Skip to content

Critical PyTorch Tensor Shape Mismatch in QLib Factor Experiment #1068

@Zizhao-HUANG

Description

@Zizhao-HUANG

🐛 Bug Description

When using the rdagent framework to run a qlib factor development workflow, the qlib.contrib.model.pytorch_transformer.TransformerModel consistently fails during the initial training epoch. It throws a RuntimeError due to a tensor shape mismatch when attempting to reshape the input batch.

The core of the issue is that the actual number of features in the processed data batch does not align with the d_feat parameter used by the model at runtime inside the Docker container. This occurs even after implementing a robust, multi-layered preventive fix system designed to dynamically detect the feature count from the data and synchronize it with all relevant YAML configuration files before training begins.

The error persists with a different input size (43008 in the latest run, previously 47104), but the root cause remains the same: shape '[2048, 20, -1]' is invalid for input of size.... This strongly suggests that the configuration file with the updated d_feat is either not being correctly read by the qrun process inside the container, or it is being overridden by a default configuration.

To Reproduce

Steps to reproduce the behavior:

  1. Initialize a qlib factor development workflow using the rdagent framework, which utilizes rdagent.scenarios.qlib.developer.factor_runner.
  2. Generate a set of new alpha factors. In our case, this resulted in a combined feature set where the number of features does not equal the default d_feat in the Qlib template configurations.
  3. Execute the main workflow (e.g., rdagent qlib proposal). The system processes the factors and prepares them in a combined_factors_df.parquet file. Our automated scripts correctly detect the new feature count and update the host's YAML configuration files.
  4. Observe the Docker container logs. The qrun process, initiated by rdagent, starts training the TransformerModel and immediately fails on the first batch of the first epoch with the RuntimeError.

Expected Behavior

The system is expected to:

  1. Automatically and accurately detect the number of features (num_features) from the final combined_factors_df.parquet file.
  2. Dynamically update the d_feat parameter in all relevant YAML configuration files (e.g., conf_combined_factors.yaml, conf_baseline.yaml) to match num_features.
  3. Launch the Docker container for training, ensuring that the Qlib workflow loads and uses this updated configuration.
  4. The TransformerModel should initialize with the correct d_feat value and proceed with training without a tensor shape RuntimeError.

Screenshot

Image

The key error message from the Docker container log is as follows.

RuntimeError: shape '[2048, 20, -1]' is invalid for input of size 43008

Environment

Note: The following information was collected using rdagent collect_info.

  • Name of current operating system: Windows
  • Processor architecture: AMD64
  • System, version, and hardware information: Windows-10-10.0.22631-SP0
  • Version number of the system: 10.0.22631
  • Python version: 3.10.18 | packaged by Anaconda, Inc. | (main, Jun 5 2025, 13:08:55) [MSC v.1929 64 bit (AMD64)]
  • Container ID: 0afd3072ec56...
  • Container Name: suspicious_hugle
  • Container Status: exited
  • Image ID used by the container: sha256:f9e9103a0266...
  • Image tag used by the container: []
  • Container port mapping: {}
  • Container Label: {'com.nvidia.volumes.needed': 'nvidia_driver', 'org.opencontainers.image.ref.name': 'ubuntu', 'org.opencontainers.image.version': '22.04'}
  • Startup Commands: /bin/sh -c pip install tables
  • RD-Agent version: 0.7.0

Additional Notes

Root Cause Analysis

The mathematical conflict is clear: The input tensor has 43008 elements. The model attempts to reshape it into a tensor of shape [batch_size, d_feat, -1], which translates to [2048, 20, -1]. For this to be valid, the total number of elements must be a multiple of 2048 * 20 = 40960. Since 43008 / 40960 = 1.05 (not an integer), the operation fails. This proves the d_feat used by the model at runtime is 20, not the actual feature count from the data.

Primary Hypothesis

Our primary hypothesis is that the Qlib workflow inside the Docker container is not using the modified YAML configuration file that resides in the mounted host workspace. It might be loading a cached or default version of the config from a different location within the container, thus ignoring the dynamic d_feat value we set just before execution.

Evidence of Unsuccessful Fixes

We have implemented a comprehensive, multi-stage preventive fix system to address this, which includes:

  1. Dynamic Feature Detection: A script in factor_runner.py that reads the generated .parquet file and accurately counts the number of feature columns.
  2. Automated Config Synchronization: A Python script (temp_sync_script.py) that is executed on the host to parse and rewrite the d_feat key in conf_combined_factors.yaml, conf_baseline.yaml, etc., with the correct feature count.
  3. Data Type Normalization: Code to standardize DataFrame columns and indices to str before saving to HDF5/Parquet, successfully eliminating PyTables performance warnings.
  4. Independent Validation Tool: A standalone script (tensor_compatibility_validator.py) to confirm the mismatch between data and config.

Logs on the host machine confirm that these scripts execute successfully: the feature count is correctly identified, and the YAML files in the workspace are modified as expected. However, the error from within the container proves that these changes are not taking effect at the model level.

CWindowssystem32cmd.exe2025-07-13.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions