-
-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
🐛 Bug Description
When using the rdagent framework to run a qlib factor development workflow, the qlib.contrib.model.pytorch_transformer.TransformerModel consistently fails during the initial training epoch. It throws a RuntimeError due to a tensor shape mismatch when attempting to reshape the input batch.
The core of the issue is that the actual number of features in the processed data batch does not align with the d_feat parameter used by the model at runtime inside the Docker container. This occurs even after implementing a robust, multi-layered preventive fix system designed to dynamically detect the feature count from the data and synchronize it with all relevant YAML configuration files before training begins.
The error persists with a different input size (43008 in the latest run, previously 47104), but the root cause remains the same: shape '[2048, 20, -1]' is invalid for input of size.... This strongly suggests that the configuration file with the updated d_feat is either not being correctly read by the qrun process inside the container, or it is being overridden by a default configuration.
To Reproduce
Steps to reproduce the behavior:
- Initialize a
qlibfactor development workflow using therdagentframework, which utilizesrdagent.scenarios.qlib.developer.factor_runner. - Generate a set of new alpha factors. In our case, this resulted in a combined feature set where the number of features does not equal the default
d_featin the Qlib template configurations. - Execute the main workflow (e.g.,
rdagent qlib proposal). The system processes the factors and prepares them in acombined_factors_df.parquetfile. Our automated scripts correctly detect the new feature count and update the host's YAML configuration files. - Observe the Docker container logs. The
qrunprocess, initiated byrdagent, starts training theTransformerModeland immediately fails on the first batch of the first epoch with theRuntimeError.
Expected Behavior
The system is expected to:
- Automatically and accurately detect the number of features (
num_features) from the finalcombined_factors_df.parquetfile. - Dynamically update the
d_featparameter in all relevant YAML configuration files (e.g.,conf_combined_factors.yaml,conf_baseline.yaml) to matchnum_features. - Launch the Docker container for training, ensuring that the Qlib workflow loads and uses this updated configuration.
- The
TransformerModelshould initialize with the correctd_featvalue and proceed with training without a tensor shapeRuntimeError.
Screenshot
The key error message from the Docker container log is as follows.
RuntimeError: shape '[2048, 20, -1]' is invalid for input of size 43008
Environment
Note: The following information was collected using rdagent collect_info.
- Name of current operating system: Windows
- Processor architecture: AMD64
- System, version, and hardware information: Windows-10-10.0.22631-SP0
- Version number of the system: 10.0.22631
- Python version: 3.10.18 | packaged by Anaconda, Inc. | (main, Jun 5 2025, 13:08:55) [MSC v.1929 64 bit (AMD64)]
- Container ID: 0afd3072ec56...
- Container Name: suspicious_hugle
- Container Status: exited
- Image ID used by the container: sha256:f9e9103a0266...
- Image tag used by the container: []
- Container port mapping: {}
- Container Label: {'com.nvidia.volumes.needed': 'nvidia_driver', 'org.opencontainers.image.ref.name': 'ubuntu', 'org.opencontainers.image.version': '22.04'}
- Startup Commands:
/bin/sh -c pip install tables - RD-Agent version: 0.7.0
Additional Notes
Root Cause Analysis
The mathematical conflict is clear: The input tensor has 43008 elements. The model attempts to reshape it into a tensor of shape [batch_size, d_feat, -1], which translates to [2048, 20, -1]. For this to be valid, the total number of elements must be a multiple of 2048 * 20 = 40960. Since 43008 / 40960 = 1.05 (not an integer), the operation fails. This proves the d_feat used by the model at runtime is 20, not the actual feature count from the data.
Primary Hypothesis
Our primary hypothesis is that the Qlib workflow inside the Docker container is not using the modified YAML configuration file that resides in the mounted host workspace. It might be loading a cached or default version of the config from a different location within the container, thus ignoring the dynamic d_feat value we set just before execution.
Evidence of Unsuccessful Fixes
We have implemented a comprehensive, multi-stage preventive fix system to address this, which includes:
- Dynamic Feature Detection: A script in
factor_runner.pythat reads the generated.parquetfile and accurately counts the number of feature columns. - Automated Config Synchronization: A Python script (
temp_sync_script.py) that is executed on the host to parse and rewrite thed_featkey inconf_combined_factors.yaml,conf_baseline.yaml, etc., with the correct feature count. - Data Type Normalization: Code to standardize DataFrame columns and indices to
strbefore saving to HDF5/Parquet, successfully eliminatingPyTablesperformance warnings. - Independent Validation Tool: A standalone script (
tensor_compatibility_validator.py) to confirm the mismatch between data and config.
Logs on the host machine confirm that these scripts execute successfully: the feature count is correctly identified, and the YAML files in the workspace are modified as expected. However, the error from within the container proves that these changes are not taking effect at the model level.