Critical PyTorch Tensor Shape Mismatch in QLib Factor Experiment

## 🐛 Bug Description

When using the `rdagent` framework to run a `qlib` factor development workflow, the `qlib.contrib.model.pytorch_transformer.TransformerModel` consistently fails during the initial training epoch. It throws a `RuntimeError` due to a tensor shape mismatch when attempting to reshape the input batch.

The core of the issue is that the actual number of features in the processed data batch does not align with the `d_feat` parameter used by the model at runtime inside the Docker container. This occurs even after implementing a robust, multi-layered preventive fix system designed to dynamically detect the feature count from the data and synchronize it with all relevant YAML configuration files before training begins.

The error persists with a different input size (`43008` in the latest run, previously `47104`), but the root cause remains the same: `shape '[2048, 20, -1]' is invalid for input of size...`. This strongly suggests that the configuration file with the updated `d_feat` is either not being correctly read by the `qrun` process inside the container, or it is being overridden by a default configuration.

## To Reproduce

Steps to reproduce the behavior:

1. Initialize a `qlib` factor development workflow using the `rdagent` framework, which utilizes `rdagent.scenarios.qlib.developer.factor_runner`.
2. Generate a set of new alpha factors. In our case, this resulted in a combined feature set where the number of features does not equal the default `d_feat` in the Qlib template configurations.
3. Execute the main workflow (e.g., `rdagent qlib proposal`). The system processes the factors and prepares them in a `combined_factors_df.parquet` file. Our automated scripts correctly detect the new feature count and update the host's YAML configuration files.
4. Observe the Docker container logs. The `qrun` process, initiated by `rdagent`, starts training the `TransformerModel` and immediately fails on the first batch of the first epoch with the `RuntimeError`.

## Expected Behavior

The system is expected to:

1. Automatically and accurately detect the number of features (`num_features`) from the final `combined_factors_df.parquet` file.
2. Dynamically update the `d_feat` parameter in all relevant YAML configuration files (e.g., `conf_combined_factors.yaml`, `conf_baseline.yaml`) to match `num_features`.
3. Launch the Docker container for training, ensuring that the Qlib workflow loads and uses this updated configuration.
4. The `TransformerModel` should initialize with the correct `d_feat` value and proceed with training without a tensor shape `RuntimeError`.

## Screenshot

<img width="1113" height="760" alt="Image" src="https://github.com/user-attachments/assets/c8b4f00d-843c-40f3-be2d-3baa79928caf" />

The key error message from the Docker container log is as follows.

```
RuntimeError: shape '[2048, 20, -1]' is invalid for input of size 43008
```

## Environment

**Note**: The following information was collected using `rdagent collect_info`.

- **Name of current operating system:** Windows
- **Processor architecture:** AMD64
- **System, version, and hardware information:** Windows-10-10.0.22631-SP0
- **Version number of the system:** 10.0.22631
- **Python version:** 3.10.18 | packaged by Anaconda, Inc. | (main, Jun 5 2025, 13:08:55) [MSC v.1929 64 bit (AMD64)]
- **Container ID:** 0afd3072ec56...
- **Container Name:** suspicious_hugle
- **Container Status:** exited
- **Image ID used by the container:** sha256:f9e9103a0266...
- **Image tag used by the container:** []
- **Container port mapping:** {}
- **Container Label:** {'com.nvidia.volumes.needed': 'nvidia_driver', 'org.opencontainers.image.ref.name': 'ubuntu', 'org.opencontainers.image.version': '22.04'}
- **Startup Commands:** `/bin/sh -c pip install tables`
- **RD-Agent version:** 0.7.0

## Additional Notes

### Root Cause Analysis

The mathematical conflict is clear: The input tensor has `43008` elements. The model attempts to reshape it into a tensor of shape `[batch_size, d_feat, -1]`, which translates to `[2048, 20, -1]`. For this to be valid, the total number of elements must be a multiple of `2048 * 20 = 40960`. Since `43008 / 40960 = 1.05` (not an integer), the operation fails. This proves the `d_feat` used by the model at runtime is **20**, not the actual feature count from the data.

### Primary Hypothesis

Our primary hypothesis is that the Qlib workflow inside the Docker container is **not using the modified YAML configuration file** that resides in the mounted host workspace. It might be loading a cached or default version of the config from a different location within the container, thus ignoring the dynamic `d_feat` value we set just before execution.

### Evidence of Unsuccessful Fixes

We have implemented a comprehensive, multi-stage preventive fix system to address this, which includes:

1. **Dynamic Feature Detection:** A script in `factor_runner.py` that reads the generated `.parquet` file and accurately counts the number of feature columns.
2. **Automated Config Synchronization:** A Python script (`temp_sync_script.py`) that is executed on the host to parse and rewrite the `d_feat` key in `conf_combined_factors.yaml`, `conf_baseline.yaml`, etc., with the correct feature count.
3. **Data Type Normalization:** Code to standardize DataFrame columns and indices to `str` before saving to HDF5/Parquet, successfully eliminating `PyTables` performance warnings.
4. **Independent Validation Tool:** A standalone script (`tensor_compatibility_validator.py`) to confirm the mismatch between data and config.

Logs on the host machine confirm that these scripts execute successfully: the feature count is correctly identified, and the YAML files in the workspace are modified as expected. However, the error from within the container proves that these changes are not taking effect at the model level.

[CWindowssystem32cmd.exe2025-07-13.txt](https://github.com/user-attachments/files/21201667/CWindowssystem32cmd.exe2025-07-13.txt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Critical PyTorch Tensor Shape Mismatch in QLib Factor Experiment #1068

🐛 Bug Description

To Reproduce

Expected Behavior

Screenshot

Environment

Additional Notes

Root Cause Analysis

Primary Hypothesis

Evidence of Unsuccessful Fixes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Critical PyTorch Tensor Shape Mismatch in QLib Factor Experiment #1068

Description

🐛 Bug Description

To Reproduce

Expected Behavior

Screenshot

Environment

Additional Notes

Root Cause Analysis

Primary Hypothesis

Evidence of Unsuccessful Fixes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions