diff --git a/FireRPFNetExtension.md b/FireRPFNetExtension.md
new file mode 100644
index 0000000000..4024c4a7b1
--- /dev/null
+++ b/FireRPFNetExtension.md
@@ -0,0 +1,209 @@
+# FireRPFNet Models - Quick Start Guide
+
+Custom 3D object detection models using FireRPFNet architecture with Fire Modules, Residual connections, and CBAM attention.
+
+## 🔥 FireRPFNet Variants
+
+- **FireRPFNetV2**: Enhanced 3D LiDAR backbone with improved attention
+- **FireRPFNet2D**: 2D image backbone variant for camera features
+
+**Plug-and-Play Design:**
+- **FireRPFNetV2** can replace SECOND backbone in any model (BEVFusion is one example shown here)
+- **FireRPFNet2D** can be used as an efficient image backbone in multi-modal architectures
+- Simply update the backbone config to integrate into your existing models
+
+---
+
+## 📋 Available Models
+
+| Model | Config | Image Backbone | LiDAR Backbone | Dataset | Modality |
+|-------|--------|---------------|----------------|---------|----------|
+| MVXNet-Squeeze | `configs/mvxnet/mvxnet_sqeezefpn_fire_rpfnet_kitti-3d-3class.py` | SQUEEZE | **FireRPFNetV2** | KITTI | Multi-modal |
+| MVXNet-Fire2D | `configs/mvxnet/mvxnet_firerpfnet2dfpn_fire_rpfnet_kitti-3d-3class.py` | **FireRPFNet2D** | **FireRPFNetV2** | KITTI | Multi-modal |
+| BEVFusion-Lidar | `projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py` | - | **FireRPFNetV2** | nuScenes | LiDAR-only |
+| BEVFusion-Cam | `projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py` | Swin-T | **FireRPFNetV2** | nuScenes | Multi-modal |
+
+---
+
+## 🚀 Installation
+
+Follow the official MMDetection3D installation guide: https://mmdetection3d.readthedocs.io/en/latest/get_started.html
+
+**Quick Setup:**
+```bash
+# Install dependencies
+pip install -U openmim
+mim install mmengine
+mim install 'mmcv>=2.0.0rc4'
+mim install 'mmdet>=3.0.0'
+
+# Install mmdetection3d
+cd mmdetection3d
+pip install -v -e .
+```
+
+---
+
+## 📦 Dataset Setup
+
+### KITTI (MVXNet models)
+```bash
+# Download from http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d
+# Organize: data/kitti/training/{image_2, velodyne, calib, label_2}
+
+# Create data infos
+python tools/create_data.py kitti --root-path ./data/kitti --out-dir ./data/kitti --extra-tag kitti
+```
+
+### nuScenes (BEVFusion models)
+```bash
+# Download from https://www.nuscenes.org/download
+# Organize: data/nuscenes/{samples, sweeps, v1.0-trainval}
+
+# Create data infos
+python tools/create_data.py nuscenes --root-path ./data/nuscenes --out-dir ./data/nuscenes --extra-tag nuscenes
+```
+
+---
+
+## 🏋️ Training Commands
+
+### MVXNet Models (KITTI)
+
+**Model 1: SqueezeFPN + FireRPFNetV2**
+```bash
+# Single GPU
+python tools/train.py configs/mvxnet/mvxnet_sqeezefpn_fire_rpfnet_kitti-3d-3class.py
+
+```
+- Batch size: 2/GPU | Epochs: 20 | LR: 0.001 | Val: Every 5 epochs
+
+**Model 2: FireRPFNet2D + FireRPFNetV2**
+```bash
+# Single GPU
+python tools/train.py configs/mvxnet/mvxnet_firerpfnet2dfpn_fire_rpfnet_kitti-3d-3class.py
+
+```
+- Batch size: 4/GPU | Epochs: 16 | LR: 0.001 | Val: Every 2 epochs | Early stopping enabled
+
+---
+
+### BEVFusion Models (nuScenes)
+
+**Model 3: BEVFusion LiDAR-only + FireRPFNetV2**
+```bash
+# 
+python tools/train.py projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py
+```
+- Batch size: 4/GPU | Epochs: 20 | LR: 0.0002 | Cyclic scheduler
+
+**Model 4: BEVFusion Multi-Modal + FireRPFNetV2**
+```bash
+
+# With mixed precision
+python tools/train.py \
+    projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py \
+    --amp
+```
+- Batch size: 4/GPU (32 total) | Epochs: 6 | LR: 0.0002 | Val: Every epoch
+
+---
+
+## 🧪 Testing
+
+### MVXNet Models
+```bash
+# Single GPU
+python tools/test.py CONFIG CHECKPOINT
+
+# Multi-GPU
+bash tools/dist_test.sh CONFIG CHECKPOINT 4
+```
+
+**Examples:**
+```bash
+# Model 1
+python tools/test.py \
+    configs/mvxnet/mvxnet_sqeezefpn_fire_rpfnet_kitti-3d-3class.py \
+    work_dirs/mvxnet_sqeezefpn_fire_rpfnet_kitti-3d-3class/best_checkpoint.pth
+
+# Model 2
+bash tools/dist_test.sh \
+    configs/mvxnet/mvxnet_firerpfnet2dfpn_fire_rpfnet_kitti-3d-3class.py \
+    work_dirs/mvxnet_firerpfnet2dfpn_fire_rpfnet_kitti-3d-3class/best_checkpoint.pth 4
+```
+
+### BEVFusion Models
+```bash
+# Model 3
+bash tools/dist_test.sh \
+    projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py \
+    work_dirs/bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d/best_checkpoint.pth 8
+
+# Model 4
+bash tools/dist_test.sh \
+    projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py \
+    work_dirs/bevfusion_lidar-cam_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d/best_checkpoint.pth 8
+```
+
+---
+
+## 💡 Tips
+
+**Resume Training:**
+```bash
+python tools/train.py CONFIG --resume work_dirs/MODEL_NAME/epoch_X.pth
+```
+
+**Specify GPUs:**
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 bash tools/dist_train.sh CONFIG 4
+```
+
+**Debug Mode:**
+```bash
+python tools/train.py CONFIG \
+    --cfg-options data.train_dataloader.num_workers=0 \
+                  data.train_dataloader.batch_size=1
+```
+
+**Monitor Training:**
+```bash
+tensorboard --logdir=work_dirs/
+```
+
+---
+
+## 🐛 Common Issues
+
+**CUDA OOM:** Reduce batch size in config or via `--cfg-options data.train_dataloader.batch_size=1`
+
+**Dataset not found:** Verify paths and run `python tools/create_data.py`
+
+**Import errors:** Reinstall with `pip install -v -e .`
+
+---
+
+## 📚 References
+
+- [MMDetection3D Documentation](https://mmdetection3d.readthedocs.io)
+- [KITTI Dataset](http://www.cvlibs.net/datasets/kitti/)
+- [nuScenes Dataset](https://www.nuscenes.org/)
+
+---
+
+## 📝 Citation
+
+```bibtex
+@article{firerpfnet2024,
+  title={FireRPFNet: Efficient 3D Object Detection with Fire Modules and Attention},
+  author={Aravind Singh},
+  journal={arXiv preprint},
+  year={2024}
+}
+```
+
+---
+
+**Happy Training! 🚀**
+
diff --git a/configs/_base_/models/centerpoint_pillar02_squeeze_squeezefpn_nus.py b/configs/_base_/models/centerpoint_pillar02_squeeze_squeezefpn_nus.py
new file mode 100644
index 0000000000..04375a2a1f
--- /dev/null
+++ b/configs/_base_/models/centerpoint_pillar02_squeeze_squeezefpn_nus.py
@@ -0,0 +1,91 @@
+voxel_size = [0.2, 0.2, 8]
+model = dict(
+    type='CenterPoint',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_layer=dict(
+            max_num_points=20,
+            voxel_size=voxel_size,
+            max_voxels=(30000, 40000))),
+    pts_voxel_encoder=dict(
+        type='PillarFeatureNet',
+        in_channels=5,
+        feat_channels=[64],
+        with_distance=False,
+        voxel_size=(0.2, 0.2, 8),
+        norm_cfg=dict(type='BN1d', eps=1e-3, momentum=0.01),
+        legacy=False),
+    pts_middle_encoder=dict(
+        type='PointPillarsScatter', in_channels=64, output_shape=(512, 512)),
+    pts_backbone=dict(
+        type='SQUEEZE',
+        in_channels=64,
+        out_channels=[64, 128, 256 , 512],
+        #layer_nums=[3, 5, 5],
+        #layer_strides=[2, 2, 2],
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        conv_cfg=dict(type='Conv2d', bias=False)),
+    pts_neck=dict(
+        type='SQUEEZEFPN',
+        in_channels=[64, 128, 256 , 512],
+        out_channels=[512, 512, 512, 512],
+        #upsample_strides=[0.5, 1, 2],
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        upsample_cfg=dict(type='deconv', bias=False),
+        #use_conv_for_no_stride=True
+        ),
+    pts_bbox_head=dict(
+        type='CenterHead',
+        in_channels=sum([128, 128, 128,128]),
+        #in_channels=256,
+        tasks=[
+            dict(num_class=1, class_names=['car']),
+            dict(num_class=2, class_names=['truck', 'construction_vehicle']),
+            dict(num_class=2, class_names=['bus', 'trailer']),
+            dict(num_class=1, class_names=['barrier']),
+            dict(num_class=2, class_names=['motorcycle', 'bicycle']),
+            dict(num_class=2, class_names=['pedestrian', 'traffic_cone']),
+        ],
+        common_heads=dict(
+            reg=(2, 2), height=(1, 2), dim=(3, 2), rot=(2, 2), vel=(2, 2)),
+        share_conv_channel=64,
+        bbox_coder=dict(
+            type='CenterPointBBoxCoder',
+            post_center_range=[-61.2, -61.2, -10.0, 61.2, 61.2, 10.0],
+            max_num=500,
+            score_threshold=0.1,
+            out_size_factor=4,
+            voxel_size=voxel_size[:2],
+            code_size=9),
+        separate_head=dict(
+            type='SeparateHead', init_bias=-2.19, final_kernel=3),
+        loss_cls=dict(type='mmdet.GaussianFocalLoss', reduction='mean'),
+        loss_bbox=dict(
+            type='mmdet.L1Loss', reduction='mean', loss_weight=0.25),
+        norm_bbox=True),
+    # model training and testing settings
+    train_cfg=dict(
+        pts=dict(
+            grid_size=[512, 512, 1],
+            voxel_size=voxel_size,
+            out_size_factor=4,
+            dense_reg=1,
+            gaussian_overlap=0.1,
+            max_objs=500,
+            min_radius=2,
+            code_weights=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2, 0.2])),
+    test_cfg=dict(
+        pts=dict(
+            post_center_limit_range=[-61.2, -61.2, -10.0, 61.2, 61.2, 10.0],
+            max_per_img=500,
+            max_pool_nms=False,
+            min_radius=[4, 12, 10, 1, 0.85, 0.175],
+            score_threshold=0.1,
+            pc_range=[-51.2, -51.2],
+            out_size_factor=4,
+            voxel_size=voxel_size[:2],
+            nms_type='rotate',
+            pre_max_size=1000,
+            post_max_size=83,
+            nms_thr=0.2)))
diff --git a/configs/_base_/models/centerpoint_voxel01_squeeze_squeezefpn_nus.py b/configs/_base_/models/centerpoint_voxel01_squeeze_squeezefpn_nus.py
new file mode 100644
index 0000000000..c36e7de268
--- /dev/null
+++ b/configs/_base_/models/centerpoint_voxel01_squeeze_squeezefpn_nus.py
@@ -0,0 +1,46 @@
+model = dict(
+    type='VoxelNet',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_layer=dict(
+            max_num_points=5,
+            point_cloud_range=[0, -40, -3, 70.4, 40, 1],
+            voxel_size=[0.05, 0.05, 0.1],
+            max_voxels=(16000, 40000))),
+    voxel_encoder=dict(type='HardSimpleVFE'),
+    middle_encoder=dict(
+        type='SparseEncoder',
+        in_channels=4,
+        sparse_shape=[41, 1600, 1408],
+        order=('conv', 'norm', 'act')),
+    backbone=dict(
+        type='SQUEEZE',
+        in_channels=3,
+        out_channels=[64, 128, 256, 512],
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        conv_cfg=dict(type='Conv2d', bias=False)),
+    neck=dict(
+        type='SQUEEZEFPN',
+        in_channels=[64, 128, 256, 512],
+        out_channels=[256, 256, 256, 256],
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        upsample_cfg=dict(type='deconv', bias=False),
+        conv_cfg=dict(type='Conv2d', bias=False)),
+    bbox_head=dict(
+        type='Anchor3DHead',
+        num_classes=3,
+        in_channels=256,
+        feat_channels=256,
+        anchor_generator=dict(
+            type='Anchor3DRangeGenerator',
+            ranges=[[0, -40, -1.8, 70.4, 40, -1.8]],
+            sizes=[[1.6, 3.9, 1.56]],
+            rotations=[0, 1.57]),
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25),
+        loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0),
+        loss_dir=dict(type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.2)),
+    train_cfg=dict(assigner=dict(type='MaxIoUAssigner')),
+    test_cfg=dict(use_rotate_nms=True, nms_across_levels=False, nms_pre=1000, nms_thr=0.01, score_thr=0.1, min_bbox_size=0, max_num=500)
+)
diff --git a/configs/centerpoint/centerpoint_pillar02_squeeze_squeezefpn_8xb4-cyclic-20e_nus-3d.py b/configs/centerpoint/centerpoint_pillar02_squeeze_squeezefpn_8xb4-cyclic-20e_nus-3d.py
new file mode 100644
index 0000000000..4ed2c5df84
--- /dev/null
+++ b/configs/centerpoint/centerpoint_pillar02_squeeze_squeezefpn_8xb4-cyclic-20e_nus-3d.py
@@ -0,0 +1,253 @@
+_base_ = [
+    '../_base_/datasets/nus-3d.py',
+    '../_base_/models/centerpoint_pillar02_squeeze_squeezefpn_nus.py',
+    '../_base_/schedules/cyclic-20e.py', '../_base_/default_runtime.py'
+]
+
+# If point cloud range is changed, the models should also change their point
+# cloud range accordingly
+point_cloud_range = [-51.2, -51.2, -5.0, 51.2, 51.2, 3.0]
+# Using calibration info convert the Lidar-coordinate point cloud range to the
+# ego-coordinate point cloud range could bring a little promotion in nuScenes.
+# point_cloud_range = [-51.2, -52, -5.0, 51.2, 50.4, 3.0]
+# For nuScenes we usually do 10-class detection
+class_names = [
+    'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier',
+    'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone'
+]
+data_prefix = dict(pts='samples/LIDAR_TOP', img='', sweeps='sweeps/LIDAR_TOP')
+model = dict(
+    data_preprocessor=dict(
+        voxel_layer=dict(point_cloud_range=point_cloud_range)),
+    pts_voxel_encoder=dict(point_cloud_range=point_cloud_range),
+    pts_bbox_head=dict(bbox_coder=dict(pc_range=point_cloud_range[:2])),
+    # model training and testing settings
+    train_cfg=dict(pts=dict(point_cloud_range=point_cloud_range)),
+    test_cfg=dict(pts=dict(pc_range=point_cloud_range[:2])))
+
+dataset_type = 'NuScenesDataset'
+data_root = 'data/nuscenes/'
+backend_args = None
+
+db_sampler = dict(
+    data_root=data_root,
+    info_path=data_root + 'nuscenes_dbinfos_train.pkl',
+    rate=1.0,
+    prepare=dict(
+        filter_by_difficulty=[-1],
+        filter_by_min_points=dict(
+            car=5,
+            truck=5,
+            bus=5,
+            trailer=5,
+            construction_vehicle=5,
+            traffic_cone=5,
+            barrier=5,
+            motorcycle=5,
+            bicycle=5,
+            pedestrian=5)),
+    classes=class_names,
+    sample_groups=dict(
+        car=2,
+        truck=3,
+        construction_vehicle=7,
+        bus=4,
+        trailer=6,
+        barrier=2,
+        motorcycle=6,
+        bicycle=6,
+        pedestrian=2,
+        traffic_cone=2),
+    points_loader=dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=[0, 1, 2, 3, 4],
+        backend_args=backend_args),
+    backend_args=backend_args)
+
+train_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type='LoadPointsFromMultiSweeps',
+        sweeps_num=9,
+        use_dim=[0, 1, 2, 3, 4],
+        pad_empty_sweeps=True,
+        remove_close=True,
+        backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True),
+    dict(type='ObjectSample', db_sampler=db_sampler),
+    dict(
+        type='GlobalRotScaleTrans',
+        rot_range=[-0.3925, 0.3925],
+        scale_ratio_range=[0.95, 1.05],
+        translation_std=[0, 0, 0]),
+    dict(
+        type='RandomFlip3D',
+        sync_2d=False,
+        flip_ratio_bev_horizontal=0.5,
+        flip_ratio_bev_vertical=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectNameFilter', classes=class_names),
+    dict(type='PointShuffle'),
+    dict(
+        type='Pack3DDetInputs',
+        keys=['points', 'gt_bboxes_3d', 'gt_labels_3d'])
+]
+test_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type='LoadPointsFromMultiSweeps',
+        sweeps_num=9,
+        use_dim=[0, 1, 2, 3, 4],
+        pad_empty_sweeps=True,
+        remove_close=True,
+        backend_args=backend_args),
+    dict(
+        type='MultiScaleFlipAug3D',
+        img_scale=(1333, 800),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            dict(
+                type='GlobalRotScaleTrans',
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type='RandomFlip3D')
+        ]),
+    dict(type='Pack3DDetInputs', keys=['points'])
+]
+
+train_dataloader = dict(
+    batch_size=4,
+    dataset=dict(
+        ann_file='nuscenes_infos_train.pkl',
+        backend_args=None,
+        box_type_3d='LiDAR',
+        data_prefix=dict(
+            img='', pts='samples/LIDAR_TOP', sweeps='sweeps/LIDAR_TOP'),
+        data_root='data/nuscenes/',
+        metainfo=dict(classes=[
+            'car',
+            'truck',
+            'construction_vehicle',
+            'bus',
+            'trailer',
+            'barrier',
+            'motorcycle',
+            'bicycle',
+            'pedestrian',
+            'traffic_cone',
+        ]),
+        modality=dict(use_camera=False, use_lidar=True),
+        pipeline=[
+            dict(
+                backend_args=None,
+                coord_type='LIDAR',
+                load_dim=5,
+                type='LoadPointsFromFile',
+                use_dim=5),
+            dict(
+                backend_args=None,
+                pad_empty_sweeps=True,
+                remove_close=True,
+                sweeps_num=9,
+                type='LoadPointsFromMultiSweeps',
+                use_dim=[
+                    0,
+                    1,
+                    2,
+                    3,
+                    4,
+                ]),
+            dict(
+                type='LoadAnnotations3D',
+                with_bbox_3d=True,
+                with_label_3d=True),
+            dict(
+                rot_range=[
+                    -0.3925,
+                    0.3925,
+                ],
+                scale_ratio_range=[
+                    0.95,
+                    1.05,
+                ],
+                translation_std=[
+                    0,
+                    0,
+                    0,
+                ],
+                type='GlobalRotScaleTrans'),
+            dict(
+                flip_ratio_bev_horizontal=0.5,
+                flip_ratio_bev_vertical=0.5,
+                sync_2d=False,
+                type='RandomFlip3D'),
+            dict(
+                point_cloud_range=[
+                    -51.2,
+                    -51.2,
+                    -5.0,
+                    51.2,
+                    51.2,
+                    3.0,
+                ],
+                type='PointsRangeFilter'),
+            dict(
+                point_cloud_range=[
+                    -51.2,
+                    -51.2,
+                    -5.0,
+                    51.2,
+                    51.2,
+                    3.0,
+                ],
+                type='ObjectRangeFilter'),
+            dict(
+                classes=[
+                    'car',
+                    'truck',
+                    'construction_vehicle',
+                    'bus',
+                    'trailer',
+                    'barrier',
+                    'motorcycle',
+                    'bicycle',
+                    'pedestrian',
+                    'traffic_cone',
+                ],
+                type='ObjectNameFilter'),
+            dict(type='PointShuffle'),
+            dict(
+                keys=[
+                    'points',
+                    'gt_bboxes_3d',
+                    'gt_labels_3d',
+                ],
+                type='Pack3DDetInputs'),
+        ],
+        test_mode=False,
+        type='NuScenesDataset',
+        use_valid_flag=True),
+    num_workers=4,
+    persistent_workers=True,
+    sampler=dict(shuffle=True, type='DefaultSampler'))
+test_dataloader = dict(
+    dataset=dict(pipeline=test_pipeline, metainfo=dict(version='v1.0-mini', classes=class_names)))
+val_dataloader = dict(
+    dataset=dict(pipeline=test_pipeline, metainfo=dict(version='v1.0-mini', classes=class_names)))
+
+train_cfg = dict(by_epoch=True, max_epochs=20, val_interval=20)
diff --git a/configs/centerpoint/centerpoint_voxel01_squeeze_squeezefpn_8xb4-cyclic-20e_nus-3d.py b/configs/centerpoint/centerpoint_voxel01_squeeze_squeezefpn_8xb4-cyclic-20e_nus-3d.py
new file mode 100644
index 0000000000..6c423cbbcd
--- /dev/null
+++ b/configs/centerpoint/centerpoint_voxel01_squeeze_squeezefpn_8xb4-cyclic-20e_nus-3d.py
@@ -0,0 +1,160 @@
+_base_ = [
+    '../_base_/datasets/nus-3d.py',
+    '../_base_/models/centerpoint_voxel01_squeeze_squeezefpn_nus.py',
+    '../_base_/schedules/cyclic-20e.py', '../_base_/default_runtime.py'
+]
+
+# If point cloud range is changed, the models should also change their point
+# cloud range accordingly
+point_cloud_range = [-51.2, -51.2, -5.0, 51.2, 51.2, 3.0]
+# Using calibration info convert the Lidar-coordinate point cloud range to the
+# ego-coordinate point cloud range could bring a little promotion in nuScenes.
+# point_cloud_range = [-51.2, -52, -5.0, 51.2, 50.4, 3.0]
+# For nuScenes we usually do 10-class detection
+class_names = [
+    'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier',
+    'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone'
+]
+data_prefix = dict(pts='samples/LIDAR_TOP', img='', sweeps='sweeps/LIDAR_TOP')
+model = dict(
+    data_preprocessor=dict(
+        voxel_layer=dict(point_cloud_range=point_cloud_range)),
+    pts_bbox_head=dict(bbox_coder=dict(pc_range=point_cloud_range[:2])),
+    # model training and testing settings
+    train_cfg=dict(pts=dict(point_cloud_range=point_cloud_range)),
+    test_cfg=dict(pts=dict(pc_range=point_cloud_range[:2])))
+
+dataset_type = 'NuScenesDataset'
+data_root = 'data/nuscenes/'
+backend_args = None
+
+db_sampler = dict(
+    data_root=data_root,
+    info_path=data_root + 'nuscenes_dbinfos_train.pkl',
+    rate=1.0,
+    prepare=dict(
+        filter_by_difficulty=[-1],
+        filter_by_min_points=dict(
+            car=5,
+            truck=5,
+            bus=5,
+            trailer=5,
+            construction_vehicle=5,
+            traffic_cone=5,
+            barrier=5,
+            motorcycle=5,
+            bicycle=5,
+            pedestrian=5)),
+    classes=class_names,
+    sample_groups=dict(
+        car=2,
+        truck=3,
+        construction_vehicle=7,
+        bus=4,
+        trailer=6,
+        barrier=2,
+        motorcycle=6,
+        bicycle=6,
+        pedestrian=2,
+        traffic_cone=2),
+    points_loader=dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=[0, 1, 2, 3, 4],
+        backend_args=backend_args),
+    backend_args=backend_args)
+
+train_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type='LoadPointsFromMultiSweeps',
+        sweeps_num=9,
+        use_dim=[0, 1, 2, 3, 4],
+        pad_empty_sweeps=True,
+        remove_close=True,
+        backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True),
+    dict(type='ObjectSample', db_sampler=db_sampler),
+    dict(
+        type='GlobalRotScaleTrans',
+        rot_range=[-0.3925, 0.3925],
+        scale_ratio_range=[0.95, 1.05],
+        translation_std=[0, 0, 0]),
+    dict(
+        type='RandomFlip3D',
+        sync_2d=False,
+        flip_ratio_bev_horizontal=0.5,
+        flip_ratio_bev_vertical=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectNameFilter', classes=class_names),
+    dict(type='PointShuffle'),
+    dict(
+        type='Pack3DDetInputs',
+        keys=['points', 'gt_bboxes_3d', 'gt_labels_3d'])
+]
+test_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type='LoadPointsFromMultiSweeps',
+        sweeps_num=9,
+        use_dim=[0, 1, 2, 3, 4],
+        pad_empty_sweeps=True,
+        remove_close=True,
+        backend_args=backend_args),
+    dict(
+        type='MultiScaleFlipAug3D',
+        img_scale=(1333, 800),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            dict(
+                type='GlobalRotScaleTrans',
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type='RandomFlip3D'),
+            dict(
+                type='PointsRangeFilter', point_cloud_range=point_cloud_range)
+        ]),
+    dict(type='Pack3DDetInputs', keys=['points'])
+]
+
+train_dataloader = dict(
+    _delete_=True,
+    batch_size=4,
+    num_workers=4,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='CBGSDataset',
+        dataset=dict(
+            type=dataset_type,
+            data_root=data_root,
+            ann_file='nuscenes_infos_train.pkl',
+            pipeline=train_pipeline,
+            metainfo=dict(classes=class_names),
+            test_mode=False,
+            data_prefix=data_prefix,
+            use_valid_flag=True,
+            # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
+            # and box_type_3d='Depth' in sunrgbd and scannet dataset.
+            box_type_3d='LiDAR',
+            backend_args=backend_args)))
+test_dataloader = dict(
+    dataset=dict(pipeline=test_pipeline, metainfo=dict(classes=class_names)))
+val_dataloader = dict(
+    dataset=dict(pipeline=test_pipeline, metainfo=dict(classes=class_names)))
+
+train_cfg = dict(val_interval=20)
diff --git a/configs/mvxnet/mvxnet_efficiency_es_fpn_fire_rpfnet_kitti-3d-3class.py b/configs/mvxnet/mvxnet_efficiency_es_fpn_fire_rpfnet_kitti-3d-3class.py
new file mode 100644
index 0000000000..03774f5541
--- /dev/null
+++ b/configs/mvxnet/mvxnet_efficiency_es_fpn_fire_rpfnet_kitti-3d-3class.py
@@ -0,0 +1,175 @@
+# MVX-Net | EfficientNet-ES + FPN (camera) | Fire-RPFNet (LiDAR)
+# KITTI 3-class full stand-alone config.
+
+_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py']
+
+# -----------------------------------------------------------------------------
+# Geometry
+# -----------------------------------------------------------------------------
+voxel_size = [0.05, 0.05, 0.1]
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+
+# -----------------------------------------------------------------------------
+# Model definition
+# -----------------------------------------------------------------------------
+model = dict(
+    type='DynamicMVXFasterRCNN',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_type='dynamic',
+        voxel_layer=dict(max_num_points=-1, point_cloud_range=point_cloud_range,
+                         voxel_size=voxel_size, max_voxels=(-1, -1)),
+        mean=[102.9801, 115.9465, 122.7717], std=[1., 1., 1.],
+        bgr_to_rgb=False, pad_size_divisor=32),
+
+    # ---------------- Camera branch ----------------
+    img_backbone=dict(
+        type='mmdet.EfficientNet',  # torchvision impl
+        arch='es',                 # efficientnet-es (small, fast)
+        out_indices=(0, 3, 5, 6),  # C1,C3,C5,C6 like MVX example
+        frozen_stages=1,
+        norm_cfg=dict(type='BN', requires_grad=False),
+        norm_eval=True),
+    img_neck=dict(
+        type='mmdet.FPN',
+        in_channels=[32, 48, 192, 1280],  # channels for eff-es layers
+        out_channels=512,
+        num_outs=4,
+        norm_cfg=dict(type='BN', requires_grad=False)),
+
+    # ---------------- LiDAR voxel encoder --------------
+    pts_voxel_encoder=dict(
+        type='DynamicVFE', in_channels=4, feat_channels=[64, 64],
+        with_distance=False, voxel_size=voxel_size,
+        with_cluster_center=True, with_voxel_center=True,
+        point_cloud_range=point_cloud_range,
+        fusion_layer=dict(
+            type='PointFusion', img_channels=512, pts_channels=64,
+            mid_channels=128, out_channels=128, img_levels=[0,1,2,3],
+            align_corners=False, activate_out=True, fuse_out=False)),
+
+    # ---------------- Sparse middle encoder ------------
+    pts_middle_encoder=dict(
+        type='SparseEncoder', in_channels=128,
+        sparse_shape=[41, 1600, 1408], order=('conv', 'norm', 'act')),
+
+    # ---------------- Fire-RPFNet backbone -------------
+    pts_backbone=dict(
+        #type='RPFNet',
+        type='FireRPFNet',
+        in_channels=256,
+        layer_channels=[128, 256, 256, 256], with_cbam=True),
+    pts_neck=None,
+
+    # ---------------- Anchor head ----------------------
+    pts_bbox_head=dict(
+        type='Anchor3DHead', num_classes=3,
+        in_channels=256, feat_channels=256, use_direction_classifier=True,
+        anchor_generator=dict(
+            type='Anchor3DRangeGenerator',
+            ranges=[[0,-40,-0.6,70.4,40,-0.6],
+                    [0,-40,-0.6,70.4,40,-0.6],
+                    [0,-40,-1.78,70.4,40,-1.78]],
+            sizes=[[0.8,0.6,1.73],[1.76,0.6,1.73],[3.9,1.6,1.56]],
+            rotations=[0, 1.57], reshape_out=False),
+        assigner_per_size=True, diff_rad_by_sin=True, assign_per_class=True,
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(type='mmdet.FocalLoss', use_sigmoid=True, gamma=2.0,
+                      alpha=0.25, loss_weight=1.0),
+        loss_bbox=dict(type='mmdet.SmoothL1Loss', beta=1.0/9.0, loss_weight=2.0),
+        loss_dir=dict(type='mmdet.CrossEntropyLoss', use_sigmoid=False, loss_weight=0.2)),
+
+    train_cfg=dict(
+        pts=dict(assigner=[
+            dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                 pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2, ignore_iof_thr=-1),
+            dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                 pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2, ignore_iof_thr=-1),
+            dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                 pos_iou_thr=0.6, neg_iou_thr=0.45, min_pos_iou=0.45, ignore_iof_thr=-1)],
+            allowed_border=0, pos_weight=-1, debug=False)),
+    test_cfg=dict(
+        pts=dict(use_rotate_nms=True, nms_across_levels=False, nms_thr=0.01,
+                  score_thr=0.1, min_bbox_size=0, nms_pre=100, max_num=50))
+)
+
+# -----------------------------------------------------------------------------
+# Dataset & pipelines
+# -----------------------------------------------------------------------------
+
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+metainfo = dict(classes=class_names)
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+
+train_pipeline = [
+    dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4,
+         backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True,
+         with_bbox=True, with_label=True),
+    dict(type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True),
+    dict(type='GlobalRotScaleTrans', rot_range=[-0.78539816,0.78539816],
+         scale_ratio_range=[0.95,1.05], translation_std=[0.2,0.2,0.2]),
+    dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='PointShuffle'),
+    dict(type='Pack3DDetInputs', keys=['points','img','gt_bboxes_3d','gt_labels_3d',
+                                       'gt_bboxes','gt_labels'])
+]
+
+test_pipeline = [
+    dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4,
+         backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='MultiScaleFlipAug3D', img_scale=(1280,384), pts_scale_ratio=1,
+         flip=False, transforms=[
+            dict(type='Resize', scale=0, keep_ratio=True),
+            dict(type='GlobalRotScaleTrans', rot_range=[0,0], scale_ratio_range=[1.,1.],
+                 translation_std=[0,0,0]),
+            dict(type='RandomFlip3D'),
+            dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+         ]),
+    dict(type='Pack3DDetInputs', keys=['points','img'])
+]
+
+modality = dict(use_lidar=True, use_camera=True)
+
+train_dataloader = dict(
+    batch_size=2, num_workers=4, sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(type='RepeatDataset', times=2, dataset=dict(
+        type=dataset_type, data_root=data_root, modality=modality,
+        ann_file='kitti_infos_train.pkl',
+        data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=train_pipeline, filter_empty_gt=False, metainfo=metainfo,
+        box_type_3d='LiDAR', backend_args=backend_args)))
+
+val_dataloader = dict(
+    batch_size=1, num_workers=1, sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(type=dataset_type, data_root=data_root, modality=modality,
+        ann_file='kitti_infos_val.pkl',
+        data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline, metainfo=metainfo, test_mode=True,
+        box_type_3d='LiDAR', backend_args=backend_args))
+
+test_dataloader = val_dataloader
+
+# -----------------------------------------------------------------------------
+# Optimizer / runtime
+# -----------------------------------------------------------------------------
+optim_wrapper = dict(optimizer=dict(lr=0.001, weight_decay=0.01),
+                     clip_grad=dict(max_norm=35, norm_type=2))
+
+val_evaluator = dict(type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(type='Det3DLocalVisualizer', vis_backends=vis_backends,
+                  name='visualizer')
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=5)
diff --git a/configs/mvxnet/mvxnet_efficiency_es_fpn_second_fpn_kitti-3d-3class.py b/configs/mvxnet/mvxnet_efficiency_es_fpn_second_fpn_kitti-3d-3class.py
new file mode 100644
index 0000000000..358b67f6a7
--- /dev/null
+++ b/configs/mvxnet/mvxnet_efficiency_es_fpn_second_fpn_kitti-3d-3class.py
@@ -0,0 +1,275 @@
+_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py']
+
+# model settings
+voxel_size = [0.05, 0.05, 0.1]
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+
+
+model = dict(
+    type='DynamicMVXFasterRCNN',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_type='dynamic',
+        voxel_layer=dict(
+            max_num_points=-1,
+            point_cloud_range=point_cloud_range,
+            voxel_size=voxel_size,
+            max_voxels=(-1, -1)),
+        mean=[102.9801, 115.9465, 122.7717],
+        std=[1.0, 1.0, 1.0],
+        bgr_to_rgb=False,
+        pad_size_divisor=32),
+    img_backbone=dict(
+        type='mmdet.EfficientNet',  # Use EfficientNet
+        arch='es',  # Choose the EfficientNet variant (b0, b1, b2, etc.)
+        out_indices=(0, 3, 5, 6),  # You can change this depending on which layers you need
+        frozen_stages=1,  # Freeze the first stage (if needed)
+        norm_cfg=dict(type='BN', requires_grad=False),
+        norm_eval=True,
+    ),  # Important: Use 'pytorch' style
+    img_neck=dict(
+        type='mmdet.FPN',
+        in_channels=[32, 48, 192, 1280],  # Correct in_channels for EfficientNet es
+        out_channels=512,
+        norm_cfg=dict(type='BN', requires_grad=False),
+        num_outs=5),
+    pts_voxel_encoder=dict(
+        type='DynamicVFE',
+        in_channels=4,
+        feat_channels=[64, 64],
+        with_distance=False,
+        voxel_size=voxel_size,
+        with_cluster_center=True,
+        with_voxel_center=True,
+        point_cloud_range=point_cloud_range,
+        fusion_layer=dict(
+            type='PointFusion',
+            img_channels=512,
+            pts_channels=64,
+            mid_channels=128,
+            out_channels=128,
+            img_levels=[0, 1, 2, 3, 4],  # Adjust if the number of FPN outputs changes
+            align_corners=False,
+            activate_out=True,
+            fuse_out=False)),
+    pts_middle_encoder=dict(
+        type='SparseEncoder',
+        in_channels=128,
+        sparse_shape=[41, 1600, 1408],
+        order=('conv', 'norm', 'act')),
+    pts_backbone=dict(
+        type='SECOND',
+        in_channels=256,
+        layer_nums=[5, 5],
+        layer_strides=[1, 2],
+        out_channels=[128, 256]),
+    pts_neck=dict(
+        type='SECONDFPN',
+        in_channels=[128, 256],
+        upsample_strides=[1, 2],
+        out_channels=[256, 256]),
+    pts_bbox_head=dict(
+        type='Anchor3DHead',
+        num_classes=3,
+        in_channels=256,  # Might need adjustment
+        feat_channels=512,  # Might need adjustment
+        use_direction_classifier=True,
+        anchor_generator=dict(
+            type='Anchor3DRangeGenerator',
+            ranges=[
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -1.78, 70.4, 40.0, -1.78],
+            ],
+            sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]],
+            rotations=[0, 1.57],
+            reshape_out=False),
+        assigner_per_size=True,
+        diff_rad_by_sin=True,
+        assign_per_class=True,
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(
+            type='mmdet.FocalLoss',
+            use_sigmoid=True,
+            gamma=2.0,
+            alpha=0.25,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0),
+        loss_dir=dict(
+            type='mmdet.CrossEntropyLoss', use_sigmoid=False,
+            loss_weight=0.2)),
+    train_cfg=dict(
+        pts=dict(
+            assigner=[
+                dict(
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.6,
+                    neg_iou_thr=0.45,
+                    min_pos_iou=0.45,
+                    ignore_iof_thr=-1),
+            ],
+            allowed_border=0,
+            pos_weight=-1,
+            debug=False)),
+    test_cfg=dict(
+        pts=dict(
+            use_rotate_nms=True,
+            nms_across_levels=False,
+            nms_thr=0.01,
+            score_thr=0.1,
+            min_bbox_size=0,
+            nms_pre=100,
+            max_num=50)))
+
+# dataset settings
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+metainfo = dict(classes=class_names)
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+train_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True),
+    dict(
+        type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True),
+    #dict(
+    #    type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True),
+    dict(
+        type='GlobalRotScaleTrans',
+        rot_range=[-0.78539816, 0.78539816],
+        scale_ratio_range=[0.95, 1.05],
+        translation_std=[0.2, 0.2, 0.2]),
+    dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='PointShuffle'),
+    dict(
+        type='Pack3DDetInputs',
+        keys=[
+            'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
+            'gt_labels'
+        ])
+]
+test_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(
+        type='MultiScaleFlipAug3D',
+        img_scale=(1280, 384),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            # Temporary solution, fix this after refactor the augtest
+            dict(type='Resize', scale=0, keep_ratio=True),
+            dict(
+                type='GlobalRotScaleTrans',
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type='RandomFlip3D'),
+            dict(
+                type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+        ]),
+    dict(type='Pack3DDetInputs', keys=['points', 'img'])
+]
+modality = dict(use_lidar=True, use_camera=True)
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=2,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='RepeatDataset',
+        times=2,
+        dataset=dict(
+            type=dataset_type,
+            data_root=data_root,
+            modality=modality,
+            ann_file='kitti_infos_train.pkl',
+            data_prefix=dict(
+                pts='training/velodyne_reduced', img='training/image_2'),
+            pipeline=train_pipeline,
+            filter_empty_gt=False,
+            metainfo=metainfo,
+            # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
+            # and box_type_3d='Depth' in sunrgbd and scannet dataset.
+            box_type_3d='LiDAR',
+            backend_args=backend_args)))
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        modality=modality,
+        ann_file='kitti_infos_val.pkl',
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='kitti_infos_val.pkl',
+        modality=modality,
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+
+optim_wrapper = dict(
+    optimizer=dict(weight_decay=0.01),
+    clip_grad=dict(max_norm=35, norm_type=2),
+)
+val_evaluator = dict(
+    type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(
+    type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1)
+
+# You may need to download the model first is the network is unstable
+#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth'  # noqa
diff --git a/configs/mvxnet/mvxnet_efficiency_es_fpn_squeeze_fpn_kitti-3d-3class.py b/configs/mvxnet/mvxnet_efficiency_es_fpn_squeeze_fpn_kitti-3d-3class.py
new file mode 100644
index 0000000000..bb001db24f
--- /dev/null
+++ b/configs/mvxnet/mvxnet_efficiency_es_fpn_squeeze_fpn_kitti-3d-3class.py
@@ -0,0 +1,279 @@
+_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py']
+
+# model settings
+voxel_size = [0.05, 0.05, 0.1]
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+
+
+model = dict(
+    type='DynamicMVXFasterRCNN',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_type='dynamic',
+        voxel_layer=dict(
+            max_num_points=-1,
+            point_cloud_range=point_cloud_range,
+            voxel_size=voxel_size,
+            max_voxels=(-1, -1)),
+        mean=[102.9801, 115.9465, 122.7717],
+        std=[1.0, 1.0, 1.0],
+        bgr_to_rgb=False,
+        pad_size_divisor=32),
+    img_backbone=dict(
+        type='mmdet.EfficientNet',  # Use EfficientNet
+        arch='es',  # Choose the EfficientNet variant (b0, b1, b2, etc.)
+        out_indices=(0, 3, 5, 6),  # You can change this depending on which layers you need
+        frozen_stages=1,  # Freeze the first stage (if needed)
+        norm_cfg=dict(type='BN', requires_grad=False),
+        norm_eval=True,
+    ),  # Important: Use 'pytorch' style
+    img_neck=dict(
+        type='mmdet.FPN',
+        in_channels=[32, 48, 192, 1280],  # Correct in_channels for EfficientNet es
+        out_channels=512,
+        norm_cfg=dict(type='BN', requires_grad=False),
+        num_outs=5),
+    pts_voxel_encoder=dict(
+        type='DynamicVFE',
+        in_channels=4,
+        feat_channels=[64, 64],
+        with_distance=False,
+        voxel_size=voxel_size,
+        with_cluster_center=True,
+        with_voxel_center=True,
+        point_cloud_range=point_cloud_range,
+        fusion_layer=dict(
+            type='PointFusion',
+            img_channels=512,
+            pts_channels=64,
+            mid_channels=128,
+            out_channels=128,
+            img_levels=[0, 1, 2, 3, 4],  # Adjust if the number of FPN outputs changes
+            align_corners=False,
+            activate_out=True,
+            fuse_out=False)),
+    pts_middle_encoder=dict(
+        type='SparseEncoder',
+        in_channels=128,
+        sparse_shape=[41, 1600, 1408],
+        order=('conv', 'norm', 'act')),
+    pts_backbone=dict(
+        type='SQUEEZE',
+        in_channels=256,
+        out_channels=[64, 128, 256 , 512],
+        #layer_nums=[3, 5, 5],
+        #layer_strides=[2, 2, 2],
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        conv_cfg=dict(type='Conv2d', bias=False)),
+    pts_neck=dict(
+         type='SQUEEZEFPN',
+        in_channels=[64, 128, 256 , 512],
+        out_channels=[512, 512, 512, 512],
+        #upsample_strides=[0.5, 1, 2],
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        upsample_cfg=dict(type='deconv', bias=False)),
+    pts_bbox_head=dict(
+        type='Anchor3DHead',
+        num_classes=3,
+        in_channels=256,  # Might need adjustment
+        feat_channels=512,  # Might need adjustment
+        use_direction_classifier=True,
+        anchor_generator=dict(
+            type='Anchor3DRangeGenerator',
+            ranges=[
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -1.78, 70.4, 40.0, -1.78],
+            ],
+            sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]],
+            rotations=[0, 1.57],
+            reshape_out=False),
+        assigner_per_size=True,
+        diff_rad_by_sin=True,
+        assign_per_class=True,
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(
+            type='mmdet.FocalLoss',
+            use_sigmoid=True,
+            gamma=2.0,
+            alpha=0.25,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0),
+        loss_dir=dict(
+            type='mmdet.CrossEntropyLoss', use_sigmoid=False,
+            loss_weight=0.2)),
+    train_cfg=dict(
+        pts=dict(
+            assigner=[
+                dict(
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.6,
+                    neg_iou_thr=0.45,
+                    min_pos_iou=0.45,
+                    ignore_iof_thr=-1),
+            ],
+            allowed_border=0,
+            pos_weight=-1,
+            debug=False)),
+    test_cfg=dict(
+        pts=dict(
+            use_rotate_nms=True,
+            nms_across_levels=False,
+            nms_thr=0.01,
+            score_thr=0.1,
+            min_bbox_size=0,
+            nms_pre=100,
+            max_num=50)))
+
+# dataset settings
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+metainfo = dict(classes=class_names)
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+train_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True),
+    #dict(
+    #    type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True),
+    dict(
+        type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True),
+    dict(
+        type='GlobalRotScaleTrans',
+        rot_range=[-0.78539816, 0.78539816],
+        scale_ratio_range=[0.95, 1.05],
+        translation_std=[0.2, 0.2, 0.2]),
+    dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='PointShuffle'),
+    dict(
+        type='Pack3DDetInputs',
+        keys=[
+            'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
+            'gt_labels'
+        ])
+]
+test_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(
+        type='MultiScaleFlipAug3D',
+        img_scale=(1280, 384),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            # Temporary solution, fix this after refactor the augtest
+            dict(type='Resize', scale=0, keep_ratio=True),
+            dict(
+                type='GlobalRotScaleTrans',
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type='RandomFlip3D'),
+            dict(
+                type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+        ]),
+    dict(type='Pack3DDetInputs', keys=['points', 'img'])
+]
+modality = dict(use_lidar=True, use_camera=True)
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=2,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='RepeatDataset',
+        times=2,
+        dataset=dict(
+            type=dataset_type,
+            data_root=data_root,
+            modality=modality,
+            ann_file='kitti_infos_train.pkl',
+            data_prefix=dict(
+                pts='training/velodyne_reduced', img='training/image_2'),
+            pipeline=train_pipeline,
+            filter_empty_gt=False,
+            metainfo=metainfo,
+            # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
+            # and box_type_3d='Depth' in sunrgbd and scannet dataset.
+            box_type_3d='LiDAR',
+            backend_args=backend_args)))
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        modality=modality,
+        ann_file='kitti_infos_val.pkl',
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='kitti_infos_val.pkl',
+        modality=modality,
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+
+optim_wrapper = dict(
+    optimizer=dict(weight_decay=0.01),
+    clip_grad=dict(max_norm=35, norm_type=2),
+)
+val_evaluator = dict(
+    type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(
+    type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1)
+
+# You may need to download the model first is the network is unstable
+#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth'  # noqa
diff --git a/configs/mvxnet/mvxnet_efficiency_fpn_second_fpn_kitti-3d-3class.py b/configs/mvxnet/mvxnet_efficiency_fpn_second_fpn_kitti-3d-3class.py
new file mode 100644
index 0000000000..cdfdd134b5
--- /dev/null
+++ b/configs/mvxnet/mvxnet_efficiency_fpn_second_fpn_kitti-3d-3class.py
@@ -0,0 +1,275 @@
+_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py']
+
+# model settings
+voxel_size = [0.05, 0.05, 0.1]
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+
+
+model = dict(
+    type='DynamicMVXFasterRCNN',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_type='dynamic',
+        voxel_layer=dict(
+            max_num_points=-1,
+            point_cloud_range=point_cloud_range,
+            voxel_size=voxel_size,
+            max_voxels=(-1, -1)),
+        mean=[102.9801, 115.9465, 122.7717],
+        std=[1.0, 1.0, 1.0],
+        bgr_to_rgb=False,
+        pad_size_divisor=32),
+    img_backbone=dict(
+        type='mmdet.EfficientNet',  # Use EfficientNet
+        arch='b2',  # Choose the EfficientNet variant (b0, b1, b2, etc.)
+        out_indices=(0, 3, 5, 6),  # You can change this depending on which layers you need
+        frozen_stages=1,  # Freeze the first stage (if needed)
+        norm_cfg=dict(type='BN', requires_grad=False),
+        norm_eval=True,
+    ),  # Important: Use 'pytorch' style
+    img_neck=dict(
+        type='mmdet.FPN',
+        in_channels=[32, 48, 352, 1408],  # Correct in_channels for EfficientNet b0
+        out_channels=512,
+        norm_cfg=dict(type='BN', requires_grad=False),
+        num_outs=5),
+    pts_voxel_encoder=dict(
+        type='DynamicVFE',
+        in_channels=4,
+        feat_channels=[64, 64],
+        with_distance=False,
+        voxel_size=voxel_size,
+        with_cluster_center=True,
+        with_voxel_center=True,
+        point_cloud_range=point_cloud_range,
+        fusion_layer=dict(
+            type='PointFusion',
+            img_channels=512,
+            pts_channels=64,
+            mid_channels=128,
+            out_channels=128,
+            img_levels=[0, 1, 2, 3, 4],  # Adjust if the number of FPN outputs changes
+            align_corners=False,
+            activate_out=True,
+            fuse_out=False)),
+    pts_middle_encoder=dict(
+        type='SparseEncoder',
+        in_channels=128,
+        sparse_shape=[41, 1600, 1408],
+        order=('conv', 'norm', 'act')),
+    pts_backbone=dict(
+        type='SECOND',
+        in_channels=256,
+        layer_nums=[5, 5],
+        layer_strides=[1, 2],
+        out_channels=[128, 256]),
+    pts_neck=dict(
+        type='SECONDFPN',
+        in_channels=[128, 256],
+        upsample_strides=[1, 2],
+        out_channels=[256, 256]),
+    pts_bbox_head=dict(
+        type='Anchor3DHead',
+        num_classes=3,
+        in_channels=256,  # Might need adjustment
+        feat_channels=512,  # Might need adjustment
+        use_direction_classifier=True,
+        anchor_generator=dict(
+            type='Anchor3DRangeGenerator',
+            ranges=[
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -1.78, 70.4, 40.0, -1.78],
+            ],
+            sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]],
+            rotations=[0, 1.57],
+            reshape_out=False),
+        assigner_per_size=True,
+        diff_rad_by_sin=True,
+        assign_per_class=True,
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(
+            type='mmdet.FocalLoss',
+            use_sigmoid=True,
+            gamma=2.0,
+            alpha=0.25,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0),
+        loss_dir=dict(
+            type='mmdet.CrossEntropyLoss', use_sigmoid=False,
+            loss_weight=0.2)),
+    train_cfg=dict(
+        pts=dict(
+            assigner=[
+                dict(
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.6,
+                    neg_iou_thr=0.45,
+                    min_pos_iou=0.45,
+                    ignore_iof_thr=-1),
+            ],
+            allowed_border=0,
+            pos_weight=-1,
+            debug=False)),
+    test_cfg=dict(
+        pts=dict(
+            use_rotate_nms=True,
+            nms_across_levels=False,
+            nms_thr=0.01,
+            score_thr=0.1,
+            min_bbox_size=0,
+            nms_pre=100,
+            max_num=50)))
+
+# dataset settings
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+metainfo = dict(classes=class_names)
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+train_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True),
+    dict(
+        type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True),
+    #dict(
+    #    type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True),
+    dict(
+        type='GlobalRotScaleTrans',
+        rot_range=[-0.78539816, 0.78539816],
+        scale_ratio_range=[0.95, 1.05],
+        translation_std=[0.2, 0.2, 0.2]),
+    dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='PointShuffle'),
+    dict(
+        type='Pack3DDetInputs',
+        keys=[
+            'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
+            'gt_labels'
+        ])
+]
+test_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(
+        type='MultiScaleFlipAug3D',
+        img_scale=(1280, 384),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            # Temporary solution, fix this after refactor the augtest
+            dict(type='Resize', scale=0, keep_ratio=True),
+            dict(
+                type='GlobalRotScaleTrans',
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type='RandomFlip3D'),
+            dict(
+                type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+        ]),
+    dict(type='Pack3DDetInputs', keys=['points', 'img'])
+]
+modality = dict(use_lidar=True, use_camera=True)
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=2,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='RepeatDataset',
+        times=2,
+        dataset=dict(
+            type=dataset_type,
+            data_root=data_root,
+            modality=modality,
+            ann_file='kitti_infos_train.pkl',
+            data_prefix=dict(
+                pts='training/velodyne_reduced', img='training/image_2'),
+            pipeline=train_pipeline,
+            filter_empty_gt=False,
+            metainfo=metainfo,
+            # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
+            # and box_type_3d='Depth' in sunrgbd and scannet dataset.
+            box_type_3d='LiDAR',
+            backend_args=backend_args)))
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        modality=modality,
+        ann_file='kitti_infos_val.pkl',
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='kitti_infos_val.pkl',
+        modality=modality,
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+
+optim_wrapper = dict(
+    optimizer=dict(weight_decay=0.01),
+    clip_grad=dict(max_norm=35, norm_type=2),
+)
+val_evaluator = dict(
+    type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(
+    type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1)
+
+# You may need to download the model first is the network is unstable
+#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth'  # noqa
diff --git a/configs/mvxnet/mvxnet_efficiency_fpn_squeeze_fpn_kitti-3d-3class.py b/configs/mvxnet/mvxnet_efficiency_fpn_squeeze_fpn_kitti-3d-3class.py
new file mode 100644
index 0000000000..cfeaf4d2b4
--- /dev/null
+++ b/configs/mvxnet/mvxnet_efficiency_fpn_squeeze_fpn_kitti-3d-3class.py
@@ -0,0 +1,279 @@
+_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py']
+
+# model settings
+voxel_size = [0.05, 0.05, 0.1]
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+
+
+model = dict(
+    type='DynamicMVXFasterRCNN',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_type='dynamic',
+        voxel_layer=dict(
+            max_num_points=-1,
+            point_cloud_range=point_cloud_range,
+            voxel_size=voxel_size,
+            max_voxels=(-1, -1)),
+        mean=[102.9801, 115.9465, 122.7717],
+        std=[1.0, 1.0, 1.0],
+        bgr_to_rgb=False,
+        pad_size_divisor=32),
+    img_backbone=dict(
+        type='mmdet.EfficientNet',  # Use EfficientNet
+        arch='b2',  # Choose the EfficientNet variant (b0, b1, b2, etc.)
+        out_indices=(0, 3, 5, 6),  # You can change this depending on which layers you need
+        frozen_stages=1,  # Freeze the first stage (if needed)
+        norm_cfg=dict(type='BN', requires_grad=False),
+        norm_eval=True,
+    ),  # Important: Use 'pytorch' style
+    img_neck=dict(
+        type='mmdet.FPN',
+        in_channels=[32, 48, 352, 1408],  # Correct in_channels for EfficientNet b0
+        out_channels=512,
+        norm_cfg=dict(type='BN', requires_grad=False),
+        num_outs=5),
+    pts_voxel_encoder=dict(
+        type='DynamicVFE',
+        in_channels=4,
+        feat_channels=[64, 64],
+        with_distance=False,
+        voxel_size=voxel_size,
+        with_cluster_center=True,
+        with_voxel_center=True,
+        point_cloud_range=point_cloud_range,
+        fusion_layer=dict(
+            type='PointFusion',
+            img_channels=512,
+            pts_channels=64,
+            mid_channels=128,
+            out_channels=128,
+            img_levels=[0, 1, 2, 3, 4],  # Adjust if the number of FPN outputs changes
+            align_corners=False,
+            activate_out=True,
+            fuse_out=False)),
+    pts_middle_encoder=dict(
+        type='SparseEncoder',
+        in_channels=128,
+        sparse_shape=[41, 1600, 1408],
+        order=('conv', 'norm', 'act')),
+    pts_backbone=dict(
+        type='SQUEEZE',
+        in_channels=256,
+        out_channels=[64, 128, 256 , 512],
+        #layer_nums=[3, 5, 5],
+        #layer_strides=[2, 2, 2],
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        conv_cfg=dict(type='Conv2d', bias=False)),
+    pts_neck=dict(
+         type='SQUEEZEFPN',
+        in_channels=[64, 128, 256 , 512],
+        out_channels=[512, 512, 512, 512],
+        #upsample_strides=[0.5, 1, 2],
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        upsample_cfg=dict(type='deconv', bias=False)),
+    pts_bbox_head=dict(
+        type='Anchor3DHead',
+        num_classes=3,
+        in_channels=256,  # Might need adjustment
+        feat_channels=512,  # Might need adjustment
+        use_direction_classifier=True,
+        anchor_generator=dict(
+            type='Anchor3DRangeGenerator',
+            ranges=[
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -1.78, 70.4, 40.0, -1.78],
+            ],
+            sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]],
+            rotations=[0, 1.57],
+            reshape_out=False),
+        assigner_per_size=True,
+        diff_rad_by_sin=True,
+        assign_per_class=True,
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(
+            type='mmdet.FocalLoss',
+            use_sigmoid=True,
+            gamma=2.0,
+            alpha=0.25,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0),
+        loss_dir=dict(
+            type='mmdet.CrossEntropyLoss', use_sigmoid=False,
+            loss_weight=0.2)),
+    train_cfg=dict(
+        pts=dict(
+            assigner=[
+                dict(
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.6,
+                    neg_iou_thr=0.45,
+                    min_pos_iou=0.45,
+                    ignore_iof_thr=-1),
+            ],
+            allowed_border=0,
+            pos_weight=-1,
+            debug=False)),
+    test_cfg=dict(
+        pts=dict(
+            use_rotate_nms=True,
+            nms_across_levels=False,
+            nms_thr=0.01,
+            score_thr=0.1,
+            min_bbox_size=0,
+            nms_pre=100,
+            max_num=50)))
+
+# dataset settings
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+metainfo = dict(classes=class_names)
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+train_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True),
+    #dict(
+    #    type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True),
+    dict(
+        type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True),
+    dict(
+        type='GlobalRotScaleTrans',
+        rot_range=[-0.78539816, 0.78539816],
+        scale_ratio_range=[0.95, 1.05],
+        translation_std=[0.2, 0.2, 0.2]),
+    dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='PointShuffle'),
+    dict(
+        type='Pack3DDetInputs',
+        keys=[
+            'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
+            'gt_labels'
+        ])
+]
+test_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(
+        type='MultiScaleFlipAug3D',
+        img_scale=(1280, 384),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            # Temporary solution, fix this after refactor the augtest
+            dict(type='Resize', scale=0, keep_ratio=True),
+            dict(
+                type='GlobalRotScaleTrans',
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type='RandomFlip3D'),
+            dict(
+                type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+        ]),
+    dict(type='Pack3DDetInputs', keys=['points', 'img'])
+]
+modality = dict(use_lidar=True, use_camera=True)
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=2,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='RepeatDataset',
+        times=2,
+        dataset=dict(
+            type=dataset_type,
+            data_root=data_root,
+            modality=modality,
+            ann_file='kitti_infos_train.pkl',
+            data_prefix=dict(
+                pts='training/velodyne_reduced', img='training/image_2'),
+            pipeline=train_pipeline,
+            filter_empty_gt=False,
+            metainfo=metainfo,
+            # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
+            # and box_type_3d='Depth' in sunrgbd and scannet dataset.
+            box_type_3d='LiDAR',
+            backend_args=backend_args)))
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        modality=modality,
+        ann_file='kitti_infos_val.pkl',
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='kitti_infos_val.pkl',
+        modality=modality,
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+
+optim_wrapper = dict(
+    optimizer=dict(weight_decay=0.01),
+    clip_grad=dict(max_norm=35, norm_type=2),
+)
+val_evaluator = dict(
+    type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(
+    type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1)
+
+# You may need to download the model first is the network is unstable
+#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth'  # noqa
diff --git a/configs/mvxnet/mvxnet_firerpfnet2dfpn_fire_rpfnet_kitti-3d-3class.py b/configs/mvxnet/mvxnet_firerpfnet2dfpn_fire_rpfnet_kitti-3d-3class.py
new file mode 100644
index 0000000000..59fbf66703
--- /dev/null
+++ b/configs/mvxnet/mvxnet_firerpfnet2dfpn_fire_rpfnet_kitti-3d-3class.py
@@ -0,0 +1,241 @@
+# MVX-Net with FireRPFNet2D (image) + FireRPFNetV2 (LiDAR)
+# Full Fire+CBAM pipeline for both modalities
+# KITTI 3-class (Car, Pedestrian, Cyclist)
+
+_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py']
+
+# -----------------------------------------------------------------------------
+# Geometry
+# -----------------------------------------------------------------------------
+voxel_size = [0.05, 0.05, 0.1]
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+
+# -----------------------------------------------------------------------------
+# Model
+# -----------------------------------------------------------------------------
+model = dict(
+    type='DynamicMVXFasterRCNN',
+    # --------------------------------------------------
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_type='dynamic',
+        voxel_layer=dict(
+            max_num_points=-1,
+            point_cloud_range=point_cloud_range,
+            voxel_size=voxel_size,
+            max_voxels=(-1, -1)),
+        mean=[102.9801, 115.9465, 122.7717],
+        std=[1.0, 1.0, 1.0],
+        bgr_to_rgb=False,
+        pad_size_divisor=32),
+
+    # ----------------------- FireRPFNet2D image branch -----------------------
+    img_backbone=dict(
+        type='FireRPFNet2D',
+        in_channels=3,
+        out_channels=[64, 128, 256, 512],  # Multi-scale outputs
+        blocks_per_stage=[2, 2, 2, 2],     # 2 Fire blocks per stage
+        with_cbam=True,                     # Enable CBAM attention
+        stem_channels=64,
+        out_indices=(0, 1, 2, 3),          # Output all 4 scales
+        frozen_stages=-1,                   # No frozen stages
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        norm_eval=False),
+
+    img_neck=dict(
+        type='mmdet.FPN',
+        in_channels=[64, 128, 256, 512],   # From FireRPFNet2D stages
+        out_channels=256,                   # Unified output channels
+        num_outs=4,
+        norm_cfg=dict(type='BN', requires_grad=False)),
+
+    # ----------------------- LiDAR voxel encoder ----------------
+    pts_voxel_encoder=dict(
+        type='DynamicVFE',
+        in_channels=4,
+        feat_channels=[64, 64],
+        with_distance=False,
+        voxel_size=voxel_size,
+        with_cluster_center=True,
+        with_voxel_center=True,
+        point_cloud_range=point_cloud_range,
+        fusion_layer=dict(
+            type='PointFusion',
+            img_channels=256,               # From FPN
+            pts_channels=64,
+            mid_channels=128,
+            out_channels=128,
+            img_levels=[0, 1, 2, 3],
+            align_corners=False,
+            activate_out=True,
+            fuse_out=False)),
+
+    # ----------------------- Sparse middle encoder --------------
+    pts_middle_encoder=dict(
+        type='SparseEncoder',
+        in_channels=128,
+        sparse_shape=[41, 1600, 1408],
+        order=('conv', 'norm', 'act')),
+
+    # ----------------------- FireRPFNetV2 backbone -------------
+    pts_backbone=dict(
+        type='FireRPFNetV2',
+        in_channels=256,               # output of SparseEncoder
+        out_channels=[128, 256, 256, 256],
+        with_cbam=True,
+        multi_scale_output=False),     # Single-scale output
+
+    pts_neck=None,                    # No additional neck needed
+
+    # ----------------------- Anchor head ------------------------
+    pts_bbox_head=dict(
+        type='Anchor3DHead',
+        num_classes=3,
+        in_channels=256,
+        feat_channels=256,
+        use_direction_classifier=True,
+        anchor_generator=dict(
+            type='Anchor3DRangeGenerator',
+            ranges=[
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -1.78, 70.4, 40.0, -1.78],
+            ],
+            sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]],
+            rotations=[0, 1.57],
+            reshape_out=False),
+        assigner_per_size=True,
+        diff_rad_by_sin=True,
+        assign_per_class=True,
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(
+            type='mmdet.FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25,
+            loss_weight=1.0),
+        loss_bbox=dict(type='mmdet.SmoothL1Loss', beta=1.0 / 9.0,
+                        loss_weight=2.0),
+        loss_dir=dict(type='mmdet.CrossEntropyLoss', use_sigmoid=False,
+                      loss_weight=0.2)),
+
+    # ----------------------- Train / Test cfg -------------------
+    train_cfg=dict(
+        pts=dict(
+            assigner=[
+                dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                     pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2,
+                     ignore_iof_thr=-1),
+                dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                     pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2,
+                     ignore_iof_thr=-1),
+                dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                     pos_iou_thr=0.6, neg_iou_thr=0.45, min_pos_iou=0.45,
+                     ignore_iof_thr=-1),
+            ],
+            allowed_border=0, pos_weight=-1, debug=False)),
+
+    test_cfg=dict(
+        pts=dict(use_rotate_nms=True, nms_across_levels=False, nms_thr=0.01,
+                  score_thr=0.1, min_bbox_size=0, nms_pre=100, max_num=50))
+)
+
+# -----------------------------------------------------------------------------
+# Dataset & pipelines
+# -----------------------------------------------------------------------------
+
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+metainfo = dict(classes=class_names)
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+
+train_pipeline = [
+    dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4,
+         backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True,
+         with_bbox=True, with_label=True),
+    dict(type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True),
+    dict(type='GlobalRotScaleTrans', rot_range=[-0.78539816, 0.78539816],
+         scale_ratio_range=[0.95, 1.05], translation_std=[0.2, 0.2, 0.2]),
+    dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='PointShuffle'),
+    dict(type='Pack3DDetInputs', keys=['points', 'img', 'gt_bboxes_3d', 'gt_labels_3d',
+                                       'gt_bboxes', 'gt_labels'])
+]
+
+test_pipeline = [
+    dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4,
+         backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='MultiScaleFlipAug3D', img_scale=(1280, 384), pts_scale_ratio=1,
+         flip=False,
+         transforms=[
+             dict(type='Resize', scale=0, keep_ratio=True),
+             dict(type='GlobalRotScaleTrans', rot_range=[0, 0], scale_ratio_range=[1., 1.],
+                  translation_std=[0, 0, 0]),
+             dict(type='RandomFlip3D'),
+             dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+         ]),
+    dict(type='Pack3DDetInputs', keys=['points', 'img'])
+]
+
+modality = dict(use_lidar=True, use_camera=True)
+
+train_dataloader = dict(
+    batch_size=4, num_workers=2, sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(type='RepeatDataset', times=1, dataset=dict(
+        type=dataset_type, data_root=data_root, modality=modality,
+        ann_file='kitti_infos_train.pkl',
+        data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=train_pipeline, filter_empty_gt=False, metainfo=metainfo,
+        box_type_3d='LiDAR', backend_args=backend_args)))
+
+val_dataloader = dict(
+    batch_size=1, num_workers=1, sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(type=dataset_type, data_root=data_root, modality=modality,
+        ann_file='kitti_infos_val.pkl',
+        data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline, metainfo=metainfo, test_mode=True,
+        box_type_3d='LiDAR', backend_args=backend_args))
+
+test_dataloader = val_dataloader
+
+# -----------------------------------------------------------------------------
+# Optimizer / Schedulers / Runtime
+# -----------------------------------------------------------------------------
+optim_wrapper = dict(optimizer=dict(lr=0.001, weight_decay=0.01),
+                     clip_grad=dict(max_norm=35, norm_type=2))
+
+val_evaluator = dict(type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(
+    type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+# Add EarlyStoppingHook
+custom_hooks = [
+    dict(
+        type='EarlyStoppingHook',
+        monitor='Kitti metric/pred_instances_3d/KITTI/Car_3D_AP40_moderate_strict',
+        patience=5,  # Number of epochs to wait before stopping
+        rule='greater',  # Stop when the metric stops increasing
+        min_delta=0.001,  # Minimum change to qualify as improvement
+    )
+]
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=16, val_interval=2)
+
+
+# Add checkpoint configuration
+default_hooks = dict(
+    checkpoint=dict(
+        type='CheckpointHook',
+        interval=2,
+        save_best='Kitti metric/pred_instances_3d/KITTI/Car_3D_AP40_moderate_strict',
+        rule='greater',
+        max_keep_ckpts=15  # Keep only the best 5 checkpoints
+    )
+)
\ No newline at end of file
diff --git a/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_320x92_kitti-3d-3class.py b/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_320x92_kitti-3d-3class.py
new file mode 100644
index 0000000000..a006107aa7
--- /dev/null
+++ b/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_320x92_kitti-3d-3class.py
@@ -0,0 +1,277 @@
+_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py']
+
+# model settings
+voxel_size = [0.05, 0.05, 0.1]
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+
+model = dict(
+    type='DynamicMVXFasterRCNN',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_type='dynamic',
+        voxel_layer=dict(
+            max_num_points=-1,
+            point_cloud_range=point_cloud_range,
+            voxel_size=voxel_size,
+            max_voxels=(-1, -1)),
+        mean=[102.9801, 115.9465, 122.7717],
+        std=[1.0, 1.0, 1.0],
+        bgr_to_rgb=False,
+        pad_size_divisor=32),
+    img_backbone=dict(
+        type='mmdet.ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(0, 1, 2, 3),
+        frozen_stages=1,
+        norm_cfg=dict(type='BN', requires_grad=False),
+        norm_eval=True,
+        style='caffe'),
+    img_neck=dict(
+        type='mmdet.FPN',
+        in_channels=[256, 512, 1024, 2048],
+        out_channels=256,
+        # make the image features more stable numerically to avoid loss nan
+        norm_cfg=dict(type='BN', requires_grad=False),
+        num_outs=5),
+    pts_voxel_encoder=dict(
+        type='DynamicVFE',
+        in_channels=4,
+        feat_channels=[64, 64],
+        with_distance=False,
+        voxel_size=voxel_size,
+        with_cluster_center=True,
+        with_voxel_center=True,
+        point_cloud_range=point_cloud_range,
+        fusion_layer=dict(
+            type='PointFusion',
+            img_channels=256,
+            pts_channels=64,
+            mid_channels=128,
+            out_channels=128,
+            img_levels=[0, 1, 2, 3, 4],
+            align_corners=False,
+            activate_out=True,
+            fuse_out=False)),
+    pts_middle_encoder=dict(
+        type='SparseEncoder',
+        in_channels=128,
+        sparse_shape=[41, 1600, 1408],
+        order=('conv', 'norm', 'act')),
+    pts_backbone=dict(
+        type='SECOND',
+        in_channels=256,
+        layer_nums=[5, 5],
+        layer_strides=[1, 2],
+        out_channels=[128, 256]),
+    pts_neck=dict(
+        type='SECONDFPN',
+        in_channels=[128, 256],
+        upsample_strides=[1, 2],
+        out_channels=[256, 256]),
+    pts_bbox_head=dict(
+        type='Anchor3DHead',
+        num_classes=3,
+        in_channels=512,
+        feat_channels=512,
+        use_direction_classifier=True,
+        anchor_generator=dict(
+            type='Anchor3DRangeGenerator',
+            ranges=[
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -1.78, 70.4, 40.0, -1.78],
+            ],
+            sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]],
+            rotations=[0, 1.57],
+            reshape_out=False),
+        assigner_per_size=True,
+        diff_rad_by_sin=True,
+        assign_per_class=True,
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(
+            type='mmdet.FocalLoss',
+            use_sigmoid=True,
+            gamma=2.0,
+            alpha=0.25,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0),
+        loss_dir=dict(
+            type='mmdet.CrossEntropyLoss', use_sigmoid=False,
+            loss_weight=0.2)),
+    # model training and testing settings
+    train_cfg=dict(
+        pts=dict(
+            assigner=[
+                dict(  # for Pedestrian
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(  # for Cyclist
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(  # for Car
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.6,
+                    neg_iou_thr=0.45,
+                    min_pos_iou=0.45,
+                    ignore_iof_thr=-1),
+            ],
+            allowed_border=0,
+            pos_weight=-1,
+            debug=False)),
+    test_cfg=dict(
+        pts=dict(
+            use_rotate_nms=True,
+            nms_across_levels=False,
+            nms_thr=0.01,
+            score_thr=0.1,
+            min_bbox_size=0,
+            nms_pre=100,
+            max_num=50)))
+
+# dataset settings
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+metainfo = dict(classes=class_names)
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+train_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True),
+    #dict(
+    #    type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True),
+    dict(
+        type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True),
+    dict(
+        type='GlobalRotScaleTrans',
+        rot_range=[-0.78539816, 0.78539816],
+        scale_ratio_range=[0.95, 1.05],
+        translation_std=[0.2, 0.2, 0.2]),
+    dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='PointShuffle'),
+    dict(
+        type='Pack3DDetInputs',
+        keys=[
+            'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
+            'gt_labels'
+        ])
+]
+test_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(
+        type='MultiScaleFlipAug3D',
+        img_scale=(1280, 384),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            # Temporary solution, fix this after refactor the augtest
+            dict(type='Resize', scale=0, keep_ratio=True),
+            dict(
+                type='GlobalRotScaleTrans',
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type='RandomFlip3D'),
+            dict(
+                type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+        ]),
+    dict(type='Pack3DDetInputs', keys=['points', 'img'])
+]
+modality = dict(use_lidar=True, use_camera=True)
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=2,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='RepeatDataset',
+        times=2,
+        dataset=dict(
+            type=dataset_type,
+            data_root=data_root,
+            modality=modality,
+            ann_file='kitti_infos_train.pkl',
+            data_prefix=dict(
+                pts='training/velodyne_reduced', img='training/image_2'),
+            pipeline=train_pipeline,
+            filter_empty_gt=False,
+            metainfo=metainfo,
+            # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
+            # and box_type_3d='Depth' in sunrgbd and scannet dataset.
+            box_type_3d='LiDAR',
+            backend_args=backend_args)))
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        modality=modality,
+        ann_file='kitti_infos_val.pkl',
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='kitti_infos_val.pkl',
+        modality=modality,
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+
+optim_wrapper = dict(
+    optimizer=dict(weight_decay=0.01),
+    clip_grad=dict(max_norm=35, norm_type=2),
+)
+val_evaluator = dict(
+    type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(
+    type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1)
+
+# You may need to download the model first is the network is unstable
+#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth'  # noqa
diff --git a/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_8xb2-80e_kitti-3d-3class.py b/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_8xb2-80e_kitti-3d-3class.py
index f6c750d9f7..5ea62980bc 100644
--- a/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_8xb2-80e_kitti-3d-3class.py
+++ b/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_8xb2-80e_kitti-3d-3class.py
@@ -269,5 +269,7 @@
 visualizer = dict(
     type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')
 
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1)
+
 # You may need to download the model first is the network is unstable
-load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth'  # noqa
+#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth'  # noqa
diff --git a/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_8xb2-80e_nus-3d-3class.py b/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_8xb2-80e_nus-3d-3class.py
new file mode 100644
index 0000000000..944eb56d5d
--- /dev/null
+++ b/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_8xb2-80e_nus-3d-3class.py
@@ -0,0 +1,273 @@
+_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py']
+
+# model settings
+voxel_size = [0.05, 0.05, 0.1]
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+
+model = dict(
+    type='DynamicMVXFasterRCNN',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_type='dynamic',
+        voxel_layer=dict(
+            max_num_points=-1,
+            point_cloud_range=point_cloud_range,
+            voxel_size=voxel_size,
+            max_voxels=(-1, -1)),
+        mean=[102.9801, 115.9465, 122.7717],
+        std=[1.0, 1.0, 1.0],
+        bgr_to_rgb=False,
+        pad_size_divisor=32),
+    img_backbone=dict(
+        type='mmdet.ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(0, 1, 2, 3),
+        frozen_stages=1,
+        norm_cfg=dict(type='BN', requires_grad=False),
+        norm_eval=True,
+        style='caffe'),
+    img_neck=dict(
+        type='mmdet.FPN',
+        in_channels=[256, 512, 1024, 2048],
+        out_channels=256,
+        # make the image features more stable numerically to avoid loss nan
+        norm_cfg=dict(type='BN', requires_grad=False),
+        num_outs=5),
+    pts_voxel_encoder=dict(
+        type='DynamicVFE',
+        in_channels=4,
+        feat_channels=[64, 64],
+        with_distance=False,
+        voxel_size=voxel_size,
+        with_cluster_center=True,
+        with_voxel_center=True,
+        point_cloud_range=point_cloud_range,
+        fusion_layer=dict(
+            type='PointFusion',
+            img_channels=256,
+            pts_channels=64,
+            mid_channels=128,
+            out_channels=128,
+            img_levels=[0, 1, 2, 3, 4],
+            align_corners=False,
+            activate_out=True,
+            fuse_out=False)),
+    pts_middle_encoder=dict(
+        type='SparseEncoder',
+        in_channels=128,
+        sparse_shape=[41, 1600, 1408],
+        order=('conv', 'norm', 'act')),
+    pts_backbone=dict(
+        type='SECOND',
+        in_channels=256,
+        layer_nums=[5, 5],
+        layer_strides=[1, 2],
+        out_channels=[128, 256]),
+    pts_neck=dict(
+        type='SECONDFPN',
+        in_channels=[128, 256],
+        upsample_strides=[1, 2],
+        out_channels=[256, 256]),
+    pts_bbox_head=dict(
+        type='Anchor3DHead',
+        num_classes=3,
+        in_channels=512,
+        feat_channels=512,
+        use_direction_classifier=True,
+        anchor_generator=dict(
+            type='Anchor3DRangeGenerator',
+            ranges=[
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -1.78, 70.4, 40.0, -1.78],
+            ],
+            sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]],
+            rotations=[0, 1.57],
+            reshape_out=False),
+        assigner_per_size=True,
+        diff_rad_by_sin=True,
+        assign_per_class=True,
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(
+            type='mmdet.FocalLoss',
+            use_sigmoid=True,
+            gamma=2.0,
+            alpha=0.25,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0),
+        loss_dir=dict(
+            type='mmdet.CrossEntropyLoss', use_sigmoid=False,
+            loss_weight=0.2)),
+    # model training and testing settings
+    train_cfg=dict(
+        pts=dict(
+            assigner=dict(
+                type='Max3DIoUAssigner',
+                iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                pos_iou_thr=0.6,
+                neg_iou_thr=0.3,
+                min_pos_iou=0.3,
+                ignore_iof_thr=-1),
+            allowed_border=0,
+            code_weight=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2, 0.2],
+            pos_weight=-1,
+            debug=False)),
+    test_cfg=dict(
+        pts=dict(
+            use_rotate_nms=True,
+            nms_across_levels=False,
+            nms_thr=0.01,
+            score_thr=0.1,
+            min_bbox_size=0,
+            nms_pre=100,
+            max_num=50)))
+
+# dataset settings
+#dataset_type = 'KittiDataset'
+#data_root = 'data/kitti/'
+#class_names = ['Pedestrian', 'Cyclist', 'Car']
+dataset_type = 'NuScenesDataset'
+data_root = 'data/nuscenes/'
+class_names = [
+    'car', 'truck', 'trailer', 'bus', 'construction_vehicle', 'bicycle',
+    'motorcycle', 'pedestrian', 'traffic_cone', 'barrier'
+]
+metainfo = dict(classes=class_names)
+input_modality = dict(use_lidar=True, use_camera=True)
+#data_prefix = dict(pts='samples/LIDAR_TOP', img='', sweeps='sweeps/LIDAR_TOP')
+data_prefix = dict(
+    pts='samples/LIDAR_TOP',
+    CAM_FRONT='samples/CAM_FRONT',
+    sweeps='sweeps/LIDAR_TOP')
+backend_args = None
+train_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    #dict(type='LoadImageFromFileMono3D', backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True),
+    dict(
+        type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True),
+    dict(
+        type='GlobalRotScaleTrans',
+        rot_range=[-0.78539816, 0.78539816],
+        scale_ratio_range=[0.95, 1.05],
+        translation_std=[0.2, 0.2, 0.2]),
+    dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='PointShuffle'),
+    dict(
+        type='Pack3DDetInputs',
+        keys=[
+            'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
+            'gt_labels'
+        ])
+]
+test_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    #dict(type='LoadImageFromFileMono3D', backend_args=backend_args),
+    dict(
+        type='MultiScaleFlipAug3D',
+        img_scale=(1280, 384),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            # Temporary solution, fix this after refactor the augtest
+            dict(type='Resize', scale=0, keep_ratio=True),
+            dict(
+                type='GlobalRotScaleTrans',
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type='RandomFlip3D'),
+            dict(
+                type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+        ]),
+    #dict(type='Pack3DDetInputs', keys=['points', 'img'])
+    dict(
+        type='Pack3DDetInputs',
+        keys=[
+            'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
+            'gt_labels'
+        ])
+]
+modality = dict(use_lidar=True, use_camera=True)
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=2,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='nuscenes_infos_train.pkl',
+        pipeline=train_pipeline,
+        metainfo=metainfo,
+        modality=input_modality,
+        #modality=modality,
+        #data_prefix=dict(
+        #    pts='training/velodyne_reduced', img='training/image_2'),
+        test_mode=False,
+        data_prefix=data_prefix,
+        default_cam_key='CAM_FRONT',
+        # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
+        # and box_type_3d='Depth' in sunrgbd and scannet dataset.
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+
+
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='nuscenes_infos_val.pkl',
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        modality=input_modality,
+        #modality=modality,
+        #data_prefix=dict(
+        #    pts='training/velodyne_reduced', img='training/image_2'),
+        data_prefix=data_prefix,
+        default_cam_key='CAM_FRONT',
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+
+val_dataloader = test_dataloader
+
+optim_wrapper = dict(
+    optimizer=dict(weight_decay=0.01),
+    clip_grad=dict(max_norm=35, norm_type=2),
+)
+#val_evaluator = dict(
+#    type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+val_evaluator = dict(
+    type='NuScenesMetric',
+    data_root=data_root,
+    ann_file=data_root + 'nuscenes_infos_val.pkl',
+    metric='bbox',
+    backend_args=backend_args)
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(
+    type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+# You may need to download the model first is the network is unstable
+#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth'  # noqa
diff --git a/configs/mvxnet/mvxnet_fpn_dv_second_squeezefpn_320x92_kitti-3d-3class.py b/configs/mvxnet/mvxnet_fpn_dv_second_squeezefpn_320x92_kitti-3d-3class.py
new file mode 100644
index 0000000000..70176a2749
--- /dev/null
+++ b/configs/mvxnet/mvxnet_fpn_dv_second_squeezefpn_320x92_kitti-3d-3class.py
@@ -0,0 +1,281 @@
+_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py']
+
+# model settings
+voxel_size = [0.05, 0.05, 0.1]
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+
+model = dict(
+    type='DynamicMVXFasterRCNN',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_type='dynamic',
+        voxel_layer=dict(
+            max_num_points=-1,
+            point_cloud_range=point_cloud_range,
+            voxel_size=voxel_size,
+            max_voxels=(-1, -1)),
+        mean=[102.9801, 115.9465, 122.7717],
+        std=[1.0, 1.0, 1.0],
+        bgr_to_rgb=False,
+        pad_size_divisor=32),
+    img_backbone=dict(
+        type='mmdet.ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(0, 1, 2, 3),
+        frozen_stages=1,
+        norm_cfg=dict(type='BN', requires_grad=False),
+        norm_eval=True,
+        style='caffe'),
+    img_neck=dict(
+        type='mmdet.FPN',
+        in_channels=[256, 512, 1024, 2048],
+        out_channels=256,
+        # make the image features more stable numerically to avoid loss nan
+        norm_cfg=dict(type='BN', requires_grad=False),
+        num_outs=5),
+    pts_voxel_encoder=dict(
+        type='DynamicVFE',
+        in_channels=4,
+        feat_channels=[64, 64],
+        with_distance=False,
+        voxel_size=voxel_size,
+        with_cluster_center=True,
+        with_voxel_center=True,
+        point_cloud_range=point_cloud_range,
+        fusion_layer=dict(
+            type='PointFusion',
+            img_channels=256,
+            pts_channels=64,
+            mid_channels=128,
+            out_channels=128,
+            img_levels=[0, 1, 2, 3, 4],
+            align_corners=False,
+            activate_out=True,
+            fuse_out=False)),
+    pts_middle_encoder=dict(
+        type='SparseEncoder',
+        in_channels=128,
+        sparse_shape=[41, 1600, 1408],
+        order=('conv', 'norm', 'act')),
+    pts_backbone=dict(
+        type='SQUEEZE',
+        in_channels=256,
+        out_channels=[64, 128, 256 , 512],
+        #layer_nums=[3, 5, 5],
+        #layer_strides=[2, 2, 2],
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        conv_cfg=dict(type='Conv2d', bias=False)),
+    pts_neck=dict(
+         type='SQUEEZEFPN',
+        in_channels=[64, 128, 256 , 512],
+        out_channels=[512, 512, 512, 512],
+        #upsample_strides=[0.5, 1, 2],
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        upsample_cfg=dict(type='deconv', bias=False)),
+    pts_bbox_head=dict(
+        type='Anchor3DHead',
+        num_classes=3,
+        in_channels=512,
+        feat_channels=512,
+        use_direction_classifier=True,
+        anchor_generator=dict(
+            type='Anchor3DRangeGenerator',
+            ranges=[
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -1.78, 70.4, 40.0, -1.78],
+            ],
+            sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]],
+            rotations=[0, 1.57],
+            reshape_out=False),
+        assigner_per_size=True,
+        diff_rad_by_sin=True,
+        assign_per_class=True,
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(
+            type='mmdet.FocalLoss',
+            use_sigmoid=True,
+            gamma=2.0,
+            alpha=0.25,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0),
+        loss_dir=dict(
+            type='mmdet.CrossEntropyLoss', use_sigmoid=False,
+            loss_weight=0.2)),
+    # model training and testing settings
+    train_cfg=dict(
+        pts=dict(
+            assigner=[
+                dict(  # for Pedestrian
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(  # for Cyclist
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(  # for Car
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.6,
+                    neg_iou_thr=0.45,
+                    min_pos_iou=0.45,
+                    ignore_iof_thr=-1),
+            ],
+            allowed_border=0,
+            pos_weight=-1,
+            debug=False)),
+    test_cfg=dict(
+        pts=dict(
+            use_rotate_nms=True,
+            nms_across_levels=False,
+            nms_thr=0.01,
+            score_thr=0.1,
+            min_bbox_size=0,
+            nms_pre=100,
+            max_num=50)))
+
+# dataset settings
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+metainfo = dict(classes=class_names)
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+train_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True),
+    #dict(
+    #    type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True),
+    dict(
+        type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True),
+    dict(
+        type='GlobalRotScaleTrans',
+        rot_range=[-0.78539816, 0.78539816],
+        scale_ratio_range=[0.95, 1.05],
+        translation_std=[0.2, 0.2, 0.2]),
+    dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='PointShuffle'),
+    dict(
+        type='Pack3DDetInputs',
+        keys=[
+            'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
+            'gt_labels'
+        ])
+]
+test_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(
+        type='MultiScaleFlipAug3D',
+        img_scale=(1280, 384),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            # Temporary solution, fix this after refactor the augtest
+            dict(type='Resize', scale=0, keep_ratio=True),
+            dict(
+                type='GlobalRotScaleTrans',
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type='RandomFlip3D'),
+            dict(
+                type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+        ]),
+    dict(type='Pack3DDetInputs', keys=['points', 'img'])
+]
+modality = dict(use_lidar=True, use_camera=True)
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=2,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='RepeatDataset',
+        times=2,
+        dataset=dict(
+            type=dataset_type,
+            data_root=data_root,
+            modality=modality,
+            ann_file='kitti_infos_train.pkl',
+            data_prefix=dict(
+                pts='training/velodyne_reduced', img='training/image_2'),
+            pipeline=train_pipeline,
+            filter_empty_gt=False,
+            metainfo=metainfo,
+            # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
+            # and box_type_3d='Depth' in sunrgbd and scannet dataset.
+            box_type_3d='LiDAR',
+            backend_args=backend_args)))
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        modality=modality,
+        ann_file='kitti_infos_val.pkl',
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='kitti_infos_val.pkl',
+        modality=modality,
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+
+optim_wrapper = dict(
+    optimizer=dict(weight_decay=0.01),
+    clip_grad=dict(max_norm=35, norm_type=2),
+)
+val_evaluator = dict(
+    type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(
+    type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1)
+
+# You may need to download the model first is the network is unstable
+#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth'  # noqa
diff --git a/configs/mvxnet/mvxnet_fpn_dv_second_squeezefpn_8xb2-80e_kitti-3d-3class.py b/configs/mvxnet/mvxnet_fpn_dv_second_squeezefpn_8xb2-80e_kitti-3d-3class.py
new file mode 100644
index 0000000000..bed54ef717
--- /dev/null
+++ b/configs/mvxnet/mvxnet_fpn_dv_second_squeezefpn_8xb2-80e_kitti-3d-3class.py
@@ -0,0 +1,279 @@
+_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py']
+
+# model settings
+voxel_size = [0.05, 0.05, 0.1]
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+
+model = dict(
+    type='DynamicMVXFasterRCNN',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_type='dynamic',
+        voxel_layer=dict(
+            max_num_points=-1,
+            point_cloud_range=point_cloud_range,
+            voxel_size=voxel_size,
+            max_voxels=(-1, -1)),
+        mean=[102.9801, 115.9465, 122.7717],
+        std=[1.0, 1.0, 1.0],
+        bgr_to_rgb=False,
+        pad_size_divisor=32),
+    img_backbone=dict(
+        type='mmdet.ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(0, 1, 2, 3),
+        frozen_stages=1,
+        norm_cfg=dict(type='BN', requires_grad=False),
+        norm_eval=True,
+        style='caffe'),
+    img_neck=dict(
+        type='mmdet.FPN',
+        in_channels=[256, 512, 1024, 2048],
+        out_channels=256,
+        # make the image features more stable numerically to avoid loss nan
+        norm_cfg=dict(type='BN', requires_grad=False),
+        num_outs=5),
+    pts_voxel_encoder=dict(
+        type='DynamicVFE',
+        in_channels=4,
+        feat_channels=[64, 64],
+        with_distance=False,
+        voxel_size=voxel_size,
+        with_cluster_center=True,
+        with_voxel_center=True,
+        point_cloud_range=point_cloud_range,
+        fusion_layer=dict(
+            type='PointFusion',
+            img_channels=256,
+            pts_channels=64,
+            mid_channels=128,
+            out_channels=128,
+            img_levels=[0, 1, 2, 3, 4],
+            align_corners=False,
+            activate_out=True,
+            fuse_out=False)),
+    pts_middle_encoder=dict(
+        type='SparseEncoder',
+        in_channels=128,
+        sparse_shape=[41, 1600, 1408],
+        order=('conv', 'norm', 'act')),
+    pts_backbone=dict(
+        type='SQUEEZE',
+        in_channels=256,
+        out_channels=[64, 128, 256 , 512],
+        #layer_nums=[3, 5, 5],
+        #layer_strides=[2, 2, 2],
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        conv_cfg=dict(type='Conv2d', bias=False)),
+    pts_neck=dict(
+         type='SQUEEZEFPN',
+        in_channels=[64, 128, 256 , 512],
+        out_channels=[512, 512, 512, 512],
+        #upsample_strides=[0.5, 1, 2],
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        upsample_cfg=dict(type='deconv', bias=False)),
+    pts_bbox_head=dict(
+        type='Anchor3DHead',
+        num_classes=3,
+        in_channels=512,
+        feat_channels=512,
+        use_direction_classifier=True,
+        anchor_generator=dict(
+            type='Anchor3DRangeGenerator',
+            ranges=[
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -1.78, 70.4, 40.0, -1.78],
+            ],
+            sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]],
+            rotations=[0, 1.57],
+            reshape_out=False),
+        assigner_per_size=True,
+        diff_rad_by_sin=True,
+        assign_per_class=True,
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(
+            type='mmdet.FocalLoss',
+            use_sigmoid=True,
+            gamma=2.0,
+            alpha=0.25,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0),
+        loss_dir=dict(
+            type='mmdet.CrossEntropyLoss', use_sigmoid=False,
+            loss_weight=0.2)),
+    # model training and testing settings
+    train_cfg=dict(
+        pts=dict(
+            assigner=[
+                dict(  # for Pedestrian
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(  # for Cyclist
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(  # for Car
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.6,
+                    neg_iou_thr=0.45,
+                    min_pos_iou=0.45,
+                    ignore_iof_thr=-1),
+            ],
+            allowed_border=0,
+            pos_weight=-1,
+            debug=False)),
+    test_cfg=dict(
+        pts=dict(
+            use_rotate_nms=True,
+            nms_across_levels=False,
+            nms_thr=0.01,
+            score_thr=0.1,
+            min_bbox_size=0,
+            nms_pre=100,
+            max_num=50)))
+
+# dataset settings
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+metainfo = dict(classes=class_names)
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+train_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True),
+    dict(
+        type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True),
+    dict(
+        type='GlobalRotScaleTrans',
+        rot_range=[-0.78539816, 0.78539816],
+        scale_ratio_range=[0.95, 1.05],
+        translation_std=[0.2, 0.2, 0.2]),
+    dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='PointShuffle'),
+    dict(
+        type='Pack3DDetInputs',
+        keys=[
+            'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
+            'gt_labels'
+        ])
+]
+test_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(
+        type='MultiScaleFlipAug3D',
+        img_scale=(1280, 384),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            # Temporary solution, fix this after refactor the augtest
+            dict(type='Resize', scale=0, keep_ratio=True),
+            dict(
+                type='GlobalRotScaleTrans',
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type='RandomFlip3D'),
+            dict(
+                type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+        ]),
+    dict(type='Pack3DDetInputs', keys=['points', 'img'])
+]
+modality = dict(use_lidar=True, use_camera=True)
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=2,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='RepeatDataset',
+        times=2,
+        dataset=dict(
+            type=dataset_type,
+            data_root=data_root,
+            modality=modality,
+            ann_file='kitti_infos_train.pkl',
+            data_prefix=dict(
+                pts='training/velodyne_reduced', img='training/image_2'),
+            pipeline=train_pipeline,
+            filter_empty_gt=False,
+            metainfo=metainfo,
+            # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
+            # and box_type_3d='Depth' in sunrgbd and scannet dataset.
+            box_type_3d='LiDAR',
+            backend_args=backend_args)))
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        modality=modality,
+        ann_file='kitti_infos_val.pkl',
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='kitti_infos_val.pkl',
+        modality=modality,
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+
+optim_wrapper = dict(
+    optimizer=dict(weight_decay=0.01),
+    clip_grad=dict(max_norm=35, norm_type=2),
+)
+val_evaluator = dict(
+    type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(
+    type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1)
+
+# You may need to download the model first is the network is unstable
+#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth'  # noqa
diff --git a/configs/mvxnet/mvxnet_mobilenetv2_fpn_fire_rpfnet_kitti-3d-3class.py b/configs/mvxnet/mvxnet_mobilenetv2_fpn_fire_rpfnet_kitti-3d-3class.py
new file mode 100644
index 0000000000..a9a1e13151
--- /dev/null
+++ b/configs/mvxnet/mvxnet_mobilenetv2_fpn_fire_rpfnet_kitti-3d-3class.py
@@ -0,0 +1,181 @@
+# Full MVX-Net config: MobileNetV2+FPN (camera) + RPFNet (LiDAR).
+# KITTI 3-class (Car, Pedestrian, Cyclist)
+
+_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py']
+
+# -----------------------------------------------------------------------------
+# Geometry
+# -----------------------------------------------------------------------------
+voxel_size = [0.05, 0.05, 0.1]
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+
+# -----------------------------------------------------------------------------
+# Model
+# -----------------------------------------------------------------------------
+model = dict(
+    type='DynamicMVXFasterRCNN',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_type='dynamic',
+        voxel_layer=dict(max_num_points=-1, point_cloud_range=point_cloud_range,
+                         voxel_size=voxel_size, max_voxels=(-1, -1)),
+        mean=[102.9801, 115.9465, 122.7717], std=[1.0, 1.0, 1.0],
+        bgr_to_rgb=False, pad_size_divisor=32),
+
+    # ----------------------- image branch -----------------------
+    img_backbone=dict(
+        type='mmdet.MobileNetV2',
+        out_indices=(0, 1, 2, 3),
+        frozen_stages=1,
+        norm_cfg=dict(type='BN', requires_grad=False),
+        norm_eval=True),
+    img_neck=dict(
+        type='mmdet.FPN',
+        in_channels=[16, 24, 32, 64],
+        out_channels=256,
+        num_outs=4,
+        norm_cfg=dict(type='BN', requires_grad=False)),
+
+    # ----------------------- LiDAR voxel encoder ----------------
+    pts_voxel_encoder=dict(
+        type='DynamicVFE',
+        in_channels=4,
+        feat_channels=[64, 64],
+        with_distance=False,
+        voxel_size=voxel_size,
+        with_cluster_center=True,
+        with_voxel_center=True,
+        point_cloud_range=point_cloud_range,
+        fusion_layer=dict(
+            type='PointFusion', img_channels=256, pts_channels=64,
+            mid_channels=128, out_channels=128, img_levels=[0,1,2,3],
+            align_corners=False, activate_out=True, fuse_out=False)),
+
+    # ----------------------- Sparse middle encoder --------------
+    pts_middle_encoder=dict(
+        type='SparseEncoder', in_channels=128,
+        sparse_shape=[41, 1600, 1408], order=('conv', 'norm', 'act')),
+
+    # ----------------------- FireRPFNet backbone --------------------
+    pts_backbone=dict(
+       # type='RPFNet',
+        type='FireRPFNet',
+        in_channels=256,
+        layer_channels=[128, 256, 256, 256], with_cbam=True),
+    pts_neck=None,
+
+    # ----------------------- Anchor head ------------------------
+    pts_bbox_head=dict(
+        type='Anchor3DHead', num_classes=3,
+        in_channels=256, feat_channels=256, use_direction_classifier=True,
+        anchor_generator=dict(type='Anchor3DRangeGenerator',
+            ranges=[[0,-40,-0.6,70.4,40,-0.6],[0,-40,-0.6,70.4,40,-0.6],
+                    [0,-40,-1.78,70.4,40,-1.78]],
+            sizes=[[0.8,0.6,1.73],[1.76,0.6,1.73],[3.9,1.6,1.56]],
+            rotations=[0,1.57], reshape_out=False),
+        assigner_per_size=True, diff_rad_by_sin=True, assign_per_class=True,
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(type='mmdet.FocalLoss', use_sigmoid=True, gamma=2.0,
+                      alpha=0.25, loss_weight=1.0),
+        loss_bbox=dict(type='mmdet.SmoothL1Loss', beta=1.0/9.0,
+                       loss_weight=2.0),
+        loss_dir=dict(type='mmdet.CrossEntropyLoss', use_sigmoid=False,
+                      loss_weight=0.2)),
+
+    train_cfg=dict(
+        pts=dict(assigner=[
+            dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                 pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2,
+                 ignore_iof_thr=-1),
+            dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                 pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2,
+                 ignore_iof_thr=-1),
+            dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                 pos_iou_thr=0.6, neg_iou_thr=0.45, min_pos_iou=0.45,
+                 ignore_iof_thr=-1)], allowed_border=0, pos_weight=-1,
+            debug=False)),
+    test_cfg=dict(
+        pts=dict(use_rotate_nms=True, nms_across_levels=False, nms_thr=0.01,
+                  score_thr=0.1, min_bbox_size=0, nms_pre=100, max_num=50))
+)
+
+# -----------------------------------------------------------------------------
+# Dataset & pipelines (unchanged)
+# -----------------------------------------------------------------------------
+
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+metainfo = dict(classes=class_names)
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+
+train_pipeline = [
+    dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4,
+         backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True,
+         with_bbox=True, with_label=True),
+    dict(type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True),
+    dict(type='GlobalRotScaleTrans', rot_range=[-0.78539816,0.78539816],
+         scale_ratio_range=[0.95,1.05], translation_std=[0.2,0.2,0.2]),
+    dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='PointShuffle'),
+    dict(type='Pack3DDetInputs', keys=['points','img','gt_bboxes_3d','gt_labels_3d',
+                                       'gt_bboxes','gt_labels'])
+]
+
+test_pipeline = [
+    dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4,
+         backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='MultiScaleFlipAug3D', img_scale=(1280,384), pts_scale_ratio=1,
+         flip=False, transforms=[
+            dict(type='Resize', scale=0, keep_ratio=True),
+            dict(type='GlobalRotScaleTrans', rot_range=[0,0], scale_ratio_range=[1.,1.],
+                 translation_std=[0,0,0]),
+            dict(type='RandomFlip3D'),
+            dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+         ]),
+    dict(type='Pack3DDetInputs', keys=['points','img'])
+]
+
+modality = dict(use_lidar=True, use_camera=True)
+
+train_dataloader = dict(
+    batch_size=2, num_workers=4, sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(type='RepeatDataset', times=2, dataset=dict(
+        type=dataset_type, data_root=data_root, modality=modality,
+        ann_file='kitti_infos_train.pkl',
+        data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=train_pipeline, filter_empty_gt=False, metainfo=metainfo,
+        box_type_3d='LiDAR', backend_args=backend_args)))
+
+val_dataloader = dict(
+    batch_size=1, num_workers=1, sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(type=dataset_type, data_root=data_root, modality=modality,
+        ann_file='kitti_infos_val.pkl',
+        data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline, metainfo=metainfo, test_mode=True,
+        box_type_3d='LiDAR', backend_args=backend_args))
+
+test_dataloader = val_dataloader
+
+# -----------------------------------------------------------------------------
+# Optimizer / runtime
+# -----------------------------------------------------------------------------
+optim_wrapper = dict(optimizer=dict(lr=0.001, weight_decay=0.01),
+                     clip_grad=dict(max_norm=35, norm_type=2))
+
+val_evaluator = dict(type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(type='Det3DLocalVisualizer', vis_backends=vis_backends,
+                  name='visualizer')
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=5)
diff --git a/configs/mvxnet/mvxnet_mobilenetv2_fpn_second_fpn_kitti-3d-3class.py b/configs/mvxnet/mvxnet_mobilenetv2_fpn_second_fpn_kitti-3d-3class.py
new file mode 100644
index 0000000000..fa5021492c
--- /dev/null
+++ b/configs/mvxnet/mvxnet_mobilenetv2_fpn_second_fpn_kitti-3d-3class.py
@@ -0,0 +1,275 @@
+_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py']
+
+# model settings
+voxel_size = [0.05, 0.05, 0.1]
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+
+model = dict(
+    type='DynamicMVXFasterRCNN',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_type='dynamic',
+        voxel_layer=dict(
+            max_num_points=-1,
+            point_cloud_range=point_cloud_range,
+            voxel_size=voxel_size,
+            max_voxels=(-1, -1)),
+        mean=[102.9801, 115.9465, 122.7717],
+        std=[1.0, 1.0, 1.0],
+        bgr_to_rgb=False,
+        pad_size_divisor=32),
+    img_backbone=dict(
+        type='mmdet.MobileNetV2',  # Use MobileNetV2 backbone
+        out_indices=(0, 1, 2, 3),  # Extract features from these layers
+        frozen_stages=1,  # Freeze the first stage (if needed)
+        norm_cfg=dict(type='BN', requires_grad=False),  # Use BatchNorm
+        norm_eval=True,
+        ),
+    img_neck=dict(
+        type='mmdet.FPN',  # Use Feature Pyramid Network (FPN) for neck
+        in_channels=[16, 24, 32, 64],  # Adjust the input channels according to MobileNetV2 (could vary with the model)
+        out_channels=256,  # Number of output channels from the FPN
+        norm_cfg=dict(type='BN', requires_grad=False),
+        num_outs=5,  # Output feature maps from 5 levels
+        ),
+    pts_voxel_encoder=dict(
+        type='DynamicVFE',
+        in_channels=4,
+        feat_channels=[64, 64],
+        with_distance=False,
+        voxel_size=voxel_size,
+        with_cluster_center=True,
+        with_voxel_center=True,
+        point_cloud_range=point_cloud_range,
+        fusion_layer=dict(
+            type='PointFusion',
+            img_channels=256,
+            pts_channels=64,
+            mid_channels=128,
+            out_channels=128,
+            img_levels=[0, 1, 2, 3, 4],
+            align_corners=False,
+            activate_out=True,
+            fuse_out=False)),
+    pts_middle_encoder=dict(
+        type='SparseEncoder',
+        in_channels=128,
+        sparse_shape=[41, 1600, 1408],
+        order=('conv', 'norm', 'act')),
+    pts_backbone=dict(
+        type='SECOND',
+        in_channels=256,
+        layer_nums=[5, 5],
+        layer_strides=[1, 2],
+        out_channels=[128, 256]),
+    pts_neck=dict(
+        type='SECONDFPN',
+        in_channels=[128, 256],
+        upsample_strides=[1, 2],
+        out_channels=[256, 256]),
+    pts_bbox_head=dict(
+        type='Anchor3DHead',
+        num_classes=3,
+        in_channels=512,
+        feat_channels=512,
+        use_direction_classifier=True,
+        anchor_generator=dict(
+            type='Anchor3DRangeGenerator',
+            ranges=[
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -1.78, 70.4, 40.0, -1.78],
+            ],
+            sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]],
+            rotations=[0, 1.57],
+            reshape_out=False),
+        assigner_per_size=True,
+        diff_rad_by_sin=True,
+        assign_per_class=True,
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(
+            type='mmdet.FocalLoss',
+            use_sigmoid=True,
+            gamma=2.0,
+            alpha=0.25,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0),
+        loss_dir=dict(
+            type='mmdet.CrossEntropyLoss', use_sigmoid=False,
+            loss_weight=0.2)),
+    # model training and testing settings
+    train_cfg=dict(
+        pts=dict(
+            assigner=[
+                dict(  # for Pedestrian
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(  # for Cyclist
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(  # for Car
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.6,
+                    neg_iou_thr=0.45,
+                    min_pos_iou=0.45,
+                    ignore_iof_thr=-1),
+            ],
+            allowed_border=0,
+            pos_weight=-1,
+            debug=False)),
+    test_cfg=dict(
+        pts=dict(
+            use_rotate_nms=True,
+            nms_across_levels=False,
+            nms_thr=0.01,
+            score_thr=0.1,
+            min_bbox_size=0,
+            nms_pre=100,
+            max_num=50)))
+
+# dataset settings
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+metainfo = dict(classes=class_names)
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+train_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True),
+    #dict(
+    #    type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True),
+    dict(
+        type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True),
+    dict(
+        type='GlobalRotScaleTrans',
+        rot_range=[-0.78539816, 0.78539816],
+        scale_ratio_range=[0.95, 1.05],
+        translation_std=[0.2, 0.2, 0.2]),
+    dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='PointShuffle'),
+    dict(
+        type='Pack3DDetInputs',
+        keys=[
+            'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
+            'gt_labels'
+        ])
+]
+test_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(
+        type='MultiScaleFlipAug3D',
+        img_scale=(1280, 384),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            # Temporary solution, fix this after refactor the augtest
+            dict(type='Resize', scale=0, keep_ratio=True),
+            dict(
+                type='GlobalRotScaleTrans',
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type='RandomFlip3D'),
+            dict(
+                type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+        ]),
+    dict(type='Pack3DDetInputs', keys=['points', 'img'])
+]
+modality = dict(use_lidar=True, use_camera=True)
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=4,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='RepeatDataset',
+        times=2,
+        dataset=dict(
+            type=dataset_type,
+            data_root=data_root,
+            modality=modality,
+            ann_file='kitti_infos_train.pkl',
+            data_prefix=dict(
+                pts='training/velodyne_reduced', img='training/image_2'),
+            pipeline=train_pipeline,
+            filter_empty_gt=False,
+            metainfo=metainfo,
+            # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
+            # and box_type_3d='Depth' in sunrgbd and scannet dataset.
+            box_type_3d='LiDAR',
+            backend_args=backend_args)))
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        modality=modality,
+        ann_file='kitti_infos_val.pkl',
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='kitti_infos_val.pkl',
+        modality=modality,
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+
+optim_wrapper = dict(
+    optimizer=dict(weight_decay=0.01),
+    clip_grad=dict(max_norm=35, norm_type=2),
+)
+val_evaluator = dict(
+    type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(
+    type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=5)
+
+# You may need to download the model first is the network is unstable
+#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth'  # noqa
diff --git a/configs/mvxnet/mvxnet_sqeezefpn_fire_rpfnet_kitti-3d-3class.py b/configs/mvxnet/mvxnet_sqeezefpn_fire_rpfnet_kitti-3d-3class.py
new file mode 100644
index 0000000000..d3920cbf8c
--- /dev/null
+++ b/configs/mvxnet/mvxnet_sqeezefpn_fire_rpfnet_kitti-3d-3class.py
@@ -0,0 +1,225 @@
+# Stand-alone MVX-Net (SqueezeFPN camera branch) + PillarNet-LTS (RPFNet)
+# for KITTI 3-class.  No dependency on other MVX configs – only schedule &
+# default_runtime are inherited.
+
+_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py']
+
+# -----------------------------------------------------------------------------
+# Geometry
+# -----------------------------------------------------------------------------
+voxel_size = [0.05, 0.05, 0.1]
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+
+# -----------------------------------------------------------------------------
+# Model
+# -----------------------------------------------------------------------------
+model = dict(
+    type='DynamicMVXFasterRCNN',
+    # --------------------------------------------------
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_type='dynamic',
+        voxel_layer=dict(
+            max_num_points=-1,
+            point_cloud_range=point_cloud_range,
+            voxel_size=voxel_size,
+            max_voxels=(-1, -1)),
+        mean=[102.9801, 115.9465, 122.7717],
+        std=[1.0, 1.0, 1.0],
+        bgr_to_rgb=False,
+        pad_size_divisor=32),
+
+    # ----------------------- image branch -----------------------
+    img_backbone=dict(
+        type='SQUEEZE',
+        in_channels=3,
+        out_channels=[64, 128, 256, 512],
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        conv_cfg=dict(type='Conv2d', bias=False)),
+    img_neck=dict(
+        type='SQUEEZEFPN',
+        in_channels=[64, 128, 256, 512],
+        out_channels=[512, 512, 512, 512],
+        norm_cfg=dict(type='BN', requires_grad=False)),
+
+    # ----------------------- LiDAR voxel encoder ----------------
+    pts_voxel_encoder=dict(
+        type='DynamicVFE',
+        in_channels=4,
+        feat_channels=[64, 64],
+        with_distance=False,
+        voxel_size=voxel_size,
+        with_cluster_center=True,
+        with_voxel_center=True,
+        point_cloud_range=point_cloud_range,
+        fusion_layer=dict(
+            type='PointFusion',
+            img_channels=512,
+            pts_channels=64,
+            mid_channels=128,
+            out_channels=128,
+            img_levels=[0, 1, 2, 3],
+            align_corners=False,
+            activate_out=True,
+            fuse_out=False)),
+
+    # ----------------------- Sparse middle encoder --------------
+    pts_middle_encoder=dict(
+        type='SparseEncoder',
+        in_channels=128,
+        sparse_shape=[41, 1600, 1408],
+        order=('conv', 'norm', 'act')),
+
+    # ----------------------- FireRPFNet backbone -------------
+    pts_backbone=dict(
+        type='FireRPFNetV2',
+        in_channels=256,               # output of SparseEncoder
+        out_channels=[128, 256, 256, 256],
+        with_cbam=True),
+        
+    pts_neck=None,                    # RPFNet is already deep enough
+
+    # ----------------------- Anchor head ------------------------
+    pts_bbox_head=dict(
+        type='Anchor3DHead',
+        num_classes=3,
+        in_channels=256,
+        feat_channels=256,
+        use_direction_classifier=True,
+        anchor_generator=dict(
+            type='Anchor3DRangeGenerator',
+            ranges=[
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -1.78, 70.4, 40.0, -1.78],
+            ],
+            sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]],
+            rotations=[0, 1.57],
+            reshape_out=False),
+        assigner_per_size=True,
+        diff_rad_by_sin=True,
+        assign_per_class=True,
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(
+            type='mmdet.FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25,
+            loss_weight=1.0),
+        loss_bbox=dict(type='mmdet.SmoothL1Loss', beta=1.0 / 9.0,
+                        loss_weight=2.0),
+        loss_dir=dict(type='mmdet.CrossEntropyLoss', use_sigmoid=False,
+                      loss_weight=0.2)),
+
+    # ----------------------- Train / Test cfg -------------------
+    train_cfg=dict(
+        pts=dict(
+            assigner=[
+                dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                     pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2,
+                     ignore_iof_thr=-1),
+                dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                     pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2,
+                     ignore_iof_thr=-1),
+                dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                     pos_iou_thr=0.6, neg_iou_thr=0.45, min_pos_iou=0.45,
+                     ignore_iof_thr=-1),
+            ],
+            allowed_border=0, pos_weight=-1, debug=False)),
+
+    test_cfg=dict(
+        pts=dict(use_rotate_nms=True, nms_across_levels=False, nms_thr=0.01,
+                  score_thr=0.1, min_bbox_size=0, nms_pre=100, max_num=50))
+)
+
+# -----------------------------------------------------------------------------
+# Dataset & pipelines (identical to original MVX squeeze-FPN config)
+# -----------------------------------------------------------------------------
+
+# dataset settings
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+metainfo = dict(classes=class_names)
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+
+train_pipeline = [
+    dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4,
+         backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True,
+         with_bbox=True, with_label=True),
+    dict(type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True),
+    dict(type='GlobalRotScaleTrans', rot_range=[-0.78539816, 0.78539816],
+         scale_ratio_range=[0.95, 1.05], translation_std=[0.2, 0.2, 0.2]),
+    dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='PointShuffle'),
+    dict(type='Pack3DDetInputs', keys=['points', 'img', 'gt_bboxes_3d', 'gt_labels_3d',
+                                       'gt_bboxes', 'gt_labels'])
+]
+
+test_pipeline = [
+    dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4,
+         backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='MultiScaleFlipAug3D', img_scale=(1280, 384), pts_scale_ratio=1,
+         flip=False,
+         transforms=[
+             dict(type='Resize', scale=0, keep_ratio=True),
+             dict(type='GlobalRotScaleTrans', rot_range=[0, 0], scale_ratio_range=[1., 1.],
+                  translation_std=[0, 0, 0]),
+             dict(type='RandomFlip3D'),
+             dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+         ]),
+    dict(type='Pack3DDetInputs', keys=['points', 'img'])
+]
+
+modality = dict(use_lidar=True, use_camera=True)
+
+train_dataloader = dict(
+    batch_size=2, num_workers=2, sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(type='RepeatDataset', times=2, dataset=dict(
+        type=dataset_type, data_root=data_root, modality=modality,
+        ann_file='kitti_infos_train.pkl',
+        data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=train_pipeline, filter_empty_gt=False, metainfo=metainfo,
+        box_type_3d='LiDAR', backend_args=backend_args)))
+
+val_dataloader = dict(
+    batch_size=1, num_workers=1, sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(type=dataset_type, data_root=data_root, modality=modality,
+        ann_file='kitti_infos_val.pkl',
+        data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline, metainfo=metainfo, test_mode=True,
+        box_type_3d='LiDAR', backend_args=backend_args))
+
+test_dataloader = val_dataloader
+
+# -----------------------------------------------------------------------------
+# Optimizer / Schedulers / Runtime
+# -----------------------------------------------------------------------------
+optim_wrapper = dict(optimizer=dict(lr=0.001, weight_decay=0.01),
+                     clip_grad=dict(max_norm=35, norm_type=2))
+
+val_evaluator = dict(type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+
+
+# optim_wrapper = dict(
+#     optimizer=dict(weight_decay=0.01),
+#     clip_grad=dict(max_norm=35, norm_type=2),
+# )
+# val_evaluator = dict(
+#     type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(
+    type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=5)
+
+# Optional: if you reduced channels you can shrink head
+# model['pts_bbox_head']['in_channels'] = 256
+# model['pts_bbox_head']['feat_channels'] = 256
diff --git a/configs/mvxnet/mvxnet_sqeezefpn_rpfnet_kitti-3d-3class.py b/configs/mvxnet/mvxnet_sqeezefpn_rpfnet_kitti-3d-3class.py
new file mode 100644
index 0000000000..7784db0c1a
--- /dev/null
+++ b/configs/mvxnet/mvxnet_sqeezefpn_rpfnet_kitti-3d-3class.py
@@ -0,0 +1,221 @@
+# Stand-alone MVX-Net (SqueezeFPN camera branch) + PillarNet-LTS (RPFNet)
+# for KITTI 3-class.  No dependency on other MVX configs – only schedule &
+# default_runtime are inherited.
+
+_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py']
+
+# -----------------------------------------------------------------------------
+# Geometry
+# -----------------------------------------------------------------------------
+voxel_size = [0.05, 0.05, 0.1]
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+
+# -----------------------------------------------------------------------------
+# Model
+# -----------------------------------------------------------------------------
+model = dict(
+    type='DynamicMVXFasterRCNN',
+    # --------------------------------------------------
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_type='dynamic',
+        voxel_layer=dict(
+            max_num_points=-1,
+            point_cloud_range=point_cloud_range,
+            voxel_size=voxel_size,
+            max_voxels=(-1, -1)),
+        mean=[102.9801, 115.9465, 122.7717],
+        std=[1.0, 1.0, 1.0],
+        bgr_to_rgb=False,
+        pad_size_divisor=32),
+
+    # ----------------------- image branch -----------------------
+    img_backbone=dict(
+        type='SQUEEZE',
+        in_channels=3,
+        out_channels=[64, 128, 256, 512],
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        conv_cfg=dict(type='Conv2d', bias=False)),
+    img_neck=dict(
+        type='SQUEEZEFPN',
+        in_channels=[64, 128, 256, 512],
+        out_channels=[512, 512, 512, 512],
+        norm_cfg=dict(type='BN', requires_grad=False)),
+
+    # ----------------------- LiDAR voxel encoder ----------------
+    pts_voxel_encoder=dict(
+        type='DynamicVFE',
+        in_channels=4,
+        feat_channels=[64, 64],
+        with_distance=False,
+        voxel_size=voxel_size,
+        with_cluster_center=True,
+        with_voxel_center=True,
+        point_cloud_range=point_cloud_range,
+        fusion_layer=dict(
+            type='PointFusion',
+            img_channels=512,
+            pts_channels=64,
+            mid_channels=128,
+            out_channels=128,
+            img_levels=[0, 1, 2, 3],
+            align_corners=False,
+            activate_out=True,
+            fuse_out=False)),
+
+    # ----------------------- Sparse middle encoder --------------
+    pts_middle_encoder=dict(
+        type='SparseEncoder',
+        in_channels=128,
+        sparse_shape=[41, 1600, 1408],
+        order=('conv', 'norm', 'act')),
+
+    # ----------------------- RPFNet backbone -------------
+    pts_backbone=dict(
+        type='RPFNet',
+        in_channels=256,               # output of SparseEncoder
+        layer_channels=[128, 256, 256, 256],
+        with_cbam=True),
+
+    pts_neck=None,                    # RPFNet is already deep enough
+
+    # ----------------------- Anchor head ------------------------
+    pts_bbox_head=dict(
+        type='Anchor3DHead',
+        num_classes=3,
+        in_channels=256,
+        feat_channels=256,
+        use_direction_classifier=True,
+        anchor_generator=dict(
+            type='Anchor3DRangeGenerator',
+            ranges=[
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -1.78, 70.4, 40.0, -1.78],
+            ],
+            sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]],
+            rotations=[0, 1.57],
+            reshape_out=False),
+        assigner_per_size=True,
+        diff_rad_by_sin=True,
+        assign_per_class=True,
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(
+            type='mmdet.FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25,
+            loss_weight=1.0),
+        loss_bbox=dict(type='mmdet.SmoothL1Loss', beta=1.0 / 9.0,
+                        loss_weight=2.0),
+        loss_dir=dict(type='mmdet.CrossEntropyLoss', use_sigmoid=False,
+                      loss_weight=0.2)),
+
+    # ----------------------- Train / Test cfg -------------------
+    train_cfg=dict(
+        pts=dict(
+            assigner=[
+                dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                     pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2,
+                     ignore_iof_thr=-1),
+                dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                     pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2,
+                     ignore_iof_thr=-1),
+                dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                     pos_iou_thr=0.6, neg_iou_thr=0.45, min_pos_iou=0.45,
+                     ignore_iof_thr=-1),
+            ],
+            allowed_border=0, pos_weight=-1, debug=False)),
+
+    test_cfg=dict(
+        pts=dict(use_rotate_nms=True, nms_across_levels=False, nms_thr=0.01,
+                  score_thr=0.1, min_bbox_size=0, nms_pre=100, max_num=50))
+)
+
+# -----------------------------------------------------------------------------
+# Dataset & pipelines (identical to original MVX squeeze-FPN config)
+# -----------------------------------------------------------------------------
+
+# dataset settings
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+metainfo = dict(classes=class_names)
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+
+train_pipeline = [
+    dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4,
+         backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True,
+         with_bbox=True, with_label=True),
+    dict(type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True),
+    dict(type='GlobalRotScaleTrans', rot_range=[-0.78539816, 0.78539816],
+         scale_ratio_range=[0.95, 1.05], translation_std=[0.2, 0.2, 0.2]),
+    dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='PointShuffle'),
+    dict(type='Pack3DDetInputs', keys=['points', 'img', 'gt_bboxes_3d', 'gt_labels_3d',
+                                       'gt_bboxes', 'gt_labels'])
+]
+
+test_pipeline = [
+    dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4,
+         backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='MultiScaleFlipAug3D', img_scale=(1280, 384), pts_scale_ratio=1,
+         flip=False,
+         transforms=[
+             dict(type='Resize', scale=0, keep_ratio=True),
+             dict(type='GlobalRotScaleTrans', rot_range=[0, 0], scale_ratio_range=[1., 1.],
+                  translation_std=[0, 0, 0]),
+             dict(type='RandomFlip3D'),
+             dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+         ]),
+    dict(type='Pack3DDetInputs', keys=['points', 'img'])
+]
+
+modality = dict(use_lidar=True, use_camera=True)
+
+train_dataloader = dict(
+    batch_size=2, num_workers=4, sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(type='RepeatDataset', times=2, dataset=dict(
+        type=dataset_type, data_root=data_root, modality=modality,
+        ann_file='kitti_infos_train.pkl',
+        data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=train_pipeline, filter_empty_gt=False, metainfo=metainfo,
+        box_type_3d='LiDAR', backend_args=backend_args)))
+
+val_dataloader = dict(
+    batch_size=1, num_workers=1, sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(type=dataset_type, data_root=data_root, modality=modality,
+        ann_file='kitti_infos_val.pkl',
+        data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline, metainfo=metainfo, test_mode=True,
+        box_type_3d='LiDAR', backend_args=backend_args))
+
+test_dataloader = val_dataloader
+
+# -----------------------------------------------------------------------------
+# Optimizer / Schedulers / Runtime
+# -----------------------------------------------------------------------------
+optim_wrapper = dict(optimizer=dict(lr=0.001, weight_decay=0.01),
+                     clip_grad=dict(max_norm=35, norm_type=2))
+
+val_evaluator = dict(type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+
+
+# optim_wrapper = dict(
+#     optimizer=dict(weight_decay=0.01),
+#     clip_grad=dict(max_norm=35, norm_type=2),
+# )
+# val_evaluator = dict(
+#     type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(
+    type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=5)
diff --git a/configs/mvxnet/mvxnet_sqeezefpn_secfpn_2x_scale_kitti-3d-3class.py b/configs/mvxnet/mvxnet_sqeezefpn_secfpn_2x_scale_kitti-3d-3class.py
new file mode 100644
index 0000000000..24331ef868
--- /dev/null
+++ b/configs/mvxnet/mvxnet_sqeezefpn_secfpn_2x_scale_kitti-3d-3class.py
@@ -0,0 +1,275 @@
+_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py']
+
+# model settings
+voxel_size = [0.05, 0.05, 0.1]
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+
+model = dict(
+    type='DynamicMVXFasterRCNN',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_type='dynamic',
+        voxel_layer=dict(
+            max_num_points=-1,
+            point_cloud_range=point_cloud_range,
+            voxel_size=voxel_size,
+            max_voxels=(-1, -1)),
+        mean=[102.9801, 115.9465, 122.7717],
+        std=[1.0, 1.0, 1.0],
+        bgr_to_rgb=False,
+        pad_size_divisor=32),
+    img_backbone=dict(
+        type='SQUEEZE',
+        in_channels=3,
+        out_channels=[64, 128, 256 , 512],
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        conv_cfg=dict(type='Conv2d', bias=False)
+        ),
+    img_neck=dict(
+        type='SQUEEZEFPN',
+        in_channels=[64, 128, 256 , 512],  # Correct in_channels for EfficientNet b0
+        out_channels=[512, 512, 512, 512],
+        norm_cfg=dict(type='BN', requires_grad=False),
+        #num_outs=4
+        ),
+    pts_voxel_encoder=dict(
+        type='DynamicVFE',
+        in_channels=4,
+        feat_channels=[64, 64],
+        with_distance=False,
+        voxel_size=voxel_size,
+        with_cluster_center=True,
+        with_voxel_center=True,
+        point_cloud_range=point_cloud_range,
+        fusion_layer=dict(
+            type='PointFusion',
+            img_channels=512,
+            pts_channels=64,
+            mid_channels=128,
+            out_channels=128,
+            img_levels=[0, 1, 2, 3],
+            align_corners=False,
+            activate_out=True,
+            fuse_out=False)),
+    pts_middle_encoder=dict(
+        type='SparseEncoder',
+        in_channels=128,
+        sparse_shape=[41, 1600, 1408],
+        order=('conv', 'norm', 'act')),
+    pts_backbone=dict(
+        type='SECOND',
+        in_channels=256,
+        layer_nums=[5, 5],
+        layer_strides=[1, 2],
+        out_channels=[128, 256]),
+    pts_neck=dict(
+        type='SECONDFPN',
+        in_channels=[128, 256],
+        upsample_strides=[1, 2],
+        out_channels=[256, 256]),
+    pts_bbox_head=dict(
+        type='Anchor3DHead',
+        num_classes=3,
+        in_channels=512,
+        feat_channels=512,
+        use_direction_classifier=True,
+        anchor_generator=dict(
+            type='Anchor3DRangeGenerator',
+            ranges=[
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -1.78, 70.4, 40.0, -1.78],
+            ],
+            sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]],
+            rotations=[0, 1.57],
+            reshape_out=False),
+        assigner_per_size=True,
+        diff_rad_by_sin=True,
+        assign_per_class=True,
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(
+            type='mmdet.FocalLoss',
+            use_sigmoid=True,
+            gamma=2.0,
+            alpha=0.25,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0),
+        loss_dir=dict(
+            type='mmdet.CrossEntropyLoss', use_sigmoid=False,
+            loss_weight=0.2)),
+    # model training and testing settings
+    train_cfg=dict(
+        pts=dict(
+            assigner=[
+                dict(  # for Pedestrian
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(  # for Cyclist
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(  # for Car
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.6,
+                    neg_iou_thr=0.45,
+                    min_pos_iou=0.45,
+                    ignore_iof_thr=-1),
+            ],
+            allowed_border=0,
+            pos_weight=-1,
+            debug=False)),
+    test_cfg=dict(
+        pts=dict(
+            use_rotate_nms=True,
+            nms_across_levels=False,
+            nms_thr=0.01,
+            score_thr=0.1,
+            min_bbox_size=0,
+            nms_pre=100,
+            max_num=50)))
+
+# dataset settings
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+metainfo = dict(classes=class_names)
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+train_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True),
+    dict(
+        type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True),
+    #dict(
+    #    type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True),
+    dict(
+        type='GlobalRotScaleTrans',
+        rot_range=[-0.78539816, 0.78539816],
+        scale_ratio_range=[0.95, 1.05],
+        translation_std=[0.2, 0.2, 0.2]),
+    dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='PointShuffle'),
+    dict(
+        type='Pack3DDetInputs',
+        keys=[
+            'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
+            'gt_labels'
+        ])
+]
+test_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(
+        type='MultiScaleFlipAug3D',
+        img_scale=(1280, 384),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            # Temporary solution, fix this after refactor the augtest
+            dict(type='Resize', scale=0, keep_ratio=True),
+            dict(
+                type='GlobalRotScaleTrans',
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type='RandomFlip3D'),
+            dict(
+                type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+        ]),
+    dict(type='Pack3DDetInputs', keys=['points', 'img'])
+]
+modality = dict(use_lidar=True, use_camera=True)
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=2,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='RepeatDataset',
+        times=2,
+        dataset=dict(
+            type=dataset_type,
+            data_root=data_root,
+            modality=modality,
+            ann_file='kitti_infos_train.pkl',
+            data_prefix=dict(
+                pts='training/velodyne_reduced', img='training/image_2'),
+            pipeline=train_pipeline,
+            filter_empty_gt=False,
+            metainfo=metainfo,
+            # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
+            # and box_type_3d='Depth' in sunrgbd and scannet dataset.
+            box_type_3d='LiDAR',
+            backend_args=backend_args)))
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        modality=modality,
+        ann_file='kitti_infos_val.pkl',
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='kitti_infos_val.pkl',
+        modality=modality,
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+
+optim_wrapper = dict(
+    optimizer=dict(weight_decay=0.01),
+    clip_grad=dict(max_norm=35, norm_type=2),
+)
+val_evaluator = dict(
+    type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(
+    type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1)
+
+# You may need to download the model first is the network is unstable
+#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth'  # noqa
diff --git a/configs/mvxnet/mvxnet_sqeezefpn_secfpn_kitti-3d-3class.py b/configs/mvxnet/mvxnet_sqeezefpn_secfpn_kitti-3d-3class.py
new file mode 100644
index 0000000000..95a4a68c60
--- /dev/null
+++ b/configs/mvxnet/mvxnet_sqeezefpn_secfpn_kitti-3d-3class.py
@@ -0,0 +1,275 @@
+_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py']
+
+# model settings
+voxel_size = [0.05, 0.05, 0.1]
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+
+model = dict(
+    type='DynamicMVXFasterRCNN',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        voxel=True,
+        voxel_type='dynamic',
+        voxel_layer=dict(
+            max_num_points=-1,
+            point_cloud_range=point_cloud_range,
+            voxel_size=voxel_size,
+            max_voxels=(-1, -1)),
+        mean=[102.9801, 115.9465, 122.7717],
+        std=[1.0, 1.0, 1.0],
+        bgr_to_rgb=False,
+        pad_size_divisor=32),
+    img_backbone=dict(
+        type='SQUEEZE',
+        in_channels=3,
+        out_channels=[64, 128, 256 , 512],
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+        conv_cfg=dict(type='Conv2d', bias=False)
+        ),
+    img_neck=dict(
+        type='SQUEEZEFPN',
+        in_channels=[64, 128, 256 , 512],  # Correct in_channels for EfficientNet b0
+        out_channels=[512, 512, 512, 512],
+        norm_cfg=dict(type='BN', requires_grad=False),
+        #num_outs=4
+        ),
+    pts_voxel_encoder=dict(
+        type='DynamicVFE',
+        in_channels=4,
+        feat_channels=[64, 64],
+        with_distance=False,
+        voxel_size=voxel_size,
+        with_cluster_center=True,
+        with_voxel_center=True,
+        point_cloud_range=point_cloud_range,
+        fusion_layer=dict(
+            type='PointFusion',
+            img_channels=512,
+            pts_channels=64,
+            mid_channels=128,
+            out_channels=128,
+            img_levels=[0, 1, 2, 3],
+            align_corners=False,
+            activate_out=True,
+            fuse_out=False)),
+    pts_middle_encoder=dict(
+        type='SparseEncoder',
+        in_channels=128,
+        sparse_shape=[41, 1600, 1408],
+        order=('conv', 'norm', 'act')),
+    pts_backbone=dict(
+        type='SECOND',
+        in_channels=256,
+        layer_nums=[5, 5],
+        layer_strides=[1, 2],
+        out_channels=[128, 256]),
+    pts_neck=dict(
+        type='SECONDFPN',
+        in_channels=[128, 256],
+        upsample_strides=[1, 2],
+        out_channels=[256, 256]),
+    pts_bbox_head=dict(
+        type='Anchor3DHead',
+        num_classes=3,
+        in_channels=512,
+        feat_channels=512,
+        use_direction_classifier=True,
+        anchor_generator=dict(
+            type='Anchor3DRangeGenerator',
+            ranges=[
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -0.6, 70.4, 40.0, -0.6],
+                [0, -40.0, -1.78, 70.4, 40.0, -1.78],
+            ],
+            sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]],
+            rotations=[0, 1.57],
+            reshape_out=False),
+        assigner_per_size=True,
+        diff_rad_by_sin=True,
+        assign_per_class=True,
+        bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'),
+        loss_cls=dict(
+            type='mmdet.FocalLoss',
+            use_sigmoid=True,
+            gamma=2.0,
+            alpha=0.25,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0),
+        loss_dir=dict(
+            type='mmdet.CrossEntropyLoss', use_sigmoid=False,
+            loss_weight=0.2)),
+    # model training and testing settings
+    train_cfg=dict(
+        pts=dict(
+            assigner=[
+                dict(  # for Pedestrian
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(  # for Cyclist
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.35,
+                    neg_iou_thr=0.2,
+                    min_pos_iou=0.2,
+                    ignore_iof_thr=-1),
+                dict(  # for Car
+                    type='Max3DIoUAssigner',
+                    iou_calculator=dict(type='BboxOverlapsNearest3D'),
+                    pos_iou_thr=0.6,
+                    neg_iou_thr=0.45,
+                    min_pos_iou=0.45,
+                    ignore_iof_thr=-1),
+            ],
+            allowed_border=0,
+            pos_weight=-1,
+            debug=False)),
+    test_cfg=dict(
+        pts=dict(
+            use_rotate_nms=True,
+            nms_across_levels=False,
+            nms_thr=0.01,
+            score_thr=0.1,
+            min_bbox_size=0,
+            nms_pre=100,
+            max_num=50)))
+
+# dataset settings
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+metainfo = dict(classes=class_names)
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+train_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True),
+    #dict(
+    #    type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True),
+    dict(
+        type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True),
+    dict(
+        type='GlobalRotScaleTrans',
+        rot_range=[-0.78539816, 0.78539816],
+        scale_ratio_range=[0.95, 1.05],
+        translation_std=[0.2, 0.2, 0.2]),
+    dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='PointShuffle'),
+    dict(
+        type='Pack3DDetInputs',
+        keys=[
+            'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
+            'gt_labels'
+        ])
+]
+test_pipeline = [
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type='LoadImageFromFile', backend_args=backend_args),
+    dict(
+        type='MultiScaleFlipAug3D',
+        img_scale=(1280, 384),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            # Temporary solution, fix this after refactor the augtest
+            dict(type='Resize', scale=0, keep_ratio=True),
+            dict(
+                type='GlobalRotScaleTrans',
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type='RandomFlip3D'),
+            dict(
+                type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+        ]),
+    dict(type='Pack3DDetInputs', keys=['points', 'img'])
+]
+modality = dict(use_lidar=True, use_camera=True)
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=2,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='RepeatDataset',
+        times=2,
+        dataset=dict(
+            type=dataset_type,
+            data_root=data_root,
+            modality=modality,
+            ann_file='kitti_infos_train.pkl',
+            data_prefix=dict(
+                pts='training/velodyne_reduced', img='training/image_2'),
+            pipeline=train_pipeline,
+            filter_empty_gt=False,
+            metainfo=metainfo,
+            # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
+            # and box_type_3d='Depth' in sunrgbd and scannet dataset.
+            box_type_3d='LiDAR',
+            backend_args=backend_args)))
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        modality=modality,
+        ann_file='kitti_infos_val.pkl',
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='kitti_infos_val.pkl',
+        modality=modality,
+        data_prefix=dict(
+            pts='training/velodyne_reduced', img='training/image_2'),
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+
+optim_wrapper = dict(
+    optimizer=dict(weight_decay=0.01),
+    clip_grad=dict(max_norm=35, norm_type=2),
+)
+val_evaluator = dict(
+    type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl')
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(
+    type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1)
+
+# You may need to download the model first is the network is unstable
+#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth'  # noqa
diff --git a/exp_list.sh b/exp_list.sh
new file mode 100755
index 0000000000..9a089ee0cd
--- /dev/null
+++ b/exp_list.sh
@@ -0,0 +1,33 @@
+#!/bin/bash
+CONFIG_FILES=(
+mvxnet_efficiency_es_fpn_second_fpn_kitti-3d-3class
+mvxnet_efficiency_es_fpn_squeeze_fpn_kitti-3d-3class
+mvxnet_efficiency_fpn_second_fpn_kitti-3d-3class
+mvxnet_efficiency_fpn_squeeze_fpn_kitti-3d-3class
+mvxnet_fpn_dv_second_secfpn_320x92_kitti-3d-3class
+mvxnet_fpn_dv_second_secfpn_8xb2-80e_kitti-3d-3class
+mvxnet_fpn_dv_second_squeezefpn_320x92_kitti-3d-3class
+mvxnet_fpn_dv_second_squeezefpn_8xb2-80e_kitti-3d-3class
+mvxnet_mobilenetv2_fpn_second_fpn_kitti-3d-3class
+mvxnet_sqeezefpn_secfpn_kitti-3d-3class
+mvxnet_sqeezefpn_secfpn_2x_scale_kitti-3d-3class
+)
+
+TIMESTAMP=$(TZ='America/Los_Angeles' date +%m%d)
+for item in "${CONFIG_FILES[@]}"; do
+  echo "############"
+  echo
+  workdir="work_dirs/"${item}
+  mkdir -p ${workdir}
+  config="configs/mvxnet/"${item}.py
+  saved_model=$workdir/epoch_20.pth
+  echo "Config:" $config
+  echo "Workdir:" $workdir
+  echo "Saved Model:" $saved_model
+
+  echo "Train Command:"
+  echo "python tools/train.py $config 2>&1 | tee $workdir/train-${TIMESTAMP}.log"
+  echo "Test Command:"
+  echo "python tools/test.py $workdir/$item.py $saved_model 2>&1 | tee $workdir/test-${TIMESTAMP}.log"
+
+done
diff --git a/mmdet3d/apis/inferencers/multi_modality_det3d_inferencer.py b/mmdet3d/apis/inferencers/multi_modality_det3d_inferencer.py
index 6717bb18c8..ecc4295ef6 100644
--- a/mmdet3d/apis/inferencers/multi_modality_det3d_inferencer.py
+++ b/mmdet3d/apis/inferencers/multi_modality_det3d_inferencer.py
@@ -74,95 +74,177 @@ def _inputs_to_list(self,
         - dict: the value with key 'points' is
             - Directory path: return all files in the directory
             - other cases: return a list containing the string. The string
-              could be a path to file, a url or other types of string according
-              to the task.
+            could be a path to file, a url or other types of string according
+            to the task.
 
         Args:
             inputs (Union[dict, list]): Inputs for the inferencer.
+            cam_type (str): Camera type. Defaults to 'CAM2'.
 
         Returns:
             list: List of input for the :meth:`preprocess`.
         """
+        processed_inputs_list = []
+
         if isinstance(inputs, dict):
-            assert 'infos' in inputs
-            infos = inputs.pop('infos')
-
-            if isinstance(inputs['img'], str):
-                img, pcd = inputs['img'], inputs['points']
-                backend = get_file_backend(img)
-                if hasattr(backend, 'isdir') and isdir(img) and isdir(pcd):
-                    # Backends like HttpsBackend do not implement `isdir`, so
-                    # only those backends that implement `isdir` could accept
-                    # the inputs as a directory
+            if 'infos' not in inputs:
+                raise ValueError("Input dictionary must contain an 'infos' key pointing to the .pkl file.")
+            infos_path = inputs.pop('infos')
+
+            # Determine the actual list of input samples
+            # This handles cases where 'img' and 'pcd' might be directories
+            current_sample_dicts = []
+            if isinstance(inputs.get('img'), str) and isinstance(inputs.get('points'), str):
+                img_path_input, pcd_path_input = inputs['img'], inputs['points']
+                # Check if these are directories
+                backend = get_file_backend(img_path_input)
+                if hasattr(backend, 'isdir') and isdir(img_path_input) and isdir(pcd_path_input):
                     img_filename_list = list_dir_or_file(
-                        img, list_dir=False, suffix=['.png', '.jpg'])
+                        img_path_input, list_dir=False, suffix=['.png', '.jpg', '.jpeg', '.PNG', '.JPG', '.JPEG']) # Added more suffixes
                     pcd_filename_list = list_dir_or_file(
-                        pcd, list_dir=False, suffix='.bin')
-                    assert len(img_filename_list) == len(pcd_filename_list)
-
-                    inputs = [{
-                        'img': join_path(img, img_filename),
-                        'points': join_path(pcd, pcd_filename)
-                    } for pcd_filename, img_filename in zip(
-                        pcd_filename_list, img_filename_list)]
-
-            if not isinstance(inputs, (list, tuple)):
-                inputs = [inputs]
-
-            # get cam2img, lidar2cam and lidar2img from infos
-            info_list = mmengine.load(infos)['data_list']
-            assert len(info_list) == len(inputs)
-            for index, input in enumerate(inputs):
-                data_info = info_list[index]
-                img_path = data_info['images'][cam_type]['img_path']
-                if isinstance(input['img'], str) and \
-                        osp.basename(img_path) != osp.basename(input['img']):
+                        pcd_path_input, list_dir=False, suffix='.bin')
+                    
+                    if len(img_filename_list) != len(pcd_filename_list):
+                        raise ValueError(
+                            f"Mismatch in number of images ({len(img_filename_list)}) and "
+                            f"point cloud files ({len(pcd_filename_list)}) "
+                            f"in directories '{img_path_input}' and '{pcd_path_input}'.")
+
+                    for pcd_filename, img_filename in zip(pcd_filename_list, img_filename_list):
+                        current_sample_dicts.append({
+                            'img': join_path(img_path_input, img_filename),
+                            'points': join_path(pcd_path_input, pcd_filename)
+                        })
+                else: # Assume single file paths if not directories
+                    current_sample_dicts = [inputs.copy()] # Use a copy of the original input dict
+            elif not isinstance(inputs, (list, tuple)): # If inputs['img'] wasn't a string, but inputs itself is a dict.
+                current_sample_dicts = [inputs.copy()]
+            else: # This case should ideally not be hit if input 'inputs' is a dict.
+                raise ValueError("Unexpected structure for 'inputs' dictionary.")
+
+
+            all_info_data = mmengine.load(infos_path)['data_list']
+
+            for single_input_sample_dict in current_sample_dicts:
+                if 'img' not in single_input_sample_dict or not isinstance(single_input_sample_dict['img'], str):
+                    raise ValueError(f"Each input sample must have an 'img' key with a string path. Problematic sample: {single_input_sample_dict}")
+
+                input_img_basename = osp.basename(single_input_sample_dict['img'])
+                found_data_info = None
+
+                for data_info_candidate in all_info_data:
+                    if 'images' not in data_info_candidate or \
+                    cam_type not in data_info_candidate['images'] or \
+                    'img_path' not in data_info_candidate['images'][cam_type]:
+                        # Silently skip malformed entries or log a warning
+                        # warnings.warn(f"Skipping malformed info entry: {data_info_candidate.get('sample_idx', 'Unknown sample')}")
+                        continue
+                    
+                    info_img_path = data_info_candidate['images'][cam_type]['img_path']
+                    if osp.basename(info_img_path) == input_img_basename:
+                        found_data_info = data_info_candidate
+                        break
+                
+                if found_data_info is None:
+                    available_img_names = [
+                        osp.basename(info['images'][cam_type]['img_path'])
+                        for info in all_info_data
+                        if 'images' in info and cam_type in info['images'] and 'img_path' in info['images'][cam_type]
+                    ]
+                    example_names = ", ".join(list(set(available_img_names))[:5])
                     raise ValueError(
-                        f'the info file of {img_path} is not provided.')
+                        f"Could not find info for image '{input_img_basename}' (from path: {single_input_sample_dict['img']}) "
+                        f"in '{infos_path}'. Checked {len(all_info_data)} entries. "
+                        f"Example image basenames in info file: {example_names}"
+                    )
+
+                # Add camera parameters from found_data_info to the input sample
                 cam2img = np.asarray(
-                    data_info['images'][cam_type]['cam2img'], dtype=np.float32)
+                    found_data_info['images'][cam_type]['cam2img'], dtype=np.float32)
                 lidar2cam = np.asarray(
-                    data_info['images'][cam_type]['lidar2cam'],
+                    found_data_info['images'][cam_type]['lidar2cam'],
                     dtype=np.float32)
-                if 'lidar2img' in data_info['images'][cam_type]:
+                if 'lidar2img' in found_data_info['images'][cam_type]:
                     lidar2img = np.asarray(
-                        data_info['images'][cam_type]['lidar2img'],
+                        found_data_info['images'][cam_type]['lidar2img'],
                         dtype=np.float32)
                 else:
                     lidar2img = cam2img @ lidar2cam
-                input['cam2img'] = cam2img
-                input['lidar2cam'] = lidar2cam
-                input['lidar2img'] = lidar2img
+                
+                # Create a new dict for the processed input to avoid modifying the original list's dicts
+                processed_sample = single_input_sample_dict.copy()
+                processed_sample['cam2img'] = cam2img
+                processed_sample['lidar2cam'] = lidar2cam
+                processed_sample['lidar2img'] = lidar2img
+                processed_inputs_list.append(processed_sample)
+
         elif isinstance(inputs, (list, tuple)):
-            # get cam2img, lidar2cam and lidar2img from infos
-            for input in inputs:
-                assert 'infos' in input
-                infos = input.pop('infos')
-                info_list = mmengine.load(infos)['data_list']
-                assert len(info_list) == 1, 'Only support single sample' \
-                    'info in `.pkl`, when input is a list.'
-                data_info = info_list[0]
-                img_path = data_info['images'][cam_type]['img_path']
-                if isinstance(input['img'], str) and \
-                        osp.basename(img_path) != osp.basename(input['img']):
+            # This branch handles cases where 'inputs' is already a list of dicts.
+            # The original logic assumes each dict in the list has its own 'infos'
+            # and that this info file contains exactly one entry.
+            # This part is kept similar to original for now, but may need adjustment
+            # if a global info file is to be used for list inputs too.
+            for single_input_item_dict in inputs:
+                if not isinstance(single_input_item_dict, dict) or 'infos' not in single_input_item_dict:
+                    raise ValueError("When inputs is a list, each item must be a dict containing an 'infos' key.")
+                
+                infos_path_item = single_input_item_dict.pop('infos')
+                current_info_list = mmengine.load(infos_path_item)['data_list']
+                
+                # Original code for list inputs expects one info entry per file.
+                # To make it search, you'd adapt the logic from the isinstance(inputs, dict) block above.
+                # For now, sticking to a modified version of the original assertion for clarity.
+                input_img_basename_item = osp.basename(single_input_item_dict['img'])
+                data_info_to_use = None
+                if len(current_info_list) == 1:
+                    # If only one entry, check if it matches, then use it.
+                    candidate = current_info_list[0]
+                    if 'images' in candidate and cam_type in candidate['images'] and \
+                    osp.basename(candidate['images'][cam_type]['img_path']) == input_img_basename_item:
+                        data_info_to_use = candidate
+                    else:
+                        raise ValueError(
+                            f"Single info entry in '{infos_path_item}' does not match input image '{input_img_basename_item}'.")
+                else:
+                    # If multiple entries, search for the right one.
+                    for candidate in current_info_list:
+                        if 'images' in candidate and cam_type in candidate['images'] and \
+                        osp.basename(candidate['images'][cam_type]['img_path']) == input_img_basename_item:
+                            data_info_to_use = candidate
+                            break
+                    if data_info_to_use is None:
+                        raise ValueError(
+                            f"Could not find matching info for image '{input_img_basename_item}' in '{infos_path_item}' "
+                            f"(which has {len(current_info_list)} entries) when inputs is a list.")
+
+                # Consistency check (original)
+                img_path_from_info = data_info_to_use['images'][cam_type]['img_path']
+                if isinstance(single_input_item_dict.get('img'), str) and \
+                osp.basename(img_path_from_info) != osp.basename(single_input_item_dict['img']):
                     raise ValueError(
-                        f'the info file of {img_path} is not provided.')
+                        f"Mismatch: info file '{img_path_from_info}' vs input image '{single_input_item_dict['img']}'.")
+
                 cam2img = np.asarray(
-                    data_info['images'][cam_type]['cam2img'], dtype=np.float32)
+                    data_info_to_use['images'][cam_type]['cam2img'], dtype=np.float32)
                 lidar2cam = np.asarray(
-                    data_info['images'][cam_type]['lidar2cam'],
+                    data_info_to_use['images'][cam_type]['lidar2cam'],
                     dtype=np.float32)
-                if 'lidar2img' in data_info['images'][cam_type]:
+                if 'lidar2img' in data_info_to_use['images'][cam_type]:
                     lidar2img = np.asarray(
-                        data_info['images'][cam_type]['lidar2img'],
+                        data_info_to_use['images'][cam_type]['lidar2img'],
                         dtype=np.float32)
                 else:
                     lidar2img = cam2img @ lidar2cam
-                input['cam2img'] = cam2img
-                input['lidar2cam'] = lidar2cam
-                input['lidar2img'] = lidar2img
-
-        return list(inputs)
+                
+                processed_sample = single_input_item_dict.copy()
+                processed_sample['cam2img'] = cam2img
+                processed_sample['lidar2cam'] = lidar2cam
+                processed_sample['lidar2img'] = lidar2img
+                processed_inputs_list.append(processed_sample)
+        else:
+            raise TypeError(f"Unsupported input type: {type(inputs)}. Expected dict or list.")
+
+        return processed_inputs_list
 
     def _init_pipeline(self, cfg: ConfigType) -> Compose:
         """Initialize the test pipeline."""
diff --git a/mmdet3d/datasets/transforms/dbsampler.py b/mmdet3d/datasets/transforms/dbsampler.py
index 56e8440b74..093cdfb170 100644
--- a/mmdet3d/datasets/transforms/dbsampler.py
+++ b/mmdet3d/datasets/transforms/dbsampler.py
@@ -280,7 +280,7 @@ def sample_all(self,
                 s_points_list.append(s_points)
 
             gt_labels = np.array([self.cat2label[s['name']] for s in sampled],
-                                 dtype=np.long)
+                                 dtype=np.int64)
 
             if ground_plane is not None:
                 xyz = sampled_gt_bboxes[:, :3]
diff --git a/mmdet3d/datasets/transforms/loading.py b/mmdet3d/datasets/transforms/loading.py
index 383c44536f..c1b9b8c395 100644
--- a/mmdet3d/datasets/transforms/loading.py
+++ b/mmdet3d/datasets/transforms/loading.py
@@ -244,6 +244,9 @@ def transform(self, results: dict) -> dict:
         if 'CAM2' in results['images']:
             filename = results['images']['CAM2']['img_path']
             results['cam2img'] = results['images']['CAM2']['cam2img']
+        elif 'CAM_FRONT' in results['images']:
+            filename = results['images']['CAM_FRONT']['img_path']
+            results['cam2img'] = results['images']['CAM_FRONT']['cam2img']
         elif len(list(results['images'].keys())) == 1:
             camera_type = list(results['images'].keys())[0]
             filename = results['images'][camera_type]['img_path']
diff --git a/mmdet3d/models/backbones/__init__.py b/mmdet3d/models/backbones/__init__.py
index 64102bec1f..1badced7cb 100644
--- a/mmdet3d/models/backbones/__init__.py
+++ b/mmdet3d/models/backbones/__init__.py
@@ -4,18 +4,22 @@
 from .cylinder3d import Asymm3DSpconv
 from .dgcnn import DGCNNBackbone
 from .dla import DLANet
+from .fire_rpfnet import FireRPFNet, FireRPFNetV2, FireRPFNet2D
 from .mink_resnet import MinkResNet
 from .minkunet_backbone import MinkUNetBackbone
 from .multi_backbone import MultiBackbone
 from .nostem_regnet import NoStemRegNet
 from .pointnet2_sa_msg import PointNet2SAMSG
 from .pointnet2_sa_ssg import PointNet2SASSG
+from .rpfnet import RPFNet
 from .second import SECOND
 from .spvcnn_backone import MinkUNetBackboneV2, SPVCNNBackbone
+from .squeezenet import SQUEEZE
 
 __all__ = [
     'ResNet', 'ResNetV1d', 'ResNeXt', 'SSDVGG', 'HRNet', 'NoStemRegNet',
     'SECOND', 'DGCNNBackbone', 'PointNet2SASSG', 'PointNet2SAMSG',
     'MultiBackbone', 'DLANet', 'MinkResNet', 'Asymm3DSpconv',
-    'MinkUNetBackbone', 'SPVCNNBackbone', 'MinkUNetBackboneV2'
+    'MinkUNetBackbone', 'SPVCNNBackbone', 'MinkUNetBackboneV2','SQUEEZE',
+    'RPFNet', 'FireRPFNet', 'FireRPFNetV2', 'FireRPFNet2D'
 ]
diff --git a/mmdet3d/models/backbones/fire_rpfnet.py b/mmdet3d/models/backbones/fire_rpfnet.py
new file mode 100644
index 0000000000..b8d62666d8
--- /dev/null
+++ b/mmdet3d/models/backbones/fire_rpfnet.py
@@ -0,0 +1,471 @@
+"""FireRPFNet: Fire Module + CBAM Attention Backbones.
+
+This module provides efficient backbones combining Fire modules (SqueezeNet-inspired)
+with CBAM attention for both 2D image and 3D point cloud (BEV) feature extraction.
+
+Variants:
+    - FireRPFNet: Original BEV backbone
+    - FireRPFNetV2: Enhanced BEV backbone with multi-scale support
+    - FireRPFNet2D: 2D image backbone for multimodal detection
+
+References:
+    - SqueezeNet Fire Module: https://arxiv.org/abs/1602.07360
+    - CBAM: https://arxiv.org/abs/1807.06521
+    - MVXNet: https://arxiv.org/abs/1904.01649
+"""
+
+import torch
+from torch import nn
+from mmcv.cnn import build_norm_layer
+from mmdet3d.registry import MODELS
+
+
+class FireBlock(nn.Module):
+    """SqueezeNet-style fire module with residual shortcut for BEV features.
+
+    Original implementation for point cloud BEV backbones (FireRPFNet, FireRPFNetV2).
+    Squeezes from input channels for compatibility with trained models.
+
+    Args:
+        in_ch (int): Input channels.
+        out_ch (int): Output channels of the expand concat.
+        norm_cfg (dict): Normalization config.
+    """
+
+    def __init__(self, in_ch, out_ch, norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01)):
+        super().__init__()
+        # Squeeze from INPUT channels (original BEV behavior)
+        squeeze_ch = max(16, in_ch // 4)
+
+        # Squeeze path (no stride, BEV maintains resolution)
+        self.squeeze = nn.Conv2d(in_ch, squeeze_ch, kernel_size=1, bias=False)
+        self.squeeze_bn = build_norm_layer(norm_cfg, squeeze_ch)[1]
+
+        # Expand paths (1x1 and 3x3 in parallel)
+        self.expand1x1 = nn.Conv2d(squeeze_ch, out_ch // 2, 1, bias=False)
+        self.expand3x3 = nn.Conv2d(squeeze_ch, out_ch // 2, 3, padding=1, bias=False)
+        self.expand_bn = build_norm_layer(norm_cfg, out_ch)[1]
+
+        self.act = nn.ReLU(inplace=True)
+
+        # Residual connection with projection if needed
+        self.downsample = None
+        if in_ch != out_ch:
+            self.downsample = nn.Sequential(
+                nn.Conv2d(in_ch, out_ch, 1, bias=False),
+                build_norm_layer(norm_cfg, out_ch)[1],
+            )
+
+    def forward(self, x):
+        identity = x
+
+        # Squeeze
+        x = self.act(self.squeeze_bn(self.squeeze(x)))
+
+        # Expand (parallel 1x1 and 3x3)
+        out1 = self.expand1x1(x)
+        out3 = self.expand3x3(x)
+        out = torch.cat([out1, out3], dim=1)
+        out = self.expand_bn(out)
+
+        # Residual connection
+        if self.downsample is not None:
+            identity = self.downsample(identity)
+
+        return self.act(out + identity)
+
+
+class FireBlock2D(nn.Module):
+    """SqueezeNet-style fire module with residual shortcut for 2D images.
+
+    Adapted for hierarchical image feature extraction with downsampling support.
+    Squeezes from output channels for consistent behavior during resolution changes.
+
+    Args:
+        in_ch (int): Input channels.
+        out_ch (int): Output channels of the expand concat.
+        stride (int): Stride for downsampling. Default: 1.
+        norm_cfg (dict): Normalization config.
+    """
+
+    def __init__(self, in_ch, out_ch, stride=1, norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01)):
+        super().__init__()
+        self.stride = stride
+        # Squeeze from OUTPUT channels (for stable behavior across resolution changes)
+        squeeze_ch = max(16, out_ch // 4)
+
+        # Squeeze path (with optional stride for downsampling)
+        self.squeeze = nn.Conv2d(in_ch, squeeze_ch, kernel_size=1, stride=stride, bias=False)
+        self.squeeze_bn = build_norm_layer(norm_cfg, squeeze_ch)[1]
+
+        # Expand paths (1x1 and 3x3 in parallel)
+        self.expand1x1 = nn.Conv2d(squeeze_ch, out_ch // 2, 1, bias=False)
+        self.expand3x3 = nn.Conv2d(squeeze_ch, out_ch // 2, 3, padding=1, bias=False)
+        self.expand_bn = build_norm_layer(norm_cfg, out_ch)[1]
+
+        self.act = nn.ReLU(inplace=True)
+
+        # Residual connection with projection if needed
+        self.downsample = None
+        if in_ch != out_ch or stride != 1:
+            self.downsample = nn.Sequential(
+                nn.Conv2d(in_ch, out_ch, 1, stride=stride, bias=False),
+                build_norm_layer(norm_cfg, out_ch)[1],
+            )
+
+    def forward(self, x):
+        identity = x
+
+        # Squeeze
+        x = self.act(self.squeeze_bn(self.squeeze(x)))
+
+        # Expand (parallel 1x1 and 3x3)
+        out1 = self.expand1x1(x)
+        out3 = self.expand3x3(x)
+        out = torch.cat([out1, out3], dim=1)
+        out = self.expand_bn(out)
+
+        # Residual connection
+        if self.downsample is not None:
+            identity = self.downsample(identity)
+
+        return self.act(out + identity)
+
+
+class CBAM(nn.Module):
+    """Convolutional Block Attention Module (CBAM).
+
+    Applies sequential channel and spatial attention to input features.
+
+    Args:
+        ch (int): Number of input channels.
+        reduction (int): Channel reduction ratio for MLP. Default: 16.
+
+    Reference:
+        Woo et al., "CBAM: Convolutional Block Attention Module", ECCV 2018.
+    """
+
+    def __init__(self, ch, reduction=16):
+        super().__init__()
+        # Channel attention
+        self.channel_att = nn.Sequential(
+            nn.AdaptiveAvgPool2d(1),
+            nn.Flatten(),
+            nn.Linear(ch, ch // reduction, bias=False),
+            nn.ReLU(inplace=True),
+            nn.Linear(ch // reduction, ch, bias=False),
+            nn.Sigmoid(),
+        )
+
+        # Spatial attention
+        self.spatial_att = nn.Sequential(
+            nn.Conv2d(2, 1, 7, padding=3, bias=False),
+            nn.Sigmoid(),
+        )
+
+    def forward(self, x):
+        b, c, _, _ = x.size()
+
+        # Channel attention
+        att_c = self.channel_att(x).view(b, c, 1, 1)
+        x = x * att_c
+
+        # Spatial attention
+        att_s = self.spatial_att(
+            torch.cat([x.mean(1, keepdim=True), x.max(1, keepdim=True)[0]], dim=1)
+        )
+        return x * att_s
+
+
+# =============================================================================
+# BEV Backbones (for Point Cloud)
+# =============================================================================
+
+@MODELS.register_module()
+class FireRPFNet(nn.Module):
+    """Residual FireNet backbone (SqueezeNet-inspired) with CBAM.
+
+    first version designed as a drop-in replacement for RPFNet in BEV pipelines.
+    Processes BEV features from sparse 3D convolution without downsampling.
+
+    Args:
+        in_channels (int): Input channels. Default: 256.
+        out_channels (tuple[int]): Output channels for each stage.
+            Default: (128, 256, 256, 256).
+        with_cbam (bool): Whether to use CBAM attention. Default: True.
+        norm_cfg (dict): Normalization config.
+            Default: dict(type='BN', eps=1e-3, momentum=0.01).
+    """
+
+    def __init__(self,
+                 in_channels=256,
+                 out_channels=(128, 256, 256, 256),
+                 with_cbam=True,
+                 norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01)):
+        super().__init__()
+        layers = []
+        ch = in_channels
+        for out_ch in out_channels:
+            block = FireBlock(ch, out_ch, norm_cfg=norm_cfg)
+            stage = [block]
+            if with_cbam:
+                stage.append(CBAM(out_ch))
+            layers.append(nn.Sequential(*stage))
+            ch = out_ch
+        self.stages = nn.ModuleList(layers)
+
+    def forward(self, x):
+        """Forward pass.
+
+        Args:
+            x (torch.Tensor): BEV features (N, C, H, W).
+
+        Returns:
+            tuple[torch.Tensor]: Single-element tuple with last stage output.
+        """
+        for stage in self.stages:
+            x = stage(x)
+        return (x, )
+
+
+@MODELS.register_module()
+class FireRPFNetV2(nn.Module):
+    """Enhanced Residual FireNet backbone with multi-scale support.
+
+    Designed as a drop-in replacement for RPFNet/SECOND in BEV pipelines.
+    Can output single-scale or multi-scale features for use with/without FPN necks.
+
+    Args:
+        in_channels (int): Input channels. Default: 256.
+        out_channels (tuple[int] | list[int]): Output channels for each stage.
+            Default: (128, 256, 256, 256).
+        with_cbam (bool): Whether to use CBAM attention after each stage.
+            Default: True.
+        multi_scale_output (bool): If True, returns multi-scale features from all stages
+            (for use with SECONDFPN neck). If False, returns only the last stage output
+            (backward compatible, for use without neck). Default: False.
+        norm_cfg (dict): Normalization config.
+            Default: dict(type='BN', eps=1e-3, momentum=0.01).
+
+    Example:
+        >>> # Single-scale output (no neck)
+        >>> pts_backbone = dict(
+        ...     type='FireRPFNetV2',
+        ...     in_channels=256,
+        ...     out_channels=[128, 256, 256, 256],
+        ...     multi_scale_output=False)
+
+        >>> # Multi-scale output (with SECONDFPN)
+        >>> pts_backbone = dict(
+        ...     type='FireRPFNetV2',
+        ...     in_channels=256,
+        ...     out_channels=[128, 256, 256, 256],
+        ...     multi_scale_output=True)
+        >>> pts_neck = dict(
+        ...     type='SECONDFPN',
+        ...     in_channels=[128, 256, 256, 256],
+        ...     upsample_strides=[1, 2, 4, 8],
+        ...     out_channels=[128, 128, 128, 128])
+    """
+
+    def __init__(self,
+                 in_channels=256,
+                 out_channels=(128, 256, 256, 256),
+                 with_cbam=True,
+                 multi_scale_output=False,
+                 norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01)):
+        super().__init__()
+        self.multi_scale_output = multi_scale_output
+        layers = []
+        ch = in_channels
+        for out_ch in out_channels:
+            block = FireBlock(ch, out_ch, norm_cfg=norm_cfg)
+            stage = [block]
+            if with_cbam:
+                stage.append(CBAM(out_ch))
+            layers.append(nn.Sequential(*stage))
+            ch = out_ch
+        self.stages = nn.ModuleList(layers)
+
+    def forward(self, x):
+        """Forward pass.
+
+        Args:
+            x (torch.Tensor): BEV features (N, C, H, W).
+
+        Returns:
+            tuple[torch.Tensor]:
+                - If multi_scale_output=False: Single-element tuple with last stage output
+                - If multi_scale_output=True: Multi-element tuple with all stage outputs
+        """
+        if self.multi_scale_output:
+            # Return multi-scale features for FPN neck
+            outs = []
+            for stage in self.stages:
+                x = stage(x)
+                outs.append(x)
+            return tuple(outs)
+        else:
+            # Return only last stage (backward compatible)
+            for stage in self.stages:
+                x = stage(x)
+            return (x, )
+
+
+# =============================================================================
+# 2D Image Backbone
+# =============================================================================
+
+@MODELS.register_module()
+class FireRPFNet2D(nn.Module):
+    """FireRPFNet2D: Efficient 2D image backbone for multimodal 3D detection.
+
+    This backbone uses Fire modules (SqueezeNet-inspired) with CBAM attention
+    for efficient feature extraction from RGB images. It outputs multi-scale features
+    suitable for FPN necks in MVXNet-style architectures.
+
+    Architecture:
+        - Stem: Conv 7×7 stride=2 + MaxPool → H/4, W/4
+        - Stage 1: Fire blocks (stride=1) → H/4, W/4
+        - Stage 2: Fire blocks (stride=2 in first) → H/8, W/8
+        - Stage 3: Fire blocks (stride=2 in first) → H/16, W/16
+        - Stage 4: Fire blocks (stride=2 in first) → H/32, W/32
+        - Each block optionally followed by CBAM attention
+
+    Args:
+        in_channels (int): Input image channels (typically 3 for RGB). Default: 3.
+        out_channels (tuple[int]): Output channels for each stage.
+            Default: (64, 128, 256, 512).
+        blocks_per_stage (tuple[int]): Number of Fire blocks per stage.
+            Default: (2, 2, 2, 2).
+        with_cbam (bool): Whether to use CBAM attention after each block.
+            Default: True.
+        stem_channels (int): Channels in stem conv. Default: 64.
+        out_indices (tuple[int]): Output feature indices for multi-scale.
+            Default: (0, 1, 2, 3) - all stages.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any stages. Default: -1.
+        norm_cfg (dict): Normalization config.
+            Default: dict(type='BN', eps=1e-3, momentum=0.01).
+        norm_eval (bool): Whether to set norm layers to eval mode. Default: False.
+
+    Example:
+        >>> # Standard configuration
+        >>> img_backbone = dict(
+        ...     type='FireRPFNet2D',
+        ...     in_channels=3,
+        ...     out_channels=[64, 128, 256, 512],
+        ...     stem_channels=64,
+        ...     with_cbam=True)
+
+        >>> # Lightweight configuration (~40% fewer params)
+        >>> img_backbone = dict(
+        ...     type='FireRPFNet2D',
+        ...     out_channels=[48, 96, 192, 384],
+        ...     stem_channels=48,
+        ...     with_cbam=True)
+
+        >>> # Without attention
+        >>> img_backbone = dict(
+        ...     type='FireRPFNet2D',
+        ...     out_channels=[64, 128, 256, 512],
+        ...     with_cbam=False)
+    """
+
+    def __init__(self,
+                 in_channels=3,
+                 out_channels=(64, 128, 256, 512),
+                 blocks_per_stage=(2, 2, 2, 2),
+                 with_cbam=True,
+                 stem_channels=64,
+                 out_indices=(0, 1, 2, 3),
+                 frozen_stages=-1,
+                 norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+                 norm_eval=False):
+        super().__init__()
+
+        assert len(out_channels) == len(blocks_per_stage), \
+            "out_channels and blocks_per_stage must have same length"
+
+        self.num_stages = len(out_channels)
+        self.out_indices = out_indices
+        self.frozen_stages = frozen_stages
+        self.norm_eval = norm_eval
+        self.with_cbam = with_cbam
+
+        # Stem: initial convolution to lift channels (H/4, W/4)
+        self.stem = nn.Sequential(
+            nn.Conv2d(in_channels, stem_channels, kernel_size=7, stride=2,
+                     padding=3, bias=False),
+            build_norm_layer(norm_cfg, stem_channels)[1],
+            nn.ReLU(inplace=True),
+            nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
+        )
+
+        # Build stages
+        self.stages = nn.ModuleList()
+        in_ch = stem_channels
+
+        for stage_idx, (out_ch, num_blocks) in enumerate(zip(out_channels, blocks_per_stage)):
+            blocks = []
+            for block_idx in range(num_blocks):
+                # First block of stages 1-3 uses stride=2 for downsampling
+                stride = 2 if (stage_idx > 0 and block_idx == 0) else 1
+
+                # Fire block (use FireBlock2D for image backbone)
+                fire_block = FireBlock2D(in_ch, out_ch, stride=stride, norm_cfg=norm_cfg)
+                blocks.append(fire_block)
+
+                # Optional CBAM attention
+                if with_cbam:
+                    blocks.append(CBAM(out_ch))
+
+                in_ch = out_ch
+
+            self.stages.append(nn.Sequential(*blocks))
+
+        self._freeze_stages()
+
+    def _freeze_stages(self):
+        """Freeze stages parameters and set to eval mode."""
+        if self.frozen_stages >= 0:
+            self.stem.eval()
+            for param in self.stem.parameters():
+                param.requires_grad = False
+
+        for i in range(0, self.frozen_stages + 1):
+            if i < len(self.stages):
+                m = self.stages[i]
+                m.eval()
+                for param in m.parameters():
+                    param.requires_grad = False
+
+    def forward(self, x):
+        """Forward pass.
+
+        Args:
+            x (torch.Tensor): Input images (N, C, H, W), typically (N, 3, H, W).
+
+        Returns:
+            tuple[torch.Tensor]: Multi-scale feature maps from selected stages.
+                Each tensor has shape (N, C_i, H_i, W_i).
+        """
+        x = self.stem(x)  # Initial downsampling: H/4, W/4
+
+        outs = []
+        for stage_idx, stage in enumerate(self.stages):
+            x = stage(x)
+            if stage_idx in self.out_indices:
+                outs.append(x)
+
+        return tuple(outs)
+
+    def train(self, mode=True):
+        """Set the module in training mode."""
+        super(FireRPFNet2D, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                # trick: eval have effect on BatchNorm only
+                if isinstance(m, nn.BatchNorm2d):
+                    m.eval()
diff --git a/mmdet3d/models/backbones/rpfnet.py b/mmdet3d/models/backbones/rpfnet.py
new file mode 100644
index 0000000000..54bc0f516b
--- /dev/null
+++ b/mmdet3d/models/backbones/rpfnet.py
@@ -0,0 +1,98 @@
+import torch
+from torch import nn
+from mmcv.cnn import build_norm_layer
+from mmdet3d.registry import MODELS
+
+
+class BasicBlock(nn.Module):
+    """Simple residual 2-D conv block used in PillarNet-LTS (RPFN)."""
+
+    def __init__(self, in_channels, out_channels, norm_cfg):
+        super().__init__()
+        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1, bias=False)
+        self.bn1 = build_norm_layer(norm_cfg, out_channels)[1]
+        self.act = nn.ReLU(inplace=True)
+        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1, bias=False)
+        self.bn2 = build_norm_layer(norm_cfg, out_channels)[1]
+        if in_channels != out_channels:
+            self.downsample = nn.Sequential(
+                nn.Conv2d(in_channels, out_channels, 1, bias=False),
+                build_norm_layer(norm_cfg, out_channels)[1],
+            )
+        else:
+            self.downsample = None
+
+    def forward(self, x):
+        identity = x
+        out = self.act(self.bn1(self.conv1(x)))
+        out = self.bn2(self.conv2(out))
+        if self.downsample is not None:
+            identity = self.downsample(identity)
+        out = self.act(out + identity)
+        return out
+
+
+class CBAM(nn.Module):
+    """Lightweight CBAM attention (channel + spatial)."""
+
+    def __init__(self, channels, reduction=16):
+        super().__init__()
+        self.mlp = nn.Sequential(
+            nn.AdaptiveAvgPool2d(1),
+            nn.Flatten(),
+            nn.Linear(channels, channels // reduction, bias=False),
+            nn.ReLU(inplace=True),
+            nn.Linear(channels // reduction, channels, bias=False),
+            nn.Sigmoid(),
+        )
+        self.spatial = nn.Sequential(
+            nn.Conv2d(2, 1, 7, padding=3, bias=False),
+            nn.Sigmoid(),
+        )
+
+    def forward(self, x):
+        # channel attention
+        b, c, _, _ = x.size()
+        channel_att = self.mlp(x).view(b, c, 1, 1)
+        x = x * channel_att
+        # spatial attention
+        spatial_att = self.spatial(torch.cat([x.mean(1, keepdim=True), x.max(1, keepdim=True)[0]], dim=1))
+        x = x * spatial_att
+        return x
+
+
+@MODELS.register_module()
+class RPFNet(nn.Module):
+    """Residual Pillar Feature Network backbone (simplified).
+
+    Args:
+        in_channels (int): #Channels of input BEV feature map (from SparseEncoder).
+        layer_channels (list[int]): Output channels for each residual stage.
+        with_cbam (bool): If True, append a CBAM after each stage.
+        norm_cfg (dict): Norm config dict.
+    """
+
+    def __init__(self,
+                 in_channels=256,
+                 layer_channels=(128, 256, 256, 256),
+                 with_cbam=True,
+                 norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01)):
+        super().__init__()
+        layers = []
+        ch = in_channels
+        for out_ch in layer_channels:
+            block = BasicBlock(ch, out_ch, norm_cfg)
+            stage = [block]
+            if with_cbam:
+                stage.append(CBAM(out_ch))
+            layers.append(nn.Sequential(*stage))
+            ch = out_ch
+        self.stages = nn.ModuleList(layers)
+
+    def forward(self, x):
+        # x: (B, C, H, W) BEV feature map
+        for stage in self.stages:
+            x = stage(x)
+        # Anchor3DHead expects a tuple/list of multi-scale features.
+        # We return a single-scale tuple to stay compatible.
+        return (x, )
diff --git a/mmdet3d/models/backbones/squeezenet.py b/mmdet3d/models/backbones/squeezenet.py
new file mode 100644
index 0000000000..c42393b2dc
--- /dev/null
+++ b/mmdet3d/models/backbones/squeezenet.py
@@ -0,0 +1,112 @@
+from mmengine.model import BaseModule
+from mmdet3d.registry import MODELS
+from mmcv.cnn import build_conv_layer, build_norm_layer
+import torch
+import torch.nn as nn
+from typing import Sequence, Optional, Tuple
+from torch import Tensor
+#from torch import nn
+@MODELS.register_module()
+class SQUEEZE(BaseModule):
+    """Backbone network using SqueezeNet architecture.
+
+    Args:
+        in_channels (int): Input channels.
+        out_channels (list[int]): Output channels for multi-scale feature maps.
+        norm_cfg (dict): Config dict of normalization layers.
+        conv_cfg (dict): Config dict of convolutional layers.
+    """
+
+    def __init__(self,
+                 in_channels: int = 3,
+                 out_channels: Sequence[int] = [64, 128, 256],
+                 norm_cfg: dict = dict(type='BN', eps=1e-3, momentum=0.01),
+                 conv_cfg: dict = dict(type='Conv2d', bias=False),
+                 init_cfg: Optional[dict] = None,
+                 pretrained: Optional[str] = None) -> None:
+        super(SQUEEZE, self).__init__(init_cfg=init_cfg)
+        self.conv_cfg = conv_cfg;
+        self.norm_cfg = norm_cfg;
+
+
+        # Define the SqueezeNet fire modules
+        self.features = nn.Sequential(
+            #build_conv_layer(conv_cfg, in_channels, 96, kernel_size=7, stride=2),
+            build_conv_layer(conv_cfg, in_channels, 64, kernel_size=3, stride=2),
+            #build_norm_layer(norm_cfg, 96)[1],
+            build_norm_layer(norm_cfg, 64)[1],
+            nn.ReLU(inplace=True),
+            nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True),
+            #self._make_fire_module(96, 16, 64, 64),
+            self._make_fire_module(64, 16, 64, 64),
+            self._make_fire_module(128, 16, 64, 64),
+            self._make_fire_module(128, 32, 128, 128),
+            nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True),
+            self._make_fire_module(256, 32, 128, 128),
+            self._make_fire_module(256, 48, 192, 192),
+            self._make_fire_module(384, 48, 192, 192),
+            self._make_fire_module(384, 64, 256, 256),
+            nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True),
+            self._make_fire_module(512, 64, 256, 256),
+        )
+
+        if isinstance(pretrained, str):
+            warnings.warn('DeprecationWarning: pretrained is deprecated, '
+                          'please use "init_cfg" instead')
+            self.init_cfg = dict(type='Pretrained', checkpoint=pretrained)
+        else:
+            self.init_cfg = dict(type='Kaiming', layer='Conv2d')
+
+
+
+    def _make_fire_module(self, in_channels, squeeze_channels, expand1x1_channels, expand3x3_channels):
+      layers = nn.Sequential()
+      
+      # Squeeze layer
+      squeeze = nn.Sequential(
+          build_conv_layer(self.conv_cfg, in_channels, squeeze_channels, kernel_size=1),
+          build_norm_layer(self.norm_cfg, squeeze_channels)[1],
+          nn.ReLU(inplace=True)
+      )
+      layers.add_module('squeeze', squeeze)
+      
+      # Expand 1x1 layer
+      expand1x1 = nn.Sequential(
+          build_conv_layer(self.conv_cfg, squeeze_channels, expand1x1_channels, kernel_size=1),
+          build_norm_layer(self.norm_cfg, expand1x1_channels)[1],
+          nn.ReLU(inplace=True)
+      )
+      layers.add_module('expand1x1', expand1x1)
+      
+      # Expand 3x3 layer
+      expand3x3 = nn.Sequential(
+          build_conv_layer(self.conv_cfg, squeeze_channels, expand3x3_channels, kernel_size=3, padding=1),
+          build_norm_layer(self.norm_cfg, expand3x3_channels)[1],
+          nn.ReLU(inplace=True)
+      )
+      layers.add_module('expand3x3', expand3x3)
+      
+      # Concatenation layer (handled within the forward function)
+      return layers
+
+    def forward(self, x):
+        """Forward function with correct concatenation for Fire modules."""
+        x = self.features[0](x)  # handled here as the initial layers are not fire modules
+        targeted_layers = [1,5, 8, 13]
+        outs = []
+        for idx, layer in enumerate(self.features[1:], 1):
+            #print(idx,":",layer)
+            if isinstance(layer, nn.Sequential) and 'squeeze' in layer._modules:
+                # This is a Fire module, handle separately
+                squeeze_output = layer.squeeze(x)
+                x1 = layer.expand1x1(squeeze_output)
+                x3 = layer.expand3x3(squeeze_output)
+                x = torch.cat([x1, x3], 1)
+            else:
+                # Normal layer
+                x = layer(x)
+            if(idx in targeted_layers):
+              outs.append(x)
+              #print("Outs x",idx , x.shape)
+        #print(len(outs))
+        return outs
diff --git a/mmdet3d/models/layers/norm.py b/mmdet3d/models/layers/norm.py
index 9a85278723..03f4950f13 100644
--- a/mmdet3d/models/layers/norm.py
+++ b/mmdet3d/models/layers/norm.py
@@ -120,6 +120,8 @@ def forward(self, input: Tensor) -> Tensor:
         Returns:
             Tensor: Has shape (N, C, H, W), same shape as input.
         """
+        if input.dtype == torch.float16:
+            input = input.to(torch.float32)  # casting to torch.float32
         assert input.dtype == torch.float32, \
             f'input should be in float32 type, got {input.dtype}'
         using_dist = dist.is_available() and dist.is_initialized()
diff --git a/mmdet3d/models/necks/__init__.py b/mmdet3d/models/necks/__init__.py
index 53b885cb16..fb60020e4a 100644
--- a/mmdet3d/models/necks/__init__.py
+++ b/mmdet3d/models/necks/__init__.py
@@ -5,8 +5,9 @@
 from .imvoxel_neck import IndoorImVoxelNeck, OutdoorImVoxelNeck
 from .pointnet2_fp_neck import PointNetFPNeck
 from .second_fpn import SECONDFPN
+from .squeeze_fpn import SQUEEZEFPN
 
 __all__ = [
     'FPN', 'SECONDFPN', 'OutdoorImVoxelNeck', 'PointNetFPNeck', 'DLANeck',
-    'IndoorImVoxelNeck'
+    'IndoorImVoxelNeck','SQUEEZEFPN'
 ]
diff --git a/mmdet3d/models/necks/squeeze_fpn.py b/mmdet3d/models/necks/squeeze_fpn.py
new file mode 100644
index 0000000000..1d50277e4c
--- /dev/null
+++ b/mmdet3d/models/necks/squeeze_fpn.py
@@ -0,0 +1,110 @@
+import torch
+from mmcv.cnn import build_conv_layer, build_norm_layer, build_upsample_layer
+from mmengine.model import BaseModule
+from torch import nn
+
+from mmdet3d.registry import MODELS
+
+
+class LastLevelMaxPool(nn.Module):
+    def __init__(self):
+        super(LastLevelMaxPool, self).__init__()
+        self.pool = nn.MaxPool2d(kernel_size=1, stride=2, padding=0)
+
+    def forward(self, x):
+        return self.pool(x)
+
+@MODELS.register_module()
+class SQUEEZEFPN(BaseModule):
+    """FPN using SqueezeNet architecture.
+
+    Args:
+        in_channels (list[int]): Input channels of multi-scale feature maps.
+        out_channels (list[int]): Output channels of feature maps.
+        norm_cfg (dict): Config dict of normalization layers.
+        upsample_cfg (dict): Config dict of upsample layers.
+        conv_cfg (dict): Config dict of conv layers.
+        init_cfg (dict or :obj:`ConfigDict` or list[dict or :obj:`ConfigDict`],
+            optional): Initialization config dict.
+    """
+
+    def __init__(self,
+                 in_channels=[64, 128, 256, 512],
+                 out_channels=[256, 256, 256, 256],
+                 norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01),
+                 upsample_cfg=dict(type='deconv', bias=False),
+                 conv_cfg=dict(type='Conv2d', bias=False),
+                 init_cfg=None):
+        super(SQUEEZEFPN, self).__init__(init_cfg=init_cfg)
+        print("out_channels", len(out_channels), "in_channels", len(in_channels))
+        print("out_channels", out_channels, len(out_channels), "in_channels", in_channels, len(in_channels))
+        assert len(out_channels) == len(in_channels)
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+
+
+        self.lateral_convs = nn.ModuleList([
+            nn.Conv2d(in_channel, out_channels[0], kernel_size=1)
+            for in_channel in in_channels
+        ])
+        self.fpn_convs = nn.ModuleList([
+            nn.Conv2d(out_channels[0], out_channels[0], kernel_size=3, padding=1)
+            for _ in range(len(in_channels))
+        ])
+        self.last_level_pool = LastLevelMaxPool()
+
+        # self.deblocks = nn.ModuleList()
+        # for i, out_channel in enumerate(out_channels):
+        #     upsample_layer = build_upsample_layer(
+        #         upsample_cfg,
+        #         in_channels=in_channels[i],
+        #         out_channels=out_channel,
+        #         kernel_size=2,
+        #         stride=2)
+        #     deblock = nn.Sequential(
+        #         upsample_layer,
+        #         build_norm_layer(norm_cfg, out_channel)[1],
+        #         nn.ReLU(inplace=True)
+        #     )
+        #    self.deblocks.append(deblock)
+
+    def forward(self, x):
+        """Forward function.
+
+        Args:
+            x (List[torch.Tensor]): Multi-level features with 4D Tensor in
+                (N, C, H, W) shape.
+
+        Returns:
+            list[torch.Tensor]: Multi-level feature maps.
+        """
+        # print("x", len(x), "in_channels", len(self.in_channels))
+        assert len(x) == len(self.in_channels)
+
+        lateral_features = [lateral_conv(feat) for lateral_conv, feat in zip(self.lateral_convs, x)]
+
+        for i in range(len(lateral_features) - 2, -1, -1):
+            # print(i)
+            # print(i,lateral_features[i].shape)
+            # print(i+1,lateral_features[i+1].shape)
+            shape_of_tensor = lateral_features[i].size()
+
+            # Extract specific dimensions for upsampleing
+            batch_size = shape_of_tensor[0] # not using
+            y_dimension = shape_of_tensor[1]
+            x_height = shape_of_tensor[2]
+            x_width = shape_of_tensor[3]
+            lateral_features[i] += nn.functional.interpolate(lateral_features[i + 1],  size=(x_height, x_width), mode='nearest')
+            #print(i,lateral_features[i].shape)
+
+
+        # Apply the FPN convolutions
+        fpn_features = [fpn_conv(feat) for fpn_conv, feat in zip(self.fpn_convs, lateral_features)]
+        # for i, feature in enumerate(fpn_features):
+        #     print(f"FPN Feature {i} shape: {feature.shape}")
+        pool = self.last_level_pool(lateral_features[0])
+        # fpn_features.append(pool)
+        # for i, feature in enumerate(fpn_features):
+        #     print(f"FPN Feature {i} shape: {feature.shape}")
+        # print(pool.shape)
+        return tuple(lateral_features)
diff --git a/projects/BEVFusion/bevfusion/bevfusion.py b/projects/BEVFusion/bevfusion/bevfusion.py
index 9f56934e66..50791f0851 100644
--- a/projects/BEVFusion/bevfusion/bevfusion.py
+++ b/projects/BEVFusion/bevfusion/bevfusion.py
@@ -56,7 +56,7 @@ def __init__(
             fusion_layer) if fusion_layer is not None else None
 
         self.pts_backbone = MODELS.build(pts_backbone)
-        self.pts_neck = MODELS.build(pts_neck)
+        self.pts_neck = MODELS.build(pts_neck) if pts_neck is not None else None
 
         self.bbox_head = MODELS.build(bbox_head)
 
@@ -279,7 +279,8 @@ def extract_feat(
             x = features[0]
 
         x = self.pts_backbone(x)
-        x = self.pts_neck(x)
+        if self.pts_neck is not None:
+            x = self.pts_neck(x)
 
         return x
 
diff --git a/projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py b/projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py
new file mode 100644
index 0000000000..c705d52b0b
--- /dev/null
+++ b/projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py
@@ -0,0 +1,237 @@
+_base_ = [
+    './bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py'
+]
+point_cloud_range = [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+
+model = dict(
+    type='BEVFusion',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=False),
+    img_backbone=dict(
+        type='mmdet.SwinTransformer',
+        embed_dims=96,
+        depths=[2, 2, 6, 2],
+        num_heads=[3, 6, 12, 24],
+        window_size=7,
+        mlp_ratio=4,
+        qkv_bias=True,
+        qk_scale=None,
+        drop_rate=0.0,
+        attn_drop_rate=0.0,
+        drop_path_rate=0.2,
+        patch_norm=True,
+        out_indices=[1, 2, 3],
+        with_cp=False,
+        convert_weights=True,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint=  # noqa: E251
+            'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth'  # noqa: E501
+        )),
+    img_neck=dict(
+        type='GeneralizedLSSFPN',
+        in_channels=[192, 384, 768],
+        out_channels=256,
+        start_level=0,
+        num_outs=3,
+        norm_cfg=dict(type='BN2d', requires_grad=True),
+        act_cfg=dict(type='ReLU', inplace=True),
+        upsample_cfg=dict(mode='bilinear', align_corners=False)),
+    view_transform=dict(
+        type='DepthLSSTransform',
+        in_channels=256,
+        out_channels=80,
+        image_size=[256, 704],
+        feature_size=[32, 88],
+        xbound=[-54.0, 54.0, 0.3],
+        ybound=[-54.0, 54.0, 0.3],
+        zbound=[-10.0, 10.0, 20.0],
+        dbound=[1.0, 60.0, 0.5],
+        downsample=2),
+    fusion_layer=dict(
+        type='ConvFuser', in_channels=[80, 256], out_channels=256))
+
+train_pipeline = [
+    dict(
+        type='BEVLoadMultiViewImageFromFiles',
+        to_float32=True,
+        color_type='color',
+        backend_args=backend_args),
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type='LoadPointsFromMultiSweeps',
+        sweeps_num=9,
+        load_dim=5,
+        use_dim=5,
+        pad_empty_sweeps=True,
+        remove_close=True,
+        backend_args=backend_args),
+    dict(
+        type='LoadAnnotations3D',
+        with_bbox_3d=True,
+        with_label_3d=True,
+        with_attr_label=False),
+    dict(
+        type='ImageAug3D',
+        final_dim=[256, 704],
+        resize_lim=[0.38, 0.55],
+        bot_pct_lim=[0.0, 0.0],
+        rot_lim=[-5.4, 5.4],
+        rand_flip=True,
+        is_train=True),
+    dict(
+        type='BEVFusionGlobalRotScaleTrans',
+        scale_ratio_range=[0.9, 1.1],
+        rot_range=[-0.78539816, 0.78539816],
+        translation_std=0.5),
+    dict(type='BEVFusionRandomFlip3D'),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(
+        type='ObjectNameFilter',
+        classes=[
+            'car', 'truck', 'construction_vehicle', 'bus', 'trailer',
+            'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone'
+        ]),
+    # Actually, 'GridMask' is not used here
+    dict(
+        type='GridMask',
+        use_h=True,
+        use_w=True,
+        max_epoch=6,
+        rotate=1,
+        offset=False,
+        ratio=0.5,
+        mode=1,
+        prob=0.0,
+        fixed_prob=True),
+    dict(type='PointShuffle'),
+    dict(
+        type='Pack3DDetInputs',
+        keys=[
+            'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
+            'gt_labels'
+        ],
+        meta_keys=[
+            'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar',
+            'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx',
+            'lidar_path', 'img_path', 'transformation_3d_flow', 'pcd_rotation',
+            'pcd_scale_factor', 'pcd_trans', 'img_aug_matrix',
+            'lidar_aug_matrix', 'num_pts_feats'
+        ])
+]
+
+test_pipeline = [
+    dict(
+        type='BEVLoadMultiViewImageFromFiles',
+        to_float32=True,
+        color_type='color',
+        backend_args=backend_args),
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type='LoadPointsFromMultiSweeps',
+        sweeps_num=9,
+        load_dim=5,
+        use_dim=5,
+        pad_empty_sweeps=True,
+        remove_close=True,
+        backend_args=backend_args),
+    dict(
+        type='ImageAug3D',
+        final_dim=[256, 704],
+        resize_lim=[0.48, 0.48],
+        bot_pct_lim=[0.0, 0.0],
+        rot_lim=[0.0, 0.0],
+        rand_flip=False,
+        is_train=False),
+    dict(
+        type='PointsRangeFilter',
+        point_cloud_range=[-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]),
+    dict(
+        type='Pack3DDetInputs',
+        keys=['img', 'points', 'gt_bboxes_3d', 'gt_labels_3d'],
+        meta_keys=[
+            'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar',
+            'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx',
+            'lidar_path', 'img_path', 'num_pts_feats'
+        ])
+]
+
+train_dataloader = dict(
+    dataset=dict(
+        dataset=dict(pipeline=train_pipeline, modality=input_modality)))
+val_dataloader = dict(
+    dataset=dict(pipeline=test_pipeline, modality=input_modality))
+test_dataloader = val_dataloader
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.33333333,
+        by_epoch=False,
+        begin=0,
+        end=500),
+    dict(
+        type='CosineAnnealingLR',
+        begin=0,
+        T_max=6,
+        end=6,
+        by_epoch=True,
+        eta_min_ratio=1e-4,
+        convert_to_iter_based=True),
+    # momentum scheduler
+    # During the first 8 epochs, momentum increases from 1 to 0.85 / 0.95
+    # during the next 12 epochs, momentum increases from 0.85 / 0.95 to 1
+    dict(
+        type='CosineAnnealingMomentum',
+        eta_min=0.85 / 0.95,
+        begin=0,
+        end=2.4,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingMomentum',
+        eta_min=1,
+        begin=2.4,
+        end=6,
+        by_epoch=True,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=6, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.01),
+    clip_grad=dict(max_norm=35, norm_type=2))
+
+# Default setting for scaling LR automatically
+#   - `enable` means enable scaling LR automatically
+#       or not by default.
+#   - `base_batch_size` = (8 GPUs) x (4 samples per GPU).
+auto_scale_lr = dict(enable=False, base_batch_size=32)
+
+default_hooks = dict(
+    logger=dict(type='LoggerHook', interval=50),
+    checkpoint=dict(type='CheckpointHook', interval=1))
+del _base_.custom_hooks
+
+work_dir = './work_dirs/bevfusion_lidar-cam_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d'
diff --git a/projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d.py b/projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d.py
new file mode 100644
index 0000000000..20467f63a8
--- /dev/null
+++ b/projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d.py
@@ -0,0 +1,238 @@
+_base_ = [
+    './bevfusion_lidar_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d.py'
+]
+point_cloud_range = [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]
+input_modality = dict(use_lidar=True, use_camera=True)
+backend_args = None
+
+model = dict(
+    type='BEVFusion',
+    data_preprocessor=dict(
+        type='Det3DDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=False),
+    img_backbone=dict(
+        type='mmdet.SwinTransformer',
+        embed_dims=96,
+        depths=[2, 2, 6, 2],
+        num_heads=[3, 6, 12, 24],
+        window_size=7,
+        mlp_ratio=4,
+        qkv_bias=True,
+        qk_scale=None,
+        drop_rate=0.0,
+        attn_drop_rate=0.0,
+        drop_path_rate=0.2,
+        patch_norm=True,
+        out_indices=[1, 2, 3],
+        with_cp=False,
+        convert_weights=True,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint=  # noqa: E251
+            'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth'  # noqa: E501
+        )),
+    img_neck=dict(
+        type='GeneralizedLSSFPN',
+        in_channels=[192, 384, 768],
+        out_channels=256,
+        start_level=0,
+        num_outs=3,
+        norm_cfg=dict(type='BN2d', requires_grad=True),
+        act_cfg=dict(type='ReLU', inplace=True),
+        upsample_cfg=dict(mode='bilinear', align_corners=False)),
+    view_transform=dict(
+        type='DepthLSSTransform',
+        in_channels=256,
+        out_channels=80,
+        image_size=[256, 704],
+        feature_size=[32, 88],
+        xbound=[-54.0, 54.0, 0.3],
+        ybound=[-54.0, 54.0, 0.3],
+        zbound=[-10.0, 10.0, 20.0],
+        dbound=[1.0, 60.0, 0.5],
+        downsample=2),
+    fusion_layer=dict(
+        type='ConvFuser', in_channels=[80, 256], out_channels=256))
+
+train_pipeline = [
+    dict(
+        type='BEVLoadMultiViewImageFromFiles',
+        to_float32=True,
+        color_type='color',
+        backend_args=backend_args),
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type='LoadPointsFromMultiSweeps',
+        sweeps_num=9,
+        load_dim=5,
+        use_dim=5,
+        pad_empty_sweeps=True,
+        remove_close=True,
+        backend_args=backend_args),
+    dict(
+        type='LoadAnnotations3D',
+        with_bbox_3d=True,
+        with_label_3d=True,
+        with_attr_label=False),
+    dict(
+        type='ImageAug3D',
+        final_dim=[256, 704],
+        resize_lim=[0.38, 0.55],
+        bot_pct_lim=[0.0, 0.0],
+        rot_lim=[-5.4, 5.4],
+        rand_flip=True,
+        is_train=True),
+    dict(
+        type='BEVFusionGlobalRotScaleTrans',
+        scale_ratio_range=[0.9, 1.1],
+        rot_range=[-0.78539816, 0.78539816],
+        translation_std=0.5),
+    dict(type='BEVFusionRandomFlip3D'),
+    dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range),
+    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
+    dict(
+        type='ObjectNameFilter',
+        classes=[
+            'car', 'truck', 'construction_vehicle', 'bus', 'trailer',
+            'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone'
+        ]),
+    # Actually, 'GridMask' is not used here
+    dict(
+        type='GridMask',
+        use_h=True,
+        use_w=True,
+        max_epoch=6,
+        rotate=1,
+        offset=False,
+        ratio=0.5,
+        mode=1,
+        prob=0.0,
+        fixed_prob=True),
+    dict(type='PointShuffle'),
+    dict(
+        type='Pack3DDetInputs',
+        keys=[
+            'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
+            'gt_labels'
+        ],
+        meta_keys=[
+            'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar',
+            'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx',
+            'lidar_path', 'img_path', 'transformation_3d_flow', 'pcd_rotation',
+            'pcd_scale_factor', 'pcd_trans', 'img_aug_matrix',
+            'lidar_aug_matrix', 'num_pts_feats'
+        ])
+]
+
+test_pipeline = [
+    dict(
+        type='BEVLoadMultiViewImageFromFiles',
+        to_float32=True,
+        color_type='color',
+        backend_args=backend_args),
+    dict(
+        type='LoadPointsFromFile',
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type='LoadPointsFromMultiSweeps',
+        sweeps_num=9,
+        load_dim=5,
+        use_dim=5,
+        pad_empty_sweeps=True,
+        remove_close=True,
+        backend_args=backend_args),
+    dict(
+        type='ImageAug3D',
+        final_dim=[256, 704],
+        resize_lim=[0.48, 0.48],
+        bot_pct_lim=[0.0, 0.0],
+        rot_lim=[0.0, 0.0],
+        rand_flip=False,
+        is_train=False),
+    dict(
+        type='PointsRangeFilter',
+        point_cloud_range=[-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]),
+    dict(
+        type='Pack3DDetInputs',
+        keys=['img', 'points', 'gt_bboxes_3d', 'gt_labels_3d'],
+        meta_keys=[
+            'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar',
+            'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx',
+            'lidar_path', 'img_path', 'num_pts_feats'
+        ])
+]
+
+train_dataloader = dict(
+    dataset=dict(
+        dataset=dict(pipeline=train_pipeline, modality=input_modality)))
+val_dataloader = dict(
+    dataset=dict(pipeline=test_pipeline, modality=input_modality))
+test_dataloader = val_dataloader
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.33333333,
+        by_epoch=False,
+        begin=0,
+        end=500),
+    dict(
+        type='CosineAnnealingLR',
+        begin=0,
+        T_max=6,
+        end=6,
+        by_epoch=True,
+        eta_min_ratio=1e-4,
+        convert_to_iter_based=True),
+    # momentum scheduler
+    # During the first 8 epochs, momentum increases from 1 to 0.85 / 0.95
+    # during the next 12 epochs, momentum increases from 0.85 / 0.95 to 1
+    dict(
+        type='CosineAnnealingMomentum',
+        eta_min=0.85 / 0.95,
+        begin=0,
+        end=2.4,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingMomentum',
+        eta_min=1,
+        begin=2.4,
+        end=6,
+        by_epoch=True,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=6, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+
+# Default setting for scaling LR automatically
+#   - `enable` means enable scaling LR automatically
+#       or not by default.
+#   - `base_batch_size` = (8 GPUs) x (4 samples per GPU).
+auto_scale_lr = dict(enable=False, base_batch_size=32)
+
+default_hooks = dict(
+    logger=dict(type='LoggerHook', interval=50),
+    checkpoint=dict(type='CheckpointHook', interval=1))
+del _base_.custom_hooks
+
+work_dir = './work_dirs/bevfusion_lidar-cam_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d'
+
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.01),
+    clip_grad=dict(max_norm=35, norm_type=2))
diff --git a/projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py b/projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py
new file mode 100644
index 0000000000..561097ebf5
--- /dev/null
+++ b/projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py
@@ -0,0 +1,18 @@
+_base_ = './bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py'
+
+# Override the point cloud backbone with FireRPFNet
+# FireRPFNet is more memory-efficient than SECOND while maintaining good performance
+# It uses Fire modules (SqueezeNet-style) with CBAM attention
+model = dict(
+    pts_backbone=dict(
+        _delete_=True,  # Completely replace the base backbone config
+        type='FireRPFNetV2',
+        in_channels=256,  # Output channels from BEVFusionSparseEncoder
+        out_channels=[128, 256, 256, 512],  # 4 stages with increasing channels
+        with_cbam=True,  # Enable Channel and Spatial Attention
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01)),
+    pts_neck=None
+)
+
+# Update the work directory
+work_dir = './work_dirs/bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d'
diff --git a/projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d.py b/projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d.py
new file mode 100644
index 0000000000..76ad10fbf9
--- /dev/null
+++ b/projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d.py
@@ -0,0 +1,31 @@
+_base_ = './bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py'
+
+model = dict(
+    pts_backbone=dict(
+        _delete_=True,  # Completely replace the base backbone config
+        type='FireRPFNetV2',
+        in_channels=256,  # Output channels from BEVFusionSparseEncoder
+        out_channels=[128, 256, 256, 256],  # 4 stages with multi-scale outputs
+        with_cbam=True,  # Enable Channel and Spatial Attention
+        multi_scale_output=True,  # CRITICAL: Enable multi-scale feature extraction
+        norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01)),
+
+    pts_neck=dict(
+        _delete_=True,  # Replace the base neck config
+        type='SECONDFPN',
+        in_channels=[128, 256, 256, 256],  # Must match FireRPFNetV2 out_channels
+        out_channels=[128, 128, 128, 128],  # Uniform output channels for fusion
+        upsample_strides=[1, 1, 1, 1],  # CRITICAL: No upsampling (same resolution)
+        norm_cfg=dict(type='BN', eps=0.001, momentum=0.01),
+        upsample_cfg=dict(type='deconv', bias=False),
+        use_conv_for_no_stride=True),  # Use 1x1 conv when stride=1
+
+    # Update bbox_head to match concatenated neck output
+    # SECONDFPN concatenates all outputs: 128 * 4 = 512 channels
+    bbox_head=dict(
+        in_channels=512,  # 128 * 4 from SECONDFPN concatenation
+    )
+)
+
+# Update the work directory to distinguish from base config
+work_dir = './work_dirs/bevfusion_lidar_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d'