diff --git a/FireRPFNetExtension.md b/FireRPFNetExtension.md new file mode 100644 index 0000000000..4024c4a7b1 --- /dev/null +++ b/FireRPFNetExtension.md @@ -0,0 +1,209 @@ +# FireRPFNet Models - Quick Start Guide + +Custom 3D object detection models using FireRPFNet architecture with Fire Modules, Residual connections, and CBAM attention. + +## ๐Ÿ”ฅ FireRPFNet Variants + +- **FireRPFNetV2**: Enhanced 3D LiDAR backbone with improved attention +- **FireRPFNet2D**: 2D image backbone variant for camera features + +**Plug-and-Play Design:** +- **FireRPFNetV2** can replace SECOND backbone in any model (BEVFusion is one example shown here) +- **FireRPFNet2D** can be used as an efficient image backbone in multi-modal architectures +- Simply update the backbone config to integrate into your existing models + +--- + +## ๐Ÿ“‹ Available Models + +| Model | Config | Image Backbone | LiDAR Backbone | Dataset | Modality | +|-------|--------|---------------|----------------|---------|----------| +| MVXNet-Squeeze | `configs/mvxnet/mvxnet_sqeezefpn_fire_rpfnet_kitti-3d-3class.py` | SQUEEZE | **FireRPFNetV2** | KITTI | Multi-modal | +| MVXNet-Fire2D | `configs/mvxnet/mvxnet_firerpfnet2dfpn_fire_rpfnet_kitti-3d-3class.py` | **FireRPFNet2D** | **FireRPFNetV2** | KITTI | Multi-modal | +| BEVFusion-Lidar | `projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py` | - | **FireRPFNetV2** | nuScenes | LiDAR-only | +| BEVFusion-Cam | `projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py` | Swin-T | **FireRPFNetV2** | nuScenes | Multi-modal | + +--- + +## ๐Ÿš€ Installation + +Follow the official MMDetection3D installation guide: https://mmdetection3d.readthedocs.io/en/latest/get_started.html + +**Quick Setup:** +```bash +# Install dependencies +pip install -U openmim +mim install mmengine +mim install 'mmcv>=2.0.0rc4' +mim install 'mmdet>=3.0.0' + +# Install mmdetection3d +cd mmdetection3d +pip install -v -e . +``` + +--- + +## ๐Ÿ“ฆ Dataset Setup + +### KITTI (MVXNet models) +```bash +# Download from http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d +# Organize: data/kitti/training/{image_2, velodyne, calib, label_2} + +# Create data infos +python tools/create_data.py kitti --root-path ./data/kitti --out-dir ./data/kitti --extra-tag kitti +``` + +### nuScenes (BEVFusion models) +```bash +# Download from https://www.nuscenes.org/download +# Organize: data/nuscenes/{samples, sweeps, v1.0-trainval} + +# Create data infos +python tools/create_data.py nuscenes --root-path ./data/nuscenes --out-dir ./data/nuscenes --extra-tag nuscenes +``` + +--- + +## ๐Ÿ‹๏ธ Training Commands + +### MVXNet Models (KITTI) + +**Model 1: SqueezeFPN + FireRPFNetV2** +```bash +# Single GPU +python tools/train.py configs/mvxnet/mvxnet_sqeezefpn_fire_rpfnet_kitti-3d-3class.py + +``` +- Batch size: 2/GPU | Epochs: 20 | LR: 0.001 | Val: Every 5 epochs + +**Model 2: FireRPFNet2D + FireRPFNetV2** +```bash +# Single GPU +python tools/train.py configs/mvxnet/mvxnet_firerpfnet2dfpn_fire_rpfnet_kitti-3d-3class.py + +``` +- Batch size: 4/GPU | Epochs: 16 | LR: 0.001 | Val: Every 2 epochs | Early stopping enabled + +--- + +### BEVFusion Models (nuScenes) + +**Model 3: BEVFusion LiDAR-only + FireRPFNetV2** +```bash +# +python tools/train.py projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py +``` +- Batch size: 4/GPU | Epochs: 20 | LR: 0.0002 | Cyclic scheduler + +**Model 4: BEVFusion Multi-Modal + FireRPFNetV2** +```bash + +# With mixed precision +python tools/train.py \ + projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py \ + --amp +``` +- Batch size: 4/GPU (32 total) | Epochs: 6 | LR: 0.0002 | Val: Every epoch + +--- + +## ๐Ÿงช Testing + +### MVXNet Models +```bash +# Single GPU +python tools/test.py CONFIG CHECKPOINT + +# Multi-GPU +bash tools/dist_test.sh CONFIG CHECKPOINT 4 +``` + +**Examples:** +```bash +# Model 1 +python tools/test.py \ + configs/mvxnet/mvxnet_sqeezefpn_fire_rpfnet_kitti-3d-3class.py \ + work_dirs/mvxnet_sqeezefpn_fire_rpfnet_kitti-3d-3class/best_checkpoint.pth + +# Model 2 +bash tools/dist_test.sh \ + configs/mvxnet/mvxnet_firerpfnet2dfpn_fire_rpfnet_kitti-3d-3class.py \ + work_dirs/mvxnet_firerpfnet2dfpn_fire_rpfnet_kitti-3d-3class/best_checkpoint.pth 4 +``` + +### BEVFusion Models +```bash +# Model 3 +bash tools/dist_test.sh \ + projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py \ + work_dirs/bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d/best_checkpoint.pth 8 + +# Model 4 +bash tools/dist_test.sh \ + projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py \ + work_dirs/bevfusion_lidar-cam_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d/best_checkpoint.pth 8 +``` + +--- + +## ๐Ÿ’ก Tips + +**Resume Training:** +```bash +python tools/train.py CONFIG --resume work_dirs/MODEL_NAME/epoch_X.pth +``` + +**Specify GPUs:** +```bash +CUDA_VISIBLE_DEVICES=0,1,2,3 bash tools/dist_train.sh CONFIG 4 +``` + +**Debug Mode:** +```bash +python tools/train.py CONFIG \ + --cfg-options data.train_dataloader.num_workers=0 \ + data.train_dataloader.batch_size=1 +``` + +**Monitor Training:** +```bash +tensorboard --logdir=work_dirs/ +``` + +--- + +## ๐Ÿ› Common Issues + +**CUDA OOM:** Reduce batch size in config or via `--cfg-options data.train_dataloader.batch_size=1` + +**Dataset not found:** Verify paths and run `python tools/create_data.py` + +**Import errors:** Reinstall with `pip install -v -e .` + +--- + +## ๐Ÿ“š References + +- [MMDetection3D Documentation](https://mmdetection3d.readthedocs.io) +- [KITTI Dataset](http://www.cvlibs.net/datasets/kitti/) +- [nuScenes Dataset](https://www.nuscenes.org/) + +--- + +## ๐Ÿ“ Citation + +```bibtex +@article{firerpfnet2024, + title={FireRPFNet: Efficient 3D Object Detection with Fire Modules and Attention}, + author={Aravind Singh}, + journal={arXiv preprint}, + year={2024} +} +``` + +--- + +**Happy Training! ๐Ÿš€** + diff --git a/configs/_base_/models/centerpoint_pillar02_squeeze_squeezefpn_nus.py b/configs/_base_/models/centerpoint_pillar02_squeeze_squeezefpn_nus.py new file mode 100644 index 0000000000..04375a2a1f --- /dev/null +++ b/configs/_base_/models/centerpoint_pillar02_squeeze_squeezefpn_nus.py @@ -0,0 +1,91 @@ +voxel_size = [0.2, 0.2, 8] +model = dict( + type='CenterPoint', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_layer=dict( + max_num_points=20, + voxel_size=voxel_size, + max_voxels=(30000, 40000))), + pts_voxel_encoder=dict( + type='PillarFeatureNet', + in_channels=5, + feat_channels=[64], + with_distance=False, + voxel_size=(0.2, 0.2, 8), + norm_cfg=dict(type='BN1d', eps=1e-3, momentum=0.01), + legacy=False), + pts_middle_encoder=dict( + type='PointPillarsScatter', in_channels=64, output_shape=(512, 512)), + pts_backbone=dict( + type='SQUEEZE', + in_channels=64, + out_channels=[64, 128, 256 , 512], + #layer_nums=[3, 5, 5], + #layer_strides=[2, 2, 2], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + conv_cfg=dict(type='Conv2d', bias=False)), + pts_neck=dict( + type='SQUEEZEFPN', + in_channels=[64, 128, 256 , 512], + out_channels=[512, 512, 512, 512], + #upsample_strides=[0.5, 1, 2], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + upsample_cfg=dict(type='deconv', bias=False), + #use_conv_for_no_stride=True + ), + pts_bbox_head=dict( + type='CenterHead', + in_channels=sum([128, 128, 128,128]), + #in_channels=256, + tasks=[ + dict(num_class=1, class_names=['car']), + dict(num_class=2, class_names=['truck', 'construction_vehicle']), + dict(num_class=2, class_names=['bus', 'trailer']), + dict(num_class=1, class_names=['barrier']), + dict(num_class=2, class_names=['motorcycle', 'bicycle']), + dict(num_class=2, class_names=['pedestrian', 'traffic_cone']), + ], + common_heads=dict( + reg=(2, 2), height=(1, 2), dim=(3, 2), rot=(2, 2), vel=(2, 2)), + share_conv_channel=64, + bbox_coder=dict( + type='CenterPointBBoxCoder', + post_center_range=[-61.2, -61.2, -10.0, 61.2, 61.2, 10.0], + max_num=500, + score_threshold=0.1, + out_size_factor=4, + voxel_size=voxel_size[:2], + code_size=9), + separate_head=dict( + type='SeparateHead', init_bias=-2.19, final_kernel=3), + loss_cls=dict(type='mmdet.GaussianFocalLoss', reduction='mean'), + loss_bbox=dict( + type='mmdet.L1Loss', reduction='mean', loss_weight=0.25), + norm_bbox=True), + # model training and testing settings + train_cfg=dict( + pts=dict( + grid_size=[512, 512, 1], + voxel_size=voxel_size, + out_size_factor=4, + dense_reg=1, + gaussian_overlap=0.1, + max_objs=500, + min_radius=2, + code_weights=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2, 0.2])), + test_cfg=dict( + pts=dict( + post_center_limit_range=[-61.2, -61.2, -10.0, 61.2, 61.2, 10.0], + max_per_img=500, + max_pool_nms=False, + min_radius=[4, 12, 10, 1, 0.85, 0.175], + score_threshold=0.1, + pc_range=[-51.2, -51.2], + out_size_factor=4, + voxel_size=voxel_size[:2], + nms_type='rotate', + pre_max_size=1000, + post_max_size=83, + nms_thr=0.2))) diff --git a/configs/_base_/models/centerpoint_voxel01_squeeze_squeezefpn_nus.py b/configs/_base_/models/centerpoint_voxel01_squeeze_squeezefpn_nus.py new file mode 100644 index 0000000000..c36e7de268 --- /dev/null +++ b/configs/_base_/models/centerpoint_voxel01_squeeze_squeezefpn_nus.py @@ -0,0 +1,46 @@ +model = dict( + type='VoxelNet', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_layer=dict( + max_num_points=5, + point_cloud_range=[0, -40, -3, 70.4, 40, 1], + voxel_size=[0.05, 0.05, 0.1], + max_voxels=(16000, 40000))), + voxel_encoder=dict(type='HardSimpleVFE'), + middle_encoder=dict( + type='SparseEncoder', + in_channels=4, + sparse_shape=[41, 1600, 1408], + order=('conv', 'norm', 'act')), + backbone=dict( + type='SQUEEZE', + in_channels=3, + out_channels=[64, 128, 256, 512], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + conv_cfg=dict(type='Conv2d', bias=False)), + neck=dict( + type='SQUEEZEFPN', + in_channels=[64, 128, 256, 512], + out_channels=[256, 256, 256, 256], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + upsample_cfg=dict(type='deconv', bias=False), + conv_cfg=dict(type='Conv2d', bias=False)), + bbox_head=dict( + type='Anchor3DHead', + num_classes=3, + in_channels=256, + feat_channels=256, + anchor_generator=dict( + type='Anchor3DRangeGenerator', + ranges=[[0, -40, -1.8, 70.4, 40, -1.8]], + sizes=[[1.6, 3.9, 1.56]], + rotations=[0, 1.57]), + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict(type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25), + loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0), + loss_dir=dict(type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.2)), + train_cfg=dict(assigner=dict(type='MaxIoUAssigner')), + test_cfg=dict(use_rotate_nms=True, nms_across_levels=False, nms_pre=1000, nms_thr=0.01, score_thr=0.1, min_bbox_size=0, max_num=500) +) diff --git a/configs/centerpoint/centerpoint_pillar02_squeeze_squeezefpn_8xb4-cyclic-20e_nus-3d.py b/configs/centerpoint/centerpoint_pillar02_squeeze_squeezefpn_8xb4-cyclic-20e_nus-3d.py new file mode 100644 index 0000000000..4ed2c5df84 --- /dev/null +++ b/configs/centerpoint/centerpoint_pillar02_squeeze_squeezefpn_8xb4-cyclic-20e_nus-3d.py @@ -0,0 +1,253 @@ +_base_ = [ + '../_base_/datasets/nus-3d.py', + '../_base_/models/centerpoint_pillar02_squeeze_squeezefpn_nus.py', + '../_base_/schedules/cyclic-20e.py', '../_base_/default_runtime.py' +] + +# If point cloud range is changed, the models should also change their point +# cloud range accordingly +point_cloud_range = [-51.2, -51.2, -5.0, 51.2, 51.2, 3.0] +# Using calibration info convert the Lidar-coordinate point cloud range to the +# ego-coordinate point cloud range could bring a little promotion in nuScenes. +# point_cloud_range = [-51.2, -52, -5.0, 51.2, 50.4, 3.0] +# For nuScenes we usually do 10-class detection +class_names = [ + 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', + 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' +] +data_prefix = dict(pts='samples/LIDAR_TOP', img='', sweeps='sweeps/LIDAR_TOP') +model = dict( + data_preprocessor=dict( + voxel_layer=dict(point_cloud_range=point_cloud_range)), + pts_voxel_encoder=dict(point_cloud_range=point_cloud_range), + pts_bbox_head=dict(bbox_coder=dict(pc_range=point_cloud_range[:2])), + # model training and testing settings + train_cfg=dict(pts=dict(point_cloud_range=point_cloud_range)), + test_cfg=dict(pts=dict(pc_range=point_cloud_range[:2]))) + +dataset_type = 'NuScenesDataset' +data_root = 'data/nuscenes/' +backend_args = None + +db_sampler = dict( + data_root=data_root, + info_path=data_root + 'nuscenes_dbinfos_train.pkl', + rate=1.0, + prepare=dict( + filter_by_difficulty=[-1], + filter_by_min_points=dict( + car=5, + truck=5, + bus=5, + trailer=5, + construction_vehicle=5, + traffic_cone=5, + barrier=5, + motorcycle=5, + bicycle=5, + pedestrian=5)), + classes=class_names, + sample_groups=dict( + car=2, + truck=3, + construction_vehicle=7, + bus=4, + trailer=6, + barrier=2, + motorcycle=6, + bicycle=6, + pedestrian=2, + traffic_cone=2), + points_loader=dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=5, + use_dim=[0, 1, 2, 3, 4], + backend_args=backend_args), + backend_args=backend_args) + +train_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=5, + use_dim=5, + backend_args=backend_args), + dict( + type='LoadPointsFromMultiSweeps', + sweeps_num=9, + use_dim=[0, 1, 2, 3, 4], + pad_empty_sweeps=True, + remove_close=True, + backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True), + dict(type='ObjectSample', db_sampler=db_sampler), + dict( + type='GlobalRotScaleTrans', + rot_range=[-0.3925, 0.3925], + scale_ratio_range=[0.95, 1.05], + translation_std=[0, 0, 0]), + dict( + type='RandomFlip3D', + sync_2d=False, + flip_ratio_bev_horizontal=0.5, + flip_ratio_bev_vertical=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectNameFilter', classes=class_names), + dict(type='PointShuffle'), + dict( + type='Pack3DDetInputs', + keys=['points', 'gt_bboxes_3d', 'gt_labels_3d']) +] +test_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=5, + use_dim=5, + backend_args=backend_args), + dict( + type='LoadPointsFromMultiSweeps', + sweeps_num=9, + use_dim=[0, 1, 2, 3, 4], + pad_empty_sweeps=True, + remove_close=True, + backend_args=backend_args), + dict( + type='MultiScaleFlipAug3D', + img_scale=(1333, 800), + pts_scale_ratio=1, + flip=False, + transforms=[ + dict( + type='GlobalRotScaleTrans', + rot_range=[0, 0], + scale_ratio_range=[1., 1.], + translation_std=[0, 0, 0]), + dict(type='RandomFlip3D') + ]), + dict(type='Pack3DDetInputs', keys=['points']) +] + +train_dataloader = dict( + batch_size=4, + dataset=dict( + ann_file='nuscenes_infos_train.pkl', + backend_args=None, + box_type_3d='LiDAR', + data_prefix=dict( + img='', pts='samples/LIDAR_TOP', sweeps='sweeps/LIDAR_TOP'), + data_root='data/nuscenes/', + metainfo=dict(classes=[ + 'car', + 'truck', + 'construction_vehicle', + 'bus', + 'trailer', + 'barrier', + 'motorcycle', + 'bicycle', + 'pedestrian', + 'traffic_cone', + ]), + modality=dict(use_camera=False, use_lidar=True), + pipeline=[ + dict( + backend_args=None, + coord_type='LIDAR', + load_dim=5, + type='LoadPointsFromFile', + use_dim=5), + dict( + backend_args=None, + pad_empty_sweeps=True, + remove_close=True, + sweeps_num=9, + type='LoadPointsFromMultiSweeps', + use_dim=[ + 0, + 1, + 2, + 3, + 4, + ]), + dict( + type='LoadAnnotations3D', + with_bbox_3d=True, + with_label_3d=True), + dict( + rot_range=[ + -0.3925, + 0.3925, + ], + scale_ratio_range=[ + 0.95, + 1.05, + ], + translation_std=[ + 0, + 0, + 0, + ], + type='GlobalRotScaleTrans'), + dict( + flip_ratio_bev_horizontal=0.5, + flip_ratio_bev_vertical=0.5, + sync_2d=False, + type='RandomFlip3D'), + dict( + point_cloud_range=[ + -51.2, + -51.2, + -5.0, + 51.2, + 51.2, + 3.0, + ], + type='PointsRangeFilter'), + dict( + point_cloud_range=[ + -51.2, + -51.2, + -5.0, + 51.2, + 51.2, + 3.0, + ], + type='ObjectRangeFilter'), + dict( + classes=[ + 'car', + 'truck', + 'construction_vehicle', + 'bus', + 'trailer', + 'barrier', + 'motorcycle', + 'bicycle', + 'pedestrian', + 'traffic_cone', + ], + type='ObjectNameFilter'), + dict(type='PointShuffle'), + dict( + keys=[ + 'points', + 'gt_bboxes_3d', + 'gt_labels_3d', + ], + type='Pack3DDetInputs'), + ], + test_mode=False, + type='NuScenesDataset', + use_valid_flag=True), + num_workers=4, + persistent_workers=True, + sampler=dict(shuffle=True, type='DefaultSampler')) +test_dataloader = dict( + dataset=dict(pipeline=test_pipeline, metainfo=dict(version='v1.0-mini', classes=class_names))) +val_dataloader = dict( + dataset=dict(pipeline=test_pipeline, metainfo=dict(version='v1.0-mini', classes=class_names))) + +train_cfg = dict(by_epoch=True, max_epochs=20, val_interval=20) diff --git a/configs/centerpoint/centerpoint_voxel01_squeeze_squeezefpn_8xb4-cyclic-20e_nus-3d.py b/configs/centerpoint/centerpoint_voxel01_squeeze_squeezefpn_8xb4-cyclic-20e_nus-3d.py new file mode 100644 index 0000000000..6c423cbbcd --- /dev/null +++ b/configs/centerpoint/centerpoint_voxel01_squeeze_squeezefpn_8xb4-cyclic-20e_nus-3d.py @@ -0,0 +1,160 @@ +_base_ = [ + '../_base_/datasets/nus-3d.py', + '../_base_/models/centerpoint_voxel01_squeeze_squeezefpn_nus.py', + '../_base_/schedules/cyclic-20e.py', '../_base_/default_runtime.py' +] + +# If point cloud range is changed, the models should also change their point +# cloud range accordingly +point_cloud_range = [-51.2, -51.2, -5.0, 51.2, 51.2, 3.0] +# Using calibration info convert the Lidar-coordinate point cloud range to the +# ego-coordinate point cloud range could bring a little promotion in nuScenes. +# point_cloud_range = [-51.2, -52, -5.0, 51.2, 50.4, 3.0] +# For nuScenes we usually do 10-class detection +class_names = [ + 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', + 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' +] +data_prefix = dict(pts='samples/LIDAR_TOP', img='', sweeps='sweeps/LIDAR_TOP') +model = dict( + data_preprocessor=dict( + voxel_layer=dict(point_cloud_range=point_cloud_range)), + pts_bbox_head=dict(bbox_coder=dict(pc_range=point_cloud_range[:2])), + # model training and testing settings + train_cfg=dict(pts=dict(point_cloud_range=point_cloud_range)), + test_cfg=dict(pts=dict(pc_range=point_cloud_range[:2]))) + +dataset_type = 'NuScenesDataset' +data_root = 'data/nuscenes/' +backend_args = None + +db_sampler = dict( + data_root=data_root, + info_path=data_root + 'nuscenes_dbinfos_train.pkl', + rate=1.0, + prepare=dict( + filter_by_difficulty=[-1], + filter_by_min_points=dict( + car=5, + truck=5, + bus=5, + trailer=5, + construction_vehicle=5, + traffic_cone=5, + barrier=5, + motorcycle=5, + bicycle=5, + pedestrian=5)), + classes=class_names, + sample_groups=dict( + car=2, + truck=3, + construction_vehicle=7, + bus=4, + trailer=6, + barrier=2, + motorcycle=6, + bicycle=6, + pedestrian=2, + traffic_cone=2), + points_loader=dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=5, + use_dim=[0, 1, 2, 3, 4], + backend_args=backend_args), + backend_args=backend_args) + +train_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=5, + use_dim=5, + backend_args=backend_args), + dict( + type='LoadPointsFromMultiSweeps', + sweeps_num=9, + use_dim=[0, 1, 2, 3, 4], + pad_empty_sweeps=True, + remove_close=True, + backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True), + dict(type='ObjectSample', db_sampler=db_sampler), + dict( + type='GlobalRotScaleTrans', + rot_range=[-0.3925, 0.3925], + scale_ratio_range=[0.95, 1.05], + translation_std=[0, 0, 0]), + dict( + type='RandomFlip3D', + sync_2d=False, + flip_ratio_bev_horizontal=0.5, + flip_ratio_bev_vertical=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectNameFilter', classes=class_names), + dict(type='PointShuffle'), + dict( + type='Pack3DDetInputs', + keys=['points', 'gt_bboxes_3d', 'gt_labels_3d']) +] +test_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=5, + use_dim=5, + backend_args=backend_args), + dict( + type='LoadPointsFromMultiSweeps', + sweeps_num=9, + use_dim=[0, 1, 2, 3, 4], + pad_empty_sweeps=True, + remove_close=True, + backend_args=backend_args), + dict( + type='MultiScaleFlipAug3D', + img_scale=(1333, 800), + pts_scale_ratio=1, + flip=False, + transforms=[ + dict( + type='GlobalRotScaleTrans', + rot_range=[0, 0], + scale_ratio_range=[1., 1.], + translation_std=[0, 0, 0]), + dict(type='RandomFlip3D'), + dict( + type='PointsRangeFilter', point_cloud_range=point_cloud_range) + ]), + dict(type='Pack3DDetInputs', keys=['points']) +] + +train_dataloader = dict( + _delete_=True, + batch_size=4, + num_workers=4, + persistent_workers=True, + sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict( + type='CBGSDataset', + dataset=dict( + type=dataset_type, + data_root=data_root, + ann_file='nuscenes_infos_train.pkl', + pipeline=train_pipeline, + metainfo=dict(classes=class_names), + test_mode=False, + data_prefix=data_prefix, + use_valid_flag=True, + # we use box_type_3d='LiDAR' in kitti and nuscenes dataset + # and box_type_3d='Depth' in sunrgbd and scannet dataset. + box_type_3d='LiDAR', + backend_args=backend_args))) +test_dataloader = dict( + dataset=dict(pipeline=test_pipeline, metainfo=dict(classes=class_names))) +val_dataloader = dict( + dataset=dict(pipeline=test_pipeline, metainfo=dict(classes=class_names))) + +train_cfg = dict(val_interval=20) diff --git a/configs/mvxnet/mvxnet_efficiency_es_fpn_fire_rpfnet_kitti-3d-3class.py b/configs/mvxnet/mvxnet_efficiency_es_fpn_fire_rpfnet_kitti-3d-3class.py new file mode 100644 index 0000000000..03774f5541 --- /dev/null +++ b/configs/mvxnet/mvxnet_efficiency_es_fpn_fire_rpfnet_kitti-3d-3class.py @@ -0,0 +1,175 @@ +# MVX-Net | EfficientNet-ES + FPN (camera) | Fire-RPFNet (LiDAR) +# KITTI 3-class full stand-alone config. + +_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py'] + +# ----------------------------------------------------------------------------- +# Geometry +# ----------------------------------------------------------------------------- +voxel_size = [0.05, 0.05, 0.1] +point_cloud_range = [0, -40, -3, 70.4, 40, 1] + +# ----------------------------------------------------------------------------- +# Model definition +# ----------------------------------------------------------------------------- +model = dict( + type='DynamicMVXFasterRCNN', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_type='dynamic', + voxel_layer=dict(max_num_points=-1, point_cloud_range=point_cloud_range, + voxel_size=voxel_size, max_voxels=(-1, -1)), + mean=[102.9801, 115.9465, 122.7717], std=[1., 1., 1.], + bgr_to_rgb=False, pad_size_divisor=32), + + # ---------------- Camera branch ---------------- + img_backbone=dict( + type='mmdet.EfficientNet', # torchvision impl + arch='es', # efficientnet-es (small, fast) + out_indices=(0, 3, 5, 6), # C1,C3,C5,C6 like MVX example + frozen_stages=1, + norm_cfg=dict(type='BN', requires_grad=False), + norm_eval=True), + img_neck=dict( + type='mmdet.FPN', + in_channels=[32, 48, 192, 1280], # channels for eff-es layers + out_channels=512, + num_outs=4, + norm_cfg=dict(type='BN', requires_grad=False)), + + # ---------------- LiDAR voxel encoder -------------- + pts_voxel_encoder=dict( + type='DynamicVFE', in_channels=4, feat_channels=[64, 64], + with_distance=False, voxel_size=voxel_size, + with_cluster_center=True, with_voxel_center=True, + point_cloud_range=point_cloud_range, + fusion_layer=dict( + type='PointFusion', img_channels=512, pts_channels=64, + mid_channels=128, out_channels=128, img_levels=[0,1,2,3], + align_corners=False, activate_out=True, fuse_out=False)), + + # ---------------- Sparse middle encoder ------------ + pts_middle_encoder=dict( + type='SparseEncoder', in_channels=128, + sparse_shape=[41, 1600, 1408], order=('conv', 'norm', 'act')), + + # ---------------- Fire-RPFNet backbone ------------- + pts_backbone=dict( + #type='RPFNet', + type='FireRPFNet', + in_channels=256, + layer_channels=[128, 256, 256, 256], with_cbam=True), + pts_neck=None, + + # ---------------- Anchor head ---------------------- + pts_bbox_head=dict( + type='Anchor3DHead', num_classes=3, + in_channels=256, feat_channels=256, use_direction_classifier=True, + anchor_generator=dict( + type='Anchor3DRangeGenerator', + ranges=[[0,-40,-0.6,70.4,40,-0.6], + [0,-40,-0.6,70.4,40,-0.6], + [0,-40,-1.78,70.4,40,-1.78]], + sizes=[[0.8,0.6,1.73],[1.76,0.6,1.73],[3.9,1.6,1.56]], + rotations=[0, 1.57], reshape_out=False), + assigner_per_size=True, diff_rad_by_sin=True, assign_per_class=True, + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict(type='mmdet.FocalLoss', use_sigmoid=True, gamma=2.0, + alpha=0.25, loss_weight=1.0), + loss_bbox=dict(type='mmdet.SmoothL1Loss', beta=1.0/9.0, loss_weight=2.0), + loss_dir=dict(type='mmdet.CrossEntropyLoss', use_sigmoid=False, loss_weight=0.2)), + + train_cfg=dict( + pts=dict(assigner=[ + dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2, ignore_iof_thr=-1), + dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2, ignore_iof_thr=-1), + dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.6, neg_iou_thr=0.45, min_pos_iou=0.45, ignore_iof_thr=-1)], + allowed_border=0, pos_weight=-1, debug=False)), + test_cfg=dict( + pts=dict(use_rotate_nms=True, nms_across_levels=False, nms_thr=0.01, + score_thr=0.1, min_bbox_size=0, nms_pre=100, max_num=50)) +) + +# ----------------------------------------------------------------------------- +# Dataset & pipelines +# ----------------------------------------------------------------------------- + +dataset_type = 'KittiDataset' +data_root = 'data/kitti/' +class_names = ['Pedestrian', 'Cyclist', 'Car'] +metainfo = dict(classes=class_names) +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None + +train_pipeline = [ + dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True, + with_bbox=True, with_label=True), + dict(type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True), + dict(type='GlobalRotScaleTrans', rot_range=[-0.78539816,0.78539816], + scale_ratio_range=[0.95,1.05], translation_std=[0.2,0.2,0.2]), + dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='PointShuffle'), + dict(type='Pack3DDetInputs', keys=['points','img','gt_bboxes_3d','gt_labels_3d', + 'gt_bboxes','gt_labels']) +] + +test_pipeline = [ + dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='MultiScaleFlipAug3D', img_scale=(1280,384), pts_scale_ratio=1, + flip=False, transforms=[ + dict(type='Resize', scale=0, keep_ratio=True), + dict(type='GlobalRotScaleTrans', rot_range=[0,0], scale_ratio_range=[1.,1.], + translation_std=[0,0,0]), + dict(type='RandomFlip3D'), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + ]), + dict(type='Pack3DDetInputs', keys=['points','img']) +] + +modality = dict(use_lidar=True, use_camera=True) + +train_dataloader = dict( + batch_size=2, num_workers=4, sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict(type='RepeatDataset', times=2, dataset=dict( + type=dataset_type, data_root=data_root, modality=modality, + ann_file='kitti_infos_train.pkl', + data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'), + pipeline=train_pipeline, filter_empty_gt=False, metainfo=metainfo, + box_type_3d='LiDAR', backend_args=backend_args))) + +val_dataloader = dict( + batch_size=1, num_workers=1, sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict(type=dataset_type, data_root=data_root, modality=modality, + ann_file='kitti_infos_val.pkl', + data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, metainfo=metainfo, test_mode=True, + box_type_3d='LiDAR', backend_args=backend_args)) + +test_dataloader = val_dataloader + +# ----------------------------------------------------------------------------- +# Optimizer / runtime +# ----------------------------------------------------------------------------- +optim_wrapper = dict(optimizer=dict(lr=0.001, weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2)) + +val_evaluator = dict(type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') + +test_evaluator = val_evaluator + +vis_backends = [dict(type='LocalVisBackend')] +visualizer = dict(type='Det3DLocalVisualizer', vis_backends=vis_backends, + name='visualizer') + +train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=5) diff --git a/configs/mvxnet/mvxnet_efficiency_es_fpn_second_fpn_kitti-3d-3class.py b/configs/mvxnet/mvxnet_efficiency_es_fpn_second_fpn_kitti-3d-3class.py new file mode 100644 index 0000000000..358b67f6a7 --- /dev/null +++ b/configs/mvxnet/mvxnet_efficiency_es_fpn_second_fpn_kitti-3d-3class.py @@ -0,0 +1,275 @@ +_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py'] + +# model settings +voxel_size = [0.05, 0.05, 0.1] +point_cloud_range = [0, -40, -3, 70.4, 40, 1] + + +model = dict( + type='DynamicMVXFasterRCNN', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_type='dynamic', + voxel_layer=dict( + max_num_points=-1, + point_cloud_range=point_cloud_range, + voxel_size=voxel_size, + max_voxels=(-1, -1)), + mean=[102.9801, 115.9465, 122.7717], + std=[1.0, 1.0, 1.0], + bgr_to_rgb=False, + pad_size_divisor=32), + img_backbone=dict( + type='mmdet.EfficientNet', # Use EfficientNet + arch='es', # Choose the EfficientNet variant (b0, b1, b2, etc.) + out_indices=(0, 3, 5, 6), # You can change this depending on which layers you need + frozen_stages=1, # Freeze the first stage (if needed) + norm_cfg=dict(type='BN', requires_grad=False), + norm_eval=True, + ), # Important: Use 'pytorch' style + img_neck=dict( + type='mmdet.FPN', + in_channels=[32, 48, 192, 1280], # Correct in_channels for EfficientNet es + out_channels=512, + norm_cfg=dict(type='BN', requires_grad=False), + num_outs=5), + pts_voxel_encoder=dict( + type='DynamicVFE', + in_channels=4, + feat_channels=[64, 64], + with_distance=False, + voxel_size=voxel_size, + with_cluster_center=True, + with_voxel_center=True, + point_cloud_range=point_cloud_range, + fusion_layer=dict( + type='PointFusion', + img_channels=512, + pts_channels=64, + mid_channels=128, + out_channels=128, + img_levels=[0, 1, 2, 3, 4], # Adjust if the number of FPN outputs changes + align_corners=False, + activate_out=True, + fuse_out=False)), + pts_middle_encoder=dict( + type='SparseEncoder', + in_channels=128, + sparse_shape=[41, 1600, 1408], + order=('conv', 'norm', 'act')), + pts_backbone=dict( + type='SECOND', + in_channels=256, + layer_nums=[5, 5], + layer_strides=[1, 2], + out_channels=[128, 256]), + pts_neck=dict( + type='SECONDFPN', + in_channels=[128, 256], + upsample_strides=[1, 2], + out_channels=[256, 256]), + pts_bbox_head=dict( + type='Anchor3DHead', + num_classes=3, + in_channels=256, # Might need adjustment + feat_channels=512, # Might need adjustment + use_direction_classifier=True, + anchor_generator=dict( + type='Anchor3DRangeGenerator', + ranges=[ + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -1.78, 70.4, 40.0, -1.78], + ], + sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]], + rotations=[0, 1.57], + reshape_out=False), + assigner_per_size=True, + diff_rad_by_sin=True, + assign_per_class=True, + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict( + type='mmdet.FocalLoss', + use_sigmoid=True, + gamma=2.0, + alpha=0.25, + loss_weight=1.0), + loss_bbox=dict( + type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0), + loss_dir=dict( + type='mmdet.CrossEntropyLoss', use_sigmoid=False, + loss_weight=0.2)), + train_cfg=dict( + pts=dict( + assigner=[ + dict( + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.6, + neg_iou_thr=0.45, + min_pos_iou=0.45, + ignore_iof_thr=-1), + ], + allowed_border=0, + pos_weight=-1, + debug=False)), + test_cfg=dict( + pts=dict( + use_rotate_nms=True, + nms_across_levels=False, + nms_thr=0.01, + score_thr=0.1, + min_bbox_size=0, + nms_pre=100, + max_num=50))) + +# dataset settings +dataset_type = 'KittiDataset' +data_root = 'data/kitti/' +class_names = ['Pedestrian', 'Cyclist', 'Car'] +metainfo = dict(classes=class_names) +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None +train_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True), + dict( + type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True), + #dict( + # type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[-0.78539816, 0.78539816], + scale_ratio_range=[0.95, 1.05], + translation_std=[0.2, 0.2, 0.2]), + dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='PointShuffle'), + dict( + type='Pack3DDetInputs', + keys=[ + 'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes', + 'gt_labels' + ]) +] +test_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict( + type='MultiScaleFlipAug3D', + img_scale=(1280, 384), + pts_scale_ratio=1, + flip=False, + transforms=[ + # Temporary solution, fix this after refactor the augtest + dict(type='Resize', scale=0, keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[0, 0], + scale_ratio_range=[1., 1.], + translation_std=[0, 0, 0]), + dict(type='RandomFlip3D'), + dict( + type='PointsRangeFilter', point_cloud_range=point_cloud_range), + ]), + dict(type='Pack3DDetInputs', keys=['points', 'img']) +] +modality = dict(use_lidar=True, use_camera=True) +train_dataloader = dict( + batch_size=2, + num_workers=2, + sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict( + type='RepeatDataset', + times=2, + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_train.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=train_pipeline, + filter_empty_gt=False, + metainfo=metainfo, + # we use box_type_3d='LiDAR' in kitti and nuscenes dataset + # and box_type_3d='Depth' in sunrgbd and scannet dataset. + box_type_3d='LiDAR', + backend_args=backend_args))) + +val_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_val.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) +test_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + ann_file='kitti_infos_val.pkl', + modality=modality, + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) + +optim_wrapper = dict( + optimizer=dict(weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2), +) +val_evaluator = dict( + type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') +test_evaluator = val_evaluator + +vis_backends = [dict(type='LocalVisBackend')] +visualizer = dict( + type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer') + +train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1) + +# You may need to download the model first is the network is unstable +#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth' # noqa diff --git a/configs/mvxnet/mvxnet_efficiency_es_fpn_squeeze_fpn_kitti-3d-3class.py b/configs/mvxnet/mvxnet_efficiency_es_fpn_squeeze_fpn_kitti-3d-3class.py new file mode 100644 index 0000000000..bb001db24f --- /dev/null +++ b/configs/mvxnet/mvxnet_efficiency_es_fpn_squeeze_fpn_kitti-3d-3class.py @@ -0,0 +1,279 @@ +_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py'] + +# model settings +voxel_size = [0.05, 0.05, 0.1] +point_cloud_range = [0, -40, -3, 70.4, 40, 1] + + +model = dict( + type='DynamicMVXFasterRCNN', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_type='dynamic', + voxel_layer=dict( + max_num_points=-1, + point_cloud_range=point_cloud_range, + voxel_size=voxel_size, + max_voxels=(-1, -1)), + mean=[102.9801, 115.9465, 122.7717], + std=[1.0, 1.0, 1.0], + bgr_to_rgb=False, + pad_size_divisor=32), + img_backbone=dict( + type='mmdet.EfficientNet', # Use EfficientNet + arch='es', # Choose the EfficientNet variant (b0, b1, b2, etc.) + out_indices=(0, 3, 5, 6), # You can change this depending on which layers you need + frozen_stages=1, # Freeze the first stage (if needed) + norm_cfg=dict(type='BN', requires_grad=False), + norm_eval=True, + ), # Important: Use 'pytorch' style + img_neck=dict( + type='mmdet.FPN', + in_channels=[32, 48, 192, 1280], # Correct in_channels for EfficientNet es + out_channels=512, + norm_cfg=dict(type='BN', requires_grad=False), + num_outs=5), + pts_voxel_encoder=dict( + type='DynamicVFE', + in_channels=4, + feat_channels=[64, 64], + with_distance=False, + voxel_size=voxel_size, + with_cluster_center=True, + with_voxel_center=True, + point_cloud_range=point_cloud_range, + fusion_layer=dict( + type='PointFusion', + img_channels=512, + pts_channels=64, + mid_channels=128, + out_channels=128, + img_levels=[0, 1, 2, 3, 4], # Adjust if the number of FPN outputs changes + align_corners=False, + activate_out=True, + fuse_out=False)), + pts_middle_encoder=dict( + type='SparseEncoder', + in_channels=128, + sparse_shape=[41, 1600, 1408], + order=('conv', 'norm', 'act')), + pts_backbone=dict( + type='SQUEEZE', + in_channels=256, + out_channels=[64, 128, 256 , 512], + #layer_nums=[3, 5, 5], + #layer_strides=[2, 2, 2], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + conv_cfg=dict(type='Conv2d', bias=False)), + pts_neck=dict( + type='SQUEEZEFPN', + in_channels=[64, 128, 256 , 512], + out_channels=[512, 512, 512, 512], + #upsample_strides=[0.5, 1, 2], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + upsample_cfg=dict(type='deconv', bias=False)), + pts_bbox_head=dict( + type='Anchor3DHead', + num_classes=3, + in_channels=256, # Might need adjustment + feat_channels=512, # Might need adjustment + use_direction_classifier=True, + anchor_generator=dict( + type='Anchor3DRangeGenerator', + ranges=[ + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -1.78, 70.4, 40.0, -1.78], + ], + sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]], + rotations=[0, 1.57], + reshape_out=False), + assigner_per_size=True, + diff_rad_by_sin=True, + assign_per_class=True, + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict( + type='mmdet.FocalLoss', + use_sigmoid=True, + gamma=2.0, + alpha=0.25, + loss_weight=1.0), + loss_bbox=dict( + type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0), + loss_dir=dict( + type='mmdet.CrossEntropyLoss', use_sigmoid=False, + loss_weight=0.2)), + train_cfg=dict( + pts=dict( + assigner=[ + dict( + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.6, + neg_iou_thr=0.45, + min_pos_iou=0.45, + ignore_iof_thr=-1), + ], + allowed_border=0, + pos_weight=-1, + debug=False)), + test_cfg=dict( + pts=dict( + use_rotate_nms=True, + nms_across_levels=False, + nms_thr=0.01, + score_thr=0.1, + min_bbox_size=0, + nms_pre=100, + max_num=50))) + +# dataset settings +dataset_type = 'KittiDataset' +data_root = 'data/kitti/' +class_names = ['Pedestrian', 'Cyclist', 'Car'] +metainfo = dict(classes=class_names) +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None +train_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True), + #dict( + # type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True), + dict( + type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[-0.78539816, 0.78539816], + scale_ratio_range=[0.95, 1.05], + translation_std=[0.2, 0.2, 0.2]), + dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='PointShuffle'), + dict( + type='Pack3DDetInputs', + keys=[ + 'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes', + 'gt_labels' + ]) +] +test_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict( + type='MultiScaleFlipAug3D', + img_scale=(1280, 384), + pts_scale_ratio=1, + flip=False, + transforms=[ + # Temporary solution, fix this after refactor the augtest + dict(type='Resize', scale=0, keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[0, 0], + scale_ratio_range=[1., 1.], + translation_std=[0, 0, 0]), + dict(type='RandomFlip3D'), + dict( + type='PointsRangeFilter', point_cloud_range=point_cloud_range), + ]), + dict(type='Pack3DDetInputs', keys=['points', 'img']) +] +modality = dict(use_lidar=True, use_camera=True) +train_dataloader = dict( + batch_size=2, + num_workers=2, + sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict( + type='RepeatDataset', + times=2, + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_train.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=train_pipeline, + filter_empty_gt=False, + metainfo=metainfo, + # we use box_type_3d='LiDAR' in kitti and nuscenes dataset + # and box_type_3d='Depth' in sunrgbd and scannet dataset. + box_type_3d='LiDAR', + backend_args=backend_args))) + +val_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_val.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) +test_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + ann_file='kitti_infos_val.pkl', + modality=modality, + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) + +optim_wrapper = dict( + optimizer=dict(weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2), +) +val_evaluator = dict( + type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') +test_evaluator = val_evaluator + +vis_backends = [dict(type='LocalVisBackend')] +visualizer = dict( + type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer') + +train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1) + +# You may need to download the model first is the network is unstable +#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth' # noqa diff --git a/configs/mvxnet/mvxnet_efficiency_fpn_second_fpn_kitti-3d-3class.py b/configs/mvxnet/mvxnet_efficiency_fpn_second_fpn_kitti-3d-3class.py new file mode 100644 index 0000000000..cdfdd134b5 --- /dev/null +++ b/configs/mvxnet/mvxnet_efficiency_fpn_second_fpn_kitti-3d-3class.py @@ -0,0 +1,275 @@ +_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py'] + +# model settings +voxel_size = [0.05, 0.05, 0.1] +point_cloud_range = [0, -40, -3, 70.4, 40, 1] + + +model = dict( + type='DynamicMVXFasterRCNN', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_type='dynamic', + voxel_layer=dict( + max_num_points=-1, + point_cloud_range=point_cloud_range, + voxel_size=voxel_size, + max_voxels=(-1, -1)), + mean=[102.9801, 115.9465, 122.7717], + std=[1.0, 1.0, 1.0], + bgr_to_rgb=False, + pad_size_divisor=32), + img_backbone=dict( + type='mmdet.EfficientNet', # Use EfficientNet + arch='b2', # Choose the EfficientNet variant (b0, b1, b2, etc.) + out_indices=(0, 3, 5, 6), # You can change this depending on which layers you need + frozen_stages=1, # Freeze the first stage (if needed) + norm_cfg=dict(type='BN', requires_grad=False), + norm_eval=True, + ), # Important: Use 'pytorch' style + img_neck=dict( + type='mmdet.FPN', + in_channels=[32, 48, 352, 1408], # Correct in_channels for EfficientNet b0 + out_channels=512, + norm_cfg=dict(type='BN', requires_grad=False), + num_outs=5), + pts_voxel_encoder=dict( + type='DynamicVFE', + in_channels=4, + feat_channels=[64, 64], + with_distance=False, + voxel_size=voxel_size, + with_cluster_center=True, + with_voxel_center=True, + point_cloud_range=point_cloud_range, + fusion_layer=dict( + type='PointFusion', + img_channels=512, + pts_channels=64, + mid_channels=128, + out_channels=128, + img_levels=[0, 1, 2, 3, 4], # Adjust if the number of FPN outputs changes + align_corners=False, + activate_out=True, + fuse_out=False)), + pts_middle_encoder=dict( + type='SparseEncoder', + in_channels=128, + sparse_shape=[41, 1600, 1408], + order=('conv', 'norm', 'act')), + pts_backbone=dict( + type='SECOND', + in_channels=256, + layer_nums=[5, 5], + layer_strides=[1, 2], + out_channels=[128, 256]), + pts_neck=dict( + type='SECONDFPN', + in_channels=[128, 256], + upsample_strides=[1, 2], + out_channels=[256, 256]), + pts_bbox_head=dict( + type='Anchor3DHead', + num_classes=3, + in_channels=256, # Might need adjustment + feat_channels=512, # Might need adjustment + use_direction_classifier=True, + anchor_generator=dict( + type='Anchor3DRangeGenerator', + ranges=[ + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -1.78, 70.4, 40.0, -1.78], + ], + sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]], + rotations=[0, 1.57], + reshape_out=False), + assigner_per_size=True, + diff_rad_by_sin=True, + assign_per_class=True, + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict( + type='mmdet.FocalLoss', + use_sigmoid=True, + gamma=2.0, + alpha=0.25, + loss_weight=1.0), + loss_bbox=dict( + type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0), + loss_dir=dict( + type='mmdet.CrossEntropyLoss', use_sigmoid=False, + loss_weight=0.2)), + train_cfg=dict( + pts=dict( + assigner=[ + dict( + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.6, + neg_iou_thr=0.45, + min_pos_iou=0.45, + ignore_iof_thr=-1), + ], + allowed_border=0, + pos_weight=-1, + debug=False)), + test_cfg=dict( + pts=dict( + use_rotate_nms=True, + nms_across_levels=False, + nms_thr=0.01, + score_thr=0.1, + min_bbox_size=0, + nms_pre=100, + max_num=50))) + +# dataset settings +dataset_type = 'KittiDataset' +data_root = 'data/kitti/' +class_names = ['Pedestrian', 'Cyclist', 'Car'] +metainfo = dict(classes=class_names) +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None +train_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True), + dict( + type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True), + #dict( + # type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[-0.78539816, 0.78539816], + scale_ratio_range=[0.95, 1.05], + translation_std=[0.2, 0.2, 0.2]), + dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='PointShuffle'), + dict( + type='Pack3DDetInputs', + keys=[ + 'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes', + 'gt_labels' + ]) +] +test_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict( + type='MultiScaleFlipAug3D', + img_scale=(1280, 384), + pts_scale_ratio=1, + flip=False, + transforms=[ + # Temporary solution, fix this after refactor the augtest + dict(type='Resize', scale=0, keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[0, 0], + scale_ratio_range=[1., 1.], + translation_std=[0, 0, 0]), + dict(type='RandomFlip3D'), + dict( + type='PointsRangeFilter', point_cloud_range=point_cloud_range), + ]), + dict(type='Pack3DDetInputs', keys=['points', 'img']) +] +modality = dict(use_lidar=True, use_camera=True) +train_dataloader = dict( + batch_size=2, + num_workers=2, + sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict( + type='RepeatDataset', + times=2, + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_train.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=train_pipeline, + filter_empty_gt=False, + metainfo=metainfo, + # we use box_type_3d='LiDAR' in kitti and nuscenes dataset + # and box_type_3d='Depth' in sunrgbd and scannet dataset. + box_type_3d='LiDAR', + backend_args=backend_args))) + +val_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_val.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) +test_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + ann_file='kitti_infos_val.pkl', + modality=modality, + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) + +optim_wrapper = dict( + optimizer=dict(weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2), +) +val_evaluator = dict( + type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') +test_evaluator = val_evaluator + +vis_backends = [dict(type='LocalVisBackend')] +visualizer = dict( + type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer') + +train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1) + +# You may need to download the model first is the network is unstable +#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth' # noqa diff --git a/configs/mvxnet/mvxnet_efficiency_fpn_squeeze_fpn_kitti-3d-3class.py b/configs/mvxnet/mvxnet_efficiency_fpn_squeeze_fpn_kitti-3d-3class.py new file mode 100644 index 0000000000..cfeaf4d2b4 --- /dev/null +++ b/configs/mvxnet/mvxnet_efficiency_fpn_squeeze_fpn_kitti-3d-3class.py @@ -0,0 +1,279 @@ +_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py'] + +# model settings +voxel_size = [0.05, 0.05, 0.1] +point_cloud_range = [0, -40, -3, 70.4, 40, 1] + + +model = dict( + type='DynamicMVXFasterRCNN', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_type='dynamic', + voxel_layer=dict( + max_num_points=-1, + point_cloud_range=point_cloud_range, + voxel_size=voxel_size, + max_voxels=(-1, -1)), + mean=[102.9801, 115.9465, 122.7717], + std=[1.0, 1.0, 1.0], + bgr_to_rgb=False, + pad_size_divisor=32), + img_backbone=dict( + type='mmdet.EfficientNet', # Use EfficientNet + arch='b2', # Choose the EfficientNet variant (b0, b1, b2, etc.) + out_indices=(0, 3, 5, 6), # You can change this depending on which layers you need + frozen_stages=1, # Freeze the first stage (if needed) + norm_cfg=dict(type='BN', requires_grad=False), + norm_eval=True, + ), # Important: Use 'pytorch' style + img_neck=dict( + type='mmdet.FPN', + in_channels=[32, 48, 352, 1408], # Correct in_channels for EfficientNet b0 + out_channels=512, + norm_cfg=dict(type='BN', requires_grad=False), + num_outs=5), + pts_voxel_encoder=dict( + type='DynamicVFE', + in_channels=4, + feat_channels=[64, 64], + with_distance=False, + voxel_size=voxel_size, + with_cluster_center=True, + with_voxel_center=True, + point_cloud_range=point_cloud_range, + fusion_layer=dict( + type='PointFusion', + img_channels=512, + pts_channels=64, + mid_channels=128, + out_channels=128, + img_levels=[0, 1, 2, 3, 4], # Adjust if the number of FPN outputs changes + align_corners=False, + activate_out=True, + fuse_out=False)), + pts_middle_encoder=dict( + type='SparseEncoder', + in_channels=128, + sparse_shape=[41, 1600, 1408], + order=('conv', 'norm', 'act')), + pts_backbone=dict( + type='SQUEEZE', + in_channels=256, + out_channels=[64, 128, 256 , 512], + #layer_nums=[3, 5, 5], + #layer_strides=[2, 2, 2], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + conv_cfg=dict(type='Conv2d', bias=False)), + pts_neck=dict( + type='SQUEEZEFPN', + in_channels=[64, 128, 256 , 512], + out_channels=[512, 512, 512, 512], + #upsample_strides=[0.5, 1, 2], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + upsample_cfg=dict(type='deconv', bias=False)), + pts_bbox_head=dict( + type='Anchor3DHead', + num_classes=3, + in_channels=256, # Might need adjustment + feat_channels=512, # Might need adjustment + use_direction_classifier=True, + anchor_generator=dict( + type='Anchor3DRangeGenerator', + ranges=[ + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -1.78, 70.4, 40.0, -1.78], + ], + sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]], + rotations=[0, 1.57], + reshape_out=False), + assigner_per_size=True, + diff_rad_by_sin=True, + assign_per_class=True, + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict( + type='mmdet.FocalLoss', + use_sigmoid=True, + gamma=2.0, + alpha=0.25, + loss_weight=1.0), + loss_bbox=dict( + type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0), + loss_dir=dict( + type='mmdet.CrossEntropyLoss', use_sigmoid=False, + loss_weight=0.2)), + train_cfg=dict( + pts=dict( + assigner=[ + dict( + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.6, + neg_iou_thr=0.45, + min_pos_iou=0.45, + ignore_iof_thr=-1), + ], + allowed_border=0, + pos_weight=-1, + debug=False)), + test_cfg=dict( + pts=dict( + use_rotate_nms=True, + nms_across_levels=False, + nms_thr=0.01, + score_thr=0.1, + min_bbox_size=0, + nms_pre=100, + max_num=50))) + +# dataset settings +dataset_type = 'KittiDataset' +data_root = 'data/kitti/' +class_names = ['Pedestrian', 'Cyclist', 'Car'] +metainfo = dict(classes=class_names) +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None +train_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True), + #dict( + # type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True), + dict( + type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[-0.78539816, 0.78539816], + scale_ratio_range=[0.95, 1.05], + translation_std=[0.2, 0.2, 0.2]), + dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='PointShuffle'), + dict( + type='Pack3DDetInputs', + keys=[ + 'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes', + 'gt_labels' + ]) +] +test_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict( + type='MultiScaleFlipAug3D', + img_scale=(1280, 384), + pts_scale_ratio=1, + flip=False, + transforms=[ + # Temporary solution, fix this after refactor the augtest + dict(type='Resize', scale=0, keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[0, 0], + scale_ratio_range=[1., 1.], + translation_std=[0, 0, 0]), + dict(type='RandomFlip3D'), + dict( + type='PointsRangeFilter', point_cloud_range=point_cloud_range), + ]), + dict(type='Pack3DDetInputs', keys=['points', 'img']) +] +modality = dict(use_lidar=True, use_camera=True) +train_dataloader = dict( + batch_size=2, + num_workers=2, + sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict( + type='RepeatDataset', + times=2, + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_train.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=train_pipeline, + filter_empty_gt=False, + metainfo=metainfo, + # we use box_type_3d='LiDAR' in kitti and nuscenes dataset + # and box_type_3d='Depth' in sunrgbd and scannet dataset. + box_type_3d='LiDAR', + backend_args=backend_args))) + +val_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_val.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) +test_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + ann_file='kitti_infos_val.pkl', + modality=modality, + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) + +optim_wrapper = dict( + optimizer=dict(weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2), +) +val_evaluator = dict( + type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') +test_evaluator = val_evaluator + +vis_backends = [dict(type='LocalVisBackend')] +visualizer = dict( + type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer') + +train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1) + +# You may need to download the model first is the network is unstable +#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth' # noqa diff --git a/configs/mvxnet/mvxnet_firerpfnet2dfpn_fire_rpfnet_kitti-3d-3class.py b/configs/mvxnet/mvxnet_firerpfnet2dfpn_fire_rpfnet_kitti-3d-3class.py new file mode 100644 index 0000000000..59fbf66703 --- /dev/null +++ b/configs/mvxnet/mvxnet_firerpfnet2dfpn_fire_rpfnet_kitti-3d-3class.py @@ -0,0 +1,241 @@ +# MVX-Net with FireRPFNet2D (image) + FireRPFNetV2 (LiDAR) +# Full Fire+CBAM pipeline for both modalities +# KITTI 3-class (Car, Pedestrian, Cyclist) + +_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py'] + +# ----------------------------------------------------------------------------- +# Geometry +# ----------------------------------------------------------------------------- +voxel_size = [0.05, 0.05, 0.1] +point_cloud_range = [0, -40, -3, 70.4, 40, 1] + +# ----------------------------------------------------------------------------- +# Model +# ----------------------------------------------------------------------------- +model = dict( + type='DynamicMVXFasterRCNN', + # -------------------------------------------------- + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_type='dynamic', + voxel_layer=dict( + max_num_points=-1, + point_cloud_range=point_cloud_range, + voxel_size=voxel_size, + max_voxels=(-1, -1)), + mean=[102.9801, 115.9465, 122.7717], + std=[1.0, 1.0, 1.0], + bgr_to_rgb=False, + pad_size_divisor=32), + + # ----------------------- FireRPFNet2D image branch ----------------------- + img_backbone=dict( + type='FireRPFNet2D', + in_channels=3, + out_channels=[64, 128, 256, 512], # Multi-scale outputs + blocks_per_stage=[2, 2, 2, 2], # 2 Fire blocks per stage + with_cbam=True, # Enable CBAM attention + stem_channels=64, + out_indices=(0, 1, 2, 3), # Output all 4 scales + frozen_stages=-1, # No frozen stages + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + norm_eval=False), + + img_neck=dict( + type='mmdet.FPN', + in_channels=[64, 128, 256, 512], # From FireRPFNet2D stages + out_channels=256, # Unified output channels + num_outs=4, + norm_cfg=dict(type='BN', requires_grad=False)), + + # ----------------------- LiDAR voxel encoder ---------------- + pts_voxel_encoder=dict( + type='DynamicVFE', + in_channels=4, + feat_channels=[64, 64], + with_distance=False, + voxel_size=voxel_size, + with_cluster_center=True, + with_voxel_center=True, + point_cloud_range=point_cloud_range, + fusion_layer=dict( + type='PointFusion', + img_channels=256, # From FPN + pts_channels=64, + mid_channels=128, + out_channels=128, + img_levels=[0, 1, 2, 3], + align_corners=False, + activate_out=True, + fuse_out=False)), + + # ----------------------- Sparse middle encoder -------------- + pts_middle_encoder=dict( + type='SparseEncoder', + in_channels=128, + sparse_shape=[41, 1600, 1408], + order=('conv', 'norm', 'act')), + + # ----------------------- FireRPFNetV2 backbone ------------- + pts_backbone=dict( + type='FireRPFNetV2', + in_channels=256, # output of SparseEncoder + out_channels=[128, 256, 256, 256], + with_cbam=True, + multi_scale_output=False), # Single-scale output + + pts_neck=None, # No additional neck needed + + # ----------------------- Anchor head ------------------------ + pts_bbox_head=dict( + type='Anchor3DHead', + num_classes=3, + in_channels=256, + feat_channels=256, + use_direction_classifier=True, + anchor_generator=dict( + type='Anchor3DRangeGenerator', + ranges=[ + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -1.78, 70.4, 40.0, -1.78], + ], + sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]], + rotations=[0, 1.57], + reshape_out=False), + assigner_per_size=True, + diff_rad_by_sin=True, + assign_per_class=True, + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict( + type='mmdet.FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, + loss_weight=1.0), + loss_bbox=dict(type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, + loss_weight=2.0), + loss_dir=dict(type='mmdet.CrossEntropyLoss', use_sigmoid=False, + loss_weight=0.2)), + + # ----------------------- Train / Test cfg ------------------- + train_cfg=dict( + pts=dict( + assigner=[ + dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2, + ignore_iof_thr=-1), + dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2, + ignore_iof_thr=-1), + dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.6, neg_iou_thr=0.45, min_pos_iou=0.45, + ignore_iof_thr=-1), + ], + allowed_border=0, pos_weight=-1, debug=False)), + + test_cfg=dict( + pts=dict(use_rotate_nms=True, nms_across_levels=False, nms_thr=0.01, + score_thr=0.1, min_bbox_size=0, nms_pre=100, max_num=50)) +) + +# ----------------------------------------------------------------------------- +# Dataset & pipelines +# ----------------------------------------------------------------------------- + +dataset_type = 'KittiDataset' +data_root = 'data/kitti/' +class_names = ['Pedestrian', 'Cyclist', 'Car'] +metainfo = dict(classes=class_names) +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None + +train_pipeline = [ + dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True, + with_bbox=True, with_label=True), + dict(type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True), + dict(type='GlobalRotScaleTrans', rot_range=[-0.78539816, 0.78539816], + scale_ratio_range=[0.95, 1.05], translation_std=[0.2, 0.2, 0.2]), + dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='PointShuffle'), + dict(type='Pack3DDetInputs', keys=['points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', + 'gt_bboxes', 'gt_labels']) +] + +test_pipeline = [ + dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='MultiScaleFlipAug3D', img_scale=(1280, 384), pts_scale_ratio=1, + flip=False, + transforms=[ + dict(type='Resize', scale=0, keep_ratio=True), + dict(type='GlobalRotScaleTrans', rot_range=[0, 0], scale_ratio_range=[1., 1.], + translation_std=[0, 0, 0]), + dict(type='RandomFlip3D'), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + ]), + dict(type='Pack3DDetInputs', keys=['points', 'img']) +] + +modality = dict(use_lidar=True, use_camera=True) + +train_dataloader = dict( + batch_size=4, num_workers=2, sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict(type='RepeatDataset', times=1, dataset=dict( + type=dataset_type, data_root=data_root, modality=modality, + ann_file='kitti_infos_train.pkl', + data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'), + pipeline=train_pipeline, filter_empty_gt=False, metainfo=metainfo, + box_type_3d='LiDAR', backend_args=backend_args))) + +val_dataloader = dict( + batch_size=1, num_workers=1, sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict(type=dataset_type, data_root=data_root, modality=modality, + ann_file='kitti_infos_val.pkl', + data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, metainfo=metainfo, test_mode=True, + box_type_3d='LiDAR', backend_args=backend_args)) + +test_dataloader = val_dataloader + +# ----------------------------------------------------------------------------- +# Optimizer / Schedulers / Runtime +# ----------------------------------------------------------------------------- +optim_wrapper = dict(optimizer=dict(lr=0.001, weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2)) + +val_evaluator = dict(type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') +test_evaluator = val_evaluator + +vis_backends = [dict(type='LocalVisBackend')] +visualizer = dict( + type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer') + +# Add EarlyStoppingHook +custom_hooks = [ + dict( + type='EarlyStoppingHook', + monitor='Kitti metric/pred_instances_3d/KITTI/Car_3D_AP40_moderate_strict', + patience=5, # Number of epochs to wait before stopping + rule='greater', # Stop when the metric stops increasing + min_delta=0.001, # Minimum change to qualify as improvement + ) +] +train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=16, val_interval=2) + + +# Add checkpoint configuration +default_hooks = dict( + checkpoint=dict( + type='CheckpointHook', + interval=2, + save_best='Kitti metric/pred_instances_3d/KITTI/Car_3D_AP40_moderate_strict', + rule='greater', + max_keep_ckpts=15 # Keep only the best 5 checkpoints + ) +) \ No newline at end of file diff --git a/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_320x92_kitti-3d-3class.py b/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_320x92_kitti-3d-3class.py new file mode 100644 index 0000000000..a006107aa7 --- /dev/null +++ b/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_320x92_kitti-3d-3class.py @@ -0,0 +1,277 @@ +_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py'] + +# model settings +voxel_size = [0.05, 0.05, 0.1] +point_cloud_range = [0, -40, -3, 70.4, 40, 1] + +model = dict( + type='DynamicMVXFasterRCNN', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_type='dynamic', + voxel_layer=dict( + max_num_points=-1, + point_cloud_range=point_cloud_range, + voxel_size=voxel_size, + max_voxels=(-1, -1)), + mean=[102.9801, 115.9465, 122.7717], + std=[1.0, 1.0, 1.0], + bgr_to_rgb=False, + pad_size_divisor=32), + img_backbone=dict( + type='mmdet.ResNet', + depth=50, + num_stages=4, + out_indices=(0, 1, 2, 3), + frozen_stages=1, + norm_cfg=dict(type='BN', requires_grad=False), + norm_eval=True, + style='caffe'), + img_neck=dict( + type='mmdet.FPN', + in_channels=[256, 512, 1024, 2048], + out_channels=256, + # make the image features more stable numerically to avoid loss nan + norm_cfg=dict(type='BN', requires_grad=False), + num_outs=5), + pts_voxel_encoder=dict( + type='DynamicVFE', + in_channels=4, + feat_channels=[64, 64], + with_distance=False, + voxel_size=voxel_size, + with_cluster_center=True, + with_voxel_center=True, + point_cloud_range=point_cloud_range, + fusion_layer=dict( + type='PointFusion', + img_channels=256, + pts_channels=64, + mid_channels=128, + out_channels=128, + img_levels=[0, 1, 2, 3, 4], + align_corners=False, + activate_out=True, + fuse_out=False)), + pts_middle_encoder=dict( + type='SparseEncoder', + in_channels=128, + sparse_shape=[41, 1600, 1408], + order=('conv', 'norm', 'act')), + pts_backbone=dict( + type='SECOND', + in_channels=256, + layer_nums=[5, 5], + layer_strides=[1, 2], + out_channels=[128, 256]), + pts_neck=dict( + type='SECONDFPN', + in_channels=[128, 256], + upsample_strides=[1, 2], + out_channels=[256, 256]), + pts_bbox_head=dict( + type='Anchor3DHead', + num_classes=3, + in_channels=512, + feat_channels=512, + use_direction_classifier=True, + anchor_generator=dict( + type='Anchor3DRangeGenerator', + ranges=[ + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -1.78, 70.4, 40.0, -1.78], + ], + sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]], + rotations=[0, 1.57], + reshape_out=False), + assigner_per_size=True, + diff_rad_by_sin=True, + assign_per_class=True, + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict( + type='mmdet.FocalLoss', + use_sigmoid=True, + gamma=2.0, + alpha=0.25, + loss_weight=1.0), + loss_bbox=dict( + type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0), + loss_dir=dict( + type='mmdet.CrossEntropyLoss', use_sigmoid=False, + loss_weight=0.2)), + # model training and testing settings + train_cfg=dict( + pts=dict( + assigner=[ + dict( # for Pedestrian + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( # for Cyclist + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( # for Car + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.6, + neg_iou_thr=0.45, + min_pos_iou=0.45, + ignore_iof_thr=-1), + ], + allowed_border=0, + pos_weight=-1, + debug=False)), + test_cfg=dict( + pts=dict( + use_rotate_nms=True, + nms_across_levels=False, + nms_thr=0.01, + score_thr=0.1, + min_bbox_size=0, + nms_pre=100, + max_num=50))) + +# dataset settings +dataset_type = 'KittiDataset' +data_root = 'data/kitti/' +class_names = ['Pedestrian', 'Cyclist', 'Car'] +metainfo = dict(classes=class_names) +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None +train_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True), + #dict( + # type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True), + dict( + type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[-0.78539816, 0.78539816], + scale_ratio_range=[0.95, 1.05], + translation_std=[0.2, 0.2, 0.2]), + dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='PointShuffle'), + dict( + type='Pack3DDetInputs', + keys=[ + 'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes', + 'gt_labels' + ]) +] +test_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict( + type='MultiScaleFlipAug3D', + img_scale=(1280, 384), + pts_scale_ratio=1, + flip=False, + transforms=[ + # Temporary solution, fix this after refactor the augtest + dict(type='Resize', scale=0, keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[0, 0], + scale_ratio_range=[1., 1.], + translation_std=[0, 0, 0]), + dict(type='RandomFlip3D'), + dict( + type='PointsRangeFilter', point_cloud_range=point_cloud_range), + ]), + dict(type='Pack3DDetInputs', keys=['points', 'img']) +] +modality = dict(use_lidar=True, use_camera=True) +train_dataloader = dict( + batch_size=2, + num_workers=2, + sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict( + type='RepeatDataset', + times=2, + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_train.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=train_pipeline, + filter_empty_gt=False, + metainfo=metainfo, + # we use box_type_3d='LiDAR' in kitti and nuscenes dataset + # and box_type_3d='Depth' in sunrgbd and scannet dataset. + box_type_3d='LiDAR', + backend_args=backend_args))) + +val_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_val.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) +test_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + ann_file='kitti_infos_val.pkl', + modality=modality, + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) + +optim_wrapper = dict( + optimizer=dict(weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2), +) +val_evaluator = dict( + type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') +test_evaluator = val_evaluator + +vis_backends = [dict(type='LocalVisBackend')] +visualizer = dict( + type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer') + +train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1) + +# You may need to download the model first is the network is unstable +#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth' # noqa diff --git a/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_8xb2-80e_kitti-3d-3class.py b/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_8xb2-80e_kitti-3d-3class.py index f6c750d9f7..5ea62980bc 100644 --- a/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_8xb2-80e_kitti-3d-3class.py +++ b/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_8xb2-80e_kitti-3d-3class.py @@ -269,5 +269,7 @@ visualizer = dict( type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer') +train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1) + # You may need to download the model first is the network is unstable -load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth' # noqa +#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth' # noqa diff --git a/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_8xb2-80e_nus-3d-3class.py b/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_8xb2-80e_nus-3d-3class.py new file mode 100644 index 0000000000..944eb56d5d --- /dev/null +++ b/configs/mvxnet/mvxnet_fpn_dv_second_secfpn_8xb2-80e_nus-3d-3class.py @@ -0,0 +1,273 @@ +_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py'] + +# model settings +voxel_size = [0.05, 0.05, 0.1] +point_cloud_range = [0, -40, -3, 70.4, 40, 1] + +model = dict( + type='DynamicMVXFasterRCNN', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_type='dynamic', + voxel_layer=dict( + max_num_points=-1, + point_cloud_range=point_cloud_range, + voxel_size=voxel_size, + max_voxels=(-1, -1)), + mean=[102.9801, 115.9465, 122.7717], + std=[1.0, 1.0, 1.0], + bgr_to_rgb=False, + pad_size_divisor=32), + img_backbone=dict( + type='mmdet.ResNet', + depth=50, + num_stages=4, + out_indices=(0, 1, 2, 3), + frozen_stages=1, + norm_cfg=dict(type='BN', requires_grad=False), + norm_eval=True, + style='caffe'), + img_neck=dict( + type='mmdet.FPN', + in_channels=[256, 512, 1024, 2048], + out_channels=256, + # make the image features more stable numerically to avoid loss nan + norm_cfg=dict(type='BN', requires_grad=False), + num_outs=5), + pts_voxel_encoder=dict( + type='DynamicVFE', + in_channels=4, + feat_channels=[64, 64], + with_distance=False, + voxel_size=voxel_size, + with_cluster_center=True, + with_voxel_center=True, + point_cloud_range=point_cloud_range, + fusion_layer=dict( + type='PointFusion', + img_channels=256, + pts_channels=64, + mid_channels=128, + out_channels=128, + img_levels=[0, 1, 2, 3, 4], + align_corners=False, + activate_out=True, + fuse_out=False)), + pts_middle_encoder=dict( + type='SparseEncoder', + in_channels=128, + sparse_shape=[41, 1600, 1408], + order=('conv', 'norm', 'act')), + pts_backbone=dict( + type='SECOND', + in_channels=256, + layer_nums=[5, 5], + layer_strides=[1, 2], + out_channels=[128, 256]), + pts_neck=dict( + type='SECONDFPN', + in_channels=[128, 256], + upsample_strides=[1, 2], + out_channels=[256, 256]), + pts_bbox_head=dict( + type='Anchor3DHead', + num_classes=3, + in_channels=512, + feat_channels=512, + use_direction_classifier=True, + anchor_generator=dict( + type='Anchor3DRangeGenerator', + ranges=[ + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -1.78, 70.4, 40.0, -1.78], + ], + sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]], + rotations=[0, 1.57], + reshape_out=False), + assigner_per_size=True, + diff_rad_by_sin=True, + assign_per_class=True, + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict( + type='mmdet.FocalLoss', + use_sigmoid=True, + gamma=2.0, + alpha=0.25, + loss_weight=1.0), + loss_bbox=dict( + type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0), + loss_dir=dict( + type='mmdet.CrossEntropyLoss', use_sigmoid=False, + loss_weight=0.2)), + # model training and testing settings + train_cfg=dict( + pts=dict( + assigner=dict( + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.6, + neg_iou_thr=0.3, + min_pos_iou=0.3, + ignore_iof_thr=-1), + allowed_border=0, + code_weight=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2, 0.2], + pos_weight=-1, + debug=False)), + test_cfg=dict( + pts=dict( + use_rotate_nms=True, + nms_across_levels=False, + nms_thr=0.01, + score_thr=0.1, + min_bbox_size=0, + nms_pre=100, + max_num=50))) + +# dataset settings +#dataset_type = 'KittiDataset' +#data_root = 'data/kitti/' +#class_names = ['Pedestrian', 'Cyclist', 'Car'] +dataset_type = 'NuScenesDataset' +data_root = 'data/nuscenes/' +class_names = [ + 'car', 'truck', 'trailer', 'bus', 'construction_vehicle', 'bicycle', + 'motorcycle', 'pedestrian', 'traffic_cone', 'barrier' +] +metainfo = dict(classes=class_names) +input_modality = dict(use_lidar=True, use_camera=True) +#data_prefix = dict(pts='samples/LIDAR_TOP', img='', sweeps='sweeps/LIDAR_TOP') +data_prefix = dict( + pts='samples/LIDAR_TOP', + CAM_FRONT='samples/CAM_FRONT', + sweeps='sweeps/LIDAR_TOP') +backend_args = None +train_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + #dict(type='LoadImageFromFileMono3D', backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True), + dict( + type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[-0.78539816, 0.78539816], + scale_ratio_range=[0.95, 1.05], + translation_std=[0.2, 0.2, 0.2]), + dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='PointShuffle'), + dict( + type='Pack3DDetInputs', + keys=[ + 'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes', + 'gt_labels' + ]) +] +test_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + #dict(type='LoadImageFromFileMono3D', backend_args=backend_args), + dict( + type='MultiScaleFlipAug3D', + img_scale=(1280, 384), + pts_scale_ratio=1, + flip=False, + transforms=[ + # Temporary solution, fix this after refactor the augtest + dict(type='Resize', scale=0, keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[0, 0], + scale_ratio_range=[1., 1.], + translation_std=[0, 0, 0]), + dict(type='RandomFlip3D'), + dict( + type='PointsRangeFilter', point_cloud_range=point_cloud_range), + ]), + #dict(type='Pack3DDetInputs', keys=['points', 'img']) + dict( + type='Pack3DDetInputs', + keys=[ + 'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes', + 'gt_labels' + ]) +] +modality = dict(use_lidar=True, use_camera=True) +train_dataloader = dict( + batch_size=2, + num_workers=2, + sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict( + type=dataset_type, + data_root=data_root, + ann_file='nuscenes_infos_train.pkl', + pipeline=train_pipeline, + metainfo=metainfo, + modality=input_modality, + #modality=modality, + #data_prefix=dict( + # pts='training/velodyne_reduced', img='training/image_2'), + test_mode=False, + data_prefix=data_prefix, + default_cam_key='CAM_FRONT', + # we use box_type_3d='LiDAR' in kitti and nuscenes dataset + # and box_type_3d='Depth' in sunrgbd and scannet dataset. + box_type_3d='LiDAR', + backend_args=backend_args)) + + +test_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + ann_file='nuscenes_infos_val.pkl', + pipeline=test_pipeline, + metainfo=metainfo, + modality=input_modality, + #modality=modality, + #data_prefix=dict( + # pts='training/velodyne_reduced', img='training/image_2'), + data_prefix=data_prefix, + default_cam_key='CAM_FRONT', + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) + +val_dataloader = test_dataloader + +optim_wrapper = dict( + optimizer=dict(weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2), +) +#val_evaluator = dict( +# type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') +val_evaluator = dict( + type='NuScenesMetric', + data_root=data_root, + ann_file=data_root + 'nuscenes_infos_val.pkl', + metric='bbox', + backend_args=backend_args) +test_evaluator = val_evaluator + +vis_backends = [dict(type='LocalVisBackend')] +visualizer = dict( + type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer') + +# You may need to download the model first is the network is unstable +#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth' # noqa diff --git a/configs/mvxnet/mvxnet_fpn_dv_second_squeezefpn_320x92_kitti-3d-3class.py b/configs/mvxnet/mvxnet_fpn_dv_second_squeezefpn_320x92_kitti-3d-3class.py new file mode 100644 index 0000000000..70176a2749 --- /dev/null +++ b/configs/mvxnet/mvxnet_fpn_dv_second_squeezefpn_320x92_kitti-3d-3class.py @@ -0,0 +1,281 @@ +_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py'] + +# model settings +voxel_size = [0.05, 0.05, 0.1] +point_cloud_range = [0, -40, -3, 70.4, 40, 1] + +model = dict( + type='DynamicMVXFasterRCNN', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_type='dynamic', + voxel_layer=dict( + max_num_points=-1, + point_cloud_range=point_cloud_range, + voxel_size=voxel_size, + max_voxels=(-1, -1)), + mean=[102.9801, 115.9465, 122.7717], + std=[1.0, 1.0, 1.0], + bgr_to_rgb=False, + pad_size_divisor=32), + img_backbone=dict( + type='mmdet.ResNet', + depth=50, + num_stages=4, + out_indices=(0, 1, 2, 3), + frozen_stages=1, + norm_cfg=dict(type='BN', requires_grad=False), + norm_eval=True, + style='caffe'), + img_neck=dict( + type='mmdet.FPN', + in_channels=[256, 512, 1024, 2048], + out_channels=256, + # make the image features more stable numerically to avoid loss nan + norm_cfg=dict(type='BN', requires_grad=False), + num_outs=5), + pts_voxel_encoder=dict( + type='DynamicVFE', + in_channels=4, + feat_channels=[64, 64], + with_distance=False, + voxel_size=voxel_size, + with_cluster_center=True, + with_voxel_center=True, + point_cloud_range=point_cloud_range, + fusion_layer=dict( + type='PointFusion', + img_channels=256, + pts_channels=64, + mid_channels=128, + out_channels=128, + img_levels=[0, 1, 2, 3, 4], + align_corners=False, + activate_out=True, + fuse_out=False)), + pts_middle_encoder=dict( + type='SparseEncoder', + in_channels=128, + sparse_shape=[41, 1600, 1408], + order=('conv', 'norm', 'act')), + pts_backbone=dict( + type='SQUEEZE', + in_channels=256, + out_channels=[64, 128, 256 , 512], + #layer_nums=[3, 5, 5], + #layer_strides=[2, 2, 2], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + conv_cfg=dict(type='Conv2d', bias=False)), + pts_neck=dict( + type='SQUEEZEFPN', + in_channels=[64, 128, 256 , 512], + out_channels=[512, 512, 512, 512], + #upsample_strides=[0.5, 1, 2], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + upsample_cfg=dict(type='deconv', bias=False)), + pts_bbox_head=dict( + type='Anchor3DHead', + num_classes=3, + in_channels=512, + feat_channels=512, + use_direction_classifier=True, + anchor_generator=dict( + type='Anchor3DRangeGenerator', + ranges=[ + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -1.78, 70.4, 40.0, -1.78], + ], + sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]], + rotations=[0, 1.57], + reshape_out=False), + assigner_per_size=True, + diff_rad_by_sin=True, + assign_per_class=True, + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict( + type='mmdet.FocalLoss', + use_sigmoid=True, + gamma=2.0, + alpha=0.25, + loss_weight=1.0), + loss_bbox=dict( + type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0), + loss_dir=dict( + type='mmdet.CrossEntropyLoss', use_sigmoid=False, + loss_weight=0.2)), + # model training and testing settings + train_cfg=dict( + pts=dict( + assigner=[ + dict( # for Pedestrian + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( # for Cyclist + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( # for Car + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.6, + neg_iou_thr=0.45, + min_pos_iou=0.45, + ignore_iof_thr=-1), + ], + allowed_border=0, + pos_weight=-1, + debug=False)), + test_cfg=dict( + pts=dict( + use_rotate_nms=True, + nms_across_levels=False, + nms_thr=0.01, + score_thr=0.1, + min_bbox_size=0, + nms_pre=100, + max_num=50))) + +# dataset settings +dataset_type = 'KittiDataset' +data_root = 'data/kitti/' +class_names = ['Pedestrian', 'Cyclist', 'Car'] +metainfo = dict(classes=class_names) +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None +train_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True), + #dict( + # type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True), + dict( + type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[-0.78539816, 0.78539816], + scale_ratio_range=[0.95, 1.05], + translation_std=[0.2, 0.2, 0.2]), + dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='PointShuffle'), + dict( + type='Pack3DDetInputs', + keys=[ + 'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes', + 'gt_labels' + ]) +] +test_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict( + type='MultiScaleFlipAug3D', + img_scale=(1280, 384), + pts_scale_ratio=1, + flip=False, + transforms=[ + # Temporary solution, fix this after refactor the augtest + dict(type='Resize', scale=0, keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[0, 0], + scale_ratio_range=[1., 1.], + translation_std=[0, 0, 0]), + dict(type='RandomFlip3D'), + dict( + type='PointsRangeFilter', point_cloud_range=point_cloud_range), + ]), + dict(type='Pack3DDetInputs', keys=['points', 'img']) +] +modality = dict(use_lidar=True, use_camera=True) +train_dataloader = dict( + batch_size=2, + num_workers=2, + sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict( + type='RepeatDataset', + times=2, + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_train.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=train_pipeline, + filter_empty_gt=False, + metainfo=metainfo, + # we use box_type_3d='LiDAR' in kitti and nuscenes dataset + # and box_type_3d='Depth' in sunrgbd and scannet dataset. + box_type_3d='LiDAR', + backend_args=backend_args))) + +val_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_val.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) +test_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + ann_file='kitti_infos_val.pkl', + modality=modality, + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) + +optim_wrapper = dict( + optimizer=dict(weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2), +) +val_evaluator = dict( + type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') +test_evaluator = val_evaluator + +vis_backends = [dict(type='LocalVisBackend')] +visualizer = dict( + type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer') + +train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1) + +# You may need to download the model first is the network is unstable +#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth' # noqa diff --git a/configs/mvxnet/mvxnet_fpn_dv_second_squeezefpn_8xb2-80e_kitti-3d-3class.py b/configs/mvxnet/mvxnet_fpn_dv_second_squeezefpn_8xb2-80e_kitti-3d-3class.py new file mode 100644 index 0000000000..bed54ef717 --- /dev/null +++ b/configs/mvxnet/mvxnet_fpn_dv_second_squeezefpn_8xb2-80e_kitti-3d-3class.py @@ -0,0 +1,279 @@ +_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py'] + +# model settings +voxel_size = [0.05, 0.05, 0.1] +point_cloud_range = [0, -40, -3, 70.4, 40, 1] + +model = dict( + type='DynamicMVXFasterRCNN', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_type='dynamic', + voxel_layer=dict( + max_num_points=-1, + point_cloud_range=point_cloud_range, + voxel_size=voxel_size, + max_voxels=(-1, -1)), + mean=[102.9801, 115.9465, 122.7717], + std=[1.0, 1.0, 1.0], + bgr_to_rgb=False, + pad_size_divisor=32), + img_backbone=dict( + type='mmdet.ResNet', + depth=50, + num_stages=4, + out_indices=(0, 1, 2, 3), + frozen_stages=1, + norm_cfg=dict(type='BN', requires_grad=False), + norm_eval=True, + style='caffe'), + img_neck=dict( + type='mmdet.FPN', + in_channels=[256, 512, 1024, 2048], + out_channels=256, + # make the image features more stable numerically to avoid loss nan + norm_cfg=dict(type='BN', requires_grad=False), + num_outs=5), + pts_voxel_encoder=dict( + type='DynamicVFE', + in_channels=4, + feat_channels=[64, 64], + with_distance=False, + voxel_size=voxel_size, + with_cluster_center=True, + with_voxel_center=True, + point_cloud_range=point_cloud_range, + fusion_layer=dict( + type='PointFusion', + img_channels=256, + pts_channels=64, + mid_channels=128, + out_channels=128, + img_levels=[0, 1, 2, 3, 4], + align_corners=False, + activate_out=True, + fuse_out=False)), + pts_middle_encoder=dict( + type='SparseEncoder', + in_channels=128, + sparse_shape=[41, 1600, 1408], + order=('conv', 'norm', 'act')), + pts_backbone=dict( + type='SQUEEZE', + in_channels=256, + out_channels=[64, 128, 256 , 512], + #layer_nums=[3, 5, 5], + #layer_strides=[2, 2, 2], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + conv_cfg=dict(type='Conv2d', bias=False)), + pts_neck=dict( + type='SQUEEZEFPN', + in_channels=[64, 128, 256 , 512], + out_channels=[512, 512, 512, 512], + #upsample_strides=[0.5, 1, 2], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + upsample_cfg=dict(type='deconv', bias=False)), + pts_bbox_head=dict( + type='Anchor3DHead', + num_classes=3, + in_channels=512, + feat_channels=512, + use_direction_classifier=True, + anchor_generator=dict( + type='Anchor3DRangeGenerator', + ranges=[ + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -1.78, 70.4, 40.0, -1.78], + ], + sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]], + rotations=[0, 1.57], + reshape_out=False), + assigner_per_size=True, + diff_rad_by_sin=True, + assign_per_class=True, + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict( + type='mmdet.FocalLoss', + use_sigmoid=True, + gamma=2.0, + alpha=0.25, + loss_weight=1.0), + loss_bbox=dict( + type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0), + loss_dir=dict( + type='mmdet.CrossEntropyLoss', use_sigmoid=False, + loss_weight=0.2)), + # model training and testing settings + train_cfg=dict( + pts=dict( + assigner=[ + dict( # for Pedestrian + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( # for Cyclist + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( # for Car + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.6, + neg_iou_thr=0.45, + min_pos_iou=0.45, + ignore_iof_thr=-1), + ], + allowed_border=0, + pos_weight=-1, + debug=False)), + test_cfg=dict( + pts=dict( + use_rotate_nms=True, + nms_across_levels=False, + nms_thr=0.01, + score_thr=0.1, + min_bbox_size=0, + nms_pre=100, + max_num=50))) + +# dataset settings +dataset_type = 'KittiDataset' +data_root = 'data/kitti/' +class_names = ['Pedestrian', 'Cyclist', 'Car'] +metainfo = dict(classes=class_names) +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None +train_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True), + dict( + type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[-0.78539816, 0.78539816], + scale_ratio_range=[0.95, 1.05], + translation_std=[0.2, 0.2, 0.2]), + dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='PointShuffle'), + dict( + type='Pack3DDetInputs', + keys=[ + 'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes', + 'gt_labels' + ]) +] +test_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict( + type='MultiScaleFlipAug3D', + img_scale=(1280, 384), + pts_scale_ratio=1, + flip=False, + transforms=[ + # Temporary solution, fix this after refactor the augtest + dict(type='Resize', scale=0, keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[0, 0], + scale_ratio_range=[1., 1.], + translation_std=[0, 0, 0]), + dict(type='RandomFlip3D'), + dict( + type='PointsRangeFilter', point_cloud_range=point_cloud_range), + ]), + dict(type='Pack3DDetInputs', keys=['points', 'img']) +] +modality = dict(use_lidar=True, use_camera=True) +train_dataloader = dict( + batch_size=2, + num_workers=2, + sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict( + type='RepeatDataset', + times=2, + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_train.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=train_pipeline, + filter_empty_gt=False, + metainfo=metainfo, + # we use box_type_3d='LiDAR' in kitti and nuscenes dataset + # and box_type_3d='Depth' in sunrgbd and scannet dataset. + box_type_3d='LiDAR', + backend_args=backend_args))) + +val_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_val.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) +test_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + ann_file='kitti_infos_val.pkl', + modality=modality, + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) + +optim_wrapper = dict( + optimizer=dict(weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2), +) +val_evaluator = dict( + type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') +test_evaluator = val_evaluator + +vis_backends = [dict(type='LocalVisBackend')] +visualizer = dict( + type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer') + +train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1) + +# You may need to download the model first is the network is unstable +#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth' # noqa diff --git a/configs/mvxnet/mvxnet_mobilenetv2_fpn_fire_rpfnet_kitti-3d-3class.py b/configs/mvxnet/mvxnet_mobilenetv2_fpn_fire_rpfnet_kitti-3d-3class.py new file mode 100644 index 0000000000..a9a1e13151 --- /dev/null +++ b/configs/mvxnet/mvxnet_mobilenetv2_fpn_fire_rpfnet_kitti-3d-3class.py @@ -0,0 +1,181 @@ +# Full MVX-Net config: MobileNetV2+FPN (camera) + RPFNet (LiDAR). +# KITTI 3-class (Car, Pedestrian, Cyclist) + +_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py'] + +# ----------------------------------------------------------------------------- +# Geometry +# ----------------------------------------------------------------------------- +voxel_size = [0.05, 0.05, 0.1] +point_cloud_range = [0, -40, -3, 70.4, 40, 1] + +# ----------------------------------------------------------------------------- +# Model +# ----------------------------------------------------------------------------- +model = dict( + type='DynamicMVXFasterRCNN', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_type='dynamic', + voxel_layer=dict(max_num_points=-1, point_cloud_range=point_cloud_range, + voxel_size=voxel_size, max_voxels=(-1, -1)), + mean=[102.9801, 115.9465, 122.7717], std=[1.0, 1.0, 1.0], + bgr_to_rgb=False, pad_size_divisor=32), + + # ----------------------- image branch ----------------------- + img_backbone=dict( + type='mmdet.MobileNetV2', + out_indices=(0, 1, 2, 3), + frozen_stages=1, + norm_cfg=dict(type='BN', requires_grad=False), + norm_eval=True), + img_neck=dict( + type='mmdet.FPN', + in_channels=[16, 24, 32, 64], + out_channels=256, + num_outs=4, + norm_cfg=dict(type='BN', requires_grad=False)), + + # ----------------------- LiDAR voxel encoder ---------------- + pts_voxel_encoder=dict( + type='DynamicVFE', + in_channels=4, + feat_channels=[64, 64], + with_distance=False, + voxel_size=voxel_size, + with_cluster_center=True, + with_voxel_center=True, + point_cloud_range=point_cloud_range, + fusion_layer=dict( + type='PointFusion', img_channels=256, pts_channels=64, + mid_channels=128, out_channels=128, img_levels=[0,1,2,3], + align_corners=False, activate_out=True, fuse_out=False)), + + # ----------------------- Sparse middle encoder -------------- + pts_middle_encoder=dict( + type='SparseEncoder', in_channels=128, + sparse_shape=[41, 1600, 1408], order=('conv', 'norm', 'act')), + + # ----------------------- FireRPFNet backbone -------------------- + pts_backbone=dict( + # type='RPFNet', + type='FireRPFNet', + in_channels=256, + layer_channels=[128, 256, 256, 256], with_cbam=True), + pts_neck=None, + + # ----------------------- Anchor head ------------------------ + pts_bbox_head=dict( + type='Anchor3DHead', num_classes=3, + in_channels=256, feat_channels=256, use_direction_classifier=True, + anchor_generator=dict(type='Anchor3DRangeGenerator', + ranges=[[0,-40,-0.6,70.4,40,-0.6],[0,-40,-0.6,70.4,40,-0.6], + [0,-40,-1.78,70.4,40,-1.78]], + sizes=[[0.8,0.6,1.73],[1.76,0.6,1.73],[3.9,1.6,1.56]], + rotations=[0,1.57], reshape_out=False), + assigner_per_size=True, diff_rad_by_sin=True, assign_per_class=True, + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict(type='mmdet.FocalLoss', use_sigmoid=True, gamma=2.0, + alpha=0.25, loss_weight=1.0), + loss_bbox=dict(type='mmdet.SmoothL1Loss', beta=1.0/9.0, + loss_weight=2.0), + loss_dir=dict(type='mmdet.CrossEntropyLoss', use_sigmoid=False, + loss_weight=0.2)), + + train_cfg=dict( + pts=dict(assigner=[ + dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2, + ignore_iof_thr=-1), + dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2, + ignore_iof_thr=-1), + dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.6, neg_iou_thr=0.45, min_pos_iou=0.45, + ignore_iof_thr=-1)], allowed_border=0, pos_weight=-1, + debug=False)), + test_cfg=dict( + pts=dict(use_rotate_nms=True, nms_across_levels=False, nms_thr=0.01, + score_thr=0.1, min_bbox_size=0, nms_pre=100, max_num=50)) +) + +# ----------------------------------------------------------------------------- +# Dataset & pipelines (unchanged) +# ----------------------------------------------------------------------------- + +dataset_type = 'KittiDataset' +data_root = 'data/kitti/' +class_names = ['Pedestrian', 'Cyclist', 'Car'] +metainfo = dict(classes=class_names) +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None + +train_pipeline = [ + dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True, + with_bbox=True, with_label=True), + dict(type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True), + dict(type='GlobalRotScaleTrans', rot_range=[-0.78539816,0.78539816], + scale_ratio_range=[0.95,1.05], translation_std=[0.2,0.2,0.2]), + dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='PointShuffle'), + dict(type='Pack3DDetInputs', keys=['points','img','gt_bboxes_3d','gt_labels_3d', + 'gt_bboxes','gt_labels']) +] + +test_pipeline = [ + dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='MultiScaleFlipAug3D', img_scale=(1280,384), pts_scale_ratio=1, + flip=False, transforms=[ + dict(type='Resize', scale=0, keep_ratio=True), + dict(type='GlobalRotScaleTrans', rot_range=[0,0], scale_ratio_range=[1.,1.], + translation_std=[0,0,0]), + dict(type='RandomFlip3D'), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + ]), + dict(type='Pack3DDetInputs', keys=['points','img']) +] + +modality = dict(use_lidar=True, use_camera=True) + +train_dataloader = dict( + batch_size=2, num_workers=4, sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict(type='RepeatDataset', times=2, dataset=dict( + type=dataset_type, data_root=data_root, modality=modality, + ann_file='kitti_infos_train.pkl', + data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'), + pipeline=train_pipeline, filter_empty_gt=False, metainfo=metainfo, + box_type_3d='LiDAR', backend_args=backend_args))) + +val_dataloader = dict( + batch_size=1, num_workers=1, sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict(type=dataset_type, data_root=data_root, modality=modality, + ann_file='kitti_infos_val.pkl', + data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, metainfo=metainfo, test_mode=True, + box_type_3d='LiDAR', backend_args=backend_args)) + +test_dataloader = val_dataloader + +# ----------------------------------------------------------------------------- +# Optimizer / runtime +# ----------------------------------------------------------------------------- +optim_wrapper = dict(optimizer=dict(lr=0.001, weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2)) + +val_evaluator = dict(type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') + +test_evaluator = val_evaluator + +vis_backends = [dict(type='LocalVisBackend')] +visualizer = dict(type='Det3DLocalVisualizer', vis_backends=vis_backends, + name='visualizer') + +train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=5) diff --git a/configs/mvxnet/mvxnet_mobilenetv2_fpn_second_fpn_kitti-3d-3class.py b/configs/mvxnet/mvxnet_mobilenetv2_fpn_second_fpn_kitti-3d-3class.py new file mode 100644 index 0000000000..fa5021492c --- /dev/null +++ b/configs/mvxnet/mvxnet_mobilenetv2_fpn_second_fpn_kitti-3d-3class.py @@ -0,0 +1,275 @@ +_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py'] + +# model settings +voxel_size = [0.05, 0.05, 0.1] +point_cloud_range = [0, -40, -3, 70.4, 40, 1] + +model = dict( + type='DynamicMVXFasterRCNN', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_type='dynamic', + voxel_layer=dict( + max_num_points=-1, + point_cloud_range=point_cloud_range, + voxel_size=voxel_size, + max_voxels=(-1, -1)), + mean=[102.9801, 115.9465, 122.7717], + std=[1.0, 1.0, 1.0], + bgr_to_rgb=False, + pad_size_divisor=32), + img_backbone=dict( + type='mmdet.MobileNetV2', # Use MobileNetV2 backbone + out_indices=(0, 1, 2, 3), # Extract features from these layers + frozen_stages=1, # Freeze the first stage (if needed) + norm_cfg=dict(type='BN', requires_grad=False), # Use BatchNorm + norm_eval=True, + ), + img_neck=dict( + type='mmdet.FPN', # Use Feature Pyramid Network (FPN) for neck + in_channels=[16, 24, 32, 64], # Adjust the input channels according to MobileNetV2 (could vary with the model) + out_channels=256, # Number of output channels from the FPN + norm_cfg=dict(type='BN', requires_grad=False), + num_outs=5, # Output feature maps from 5 levels + ), + pts_voxel_encoder=dict( + type='DynamicVFE', + in_channels=4, + feat_channels=[64, 64], + with_distance=False, + voxel_size=voxel_size, + with_cluster_center=True, + with_voxel_center=True, + point_cloud_range=point_cloud_range, + fusion_layer=dict( + type='PointFusion', + img_channels=256, + pts_channels=64, + mid_channels=128, + out_channels=128, + img_levels=[0, 1, 2, 3, 4], + align_corners=False, + activate_out=True, + fuse_out=False)), + pts_middle_encoder=dict( + type='SparseEncoder', + in_channels=128, + sparse_shape=[41, 1600, 1408], + order=('conv', 'norm', 'act')), + pts_backbone=dict( + type='SECOND', + in_channels=256, + layer_nums=[5, 5], + layer_strides=[1, 2], + out_channels=[128, 256]), + pts_neck=dict( + type='SECONDFPN', + in_channels=[128, 256], + upsample_strides=[1, 2], + out_channels=[256, 256]), + pts_bbox_head=dict( + type='Anchor3DHead', + num_classes=3, + in_channels=512, + feat_channels=512, + use_direction_classifier=True, + anchor_generator=dict( + type='Anchor3DRangeGenerator', + ranges=[ + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -1.78, 70.4, 40.0, -1.78], + ], + sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]], + rotations=[0, 1.57], + reshape_out=False), + assigner_per_size=True, + diff_rad_by_sin=True, + assign_per_class=True, + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict( + type='mmdet.FocalLoss', + use_sigmoid=True, + gamma=2.0, + alpha=0.25, + loss_weight=1.0), + loss_bbox=dict( + type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0), + loss_dir=dict( + type='mmdet.CrossEntropyLoss', use_sigmoid=False, + loss_weight=0.2)), + # model training and testing settings + train_cfg=dict( + pts=dict( + assigner=[ + dict( # for Pedestrian + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( # for Cyclist + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( # for Car + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.6, + neg_iou_thr=0.45, + min_pos_iou=0.45, + ignore_iof_thr=-1), + ], + allowed_border=0, + pos_weight=-1, + debug=False)), + test_cfg=dict( + pts=dict( + use_rotate_nms=True, + nms_across_levels=False, + nms_thr=0.01, + score_thr=0.1, + min_bbox_size=0, + nms_pre=100, + max_num=50))) + +# dataset settings +dataset_type = 'KittiDataset' +data_root = 'data/kitti/' +class_names = ['Pedestrian', 'Cyclist', 'Car'] +metainfo = dict(classes=class_names) +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None +train_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True), + #dict( + # type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True), + dict( + type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[-0.78539816, 0.78539816], + scale_ratio_range=[0.95, 1.05], + translation_std=[0.2, 0.2, 0.2]), + dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='PointShuffle'), + dict( + type='Pack3DDetInputs', + keys=[ + 'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes', + 'gt_labels' + ]) +] +test_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict( + type='MultiScaleFlipAug3D', + img_scale=(1280, 384), + pts_scale_ratio=1, + flip=False, + transforms=[ + # Temporary solution, fix this after refactor the augtest + dict(type='Resize', scale=0, keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[0, 0], + scale_ratio_range=[1., 1.], + translation_std=[0, 0, 0]), + dict(type='RandomFlip3D'), + dict( + type='PointsRangeFilter', point_cloud_range=point_cloud_range), + ]), + dict(type='Pack3DDetInputs', keys=['points', 'img']) +] +modality = dict(use_lidar=True, use_camera=True) +train_dataloader = dict( + batch_size=2, + num_workers=4, + sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict( + type='RepeatDataset', + times=2, + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_train.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=train_pipeline, + filter_empty_gt=False, + metainfo=metainfo, + # we use box_type_3d='LiDAR' in kitti and nuscenes dataset + # and box_type_3d='Depth' in sunrgbd and scannet dataset. + box_type_3d='LiDAR', + backend_args=backend_args))) + +val_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_val.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) +test_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + ann_file='kitti_infos_val.pkl', + modality=modality, + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) + +optim_wrapper = dict( + optimizer=dict(weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2), +) +val_evaluator = dict( + type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') +test_evaluator = val_evaluator + +vis_backends = [dict(type='LocalVisBackend')] +visualizer = dict( + type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer') + +train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=5) + +# You may need to download the model first is the network is unstable +#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth' # noqa diff --git a/configs/mvxnet/mvxnet_sqeezefpn_fire_rpfnet_kitti-3d-3class.py b/configs/mvxnet/mvxnet_sqeezefpn_fire_rpfnet_kitti-3d-3class.py new file mode 100644 index 0000000000..d3920cbf8c --- /dev/null +++ b/configs/mvxnet/mvxnet_sqeezefpn_fire_rpfnet_kitti-3d-3class.py @@ -0,0 +1,225 @@ +# Stand-alone MVX-Net (SqueezeFPN camera branch) + PillarNet-LTS (RPFNet) +# for KITTI 3-class. No dependency on other MVX configs โ€“ only schedule & +# default_runtime are inherited. + +_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py'] + +# ----------------------------------------------------------------------------- +# Geometry +# ----------------------------------------------------------------------------- +voxel_size = [0.05, 0.05, 0.1] +point_cloud_range = [0, -40, -3, 70.4, 40, 1] + +# ----------------------------------------------------------------------------- +# Model +# ----------------------------------------------------------------------------- +model = dict( + type='DynamicMVXFasterRCNN', + # -------------------------------------------------- + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_type='dynamic', + voxel_layer=dict( + max_num_points=-1, + point_cloud_range=point_cloud_range, + voxel_size=voxel_size, + max_voxels=(-1, -1)), + mean=[102.9801, 115.9465, 122.7717], + std=[1.0, 1.0, 1.0], + bgr_to_rgb=False, + pad_size_divisor=32), + + # ----------------------- image branch ----------------------- + img_backbone=dict( + type='SQUEEZE', + in_channels=3, + out_channels=[64, 128, 256, 512], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + conv_cfg=dict(type='Conv2d', bias=False)), + img_neck=dict( + type='SQUEEZEFPN', + in_channels=[64, 128, 256, 512], + out_channels=[512, 512, 512, 512], + norm_cfg=dict(type='BN', requires_grad=False)), + + # ----------------------- LiDAR voxel encoder ---------------- + pts_voxel_encoder=dict( + type='DynamicVFE', + in_channels=4, + feat_channels=[64, 64], + with_distance=False, + voxel_size=voxel_size, + with_cluster_center=True, + with_voxel_center=True, + point_cloud_range=point_cloud_range, + fusion_layer=dict( + type='PointFusion', + img_channels=512, + pts_channels=64, + mid_channels=128, + out_channels=128, + img_levels=[0, 1, 2, 3], + align_corners=False, + activate_out=True, + fuse_out=False)), + + # ----------------------- Sparse middle encoder -------------- + pts_middle_encoder=dict( + type='SparseEncoder', + in_channels=128, + sparse_shape=[41, 1600, 1408], + order=('conv', 'norm', 'act')), + + # ----------------------- FireRPFNet backbone ------------- + pts_backbone=dict( + type='FireRPFNetV2', + in_channels=256, # output of SparseEncoder + out_channels=[128, 256, 256, 256], + with_cbam=True), + + pts_neck=None, # RPFNet is already deep enough + + # ----------------------- Anchor head ------------------------ + pts_bbox_head=dict( + type='Anchor3DHead', + num_classes=3, + in_channels=256, + feat_channels=256, + use_direction_classifier=True, + anchor_generator=dict( + type='Anchor3DRangeGenerator', + ranges=[ + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -1.78, 70.4, 40.0, -1.78], + ], + sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]], + rotations=[0, 1.57], + reshape_out=False), + assigner_per_size=True, + diff_rad_by_sin=True, + assign_per_class=True, + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict( + type='mmdet.FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, + loss_weight=1.0), + loss_bbox=dict(type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, + loss_weight=2.0), + loss_dir=dict(type='mmdet.CrossEntropyLoss', use_sigmoid=False, + loss_weight=0.2)), + + # ----------------------- Train / Test cfg ------------------- + train_cfg=dict( + pts=dict( + assigner=[ + dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2, + ignore_iof_thr=-1), + dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2, + ignore_iof_thr=-1), + dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.6, neg_iou_thr=0.45, min_pos_iou=0.45, + ignore_iof_thr=-1), + ], + allowed_border=0, pos_weight=-1, debug=False)), + + test_cfg=dict( + pts=dict(use_rotate_nms=True, nms_across_levels=False, nms_thr=0.01, + score_thr=0.1, min_bbox_size=0, nms_pre=100, max_num=50)) +) + +# ----------------------------------------------------------------------------- +# Dataset & pipelines (identical to original MVX squeeze-FPN config) +# ----------------------------------------------------------------------------- + +# dataset settings +dataset_type = 'KittiDataset' +data_root = 'data/kitti/' +class_names = ['Pedestrian', 'Cyclist', 'Car'] +metainfo = dict(classes=class_names) +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None + +train_pipeline = [ + dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True, + with_bbox=True, with_label=True), + dict(type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True), + dict(type='GlobalRotScaleTrans', rot_range=[-0.78539816, 0.78539816], + scale_ratio_range=[0.95, 1.05], translation_std=[0.2, 0.2, 0.2]), + dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='PointShuffle'), + dict(type='Pack3DDetInputs', keys=['points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', + 'gt_bboxes', 'gt_labels']) +] + +test_pipeline = [ + dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='MultiScaleFlipAug3D', img_scale=(1280, 384), pts_scale_ratio=1, + flip=False, + transforms=[ + dict(type='Resize', scale=0, keep_ratio=True), + dict(type='GlobalRotScaleTrans', rot_range=[0, 0], scale_ratio_range=[1., 1.], + translation_std=[0, 0, 0]), + dict(type='RandomFlip3D'), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + ]), + dict(type='Pack3DDetInputs', keys=['points', 'img']) +] + +modality = dict(use_lidar=True, use_camera=True) + +train_dataloader = dict( + batch_size=2, num_workers=2, sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict(type='RepeatDataset', times=2, dataset=dict( + type=dataset_type, data_root=data_root, modality=modality, + ann_file='kitti_infos_train.pkl', + data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'), + pipeline=train_pipeline, filter_empty_gt=False, metainfo=metainfo, + box_type_3d='LiDAR', backend_args=backend_args))) + +val_dataloader = dict( + batch_size=1, num_workers=1, sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict(type=dataset_type, data_root=data_root, modality=modality, + ann_file='kitti_infos_val.pkl', + data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, metainfo=metainfo, test_mode=True, + box_type_3d='LiDAR', backend_args=backend_args)) + +test_dataloader = val_dataloader + +# ----------------------------------------------------------------------------- +# Optimizer / Schedulers / Runtime +# ----------------------------------------------------------------------------- +optim_wrapper = dict(optimizer=dict(lr=0.001, weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2)) + +val_evaluator = dict(type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') + + +# optim_wrapper = dict( +# optimizer=dict(weight_decay=0.01), +# clip_grad=dict(max_norm=35, norm_type=2), +# ) +# val_evaluator = dict( +# type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') + +test_evaluator = val_evaluator + +vis_backends = [dict(type='LocalVisBackend')] +visualizer = dict( + type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer') + +train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=5) + +# Optional: if you reduced channels you can shrink head +# model['pts_bbox_head']['in_channels'] = 256 +# model['pts_bbox_head']['feat_channels'] = 256 diff --git a/configs/mvxnet/mvxnet_sqeezefpn_rpfnet_kitti-3d-3class.py b/configs/mvxnet/mvxnet_sqeezefpn_rpfnet_kitti-3d-3class.py new file mode 100644 index 0000000000..7784db0c1a --- /dev/null +++ b/configs/mvxnet/mvxnet_sqeezefpn_rpfnet_kitti-3d-3class.py @@ -0,0 +1,221 @@ +# Stand-alone MVX-Net (SqueezeFPN camera branch) + PillarNet-LTS (RPFNet) +# for KITTI 3-class. No dependency on other MVX configs โ€“ only schedule & +# default_runtime are inherited. + +_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py'] + +# ----------------------------------------------------------------------------- +# Geometry +# ----------------------------------------------------------------------------- +voxel_size = [0.05, 0.05, 0.1] +point_cloud_range = [0, -40, -3, 70.4, 40, 1] + +# ----------------------------------------------------------------------------- +# Model +# ----------------------------------------------------------------------------- +model = dict( + type='DynamicMVXFasterRCNN', + # -------------------------------------------------- + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_type='dynamic', + voxel_layer=dict( + max_num_points=-1, + point_cloud_range=point_cloud_range, + voxel_size=voxel_size, + max_voxels=(-1, -1)), + mean=[102.9801, 115.9465, 122.7717], + std=[1.0, 1.0, 1.0], + bgr_to_rgb=False, + pad_size_divisor=32), + + # ----------------------- image branch ----------------------- + img_backbone=dict( + type='SQUEEZE', + in_channels=3, + out_channels=[64, 128, 256, 512], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + conv_cfg=dict(type='Conv2d', bias=False)), + img_neck=dict( + type='SQUEEZEFPN', + in_channels=[64, 128, 256, 512], + out_channels=[512, 512, 512, 512], + norm_cfg=dict(type='BN', requires_grad=False)), + + # ----------------------- LiDAR voxel encoder ---------------- + pts_voxel_encoder=dict( + type='DynamicVFE', + in_channels=4, + feat_channels=[64, 64], + with_distance=False, + voxel_size=voxel_size, + with_cluster_center=True, + with_voxel_center=True, + point_cloud_range=point_cloud_range, + fusion_layer=dict( + type='PointFusion', + img_channels=512, + pts_channels=64, + mid_channels=128, + out_channels=128, + img_levels=[0, 1, 2, 3], + align_corners=False, + activate_out=True, + fuse_out=False)), + + # ----------------------- Sparse middle encoder -------------- + pts_middle_encoder=dict( + type='SparseEncoder', + in_channels=128, + sparse_shape=[41, 1600, 1408], + order=('conv', 'norm', 'act')), + + # ----------------------- RPFNet backbone ------------- + pts_backbone=dict( + type='RPFNet', + in_channels=256, # output of SparseEncoder + layer_channels=[128, 256, 256, 256], + with_cbam=True), + + pts_neck=None, # RPFNet is already deep enough + + # ----------------------- Anchor head ------------------------ + pts_bbox_head=dict( + type='Anchor3DHead', + num_classes=3, + in_channels=256, + feat_channels=256, + use_direction_classifier=True, + anchor_generator=dict( + type='Anchor3DRangeGenerator', + ranges=[ + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -1.78, 70.4, 40.0, -1.78], + ], + sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]], + rotations=[0, 1.57], + reshape_out=False), + assigner_per_size=True, + diff_rad_by_sin=True, + assign_per_class=True, + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict( + type='mmdet.FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, + loss_weight=1.0), + loss_bbox=dict(type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, + loss_weight=2.0), + loss_dir=dict(type='mmdet.CrossEntropyLoss', use_sigmoid=False, + loss_weight=0.2)), + + # ----------------------- Train / Test cfg ------------------- + train_cfg=dict( + pts=dict( + assigner=[ + dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2, + ignore_iof_thr=-1), + dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, neg_iou_thr=0.2, min_pos_iou=0.2, + ignore_iof_thr=-1), + dict(type='Max3DIoUAssigner', iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.6, neg_iou_thr=0.45, min_pos_iou=0.45, + ignore_iof_thr=-1), + ], + allowed_border=0, pos_weight=-1, debug=False)), + + test_cfg=dict( + pts=dict(use_rotate_nms=True, nms_across_levels=False, nms_thr=0.01, + score_thr=0.1, min_bbox_size=0, nms_pre=100, max_num=50)) +) + +# ----------------------------------------------------------------------------- +# Dataset & pipelines (identical to original MVX squeeze-FPN config) +# ----------------------------------------------------------------------------- + +# dataset settings +dataset_type = 'KittiDataset' +data_root = 'data/kitti/' +class_names = ['Pedestrian', 'Cyclist', 'Car'] +metainfo = dict(classes=class_names) +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None + +train_pipeline = [ + dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True, + with_bbox=True, with_label=True), + dict(type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True), + dict(type='GlobalRotScaleTrans', rot_range=[-0.78539816, 0.78539816], + scale_ratio_range=[0.95, 1.05], translation_std=[0.2, 0.2, 0.2]), + dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='PointShuffle'), + dict(type='Pack3DDetInputs', keys=['points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', + 'gt_bboxes', 'gt_labels']) +] + +test_pipeline = [ + dict(type='LoadPointsFromFile', coord_type='LIDAR', load_dim=4, use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='MultiScaleFlipAug3D', img_scale=(1280, 384), pts_scale_ratio=1, + flip=False, + transforms=[ + dict(type='Resize', scale=0, keep_ratio=True), + dict(type='GlobalRotScaleTrans', rot_range=[0, 0], scale_ratio_range=[1., 1.], + translation_std=[0, 0, 0]), + dict(type='RandomFlip3D'), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + ]), + dict(type='Pack3DDetInputs', keys=['points', 'img']) +] + +modality = dict(use_lidar=True, use_camera=True) + +train_dataloader = dict( + batch_size=2, num_workers=4, sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict(type='RepeatDataset', times=2, dataset=dict( + type=dataset_type, data_root=data_root, modality=modality, + ann_file='kitti_infos_train.pkl', + data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'), + pipeline=train_pipeline, filter_empty_gt=False, metainfo=metainfo, + box_type_3d='LiDAR', backend_args=backend_args))) + +val_dataloader = dict( + batch_size=1, num_workers=1, sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict(type=dataset_type, data_root=data_root, modality=modality, + ann_file='kitti_infos_val.pkl', + data_prefix=dict(pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, metainfo=metainfo, test_mode=True, + box_type_3d='LiDAR', backend_args=backend_args)) + +test_dataloader = val_dataloader + +# ----------------------------------------------------------------------------- +# Optimizer / Schedulers / Runtime +# ----------------------------------------------------------------------------- +optim_wrapper = dict(optimizer=dict(lr=0.001, weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2)) + +val_evaluator = dict(type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') + + +# optim_wrapper = dict( +# optimizer=dict(weight_decay=0.01), +# clip_grad=dict(max_norm=35, norm_type=2), +# ) +# val_evaluator = dict( +# type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') + +test_evaluator = val_evaluator + +vis_backends = [dict(type='LocalVisBackend')] +visualizer = dict( + type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer') + +train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=5) diff --git a/configs/mvxnet/mvxnet_sqeezefpn_secfpn_2x_scale_kitti-3d-3class.py b/configs/mvxnet/mvxnet_sqeezefpn_secfpn_2x_scale_kitti-3d-3class.py new file mode 100644 index 0000000000..24331ef868 --- /dev/null +++ b/configs/mvxnet/mvxnet_sqeezefpn_secfpn_2x_scale_kitti-3d-3class.py @@ -0,0 +1,275 @@ +_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py'] + +# model settings +voxel_size = [0.05, 0.05, 0.1] +point_cloud_range = [0, -40, -3, 70.4, 40, 1] + +model = dict( + type='DynamicMVXFasterRCNN', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_type='dynamic', + voxel_layer=dict( + max_num_points=-1, + point_cloud_range=point_cloud_range, + voxel_size=voxel_size, + max_voxels=(-1, -1)), + mean=[102.9801, 115.9465, 122.7717], + std=[1.0, 1.0, 1.0], + bgr_to_rgb=False, + pad_size_divisor=32), + img_backbone=dict( + type='SQUEEZE', + in_channels=3, + out_channels=[64, 128, 256 , 512], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + conv_cfg=dict(type='Conv2d', bias=False) + ), + img_neck=dict( + type='SQUEEZEFPN', + in_channels=[64, 128, 256 , 512], # Correct in_channels for EfficientNet b0 + out_channels=[512, 512, 512, 512], + norm_cfg=dict(type='BN', requires_grad=False), + #num_outs=4 + ), + pts_voxel_encoder=dict( + type='DynamicVFE', + in_channels=4, + feat_channels=[64, 64], + with_distance=False, + voxel_size=voxel_size, + with_cluster_center=True, + with_voxel_center=True, + point_cloud_range=point_cloud_range, + fusion_layer=dict( + type='PointFusion', + img_channels=512, + pts_channels=64, + mid_channels=128, + out_channels=128, + img_levels=[0, 1, 2, 3], + align_corners=False, + activate_out=True, + fuse_out=False)), + pts_middle_encoder=dict( + type='SparseEncoder', + in_channels=128, + sparse_shape=[41, 1600, 1408], + order=('conv', 'norm', 'act')), + pts_backbone=dict( + type='SECOND', + in_channels=256, + layer_nums=[5, 5], + layer_strides=[1, 2], + out_channels=[128, 256]), + pts_neck=dict( + type='SECONDFPN', + in_channels=[128, 256], + upsample_strides=[1, 2], + out_channels=[256, 256]), + pts_bbox_head=dict( + type='Anchor3DHead', + num_classes=3, + in_channels=512, + feat_channels=512, + use_direction_classifier=True, + anchor_generator=dict( + type='Anchor3DRangeGenerator', + ranges=[ + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -1.78, 70.4, 40.0, -1.78], + ], + sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]], + rotations=[0, 1.57], + reshape_out=False), + assigner_per_size=True, + diff_rad_by_sin=True, + assign_per_class=True, + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict( + type='mmdet.FocalLoss', + use_sigmoid=True, + gamma=2.0, + alpha=0.25, + loss_weight=1.0), + loss_bbox=dict( + type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0), + loss_dir=dict( + type='mmdet.CrossEntropyLoss', use_sigmoid=False, + loss_weight=0.2)), + # model training and testing settings + train_cfg=dict( + pts=dict( + assigner=[ + dict( # for Pedestrian + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( # for Cyclist + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( # for Car + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.6, + neg_iou_thr=0.45, + min_pos_iou=0.45, + ignore_iof_thr=-1), + ], + allowed_border=0, + pos_weight=-1, + debug=False)), + test_cfg=dict( + pts=dict( + use_rotate_nms=True, + nms_across_levels=False, + nms_thr=0.01, + score_thr=0.1, + min_bbox_size=0, + nms_pre=100, + max_num=50))) + +# dataset settings +dataset_type = 'KittiDataset' +data_root = 'data/kitti/' +class_names = ['Pedestrian', 'Cyclist', 'Car'] +metainfo = dict(classes=class_names) +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None +train_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True), + dict( + type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True), + #dict( + # type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[-0.78539816, 0.78539816], + scale_ratio_range=[0.95, 1.05], + translation_std=[0.2, 0.2, 0.2]), + dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='PointShuffle'), + dict( + type='Pack3DDetInputs', + keys=[ + 'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes', + 'gt_labels' + ]) +] +test_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict( + type='MultiScaleFlipAug3D', + img_scale=(1280, 384), + pts_scale_ratio=1, + flip=False, + transforms=[ + # Temporary solution, fix this after refactor the augtest + dict(type='Resize', scale=0, keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[0, 0], + scale_ratio_range=[1., 1.], + translation_std=[0, 0, 0]), + dict(type='RandomFlip3D'), + dict( + type='PointsRangeFilter', point_cloud_range=point_cloud_range), + ]), + dict(type='Pack3DDetInputs', keys=['points', 'img']) +] +modality = dict(use_lidar=True, use_camera=True) +train_dataloader = dict( + batch_size=2, + num_workers=2, + sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict( + type='RepeatDataset', + times=2, + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_train.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=train_pipeline, + filter_empty_gt=False, + metainfo=metainfo, + # we use box_type_3d='LiDAR' in kitti and nuscenes dataset + # and box_type_3d='Depth' in sunrgbd and scannet dataset. + box_type_3d='LiDAR', + backend_args=backend_args))) + +val_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_val.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) +test_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + ann_file='kitti_infos_val.pkl', + modality=modality, + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) + +optim_wrapper = dict( + optimizer=dict(weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2), +) +val_evaluator = dict( + type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') +test_evaluator = val_evaluator + +vis_backends = [dict(type='LocalVisBackend')] +visualizer = dict( + type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer') + +train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1) + +# You may need to download the model first is the network is unstable +#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth' # noqa diff --git a/configs/mvxnet/mvxnet_sqeezefpn_secfpn_kitti-3d-3class.py b/configs/mvxnet/mvxnet_sqeezefpn_secfpn_kitti-3d-3class.py new file mode 100644 index 0000000000..95a4a68c60 --- /dev/null +++ b/configs/mvxnet/mvxnet_sqeezefpn_secfpn_kitti-3d-3class.py @@ -0,0 +1,275 @@ +_base_ = ['../_base_/schedules/cosine.py', '../_base_/default_runtime.py'] + +# model settings +voxel_size = [0.05, 0.05, 0.1] +point_cloud_range = [0, -40, -3, 70.4, 40, 1] + +model = dict( + type='DynamicMVXFasterRCNN', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + voxel=True, + voxel_type='dynamic', + voxel_layer=dict( + max_num_points=-1, + point_cloud_range=point_cloud_range, + voxel_size=voxel_size, + max_voxels=(-1, -1)), + mean=[102.9801, 115.9465, 122.7717], + std=[1.0, 1.0, 1.0], + bgr_to_rgb=False, + pad_size_divisor=32), + img_backbone=dict( + type='SQUEEZE', + in_channels=3, + out_channels=[64, 128, 256 , 512], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + conv_cfg=dict(type='Conv2d', bias=False) + ), + img_neck=dict( + type='SQUEEZEFPN', + in_channels=[64, 128, 256 , 512], # Correct in_channels for EfficientNet b0 + out_channels=[512, 512, 512, 512], + norm_cfg=dict(type='BN', requires_grad=False), + #num_outs=4 + ), + pts_voxel_encoder=dict( + type='DynamicVFE', + in_channels=4, + feat_channels=[64, 64], + with_distance=False, + voxel_size=voxel_size, + with_cluster_center=True, + with_voxel_center=True, + point_cloud_range=point_cloud_range, + fusion_layer=dict( + type='PointFusion', + img_channels=512, + pts_channels=64, + mid_channels=128, + out_channels=128, + img_levels=[0, 1, 2, 3], + align_corners=False, + activate_out=True, + fuse_out=False)), + pts_middle_encoder=dict( + type='SparseEncoder', + in_channels=128, + sparse_shape=[41, 1600, 1408], + order=('conv', 'norm', 'act')), + pts_backbone=dict( + type='SECOND', + in_channels=256, + layer_nums=[5, 5], + layer_strides=[1, 2], + out_channels=[128, 256]), + pts_neck=dict( + type='SECONDFPN', + in_channels=[128, 256], + upsample_strides=[1, 2], + out_channels=[256, 256]), + pts_bbox_head=dict( + type='Anchor3DHead', + num_classes=3, + in_channels=512, + feat_channels=512, + use_direction_classifier=True, + anchor_generator=dict( + type='Anchor3DRangeGenerator', + ranges=[ + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -0.6, 70.4, 40.0, -0.6], + [0, -40.0, -1.78, 70.4, 40.0, -1.78], + ], + sizes=[[0.8, 0.6, 1.73], [1.76, 0.6, 1.73], [3.9, 1.6, 1.56]], + rotations=[0, 1.57], + reshape_out=False), + assigner_per_size=True, + diff_rad_by_sin=True, + assign_per_class=True, + bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder'), + loss_cls=dict( + type='mmdet.FocalLoss', + use_sigmoid=True, + gamma=2.0, + alpha=0.25, + loss_weight=1.0), + loss_bbox=dict( + type='mmdet.SmoothL1Loss', beta=1.0 / 9.0, loss_weight=2.0), + loss_dir=dict( + type='mmdet.CrossEntropyLoss', use_sigmoid=False, + loss_weight=0.2)), + # model training and testing settings + train_cfg=dict( + pts=dict( + assigner=[ + dict( # for Pedestrian + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( # for Cyclist + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.35, + neg_iou_thr=0.2, + min_pos_iou=0.2, + ignore_iof_thr=-1), + dict( # for Car + type='Max3DIoUAssigner', + iou_calculator=dict(type='BboxOverlapsNearest3D'), + pos_iou_thr=0.6, + neg_iou_thr=0.45, + min_pos_iou=0.45, + ignore_iof_thr=-1), + ], + allowed_border=0, + pos_weight=-1, + debug=False)), + test_cfg=dict( + pts=dict( + use_rotate_nms=True, + nms_across_levels=False, + nms_thr=0.01, + score_thr=0.1, + min_bbox_size=0, + nms_pre=100, + max_num=50))) + +# dataset settings +dataset_type = 'KittiDataset' +data_root = 'data/kitti/' +class_names = ['Pedestrian', 'Cyclist', 'Car'] +metainfo = dict(classes=class_names) +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None +train_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True), + #dict( + # type='RandomResize', scale=[(640, 192), (2560, 768)], keep_ratio=True), + dict( + type='RandomResize', scale=[(320, 96), (1280, 384)], keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[-0.78539816, 0.78539816], + scale_ratio_range=[0.95, 1.05], + translation_std=[0.2, 0.2, 0.2]), + dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict(type='PointShuffle'), + dict( + type='Pack3DDetInputs', + keys=[ + 'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes', + 'gt_labels' + ]) +] +test_pipeline = [ + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=4, + use_dim=4, + backend_args=backend_args), + dict(type='LoadImageFromFile', backend_args=backend_args), + dict( + type='MultiScaleFlipAug3D', + img_scale=(1280, 384), + pts_scale_ratio=1, + flip=False, + transforms=[ + # Temporary solution, fix this after refactor the augtest + dict(type='Resize', scale=0, keep_ratio=True), + dict( + type='GlobalRotScaleTrans', + rot_range=[0, 0], + scale_ratio_range=[1., 1.], + translation_std=[0, 0, 0]), + dict(type='RandomFlip3D'), + dict( + type='PointsRangeFilter', point_cloud_range=point_cloud_range), + ]), + dict(type='Pack3DDetInputs', keys=['points', 'img']) +] +modality = dict(use_lidar=True, use_camera=True) +train_dataloader = dict( + batch_size=2, + num_workers=2, + sampler=dict(type='DefaultSampler', shuffle=True), + dataset=dict( + type='RepeatDataset', + times=2, + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_train.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=train_pipeline, + filter_empty_gt=False, + metainfo=metainfo, + # we use box_type_3d='LiDAR' in kitti and nuscenes dataset + # and box_type_3d='Depth' in sunrgbd and scannet dataset. + box_type_3d='LiDAR', + backend_args=backend_args))) + +val_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + modality=modality, + ann_file='kitti_infos_val.pkl', + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) +test_dataloader = dict( + batch_size=1, + num_workers=1, + sampler=dict(type='DefaultSampler', shuffle=False), + dataset=dict( + type=dataset_type, + data_root=data_root, + ann_file='kitti_infos_val.pkl', + modality=modality, + data_prefix=dict( + pts='training/velodyne_reduced', img='training/image_2'), + pipeline=test_pipeline, + metainfo=metainfo, + test_mode=True, + box_type_3d='LiDAR', + backend_args=backend_args)) + +optim_wrapper = dict( + optimizer=dict(weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2), +) +val_evaluator = dict( + type='KittiMetric', ann_file='data/kitti/kitti_infos_val.pkl') +test_evaluator = val_evaluator + +vis_backends = [dict(type='LocalVisBackend')] +visualizer = dict( + type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer') + +train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=20, val_interval=1) + +# You may need to download the model first is the network is unstable +#load_from = 'https://download.openmmlab.com/mmdetection3d/pretrain_models/mvx_faster_rcnn_detectron2-caffe_20e_coco-pretrain_gt-sample_kitti-3-class_moderate-79.3_20200207-a4a6a3c7.pth' # noqa diff --git a/exp_list.sh b/exp_list.sh new file mode 100755 index 0000000000..9a089ee0cd --- /dev/null +++ b/exp_list.sh @@ -0,0 +1,33 @@ +#!/bin/bash +CONFIG_FILES=( +mvxnet_efficiency_es_fpn_second_fpn_kitti-3d-3class +mvxnet_efficiency_es_fpn_squeeze_fpn_kitti-3d-3class +mvxnet_efficiency_fpn_second_fpn_kitti-3d-3class +mvxnet_efficiency_fpn_squeeze_fpn_kitti-3d-3class +mvxnet_fpn_dv_second_secfpn_320x92_kitti-3d-3class +mvxnet_fpn_dv_second_secfpn_8xb2-80e_kitti-3d-3class +mvxnet_fpn_dv_second_squeezefpn_320x92_kitti-3d-3class +mvxnet_fpn_dv_second_squeezefpn_8xb2-80e_kitti-3d-3class +mvxnet_mobilenetv2_fpn_second_fpn_kitti-3d-3class +mvxnet_sqeezefpn_secfpn_kitti-3d-3class +mvxnet_sqeezefpn_secfpn_2x_scale_kitti-3d-3class +) + +TIMESTAMP=$(TZ='America/Los_Angeles' date +%m%d) +for item in "${CONFIG_FILES[@]}"; do + echo "############" + echo + workdir="work_dirs/"${item} + mkdir -p ${workdir} + config="configs/mvxnet/"${item}.py + saved_model=$workdir/epoch_20.pth + echo "Config:" $config + echo "Workdir:" $workdir + echo "Saved Model:" $saved_model + + echo "Train Command:" + echo "python tools/train.py $config 2>&1 | tee $workdir/train-${TIMESTAMP}.log" + echo "Test Command:" + echo "python tools/test.py $workdir/$item.py $saved_model 2>&1 | tee $workdir/test-${TIMESTAMP}.log" + +done diff --git a/mmdet3d/apis/inferencers/multi_modality_det3d_inferencer.py b/mmdet3d/apis/inferencers/multi_modality_det3d_inferencer.py index 6717bb18c8..ecc4295ef6 100644 --- a/mmdet3d/apis/inferencers/multi_modality_det3d_inferencer.py +++ b/mmdet3d/apis/inferencers/multi_modality_det3d_inferencer.py @@ -74,95 +74,177 @@ def _inputs_to_list(self, - dict: the value with key 'points' is - Directory path: return all files in the directory - other cases: return a list containing the string. The string - could be a path to file, a url or other types of string according - to the task. + could be a path to file, a url or other types of string according + to the task. Args: inputs (Union[dict, list]): Inputs for the inferencer. + cam_type (str): Camera type. Defaults to 'CAM2'. Returns: list: List of input for the :meth:`preprocess`. """ + processed_inputs_list = [] + if isinstance(inputs, dict): - assert 'infos' in inputs - infos = inputs.pop('infos') - - if isinstance(inputs['img'], str): - img, pcd = inputs['img'], inputs['points'] - backend = get_file_backend(img) - if hasattr(backend, 'isdir') and isdir(img) and isdir(pcd): - # Backends like HttpsBackend do not implement `isdir`, so - # only those backends that implement `isdir` could accept - # the inputs as a directory + if 'infos' not in inputs: + raise ValueError("Input dictionary must contain an 'infos' key pointing to the .pkl file.") + infos_path = inputs.pop('infos') + + # Determine the actual list of input samples + # This handles cases where 'img' and 'pcd' might be directories + current_sample_dicts = [] + if isinstance(inputs.get('img'), str) and isinstance(inputs.get('points'), str): + img_path_input, pcd_path_input = inputs['img'], inputs['points'] + # Check if these are directories + backend = get_file_backend(img_path_input) + if hasattr(backend, 'isdir') and isdir(img_path_input) and isdir(pcd_path_input): img_filename_list = list_dir_or_file( - img, list_dir=False, suffix=['.png', '.jpg']) + img_path_input, list_dir=False, suffix=['.png', '.jpg', '.jpeg', '.PNG', '.JPG', '.JPEG']) # Added more suffixes pcd_filename_list = list_dir_or_file( - pcd, list_dir=False, suffix='.bin') - assert len(img_filename_list) == len(pcd_filename_list) - - inputs = [{ - 'img': join_path(img, img_filename), - 'points': join_path(pcd, pcd_filename) - } for pcd_filename, img_filename in zip( - pcd_filename_list, img_filename_list)] - - if not isinstance(inputs, (list, tuple)): - inputs = [inputs] - - # get cam2img, lidar2cam and lidar2img from infos - info_list = mmengine.load(infos)['data_list'] - assert len(info_list) == len(inputs) - for index, input in enumerate(inputs): - data_info = info_list[index] - img_path = data_info['images'][cam_type]['img_path'] - if isinstance(input['img'], str) and \ - osp.basename(img_path) != osp.basename(input['img']): + pcd_path_input, list_dir=False, suffix='.bin') + + if len(img_filename_list) != len(pcd_filename_list): + raise ValueError( + f"Mismatch in number of images ({len(img_filename_list)}) and " + f"point cloud files ({len(pcd_filename_list)}) " + f"in directories '{img_path_input}' and '{pcd_path_input}'.") + + for pcd_filename, img_filename in zip(pcd_filename_list, img_filename_list): + current_sample_dicts.append({ + 'img': join_path(img_path_input, img_filename), + 'points': join_path(pcd_path_input, pcd_filename) + }) + else: # Assume single file paths if not directories + current_sample_dicts = [inputs.copy()] # Use a copy of the original input dict + elif not isinstance(inputs, (list, tuple)): # If inputs['img'] wasn't a string, but inputs itself is a dict. + current_sample_dicts = [inputs.copy()] + else: # This case should ideally not be hit if input 'inputs' is a dict. + raise ValueError("Unexpected structure for 'inputs' dictionary.") + + + all_info_data = mmengine.load(infos_path)['data_list'] + + for single_input_sample_dict in current_sample_dicts: + if 'img' not in single_input_sample_dict or not isinstance(single_input_sample_dict['img'], str): + raise ValueError(f"Each input sample must have an 'img' key with a string path. Problematic sample: {single_input_sample_dict}") + + input_img_basename = osp.basename(single_input_sample_dict['img']) + found_data_info = None + + for data_info_candidate in all_info_data: + if 'images' not in data_info_candidate or \ + cam_type not in data_info_candidate['images'] or \ + 'img_path' not in data_info_candidate['images'][cam_type]: + # Silently skip malformed entries or log a warning + # warnings.warn(f"Skipping malformed info entry: {data_info_candidate.get('sample_idx', 'Unknown sample')}") + continue + + info_img_path = data_info_candidate['images'][cam_type]['img_path'] + if osp.basename(info_img_path) == input_img_basename: + found_data_info = data_info_candidate + break + + if found_data_info is None: + available_img_names = [ + osp.basename(info['images'][cam_type]['img_path']) + for info in all_info_data + if 'images' in info and cam_type in info['images'] and 'img_path' in info['images'][cam_type] + ] + example_names = ", ".join(list(set(available_img_names))[:5]) raise ValueError( - f'the info file of {img_path} is not provided.') + f"Could not find info for image '{input_img_basename}' (from path: {single_input_sample_dict['img']}) " + f"in '{infos_path}'. Checked {len(all_info_data)} entries. " + f"Example image basenames in info file: {example_names}" + ) + + # Add camera parameters from found_data_info to the input sample cam2img = np.asarray( - data_info['images'][cam_type]['cam2img'], dtype=np.float32) + found_data_info['images'][cam_type]['cam2img'], dtype=np.float32) lidar2cam = np.asarray( - data_info['images'][cam_type]['lidar2cam'], + found_data_info['images'][cam_type]['lidar2cam'], dtype=np.float32) - if 'lidar2img' in data_info['images'][cam_type]: + if 'lidar2img' in found_data_info['images'][cam_type]: lidar2img = np.asarray( - data_info['images'][cam_type]['lidar2img'], + found_data_info['images'][cam_type]['lidar2img'], dtype=np.float32) else: lidar2img = cam2img @ lidar2cam - input['cam2img'] = cam2img - input['lidar2cam'] = lidar2cam - input['lidar2img'] = lidar2img + + # Create a new dict for the processed input to avoid modifying the original list's dicts + processed_sample = single_input_sample_dict.copy() + processed_sample['cam2img'] = cam2img + processed_sample['lidar2cam'] = lidar2cam + processed_sample['lidar2img'] = lidar2img + processed_inputs_list.append(processed_sample) + elif isinstance(inputs, (list, tuple)): - # get cam2img, lidar2cam and lidar2img from infos - for input in inputs: - assert 'infos' in input - infos = input.pop('infos') - info_list = mmengine.load(infos)['data_list'] - assert len(info_list) == 1, 'Only support single sample' \ - 'info in `.pkl`, when input is a list.' - data_info = info_list[0] - img_path = data_info['images'][cam_type]['img_path'] - if isinstance(input['img'], str) and \ - osp.basename(img_path) != osp.basename(input['img']): + # This branch handles cases where 'inputs' is already a list of dicts. + # The original logic assumes each dict in the list has its own 'infos' + # and that this info file contains exactly one entry. + # This part is kept similar to original for now, but may need adjustment + # if a global info file is to be used for list inputs too. + for single_input_item_dict in inputs: + if not isinstance(single_input_item_dict, dict) or 'infos' not in single_input_item_dict: + raise ValueError("When inputs is a list, each item must be a dict containing an 'infos' key.") + + infos_path_item = single_input_item_dict.pop('infos') + current_info_list = mmengine.load(infos_path_item)['data_list'] + + # Original code for list inputs expects one info entry per file. + # To make it search, you'd adapt the logic from the isinstance(inputs, dict) block above. + # For now, sticking to a modified version of the original assertion for clarity. + input_img_basename_item = osp.basename(single_input_item_dict['img']) + data_info_to_use = None + if len(current_info_list) == 1: + # If only one entry, check if it matches, then use it. + candidate = current_info_list[0] + if 'images' in candidate and cam_type in candidate['images'] and \ + osp.basename(candidate['images'][cam_type]['img_path']) == input_img_basename_item: + data_info_to_use = candidate + else: + raise ValueError( + f"Single info entry in '{infos_path_item}' does not match input image '{input_img_basename_item}'.") + else: + # If multiple entries, search for the right one. + for candidate in current_info_list: + if 'images' in candidate and cam_type in candidate['images'] and \ + osp.basename(candidate['images'][cam_type]['img_path']) == input_img_basename_item: + data_info_to_use = candidate + break + if data_info_to_use is None: + raise ValueError( + f"Could not find matching info for image '{input_img_basename_item}' in '{infos_path_item}' " + f"(which has {len(current_info_list)} entries) when inputs is a list.") + + # Consistency check (original) + img_path_from_info = data_info_to_use['images'][cam_type]['img_path'] + if isinstance(single_input_item_dict.get('img'), str) and \ + osp.basename(img_path_from_info) != osp.basename(single_input_item_dict['img']): raise ValueError( - f'the info file of {img_path} is not provided.') + f"Mismatch: info file '{img_path_from_info}' vs input image '{single_input_item_dict['img']}'.") + cam2img = np.asarray( - data_info['images'][cam_type]['cam2img'], dtype=np.float32) + data_info_to_use['images'][cam_type]['cam2img'], dtype=np.float32) lidar2cam = np.asarray( - data_info['images'][cam_type]['lidar2cam'], + data_info_to_use['images'][cam_type]['lidar2cam'], dtype=np.float32) - if 'lidar2img' in data_info['images'][cam_type]: + if 'lidar2img' in data_info_to_use['images'][cam_type]: lidar2img = np.asarray( - data_info['images'][cam_type]['lidar2img'], + data_info_to_use['images'][cam_type]['lidar2img'], dtype=np.float32) else: lidar2img = cam2img @ lidar2cam - input['cam2img'] = cam2img - input['lidar2cam'] = lidar2cam - input['lidar2img'] = lidar2img - - return list(inputs) + + processed_sample = single_input_item_dict.copy() + processed_sample['cam2img'] = cam2img + processed_sample['lidar2cam'] = lidar2cam + processed_sample['lidar2img'] = lidar2img + processed_inputs_list.append(processed_sample) + else: + raise TypeError(f"Unsupported input type: {type(inputs)}. Expected dict or list.") + + return processed_inputs_list def _init_pipeline(self, cfg: ConfigType) -> Compose: """Initialize the test pipeline.""" diff --git a/mmdet3d/datasets/transforms/dbsampler.py b/mmdet3d/datasets/transforms/dbsampler.py index 56e8440b74..093cdfb170 100644 --- a/mmdet3d/datasets/transforms/dbsampler.py +++ b/mmdet3d/datasets/transforms/dbsampler.py @@ -280,7 +280,7 @@ def sample_all(self, s_points_list.append(s_points) gt_labels = np.array([self.cat2label[s['name']] for s in sampled], - dtype=np.long) + dtype=np.int64) if ground_plane is not None: xyz = sampled_gt_bboxes[:, :3] diff --git a/mmdet3d/datasets/transforms/loading.py b/mmdet3d/datasets/transforms/loading.py index 383c44536f..c1b9b8c395 100644 --- a/mmdet3d/datasets/transforms/loading.py +++ b/mmdet3d/datasets/transforms/loading.py @@ -244,6 +244,9 @@ def transform(self, results: dict) -> dict: if 'CAM2' in results['images']: filename = results['images']['CAM2']['img_path'] results['cam2img'] = results['images']['CAM2']['cam2img'] + elif 'CAM_FRONT' in results['images']: + filename = results['images']['CAM_FRONT']['img_path'] + results['cam2img'] = results['images']['CAM_FRONT']['cam2img'] elif len(list(results['images'].keys())) == 1: camera_type = list(results['images'].keys())[0] filename = results['images'][camera_type]['img_path'] diff --git a/mmdet3d/models/backbones/__init__.py b/mmdet3d/models/backbones/__init__.py index 64102bec1f..1badced7cb 100644 --- a/mmdet3d/models/backbones/__init__.py +++ b/mmdet3d/models/backbones/__init__.py @@ -4,18 +4,22 @@ from .cylinder3d import Asymm3DSpconv from .dgcnn import DGCNNBackbone from .dla import DLANet +from .fire_rpfnet import FireRPFNet, FireRPFNetV2, FireRPFNet2D from .mink_resnet import MinkResNet from .minkunet_backbone import MinkUNetBackbone from .multi_backbone import MultiBackbone from .nostem_regnet import NoStemRegNet from .pointnet2_sa_msg import PointNet2SAMSG from .pointnet2_sa_ssg import PointNet2SASSG +from .rpfnet import RPFNet from .second import SECOND from .spvcnn_backone import MinkUNetBackboneV2, SPVCNNBackbone +from .squeezenet import SQUEEZE __all__ = [ 'ResNet', 'ResNetV1d', 'ResNeXt', 'SSDVGG', 'HRNet', 'NoStemRegNet', 'SECOND', 'DGCNNBackbone', 'PointNet2SASSG', 'PointNet2SAMSG', 'MultiBackbone', 'DLANet', 'MinkResNet', 'Asymm3DSpconv', - 'MinkUNetBackbone', 'SPVCNNBackbone', 'MinkUNetBackboneV2' + 'MinkUNetBackbone', 'SPVCNNBackbone', 'MinkUNetBackboneV2','SQUEEZE', + 'RPFNet', 'FireRPFNet', 'FireRPFNetV2', 'FireRPFNet2D' ] diff --git a/mmdet3d/models/backbones/fire_rpfnet.py b/mmdet3d/models/backbones/fire_rpfnet.py new file mode 100644 index 0000000000..b8d62666d8 --- /dev/null +++ b/mmdet3d/models/backbones/fire_rpfnet.py @@ -0,0 +1,471 @@ +"""FireRPFNet: Fire Module + CBAM Attention Backbones. + +This module provides efficient backbones combining Fire modules (SqueezeNet-inspired) +with CBAM attention for both 2D image and 3D point cloud (BEV) feature extraction. + +Variants: + - FireRPFNet: Original BEV backbone + - FireRPFNetV2: Enhanced BEV backbone with multi-scale support + - FireRPFNet2D: 2D image backbone for multimodal detection + +References: + - SqueezeNet Fire Module: https://arxiv.org/abs/1602.07360 + - CBAM: https://arxiv.org/abs/1807.06521 + - MVXNet: https://arxiv.org/abs/1904.01649 +""" + +import torch +from torch import nn +from mmcv.cnn import build_norm_layer +from mmdet3d.registry import MODELS + + +class FireBlock(nn.Module): + """SqueezeNet-style fire module with residual shortcut for BEV features. + + Original implementation for point cloud BEV backbones (FireRPFNet, FireRPFNetV2). + Squeezes from input channels for compatibility with trained models. + + Args: + in_ch (int): Input channels. + out_ch (int): Output channels of the expand concat. + norm_cfg (dict): Normalization config. + """ + + def __init__(self, in_ch, out_ch, norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01)): + super().__init__() + # Squeeze from INPUT channels (original BEV behavior) + squeeze_ch = max(16, in_ch // 4) + + # Squeeze path (no stride, BEV maintains resolution) + self.squeeze = nn.Conv2d(in_ch, squeeze_ch, kernel_size=1, bias=False) + self.squeeze_bn = build_norm_layer(norm_cfg, squeeze_ch)[1] + + # Expand paths (1x1 and 3x3 in parallel) + self.expand1x1 = nn.Conv2d(squeeze_ch, out_ch // 2, 1, bias=False) + self.expand3x3 = nn.Conv2d(squeeze_ch, out_ch // 2, 3, padding=1, bias=False) + self.expand_bn = build_norm_layer(norm_cfg, out_ch)[1] + + self.act = nn.ReLU(inplace=True) + + # Residual connection with projection if needed + self.downsample = None + if in_ch != out_ch: + self.downsample = nn.Sequential( + nn.Conv2d(in_ch, out_ch, 1, bias=False), + build_norm_layer(norm_cfg, out_ch)[1], + ) + + def forward(self, x): + identity = x + + # Squeeze + x = self.act(self.squeeze_bn(self.squeeze(x))) + + # Expand (parallel 1x1 and 3x3) + out1 = self.expand1x1(x) + out3 = self.expand3x3(x) + out = torch.cat([out1, out3], dim=1) + out = self.expand_bn(out) + + # Residual connection + if self.downsample is not None: + identity = self.downsample(identity) + + return self.act(out + identity) + + +class FireBlock2D(nn.Module): + """SqueezeNet-style fire module with residual shortcut for 2D images. + + Adapted for hierarchical image feature extraction with downsampling support. + Squeezes from output channels for consistent behavior during resolution changes. + + Args: + in_ch (int): Input channels. + out_ch (int): Output channels of the expand concat. + stride (int): Stride for downsampling. Default: 1. + norm_cfg (dict): Normalization config. + """ + + def __init__(self, in_ch, out_ch, stride=1, norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01)): + super().__init__() + self.stride = stride + # Squeeze from OUTPUT channels (for stable behavior across resolution changes) + squeeze_ch = max(16, out_ch // 4) + + # Squeeze path (with optional stride for downsampling) + self.squeeze = nn.Conv2d(in_ch, squeeze_ch, kernel_size=1, stride=stride, bias=False) + self.squeeze_bn = build_norm_layer(norm_cfg, squeeze_ch)[1] + + # Expand paths (1x1 and 3x3 in parallel) + self.expand1x1 = nn.Conv2d(squeeze_ch, out_ch // 2, 1, bias=False) + self.expand3x3 = nn.Conv2d(squeeze_ch, out_ch // 2, 3, padding=1, bias=False) + self.expand_bn = build_norm_layer(norm_cfg, out_ch)[1] + + self.act = nn.ReLU(inplace=True) + + # Residual connection with projection if needed + self.downsample = None + if in_ch != out_ch or stride != 1: + self.downsample = nn.Sequential( + nn.Conv2d(in_ch, out_ch, 1, stride=stride, bias=False), + build_norm_layer(norm_cfg, out_ch)[1], + ) + + def forward(self, x): + identity = x + + # Squeeze + x = self.act(self.squeeze_bn(self.squeeze(x))) + + # Expand (parallel 1x1 and 3x3) + out1 = self.expand1x1(x) + out3 = self.expand3x3(x) + out = torch.cat([out1, out3], dim=1) + out = self.expand_bn(out) + + # Residual connection + if self.downsample is not None: + identity = self.downsample(identity) + + return self.act(out + identity) + + +class CBAM(nn.Module): + """Convolutional Block Attention Module (CBAM). + + Applies sequential channel and spatial attention to input features. + + Args: + ch (int): Number of input channels. + reduction (int): Channel reduction ratio for MLP. Default: 16. + + Reference: + Woo et al., "CBAM: Convolutional Block Attention Module", ECCV 2018. + """ + + def __init__(self, ch, reduction=16): + super().__init__() + # Channel attention + self.channel_att = nn.Sequential( + nn.AdaptiveAvgPool2d(1), + nn.Flatten(), + nn.Linear(ch, ch // reduction, bias=False), + nn.ReLU(inplace=True), + nn.Linear(ch // reduction, ch, bias=False), + nn.Sigmoid(), + ) + + # Spatial attention + self.spatial_att = nn.Sequential( + nn.Conv2d(2, 1, 7, padding=3, bias=False), + nn.Sigmoid(), + ) + + def forward(self, x): + b, c, _, _ = x.size() + + # Channel attention + att_c = self.channel_att(x).view(b, c, 1, 1) + x = x * att_c + + # Spatial attention + att_s = self.spatial_att( + torch.cat([x.mean(1, keepdim=True), x.max(1, keepdim=True)[0]], dim=1) + ) + return x * att_s + + +# ============================================================================= +# BEV Backbones (for Point Cloud) +# ============================================================================= + +@MODELS.register_module() +class FireRPFNet(nn.Module): + """Residual FireNet backbone (SqueezeNet-inspired) with CBAM. + + first version designed as a drop-in replacement for RPFNet in BEV pipelines. + Processes BEV features from sparse 3D convolution without downsampling. + + Args: + in_channels (int): Input channels. Default: 256. + out_channels (tuple[int]): Output channels for each stage. + Default: (128, 256, 256, 256). + with_cbam (bool): Whether to use CBAM attention. Default: True. + norm_cfg (dict): Normalization config. + Default: dict(type='BN', eps=1e-3, momentum=0.01). + """ + + def __init__(self, + in_channels=256, + out_channels=(128, 256, 256, 256), + with_cbam=True, + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01)): + super().__init__() + layers = [] + ch = in_channels + for out_ch in out_channels: + block = FireBlock(ch, out_ch, norm_cfg=norm_cfg) + stage = [block] + if with_cbam: + stage.append(CBAM(out_ch)) + layers.append(nn.Sequential(*stage)) + ch = out_ch + self.stages = nn.ModuleList(layers) + + def forward(self, x): + """Forward pass. + + Args: + x (torch.Tensor): BEV features (N, C, H, W). + + Returns: + tuple[torch.Tensor]: Single-element tuple with last stage output. + """ + for stage in self.stages: + x = stage(x) + return (x, ) + + +@MODELS.register_module() +class FireRPFNetV2(nn.Module): + """Enhanced Residual FireNet backbone with multi-scale support. + + Designed as a drop-in replacement for RPFNet/SECOND in BEV pipelines. + Can output single-scale or multi-scale features for use with/without FPN necks. + + Args: + in_channels (int): Input channels. Default: 256. + out_channels (tuple[int] | list[int]): Output channels for each stage. + Default: (128, 256, 256, 256). + with_cbam (bool): Whether to use CBAM attention after each stage. + Default: True. + multi_scale_output (bool): If True, returns multi-scale features from all stages + (for use with SECONDFPN neck). If False, returns only the last stage output + (backward compatible, for use without neck). Default: False. + norm_cfg (dict): Normalization config. + Default: dict(type='BN', eps=1e-3, momentum=0.01). + + Example: + >>> # Single-scale output (no neck) + >>> pts_backbone = dict( + ... type='FireRPFNetV2', + ... in_channels=256, + ... out_channels=[128, 256, 256, 256], + ... multi_scale_output=False) + + >>> # Multi-scale output (with SECONDFPN) + >>> pts_backbone = dict( + ... type='FireRPFNetV2', + ... in_channels=256, + ... out_channels=[128, 256, 256, 256], + ... multi_scale_output=True) + >>> pts_neck = dict( + ... type='SECONDFPN', + ... in_channels=[128, 256, 256, 256], + ... upsample_strides=[1, 2, 4, 8], + ... out_channels=[128, 128, 128, 128]) + """ + + def __init__(self, + in_channels=256, + out_channels=(128, 256, 256, 256), + with_cbam=True, + multi_scale_output=False, + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01)): + super().__init__() + self.multi_scale_output = multi_scale_output + layers = [] + ch = in_channels + for out_ch in out_channels: + block = FireBlock(ch, out_ch, norm_cfg=norm_cfg) + stage = [block] + if with_cbam: + stage.append(CBAM(out_ch)) + layers.append(nn.Sequential(*stage)) + ch = out_ch + self.stages = nn.ModuleList(layers) + + def forward(self, x): + """Forward pass. + + Args: + x (torch.Tensor): BEV features (N, C, H, W). + + Returns: + tuple[torch.Tensor]: + - If multi_scale_output=False: Single-element tuple with last stage output + - If multi_scale_output=True: Multi-element tuple with all stage outputs + """ + if self.multi_scale_output: + # Return multi-scale features for FPN neck + outs = [] + for stage in self.stages: + x = stage(x) + outs.append(x) + return tuple(outs) + else: + # Return only last stage (backward compatible) + for stage in self.stages: + x = stage(x) + return (x, ) + + +# ============================================================================= +# 2D Image Backbone +# ============================================================================= + +@MODELS.register_module() +class FireRPFNet2D(nn.Module): + """FireRPFNet2D: Efficient 2D image backbone for multimodal 3D detection. + + This backbone uses Fire modules (SqueezeNet-inspired) with CBAM attention + for efficient feature extraction from RGB images. It outputs multi-scale features + suitable for FPN necks in MVXNet-style architectures. + + Architecture: + - Stem: Conv 7ร—7 stride=2 + MaxPool โ†’ H/4, W/4 + - Stage 1: Fire blocks (stride=1) โ†’ H/4, W/4 + - Stage 2: Fire blocks (stride=2 in first) โ†’ H/8, W/8 + - Stage 3: Fire blocks (stride=2 in first) โ†’ H/16, W/16 + - Stage 4: Fire blocks (stride=2 in first) โ†’ H/32, W/32 + - Each block optionally followed by CBAM attention + + Args: + in_channels (int): Input image channels (typically 3 for RGB). Default: 3. + out_channels (tuple[int]): Output channels for each stage. + Default: (64, 128, 256, 512). + blocks_per_stage (tuple[int]): Number of Fire blocks per stage. + Default: (2, 2, 2, 2). + with_cbam (bool): Whether to use CBAM attention after each block. + Default: True. + stem_channels (int): Channels in stem conv. Default: 64. + out_indices (tuple[int]): Output feature indices for multi-scale. + Default: (0, 1, 2, 3) - all stages. + frozen_stages (int): Stages to be frozen (stop grad and set eval mode). + -1 means not freezing any stages. Default: -1. + norm_cfg (dict): Normalization config. + Default: dict(type='BN', eps=1e-3, momentum=0.01). + norm_eval (bool): Whether to set norm layers to eval mode. Default: False. + + Example: + >>> # Standard configuration + >>> img_backbone = dict( + ... type='FireRPFNet2D', + ... in_channels=3, + ... out_channels=[64, 128, 256, 512], + ... stem_channels=64, + ... with_cbam=True) + + >>> # Lightweight configuration (~40% fewer params) + >>> img_backbone = dict( + ... type='FireRPFNet2D', + ... out_channels=[48, 96, 192, 384], + ... stem_channels=48, + ... with_cbam=True) + + >>> # Without attention + >>> img_backbone = dict( + ... type='FireRPFNet2D', + ... out_channels=[64, 128, 256, 512], + ... with_cbam=False) + """ + + def __init__(self, + in_channels=3, + out_channels=(64, 128, 256, 512), + blocks_per_stage=(2, 2, 2, 2), + with_cbam=True, + stem_channels=64, + out_indices=(0, 1, 2, 3), + frozen_stages=-1, + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + norm_eval=False): + super().__init__() + + assert len(out_channels) == len(blocks_per_stage), \ + "out_channels and blocks_per_stage must have same length" + + self.num_stages = len(out_channels) + self.out_indices = out_indices + self.frozen_stages = frozen_stages + self.norm_eval = norm_eval + self.with_cbam = with_cbam + + # Stem: initial convolution to lift channels (H/4, W/4) + self.stem = nn.Sequential( + nn.Conv2d(in_channels, stem_channels, kernel_size=7, stride=2, + padding=3, bias=False), + build_norm_layer(norm_cfg, stem_channels)[1], + nn.ReLU(inplace=True), + nn.MaxPool2d(kernel_size=3, stride=2, padding=1) + ) + + # Build stages + self.stages = nn.ModuleList() + in_ch = stem_channels + + for stage_idx, (out_ch, num_blocks) in enumerate(zip(out_channels, blocks_per_stage)): + blocks = [] + for block_idx in range(num_blocks): + # First block of stages 1-3 uses stride=2 for downsampling + stride = 2 if (stage_idx > 0 and block_idx == 0) else 1 + + # Fire block (use FireBlock2D for image backbone) + fire_block = FireBlock2D(in_ch, out_ch, stride=stride, norm_cfg=norm_cfg) + blocks.append(fire_block) + + # Optional CBAM attention + if with_cbam: + blocks.append(CBAM(out_ch)) + + in_ch = out_ch + + self.stages.append(nn.Sequential(*blocks)) + + self._freeze_stages() + + def _freeze_stages(self): + """Freeze stages parameters and set to eval mode.""" + if self.frozen_stages >= 0: + self.stem.eval() + for param in self.stem.parameters(): + param.requires_grad = False + + for i in range(0, self.frozen_stages + 1): + if i < len(self.stages): + m = self.stages[i] + m.eval() + for param in m.parameters(): + param.requires_grad = False + + def forward(self, x): + """Forward pass. + + Args: + x (torch.Tensor): Input images (N, C, H, W), typically (N, 3, H, W). + + Returns: + tuple[torch.Tensor]: Multi-scale feature maps from selected stages. + Each tensor has shape (N, C_i, H_i, W_i). + """ + x = self.stem(x) # Initial downsampling: H/4, W/4 + + outs = [] + for stage_idx, stage in enumerate(self.stages): + x = stage(x) + if stage_idx in self.out_indices: + outs.append(x) + + return tuple(outs) + + def train(self, mode=True): + """Set the module in training mode.""" + super(FireRPFNet2D, self).train(mode) + self._freeze_stages() + if mode and self.norm_eval: + for m in self.modules(): + # trick: eval have effect on BatchNorm only + if isinstance(m, nn.BatchNorm2d): + m.eval() diff --git a/mmdet3d/models/backbones/rpfnet.py b/mmdet3d/models/backbones/rpfnet.py new file mode 100644 index 0000000000..54bc0f516b --- /dev/null +++ b/mmdet3d/models/backbones/rpfnet.py @@ -0,0 +1,98 @@ +import torch +from torch import nn +from mmcv.cnn import build_norm_layer +from mmdet3d.registry import MODELS + + +class BasicBlock(nn.Module): + """Simple residual 2-D conv block used in PillarNet-LTS (RPFN).""" + + def __init__(self, in_channels, out_channels, norm_cfg): + super().__init__() + self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1, bias=False) + self.bn1 = build_norm_layer(norm_cfg, out_channels)[1] + self.act = nn.ReLU(inplace=True) + self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1, bias=False) + self.bn2 = build_norm_layer(norm_cfg, out_channels)[1] + if in_channels != out_channels: + self.downsample = nn.Sequential( + nn.Conv2d(in_channels, out_channels, 1, bias=False), + build_norm_layer(norm_cfg, out_channels)[1], + ) + else: + self.downsample = None + + def forward(self, x): + identity = x + out = self.act(self.bn1(self.conv1(x))) + out = self.bn2(self.conv2(out)) + if self.downsample is not None: + identity = self.downsample(identity) + out = self.act(out + identity) + return out + + +class CBAM(nn.Module): + """Lightweight CBAM attention (channel + spatial).""" + + def __init__(self, channels, reduction=16): + super().__init__() + self.mlp = nn.Sequential( + nn.AdaptiveAvgPool2d(1), + nn.Flatten(), + nn.Linear(channels, channels // reduction, bias=False), + nn.ReLU(inplace=True), + nn.Linear(channels // reduction, channels, bias=False), + nn.Sigmoid(), + ) + self.spatial = nn.Sequential( + nn.Conv2d(2, 1, 7, padding=3, bias=False), + nn.Sigmoid(), + ) + + def forward(self, x): + # channel attention + b, c, _, _ = x.size() + channel_att = self.mlp(x).view(b, c, 1, 1) + x = x * channel_att + # spatial attention + spatial_att = self.spatial(torch.cat([x.mean(1, keepdim=True), x.max(1, keepdim=True)[0]], dim=1)) + x = x * spatial_att + return x + + +@MODELS.register_module() +class RPFNet(nn.Module): + """Residual Pillar Feature Network backbone (simplified). + + Args: + in_channels (int): #Channels of input BEV feature map (from SparseEncoder). + layer_channels (list[int]): Output channels for each residual stage. + with_cbam (bool): If True, append a CBAM after each stage. + norm_cfg (dict): Norm config dict. + """ + + def __init__(self, + in_channels=256, + layer_channels=(128, 256, 256, 256), + with_cbam=True, + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01)): + super().__init__() + layers = [] + ch = in_channels + for out_ch in layer_channels: + block = BasicBlock(ch, out_ch, norm_cfg) + stage = [block] + if with_cbam: + stage.append(CBAM(out_ch)) + layers.append(nn.Sequential(*stage)) + ch = out_ch + self.stages = nn.ModuleList(layers) + + def forward(self, x): + # x: (B, C, H, W) BEV feature map + for stage in self.stages: + x = stage(x) + # Anchor3DHead expects a tuple/list of multi-scale features. + # We return a single-scale tuple to stay compatible. + return (x, ) diff --git a/mmdet3d/models/backbones/squeezenet.py b/mmdet3d/models/backbones/squeezenet.py new file mode 100644 index 0000000000..c42393b2dc --- /dev/null +++ b/mmdet3d/models/backbones/squeezenet.py @@ -0,0 +1,112 @@ +from mmengine.model import BaseModule +from mmdet3d.registry import MODELS +from mmcv.cnn import build_conv_layer, build_norm_layer +import torch +import torch.nn as nn +from typing import Sequence, Optional, Tuple +from torch import Tensor +#from torch import nn +@MODELS.register_module() +class SQUEEZE(BaseModule): + """Backbone network using SqueezeNet architecture. + + Args: + in_channels (int): Input channels. + out_channels (list[int]): Output channels for multi-scale feature maps. + norm_cfg (dict): Config dict of normalization layers. + conv_cfg (dict): Config dict of convolutional layers. + """ + + def __init__(self, + in_channels: int = 3, + out_channels: Sequence[int] = [64, 128, 256], + norm_cfg: dict = dict(type='BN', eps=1e-3, momentum=0.01), + conv_cfg: dict = dict(type='Conv2d', bias=False), + init_cfg: Optional[dict] = None, + pretrained: Optional[str] = None) -> None: + super(SQUEEZE, self).__init__(init_cfg=init_cfg) + self.conv_cfg = conv_cfg; + self.norm_cfg = norm_cfg; + + + # Define the SqueezeNet fire modules + self.features = nn.Sequential( + #build_conv_layer(conv_cfg, in_channels, 96, kernel_size=7, stride=2), + build_conv_layer(conv_cfg, in_channels, 64, kernel_size=3, stride=2), + #build_norm_layer(norm_cfg, 96)[1], + build_norm_layer(norm_cfg, 64)[1], + nn.ReLU(inplace=True), + nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True), + #self._make_fire_module(96, 16, 64, 64), + self._make_fire_module(64, 16, 64, 64), + self._make_fire_module(128, 16, 64, 64), + self._make_fire_module(128, 32, 128, 128), + nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True), + self._make_fire_module(256, 32, 128, 128), + self._make_fire_module(256, 48, 192, 192), + self._make_fire_module(384, 48, 192, 192), + self._make_fire_module(384, 64, 256, 256), + nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True), + self._make_fire_module(512, 64, 256, 256), + ) + + if isinstance(pretrained, str): + warnings.warn('DeprecationWarning: pretrained is deprecated, ' + 'please use "init_cfg" instead') + self.init_cfg = dict(type='Pretrained', checkpoint=pretrained) + else: + self.init_cfg = dict(type='Kaiming', layer='Conv2d') + + + + def _make_fire_module(self, in_channels, squeeze_channels, expand1x1_channels, expand3x3_channels): + layers = nn.Sequential() + + # Squeeze layer + squeeze = nn.Sequential( + build_conv_layer(self.conv_cfg, in_channels, squeeze_channels, kernel_size=1), + build_norm_layer(self.norm_cfg, squeeze_channels)[1], + nn.ReLU(inplace=True) + ) + layers.add_module('squeeze', squeeze) + + # Expand 1x1 layer + expand1x1 = nn.Sequential( + build_conv_layer(self.conv_cfg, squeeze_channels, expand1x1_channels, kernel_size=1), + build_norm_layer(self.norm_cfg, expand1x1_channels)[1], + nn.ReLU(inplace=True) + ) + layers.add_module('expand1x1', expand1x1) + + # Expand 3x3 layer + expand3x3 = nn.Sequential( + build_conv_layer(self.conv_cfg, squeeze_channels, expand3x3_channels, kernel_size=3, padding=1), + build_norm_layer(self.norm_cfg, expand3x3_channels)[1], + nn.ReLU(inplace=True) + ) + layers.add_module('expand3x3', expand3x3) + + # Concatenation layer (handled within the forward function) + return layers + + def forward(self, x): + """Forward function with correct concatenation for Fire modules.""" + x = self.features[0](x) # handled here as the initial layers are not fire modules + targeted_layers = [1,5, 8, 13] + outs = [] + for idx, layer in enumerate(self.features[1:], 1): + #print(idx,":",layer) + if isinstance(layer, nn.Sequential) and 'squeeze' in layer._modules: + # This is a Fire module, handle separately + squeeze_output = layer.squeeze(x) + x1 = layer.expand1x1(squeeze_output) + x3 = layer.expand3x3(squeeze_output) + x = torch.cat([x1, x3], 1) + else: + # Normal layer + x = layer(x) + if(idx in targeted_layers): + outs.append(x) + #print("Outs x",idx , x.shape) + #print(len(outs)) + return outs diff --git a/mmdet3d/models/layers/norm.py b/mmdet3d/models/layers/norm.py index 9a85278723..03f4950f13 100644 --- a/mmdet3d/models/layers/norm.py +++ b/mmdet3d/models/layers/norm.py @@ -120,6 +120,8 @@ def forward(self, input: Tensor) -> Tensor: Returns: Tensor: Has shape (N, C, H, W), same shape as input. """ + if input.dtype == torch.float16: + input = input.to(torch.float32) # casting to torch.float32 assert input.dtype == torch.float32, \ f'input should be in float32 type, got {input.dtype}' using_dist = dist.is_available() and dist.is_initialized() diff --git a/mmdet3d/models/necks/__init__.py b/mmdet3d/models/necks/__init__.py index 53b885cb16..fb60020e4a 100644 --- a/mmdet3d/models/necks/__init__.py +++ b/mmdet3d/models/necks/__init__.py @@ -5,8 +5,9 @@ from .imvoxel_neck import IndoorImVoxelNeck, OutdoorImVoxelNeck from .pointnet2_fp_neck import PointNetFPNeck from .second_fpn import SECONDFPN +from .squeeze_fpn import SQUEEZEFPN __all__ = [ 'FPN', 'SECONDFPN', 'OutdoorImVoxelNeck', 'PointNetFPNeck', 'DLANeck', - 'IndoorImVoxelNeck' + 'IndoorImVoxelNeck','SQUEEZEFPN' ] diff --git a/mmdet3d/models/necks/squeeze_fpn.py b/mmdet3d/models/necks/squeeze_fpn.py new file mode 100644 index 0000000000..1d50277e4c --- /dev/null +++ b/mmdet3d/models/necks/squeeze_fpn.py @@ -0,0 +1,110 @@ +import torch +from mmcv.cnn import build_conv_layer, build_norm_layer, build_upsample_layer +from mmengine.model import BaseModule +from torch import nn + +from mmdet3d.registry import MODELS + + +class LastLevelMaxPool(nn.Module): + def __init__(self): + super(LastLevelMaxPool, self).__init__() + self.pool = nn.MaxPool2d(kernel_size=1, stride=2, padding=0) + + def forward(self, x): + return self.pool(x) + +@MODELS.register_module() +class SQUEEZEFPN(BaseModule): + """FPN using SqueezeNet architecture. + + Args: + in_channels (list[int]): Input channels of multi-scale feature maps. + out_channels (list[int]): Output channels of feature maps. + norm_cfg (dict): Config dict of normalization layers. + upsample_cfg (dict): Config dict of upsample layers. + conv_cfg (dict): Config dict of conv layers. + init_cfg (dict or :obj:`ConfigDict` or list[dict or :obj:`ConfigDict`], + optional): Initialization config dict. + """ + + def __init__(self, + in_channels=[64, 128, 256, 512], + out_channels=[256, 256, 256, 256], + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01), + upsample_cfg=dict(type='deconv', bias=False), + conv_cfg=dict(type='Conv2d', bias=False), + init_cfg=None): + super(SQUEEZEFPN, self).__init__(init_cfg=init_cfg) + print("out_channels", len(out_channels), "in_channels", len(in_channels)) + print("out_channels", out_channels, len(out_channels), "in_channels", in_channels, len(in_channels)) + assert len(out_channels) == len(in_channels) + self.in_channels = in_channels + self.out_channels = out_channels + + + self.lateral_convs = nn.ModuleList([ + nn.Conv2d(in_channel, out_channels[0], kernel_size=1) + for in_channel in in_channels + ]) + self.fpn_convs = nn.ModuleList([ + nn.Conv2d(out_channels[0], out_channels[0], kernel_size=3, padding=1) + for _ in range(len(in_channels)) + ]) + self.last_level_pool = LastLevelMaxPool() + + # self.deblocks = nn.ModuleList() + # for i, out_channel in enumerate(out_channels): + # upsample_layer = build_upsample_layer( + # upsample_cfg, + # in_channels=in_channels[i], + # out_channels=out_channel, + # kernel_size=2, + # stride=2) + # deblock = nn.Sequential( + # upsample_layer, + # build_norm_layer(norm_cfg, out_channel)[1], + # nn.ReLU(inplace=True) + # ) + # self.deblocks.append(deblock) + + def forward(self, x): + """Forward function. + + Args: + x (List[torch.Tensor]): Multi-level features with 4D Tensor in + (N, C, H, W) shape. + + Returns: + list[torch.Tensor]: Multi-level feature maps. + """ + # print("x", len(x), "in_channels", len(self.in_channels)) + assert len(x) == len(self.in_channels) + + lateral_features = [lateral_conv(feat) for lateral_conv, feat in zip(self.lateral_convs, x)] + + for i in range(len(lateral_features) - 2, -1, -1): + # print(i) + # print(i,lateral_features[i].shape) + # print(i+1,lateral_features[i+1].shape) + shape_of_tensor = lateral_features[i].size() + + # Extract specific dimensions for upsampleing + batch_size = shape_of_tensor[0] # not using + y_dimension = shape_of_tensor[1] + x_height = shape_of_tensor[2] + x_width = shape_of_tensor[3] + lateral_features[i] += nn.functional.interpolate(lateral_features[i + 1], size=(x_height, x_width), mode='nearest') + #print(i,lateral_features[i].shape) + + + # Apply the FPN convolutions + fpn_features = [fpn_conv(feat) for fpn_conv, feat in zip(self.fpn_convs, lateral_features)] + # for i, feature in enumerate(fpn_features): + # print(f"FPN Feature {i} shape: {feature.shape}") + pool = self.last_level_pool(lateral_features[0]) + # fpn_features.append(pool) + # for i, feature in enumerate(fpn_features): + # print(f"FPN Feature {i} shape: {feature.shape}") + # print(pool.shape) + return tuple(lateral_features) diff --git a/projects/BEVFusion/bevfusion/bevfusion.py b/projects/BEVFusion/bevfusion/bevfusion.py index 9f56934e66..50791f0851 100644 --- a/projects/BEVFusion/bevfusion/bevfusion.py +++ b/projects/BEVFusion/bevfusion/bevfusion.py @@ -56,7 +56,7 @@ def __init__( fusion_layer) if fusion_layer is not None else None self.pts_backbone = MODELS.build(pts_backbone) - self.pts_neck = MODELS.build(pts_neck) + self.pts_neck = MODELS.build(pts_neck) if pts_neck is not None else None self.bbox_head = MODELS.build(bbox_head) @@ -279,7 +279,8 @@ def extract_feat( x = features[0] x = self.pts_backbone(x) - x = self.pts_neck(x) + if self.pts_neck is not None: + x = self.pts_neck(x) return x diff --git a/projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py b/projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py new file mode 100644 index 0000000000..c705d52b0b --- /dev/null +++ b/projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py @@ -0,0 +1,237 @@ +_base_ = [ + './bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py' +] +point_cloud_range = [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0] +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None + +model = dict( + type='BEVFusion', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + mean=[123.675, 116.28, 103.53], + std=[58.395, 57.12, 57.375], + bgr_to_rgb=False), + img_backbone=dict( + type='mmdet.SwinTransformer', + embed_dims=96, + depths=[2, 2, 6, 2], + num_heads=[3, 6, 12, 24], + window_size=7, + mlp_ratio=4, + qkv_bias=True, + qk_scale=None, + drop_rate=0.0, + attn_drop_rate=0.0, + drop_path_rate=0.2, + patch_norm=True, + out_indices=[1, 2, 3], + with_cp=False, + convert_weights=True, + init_cfg=dict( + type='Pretrained', + checkpoint= # noqa: E251 + 'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth' # noqa: E501 + )), + img_neck=dict( + type='GeneralizedLSSFPN', + in_channels=[192, 384, 768], + out_channels=256, + start_level=0, + num_outs=3, + norm_cfg=dict(type='BN2d', requires_grad=True), + act_cfg=dict(type='ReLU', inplace=True), + upsample_cfg=dict(mode='bilinear', align_corners=False)), + view_transform=dict( + type='DepthLSSTransform', + in_channels=256, + out_channels=80, + image_size=[256, 704], + feature_size=[32, 88], + xbound=[-54.0, 54.0, 0.3], + ybound=[-54.0, 54.0, 0.3], + zbound=[-10.0, 10.0, 20.0], + dbound=[1.0, 60.0, 0.5], + downsample=2), + fusion_layer=dict( + type='ConvFuser', in_channels=[80, 256], out_channels=256)) + +train_pipeline = [ + dict( + type='BEVLoadMultiViewImageFromFiles', + to_float32=True, + color_type='color', + backend_args=backend_args), + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=5, + use_dim=5, + backend_args=backend_args), + dict( + type='LoadPointsFromMultiSweeps', + sweeps_num=9, + load_dim=5, + use_dim=5, + pad_empty_sweeps=True, + remove_close=True, + backend_args=backend_args), + dict( + type='LoadAnnotations3D', + with_bbox_3d=True, + with_label_3d=True, + with_attr_label=False), + dict( + type='ImageAug3D', + final_dim=[256, 704], + resize_lim=[0.38, 0.55], + bot_pct_lim=[0.0, 0.0], + rot_lim=[-5.4, 5.4], + rand_flip=True, + is_train=True), + dict( + type='BEVFusionGlobalRotScaleTrans', + scale_ratio_range=[0.9, 1.1], + rot_range=[-0.78539816, 0.78539816], + translation_std=0.5), + dict(type='BEVFusionRandomFlip3D'), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict( + type='ObjectNameFilter', + classes=[ + 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', + 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' + ]), + # Actually, 'GridMask' is not used here + dict( + type='GridMask', + use_h=True, + use_w=True, + max_epoch=6, + rotate=1, + offset=False, + ratio=0.5, + mode=1, + prob=0.0, + fixed_prob=True), + dict(type='PointShuffle'), + dict( + type='Pack3DDetInputs', + keys=[ + 'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes', + 'gt_labels' + ], + meta_keys=[ + 'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar', + 'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx', + 'lidar_path', 'img_path', 'transformation_3d_flow', 'pcd_rotation', + 'pcd_scale_factor', 'pcd_trans', 'img_aug_matrix', + 'lidar_aug_matrix', 'num_pts_feats' + ]) +] + +test_pipeline = [ + dict( + type='BEVLoadMultiViewImageFromFiles', + to_float32=True, + color_type='color', + backend_args=backend_args), + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=5, + use_dim=5, + backend_args=backend_args), + dict( + type='LoadPointsFromMultiSweeps', + sweeps_num=9, + load_dim=5, + use_dim=5, + pad_empty_sweeps=True, + remove_close=True, + backend_args=backend_args), + dict( + type='ImageAug3D', + final_dim=[256, 704], + resize_lim=[0.48, 0.48], + bot_pct_lim=[0.0, 0.0], + rot_lim=[0.0, 0.0], + rand_flip=False, + is_train=False), + dict( + type='PointsRangeFilter', + point_cloud_range=[-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]), + dict( + type='Pack3DDetInputs', + keys=['img', 'points', 'gt_bboxes_3d', 'gt_labels_3d'], + meta_keys=[ + 'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar', + 'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx', + 'lidar_path', 'img_path', 'num_pts_feats' + ]) +] + +train_dataloader = dict( + dataset=dict( + dataset=dict(pipeline=train_pipeline, modality=input_modality))) +val_dataloader = dict( + dataset=dict(pipeline=test_pipeline, modality=input_modality)) +test_dataloader = val_dataloader + +param_scheduler = [ + dict( + type='LinearLR', + start_factor=0.33333333, + by_epoch=False, + begin=0, + end=500), + dict( + type='CosineAnnealingLR', + begin=0, + T_max=6, + end=6, + by_epoch=True, + eta_min_ratio=1e-4, + convert_to_iter_based=True), + # momentum scheduler + # During the first 8 epochs, momentum increases from 1 to 0.85 / 0.95 + # during the next 12 epochs, momentum increases from 0.85 / 0.95 to 1 + dict( + type='CosineAnnealingMomentum', + eta_min=0.85 / 0.95, + begin=0, + end=2.4, + by_epoch=True, + convert_to_iter_based=True), + dict( + type='CosineAnnealingMomentum', + eta_min=1, + begin=2.4, + end=6, + by_epoch=True, + convert_to_iter_based=True) +] + +# runtime settings +train_cfg = dict(by_epoch=True, max_epochs=6, val_interval=1) +val_cfg = dict() +test_cfg = dict() + +optim_wrapper = dict( + type='OptimWrapper', + optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2)) + +# Default setting for scaling LR automatically +# - `enable` means enable scaling LR automatically +# or not by default. +# - `base_batch_size` = (8 GPUs) x (4 samples per GPU). +auto_scale_lr = dict(enable=False, base_batch_size=32) + +default_hooks = dict( + logger=dict(type='LoggerHook', interval=50), + checkpoint=dict(type='CheckpointHook', interval=1)) +del _base_.custom_hooks + +work_dir = './work_dirs/bevfusion_lidar-cam_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d' diff --git a/projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d.py b/projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d.py new file mode 100644 index 0000000000..20467f63a8 --- /dev/null +++ b/projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d.py @@ -0,0 +1,238 @@ +_base_ = [ + './bevfusion_lidar_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d.py' +] +point_cloud_range = [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0] +input_modality = dict(use_lidar=True, use_camera=True) +backend_args = None + +model = dict( + type='BEVFusion', + data_preprocessor=dict( + type='Det3DDataPreprocessor', + mean=[123.675, 116.28, 103.53], + std=[58.395, 57.12, 57.375], + bgr_to_rgb=False), + img_backbone=dict( + type='mmdet.SwinTransformer', + embed_dims=96, + depths=[2, 2, 6, 2], + num_heads=[3, 6, 12, 24], + window_size=7, + mlp_ratio=4, + qkv_bias=True, + qk_scale=None, + drop_rate=0.0, + attn_drop_rate=0.0, + drop_path_rate=0.2, + patch_norm=True, + out_indices=[1, 2, 3], + with_cp=False, + convert_weights=True, + init_cfg=dict( + type='Pretrained', + checkpoint= # noqa: E251 + 'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth' # noqa: E501 + )), + img_neck=dict( + type='GeneralizedLSSFPN', + in_channels=[192, 384, 768], + out_channels=256, + start_level=0, + num_outs=3, + norm_cfg=dict(type='BN2d', requires_grad=True), + act_cfg=dict(type='ReLU', inplace=True), + upsample_cfg=dict(mode='bilinear', align_corners=False)), + view_transform=dict( + type='DepthLSSTransform', + in_channels=256, + out_channels=80, + image_size=[256, 704], + feature_size=[32, 88], + xbound=[-54.0, 54.0, 0.3], + ybound=[-54.0, 54.0, 0.3], + zbound=[-10.0, 10.0, 20.0], + dbound=[1.0, 60.0, 0.5], + downsample=2), + fusion_layer=dict( + type='ConvFuser', in_channels=[80, 256], out_channels=256)) + +train_pipeline = [ + dict( + type='BEVLoadMultiViewImageFromFiles', + to_float32=True, + color_type='color', + backend_args=backend_args), + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=5, + use_dim=5, + backend_args=backend_args), + dict( + type='LoadPointsFromMultiSweeps', + sweeps_num=9, + load_dim=5, + use_dim=5, + pad_empty_sweeps=True, + remove_close=True, + backend_args=backend_args), + dict( + type='LoadAnnotations3D', + with_bbox_3d=True, + with_label_3d=True, + with_attr_label=False), + dict( + type='ImageAug3D', + final_dim=[256, 704], + resize_lim=[0.38, 0.55], + bot_pct_lim=[0.0, 0.0], + rot_lim=[-5.4, 5.4], + rand_flip=True, + is_train=True), + dict( + type='BEVFusionGlobalRotScaleTrans', + scale_ratio_range=[0.9, 1.1], + rot_range=[-0.78539816, 0.78539816], + translation_std=0.5), + dict(type='BEVFusionRandomFlip3D'), + dict(type='PointsRangeFilter', point_cloud_range=point_cloud_range), + dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range), + dict( + type='ObjectNameFilter', + classes=[ + 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', + 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' + ]), + # Actually, 'GridMask' is not used here + dict( + type='GridMask', + use_h=True, + use_w=True, + max_epoch=6, + rotate=1, + offset=False, + ratio=0.5, + mode=1, + prob=0.0, + fixed_prob=True), + dict(type='PointShuffle'), + dict( + type='Pack3DDetInputs', + keys=[ + 'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes', + 'gt_labels' + ], + meta_keys=[ + 'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar', + 'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx', + 'lidar_path', 'img_path', 'transformation_3d_flow', 'pcd_rotation', + 'pcd_scale_factor', 'pcd_trans', 'img_aug_matrix', + 'lidar_aug_matrix', 'num_pts_feats' + ]) +] + +test_pipeline = [ + dict( + type='BEVLoadMultiViewImageFromFiles', + to_float32=True, + color_type='color', + backend_args=backend_args), + dict( + type='LoadPointsFromFile', + coord_type='LIDAR', + load_dim=5, + use_dim=5, + backend_args=backend_args), + dict( + type='LoadPointsFromMultiSweeps', + sweeps_num=9, + load_dim=5, + use_dim=5, + pad_empty_sweeps=True, + remove_close=True, + backend_args=backend_args), + dict( + type='ImageAug3D', + final_dim=[256, 704], + resize_lim=[0.48, 0.48], + bot_pct_lim=[0.0, 0.0], + rot_lim=[0.0, 0.0], + rand_flip=False, + is_train=False), + dict( + type='PointsRangeFilter', + point_cloud_range=[-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]), + dict( + type='Pack3DDetInputs', + keys=['img', 'points', 'gt_bboxes_3d', 'gt_labels_3d'], + meta_keys=[ + 'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar', + 'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx', + 'lidar_path', 'img_path', 'num_pts_feats' + ]) +] + +train_dataloader = dict( + dataset=dict( + dataset=dict(pipeline=train_pipeline, modality=input_modality))) +val_dataloader = dict( + dataset=dict(pipeline=test_pipeline, modality=input_modality)) +test_dataloader = val_dataloader + +param_scheduler = [ + dict( + type='LinearLR', + start_factor=0.33333333, + by_epoch=False, + begin=0, + end=500), + dict( + type='CosineAnnealingLR', + begin=0, + T_max=6, + end=6, + by_epoch=True, + eta_min_ratio=1e-4, + convert_to_iter_based=True), + # momentum scheduler + # During the first 8 epochs, momentum increases from 1 to 0.85 / 0.95 + # during the next 12 epochs, momentum increases from 0.85 / 0.95 to 1 + dict( + type='CosineAnnealingMomentum', + eta_min=0.85 / 0.95, + begin=0, + end=2.4, + by_epoch=True, + convert_to_iter_based=True), + dict( + type='CosineAnnealingMomentum', + eta_min=1, + begin=2.4, + end=6, + by_epoch=True, + convert_to_iter_based=True) +] + +# runtime settings +train_cfg = dict(by_epoch=True, max_epochs=6, val_interval=1) +val_cfg = dict() +test_cfg = dict() + + +# Default setting for scaling LR automatically +# - `enable` means enable scaling LR automatically +# or not by default. +# - `base_batch_size` = (8 GPUs) x (4 samples per GPU). +auto_scale_lr = dict(enable=False, base_batch_size=32) + +default_hooks = dict( + logger=dict(type='LoggerHook', interval=50), + checkpoint=dict(type='CheckpointHook', interval=1)) +del _base_.custom_hooks + +work_dir = './work_dirs/bevfusion_lidar-cam_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d' + +optim_wrapper = dict( + type='OptimWrapper', + optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.01), + clip_grad=dict(max_norm=35, norm_type=2)) diff --git a/projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py b/projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py new file mode 100644 index 0000000000..561097ebf5 --- /dev/null +++ b/projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d.py @@ -0,0 +1,18 @@ +_base_ = './bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py' + +# Override the point cloud backbone with FireRPFNet +# FireRPFNet is more memory-efficient than SECOND while maintaining good performance +# It uses Fire modules (SqueezeNet-style) with CBAM attention +model = dict( + pts_backbone=dict( + _delete_=True, # Completely replace the base backbone config + type='FireRPFNetV2', + in_channels=256, # Output channels from BEVFusionSparseEncoder + out_channels=[128, 256, 256, 512], # 4 stages with increasing channels + with_cbam=True, # Enable Channel and Spatial Attention + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01)), + pts_neck=None +) + +# Update the work directory +work_dir = './work_dirs/bevfusion_lidar_voxel0075_firerpfnet_8xb4-cyclic-20e_nus-3d' diff --git a/projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d.py b/projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d.py new file mode 100644 index 0000000000..76ad10fbf9 --- /dev/null +++ b/projects/BEVFusion/configs/bevfusion_lidar_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d.py @@ -0,0 +1,31 @@ +_base_ = './bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py' + +model = dict( + pts_backbone=dict( + _delete_=True, # Completely replace the base backbone config + type='FireRPFNetV2', + in_channels=256, # Output channels from BEVFusionSparseEncoder + out_channels=[128, 256, 256, 256], # 4 stages with multi-scale outputs + with_cbam=True, # Enable Channel and Spatial Attention + multi_scale_output=True, # CRITICAL: Enable multi-scale feature extraction + norm_cfg=dict(type='BN', eps=1e-3, momentum=0.01)), + + pts_neck=dict( + _delete_=True, # Replace the base neck config + type='SECONDFPN', + in_channels=[128, 256, 256, 256], # Must match FireRPFNetV2 out_channels + out_channels=[128, 128, 128, 128], # Uniform output channels for fusion + upsample_strides=[1, 1, 1, 1], # CRITICAL: No upsampling (same resolution) + norm_cfg=dict(type='BN', eps=0.001, momentum=0.01), + upsample_cfg=dict(type='deconv', bias=False), + use_conv_for_no_stride=True), # Use 1x1 conv when stride=1 + + # Update bbox_head to match concatenated neck output + # SECONDFPN concatenates all outputs: 128 * 4 = 512 channels + bbox_head=dict( + in_channels=512, # 128 * 4 from SECONDFPN concatenation + ) +) + +# Update the work directory to distinguish from base config +work_dir = './work_dirs/bevfusion_lidar_voxel0075_firerpfnet_with_neck_8xb4-cyclic-20e_nus-3d'