Skip to content

Commit 01b319f

Browse files
committed
Release VOC model
Release VOC model
1 parent 9baa7cc commit 01b319f

File tree

5 files changed

+590
-2
lines changed

5 files changed

+590
-2
lines changed

detection/README.md

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ Prepare datasets according to the guidelines in [MMDetection v2.28.1](https://gi
115115

116116
</details>
117117

118-
<details open>
118+
<details>
119119
<summary> Dataset: LVIS </summary>
120120
<br>
121121
<div>
@@ -128,7 +128,7 @@ Prepare datasets according to the guidelines in [MMDetection v2.28.1](https://gi
128128

129129
</details>
130130

131-
<details open>
131+
<details>
132132
<summary> Dataset: OpenImages </summary>
133133
<br>
134134
<div>
@@ -141,6 +141,19 @@ Prepare datasets according to the guidelines in [MMDetection v2.28.1](https://gi
141141

142142
</details>
143143

144+
<details>
145+
<summary> Dataset: VOC 2007 & 2012 </summary>
146+
<br>
147+
<div>
148+
149+
| method | backbone | VOC 2007 | VOC 2012 | #param | Config | Download |
150+
| :----: | :-----------: | :------: | :------: | :----: | :---------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------: |
151+
| DINO | InternImage-H | 94.0 | 97.2 | 2.18B | [config](./configs/voc/dino_4scale_cbinternimage_h_objects365_voc07.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/dino_4scale_cbinternimage_h_objects365_voc0712.pth) |
152+
153+
</div>
154+
155+
</details>
156+
144157
## Evaluation
145158

146159
To evaluate our `InternImage` on COCO val, run:

detection/configs/voc/README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# PASCAL VOC
2+
3+
## Introduction
4+
5+
PASCAL VOC 2007 is a widely used dataset for object detection, classification, and segmentation tasks in computer vision. Released in 2007, it contains 9,963 images with 24,640 annotated objects across 20 categories, such as people, animals, and vehicles. The dataset is divided into training (2,501 images), validation (2,510 images), and test (4,952 images) sets. Each object is labeled with a class, bounding box, and additional attributes like "difficult" or "truncated." VOC 2007 introduced the mean Average Precision (mAP) metric, which remains a standard for evaluating object detection models.
6+
7+
PASCAL VOC 2012, released in 2012, is an improved version of VOC 2007 with more diverse images and annotations. It contains 11,540 images and 27,450 object instances, covering the same 20 categories. In addition to object detection and classification, VOC 2012 includes more detailed annotations for semantic segmentation. The dataset is split into training (5,717 images), validation (5,823 images), and a test set with hidden labels. VOC 2012 is often used as a benchmark for deep learning models and serves as a foundation for modern object detection and segmentation research.
8+
9+
## Model Zoo
10+
11+
### DINO + CB-InternImage
12+
13+
| backbone | pretrain | VOC 2007 | VOC 2012 | #param | Config | Download |
14+
| :--------------: | :--------: | :------: | :------: | :----: | :---------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------: |
15+
| CB-InternImage-H | Objects365 | 94.0 | 97.2 | 2.18B | [config](./dino_4scale_cbinternimage_h_objects365_voc07.py) | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/dino_4scale_cbinternimage_h_objects365_voc0712.pth) |
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
_base_ = [
2+
'../_base_/datasets/voc0712.py',
3+
'../_base_/default_runtime.py'
4+
]
5+
load_from = 'https://huggingface.co/OpenGVLab/InternImage/resolve/main/dino_4scale_cbinternimage_h_objects365_80classes.pth'
6+
model = dict(
7+
type='CBDINO',
8+
backbone=dict(
9+
type='CBInternImage',
10+
core_op='DCNv3',
11+
channels=320,
12+
depths=[6, 6, 32, 6],
13+
groups=[10, 20, 40, 80],
14+
mlp_ratio=4.,
15+
drop_path_rate=0.5,
16+
norm_layer='LN',
17+
layer_scale=None,
18+
offset_scale=1.0,
19+
post_norm=False,
20+
dw_kernel_size=5, # for InternImage-H/G
21+
res_post_norm=True, # for InternImage-H/G
22+
level2_post_norm=True, # for InternImage-H/G
23+
level2_post_norm_block_ids=[5, 11, 17, 23, 29], # for InternImage-H/G
24+
center_feature_scale=True, # for InternImage-H/G
25+
with_cp=True,
26+
out_indices=[(0, 1, 2, 3), (1, 2, 3)],
27+
init_cfg=None,
28+
),
29+
neck=[dict(
30+
type='CBChannelMapper',
31+
in_channels=[640, 1280, 2560],
32+
kernel_size=1,
33+
out_channels=256,
34+
act_cfg=None,
35+
norm_cfg=dict(type='GN', num_groups=32),
36+
num_outs=4)],
37+
bbox_head=dict(
38+
type='CBDINOHead',
39+
num_query=900,
40+
num_classes=20,
41+
in_channels=2048, # TODO
42+
sync_cls_avg_factor=True,
43+
as_two_stage=True,
44+
with_box_refine=True,
45+
dn_cfg=dict(
46+
type='CdnQueryGenerator',
47+
noise_scale=dict(label=0.5, box=1.0), # 0.5, 0.4 for DN-DETR
48+
group_cfg=dict(dynamic=True, num_groups=None, num_dn_queries=1000)),
49+
transformer=dict(
50+
type='DinoTransformer',
51+
two_stage_num_proposals=900,
52+
encoder=dict(
53+
type='DetrTransformerEncoder',
54+
num_layers=6,
55+
transformerlayers=dict(
56+
type='BaseTransformerLayer',
57+
attn_cfgs=dict(
58+
type='MultiScaleDeformableAttention',
59+
embed_dims=256,
60+
dropout=0.0), # 0.1 for DeformDETR
61+
feedforward_channels=2048, # 1024 for DeformDETR
62+
ffn_cfgs=dict(
63+
type='FFN',
64+
embed_dims=256,
65+
feedforward_channels=2048,
66+
num_fcs=2,
67+
ffn_drop=0.,
68+
use_checkpoint=True,
69+
act_cfg=dict(type='ReLU', inplace=True),),
70+
ffn_dropout=0.0, # 0.1 for DeformDETR
71+
operation_order=('self_attn', 'norm', 'ffn', 'norm'))),
72+
decoder=dict(
73+
type='DinoTransformerDecoder',
74+
num_layers=6,
75+
return_intermediate=True,
76+
transformerlayers=dict(
77+
type='DetrTransformerDecoderLayer',
78+
attn_cfgs=[
79+
dict(
80+
type='MultiheadAttention',
81+
embed_dims=256,
82+
num_heads=8,
83+
dropout=0.0), # 0.1 for DeformDETR
84+
dict(
85+
type='MultiScaleDeformableAttention',
86+
num_levels=4,
87+
embed_dims=256,
88+
dropout=0.0), # 0.1 for DeformDETR
89+
],
90+
feedforward_channels=2048, # 1024 for DeformDETR
91+
ffn_cfgs=dict(
92+
type='FFN',
93+
embed_dims=256,
94+
feedforward_channels=2048,
95+
num_fcs=2,
96+
ffn_drop=0.,
97+
use_checkpoint=True,
98+
act_cfg=dict(type='ReLU', inplace=True),),
99+
ffn_dropout=0.0, # 0.1 for DeformDETR
100+
operation_order=('self_attn', 'norm', 'cross_attn', 'norm',
101+
'ffn', 'norm')))),
102+
positional_encoding=dict(
103+
type='SinePositionalEncoding',
104+
num_feats=128,
105+
temperature=20,
106+
normalize=True),
107+
loss_cls=dict(
108+
type='FocalLoss',
109+
use_sigmoid=True,
110+
gamma=2.0,
111+
alpha=0.25,
112+
loss_weight=1.0), # 2.0 in DeformDETR
113+
loss_bbox=dict(type='L1Loss', loss_weight=5.0),
114+
loss_iou=dict(type='GIoULoss', loss_weight=2.0)),
115+
# training and testing settings
116+
train_cfg=dict(
117+
assigner=dict(
118+
type='HungarianAssigner',
119+
cls_cost=dict(type='FocalLossCost', weight=2.0),
120+
reg_cost=dict(type='BBoxL1Cost', weight=5.0, box_format='xywh'),
121+
iou_cost=dict(type='IoUCost', iou_mode='giou', weight=2.0)),
122+
snip_cfg=dict(
123+
type='v3',
124+
weight=0.1)),
125+
test_cfg=dict(max_per_img=300)) # TODO: Originally 100
126+
img_norm_cfg = dict(
127+
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
128+
# train_pipeline, NOTE the img_scale and the Pad's size_divisor is different
129+
# from the default setting in mmdet.
130+
train_pipeline = [
131+
dict(type='LoadImageFromFile'),
132+
dict(type='LoadAnnotations', with_bbox=True),
133+
dict(type='RandomFlip', flip_ratio=0.5),
134+
dict(type='Resize',
135+
img_scale=[(2000, 600), (2000, 1200)],
136+
multiscale_mode='range',
137+
keep_ratio=True),
138+
dict(type='Normalize', **img_norm_cfg),
139+
dict(type='Pad', size_divisor=32),
140+
dict(type='DefaultFormatBundle'),
141+
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
142+
]
143+
test_pipeline = [
144+
dict(type='LoadImageFromFile'),
145+
dict(
146+
type='MultiScaleFlipAug',
147+
img_scale=(2000, 1000),
148+
flip=False,
149+
transforms=[
150+
dict(type='Resize', keep_ratio=True),
151+
dict(type='RandomFlip'),
152+
dict(type='Normalize', **img_norm_cfg),
153+
dict(type='Pad', size_divisor=32),
154+
dict(type='ImageToTensor', keys=['img']),
155+
dict(type='Collect', keys=['img'])
156+
])
157+
]
158+
data = dict(
159+
samples_per_gpu=1,
160+
workers_per_gpu=2,
161+
train=dict(pipeline=train_pipeline),
162+
val=dict(pipeline=test_pipeline),
163+
test=dict(pipeline=test_pipeline))
164+
# test=dict(
165+
# type='VOCDataset',
166+
# ann_file='./data/VOCdevkit/VOC2012test/ImageSets/Main/test.txt',
167+
# img_prefix='./data/VOCdevkit/VOC2012test/',
168+
# pipeline=test_pipeline,))
169+
# optimizer
170+
optimizer = dict(
171+
type='AdamW', lr=0.0001/2, weight_decay=0.0001,
172+
constructor='CustomLayerDecayOptimizerConstructor',
173+
paramwise_cfg=dict(num_layers=50, layer_decay_rate=0.94,
174+
depths=[6, 6, 32, 6], offset_lr_scale=1e-3))
175+
optimizer_config = dict(grad_clip=dict(max_norm=0.1, norm_type=2))
176+
# learning policy
177+
lr_config = dict(
178+
policy='step',
179+
warmup='linear',
180+
warmup_iters=500,
181+
warmup_ratio=0.001,
182+
step=[])
183+
runner = dict(type='IterBasedRunner', max_iters=20000)
184+
checkpoint_config = dict(interval=500, max_keep_ckpts=3)
185+
evaluation = dict(interval=500, save_best='auto')

0 commit comments

Comments
 (0)