开放词汇目标检测方法综述

摘要/Abstract

参考文献

多维度评价

doi:10.6040/j.issn.1672-3961.0.2024.162

摘要： 目标检测方法针对特定场景进行训练,需要识别的物体都要人工标注,检测器只能识别被标注的物体。随着目标检测应用场景逐渐增加,特定场景下训练的目标检测器不能满足多样化场景需求,目标检测方法的泛化性能成为研究者关注热点。不同场景中存在同一物体标签不一致,不同物体特征差异较大等问题,导致在特定场景下训练目标检测器无法泛化到其他场景。针对上述挑战,研究者提出面向开放词汇目标检测方法,利用大量图像-词汇知识将目标检测器从特定场景扩展到开放场景。检测器扩展到开放场景通常有两种方式,即基于大规模图像标题数据方法和基于预训练视觉语言模型方法。基于图像标题数据方法通常需要从大量数据中提取与物体相对应的词汇知识注入检测器,基于视觉语言模型方法则直接利用预训练的知识扩展检测器。开放词汇目标检测模型无需重新训练即可应用在不同场景,更加实用有效。

关键词: 开放词汇, 开放世界, 零样本学习, 开放场景目标检测, 视觉语言模型

中图分类号:

TP391

聂秀山,赵润虎,宁阳,刘新锋. 开放词汇目标检测方法综述[J]. 山东大学学报 (工学版), 2025, 55(1): 1-14.

NIE Xiushan, ZHAO Runhu, NING Yang, LIU Xinfeng. Survey of open vocabulary object detection methods[J]. Journal of Shandong University(Engineering Science), 2025, 55(1): 1-14.

[1] GIRSHICK R, DONAHUE J, DARRELl T, et al. Region-based convolutional networks for accurate object detection and segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 38(1): 142-158.
[2] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(6): 1137-1149.
[3] REDMON J, FARHADI A. Yolov3: An incrementalimprovement[EB/OL].(2018-04-08)[2024-05-28]. https://arxiv.org/abs/1804.02767.
[4] LIU W, ANGUELOV D, ERHAN D, et al. Ssd: single shot multibox detector[C] //Proceedings of the Computer Vision-ECCV 2016 Workshops. Berlin, Germany: Springer, 2016: 21-37.
[5] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL].(2023-08-02)[2024-05-28]. https://arxiv.org/abs/1706.03762.
[6] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transforme-rs for image recognition at scale[EB/OL].(2021-06-03)[2024-05-28]. https://arxiv.org/abs/2010.11929.
[7] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C] //Proceedings of the IEEE International Conference on Computer Vision. Piscataway, USA: IEEE, 2021: 10012-10022.
[8] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C] //Proc-eedings of the Computer Vision-ECCV 2020 Workshops. Berlin, Germany: Springer, 2020: 213-229.
[9] GU A, DAO T. Mamba: linear-time sequence modeling with selective state spaces[EB/OL].(2024-05-31)[2024-06-17]. https://arxiv.org/abs/2312.00752.
[10] ZHU L, LIAO B, ZHANG Q, et al. Vision mamba: efficient visual representation learning with bidirect-ional state space model[EB/OL].(2024-02-10)[2024-06-17]. https://arxiv.org/abs/2401.09417.
[11] HUANG T, PEI X, YOU S, et al. Localmamba: visual state space model with windowed selective scan[EB/OL].(2024-03-14)[2024-06-17]. https://arxiv.org/abs/2403.09338.
[12] ZOU Z, CHEN K, SHI Z, et al. Object detection in 20 years: a survey[J]. Proceedings of the IEEE, 2023, 111(3): 257-276.
[13] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context[C] //Proceedings of the Computer Vision-ECCV 2014 Workshops. Berlin, Germany: Springer, 2014: 740-755.
[14] GUPTA A, DOLLAR P, GIRSHICK R. Lvis: a da-taset for large vocabulary instance segmentation[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2019: 5356-5364.
[15] SCHEIRER W J, DE REZENDE ROCHA A, SAPKOTA A, et al. Toward open set recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(7): 1757-1772.
[16] KANG B, LIU Z, WANG X, et al. Few-shot object detection via feature reweighting[C] //Proceedings of the IEEE International Conference on Computer Vision. Piscataway, USA: IEEE, 2019: 8420-8429.
[17] BANSAL A, SIKKA K, SHARMA G, et al. Zero-shot object detection[C] //Proceedings of the Computer Vision- ECCV 2018 Workshops. Berlin, Germany: Springer, 2018: 384-400.
[18] ZHU P, WANG H, SALIGRAMA V. Zero shot detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 30(4): 998-1010.
[19] DEVLIN J, CHANG M, LEE K, et al. Bert: pretr-aining of deep bidirectional transformers for language understanding[EB/OL].(2019-05-24)[2024-06-17].https://arxiv.org/abs/1810.04805.
[20] ZAREIAN A, ROSA K D, HU D H, et al. Open vocabulary object detection using captions[EB/OL].(2021-05-14)[2024-06-17]. https://arxiv.org/abs/2011.10678.
[21] WU J, LI X, XU S, et al. Towards open vocabulary learning: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(7): 5092-5113.
[22] GENG C, HUANG S, CHEN S. Recent advances in open set recognition: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 43(10): 3614-3631.
[23] JOSEPH K J, KHAN S, KHAN F S, et al. Towards open world object detection[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2021: 5830-5840.
[24] ROMERA-PAREDES B, TORR P. An embarrassingly simple approach to zero-shot learning[C] //Proceedings of the 32nd International Conference on Machine Learning. New York, USA: ACM, 2015: 2152-2161.
[25] WANG Y, YAO Q, KWOK J T, et al. Generalizing from a few examples: a survey on few-shot learning[J]. ACM Computing Surveys, 2020, 53(3): 1-34.
[26] LI L H, YATSKAR M, YIN D, et al. Visualbert: a simple and performant baseline for vision and language[EB/OL].(2019-08-09)[2024-06-17]. https://arxiv.org/abs/1908.03557.
[27] LU J, BATRA D, PARIKH D, et al. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[EB/OL].(2019-08-06)[2024-06-17]. https://arxiv.org/abs/1908.02265.
[28] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C] //Proceedings of the 38th International Conference on Machine Learning. New York, USA: ACM, 2021: 8748-8763.
[29] MU N, KIRILLOV A, WAGNER D, et al. Slip: self-supervision meets language-image pre-training[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 529-544.
[30] SUN Q, FANG Y, WU L, et al. Evaclip: improved training techniques for clip at scale[EB/OL].(2023-03-27)[2024-06-17]. https://arxiv.org/abs/2011.10678.
[31] LI Y, FAN H, HU R, et al. Scaling language-image pretraining via masking[C] //Proceedings of the I-EEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2023: 23390-23400.
[32] WU S, ZHANG W, XU L, et al. Clipself: vision-transformer distills itself for open-vocabulary dense prediction[EB/OL].(2024-01-24)[2024-06-17]. https://arxiv.org/abs/2310.01403.
[33] GU X, LIN T Y, KUO W, et al. Open-vocabulary object detection via vision and language knowledgedistillation[EB/OL].(2022-05-12)[2024-06-17]. https://arxiv.org/abs/2104.13921.
[34] BRAVO M A, MITTAL S, BROX T. Localized vision-language matching for open-vocabulary object detection[C] //Proceedings of the Pattern Recognition: 44th DAGM German Conference. Berlin, Germany: Springer, 2022: 393-408.
[35] CHEN P, SHENG K, ZHANG M, et al. Open vocabulary object detection with proposal mining and prediction equalization[EB/OL].(2022-11-24)[2024-06-17]. https://arxiv.org/abs/2206.11134.
[36] KIM D, ANGELOVA A, KUO W. Region-aware pretraining for open-vocabulary object detection with vision transformers[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2023: 11144-11154.
[37] LIN C, SUN P, JIANG Y, et al. Learning object-language alignments for open-vocabulary object detection[EB/OL].(2022-11-27)[2024-06-17]. https://arxiv.org/abs/2211.14843.
[38] KIM D, ANGELOVA A, KUO W. Detection-oriented image-text pretraining for open-vocabulary detec-tion[EB/OL].(2023-09-29)[2024-06-17]. https://arxiv.org/abs/2310.00161v1.
[39] MA C, JIANG Y, WEN X, et al. Codet: co-occurrence guided region-word alignment for open-vocabulary object detection[J]. Advances in Neural Information Processing Systems, 2024, 36: 71078-71094.
[40] ZHOU X, GIRDHAR R, JOULIN A, et al. Detecting twenty-thousand classes using image-level supe-rvision[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 350-368.
[41] ZHONG Y, YANG J, ZHANG P, et al. Regionclip: region-based language-image pretraining[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2022: 16793-16803.
[42] ZHAO S, ZHANG Z, SCHULTER S, et al. Exploiting unlabeled data with vision and language models for object detection[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 159-175.
[43] GAO M, XING C, NIEBLES J C, et al. Open vocabulary object detection with pseudo bounding-box labels[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 266-282.
[44] XU S, LI X, WU S, et al. Dst-det: simple dynamic self-training for open-vocabulary object detection[EB/OL].(2024-04-01)[2024-06-17]. https://arxiv.org/abs/2310.01393.
[45] WANG L, LIU Y, DU P, et al. Object-aware distil-lation pyramid for open-vocabulary object detection[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2023: 11186-11196.
[46] MA Z, LUO G, GAO J, et al. Open-vocabulary one-stage detection with hierarchical visual-language knowledge distillation[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2022: 14074-14083.
[47] ZANG Y, LI W, ZHOU K, et al. Open-vocabulary detr with conditional matching[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 106-122.
[48] WU S, ZHANG W, JIN S, et al. Aligning bag of regions for open-vocabulary object detection[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2023: 15254-15264.
[49] CHO H C, JHOO W Y, KANG W, et al. Open-vocabulary object detection using pseudo caption la-bels[EB/OL].(2023-03-23)[2024-06-17]. https://arxiv.org/abs/2303.13040.
[50] KUO W, CUI Y, GU X, et al. Fvlm: open-vocabulary object detection upon frozen vision and lang-uage models[EB/OL].(2023-02-23)[2024-06-17]. https://arxiv.org/abs/2209.15639.
[51] MINDERER M, GRITSENKO A, STONE A, et al.Simple open-vocabulary object detection[C] //Procee-dings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 728-755.
[52] DU Y, WEI F, ZHANG Z, et al. Learning to prompt for open-vocabulary object detection with vision-language model[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2022: 14084-14093.
[53] FENG C, ZHONG Y, JIE Z, et al. Promptdet: tow-ards open-vocabulary detection using uncurated images[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 701-717.
[54] SONG H, BANG J. Prompt-guided transformers forend-to-end open-vocabulary object detection[EB/OL].(2023-03-25)[2024-06-17]. https://arxiv.org/abs/2303.14386.
[55] WU X, ZHU F, ZHAO R, et al. CORA: adapting clip for open-vocabulary detection with region prompting and anchor prematching[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2023: 7031-7040.
[56] LI J, ZHANG J, LI J, et al. Learning background prompts to discover implicit knowledge for open vocabulary object detection[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2024: 16678-16687.
[57] WANG J, ZHANG P, CHU T, et al. V3det: vast vocabulary visual detection dataset[EB/OL].(2023-10-05)[2024-06-17]. https://arxiv.org/abs/2304.03752.
[58] EVERINGHAM M, ESLAMI S M, GOOL L, et al. The pascal visual object classes challenge: a retrospective[J]. International Journal of Computer Vision, 2015, 111(1): 98-136.
[59] SHAO S, LI Z, ZHANG T, et al. Objects365: a large-scale, high-quality dataset for object detection[C] //Proceedings of the IEEE International Conference on Computer Vision. Piscataway, USA: IEEE, 2019: 8430-8439.
[60] WANG Y, SU X, CHEN Q, et al. OVLW-DETR: open-vocabulary light-weighted detection transformer[EB/OL].(2024-07-15)[2024-07-26]. https://arxiv.org/abs/2407.10655.
[61] GAO K, CHEN L, ZHANG H, et al. Composition-al prompt tuning with motion cues for open-vocabulary video relation detection[EB/OL].(2023-02-01)[2024-06-17]. https://arxiv.org/abs/2302.00268.
[62] LI L, XIAO J, CHEN G, et al. Zero-shot visual relation detection via composite visual cues from large language models[J]. Advances in Neural Information Processing Systems, 2024, 36: 50105-50116.
[63] ZHU C, CHEN L. A survey on open-vocabulary detection and segmentation: past, present, and future[EB/OL].(2024-04-15)[2024-06-17]. https://arxiv.org/abs/2307.09220.

Just accepted

Online first

Just accepted

Online first

Viewed

Full text

	From	local

	Times	87
	Rate	100%

Abstract

155

Just accepted	Online first	Issue

0	0	155

From	Others	local

Times	152	3
Rate	98%	2%

Cited

Web of Science	Crossref	ScienceDirect	Search for Citations in Google Scholar >>


This page requires you have already subscribed to WoS.

Shared

Discussed