山东大学学报 (工学版) ›› 2025, Vol. 55 ›› Issue (1): 1-14.doi: 10.6040/j.issn.1672-3961.0.2024.162
• 机器学习与数据挖掘 •
聂秀山,赵润虎,宁阳*,刘新锋
NIE Xiushan, ZHAO Runhu, NING Yang*, LIU Xinfeng
摘要: 目标检测方法针对特定场景进行训练,需要识别的物体都要人工标注,检测器只能识别被标注的物体。随着目标检测应用场景逐渐增加,特定场景下训练的目标检测器不能满足多样化场景需求,目标检测方法的泛化性能成为研究者关注热点。不同场景中存在同一物体标签不一致,不同物体特征差异较大等问题,导致在特定场景下训练目标检测器无法泛化到其他场景。针对上述挑战,研究者提出面向开放词汇目标检测方法,利用大量图像-词汇知识将目标检测器从特定场景扩展到开放场景。检测器扩展到开放场景通常有两种方式,即基于大规模图像标题数据方法和基于预训练视觉语言模型方法。基于图像标题数据方法通常需要从大量数据中提取与物体相对应的词汇知识注入检测器,基于视觉语言模型方法则直接利用预训练的知识扩展检测器。开放词汇目标检测模型无需重新训练即可应用在不同场景,更加实用有效。
中图分类号:
[1] GIRSHICK R, DONAHUE J, DARRELl T, et al. Region-based convolutional networks for accurate object detection and segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 38(1): 142-158. [2] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(6): 1137-1149. [3] REDMON J, FARHADI A. Yolov3: An incrementalimprovement[EB/OL].(2018-04-08)[2024-05-28]. https://arxiv.org/abs/1804.02767. [4] LIU W, ANGUELOV D, ERHAN D, et al. Ssd: single shot multibox detector[C] //Proceedings of the Computer Vision-ECCV 2016 Workshops. Berlin, Germany: Springer, 2016: 21-37. [5] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL].(2023-08-02)[2024-05-28]. https://arxiv.org/abs/1706.03762. [6] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transforme-rs for image recognition at scale[EB/OL].(2021-06-03)[2024-05-28]. https://arxiv.org/abs/2010.11929. [7] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C] //Proceedings of the IEEE International Conference on Computer Vision. Piscataway, USA: IEEE, 2021: 10012-10022. [8] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C] //Proc-eedings of the Computer Vision-ECCV 2020 Workshops. Berlin, Germany: Springer, 2020: 213-229. [9] GU A, DAO T. Mamba: linear-time sequence modeling with selective state spaces[EB/OL].(2024-05-31)[2024-06-17]. https://arxiv.org/abs/2312.00752. [10] ZHU L, LIAO B, ZHANG Q, et al. Vision mamba: efficient visual representation learning with bidirect-ional state space model[EB/OL].(2024-02-10)[2024-06-17]. https://arxiv.org/abs/2401.09417. [11] HUANG T, PEI X, YOU S, et al. Localmamba: visual state space model with windowed selective scan[EB/OL].(2024-03-14)[2024-06-17]. https://arxiv.org/abs/2403.09338. [12] ZOU Z, CHEN K, SHI Z, et al. Object detection in 20 years: a survey[J]. Proceedings of the IEEE, 2023, 111(3): 257-276. [13] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context[C] //Proceedings of the Computer Vision-ECCV 2014 Workshops. Berlin, Germany: Springer, 2014: 740-755. [14] GUPTA A, DOLLAR P, GIRSHICK R. Lvis: a da-taset for large vocabulary instance segmentation[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2019: 5356-5364. [15] SCHEIRER W J, DE REZENDE ROCHA A, SAPKOTA A, et al. Toward open set recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(7): 1757-1772. [16] KANG B, LIU Z, WANG X, et al. Few-shot object detection via feature reweighting[C] //Proceedings of the IEEE International Conference on Computer Vision. Piscataway, USA: IEEE, 2019: 8420-8429. [17] BANSAL A, SIKKA K, SHARMA G, et al. Zero-shot object detection[C] //Proceedings of the Computer Vision- ECCV 2018 Workshops. Berlin, Germany: Springer, 2018: 384-400. [18] ZHU P, WANG H, SALIGRAMA V. Zero shot detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 30(4): 998-1010. [19] DEVLIN J, CHANG M, LEE K, et al. Bert: pretr-aining of deep bidirectional transformers for language understanding[EB/OL].(2019-05-24)[2024-06-17].https://arxiv.org/abs/1810.04805. [20] ZAREIAN A, ROSA K D, HU D H, et al. Open vocabulary object detection using captions[EB/OL].(2021-05-14)[2024-06-17]. https://arxiv.org/abs/2011.10678. [21] WU J, LI X, XU S, et al. Towards open vocabulary learning: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(7): 5092-5113. [22] GENG C, HUANG S, CHEN S. Recent advances in open set recognition: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 43(10): 3614-3631. [23] JOSEPH K J, KHAN S, KHAN F S, et al. Towards open world object detection[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2021: 5830-5840. [24] ROMERA-PAREDES B, TORR P. An embarrassingly simple approach to zero-shot learning[C] //Proceedings of the 32nd International Conference on Machine Learning. New York, USA: ACM, 2015: 2152-2161. [25] WANG Y, YAO Q, KWOK J T, et al. Generalizing from a few examples: a survey on few-shot learning[J]. ACM Computing Surveys, 2020, 53(3): 1-34. [26] LI L H, YATSKAR M, YIN D, et al. Visualbert: a simple and performant baseline for vision and language[EB/OL].(2019-08-09)[2024-06-17]. https://arxiv.org/abs/1908.03557. [27] LU J, BATRA D, PARIKH D, et al. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[EB/OL].(2019-08-06)[2024-06-17]. https://arxiv.org/abs/1908.02265. [28] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C] //Proceedings of the 38th International Conference on Machine Learning. New York, USA: ACM, 2021: 8748-8763. [29] MU N, KIRILLOV A, WAGNER D, et al. Slip: self-supervision meets language-image pre-training[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 529-544. [30] SUN Q, FANG Y, WU L, et al. Evaclip: improved training techniques for clip at scale[EB/OL].(2023-03-27)[2024-06-17]. https://arxiv.org/abs/2011.10678. [31] LI Y, FAN H, HU R, et al. Scaling language-image pretraining via masking[C] //Proceedings of the I-EEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2023: 23390-23400. [32] WU S, ZHANG W, XU L, et al. Clipself: vision-transformer distills itself for open-vocabulary dense prediction[EB/OL].(2024-01-24)[2024-06-17]. https://arxiv.org/abs/2310.01403. [33] GU X, LIN T Y, KUO W, et al. Open-vocabulary object detection via vision and language knowledgedistillation[EB/OL].(2022-05-12)[2024-06-17]. https://arxiv.org/abs/2104.13921. [34] BRAVO M A, MITTAL S, BROX T. Localized vision-language matching for open-vocabulary object detection[C] //Proceedings of the Pattern Recognition: 44th DAGM German Conference. Berlin, Germany: Springer, 2022: 393-408. [35] CHEN P, SHENG K, ZHANG M, et al. Open vocabulary object detection with proposal mining and prediction equalization[EB/OL].(2022-11-24)[2024-06-17]. https://arxiv.org/abs/2206.11134. [36] KIM D, ANGELOVA A, KUO W. Region-aware pretraining for open-vocabulary object detection with vision transformers[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2023: 11144-11154. [37] LIN C, SUN P, JIANG Y, et al. Learning object-language alignments for open-vocabulary object detection[EB/OL].(2022-11-27)[2024-06-17]. https://arxiv.org/abs/2211.14843. [38] KIM D, ANGELOVA A, KUO W. Detection-oriented image-text pretraining for open-vocabulary detec-tion[EB/OL].(2023-09-29)[2024-06-17]. https://arxiv.org/abs/2310.00161v1. [39] MA C, JIANG Y, WEN X, et al. Codet: co-occurrence guided region-word alignment for open-vocabulary object detection[J]. Advances in Neural Information Processing Systems, 2024, 36: 71078-71094. [40] ZHOU X, GIRDHAR R, JOULIN A, et al. Detecting twenty-thousand classes using image-level supe-rvision[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 350-368. [41] ZHONG Y, YANG J, ZHANG P, et al. Regionclip: region-based language-image pretraining[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2022: 16793-16803. [42] ZHAO S, ZHANG Z, SCHULTER S, et al. Exploiting unlabeled data with vision and language models for object detection[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 159-175. [43] GAO M, XING C, NIEBLES J C, et al. Open vocabulary object detection with pseudo bounding-box labels[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 266-282. [44] XU S, LI X, WU S, et al. Dst-det: simple dynamic self-training for open-vocabulary object detection[EB/OL].(2024-04-01)[2024-06-17]. https://arxiv.org/abs/2310.01393. [45] WANG L, LIU Y, DU P, et al. Object-aware distil-lation pyramid for open-vocabulary object detection[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2023: 11186-11196. [46] MA Z, LUO G, GAO J, et al. Open-vocabulary one-stage detection with hierarchical visual-language knowledge distillation[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2022: 14074-14083. [47] ZANG Y, LI W, ZHOU K, et al. Open-vocabulary detr with conditional matching[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 106-122. [48] WU S, ZHANG W, JIN S, et al. Aligning bag of regions for open-vocabulary object detection[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2023: 15254-15264. [49] CHO H C, JHOO W Y, KANG W, et al. Open-vocabulary object detection using pseudo caption la-bels[EB/OL].(2023-03-23)[2024-06-17]. https://arxiv.org/abs/2303.13040. [50] KUO W, CUI Y, GU X, et al. Fvlm: open-vocabulary object detection upon frozen vision and lang-uage models[EB/OL].(2023-02-23)[2024-06-17]. https://arxiv.org/abs/2209.15639. [51] MINDERER M, GRITSENKO A, STONE A, et al.Simple open-vocabulary object detection[C] //Procee-dings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 728-755. [52] DU Y, WEI F, ZHANG Z, et al. Learning to prompt for open-vocabulary object detection with vision-language model[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2022: 14084-14093. [53] FENG C, ZHONG Y, JIE Z, et al. Promptdet: tow-ards open-vocabulary detection using uncurated images[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 701-717. [54] SONG H, BANG J. Prompt-guided transformers forend-to-end open-vocabulary object detection[EB/OL].(2023-03-25)[2024-06-17]. https://arxiv.org/abs/2303.14386. [55] WU X, ZHU F, ZHAO R, et al. CORA: adapting clip for open-vocabulary detection with region prompting and anchor prematching[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2023: 7031-7040. [56] LI J, ZHANG J, LI J, et al. Learning background prompts to discover implicit knowledge for open vocabulary object detection[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2024: 16678-16687. [57] WANG J, ZHANG P, CHU T, et al. V3det: vast vocabulary visual detection dataset[EB/OL].(2023-10-05)[2024-06-17]. https://arxiv.org/abs/2304.03752. [58] EVERINGHAM M, ESLAMI S M, GOOL L, et al. The pascal visual object classes challenge: a retrospective[J]. International Journal of Computer Vision, 2015, 111(1): 98-136. [59] SHAO S, LI Z, ZHANG T, et al. Objects365: a large-scale, high-quality dataset for object detection[C] //Proceedings of the IEEE International Conference on Computer Vision. Piscataway, USA: IEEE, 2019: 8430-8439. [60] WANG Y, SU X, CHEN Q, et al. OVLW-DETR: open-vocabulary light-weighted detection transformer[EB/OL].(2024-07-15)[2024-07-26]. https://arxiv.org/abs/2407.10655. [61] GAO K, CHEN L, ZHANG H, et al. Composition-al prompt tuning with motion cues for open-vocabulary video relation detection[EB/OL].(2023-02-01)[2024-06-17]. https://arxiv.org/abs/2302.00268. [62] LI L, XIAO J, CHEN G, et al. Zero-shot visual relation detection via composite visual cues from large language models[J]. Advances in Neural Information Processing Systems, 2024, 36: 50105-50116. [63] ZHU C, CHEN L. A survey on open-vocabulary detection and segmentation: past, present, and future[EB/OL].(2024-04-15)[2024-06-17]. https://arxiv.org/abs/2307.09220. |
[1] | 李二超, 张智钊. 在线动态订单需求车辆路径规划[J]. 山东大学学报 (工学版), 2024, 54(5): 62-73. |
[2] | 杨巨成, 魏峰, 林亮, 贾庆祥, 刘建征. 驾驶员疲劳驾驶检测研究综述[J]. 山东大学学报 (工学版), 2024, 54(2): 1-12. |
[3] | 肖伟, 郑更生, 陈钰佳. 结合自训练模型的命名实体识别方法[J]. 山东大学学报 (工学版), 2024, 54(2): 96-102. |
[4] | 胡钢, 王乐萌, 卢志宇, 王琴, 徐翔. 基于节点多阶邻居递阶关联贡献度的重要性辨识[J]. 山东大学学报 (工学版), 2024, 54(1): 1-10. |
[5] | 李家春,李博文,常建波. 一种高效且轻量的RGB单帧人脸反欺诈模型[J]. 山东大学学报 (工学版), 2023, 53(6): 1-7. |
[6] | 樊禹江,黄欢欢,丁佳雄,廖凯,余滨杉. 基于云模型的老旧小区韧性评价体系[J]. 山东大学学报 (工学版), 2023, 53(5): 1-9, 19. |
[7] | 李颖,王建坤. 基于监督图正则化和信息融合的轻度认知障碍分类方法[J]. 山东大学学报 (工学版), 2023, 53(4): 65-73. |
[8] | 于艺旋,杨耕,耿华. 连续复合运动的多模态层次化关键帧提取方法[J]. 山东大学学报 (工学版), 2023, 53(2): 42-50. |
[9] | 张豪,李子凌,刘通,张大伟,陶建华. 融合社会学因素的模糊贝叶斯网技术预测模型[J]. 山东大学学报 (工学版), 2023, 53(2): 23-33. |
[10] | 吴艳丽,刘淑薇,何东晓,王晓宝,金弟. 刻画多种潜在关系的泊松-伽马主题模型[J]. 山东大学学报 (工学版), 2023, 53(2): 51-60. |
[11] | 余明骏,刁红军,凌兴宏. 基于轨迹掩膜的在线多目标跟踪方法[J]. 山东大学学报 (工学版), 2023, 53(2): 61-69. |
[12] | 黄华娟,程前,韦修喜,于楚楚. 融合Jaya高斯变异的自适应乌鸦搜索算法[J]. 山东大学学报 (工学版), 2023, 53(2): 11-22. |
[13] | 刘方旭,王建,魏本征. 基于多空间注意力的小儿肺炎辅助诊断算法[J]. 山东大学学报 (工学版), 2023, 53(2): 135-142. |
[14] | 刘行,杨璐,郝凡昌. 基于多特征融合的手指静脉图像检索方法[J]. 山东大学学报 (工学版), 2023, 53(2): 118-126. |
[15] | 袁钺,王艳丽,刘勘. 基于空洞卷积块架构的命名实体识别模型[J]. 山东大学学报 (工学版), 2022, 52(6): 105-114. |
Viewed | ||||||||||||||||||||||||||||||||||||||||||||||
Full text 87
|
|
|||||||||||||||||||||||||||||||||||||||||||||
Abstract 155
|
|
|||||||||||||||||||||||||||||||||||||||||||||
Cited |
|
|||||||||||||||||||||||||||||||||||||||||||||
Shared | ||||||||||||||||||||||||||||||||||||||||||||||
Discussed |
|