Journal of Shandong University(Engineering Science) ›› 2025, Vol. 55 ›› Issue (1): 1-14.doi: 10.6040/j.issn.1672-3961.0.2024.162

• Machine Learning & Data Mining •    

Survey of open vocabulary object detection methods

NIE Xiushan, ZHAO Runhu, NING Yang*, LIU Xinfeng   

  1. School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, Shandong, China
  • Published:2025-02-20

CLC Number: 

  • TP391
[1] GIRSHICK R, DONAHUE J, DARRELl T, et al. Region-based convolutional networks for accurate object detection and segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 38(1): 142-158.
[2] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(6): 1137-1149.
[3] REDMON J, FARHADI A. Yolov3: An incrementalimprovement[EB/OL].(2018-04-08)[2024-05-28]. https://arxiv.org/abs/1804.02767.
[4] LIU W, ANGUELOV D, ERHAN D, et al. Ssd: single shot multibox detector[C] //Proceedings of the Computer Vision-ECCV 2016 Workshops. Berlin, Germany: Springer, 2016: 21-37.
[5] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL].(2023-08-02)[2024-05-28]. https://arxiv.org/abs/1706.03762.
[6] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transforme-rs for image recognition at scale[EB/OL].(2021-06-03)[2024-05-28]. https://arxiv.org/abs/2010.11929.
[7] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C] //Proceedings of the IEEE International Conference on Computer Vision. Piscataway, USA: IEEE, 2021: 10012-10022.
[8] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C] //Proc-eedings of the Computer Vision-ECCV 2020 Workshops. Berlin, Germany: Springer, 2020: 213-229.
[9] GU A, DAO T. Mamba: linear-time sequence modeling with selective state spaces[EB/OL].(2024-05-31)[2024-06-17]. https://arxiv.org/abs/2312.00752.
[10] ZHU L, LIAO B, ZHANG Q, et al. Vision mamba: efficient visual representation learning with bidirect-ional state space model[EB/OL].(2024-02-10)[2024-06-17]. https://arxiv.org/abs/2401.09417.
[11] HUANG T, PEI X, YOU S, et al. Localmamba: visual state space model with windowed selective scan[EB/OL].(2024-03-14)[2024-06-17]. https://arxiv.org/abs/2403.09338.
[12] ZOU Z, CHEN K, SHI Z, et al. Object detection in 20 years: a survey[J]. Proceedings of the IEEE, 2023, 111(3): 257-276.
[13] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context[C] //Proceedings of the Computer Vision-ECCV 2014 Workshops. Berlin, Germany: Springer, 2014: 740-755.
[14] GUPTA A, DOLLAR P, GIRSHICK R. Lvis: a da-taset for large vocabulary instance segmentation[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2019: 5356-5364.
[15] SCHEIRER W J, DE REZENDE ROCHA A, SAPKOTA A, et al. Toward open set recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(7): 1757-1772.
[16] KANG B, LIU Z, WANG X, et al. Few-shot object detection via feature reweighting[C] //Proceedings of the IEEE International Conference on Computer Vision. Piscataway, USA: IEEE, 2019: 8420-8429.
[17] BANSAL A, SIKKA K, SHARMA G, et al. Zero-shot object detection[C] //Proceedings of the Computer Vision- ECCV 2018 Workshops. Berlin, Germany: Springer, 2018: 384-400.
[18] ZHU P, WANG H, SALIGRAMA V. Zero shot detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 30(4): 998-1010.
[19] DEVLIN J, CHANG M, LEE K, et al. Bert: pretr-aining of deep bidirectional transformers for language understanding[EB/OL].(2019-05-24)[2024-06-17].https://arxiv.org/abs/1810.04805.
[20] ZAREIAN A, ROSA K D, HU D H, et al. Open vocabulary object detection using captions[EB/OL].(2021-05-14)[2024-06-17]. https://arxiv.org/abs/2011.10678.
[21] WU J, LI X, XU S, et al. Towards open vocabulary learning: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(7): 5092-5113.
[22] GENG C, HUANG S, CHEN S. Recent advances in open set recognition: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 43(10): 3614-3631.
[23] JOSEPH K J, KHAN S, KHAN F S, et al. Towards open world object detection[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2021: 5830-5840.
[24] ROMERA-PAREDES B, TORR P. An embarrassingly simple approach to zero-shot learning[C] //Proceedings of the 32nd International Conference on Machine Learning. New York, USA: ACM, 2015: 2152-2161.
[25] WANG Y, YAO Q, KWOK J T, et al. Generalizing from a few examples: a survey on few-shot learning[J]. ACM Computing Surveys, 2020, 53(3): 1-34.
[26] LI L H, YATSKAR M, YIN D, et al. Visualbert: a simple and performant baseline for vision and language[EB/OL].(2019-08-09)[2024-06-17]. https://arxiv.org/abs/1908.03557.
[27] LU J, BATRA D, PARIKH D, et al. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[EB/OL].(2019-08-06)[2024-06-17]. https://arxiv.org/abs/1908.02265.
[28] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C] //Proceedings of the 38th International Conference on Machine Learning. New York, USA: ACM, 2021: 8748-8763.
[29] MU N, KIRILLOV A, WAGNER D, et al. Slip: self-supervision meets language-image pre-training[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 529-544.
[30] SUN Q, FANG Y, WU L, et al. Evaclip: improved training techniques for clip at scale[EB/OL].(2023-03-27)[2024-06-17]. https://arxiv.org/abs/2011.10678.
[31] LI Y, FAN H, HU R, et al. Scaling language-image pretraining via masking[C] //Proceedings of the I-EEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2023: 23390-23400.
[32] WU S, ZHANG W, XU L, et al. Clipself: vision-transformer distills itself for open-vocabulary dense prediction[EB/OL].(2024-01-24)[2024-06-17]. https://arxiv.org/abs/2310.01403.
[33] GU X, LIN T Y, KUO W, et al. Open-vocabulary object detection via vision and language knowledgedistillation[EB/OL].(2022-05-12)[2024-06-17]. https://arxiv.org/abs/2104.13921.
[34] BRAVO M A, MITTAL S, BROX T. Localized vision-language matching for open-vocabulary object detection[C] //Proceedings of the Pattern Recognition: 44th DAGM German Conference. Berlin, Germany: Springer, 2022: 393-408.
[35] CHEN P, SHENG K, ZHANG M, et al. Open vocabulary object detection with proposal mining and prediction equalization[EB/OL].(2022-11-24)[2024-06-17]. https://arxiv.org/abs/2206.11134.
[36] KIM D, ANGELOVA A, KUO W. Region-aware pretraining for open-vocabulary object detection with vision transformers[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2023: 11144-11154.
[37] LIN C, SUN P, JIANG Y, et al. Learning object-language alignments for open-vocabulary object detection[EB/OL].(2022-11-27)[2024-06-17]. https://arxiv.org/abs/2211.14843.
[38] KIM D, ANGELOVA A, KUO W. Detection-oriented image-text pretraining for open-vocabulary detec-tion[EB/OL].(2023-09-29)[2024-06-17]. https://arxiv.org/abs/2310.00161v1.
[39] MA C, JIANG Y, WEN X, et al. Codet: co-occurrence guided region-word alignment for open-vocabulary object detection[J]. Advances in Neural Information Processing Systems, 2024, 36: 71078-71094.
[40] ZHOU X, GIRDHAR R, JOULIN A, et al. Detecting twenty-thousand classes using image-level supe-rvision[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 350-368.
[41] ZHONG Y, YANG J, ZHANG P, et al. Regionclip: region-based language-image pretraining[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2022: 16793-16803.
[42] ZHAO S, ZHANG Z, SCHULTER S, et al. Exploiting unlabeled data with vision and language models for object detection[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 159-175.
[43] GAO M, XING C, NIEBLES J C, et al. Open vocabulary object detection with pseudo bounding-box labels[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 266-282.
[44] XU S, LI X, WU S, et al. Dst-det: simple dynamic self-training for open-vocabulary object detection[EB/OL].(2024-04-01)[2024-06-17]. https://arxiv.org/abs/2310.01393.
[45] WANG L, LIU Y, DU P, et al. Object-aware distil-lation pyramid for open-vocabulary object detection[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2023: 11186-11196.
[46] MA Z, LUO G, GAO J, et al. Open-vocabulary one-stage detection with hierarchical visual-language knowledge distillation[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2022: 14074-14083.
[47] ZANG Y, LI W, ZHOU K, et al. Open-vocabulary detr with conditional matching[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 106-122.
[48] WU S, ZHANG W, JIN S, et al. Aligning bag of regions for open-vocabulary object detection[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2023: 15254-15264.
[49] CHO H C, JHOO W Y, KANG W, et al. Open-vocabulary object detection using pseudo caption la-bels[EB/OL].(2023-03-23)[2024-06-17]. https://arxiv.org/abs/2303.13040.
[50] KUO W, CUI Y, GU X, et al. Fvlm: open-vocabulary object detection upon frozen vision and lang-uage models[EB/OL].(2023-02-23)[2024-06-17]. https://arxiv.org/abs/2209.15639.
[51] MINDERER M, GRITSENKO A, STONE A, et al.Simple open-vocabulary object detection[C] //Procee-dings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 728-755.
[52] DU Y, WEI F, ZHANG Z, et al. Learning to prompt for open-vocabulary object detection with vision-language model[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2022: 14084-14093.
[53] FENG C, ZHONG Y, JIE Z, et al. Promptdet: tow-ards open-vocabulary detection using uncurated images[C] //Proceedings of the Computer Vision-ECCV 2022 Workshops. Berlin, Germany: Springer, 2022: 701-717.
[54] SONG H, BANG J. Prompt-guided transformers forend-to-end open-vocabulary object detection[EB/OL].(2023-03-25)[2024-06-17]. https://arxiv.org/abs/2303.14386.
[55] WU X, ZHU F, ZHAO R, et al. CORA: adapting clip for open-vocabulary detection with region prompting and anchor prematching[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2023: 7031-7040.
[56] LI J, ZHANG J, LI J, et al. Learning background prompts to discover implicit knowledge for open vocabulary object detection[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2024: 16678-16687.
[57] WANG J, ZHANG P, CHU T, et al. V3det: vast vocabulary visual detection dataset[EB/OL].(2023-10-05)[2024-06-17]. https://arxiv.org/abs/2304.03752.
[58] EVERINGHAM M, ESLAMI S M, GOOL L, et al. The pascal visual object classes challenge: a retrospective[J]. International Journal of Computer Vision, 2015, 111(1): 98-136.
[59] SHAO S, LI Z, ZHANG T, et al. Objects365: a large-scale, high-quality dataset for object detection[C] //Proceedings of the IEEE International Conference on Computer Vision. Piscataway, USA: IEEE, 2019: 8430-8439.
[60] WANG Y, SU X, CHEN Q, et al. OVLW-DETR: open-vocabulary light-weighted detection transformer[EB/OL].(2024-07-15)[2024-07-26]. https://arxiv.org/abs/2407.10655.
[61] GAO K, CHEN L, ZHANG H, et al. Composition-al prompt tuning with motion cues for open-vocabulary video relation detection[EB/OL].(2023-02-01)[2024-06-17]. https://arxiv.org/abs/2302.00268.
[62] LI L, XIAO J, CHEN G, et al. Zero-shot visual relation detection via composite visual cues from large language models[J]. Advances in Neural Information Processing Systems, 2024, 36: 50105-50116.
[63] ZHU C, CHEN L. A survey on open-vocabulary detection and segmentation: past, present, and future[EB/OL].(2024-04-15)[2024-06-17]. https://arxiv.org/abs/2307.09220.
[1] LI Erchao, ZHANG Zhizhao. Online dynamic demand vehicle routing planning [J]. Journal of Shandong University(Engineering Science), 2024, 54(5): 62-73.
[2] YANG Jucheng, WEI Feng, LIN Liang, JIA Qingxiang, LIU Jianzheng. A research survey of driver drowsiness driving detection [J]. Journal of Shandong University(Engineering Science), 2024, 54(2): 1-12.
[3] XIAO Wei, ZHENG Gengsheng, CHEN Yujia. Named entity recognition method combined with self-training model [J]. Journal of Shandong University(Engineering Science), 2024, 54(2): 96-102.
[4] Gang HU, Lemeng WANG, Zhiyu LU, Qin WANG, Xiang XU. Importance identification method based on multi-order neighborhood hierarchical association contribution of nodes [J]. Journal of Shandong University(Engineering Science), 2024, 54(1): 1-10.
[5] Jiachun LI,Bowen LI,Jianbo CHANG. An efficient and lightweight RGB frame-level face anti-spoofing model [J]. Journal of Shandong University(Engineering Science), 2023, 53(6): 1-7.
[6] Yujiang FAN,Huanhuan HUANG,Jiaxiong DING,Kai LIAO,Binshan YU. Resilience evaluation system of the old community based on cloud model [J]. Journal of Shandong University(Engineering Science), 2023, 53(5): 1-9, 19.
[7] Ying LI,Jiankun WANG. The classification of mild cognitive impairment based on supervised graph regularization and information fusion [J]. Journal of Shandong University(Engineering Science), 2023, 53(4): 65-73.
[8] YU Yixuan, YANG Geng, GENG Hua. Multimodal hierarchical keyframe extraction method for continuous combined motion [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 42-50.
[9] ZHANG Hao, LI Ziling, LIU Tong, ZHANG Dawei, TAO Jianhua. A technology prediction model based on fuzzy Bayesian networks with sociological factors [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 23-33.
[10] WU Yanli, LIU Shuwei, HE Dongxiao, WANG Xiaobao, JIN Di. Poisson-gamma topic model of describing multiple underlying relationships [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 51-60.
[11] YU Mingjun, DIAO Hongjun, LING Xinghong. Online multi-object tracking method based on trajectory mask [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 61-69.
[12] HUANG Huajuan, CHENG Qian, WEI Xiuxi, YU Chuchu. Adaptive crow search algorithm with Jaya algorithm and Gaussian mutation [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 11-22.
[13] LIU Fangxu, WANG Jian, WEI Benzheng. Auxiliary diagnosis algorithm for pediatric pneumonia based on multi-spatial attention [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 135-142.
[14] LIU Xing, YANG Lu, HAO Fanchang. Finger vein image retrieval based on multi-feature fusion [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 118-126.
[15] Yue YUAN,Yanli WANG,Kan LIU. Named entity recognition model based on dilated convolutional block architecture [J]. Journal of Shandong University(Engineering Science), 2022, 52(6): 105-114.
Viewed
Full text
83
HTML PDF
Just accepted Online first Issue Just accepted Online first Issue
0 0 0 0 0 83

  From local
  Times 83
  Rate 100%

Abstract
148
Just accepted Online first Issue
0 0 148
  From Others local
  Times 145 3
  Rate 98% 2%

Cited

Web of Science  Crossref   ScienceDirect  Search for Citations in Google Scholar >>
 
This page requires you have already subscribed to WoS.
  Shared   
  Discussed   
No Suggested Reading articles found!