您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2019, Vol. 49 ›› Issue (6): 25-35.doi: 10.6040/j.issn.1672-3961.0.2019.244

• 控制科学与工程——机器人专题 • 上一篇    下一篇

基于深度学习的图像自动标注方法综述

常致富(),周风余*(),王玉刚,沈冬冬,赵阳   

  1. 山东大学控制科学与工程学院, 山东 济南 250061
  • 收稿日期:2019-05-22 出版日期:2019-12-20 发布日期:2019-12-17
  • 通讯作者: 周风余 E-mail:zfchang2018@gmail.com;zhoufengyu@sdu.edu.cn
  • 作者简介:常致富(1994—),男,山东济宁人,硕士研究生,主要研究方向为深度学习,图像理解. E-mail:zfchang2018@gmail.com
  • 基金资助:
    国家重点研发计划项目(2017YFB1302400);国家自然科学基金(61773242);山东省重大科技创新工程项目(2017CXGC0926);山东省重点研发计划(公益类专项)项目(2017GGX30133)

A survey of image captioning methods based on deep learning

Zhifu CHANG(),Fengyu ZHOU*(),Yugang WANG,Dongdong SHEN,Yang ZHAO   

  1. School of Control Science and Engineering, Shandong University, Jinan 250061, Shandong, China
  • Received:2019-05-22 Online:2019-12-20 Published:2019-12-17
  • Contact: Fengyu ZHOU E-mail:zfchang2018@gmail.com;zhoufengyu@sdu.edu.cn
  • Supported by:
    国家重点研发计划项目(2017YFB1302400);国家自然科学基金(61773242);山东省重大科技创新工程项目(2017CXGC0926);山东省重点研发计划(公益类专项)项目(2017GGX30133)

摘要:

图像自动标注是目前计算机视觉和自然语言处理交叉研究领域的一个研究热点。对图像自动标注领域中的深度学习方法进行综述;针对图像自动标注领域的国内外研究现状,按照基于多模态空间、基于多区域、基于编码-解码、基于强化学习和基于生成式对抗网络等五个分类标准进行详细综述;介绍图像自动标注领域相关的数据集和评价标准,对比不同图像自动标注方法的优缺点;通过分析图像自动标注领域的当前研究现状,提出该领域亟待解决的3个关键问题,进一步指出未来的研究方向,并对本研究进行总结。

关键词: 图像自动标注, 多模态空间, 多区域, 编码-解码, 强化学习, 生成式对抗网络

Abstract:

Image captioning is the cross-research direction of computer vision and natural language processing. This paper aimsed to summarize the deep learning methods in the field of image captioning. Imgage captioning methods based on deep learning was summarized into five categories: multimodal space based method, multi-region based method, enconder-deconder based method, reinforcement learning based method, and generative adversarial networks based method.The datasets and evaluation metrics were demonstrated, and experimental result of different methods were compared. The three key problems and future research direction for image captioning were presented and summarized.

Key words: image captioning, multimodal space, multi-region, enconder-deconder, reinforcement learning, generative adversarial networks

中图分类号: 

  • TP24

图1

图像的文本描述实例"

图2

基于深度学习的图像自动标注方法分类"

图3

基于多模态空间的图像自动标注方法示意图"

图4

基于多模态空间的图像自动标注方法模型图"

图5

基于多区域的图像自动标注方法示意图"

图6

基于多区域的图像自动标注方法模型图"

图7

基于编码-解码框架的图像自动标注方法示意图"

图8

基于注意力机制的图像自动标注方法示意图"

图9

基于注意力机制的图像自动标注方法模型图"

表1

图像自动标注领域常用数据集"

图像集 数量/张 标注类别 发布时间 发布机构
MSCOCO数据集 328 000 图像级 2014年 微软公司
Flicr8k数据集 8 000 图像级 2013年 伊利诺伊大学香槟分校
Flickr30k数据集 30 000 图像级 2015年 伊利诺伊大学香槟分校
Visual Genome数据集 10 800 区域级 2017年 斯坦福大学
IAPR TC-12数据集 20 000 图像级 2006年 国际模式识别协会
MIT-Adobe Fivek数据集 5 000 图像级 2011年 麻省理工学院和Adobe公司

表2

图像自动标注方法在MSCOCO数据集上的试验结果对比"

名称 评价标准
BLEU-1 BLEU-2 BLEU-3 BLEU-4 METTOR CIDEr ROUGE-L SPICE
Adaptive Attention via A Visual Sentinel[28] 0.742 0.580 0.439 0.332 0.266 1.085
SCN[51] 0.741 0.578 0.444 0.341 0.261 1.041
Actor-Critic Sequence Training[37] 0.344 0.267 1.162 0.558
SCST[36] 0.319 0.255 1.060 0.543
LSTM-A[34] 0.734 0.567 0.430 0.326 0.254 1.000 0.540 0.186
Language CNN[52] 0.720 0.550 0.410 0.300 0.240 0.960 0.176
1 LI F F , IYER A , KOCH C , et al. What do we perceive in a glance of a real-world scene[J]. Journal of Vision, 2007, 7 (1): 10- 10.
2 KARPATHY A, LI F F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston, USA: IEEE, 2015: 3128-3137.
3 HE K, ZHANG X, REN S, et al. Delving deep into rectifiers: surpassing human-level performance on imagenet classification[C]//Proceedings of the IEEE International Conference on Computer Vision.Santiago, Chile: IEEE, 2015: 1026-1034.
4 REN S, HE K, GIRSHICK R, et al. Faster r-cnn: towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems. Montreal, Canada: NIPS, 2015: 91-99.
5 SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems. Montreal, Canada: NIPS, 2014: 3104-3112.
6 HOSSAIN M , SOHEL F , SHIRATUDDIN M F , et al. A comprehensive study of deep learning for image captioning[J]. Arxiv: Computer Vision and Pattern Recognition, 2018.
7 FU K , LI J , JIN J , et al. Image-text surgery:efficient concept learning in image captioning by generating pseudopairs[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, (99): 1- 12.
8 GAO L, FAN K, SONG J, et al.Deliberate attention networks for image captioning[C]//AAAI-19. Honolulu, USA: AAAI, 2019: 8320-8327.
9 CHEN F, JI R, SUN X, et al. Groupcap: group-based image captioning with structured relevance and diversity constraints[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 1345-1353.
10 彭宇新, 綦金玮, 黄鑫. 多媒体内容理解的研究现状与展望[J]. 计算机研究与发展, 2019, 56 (1): 183- 208.
PENG Y X , QI J W , HUANG X . Current research status and prospects on multimedia content understanding[J]. Journal of Computer Research and Development, 2019, 56 (1): 183- 208.
11 FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: generating sentences from images[C]//European Conference on Computer Vision. Berlin, Germany: Springer, 2010: 15-29.
12 ORDONEZ V, KULKARNI G, BERG T L. Im2text: describing images using 1 million captioned photographs[C]//Advances in Neural Information Processing Systems. Granada, Spain: Curran Associates Inc, 2011: 1143-1151.
13 YANG Y, TEO C L, DAUMÉ H, et al. Corpus-guided sentence generation of natural images[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Edinburgh, United Kingdom: ACL, 2011: 444-454.
14 LI S, KULKARNI G, BERG T L, et al. Composing simple image descriptions using web-scale n-grams[C]//Proceedings of the Fifteenth Conference on Computa-tional Natural Language Learning. Portland, USA: ACL, 2011: 220-228.
15 LECUN Y , BENGIO Y , HINTON G . Deep learning[J]. Nature, 2015, 521 (7553): 436.
doi: 10.1038/nature14539
16 KIROS R, SALAKHUTDINOV R, ZEMEL R. Multimodal neural language models[C]//International Conference on Machine Learning. Beijing, China: IMLS, 2014: 595-603.
17 KARPATHY A, JOULIN A, LI F F. Deep fragment embeddings for bidirectional image sentence mapping[C]//Advances in Neural Information Processing Systems. Montreal, Canada: NIPS, 2014: 1889-1897.
18 MAO J , XU W , YANG Y , et al. Deep captioning with multimodal recurrent neural networks (m-rnn)[J]. Arxiv: Computer Vision and Pattern Recognition, 2014.
19 JOHNSON J, KARPATHY A, LI F F. Densecap: fully convolutional localization networks for dense captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 4565-4574.
20 YANG L, TANG K, YANG J, et al. Dense captioning with joint inference and visual context[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 2193-2202.
21 KRISHNA R , ZHU Y , GROTH O , et al. Visual genome: connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123 (1): 32- 73.
22 CHO K , VAN MERRI NBOER B , GULCEHRE C , et al. Learning phrase representations using rnn encoder-decoder for statistical machine translation[J]. Arxiv: Computer Vision and Pattern Recognition, 2014.
23 VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 3156-3164.
24 JIA X, GAVVES E, FERNANDO B, et al. Guiding the long-short term memory model for image caption generation[C]//Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 2407-2415.
25 MAO J, HUANG J, TOSHEV A, et al. Generation and comprehension of unambiguous object descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 11-20.
26 WANG C, YANG H, BARTZ C, et al. Image captioning with deep bidirectional LSTMs[C]//Proceedings of the 2016 ACM on Multimedia Conference. Amsterdam, United Kingdom: ACM, 2016: 988-997.
27 XU K, BA J, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[C]//International Conference on Machine Learning.Lile, France: IMLS, 2015: 2048-2057.
28 LU J, XIONG C, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 375-383.
29 PEDERSOLI M, LUCAS T, SCHMID C, et al. Areas of attention for image captioning[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 1242-1250.
30 TAVAKOLI H R, SHETTY R, BORJI A, et al. Paying attention to descriptions generated by image captioning models[C]//Proceedings of the IEEE International Conference on Computer Vision.Venice, Italy: IEEE, 2017: 2487-2496.
31 ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6077-6086.
32 CHUNSEONG Park C, KIM B, KIM G. Attend to you: personalized image captioning with context sequence memory networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 895-903.
33 YOU Q, JIN H, WANG Z, et al. Image captioning with semantic attention[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 4651-4659.
34 YAO T, PAN Y, LI Y, et al. Boosting image captioning with attributes[C]//Proceedings of the IEEE International Conference on Computer Vision.Venice, Italy: IEEE, 2017: 4894-4902.
35 REN Z, WANG X, ZHANG N, et al. Deep reinforcement learning-based image captioning with embedding reward[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE, 2017: 1151-1159.
36 RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu, USA: IEEE, 2017: 7008-7024.
37 ZHANG L , SUNG F , LIU F , et al. Actor-critic sequence training for image captioning[J]. Arxiv: Computer Vision and Pattern Recognition, 2017.
38 KONDA V R, TSITSIKLIS J N. Actor-critic algorithms[C]//Advances in Neural Information Processing Systems. Denver, USA: NIPS, 2000: 1008-1014.
39 DAI B , FIDLER S , URTASUN R , et al. Towards diverse and natural image descriptions via a conditional gan[J]. Arxiv: Computer Vision and Pattern Recognition, 2017.
40 SHETTY R, ROHRBACH M, HENDRICKS L A, et al. Speaking the same language: matching machine to human captions by adversarial training[C]//2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017: 4155-4164.
41 LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context[C]//European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014: 740-755.
42 HODOSH M , YOUNG P , HOCKENMAIER J . Framing image description as a ranking task: data, models and evaluation metrics[J]. Journal of Artificial Intelligence Research, 2013, 47, 853- 899.
doi: 10.1613/jair.3994
43 PLUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models[C]//Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 2641-2649.
44 GRUBINGER M, CLOUGH P, MVLLER H, et al. The iapr tc-12 benchmark: a new evaluation resource for visual inform-ation systems[C]//International Workshop Ontoimage. Genoa, Italy: OntoImage, 2006: 13-55.
45 BYCHKOVSKY V, PARIS S, CHAN E, et al. Learning photographic global tonal adjustment with a database of input/output image pairs[C]//CVPR 2011. Piscataway, USA: IEEE, 2011: 97-104.
46 PAPINENI K, ROUKOS S, WARD T, et al. Bleu: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Istanbul, Turkey: ACL, 2002: 311-318.
47 LIN C Y . Rouge: a package for automatic evaluation of summaries[J]. Text Summarization Branches Out, 2004, 74- 81.
48 BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Istanbul, Turkey: ACL, 2005: 65-72.
49 VEDANTAM R, LAWRENCE ZITNICK C, Parikh D. Cider: consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 4566-4575.
50 ANDERSON P, FERNANDO B, JOHNSON M, et al. Spice: semantic propositional image caption evaluation[C]//Euopean Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016: 382-398.
51 GAN Z, GAN C, HE X, et al. Semantic compositional networks for visual captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 5630-5639.
52 GU J, WANG G, CAI J, et al. An empirical study of language cnn for image captioning[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 1222-1231.
[1] 高君健,廖祝华,刘毅志,赵肄江. 基于分层多智能体强化学习的个性化与信号控制联合路径引导方法[J]. 山东大学学报 (工学版), 2025, 55(3): 34-45.
[2] 陈兴国,吕咏洲,巩宇,陈耀雄. 基于贝叶斯优化的强化学习广义不动点解逼近[J]. 山东大学学报 (工学版), 2024, 54(4): 21-34.
[3] 曹宇慧,黄昱泽,冯北鹏,张淼,郭珍珍. 基于深度强化学习的物联网服务协同卸载方法[J]. 山东大学学报 (工学版), 2024, 54(1): 83-90.
[4] 张俊三,程俏俏,万瑶,朱杰,张世栋. MIRGAN: 一种基于GAN的医学影像报告生成模型[J]. 山东大学学报 (工学版), 2021, 51(2): 9-18.
[5] 朱娜娜1, 2, 张化祥1, 2*, 刘丽1, 2. 基于改进FCM算法和贝叶斯分类的图像自动标注[J]. 山东大学学报(工学版), 2013, 43(6): 12-16.
[6] 沈晶,刘海波,张汝波,吴艳霞,程晓北. 基于半马尔可夫对策的多机器人分层强化学习[J]. 山东大学学报(工学版), 2010, 40(4): 1-7.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 张永花,王安玲,刘福平 . 低频非均匀电磁波在导电界面的反射相角[J]. 山东大学学报(工学版), 2006, 36(2): 22 -25 .
[2] 孔祥臻,刘延俊,王勇,赵秀华 . 气动比例阀的死区补偿与仿真[J]. 山东大学学报(工学版), 2006, 36(1): 99 -102 .
[3] 来翔 . 用胞映射方法讨论一类MKdV方程[J]. 山东大学学报(工学版), 2006, 36(1): 87 -92 .
[4] 余嘉元1 , 田金亭1 , 朱强忠2 . 计算智能在心理学中的应用[J]. 山东大学学报(工学版), 2009, 39(1): 1 -5 .
[5] 陈瑞,李红伟,田靖. 磁极数对径向磁轴承承载力的影响[J]. 山东大学学报(工学版), 2018, 48(2): 81 -85 .
[6] 季涛,高旭,孙同景,薛永端,徐丙垠 . 铁路10 kV自闭/贯通线路故障行波特征分析[J]. 山东大学学报(工学版), 2006, 36(2): 111 -116 .
[7] 秦通,孙丰荣*,王丽梅,王庆浩,李新彩. 基于极大圆盘引导的形状插值实现三维表面重建[J]. 山东大学学报(工学版), 2010, 40(3): 1 -5 .
[8] 孙殿柱,朱昌志,李延瑞 . 散乱点云边界特征快速提取算法[J]. 山东大学学报(工学版), 2009, 39(1): 84 -86 .
[9] 胡天亮,李鹏,张承瑞,左毅 . 基于VHDL的正交编码脉冲电路解码计数器设计[J]. 山东大学学报(工学版), 2008, 38(3): 10 -13 .
[10] 卜德云 张道强. 自适应谱聚类算法研究[J]. 山东大学学报(工学版), 2009, 39(5): 22 -26 .