基于深度学习的图像自动标注方法综述

doi:10.6040/j.issn.1672-3961.0.2019.244

Abstract

Abstract:

Image captioning is the cross-research direction of computer vision and natural language processing. This paper aimsed to summarize the deep learning methods in the field of image captioning. Imgage captioning methods based on deep learning was summarized into five categories: multimodal space based method, multi-region based method, enconder-deconder based method, reinforcement learning based method, and generative adversarial networks based method.The datasets and evaluation metrics were demonstrated, and experimental result of different methods were compared. The three key problems and future research direction for image captioning were presented and summarized.

Key words: image captioning, multimodal space, multi-region, enconder-deconder, reinforcement learning, generative adversarial networks

CLC Number:

TP24

Zhifu CHANG,Fengyu ZHOU,Yugang WANG,Dongdong SHEN,Yang ZHAO. A survey of image captioning methods based on deep learning[J].Journal of Shandong University(Engineering Science), 2019, 49(6): 25-35.

Figures/Tables 11

Fig.1

Fig.2

Fig.3

Fig.4

Fig.5

Fig.6

Fig.7

Fig.8

Fig.9

Table 1

Table 2

References 52

1	LI F F , IYER A , KOCH C , et al. What do we perceive in a glance of a real-world scene[J]. Journal of Vision, 2007, 7 (1): 10- 10.
2	KARPATHY A, LI F F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston, USA: IEEE, 2015: 3128-3137.
3	HE K, ZHANG X, REN S, et al. Delving deep into rectifiers: surpassing human-level performance on imagenet classification[C]//Proceedings of the IEEE International Conference on Computer Vision.Santiago, Chile: IEEE, 2015: 1026-1034.
4	REN S, HE K, GIRSHICK R, et al. Faster r-cnn: towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems. Montreal, Canada: NIPS, 2015: 91-99.
5	SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems. Montreal, Canada: NIPS, 2014: 3104-3112.
6	HOSSAIN M , SOHEL F , SHIRATUDDIN M F , et al. A comprehensive study of deep learning for image captioning[J]. Arxiv: Computer Vision and Pattern Recognition, 2018.
7	FU K , LI J , JIN J , et al. Image-text surgery:efficient concept learning in image captioning by generating pseudopairs[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, (99): 1- 12.
8	GAO L, FAN K, SONG J, et al.Deliberate attention networks for image captioning[C]//AAAI-19. Honolulu, USA: AAAI, 2019: 8320-8327.
9	CHEN F, JI R, SUN X, et al. Groupcap: group-based image captioning with structured relevance and diversity constraints[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 1345-1353.
10	彭宇新, 綦金玮, 黄鑫. 多媒体内容理解的研究现状与展望[J]. 计算机研究与发展, 2019, 56 (1): 183- 208.
	PENG Y X , QI J W , HUANG X . Current research status and prospects on multimedia content understanding[J]. Journal of Computer Research and Development, 2019, 56 (1): 183- 208.
11	FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: generating sentences from images[C]//European Conference on Computer Vision. Berlin, Germany: Springer, 2010: 15-29.
12	ORDONEZ V, KULKARNI G, BERG T L. Im2text: describing images using 1 million captioned photographs[C]//Advances in Neural Information Processing Systems. Granada, Spain: Curran Associates Inc, 2011: 1143-1151.
13	YANG Y, TEO C L, DAUMÉ H, et al. Corpus-guided sentence generation of natural images[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Edinburgh, United Kingdom: ACL, 2011: 444-454.
14	LI S, KULKARNI G, BERG T L, et al. Composing simple image descriptions using web-scale n-grams[C]//Proceedings of the Fifteenth Conference on Computa-tional Natural Language Learning. Portland, USA: ACL, 2011: 220-228.
15	LECUN Y , BENGIO Y , HINTON G . Deep learning[J]. Nature, 2015, 521 (7553): 436. doi: 10.1038/nature14539
16	KIROS R, SALAKHUTDINOV R, ZEMEL R. Multimodal neural language models[C]//International Conference on Machine Learning. Beijing, China: IMLS, 2014: 595-603.
17	KARPATHY A, JOULIN A, LI F F. Deep fragment embeddings for bidirectional image sentence mapping[C]//Advances in Neural Information Processing Systems. Montreal, Canada: NIPS, 2014: 1889-1897.
18	MAO J , XU W , YANG Y , et al. Deep captioning with multimodal recurrent neural networks (m-rnn)[J]. Arxiv: Computer Vision and Pattern Recognition, 2014.
19	JOHNSON J, KARPATHY A, LI F F. Densecap: fully convolutional localization networks for dense captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 4565-4574.
20	YANG L, TANG K, YANG J, et al. Dense captioning with joint inference and visual context[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 2193-2202.
21	KRISHNA R , ZHU Y , GROTH O , et al. Visual genome: connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123 (1): 32- 73.
22	CHO K , VAN MERRI NBOER B , GULCEHRE C , et al. Learning phrase representations using rnn encoder-decoder for statistical machine translation[J]. Arxiv: Computer Vision and Pattern Recognition, 2014.
23	VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 3156-3164.
24	JIA X, GAVVES E, FERNANDO B, et al. Guiding the long-short term memory model for image caption generation[C]//Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 2407-2415.
25	MAO J, HUANG J, TOSHEV A, et al. Generation and comprehension of unambiguous object descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 11-20.
26	WANG C, YANG H, BARTZ C, et al. Image captioning with deep bidirectional LSTMs[C]//Proceedings of the 2016 ACM on Multimedia Conference. Amsterdam, United Kingdom: ACM, 2016: 988-997.
27	XU K, BA J, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[C]//International Conference on Machine Learning.Lile, France: IMLS, 2015: 2048-2057.
28	LU J, XIONG C, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 375-383.
29	PEDERSOLI M, LUCAS T, SCHMID C, et al. Areas of attention for image captioning[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 1242-1250.
30	TAVAKOLI H R, SHETTY R, BORJI A, et al. Paying attention to descriptions generated by image captioning models[C]//Proceedings of the IEEE International Conference on Computer Vision.Venice, Italy: IEEE, 2017: 2487-2496.
31	ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6077-6086.
32	CHUNSEONG Park C, KIM B, KIM G. Attend to you: personalized image captioning with context sequence memory networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 895-903.
33	YOU Q, JIN H, WANG Z, et al. Image captioning with semantic attention[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 4651-4659.
34	YAO T, PAN Y, LI Y, et al. Boosting image captioning with attributes[C]//Proceedings of the IEEE International Conference on Computer Vision.Venice, Italy: IEEE, 2017: 4894-4902.
35	REN Z, WANG X, ZHANG N, et al. Deep reinforcement learning-based image captioning with embedding reward[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE, 2017: 1151-1159.
36	RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu, USA: IEEE, 2017: 7008-7024.
37	ZHANG L , SUNG F , LIU F , et al. Actor-critic sequence training for image captioning[J]. Arxiv: Computer Vision and Pattern Recognition, 2017.
38	KONDA V R, TSITSIKLIS J N. Actor-critic algorithms[C]//Advances in Neural Information Processing Systems. Denver, USA: NIPS, 2000: 1008-1014.
39	DAI B , FIDLER S , URTASUN R , et al. Towards diverse and natural image descriptions via a conditional gan[J]. Arxiv: Computer Vision and Pattern Recognition, 2017.
40	SHETTY R, ROHRBACH M, HENDRICKS L A, et al. Speaking the same language: matching machine to human captions by adversarial training[C]//2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017: 4155-4164.
41	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context[C]//European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014: 740-755.
42	HODOSH M , YOUNG P , HOCKENMAIER J . Framing image description as a ranking task: data, models and evaluation metrics[J]. Journal of Artificial Intelligence Research, 2013, 47, 853- 899. doi: 10.1613/jair.3994
43	PLUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models[C]//Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 2641-2649.
44	GRUBINGER M, CLOUGH P, MVLLER H, et al. The iapr tc-12 benchmark: a new evaluation resource for visual inform-ation systems[C]//International Workshop Ontoimage. Genoa, Italy: OntoImage, 2006: 13-55.
45	BYCHKOVSKY V, PARIS S, CHAN E, et al. Learning photographic global tonal adjustment with a database of input/output image pairs[C]//CVPR 2011. Piscataway, USA: IEEE, 2011: 97-104.
46	PAPINENI K, ROUKOS S, WARD T, et al. Bleu: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Istanbul, Turkey: ACL, 2002: 311-318.
47	LIN C Y . Rouge: a package for automatic evaluation of summaries[J]. Text Summarization Branches Out, 2004, 74- 81.
48	BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Istanbul, Turkey: ACL, 2005: 65-72.
49	VEDANTAM R, LAWRENCE ZITNICK C, Parikh D. Cider: consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 4566-4575.
50	ANDERSON P, FERNANDO B, JOHNSON M, et al. Spice: semantic propositional image caption evaluation[C]//Euopean Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016: 382-398.
51	GAN Z, GAN C, HE X, et al. Semantic compositional networks for visual captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 5630-5639.
52	GU J, WANG G, CAI J, et al. An empirical study of language cnn for image captioning[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 1222-1231.

Metrics

Viewed

Full text

2254

Abstract

Cited

Shared

Discussed

Comments

Recommended 10

[1]	SUN Dianzhu, ZHU Changzhi, LI Yanrui. [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(1): 84 -86 .
[2]	. [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(1): 92 -95 .
[3]	XIA Bin,ZHANG Lian-jun . Energy comparison-based TOA estimation algorithm for the DS-CDMA UWB system[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2007, 37(1): 70 -73 .
[4]	GONG Yi-guang,GONG Yi-guang,BAI Jun-jie,WANG Ning-sheng . An DNC system based on SCA specification and its application[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(1): 5 -8 .
[5]	LI Chun-xiao， YUE Qin-yan， LU Lei， GAO Bao-yu， YANG Zhong-lian， SI Xiao-hui， NI Shou-qing， WANG Yuan-fang. Synthesis and application of hydrophobically associating cationic polyacrylamide [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(6): 99 -104 .
[6]	HUANG Jinchao. A new method for muti-objects image segmentation based on faster region proposal networks[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(4): 20 -26 .
[7]	LI Xin-Cheng, SUN Dian-Zhu. The GTK+ and VTK programming based on Python and its application[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(6): 78 -81 .
[8]	PAN Guo-dong,WANG Jia-ye,XIANG Hui . A note on proof of the 3-Color problem of the polygon triangulation graph[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2007, 37(1): 74 -75 .
[9]	HUANG Yan-min1,2, ZHU Chen-fu1, CHEN Shu-xiang2, SONG Cui2, XU Chao2. Micro／nano-silver migration into food simulations from micro/nano polypropylene chambers[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2010, 40(2): 110 -112 .
[10]	WANG Xiao-jie,XU Xian-peng,WANG Hua-zhen,HUANG Shuai-shuai,LI Ping . The security analysis of the SARI algorithm[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2007, 37(2): 93 -96 .

图像集	数量/张	标注类别	发布时间	发布机构
MSCOCO数据集	328 000	图像级	2014年	微软公司
Flicr8k数据集	8 000	图像级	2013年	伊利诺伊大学香槟分校
Flickr30k数据集	30 000	图像级	2015年	伊利诺伊大学香槟分校
Visual Genome数据集	10 800	区域级	2017年	斯坦福大学
IAPR TC-12数据集	20 000	图像级	2006年	国际模式识别协会
MIT-Adobe Fivek数据集	5 000	图像级	2011年	麻省理工学院和Adobe公司

名称	评价标准
名称	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METTOR	CIDEr	ROUGE-L	SPICE
Adaptive Attention via A Visual Sentinel^[28]	0.742	0.580	0.439	0.332	0.266	1.085
SCN^[51]	0.741	0.578	0.444	0.341	0.261	1.041
Actor-Critic Sequence Training^[37]				0.344	0.267	1.162	0.558
SCST^[36]				0.319	0.255	1.060	0.543
LSTM-A^[34]	0.734	0.567	0.430	0.326	0.254	1.000	0.540	0.186
Language CNN^[52]	0.720	0.550	0.410	0.300	0.240	0.960		0.176

A survey of image captioning methods based on deep learning

RichHTML

PDF (PC)

Abstract

Cite this article

share this article

Figures/Tables 11

References 52

Related Articles 1

Metrics

Comments

Recommended 10