Keywords

1 Introduction

With the emergence of various high-tech electronic products, the carrier of important information is no longer a traditional single text, video or audio, but a variety of media. Cross-media and multi-modal data mainly showed the underlying data the characteristics of heterogeneous and high-level semantic expression similar to the “semantic gap” problem, in recent years, more and more scholars and researchers have been involved the study of multi-modal data [1, 2].

Image scene description is a important research direction in the field of image understanding, its object of study is by the color, texture, shape representation of image information, such as the target task is to get a accurate description of image content text sequences, across the two modal images and text information expression. Traditional image description task mainly has two key issues - image content representation and classification judgment, namely to find the most representative image scene, and then to learning and training of scene, scene category classification model. Image content description often requires human visual feature extracting, design visual dictionary, such as complex work, and need additional researchers’ prior knowledge, this part of the work has a great influence for the classification effect. The main sources of problems or image and text data in the underlying data expression differences, how to eliminate the differences, and for a variety of modal data fusion is the core content of this paper.

Hinton since 2006, put forward the concept of deep learning, a large number of papers published about deep learning, in-depth study has been successfully applied to computer vision, speech recognition, natural language processing, and other fields. In multiple modal data fusion, image and text, for example, the depth of the neural network can use different models of two kinds of modal data feature extraction, and image and text are similar characteristics of space vector modeling method, therefore, for the fusion of images and text on the feature space, image semantic expression based on deep learning can be realized.

2 Related Work

The study of semantic description of traditional image scenes is mainly based on single-modality images. To reduce the “semantic gap”, an image analysis layer such as a visual dictionary is constructed between the low-level visual features and high-level semantic information. Image semantic information description must establish the mapping relationship between low-level visual features and high-level semantics. In recent years, more and more scholars have begun to pay attention to the research field of multimodal data fusion. Sawant et al. (2011) thinks it needs to capture events, locations, and personalities in addition to the visual features of the human beings in the traditional object monitoring and scene interpretation [3]. The abstract concept of a reference, Ma et al. (2011) carried out related research on image and text fusion methods, and proposed a data fusion framework based on image content and tags-a new random walk model that uses fusion parameters to balance content and labels between the impact [4]. Hollink et al. (2005) describes a semi-automatic image annotation algorithm in specific fields, the goal is to identify domain features to increase semantic understanding [5].

In recent years, the intensified research on deep neural networks has remained high and breakthroughs have also been made. This has benefited from convolutional neural networks and recurrent neural networks [6], and most researchers are based on these two types of deep neural network models and have done a lot of optimization work. Karpathy and Li (2015) and Xu et al. (2015) did a lot of work in the field of image understanding [7, 8]. Li et al. propose an image semantic alignment model, extract the key information in the image and align it with the keywords in the sentence description. However, they all use the RNN structure in the language model, and there is a lack of semantic relevance.

3 Deep Neural Network Model Based on Multi-attention Mechanism

In this paper, we propose a deep neural network based on Convolution Neural Network (CNN) and Long Short Term Memory Network Network (LSTM) to describe the image in sentences, and introduce multi-attention mechanism on this basis. When we describe an image, we need to pay attention to the content of the image as well as the language foundation. When we get the word “cat,” we focus on the cat part of the image and ignore the rest. The prediction of a word requires not only the introduction of attention mechanism in the language model, but also in the image.

3.1 CNN for Feature Extraction

Russakovsky et al. (2015) summarized the competition results of each team in the ImageNet competition and the brief description of the algorithm in the last five years [9]. Among them VGG network performance is particularly outstanding. Simonyan and Zisserman (2014) believed that the filter of 7 × 7 could be decomposed into a number of 3 × 3 filters, the smaller filter means more flexible channel [10]. Figure 1 shows the convolution structure of vgg-16 network, We used his convolutional layer to extract image features as input to the image part.

Fig. 1.
figure 1

Calculation of a single LSTM node

3.2 LSTM for Sequences’ Predict

DNN may be able to extract excellent characteristic expression, but LSTM is better than DNN in word prediction. LSTM is an improved recurrent neural network [11], general RNN cannot save too much information, there is only one state in the hidden layer, it is very sensitive to short-term input, if we add another state C to save long-term information, the problem will be solved. LSTM uses gate to control long-term status C. The gate can be expressed as:

$$ \varvec{g}\left( {\mathbf{x}} \right) = \varvec{ \sigma }\left( {\varvec{W}{\mathbf{x}} + {\mathbf{b}}} \right) $$
(1)

W is the weight vector of the gate and b is the bias term. \( \varvec{\sigma} \) is the sigmoid function and the range is (0,1), so the state of the gate is half open and half closed. The structure of a single node in the LSTM network is shown in Fig. 1. The final output of the LSTM is jointly controlled by the output gate and cell state:

$$ \varvec{h}_{\varvec{t}} = \varvec{o}_{\varvec{t}} \circ \varvec{tanh}\left( {\varvec{c}_{\varvec{t}} } \right) $$
(2)

Because of the control of oblivion gate, it can save the information of a long time ago, and because of the control of the input door, it can avoid the current inconsequential content from entering the memory.

3.3 Multi-attention Mechanism

The AM model is one of the most important developments in the NLP field in the past few years, which appear in most of the current papers are attached to the Encoder-Decoder framework. But the AM model can be used as a general idea, When a general RNN model generates a language sequence, the predicted next word is only related to its first n words, and n is generally less than 5.

$$ \varvec{y}_{\varvec{i}} = \varvec{f}\left( {\varvec{y}_{{\varvec{i} - 1}} ,\varvec{y}_{{\varvec{i} - 2}} ,\varvec{y}_{{\varvec{i} - 3}} \cdots ,\varvec{C}} \right) $$
(3)

The above formula C represents the semantic encoding, Obviously the semantic semantics of each word is C is unreasonable, and a word will not only be related to the most recent words, So we give each word a probability distribution that expresses its relevance to other words, and Replace C with \( C_{i} \).

$$ \varvec{C}_{\varvec{i}} = \sum\nolimits_{{\varvec{j} = 1}}^{{\varvec{T}_{\varvec{x}} }} {\varvec{\alpha}_{{\varvec{ij}}} \varvec{h}_{\varvec{j}} } \varvec{ } $$
(4)

\( \varvec{T}_{\varvec{x}} \) indicates the number of other words associated with \( \varvec{C}_{\varvec{i}} \), \( \varvec{\alpha}_{{\varvec{ij}}} \) represents the probability of interest between two words. \( \varvec{h}_{\varvec{j}} \) is the information of the word itself. The question now is how to calculate the probability distribution. It is usually to calculate the similarity between the current input information \( \varvec{H}_{\varvec{i}} \) and the previous information \( \varvec{h}_{\varvec{j}} \). In the word2vec model, it is the distance between two vectors.

After going through a multi-layer convolutional structure, the image information is compressed into a vector I. When predicting each word, you need to associate some information in the vector. The attention parameter W for the image is what we need to train to get.

$$ \varvec{A}_{\varvec{i}} = \varvec{I} \circ \varvec{W}_{\varvec{i}} $$
(5)

Finally, the functional relationship of each predicted word is as follows and The model is shown in Fig. 2.

Fig. 2.
figure 2

Deep neural network of multi-attention mechanism

$$ \varvec{y}_{\varvec{i}} = \varvec{ f}\left( {\varvec{A}_{\varvec{i}} ,\varvec{C}_{\varvec{i}} ,\varvec{y}_{{\varvec{i} - 1}} ,\varvec{y}_{{\varvec{i} - 2}} ,\varvec{y}_{{\varvec{i} - 3}} ,\cdot\cdot\cdot\varvec{ }} \right) $$
(6)

4 Experiments

In order to compare with the prior art, we conducted a large number of experiments using the BLEU metrics to evaluate the effectiveness of our model. Experiments used the Flickr 8K data set, contrast traditional image scene expression method.

4.1 Experimental Environment and Parameter Deployment

Experiments using the current popular TensorFlow framework, and VGG-16 model to extract image features, In the whole model, set the batch-size to 8 to relieve the pressure of GPU memory. Fully connected layer functions as a “classifier” throughout the convolutional neural network, while the convolutional layer, the pooled layer, and the activation function layer can be viewed as mapping the image to the feature space. Deeper layers will inevitably bring better results.

The experiment uses the Flickr8k dataset, which includes Flickr8k_Dataset and Flickr8k_text. We compressed the image to a size of 300 × 300 pixels so that we could train more samples; In Flickr8k_text, each image corresponds to 4 different types of descriptions. We use Word2vec to vectorize the words. In terms of semantically similar words, the Euclidean distances of their vectors are often very close. Then we only select the first 200 words of frequency, and the rest are replaced by [UNK], which makes the load of the model smaller.

4.2 Evaluating Indicator

The popular automatic evaluation method is the BLEU algorithm proposed by IBM [12] ,The BLEU method first calculates the number of matching n-grams in the reference sentence and the generated sentence, and then calculates the ratio of the number of n-grams in the generated sentence as an evaluation index. It focuses on the accuracy of generating the word or phrase in the sentence. The accuracy of each order N-gram can be calculated by the following formula:

$$ P_{n} = \frac{{\mathop \sum \nolimits_{i} \mathop \sum \nolimits_{k} { \hbox{min} }\left( {h_{k} \left( {c_{i} } \right),max_{j \in m} h_{k} \left( {s_{ij} } \right)} \right)}}{{\mathop \sum \nolimits_{i} \mathop \sum \nolimits_{k} { \hbox{min} }\left( {h_{k} \left( {c_{i} } \right)} \right)}} $$
(7)

The upper limit of N is 4, which means that only the accuracy of 4-gram can be calculated.

4.3 Results and Discussion

There are two comparisons in the experiment. One is to compare the traditional basic model, and the other is to compare the current mainstream model. As can be seen from Table 1, the accuracy of the Multi-AM model in the keyword is better than the traditional model.

Table 1. Comparison with other methods on the flicker8K dataset

From the results of Table 2, the accuracy of one and two keywords is better than other mainstream algorithms, but the effect of obtaining more keywords is not good. but the effect of obtaining more keywords is not good.

Table 2. Comparison with other mainstream model on the flicker8K dataset

5 Conclusions

Semantic description of images is a complex task, and the framework based on deep learning has become the current mainstream method. This paper proposes a multi-attention mechanism based on deep neural network, we need to understand the grammar of the sentence generation, as well as the content in the image. Words have different levels of attention to image content and different levels of attention from context. Experimental results show that the multi-attention mechanism can achieve higher scores under the BLEU evaluation criteria, which shows that the method can extract more keywords.

Future research should attempt to lack the sentence description data set and improve the semantic matching of sentences. We need to accurately position the predicted word in the image, which is very difficult.