Keywords

1 Introduction

1.1 Why Should We Learn to Learn from Web Data?

Large annotated datasets, powerful hardware and deep learning techniques are allowing to get outstanding machine learning results. Not only in traditional classification problems but also in more challenging tasks such as image captioning or language translation. Deep neural networks allow building pipelines that can learn patterns from any kind of data with impressive results. One of the bottlenecks of training deep neural networks is, though, the availability of properly annotated data, since deep learning techniques are data hungry. Despite the existence of large-scale annotated datasets such as ImageNet [11], COCO [16] or Places [40] and tools for human annotation such as Amazon Mechanical Turk, the lack of data limits the application of deep learning to specific problems where it is difficult or economically non-viable to get proper annotations.

A common strategy to overcome this problem is to first train models in generic datasets as ImageNet and then fine-tune them to other areas using smaller, specific datasets [36]. But still we depend on the existence of annotated data to train our models. Another strategy to overcome the insufficiency of data is to use computer graphics techniques to generate artificial data inexpensively. However, while synthetic data has proven to be a valuable source of training data for many applications such as pedestrian detection [19], image semantic segmentation [28] and scene text detection and recognition [8, 26], nowadays it is still not easy to generate realistic complex images for some tasks.

An alternative to these strategies is learning from free existing weakly annotated multimodal data. Web and Social Media offer an immense amount of images accompanied with other information such as the image title, description or date. This data is noisy and unstructured but it is free and nearly unlimited. Designing algorithms to learn from Web data is an interesting research area as it would disconnect the deep learning evolution from the scaling of human-annotated datasets, given the enormous amount of existing Web and Social Media data.

1.2 How to Learn from Web Data?

In some works, such as in the WebVision Challenge [14], Web data is used to build a classification dataset: queries are made to search engines using class names and the retrieved images are labeled with the querying class. In such a configuration the learning is limited to some pre-established classes, thus it could not generalize to new classes. While working with image labels is very convenient for training traditional visual models, the semantics in such a discrete space is very limited in comparison with the richness of human language expressiveness when describing an image. Instead we define here a scenario where, by exploiting distributional semantics in a given text corpus, we can learn from every word associated to an image. As illustrated in Fig. 1, by leveraging the richer semantics encoded in the learnt embedding space, we can infer previously unseen concepts even though they might not be explicitly present in the training set.

The noisy and unstructured text associated to Web images provides information about the image content that we can use to learn visual features. A strategy to do that is to embed the multimodal data (images and text) in the same vectorial space. In this work we represent text using five different state of the art methods and eventually embed images in the learn semantic space by means of a regression CNN. We compare the performance of the different text space configurations under a text based image retrieval task.

Fig. 1.
figure 1

Top-ranked results of combined text queries by our semantic image retrieval model. The learnt joint image-text embedding permits to learn a rich semantic manifold even for previously unseen concepts even though they might not be explicitly present in the training set.

Fig. 2.
figure 2

First retrieved images for multimodal queries (concepts are added or removed to bias the results) with Word2Vec on WebVision.

2 Related Work

Multimodal image and text embeddings have been lately a very active research area. The possibilities of learning together from different kinds of data have motivated this field of study, where both general and applied research has been done. DeViSE [22] proposes a pipeline that, instead of learning to predict ImageNet classes, it learns to infer the Word2Vec [21] representations of their labels. The result is a model that makes semantically relevant predictions even when it makes errors, and generalizes to classes outside of its labeled training set. Gordo and Larlus [7] use captions associated to images to learn a common embedding space for images and text through which they perform semantic image retrieval. They use a tf-idf based BoW representation over the image captions as a semantic similarity measure between images and they train a CNN to minimize a margin loss based on the distances of triplets of query-similar-dissimilar images. Gomez et al. [5] use LDA [1] to extract topic probabilities from a bunch of Wikipedia articles and train a CNN to embed its associated images in the same topic space. Wang et al. [32] propose a method to learn a joint embedding of images and text for image-to-text and text-to-image retrieval, by training a neural net to embed in the same space Word2Vec [21] text representations and CNN extracted features.

Other than semantic retrieval, joint image-text embeddings have also been used in more specific applications. Patel et al. [23] use LDA [1] to learn a joint image-text embedding and generate contextualized lexicons for images using only visual information. Gordo et al. [6] embed word images in a semantic space relying in the graph taxonomy provided by WordNet [27] to perform text recognition. In a more specific application, Salvador et al. [29] propose a joint embedding of food images and its recipes to identify ingredients, using Word2Vec [21] and LSTM representations to encode ingredient names and cooking instructions and a CNN to extract visual features from the associated images.

The robustness against noisy data has also been addressed by the community, though usually in an implicit way. Patrini et al. [24] address the problem of training a deep neural network with label noise with a loss correction approach and Xiau et al. [33] propose a method to train a network with a limited number of clean labels and millions of noisy labels. Fu et al. [4] propose an image tagging method robust to noisy training data and Xu et al. [34] address social image tagging correction and completion. Zhang et al. [20] show how label noise affects the CNN training process and its generalization error.

2.1 Contributions

The work presented here brings in a performance comparison between five state of the art text embeddings in multimodal learning, showing results in three different datasets. Furthermore it proves that multimodal learning can be applied to Web and Social Media data achieving competitive results in text-based image retrieval compared to pipelines trained with human annotated data. Finally, a new dataset formed by Instagram images and its associated text is presented: InstaCities1M.

3 Multimodal Text-Image Embedding

One of the objectives of this work is to serve as a fair comparative of different text embeddings methods when learning from Web and Social Media data. Therefore we design a pipeline to test the different methods under the same conditions, where the text embedding is a module that can be replaced by any text representation.

The proposed pipeline is as follows: First, we train the text embedding model on a dataset composed by pairs of images and correlated texts (Ix). Second, we use the text embedding model to generate vectorial representations of those texts. Given a text instance x, we denote its embedding by \(\phi (x) \in \mathbb {R}^d\). Third, we train a CNN to regress those text embeddings directly from the correlated images. Given an image I, its representation in the embedding space is denoted by \(\psi (I) \in \mathbb {R}^d\). Thereby the CNN learns to embed images in the vectorial space defined by the text embedding model. The trained CNN model is used to generate visual embeddings for the test set images. Figure 3 shows a diagram of the visual embedding training pipeline and the retrieval procedure.

In the image retrieval stage the vectorial representation in the joint text-image space of the querying text is computed using the text embedding model. Image queries can also be handled by using the visual embedding model instead of the text embedding model to generate the query representation. Furthermore, we can generate complex queries combining different query representations applying algebra in the joint text-image space. To retrieve the most semantically similar image \(I_R\) to a query \(x_q\), we compute the cosine similarity of its vectorial representation \(\phi (x_q)\) with the visual embeddings of the test set images \(\psi (I_T)\), and retrieve the nearest image in the joint text-image space:

$$\begin{aligned} {\mathop {\hbox {arg min}}\limits _{I_T \in \text {Test}}} \frac{ \langle \phi (x_q) , \psi (I_T) \rangle }{||\phi (x_q)|| \cdot ||\psi (I_T)||}. \end{aligned}$$
(1)

State of the art text embedding methods trained on large text corpus are very good generating representations of text in a vector space where semantically similar concepts fall close to each other. The proposed pipeline leverages the semantic structure of those text embedding spaces training a visual embedding model that generates vectorial representations of images in the same space, mapping semantically similar images close to each other, and also close to texts correlated to the image content. Note that the proposed joint text-image embedding can be extended to other tasks besides image retrieval, such as image annotation, tagging or captioning.

Fig. 3.
figure 3

Pipeline of the visual embedding model training and the image retrieval by text.

3.1 Visual Embedding

A CNN is trained to regress text embeddings from the correlated images minimizing a sigmoid cross-entropy loss. This loss is used to minimize distances between the text and image embeddings. Let \(\{(I_n, x_n)\}_{n=1:N}\) be a batch of image-text pairs. If \(\sigma (\cdot )\) is the component-wise sigmoid function, we denote \(p_n = \sigma (\phi (x_n))\) and \(\hat{p}_n = \sigma (\psi (I_n))\), and let the loss be:

$$\begin{aligned} L = -\tfrac{1}{N}\sum _{n=1}^{N} [\, p_n \log \hat{p}_n + (1-p_n) \log (1-\hat{p}_n) \, ], \end{aligned}$$
(2)

where the sum’s inner expression is averaged over all vector components. The GoogleNet architecture [30] is used, customizing the last layer to regress a vector of the same dimensionality as the text embedding. We train with a Stochastic Gradient Descent optimizer with a learning rate of 0.001, multiplied by 0.1 every 100,000 iterations, and a momentum of 0.9. The batch size is set to 120 and random cropping and mirroring are used as online data augmentation. With these settings the CNN trainings converge around 300K–500K iterations. We use the Caffe [10] framework and initialize with the ImageNet [11] trained model to make the training faster. Notice that, despite initializing with a model trained with human-annotated data, this does not denote a dependence on annotated data, since the resulting model can generalize to much more concepts than the ImageNet classes. We trained one model from scratch obtaining similar results, although more training iterations were needed.

3.2 Text Embeddings

Text vectorization methods are diverse in terms of architecture and the text structure they are designed to deal with. Some methods are oriented to vectorize individual words and others to vectorize full texts or paragraphs. In this work we consider the top-performing text embeddings and test them in our pipeline to evaluate their performance when learning from Web and Social Media data. Here we explain briefly the main characteristics of each text embedding method used.

LDA [1]: Latent Dirichlet Allocation learns latent topics from a collection of text documents and maps words to a vector of probabilities of those topics. It can describe a document by assigning topic distributions to them, which in turn have word distributions assigned. An advantage of this method is that it gives interpretable topics.

Word2Vec [21]: Using large amounts of unannotated plain text, Word2Vec learns relationships between words automatically using a feed-forward neural network. It builds distributed semantic representations of words using the context of them considering both words before and after the target word.

FastText [2]: It is an extension of Word2Vec which treats each word as composed of character ngrams, learning representations for ngrams instead of words. The vector for a word is made of the sum of its character n grams, so it can generate embeddings for out of vocabulary words.

Doc2Vec [12]: Extends the Word2Vec idea to documents. Instead of learning feature representations for words, it learns them for sentences or documents.

GloVe [25]: It is a count-based model. It learns the vectors by essentially doing dimensionality reduction on the co-occurrence counts matrix. Training is performed on aggregated global word-word co-occurrence statistics from a corpus.

To the best of our knowledge, this is the first time these text embeddings are trained from scratch on the same corpus and evaluated under the image retrieval by text task. We used GensimFootnote 1 implementations of LDA, Word2Vec, FastText and Doc2Vec and the GloVe implementation by Maciej KulaFootnote 2. While LDA and Doc2Vec can generate embeddings for documents, Word2Vec, GloVe and FastText only generate word embeddings. To get documents embeddings from these methods, we consider two standard strategies: First, computing the document embedding as the mean embedding of its words. Second, computing a tf-idf weighted mean of the words in the document. For all embeddings a dimensionality of 400 has been used. The value has been selected because is the one used in the Doc2Vec paper [12], which compares Doc2Vec with other text embedding methods, and it is enough to get optimum performances of Word2Vec, FastText and GloVe, as [2, 21, 25] show respectively. For LDA a dimensionality of 200 has also been considered.

4 Experiments

4.1 Benchmarks

InstaCities1M. A dataset formed by Instagram images associated with one of the 10 most populated English speaking cities all over the world (in the images captions one of this city names appears). It contains 100K images for each city, which makes a total of 1M images, split in 800K training images, 50K validation images and 150K test images. The interest of this dataset is that is formed by recent Social Media data. The text associated with the images is the description and the hashtags written by the photo up-loaders, so it is the kind of free available data that would be very interesting to be able to learn from. The InstaCities1M dataset is available on https://gombru.github.io/2018/08/01/InstaCities1M/.

WebVision [15]. It contains more than 2.4 million images crawled from the Flickr Website and Google Images search. The same 1,000 concepts as the ILSVRC 2012 dataset [11] are used for querying images. The textual information accompanying those images (caption, user tags and description) is provided. The validation set, which is used as test in this work, contains 50K images.

MIRFlickr [9]. It contains 25,000 images collected from Flickr, annotated using 24 predefined semantic concepts. 14 of those concepts are divided in two categories: (1) strong correlation concepts and (2) weak correlation concepts. The correlation between an image and a concept is strong if the concept appears in the image predominantly. For differentiation, we denote strong correlation concepts by a suffix “*”. Finally, considering strong and weak concepts separately, we get 38 concepts in total. All images in the dataset are annotated by at least one of those concepts. Additionally, all images have associated tags collected from Flickr. Following the experimental protocol in [13, 17, 18, 35] tags that appear less than 20 times are first removed and then instances without tags or annotations are removed.

4.2 Retrieval on InstaCities1M and WebVision Datasets

Experiment Setup. To evaluate the learnt joint embeddings, we define a set of textual queries and check visually if the TOP-5 retrieved images contain the querying concept. We define 24 different queries. Half of them are single word queries and the other half two word queries. They have been selected to cover a wide area of semantic concepts that are usually present in Web and Social Media data. Both simple and complex queries are divided in four different categories: Urban, weather, food and people. The simple queries are: Car, skyline, bike; sunrise, snow, rain; ice-cream, cake, pizza; woman, man, kid. The complex queries are: Yellow + car, skyline + night, bike + park; sunrise + beach; snow + ski; rain + umbrella; ice-cream + beach, chocolate + cake; pizza + wine; woman + bag, man + boat, kid + dog. For complex queries, only images containing both querying concepts are considered correct.

Table 1. Performance on InstaCities1M and WebVision. First column shows the mean P@5 for all the queries, second for the simple queries and third for complex queries.
Table 2. Performance on transfer learning. First column shows the mean P@5 for all the queries, second for the simple queries and third for complex queries.
Fig. 4.
figure 4

First retrieved images for city related complex queries with Word2Vec on InstaCites1M.

Fig. 5.
figure 5

First retrieved images for text non-object queries with Word2Vec on InstaCites1M.

Results and Conclusions. Tables 1 and 2 show the mean Precision at 5 for InstaCities1M and WebVision datasets and transfer learning between those datasets. To compute transfer learning results, we train the model with one dataset and test with the other. Figures 1 and 4 show the first retrieved images for some complex textual queries. Figure 5 shows results for non-object queries, proving that our pipeline works beyond traditional instance-level retrieval. Figure 2 shows that retrieval also works with multimodal queries combining an image and text.

For complex queries, where we demand two concepts to appear in the retrieved images, we obtain good results for those queries where the concepts tend to appear together. For instance, we generally retrieve correct images for “skyline + night” and for “bike + park”, but we do not retrieve images for “dog + kid”. When failing with this complex queries, usually images where only one of the two querying concepts appears are retrieved. Figure 6 shows that in some cases images corresponding to semantic concepts between the two querying concepts are retrieved. That proves that the common embedding space that has been learnt has a semantic structure. The performance is generally better in InstaCities1M than in WebVision. The reason is that the queries are closer to the kind of images people tend to post in Instagram than to the ImageNet classes. However, the results on transfer learning show that WebVision is a better dataset to train than InstaCities1M. Results show that all the tested text embeddings methods work quite well for simple queries. Though, LDA fails when is trained in WebVision. That is because LDA learns latent topics with semantic sense from the training data. Every WebVision image is associated to one of the 1,000 ImageNet classes, which influences a lot the topics learning. As a result, the embedding fails when the queries are not related to those classes. The top performing methods are GloVe when training with InstaCities1M and Word2Vec when training with WebVision, but the difference between their performance is small. FastText achieves a good performance on WebVision but a bad performance on InstaCities1M compared to the other methods. An explanation is that, while Social Media data contains more colloquial vocabulary, WebVision contains domain specific and diverse vocabulary, and since FastText learns representations for character ngrams, is more suitable to learn representations from corpus that are morphologically rich. Doc2Vec does not work well in any database. That is because it is oriented to deal with larger texts than the ones we find accompanying images in Web and Social Media. For word embedding methods Word2Vec and GloVe, the results computing the text representation as the mean or as the tf-idf weighted mean of the words embeddings are similar.

Error Analysis. Remarkable sources of errors are listed and explained in this section.

Visual Features Confusion: Errors due to the confusion between visually similar objects. For instance retrieving images of a quiche when querying “pizza”. Those errors could be avoided using more data and a higher dimensional representations, since the problem is the lack of training data to learn visual features that generalize to unseen samples.

Errors from the Dataset Statistics: An important source of errors is due to dataset statistics. As an example, the WebVision dataset contains a class which is “snow leopard” and it has many images of that concept. The word “snow” appears frequently in the images correlated descriptions, so the net learns to embed together the word “snow” and the visual features of a “snow leopard”. There are many more images of “snow leopard” than of “snow”, therefore, when we query “snow” we get snow leopard images. Figure 7 shows this error and how we can use complex multimodal queries to bias the results.

Fig. 6.
figure 6

First retrieved images for simple (left and right columns) and complex weighted queries with Word2Vec on InstaCites1M.

Fig. 7.
figure 7

First retrieved images for text queries using Word2Vec on WebVision. Concepts are removed to bias the results.

Words with Different Meanings or Uses: Words with different meanings or words that people use in different scenarios introduce unexpected behaviors. For instance when we query “woman + bag” in the InstaCities1M dataset we usually retrieve images of pink bags. The reason is that people tend to write “woman” in an image caption when pink stuff appears. Those are considered errors in our evaluation, but inferring which images people relate with certain words in Social Media can be a very interesting research.

4.3 Retrieval in the MIRFlickr Dataset

To compare the performance of our pipeline to other image retrieval by text systems we use the MIRFlickr dataset, which is typically used to train and evaluate image retrieval systems. The objective is to prove the quality of the multimodal embeddings learnt solely with Web data comparing them to supervised methods.

Table 3. MAP on the image by text retrieval task on MIRFlickr as defined in [18, 35].
Table 4. MAP on the image by text retrieval task on MIRFlickr as defined in [38].
Table 5. AP scores for 38 semantic concepts and MAP on MIRFlickr. Blue numbers compare our method trained with InstaCities and other methods trained with the target dataset.

Experiment Setup. We consider three different experiments: (1) Using as queries the tags accompanying the query images and computing the MAP of all the queries. Here a retrieved image is considered correct if it shares at least one tag with the query image. For this experiment, the splits used are 5% queries set and 95% training and retrieval set, as defined in [18, 35]. (2) Using as queries the class names. Here a retrieved image is considered correct if it is tagged with the query concept. For this experiment, the splits used are 50% training and 50% retrieval set, as defined in [31]. (3) Same as experiment 1 but using the MIRFlickr train-test split proposed in Zhang et al. [38].

Results and Conclusions. Tables 3 and 4 show the results for the experiments 1 and 3 respectively. We appreciate that our pipeline trained with Web and Social Media data in a multimodal self-supervised fashion achieves competitive results. When trained with the target dataset, our pipeline outperforms the other methods. Table 5 shows results for the experiment 2. Our pipeline with the GloVe tf-idf text embedding trained with InstaCites1M outperforms state of the art methods in most of the classes and in MAP. If we train with the target dataset, results are improved significantly. Notice that despite being applied here to the classes and tags existing in MIRFlickr, our pipeline is generic and has learnt to produce joint image and text embeddings for many more semantic concepts, as seen in the qualitative examples.

4.4 Comparing the Image and Text Embeddings

Experiment Setup. To evaluate how the CNN has learnt to map images to the text embedding space and the semantic quality of that space, we perform the following experiment: We build random image pairs from the MIRFlickr dataset and we compute the cosine similarity between both their image and their text embeddings. In Fig. 8 we plot the images embeddings distance vs the text embedding distance of 20,000 random image pairs. If the CNN has learnt correctly to map images to the text embedding space, the distances between the embeddings of the images and the texts of a pair should be similar, and points in the plot should fall around the identity line \(y=x\). Also, if the learnt space has a semantic structure, both the distance between images embeddings and the distance between texts embeddings should be smaller for those pairs sharing more tags: The plot points’ color reflects the number of common tags of the image pair, so pairs sharing more tags should be closer to the axis origin.

Fig. 8.
figure 8

Text embeddings distance (X) vs the images embedding distance (Y) of different random image pairs for LDA, Word2Vec and GloVe embeddings trained with InstaCities1M. Distances have been normalized between [0,1]. Points are red if the pair does not share any tag, orange if it shares 1, light orange if it shares 2, yellow if it shares 3 and green if it shares more. R\(^{2}\) is the coefficient of determination of images and texts distances. (Color figure online)

As an example, take a dog image with the tag “dog”, a cat image with the tag “cat” and one of a scarab with the tag “scarab”. If the text embedding has been learnt correctly, the distance between the projections of dog and scarab tags in the text embedding space should be bigger than the one between dog and cat tags, but smaller than the one between other pairs not related at all. If the CNN has correctly learnt to embed the images of those animals in the text embedding space, the distance between the dog and the cat image embeddings should be similar than the one between their tags embeddings (and the same for any pair). So the point given by the pair should fall in the identity line. Furthermore, that distance should be nearer to the coordinates origin than the point given by the dog and scarab pair, which should also fall in the identity line and nearer to the coordinates origin that another pair that has no relation at all.

Results and Conclusions. The plots for both the Word2Vec and the GloVe embeddings show a similar shape. The resulting blob is elongated along the \(y=x\) direction, which proves that both image and text embeddings tend to provide similar distances for an image pair. The blob is thinner and closer to the identity line when the distances are smaller (so when the image pairs are related), which means that the embeddings can provide a valid distance for semantic concepts that are close enough (dog, cat), but fails inferring distances between weak related concepts (car, skateboard). The colors of the points in the plots show that the space learnt has a semantic structure. Points corresponding to pairs having more tags in common are closer to the coordinates origin and have smaller distances between the image and the text embedding. From the colors it can also be deducted that the CNN is good inferring distances for related images pairs: there are just a few images having more than 3 tags in common with image embedding distance bigger than 0.6, while there are many images with bigger distances that do not have tags in common. However, the visual embedding sometimes fails and infers small distances for image pairs that are not related, as those images pairs having no tags in common and an image embedding distance below 0.2.

The plot of the LDA embedding shows that the learnt joint embedding is not so good in terms of the CNN images mapping to the text embedding space nor in terms of the space semantic structure. The blob does not follow the identity line direction that much which means that the CNN and the LDA are not inferring similar distances for images and texts of pairs. The points colors show that the CNN is inferring smaller distances for more similar image pairs only when the pairs are very related.

The coefficient of determination R\(^{2}\) measures the proportion of the variance in a dependent variable that is predicted by linear regression and a predictor variable. In this case, it can be interpreted as a measure of how much image distances can be predicted from text distances and, therefore, of how well the visual embedding has learnt to map images to the joint image-text space. It ratifies our plots’ visual inspection proving that visual embeddings trained with Word2Vec and GloVe representations have learnt a much more accurate mapping than LDA, and shows that Word2Vec is better in terms of that mapping.

5 Conclusions

In this work we learn a joint visual and textual embedding using Web and Social Media data and we benchmark state of the art text embeddings in the image retrieval by text task, concluding that GloVe and Word2Vec are the best ones for this data, having a similar performance and competitive performances over supervised methods in the image retrieval by text task. We show that our models go beyond instance-level image retrieval to semantic retrieval and that can handle multiple concepts queries and also multimodal queries, composed by a visual query and a text modifier to bias the results. We clearly outperform state of the art in the MIRFlick dataset when training in the target data. The code used in the project is available on https://github.com/gombru/LearnFromWebData.