Automatic Image Annotation Based on Semi-supervised Probabilistic CCA

Zhang, Bo; Ma, Gang; Yang, Xi; Shi, Zhongzhi; Hao, Jie

doi:10.1007/978-3-319-48390-0_22

Bo Zhang^18,19,
Gang Ma^19,20,
Xi Yang¹⁹,
Zhongzhi Shi¹⁹ &
…
Jie Hao²¹

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 486))

Included in the following conference series:

International Conference on Intelligent Information Processing

874 Accesses
1 Citations

Abstract

We propose a novel semi-supervised method for building a statistical model that represents the relationship between images and text labels (tags) based on a semi-supervised variant of CCA called SemiPCCA, which extends the probabilistic CCA model to make use of the labelled and unlabelled images together to extract the low-dimensional latent space representing topics of images. Real-world image tagging experiments indicate that our proposed method improves the accuracy even when only a small number of labelled images are available.

This work is supported by the National Program on Key Basic Research Project (973 Program) (No. 2013CB329502), National Natural Science Foundation of China (No. 61035003), National High-tech R&D Program of China (863 Program) (No. 2012AA011003), National Science and Technology Support Program (No. 2012BA107B02), Natural Science Foundation of Jiangsu Province (No. BK20160276).

You have full access to this open access chapter, Download conference paper PDF

A New Method for Image Understanding and Retrieval Using Text-Mined Knowledge

Images Annotation Extension Based on User Feedback

Latent Gaussian-Multinomial Generative Model for Annotated Data

Keywords

1 Introduction

Automatic image annotation has become an important and challenging problem due to the existence of semantic gap. The state-of-the-art techniques of image auto-annotation can be roughly categorized into two different schools of thought. The first one defines auto-annotation as a traditional supervised classification problem, which treats each word (or semantic concept) as an independent class and creates different classifiers for every word. This approach computes similarity at the visual level and annotates a new image by propagating the corresponding words. The second perspective takes a different stand and treats images and texts as equivalent data. It attempts to discover the correlation between visual features and textual words on an unsupervised basis, by estimating the joint distribution of features and words. Thus, it poses annotation as statistical inference in a graphical model. Under this perspective, images are treated as bags of words and features, each of which are assumed generated by a hidden variable. Various approaches differ in the definition of the states of the hidden variable: some associate them with images in the database, while others associate them with image clusters or latent aspects (topics).

As latent aspect models, PLSA [8] and latent Dirichlet allocation (LDA) [3] have been successfully applied to annotate and retrieve images. PLSA-WORDS [12] is a representative approach, which achieves the annotation task by constraining the latent space to ensure its consistency in words. However, since standard PLSA can only handle discrete quantity (such as textual words), this approach quantizes feature vectors into discrete visual words for PLSA modeling. Therefore, its annotation performance is sensitive to the clustering granularity. GM-PLSA [11] deals with the data of different modalities in terms of their characteristics, which assumes that feature vectors in an image are governed by a Gaussian distribution under a given latent aspect other than a multinomial one, and employs continuous PLSA and standard PLSA to model visual features and textual words respectively. This model learns the correlation between these two modalities by an asymmetric learning approach and then it can predict semantic annotation precisely for unseen images.

Canonical correlation analysis (CCA) is a data analysis and dimensionality reduction method similar to PCA. While PCA deals with only one data space, CCA is a technique for joint dimensionality reduction across two spaces that provide heterogeneous representations of the same data. CCA is a classical but still powerful method for analyzing these paired multi-view data. Since CCA can be interpreted as an approximation to Gaussian PLSA and also be regarded as an extension of Fisher linear discriminant analysis (FDA) to multi-label classification [1], learning topic models through CCA is not only computationally efficient, but also promising for multi-label image annotation and retrieval.

However, CCA requires the data be rigorously paired or one-to-one correspondence among different views due to its correlation definition. However, such requirement is usually not satisfied in real-world applications due to various reasons. To cope with this problem, several extensions of CCA have been proposed to utilize the meaningful prior information hidden in additional unpaired data. Blaschko et al. [2] proposes semi-supervised Laplacian regularization of kernel canonical correlation (SemiLRKCCA) to find a set of highly correlated directions by exploiting the intrinsic manifold geometry structure of all data (paired and unpaired). SemiCCA [10] resembles the manifold regularization, i.e., using the global structure of the whole training data including both paired and unpaired samples to regularize CCA. Consequently, SemiCCA seamlessly bridges CCA and principal component analysis (PCA), and inherits some characteristics of both PCA and CCA. Gu et al. [6] proposed partially paired locality correlation analysis (PPLCA), which effectively deals with the semi-paired scenario of wireless sensor network localization by virtue of the combination of the neighbourhood structure information in data. Most recently, Chen et al. [4] presents a general dimensionality reduction framework for semi-paired and semi-supervised multi-view data which naturally generalizes existing related works by using different kinds of prior information. Based on the framework, they develop a novel dimensionality reduction method, termed as semi-paired and semi-supervised generalized correlation analysis (S2GCA), which exploits a small amount of paired data to perform CCA.

We propose a semi-supervised variant of CCA named SemiPCCA based on the probabilistic model for CCA. The estimation of SemiPCCA model parameters is affected by the unpaied multi-view data (e.g. unlabelled image) which revealed the global structure within each modality. Then, an automatic image annotation method based on SemiPCCA is presented. Through estimating the relevance between images and words by using the labelled and unlabelled images together, this method is shown to be more accurate than previous publish methods.

This paper is organized as follows. After introducing the framework of the proposed SemiPCCA model briefly in Sect. 2, we formally present our automatic image annotation method based on SemiPCCA in Sect. 3. Finally Sect. 4 illustrates experiments results and Sect. 5 concludes the paper.

2 Framework

In this section, we first review a probabilistic model for CCA. Then armed with this probabilistic reformulation of CCA, we present our semi-supervised variant of CCA named SemiPCCA based on the probabilistic model for CCA. The estimation of SemiPCCA model parameters is affected by the unlabelled multi-view data which revealed the global structure within each modality.

2.1 Probabilistic Canonical Correlation Analysis

In [1], Bach and Jordan propose a probabilistic interpretation of CCA. In this model, two random vectors $x_1 \in \mathbb {R}^{m_1}$ and $x_2 \in \mathbb {R}^{m_2}$ are considered generated by the same latent variable $z \in \mathbb {R}^d (\min {(m_1,m_2)} \geqslant d \geqslant 1)$ and thus the “correlated” to each other.

In this model, the observations of $x_1$ and $x_2$ are generated form the same latent variable z (Gaussian distribution with zero mean and unit variance) with unknown linear transformations $W_1$ and $W_2$ by adding Gaussian noise $\varepsilon _1$ and $\varepsilon _2$, i.e.,

$$\begin{aligned} \begin{aligned} P\left( z\right) \sim&\mathcal {N} \left( 0,I_d \right) ,\\ P\left( \varepsilon _1\right) \sim \mathcal {N} \left( 0,\varPsi _1 \right) ,&P\left( \varepsilon _2\right) \sim \mathcal {N} \left( 0,\varPsi _2 \right) ,\\ x_1=W_1 z+\mu _1+&\varepsilon _1,W_1 \in \mathbb {R}^{m_1 \times d},\\ x_2=W_2 z+\mu _2+&\varepsilon _2,W_2 \in \mathbb {R}^{m_2 \times d}. \end{aligned} \end{aligned}$$

(1)

From [1], the corresponding maximum-likelihood estimations to the unknown parameters $\mu _1$, $\mu _2$, $W_1$, $W_2$, $\varPsi _1$ and $\varPsi _2$ are

$$\begin{aligned} \begin{aligned} \hat{\mu }_1=\frac{1}{N}\sum _{i=1}^{N} x_1^2,&\hat{\mu }_2=\frac{1}{N}\sum _{i=1}^{N} x_2^2,\\ \hat{W}_1=\widetilde{\varSigma }_{11}U_{1d}M_1,&\hat{W}_2=\widetilde{\varSigma }_{22}U_{2d}M_2, \\ \hat{\varPsi }_1=\widetilde{\varSigma }_{11}-\hat{W}_1\hat{W}_1^T,&\hat{\varPsi }_2=\widetilde{\varSigma }_{22}-\hat{W}_2\hat{W}_2^T, \end{aligned} \end{aligned}$$

(2)

where $\widetilde{\varSigma }_{11}$, $\widetilde{\varSigma }_{22}$ have the same meaning of standard CCA, the columns of $U_{1d}$ and $U_{2d}$ are the first d canonical directions, $P_d$ is the diagonal matrix with its diagonal elements given by the first d canonical correlations and $M_1$, $M_2 \in \mathbb {R}^{d \times d}$, with spectral norms smaller the one, satisfying $M_1M_2^T=P_d$. In our expectations, let $M_1=M_2=(P_d)^{1 / 2}$. The posterior expectations of z given $x_1$ and $x_2$ are

$$\begin{aligned} \begin{aligned} E \left( z \vert x_1 \right) =M_1^T U_{1d}^T \left( x_1-\hat{\mu }_1 \right) , \\ E \left( z \vert x_2 \right) =M_2^T U_{2d}^T \left( x_2-\hat{\mu }_2 \right) . \end{aligned} \end{aligned}$$

(3)

Thus, $E \left( z \vert x_1 \right) $ and $E \left( z \vert x_2 \right) $ lie in the d dimensional subspace that are identical with those of standard CCA.

2.2 Semi-supervised PCCA

Consider a set of paired samples of size $N_p$, ${X}_1^P=\{({x}_1^i)\}_{i=1}^{N^p}$ and ${X}_2^P=\{({x}_2^i)\}_{i=1}^{N^p}$, where each sample ${x}_1^i$ (resp. ${x}_2^i$) is represented as a vector with dimension of $m_1$ (resp. $m_2$). When the number of paired of samples is small, CCA tends to overfit the given paired samples. Here, let us consider the situation where unpaired samples ${X}_1^U=\{({x}_1^j)\}_{j=N^p+1}^{N^1}$ and/or ${X}_2^U=\{({x}_2^k)\}_{k=N^p+1}^{N^2}$ are additional provided, where ${X}_1^U$ and ${X}_2^U$ might be independently generated. Since the original CCA and PCCA cannot directly incorporate such unpaired samples, we proposed a novel method named Semi-supervised PCCA (SemiPCCA) that can avoid overfitting by utilizing the additional unpaired samples. See Fig. 1 for an illustration of the graphical model of the SemiPCCA model.

The whole observation is now $D=\{({x}_1^i,{x}_2^i)\}_{i=1}^{N^p} \cup \{({x}_1^j)\}_{j=N^p+1}^{N^1} \cup \{({x}_2^k)\}_{k=N^p+1}^{N^2}$. The likelihood, with the independent assumption of all the data points, is calculated as

$$\begin{aligned} \begin{aligned} L(\varTheta )=\prod _{i=1}^{N^p}P({x}_1^i,{x}_2^i;\varTheta )\prod _{j=N^p+1}^{N^1}P({x}_1^j;\varTheta )\prod _{k=N^p+1}^{N^2}P({x}_2^k;\varTheta ) \end{aligned} \end{aligned}$$

(4)

In SemiPCCA model, for paired samples $\{({x}_1^i,{x}_2^i)\}_{i=1}^{N^p}$, ${x}_1^i$ and ${x}_2^i$ are considered generated by the same latent variable ${z}^i$ and $P({x}_1^i,{x}_2^i)$ is calculated as in PCCA model, i.e.

$$\begin{aligned} \begin{aligned} P({x}_1^i,{x}_2^i;\varTheta ) \sim \mathcal {N} \left( \left( \begin{array} {c} \mu _1 \\ \mu _2 \end{array} \right) , \left( \begin{array} {cc} W_1W_1^T+\varPsi _1 &{} W_1W_2^T \\ W_2W_1^T &{} W_2W_2^T +\varPsi _2 \\ \end{array} \right) \right) . \end{aligned} \end{aligned}$$

(5)

Whereas for unpaired observations ${X}_1^U=\{({x}_1^j)\}_{j=N^p+1}^{N^1}$ and/or ${X}_2^U=\{({x}_2^k)\}_{k=N^p+1}^{N^2}$, ${x}_1^j$ and ${x}_2^k$ are separately generated from the latent variable ${z_1}^j$ and ${z_2}^k$ with linear transformations $W_1$ and $W_2$ by adding Gaussian noise $\varepsilon _1$ and $\varepsilon _2$. From Eq. (1),

$$\begin{aligned} \begin{aligned} P({x}_1^j;&\varTheta ) \sim \mathcal {N} \left( \mu _1, W_1W_1^T+\varPsi _1\right) , \\ P({x}_2^k;&\varTheta ) \sim \mathcal {N} \left( \mu _2, W_2W_2^T+\varPsi _2\right) . \end{aligned} \end{aligned}$$

(6)

For means of ${x}_1$ and ${x}_2$ we have

$$\begin{aligned} \hat{\mu _1} = \frac{1}{N_1}\sum _{i=1}^{N_1} x_1^i, \hat{\mu _2} = \frac{1}{N_2}\sum _{i=1}^{N_2} x_1^i, \end{aligned}$$

(7)

which are just the sample means. Since they are always the same in all EM iterations, we can centre the data ${X}_1^P\cup {X}_1^U$, ${X}_2^P\cup {X}_2^U$ by subtracting these means in the beginning and ignore these parameters in the learning process. So for simplicity we change the notation $x_1^i$, $x_2^i$, $x_1^j$ and $x_2^k$ to be the centred vectors in the following.

For the two mapping matrices, we have the updates

$$\begin{aligned} \begin{aligned} \hat{W_1} = ( \sum _{i=1}^{N_p} x_1^i{\langle {z}^i \rangle }^T +&\sum _{j=N_p+1}^{N_1} x_1^j{\langle {z_1}^j \rangle }^T ) { ( \sum _{i=1}^{N_p} \langle {z}^i{{z}^i}^T \rangle + \sum _{j=N_p+1}^{N_1} \langle {z_1}^j{{z_1}^j}^T \rangle ) }^{-1} \end{aligned} \end{aligned}$$

(8)

$$\begin{aligned} \begin{aligned} \hat{W_2} = ( \sum _{i=1}^{N_p} x_2^i{\langle {z}^i \rangle }^T +&\sum _{k=N_p+2}^{N_2} x_2^k{\langle {z_2}^k \rangle }^T ) { ( \sum _{i=1}^{N_p} \langle {z}^i{{z}^i}^T \rangle + \sum _{k=N_p+1}^{N_2} \langle {z_2}^k{{z_2}^k}^T \rangle ) }^{-1} \end{aligned} \end{aligned}$$

(9)

Finally the noise levels are updated as

$$\begin{aligned} \begin{aligned} \hat{\varPsi _1} = \frac{1}{N_1} \{ (&\sum _{i=1}^{N_p}( x_1^i - \hat{W_1}{\langle {z}^i \rangle } ){( x_1^i - \hat{W_1}{\langle {z}^i \rangle } )}^T \\+&\sum _{j=N_p+1}^{N_1}( x_1^j - \hat{W_1}{\langle {z_1}^j \rangle } ){( x_1^j - \hat{W_1}{\langle {z_1}^j \rangle } )}^T ) \} \end{aligned} \end{aligned}$$

(10)

$$\begin{aligned} \begin{aligned} \hat{\varPsi _2} = \frac{1}{N_2} \{ (&\sum _{i=1}^{N_p}( x_2^i - \hat{W_2}{\langle {z}^i \rangle } ){( x_2^i - \hat{W_2}{\langle {z}^i \rangle } )}^T\\&+ \sum _{k=N_p+1}^{N_2}( x_2^k - \hat{W_2}{\langle {z_2}^k \rangle } ){( x_2^k - \hat{W_2}{\langle {z_2}^k \rangle } )}^T ) \} \end{aligned} \end{aligned}$$

(11)

2.3 Projections in SemiPCCA Model

Analogous to the PCCA model, the projection of a labelled image $({x}_1^i,{x}_2^i)$ in SemiPCCA model is directly given by Eq. (3).

Although this result looks similar as that in PCCA model, the learning of $W_1$ and $W_2$ are influenced by those unpaired samples. Unpaired samples reveal the global structure of whole the samples in each domain. Note once a basis in one sample space is rectified, the corresponding bases in the other sample space is also rectified so that correlations between two bases are maximized.

3 Annotation on Unlabelled Image

Now, we presents an automatic image annotation method based on the SemiPCCA, which estimating the association between images and words by using the labelled and unlabelled images together.

Let ${X}_1^P=\{({x}_1^i)\}_{i=1}^{N^p}$ and ${X}_2^P=\{({x}_2^i)\}_{i=1}^{N^p}$ be the set of labelled images and its corresponding semantic features with $m_1$ and $m_2$ dimensions of size $N_p$, and ${X}_1^U=\{({x}_1^j)\}_{j=N^p+1}^{N^1}$ be a set of unlabelled images.

The first step is to extracts image features and labels features of training samples, and generates the essential latent space by fitting SemiPCCA.

In the context of automatic image annotation, ${X}_1^U$ only exists, whereas ${X}_2^U$ is empty. So, for the mapping matrices $W_2$ and the noise levels $\varPsi _2$, we have to change the updates as follows,

$$\begin{aligned} \begin{aligned} \hat{W_2} = ( \sum _{i=1}^{N_p} x_2^i{\langle {z}^i \rangle }^T ) { ( \sum _{i=1}^{N_p} \langle {z}^i{{z}^i}^T \rangle ) }^{-1} \end{aligned} \end{aligned}$$

(12)

$$\begin{aligned} \begin{aligned} \hat{\varPsi _2} = \frac{1}{N_p} \{ (&\sum _{i=1}^{N_p}( x_2^i - \hat{W_2}{\langle {z}^i \rangle } ){( x_2^i - \hat{W_2}{\langle {z}^i \rangle } )}^T ) \} \end{aligned} \end{aligned}$$

(13)

Using this model, we derive the posterior probability of a sample in the latent space. When only an image feature ${x}_1$ is given, the posterior probability $P({z_1} \vert {x}_1)$ of estimated latent variable $z_1$ becomes a normal distribution whose mean and variance are,

$$\begin{aligned} \begin{aligned}&\mu _{z_1} = \hat{W_1}^T(\hat{W_1}\hat{W_1}^T+\hat{\varPsi _1})^{-1} \left( {x}_1 - \hat{\mu _1} \right) , \\&\varPsi _{z_1} = I - \hat{W_1}^T(\hat{W_1}\hat{W_1}^T+\hat{\varPsi _1})^{-1}, \end{aligned} \end{aligned}$$

(14)

respectively. Also, when both an image feature ${x}_1$ and semantic feature ${x}_2$ are given, the posterior probability $P({z} \vert {x}_1,{x}_2)$ becomes,

$$\begin{aligned} \begin{aligned}&\mu _{z} = \hat{W}{}^T(\hat{W}\hat{W}{}^T+\hat{\varPsi })^{-1} \left( \left( \begin{array} {c} x_1 \\ x_2 \end{array} \right) - \hat{\mu } \right) , \\&\varPsi _{z} = I - \hat{W}{}^T(\hat{W}\hat{W}{}^T+\hat{\varPsi })^{-1}\hat{W}. \end{aligned} \end{aligned}$$

(15)

The second step is to map labelled training images $\{T_i^{(P)} = ({x}_1^i, {x}_2^i)\}_{i=1}^{N^p}$ and unlabelled images $\{( Q_j^{(U)} = {x}_1^j\}_{j=N^p+1}^{N^1}$ to the latent space with posterior probability $P({z} \vert {x}_1,{x}_2)$ and $P({z} \vert {x}_1)$ separately, and K-L distance is used for measuring the similarity between two images.

We define the similarity between two samples as follows. When two labelled images $T_i^{(P)} = ({x}_1^i, {x}_2^i)$ and $T_j^{(P)} = ({x}_1^j, {x}_2^j)$ are available, then similarity is defined as,

$$\begin{aligned} \begin{aligned} D\left( T_i^{(P)}, T_j^{(P)} \right) = {\left( \mu _z^i - \mu _z^j \right) }^T {\varPsi _z}^{-1} \left( \mu _z^i - \mu _z^j \right) , \end{aligned} \end{aligned}$$

(16)

which measuring essential similarities both in terms of appearance and semantics.

Furthermore, when labels feature of one of two samples is not available, e.g. one labelled image $T_i^{(P)} = ({x}_1^i, {x}_2^i)$ and one unlabelled image $Q_j^{(U)} = {x}_1^j$, which is the usual case in automatic image annotation, our framework also enables measuring similarities with semantic aspects even in the absence of labels features, and their similarity becomes:

$$\begin{aligned} \begin{aligned} D\left( T_i^{(P)}, Q_j^{(U)} \right) = {\left( \mu _z^i - \mu _{z_1}^j \right) }^T \left( \frac{{\varPsi _z}^{-1} + {\varPsi _{z_1}}^{-1}}{2}\right) \left( \mu _z^i - \mu _{z_1}^j \right) \end{aligned} \end{aligned}$$

(17)

As we described, we can formalize a new image annotation method. Let $x_{new}$ demote a newly input image. To annotate $x_{new}$ with some words, we calculate the posterior probability posterior probability of a word w given by ${x}_{new}$, which is represented as

$$\begin{aligned} \begin{aligned} P\left( {w} \vert {x}_{new} \right) =\sum _{i=1}^{N_p} P\left( {w} \vert T_i^{(P)} \right) P\left( T_i^{(P)} \vert {x}_{new} \right) . \end{aligned} \end{aligned}$$

(18)

The posterior probability $P\left( T_i^{(P)} \vert x_{new}\right) $ of each labelled image $T_i^{(P)}$ using the above similarity measurement is defined as follow,

$$\begin{aligned} \begin{aligned} P\left( T_i^{(P)} \vert x_{new} \right) = \frac{ exp\left( - D\left( T_i^{(P)}, x_{new} \right) \right) }{ \sum _{i=1}^{N_p}{exp\left( - D\left( T_i^{(P)}, x_{new} \right) \right) } }, \end{aligned} \end{aligned}$$

(19)

where, the denominator is a regularization term so that $\sum _{i=1}^{N_p}P\left( T_i^{(P)} \vert x_{new}\right) = 1$. $P\left( {w} \vert T_i^{(P)} \right) $ corresponds to the sample-to-label model, which is defined as

$$\begin{aligned} \begin{aligned} P\left( {w} \vert T_i^{(P)} \right) = \mu \delta _{w,T_i^{(P)}} + (1-\mu )\frac{N_w}{NW}, \end{aligned} \end{aligned}$$

(20)

where $N_w$ is the number of the images that contain w in the training data set, $\delta _{w,T_i^{(P)}} = 1$ if word w is annotated in the training sample $T_i^{(P)}$, otherwise $\delta _{w,T_i^{(P)}} = 0$. $\mu $ is a parameter between zero and one. NW is the number of the words.

The words are sorted in descending order of the posterior probability $P\left( {w} \vert {x}_{new} \right) $. The highest ranked words are used to annotate the image ${x}_{new}$.

4 Experiments

This section describes the results for the automatic image annotation task.

We use Corel5K and Corel30K to evaluate the performance of the proposed method. Corel5K contains 5,000 pairs of the image and the labels. Each image is manually annotated with one to five words. The training data has 371 words. 260 words among them appear in the test data.

Corel30K dataset is an extension of the Corel5K dataset based on a substantially larger database, which tries to correct some of the limitation in Corel5k such as small number of examples and small size of the vocabulary. Corel30K dataset contains 31,695 images and 5,587 words.

We follow the methodology of previous works, 500 images from the Corel5K are the test data. The other 1500, 2250 and 4500 images are selected from the Corel5K as the training data respectively, alone with the remaining training images in Corel5K and 31,695 images in Corel30K which acted as the unlabelled image to estimate the parameters of SemiPCCA together.

4.1 Feature Representation

As the image feature, we use the color higher-order local auto-correlation (Color-HLAC) features. This is a powerful global image feature for color images. Generally, global image features are suitable for realizing scalable systems because they can be extracted quite fast. Also, they are well suited for unconstrained image level annotation.

The Color-HLAC features enumerate all combinations of mask patterns that define autocorrelations of neighboring points and include both color information and texture information simultaneously. In this paper we use at most the 2nd order correlations, whose dimension is 714. The 2nd order Color-HLAC feature is reduced by PCA to preserve the 80 dimensions.

We extract Color-HLAC features from two scales (1/1, 1/2 size) to obtain robustness against scale change. Also, we extract them from edge images obtained by using the Sobel filter as well as the normal images. In all, the final image features are 320 dimensions.

As for labels feature, we use the word histogram. In this work, each image is simply annotated with a few words, so the word histogram becomes a binary feature.

4.2 Evaluation and Results

In this section, the performance of our model (SemiPCCA) is compared with several models. Image annotation performance is evaluated by comparing the captions automatically generated for the test set with the human-produced ground truth. For evaluation of annotation performance of our method, we follow the methodology of previous works. We define the automatic annotation as the five semantic words of largest posterior probability, and compute the recall and precision of every word in the test set. For a given semantic word, recall = B/C and precision = B/A, where A is the number of images automatically annotated with a given word; B is the number of images correctly annotated with that word; C is the number of images having that word in ground truth annotation. The average word precision and word recall values summarize the system performance.

Table 1. Performance comparison of different automatic image annotation models on Corel5k dataset.

Full size table

Table 1 shows the results obtained by the proposed method and various previously proposed methods - - CRM [9], MBRM [5], PLSA-WORDS [12], GM-PLSA [11], using Corel5K. In order to compare with those previous models, we divide this dataset into 2 parts: a training set of 4,500 images and a test set of 500 images. We report the results on two sets of words: the subset of 49 best words and the complete set of all 260 words that occur in the training set. From the table, we can see that our model performs significantly better than all other models. We believe that using SemiPCCA to model visual and textual data by labelled and unlabelled images respectively is the reason for this result.

5 Conclusions

This paper presents an automatic image annotation method based on the SemiPCCA. Through estimating the association between images and words by using the labelled and unlabelled images together, this method is shown to be more accurate than previous publish methods. Experiments on the Corel dataset prove that our approach is promising for semantic image annotation. In comparison to several state-of-the-art annotation models, higher accuracy and superior effectiveness of our approach are reported.

References

Bach, F.R., Jordan, M.I.: A probability interpretation of canonical correlation analysis. Technical Report 688, Department of Statistics, Universityof California, Berkeley (2005)
Google Scholar
Blaschko, M.B., Lampert, C.H., Gretton, A.: Semi-supervised Laplacian regularization of Kernel canonical correlation analysis. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5211, pp. 133–145. Springer, Heidelberg (2008). doi:10.1007/978-3-540-87479-9_27
Chapter Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)
MATH Google Scholar
Chen, X., Chen, S., Xue, H., Zhou, X.: A unified dimensionality reduction framework for semi-paired and semi-supervised multiview data. Pattern Recogn. 45(5), 2005–2018 (2012)
Article MATH Google Scholar
Feng, S.L., Manmatha, R., Lavrenko, V.: Multiple bernoulli relevance models for image and video annotation. In: CVPR 2004, Washington, DC, United States, pp. 1002–1009 (2004)
Google Scholar
Gu, J., Chen, S., Sun, T.: Localization with incompletely paired data in complex wireless sensor network. IEEE Trans. Wirel. Commun. 10(9), 2841–2849 (2011)
Article Google Scholar
Harada, T., Nakayama, H., Kuniyoshi, Y.: Image annotation and retrieval based on efficient learning of contextual latent space. In: ICME 2009, Piscataway, USA, pp. 858–861 (2009)
Google Scholar
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42, 177–196 (2001)
Article MATH Google Scholar
Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using crossmedia relevance models. In: SIGIR 2003, Toronto, Canada, pp. 119–126 (2003)
Google Scholar
Kimura, A., Kameoka, H., Sugiyama, M., Nakano, T.: Semicca: efficient semi-supervised learning of canonical correlations. In: ICPR 2010, Istanbul, Turkey, pp. 2933–2936 (2010)
Google Scholar
Li, Z., Shi, Z., Liu, X., Shi, Z.: Modeling continuous visual features for semantic image annotation and retrieval. Pattern Recogn. Lett. 32(3), 516–523 (2011)
Article Google Scholar
Monay, F., Gatica-Perez, D.: Modeling semantic aspects for cross-media image indexing. IEEE Trans. Pattern Anal. Mach. Intell. 29(10), 1802–1817 (2007)
Article Google Scholar

Download references

Author information

Authors and Affiliations

China University of Mining and Technology, Xuzhou, 221116, China
Bo Zhang
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Bo Zhang, Gang Ma, Xi Yang & Zhongzhi Shi
University of Chinese Academy of Sciences, Beijing, 100190, China
Gang Ma
School of Medicine Information, Xuzhou Medical University, Xuzhou, 221000, China
Jie Hao

Authors

Bo Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Gang Ma
View author publications
You can also search for this author in PubMed Google Scholar
Xi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhongzhi Shi
View author publications
You can also search for this author in PubMed Google Scholar
Jie Hao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Zhang .

Editor information

Editors and Affiliations

Chinese Academy of Sciences , Beijing, China
Zhongzhi Shi
University of Salford , Salford, United Kingdom
Sunil Vadera
Deakin University , Burwood, Victoria, Australia
Gang Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, B., Ma, G., Yang, X., Shi, Z., Hao, J. (2016). Automatic Image Annotation Based on Semi-supervised Probabilistic CCA. In: Shi, Z., Vadera, S., Li, G. (eds) Intelligent Information Processing VIII. IIP 2016. IFIP Advances in Information and Communication Technology, vol 486. Springer, Cham. https://doi.org/10.1007/978-3-319-48390-0_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-48390-0_22
Published: 20 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48389-4
Online ISBN: 978-3-319-48390-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automatic Image Annotation Based on Semi-supervised Probabilistic CCA

Abstract

Similar content being viewed by others

A New Method for Image Understanding and Retrieval Using Text-Mined Knowledge

Images Annotation Extension Based on User Feedback

Latent Gaussian-Multinomial Generative Model for Annotated Data

Keywords

1 Introduction

2 Framework

2.1 Probabilistic Canonical Correlation Analysis

2.2 Semi-supervised PCCA

2.3 Projections in SemiPCCA Model

3 Annotation on Unlabelled Image

4 Experiments

4.1 Feature Representation

4.2 Evaluation and Results

5 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Automatic Image Annotation Based on Semi-supervised Probabilistic CCA

Abstract

Similar content being viewed by others

A New Method for Image Understanding and Retrieval Using Text-Mined Knowledge

Images Annotation Extension Based on User Feedback

Latent Gaussian-Multinomial Generative Model for Annotated Data

Keywords

1 Introduction

2 Framework

2.1 Probabilistic Canonical Correlation Analysis

2.2 Semi-supervised PCCA

2.3 Projections in SemiPCCA Model

3 Annotation on Unlabelled Image

4 Experiments

4.1 Feature Representation

4.2 Evaluation and Results

5 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation