Keywords

1 Introduction

Twitter is a source of valuable feedback for companies to probe the public perception of their brands. Whereas sentiment analysis has been extensively applied to social media messages (see [17] among many), other dimensions of brand perception are still of interest and have received less attention [13], specially those related to marketing. In particular, marketing specialists are highly interested in: (a) knowing the position of a tweet author in the purchase funnel (this is, where in the different stages of the customer journey is the author in); (b) knowing to which element or elements of the marketing mixFootnote 1 the text refers to and (c) knowing the author’s affective situation with respect to a brand in the tweet.

This paper presents the MAS Corpus, a Spanish corpus of tweets of interest for marketing specialists, labeling messages in the three dimensions aforementioned. The corpus is freely available at http://mascorpus.linkeddata.es/ and has been developed in the context of the Spanish research project LPS BIGGERFootnote 2, which analyzed different dimensions of tweets in order to extract relevant information on marketing purposes. A first version of the corpus containing only the sentiment analysis annotations was released as the Corpus for Sentiment Analysis towards Brands (SAB) and was described in [16]. Following this work, we have expanded the corpus tagging the messages in the two remaining dimensions described before: the purchase funnel and the marketing mix. Tweets that were almost identical to others have been removed. Categories of each of the three aspects tagged in the corpus (Sentiment Analysis, Marketing Mix and Purchase Funnel) can be found in Fig. 1.

Fig. 1.
figure 1

The three different aspects in an opinion of interest for marketing: Purchase Funnel (when is the opinion emitted with regard to the purchase journey), Marketing Mix (what is evaluated in the opinion) and Sentiment Analysis (emotions expressed with regard to the brand).

2 Related Work

2.1 Sentiment Analysis

Even when Sentiment Analysis is a major field in Natural Language Processing, most of works in Spanish tend to focus on polarity [6, 11], being the efforts towards emotions really scarce [23]. Sources of corpora also differ to our aims, since they tend to use specific websites [15, 18] and different social networks [6, 11, 23], and limit to domains such as tourism [15] and medical opinions [18]. Twitter publicly available corpus scarcity is mainly due to two related facts; first, its policy on text dissemination (just the ID of the tweets can be provided, no text nor username). Second, that Twitter deletes periodically deletes tweets from their servers; these two facts lead that, even when having the IDs, texts in corpora become impossible to retrieve and therefore the corpus is not available as resource for test anymore. An extended review of works in Spanish Sentiment Analysis with regard to our needs can be found in [16].

2.2 Purchase Funnel

Although different purchase funnel interpretations have been suggested in literature [4, 7], we have based our approach on the one defined in the LPS BIGGER project and already used in [26]. This purchase funnel consists of four different stages (Awareness, Evaluation, Purchase and Postpurchase), that reflect how the client gets to know the product, investigates or compares it to other options, acquires it and actually uses and reviews it, respectively.

To the best of our knowledge, there are not public Spanish corpora available containing purchase funnel annotations, since the only work in Spanish on this topic the authors are aware of did not release the dataset used [26]. Nevertheless, the concept of Purchase Intention has been widely covered in literature, especially for marketing purposes in English language. Differently to Sentiment Analysis, Purchase Intention tries to detect or distinguish whether the client intends to buy a product, rather than whether he likes it or not [27]. Starting with the WISH corpus [9], covering wishes in several domains and sources (including product reviews), most works aim to discriminate between different kinds of intentions of users: in [22], the analysis focuses in suggestions and wishes for products and services both in a private dataset and in a part of the previously mentioned WISH corpus; also an analysis performed on tweets about different intentions can be found in [14].

Finally, the most similar categories to the ones in our purchase funnel interpretation are the ones in [5], where the authors differentiate between several kinds of intention, being some of them (such as wish, compare or complain) easy mappable to our purchase funnel stages. Also the corpus used in [10], that classifies into pre-purchase and post-purchase reviews, shares our “timeline” interpretation of the purchase funnel. Out of the marketing domain, corpora labeled with purchase funnel tags for an specific domain have also been published, e.g., for the London musicals and recreational events [8].

2.3 Marketing Mix

Although the original concept of marketing mix [3] contained twelve elements for manufacturers, the most extended categorization for marketing is the one proposed by [12], consisting of four aspects (price, product, promotion, place) often known as “the four Ps” (or 4Ps) and revisited several times in literature [25]. Nevertheless, while marketing mix is a well-known and extended concept in the marketing field, in NLP the task of identifying these facets is often simply referred as detecting or recognizing “aspects”, excepting some cases in literature [2]. This task has been often tackled in English [19, 21], while in Spanish corpora we can find a few datasets containing information about aspects, such as those in [6, 20].

3 Tagging Criteria

The corpus consists of more than 3k tweets of brands (see Fig. 1) from different sectors, namely Food, Automotive, Banking, Beverages, Sports, Retail and Telecom. When several brands appear in one tweet, just one of them is considered in the tagging process (the marked one); at the same time, the same tweet can appear several times in the corpus considering different brands. Every tweet is tagged in the dimensions exposed in Fig. 1; more than one tag is possible in sentiment and marketing mix dimensions (except simultaneously tagging the pairs of directly opposed emotions), while the purchase funnel, as representing a path on the purchase journey, only presents a tag per tweet. We describe below each dimension, along with a brief report on the criteria used for tagging each category (the complete criteria document, along with statistics on the corpus, can be downloaded with it) (Table 1).

Table 1. Sectors and Brands in the corpus.

3.1 Sentiment Analysis

A tweet can be tagged with one or several emotions, as long as it does not contain directly opposite emotions, or with a NC2 label meaning there are no emotions on it. Each basic emotion embraces also secondary emotions in it (described in Table 2, and also used by Aguado [1]), and a combination of them can express more complex feelings often seen in customers, such as shown in the following examples:

  • When a customer is unable to find a desired product, the post is tagged as sadness (for the unavailability) and satisfaction (because it reveals previous satisfaction with the brand that deserves the effort of keep looking exactly for it instead of switching to one from another brand).

  • When a post shows that a purchase is recurrent, it is tagged as trust, referring to the loyalty of the client.

  • Emoticons of love are tagged as love and musical ones as happiness (unless irony happens). Love typically implies happiness.

  • Happiness can only be tagged for an already acquired product or service, not for future purchases. The wish for a brand (“I would like to have X”) is tagged as trust and satisfaction, and possibly love (depending on the degree of the desire).

  • A deep dissatisfaction is hate, a deep satisfaction is love.

  • Fear is understood as the opposite of trust.

  • The document “I am satisfied with X but\(\ldots \) is tagged as satisfaction, trust plus whichever negative sentiment follows.

  • When the emotion is not obvious, the message must be marked as NC2 (sometimes the meaning is not evident from the message alone). Messages that are evident advertising by the brand community managers, automatic messages or messages related to brand-events are also tagged as NC2.

  • The emotion must be sparked by the brand/product, and it is not necessarily the same as the message. If the author of a message says that he is sad and he is going to have a Mahou (Spanish beer brand), the tag should not be sadness, but a positive emotion.

  • A deep dissatisfaction is hate, a deep satisfaction is love. Fear is understood as the opposite of trust.

3.2 Purchase Funnel

Each tweet can belong to a stage in the purchase funnel, be ambiguous or be related to a brand without the author being involved in the purchase (such as is the case of posts of the brand itself). Different phases and concrete examples are tagged in the corpus as follows:

Awareness. The first contact of the client with the brand (either showing a willingness to buy or not), usually expressed in first person and mentioning advertising, videos, publicity campaigns, etc. Some examples of awareness would be:

  1. (1)

    I just loved last Movistar ad.

  2. (2)

    I like the videos in Nike’s YouTube channel.

Evaluation. The post implies some research on the brand (such as questions or seek of confirmation) or comparison to others (by showing preferences among them, for instance), and some interest in acquiring a product or service. Examples of evaluation would be the following:

  1. (3)

    I prefer Citroen to more expensive brands, such as Mercedes or BMW.

  2. (4)

    Looking for a second-hand Kia Sorento in NY, please send me a DM.

Purchase. There is a direct reference to the moment of a purchase or to a clear intention of purchase (usually in first person). Some examples:

  1. (5)

    I’ve finally decided to switch to Movistar.

  2. (6)

    Buying my brand new blue Citroen right now!

Table 2. Main emotions and their secondary emotions.

Postpurchase. Texts referring to a past purchase or to a current experience, implying to own a product. This class presents a special complexity, since interpretation on the same linguistic patterns change depending on the kind of product, as already exposed in [26] and exemplified in the sentences below:

  1. (7)

    I like Heineken, the taste is so good.

    I would love a Heineken!

  2. (8)

    I like BMWs, they are so classy!

    I would love a BMW!

In (7), the client has likely tasted that beer brand before; people does not tend to like or want beverages they have no experience with (at least without mentioning, such as in “I want to taste the new Heineken.”). But the same fact is not derived from more expensive items, even when expressed the same way, such as happens in (8): someone can like a car (such as its appearance or its engine) without having used it or intending to. This is why our criteria states that these kind of expressions must be tagged as Postpurchase for some brands (depending on the sector) and others must be tagged as Ambiguous, since there can be several possible and equally likely interpretations. This is kind of expressions has been tagged therefore:

  • As ambiguous for more expensive products of which people tend to give opinion without acquiring them. In our case, this applies just to the AUTOMOTIVE sector.

  • As postpurchase for products that are easier to acquire, and are likely to have been tried before talking about them. This include the sectors BEVERAGES, BANKING, FOOD, RETAIL, TELECOM and SPORTS.

Ambiguous. This category includes critical posts, suggestions and recommendations, along with posts where it is not clear in which stage the customer is (such as the case mentioned above).

  1. (9)

    Do not buy Milka!

  2. (10)

    Loving the new Kia!

NC2. Includes impersonal messages without opinions (such as corporative news or responses of the brand to clients), questions implying no personal evaluation or intention (for instance, involving a third person), texts with buy or rental offers with no mention to real use experience, etc.

  1. (11)

    2008 Hyundai for sale.

  2. (12)

    My aunt didn’t like the Kia.

3.3 Marketing Mix

We have added a NC2 class to the four original McCarthy’s Ps to indicate none of the four aspects is treated in the tweet. It must be noted that, differently than the purchase funnel, several marketing mix tags can appear in the same tweet (except of the NC2). Brief explanation of each of the categories tagged for marketing mix, along with examples and part of the criteria, are exposed below:

Product. This category encompasses texts related to the features of the product (such as its quality, performance or taste), along with references to design (such as size, colors or packing) or guaranty, such as in the following examples:

  1. (13)

    I find the new iPhone too big for my pocket.

  2. (14)

    I love the new mix Milka Oreo!

Note that when someone loves/likes something (such as food), we assume it refers to some feature of a product (such as its taste), so we tag it as Product.

Promotion. Texts referring to all the promotions and programs of the brand channeled to increase sales and ensure visibility to their products or the brand, such as advertisements, sponsorships (such as prices, sport teams or events), special offers, work offers, promotional articles, etc.

  1. (15)

    Freaking out with the new 2x1 @Ikea!

  2. (16)

    La Liga BBVA is the best league in the world.

Price. Includes economical aspects of a product, such as references to its value or promotions involving discounts or price drops (that must also be tagged as Promotion). Examples of texts that should be tagged as Price would be the following:

  1. (17)

    I’m afraid that I can’t afford the new Toyota.

  2. (18)

    Yesterday I saw the same Adidas for just 40e!

Place. Aspects related to commercialization, such physical places of distribution of the products (for instance, if a product is difficult to find) and customer service (in every stage of the purchase: information, at the point of sale, postpurchase, technical support, etc.).

  1. (19)

    I love the new Milka McFlurry at McDonalds

  2. (20)

    Already three malls and unable to find the new Nike Pegasus!

NC2. Impersonal messages of the brand, news or texts that include none of the aspects mentioned before.

  1. (21)

    Nike is paying no tax!

  2. (22)

    I can’t decide between Puleva and Pascual.

4 The MAS Corpus

4.1 Building the Corpus

A different approach was used for Marketing Mix and the Purchase Funnel tagging with respect to the Sentiment Analysis tagging procedure (where three taggers acted independently with just a common criteria document) exposed in [16]. This meets the need of streamlining the whole tagging process, that happens to be both difficult and time-consuming for taggers. This new procedure is briefly exposed below:

  1. 1.

    A first version of the criteria document was written, based on the study of literature and previous experience within the LPS BIGGER project.

  2. 2.

    Then Tagger 1 tagged a representative part of the corpus (about 800 tweets), highlighting main doubts and dubious tweets with regard to the criteria, that are revised; new tagging examples are added, and some nuances and special cases are rewritten.

  3. 3.

    Taggers 2 and 3 revise the tags by Tagger 1, paying special attention to tweets marked as dubious: if an agreement is reached, the tagging is updated consequently; otherwise, the tweet is tagged as Ambiguous or NC2.

  4. 4.

    Then each tagger takes a part of the corpus to tag it following the new criteria and highlighting doubts again; these tweets will be revised with remaining taggers, reaching an agreement on the final unique tags in the corpus.

Table 3. Total and average (per tweet) statistics on the corpus. Stanford CoreNLP was used for POS information, while patterns were used for detecting hashtags (‘#’), mentions (‘@’) and URLs (‘www.*’/‘http*’).

4.2 Publishing the Corpus as Linked Data

We maintain the RDF representation used in the previous version of the corpus, using again our own vocabularyFootnote 3 to express the purchase funnel and the marketing mix. We also reuse Marl [28] and Onyx [24] for emotions and polarity, and SIOCFootnote 4 and GoodRelationsFootnote 5 for post and brand representation. Also links to the entries of brands and companies in external databases such as Thomson Reuters’ PermIDFootnote 6 and DBpediaFootnote 7 extend the information in the tweets. Figure 2 shows an example of a tweet tagged in the dimensions extracted from the corpus, while Fig. 3 shows information on the company. An overview of this example is depicted in Fig. 4.

Table 4. Statistics on the corpus for the Marketing Mix dimension. Each column represents the percentages of NC2 (none of the others) and each of the four Ps described in Sect. 3, namely Product (PROD), Price (PRI), Promotion (PROM) and Place (PLA)
Table 5. Statistics on the corpus for the Purchase Funnel categories. Each column represents the percentages of each of the tags described in Sect. 3, namely NC2 (none of the others) Evaluation (EVA), Awareness (AWA), Purchase (PUR), Postpurchase (POS) and Ambiguous (AMB).
Table 6. Statistics on the emotions in the corpus. Column ANY shows the percentage of posts with any emotion (this is, non neutral posts); remaining columns show the percentage of each category among these non neutral posts: Hate (HAT), Sadness (SAD), Fear (FEA), Dissatisfaction (DIS), Satisfaction (SAT), Trust (TRU), Happiness (HAP) and Love (LOV).
Fig. 2.
figure 2

Sample tagged post “Nike’s 2002–2004 T-shirts and Adidas’ 2006–2008 are the love of my life”, with information such as the tweet ID (sioc:id), the text (sioc:content, if available), emotion and polarity expressed towards the brand (marl and onyx), and Purchase Funnel and Marketing Mix tags (sabd).

Fig. 3.
figure 3

Extra information on a brand (Nike) and its company (Nike Inc).

Fig. 4.
figure 4

Example of a tweet using our vocabulary and reusing ontologies (marl, onyx, good relations and sioc) and linking to external resources (DBpedia and PermID) as exposed in Sect. 4.2.

4.3 Corpus Description

Final corpus contains 3,763 tweets. Statistics on linguistic information in the corpus can be found in Table 3, along with specific data relevant for Social Media, such as the amount of hashtags, user mentions and URLs. The distribution of categories varies depending on the sector. Mentions of Place are for instance more common in Sports than in other categories, such as Beverages or Telecom, as shown in Table 4. Also when opinions are expressed differs: tweets in the Food sector tend to refer to the Postpurchase phase, while others tend to be more ambiguous or refer to previous phases, as shown in Table 5. Regarding emotions, some of them just appear in certain domains, such as Fear for Banking, as shown in Table 6.

5 Conclusions

Whereas the SAB corpus provided a collection of tweets tagged with labels useful for making Sentiment Analysis towards brands, this new corpus is of interest for the marketing analysis in a broader way. Using categories specially designed for marketing, such as the four Ps, the MAS Corpus allows marketing professionals to have additional information of habits and behaviors, strong and weak points of the whole purchase experience, and also full insights on concrete aspects of each client reviews.