Keywords

1 Introduction

In this study, we analyze characteristics of highly sensitive consumers who write reports diffusive in the food market. You will be wondering why this analysis was done. We focused on innovation theory. In the first place, innovation theory was proposed by Professor Rogers in 1962. This theory has been successful in a wide range of fields including communication, agriculture, public health, criminal justice, and marketing. In Japan, Morio (2012) has reported that “the appropriateness of companies in introducing new kinds of unique vegetables” is researched. In this study, long sales periods are assumed. Therefore, we target foods that can develop short ones. From innovation theory, if it can be disseminated to Innovators and Early Adopters, it is known that leading to Early Majority and Late Majority (Fig. 1), the market share will rise significantly (Fig. 2).

Fig. 1.
figure 1

Diffusion process

Fig. 2.
figure 2

Category classification

The characteristics of the Innovator and Early Adopter are that they are interested in product innovation and are sensitive to trends and constantly collect new information themselves and have a great influence on other consumers. In this study, these two layers are called “High Sensitivity Layer”. We think that “High Sensitivity Layer” with keeping them eyes open, and products that them sympathize will diffusion from the innovation theory to the Early and Late Majority and the market share is expanded to expand. We think that those that collect a lot of empathy in the “High Sensitivity Layer” are diffusive in the innovators and the early adopters, and grab the characteristics of highly sensitive consumers who gather many empathies. I think that it may be able to fulfill the purpose of test marketing by seeing the response of new products of food to this consumer. In this study, we report characteristics of highly sensitive consumers.

2 Data Summary

In this study, we used “Minrepo Data” from INTAGE Inc. distributed by IDR Dataset Service of National Institute of Informatics. We analyzed the report posted on the living post & enterprise co-creation SNS (Social Networking Service) application “Minrepo”. In this analysis, the data shown in Table 1 was mainly used among the data included in the report. In this study, 6,367 reports with genre “Foods report” and type “Bought or Received” were targeted. The used data period is from January 1, 2016 to June 30, 2016. The Minrepo official website says (2018), “High quality word of mouth contributions gather, which expresses products more attractive with high sensitivity consumers from photos and comments”, “real life actuality data”. That is, the Minrepo is an SNS where realistic living conditions data of high sensitivity consumers gather. The high sensitivity sensitive consumer’s post “Reports” of “Minrepo users” will receive responses from the same highly sensitive consumers.

Table 1. Mainly used variables

The posting of this SNS resemble Tweet of the famous Twitter post, and it is posted when there is something the submitter is interested in or what he/she wishes to teach. The point different from Twitter is that there is no limit on the number of characters, and emotion at posting can be added. Figure 3 shows the number of characters for each report. From this Fig. 3, even if there is no limit on the number of characters, it is understood that about 50% of people contribute with the number of characters of 100 characters or less. Of the remaining people, about 45% of the people have posted within 300 characters, and as a whole, about 95% of people can confirm that they post within 300 characters. Figure 4 shows the gender of each report. From this Fig. 4, it is found that posting about food is about 20%, more women than men. Figure 5 shows the age group for each posted report. While SNS such as Twitter and Instagram has many young people, this Minrepo accounts for about 75% at the age of 35 to 50, and about 15% at 15 to 30 years old shows that there are few young people. Figure 6 shows the percentage of emotions when posting a report. It is understood that the report which posted about food just like a soliloquy without emotion accounted for about a quarter. Looking at the proportion of the remaining emotions, we can see that positive emotions account for about 70% of the total. Conversely, only 3% of the negative emotions are present, and when you come up with something happy, you can confirm that the report is very often posted.

Fig. 3.
figure 3

Number of characters

Fig. 4.
figure 4

Gender proportion

Fig. 5.
figure 5

Age distribution

Fig. 6.
figure 6

Feelings proportion

The data used in this study included image files, but we created a variable (Number of images) simply counting the number of attached images. Figure 7 shows the ratio of the number of attached images in all reports. There are almost no reports with 0 attached images, and only one report accounts for about 55%. About 22% each report attached 2, 3 images, it turned out that many people attached multiple images.

Fig. 7.
figure 7

Proportion of the number of attached images

On the popularity tab of Minrepo, reports posted within 3 days are sorted in descending order of “favorites” and introduced. Reports that are loaded at one time occupy more than 10 favorites. Figure 8 shows the distribution of Number of favorites in the overall report. It is worth noting here that the number of favorites shown in Fig. 8 is the total value since posting. The size of that number has no meaning. It is important that Number of favorites exceeds 10 within 3 days. In this study, we define “empathy report” for favorite number of 10 or more, and “non-empathetic report” for less than 10 favorites.

Fig. 8.
figure 8

Gender proportion

3 Build a Deep Learning Model

3.1 Outline of Deep Learning Model

For the purpose of creating a generalized model, we prepare a deep learning model using the data in Sect. 2. Deep learning is “a type of parametric function approximation”. The special thing compared to others is that it has the property that “arbitrary functions can be approximated with arbitrary precision by using a large parameter set”. This property is a property not found in the linear model. Learning is conducted with the “empathy report” and “non-empathy report” defined in Sect. 2 as objective variables. The deep learning model built in this study is a classified model of binary categories in supervised learning. From the data we handle, we use a network structure called Multi-layer perceptron (Fig. 9) to construct a deep learning model. The perceptron was developed by Rosenblatt (1958). For the explanatory variable, select “number of images”, “emotion”, “gender”, “age”.

Fig. 9.
figure 9

Architecture of multi-layer perceptron

The purpose of this research is to predict the probability (1) of which category from the binary classification by logistic regression.

$$ \begin{array}{*{20}c} {\varvec{P}\left( {\hat{\varvec{y}}^{{\left( \varvec{i} \right)}} = \left\{ {{\mathbf{0}},{\mathbf{1}}} \right\} |\varvec{x}^{{\left( \varvec{i} \right)}} } \right) = {\mathbf{softmax}}\left( {\varvec{Wx}^{{(\varvec{i})}} + \varvec{b}} \right)\# } \\ \end{array} $$
(1)

At this time, \( x^{\left( i \right)} \) is the input, the parenthesized superscript is the data number, \( \hat{y}^{\left( i \right)} \) is the output. Also, the coupling weight \( W \) is a \( \left( {C, D} \right) \) matrix and the bias term \( b \) is a \( C \) dimensional vector (\( C \): number of categories, \( D \): dimension of each data point). The accuracy of the deep learning model is improved by adjusting the parameter \( \left\{ {W,b} \right\} \) to be close to supervised data. The softmax function is a function that returns \( d \) probability values with the \( d \)-dimensional vector as an argument, and is represented by (2).

$$ \begin{array}{*{20}c} {{\text{softmax}}\left( v \right)_{i} = \frac{{\exp \left( {v_{i} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{d} \exp \left( {v_{i} } \right)}}\# } \\ \end{array} $$
(2)

The probability value thus obtained and the observation value of the actual category \( \left\{ {c^{\left( i \right)} } \right\}_{i = 1}^{{N_{data} }} \) are measured by the smallness of the cross entropy loss function (3), and adjusts parameters. We aim to minimize loss function by stochastic gradient descent method.

$$ \begin{array}{*{20}c} {L = - \mathop \sum \limits_{i = 1}^{{N_{data} }} \log \left( {\varvec{P}\left( {\hat{\varvec{y}}^{{\left( \varvec{i} \right)}} = \left\{ {{\mathbf{0}},{\mathbf{1}}} \right\} |\varvec{x}^{{\left( \varvec{i} \right)}} } \right)} \right)\# } \\ \end{array} $$
(3)

In the logistic regression described above, there was only one pair of coupling weight \( W \) and bias term \( b \) for performing linear transformation as a parameter. In a multilayer perceptron with \( k \) hidden layers, there are \( \left( {k + 1} \right) \) combining weights and biases, and they convert the input \( h^{0} = x \) to (4).

$$ \begin{aligned} h^{1} & = {\text{activate}}_{1} \left( {W^{1} h^{0} + b^{1} } \right)\left( {h^{0} = x} \right) \\ h^{2} & = {\text{activate}}_{2} \left( {W^{2} h^{1} + b^{2} } \right) \\ & \vdots \\ h^{k} & = {\text{activate}}_{k} \left( {W^{k} h^{k - 1} + b^{k} } \right) \\ h & = Wh^{k} + b \\ \end{aligned} $$
(4)

h0 is the input layer, hk is the hidden layer k, h is the output layer.

There are “wish count” and “number of activities” as responses to the report, but these two are attached to the “favorite” report and are not explanatory variables. One of the things to keep in mind when asking explanatory variables is the problem of multiple collinearity. This is a problem especially in multiple regression analysis. However, according to Yoshida (1987), multicollinearity is said to be an essential problem that the correlation matrix is not regular when there is a correlation between explanatory variables, so that the solution cannot be obtained or is unstable. In this study, regularization by Dropout is carried out, prevention of over learning and prevention of multiple collinearity are also done Srivastava et al. (2014) has shown the prevention of over-learning by regularization..

As a precaution, look at the relationship between explanatory variables here. From the scale of explanatory variable (Table 2), use Spearman’s rank correlation coefficient for ordinal scale and ratio scale, correlation ratio on nominal scale, ordinal scale nominal scale, ratio scale, and Cramer’s coefficient of association for nominal scale. The results are shown in Table 3.

Table 2. Scale chart
Table 3. Relationship between explanatory variables

3.2 Hyper Parameter Setting

In order to create a deep learning model, it is necessary to set a large number of hyperparameters. First of all, for learning of this model, 80% (5,094 cases) of the total 6,367 reports were used for learning and the remaining 20% (1,237 cases) were used for verification.

Next, we set the number of hidden layers in a multi-layer perceptron. According to Asakawa (2014), if there is only one hidden layer, we can solve only linearly identifiable problems, so we know that at least two or more hidden layers are needed. Also, according to Nielsen (2014), as the number of hidden layers increases, the stochastic gradient descent method is said to cause gradient disappearance problem/instability of gradient, so increase the number of hidden layers too much. For these reasons, we set two hidden layers in this model.

In each hidden layer, it is necessary to set three hyperparameters of “output dimension number”, “activation function”, and “dropout ratio”. In this model, three parameters are set by Bayesian optimization for Accuracy in confusion matrix.

Finally, set up learning. In the multi-layer perceptron, it is necessary to set parameters for proceeding learning of “optimization function”, “learning rate”, “epochs” and “mini-batch size”.

We compare accuracy with representative “Adam”, “SimpleSGD”, “AdaDelta”, “AdaGrad”, “RMSprop”. This is because the optimum optimization function differs depending on the data to be used. The optimization function selects an optimization function with a high rate of accuracy (Needless to say, the other parameters are the same). From Table 4, the optimization function of this model sets “AdaDelta”.

Table 4. Accuracy for each optimization function

According to Taguchi (1998), if the learning rate is increased, the convergence of the error becomes faster, but it becomes unstable. When it is decreased, the stability of error increases but it is said that it takes time to converge. However, in this model, since “AdaDelta” was set as the optimization function, it became unnecessary to set the initial value learning rate. The details are described in detail by developer Zeiler (2012). In fact, even if the learning rate was greatly changed and compared, the change in the correct answer rate was not recognized (Table 5). In order to do learning it is necessary to set the learning rate, so the default value of 0.0001 is set in this model.

Table 5. Accuracy for each learning rate

Regarding the mini-batch size, it is preferable to divide it into \( 2^{n} \) for memory access efficiency. The smaller the size is, the higher the accuracy rate increases, but the overfitting becomes more likely to occur.In this model, we change the value from 128 to 64, 64 to 32 with the default value of 64 as a reference, and set 64 which was high effect (Table 6).

Table 6. Accuracy for each mini-batch size

The epochsis the number that learns one training data repeatedly. In deep learning, parameters are learned by learning training data repeatedly, so it is necessary to take many epochs to the extent that over learning does not occur. Santos (2007) reports on the error caused by taking too many epoch numbers. Therefore, we set epochs using Early stopping method. Early stopping is a method proposed by Prechelt (1998). Early Stopping is to stop learning when errors begin to converge. This makes it possible to stop learning so as not to overfitting. In this model, we set 100 which the error began converging to the epochs.

In this model, we made the above setting and finally got a model that gains 73.3% accuracy.

4 Decision Tree Analysis

In Sect. 3, a deep learning model was created. In this section, we report features of highly sensitive consumers, visually and numerically clearly, using decision tree analysis from that model. A positive feature of decision tree analysis is that it is easy to interpret the output result, it is possible to handle various scales for explanatory variable/objective variable. The primary reason for using decision tree analysis is that the classification process can be visualized in an easy-to-understand manner. Of course, there are not only good things but also negative features. The analysis is likely to over-adapt and the generalization performance may be lowered. Overfitting refers to excessive learning of the observed data, which also sensitively reacts to noise, which means that the fitness to unknown data worsens. As mentioned earlier, since it is top priority to visualize the classification process in an easy-to-understand manner, we will make efforts to minimize excessive adaptation. The reason for creating a deep learning model is also to make observation data closer to true category and to improve generalization performance.

The explanatory variables used in this model have values as shown in Table 7. Apply this model created as a generalized model, predict which “empathy report” or “Non-empathetic report” will be written and analyze the binary features predicted by decision tree analysis. Table 8 shows the settings used for the decision tree analysis. After creating the decision tree model, we pruned for the purpose of improving generalization performance.

Table 7. Value to be taken by explanatory variable
Table 8. Setting of decision tree analysis.

The Gini coefficient of the branching method is defined by (5).

$$ \begin{array}{*{20}c} {{\text{Gini}}\;{\text{coefficient}} = 1 - \sum\nolimits_{i - 1}^{c} {p_{i}^{2} \# } } \\ \end{array} $$
(5)

Figure 10 shows the results of the decision tree analysis. In this model, it is important to see the state of branching and the proportion of each class. For example, when the number of attached images is 1 or less, it is found that the ratio of Non-empathetic report is 92.7% and if the number of attached images is 1 or less, it is almost written a Non-empathetic report. In the other hand, when the number of attached images is 3, and when the age exceeds 57 years old, you can find that you write almost empathy report.

Fig. 10.
figure 10

Result of decision tree analysis

5 Discussion

Based on the results of the decision tree analysis, it is found that the report is more likely to be empathized as the number of images is larger. Reports that only 0 and 1 images are attached are classified as non-empathetic reports by 92.7%. Therefore, 2 or more images are necessary as the first condition of the empathy report. It is understood that visual information has more information amount than simple text data, and it is easy to acquire empathy.

Next, in the group with a large proportion of “empathy report”, the report attached with 2 images shows that the age is 43 years old or more, and the report attached with 3 images shows that the age is 36 years old or over. Under the condition that there are two or more attached images, it turns out that people who are years old can acquire empathy more easily. Also, from the result of the decision tree analysis, it is understood that the more the age, the more the proportion of the empathy report is increasing.

Conversely, 97% or more is classified as non-empathic report at the age of 30 or under even if the number of images is 2 or 3. If the number of images is 2 and 39 years old, if the number of images is 3, 70% or more are classified as non-empathy report even under 35 years old. Therefore, writing an easy-to-empathy report is understood from the “Arafor” (Arafor: Referring to people, aged around 40).

In addition, the explanatory variable contains emotion and gender in addition to the number of images and age. Emotions and gender were not recognized as such important conditions from the model created in this study.

When developing a new product of food, it is necessary to care about the sensitive layer from innovation theory. From the results of this study, we believe that we can withstand the role of test marketing (although verification is necessary) by attaching a lot of images in the high sensitivity layer and seeing the reactions for people who are years old.

6 Future Work

In this time, we have constructed a deep learning model that predicts binary values of “non-empathetic report” and “empathy report”, but its explanatory variables are limited to those that can be numerically interpreted (including the nominal scale). It is conceivable that accuracy will be further improved if we build a deep learning model from text mining as to what kind of words are easily sympathized to people. Also, it is also interesting to search empathize words. Since I searched whether reports are easy to sympathize for each report this time, I think that interesting knowledge can be obtained by searching by associating users.

In addition, this time data included image files. Since the importance of the image is indicated also from the result of the decision tree analysis, I want to verify what type of image is easily sympathized by carrying out image tagging with AI or the like and performing text mining.

Finally, we decided to use the decision tree analysis with the highest priority for ease of interpretation. There is a logistic regression analysis as a binary classification model, so I would like to verify this as well.