P2P Lending Analysis Using the Most Relevant Graph-Based Features

Cui, Lixin; Bai, Lu; Wang, Yue; Bai, Xiao; Zhang, Zhihong; Hancock, Edwin R.

doi:10.1007/978-3-319-49055-7_1

Lixin Cui¹⁸,
Lu Bai¹⁸,
Yue Wang¹⁸,
Xiao Bai¹⁹,
Zhihong Zhang²⁰ &
…
Edwin R. Hancock²¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10029))

Included in the following conference series:

Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR)

1677 Accesses
4 Citations
3 Altmetric

Abstract

Peer-to-Peer (P2P) lending is an online platform to facilitate borrowing and investment transactions. A central problem for these P2P platforms is how to identify the most influential factors that are closely related to the credit risks. This problem is inherently complex due to the various forms of risks and the numerous influencing factors involved. Moreover, raw data of P2P lending are often high-dimension, highly correlated and unstable, making the problem more untractable by traditional statistical and machine learning approaches. To address these problems, we develop a novel filter-based feature selection method for P2P lending analysis. Unlike most traditional feature selection methods that use vectorial features, the proposed method is based on graph-based features and thus incorporates the relationships between pairwise feature samples into the feature selection process. Since the graph-based features are by nature completed weighted graphs, we use the steady state random walk to encapsulate the main characteristics of the graph-based features. Specifically, we compute a probability distribution of the walk visiting the vertices. Furthermore, we measure the discriminant power of each graph-based feature with respect to the target feature, through the Jensen-Shannon divergence measure between the probability distributions from the random walks. We select an optimal subset of features based on the most relevant graph-based features, through the Jensen-Shannon divergence measure. Unlike most existing state-of-the-art feature selection methods, the proposed method can accommodate both continuous and discrete target features. Experiments demonstrate the effectiveness and usefulness of the proposed feature selection algorithm on the problem of P2P lending platforms in China.

You have full access to this open access chapter, Download conference paper PDF

A Bayesian Investment Model for Online P2P Lending

Feature Selection on Credit Risk Prediction for Peer-to-Peer Lending

The Analysis on the Application of Machine Learning Algorithms in Risk Rating of P2P Online Loan Platforms

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Online Peer-to-Peer (P2P) lending has recently emerged as an useful financing alternative where individuals can borrow and lend money directly through an online trading platform without the help of institutional intermediaries such as banks [24]. Despite its explosive development, recent years have witnessed several acute problems such as high default rate of borrowers and bankruptcy of a large number of P2P lending platforms, etc [15]. To prevent personal investors from economic losses and ensure smooth and effective operations of the P2P lending industry, it is of great necessity to develop efficient credit risk assessment methods. Indeed, state-of-the-art P2P lending platforms such as Prosper and Lending Club, have utilized credit rating models for evaluating risk for each loan [6].

Credit risk evaluation decisions are inherently complex due to the various forms of risks and the numerous influencing factors involved [4]. Along this line, tremendous efforts have been devoted to developing quantitative credit rating methods due to their effectiveness. These methods can be broadly divided into two groups: traditional statistical methods [5]and machine learning approaches [20]. Using statistical methods is difficult because of the complexities of dependencies between various factors that influence the final credit risk evaluations. On the other hand, machine learning approaches, such as tree-based classifiers, support vector machines (SVM), and neural networks (NN), etc., do not require the factors to be independent and identically distributed (i.i.d.), and are capable of tackling computationally intensive credit risk evaluation problems. However, along with emergence of the Internet and E-Commerce, data sets of P2P lending are getting larger and larger. In addition, these raw data are often high-dimensional, highly-correlated, and unstable. These characteristics of the P2P lending data present new challenges for traditional machine learning algorithms, which may consume large amount of computational time and can not process information effectively.

To mitigate this problem, one potent way is to use feature selection in the data preprocessing process before implementing the learning algorithms [7]. By choosing a small subset of the most informative features that ideally is necessary and sufficient to describe the target concept [10], feature selection is capable of solving data mining and pattern recognition problems with data sets involving large number of features. Some have attempted to explore the advantages of feature selection for credit risk evaluation for P2P lending. For instance, Malekipirbazari and Aksakalli [15] proposed a random forest based classification method for predicting borrower status. To reduce data dimensionality, they proposed a feature selection method based on the information gain of each individual feature. In Jin and Zhu [21], a random forest method is used to evaluate the significance of each feature, and a feature subset is selected based on this measure. By comparing the performance of decision tree, SVM, and NN on a dataset from Lending Club, the authors demonstrated the effectiveness of using feature selection method for credit risk analysis for P2P lending. Despite its usefulness in solving credit risk evaluation problems in P2P lending, most existing feature selection methods accommodate each feature as a vector, and thus ignore the relationship between pairwise samples in each feature. This drawback lead to significant information loss. Thus, developing effective feature selection method still remains a challenge.

The aim of this paper is to address the aforementioned shortcoming, by developing a new feature selection method. We commence by transforming each vectorial feature into a graph-based feature, that not only incorporates relationships between pairwise feature samples but also reflects richer characteristics than the original vectorial features. We also transform the target feature into a graph-based target feature. Furthermore, we use the steady state random walk and compute a probability distribution of the walk visiting the vertices. With the probability distribution for the graph-based features and the graph-based target feature to hand, we measure the discriminant power of each graph-based feature with respect to the graph-based target feature, through the Jensen-Shannon divergence measure between the probability distributions from the random walks. We select an optimal subset of features based on the most relevant graph-based features, through the Jensen-Shannon divergence measure. Unlike most existing state-of-the-art feature selection methods, the proposed method can accommodate both continuous and discrete target features. Experiments demonstrate the effectiveness and usefulness of the proposed feature selection algorithm on the problem of P2P lending platforms in China.

This paper is organized as follows. Section 2 briefly reviews the related works of feature selection methods. Section 3 presents preliminary concepts that will be used in this work. Then Sect. 4 define the proposed feature selection method. Section 5 presents the experimental evaluation of the proposed approach on a dataset collected from a large P2P lending portal. Section 6 concludes this work.

2 Literature Review

Feature selection has been a fundamental research topic in data mining and machine learning [8]. By choosing from the input data a subset of features that maximizes a generalized performance criterion, feature selection reduces the high dimensionality of the original data, improves learning performance, and provides faster and more cost-effective predictors [7].

Feature selection methods can be broadly divided into two categories, depending on their interaction with the classifier [18]. Filter-based methods [23] are independent from the classifier and focuses on the intrinsic properties of the original data. It usually provides a feature weighting or ranking based on some evaluation criteria and outputs a subset of selected features. By contrast, wrapper approaches [13] perform a search for an optimal subset of features using the outcome of a classifier as guidance. Often, the results obtained by wrapper methods are better than those obtained by filter methods, but the computational cost is also much higher.

Generally, evaluation criteria are of great significance for feature selection and a great variety of effective evaluation criteria have been proposed to locate informative features. These methods include distance [12], correlation [9], information entropy [14], rough set theory [3], etc. Among them, the correlation criterion and its extensions are probably one of the most widely used criteria to characterize the relevance between features, due to its good performance and ease of implementation. For instance, Hall [9] employed some correlation measures to evaluate the optimal feature subsets based on the assumption that a good subset contains features which are highly correlated to the class, yet uncorrelated to each other. In [19], a supervised feature selection method which builds a dissimilarity space by hierarchical clustering with conditional mutual information is developed.

Broadly speaking, there are two types of methods to measure the correlation between the same type of features. One is traditional linear correlation and the other kind is based upon information theory. For the first type, the most well-known similarity measure between two features x and y is the linear correlation measure $\mathrm {sim}(x,y)$ and

$$\begin{aligned} \mathrm {sim}(x,y)=\mathrm {cov}(x,y)/\sqrt{\mathrm {var}(x)\mathrm {var}(y)}. \end{aligned}$$

Here $\mathrm {var}(\cdot )$ represents the variance of a feature and $\mathrm {cov}(x,y)$ denotes the covariance between feature x and y. Other measures in this category are basically variations of this measure, including the maximal information compression index and least square regression error. Although this type of similarity measure can reduce redundancy among relevant features, it has several shortcomings. First, the linear correlation assumption between features are often not reasonable because many real world data such as P2P lending and finance, have very complex nonlinear relationships. Second, the linear correlation measure is not applicable in the cases when discrete data are involved.

To address these shortcomings, various information theory based correlation measures such as information gain [22] and symmetrical uncertainty [17] have been proposed. The amount by which the entropy of x decreases reflects additional information about X provided by Y, and is called information gain, which is expressed as

$$\begin{aligned} IG(x,y)=H(x)-H(x|y)=H(y)-H(y|x). \end{aligned}$$

Here $H(\cdot )$ denotes the entropy of a feature X and H(x|y) refers to the entropy of X after observing values of another discrete feature Y. Despite its efficiency, information gain is biased towards features with more values. On the other hand, symmetrical uncertainty normalizes its value to the range of [0, 1]. It can be defined as

$$\begin{aligned} SU(x,y)=2[IG(x,y)/H(x)+H(y)]. \end{aligned}$$

If $SU(x,y)=1$, it indicates that features x and y are completely related. Otherwise, if SU takes the value of zero, it suggests that x and y are totally independent.

Although many robust correlation-based evaluation criteria have been developed in literature, there is no existing method that can incorporate relationships between pairwise samples of each feature dimension into the feature selection process. We also notice that a number of existing feature selection criteria implicitly select features that preserve sample relationship, which can be inferred from either a predefined distance metric or label information [25]. This indicates that it would be beneficial to incorporate sample relationship into the feature selection algorithm.

3 Preliminary Concepts

3.1 Probability Distributions from the Steady State Random Walk

Assume G(V, E) is a graph with vertex set V, edge set E, and a weight function $\omega :V \times V\rightarrow \mathbb {R}^+$. If $\omega (u,v) > 0$ ($\omega (u,v)=\omega (v,u)$), we say that (u, v) is an edge of G, i.e., the vertices $u\in V$ and $v \in V$ are adjacent. The vertex degree matrix of G is a diagonal matrix D whose elements are given by

$$\begin{aligned} D(v,v)=d(v)=\sum _{u\in V}\omega (v,u). \end{aligned}$$

(1)

Based on [2], the probability of the steady state random walk visiting each vertex v is

$$\begin{aligned} p(v)=d(v)/\sum _{u \in V}d(u). \end{aligned}$$

(2)

Furthermore, from the probability distribution $P=\{p(1),\ldots ,p(v),\ldots ,p(|V|)\}$, we can straightforwardly compute the Shannon entropy of G as

$$\begin{aligned} H_S(G)=-\sum _{v\in V}p(v)\log p(v). \end{aligned}$$

(3)

3.2 Jensen-Shannon Divergence

In information theory, the JSD is a dissimilarity measure between probability distributions over potentially structured data, e.g., trees, graphs, etc. It is related to the Shannon entropy of the two distributions [2]. Consider two (discrete) probability distributions $\mathcal {P}=(p_1,\ldots ,p_m,\ldots ,p_M)$ and $\mathcal {Q}=(q_1,\ldots ,q_m,\ldots ,q_M)$, then the classical Jensen-Shannon divergence between $\mathcal {P}$ and $\mathcal {Q}$ is defined as

$$\begin{aligned}&D_{JS}(\mathcal {P},\mathcal {Q}) = H_S\Big (\frac{\mathcal {P}+\mathcal {Q}}{2}\Big ) - \frac{1}{2} H_S(\mathcal {P}) - \frac{1}{2} H_S(\mathcal {Q}) \nonumber \\&=-\sum _{m=1}^M \frac{p_m+q_m}{2} \log \frac{p_m+q_m}{2} + \frac{1}{2}\sum _{m=1}^M {p_m} \log {p_m}+ \frac{1}{2} \sum _{m=1}^M {q_m} \log {q_m}, \end{aligned}$$

(4)

where $H_S(.)$ is the Shannon entropy of a probability distribution. Note that, the JSD measure is used as a means of measuring the information theoretic dissimilarity of graphs. However, in this work, we are more interested in the similarity measure between features. Thus, we define the JSD based similarity measure by transforming the JSD into its negative form and obtaining the corresponding exponential value, i.e.,

$$\begin{aligned} S(\mathcal {P},\mathcal {Q})=\exp \{-D_{JS}(\mathcal {P},\mathcal {Q})\}. \end{aligned}$$

(5)

4 The Feature Selection Method on Graph-Based Features

4.1 Construction of Graph-Based Features

In this subsection, we transform each vectorial feature into a new graph-based feature, that is a complete weighted graph. The main advantage of using the new feature representation is that the graph-based feature can incorporate the relationship between samples of each original vectorial feature, and thus leading to less information loss. Given a dataset having N features denoted as $\mathcal {X}=\{ \mathbf f _{1},\ldots ,\mathbf f _{i},\ldots ,\mathbf f _{N} \}\in \mathbb {R}^{M\times N}$, $\mathbf f _{i}$ represents the i-th vectorial feature that has M samples. We transform each vectorial feature $\mathbf f _{i}$ into a graph-based feature $\mathbf G _{i}(V_i,E_i)$, where each vertex $v_{a}\in V_i$ indicates the a-th sample $f_{a}$ of $\mathbf f _{i}$, each pair of vertices $v_{a}$ and $v_{b}$ is connected by a weighted edge $(v_{a},v_{b})\in E_i$, and the weight $w(v_{a},v_{b})$ is the Euclidean distance as

$$\begin{aligned} w(v_{a},v_{b})= \sqrt{ (f_{a} - f_{b}) (f_{a} - f_{b})^T }. \end{aligned}$$

(6)

Similarly, if the sample of the target feature $\mathbf Y =\{y_{1},\ldots ,y_{a},\ldots ,y_{b},\ldots ,y_{M}\}^T$ are continuous, its graph-based feature $\varvec{\hat{\mathbf{G }}}(\hat{V},\hat{E})$ can also be computed using Eq. (6) and each vertex $\hat{v}_a\in \hat{V}$ represents the a-th sample $y_{a}$. However, for some instances, the sample $y_{a}$ of the target feature Y may take discrete values $c=1,2,\ldots ,C$. For this instance, we first compute the graph-based target feature $\varvec{\hat{\mathbf{G }}}_\mathbf i (\hat{V}_{i},\hat{E}_{i})$ for each feature $\mathbf f _{i}$, where the weight $w(\hat{v}_{ia},\hat{v}_{ib})$ of each edge $(\hat{v}_{ia},\hat{v}_{ib})\in \hat{E}_i$ is

$$\begin{aligned} w(\hat{v}_{a},\hat{v}_{b})= \sqrt{ (\mu _{ia} - \mu _{ib}) (\mu _{ia} - \mu _{ib})^T }, \end{aligned}$$

(7)

where $\mu _{ia}$ is the mean value of all samples in $\mathbf f _{i}$ that are corresponded by the same discrete value c of the target feature samples and $c=y_a$. Moreover, based on [11], we also compute the Fisher score $F(\mathbf f _{i})$ for each feature $\mathbf f _{i}$ as

$$\begin{aligned} F(\mathbf f _{i}) = \frac{\sum _{c=1}^{C}n_{c}(\mu _{c}-\mu )^{2}}{\sum _{c=1}^{C}n_{c}\sigma _{c}^{2}}, \end{aligned}$$

(8)

where $\mu _{c}$ and $\sigma _{c}^{2}$ are the mean and variance of the samples corresponded by the same discrete value c, $\mu $ is the mean of feature $\mathbf f _{i}$, and $n_{c}$ is the number of the samples corresponded by c-th in feature $\mathbf f _{i}$. From Eq. (8), we observe that the Fisher score $S(\mathbf f _{i})$ reveal the quality of the graph-based target feature $\varvec{\hat{\mathbf{G }}}_i$ for $\mathbf f _i$. In other words, a higher Fisher score means a better target feature graph. As a result, the graph-based target feature $\varvec{\hat{\mathbf{G }}}(\hat{V},\hat{E})$ can be identified by

$$\begin{aligned} \varvec{\hat{\mathbf{G }}}(\hat{V},\hat{E}) = \varvec{\hat{\mathbf{G }}}(\hat{V}_i^*,\hat{E}_i^*), \end{aligned}$$

(9)

where

$$\begin{aligned} i^*=\arg \max _{i} F(\mathbf f _{i}). \end{aligned}$$

(10)

4.2 Feature Selection Based on Relevant Graph-Based Features

We aim to select an optimal subset of features. Specifically, by measuring the Jensen-Shannon divergence between graph-based features, we compute the discriminant power of each vectorial feature with respect to the target feature. For a set of N features $\mathbf f _{1},\ldots ,\mathbf f _{i},\ldots ,\mathbf f _{j},\ldots ,\mathbf f _{N}$ and the associated continuous or discrete target feature $\mathbf Y $, the relevance degree or discriminant power of the feature $\mathbf f _{i}$ with respect to $\mathbf Y $ is

$$\begin{aligned} R_{\mathbf{f }_{i},\mathbf Y }=S(\mathbf G _{i},\varvec{\hat{\mathbf{G }}}), \end{aligned}$$

(11)

where $\mathbf G _{i}$ and $\varvec{\hat{\mathbf{G }}}$ are the graph-based features of $\mathbf f _{i}$ and $\mathbf Y $, S is the JSD based similarity measure defined in Eq. (5). Based on the relevance degree of each feature $\mathbf f _{i}$ with respect to the target feature $\mathbf Y $ computed by Eq. (11) (for the continuous target feature) or Eq. (9) (for the discrete target feature), we can rank the original vectorial features in descending order and then select a subset of the most relevant features.

5 Experiments

We evaluate the effectiveness of the proposed graph-based feature selection algorithm on the problem of P2P lending platforms in China. This is of great significance for the credit risk analysis of the P2P platforms because the P2P lending industry has developed rapidly since the year of 2007, and many have suffered from severe problems such as default of borrowers and bankruptcy. More specifically, we use a data of 200 P2P platforms collected from a famous P2P lending portal in China (http://www.wdzj.com/). For each platform, we use 19 features including: (1) transaction volume, (2) total turnover, (3) total number of borrowers, (4) total number of investors, (5) online time, which refers to the foundation year of the platform, (6) operation time, i.e., number of months since the foundation of the platform, (7) registered capital, (8) weighted turnover, (9) average term of loan, (10) average full mark time, i.e., tender period of a loan raised to the required full capital, (11) average amount borrowed, i.e., average loan amount of each successful borrower, (12) average amount invested, which is the average investment amount of each successful investor, (13) loan dispersion, i.e., the ratio of the repayment amount to the total capital, (14) investment dispersion, the ratio of the invested amount to the total capital, (15) average times of borrowing, (16)average times of investment, (17) loan balance, (18) popularity, and (19) interest rate.

5.1 The Most Influential Features for Credit Risks (Continuous Target Features)

We first use the proposed feature selection algorithm to identify the most influential features which are most relevant to the interest rate of P2P platforms. In finance, the interest rates of P2P lending can also be interpreted as the rate of return on a loan (for investors), and the higher the rate of return, the greater the likelihood of default. Identifying the most relevant features to the interest rate can help investors effectively manage the credit risks involved in P2P lending [24]. Therefore, in our experiment, we set the interest rate as our continuous target feature. Our purpose is to identify the features that are most influential for the credit risks of the P2P platforms. To realize this goal, we use the proposed feature selection algorithm to rank the remaining 18 features according to their similarities to the target label in descending order. The results are shown in Table 1.

Table 1. Influential factors for bankruptcy problems for P2P lending platforms in China

Full size table

Results and Discussions: It is shown that registered capital, operation time, average amount invested, loan dispersion, and average times of investment are the top five features which are most relevant to the interest rate (target feature). These results are in consistent with the finance theory. For instance, the registered capital indicates stronger financial stability of the platform. In addition, a longer operation time of the platform usually implies that the platform accumulates abundant risk management knowledge and skills, which are helpful to maintain a lower credit risk level. Moreover, a more dispersed loan rate often indicates a higher degree of security for the platform, which implies a relatively lower interest rate. The average amount invested and average times of investment indicate investors’ preferences for the less risky platforms. On the contrary, features such as average times of borrowing and average amount borrowed are of less relevance because these features reflect the financing needs of the borrowers and are less relevant to the credit risks of the platforms.

Comparisons: In this section, we compare the proposed feature selection (FS) method with two widely used methods including correlation analysis (CA) and multiple linear regression (MLR). Table 2 presents a comparison of the results obtained via these methods. Each method identifies 10 features which have higher correlation to the interest rate. It can be noticed that the most influential factors identified by the proposed method tend to be more in consistent with the factors selected by MLR, whereas CA ranks different features higher. For example, among the top five most influential factors, both FS and MLR select operation time and loan dispersion. This is reasonable because a more dispersed loan rate often indicates a higher degree of security for the platform, which implies a relatively lower interest rate. Also, a longer operation time of the platform often indicates that the platform accumulates abundant risk management knowledge and skills, which are helpful to maintain a lower credit risk level. These results are in consistent with the finance theory and demonstrate the effectiveness and usefulness of the proposed method for the identification of the most influential factors for credit risk analysis of P2P lending platforms.

Table 2. Comparison of three methods

Full size table

5.2 Classification for the Credit Rating (Discrete Target Features)

We further evaluate the performance of the proposed method when the target features are discrete. We set the credit rating (taking discrete values) as the target feature, and our purpose is to identify the most influential features for the credit rating of the P2P lending platforms in China. These rating values are collected from the “Report on the Development of the P2P lending industry in China, 2014–2015”, issued by the Financial Research Institute of the Chinese Academy of Social Sciences. Due to the strict evaluation criteria involved, only 104 P2P platforms are included in this report, among which only 42 platforms belong to the 200 P2P platforms used in the above data set. Therefore, we take these 42 platforms as samples for evaluation.

In our experiment, we set the discrete credit rating targets as the classification labels. Because the 42 platforms are categorized into four classes according to their credit rating values, we set the number of classes as four. We randomly select $50\,\%$ of the 42 samples as training data, and use the other half for testing. By repeating this selection process 10 times, we obtain 10 random partitions of the original data. For each of the 10 partitions of the original data, we perform a 10-fold cross-validation using a C-Support Vector Machine (C-SVM) to evaluate the classification accuracy associated with the selected features located via different feature selection methods. These methods include: (1) the proposed feature selection method (GS), (2) the Fisher Score method (FS) [11], and (3) the Mutual Information based method (MI) [16]. We perform cross-validation on the testing samples taken from the feature selection process. Specifically, the entire sample is randomly partitioned into 10 subsets and then we choose one subset for testing and use the remaining 9 subsets for training, and this procedure is repeated 10 times. The final accuracy is computed by averaging the accuracies from each of the random subsets, over all the 10 partitions. The final accuracy is computed by averaging the accuracies from each of the random subsets, over all the 10 partitions. The classification accuracy of each feature selection method based on different number of the most influential features is shown in Fig. 1.

Figure 1 indicates that the proposed method (GS) achieves the best classification accuracy ($31.50\,\%$) while requiring the lowest number of features, i.e., 3 features. In contrast, the FS and MI methods request 3 and 4 features respectively for their best classification accuracies $30.50\,\%$ and $29.00\,\%$, respectively. The reasons for this effectiveness is that only the proposed method incorporates the sample relationship into the feature selection process, and thus encapsulates more information. Although the classification accuracy is $31.50\,\%$, it is very promising because dividing 42 samples into four different classes is a very challenging classification task. Thus, the classification accuracy demonstrates the effectiveness of the proposed method.

6 Conclusion

In this paper, we have developed a novel feature selection algorithm to conduct credit risk analysis for the P2P lending platforms. Unlike most existing feature selection methods, the proposed method is based on graph-based feature and encapsulate global topological information of features into feature selection process. The proposed method thus avoid information loss between feature samples that arises in traditional feature selection methods. Using a dataset collected from a famous P2P portal in China, we demonstrate the effectiveness of our method.

The proposed feature selection method ignores the redundancy between pairwise features. As a result, the optimal subset of selected features may include redundant features. Furthermore, the proposed method cannot adaptively select the most informative feature subset. To address these problems, future work will be aimed at proposing a new framework that can adaptively select the most informative and less redundant graph-based feature subset. Furthermore, it is also interesting to propose new approaches of establishing graph-based features from original vectorial features. Finally, note that, the similarity measure between a graph-based feature and the target graph-based feature defined by Eq. (11) is the Jensen-Shannon diffusion graph kernel [1, 2] over probability distributions. In fact, one can also adopt other alternative graph kernels. In other words, the proposed framework provides a way of developing feature selection methods associated with graph kernels. It is interesting to explore the performance of the proposed method associated with different graph kernels in future works.

References

Bai, L., Bunke, H., Hancock, E.R.: An attributed graph kernel from the Jensen-Shannon divergence. In: Proceedings of ICPR, pp. 88–93 (2014). DBLP:conf/icpr/2014
Google Scholar
Bai, L., Rossi, L., Bunke, H., Hancock, E.R.: Attributed graph kernels using the Jensen-Tsallis q-differences. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8724, pp. 99–114. Springer, Heidelberg (2014). doi:10.1007/978-3-662-44848-9_7
Google Scholar
Chen, Y., Miao, D., Wang, R.: A rough set approach to feature selection based on ant colony optimization. Pattern Recogn. Lett. 31(3), 226–233 (2010)
Article Google Scholar
Crook, J.N., Edelman, D., Thomas, L.C.: Recent developments in consumer credit risk assessment. Eur. J. Oper. Res. 183(3), 1447–1465 (2007)
Article MathSciNet MATH Google Scholar
Hand, D.J., Henley, W.E.: Statistical classification methods in consumer credit scoring: a review. J. R. Stat. Soc. Ser. A 160(3), 523–541 (1997)
Article Google Scholar
Guo, Y., Zhou, W., Luo, C., Liu, C., Xiong, H.: Instance-based credit risk assessment for investment decisions in P2P lending. Eur. J. Oper. Res. 249(2), 417–426 (2016)
Article MathSciNet MATH Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Hájek, P., Michalak, K.: Feature selection in corporate credit rating prediction. Knowl.-Based Syst. 51, 72–84 (2013)
Article Google Scholar
Hall, M.A.: Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the ICML, pp. 359–366 (2000)
Google Scholar
Han, J., Sun, Z., Hao, H.: Selecting feature subset with sparsity and low redundancy for unsupervised learning. Knowl.-Based Syst. 86, 210–223 (2015)
Article Google Scholar
He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing Systems 18 [Neural Information Processing Systems, NIPS 2005, Vancouver, British Columbia, Canada, 5–8 December 2005], pp. 507–514 (2005)
Google Scholar
Huang, Y., McCullagh, P.J., Black, N.D.: An optimization of relieff for classification in large datasets. Data Knowl. Eng. 68(11), 1348–1356 (2009)
Article Google Scholar
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97(1–2), 273–324 (1997)
Article MATH Google Scholar
Last, M., Kandel, A., Maimon, O.: Information-theoretic algorithm for feature selection. Pattern Recogn. Lett. 22(6/7), 799–811 (2001)
Article MATH Google Scholar
Malekipirbazari, M., Aksakalli, V.: Risk assessment in social lending via random forests. Expert Syst. Appl. 42(10), 4621–4631 (2015)
Article Google Scholar
Pohjalainen, J., Räsänen, O., Kadioglu, S.: Feature selection methods and their combinations in high-dimensional classification of speaker likability, intelligibility and personality traits. Comput. Speech Lang. 29(1), 145–171 (2015)
Article Google Scholar
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C, 2nd edn. Cambridge University Press, Cambridge (1992)
MATH Google Scholar
Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)
Article Google Scholar
Sotoca, J.M., Pla, F.: Supervised feature selection by clustering using conditional mutual information-based distances. Pattern Recogn. 43(6), 2068–2081 (2010)
Article MATH Google Scholar
Yeh, I.-C., Lien, C.-H.: The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl. 36(2), 2473–2480 (2009)
Article Google Scholar
Jin, Y., Zhu, Y.D.: A data-driven approach to predict default risk of loan for online Peer-to-Peer (P2P) lending. In: Proceedings of Fifth International Conference on Communication Systems and Network Technologies, pp. 609–613 (2015)
Google Scholar
Yu, L., Liu, H.: Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 5, 1205–1224 (2004)
MathSciNet MATH Google Scholar
Zhang, D., Chen, S., Zhou, Z.-H.: Constraint score: a new filter method for feature selection with pairwise constraints. Pattern Recogn. 41(5), 1440–1451 (2008)
Article MATH Google Scholar
Zhao, H., Le, W., Liu, Q., Ge, Y., Chen, E.: Investment recommendation in P2P lending: a portfolio perspective with risk management. In: Proceedings of ICDM, pp. 1109–1114 (2014)
Google Scholar
Zhao, Z., Wang, L., Liu, H., Ye, J.: On similarity preserving feature selection. IEEE Trans. Knowl. Data Eng. 25(3), 619–632 (2013)
Article Google Scholar

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Grant nos. 61602535, 61503422 and 61402389), and the Open Projects Program of National Laboratory of Pattern Recognition. Lu Bai is supported by the program for innovation research in Central University of Finance and Economics. Edwin R. Hancock is supported by a Royal Society Wolfson Research Merit Award. Lixin Cui is supported by the Young Scholar Development Fund of Central University of Finance and Economics, No. QJJ1540.

Author information

Authors and Affiliations

School of Information, Central University of Finance and Economics, Beijing, China
Lixin Cui, Lu Bai & Yue Wang
School of Computer Science and Engineering, Beihang University, Beijing, China
Xiao Bai
Software School, Xiamen University, Xiamen, Fujian, China
Zhihong Zhang
Department of Computer Science, University of York, York, UK
Edwin R. Hancock

Authors

Lixin Cui
View author publications
You can also search for this author in PubMed Google Scholar
Lu Bai
View author publications
You can also search for this author in PubMed Google Scholar
Yue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Bai
View author publications
You can also search for this author in PubMed Google Scholar
Zhihong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Edwin R. Hancock
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lu Bai .

Editor information

Editors and Affiliations

Data 61 - CSIRO , Canberra, Australia
Antonio Robles-Kelly
Pattern Recognition Laboratory, Technical University of Delft Pattern Recognition Laboratory, CD Delft, The Netherlands
Marco Loog
Electrical and Electronic Engineering, University of Cagliari Electrical and Electronic Engineering, Cagliari, Italy
Battista Biggio
Computación e IA, Universidad de Alicante Computación e IA, Alicante, Spain
Francisco Escolano
Computer Science, University of York Computer Science, York, United Kingdom
Richard Wilson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cui, L., Bai, L., Wang, Y., Bai, X., Zhang, Z., Hancock, E.R. (2016). P2P Lending Analysis Using the Most Relevant Graph-Based Features. In: Robles-Kelly, A., Loog, M., Biggio, B., Escolano, F., Wilson, R. (eds) Structural, Syntactic, and Statistical Pattern Recognition. S+SSPR 2016. Lecture Notes in Computer Science(), vol 10029. Springer, Cham. https://doi.org/10.1007/978-3-319-49055-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-49055-7_1
Published: 05 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49054-0
Online ISBN: 978-3-319-49055-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)