Keywords

1 Introduction

1.1 Background and Purpose

Latin American economy has achieved rapid economic growth since 2003 after undergoing currency crisis and e turmoil from the early 1930s through until the early 1960s. As a result of this growth, consumer finance market expanded due to external demand in country X in Latin America. In addition, because part of the poor class has changed into the middle class, financial services have also spread to those who could not formulate loans in the past. On the other hand, there are many debtors who do not understand the contents of the loan contract, and the debt default due to excessive debt is increasing. For this reason, financial institutions that are lenders of funds in Latin America and manufacturers and retail stores are required to measure the credit risk of the debtor who formed the loan and to devise credit judgment based on the size of the risk.

Many financial institutions are doing business that takes interest from debtors by financing. Not only financial institutions, but also makers and retailers selling products on loans is not uncommon. Credit risk is the risk that losses will be incurred due to a decrease or loss of the value of the loaned funds due to deterioration of the financial situation of the debtor. The state that the debtor does not return the borrowed funds and the repayment is delayed or stopped is called default. If the debtor defaults, the likelihood that returns will be returned to the lender is almost nil. From the above, accurately grasping the credit risk becomes an important issue in order to carry out stable management. Financial institutions, primarily banks and others, may do ratings to determine whether to lend or lend to debtors. This is called the Internal Rating System. With the introduction of the new BIS regulation by the Basel Committee on Banking Supervision [1], the establishment and validation of an appropriate internal rating system is an issue.

In order to quantitatively grasp credit risk, in many cases, a Credit Risk Model (there are other names such as Credit Scoring Model etc.) based on statistics is constructed. A typical model is a binomial logit model. A binomial logit model can estimate the default probability with default/non-default binary as the objective variable. The model is generally used in a real environment that measures credit risk because it has a high ability to explain short-term defaults and is comparatively easy to calculate.

The purpose of this research is to construct and evaluate a credit risk model based on 14,000 car loan data in Country X of Latin America. Understand the characteristics of debtors who defaulted within one year based on the analysis results and grasp the situation of default in Latin America. The significance of research on applying internal logic of internal rating performed at financial institutions to auto loans and focusing on Latin America where the number of default occurrences has increased is significant. The binomial logit model was adopted as a usage model in this research. Estimate default probabilities for debtors who defaulted within one year using the model and grasp default situations in Latin America. Research on the credit risk model tends to find financial indicators that affect bankruptcies in many cases, and there are still few researches of default probability estimation for individual debtors. Above all, no research on auto loans exists to the best of my knowledge. Due to the small number of research cases, we decided that credit risk measurement by a general binomial logit model is appropriate rather than using complicated models from the beginning in this research. In addition, AUC was used as an evaluation index of model accuracy. As with the binomial logit model, AUC is also an evaluation index generally used at the site of credit risk management.

1.2 Previous Research

Many prior studies on credit risk models are found, but many of them are research cases for companies. Since the subject of analysis in this study is for individuals, previous studies on individual debtors are described below. However, all of the following are studies on Japanese data in Japan, and similar results can not be obtained for Latin American data. Pay attention to this fact and use the results of previous research as reference for this research.

  • Hibiki et al. [4]

It is a research that constructed a credit scoring model using about 350,000 educational loan data in Japan. Hibiki uses a logistic regression model and evaluates the model with AR. Education loans are loans that mainly finance educational funds for admission or advancement to student parents in Japan. Users of education loans are stable regardless of observation period of data, such as age distribution and income level, indicating that there was no significant difference in parameters even if changing the lending year or variable combination. Furthermore, AR verification with in-sample and out-of-sample data showed that AR did not decrease so much and revealed that a model with less overfitting could be constructed. Research results show the usefulness of the credit scoring model in the education loan. However, from the practical viewpoint, in order to avoid danger such as spoofing declaration, the influence of concrete variables is hidden.

  • Okumura and Kakamu [2]

Credit Scoring Models Using Hierarchical Bayes Model: An Application to Inter-bank Consortium Mortgage Data.

It is a research that constructed a credit scoring model using Japan’s Inter-bank Consortium data. Using a hierarchical Bayesian probit model, it is doing the verification of whether to improve the estimation accuracy than the existing model. The ROC curve and AUC are used for the model evaluation. To build a scoring model with high accuracy, securing default samples is a key point. However, the use of pooled data by multiple banks assumes homogeneity between banks, and the results obtained represent only the average features. As a countermeasure to that problem Okumura and Kakamu focused on regionality and revealed the differences between the regions relatively easily by using a hierarchical model. When summarizing the verification results, it turned out that there is a difference in the default for each region. Furthermore, it became clear that the hierarchical Bayesian model improves the performance over the existing model.

Examples of explanatory variables that have been commonly used for credit risk models in past projects [6] other than the above research are shown.

Items that are regarded as important in credit screening are repayment and loan ratio, transaction history with financial institutions, etc. Since it is possible to comprehensively judge the burden of repayment on income and the saving trend of debtors, generally there is a strong relationship with the default.

The explanatory variable in the revenue assessment item to measure the stability of income is the type of business of the debtor, the type of occupation, the size of the company and the number of years of service. Since the source of the stability of the income of the debtor is the stability of the workplace, it is generally an item that has a strong relationship with the default.

The spending assessment items that measure the degree of maintenance of repayment resources from the viewpoint of the opposite of revenue assessment are the dependents of the debtor, the number of children, the age of the child and so on. However, since items related to expenditure are relatively few compared to items related to income at the stage of application form, data to be analyzed also becomes few. For this reason, there are a lot of room for research on expenditure assessment items, which is an item with high analytical value.

Besides the information at the present moment, the debtor’s deposit balance and card loan outstanding are examples of time-dependent variables. These can be expected not only at the time of application but also strong explanatory power as a variable which varies with passage of time. Especially periodically checking the trend of the deposit balance and renewing the credit risk of the debtor according to the balance up and down is a useful method as part of the ongoing management of credit risk.

In addition to that, variables such as gender and residential area of the debtor may be used for model construction, but it does not always affect creditworthiness.

1.3 Paper Composition

This paper composition is below shown. Topic 1, the background and purpose in this research, and the related prior research are described. Topic 2, describe the outline of data used for analysis. Topic 3, describe the credit risk model to be used and its evaluation method. Topic 4, describe the result of analysis and discussion on the results. Topic 5, describe summary of this research and future work.

2 Summary of Date

This topic describes the summary of the auto loan data to be used in the analysis.

2.1 Summary of Date

The data used in this research is customer data purchasing products in a loan in country X in Latin America. The data period is from September 1, 2010 to June 30, 2012. The customer target is 14,304 people living in X.

The data items are age, gender, presence of marriage, presence or absence of regular work, presence or absence of educational history, existence of owning house, income (monthly income), down payment of loan, borrowing money, interest rate, default (whether it defaulted within 1 year from purchase). In addition to these, a repayment capacity index obtained by dividing borrowings by income and down payment ratio obtained by dividing down payment by product price were created as new items. Furthermore, when there was a missing value in the item, a missing flag (1 if there was a missing, 0 if not) was created as a new item for that item. Then, 0 was substituted for the original missing value. Items for which missing flags are created are gender, repayment capability index, down payment ratio, interest rate. The above items are used as explanatory variables of the binomial logit model. The objective variable is the default within one year.

3 Method of Analysis

This topic describes the summary of the credit risk model, the binomial logit model to be used and the ROC curve and AUC as model’s evaluation index. The software used for analysis is IBM@ SPSS@ Statistics (ver. 22).

3.1 Summary of Credit Risk Model

The credit risk model is roughly divided into a statistical model and an option approach model. In the statistical model, default judgment and default probability estimation are performed based on the financial data of the debtor. Typical models are discriminant analysis, logit model, hazard model. In general, as the number of data increases, it is possible to create models with high explanation. Therefore, it is often applied to small and medium enterprises with abundant data. The option approach model estimates the default probability using market data such as stock price and corporate bond interest rate. A typical model is a Merton model, which can estimate the default probability in real time by obtaining market data for large companies with stocks listed. However, the option approach model is not a rating, it is often used as a monitoring tool for listed companies. Besides that, models such as neural networks and support vector machines are increasingly applied as a constructive and experimental model.

The reason for adopting the binomial logit model in this research is ease of application to analysis data. Discriminant analysis and probit model are models having properties similar to the logit model. However, since these models are difficult to assume distribution, it is difficult to apply them to actual data. Unlike logit models, the hazard model is a model to explain relatively long-term defaults. Since this research builds a model against the default within one year, it can be judged that the binomial logit model is appropriate.

3.2 Binomial Logit Model

The binomial logit model used in this research is shown below.

$$ \begin{array}{*{20}c} {\varvec{P}\, = \,\frac{{\mathbf{1}}}{{{\mathbf{1}}\, + \,\varvec{exp}\left( \varvec{Z} \right)}}({\mathbf{0}}\, < \,\varvec{P}\, < \,{\mathbf{1}})} \\ \end{array} $$
(1)
$$ \begin{array}{*{20}c} {\varvec{Z}\, = \,\varvec{ln}\left( {\frac{{{\mathbf{1}}\, - \,\varvec{P}}}{\varvec{P}}} \right)\, = \,\varvec{\alpha}\, + \,\mathop \sum \limits_{\varvec{i}}^{\varvec{n}}\varvec{\beta}_{\varvec{i}} \varvec{x}_{\varvec{i}} \left( {\varvec{i}\, = \,{\mathbf{1}},\, \ldots ,\,\varvec{I}} \right)} \\ \end{array} $$
(2)

At (1), \( \varvec{P} \) is the probability of defaulting within one year (the value ranges from 0 to 1). At (2), Z is log odds. When \( i \) is the debtor, \( I \) is the number of debtors, and \( \varvec{n} \) is the number of explanatory variables, \( \varvec{x}_{\varvec{i}} \) is the explanatory variable used for the model and \( \varvec{\alpha} \) and \( \varvec{\beta}_{\varvec{i}} \) are the parameter estimated by the maximum likelihood method. The default probability decreases as \( \varvec{Z} \) increases.

Maximum likelihood method assumes that facts \( \left( {y_{i} \, = \,\left\{ {0,\,1} \right\}} \right) \) showing default/non-default are independent, the likelihood function is given by (3).

$$ \begin{array}{*{20}c} {\varvec{L}\, = \,\mathop \prod \limits_{{\varvec{i} = 1}}^{\varvec{I}} \varvec{P}^{{\varvec{y}_{\varvec{i}} }} ({\mathbf{1}}\, - \,\varvec{P})^{{{\mathbf{1}} - \varvec{y}_{\varvec{i}} }} } \\ \end{array} $$
(3)

In order to estimate the parameter that maximizes (3), the expression that maximizes the log-likelihood function \( lnL \) is (4).

$$ \begin{array}{*{20}c} {\varvec{lnL}\, = \,\mathop \sum \limits_{{\varvec{i} = 1}}^{\varvec{I}} \left\{ {\varvec{y}_{\varvec{i}} \varvec{lnP}\, + \,\left( {{\mathbf{1}}\, - \,\varvec{y}_{\varvec{i}} } \right)\varvec{ln}\left( {{\mathbf{1}}\, - \,\varvec{P}} \right)} \right\}} \\ \end{array} $$
(4)

3.3 ROC Curve and AUC

ROC Curves (Receiver Operating Characteristic Curve) and AUC (Area Under the Curve) are one of methods widely used as a method of evaluating the credit risk model. This index is most frequently used alongside AR (Accuracy Ratio) in the credit risk model. ROC Curve is a curve obtained by finding the error probability from the default predicted probability and the default/non-default judgment result. AUC defined by the area under the curve is an index to measure the accuracy of the model. If the model has no explanatory power and the default occurs randomly, AUC is 0.5. AUC in the case of completely default prediction is 1. In other words, AUC can be judged to be a highly explanatory model as it is closer to 1.

$$ 0.5\, \le \,{\text{AUC}}\, \le \,1 $$

4 Results

4.1 Results of Binomial Logit Model

As a result of executing the binomial logit model, only the explanatory variables whose significance probability was lower than 5% are shown in the Table 1.

Table 1. Result of executing the binomial logit model

Table 1 shows, it became clear that the presence or absence of marriage greatly influences the default. It is thought that the debtor who got married and had a family became easier to default because expenditure is larger than the debtor who does not have a family. In addition, when there are missing values in the income and down payment items, the debtor got the result that it is easy to default. Since information on auto loans is often written by the debtor himself, it was suggested that debtors who do not fill in information correctly have a stronger possibility of default. On the other hand, the repayment capacity index which seemed to be most related to the default did not show significant results. The possible cause is that there are about 1,000 data whose index is almost 0, so that data had a bad influence on the result.

The value of AUC was 0.625. Since the value of AUC takes a value from 0.5 to 1, and the closer to 1, the accuracy of the model can be evaluated as high, the model of this research could not show high accuracy.

5 Conclusions and Future Work

5.1 Conclusions and Discussion

The purpose of this research is to build a binomial logit model for loan data against the background of the default increase problem in Latin America and to grasp the characteristics of the debtor in Latin America. As a result, it became clear that the married debtor and debtor whose information is missing are easy to default. At the time of credit judgment, debtors who have a family by marriage pay particular attention to the income and expenditure situation. In addition, efforts are needed to judge whether the necessary items are accurately entered. However, since the value of AUC indicating the accuracy of the model was never a high numerical value, there is a need to reconsider the combination of variables and the model to be used.

5.2 Future Work

For future work of this research, abnormal value processing and data segment can be considered. In this research, since analysis was performed using all data, distribution of each item became complicated, and it is considered that the accuracy of the model worsened. It is thought that complicated distribution is responsible for the fact that repayment capacity index did not result in significant results. For this reason, firstly, abnormal values are excluded or converted to eliminate distribution bias. Secondly, by segmenting data by age or region, the instability of the model due to the difference in distribution can be solved. Finally, execute the binomial logit model for each segment, and make the debtor’s default factor clearer. However, since there are only 14,000 data to be used, the number of pieces of data for each segment decreases, and the analysis result may be worse. In addition, since overfitting may occur, segmentation will thoroughly consider the appropriate number of segments.