Caravan Customer Identification

Tommy Lam
Apr 22, 2019
3 min read

This study targets caravan customer identification. The study begins with exploratory data analysis and followed by building several models to predict potential customers.

Programming Language: R
Methodologies: Random Forest, KNN Algorithm, Logistic Regression Classification

Read the full project with codings here

Exploratory Data Analysis

Data exploration is the early stage of data analysis, which enables the users to explore and briefly understand how is the data looks like. This would provide insights to users in order to design a further analysis in the later stage.

During data exploration, basic graphs showing the demographics, correlations will be used for a basic understanding of the data.

Age distribution

40-50 is the largest group, and followed by 30-40 and 50-60

Age split among caravan users and non-users

The following graph shows the age distribution split by caravan customers.

Average age of caravan users and non-users:

User:     45.1
Non-User: 44.9

From the graph and table, we can see that the age distribution are similar between caravan users and non-users, which indicates that age are not a considerable factor in this study.

Average Income

The graph below shows the distribution of average income between caravan users or non-users. Caravan users generally have higher income than non-user.

Average Household Size

This graph shows the caravan customers tend to have a larger household size.

Now, we will focus on the caravan user.

Customer main type

The following graph shows the constitution of each customer type:

The major customer type is 'Family with grown ups'. Second and third are 'Driven Growers' and 'Average Family' respectively.

Correlation

In this section, we will explore the correlation of caravan against the rest of the variables, so as to discover some important factors that drive customers to use caravan insurance.

The following table shows the top correlated variables against caravan insurance user.

V47 (Contribution car policies), V43 (Purchasing power class), V44 (Contribution private third party insurance) and V68 (Number of car policies) are the top correlated variables.

We can understand customers related to car policies show strong relation to caravan usage. Also, those with higher contribution to private third party insurance are also more related to caravan usage.

Summary

The customers of caravan tend to have a profile with higher income and with higher size of household. They would be the target group to be focused on in the future marketing planning.

Also, according to the correlation analysis, we should pay more attention to customers with stronger contribution to car policies, since they have higher chances to have a caravan insurance.

Predictive Modelling

1. Logistic regression classification

The result of the logistic regression gives the predictions within the range from -1 to 1. The following graph shows the distribution of the prediction based the logistic regression model applying on the testing data.

In the next step, since we need to predict the potential customers, only 0 or 1 are allowed in the prediction of testing data. Therefore, we need to define a decision boundary to predict the potential customers.

In order to identify the decision boundary, prediction result will be calculated for each boundary level.

From the graph, in order to balance both accuracy and specificity, 0.075 would the optimum decision boundary for logistic regression classification.

Here is the confusion matrix with 0.075 decision boundary:

Stepwise Logistic regression

Stepwise regression is the methodology to select the subset variables for the regression. It is believed that with the selected variables, the model can give a better result.

Similar table for identifying the decision boundary.

Similar result as the non-stepwise logistic regression, 0.075 is suggested to be the optimal decision boundary to balance both accuracy and specificity.

KNN Classifier

KNN is one of the most popular machine learning algorithm. The 'K' is referring to the number of nearest neighbors to take vote from, in order to classify the testing data.

Here is an example of using KNN classifier when k = 3:

The graph below shows the model performance under different values of K.

As we can see, after k = 3, the accuracy stays almost stationary and the specificity drops significantly. This means the model cannot identify the potential customers efficiently after K = 3. Therefore, K = 3 would be the optimal value in this KNN classifier. Here is the summary of KNN classification with K = 3: