Feature Extraction using Response code on Customer Transaction Prediction

Prassena Kannan
7 min readFeb 1, 2020

After reading this post you will know:

  1. what is Feature Extraction?
  2. What is Response code?
  3. Python code to compute response code
  4. Model Selection
  5. Why LightGBM is used?
  6. BayesianOptimization for LightGBM
  7. Training LightGBM model

1. What is Feature Extraction?

Feature Extraction, also known as feature creation, is the process of constructing new features from existing data to train a machine learning model. This step can be more important than the actual modeling because a machine learning algorithm only learns from the data we give it, and creating features that are relevant to a task is absolutely crucial. Some of the few simple Feature extraction techniques are calculating mean, median, mode, variance, standard deviation, kurtosis and so on for the feature or of each observation.

2. What is Response code?

Response code is one of the powerful feature Extraction technique .It says about the probability of the label given the feature value. It will be easy to understand response code when we see with an example. Lets consider a sample data seen below which I have created on my own to explain response code in a simpler way.

Age and Weight are the features in above table. For each unique value in the feature we need to calculate the response code. Lets take age feature, unique values in this features are 20 and 30. To calculate response code for age 30 with respect to target 1, find probability of target 1 given age is 30.

Number of times my age feature have value 30 = 5
Number of times my target value is 1 when my age feature is 30 = 3
Probability of target 1 when the given age is 30 = 3/5

Same way we can calculate for age 20, it will be 2/4 as 2 times label 1 occurred when my age feature is 20.

Now lets calculate for weight feature, there are 3 unique values in our weight feature 60,70,80. For 60 it will be 0/2 that is 0 because we can see that when the weight is 60 my label don’t have 1. For 70 response code is 2/4 and for 80 it is 3/3.

In below snippet AGE_r and Weight_r are response code features which we are adding to the original data.

Response code calculated with respect to target 1

Same way we can calculate the the response code with respect to target 0. Find probability of target 0 given age is 30 or 1-probability of target 1 given age is 30.

Number of times my target value is 0 when my age feature is 30 = 2
Number of times my age feature have value 30 = 5
Probability of target 1 when the given age is 20 = 2/5

Response code calculated with respect to target 0

Since it is a binomial classification problem its enough to calculate response code for any one class, if it is a multinomial classification problem we need to calculate (Number of classes - 1) number of response codes for each feature, i.e if we have 4 labels then we need to calculate 4–1 = 3 response codes for each feature.

3. Python code to compute response code for 2 class classification problem

Below is the code snippet of implementation of response code in python

Code to Compute Response Code

The function takes the data frame and feature for which we need to compute response code and returns the response code for that particular feature.

In the above code we are iterating over every feature in the data frame and computing it’s corresponding response code and adding new column to the original data frame so that we can train on new additional features. Please find the whole code in my GitHub repository by clicking here.

You can find EDA for the problem here. Am not covering EDA in this blog as my main focus is to explain how response code helps in improving model performance.

Here we are done with computing response code, now let’s train the model with additional new features and see the model performance.

4. Model Selection

There are lot of Machine Learning models out there which are used for classification task. Instead of trying all the models and see its performance we can look at the data we have and judge what type of model will perform well. As the dimension of data set is not very high tree based model will perform good when compared to linear models like logistic regression or SVM. We got lot of variance of tree based models which are Decision tree ,Random forest, bagging ,boosting etc. Lets try with LightGBM.

5. LightGBM

I’m going with LightGBM which is a boosting model with decision trees as base learners because which has Faster training speed and higher efficiency and Lower memory usage.Here is LightGBM which is well documented in a way that it is easy to understand.

The main challenge comes when we need to set values for hyper parameters, LightGBM has lot of hyper parameters to be set for better results. There are 3 popular ways to find the best hyper parameter GridSearchCv, RandomizedSearchCV and BayesianOptimization. In this blog we will see how to get optimal parameter for LightGBM using BayesianOptimization.

6. BayesianOptimization for LightGBM

This section is split into 3 small sub topics for better understanding as below,
1. Installing Bayesian global optimization library
2. Loading the data
3. Function to be optimized (LightGBM)

6.1 Installing Bayesian global optimization library

In order to use this BayesianOptimization, its mandatory to have the bayesian-optimization package per installed. This below one line of code will do installation for you.

6.2 Loading Data

The data am using here is customer transaction prediction data set. The source of data is from kaggle. You can find the data here. Below is a over view of raw data where we have 202 columns of which 200 are independent variable with ‘var_*’ ,one target column and a id_code.

The raw data was passed through the response code function we have created and response code added data is stored into a CSV, so it can be used for training different models in different kernels. Below snippet shows loading of data which have extracted feature.

We have 2 lakh data points with 400 independent features and one target variable in new dataframe. The variables with ‘*_r’ are newly added response code features.

6.3 Function to be optimized (LightGBM)

In this function we define the parameters which need to be tuned .In our case am tuning this 8 parameters lambda_l1, lambda_l2, learning_rate, max_bin, max_depth, min_data_in_leaf, min_gain_to_split and num_leaves as I consider these things are highly important parameter. If the reader wants to tune some other parameters they can add it in the function.we are splitting the data using StratifiedKFold which gives two set of index as output which can be used to split data into train data and validation data. we are setting AUC score as return type which we want to maximize. The best parameter will be the values which maximize the AUC score.

After defining the function we need to pass values to the parameters of the function.The values are sent as range not as a single integer value,the range indicates lower bound and upper bound values of each parameter can take. Below is the code where we are setting the values.

Now we call BayesianOptimization which takes the Function to be optimized and the values the function takes. There are two variables init_points and n_iter which defines total number of iterations to come up with best parameter value.

Things are done ,now we need to call the function from Bayesian optimization framework to maximize the function return value .As soon as we call the maximize function it fits the train data and validates on data . After completing all iteration (we can get the best parameters which gave max AUC score.

Calling maximize function to start optimization

Below are values of the parameter which gave max AUC score. If you note we can see the values of float type ,its because BayesianOptimization search with continuous values in the range we gave, we can convert it to integer and use it for final training.

Best Parameters which gives max AUC

7. Training LightGBM model

There is nothing much to do now ,just set the best parameters we got through BayesianOptimization and fit the model . Am fitting model with 8 StratifiedKFold ,which is like training 8 models on different data set and saving all 8 models for future use .Run the below code and take a break. When you come back exiting results will be waiting for you.

Setting best param and training

My AUC score got out of training on Response code feature is 0.90572 which is way better than the 0.89437 which I got over training data without Response code feature.

You can find both of my models which has been trained without response code feature and with response code feature here.

Confusion Matrix

The confusion matrix will throw some light on how good my model predicts class 0 and class 1. As we can see the diagonal elements are larger than non-diagonal elements implies that True negative and True positive values are higher than False negative and False positive values.

Thanks for your time hope the post is helpful. In case of any clarification please feel free to reach me out. Will catch up soon with another blog.

References

https://en.wikipedia.org/wiki/Feature_extraction
https://lightgbm.readthedocs.io/en/latest/
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
https://github.com/fmfn/BayesianOptimization
https://www.kaggle.com/fayzur/lgb-bayesian-parameters-finding-rank-average

Follow me for more ML updates

GitHub : https://github.com/prassena
LinkedIn : https://www.linkedin.com/in/prassena-k-738367140/

--

--