The basics of creating a machine learning model

November 18, 2017

How do we create a machine learning model?

Here I am going to talk about the step by step approach while designing a machine learning model. You can use any programming language, R or Python. The steps to create the model will remain the same.

First of all, we need to understand the business problem. That’s the basic, right?! Even if you are not an expert in that business domain, still it is extremely important to understand the data and what problem we are trying to solve by this statistical analysis.

Now, let’s perform basic EDA. What is EDA? It’s Exploratory Data Analysis. Here we will do some basic analysis on data. There are various commands available in R, Python which help us do this analysis. We can also use Tableau like tool for data mining. Remember, Tableau is not only for glossy business reports. It’s a very effective tool for statistical data analysis as well. You can also perform AB tests/ Chi Squared tests to verify the findings from data mining.

After EDA, we might need to perform data transformation as required. It includes taking care of null values, skewed data, filtering out incomplete records etc. ETL tools like SSIS can make this process repeatable and easy to debug if working with uncleaned version of data.

Once our cleaned data set is prepared, we are ready to start building a model!

Let’s take care of categorical variables in the dataset. As we know, the mathematical statistical models won’t be able to derive anything from categorical variables. We need to input either binary data or actual numerical values representing value of the attribute. The encoder libraries in R and Python will make this dummy variable creation a cake walk!

Split the data in training and test sets. The proportion to split data in training data set and test data set would depend on the volume available, but usually it is 75:25 or 80:20.

Decide what model we need to choose? A regression model, classification model or clustering model? This is where understanding the business problem and data will help us.

Is our output gonna be a continuous value or a categorical value?

Do we want to implement a linear model or a non-linear one?

Once we have answers for above questions, create the model using respective R/ Python libraries and fit and train the model on test data set.

We want to try various models and check which one fits better? which model gives better prediction?

Use Grid search algorithm to help find the best model and the best parameters.

A confusion matrix will help us in verifying the model. Also perform K fold cross validation to make sure deviation from the accuracies is acceptable.

So at this stage we got our model, which gives us maximum accuracy. But how do we know which features are affecting the output significantly compared to rest of the variables? Most of the times, only predicting the value of the dependant variable is not sufficient. We need identify feature rankings, so that business can take required action. Many algorithms like RFE/ Random Forest will give us these rankings.

And we are done!

We just have to make sure we present the findings in an efficient way. The graphs we generate should be self explanatory. Giving crisp and clear answer to the business problem. At the same time, don’t forget that the data can tell something which is totally a new perspective. So keep an open mind! Listen to the data. And we are ready to tell the Story we discovered!