Predicting Employee Attrition using Machine Learning
Analyzing the causes of employee attrition and building an ML model for future forecasts
Employee Attrition is the loss of employees through other means than sacking without replacement. Natural causes such as retirement, resignation, health reasons etc. all fall under causes of employee attrition.
Apart from the natural causes of employee attrition, there are other causes that could influence an employee to leave their current employment. A lot of these reasons are related to job satisfaction and personal fulfilment.
When employees feel dissatisfied with the working conditions or they feel less accomplished working at these jobs, they are more likely to take the exit. In cases like this, employers are left in the lurch after such decisions have been made because there are usually no plans for replacement of these staff.
In this project, I used machine learning to predict an employee’s leaving the company or not. This kind of machine learning falls under the Supervised learning category because I have a column that shows which employees have left the company.
The data used for this project was from Kaggle and it contains employee details such as age, monthly income, number of years in the company etc. The data can be accessed here.
The dataset contains the details of members of staff from three different departments — Human Resources, Research and Development and Sales.
Only 16.12% of the staff have left the company which is just a fraction of the company’s workforce.
Though, the attrition ratio is little, the male gender seem to be affected the most. The graph on the left shows that the members of staff that have left earn less than 1M, this shows that income is a determinant of employee attrition.
Supervised and Unsupervised Learning
The difference between supervised and unsupervised learning is that, for supervised learning, we have a picture of what the output should look like so we create a model that can produce the desired function. Or better still, we map the input to an expected output using a function that was created based on studying the relationship between the various input and output.
For unsupervised learning, we do not have an expected output, the data is continuously studied with the intent of discovering the structure of the data properties.
In the data used in this project, in addition to the various input columns, we have the output column — Attrition which shows whether an employee has left the company or not.
In the process of training a machine learning models, there are various machine learning algorithms that can be used based on the type of data being used.
Some of the popular machine learning algorithms used are Naïve Bayes, Decision Tree, Linear regression, Logistic regression, K-nearest neighbors, Learning vector quantization, Support vector machines and Random forest classifiers.
For this project, I applied the random forest classifier. Although, I also experimented with the Linear regression and Decision Tress algorithms, but I chose the random forest classifier because it gave me the highest accuracy for my prediction.
Steps to creating the model
- Import the data: Since I worked with pandas, I used the pandas library to import my data
import pandas as pd
df = pd.read_csv("file.csv")
- Clean the data: The accuracy of your machine learning model is dependent on the quality of your data. For data to be of high quality, it must be properly cleaned. Cleaning operations include taking out duplicate values, ensuring there are no null values, using the right data types for each column etc.
- Preprocessing the data: This includes selecting the features relevant to your research and dropping the irrelevant ones.
feature_names = ['Age', 'Department', 'Gender', 'MaritalStatus', 'EnvironmentSatisfaction', 'HourlyRate', 'JobSatisfaction', 'MonthlyIncome', 'PercentSalaryHike', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole']
- Also, some of the chosen columns in the dataset will be of string or Boolean types. These types are not ideal for training data as most of the models work with numeric data types. Therefore, these columns (if they are to be used in your prediction) will have to be converted to the numeric datatype through ENCODING.
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df_copy['Attrition_code'] = labelencoder.fit_transform(df_copy['Attrition'])
from sklearn.preprocessing import OneHotEncoder
# creating instance of one-hot-encoder
OHenc = OneHotEncoder(handle_unknown='ignore', sparse=False)
from the above, I used two styles of encoding: label encoding and the One-hot encoding to encode various features
- Split the Data into two: The training and the test data. The essence of this is so that while we train our machine learning model with one half of the data, we can use the other half to test the accuracy of the data through prediction. The common split ratio is 80/20. 80% for training and the remaining 20% for validation. This is preferred because the larger the data set available for training, the closer the prediction is to accuracy.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80,test_size=0.20, random_state=0)
- Create the Model:
clr = RandomForestClassifier()
I used the RandomForest classifier as earlier stated.
- Train the Model: This involves using your model to train the dataset
- Make the predictions:
prediction = clr.predict(OH_X_test)
- Evaluate the model:
accuracy = accuracy_score(Y_test, prediction)
The accuracy of this model is 0.8299 (83%). I tried two other machine learning algorithm as said earlier — DecisionTree and Linear regression. The DecisionTree had an accuracy of about 76% while the Linear regression had an error rate of 0.36 hence, my decision to stick with the RandomForest classifier.
The complete implementation of this exercise can be viewed on github
Kindly reach out to me on Twitter