PROJECT REPORT
Employee Attrition Predictive Model
Submitted towards the partial fulfillment of the criteria for award of PGA by
Imarticus
Submitted By:
Eldin B. Joseph
Andrew Kurian Jacob
Vivekanandan R.
Amit Kumar
Course and Batch: DSP Batch 12(2018-19)
Abstract
Keywords
Disclaimer: *Data shared by the customer is confidential and sensitive, it should not be used for any purposes apart from
capstone project submission for PGA. The Name and demographic details of the enterprise is kept confidential as per
their owners’ request and binding.
Acknowledgements
[To be changed by candidates as per their requirement]
We are using this opportunity to express our gratitude to everyone who supported us
throughout the course of this group project. We are thankful for their aspiring guidance,
invaluably constructive criticism and friendly advice during the project work. I am
sincerely grateful to them for sharing their truthful and illuminating views on a number
of issues related to the project.
Further, we were fortunate to have ___Rahul Sarkar_____________ as our mentor. He has
readily shared his immense knowledge in data analytics and guide us in a manner that
the outcome resulted in enhancing our data skills.
We wish to thank, all the faculties, as this project utilized knowledge gained from every
course that formed the PGA program.
We certify that the work done by us for conceptualizing and completing this project is
original and authentic.
Date: March28, 2019 Eldin B. Joseph
Place: Bengaluru Andrew Kurian Jacob
Vivekanandan R.
Amit Kumar
2
Certificate of Completion
[To be changed by candidates as per their requirement]
I hereby certify that the project titled “Employee Attrition” was undertaken and
completed under my supervision by Eldin B. Joseph, Andrew Kurian Jacob,
Vivekanandan R. and Amit Kumar from the batch of DSP (Oct 2018).
Mentor:
Date: March28, 2019
Place – Bengaluru
3
Table of Contents
Abstract................................................................................................................................................2
Acknowledgements..............................................................................................................................2
Certificate of Completion.....................................................................................................................3
CHAPTER 1: INTRODUCTION.............................................................................................................5
1.1 Title & Objective of the study..............................................................................................5
1.2 Need of the Study.................................................................................................................5
1.3 Business or Enterprise under study....................................................................................5
1.4 Business Model of Enterprise..............................................................................................5
1.5 Data Sources.........................................................................................................................5
1.5 Tools & Techniques....................................................................................................................6
CHAPTER 2: DATA PREPARATION AND UNDERSTANDING.............................................................6
2.1 Phase I – Data Extraction and Cleaning:...................................................................................6
2.2 Phase II - Feature Engineering..................................................................................................6
2.3 Data Dictionary:.........................................................................................................................8
2.4 Exploratory Data Analysis:......................................................................................................12
CHAPTER 3: FITTING MODELS TO DATA.........................................................................................14
CHAPTER 4: KEY FINDINGS..............................................................................................................20
CHAPTER 5: RECOMMENDATIONS AND CONCLUSION..................................................................21
From the random forest model, we see that monthly income is a key factor that stands out in
determining any employee’s attrition. Other crucial factors include whether the employee
works overtime, the age, hourly rate, daily rate, monthly rate, total working years and distance
from home. We recommend that increasing monthly income in proportion with overtime
worked will lower the attrition rate.................................................................................................21
Therefore, we can safely conclude that employees take financial benefits and personal
convenience into account when it comes to deciding whether to leave the organization, and that
factors such as JobRole and Education play a lesser significant part in this decision. The random
forest model which is essentially a cluster of decision trees helps us in reaching this conclusion
because this model assesses attrition with many permutations of the different variables..........21
CHAPTER 6: REFERENCES................................................................................................................22
List of Figures
1. Attrition rate in dependent variable
2. Distribution by employee department
4
3. Distribution of satisfaction levels
4. ROC curve
5. Decision Tree
6. Random forest model
7. Significance of different variables with attrition
8. Model comparison
List of Tables
1. Data Dictionary
CHAPTER 1: INTRODUCTION
1.1Title & Objective of the study
‘Employee Attrition’ refers to the rate of employee turnover of an organization. This study aims to
find the important factors that decide whether an employee continues to work in the organization or
not.
1.2Need of the Study
Understanding why employees may want to leave an organization helps greatly in the hiring process
(of the employers) and working conditions for the employees. It also helps in building an employee
friendly working atmosphere.
1.3Business or Enterprise under study
We have received the employee data of IBM HR Analytics which shows many parameters such as income,
years worked, distance travelled etc.
1.4Business Model of Enterprise
IBM is a leading diversified technology company with a broad range of business
offerings across IT hardware, software, and services segments. IBM combines its
broad mix of capabilities to provide integrated solutions and platforms to its clients.
IBM was founded in 1911. IBM operates in more than 175 countries across the globe
and has over 400,000 employees. IBM has five business segments: Global
Technology Services (GTS), Global Business Services (GBS), Software, Systems
Hardware, and Global Financing.
5
1.5Data Sources
The data was procured from an excel file which shows data of multiple employees of IBM HR
Analytics regarding their working conditions and attrition.
1.5 Tools & Techniques
Tools: We use R which is an analytical software used for building models for descriptive,
prescriptive and predictive analytic approaches.
Techniques: We build multiple classification models and use them to predict for a sample dataset. The
models are compared on the basis of model parameters and the best one is selected..
CHAPTER 2: DATA PREPARATION AND UNDERSTANDING
One of the first steps we engaged in was to outline the sequence of steps that we will be
following for our project. Each of these steps are elaborated below
2.1 Phase I – Data Extraction and Cleaning:
Missing Value Analysis and Treatment:
There were no missing values.
Handling Outliers:
Outliers of the respective columns were capped at their respective highest
percentiles.
Feature Extraction:
We have removed the variables which had zero variability and other variables
which were irrelevant to the model.
##checking for variability
names(data[,nearZeroVar(data)])
# removing varibales "EmployeeCount" "Over18" "StandardHours" as they have
0 variability
# also removing EmployeeNumber as it is irrelevent.
fulldata<-
subset(data,select=c(EmployeeCount,Over18,StandardHours,EmployeeNumber))
2.2 Phase II - Feature Engineering
We have introduced two new variables using existing ones to improve their predictions. They are
loyalty and volatile rate.
6
loyality<-fulldata2$YearsAtCompany/fulldata2$TotalWorkingYears
loyality<-round(loyality,digits = 2)
volatile<-fulldata2$TotalWorkingYears/fulldata2$NumCompaniesWorked
volatile<-round(volatile,digits = 2)
7
2.3 Data Dictionary:
Column Name Data Type Description Example
Age Integer Age of 49
employee
Attrition Factor If employee No
with 2 has left
levels organization
BusinessTravel Factor Rate of Travel_Frequently
with 3 employee
levels travelling for
business
purposes
DailyRate Integer Remuneration 279
expected by
employee on
daily basis
Department Factor Department of Sales
with 3 work
levels
DistanceFromHome Integer Distance 8
between
employee’s
home and
place of work
Education Integer Level of 2
education of
employee
EducationField Factor Employee’s Medical
with 6 field of study
levels
EmployeeCount Integer No. of 1
employees
under same
identification
EmployeeNumber Integer Number 12
assigned to
employee
EnvironmentSatisfaction Integer Level of 3
employee’s
satisfaction
with working
environment
Gender Factor Gender of Male
with 2 employee
8
levels
HourlyRate Integer Remuneration 92
expected by
employee on
hourly basis
JobInvolvement Integer Level of 2
involvement
shown in
work
JobLevel Integer Hierarchical 1
position of
employee
JobRole Factor Job position Sales Executive
with 9 assigned to
levels employee
JobSatisfaction Integer Level of 4
employee’s
job
satisfaction
MaritalStatus Factor Employee’s Married
with 3 marital status
levels
MonthlyIncome Integer Monthly 2090
remuneration
of employee
MonthlyRate Integer Monthly 2396
remuneration
charged by
employee
NumCompaniesWorked Integer No. of 1
companies
employee has
worked prior
to the current
one
Over18 Factor If the worker Yes
with 1 is aged more
level than 18
OverTime Factor Whether the Yes
with 2 employee
levels works
overtime
PercentSalaryHike Integer Percentage of 11
salary
increment of
employee
PerformanceRating Integer Employee’s 3
job
performance
9
rating
RelationshipSatisfaction Integer Level of 4
satisfaction of
employee’s
relationship
StandardHours Integer No. of hours 80
employee
works
StockOptionLevel Integer Level of 1
stocks of
company
offered to
employee
TotalWorkingYears Integer No. of years 6
employee has
been working
TrainingTimesLastYear Integer No. of times 2
employee was
given training
the past year
WorkLifeBalance Integer Level of 3
employee’s
balance with
work and
personal life.
YearsAtCompany Integer No. of years 8
employee has
worked at
current
company
YearsInCurrentRole Integer No. of years 7
employee has
worked in the
current
position
YearsSinceLastPromotion Integer No. of years 1
since
employee’s
last
promotion
YearsWithCurrManager Integer No. of years 5
employee has
spent under
the current
manager
10
Data definitions for categorical variables:
Education: 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor'
EnvironmentSatisfaction: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'
JobInvolvement: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'
JobSatisfaction: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'
PerformanceRating :1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding'
RelationshipSatisfaction: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'
WorkLifeBalance: 1 'Bad' 2 'Good' 3 'Better' 4 'Best'
11
2.4 Exploratory Data Analysis:
##checking attrition rate in dependent variable
table(fulldata$Attrition)
ggplot(data = fulldata, aes(x= Attrition, fill=Attrition))
+geom_bar(color="grey40",alpha=1)+
theme([Link] =element_blank() )+theme([Link].y =
element_line("grey"))+
ggtitle("Employee Attrition")
Fig.1: Attrition rate in dependent variable
# distribution by employee department
ggplot(data = fulldata, aes(x= Department,fill=Department))
+geom_bar(color="grey40",alpha=1)+
theme([Link] =element_blank() )+theme([Link].y =
element_line("grey"))
12
Fig.2: Distribution by employee department
#job satisfaction distribution
ggplot(data = fulldata,aes(x=JobSatisfaction,fill=Attrition))
+geom_bar(col="grey40",alpha=1)+
xlab("Satisfaction levels")+ylab("Count")+ggtitle("Distribution of
satisfaction levels")+
theme([Link] =element_blank() )+theme([Link].y =
element_line("grey"))+
theme(text = element_text(family = "Decima WE",color = "black"))
13
Fig.3: Distribution of satisfaction levels
CHAPTER 3: FITTING MODELS TO DATA
Logistic Regression
We have created multiple models by removing insignificant variables from each one and
by comparing the AIC value of each models with one another. After careful evaluation of
each one the model which had the least AIC value is selected for validation.
log_model1<-glm(Attrition~.,data = traindata)
summary(log_model1)
#removing insignificant variables
m1<-update(log_model1,.~.-MaritalStatusMarried)
summary(m1)
m2<-update(m1,.~.-PerformanceRating)
summary(m2)
m3<-update(m2,.~.-DailyRate)
……………………………………………………
…………………………………………………..
m16<-update(m15,.~.-TrainingTimesLastYear)
summary(m16)
The prediction was done using the final model on test data and accuracy of the model is
predicted using confusion matrix
test_prob<-predict(log_model2,testdata,type="response")
head(test_prob)
test_class<-ifelse(test_prob>=0.50,1,0)
head(test_class)
14
confusionMatrix(table(test_class,testdata$Attrition),positive = "1")
The ROC curve is plotted to find the best possible threshold value for the model.
roc_pred<-prediction(train_prob,traindata$Attrition)
roc_curve<-performance(roc_pred,"tpr","fpr")
plot(roc_curve,[Link]=seq(0,1,by=0.1),colorize=TRUE)
Fig.4: ROC curve
From the ROC curve 0.6 threshold value is selected as the optimal one in order to get the
maximum efficiency.
Accuracy : 0.8818
95% CI : (0.8486, 0.91)
No Information Rate : 0.8468
P-Value [Acc > NIR] : 0.01946
Kappa : 0.5101
Mcnemar's Test P-Value : 0.13442
Sensitivity : 0.9457
Specificity : 0.5286
Pos Pred Value : 0.9173
Neg Pred Value : 0.6379
Prevalence : 0.8468
Detection Rate : 0.8009
Detection Prevalence : 0.8731
Balanced Accuracy : 0.7372
15
SVM (Support Vector Machines)
We have created the first model where kernel is linear in nature
svm_model1<-svm([Link](Attrition)~.,data = traindata,kernel="linear")
summary(svm_model1)
svm_prob<-predict(svm_model1,testdata)
head(svm_prob)
confusionMatrix(table(svm_prob,testdata$Attrition),positive = "1")
We then created a model with kernel types as radial and polynomial in nature. By
comparing all the models the maximum efficiency which we got was on the first model.
Accuracy : 0.8818
95% CI : (0.8486, 0.91)
No Information Rate : 0.8468
P-Value [Acc > NIR] : 0.019460
Kappa : 0.4701
Mcnemar's Test P-Value : 0.001749
Sensitivity : 0.9612
Specificity : 0.4429
Pos Pred Value : 0.9051
Neg Pred Value : 0.6739
Prevalence : 0.8468
Detection Rate : 0.8140
Detection Prevalence : 0.8993
Balanced Accuracy : 0.7020
Decision Tree
The decision tree model which we built provided the 84.68% accuracy
dt_model1<-rpart(Attrition~.,data = traindata,method = "class")
summary(dt_model1)
dt_pred=predict(dt_model1,testdata,type = "class")
head(dt_pred)
confusionMatrix(table(dt_pred,testdata$Attrition),positive = "1")
Accuracy : 0.8468
95% CI : (0.8105, 0.8786)
No Information Rate : 0.8468
P-Value [Acc > NIR] : 0.53184
Kappa : 0.2939
Mcnemar's Test P-Value : 0.00125
Sensitivity : 0.9457
Specificity : 0.3000
Pos Pred Value : 0.8819
Neg Pred Value : 0.5000
Prevalence : 0.8468
Detection Rate : 0.8009
16
Detection Prevalence : 0.9081
Balanced Accuracy : 0.6229
Fig.5: Decision Tree
Random forest
We created 3 models on random forest and explored more in each models. The first one
provided us with 86.65% accuracy. We then plotted the variable importance graph o
identify the important variables and made the model on that basis. The third model
which we built was by adjusting the tuning parameters such as nodesize, ntree and
mtry. Out of this 3 models the first model provided us with the better results.
17
Fig.6: Random forest model
Accuracy : 0.8665
95% CI : (0.8319, 0.8963)
No Information Rate : 0.8468
P-Value [Acc > NIR] : 0.134
Kappa : 0.228
Mcnemar's Test P-Value : 7.496e-13
Sensitivity : 0.9948
Specificity : 0.1571
Pos Pred Value : 0.8671
Neg Pred Value : 0.8462
Prevalence : 0.8468
Detection Rate : 0.8425
Detection Prevalence : 0.9716
Balanced Accuracy : 0.5760
18
Fig.7: Significance of different variables with attrition
19
CHAPTER 4: KEY FINDINGS
We have successfully created 4 classification models and plotted its characteristics in
the below shown graph.
Fig.8: Model comparison
From the final output of each models it was random forest which provided the
maximum efficiency. It excelled other models in areas like maximum sensitivity with
minimum specificity. Although from precision and accuracy point of view logistic and
SVM gave the better result however taking all the characteristics of a model into account
random forest stands aside from other models.
20
CHAPTER 5: RECOMMENDATIONS AND CONCLUSION
From the random forest model, we see that monthly income is a key factor that stands
out in determining any employee’s attrition. Other crucial factors include whether the
employee works overtime, the age, hourly rate, daily rate, monthly rate, total working
years and distance from home. We recommend that increasing monthly income in
proportion with overtime worked will lower the attrition rate.
Therefore, we can safely conclude that employees take financial benefits and personal
convenience into account when it comes to deciding whether to leave the organization,
and that factors such as JobRole and Education play a lesser significant part in this
decision. The random forest model which is essentially a cluster of decision trees helps
us in reaching this conclusion because this model assesses attrition with many
permutations of the different variables.
21
CHAPTER 6: REFERENCES
[Link]
[Link]
[Link]
[Link]
[Link]
22