0% found this document useful (0 votes)
101 views22 pages

Employee Attrition Prediction Model Report

This report analyzes employee attrition data from IBM to identify key factors that influence an employee's decision to leave the company. The data was cleaned and explored to understand patterns. Various classification models were fitted to predict attrition. The random forest model performed best, identifying monthly income, overtime work, age, pay rates, tenure, and commute distance as important factors. Increasing monthly pay proportionate to overtime could help reduce attrition. Overall, financial benefits and convenience seem to matter most in an employee's decision to stay or leave, more than other attributes like job role and education.

Uploaded by

Vivek R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views22 pages

Employee Attrition Prediction Model Report

This report analyzes employee attrition data from IBM to identify key factors that influence an employee's decision to leave the company. The data was cleaned and explored to understand patterns. Various classification models were fitted to predict attrition. The random forest model performed best, identifying monthly income, overtime work, age, pay rates, tenure, and commute distance as important factors. Increasing monthly pay proportionate to overtime could help reduce attrition. Overall, financial benefits and convenience seem to matter most in an employee's decision to stay or leave, more than other attributes like job role and education.

Uploaded by

Vivek R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

PROJECT REPORT

Employee Attrition Predictive Model

Submitted towards the partial fulfillment of the criteria for award of PGA by
Imarticus

Submitted By:
Eldin B. Joseph
Andrew Kurian Jacob
Vivekanandan R.
Amit Kumar

Course and Batch: DSP Batch 12(2018-19)


Abstract
Keywords

Disclaimer: *Data shared by the customer is confidential and sensitive, it should not be used for any purposes apart from
capstone project submission for PGA. The Name and demographic details of the enterprise is kept confidential as per
their owners’ request and binding.

Acknowledgements
[To be changed by candidates as per their requirement]

We are using this opportunity to express our gratitude to everyone who supported us
throughout the course of this group project. We are thankful for their aspiring guidance,
invaluably constructive criticism and friendly advice during the project work. I am
sincerely grateful to them for sharing their truthful and illuminating views on a number
of issues related to the project.

Further, we were fortunate to have ___Rahul Sarkar_____________ as our mentor. He has


readily shared his immense knowledge in data analytics and guide us in a manner that
the outcome resulted in enhancing our data skills.

We wish to thank, all the faculties, as this project utilized knowledge gained from every
course that formed the PGA program.

We certify that the work done by us for conceptualizing and completing this project is
original and authentic.

Date: March28, 2019 Eldin B. Joseph

Place: Bengaluru Andrew Kurian Jacob

Vivekanandan R.

Amit Kumar

2
Certificate of Completion
[To be changed by candidates as per their requirement]

I hereby certify that the project titled “Employee Attrition” was undertaken and
completed under my supervision by Eldin B. Joseph, Andrew Kurian Jacob,
Vivekanandan R. and Amit Kumar from the batch of DSP (Oct 2018).

Mentor:

Date: March28, 2019

Place – Bengaluru

3
Table of Contents
Abstract................................................................................................................................................2
Acknowledgements..............................................................................................................................2
Certificate of Completion.....................................................................................................................3
CHAPTER 1: INTRODUCTION.............................................................................................................5
1.1 Title & Objective of the study..............................................................................................5
1.2 Need of the Study.................................................................................................................5
1.3 Business or Enterprise under study....................................................................................5
1.4 Business Model of Enterprise..............................................................................................5
1.5 Data Sources.........................................................................................................................5
1.5 Tools & Techniques....................................................................................................................6
CHAPTER 2: DATA PREPARATION AND UNDERSTANDING.............................................................6
2.1 Phase I – Data Extraction and Cleaning:...................................................................................6
2.2 Phase II - Feature Engineering..................................................................................................6
2.3 Data Dictionary:.........................................................................................................................8
2.4 Exploratory Data Analysis:......................................................................................................12
CHAPTER 3: FITTING MODELS TO DATA.........................................................................................14
CHAPTER 4: KEY FINDINGS..............................................................................................................20
CHAPTER 5: RECOMMENDATIONS AND CONCLUSION..................................................................21
From the random forest model, we see that monthly income is a key factor that stands out in
determining any employee’s attrition. Other crucial factors include whether the employee
works overtime, the age, hourly rate, daily rate, monthly rate, total working years and distance
from home. We recommend that increasing monthly income in proportion with overtime
worked will lower the attrition rate.................................................................................................21
Therefore, we can safely conclude that employees take financial benefits and personal
convenience into account when it comes to deciding whether to leave the organization, and that
factors such as JobRole and Education play a lesser significant part in this decision. The random
forest model which is essentially a cluster of decision trees helps us in reaching this conclusion
because this model assesses attrition with many permutations of the different variables..........21
CHAPTER 6: REFERENCES................................................................................................................22

List of Figures

1. Attrition rate in dependent variable


2. Distribution by employee department

4
3. Distribution of satisfaction levels
4. ROC curve
5. Decision Tree
6. Random forest model
7. Significance of different variables with attrition
8. Model comparison

List of Tables

1. Data Dictionary

CHAPTER 1: INTRODUCTION

1.1Title & Objective of the study


‘Employee Attrition’ refers to the rate of employee turnover of an organization. This study aims to
find the important factors that decide whether an employee continues to work in the organization or
not.

1.2Need of the Study


Understanding why employees may want to leave an organization helps greatly in the hiring process
(of the employers) and working conditions for the employees. It also helps in building an employee
friendly working atmosphere.

1.3Business or Enterprise under study


We have received the employee data of IBM HR Analytics which shows many parameters such as income,
years worked, distance travelled etc.

1.4Business Model of Enterprise


IBM is a leading diversified technology company with a broad range of business
offerings across IT hardware, software, and services segments. IBM combines its
broad mix of capabilities to provide integrated solutions and platforms to its clients.
IBM was founded in 1911. IBM operates in more than 175 countries across the globe
and has over 400,000 employees. IBM has five business segments: Global
Technology Services (GTS), Global Business Services (GBS), Software, Systems
Hardware, and Global Financing. 

5
1.5Data Sources
The data was procured from an excel file which shows data of multiple employees of IBM HR
Analytics regarding their working conditions and attrition.

1.5 Tools & Techniques


Tools: We use R which is an analytical software used for building models for descriptive,
prescriptive and predictive analytic approaches.

Techniques: We build multiple classification models and use them to predict for a sample dataset. The
models are compared on the basis of model parameters and the best one is selected..

CHAPTER 2: DATA PREPARATION AND UNDERSTANDING

One of the first steps we engaged in was to outline the sequence of steps that we will be
following for our project. Each of these steps are elaborated below

2.1 Phase I – Data Extraction and Cleaning:


 Missing Value Analysis and Treatment:
There were no missing values.
 Handling Outliers:
Outliers of the respective columns were capped at their respective highest
percentiles.
 Feature Extraction:
We have removed the variables which had zero variability and other variables
which were irrelevant to the model.

##checking for variability


names(data[,nearZeroVar(data)])
# removing varibales "EmployeeCount" "Over18" "StandardHours" as they have
0 variability
# also removing EmployeeNumber as it is irrelevent.
fulldata<-
subset(data,select=c(EmployeeCount,Over18,StandardHours,EmployeeNumber))

2.2 Phase II - Feature Engineering


We have introduced two new variables using existing ones to improve their predictions. They are
loyalty and volatile rate.

6
loyality<-fulldata2$YearsAtCompany/fulldata2$TotalWorkingYears
loyality<-round(loyality,digits = 2)
volatile<-fulldata2$TotalWorkingYears/fulldata2$NumCompaniesWorked
volatile<-round(volatile,digits = 2)

7
2.3 Data Dictionary:

Column Name Data Type Description Example


Age Integer Age of 49
employee
Attrition Factor If employee No
with 2 has left
levels organization
BusinessTravel Factor Rate of Travel_Frequently
with 3 employee
levels travelling for
business
purposes
DailyRate Integer Remuneration 279
expected by
employee on
daily basis
Department Factor Department of Sales
with 3 work
levels
DistanceFromHome Integer Distance 8
between
employee’s
home and
place of work
Education Integer Level of 2
education of
employee
EducationField Factor Employee’s Medical
with 6 field of study
levels
EmployeeCount Integer No. of 1
employees
under same
identification
EmployeeNumber Integer Number 12
assigned to
employee
EnvironmentSatisfaction Integer Level of 3
employee’s
satisfaction
with working
environment
Gender Factor Gender of Male
with 2 employee

8
levels
HourlyRate Integer Remuneration 92
expected by
employee on
hourly basis
JobInvolvement Integer Level of 2
involvement
shown in
work
JobLevel Integer Hierarchical 1
position of
employee
JobRole Factor Job position Sales Executive
with 9 assigned to
levels employee
JobSatisfaction Integer Level of 4
employee’s
job
satisfaction
MaritalStatus Factor Employee’s Married
with 3 marital status
levels
MonthlyIncome Integer Monthly 2090
remuneration
of employee
MonthlyRate Integer Monthly 2396
remuneration
charged by
employee
NumCompaniesWorked Integer No. of 1
companies
employee has
worked prior
to the current
one
Over18 Factor If the worker Yes
with 1 is aged more
level than 18
OverTime Factor Whether the Yes
with 2 employee
levels works
overtime
PercentSalaryHike Integer Percentage of 11
salary
increment of
employee
PerformanceRating Integer Employee’s 3
job
performance

9
rating
RelationshipSatisfaction Integer Level of 4
satisfaction of
employee’s
relationship
StandardHours Integer No. of hours 80
employee
works
StockOptionLevel Integer Level of 1
stocks of
company
offered to
employee
TotalWorkingYears Integer No. of years 6
employee has
been working
TrainingTimesLastYear Integer No. of times 2
employee was
given training
the past year
WorkLifeBalance Integer Level of 3
employee’s
balance with
work and
personal life.
YearsAtCompany Integer No. of years 8
employee has
worked at
current
company
YearsInCurrentRole Integer No. of years 7
employee has
worked in the
current
position
YearsSinceLastPromotion Integer No. of years 1
since
employee’s
last
promotion
YearsWithCurrManager Integer No. of years 5
employee has
spent under
the current
manager

10
Data definitions for categorical variables:

Education: 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor'

EnvironmentSatisfaction: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'

JobInvolvement: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'

JobSatisfaction: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'

PerformanceRating :1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding'

RelationshipSatisfaction: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'


WorkLifeBalance: 1 'Bad' 2 'Good' 3 'Better' 4 'Best'

11
2.4 Exploratory Data Analysis:

##checking attrition rate in dependent variable


table(fulldata$Attrition)
ggplot(data = fulldata, aes(x= Attrition, fill=Attrition))
+geom_bar(color="grey40",alpha=1)+
theme([Link] =element_blank() )+theme([Link].y =
element_line("grey"))+
ggtitle("Employee Attrition")

Fig.1: Attrition rate in dependent variable

# distribution by employee department


ggplot(data = fulldata, aes(x= Department,fill=Department))
+geom_bar(color="grey40",alpha=1)+
theme([Link] =element_blank() )+theme([Link].y =
element_line("grey"))

12
Fig.2: Distribution by employee department

#job satisfaction distribution


ggplot(data = fulldata,aes(x=JobSatisfaction,fill=Attrition))
+geom_bar(col="grey40",alpha=1)+
xlab("Satisfaction levels")+ylab("Count")+ggtitle("Distribution of
satisfaction levels")+
theme([Link] =element_blank() )+theme([Link].y =
element_line("grey"))+
theme(text = element_text(family = "Decima WE",color = "black"))

13
Fig.3: Distribution of satisfaction levels

CHAPTER 3: FITTING MODELS TO DATA

Logistic Regression

We have created multiple models by removing insignificant variables from each one and
by comparing the AIC value of each models with one another. After careful evaluation of
each one the model which had the least AIC value is selected for validation.
log_model1<-glm(Attrition~.,data = traindata)
summary(log_model1)
#removing insignificant variables
m1<-update(log_model1,.~.-MaritalStatusMarried)
summary(m1)
m2<-update(m1,.~.-PerformanceRating)
summary(m2)
m3<-update(m2,.~.-DailyRate)
……………………………………………………
…………………………………………………..
m16<-update(m15,.~.-TrainingTimesLastYear)
summary(m16)
The prediction was done using the final model on test data and accuracy of the model is
predicted using confusion matrix
test_prob<-predict(log_model2,testdata,type="response")
head(test_prob)
test_class<-ifelse(test_prob>=0.50,1,0)
head(test_class)

14
confusionMatrix(table(test_class,testdata$Attrition),positive = "1")

The ROC curve is plotted to find the best possible threshold value for the model.

roc_pred<-prediction(train_prob,traindata$Attrition)
roc_curve<-performance(roc_pred,"tpr","fpr")
plot(roc_curve,[Link]=seq(0,1,by=0.1),colorize=TRUE)

Fig.4: ROC curve

From the ROC curve 0.6 threshold value is selected as the optimal one in order to get the
maximum efficiency.

Accuracy : 0.8818
95% CI : (0.8486, 0.91)
No Information Rate : 0.8468
P-Value [Acc > NIR] : 0.01946

Kappa : 0.5101
Mcnemar's Test P-Value : 0.13442

Sensitivity : 0.9457
Specificity : 0.5286
Pos Pred Value : 0.9173
Neg Pred Value : 0.6379
Prevalence : 0.8468
Detection Rate : 0.8009
Detection Prevalence : 0.8731
Balanced Accuracy : 0.7372

15
SVM (Support Vector Machines)

We have created the first model where kernel is linear in nature

svm_model1<-svm([Link](Attrition)~.,data = traindata,kernel="linear")
summary(svm_model1)
svm_prob<-predict(svm_model1,testdata)
head(svm_prob)
confusionMatrix(table(svm_prob,testdata$Attrition),positive = "1")

We then created a model with kernel types as radial and polynomial in nature. By
comparing all the models the maximum efficiency which we got was on the first model.

Accuracy : 0.8818
95% CI : (0.8486, 0.91)
No Information Rate : 0.8468
P-Value [Acc > NIR] : 0.019460

Kappa : 0.4701
Mcnemar's Test P-Value : 0.001749

Sensitivity : 0.9612
Specificity : 0.4429
Pos Pred Value : 0.9051
Neg Pred Value : 0.6739
Prevalence : 0.8468
Detection Rate : 0.8140
Detection Prevalence : 0.8993
Balanced Accuracy : 0.7020

Decision Tree

The decision tree model which we built provided the 84.68% accuracy

dt_model1<-rpart(Attrition~.,data = traindata,method = "class")


summary(dt_model1)
dt_pred=predict(dt_model1,testdata,type = "class")
head(dt_pred)
confusionMatrix(table(dt_pred,testdata$Attrition),positive = "1")

Accuracy : 0.8468
95% CI : (0.8105, 0.8786)
No Information Rate : 0.8468
P-Value [Acc > NIR] : 0.53184

Kappa : 0.2939
Mcnemar's Test P-Value : 0.00125

Sensitivity : 0.9457
Specificity : 0.3000
Pos Pred Value : 0.8819
Neg Pred Value : 0.5000
Prevalence : 0.8468
Detection Rate : 0.8009

16
Detection Prevalence : 0.9081
Balanced Accuracy : 0.6229

Fig.5: Decision Tree

Random forest

We created 3 models on random forest and explored more in each models. The first one
provided us with 86.65% accuracy. We then plotted the variable importance graph o
identify the important variables and made the model on that basis. The third model
which we built was by adjusting the tuning parameters such as nodesize, ntree and
mtry. Out of this 3 models the first model provided us with the better results.

17
Fig.6: Random forest model

Accuracy : 0.8665
95% CI : (0.8319, 0.8963)
No Information Rate : 0.8468
P-Value [Acc > NIR] : 0.134

Kappa : 0.228
Mcnemar's Test P-Value : 7.496e-13

Sensitivity : 0.9948
Specificity : 0.1571
Pos Pred Value : 0.8671
Neg Pred Value : 0.8462
Prevalence : 0.8468
Detection Rate : 0.8425
Detection Prevalence : 0.9716
Balanced Accuracy : 0.5760

18
Fig.7: Significance of different variables with attrition

19
CHAPTER 4: KEY FINDINGS
We have successfully created 4 classification models and plotted its characteristics in
the below shown graph.

Fig.8: Model comparison


From the final output of each models it was random forest which provided the
maximum efficiency. It excelled other models in areas like maximum sensitivity with
minimum specificity. Although from precision and accuracy point of view logistic and
SVM gave the better result however taking all the characteristics of a model into account
random forest stands aside from other models.

20
CHAPTER 5: RECOMMENDATIONS AND CONCLUSION

From the random forest model, we see that monthly income is a key factor that stands
out in determining any employee’s attrition. Other crucial factors include whether the
employee works overtime, the age, hourly rate, daily rate, monthly rate, total working
years and distance from home. We recommend that increasing monthly income in
proportion with overtime worked will lower the attrition rate.

Therefore, we can safely conclude that employees take financial benefits and personal
convenience into account when it comes to deciding whether to leave the organization,
and that factors such as JobRole and Education play a lesser significant part in this
decision. The random forest model which is essentially a cluster of decision trees helps
us in reaching this conclusion because this model assesses attrition with many
permutations of the different variables.

21
CHAPTER 6: REFERENCES

[Link]

[Link]

[Link]

[Link]
[Link]

22

You might also like