0% found this document useful (0 votes)

792 views6 pages

Churn Modelling Dataset Overview

The document discusses loading and preprocessing a churn modelling dataset using Pandas and Scikit-learn. It loads the dataset, cleans the data by dropping columns and encoding categorical variables. It then splits the data into training and test sets, standardizes numeric variables, and builds a neural network model to classify customers as churned or not churned based on their attributes. The model contains an input layer, four hidden layers with decreasing nodes, and a single node output layer with a sigmoid activation.

Uploaded by

Divyani Chavan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

792 views6 pages

Churn Modelling Dataset Overview

Uploaded by

Divyani Chavan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Cleaning
Loading the Dataset
Separating Features and Labels
Splitting the Dataset
Encoding Categorical Data
Building the Neural Network Model
Training the Model
Model Prediction and Evaluation

In [1]: import pandas as pd

import numpy as np
import seaborn as sns
import [Link] as plt
from [Link] import ColumnTransformer
from sklearn.model_selection import train_test_split
from [Link] import LabelEncoder, OneHotEncoder, StandardScaler
from [Link] import SVC, LinearSVC
from [Link] import KNeighborsClassifier
from sklearn import metrics
from sklearn import preprocessing

Loading the Dataset

First we load the dataset and find out the number of columns, rows, NULL values, etc.

In [2]: df = pd.read_csv('churn_modelling.csv')

In [3]: [Link]()

<class '[Link]'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 RowNumber 10000 non-null int64
1 CustomerId 10000 non-null int64
2 Surname 10000 non-null object
3 CreditScore 10000 non-null int64
4 Geography 10000 non-null object
5 Gender 10000 non-null object
6 Age 10000 non-null int64
7 Tenure 10000 non-null int64
8 Balance 10000 non-null float64
9 NumOfProducts 10000 non-null int64
10 HasCrCard 10000 non-null int64
11 IsActiveMember 10000 non-null int64
12 EstimatedSalary 10000 non-null float64
13 Exited 10000 non-null int64
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB

In [4]: [Link]()

Out[4]: RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProdu

0 1 15634602 Hargrave 619 France Female 42 2 0.00

1 2 15647311 Hill 608 Spain Female 41 1 83807.86

2 3 15619304 Onio 502 France Female 42 8 159660.80

3 4 15701354 Boni 699 France Female 39 1 0.00

4 5 15737888 Mitchell 850 Spain Female 43 2 125510.82

Cleaning
In [5]: [Link](columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True)
In [6]: [Link]().sum()

Out[6]: CreditScore 0
Geography 0
Gender 0
Age 0
Tenure 0
Balance 0
NumOfProducts 0
HasCrCard 0
IsActiveMember 0
EstimatedSalary 0
Exited 0
dtype: int64

In [7]: [Link]()

Out[7]: CreditScore Age Tenure Balance NumOfProducts HasCrCard IsActiveMembe

count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.00000 10000.000000

mean 650.528800 38.921800 5.012800 76485.889288 1.530200 0.70550 0.515100

std 96.653299 10.487806 2.892174 62397.405202 0.581654 0.45584 0.499797

min 350.000000 18.000000 0.000000 0.000000 1.000000 0.00000 0.000000

25% 584.000000 32.000000 3.000000 0.000000 1.000000 0.00000 0.000000

50% 652.000000 37.000000 5.000000 97198.540000 1.000000 1.00000 1.000000

75% 718.000000 44.000000 7.000000 127644.240000 2.000000 1.00000 1.000000

max 850.000000 92.000000 10.000000 250898.090000 4.000000 1.00000 1.000000

Separating the features and the labels

In [8]: X=[Link][:, :[Link][1]-1].values #Independent Variables
y=[Link][:, -1].values #Dependent Variable
[Link], [Link]

Out[8]: ((10000, 10), (10000,))

Encoding categorical (string based) data.

In [9]: print(X[:8,1], '... will now become: ')

label_X_country_encoder = LabelEncoder()
X[:,1] = label_X_country_encoder.fit_transform(X[:,1])
print(X[:8,1])

['France' 'Spain' 'France' 'France' 'Spain' 'Spain' 'France' 'Germany'] ... will now become:
[0 2 0 0 2 2 0 1]

In [10]: print(X[:6,2], '... will now become: ')

label_X_gender_encoder = LabelEncoder()
X[:,2] = label_X_gender_encoder.fit_transform(X[:,2])
print(X[:6,2])

['Female' 'Female' 'Female' 'Female' 'Female' 'Male'] ... will now become:
[0 0 0 0 0 1]
Split the countries into respective dimensions. Converting the string features into their own
dimensions.

In [11]: transform = ColumnTransformer([("countries", OneHotEncoder(), [1])], remainder="passthrough") #

X = transform.fit_transform(X)
X

Out[11]: array([[1.0, 0.0, 0.0, ..., 1, 1, 101348.88],

[0.0, 0.0, 1.0, ..., 0, 1, 112542.58],
[1.0, 0.0, 0.0, ..., 1, 0, 113931.57],
...,
[1.0, 0.0, 0.0, ..., 0, 1, 42085.58],
[0.0, 1.0, 0.0, ..., 1, 0, 92888.52],
[1.0, 0.0, 0.0, ..., 1, 0, 38190.78]], dtype=object)

Dimensionality reduction. A 0 on two countries means that the country has to be the one variable
which wasn't included

In [12]: X = X[:,1:]
[Link]

Out[12]: (10000, 11)

Splitting the Dataset

Training and Test Set

In [13]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Normalize the train and test data

['CreditScore','Age','Tenure','Balance','NumOfProducts','EstimatedSalary']

In [14]: sc=StandardScaler()
X_train[:,[Link]([2,4,5,6,7,10])] = sc.fit_transform(X_train[:,[Link]([2,4,5,6,7,10])])
X_test[:,[Link]([2,4,5,6,7,10])] = [Link](X_test[:,[Link]([2,4,5,6,7,10])])

In [15]: sc=StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = [Link](X_test)
X_train

Out[15]: array([[-0.5698444 , 1.74309049, 0.16958176, ..., 0.64259497,

-1.03227043, 1.10643166],
[ 1.75486502, -0.57369368, -2.30455945, ..., 0.64259497,
0.9687384 , -0.74866447],
[-0.5698444 , -0.57369368, -1.19119591, ..., 0.64259497,
-1.03227043, 1.48533467],
...,
[-0.5698444 , -0.57369368, 0.9015152 , ..., 0.64259497,
-1.03227043, 1.41231994],
[-0.5698444 , 1.74309049, -0.62420521, ..., 0.64259497,
0.9687384 , 0.84432121],
[ 1.75486502, -0.57369368, -0.28401079, ..., 0.64259497,
-1.03227043, 0.32472465]])

Initialize & build the model

INPUT = Number columns (Independet ) HIDDEN - AF HIDDEN -AF . . . N OUTPUT (1,2) -Sigmoid

In [16]: from [Link] import Sequential

# Initializing the ANN
classifier = Sequential()

In [17]: from [Link] import Dense

# The amount of nodes (dimensions) in hidden layer should be the average of input and output lay
# This adds the input layer (by specifying input dimension) AND the first hidden layer (units)
[Link](Dense(activation = 'relu', input_dim = 11, units=256, kernel_initializer='uniform

In [18]: # Adding the hidden layer

[Link](Dense(activation = 'relu', units=512, kernel_initializer='uniform'))
[Link](Dense(activation = 'relu', units=256, kernel_initializer='uniform'))
[Link](Dense(activation = 'relu', units=128, kernel_initializer='uniform'))

In [19]: # Adding the output layer

# Notice that we do not need to specify input dim.
# we have an output of 1 node, which is the the desired dimensions of our output (stay with the
# We use the sigmoid because we want probability outcomes
[Link](Dense(activation = 'sigmoid', units=1, kernel_initializer='uniform'))

In [20]: # Create optimizer with default learning rate

# sgd_optimizer = [Link]()
# Compile the model
[Link](optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [21]: [Link]()

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 256) 3072

dense_1 (Dense) (None, 512) 131584

dense_2 (Dense) (None, 256) 131328

dense_3 (Dense) (None, 128) 32896

dense_4 (Dense) (None, 1) 129

=================================================================
Total params: 299,009
Trainable params: 299,009
Non-trainable params: 0
_________________________________________________________________
In [22]: [Link](
X_train, y_train,
validation_data=(X_test,y_test),
epochs=20,
batch_size=32
)

Epoch 1/20
250/250 [==============================] - 1s 3ms/step - loss: 0.4378 - accuracy: 0.8163 - val
_loss: 0.3689 - val_accuracy: 0.8480
Epoch 2/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3638 - accuracy: 0.8510 - val
_loss: 0.3509 - val_accuracy: 0.8590
Epoch 3/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3485 - accuracy: 0.8575 - val
_loss: 0.3412 - val_accuracy: 0.8585
Epoch 4/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3435 - accuracy: 0.8631 - val
_loss: 0.3468 - val_accuracy: 0.8645
Epoch 5/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3407 - accuracy: 0.8600 - val
_loss: 0.3440 - val_accuracy: 0.8620
Epoch 6/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3394 - accuracy: 0.8621 - val
_loss: 0.3385 - val_accuracy: 0.8645
Epoch 7/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3332 - accuracy: 0.8629 - val
_loss: 0.3372 - val_accuracy: 0.8625
Epoch 8/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3264 - accuracy: 0.8695 - val
_loss: 0.3410 - val_accuracy: 0.8605
Epoch 9/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3258 - accuracy: 0.8680 - val
_loss: 0.3419 - val_accuracy: 0.8640
Epoch 10/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3226 - accuracy: 0.8700 - val
_loss: 0.3418 - val_accuracy: 0.8630
Epoch 11/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3198 - accuracy: 0.8700 - val
_loss: 0.3416 - val_accuracy: 0.8650
Epoch 12/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3154 - accuracy: 0.8725 - val
_loss: 0.3461 - val_accuracy: 0.8580
Epoch 13/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3122 - accuracy: 0.8749 - val
_loss: 0.3400 - val_accuracy: 0.8615
Epoch 14/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3094 - accuracy: 0.8739 - val
_loss: 0.3443 - val_accuracy: 0.8620
Epoch 15/20
250/250 [==============================] - 1s 3ms/step - loss: 0.3051 - accuracy: 0.8761 - val
_loss: 0.3529 - val_accuracy: 0.8570
Epoch 16/20
250/250 [==============================] - 1s 3ms/step - loss: 0.2997 - accuracy: 0.8790 - val
_loss: 0.3541 - val_accuracy: 0.8595
Epoch 17/20
250/250 [==============================] - 1s 3ms/step - loss: 0.2939 - accuracy: 0.8774 - val
_loss: 0.3487 - val_accuracy: 0.8635
Epoch 18/20
250/250 [==============================] - 1s 3ms/step - loss: 0.2882 - accuracy: 0.8836 - val
_loss: 0.3876 - val_accuracy: 0.8530
Epoch 19/20
250/250 [==============================] - 1s 3ms/step - loss: 0.2871 - accuracy: 0.8796 - val
_loss: 0.3724 - val_accuracy: 0.8515
Epoch 20/20
250/250 [==============================] - 1s 3ms/step - loss: 0.2782 - accuracy: 0.8854 - val
_loss: 0.3754 - val_accuracy: 0.8615

Out[22]: <[Link] at 0x7f1a35f238b0>

Predict the results using 0.5 as a threshold

In [23]: y_pred = [Link](X_test)

y_pred

63/63 [==============================] - 0s 977us/step

Out[23]: array([[0.2910964 ],
[0.17213912],
[0.14321707],
...,
[0.00948479],
[0.1535894 ],
[0.11742345]], dtype=float32)

In [24]: # To use the confusion Matrix, we need to convert the probabilities that a customer will leave t
# So we will use the cutoff value 0.5 to indicate whether they are likely to exit or not.
y_pred = (y_pred > 0.5)
y_pred

Out[24]: array([[False],
[False],
[False],
...,
[False],
[False],
[False]])

Print the Accuracy score and confusion matrix

In [25]: from [Link] import confusion_matrix,classification_report

cm1 = confusion_matrix(y_test, y_pred)
cm1

Out[25]: array([[1526, 69],

[ 208, 197]])

In [26]: print(classification_report(y_test, y_pred))

precision recall f1-score support

0 0.88 0.96 0.92 1595

1 0.74 0.49 0.59 405

accuracy 0.86 2000

macro avg 0.81 0.72 0.75 2000
weighted avg 0.85 0.86 0.85 2000

In [27]: accuracy_model1 = ((cm1[0][0]+cm1[1][1])*100)/(cm1[0][0]+cm1[1][1]+cm1[0][1]+cm1[1][0])

print (accuracy_model1, '% of testing data was classified correctly')

86.15 % of testing data was classified correctly

Common questions

StandardScaler is employed to normalize the dataset by scaling features such that they have a mean of zero and a standard deviation of one. This is crucial for optimizing the performance of machine learning algorithms that are sensitive to feature scaling, such as neural networks, by ensuring that each feature contributes equally to the result .

The dataset is split into training and testing sets using train_test_split, with a test size of 20%. Splitting the dataset is crucial to evaluate the model's performance and to ensure it generalizes well to unseen data .

Precision and recall are inversely related in the context of this model. Precision measures the accuracy of the positive predictions, while recall measures how well the model captures the positive class. A high precision but low recall indicates that the model makes few false positive errors but misses many actual positives. In this document, the precision for the exit class was lower than the non-exit class, reflecting a challenge in accurately predicting customers who exited .

The training accuracy and validation accuracy provide insight into model performance and generalization. An increasing training accuracy with a corresponding increase in validation accuracy suggests good model generalization. Though, if the validation accuracy consistently remains lower than the training accuracy, it might indicate overfitting. In the epochs, both training accuracy and validation accuracy were close, implying the model was able to generalize well over different data sets .

The transformation of categorical string data into numerical format is done using LabelEncoder and OneHotEncoder. LabelEncoder is used for features like 'Geography' and 'Gender', converting them into numerical values, while OneHotEncoder is applied to convert the string-based categorical data into separate binary columns .

The model's performance was evaluated using metrics such as accuracy, precision, recall, F1-score, and confusion matrix. Accuracy measures the percentage of correctly predicted instances. Precision and recall evaluate the number of true positive results among the predicted positives and actual positives, respectively. The F1-score balances precision and recall. The confusion matrix provides a detailed breakdown of true positive, true negative, false positive, and false negative predictions .

The neural network model is a Sequential model with five layers: an input layer with 256 units, and three hidden layers with 512, 256, and 128 units respectively, all using ReLU activation. The output layer consists of one unit with a sigmoid activation function to classify the outcome as binary .

Dimensionality reduction in the preprocessing step involves converting string-based features into separate dimensions—essentially expanding categorical data into multiple binary columns. This helps simplify the data structure and reduce the computational complexity without losing significant information. For instance, countries are encoded with multiple binary columns while ensuring that a zero on multiple columns indicates the other country not explicitly included .

A confusion matrix reveals how many predictions were correctly or incorrectly classified across all classes. It shows the true positives, true negatives, false positives, and false negatives, offering detailed insight into the model's accuracy. In this document, it was used to calculate the accuracy, precision, recall, and F1-score of the model, which were further utilized to assess the model’s predictive capabilities .

The Adam optimizer benefits the ANN training process by combining the advantages of two other extensions of stochastic gradient descent: Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Propagation (RMSProp). It adapts the learning rate for each parameter, enhancing the training performance by facilitating faster convergence and greater model accuracy .

Heart Disease Prediction Model
100% (1)
Heart Disease Prediction Model
73 pages
Comparativa de Algoritmos de Regresión
100% (1)
Comparativa de Algoritmos de Regresión
9 pages
Heart Disease Prediction with ML in Jupyter
100% (1)
Heart Disease Prediction with ML in Jupyter
9 pages
Linear Regression with Boston Housing Data
100% (1)
Linear Regression with Boston Housing Data
3 pages
Bagging Regressor Overview
100% (1)
Bagging Regressor Overview
84 pages
Ensemble Methods: Bagging, Boosting, Stacking
100% (1)
Ensemble Methods: Bagging, Boosting, Stacking
19 pages
Understanding Random Forests in Machine Learning
100% (1)
Understanding Random Forests in Machine Learning
28 pages
SVM for Classification and Regression
100% (1)
SVM for Classification and Regression
28 pages
BSCS 7th Sem Machine Learning Assignment 1
100% (1)
BSCS 7th Sem Machine Learning Assignment 1
5 pages
Bootstrap Powerpoint
100% (1)
Bootstrap Powerpoint
20 pages
Essential Math for Machine Learning
No ratings yet
Essential Math for Machine Learning
3 pages
Predicting Customer Churn with Logistic Regression
100% (1)
Predicting Customer Churn with Logistic Regression
13 pages
KNN with RBF Metric Overview
No ratings yet
KNN with RBF Metric Overview
9 pages
Gold Price Time Series Forecasting
100% (1)
Gold Price Time Series Forecasting
11 pages
Scikit-Learn QSAR Random Forest Analysis
100% (1)
Scikit-Learn QSAR Random Forest Analysis
11 pages
Mathematics Roadmap for Machine Learning
No ratings yet
Mathematics Roadmap for Machine Learning
16 pages
Regression Diagnostics Overview
100% (1)
Regression Diagnostics Overview
53 pages
KNN Classification of Customer Groups
100% (1)
KNN Classification of Customer Groups
11 pages
Linear Regression Essentials in Python
No ratings yet
Linear Regression Essentials in Python
23 pages
Lasso vs Ridge Regression Explained
No ratings yet
Lasso vs Ridge Regression Explained
5 pages
Simple Linear Regression Guide with Python
No ratings yet
Simple Linear Regression Guide with Python
8 pages
Introduction to Data Mining & Machine Learning
100% (1)
Introduction to Data Mining & Machine Learning
51 pages
Data Science Interview Questions Guide
100% (1)
Data Science Interview Questions Guide
16 pages
Understanding Decision Trees in ML
100% (1)
Understanding Decision Trees in ML
58 pages
Logistic Regression for Choice Models
100% (1)
Logistic Regression for Choice Models
14 pages
Handling Null Values in Pandas
No ratings yet
Handling Null Values in Pandas
19 pages
Supervised Learning: K-NN & Decision Trees
No ratings yet
Supervised Learning: K-NN & Decision Trees
26 pages
Logistic Regression for Credit Status
100% (1)
Logistic Regression for Credit Status
25 pages
ML Model Types and Concepts Explained
No ratings yet
ML Model Types and Concepts Explained
34 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
Gradient Descent in Machine Learning
100% (2)
Gradient Descent in Machine Learning
28 pages
Classification and Clustering Techniques
100% (1)
Classification and Clustering Techniques
101 pages
Classification Techniques in Data Mining
No ratings yet
Classification Techniques in Data Mining
67 pages
Linear Models for Regression Explained
100% (1)
Linear Models for Regression Explained
48 pages
Data Preprocessing for ML Techniques
No ratings yet
Data Preprocessing for ML Techniques
38 pages
Matrix Operations and Statistics in Python
No ratings yet
Matrix Operations and Statistics in Python
21 pages
k-NN Algorithm Explained with Python
100% (2)
k-NN Algorithm Explained with Python
9 pages
KNN Implementation on Iris Dataset
100% (1)
KNN Implementation on Iris Dataset
5 pages
Understanding Pattern Recognition Techniques
100% (1)
Understanding Pattern Recognition Techniques
41 pages
Iris Dataset Analysis with BPNN
100% (1)
Iris Dataset Analysis with BPNN
4 pages
Big Data: Decision Trees & Ensemble Methods
100% (1)
Big Data: Decision Trees & Ensemble Methods
23 pages
Neural Networks Lab Course Overview
No ratings yet
Neural Networks Lab Course Overview
15 pages
Nearest Neighbor Classifier Overview
No ratings yet
Nearest Neighbor Classifier Overview
16 pages
Ensemble Classifiers Overview
100% (1)
Ensemble Classifiers Overview
37 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
25 pages
Decision Tree Learning Overview
100% (1)
Decision Tree Learning Overview
15 pages
Python Cumprod Function Usage
100% (1)
Python Cumprod Function Usage
27 pages
Understanding Regression Analysis Basics
100% (2)
Understanding Regression Analysis Basics
9 pages
Machine Learning Strategies Overview
No ratings yet
Machine Learning Strategies Overview
59 pages
Chapter 5 - Classification Problems
100% (1)
Chapter 5 - Classification Problems
25 pages
Understanding the Softmax Function
No ratings yet
Understanding the Softmax Function
6 pages
Logistic Regression vs Random Forest
100% (1)
Logistic Regression vs Random Forest
72 pages
Linear Regression and KNN Algorithms Guide
100% (5)
Linear Regression and KNN Algorithms Guide
56 pages
Random Forests: Concepts and R Code
100% (1)
Random Forests: Concepts and R Code
4 pages
Data Preprocessing for Machine Learning
No ratings yet
Data Preprocessing for Machine Learning
19 pages
Expectation-Maximization Algorithm Overview
No ratings yet
Expectation-Maximization Algorithm Overview
3 pages
Customer Churn Prediction Model
No ratings yet
Customer Churn Prediction Model
2 pages
Deep Learning Practical File Overview
No ratings yet
Deep Learning Practical File Overview
13 pages
Customer Churn Prediction with Neural Networks
No ratings yet
Customer Churn Prediction with Neural Networks
3 pages
Deep Learning for Customer Churn Analysis
No ratings yet
Deep Learning for Customer Churn Analysis
18 pages
State Space Representation in MATLAB
No ratings yet
State Space Representation in MATLAB
18 pages
Applications of Backtracking Algorithms
No ratings yet
Applications of Backtracking Algorithms
2 pages
CCS354 Network Security Key Concepts
100% (1)
CCS354 Network Security Key Concepts
46 pages
Transportation Problem Quiz
No ratings yet
Transportation Problem Quiz
4 pages
Variable Neighbourhood Search Overview
No ratings yet
Variable Neighbourhood Search Overview
42 pages
Probabilistic Reasoning in AI
No ratings yet
Probabilistic Reasoning in AI
59 pages
End-Term Exam Declaration and Analysis
No ratings yet
End-Term Exam Declaration and Analysis
7 pages
Exponential Functions Test Overview
No ratings yet
Exponential Functions Test Overview
3 pages
Problem Solving with Algorithms & Flowcharts
No ratings yet
Problem Solving with Algorithms & Flowcharts
19 pages
Research Contributions of Abdulhamit Subasi
No ratings yet
Research Contributions of Abdulhamit Subasi
2 pages
Machine Learning Interview Insights
No ratings yet
Machine Learning Interview Insights
18 pages
Knapsack Problem
No ratings yet
Knapsack Problem
3 pages
Floating Point Numbers: Representation & Arithmetic
No ratings yet
Floating Point Numbers: Representation & Arithmetic
14 pages
Mystery Numbers in Linear Equations
No ratings yet
Mystery Numbers in Linear Equations
3 pages
Bayesian Structural Equation Modeling (Sarah Depaoli)
No ratings yet
Bayesian Structural Equation Modeling (Sarah Depaoli)
550 pages
Solutions for Harmonic Oscillator Problems
No ratings yet
Solutions for Harmonic Oscillator Problems
8 pages
EDS 475 Discrete Math Course Overview
No ratings yet
EDS 475 Discrete Math Course Overview
6 pages
GSP Stipends and Data Mining Insights
No ratings yet
GSP Stipends and Data Mining Insights
43 pages
Fourier Transform Worked Examples
No ratings yet
Fourier Transform Worked Examples
3 pages
Excel Goal Seek for Car Loan Payments
No ratings yet
Excel Goal Seek for Car Loan Payments
4 pages
Supervised Machine Learning Overview
No ratings yet
Supervised Machine Learning Overview
5 pages
Product and Quotient Rule Differentiation
No ratings yet
Product and Quotient Rule Differentiation
4 pages
Nelder-Mead Method Demonstration
No ratings yet
Nelder-Mead Method Demonstration
1 page
Dense-Inception Neural Network for Fall Detection
No ratings yet
Dense-Inception Neural Network for Fall Detection
6 pages
II-II Mid Time Table
No ratings yet
II-II Mid Time Table
2 pages
Automated Medicinal Plant Identification
No ratings yet
Automated Medicinal Plant Identification
5 pages
Forecasting+ +Transcription+S04
No ratings yet
Forecasting+ +Transcription+S04
18 pages
CCS355 Lab Manual: Neural Networks
No ratings yet
CCS355 Lab Manual: Neural Networks
12 pages
K-Nearest Neighbor Classifier Explained
No ratings yet
K-Nearest Neighbor Classifier Explained
16 pages
OpenFOAM 2D Airfoil Simulation Guide
0% (1)
OpenFOAM 2D Airfoil Simulation Guide
16 pages

Churn Modelling Dataset Overview

Uploaded by

Churn Modelling Dataset Overview

Uploaded by

In [1]: import pandas as pd

Loading the Dataset

0 1 15634602 Hargrave 619 France Female 42 2 0.00

1 2 15647311 Hill 608 Spain Female 41 1 83807.86

2 3 15619304 Onio 502 France Female 42 8 159660.80

3 4 15701354 Boni 699 France Female 39 1 0.00

4 5 15737888 Mitchell 850 Spain Female 43 2 125510.82

Out[7]: CreditScore Age Tenure Balance NumOfProducts HasCrCard IsActiveMembe

count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.00000 10000.000000

mean 650.528800 38.921800 5.012800 76485.889288 1.530200 0.70550 0.515100

std 96.653299 10.487806 2.892174 62397.405202 0.581654 0.45584 0.499797

min 350.000000 18.000000 0.000000 0.000000 1.000000 0.00000 0.000000

25% 584.000000 32.000000 3.000000 0.000000 1.000000 0.00000 0.000000

50% 652.000000 37.000000 5.000000 97198.540000 1.000000 1.00000 1.000000

75% 718.000000 44.000000 7.000000 127644.240000 2.000000 1.00000 1.000000

max 850.000000 92.000000 10.000000 250898.090000 4.000000 1.00000 1.000000

Separating the features and the labels

Out[8]: ((10000, 10), (10000,))

Encoding categorical (string based) data.

In [9]: print(X[:8,1], '... will now become: ')

In [10]: print(X[:6,2], '... will now become: ')

In [11]: transform = ColumnTransformer([("countries", OneHotEncoder(), [1])], remainder="passthrough") #

Out[11]: array([[1.0, 0.0, 0.0, ..., 1, 1, 101348.88],

Out[12]: (10000, 11)

Splitting the Dataset

In [13]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Normalize the train and test data

Out[15]: array([[-0.5698444 , 1.74309049, 0.16958176, ..., 0.64259497,

Initialize & build the model

In [16]: from [Link] import Sequential

In [17]: from [Link] import Dense

In [18]: # Adding the hidden layer

In [19]: # Adding the output layer

In [20]: # Create optimizer with default learning rate

dense_1 (Dense) (None, 512) 131584

dense_2 (Dense) (None, 256) 131328

dense_3 (Dense) (None, 128) 32896

dense_4 (Dense) (None, 1) 129

Out[22]: <[Link] at 0x7f1a35f238b0>

Predict the results using 0.5 as a threshold

In [23]: y_pred = [Link](X_test)

63/63 [==============================] - 0s 977us/step

Print the Accuracy score and confusion matrix

In [25]: from [Link] import confusion_matrix,classification_report

Out[25]: array([[1526, 69],

In [26]: print(classification_report(y_test, y_pred))

precision recall f1-score support

0 0.88 0.96 0.92 1595

accuracy 0.86 2000

In [27]: accuracy_model1 = ((cm1[0][0]+cm1[1][1])*100)/(cm1[0][0]+cm1[1][1]+cm1[0][1]+cm1[1][0])

86.15 % of testing data was classified correctly

Common questions

Explain the role of StandardScaler in normalizing the dataset.

Explain the role of StandardScaler in normalizing the dataset.

How is the dataset split for training and testing, and why is this important?

How is the dataset split for training and testing, and why is this important?

Discuss the relationship between precision and recall as observed in the model's classification report.

Discuss the relationship between precision and recall as observed in the model's classification report.

What inference can be drawn from the training accuracy and validation accuracy observed during epochs?

What inference can be drawn from the training accuracy and validation accuracy observed during epochs?

What method is used to transform categorical string data into numerical format in the dataset?

What method is used to transform categorical string data into numerical format in the dataset?

What performance metrics were used to evaluate the model, and what do they signify?

What performance metrics were used to evaluate the model, and what do they signify?

Describe the architecture of the neural network model used in the document.

Describe the architecture of the neural network model used in the document.

How does dimensionality reduction influence the dataset in the preprocessing step?

How does dimensionality reduction influence the dataset in the preprocessing step?

What does a confusion matrix reveal about the model's performance, and how was it used in this case?

What does a confusion matrix reveal about the model's performance, and how was it used in this case?

How does the use of the Adam optimizer benefit the ANN training process?

How does the use of the Adam optimizer benefit the ANN training process?

You might also like