Predicting Student Dropout
Risk
Student: D. Mithilesh
Submission Date: 28/09/2025
A comprehensive data science approach to identifying at-risk
students using machine learning algorithms
"Data reveals.
Action prevents…."
Project Introduction
Objective
Develop a predictive model to identify students at risk of dropping out using comprehensive
data analysis and machine learning techniques.
Dataset Overview
• Student records with attendance patterns
• Academic performance metrics
• Demographic and behavioral indicators
Tools & Techniques
→Platforms: Python, Excel, Google Sheets
• Algorithms: K-NN Classification
• Analysis: K-Means Clustering
Exploratory Data Analysis
Comprehensive analysis revealed critical patterns in student behavior and academic performance that correlate with dropout risk.
85% 2.8 42%
Attendance Rate Average GPA Risk Indicators
Average attendance among successful Mean GPA of continuing students Students showing multiple warning signs
students
K-NN Classification Methodology
K-Nearest Neighbors algorithm identifies dropout risk by analyzing similarity patterns between
students based on key performance indicators.
01
Data Normalization
Standardize features for fair comparison
02
Distance Calculation
Compute Euclidean distances between students
03
Neighbor Selection
Identify K closest similar students
04
Outcome Prediction
Classify based on majority neighbor outcomes
K-Means Clustering Methodology
Initialize Centroids
Place K random cluster centers in the data space
Assign Data Points
Group students to nearest centroid based on characteristics
Recompute Centers
Update centroid positions based on cluster members
Iterate Until Stable
Repeat process until clusters converge
K-NN Classification Results
The K-NN model achieved strong predictive accuracy, successfully identifying high-risk students with precision.
Student ID Distance Prediction
ST_001 0.23 High Risk
ST_002 0.45 Low Risk
87%
ST_003 0.31 High Risk
ST_004 0.67 Low Risk
ST_005 0.19 High Risk
Model Accuracy
Correct predictions on test data
82%
Precision Rate
True positive identification
K-Means Clustering Results
High Risk Cluster Moderate Risk Cluster
Students with low Average performers with
attendance (<60%) and inconsistent patterns.
declining grades. Requires Benefit from targeted
immediate intervention. support programs.
Low Risk Cluster
Strong academic performance with consistent engagement.
Minimal intervention needed.
Key Insights and Learnings
Attendance is the Early Detection
Strongest Predictor Enables Intervention
Students with <70% ML models identify at-
attendance show 3x risk students 2 semesters
higher dropout risk in advance
Multiple Factors Create Compound Risk
Combination of low grades and poor engagement amplifies
dropout probability
Challenges and Recommendations
Data Quality Challenges Future Recommendations
• Missing attendance records for 15% of students • Expand dataset to include 5+ years of historical
• Inconsistent grading scales across departments data
• Limited socioeconomic background data • Implement Random Forest and Neural Network
models
• Integrate real-time data collection systems
Conclusion and Impact
Project Summary
Successfully developed a predictive model achieving 87% accuracy in identifying student dropout
risk using K-NN classification and K-Means clustering techniques.
Broader Implications
AI-driven early warning systems can transform educational outcomes by enabling proactive
interventions, potentially saving thousands of academic careers annually.
References: Documentation, IIT-M DATA SCIENCE AND AI COURSE VIDEOS, Google AI.
TOOLS: ChatGPT, Chrome, MS Excel, MS PPT.