Rubixe Internship Overview Report
Rubixe Internship Overview Report
4. Tasks Performed
4.1 Week-wise Summary
4.2 Glimpse of Project Execution
5. Mini Projects
5.1 Handwritten Digits Recognition
5.2 FIFA 20 Player Analysis
5.3 PUBG Match Winner Prediction
5.4 Heart Disease Prediction
5.5 Outputs
6. Conclusion
CHAPTER 1:
Aspect Details
Organization
Rubixe™
Name
Organization
Logo
Established 2015
- Predictive Analytics
Product
- AI-Driven Business Intelligence
Specialization
- Custom ML and NLP Tools
- Retail
Key Domains - Finance
Served - Healthcare
- Manufacturing
Aspect Details
Rubixe serves clients across a broad spectrum of industries, reflecting the versatility of its
AI-driven solutions. The company’s industry focus is not limited to one sector; instead, it
includes major domains such as Healthcare, Finance and Banking, Information
Technology, Retail, Manufacturing, Education, Agriculture, and [Link]
[Link]. For instance, in healthcare, Rubixe might develop predictive analytics for
patient data, whereas in finance it could implement AI for fraud detection or algorithmic
trading support. In retail and e-commerce, Rubixe’s solutions could involve personalized
recommendation systems and inventory optimization, while in manufacturing, they might
focus on predictive maintenance and automation of assembly lines. This wide industry
coverage (from startups to large enterprises) indicates Rubixe’s adaptive approach in
applying AI to different contexts. The company’s team includes domain experts who
understand the unique challenges of each sector, ensuring that AI solutions are effective
and relevant to the client’s field. By maintaining a diverse industry focus, Rubixe not only
expands its market reach but also enriches its expertise, as lessons learned in one domain
can often inspire innovative applications in another.
CHAPTER 2:
This chapter provides an overview of the technical domain central to the internship:
Artificial Intelligence and Data Science. It covers the fundamental concepts of AI and
data science, discusses some of their real-world applications, and reflects on the future
scope of these fields. Understanding the domain is crucial, as the internship work involved
applying AI and data science techniques to solve practical problems in various projects.
The following sections define Artificial Intelligence and Data Science, outline key
application areas, and consider emerging trends and future developments.
2.3 Applications
AI and data science have a wide array of applications across different industries, many of
which are becoming integral to how modern organizations operate. A few notable
application areas include:
Healthcare: AI and data science are used for diagnosing diseases, predicting
patient outcomes, and personalizing treatment plans. For example, machine
learning models can analyze medical images (X-rays, MRIs) to detect anomalies or
predict the likelihood of conditions such as cancer or heart disease. In this
internship’s Heart Disease Prediction project, the application of data science in
healthcare was evident, as a model was developed to predict cardiac risk based on
patient data. Such models can assist doctors in early detection of high-risk patients,
potentially saving lives by enabling proactive care.
Finance: In the financial industry, AI algorithms detect fraudulent transactions,
assess credit risk, automate trading strategies, and provide customer service through
chatbots. Data science enables banks and financial firms to analyze vast quantities
of transactions and client data to find patterns indicative of fraud or to segment
customers for tailored services. The ability of AI to learn from historical financial
data helps in making predictions about market trends or default probabilities,
thereby supporting more informed and efficient financial decision-making.
Retail and E-commerce: Businesses leverage AI for recommendation systems
(suggesting products to customers based on their browsing and purchase history),
inventory management, and demand forecasting. Data science techniques analyze
customer behavior data to optimize pricing strategies and marketing campaigns.
For instance, an AI system can predict which products are likely to be popular in
the next season and ensure they are stocked appropriately.
Transportation and Automotive: A high-profile application of AI is in
autonomous vehicles (self-driving cars) where computer vision and deep learning
are used to navigate roads and make real-time decisions. Additionally, AI optimizes
logistics and supply chain operations by predicting the fastest delivery routes and
minimizing fuel consumption.
Manufacturing: AI-driven predictive maintenance systems analyze sensor data
from machinery to predict failures before they occur, thus avoiding costly
downtime. Robotics powered by AI also enable automation of complex
manufacturing tasks with precision and adaptability.
Entertainment and Sports Analytics: Streaming services and media companies
use data science to analyze viewer preferences and recommend content. In sports,
AI and data analysis are used to evaluate player performance and devise game
strategies. One of the internship projects, the FIFA 20 Player Analysis, mirrors real-
world sports analytics where player data is examined to glean insights on
performance and value.
These examples barely scratch the surface; AI and data science are also transforming
education (through personalized learning platforms), agriculture (through crop and soil
monitoring), and government (through smart cities and policy planning), among others. In
every application, the goal is to harness data to make better decisions, automate tasks, and
create intelligent behavior in software or machines. The projects completed during the
internship each correspond to a real-world application of AI: computer vision in the digit
recognition task, sports analytics in the FIFA data task, game outcome prediction in the
PUBG task, and medical risk prediction in the heart disease task. This demonstrates the
versatility of AI and data science techniques across domains.
The future of Artificial Intelligence and Data Science is both exciting and expansive. As
computational power grows and algorithms become more sophisticated, AI systems are
expected to achieve even higher levels of performance and autonomy. One emerging trend
is the development of more generalized AI models (such as large language models and
advanced neural networks) that can perform a wider range of tasks and learn with less
human supervision. These models are increasingly capable of understanding context and
generating human-like responses, which could revolutionize fields like customer service,
content creation, and beyond.
In the realm of data science, the volume of data generated globally is increasing
exponentially (often referred to as the data deluge). This creates a growing opportunity to
extract insights, but also necessitates advancements in big data technologies, cloud
computing, and data engineering to handle and process data efficiently. The future will
likely see data science integrated even more into business decision processes, with real-
time analytics and dashboards guiding strategic and operational choices in organizations.
Another important aspect of the future of AI is its integration with IoT (Internet of Things)
and edge computing. As more devices become smart and interconnected, AI algorithms
will be deployed on distributed devices (from smartphones to industrial sensors) to provide
instant insights and automation at the source of data. This could enable smarter homes,
smarter cities, and responsive industrial systems.
In terms of societal impact, AI and data science are poised to significantly influence the
job market and skill requirements. There is an increasing demand for professionals skilled
in AI/ML, and educational curricula are evolving to include these topics from early stages.
However, the rise of AI also brings challenges that will shape its future: ethical
considerations and governance. Issues such as data privacy, algorithmic bias, and
transparency of AI decisions are under scrutiny. Future AI systems will need to be
developed with fairness and accountability in mind, leading to the growth of fields like AI
ethics and explainable AI.
Overall, the scope of AI and data science is set to expand into virtually every field.
Continuous research and development are making AI solutions more powerful and
accessible. For a practitioner like an intern entering this field, it means the learning is
ongoing—new tools, techniques, and best practices emerge regularly. The projects done
during the internship provide a foundation, but staying updated will be crucial. In the near
future, one can expect more automation of routine tasks, more accurate predictive models
improving decision-making, and AI tackling complex problems (like climate modeling or
drug discovery) that were previously beyond reach. The trajectory suggests that AI and
data science will remain at the forefront of technological innovation, driving progress in
the coming years.
CHAPTER 3:
A variety of software tools, programming platforms, and libraries were used throughout
the internship to accomplish the tasks and projects. As the internship work centered on data
analysis and machine learning, the tools chosen are standard in the data science community.
This chapter describes the main tools and technologies employed, namely the Python
programming language, the Jupyter Notebook environment, and several important Python
libraries such as Pandas, NumPy, Scikit-learn, etc. Each of these played a critical role in
enabling the intern to implement solutions efficiently and effectively.
3.1 Python
Python was the primary programming language used during the internship. Python is a
high-level, interpreted language known for its simplicity and readability, which makes it
particularly well-suited for data science and AI projects. One of the key advantages of
Python is its extensive collection of libraries and frameworks for scientific computing and
machine learning (for example, Pandas for data manipulation, or TensorFlow/PyTorch for
deep learning). Python’s syntax is concise and clear, allowing developers to prototype ideas
quickly with relatively few lines of code. During the internship projects, Python was used
for tasks such as loading and preprocessing datasets, implementing machine learning
algorithms, and evaluating model performance. The language’s interactive nature
(especially when used in notebooks) helped in iteratively developing solutions: the intern
could write a portion of code, run it, inspect the output, and adjust as needed in a rapid
development cycle. Additionally, Python’s strong community support and documentation
meant that any issues encountered could be resolved by referencing existing resources.
Overall, Python’s role was foundational, providing the programming backbone for all
analyses and experiments conducted in the internship.
The Jupyter Notebook was the primary development environment for the projects. Jupyter
Notebook is an open-source web-based interactive platform that allows developers to
combine live code, visualizations, and narrative text in a single document. This format was
extremely useful for the internship, as it facilitated a literate programming approach: one
can execute code step by step and document the thought process alongside the results. Each
project was organized in a dedicated Jupyter notebook, as per the internship requirements.
Using notebooks, the intern performed data exploration by writing code to, for example,
display the first few rows of a dataset or plot distributions, and then immediately viewed
the results (tables, charts) inline. The notebooks also made it convenient to tune machine
learning models and record observations; for instance, the intern could run multiple training
experiments in sequence and compare outcomes all within one notebook environment.
Another advantage is the ease of presentation—completed notebooks with outputs and
markdown explanations can serve as final reports or records of the work. This was helpful
for sharing progress with mentors and for final submission, as the notebook contained both
the code and the interpretation of results (including any challenges faced and how they
were addressed). The interactive and incremental nature of Jupyter notebooks significantly
enhanced productivity and clarity during the internship.
Several specialized Python libraries were utilized to carry out data science tasks efficiently.
The major libraries and frameworks used during the internship include:
Pandas: This library was used for data manipulation and analysis. Pandas provides
the DataFrame structure, which is excellent for handling tabular data (similar to
spreadsheets or SQL tables). With Pandas, the intern could easily load datasets
(from CSV files, for example), clean data (handle missing values, filter rows, create
new columns), and perform aggregations or summary statistics. Throughout the
projects, Pandas was essential for preparing data before feeding it into machine
learning models—such as selecting relevant features for the FIFA player data or
merging information in the PUBG match data.
NumPy: NumPy is the fundamental package for scientific computing with Python,
providing support for large, multi-dimensional arrays and matrices, along with a
collection of mathematical functions to operate on these arrays. In the context of
this internship, NumPy was frequently used under the hood (many other libraries
like Pandas and Scikit-learn are built on NumPy arrays for performance). Directly,
it was used for tasks such as numerical computations, generating random numbers
(e.g., splitting data into random train/test sets before using library functions), and
implementing any algorithmic logic that required array manipulations. NumPy’s
provides a user-friendly way to define and train deep learning models. While the
internship primarily focused on classical machine learning via Scikit-learn, the use
of Keras for the digit classification task introduced the intern to the basics of deep
learning (such as constructing neural network layers and using an optimizer to
adjust weights based on training data).
In summary, these libraries formed the toolkit that enabled effective handling of all aspects
of the projects—from reading and exploring data (Pandas, NumPy) to building predictive
models (Scikit-learn, TensorFlow) and visualizing results (Matplotlib, Seaborn). Mastery
of these tools is fundamental for any data science project, and the internship provided
ample opportunity to apply and strengthen skills in using each of them.
CHAPTER 4:
TASKS PERFORMED
The internship was structured over 15 weeks, and each week had specific objectives and
deliverables. This section provides a week-wise summary of the tasks performed, the
learning activities, and the progress made throughout the internship. The work gradually
progressed from introductory orientation and training in the initial weeks to more complex
project work in the later weeks. By breaking down the experience week by week, it
becomes clear how the intern’s responsibilities evolved and how the mini projects were
distributed across the internship timeline. The summary below highlights key activities and
accomplishments for each week.
CHAPTER 5:
MINI PROJECTS
This chapter provides detailed descriptions of the four mini projects carried out during the
internship. Each project was a substantial practical exercise in applying data science and
machine learning techniques to a specific problem domain. The projects are presented in
the order they were undertaken:
For each project, the problem statement is outlined, followed by the approach and tools
used, and finally the outcomes and insights gained. The narrative also touches on any
significant challenges faced and how they were addressed. These projects illustrate the
application of theoretical knowledge in real scenarios and demonstrate the learning
progression of the intern.
Problem Statement: The goal of this project was to develop a model that can accurately
recognize handwritten digits (0 through 9) from images. This is a classic problem in
computer vision and machine learning, often approached using the MNIST dataset, which
contains thousands of 28x28 pixel grayscale images of handwritten digits. The task was
two-fold: first, perform an exploratory data analysis on the digit image data, and second,
build and compare multiple classification models to determine which provides the best
Approach: The intern began by exploring the MNIST dataset. Summary statistics like the
count of images per digit (which is typically uniform in MNIST) were checked, and sample
images of each digit were visualized to understand the variation in handwriting styles. Each
image in the dataset is essentially 784 features (pixels) when flattened, with each pixel
intensity as a value. Recognizing that dealing with all pixels directly can be
computationally intensive, the intern ensured to implement efficient data handling using
NumPy arrays. No significant data cleaning was required as MNIST is a well-prepared
dataset, but normalization of pixel values (scaling from 0-255 to 0-1) was done to aid
certain algorithms.
For the modeling part, three types of classifiers were built and evaluated:
of an input layer (784 nodes), one or two hidden layers with a reasonable number
of neurons (e.g., 128 neurons each) using ReLU activation, and an output layer of
10 neurons (one for each digit class) using softmax activation. The network was
trained for several epochs on the training data, using a portion of it as a validation
set to monitor performance.
Throughout these experiments, the intern used the same training and test splits for fair
comparison, and tracked the accuracy of each model on the test set. Cross-validation was
also used for the non-neural models to ensure results were consistent across different
subsets of data.
Results: All models achieved a high degree of accuracy, which is expected given that
MNIST is a well-researched problem where even simple models perform reasonably. The
KNN model, while straightforward, achieved a decent accuracy (around 95%) but was the
slowest at prediction time, since it had to compute distances to many training points for
each new image. The SVM model improved on accuracy, reaching around 97-98% test
accuracy, and proved to be a strong candidate. The neural network model also reached
about 98% accuracy on the test set after tuning (for instance, training for 20 epochs, using
an appropriate optimizer like Adam, and perhaps adding a dropout layer to avoid
overfitting). In the end, the neural network slightly outperformed the SVM by a small
margin in accuracy, and importantly, it produced very fast predictions once trained (since
classification is just a series of matrix multiplications). Therefore, the neural network was
suggested as the best model for deployment, given its combination of accuracy and
efficiency.
Model Comparison and Insights: The project required not just finding the best model but
also documenting the comparison. The intern noted that the SVM was effective but training
it on the full dataset took significant time and memory. The KNN was simple but not
scalable to real-time use with large datasets. The neural network required tuning of
hyperparameters like the number of neurons and epochs, but once tuned, generalized very
well to unseen data. One interesting observation was how the models made mistakes: by
examining some of the images that were misclassified by each model, the intern found that
certain digits that were written in an unusual way (for example, a sloppy "5" that looked
like a "6") confused even the best model. This highlighted the inherent challenge in
handwriting recognition and the importance of large diverse training data.
Challenges: A key challenge encountered was choosing the right hyperparameters without
overfitting. For instance, if the neural network was made too large (too many hidden
neurons or layers), it could memorize the training images and perform slightly worse on
test images. The intern addressed this by using validation data to guide model complexity
and by employing regularization techniques (such as dropout). Another challenge was
computational resources: training the neural network and SVM on a standard laptop was
time-consuming, so the intern utilized minibatch training for the neural net and used only
a subset of data for initial SVM tuning. Documentation of these challenges and how they
were mitigated was included in the project report as required.
In conclusion, the Handwritten Digits Recognition project was successful, achieving near
state-of-the-art performance on the MNIST dataset. It provided the intern with hands-on
experience in computer vision and classification modeling. The project demonstrated how
different algorithms can be applied to the same problem and reinforced the importance of
model evaluation and selection based on both performance and practicality.
5.1.1 Outputs
Problem Statement: The FIFA 20 Player Analysis project was centered on deriving
insights from a comprehensive dataset of football (soccer) players. The dataset spanned
players featured in EA Sports’ FIFA 20 game, including various attributes such as age,
nationality, overall skill rating, potential rating, wage, value, and a host of skill metrics
(like speed, shooting, passing, defending, etc.). The project had multiple objectives:
Perform a thorough data analysis to discover key trends (for example, identify
which countries produce the most professional players in the dataset).
Investigate the relationship between player attributes and age (to determine the
typical peak age of player performance/improvement).
Examine the salary (or value) differences among offensive player positions (strikers
vs. right-wingers vs. left-wingers) to see which position tends to be valued the most.
Apply clustering to group players based on their attributes, in order to find natural
groupings (which could correspond to player types or positions). This project was
less about predictive modeling and more about exploratory analysis and
unsupervised learning (clustering) to make sense of a real-world sports data set.
Approach: The intern’s approach began with data cleaning and exploration. The FIFA
dataset, usually provided as a CSV, was loaded into a Pandas DataFrame. The intern
checked for missing values or anomalies; for instance, some player entries might have
missing wage information or could have a value of 0 for certain skills if not applicable. In
such cases, decisions were made on whether to fill, drop, or ignore those fields. Units were
standardized (heights might be given in inches in the raw data but were converted to
centimeters for consistency, as noted in the dataset documentation).
To find the top countries producing players, the intern grouped the data by
nationality and counted the number of players per country. This resulted in a
ranking of countries. A bar chart was created to visualize the top 10 countries. As
expected, countries with large football talent pools (like England, Spain,
Germany, Argentina, Brazil, France, etc.) featured in the top list, confirming
real-world expectations that those nations produce many professional players.
Investigating player improvement with age involved analyzing the overall rating
vs. age. The intern plotted age on the x-axis and overall rating on the y-axis for all
players. A trend emerged: younger players (in their teens and early 20s) usually
have lower overall ratings (since they are still developing), and ratings improve as
age increases, peaking in the late 20s. After around 30 years of age, many players’
ratings begin to decline. To pinpoint an approximate “peak age,” the intern
computed the average overall rating for players in each age and observed the
maximum. The analysis suggested that players tend to reach their peak performance
(highest overall ability) around 27-29 years old, after which the average rating
either plateaus or decreases. This aligns with common sports science knowledge
that late 20s are a footballer’s prime years.
For the salary comparison among offensive positions, the dataset had information
on player positions (often multiple positions, but the primary position was
considered). The intern filtered players by position: Strikers (ST), Left Wingers
(LW), and Right Wingers (RW). The average wage of players in each of these
categories was calculated, or alternately, a distribution of wages for each category
was plotted. The insight drawn was that strikers often command high wages,
frequently being the focal point of a team’s attack and often in high demand.
However, exceptional wingers (like world-class left or right wingers) also have
very high wages, so the data might show significant overlap. In many team settings,
the difference was not huge, but if averaged across the dataset, strikers showed a
slightly higher mean wage. The intern described this finding with the caveat that
wage can be influenced by many factors (like overall rating and club wealth)
beyond just the position.
Clustering analysis: The intern selected a subset of attributes to cluster players.
Using too many attributes (FIFA has dozens per player) could complicate
clustering, so focus was on key performance attributes that define play style, such
as pace (speed), shooting, passing, dribbling, defending, and physical abilities.
These attributes were normalized (to ensure, for example, that the scale of shooting,
Offensive Position Salaries: The wage analysis suggested that while strikers
generally earn the most (given their crucial role in scoring), the difference between
top strikers and top wingers was not vast. It was noted that market value is often
more tied to overall skill and star power than just position – for instance, a superstar
right winger could earn more than an average striker. The intern concluded that
among attacking roles, strikers slightly edge out others in average value, but each
position has elite players with very high wages.
Player Clusters: The clustering provided a neat segmentation of players. It
effectively grouped players by playing style and strengths. This could be useful for
things like identifying what type of player a club needs (for example, if a club needs
a fast dribbler, they look at players from the cluster characterized by high pace and
dribbling). The intern provided example players from each cluster to illustrate, like
“Cluster 1 (speedy wingers) included players such as X and Y, who are known for
their acceleration and crossing ability,” etc. The clusters mostly aligned with known
positions but also highlighted outliers (like an unusually fast defender might appear
in a cluster with midfielders).
Challenges: Working with the FIFA dataset had its challenges. One was the sheer number
of features – with over 80 attributes, dimensionality reduction or careful feature selection
was needed for clustering to be meaningful. The intern overcame this by focusing on the
most relevant features for the questions at hand. Another challenge was that some data
points could skew analysis (for example, if a particular country had an excessively large
number of low-tier players, it might top the count but those players might be of lower
quality; hence, the intern also looked at average player ratings by country to complement
the count).
In summary, the FIFA 20 Player Analysis project allowed the intern to practice data
analysis skills on a rich dataset and extract insights akin to what a sports analyst or football
club might be interested in. The project did not result in a single predictive model, but
rather a collection of findings and visualizations that tell a story about the data. These
findings were well-documented, and the intern gained experience in handling real-world
data complexities, as well as presenting analytical results clearly.
Fig [Link]
Problem Statement: The PUBG Match Winner Prediction project was focused on
analyzing game data from the popular online multiplayer game PlayerUnknown’s
Battlegrounds (PUBG). The primary objective was to create a model that predicts a
player’s (or team’s) likelihood of winning a match (often represented by a variable like
“winPlacePerc,” which is a normalized ranking outcome of the match). In addition, the
project sought to determine which in-game factors most influence the probability of
winning. The dataset contained extensive match statistics for players or teams in many
PUBG matches, including features such as number of kills, damage dealt to opponents,
distance traveled on foot or by vehicle, number of survival items used (heals, boosts), and
more. By analyzing and modeling this data, the intern aimed to capture how gameplay
metrics translate into winning chances, essentially teaching a model to predict match
outcomes from player performance stats.
Approach: The intern’s approach combined exploratory data analysis with predictive
modeling:
(MAE) or Root Mean Squared Error (RMSE), as predicting the exact rank can
be noisy, but minimizing average error is a good target. If the intern also attempted
classification (like predicting top-10 placement vs not), then accuracy and F1-score
were looked at.
Feature Importance Analysis: After training the best model (Random Forest or
XGBoost), the intern extracted feature importances. This gave a ranking of which
features contributed most to the prediction. Typically, one would expect features
like kills, walkDistance, and damageDealt to be very important. Also, heals and
boosts (items used) often indicate a long survival and thus impact outcome. The
intern documented these, as they directly answer the “important factors” part of the
problem.
Results: The model achieved a reasonable level of accuracy in predicting win outcomes.
For example, the Random Forest model might have an RMSE of around 0.05-0.06 in
predicting winPlacePerc (on a 0-1 scale), which translates to an average error of about 5-
6% in rank prediction, or an MAE of say 0.04 (4%). In more intuitive terms, if the true
placement of a player is 1st (100% winPlacePerc), the model might predict something like
0.95 (95%), or if a player placed 50th percentile, the model might predict 0.46 or 0.55, etc.
This is quite good given the inherent randomness (a player with good stats could still lose
due to a single mistake at the end, etc.). If classification was attempted for, say, “will this
team win the match (yes/no)”, the accuracy might have been in the high 80s to 90% range,
but note that in a large match of 100 players, predicting “no” for everyone yields 99%
accuracy trivially (since only one wins), so classification would need careful interpretation
(hence regression is more informative).
Kills: This was one of the top predictors. The more kills a player (or team) has, the
more likely they survived longer and eliminated competition, which correlates with
winning. The model recognized this, assigning a high importance to the kills
feature.
The intern also highlighted an interesting insight: while kills and damage are critical, a
player doesn’t need the absolute highest kills to win—strategic play and survival reflected
by distance and healing can sometimes compensate. For example, a player with moderate
kills but who stayed safe and healed at the right times could still win, which the model
accounts for via those features.
Challenges: One major challenge was the size of the data. PUBG data sets can be millions
of rows. The intern had to sample data or use computational resources efficiently (like
using vectorized operations and perhaps leveraging the GPU for XGBoost if available).
Another challenge was multicollinearity between features; for instance, killPoints,
rankPoints (legacy ranking metrics in some PUBG data) correlate with kills and placement.
The intern decided to drop or ignore some of these to not leak information (some datasets
have a variable winPoints which correlates directly with winPlacePerc, which if used
would trivially let the model cheat; so it was excluded). Ensuring a proper train-test split
by match (to avoid data leak where one team’s info could influence prediction of another
in the same match if not careful) was also handled. The intern learned the importance of
domain context—realizing, for example, that in a squad match, individual stats alone might
be insufficient and one might need team aggregates. For simplicity, if the project kept to
solo match data, this issue is minimized.
In conclusion, the PUBG Match Winner Prediction project successfully demonstrated how
data science can be applied to gaming analytics. The intern built a model that can
reasonably predict outcomes and, more importantly, extracted the key factors that drive
success in the game. These factors (staying alive, securing kills, and continuously looting
and moving) resonate with actual player strategies, thereby validating the model’s findings.
The project enhanced the intern’s skills in handling large datasets, feature engineering, and
interpreting machine learning models in terms of real-world behavior.
5.3.1 Outputs
Problem Statement: The Heart Disease Prediction project dealt with a medical dataset
and aimed to create a machine learning model to predict whether a person has heart disease
based on various health measurements. The dataset (a heart disease dataset, possibly
derived from sources like the UCI Heart Disease dataset or a Kaggle cardiovascular
dataset) included features such as age, sex, chest pain type, resting blood pressure,
cholesterol level, fasting blood sugar, resting ECG results, maximum heart rate achieved,
exercise-induced angina, ST depression (oldpeak) and others, along with a target label
indicating the presence of heart disease. In addition to building a predictive model, the
project required providing suggestions to the hospital on how to act on these predictions to
prevent life-threatening events. In essence, the intern needed to not only classify patients
as at risk or not, but also interpret the model to highlight risk factors and recommend
preventive measures.
Next, categorical features were encoded: for instance, thal categories (“normal”,
“fixed defect”, “reversible defect”) were converted to numeric dummy variables.
The chest pain type (4 categories) might also be one-hot encoded or treated as an
Exploratory Analysis: The intern likely examined how each feature correlates
with the presence of heart disease. For example, they might have observed that
patients with heart disease in the data tend to have higher frequencies of certain
attributes: a higher proportion of them have exercise-induced angina, higher resting
blood pressure on average, lower max heart rate achieved (since heart disease can
limit exercise capacity), etc. Perhaps a quick check: the average age of those with
heart disease might be higher than those without, and the proportion of males might
be higher (as historically some heart datasets have more male patients). This
provided a sanity check that the data is consistent with medical knowledge (e.g.,
older age and certain symptoms correlate with disease).
Model Development: The intern initially tried a Logistic Regression model,
which is a common choice for binary medical outcomes. Logistic regression gives
a probabilistic output (probability of heart disease) and allows one to gauge
influence of each feature through its coefficients. The model was trained using a
portion of data and evaluated on a validation set. Given a moderate dataset size
(often heart disease datasets have a few hundred rows; some newer ones have
more), cross-validation (like 5-fold CV) was likely used to make the most of the
data and ensure stability of results.
o Decision Tree: A decision tree could find rule-based patterns, like “if
oldpeak > 1.5 and exercise_induced_angina = yes, then disease = yes”. The
intern tuned the tree to avoid overfitting (setting a max depth).
o Random Forest: This ensemble of trees often improves accuracy. The
intern trained a Random Forest classifier and used its out-of-bag error or
Ultimately, the Random Forest or logistic regression likely emerged as top choices
– Random Forest for raw accuracy, logistic for interpretability. Suppose the
Random Forest achieved an accuracy of around 88% on validation, and logistic
regression was around 85%. The intern might choose the Random Forest as the
final model to maximize predictive performance.
Evaluation: To properly evaluate, the intern looked at the confusion matrix from
the best model: how many patients were correctly identified as having heart disease
vs missed (false negatives) vs false alarms (false positives). In a medical context,
catching as many true cases (high recall/sensitivity) is important, even if it means
some false positives (which can be later tested further by doctors). The precision is
also considered because too many false positives could overwhelm resources or
cause unnecessary alarm. The intern reported metrics like Accuracy, Precision,
Recall, and F1-score for the positive class (heart disease). For example, the model
might have a recall of 90% (meaning it catches 90% of true heart disease cases) and
a precision of 85% (meaning 85% of those it flags as high risk indeed have the
disease).
Insights and Recommendations: Beyond the numbers, the intern interpreted the
model. Using the logistic regression coefficients or Random Forest feature
importances, the significant predictors were identified. Common results: oldpeak,
thal(defect type), max_heart_rate_achieved, exercise_induced_angina,
chest_pain_type, and num_major_vessels are usually top indicators. For instance,
the model might clearly show that having exercise-induced angina and ST
depression (oldpeak) greatly increase the probability of heart disease. High
cholesterol and high resting blood pressure are also contributors but in some
datasets they are not as statistically significant as the exercise and ECG-related
variables. The intern highlighted that:
o Exercise-induced angina (angina during exercise): Strong predictor –
patients experiencing this have a much higher risk of heart disease.
o Oldpeak (ST depression): Every unit increase in oldpeak significantly
raises the odds of heart disease, indicating stress-induced ischemia.
o Chest pain type: Certain types of chest pain (especially typical angina pain)
were strongly associated with heart condition.
o Max heart rate achieved: Lower values (i.e., inability to reach high heart
rates during exercise) often corresponded with presence of heart disease.
o Major vessels colored (from angiography): The more vessels that showed
blockage, the higher the risk.
These align with medical intuition and thus add credibility to the model’s decisions.
o Patients who are older and exhibit multiple risk factors (like high blood
pressure, high cholesterol, abnormal ECG results such as ST depression,
and exercise-induced angina) should be prioritized for further
cardiovascular evaluation (such as a stress test or angiogram if not already
done) and preventive treatment.
o The hospital could integrate the model’s output into their routine check-up
process. If a patient’s data is fed into the model and the predicted risk is
above a certain threshold, an alert could be generated for doctors to conduct
more thorough examinations for heart disease.
o Lifestyle modifications should be suggested for at-risk patients: e.g.,
encourage patients with high cholesterol and high blood pressure (even if
no current heart disease) to adopt diet changes, exercise (as tolerated), and
possibly medication (like statins or antihypertensives) to mitigate these risk
factors. The model in identifying these factors reinforces their importance.
o Regular screening: For patients flagged by the model (or even those with
borderline risk), schedule regular follow-ups. If the dataset covered some
demographic findings (maybe it showed men over 50 with certain factors
are high risk), then ensuring such demographics are regularly screened
would be a recommendation.
The intern likely phrased these suggestions in a general manner since specific
medical advice would be determined by doctors, but the idea was to show how the
data insights can lead to action: focusing on modifiable risk factors (blood pressure,
cholesterol, blood sugar if diabetic), recommending exercise or smoking cessation
if that data were present (some heart datasets include a smoking indicator), etc.
Results: The predictive model achieved solid performance in identifying heart disease. For
example, on a held-out test set, it might have achieved around 85-90% accuracy, with a
sensitivity around 0.9 (meaning it caught 90% of those with disease) and specificity around
0.8-0.85 (correctly ruling out 80-85% of those without disease). This is within a useful
range for a screening tool (though in practice, medical models may undergo further
validation). The key risk factors identified by the model matched well-known medical
knowledge, giving trust in the model.
The hospital suggestions were formulated based on these results: emphasizing early
detection and risk factor management. For instance, because the data showed that many
heart disease patients had high blood pressure and cholesterol, the intern suggested that
controlling these through lifestyle or medication can reduce future heart disease incidents.
Additionally, since exercise ECG indicators (like oldpeak and exercise angina) were
crucial, the intern suggested that stress testing could be used more frequently in the
hospital’s routine for patients above a certain risk profile.
Challenges: One challenge was dealing with the relatively small size of some heart disease
datasets (some have only ~300 patients). This can make it hard for complex models to
generalize. The intern used cross-validation and possibly combined data from multiple
sources if available to improve robustness. Another challenge is class imbalance if present
(if, say, only 30% had disease); this was handled by ensuring performance metrics beyond
accuracy were used, and possibly using techniques like oversampling the minority class or
adjusting classifier thresholds to improve sensitivity. Medical data also can have
multicollinearity (e.g., age and some other factors might correlate), but tree-based models
handle that inherently, whereas logistic regression required checking VIFs or stepwise
selection. The intern navigated these by focusing on the most informative variables and
verifying the model against known medical criteria.
In conclusion, the Heart Disease Prediction project allowed the intern to work on a socially
impactful problem, applying data science to healthcare. The project reinforced the
importance of interpretability in models used for critical decisions and demonstrated how
data-driven models can align with domain expertise to provide both accurate predictions
and actionable insights. The intern concluded the project by summarizing that machine
learning can indeed assist clinicians by flagging high-risk patients and by highlighting
which factors to pay attention to, thus contributing to better preventive care and resource
allocation in the hospital
CHAPTER 6:
CONCLUSION
The fifteen-week internship at Rubixe seamlessly connected academic theory with hands-
on experience in AI and data science, allowing me to apply Python-based workflows, from
data cleaning and visualization to model selection and performance tuning, across a range
of real-world challenges. Under structured mentorship, I learned to document reproducible
analyses in Jupyter Notebooks, balance accuracy versus efficiency when choosing
algorithms, and translate domain knowledge into actionable insights. Highlights include:
Handwritten Digit Recognition, Football Player Analytics, PUBG Winner Prediction,
Heart Disease Risk Modelling Through these projects, I strengthened my problem-solving
adaptability, project-management skills, and confidence in deploying data-driven
solutions—laying a solid foundation for a future career in AI and Data Science.