Introduction to Data Science and Careers
Introduction to Data Science and Careers
Unit I
What is Data Science?s
Data Science is also known as data-driven science, which makes use of scientific methods,
processes, and systems to extract knowledge or insights from data in various forms, i.e. either
structured or unstructured. Data Science uses the most advanced hardware, programming
systems, and algorithms to solve problems that have to do with data. It is where artificial
intelligence is going.
Data Science is an interdisciplinary field that lets you learn from both organised and
unorganised data. With data science, you can turn a business problem into a research project
and then apply into a real-world solution.
• Gaming Industry
• Health Care
• Predictive Analysis
• Image Recognition
• Recommendation systems
• Finance
• Computer Vision
• Internet Search
• Speech Recognition
• Education
The following are some career paths and job opportunities in data science:
• Data Analyst
• Data Scientist
• Database Administrator
• Data Architect
• Hadoop Engineer
According to IDC, worldwide data will reach 175 zettabytes by 2025. Data Science helps
businesses to comprehend vast amounts of data from different sources, extract useful
insights, and make better data-driven choices. Data Science is used extensively in several
industrial fields, such as marketing, healthcare, finance, banking, and policy work.
• Data is the oil of the modern age. With the proper tools, technologies, and algorithms,
we can leverage data to create a unique competitive edge.
• Data Science may assist in detecting fraud using sophisticated machine learning
techniques.
• You may use sentiment analysis to determine the brand loyalty of your customers. This
helps you to make better and quicker choices.
According to Forbes, the total quantity of data generated, copied, recorded, and consumed in
the globe surged by about 5,000% between 2010 and 2020, from 1.2 trillion gigabytes to 59
trillion gigabytes.
• Companies are being outcompeted by firms that make judgments based on data. For
example, the Ford organization in 2006, had a loss of $12.6 billion. Following the
defeat, they hired a senior data scientist to manage the data and undertook a three-
year makeover. This ultimately resulted in the sale of almost 2,300,000 automobiles
and earned a profit for 2009 as a whole.
Big Data refers to the vast volumes of data generated at high velocity from a variety of sources.
This data is characterized by the three V's: Volume, Velocity, and Variety.
1. Volume: Big Data involves large datasets that are too complex for traditional data
processing tools to handle. These datasets can range from terabytes to petabytes of
information.
2. Velocity: Big Data is generated in real-time or near real-time, requiring fast processing
to extract meaningful insights.
3. Variety: The data comes in multiple forms, including structured data (like databases),
semi-structured data (like XML files), and unstructured data (like text, images, and
videos).
Data Science is an interdisciplinary field that utilizes scientific methods, algorithms, and
systems to extract knowledge and insights from structured and unstructured data. It
encompasses a variety of techniques from statistics, machine learning, data mining, and big
data analytics.
1. Analyze: They examine complex datasets to identify patterns, trends, and correlations.
2. Model: Using statistical models and machine learning algorithms, they create
predictive models that can forecast future trends or behaviors.
3. Interpret: They translate data findings into actionable business strategies and
decisions.
The current landscape perspective of data science is characterized by several key trends and
developments:
• AI and Machine Learning: Data science is increasingly intertwined with artificial
intelligence (AI) and machine learning, enabling more advanced predictive analytics
and automation of data-driven processes.
• Big Data: The proliferation of big data continues to shape the data science landscape,
with organizations leveraging large volumes of data from diverse sources to gain
insights and make informed decisions.
• Ethics and Privacy: There is a growing emphasis on ethical considerations and privacy
concerns in data science, with a focus on responsible data usage, transparency, and
compliance with regulations such as GDPR.
• Automation and Scalability: There is a shift towards automating repetitive data tasks
and scaling data science processes through the use of cloud-based platforms and tools,
enabling greater efficiency and agility.
• Continuous Learning and Upskilling: Given the rapid evolution of data science
technologies and methodologies, professionals in the field are increasingly focused on
continuous learning and upskilling to stay abreast of the latest developments.
Data science is a multidisciplinary field that combines statistics, computer science, and
domain expertise to extract insights and knowledge from data. The skills required for data
science can be broadly classified into technical skills, domain expertise, and soft skills.
1. Technical skills:
Data science requires proficiency in programming languages such as Python or R, data
visualization tools like Tableau or Power BI, databases such as SQL, and machine
learning algorithms. Data scientists should have a solid understanding of data
manipulation and analysis techniques, including data cleaning, transformation, and
feature engineering.
2. Domain expertise:
Data scientists should have an understanding of the business domain in which they
work. For example, a data scientist in healthcare should have knowledge of medical
terminologies and healthcare workflows. Similarly, a data scientist in finance should
have an understanding of financial instruments and markets.
3. Soft skills:
Soft skills like communication, collaboration, and problem-solving are essential for a
successful data scientist. Data scientists should be able to communicate complex
technical concepts to non-technical stakeholders in a clear and concise manner. They
should also be able to work collaboratively in a team environment, and have strong
problem-solving skills to identify and solve complex problems.
Data science is an interdisciplinary field that involves using statistical and computational
techniques to extract insights from data. Some of the key skills required for a career in data
science include:
• Data wrangling: the ability to clean, organize, and manipulate large datasets is an
important skill for data preparation.
• Data visualization: the ability to create clear and effective visualizations of data is
important for communicating insights and findings to others.
Domain knowledge: understanding the specific industry or business context in which data is
being analyzed is important for interpreting and applying the insights generated.
1. Math Skills:
• Multivariable Calculus & Linear Algebra: These two things are very important
as they help us in understanding various machine learning algorithms which
play an important role in Data Science.
2. Programming Skills:
• Non Relational Databases: These are of many types but mostly used types are:
i) Column: Cassandra, HBase ii) Document: MongoDB, CouchDB iii) Key-value:
Redis, Dynamo
• Machine Learning: It is one of the most important parts of data science and
the hot topic of research among researchers so every year new developments
are made in this. You at least need to know common algorithms of supervised
and unsupervised learning. There are many libraries available in python and
R. List of Python Libraries: i) Basic Libraries: NumPy, SciPy, Pandas, Ipython,
matplotlib ii) Libraries for Machine Learning: sci-kit-learn, Theano, TensorFlow
iii) Libraries for Data Mining & Natural Language Processing: Scrapy, NLTK,
Pattern
3. Domain Knowledge Mostly people ignore this thinking it's not important but it is very
very important. The whole purpose of data science is to extract useful insights from
that data so that it can be beneficial to a company’s business. If you don’t understand
the business side of your company like how your company’s business model works,
and how you can make it better, then you are of no use to the company.
Linear algebra simplifies the management and analysis of large datasets. It is widely used in
Data Science and machine learning to understand data especially when there are many
features. In this article we’ll explore the importance of linear algebra in data science, its key
concepts, real-world applications and the challenges learners face.
Linear algebra in data science refers to the use of mathematical concepts involving vectors,
matrices and linear transformations to manipulate and analyse data. It provides useful
algorithms and processes in data science such as machine learning, statistics and big data
analytics. It turns theoretical data models into practical solutions that can be used in real-
world situations. It helps us:
• Use techniques like dimensionality reduction to simplify large datasets while keeping
important patterns.
Below are some important linear algebra topics that are widely used in data science.
1. Vectors
Vectors are ordered array of numbers that represents a point or direction in space. In data
science, vectors are used to represent data points, features or coefficients in machine learning
models.
• Vectors
• Vector Operations
• Vector Norms
2. Matrices
• Matrix
• Matrix Operations
• Matrix Transpose
• Identity Matrix
• Zero Matrix
• Sparse matrix
• Inverse of a Matrix
3. Matrix Decomposition
Matrix decomposition is a process where we break down a complex matrix into simpler into
more manageable parts. These parts include LU decomposition, QR decomposition or
Singular Value Decomposition.
• LU Decomposition
• QR Decomposition
• Cholesky Decomposition
• Eigenvalue Decomposition
4. Determinants
Determinant of a square matrix is a single number that tells us if the matrix can
be turned around or not. It is is important when we need to find the best possible answer or
when we are solving systems of linear equations in math.
• Determinants
• Properties of Determinants
• Relationship with Invertibility of Matrices
Eigenvalues and eigenvectors are used in various data science algorithms such as PCA for
dimensionality reduction and feature extraction.
• Applications
A vector space is a set of vectors that can be scaled and added together and subspaces are
subsets of a vector space used for understanding data structures and transformations in
machine learning.
• Vector Spaces
• Linear Independence
• Linear Transformation
• Span
• Column Space
• Null Space
• Gaussian Elimination
8. Orthogonality
Two vectors are considered orthogonal when their dot product evaluation results in a zero
value. Data science makes use of orthogonality for selecting features while conducting
dimensionality reduction and establishing whether models operate independently or not.
• Orthogonal Vectors
• Orthogonal Matrices
• Orthogonal Projections
• Gram-Schmidt Process
PCA is a dimensionality reduction technique that transforms data into a smaller set of
variables and capture the most significant variance. It's used for feature extraction and noise
reduction.
• Dimensionality Reduction
Optimization means to find the best possible solution to a problem. Linear algebra applies this
concept to solve problems involving least squares regression as well as machine learning
models and linear regression models.
• Cost Functions
• Objective Functions
• Linear Programming
• Simplex Method
• Newton's Method
• Lagrange Multipliers
• Image Processing and Computer Vision - Linear Algebra allows processing of images
through various transformations and compression techniques as well as extracting
features from datasets.
• Clustering and Classification - The algorithms k-means clustering and Support Vector
Machines (SVM) use Linear Algebra to group or classify data points effectively.
Learning linear algebra presents challenges to data science students because of three key
problems:
• The learning process feels steep because beginner-level students find matrix inversion
and eigenvalue decomposition challenging to handle.
• Sales professionals face confusion when looking at multiple linear algebra applications
across different disciplines.
PCA is commonly used for data preprocessing for use with machine learning algorithms. It
helps to remove redundancy, improve computational efficiency and make data easier to
visualize and analyze especially when dealing with high-dimensional data.
PCA uses linear algebra to transform data into new features called principal components. It
finds these by calculating eigenvectors (directions) and eigenvalues (importance) from the
covariance matrix. PCA selects the top components with the highest eigenvalues and projects
the data onto them simplify the dataset.
Note: It prioritizes the directions where the data varies the most because more variation =
more useful information.
Imagine you’re looking at a messy cloud of data points like stars in the sky and want to simplify
it. PCA helps you find the "most important angles" to view this cloud so you don’t miss the big
patterns. Here’s how it works step by step:
Different features may have different units and scales like salary vs. age. To compare them
fairly PCA first standardizes the data by making each feature have:
• A mean of 0
• A standard deviation of 1
Z=X−μσZ=σX−μ
where:
Next PCA calculates the covariance matrix to see how features relate to each other whether
they increase or decrease together. The covariance between two features x1x1 and x2x2 is:
cov(x1,x2)=∑i=1n(x1i−x1ˉ)(x2i−x2ˉ)n−1cov(x1,x2)=n−1∑i=1n(x1i−x1ˉ)(x2i−x2ˉ)
Where:
PCA identifies new axes where the data spreads out the most:
• 1st Principal Component (PC1): The direction of maximum variance (most spread).
• 2nd Principal Component (PC2): The next best direction, perpendicular to PC1 and so
on.
These directions come from the eigenvectors of the covariance matrix and their importance
is measured by eigenvalues. For a square matrix A an eigenvector X (a non-zero vector) and
its corresponding eigenvalue λ satisfy:
AX=λXAX=λX
This means:
After calculating the eigenvalues and eigenvectors PCA ranks them by the amount of
information they capture. We then:
1. Select the top k components hat capture most of the variance like 95%.
This means we reduce the number of features (dimensions) while keeping the important
patterns in the data.
In the above image the original dataset has two features "Radius" and "Area" represented by
the black axes. PCA identifies two new directions: PC₁ and PC₂ which are the principal
components.
• These new axes are rotated versions of the original ones. PC₁ captures the maximum
variance in the data meaning it holds the most information while PC₂ captures the
remaining variance and is perpendicular to PC₁.
• The spread of data is much wider along PC₁ than along PC₂. This is why PC₁ is chosen
for dimensionality reduction. By projecting the data points (blue crosses) onto PC₁ we
effectively transform the 2D data into 1D and retain most of the important structure
and patterns.
Hence PCA uses a linear transformation that is based on preserving the most variance in the
data using the least number of dimensions. It involves the following steps:
We import the necessary library like pandas, numpy, scikit learn, seaborn and matplotlib to
visualize results.
import numpy as np
import pandas as pd
from [Link] import StandardScaler
We make a small dataset with three features Height, Weight, Age and Gender.
data = {
'Height': [170, 165, 180, 175, 160, 172, 168, 177, 162, 158],
'Weight': [65, 59, 75, 68, 55, 70, 62, 74, 58, 54],
'Age': [30, 25, 35, 28, 22, 32, 27, 33, 24, 21],
df = [Link](data)
print(df)
Output:
Dataset
X = [Link]('Gender', axis=1)
y = df['Gender']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
• We reduce the data from 3 features to 2 new features called principal components.
These components capture most of the original information but in fewer dimensions.
• We split the data into 70% training and 30% testing sets.
• We train a logistic regression model on the reduced training data and predict gender
labels on the test set.
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
model = LogisticRegression()
[Link](X_train, y_train)
y_pred = [Link](X_test)
The confusion matrix compares actual vs predicted labels. This makes it easy to see where
predictions were correct or wrong.
cm = confusion_matrix(y_test, y_pred)
[Link](figsize=(5,4))
[Link]('Predicted Label')
[Link]('True Label')
[Link]('Confusion Matrix')
[Link]()
Output:
Confusion matrix
y_numeric = [Link](y)[0]
[Link](figsize=(12, 5))
[Link](1, 2, 1)
[Link](label='Target classes')
[Link](1, 2, 2)
[Link](label='Target classes')
plt.tight_layout()
[Link]()
PCA Algorithm
• Left Plot Before PCA: This shows the original standardized data plotted using the first
two features. There is no guarantee of clear separation between classes as these are
raw input dimensions.
• Right Plot After PCA: This displays the transformed data using the top 2 principal
components. These new components capture the maximum variance often showing
better class separation and structure making it easier to analyze or model.
2. Noise Reduction: Eliminates components with low variance enhance data clarity.
3. Data Compression: Represents data with fewer components reduce storage needs
and speeding up processing.
4. Outlier Detection: Identifies unusual data points by showing which ones deviate
significantly in the reduced space.
2. Data Scaling Sensitivity: Requires proper scaling of data before application or results
may be misleading.
3. Information Loss: Reducing dimensions may lose some important information if too
few components are kept.
4. Assumption of Linearity: Works best when relationships between variables are linear
and may struggle with non-linear data.
6. Risk of Overfitting: Using too many components or working with a small dataset might
lead to models that don't generalize well.
Statistics is like a toolkit we use to understand and make sense of information. It helps us
collect, organize, analyze and interpret data to find patterns, trends and relationships in the
world around us.
From analyzing scientific experiments to making informed business decisions, statistics plays
an important role across many fields such as science, economics, social sciences, engineering
and sports. Whether it's calculating the average test score in a classroom or predicting election
outcomes based on a sample, statistics gives us tools to make data-driven decisions.
Types of Statistics
There are commonly two types of statistics, which are discussed below:
1. Descriptive Statistics: Descriptive Statistics helps us simplify and organize big chunks
of data. This makes large amounts of data easier to understand.
Types of Data
1. Qualitative Data: This data is descriptive. For example - She is beautiful, He is tall, etc.
2. Quantitative Data: This is numerical information. For example- A horse has four legs.
• Continuous Data: It is not fixed but has a range of data and can be measured.
Basics of Statistics
Population
Average of the entire group. ΣxNΣNx
Mean (μ)
Sample/Pop
Population σ=1N∑i=1n(xi−μ)2Sample s=1N−
ulation Measures how spread out
1∑i=1n(xi−xˉ)2Population σ=N1∑i=1n(xi−μ)2
Standard the data is from the mean
Sample s=N−11∑i=1n(xi−xˉ)2
Deviation
Parameters Definition Formulas
Class
Range of values in a group CI = Upper Limit − Lower Limit
Interval(CI)
1. Mean: The mean can be calculated by summing all values present in the sample divided by
total number of values present in the sample or population.
Formula:Mean(μ)=SumofValuesNumberofValues Formula:Mean(μ)=NumberofValuesSumofV
alues
2. Median: The median is the middle of a dataset when arranged from lowest to highest or
highest to lowest in order to find the median, the data must be sorted. For an odd number of
data points the median is the middle value and for an even number of data points median is
the average of the two middle values.
3. Mode: The most frequently occurring value in the Sample or Population is called as Mode.
Measure of Dispersion
• Range: Range is the difference between the maximum and minimum values of the
Sample.
• Variance (σ²): Variance is a measure of how spread-out values from the mean by
measuring the dispersion around the Mean.
Formula:σ2 = Σ(X−μ)2n Formula:σ2 = nΣ(X−μ)2
• Standard Deviation (σ): Standard Deviation is the square root of variance. The
measuring unit of S.D. is same as the Sample values' unit. It indicates the average
distance of data points from the mean and is widely used due to its intuitive
interpretation.
Formula:σ=(σ2)=(Σ(X−μ)2n)Formula:σ=(σ2)=(nΣ(X−μ)2)
• Interquartile Range (IQR): The range between the first quartile (Q1) and the third
quartile (Q3). It is less sensitive to extreme values than the range. To compute IQR,
calculate the values of the first and third quartile by arranging the data in ascending
order. Then, calculate the mean of each half of the dataset.
Formula:IQR=Q3−Q1Formula:IQR=Q3−Q1
• Mean Absolute Deviation: The average of the absolute differences between each data
point and the mean. It provides a measure of the average deviation from the mean.
CV=(σμ)∗100CV=(μσ)∗100
Measure of Shape
1. Skewness
2. Kurtosis
Kurtosis quantifies the degree to which a probability distribution deviates from the normal
distribution. It assesses the "tailedness" of the distribution, indicating whether it has heavier
or lighter tails than a normal distribution. High kurtosis implies more extreme values in the
distribution, while low kurtosis indicates a flatter distribution.
Types of Kurtosis
Types of Kurtosis
Measure of Relationship
• Covariance: Covariance measures the degree to which two variables change together.
Cov(x,y)=∑(Xi−X‾)(Yi−Y‾)nCov(x,y)=n∑(Xi−X)(Yi−Y)
• Correlation: Correlation measures the strength and direction of the linear relationship
between two variables. It is represented by correlation coefficient which ranges from
-1 to 1. A positive correlation indicates a direct relationship, while a negative
correlation implies an inverse relationship. Pearson's correlation coefficient is given by:
ρ(X,Y)=cov(X,Y)σXσYρ(X,Y)=σXσYcov(X,Y)
Probability Theory
Term Definition
Bayes Theorem
P(A∣B)=P(B∣A)×P(A)P(B)P(A∣B)=P(B)P(B∣A)×P(A)
where
• P(A∣B): Probability of event A given that event B has occurred (posterior probability).
• P(B∣A): Probability of event B given that event A has occurred (likelihood).
• Empirical Distribution Function (EDF): Estimates the CDF using observed sample data.
Formula: f(X∣μ,σ)=ϵ−0.5(X−μσ)2σ(2π)f(X∣μ,σ)=σ(2π)ϵ−0.5(σX−μ)2
Empirical Rule (68-95-99.7 Rule): ~68% data within 1σ, ~95% within 2σ, ~99.7% within 3σ.
Central Limit Theorem: The Central Limit Theorem (CLT) states that, regardless of the shape
of the original population distribution, the sampling distribution of the sample mean will be
approximately normally distributed if the sample size tends to infinity.
2. Student t-distribution
f(t)=Γ(df+12)dfπΓ(df2)(1+t2df)−df+12f(t)=dfπΓ(2df)Γ(2df+1)(1+dft2)−2df+1
3. Chi-square Distribution
χ2=12k/2Γ(k/2)xk2−1e−x2χ2=2k/2Γ(k/2)1x2k−1e2−x
4. Binomial Distribution
The binomial distribution models the number of successes in a fixed number of independent
Bernoulli trials, where each trial has the same probability of success (p).
Formula: P(X=k)=(kn)pk(1−p)n−kP(X=k)=(kn)pk(1−p)n−k
5. Poisson Distribution
The poisson distribution models the number of events that occur in a fixed interval of time or
space. It's characterized by a single parameter (λ), the average rate of occurrence.
Formula: P(X=k)=ϵ−λλkk!P(X=k)=k!ϵ−λλk
6. Uniform Distribution
The uniform distribution represents a constant probability for all outcomes in a given range.
Formula: f(X)=1b−af(X)=b−a1
• Bias: The difference between an estimator’s expected value and the true parameter.
Bias(θ^)=E(θ^)−θBias(θ)=E(θ)−θ
Hypothesis Testing
Hypothesis testing makes inferences about a population parameter based on sample statistic.
1. Null Hypothesis (H₀): There is no significant difference or effect.
2. Alternative Hypothesis (H₁): There is a significant effect i.e the given statement can be false.
3. Degrees of freedom: Degrees of freedom (df) in statistics represent the number of values
or quantities in the final calculation of a statistic that are free to vary. It is mainly defined as
sample size-one (n-1).
• If p ≤ α: reject H₀
• Type I Error that occurs when the null hypothesis is true, but the statistical test
incorrectly rejects it. It is often referred to as a "false positive" or "alpha error."
• Type II Error that occurs when the null hypothesis is false, but the statistical test fails
to reject it. It is often referred to as a "false negative."
7. Confidence Intervals: A confidence interval is a range of values that is used to estimate the
true value of a population parameter with a certain level of confidence. It provides a measure
of the uncertainty or margin of error associated with a sample statistic, such as the sample
mean or proportion.
An e-commerce company wants to know if a website redesign affects average user session
time.
Hypotheses:
• H₀: No change (μ_after − μ_before = 0)
Interpretation:
Statistical Tests
Parametric test are statistical methods that make assumption that the data follows normal
distribution.
One- sample:
One-Sample Test:
Z=
t = X‾−μsnnsX−μ
X‾−μσnnσX−μ
Two-Sample Test:
t=X1‾−X2‾s12n1+s22n2t=n1
Two-Sample Test: F=s12s22F=s22s12
s12+n2s22X1−X2
Paired t-Test:
Z = X1‾−X2‾σ12n1+σ22n2n1
σ12+n2σ22X1−X2 t=d‾sdnnsdd
d= difference
SSE=ΣΣ(xˉ1−xˉ)2ΣΣ(xˉ1
df2=N-1 MSE=SSE/(N-k)
Error −xˉ)2
2. Two-way ANOVA: Tests impact of two categorical variables and their interaction
Chi-Squared Test
The chi-squared test is a statistical test used to determine if there is a significant association
between two categorical variables. It compares the observed frequencies in a contingency
table with the frequencies.
Formula:X2=Σ(Oij−Eij)2EijX2=ΣEij(Oij−Eij)2
This test is also performed on big data with multiple number of observations.
Non-Parametric Test
Non-parametric test does not make assumptions about the distribution of the data. They are
useful when data does not meet the assumptions required for parametric tests.
A/B testing, also known as split testing, is a method used to compare two versions (A and B)
of a webpage, app, or marketing asset to determine which one performs better.
Example: a product manager change a website's "Shop Now" button color from green to blue
to improve the click-through rate (CTR). Formulating null and alternative hypotheses, users
are divided into A and B groups and CTRs are recorded. Statistical tests like chi-square or t-test
are applied with a 5% confidence interval. If the p-value is below 5%, the manager may
conclude that changing the button color significantly affects CTR, informing decisions for
permanent implementation.
Regression
Where,
• αα is the intercept
Regression coefficient is a measure of the strength and direction of the relationship between
a predictor variable (independent variable) and the response variable (dependent
variable) β=∑(Xi−X‾)(Yi−Y‾)∑(Xi−X‾)2β=∑(Xi−X)2∑(Xi−X)(Yi−Y)
Descriptive Statistic
Statistics is the foundation of data science. Descriptive statistics are simple tools that help us
understand and summarize data. They show the basic features of a dataset, like the average,
highest and lowest values and how spread out the numbers are. It's the first step in making
sense of information.
Descriptive Statistic
There are three categories for standard classification of descriptive statistics methods, each
serving different purposes in summarizing and describing data. They help us understand:
1. Where the data centers (Measures of Central Tendency)
Statistical values that describe the central position within a dataset. There are three main
measures of central tendency:
Mean: is the sum of observations divided by the total number of observations. It is also
defined as average which is the sum divided by count.
xˉ=∑xnxˉ=n∑x
where,
• x = Observations
• n = number of terms
Let's look at an example of how can we find the mean of a data set using python code
implementation. Before its implementation we should have some basic knowledge
about numpy and scipy.
import numpy as np
# Sample Data
# Mean
mean = [Link](arr)
Output
Mean = 7.333333333333333
Mode: The most frequently occurring value in the dataset. It’s useful for categorical data and
in cases where knowing the most common choice is crucial.
# sample Data
arr = [1, 2, 2, 3]
# Mode
mode = [Link](arr)
Output:
Median: The median is the middle value in a sorted dataset. If the number of values is odd,
it's the center value, if even, it's the average of the two middle values. It's often better than
the mean for skewed data.
import numpy as np
# sample Data
arr = [1, 2, 3, 4]
# Median
median = [Link](arr)
Output
Median = 2.5
Note : All implementations are performed using numpy library in python. If you want to learn
and understand more about it. Refer to the link.
Central tendency measures are the foundation for understanding data distribution and
identifying anomalies. For example, the mean can reveal trends, while the median highlights
skewed distributions.
2. Measure of Variability
Knowing not just where the data centers but also how it spreads out is important. Measures
of variability, also called measures of dispersion, help us spot the spread or distribution of
observations in a dataset. They identifying outliers, assessing model assumptions and
understanding data variability in relation to its mean. The key measures of variability include:
1. Range : describes the difference between the largest and smallest data point in our data
set. The bigger the range, the more the spread of data and vice versa. While easy to
compute range is sensitive to outliers. This measure can provide a quick sense of the data
spread but should be complemented with other statistics.
import numpy as np
# Sample Data
arr = [1, 2, 3, 4, 5]
# Finding Max
Maximum = max(arr)
# Finding Min
Minimum = min(arr)
Range = Maximum-Minimum
Output
σ2=∑(x−μ)2Nσ2=N∑(x−μ)2
where,
• mu -> Mean
import statistics
# sample data
arr = [1, 2, 3, 4, 5]
# variance
Output
Var = 2.5
3. Standard deviation: Standard deviation is widely used to measure the extent of variation
or dispersion in data. It's especially important when assessing model performance (e.g.,
residuals) or comparing datasets with different means.
It is defined as the square root of the variance. It is calculated by finding the mean,
then subtracting each number from the mean which is also known as the average and
squaring the result. Adding all the values and then dividing by the no of terms followed by
the square root.
σ=∑(x−μ)2Nσ=N∑(x−μ)2
where,
• N = number of terms
• mu = Mean
import statistics
arr = [1, 2, 3, 4, 5]
Output
Std = 1.5811388300841898
Variability measures are important in residual analysis to check how well a model fits the data.
Frequency distribution table is a powerful summarize way to show how data points are
distributed across different categories or intervals. Helps identify patterns, outliers and the
overall structure of the dataset. It is often the first step in understand the dataset before
applying more advanced analytical methods or creating visualizations like histograms or pie
charts.
• Frequency counts
• Relative frequencies (percentages)
For Frequency Distribution – Histogram, Bar Graph, Frequency Polygon and Pie Chart read
article: Frequency Distribution – Table, Graphs, Formula.
Statistical Inference
Statistical inference is the process of using data analysis to infer properties of an underlying
distribution of a population. It is a branch of statistics that deals with making inferences about
a population based on data from a sample.
Statistical Inference
Consider a scenario where you are presented with a bag which is too big to effectively count
each bean by individual shape and colours. The bag is filled with differently shaped beans and
different colors of the same. The task entails determining the proportion of red-coloured beans
without spending much effort and time. This is how statistical inference works in this context.
You simply pick a random small sample using a handful and then calculate the proportion of
the red beans. In this case, you would have picked a small subset, your handful of beans to
create an inference on a much larger population, that is the entire bag of beans.
• Parameter Estimation
• Hypothesis Testing
Parameter Estimation
Parameter estimation is another primary goal of statistical inference. Parameters are capable
of being deduced; they are quantified traits or properties related to the population you are
studying. Some instances comprise the population mean, population variance, and so on-the-
list. Imagine measuring each person in a town to realize the mean. This is a daunting if not an
impossible task. Thus, most of the time, we use estimates.
• Point Estimation
• Interval Estimation
Hypothesis Testing
Hypothesis testing is used to make decisions or draw conclusions about a population based on
sample data. It involves formulating a hypothesis about the population parameter, collecting
sample data, and then using statistical methods to determine whether the data provide
enough evidence to reject or fail to reject the hypothesis.
There are various methods of statistical inference, some of these methods are:
• Parametric Methods
• Non-parametric Methods
• Bayesian Methods
Parametric Methods
In this scenario, the parametric statistical methods will assume that the data is drawn from a
population characterized by a probability distribution. It is mainly believed that they follow a
normal distribution thus can allow one to make guesses about the populace in question . For
example, the t-tests and ANOVA are parametric tests that give accurate results with the
assumption that the data ought to be
Non-Parametric Methods
These are less assumptive and more flexible analysis methods when dealing with data out of
normal distribution. They are also used to conduct data analysis when one is uncertain about
meeting the assumption for parametric methods and when one have less or inadequate data
. Some of the non-parametric tests include Wilcoxon signed-rank test and Kruskal-Wallis test
among others.
• Example: A biologist has collected data on plant health in an ordinal variable but since
it is only a small sample and normal assumption is not met, the biologist can use
Kruskal-Wallis testing.
Bayesian Methods
Bayesian statistics is distinct from conventional methods in that it includes prior knowledge
and beliefs. It determines the various potential probabilities of a hypothesis being genuine in
the light of current and previous knowledge. Thus, it allows updating the likelihood of beliefs
with new data.
• Example: consider a situation where a doctor is investigating a new treatment and has
the prior belief about the success rate of the treatment. Upon conducting a new clinical
trial, the doctor uses Bayesian method to update his “prior belief” with the data from
the new trials to estimate the true success rate of the treatment.
• Hypothesis Testing
• Confidence Intervals
• Regression Analysis
Hypothesis Testing
One of the central parts of statistical analysis is hypothesis testing which assumes an inference
or withstand any conclusions concerning the element from the sample data. Hypothesis testing
may be defined as a structured technique that includes formulating two opposing hypotheses,
an alpha level, test statistic computation, and a decision based on the obtained outcomes. Two
types of hypotheses can be distinguished: a null hypothesis to signify no significant difference
and an alternative hypothesis H1 or Ha to express a significant effect or difference.
• Example: If a car manufacturing company makes a claim that their new car model
gives a mileage of not less than 25miles/gallon. Then an independent agency collects
data for a sample of these cars and performs a hypothesis test. The null hypothesis
would be that the car does give a mileage of not less than 25miles/gallon and they
would test against the alternative hypothesis that it doesn’t. The sample data would
then be used to either fail to reject or reject the null hypothesis.
Another statistical concept that involves confidence intervals is determining a range of possible
values where the population parameter can be, given a certain confidence percentage –
usually 95%. In simpler terms, CI’s provide an estimate of the population value and the level of
uncertainty that comes with it.
• Example: A study on health records could show that 95% CI for average blood pressure
is 120-130 . In other words, there is a 95% chance that the average blood pressure of
all population is in the values between 120 and 130.
Regression Analysis
Multiple regression refers to the relationship between more than two variables. Linear
regression, at its most basic level, examines how a dependent variable Y varies with an
independent variable X. The regression equation, Y = a + bX + e, a + bX + e, which is the best
fit line through the data points quantifies this variation.
Statistical inference has a wide range of applications across various fields. Here are some
common applications:
• Clinical Trials: In medical research, statistical inference is used to analyze clinical trial
data to determine the effectiveness of new treatments or interventions. Researchers
use statistical methods to compare treatment groups, assess the significance of results,
and make inferences about the broader population of patients.
Problem Definition → Data Collection → Data Preprocessing → Exploratory Data Analysis → Modeling →
Evaluation → Deployment → Monitoring
• Problem Definition: What question(s) are you trying to answer? What are the
business / real world objectives, what are success criteria?
• Data Collection: Acquire data from one or more sources. May involve APIs,
databases, scraping, surveys, sensors.
• Data Preprocessing: Cleaning, integrating, transforming, reducing, sometimes
discretizing data to prepare for analysis.
• Exploratory Data Analysis (EDA): Understand data via summary statistics and
visualization; check distributions, relationships, anomalies, missingness.
• Modeling: Choose algorithms, train models.
• Evaluation: Quantify performance, compare models, check whether models meet
criteria.
• Deployment: Putting model into production or decision process.
• Monitoring & Maintenance: Ensuring model continues to perform; dealing with
drift, updating as needed.
2. Data Preprocessing
Preprocessing deals with preparing raw data for analysis and modeling. Raw data is rarely
“clean” or “ready.”
Goal: Fix or remove corrupted, incomplete, incorrectly formatted, or noisy parts of data.
Common Tasks:
Challenges:
• Schema mismatches: Different tables may represent the same concept with
different attribute names or units
• Entity resolution: Matching records referring to the same real-world entity
• Redundancy and contradictions: Conflicting values from different sources
Example:
• You have customer data from two systems: system A has cust_id, first_name, last_name,
DOB; system B has id, name, age, email. Need to map fields, standardize name formats,
decide how to compute age / DOB, unify customer_id.
Goal: Reduce data size (in terms of number of attributes, number of records) while retaining
as much of the important information as possible.
Techniques:
Goal: Change format or scale of data to improve suitability for analysis or modeling.
Tasks include:
Goal: Convert continuous attributes into discrete buckets (bins), which can sometimes
simplify models, help with interpretability, or be required by some algorithms.
Methods:
Summary Statistics
Visualization Tools
• Univariate plots:
o Histogram
o Box plot
o Density plot
o Bar chart (for categorical)
• Bivariate plots:
o Scatter plot
o Correlation heatmap
o Bar plot comparing categories vs numeric summary
• Multivariate / higher-dimensional:
o Pair plot (scatter matrix)
o Principal component plots
o Heatmaps
o Parallel coordinates
• Other visual tools:
o Missingness matrix (to see patterns in missing data)
o Outlier plots
o Time-series plots if data has time component
Example of a Plot
Imagine we have a dataset of housing: features include price, area, num_rooms, age_of_house.
Example Hypothetical:
• TP = 70
• TN = 900
• FP = 30
• FN = 50
Then:
As we vary the decision threshold (if the classifier produces probabilities), we get different
(FPR, TPR) pairs → ROC curve.
• AUC (Area Under ROC Curve): summarises overall performance across all thresholds.
AUC = 1 means perfect, 0.5 means random guess for binary classification.
Interpretation: A good ROC is one that bulges up toward the top-left corner.
• Types:
1. Independent two-sample t-test: comparing metric from two different
models on independent folds / datasets
2. Paired t-test: comparing two models on the same dataset / splits (e.g.
cross-validation folds) — accounts for correlation
• Hypotheses:
o Null hypothesis H0H_0: means are equal
o Alternative H1H_1: means are different
• Procedure:
1. Collect metric values (e.g. accuracies) from multiple runs/folds for model A
and model B.
2. Compute t-statistic.
3. Compare with critical t (or compute p-value).
4. If p < α (e.g. 0.05), reject H0H_0 → conclude that one model is significantly
better.
• Missing values: bmi missing for id=2, glucose missing for id=3.
o Strategy: Impute using median or mean (for numeric)
o Or for glucose: maybe predictive imputation using other features.
• Outliers: Suppose a record with bmi = 1000 (typo). Cap or remove.
• Duplicates: Remove duplicate ids if any.
• Suppose we also have another table with patient’s lab reports, which has extra
features like cholesterol, etc. Merge on id. Handle differences in naming, units.
• Feature selection: Maybe features like id are irrelevant. Maybe some lab test is
irrelevant.
• Dimensionality reduction: If there are many lab test features, use PCA to reduce.
• Scale numeric features (age, bmi, glucose) using z-score normalization (subtract
mean, divide by std dev).
• Encode gender as 0/1.
• If glucose is heavily skewed, maybe do log transformation.
*Step D: Modeling
• Pick a classification model (say logistic regression, random forest). Split data into
training/test sets or do cross-validation.
Step E: Evaluation
From this:
1. Introduction
What is machine learning?
Machine learning (ML) is a type of artificial intelligence (AI) and computer science focused
on building computer systems that learn from data and
The use of data and algorithms to imitate the way that humans learn, gradually improving its
accuracy.
The broad range of techniques ML encompasses enables software applications to improve their
performance over time.
Machine learning is an important component of the growing field of data science.
Through the use of statistical methods, algorithms are trained to make classifications or
predictions, and to uncover key insights in data mining projects.
Machine learning algorithms are trained to find relationships and patterns in data. They use
historical data as input to make predictions, classify information, cluster data points, reduce
dimensionality and even help generate new content.
Automation: Machine learning, which works entirely autonomously in any field without the
need for any human intervention. For example, robots perform the essential process steps in
manufacturing plants.
Finance Industry: Machine learning is growing in popularity in the finance industry. Banks
are mainly using ML to find patterns inside the data but also to prevent fraud. Ex. PayPal
Government organization: The government makes use of ML to manage public safety and
utilities. Take the example of China with its massive face recognition. The government uses
Artificial intelligence to prevent jaywalking.
Healthcare industry: Healthcare was one of the first industries to use machine learning with
image detection. Drug discovery, Personalized treatment
Marketing: Broad use of AI is done in marketing thanks to abundant access to data. Before
the age of mass data, researchers develop advanced mathematical tools like Bayesian analysis
to estimate the value of a customer. With the boom of data, the marketing department relies on
AI to optimize customer relationships and marketing campaigns.
Retail industry: Machine learning is used in the retail industry to analyze customer behavior,
predict demand, and manage inventory. It also helps retailers to personalize the shopping
experience for each customer by recommending products based on their past purchases and
preferences. Ex. NetFlix, Youtue
Machine learning has become essential for solving problems across numerous areas, such
as
1. Computational finance (credit scoring, algorithmic trading)
2. Computer vision (facial recognition, motion tracking, object detection)
3. Computational biology (DNA sequencing, brain tumor detection, drug discovery)
4. Automotive, aerospace, and manufacturing (predictive maintenance)
5. Natural language processing (voice recognition)
Forward Pass: In the Forward Pass, the machine learning algorithm takes in input data and
produces an output. Depending on the model algorithm it computes the predictions.
Loss Function: The loss function, also known as the error or cost function, is used to evaluate
the accuracy of the predictions made by the model. The function compares the predicted output
of the model to the actual output and calculates the difference between them. This difference
is known as error or loss. The goal of the model is to minimize the error or loss function by
adjusting its internal parameters.
Model Optimization Process: The model optimization process is the iterative process of
adjusting the internal parameters of the model to minimize the error or loss function. This is
done using an optimization algorithm, such as gradient descent.
A Model Optimization Process: If the model can fit better to the data points in the training
set, then weights are adjusted to reduce the discrepancy between the known example and the
model estimate. The algorithm will repeat this “evaluate and optimize” process, updating
weights autonomously until a threshold of accuracy has been met.
Study the Problems: The first step is to study the problem. This step involves understanding
the business problem and defining the objectives of the model.
Data Collection: When the problem is well-defined, we can collect the relevant data required
for the model. The data could come from various sources such as databases, APIs, or web
scraping.
Data Preparation: When our problem-related data is collected. then it is a good idea to check
the data properly and make it in the desired format so that it can be used by the model to find
the hidden patterns. This can be done in the following steps:
Data cleaning
Data Transformation
Explanatory Data Analysis and Feature Engineering Split the dataset for training and
testing.
Model Selection: The next step is to select the appropriate machine learning algorithm that is
suitable for our problem. This step requires knowledge of the strengths and weaknesses of
different algorithms. Sometimes we use multiple models and compare their results and select
the best model as per our requirements.
Model building and Training: After selecting the algorithm, we have to build the model.
In the case of traditional machine learning building mode is easy it is just a few hyperparameter
tunings.
In the case of deep learning, we have to define layer-wise architecture along with input and
output size, number of nodes in each layer, loss function, gradient descent optimizer, etc.
After that model is trained using the preprocessed dataset.
Model Evaluation: Once the model is trained, it can be evaluated on the test dataset to
determine its accuracy and performance using different techniques like classification report,
F1 score, precision, recall, ROC Curve, Mean Square error, absolute error, etc.
Model Tuning: Based on the evaluation results, the model may need to be tuned or optimized
to improve its performance. This involves tweaking the hyperparameters of the model.
Deployment: Once the model is trained and tuned, it can be deployed in a production
environment to make predictions on new data. This step requires integrating the model into an
existing software system or creating a new system for the model.
Monitoring and Maintenance: Finally, it is essential to monitor the model’s performance in
the production environment and perform maintenance tasks as required. This involves
monitoring for data drift, retraining the model as needed, and updating the model as new data
becomes available.
Classification: These refer to algorithms that address classification problems where the output
variable is categorical; for example, yes or no, true or false, male or female, etc. Real-world
applications of this category are evident in spam detection and email filtering.
Some known classification algorithms include the Random Forest Algorithm, Decision Tree
Algorithm, Logistic Regression Algorithm, and Support Vector Machine Algorithm.
Regression: Regression algorithms handle regression problems where input and output
variables have a linear relationship. These are known to predict continuous output variables.
Examples include weather prediction, market trend analysis, etc.
Popular regression algorithms include the Simple Linear Regression Algorithm, Multivariate
Regression Algorithm, Decision Tree Algorithm, and Lasso Regression.
Unsupervised Learning:
Unsupervised learning refers to a learning technique that’s devoid of supervision. Here, the
machine is trained using an unlabeled dataset and is enabled to predict the output without any
supervision. An unsupervised learning algorithm aims to group the unsorted dataset based on
the input’s similarities, differences, and patterns.
For example, consider an input dataset of images of a fruit-filled container. Here, the images
are not known to the machine learning model. When we input the dataset into the ML model,
the task of the model is to identify the pattern of objects, such as color, shape, or differences
seen in the input images and categorize them.
Unsupervised machine learning is further classified into two types:
Clustering: The clustering technique refers to grouping objects into clusters based on
parameters such as similarities or differences between objects. For example, grouping
customers by the products they purchase.
Some known clustering algorithms include the K-Means Clustering Algorithm, Mean-Shift
Algorithm, DBSCAN Algorithm, Principal Component Analysis, and Independent Component
Analysis.
Association: Association learning refers to identifying typical relations between the variables
of a large dataset. It determines the dependency of various data items and maps associated
variables. Typical applications include web usage mining and market data analysis.
Popular algorithms obeying association rules include the Apriori Algorithm, Eclat Algorithm,
and FP- Growth Algorithm.
Semi-supervised learning
Semi-supervised learning comprises characteristics of both supervised and unsupervised
machine learning. It uses the combination of labeled and unlabeled datasets to train its
algorithms. Using both types of datasets, semi-supervised learning overcomes the drawbacks
of the options mentioned above.
Consider an example of a college student. A student learning a concept under a teacher’s
supervision in college is termed supervised learning.
Reinforcement learning
Reinforcement learning is a feedback-based process. Here, the AI component automatically
takes stock of its surroundings by the hit & trial method, takes action, learns from experiences,
and improves performance. The component is rewarded for each good action and penalized for
every wrong move. Thus, the reinforcement learning component aims to maximize the rewards
by performing good actions.
Unlike supervised learning, reinforcement learning lacks labeled data, and the agents learn via
experiences only. Consider video games. Here, the game specifies the environment, and each
move of the reinforcement agent defines its state.
Regression:
Machine Learning Regression is a technique for investigating the relationship between
independent variables or features and a dependent variable or outcome.
Solving regression problems is one of the most common applications for machine learning
models, especially in supervised machine learning.
Regression is used to identify patterns and relationships within a dataset, which can then be
applied to new and unseen data. This makes regression a key element of machine learning in
finance, and is often leveraged to help forecast portfolio performance or stock costs and trends.
Common use for machine learning regression models include:
Types of regression:
Simple Regression
Used to predict a continuous dependent variable based on a single independent variable. Simple
linear regression should be used when there is only a single independent variable.
Multiple Regression
Used to predict a continuous dependent variable based on multiple independent variables.
Multiple linear regression should be used when there are multiple independent variables
Simple Linear Regression
Multiple linear regression
Logistic regression
Regression Algorithms
There are many different types of regression algorithms, but some of the most common include:
Linear regression: Linear regression is one of the simplest and most widely used statistical
models. This assumes that there is a linear relationship between the independent and dependent
variables. This means that the change in the dependent variable is proportional to the change
in the independent variables.
Support vector regression (SVR): Support vector regression (SVR) is a type of regression
algorithm that is based on the support vector machine (SVM) algorithm. SVM is a type of
algorithm that is used for classification tasks, but it can also be used for regression tasks. SVR
works by finding a hyperplane that minimizes the sum of the squared residuals between the
predicted and actual values.
Decision tree regression: Decision tree regression is a type of regression algorithm that
builds a decision tree to predict the target value. A decision tree is a tree-like structure that
consists of nodes and branches. Each node represents a decision, and each branch represents
the outcome of that decision. The goal of decision tree regression is to build a tree that can
accurately predict the target value for new data points.
Ridge Regression: Ridge regression is a type of linear regression that is used to prevent
overfitting. Overfitting occurs when the model learns the training data too well and is unable
to generalize to new data
Lasso regression: Lasso regression is another type of linear regression that is used to prevent
overfitting. It does this by adding a penalty term to the loss function that forces the model to
use some weights and to set others to zero.
Random forest regression: Random forest regression is an ensemble method that combines
multiple decision trees to predict the target value. Ensemble methods are a type of machine
learning algorithm that combines multiple models to improve the performance of the overall
model. Random forest regression works by building a large number of decision trees, each of
which is trained on a different subset of the training data. The final prediction is made by
averaging the predictions of all of the trees.
Characteristics of Regression
Here are the characteristics of the regression:
Continuous Target Variable: Regression deals with predicting continuous target variables
that represent numerical values. Examples include predicting house prices, forecasting sales
figures, or estimating patient recovery times.
Error Measurement: Regression models are evaluated based on their ability to minimize
the error between the predicted and actual values of the target variable. Common error metrics
include mean absolute error (MAE), mean squared error (MSE), and root mean squared error
(RMSE).
Model Complexity: Regression models range from simple linear models to more complex
nonlinear models. The choice of model complexity depends on the complexity of the
relationship between the input features and the target variable.
Logistic Regression
Purpose:
Used for binary classification problems (Yes/No, 0/1).
Equation:
[
P(Y=1) = \frac{1}{1 + e^{-(β_0 + β_1X)}}
]
Output:
Applications:
• Spam detection
• Disease diagnosis (e.g., diabetes prediction)
• Customer churn prediction
Classifiers Overview
A classifier is an algorithm that maps input data to a category (label).
Common classifiers:
Classify a data point based on the majority label among its k nearest neighbors.
Steps:
1. Choose value of k.
2. Calculate distance (Euclidean, Manhattan, etc.).
3. Find k nearest points.
4. Assign class based on majority vote.
Advantages:
Disadvantages:
k-Means Clustering
Type: Unsupervised Learning
Purpose:
Groups data into k clusters such that each point belongs to the nearest centroid.
Algorithm Steps:
Applications:
• Customer segmentation
• Image compression
• Document clustering
Decision Tree
Concept:
A tree-like structure where each internal node represents a test on an attribute, each branch
represents an outcome, and each leaf node represents a class label.
Algorithm:
Key Metrics:
• Entropy (H):
[
H(S) = -\sum p_i \log_2 p_i
]
• Information Gain (IG):
[
IG = H(parent) - \sum \frac{|child|}{|parent|}H(child)
]
Advantages:
• Easy to interpret.
• Handles numerical and categorical data.
Disadvantages:
• Prone to overfitting.
• Small changes in data may alter the tree.
Where:
Types:
Applications:
Ensemble Methods
Concept:
Types:
Random Forest
Concept:
Steps:
1. Randomly select data samples (with replacement).
2. Build decision trees on each sample.
3. For classification — use majority voting.
4. For regression — take average prediction.
Advantages:
• High accuracy.
• Reduces overfitting.
• Handles missing values.
Feature Selection
Selecting the most important features to:
• Reduce overfitting
• Improve accuracy
• Decrease computation time
Types:
(a) Filter Methods
2. Nearest Neighbour
4. Decision Trees
5. Boosted Trees
6. Random Forest
7. Neural Networks
Nearest Neighbour:
It is a statistical method for analysing a data set in which there are one or more independent
variables that determine an outcome. The outcome is measured with a dichotomous variable
(in which there are only two possible outcomes). The goal of logistic regression is to find the
best fitting model to describe the relationship between the dichotomous characteristic of
interest (dependent variable = response or outcome variable) and a set of independent (predictor
or explanatory) variables. This is better than other binary classification like nearest neighbour
since it also explains quantitatively the factors that lead to classification.
Decision Trees:
Decision tree builds classification or regression models in the form of a tree structure. It breaks
down a data set into smaller and smaller subsets while at the same time an associated decision
tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A
decision node has two or more branches and a leaf node represents a classification or decision.
The topmost decision node in a tree which corresponds to the best predictor called root node.
Decision trees can handle both categorical and numerical data.
Random Forest:
Random forests or random decision forests are an ensemble learning method for classification,
regression and other tasks, that operate by constructing a multitude of decision trees at training
time and outputting the class that is the mode of the classes (classification) or mean prediction
(regression) of the individual trees. Random decision forests correct for decision trees’ habit of
over fitting to their training set.
Neural Network:
A neural network consists of units (neurons), arranged in layers, which convert an input vector
into some output. Each unit takes an input, applies a (often nonlinear) function to it and then
passes the output on to the next layer. Generally the networks are defined to be feed-forward:
a unit feeds its output to all the units on the next layer, but there is no feedback to the previous
layer. Weightings are applied to the signals passing from one unit to another, and it is these
weightings which are tuned in the training phase to adapt a neural network to the particular
problem at hand.
Unit IV
What is Clustering?
Definition:
Clustering is an unsupervised learning technique that groups data points into clusters
(groups) such that:
• Points within the same cluster are similar (high intra-cluster similarity),
• Points from different clusters are dissimilar (low inter-cluster similarity).
Used in: Market segmentation, social network analysis, document clustering, image
segmentation, anomaly detection, etc.
a) Euclidean Distance
Formula:
𝑑(𝑝, 𝑞) = √∑(𝑝𝑖 − 𝑞𝑖 )2
Example:
For points (1,2) and (4,6):
b) Manhattan Distance
𝑑(𝑝, 𝑞) = ∑ ∣ 𝑝𝑖 − 𝑞𝑖 ∣
c) Cosine Similarity (or Distance)
d) Minkowski Distance
𝑑(𝑝, 𝑞) = (∑ ∣ 𝑝𝑖 − 𝑞𝑖 ∣𝑝 )1/𝑝
e) Jaccard Distance
2. Clustering Approaches
2.1 Hierarchical Agglomerative Clustering (HAC)
Concept:
Builds a hierarchy (tree) of clusters by merging the closest clusters step-by-step until all
points are in one cluster.
Steps:
Method Description
Single Linkage Minimum distance between any two points (nearest neighbor)
Complete Linkage Maximum distance between points (farthest neighbor)
Average Linkage Average distance between all pairs
Ward’s Method Minimizes total within-cluster variance
Output:
Dendrogram
Hierarchical output visualized as a tree; user cuts the tree at desired level to get clusters.
Example of HAC
Advantages
Disadvantages
Example
Advantages
Disadvantages
Key Parameters
Definitions
Advantages
Disadvantages
4. Clustering Tendency
Before clustering, check if the data has natural clusters.
1. Hopkins statistic
o Values close to 1 → strong clustering tendency
o Close to 0.5 → random data (no clusters)
2. Visual methods
o Scatter plots
o Distance matrix heatmap
o Pairplots (for small dimensions)
5. Clustering Quality
Metrics used to evaluate clustering results.
1. Silhouette Score
𝑏−𝑎
𝑠=
max(𝑎, 𝑏)
2. Davies-Bouldin Index
(Labels required)