0% found this document useful (0 votes)
17 views65 pages

Introduction to Data Science and Careers

Data Science is an interdisciplinary field that utilizes scientific methods and algorithms to extract insights from structured and unstructured data, with applications across various industries such as healthcare, finance, and gaming. The demand for data science skills is growing, requiring proficiency in programming, statistics, machine learning, and domain knowledge. Key trends in the field include the integration of AI, big data, ethical considerations, and the need for continuous learning and collaboration.

Uploaded by

auranesyncs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views65 pages

Introduction to Data Science and Careers

Data Science is an interdisciplinary field that utilizes scientific methods and algorithms to extract insights from structured and unstructured data, with applications across various industries such as healthcare, finance, and gaming. The demand for data science skills is growing, requiring proficiency in programming, statistics, machine learning, and domain knowledge. Key trends in the field include the integration of AI, big data, ethical considerations, and the need for continuous learning and collaboration.

Uploaded by

auranesyncs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Science

Unit I
What is Data Science?s

Data Science is also known as data-driven science, which makes use of scientific methods,
processes, and systems to extract knowledge or insights from data in various forms, i.e. either
structured or unstructured. Data Science uses the most advanced hardware, programming
systems, and algorithms to solve problems that have to do with data. It is where artificial
intelligence is going.

Data Science is an interdisciplinary field that lets you learn from both organised and
unorganised data. With data science, you can turn a business problem into a research project
and then apply into a real-world solution.

Data Science Applications

The following are the applications of data science:

• Gaming Industry

• Health Care

• Medical Image Analysis

• Predictive Analysis

• Image Recognition

• Recommendation systems

• Airline Routing Planning

• Finance

• Improvement in Health Care services

• Computer Vision

• Efficient Management of Energy

• Internet Search

• Speech Recognition

• Education

Data Science Careers / Job Opportunities

The following are some career paths and job opportunities in data science:
• Data Analyst

• Data Scientist

• Database Administrator

• Big Data Engineer

• Data Mining Engineer

• Machine Learning Engineer

• Data Architect

• Hadoop Engineer

• Data Warehouse Architect

Why Data Science?

According to IDC, worldwide data will reach 175 zettabytes by 2025. Data Science helps
businesses to comprehend vast amounts of data from different sources, extract useful
insights, and make better data-driven choices. Data Science is used extensively in several
industrial fields, such as marketing, healthcare, finance, banking, and policy work.

Here are significant advantages of using Data Analytics Technology −

• Data is the oil of the modern age. With the proper tools, technologies, and algorithms,
we can leverage data to create a unique competitive edge.

• Data Science may assist in detecting fraud using sophisticated machine learning
techniques.

• It helps you avoid severe financial losses.

• Enables the development of intelligent machines

• You may use sentiment analysis to determine the brand loyalty of your customers. This
helps you to make better and quicker choices.

• It enables you to propose the appropriate product to the appropriate consumer in


order to grow your company.

Need for Data Science

The data we have and how much data we generate

According to Forbes, the total quantity of data generated, copied, recorded, and consumed in
the globe surged by about 5,000% between 2010 and 2020, from 1.2 trillion gigabytes to 59
trillion gigabytes.

How companies have benefited from Data Science?


• Several businesses are undergoing data transformation (converting their IT
architecture to one that supports Data Science), there are data boot camps around,
etc. Indeed, there is a straightforward explanation for this: Data Science provides
valuable insights.

• Companies are being outcompeted by firms that make judgments based on data. For
example, the Ford organization in 2006, had a loss of $12.6 billion. Following the
defeat, they hired a senior data scientist to manage the data and undertook a three-
year makeover. This ultimately resulted in the sale of almost 2,300,000 automobiles
and earned a profit for 2009 as a whole.

What is Big Data?

Big Data refers to the vast volumes of data generated at high velocity from a variety of sources.
This data is characterized by the three V's: Volume, Velocity, and Variety.

1. Volume: Big Data involves large datasets that are too complex for traditional data
processing tools to handle. These datasets can range from terabytes to petabytes of
information.

2. Velocity: Big Data is generated in real-time or near real-time, requiring fast processing
to extract meaningful insights.

3. Variety: The data comes in multiple forms, including structured data (like databases),
semi-structured data (like XML files), and unstructured data (like text, images, and
videos).

What is Data Science?

Data Science is an interdisciplinary field that utilizes scientific methods, algorithms, and
systems to extract knowledge and insights from structured and unstructured data. It
encompasses a variety of techniques from statistics, machine learning, data mining, and big
data analytics.

Data Scientists use their expertise to:

1. Analyze: They examine complex datasets to identify patterns, trends, and correlations.

2. Model: Using statistical models and machine learning algorithms, they create
predictive models that can forecast future trends or behaviors.

3. Interpret: They translate data findings into actionable business strategies and
decisions.

Current Landscape Perspective of Data Science

The current landscape perspective of data science is characterized by several key trends and
developments:
• AI and Machine Learning: Data science is increasingly intertwined with artificial
intelligence (AI) and machine learning, enabling more advanced predictive analytics
and automation of data-driven processes.

• Big Data: The proliferation of big data continues to shape the data science landscape,
with organizations leveraging large volumes of data from diverse sources to gain
insights and make informed decisions.

• Ethics and Privacy: There is a growing emphasis on ethical considerations and privacy
concerns in data science, with a focus on responsible data usage, transparency, and
compliance with regulations such as GDPR.

• Interdisciplinary Collaboration: Data science is increasingly viewed as a


multidisciplinary field, requiring collaboration between data scientists, domain
experts, and stakeholders to ensure the meaningful interpretation and application of
data-driven insights.

• Data Visualization and Interpretability: The importance of effective data visualization


and interpretability techniques is on the rise, as stakeholders seek to comprehend and
communicate complex data findings in a more accessible manner.

• Automation and Scalability: There is a shift towards automating repetitive data tasks
and scaling data science processes through the use of cloud-based platforms and tools,
enabling greater efficiency and agility.

• Continuous Learning and Upskilling: Given the rapid evolution of data science
technologies and methodologies, professionals in the field are increasingly focused on
continuous learning and upskilling to stay abreast of the latest developments.

Skill sets needed (Data Science : Skills Required)

Data science is an interdisciplinary field of scientific methods, processes, algorithms, and


systems to extract knowledge or insights from data in various forms, either structured or
unstructured, similar to data mining. Big Data Analytics or Data Science is a very common term
in the IT industry because everyone knows this is some fancy term that is gonna help us to
deal with the huge amount of data we are generating these days. Let's find out what the skills
required are:

Data science is a multidisciplinary field that combines statistics, computer science, and
domain expertise to extract insights and knowledge from data. The skills required for data
science can be broadly classified into technical skills, domain expertise, and soft skills.

1. Technical skills:
Data science requires proficiency in programming languages such as Python or R, data
visualization tools like Tableau or Power BI, databases such as SQL, and machine
learning algorithms. Data scientists should have a solid understanding of data
manipulation and analysis techniques, including data cleaning, transformation, and
feature engineering.

2. Domain expertise:
Data scientists should have an understanding of the business domain in which they
work. For example, a data scientist in healthcare should have knowledge of medical
terminologies and healthcare workflows. Similarly, a data scientist in finance should
have an understanding of financial instruments and markets.

3. Soft skills:
Soft skills like communication, collaboration, and problem-solving are essential for a
successful data scientist. Data scientists should be able to communicate complex
technical concepts to non-technical stakeholders in a clear and concise manner. They
should also be able to work collaboratively in a team environment, and have strong
problem-solving skills to identify and solve complex problems.

In summary, data science requires technical proficiency in programming languages, data


analysis, and machine learning algorithms, domain expertise in the relevant field, and strong
soft skills such as communication, collaboration, and problem-solving. A well-rounded data
scientist with expertise in these areas can extract insights and knowledge from data and drive
business value.

Data science is an interdisciplinary field that involves using statistical and computational
techniques to extract insights from data. Some of the key skills required for a career in data
science include:

• Programming skills: proficiency in one or more programming languages such as


Python, R, or SQL is essential for working with data.

• Statistics and probability: understanding of statistical concepts such as probability


distributions, hypothesis testing, and regression analysis is necessary for data analysis
and modeling.

• Machine learning: knowledge of machine learning algorithms and techniques for


building predictive models is crucial for data science.

• Data wrangling: the ability to clean, organize, and manipulate large datasets is an
important skill for data preparation.

• Data visualization: the ability to create clear and effective visualizations of data is
important for communicating insights and findings to others.

• Communication skills: being able to explain complex data concepts to non-technical


stakeholders is critical for data science.

Domain knowledge: understanding the specific industry or business context in which data is
being analyzed is important for interpreting and applying the insights generated.
1. Math Skills:

• Multivariable Calculus & Linear Algebra: These two things are very important
as they help us in understanding various machine learning algorithms which
play an important role in Data Science.

• Probability & Statistics: Understanding Statistics is very important as this is the


branch of Data analysis. Probability theory is also important to statistics and it
is mentioned as a prerequisite for learning machine learning.

2. Programming Skills:

• Programming Knowledge: You need to have a good grasp of programming


concepts such as Data structures and algorithms. Languages used are python,
R, Java, and Scala. C++ is also used in some places where performance is
extremely important.

• Relational Databases: You need to know databases such as SQL or Oracle so


that you can fetch the required data from them whenever needed.

• Non Relational Databases: These are of many types but mostly used types are:
i) Column: Cassandra, HBase ii) Document: MongoDB, CouchDB iii) Key-value:
Redis, Dynamo

• Distributed Computing: It is one of the most important skills to handle a large


amount of data because we cannot process this much data on a single system.
Tools which mainly used are Apache Hadoop and Spark. It has two main parts:
HDFS i.e Hadoop Distributed File System which is used for storing data over a
distributed file system. The other part is map-reduce by which we process data.
We can write map-reduce in programs in java or python. There are many other
tools also such as PIG, and HIVE.

• Machine Learning: It is one of the most important parts of data science and
the hot topic of research among researchers so every year new developments
are made in this. You at least need to know common algorithms of supervised
and unsupervised learning. There are many libraries available in python and
R. List of Python Libraries: i) Basic Libraries: NumPy, SciPy, Pandas, Ipython,
matplotlib ii) Libraries for Machine Learning: sci-kit-learn, Theano, TensorFlow
iii) Libraries for Data Mining & Natural Language Processing: Scrapy, NLTK,
Pattern

3. Domain Knowledge Mostly people ignore this thinking it's not important but it is very
very important. The whole purpose of data science is to extract useful insights from
that data so that it can be beneficial to a company’s business. If you don’t understand
the business side of your company like how your company’s business model works,
and how you can make it better, then you are of no use to the company.

Linear Algebra Required for Data Science

Linear algebra simplifies the management and analysis of large datasets. It is widely used in
Data Science and machine learning to understand data especially when there are many
features. In this article we’ll explore the importance of linear algebra in data science, its key
concepts, real-world applications and the challenges learners face.

Linear Algebra in Data Science

Linear algebra in data science refers to the use of mathematical concepts involving vectors,
matrices and linear transformations to manipulate and analyse data. It provides useful
algorithms and processes in data science such as machine learning, statistics and big data
analytics. It turns theoretical data models into practical solutions that can be used in real-
world situations. It helps us:

• Tto represent datasets as vectors and matrices

• Perform operations like scaling, rotation and projection on data efficiently.

• Use techniques like dimensionality reduction to simplify large datasets while keeping
important patterns.

Below are some important linear algebra topics that are widely used in data science.

1. Vectors

Vectors are ordered array of numbers that represents a point or direction in space. In data
science, vectors are used to represent data points, features or coefficients in machine learning
models.
• Vectors

• Vector Operations

• Vector Norms

2. Matrices

Matrix is a two-dimensional array of numbers. They are used to represent datasets,


transformations or linear systems where rows typically represent observations and columns
represent features.

• Matrix

• Matrix Operations

• Matrix Transpose

• Identity Matrix

• Zero Matrix

• Sparse matrix

• Inverse of a Matrix

3. Matrix Decomposition

Matrix decomposition is a process where we break down a complex matrix into simpler into
more manageable parts. These parts include LU decomposition, QR decomposition or
Singular Value Decomposition.

• LU Decomposition

• QR Decomposition

• Cholesky Decomposition

• Non-Negative Matrix Factorization (NMF)

• Eigenvalue Decomposition

• Singular Value Decomposition (SVD)

4. Determinants

Determinant of a square matrix is a single number that tells us if the matrix can
be turned around or not. It is is important when we need to find the best possible answer or
when we are solving systems of linear equations in math.

• Determinants

• Properties of Determinants
• Relationship with Invertibility of Matrices

5. Eigenvalues and Eigenvectors

Eigenvalues and eigenvectors are used in various data science algorithms such as PCA for
dimensionality reduction and feature extraction.

• Eigen Values & Eigen Vectors

• Finding Eigenvalues and Eigenvectors

• Applications

6. Vector Spaces and Subspaces

A vector space is a set of vectors that can be scaled and added together and subspaces are
subsets of a vector space used for understanding data structures and transformations in
machine learning.

• Vector Spaces

• Linear Independence

• Linear Transformation

• Span

• Basis and Dimensions

• Column Space

• Null Space

7. Systems of Linear Equations

Systems of linear equations can be represented as matrices. Solving systems of linear


equations is essential in regression analysis, optimization and neural networks.

• Gaussian Elimination

• Homogeneous Linear Systems

• Least Squares Solutions

8. Orthogonality

Two vectors are considered orthogonal when their dot product evaluation results in a zero
value. Data science makes use of orthogonality for selecting features while conducting
dimensionality reduction and establishing whether models operate independently or not.

• Orthogonal Vectors

• Orthogonal Matrices
• Orthogonal Projections

• Gram-Schmidt Process

9. Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique that transforms data into a smaller set of
variables and capture the most significant variance. It's used for feature extraction and noise
reduction.

• Covariance Matrix and Its Role

• Dimensionality Reduction

10. Optimization in Linear Algebra

Optimization means to find the best possible solution to a problem. Linear algebra applies this
concept to solve problems involving least squares regression as well as machine learning
models and linear regression models.

• Gradient Descent Method

• Cost Functions

• Objective Functions

• Linear Programming

• Simplex Method

• Newton's Method

• Conjugate Gradient Method

• Lagrange Multipliers

Applications of Linear Algebra in Data Science

• Recommender Systems - Recommender Systems depend on Linear Algebra to


generate personalized suggestions for Spotify and Netflix as well other streaming
platforms.

• Dimensionality Reduction - It represents the second step that simplifies extensive


datasets while maintaining all essential data points. PCA decrease data quantity while
enhancing usability for humans and machines.
• NLP - In NLP word embeddings like Word2Vec or GloVe represent words as vectors.
The calculation of word relationships through linear algebra operations includes both
dot products alongside matrix multiplication.

• Image Processing and Computer Vision - Linear Algebra allows processing of images
through various transformations and compression techniques as well as extracting
features from datasets.

• Clustering and Classification - The algorithms k-means clustering and Support Vector
Machines (SVM) use Linear Algebra to group or classify data points effectively.

• Data Transformation and Preprocessing - It is used in data preprocessing through its


applications in transforming and reshaping data points ahead of machine learning
algorithm utilization.

Challenges in Linear Algebra

Learning linear algebra presents challenges to data science students because of three key
problems:

• Linear algebra introduces difficult-to-understand theoretical principles that include


vectors along with matrices and transformations.

• The learning process feels steep because beginner-level students find matrix inversion
and eigenvalue decomposition challenging to handle.

• Sales professionals face confusion when looking at multiple linear algebra applications
across different disciplines.

Principal Component Analysis(PCA)

PCA (Principal Component Analysis) is a dimensionality reduction technique used in data


analysis and machine learning. It helps you to reduce the number of features in a dataset
while keeping the most important information. It changes your original features into new
features these new features don’t overlap with each other and the first few keep most of the
important differences found in the original data.

PCA is commonly used for data preprocessing for use with machine learning algorithms. It
helps to remove redundancy, improve computational efficiency and make data easier to
visualize and analyze especially when dealing with high-dimensional data.

How Principal Component Analysis Works

PCA uses linear algebra to transform data into new features called principal components. It
finds these by calculating eigenvectors (directions) and eigenvalues (importance) from the
covariance matrix. PCA selects the top components with the highest eigenvalues and projects
the data onto them simplify the dataset.
Note: It prioritizes the directions where the data varies the most because more variation =
more useful information.

Imagine you’re looking at a messy cloud of data points like stars in the sky and want to simplify
it. PCA helps you find the "most important angles" to view this cloud so you don’t miss the big
patterns. Here’s how it works step by step:

Step 1: Standardize the Data

Different features may have different units and scales like salary vs. age. To compare them
fairly PCA first standardizes the data by making each feature have:

• A mean of 0

• A standard deviation of 1

Z=X−μσZ=σX−μ

where:

• μμ is the mean of independent features μ={μ1,μ2,⋯,μm}μ={μ1,μ2,⋯,μm}

• σσ is the standard deviation of independent features σ={σ1,σ2,⋯,σm}σ={σ1,σ2,⋯,σm


}

Step 2: Calculate Covariance Matrix

Next PCA calculates the covariance matrix to see how features relate to each other whether
they increase or decrease together. The covariance between two features x1x1 and x2x2 is:

cov(x1,x2)=∑i=1n(x1i−x1ˉ)(x2i−x2ˉ)n−1cov(x1,x2)=n−1∑i=1n(x1i−x1ˉ)(x2i−x2ˉ)

Where:

• xˉ1andxˉ2xˉ1andxˉ2 are the mean values of features x1andx2x1andx2

• nn is the number of data points

The value of covariance can be positive, negative or zeros.

Step 3: Find the Principal Components

PCA identifies new axes where the data spreads out the most:

• 1st Principal Component (PC1): The direction of maximum variance (most spread).

• 2nd Principal Component (PC2): The next best direction, perpendicular to PC1 and so
on.

These directions come from the eigenvectors of the covariance matrix and their importance
is measured by eigenvalues. For a square matrix A an eigenvector X (a non-zero vector) and
its corresponding eigenvalue λ satisfy:
AX=λXAX=λX

This means:

• When A acts on X it only stretches or shrinks X by the scalar λ.

• The direction of X remains unchanged hence eigenvectors define "stable directions"


of A.

Eigenvalues help rank these directions by importance.

Step 4: Pick the Top Directions & Transform Data

After calculating the eigenvalues and eigenvectors PCA ranks them by the amount of
information they capture. We then:

1. Select the top k components hat capture most of the variance like 95%.

2. Transform the original dataset by projecting it onto these top components.

This means we reduce the number of features (dimensions) while keeping the important
patterns in the data.

In the above image the original dataset has two features "Radius" and "Area" represented by
the black axes. PCA identifies two new directions: PC₁ and PC₂ which are the principal
components.

• These new axes are rotated versions of the original ones. PC₁ captures the maximum
variance in the data meaning it holds the most information while PC₂ captures the
remaining variance and is perpendicular to PC₁.

• The spread of data is much wider along PC₁ than along PC₂. This is why PC₁ is chosen
for dimensionality reduction. By projecting the data points (blue crosses) onto PC₁ we
effectively transform the 2D data into 1D and retain most of the important structure
and patterns.

Implementation of Principal Component Analysis in Python

Hence PCA uses a linear transformation that is based on preserving the most variance in the
data using the least number of dimensions. It involves the following steps:

Step 1: Importing Required Libraries

We import the necessary library like pandas, numpy, scikit learn, seaborn and matplotlib to
visualize results.

import numpy as np

import pandas as pd
from [Link] import StandardScaler

from [Link] import PCA

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from [Link] import confusion_matrix

import [Link] as plt

import seaborn as sns

Step 2: Creating Sample Dataset

We make a small dataset with three features Height, Weight, Age and Gender.

data = {

'Height': [170, 165, 180, 175, 160, 172, 168, 177, 162, 158],

'Weight': [65, 59, 75, 68, 55, 70, 62, 74, 58, 54],

'Age': [30, 25, 35, 28, 22, 32, 27, 33, 24, 21],

'Gender': [1, 0, 1, 1, 0, 1, 0, 1, 0, 0] # 1 = Male, 0 = Female

df = [Link](data)

print(df)

Output:

Dataset

Step 3: Standardizing the Data


Since the features have different scales Height vs Age we standardize the data. This makes all
features have mean = 0 and standard deviation = 1 so that no feature dominates just because
of its units.

X = [Link]('Gender', axis=1)

y = df['Gender']

scaler = StandardScaler()

X_scaled = scaler.fit_transform(df)

Step 4: Applying PCA algorithm

• We reduce the data from 3 features to 2 new features called principal components.
These components capture most of the original information but in fewer dimensions.

• We split the data into 70% training and 30% testing sets.

• We train a logistic regression model on the reduced training data and predict gender
labels on the test set.

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=42)

model = LogisticRegression()

[Link](X_train, y_train)

y_pred = [Link](X_test)

Step 5: Evaluating with Confusion Matrix

The confusion matrix compares actual vs predicted labels. This makes it easy to see where
predictions were correct or wrong.

cm = confusion_matrix(y_test, y_pred)

[Link](figsize=(5,4))

[Link](cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Female', 'Male'],


yticklabels=['Female', 'Male'])

[Link]('Predicted Label')

[Link]('True Label')

[Link]('Confusion Matrix')
[Link]()

Output:

Confusion matrix

Step 6: Visualizing PCA Result

y_numeric = [Link](y)[0]

[Link](figsize=(12, 5))

[Link](1, 2, 1)

[Link](X_scaled[:, 0], X_scaled[:, 1], c=y_numeric, cmap='coolwarm', edgecolor='k', s=80)

[Link]('Original Feature 1')

[Link]('Original Feature 2')

[Link]('Before PCA: Using First 2 Standardized Features')

[Link](label='Target classes')

[Link](1, 2, 2)

[Link](X_pca[:, 0], X_pca[:, 1], c=y_numeric, cmap='coolwarm', edgecolor='k', s=80)

[Link]('Principal Component 1')

[Link]('Principal Component 2')

[Link]('After PCA: Projected onto 2 Principal Components')

[Link](label='Target classes')

plt.tight_layout()

[Link]()

PCA Algorithm
• Left Plot Before PCA: This shows the original standardized data plotted using the first
two features. There is no guarantee of clear separation between classes as these are
raw input dimensions.

• Right Plot After PCA: This displays the transformed data using the top 2 principal
components. These new components capture the maximum variance often showing
better class separation and structure making it easier to analyze or model.

Advantages of Principal Component Analysis

1. Multicollinearity Handling: Creates new, uncorrelated variables to address issues


when original features are highly correlated.

2. Noise Reduction: Eliminates components with low variance enhance data clarity.

3. Data Compression: Represents data with fewer components reduce storage needs
and speeding up processing.

4. Outlier Detection: Identifies unusual data points by showing which ones deviate
significantly in the reduced space.

Disadvantages of Principal Component Analysis

1. Interpretation Challenges: The new components are combinations of original


variables which can be hard to explain.

2. Data Scaling Sensitivity: Requires proper scaling of data before application or results
may be misleading.

3. Information Loss: Reducing dimensions may lose some important information if too
few components are kept.

4. Assumption of Linearity: Works best when relationships between variables are linear
and may struggle with non-linear data.

5. Computational Complexity: Can be slow and resource-intensive on very large


datasets.

6. Risk of Overfitting: Using too many components or working with a small dataset might
lead to models that don't generalize well.

Statistics For Data Science

Statistics is like a toolkit we use to understand and make sense of information. It helps us
collect, organize, analyze and interpret data to find patterns, trends and relationships in the
world around us.

From analyzing scientific experiments to making informed business decisions, statistics plays
an important role across many fields such as science, economics, social sciences, engineering
and sports. Whether it's calculating the average test score in a classroom or predicting election
outcomes based on a sample, statistics gives us tools to make data-driven decisions.

Types of Statistics

There are commonly two types of statistics, which are discussed below:

1. Descriptive Statistics: Descriptive Statistics helps us simplify and organize big chunks
of data. This makes large amounts of data easier to understand.

2. Inferential Statistics: Inferential Statistics is a little different. It uses smaller data to


conclude a larger group. It helps us predict and draw conclusions about a population.

What is Data in Statistics?

Data is a collection of observations, it can be in the form of numbers, words, measurements,


or statements.

Types of Data

1. Qualitative Data: This data is descriptive. For example - She is beautiful, He is tall, etc.

2. Quantitative Data: This is numerical information. For example- A horse has four legs.

• Discrete Data: It has a particular fixed value and can be counted.

• Continuous Data: It is not fixed but has a range of data and can be measured.

Basics of Statistics

Basic formulas of statistics are,

Parameters Definition Formulas

Population
Average of the entire group. ΣxNΣNx
Mean (μ)

Sample Average of a subset of the


ΣxnΣnx
Mean population

Sample/Pop
Population σ=1N∑i=1n(xi−μ)2Sample s=1N−
ulation Measures how spread out
1∑i=1n(xi−xˉ)2Population σ=N1∑i=1n(xi−μ)2
Standard the data is from the mean
Sample s=N−11∑i=1n(xi−xˉ)2
Deviation
Parameters Definition Formulas

Sample/Pop Variance(Population) = ∑(x−x‾)2nVariance(S


Shows how far values are
ulation ample) = ∑(x−x‾)2n−1Variance(Population)
from the mean, squared
Variance = n∑(x−x)2Variance(Sample) = n−1∑(x−x)2

Class
Range of values in a group CI = Upper Limit − Lower Limit
Interval(CI)

Frequency(f) How often a value appears Count of occurrences

Difference between largest


Range = Max−Min
Range (R) and smallest values

Measure of Central Tendency

1. Mean: The mean can be calculated by summing all values present in the sample divided by
total number of values present in the sample or population.

Formula:Mean(μ)=SumofValuesNumberofValues Formula:Mean(μ)=NumberofValuesSumofV
alues

2. Median: The median is the middle of a dataset when arranged from lowest to highest or
highest to lowest in order to find the median, the data must be sorted. For an odd number of
data points the median is the middle value and for an even number of data points median is
the average of the two middle values.

3. Mode: The most frequently occurring value in the Sample or Population is called as Mode.

Measure of Dispersion

• Range: Range is the difference between the maximum and minimum values of the
Sample.

• Variance (σ²): Variance is a measure of how spread-out values from the mean by
measuring the dispersion around the Mean.
Formula:σ2 = Σ(X−μ)2n Formula:σ2 = nΣ(X−μ)2

• Standard Deviation (σ): Standard Deviation is the square root of variance. The
measuring unit of S.D. is same as the Sample values' unit. It indicates the average
distance of data points from the mean and is widely used due to its intuitive
interpretation.

Formula:σ=(σ2)=(Σ(X−μ)2n)Formula:σ=(σ2)=(nΣ(X−μ)2)

• Interquartile Range (IQR): The range between the first quartile (Q1) and the third
quartile (Q3). It is less sensitive to extreme values than the range. To compute IQR,
calculate the values of the first and third quartile by arranging the data in ascending
order. Then, calculate the mean of each half of the dataset.

Formula:IQR=Q3−Q1Formula:IQR=Q3−Q1

• Quartiles: Quartiles divides the dataset into four equal parts:

Q1 is the median of the lower 25%


Q2 is the median (50%)
Q3 is the median of the upper 25% of the dataset.

• Mean Absolute Deviation: The average of the absolute differences between each data
point and the mean. It provides a measure of the average deviation from the mean.

Formula: MeanAbsoluteDeviation=∑i=1n∣X−μ∣nFormula: MeanAbsoluteDeviation=n∑i=1n


∣X−μ∣
• Coefficient of Variation (CV):
CV is the ratio of the standard deviation to the mean, expressed as a percentage. It is
useful for comparing the relative variability of different datasets.

CV=(σμ)∗100CV=(μσ)∗100

Measure of Shape

1. Skewness

Skewness is the measure of asymmetry of probability distribution about its mean.


Types of Skewed data

Types of Skewed data

• Positive Skew (Right): Mean > Median

• Negative Skew (Left): Mean < Median

• Symmetrical: Mean = Median

2. Kurtosis

Kurtosis quantifies the degree to which a probability distribution deviates from the normal
distribution. It assesses the "tailedness" of the distribution, indicating whether it has heavier
or lighter tails than a normal distribution. High kurtosis implies more extreme values in the
distribution, while low kurtosis indicates a flatter distribution.

Types of Kurtosis

Types of Kurtosis

• Mesokurtic: Normal distribution (kurtosis = 3)

• Leptokurtic: Heavy tails (kurtosis > 3)

• Platykurtic: Light tails (kurtosis < 3)

Measure of Relationship
• Covariance: Covariance measures the degree to which two variables change together.

Cov(x,y)=∑(Xi−X‾)(Yi−Y‾)nCov(x,y)=n∑(Xi−X)(Yi−Y)

• Correlation: Correlation measures the strength and direction of the linear relationship
between two variables. It is represented by correlation coefficient which ranges from
-1 to 1. A positive correlation indicates a direct relationship, while a negative
correlation implies an inverse relationship. Pearson's correlation coefficient is given by:

ρ(X,Y)=cov(X,Y)σXσYρ(X,Y)=σXσYcov(X,Y)

Probability Theory

Here are some basic concepts or terminologies used in probability:

Term Definition

Sample Space The set of all possible outcomes in a probability experiment.

Event A subset of the sample space.

Joint Probability Probability of occurring events A and B. Formula: P(A and B)


(Intersection of Event) = P(A) × P(B)

Probability of occurring events A or B. Formula: P(A or B) =


Union of Events
P(A) + P(B) - P(A and B)

Probability of occurring events A when event B has


Conditional Probability
occurred. Formula: P(A | B) = P(A and B)/P(B)

Bayes Theorem

Bayes' Theorem is a fundamental concept in probability theory that relates conditional


probabilities. It is named after the Reverend Thomas Bayes, who first introduced the theorem.
Bayes' Theorem is a mathematical formula that provides a way to update probabilities based
on new evidence. The formula is as follows:

P(A∣B)=P(B∣A)×P(A)P(B)P(A∣B)=P(B)P(B∣A)×P(A)

where

• P(A∣B): Probability of event A given that event B has occurred (posterior probability).
• P(B∣A): Probability of event B given that event A has occurred (likelihood).

Types of Probability Functions

• Probability Mass Function(PMF): Probability Mass Function is a concept of a discrete


random variable.

• Probability Density Function (PDF): Probability Density Function describes the


likelihood of a continuous random variable falling within a particular range.

• Cumulative Distribution Function (CDF): Cumulative Distribution Function gives the


probability that a random variable will take a value less than or equal to a given value.

• Empirical Distribution Function (EDF): Estimates the CDF using observed sample data.

Probability Distributions Functions

1. Normal or Gaussian Distribution

The normal distribution is a continuous probability distribution characterized by its bell-


shaped curve and can be by described by mean (μ) and standard deviation (σ).

Formula: f(X∣μ,σ)=ϵ−0.5(X−μσ)2σ(2π)f(X∣μ,σ)=σ(2π)ϵ−0.5(σX−μ)2

Empirical Rule (68-95-99.7 Rule): ~68% data within 1σ, ~95% within 2σ, ~99.7% within 3σ.

Use: Detecting outliers, modeling natural phenomena.

Central Limit Theorem: The Central Limit Theorem (CLT) states that, regardless of the shape
of the original population distribution, the sampling distribution of the sample mean will be
approximately normally distributed if the sample size tends to infinity.

2. Student t-distribution

The t-distribution, also known as Student's t-distribution, is a probability distribution that is


used in statistics.

f(t)=Γ(df+12)dfπΓ(df2)(1+t2df)−df+12f(t)=dfπΓ(2df)Γ(2df+1)(1+dft2)−2df+1

• Parameter: Degrees of freedom (df).


• Use: Hypothesis testing with small samples.

3. Chi-square Distribution

The chi-squared distribution, denoted as χ2χ2 is a probability distribution used in statistics it


is related to the sum of squared standard normal deviates.

χ2=12k/2Γ(k/2)xk2−1e−x2χ2=2k/2Γ(k/2)1x2k−1e2−x

4. Binomial Distribution

The binomial distribution models the number of successes in a fixed number of independent
Bernoulli trials, where each trial has the same probability of success (p).

Formula: P(X=k)=(kn)pk(1−p)n−kP(X=k)=(kn)pk(1−p)n−k

5. Poisson Distribution

The poisson distribution models the number of events that occur in a fixed interval of time or
space. It's characterized by a single parameter (λ), the average rate of occurrence.

Formula: P(X=k)=ϵ−λλkk!P(X=k)=k!ϵ−λλk

6. Uniform Distribution

The uniform distribution represents a constant probability for all outcomes in a given range.

Formula: f(X)=1b−af(X)=b−a1

Parameter estimation for Statistical Inference

• Population: Population is entire group about which conclusions are drawn.

• Sample: Sample is a subset of the population used to make inferences.

• Expectation (E[x]): Expectation is average or expected value of a random variable.

• Parameter: A numerical value that describes a population (e.g., μ, σ, p).

• Statistic: A value computed from sample data to estimate a population parameter.

• Estimation: The process of inferring population parameters from sample statistics.

• Estimator: A rule or formula to estimate an unknown parameter.

• Bias: The difference between an estimator’s expected value and the true parameter.

Bias(θ^)=E(θ^)−θBias(θ)=E(θ)−θ

Hypothesis Testing

Hypothesis testing makes inferences about a population parameter based on sample statistic.
1. Null Hypothesis (H₀): There is no significant difference or effect.

2. Alternative Hypothesis (H₁): There is a significant effect i.e the given statement can be false.

3. Degrees of freedom: Degrees of freedom (df) in statistics represent the number of values
or quantities in the final calculation of a statistic that are free to vary. It is mainly defined as
sample size-one (n-1).

4. Level of Significance(αα): This is the threshold used to determine statistical significance.


Common values are 0.05, 0.01, or 0.10.

5. p-value: The p-value probability of observing results if H₀ is true.

• If p ≤ α: reject H₀

• If p > α: fail to reject H₀

6. Type I Error and Type II Error

• Type I Error that occurs when the null hypothesis is true, but the statistical test
incorrectly rejects it. It is often referred to as a "false positive" or "alpha error."

• Type II Error that occurs when the null hypothesis is false, but the statistical test fails
to reject it. It is often referred to as a "false negative."

7. Confidence Intervals: A confidence interval is a range of values that is used to estimate the
true value of a population parameter with a certain level of confidence. It provides a measure
of the uncertainty or margin of error associated with a sample statistic, such as the sample
mean or proportion.

Example of Hypothesis Testing (Website Redesign)

An e-commerce company wants to know if a website redesign affects average user session
time.

• Before: Mean = 3.5 min, SD = 1.2, n = 50

• After: Mean = 4.2 min, SD = 1.5, n = 60

Hypotheses:
• H₀: No change (μ_after − μ_before = 0)

• H₁: Positive change (μ_after − μ_before > 0)

Significance Level: α = 0.05


Test: Difference in means -> calculate p-value

Interpretation:

• If p < 0.05: Redesign significantly increased session time

• If p ≥ 0.05: No significant effect

Statistical Tests

Parametric test are statistical methods that make assumption that the data follows normal
distribution.

Z-test t-test F-test

Tests if a sample mean differs Compares means when


Compares variances of
from a known population population standard deviation
two or more groups.
mean. is unknown.

Population standard To test if group


Small samples or unknown
deviation is known and variances are
population standard deviation.
sample size is large. significantly different.

One- sample:
One-Sample Test:
Z=
t = X‾−μsnnsX−μ
X‾−μσnnσX−μ
Two-Sample Test:

t=X1‾−X2‾s12n1+s22n2t=n1
Two-Sample Test: F=s12s22F=s22s12
s12+n2s22X1−X2

Paired t-Test:
Z = X1‾−X2‾σ12n1+σ22n2n1
σ12+n2σ22X1−X2 t=d‾sdnnsdd
d= difference

ANOVA (Analysis Of Variance)


Source of Degrees Of
Variation Sum of Squares Freedom Mean Squares F-Value

Between SSB= Σn1(xˉ1−xˉ)2Σn1 MSB= SSB/ (k-


df1=k-1 f=MSB/MSE
Groups (xˉ1−xˉ)2 1)

SSE=ΣΣ(xˉ1−xˉ)2ΣΣ(xˉ1
df2=N-1 MSE=SSE/(N-k)
Error −xˉ)2

Total SST= SSE+SSE df3=N-1

There are mainly two types of ANOVA:

1. One-way ANOVA: Compares means of 3+ groups.

• H₀: All group means are equal

• H₁: At least one group differs

2. Two-way ANOVA: Tests impact of two categorical variables and their interaction

Chi-Squared Test

The chi-squared test is a statistical test used to determine if there is a significant association
between two categorical variables. It compares the observed frequencies in a contingency
table with the frequencies.

Formula:X2=Σ(Oij−Eij)2EijX2=ΣEij(Oij−Eij)2

This test is also performed on big data with multiple number of observations.

Non-Parametric Test

Non-parametric test does not make assumptions about the distribution of the data. They are
useful when data does not meet the assumptions required for parametric tests.

• Mann-Whitney U Test: Mann-Whitney U Test is used to determine whether there is a


difference between two independent groups when the dependent variable is ordinal
or continuous. Applicable when assumptions for a t-test are not met. In it we rank all
data points, combines the ranks and calculates the test statistic.

• Kruskal-Wallis Test: Kruskal-Wallis Test is used to determine whether there are


differences among three or more independent groups when the dependent variable is
ordinal or continuous. Non-parametric alternative to one-way ANOVA.
A/B Testing or Split Testing

A/B testing, also known as split testing, is a method used to compare two versions (A and B)
of a webpage, app, or marketing asset to determine which one performs better.

Example: a product manager change a website's "Shop Now" button color from green to blue
to improve the click-through rate (CTR). Formulating null and alternative hypotheses, users
are divided into A and B groups and CTRs are recorded. Statistical tests like chi-square or t-test
are applied with a 5% confidence interval. If the p-value is below 5%, the manager may
conclude that changing the button color significantly affects CTR, informing decisions for
permanent implementation.

Regression

Regression is a statistical technique used to model the relationship between a dependent


variable and one or more independent variables.

The equation for regression: y=α+βxy=α+βx

Where,

• y is the dependent variable,

• x is the independent variable

• αα is the intercept

• ββ is the regression coefficient.

Regression coefficient is a measure of the strength and direction of the relationship between
a predictor variable (independent variable) and the response variable (dependent
variable) β=∑(Xi−X‾)(Yi−Y‾)∑(Xi−X‾)2β=∑(Xi−X)2∑(Xi−X)(Yi−Y)

Descriptive Statistic

Statistics is the foundation of data science. Descriptive statistics are simple tools that help us
understand and summarize data. They show the basic features of a dataset, like the average,
highest and lowest values and how spread out the numbers are. It's the first step in making
sense of information.

Descriptive Statistic

Types of Descriptive Statistics

There are three categories for standard classification of descriptive statistics methods, each
serving different purposes in summarizing and describing data. They help us understand:
1. Where the data centers (Measures of Central Tendency)

2. How spread out the data is (Measure of Variability)

3. How the data is distributed (Measures of Frequency Distribution)

1. Measures of Central Tendency

Statistical values that describe the central position within a dataset. There are three main
measures of central tendency:

Measures of Central Tendency

Mean: is the sum of observations divided by the total number of observations. It is also
defined as average which is the sum divided by count.

xˉ=∑xnxˉ=n∑x

where,

• x = Observations

• n = number of terms

Let's look at an example of how can we find the mean of a data set using python code
implementation. Before its implementation we should have some basic knowledge
about numpy and scipy.

import numpy as np

# Sample Data

arr = [5, 6, 11]

# Mean

mean = [Link](arr)

print("Mean = ", mean)

Output

Mean = 7.333333333333333
Mode: The most frequently occurring value in the dataset. It’s useful for categorical data and
in cases where knowing the most common choice is crucial.

import [Link] as stats

# sample Data

arr = [1, 2, 2, 3]

# Mode

mode = [Link](arr)

print("Mode = ", mode)

Output:

Mode = ModeResult(mode=array([2]), count=array([2]))

Median: The median is the middle value in a sorted dataset. If the number of values is odd,
it's the center value, if even, it's the average of the two middle values. It's often better than
the mean for skewed data.

import numpy as np

# sample Data

arr = [1, 2, 3, 4]

# Median

median = [Link](arr)

print("Median = ", median)

Output

Median = 2.5

Note : All implementations are performed using numpy library in python. If you want to learn
and understand more about it. Refer to the link.

Central tendency measures are the foundation for understanding data distribution and
identifying anomalies. For example, the mean can reveal trends, while the median highlights
skewed distributions.

2. Measure of Variability

Knowing not just where the data centers but also how it spreads out is important. Measures
of variability, also called measures of dispersion, help us spot the spread or distribution of
observations in a dataset. They identifying outliers, assessing model assumptions and
understanding data variability in relation to its mean. The key measures of variability include:
1. Range : describes the difference between the largest and smallest data point in our data
set. The bigger the range, the more the spread of data and vice versa. While easy to
compute range is sensitive to outliers. This measure can provide a quick sense of the data
spread but should be complemented with other statistics.

Range = Largest data value - smallest data value

import numpy as np

# Sample Data

arr = [1, 2, 3, 4, 5]

# Finding Max

Maximum = max(arr)

# Finding Min

Minimum = min(arr)

# Difference Of Max and Min

Range = Maximum-Minimum

print("Maximum = {}, Minimum = {} and Range = {}".format(

Maximum, Minimum, Range))

Output

Maximum = 5, Minimum = 1 and Range = 4

2. Variance: is defined as an average squared deviation from the mean. It is calculated by


finding the difference between every data point and the average which is also known as the
mean, squaring them, adding all of them and then dividing by the number of data points
present in our data set.

σ2=∑(x−μ)2Nσ2=N∑(x−μ)2

where,

• x -> Observation under consideration

• N -> number of terms

• mu -> Mean

import statistics

# sample data

arr = [1, 2, 3, 4, 5]
# variance

print("Var = ", ([Link](arr)))

Output

Var = 2.5

3. Standard deviation: Standard deviation is widely used to measure the extent of variation
or dispersion in data. It's especially important when assessing model performance (e.g.,
residuals) or comparing datasets with different means.

It is defined as the square root of the variance. It is calculated by finding the mean,
then subtracting each number from the mean which is also known as the average and
squaring the result. Adding all the values and then dividing by the no of terms followed by
the square root.

σ=∑(x−μ)2Nσ=N∑(x−μ)2

where,

• x = Observation under consideration

• N = number of terms

• mu = Mean

import statistics

arr = [1, 2, 3, 4, 5]

print("Std = ", ([Link](arr)))

Output

Std = 1.5811388300841898

Variability measures are important in residual analysis to check how well a model fits the data.

3. Measures of Frequency Distribution

Frequency distribution table is a powerful summarize way to show how data points are
distributed across different categories or intervals. Helps identify patterns, outliers and the
overall structure of the dataset. It is often the first step in understand the dataset before
applying more advanced analytical methods or creating visualizations like histograms or pie
charts.

Frequency Distribution Table Includes measure like:

• Data intervals or categories

• Frequency counts
• Relative frequencies (percentages)

• Cumulative frequencies when needed

For Frequency Distribution – Histogram, Bar Graph, Frequency Polygon and Pie Chart read
article: Frequency Distribution – Table, Graphs, Formula.

Statistical Inference

Statistical inference is the process of using data analysis to infer properties of an underlying
distribution of a population. It is a branch of statistics that deals with making inferences about
a population based on data from a sample.

Statistical inference is based on probability theory and probability distributions. It involves


making assumptions about the population and the sample, and using statistical models to
analyze the data. In this article, we will be discussing it in detail.

Statistical Inference

Statistical inference is the process of drawing conclusions or making predictions about a


population based on data collected from a sample of that population. It involves using
statistical methods to analyze sample data and make inferences or predictions about
parameters or characteristics of the entire population from which the sample was drawn.

Consider a scenario where you are presented with a bag which is too big to effectively count
each bean by individual shape and colours. The bag is filled with differently shaped beans and
different colors of the same. The task entails determining the proportion of red-coloured beans
without spending much effort and time. This is how statistical inference works in this context.

You simply pick a random small sample using a handful and then calculate the proportion of
the red beans. In this case, you would have picked a small subset, your handful of beans to
create an inference on a much larger population, that is the entire bag of beans.

Branches of Statistical Inference


There are two main branches of statistical inference:

• Parameter Estimation

• Hypothesis Testing

Parameter Estimation

Parameter estimation is another primary goal of statistical inference. Parameters are capable
of being deduced; they are quantified traits or properties related to the population you are
studying. Some instances comprise the population mean, population variance, and so on-the-
list. Imagine measuring each person in a town to realize the mean. This is a daunting if not an
impossible task. Thus, most of the time, we use estimates.

There are two broad methods of parameter estimation:

• Point Estimation

• Interval Estimation

Hypothesis Testing

Hypothesis testing is used to make decisions or draw conclusions about a population based on
sample data. It involves formulating a hypothesis about the population parameter, collecting
sample data, and then using statistical methods to determine whether the data provide
enough evidence to reject or fail to reject the hypothesis.

Statistical Inference Methods

There are various methods of statistical inference, some of these methods are:

• Parametric Methods

• Non-parametric Methods

• Bayesian Methods

Let's discuss these methods in detail as follows:

Parametric Methods

In this scenario, the parametric statistical methods will assume that the data is drawn from a
population characterized by a probability distribution. It is mainly believed that they follow a
normal distribution thus can allow one to make guesses about the populace in question . For
example, the t-tests and ANOVA are parametric tests that give accurate results with the
assumption that the data ought to be

• Example: A psychologist may ask himself if there is a measurable difference, on


average, between the IQ scores of women and men. To test his theory, he draws
samples from each group and assumes they are both normally distributed. He can opt
for a parametric test such as t-test and assess if the mean disparity is statistically
significant.

Non-Parametric Methods

These are less assumptive and more flexible analysis methods when dealing with data out of
normal distribution. They are also used to conduct data analysis when one is uncertain about
meeting the assumption for parametric methods and when one have less or inadequate data
. Some of the non-parametric tests include Wilcoxon signed-rank test and Kruskal-Wallis test
among others.

• Example: A biologist has collected data on plant health in an ordinal variable but since
it is only a small sample and normal assumption is not met, the biologist can use
Kruskal-Wallis testing.

Bayesian Methods

Bayesian statistics is distinct from conventional methods in that it includes prior knowledge
and beliefs. It determines the various potential probabilities of a hypothesis being genuine in
the light of current and previous knowledge. Thus, it allows updating the likelihood of beliefs
with new data.

• Example: consider a situation where a doctor is investigating a new treatment and has
the prior belief about the success rate of the treatment. Upon conducting a new clinical
trial, the doctor uses Bayesian method to update his “prior belief” with the data from
the new trials to estimate the true success rate of the treatment.

Statistical Inference Techniques

Some of the common techniques for statistical inference are:

• Hypothesis Testing

• Confidence Intervals

• Regression Analysis

Let's discuss these in detail as follows:

Hypothesis Testing

One of the central parts of statistical analysis is hypothesis testing which assumes an inference
or withstand any conclusions concerning the element from the sample data. Hypothesis testing
may be defined as a structured technique that includes formulating two opposing hypotheses,
an alpha level, test statistic computation, and a decision based on the obtained outcomes. Two
types of hypotheses can be distinguished: a null hypothesis to signify no significant difference
and an alternative hypothesis H1 or Ha to express a significant effect or difference.
• Example: If a car manufacturing company makes a claim that their new car model
gives a mileage of not less than 25miles/gallon. Then an independent agency collects
data for a sample of these cars and performs a hypothesis test. The null hypothesis
would be that the car does give a mileage of not less than 25miles/gallon and they
would test against the alternative hypothesis that it doesn’t. The sample data would
then be used to either fail to reject or reject the null hypothesis.

Confidence Intervals (CI)

Another statistical concept that involves confidence intervals is determining a range of possible
values where the population parameter can be, given a certain confidence percentage –
usually 95%. In simpler terms, CI’s provide an estimate of the population value and the level of
uncertainty that comes with it.

• Example: A study on health records could show that 95% CI for average blood pressure
is 120-130 . In other words, there is a 95% chance that the average blood pressure of
all population is in the values between 120 and 130.

Regression Analysis

Multiple regression refers to the relationship between more than two variables. Linear
regression, at its most basic level, examines how a dependent variable Y varies with an
independent variable X. The regression equation, Y = a + bX + e, a + bX + e, which is the best
fit line through the data points quantifies this variation.

• Example: Consider a situation in which one is curious about one’s advertisement on


sales and is presented with it. Ultimately, it may influence questionnaire allocation as
well as lead staff to feel disgruntled or upset and dissatisfied. In several regression
conditions, regression analysis allows for the quantification of these two effects as well.
Specifically, Y is the predicted outcoming factor while X1, X2, and X3 are the observed
variables used to anticipate it.

Applications of Statistical Inference

Statistical inference has a wide range of applications across various fields. Here are some
common applications:

• Clinical Trials: In medical research, statistical inference is used to analyze clinical trial
data to determine the effectiveness of new treatments or interventions. Researchers
use statistical methods to compare treatment groups, assess the significance of results,
and make inferences about the broader population of patients.

• Quality Control: In manufacturing and industrial settings, statistical inference is used


to monitor and improve product quality. Techniques such as hypothesis testing and
control charts are employed to make inferences about the consistency and reliability
of production processes based on sample data.
• Market Research: In business and marketing, statistical inference is used to analyze
consumer behavior, conduct surveys, and make predictions about market trends.
Businesses use techniques such as regression analysis and hypothesis testing to draw
conclusions about customer preferences, demand for products, and effectiveness of
marketing strategies.

• Economics and Finance: In economics and finance, statistical inference is used to


analyze economic data, forecast trends, and make decisions about investments and
financial markets. Techniques such as time series analysis, regression modeling, and
Monte Carlo simulations are commonly used to make inferences about economic
indicators, asset prices, and risk management.
Unit 2

1. The Data Science Process


A high-level roadmap of how a data science project is usually structured. Following this
helps avoid pitfalls, ensures good practice, and makes work reproducible.

Problem Definition → Data Collection → Data Preprocessing → Exploratory Data Analysis → Modeling →
Evaluation → Deployment → Monitoring

• Problem Definition: What question(s) are you trying to answer? What are the
business / real world objectives, what are success criteria?
• Data Collection: Acquire data from one or more sources. May involve APIs,
databases, scraping, surveys, sensors.
• Data Preprocessing: Cleaning, integrating, transforming, reducing, sometimes
discretizing data to prepare for analysis.
• Exploratory Data Analysis (EDA): Understand data via summary statistics and
visualization; check distributions, relationships, anomalies, missingness.
• Modeling: Choose algorithms, train models.
• Evaluation: Quantify performance, compare models, check whether models meet
criteria.
• Deployment: Putting model into production or decision process.
• Monitoring & Maintenance: Ensuring model continues to perform; dealing with
drift, updating as needed.

2. Data Preprocessing
Preprocessing deals with preparing raw data for analysis and modeling. Raw data is rarely
“clean” or “ready.”

2.1 Data Cleaning

Goal: Fix or remove corrupted, incomplete, incorrectly formatted, or noisy parts of data.

Common Tasks:

• Handling missing values


o Remove rows / columns
o Imputation: mean, median, mode, interpolation
o More sophisticated: using modeling to predict missing values
• Dealing with noise / outliers
o Detect via statistical methods (Z-score, IQR)
o Either remove, cap (winsorizing), or transform
• Duplicates removal
o Exact duplicates
o Near-duplicates (similar records)
• Correct inconsistencies
o Uniform units (e.g. kg vs grams)
o Fix categorical labels (“male”, “Male”, “M”, etc.)

2.2 Data Integration

Goal: Combine data from multiple sources into a unified dataset.

Challenges:

• Schema mismatches: Different tables may represent the same concept with
different attribute names or units
• Entity resolution: Matching records referring to the same real-world entity
• Redundancy and contradictions: Conflicting values from different sources

Example:

• You have customer data from two systems: system A has cust_id, first_name, last_name,
DOB; system B has id, name, age, email. Need to map fields, standardize name formats,
decide how to compute age / DOB, unify customer_id.

2.3 Data Reduction

Goal: Reduce data size (in terms of number of attributes, number of records) while retaining
as much of the important information as possible.

Techniques:

• Dimensionality reduction: PCA, LDA, feature selection, embedding methods


• Feature selection: removing irrelevant or redundant features
• Sampling: random sampling, stratified sampling
• Aggregation: summarizing (e.g. daily averages of hourly data)
• Numerosity reduction: using histograms, clustering, etc.

2.4 Data Transformation

Goal: Change format or scale of data to improve suitability for analysis or modeling.

Tasks include:

• Normalization / scaling: min-max scaling, z-score (standardization)


• Log / power / root transformations: for skewed data
• Encoding categorical variables: one-hot encoding, ordinal encoding, dummy
variables
• Aggregation: combining data (e.g. total sales per month)
• Feature generation / engineering: combining or creating new features from existing
ones, e.g. interaction terms, ratios
2.5 Data Discretization

Goal: Convert continuous attributes into discrete buckets (bins), which can sometimes
simplify models, help with interpretability, or be required by some algorithms.

Methods:

• Equal-width binning: divide range into intervals of equal width


• Equal-frequency (quantile) binning: bins such that each has roughly same number of
records
• Clustering-based discretization: group data points using clustering and treat clusters
as categories
• Decision tree based discretization: use tree splits to define bins

3. Exploratory Data Analysis (EDA)


Before modeling, you need to know what data looks like.

3.1 Basic Tools: Summary Statistics, Plots & Graphs

Summary Statistics

• Univariate: mean, median, mode, minimum, maximum, standard deviation,


quartiles, skewness, kurtosis
• Bivariate / Multivariate: covariance, correlation coefficients (Pearson, Spearman),
cross-tabulations (for categorical data)

Visualization Tools

• Univariate plots:
o Histogram
o Box plot
o Density plot
o Bar chart (for categorical)
• Bivariate plots:
o Scatter plot
o Correlation heatmap
o Bar plot comparing categories vs numeric summary
• Multivariate / higher-dimensional:
o Pair plot (scatter matrix)
o Principal component plots
o Heatmaps
o Parallel coordinates
• Other visual tools:
o Missingness matrix (to see patterns in missing data)
o Outlier plots
o Time-series plots if data has time component
Example of a Plot

Imagine we have a dataset of housing: features include price, area, num_rooms, age_of_house.

• Histogram of price to see distribution (is it skewed?).


• Scatter plot of area vs price to see how strongly area affects price.
• Box plot of price grouped by num_rooms to see variations by room count.

3.2 Philosophy of EDA

• Curiosity: Be open to discovering unexpected patterns


• Iterative: EDA is not linear; often revisit earlier steps with new insights
• Visualization + statistics: Both are important; stats give quantitative backbone,
graphs provide intuition and detection of anomalies
• Hypothesis generation: Use EDA to formulate hypotheses to test in modeling or
further analysis
• Transparency: Clean, documented EDA helps trust and reproducibility

4. Evaluation of Classification Methods


After preparing data and building models, you need to evaluate how good they are.

4.1 Confusion Matrix & Related Metrics

A confusion matrix for a binary classifier looks like this:

Predicted Positive Predicted Negative


Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

From these, define:

• Accuracy = (TP + TN) / (TP + FP + TN + FN)


• Precision (Positive Predictive Value) = TP / (TP + FP)
• Recall (Sensitivity / True Positive Rate) = TP / (TP + FN)
• Specificity (True Negative Rate) = TN / (TN + FP)
• F1-Score = 2 * (Precision * Recall) / (Precision + Recall)

Example Hypothetical:

Suppose in a medical test for a disease you have:

• TP = 70
• TN = 900
• FP = 30
• FN = 50
Then:

• Accuracy = (70 + 900) / (70+900+30+50) = 970/1050 ≈ 0.9238


• Precision = 70 / (70 + 30) = 70/100 = 0.70
• Recall = 70 / (70 + 50) = 70/120 ≈ 0.5833
• F1 = 2 * (0.70 * 0.5833) / (0.70 + 0.5833) ≈ 0.636

4.2 ROC Curve & AUC

ROC = Receiver Operating Characteristic. Plots:

• X-axis: False Positive Rate (FPR) = FP / (FP + TN)


• Y-axis: True Positive Rate (TPR) = Recall = TP / (TP + FN)

As we vary the decision threshold (if the classifier produces probabilities), we get different
(FPR, TPR) pairs → ROC curve.

• AUC (Area Under ROC Curve): summarises overall performance across all thresholds.
AUC = 1 means perfect, 0.5 means random guess for binary classification.

Interpretation: A good ROC is one that bulges up toward the top-left corner.

4.3 Student’s T-test

Used to test if there is a significant difference between means of two groups. In


classification contexts, often to compare performance metrics (like accuracy) across two
models.

• Types:
1. Independent two-sample t-test: comparing metric from two different
models on independent folds / datasets
2. Paired t-test: comparing two models on the same dataset / splits (e.g.
cross-validation folds) — accounts for correlation
• Hypotheses:
o Null hypothesis H0H_0: means are equal
o Alternative H1H_1: means are different
• Procedure:
1. Collect metric values (e.g. accuracies) from multiple runs/folds for model A
and model B.
2. Compute t-statistic.
3. Compare with critical t (or compute p-value).
4. If p < α (e.g. 0.05), reject H0H_0 → conclude that one model is significantly
better.

5. Worked Example (End-to-end)


Let's walk through a hypothetical example using a dataset for predicting whether a person
has diabetes based on features like age, BMI, blood pressure, glucose level, etc.

Step A: Data Collection & Problem Definition

• Problem: Predict binary outcome “Diabetic” / “Not Diabetic.”


• Success criteria: High recall (because missing a diabetic case is costly), but also
precision should not be too low.

Step B: Data Preprocessing

Suppose our raw data looks like this (simplified):

id age bmi blood_pressure glucose_level gender outcome


1 45 27.5 80 140 Male Yes
2 34 – 70 120 Female No
3 50 30.2 85 – Male Yes
... ... ... ... ... ... ...

B1. Data Cleaning

• Missing values: bmi missing for id=2, glucose missing for id=3.
o Strategy: Impute using median or mean (for numeric)
o Or for glucose: maybe predictive imputation using other features.
• Outliers: Suppose a record with bmi = 1000 (typo). Cap or remove.
• Duplicates: Remove duplicate ids if any.

B2. Data Integration

• Suppose we also have another table with patient’s lab reports, which has extra
features like cholesterol, etc. Merge on id. Handle differences in naming, units.

B3. Data Reduction

• Feature selection: Maybe features like id are irrelevant. Maybe some lab test is
irrelevant.
• Dimensionality reduction: If there are many lab test features, use PCA to reduce.

B4. Data Transformation

• Scale numeric features (age, bmi, glucose) using z-score normalization (subtract
mean, divide by std dev).
• Encode gender as 0/1.
• If glucose is heavily skewed, maybe do log transformation.

B5. Data Discretization


• Discretize age into bins (e.g. <30, 30-45, 45-60, >60) for interpretability or for
algorithms which benefit.

Step C: Exploratory Data Analysis (EDA)

• Univariate analysis: Histograms of glucose, BMI, boxplots of blood pressure. Check


for skewed distributions.
• Bivariate: Scatter plots of glucose vs. BMI, or age vs. glucose. Box plot of glucose
grouped by outcome (Yes/No).
• Missingness pattern: Plot which features have more missing values; check if
missingness correlates with outcome.
• Correlation heatmap: Among numeric variables, see which are highly correlated (to
detect multicollinearity).

*Step D: Modeling

• Pick a classification model (say logistic regression, random forest). Split data into
training/test sets or do cross-validation.

Step E: Evaluation

• Compute confusion matrix on test set. Suppose results:

Pred = Yes Pred = No


Actual = Yes TP = 80 FN = 20
Actual = No FP = 15 TN = 185

From this:

• Accuracy = (80 + 185) / (80+185+15+20) = 265 / 300 ≈ 88.33%


• Precision = 80 / (80 + 15) ≈ 84.21%
• Recall = 80 / (80 + 20) = 80%
• F1-score = 2 * (0.8421 * 0.80) / (0.8421 + 0.80) ≈ 0.820
• Plot ROC curve: using predicted probabilities, compute TPR vs FPR for multiple
thresholds; compute AUC, say 0.92.

Step F: Comparing Models with Student’s T-Test

• Suppose we trained two models: Model A (logistic regression), Model B (random


forest).
• Do 10-fold cross-validation, record accuracy in each fold for both models:
o Model A accuracies: 0.85, 0.88, 0.89, …
o Model B accuracies: 0.90, 0.91, 0.92, …
• Use paired t-test to test whether mean accuracy for Model B > Model A is
statistically significant.
• If p-value < 0.05, conclude Model B is significantly better.
Unit 3

Basic Machine Learning Algorithms

1. Introduction
What is machine learning?
Machine learning (ML) is a type of artificial intelligence (AI) and computer science focused
on building computer systems that learn from data and
The use of data and algorithms to imitate the way that humans learn, gradually improving its
accuracy.
The broad range of techniques ML encompasses enables software applications to improve their
performance over time.
Machine learning is an important component of the growing field of data science.
Through the use of statistical methods, algorithms are trained to make classifications or
predictions, and to uncover key insights in data mining projects.
Machine learning algorithms are trained to find relationships and patterns in data. They use
historical data as input to make predictions, classify information, cluster data points, reduce
dimensionality and even help generate new content.

Machine learning is widely applicable


E-commerce, social media and news organizations to suggest content based on a customer’s
past behaviour, for example, use recommendation engines.
Machine learning algorithms and machine vision are a critical component of self-driving cars,
helping them navigate the roads safely. In healthcare, machine learning is used to diagnose and
suggest treatment plans.
Other common ML use cases include fraud detection, spam filtering, malware threat detection,
predictive maintenance and business process automation.

Automation: Machine learning, which works entirely autonomously in any field without the
need for any human intervention. For example, robots perform the essential process steps in
manufacturing plants.

Finance Industry: Machine learning is growing in popularity in the finance industry. Banks
are mainly using ML to find patterns inside the data but also to prevent fraud. Ex. PayPal

Government organization: The government makes use of ML to manage public safety and
utilities. Take the example of China with its massive face recognition. The government uses
Artificial intelligence to prevent jaywalking.

Healthcare industry: Healthcare was one of the first industries to use machine learning with
image detection. Drug discovery, Personalized treatment

Marketing: Broad use of AI is done in marketing thanks to abundant access to data. Before
the age of mass data, researchers develop advanced mathematical tools like Bayesian analysis
to estimate the value of a customer. With the boom of data, the marketing department relies on
AI to optimize customer relationships and marketing campaigns.
Retail industry: Machine learning is used in the retail industry to analyze customer behavior,
predict demand, and manage inventory. It also helps retailers to personalize the shopping
experience for each customer by recommending products based on their past purchases and
preferences. Ex. NetFlix, Youtue

Transportation: Machine learning is used in the transportation industry to optimize routes,


reduce fuel consumption, and improve the overall efficiency of transportation systems. It also
plays a role in autonomous vehicles, where ML algorithms are used to make decisions about
navigation and safety.
Ex. Uber,Ola

Machine learning has become essential for solving problems across numerous areas, such
as
1. Computational finance (credit scoring, algorithmic trading)
2. Computer vision (facial recognition, motion tracking, object detection)
3. Computational biology (DNA sequencing, brain tumor detection, drug discovery)
4. Automotive, aerospace, and manufacturing (predictive maintenance)
5. Natural language processing (voice recognition)

Difference between AI, Machine Learning and Traditional Programming


Machine Learning Traditional Programming Artificial Intelligence
Machine Learning is a In traditional programming, Artificial Intelligence
subset of artificial rule-based code is written by involves making the
intelligence(AI) that focus the developers depending on machine as much capable,
on learning from data to the problem statements. So that it can perform the
develop an algorithm that tasks that typically require
can be used to make a human intelligence.
prediction.
Machine Learning uses a Traditional programming is AI can involve many
data- driven approach, It is typically rule-based and different techniques,
typically trained on deterministic. It hasn’t self- including Machine Learning
historical data and then used learning features like and Deep Learning, as well
to make predictions on new Machine Learning and AI. as traditional rule-based
data. programming.
ML can find patterns and Traditional programming is Sometimes AI uses a
insights in large datasets that totally dependent on the combination of both Data
might be difficult for intelligence of developers. and Pre-defined rules, which
humans to discover. So, it has very limited gives it a great edge in
capability. solving complex tasks with
good accuracy which seem
impossible to humans.
Machine Learning is the Traditional programming is AI is a broad field that
subset of AI. And Now it is often used to build includes many different
used in various AI-based applications and software applications, including
tasks like Chatbot Question systems that have specific natural language processing,
answering, self-driven car., functionality. computer vision, and
etc. robotics.
How machine-learning algorithms work
Machine Learning works in the following manner.

Forward Pass: In the Forward Pass, the machine learning algorithm takes in input data and
produces an output. Depending on the model algorithm it computes the predictions.

Loss Function: The loss function, also known as the error or cost function, is used to evaluate
the accuracy of the predictions made by the model. The function compares the predicted output
of the model to the actual output and calculates the difference between them. This difference
is known as error or loss. The goal of the model is to minimize the error or loss function by
adjusting its internal parameters.

Model Optimization Process: The model optimization process is the iterative process of
adjusting the internal parameters of the model to minimize the error or loss function. This is
done using an optimization algorithm, such as gradient descent.

Future Scope (Needs) of Machine Learning


The scope of Machine Learning is not limited to the investment sector. Rather, it is expanding
across all fields such as banking and finance, information technology, media & entertainment,
gaming, and the automotive industry.

1. Automotive Industry(Smart cars and smart homes (IoT))


2. Robotics
3. Quantum Computing
4. Computer Vision
5. Predicting security breaches, finding malware and other anomalies in data
6. Personalized recommendations (ex: Netflix, Amazon)
7. Improving online search results based on preferences
8. Natural language processing
9. Wearable technology, especially in healthcare

Limitations of Machine Learning


1. Lack of Transparency and Interpretability
2. Overfitting and Underfitting
3. Bias and Discrimination
4. Limited Data Availability
5. Computational Resources
6. Lack of Causality
7. Ethical Considerations
A machine cannot learn if there is no data available. Besides, a dataset with a lack of diversity
gives the machine a hard time.
A machine needs to have heterogeneity to learn meaningful insight.
It is rare that an algorithm can extract information when there are no or few variations.
It is recommended to have at least 20 observations per group to help the machine learn. This
constraint leads to poor evaluation and prediction.

How machine learning works


A Decision Process: In general, machine learning algorithms are used to make a prediction or
classification. Based on some input data, which can be labeled or unlabeled, your algorithm
will produce an estimate about a pattern in the data.
An Error Function: An error function evaluates the prediction of the model. If there are
known examples, an error function can make a comparison to assess the accuracy of the model.

A Model Optimization Process: If the model can fit better to the data points in the training
set, then weights are adjusted to reduce the discrepancy between the known example and the
model estimate. The algorithm will repeat this “evaluate and optimize” process, updating
weights autonomously until a threshold of accuracy has been met.

Machine Learning lifecycle:


The lifecycle of a machine learning project involves a series of steps that include:

Study the Problems: The first step is to study the problem. This step involves understanding
the business problem and defining the objectives of the model.

Data Collection: When the problem is well-defined, we can collect the relevant data required
for the model. The data could come from various sources such as databases, APIs, or web
scraping.

Data Preparation: When our problem-related data is collected. then it is a good idea to check
the data properly and make it in the desired format so that it can be used by the model to find
the hidden patterns. This can be done in the following steps:

Data cleaning
Data Transformation

Explanatory Data Analysis and Feature Engineering Split the dataset for training and
testing.

Model Selection: The next step is to select the appropriate machine learning algorithm that is
suitable for our problem. This step requires knowledge of the strengths and weaknesses of
different algorithms. Sometimes we use multiple models and compare their results and select
the best model as per our requirements.

Model building and Training: After selecting the algorithm, we have to build the model.
In the case of traditional machine learning building mode is easy it is just a few hyperparameter
tunings.
In the case of deep learning, we have to define layer-wise architecture along with input and
output size, number of nodes in each layer, loss function, gradient descent optimizer, etc.
After that model is trained using the preprocessed dataset.

Model Evaluation: Once the model is trained, it can be evaluated on the test dataset to
determine its accuracy and performance using different techniques like classification report,
F1 score, precision, recall, ROC Curve, Mean Square error, absolute error, etc.

Model Tuning: Based on the evaluation results, the model may need to be tuned or optimized
to improve its performance. This involves tweaking the hyperparameters of the model.

Deployment: Once the model is trained and tuned, it can be deployed in a production
environment to make predictions on new data. This step requires integrating the model into an
existing software system or creating a new system for the model.
Monitoring and Maintenance: Finally, it is essential to monitor the model’s performance in
the production environment and perform maintenance tasks as required. This involves
monitoring for data drift, retraining the model as needed, and updating the model as new data
becomes available.

Types of Machine Learning


1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-supervised learning
4. Reinforcement Machine Learning

Supervised Machine Learning:-


This type of ML involves supervision, where machines are trained on labeled datasets and
enabled to predict outputs based on the provided training. The labeled dataset specifies that
some input and output parameters are already mapped. Hence, the machine is trained with the
input and corresponding output. A device is made to predict the outcome using the test dataset
in subsequent phases.
For example, consider an input dataset of parrot and crow images. Initially, the machine is
trained to understand the pictures, including the parrot and crow’s color, eyes, shape, and size.
The primary objective of the supervised learning technique is to map the input variable (a) with
the output variable (b). Supervised machine learning is further classified into two broad
categories:

Classification: These refer to algorithms that address classification problems where the output
variable is categorical; for example, yes or no, true or false, male or female, etc. Real-world
applications of this category are evident in spam detection and email filtering.
Some known classification algorithms include the Random Forest Algorithm, Decision Tree
Algorithm, Logistic Regression Algorithm, and Support Vector Machine Algorithm.

Regression: Regression algorithms handle regression problems where input and output
variables have a linear relationship. These are known to predict continuous output variables.
Examples include weather prediction, market trend analysis, etc.
Popular regression algorithms include the Simple Linear Regression Algorithm, Multivariate
Regression Algorithm, Decision Tree Algorithm, and Lasso Regression.

Unsupervised Learning:
Unsupervised learning refers to a learning technique that’s devoid of supervision. Here, the
machine is trained using an unlabeled dataset and is enabled to predict the output without any
supervision. An unsupervised learning algorithm aims to group the unsorted dataset based on
the input’s similarities, differences, and patterns.
For example, consider an input dataset of images of a fruit-filled container. Here, the images
are not known to the machine learning model. When we input the dataset into the ML model,
the task of the model is to identify the pattern of objects, such as color, shape, or differences
seen in the input images and categorize them.
Unsupervised machine learning is further classified into two types:

Clustering: The clustering technique refers to grouping objects into clusters based on
parameters such as similarities or differences between objects. For example, grouping
customers by the products they purchase.
Some known clustering algorithms include the K-Means Clustering Algorithm, Mean-Shift
Algorithm, DBSCAN Algorithm, Principal Component Analysis, and Independent Component
Analysis.

Association: Association learning refers to identifying typical relations between the variables
of a large dataset. It determines the dependency of various data items and maps associated
variables. Typical applications include web usage mining and market data analysis.
Popular algorithms obeying association rules include the Apriori Algorithm, Eclat Algorithm,
and FP- Growth Algorithm.

Semi-supervised learning
Semi-supervised learning comprises characteristics of both supervised and unsupervised
machine learning. It uses the combination of labeled and unlabeled datasets to train its
algorithms. Using both types of datasets, semi-supervised learning overcomes the drawbacks
of the options mentioned above.
Consider an example of a college student. A student learning a concept under a teacher’s
supervision in college is termed supervised learning.

Reinforcement learning
Reinforcement learning is a feedback-based process. Here, the AI component automatically
takes stock of its surroundings by the hit & trial method, takes action, learns from experiences,
and improves performance. The component is rewarded for each good action and penalized for
every wrong move. Thus, the reinforcement learning component aims to maximize the rewards
by performing good actions.
Unlike supervised learning, reinforcement learning lacks labeled data, and the agents learn via
experiences only. Consider video games. Here, the game specifies the environment, and each
move of the reinforcement agent defines its state.

Regression:
Machine Learning Regression is a technique for investigating the relationship between
independent variables or features and a dependent variable or outcome.
Solving regression problems is one of the most common applications for machine learning
models, especially in supervised machine learning.
Regression is used to identify patterns and relationships within a dataset, which can then be
applied to new and unseen data. This makes regression a key element of machine learning in
finance, and is often leveraged to help forecast portfolio performance or stock costs and trends.
Common use for machine learning regression models include:

1. Forecasting continuous outcomes like house prices, stock prices, or sales.


2. Predicting the success of future retail sales or marketing campaigns to ensure resources are
used effectively.
3. Predicting customer or user trends, such as on streaming services or e-commerce websites.
4. Analysing datasets to establish the relationships between variables and an output.
5. Predicting interest rates or stock prices from a variety of factors.
6. Creating time series visualisations.

Types of regression:
 Simple Regression
Used to predict a continuous dependent variable based on a single independent variable. Simple
linear regression should be used when there is only a single independent variable.
 Multiple Regression
Used to predict a continuous dependent variable based on multiple independent variables.
Multiple linear regression should be used when there are multiple independent variables
 Simple Linear Regression
 Multiple linear regression
 Logistic regression

Regression Algorithms
There are many different types of regression algorithms, but some of the most common include:

 Linear regression: Linear regression is one of the simplest and most widely used statistical
models. This assumes that there is a linear relationship between the independent and dependent
variables. This means that the change in the dependent variable is proportional to the change
in the independent variables.

 Polynomial regression: Polynomial regression is used to model nonlinear relationships


between the dependent variable and the independent variables. It adds polynomial terms to the
linear regression model to capture more complex relationships.

 Support vector regression (SVR): Support vector regression (SVR) is a type of regression
algorithm that is based on the support vector machine (SVM) algorithm. SVM is a type of
algorithm that is used for classification tasks, but it can also be used for regression tasks. SVR
works by finding a hyperplane that minimizes the sum of the squared residuals between the
predicted and actual values.

 Decision tree regression: Decision tree regression is a type of regression algorithm that
builds a decision tree to predict the target value. A decision tree is a tree-like structure that
consists of nodes and branches. Each node represents a decision, and each branch represents
the outcome of that decision. The goal of decision tree regression is to build a tree that can
accurately predict the target value for new data points.

 Ridge Regression: Ridge regression is a type of linear regression that is used to prevent
overfitting. Overfitting occurs when the model learns the training data too well and is unable
to generalize to new data

 Lasso regression: Lasso regression is another type of linear regression that is used to prevent
overfitting. It does this by adding a penalty term to the loss function that forces the model to
use some weights and to set others to zero.

 Random forest regression: Random forest regression is an ensemble method that combines
multiple decision trees to predict the target value. Ensemble methods are a type of machine
learning algorithm that combines multiple models to improve the performance of the overall
model. Random forest regression works by building a large number of decision trees, each of
which is trained on a different subset of the training data. The final prediction is made by
averaging the predictions of all of the trees.

Characteristics of Regression
Here are the characteristics of the regression:
 Continuous Target Variable: Regression deals with predicting continuous target variables
that represent numerical values. Examples include predicting house prices, forecasting sales
figures, or estimating patient recovery times.

 Error Measurement: Regression models are evaluated based on their ability to minimize
the error between the predicted and actual values of the target variable. Common error metrics
include mean absolute error (MAE), mean squared error (MSE), and root mean squared error
(RMSE).

 Model Complexity: Regression models range from simple linear models to more complex
nonlinear models. The choice of model complexity depends on the complexity of the
relationship between the input features and the target variable.

 Overfitting and Underfitting: Regression models are susceptible to overfitting and


underfitting. Overfitting occurs when the model fits the training data too well and fails to
generalize to new data. Underfitting occurs when the model fails to capture the underlying
patterns in the data.

 Interpretability: The interpretability of regression models varies depending on the


algorithm used. Simple linear models are highly interpretable, while more complex models
may be more difficult to interpret.

Logistic Regression
Purpose:
Used for binary classification problems (Yes/No, 0/1).

Equation:

[
P(Y=1) = \frac{1}{1 + e^{-(β_0 + β_1X)}}
]

Output:

Gives probability values between 0 and 1.


Decision rule:

• If ( P(Y=1) > 0.5 ), predict 1 (positive class)


• Else, predict 0 (negative class)

Applications:
• Spam detection
• Disease diagnosis (e.g., diabetes prediction)
• Customer churn prediction
Classifiers Overview
A classifier is an algorithm that maps input data to a category (label).
Common classifiers:

• k-Nearest Neighbors (k-NN)


• Decision Trees
• Naïve Bayes
• Random Forest
• Support Vector Machines (SVM)

k-Nearest Neighbors (k-NN)


Idea:

Classify a data point based on the majority label among its k nearest neighbors.

Steps:

1. Choose value of k.
2. Calculate distance (Euclidean, Manhattan, etc.).
3. Find k nearest points.
4. Assign class based on majority vote.

Advantages:

• Simple and non-parametric.


• Works well for small datasets.

Disadvantages:

• Computationally expensive for large datasets.


• Sensitive to irrelevant features.

k-Means Clustering
Type: Unsupervised Learning

Purpose:

Groups data into k clusters such that each point belongs to the nearest centroid.

Algorithm Steps:

1. Select k cluster centroids.


2. Assign each data point to the nearest centroid.
3. Recalculate centroids.
4. Repeat until convergence.

Applications:

• Customer segmentation
• Image compression
• Document clustering

Decision Tree
Concept:

A tree-like structure where each internal node represents a test on an attribute, each branch
represents an outcome, and each leaf node represents a class label.

Algorithm:

• ID3, CART, or C4.5

Key Metrics:

• Entropy (H):
[
H(S) = -\sum p_i \log_2 p_i
]
• Information Gain (IG):
[
IG = H(parent) - \sum \frac{|child|}{|parent|}H(child)
]

Advantages:
• Easy to interpret.
• Handles numerical and categorical data.

Disadvantages:

• Prone to overfitting.
• Small changes in data may alter the tree.

Naïve Bayes Classifier


Concept:

Based on Bayes Theorem and assumes independence among predictors.


[
P(C|X) = \frac{P(X|C) \times P(C)}{P(X)}
]

Where:

• (P(C|X)): Posterior probability


• (P(C)): Prior probability
• (P(X|C)): Likelihood

Types:

• Gaussian Naïve Bayes


• Multinomial Naïve Bayes
• Bernoulli Naïve Bayes

Applications:

• Email spam filtering


• Sentiment analysis
• Document classification

Ensemble Methods
Concept:

Combines multiple base learners (weak models) to form a strong model.

Types:

1. Bagging (Bootstrap Aggregating):


o Trains multiple models on random subsets of data.
o Example: Random Forest
2. Boosting:
o Trains models sequentially, each focusing on errors of previous ones.
o Examples: AdaBoost, Gradient Boosting, XGBoost
3. Stacking:
o Combines predictions from multiple models using a meta-classifier.

Random Forest
Concept:

An ensemble of multiple Decision Trees trained using bagging.

Steps:
1. Randomly select data samples (with replacement).
2. Build decision trees on each sample.
3. For classification — use majority voting.
4. For regression — take average prediction.

Advantages:

• High accuracy.
• Reduces overfitting.
• Handles missing values.

Feature Generation and Selection


1. Feature Generation

Creating new features from raw data to improve model performance.


Examples:

• Polynomial features (e.g., x², x³)


• Aggregations (e.g., mean, sum)
• Text vectorization (TF-IDF)
• Feature encoding (One-Hot, Label)

Feature Selection
Selecting the most important features to:

• Reduce overfitting
• Improve accuracy
• Decrease computation time

Types:
(a) Filter Methods

• Use statistical measures to rank features.


• Examples:
o Correlation coefficient
o Chi-square test
o Mutual Information

(b) Wrapper Methods

• Use a predictive model to evaluate combinations of features.


• Examples:
o Forward selection
o Backward elimination
o Recursive Feature Elimination (RFE)

(c) Embedded Methods


• Feature selection occurs during model training.
• Examples:
o Decision Trees (feature importance)
o Random Forests (Gini importance)
o LASSO (L1 regularization)

Machine Learning Models

Types of classification algorithms in Machine Learning:

1. Linear Classifiers: Logistic Regression, Naive Bayes Classifier

2. Nearest Neighbour

3. Support Vector Machines

4. Decision Trees

5. Boosted Trees

6. Random Forest

7. Neural Networks

Naive Bayes Classifier (Generative Learning Model):

It is a classification technique based on Bayes’ Theorem with an assumption of independence


among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other feature. Even if these
features depend on each other or upon the existence of the other features, all of these properties
independently contribute to the probability. Naive Bayes model is easy to build and particularly
useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform
even highly sophisticated classification methods.

Nearest Neighbour:

The k-nearest-neighbour algorithm is a classification algorithm, and it is supervised: it takes a


bunch of labelled points and uses them to learn how to label other points. To label a new point,
it looks at the labelled points closest to that new point (those are its nearest neighbours), and
has those neighbour vote, so whichever label the most of the neighbours have is the label for
the new point (the “k” is the number of neighbour it checks).

Logistic Regression (Predictive Learning Model):

It is a statistical method for analysing a data set in which there are one or more independent
variables that determine an outcome. The outcome is measured with a dichotomous variable
(in which there are only two possible outcomes). The goal of logistic regression is to find the
best fitting model to describe the relationship between the dichotomous characteristic of
interest (dependent variable = response or outcome variable) and a set of independent (predictor
or explanatory) variables. This is better than other binary classification like nearest neighbour
since it also explains quantitatively the factors that lead to classification.

Decision Trees:

Decision tree builds classification or regression models in the form of a tree structure. It breaks
down a data set into smaller and smaller subsets while at the same time an associated decision
tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A
decision node has two or more branches and a leaf node represents a classification or decision.
The topmost decision node in a tree which corresponds to the best predictor called root node.
Decision trees can handle both categorical and numerical data.

Random Forest:

Random forests or random decision forests are an ensemble learning method for classification,
regression and other tasks, that operate by constructing a multitude of decision trees at training
time and outputting the class that is the mode of the classes (classification) or mean prediction
(regression) of the individual trees. Random decision forests correct for decision trees’ habit of
over fitting to their training set.

Neural Network:

A neural network consists of units (neurons), arranged in layers, which convert an input vector
into some output. Each unit takes an input, applies a (often nonlinear) function to it and then
passes the output on to the next layer. Generally the networks are defined to be feed-forward:
a unit feeds its output to all the units on the next layer, but there is no feedback to the previous
layer. Weightings are applied to the signals passing from one unit to another, and it is these
weightings which are tuned in the training phase to adapt a neural network to the particular
problem at hand.

Unit IV

What is Clustering?
Definition:
Clustering is an unsupervised learning technique that groups data points into clusters
(groups) such that:

• Points within the same cluster are similar (high intra-cluster similarity),
• Points from different clusters are dissimilar (low inter-cluster similarity).

Used in: Market segmentation, social network analysis, document clustering, image
segmentation, anomaly detection, etc.

Choosing Distance Metrics


Distance metrics determine how similarity between points is measured. The choice affects the
shape, size, and behavior of clusters.

1.1 Common Distance Metrics

Formula (for points


Metric Suitable For Remarks
x, y)
Euclidean Most common, geometric
√Σ(xᵢ - yᵢ)² Continuous data
Distance distance
Manhattan
Σ xᵢ - yᵢ
Distance
Minkowski
(Σ xᵢ - yᵢ ᵖ)^(1/p)
Distance
Text / High- Measures angle, not
Cosine Similarity (x·y) / (‖x‖‖y‖)
dimensional magnitude
Jaccard Index A∩B /

Choosing the right metric depends on:

• Data type (continuous, categorical, binary)


• Scale of features
• Sensitivity to outliers

a) Euclidean Distance

• Most commonly used.


• Measures straight-line distance.
• Works well with compact, spherical clusters.

Formula:

𝑑(𝑝, 𝑞) = √∑(𝑝𝑖 − 𝑞𝑖 )2

Example:
For points (1,2) and (4,6):

√(1 − 4)2 + (2 − 6)2 = 5

b) Manhattan Distance

• Sum of absolute differences.


• Good for grid-like structures.

𝑑(𝑝, 𝑞) = ∑ ∣ 𝑝𝑖 − 𝑞𝑖 ∣
c) Cosine Similarity (or Distance)

• Measures angle between vectors.


• Good for text, high-dimensional data.

Cosine distance = 1 − cosine similarity.

d) Minkowski Distance

Generalization of Euclidean and Manhattan:

𝑑(𝑝, 𝑞) = (∑ ∣ 𝑝𝑖 − 𝑞𝑖 ∣𝑝 )1/𝑝

e) Jaccard Distance

• For binary/categorical data.


• Measures dissimilarity between sets.

How to choose a distance metric?


Data Type Recommended Metric
Numeric, continuous Euclidean, Manhattan
Sparse high-dimensional
Cosine
(text)
Binary Jaccard
Density-based metrics (e.g., DBSCAN uses Euclidean by
Non-linear structures
default)

2. Clustering Approaches
2.1 Hierarchical Agglomerative Clustering (HAC)
Concept:

Builds a hierarchy (tree) of clusters by merging the closest clusters step-by-step until all
points are in one cluster.

Steps:

1. Start with each data point as a single cluster.


2. Compute pairwise distance matrix.
3. Merge the two clusters that are closest (minimum distance).
4. Update the distance matrix.
5. Repeat until one cluster remains.

Linkage Criteria (distance between clusters)

Method Description
Single Linkage Minimum distance between any two points (nearest neighbor)
Complete Linkage Maximum distance between points (farthest neighbor)
Average Linkage Average distance between all pairs
Ward’s Method Minimizes total within-cluster variance

Output:

A Dendrogram (tree diagram) shows how clusters merge at different distances.

Pros & Cons:

Easy to visualize, no need to predefine K


Computationally expensive (O(n²)), sensitive to noise

Dendrogram

Hierarchical output visualized as a tree; user cuts the tree at desired level to get clusters.

Example of HAC

Points: A(1,1), B(2,1), C(8,8)

• Start: {A}, {B}, {C}


• Merge A and B first (closest)
• Then merge AB with C

Advantages

• No need to pre-specify number of clusters.


• Dendrogram gives rich structure.
• Works well for small datasets.

Disadvantages

• High computational cost (O(n²)).


• Not good for large datasets.
• Sensitive to noise and outliers.

2.2 K-Means Clustering (Lloyd's Algorithm)


Partition data into K clusters such that each point belongs to the nearest mean (centroid).
Goal: Minimize within-cluster sum of squares (WCSS).

Algorithm (Lloyd's Algorithm)

1. Choose K (number of clusters).


2. Randomly initialize K centroids.
3. Assign each point to the nearest centroid (using distance metric).
4. Recompute centroids as the mean of assigned points.
5. Repeat steps 3–4 until centroids stabilize (no change).

Example

Cluster customers into k = 2 groups based on age and income.

Initial centroids → C1, C2


Iteratively adjust assignments and centroids.

Advantages

• Simple and fast.


• Works well for large datasets.
• Best for spherical clusters.

Disadvantages

• Must choose k beforehand.


• Sensitive to initialization (can get stuck in local minima).
• Poor for non-spherical and varying-density clusters.
• Sensitive to outliers (means get distorted).

2.3 DBSCAN (Density-Based Spatial Clustering of


Applications with Noise)
Clusters are formed based on density of points.

Key Parameters

• ε (epsilon) → neighborhood radius


• minPts → minimum points to form a dense region

Definitions

• Core point: ≥ minPts in ε radius


• Border point: within ε of a core point
• Noise point: neither core nor border

How DBSCAN Works

1. Choose an unvisited point.


2. If it’s a core point → start a new cluster.
3. Find all density-connected points.
4. Mark remaining points as noise or border.

Example – Finding clusters in arbitrary shapes

Points forming two moon-shaped clusters.


K-means fails (forces circular clusters).
DBSCAN succeeds due to density logic.

Advantages

• No need to specify number of clusters.


• Handles noise/outliers well.
• Finds arbitrary-shaped clusters.

Disadvantages

• Struggles with varying densities.


• Requires tuning of ε and minPts.
• Not ideal for high-dimensional data.

3. Relative Merits of Each Method


Method Strengths Weaknesses
No need for k; dendrogram; Slow; sensitive to noise; not good for large
HAC
interpretable datasets
Requires k; assumes spherical clusters; bad
K-Means Fast; scalable; simple
with outliers
Handles noise; arbitrary shapes; no
DBSCAN Fails on varying density; parameter tuning
k needed

4. Clustering Tendency
Before clustering, check if the data has natural clusters.

Tools for measuring tendency

1. Hopkins statistic
o Values close to 1 → strong clustering tendency
o Close to 0.5 → random data (no clusters)
2. Visual methods
o Scatter plots
o Distance matrix heatmap
o Pairplots (for small dimensions)

5. Clustering Quality
Metrics used to evaluate clustering results.

Internal Validation Metrics

(Use only data, no labels)

1. Silhouette Score

𝑏−𝑎
𝑠=
max⁡(𝑎, 𝑏)

• a = avg. intra-cluster distance


• b = avg. nearest-cluster distance
Range: −1 to 1
High = well-separated and compact clusters

2. Davies-Bouldin Index

Lower values = better.

External Validation Metrics

(Labels required)

• Adjusted Rand Index (ARI)


• Purity
• Mutual Information

You might also like