Classification in Data Mining – Introduction
1. Introduction to Data Mining
Data Mining is the process of discovering hidden, previously unknown, and useful patterns
from large volumes of data stored in databases, data warehouses, or other information
repositories. It combines techniques from statistics, machine learning, artificial intelligence,
and database systems to transform raw data into meaningful knowledge.
Among the various data mining tasks such as association rule mining, clustering, regression,
outlier analysis, and classification, classification is one of the most important and widely used
techniques.
2. What is Classification?
Classification is a supervised learning technique in data mining that is used to predict
categorical class labels for new data instances based on past observations.
In classification:
The data objects belong to predefined classes
A model (classifier) is built using training data
The model is then used to classify unseen or test data
Definition
Classification is the process of finding a model that describes and distinguishes data classes or
concepts, for the purpose of being able to use the model to predict the class of objects whose
class label is unknown.
3. Role of Classification in Data Warehousing
Data warehouses store integrated, historical, and summarized data from multiple sources.
Classification techniques are applied on warehouse data to:
Support decision making
Enable predictive analysis
Discover trends and future outcomes
Example:
Classifying customers as high-value, medium-value, or low-value
Classifying loan applicants as safe or risky
4. Supervised Learning Nature of Classification
Classification is called supervised learning because:
The class labels are known in advance
The learning algorithm is trained using labeled data
Example:
Age Income Student Class
22 Low Yes Buy
45 High No Not Buy
Here, Class is the target attribute.
5. Steps Involved in Classification
1. Data Collection
Data is collected from databases, data warehouses, or external sources.
2. Data Preprocessing
Includes:
Data cleaning
Handling missing values
Data transformation
Data normalization
3. Training Phase
A portion of data is used to build the classification model
Known as training dataset
4. Testing Phase
The model is tested using unseen data
Known as test dataset
5. Model Evaluation
Performance is evaluated using:
Accuracy
Precision
Recall
Confusion matrix
6. Classification Techniques
1. Decision Tree Classification
Uses tree-like structures
Easy to understand and interpret
Example: ID3, C4.5, CART
2. Bayesian Classification
Based on Bayes’ Theorem
Assumes independence between attributes
Example: Naïve Bayes Classifier
3. Rule-Based Classification
Uses IF-THEN rules
Simple and interpretable
4. k-Nearest Neighbor (k-NN)
Classifies based on the nearest data points
Distance-based approach
5. Artificial Neural Networks
Inspired by the human brain
Suitable for complex patterns
Requires large datasets
6. Support Vector Machines (SVM)
Finds optimal separating hyperplane
Effective for high-dimensional data
7. Classification vs Clustering
Classification Clustering
Supervised learning Unsupervised learning
Predefined class labels No predefined labels
Predictive Descriptive
Requires training data No training data
8. Applications of Classification
1. Business
Customer segmentation
Credit risk analysis
Market basket analysis
2. Banking and Finance
Loan approval
Fraud detection
Credit scoring
3. Healthcare
Disease diagnosis
Patient risk classification
Medical image analysis
4. Education
Student performance prediction
Dropout analysis
5. E-Commerce
Product recommendation
User behavior prediction
9. Advantages of Classification
Helps in decision making
Automates prediction tasks
Improves business strategies
Handles large datasets efficiently
10. Limitations of Classification
Requires labeled data
Model accuracy depends on data quality
Overfitting may occur
Some algorithms are complex and time-consuming
11. Importance of Classification in Data Mining
Classification plays a crucial role in:
Predictive analytics
Knowledge discovery
Business intelligence
Strategic planning
It helps organizations anticipate future trends, reduce risks, and maximize profits.
Statistical-Based Algorithms for
Classification in Data Mining
1. Introduction
Statistical-based classification algorithms use principles of statistics and probability
theory to assign class labels. These algorithms assume that data follows certain probabilistic
distributions and use statistical inference to make decisions.
They are widely used because:
They provide mathematical foundation
They handle uncertainty and noise
They give probabilistic outputs
They work well with large datasets
2. Characteristics of Statistical Classification Algorithms
Statistical-based classifiers generally have the following characteristics:
1. Use probability models
2. Assume data distribution (Gaussian, Bernoulli, Multinomial, etc.)
3. Estimate parameters from training data
4. Predict class with maximum probability
5. Based on Bayes’ theorem or statistical decision theory
3. Types of Statistical-Based Classification Algorithms
The major statistical-based algorithms used for classification in Data Mining are:
1. Bayesian Classification
2. Naïve Bayes Classifier
3. Bayesian Belief Networks
4. Logistic Regression
5. Linear Discriminant Analysis (LDA)
6. Quadratic Discriminant Analysis (QDA)
7. k-Nearest Neighbor (k-NN) (partly statistical)
4. Bayesian Classification
4.1 Concept
Bayesian classification is based on Bayes’ Theorem, which calculates the posterior probability
of a class given a data sample.
Bayes’ Theorem:
[
P(C_i | X) = \frac{P(X | C_i) \cdot P(C_i)}{P(X)}
]
Where:
(P(C_i | X)) → Posterior probability of class (C_i)
(P(X | C_i)) → Likelihood
(P(C_i)) → Prior probability
(P(X)) → Evidence
Classification Rule:
Assign sample X to the class with maximum posterior probability.
4.2 Advantages
Strong mathematical foundation
Works well with missing data
Handles uncertainty effectively
4.3 Disadvantages
Requires estimation of probability distributions
Computationally expensive for complex data
5. Naïve Bayes Classifier
5.1 Introduction
Naïve Bayes is the most popular statistical classifier in data mining. It is called naïve because
it assumes conditional independence between attributes.
Independence Assumption:
[
P(X | C) = \prod_{i=1}^{n} P(x_i | C)
]
5.2 Algorithm Steps
1. Calculate prior probability (P(C)) for each class
2. Calculate likelihood (P(x_i | C)) for each attribute
3. Compute posterior probability using Bayes’ theorem
4. Assign class with highest probability
5.3 Types of Naïve Bayes
1. Gaussian Naïve Bayes – for continuous data
2. Multinomial Naïve Bayes – for text classification
3. Bernoulli Naïve Bayes – for binary features
5.4 Advantages
Simple and fast
Requires small training data
Performs well in text mining (spam detection)
5.5 Disadvantages
Independence assumption is unrealistic
Less accurate when attributes are correlated
6. Bayesian Belief Networks (BBN)
6.1 Definition
A Bayesian Belief Network is a directed acyclic graph (DAG) that represents probabilistic
relationships among variables.
Nodes → Random variables
Edges → Conditional dependencies
6.2 Features
Represents joint probability distribution
Removes naïve independence assumption
Uses conditional probability tables (CPT)
6.3 Advantages
Handles complex dependencies
Useful in medical diagnosis and expert systems
6.4 Disadvantages
Structure learning is complex
High computational cost
7. Logistic Regression
7.1 Introduction
Logistic regression is a statistical classification technique used for binary classification.
It models the probability of a class using the logistic (sigmoid) function.
Sigmoid Function:
[
P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \dots + \beta_nX_n)}}
]
7.2 Characteristics
Outputs probability between 0 and 1
Decision boundary is linear
No distribution assumption for predictors
7.3 Advantages
Easy to interpret
Efficient for large datasets
7.4 Disadvantages
Cannot model complex non-linear relationships
8. Linear Discriminant Analysis (LDA)
8.1 Introduction
LDA is a statistical method used to find a linear combination of features that best separates
classes.
Assumptions:
Data follows Gaussian distribution
Equal covariance matrices for all classes
8.2 Advantages
Works well when assumptions hold
Reduces dimensionality
8.3 Disadvantages
Sensitive to outliers
Poor performance when assumptions fail
9. Quadratic Discriminant Analysis (QDA)
Extension of LDA
Allows different covariance matrices
Produces quadratic decision boundaries
10. k-Nearest Neighbor (k-NN) as Statistical Classifier
Uses distance-based probability
Class is assigned based on majority of nearest neighbors
Non-parametric and instance-based
11. Comparison of Statistical Classification Algorithms
Algorithm Distribution Assumption Speed Accuracy
Naïve Bayes Strong Very Fast Moderate
Bayesian Network Moderate Slow High
Logistic Regression None Fast High
LDA Gaussian Fast High
QDA Gaussian Medium High
12. Applications
Spam filtering
Medical diagnosis
Credit scoring
Fraud detection
Text classification
Customer segmentation