0% found this document useful (0 votes)
26 views11 pages

Data Mining Classification Techniques

Uploaded by

chandananyc1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views11 pages

Data Mining Classification Techniques

Uploaded by

chandananyc1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Classification in Data Mining – Introduction

1. Introduction to Data Mining


Data Mining is the process of discovering hidden, previously unknown, and useful patterns
from large volumes of data stored in databases, data warehouses, or other information
repositories. It combines techniques from statistics, machine learning, artificial intelligence,
and database systems to transform raw data into meaningful knowledge.

Among the various data mining tasks such as association rule mining, clustering, regression,
outlier analysis, and classification, classification is one of the most important and widely used
techniques.

2. What is Classification?
Classification is a supervised learning technique in data mining that is used to predict
categorical class labels for new data instances based on past observations.

In classification:

 The data objects belong to predefined classes


 A model (classifier) is built using training data
 The model is then used to classify unseen or test data

Definition

Classification is the process of finding a model that describes and distinguishes data classes or
concepts, for the purpose of being able to use the model to predict the class of objects whose
class label is unknown.

3. Role of Classification in Data Warehousing


Data warehouses store integrated, historical, and summarized data from multiple sources.
Classification techniques are applied on warehouse data to:

 Support decision making


 Enable predictive analysis
 Discover trends and future outcomes

Example:
 Classifying customers as high-value, medium-value, or low-value
 Classifying loan applicants as safe or risky

4. Supervised Learning Nature of Classification


Classification is called supervised learning because:

 The class labels are known in advance


 The learning algorithm is trained using labeled data

Example:

Age Income Student Class


22 Low Yes Buy
45 High No Not Buy

Here, Class is the target attribute.

5. Steps Involved in Classification


1. Data Collection

Data is collected from databases, data warehouses, or external sources.

2. Data Preprocessing

Includes:

 Data cleaning
 Handling missing values
 Data transformation
 Data normalization

3. Training Phase

 A portion of data is used to build the classification model


 Known as training dataset

4. Testing Phase
 The model is tested using unseen data
 Known as test dataset

5. Model Evaluation

Performance is evaluated using:

 Accuracy
 Precision
 Recall
 Confusion matrix

6. Classification Techniques
1. Decision Tree Classification

 Uses tree-like structures


 Easy to understand and interpret
 Example: ID3, C4.5, CART

2. Bayesian Classification

 Based on Bayes’ Theorem


 Assumes independence between attributes
 Example: Naïve Bayes Classifier

3. Rule-Based Classification

 Uses IF-THEN rules


 Simple and interpretable

4. k-Nearest Neighbor (k-NN)

 Classifies based on the nearest data points


 Distance-based approach

5. Artificial Neural Networks

 Inspired by the human brain


 Suitable for complex patterns
 Requires large datasets

6. Support Vector Machines (SVM)


 Finds optimal separating hyperplane
 Effective for high-dimensional data

7. Classification vs Clustering
Classification Clustering
Supervised learning Unsupervised learning
Predefined class labels No predefined labels
Predictive Descriptive
Requires training data No training data

8. Applications of Classification
1. Business

 Customer segmentation
 Credit risk analysis
 Market basket analysis

2. Banking and Finance

 Loan approval
 Fraud detection
 Credit scoring

3. Healthcare

 Disease diagnosis
 Patient risk classification
 Medical image analysis

4. Education

 Student performance prediction


 Dropout analysis

5. E-Commerce

 Product recommendation
 User behavior prediction
9. Advantages of Classification
 Helps in decision making
 Automates prediction tasks
 Improves business strategies
 Handles large datasets efficiently

10. Limitations of Classification


 Requires labeled data
 Model accuracy depends on data quality
 Overfitting may occur
 Some algorithms are complex and time-consuming

11. Importance of Classification in Data Mining


Classification plays a crucial role in:

 Predictive analytics
 Knowledge discovery
 Business intelligence
 Strategic planning

It helps organizations anticipate future trends, reduce risks, and maximize profits.
Statistical-Based Algorithms for
Classification in Data Mining
1. Introduction
Statistical-based classification algorithms use principles of statistics and probability
theory to assign class labels. These algorithms assume that data follows certain probabilistic
distributions and use statistical inference to make decisions.

They are widely used because:

 They provide mathematical foundation


 They handle uncertainty and noise
 They give probabilistic outputs
 They work well with large datasets

2. Characteristics of Statistical Classification Algorithms


Statistical-based classifiers generally have the following characteristics:

1. Use probability models


2. Assume data distribution (Gaussian, Bernoulli, Multinomial, etc.)
3. Estimate parameters from training data
4. Predict class with maximum probability
5. Based on Bayes’ theorem or statistical decision theory

3. Types of Statistical-Based Classification Algorithms


The major statistical-based algorithms used for classification in Data Mining are:

1. Bayesian Classification
2. Naïve Bayes Classifier
3. Bayesian Belief Networks
4. Logistic Regression
5. Linear Discriminant Analysis (LDA)
6. Quadratic Discriminant Analysis (QDA)
7. k-Nearest Neighbor (k-NN) (partly statistical)
4. Bayesian Classification
4.1 Concept

Bayesian classification is based on Bayes’ Theorem, which calculates the posterior probability
of a class given a data sample.

Bayes’ Theorem:

[
P(C_i | X) = \frac{P(X | C_i) \cdot P(C_i)}{P(X)}
]

Where:

 (P(C_i | X)) → Posterior probability of class (C_i)


 (P(X | C_i)) → Likelihood
 (P(C_i)) → Prior probability
 (P(X)) → Evidence

Classification Rule:

Assign sample X to the class with maximum posterior probability.

4.2 Advantages

 Strong mathematical foundation


 Works well with missing data
 Handles uncertainty effectively

4.3 Disadvantages

 Requires estimation of probability distributions


 Computationally expensive for complex data

5. Naïve Bayes Classifier


5.1 Introduction
Naïve Bayes is the most popular statistical classifier in data mining. It is called naïve because
it assumes conditional independence between attributes.

Independence Assumption:

[
P(X | C) = \prod_{i=1}^{n} P(x_i | C)
]

5.2 Algorithm Steps

1. Calculate prior probability (P(C)) for each class


2. Calculate likelihood (P(x_i | C)) for each attribute
3. Compute posterior probability using Bayes’ theorem
4. Assign class with highest probability

5.3 Types of Naïve Bayes

1. Gaussian Naïve Bayes – for continuous data


2. Multinomial Naïve Bayes – for text classification
3. Bernoulli Naïve Bayes – for binary features

5.4 Advantages

 Simple and fast


 Requires small training data
 Performs well in text mining (spam detection)

5.5 Disadvantages

 Independence assumption is unrealistic


 Less accurate when attributes are correlated

6. Bayesian Belief Networks (BBN)


6.1 Definition
A Bayesian Belief Network is a directed acyclic graph (DAG) that represents probabilistic
relationships among variables.

 Nodes → Random variables


 Edges → Conditional dependencies

6.2 Features

 Represents joint probability distribution


 Removes naïve independence assumption
 Uses conditional probability tables (CPT)

6.3 Advantages

 Handles complex dependencies


 Useful in medical diagnosis and expert systems

6.4 Disadvantages

 Structure learning is complex


 High computational cost

7. Logistic Regression
7.1 Introduction

Logistic regression is a statistical classification technique used for binary classification.

It models the probability of a class using the logistic (sigmoid) function.

Sigmoid Function:

[
P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \dots + \beta_nX_n)}}
]

7.2 Characteristics
 Outputs probability between 0 and 1
 Decision boundary is linear
 No distribution assumption for predictors

7.3 Advantages

 Easy to interpret
 Efficient for large datasets

7.4 Disadvantages

 Cannot model complex non-linear relationships

8. Linear Discriminant Analysis (LDA)


8.1 Introduction

LDA is a statistical method used to find a linear combination of features that best separates
classes.

Assumptions:

 Data follows Gaussian distribution


 Equal covariance matrices for all classes

8.2 Advantages

 Works well when assumptions hold


 Reduces dimensionality

8.3 Disadvantages

 Sensitive to outliers
 Poor performance when assumptions fail

9. Quadratic Discriminant Analysis (QDA)


 Extension of LDA
 Allows different covariance matrices
 Produces quadratic decision boundaries

10. k-Nearest Neighbor (k-NN) as Statistical Classifier


 Uses distance-based probability
 Class is assigned based on majority of nearest neighbors
 Non-parametric and instance-based

11. Comparison of Statistical Classification Algorithms


Algorithm Distribution Assumption Speed Accuracy
Naïve Bayes Strong Very Fast Moderate
Bayesian Network Moderate Slow High
Logistic Regression None Fast High
LDA Gaussian Fast High
QDA Gaussian Medium High

12. Applications
 Spam filtering
 Medical diagnosis
 Credit scoring
 Fraud detection
 Text classification
 Customer segmentation

You might also like