Introduction to
Data Mining
Intro to Data Mining 1
What to Cover
Motivation: Why data mining?
What is data mining?
Data Mining: On what kind of data?
Data mining functionality
Are all the patterns interesting?
Classification of data mining systems
Major issues in data mining
Intro to Data Mining 2
Necessity Is the Mother of
Invention
Data explosion problem
Automated data collection tools and mature database
technology lead to tremendous amounts of data
accumulated and/or to be analyzed in databases, data
warehouses, and other information repositories
We are drowning in data, but starving for knowledge!
Intro to Data Mining 3
Solution
Data warehousing and data mining
Data warehousing and on-line analytical processing
Mining interesting knowledge (rules, regularities, patterns,
constraints) from data in large databases
Intro to Data Mining 4
Evolution of Database
Technology
1960s:
Data collection, database creation, IMS and network
DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational,
OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific,
engineering, etc.)
Intro to Data Mining 5
Evolution..
1990s:
Data mining, data warehousing, multimedia
databases, and Web databases
2000s
Stream data management and mining
Data mining with a variety of applications
Web technology and global information systems
Intro to Data Mining 6
What Is Data Mining?
Data Mining is the process of discovering new
correlations, patterns, and trends by digging into
(mining) large amounts of data stored in
warehouses, using artificial intelligence, statistical
and mathematical techniques.
Data mining can also be defined as “The nontrivial
extraction of implicit, previously unknown, and
potentially useful information from data”.
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, etc.
Intro to Data Mining 7
Why Data Mining?
Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship management
(CRM), market basket analysis, cross selling, market
segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and detection of unusual patterns
(outliers)
Intro to Data Mining 8
Why Data Mining?
Other Applications
Text mining (news group, email, documents) and Web
mining
Stream data mining
DNA and bio-data analysis
Intro to Data Mining 9
Market Analysis and Management
Where does the data come from?
Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Target marketing
Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits..
Determine customer purchasing patterns over time
Cross-market analysis
Associations/co-relations between product sales, &
prediction based on such association
Intro to Data Mining 10
Market Analysis and Management
Customer profiling
What types of customers buy what products (clustering or
classification)
Customer requirement analysis
identifying the best products for different customers
predict what factors will attract new customers
Provision of summary information
multidimensional summary reports
statistical summary information (data central tendency and
variation)
Intro to Data Mining 11
Corporate Analysis & Risk
Management
Finance planning and asset evaluation
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
Resource planning
summarize and compare the resources and spending
Competition
monitor competitors and market directions
group customers into classes and a class-based pricing
procedure
set pricing strategy in a highly competitive market
Intro to Data Mining 12
Fraud Detection & Mining Unusual
Patterns
Approaches: Clustering & model construction for frauds,
outlier analysis
Applications: Health care, retail, credit card service,
telecomm.
Auto insurance: ring of collisions
Money laundering: suspicious monetary transactions
Medical insurance
Professional patients, ring of doctors, and ring of
references
Unnecessary or correlated screening tests
Intro to Data Mining 13
Fraud Detection & Mining
Unusual Patterns
Telecommunications: phone-call fraud
Phone call model: destination of the call, duration,
time of day or week. Analyze patterns that deviate
from an expected norm
Retail industry
Analysts estimate that 38% of retail shrink is due to
dishonest employees
Anti-terrorism
Intro to Data Mining 14
Data Mining: A KDD Process
Data mining—core of Pattern Evaluation
knowledge discovery process
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Intro to Data Mining 15
Databases
Steps of a KDD Process
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of
effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction,
invariant representation.
Choosing functions of data mining
summarization, classification, regression, association,
clustering.
Intro to Data Mining 16
Steps of a KDD Process
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant
patterns, etc.
Use of discovered knowledge
Intro to Data Mining 17
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA DBA
Data Sources
Paper, Files, InformationIntro
Providers,
to Data Mining Database Systems, OLTP 18
Architecture: Typical Data
Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration Filtering
Data
Databases Warehouse
Intro to Data Mining 19
Data Mining: On What Kinds of
Data?
Relational database
Data warehouse
Transactional database
Advanced database and information repository
Object-relational database
Spatial and temporal data
Time-series data
Stream data
Multimedia database
Heterogeneous and legacy database
Text databases & WWW
Intro to Data Mining 20
Data Mining Functionalities
Concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics
Association (correlation and causality)
Classification and Prediction
Construct models (functions) that describe and distinguish
classes or concepts for future prediction
Presentation: decision-tree, classification rule, neural
network
Predict some unknown or missing numerical values
Intro to Data Mining 21
Data Mining Functionalities
Cluster analysis
Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass similarity
Outlier analysis
Outlier: a data object that does not comply with the general
behavior of the data
Noise or exception? No! useful in fraud detection, rare events
analysis
Trend and evolution analysis
Trend and deviation: regression analysis
Sequential pattern mining, periodicity analysis
Similarity-based analysisIntro to Data Mining 22
Data Mining: Confluence of Multiple
Disciplines
Database
Statistics
Systems
Machine
Learning
Data Mining Visualization
Algorithm Other
Disciplines
Intro to Data Mining 23
Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse data
types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing
one: knowledge fusion
Intro to Data Mining 24
Major Issues in Data Mining
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of
abstraction
Applications and social impacts
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
Intro to Data Mining 25
Summary
Data mining: discovering interesting patterns from large amounts
of data
A natural evolution of database technology, in great demand, with
wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
Mining can be performed in a variety of information repositories
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis..
Data mining systems and architectures
Major issues in data miningIntro to Data Mining 26
A Brief History of Data Mining
Society
1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky-
Shapiro)
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P.
Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in
Databases and Data Mining (KDD’95-98)
Journal of Data Mining and Knowledge Discovery (1997)
1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD
Explorations
More conferences on data mining
Intro to Data Mining 27
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
Where to Find References?
Data mining and KDD (SIGKDD: CDROM)
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations
Database systems (SIGMOD: CD ROM)
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT,
DASFAA
Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
AI & Machine Learning
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.
Journals: Machine Learning, Artificial Intelligence, etc.
Statistics
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization
Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization
Intro toand
Data computer
Mining graphics, etc. 28
Recommended Reference Books
R. Agrawal, J. Han, and H. Mannila, Readings in Data Mining: A Database Perspective, Morgan
Kaufmann (in preparation)
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge
Discovery and Data Mining. AAAI/MIT Press, 1996
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge
Discovery, Morgan Kaufmann, 2001
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Springer-Verlag, 2001
T. M. Mitchell, Machine Learning, McGraw Hill, 1997
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2001
Intro to Data Mining 29