Data Warehousing Concepts and ETL Process
Data Warehousing Concepts and ETL Process
Data Warehousing
Objectives:
- To learn basic concepts of data warehouse and its architecture, model
- To study data transformation process of data warehouse
- To study data cube structure and its representation, schema model
- To study the different retrieval operations from a data cube
1.0 Introduction
Data warehousing provides different architectures and tools for decision makers to execute
their plans based on data collected by different sources. It is a place where we can load
various types of data and accessed commonly by knowledge worker. Now a day’s data is
growing rapidly in organization. Organizations typically collect diverse kinds of data and
maintain in large data stores from clustered information sources. It has some challenges
and complexities to gather the data and efficiently access at one site. More efforts have to
be taken in the industry and research is going on for storing and integrating the historical
information. It maintains the data in multi-dimensional structure and its query writing is
complex. The retrieval of data is more important and the historical data analysis leads
knowledge worker to take proper decisions towards growth of the organization.
1.1 Definition
According to William H. Inmon, a leading architect in the construction of data warehouse
systems, “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-making process” (Inmon, 1996).
Subject-oriented:
A data warehouse is basically focused on subjects as customer, supplier, product, library
and sales. Its views on a particular subject matter irrespective of data. It focuses on
modeling and analyzing the collected data.
Integrated:
A data warehouse is based on combining different multiple data sources. It includes files,
online data records, registers and manuals. All these data are uploaded to data warehouse
prior performing some data preprocessing task. It applies different data cleaning methods
to upload the data into a data warehouse.
Time-variant:
Data is stored based on historic perspective. It stores a data about past five to ten years in
analyzed format. It doesn’t have records to be maintained in a day by day operation.
1
Nonvolatile:
Data warehouse has a separate physical data store. It stores the data after preprocessing and
it’s available permanently. It can be accessed as needed by the knowledge worker. Its
stores are different to the local data store. It doesn’t have recovery, consistency control
mechanisms. It has only two operation loading and accessing the records.
Besides cleaning, loading, refreshing, and metadata definition tools, data warehouse
systems have management tools which are useful for knowledge worker. Data cleaning and
data transformation are important steps in improving the data quality and produces better
output using data mining technique.
2
Figure 1.1: ETL Process (JNTUH-R13)
The solution for this kind of issues is to use data warehouse where we can store the data
from different data sources. Before storing data into a data warehouse, it must process ETL
operations. It extracts data from these (marketing, sales, purchase etc.) sources and
transform in a unique format as cust_id, cust_name, sales_amt, and other related fields. It
maintains unified structure. After ETL process it can be used in any Data warehouse tool or
BI (Business Intelligence) tools. Then we can get any reports in different format.
ETL process helps the companies to analyse business data for making different marketing
strategies and decisions. It is helpful in transforming data into data warehouse.
3
▪ The operational database has limited structure and pattern whereas Data warehouse has
vast storage capacity and independently it is maintaining.
▪ The major task of online operational database systems is to perform online transaction
and query processing. These systems are called online transaction processing
(OLTP) systems. OLTP records the data based on day to day transactions. The regular
operations such as sales, banking process, manufacturing, purchasing, selling, payroll
and accounting.
▪ Data warehouse systems can be handled by experts, knowledge workers to retrieve
efficient data and played a role of data analyst in decision making. Such systems are
known as online analytical processing (OLAP) systems.
4
Figure 1.2: Cube – Multidimensional Model (Jiawei Han, 2012)
Data cubes provide fast accession of summarized data which is mostly used in data mining
process. The data cube is divided into four levels. Its known as abstract views of data. At
the lowest level it called as base cuboid and highest peak level is called as apex. In between
levels are maintained by cuboids. Data cubes created for varying levels of abstraction are
referred to as cuboids, so that a “data cube" may instead refer to a lattice of cuboids. Each
higher level of abstraction further reduces the resulting data size. (Jiawei Han, 2012)
A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts. Dimensions are the collection of tables with respect to maintain
records. The table has multiple columns. The table contains only the columns apart from a
primary key column.
Example:
ElectonicsCatalog – a relational database described by the following relation tables:
Customer, Item, Employee and Branch.
Customer (Cust_ID, Cust_Name, Address, Age, Income, Credit_Info, Category,
Occupation)
Item (Item_ID, Item_Name, Brand, Category, Type, Price, Mfd_City, Supplier, Cost)
Employee (Emp_ID, Emp_Name, Category, Group, Salary, Commission)
Branch (Branch_ID, Branch_Name, Address)
Purchase (Trans_ID, Cust_ID, Emp_ID, Pur_date, Pur_Time, Method_Paid, Amt)
Items_Sold (Trans_ID, Item_ID, Qty)
Works_At (Emp_ID, Branch_ID)
5
ElectronicsCatalog may create a sales data warehouse in order to keep the records of sales
of an electronics product. The dimensions are time, item, branch, and location. These
dimensions allow the store to keep the records of sales in monthwise (time), itemwise
(item), branchwise (branch) and citywise (location).
Each dimension may have a table associated with it, called a dimension table, which
further describes the dimension. There may be a multiple dimension tables called sub
dimension table. The dimension table can be defined by the database programmer or
database designer. All the dimensions are placed and connected each other with a fact
table. This is a central place to hold all the dimension tables.
For example, a dimension table for item may contain the attributes item_name, brand, and
item_type.
Facts are numeric measures. Think of them as the quantities by which we want to analyze
relationships between dimensions. It contains the aggregate values of a table.
Examples of facts for a sales data warehouse include Amount_sold (sales amount in
rupees), unit_sold (number of units sold), and amount.
The fact table contains the names of the facts, or measures, as well as keys to each of the
related dimension tables. (Jiawei Han, 2012)
For example, Vancouver can be mapped to British Columbia, and Chicago to Illinois.
The states can in turn be mapped to the country (e.g., Canada or the United States) to
which they belong. These mappings form a concept hierarchy for the dimension location,
mapping a set of low-level concepts (i.e., cities) to higher-level, more general concepts
(i.e., countries). This concept hierarchy is illustrated in Figure 1.3. (Jiawei Han, 2012)
6
Figure 1.3 A concept hierarchy for location (Jiawei Han, 2012)
Figure 1.4 (a) a hierarchy for location and (b) a lattice for time (Jiawei Han, 2012)
7
- A concept hierarchy is a total or partial order among the attributes in a database schema
is called a schema hierarchy.
- Concept hierarchies are common to many applications (e.g., for time) may be
predefined in the data mining system.
- Concept hierarchies defined by discretizing or grouping values for a given dimension
or attribute, resulting in a set-grouping hierarchy.
- A total or partial order can be defined among groups of values.
- Concept hierarchies may be provided manually by system users, domain experts, or
knowledge engineers, or may be automatically generated based on statistical analysis of
the data distribution. (Jiawei Han, 2012)
Similarly, it can be shown that min N () and max N () (which find the N minimum and N
maximum values, respectively, in a given set) and standard deviation () are algebraic
aggregate functions. A measure is algebraic if it is obtained by applying an algebraic
aggregate function. (Jiawei Han, 2012)
Holistic: An aggregate function is holistic if there is no constant bound on the storage size
needed to describe a sub aggregate. There does not exist an algebraic function with M
arguments (where M is a constant) that characterizes the computation.
Common examples of holistic functions include median (), mode (), and rank (). A measure
is holistic if it is obtained by applying a holistic aggregate function.
8
Most large data cube applications require efficient computation of distributive and
algebraic measures. It is difficult to compute holistic measures efficiently. (Jiawei Han,
2012)
Star schema: The most common modeling structure is star schema, in which the data
warehouse contains
▪ A large central table (fact table) containing the data, with no redundancy, and
▪ A set of smaller attendant tables (dimension tables), one for each dimension.
The schema graph combines dimension tables in scattered form and fact table in the
central. Fact table holds the different dimension tables. It may have summarized columns at
central part. (Jiawei Han, 2012)
A star schema for ElectronicsCatalog sales is shown in Figure 1.6. Sales are considered
along four dimensions: time, item, branch, and location. The schema contains a central fact
table for sales that contains keys to each of the four dimensions, along with two measures:
dollars sold and units sold.
To minimize the size of the fact table, dimension identifiers (e.g., time key and item key)
are system-generated identifiers
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
The define cube statement defines a data cube called sales star, which corresponds to the
central sales fact table. This command specifies the dimensions and the two measures,
dollars sold and units sold. The data cube has four dimensions, namely, time, item, branch,
and location.
Snowflake schema: The snowflake schema is another model of data warehouse, where
some dimension tables are normalized, and divided into sub dimension tables.
It contains additional sub tables which represent a scattered form of one single fact and
multiple dimension tables. The shape of schema is similar to a snowflake. That’s why its
called as snowflake schema.
The major difference between the snowflake and star schema models is that the dimension
tables of the snowflake model may be kept in normalized form to reduce redundancies.
Snowflake schema contains sub dimension tables.
A snowflake schema for ElectronicsCatalog sales is given in Figure 1.7. Here, the sales
fact table is identical to that of the star schema in Figure 1.6.
10
The main difference between the two schemas is in the definition of dimension tables. The
single dimension table for item in the star schema is normalized in the snowflake schema,
resulting in new item and supplier tables.
For example, the item dimension table now contains the attributes item key, item name,
brand, type, and supplier key, where supplier key is linked to the supplier dimension table,
containing supplier key and supplier type information.
Similarly, the single dimension table for location in the star schema can be normalized into
two new tables: location and city. The city key in the new location table links to the city
dimension. A further normalization can be performed on state and country in the snowflake
schema shown in Figure 1.7.
This definition is similar to that of sales star, except that, here, the item and location
dimension tables are normalized.
For instance, the item dimension of the sales star data cube has been normalized in the
sales snowflake cube into two dimensions tables, item and supplier. Note that the
dimension definition for supplier is specified within the definition for item. Defining
supplier in this way implicitly creates a supplier key in the item dimension table definition.
Similarly, the location dimension of the sales star data cube has been normalized in the
sales snowflake cube into two-dimension tables, location and city. The dimension
definition for city is specified within the definition for location. In this way, a city key is
implicitly created in the location dimension table definition.
A fact constellation schema is shown in Figure 1.8. This schema specifies two fact tables,
sales and shipping. The sales table definition is identical to that of the star schema (Figure
1.6). The shipping table has five dimensions, or keys—item key, time key, shipper key,
from location, and to location—and two measures—dollars cost and units shipped. A fact
constellation schema allows dimension tables to be shared between fact tables.
11
For example, the dimensions tables for time, item, and location are shared between the
sales and shipping fact tables. (Jiawei Han, 2012)
It is a hybrid structure that included star schema and snow flake schema of data warehouse.
There is a possible case where multiple dimensions and facts are present in fact
constellation schema.
A define cube statement is used to define data cubes for sales and shipping, corresponding
to the two fact tables of the schema. Note that the time, item, and location dimensions of
the sales cube are shared with the shipping cube.
This is indicated for the time dimension, for example, as follows. Under the define cube
statement for shipping, the statement “define dimension time as time in cube sales” is
specified.
12
1.9 OLAP Operations
In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data from different perspectives.
OLAP operations used to view a data in a different perspective for data analysis and easy
retrieval of data. OLAP provides a user-friendly environment for interactive data analysis.
(Jiawei Han, 2012)
A typical OLAP operations described is illustrated in Figure 1.11. In our case multi-
dimensional cube is placed at center - ElectronicsCatalog sales. The cube contains the
dimensions location, time, and item, where location is aggregated with respect to city
values, time is aggregated with respect to quarters, and item is aggregated with respect to
item types.
Roll Up or Drill Up: The roll-up operation performs an aggregation on a data cube, either
by climbing up a concept hierarchy for a dimension or by dimension reduction. Figure 1.9
shows the result of a roll-up operation performed on the central cube by climbing up the
concept hierarchy for location given in Figure 1.9.
13
This hierarchy was defined as the total order “street <city < province or state < country.”
The roll-up operation shown aggregates the data by ascending the location hierarchy from
the level of city to the level of country. In other words, rather than grouping the data by
city, the resulting cube groups the data by country. When roll-up is performed by
dimension reduction, one or more dimensions are removed from the given cube.
For example, consider a sales data cube containing only the location and time dimensions.
Roll-up may be performed by removing, say, the time dimension, resulting in an
aggregation of the total sales by location, rather than by location and by time. (Jiawei Han,
2012)
Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to
more detailed data. Drill-down can be realized by either stepping down a concept hierarchy
for a dimension or introducing additional dimensions.
14
Figure 1.10 shows the result of a drill-down operation performed on the central cube by
stepping down a concept hierarchy for time defined as “day < month < quarter < year.”
Drill-down occurs by descending the time hierarchy from the level of quarter to the more
detailed level of month.
The resulting data cube details the total sales per month rather than summarizing them by
quarter. (Jiawei Han, 2012)
Slice and dice: The slice operation perform a selection on one dimension of the given
cube, resulting in a subcube.
15
Figure 1.12 shows a slice operation where the sales data are selected from the central cube
for the dimension time using the criterion time D “Q1.” The dice operation defines a
subcube by performing a selection on two or more dimensions.
Figure 1.13 shows a dice operation on the central cube based on the following selection
criteria that involve three dimensions: (location D “Toronto” or “Vancouver”) and (time D
“Q1” or “Q2”) and (item D “home entertainment” or “computer”). (Jiawei Han, 2012)
16
Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data
axes in view to provide an alternative data presentation.
Figure 1.14 shows a pivot operation where the item and location axes in a 2-D slice are
rotated. Other examples include rotating the axes in a 3-D cube, or transforming a 3-D cube
into a series of 2-D planes. (Jiawei Han, 2012)
OLAP offers analytical modelling capabilities, including a calculation engine for deriving
ratios, variance, and so on, and for computing measures across multiple dimensions. It can
generate summarizations, aggregations, and hierarchies at each granularity level and at
every dimension intersection. OLAP also supports functional models for forecasting, trend
analysis, and statistical analysis. OLAP engine is a powerful data analysis tool.
17
Figure 1.15: Starnet Model (Jiawei Han, 2012)
This starnet consists of four radial lines, representing concept hierarchies for the
dimensions location, customer, item, and time, respectively. Each line consists of footprints
representing abstraction levels of the dimension.
For example, the time line has four footprints: “day,” “month,” “quarter,” and “year.” A
concept hierarchy may involve a single attribute (e.g., date for the time hierarchy) or
several attributes (e.g., the concept hierarchy for location involves the attributes street, city,
province or state, and country).
In order to examine the item sales at ElectronicsCatalog, users can roll up along the time
dimension from month to quarter, or, say, drill down along the location dimension from
country to city.
Concept hierarchies can be used to generalize data by replacing low-level values (such as
“day” for the time dimension) by higher-level abstractions (such as “year”), or to specialize
data by replacing higher-level abstractions with lower-level values. (Jiawei Han, 2012)
In MOLAP, data retrieval is very fast Data retrieval is slow as data retrieving
as compared with ROLAP. from different tables.
Data sets are stored in multi Simple tabular format is used for storing
dimensional ways. data set.
To retrieve the data from MOLAP is ROLAP required technical skill for
easy to new users. retrieving data.
In this era of rapid growth in Big Data, real time processing platforms, Data Warehouse
is still reliable option for data storage, analysis and indexing. Data warehouse is a
relational database that is designed for writing up complex queries (DMQL – Data
Mining Query Language) to retrieve the analytical results based on past stored
historical data.
Teradata
Teradata is a market leader in the data warehousing space that brings more than 30
years of history to the table. Teradata’s EDW (enterprise data warehouse)
platform provides businesses with robust, scalable hybrid-storage capabilities and
analytics from mounds of unstructured and structured data leading to real-time business
intelligence insights, trends, and opportunities. Teradata also offers a cloud-based
DBMS solution via its Aster Database platform.
Oracle
Oracle is basically the household name in relational databases and data warehousing
and has been so for decades. Oracle 12c Database is the industry standard for high
performance scalable, optimized data warehousing. The company’s specialized
platform for the data warehousing side is the Oracle Exadata Machine. There are an
estimated 390,000 Oracle DBMS customers worldwide.
20
Questions:
xxxXxxx
21
Unit – II
Data Warehouse Architecture
Objectives:
- To learn basic concepts of data warehouse and its architecture, model
- To study data loading and preprocessing techniques
- To overcome and resolve the issues of data loading into a data warehouse
- To implement data warehouse
The data warehouse view containing both fact and dimension table as explain in previous
chapter. The business query view is similar to data views of oracle. It can be used by
different users as pert the requirements. The query to be written to retrieve the data from
the data warehouse such query is represented by Data Mining Query Language (DMQL). It
provides data view point to the user. (Jiawei Han, 2012)
22
- The external data sources are operated by different database engines or one kind of
gateways. This is a program interface where data can be extracted in a specified format
and operated by DBMS servers.
- A database engine or gateway uses ODBC (Open Database Connection) and OLEDB
(Object Linking and Embedding Database) by Microsoft and JDBC (Java Database
Connection) for handling external data source.
- Then it can be stored in data warehouse by maintaining separate database repository
called as Meta data repository. It holds all kinds of storage information about data and
its content in data warehouse. (Jiawei Han, 2012)
- The middle tier is an OLAP server that is implemented using a specialized database
server that handles ROLAP and MOLAP data. ROLAP is responsible for handling
relation databases and HOLAP is responsible for handling a data which available in
arrays, flat files etc.
- The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
Data mart: A data mart contains a subset of corporate-wide data that is of value to a
specific group of users. The scope is confined to specific selected subjects. It is a small part
of data warehouse. Data marts are usually implemented on low-cost departmental servers
that are Unix/Linux or Windows based. The implementation cycle of a data mart is more
likely to be measured in weeks rather than months or years. It may involve complex
integration in the long run if its design and planning were not enterprise wide.
For example, a marketing data mart may confine its subjects to customer, item, and sales.
The data contained in data marts tend to be summarized.
Metadata is simply defined as data about data. The data that is used to represent other
data is known as metadata. For example, the index of a book serves as a metadata for the
contents in the book. In other words, we can say that metadata is the summarized data that
leads us to detailed data. In terms of data warehouse, we can define metadata as follows.
• Metadata is the road-map to a data warehouse.
• Metadata in a data warehouse defines the warehouse objects.
• Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
Classification of Metadata
• Technical Metadata − It includes database system names, table and column names
and sizes, data types and allowed values. Technical metadata also includes
structural information such as primary and foreign key attributes and indices.
24
• Operational Metadata − It includes currency of data and data lineage. Currency of
data means whether the data is active, archived, or purged. Lineage of data means
the history of data migrated and transformation applied on it.
Virtual warehouse: A virtual warehouse is a set of views over operational databases. For
efficient query processing, only some of the possible summary views may be materialized.
A virtual warehouse is easy to build but requires excess capacity on operational database
servers. (Jiawei Han, 2012)
The bottom up approach to the design, development, and deployment of independent data
marts provides flexibility, low cost, and rapid return of investment. It can lead to problems
when integrating various disparate data marts into a consistent enterprise data warehouse.
First, a high-level corporate data model is defined within a reasonably short period (such as
one or two months) that provides a corporate-wide, consistent, integrated view of data
among different subjects and potential usages. This high-level model, need to be refined in
25
the further development of enterprise data warehouses and departmental data marts, will
greatly reduce future integration problems.
Second, independent data marts can be implemented in parallel with the enterprise
warehouse based on the same corporate data model set noted before.
Third, distributed data marts can be constructed to integrate different data marts via hub
servers.
Finally, a multitier data warehouse is constructed where the enterprise warehouse is the
sole custodian of all warehouse data, which is then distributed to the various dependent
data marts. (Jiawei Han, 2012)
At typical processing includes data cleaning, data integration, data reduction, and data
transformation operation. These operations can be applied on raw data and prepared for
further processing. It is a conversion mechanism for creating processed data and loaded
into data warehouse. (JNTUH-R13)
26
▪ Human or computer errors while doing data entry operations
▪ Errors in data transmission, conversions
▪ containing errors or outliers’ data.
▪ e.g., InterestRate = “-10.98”
27
Figure 2.3 Forms of data pre-processing (Jiawei Han, 2012)
a) Missing Values
Following are some techniques to handle missing values before loading into a
data warehouse. It includes:
i. Ignoring the tuple: This is can be done, when we don’t have proper label
highlighting the information of certain case. The category of data is not
specified by some values, means its missing. If it happens with most of the
column values, then ignoring such columns will lose data quality. Then it is
not an effective technique to ignore most of the data.
ii. Filling Missing Data (Manually): Sometimes it may be possible to fill
some column values manually. Filling missing data in a column will be
based on past entries of respective column values. It is an assumed data to
be entered manually to the respective columns. It is a time-consuming
process. It is not suitable for filling missing data in some cases where past
data is not relevant.
iii. Use of global constant: It can be possible to replace all missing values by
some constants. The constant assumption must be specified before replacing
28
it with missing values. For example, if column values are “unknown”,
“NA”, or “—” then we can replace it by a global constant.
iv. In some situation, data can be classified based on its past entries and the
missing values can be replaced by taking its mean or mode values.
v. The missing values can be studied based on previous entries and applied
data regression, Bayesian technique or decision tree induction to replace
missing values.
b) Noisy data:
Noise is a random error or variance in a measured variable. Data smoothing
technique can be used for removing noisy data.
29
iii. Regression: smooth by fitting the data into regression functions.
Linear regression involves finding the best of line to fit two variables, so
that one variable can be used to predict the other.
30
b) Data Transformation
Data transformation can involve the following:
i. Smoothing: which works to remove noise from the data
ii. Aggregation: where summary or aggregation operations are applied to the data.
For example, daily sales data may be aggregated to compute weekly and annual
total scores.
iii. Generalization of the data: where low-level or “primitive” (raw) data are
replaced by higher-level concepts using concept hierarchies.
For example, categorical attributes, like street, can be generalized to higher-
level concepts, like city or country.
iv. Normalization: where the attribute data are scaled to fall within a small
specified range, such as −1.0 to 1.0, or 0.0 to 1.0.
v. Attribute construction (feature construction): this is where new attributes are
constructed and added from the given set of attributes to help the mining
process.
2.4 Data cube aggregation: It reduce the data to the concept level needed in the analysis.
The queries regarding aggregated information should be answered using data cube
when possible. Data cubes store multidimensional aggregated information. The
following figure 2.6 shows a data cube for multidimensional analysis of sales data
with respect to annual sales per item type for each branch.
31
Figure 2.6 Cube Structure (JNTUH-R13)
Each cell holds an aggregate data value, corresponding to the data point in
multidimensional space. The following database consists of sales per quarter for the
years 1997-1999.
Suppose, the analyzer interested in the annual sales rather than sales per quarter, the
above data can be aggregated so that the resulting data summarizes the total sales per
year instead of per quarter. The resulting data is in smaller in volume, without loss of
information necessary for the analysis task. (JNTUH-R13)
32
ii. Help to eliminate irrelevant features and reduce noise
iii. Reduce time and space required in data mining
iv. Allow easier visualization
c. Dimensionality reduction techniques
i. Principal component analysis
ii. Singular value decomposition
iii. Supervised and nonlinear techniques (e.g., feature selection)
b. Histogram
i. Divide data into buckets and store average (sum) for each bucket. A
bucket represents an attribute-value/frequency pair
ii. It can be constructed optimally in one dimension using dynamic
programming.
iii. It divides up the range of possible values in a data set into classes or
groups. For each group, a rectangle (bucket) is constructed with a base
length equal to the range of values in that specific group, and an area
proportional to the number of observations falling into that group.
iv. The buckets are displayed in a horizontal axis while height of a bucket
represents the average frequency of the values.
33
Figure 2.8 Histogram Sample (JNTUH-R13)
Histograms are highly effective at approximating both sparse and dense data, as
well as highly skewed, and uniform data.
Quality of clusters measured by their diameter (max distance between any two
objects in the cluster) or centroid distance (avg. distance of each cluster object
from its centroid)
d. Sampling
Sampling can be used as a data reduction technique since it allows a large data
set to be represented by a much smaller random sample (or subset) of the data.
Suppose that a large data set, D, contains N tuples. Let's have a look at some
possible samples for D.
i. Simple random sample without replacement (SRSWOR) of size n:
This is created by drawing n of the N tuples from D (n < N), where the
probably of drawing any tuple in D is 1=N, i.e., all tuples are equally
likely.
34
ii. Simple random sample with replacement (SRSWR) of size n: This is
similar to SRSWOR, except that each time a tuple is drawn from D, it is
recorded and then replaced. That is, after a tuple is drawn, it is placed
back in D so that it may be drawn again.
iii. Cluster sample: If the tuples in D are grouped into M mutually disjoint
“clusters", then a SRS of m clusters can be obtained, where m < M.
For example, tuples in a database are usually retrieved a page at a time,
so that each page can be considered a cluster.
A reduced data representation can be obtained by applying, say,
SRSWOR to the pages, resulting in a cluster sample of the tuples.
iv. Stratified sample: If D is divided into mutually disjoint parts called
“strata", a stratified sample of D is generated by obtaining a SRS at each
stratum. This helps to ensure a representative sample, especially when
the data are skewed.
For example, a stratified sample may be obtained from customer data,
where stratum is created for each customer age group.
The age group having the smallest number of customers will be sure to
be represented.
Concept Hierarchy
▪ A concept hierarchy for a given numeric attribute defines a Discretization of the
attribute.
▪ Concept hierarchies can be used to reduce the data by collecting and replacing
low level concepts (such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).
▪ A concept hierarchy defines a sequence of mappings from set of low-level
concepts to higher-level, more general concepts.
▪ It organizes the values of attributes or dimension into gradual levels of
abstraction. They are useful in mining at multiple levels of abstraction
35
Questions:
xxxXxxx
36
Unit – III
Introduction to Data Mining
Objectives:
- To learn basic concepts of data mining and its functionalities
- To study data mining classification and mining task primitives
- To study different issues and data retrieval in data mining
- To study the applicability of data mining
Example:
Data Mining can be applicable in following scenarios:
- Studying the profile of Credit Card Holders
- Living standards of Heart Attack Patients
- Buying Behavior of Middle-Class People
- Seasonal expenses made by people
- Medical Diagnosis
In above examples, mining plays an analytical role, to study or retrieve the data/reports
based on past data about 5 – 6 years of certain case. It will generate a new kind of
information about thinking a certain situation and retrieving the most probable
knowledge. This knowledge will be helpful for further decisions where the same
situation arises.
Data Mining supports in applying marketing strategies about certain non – moving
products in a shopping mall. It can be applicable in Medical Diagnosis where doctors
can give the prescriptions based on history of patient’s diseases.
The basic task of data mining is to retrieve the knowledge from a large data set and
convert it into in an understandable format for further use. It can be helpful in
extracting the information. The key properties of data mining are
▪ Extraction of knowledge or pattern
▪ Outcome Prediction
▪ Knowledge Intelligence
▪ Retrieval and focus based on large data set
37
Figure 3.1 KDD Process (Jiawei Han, 2012)
The Support (s) for an association rule X➔ Y is the percentage of transactions in the
39
database that contains X Y.
Confidence or strength () for an association rule X ➔ Y is the ratio of the number of
transactions that contain X Y to the number of transactions that contain X.
The support count is absolute support while probability of support count is relative
support. Consider following set of transaction where transaction id 10 containing items
A, B, C and so on.
10 A, B, C
20 A, C
Frequent pattern Support
30 A, D
{A} 75%
40 B, E, F
{B} 50%
{C} 50%
{A, C} 50%
Consider an example of a super market. There are 1000 customers purchased different
items. It includes soap, washing powder, shampoo, conditioners, biscuits, and so on.
Let us find out Support and Confidence of any two items. In our case say, we are finding
40
the support and confidence for two items – shampoo and conditioners. It is because of
most of the customers brought these two items frequently.
From 1000 transactions, 200 transactions containing shampoo item and 300 transactions
containing conditioner. In 300 transactions, 100 transactions included shampoo and
conditioner item brought together. Based on this data, we will compute support and
confidence.
Support
Support is the default popularity of any item. You can calculate the Support by the
following formula. In our example,
Support (Shampoo) = (Transactions involving Shampoo) / (Total Transactions)
= 200/1000
= 20%
Confidence
You can calculate the Confidence by the following formula. In our example,
Confidence = (Transactions involving both Conditioner and Shampoo) /
(Total Transactions involving Shampoo)
= 100/200
= 50%
It says that 50% of customers who bought Conditioner and Shampoo as well.
3.1.5 Classification:
▪ It predicts categorical class labels
▪ It classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying attribute and uses it in classifying new data
▪ Typical Applications
o credit approval
o target marketing
o medical diagnosis
o treatment effectiveness analysis
It can be defined as the process of designing a model based on different or similar kind
of classes. It will classify a categorical data in a different class. The classes are grouped
together in a particular class label. This design model can be used to analysis a set of
training data. It can be helpful in predicting future values.
Example:
An airport security screening station is used to deter mine if passengers are potential
terrorist or criminals. To do this, the face of each passenger is scanned and its basic
pattern (distance between eyes, size, and shape of mouth, head etc) is identified. This
pattern is compared to entries in a database to see if it matches any patterns that are
associated with known offenders
3.1.6 Prediction:
Prediction is one of the classification techniques, through which we can predict the
situation based on past relevant data. The classification and prediction are based on
41
each other. Using classification data, we can easily retrieve the similar kind of data and
this can be helpful in retrieving future predictions.
Example:
Predicting share market status of particular shares is difficult problem. One approach is
to study entirely the status of the particular shares and its share values in last week,
month, past year. The status of current economy and relevant business strategies also
studied. By considering these entire parameters one can easily predict the future growth
of a particular share can be determined. It will highlights / predicts the profit or loss of a
particular share values. The prediction must be made with respect to the time the data
were collected.
43
wet region
Classification Cluster Analysis
◼ Classification and label prediction ◼ Unsupervised learning (i.e., Class
◼ Construct models label is unknown)
(functions) based on some ◼ Group data to form new categories
training examples (i.e., clusters), e.g., cluster houses to
◼ Describe and distinguish find distribution patterns
classes or concepts for ◼ Principle: Maximizing intra-class
future prediction similarity & minimizing interclass
◼ E.g., classify similarity
countries based on ◼ Many methods and applications
(climate), or
classify cars based
on (gas mileage)
◼ Predict some unknown
class labels
◼ Typical methods
◼ Decision trees, naïve Outlier Analysis
Bayesian classification, ◼ Outlier: A data object that does not
support vector machines, comply with the general behavior of
neural networks, rule-based the data
classification, pattern- ◼ Noise or exception? ― One person’s
based classification, garbage could be another person’s
logistic regression, … treasure
◼ Typical applications: ◼ Methods: by product of clustering or
◼ Credit card fraud detection, regression analysis …
direct marketing, ◼ Useful in fraud detection, rare events
classifying stars, diseases, analysis
web-pages, …
44
c. Classification according to the king of knowledge discovered: this classification
categorizes data mining systems based on the kind of knowledge discovered or data
mining functionalities, such as characterization, discrimination, association,
classification, clustering, etc. Some systems tend to be comprehensive systems
offering several data mining functionalities together.
d. Classification according to mining techniques used: Data mining systems
employ and provide different techniques. This classification categorizes data
mining systems according to the data analysis approach used such as machine
learning, neural networks, genetic algorithms, statistics, visualization, database
oriented or data warehouse-oriented, etc. The classification can also take into
account the degree of user interaction involved in the data mining process such as
query-driven systems, interactive exploratory systems, or autonomous systems. A
comprehensive system would provide a wide variety of data mining techniques to
fit different situations and options, and offer different degrees of user interaction.
45
3.4 Integration of a Data Mining System with a Database or a Data Warehouse
System,
The differences between the following architectures for the integration of a data mining
system with a database or data warehouse system are as follows.
a. No coupling:
The data mining system uses sources such as flat files to obtain the initial data set to
be mined since no database system or data warehouse system functions are
implemented as part of the process. Thus, this architecture represents a poor design
choice.
b. Loose coupling:
The data mining system is not integrated with the database or data warehouse
system beyond their use as the source of the initial data set to be mined, and
possible use in storage of the results. Thus, this architecture can take advantage of
the flexibility, efficiency and features such as indexing that the database and data
warehousing systems may provide. However, it is difficult for loose coupling to
achieve high scalability and good performance with large data sets as many such
systems are memory-based.
c. Semi tight coupling:
Some of the data mining primitives such as aggregation, sorting or pre computation
of statistical functions are efficiently implemented in the database or data
warehouse system, for use by the data mining system during mining-query
processing. Also, some frequently used inter mediate mining results can be pre
computed and stored in the database or data warehouse system, thereby enhancing
the performance of the data mining system.
d. Tight coupling:
The database or data warehouse system is fully integrated as part of the data mining
system and thereby provides optimized data mining query processing. Thus, the
data mining sub system is treated as one functional component of an information
system. This is a highly desirable architecture as it facilitates efficient
implementations of data mining functions, high system performance, and an
integrated information processing environment
From the descriptions of the architectures provided above, it can be seen that tight
coupling is the best alternative without respect to technical or implementation issues.
However, as much of the technical infrastructure needed in a tightly coupled system is
still evolving, implementation of such a system is non-trivial. Therefore, the most
popular architecture is currently semi tight coupling as it provides a compromise
between loose and tight coupling.
46
Figure 3.4: Data Mining Architecture (Jiawei Han, 2012)
The architecture of a typical data mining system may have the following major
components
▪ Database, Data warehouse, World wide web or other information repository-Data
cleaning and data integration techniques may be performed on the data
▪ Database or Data Warehouse Server-It is responsible for fetching the relevant data
based on the user’s data mining request.
▪ Data mining Engine-It consists of a set of functional modules for task such as
characterization, association and correlation analysis classification, prediction
cluster analysis, outlier analysis etc
▪ Knowledge base – It is the domain knowledge used to guide the search or evaluate
the interestingness of resulting patterns
47
deviation analysis, and similarity analysis. These tasks may use the same
database in different ways and require the development of numerous data
mining techniques.
o Interactive mining of knowledge at multiple levels of abstraction: Since it is
difficult to know exactly what can be discovered within a database, the data
mining process should be interactive.
o Incorporation of background knowledge: Background knowledge, or
information regarding the domain under study, may be used to guide the
discovery patterns. Domain knowledge related to databases, such as integrity
constraints and deduction rules, can help focus and speed up a data mining
process, or judge the interestingness of discovered patterns.
o Data mining query languages and ad-hoc data mining: Knowledge in
Relational query languages (such as SQL) required since it allow users to pose
ad-hoc queries for data retrieval.
o Presentation and visualization of data mining results: Discovered
knowledge should be expressed in high-level languages, visual representations,
so that the knowledge can be easily understood and directly usable by humans
o Handling outlier or incomplete data: The data stored in a database may
reflect outliers: noise, exceptional cases, or incomplete data objects. These
objects may confuse the analysis process, causing over fitting of the data to the
knowledge model constructed. As a result, the accuracy of the discovered
patterns can be poor. Data cleaning methods and data analysis methods which
can handle outliers are required.
o Pattern evaluation: refers to interestingness of pattern: A data mining
system can uncover thousands of patterns. Many of the patterns discovered may
be uninteresting to the given user, representing common knowledge or lacking
novelty. Several challenges remain regarding the development of techniques to
assess the interestingness of discovered patterns,
48
semi-structured, or unstructured data with diverse data semantics poses great
challenges to data mining.
Questions:
xxxXxxx
49
Unit – IV
Mining Association Rules
Objectives:
- To learn basic concepts of association rule mining
- To study the data support and confidence in frequent itemsets
- To study Apriori algorithm to retrieve the most frequent itemsets from a large data
set and its applicability in business
4 Basic Concepts
Association Rule
Association rules are used to show relationship between data items. It is based on the
frequent pattern. Frequent pattern is a set of items, subsequences, substructures, etc.
that occurs frequently in a data set. Frequent pattern raise the questions as:
What products were often purchased together?
- Beer and diapers, Milk and Butter/Bread
What products are purchased one after other?
- PC followed by digital camera
- TV set followed by VCD player
▪ Is there a structure defining relationships in the items purchased?
▪ What kinds of DNA are sensitive to this new drug?
▪ Can we automatically classify web documents?
To answer these questions, we need to have a concept of Association that exhibit a
relationship between two items.
It is also helpful for season wise selling of products. It will keep the product in shelf of
shopping malls to attract the customers buying behavior. Market Basket provides a place
where ‘nonmoving items’ can be given a good treatment, so it can be sold with an offer.
The market basket analyzes customer buying habits by finding associations between the
different items that customers place in their “shopping baskets” (Figure 4.1). The
50
discovery of these associations can help retailers to develop marketing strategies by
identifying the items which are frequently purchased together by customers. A typical
example of frequent itemset mining is market basket analysis. (Jiawei Han, 2012)
For instance, if customers are buying milk, how likely are they to also buy bread (and
what kind of bread) on the same trip to the supermarket? This information can lead to
increased sales by helping retailers do selective marketing and plan their shelf space.
(Jiawei Han, 2012)
51
4.2.1 Steps to perform multilevel association rules from transactional database are:
Step1: consider frequent item sets
Step2: arrange the items in hierarchy form
Step3: find the Items at the lower level (expected to have lower support)
Step4: Apply association rules on frequent item sets
Step5: Use some methods and identify frequent itemsets
Note: Support is categorized into two types
4.2.2 Mining multidimensional association rules from Relational databases and data
warehouse
Multi-dimensional association rule: Multi-dimensional association rule can be defined
as the statement which contains only two (or) more predicates/dimensions. We can
perform the following association rules on relational database and data warehouse
▪ Boolean dimensional association rule:
Boolean dimensional association rule can be defined as comparing existing
predicates/dimensions with non-existing predicates/dimensions
▪ Single dimensional association rule:
52
Single dimensional association rule can be defined as the statement which contains
only single predicate/dimension
Algorithm Explanation:
Example:
Consider following list of items for a given transactions (Jiawei Han, 2012)
54
There are nine transactions in this database, that is, |D| = 9.
Steps:
1. In the first iteration of the algorithm, each item is a member of the set of
candidate1-itemsets, C1.
The algorithm simply scans all of the transactions in order to count the number of
occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set
of frequent 1-itemsets, L1, can then be determined.
It consists of the candidate 1-itemsets satisfying minimum support. In our example,
all of the candidates in C1 satisfy minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1
to generate a candidate set of 2-itemsets, [Link] candidates are removed fromC2
during the prune step because each subset of the candidates is also frequent.
4. Next, the transactions in Dare scanned and the support count of each candidate
itemsetInC2 is accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those
candidate2-itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, From the join step, we first
getC3 =L2x L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2,
I4, I5}.
Based on the Apriori property that all subsets of a frequent item set must also be
frequent, we can determine that the four latter candidates cannot possibly be
frequent.
7. The transactions in D are scanned in order to determine L3, consisting of those
candidate 3-itemsets in C3 having minimum support.
8. The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4. (Jiawei
Han, 2012)
Illustration
55
Figure 4.5: Apriori Algorithm Implementation (Jiawei Han, 2012)
Example:
Consider the following data on which Apriori algorithm can be applied to find the association rule
for frequent item set.
Sample Data D
Transaction ID Items
100 Bread, Milk, Butter
200 Bread, Butter
300 Bread, Eggs
400 Milk, Biscuit, Chips
56
Step 2: Compare candidate support count with minimum support (50%)
Example:
We reexamine the mining of transaction database, D, of Table 4.3 using the frequent
pattern growth approach. An FP-tree is then constructed as follows:
First, create the root of the tree, labeled with “null.” Scan database D a second time.
The items in each transaction are processed in L order (i.e., sorted according to
descending support count), and a branch is created for each transaction.
For example, the scan of the first transaction, “T100: I1, I2, I5,” which contains three
items (I2, I1, I5 in L order), leads to the construction of the first branch of the tree with
three nodes, [I2: 1], [I1: 1], and [I5: 1], where I2 is linked as a child to the root, I1 is
linked to I2, and I5 is linked to I1.
57
The second transaction, T200, contains the items I2 and I4 in L order, which would result
in a branch where I2 is linked to the root and I4 is linked to I2. This branch would share a
common prefix, I2, with the existing path for T100.
Therefore, we instead increment the count of the I2 node by 1, and create a new node, [I4:
1], which is linked as a child to [I2: 2].
In general, when considering the branch to be added for a transaction, the count of each
node along a common prefix is incremented by 1, and nodes for the items following the
prefix are created and linked accordingly.
To facilitate tree traversal, an item header table is built so that each item points to its
occurrences in the tree via a chain of node-links.
The tree obtained after scanning all the transactions is shown in Figure 4.6 with the
associated node-links.
In this way, the problem of mining frequent patterns in databases is transformed into that
of mining the FP-tree. (Jiawei Han, 2012)
FP Tree Mining
The FP-tree is mined as follows. Start from each frequent length-1 pattern (as an initial
suffix pattern), construct its conditional pattern base (a “sub-database,” which consists
of the set of prefix paths in the FP-tree co-occurring with the suffix pattern), then
construct its (conditional) FP-tree, and perform mining recursively on the tree.
The pattern growth is achieved by the concatenation of the suffix pattern with the frequent
patterns generated from a conditional FP-tree. Mining of the FP-tree is summarized in
Table 4.4 and detailed as follows.
58
Table 4.4: FP Tree Mining
We first consider I5, which is the last item in L, rather than the first. The reason for
starting at the end of the list will become apparent as we explain the FP-tree mining
process.
I5 occurs in two FP-tree branches of Figure 4.7. (The occurrences of I5 can easily be
found by following its chain of node-links.)
The paths formed by these branches are [I2, I1, I5: 1] and [I2, I1, I3, I5: 1]. Therefore,
considering I5 as a suffix, its corresponding two prefix paths are [I2, I1: 1] and [I2, I1, I3:
1], which form its conditional pattern base.
For I4, its two prefix paths form the conditional pattern base, {[I2 I1: 1], [I2: 1]}, which
generates a single-node conditional FP-tree, [I2: 2], and derives one frequent pattern, [I2,
I4: 2].
Similar to the preceding analysis, I3’s conditional pattern base is {[I2, I1: 2], [I2: 2], [I1:
2]}. Its conditional FP-tree has two branches, [I2: 4, I1: 2] and [I1: 2], as shown in Figure
4.7, which generates the set of patterns {[I2, I3: 4], [I1, I3: 4], [I2, I1, I]: 2]}.
Finally, I1’s conditional pattern base is [I2: 4], with an FP-tree that contains only one
node, [I2: 4], which generates one frequent pattern, [I2, I1: 4]. This mining process is
summarized in the algorithm. (Jiawei Han, 2012)
59
Algorithm: FP growth
Input:
- D, a transaction database;
- Min_sup, the minimum support count threshold.
Method:
1. The FP-tree is constructed in the following steps:
a. Scan the transaction database D once. Collect F, the set of frequent items, and
their support counts.
b. Sort F in support count descending order as L, the list of frequent items.
c. Create the root of an FP-tree, and label it as “null.”
Select and sort the frequent items in Trans according to the order of L.
Let the sorted frequent item list in Trans be [p|P], where p is the first element and P is
the remaining list.
2. The FP-tree is mined by calling FP_growth (FP tree, nul/) which is implemented as
follows.
Procedure FP_growth(Tree, α)
if Tree contains a single path P then
for each combination (denoted as β) of the nodes in the path P
generate pattern β ∩ α with support_count = minimum support count of nodes in
β;
else for each ai in the header of Tree
{
generate pattern β = ai ∪ 𝛼 with support_count = ai .support_count;
construct β’s conditional pattern base and then β’s conditional FP_tree Treeβ ;
if Treeβ ≠ ∅; then
call FP_growth(Treeβ , β);
}
60
Questions:
xxxXxxx
61
Unit – V
Classification and Prediction
Objectives:
- To learn basic concepts of classification and prediction
- To study supervised and unsupervised learning, data classification technique
- To study rule-based classification technique and decision tree induction
- To study different classification algorithms
5.0 Introduction:
Classification and prediction are two forms of data analysis. It can be used to extract
models describing important data classes or to predict future data trends. Classification
predicts categorical (discrete, unordered) labels, prediction models continuous valued
functions.
For example, we can build a classification model to categorize bank loan applications as
either safe or risky, or a prediction model to predict the expenditures of potential customers
on computer equipment given their income and occupation.
Most algorithms are memory resident, typically assuming a small data size. Recent data
mining research has built on such work, developing scalable classification and prediction
techniques capable of handling large disk-resident data.
62
Figure 5.1 Model Construction (Jiawei Han, 2012)
63
b. Unsupervised learning (clustering)
• The class labels of training data are unknown
• Given a set of measurements, observations, etc. with the aim of establishing
the existence of classes or clusters in the data
a. Data cleaning:
This refers to the pre-processing of data in order to remove or reduce noise (by
applying smoothing techniques) and the treatment of missing values (e.g., by replacing
a missing value with the most commonly occurring value for that attribute, or with the
most probable value based on statistics). Most of the classification algorithms have
some mechanisms for handling noisy or missing data, this step can help reduce
confusion during learning.
b. Relevance analysis:
Many of the attributes in the data may be redundant. Correlation analysis can be used
to identify whether any two given attributes are statistically related.
For example, a strong correlation between attributes A1 and A2 would suggest that one
of the two could be removed from further analysis.
A database may also contain irrelevant attributes. Attribute subset selection can be used
in these cases to find a reduced set of attributes such that the resulting probability
distribution of the data classes is as close as possible to the original distribution
obtained using all attributes.
Relevance analysis, in the form of correlation analysis and attribute subset selection,
can be used to detect attributes that do not contribute to the classification or prediction
task. Such analysis can help improve classification efficiency and scalability.
For example, numeric values for the attribute income can be generalized to discrete
ranges, such as low, medium, and high. Similarly, categorical attributes, like street, can
be generalized to higher-level concepts, like city.
64
Data can also be reduced by applying many other methods, ranging from wavelet
transformation and principle components analysis to discretization techniques, such as
binning, histogram analysis, and clustering.
Example
Consider following table, that contain information about the age group who having
credit ratings and income details of a student for buying a computer system.
66
Age Income Student Credit Ratings
<=30 High No Fair
<=30 High No Excellent
31…40 High No Fair
>40 Medium No Fair
>40 Low Yes Fair
>40 Low Yes Excellent
31…40 Low Yes Excellent
<=30 Medium No Fair
<=30 Low Yes Fair
>40 Medium Yes Fair
<=30 Medium Yes Excellent
31…40 Medium No Excellent
31…40 High Yes Fair
>40 Medium No Excellent
Table 5.1 Computer Buying Conditions
▪ Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the
classifier will predict that X belongs to the class having the highest posterior
probability, conditioned on X. That is, the naïve Bayesian classifier predicts
that tuple X belongs to the class Ci if and only if
68
▪ As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If
the class prior probabilities are not known, then it is commonly assumed
that the classes are equally likely, that is, P(C1) = P(C2) = …= P(Cm), and
we would therefore maximize P(X|Ci). Otherwise, we maximize
P(X|Ci)P(Ci).
▪ Given data sets with many attributes, it would be extremely computationally
expensive to compute P(X|Ci). In order to reduce computation in evaluating
P(X|Ci), the naive assumption of class conditional independence is made.
This presumes that the values of the attributes are conditionally independent
of one another, given the class label of the tuple. Thus,
69
Example:
▪ The inputs to the network correspond to the attributes measured for each
training tuple. The inputs are fed simultaneously into the units making up
the input layer. These inputs pass through the input layer and are then
weighted and fed simultaneously to a second layer known as a hidden layer.
▪ The outputs of the hidden layer units can be input to another hidden layer,
and so on. The number of hidden layers is arbitrary.
▪ The weighted outputs of the last hidden layer are input to units making up
the output layer, which emits the network’s prediction for given tuples
70
5.11.2 Rule Extraction from a Decision Tree
▪ Rules are easier to understand than large trees
▪ One rule is created for each path from the root to a leaf
▪ Each attribute-value pair along a path forms a conjunction: the leaf holds
the class prediction
▪ Rules are mutually exclusive and exhaustive
▪ Example:
Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no
71
Questions:
xxxXxxx
72
Unit – VI
Cluster Analysis
Objectives:
- To learn basic concepts of data clustering and its classification, model
- To study the different applications of cluster analysis
- To study data clustering methods and its applicability in data warehouse
6 Introduction:
Cluster analysis or clustering is the process of partitioning a set of data objects into
subsets. Each subset is a cluster, such that objects in a cluster are similar to one another,
yet dissimilar to objects in other clusters. The set of clusters resulting from a cluster
analysis can be referred to as a clustering. It is a discovery of unknown group of data
available in different clusters. It is also known as un supervised learning. In clustering we
don’t have a specific group of data, it is randomly scattered in some similar kind of data.
The dissimilarity is also a concern of another cluster. Because a cluster is a collection of
data objects that are similar to one another within the cluster and dissimilar to objects in
other clusters, a cluster of data objects can be treated as an implicit class. In this sense,
clustering is sometimes called automatic classification.
For example, exceptional cases in credit card transactions, such as very expensive and
infrequent purchases, may be of interest as possible fraudulent activities.
In data mining, it can be used for data distribution based on characteristics of cluster for
detailed analysis. It can be used as pre-processing step for the algorithms where data
characterization, distribution can be retrieved. It can be used for identification of different
outliers.
As a branch of statistics, cluster analysis has been extensively studied, with the main focus
on distance-based cluster analysis. Cluster analysis tools based on k-means, k-medoids,
and several other methods also have been built into many statistical analysis software
packages or systems, such as S-Plus, SPSS, and SAS.
In machine learning, classification is known as supervised learning because the class label
information is given, that is, the learning algorithm is supervised in that it is told the class
membership of each training tuple. Clustering is known as unsupervised learning because
the class label information is not present. For this reason, clustering is a form of learning
by observation, rather than learning by examples.
73
In data mining, efforts have focused on finding methods for efficient and effective cluster
analysis in large databases. Active themes of research focus on the scalability of clustering
methods, the effectiveness of methods for clustering complex shapes (e.g., nonconvex) and
types of data (e.g., text, graphs, and images), high-dimensional clustering techniques (e.g.,
clustering objects with thousands of features), and methods for clustering mixed numerical
and nominal data in large databases. (Jiawei Han, 2012)
▪ Marketing: Help marketers discover distinct groups in their customer bases, and then
use this knowledge to develop targeted marketing programs
▪ Land use: Identification of areas of similar land use in an earth observation database
▪ Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
▪ City-planning: Identifying groups of houses according to their house type, value, and
geographical location
▪ Earth-quake studies: Observed earth quake epic enters should be clustered along
continent faults
▪ Cluster analysis has been widely used in numerous applications, including market
research, pattern recognition, data analysis, and image processing.
▪ In business, clustering can help marketers discover distinct groups in their customer
bases and characterize customer groups based on purchasing patterns.
▪ In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
▪ Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value, and geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
▪ Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
▪ Clustering can also be used for outlier detection, Applications of outlier detection
include the detection of credit card fraud and the monitoring of criminal activities in
electronic commerce.
Clustering can be used to organize the search results into groups and present the results in a
concise and easily accessible way. Clustering techniques have been developed to cluster
documents into topics, which are commonly used in information retrieval practice.
74
a. Scalability:
Many clustering algorithms work well on small data sets over a data object. In large
database it may contain millions of objects particularly in web search scenario.
Clustering on only a sample of a given large data set may lead to biased results.
Therefore, highly scalable clustering algorithms are needed.
75
g. Capability of clustering high-dimensionality data:
A data set can contain numerous dimensions or attributes. When clustering
documents, for example, each keyword can be regarded as a dimension, and there
are often thousands of keywords. Most clustering algorithms are good at handling
low-dimensional data such as data sets involving only two or three dimensions.
Finding clusters of data objects in a high dimensional space is challenging,
especially considering that such data can be very sparse and highly skewed.
h. Constraint-based clustering:
Real-world applications may need to perform clustering under various kinds of
constraints. Suppose that your job is to choose the locations for a given number of
new automatic teller machines (ATMs) in a city. To decide upon this, you may
cluster households while considering constraints such as the city’s rivers and
highway networks and the types and number of customers per cluster. A
challenging task is to find data groups with good clustering behavior that satisfy
specified constraints.
For example, in text mining, we may want to organize a corpus of documents into
multiple general topics, such as “politics” and “sports,” each of which may have
subtopics, For instance, “football,” “basketball,” “baseball,” and “hockey” can exist
as subtopics of “sports.” The latter four subtopics are at a lower level in the
hierarchy than “sports.”
k. Separation of clusters:
Some methods partition data objects into mutually exclusive clusters. When
clustering customers into groups so that each group is taken care of by one
manager, each customer may belong to only one group. In some other situations,
the clusters may not be exclusive, that is, a data object may belong to more than one
cluster. For example, when clustering documents into topics, a document may be
related to multiple topics. Thus, the topics as clusters may not be exclusive.
76
l. Similarity measure:
Some methods determine the similarity between two objects by the distance
between them. Such a distance can be defined on Euclidean space, a road network,
a vector space, or any other space. In other methods, the similarity may be defined
by connectivity based on density or contiguity, and may not rely on the absolute
distance between two objects. Similarity measures play a fundamental role in the
design of clustering methods. While distance-based methods can often take
advantage of optimization techniques, density- and continuity-based methods can
often find clusters of arbitrary shape.
m. Clustering space:
Many of the clustering methods search for clusters within the entire given data
space. These methods are useful for low-dimensionality data sets. With high
dimensional data, however, there can be many irrelevant attributes, which can make
similarity measurements unreliable. Consequently, clusters found in the full space
are often meaningless. It’s often better to instead search for clusters within different
subspaces of the same data set. Subspace clustering discovers clusters and
subspaces (often of low dimensionality) that manifest object similarity.
77
Table 6.1: Overview of clustering methods
79
6.3.5 Model Based Method
Model-based methods hypothesize a model for each of the clusters and find the
best fit of the data to the given model. A model-based algorithm may locate
clusters by constructing a density function that reflects the spatial distribution of
the data points. It also leads to a way of automatically determining the number
of clusters based on standard statistics, taking ―noise‖ or outliers into account
and thus yielding robust clustering methods. (VSSUT)
Where
E is the sum of the square error for all objects in the data set
pi s the point in space representing a given object
mi is the mean of cluster Ci
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
Select any k objects from D as the initial cluster centers;
repeat
i. (re)assign each object to the cluster to which the object is the most
similar,
based on the mean value of the objects in the cluster;
ii. update the cluster means, that is, calculate the mean value of the objects
for each cluster;
iii. until no change;
Figure 6.1: Clustering of a set of objects using the k-means method (Jiawei Han, 2012)
81
6.4.2 k-Medoids Method:
The k - Means algorithm is majorly focused on outliers. It may create a noise in
large varying data. Instead of taking the mean value of the objects in a cluster as
a reference point, we can pick actual objects to represent the clusters, using one
representative object per cluster. Each remaining object is clustered with the
representative object to which it is the most similar.
Where E is the sum of the absolute error for all objects in the data set
p is the point in space representing a given object in clusterCj
Oj is the representative object of Cj
Figure 6.2: Clustering of a set of objects using the k-mediods method (Jiawei Han, 2012)
82
Case 2: p currently belongs to representative object, Oj. If Oji replaced by
Orandom as a representative object and p is closest to Orandom, then p is
reassigned to Orandom.
Case 3: p currently belongs to representative object, Oi, i≠j. If Oji replaced by
Orandom as a representative object and p is still closest to oi, then the
assignment does not change.
Case 4: p currently belongs to representative object, oi, i≠j. If Oji replaced by
Orandom as a representative object and p is closest to Orandom, then p is
reassigned to Orandom.
The k-medoids method is more robust than k-means in the presence of noise and
outliers, because a medoid is less influenced by outliers or other extreme values
than a mean. However, its processing is more costly than the k-means method.
(Jiawei Han, 2012)
The quality of hierarchical clustering method suffers from performing adjustments once
a data is merged into a group. It is also a somewhat difficult to split. It cannot be back
track operation into original state. Hierarchical clustering methods can be further
classified as either agglomerative or divisive, depending on whether the hierarchical
decomposition is formed in a bottom-up or top-down fashion.
83
6.5.1 Agglomerative hierarchical clustering:
This bottom-up strategy starts by placing each object in its own cluster and then
merges these atomic clusters into larger and larger clusters, until all of the
objects are in a single cluster or until certain termination conditions are
satisfied. Most hierarchical clustering methods belong to this category. They
differ only in their definition of inter cluster similarity.
It can be used in fraud detection, for example, by detecting unusual usage of credit
cards or telecommunication services. In addition, it is useful in customized marketing
for identifying the spending behaviour of customers with extremely low or extremely
high incomes, or in medical analysis for finding unusual responses to various medical
treatments.
84
Questions:
xxxXxxx
85
Unit – VII
Web Structure Mining
Objectives:
- To learn basic concepts of web structure mining, web links, social network,
crawlers and community discovery on internet
- To study the data mining over internet and web based searching
- To discover and analyze web usage pattern and new developments
- To create awareness towards marketing over social networking sites
86
Second, the number of out-links can be used to identify Web pages that act as hubs,
where a hub is one or a set of Web pages that point to many authoritative pages of the
same topic
f. Object reconciliation.
In object reconciliation, the task is to predict whether two objects are the same based on
their attributes and links. This task is common in information extraction, duplication
elimination, object consolidation, and citation matching, and is also known as record
linkage or identity uncertainty.
Examples include predicting whether two websites are mirrors of each other, whether
two citations actually refer to the same paper, and whether two apparent disease strains
are really the same.
g. Group detection.
Group detection is a clustering task. It predicts when a set of objects belong to the same
group or cluster, based on their attributes as well as their link structure. An area of
application is the identification of Web communities, where a Web community is a
collection of Web pages that focus on a particular theme or topic. A similar example in
the bibliographic domain is the identification of research communities.
h. Sub graph detection.
Sub graph identification finds characteristic sub graphs within networks. An example
from biology is the discovery of sub graphs corresponding to protein structures. In
chemistry, we can search for sub graphs representing chemical substructures.
i. Metadata mining.
Metadata are data about data. Metadata provide semi-structured data about unstructured
data, ranging from text and Web data to multimedia data-bases. It is useful for data
integration tasks in many domains.
Social network analysis is useful for the Web because the Web is essentially a virtual
society, where each page can be regarded as a social actor and each hyperlink as a
relationship.
Many of the results from social networks can be adapted and extended for the use in the
Web context. The ideas from social network analysis are indeed instrumental to the success
of Web search engines. (Liu, 2011)
87
When a publication (also called a paper) cites another publication, a relationship is
established between the publications. Citation analysis uses these relationships (links) to
perform various types of analysis. A citation can represent many types of links, such as
links between authors, publications, journals and conferences, and fields, or even between
countries. (Liu, 2011). When a research scholar publishes his work globally the reader can
cite the publication which establishes a link to the author and it improves the bibliography.
7.2.1 Co-Citation
Co-citation is used to measure the similarity of two papers (or publications). If papers i
and j are both cited by paper k, then they may be said to be related in some sense to
each other, even though they do not directly cite each other. Figure. 7.1 shows that
papers i and j are co-cited by paper k. If papers i and j are cited together by many
papers, it means that i and j have a strong relationship or similarity. The more papers
they are cited by, the stronger their relationship is. (Liu, 2011)
Fig. 7.1: Paper i and paper j are co-cited by paper k (Liu, 2011)
Fig. 7.2: Both paper i and paper j cite paper k (Liu, 2011)
88
Bibliographic coupling is also symmetric and is regarded as a similarity measure of
two papers in clustering. The Web, hubs and authorities, found by the HITS
algorithm are directly related to co citation and bibliographic coupling matrices.
(Liu, 2011)
An authority is a page with many in-links. The idea is that the page may have authoritative
content on some topic and thus many people trust it and thus link to it.
A hub is a page with many out-links. The page serves as an organizer of the information
on a particular topic and points to many good authority pages on the topic. When a user
comes to this hub page, he/she will find many useful links which take him/her to good
content pages on the topic. Figure 7.3 shows an authority page and a hub page.
The key idea of HITS is that a good hub points to many good authorities and a good
authority is pointed to by many good hubs. Thus, authorities and hubs have a mutual
reinforcement relationship. (Liu, 2011)
89
dense bipartite sub-graphs and it represents a strong relationship between groups. Apart
from a Web, communities also exist in emails and text documents. (Liu, 2011)
There are many reasons for discovering communities. Following are three main reasons:
(Kumar, 1993)
▪ Communities provide valuable and possibly the most reliable, timely, and up-to-
date information resources for a user interested in them.
▪ They represent the sociology of the Web: studying them gives insights into the
evolution of the Web.
▪ They enable target advertising at a very precise level.
Community Manifestation in Data: Given a data set, which can be a set of Web pages, a
collection of emails, or a set of text documents, we want to find communities of entities in
the data. The system needs to discover the hidden community structures. The first issue
that we need to know is how communities manifest themselves. From such manifested
evidences, the system can discover possible communities. Different types of data may have
different forms of manifestation. We give three examples. (Liu, 2011)
a. Web Pages:
▪ Hyperlinks: A group of content creators sharing a common interest is usually inter-
connected through hyperlinks. That is, members in a community are more likely to
be connected among themselves than outside the community.
▪ Content words: Web pages of a community contain words that are related to the
community theme.
b. Emails:
▪ Email exchange between entities: Members of a community are more likely to
communicate with one another.
▪ Content words: Email contents of a community also contain words related to the
theme of the community.
c. Text documents:
▪ Co-occurrence of entities: Members of a community are more likely to appear
together in the same sentence and/or the same document.
▪ Content words: Words in sentences indicate the community theme.
90
▪ Another use is to monitor Web sites and pages of interest, so that a user or
community can be notified when new information appears in certain places.
▪ There are also malicious applications of crawlers, for example, that harvest email
addresses to be used by spammers or collect personal information to be used in
phishing and other identity theft attacks.
In fact, crawlers are the main consumers of Internet bandwidth. They collect pages for
search engines to build their indexes. Well known search engines such as Google, Yahoo!
and MSN run very efficient universal crawlers designed to gather all pages irrespective of
their content. Other crawlers, sometimes called preferential crawlers, are more targeted.
They attempt to download only pages of certain types or topics. (Liu, 2011)
a. The frontier is the main data structure, which contains the URLs of unvisited pages.
b. Typical crawlers attempt to store the frontier in the main memory for efficiency. Based
on the declining price of memory and the spread of 64-bit processors, quite a large
frontier size is feasible.
c. Yet the crawler designer must decide which URLs have low priority and thus get
discarded when the frontier is filled up.
d. Note that given some maximum size, the frontier will fill up quickly due to the high
fan-out of pages.
e. Even more importantly, the crawler algorithm must specify the order in which new
URLs are extracted from the frontier to be visited.
These mechanisms determine the graph search algorithm implemented by the crawler.
Figure 7.5 shows the flow of a basic sequential crawler. Such a crawler fetches one page at
a time, making inefficient use of its resources. (Liu, 2011)
91
Fig. 7.5: Flow chart of a basic sequential crawler. (Liu, 2011)
Even without the luxury of a text classifier, a topical crawler can be smart about
preferentially exploring regions of the Web that appear relevant to the target topic by
comparing features collected from visited pages with cues in the topic description.
To illustrate a topical crawler with its advantages and limitations, let us consider the
MySpiders applet ([Link]). Figure 8.6 shows a screenshot of
this application.
In this example the user has launched a population of crawlers with the query “search
censorship in france” using the InfoSpiders algorithm. The crawler reports some seed pages
obtained from a search engine, but also a relevant blog page (bottom left) that was not
returned by the search engine. This page was found by one of the agents, called Spider2,
crawling autonomously from one of the seeds. We can see that Spider2 spawned a new
agent, Spider13, who started crawling for pages also containing the term “italy.” Another
agent, Spider5, spawned two agents one of which, Spider11, identified and internalized the
relevant term “engine.”
A crawling task can thus be viewed as a constrained multi-objective search problem. The
wide variety of objective functions, coupled with the lack of appropriate knowledge about
93
the search space, make such a problem challenging. In the remainder of this section we
briefly discuss the theoretical conditions necessary for topical crawlers to function, and the
empirical evidence supporting the existence of such conditions. Then we review some of
the machine learning techniques that have been successfully applied to identify and exploit
useful cues for topical crawlers. (Liu, 2011)
7.10 Evaluation
Given the goal of building a “good” crawler, a critical question is how to evaluate crawlers
so that one can reliably compare two crawling algorithms and conclude that one is “better”
than the other. Crawlers evaluated through the application it supports. The performance of
the crawlers is based on the algorithm supposed to be used and the indexing schemes
application. Crawler evaluation has been carried out by comparing a crawling algorithm
on a limited number of queries/tasks without considering the statistical significance. Web
crawling field has emerged for evaluating and comparing disparate crawling strategies on
common tasks through well-defined performance measures. (Liu, 2011)
The valuation of crawlers requires a sufficient number of crawls runs over different topics,
as well as sound methodologies that consider the temporal nature of crawler outputs.
Evaluation includes the general unavailability of relevant sets for particular topics or
queries. The meaningful experiments involving real users for assessing the relevance of
pages as they are crawled are extremely problematic. Crawls against the live Web pose
serious time constraints and would be overly burdensome to the subjects.
To circumvent these problems, crawler evaluation typically relies on defining measures for
automatically estimating page relevance and quality. (Liu, 2011)
a. A page may be considered relevant if it contains some or all of the keywords in the
topic/query.
b. The frequency with which the keywords appear on the page may also be considered
(Cho, 1998).
c. While the topic of interest to the user is often expressed as a short query, a longer
description may be available in some cases.
d. Similarity between the short or long description and each crawled page may be used
to judge the page's relevance.
e. The pages used as the crawl's seed URLs may be combined together into a single
document, and the cosine similarity between this document and a crawled page may
serve as the page’s relevance score.
f. A classifier may be trained to identify relevant pages. The training may be done
using seed pages or other pre-specified relevant pages as positive examples. The
trained classifier then provides boolean or continuous relevance scores for each of
the crawled pages. Note that if the same classifier, or a classifier trained on the
same labeled examples, is used both to guide a (focused) crawler and to evaluate it,
the evaluation is not unbiased. Clearly the evaluating classifier would be biased in
favor of crawled pages. To partially address this issue, an evaluation classifier may
be trained on a different set than the crawling classifier. Ideally the training sets
should be disjoint. At a minimum the training set used for evaluation must be
extended with examples not available to the crawler (Pant G. a., 2005).
94
g. Another set of evaluation criterion can be obtained by scaling or normalizing any of
the above performance measures by the critical resources used by a crawler. This
way, one can compare crawling algorithms by way of performance/cost analysis.
h. For example, with limited network bandwidth one may see latency as a major
bottleneck for a crawling task. The time spent by a crawler on network I/O can be
monitored and applied as a scaling factor to normalize precision or recall.
Using such a measure, a crawler designed to preferentially visit short pages, or pages from
fast servers, would outperform one that can locate pages of equal or even better quality but
less efficiently. (Liu, 2011)
Another requirement is to disclose the nature of the crawler using the User-Agent
HTTP header. The value of this header should include not only a name and version
number of the crawler, but also a pointer to where Web administrators may find
information about the crawler. A Web site is created for this purpose and its URL is
included in the User-Agent field. Another piece of useful information is the email
contact to be specified in the form header.
Finally, crawler etiquette requires compliance with the Robot Exclusion Protocol.
This is a de facto standard providing a way for Web server administrators to
communicate which files may not be accessed by a crawler. The file provides access
policies for different crawlers, identified by the User-agent field. When discussing the
interactions between information providers and search engines or other applications
that rely on Web crawlers, confusion sometime arises between the ethical, technical,
and legal ramifications of the Robot Exclusion Protocol. Compliance with the protocol
is an ethical issue, and non-compliant crawlers can justifiably be shunned by the Web
community. Crawlers may disguise themselves as browsers by sending a browser's
identifying string in the User-Agent header. This way a server administrator may not
immediately detect lack of compliance with the Exclusion Protocol, but an aggressive
request profile is likely to reveal the true nature of the crawler.
Web administrators may attempt to improve the ranking of their pages in a search
engine by providing different content depending on whether a request originates from a
browser or a search engine crawler, as determined by inspecting the request's User-
Agent header. This technique, called cloaking, is frowned upon by search engines,
which remove sites from their indices when such abuses are detected.
95
One of the most serious challenges for crawlers originates from the rising popularity of
pay-per-click advertising. If a crawler is not to follow advertising links, it needs to have
a robust detection algorithm to discriminate ads from other links. A bad crawler may
also pretend to be a genuine user who clicks on the advertising links in order to collect
more money from merchants for the hosts of advertising links.
Interestingly, the gap between humans and crawlers may be narrowing from both sides.
While crawlers become smarter, some humans are dumbing down their content to make
it more accessible to crawlers.
For example, some online news providers use simpler titles than can be easily classified
and interpreted by a crawler as opposed or in addition to witty titles that can only be
understood by humans.
A business may wish to disallow crawlers from its site if it provides a service by which
it wants to entice human users to visit the site, say to make a profit via ads on the site.
A competitor crawling the information and mirroring it on its own site, with different
ads, is a clear violator not only of the Robot Exclusion Protocol but also possibly of
copyright law.
There are many browser extensions that allow users to perform all kinds of tasks that
deviate from the classic browsing activity, including hiding ads, altering the appearance
and content of pages, adding and deleting links, adding functionality to pages, pre-
fetching pages, and so on.
- What about an individual user who wants to access the information but
automatically hide the ads?
- Should they identify themselves through the User-Agent header as distinct from the
browser with which they are integrated?
- Should a server be allowed to exclude them? And should they comply with such
exclusion policies?
These too are questions about ethical crawler behaviors that remain open for the
moment. (Liu, 2011)
96
Users are consumers of information provided by search engines, search engines are
consumers of information provided by crawlers, and crawlers are consumers of
information provided by users (authors). This one-directional loop does not allow, for
example, information to flow from a search engine (say, the queries submitted by users)
to a crawler. It is likely that commercial search engines will soon leverage the huge
amounts of data collected from their users to focus their crawlers on the topics most
important to the searching public. To investigate this idea in the context of a vertical
search engine, a system was built in which the crawler and the search engine engage in
a symbiotic relationship (Pant G. S., 2004). The crawler feeds the search engine which
in turn helps the crawler. It was found that such a symbiosis can help the system learn
about a community's interests and serve such a community with better focus.
Universal crawlers have to somehow focus on the most “important” pages given the
impossibility to cover the entire Web and keep a fresh index of it. This has led to the
use of global prestige measures such as PageRank to bias universal crawlers, either
explicitly or implicitly through the long-tailed structure of the Web graph. An
important problem with these approaches is that the focus is dictated by popularity
among “average” users and disregards the heterogeneity of user interests. A page about
a mathematical theorem may appear quite uninteresting to the average user, if one
compares it to a page about a pop star using indegree or PageRank as a popularity
measure. Yet the math page may be highly relevant and important to a small
community of users (mathematicians). Future crawlers will have to learn to
discriminate between low-quality pages and high-quality pages that are relevant to very
small communities. (Liu, 2011)
Social networks have recently received much attention among Web users as vehicles to
capture commonalities of interests and to share relevant information. We are witnessing
an explosion of social and collaborative engines in which user recommendations,
opinions, and annotations are aggregated and shared. Mechanisms include tagging (e.g.,
[Link] and [Link]), ratings (e.g. [Link]), voting (e.g., [Link]), and
hierarchical similarity ([Link]). One key advantage of social systems is that
they empower humans rather than depending on crawlers to discover relevant
resources. Further, the aggregation of user recommendations gives rise to a natural
notion of trust. Crawlers could be designed to expand the utility of information
collected through social systems. (Liu, 2011)
Social networks can emerge not only by mining a central repository of user-provided
resources, but also by connecting hosts associated with individual users or communities
scattered across the Internet. Imagine a user creating its own micro-search engine by
employing a personalized topical crawler, seeded for example with a set of bookmarked
pages.
97
The integration of effective personalized/topical crawlers with adaptive query routing
algorithms is the key to the success of peer-based social search systems. Many
synergies may be exploited in this integration by leveraging contextual information
about the local peer that is readily available to the crawler, as well as information about
the peer's neighbors that can be mined through the stream of queries and results routed
through the local peer. (Liu, 2011)
Web usage mining process can be divided into three inter-dependent stages:
- data collection and pre-processing,
- pattern discovery, and
- pattern analysis.
In the pre-processing stage, the clickstream data is cleaned and partitioned into a set of
user transactions representing the activities of each user during different visits to the
site. The site content or structure, as well as semantic domain knowledge from site
ontologies (such as product catalogs or concept hierarchies), may also be used in pre-
processing or to enhance user transaction data.
In the pattern discovery stage, statistical, database, and machine learning operations are
performed to obtain hidden patterns. The hidden patterns are reflecting the typical
behavior of users, summary statistics on Web resources, sessions, and users.
In the final stage of the process, the discovered patterns and statistics are further
processed, filtered, possibly resulting in aggregate user models that can be used as input
to applications such as recommendation engines, visualization tools, and Web analytics
and report generation tools. The overall process is depicted in Figure 7.8.
98
Fig. 7.8: The Web usage mining process (Liu, 2011)
The data preparation process is often the most time consuming and computationally
intensive step in the Web usage mining process, and often requires the use of special
algorithms and heuristics not commonly employed in other domains.
This process is critical to the successful extraction of useful patterns from the data. The
process may involve pre-processing the original data, integrating data from multiple
sources, and transforming the integrated data into a form suitable for input into specific
data mining operations. Collectively, we refer to this process as data preparation.
Usage data preparation presents a number of unique challenges which have led to a
variety of algorithms and heuristic techniques for pre-processing tasks such as data
fusion and cleaning, user and session identification, page view identification.
The successful application of data mining techniques to Web usage data is highly
dependent on the correct application of the pre-processing tasks. In the context of e-
commerce data analysis, these techniques have been extended to allow for the
discovery of important and insightful user and site metrics.
99
Fig. 7.9: Steps in data preparation for Web usage mining (Liu, 2011)
Figure 7.9 provides a summary of the primary tasks and elements in usage data pre-
processing. We begin by providing a summary of data types commonly used in Web
usage mining and then provide a brief discussion of some of the primary data
preparation tasks. (Liu, 2011)
Page views are semantically meaningful entities to which mining tasks are applied
(such as pages or products).
Where, each for some j in {1,2, 3,….n} and is the weight associated with
page view in transaction t, representing its significance.
The weights can be determined in a number of ways, in part based on the type of
analysis or the intended personalization framework.
One commonly used option is to set the weight for the last page view to be the mean
time duration for the page taken across all sessions in which the page view does not
occur as the last one. It is common to use a normalized value of page duration instead
of raw time duration in order to account for user variances. In some applications, the
log of page view duration is used as the weight to reduce the noise in the data.
101
For many data mining tasks, such as clustering and association rule mining, where the
ordering of page views in a transaction is not relevant, we can represent each user
transaction as a vector over the n-dimensional space of page views.
Fig. 7.10: An example of a user-page view matrix (or transaction matrix) (Liu, 2011)
Given the transaction t above, the transaction vector t (we use a bold face lower case
letter to represent a vector) is given by:
where each for some j in {1, 2, ···, n}, if pj appears in the transaction t,
and otherwise.
Thus, conceptually, the set of all user transactions can be viewed as an m×n user-page
view matrix (also called the transaction matrix), denoted by UPM
Given a set of transactions in the user-page view matrix as described above, a variety of
unsupervised learning techniques can be applied to obtain patterns. These techniques
such as clustering of transactions (or sessions) can lead to the discovery of important
user or visitor segments.
102
Other techniques such as item (e.g., page view) clustering and association or sequential
pattern mining can find important relationships among items based on the navigational
patterns of users in the site.
103
demographics in order to perform market segmentation in e-commerce
applications or provide personalized Web content to the users with similar
interests.
iv. Analysis of user groups based on their demographic attributes (e.g., age, gender,
income level, etc.) may lead to the discovery of valuable business intelligence.
v. Usage-based clustering has also been used to create Web-based “user
communities” reflecting similar interests of groups of users.
Given the mapping of user transactions into a multi-dimensional space as
vectors of page views, standard clustering algorithms, such as k-means, can
partition this space into groups of transactions that are close to each other based
on a measure of distance or similarity among the vectors. Transaction clusters
can represent user or visitor segments based on their navigational behavior or
other attributes that have been captured in the transaction file. Each transaction
cluster may potentially contain thousands of user transactions involving
hundreds of page view references. The ultimate goal in clustering user
transactions is to provide the ability to analyze each segment for deriving
business intelligence, or to use them for tasks such as personalization.
vi. Another approach in creating an aggregate view of each cluster is to compute
the centroid (or the mean vector) of each cluster. The dimension value for each
page view in the mean vector is computed by finding the ratio of the sum of the
page view weights across transactions to the total number of transactions in the
cluster. If page view weights in the original transactions are binary, then the
dimension value of a page view p in a cluster centroid represents the percentage
of transactions in the cluster in which p occurs. Thus, the centroid dimension
value of p provides a measure of its significance in the cluster. Page views in
the centroid can be sorted according to these weights and lower weight page
views can be filtered out. The resulting set of page view-weight pairs can be
viewed as an “aggregate usage profile” representing the interests or behavior of
a significant group of users. (Liu, 2011)
For example, if a site does not provide direct linkage between two pages A and
B, the discovery of a rule, A → B, would indicates that providing a direct
hyperlink from A to B might aid users in finding the intended information. Both
association analysis (among products or page views) and statistical correlation
104
analysis (generally among customers or visitors) have been used successfully in
Web personalization and recommender systems. (Liu, 2011)
105
Questions:
xxxXxxx
106
Bibliography
Web Resources:
(2019, August). Retrieved from [Link]
Büchner, A. a. (1998). Discovering internet marketing intelligence through online analytical web usage
mining. ACM SIGMOD Record, 54 - 61.
Cho, J. H.-M. (1998). Efficient crawling through URL. Computer Networks, 161-172.
Jiawei Han, M. K. (2012). Data Mining Concepts and Techniques. USA: Elsevier.
JNTUH-R13. (n.d.). Data Warehousing and Data Mining. Retrieved from [Link]
Tanasa, D. a. (2005). Advanced data preprocessing for intersites web Usage Mining. Intelligent
Systems, 59 - 65.
VSSUT, B. (n.d.). Lecture Notes on Data Mining and Data Warehousing. VSSUT, Burla: DEPT OF CSE &
IT.
107