0% found this document useful (0 votes)
55 views107 pages

Data Warehousing Concepts and ETL Process

The document provides an overview of data warehousing, including its architecture, ETL process, and the differences between database systems and data warehouses. It defines key concepts such as subject-oriented, integrated, time-variant, and nonvolatile data, and explains the significance of data cubes and multidimensional data models. Additionally, it discusses the roles of OLTP and OLAP systems, concept hierarchies, measures, and schema models used in data warehousing.

Uploaded by

Amira Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views107 pages

Data Warehousing Concepts and ETL Process

The document provides an overview of data warehousing, including its architecture, ETL process, and the differences between database systems and data warehouses. It defines key concepts such as subject-oriented, integrated, time-variant, and nonvolatile data, and explains the significance of data cubes and multidimensional data models. Additionally, it discusses the roles of OLTP and OLAP systems, concept hierarchies, measures, and schema models used in data warehousing.

Uploaded by

Amira Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit – I

Data Warehousing

Objectives:
- To learn basic concepts of data warehouse and its architecture, model
- To study data transformation process of data warehouse
- To study data cube structure and its representation, schema model
- To study the different retrieval operations from a data cube

1.0 Introduction
Data warehousing provides different architectures and tools for decision makers to execute
their plans based on data collected by different sources. It is a place where we can load
various types of data and accessed commonly by knowledge worker. Now a day’s data is
growing rapidly in organization. Organizations typically collect diverse kinds of data and
maintain in large data stores from clustered information sources. It has some challenges
and complexities to gather the data and efficiently access at one site. More efforts have to
be taken in the industry and research is going on for storing and integrating the historical
information. It maintains the data in multi-dimensional structure and its query writing is
complex. The retrieval of data is more important and the historical data analysis leads
knowledge worker to take proper decisions towards growth of the organization.

1.1 Definition
According to William H. Inmon, a leading architect in the construction of data warehouse
systems, “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-making process” (Inmon, 1996).

The four keywords—subject-oriented, integrated, time-variant, and nonvolatile


differentiate data warehouse from traditional data store system. (Inmon, 1996)

Subject-oriented:
A data warehouse is basically focused on subjects as customer, supplier, product, library
and sales. Its views on a particular subject matter irrespective of data. It focuses on
modeling and analyzing the collected data.
Integrated:
A data warehouse is based on combining different multiple data sources. It includes files,
online data records, registers and manuals. All these data are uploaded to data warehouse
prior performing some data preprocessing task. It applies different data cleaning methods
to upload the data into a data warehouse.
Time-variant:
Data is stored based on historic perspective. It stores a data about past five to ten years in
analyzed format. It doesn’t have records to be maintained in a day by day operation.

1
Nonvolatile:
Data warehouse has a separate physical data store. It stores the data after preprocessing and
it’s available permanently. It can be accessed as needed by the knowledge worker. Its
stores are different to the local data store. It doesn’t have recovery, consistency control
mechanisms. It has only two operation loading and accessing the records.

Example of Data Warehouse


- Insurance Sector
- Retail Banking
- Telephone Customer Usage
- Farmer Loans
In above examples, a data warehouse maintains the relevant data about past 5 – 10 years.
The data is in the form of data cubes (discussed in later section). Data warehouse maintains
only analytical processed data.

1.2 ETL Process (Extraction, Transformation, and Loading)


A data warehouse system has various tools to support and control its functionalities. It has
some back-end tools and utilities to gather, update and refresh the existing data and newly
added data. It’s depicted in data warehouse architecture (figure1.1).

Following are the basic functions to be carried in data warehouse:


Data extraction collects data from different & multiple, heterogeneous, and external
sources.
Data cleaning detects errors in the data and rectifies them when possible and performs full
cleaning and data preprocessing task.
Data transformation converts data from legacy or host format to warehouse format. The
unique format of data should be maintained. The different data conversion techniques have
been adopted in data transformation.
Load sorts, summarizes, consolidates, and computes views, checks integrity, and builds
sequences, indexes and clusters, classify the data. This is the basic and fundamental
process of data warehouse. The data normalization also carried out and proper conversion
carried out into a data cube.
Refresh updates from the data sources to the warehouse.

Besides cleaning, loading, refreshing, and metadata definition tools, data warehouse
systems have management tools which are useful for knowledge worker. Data cleaning and
data transformation are important steps in improving the data quality and produces better
output using data mining technique.

2
Figure 1.1: ETL Process (JNTUH-R13)

Example Illustration of ETL Process


Consider example of any retail store where it has different sub departments such as Sales,
Marketing, Purchase, Inventory and Import/Export etc. In each department it’s very
difficult to manage customer data independently. Each department may store customer
details in their own specific format. It may be different for each department. In sales
department customer record maintained by cust_id, and marketing department may record
customer details as cust_name. In this case, it’s difficult to retrieve monthly sales
information of a specific customer. It’s a tedious task. Similarly retrieving customer
profiles for marketing purpose is also a difficult task.

The solution for this kind of issues is to use data warehouse where we can store the data
from different data sources. Before storing data into a data warehouse, it must process ETL
operations. It extracts data from these (marketing, sales, purchase etc.) sources and
transform in a unique format as cust_id, cust_name, sales_amt, and other related fields. It
maintains unified structure. After ETL process it can be used in any Data warehouse tool or
BI (Business Intelligence) tools. Then we can get any reports in different format.

ETL process helps the companies to analyse business data for making different marketing
strategies and decisions. It is helpful in transforming data into data warehouse.

1.3 Differences Between Database Systems and Data Warehouses


▪ Data Warehouse stores large amount of data, compared with operational database.

3
▪ The operational database has limited structure and pattern whereas Data warehouse has
vast storage capacity and independently it is maintaining.
▪ The major task of online operational database systems is to perform online transaction
and query processing. These systems are called online transaction processing
(OLTP) systems. OLTP records the data based on day to day transactions. The regular
operations such as sales, banking process, manufacturing, purchasing, selling, payroll
and accounting.
▪ Data warehouse systems can be handled by experts, knowledge workers to retrieve
efficient data and played a role of data analyst in decision making. Such systems are
known as online analytical processing (OLAP) systems.

1.4 Difference between OLTP and OLAP


The major distinguishing features of OLTP and OLAP are summarized as follows:
Parameters OLTP OLAP
Users clerk, IT professional, DBA knowledge worker, Manager,
Executives, Analysts
Function day to day operations decision support, long term
operations
DB design application-oriented, ER based, subject-oriented, star, snowflake
current schemas, historical
Data current, up-to-date historical,
detailed, flat relational, summarized, multidimensional
isolated integrated, consolidated
Usage Repetitive ad-hoc
Access read/write, lots of scans, read
index/hash on primary key
unit of work short, simple transaction complex query
No. of records Tens Millions
accessed
No. of users Thousands Hundreds
DB size 100MB-GB 100GB-TB
Metric transaction throughput query throughput, response
View Detailed, flat relational Summarized, multi - dimensional
Table 1.0: Difference between OLTP and OLAP

1.5 Overview of Multidimensional Data Model


A data model represents a multi-dimensional view of data. The data is in heterogeneous
form. It has several attributes. A multi-dimensional data model can be represented with the
help of data cube. A data cube can be used to represent a multi view of data.

4
Figure 1.2: Cube – Multidimensional Model (Jiawei Han, 2012)

Data cubes provide fast accession of summarized data which is mostly used in data mining
process. The data cube is divided into four levels. Its known as abstract views of data. At
the lowest level it called as base cuboid and highest peak level is called as apex. In between
levels are maintained by cuboids. Data cubes created for varying levels of abstraction are
referred to as cuboids, so that a “data cube" may instead refer to a lattice of cuboids. Each
higher level of abstraction further reduces the resulting data size. (Jiawei Han, 2012)

A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts. Dimensions are the collection of tables with respect to maintain
records. The table has multiple columns. The table contains only the columns apart from a
primary key column.

Example:
ElectonicsCatalog – a relational database described by the following relation tables:
Customer, Item, Employee and Branch.
Customer (Cust_ID, Cust_Name, Address, Age, Income, Credit_Info, Category,
Occupation)
Item (Item_ID, Item_Name, Brand, Category, Type, Price, Mfd_City, Supplier, Cost)
Employee (Emp_ID, Emp_Name, Category, Group, Salary, Commission)
Branch (Branch_ID, Branch_Name, Address)
Purchase (Trans_ID, Cust_ID, Emp_ID, Pur_date, Pur_Time, Method_Paid, Amt)
Items_Sold (Trans_ID, Item_ID, Qty)
Works_At (Emp_ID, Branch_ID)

A data cube for summarized sales data of ElectronicsCatalog is represented using


following dimensions:
Address (with city values – Chicago, New York, Toronto, Vancouver)
Time (with quarter values – Q1, Q2, Q3, Q4)
Item (with item type values – Home Entertainment, Computer, Phone, Security)
The aggregate value stored in each cell of cube is Sales_Amt (in thousands)

5
ElectronicsCatalog may create a sales data warehouse in order to keep the records of sales
of an electronics product. The dimensions are time, item, branch, and location. These
dimensions allow the store to keep the records of sales in monthwise (time), itemwise
(item), branchwise (branch) and citywise (location).

Each dimension may have a table associated with it, called a dimension table, which
further describes the dimension. There may be a multiple dimension tables called sub
dimension table. The dimension table can be defined by the database programmer or
database designer. All the dimensions are placed and connected each other with a fact
table. This is a central place to hold all the dimension tables.

For example, a dimension table for item may contain the attributes item_name, brand, and
item_type.

Facts are numeric measures. Think of them as the quantities by which we want to analyze
relationships between dimensions. It contains the aggregate values of a table.

Examples of facts for a sales data warehouse include Amount_sold (sales amount in
rupees), unit_sold (number of units sold), and amount.

The fact table contains the names of the facts, or measures, as well as keys to each of the
related dimension tables. (Jiawei Han, 2012)

1.6 Concept Hierarchies


A concept hierarchy defines a sequence of mappings from a set of low-level concepts to
higher-level. Consider a concept hierarchy for the dimension location. City values for
location include Vancouver, Toronto, New York, and Chicago. Each city can be mapped to
the state to which it belongs.

For example, Vancouver can be mapped to British Columbia, and Chicago to Illinois.

The states can in turn be mapped to the country (e.g., Canada or the United States) to
which they belong. These mappings form a concept hierarchy for the dimension location,
mapping a set of low-level concepts (i.e., cities) to higher-level, more general concepts
(i.e., countries). This concept hierarchy is illustrated in Figure 1.3. (Jiawei Han, 2012)

6
Figure 1.3 A concept hierarchy for location (Jiawei Han, 2012)

Many concept hierarchies are implicit within the database schema.


For example, suppose that the dimension location is described by the attributes number,
street, city, state, pin, and country. These attributes are related by a total order, forming a
concept hierarchy such as “street < city < state < country.” This hierarchy is shown in
Figure 1.4(a). (Jiawei Han, 2012)

The attributes of a dimension may be organized in a partial order, forming a lattice. An


example of a partial order for the time dimension based on the attributes day, week, month,
quarter, and year is “day <fmonth < quarter; week < year.” This lattice structure is shown
in Figure 1.4(b). (Jiawei Han, 2012)

Figure 1.4 (a) a hierarchy for location and (b) a lattice for time (Jiawei Han, 2012)

7
- A concept hierarchy is a total or partial order among the attributes in a database schema
is called a schema hierarchy.
- Concept hierarchies are common to many applications (e.g., for time) may be
predefined in the data mining system.
- Concept hierarchies defined by discretizing or grouping values for a given dimension
or attribute, resulting in a set-grouping hierarchy.
- A total or partial order can be defined among groups of values.
- Concept hierarchies may be provided manually by system users, domain experts, or
knowledge engineers, or may be automatically generated based on statistical analysis of
the data distribution. (Jiawei Han, 2012)

1.7 Measures: Their Categorization and Computation


A data cube measure is a numeric function that can be evaluated at each point in the data
cube space. A measure value is computed for a given point by aggregating the data
corresponding dimension–value pairs defining the given point. Measures can be organized
into three categories—distributive, algebraic, and holistic based on the kind of aggregate
functions used. (Jiawei Han, 2012)

Distributive: An aggregate function is distributive if it can be computed in a distributed


manner as follows:
For example, sum () can be computed for a data cube by first partitioning the cube into a
set of subcubes, computing sum () for each subcube, and then summing up the counts
obtained for each subcube. Hence, sum () is a distributive aggregate function. count (), min
(), and max () are the other distributive aggregate functions.

A measure is distributive if it is obtained by applying a distributive aggregate function.


Distributive measures can be computed efficiently because of the way the computation can
be partitioned. (Jiawei Han, 2012)

Algebraic: An aggregate function is algebraic if it can be computed by an algebraic


function with M arguments (where M is a bounded positive integer), each of which is
obtained by applying a distributive aggregate function.
For example, avg () (average) can be computed by sum ()/count (), where both sum () and
count () are distributive aggregate functions.

Similarly, it can be shown that min N () and max N () (which find the N minimum and N
maximum values, respectively, in a given set) and standard deviation () are algebraic
aggregate functions. A measure is algebraic if it is obtained by applying an algebraic
aggregate function. (Jiawei Han, 2012)

Holistic: An aggregate function is holistic if there is no constant bound on the storage size
needed to describe a sub aggregate. There does not exist an algebraic function with M
arguments (where M is a constant) that characterizes the computation.
Common examples of holistic functions include median (), mode (), and rank (). A measure
is holistic if it is obtained by applying a holistic aggregate function.

8
Most large data cube applications require efficient computation of distributive and
algebraic measures. It is difficult to compute holistic measures efficiently. (Jiawei Han,
2012)

1.8 Schemas for Multidimensional Data Models


The entity-relationship diagram / data model is commonly used in designing databases and
database schema represents the relationships between different dimensions and facts of
data warehouse. Such data model is used in online transaction processing. To represent
entire data warehouse schema a multidimensional view can be used. A database schema
facilitates data warehouse structure in the form of multidimensional model. It can be used
in data analysis.

To represent multidimensional model in data warehouse we have three schema models:


star schema, a snowflake schema, or a fact constellation schema.

Star schema: The most common modeling structure is star schema, in which the data
warehouse contains
▪ A large central table (fact table) containing the data, with no redundancy, and
▪ A set of smaller attendant tables (dimension tables), one for each dimension.
The schema graph combines dimension tables in scattered form and fact table in the
central. Fact table holds the different dimension tables. It may have summarized columns at
central part. (Jiawei Han, 2012)

A star schema for ElectronicsCatalog sales is shown in Figure 1.6. Sales are considered
along four dimensions: time, item, branch, and location. The schema contains a central fact
table for sales that contains keys to each of the four dimensions, along with two measures:
dollars sold and units sold.

To minimize the size of the fact table, dimension identifiers (e.g., time key and item key)
are system-generated identifiers

Figure 1.6: Star Schema Model


9
In the star schema, each dimension is represented by only one table, and each table
contains a set of attributes. All primary keys of the corresponding dimension tables are kept
at central place known as fact table – sales Fact with aggregate columns – dollars_sold and
units_sold.

define cube sales star [time, item, branch, location]:


dollars sold = sum (sales in dollars), units sold = count (*)

define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)

The define cube statement defines a data cube called sales star, which corresponds to the
central sales fact table. This command specifies the dimensions and the two measures,
dollars sold and units sold. The data cube has four dimensions, namely, time, item, branch,
and location.

Snowflake schema: The snowflake schema is another model of data warehouse, where
some dimension tables are normalized, and divided into sub dimension tables.
It contains additional sub tables which represent a scattered form of one single fact and
multiple dimension tables. The shape of schema is similar to a snowflake. That’s why its
called as snowflake schema.
The major difference between the snowflake and star schema models is that the dimension
tables of the snowflake model may be kept in normalized form to reduce redundancies.
Snowflake schema contains sub dimension tables.
A snowflake schema for ElectronicsCatalog sales is given in Figure 1.7. Here, the sales
fact table is identical to that of the star schema in Figure 1.6.

Figure 1.7: Snowflake Schema Model

10
The main difference between the two schemas is in the definition of dimension tables. The
single dimension table for item in the star schema is normalized in the snowflake schema,
resulting in new item and supplier tables.

For example, the item dimension table now contains the attributes item key, item name,
brand, type, and supplier key, where supplier key is linked to the supplier dimension table,
containing supplier key and supplier type information.

Similarly, the single dimension table for location in the star schema can be normalized into
two new tables: location and city. The city key in the new location table links to the city
dimension. A further normalization can be performed on state and country in the snowflake
schema shown in Figure 1.7.

define cube sales snowflake [time, item, branch, location]:


dollars sold = sum (sales in dollars), units sold = count (*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key,
supplier type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state,
country))

This definition is similar to that of sales star, except that, here, the item and location
dimension tables are normalized.

For instance, the item dimension of the sales star data cube has been normalized in the
sales snowflake cube into two dimensions tables, item and supplier. Note that the
dimension definition for supplier is specified within the definition for item. Defining
supplier in this way implicitly creates a supplier key in the item dimension table definition.
Similarly, the location dimension of the sales star data cube has been normalized in the
sales snowflake cube into two-dimension tables, location and city. The dimension
definition for city is specified within the definition for location. In this way, a city key is
implicitly created in the location dimension table definition.

Fact constellation: It is a mixture of star schema and snowflake schema model. It is


sometimes called hybrid structure of data warehouse. Fact constellation shares multiple
fact tables and dimensional tables with additional sub dimensional and fact table. This kind
of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a
fact constellation.

A fact constellation schema is shown in Figure 1.8. This schema specifies two fact tables,
sales and shipping. The sales table definition is identical to that of the star schema (Figure
1.6). The shipping table has five dimensions, or keys—item key, time key, shipper key,
from location, and to location—and two measures—dollars cost and units shipped. A fact
constellation schema allows dimension tables to be shared between fact tables.

11
For example, the dimensions tables for time, item, and location are shared between the
sales and shipping fact tables. (Jiawei Han, 2012)

Figure 1.8: Fact Constellation Schema Model

It is a hybrid structure that included star schema and snow flake schema of data warehouse.
There is a possible case where multiple dimensions and facts are present in fact
constellation schema.

define cube sales [time, item, branch, location]:


dollars sold = sum (sales in dollars), units sold = count (*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)

define cube shipping [time, item, shipper, from location, to location]:


dollars cost = sum (cost in dollars), units shipped = count (*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper key, shipper name, location as location in cube sales,
shipper type)
define dimension from location as location in cube sales
define dimension to location as location in cube sales

A define cube statement is used to define data cubes for sales and shipping, corresponding
to the two fact tables of the schema. Note that the time, item, and location dimensions of
the sales cube are shared with the shipping cube.

This is indicated for the time dimension, for example, as follows. Under the define cube
statement for shipping, the statement “define dimension time as time in cube sales” is
specified.

12
1.9 OLAP Operations
In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data from different perspectives.

OLAP operations used to view a data in a different perspective for data analysis and easy
retrieval of data. OLAP provides a user-friendly environment for interactive data analysis.
(Jiawei Han, 2012)

A typical OLAP operations described is illustrated in Figure 1.11. In our case multi-
dimensional cube is placed at center - ElectronicsCatalog sales. The cube contains the
dimensions location, time, and item, where location is aggregated with respect to city
values, time is aggregated with respect to quarters, and item is aggregated with respect to
item types.

Roll Up or Drill Up: The roll-up operation performs an aggregation on a data cube, either
by climbing up a concept hierarchy for a dimension or by dimension reduction. Figure 1.9
shows the result of a roll-up operation performed on the central cube by climbing up the
concept hierarchy for location given in Figure 1.9.

Figure 1.9: Roll Up Operation

13
This hierarchy was defined as the total order “street <city < province or state < country.”
The roll-up operation shown aggregates the data by ascending the location hierarchy from
the level of city to the level of country. In other words, rather than grouping the data by
city, the resulting cube groups the data by country. When roll-up is performed by
dimension reduction, one or more dimensions are removed from the given cube.

For example, consider a sales data cube containing only the location and time dimensions.
Roll-up may be performed by removing, say, the time dimension, resulting in an
aggregation of the total sales by location, rather than by location and by time. (Jiawei Han,
2012)

Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to
more detailed data. Drill-down can be realized by either stepping down a concept hierarchy
for a dimension or introducing additional dimensions.

Figure 1.10: Drill Down Operation

14
Figure 1.10 shows the result of a drill-down operation performed on the central cube by
stepping down a concept hierarchy for time defined as “day < month < quarter < year.”
Drill-down occurs by descending the time hierarchy from the level of quarter to the more
detailed level of month.

The resulting data cube details the total sales per month rather than summarizing them by
quarter. (Jiawei Han, 2012)

Figure 1.11 Typical OLAP operations (Jiawei Han, 2012)

Slice and dice: The slice operation perform a selection on one dimension of the given
cube, resulting in a subcube.

15
Figure 1.12 shows a slice operation where the sales data are selected from the central cube
for the dimension time using the criterion time D “Q1.” The dice operation defines a
subcube by performing a selection on two or more dimensions.

Figure 1.12: Slice Operation

Figure 1.13 shows a dice operation on the central cube based on the following selection
criteria that involve three dimensions: (location D “Toronto” or “Vancouver”) and (time D
“Q1” or “Q2”) and (item D “home entertainment” or “computer”). (Jiawei Han, 2012)

Figure 1.13: Dice Operation

16
Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data
axes in view to provide an alternative data presentation.

Figure 1.14 shows a pivot operation where the item and location axes in a 2-D slice are
rotated. Other examples include rotating the axes in a 3-D cube, or transforming a 3-D cube
into a series of 2-D planes. (Jiawei Han, 2012)

Figure 1.14: Pivot Operation

OLAP offers analytical modelling capabilities, including a calculation engine for deriving
ratios, variance, and so on, and for computing measures across multiple dimensions. It can
generate summarizations, aggregations, and hierarchies at each granularity level and at
every dimension intersection. OLAP also supports functional models for forecasting, trend
analysis, and statistical analysis. OLAP engine is a powerful data analysis tool.

1.10 Efficient Processing of OLAP Queries


The querying of multidimensional databases can be based on a starnet model, which
consists of radial lines emanating from a central point, where each line represents a concept
hierarchy for a dimension. Each abstraction level in the hierarchy is called a footprint.
These represent the granularities available for use by OLAP operations such as drill-down
and roll-up. A starnet query model for the ElectronicsCatalog data warehouse is shown in
Figure 1.15. (Jiawei Han, 2012)

17
Figure 1.15: Starnet Model (Jiawei Han, 2012)

This starnet consists of four radial lines, representing concept hierarchies for the
dimensions location, customer, item, and time, respectively. Each line consists of footprints
representing abstraction levels of the dimension.

For example, the time line has four footprints: “day,” “month,” “quarter,” and “year.” A
concept hierarchy may involve a single attribute (e.g., date for the time hierarchy) or
several attributes (e.g., the concept hierarchy for location involves the attributes street, city,
province or state, and country).

In order to examine the item sales at ElectronicsCatalog, users can roll up along the time
dimension from month to quarter, or, say, drill down along the location dimension from
country to city.

Concept hierarchies can be used to generalize data by replacing low-level values (such as
“day” for the time dimension) by higher-level abstractions (such as “year”), or to specialize
data by replacing higher-level abstractions with lower-level values. (Jiawei Han, 2012)

1.11 Type of OLAP Servers:


To improve OLAP operations in retrieval of data in multidimensional data model we
required a specialized server. A specialized SQL server supports data mining query
language (DMQL). SQL Server support complex query language to retrieve the data over
star and snowflake schemas in a read-only environment. Following are different OLAP
servers used by different vendors as per the requirements. (JNTUH-R13)

Relational OLAP (ROLAP):


- ROLAP is responsible for operating a relational database. The tables used in RDBMS
packages can be handled in ROLAP. The fact and dimension tables are stored in a
relational table and a new table are created to hold summarized data. It is depending on
a data available schema structure.
18
- In the view of schema structure and to perform OLAP operations in relational OLAP
one has to use SELECT and WHERE clause. It supports SQL queries in slicing and
dicing operations.
- ROLAP tools do not use pre-calculated data cubes but instead pose the query to the
standard relational database and its tables in order to bring back the data required to
answer the question.
- ROLAP also has the ability to drill down to the lowest level of detail in the database.

Multidimensional OLAP (MOLAP):


- As in ROLAP operating relational databases, MOLAP is responsible for operating and
handling the different operations over a data cube.
- MOLAP stores a data of any kind such as files, a data stored in a row – column format,
tabular format or in array format.
- The OLAP operations such as slicing, pivot are handled easily by MOLAP.
- Before handling entire data cube – multidimensional data, the initial processing can be
done using MOLAP.
- The multiple views of data can be hold using data cube which is easily operated using
MOLAP.
- Basically, it’s a simply OLAP operations, in data warehouse it can be used for handling
multi view databases compared with ROLAP.
- It also performs basic data pre-processing operations while loading data in a MOLAP.
- MOLAP tools provide easy retrieval of data from a data cube and easily we can update
or modify the data in an existing dataset.

Difference between MOLAP and ROLAP


MOLAP ROLAP

In MOLAP, data retrieval is very fast Data retrieval is slow as data retrieving
as compared with ROLAP. from different tables.

Data sets are stored in multi Simple tabular format is used for storing
dimensional ways. data set.

To retrieve the data from MOLAP is ROLAP required technical skill for
easy to new users. retrieving data.

Data maintained in cubic format. Table format is used to store data.

Hybrid OLAP (HOLAP):


- It holds both ROLAP and MOLAP data.
- It has a specialised data structure to maintain database to store huge amount of data.
- HOLAP is responsible for holding and operating both relational databases and
multidimensional data – data cube.
- HOLAP tool handles all kinds of data conversion techniques to load data into a data
warehouse.
- HOLAP removes data anomalies occurred in MOLAP and ROLAP by performing data
pre-processing technique.
- It effectively retrieves data from data warehouse.
19
1.12 Data Warehouse Tools

In this era of rapid growth in Big Data, real time processing platforms, Data Warehouse
is still reliable option for data storage, analysis and indexing. Data warehouse is a
relational database that is designed for writing up complex queries (DMQL – Data
Mining Query Language) to retrieve the analytical results based on past stored
historical data.
Teradata
Teradata is a market leader in the data warehousing space that brings more than 30
years of history to the table. Teradata’s EDW (enterprise data warehouse)
platform provides businesses with robust, scalable hybrid-storage capabilities and
analytics from mounds of unstructured and structured data leading to real-time business
intelligence insights, trends, and opportunities. Teradata also offers a cloud-based
DBMS solution via its Aster Database platform.

Oracle
Oracle is basically the household name in relational databases and data warehousing
and has been so for decades. Oracle 12c Database is the industry standard for high
performance scalable, optimized data warehousing. The company’s specialized
platform for the data warehousing side is the Oracle Exadata Machine. There are an
estimated 390,000 Oracle DBMS customers worldwide.

Amazon Web Services (AWS)


The whole shift in data storage and warehousing to the cloud over the last several years
has been momentous and Amazon has been a market leader in that whole paradigm.
Amazon offers a whole ecosystem of data storage tools and resources that complement
its cloud services platform. Amazon was the overall leader in data warehousing
customer satisfaction and experience in last year’s survey.

Cloudera has emerged in recent years as a major enterprise provider of Hadoop-based


data storage and processing solutions. Cloudera offers an Enterprise Data Hub
(EDH) for its variety of operational data store, or data warehouse. The EDH is
Cloudera’s proprietary framework for the “information-driven enterprise” and focuses
on “batch processing, interactive SQL, enterprise search, and advanced analytics—
together with the robust security, governance, data protection, and management that
enterprises require.”

20
Questions:

1. What is data warehouse? Explain with one example.


2. Describe ETP process in detail.
3. Differentiate between OLTP and OLAP.
4. Explain different Schemas used in data warehouse.
5. Explain different OLAP operations.
6. Write a note on:
a. Data Cube
b. MOLAP
c. ROLAP

xxxXxxx

21
Unit – II
Data Warehouse Architecture

Objectives:
- To learn basic concepts of data warehouse and its architecture, model
- To study data loading and preprocessing techniques
- To overcome and resolve the issues of data loading into a data warehouse
- To implement data warehouse

2.0 Steps for Design & Construction of Data Warehouse


Data warehouse design process is divided into following views. Based on views its
construction is carried out. It is also known as Business Analysis Framework. The data
warehouse is built under different modules called as views. The four different views
regarding the design of a data warehouse must be considered:
▪ Top-down view,
▪ Data source view,
▪ Data warehouse view,
▪ Business query view.
The top down view allows gathering the information required for loading into data
warehouse. The relevant information of a particular subject is to stored. It is a basic
requirement for designing data warehouse.
The data source view is managed by operational systems. It allows collecting the
heterogeneous collection of data. The data coming from different sources such as files,
catalogues, govt. Manuals/reports are to be stored in a specified format.

The data warehouse view containing both fact and dimension table as explain in previous
chapter. The business query view is similar to data views of oracle. It can be used by
different users as pert the requirements. The query to be written to retrieve the data from
the data warehouse such query is represented by Data Mining Query Language (DMQL). It
provides data view point to the user. (Jiawei Han, 2012)

2.1 Data Warehouse Architecture


Data warehouses architecture is as presented in Figure 2.1. It is divided into three tiers. The
tiers have different tools and utilities. These tools and utilities perform a data extraction,
cleaning, and transformation as well as load and refresh functions are used to update the
data warehouse.
- The bottom tier is a warehouse database server. It maintains data in a relational
database system.
- Back-end tools and utilities are used to input the data/records into the bottom tier from
transactional databases and other data sources.

22
- The external data sources are operated by different database engines or one kind of
gateways. This is a program interface where data can be extracted in a specified format
and operated by DBMS servers.
- A database engine or gateway uses ODBC (Open Database Connection) and OLEDB
(Object Linking and Embedding Database) by Microsoft and JDBC (Java Database
Connection) for handling external data source.
- Then it can be stored in data warehouse by maintaining separate database repository
called as Meta data repository. It holds all kinds of storage information about data and
its content in data warehouse. (Jiawei Han, 2012)
- The middle tier is an OLAP server that is implemented using a specialized database
server that handles ROLAP and MOLAP data. ROLAP is responsible for handling
relation databases and HOLAP is responsible for handling a data which available in
arrays, flat files etc.
- The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

Figure 2.1: Data warehouse Architecture (Jiawei Han, 2012)


23
Enterprise warehouse: An enterprise warehouse collects all of the information about
subjects. It provides corporate-wide data integration from different external data sources,
operational systems. It typically contains detailed data as well as summarized data and can
range in size from a few GB to 100 GB, TB or beyond. An enterprise data warehouse may
be implemented on traditional mainframes, computer super servers, or parallel architecture
platforms. It requires extensive business modeling and may take years to design and build.
(Jiawei Han, 2012)

Data mart: A data mart contains a subset of corporate-wide data that is of value to a
specific group of users. The scope is confined to specific selected subjects. It is a small part
of data warehouse. Data marts are usually implemented on low-cost departmental servers
that are Unix/Linux or Windows based. The implementation cycle of a data mart is more
likely to be measured in weeks rather than months or years. It may involve complex
integration in the long run if its design and planning were not enterprise wide.

Depending on the source of data, data marts can be categorized as independent or


dependent. Independent data marts are sourced from data captured from one or more
operational systems or external information providers, or from data generated locally
within a department or geographic area. Dependent data marts are sourced directly from
enterprise data warehouses.

For example, a marketing data mart may confine its subjects to customer, item, and sales.
The data contained in data marts tend to be summarized.

Metadata is simply defined as data about data. The data that is used to represent other
data is known as metadata. For example, the index of a book serves as a metadata for the
contents in the book. In other words, we can say that metadata is the summarized data that
leads us to detailed data. In terms of data warehouse, we can define metadata as follows.
• Metadata is the road-map to a data warehouse.
• Metadata in a data warehouse defines the warehouse objects.
• Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.

Classification of Metadata

Metadata can be broadly categorized into three categories −

• Business Metadata − It has the data ownership information, business definition,


and changing policies.

• Technical Metadata − It includes database system names, table and column names
and sizes, data types and allowed values. Technical metadata also includes
structural information such as primary and foreign key attributes and indices.

24
• Operational Metadata − It includes currency of data and data lineage. Currency of
data means whether the data is active, archived, or purged. Lineage of data means
the history of data migrated and transformation applied on it.

Virtual warehouse: A virtual warehouse is a set of views over operational databases. For
efficient query processing, only some of the possible summary views may be materialized.
A virtual warehouse is easy to build but requires excess capacity on operational database
servers. (Jiawei Han, 2012)

2.2 Data Warehouse Implementation


The top-down development of an enterprise warehouse serves as a systematic solution and
minimizes integration problems. It is an expensive, takes a long time to develop, and lacks
flexibility due to the difficulty in achieving consistency and consensus for a common data
model for the entire organization.

The bottom up approach to the design, development, and deployment of independent data
marts provides flexibility, low cost, and rapid return of investment. It can lead to problems
when integrating various disparate data marts into a consistent enterprise data warehouse.

A recommended method for the development of data warehouse systems is to implement


the warehouse in an incremental and evolutionary manner, as shown in following Figure
2.2.

Figure 2.2: Data warehouse Implementation (Jiawei Han, 2012)

First, a high-level corporate data model is defined within a reasonably short period (such as
one or two months) that provides a corporate-wide, consistent, integrated view of data
among different subjects and potential usages. This high-level model, need to be refined in

25
the further development of enterprise data warehouses and departmental data marts, will
greatly reduce future integration problems.

Second, independent data marts can be implemented in parallel with the enterprise
warehouse based on the same corporate data model set noted before.

Third, distributed data marts can be constructed to integrate different data marts via hub
servers.

Finally, a multitier data warehouse is constructed where the enterprise warehouse is the
sole custodian of all warehouse data, which is then distributed to the various dependent
data marts. (Jiawei Han, 2012)

2.3 Data Pre-processing Overview


Data preprocessing is a technique to perform some operations on facts, raw data to
convert it into a new processed data. It can be useful for further operations. It is
responsible for converting raw data in a furnished data. It removes defects, errors and
relevant operation to have a structured data that can be used in data minin g process.

At typical processing includes data cleaning, data integration, data reduction, and data
transformation operation. These operations can be applied on raw data and prepared for
further processing. It is a conversion mechanism for creating processed data and loaded
into data warehouse. (JNTUH-R13)

2.3.1 Need for Data Pre-Processing


Data in the real world is dirty or not in a standard format. It can be in incomplete, noisy and
inconsistent form that user cannot read easily or understood. It should be processed to
improve the quality of data. Then it can be used for searching and easily retrieve the
required data. Data mining technique can be applied on such data to get quality
results/outcome.
▪ If there is no quality data, it will result irrelevant data that means no quality result.
If no quality result, no proper decisions would be taken care.
▪ If there is much irrelevant and redundant information present, then it difficult to
retrieve the proper knowledge from such raw data.

Incomplete data may come from:


▪ “Not applicable (NA, NIL, “----")” records when collected
▪ The improper assumptions based on time when its was collected and done analysis.
▪ Human/hardware/software problems
▪ lacking column, field values,
▪ lacking certain attributes of interest, or taken important fields only
▪ containing only aggregate data.
e.g., PostGiven = “ ”.

Noisy data (incorrect values) may come from


▪ Faulty data collection by the input devices

26
▪ Human or computer errors while doing data entry operations
▪ Errors in data transmission, conversions
▪ containing errors or outliers’ data.
▪ e.g., InterestRate = “-10.98”

Inconsistent data may come from:


▪ Different data sources, external files
▪ Functional dependency violation (e.g., modify some linked data)
▪ containing discrepancies in codes or names.
e.g., Age = “42” Birthday = “11/07/1981” – is it 7th month or 11th month…?

Dirty data may come from:


▪ Different data sources, external files
▪ It may come from improper assumptions, human hardware/software problems
▪ It doesn’t contain proper format data
▪ Data Dirty may be one of the above cases – incomplete data, noisy data etc.,
e.g. Name of a particular thing may falls in 2-3 sub parts viz., firstname, middlename,
surname, title etc.,
The name “Ramesh Sharma” is similar to “Sharma Ramesh” and so on possible cases.

2.3.2 Major Tasks in Data Pre-processing


▪ Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
▪ Data integration
Integration of multiple databases, data cubes, or files
▪ Data transformation
Normalization and aggregation
▪ Data reduction
Obtains reduced representation in volume but produces the same or similar
analytical results
▪ Data discretization
Part of data reduction but with importance, especially for numerical data
Following figure 2.3 shows the different forms of data pre processing techniques.

27
Figure 2.3 Forms of data pre-processing (Jiawei Han, 2012)

2.3.3 Data Cleaning


Data cleaning process fills missing values, incomplete data by replacing some logical
values which will turn in smooth out noisy data. This process identifies outliers, and
corrects the inconsistencies in the data. There are various methods for handling such
problems:

a) Missing Values
Following are some techniques to handle missing values before loading into a
data warehouse. It includes:
i. Ignoring the tuple: This is can be done, when we don’t have proper label
highlighting the information of certain case. The category of data is not
specified by some values, means its missing. If it happens with most of the
column values, then ignoring such columns will lose data quality. Then it is
not an effective technique to ignore most of the data.
ii. Filling Missing Data (Manually): Sometimes it may be possible to fill
some column values manually. Filling missing data in a column will be
based on past entries of respective column values. It is an assumed data to
be entered manually to the respective columns. It is a time-consuming
process. It is not suitable for filling missing data in some cases where past
data is not relevant.
iii. Use of global constant: It can be possible to replace all missing values by
some constants. The constant assumption must be specified before replacing

28
it with missing values. For example, if column values are “unknown”,
“NA”, or “—” then we can replace it by a global constant.
iv. In some situation, data can be classified based on its past entries and the
missing values can be replaced by taking its mean or mode values.
v. The missing values can be studied based on previous entries and applied
data regression, Bayesian technique or decision tree induction to replace
missing values.

b) Noisy data:
Noise is a random error or variance in a measured variable. Data smoothing
technique can be used for removing noisy data.

Following are data smoothing techniques:


i. Binning methods: Binning method is used to smooth a sorted data value by
consulting its neighbourhood", or values around it. The sorted values are
divided into a number of 'buckets', or bins. Because binning methods
consult the neighbourhood of values, they perform local smoothing.
In this technique
▪ First given data is sorted in ascending or descending order
▪ Then the sorted list partitioned into equi-depth of bins (equal part)
▪ Then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
a. Smoothing by bin means: Each value in the bin is replaced by the mean
value of the bin.
b. Smoothing by bin medians: Each value in the bin is replaced by the bin
median.
c. Smoothing by boundaries: The min and max values of a bin are
identified as the bin boundaries. Each bin value is replaced by the
closest boundary value.

ii. Clustering: Outliers in the data may be detected by clustering, where


similar values are organized into groups, or clusters. Values that fall outside
of the set of clusters may be considered outliers.

Figure 2.4 Clustering (JNTUH-R13)

29
iii. Regression: smooth by fitting the data into regression functions.
Linear regression involves finding the best of line to fit two variables, so
that one variable can be used to predict the other.

Figure 2.5 Linear Regression Graph (JNTUH-R13)

Multiple linear regression is an extension of linear regression, where more


than two variables are involved and the data are fit to a multidimensional
surface. Using regression to find a mathematical equation to fit the data
helps smooth out the noise.
iv. Rule Specification: To update the noisy data, we need to specify some rules
for each value of a given attribute by verifying it. Null Value adjustments
care should be taken by defining the rule.

2.3.4 Data Integration and Transformation


a) Data Integration
It combines data from multiple sources into a central store. There are number of
issues to consider during data integration.
Issues:
i. Schema integration: refers integration of metadata from different sources.
ii. Entity identification problem: Identifying entity in one data source similar
to entity in another table.
For example, customer_id in one data source and customer_no in another
data source refer to the same entity
iii. Detecting and resolving data value conflicts: Attribute values from
different sources can be different due to different representations, different
scales.
E.g. metric vs. British units
iv. Redundancy: It is another issue while performing data integration.
Redundancy can occur due to the following reasons:
▪ Object identification: The same attribute may have different names in
different data source
▪ Derived Data: one attribute may be derived from another attribute.

30
b) Data Transformation
Data transformation can involve the following:
i. Smoothing: which works to remove noise from the data
ii. Aggregation: where summary or aggregation operations are applied to the data.
For example, daily sales data may be aggregated to compute weekly and annual
total scores.
iii. Generalization of the data: where low-level or “primitive” (raw) data are
replaced by higher-level concepts using concept hierarchies.
For example, categorical attributes, like street, can be generalized to higher-
level concepts, like city or country.
iv. Normalization: where the attribute data are scaled to fall within a small
specified range, such as −1.0 to 1.0, or 0.0 to 1.0.
v. Attribute construction (feature construction): this is where new attributes are
constructed and added from the given set of attributes to help the mining
process.

2.3.5 Data Reduction techniques


This technique can be applied to obtain a reduced representation of the data set. In a
smaller volume by maintain integrity of original data. Data reduction includes:
a. Data cube aggregation, where aggregation operations are applied to the data in
the construction of a data cube.
b. Attribute subset selection, where irrelevant, weakly relevant or redundant
attributes or dimensions may be detected and removed.
c. Dimensionality reduction, where encoding mechanisms are used to reduce the
data set size. It includes Wavelet Transforms Principal Components Analysis
d. Numerosity reduction, where the data are replaced or estimated by alternative,
smaller data representations such as parametric models or nonparametric
methods such as clustering, sampling, and the use of histograms.
e. Discretization and concept hierarchy generation, where raw data values for
attributes are replaced by ranges or higher conceptual levels. Data
Discretization is a form of numerosity reduction that is very useful for the
automatic generation of concept hierarchies. (JNTUH-R13)

2.4 Data cube aggregation: It reduce the data to the concept level needed in the analysis.
The queries regarding aggregated information should be answered using data cube
when possible. Data cubes store multidimensional aggregated information. The
following figure 2.6 shows a data cube for multidimensional analysis of sales data
with respect to annual sales per item type for each branch.

31
Figure 2.6 Cube Structure (JNTUH-R13)

Each cell holds an aggregate data value, corresponding to the data point in
multidimensional space. The following database consists of sales per quarter for the
years 1997-1999.

Figure 2.7 Levels of Abstraction in Data Cube (JNTUH-R13)

Suppose, the analyzer interested in the annual sales rather than sales per quarter, the
above data can be aggregated so that the resulting data summarizes the total sales per
year instead of per quarter. The resulting data is in smaller in volume, without loss of
information necessary for the analysis task. (JNTUH-R13)

2.5 Dimensionality Reduction


It reduces the data set size by removing irrelevant attributes. This is a method of attribute
subset selection are applied.
a. Curse of dimensionality
i. When dimensionality increases, data becomes increasingly sparse
ii. Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
iii. The possible combinations of subspaces will grow exponentially
b. Dimensionality reduction
i. Avoid the curse of dimensionality

32
ii. Help to eliminate irrelevant features and reduce noise
iii. Reduce time and space required in data mining
iv. Allow easier visualization
c. Dimensionality reduction techniques
i. Principal component analysis
ii. Singular value decomposition
iii. Supervised and nonlinear techniques (e.g., feature selection)

2.6 Numerosity Reduction


Data volume can be reduced by choosing alternative smaller forms of data. It can be
▪ Parametric method
▪ Non parametric method
Parametric: Assume the data fits some model, then estimate model parameters, and
store only the parameters, instead of actual data.
Non parametric: In which histogram, clustering and sampling is used to store reduced
form of data. Apart from method, numerosity reduction having techniques:
a. Regression and log linear models:
i. Can be used to approximate the given data
ii. In linear regression, the data are modelled to fit a straight-line using
Y = α + β X,
where α, β are coefficients
iii. Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into the above.
iv. Log-linear model: The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
v. Probability: P(a, b, c, d) = α ab β ac γ ad δ bcd

b. Histogram
i. Divide data into buckets and store average (sum) for each bucket. A
bucket represents an attribute-value/frequency pair
ii. It can be constructed optimally in one dimension using dynamic
programming.
iii. It divides up the range of possible values in a data set into classes or
groups. For each group, a rectangle (bucket) is constructed with a base
length equal to the range of values in that specific group, and an area
proportional to the number of observations falling into that group.
iv. The buckets are displayed in a horizontal axis while height of a bucket
represents the average frequency of the values.

33
Figure 2.8 Histogram Sample (JNTUH-R13)

Histograms are highly effective at approximating both sparse and dense data, as
well as highly skewed, and uniform data.

c. Clustering techniques consider data tuples as objects. They partition the


objects into groups or clusters, so that objects within a cluster are “similar" to
one another and “dissimilar" to objects in other clusters. Similarity is commonly
defined in terms of how “close" the objects are in space, based on a distance
function.

Figure 2.10 Clustering (JNTUH-R13)

Quality of clusters measured by their diameter (max distance between any two
objects in the cluster) or centroid distance (avg. distance of each cluster object
from its centroid)

d. Sampling
Sampling can be used as a data reduction technique since it allows a large data
set to be represented by a much smaller random sample (or subset) of the data.
Suppose that a large data set, D, contains N tuples. Let's have a look at some
possible samples for D.
i. Simple random sample without replacement (SRSWOR) of size n:
This is created by drawing n of the N tuples from D (n < N), where the
probably of drawing any tuple in D is 1=N, i.e., all tuples are equally
likely.

34
ii. Simple random sample with replacement (SRSWR) of size n: This is
similar to SRSWOR, except that each time a tuple is drawn from D, it is
recorded and then replaced. That is, after a tuple is drawn, it is placed
back in D so that it may be drawn again.
iii. Cluster sample: If the tuples in D are grouped into M mutually disjoint
“clusters", then a SRS of m clusters can be obtained, where m < M.
For example, tuples in a database are usually retrieved a page at a time,
so that each page can be considered a cluster.
A reduced data representation can be obtained by applying, say,
SRSWOR to the pages, resulting in a cluster sample of the tuples.
iv. Stratified sample: If D is divided into mutually disjoint parts called
“strata", a stratified sample of D is generated by obtaining a SRS at each
stratum. This helps to ensure a representative sample, especially when
the data are skewed.
For example, a stratified sample may be obtained from customer data,
where stratum is created for each customer age group.
The age group having the smallest number of customers will be sure to
be represented.

An advantage of sampling for data reduction is that the cost of obtaining a


sample is proportional to the size of the sample, n, as opposed to N, the data set
size. Hence, sampling complexity is potentially sub-linear to the size of the
data. When applied to data reduction, sampling is most commonly used to
estimate the answer to an aggregate query. (JNTUH-R13)

2.7 Discretization and Concept hierarchies


Discretization:
▪ Data Discretization techniques can be used to reduce the number of values for a
given continuous attribute, by dividing the range of the attribute into intervals.
▪ Interval labels can then be used to replace actual data values.

Concept Hierarchy
▪ A concept hierarchy for a given numeric attribute defines a Discretization of the
attribute.
▪ Concept hierarchies can be used to reduce the data by collecting and replacing
low level concepts (such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).
▪ A concept hierarchy defines a sequence of mappings from set of low-level
concepts to higher-level, more general concepts.
▪ It organizes the values of attributes or dimension into gradual levels of
abstraction. They are useful in mining at multiple levels of abstraction

35
Questions:

1. Explain data warehouse architecture.


2. Explain different components of data warehouse architecture.
3. What is Data Preprocessing? Explain its different techniques.
4. Explain the process of data cleaning with example.
5. Write a note on:
a. Data discretization
b. Concept Hierarchy
c. Data Reduction

xxxXxxx

36
Unit – III
Introduction to Data Mining
Objectives:
- To learn basic concepts of data mining and its functionalities
- To study data mining classification and mining task primitives
- To study different issues and data retrieval in data mining
- To study the applicability of data mining

3.0 Fundamentals of Data Mining


Data mining is a retrieval of knowledge from large amounts of data. It retrieves the
important pattern from data warehouse. Data mining also referred as knowledge
mining. It is a knowledge discovery from database. It is a computational process of
discovering patterns from a large data set. Its involving different methods of artificial
intelligence, machine learning, statistics, and database systems.

Example:
Data Mining can be applicable in following scenarios:
- Studying the profile of Credit Card Holders
- Living standards of Heart Attack Patients
- Buying Behavior of Middle-Class People
- Seasonal expenses made by people
- Medical Diagnosis
In above examples, mining plays an analytical role, to study or retrieve the data/reports
based on past data about 5 – 6 years of certain case. It will generate a new kind of
information about thinking a certain situation and retrieving the most probable
knowledge. This knowledge will be helpful for further decisions where the same
situation arises.

Data Mining supports in applying marketing strategies about certain non – moving
products in a shopping mall. It can be applicable in Medical Diagnosis where doctors
can give the prescriptions based on history of patient’s diseases.

The basic task of data mining is to retrieve the knowledge from a large data set and
convert it into in an understandable format for further use. It can be helpful in
extracting the information. The key properties of data mining are
▪ Extraction of knowledge or pattern
▪ Outcome Prediction
▪ Knowledge Intelligence
▪ Retrieval and focus based on large data set

37
Figure 3.1 KDD Process (Jiawei Han, 2012)

Knowledge Discovery of Data (KDD)


Knowledge discovery as a process is depicted in figure 3 . 1 and consists of an
iterative sequence of the following steps:
i. Data cleaning: to remove noisy data irrelevant data and dirty data
ii. Data integration: where multiple data sources may be combined
iii. Data selection: where relevant data can be used for easy retrieval from large
data set
iv. Data transformation: where data is transformed in an appropriate format,
applying standard data transformation technique by performing summary or
aggregation operations
v. Data mining: an essential process where different set of algorithms are used to
retrieve the relevant information or extraction of data patterns – knowledge
vi. Pattern evaluation: to identify an interesting pattern representing knowledge
based on selective measures, intelligence
vii. Knowledge presentation: where visualization and knowledge representation
techniques are used to present the mined knowledge to the user, representation
data in a user understandable format.

3.1 Data Mining Functionalities


Data mining functionalities are used to r et ri eve a di f fer ent t ype o f pat t erns i n
dat a m i ni ng t asks.
The data mining tasks can be classified into two categories:
▪ Descriptive: Mining tasks based on similar kind of data and evaluation is based on
aggregate data.
38
▪ Predictive: Mining tasks based on prediction and evaluation is based on statistical
relevance data.

3.1.1 Concept/class description: characterization and discrimination


Data can be associated with classes or concepts. It describes a given set of data in a
concise manner, aggregation and presenting an interesting pattern of the data. These
descriptions can be derived via
▪ data characterization, by summarizing the data of the class under study
▪ data discrimination, by comparison of the target class with one or a set of
comparative classes
▪ Both data characterization and discrimination.

3.1.2 Data characterization


It is a summarization of the general characteristics or features of a target class of data.
Example:
A data mining system should be able to produce a description summarizing the
characteristics of a student who has obtained more than 75% in every exam; the result
could be a general profile of the student.

3.1.3 Data Discrimination


It is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes. Data discrimination cab be
presented in different ways, pie chart, bar chart, curves or multidimensional data
cube/tables. Data discrimination can be expressed by defining rules to generate pattern.
The pattern represents the relations between data objects.
Example
Based on performance of the students in exam semester, one can easily have a detailed
description about the student profile. The performance can be expressed on various
parameters. This information is called data discrimination.

3.1.4 Mining Frequent Patterns, Association and Correlations


It is the discovery of association rules showing attribute-value conditions that occur
frequently together in each set of data. Association rules are used to show a relationship
between data items. A Pattern is that frequently occurs in a data set.

The Support (s) for an association rule X➔ Y is the percentage of transactions in the

39
database that contains X  Y.

Confidence or strength () for an association rule X ➔ Y is the ratio of the number of
transactions that contain X  Y to the number of transactions that contain X.

The support count is absolute support while probability of support count is relative
support. Consider following set of transaction where transaction id 10 containing items
A, B, C and so on.

Transaction-id Items bought


10 A, B, C
20 A, C
30 A, D
40 B, E, F

Table 3.1: Transaction Details


In general, support and confidence can be defined as:
Support(X=>Y) =P(XUY)
Confidence(X=>Y) =P(Y/X)
=P (XUY)/P(X)

Transaction-id Items bought

10 A, B, C
20 A, C
Frequent pattern Support
30 A, D
{A} 75%
40 B, E, F
{B} 50%
{C} 50%
{A, C} 50%

Table 3.2: Computation of Support and Confidence


For rule A  C:
Support = support({A}{C}) = 50%
Confidence = support({A}{C})/support({A}) = 50% divided by 75% ➔ 66.6%
Here, {A}  {B} ➔ 50% shown in above table as last row
And {A}➔ 75% shown in above table as a first row
We will see another example, which illustrates the concept of Support and Confidence.

Consider an example of a super market. There are 1000 customers purchased different
items. It includes soap, washing powder, shampoo, conditioners, biscuits, and so on.
Let us find out Support and Confidence of any two items. In our case say, we are finding

40
the support and confidence for two items – shampoo and conditioners. It is because of
most of the customers brought these two items frequently.

From 1000 transactions, 200 transactions containing shampoo item and 300 transactions
containing conditioner. In 300 transactions, 100 transactions included shampoo and
conditioner item brought together. Based on this data, we will compute support and
confidence.

Support
Support is the default popularity of any item. You can calculate the Support by the
following formula. In our example,
Support (Shampoo) = (Transactions involving Shampoo) / (Total Transactions)
= 200/1000
= 20%

Confidence
You can calculate the Confidence by the following formula. In our example,
Confidence = (Transactions involving both Conditioner and Shampoo) /
(Total Transactions involving Shampoo)
= 100/200
= 50%
It says that 50% of customers who bought Conditioner and Shampoo as well.

3.1.5 Classification:
▪ It predicts categorical class labels
▪ It classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying attribute and uses it in classifying new data
▪ Typical Applications
o credit approval
o target marketing
o medical diagnosis
o treatment effectiveness analysis
It can be defined as the process of designing a model based on different or similar kind
of classes. It will classify a categorical data in a different class. The classes are grouped
together in a particular class label. This design model can be used to analysis a set of
training data. It can be helpful in predicting future values.
Example:
An airport security screening station is used to deter mine if passengers are potential
terrorist or criminals. To do this, the face of each passenger is scanned and its basic
pattern (distance between eyes, size, and shape of mouth, head etc) is identified. This
pattern is compared to entries in a database to see if it matches any patterns that are
associated with known offenders

3.1.6 Prediction:
Prediction is one of the classification techniques, through which we can predict the
situation based on past relevant data. The classification and prediction are based on

41
each other. Using classification data, we can easily retrieve the similar kind of data and
this can be helpful in retrieving future predictions.
Example:
Predicting share market status of particular shares is difficult problem. One approach is
to study entirely the status of the particular shares and its share values in last week,
month, past year. The status of current economy and relevant business strategies also
studied. By considering these entire parameters one can easily predict the future growth
of a particular share can be determined. It will highlights / predicts the profit or loss of a
particular share values. The prediction must be made with respect to the time the data
were collected.

3.1.7 Classification vs. Prediction


▪ Classification and prediction are relevant to each other and it differs in some
cases.
▪ Classification is based on constructing a set of models based on some similar
kind of data. It groups different set of data of similar kind.
▪ Prediction can be applicable on classification data. It will predict some
missing/unavailable data values.
▪ Prediction can identify outliers. Classification can group dissimilar kind of data.

3.1.8 Clustering analysis


Clustering analyzes data objects without considering a specific class or category. The
objects are clubbed together in a particular category even though it may have similar
attributes or dissimilar attributes. Based on attribute values, its ca be classified. This
classification is called clustering. Each cluster is formed as a class of objects. Clustering
facilitates the organization of different set of observations into a structure it may be
hierarchy of classes. This will group similar events together and dissimilar into different
groups.

Figure 3.3 Clustering (JNTUH-R13)


Example:
In shopping mall, the different products are grouped together in a particular shelf. The
soap section may contain different kinds of soap products, it may include shampoo,
conditioners, bathing soap as well washing soaps. In grocery section, it is a cluster of
some dissimilar products. In dress material section, it contains all kinds of dress
material including shirts, trousers and other. It may be called a cluster of dress materials
which is clubbed together for easy accession to the customers. In geographical section
42
the land is divided into different region. It is a second type of cluster where buildings
having one cluster and other land region is another cluster.

3.1.9 Classification vs. Clustering


▪ In classification you have a set of predefined classes and the objects are combined
in that group which it belongs to. There is exact similarity exists between the
objects.
It’s a supervised learning.
▪ In clustering, you don’t have a set of predefined classes as in classification. The
objects are grouped together based on attribute of objects. There is no exact
similarity exist between the objects. It’s an unsupervised learning.

3.1.10 Outlier analysis:


A database may contain data objects that do not comply with general model of data.
These data objects are outliers. In other words, the data objects which do not fall within
the cluster will be called as outlier data objects. Noisy data or exceptional data are also
called as outlier data. The analysis of outlier data is referred to as outlier mining.
Example:
Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of
extremely large amounts for a given account number in comparison to regular charges
incurred by the same account. Outlier values may also be detected with respect to the
location and type of purchase, or the purchase frequency.

Data mining functionalities in short explained in following table

Generalization Association and Co - relation Analysis


◼ Information integration and data ◼ Frequent patterns (or frequent
warehouse construction itemsets)
◼ Data cleaning, ◼ What items are frequently
transformation, integration, purchased together in your
and multidimensional data Walmart?
model ◼ Association, correlation vs. causality
◼ Data cube technology ◼ A typical association rule
◼ Scalable methods for ◼ Diaper → Beer
computing (i.e., [0.5%, 75%]
materializing) (support, confidence)
multidimensional ◼ Are strongly associated items
aggregates also strongly correlated?
◼ OLAP (online analytical ◼ How to mine such patterns and rules
processing) efficiently in large datasets?
◼ Multidimensional concept ◼ How to use such patterns for
description: Characterization and classification, clustering, and other
discrimination applications?
◼ Generalize, summarize,
and contrast data
characteristics, e.g., dry vs.

43
wet region
Classification Cluster Analysis
◼ Classification and label prediction ◼ Unsupervised learning (i.e., Class
◼ Construct models label is unknown)
(functions) based on some ◼ Group data to form new categories
training examples (i.e., clusters), e.g., cluster houses to
◼ Describe and distinguish find distribution patterns
classes or concepts for ◼ Principle: Maximizing intra-class
future prediction similarity & minimizing interclass
◼ E.g., classify similarity
countries based on ◼ Many methods and applications
(climate), or
classify cars based
on (gas mileage)
◼ Predict some unknown
class labels
◼ Typical methods
◼ Decision trees, naïve Outlier Analysis
Bayesian classification, ◼ Outlier: A data object that does not
support vector machines, comply with the general behavior of
neural networks, rule-based the data
classification, pattern- ◼ Noise or exception? ― One person’s
based classification, garbage could be another person’s
logistic regression, … treasure
◼ Typical applications: ◼ Methods: by product of clustering or
◼ Credit card fraud detection, regression analysis …
direct marketing, ◼ Useful in fraud detection, rare events
classifying stars, diseases, analysis
web-pages, …

3.2 Classification of Data Mining Systems


There are many data mining systems available or being developed. Some are
specialized systems dedicated to a given data source or are confined to limited data
mining functionalities, other are more versatile and comprehensive.
Data mining systems can be categorized according to various criteria among other
classification are the following:
a. Classification according to the type of data source mined: this classification
categorizes data mining systems according to the type of data handled such as
spatial data, multimedia data, time-series data, text data, World Wide Web, etc.
b. Classification according to the data model drawn on: this classification
categorizes data mining systems based on the data model involved such as
relational database, object-oriented database, data warehouse, transactional, etc.

44
c. Classification according to the king of knowledge discovered: this classification
categorizes data mining systems based on the kind of knowledge discovered or data
mining functionalities, such as characterization, discrimination, association,
classification, clustering, etc. Some systems tend to be comprehensive systems
offering several data mining functionalities together.
d. Classification according to mining techniques used: Data mining systems
employ and provide different techniques. This classification categorizes data
mining systems according to the data analysis approach used such as machine
learning, neural networks, genetic algorithms, statistics, visualization, database
oriented or data warehouse-oriented, etc. The classification can also take into
account the degree of user interaction involved in the data mining process such as
query-driven systems, interactive exploratory systems, or autonomous systems. A
comprehensive system would provide a wide variety of data mining techniques to
fit different situations and options, and offer different degrees of user interaction.

3.3 Data Mining Task Primitives


a. Task-relevant data: This primitive specifies the data upon which mining is to be
performed. It involves specifying the database and tables or data warehouse
containing the relevant data, conditions for selecting the relevant data, the relevant
attributes or dimensions for exploration, and instructions regarding the ordering or
grouping of the data retrieved.
b. Knowledge type to be mined: This primitive specifies the specific data mining
function to be performed, such as characterization, discrimination, association,
classification, clustering, or evolution analysis. As well, the user can be more
specific and provide pattern templates that all discovered patterns must match.
These templates or meta patterns (also called meta rules or meta queries), can be
used to guide the discovery process.
c. Background knowledge: This primitive allows users to specify knowledge they
have about the domain to be mined. Such knowledge can be used to guide the
knowledge discovery process and evaluate the patterns that are found. Of the
several kinds of background knowledge, this chapter focuses on concept
hierarchies.
d. Pattern interestingness measure: This primitive allows users to specify functions
that are used to separate uninteresting patterns from knowledge and may be used to
guide the mining process, as well as to evaluate the discovered patterns. This allows
the user to confine the number of uninteresting patterns returned by the process, as
a data mining process may generate a large number of patterns. Interestingness
measures can be specified for such pattern characteristics as simplicity, certainty,
utility and novelty.
e. Visualization of discovered patterns: This primitive refers to the form in which
discovered patterns are to be displayed. In order for data mining to be effective in
conveying knowledge to users, data mining systems should be able to display the
discovered patterns in multiple forms such as rules, tables, cross tabs (cross-
tabulations), pie or bar charts, decision trees, cubes or other visual representations.

45
3.4 Integration of a Data Mining System with a Database or a Data Warehouse
System,
The differences between the following architectures for the integration of a data mining
system with a database or data warehouse system are as follows.
a. No coupling:
The data mining system uses sources such as flat files to obtain the initial data set to
be mined since no database system or data warehouse system functions are
implemented as part of the process. Thus, this architecture represents a poor design
choice.
b. Loose coupling:
The data mining system is not integrated with the database or data warehouse
system beyond their use as the source of the initial data set to be mined, and
possible use in storage of the results. Thus, this architecture can take advantage of
the flexibility, efficiency and features such as indexing that the database and data
warehousing systems may provide. However, it is difficult for loose coupling to
achieve high scalability and good performance with large data sets as many such
systems are memory-based.
c. Semi tight coupling:
Some of the data mining primitives such as aggregation, sorting or pre computation
of statistical functions are efficiently implemented in the database or data
warehouse system, for use by the data mining system during mining-query
processing. Also, some frequently used inter mediate mining results can be pre
computed and stored in the database or data warehouse system, thereby enhancing
the performance of the data mining system.
d. Tight coupling:
The database or data warehouse system is fully integrated as part of the data mining
system and thereby provides optimized data mining query processing. Thus, the
data mining sub system is treated as one functional component of an information
system. This is a highly desirable architecture as it facilitates efficient
implementations of data mining functions, high system performance, and an
integrated information processing environment

From the descriptions of the architectures provided above, it can be seen that tight
coupling is the best alternative without respect to technical or implementation issues.
However, as much of the technical infrastructure needed in a tightly coupled system is
still evolving, implementation of such a system is non-trivial. Therefore, the most
popular architecture is currently semi tight coupling as it provides a compromise
between loose and tight coupling.

3.5 Data Mining Architecture


Data Mining is the process of discovering interesting Knowledge from large amounts
of data stored in data bases, data warehouses or other information repositories (Jiawei
Han, 2012)

46
Figure 3.4: Data Mining Architecture (Jiawei Han, 2012)

The architecture of a typical data mining system may have the following major
components
▪ Database, Data warehouse, World wide web or other information repository-Data
cleaning and data integration techniques may be performed on the data

▪ Database or Data Warehouse Server-It is responsible for fetching the relevant data
based on the user’s data mining request.

▪ Data mining Engine-It consists of a set of functional modules for task such as
characterization, association and correlation analysis classification, prediction
cluster analysis, outlier analysis etc

▪ Knowledge base – It is the domain knowledge used to guide the search or evaluate
the interestingness of resulting patterns

▪ Pattern evolution module- It applies interestingness measures to filter out


discovered patterns

▪ Graphical User Interface- user can specify a data mining query

3.6 Major issues in Data Mining


Major issues in data mining is regarding mining methodology, user interaction,
performance, and diverse data types

a. Mining methodology and user-interaction issues:


o Mining different kinds of knowledge in databases: Since different users can
be interested in different kinds of knowledge, data mining should cover a wide
spectrum of data analysis and knowledge discovery tasks, including data
characterization, discrimination, association, classification, clustering, trend and

47
deviation analysis, and similarity analysis. These tasks may use the same
database in different ways and require the development of numerous data
mining techniques.
o Interactive mining of knowledge at multiple levels of abstraction: Since it is
difficult to know exactly what can be discovered within a database, the data
mining process should be interactive.
o Incorporation of background knowledge: Background knowledge, or
information regarding the domain under study, may be used to guide the
discovery patterns. Domain knowledge related to databases, such as integrity
constraints and deduction rules, can help focus and speed up a data mining
process, or judge the interestingness of discovered patterns.
o Data mining query languages and ad-hoc data mining: Knowledge in
Relational query languages (such as SQL) required since it allow users to pose
ad-hoc queries for data retrieval.
o Presentation and visualization of data mining results: Discovered
knowledge should be expressed in high-level languages, visual representations,
so that the knowledge can be easily understood and directly usable by humans
o Handling outlier or incomplete data: The data stored in a database may
reflect outliers: noise, exceptional cases, or incomplete data objects. These
objects may confuse the analysis process, causing over fitting of the data to the
knowledge model constructed. As a result, the accuracy of the discovered
patterns can be poor. Data cleaning methods and data analysis methods which
can handle outliers are required.
o Pattern evaluation: refers to interestingness of pattern: A data mining
system can uncover thousands of patterns. Many of the patterns discovered may
be uninteresting to the given user, representing common knowledge or lacking
novelty. Several challenges remain regarding the development of techniques to
assess the interestingness of discovered patterns,

b. Performance issues. These include efficiency, scalability, and parallelization of


data mining algorithms.
o Efficiency and scalability of data mining algorithms: To effectively extract
information from a huge amount of data in databases, data mining algorithms
must be efficient and scalable.
o Parallel, distributed, and incremental updating algorithms: Such algorithms
divide the data into partitions, which are processed in parallel. The results from
the partitions are then merged.
c. Issues relating to the diversity of database types
o Handling of relational and complex types of data: Since relational databases
and data warehouses are widely used, the development of efficient and
effective data mining systems for such data is important.
o Mining information from heterogeneous databases and global information
systems: Local and wide-area computer networks (such as the Internet)
connect many sources of data, forming huge, distributed, and heterogeneous
databases. The discovery of knowledge from different sources of structured,

48
semi-structured, or unstructured data with diverse data semantics poses great
challenges to data mining.

3.7 Applications of Data Mining


Following are the some of the areas where Data Mining plays vital role.
- Web page analysis: from web page classification, clustering to PageRank & HITS
algorithms
- Collaborative analysis & recommender systems
- Basket data analysis to targeted marketing
- Biological and medical data analysis: classification, cluster analysis (microarray data
analysis), biological sequence analysis, biological Pattern analysis
- Data mining and software engineering, Networking
- From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server Analysis
Manager, Oracle Data Mining Tools) to invisible data mining

Questions:

1. Explain KDD process in detail.


2. Explain different functionalities of Data Mining.
3. Describe different data mining task primitives.
4. What are the different issues in data mining? Explain.
5. Write a note on:
a. Data Mining
b. Data Mining Applications

xxxXxxx

49
Unit – IV
Mining Association Rules
Objectives:
- To learn basic concepts of association rule mining
- To study the data support and confidence in frequent itemsets
- To study Apriori algorithm to retrieve the most frequent itemsets from a large data
set and its applicability in business

4 Basic Concepts
Association Rule
Association rules are used to show relationship between data items. It is based on the
frequent pattern. Frequent pattern is a set of items, subsequences, substructures, etc.
that occurs frequently in a data set. Frequent pattern raise the questions as:
What products were often purchased together?
- Beer and diapers, Milk and Butter/Bread
What products are purchased one after other?
- PC followed by digital camera
- TV set followed by VCD player
▪ Is there a structure defining relationships in the items purchased?
▪ What kinds of DNA are sensitive to this new drug?
▪ Can we automatically classify web documents?
To answer these questions, we need to have a concept of Association that exhibit a
relationship between two items.

Association rule mining:


To find out frequent patterns, associations, correlations, or causal structures
among sets of items or objects in transaction databases, relational databases, and
other information repositories. Association rule mining can be applicable in
Basket data analysis, cross-marketing, catalog design, clustering, classification,
Shopping Mall Shelf arrangements, etc.

4.1 Market Basket Analysis


Frequent itemset mining leads to the discovery of an associations and correlations among
items in large data sets. The data continuously uploaded into data warehouse by adding
data from different data marts, or other data sources. The different organizations are
interested in mining different interesting patterns from their databases. The discovery of
such interesting patterns will be helpful in business decision making process. It would
help in catalog design, cross-marketing, and customer shopping behavior analysis.

It is also helpful for season wise selling of products. It will keep the product in shelf of
shopping malls to attract the customers buying behavior. Market Basket provides a place
where ‘nonmoving items’ can be given a good treatment, so it can be sold with an offer.

The market basket analyzes customer buying habits by finding associations between the
different items that customers place in their “shopping baskets” (Figure 4.1). The

50
discovery of these associations can help retailers to develop marketing strategies by
identifying the items which are frequently purchased together by customers. A typical
example of frequent itemset mining is market basket analysis. (Jiawei Han, 2012)

Figure 4.1 Market basket analysis (Jiawei Han, 2012)

For instance, if customers are buying milk, how likely are they to also buy bread (and
what kind of bread) on the same trip to the supermarket? This information can lead to
increased sales by helping retailers do selective marketing and plan their shelf space.
(Jiawei Han, 2012)

4.2 Mining multilevel association rules from transactional databases


Multilevel association rules can be defined as applying association rules over different
levels of data abstraction

Figure 4.2 Multi level Association Rule

51
4.2.1 Steps to perform multilevel association rules from transactional database are:
Step1: consider frequent item sets
Step2: arrange the items in hierarchy form
Step3: find the Items at the lower level (expected to have lower support)
Step4: Apply association rules on frequent item sets
Step5: Use some methods and identify frequent itemsets
Note: Support is categorized into two types

Uniform Support: the same minimum support for all levels

Figure 4.3 Uniform Support

Reduced Support: reduced minimum support at lower levels

Figure 4.4: Reduced Support

4.2.2 Mining multidimensional association rules from Relational databases and data
warehouse
Multi-dimensional association rule: Multi-dimensional association rule can be defined
as the statement which contains only two (or) more predicates/dimensions. We can
perform the following association rules on relational database and data warehouse
▪ Boolean dimensional association rule:
Boolean dimensional association rule can be defined as comparing existing
predicates/dimensions with non-existing predicates/dimensions
▪ Single dimensional association rule:

52
Single dimensional association rule can be defined as the statement which contains
only single predicate/dimension

Multi-dimensional association rule can be applied on different types of attributes


which are:
▪ Categorical Attributes: finite number of possible values, no ordering among
values
▪ Quantitative Attributes: numeric, implicit ordering among values
▪ Relational database can be viewed in the form of table. We can easily perform
concept hierarchy. The concept hierarchy is also termed as generalization. It can be
used to retrieve the frequent item sets. The item sets hierarchy can be obtained and
represented in a tree structure format.
▪ Generalization: replacing low level attributes with high level attributes called
generalization
▪ Data warehouse can be viewed in the form of multidimensional data model (uses
data cubes) in order to find out the frequent patterns.

4.3 From association mining to correlation analysis:


Association mining can be defined as finding frequent patterns, associations, correlations,
or causal structures among sets of items or objects in transaction databases, relational
databases, and other information repositories.
An Association mining to correlation analysis can be performed based on interesting
measures of frequent items.
▪ among frequent items we are performing correlation analysis
▪ Correlation analysis means one frequent item is dependent on another frequent
item.

4.4 Constraint-based association mining:


A constraint-based association mining can be defined as applying different types of
constraints on different types of knowledge:
a. Knowledge type constraint
Based on classification, association, we are applying the knowledge type
constraints.
b. Data constraint
It is a basis for identifying the products value in certain duration
Ex: Find the products which sold together in the year, month
c. Dimension/level constraints
in relevance to region, price, brand, customer category
d. Rule constraints
On the form of the rules to be mined (e.g., No. of predicates, etc)
For example: Giving offer on some products does effect on target sales or not.
e. Interestingness constraints
▪ Thresholds on measures of interestingness
▪ Identification of products on customer buying behavior, style
▪ Strong rules (min_support ≥ 3%, min_confidence ≥ 60%).
53
4.5 Apriori Algorithm

Find frequent itemsets using an iterative level-wise approach based on candidate


generation. (Jiawei Han, 2012)
Input:
D, a database of transactions;
min sup, the minimum support count threshold.
Output: L, frequent itemsets in D.
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++)
do begin

Ck+1 = candidates generated from Lk;


for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;

Algorithm Explanation:

- Apriori employs an iterative approach known as a level-wise search, where k-itemsets


are used to explore (k+1)-itemsets.
- First, the set of frequent 1-itemsets is found by scanning the database to accumulate
the count for each item, and collecting those items that satisfy minimum support.
- The resulting set is denoted [Link], L1 is used to find L2, the set of frequent 2-
itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be
found.
- The finding of each Lk requires one full scan of the database.
- A two-step process is followed in Apriori consisting of join and prune action.

Example:
Consider following list of items for a given transactions (Jiawei Han, 2012)

TID List of item IDs


T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
Table 4.3: Database Transaction (Jiawei Han, 2012)

54
There are nine transactions in this database, that is, |D| = 9.

Steps:
1. In the first iteration of the algorithm, each item is a member of the set of
candidate1-itemsets, C1.
The algorithm simply scans all of the transactions in order to count the number of
occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set
of frequent 1-itemsets, L1, can then be determined.
It consists of the candidate 1-itemsets satisfying minimum support. In our example,
all of the candidates in C1 satisfy minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1
to generate a candidate set of 2-itemsets, [Link] candidates are removed fromC2
during the prune step because each subset of the candidates is also frequent.
4. Next, the transactions in Dare scanned and the support count of each candidate
itemsetInC2 is accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those
candidate2-itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, From the join step, we first
getC3 =L2x L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2,
I4, I5}.
Based on the Apriori property that all subsets of a frequent item set must also be
frequent, we can determine that the four latter candidates cannot possibly be
frequent.
7. The transactions in D are scanned in order to determine L3, consisting of those
candidate 3-itemsets in C3 having minimum support.
8. The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4. (Jiawei
Han, 2012)

Illustration

55
Figure 4.5: Apriori Algorithm Implementation (Jiawei Han, 2012)

Example:
Consider the following data on which Apriori algorithm can be applied to find the association rule
for frequent item set.

Sample Data D

Transaction ID Items
100 Bread, Milk, Butter
200 Bread, Butter
300 Bread, Eggs
400 Milk, Biscuit, Chips

Step 1: Scan data D for count of each candidate

Item set Support


{Bread} 3
{Milk} 2
{Butter} 2
{Biscuits} 1
{Chips} 1
{Eggs} 1

56
Step 2: Compare candidate support count with minimum support (50%)

Item set Support


{Bread} 3
{Milk} 2
{Butter} 2

Step 3: Generate candidate from above table & its support

Item set Support


{Bread, Milk} 1
{Bread, Butter} 2
{Milk, Butter 1

Step 4: Compare candidate with minimum support

Item set Support


{Bread, Butter} 2

Step 5: So the data contains frequent item set {Bread, Butter}


Therefore, the association rule can be set as Bread->Butter or Butter->Bread
It can be decided then that these are the frequent items purchased by the customers in a
combination and accordingly the marketing strategy can be decided for the store.

4.6 FP Tree Growth Algorithm

FP – Frequent Pattern growth is an interesting method to retrieve frequent patterns and it


is based on divide and conquers strategy. (Jiawei Han, 2012)
▪ First, it compresses the database representing frequent items into a frequent pattern
tree, or FP-tree, which has (stores) the itemset association information.
▪ It then divides the compressed database into a set of conditional databases; each
associated with one frequent item or “pattern fragment,” and mines each database
separately.
▪ For each “pattern fragment,” only its associated data sets need to be examined.
▪ This approach reduces the size of the data sets to be searched, along with the “growth”
of patterns being examined.
▪ This process can be recursively applied to any projected database if its FP-tree still
cannot fit in main memory.

Example:
We reexamine the mining of transaction database, D, of Table 4.3 using the frequent
pattern growth approach. An FP-tree is then constructed as follows:

First, create the root of the tree, labeled with “null.” Scan database D a second time.
The items in each transaction are processed in L order (i.e., sorted according to
descending support count), and a branch is created for each transaction.

For example, the scan of the first transaction, “T100: I1, I2, I5,” which contains three
items (I2, I1, I5 in L order), leads to the construction of the first branch of the tree with
three nodes, [I2: 1], [I1: 1], and [I5: 1], where I2 is linked as a child to the root, I1 is
linked to I2, and I5 is linked to I1.

57
The second transaction, T200, contains the items I2 and I4 in L order, which would result
in a branch where I2 is linked to the root and I4 is linked to I2. This branch would share a
common prefix, I2, with the existing path for T100.

Therefore, we instead increment the count of the I2 node by 1, and create a new node, [I4:
1], which is linked as a child to [I2: 2].

In general, when considering the branch to be added for a transaction, the count of each
node along a common prefix is incremented by 1, and nodes for the items following the
prefix are created and linked accordingly.

Figure 4.6 FP Tree (Jiawei Han, 2012)

To facilitate tree traversal, an item header table is built so that each item points to its
occurrences in the tree via a chain of node-links.
The tree obtained after scanning all the transactions is shown in Figure 4.6 with the
associated node-links.

In this way, the problem of mining frequent patterns in databases is transformed into that
of mining the FP-tree. (Jiawei Han, 2012)

FP Tree Mining
The FP-tree is mined as follows. Start from each frequent length-1 pattern (as an initial
suffix pattern), construct its conditional pattern base (a “sub-database,” which consists
of the set of prefix paths in the FP-tree co-occurring with the suffix pattern), then
construct its (conditional) FP-tree, and perform mining recursively on the tree.

The pattern growth is achieved by the concatenation of the suffix pattern with the frequent
patterns generated from a conditional FP-tree. Mining of the FP-tree is summarized in
Table 4.4 and detailed as follows.

58
Table 4.4: FP Tree Mining

We first consider I5, which is the last item in L, rather than the first. The reason for
starting at the end of the list will become apparent as we explain the FP-tree mining
process.

I5 occurs in two FP-tree branches of Figure 4.7. (The occurrences of I5 can easily be
found by following its chain of node-links.)

The paths formed by these branches are [I2, I1, I5: 1] and [I2, I1, I3, I5: 1]. Therefore,
considering I5 as a suffix, its corresponding two prefix paths are [I2, I1: 1] and [I2, I1, I3:
1], which form its conditional pattern base.

Using this conditional pattern base as a transaction database, we build an I5-conditional


FP-tree, which contains only a single path, [I2: 2, I1: 2]; I3 is not included because its
support count of 1 is less than the minimum support count. The single path generates all
the combinations of frequent patterns: [I2, I5: 2], [I1, I5: 2], [I2, I1, I5: 2].

For I4, its two prefix paths form the conditional pattern base, {[I2 I1: 1], [I2: 1]}, which
generates a single-node conditional FP-tree, [I2: 2], and derives one frequent pattern, [I2,
I4: 2].

Figure 4.7 FP Tree Mining (Jiawei Han, 2012)

Similar to the preceding analysis, I3’s conditional pattern base is {[I2, I1: 2], [I2: 2], [I1:
2]}. Its conditional FP-tree has two branches, [I2: 4, I1: 2] and [I1: 2], as shown in Figure
4.7, which generates the set of patterns {[I2, I3: 4], [I1, I3: 4], [I2, I1, I]: 2]}.

Finally, I1’s conditional pattern base is [I2: 4], with an FP-tree that contains only one
node, [I2: 4], which generates one frequent pattern, [I2, I1: 4]. This mining process is
summarized in the algorithm. (Jiawei Han, 2012)

59
Algorithm: FP growth

Input:
- D, a transaction database;
- Min_sup, the minimum support count threshold.

Output: The complete set of frequent patterns.

Method:
1. The FP-tree is constructed in the following steps:
a. Scan the transaction database D once. Collect F, the set of frequent items, and
their support counts.
b. Sort F in support count descending order as L, the list of frequent items.
c. Create the root of an FP-tree, and label it as “null.”

For each transaction Trans in D do the following:

Select and sort the frequent items in Trans according to the order of L.

Let the sorted frequent item list in Trans be [p|P], where p is the first element and P is
the remaining list.

Call insertTree ([p|P], T), which is performed as follows:


If T has a child N such that N.item_name = p.item_name, then
increment N’s count by 1;
else create a new node N,and let its count be 1, its parent link be linked to T,
and its node-link to the nodes with the same item_name via the node-
link structure.
If P is nonempty, call insertTree (P, N) recursively.

2. The FP-tree is mined by calling FP_growth (FP tree, nul/) which is implemented as
follows.
Procedure FP_growth(Tree, α)
if Tree contains a single path P then
for each combination (denoted as β) of the nodes in the path P
generate pattern β ∩ α with support_count = minimum support count of nodes in
β;
else for each ai in the header of Tree
{
generate pattern β = ai ∪ 𝛼 with support_count = ai .support_count;
construct β’s conditional pattern base and then β’s conditional FP_tree Treeβ ;
if Treeβ ≠ ∅; then
call FP_growth(Treeβ , β);
}

60
Questions:

1. Explain Market Basket analysis with example.


2. Explain Apriori Algorithm.
3. Explain association rule mining.
4. Explain FP Tree growth algorithm.
5. Write a note on:
a. Constrain Based association mining
b. Association Mining to correlation mining

xxxXxxx

61
Unit – V
Classification and Prediction

Objectives:
- To learn basic concepts of classification and prediction
- To study supervised and unsupervised learning, data classification technique
- To study rule-based classification technique and decision tree induction
- To study different classification algorithms

5.0 Introduction:
Classification and prediction are two forms of data analysis. It can be used to extract
models describing important data classes or to predict future data trends. Classification
predicts categorical (discrete, unordered) labels, prediction models continuous valued
functions.
For example, we can build a classification model to categorize bank loan applications as
either safe or risky, or a prediction model to predict the expenditures of potential customers
on computer equipment given their income and occupation.

A predictor is constructed that predicts a continuous-valued function, or ordered value, as


opposed to a categorical label. Regression analysis is a statistical methodology that is most
often used for numeric prediction. Many classification and prediction methods have been
proposed by researchers in machine learning, pattern recognition, and statistics.

Most algorithms are memory resident, typically assuming a small data size. Recent data
mining research has built on such work, developing scalable classification and prediction
techniques capable of handling large disk-resident data.

5.0.1 Classification—A Two-Step Process


a. Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as determined
by the class label attribute
• The set of tuples used for model construction: training set
• The model is represented as classification rules, decision trees, or
mathematical formulae

62
Figure 5.1 Model Construction (Jiawei Han, 2012)

b. Model usage: for classifying future or unknown objects


• Estimate accuracy of the model
• The known label of test sample is compared with the classified result from
the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set, otherwise over-fitting will occur

Figure 5.2 Model Usage (Jiawei Han, 2012)

5.0.2 Supervised vs. Unsupervised Learning


a. Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set

63
b. Unsupervised learning (clustering)
• The class labels of training data are unknown
• Given a set of measurements, observations, etc. with the aim of establishing
the existence of classes or clusters in the data

5.1 Preparing Data for Classification and Prediction:


The following pre-processing steps may be applied to the data to help improve the
accuracy, efficiency, and scalability of the classification or prediction process.

a. Data cleaning:
This refers to the pre-processing of data in order to remove or reduce noise (by
applying smoothing techniques) and the treatment of missing values (e.g., by replacing
a missing value with the most commonly occurring value for that attribute, or with the
most probable value based on statistics). Most of the classification algorithms have
some mechanisms for handling noisy or missing data, this step can help reduce
confusion during learning.

b. Relevance analysis:
Many of the attributes in the data may be redundant. Correlation analysis can be used
to identify whether any two given attributes are statistically related.
For example, a strong correlation between attributes A1 and A2 would suggest that one
of the two could be removed from further analysis.

A database may also contain irrelevant attributes. Attribute subset selection can be used
in these cases to find a reduced set of attributes such that the resulting probability
distribution of the data classes is as close as possible to the original distribution
obtained using all attributes.

Relevance analysis, in the form of correlation analysis and attribute subset selection,
can be used to detect attributes that do not contribute to the classification or prediction
task. Such analysis can help improve classification efficiency and scalability.

c. Data Transformation and Reduction


The data may be transformed by normalization. Normalization involves scaling all
values for a given attribute so that they fall within a small specified range, such as -1 to
+1 or 0 to 1. The data can also be transformed by generalizing it to higher-level
concepts. Concept hierarchies may be used for this purpose. This is particularly useful
for continuous valued attributes.

For example, numeric values for the attribute income can be generalized to discrete
ranges, such as low, medium, and high. Similarly, categorical attributes, like street, can
be generalized to higher-level concepts, like city.

64
Data can also be reduced by applying many other methods, ranging from wavelet
transformation and principle components analysis to discretization techniques, such as
binning, histogram analysis, and clustering.

5.2 Comparing Classification and Prediction Methods:


a. Accuracy:
The accuracy of a classifier refers to the ability of a given classifier to correctly
predict the class label of new or previously unseen data (i.e., tuples without class
label information).
The accuracy of a predictor refers to how well a given predictor can guess the value
of the predicted attribute for new or previously unseen data.
b. Speed:
This refers to the computational costs involved in generating and using the given
classifier or predictor.
c. Robustness:
This is the ability of the classifier or predictor to make correct predictions given
noisy data or data with missing values.
d. Scalability:
This refers to the ability to construct the classifier or predictor efficiently given
large amounts of data.
e. Interpretability:
This refers to the level of understanding and insight that is provided by the
classifier or predictor. Interpretability is subjective and therefore more difficult to
assess.

5.3 Classification by Decision Tree Induction


Decision tree induction is the learning of decision trees from class-labelled training
tuples. A decision tree is a flowchart-like tree structure, where each internal node
(non - leaf node) denotes a test on an attribute, each branch represents an outcome of
the test, and each leaf node (or terminal node) holds a class label. The topmost node in
a tree is the root node. A typical decision tree is shown in Figure 5.3.

Figure 5.3: A decision tree for the concept buys computer


65
Each internal (non-leaf) node represents a test on an attribute. Each leaf node represents
a class (either buys computer D yes or buys computer D no) likely to purchase a
computer. Internal nodes are denoted by rectangles, and leaf nodes are denoted by
ovals. Decision trees are the basis of several commercial rule induction systems.
Decision tree generation consists of:
a. Tree construction
▪ At start, all the training examples are at the root
▪ Partition examples recursively based on selected attributes
b. Tree pruning
▪ Identify and remove branches that reflect noise or outliers
c. Use of decision tree:
▪ Classifying an unknown sample
▪ Test the attribute values of the sample against the decision tree

5.4 Algorithm for Decision Tree Induction


a. Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized
in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
b. Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left

5.5 Extracting Classification Rules from Trees


- Represent the knowledge in the form of IF-THEN rules
- One rule is created for each path from the root to a leaf
- Each attribute-value pair along a path forms a conjunction
- The leaf node holds the class prediction
- Rules are easier for humans to understand

Example
Consider following table, that contain information about the age group who having
credit ratings and income details of a student for buying a computer system.

66
Age Income Student Credit Ratings
<=30 High No Fair
<=30 High No Excellent
31…40 High No Fair
>40 Medium No Fair
>40 Low Yes Fair
>40 Low Yes Excellent
31…40 Low Yes Excellent
<=30 Medium No Fair
<=30 Low Yes Fair
>40 Medium Yes Fair
<=30 Medium Yes Excellent
31…40 Medium No Excellent
31…40 High Yes Fair
>40 Medium No Excellent
Table 5.1 Computer Buying Conditions

IF age = “<=30” AND student = “no” THEN buys_computer = “no”


IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

5.6 Avoid Overfitting in Classification


a. The generated tree may overfit the training data
- Too many branches, some may reflect anomalies due to noise or outliers
- Result is in poor accuracy for unseen samples
b. Two approaches to avoid over fitting
• Pre pruning:
i. Halt tree construction early—do not split a node if this would result
in the goodness measure falling below a threshold
ii. Difficult to choose an appropriate threshold
• Post pruning:
i. Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
ii. Use a set of data different from the training data to decide which is
the “best pruned tree”

5.7 Bayesian Classification:


▪ Bayesian classifiers are statistical classifiers. It can predict class membership
probabilities, such as the probability that a given tuple belongs to a particular class.
▪ Probabilistic learning: Calculate explicit probabilities for hypothesis, among the
most practical approaches to certain types of learning problems
67
▪ Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct. Prior knowledge can be combined with
observed data.
▪ Probabilistic prediction: Predict multiple hypotheses, weighted by their
probabilities
▪ Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured
▪ Bayesian classification is based on Bayes’ theorem.

5.8 Bayes’ Theorem:


▪ Let X be a data tuple. In Bayesian terms, X is considered ―evidence and it is
described by measurements made on a set of n attributes.
▪ Let H be some hypothesis, such as that the data tuple X belongs to a specified class
C.
▪ For classification problems, we want to determine P(H|X), the probability that the
hypothesis H holds given the ―evidence‖ or observed data tuple X.
▪ P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on
X.
▪ Bayes’ theorem is useful in that it provides a way of calculating the posterior
probability, P(H|X), from P(H), P(X|H), and P(X).

5.9 Naïve Bayesian Classification:


The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:
▪ Let D be a training set of tuples and their associated class labels. As usual,
each tuple is represented by an n-dimensional attribute vector, X = (x1, x2,
…,xn), depicting n measurements made on the tuple from n attributes,
respectively, A1, A2, …, An.

▪ Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the
classifier will predict that X belongs to the class having the highest posterior
probability, conditioned on X. That is, the naïve Bayesian classifier predicts
that tuple X belongs to the class Ci if and only if

▪ Thus, we maximize P(Ci|X). The classCi for which P(Ci|X) is maximized is


called the maximum posteriori hypothesis. By Bayes’ theorem

68
▪ As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If
the class prior probabilities are not known, then it is commonly assumed
that the classes are equally likely, that is, P(C1) = P(C2) = …= P(Cm), and
we would therefore maximize P(X|Ci). Otherwise, we maximize
P(X|Ci)P(Ci).
▪ Given data sets with many attributes, it would be extremely computationally
expensive to compute P(X|Ci). In order to reduce computation in evaluating
P(X|Ci), the naive assumption of class conditional independence is made.
This presumes that the values of the attributes are conditionally independent
of one another, given the class label of the tuple. Thus,

▪ We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci)


from the training tuples. For each attribute, we look at whether the attribute
is categorical or continuous-valued.

▪ For instance, to compute P(X|Ci), we consider the following:


• If Akis categorical, then P(xk|Ci) is the number of tuples of class
Ciin D having the value xk for Ak, divided by |Ci,D| the number
of tuples of class Ciin D.
• If Akis continuous-valued, then we need to do a bit more work,
but the calculation is pretty straightforward.
▪ A continuous-valued attribute is typically assumed to have a Gaussian
distribution with a mean μ and standard deviation, defined by

▪ In order to predict the class label of X, P(XjCi)P(Ci) is evaluated for each


class Ci. The classifier predicts that the class label of tuple X is the class Ciif
and only if

5.10 Bayesian Belief Network:


▪ The back propagation algorithm performs learning on a multilayer feed-
forward neural network also known as Bayesian Belief network.
▪ It iteratively learns a set of weights for prediction of the class label of tuples.
▪ A multilayer feed-forward neural network consists of an input layer, one or
more hidden layers, and an output layer.

69
Example:
▪ The inputs to the network correspond to the attributes measured for each
training tuple. The inputs are fed simultaneously into the units making up
the input layer. These inputs pass through the input layer and are then
weighted and fed simultaneously to a second layer known as a hidden layer.
▪ The outputs of the hidden layer units can be input to another hidden layer,
and so on. The number of hidden layers is arbitrary.

Figure 5.4 Neural Network

▪ The weighted outputs of the last hidden layer are input to units making up
the output layer, which emits the network’s prediction for given tuples

5.11 Rule Based Classification


5.11.1 Using IF-THEN Rules for Classification
▪ Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
▪ Rule antecedent/precondition vs. rule consequent
▪ Assessment of a rule: coverage and accuracy
ncovers = # of tuples covered by R
ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
▪ If more than one rule is triggered, need conflict resolution
– Size ordering: assign the highest priority to the triggering rules that
has the “toughest” requirement (i.e., with the most attribute test)
– Class-based ordering: decreasing order of prevalence or
misclassification cost per class
– Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts

70
5.11.2 Rule Extraction from a Decision Tree
▪ Rules are easier to understand than large trees
▪ One rule is created for each path from the root to a leaf
▪ Each attribute-value pair along a path forms a conjunction: the leaf holds
the class prediction
▪ Rules are mutually exclusive and exhaustive

Figure 5.5 Rule Extraction

▪ Example:
Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no

5.11.3 Rule Extraction from the Training Data


▪ Sequential covering algorithm: Extracts rules directly from training data
▪ Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
▪ Rules are learned sequentially, each for a given class Ci will cover many
tuples of Ci but none (or few) of the tuples of other classes
▪ Steps:
– Rules are learned one at a time
– Each time a rule is learned, the tuples covered by the rules are
removed
– The process repeats on the remaining tuples unless termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
▪ Comp. w. decision-tree induction: learning a set of rules simultaneously

71
Questions:

1. Explain supervised and unsupervised classification.


2. Explain decision tree induction with example.
3. Describe rule-based classification with example.
4. Explain different applications of Prediction in data mining.
5. Write a note on:
a. Attribute selection measure
b. Tree pruning

xxxXxxx

72
Unit – VI
Cluster Analysis
Objectives:
- To learn basic concepts of data clustering and its classification, model
- To study the different applications of cluster analysis
- To study data clustering methods and its applicability in data warehouse

6 Introduction:
Cluster analysis or clustering is the process of partitioning a set of data objects into
subsets. Each subset is a cluster, such that objects in a cluster are similar to one another,
yet dissimilar to objects in other clusters. The set of clusters resulting from a cluster
analysis can be referred to as a clustering. It is a discovery of unknown group of data
available in different clusters. It is also known as un supervised learning. In clustering we
don’t have a specific group of data, it is randomly scattered in some similar kind of data.
The dissimilarity is also a concern of another cluster. Because a cluster is a collection of
data objects that are similar to one another within the cluster and dissimilar to objects in
other clusters, a cluster of data objects can be treated as an implicit class. In this sense,
clustering is sometimes called automatic classification.

Clustering is also called data segmentation in some applications because clustering


partitions large data sets into groups according to their similarity. Clustering can also be
used for outlier detection, where outliers (values that are “far away” from any cluster)
may be more interesting than common cases. Applications of outlier detection include the
detection of credit card fraud and the monitoring of criminal activities in electronic
commerce.

For example, exceptional cases in credit card transactions, such as very expensive and
infrequent purchases, may be of interest as possible fraudulent activities.

In data mining, it can be used for data distribution based on characteristics of cluster for
detailed analysis. It can be used as pre-processing step for the algorithms where data
characterization, distribution can be retrieved. It can be used for identification of different
outliers.

As a branch of statistics, cluster analysis has been extensively studied, with the main focus
on distance-based cluster analysis. Cluster analysis tools based on k-means, k-medoids,
and several other methods also have been built into many statistical analysis software
packages or systems, such as S-Plus, SPSS, and SAS.

In machine learning, classification is known as supervised learning because the class label
information is given, that is, the learning algorithm is supervised in that it is told the class
membership of each training tuple. Clustering is known as unsupervised learning because
the class label information is not present. For this reason, clustering is a form of learning
by observation, rather than learning by examples.

73
In data mining, efforts have focused on finding methods for efficient and effective cluster
analysis in large databases. Active themes of research focus on the scalability of clustering
methods, the effectiveness of methods for clustering complex shapes (e.g., nonconvex) and
types of data (e.g., text, graphs, and images), high-dimensional clustering techniques (e.g.,
clustering objects with thousands of features), and methods for clustering mixed numerical
and nominal data in large databases. (Jiawei Han, 2012)

6.1 Clustering Applications


Data clustering is under vigorous development. Contributing areas of research include data
mining, statistics, machine learning, spatial database technology, information retrieval,
Web search, biology, marketing, and many other application areas.

▪ Marketing: Help marketers discover distinct groups in their customer bases, and then
use this knowledge to develop targeted marketing programs
▪ Land use: Identification of areas of similar land use in an earth observation database
▪ Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
▪ City-planning: Identifying groups of houses according to their house type, value, and
geographical location
▪ Earth-quake studies: Observed earth quake epic enters should be clustered along
continent faults
▪ Cluster analysis has been widely used in numerous applications, including market
research, pattern recognition, data analysis, and image processing.
▪ In business, clustering can help marketers discover distinct groups in their customer
bases and characterize customer groups based on purchasing patterns.
▪ In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
▪ Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value, and geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
▪ Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
▪ Clustering can also be used for outlier detection, Applications of outlier detection
include the detection of credit card fraud and the monitoring of criminal activities in
electronic commerce.

Clustering can be used to organize the search results into groups and present the results in a
concise and easily accessible way. Clustering techniques have been developed to cluster
documents into topics, which are commonly used in information retrieval practice.

6.2 Requirements of Cluster Analysis


Clustering is a challenging research field. In this section, you will learn about the
requirements for clustering as a data mining tool, as well as aspects that can be used for
comparing clustering methods. The following are typical requirements of clustering in
data mining.

74
a. Scalability:
Many clustering algorithms work well on small data sets over a data object. In large
database it may contain millions of objects particularly in web search scenario.
Clustering on only a sample of a given large data set may lead to biased results.
Therefore, highly scalable clustering algorithms are needed.

b. Ability to deal with different types of attributes:


Many algorithms are designed to cluster numeric (interval-based) data. The
applications required a clustering objects such as binary, nominal (categorical), and
ordinal data, or mixtures of these data types. It may be a complex data types as
graphs, sequences, images and documents.

c. Discovery of clusters with arbitrary shape:


Many clustering algorithms determine clusters based on Euclidean or Manhattan
distance measures. Algorithms based on such distance measures tend to find
spherical clusters with similar size and density. However, a cluster could be of any
shape. Cluster analysis on sensor readings can detect interesting phenomena. We
may want to use clustering to find the frontier of a running forest fire, which is
often not spherical. It is important to develop algorithms that can detect clusters of
arbitrary shape.

d. Requirements for domain knowledge to determine input parameters:


Many clustering algorithms require users to provide domain knowledge in the form
of input parameters such as the desired number of clusters. Parameters are hard to
determine, especially for high-dimensionality data sets and where users have yet to
grasp a deep understanding of their data. Requiring the specification of domain
knowledge not only burdens users, but also makes the quality of clustering difficult
to control.

e. Ability to deal with noisy data:


Most real-world data sets contain outliers and/or missing, unknown, or erroneous
data. Sensor readings, for example, are often noisy—some readings may be
inaccurate due to the sensing mechanisms, and some readings may be erroneous
due to interferences from surrounding transient objects. Clustering algorithms can
be sensitive to such noise and may produce poor-quality clusters. Therefore, we
need clustering methods that are robust to noise.

f. Incremental clustering and insensitivity to input order:


In many applications, incremental updates (representing newer data) may arrive at
any time. Some clustering algorithms cannot incorporate incremental updates into
existing clustering structures and, instead, have to recompute a new clustering from
scratch. Clustering algorithms may also be sensitive to the input data order. That is,
given a set of data objects, clustering algorithms may return dramatically different
clustering depending on the order in which the objects are presented. Incremental
clustering algorithms and algorithms that are insensitive to the input order are
needed.

75
g. Capability of clustering high-dimensionality data:
A data set can contain numerous dimensions or attributes. When clustering
documents, for example, each keyword can be regarded as a dimension, and there
are often thousands of keywords. Most clustering algorithms are good at handling
low-dimensional data such as data sets involving only two or three dimensions.
Finding clusters of data objects in a high dimensional space is challenging,
especially considering that such data can be very sparse and highly skewed.

h. Constraint-based clustering:
Real-world applications may need to perform clustering under various kinds of
constraints. Suppose that your job is to choose the locations for a given number of
new automatic teller machines (ATMs) in a city. To decide upon this, you may
cluster households while considering constraints such as the city’s rivers and
highway networks and the types and number of customers per cluster. A
challenging task is to find data groups with good clustering behavior that satisfy
specified constraints.

i. Interpretability and usability:


Users want clustering results to be interpretable, comprehensible, and usable. That
is, clustering may need to be tied in with specific semantic interpretations and
applications. It is important to study how an application goal may influence the
selection of clustering features and clustering methods. The following are
orthogonal aspects with which clustering methods can be compared:

j. The partitioning criteria:


In some methods, all the objects are partitioned so that no hierarchy exists among
the clusters. That is, all the clusters are at the same level conceptually. Such a
method is useful, for example, for partitioning customers into groups so that each
group has its own manager. Alternatively, other methods partition data objects
hierarchically, where clusters can be formed at different semantic levels.

For example, in text mining, we may want to organize a corpus of documents into
multiple general topics, such as “politics” and “sports,” each of which may have
subtopics, For instance, “football,” “basketball,” “baseball,” and “hockey” can exist
as subtopics of “sports.” The latter four subtopics are at a lower level in the
hierarchy than “sports.”

k. Separation of clusters:
Some methods partition data objects into mutually exclusive clusters. When
clustering customers into groups so that each group is taken care of by one
manager, each customer may belong to only one group. In some other situations,
the clusters may not be exclusive, that is, a data object may belong to more than one
cluster. For example, when clustering documents into topics, a document may be
related to multiple topics. Thus, the topics as clusters may not be exclusive.

76
l. Similarity measure:
Some methods determine the similarity between two objects by the distance
between them. Such a distance can be defined on Euclidean space, a road network,
a vector space, or any other space. In other methods, the similarity may be defined
by connectivity based on density or contiguity, and may not rely on the absolute
distance between two objects. Similarity measures play a fundamental role in the
design of clustering methods. While distance-based methods can often take
advantage of optimization techniques, density- and continuity-based methods can
often find clusters of arbitrary shape.

m. Clustering space:
Many of the clustering methods search for clusters within the entire given data
space. These methods are useful for low-dimensionality data sets. With high
dimensional data, however, there can be many irrelevant attributes, which can make
similarity measurements unreliable. Consequently, clusters found in the full space
are often meaningless. It’s often better to instead search for clusters within different
subspaces of the same data set. Subspace clustering discovers clusters and
subspaces (often of low dimensionality) that manifest object similarity.

To conclude, clustering algorithms have several requirements. These factors include


scalability and the ability to deal with different types of attributes, noisy data, incremental
updates, clusters of arbitrary shape, and constraints. Interpretability and usability are also
important. In addition, clustering methods can differ with respect to the partitioning level,
whether or not clusters are mutually exclusive, the similarity measures used, and whether
or not subspace clustering is performed. (VSSUT)

6.3 Clustering Methods


There are many clustering algorithms in the literature. It is difficult to provide a crisp
categorization of clustering methods because these categories may overlap so that a method
may have features from several categories. In general, the major fundamental clustering
methods can be classified into the following categories:

77
Table 6.1: Overview of clustering methods

6.3.1 Partitioning methods:


▪ A partitioning method constructs k partitions of the data, where each partition
represents a cluster and k <= n. That is, it classifies the data into k groups, which
together satisfy the following requirements:
o Each group must contain at least one object, and
o Each object must belong to exactly one group.
▪ A partitioning method creates an initial partitioning. It then uses an iterative
relocation technique that attempts to improve the partitioning by moving objects
from one group to another.
▪ The general criterion of a good partitioning is that objects in the same cluster
are close or related to each other, whereas objects of different clusters are far
apart or very different. (VSSUT)

6.3.2 Hierarchical methods:


▪ A hierarchical method creates a hierarchical decomposition of the given set of
data objects. A hierarchical method can be classified as being either
agglomerative or divisive, based on how the hierarchical decomposition is
formed.
o The agglomerative approach, also called the bottom-up approach, starts
with each object forming a separate group. It successively merges the
objects or groups that are close to one another, until all of the groups are
merged into one or until a termination condition holds.
o The divisive approach, also called the top-down approach, starts with all
of the objects in the same cluster. In each successive iteration, a cluster
is split up into smaller clusters, until eventually each object is in one
cluster, or until a termination condition holds.
78
▪ Hierarchical methods suffer from the fact that once a step (merge or split) is
done, it can never be undone. This rigidity is useful in that it leads to smaller
computation costs by not having to worry about a combinatorial number of
different choices. There are two approaches to improving the quality of
hierarchical clustering:
o Perform careful analysis of object ―linkages‖ at each hierarchical
partitioning, such as in Chameleon, or
o Integrate hierarchical agglomeration and other approaches by first using
a hierarchical agglomerative algorithm to group objects into micro
clusters, and then performing macro clustering on the micro clusters
using another clustering method such as iterative relocation. (VSSUT)

6.3.3 Density-based methods:


Most partitioning methods cluster objects based on the distance between
objects. Such methods can find only spherical-shaped clusters and encounter
difficulty in discovering clusters of arbitrary shapes. The other clustering
methods have been developed based on the notion of density. Their general idea
is to continue growing a given cluster as long as the density (number of objects
or data points) in the “neighborhood” exceeds some threshold. For example, for
each data point within a given cluster, the neighborhood of a given radius has to
contain at least a minimum number of points. Such a method can be used to
filter out noise or outliers and discover clusters of arbitrary shape. Density-
based methods can divide a set of objects into multiple exclusive clusters, or a
hierarchy of clusters. Typically, density-based methods consider exclusive
clusters only, and do not consider fuzzy clusters. Moreover, density-based
methods can be extended from full space to subspace clustering. (VSSUT)

6.3.4 Grid-based methods:


Grid-based methods quantize the object space into a finite number of cells that
form a grid structure. All the clustering operations are performed on the grid
structure (i.e., on the quantized space). The main advantage of this approach is
its fast processing time, which is typically independent of the number of data
objects and dependent only on the number of cells in each dimension in the
quantized space. Using grids is often an efficient approach to many spatial data
mining problems, including clustering.

Therefore, grid-based methods can be integrated with other clustering methods


such as density-based methods and hierarchical methods. Some clustering
algorithms integrate the ideas of several clustering methods, so that it is
sometimes difficult to classify a given algorithm as uniquely belonging to only
one clustering method category. Furthermore, some applications may have
clustering criteria that require the integration of several clustering techniques.
(VSSUT)

79
6.3.5 Model Based Method
Model-based methods hypothesize a model for each of the clusters and find the
best fit of the data to the given model. A model-based algorithm may locate
clusters by constructing a density function that reflects the spatial distribution of
the data points. It also leads to a way of automatically determining the number
of clusters based on standard statistics, taking ―noise‖ or outliers into account
and thus yielding robust clustering methods. (VSSUT)

6.3.6 Constraint Based Method

It is a clustering approach that performs clustering by incorporation of user-


specified or application-oriented constraints. A constraint expresses a user’s
expectation or describes properties of the desired clustering results. It provides
an effective means for communicating with the clustering process. The various
kinds of constraints can be specified, either by a user or as per application
requirements. Spatial clustering employs with the existence of obstacles and
clustering under user-specified constraints. In addition, semi-supervised
clustering employs for pairwise constraints in order to improve the quality of
the resulting clustering. (VSSUT)

6.4 Partitioning Methods


The simplest and most fundamental version of cluster analysis is partitioning, which
organizes the objects of a set into several exclusive groups or clusters. To keep the
problem specification concise, we can assume that the number of clusters is given as
background knowledge. This parameter is the starting point for partitioning methods.
The most well-known and commonly used partitioning methods are
▪ The k-Means Method
▪ k-Medoids Method

6.4.1 Centroid-Based Technique: The K-Means Method:


The k-means algorithm takes the input parameter, k, and partitions a set of n
objects into k clusters so that the resulting intra cluster similarity is high but the
inter cluster similarity is low.
Cluster similarity is measured in regard to the mean value of the objects in a
cluster, which can be viewed as the cluster’s centroid or center of gravity.
The k-means algorithm proceeds as follows. (Jiawei Han, 2012)

▪ First, it randomly selects k of the objects, each of which initially represents a


cluster mean or center.
▪ For each of the remaining objects, an object is assigned to the cluster to
which it is the most similar, based on the distance between the object and
the cluster mean.
▪ It then computes the new mean for each cluster.
▪ This process iterates until the criterion function converges.
80
Typically, the square-error criterion is used, defined as

Where
E is the sum of the square error for all objects in the data set
pi s the point in space representing a given object
mi is the mean of cluster Ci

The k-means partitioning algorithm:


The k-means algorithm is for partitioning, where each cluster’s center is
represented by the mean value of the objects in the cluster. (Jiawei Han, 2012)

Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
Select any k objects from D as the initial cluster centers;
repeat
i. (re)assign each object to the cluster to which the object is the most
similar,
based on the mean value of the objects in the cluster;
ii. update the cluster means, that is, calculate the mean value of the objects
for each cluster;
iii. until no change;

The k-means algorithm proceeds as follows (Jiawei Han, 2012):


First, it randomly selects k of the objects, each of which initially represents a
cluster mean or center. For each of the remaining objects, an object is assigned
to the cluster to which it is the most similar, based on the distance between the
object and the cluster mean. It then computes the new mean for each cluster.
This process iterates until the criteria fulfilled.

Figure 6.1: Clustering of a set of objects using the k-means method (Jiawei Han, 2012)

81
6.4.2 k-Medoids Method:
The k - Means algorithm is majorly focused on outliers. It may create a noise in
large varying data. Instead of taking the mean value of the objects in a cluster as
a reference point, we can pick actual objects to represent the clusters, using one
representative object per cluster. Each remaining object is clustered with the
representative object to which it is the most similar.

The partitioning method is performed based on the principle of minimizing the


sum of the dissimilarities between each object and its corresponding reference
point. That is, an absolute-error criterion is used, defined as

Where E is the sum of the absolute error for all objects in the data set
p is the point in space representing a given object in clusterCj
Oj is the representative object of Cj

k-Mediods method is also known as partition around mediods. The Partitioning


Around Medoids (PAM) algorithm is a popular realization of k-medoids
clustering. It tackles the problem in an iterative, greedy way. The initial
representative objects are chosen arbitrarily. The iterative process of replacing
representative objects by non-representative objects and this process continues
as long as the quality of the resulting clustering is improved. This quality is
estimated using a cost function that measures the average dissimilarity between
an object and the representative object of its cluster.

To determine whether a non-representative object, Oj random, is a good


replacement for a current representative object, Oj, the following four cases are
examined for each of the nonrepresentative objects.

Figure 6.2: Clustering of a set of objects using the k-mediods method (Jiawei Han, 2012)

Case 1: p currently belongs to representative object, Oj. If Oji replaced by


Orandom as a representative object and p is closest to one of the other
representative objects, Oi, i≠j, then p is reassigned to oi.

82
Case 2: p currently belongs to representative object, Oj. If Oji replaced by
Orandom as a representative object and p is closest to Orandom, then p is
reassigned to Orandom.
Case 3: p currently belongs to representative object, Oi, i≠j. If Oji replaced by
Orandom as a representative object and p is still closest to oi, then the
assignment does not change.
Case 4: p currently belongs to representative object, oi, i≠j. If Oji replaced by
Orandom as a representative object and p is closest to Orandom, then p is
reassigned to Orandom.

If an object Algorithm: k-medoids. PAM, a k-medoids algorithm for


partitioning based on medoid or central objects.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects in D as the initial representative objects or
seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest
representative object;
(4) randomly select a nonrepresentative object, Orandom;
(5) compute the total cost, S, of swapping representative object, Oj, with
Orandom;
(6) if S < 0 then swap Oj with Orandom to form the new set of k
representative objects;
(7) until no change;

The k-medoids method is more robust than k-means in the presence of noise and
outliers, because a medoid is less influenced by outliers or other extreme values
than a mean. However, its processing is more costly than the k-means method.
(Jiawei Han, 2012)

6.5 Hierarchical Methods


A hierarchical clustering method works by grouping data objects into a hierarchy or
“tree” of clusters. That means, we may partition a data into a different level of group. A
hierarchical clustering method works by grouping data objects into a tree of clusters.

The quality of hierarchical clustering method suffers from performing adjustments once
a data is merged into a group. It is also a somewhat difficult to split. It cannot be back
track operation into original state. Hierarchical clustering methods can be further
classified as either agglomerative or divisive, depending on whether the hierarchical
decomposition is formed in a bottom-up or top-down fashion.

83
6.5.1 Agglomerative hierarchical clustering:
This bottom-up strategy starts by placing each object in its own cluster and then
merges these atomic clusters into larger and larger clusters, until all of the
objects are in a single cluster or until certain termination conditions are
satisfied. Most hierarchical clustering methods belong to this category. They
differ only in their definition of inter cluster similarity.

6.5.2 Divisive hierarchical clustering:


This top-down strategy does the reverse of agglomerative hierarchical clustering
by starting with all objects in one cluster. It sub divides the cluster into smaller
and smaller pieces, until each object forms a cluster on its own or until it
satisfies certain termination conditions, such as a desired number of clusters is
obtained or the diameter of each cluster is within a certain threshold.

Hierarchical clustering methods can encounter difficulties regarding the


selection of merge or split points. Such a decision is critical, because once a
group of objects is merged or split, the process at the next step will operate on
the newly generated clusters. It will neither undo what was done previously, nor
perform object swapping between clusters. Thus, merge or split decisions, if not
well chosen, may lead to low-quality clusters. The methods do not scale well
because each decision of merge or split needs to examine and evaluate many
objects or clusters.

A promising direction for improving the clustering quality of hierarchical


methods is to integrate hierarchical clustering with other clustering techniques,
resulting in multiple-phase (or multiphase) clustering.

6.6 Outlier Analysis:


Outlier is an object which is not fit in a particular group or cluster. Its value is different
among the other objects. It can easily be separated among the group of elements. Many
data mining algorithms try to minimize such outlier that affects the entire data. It could
result in the loss of important hidden information because one person’s noise could be
another person’s signal. The outliers may be of particular interest, such as in the case of
fraud detection, where outliers may indicate fraudulent activity. Thus, outlier detection
and analysis are an interesting data mining task, referred to as outlier mining.

The outlier mining problem can be viewed as two sub problems:


• Define what data can be considered as inconsistent in a given data set, and
• Find an efficient method to mine the outliers so defined.

It can be used in fraud detection, for example, by detecting unusual usage of credit
cards or telecommunication services. In addition, it is useful in customized marketing
for identifying the spending behaviour of customers with extremely low or extremely
high incomes, or in medical analysis for finding unusual responses to various medical
treatments.

84
Questions:

1. Explain K – means algorithm.


2. Explain partition around mediods technique.
3. What are the different clustering methods? Explain the different types of data used in
cluster analysis.
4. Explain different applications of clustering.
5. Write a note on:
a. Grid Based Clustering Method
b. Hierarchical Method of clustering

xxxXxxx

85
Unit – VII
Web Structure Mining
Objectives:
- To learn basic concepts of web structure mining, web links, social network,
crawlers and community discovery on internet
- To study the data mining over internet and web based searching
- To discover and analyze web usage pattern and new developments
- To create awareness towards marketing over social networking sites

7 Web Link Mining


Link mining is a relevant concept of web-based mining. The data comprising social
network data come from different heterogeneous source, multi relational and semi
structured. The web link mining concern with social networks, web page links, hypertext,
web-based analysis. It focused on descriptive and predictive modelling. The web link
mining does follow tasks (Liu, 2011):
Here, we list these tasks with examples from various domains:
a. Link-based object classification.
In traditional classification methods, objects are classified based on the attributes that
describe them. Link-based classification predicts the category of an object based not
only on its attributes, but also on its links, and on the attributes of linked objects.
Web page classification is a well-recognized example of link-based classification. It
predicts the category of a Web page based on word occurrence (words that occur on the
page) and anchor text (the hyperlink words, that is, the words you click on when you
click on a link), both of which serve as attributes.
b. Object type prediction.
This predicts the type of an object, based on its attributes and its links, and on the
attributes of objects linked to it. In the bibliographic domain, we may want to predict
the venue type of a publication as conference, journal, or workshop.
c. Link type prediction.
This predicts the type or purpose of a link, based on properties of the objects involved.
For an instance, we may try to predict whether two people who know each other are
family members, co-workers, or acquaintances.
d. Predicting link existence.
Unlike link type prediction, where we know a connection exists between two objects
and we want to predict its type, instead we may want to predict whether a link exists
between two objects. Examples include predicting whether there will be a link between
two Web pages.
e. Link cardinality estimation.
There are two forms of link cardinality estimation.
First, we may predict the number of links to an object. This is useful, for instance, in
predicting the authoritativeness of a Web page based on the number of links to it (in-
links).

86
Second, the number of out-links can be used to identify Web pages that act as hubs,
where a hub is one or a set of Web pages that point to many authoritative pages of the
same topic
f. Object reconciliation.
In object reconciliation, the task is to predict whether two objects are the same based on
their attributes and links. This task is common in information extraction, duplication
elimination, object consolidation, and citation matching, and is also known as record
linkage or identity uncertainty.
Examples include predicting whether two websites are mirrors of each other, whether
two citations actually refer to the same paper, and whether two apparent disease strains
are really the same.
g. Group detection.
Group detection is a clustering task. It predicts when a set of objects belong to the same
group or cluster, based on their attributes as well as their link structure. An area of
application is the identification of Web communities, where a Web community is a
collection of Web pages that focus on a particular theme or topic. A similar example in
the bibliographic domain is the identification of research communities.
h. Sub graph detection.
Sub graph identification finds characteristic sub graphs within networks. An example
from biology is the discovery of sub graphs corresponding to protein structures. In
chemistry, we can search for sub graphs representing chemical substructures.
i. Metadata mining.
Metadata are data about data. Metadata provide semi-structured data about unstructured
data, ranging from text and Web data to multimedia data-bases. It is useful for data
integration tasks in many domains.

7.1 Social Network Analysis


Social network is the study of social entities (people in an organization, called actors), and
their interactions and relationships. The interactions and relationships can be represented
with a network or graph, where each vertex (or node) represents an actor and each link
represents a relationship. From the network we can study the properties of its structure, and
the role, position and prestige of each social actor. We can also find various kinds of sub-
graphs, e.g., communities formed by groups of actors.

Social network analysis is useful for the Web because the Web is essentially a virtual
society, where each page can be regarded as a social actor and each hyperlink as a
relationship.
Many of the results from social networks can be adapted and extended for the use in the
Web context. The ideas from social network analysis are indeed instrumental to the success
of Web search engines. (Liu, 2011)

7.2 Co-Citation and Bibliographic Coupling


The area of research concerned with links is the citation analysis of scholarly publications.
A scholarly publication usually cites related prior work to acknowledge the origins of some
ideas in the publication and to compare the new proposal with existing work. Citation
analysis is an area of bibliometric research, which studies citations to establish the
relationships between authors and their work.

87
When a publication (also called a paper) cites another publication, a relationship is
established between the publications. Citation analysis uses these relationships (links) to
perform various types of analysis. A citation can represent many types of links, such as
links between authors, publications, journals and conferences, and fields, or even between
countries. (Liu, 2011). When a research scholar publishes his work globally the reader can
cite the publication which establishes a link to the author and it improves the bibliography.

7.2.1 Co-Citation
Co-citation is used to measure the similarity of two papers (or publications). If papers i
and j are both cited by paper k, then they may be said to be related in some sense to
each other, even though they do not directly cite each other. Figure. 7.1 shows that
papers i and j are co-cited by paper k. If papers i and j are cited together by many
papers, it means that i and j have a strong relationship or similarity. The more papers
they are cited by, the stronger their relationship is. (Liu, 2011)

Fig. 7.1: Paper i and paper j are co-cited by paper k (Liu, 2011)

7.2.2 Bibliographic Coupling


Bibliographic coupling operates on a similar principle, but in a way, it is the mirror
image of co-citation. Bibliographic coupling links papers that cite the same articles so
that if papers i and j both cite paper k, they may be said to be related, even though they
do not directly cite each other. The more papers they both cite, the stronger their
similarity is. Fig. 7.2 shows both papers i and j citing (referencing) paper k. (Liu, 2011)

Fig. 7.2: Both paper i and paper j cite paper k (Liu, 2011)

88
Bibliographic coupling is also symmetric and is regarded as a similarity measure of
two papers in clustering. The Web, hubs and authorities, found by the HITS
algorithm are directly related to co citation and bibliographic coupling matrices.
(Liu, 2011)

7.3 Page Rank


The main ideas of Page Rank and HITS are really quite similar. Page Rank has emerged as
the dominant link analysis model for Web search, partly due to its query-independent
evaluation of Web pages and its ability to combat spamming, and partly due to Google’s
business success. Page Rank relies on the democratic nature of the Web by using its vast
link structure as an indicator of an individual page's quality. In essence, Page Rank
interprets a hyperlink from page x to page y as a vote, by page x, for page y. Page Rank
looks at more than just the sheer number of votes or links that a page receives. It also
analyzes the page that casts the vote. Votes casted by pages that are themselves
“important” weigh more heavily and help to make other pages more “important.” This is
exactly the idea of rank prestige in social networks. (Liu, 2011)

7.4 Authorities and Hub Ranking


Authorities and hub ranking are the concepts based on HITS. When the user issues a search
query, HITS (Hypertext Induced Topic Search) first expands the list of relevant pages
returned by a search engine and then produces two rankings of the expanded set of pages,
authority ranking and hub ranking. (Liu, 2011)

An authority is a page with many in-links. The idea is that the page may have authoritative
content on some topic and thus many people trust it and thus link to it.
A hub is a page with many out-links. The page serves as an organizer of the information
on a particular topic and points to many good authority pages on the topic. When a user
comes to this hub page, he/she will find many useful links which take him/her to good
content pages on the topic. Figure 7.3 shows an authority page and a hub page.

Fig. 7.3: An authority page and a hub page (Liu, 2011)

The key idea of HITS is that a good hub points to many good authorities and a good
authority is pointed to by many good hubs. Thus, authorities and hubs have a mutual
reinforcement relationship. (Liu, 2011)

7.5 Community Discovery


A community is simply a group of entities (e.g., people or organizations) that shares a
common interest or is involved in an activity or event. The communities are represented by

89
dense bipartite sub-graphs and it represents a strong relationship between groups. Apart
from a Web, communities also exist in emails and text documents. (Liu, 2011)
There are many reasons for discovering communities. Following are three main reasons:
(Kumar, 1993)
▪ Communities provide valuable and possibly the most reliable, timely, and up-to-
date information resources for a user interested in them.
▪ They represent the sociology of the Web: studying them gives insights into the
evolution of the Web.
▪ They enable target advertising at a very precise level.

Community Manifestation in Data: Given a data set, which can be a set of Web pages, a
collection of emails, or a set of text documents, we want to find communities of entities in
the data. The system needs to discover the hidden community structures. The first issue
that we need to know is how communities manifest themselves. From such manifested
evidences, the system can discover possible communities. Different types of data may have
different forms of manifestation. We give three examples. (Liu, 2011)

a. Web Pages:
▪ Hyperlinks: A group of content creators sharing a common interest is usually inter-
connected through hyperlinks. That is, members in a community are more likely to
be connected among themselves than outside the community.
▪ Content words: Web pages of a community contain words that are related to the
community theme.

b. Emails:
▪ Email exchange between entities: Members of a community are more likely to
communicate with one another.
▪ Content words: Email contents of a community also contain words related to the
theme of the community.

c. Text documents:
▪ Co-occurrence of entities: Members of a community are more likely to appear
together in the same sentence and/or the same document.
▪ Content words: Words in sentences indicate the community theme.

7.6 Web Crawlers


Web crawlers, also known as spiders or robots, are programs that automatically
download Web pages. Since information on the Web is scattered among billions of pages
served by millions of servers around the globe, users who browse the Web can follow
hyperlinks to access information, virtually moving from one page to the next. A crawler
can visit many sites to collect information that can be analyzed and mined in a central
location, either online (as it is downloaded) or off-line (after it is stored). There are many
applications for Web crawlers. (Liu, 2011)
▪ One is business intelligence, whereby organizations collect information about their
competitors and potential collaborators.

90
▪ Another use is to monitor Web sites and pages of interest, so that a user or
community can be notified when new information appears in certain places.
▪ There are also malicious applications of crawlers, for example, that harvest email
addresses to be used by spammers or collect personal information to be used in
phishing and other identity theft attacks.

In fact, crawlers are the main consumers of Internet bandwidth. They collect pages for
search engines to build their indexes. Well known search engines such as Google, Yahoo!
and MSN run very efficient universal crawlers designed to gather all pages irrespective of
their content. Other crawlers, sometimes called preferential crawlers, are more targeted.
They attempt to download only pages of certain types or topics. (Liu, 2011)

7.7 Basic Crawler Algorithm


A crawler is a graph search algorithm. The Web can be seen as a large graph with pages as
its nodes and hyperlinks as its edges. A crawler starts from a few of the nodes (seeds) and
then follows the edges to reach other nodes. The process of fetching a page and extracting
the links within it is analogous to expanding a node in graph search. (Liu, 2011)

a. The frontier is the main data structure, which contains the URLs of unvisited pages.
b. Typical crawlers attempt to store the frontier in the main memory for efficiency. Based
on the declining price of memory and the spread of 64-bit processors, quite a large
frontier size is feasible.
c. Yet the crawler designer must decide which URLs have low priority and thus get
discarded when the frontier is filled up.
d. Note that given some maximum size, the frontier will fill up quickly due to the high
fan-out of pages.
e. Even more importantly, the crawler algorithm must specify the order in which new
URLs are extracted from the frontier to be visited.
These mechanisms determine the graph search algorithm implemented by the crawler.
Figure 7.5 shows the flow of a basic sequential crawler. Such a crawler fetches one page at
a time, making inefficient use of its resources. (Liu, 2011)

91
Fig. 7.5: Flow chart of a basic sequential crawler. (Liu, 2011)

7.8 Universal Crawlers


General purpose search engines use Web crawlers to maintain their indices, amortizing the
cost of crawling and indexing over the millions of queries received between successive
index updates (though indexers are designed for incremental updates. These large-scale
universal crawlers differ from the concurrent breadth-first crawlers described above along
two major dimensions (Liu, 2011):
a. Performance: They need to scale up to fetching and processing hundreds of thousands
of pages per second. This calls for several architectural improvements.
b. Policy: They strive to cover as much as possible of the most important pages on the
Web, while maintaining their index as fresh as possible. These goals are, of course,
conflicting so that the crawlers

7.9 Topical Crawlers


For many preferential crawling tasks, labeled (positive and negative) examples of pages are
not available in sufficient numbers to train a focused crawler before the crawl starts.
Instead, we typically have a small set of seed pages and a description of a topic of interest
to a user or user community. The topic can consist of one or more example pages (possibly
92
the seeds) or even a short query. Preferential crawlers that start with only such information
are often called topical crawlers [8, 14, 38]. They do not have text classifiers to guide
crawling.

Even without the luxury of a text classifier, a topical crawler can be smart about
preferentially exploring regions of the Web that appear relevant to the target topic by
comparing features collected from visited pages with cues in the topic description.

Fig. 7.6: Screenshot of the MySpiders applet in action. (Liu, 2011)

To illustrate a topical crawler with its advantages and limitations, let us consider the
MySpiders applet ([Link]). Figure 8.6 shows a screenshot of
this application.

In this example the user has launched a population of crawlers with the query “search
censorship in france” using the InfoSpiders algorithm. The crawler reports some seed pages
obtained from a search engine, but also a relevant blog page (bottom left) that was not
returned by the search engine. This page was found by one of the agents, called Spider2,
crawling autonomously from one of the seeds. We can see that Spider2 spawned a new
agent, Spider13, who started crawling for pages also containing the term “italy.” Another
agent, Spider5, spawned two agents one of which, Spider11, identified and internalized the
relevant term “engine.”

A crawling task can thus be viewed as a constrained multi-objective search problem. The
wide variety of objective functions, coupled with the lack of appropriate knowledge about

93
the search space, make such a problem challenging. In the remainder of this section we
briefly discuss the theoretical conditions necessary for topical crawlers to function, and the
empirical evidence supporting the existence of such conditions. Then we review some of
the machine learning techniques that have been successfully applied to identify and exploit
useful cues for topical crawlers. (Liu, 2011)

7.10 Evaluation
Given the goal of building a “good” crawler, a critical question is how to evaluate crawlers
so that one can reliably compare two crawling algorithms and conclude that one is “better”
than the other. Crawlers evaluated through the application it supports. The performance of
the crawlers is based on the algorithm supposed to be used and the indexing schemes
application. Crawler evaluation has been carried out by comparing a crawling algorithm
on a limited number of queries/tasks without considering the statistical significance. Web
crawling field has emerged for evaluating and comparing disparate crawling strategies on
common tasks through well-defined performance measures. (Liu, 2011)

The valuation of crawlers requires a sufficient number of crawls runs over different topics,
as well as sound methodologies that consider the temporal nature of crawler outputs.
Evaluation includes the general unavailability of relevant sets for particular topics or
queries. The meaningful experiments involving real users for assessing the relevance of
pages as they are crawled are extremely problematic. Crawls against the live Web pose
serious time constraints and would be overly burdensome to the subjects.

To circumvent these problems, crawler evaluation typically relies on defining measures for
automatically estimating page relevance and quality. (Liu, 2011)
a. A page may be considered relevant if it contains some or all of the keywords in the
topic/query.
b. The frequency with which the keywords appear on the page may also be considered
(Cho, 1998).
c. While the topic of interest to the user is often expressed as a short query, a longer
description may be available in some cases.
d. Similarity between the short or long description and each crawled page may be used
to judge the page's relevance.
e. The pages used as the crawl's seed URLs may be combined together into a single
document, and the cosine similarity between this document and a crawled page may
serve as the page’s relevance score.
f. A classifier may be trained to identify relevant pages. The training may be done
using seed pages or other pre-specified relevant pages as positive examples. The
trained classifier then provides boolean or continuous relevance scores for each of
the crawled pages. Note that if the same classifier, or a classifier trained on the
same labeled examples, is used both to guide a (focused) crawler and to evaluate it,
the evaluation is not unbiased. Clearly the evaluating classifier would be biased in
favor of crawled pages. To partially address this issue, an evaluation classifier may
be trained on a different set than the crawling classifier. Ideally the training sets
should be disjoint. At a minimum the training set used for evaluation must be
extended with examples not available to the crawler (Pant G. a., 2005).

94
g. Another set of evaluation criterion can be obtained by scaling or normalizing any of
the above performance measures by the critical resources used by a crawler. This
way, one can compare crawling algorithms by way of performance/cost analysis.
h. For example, with limited network bandwidth one may see latency as a major
bottleneck for a crawling task. The time spent by a crawler on network I/O can be
monitored and applied as a scaling factor to normalize precision or recall.

Using such a measure, a crawler designed to preferentially visit short pages, or pages from
fast servers, would outperform one that can locate pages of equal or even better quality but
less efficiently. (Liu, 2011)

7.11 Crawler Ethics and Conflicts


To prevent or control over helming client request over a server need to be improved for
running crawlers on a web. In the extreme case a server inundated with requests from
an aggressive crawler would become unable to respond to other requests, resulting in
an effective denial of service attack by the crawler. In a concurrent crawler, this task
can be carried out by the frontier manager, when URLs are dequeued and passed to
individual threads or processes. Preventing server overload is just one of a number of
policies required of ethical Web agents. Such policies are often collectively referred to
as crawler etiquette. (Liu, 2011)

Another requirement is to disclose the nature of the crawler using the User-Agent
HTTP header. The value of this header should include not only a name and version
number of the crawler, but also a pointer to where Web administrators may find
information about the crawler. A Web site is created for this purpose and its URL is
included in the User-Agent field. Another piece of useful information is the email
contact to be specified in the form header.

Finally, crawler etiquette requires compliance with the Robot Exclusion Protocol.
This is a de facto standard providing a way for Web server administrators to
communicate which files may not be accessed by a crawler. The file provides access
policies for different crawlers, identified by the User-agent field. When discussing the
interactions between information providers and search engines or other applications
that rely on Web crawlers, confusion sometime arises between the ethical, technical,
and legal ramifications of the Robot Exclusion Protocol. Compliance with the protocol
is an ethical issue, and non-compliant crawlers can justifiably be shunned by the Web
community. Crawlers may disguise themselves as browsers by sending a browser's
identifying string in the User-Agent header. This way a server administrator may not
immediately detect lack of compliance with the Exclusion Protocol, but an aggressive
request profile is likely to reveal the true nature of the crawler.

Web administrators may attempt to improve the ranking of their pages in a search
engine by providing different content depending on whether a request originates from a
browser or a search engine crawler, as determined by inspecting the request's User-
Agent header. This technique, called cloaking, is frowned upon by search engines,
which remove sites from their indices when such abuses are detected.

95
One of the most serious challenges for crawlers originates from the rising popularity of
pay-per-click advertising. If a crawler is not to follow advertising links, it needs to have
a robust detection algorithm to discriminate ads from other links. A bad crawler may
also pretend to be a genuine user who clicks on the advertising links in order to collect
more money from merchants for the hosts of advertising links.

Crawlers need to become increasingly sophisticated to prevent insidious forms of spam


from polluting and exploiting the Web environment. Malicious crawlers are also
becoming smarter in their efforts, not only to spam but also to steal personal
information and in general to deceive people and crawlers for illicit gains.

Interestingly, the gap between humans and crawlers may be narrowing from both sides.
While crawlers become smarter, some humans are dumbing down their content to make
it more accessible to crawlers.

For example, some online news providers use simpler titles than can be easily classified
and interpreted by a crawler as opposed or in addition to witty titles that can only be
understood by humans.

A business may wish to disallow crawlers from its site if it provides a service by which
it wants to entice human users to visit the site, say to make a profit via ads on the site.
A competitor crawling the information and mirroring it on its own site, with different
ads, is a clear violator not only of the Robot Exclusion Protocol but also possibly of
copyright law.

There are many browser extensions that allow users to perform all kinds of tasks that
deviate from the classic browsing activity, including hiding ads, altering the appearance
and content of pages, adding and deleting links, adding functionality to pages, pre-
fetching pages, and so on.
- What about an individual user who wants to access the information but
automatically hide the ads?
- Should they identify themselves through the User-Agent header as distinct from the
browser with which they are integrated?
- Should a server be allowed to exclude them? And should they comply with such
exclusion policies?
These too are questions about ethical crawler behaviors that remain open for the
moment. (Liu, 2011)

7.12 Some New Developments


The typical use of (universal) crawlers has been for creating and maintaining indexes
for general purpose search engines. A more diverse use of (topical) crawlers is
emerging both for client and server-based applications. Topical crawlers are becoming
important tools to support applications such as specialized Web portals (a.k.a.
“vertical” search engines), live crawling, and competitive intelligence. (Liu, 2011)

96
Users are consumers of information provided by search engines, search engines are
consumers of information provided by crawlers, and crawlers are consumers of
information provided by users (authors). This one-directional loop does not allow, for
example, information to flow from a search engine (say, the queries submitted by users)
to a crawler. It is likely that commercial search engines will soon leverage the huge
amounts of data collected from their users to focus their crawlers on the topics most
important to the searching public. To investigate this idea in the context of a vertical
search engine, a system was built in which the crawler and the search engine engage in
a symbiotic relationship (Pant G. S., 2004). The crawler feeds the search engine which
in turn helps the crawler. It was found that such a symbiosis can help the system learn
about a community's interests and serve such a community with better focus.

Universal crawlers have to somehow focus on the most “important” pages given the
impossibility to cover the entire Web and keep a fresh index of it. This has led to the
use of global prestige measures such as PageRank to bias universal crawlers, either
explicitly or implicitly through the long-tailed structure of the Web graph. An
important problem with these approaches is that the focus is dictated by popularity
among “average” users and disregards the heterogeneity of user interests. A page about
a mathematical theorem may appear quite uninteresting to the average user, if one
compares it to a page about a pop star using indegree or PageRank as a popularity
measure. Yet the math page may be highly relevant and important to a small
community of users (mathematicians). Future crawlers will have to learn to
discriminate between low-quality pages and high-quality pages that are relevant to very
small communities. (Liu, 2011)

Social networks have recently received much attention among Web users as vehicles to
capture commonalities of interests and to share relevant information. We are witnessing
an explosion of social and collaborative engines in which user recommendations,
opinions, and annotations are aggregated and shared. Mechanisms include tagging (e.g.,
[Link] and [Link]), ratings (e.g. [Link]), voting (e.g., [Link]), and
hierarchical similarity ([Link]). One key advantage of social systems is that
they empower humans rather than depending on crawlers to discover relevant
resources. Further, the aggregation of user recommendations gives rise to a natural
notion of trust. Crawlers could be designed to expand the utility of information
collected through social systems. (Liu, 2011)

For example, it would be straightforward to obtain seed URLs relevant to specific


communities of all sizes. Crawlers would then explore the Web for other resources in
the neighborhood of these seed pages, exploiting topical locality to locate and index
other pages relevant to those communities.

Social networks can emerge not only by mining a central repository of user-provided
resources, but also by connecting hosts associated with individual users or communities
scattered across the Internet. Imagine a user creating its own micro-search engine by
employing a personalized topical crawler, seeded for example with a set of bookmarked
pages.

97
The integration of effective personalized/topical crawlers with adaptive query routing
algorithms is the key to the success of peer-based social search systems. Many
synergies may be exploited in this integration by leveraging contextual information
about the local peer that is readily available to the crawler, as well as information about
the peer's neighbors that can be mined through the stream of queries and results routed
through the local peer. (Liu, 2011)

7.13 Web Usage Mining


Web usage mining refers to the automatic discovery and analysis of patterns in
clickstreams. The user transactions over a web and other associated data collected or
generated as a result of user interactions with Web resources on one or more Web sites.
The goal is to capture, model, and analyze the behavioral patterns and profiles of users
interacting with a Web site.

The discovered patterns are usually represented as collections of pages, objects, or


resources that are frequently accessed or used by groups of users with common needs
or interests.
This will help the organizations to determine:
- the life-time value of clients,
- design cross-marketing strategies across products and services,
- evaluate the effectiveness of promotional campaigns,
- optimize the functionality of Web-based applications,
- provide more personalized content to visitors, and
- find the most effective logical structure for their Web space

Web usage mining process can be divided into three inter-dependent stages:
- data collection and pre-processing,
- pattern discovery, and
- pattern analysis.

In the pre-processing stage, the clickstream data is cleaned and partitioned into a set of
user transactions representing the activities of each user during different visits to the
site. The site content or structure, as well as semantic domain knowledge from site
ontologies (such as product catalogs or concept hierarchies), may also be used in pre-
processing or to enhance user transaction data.

In the pattern discovery stage, statistical, database, and machine learning operations are
performed to obtain hidden patterns. The hidden patterns are reflecting the typical
behavior of users, summary statistics on Web resources, sessions, and users.

In the final stage of the process, the discovered patterns and statistics are further
processed, filtered, possibly resulting in aggregate user models that can be used as input
to applications such as recommendation engines, visualization tools, and Web analytics
and report generation tools. The overall process is depicted in Figure 7.8.

98
Fig. 7.8: The Web usage mining process (Liu, 2011)

7.13.1 Data Collection and Pre-Processing


An important task in any data mining application is the creation of a suitable target data
set to which data mining and statistical algorithms can be applied. This is particularly
important in Web usage mining due to the characteristics of clickstream data and its
relationship to other related data collected from multiple sources and across multiple
channels.

The data preparation process is often the most time consuming and computationally
intensive step in the Web usage mining process, and often requires the use of special
algorithms and heuristics not commonly employed in other domains.

This process is critical to the successful extraction of useful patterns from the data. The
process may involve pre-processing the original data, integrating data from multiple
sources, and transforming the integrated data into a form suitable for input into specific
data mining operations. Collectively, we refer to this process as data preparation.
Usage data preparation presents a number of unique challenges which have led to a
variety of algorithms and heuristic techniques for pre-processing tasks such as data
fusion and cleaning, user and session identification, page view identification.

The successful application of data mining techniques to Web usage data is highly
dependent on the correct application of the pre-processing tasks. In the context of e-
commerce data analysis, these techniques have been extended to allow for the
discovery of important and insightful user and site metrics.

99
Fig. 7.9: Steps in data preparation for Web usage mining (Liu, 2011)

Figure 7.9 provides a summary of the primary tasks and elements in usage data pre-
processing. We begin by providing a summary of data types commonly used in Web
usage mining and then provide a brief discussion of some of the primary data
preparation tasks. (Liu, 2011)

7.13.2 Cleaning and Filtering


In large-scale Web sites, it is typical that the content served to users comes from
multiple Web or application servers. In some cases, multiple servers with redundant
content are used to reduce the load on any particular server.
- Data fusion/filtering refers to the merging of log files from several Web and
application servers. This may require global synchronization across these
servers. In the absence of shared embedded session ids, heuristic methods based
on the “referrer” field in server logs along with various sessionization and user
identification methods can be used to perform the merging. This step is essential
in “inter-site” Web usage mining where the analysis of user behavior is
performed over the log files of multiple related Web sites (Tanasa, 2005).
100
- Data cleaning is usually site-specific, and involves tasks such as, removing
extraneous references to embedded objects that may not be important for the
purpose of analysis, including references to style files, graphics, or sound files.
- The cleaning process also may involve the removal of at least some of the data
fields (e.g., number of bytes transferred or version of HTTP protocol used, etc.)
that may not provide useful information in analysis or data mining tasks.
- Data cleaning also entails the removal of references due to crawler navigations.
It is not uncommon for a typical log file to contain a significant (sometimes as
high as 50%) percentage of references resulting from search engine or other
crawlers (or spiders). Well-known search engine crawlers can usually be
identified and removed by maintaining a list of known crawlers. (Liu, 2011)

7.13.3 Data Modeling for Web Usage Mining


Usage data pre-processing results in a set of n page views, P = {p1, p2, ···, pn}, and a
set of m user transactions, T = {t1,t2,···,tm}, where each ti in T is a subset of P.

Page views are semantically meaningful entities to which mining tasks are applied
(such as pages or products).

Conceptually, we view each transaction t as an l-length sequence of ordered pairs:

Where, each for some j in {1,2, 3,….n} and is the weight associated with
page view in transaction t, representing its significance.

The weights can be determined in a number of ways, in part based on the type of
analysis or the intended personalization framework.

For example, in collaborative filtering applications which rely on the profiles of


similar users to make recommendations to the current user, weights may be based on
user ratings of items. Web usage mining tasks the weights are either binary,
representing the existence or non-existence of a page view in the transaction; or they
can be a function of the duration of the page view in the user’s session. In the case of
time durations, it should be noted that usually the time spent by a user on the last page
view in the session is not available.

One commonly used option is to set the weight for the last page view to be the mean
time duration for the page taken across all sessions in which the page view does not
occur as the last one. It is common to use a normalized value of page duration instead
of raw time duration in order to account for user variances. In some applications, the
log of page view duration is used as the weight to reduce the noise in the data.

101
For many data mining tasks, such as clustering and association rule mining, where the
ordering of page views in a transaction is not relevant, we can represent each user
transaction as a vector over the n-dimensional space of page views.

Fig. 7.10: An example of a user-page view matrix (or transaction matrix) (Liu, 2011)

Given the transaction t above, the transaction vector t (we use a bold face lower case
letter to represent a vector) is given by:

where each for some j in {1, 2, ···, n}, if pj appears in the transaction t,

and otherwise.

Thus, conceptually, the set of all user transactions can be viewed as an m×n user-page
view matrix (also called the transaction matrix), denoted by UPM

An example of a hypothetical user-page view matrix is depicted in Figure 7.10. In this


example, the eights for each page view is the amount of time (e.g., in seconds) that a
particular user spent on the page view. These weights must be normalized to account
for variances in viewing times by different users. It should also be noted that the
weights may be composite or aggregate values in cases where the page view represents
a collection or sequence of pages and not a single page.

Given a set of transactions in the user-page view matrix as described above, a variety of
unsupervised learning techniques can be applied to obtain patterns. These techniques
such as clustering of transactions (or sessions) can lead to the discovery of important
user or visitor segments.

102
Other techniques such as item (e.g., page view) clustering and association or sequential
pattern mining can find important relationships among items based on the navigational
patterns of users in the site.

It is possible to integrate other sources of knowledge, such as semantic information


from the content of Web pages with the Web usage mining process. Generally, the
textual features from the content of Web pages represent the underlying semantics of
the site. (Liu, 2011)

7.13.4 Discovery and Analysis of Web Usage Patterns


The types and levels of analysis, performed on the integrated usage data, depend on the
ultimate goals of the analyst and the desired outcomes. Following are the types of
pattern discovery and analysis techniques employed in the Web usage mining domain.
(Liu, 2011)

a. Session and Visitor Analysis


i. The internet session data constitutes the most common form of analysis. It can
be used for statistical analysis. The data is aggregated by predetermined units
such as days, sessions, visitors, or domains. Standard statistical techniques can
be used on this data to gain knowledge about visitor behavior. This is the
approach taken by most commercial tools available for Web log analysis.
Reports based on this type of analysis may include information about most
frequently accessed pages, average view time of a page, average length of a
path through a site, common entry and exit points, and other aggregate
measures. Despite a lack of depth in this type of analysis, the resulting
knowledge can be potentially useful for improving the system performance, and
providing support for marketing decisions.
ii. OLAP provides a more integrated framework for analysis with a higher degree
of flexibility. The data source for OLAP analysis is usually a multidimensional
data warehouse which integrates usage, content, and e-commerce data at
different levels of aggregation for each dimension. OLAP tools allow changes
in aggregation levels along each dimension during the analysis. Analysis
dimensions in such a structure can be based on various fields available in the
log files, and may include time duration, domain, requested resource, user
agent, and referrers. This allows the analysis to be performed on portions of the
log related to a specific time interval, or at a higher level of abstraction with
respect to the URL path structure. (Cho, 1998)

b. Cluster Analysis and Visitor Segmentation


i. Clustering is a data mining technique that groups together a set of items having
similar characteristics. In the usage domain, there are two kinds of interesting
clusters that can be discovered: user clusters and page clusters.
ii. Clustering of user records (sessions or transactions) is one of the most
commonly used analysis tasks in Web usage mining and Web analytics.
iii. Clustering of users tends to establish groups of users exhibiting similar
browsing patterns. Such knowledge is especially useful for inferring user

103
demographics in order to perform market segmentation in e-commerce
applications or provide personalized Web content to the users with similar
interests.
iv. Analysis of user groups based on their demographic attributes (e.g., age, gender,
income level, etc.) may lead to the discovery of valuable business intelligence.
v. Usage-based clustering has also been used to create Web-based “user
communities” reflecting similar interests of groups of users.
Given the mapping of user transactions into a multi-dimensional space as
vectors of page views, standard clustering algorithms, such as k-means, can
partition this space into groups of transactions that are close to each other based
on a measure of distance or similarity among the vectors. Transaction clusters
can represent user or visitor segments based on their navigational behavior or
other attributes that have been captured in the transaction file. Each transaction
cluster may potentially contain thousands of user transactions involving
hundreds of page view references. The ultimate goal in clustering user
transactions is to provide the ability to analyze each segment for deriving
business intelligence, or to use them for tasks such as personalization.
vi. Another approach in creating an aggregate view of each cluster is to compute
the centroid (or the mean vector) of each cluster. The dimension value for each
page view in the mean vector is computed by finding the ratio of the sum of the
page view weights across transactions to the total number of transactions in the
cluster. If page view weights in the original transactions are binary, then the
dimension value of a page view p in a cluster centroid represents the percentage
of transactions in the cluster in which p occurs. Thus, the centroid dimension
value of p provides a measure of its significance in the cluster. Page views in
the centroid can be sorted according to these weights and lower weight page
views can be filtered out. The resulting set of page view-weight pairs can be
viewed as an “aggregate usage profile” representing the interests or behavior of
a significant group of users. (Liu, 2011)

c. Association and Correlation Analysis


Association rule discovery and statistical correlation analysis can find groups of
items or pages that are commonly accessed or purchased together. This enables
Web sites to organize the site content more efficiently, or to provide effective
cross-sale product recommendations. The mining of association rules in Web
transaction data has many advantages.

For example, a high-confidence rule such as special-offers/, /products/software/


→shopping-cart might provide some indication that a promotional campaign on
software products is positively affecting online sales. Such rules can also be
used to optimize the structure of the site.

For example, if a site does not provide direct linkage between two pages A and
B, the discovery of a rule, A → B, would indicates that providing a direct
hyperlink from A to B might aid users in finding the intended information. Both
association analysis (among products or page views) and statistical correlation

104
analysis (generally among customers or visitors) have been used successfully in
Web personalization and recommender systems. (Liu, 2011)

d. Analysis of Sequential and Navigational Patterns


i. The technique of sequential pattern mining attempts to find inter-session
patterns such that the presence of a set of items is followed by another item in a
time-ordered set of sessions or episodes. By using this approach, Web marketers
can predict future visit patterns which will be helpful in placing advertisements
aimed at certain user groups.
ii. Other types of temporal analysis that can be performed on sequential patterns
include trend analysis, change point detection, or similarity analysis. In the
context of Web usage data, sequential pattern mining can be used to capture
frequent navigational paths among user trails.
iii. Sequential patterns (SPs) in Web usage data capture the Web page trails that are
often visited by users, in the order that they were visited. Sequential patterns are
those sequences of items that frequently occur in a sufficiently large proportion
of (sequence) transactions.
iv. The view of Web transactions as sequences of page views allows for a number
of useful and well-studied models to be used in discovering or analyzing user
navigation patterns. One such approach is to model the navigational activities in
the Web site as a Markov model: each page view (or a category) can be
represented as a state and the transition probability between two states can
represent the likelihood that a user will navigate from one state to the other.
This representation allows for the computation of a number of useful user or site
metrics. (Liu, 2011)

e. Classification and Prediction based on Web User Transactions


i. Classification is the task of mapping a data item into one of several predefined
classes. In the Web domain, one is interested in developing a profile of users
belonging to a particular class or category. This requires an extraction and
selection of features that best describe the properties of given the class or
category.
ii. Classification techniques play an important role in Web analytics applications
for modeling the users according to various predefined metrics.
For example, given a set of user transactions, the sum of purchases made by
each user within a specified period of time can be computed. A classification
model can then be built based on this enriched data in order to classify users
into those who have a high propensity to buy and those who do not, taking into
account features such as users’ demographic attributes, as well their
navigational activities.
iii. There are also a large number of other classification problems that are based the
Web usage data. The problem of solving any classification problem is basically
the same, i.e.,
a. model the problem as a classification learning problem,
b. identify and/or construct a set of relevant features for learning, and
c. apply a supervised learning method or regression method. (Liu, 2011)

105
Questions:

1. Explain social networks implication in data mining.


2. Explain web usage mining in detail.
3. Explain crawler ethics and conflicts.
4. Explain community discovery.
5. Write a note on:
a. Citation, Co-citation
b. Similarity search

xxxXxxx

106
Bibliography
Web Resources:
(2019, August). Retrieved from [Link]

(2019, July). Retrieved from smartworld: [Link]

(2019, June). Retrieved from Smartzworld: [Link]

Büchner, A. a. (1998). Discovering internet marketing intelligence through online analytical web usage
mining. ACM SIGMOD Record, 54 - 61.

Cho, J. H.-M. (1998). Efficient crawling through URL. Computer Networks, 161-172.

Inmon, W. H. (1996). Building the Data Warehouse. JohnWiley & Sons.

Jiawei Han, M. K. (2012). Data Mining Concepts and Techniques. USA: Elsevier.

JNTUH-R13. (n.d.). Data Warehousing and Data Mining. Retrieved from [Link]

Kumar, R. P. (1993). Trawling the Web. Computer Networks, 11-16.

Liu, B. (2011). Web Data Mining (2 ed.). Chicago: Springer.

Pant, G. a. (2005). Learning to crawl: Comparing classification. ACM Transactions on Information


Systems, 430-462.

Pant, G. S. (2004). Search engine-crawler symbiosis:Adapting to community interests. Research and


AdvancedTechnology for Digital Libraries, 221-232.

Tanasa, D. a. (2005). Advanced data preprocessing for intersites web Usage Mining. Intelligent
Systems, 59 - 65.

VSSUT, B. (n.d.). Lecture Notes on Data Mining and Data Warehousing. VSSUT, Burla: DEPT OF CSE &
IT.

107

You might also like