An Introduction to Data
Warehousing and Data Mining
Before going to Data Warehouse, you should know
What is data?
Evolution of Database System.
What is data, information and knowledge
As the world approached the 21st Century we are facing new and challenging
problems. More than ever before, governments, industry and the wider
community need information to help them to make decisions to tackle these
problems.
Before one can present and interpret information there has to be a process of
gathering and sorting data. Just as crude oil is the raw material from which petrol
is distilled, so too, data can be viewed as the raw material from which
information is obtained. Therefore, a good definition of data is:
Data
Data are observations or facts which when collected, organized and evaluated
become information or knowledge.
Information
Information is data that has been organized to serve a useful purpose.
Knowledge:
Informal, involves culture and generally know-how acquired by a human being
in his life experience.
Evaluation in Database Management
Ancient to modern:
All records were stored in unordered format, does not guarantee quality of
data or search technique. Then the concept of design came, which lead to
better reliability and performance
1960’s:
Computers become cost effective for private companies along with increasing
storage capability of computers. Two main data models were developed:
network model (CODASYL) and hierarchical (IMS).
1970-1972:
E.F. Codd proposed relational model for databases in a landmark paper on
how to think about databases.
1976:
P. Chen proposed the Entity-Relationship (ER) model for database design
giving yet another important insight into conceptual data models.
1980:
SQL (Structured Query Language) becomes “intergalactic standard”.
Evaluation in Database Management (Continue)
1990:
ODBC and the beginning of Object Database Management Systems (ODBMS).
Late-1990’s:
OLTP (Online Transaction Processing) and OLAP (Online Analytic Processing).
Future trends:
Huge (terabyte) systems are appearing and will require novel means of handling
and analyzing data. Successors to SQL (and perhaps RDBMS) will be emerging in
the future. Most likely this will be overtaken by XML and other emerging
techniques.
What a Data Warehouse Is
Data warehouse is the center of the architecture for information systems
for future. Data warehouse supports informational processing by providing
a solid platform of integrated, historical data from which to do analysis.
Data warehouse provides the facility for integration in a world of
unintegrated application systems. Data warehouse is achieved in an
evolutionary, step at a time fashion. Data warehouse organizes and stores
the data needed for informational, analytical processing over a long
historical time perspective. There is indeed a world of promise in building
and maintaining a data warehouse.
A data warehouse is a:
subject oriented,
integrated,
time variant,
non volatile
collection of data in support of management's decision-making process
Subject Oriented:
Data that gives information about a particular subject instead
of about a company's ongoing operations.
Integrated:
Data that is gathered into the data warehouse from a variety
of sources and merged into a coherent whole.
Time-variant:
All data in the data warehouse is identified with a particular
time period.
Non-volatile
Data is stable in a data warehouse. More data is added but
data is never removed. This enables management to gain a
consistent picture of the business.
Operational vs. Informational Systems
Perhaps the most important concept that has come out of the Data Warehouse
movement is the recognition that there are two fundamentally different types
of information systems in all organizations: operational systems and
informational systems.
"Operational systems" are just what their name implies; they are the systems
that help us run the enterprise operation day-to-day. These are the backbone
systems of any enterprise, our "order entry', "inventory", "manufacturing",
"payroll" and "accounting" systems. Because of their importance to the
organization, operational systems were almost always the first parts of the
enterprise to be computerized. Over the years, these operational systems have
been extended and rewritten, enhanced and maintained to the point that they
are completely integrated into the organization. Indeed, most large
organizations around the world today couldn't operate without their
operational systems and the data that these systems maintain.
On the other hand, there are other functions that go on within the enterprise that
have to do with planning, forecasting and managing the organization. These
functions are also critical to the survival of the organization, especially in our
current fast-paced world. Functions like "marketing planning", "engineering
planning" and "financial analysis" also require information systems to support
them. But these functions are different from operational ones, and the types of
systems and information required are also different. The knowledge-based
functions are informational systems.
"Informational systems" have to do with analyzing data and making decisions,
often major decisions, about how the enterprise will operate, now and in the
future. And not only do informational systems have a different focus from
operational ones, they often have a different scope. Where operational data
needs are normally focused upon a single area, informational data needs often
span a number of different areas and need large amounts of related operational
data.
Why Data Warehousing
• Quality decisions come from quality data.
• Problems with real life data:
– Data needs to be integrated from different sources
– Missing values
– Noisy and inconsistent values
– Data is not at the right level of aggregation
Why Use of Data Mining Today
Human analysis skills are inadequate:
– Volume and dimensionality of the data
– High data growth rate
Availability of:
– Data
– Storage
– Computational power
– Off-the-shelf software
Why Use of Data Mining Today (Continued…..)
Competition on service, not only on price (Banks,
phone companies, hotel chains, rental car
companies)
Personalization
What is Data Mining
Data Mining, or Knowledge Discovery in Databases (KDD) as it is
also known, is the nontrivial extraction of implicit, previously
unknown, and potentially useful information from data. This
encompasses a number of different technical approaches, such as
clustering, data summarization, learning classification rules, finding
dependency net works, analysing changes, and detecting anomalies.
In other words data mining is the search for relationships and global
patterns that exist in large databases but are `hidden' among the vast
amount of data, such as a relationship between student data and their
progress report. These relationships represent valuable knowledge
about the database and the objects in the database and, if the
database is a faithful mirror, of the real world registered by the
database.
Preprocessing and Mining
Knowledge
Patterns
Preprocessed
Data
Target Interpretation
Data
Model
Original Data Construction
Preprocessing
Data
Integration
and Selection
Convergence of Three Key Technologies
Common Uses of Data Mining
• Direct mail marketing
• Web site personalization
• Credit card fraud detection
• Bioinformatics
• Cheminformatics
• Text mining & analysis
• Market basket analysis
CASE STUDY
Of SAURASHTRA UNIVERSITY
RAJKOT - GUJARAT
TECHNOLOGICAL HETEROGENEITY
Rajkot
(Oracle)
Surendranagar
(DEC Sybase)
Junagadh Porbandar Amreli Jamnagar
(IBM DB2) (FoxPro) (Oracle) (Sql Ser)
The technological environment typically is heterogeneous.
Rajkot
Junagadh Porbandar Amreli Jamnagar Surendra
nagar
Detailed data is refreshed into the global warehouse
from the outlying sites.
Rajkot
Surendranagar
Junagadh
Jamnagar
Porbandar Amreli
The global data model is used to identify and define the
system of record at the outlying sites.
LEVELS OF GRANULARITY
Detailed data from outlying sites is added and aggregated upward
until the global Data Warehouse is populated.
The system of record remains at the outlying site level.
LOCAL WAREHOUSES
Each of the outlying sites can have its own
local Data Warehouse.
The local Data Warehouses at the outlying
site can feed the global Data Warehouse.
DRILL DOWN
Drill down starts at the global Data Warehouse
and goes to the outlying sites.
Metadata is the glue that holds the global data
environment [Link] metadata is
required across the globe.
BUILDING THE GLOBAL DATA WAREHOUSE
ITERATIVELY
The global Data Warehouse is built and populated
iteratively, in phases.
STAGING AREAS
Staging areas can be created for the detailed refreshment
data as it moves to the global Data Warehouse.
SUPPORTING MORE THAN ONE GLOBAL DATA
WAREHOUSE
The outlying sites can support more than one global
Data Warehouse.
Out comes from the System
Discovering the stages and status of teaching and research work undertaken
by faculty members.
To access information regarding learning environment for student and
optimize for specific needs.
Status of allotment of work, monitoring of allotted work, follow up of work
for increasing effectivity and productivity.
To optimize time for conduct of exam of university.
To monitor activities of departments teaching, research and administration.
To promote collaborative work and establish communication to increase
effectiveness of collaborative work.
The data keeping and data mining will generate data that will assist for
quality maintains, quality improvement leads to assessment for ISO
9000 and accreditation for NAAC, NBA (AICTE).
Account intelligence.
Budgeting analysis and setting budget target.
Budget monitoring on time scale.
Student feedback data warehousing and generating
intelligent conclusion.
Creating syllabus information for course program for individual
subject and deriving conclusion for accommodation of required subject
based on analysis and to certain extend course contain detail. This
approach will help to update curriculum keeping pace with emerging
technology and quick implementation by industry.
References
•[Link]
•Data mining by Peter Adriaans and Dolf Zantinge (PEA)
•Data warehousing in the real world by Sam Anahory and
Dennis Murray(PEA)
•[Link]
[Link]
•Data Mining Techniques By Arun K Pujari