0% found this document useful (0 votes)
21 views28 pages

Knowledge Discovery in Databases (KDD) Lect 4

Uploaded by

asiimwemartinkab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views28 pages

Knowledge Discovery in Databases (KDD) Lect 4

Uploaded by

asiimwemartinkab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Knowledge Discovery in

Databases(KDD)
• some people treat data mining same as
Knowledge discovery while some people view
data mining essential step in process of
knowledge discovery.
• Here is the list of steps involved in knowledge
discovery process:
Here is the list of steps involved in knowledge discovery process:

• Data Cleaning - In this step the noise and inconsistent data is


removed.
• Data Integration - In this step multiple data sources are
combined.
• Data Selection - In this step relevant to the analysis task are
retrieved from the database.
• Data Transformation - In this step data are transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations.
• Data Mining - In this step intelligent methods are applied in
order to extract data patterns.
• Pattern Evaluation - In this step, data patterns are evaluated.
Knowledge Presentation - In this step, knowledge is represented.
Data Warehouse:
• A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile
collection of data in support of management's
decision making process.
• Subject-Oriented: A data warehouse can be
used to analyze a particular subject area. For
example, "sales" can be a particular subject.
• Integrated: A data warehouse integrates data
from multiple data sources. For example,
source A and source B may have different
ways of identifying a product, but in a data
warehouse, there will be only a single way of
identifying a product.
• Time-Variant: Historical data is kept in a data warehouse.
For example, one can retrieve data from 3 months, 6
months, 12 months, or even older data from a data
warehouse. This contrasts with a transactions system,
where often only the most recent data is kept. For
example, a transaction system may hold the most recent
address of a customer, where a data warehouse can hold
all addresses associated with a customer.
• Non-volatile: Once data is in the data warehouse, it will
not change. So, historical data in a data warehouse
should never be altered.
Data Warehouse Design Process:
• A data warehouse can be built using a top-
down approach, a bottom-up approach, or a
combination of both.
• The top-down approach starts with the overall design
and planning. It is useful in cases where the technology
is mature and well known, and where the business
problems that must be solved are clear and well
understood.
• The bottom-up approach starts with experiments and
prototypes. This is useful in the early stage of business
modeling and technology development.
• It allows an organization to move forward at
considerably less expense and to evaluate the benefits
of the technology before making significant
commitments.
• In the combined approach, an organization
can exploit the planned and strategic nature of
the top-down approach while retaining the
rapid implementation and opportunistic
application of the bottom-up approach
• The warehouse design process consists of the
following steps:
• Choose a business process to model, for example,
orders, invoices, shipments, inventory, account
administration, sales, or the general ledger. If the
business process is organizational and involves
multiple complex object collections, a data
warehouse model should be followed.
• However, if the process is departmental and
focuses on the analysis of one kind of business
process, a data mart model should be chosen.
• Choose the grain of the business process. The grain is
the fundamental, atomic level of data to be
represented in the fact table for this process, for
example, individual transactions, individual daily
snapshots, and so on.
• Choose the dimensions that will apply to each fact
table record. Typical dimensions are time, item,
customer, supplier, warehouse, transaction type, and
status.
• Choose the measures that will populate each fact table
record. Typical measures are numeric additive
quantities like dollars sold and units sold.
Data Warehouse Architecture:
Data Warehouses usually have a three-level
(tier) architecture that includes:
1. Bottom Tier (Data Warehouse Server)
2. Middle Tier (OLAP Server)
3. Top Tier (Front end Tools)
A bottom-tier
• A bottom-tier that consists of the Data Warehouse
server, which is almost always an RDBMS. It may
include several specialized data marts and a
metadata repository.
• Data from operational databases and external
sources (such as user profile data provided by
external consultants) are extracted using application
program interfaces called a gateway. A gateway is
provided by the underlying DBMS and allows
customer programs to generate SQL code to be
executed at a server.
A middle-tier
• A middle-tier which consists of an OLAP server for fast
querying of the data warehouse.
• The OLAP server is implemented using either
• (1) A Relational OLAP (ROLAP) model, i.e., an extended
relational DBMS that maps functions on multidimensional
data to standard relational operations.
• (2) A Multidimensional OLAP (MOLAP) model, i.e., a
particular purpose server that directly implements
multidimensional information and operations.
• A top-tier that contains front-end tools for displaying
results provided by OLAP, as well as additional tools for data
mining of the OLAP-generated data.
• The metadata repository stores information that
defines DW objects. It includes the following
parameters and information for the middle and the
top-tier applications:
• A description of the DW structure, including the
warehouse schema, dimension, hierarchies, data
mart locations, and contents, etc.
• Operational metadata, which usually describes the
currency level of the stored data, i.e., active, archived
or purged, and warehouse monitoring information,
i.e., usage statistics, error reports, audit, etc.
• System performance data, which includes indices,
used to improve data access and retrieval
performance.
• Information about the mapping from operational
databases, which provides source RDBMSs and
their contents, cleaning and transformation rules,
etc.
• Summarization algorithms, predefined queries, and
reports business data, which include business
terms and definitions, ownership information, etc.
What is Operational Data Stores?

• An ODS has been described


by Inmon and Imhoff (1996) as a subject-
oriented, integrated, volatile, current valued
data store, containing only detailed corporate
data.
• A data warehouse is a documenting database
that includes associatively recent as well as
historical information and may also include
aggregate data.
• The ODS is a subject-oriented. It is organized
around the significant information subject of an
enterprise. In a university, the subjects may be
students, lecturers and courses while in the
company the subjects might be users,
salespersons and products.
• The ODS is an integrated. That is, it is a group of
subject-oriented record from a variety of systems
to provides an enterprise-wide view of the
information.
• The ODS is a current-valued. That is, an ODS is
up-to-date and follow the current status of the
data.
• An ODS does not contain historical
information.
• Since the OLTP system data is changing all the
time, data from underlying sources refresh the
ODS as generally and frequently as possible.
• The ODS is volatile. That is, the data in the
ODS frequently changes as new data refreshes
the ODS.
• The ODS is a detailed. That is, ODS is detailed
enough to serve the need of the operational
management staff in the enterprise. The
granularity of the information in the ODS does
not have to be precisely the same as in the
source OLTP system.
• ODS Design and Implementation
• The extraction of data from source databases needs
to be efficient, and the quality of records needs to
be maintained.
• Since the data is refreshed generally and frequently,
suitable checks are required to ensure the quality of
data after each refresh.
• An ODS is a read-only database other than regular
refreshing by the OLTP systems. Customer should
not be allowed to update ODS information.
• Populating an ODS contains an acquisition
phase of extracting, transforming and loading
information from OLTP source systems.
• This procedure is ETL.
• Completing populating the database, analyze
for anomalies and testing for performance are
essential before an ODS system can go online.
Difference between Operational Data Stores and Data Warehouse
Operational Data Stores Data Warehouse
ODS means for operational reporting A data warehouse is intended for
and supports current or near real-time historical and trend analysis, usually
reporting requirements. reporting on a large volume of data.

An ODS consist of only a short window A data warehouse includes the entire
of data. history of data.
It is typically detailed data only. It contains summarized and detailed data.

It is used for detailed decision making It is used for long term decision making
and operational reporting. and management reporting.

It is used at the operational level. It is used at the managerial level.


It serves as conduct for data between It serves as a repository for cleansed and
operational and analytics system. consolidated data sets.

It is updated often as the transactions It is usually updated in batch processing


system generates new data. mode on a set schedule.
•End

You might also like