Data Warehousing & Mining Prof. J. N. Rajurkar MIET Bhandara
Data Warehousing & Mining Prof. J. N. Rajurkar MIET Bhandara
In a general, Data driven DSS emphasizes access to and manipulation of a time series of internal
company data and sometimes external and real time data.
Simple file systems accessed by the query and retrieval tools provides the most elementary levels of
functionality.
Data driven DSS with online analytical processing provides the highest level of functionality and
decision support that is linked to analysis of large collections of historical data . Executive
Information systems are the examples of Data Driven DSS.
One of the first data driven DSS was built using an APL based software package called AAIMS an
Acronym for An Analytical Information Management System.
Business Intelligence (BI) is sometimes used interchangeably with books, report and query tools and
executive information system. In general Business Intelligence systems are Data driven DSS.
Communication driven DSS use network and communications technologies to facilitate decision
relevant collaboration and communication.
In these systems, communication technologies are the dominant architectural component. Tools
include groupware, video conferencing and computer based bulletin boards are primary technologies.
In past few years voice and video delivered using internet protocol have greatly expanded the
possibilities for synchronous communication driven DSS.
A document driven DSS uses computer storage and processing technologies to provides document
retrieval and analysis.
Large document databases may include scanned document, hypertext documents, images, sounds and
videos.
The WWW technologies significantly increased the availability of documents and facilitated the
development of Document driven DSS.
Knowledge based DSS can suggest or recommend actions to managers. These DSS are person
computer systems with specialised problem solving expertise.
These system have been called as suggestion DSS.
Artificial Intelligence systems have been developed to detect fraud and expedite financial
transactions; many medical diagnostic systems have been based on AI.
Data Warehousing & Mining Prof. J. N. Rajurkar MIET Bhandara.
MYCIN project for blood disease diagnosis is example of knowledge based DSS.
Beginning in approximately1995, the World Wide Web (WWW) global internet provided a
technology platform for further extending the capabilities and development of Computerized decision
support.
Release of HTML with form tag and tables was turning point in development of Web based DSS.
A Web based decision support system delivers decision support information to manager using Web
browsers like Netscape Navigator or Internet Explorer.
Or A Data warehouse refers to a data repository that is maintained separately from organizations operational
databases. DW systems allows for integration of a variety of allocation systems
Or According to Williams H. Inmon “Data warehouse is a subject oriented, integrated, time variant and
non volatile collection of data in support of management’s decision making process”
Or A single, complete and consistent store of data obtained from a variety of different sources made
available to end users in what they can understand and use in business context.
Or A Data warehouse is a subject oriented, integrated, time variant and non volatile collection of huge
amount of data which helps in management decision making process.
Subject Oriented: A data warehouse is organized around major subjects such as customer, supplier.
Product and sales. Rather than concentrating on day-to –day operations and transaction processing of
an organization, data warehouse focuses on the modelling and analysis of data for decision makers.
Hence data warehouses typically provide a simple and concise view of particular subject issue by
excluding data that are not useful in decision support system.
Integrated: A Data Warehouse is usually constructed by integrating multiple heterogeneous sources,
such as a relational databases, flat files, and online transaction records. Data cleaning and integration
techniques are applied to ensure consistency in naming conventions, encoding structure, attributes
measurement and so on.
Data Warehousing & Mining Prof. J. N. Rajurkar MIET Bhandara.
Time variant: Data are stored to provide information from an historic perspective (e.g. past 5-10
years).Every key structure in the data warehouse contains either implicitly or explicitly, a time
variant. Time Variant nature of the data in a data warehouse allows for analysis of the past, relates
information to the present, and enables forecast for the future.
Non Volatile: Non volatile means, once data entered into the warehouse, data should not change. A
Data warehouse is always a physically separate store of data transformed from the application data
found in operational environment. Due to this separation, data warehouse does not require transaction
processing, recovery and concurrency mechanisms.
Data Granularity: When a user queries the data warehouse for analysis, they usually start by
looking at summary data. Therefore we find it efficient to keep data summarized at different levels.
Depending on the query, we can then go to the particular level of the detail and satisfy the query.
Data granularity refers to the level of details. The lower the level of details, finer the data granularity.
Q.) What do you mean by Strategic Information? Describe its characteristics features.(W-2015)
Or
Explain the Compelling need for data warehousing.
Ans: A Strategic Information (SI) is a information that helps companies change or otherwise alter their
business strategy and/or structure. It is typically utilized to streamline and quicken the reaction time to
environmental changes and aid it in achieving a competitive advantage.
The executives and managers who are responsible for keeping the enterprise competitive need
information to make proper decisions. They need information to formulate the business strategies, establish
goals and monitor results.
The type of information needed to make decisions in the formulation and execution of business
strategies and objectives are broad-based and encompass the entire organization. We may combine all these
types of essential information into one group and call it strategic information.
Processing large volume of data and providing interactive analysis requires extra computing power.
The explosive increase in computing power and its lower costs make provision of strategic information
feasible.
Data Warehousing & Mining Prof. J. N. Rajurkar MIET Bhandara.
DW planning: This phase is aimed at determining the scope and the goals of the DW, and determines the
number and the order in which the data marts are to be implemented according to the business priorities and
the technical constraints .At this stage the physical architecture of the system must be defined .
Data mart design and implementation: This macro-phase will be repeated for each data mart to be
implemented and will be discussed in more detail in the following. At each iteration a new data mart is
designed and deployed. Multidimensional modelling of each data mart must be carried out considering the
available conformed dimensions and the constraints deriving from previous implementations.
Data Warehousing & Mining Prof. J. N. Rajurkar MIET Bhandara.
DW maintenance and evolution: DW maintenance mainly concerns performance optimization that must be
periodically carried out due to user requirements that change according to the problems and the opportunities
the managers run into. On the other hand, DW evolution concerns keeping the DW schema up-to-date with
respect to the business domain and the business requirement changes.
Knowledge Discovery from Data (KDD) is the process of discovering useful knowledge from
collection of data.
Major KDD application areas include marketing, fraud detection and telecommunications.
KDD process includes fallowing iterative useful steps.
Data Cleaning: This step is use to remove noise and inconsistent data.
Data Integration: In this step multiple data sources may be combined.
Data Selection: In this step, the data which is relevant to the analysis task are retrieved from the
database. On other hand data which is not relevant for analysis task is omitted.
Data Transformation: In this step , data are transformed and consolidated into forms appropriate for
mining by performing summary or aggregation operations.
Data Mining: This is an essential process where intelligence methods are applied to extract data
patterns.
Pattern Evaluation: This step is used to identify the truly interesting patterns representing
knowledge based on interestingness measures.
Data Warehousing & Mining Prof. J. N. Rajurkar MIET Bhandara.
Knowledge presentation: In this step visualization and knowledge representation techniques are
used to present mined knowledge to users.
Steps including data cleaning, data integration, data selection and data transformation are the data pre-
processing steps.
Data mining step may interact with the user or knowledge base. The interesting patterns are presented to the
user and may be stored as new knowledge in the knowledge base.
Data mining is the process of discovering the interesting patterns and knowledge from large amount of data.
Data sources may includes databases, data warehouses, Web and other information repository etc.
Q.) Why do you need separate data staging area in DWH? Explain its function.(W-15)
Ans: A staging area is an intermediate storage area used for data processing during the extract, transform
and load (ETL) process. The data staging area sits between the data sources and the data target, which are
often data warehouses, data marts, or other data repositories. It is also called as landing zone,
The primary motivations for their use are to increase efficiency of ETL processes, ensure data integrity and
support data quality operations. The functions of the staging area include the following:
Consolidation: One of the primary functions performed by a staging area is consolidation of data
from multiple source systems. In performing this function the staging area acts as a large "bucket" in
which data from multiple source systems can be temporarily placed for further processing.
Alignment: Aligning data includes standardization of reference data across multiple source systems
and validation of relationships between records and data elements from different sources.
Minimizing contention: The staging area and ETL processes it supports are often designed with a
goal of minimizing contention within source systems.
Independent scheduling/multiple targets: The staging area can support hosting of data to be
processed on independent schedules, and data that is meant to be directed to multiple targets.
Change detection: This functionality is particularly useful when the source systems do not support
reliable forms of change detection, such as system-enforced time stamping.
Cleansing data: Data cleansing includes identification and removal (or update) of invalid data from
the source systems.
Data archiving and troubleshooting: In this, staging area can be used to maintain historical records
during the load process, or it can be used to push data into a target archive structure.
Data Warehousing & Mining Prof. J. N. Rajurkar MIET Bhandara.
By Multidimensional OLAP MOLAP model, which directly implements the multidimensional data
and operations
3. Top-Tier -This tier is the front-end client layer. This layer holds the query tools and reporting tools,
analysis tools and data mining tools ( eg.. Trend analysis, prediction and so on).
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a virtual
warehouse. Building a virtual warehouse requires excess capacity on operational database servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups of
an organization.
In other words, we can claim that data marts contain data specific to a particular group. For example, the
marketing data mart may contain data related to items, customers, and sales. Data marts are confined to
subjects.
Points to remember about data marts:
Window-based or Unix/Linux-based servers are used to implement data marts. They are
implemented on low-cost servers.
The implementation data mart cycles is measured in short periods of time, i.e., in weeks rather than
months or years.
The life cycle of a data mart may be complex in long run, if its planning and design are not
organization-wide.
Data marts are small in size.
Data marts are customized by department.
The source of a data mart is departmentally structured data warehouse.
Data mart is flexible.
Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects spanning an entire organization.
It provides us enterprise-wide data integration.
Data Warehousing & Mining Prof. J. N. Rajurkar MIET Bhandara.
The data is integrated from operational systems and external information providers.
This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond.
This type of warehouse can be implemented on traditional mainframes, super computer servers or
parallel architecture platforms.
Ans:- Metadata is simply defined as data about data. The data that are used to represent other data is known
as metadata. Metadata in data warehouse defines the warehouse objects. Metadata acts as a directory. This
directory helps the decision support system to locate the contents of a data warehouse. Metadata is a road-
map to data warehouse. It is created for the data names and definition of given data warehouse. For example,
the index of a book serves as a metadata for the contents in the book.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It contains the following metadata:
Business metadata - It contains the data ownership information, business definition, and changing
policies.
Operational metadata - It includes currency of data and data lineage. Currency of data refers to the
data being active, archived, or purged. Lineage of data means history of data migrated and
transformation applied on it.
Algorithms used for summarization, which includes measure and dimension definition algorithms,
partitions, subject areas, aggregation summarization and predefined queries and reports..
Data for mapping from operational environment to data warehouse - It metadata includes
source databases and their contents, data extraction, data partition, cleaning, transformation rules,
data refresh and purging rules.
Data related to system performance, which includes indices and profiles that improve data access
and retrieval performance, replication cycles etc.
Types of Metadata
Operational Metadata
Extraction and Transformation Metadata
End-User Metadata
Data Warehousing & Mining Prof. J. N. Rajurkar MIET Bhandara.
Operational Metadata: Data for the data warehouse comes from several operational systems of the
enterprise. These source systems contain different data structures. The data elements selected for the data
warehouse have various field lengths and data types. In selecting data from the source systems for the data
warehouse, we split records, combine parts of records from different source files, and deal with multiple
coding schemes and field lengths. When you deliver information to the end-users, we must be able to tie that
back to the original source data sets. Operational metadata contain all of this information about the
operational data sources.
Extraction and Transformation Metadata. Extraction and transformation metadata contain data about the
extraction of data from the source systems, namely, the extraction frequencies, extraction methods, and
business rules for the data extraction. Also, this category of metadata contains information about all the data
transformations that take place in the data staging area.
End-User Metadata. The end-user metadata is the navigational map of the data warehouse. It enables the
end-users to find information from the data warehouse. The end-user metadata allows the end-users to use
their own business terminology and look for information in those ways in which they normally think of the
business.
Ans:
The operational systems such as order processing, inventory control, claims processing, outpatient
billing, and so on are not the signed or intended to provide strategic information. If we need the
ability to provide strategic information, we must get the information from altogether different types
of systems. Only specially designed decision support systems or informational systems can provide
strategic information.
Operational systems are online transaction processing (OLTP) systems. These are the systems that
are used to run the day-to-day core business of the company. They support the basic business
processes of the company. These systems typically get the data into the database.
On the other hand, specially designed and built decision-support systems are not meant to run the
core business processes. They are used to watch how the business runs, and then make strategic
decisions to improve the business.
From the data analyst’s point of view, decision support data differ from operational data in three
main areas: time span, granularity, and dimensionality.
Data Warehousing & Mining Prof. J. N. Rajurkar MIET Bhandara.
Time span: Operational data cover a short time frame. In contrast, decision support data tend to
cover a longer time frame.
Granularity (level of aggregation): Decision support data must be presented at different levels of
aggregation, from highly summarized to near-atomic.
Dimensionality: Operational data focus on representing individual transactions rather than on the
effects of the transactions over time. In contrast, data analysts tend to include many data dimensions
and are interested in how the data relate over those dimensions.
Benefits of DSS
Quick retrieval
The ability to share information across the company
Provides simultaneous read/write requests through pre-defined queries
The amount of data that can be stored that pertains to a business
Ans:
Database systems are the one of the key enabling forces behind the business transformation.
Database system technology also needs to be efficient in terms of storage and speed.
Modern database system thus needs to build high reliability mechanisms in their designs.
Performance evaluation of database system technology is thus an important concern. Performance
evaluation of database is a non trivial activity make more complicated by the existence of different
flavors of database systems turned for specific requirement.
Database is the shared resource that is at centre of such system. The databases functionality is
optimal storage and maintains the correctness of the data and maintains the consistency of the
system at all time.
Database management is complex set of software program that controls the organization, storage,
management and retrieval of data in database.
Data Warehousing & Mining Prof. J. N. Rajurkar MIET Bhandara.
Database management is complex set of software program that allows multiple users to access,
create, update and retrieve the data to and from the database.
Storage manager is a program module that provides interface between low –level data storage in
the database and application programs and queries submitted to the system. The storage manager is
responsible for fallowing task.
Such as interaction with file manager, efficient storing, retrieving and updating the data.
Users are differentiating by the way they want interact with system.Specilazed users, writes
specialized database application that do not fit into in traditional data processing framework.
Sophisticated users form requests in database query language.
A naïve Users invoke one of the permanent application programs that have been written previously.
Data Model is just way of structuring the data. It also defines set of operations that can be
performed on the data. Flat model consists of single, two-dimensional array of data elements.
Network model organizes data using two fundamental structures called records and sets. Relational
database contains multiple table which similar to one flat database model.
Dimensional model is often implemented on the top of relational model using star schema
consisting of one table containing the facts and surrounding tables containing the dimensions.
Object Database models This aims to avoid overhead (referred as independent mismatch) of
converting information.
After we have extracted data from various operational system external sources , we have to prepare
the data for storing in the data warehouse.
Extracted data made available are available in different format hence different functions such as
transformation are applied to prepared for loading in staging area.
Data staging provides a place and area with set of functions to clean,change,combine,convert for
storage and use in the data warehouse.
E) Metadata Component
Metadata in data warehouse is similar to the data dictionary or data catalog in the database
management system.
The data dictionary contains data about the data in the database. Similarly metadata component is the
data about the data in data warehouse.
F) Management and Control Component
This component of the data warehouse architecture site on top of all the other components.
Management and Control Component coordinates the services and activities within the data
warehouse.
This component controls the data transformation and data transfer into data warehouse storage.
It monitors the movement of data into staging area and from there into data warehouse storage itself.
Management and Control Component interact with the metadata component to perform the
management and control functions.
Ans: A data warehouse is never static; it evolves as the business expands. As the business evolves, its
requirements keep changing and therefore a data warehouse must be designed to ride with these changes.
Hence a data warehouse system needs to be flexible. The delivery method is a variant of the joint application
development approach adopted for the delivery of a data warehouse.
1. Standard Reports:
Usage: Reports that require infrequent structural changes, and can be easily accessed electronically.
2. Queries
3. Analytical Applications
4. OLAP Analysis
Purpose: Provides ability to perform summary, detailed or trend analysis on requested data.
6. Data Mining
Ans: To design an effective data warehouse we need to understand and analyze business needs and construct
a business analysis framework. A data warehouse can be built using a top-down approach, a bottom-up
approach or a combination of both.
The top-down approach starts with overall design and planning. It is useful in cases where the
technology is mature and well known, and where the business problems that must be solved are clear and
well understood.
The bottom up approach starts with experiments and prototypes. This is useful in the early stage of
business modeling and technology development. It allows an organization to move forward at considerably
less expense and to evaluate the technological benefits before making significant commitments.
In the combined approach, an organization can exploit the planned and strategic nature of the top-
down approach while retaining the rapid implementation and opportunistic application of the bottom-up
approach.
From the software engineering point of view, the design and construction of a data warehouse may consist of
the following steps: planning, requirements study, problem analysis, warehouse design, data integration and
testing, and finally deployment of the data warehouse. Large software systems can be developed using one
of two methodologies: the waterfall method or the spiral method. The waterfall method performs a
structured and systematic analysis at each step before proceeding to the next, which is like a waterfall,
falling from one step to the next. The spiral method involves the rapid generation of increasingly functional
systems, with short intervals between successive releases. This is considered a good choice for data
warehouse development, especially for data marts, because the turnaround time is short, modifications can
be done quickly, and new designs and technologies can be adapted in a timely manner.
Data Warehousing & Mining Prof. J. N. Rajurkar MIET Bhandara.
1. Choose a business process to model (e.g., orders, invoices, shipments, inventory, account administration,
sales, or the general ledger). If the business process is organizational and involves multiple complex object
collections, a data warehouse model should be followed. However, if the process is departmental and
focuses on the analysis of one kind of business process, a data mart model should be chosen.
2. Choose the business process grain, which is the fundamental, atomic level of data to be represented in the
fact table for this process (e.g., individual transactions, individual daily snapshots, and so on).
3. Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item,
customer, supplier, warehouse, transaction type, and status.
4. Choose the measures that will populate each fact table record. Typical measures are numeric additive
quantities like dollars_sold and units_sold.
Because data warehouse construction is a difficult and long-term task, its implementation scope
should be clearly defined. The goals of an initial data warehouse implementation should be specific,
achievable, and measurable. This involves determining the time and budget allocations, the subset of the
organization that is to be modeled, the number of data sources selected, and the number and types of
departments to be served.
Once a data warehouse is designed and constructed, the initial deployment of the warehouse includes
initial installation, roll-out planning, training, and orientation. Platform upgrades and maintenance must also
be considered.
Various kinds of data warehouse design tools are available. Data warehouse development tools
provide functions to define and edit metadata repository contents (e.g., schemas, scripts, or rules), answer
queries, output reports, and ship metadata to and from relational database system catalogs. Planning and
analysis tools study the impact of schema changes and of refresh performance when changing refresh rates
or time windows.