Data Mining Warehousing I & II
Data Mining Warehousing I & II
Characteristics
Data warehouses are repositories of high-volume information. They are centralized stores of all the
data a company may generate, formed by relational databases and designed for query and analysis.
Data warehouses allow for quick, accurate access to structured data via predefined queries.
i. Subject-oriented :
The warehouse organizes data around the essential subjects of the business (customers and products)
rather than around applications such as inventory management or order processing.
i. Integrated:
It is consistent/uniform in the way that data from several sources is extracted and transformed,
regardless of the original source. For example, coding conventions are standardized: M _ male, F _
female.
ii. Time-variant:
Data are organized by various time-periods (e.g. weekly, monthly, annually, etc.).
iii. Non-volatile:
The warehouse’s database is not updated in real time. It is periodically updated via the
uploading of data, protecting it from the influence of momentary change. There are a number
of steps and processes in building a warehouse.
First, you must identify where the relevant data is stored. This can be a challenge.When the
Commonwealth Bank opted to implement CRM in its retail banking business, it found that relevant
customer data were resident on over 80 separate systems.
Secondly, data must be extracted from those systems. It is possible that when these systems were
developed they were not expected to align with other systems. The data then needs to be transformed
into a standardized, consistent and clean format. Data in different systems may have been stored in
different forms. Also, the cleanliness of data from different parts of the business may vary.
The culture in sales may be very driven by quarterly performance targets. Getting sales
representatives to maintain their customer fi les may be not straightforward. Much of their
information may be in their heads. On the other hand, direct marketers may be very dedicated to
keeping their data in good shape.
After transformation, the data then needs to be uploaded into the warehouse. Archival data that have
little relevance to today’s operations may be set aside, or only uploaded if there is sufficient space.
Recent operational and transactional data from the various functions, channels and touch points will
most probably be prioritized for uploading. Refreshing the data in the warehouse is important. This
may be done on a daily or weekly basis depending upon the speed of change in the business and its
environment.
A data warehouse maintains a copy of information from the source transaction systems.
This architectural complexity provides the opportunity to:
b. Integrate data from multiple source systems, enabling a central view across the enterprise. This
benefit is always valuable, but particularly so when the organization has grown by merger.
c. Improve data, by providing consistent codes and descriptions, flagging or even fixing bad data.
e. Provide a single common data model for all data of interest regardless of the data’s source.
g. Restructure the data so that it delivers excellent query performance, even for complex analytic
queries, without impacting the operational systems.
h. Add value to operational business applications, notably customer relationship management (CRM)
systems.
Purpose of data To control and run fundamental business To help with planning, problem solving, and
tasks decision support
What the data Reveals a snapshot of ongoing business Multi-dimensional views of various kinds of
processes business activities
Inserts and Short and fast inserts and updates initiated by Periodic long-running batch jobs refresh the data
Updates end users
Queries Relatively standardized and simple queries Often complex queries involving aggregations
Returning relatively few records
Processing Typically very fast Depends on the amount of data involved; batch
Speed data refreshes and complex queries may take many
hours; query speed can be improved by creating
indexes
Space Can be relatively small if historical data is Larger due to the existence of aggregation
Requirements archived structures and history data; requires more indexes
than OLTP
Database Highly normalized with many tables Typically de-normalized with fewer tables; use of
Design star and/or snowflake schemas
Backup and Backup religiously; operational data is Instead of regular backups, some environments
Recovery critical to run the business, data loss is likely may consider simply reloading the OLTP data as a
to entail significant monetary loss and legal recovery method
liability