Reference Short Notes For Mid Term Papers: CS614 - Date Warehousing
Reference Short Notes For Mid Term Papers: CS614 - Date Warehousing
OLTP DWH
Primary key used Primary key NOT used
No concept of Primary Index Primary index used
May use a single table Uses multiple tables
Few rows returned Many rows returned
High selectivity of query Low selectivity of query
Indexing on primary key (unique) Indexing on primary index (non-unique)
A complete repository of historical corporate data extracted from transaction systems that is
available for ad-hoc access by knowledge workers
There are, and there can be many applications of a data warehouse. It is not possible to
discuss all of them. Some representative applications are listed to be discussed as follows:
o Fraud detection
o Profitability analysis
o Direct mail/database marketing
o Credit risk prediction
o Customer retention modeling
o Yield management
o Inventory management
1) Collapsing Tables.
2) 2. Pre-Joining.
ADVANTAGE
• Storage
• Performance
• Maintenance
• Ease-of-use
Fast: Delivers information to the user at a fairly constant rate i.e. O(1) time. Most queries
answered in less than 5 seconds.
Analysis: Performs basic numerical and statistical analysis of the data, pre-defined by an
application developer or defined ad-hocly by the user.
Shared: Implements the security requirements necessary for sharing potentially confidential
data across a large user population.
Information: Accesses all the data and information necessary and relevant for the
application, wherever it may reside and not limited by volume.
Maintenance issue: Every data item received must be aggregated into every cube (assuming
“to-date” summaries are maintained). Lot of work
Storage issue: As dimensions get less detailed (e.g., year vs. day) cubes get much smaller, but storage
consequences for building hundreds of cubes can be significant. Lot of space
Scalability: Often have difficulty scaling when the size of dimensions becomes large. The
breakpoint is typically around 64,000 cardinality of a dimension.
One logical cube of data can be spread across multiple physical cubes on separate (or
same) servers.
The divide &conquer cube partitioning approach helps alleviate the scalability
limitations of MOLAP implementation.
Used to query two dissimilar cubes by creating a third “virtual” cube by a join between two
cubes.
Logically similar to a relational view i.e. linking two (or more) cubes along common dimension(s).
o Deployment of significantly large dimension tables as compared to MOLAP using secondary storage.
o Aggregate awareness allows using pre -built summary tables by some front -end tools.
o Star schema designs usually used to facilitate ROLAP querying (in next lecture).
Maintenance.
Non standard hierarchy of dimensions.
Non standard conventions.
Explosion of storage space requirement.
Aggregation pit-falls.
HOLAP (page87)
- HOLAP (Hybrid OLAP) allows co-existence of pre-built MOLAP cubes alongside relational OLAP or
ROLAP structures.
De-Normalization
Dimensional Modeling (DM)
(1) Distributive
(2) Algebraic
(3) Holistic
Transactional fact tables don’t have records for events that don’t occur
Advantage:
Disadvantage
Lack of information
Handling Multi-valued Dimensions? (page110)
Simple to implement
No tracking of history
Logical Extraction
Full Extraction
Incremental Extraction
Physical Extraction
Online Extraction
Offline Extraction
Legacy vs. OLTP
Logical Data Extraction (page 121)
Full Extraction
Incremental Extraction
Online Extraction
Offline Extraction
Data NOT extracted directly from the source system, instead staged explicitly outside the
original source system.
Basic tasks
I. Selection
II. Splitting/Joining
III. Conversion
IV. Summarization
V. Enrichment
Once we have transformed data, there are three primary loading strategies:
Full data refresh with BLOCK INSERT or ‘block slamming’ into empty table.
Incremental data refresh with BLOCK INSERT or ‘block slamming’ into existing
(populated) tables.
Trickle/continuous feed with constant data collection and loading using row level
insert and update operations.
- Things would have been simpler in the presence of operational systems, but that is not always the case
- Manual data collection and entry. Nothing wrong with that, but potential to introduces lots of problems
- Data is never perfect. The cost of perfection, extremely high vs. its value.
“Some” Issues
Talking about not weekly data, but data spread over years.
Historical data on tapes that are serial and very slow to mount etc.
Need lots of processing and I/O to effectively handle large data volumes.
Need efficient interconnect bandwidth to transfer large amounts of data from legacy sources
to DWH.
Fill in forms
E.g. addresses
- Inconsistent output
i.e., html tags which mark interesting fields might be different on different pages.
ETL:
Extract, Transform, Load in which data transformation takes place on a separate transformation server.
ELT:
Extract, Load, Transform in which data transformation takes place on the data warehouse server.
> Decisions taken at government level using wrong data resulting in undesirable results.
> In direct mail marketing sending letters to wrong addresses loss of money and bad reputation.
Lexical Errors
Irregularities
Semantically Dirty Data
Coverage Anomalies
Missing Attributes
Missing Records
o Dropping records.
o “Manually” filling missing values.
o Using a global constant as filler.
o Using the attribute mean (or median) as filler.
o Using the most probable value as filler.
1) Statistical
2) Pattern Based
3) Clustering
4) Association Rules
The data coming from outside the organization owning the DWH can have even lower
quality data i.e. different representation for same entity, transcription or typographical errors.
Compute a key for each record in the list by extracting relevant fields or portions of fields
Sort the records in the data list using the key of step 1
Step 3: Merge
Move a fixed size window through the sequential list of records limiting the
comparisons for matching records to those records in the window
If the size of the window is w records then every new record entering the window
is compared with the previous w-1 records.
Selection of Keys
Effectiveness highly dependent on the key selected to sort the records middle name vs. last name,
A key is a sequence of a subset of attributes or sub -strings within the attributes chosen from the
record.
The keys are used for sorting the entire dataset with the intention that matched candidates
will appear close to each other.
- Since data is dirty, so keys WILL also be dirty, and matching records will not come together.
- Solution is to use external standard source files to validate the data and resolve any data conflicts.
BSN Method: Equational Theory (page 162)
Fields that appear first in the key have higher discriminating power than those appearing after them.
If NID number is first attribute in the key 81684854432 and 18684854432 are highly likely to
fall in windows that are far apart.
Law #4 - “Data quality problems increase with the age of the system!”
Law #5 – “The less likely something is to occur, the more traumatic it will be when it happens!”
Total Quality Management (TQM) (page 169)
TQM approach is advocating the involvement of all employees in the continuous improvement
process, the ultimate goal being the customer satisfaction.
Controllable Costs
Resultant Costs
Accuracy
Completeness
Consistency
Timeliness
Timeliness
Uniqueness
Interpretability
Interpretability
Interpretability
Ratios
Min-Max
Simple Ratios
Free-of-error
Completeness
Consistency
Min-Max
Believability
Appropriate Amount of Data
Min-Max
Timeliness
Accessibility
The occurrences of each domain value within each coded attribute of the database.