Chapter -2
Data warehousing in Business
Intelligence
• LEARNING OBJECTIVES
• • Understand the basic definitions and
• concepts of data warehouses
• • Understand data warehousing
• architectures
• • Describe the processes used in
• developing and managing data
• warehouses
• • Explain data warehousing operations
• • Explain the role of data warehouses in
• decision support
• • Explain data integration and the
• extraction, transformation, and load
• (ETL) processes
• 11 Describe real~time (active) data
• warehousing
• • Understand data warehouse
• administration and security issues
2.2 DATA WAREHOUSING PROCESS OVERVIEW
• Organizations, private and public, continuously collect data, information, and
knowledge at
• an increasingly accelerated rate and store them in computerized systems.
Maintaining and
• using these data and information becomes extremely complex, especially as
scalability issues
• arise. In addition, the number of users needing to access the information
continues to
• increase as a result of improved reliability and availabi lity of network access,
especially the
• Internet. Working with multiple databases, either integrated in a data warehouse
or not, has
• become an e>..1:remely difficult task requiring considerable expertise, but it can
provide immense
• benefits far exceeding its cost (see the opening vignette and Application Case 2.2).
Application Case 2.2
Data Warehousing Supports First American Corporation's Corporate
Strategy
• First American Corporation changed its corporate strategy from a
traditional banking approach to one that was centered on CRM.
This enabled First
• American to transform itself from a company that lost $60 million in
1990 to an innovative financial services leader a decade later. The
successful implementation
• of this strategy would not have been possible without its VISION
data warehouse, which stores information about customer behavior
such as products used, buying preferences, and client-value
positions. VISION provides:
• Identification of the top 20 percent of profitable customers
• Identification of the 40 to 50 percent of unprofitable customers
• Retention strategies
• Lower-cost distribution channels
change, moving itself into the "sweet 16" of financial
• services corporations.
• • Strategies to expand customer re la tionships
• • Redesigned information flows
• Access to information through a data warehouse
• can enable both evolutionary and revolutionary
• change. First American achieved revolutionary
• Sources: Based on B. L. Cooper, H. J. Vatson, B. H. Wi,xom, and
• D. L. Goodhue, "Data Warehousing Supports Corporate Strategy
• at First American Corporation,"M/S Quarterly, Vol. 24, No. 4,
• 2000, pp. 547-567; and B. L. Cooper, H.). Watson, B. H. Wixom,
• and D. L. Goodhue, "Data Warehousing Supports Corporate
• Strategy at First American Corporation," SIM International
• Conference, Atlanta, August 15-19, 1999.
• Many organizations need to create data warehouses-massive data stores of timeseries
• data for decision support. Data are imported from various external and internal
• resources and are cleansed and organized in a manner consistent with the organization's
• needs. After the data are populated in the data warehouse, data marts can be loaded for a
• specific area or department. Alternatively, data marts can be created first, as needed, and
• then integrated into an EDW. Often, though, data marts are not developed, but data are
• simply loaded onto PCs or left in their original state for direct manipulation using BI tools.
• In Figure 2.1, we show the data warehouse concept. The following are the major
• components of the data warehousing process:
• • Data sources. Data are sourced from multiple independent operational "legacy"
• systems and possibly from external data providers (such as the U.S. Census). Data
• may also come from an online transaction processing (OLTP) or ERP system. Web
• data in the form of Web logs may also feed a data warehouse.
• • Data extraction and tra11sfor111atio11. Data are extracted and properly transformed
• using custom-written or commercial software called ETL.
• • Data loading. Data are loaded into a staging area, where they are transformed and
• cleansed. The data are then ready to load into the data warehouse and/or data marts.
• • Compreheusive database. Essentially, this is the EDW to support all decision
• analysis by providing relevant summarized and detailed information originating
• from many different sources.
• • Metadata. Metadata are maintained so that they can be assessed by IT
personnel
• and users. Metadata include soft-vvare programs about data and rules for
organizing
• data summaries that are easy to index and search, especially with Web
tools.
• • Middleware tools. Middleware tools enable access to the data
warehouse. Power
• users such as analysts may write their own SQL queries. Others may
employ a managed
• query environment, such as Business Objects, to access data. There are
many front-end
• applications that business users can use to interact with data stored in the
data repositories,
• including data mining, OLAP, repo1ting tools, and data visualization tools.
• SECTION 2.2 REVIEW QUESTIONS
• 1. Describe the data warehousing process.
• 2. Describe the major components of a data
warehouse.
• 3. Identify the role of middleware tools.
• 2.3 DATA WAREHOUSING ARCHITECTURES
• There are several basic information system architectures that can be used for data warehousing.
• Generally speaking, these architectures are commonly called client/ server or
• n-tier architectures, of which two-tier and three-tier architectures are the most common
• (see Figures 2.2 and 2.3), but sometimes there is simply one tier. These types of multitiered
• architectures are known to be capable of serving the needs of large-scale,
• performance-demanding information systems such as data warehouses. Referring to the
• use of n-tiered architectures for data warehousing, Hoffer et al. (2007) distinguished
• among these architectures by dividing the data warehouse into three parts:
1. The data warehouse itself, which contains the data and associated software.
2. Data acquisition (back-end) softvvare, which extracts data from legacy systems and
external sources, consolidates and summarizes them, and loads them into the data warehouse.
3. Client (front-end) software, which allows users to access and analyze data from the warehouse (a DSS/
Bl/ business analytics [BA] engine)
In a three-tier architecture, operational systems contain the data and the software for data acquisition in one
tier (i .e. , the server), the data warehouse is another tier, and the third tier includes the DSS/ Bl/BA engine
(i.e. , the application server) and the client (see Figure 2.2). Data from the warehouse are processed twice
and deposited in an additional multidimensional database, organized for easy multidimensional analysis
and presentation, or replicated in data marts. The advantage of the three-tier architecture
is its separation of the functions of the data warehouse, which eliminates resource constraints
and makes it possible to easily create data marts.
• 2.3 DATA WAREHOUSING ARCHITECTURES
• There are several basic information system architectures that can be used for data warehousing.
• Generally speaking, these architectures are commonly called client/ server or
• n-tier architectures, of which two-tier and three-tier architectures are the most common
• (see Figures 2.2 and 2.3), but sometimes there is simply one tier. These types of multitiered
• architectures are known to be capable of serving the needs of large-scale,
• performance-demanding information systems such as data warehouses. Referring to the
• use of n-tiered architectures for data warehousing, Hoffer et al. (2007) distinguished
• among these architectures by dividing the data warehouse into three parts:
• 1. The data warehouse itself, which conta ins the data and associated software.
• 2. Data acquisition (back-end) softvvare, which extracts data from legacy systems and
• external sources, consolidates and summarizes them, and loads them into the data
• warehouse.
• 3. Client (front-end) software, which allows users to access and analyze data from the
• warehouse (a DSS/ Bl/ business analytics [BA] engine)
• In a three-tier architecture, operational systems contain the data and the software
• for data acquisition in one tier (i .e. , the server), the data warehouse is another tier, and
• the third tier includes the DSS/ Bl/BA engine (i.e. , the application server) and the client
• (see Figure 2.2). Data from the warehouse are processed twice and deposited in an additional
• multidimensional database, organized for easy multidimensional analysis
• and presentation, or replicated in data marts. The advantage of the three-tier architecture
• is its separation of the functions of the data warehouse, which eliminates resource constraints
• and makes it possible to easily create data marts.
• Several issues must be considered when deciding which architecture to use. Among
• them are the following:
• • Which database manageme11t system (DBMS) should be used? Most data
• warehouses are built using relational database management systems (RDBMS).
• Oracle (Oracle Corporation, oracle.com), SQL Server (Microsoft Corporation,
• microsoft.com/sql/), and DB2 (IBM Corporation, 306.ibm.com/software/data/
• db2/) are the ones most commonly used. Each of these products supports both
• client/ server and Web-based architectures.
• • parallel processi11g a11d/or partitioning be used? Parallel processing
• enables multiple CPUs to process data warehouse query requests simultaneously and
• provides scalability. Data warehouse designers need to decide whether the database
• tables will be partitioned (i.e., split into smaller tables) for access efficiency and what
• the criteria will be. This is an important consideration that is necessitated by the large
• amounts of data contained in a typical data warehouse. A recent su1vey on parallel
• and distributed data warehouses can be found in Furtado (2009). Teradata (teradata.
• com) has successfully adopted and often commented on its novel implementation of
• this approach.
• • data migration tools be used to load the data wa·rehouse? Moving
• data from an existing system into a data warehouse is a tedious and laborious
• task. Depending on the diversity and the location of the data assets, migration
• may be a relatively simple procedure or (in contrary) a months-long project. The
• results of a thorough assessment of the existing data assets should be used to determine
• whether to use migration tools, and if so, what capabilities to seek in
• those commercial tools.
• • bat tools will be used to supp01·t data retrieval and analysis? Often it is
• necessary to use specialized tools to periodically locate, access, analyze, extract,
• transform, and load necessary data into a data warehouse. A decision has to be
• made on (i) developing the migration tools in-house, (ii) purchasing them from a
• third-party provider, or (iii) using the ones provided with the data warehouse system.
• Overly complex, real-time migrations warrant specialized third-party ETL tools
• Alternative Data Warehousing Architectures
• At the highest level, data warehouse architecture
design viewpoints can be categorized into
• enterprise-wide data warehouse (EDW) design and
data mart (DM) design (Golfarelli and
• Rizzi, 2009). In Figure 2.5 (parts a-e), we show some
alternatives to the basic architectural
• design types that are neither pure EDW nor pure DM,
but in between or beyond the traditional
• architectural structures. Notable new ones include
hub-and-spoke and federated
• The five architectures shown in Figure 2.5 (pa rts a-e) are proposed by
• Ariyacbandra and Watson (2005, 2006a, and 2006b). Previously, in an e:>-.1:ensive study, Sen
• and Sinha (2005) identified 15 different data warehousing methodologies. The sources of
• these methodologies are classified into three broad categories: core-technology vendors,
infrastructure
• vendors, and information-modeling companies.
• a. independent data marts. This is arguably the simplest and the least costly architecture
• alternative. The data marts are developed to operate independently of each
• other to serve for the needs of individual organizational units. Because of the independence,
• they may have inconsistent data definitions and different dimensions and
• measures, making it difficult to analyze data across the data marts (i.e., it is difficult,
• if not impossible, to get to the "one version of the truth").
• b. Data mart bus architecture. This architecture is a viable alternative to the independent
• data marts where the individual marts are linked to each other via
• some kind of middleware. Because the data are linked among the individual
• marts, there is a better chance of maintaining data consistency across the enterprise
• (at least at the metadata level). Even though it allows for complex data
• queries across data marts, the performance of these types of analysis may not be
• at a satisfactory level.
• c. Hub-and-spoke architecture. This is perhaps the
most famous data warehousing
• architecture today. Here the attention is focused on
building a scalable and maintainable infrastructure
(often developed in an iterative way, subject area by
subject area) that includes a centralized data
warehouse and several dependent data marts (each for
an organizational unit). This architecture allows for
easy and customization of
user interfaces and reports. On the negative side, this
architecture lacks the holistic enterprise view, and may
lead to data redundancy and data latency.
• d. Centralized data warehouse. The centralized data warehouse architecture is
• similar to the hub-and-spoke architecture except that there are no dependent data
• marts; instead, there is a gigantic enterprise data warehouse that serves for the
• needs of all organizational units. This centra lized approach provides users with
• access to all data in the data warehouse instead of limiting them to data marts. In
• addition , it reduces the amount of data the technical team has to transfer or
• change, therefore simplifying data management and administra tion. If designed
• and implemented properly, this architecture provides a timely and holistic view of
• the enterprise to whomever, whenever, and wherever they may be within the
organization.
• The central data warehouses architecture, which is advocated mainly
• by Teradata Corp., advises using data warehouses without any data marts (see
• Figure 2.6).
• e . Federated data warehouse. The federated approach is a concession to the natural
forces that undermine the best plans for developing a perfect system. It uses all
possible means to integrate analytical resources from multiple sources to meet
changing needs or business conditions. Essentially, the federated approach involves
integrating disparate systems. In a federated architecture, existing decision support
structures are left in place, and data are accessed from those sources as needed. The
federated approach is supported by middleware vendors that propose distributed
query and join capabilities. These eXtensible Markup Language (X!VIL)-based tools
offer users a global view of distributed data sources, including data warehouses, data
marts, Web sites, documents, and operational systems. When users choose query objects
from this view and press the submit button, the tool automatically queries the
distributed sources, joins the results, and presents them to the user. Because of performance
and data quality issues, most experts agree that federated approaches
work well to supplement data warehouses, not replace them (Eckerson, 2005).
• Which Architecture Is the Best?
• Ever since data warehousing became a critical
part of modern enterprises, the question of
which data warehouse architecture is the best
has been a topic of regular discussion. The
two gurus of the data warehousing field, Bill
Inmon and Ralph Kimball, are at the heart
• of this discussion. Inmon advocates the hub-
and-spoke architecture (e.g., the Corporate
Information Factory), whereas Kimball
promotes the data mart bus architecture with
conformed dimensions. Other architectures
are possible, but these two options are
fundamentally different approaches.
• A major purpose of a data warehouse is to
integrate data from multiple systems.
• Various integration technologies enable data
and metadata integration:
• • Enterprise application integration (EAI)
• • Service-oriented architecture (SOA)
• • Enterprise information integration (Ell)
• • Extraction, transformation, and load (ETL)
• Enterprise application integration (EAi) provides a vehicle for
pushing data from source systems into the data warehouse. It
involves integrating application functionality and is focused on
sharing functionality (rather than data) across systems, thereby
enabling flexibility and reuse. Traditionally, EAI solutions have
focused on enabling application reuse at the application
programming interface level. Recently, EAI is accomplished by using
SOA coarse-grained services (a collection of business processes or
functions) that are well defined and documented. Using Web
services is a specialized way of implementing an SOA. EAI can be
used to facilitate data acquisition directly into a near real- time data
warehouse or to deliver decisions to the OLTP systems. There are
many different approaches to and tools for EAI implementation.
Enterp1·ise information integration (Ell)
is an evolving tool space that promises real-time data integration from a variety of sources such as
relational databases, Web se1vices, and multidimensional databases. It is a mechanism for pulling
data from source systems to satisfy a request for information. EU tools use predefined metadata to
populate views that make integrated data appear relational to encl users. XML may be the most
important aspect of Ell because XML allows data to be tagged either at creation time or later.
• These tags can be extended and modified to accommodate almost any area of knowledge
• (Kay, 2005).
• Physical data integration has conventionally been the main mechanism for creating an integrated
view with data warehouses and data marts.
• With the advent of Ell tools (Kay, 2005), new virtual data integration patterns are feasible. Manglik
and Mehra (2005) discussed the benefits and constraints of new data integration patterns that can
expand traditional physical methodologies to present a comprehensive view for the enterprise.
• We next turn to the approach for loading data into the warehouse: ETL.
• Extraction, Transformation, and Load (ETL}
• At the heart of the technical side of the data warehousing process is extraction, transformation,
• and load (ETL). ETL technologies, which have existed for some time, are instn_1-
• mental in the process and use of data warehouses. The ETL process is an integral component
• in any data-centric project. IT managers are often faced with challenges because the ETL
• process typically consumes 70 percent of the time in a data-centric project.
• The ETL process consists of extraction (i.e., reading data from one or more databases),
• transformation (i.e ., converting the extracted data from its previous form into the
• form in which it needs to be so that it can be placed into a data warehouse or simply another
• database), and load (i.e., putting the data into the data warehouse). Transformation
• occurs by using rules or lookup tables or by combining the data with other data. The
• three database functions are integrated into one tool to pull data out of one or more databases
• and place them into another, consolidated database or a data warehouse.
• ETL tools also transport data between sources and targets, document how data
• elements (e.g., metaclata) change as they move between source and target, exchange metadata
• with other applications as needed, and administer all runtin1e processes and operations
• (e.g., scheduling, error management, audit logs, statistics). ETL is extremely impo1tant for
• data integration as well as for data warehousing. The purpose of the ETL process is to load
• A major purpose of a data warehouse is to
integrate data from multiple systems.
• Various integration technologies enable data
and metaclata integration:
• • Enterprise application integration (EAI)
• • Service-oriented architecture (SOA)
• • Enterprise information integration (Ell)
• • Extraction, transformation, and load (ETL)
• Enterprise application integration (EAi) provides a vehicle for pushing data
• from source systems into the data warehouse. It involves integrating
application functionality and is focused on sharing functionality (rather
than data) across systems, there by enabling flexibility and reuse.
Traditionally, EAI solutions have focused on enabling application reuse at
the application programming interface level.
• Recently, EAI is accomplished by using SOA coarse-grained services (a
collection of business processes or functions) that are well defined and
documented. Using Web services is a specialized way of implementing an
SOA. EAI can be used to facilitate data acquisition directly into a near
real-time data warehouse or to deliver decisions to the OLTP systems.
There are many different approaches to and tools for EAI implementation.
• Enterp1·ise information integration (Ell)
• is an evolving tool space that promises real-time data integration from a variety of
sources such as relational databases, Web se1vices, and multidimensional
databases. It is a mechanism for pulling data from source systems to satisfy a
request for information. EU tools use predefined metadata to populate views that
make integrated data appear relational to encl users. XML may be the most
important aspect of Ell because XML allows data to be tagged either at creation
time or later.
• These tags can be extended and modified to accommodate almost any area of
knowledge
(Kay, 2005).
• Physical data integration has conventionally been the main mechanism for creating
an integrated view with data warehouses and data marts. With the advent of Ell
tools
(Kay, 2005), new virtual data integration patterns are feasible. Manglik and
Mehra (2005)
discussed the benefits and constraints of new data integration patterns that can
expand
traditional physical methodologies to present a comprehensive view for the
enterprise.
• We next turn to the approach for loading data into the warehouse: ETL.
• Extraction, Transformation, and Load (ETL}
At the heart of the technical side of the data warehousing process is extraction,
transformation,
and load (ETL). ETL technologies, which have existed for some time, are instn_1-
mental in the process and use of data warehouses. The ETL process is an integral
component
in any data-centric project.
• IT managers are often faced with challenges because the ETL process typically
consumes 70 percent of the time in a data-centric project.
• The ETL process consists of extraction (i.e., reading data from one or more
databases),
transformation (i.e ., converting the extracted data from its previous form into
the
form in which it needs to be so that it can be placed into a data warehouse or
simply another
database), and load (i.e., putting the data into the data warehouse).
Transformation occurs by using rules or lookup tables or by combining the data
with other data. The three database functions are integrated into one tool to pull
data out of one or more databases
and place them into another, consolidated database or a data warehouse.
• 2.5 DATA WAREHOUSE DEVELOPMENT
• A data warehousing project is a major undertaking for any organization and is more
complicated than a simple, mainframe selection and implementation project because it comprises
and influences many departments and many input and output interfaces and it can
be part of a CRM business stra tegy. A data warehouse provides several benefits that can
be classified as direct and indirect. Direct benefits include the following:
• Encl users can perform extensive analysis in numerous ways.
• A consolidated view of corporate data (i.e., a single version of the truth) is possible.
Better and more-timely information is possible. A data warehouse permits information
processing to be re lieved from costly operational systems onto low-cost servers; therefore, many
more end-user information requests can be processed more quickly.
• Enhanced system performance can result. A data warehouse frees production processing
because some operational system reporting requirements are moved to DSS.
• Data access is simplified.
• Data Warehouse Vendors
• McCloskey (2002) cited six guidelines that need to be considered when developing a vendor
list: financial strength, ERP linkages, qualified consultants, market share, inclust1y
experience, and established partnerships. Data can be obtained from trade shows and
corporate Web sites, as well as by submitting requests for specific product information.
Van den Hoven (1998) differentiated three types of data warehousing products. The first
type handles functions such as locating, extracting, transforming, cleansing, transporting,
and loading the data into the data warehouse. The second type is a data management
tool-a database engine that stores and manages the data warehouse as well as the metadata.
• The third type is a data access tool that provides end users with access to analyze
the data in the data warehouse. This may include query generators, visualization, EIS,
OLAP, and data mining.
• Data Warehouse Development Approaches
• Many organizations need to create the data warehouses used for decision
support. Two
• competing approaches are employed. The first approach is that of Bill
Inmon, who is often called "the father of data warehousing." Inmon
supports a top-clown development approach that adapts traditional
relational database tools to the development needs of an enterprise-wide
data warehouse, also known as the EDW approach.
• The second approach is that of Ralph Kimball, who proposes a bottom-up
approach that employs dimensional modeling, also known as the data
mart approach.
• Knowing how these two models are alike and how they differ helps us
understand the basic data warehouse concepts (Breslin, 2004). Table 2.3
compares the two approaches.
• We describe these approaches in detail next.
• THE INMON MODEL: THE EDW APPROACH Inmon's
approach emphasizes top-clown
• development, employing established database
development methodologies and tools,
• such as entity-relationship diagrams (ERD), and an
adjustment of the spiral development
• approach. The EDW approach does not preclude the
creation of data marts. The
• EDW is the ideal in this approach because it provides a
consistent and comprehensive
• view of the enterprise. Murtaza 0998) presented a
framework for developing EDW.
• THE KIMBALL MODEL: THE DATA MART APPROACH
Kimball's data mart strategy is a
• "plan big, build small" approach. A data mart is a subject-
oriented or department-oriented
• data warehouse. It is a scaled-down version of a data
warehouse that focuses on the
• requests of a specific department such as marketing or
sales. This model applies dimensional
• data modeling, which starts with tables. Kimball advocated
a development methodology
• that entails a bottom-up approach, which in the case of
data warehouses means
• building one data mart at a time.
• WHICH MODEL IS BEST? There is no one-size-fits-all strategy to data
warehousing.
• An enterprise's data warehousing strategy can evolve from a simple data mart to a
• complex data warehouse in response to user demands, the enterprise's business
• requirements, and the enterprise's maturity in managing its data resources. For
many
• enterprises, a data mart is frequently a convenient first step to acquiring
experience in
• constructing and managing a data warehouse while presenting business users with
the
• benefits of better access to their data; in addition, a data mart commonly indicates
the
• business value of data warehousing. Ultimately, obtaining an EDW is ideal (see
• Application Case 2.5). However, the development of individual data marts can
often provide many benefits along the way toward developing an EDW, especially
if the organization is unable or unwilling to invest in a large-scale project. Data
marts can also demonstrate feasibility and success in providing benefits. This could
potentially lead to an investment in an EDW.
• Representation of Data in Data Warehouse
• A typical data warehouse structure is shown in Figure 2.1. Many
variations of data warehouse
• architecture are possible (see Figure 2.5). No matter what the
architecture was the
• design of data representation in the data warehouse has always
been based on the concept dimensional modeling. Dimensional
modeling is a retrieval-based system that supports high-volume
que1y access. Representation and storage of data in a data
warehouse should be designed in such a way that not only
accommodates but also boosts the
• processing of complex multidimensional queries. Often, the star
schema and the snowflakes schema are the means by which
dimensional modeling is implemented in data warehouses.
• Analysis of Data in Data Warehouse
• Once the data are properly stored in a data warehouse,
that data can be used in various
• ways to support organizational decision making. OLAP
is arguably the most commonly
• used data analysis technique in data warehouses, and
it has been growing in popularity due
• to the exponential increase in data volumes and the
recognition of the business value of
• data-driven analytics. Simply, OLAP is an approach to
quickly answer ad hoc questions by
• OLAP versus OLTP
• OLTP is a term used for transaction system that is primarily responsible for
capturing and storing data related to clay-to-clay business functions such as ERP,
CRM, SCM, POS, and so on. OLTP system addresses a critical business need,
automating daily business transactions and running real-time reports and routine
ana lysis. But these systems are not designed for ad hoc analysis and complex
queries that deal with a number of data items.
• OLAP, on the other hand, is designed to address this need by providing ad hoc
analysis of organizational data much more effectively and efficiently. OLAP and
OLTP rely heavily on each other: OLAP uses the data captures by OLTP, and OLTP
automates the business processes that are managed by decisions supported by
OLAP.
• •Slice: A slice is a subset of a multidimensional array (usually a two-dimensional
• representation) corresponding to a single value set for one (or more) of the
dimensions
• not in the subset. A simple slicing operation on a three-dimensional cube is
• shown in Figure 2.9.
• • Dice: The dice operation is a slice on more than two dimensions of a data cube.
• • Drill Down/Up: Drilling down or up is a specific OLAP technique whereby the
• user navigates among levels of data ranging from the most summarized (up) to the
• most detailed (down).
• • Roll up: A roll up involves computing all of the data relationships for one or
more
• dimensions. To do this, a computational relationship or formula might be defined.
• • Pivot: It is used to change the dimensional orientation of a report or an ad hoc
• query-page display.
• VARIATIONS OF OLAP OLAP has a few variations; among them, ROLAP, MOLAP
and
• HOLAP are the most common ones.
• ROLAP stands for Relational Online Analytical Processing. ROLAP is an alternative
• to the MOLAP (Multidimensional OLAP) technology. While both ROLAP and
MOLAP
• analytic tools are designed to allow analysis of data through the use of a
multidimensional data
• model, ROLAP differs significantly in that it does not require the pre-computation
and
• storage of information. Instead, ROLAP tools access the data in a relational
database
• and generate SQL queries to calculate information at the appropriate level when
an end
• user requests it. With ROLAP, it is possible to create additional database tables
(sununa1y tables
• or aggregations) that summarize the data at any desired combination of
dimensions. While
• ROLAP uses a relational database source, generally the database must be carefully
designed
• 2.6 DATA WAREHOUSING IMPLEMENTATION ISSUES
• Implementing a data warehouse is generally a massive effort that ~rnst .be planned
• and executed according to established methods. However, the proiect hfecycl.e has
• many facets, and no single person can be an expert in each area. Here we d.1scuss
• specific ideas and issues as they relate to data warehousing. Inmon (2006) provided a
• set of actions that a data warehouse systems programmer may use to tune a data
• warehouse. . .
• Reeves (2009) and Solomon (2005) provided some guidelines regarding the cntteal
• questions that must be asked, some risks that should be weighted.' and some processes
• that can be followed to help ensure a successful data warehouse 1mplementat1on. They
• compiled a list of 11 major tasks that could be performed in parallel:
• 1. Establishment of service-level agreements and data-refresh requirements
• 2. Identification of data sources and their governance policies
• 3. Data quality planning
• 4. Data model design
• 5. ETL tool selection
• 6. Relational database software and platform selection
• 7. Data transport
• 8. Data conversion
• 9. Reconciliation process
• 10. Purge and archive planning
• 11. End-user support

bich-2.ngjfyjdkzxzkckzxzkxzkxkgxjgyityutxjgyutxppt

  • 1.
    Chapter -2 Data warehousingin Business Intelligence
  • 2.
    • LEARNING OBJECTIVES •• Understand the basic definitions and • concepts of data warehouses • • Understand data warehousing • architectures • • Describe the processes used in • developing and managing data • warehouses • • Explain data warehousing operations • • Explain the role of data warehouses in • decision support • • Explain data integration and the • extraction, transformation, and load • (ETL) processes • 11 Describe real~time (active) data • warehousing • • Understand data warehouse • administration and security issues
  • 3.
    2.2 DATA WAREHOUSINGPROCESS OVERVIEW • Organizations, private and public, continuously collect data, information, and knowledge at • an increasingly accelerated rate and store them in computerized systems. Maintaining and • using these data and information becomes extremely complex, especially as scalability issues • arise. In addition, the number of users needing to access the information continues to • increase as a result of improved reliability and availabi lity of network access, especially the • Internet. Working with multiple databases, either integrated in a data warehouse or not, has • become an e>..1:remely difficult task requiring considerable expertise, but it can provide immense • benefits far exceeding its cost (see the opening vignette and Application Case 2.2).
  • 4.
    Application Case 2.2 DataWarehousing Supports First American Corporation's Corporate Strategy • First American Corporation changed its corporate strategy from a traditional banking approach to one that was centered on CRM. This enabled First • American to transform itself from a company that lost $60 million in 1990 to an innovative financial services leader a decade later. The successful implementation • of this strategy would not have been possible without its VISION data warehouse, which stores information about customer behavior such as products used, buying preferences, and client-value positions. VISION provides: • Identification of the top 20 percent of profitable customers • Identification of the 40 to 50 percent of unprofitable customers
  • 5.
    • Retention strategies •Lower-cost distribution channels change, moving itself into the "sweet 16" of financial • services corporations. • • Strategies to expand customer re la tionships • • Redesigned information flows • Access to information through a data warehouse • can enable both evolutionary and revolutionary • change. First American achieved revolutionary • Sources: Based on B. L. Cooper, H. J. Vatson, B. H. Wi,xom, and • D. L. Goodhue, "Data Warehousing Supports Corporate Strategy • at First American Corporation,"M/S Quarterly, Vol. 24, No. 4, • 2000, pp. 547-567; and B. L. Cooper, H.). Watson, B. H. Wixom, • and D. L. Goodhue, "Data Warehousing Supports Corporate • Strategy at First American Corporation," SIM International • Conference, Atlanta, August 15-19, 1999.
  • 6.
    • Many organizationsneed to create data warehouses-massive data stores of timeseries • data for decision support. Data are imported from various external and internal • resources and are cleansed and organized in a manner consistent with the organization's • needs. After the data are populated in the data warehouse, data marts can be loaded for a • specific area or department. Alternatively, data marts can be created first, as needed, and • then integrated into an EDW. Often, though, data marts are not developed, but data are • simply loaded onto PCs or left in their original state for direct manipulation using BI tools. • In Figure 2.1, we show the data warehouse concept. The following are the major • components of the data warehousing process: • • Data sources. Data are sourced from multiple independent operational "legacy" • systems and possibly from external data providers (such as the U.S. Census). Data • may also come from an online transaction processing (OLTP) or ERP system. Web • data in the form of Web logs may also feed a data warehouse. • • Data extraction and tra11sfor111atio11. Data are extracted and properly transformed • using custom-written or commercial software called ETL. • • Data loading. Data are loaded into a staging area, where they are transformed and • cleansed. The data are then ready to load into the data warehouse and/or data marts. • • Compreheusive database. Essentially, this is the EDW to support all decision • analysis by providing relevant summarized and detailed information originating • from many different sources.
  • 8.
    • • Metadata.Metadata are maintained so that they can be assessed by IT personnel • and users. Metadata include soft-vvare programs about data and rules for organizing • data summaries that are easy to index and search, especially with Web tools. • • Middleware tools. Middleware tools enable access to the data warehouse. Power • users such as analysts may write their own SQL queries. Others may employ a managed • query environment, such as Business Objects, to access data. There are many front-end • applications that business users can use to interact with data stored in the data repositories, • including data mining, OLAP, repo1ting tools, and data visualization tools.
  • 9.
    • SECTION 2.2REVIEW QUESTIONS • 1. Describe the data warehousing process. • 2. Describe the major components of a data warehouse. • 3. Identify the role of middleware tools.
  • 10.
    • 2.3 DATAWAREHOUSING ARCHITECTURES • There are several basic information system architectures that can be used for data warehousing. • Generally speaking, these architectures are commonly called client/ server or • n-tier architectures, of which two-tier and three-tier architectures are the most common • (see Figures 2.2 and 2.3), but sometimes there is simply one tier. These types of multitiered • architectures are known to be capable of serving the needs of large-scale, • performance-demanding information systems such as data warehouses. Referring to the • use of n-tiered architectures for data warehousing, Hoffer et al. (2007) distinguished • among these architectures by dividing the data warehouse into three parts: 1. The data warehouse itself, which contains the data and associated software. 2. Data acquisition (back-end) softvvare, which extracts data from legacy systems and external sources, consolidates and summarizes them, and loads them into the data warehouse. 3. Client (front-end) software, which allows users to access and analyze data from the warehouse (a DSS/ Bl/ business analytics [BA] engine) In a three-tier architecture, operational systems contain the data and the software for data acquisition in one tier (i .e. , the server), the data warehouse is another tier, and the third tier includes the DSS/ Bl/BA engine (i.e. , the application server) and the client (see Figure 2.2). Data from the warehouse are processed twice and deposited in an additional multidimensional database, organized for easy multidimensional analysis and presentation, or replicated in data marts. The advantage of the three-tier architecture is its separation of the functions of the data warehouse, which eliminates resource constraints and makes it possible to easily create data marts.
  • 13.
    • 2.3 DATAWAREHOUSING ARCHITECTURES • There are several basic information system architectures that can be used for data warehousing. • Generally speaking, these architectures are commonly called client/ server or • n-tier architectures, of which two-tier and three-tier architectures are the most common • (see Figures 2.2 and 2.3), but sometimes there is simply one tier. These types of multitiered • architectures are known to be capable of serving the needs of large-scale, • performance-demanding information systems such as data warehouses. Referring to the • use of n-tiered architectures for data warehousing, Hoffer et al. (2007) distinguished • among these architectures by dividing the data warehouse into three parts: • 1. The data warehouse itself, which conta ins the data and associated software. • 2. Data acquisition (back-end) softvvare, which extracts data from legacy systems and • external sources, consolidates and summarizes them, and loads them into the data • warehouse. • 3. Client (front-end) software, which allows users to access and analyze data from the • warehouse (a DSS/ Bl/ business analytics [BA] engine) • In a three-tier architecture, operational systems contain the data and the software • for data acquisition in one tier (i .e. , the server), the data warehouse is another tier, and • the third tier includes the DSS/ Bl/BA engine (i.e. , the application server) and the client • (see Figure 2.2). Data from the warehouse are processed twice and deposited in an additional • multidimensional database, organized for easy multidimensional analysis • and presentation, or replicated in data marts. The advantage of the three-tier architecture • is its separation of the functions of the data warehouse, which eliminates resource constraints • and makes it possible to easily create data marts.
  • 16.
    • Several issuesmust be considered when deciding which architecture to use. Among • them are the following: • • Which database manageme11t system (DBMS) should be used? Most data • warehouses are built using relational database management systems (RDBMS). • Oracle (Oracle Corporation, oracle.com), SQL Server (Microsoft Corporation, • microsoft.com/sql/), and DB2 (IBM Corporation, 306.ibm.com/software/data/ • db2/) are the ones most commonly used. Each of these products supports both • client/ server and Web-based architectures. • • parallel processi11g a11d/or partitioning be used? Parallel processing • enables multiple CPUs to process data warehouse query requests simultaneously and • provides scalability. Data warehouse designers need to decide whether the database • tables will be partitioned (i.e., split into smaller tables) for access efficiency and what • the criteria will be. This is an important consideration that is necessitated by the large • amounts of data contained in a typical data warehouse. A recent su1vey on parallel • and distributed data warehouses can be found in Furtado (2009). Teradata (teradata. • com) has successfully adopted and often commented on its novel implementation of • this approach. • • data migration tools be used to load the data wa·rehouse? Moving • data from an existing system into a data warehouse is a tedious and laborious • task. Depending on the diversity and the location of the data assets, migration • may be a relatively simple procedure or (in contrary) a months-long project. The • results of a thorough assessment of the existing data assets should be used to determine • whether to use migration tools, and if so, what capabilities to seek in • those commercial tools. • • bat tools will be used to supp01·t data retrieval and analysis? Often it is • necessary to use specialized tools to periodically locate, access, analyze, extract, • transform, and load necessary data into a data warehouse. A decision has to be • made on (i) developing the migration tools in-house, (ii) purchasing them from a • third-party provider, or (iii) using the ones provided with the data warehouse system. • Overly complex, real-time migrations warrant specialized third-party ETL tools
  • 17.
    • Alternative DataWarehousing Architectures • At the highest level, data warehouse architecture design viewpoints can be categorized into • enterprise-wide data warehouse (EDW) design and data mart (DM) design (Golfarelli and • Rizzi, 2009). In Figure 2.5 (parts a-e), we show some alternatives to the basic architectural • design types that are neither pure EDW nor pure DM, but in between or beyond the traditional • architectural structures. Notable new ones include hub-and-spoke and federated
  • 19.
    • The fivearchitectures shown in Figure 2.5 (pa rts a-e) are proposed by • Ariyacbandra and Watson (2005, 2006a, and 2006b). Previously, in an e:>-.1:ensive study, Sen • and Sinha (2005) identified 15 different data warehousing methodologies. The sources of • these methodologies are classified into three broad categories: core-technology vendors, infrastructure • vendors, and information-modeling companies. • a. independent data marts. This is arguably the simplest and the least costly architecture • alternative. The data marts are developed to operate independently of each • other to serve for the needs of individual organizational units. Because of the independence, • they may have inconsistent data definitions and different dimensions and • measures, making it difficult to analyze data across the data marts (i.e., it is difficult, • if not impossible, to get to the "one version of the truth"). • b. Data mart bus architecture. This architecture is a viable alternative to the independent • data marts where the individual marts are linked to each other via • some kind of middleware. Because the data are linked among the individual • marts, there is a better chance of maintaining data consistency across the enterprise • (at least at the metadata level). Even though it allows for complex data • queries across data marts, the performance of these types of analysis may not be • at a satisfactory level.
  • 20.
    • c. Hub-and-spokearchitecture. This is perhaps the most famous data warehousing • architecture today. Here the attention is focused on building a scalable and maintainable infrastructure (often developed in an iterative way, subject area by subject area) that includes a centralized data warehouse and several dependent data marts (each for an organizational unit). This architecture allows for easy and customization of user interfaces and reports. On the negative side, this architecture lacks the holistic enterprise view, and may lead to data redundancy and data latency.
  • 21.
    • d. Centralizeddata warehouse. The centralized data warehouse architecture is • similar to the hub-and-spoke architecture except that there are no dependent data • marts; instead, there is a gigantic enterprise data warehouse that serves for the • needs of all organizational units. This centra lized approach provides users with • access to all data in the data warehouse instead of limiting them to data marts. In • addition , it reduces the amount of data the technical team has to transfer or • change, therefore simplifying data management and administra tion. If designed • and implemented properly, this architecture provides a timely and holistic view of • the enterprise to whomever, whenever, and wherever they may be within the organization. • The central data warehouses architecture, which is advocated mainly • by Teradata Corp., advises using data warehouses without any data marts (see • Figure 2.6).
  • 22.
    • e .Federated data warehouse. The federated approach is a concession to the natural forces that undermine the best plans for developing a perfect system. It uses all possible means to integrate analytical resources from multiple sources to meet changing needs or business conditions. Essentially, the federated approach involves integrating disparate systems. In a federated architecture, existing decision support structures are left in place, and data are accessed from those sources as needed. The federated approach is supported by middleware vendors that propose distributed query and join capabilities. These eXtensible Markup Language (X!VIL)-based tools offer users a global view of distributed data sources, including data warehouses, data marts, Web sites, documents, and operational systems. When users choose query objects from this view and press the submit button, the tool automatically queries the distributed sources, joins the results, and presents them to the user. Because of performance and data quality issues, most experts agree that federated approaches work well to supplement data warehouses, not replace them (Eckerson, 2005).
  • 24.
    • Which ArchitectureIs the Best? • Ever since data warehousing became a critical part of modern enterprises, the question of which data warehouse architecture is the best has been a topic of regular discussion. The two gurus of the data warehousing field, Bill Inmon and Ralph Kimball, are at the heart
  • 25.
    • of thisdiscussion. Inmon advocates the hub- and-spoke architecture (e.g., the Corporate Information Factory), whereas Kimball promotes the data mart bus architecture with conformed dimensions. Other architectures are possible, but these two options are fundamentally different approaches.
  • 27.
    • A majorpurpose of a data warehouse is to integrate data from multiple systems. • Various integration technologies enable data and metadata integration: • • Enterprise application integration (EAI) • • Service-oriented architecture (SOA) • • Enterprise information integration (Ell) • • Extraction, transformation, and load (ETL)
  • 28.
    • Enterprise applicationintegration (EAi) provides a vehicle for pushing data from source systems into the data warehouse. It involves integrating application functionality and is focused on sharing functionality (rather than data) across systems, thereby enabling flexibility and reuse. Traditionally, EAI solutions have focused on enabling application reuse at the application programming interface level. Recently, EAI is accomplished by using SOA coarse-grained services (a collection of business processes or functions) that are well defined and documented. Using Web services is a specialized way of implementing an SOA. EAI can be used to facilitate data acquisition directly into a near real- time data warehouse or to deliver decisions to the OLTP systems. There are many different approaches to and tools for EAI implementation.
  • 29.
    Enterp1·ise information integration(Ell) is an evolving tool space that promises real-time data integration from a variety of sources such as relational databases, Web se1vices, and multidimensional databases. It is a mechanism for pulling data from source systems to satisfy a request for information. EU tools use predefined metadata to populate views that make integrated data appear relational to encl users. XML may be the most important aspect of Ell because XML allows data to be tagged either at creation time or later. • These tags can be extended and modified to accommodate almost any area of knowledge • (Kay, 2005). • Physical data integration has conventionally been the main mechanism for creating an integrated view with data warehouses and data marts. • With the advent of Ell tools (Kay, 2005), new virtual data integration patterns are feasible. Manglik and Mehra (2005) discussed the benefits and constraints of new data integration patterns that can expand traditional physical methodologies to present a comprehensive view for the enterprise. • We next turn to the approach for loading data into the warehouse: ETL.
  • 30.
    • Extraction, Transformation,and Load (ETL} • At the heart of the technical side of the data warehousing process is extraction, transformation, • and load (ETL). ETL technologies, which have existed for some time, are instn_1- • mental in the process and use of data warehouses. The ETL process is an integral component • in any data-centric project. IT managers are often faced with challenges because the ETL • process typically consumes 70 percent of the time in a data-centric project. • The ETL process consists of extraction (i.e., reading data from one or more databases), • transformation (i.e ., converting the extracted data from its previous form into the • form in which it needs to be so that it can be placed into a data warehouse or simply another • database), and load (i.e., putting the data into the data warehouse). Transformation • occurs by using rules or lookup tables or by combining the data with other data. The • three database functions are integrated into one tool to pull data out of one or more databases • and place them into another, consolidated database or a data warehouse. • ETL tools also transport data between sources and targets, document how data • elements (e.g., metaclata) change as they move between source and target, exchange metadata • with other applications as needed, and administer all runtin1e processes and operations • (e.g., scheduling, error management, audit logs, statistics). ETL is extremely impo1tant for • data integration as well as for data warehousing. The purpose of the ETL process is to load
  • 33.
    • A majorpurpose of a data warehouse is to integrate data from multiple systems. • Various integration technologies enable data and metaclata integration: • • Enterprise application integration (EAI) • • Service-oriented architecture (SOA) • • Enterprise information integration (Ell) • • Extraction, transformation, and load (ETL)
  • 34.
    • Enterprise applicationintegration (EAi) provides a vehicle for pushing data • from source systems into the data warehouse. It involves integrating application functionality and is focused on sharing functionality (rather than data) across systems, there by enabling flexibility and reuse. Traditionally, EAI solutions have focused on enabling application reuse at the application programming interface level. • Recently, EAI is accomplished by using SOA coarse-grained services (a collection of business processes or functions) that are well defined and documented. Using Web services is a specialized way of implementing an SOA. EAI can be used to facilitate data acquisition directly into a near real-time data warehouse or to deliver decisions to the OLTP systems. There are many different approaches to and tools for EAI implementation.
  • 35.
    • Enterp1·ise informationintegration (Ell) • is an evolving tool space that promises real-time data integration from a variety of sources such as relational databases, Web se1vices, and multidimensional databases. It is a mechanism for pulling data from source systems to satisfy a request for information. EU tools use predefined metadata to populate views that make integrated data appear relational to encl users. XML may be the most important aspect of Ell because XML allows data to be tagged either at creation time or later. • These tags can be extended and modified to accommodate almost any area of knowledge (Kay, 2005). • Physical data integration has conventionally been the main mechanism for creating an integrated view with data warehouses and data marts. With the advent of Ell tools (Kay, 2005), new virtual data integration patterns are feasible. Manglik and Mehra (2005) discussed the benefits and constraints of new data integration patterns that can expand traditional physical methodologies to present a comprehensive view for the enterprise. • We next turn to the approach for loading data into the warehouse: ETL.
  • 36.
    • Extraction, Transformation,and Load (ETL} At the heart of the technical side of the data warehousing process is extraction, transformation, and load (ETL). ETL technologies, which have existed for some time, are instn_1- mental in the process and use of data warehouses. The ETL process is an integral component in any data-centric project. • IT managers are often faced with challenges because the ETL process typically consumes 70 percent of the time in a data-centric project. • The ETL process consists of extraction (i.e., reading data from one or more databases), transformation (i.e ., converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a data warehouse or simply another database), and load (i.e., putting the data into the data warehouse). Transformation occurs by using rules or lookup tables or by combining the data with other data. The three database functions are integrated into one tool to pull data out of one or more databases and place them into another, consolidated database or a data warehouse.
  • 38.
    • 2.5 DATAWAREHOUSE DEVELOPMENT • A data warehousing project is a major undertaking for any organization and is more complicated than a simple, mainframe selection and implementation project because it comprises and influences many departments and many input and output interfaces and it can be part of a CRM business stra tegy. A data warehouse provides several benefits that can be classified as direct and indirect. Direct benefits include the following: • Encl users can perform extensive analysis in numerous ways. • A consolidated view of corporate data (i.e., a single version of the truth) is possible. Better and more-timely information is possible. A data warehouse permits information processing to be re lieved from costly operational systems onto low-cost servers; therefore, many more end-user information requests can be processed more quickly. • Enhanced system performance can result. A data warehouse frees production processing because some operational system reporting requirements are moved to DSS. • Data access is simplified.
  • 40.
    • Data WarehouseVendors • McCloskey (2002) cited six guidelines that need to be considered when developing a vendor list: financial strength, ERP linkages, qualified consultants, market share, inclust1y experience, and established partnerships. Data can be obtained from trade shows and corporate Web sites, as well as by submitting requests for specific product information. Van den Hoven (1998) differentiated three types of data warehousing products. The first type handles functions such as locating, extracting, transforming, cleansing, transporting, and loading the data into the data warehouse. The second type is a data management tool-a database engine that stores and manages the data warehouse as well as the metadata. • The third type is a data access tool that provides end users with access to analyze the data in the data warehouse. This may include query generators, visualization, EIS, OLAP, and data mining.
  • 41.
    • Data WarehouseDevelopment Approaches • Many organizations need to create the data warehouses used for decision support. Two • competing approaches are employed. The first approach is that of Bill Inmon, who is often called "the father of data warehousing." Inmon supports a top-clown development approach that adapts traditional relational database tools to the development needs of an enterprise-wide data warehouse, also known as the EDW approach. • The second approach is that of Ralph Kimball, who proposes a bottom-up approach that employs dimensional modeling, also known as the data mart approach. • Knowing how these two models are alike and how they differ helps us understand the basic data warehouse concepts (Breslin, 2004). Table 2.3 compares the two approaches. • We describe these approaches in detail next.
  • 42.
    • THE INMONMODEL: THE EDW APPROACH Inmon's approach emphasizes top-clown • development, employing established database development methodologies and tools, • such as entity-relationship diagrams (ERD), and an adjustment of the spiral development • approach. The EDW approach does not preclude the creation of data marts. The • EDW is the ideal in this approach because it provides a consistent and comprehensive • view of the enterprise. Murtaza 0998) presented a framework for developing EDW.
  • 43.
    • THE KIMBALLMODEL: THE DATA MART APPROACH Kimball's data mart strategy is a • "plan big, build small" approach. A data mart is a subject- oriented or department-oriented • data warehouse. It is a scaled-down version of a data warehouse that focuses on the • requests of a specific department such as marketing or sales. This model applies dimensional • data modeling, which starts with tables. Kimball advocated a development methodology • that entails a bottom-up approach, which in the case of data warehouses means • building one data mart at a time.
  • 44.
    • WHICH MODELIS BEST? There is no one-size-fits-all strategy to data warehousing. • An enterprise's data warehousing strategy can evolve from a simple data mart to a • complex data warehouse in response to user demands, the enterprise's business • requirements, and the enterprise's maturity in managing its data resources. For many • enterprises, a data mart is frequently a convenient first step to acquiring experience in • constructing and managing a data warehouse while presenting business users with the • benefits of better access to their data; in addition, a data mart commonly indicates the • business value of data warehousing. Ultimately, obtaining an EDW is ideal (see • Application Case 2.5). However, the development of individual data marts can often provide many benefits along the way toward developing an EDW, especially if the organization is unable or unwilling to invest in a large-scale project. Data marts can also demonstrate feasibility and success in providing benefits. This could potentially lead to an investment in an EDW.
  • 46.
    • Representation ofData in Data Warehouse • A typical data warehouse structure is shown in Figure 2.1. Many variations of data warehouse • architecture are possible (see Figure 2.5). No matter what the architecture was the • design of data representation in the data warehouse has always been based on the concept dimensional modeling. Dimensional modeling is a retrieval-based system that supports high-volume que1y access. Representation and storage of data in a data warehouse should be designed in such a way that not only accommodates but also boosts the • processing of complex multidimensional queries. Often, the star schema and the snowflakes schema are the means by which dimensional modeling is implemented in data warehouses.
  • 47.
    • Analysis ofData in Data Warehouse • Once the data are properly stored in a data warehouse, that data can be used in various • ways to support organizational decision making. OLAP is arguably the most commonly • used data analysis technique in data warehouses, and it has been growing in popularity due • to the exponential increase in data volumes and the recognition of the business value of • data-driven analytics. Simply, OLAP is an approach to quickly answer ad hoc questions by
  • 49.
    • OLAP versusOLTP • OLTP is a term used for transaction system that is primarily responsible for capturing and storing data related to clay-to-clay business functions such as ERP, CRM, SCM, POS, and so on. OLTP system addresses a critical business need, automating daily business transactions and running real-time reports and routine ana lysis. But these systems are not designed for ad hoc analysis and complex queries that deal with a number of data items. • OLAP, on the other hand, is designed to address this need by providing ad hoc analysis of organizational data much more effectively and efficiently. OLAP and OLTP rely heavily on each other: OLAP uses the data captures by OLTP, and OLTP automates the business processes that are managed by decisions supported by OLAP.
  • 51.
    • •Slice: Aslice is a subset of a multidimensional array (usually a two-dimensional • representation) corresponding to a single value set for one (or more) of the dimensions • not in the subset. A simple slicing operation on a three-dimensional cube is • shown in Figure 2.9. • • Dice: The dice operation is a slice on more than two dimensions of a data cube. • • Drill Down/Up: Drilling down or up is a specific OLAP technique whereby the • user navigates among levels of data ranging from the most summarized (up) to the • most detailed (down). • • Roll up: A roll up involves computing all of the data relationships for one or more • dimensions. To do this, a computational relationship or formula might be defined. • • Pivot: It is used to change the dimensional orientation of a report or an ad hoc • query-page display.
  • 52.
    • VARIATIONS OFOLAP OLAP has a few variations; among them, ROLAP, MOLAP and • HOLAP are the most common ones. • ROLAP stands for Relational Online Analytical Processing. ROLAP is an alternative • to the MOLAP (Multidimensional OLAP) technology. While both ROLAP and MOLAP • analytic tools are designed to allow analysis of data through the use of a multidimensional data • model, ROLAP differs significantly in that it does not require the pre-computation and • storage of information. Instead, ROLAP tools access the data in a relational database • and generate SQL queries to calculate information at the appropriate level when an end • user requests it. With ROLAP, it is possible to create additional database tables (sununa1y tables • or aggregations) that summarize the data at any desired combination of dimensions. While • ROLAP uses a relational database source, generally the database must be carefully designed
  • 54.
    • 2.6 DATAWAREHOUSING IMPLEMENTATION ISSUES • Implementing a data warehouse is generally a massive effort that ~rnst .be planned • and executed according to established methods. However, the proiect hfecycl.e has • many facets, and no single person can be an expert in each area. Here we d.1scuss • specific ideas and issues as they relate to data warehousing. Inmon (2006) provided a • set of actions that a data warehouse systems programmer may use to tune a data • warehouse. . . • Reeves (2009) and Solomon (2005) provided some guidelines regarding the cntteal • questions that must be asked, some risks that should be weighted.' and some processes • that can be followed to help ensure a successful data warehouse 1mplementat1on. They • compiled a list of 11 major tasks that could be performed in parallel: • 1. Establishment of service-level agreements and data-refresh requirements • 2. Identification of data sources and their governance policies • 3. Data quality planning • 4. Data model design • 5. ETL tool selection • 6. Relational database software and platform selection • 7. Data transport • 8. Data conversion • 9. Reconciliation process • 10. Purge and archive planning • 11. End-user support