0% found this document useful (0 votes)
10 views

ch2-dbs

The document discusses the concepts of Decision Support Systems (DSS) and Data Warehousing, highlighting their roles in improving decision-making and data management. It outlines various types of DSS, including data-driven, model-driven, and knowledge-driven systems, as well as the structure and benefits of data warehouses. Additionally, it addresses challenges and limitations associated with DSS and the importance of data warehouses in providing integrated and historical data for strategic decisions.

Uploaded by

shbhamare123
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

ch2-dbs

The document discusses the concepts of Decision Support Systems (DSS) and Data Warehousing, highlighting their roles in improving decision-making and data management. It outlines various types of DSS, including data-driven, model-driven, and knowledge-driven systems, as well as the structure and benefits of data warehouses. Additionally, it addresses challenges and limitations associated with DSS and the importance of data warehouses in providing integrated and historical data for strategic decisions.

Uploaded by

shbhamare123
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 80

Module 2

Extraction, Transformation and Loading


Content
• Decision support systems, Escalating need for strategic information,
Failures of past decision-support systems, Operational versus
decision-support systems,
• Data warehouse defining using features, Data warehouses and data
marts, Data warehouse architecture,
• Data warehouse modeling vs operational database modeling,
Features of a good dimensional model.
DSS
• A decision support system (DSS) is a computer program application
used to improve a company's decision-making capabilities. It analyzes
large amounts of data and presents an organization with the best
possible options available.
• Decision support systems bring together data and knowledge from
different areas and sources to provide users with information beyond
the usual reports and summaries.
• This is intended to help people make informed decisions.
DSS
• Typical information a decision support application might gather and
present include the following:

• comparative sales figures between one week and the next;


• projected revenue figures based on new product sales assumptions; and
• the consequences of different decisions.
DSS V/s Operational Applications
• A decision support system is an informational application as opposed
to an operational application.
• Informational applications provide users with relevant information
based on a variety of data sources to support better-informed
decision-making.
• Operational applications, by contrast, record the details of business
transactions, including the data required for the decision-support
needs of a business.
DSS Components
• A typical DSS consists of three different parts: knowledge database,
software and user interface.
1. Knowledge base.
• A knowledge base is an integral part of a decision support system
database, containing information from both internal and external
sources.
• It is a library of information related to particular subjects and is the
part of a DSS that stores information used by the system's reasoning
engine to determine a course of action.
DSS Components
2. Software system. The software system is composed of model
management systems.
• A model is a simulation of a real-world system with the goal of
understanding how the system works and how it can be improved.
• Organizations use models to predict how outcomes will change with
different adjustments to the system.
• For example, models can be helpful for understanding systems that
are too complicated, too expensive or too dangerous to fully explore
in real life.
DSS Components
3. User interface. The user interface enables easy system navigation.
• The primary goal of the decision support system's user interface is to
make it easy for the user to manipulate the data that is stored on it.
• Businesses can use the interface to evaluate the effectiveness of DSS
transactions for the end users.
• DSS interfaces include simple windows, complex menu-driven
interfaces and command-line interfaces.
Example – DSS in Health Care
Types of DSS
1. Data-driven DSS
• A data-driven DSS gives users access to a large amount of internal and external data.
• DSS will query a database using the web, an external server or a company's
mainframe.
• It relies on data mining to provide patterns and information about the data being
assessed.
• Users rely on data-driven decision support systems to make decisions about
businesses, inventories and products.
• Managers might find data-driven decision support systems most helpful when
analyzing current and historical data to report on the conditions of a department or
the business.
Software examples of a data-driven
DSS
• Geographic Information Systems (GIS)
• File drawer systems
• Executive information systems
• Computer-based databases with query systems
Types of DSS
2. Model-driven DSS
• A model-driven DSS allows a user to analyze and manipulate specific models
of data, such as statistics, finances or scheduling.
• These decision support systems are specific to the type of model the user
wants to interact with and typically offer less data than other DSS types.
• They analyze scenarios and data to allow the user to manipulate a model,
such as creating a work schedule.
• They might use simple analysis tools or complex statistics, depending on the
model's purpose and the user's needs.
• Managers, staff and third parties who interact with a business might use a
model-driven DSS.
Software examples of a model-
driven DSS
• Scheduling software
• Financial modeling
• Decision analysis modeling
• Optimization software
Types of DSS
3. Knowledge-driven DSS
• With a knowledge-driven DSS, a knowledge-management system monitors
continually updated data about an organization to support decisions.
• The DSS uses diagnosis, prediction, interpretation and classification to
recommend actions consistent with the business.
• A knowledge-driven DSS can be helpful to managers because it performs tasks
faster than a human might.
• They can also help consumers decide which products and services to buy. This
kind of DSS often relies on a data-mining component.
• Managers, staff and external users, such as customers, might use a
knowledge-driven DSS.
Software examples of a knowledge-
driven DSS
• Software that identifies new or current customers who might be
interested in products
• Product selection software
Types of DSS
4. Document-driven DSS
• A document-driven DSS retrieves unstructured information from a
variety of electronic sources.
• It searches webpages, documents in databases and other information
based on a user's search terms to gather relevant information.
• A document-driven DSS might be specific to a business' private files or
as broad as a common internet search engine.
• Anyone using a database's search function or an internet search
engine is using a document-driven DSS.
Software examples of a document-
driven DSS

• Search engines
• Database search software
• Article databases with search functions
Types of DSS
5. Communication-driven DSS
• A communication-driven DSS uses tools to support communication and
collaboration. Email is an example of a communication-driven DSS.
• This type of DSS includes share tools that allow multiple people to work on
a project at once and software that allows for digital communication
between people.
• It improves a shared project's efficiency and effectiveness and can help
facilitate meetings and conversations.
• Internal team members, virtual business meeting hosts and online chat and
video meeting software users can benefit from a communication-driven
DSS.
Software examples of a
communication-driven DSS

• Chat and instant messaging services


• Collaboration software, such as document sharing and editing
software
• Email
Types of DSS
6. Intelligent DSS
• Any DSS with artificial intelligence in its design is an example of intelligent
DSS (IDSS).
• Within an IDSS, AI does data mining and processing to filter through large
datasets.
• An IDSS is designed to offer similar services to a human consultant. They're
programmed to identify patterns and trends to guide decision-making.
• They can also resolve problems and analyze solutions. AI components add
advantages, such as fuzzy logic and machine learning, to a DSS. Managers,
diagnosticians and other decision-makers might use an IDSS.
Software examples of an intelligent
DSS

• Smart manufacturing systems


• Medical diagnostic systems
Types of DSS
7. Manual DSS
• A manual DSS relies on individuals instead of computers to support
decision-making.
• A group of experts analyzes the strengths, weaknesses, opportunities
and threats of their organization or project.
• A manual DSS is much slower than a computer-based DSS, but certain
types of analysis still need a human eye at every step. Economists,
executives and managers might use a manual DSS.
Examples of manual DSS include:
• Cost-benefit analyses
• Decision matrixes
Types of DSS
8. Hybrid DSS
• A hybrid DSS combines parts of multiple DSS types to create a complex
outcome. Large issues in industries such as finance and health care
might require the tools of multiple decision support systems, such as a
knowledge-driven DSS and a data-driven DSS.
• A hybrid DSS might use additional software to help these components
work together. Sometimes a human analyzes and combines the results of
each DSS.
• A hybrid DSS might also describe a system in which a human works with
a DSS to extract and manipulate data. Medical professionals, financial
decision-makers and researchers might use a hybrid DSS.
Software examples of a hybrid DSS

• Risk assessment
• Clinical DSS
• Web-based DSS
Failures, Uncertainties and Limitations in Decision
Support...

• Technological knowledge of users is required. ...


• Hard to Quantify Factors. ...
• Hard to collect all of related data. ...
• Processing Model Limitations and Assumptions. ...
• System design failures. ...
• Organization Resistance.
1. Technological knowledge of users is required
• Although decision support systems have more user-friendly in recent years, it remains an issue, especially
for small business firms that lack of technological knowledge of users. Most decision support systems still
need technical term knowledge for the analysis.

2. Hard to Quantify Factors


• In actual world, some values cannot be specific and some are hard to quantify factors such as future
interest rates, new legislation or product shelf life that may all be considered while analyzing. Even though
the decision support system may provide the certainly result, the decision maker must use their own
judgment in making the final decision.

3. Hard to collect all of related data


• At times, data are not recorded correctly or data without beware of errors or some data cannot be
recorded. Some data must be evaluated in Analysis. Thus, the certain value from decision support tools
may be different from what it should be.
4. Processing Model Limitations and Assumptions

• As same as the processing analysis data in economics model, Decision


makers may not be fully aware of the limitations or assumptions of
the particular processing model. The assumptions and limitations are
about “The situation MUST be under condition like this, the result
should be…” but the situations cannot be controlled like assumptions
and limitations of the decision support model.
• 5. System design failures

• Because of problems of each individual users are different, it’s a challenge of


Decision support system developers to design program to support each person.
Some decision makers don’t exactly know what they want and what they can
obtain from decision support system or requirement may not well obtain.

• Decision support system may be designed and not match to exactly what
decision makers want. When it’s being used, the result from system may not be
what decision maker want and information getting may not be sufficient to
make any decision for decision maker.
• 6. Organization Resistance

• Any new technology change will cause resistance from some users or
stakeholders. Some people may fear of learning how to use new
system or lost of status or influences in organization. Sometimes
developer may have adequately received corporation by users in
organization or no intention in using DSS system of users.

• Outcome system may not be what users want. Benefit from using DSS
at any issue may not be as much as expected.
Section 2
• Data Warehouse is a relational database management system (RDBMS) construct to meet
the requirement of transaction processing systems.
• It can be loosely described as any centralized data repository which can be queried for
business benefits.
• It is a database that stores information oriented to satisfy decision-making requests. It is a
group of decision support technologies, targets to enabling the knowledge worker
(executive, manager, and analyst) to make superior and higher decisions.
• So, Data Warehousing support architectures and tool for business executives to
systematically organize, understand and use their information to make strategic decisions.
• Data Warehouse environment contains an extraction, transportation, and loading (ETL)
solution, an online analytical processing (OLAP) engine, customer analysis tools, and other
applications that handle the process of gathering information and delivering it to business
users.
Data Warehouse
• A Data Warehouse (DW) is a relational database that is designed for
query and analysis rather than transaction processing. It includes
historical data derived from transaction data from single and multiple
sources.

• A Data Warehouse provides integrated, enterprise-wide, historical


data and focuses on providing support for decision-makers for data
modeling and analysis.
Data Warehouse
• A Data Warehouse is a group of data specific to the entire organization, not only to
a particular group of users.
• It is not used for daily operations and transaction processing but used for making
decisions.
• A Data Warehouse can be viewed as a data system with the following attributes:
• It is a database designed for investigative tasks, using data from various applications.
• It supports a relatively small number of clients with relatively long interactions.
• It includes current and historical data to provide a historical perspective of information.
• Its usage is read-intensive.
• It contains a few large tables.
• "Data Warehouse is a subject-oriented, integrated, and time-variant store of
information in support of management's decisions."
Subject-Oriented

• A data warehouse target on the modeling and analysis of data for


decision-makers. Therefore, data warehouses typically provide a
concise and straightforward view around a particular subject, such as
customer, product, or sales, instead of the global organization's
ongoing operations.
• This is done by excluding data that are not useful concerning the
subject and including all data needed by the users to understand the
subject.
Integrated

• A data warehouse integrates various heterogeneous data sources like


RDBMS, flat files, and online transaction records. It requires
performing data cleaning and integration during data warehousing to
ensure consistency in naming conventions, attributes types, etc.,
among different data sources.
Time-Variant

• Historical information is kept in a data warehouse. For example, one


can retrieve files from 3 months, 6 months, 12 months, or even
previous data from a data warehouse. These variations with a
transactions system, where often only the most current file is kept.
Non-Volatile

• The data warehouse is a physically separate data storage, which is


transformed from the source operational RDBMS.
• The operational updates of data do not occur in the data warehouse,
i.e., update, insert, and delete operations are not performed.
• It usually requires only two procedures in data accessing: Initial
loading of data and access to data.
• Therefore, the DW does not require transaction processing, recovery,
and concurrency capabilities, which allows for substantial speedup of
data retrieval. Non-Volatile defines that once entered into the
warehouse, and data should not change.
Types of Data Warehouse

Enterprise Data Warehouse (EDW)


• warehouse serves as a key or central database that facilitates decision-support services throughout the
enterprise.
• The advantage is that it provides access to cross-organizational information, offers a unified approach to
data representation, and allows running complex queries.
Operational Data Store (ODS)
• data warehouse refreshes in real-time.
• It is often preferred for routine activities like storing employee records.
• It is required when data warehouse systems do not support reporting needs of the business.
Data Mart
• A data mart is a subset of a data warehouse built to maintain a particular department, region, or
business unit.
• Every department of a business has a central repository or data mart to store data. The data from the
data mart is stored in the ODS periodically.
• The ODS then sends the data to the EDW, where it is stored and used.
Examples
• Investment and Insurance companies use data warehouses to
primarily analyze customer and market trends and allied data
patterns.
• In sub-sectors like Forex and stock markets, data warehouse plays a
significant role because a single point difference can result in huge
losses across the board.
Data Ware House
Data Ware House and ETL
Benefits of Data ware House
• Improved data consistency
• Better business decisions
• Easier access to enterprise data for end-users
• Better documentation of data
• Reduced computer costs and higher productivity
• Enabling end-users to ask ad-hoc queries or reports without deterring
the performance of operational systems
• Collection of related data from various sources into a place
Examples of Data Warehouse
• SQL Data Warehouse is a cloud-based Enterprise Data Warehouse
(EDW) that leverages Massively Parallel Processing (MPP) to quickly run
complex queries across petabytes of data. Use SQL Data Warehouse as
a key component of a big data solution.
• Excel is a popular spreadsheet application that can be used to store
data from a variety of sources. While Excel is not a traditional data
warehouse application, it can be used to create a data warehouse.
• Azure SQL Data Warehouse is a cloud based data warehouse that
enables in creating and delivering a data warehouse. Azure Data
Warehouse is capable of processing large volumes of relational and
non-relational data.
Terminology
• Data warehouses, data lakes, and data marts are different cloud
storage solutions.
• A data warehouse stores data in a structured format. It is a central
repository of preprocessed data for analytics and business
intelligence.
• A data mart is a data warehouse that serves the needs of a specific
business unit, like a company’s finance, marketing, or sales
department.
• A data lake is a central repository for raw data and unstructured data.
You can store data first and process it later on.
Data Warehouse V/S Data Mart
Data Warehouse and Data Mart
Assignment Question
1. How AWS can help different organizations to provide data storage
solutions in context of –
• data warehouse
• data mart
• data lakes
2. When to use data warehouse, data lake and data marts, Explain with
example.
Data Warehouse Architecture

• A data warehouse architecture is a method of defining the overall architecture


of data communication processing and presentation that exist for end-clients
computing within the enterprise.
• Each data warehouse is different, but all are characterized by standard vital
components.
• Production applications such as payroll accounts payable product purchasing
and inventory control are designed for online transaction processing (OLTP).
Such applications gather detailed data from day to day operations.
• Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP).
These include applications such as forecasting, profiling, summary reporting,
and trend analysis.
Three Architectures – Data
Warehouse
• Data Warehouse Architecture: Basic
• Data Warehouse Architecture: With Staging Area
• Data Warehouse Architecture: With Staging Area and Data Marts
Data Warehouse Basic
Operational System -
An operational system is a method used in data warehousing to
refer to a system that is used to process the day-to-day
transactions of an organization.

Flat Files -
A Flat file system is a system of files in which transactional data is
stored, and every file in the system must have a different name.

Meta Data -
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose,
including:
Meta Data summarizes necessary information about data, which
can make finding and work with particular instances of data more
accessible. For example, author, data build, and data changed, and
file size are examples of very basic document metadata.
Metadata is used to direct a query to the most appropriate data
source.
Data Warehouse Basic
Lightly and highly summarized data

• The area of the data warehouse saves all the predefined lightly and highly summarized (aggregated) data generated by the
warehouse manager.

• The goals of the summarized information are to speed up query performance. The summarized record is updated continuously as
new information is loaded into the warehouse.

End-User access Tools

• The principal purpose of a data warehouse is to provide information to the business managers for strategic decision-making. These
customers interact with the warehouse using end-client access tools.

• The examples of some of the end-user access tools can be:

• Reporting and Query Tools


• Application Development Tools
• Executive Information Systems Tools
• Online Analytical Processing Tools
• Data Mining Tools
Data Warehouse Architecture: With Staging Area

We must clean and process your operational information


before put it into the warehouse.

W e can do this programmatically, although data


warehouses uses a staging area (A place where data is
processed before entering the warehouse).

A staging area simplifies data cleansing and consolidation


for operational method coming from multiple source
systems, especially for enterprise data warehouses where
all relevant data of an enterprise is consolidated.
Data Warehouse with staging area
and data marts
Properties
1. Separation: Analytical and transactional processing should
be keep apart as much as possible.

2. Scalability: Hardware and software architectures should


be simple to upgrade the data volume, which has to be
managed and processed, and the number of user's
requirements, which have to be met, progressively increase.

3. Extensibility: The architecture should be able to perform


new operations and technologies without redesigning the
whole system.

4. Security: Monitoring accesses are necessary because of


the strategic data stored in the data warehouses.

5. Administer ability: Data Warehouse management should


not be complicated.
Types of Data warehouse
Single-Tier Architecture

• Single-Tier architecture is not periodically


used in practice.
• Its purpose is to minimize the amount of
data stored to reach this goal;
• it removes data redundancies.
• The vulnerability of this architecture lies in its
failure to meet the requirement for
separation between analytical and
transactional processing.
• Analysis queries are agreed to operational
data after the middleware interprets them.
• In this way, queries affect transactional
workloads.
Source layer: A data warehouse system uses a heterogeneous source Two-Tier Architecture

of data. That data is stored initially to corporate relational databases


or legacy databases, or it may come from an information system
outside the corporate walls.
Data Staging: The data stored to the source should be extracted,
cleansed to remove inconsistencies and fill gaps, and integrated to
merge heterogeneous sources into one standard schema. The so-
named Extraction, Transformation, and Loading Tools (ETL) can
combine heterogeneous schemata, extract, transform, cleanse,
validate, filter, and load source data into a data warehouse.
Data Warehouse layer: Information is saved to one logically
centralized individual repository: a data warehouse. The data
warehouses can be directly accessed, but it can also be used as a
source for creating data marts, which partially replicate data
warehouse contents and are designed for specific enterprise
departments.
Analysis: In this layer, integrated data is efficiently, and flexible
accessed to issue reports, dynamically analyze information, and
simulate hypothetical business scenarios. It should feature aggregate
information navigators, complex query optimizers, and customer-
friendly GUIs.
Three-tier architecture
• The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and data
warehouse.

• The main advantage of the reconciled layer is that it creates a standard reference data
model for a whole enterprise.
• it separates the problems of source data extraction and integration from those of data
warehouse population.
• In some cases, the reconciled layer is also directly used to accomplish better some
operational tasks, such as producing daily reports that cannot be satisfactorily prepared
using the corporate applications or generating data flows to feed external processes
periodically to benefit from cleaning and integration.
Three-tier architecture
Three-tier architecture
• Data Warehouses usually have a three-level (tier) architecture that includes:

• Bottom Tier (Data Warehouse Server)


• Middle Tier (OLAP Server)
• Top Tier (Front end Tools).
• A bottom-tier that consists of the Data Warehouse server, which is almost always an
RDBMS. It may include several specialized data marts and a metadata repository.
• Data from operational databases and external sources (such as user profile data provided
by external consultants) are extracted using application program interfaces called a
gateway. A gateway is provided by the underlying DBMS and allows customer programs to
generate SQL code to be executed at a server.
• Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-
Linking and Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).
Metadata
• The metadata repository stores information that defines DW objects. It includes the following
parameters and information for the middle and the top-tier applications:

• A description of the DW structure, including the warehouse schema, dimension, hierarchies, data
mart locations, and contents, etc.
• Operational metadata, which usually describes the currency level of the stored data, i.e., active,
archived or purged, and warehouse monitoring information, i.e., usage statistics, error reports,
audit, etc.
• System performance data, which includes indices, used to improve data access and retrieval
performance.
• Information about the mapping from operational databases, which provides source RDBMSs and
their contents, cleaning and transformation rules, etc.
• Summarization algorithms, predefined queries, and reports business data, which include business
terms and definitions, ownership information, etc.
Principles of Data Warehousing
• Load Performance - Data warehouses require increase loading of new data periodically basis
within narrow time windows; performance on the load process should be measured in
hundreds of millions of rows and gigabytes per hour and must not artificially constrain the
volume of data business.
• Load Processing - Many phases must be taken to load new or update data into the data
warehouse, including data conversion, filtering, reformatting, indexing, and metadata update.
• Data Quality Management - Fact-based management demands the highest data quality. The
warehouse ensures local consistency, global consistency, and referential integrity despite
"dirty" sources and massive database size.
• Query Performance - Fact-based management must not be slowed by the performance of the
data warehouse RDBMS; large, complex queries must be complete in seconds, not days.
• Terabyte Scalability - Data warehouse sizes are growing at astonishing rates. Today these size
from a few to hundreds of gigabytes and terabyte-sized data warehouses.
Operational Data Stores

• An ODS has been described by Inmon and Imhoff (1996) as a subject-oriented,


integrated, volatile, current valued data store, containing only detailed corporate
data. A data warehouse is a documenting database that includes associatively
recent as well as historical information and may also include aggregate data.
• The ODS is a subject-oriented. It is organized around the significant information subject of
an enterprise. In a university, the subjects may be students, lecturers and courses while in
the company the subjects might be users, salespersons and products.
• The ODS is an integrated. That is, it is a group of subject-oriented record from a variety of
systems to provides an enterprise-wide view of the information.
• The ODS is a current-valued. That is, an ODS is up-to-date and follow the current status of
the data. An ODS does not contain historical information. Since the OLTP system data is
changing all the time, data from underlying sources refresh the ODS as generally and
frequently as possible.
Operational Data Store
• The ODS is volatile. That is, the data in the ODS frequently changes as new
data refreshes the ODS.
• The ODS is a detailed. That is, ODS is detailed enough to serve the need of the
operational management staff in the enterprise. The granularity of the
information in the ODS does not have to be precisely the same as in the
source OLTP system.
Structure of ODS
Operational Monitoring
• Flash monitoring and the reporting tools are like a
dashboard that support meaningful online data on
the operational status of the enterprise.
• This method is achieved by the use of ODS data as
inputs to the flash monitoring and reporting tools, to
provide business users with a refreshed continuously,
enterprise-wide view of operations without creating
unwanted interruptions or additional load on
transactions-processing systems.
Zero Latency Enterprise (ZLE)
• The Gantner Group has used a method Zero Latency Enterprise (ZLE)
for near real-time integration of operational information so that there
is no necessary delay in getting data from one part or one system of
an enterprise to another system that needs the data.
• A ZLE data store is like an ODS that is integrated and up-to-date. The
objective of a ZLE data store is to allow management a single view of
enterprise information by bringing together relevant information in
real-time and providing management with a "360-degree" aspect of
the user.
Features of ZLE
• It has a consolidated view of the enterprise operational information. It
has a massive level of availability, and it contains online refreshing of
data. ZLE requires data that is as current as possible. Since a ZLE
needs to provide a large number of concurrent users, for example, call
centre users, the fast turnaround time for transactions and 24/7
availability are required.
Operational Data Store V/S Data
Warehouse
ETL Process
• The mechanism of extracting information from source systems and
bringing it into the data warehouse is commonly called ETL, which stands
for Extraction, Transformation and Loading.

• The ETL process requires active inputs from various stakeholders,


including developers, analysts, testers, top executives and is technically
challenging.
• To maintain its value as a tool for decision-makers, Data warehouse
technique needs to change with business changes. ETL is a recurring
method (daily, weekly, monthly) of a Data warehouse system and needs
to be agile, automated, and well documented.
Extraction
• Extraction is the operation of extracting information from a source
system for further use in a data warehouse environment. This is the
first stage of the ETL process.
• Extraction process is often one of the most time-consuming tasks in
the ETL.
• The source systems might be complicated and poorly documented,
and thus determining which data needs to be extracted can be
difficult.
• The data has to be extracted several times in a periodic manner to
supply all changed data to the warehouse and keep it up-to-date.
Cleaning
• The cleansing stage is crucial in a data warehouse technique because it is supposed to improve
data quality. The primary data cleansing features found in ETL tools are rectification and
homogenization. They use specific dictionaries to rectify typing mistakes and to recognize
synonyms, as well as rule-based cleansing to enforce domain-specific rules and defines
appropriate associations between values.

• The following examples show the essential of data cleaning:


• If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-to-date
list of contact addresses, email addresses and telephone numbers must be available.
• If a client or supplier calls, the staff responding should be quickly able to find the person in the
enterprise database, but this need that the caller's name or his/her company name is listed in
the database.
• If a user appears in the databases with two or more slightly different names or different account
numbers, it becomes difficult to update the customer's information.
Transformation
• Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format. If we implement a three-
layer architecture, this phase outputs our reconciled data layer.
• The following points must be rectified in this phase:
• Loose texts may hide valuable information. For example, XYZ PVT Ltd does not explicitly
show that this is a Limited Partnership company.
• Different formats can be used for individual data. For example, data can be saved as a string
or as three integers.
• Following are the main transformation processes aimed at populating the reconciled data
layer:
• Conversion and normalization that operate on both storage formats and units of measure to make data
uniform.
• Matching that associates equivalent fields in different sources.
• Selection that reduces the number of source fields and records.
ETL Process
Loading
• The Load is the process of writing the data into the target database. During
the load step, it is necessary to ensure that the load is performed correctly
and with as little resources as possible.
• Loading can be carried in two ways:
• Refresh: Data Warehouse data is completely rewritten. This means that
older file is replaced. Refresh is usually used in combination with static
extraction to populate a data warehouse initially.
• Update: Only those changes applied to source information are added to the
Data Warehouse. An update is typically carried out without deleting or
modifying preexisting data. This method is used in combination with
incremental extraction to update data warehouses regularly.
ETL V/s ELT
ETL V/S ELT
Assignment

https://www.javatpoint.com/types-of-data-warehouses

• Choose one type


• Prepare a presentation with 5 slides
• Present it in the class

You might also like