0% found this document useful (0 votes)
23 views

BI UNIT 3 NOTES PDF

Uploaded by

arvind.gautam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

BI UNIT 3 NOTES PDF

Uploaded by

arvind.gautam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Unit III Data Provisioning and Data Visualization

Data warehouse
A data warehouse is an enterprise system used for the analysis and reporting of
structured and semi-structured data from multiple sources, such as point-of-sale
transactions, marketing automation, customer relationship management, and more.

A data-warehouse is a heterogeneous collection of different data sources organised


under a unified schema. There are 2 approaches for constructing data-warehouse: Top-
down approach and Bottom-up approach are explained as below.

1. Top-down approach:

The essential components are discussed below:


1. External Sources –
External source is a source from where data is collected irrespective of the type of
data. Data can be structured, semi structured and unstructured as well.

2. Stage Area –
Since the data, extracted from the external sources does not follow a particular
format, so there is a need to validate this data to load into datawarehouse. For this
purpose, it is recommended to use ETL tool.

 E(Extracted): Data is extracted from External data source.

 T(Transform): Data is transformed into the standard format.


 L(Load): Data is loaded into datawarehouse after transforming it into the
standard format.

3. Data-warehouse –
After cleansing of data, it is stored in the datawarehouse as central repository. It
actually stores the meta data and the actual data gets stored in the data
marts. Note that datawarehouse stores the data in its purest form in this top-down
approach.

4. Data Marts –
Data mart is also a part of storage component. It stores the information of a
particular function of an organisation which is handled by single authority. There
can be as many number of data marts in an organisation depending upon the
functions. We can also say that data mart contains subset of the data stored in
datawarehouse.

5. Data Mining –
The practice of analysing the big data present in datawarehouse is data mining. It is
used to find the hidden patterns that are present in the database or in datawarehouse
with the help of algorithm of data mining.

This approach is defined by Inmon as – datawarehouse as a central repository for


the complete organisation and data marts are created from it after the complete
datawarehouse has been created.

Advantages of Top-Down Approach –


1. Since the data marts are created from the datawarehouse, provides consistent
dimensional view of data marts.

2. Also, this model is considered as the strongest model for business changes. That‘s
why, big organisations prefer to follow this approach.

3. Creating data mart from datawarehouse is easy.


4. Improved data consistency: The top-down approach promotes data consistency by
ensuring that all data marts are sourced from a common data warehouse. This
ensures that all data is standardized, reducing the risk of errors and inconsistencies in
reporting.
5. Easier maintenance: Since all data marts are sourced from a central data warehouse,
it is easier to maintain and update the data in a top-down approach. Changes can be
made to the data warehouse, and those changes will automatically propagate to all
the data marts that rely on it.
6. Better scalability: The top-down approach is highly scalable, allowing organizations
to add new data marts as needed without disrupting the existing infrastructure. This
is particularly important for organizations that are experiencing rapid growth or have
evolving business needs.
7. Improved governance: The top-down approach facilitates better governance by
enabling centralized control of data access, security, and quality. This ensures that
all data is managed consistently and that it meets the organization‘s standards for
quality and compliance.
8. Reduced duplication: The top-down approach reduces data duplication by ensuring
that data is stored only once in the data warehouse. This saves storage space and
reduces the risk of data inconsistencies.
9. Better reporting: The top-down approach enables better reporting by providing a
consistent view of data across all data marts. This makes it easier to create accurate
and timely reports, which can improve decision-making and drive better business
outcomes.
10.Better data integration: The top-down approach enables better data integration by
ensuring that all data marts are sourced from a common data warehouse. This makes
it easier to integrate data from different sources and provides a more complete view
of the organization‘s data.

Disadvantages of Top-Down Approach –


1. The cost, time taken in designing and its maintenance is very high.
2. Complexity: The top-down approach can be complex to implement and maintain,
particularly for large organizations with complex data needs. The design and
implementation of the data warehouse and data marts can be time-consuming and
costly.
3. Lack of flexibility: The top-down approach may not be suitable for organizations
that require a high degree of flexibility in their data reporting and analysis. Since the
design of the data warehouse and data marts is pre-determined, it may not be
possible to adapt to new or changing business requirements.
4. Limited user involvement: The top-down approach can be dominated by IT
departments, which may lead to limited user involvement in the design and
implementation process. This can result in data marts that do not meet the specific
needs of business users.
5. Data latency: The top-down approach may result in data latency, particularly when
data is sourced from multiple systems. This can impact the accuracy and timeliness
of reporting and analysis.
6. Data ownership: The top-down approach can create challenges around data
ownership and control. Since data is centralized in the data warehouse, it may not be
clear who is responsible for maintaining and updating the data.
7. Cost: The top-down approach can be expensive to implement and maintain,
particularly for smaller organizations that may not have the resources to invest in a
large-scale data warehouse and associated data marts.
8. Integration challenges: The top-down approach may face challenges in integrating
data from different sources, particularly when data is stored in different formats or
structures. This can lead to data inconsistencies and inaccuracies.

2. Bottom-up approach:

1. First, the data is extracted from external sources (same as happens in top-down
approach).

2. Then, the data go through the staging area (as explained above) and loaded into data
marts instead of datawarehouse. The data marts are created first and provide
reporting capability. It addresses a single business area.

3. These data marts are then integrated into datawarehouse.

This approach is given by Kinball as – data marts are created first and provides a thin
view for analyses and datawarehouse is created after complete data marts have been
created.

Advantages of Bottom-Up Approach –


1. As the data marts are created first, so the reports are quickly generated.

2. We can accommodate more number of data marts here and in this way
datawarehouse can be extended.

3. Also, the cost and time taken in designing this model is low comparatively.
4. Incremental development: The bottom-up approach supports incremental
development, allowing for the creation of data marts one at a time. This allows for
quick wins and incremental improvements in data reporting and analysis.
5. User involvement: The bottom-up approach encourages user involvement in the
design and implementation process. Business users can provide feedback on the data
marts and reports, helping to ensure that the data marts meet their specific needs.
6. Flexibility: The bottom-up approach is more flexible than the top-down approach, as
it allows for the creation of data marts based on specific business needs. This
approach can be particularly useful for organizations that require a high degree of
flexibility in their reporting and analysis.
7. Faster time to value: The bottom-up approach can deliver faster time to value, as the
data marts can be created more quickly than a centralized data warehouse. This can
be particularly useful for smaller organizations with limited resources.
8. Reduced risk: The bottom-up approach reduces the risk of failure, as data marts can
be tested and refined before being incorporated into a larger data warehouse. This
approach can also help to identify and address potential data quality issues early in
the process.
9. Scalability: The bottom-up approach can be scaled up over time, as new data marts
can be added as needed. This approach can be particularly useful for organizations
that are growing rapidly or undergoing significant change.
10.Data ownership: The bottom-up approach can help to clarify data ownership and
control, as each data mart is typically owned and managed by a specific business
unit. This can help to ensure that data is accurate and up-to-date, and that it is being
used in a consistent and appropriate way across the organization.

Disadvantage of Bottom-Up Approach –


1. This model is not strong as top-down approach as dimensional view of data marts is
not consistent as it is in above approach.
2. Data silos: The bottom-up approach can lead to the creation of data silos, where
different business units create their own data marts without considering the needs of
other parts of the organization. This can lead to inconsistencies and redundancies in
the data, as well as difficulties in integrating data across the organization.
3. Integration challenges: Because the bottom-up approach relies on the integration of
multiple data marts, it can be more difficult to integrate data from different sources
and ensure consistency across the organization. This can lead to issues with data
quality and accuracy.
4. Duplication of effort: In a bottom-up approach, different business units may
duplicate effort by creating their own data marts with similar or overlapping data.
This can lead to inefficiencies and higher costs in data management.
5. Lack of enterprise-wide view: The bottom-up approach can result in a lack of
enterprise-wide view, as data marts are typically designed to meet the needs of
specific business units rather than the organization as a whole. This can make it
difficult to gain a comprehensive understanding of the organization‘s data and
business processes.
6. Complexity: The bottom-up approach can be more complex than the top-down
approach, as it involves the integration of multiple data marts with varying levels of
complexity and granularity. This can make it more difficult to manage and maintain
the data warehouse over time.
7. Risk of inconsistency: Because the bottom-up approach allows for the creation of
data marts with different structures and granularities, there is a risk of inconsistency
in the data. This can make it difficult to compare data across different parts of the
organization or to ensure that reports are accurate and reliable.
Schemas
Schema is a logical description of the entire database. It includes the name and description of
records of all record types including all associated data-items and aggregates. Much like a
database, a data warehouse also requires to maintain a schema. A database uses relational
model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this
chapter, we will discuss the schemas used in a data warehouse.

Star Schema
 Each dimension in a star schema is represented with only one-dimension table.
 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.

 There is a fact table at the center. It contains the keys to each of four dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.

Note − Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key, street,
city, province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British
Columbia. The entries for such cities may cause data redundancy along the attributes
province_or_state and country.

Snowflake Schema
 Some dimension tables in the Snowflake schema are normalized.
 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.
 ow the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier dimension
table contains the attributes supplier_key and supplier_type.

Note − Due to normalization in the Snowflake schema, the redundancy is reduced and

therefore, it becomes easy to maintain and the save storage space.

Fact Constellation Schema


 A fact constellation has multiple fact tables. It is also known as galaxy schema.
 The following diagram shows two fact tables, namely sales and shipping.
 The sales fact table is same as that in the star schema.
 The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
 The shipping fact table also contains two measures, namely dollars sold and units
sold.
 It is also possible to share dimension tables between fact tables. For example, time,
item, and location dimension tables are shared between the sales and shipping fact
table.

Schema Definition

Multidimensional schema is defined using Data Mining Query Language (DMQL). The
two primitives, cube definition and dimension definition, can be used for defining the
data warehouses and data marts.

Syntax for Cube Definition


define cube < cube_name > [ < dimension-list > }: < measure_list >
Syntax for Dimension Definition
define dimension < dimension_name > as ( < attribute_or_dimension_list > )
Star Schema Definition
The star schema that we have discussed can be defined using Data Mining Query
Language (DMQL) as follows −
define cube sales star [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
Snowflake Schema Definition
Snowflake schema can be defined using DMQL as follows −
define cube sales snowflake [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key,
supplier type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state,
country))
Fact Constellation Schema Definition
Fact constellation schema can be defined using DMQL as follows −
define cube sales [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state,country)
define cube shipping [time, item, shipper, from location, to location]:

dollars cost = sum(cost in dollars), units shipped = count(*)


define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper key, shipper name, location as location in cube
sales, shipper type)
define dimension from location as location in cube sales
define dimension to location as location in cube sales

Data Quality
Data quality is a measure of the condition of data based on factors such as accuracy,
completeness, consistency, reliability and whether it's up to date. Measuring data quality
levels can help organizations identify data errors that need to be resolved and assess whether
the data in their IT systems is fit to serve its intended purpose.

The emphasis on data quality in enterprise systems has increased as data processing has
become more intricately linked with business operations and organizations increasingly use
data analytics to help drive business decisions. Data quality management is a core
component of the overall data management process, and data quality improvement
efforts are often closely tied to data governance programs that aim to ensure data is
formatted and used consistently throughout an organization.

Why data quality is important


Bad data can have significant business consequences for companies. Poor-quality data is
often pegged as the source of operational snafus, inaccurate analytics and ill-conceived
business strategies. Examples of the economic damage data quality problems can
cause include added expenses when products are shipped to the wrong customer addresses,
lost sales opportunities because of erroneous or incomplete customer records, and fines for
improper financial or regulatory compliance reporting.

Consulting firm Gartner said in 2021 that bad data quality costs organizations an average of
$12.9 million per year. Another figure that's still often cited is a calculation by IBM that the
annual cost of data quality issues in the U.S. amounted to $3.1 trillion in 2016. And in an
article he wrote for the MIT Sloan Management Review in 2017, data quality consultant
Thomas Redman estimated that correcting data errors and dealing with the business
problems caused by bad data costs companies 15% to 25% of their annual revenue on
average.

In addition, a lack of trust in data on the part of corporate executives and business managers
is commonly cited among the chief impediments to using business intelligence (BI) and
analytics tools to improve decision-making in organizations. All of that makes an effective
data quality management strategy a must.

What is good data quality?


Data accuracy is a key attribute of high-quality data. To avoid transaction processing
problems in operational systems and faulty results in analytics applications, the data that's
used must be correct. Inaccurate data needs to be identified, documented and fixed to ensure
that business executives, data analysts and other end users are working with good
information.

Other aspects, or dimensions, that are important elements of good data quality include the
following:

 completeness, with data sets containing all of the data elements they should;

 consistency, where there are no conflicts between the same data values in different
systems or data sets;

 uniqueness, indicating a lack of duplicate data records in databases and data warehouses;

 timeliness or currency, meaning that data has been updated to keep it current and is
available to use when it's needed;

 validity, confirming that data contains the values it should and is structured properly; and

 conformity to the standard data formats created by an organization.

Meeting all of these factors helps produce data sets that are reliable and trustworthy. A long
list of additional dimensions of data quality can also be applied -- some examples include
appropriateness, credibility, relevance, reliability and usability.
These metrics can
be used to track data quality levels and how quality issues affect business operations.

How to determine data quality


As a first step toward determining their data quality levels, organizations typically inventory
their data assets and do baseline studies to measure the relative accuracy, uniqueness and
validity of data sets. The established baseline ratings can then be compared against the data
in systems on an ongoing basis to help identify new data quality issues.

Another common step is to create a set of data quality rules based on business requirements
for both operational and analytics data. Such rules specify required quality levels in data sets
and detail what different data elements need to include so they can be checked for accuracy,
consistency and other data quality attributes. After the rules are in place, a data management
team typically conducts a data quality assessment to measure the quality of data sets and
document data errors and other problems -- a procedure that can be repeated at regular
intervals to maintain the highest data quality levels possible.

Various methodologies for such assessments have been developed. For example, data
managers at UnitedHealth Group's Optum healthcare services subsidiary created the Data
Quality Assessment Framework (DQAF) in 2009 to formalize a method for assessing its data
quality. The DQAF provides guidelines for measuring data quality based on four
dimensions: completeness, timeliness, validity and consistency. Optum has publicized details
about the framework as a possible model for other organizations.

The International Monetary Fund (IMF), which oversees the global monetary system and
lends money to economically troubled nations, has also specified an assessment
methodology with the same name as the Optum one. Its framework focuses on accuracy,
reliability, consistency and other data quality attributes in the statistical data that member
countries must submit to the IMF. In addition, the U.S. government's Office of the National
Coordinator for Health Information Technology has detailed a data quality framework for
patient demographic data collected by healthcare organizations.

Data quality management tools and techniques


Data quality projects typically also involve several other steps. For example, a data quality
management cycle outlined by data management consultant David Loshin begins with
identifying and measuring the effect that bad data has on business operations. Next, data
quality rules are defined, performance targets for improving relevant data quality metrics are
set, and specific data quality improvement processes are designed and put in place.

Those processes include data cleansing, or data scrubbing, to fix data errors, plus work to
enhance data sets by adding missing values, more up-to-date information or additional
records. The results are then monitored and measured against the performance targets, and
any remaining deficiencies in data quality provide a starting point for the next round of
planned improvements. Such a cycle is intended to ensure that efforts to improve overall
data quality continue after individual projects are completed.

These are the key steps in the data quality improvement process.
To help streamline such efforts, data quality software tools can match records, delete
duplicates, validate new data, establish remediation policies and identify personal data in
data sets; they also do data profiling to collect information about data sets and identify
possible outlier values. Augmented data quality functions are an emerging set of capabilities
that software vendors are building into their tools to automate tasks and procedures,
primarily through the use of artificial intelligence (AI) and machine learning.

Management consoles for data quality initiatives support creation of data handling rules,
discovery of data relationships and automated data transformations that may be part of data
quality maintenance efforts. Collaboration and workflow enablement tools have also become
more common, providing shared views of corporate data repositories to data quality
managers and data stewards, who are charged with overseeing particular data sets.

Data quality tools and improvement processes are often incorporated into data governance
programs, which typically use data quality metrics to help demonstrate their business value
to companies. They're also key components of master data management (MDM) initiatives
that create central registries of master data on customers, products and supply chains, among
other data domains.

Benefits of good data quality


From a financial standpoint, maintaining high data quality levels enables organizations to
reduce the cost of identifying and fixing bad data in their systems. Companies are also able
to avoid operational errors and business process breakdowns that can increase operating
expenses and reduce revenues.

In addition, good data quality increases the accuracy of analytics applications, which can
lead to better business decision-making that boosts sales, improves internal processes and
gives organizations a competitive edge over rivals. High-quality data can help expand the
use of BI dashboards and analytics tools, as well -- if analytics data is seen as trustworthy,
business users are more likely to rely on it instead of basing decisions on gut feelings or their
own spreadsheets.

Effective data quality management also frees up data management teams to focus on more
productive tasks than cleaning up data sets. For example, they can spend more time helping
business users and data analysts take advantage of the available data in systems and
promoting data quality best practices in business operations to minimize data errors.

Emerging data quality challenges


For many years, the burden of data quality efforts centered on structured data stored
in relational databases since they were the dominant technology for managing data. But the
nature of data quality problems expanded as big data systems and cloud computing became
more prominent. Increasingly, data managers also need to focus on the quality of
unstructured and semistructured data, such as text, internet clickstream records, sensor data
and network, system and application logs. In addition, data quality now often needs to be
managed in a combination of on-premises and cloud systems.

The growing use of AI tools and machine learning applications in organizations further
complicates the data quality process, as does the adoption of real-time data streaming
platforms that funnel large volumes of data into corporate systems on a continuous basis.
Complex data pipelines created to support data science and advanced analytics work add to
the challenges, too.

Data quality demands are also expanding due to the implementation of new data privacy and
protection laws, most notably the European Union's General Data Protection Regulation
(GDPR) and the California Consumer Privacy Act (CCPA). Both measures give people the
right to access the personal data that companies collect about them, which means
organizations must be able to find all of the records on an individual in their systems without
missing any because of inaccurate or inconsistent data.

Fixing data quality issues


Data quality managers, analysts and engineers are primarily responsible for fixing data errors
and other data quality problems in organizations. They're collectively tasked with finding
and cleansing bad data in databases and other data repositories, often with assistance and
support from other data management professionals, particularly data stewards and data
governance program managers.

However, it's also a common practice to involve business users, data scientists and other
analysts in the data quality process to help reduce the number of data quality issues created
in systems. Business participation can be achieved partly through data governance programs
and interactions with data stewards, who frequently come from business units. In addition,
though, many companies run training programs on data quality best practices for end users.
A common mantra among data managers is that everyone in an organization is responsible
for data quality.

Data quality vs. data integrity


Data quality and data integrity are sometimes referred to interchangeably; alternatively,
some people treat data integrity as a facet of data accuracy or a separate dimension of data
quality. More generally, though, data integrity is seen as a broader concept that combines
data quality, data governance and data protection mechanisms to address data accuracy,
consistency and security as a whole.

In that broader view, data integrity focuses on integrity from both logical and physical
standpoints. Logical integrity includes data quality measures and database attributes such as
referential integrity, which ensures that related data elements in different database tables are
valid. Physical integrity involves access controls and other security measures designed to
prevent data from being modified or corrupted by unauthorized users, as well as backup and
disaster recovery protections.

What is data profiling?

Data profiling is the process of reviewing source data, understanding structure, content and
interrelationships, and identifying potential for data projects.

Data profiling is a crucial part of:

Data warehouse and business intelligence (DW/BI) projects—data profiling can uncover
data quality issues in data sources, and what needs to be corrected in ETL.

Data conversion and migration projects—data profiling can identify data quality issues,
which you can handle in scripts and data integration tools copying data from source to target.
It can also uncover new requirements for the target system.
Source system data quality projects—data profiling can highlight data which suffers from
serious or numerous quality issues, and the source of the issues (e.g. user inputs, errors in
interfaces, data corruption).

Data profiling involves:

Collecting descriptive statistics like min, max, count and sum.


Collecting data types, length and recurring patterns.
Tagging data with keywords, descriptions or categories.
Performing data quality assessment, risk of performing joins on the data.
Discovering metadata and assessing its accuracy.
Identifying distributions, key candidates, foreign-key candidates, functional dependencies,
embedded value dependencies, and performing inter-table analysis.

Types of data profiling

There are three main types of data profiling:

Structure discovery

Validating that data is consistent and formatted correctly, and performing mathematical
checks on the data (e.g. sum, minimum or maximum). Structure discovery helps understand
how well data is structured—for example, what percentage of phone numbers do not have
the correct number of digits.

Content discovery

Looking into individual data records to discover errors. Content discovery identifies which
specific rows in a table contain problems, and which systemic issues occur in the data (for
example, phone numbers with no area code).

Relationship discovery
Discovering how parts of the data are interrelated. For example, key relationships between
database tables, references between cells or tables in a spreadsheet. Understanding
relationships is crucial to reusing data; related data sources should be united into one or
imported in a way that preserves important relationships.

Data profiling steps—an efficient process for data profiling

Ralph Kimball, a father of data warehouse architecture, suggests a four-step process for data
profiling:

Use data profiling at project start to discover if data is suitable for analysis—and make a ―go
/ no go‖ decision on the project.

Identify and correct data quality issues in source data, even before starting to move it into
target database.

Identify data quality issues that can be corrected by Extract-Transform-Load (ETL), while
data is moved from source to target. Data profiling can uncover if additional manual
processing is needed.

Identify unanticipated business rules, hierarchical structures and foreign key / private key
relationships, use them to fine-tune the ETL process.

Data profiling and data quality analysis best practices

Basic data profiling techniques:

Distinct count and percent—identifies natural keys, distinct values in each column that can
help process inserts and updates. Handy for tables without headers.

Percent of zero / blank / null values—identifies missing or unknown data. Helps ETL
architects setup appropriate default values.

Minimum / maximum / average string length—helps select appropriate data types and
sizes in target database. Enables setting column widths just wide enough for the data, to
improve performance.
Advanced data profiling techniques:

Key integrity—ensures keys are always present in the data, using zero/blank/null analysis.
Also, helps identify orphan keys, which are problematic for ETL and future analysis.

Cardinality—checks relationships like one-to-one, one-to-many, many-to-many, between


related data sets. This helps BI tools perform inner or outer joins correctly.

Pattern and frequency distributions—checks if data fields are formatted correctly, for
example if emails are in a valid format. Extremely important for data fields used for
outbound communications (emails, phone numbers, addresses).

6 data profiling tools—open source and commercial

Data profiling, a tedious and labor intensive activity, can be automated with tools, to make
huge data projects more feasible. These are essential to your data analytics stack.

Open source data profiling tools

1. Quadient DataCleaner—key features include:


Data quality, data profiling and data wrangling
Detect and merge duplicates
Boolean analysis
Completeness analysis
Character set distribution
Date gap analysis
Reference data matching
2. Aggregate Profiler (Open Source Data Quality and Profiling)—key features include:
Data profiling, filtering, and governance
Similarity checks
Data enrichment
Real time alerting for data issues or changes
Basket analysis with bubble chart validation
Single customer view
Dummy data creation
Metadata discovery
Anomaly discovery and data cleansing tool
Hadoop integration
3. Talend Open Studio—a suite of open source tools, data quality features include:
Customizable data assessment
A pattern library
Analytics with graphical charts
Fraud pattern detection
Column set analysis
Advanced matching
Time column correlation

ETL Architecture and what is ETL

Extract, transform, and load (ETL) is the process of combining data from multiple
sources into a large, central repository called a data warehouse. ETL uses a set of
business rules to clean and organize raw data and prepare it for storage, data
analytics, and machine learning (ML).

ETL Architecture
ETL stands for Extract, Transform, and Load. In today's data warehousing world, this term is extended
to E-MPAC-TL or Extract, Monitor, Profile, Analyze, Cleanse, Transform, and Load. In other words,
ETL focus on Data Quality and MetaData.
Extraction
The main goal of extraction is to collect the data from the source system as fast as possible and less
convenient for these source systems. It also states that the most applicable extraction method should be
chosen for source date/time stamps, database log tables, hybrid depending on the situation.

Transform and Loading


Transform and loading the data is all about to integrate the data and finally moving the combined data to
the presentation area, which can be accessed by the front end tools by the end-user community. Here, the
emphasis should be on the functionality offered by the ETL-tool and using it most effectively. It is not
enough to use an ETL tool. In a medium to large scale data warehouse environment, it is important to
standardize the data as much as possible instead of going for customization. ETL will reduce the
throughput time of the different source to target development activities which form the bulk of the
traditional ETL effort.

Monitoring
Monitoring of the data enables the verification of the data, which is moved throughout the entire ETL
process and has two main objectives. Firstly, the data should be screened. A proper balance should be
there between screening the incoming data as much as possible and not slowing down the entire ETL
process when too much checking is done. Here an inside-out approach which is used in Ralph Kimbal
screening technique could be used. This technique can capture all errors consistently which is based on a
pre-defined set of metadata business rules and enables the reporting on them through a simple star
schema, which enables a view on the data quality evolution over the time. Secondly, we should have to
be focused on ETL performance. This metadata information can be plugged into all dimension and fact
tables and can be called an audit dimension.

Quality Assurance
Quality Assurance is a process between the different stages that could be defined depending on the need,
and these processes can check the completeness of the value; do we still have the same number of
records or total of specific measures between different ETL stages? This information should be captured
as metadata. Finally, the data lineage should be foreseen throughout the entire ETL process, included the
error records produced.
Data Profiling
It is used to generate statistics about the sources. The objective of data profiling is to know about the
sources. Data profiling will use analytical techniques to discover the actual content, structure, and
quality of the data by analyzing and validating the data pattern and formats and by identifying and
validating redundant data across the data source. It is essential to use the correct tool, which is used to
automate this process. It gives a huge amount and variety of data.

Data Analysis
To analyze the results of the profiled data, Data Analysis is used. For analyzing the data, it is easier to
identify data quality problems such as missing data, inconsistent data, invalid data, constraint problems,
parent-child issues such as orphans, duplicated. It is essential to capture the results of this assessment
correctly. Data analysis will become the communication medium between the source and the data
warehouse team for tackling the outstanding issues. The source to target mapping highly depends on the
quality of the source analysis.

Source Analysis
In the source Analysis, the focus should not only on the sources but also on the surroundings, to obtain
the source documentation. The future of the source applications depends upon the current data issues of
origin, the corresponding data models/ metadata repositories, and receiving a walkthrough of source
model and business rules by source owners. It is crucial to set up the frequent meetings with owners of
the source to detect the changes which might impact the data warehouse and the associated ETL process.

Cleansing
In this section, the errors found can be fixed, which is based on the Metadata of a pre-defined set of
rules. Here, a distinction needs to be made between completely or partly rejected records and enable the
manual correction of the issues or by fixing the data by correcting the inaccurate data fields, adjusting
the data format, etc.

What is Staging, Data marts, Cubes.


A staging area, or landing zone, is an intermediate storage area used for data processing
during the extract, transform and load (ETL) process. The data staging area sits between
the data source(s) and the data target(s), which are often data warehouses, data marts, or
other data repositories.

The Data Staging Area is a temporary storage area for data copied from Source Systems.
In a Data Warehousing Architecture, a Data Staging Area is mostly necessary for time
considerations. In other words, before data can be incorporated into the Data
Warehouse, all essential data must be readily available.

It is not possible to retrieve all data from all Operational databases at the same time
because of varying Business Cycles, Data Processing Cycles, Hardware, and Network
Resource Restrictions, and Geographical Variables.

Here‘s all you need to know about Data Staging Area, as well as some key pointers to
keep in mind before you start the process.

What is Data Staging?

During the Extract, Transform, and Load (ETL) process, a Staging Area, also known as
a landing zone, is an interim storage region used for Data Processing. The Data Staging
Area is located in between the Data Source(s) and the Data Target(s), which are typically
Data Warehouses, Data Marts, or other Data Repositories.

Data Staging spaces are frequently ephemeral in nature, with their contents being wiped
before performing an ETL process or shortly after it has been completed successfully.
However, there are architectures for staging areas that are designed to hold data for long
periods of time for preservation or debugging purposes.

You can learn more about data ingestion architecture here.

Why do you need to Stage Data?

There is no designated space available for testing data transformations in a direct data
integration strategy, where the data is extracted, transformed, and then loaded to the new
storage. Before being loaded to the target system, data from the source can be replicated,
reformatted, and tested in a staging area in data warehouse.

Most firms today have several Data Sources to derive information. Before being loaded
into the new system, the extracted data must be polished and cleansed, as well as have the
right format and structure. A Staging space is useful in this situation. Data is altered,
replicated as needed, linked and aggregated if necessary, and then cleansed in this
intermediate layer.

What is a Data Staging Area?

The Data Staging Area is made up of the Data Staging Server software and the data
store archive (repository) of the outcomes of the extraction, transformation, and loading
activities in the data warehousing process.
The archival repository stores cleansed, converted data and attributes for loading into
Data Marts and Data Warehouses, while the Data Staging software server saves and alters
data taken from OLTP data sources.

A Data Staging Area is a design concept for a Data Pipeline. It is a location where
raw/unprocessed data is stored before being modified for downstream usage. Database
tables, files in a Cloud Storage System, and other staging regions are examples.

Example:

It‘s reasonable to extract sales data on a daily basis, but daily extracts aren‘t appropriate
for financial data that needs to be reconciled at the end of the month. Similarly, extracting
―customer‖ data from a database in Singapore at noon eastern standard time may be
appropriate, but it is not appropriate for ―customer‖ data in a Chicago database.

Data in the Data Warehouse can be permanent (i.e., it lasts for a long time) or transitory
(i.e., it only lasts for a short time) (i.e. only remains around temporarily). A data
warehouse staging space is not required for all enterprises. For many firms, using ETL to
replicate data straight from operational databases into the Data Warehouse is a viable
option.

External Staging

The area where data staging takes place outside a data warehouse is commonly referred
to as External Staging. This area is often hosted by cloud storage providers such
as Google Cloud Storage (GCS) or Amazon Web Simple Storage Solution (AWS S3).

Internal Staging

Unified cloud data warehouses in modern approaches often use an internal staging
process that involves creating raw tables separate from the rest of the warehouse. These
raw tables then undergo a transformation, cleaning, and normalization process in an
‗ELT staging area‘. A final layer is then used to present only the cleaned and prepared
data to BI tooling and business users, allowing data teams to curate a single source of
truth, reduce complexity, and mitigate data sprawl.

What is the Role of a Data Staging Area in Warehouse ETL?

 A Data Staging Area has a single purpose: to hold raw data from Source Systems
and to provide a space for transitory tables created during the transformation
process.
 The function of the Data Staging Area varies depending on the design
methodology and ETL toolset, but the Target Audience is always the ETL
process and the Data Engineers who are responsible for designing and maintaining
the ETL.
 This does not diminish the need for data governance in a Data Lake, but it does
make it much easier to manage when contrasted to a large number of individuals
and processes that may access it.

To learn more about ETL, check out our in-depth guide.

What are the Advantages of a Staging Area?

The efficiency of employing the Data Staging Area is heavily dependent on the business
requirements and the ETL system‘s operating environment. Here are a few reasons why
you should choose Data Integration products such as Hevo Data that have a Staging Area.

 Recoverability: In the event that operations are corrupted, data should be


retrieved. As a result, stage data after it has been collected from the source and
after any substantial transformations have been made to it. In the event that Data
Corruption occurs during the latter phases, the staging steps will function as
recovery points.
 Backup: Backups let you store, compress, and archive data all the way down to the
database level. When large amounts of data are sent, one issue that usually arises is
Data Backup on such a large scale. Data can be sent in pieces that can be readily
preserved and archived using staging areas.
 Auditing: By comparing the original input files (with transformation rules) and the
output data files, staged data can make the auditing process easier. The data
connectivity between the source and the goal is lost as the ETL process becomes
more complicated, which can lengthen and complicate the auditing process.
Staging methods enable a smooth auditing procedure while keeping the data
lineage intact.

What are the Disadvantages of a Data Staging Area?

 If the transformation process slows down, the staging procedure will come to a halt
as well.
 There might be some variation in the use of disc space because the data must be
dumped into a local area.

Staging & Data Warehouse Architecture

A Data Warehouse‘s Architecture is seen in the image below. Through the data
warehouse, end users have immediate access to data collected from a variety of Source
Systems.
The image above shows metadata and raw data from a standard OLTP system, as well as
a new type of data, called Summary Data. Summaries are a way to pre-compute frequent,
time-consuming processes so that data can be retrieved in a fraction of a second. A
common data warehouse query, for example, might be to get August Sales. A
materialized view is a summary in an Oracle database.

An Enterprise Data Warehouse is a centralized repository of raw data that serves as the hub
of your data warehousing system (EDW). By storing all essential business information in the
most complete format, an EDW delivers a 360-degree insight into an organization‘s
business.

Data Warehouse with a Staging Area

As demonstrated in the image below, you must clean and prepare your
operational data before putting it into the warehouse. Although most data
warehouses employ a staging area instead, this can be done programmatically.
A Staging Area makes data cleansing and consolidation for operational data
from numerous Source Systems easier, especially for corporate data
warehouses that consolidate all of an organization‘s important data. This
typical architecture is depicted in the image below.

Data Warehouse with Data Marts & Staging Areas

Although the architecture shown in the image above is quite standard, you may want to alter
the architecture of your warehouse for different groups within your company. You can do
this by incorporating Data Marts, which are systems tailored to a specific industry. The
image below depicts a scenario in which purchasing, sales, and inventory are all separated.
In this case, a financial analyst would wish to mine past data to develop forecasts about
client behavior or examine historical data for purchases and sales.

Essential Steps in Data Staging Areas

The data staging area serves as the gateway to successful data integration and
analytics, preparing data for subsequent stages by ensuring it‘s cleansed,
transformed, and ready for analysis.

Let’s delve into the sequential steps undertaken within the data staging
area:
1. Data Extraction

This initial step involves extracting data from various source systems, including
databases, CRM systems, and ERP solutions. The staging area serves as the primary
landing zone for this extracted data, allowing it to be collated in one centralized location
for further processing.

2. Data Profiling

Before proceeding, it‘s crucial to understand the nature and quality of incoming data by
assessing completeness, consistency, and anomalies. Data profiling provides insights into
potential issues that need addressing in subsequent steps.

3. Data Cleansing

Raw data often contains errors, duplicates, or inconsistencies that need to be identified
and rectified. Cleansing operations in the staging area ensure that data is of high quality
before it proceeds further.

4. Data Transformation

Since data sources can have varying structures, formats, and standards, transformation
becomes vital to align data with the target system‘s schema. Transformation processes
ensure data compatibility with the target system.

5. Data Validation

Once data is cleansed and transformed, it‘s imperative to validate it against specific
business rules or criteria to ensure accuracy and relevancy. The staging area enforces
validation checks to further ensure data quality before proceeding.

6. Data Integration

The staging area acts as the ground for data integration, ensuring cohesiveness in the
resultant dataset.

7. Temporal Storage

Sometimes, there‘s a need to hold processed data temporarily before loading it into the
target system. The staging area provides buffering capacity, ensuring data readiness for
subsequent stages.
8. Data Loading

In this final step, the prepared data is loaded into the target system, such as a data
warehouse or data mart. The staging area ensures a smooth process, as data is already in a
compatible format, cleaned, and validated.

In summary, the data staging area serves as a meticulously organized processing hub,
guiding data through essential steps to ensure its accuracy, reliability, and readiness for
insightful analysis.

What is a data mart?


A data mart is a smaller subsection of a data warehouse built specifically for a particular
subject area, business function, or group of users. The main idea is to provide a specific
part of an organization with data that is the most relevant for their analytical needs. For
example, the sales or finance teams can use a data mart containing sales information only
to make quarterly or yearly reports and projections. Since data marts provide analytical
capabilities for a restricted area of a data warehouse, they offer isolated security and
isolated performance.

Data mart vs data warehouse vs data lake vs OLAP cube

Data lakes, data warehouses, and data marts are all data repositories of different sizes.
Apart from the size, there are other significant characteristics to highlight.

A data mart is a subject-oriented relational database commonly containing a subset of


DW data that is specific to a particular business department of an enterprise, e.g., a
marketing department. Data marts get information from relatively few sources and are
small in size — less than 100 GB. They typically contain structured data and take less
time for setup — normally 3 to 6 months for on-premise solutions.

A data lake is a central repository used to store massive amounts of both structured and
unstructured data coming from a great variety of sources. Data lakes accept raw data,
eliminating the need for prior cleansing and processing. As far as the size, they can be
home to many files, where even one file can be larger than 100 GB. Depending on the
goal, it may take weeks or months to set up a data lake. Moreover, not all organizations
use data lakes.
Types of data marts

Based on how data marts are related to the data warehouse as well as external and
internal data sources, they can be categorized as dependent, independent, and hybrid.
Let's elaborate on each one.

Dependent data marts are the subdivisions of a larger data warehouse that serves as a
centralized data source. This is something known as the top-down approach — you first
create a data warehouse and then design data marts on top of it. Within this sort of
relationship, data marts do not interact with data sources directly. Based on the subjects,
different sets of data are clustered inside a data warehouse, restructured, and loaded into
respective data marts from where they can be queried.

Dependent data marts are well suited for larger companies that need better control over
the systems, improved performance, and lower telecommunication costs.
Independent data marts act as standalone systems, meaning they can work without a
data warehouse. They receive data from external and internal data sources directly. The
data presented in independent data marts can be then used for the creation of a data
warehouse. This approach is called bottom-up.

Often, the motivation behind choosing independent data marts is shorter time to market.
They work great for small to medium-sized companies.

So, the key difference between dependent and independent data marts is in the way they
get data from sources. The step involving data transfer, filtering, and loading into either a
data warehouse or data mart is called the extract-transform-load (ELT) process. When
dealing with dependent data marts, the central data warehouse already keeps data
formatted and cleansed, so ETL tools will do little work. On the other hand, independent
data marts require the complete ETL process for data to be injected.
Data mart implementation steps
The process of creating data marts may be complicated and differ depending on the needs of
a particular company. In most cases, there are five core steps such as designing a data mart,
constructing it, transferring data, configuring access to a repository, and finally managing it.
We‘ll walk you through each step in more detail.

Data mart designing


The first thing you do when implementing a data mart is deciding on the scope of the project
and its design. Since data marts are subject-oriented databases, this step involves
determining a subject or a topic to which data stored in a mart will be related. In addition to
collecting information about technical specifications, you need to decide on business
requirements during this phase too. It is also necessary to identify the data sources related to
the subject and design the logical and physical structure of the data mart.
Data mart constructing
Once the scope of work is established, here comes the second step that involves constructing
the logical and physical structures of the data mart architecture designed during the first
phase.

 Logical structure refers to the scenario where data exists in the form of virtual tables or
views separated from the warehouse logically, not physically. Virtual data marts may be a
good option when resources are limited.
 Physical structure refers to the scenario where a database is physically separated from the
warehouse. The database may be cloud-based or on-premises.
Also, this step requires the creation of the schema objects (e.g., tables, indexes) and setting
up data access structures.

It is essential to perform a detailed requirement collection before implementing any scenario


since different organizations may need different types of data marts.

You might also like