BI UNIT 3 NOTES PDF
BI UNIT 3 NOTES PDF
Data warehouse
A data warehouse is an enterprise system used for the analysis and reporting of
structured and semi-structured data from multiple sources, such as point-of-sale
transactions, marketing automation, customer relationship management, and more.
1. Top-down approach:
2. Stage Area –
Since the data, extracted from the external sources does not follow a particular
format, so there is a need to validate this data to load into datawarehouse. For this
purpose, it is recommended to use ETL tool.
3. Data-warehouse –
After cleansing of data, it is stored in the datawarehouse as central repository. It
actually stores the meta data and the actual data gets stored in the data
marts. Note that datawarehouse stores the data in its purest form in this top-down
approach.
4. Data Marts –
Data mart is also a part of storage component. It stores the information of a
particular function of an organisation which is handled by single authority. There
can be as many number of data marts in an organisation depending upon the
functions. We can also say that data mart contains subset of the data stored in
datawarehouse.
5. Data Mining –
The practice of analysing the big data present in datawarehouse is data mining. It is
used to find the hidden patterns that are present in the database or in datawarehouse
with the help of algorithm of data mining.
2. Also, this model is considered as the strongest model for business changes. That‘s
why, big organisations prefer to follow this approach.
2. Bottom-up approach:
1. First, the data is extracted from external sources (same as happens in top-down
approach).
2. Then, the data go through the staging area (as explained above) and loaded into data
marts instead of datawarehouse. The data marts are created first and provide
reporting capability. It addresses a single business area.
This approach is given by Kinball as – data marts are created first and provides a thin
view for analyses and datawarehouse is created after complete data marts have been
created.
2. We can accommodate more number of data marts here and in this way
datawarehouse can be extended.
3. Also, the cost and time taken in designing this model is low comparatively.
4. Incremental development: The bottom-up approach supports incremental
development, allowing for the creation of data marts one at a time. This allows for
quick wins and incremental improvements in data reporting and analysis.
5. User involvement: The bottom-up approach encourages user involvement in the
design and implementation process. Business users can provide feedback on the data
marts and reports, helping to ensure that the data marts meet their specific needs.
6. Flexibility: The bottom-up approach is more flexible than the top-down approach, as
it allows for the creation of data marts based on specific business needs. This
approach can be particularly useful for organizations that require a high degree of
flexibility in their reporting and analysis.
7. Faster time to value: The bottom-up approach can deliver faster time to value, as the
data marts can be created more quickly than a centralized data warehouse. This can
be particularly useful for smaller organizations with limited resources.
8. Reduced risk: The bottom-up approach reduces the risk of failure, as data marts can
be tested and refined before being incorporated into a larger data warehouse. This
approach can also help to identify and address potential data quality issues early in
the process.
9. Scalability: The bottom-up approach can be scaled up over time, as new data marts
can be added as needed. This approach can be particularly useful for organizations
that are growing rapidly or undergoing significant change.
10.Data ownership: The bottom-up approach can help to clarify data ownership and
control, as each data mart is typically owned and managed by a specific business
unit. This can help to ensure that data is accurate and up-to-date, and that it is being
used in a consistent and appropriate way across the organization.
Star Schema
Each dimension in a star schema is represented with only one-dimension table.
This dimension table contains the set of attributes.
The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
There is a fact table at the center. It contains the keys to each of four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Note − Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key, street,
city, province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British
Columbia. The entries for such cities may cause data redundancy along the attributes
province_or_state and country.
Snowflake Schema
Some dimension tables in the Snowflake schema are normalized.
The normalization splits up the data into additional tables.
Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.
ow the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension
table contains the attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and
Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The
two primitives, cube definition and dimension definition, can be used for defining the
data warehouses and data marts.
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
Snowflake Schema Definition
Snowflake schema can be defined using DMQL as follows −
define cube sales snowflake [time, item, branch, location]:
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key,
supplier type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state,
country))
Fact Constellation Schema Definition
Fact constellation schema can be defined using DMQL as follows −
define cube sales [time, item, branch, location]:
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state,country)
define cube shipping [time, item, shipper, from location, to location]:
Data Quality
Data quality is a measure of the condition of data based on factors such as accuracy,
completeness, consistency, reliability and whether it's up to date. Measuring data quality
levels can help organizations identify data errors that need to be resolved and assess whether
the data in their IT systems is fit to serve its intended purpose.
The emphasis on data quality in enterprise systems has increased as data processing has
become more intricately linked with business operations and organizations increasingly use
data analytics to help drive business decisions. Data quality management is a core
component of the overall data management process, and data quality improvement
efforts are often closely tied to data governance programs that aim to ensure data is
formatted and used consistently throughout an organization.
Consulting firm Gartner said in 2021 that bad data quality costs organizations an average of
$12.9 million per year. Another figure that's still often cited is a calculation by IBM that the
annual cost of data quality issues in the U.S. amounted to $3.1 trillion in 2016. And in an
article he wrote for the MIT Sloan Management Review in 2017, data quality consultant
Thomas Redman estimated that correcting data errors and dealing with the business
problems caused by bad data costs companies 15% to 25% of their annual revenue on
average.
In addition, a lack of trust in data on the part of corporate executives and business managers
is commonly cited among the chief impediments to using business intelligence (BI) and
analytics tools to improve decision-making in organizations. All of that makes an effective
data quality management strategy a must.
Other aspects, or dimensions, that are important elements of good data quality include the
following:
completeness, with data sets containing all of the data elements they should;
consistency, where there are no conflicts between the same data values in different
systems or data sets;
uniqueness, indicating a lack of duplicate data records in databases and data warehouses;
timeliness or currency, meaning that data has been updated to keep it current and is
available to use when it's needed;
validity, confirming that data contains the values it should and is structured properly; and
Meeting all of these factors helps produce data sets that are reliable and trustworthy. A long
list of additional dimensions of data quality can also be applied -- some examples include
appropriateness, credibility, relevance, reliability and usability.
These metrics can
be used to track data quality levels and how quality issues affect business operations.
Another common step is to create a set of data quality rules based on business requirements
for both operational and analytics data. Such rules specify required quality levels in data sets
and detail what different data elements need to include so they can be checked for accuracy,
consistency and other data quality attributes. After the rules are in place, a data management
team typically conducts a data quality assessment to measure the quality of data sets and
document data errors and other problems -- a procedure that can be repeated at regular
intervals to maintain the highest data quality levels possible.
Various methodologies for such assessments have been developed. For example, data
managers at UnitedHealth Group's Optum healthcare services subsidiary created the Data
Quality Assessment Framework (DQAF) in 2009 to formalize a method for assessing its data
quality. The DQAF provides guidelines for measuring data quality based on four
dimensions: completeness, timeliness, validity and consistency. Optum has publicized details
about the framework as a possible model for other organizations.
The International Monetary Fund (IMF), which oversees the global monetary system and
lends money to economically troubled nations, has also specified an assessment
methodology with the same name as the Optum one. Its framework focuses on accuracy,
reliability, consistency and other data quality attributes in the statistical data that member
countries must submit to the IMF. In addition, the U.S. government's Office of the National
Coordinator for Health Information Technology has detailed a data quality framework for
patient demographic data collected by healthcare organizations.
Those processes include data cleansing, or data scrubbing, to fix data errors, plus work to
enhance data sets by adding missing values, more up-to-date information or additional
records. The results are then monitored and measured against the performance targets, and
any remaining deficiencies in data quality provide a starting point for the next round of
planned improvements. Such a cycle is intended to ensure that efforts to improve overall
data quality continue after individual projects are completed.
These are the key steps in the data quality improvement process.
To help streamline such efforts, data quality software tools can match records, delete
duplicates, validate new data, establish remediation policies and identify personal data in
data sets; they also do data profiling to collect information about data sets and identify
possible outlier values. Augmented data quality functions are an emerging set of capabilities
that software vendors are building into their tools to automate tasks and procedures,
primarily through the use of artificial intelligence (AI) and machine learning.
Management consoles for data quality initiatives support creation of data handling rules,
discovery of data relationships and automated data transformations that may be part of data
quality maintenance efforts. Collaboration and workflow enablement tools have also become
more common, providing shared views of corporate data repositories to data quality
managers and data stewards, who are charged with overseeing particular data sets.
Data quality tools and improvement processes are often incorporated into data governance
programs, which typically use data quality metrics to help demonstrate their business value
to companies. They're also key components of master data management (MDM) initiatives
that create central registries of master data on customers, products and supply chains, among
other data domains.
In addition, good data quality increases the accuracy of analytics applications, which can
lead to better business decision-making that boosts sales, improves internal processes and
gives organizations a competitive edge over rivals. High-quality data can help expand the
use of BI dashboards and analytics tools, as well -- if analytics data is seen as trustworthy,
business users are more likely to rely on it instead of basing decisions on gut feelings or their
own spreadsheets.
Effective data quality management also frees up data management teams to focus on more
productive tasks than cleaning up data sets. For example, they can spend more time helping
business users and data analysts take advantage of the available data in systems and
promoting data quality best practices in business operations to minimize data errors.
The growing use of AI tools and machine learning applications in organizations further
complicates the data quality process, as does the adoption of real-time data streaming
platforms that funnel large volumes of data into corporate systems on a continuous basis.
Complex data pipelines created to support data science and advanced analytics work add to
the challenges, too.
Data quality demands are also expanding due to the implementation of new data privacy and
protection laws, most notably the European Union's General Data Protection Regulation
(GDPR) and the California Consumer Privacy Act (CCPA). Both measures give people the
right to access the personal data that companies collect about them, which means
organizations must be able to find all of the records on an individual in their systems without
missing any because of inaccurate or inconsistent data.
However, it's also a common practice to involve business users, data scientists and other
analysts in the data quality process to help reduce the number of data quality issues created
in systems. Business participation can be achieved partly through data governance programs
and interactions with data stewards, who frequently come from business units. In addition,
though, many companies run training programs on data quality best practices for end users.
A common mantra among data managers is that everyone in an organization is responsible
for data quality.
In that broader view, data integrity focuses on integrity from both logical and physical
standpoints. Logical integrity includes data quality measures and database attributes such as
referential integrity, which ensures that related data elements in different database tables are
valid. Physical integrity involves access controls and other security measures designed to
prevent data from being modified or corrupted by unauthorized users, as well as backup and
disaster recovery protections.
Data profiling is the process of reviewing source data, understanding structure, content and
interrelationships, and identifying potential for data projects.
Data warehouse and business intelligence (DW/BI) projects—data profiling can uncover
data quality issues in data sources, and what needs to be corrected in ETL.
Data conversion and migration projects—data profiling can identify data quality issues,
which you can handle in scripts and data integration tools copying data from source to target.
It can also uncover new requirements for the target system.
Source system data quality projects—data profiling can highlight data which suffers from
serious or numerous quality issues, and the source of the issues (e.g. user inputs, errors in
interfaces, data corruption).
Structure discovery
Validating that data is consistent and formatted correctly, and performing mathematical
checks on the data (e.g. sum, minimum or maximum). Structure discovery helps understand
how well data is structured—for example, what percentage of phone numbers do not have
the correct number of digits.
Content discovery
Looking into individual data records to discover errors. Content discovery identifies which
specific rows in a table contain problems, and which systemic issues occur in the data (for
example, phone numbers with no area code).
Relationship discovery
Discovering how parts of the data are interrelated. For example, key relationships between
database tables, references between cells or tables in a spreadsheet. Understanding
relationships is crucial to reusing data; related data sources should be united into one or
imported in a way that preserves important relationships.
Ralph Kimball, a father of data warehouse architecture, suggests a four-step process for data
profiling:
Use data profiling at project start to discover if data is suitable for analysis—and make a ―go
/ no go‖ decision on the project.
Identify and correct data quality issues in source data, even before starting to move it into
target database.
Identify data quality issues that can be corrected by Extract-Transform-Load (ETL), while
data is moved from source to target. Data profiling can uncover if additional manual
processing is needed.
Identify unanticipated business rules, hierarchical structures and foreign key / private key
relationships, use them to fine-tune the ETL process.
Distinct count and percent—identifies natural keys, distinct values in each column that can
help process inserts and updates. Handy for tables without headers.
Percent of zero / blank / null values—identifies missing or unknown data. Helps ETL
architects setup appropriate default values.
Minimum / maximum / average string length—helps select appropriate data types and
sizes in target database. Enables setting column widths just wide enough for the data, to
improve performance.
Advanced data profiling techniques:
Key integrity—ensures keys are always present in the data, using zero/blank/null analysis.
Also, helps identify orphan keys, which are problematic for ETL and future analysis.
Pattern and frequency distributions—checks if data fields are formatted correctly, for
example if emails are in a valid format. Extremely important for data fields used for
outbound communications (emails, phone numbers, addresses).
Data profiling, a tedious and labor intensive activity, can be automated with tools, to make
huge data projects more feasible. These are essential to your data analytics stack.
Extract, transform, and load (ETL) is the process of combining data from multiple
sources into a large, central repository called a data warehouse. ETL uses a set of
business rules to clean and organize raw data and prepare it for storage, data
analytics, and machine learning (ML).
ETL Architecture
ETL stands for Extract, Transform, and Load. In today's data warehousing world, this term is extended
to E-MPAC-TL or Extract, Monitor, Profile, Analyze, Cleanse, Transform, and Load. In other words,
ETL focus on Data Quality and MetaData.
Extraction
The main goal of extraction is to collect the data from the source system as fast as possible and less
convenient for these source systems. It also states that the most applicable extraction method should be
chosen for source date/time stamps, database log tables, hybrid depending on the situation.
Monitoring
Monitoring of the data enables the verification of the data, which is moved throughout the entire ETL
process and has two main objectives. Firstly, the data should be screened. A proper balance should be
there between screening the incoming data as much as possible and not slowing down the entire ETL
process when too much checking is done. Here an inside-out approach which is used in Ralph Kimbal
screening technique could be used. This technique can capture all errors consistently which is based on a
pre-defined set of metadata business rules and enables the reporting on them through a simple star
schema, which enables a view on the data quality evolution over the time. Secondly, we should have to
be focused on ETL performance. This metadata information can be plugged into all dimension and fact
tables and can be called an audit dimension.
Quality Assurance
Quality Assurance is a process between the different stages that could be defined depending on the need,
and these processes can check the completeness of the value; do we still have the same number of
records or total of specific measures between different ETL stages? This information should be captured
as metadata. Finally, the data lineage should be foreseen throughout the entire ETL process, included the
error records produced.
Data Profiling
It is used to generate statistics about the sources. The objective of data profiling is to know about the
sources. Data profiling will use analytical techniques to discover the actual content, structure, and
quality of the data by analyzing and validating the data pattern and formats and by identifying and
validating redundant data across the data source. It is essential to use the correct tool, which is used to
automate this process. It gives a huge amount and variety of data.
Data Analysis
To analyze the results of the profiled data, Data Analysis is used. For analyzing the data, it is easier to
identify data quality problems such as missing data, inconsistent data, invalid data, constraint problems,
parent-child issues such as orphans, duplicated. It is essential to capture the results of this assessment
correctly. Data analysis will become the communication medium between the source and the data
warehouse team for tackling the outstanding issues. The source to target mapping highly depends on the
quality of the source analysis.
Source Analysis
In the source Analysis, the focus should not only on the sources but also on the surroundings, to obtain
the source documentation. The future of the source applications depends upon the current data issues of
origin, the corresponding data models/ metadata repositories, and receiving a walkthrough of source
model and business rules by source owners. It is crucial to set up the frequent meetings with owners of
the source to detect the changes which might impact the data warehouse and the associated ETL process.
Cleansing
In this section, the errors found can be fixed, which is based on the Metadata of a pre-defined set of
rules. Here, a distinction needs to be made between completely or partly rejected records and enable the
manual correction of the issues or by fixing the data by correcting the inaccurate data fields, adjusting
the data format, etc.
The Data Staging Area is a temporary storage area for data copied from Source Systems.
In a Data Warehousing Architecture, a Data Staging Area is mostly necessary for time
considerations. In other words, before data can be incorporated into the Data
Warehouse, all essential data must be readily available.
It is not possible to retrieve all data from all Operational databases at the same time
because of varying Business Cycles, Data Processing Cycles, Hardware, and Network
Resource Restrictions, and Geographical Variables.
Here‘s all you need to know about Data Staging Area, as well as some key pointers to
keep in mind before you start the process.
During the Extract, Transform, and Load (ETL) process, a Staging Area, also known as
a landing zone, is an interim storage region used for Data Processing. The Data Staging
Area is located in between the Data Source(s) and the Data Target(s), which are typically
Data Warehouses, Data Marts, or other Data Repositories.
Data Staging spaces are frequently ephemeral in nature, with their contents being wiped
before performing an ETL process or shortly after it has been completed successfully.
However, there are architectures for staging areas that are designed to hold data for long
periods of time for preservation or debugging purposes.
There is no designated space available for testing data transformations in a direct data
integration strategy, where the data is extracted, transformed, and then loaded to the new
storage. Before being loaded to the target system, data from the source can be replicated,
reformatted, and tested in a staging area in data warehouse.
Most firms today have several Data Sources to derive information. Before being loaded
into the new system, the extracted data must be polished and cleansed, as well as have the
right format and structure. A Staging space is useful in this situation. Data is altered,
replicated as needed, linked and aggregated if necessary, and then cleansed in this
intermediate layer.
The Data Staging Area is made up of the Data Staging Server software and the data
store archive (repository) of the outcomes of the extraction, transformation, and loading
activities in the data warehousing process.
The archival repository stores cleansed, converted data and attributes for loading into
Data Marts and Data Warehouses, while the Data Staging software server saves and alters
data taken from OLTP data sources.
A Data Staging Area is a design concept for a Data Pipeline. It is a location where
raw/unprocessed data is stored before being modified for downstream usage. Database
tables, files in a Cloud Storage System, and other staging regions are examples.
Example:
It‘s reasonable to extract sales data on a daily basis, but daily extracts aren‘t appropriate
for financial data that needs to be reconciled at the end of the month. Similarly, extracting
―customer‖ data from a database in Singapore at noon eastern standard time may be
appropriate, but it is not appropriate for ―customer‖ data in a Chicago database.
Data in the Data Warehouse can be permanent (i.e., it lasts for a long time) or transitory
(i.e., it only lasts for a short time) (i.e. only remains around temporarily). A data
warehouse staging space is not required for all enterprises. For many firms, using ETL to
replicate data straight from operational databases into the Data Warehouse is a viable
option.
External Staging
The area where data staging takes place outside a data warehouse is commonly referred
to as External Staging. This area is often hosted by cloud storage providers such
as Google Cloud Storage (GCS) or Amazon Web Simple Storage Solution (AWS S3).
Internal Staging
Unified cloud data warehouses in modern approaches often use an internal staging
process that involves creating raw tables separate from the rest of the warehouse. These
raw tables then undergo a transformation, cleaning, and normalization process in an
‗ELT staging area‘. A final layer is then used to present only the cleaned and prepared
data to BI tooling and business users, allowing data teams to curate a single source of
truth, reduce complexity, and mitigate data sprawl.
A Data Staging Area has a single purpose: to hold raw data from Source Systems
and to provide a space for transitory tables created during the transformation
process.
The function of the Data Staging Area varies depending on the design
methodology and ETL toolset, but the Target Audience is always the ETL
process and the Data Engineers who are responsible for designing and maintaining
the ETL.
This does not diminish the need for data governance in a Data Lake, but it does
make it much easier to manage when contrasted to a large number of individuals
and processes that may access it.
The efficiency of employing the Data Staging Area is heavily dependent on the business
requirements and the ETL system‘s operating environment. Here are a few reasons why
you should choose Data Integration products such as Hevo Data that have a Staging Area.
If the transformation process slows down, the staging procedure will come to a halt
as well.
There might be some variation in the use of disc space because the data must be
dumped into a local area.
A Data Warehouse‘s Architecture is seen in the image below. Through the data
warehouse, end users have immediate access to data collected from a variety of Source
Systems.
The image above shows metadata and raw data from a standard OLTP system, as well as
a new type of data, called Summary Data. Summaries are a way to pre-compute frequent,
time-consuming processes so that data can be retrieved in a fraction of a second. A
common data warehouse query, for example, might be to get August Sales. A
materialized view is a summary in an Oracle database.
An Enterprise Data Warehouse is a centralized repository of raw data that serves as the hub
of your data warehousing system (EDW). By storing all essential business information in the
most complete format, an EDW delivers a 360-degree insight into an organization‘s
business.
As demonstrated in the image below, you must clean and prepare your
operational data before putting it into the warehouse. Although most data
warehouses employ a staging area instead, this can be done programmatically.
A Staging Area makes data cleansing and consolidation for operational data
from numerous Source Systems easier, especially for corporate data
warehouses that consolidate all of an organization‘s important data. This
typical architecture is depicted in the image below.
Although the architecture shown in the image above is quite standard, you may want to alter
the architecture of your warehouse for different groups within your company. You can do
this by incorporating Data Marts, which are systems tailored to a specific industry. The
image below depicts a scenario in which purchasing, sales, and inventory are all separated.
In this case, a financial analyst would wish to mine past data to develop forecasts about
client behavior or examine historical data for purchases and sales.
The data staging area serves as the gateway to successful data integration and
analytics, preparing data for subsequent stages by ensuring it‘s cleansed,
transformed, and ready for analysis.
Let’s delve into the sequential steps undertaken within the data staging
area:
1. Data Extraction
This initial step involves extracting data from various source systems, including
databases, CRM systems, and ERP solutions. The staging area serves as the primary
landing zone for this extracted data, allowing it to be collated in one centralized location
for further processing.
2. Data Profiling
Before proceeding, it‘s crucial to understand the nature and quality of incoming data by
assessing completeness, consistency, and anomalies. Data profiling provides insights into
potential issues that need addressing in subsequent steps.
3. Data Cleansing
Raw data often contains errors, duplicates, or inconsistencies that need to be identified
and rectified. Cleansing operations in the staging area ensure that data is of high quality
before it proceeds further.
4. Data Transformation
Since data sources can have varying structures, formats, and standards, transformation
becomes vital to align data with the target system‘s schema. Transformation processes
ensure data compatibility with the target system.
5. Data Validation
Once data is cleansed and transformed, it‘s imperative to validate it against specific
business rules or criteria to ensure accuracy and relevancy. The staging area enforces
validation checks to further ensure data quality before proceeding.
6. Data Integration
The staging area acts as the ground for data integration, ensuring cohesiveness in the
resultant dataset.
7. Temporal Storage
Sometimes, there‘s a need to hold processed data temporarily before loading it into the
target system. The staging area provides buffering capacity, ensuring data readiness for
subsequent stages.
8. Data Loading
In this final step, the prepared data is loaded into the target system, such as a data
warehouse or data mart. The staging area ensures a smooth process, as data is already in a
compatible format, cleaned, and validated.
In summary, the data staging area serves as a meticulously organized processing hub,
guiding data through essential steps to ensure its accuracy, reliability, and readiness for
insightful analysis.
Data lakes, data warehouses, and data marts are all data repositories of different sizes.
Apart from the size, there are other significant characteristics to highlight.
A data lake is a central repository used to store massive amounts of both structured and
unstructured data coming from a great variety of sources. Data lakes accept raw data,
eliminating the need for prior cleansing and processing. As far as the size, they can be
home to many files, where even one file can be larger than 100 GB. Depending on the
goal, it may take weeks or months to set up a data lake. Moreover, not all organizations
use data lakes.
Types of data marts
Based on how data marts are related to the data warehouse as well as external and
internal data sources, they can be categorized as dependent, independent, and hybrid.
Let's elaborate on each one.
Dependent data marts are the subdivisions of a larger data warehouse that serves as a
centralized data source. This is something known as the top-down approach — you first
create a data warehouse and then design data marts on top of it. Within this sort of
relationship, data marts do not interact with data sources directly. Based on the subjects,
different sets of data are clustered inside a data warehouse, restructured, and loaded into
respective data marts from where they can be queried.
Dependent data marts are well suited for larger companies that need better control over
the systems, improved performance, and lower telecommunication costs.
Independent data marts act as standalone systems, meaning they can work without a
data warehouse. They receive data from external and internal data sources directly. The
data presented in independent data marts can be then used for the creation of a data
warehouse. This approach is called bottom-up.
Often, the motivation behind choosing independent data marts is shorter time to market.
They work great for small to medium-sized companies.
So, the key difference between dependent and independent data marts is in the way they
get data from sources. The step involving data transfer, filtering, and loading into either a
data warehouse or data mart is called the extract-transform-load (ELT) process. When
dealing with dependent data marts, the central data warehouse already keeps data
formatted and cleansed, so ETL tools will do little work. On the other hand, independent
data marts require the complete ETL process for data to be injected.
Data mart implementation steps
The process of creating data marts may be complicated and differ depending on the needs of
a particular company. In most cases, there are five core steps such as designing a data mart,
constructing it, transferring data, configuring access to a repository, and finally managing it.
We‘ll walk you through each step in more detail.
Logical structure refers to the scenario where data exists in the form of virtual tables or
views separated from the warehouse logically, not physically. Virtual data marts may be a
good option when resources are limited.
Physical structure refers to the scenario where a database is physically separated from the
warehouse. The database may be cloud-based or on-premises.
Also, this step requires the creation of the schema objects (e.g., tables, indexes) and setting
up data access structures.