The ETL process in data warehousing involves extraction, transformation, and loading of data. Data is extracted from operational databases, transformed to match the data warehouse schema, and loaded into the data warehouse database. As source data and business needs change, the ETL process must also evolve to maintain the data warehouse's value as a business decision making tool. The ETL process consists of extracting data from sources, transforming it to resolve conflicts and quality issues, and loading it into the target data warehouse structures.
What is ETL testing & how to enforce it in Data WharehouseBugRaptors
Bugraptors always remains up to date with latest technologies and ongoing trends in testing. Technology like ELT Testing bringing the great changes which arises the scope of testing by keeping in mind all the positive and negative scenarios.
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Edureka!
This Data Warehouse Tutorial For Beginners will give you an introduction to data warehousing and business intelligence. You will be able to understand basic data warehouse concepts with examples. The following topics have been covered in this tutorial:
1. What Is The Need For BI?
2. What Is Data Warehousing?
3. Key Terminologies Related To Data Warehouse Architecture:
a. OLTP Vs OLAP
b. ETL
c. Data Mart
d. Metadata
4. Data Warehouse Architecture
5. Demo: Creating A Data Warehouse
Data Warehouse:
A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format.
Reconciled data: detailed, current data intended to be the single, authoritative source for all decision support.
Extraction:
The Extract step covers the data extraction from the source system and makes it accessible for further processing. The main objective of the extract step is to retrieve all the required data from the source system with as little resources as possible.
Data Transformation:
Data transformation is the component of data reconcilation that converts data from the format of the source operational systems to the format of enterprise data warehouse.
Data Loading:
During the load step, it is necessary to ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database. In order to make the load process efficient, it is helpful to disable any constraints and indexes before the load and enable them back only after the load completes. The referential integrity needs to be maintained by ETL tool to ensure consistency.
This document discusses data warehousing, including its definition, importance, components, strategies, ETL processes, and considerations for success and pitfalls. A data warehouse is a collection of integrated, subject-oriented, non-volatile data used for analysis. It allows more effective decision making through consolidated historical data from multiple sources. Key components include summarized and current detailed data, as well as transformation programs. Common strategies are enterprise-wide and data mart approaches. ETL processes extract, transform and load the data. Clean data and proper implementation, training and maintenance are important for success.
Data warehousing involves assembling and managing data from various sources to provide an integrated view of enterprise information. A data warehouse contains consolidated, historical data used to support management decision making. It differs from operational databases by containing aggregated, non-volatile data optimized for queries rather than updates. The extract, transform, load (ETL) process migrates data from source systems to the warehouse, transforming it as needed. Process managers oversee loading, maintaining, and querying the warehouse data.
What is a Data Warehouse and How Do I Test It?RTTS
ETL Testing: A primer for Testers on Data Warehouses, ETL, Business Intelligence and how to test them.
Are you hearing and reading about Big Data, Enterprise Data Warehouses (EDW), the ETL Process and Business Intelligence (BI)? The software markets for EDW and BI are quickly approaching $22 billion, according to Gartner, and Big Data is growing at an exponential pace.
Are you being tasked to test these environments or would you like to learn about them and be prepared for when you are asked to test them?
RTTS, the Software Quality Experts, provided this groundbreaking webinar, based upon our many years of experience in providing software quality solutions for more than 400 companies.
You will learn the answer to the following questions:
• What is Big Data and what does it mean to me?
• What are the business reasons for a building a Data Warehouse and for using Business Intelligence software?
• How do Data Warehouses, Business Intelligence tools and ETL work from a technical perspective?
• Who are the primary players in this software space?
• How do I test these environments?
• What tools should I use?
This slide deck is geared towards:
QA Testers
Data Architects
Business Analysts
ETL Developers
Operations Teams
Project Managers
...and anyone else who is (a) new to the EDW space, (b) wants to be educated in the business and technical sides and (c) wants to understand how to test them.
A data mart is a subset of a data warehouse that focuses on a single subject or department, such as sales or marketing. Data marts typically draw from a few sources like internal systems or a central data warehouse. They are smaller in size and easier to build than data warehouses. There are two main approaches to designing data marts - the top-down approach builds them as part of an overall data warehouse, while the bottom-up approach builds independent data marts first.
The document provides an overview of business intelligence, data warehousing, and ETL concepts. It defines business intelligence as using technologies to analyze data and support decision making. A data warehouse stores historical data from transaction systems and supports querying and analysis for insights. ETL is the process of extracting data from sources, transforming it, and loading it into the data warehouse for analysis. The document discusses components of BI systems like the data warehouse, data marts, and dimensional modeling and provides examples of how these concepts work together.
ETL is a process that extracts data from multiple sources, transforms it to fit operational needs, and loads it into a data warehouse or other destination system. It migrates, converts, and transforms data to make it accessible for business analysis. The ETL process extracts raw data, transforms it by cleaning, consolidating, and formatting the data, and loads the transformed data into the target data warehouse or data marts.
This document discusses data warehousing and OLAP (online analytical processing) technology. It defines a data warehouse as a subject-oriented, integrated, time-variant, and nonvolatile collection of data to support management decision making. It describes how data warehouses use a multi-dimensional data model with facts and dimensions to organize historical data from multiple sources for analysis. Common data warehouse architectures like star schemas and snowflake schemas are also summarized.
This document defines key concepts in data warehousing including dimensions, facts, star schemas, and snowflake schemas. Dimensions categorize data and provide structure through labeling and tagging. Facts contain measures and aggregates. A star schema consists of dimension tables modeled around a fact table, optimized for querying large datasets. A snowflake schema is similar to a star schema but with normalized dimension tables to separate null data for faster lookups. Conformed dimensions have a common structure that can be applied across multiple fact tables.
The document discusses dimensional modeling and data warehousing. It describes how dimensional models are designed for understandability and ease of reporting rather than updates. Key aspects include facts and dimensions, with facts being numeric measures and dimensions providing context. Slowly changing dimensions are also covered, with types 1-3 handling changes to dimension attribute values over time.
There are three common data warehouse architectures: basic, with a staging area, and with a staging area and data marts. The basic architecture extracts data directly from source systems into the data warehouse for users. The staging area architecture uses a staging area to clean and process data before loading it into the warehouse. The third architecture adds data marts, which are subsets of the warehouse organized for specific business units like sales or purchasing.
This document provides an overview of current ETL techniques from a big data perspective. It discusses the evolution of ETL from traditional batch-based techniques to near real-time and real-time approaches. However, existing real-time ETL approaches are inadequate to address the volume, velocity, and variety characteristics of data streams. The document also surveys available ETL tools and techniques for handling data streams, and concludes that the ETL process needs to be redefined to better address issues in processing dynamic data streams.
The document provides an overview of key concepts in data warehousing and business intelligence, including:
1) It defines data warehousing concepts such as the characteristics of a data warehouse (subject-oriented, integrated, time-variant, non-volatile), grain/granularity, and the differences between OLTP and data warehouse systems.
2) It discusses the evolution of business intelligence and key components of a data warehouse such as the source systems, staging area, presentation area, and access tools.
3) It covers dimensional modeling concepts like star schemas, snowflake schemas, and slowly and rapidly changing dimensions.
A data warehouse stores current and historical data for analysis and decision making. It uses a star schema with fact and dimension tables. The fact table contains measures that can be aggregated and connected to dimension tables through foreign keys. Dimensions describe the facts and contain descriptive attributes to analyze measures over time, products, locations etc. This allows analyzing large volumes of historical data for informed decisions.
The document discusses dimensional modeling concepts used in data warehouse design. Dimensional modeling organizes data into facts and dimensions. Facts are measures that are analyzed, while dimensions provide context for the facts. The dimensional model uses star and snowflake schemas to store data in denormalized tables optimized for querying. Key aspects covered include fact and dimension tables, slowly changing dimensions, and handling many-to-many and recursive relationships.
This document provides an overview of key concepts related to data warehousing including what a data warehouse is, common data warehouse architectures, types of data warehouses, and dimensional modeling techniques. It defines key terms like facts, dimensions, star schemas, and snowflake schemas and provides examples of each. It also discusses business intelligence tools that can analyze and extract insights from data warehouses.
A data warehouse is a central repository for storing historical and integrated data from multiple sources to be used for analysis and reporting. It contains a single version of the truth and is optimized for read access. In contrast, operational databases are optimized for transaction processing and contain current detailed data. A key aspect of data warehousing is using a dimensional model with fact and dimension tables. This allows for analyzing relationships between measures and dimensions in a multi-dimensional structure known as a data cube.
Business Intelligence Data Warehouse SystemKiran kumar
This document provides an overview of data warehousing and business intelligence concepts. It discusses:
- What a data warehouse is and its key properties like being integrated, non-volatile, time-variant and subject-oriented.
- Common data warehouse architectures including dimensional modeling, ETL processes, and different layers like the data storage layer and presentation layer.
- How data marts are subsets of the data warehouse that focus on specific business functions or departments.
- Different types of dimensions tables and slowly changing dimensions.
- How business intelligence uses the data warehouse for analysis, querying, reporting and generating insights to help with decision making.
Introduction to Data Warehouse. Summarized from the first chapter of 'The Data Warehouse Lifecyle Toolkit : Expert Methods for Designing, Developing, and Deploying Data Warehouses' by Ralph Kimball
Data Warehouse, Data Warehouse Architecture, Data Warehouse Concept, Data Warehouse Modeling, OLAP, OLAP Operations, Data Cube, Data Processing, Data Cleaning, Data Reduction, Data Integration, Data Transformation
A data warehouse is a database used for reporting and analysis that integrates data from multiple sources. It provides strategic information through analysis that cannot be done by operational systems. A data warehouse contains integrated, subject-oriented data that is periodically updated and stored over time for decision making. It supports analytical tools and access for management rather than daily transactions.
The document discusses the extraction, transformation, and loading (ETL) process used in data warehousing. It describes how ETL tools extract data from operational systems, transform the data through cleansing and formatting, and load it into the data warehouse. Metadata is generated during the ETL process to document the data flow and mappings. The roles of different types of metadata are also outlined. Common ETL tools and their strengths and limitations are reviewed.
A data mart is a subset of a data warehouse that focuses on a single subject or department, such as sales or marketing. Data marts typically draw from a few sources like internal systems or a central data warehouse. They are smaller in size and easier to build than data warehouses. There are two main approaches to designing data marts - the top-down approach builds them as part of an overall data warehouse, while the bottom-up approach builds independent data marts first.
The document provides an overview of business intelligence, data warehousing, and ETL concepts. It defines business intelligence as using technologies to analyze data and support decision making. A data warehouse stores historical data from transaction systems and supports querying and analysis for insights. ETL is the process of extracting data from sources, transforming it, and loading it into the data warehouse for analysis. The document discusses components of BI systems like the data warehouse, data marts, and dimensional modeling and provides examples of how these concepts work together.
ETL is a process that extracts data from multiple sources, transforms it to fit operational needs, and loads it into a data warehouse or other destination system. It migrates, converts, and transforms data to make it accessible for business analysis. The ETL process extracts raw data, transforms it by cleaning, consolidating, and formatting the data, and loads the transformed data into the target data warehouse or data marts.
This document discusses data warehousing and OLAP (online analytical processing) technology. It defines a data warehouse as a subject-oriented, integrated, time-variant, and nonvolatile collection of data to support management decision making. It describes how data warehouses use a multi-dimensional data model with facts and dimensions to organize historical data from multiple sources for analysis. Common data warehouse architectures like star schemas and snowflake schemas are also summarized.
This document defines key concepts in data warehousing including dimensions, facts, star schemas, and snowflake schemas. Dimensions categorize data and provide structure through labeling and tagging. Facts contain measures and aggregates. A star schema consists of dimension tables modeled around a fact table, optimized for querying large datasets. A snowflake schema is similar to a star schema but with normalized dimension tables to separate null data for faster lookups. Conformed dimensions have a common structure that can be applied across multiple fact tables.
The document discusses dimensional modeling and data warehousing. It describes how dimensional models are designed for understandability and ease of reporting rather than updates. Key aspects include facts and dimensions, with facts being numeric measures and dimensions providing context. Slowly changing dimensions are also covered, with types 1-3 handling changes to dimension attribute values over time.
There are three common data warehouse architectures: basic, with a staging area, and with a staging area and data marts. The basic architecture extracts data directly from source systems into the data warehouse for users. The staging area architecture uses a staging area to clean and process data before loading it into the warehouse. The third architecture adds data marts, which are subsets of the warehouse organized for specific business units like sales or purchasing.
This document provides an overview of current ETL techniques from a big data perspective. It discusses the evolution of ETL from traditional batch-based techniques to near real-time and real-time approaches. However, existing real-time ETL approaches are inadequate to address the volume, velocity, and variety characteristics of data streams. The document also surveys available ETL tools and techniques for handling data streams, and concludes that the ETL process needs to be redefined to better address issues in processing dynamic data streams.
The document provides an overview of key concepts in data warehousing and business intelligence, including:
1) It defines data warehousing concepts such as the characteristics of a data warehouse (subject-oriented, integrated, time-variant, non-volatile), grain/granularity, and the differences between OLTP and data warehouse systems.
2) It discusses the evolution of business intelligence and key components of a data warehouse such as the source systems, staging area, presentation area, and access tools.
3) It covers dimensional modeling concepts like star schemas, snowflake schemas, and slowly and rapidly changing dimensions.
A data warehouse stores current and historical data for analysis and decision making. It uses a star schema with fact and dimension tables. The fact table contains measures that can be aggregated and connected to dimension tables through foreign keys. Dimensions describe the facts and contain descriptive attributes to analyze measures over time, products, locations etc. This allows analyzing large volumes of historical data for informed decisions.
The document discusses dimensional modeling concepts used in data warehouse design. Dimensional modeling organizes data into facts and dimensions. Facts are measures that are analyzed, while dimensions provide context for the facts. The dimensional model uses star and snowflake schemas to store data in denormalized tables optimized for querying. Key aspects covered include fact and dimension tables, slowly changing dimensions, and handling many-to-many and recursive relationships.
This document provides an overview of key concepts related to data warehousing including what a data warehouse is, common data warehouse architectures, types of data warehouses, and dimensional modeling techniques. It defines key terms like facts, dimensions, star schemas, and snowflake schemas and provides examples of each. It also discusses business intelligence tools that can analyze and extract insights from data warehouses.
A data warehouse is a central repository for storing historical and integrated data from multiple sources to be used for analysis and reporting. It contains a single version of the truth and is optimized for read access. In contrast, operational databases are optimized for transaction processing and contain current detailed data. A key aspect of data warehousing is using a dimensional model with fact and dimension tables. This allows for analyzing relationships between measures and dimensions in a multi-dimensional structure known as a data cube.
Business Intelligence Data Warehouse SystemKiran kumar
This document provides an overview of data warehousing and business intelligence concepts. It discusses:
- What a data warehouse is and its key properties like being integrated, non-volatile, time-variant and subject-oriented.
- Common data warehouse architectures including dimensional modeling, ETL processes, and different layers like the data storage layer and presentation layer.
- How data marts are subsets of the data warehouse that focus on specific business functions or departments.
- Different types of dimensions tables and slowly changing dimensions.
- How business intelligence uses the data warehouse for analysis, querying, reporting and generating insights to help with decision making.
Introduction to Data Warehouse. Summarized from the first chapter of 'The Data Warehouse Lifecyle Toolkit : Expert Methods for Designing, Developing, and Deploying Data Warehouses' by Ralph Kimball
Data Warehouse, Data Warehouse Architecture, Data Warehouse Concept, Data Warehouse Modeling, OLAP, OLAP Operations, Data Cube, Data Processing, Data Cleaning, Data Reduction, Data Integration, Data Transformation
A data warehouse is a database used for reporting and analysis that integrates data from multiple sources. It provides strategic information through analysis that cannot be done by operational systems. A data warehouse contains integrated, subject-oriented data that is periodically updated and stored over time for decision making. It supports analytical tools and access for management rather than daily transactions.
The document discusses the extraction, transformation, and loading (ETL) process used in data warehousing. It describes how ETL tools extract data from operational systems, transform the data through cleansing and formatting, and load it into the data warehouse. Metadata is generated during the ETL process to document the data flow and mappings. The roles of different types of metadata are also outlined. Common ETL tools and their strengths and limitations are reviewed.
An Overview on Data Quality Issues at Data Staging ETLidescitation
A data warehouse (DW) is a collection of technologies
aimed at enabling the decision maker to make better and
faster decisions. Data warehouses differ from operational
databases in that they are subject oriented, integrated, time
variant, non volatile, summarized, larger, not normalized, and
perform OLAP. The generic data warehouse architecture
consists of three layers (data sources, DSA, and primary data
warehouse). During the ETL process, data is extracted from
an OLTP databases, transformed to match the data warehouse
schema, and loaded into the data warehouse database
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATAcsandit
In this paper we investigate the problem of providing scalability to near-real-time ETL+Q (Extract, transform, load and querying) process of data warehouses. In general, data loading,
transformation and integration are heavy tasks that are performed only periodically during small fixed time windows.
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA cscpconf
In this paper we investigate the problem of providing scalability to near-real-time ETL+Q (Extract, transform, load and querying) process of data warehouses. In general, data loading, transformation and integration are heavy tasks that are performed only periodically during small fixed time windows. We propose an approach to enable the automatic scalability and freshness of any data warehouse and ETL+Q process for near-real-time BigData scenarios. A general framework for testing the proposed system was implementing, supporting parallelization solutions for each part of the ETL+Q pipeline. The results show that the proposed system is capable of handling scalability to provide the desired processing speed.
Database migration is the process of transferring data between different database systems or upgrades. It involves analyzing and mapping data from the source to the target system, transforming the data, validating data quality, and maintaining the migrated data. For example, Capital One migrated from Oracle to Teradata databases as their data volume grew too large for Oracle to efficiently handle. The migration process includes pre-migration planning, extraction, transformation, data loading, validation, and post-migration maintenance.
The document discusses data integration and the ETL process. It provides details on:
1. Data integration, which combines data from different sources to create a unified view, supporting business analysis. It involves extracting, transforming, and loading data.
2. The general approach of integration, which can be achieved through application, business process, and user interaction integration. Techniques include ETL, data federation, and data propagation.
3. Data integration for data warehousing, focusing on the "reconciled data layer" which harmonizes data from sources before loading into the warehouse. This involves transforming operational data characteristics.
This document provides an overview of key concepts in data warehousing including:
1. The need for data warehousing to consolidate data from multiple sources and support decision making.
2. Common data warehouse architectures like the two-tier architecture and data marts.
3. The extract, transform, load (ETL) process used to reconcile data and populate the data warehouse.
The document provides an overview of the data migration process. It discusses the key steps which include discovering the source and target systems, mapping data fields between the systems, extracting and transforming the data, loading it into a staging system, and then loading it into the target system. It also discusses verifying the data and common tools used for data migration projects.
ETL stands for extract, transform, and load and is a traditionally accepted way for organizations to combine data from multiple systems into a single database, data store, data warehouse, or data lake.
Three-step data integration
process that combines data from
multiple data sources into a
single, consistent data store that
is loaded into a data warehouse
or other target system.
Data Warehouse - What you know about etl process is wrongMassimo Cenci
The document discusses redefining the typical ETL process. It argues that the traditional understanding of ETL, consisting of extraction, transformation, and loading, is misleading and does not accurately describe the workflow. Specifically, it notes that:
1) The extraction step is usually handled by external source systems, not the data warehouse team.
2) There is a missing configuration and data acquisition step before loading.
3) Transformation is better thought of as data enrichment rather than transformation.
4) The loading phase is unclear about where the data should be loaded.
It proposes redefining the process as configuration, acquisition, loading (to a staging area), enrichment, and final loading to the data warehouse.
The ETL process contains 3 main steps: extraction, transformation, and loading. Data is extracted from source databases, transformed by applying business rules, and loaded into the target database. A full load populates data warehouse tables for the first time by loading all records, while an incremental load applies dynamic changes over time. A three-tier data warehouse has a source layer to land data, an integration layer to store transformed data, and a dimension layer as the presentation layer. Snapshots are read-only copies of master tables refreshed periodically, while materialized views are pre-computed aggregate tables created from fact and dimension tables with associated materialized view logs. PowerCenter processes large volumes of data including from ERP sources and allows session partitioning
Enhancing Data Staging as a Mechanism for Fast Data AccessEditor IJCATR
Most organizations rely on data in their daily transactions and operations. This data is retrieved from different source
systems in a distributed network hence it comes in varying data types and formats. The source data is prepared and cleaned by
subjecting it to algorithms and functions before transferring it to the target systems which takes more time. Moreover, there is pressure
from data users within the data warehouse for data to be availed quickly for them to make appropriate decisions and forecasts. This has
not been the case due to immense data explosion in millions of transactions resulting from business processes of the organizations. The
current legacy systems cannot handle large data levels due to processing capabilities and customizations. This approach has failed
because there lacks clear procedures to decide which data to collect or exempt. It is with this concern that performance degradation
should be addressed because organizations invest a lot of resources to establish a functioning data warehouse. Data staging is a
technological innovation within data warehouses where data manipulations are carried out before transfer to target systems. It carries
out data integration by harmonizing the staging functions, cleansing, verification, and archiving source data. Deterministic
Prioritization Approach will be employed to enhance data staging, and to clearly prove this change Experiment design is needed to test
scenarios in the study. Previous studies in this field have mainly focused in the data warehouses processes as a whole but less to the
specifics of data staging area.
This document proposes enhancing data staging as a mechanism for fast data access. It discusses challenges with current data warehousing systems, including performance issues as data volumes increase. It then introduces the concept of data staging and proposes a new approach called Deterministic Prioritization to improve data staging. This approach would prioritize and filter data in the staging area based on confidence levels and distinctiveness, to reduce processing loads. It would create clustered indexes on distinct data columns to alter query execution plans and improve performance. The goal is to develop a new cross-platform data staging framework with forecasting and prioritization mechanisms to optimize data transfer and system resource usage.
Enhancing Data Staging as a Mechanism for Fast Data AccessEditor IJCATR
This document proposes enhancing data staging as a mechanism for fast data access. It discusses challenges with current data warehousing systems, including performance issues as data volumes increase. It then introduces the concept of data staging and proposes a new approach called Deterministic Prioritization to improve data staging. This approach would prioritize and filter data in the staging area based on confidence levels and distinctiveness, to reduce processing loads. It would create clustered indexes on distinct data columns to alter query execution plans and improve performance. The goal is to develop a new cross-platform data staging framework with forecasting and prioritization mechanisms to optimize resource usage and data transfer speeds.
Enhancing Data Staging as a Mechanism for Fast Data AccessEditor IJCATR
Most organizations rely on data in their daily transactions and operations. This data is retrieved from different source systems in a distributed network hence it comes in varying data types and formats. The source data is prepared and cleaned by subjecting it to algorithms and functions before transferring it to the target systems which takes more time. Moreover, there is pressure from data users within the data warehouse for data to be availed quickly for them to make appropriate decisions and forecasts. This has not been the case due to immense data explosion in millions of transactions resulting from business processes of the organizations. The current legacy systems cannot handle large data levels due to processing capabilities and customizations. This approach has failed because there lacks clear procedures to decide which data to collect or exempt. It is with this concern that performance degradation should be addressed because organizations invest a lot of resources to establish a functioning data warehouse. Data staging is a technological innovation within data warehouses where data manipulations are carried out before transfer to target systems. It carries out data integration by harmonizing the staging functions, cleansing, verification, and archiving source data. Deterministic Prioritization Approach will be employed to enhance data staging, and to clearly prove this change Experiment design is needed to test scenarios in the study. Previous studies in this field have mainly focused in the data warehouses processes as a whole but less to the specifics of data staging area.
Enhancing Data Staging as a Mechanism for Fast Data AccessEditor IJCATR
Most organizations rely on data in their daily transactions and operations. This data is retrieved from different source
systems in a distributed network hence it comes in varying data types and formats. The source data is prepared and cleaned by
subjecting it to algorithms and functions before transferring it to the target systems which takes more time. Moreover, there is pressure
from data users within the data warehouse for data to be availed quickly for them to make appropriate decisions and forecasts. This has
not been the case due to immense data explosion in millions of transactions resulting from business processes of the organizations. The
current legacy systems cannot handle large data levels due to processing capabilities and customizations. This approach has failed
because there lacks clear procedures to decide which data to collect or exempt. It is with this concern that performance degradation
should be addressed because organizations invest a lot of resources to establish a functioning data warehouse. Data staging is a
technological innovation within data warehouses where data manipulations are carried out before transfer to target systems. It carries
out data integration by harmonizing the staging functions, cleansing, verification, and archiving source data. Deterministic
Prioritization Approach will be employed to enhance data staging, and to clearly prove this change Experiment design is needed to test
scenarios in the study. Previous studies in this field have mainly focused in the data warehouses processes as a whole but less to the
specifics of data staging area.
Meaning of Service; Characteristics of Services; Classification of Services; Marketing mix of services; Customer involvement in services; Building customer loyalty; GAP model; Balancing demand & capacity.
Meaning and Elements – Classification of products; product life cycle, new product development process; branding, packaging; Pricing: Objectives, factors influencing pricing policy; types of pricing methods, Distribution: definition; need; types of marketing channels, factors affecting channels;; Promotion: Nature and importance of promotion; promotion mix; advertising; sales promotion; public relation; direct selling and publicity.
Definition; Nature; Scope and Importance of marketing; Approaches to the study of marketing; Functions of marketing, Market Segmentation: Meaning; Importance; Bases of Segmentation; Market Targeting; Types of targeting; Market Positioning; Strategies for positioning, Recent trends in Marketing
This document provides an introduction and overview of spreadsheets and Microsoft Excel. It defines what a spreadsheet is, outlines key features and elements of Excel including cells, worksheets, formatting, formulas, functions, charts and pivot tables. It also describes various data analysis tools in Excel like sorting, filtering, conditional formatting, and how to perform tasks like what-if analysis using goal seek and scenario manager. The document is intended as a reference for using spreadsheets, especially Microsoft Excel, in a business context.
Introduction to Data and Information, database, types of database models, Introduction to DBMS, Difference between file management systems and DBMS, advantages & disadvantages of DBMS, Data warehousing, Data mining, Applications of DBMS, Introduction to MS Access, Create Database, Create Table, Adding Data, Forms in MS Access, Reports in MS Access.
Transaction Processing Systems (TPS), Management Information System (MIS), Decision Support Systems (DSS), Group Decision Support System (GDSS), Executive Information System (EIS), Expert System (ES) – features, process, advantages & disadvantages, role of these systems in decision making process.
The document discusses the importance of information systems in decision making and strategy building for organizations. It defines information and information technology, and describes the difference between information systems and information technology. An information system is comprised of various components including hardware, software, data, people, and processes. Information systems help management make informed decisions, improve communication and business processes, and develop effective strategies. Managers play an important role in overseeing information systems and ensuring they meet the needs of the organization.
This document provides an introduction to data mining concepts including definitions, tasks, challenges, and techniques. It discusses data mining definitions, the data mining process including data preprocessing steps like cleaning, integration, transformation and reduction. It also covers common data mining tasks like classification, clustering, association rule mining and the Apriori algorithm. Overall, the document serves as a high-level overview of key data mining concepts and methods.
Data Warehouse – Introduction, characteristics, architecture, scheme and modelling, Differences between operational database systems and data warehouse.
Nature and purpose of organization, principles of organization, types of organization, formal and informal organization, types of organization structure, departmentation, importance and bases of departmentaion, committees, meaning and types, centralization vs decentralization of authority and responsibility, span of control, MBO and MBE (meaning only), nature and importance of staffing, process of recruitment & selection (in brief)
Meaning and nature of directing, leadership styles, motivation, meaning and importance, Communication, meaning and importance, co-ordination, meaning and importance and techniques of co-ordination, control, meaning, features, importance and steps in control process, essentials of a sound control system, methods of establishing control (in brief).
Data Analysis & Interpretation and Report WritingSOMASUNDARAM T
Statistical Methods for Data Analysis (Only Theory), Meaning of Interpretation, Technique of Interpretation, Significance of Report Writing, Steps, Layout of Research Report, Types of Research Reports, Precautions while writing research reports
General features of computer – Evolution of computers; Computer Applications – Data Processing – Information Processing – Commercial – Office Automation – Industry and Engineering – Healthcare – Education – Disruptive technologies.
Introduction, Meaning, Nature, Characteristics of Management, Scope and Functional areas of management, Management as a science or art or profession, management & administration, Henry Fayol’s Principles of Management.
Meaning of business, Classification of Business, Industry, types of industry, commerce, trade, aids to trade, forms of business, sole proprietary concerns, cooperative society, meaning, characteristics, advantages and disadvantages, partnership firms, meaning, characteristics, advantages and disadvantages, types of partners, LLP.
Title: A Quick and Illustrated Guide to APA Style Referencing (7th Edition)
This visual and beginner-friendly guide simplifies the APA referencing style (7th edition) for academic writing. Designed especially for commerce students and research beginners, it includes:
✅ Real examples from original research papers
✅ Color-coded diagrams for clarity
✅ Key rules for in-text citation and reference list formatting
✅ Free citation tools like Mendeley & Zotero explained
Whether you're writing a college assignment, dissertation, or academic article, this guide will help you cite your sources correctly, confidently, and consistent.
Created by: Prof. Ishika Ghosh,
Faculty.
📩 For queries or feedback: [email protected]
How to Set warnings for invoicing specific customers in odooCeline George
Odoo 16 offers a powerful platform for managing sales documents and invoicing efficiently. One of its standout features is the ability to set warnings and block messages for specific customers during the invoicing process.
The ever evoilving world of science /7th class science curiosity /samyans aca...Sandeep Swamy
The Ever-Evolving World of
Science
Welcome to Grade 7 Science4not just a textbook with facts, but an invitation to
question, experiment, and explore the beautiful world we live in. From tiny cells
inside a leaf to the movement of celestial bodies, from household materials to
underground water flows, this journey will challenge your thinking and expand
your knowledge.
Notice something special about this book? The page numbers follow the playful
flight of a butterfly and a soaring paper plane! Just as these objects take flight,
learning soars when curiosity leads the way. Simple observations, like paper
planes, have inspired scientific explorations throughout history.
The Pala kings were people-protectors. In fact, Gopal was elected to the throne only to end Matsya Nyaya. Bhagalpur Abhiledh states that Dharmapala imposed only fair taxes on the people. Rampala abolished the unjust taxes imposed by Bhima. The Pala rulers were lovers of learning. Vikramshila University was established by Dharmapala. He opened 50 other learning centers. A famous Buddhist scholar named Haribhadra was to be present in his court. Devpala appointed another Buddhist scholar named Veerdeva as the vice president of Nalanda Vihar. Among other scholars of this period, Sandhyakar Nandi, Chakrapani Dutta and Vajradatta are especially famous. Sandhyakar Nandi wrote the famous poem of this period 'Ramcharit'.
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulsesushreesangita003
what is pulse ?
Purpose
physiology and Regulation of pulse
Characteristics of pulse
factors affecting pulse
Sites of pulse
Alteration of pulse
for BSC Nursing 1st semester
for Gnm Nursing 1st year
Students .
vitalsign
How to Manage Opening & Closing Controls in Odoo 17 POSCeline George
In Odoo 17 Point of Sale, the opening and closing controls are key for cash management. At the start of a shift, cashiers log in and enter the starting cash amount, marking the beginning of financial tracking. Throughout the shift, every transaction is recorded, creating an audit trail.
Social Problem-Unemployment .pptx notes for Physiotherapy StudentsDrNidhiAgarwal
Unemployment is a major social problem, by which not only rural population have suffered but also urban population are suffered while they are literate having good qualification.The evil consequences like poverty, frustration, revolution
result in crimes and social disorganization. Therefore, it is
necessary that all efforts be made to have maximum.
employment facilities. The Government of India has already
announced that the question of payment of unemployment
allowance cannot be considered in India
INTRO TO STATISTICS
INTRO TO SPSS INTERFACE
CLEANING MULTIPLE CHOICE RESPONSE DATA WITH EXCEL
ANALYZING MULTIPLE CHOICE RESPONSE DATA
INTERPRETATION
Q & A SESSION
PRACTICAL HANDS-ON ACTIVITY
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingCeline George
The Accounting module in Odoo 17 is a complete tool designed to manage all financial aspects of a business. Odoo offers a comprehensive set of tools for generating financial and tax reports, which are crucial for managing a company's finances and ensuring compliance with tax regulations.
Multi-currency in odoo accounting and Update exchange rates automatically in ...Celine George
Most business transactions use the currencies of several countries for financial operations. For global transactions, multi-currency management is essential for enabling international trade.
2. UNIT 2
ETL Process and Maintenance of Data
Warehouse
Data Extraction, Data
Transformation, Data loading, Data
Quality, Data Warehouse design
reviews, Testing and Monitoring
the data warehouse.
3. • ETL is a process in Data Warehousing and it stands for Extract,
Transform and Load.
• It is a process in which an ETL tool extracts the data from various
data source systems, transforms it in the staging area, finally loads
it into the Data warehouse system.
• It is data integration process that combines data from multiple data
sources into a single, consistent data store into a data warehouse.
• This may contains customize the tool to suit the need of the
enterprises.
(E.g.) ETL tool sets for long-term analysis & usage of data in
banking, insurance claims, retail sales history, etc.
ETL Process
6. • Google BigQuery.
• Amazon Redshift.
• Informatica – PowerCenter.
• IBM – Infosphere Information Server.
• Oracle Data Integrator.
• SQL Server Integration Services.
ETL Tools
7. 1) Scalability – unlimited scalability are available at the
click of a button. (i.e.) capacity to be changed in size.
2) Simplicity – it saves time, resources and avoids lot of
complexity.
3) Out of the box – open source ETL require
customization and cloud-based ETL requires integration.
4) Compliance – it finds easy way to skip complicated and
risky compliance setups.
5) Long-term costs – it is cheaper with open source ETL
tools but will cost make it for long run.
Benefits of ETL Tools
8. Extraction (E):
The first step is extraction of data, source system’s data
is accessed first and prepared further for processing and
extracting required values.
It is extracted in various formats like relational
databases, No SQL, XML and flat files, etc.
It is important to store extract data in staging area, not
directly into data warehouse as it may cause damage and
rollback will be much difficult.
Phase (Steps) of ETL Process
9. Extraction has three approach -
a) Update Notification – the data is changed or altered in the
source system, it will notify users about the change.
b) Incremental Extract – many systems are incapable of
providing notification but are efficient enough to track down the
changes that made to source data.
c) Full extract – the whole data is extracted, when system is
neither able to notify nor able to track down the changes. Old
copy of data is maintained to identify the change.
(E.g.) phone numbers, Email conversion to standard form,
validation of address fields, etc.
Phase (Steps) of ETL Process
10. The data extraction issues are -
Source Identification – identify source applications and source structures.
Method of extraction – for each data source, define whether the extraction
process is manual or tool-based.
Extraction frequency – for each data source, establish how frequently the
data extraction must be done (daily, weekly, quartely, etc.)
Time Window – for each data source, denote the time window for the
extraction process.
Job sequencing – determine whether the beginning of one job in an
extraction job stream has to wait until the previous job has finished
successfully.
Exception handling – determine how to handle input records that can’t be
extracted.
Phase (Steps) of ETL Process
11. The following are the guidelines adopted for extracts as best practices -
The extract processing should identify changes since the last extract.
Interface record types should be defined for the extraction based on
entities in data warehouse model.
(E.g.) Client information extracted from source may be categorized into
person attributes, contact point information, etc.
When changes sent to data warehouse, all current attributes for changed
entity should be also sent.
Any interface record should be categorized as –
Records which have been added to operational database since the last
extract.
Records which have been deleted from operational database since the last
extract.
Phase (Steps) of ETL Process
12. Transformation (T):
The second step of ETL process is transformation.
A set of rules or functions are applied on extracted
data to convert it into a single standard format.
It includes dimension conversion, aggregation,
joining, derivation and calculations of new values.
Phase (Steps) of ETL Process
13. Transformation involve the following processes or tasks –
a) Filtering - Filtering – loading only certain attributes into the
data warehouse.
b) Cleaning – filling up the NULL values with some default
values, mapping U.S.A, United States, and America into USA,
etc.
c) Joining – joining multiple attributes into one.
d) Splitting – splitting a single attribute into multiple attributes.
e) Sorting – sorting tuples on the basis of some attribute
(generally key-attribute).
Phase (Steps) of ETL Process
14. Major Data Transformation Types:
a) Format Revisions – it include changes to the data types and length of
individual fields.
(E.g.) Product package type indicated by codes and names, in which fields
are numeric and text data types.
b) Decoding of fields – this is common when dealing with multiple source
systems and same data items are described by field values.
(E.g.) Coding for Male and Female may be 1 and 2 in one source system or
M and F in another source system.
c) Splitting of Single fields – storing of name, address, city, state data
together in a single field in earlier systems, but individual components need
to store in separate fields in data warehouse.
Phase (Steps) of ETL Process
15. d) Merging of information – this is not mean the merging of several fields
to create a single field of data.
(E.g.) information about product may come from different data sources,
product code & package type from another data source. Merging of
information denoted combination of product code, description, package
types in single entity.
e) Character set conversion – this is related to conversion of character sets
to an agreed standard character set for text data in data warehouse.
f) Conversion of units of measurements – it is required to convert the
metrics based on overseas operations to set numbers are all in one standard
unit of measurement.
g) Date / Time Conversion - this is representation of data and time in
standard formats.
Phase (Steps) of ETL Process
16. h) Summarization – this type of transformation is creating of summaries to
be loaded in data warehouse instead of loading most granular level of data.
(E.g.) Credit card transaction not necessary to store in data warehouse for
each single transaction, instead summarize the daily transactions for each
credit card and store the summary data.
i) Key Restructuring – while extracting data from input sources, look at
primary keys of extracted records and come with keys for fact and
dimension table based on keys in extracted records.
j) Deduplication – customer files have several records for same customer in
most of the companies, which leads to creation of additional records by
mistake. It is to maintain one record of customer and link all duplicates in
source systems to single record.
Phase (Steps) of ETL Process
17. Loading (L):
The third & final step of ETL process is loading.
Transformed data is finally loaded into data warehouse.
Data is updated by loading into data warehouse frequently, but regular intervals.
Indexes and constraints previously applied to data needs to be diabled before loading
commences.
The rate and period of loading is depends on requirements and varies from system to
system.
During the loads, the data warehouse has to be offline.
Time period should be identified when loads may be scheduled without affecting data
warehouse users.
It should be consider to divide the whole load process into smaller chunks and
populating a few files at a time.
Phase (Steps) of ETL Process
18. Mode of Loading (L):
a) Load – If targeted table to be loaded already exists and data exists in
table, load process wipes out the existing data and applies data from
incoming file. If table is empty before loading, the load process applies the
data from incoming file.
b) Append – It is an extension of the load. If data already exists in table,
append process unconditionally adds the incoming data, preserving the
existing data in target table. Incoming duplicate record may be rejected
during the append process.
c) Destructive Merge – in this step, apply the incoming data to the target
data. If incoming record matches with the key of an existing record, update
the matching target record, if not then add incoming record to the target
table.
Phase (Steps) of ETL Process
19. Mode of Loading (L):
d) Constructive Merge – this mode is opposite to the destructive merge. (i.e.) if
incoming record matches with the key of an existing record, leave the existing record,
add incoming record and mark the added record as superseding the old record.
e) Initial Load – to load the whole data warehouse in a single run. It is able to split the
load into separaet subloads and run as single loads. If you need more than one run to
create a single table, then it is scheduled to run in several days.
f) Incremental Loads – these are the applications of ongoing changes from source
system. It need a method to preserve the periodic nature of changes in data warehouse.
Constructive merge mode is an appropriate method for incremental tools.
g) Full Refresh – this application of data involves periodically rewriting the entire
data warehouse. Sometimes, it can partial refreshes to rewrite only specific tables, but
partial refreshes are rare, because dimentions table is intricately tied to the fact table.
Phase (Steps) of ETL Process
20. Data Quality (DQ) in Data Warehouse
What is Data Quality?
A IT professional, data accuracy is quite often and its accurancy associated
with a data element.
(E.g.) Consider Customer as Entity and it has attributes like -
,,,,,, etc.
This indicates data accuracy as its attributes of customer entity describe the
particular customer.
(i.e.) if data is fit for purpose for which it is intended, then such data has
quality.
Data Quality is related to the usage for the data item as defined by the users.
Data Quality in operational systems required database records conform to
field validation edits, which is data quality, but single field edits alone don’t
constitute data quality.
Customer
Name
Customer
Address
Customer
State
Customer
Mobile No
21. Definition:
Data quality refers to the overall utility of a dataset(s) as a function of
its ability to be easily processed and analyzed for other uses, usually by a
database, data warehouse, or data analytics system.
Data quality in a data warehouse is not just the quality of individual data
items but the quality of the full, integrated system as a whole.
(E.g.) In online ordering system, while entering the data about customers in
order entry application, we may collect demographics of each customer.
Sometimes, this demographic factors may not be needed or not much
attention.
When those data’s are try to access which is integrated whole lacks data
quality.
(E.g.) Few customer’s information may be important or may not be
importance, when we filling or writing any application form. (especially
Banking process)
22. Data Accuracy Vs Data Quality
Difference between Data Accuracy and Data Quality:
Data Accuracy Data Quality
Specific instance of an entity accurately
represents that occurrence of the entity
The data item is exactly fit for the
purpose for which the business users
have defined it.
Data element defined in terms of
database technology
Wider concept grounded in the specific
business of the company
Data element conforms to validation
constraints
Relates not just to single data elements
but to the system as a whole
Individual data items have the correct
data types
Forum and content of data elements
consistent across the whole system
Traditionally relates to operational
systems
Essentially needed in a corporate-wide
data warehouse for business users
23. Characteristics (Dimensions) of Data Quality
The data quality dimensions are -
1. Accuracy:
The value stored in the system for a data element is the right value for that
occurence of the data element. (E.g.) getting correct customer address
2. Domain Integrity:
The data value of an attribute falls in the range of allowable, defined values.
(E.g.) Male and Female for gender data element.
3. Data Type:
Value for a data attribute is actually stored as the data type defined for the
attribute. (E.g.) Name field is defined as ‘text’.
4. Consistency:
The form and content of a data field in the same across multiple source
systems. (E.g.) Product code for Product A is 1234.
24. 5. Redundancy:
The same data must not be stored in more than one place in a system.
6. Completeness:
There are no missing values for a given attribute in the system.
7. Duplication:
Duplication of records in a system is completely resolved. (E.g.) duplicate
records are identified and created cross-reference.
8. Conformance to Business Rules:
The values of each data item adhere to prescribed business rules. (E.g.) in
auction system, sale price can’t be less than the reserve price.
9. Structural Definiteness:
Wherever data item can naturally be structured into individual components,
the item must contain this well-defined structure. (E.g.)Names are divided
into first name, middle name and last name, which reduces the missing
25. 10. Data Anomaly:
A field must be used only for the purpose for which it is defined. (E.g.) In third
line of address column for long address, it should be entered the third line of
address, not the phone numbers or fax number.
11. Clarity:
A data element may possess all the other characteristics of quality data but if the
users do not understand its meaning clearly.
12. Timely:
The users determine the timeliness of the data. (E.g.) updation of data in customer
database on daily basis.
13. Usefulness:
Every data element in the data warehouse must satisfy some requirements of the
collection of users.
14. Adherence to Data Integrity Rules:
The data stored in the relational databases of the source system must adhere to
entity integrity and referential integrity rules.
26. Data Quality Challenges (Problems) in DW
0% 10% 20% 30% 40% 50%
Database Performance
Management Expectations
Business Rules
Data Transformation
User Expectations
Data Modeling
Data Quality
PERCENTAGE
Data Warehouse Challenges
27. Data Quality Framework
Establish Data Quality
Steering Committee
Agree on a suitable data
quality framework
Identify the business
functions affected most by
bad data
Institute data quality policy
and standards
Select high impact data
elements and determine
priorities
Define quality measurement
parameters and benchmarks
Plan and execute data
cleansing for high impact
data elements
Plan and execute data
cleansing for other less
severe elements
Initial
Data
Cleansing
efforts
Ongoing
Data
Cleansing
efforts
IT Professionals
User Representatives
28. Data Quality – Participants and Roles
Data
Quality
Initiatives
Data Consumer
(User Dept.)
Data Producer
(User Dept.)
Data Expert
(User Dept.) Data Correction Authority
(IT Dept.)
Data Consistency
Expert (IT Dept.)
Data Policy
Administrator
(IT Dept.)
Data Integrity
Specialist (User
Dept.)
29. The responsibilities for the roles are -
1. Data Consumers:
Uses the data warehouse for queries, reports and analysis. Establishes the
acceptable levels of data quality.
2. Data Producer:
Responsible for the quality of data input into the source systems.
3. Data Expert:
Expert in the subject matter and the data itself of the source systems.
Responsible for identifying pollution in the source systems.
4. Data Policy Administrator:
Ultimately responsible for resolving data corruption as data is
transformed and moved into the data warehouse.
30. The responsibilities for the roles are -
5. Data Integrity Specialist:
Responsible for ensuring that the data in the source systems conforms to
the business rules.
6. Data Correction Authority:
Responsible for actually applying the data cleansing techniques through
the use of tools or in-house programs.
7. Data Consistency Expert:
Responsible for ensuring that all data within the data warehouse (various
data marts) are fully synchronized.
31. Data Quality Tools
The useful data quality tools are -
1. Categories of Data Cleansing tools:
It assist in two ways –
Data error discovery tools work on the source data to identify inaccuracies and
inconsistencies.
Data correction tools help fix the corrupt data, which use series of algorithms
to parse, transform, match, consolidate and correct the data.
2. Error Discovery features:
The following list of error discovery functions that data cleansing tools are
capable of performing –
Quickly and easily identify duplicate records.
Identify data items whose values are outside the range of legal domain values.
Find inconsistent data.
32. Check for range of allowable values.
Detect inconsistencies among data items from different sources.
Allow users to identify and quantify data quality problems.
Monitor trends in data quality over time.
Report to users on the quality of data used for analysis.
Reconcile problems of RDBMS referential integrity.
3. Data Correction features:
The following list describes the typical error correction functions that data cleansing
tools are capable of performing –
Normalize inconsistent data.
Improve merging of data from dissimilar data sources.
Group and relate customer records belonging to the same household.
Provide measurements of data quality.
Standardize data elements to common formats.
Validate for allowable values.
33. 4. DBMS for Quality Control:
The database management system used as a tool for data quality control in many
ways, especially RDBMS prevent several types of errors creeping into data
warehouse –
Domain integrity – provide domain value edits. Prevent entry of data if entered
data value is outside the defined limits of value.
Update security – prevent unauthorized updates to the databases, which stops
unauthorized users from updating data in an incorrect way.
Entity Integrity Checking – ensure that duplicate records with same primary key
value are not entered.
Minimize missing values – ensure that nulls are not allowed in mandatory fields.
Referrential Integrity Checking – ensure that relationships based on foreign keys
are pre-served.
Conformance to Business rules – use trigger programs and stored procedures to
enforce business rules.
34. Benefits of Data Quality
Some specific areas where data quality definite benefits -
Analysis with timely informaiton.
Better Customer Service.
Newer opportunities.
Reduced costs and Risks.
Improved Productivity.
Reliable Strategic Decision Making.
35. Data Warehouse Design Reviews
One of the most effective techniques for ensuring quality in the
operational environment is the design review.
Errors can be detected and resolved prior to coding thtough a design
review.
The cost benefot of identifying errors early in the development life cycle
is enormous.
Design review is usually done on completion of the physical design of an
application.
Some of the issues around operational design review are follows –
Transaction performance
System availability
Project readiness
Batch window adequacy
Capacity
User requirements satisfaction
36. Views of Data Warehouse Design
The four views regarding a data warehouse design must be considered –
1. Top-Down View:
This allows the selection of relevant information necessary for data
warehouse.
This information matches current and future business needs.
2. Data Source View:
It exposes the information being captured, stored and managed by
operational systems.
This information may be documented at various levels of detail and
accuracy from individual data source tables to integrated data sources
table.
Data sources are often modeled by traditional data modeling techniques,
such as entity – relationship model.
37. Views of Data Warehouse Design
3. Data Warehouse View:
This views includes fact tables and dimension tables.
It represents the information that is stored inside the data
warehouse, including pre-calculated totals and counts.
Information regarding the source, date and time of origin, added
to provide historical context.
4. Business Query View:
This view is the data perspective in the data warehouse from the
end-user’s view point.
38. Data Warehouse Design Approaches
A data warehouse can be built using three approaches -
a) The top-down approach:
It starts with the overall design and planning.
It is useful in cases where the technology is mature and well
known, and where the business problems that must be solved
are clear and well understood.
The process begins with an ETL process working from external
data sources.
In the top-down model, integration between the data warehouse
and the data marts is automatic as long as the data marts as
subsets of data warehouse is maintained.
39. Data Warehouse Design Approaches
b) The bottom-up approach:
The bottom-up approach starts with experiments and prototypes.
This is useful in the early stage of business modeling and
technology development.
It allows an organization to move forward at considerably less
expense and to evaluate the benefits of the technology before
making significant commitments.
This approach is to construct the data warehouse incrementally
over time from independently developed data marts.
In this approach, data flows from sources into data marts, then into
the data warehouse.
40. Data Warehouse Design Approaches
c) The Combined approach:
In this approach, both the top-down approach and bottom-up
approaches are exploited.
In combined approach, an organization can exploit the planned
and strategic nature of top-down approach while retaining the
rapid implementation and opportunistic application of the
bottom-up approach.
41. Data Warehouse Design Process
The general data warehouse design process involves the following steps -
Step 1: Choosing the appropriate Business process:
Based on need & requirements, there exist two types of models like data
warehouse model and data mart model.
Data warehouse model is chosen if business process is organisational
and has many complex object collections.
A data mart model is chosen if business process is departmental and
focus on analysis of particular process.
Step 2: Choosing the grain of the business process:
Grain is defined as fundamental data which are represented in fact table
for chose business process.
(E.g.) individual snapshots, individual transactions, etc.
42. Data Warehouse Design Process
Step 3: Choosing the Dimensions:
It includes selecting various dimensions such as time, item,
status, etc., which need in be applied to each fact table record.
Step 4: Choosing the measures:
It includes selecting various dimensions such as items_sold,
euros_sold, etc., which helps in filling up each fact table
record.
43. Testing & Monitoring the Data Warehouse
Definition:
Data Warehouse testing is the process of building and
executing comprehensive test cases to ensure that data in a
warehouse has integrity and is reliable, accurate and consistent
with the organization’s data framework.
Testing is very important for data warehouse systems for data
validation and to make them work correctly and efficiently.
Data Warehouse Testing is a series of Verification and
Validation activities performed to check for the quality and
accuracy of the Data Warehouse and its contents.
44. There are five basic levels of testing performed on a data warehouse –
1. Unit Testing:
This type of testing is being performed at the developer’s end.
In unit testing, each unit / component of modules is separately tested.
Each modules of the whole data warehouse (i.e.) program, SQL
Script, procedure,, Unix shell is validated and tested.
2. Integration Testing:
In this type of testing the various individual units / modules of the
application are brought together or combined and then tested against
the number of inputs.
It is performed to detect the fault in integrated modules and to test
whether the various components are performing well after integration.
45. 3. System Testing:
System testing is the form of testing that validates and tests the whole data
warehouse application.
This type of testing is being performed by technical testing team.
This test is conducted after developer’s team performs unit testing and the
main purpose of this testing is to check whether the entire system is working
altogether or not.
4. Acceptance Testing:
To verify that the entire solution meets the business requirements and
successfully supports the business processes from a user’s perspective.
5. System Assurance Testing:
To ensure and verify the operational readiness of the system in a production
environment.
This is also referred to as the warranty period coverage.
46. Challenges of data warehouse testing are -
Data selection from multiple source and analysis that follows
pose great challenge.
Volume and complexity of the data, certain testing strategies are
time consuming.
ETL testing requires hive SQL skills, thus it pose challenges for
tester who have limited SQL skills.
Redundant data in a data warehouse & Inconsistent and
inaccurate reports.
47. Data Warehouse Testing Process
Testing a data warehouse is a multi-step process that involves
activities like identifying business requirements, designing test
cases, setting up a test framework, executing the text case and
validating data.
The steps for testing process are –
Step 1: Identify the various entry points:
As loading data into a warehouse involves multiple stages, it’s
essential to find out the various entry points to test data at each of
those stages.
If testing is done only at the destination, it can be confusing when
errors are found as it becomes more difficult to determine the root
cause.
48. Step 2: Prepare the required collaterals:
Two fundamental collaterals required for the testing process are database
schema representation and a mapping document.
The mapping document is usually a spreadsheet which maps each
column in the source database to the destination database.
A data integration solution can help generate the mapping document,
which is then used as an input to design test cases.
Step 3: Design an elastic, automated and integrated testing framework:
ETL is not a one-time activity. While some data is loaded all at once and
some through batches, new updates may trickle in
through streaming queues.
A testing framework design has to be generic and architecturally flexible
to accommodate new and diverse data sources and types, more volumes,
and the ability to work seamlessly with cloud and on-premises
49. Integrating the test framework with an automated data solution
(that contains features as discussed in the previous section)
increases the efficiency of the testing process.
Step 4: Adopt a comprehensive testing approach:
The testing framework needs to aim for 100% coverage of the
data warehousing process.
it’s important to design multiple testing approaches such as unit,
integration, functional, and performance testing.
The data itself has to be scrutinized for many checks that
includes looking for duplicates, matching record counts,
completeness, accuracy, loss of data, and correctness of
transformation.
50. Testing Operational Environment
There are no. of aspects that need to be tested as below –
1. Security:
A separate security document is required for security testing. This document
contains a list of disallowed operations and devising tests for each.
2. Scheduler:
Scheduling software is required to control the daily operations of a data
warehouse. It needs to be tested during system testing. The scheduling
software requires an interface with the data warehouse, which will need the
scheduler to control overnight processing and the management of
aggregations.
3. Disk Configuration:
Disk configuration also needs to be tested to identify I/O bottlenecks. The test
should be performed with multiple times with different settings.
51. 4. Management Tools:
It is required to test all the management tools during system
testing. Here is the list of tools that need to be tested.
• Event manager
• System manager
• Database manager
• Configuration manager
• Backup recovery manager
52. Testing the Database
The database is tested in following three ways –
1. Testing the database manager and monitoring tools:
To test the database manager and the monitoring tools, they should be used in
the creation, running, and management of test database.
2. Testing database features:
Here is the list of features that we have to test −
– Querying in parallel
– Create index in parallel
– Data load in parallel
3. Testing database performance:
Query execution plays a very important role in data warehouse performance
measures. There are sets of fixed queries that need to be run regularly and
they should be tested.
53. Data Warehouse Monitoring
Data warehouse monitoring helps to understand how
the data warehouse is performing.
Some of the several reasons for monitoring are –
It ensures top performance.
It ensures excellent usability.
It ensures the business can run efficiently.
It prevents security issues.
It ensures governance and compliance.