Big Data Architecture Group 1 PROJECT
Big Data Architecture Group 1 PROJECT
(INFO8115-23S-SEC1)
GROUP 1
1
GROUP MEMBERS
2
TABLE OF CONTENTS
2 RACI CHART 5
10 SUMMARY 51
3
DOCUMENT HISTORY CHART
4
KATHIKAY
RAHUL HARSH PAARTH
Task / Team Members AGARWAL BHARATH NAVYA TEJA
SETHI DODIYA ARORA
Task 1:
ENTITY RELATIONSHIP R A R R C I
DIAGRAM
Task 2:
NORMALIZATION
A R C R I C
RACI
CHART
Task 3:
I A R I C R
BUSINESS INTELLIGENCE
Task 4:
DATAWAREHOUSING AND BI C I A R I R
DATA
Task 5:
R C R I R A
DATA INTEGRATION
Task 6:
C R I C A R
DATA SOURCES
5
DATA SOURCE NORMALIZING
Transactio Date Customer Customer Address Customer Customer Product Product Product Product Store Product Units Extended
n# Flyer #1 Flyer #2 Description Department Manufacturer Manufacturer Description Cost Purchase Purchased
Address d Amount
1 01/15/20 John Smith 44 Main Street, Weekly, Promo #1 Eggs Dairy Burnbrae 940 Matheson Sobeys $5.00 3 $15.00
20 Toronto Monthly Blvd E,
[email protected] Mississauga
m
2 01/16/20 John Smith 44 Main Street, Weekly, Promo #1 Bread Baked Goods Weston Foods 125 The Zehrs $6.00 4 $24.00
20 Toronto Monthly Queensway,
[email protected] Etobicoke
m
3 01/15/20 John Smith 44 Main Street, Weekly, Promo #1 Bacon Dairy Schneiders P.O. Box 61016 Costco $7.00 5 $35.00
21 Toronto Monthly Winnipeg, MB
[email protected]
m
4 01/16/20 Jane Wright 31 Main Street, Annual Promo #2 Eggs Dairy Burnbrae 940 Matheson Sobeys $5.01 3 $15.03
21 Kitchener Blvd E,
[email protected] Mississauga
om
5 01/15/20 Jane Wright 31 Main Street, Annual Promo #2 Bread Produce Weston Foods 125 The Zehrs $6.00 4 $24.00
22 Kitchener Queensway,
[email protected] Etobicoke
om
6 01/16/20 Jane Wright 31 Main Street, Annual Promo #2 Bacon Fridge Schneiders P.O. Box 61016 Costco $7.00 5 $35.00
22 Kitchener Winnipeg, MB
[email protected]
om
7 01/15/20 Jane Smith 24 Main Street, Monthly, Promo #3 Vanilla Ice Frozen Burnbrae 940 Matheson Zehrs $9.00 4 $36.00
22 Waterloo Annual Cream Blvd E,
[email protected] Mississauga
m
8 01/16/20 Jane Smith 24 Main Street, Monthly, Promo #3 Eggs Dairy Burnbrae 940 Matheson Longos $5.50 2 $11.00
22 Waterloo Annual Blvd E,
[email protected] Mississauga
m
6
ENTITY RELATIONSHIP (ER)
DIAGRAM
7
NORMALIZATION
4NF A relation will be in 4NF if it is in Boyce Codd's normal form and has no
multi-valued dependency.
8
FIRST NORMAL FORM (1NF)
Transaction Details Table:
9
SECOND NORMAL FORM (2NF)
TRANSACTION DETAILS TABLE: Extended
Product Purchased
Transaction # Date Customer Description Product Cost Units Purchased Amount
1 1/15/20 John ... Eggs $5.00 3 $15.00
PROMOTIONS TABLE:
Promotion Name Promotion Flyer #1 Promotion Flyer #2
Promo #1 Weekly, Monthly Promo #1
10
TRANSACTIONFACT TABLE:
Extended
Transaction Promotion Product Units Purchased
# Date ID Customer ID Product ID Store ID ID Cost Purchased Amount
1 ... AA P1 1 Pid 1 $5.00 3 $15.00
2 ... BB P2 2 Pid 2 $6.00 4 24
DATEDIMENSION TABLE:
01/16/2020 16 2020 1
11
CASE STUDY - BIG DATA ARCHITECTING
FOR NEW COMPANY
12
13
■ Reporting & Alerts
• BUSINESS INTELLIGENCE (BI) architecture is the
framework that a company uses to deploy business intelligence ■ Ad-hoc Analysis
and analytics software or applications. This framework includes a ■ Dashboards
variety of components including IT systems and different software
tools that the company plans to use to collect, integrate, store and ■ OLAP
analyze data. ■ MS Office Integration
• In a comprehensive BI architecture, components work together to ■ Predictive Analytics
provide a cohesive and effective system for collecting, processing,
analyzing, and presenting data to support decision-making ■ Data Discovery
processes within an organization. Each component plays a specific ■ Data Visualization
role in enhancing the overall BI capabilities and facilitating data-
■ Big Data Analytics
driven insights.
■ Mobile BI
14
REPORTING AND ALERTS:
• Reporting involves generating structured documents or summaries of data to
communicate information to stakeholders. Alerts are notifications triggered by
predefined criteria, informing users about specific events or changes in data.
Tools:
Tableau: Creates interactive and shareable dashboards and reports, supports data
exploration.
Microsoft Power BI: Generates visual reports and dashboards with integration into
Microsoft Office.
REASONS FOR USING TABLEAU REASONS FOR USING MS POWER BI
15
AD HOC ANALYSIS
• Ad hoc analysis allows users to perform on-the-fly queries and exploration of data to
answer specific questions that arise outside of regular reporting requirements.
TOOL: Datapine
• ad hoc reporting software that provides independence, flexibility, and usability while
assisting in the rapid and accurate answer of vital queries, datapine's data visualization and
reporting tool checks all the boxes.
• Aimed specifically at the end-user, our different types of dashboards and self-service
reporting tools are intuitive and accessible, which means users don't have to possess a
wealth of technical knowledge to utilize our platforms.
Here are some Reasons why organizations might consider using the datapine tool for ad hoc analysis:
Real-Time Data Analysis:
Customizable Dashboards: Ad Hoc Querying: Ad hoc
Depending on its
Users can create analysis often involves
capabilities, datapine may
customized dashboards that asking spontaneous
offer real-time data
are tailored to their specific questions and getting
connectivity. This feature is
needs. This flexibility allows instant answers. datapine's
beneficial for scenarios
users to arrange query capabilities enable
where analyzing up-to-the-
visualizations, metrics, and users to perform on-the-fly
minute data is critical, such
data filters in a way that analysis by creating custom
as monitoring live
aligns with their analysis queries and filters to
campaigns or tracking real-
goals. explore data interactively.
time performance metrics.
16
DASHBOARDS
Tableau
• Dashboards are useful across different industries and verticals because they're
highly customizable.
• Dashboards use visualizations like tables, graphs, and charts, others who
aren't as close to the data can quickly and easily understand the story it tells
or the insights it reveals.
• Dashboards tend to have a high-level view of broad amounts of data and are
created to answer a single question.
• Dashboards collect data from several sources so that non-technical users may
access and analyse it more readily.
Compelling Reasons why organizations might choose to use Tableau for dashboard analysis:
Security and Governance: Tableau offers features for data security and governance, allowing organizations to control who can
access and modify dashboards. This is important for maintaining data integrity and compliance.
Advanced Analytics Integration: Tableau can integrate with advanced analytics tools, enabling organizations to incorporate
statistical analysis, predictive modeling, and other data science techniques into their dashboards.
17
OLAP (ONLINE ANALYTICAL PROCESSING)
• OLAP enables users to interactively analyze multidimensional data, such as drilling down into
details or pivoting dimensions to gain insights from different angles.
Tools:
Oracle OLAP: Offers multidimensional analysis and modeling
capabilities for in-depth data exploration.
Microsoft Analysis Services (SSAS): Empowers multidimensional
analysis through the use of cubes for enhanced insights.
Reasons why organizations might consider using each of these tools for
OLAP:
18
MS OFFICE INTEGRATION:
• MS Office Integration: Users can interface with Office for iOS, Office for Android, Office Online, and Excel, PowerPoint, or Word
Mobile from your applications and web-based experiences, allowing them to smoothly shift from third-party solutions to
working in Office applications.
Oracle
• Users can interact with content servers and the files on them
directly from a variety of Microsoft Office products after installing
the Desktop client software on their computer.
• They can open files from a content server ("check out"), save files to
a content server ("check in"), search for files on a server, compare
document revisions on a server, and insert files from a server or
links to these files into the current document.
Reasons why organizations might use Oracle tools for Microsoft Office integration:
• Familiar Interface: Integrating Oracle tools with Microsoft Office leverages the familiarity of Office applications, enabling users
to interact with and analyze data using a user interface they are comfortable with.
• Automated Workflows: Oracle tools can enable automation of processes such as data extraction, transformation, and loading
(ETL) directly within Microsoft Office, streamlining repetitive tasks.
• Security and Control: Oracle tools can provide security features to control data access and ensure that sensitive data remains
protected when integrated with Microsoft Office documents.
19
PREDICTIVE ANALYTICS
• Predictive analytics involves using historical data and statistical
algorithms to make predictions about future events or outcomes. This
can help organizations make proactive decisions.
Tools:
IBM SPSS: Performs statistical analysis and predictive modelling.
RapidMiner: Offers data mining and predictive analytics.
REASONS FOR IBM SPSS AND RAPIDMINER TOOL USED:
• Statistical Analysis: IBM SPSS is renowned for its comprehensive
statistical analysis capabilities, allowing users to explore, summarize, and
visualize data to identify trends, patterns, and correlations.
• Predictive Modeling: SPSS enables the creation and validation of
predictive models, helping users make informed decisions by forecasting
future outcomes based on historical data.
• Automated Machine Learning: The tool includes automated machine
learning capabilities, allowing users to quickly experiment with various
algorithms and techniques to find the best model fit.
• Open Source Foundation: RapidMiner's open-source heritage fosters a
collaborative environment and enables users to customize and extend
the platform to meet specific predictive analytics needs.
20
DATA DISCOVERY
• Data discovery is the process of extracting meaningful patterns from data. This is achieved by collecting data from a
wide variety of sources and then applying advanced analytics to it to identify specific patterns or themes.
DOMO
■ Domo analytics is a complete cloud analytics platform that was created for
large enterprises and small businesses.
■ Domo is an outlier due to its unique focus on ease of use for both business
and technical users.
■ In 2022, Gartner calls Domo's "consumer design focus" a significant
advantage for the company and its customers.
■ Gartner also considers its "speed of deployment" a significant advantage,
with over 1,000 platform API connectors and a low-code / no-code
environment for creating analytics content and custom analytics apps.
21
DATA VISUALIZATION:
• Data Visualization: data into visuals, making it easier to understand, digest and make important business decisions
from. Data visualization creates actionable insights your team might not have found otherwise.
Zoho Analytics
■ Zoho Analytics is a data visualization tool that allows users to import data from
a variety of data sources for in-depth analysis.
■ With a drag-and-drop interface, users can create insightful reports and
dashboards with a range of data visualization tools.
■ Users can collaborate on reports and dashboards with their coworkers and
decide what others may see and do with the reports provided to them.
■ Publish reports and dashboards via email or embed them on websites.
■ It integrates with other Zoho applications and even offers a free plan.
REASONS:
Prebuilt Widgets and Templates: Zoho Analytics offers a range of prebuilt widgets, templates, and dashboards that accelerate
the process of creating visualizations. Users can customize these templates to match their specific requirements.
Affordability: Zoho Analytics offers different pricing tiers, making it an affordable option for small and medium-sized businesses
that need data visualization capabilities.
22
BIG DATA ANALYTICS
• Big data analytics involves processing and analyzing large and complex data sets to extract valuable insights that can
inform business decisions.
Apache Hadoop:
■ Apache Hadoop is an open-source framework designed to process and analyze large
datasets in a distributed computing environment. It's based on a distributed file
system (HDFS) and a programming model called MapReduce.
Google BigQuery:
■ Google BigQuery is a fully managed data warehouse and analytics service provided
by Google Cloud. It's designed for high-speed analysis of large datasets using SQL
queries.
REASONS TO ADOPT THESE TOOLS:
Distributed Processing: Hadoop divides tasks into smaller subtasks that can be processed
on multiple machines, leading to faster data processing.
Cost-Effective: Hadoop can run on commodity hardware, providing cost-efficient
solutions for processing large volumes of data.
Serverless Architecture: BigQuery eliminates the need for infrastructure management;
users can focus on writing SQL queries and analyzing data.
Integration with Google Cloud: BigQuery seamlessly integrates with other Google Cloud
services, allowing users to build comprehensive data pipelines and workflows.
23
MOBILE BI
• Mobile Business Intelligence (Mobile BI) refers to the practice of delivering business intelligence tools, data, insights, and
analytics to mobile devices such as smartphones and tablets. It enables users to access and interact with business data and
analytics while on the go, allowing for timely decision-making and enhanced collaboration. Mobile BI extends the benefits of
traditional business intelligence to a mobile and often remote workforce.
Google Analytics
■ Google Analytics is primarily known as a web analytics tool, used to track
and analyze user interactions on websites and web applications. However,
it can also be utilized for Mobile Business Intelligence (Mobile BI)
purposes, which involves tracking and analyzing user behavior and
interactions within mobile applications.
■ Google Analytics can be connected with various other platforms such as
Google Adsense, Google Optimize 360, Google Search Console, Google
Studio, Salesforce Marketing Cloud, which help increase its functionality
ten-fold.
Here's how Google Analytics can be applied to Mobile BI:
Mobile App Tracking: Google Analytics can be integrated into mobile applications to track user activities, engagement, and
interactions within the app.
Segmentation and User Demographics: Google Analytics allows you to segment your app users based on various attributes, such
as geographic location, device type, operating system, and more.
24
DATA WAREHOUSE & BI DATA
25
DATA MART
• A data mart is a subset of a data warehouse that focuses on a certain business function or user group. It comprises pre-
aggregated and relevant data to assist a certain department's analytical needs.
• Oracle Database - Oracle offers a variety of tools for creating and managing data marts, such as Oracle Data Integrator
(ODI) for ETL, Oracle OLAP for analytical processing, and Oracle BI Publisher for reporting.
Microsoft SQL Server Analysis Services (SSAS)
■ Microsoft SQL Server Analysis Services (SSAS) is a powerful tool that enables the
creation of data marts, which are specialized subsets of data warehouses. It allows
you to design multidimensional or tabular data models tailored to the analytical
needs of specific business departments.
■ SSAS seamlessly integrates with other Microsoft BI tools like Power BI and Excel,
providing a cohesive end-to-end solution for data analysis and visualization.
Reasons for Using SSAS AND ORACLE DATABASE:
• Integration with Microsoft Tools: If your organization heavily relies on Microsoft
tools like Excel and Power BI, using SSAS ensures seamless integration and
enhances the end-to-end analytical workflow.
• Enterprise-Grade Reliability: If your data marts need to handle critical, high-
volume data and require robust reliability, Oracle Database's proven track record
can be an advantage.
• Security and Compliance: If your organization operates in industries with strict
data security and compliance requirements, Oracle Database's security features
can help meet those needs.
26
ANALYTICAL SANDBOX
■ An analytical sandbox is a specialized environment created for the purpose of performing
exploratory data analysis, testing, and experimenting with data without affecting the operational
or production systems. It serves as a safe and isolated space where data professionals, analysts,
and data scientists can work with data to gain insights, develop models, and test hypotheses
without the risk of disrupting the primary data sources or processes.
Amazon Redshift: Amazon Redshift Spectrum is a fully-managed data warehouse service offered by
AWS. It allows users to create an analytical sandbox by running SQL queries directly on data stored in
Amazon S3. By integrating with AWS services like Amazon QuickSight and Amazon Athena, Redshift
Spectrum provides a complete ecosystem for ad-hoc data analysis and reporting within the analytical
sandbox
Hadoop, Apache: Hadoop is an open-source distributed computing system for storing and processing
massive amounts of data in a scalable and cost-effective way. It's great for large data analytics and can
be coupled with other data processing and analysis platforms such as Apache Hive, Spark, and Pig.
27
OLAP CUBES
■ Online Analytical Processing (OLAP) cubes are multidimensional data
structures that enable rapid data analysis from multiple dimensions or
perspectives. OLAP cubes allow users to quickly aggregate and explore data
to gain insights into business performance, trends, and patterns. They are
particularly useful for complex queries and reporting scenarios involving
large volumes of data.
■ Using IBM Cognos for OLAP Cubes:
1. Dimensional Modeling: With IBM Cognos, you can create and define
dimensions that represent various attributes or perspectives of your data.
Dimensions help organize data hierarchies and provide context for analysis.
2. Cubing Services: Cognos provides cubing services that allow you to design
and build OLAP cubes based on your dimensional model. These cubes store
pre-aggregated data, enhancing query performance for complex analyses.
3. Measure Definitions: Cognos enables you to define measures, which are the
numerical values you want to analyze. Measures can represent metrics, KPIs,
or any other quantitative data points.
4. Hierarchy Creation: Hierarchies represent the drill-down paths within
dimensions. Cognos allows you to create hierarchies that users can use to
navigate and analyze data at different levels of granularity.
5. Aggregations and Calculations: You can define aggregations and calculations
within Cognos OLAP cubes to derive new measures, perform calculations,
and create custom business logic.
28
REFINED BIG DATA
■ A "refined big data tool" refers to a specialized software or technology that has undergone optimization, improvements, or
enhancements to better handle the challenges and complexities of processing, analyzing, and managing large and complex
datasets. These refinements are typically aimed at improving performance, scalability, efficiency, usability, and functionality
to meet the specific demands of big data applications. These solutions often employ distributed computing frameworks like
Open refine, Apache Hadoop and Apache Spark to handle the scale and complexity of big data processing.
OpenRefine, formerly known as Google Refine, is an open-source data
cleaning and transformation tool designed to help users clean, organize,
and refine messy and inconsistent data. It's especially useful for preparing
and preprocessing data for analysis, visualization, and other data-related
tasks. OpenRefine focuses on data quality improvement by providing a
user-friendly interface for data manipulation and enrichment.
Apache Spark is a powerful open-source data processing and analytics
framework that can be used for refining and processing data at scale. It's
designed to handle large-scale data processing tasks efficiently and offers
various features that make it suitable for refining data.
Reasons for Using Apache Spark for Data Refinement:
• Speed and Performance: Spark's in-memory processing and optimized
execution engine provide fast data processing, even for complex refinement
operations.
• Rich Ecosystem: Spark integrates with various libraries and tools, expanding its
capabilities for data refinement and processing.
29
ENTERPRISE DATA WAREHOUSE
■ An Enterprise Data Warehouse (EDW) is a central repository that consolidates and integrates
data from various sources within an organization. It serves as a comprehensive, unified data
source for reporting, analysis, and decision-making across the enterprise. EDWs are designed
to support complex queries, historical data storage, and data governance.
Teradata
■ Teradata is a highly scalable and performant Enterprise Data Warehouse (EDW) tool,
employing a shared-nothing architecture and parallel processing to handle large data
volumes and complex queries efficiently.
■ It supports advanced analytics and machine learning capabilities, integrating with popular
tools like R and Python, empowering data scientists and analysts to perform sophisticated
analytics within the data warehouse.
■ Teradata enables seamless data integration and consolidation from various sources, offering
data transformation and quality management features to ensure reliable and consistent data
for analysis.
■ The platform emphasizes comprehensive security and data governance, providing robust
access controls, encryption, auditing, and data lineage to safeguard sensitive information and
comply with regulatory requirements.
Amazon's Redshift: Amazon Redshift is an AWS-provided fully managed data warehousing
solution. It is intended for high-performance analysis of huge datasets and works well with other
AWS services.
30
MASTER DATA MANAGEMENT
■ Master Data Management (MDM) is a process and technology-driven approach that ensures consistency, accuracy, and governance
of an organization's critical data entities (e.g., customers, products, employees) across various systems and applications. MDM
solutions help prevent data duplication, improve data quality, and establish a single version of truth for master data.
Informatica MDM
■ Informatica is a well-known MDM vendor, and its MDM product offers powerful data governance, data quality, and data integration
features.
■ Informatica MDM is a robust and widely used Master Data Management solution that helps organizations manage and govern
master data across different domains, such as customers, products, and suppliers.
■ The tool provides comprehensive data modelling and data quality management features, ensuring the accuracy and consistency of
master data across all systems and applications.
■ Informatica MDM offers a flexible data governance framework, allowing organizations to define and enforce business rules and
policies for data management and stewardship.
■ Its integration capabilities enable seamless synchronization of master data across systems, providing a reliable and authoritative
source of truth for master data within the organization.
Reasons to consider using Informatica MDM for your master data management needs:
1.Data Quality and Accuracy: Informatica MDM provides tools for data profiling, cleansing,
and enrichment, ensuring that your master data is accurate, complete, and consistent. This
enhances the quality of your data, leading to better decision-making.
2.360-Degree View of Data: Informatica MDM helps in creating a comprehensive and
unified view of master data by consolidating information from multiple sources and
systems. This 360-degree view facilitates better understanding and analysis of data.
31
OPERATIONAL DATA STORE
■ A database or data storage system that acts as a real-time or near-real-time staging area for operational data is known as an Operational Data
Store (ODS). It collects and combines data from many transactional systems, making it available for operational reporting and business process
assistance.
Oracle Database:
■ Oracle Database is a mature and reliable relational database system that can serve as an operational data store for real-time data processing and
transactional workloads.
■ The database's ACID compliance ensures data integrity, while its performance optimization features guarantee efficient handling of operational
data.
■ Oracle Database supports a variety of data types and provides robust query and indexing capabilities, making it suitable for diverse operational
data storage requirements.
■ With built-in security features and high availability options, Oracle Database ensures the consistent and reliable storage and retrieval of critical
operational data.
Microsoft SQL Server: SQL Server's database management system, as well as SQL Server Integration Services (SSIS) for data extraction,
transformation, and loading (ETL) activities, may be utilised to create an ODS.
CONS of using MS SQL Server and Oracle Database for an
operational data store:
• Scale Limitations
• Complexity
• Licensing Costs
• Learning Curve
32
BI REPOSITORIES
■ Repositories are central storage places where source code, project files, and associated materials are saved and maintained in the
context of software development and version control. Repositories are used by developers to monitor changes, collaborate on code, and
keep track of version history.
Microsoft Power BI
■ Microsoft Power BI is a leading business intelligence platform that provides a secure and centralized repository for storing and managing
reports, dashboards, and data models.
■ The tool offers seamless integration with other Microsoft products like Excel and SharePoint, facilitating easy sharing and collaboration
on BI artifacts within the organization.
■ Power BI's user-friendly interface allows business users to interact with and explore data visualizations, making it accessible to a broad
audience within the organization.
■ With its cloud-based service, Power BI offers scalable and reliable storage and sharing capabilities, supporting organizations of all sizes in
theirwhy
Reasons BI initiatives.
organizations opt to use Power BI for their BI repositories:
1.Integration with Microsoft Ecosystem: If your organization already uses Microsoft
tools like Excel, SharePoint, and SQL Server, Power BI seamlessly integrates with
these products, providing a unified ecosystem for data management and analytics.
2.Data Source Connectivity: Power BI supports a wide range of data sources, both
on-premises and cloud-based, allowing you to connect to various databases, APIs,
and services to consolidate data from different sources.
33
STAGING
■ Staging refers to the intermediate storage area where data is temporarily held before being loaded into the target data
warehouse or database. The staging area allows for data transformation, cleansing, and validation before it becomes part of
the operational or analytical systems. It acts as a buffer between the source systems and the destination, ensuring data
integrity and consistency during the data loading process.
Apache Kafka
■ Apache Kafka is a distributed streaming platform that excels as a staging area for real-time data streaming and event
processing.
■ The platform's support for data partitioning and replication ensures data availability and resilience, making it a reliable staging
solution for critical data pipelines.
NiFi Apache: Apache NiFi is an open-source data integration platform that supports data intake, transformation, and routing. It is
suited for real-time data transportation and streaming data applications.
Reasons for Staging with Apache Kafka:
1.Decoupling Data Producers and Consumers: Kafka acts as a message broker,
allowing producers to publish data independently of consumers. Staging data in
Kafka allows producers to send data without worrying about whether consumers are
ready to process it immediately.
2.Data Buffering and Flow Control: Staging data in Kafka provides a buffer between
data producers and consumers. This buffering allows for flow control, ensuring that
consumers can process data at their own pace without overwhelming the system.
34
DATA INTEGRATION
• Data integration is a fundamental component of Business Intelligence (BI) architecture that involves combining, transforming,
and consolidating data from various sources into a unified and accessible format for analysis, reporting, and decision-making. In
the BI architectural category, data integration plays a pivotal role in ensuring that accurate, consistent, and relevant data is
available to support informed business decisions.
• Data Visualization
• ETL ELT
• Data Services
• Data Integration
• Master Data Management
35
DATA VIRTUALIZATION
• Data Virtualization is a modern approach to data integration where data from various sources is made available for
consumption without the need to physically store it in an intermediate repository. Instead, data is accessed in real-time across
its native platforms. This method abstracts the underlying data sources and provides a unified, single view to the end-users. It’s
like having a 'virtual' layer that sits between data sources and consumers, and can translate queries into the respective format
for each data source, consolidating results on-the-fly. This approach is agile, scalable, and can reduce data redundancy and
storage costs.
Tool : Denodo
Data Virtualization is about creating an abstraction layer that aggregates data from various sources, making it accessible to users
without them needing to know the data's actual location. In the context of your case study, the company deals with different
datasets such as Operational/Transactional systems (Finance, Marketing, Sales), In-house solutions (Finance Excel spreadsheets),
and Web-Based datasets (Product Reviews). Denodo will allow a seamless integration and provide a unified view of this diverse
data, ensuring that decision-makers have a comprehensive view without the complexities of sourcing the data
Reasons to Use Denodo for Data Virtualization:
• Data Agility: Denodo's ability to quickly access and integrate data from various
sources supports agile decision-making and business responsiveness.
• Data Integration Simplification: Denodo simplifies the data integration process by
abstracting the complexity of underlying data sources.
• Self-Service Analytics: Denodo empowers business users to access and combine data
on their own, reducing the IT bottleneck.
36
ETL (EXTRACT, TRANSFORM, LOAD) & ELT (EXTRACT, LOAD, TRANSFORM)
This is a traditional approach to data integration where data is first extracted from various source systems, then transformed (cleaned,
enriched, and made consistent) in an intermediate staging area, and finally loaded into a destination system like a data warehouse. The
transformation occurs before loading, ensuring that the data in the target system is already in the desired format.
Apache NiFi :
• Apache NiFi is a data integration and flow management tool designed for
automating data movement, transformation, and enrichment. Here are reasons
to consider using Apache NiFi for ETL/ELT:
• Data Flow Automation: NiFi offers a visual interface for designing data flows,
making it easy to automate ETL/ELT tasks without extensive coding.
• Data Transformation: NiFi provides processors for data transformation,
enrichment, validation, and cleaning within the data flow.
Apache Spark :
• Apache Spark is a powerful open-source data processing and analytics
framework that can be used for both ETL and ELT processes. Here are reasons to
consider using Apache Spark for ETL/ELT:
• Speed: Spark's in-memory processing capability accelerates data processing,
making ETL/ELT tasks faster compared to traditional batch processing.
• Flexibility: Spark supports batch processing, interactive queries, real-time
streaming, and machine learning within the same framework, offering flexibility
in your data pipeline.
37
DATA SERVICES
• Data Services refer to the provision and consumption of data-related functionalities through services, typically web services.
These services can include data retrieval, data update, data transformation, and more. They allow for a modular approach to
data access and management, where applications and systems interact with data through standardized service calls, abstracting
away the complexities of direct data storage and retrieval.
Apache Kafka
Kafka provides a unified platform for data streams, ensuring real-time data processing
capabilities. This could be essential for the company if they wish to monitor certain metrics in
real-time, such as live sales figures or real-time feedback from product reviews. Reasons to use
Apache Kafka for data services:
• Real-Time Data Streaming: Kafka provides a platform for ingesting, processing, and
distributing real-time data streams, making it suitable for delivering up-to-date data to data
services.
GraphQL:
GraphQL is a query language and runtime for APIs that allows clients to request specific data
from a server, enabling efficient and flexible data fetching. GraphQL is often used to provide data
services with a more tailored approach compared to traditional REST APIs.
Reasons to use GraphQL for data services:
1.Flexible Data Retrieval: GraphQL allows clients to request only the data they need, reducing
over-fetching and under-fetching of data, which is common in REST APIs.
2.Single Endpoint: GraphQL uses a single endpoint for queries, simplifying API interactions and
reducing the need for multiple endpoints.
38
DATA INTEGRATION
• Data Integration is the process of combining data from different sources into a single, unified view or system. It addresses
challenges like data discrepancies, redundancies, and inconsistencies. The goal is to provide a holistic and consistent view of
data across an organization, ensuring that decision-makers, applications, and analytics tools have timely access to clean, reliable,
and integrated data. It can involve various methodologies and tools, including those mentioned in this list.
Talend
Data integration is pivotal for a unified and streamlined data architecture. Talend, being a comprehensive platform for both batch
and real-time data integration, will ensure data from finance, marketing, sales, and even from in-house Excel spreadsheets
seamlessly converge into the central repository.
Key reasons for using Talend for data integration:
• Broad Connectivity: Talend supports a wide range of connectors for various data
sources and systems, including databases, cloud services, APIs, flat files, and more.
This makes it versatile for integrating data from different platforms.
• ETL and ELT Capabilities: Talend can handle both traditional ETL processes (Extract,
Transform, Load) and modern ELT processes (Extract, Load, Transform) based on
your data integration needs.
• Code Generation: Talend generates optimized code in various programming
languages (Java, SQL, etc.) for executing data integration processes, offering
performance and scalability.
• Data Warehouse Integration: Integrate with popular data warehouses like
Amazon Redshift, Google BigQuery, and Snowflake.
39
Master Data Management (MDM)
• MDM refers to the comprehensive process of defining, governing, and managing core business entities in a consistent and unified
manner across the enterprise. These entities might include customers, products, suppliers, and other "master" data objects.
MDM ensures that the organization has a single source of truth for these critical data entities.
Informatica MDM (Master Data Management):
• It is a platform designed to help organizations manage and ensure the accuracy, consistency, and quality of their master data
across various systems and applications. Here are the pros and cons of using Informatica MDM for master data management:
Pros of Using Informatica MDM:
• Golden Record Management: Informatica MDM identifies and manages "golden records,"
which are the most accurate and complete versions of master data entities.
• Data Enrichment: Informatica MDM allows for data enrichment through the integration of
external data sources and services, improving the quality and depth of master data.
• Hierarchy Management: Informatica MDM supports the creation and management of
hierarchical relationships between master data entities, which is crucial for accurate reporting
and analysis.
Cons of Using Informatica MDM:
• Complex Implementation: Implementing Informatica MDM can be complex and time-
consuming due to the need for data modeling, integration, and customization.
• High Initial Cost: The upfront cost of licensing, implementation, and training can be significant,
especially for smaller organizations.
40
DATA SOURCES
41
APPLICATION SERVICES
• ”Application services" typically refer to the various software tools, platforms, and services used to design, develop, deploy, and
manage BI applications and solutions. These services play a crucial role in leveraging big data technologies for BI purposes.
Tools and their application services in the BI architectural category, along with reasons for their use:
• Apache Hive: Provides a SQL-like query language (HiveQL) to query and analyze data stored in HDFS.
Reasons: These tools support the storage and processing of large datasets, making them suitable for handling big data in BI
applications.
• Teradata, Microsoft SQL Server Data Warehouse: On-premises or cloud-based data warehousing platforms provide powerful
analytics capabilities.
Reasons: These platforms provide high-speed querying and storage for analytical data used in BI applications.
• Talend, Apache Nifi: These tools help in extracting, transforming, and loading (ETL) data from various sources into the BI system.
Reasons: ETL tools streamline the process of integrating data from various sources into the BI environment.
• Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform: Cloud services offer scalable infrastructure and tools for
big data processing, storage, and analytics.
Reasons: Cloud platforms provide on-demand resources for building and scaling BI solutions without large upfront investments.
42
CLOUD APPLICATION & DATABASES
• Cloud application and database refers to utilizing cloud-based applications and databases for storing, managing, and analyzing
data to support BI processes. These cloud-based tools play a significant role in modern BI architectures by offering scalability,
flexibility, and ease of use.
Google Data Studio:
•Reasons for Use: Google Data Studio is a cloud-based tool for creating customizable reports and dashboards that are shareable
and interactive. It's suitable for integrating data from various Google services.
Qlik Sense Cloud:
•Reasons for Use: Qlik Sense Cloud is a cloud-based self-service analytics platform that allows users to create, explore, and share
visualizations. It's designed for collaborative data discovery.
Google Cloud SQL:
•Reasons for Use: Google Cloud SQL offers managed database services for MySQL, PostgreSQL, and SQL Server. It's designed to
simplify database management for BI applications.
Azure Cosmos DB:
•Reasons for Use: Azure Cosmos DB is a globally distributed NoSQL database service. It's suitable for BI applications that require
global availability and low-latency queries.
43
BUSINESS PROCESSES CONTI.
• Business processes refer to the structured workflows, methodologies, and activities that organizations implement to utilize BI
solutions effectively. These processes are supported by a variety of big data tools to ensure that data-driven insights are
integrated into decision-making.
Some categories of tools commonly used for various aspects of business processes:
Workflow and Process Automation:
•Workflow Management Systems: Tools like Trello, Asana, and Monday.com help teams organize and streamline their workflows
and tasks.
•Business Process Management (BPM) Software: Tools like Nintex, Appian, and Bonita facilitate the design, automation, and
optimization of business processes.
Data Cleansing and Transformation:
•Data Cleaning Tools: OpenRefine, Trifacta, DataWrangler help clean and transform raw data into usable formats.
•Data Transformation Tools: Apache Spark, Python, and SQL are used to preprocess and transform data for analysis.
Customer Relationship Management (CRM):
•CRM Software: Salesforce, HubSpot, Zoho CRM manage customer interactions, sales, and marketing activities.
Advanced Analytics and Machine Learning:
•Data Science Platforms: Tools like IBM Watson Studio, Databricks provide collaborative environments for data science and
analytics.
•Machine Learning Tools: Python libraries like scikit-learn, TensorFlow, and R for advanced analytics and predictive modeling.
44
BUSINESS PROCESSES
45
ENTERPRISE APPLICATIONS
■ Enterprise applications are software systems designed to support and streamline complex business processes and operations
within an organization. These applications typically cover various functional areas such as customer relationship management
(CRM), enterprise resource planning (ERP), human resources management (HRM), supply chain management (SCM), and
more.
SAP Business Suite Benefits:
• Comprehensive suite of integrated enterprise applications for various business functions.
• Offers modules for ERP, CRM, HR, supply chain management, and more.
Drawbacks:
• High implementation and licensing costs, making it more suitable for larger enterprises.
• Complex and lengthy implementation process.
Oracle E-Business Suite Benefits:
• Integrated suite of applications for financials, supply chain, HR, and more.
• Robust reporting and analytics capabilities.
Drawbacks:
• High licensing costs and additional fees for add-ons and customizations.
• Complex implementation and integration process.
46
DATA SERVICES
• Data services are crucial for ensuring that data is accessible, reliable, and usable for various purposes, including analytics,
reporting, and decision-making. Here are some key points about the data services category, along with commonly used tools and
their reasons for use:
Tools: Informatica PowerCenter, Talend, Apache Nifi.
Reasons for Use: Data integration tools help organizations create a single source of truth, improve data quality, and support better decision-
making by providing accurate and up-to-date information.
Talend Data Services Platform Benefits:
• Comprehensive data integration platform with data services capabilities.
• Supports API creation, data access, and seamless integration into applications.
Drawbacks:
• Pricing may be a concern for smaller organizations.
• Complexity can be overwhelming for simple data integration tasks.
• Requires more memory and processing power for handling extensive data transformations.
Benefits of Informatica PowerCenter
Metadata Management: The platform offers robust metadata management capabilities, allowing users to document, track, and manage the
lineage of data transformations, ensuring data governance and compliance.
Strong Ecosystem: Informatica has a large community of users, extensive documentation, and a range of training resources, making it easier
for users to learn and troubleshoot the platform.
47
EXTRACTS
• “Extracts" refer to subsets of data that are obtained from a larger dataset for specific purposes, such as analysis, reporting, or
data integration. Extracts are commonly used to improve efficiency by working with a smaller portion of data, especially when
dealing with large and complex datasets.
Informatica PowerCenter:
• Reasons for Use: Informatica PowerCenter provides a comprehensive platform
for data integration and ETL processes. It offers features for data cleansing,
transformation, and data quality, making it suitable for creating accurate and
reliable data extracts.
SAS Data Integration Studio:
• Reasons for Use: SAS Data Integration Studio is part of the SAS suite and offers
comprehensive ETL capabilities. It's suitable for organizations that use SAS for
analytics and want to create data extracts for analysis.
IBM InfoSphere DataStage:
• Reasons for Use: IBM InfoSphere DataStage provides ETL capabilities with a
focus on data quality and data integration. It's suitable for organizations
looking to create high-quality data extracts for analysis and reporting.
Alteryx:
• Reasons for Use: Alteryx offers a self-service data preparation and ETL
platform. It's suitable for analysts and business users who need to create
extracts and perform data transformations without extensive technical skills.
48
SPREADSHEETS
• Spreadsheets are widely used for tasks such as data entry, calculations, reporting, and basic data analysis. Here are some
common spreadsheet tools:
Microsoft Excel:
Reasons for Use: Microsoft Excel is one of the most widely used spreadsheet tools. It offers a comprehensive set of
features for data manipulation, calculations, charting, and analysis. It's suitable for individuals, businesses, and
organizations of all sizes.
• Pros: Extensive features for data analysis, including functions, formulas, pivot tables, and charts.
Familiar user interface and compatibility with other Microsoft Office applications.
Integration with external data sources and add-ins for extended functionality.
• Cons: Limited collaboration features, especially in real-time.
Large datasets can slow down performance.
May require some learning to fully utilize advanced features.
Google Sheets:
•Reasons for Use: Google Sheets is a cloud-based spreadsheet tool that offers collaborative features and
integration with other Google Workspace applications. It's suitable for collaborative projects and remote
teams.
• Pros: Real-time collaboration and sharing capabilities.
Cloud-based, accessible from any device with an internet connection.
Basic data analysis features and integration with Google Drive.
• Cons: Fewer advanced features compared to desktop-based spreadsheet tools.
Limited offline access without internet connectivity.
May have limitations for complex data analysis tasks.
49
UNSTRUCTURED DOCUMENTS
• Tools for handling unstructured documents are designed to manage, analyze, and extract valuable information from documents
that do not have a predefined structure, such as text files, PDFs, emails, images, and more. These tools use techniques like
natural language processing (NLP) and machine learning to process and derive insights from unstructured content.
Apache Tika: Reasons for Use: Apache Tika is an open-source library that extracts text and
metadata from various types of unstructured documents. It's suitable for developers looking to
integrate document parsing capabilities into their applications.
•Pros:
•Supports a wide range of document formats, including text, PDF, HTML, images, and more.
•Provides a consistent API for extracting content and metadata from documents.
•Cons:
• Primarily a library, requires integration into existing applications.
•Limited out-of-the-box features for more advanced document analysis.
IBM Watson Discovery :Reasons for Use: IBM Watson Discovery is an AI-powered platform
that analyzes unstructured text data for insights. It's suitable for organizations looking to gain
insights from textual content.
•Pros:
•Utilizes AI and NLP to extract insights, entities, relationships, and sentiment from text.
•Supports multiple document formats, including PDFs, web pages, and more.
•Cons:
•Commercial software with associated costs.
•Limited to IBM's ecosystem and might require data to be transferred to the cloud.
50
SUMMARY
Microsoft
MS
TABLE QLIKVIE DATAPI Oracle Analysis
POWE
AU W NE OLAP Services
R BI
(SSAS)
OLAP and
Data Visualization and Reporting Ad Hoc Analysis Multidimensional
Analysis
Informatica Apache
Talend PowerCenter, Nifi Amazon Google
MDM
Redshift BigQuery
Apache Apache
Spark Kafka
51
REFERENCES
https://www.tableau.com/learn/articles/business-intelligence
https://peakup.org/blog/power-bi-create-data-alerts/
https://www.getapp.com/business-intelligence-analytics-software/a/sas-visual-analytics/
https://www.techtarget.com/searchbusinessanalytics/definition/ad-hoc-analysis
https://www.datapine.com/blog/ad-hoc-reporting-analysis-meaning-benefits-examples/#tool-example
https://www.tableau.com/learn/articles/dashboards/what-is
https://aws.amazon.com/what-is/olap/#:~:text=Online%20analytical%20processing%20(OLAP)%20is,smart%20meters%2C%20and%20internal%20systems .
https://learn.microsoft.com/en-us/office/client-developer/integration/integrate-with-office
https://docs.oracle.com/middleware/11119/wcc/use-desktop/ms_office.htm
https://www.ibm.com/products/spss-statistics
https://snowplow.io/blog/data-discovery/
https://www.graphable.ai/software/what-is-domo-analytics/#:~:text=Domo%20analytics%20is%20a%20complete,all%20in%20a%20collaborative%20setting .
https://www.forbes.com/advisor/business/software/best-data-visualization-tools/#zoho_analytics_section
https://www.sas.com/en_ca/insights/analytics/big-data-analytics.html
https://careerfoundry.com/en/blog/data-analytics/data-analytics-tools/#apache-spark
https://dynamics.folio3.com/blog/mobile-business-intelligence-bi/
52
REFERENCES CONTI.
https://kafka.apache.org/
https://powerbi.microsoft.com/en-ca/business-analysts/
https://www.informatica.com/ca/products/master-data-management.html
https://spark.apache.org/
https://www.teradata.com/
https://www.oracle.com/ca-en/database/
https://learn.microsoft.com/en-us/analysis-services/analysis-services-overview?view=asallproducts-allversions
https://aws.amazon.com/redshift/
https://learn.microsoft.com/en-us/analysis-services/ssas-overview?view=asallproducts-allversions
53
Q/A
54
THANK YOU
55