0% found this document useful (0 votes)
40 views

EDM - E1 - Data Architecture and Modeling - Data Architecture v1.1

Data Architecture and Modeling

Uploaded by

mukhopadhyay00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

EDM - E1 - Data Architecture and Modeling - Data Architecture v1.1

Data Architecture and Modeling

Uploaded by

mukhopadhyay00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 53

EDM - E1 - Data Architecture and Modeling –

Data Architecture

June 26, 2024 TCS Public


EDM – E1 – Data Architecture and Modeling – Training Lecture Series

• EDM – E1 – Data Architecture and Modeling - Data Architecture

• EDM – E1 – Data Architecture and Modeling - Data Modeling Overview

• EDM – E1 – Data Architecture and Modeling - Normalization

• EDM – E1 – Data Architecture and Modeling - Dimensional Modeling

June 26, 2024 2


Data Architecture Agenda
• Data Architecture Overview
• Data Visualization
• Data Flow Diagrams
• Business Data
• Data Storage & Administration
• Data Storage vendors
• RAID
• Backup, Recovery, and Archival purging
• Data Modeling
• Corporate 3NF Modeling
• Dimensional Modeling
• Organizational Roles
• Functional Activities of a DBA
• References

June 26, 2024 3


Data Architecture Agenda
• Data Architecture Overview
• Data Visualization
• Data Flow Diagrams
• Business Data
• Data Storage & Administration
• Data Storage vendors
• RAID
• Backup, Recovery, and Archival purging
• Data Modeling
• Corporate 3NF Modeling
• Dimensional Modeling
• Organizational Roles
• Functional Activities of a DBA
• References

June 26, 2024 4


Today’s Enterprise Data – A Reality Check
• Highly Dispersed

• No well-defined Ownership

• No well-defined standards on Data Quality and Security

• Highly Duplicated

• Highly Manual Processes

• Organizational Conflicts (“Trust In Me…..”)

• Disparate set of technologies for implementing the same processes in different departments

• The rate at which business requirements come into the system are far greater than the rate of
implementation by IT

• 1 IT :: M Business

• IT might be considered as a “Cost Center” – “Data” is NOT a “Strategic Asset”

• Often times, Data Architects are visibly required (especially through critical failures) and are
given high levels of responsibility, but without the resources or authority to match the
expectations

June 26, 2024 5


Enterprise Data Management (EDM) Framework

Data Governance (DG) Master Data Management (MDM) Data Security (DS)
• Achieve organizational • Identify and maintain a single, unified • Ensure that data is kept safe and is
alignment (people, along with view of reference data across the accessible to pertinent users and
processes and technologies) on enterprise. groups, internally and externally. It
the governing of data ensures access to data is suitably
management issues throughout controlled throughout the flow of data
the enterprise. in an enterprise, by setting up data
security policies and processes.

Metadata Management (MM) Data Quality Management (DQM)


• Ensure that metadata is • Eliminate bad data and increase the
properly created, stored, and reliability, effectiveness, and usability
controlled. Metadata (Data of data. Typically this includes
about Data) is information about updating, standardizing, and de-
data’s meaning (semantics) and duplicating of data. Establishes
structure (syntax, constraints & Data Architecture (DA) processes that ensure quality data on
relationships). • Specify how data is created, a regular and consistent basis.
processed, stored, utilized and
distributed. Covers data modeling,
data flow analysis, tuning, storage,
visualization, and infrastructure.

June 26, 2024 6


EDM Core Activities and Interactions

BI/DWH Data Integration

Data
Migration
Data Governance
Analytics

Da
• Data Stewardship

y
Data

ta
li t
• Data Privacy
Consolidation

ua

Se
• Data Visualization

c
• Data Analysis

ur
ta

it y
Da
• Data Discovery
• Data Profiling Data
Data Mining • Data Cleansing Synchronization

t
en
• Data Monitoring

em
• Data Modeling (Logical)

g
na
ERP • Data Modeling (Physical)

Ma
MD

• Data Flow Analysis

ta
M

• Data Storage Unstructured

da
Data
• Database Administration

ta
CRM

Me
Data Architecture

BPM/SOA KM
SCM

June 26, 2024 7


Data Architecture Overview

• Specifies how data is processed, stored and utilized in an enterprise.


Covers mainly Data Modeling, Storage and Administration, Policies,
Visualization, and Infrastructure

• Takes into account enterprise-wise data needs. Comes handy when


an enterprise goes for Basel, Sarbanes-Oxley etc. compliance

• Storage covers Retention, Backup, Recovery and Archival, Purging


(longevity) of data

• Infrastructure covers software, hardware, networking, RDBMS etc.


Defines policies for usage of various techniques for data
compression, performance tuning, capacity planning, growth etc.

June 26, 2024 8


Data Architecture Agenda
• Data Architecture Overview
• Data Visualization
• Data Flow Diagrams
• Business Data
• Data Storage & Administration
• Data Storage vendors
• RAID
• Backup, Recovery, and Archival purging
• Data Modeling
• Corporate 3NF Modeling
• Dimensional Modeling
• Organizational Roles
• Functional Activities of a DBA
• References

June 26, 2024 9


Data Flow Diagram (DFD)
• A Data Flow Diagram (DFD) is a graphical representation of the
"flow" of data through an information system. – Ref. 2

• A data flow diagram can also be used for the visualization of data
processing (structured design).

• It is common practice for a designer to draw a context-level DFD first


which shows the interaction between the system and outside
entities.

• With a dataflow diagram, users are able to


– visualize how the system will operate,
– what the system will accomplish, and
– how the system will be implemented.

June 26, 2024 10


DFD Development Approaches
• Top-Down Approach
– The system designer makes a context level DFD, which shows the interaction (data flows)
between the system (represented by one process) and the system environment (represented by
terminators).
– The system is decomposed in lower level DFD (Zero) into a set of processes, data stores, and
the data flows between these processes and data stores.
– Each process is then decomposed into an even lower level diagram containing its sub
processes.
– This approach then continues on the subsequent sub processes, until a necessary and
sufficient level of detail is reached which is called the primitive process (aka chewable in one
bite).

• Event Partitioning Approach – Ref. 3


– Construct detailed DFD.
• The list of all events is made.
• For each event a process is constructed.
– Each process is linked (with incoming data flows) directly with other processes or via
data stores, so that it has enough information to respond to a given event.
• The reaction of each process to a given event is modeled by an outgoing data flow.

June 26, 2024 11


Simple DFD Notation

June 26, 2024 12


Data Flow Diagram

June 26, 2024 13


Data Flow Diagram

June 26, 2024 14


Data Flow Diagram Tools
CA ERwin Data Modeler A data modeling tool

ConceptDraw A Windows and Mac OS X data flow diagramming

Dia A free source diagramming tool with flowchart support

Kivio A free source diagramming tool for KDE

Microsoft Visio A Windows diagramming tool which includes very basic DFD support (Images only,
does not record data flows)

SmartDraw A Windows diagramming tool with Yourdon and Coad process notations and Gane and
Sarson process notation

System Architect An enterprise architecture tool, supporting Coad/Yourdon, Gane & Sarson,
Ward/Mellor, and SSADM notations and techniques

DFDdeveloper An open source software application that allows Microsoft Office users to create
interactive leveled data flow diagrams and data dictionaries

June 26, 2024 15


Business Data: Sample Links

• Overview
– http://web.cs.wpi.edu/~matt/courses/cs563/talks/datavis.html

• Modern Approaches
– http://www.smashingmagazine.com/2007/08/02/data-visualization-modern-approache
s/

• The Good, The Bad, and The Ugly


– http://www.math.yorku.ca/SCS/Gallery/

• A periodic Table of Visualization Methods
– http://www.visual-literacy.org/periodic_table/periodic_table.html

June 26, 2024 16


Data Architecture Agenda
• Data Architecture Overview
• Data Visualization
• Data Flow Diagrams
• Business Data
• Data Storage & Administration
• Data Storage vendors
• RAID
• Backup, Recovery, and Archival purging
• Data Modeling
• Corporate 3NF Modeling
• Dimensional Modeling
• Organizational Roles
• Functional Activities of a DBA
• References

June 26, 2024 17


The Ever Growing Digital Universe – IDC 2008

• “The Digital Universe in 2007 stood at 281 Billion Gigabytes and with an annual
growth rate of almost 60%, it is projected to reach nearly 1.8 Zettabytes in 2011.”

• Causes:
– Digital Cameras
– Digital Surveillance Cameras
– Digital Television
– Better understanding of replication trends
– Increasing internet access in emerging countries
– Sensor-based applications
– Data Centers supporting cloud computing and social networks

• Digital Universe in 2007 / person == 45 GB or, total of 17 billion 8 GB iPhones !!!

• A person’s digital shadow – digital information generated about the average person
on a daily basis (financial records, mailing lists, surfing histories, images by
security cameras etc.) – now surpasses the amount of digital information
individuals actively create themselves (creating pictures, sending emails, digital
voice calls etc.)

June 26, 2024 18


Data Storage
• Refers to storing structured, unstructured or semi structured data

• For structured data storage, mainly databases are used. Unstructured data
can be stored in proprietary systems or the latest trends in databases
allow storage of unstructured data as well

• Depending on the data volumes, internal or external hard-disks are used.


Latest trends in database technologies allow data to be compressed and
stored

• Apart from ‘raw’ data, RDBMS systems have overheads in terms of


indexes, partitions etc. which cause the size of disk to be 2-4 times of raw
data

• Capacity planning for data storage takes into account existing data
volumes and future growth. Inadequate capacity planning can impact the
performance of systems

June 26, 2024 19


BI Components – Data Store
• Integrated data of CPM,
• Transient Staging Area
Risk etc solutions
• Persistent Staging Area
• Unstructured Data
(e.g. Data Replication Area)
Storage

• Real Time Trickle feed +


• Metadata storage for ETL, Heavy Batch Load
Reporting, DB, Data Model Staging Area xDW • Operational (simple) +
etc. tools Strategic (complex)
Queries

Repositories
ADW
(ETL/Reporting etc) Data
Store • DW / EDW: High
• Real Time /
operational Data volume Data
Storage Integration in batch
• Low latency loads
ODS EDW, DW, DM mode
• Simple-Average • Data Marts:
Operational Queries Subject Area
specific with
MDDB medium data
volumes

DW Scale Raw Data Size DB Size


Small 50GB 100-200GB
Medium 100-500GB 500GB-1TB
• Small scale data volumes primarily for
OLAP Analysis
Large 500GB - 1TB 1-3TB • Specialized in Planning, Budgeting,
VLDW 1TB+ 3TB+ Forecasting and What-if / scenario
building applications
June 26, 2024 20
Data Storage Vendors– Gartner MQ for DW Databases

June 26, 2024 21


Backup Recovery and Archival Purging
• Backups are used to restore systems to its original state after a data loss
event

• Taking data backups is a periodic activity whereas recovery takes place


only after the data loss or system crash events have taken place

• Archival is the process of transferring data from online to offline storage.


It deletes data from online storage and migrates to offline storage

• Backups differ from archives in the sense that archives are the primary
copy of data and backups are a secondary copy of data

• Purging is deleting data that is no longer needed. Data Warehouse purging


is simpler compared with OLTP purging. Database partitioning is one of
the common techniques that is generally used for purging old data.

• Impacts on performance and regulatory guidelines generally define data


retention and purging strategy for an organization

June 26, 2024 22


RAID Storage Subsystems

• RAID (Redundant Array of Inexpensive Drives)


– Later on, term used is “Redundant Array of Independent
Disks”

• RAID can provide Fault Tolerance

• Provides fault tolerance against only disk failures. Other


hardware devices not covered

• RAID Levels

• RAID Performance

June 26, 2024 23


RAID-0
• Striping across disks
• No fault tolerance
• Use all of the disk space
• Best Performance
• Lose one disk and lose the data

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

June 26, 2024 24


RAID-1

• Mirroring – provides a copy or mirror


• Good fault tolerance
• Disk Space Available = Disk Drives / 2
• RAID-1 does not have stripes

Disk 1
Disk 1
Mirror

June 26, 2024 25


RAID-1 I/O Characteristics

•Physical Reads = Logical reads


•Physical Writes = Logical Writes x 2
•Good read performance – can
simultaneously read from both disks in the
mirrored pair

June 26, 2024 26


RAID-5

•Striping
•Distributed Parity
•Disk space = Disk Drives - 1

1 2 3 Parity

4 5 Parity 6

7 Parity 8 9

Parity 10 11 12

June 26, 2024 27


RAID-5 Write

Step 1: Data and Parity are Read

1 2 3 Parity

4 5 Parity 6

7 Parity 8 9

Parity 10 11 12

June 26, 2024 28


RAID-5 Write

Step 2: New Parity is Calculated

XOR Engine Calculates


parity

1 2 3 Parity

4 5 Parity 6

7 Parity 8 9

Parity 10 11 12

June 26, 2024 29


RAID-5 Write

Step 3: Data and Parity are Written

1 2 3 Parity

4 5 Parity 6

7 Parity 8 9

Parity 10 11 12

June 26, 2024 30


RAID-5 I/O Characteristics

•Reading simply reads off of the disk that contains


the data
•Writing
– Reads the data stripe
– Reads the parity stripe
– Xors the data and writes it out
– Calculates a new parity and writes it out

June 26, 2024 31


RAID Cost Summary

RAID-1
Protection

RAID-5

RAID-0

Cost

June 26, 2024 32


RAID Performance Summary

RAID-1
Protection

RAID-5

RAID-0

Performance

June 26, 2024 33


RAID Summary
• RAID-0
– No Protection
– Best Performance
– Least Cost

• RAID-1
– Best Protection
– Good Performance
– Most Expensive

• RAID-5
– Good Protection
– Worst Performance
– Least Expensive Fault Tolerant

June 26, 2024 34


Summary of RAID

• RAID overhead can significantly affect


performance

• RAID 0 offers no protection and no overhead

• RAID 1 offers good protection but with a 2x


write overhead

• RAID 5 offers moderate protection but with a 4x


write overhead

June 26, 2024 35


Data Architecture Agenda
• Data Architecture Overview
• Data Visualization
• Data Flow Diagrams
• Business Data
• Data Storage & Administration
• Data Storage vendors
• RAID
• Backup, Recovery, and Archival purging
• Data Modeling
• Corporate 3NF Modeling
• Dimensional Modeling
• Organizational Roles
• Functional Activities of a DBA
• References

June 26, 2024 36


What is Modeling and Data Modeling?
• Modeling is an efficient and effective way to represent the organization’s
needs; It provides information in a graphical way to the members of an
organization to understand and communicate the business rules and
processes

• Data Modeling refers to structuring and organizing data in order to


present logical and graphical representation of the information needs

• Provide basis for physical implementation: These structures are then


typically implemented in a DataBase Management System

• The process of Data Modeling also imposes constraints or limitations on


the data placed within the structure

• The goal of a Data Modeling exercise is to model the “perceived real


world” of the “business”

June 26, 2024 37


Data Model and Modelers
• A Data Model is a conceptual representation of data structures (tables)
required for a database and is very powerful in expressing and
communicating the business requirements.

• A Data Model visually represents the nature of data, business rules


governing the data, and how it will be organized in the database.

• A good data model is (Ref. 1)


•Built on consistent application of sound technique
•Embodies a business context
•Used to improve the business

• Data Modelers are responsible for designing the data model and they
communicate with functional teams to understand the business
requirements and technical teams to implement the database.

June 26, 2024 38


Development Cycle - Example of Logical Data Modeling

June 26, 2024 39


Characteristics of a high-quality Data Model
• Embodies Business Plans, Policies, and Strategies

• Uses Recognized Set of Rules

• Involves Domain Experts

• Can be transformed into High-quality Design

• Is created in context of other business architecture elements

• Is created in the context of Enterprise Architecture

• Is created in context of Overall Data Quality Lifecycle

• Depends on support infrastructure

• Involves the right stakeholder June 26, 2024 40


Example: Corporate 3NF Relational Model

June 26, 2024 41


Dimensional Modeling – Definitions and Terms
• Dimensional Data Modeling comprises of one or more dimension tables
and fact tables.

• Dimension table is one that describes the business entities of an


enterprise, represented as hierarchical, categorical information such as
time, departments, locations, and products. Dimension tables are
sometimes called lookup or reference tables.

• A fact (measure) table contains measures (sales gross value, total units
sold) and dimension columns. These dimension columns are actually
foreign keys from the respective dimension tables.

• The performance of queries in dimensional data model can be significantly


increased when materialized views are used.

• Materialized view is a pre-computed table comprising aggregated or joined


data from fact and possibly dimension tables which also known as a
summary or aggregate table.

June 26, 2024 42


Example of Dimensional Modeling

June 26, 2024 43


Corporate 3NF (Relational) vs. Dimensional Data Modeling

Relational Data Model Dimensional Data Model


Data is stored in RDBMS Data is stored in RDBMS or
Multidimensional databases
Tables are units of storage Cubes are units of storage

Data is normalized and used for OLTP. Data is denormalized and used in
Optimized for OLTP processing datawarehouse and data mart. Optimized for
OLAP
Several tables and chains of relationships Few tables and fact tables are connected to
among them dimensional tables
Volatile (several updates) and time variant Non volatile and time invariant

Normal Reports User friendly, interactive, drag and drop


multidimensional OLAP Reports

Will be covering Data Modeling Overview as a separate topic


Will be covering Dimensional Modeling as a separate topic

June 26, 2024 44


Challenges in a Data Modeling World
• 4 Modelers  at least 6 different Data Models!

• Models not “up-to-date” or non-existent! – non-strategic

• Models incomplete – not enough interface with business, or business is not clear

• Sometimes, conflicting requirements from the business

• ETL (3NF) vs. Reporting (Dimensional) – Clash of the Titans (Understanding & Skills)

• “It is easier to change the Physical Structure, because that is what we see”

• Not enough documentation

• Logical consistency, but not necessarily physical consistency

June 26, 2024 45


Challenges in a Data Modeling World
• Believe it or not, tools like ERWin have not much progressed over the years

• Not enough funding  Modelers often overloaded  Onus is on modelers to chase the
business (es)

• Pull vs. Push


• Modelers chasing business initially
• Later, business should be providing any changes or new functionality

• Different naming standards and tools when mergers happen  Harder to integrate and
maintain

• Standards on data quality, security, metadata management, and master data management
differ in degrees of implementation across departments

• Business asking data modelers to take decisions on definitions/meaning of data


inconsistencies, rather than the other way around

• The rate at which businesses move is much faster than the rate at which data models get
June 26, 2024 46
updated
Data Architecture Agenda
• Data Architecture Overview
• Data Visualization
• Data Flow Diagrams
• Business Data
• Data Storage & Administration
• Data Storage vendors
• RAID
• Backup, Recovery, and Archival purging
• Data Modeling
• Corporate 3NF Modeling
• Dimensional Modeling
• Organizational Roles
• Functional Activities of a DBA
• References

June 26, 2024 47


Information Architecture – Steps

Perspective Owner IA Stages

Scope Planner Context Definition

Enterprise Model Program Owner / Sponsor Conceptual Data Model

System Model Information Architect Logical Data model

Technology Model DB Architect Physical Data Model

Component Model Developer Data Structures

Functioning Enterprise User Data Governance

June 26, 2024 48


Key Roles – Database Architect Team
Role Title Desription of Primary Responsibility Allocation
Senior Data Architect Owns and executes the Enterprise Logical Data Modeling activity. Is responsible for Full time
constructing the Logical Data Model, deliverable. Plays a lead role in shaping the course of
the Data Modeling activity, by facilititating thorough understanding the entities and
relationships.
Data Architect Reports to the Seniot Data Architect, and performs the functions of a junior data modeler. Full time
May oaccasionally lead a modeling session. Helps in drilling down into lower levels of detail in
specific subject areas, as well as in documenting issues and resolving conflicts, in addition to
drawing the E-R diagrams and capturing the meta data.
Meta Data Administrator Supports the Enterprise Data Modeling exercise with a robust set of tools for capture and Full time
reporting of meta data. Responsible for ensuring that the meta data tools are up and
available, and also able to respond to evolving meta data needs in the organization.
Business Process owner Owns one or more of the business processes that interact with the data being modeled. The Part time
process owners can in most cases identify 80% of the key attributes of an entity, in a group
of entities.
Business Analyst Creates process models for all processes that interact with the Data. Can drill down into very Full time
low levels of detail about the business process. Primarily focuses on the Business Process -
Entity /Attribute detaiil
Application Architect Supports the Enterprise Data Modeling exercise with a robust set of tools for capture and Part time
reporting of meta data. Responsible for ensuring that the meta data tools are up and
available, and also able to respond to evolving meta data needs in the organiza
Data Base Administrator Owns the physical data base at the table, columns, views and indexes level of detail. Part time
Sometimes, a DBA can contribute to the development of a semantic model owing to their
deep insight into the existing views and meanings of data.
Subject Area Expert A person who is an expert in the specific subject area being discussed. During the course of Part time
an Enterprise Data Exercise, many different subject matter experts will be consulted to help
the team better understand the processes and the underlying detail.

June 26, 2024 49


Functional Activities of a DataBase Administrator (DBA) Team
• DBMS
– Backup & Recovery
– Archival & Purging
– Security (Users, Groups, Roles, Permissions, Access)
– Monitoring (CPU, Disk I/O, Space)
– Performance Tuning & Optimization (Indexing Strategies, SQL, Statistics, Access Plan Analysis, Caching)
– Log Management
– Capacity Planning
– Scheduling
– Replication
– Data Architecture
• Environment (H/W and S/W)
– Dev, QA, Production
– Change Control and Testing
– Version Control Standards & Policies
– Naming Conventions, Guidelines
– Compliance, Security, Modeling, Query, Installation Standards
• Crisis Management
– Virus/Hacker attacks
– H/W and S/W failures
– Deadlocks, Skewed Data, Runaway Queries
• Project Management & Governance
– Skills & Training
– Mentoring
– Budgeting
– Scheduling & SLA Compliance
– Management Updates & Reporting
– New Project introduction
– Team Interactions (Data Architect, Data Modelers, Data Stewards, Database Developers, Business Users, QA)
– Tool Selection

June 26, 2024 50


References

1. The Data Modeling Handbook – A Best-Practice Approach to Building Quality Data


Models – Michael Reingruber, William W. Gregory, John Wiley & Sons, Inc., ISBN 0-
471-05290-6

2. Wikipedia – www.wikipedia.org

3. Modern Structured Analysis, Edward Yourdon, ISBN 978-0135986240

June 26, 2024 51


June 26, 2024 52
Dimensional Modeling (contd.)
• A dimensional database is designed and tuned to support the analysis of
business trends and projections.

• Dimensional Modeling optimizes the database for data retrieval and analysis.

• Some of the decisions to be made during the design of a dimensional model are:
– The business processes to be selected for analysis of the subject area to be
modeled.
– Granularity of the fact tables.
– Dimensions and hierarchies to be identified for each fact table.
– Measures for the fact tables.
– Attributes for each dimension table.
– Pattern selection (Star schema, Snowflake schema or Starflake schema)

June 26, 2024 53

You might also like