SlideShare a Scribd company logo
Dev Analytics Aggregated DB
Design Analysis
Analytics Problem Context
• Broadly the problem can be broken into three parts
1. Aggregate from one or more sources
2. Bring the aggregated data from the BE to FE DBs
3. Serve up data in a performant way through the portals
• Fundamental Requirements
• Backend Warehouse
• That can comfortably process some amount of data in a safe robust manner
• That is highly available
• Frontend DB to serve the queries from the portal
• Should serve queries in an efficient manner
• Must be highly available
• To mitigate the effects of disk performance for concurrent queries, read scale-out is required
• Scale out Model
• The above two are components of a stamp unit which would get replicated as instances to support
further scale out
Part 1 : Process data
• What happens?
• Large number of rows get inserted into the mining warehouse, so the CPU/Memory Usage can spike up
• So, we chunk the data to control the resource usage
• Are we setup for efficient writes to the SQL data files ?
• Data and Log file configurations matter
• Exploit Contiguity on disk
• Replication to other secondaries/primaries (for HA) can bring the system to the knees
• So, we chunk the data and provide breathing space
• We need to monitor the replication backlogs
• Are we setup to best contain the replication overheads ?
• Log files play a major role
• Read Efficiencies
• Write Efficiencies
• Spindle separation
• Data Categories
• Data that needs to be strongly consistent across partitions
• Scalability
• Currently we have 2 partitions?
• Reference to a document on the effort required to add another partition?
• Real-time
• Currently we process data once a day
• Current Indexing / Query Design Mantra
• Data comes in time order, prioritize for clustered indexes for write
• For read, rely on scan performance (these need to be validated)
Part 1 : Process Data (Cont)
• Data Categories
• Data that needs to be strongly consistent across partitions
• And highly available?
• Data that needs to be replicated
• Checkpoint/Restart Model?
• Scalability
• Currently we have 2 partitions?
• Reference to a document on the effort required to add another partition?
• Real-time
• Currently we process data once a day
• Current Indexing / Query Design Mantra
• Data comes in time order, prioritize for clustered indexes for write
• For read, rely on scan performance (these need to be validated)
Part 2: Move Data to FE while it is serving
Data
• What happens
• User queries some application specific or trend data
• While the user query is being processed, new data arrives and is inserted in time order
• This writes are chunked and currently occur over a period of 4 hours
• The writes result in fragmentation of the non clustered index, also new data requires SQL
statistics to be updated
• We have no partitioning in FE so defrag and update statistics are costly operations and
moreover they result in CPU/Memory Usage spikes
• In general, large tables make everything inefficient
• Mantra to write without disrupting the query performance
• Partitioning is highly recommended simply because the fact tables are large
• Table Partitioning
• Note the partitions are in time order, helps purge old data as well
• Active Passive is a poor cousin
Part 3: Serving Data
• What happens
• User queries some application specific or trend data
• Trend/Category Wide data requires query of records across applications
• Some queries require processing at the FE because of the nature of the queries
• User is asked to select a filter say over Country(35 choices?), Age(3), Gender (3) and numbers for the top 20 in that category/subcategory are displayed
• Pre-aggregation and sorting of this size/nature also cause DB size bloat
• Feature Use
• For this kind of heavy lifting, it would be useful to determine to what extent this feature is being used
• Queries can be and are disrupted by the writes that occur daily
• Index Mantras
• Rely on Clustered Indexes to the extent possible
• Use few carefully chosen non clustered index as necessary
• Query Mantras
• Intelligent Point Scans
• These serve by removing the need for churning data from time order into application order
• Use of Specific Joins
• Nested Loops Join: http://blogs.msdn.com/craigfr/archive/2006/07/26/679319.aspx
• Hash Join: http://blogs.msdn.com/craigfr/archive/2006/08/10/687630.aspx
• Merge Join: http://blogs.msdn.com/craigfr/archive/2006/08/03/merge-join.aspx
• Be aware/conscious about the size of the tables while designing your sprocs
• Ensure that update statistics is enabled, Auto update is recommended, A forced update after data insertions is desirable
• When not to rely on SQL to make the appropriate choice for you? Ashish?
• Index and Query Reference Spec
• Ashish’s spec
Active Passive Solution
• Passive
• Helps control the data insertion process without disruption
• Cons
• Not the best utilization of resources
• Monitoring Intensive
• Certainly not elegant
• Data that is sent to Passive(Set A) on Day 1 has to be fed to the Passive(Set B) on day 2, while
it also receives data for Day 1
• Or robust to backlogs, should they arise
• At this point we only know that at least the FE will continue to serve, though it may serve
stale data in these situations for longer time than we would like
Table Partitioning
• General Approach
• Create date-wise partitions with two empty partitions mapped to file-groups
• Empty partitions are used for data insertion and data removal
• Defrag and Update Statistics can be performed on staging partitions created on appropriate file groups
• Since splitting an empty partition is a metadata only operation, it is quick and helps maintain the same configuration
• By appropriate file group separation, you mitigate the impact of conflicting i/o operations
• References
• http://www.google.com/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=2&ved=0CC8QFjAB&url=https%3A%2F%2Fptop.only.wip.la%3A443%2Fhttp%2Fdownload.micro
soft.com%2Fdownload%2FD%2FB%2FD%2FDBDE7972-1EB9-470A-BA18-58849DB3EB3B%2FPartTableAndIndexStrat.docx&ei=eKkpUo-
lKIauiQewpYGgCw&usg=AFQjCNHKMusGnaIp9EzsR94YGq8OJRPM1w&bvm=bv.51773540,d.aGc
• http://msdn.microsoft.com/en-us/library/aa964122(SQL.90).aspx
• We believe this is the right long term path for Analytics FE DB evolution
• Even if you offload the BE processing to COSMOS, to serve the portal queries, an FE that allows efficient insertion and removal of data
while servicing requests from the portals is required
• This is an important cog in the design of an optimal and efficient FE unit
• Active/Passive is a stop gap measure with lower utilization of h/w assets
• Need a POC to validate if the query overheads are acceptable (within the desired and acceptable limit)
• References on parallel query execution overheads
• http://technet.microsoft.com/en-us/library/ms345599(v=sql.105).aspx
Other General References
• Fast track Recommendations
• http://msdn.microsoft.com/en-us/library/gg567302.aspx
• http://msdn.microsoft.com/en-us/library/gg605238.aspx
• Best practice recommendations
• http://technet.microsoft.com/en-us/library/dd578580(v=sql.100).aspx
Ad

More Related Content

What's hot (20)

Data integration
Data integrationData integration
Data integration
Umar Alharaky
 
Data migration
Data migrationData migration
Data migration
Vatsala Chauhan
 
Data Verification In QA Department Final
Data Verification In QA Department FinalData Verification In QA Department Final
Data Verification In QA Department Final
Wayne Yaddow
 
Accenture informatica interview question answers
Accenture informatica interview question answersAccenture informatica interview question answers
Accenture informatica interview question answers
Sweta Singh
 
Data Warehouse Testing: It’s All about the Planning
Data Warehouse Testing: It’s All about the PlanningData Warehouse Testing: It’s All about the Planning
Data Warehouse Testing: It’s All about the Planning
TechWell
 
Did you mean 'Galene'?
Did you mean 'Galene'?Did you mean 'Galene'?
Did you mean 'Galene'?
Azeem Mohammad
 
Cognos framework manager
Cognos framework managerCognos framework manager
Cognos framework manager
maxonlinetr
 
Etl process in data warehouse
Etl process in data warehouseEtl process in data warehouse
Etl process in data warehouse
Komal Choudhary
 
ETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testingETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testing
Vibrant Event
 
Migrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Migrating Data Warehouse Solutions from Oracle to non-Oracle DatabasesMigrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Migrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Jade Global
 
BUILDING A DATA WAREHOUSE
BUILDING A DATA WAREHOUSEBUILDING A DATA WAREHOUSE
BUILDING A DATA WAREHOUSE
Neha Kapoor
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsDatastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobs
shanker_uma
 
Datastage free tutorial
Datastage free tutorialDatastage free tutorial
Datastage free tutorial
tekslate1
 
TPT connection Implementation in Informatica
TPT connection Implementation in InformaticaTPT connection Implementation in Informatica
TPT connection Implementation in Informatica
Yagya Sharma
 
Data Integration (ETL)
Data Integration (ETL)Data Integration (ETL)
Data Integration (ETL)
easysoft
 
Improving Reporting Performance
Improving Reporting PerformanceImproving Reporting Performance
Improving Reporting Performance
Dhiren Gala
 
DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING
DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAININGDATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING
DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING
Datawarehouse Trainings
 
Teradata 13.10
Teradata 13.10Teradata 13.10
Teradata 13.10
Teradata
 
CS101- Introduction to Computing- Lecture 37
CS101- Introduction to Computing- Lecture 37CS101- Introduction to Computing- Lecture 37
CS101- Introduction to Computing- Lecture 37
Bilal Ahmed
 
ETL Process
ETL ProcessETL Process
ETL Process
Rohin Rangnekar
 
Data Verification In QA Department Final
Data Verification In QA Department FinalData Verification In QA Department Final
Data Verification In QA Department Final
Wayne Yaddow
 
Accenture informatica interview question answers
Accenture informatica interview question answersAccenture informatica interview question answers
Accenture informatica interview question answers
Sweta Singh
 
Data Warehouse Testing: It’s All about the Planning
Data Warehouse Testing: It’s All about the PlanningData Warehouse Testing: It’s All about the Planning
Data Warehouse Testing: It’s All about the Planning
TechWell
 
Did you mean 'Galene'?
Did you mean 'Galene'?Did you mean 'Galene'?
Did you mean 'Galene'?
Azeem Mohammad
 
Cognos framework manager
Cognos framework managerCognos framework manager
Cognos framework manager
maxonlinetr
 
Etl process in data warehouse
Etl process in data warehouseEtl process in data warehouse
Etl process in data warehouse
Komal Choudhary
 
ETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testingETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testing
Vibrant Event
 
Migrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Migrating Data Warehouse Solutions from Oracle to non-Oracle DatabasesMigrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Migrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Jade Global
 
BUILDING A DATA WAREHOUSE
BUILDING A DATA WAREHOUSEBUILDING A DATA WAREHOUSE
BUILDING A DATA WAREHOUSE
Neha Kapoor
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsDatastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobs
shanker_uma
 
Datastage free tutorial
Datastage free tutorialDatastage free tutorial
Datastage free tutorial
tekslate1
 
TPT connection Implementation in Informatica
TPT connection Implementation in InformaticaTPT connection Implementation in Informatica
TPT connection Implementation in Informatica
Yagya Sharma
 
Data Integration (ETL)
Data Integration (ETL)Data Integration (ETL)
Data Integration (ETL)
easysoft
 
Improving Reporting Performance
Improving Reporting PerformanceImproving Reporting Performance
Improving Reporting Performance
Dhiren Gala
 
DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING
DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAININGDATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING
DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING
Datawarehouse Trainings
 
Teradata 13.10
Teradata 13.10Teradata 13.10
Teradata 13.10
Teradata
 
CS101- Introduction to Computing- Lecture 37
CS101- Introduction to Computing- Lecture 37CS101- Introduction to Computing- Lecture 37
CS101- Introduction to Computing- Lecture 37
Bilal Ahmed
 

Viewers also liked (14)

The 5 Biggest Mistakes Manufacturers Unwittingly Make on Social Media
The 5 Biggest Mistakes Manufacturers Unwittingly Make on Social MediaThe 5 Biggest Mistakes Manufacturers Unwittingly Make on Social Media
The 5 Biggest Mistakes Manufacturers Unwittingly Make on Social Media
Brainstorm Digital
 
Partes internas del computador
Partes internas del computadorPartes internas del computador
Partes internas del computador
Andres Soler Morales
 
Social Media "Content Cookery School"
Social Media "Content Cookery School"Social Media "Content Cookery School"
Social Media "Content Cookery School"
Emarketeers
 
Excess & Surplus
Excess & SurplusExcess & Surplus
Excess & Surplus
gauravanand
 
MasterChef Web Version
MasterChef Web VersionMasterChef Web Version
MasterChef Web Version
Rafael Ventura
 
Index Provisioning for ALM Search - My Presentation
Index Provisioning for ALM Search - My PresentationIndex Provisioning for ALM Search - My Presentation
Index Provisioning for ALM Search - My Presentation
Sunita Shrivastava
 
REVERSE MORTGAGE TO PURCHASE A HOME for REALTORS
REVERSE MORTGAGE TO PURCHASE A HOME for REALTORSREVERSE MORTGAGE TO PURCHASE A HOME for REALTORS
REVERSE MORTGAGE TO PURCHASE A HOME for REALTORS
Jose Caba (L.I.O.N.)
 
Kata pengantar kelompok iv fix (rk)
Kata pengantar kelompok iv fix (rk)Kata pengantar kelompok iv fix (rk)
Kata pengantar kelompok iv fix (rk)
Andi Milwadi
 
Outsmarting The Search Competition with Predatory Thinking
Outsmarting The Search Competition with Predatory ThinkingOutsmarting The Search Competition with Predatory Thinking
Outsmarting The Search Competition with Predatory Thinking
Emarketeers
 
Rare_Book_Translation
Rare_Book_TranslationRare_Book_Translation
Rare_Book_Translation
Rini Sucahyo
 
"Прованс" коллекция Bremani
"Прованс" коллекция Bremani"Прованс" коллекция Bremani
"Прованс" коллекция Bremani
NSP Ukraine
 
ZMET Technique- Nescafé coffee, Brand Management Assignment
ZMET Technique- Nescafé coffee, Brand Management AssignmentZMET Technique- Nescafé coffee, Brand Management Assignment
ZMET Technique- Nescafé coffee, Brand Management Assignment
chethanlive17
 
Carolina
CarolinaCarolina
Carolina
Carolina Santos
 
The 5 Biggest Mistakes Manufacturers Unwittingly Make on Social Media
The 5 Biggest Mistakes Manufacturers Unwittingly Make on Social MediaThe 5 Biggest Mistakes Manufacturers Unwittingly Make on Social Media
The 5 Biggest Mistakes Manufacturers Unwittingly Make on Social Media
Brainstorm Digital
 
Social Media "Content Cookery School"
Social Media "Content Cookery School"Social Media "Content Cookery School"
Social Media "Content Cookery School"
Emarketeers
 
Excess & Surplus
Excess & SurplusExcess & Surplus
Excess & Surplus
gauravanand
 
MasterChef Web Version
MasterChef Web VersionMasterChef Web Version
MasterChef Web Version
Rafael Ventura
 
Index Provisioning for ALM Search - My Presentation
Index Provisioning for ALM Search - My PresentationIndex Provisioning for ALM Search - My Presentation
Index Provisioning for ALM Search - My Presentation
Sunita Shrivastava
 
REVERSE MORTGAGE TO PURCHASE A HOME for REALTORS
REVERSE MORTGAGE TO PURCHASE A HOME for REALTORSREVERSE MORTGAGE TO PURCHASE A HOME for REALTORS
REVERSE MORTGAGE TO PURCHASE A HOME for REALTORS
Jose Caba (L.I.O.N.)
 
Kata pengantar kelompok iv fix (rk)
Kata pengantar kelompok iv fix (rk)Kata pengantar kelompok iv fix (rk)
Kata pengantar kelompok iv fix (rk)
Andi Milwadi
 
Outsmarting The Search Competition with Predatory Thinking
Outsmarting The Search Competition with Predatory ThinkingOutsmarting The Search Competition with Predatory Thinking
Outsmarting The Search Competition with Predatory Thinking
Emarketeers
 
Rare_Book_Translation
Rare_Book_TranslationRare_Book_Translation
Rare_Book_Translation
Rini Sucahyo
 
"Прованс" коллекция Bremani
"Прованс" коллекция Bremani"Прованс" коллекция Bremani
"Прованс" коллекция Bremani
NSP Ukraine
 
ZMET Technique- Nescafé coffee, Brand Management Assignment
ZMET Technique- Nescafé coffee, Brand Management AssignmentZMET Technique- Nescafé coffee, Brand Management Assignment
ZMET Technique- Nescafé coffee, Brand Management Assignment
chethanlive17
 
Ad

Similar to Dev Analytics Aggregate DB Design Analysis (20)

Lecture1 Information Management Introduction
Lecture1 Information Management IntroductionLecture1 Information Management Introduction
Lecture1 Information Management Introduction
JanessaCruz
 
Database management system lecture notes
Database management system lecture notesDatabase management system lecture notes
Database management system lecture notes
UTSAHSINGH2
 
module 1 DWDM (complete) chapter ppt.pptx
module 1 DWDM (complete) chapter ppt.pptxmodule 1 DWDM (complete) chapter ppt.pptx
module 1 DWDM (complete) chapter ppt.pptx
rakshajain287
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
Saurabh K. Gupta
 
Advance database system (part 2)
Advance database system (part 2)Advance database system (part 2)
Advance database system (part 2)
Abdullah Khosa
 
Data Ingestion Engine
Data Ingestion EngineData Ingestion Engine
Data Ingestion Engine
Adam Doyle
 
Data Privacy at Scale
Data Privacy at ScaleData Privacy at Scale
Data Privacy at Scale
DataWorks Summit
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
RahulSingh986955
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Codemotion
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.ppt
RafiulHasan19
 
Data mining
Data miningData mining
Data mining
sweetysweety8
 
LDV-v2.pptx
LDV-v2.pptxLDV-v2.pptx
LDV-v2.pptx
Shams Pirzada
 
data warehousing need and characteristics. types of data w data warehouse arc...
data warehousing need and characteristics. types of data w data warehouse arc...data warehousing need and characteristics. types of data w data warehouse arc...
data warehousing need and characteristics. types of data w data warehouse arc...
aasifkuchey85
 
Data Mart Lake Ware.pptx
Data Mart Lake Ware.pptxData Mart Lake Ware.pptx
Data Mart Lake Ware.pptx
BalasundaramSr
 
LDV.pptx
LDV.pptxLDV.pptx
LDV.pptx
Shams Pirzada
 
Stateful streaming and the challenge of state
Stateful streaming and the challenge of stateStateful streaming and the challenge of state
Stateful streaming and the challenge of state
Yoni Farin
 
Data Analytics: HDFS with Big Data : Issues and Application
Data Analytics:  HDFS  with  Big Data :  Issues and ApplicationData Analytics:  HDFS  with  Big Data :  Issues and Application
Data Analytics: HDFS with Big Data : Issues and Application
Dr. Chitra Dhawale
 
Data warehouse introduction
Data warehouse introductionData warehouse introduction
Data warehouse introduction
Murli Jha
 
Building Data Warehouse in SQL Server
Building Data Warehouse in SQL ServerBuilding Data Warehouse in SQL Server
Building Data Warehouse in SQL Server
Antonios Chatzipavlis
 
Data warehouseold
Data warehouseoldData warehouseold
Data warehouseold
Shwetabh Jaiswal
 
Lecture1 Information Management Introduction
Lecture1 Information Management IntroductionLecture1 Information Management Introduction
Lecture1 Information Management Introduction
JanessaCruz
 
Database management system lecture notes
Database management system lecture notesDatabase management system lecture notes
Database management system lecture notes
UTSAHSINGH2
 
module 1 DWDM (complete) chapter ppt.pptx
module 1 DWDM (complete) chapter ppt.pptxmodule 1 DWDM (complete) chapter ppt.pptx
module 1 DWDM (complete) chapter ppt.pptx
rakshajain287
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
Saurabh K. Gupta
 
Advance database system (part 2)
Advance database system (part 2)Advance database system (part 2)
Advance database system (part 2)
Abdullah Khosa
 
Data Ingestion Engine
Data Ingestion EngineData Ingestion Engine
Data Ingestion Engine
Adam Doyle
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Codemotion
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.ppt
RafiulHasan19
 
data warehousing need and characteristics. types of data w data warehouse arc...
data warehousing need and characteristics. types of data w data warehouse arc...data warehousing need and characteristics. types of data w data warehouse arc...
data warehousing need and characteristics. types of data w data warehouse arc...
aasifkuchey85
 
Data Mart Lake Ware.pptx
Data Mart Lake Ware.pptxData Mart Lake Ware.pptx
Data Mart Lake Ware.pptx
BalasundaramSr
 
Stateful streaming and the challenge of state
Stateful streaming and the challenge of stateStateful streaming and the challenge of state
Stateful streaming and the challenge of state
Yoni Farin
 
Data Analytics: HDFS with Big Data : Issues and Application
Data Analytics:  HDFS  with  Big Data :  Issues and ApplicationData Analytics:  HDFS  with  Big Data :  Issues and Application
Data Analytics: HDFS with Big Data : Issues and Application
Dr. Chitra Dhawale
 
Data warehouse introduction
Data warehouse introductionData warehouse introduction
Data warehouse introduction
Murli Jha
 
Building Data Warehouse in SQL Server
Building Data Warehouse in SQL ServerBuilding Data Warehouse in SQL Server
Building Data Warehouse in SQL Server
Antonios Chatzipavlis
 
Ad

More from Sunita Shrivastava (6)

Bing Phone Book Service Arch Spec
Bing Phone Book Service Arch SpecBing Phone Book Service Arch Spec
Bing Phone Book Service Arch Spec
Sunita Shrivastava
 
Cognito Unified API Specification
Cognito Unified API SpecificationCognito Unified API Specification
Cognito Unified API Specification
Sunita Shrivastava
 
Dev Analytics Overview
Dev Analytics OverviewDev Analytics Overview
Dev Analytics Overview
Sunita Shrivastava
 
Logical Architecture for Protection
Logical Architecture for ProtectionLogical Architecture for Protection
Logical Architecture for Protection
Sunita Shrivastava
 
Search Approach - ES, GraphDB
Search Approach - ES, GraphDBSearch Approach - ES, GraphDB
Search Approach - ES, GraphDB
Sunita Shrivastava
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
Sunita Shrivastava
 
Bing Phone Book Service Arch Spec
Bing Phone Book Service Arch SpecBing Phone Book Service Arch Spec
Bing Phone Book Service Arch Spec
Sunita Shrivastava
 
Cognito Unified API Specification
Cognito Unified API SpecificationCognito Unified API Specification
Cognito Unified API Specification
Sunita Shrivastava
 
Logical Architecture for Protection
Logical Architecture for ProtectionLogical Architecture for Protection
Logical Architecture for Protection
Sunita Shrivastava
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
Sunita Shrivastava
 

Dev Analytics Aggregate DB Design Analysis

  • 1. Dev Analytics Aggregated DB Design Analysis
  • 2. Analytics Problem Context • Broadly the problem can be broken into three parts 1. Aggregate from one or more sources 2. Bring the aggregated data from the BE to FE DBs 3. Serve up data in a performant way through the portals • Fundamental Requirements • Backend Warehouse • That can comfortably process some amount of data in a safe robust manner • That is highly available • Frontend DB to serve the queries from the portal • Should serve queries in an efficient manner • Must be highly available • To mitigate the effects of disk performance for concurrent queries, read scale-out is required • Scale out Model • The above two are components of a stamp unit which would get replicated as instances to support further scale out
  • 3. Part 1 : Process data • What happens? • Large number of rows get inserted into the mining warehouse, so the CPU/Memory Usage can spike up • So, we chunk the data to control the resource usage • Are we setup for efficient writes to the SQL data files ? • Data and Log file configurations matter • Exploit Contiguity on disk • Replication to other secondaries/primaries (for HA) can bring the system to the knees • So, we chunk the data and provide breathing space • We need to monitor the replication backlogs • Are we setup to best contain the replication overheads ? • Log files play a major role • Read Efficiencies • Write Efficiencies • Spindle separation • Data Categories • Data that needs to be strongly consistent across partitions • Scalability • Currently we have 2 partitions? • Reference to a document on the effort required to add another partition? • Real-time • Currently we process data once a day • Current Indexing / Query Design Mantra • Data comes in time order, prioritize for clustered indexes for write • For read, rely on scan performance (these need to be validated)
  • 4. Part 1 : Process Data (Cont) • Data Categories • Data that needs to be strongly consistent across partitions • And highly available? • Data that needs to be replicated • Checkpoint/Restart Model? • Scalability • Currently we have 2 partitions? • Reference to a document on the effort required to add another partition? • Real-time • Currently we process data once a day • Current Indexing / Query Design Mantra • Data comes in time order, prioritize for clustered indexes for write • For read, rely on scan performance (these need to be validated)
  • 5. Part 2: Move Data to FE while it is serving Data • What happens • User queries some application specific or trend data • While the user query is being processed, new data arrives and is inserted in time order • This writes are chunked and currently occur over a period of 4 hours • The writes result in fragmentation of the non clustered index, also new data requires SQL statistics to be updated • We have no partitioning in FE so defrag and update statistics are costly operations and moreover they result in CPU/Memory Usage spikes • In general, large tables make everything inefficient • Mantra to write without disrupting the query performance • Partitioning is highly recommended simply because the fact tables are large • Table Partitioning • Note the partitions are in time order, helps purge old data as well • Active Passive is a poor cousin
  • 6. Part 3: Serving Data • What happens • User queries some application specific or trend data • Trend/Category Wide data requires query of records across applications • Some queries require processing at the FE because of the nature of the queries • User is asked to select a filter say over Country(35 choices?), Age(3), Gender (3) and numbers for the top 20 in that category/subcategory are displayed • Pre-aggregation and sorting of this size/nature also cause DB size bloat • Feature Use • For this kind of heavy lifting, it would be useful to determine to what extent this feature is being used • Queries can be and are disrupted by the writes that occur daily • Index Mantras • Rely on Clustered Indexes to the extent possible • Use few carefully chosen non clustered index as necessary • Query Mantras • Intelligent Point Scans • These serve by removing the need for churning data from time order into application order • Use of Specific Joins • Nested Loops Join: http://blogs.msdn.com/craigfr/archive/2006/07/26/679319.aspx • Hash Join: http://blogs.msdn.com/craigfr/archive/2006/08/10/687630.aspx • Merge Join: http://blogs.msdn.com/craigfr/archive/2006/08/03/merge-join.aspx • Be aware/conscious about the size of the tables while designing your sprocs • Ensure that update statistics is enabled, Auto update is recommended, A forced update after data insertions is desirable • When not to rely on SQL to make the appropriate choice for you? Ashish? • Index and Query Reference Spec • Ashish’s spec
  • 7. Active Passive Solution • Passive • Helps control the data insertion process without disruption • Cons • Not the best utilization of resources • Monitoring Intensive • Certainly not elegant • Data that is sent to Passive(Set A) on Day 1 has to be fed to the Passive(Set B) on day 2, while it also receives data for Day 1 • Or robust to backlogs, should they arise • At this point we only know that at least the FE will continue to serve, though it may serve stale data in these situations for longer time than we would like
  • 8. Table Partitioning • General Approach • Create date-wise partitions with two empty partitions mapped to file-groups • Empty partitions are used for data insertion and data removal • Defrag and Update Statistics can be performed on staging partitions created on appropriate file groups • Since splitting an empty partition is a metadata only operation, it is quick and helps maintain the same configuration • By appropriate file group separation, you mitigate the impact of conflicting i/o operations • References • http://www.google.com/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=2&ved=0CC8QFjAB&url=https%3A%2F%2Fptop.only.wip.la%3A443%2Fhttp%2Fdownload.micro soft.com%2Fdownload%2FD%2FB%2FD%2FDBDE7972-1EB9-470A-BA18-58849DB3EB3B%2FPartTableAndIndexStrat.docx&ei=eKkpUo- lKIauiQewpYGgCw&usg=AFQjCNHKMusGnaIp9EzsR94YGq8OJRPM1w&bvm=bv.51773540,d.aGc • http://msdn.microsoft.com/en-us/library/aa964122(SQL.90).aspx • We believe this is the right long term path for Analytics FE DB evolution • Even if you offload the BE processing to COSMOS, to serve the portal queries, an FE that allows efficient insertion and removal of data while servicing requests from the portals is required • This is an important cog in the design of an optimal and efficient FE unit • Active/Passive is a stop gap measure with lower utilization of h/w assets • Need a POC to validate if the query overheads are acceptable (within the desired and acceptable limit) • References on parallel query execution overheads • http://technet.microsoft.com/en-us/library/ms345599(v=sql.105).aspx
  • 9. Other General References • Fast track Recommendations • http://msdn.microsoft.com/en-us/library/gg567302.aspx • http://msdn.microsoft.com/en-us/library/gg605238.aspx • Best practice recommendations • http://technet.microsoft.com/en-us/library/dd578580(v=sql.100).aspx

Editor's Notes

  • #4: SQL is the underlying technology. It must be well understood. We need appropriate configurations so that things work underneath work for us rather than work against us.