[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features

Azure DocumentDB:
Deep Dive into
Advanced Features
Aravind Ramachandran
Program Manager
Azure DocumentDB
@arkramac
Andrew Liu
Program Manager
Azure DocumentDB
@aliuy8

3 V’s of data : Endless possibilities
LearningGaming
Retail
Telematics
Mobile Apps
IoT

The 2x2s of database tradeoffs

DocumentDB: Capabilities
Guaranteed low latency
• <10ms reads/<15ms writes @ P99.
• Requests are served from local region
• Write optimized, latch-free database
engine designed for SSDs and low latency
access.
• Synchronous and automatic document
indexing at sustained ingestion rates
Elastic and limitless
global scale
• Independently scale throughput and
storage - locally and globally
• Transparent partition management and
routing
Multiple consistency levels
• Multiple well defined consistency levels
• Intuitive programming model for relaxed consistency models
• Clear PACELC tradeoffs and 99.99% availability SLAs
SQL and JavaScript –
schema free
• Automatic tree path based indexing
• No schemas or secondary indices required
upfront
• SQL and JavaScript language integrated
queries
• Hash, range, and spatial
• Multi-document, JavaScript language
integrated transactions

DocumentDB resource model
Resources
• identified by their logical and stable URI
• Represented as JSON documents
• Partitioned and across span machines, clusters and regions
1
Resource model
• Stateless interaction (HTTP and TCP)
• Hierarchical overlay atop partitioning model
2
Partitioning Model
• Grid Partitioning – horizontal based on
hash/range and vertical across regions
• Each partition made highly available via a replica
set
3
Replica-
set
US-East
US-West
N
Europe
Partitions
Partition set
Local
distribution
Globaldistribution

Accessing DocumentDB
Java .NET Java .NET
Ruby
…

Let’s talk about…
• Modeling JSON Documents
• Collections and Scaling
• Query and Indexing
• Global Distribution
• Tips and Best Practices
Everything you need to know to build
Blazing fast, planet-scale applications!

Let’s talk about JSON documents
"With great power comes great responsibility“
- Uncle Ben

Data normalization
How do approaches differ?

Come as you are
Data normalization
How do approaches differ?

Person
Address
ContactDetail
ContactDetailType
PersonContactDetailLnk
PersonId
ContactDetailId
Id Id
Id Id
Modeling Data: The Relational Way

Person
Id
Addresses
{
"id": "0ec1ab0c-de08-4e42-a429-...",
"addresses": [
{ "street": "1 Redmond Way",
"city": "Redmond", "state": "WA",
"zip": 98052}
],
"contactDetails": [
{"type": "home", "detail": “555-1212"},
{"type": "email", "detail": “me@ms.com"}
],
...
}
Address
…
Address
…
ContactDetails
ContactDetail
…
Modeling Data: The Document Way

To embed, or to reference, that is the question

{
"id": "1",
"firstName": "Thomas",
"lastName": "Andersen",
"addresses": [
{
"line1": "100 Some Street",
"line2": "Unit 1",
"city": "Seattle",
"state": "WA",
"zip": 98012 }
],
"contactDetails": [
{"email: "thomas@andersen.com"},
{"phone": "+1 555 555-5555", "extension": 5555}
]
}
Try model your entity as a self-
contained document
Generally, use embedded data
models when:
There are "contains" relationships
between entities
There are one-to-few relationships
between entities
Embedded data changes infrequently
Embedded data won’t grow without
bounds
Embedded data is integral to data in a
document
Data modeling with denormalization
better read performance

In general, use normalized data
models when:
Write performance is more important
than read performance
Representing one-to-many
relationships
Can representing many-to-many
relationships
Related data changes frequently
Provides more flexibility than
embedding
More round trips to read data
Data modeling with referencing
{
"id": "xyz",
"username: "user xyz"
}
{
"id": "address_xyz",
"userid": "xyz",
"address" : {
…
}
}
{
"id: "contact_xyz",
"userid": "xyz",
"email" : "user@user.com"
"phone" : "555 5555"
}
Normalizing typically provides better write performance

No magic bullet
Hybrid Approach:
Model on a property-level
(as opposed to record-level)
Optimize your data model for
your workload…
(as opposed to blindly following types)
Modeling impacts RU due to
document size
Hybrid models
{
"id": "1",
"firstName": "Thomas",
"lastName": "Andersen",
"countOfBooks": 3,
"books": [1, 2, 3],
"images": [
{"thumbnail": "http://....png"}
{"profile": "http://....png"}
]
}
{
"id": 1,
"name": "DocumentDB 101",
"authors": [
{"id": 1, "name": "Thomas Andersen", "thumbnail":
"http://....png"},
{"id": 2, "name": "William Wakefield", "thumbnail":
"http://....png"}
]
}

Measuring Throughput (Request Units)
Replica gets a fixed
budget of request units
Request Unit/sec (RU) is
the normalized currency
% IOPS
% CPU
% Memory
Document
Documents
Document
Operations consume request units
(RUs)
Documents
Min
RU/sec
Max
RU/sec
IncomingRequests
Replica
Quiescen
t
Rate
limit
No
throttlin
g
Requests get rate limited
if they exceed the SLA
Customers pay for reserved
request units by the hour

What are partitions?
…. ….
Partition 1 Partition 2 Partition i Partition n
…
Collection

What are partitions?
…. ….
London
Paris
…
New York …
Houston
Chicago
New Delhi
Mumbai
Boston
Berlin
…
Partition Key = city

Partitioning patterns
Writes should scale across Partition Keys
…. ….
…
…
……

Partitioning patterns
Reads should minimize cross-partition lookups
…. ….
…
…
……

Recipe for Choosing Partition Key

Let's talk about Query and Indexing

DocumentDB: SQL and JavaScript queries
{ "locations":
[ { "country": "Germany", "city": "Berlin" },
{ "country": "France", "city": "Paris" }
],
"headquarter": "Belgium",
"exports": [{ "city": "Moscow" }, { "city": "Athens" }]
}
locations headquarter exports
0 1
country
Germany
city
Berlin
country
France
city
Paris
city
Moscow
city
Athens
Belgium 0 1
{ "locations": [{ "country": "Germany", "city": "Bonn", "revenue": 200 } ],
"headquarter": "Italy",
"exports": [ { "city": "Berlin","dealers": [{"name": "Hans"}] }, { "city":
"Athens" }
]
} locations headquarter
0
country
Germany
city
Bonn
revenue
200
Italy
exports
city
Berlin
city
Athens
0
1
dealers
0
Hans
name
{
"results":
[
{
"locations":
[
{"country":"Germany","city":"Berlin"},
{"country":"France","city":"Paris"}
]
}
]
0
locations
0 1
country
Germany
city
Berlin
country
France
city
Paris
results
SELECT C.locations
FROM company C
WHERE C.headquarter = "Belgium"
SQL function businessLogic() {
var country = "Belgium";
__.filter(function(x){return x.headquarter===country;});}
JavaScript

Indexing under the hood
• Logically the index is a union of all the document trees
• Structure contributed by the interior nodes, instance values are the leaves
Common
structure
• Structural information and instance values are normalized into a
unifying concept of JSON-Path
Terms Postings List
$/location/0/ 1, 2
location/0/country/ 1, 2
location/0/city/ 1, 2
0/country/Germany 1, 2
1/country/France 2
… …
0/city/Moscow 2
0/dealers/0 2
0
Germany
location
0
location
country
0
country
Range &
ORDERBY queries
0
Germany
location
0
location
country
0
country
Wildcard queries Spatial queries
0
coordinates
Dynamic
Encoding of
Postings List
(E-WAH/differential)

Queries that use the index
• Equality: =
• Range: <, >, <=, >=
• ORDER BY
• String operators: STARTSWITH
• Spatial operators: ST_WITHIN and ST_DISTANCE
• Array operators: ARRAY_CONTAINS
• Schema operators: IS_DEFINED, IS_NUMBER, IS_STRING, …

Indexing Policies
Configuration Level Options
Automatic Per collection True (default) or False
Override with each document write
Indexing Mode Per collection Consistent, Lazy, and None
None for KV workloads
Included and excluded
paths
Per path Individual path or recursive includes (? And *)
Indexing Type Per path Support Hash, Range, and Spatial
Indexing Precision Per path Supports 1 – 100 per path (and max)
Tradeoff storage, query RUs and write Rus

Let’s talk about Planet-Scale

Guaranteed low latency
“I want my data wherever my users are.”

Guaranteed high availability
Globally. With policy based failover.
99.99%

Multi-region DocumentDB databases
DocumentDB
Collection
Replica-
set
US-East
US-West
India
Partitions
Partition set
Globaldistribution
Local distribution
Primary Replica-sets
2M RUs
Secondary Replica-sets
2M RUs
2M RUs
Secondary Replica-sets
A DocumentDB collection
2M RUs
Total RUs =
Provisioned RUs x Number of
regions
In this example:
2M RUs x 3 regions = 6M RUs

Programmable data consistency
“Its hard to write distributed apps.”
Strong consistency,
High latency
Eventual consistency, Low
latency

Consistency Levels
• PACELC Theorem and the associated tradeoffs

Consistency Levels
• Strong, Eventual, Bounded Staleness, and Session
Strong Bounded
Staleness
Sessio
n
Eventu
al
LEFT TO RIGHT  Weaker Consistency, Better Read scalability, Lower write latency
Client
P SS
Client
P SS
Clie
nt
P SS
Client
P SS
Client
• Consistent Prefix reads.
• Reads lag behind writes by K
prefixes or T interval
• Monotonic reads, writes and
Read your writes guarantee

DocumentDB Recent Updates
• Automatic Expiration via Time-To-Live (TTL)
• Expanded Geo-Spatial support for Polygons and Lines
• Preview Support for
• Local Emulator
• IP Filtering
• Self-Service Backup + Restore
• Protocol Support for MongoDB

Session Evaluations
ways to access
Go to passSummit.com Download the GuideBook App
and search: PASS Summit 2016
Follow the QR code link displayed
on session signage throughout the
conference venue and in the
program guide
Submit by 5pm
Friday November 6th to
WIN prizes
Your feedback is
important and valuable. 3

Thank You
Learn more from
Azure DocumentDB
askdocdb@microsoft.com or follow @DocumentDB

[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features

Recommended

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to [PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features (20)

Recently uploaded (20)

[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features

Editor's Notes