SlideShare a Scribd company logo
Elassandra :
Elasticsearch as a Cassandra Secondary Index
September the 8th, 2016 - LLD20
© DataStax, All Rights Reserved.
Elassandra : Elasticsearch as a Cassandra secondary index
2
Vincent Royer
Elassandra Author
Rémi Trouville
Strapdata Co-Founder
We have been working together for 8 years in the banking/insurance industry
Today’s objectives :
• Sharing our vision and excitement about our project
• Receiving feedback from you all about elassandra
• Meeting NOSQL gurus to exchange ideas, solutions, questions, … and beers
&
© DataStax, All Rights Reserved.
1 Introduction
2 How Elassandra works ?
3 Cool Features
4 Elassandra’s ecosystem
5 Roadmap
6 Q&A
3
© DataStax, All Rights Reserved. 4
Elassandra’s status (2016/09/08)
Current Usage
• Used for non-critical data including
• Application logs
• Server monitoring (CPU, memory…)
• Consolidation and reporting from various SQL databases.
Current status of Elassandra
• Still in beta version
• Needs testing on larger deployments
Production-ready targeted End of 2016
© DataStax, All Rights Reserved.
1 Introduction
2 How Elassandra works ?
3 Cool Features
4 Elassandra’s ecosystem
5 Roadmap
6 Q&A
5
Write operations
© DataStax, All Rights Reserved.
• The master node manages and broadcasts the cluster state
• Only primary nodes supports write operations
• On failure, a new master or primary node is elected
• By default, 5 shards per index
6
Elasticsearch design
Master
Node
Primary
Node
Primary
Node
Primary
Node
Replica
Node
Replica
Node
Replica
Node
Replica
Node
Replica
Node
Replica
Node
Read operations
© DataStax, All Rights Reserved.
• Elasticsearch code is embedded in Cassandra nodes
• Documents are stored as row in a Cassandra tables (no more _source in Elasticsearch)
• A custom secondary index synchronously updates elasticsearch indices
7
Elassandra design
Terminology
Elasticsearch Cassandra Description
Mapping Schema Defines data structures
Cluster Virtual Datacenter An elassandra datacenter is an elastic search cluster
Index Keyspace An index relies on a keyspace.
Type Table Each document type is stored in a cassandra table
Document Row A document is stored as a cassandra row where _id is
the primary key.
Field Column Each indexed field is backed by a cassandra column
Object or nested fields User Defined Type Automatically created User Defined Type to store
elasticsearch objects.
© DataStax, All Rights Reserved.
Elassandra Write Path
9
shard
Secondary index Secondary index
REST index a document
shard
Dynamic
Mapping
Update
node 1
(coordinator)
node 2 node 3
Elasticsearch
Layer
Cassandra
Layer
INSERT INTO (…) VALUES (…)
index a document
including _token
index a document
including _token
© DataStax, All Rights Reserved.
Elassandra Search Path
10
shard
Secondary index Secondary index
REST search
shard
node 1
(coordinator)
node 2 node 3
Elasticsearch
Layer
Cassandra
Layer
SELECT fields FROM
index.type WHERE PK=_id
Secondary index
shard
Search phase with _token filter
to avoid duplicates
Fetch phase
Response
© DataStax, All Rights Reserved.
Elasticsearch cluster state
Cluster state has 3 main sections :
1. Cluster information (cluster name, node ids)
2. Metadata (mapping definition, indices and data structures, stored in a Cassandra)
3. Routing table to route search operations (Built locally from the Cassandra topology)
11
"metadata" : {
"version" : 2,
"cluster_uuid" : "e8b9c9f0-0c07-4845-9c02-211a4dbf7ea6",
"templates" : { },
"indices" : {
"twitter" : {
"state" : "open",
"settings" : {
"index" : {
"creation_date" : "1471681453347",
"number_of_shards" : "1",
"number_of_replicas" : "0",
"uuid" : "j4zZS2eOTHaDcW3r1e_1DA",
"version" : {
"created" : "2010199"
}
}
},
"mappings" : {
"tweet" : {
"properties" : {
"size" : { "type" : "long"},
"post_date" : {
"format" : "strict_date_optional_time||epoch_millis",
"type" : "date"
},
"message" : {"type" : "string"},
"user" : {"type" : « string" }
}
}
}
"routing_table" : {
"indices" : {
"twitter" : {
"shards" : {
"0" : [ {
"state" : "STARTED",
"primary" : true,
"node" : "e8b9c9f0-0c07-4845-9c02-211a4dbf7ea6",
"relocating_node" : null,
"shard" : 0,
"index" : "twitter",
"version" : 4,
"token_ranges" : [ "(-9223372036854775808,9223372036854775807]" ],
"allocation_id" : {
"id" : "SdDlnqLXTuacrlHpaJkAwA"
}
} ]
}
}
}
}
{
"cluster_name" : "Test Cluster",
"version" : 7,
"state_uuid" : "SkMDaaB-RA6n0DhmHZaTow",
"master_node" : "e8b9c9f0-0c07-4845-9c02-211a4dbf7ea6",
"blocks" : { },
"nodes" : {
"e8b9c9f0-0c07-4845-9c02-211a4dbf7ea6" : {
"name" : "localhost",
"status" : "ALIVE",
"transport_address" : "127.0.0.1:9300",
"attributes" : {
"rack" : "rack1",
"data" : "true",
"data_center" : "dc1",
"master" : "true"
}
}
},
}
© DataStax, All Rights Reserved. 12
Elasticsearch mapping storage in Cassandra
Elasticsearch mapping is stored in :
• A Cassandra table elastic_admin.metadata
• In the internal cassandra system keyspace.
On node bootstrap (first start of a node)
Data are pulled from other nodes and are indexed in elasticsearch
=> Bootstrapping provides elasticsearch resharding.
On node startup :
Recovered data from commitlogs are indexed in elasticsearch.
=> This ensures consistency after a failure.
© DataStax, All Rights Reserved. 13
Masterless mapping management
When a node update the elasticsearch mapping :
• A PAXOS transaction on elastic_admin.metadata table
ensures no concurrent modification can be done.
• The GOSSIP protocol is used
• to notify all the nodes to reload the new mapping
• to check that all nodes have applied this new mapping
• to broadcast shards status.
=> No more elasticsearch master node
© DataStax, All Rights Reserved. 14
Cross Datacenter Replication
dc1 dc2
Elasticsearch mapping
and data replication
Elasticsearch cluster Elasticsearch cluster
Kibana Kibana
Cassandra Hinted Handoff and Repair ensures data consistency
© DataStax, All Rights Reserved. 15
Elassandra : Backup & Restore
Backup Elasticsearch Lucene files like Cassandra SSTables
• Cassandra flush memtables and secondary indices when snapshotting
• Lucene file are immutable like cassandra SSTables
• Snapshot = hard link on immutable SSTables + lucene files.
Benefits :
• Consistent backup of Cassandra and Elasticsearch indices
• Cassandra as a primary storage (No shared FS needed)
© DataStax, All Rights Reserved.
1 Introduction
2 How Elassandra works ?
3 Cool Features
4 Elassandra’s ecosystem
5 Roadmap
6 Q&A
16
© DataStax, All Rights Reserved.
PUT /twitter/tweet/1 {
"user" : "vince",
"post_date" : "2009-11-15T14:12:12",
"message" : "look at Elassandra",
"size": 50
}
17
Elassandra provides Bi-directionnal mapping
CREATE KEYSPACE twitter WITH …
CREATE TABLE twitter.tweet (
"_id" text PRIMARY KEY,
message list<text>,
post_date list<timestamp>,
size list<bigint>,
user list<text>
)
Inserting a document via elastic APIs creates/updates the underlying CQL schema
Discover the Elasticsearch mapping from an existing CQL schema
PUT /twitter/_mapping/tweet {
"discover" : ".*"
}
PUT /twitter/_mapping/tweet {
"twitter" : {
"properties" : {
"message" : {
"type" : "string"
},
"post_date" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"size" : {
"type" : "long"
},
"user" : {
"type" : "string"
}
}
}
}
© DataStax, All Rights Reserved. 18
Elassandra supports nested documents with UDT
Nested documents are stored in a Cassandra User Defined Type dynamically
generated from the Elasticsearch mapping.
curl -XPUT "http://$NODE:9200/directory/users/1" -d '{
"group" : "fans",
"name" : {
"first" : "John",
"last" : "Smith"
}
}'
CREATE KEYSPACE directory WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '1'} AND durable_writes = true;
CREATE TYPE directory.users_name (
last frozen<list<text>>,
first frozen<list<text>>
);
CREATE TABLE directory.users (
"_id" text PRIMARY KEY,
group list<text>,
name list<frozen<users_name>>
);
CREATE CUSTOM INDEX elastic_users_name_idx ON directory.users (name) USING 'org.elasticsearch.cassandra.index.ExtendedElasticSecondaryIndex';
CREATE CUSTOM INDEX elastic_users_group_idx ON directory.users (group) USING 'org.elasticsearch.cassandra.index.ExtendedElasticSecondaryIndex';
© DataStax, All Rights Reserved. 19
Many elasticsearch indices for a keyspace
A keyspace content can be indexed in many elasticsearch indices with various
mappings.
Standard Cassandra index rebuild (use C* compaction manager threads) :
nodetool rebuild_index <keyspace> <tablename> elastic_<tablename>
Benefits : Change index mappings with zero downtime
© DataStax, All Rights Reserved. 20
Partitioned indices for logs analysis with Kibana
At index time, a partition function builds the target elasticsearch index name.
• Time-frame indices are removed when too old.
• A default TTL on the underlying C* tables removes old logs.
• Comes with a cost : duplicate lucene term dictionaries.
curl -XPUT "http://localhost:9200/logs_${YEAR}" -d '{
"settings":{
"keyspace":"logs",
"index.partition_function":"year logs_{0,date,yyyy} date_field" }
}
}’
curl -s -XGET http://localhost:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open kibana 2 1 14 1 54.3kb 54.3kb
green open logs_2005 2 0 22770654 874988 2.8gb 2.8gb
green open logs_2006 2 0 93003294 5466480 12gb 12gb
green open logs_2007 2 0 118455836 4856867 15.1gb 15.1gb
green open logs_2008 2 0 131107405 5969785 16.8gb 16.8gb
green open logs_2009 2 0 58150615 1296827 7.4gb 7.4gb
green open logs_2010 2 0 23 2 86.8kb 86.8kb
green open logs_2011 2 0 0 0 142b 142b
© DataStax, All Rights Reserved.
1 Introduction
2 How Elassandra works ?
3 Cool Features
4 Elassandra’s ecosystem
5 Roadmap
6 Q&A
21
© DataStax, All Rights Reserved. 22
Elassandra + Kibana
Kibana for search and data visualization :
cqlsh> SELECT "_id",title,"kibanaSavedObjectMeta" FROM kibana.visualization;
_id | title | kibanaSavedObjectMeta
-------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------
lastfm-by-age | ['lastfm by age'] | [{searchSourceJSON: ['{"index":"lastfm_*","query":{"query_string":{"query":"*","analyze_wildcard":true}},"filter":[]}']}]
lastfm-histogram | ['lastfm histogram'] | [{searchSourceJSON: ['{"index":"lastfm_*","query":{"query_string":{"query":"*","analyze_wildcard":true}},"filter":[]}']}]
lastfm-by-country | ['lastfm by country'] | [{searchSourceJSON: ['{"index":"lastfm_*","query":{"query_string":{"analyze_wildcard":true,"query":"*"}},"filter":[]}']}]
lastfm-count | ['lastfm count'] | [{searchSourceJSON: ['{"index":"lastfm_*","query":{"query_string":{"query":"*","analyze_wildcard":true}},"filter":[]}']}]
Even Kibana’s
configuration is in
Cassandra
© DataStax, All Rights Reserved. 23
Elassandra tools & plugins
Elassandra supports Elasticsearch tools & plugins
• Logstash & Beats
• ElasticHQ (royrusso)
• Elasticsearch-sql (NLPchina)
• JDBC sql4es (Anchormen)
© DataStax, All Rights Reserved. 24
Elassandra + Kibana + PrestoDB + CQLInject
Denormalizing our dataset for visualization with kibana
cqlinject jdbc "SELECT A.a, B.b FROM A INNER JOIN B on A.c=B.c"
1. Execute a JDBC request on prestoDB.
2. From the response metadata, add new columns to the target C* table.
3. Refresh the Elasticsearch mapping from C* table.
4. Write back the result to Elassandra.
worker
worker
worker
worker
worker
worker
Kibanacqlinject E*+PrestoDB
© DataStax, All Rights Reserved. 25
Elassandra + Spark
Principle : The Elasticsearch-Hadoop connector creates 1 partition
per shard whereas Elassandra has only 1 shard on each node.
Benefits :
• Workers/executors read/write locally on Elassandra nodes.
• Elassandra resharding functionality allows to scale out cassandra
+elasticsearch+spark
• The elasticsearch-Spark connector supports pushdown
How :
A slight modification in elasticsearch-hadoop connector to add
token_ranges filter from the coordinator routing table to avoid
duplicates if nodes have overlapping routing tables.
executor
executor
executor
executor
executor
executor
E* + Spark
© DataStax, All Rights Reserved.
Time Series with Elassandra
Storing large time series in Cassandra
• N tables with different levels of precision and retention
• Daily rollup batches on each node to aggregate local data and compute metadata (min/max/avg/stdev….)
• Automatic purge with default TTL + DateTieredCompactionStrategy
Searching with only index metric names and metadata
• Metadata enrichment by joining other sources of data (ex: datacenters, applications, hardware info….)
• Search with regex on any metadata to display relevant time series
26
E*Data Injection Grapher
Web browser
search
display
Local Daily
Batchs
© DataStax, All Rights Reserved. 27
Write throughput
Cassandra = 30k write/s Elassandra = 30k write/s
• Write Throughput is the same if your node is not overloaded
• CPU x2 for Elassandra
• (#threads + #classes) X2 for Elassandra
2 nodes cluster, RF=1, Google Cloud VM n1-highcpu-16 (16 vCPU - 14,4 Go mem)
© DataStax, All Rights Reserved.
1 Introduction
2 How Elassandra works ?
3 Cool Features
4 Elassandra’s ecosystem
5 Roadmap
6 Q&A
28
© DataStax, All Rights Reserved.
Elassandra Roadmap
Make it a deployed enterprise grade solution:
• Improve the documentation and packaging
• Implement Elasticsearch missing features
• Upgrade to Cassandra 3.0.<lastest> and Elasticsearch 2.<lastest>
• Make it ready for Windows OS
• Provide security features (SSL, LDAP, document and field level security)
• Deliver professional services
29
© DataStax, All Rights Reserved.
More about us …
30
http://www.elassandra.io
https://github.com/vroyer/elassandra
Vincent Royer
Elassandra Author
vroyer@strapdata.com
Rémi Trouville
Strapdata Co-Founder
rtrouville@strapdata.com
© DataStax, All Rights Reserved.
1 Introduction
2 How Elassandra works ?
3 Features
4 Use case examples
5 Roadmap
6 Q&A
31
Thank you

More Related Content

What's hot (20)

PDF
Oracle Performance Tuning Fundamentals
Enkitec
 
PPTX
Agile Retrospectives
Allison Pollard
 
PDF
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
 
PDF
My Experience Using Oracle SQL Plan Baselines 11g/12c
Nelson Calero
 
PPTX
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Anant Corporation
 
PDF
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
Databricks
 
PDF
Azure Synapse Analytics
WinWire Technologies Inc
 
PDF
Big Query Basics
Ido Green
 
PDF
Spark with Delta Lake
Knoldus Inc.
 
PPTX
Scaled Agile Framework (SAFe) Roles and Meetings
Rob Betcher
 
PDF
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
PDF
Snowflake free trial_lab_guide
slidedown1
 
DOCX
Informe final
Sergio Montero
 
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
PDF
In-memory Database and MySQL Cluster
grandis_au
 
PDF
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
AWS Chicago
 
PDF
Scrum Cheat Sheet
Edwin Ritter
 
PPTX
Scrum Framework
Upekha Vandebona
 
PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
PPTX
Agile scrum fundamentals
Deniz Gungor
 
Oracle Performance Tuning Fundamentals
Enkitec
 
Agile Retrospectives
Allison Pollard
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
 
My Experience Using Oracle SQL Plan Baselines 11g/12c
Nelson Calero
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Anant Corporation
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
Databricks
 
Azure Synapse Analytics
WinWire Technologies Inc
 
Big Query Basics
Ido Green
 
Spark with Delta Lake
Knoldus Inc.
 
Scaled Agile Framework (SAFe) Roles and Meetings
Rob Betcher
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Snowflake free trial_lab_guide
slidedown1
 
Informe final
Sergio Montero
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
In-memory Database and MySQL Cluster
grandis_au
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
AWS Chicago
 
Scrum Cheat Sheet
Edwin Ritter
 
Scrum Framework
Upekha Vandebona
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Agile scrum fundamentals
Deniz Gungor
 

Similar to Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vincent Royer, Independent) | C* Summit 2016 (20)

PPTX
Elassandra schema management - Apache Con 2019
Vincent Royer
 
PPTX
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax
 
PDF
Multi-cluster k8ssandra
KubernetesCommunityD
 
PPTX
Cassandra - A decentralized storage system
Arunit Gupta
 
PPT
Elk presentation1#3
uzzal basak
 
PDF
Lambda at Weather Scale - Cassandra Summit 2015
Robbie Strickland
 
PPTX
Cassandra Java APIs Old and New – A Comparison
shsedghi
 
PPTX
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
PDF
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Patrick McFadin
 
PDF
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
ScyllaDB
 
PPTX
Cassandra Overview
Sergey Titov, Ph.D.
 
PDF
Apache cassandra & apache spark for time series data
Patrick McFadin
 
PDF
Cassandra's Odyssey @ Netflix
Roopa Tangirala
 
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
PPTX
Introduction to NoSQL CassandraDB
Janos Geronimo
 
PPTX
Elastic Stack Introduction
Vikram Shinde
 
PPTX
Elastic stack Presentation
Amr Alaa Yassen
 
PPTX
Presentation
Dimitris Stripelis
 
PPTX
Unit -3 _Cassandra-CRUD Operations_Practice Examples
chayapathiar1
 
PPTX
Unit -3 -Features of Cassandra, CQL Data types, CQLSH, Keyspaces
ssuser9d6aac
 
Elassandra schema management - Apache Con 2019
Vincent Royer
 
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax
 
Multi-cluster k8ssandra
KubernetesCommunityD
 
Cassandra - A decentralized storage system
Arunit Gupta
 
Elk presentation1#3
uzzal basak
 
Lambda at Weather Scale - Cassandra Summit 2015
Robbie Strickland
 
Cassandra Java APIs Old and New – A Comparison
shsedghi
 
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Patrick McFadin
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
ScyllaDB
 
Cassandra Overview
Sergey Titov, Ph.D.
 
Apache cassandra & apache spark for time series data
Patrick McFadin
 
Cassandra's Odyssey @ Netflix
Roopa Tangirala
 
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
Introduction to NoSQL CassandraDB
Janos Geronimo
 
Elastic Stack Introduction
Vikram Shinde
 
Elastic stack Presentation
Amr Alaa Yassen
 
Presentation
Dimitris Stripelis
 
Unit -3 _Cassandra-CRUD Operations_Practice Examples
chayapathiar1
 
Unit -3 -Features of Cassandra, CQL Data types, CQLSH, Keyspaces
ssuser9d6aac
 
Ad

More from DataStax (20)

PPTX
Is Your Enterprise Ready to Shine This Holiday Season?
DataStax
 
PPTX
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
DataStax
 
PPTX
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
DataStax
 
PPTX
Best Practices for Getting to Production with DataStax Enterprise Graph
DataStax
 
PPTX
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
DataStax
 
PPTX
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
DataStax
 
PDF
Webinar | Better Together: Apache Cassandra and Apache Kafka
DataStax
 
PDF
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
DataStax
 
PDF
Introduction to Apache Cassandra™ + What’s New in 4.0
DataStax
 
PPTX
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
DataStax
 
PPTX
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
DataStax
 
PDF
Designing a Distributed Cloud Database for Dummies
DataStax
 
PDF
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
DataStax
 
PDF
How to Evaluate Cloud Databases for eCommerce
DataStax
 
PPTX
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
DataStax
 
PPTX
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
DataStax
 
PPTX
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
DataStax
 
PPTX
Datastax - The Architect's guide to customer experience (CX)
DataStax
 
PPTX
An Operational Data Layer is Critical for Transformative Banking Applications
DataStax
 
PPTX
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
DataStax
 
Is Your Enterprise Ready to Shine This Holiday Season?
DataStax
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
DataStax
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
DataStax
 
Best Practices for Getting to Production with DataStax Enterprise Graph
DataStax
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
DataStax
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
DataStax
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
DataStax
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
DataStax
 
Introduction to Apache Cassandra™ + What’s New in 4.0
DataStax
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
DataStax
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
DataStax
 
Designing a Distributed Cloud Database for Dummies
DataStax
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
DataStax
 
How to Evaluate Cloud Databases for eCommerce
DataStax
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
DataStax
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
DataStax
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
DataStax
 
Datastax - The Architect's guide to customer experience (CX)
DataStax
 
An Operational Data Layer is Critical for Transformative Banking Applications
DataStax
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
DataStax
 
Ad

Recently uploaded (20)

PDF
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
PPTX
Chess King 25.0.0.2500 With Crack Full Free Download
cracked shares
 
PDF
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
 
PDF
System Center 2025 vs. 2022; What’s new, what’s next_PDF.pdf
Q-Advise
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PPTX
ChessBase 18.02 Crack + Serial Key Free Download
cracked shares
 
PDF
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
PDF
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
PDF
What companies do with Pharo (ESUG 2025)
ESUG
 
PDF
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PDF
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
PDF
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
PDF
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PPTX
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
PPTX
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
PDF
How to Download and Install ADT (ABAP Development Tools) for Eclipse IDE | SA...
SAP Vista, an A L T Z E N Company
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
Chess King 25.0.0.2500 With Crack Full Free Download
cracked shares
 
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
 
System Center 2025 vs. 2022; What’s new, what’s next_PDF.pdf
Q-Advise
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
ChessBase 18.02 Crack + Serial Key Free Download
cracked shares
 
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
What companies do with Pharo (ESUG 2025)
ESUG
 
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
Presentation about variables and constant.pptx
kr2589474
 
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
How to Download and Install ADT (ABAP Development Tools) for Eclipse IDE | SA...
SAP Vista, an A L T Z E N Company
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 

Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vincent Royer, Independent) | C* Summit 2016

  • 1. Elassandra : Elasticsearch as a Cassandra Secondary Index September the 8th, 2016 - LLD20
  • 2. © DataStax, All Rights Reserved. Elassandra : Elasticsearch as a Cassandra secondary index 2 Vincent Royer Elassandra Author Rémi Trouville Strapdata Co-Founder We have been working together for 8 years in the banking/insurance industry Today’s objectives : • Sharing our vision and excitement about our project • Receiving feedback from you all about elassandra • Meeting NOSQL gurus to exchange ideas, solutions, questions, … and beers &
  • 3. © DataStax, All Rights Reserved. 1 Introduction 2 How Elassandra works ? 3 Cool Features 4 Elassandra’s ecosystem 5 Roadmap 6 Q&A 3
  • 4. © DataStax, All Rights Reserved. 4 Elassandra’s status (2016/09/08) Current Usage • Used for non-critical data including • Application logs • Server monitoring (CPU, memory…) • Consolidation and reporting from various SQL databases. Current status of Elassandra • Still in beta version • Needs testing on larger deployments Production-ready targeted End of 2016
  • 5. © DataStax, All Rights Reserved. 1 Introduction 2 How Elassandra works ? 3 Cool Features 4 Elassandra’s ecosystem 5 Roadmap 6 Q&A 5
  • 6. Write operations © DataStax, All Rights Reserved. • The master node manages and broadcasts the cluster state • Only primary nodes supports write operations • On failure, a new master or primary node is elected • By default, 5 shards per index 6 Elasticsearch design Master Node Primary Node Primary Node Primary Node Replica Node Replica Node Replica Node Replica Node Replica Node Replica Node Read operations
  • 7. © DataStax, All Rights Reserved. • Elasticsearch code is embedded in Cassandra nodes • Documents are stored as row in a Cassandra tables (no more _source in Elasticsearch) • A custom secondary index synchronously updates elasticsearch indices 7 Elassandra design
  • 8. Terminology Elasticsearch Cassandra Description Mapping Schema Defines data structures Cluster Virtual Datacenter An elassandra datacenter is an elastic search cluster Index Keyspace An index relies on a keyspace. Type Table Each document type is stored in a cassandra table Document Row A document is stored as a cassandra row where _id is the primary key. Field Column Each indexed field is backed by a cassandra column Object or nested fields User Defined Type Automatically created User Defined Type to store elasticsearch objects.
  • 9. © DataStax, All Rights Reserved. Elassandra Write Path 9 shard Secondary index Secondary index REST index a document shard Dynamic Mapping Update node 1 (coordinator) node 2 node 3 Elasticsearch Layer Cassandra Layer INSERT INTO (…) VALUES (…) index a document including _token index a document including _token
  • 10. © DataStax, All Rights Reserved. Elassandra Search Path 10 shard Secondary index Secondary index REST search shard node 1 (coordinator) node 2 node 3 Elasticsearch Layer Cassandra Layer SELECT fields FROM index.type WHERE PK=_id Secondary index shard Search phase with _token filter to avoid duplicates Fetch phase Response
  • 11. © DataStax, All Rights Reserved. Elasticsearch cluster state Cluster state has 3 main sections : 1. Cluster information (cluster name, node ids) 2. Metadata (mapping definition, indices and data structures, stored in a Cassandra) 3. Routing table to route search operations (Built locally from the Cassandra topology) 11 "metadata" : { "version" : 2, "cluster_uuid" : "e8b9c9f0-0c07-4845-9c02-211a4dbf7ea6", "templates" : { }, "indices" : { "twitter" : { "state" : "open", "settings" : { "index" : { "creation_date" : "1471681453347", "number_of_shards" : "1", "number_of_replicas" : "0", "uuid" : "j4zZS2eOTHaDcW3r1e_1DA", "version" : { "created" : "2010199" } } }, "mappings" : { "tweet" : { "properties" : { "size" : { "type" : "long"}, "post_date" : { "format" : "strict_date_optional_time||epoch_millis", "type" : "date" }, "message" : {"type" : "string"}, "user" : {"type" : « string" } } } } "routing_table" : { "indices" : { "twitter" : { "shards" : { "0" : [ { "state" : "STARTED", "primary" : true, "node" : "e8b9c9f0-0c07-4845-9c02-211a4dbf7ea6", "relocating_node" : null, "shard" : 0, "index" : "twitter", "version" : 4, "token_ranges" : [ "(-9223372036854775808,9223372036854775807]" ], "allocation_id" : { "id" : "SdDlnqLXTuacrlHpaJkAwA" } } ] } } } } { "cluster_name" : "Test Cluster", "version" : 7, "state_uuid" : "SkMDaaB-RA6n0DhmHZaTow", "master_node" : "e8b9c9f0-0c07-4845-9c02-211a4dbf7ea6", "blocks" : { }, "nodes" : { "e8b9c9f0-0c07-4845-9c02-211a4dbf7ea6" : { "name" : "localhost", "status" : "ALIVE", "transport_address" : "127.0.0.1:9300", "attributes" : { "rack" : "rack1", "data" : "true", "data_center" : "dc1", "master" : "true" } } }, }
  • 12. © DataStax, All Rights Reserved. 12 Elasticsearch mapping storage in Cassandra Elasticsearch mapping is stored in : • A Cassandra table elastic_admin.metadata • In the internal cassandra system keyspace. On node bootstrap (first start of a node) Data are pulled from other nodes and are indexed in elasticsearch => Bootstrapping provides elasticsearch resharding. On node startup : Recovered data from commitlogs are indexed in elasticsearch. => This ensures consistency after a failure.
  • 13. © DataStax, All Rights Reserved. 13 Masterless mapping management When a node update the elasticsearch mapping : • A PAXOS transaction on elastic_admin.metadata table ensures no concurrent modification can be done. • The GOSSIP protocol is used • to notify all the nodes to reload the new mapping • to check that all nodes have applied this new mapping • to broadcast shards status. => No more elasticsearch master node
  • 14. © DataStax, All Rights Reserved. 14 Cross Datacenter Replication dc1 dc2 Elasticsearch mapping and data replication Elasticsearch cluster Elasticsearch cluster Kibana Kibana Cassandra Hinted Handoff and Repair ensures data consistency
  • 15. © DataStax, All Rights Reserved. 15 Elassandra : Backup & Restore Backup Elasticsearch Lucene files like Cassandra SSTables • Cassandra flush memtables and secondary indices when snapshotting • Lucene file are immutable like cassandra SSTables • Snapshot = hard link on immutable SSTables + lucene files. Benefits : • Consistent backup of Cassandra and Elasticsearch indices • Cassandra as a primary storage (No shared FS needed)
  • 16. © DataStax, All Rights Reserved. 1 Introduction 2 How Elassandra works ? 3 Cool Features 4 Elassandra’s ecosystem 5 Roadmap 6 Q&A 16
  • 17. © DataStax, All Rights Reserved. PUT /twitter/tweet/1 { "user" : "vince", "post_date" : "2009-11-15T14:12:12", "message" : "look at Elassandra", "size": 50 } 17 Elassandra provides Bi-directionnal mapping CREATE KEYSPACE twitter WITH … CREATE TABLE twitter.tweet ( "_id" text PRIMARY KEY, message list<text>, post_date list<timestamp>, size list<bigint>, user list<text> ) Inserting a document via elastic APIs creates/updates the underlying CQL schema Discover the Elasticsearch mapping from an existing CQL schema PUT /twitter/_mapping/tweet { "discover" : ".*" } PUT /twitter/_mapping/tweet { "twitter" : { "properties" : { "message" : { "type" : "string" }, "post_date" : { "type" : "date", "format" : "strict_date_optional_time||epoch_millis" }, "size" : { "type" : "long" }, "user" : { "type" : "string" } } } }
  • 18. © DataStax, All Rights Reserved. 18 Elassandra supports nested documents with UDT Nested documents are stored in a Cassandra User Defined Type dynamically generated from the Elasticsearch mapping. curl -XPUT "http://$NODE:9200/directory/users/1" -d '{ "group" : "fans", "name" : { "first" : "John", "last" : "Smith" } }' CREATE KEYSPACE directory WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '1'} AND durable_writes = true; CREATE TYPE directory.users_name ( last frozen<list<text>>, first frozen<list<text>> ); CREATE TABLE directory.users ( "_id" text PRIMARY KEY, group list<text>, name list<frozen<users_name>> ); CREATE CUSTOM INDEX elastic_users_name_idx ON directory.users (name) USING 'org.elasticsearch.cassandra.index.ExtendedElasticSecondaryIndex'; CREATE CUSTOM INDEX elastic_users_group_idx ON directory.users (group) USING 'org.elasticsearch.cassandra.index.ExtendedElasticSecondaryIndex';
  • 19. © DataStax, All Rights Reserved. 19 Many elasticsearch indices for a keyspace A keyspace content can be indexed in many elasticsearch indices with various mappings. Standard Cassandra index rebuild (use C* compaction manager threads) : nodetool rebuild_index <keyspace> <tablename> elastic_<tablename> Benefits : Change index mappings with zero downtime
  • 20. © DataStax, All Rights Reserved. 20 Partitioned indices for logs analysis with Kibana At index time, a partition function builds the target elasticsearch index name. • Time-frame indices are removed when too old. • A default TTL on the underlying C* tables removes old logs. • Comes with a cost : duplicate lucene term dictionaries. curl -XPUT "http://localhost:9200/logs_${YEAR}" -d '{ "settings":{ "keyspace":"logs", "index.partition_function":"year logs_{0,date,yyyy} date_field" } } }’ curl -s -XGET http://localhost:9200/_cat/indices?v health status index pri rep docs.count docs.deleted store.size pri.store.size green open kibana 2 1 14 1 54.3kb 54.3kb green open logs_2005 2 0 22770654 874988 2.8gb 2.8gb green open logs_2006 2 0 93003294 5466480 12gb 12gb green open logs_2007 2 0 118455836 4856867 15.1gb 15.1gb green open logs_2008 2 0 131107405 5969785 16.8gb 16.8gb green open logs_2009 2 0 58150615 1296827 7.4gb 7.4gb green open logs_2010 2 0 23 2 86.8kb 86.8kb green open logs_2011 2 0 0 0 142b 142b
  • 21. © DataStax, All Rights Reserved. 1 Introduction 2 How Elassandra works ? 3 Cool Features 4 Elassandra’s ecosystem 5 Roadmap 6 Q&A 21
  • 22. © DataStax, All Rights Reserved. 22 Elassandra + Kibana Kibana for search and data visualization : cqlsh> SELECT "_id",title,"kibanaSavedObjectMeta" FROM kibana.visualization; _id | title | kibanaSavedObjectMeta -------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------- lastfm-by-age | ['lastfm by age'] | [{searchSourceJSON: ['{"index":"lastfm_*","query":{"query_string":{"query":"*","analyze_wildcard":true}},"filter":[]}']}] lastfm-histogram | ['lastfm histogram'] | [{searchSourceJSON: ['{"index":"lastfm_*","query":{"query_string":{"query":"*","analyze_wildcard":true}},"filter":[]}']}] lastfm-by-country | ['lastfm by country'] | [{searchSourceJSON: ['{"index":"lastfm_*","query":{"query_string":{"analyze_wildcard":true,"query":"*"}},"filter":[]}']}] lastfm-count | ['lastfm count'] | [{searchSourceJSON: ['{"index":"lastfm_*","query":{"query_string":{"query":"*","analyze_wildcard":true}},"filter":[]}']}] Even Kibana’s configuration is in Cassandra
  • 23. © DataStax, All Rights Reserved. 23 Elassandra tools & plugins Elassandra supports Elasticsearch tools & plugins • Logstash & Beats • ElasticHQ (royrusso) • Elasticsearch-sql (NLPchina) • JDBC sql4es (Anchormen)
  • 24. © DataStax, All Rights Reserved. 24 Elassandra + Kibana + PrestoDB + CQLInject Denormalizing our dataset for visualization with kibana cqlinject jdbc "SELECT A.a, B.b FROM A INNER JOIN B on A.c=B.c" 1. Execute a JDBC request on prestoDB. 2. From the response metadata, add new columns to the target C* table. 3. Refresh the Elasticsearch mapping from C* table. 4. Write back the result to Elassandra. worker worker worker worker worker worker Kibanacqlinject E*+PrestoDB
  • 25. © DataStax, All Rights Reserved. 25 Elassandra + Spark Principle : The Elasticsearch-Hadoop connector creates 1 partition per shard whereas Elassandra has only 1 shard on each node. Benefits : • Workers/executors read/write locally on Elassandra nodes. • Elassandra resharding functionality allows to scale out cassandra +elasticsearch+spark • The elasticsearch-Spark connector supports pushdown How : A slight modification in elasticsearch-hadoop connector to add token_ranges filter from the coordinator routing table to avoid duplicates if nodes have overlapping routing tables. executor executor executor executor executor executor E* + Spark
  • 26. © DataStax, All Rights Reserved. Time Series with Elassandra Storing large time series in Cassandra • N tables with different levels of precision and retention • Daily rollup batches on each node to aggregate local data and compute metadata (min/max/avg/stdev….) • Automatic purge with default TTL + DateTieredCompactionStrategy Searching with only index metric names and metadata • Metadata enrichment by joining other sources of data (ex: datacenters, applications, hardware info….) • Search with regex on any metadata to display relevant time series 26 E*Data Injection Grapher Web browser search display Local Daily Batchs
  • 27. © DataStax, All Rights Reserved. 27 Write throughput Cassandra = 30k write/s Elassandra = 30k write/s • Write Throughput is the same if your node is not overloaded • CPU x2 for Elassandra • (#threads + #classes) X2 for Elassandra 2 nodes cluster, RF=1, Google Cloud VM n1-highcpu-16 (16 vCPU - 14,4 Go mem)
  • 28. © DataStax, All Rights Reserved. 1 Introduction 2 How Elassandra works ? 3 Cool Features 4 Elassandra’s ecosystem 5 Roadmap 6 Q&A 28
  • 29. © DataStax, All Rights Reserved. Elassandra Roadmap Make it a deployed enterprise grade solution: • Improve the documentation and packaging • Implement Elasticsearch missing features • Upgrade to Cassandra 3.0.<lastest> and Elasticsearch 2.<lastest> • Make it ready for Windows OS • Provide security features (SSL, LDAP, document and field level security) • Deliver professional services 29
  • 30. © DataStax, All Rights Reserved. More about us … 30 http://www.elassandra.io https://github.com/vroyer/elassandra Vincent Royer Elassandra Author [email protected] Rémi Trouville Strapdata Co-Founder [email protected]
  • 31. © DataStax, All Rights Reserved. 1 Introduction 2 How Elassandra works ? 3 Features 4 Use case examples 5 Roadmap 6 Q&A 31