SlideShare a Scribd company logo
Building AuroraObjects
Who am I?
● Wido den Hollander (1986)
● Co-owner and CTO of a PCextreme B.V., a
dutch hosting company
● Ceph trainer and consultant at 42on B.V.
● Part of the Ceph community since late 2009
– Wrote the Apache CloudStack integration
– libvirt RBD storage pool support
– PHP and Java bindings for librados
PCextreme?
● Founded in 2004
● Medium-sized ISP in the Netherlands
● 45.000 customers
● Started as a shared hosting company
● Datacenter in Amsterdam
What is AuroraObjects?
● Under the name “Aurora” my hosting company
PCextreme B.V. has two services:
– AuroraCompute, a CloudStack based public cloud
backed by Ceph's RBD
– AuroraObjects, a public object store using Ceph's
RADOS Gateway
● AuroraObjects is a public RADOS Gateway
service (S3 only) running in production
The RADOS Gateway (RGW)
● Service objects using either Amazon's S3 or
OpenStack's Swift protocol
● All objects are stored in RADOS, the gateway
is just a abstraction between HTTP/S3 and
RADOS
The RADOS Gateway
Our ideas
● We wanted to cache frequently accessed
objects using Varnish
– Only possible with anonymous clients
● SSL should be supported
● Storage between Compute and Objects
services shared
● 3x replication
Varnish
● A caching reverse HTTP proxy
– Very fast
● Up to 100k requests/s
– Configurable using the Varnish Configuration
Language (VCL)
– Used by Facebook and eBay
● Not a part of Ceph, but can be used with the
RADOS Gateway
The Gateways
● SuperMicro 1U
– AMD Opteron 6200 series CPU
– 128GB RAM
● 20Gbit LACP trunk
● 4 nodes
● Varnish runs locally with RGW on each node
– Uses the RAM to cache objects
The Ceph cluster
● SuperMicro 2U chassis
– AMD Opteron 4334 CPU
– 32GB Ram
– Intel S3500 80GB SSD for OS
– Intel S3700 200GB SSD for Journaling
– 6x Seagate 3TB 7200RPM drive for OSD
● 2Gbit LACP trunk
● 18 nodes
● ~320TB of raw storage
Our problems
● When we cache Objects in Varnish, they don't
show up in the usage accounting of the RGW
– The HTTP request never reaches RGW
● When a Object changes we have to purge all
caches to maintain cache consistency
– User might change a ACL or modify a object with a
PUT request
● We wanted to make cached requests cheaper
then non-cached requests
Our solution: Logstash
● All requests go from Varnish into Logstash and
into ElasticSearch
– From ElasticSearch we do the usage accounting
● When Logstash sees a PUT, DELETE or PUT
request it makes a local request which sends out
a multicast to all other RGW nodes to purge that
specific object
● We also store bucket storage usage in
ElasticSearch so we have an average over the
month
Our solution: Logstash
● All requests go from Varnish into Logstash and
into ElasticSearch
– From ElasticSearch we do the usage accounting
● When Logstash sees a PUT, DELETE or PUT
request it makes a local request which sends out
a multicast to all other RGW nodes to purge that
specific object
● We also store bucket storage usage in
ElasticSearch so we have an average over the
month
LogStash and ElasticSearch
● varnishncsa → logstash → redis → elasticsearch
input {
pipe {
command => "/usr/local/bin/varnishncsa.logstash"
type => "http"
}
}
● And we simply execute varnishncsa
varnishncsa -F '%{VCL_Log:client}x %{VCL_Log:proto}x %{VCL_Log:authorization}x %
{Bucket}o %m %{Host}i %U %b %s %{Varnish:time_firstbyte}x %{Varnish:hitmiss}x'
%{Bucket}o?
● With %{<header>}o you can display the output of the return
header <header>:
– %{Server}o: Apache 2
– %{Content-Type}o: text/html
● We patched RGW (is in master) that it can optionally return
the bucket name in the response:
200 OK
Connection: close
Date: Tue, 25 Feb 2014 14:42:31 GMT
Server: AuroraObjects
Content-Length: 1412
Content-Type: application/xml
Bucket: "ceph"
X-Cache-Hit: No
● 'rgw expose bucket = true' in ceph.conf returns Bucket
Usage accounting
● We only query RGW for storage usage and
also store that in ElasticSearch
● ElasticSearch is used for all traffic accounting
– Allows us to differentiate between cached and
non-cached traffic
Back to Ceph: CRUSHMap
● A good CRUSHMap design should reflect the
physical topology of your Ceph cluster
– All machines have a single power supply
– The datacenter has a A and B powercircuit
● We use a STS (Static Transfer Switch) to create a third
powercircuit
● With CRUSH we store each replica on a
different powercircuit
– When a circuit fails, we loose 2/3 of the Ceph cluster
– Each powercircuit has it's own switching / network
The CRUSHMap
type 7 powerfeed
host ceph03 {
alg straw
hash 0
item osd.12 weight 1.000
item osd.13 weight 1.000
..
}
powerfeed powerfeed-a {
alg straw
hash 0
item ceph03 weight 6.000
item ceph04 weight 6.000
..
}
root ams02 {
alg straw
hash 0
item powerfeed-a
item powerfeed-b
item powerfeed-c
}
rule powerfeed {
ruleset 4
type replicated
min_size 1
max_size 3
step take ams02
step chooseleaf firstn 0 type powerfeed
step emit
}
The CRUSHMap
Testing the CRUSHMap
● With crushtool you can test your CRUSHMap
● $ crushtool -c ceph.zone01.ams02.crushmap.txt -o /tmp/crushmap
● $ crushtool -i /tmp/crushmap --test --rule 4 --num-rep 3 –show-
statistics
● This shows you the result of the CRUSHMap:
rule 4 (powerfeed), x = 0..1023, numrep = 3..3
CRUSH rule 4 x 0 [36,68,18]
CRUSH rule 4 x 1 [21,52,67]
..
CRUSH rule 4 x 1023 [30,41,68]
rule 4 (powerfeed) num_rep 3 result size == 3: 1024/1024
● Manually verify those locations are correct
A summary
● We cache anonymously accessed objects with
Varnish
– Allows us to process thousands of requests per
second
– Saves us I/O on the OSDs
● We use LogStash and ElasticSearch to store all
requests and do usage accounting
● With CRUSH we store each replica on a different
power circuit
Resources
● LogStash: http://www.logstash.net/
● ElasticSearch: http://www.elasticsearch.net/
● Varnish: http://www.varnish-cache.org/
● CRUSH: http://ceph.com/docs/master/
● E-Mail: wido@42on.com
● Twitter: @widodh

More Related Content

What's hot (20)

PPTX
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Speedment, Inc.
 
PDF
What's new in Luminous and Beyond
Sage Weil
 
PDF
BlueStore: a new, faster storage backend for Ceph
Sage Weil
 
PDF
What's new in Jewel and Beyond
Sage Weil
 
PDF
Ceph Performance: Projects Leading up to Jewel
Colleen Corrice
 
PDF
The State of Ceph, Manila, and Containers in OpenStack
Sage Weil
 
PDF
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Community
 
PDF
Running OpenStack in Production - Barcamp Saigon 2016
Thang Man
 
PDF
20171101 taco scargo luminous is out, what's in it for you
Taco Scargo
 
PDF
Ceph as storage for CloudStack
Ceph Community
 
ODP
Gluster d thread_synchronization_using_urcu_lca2016
Gluster.org
 
PDF
NUMA and Java Databases
Raghavendra Prabhu
 
PDF
Keeping OpenStack storage trendy with Ceph and containers
Sage Weil
 
PDF
Scaling Up Logging and Metrics
Ricardo Lourenço
 
PPTX
ceph-barcelona-v-1.2
Ranga Swami Reddy Muthumula
 
PDF
Update on Crimson - the Seastarized Ceph - Seastar Summit
ScyllaDB
 
ODP
Logging for OpenStack - Elasticsearch, Fluentd, Logstash, Kibana
Md Safiyat Reza
 
PDF
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
OpenStack
 
ODP
Speeding up ps and top
Kirill Kolyshkin
 
PDF
Community Update at OpenStack Summit Boston
Sage Weil
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Speedment, Inc.
 
What's new in Luminous and Beyond
Sage Weil
 
BlueStore: a new, faster storage backend for Ceph
Sage Weil
 
What's new in Jewel and Beyond
Sage Weil
 
Ceph Performance: Projects Leading up to Jewel
Colleen Corrice
 
The State of Ceph, Manila, and Containers in OpenStack
Sage Weil
 
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Community
 
Running OpenStack in Production - Barcamp Saigon 2016
Thang Man
 
20171101 taco scargo luminous is out, what's in it for you
Taco Scargo
 
Ceph as storage for CloudStack
Ceph Community
 
Gluster d thread_synchronization_using_urcu_lca2016
Gluster.org
 
NUMA and Java Databases
Raghavendra Prabhu
 
Keeping OpenStack storage trendy with Ceph and containers
Sage Weil
 
Scaling Up Logging and Metrics
Ricardo Lourenço
 
ceph-barcelona-v-1.2
Ranga Swami Reddy Muthumula
 
Update on Crimson - the Seastarized Ceph - Seastar Summit
ScyllaDB
 
Logging for OpenStack - Elasticsearch, Fluentd, Logstash, Kibana
Md Safiyat Reza
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
OpenStack
 
Speeding up ps and top
Kirill Kolyshkin
 
Community Update at OpenStack Summit Boston
Sage Weil
 

Similar to Building AuroraObjects- Ceph Day Frankfurt (20)

PDF
Webinar - DreamObjects/Ceph Case Study
Ceph Community
 
PDF
Scale 10x 01:22:12
Ceph Community
 
PDF
New use cases for Ceph, beyond OpenStack, Luis Rico
Ceph Community
 
PDF
Ceph Day New York: Ceph: one decade in
Ceph Community
 
PDF
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Danny Al-Gaaf
 
PDF
Storage tiering and erasure coding in Ceph (SCaLE13x)
Sage Weil
 
ODP
Ceph: A decade in the making and still going strong
Patrick McGarry
 
ODP
Ceph Day SF 2015 - Keynote
Ceph Community
 
PDF
Introduction into Ceph storage for OpenStack
OpenStack_Online
 
PDF
Strata - 03/31/2012
Ceph Community
 
PDF
Ceph Overview for Distributed Computing Denver Meetup
ktdreyer
 
PDF
Quick-and-Easy Deployment of a Ceph Storage Cluster
Patrick Quairoli
 
PPT
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Ceph Community
 
PDF
CloudOpen 2013: Developing cloud infrastructure: from scratch: the tale of an...
Andrey Korolyov
 
PDF
DEVIEW 2013
Patrick McGarry
 
ODP
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Community
 
ODP
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Community
 
PPTX
Ceph Intro and Architectural Overview by Ross Turk
buildacloud
 
ODP
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Community
 
ODP
London Ceph Day Keynote: Building Tomorrow's Ceph
Ceph Community
 
Webinar - DreamObjects/Ceph Case Study
Ceph Community
 
Scale 10x 01:22:12
Ceph Community
 
New use cases for Ceph, beyond OpenStack, Luis Rico
Ceph Community
 
Ceph Day New York: Ceph: one decade in
Ceph Community
 
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Danny Al-Gaaf
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Sage Weil
 
Ceph: A decade in the making and still going strong
Patrick McGarry
 
Ceph Day SF 2015 - Keynote
Ceph Community
 
Introduction into Ceph storage for OpenStack
OpenStack_Online
 
Strata - 03/31/2012
Ceph Community
 
Ceph Overview for Distributed Computing Denver Meetup
ktdreyer
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Patrick Quairoli
 
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Ceph Community
 
CloudOpen 2013: Developing cloud infrastructure: from scratch: the tale of an...
Andrey Korolyov
 
DEVIEW 2013
Patrick McGarry
 
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Community
 
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Community
 
Ceph Intro and Architectural Overview by Ross Turk
buildacloud
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Community
 
London Ceph Day Keynote: Building Tomorrow's Ceph
Ceph Community
 
Ad

Recently uploaded (20)

PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
July Patch Tuesday
Ivanti
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
July Patch Tuesday
Ivanti
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Ad

Building AuroraObjects- Ceph Day Frankfurt

  • 2. Who am I? ● Wido den Hollander (1986) ● Co-owner and CTO of a PCextreme B.V., a dutch hosting company ● Ceph trainer and consultant at 42on B.V. ● Part of the Ceph community since late 2009 – Wrote the Apache CloudStack integration – libvirt RBD storage pool support – PHP and Java bindings for librados
  • 3. PCextreme? ● Founded in 2004 ● Medium-sized ISP in the Netherlands ● 45.000 customers ● Started as a shared hosting company ● Datacenter in Amsterdam
  • 4. What is AuroraObjects? ● Under the name “Aurora” my hosting company PCextreme B.V. has two services: – AuroraCompute, a CloudStack based public cloud backed by Ceph's RBD – AuroraObjects, a public object store using Ceph's RADOS Gateway ● AuroraObjects is a public RADOS Gateway service (S3 only) running in production
  • 5. The RADOS Gateway (RGW) ● Service objects using either Amazon's S3 or OpenStack's Swift protocol ● All objects are stored in RADOS, the gateway is just a abstraction between HTTP/S3 and RADOS
  • 7. Our ideas ● We wanted to cache frequently accessed objects using Varnish – Only possible with anonymous clients ● SSL should be supported ● Storage between Compute and Objects services shared ● 3x replication
  • 8. Varnish ● A caching reverse HTTP proxy – Very fast ● Up to 100k requests/s – Configurable using the Varnish Configuration Language (VCL) – Used by Facebook and eBay ● Not a part of Ceph, but can be used with the RADOS Gateway
  • 9. The Gateways ● SuperMicro 1U – AMD Opteron 6200 series CPU – 128GB RAM ● 20Gbit LACP trunk ● 4 nodes ● Varnish runs locally with RGW on each node – Uses the RAM to cache objects
  • 10. The Ceph cluster ● SuperMicro 2U chassis – AMD Opteron 4334 CPU – 32GB Ram – Intel S3500 80GB SSD for OS – Intel S3700 200GB SSD for Journaling – 6x Seagate 3TB 7200RPM drive for OSD ● 2Gbit LACP trunk ● 18 nodes ● ~320TB of raw storage
  • 11. Our problems ● When we cache Objects in Varnish, they don't show up in the usage accounting of the RGW – The HTTP request never reaches RGW ● When a Object changes we have to purge all caches to maintain cache consistency – User might change a ACL or modify a object with a PUT request ● We wanted to make cached requests cheaper then non-cached requests
  • 12. Our solution: Logstash ● All requests go from Varnish into Logstash and into ElasticSearch – From ElasticSearch we do the usage accounting ● When Logstash sees a PUT, DELETE or PUT request it makes a local request which sends out a multicast to all other RGW nodes to purge that specific object ● We also store bucket storage usage in ElasticSearch so we have an average over the month
  • 13. Our solution: Logstash ● All requests go from Varnish into Logstash and into ElasticSearch – From ElasticSearch we do the usage accounting ● When Logstash sees a PUT, DELETE or PUT request it makes a local request which sends out a multicast to all other RGW nodes to purge that specific object ● We also store bucket storage usage in ElasticSearch so we have an average over the month
  • 14. LogStash and ElasticSearch ● varnishncsa → logstash → redis → elasticsearch input { pipe { command => "/usr/local/bin/varnishncsa.logstash" type => "http" } } ● And we simply execute varnishncsa varnishncsa -F '%{VCL_Log:client}x %{VCL_Log:proto}x %{VCL_Log:authorization}x % {Bucket}o %m %{Host}i %U %b %s %{Varnish:time_firstbyte}x %{Varnish:hitmiss}x'
  • 15. %{Bucket}o? ● With %{<header>}o you can display the output of the return header <header>: – %{Server}o: Apache 2 – %{Content-Type}o: text/html ● We patched RGW (is in master) that it can optionally return the bucket name in the response: 200 OK Connection: close Date: Tue, 25 Feb 2014 14:42:31 GMT Server: AuroraObjects Content-Length: 1412 Content-Type: application/xml Bucket: "ceph" X-Cache-Hit: No ● 'rgw expose bucket = true' in ceph.conf returns Bucket
  • 16. Usage accounting ● We only query RGW for storage usage and also store that in ElasticSearch ● ElasticSearch is used for all traffic accounting – Allows us to differentiate between cached and non-cached traffic
  • 17. Back to Ceph: CRUSHMap ● A good CRUSHMap design should reflect the physical topology of your Ceph cluster – All machines have a single power supply – The datacenter has a A and B powercircuit ● We use a STS (Static Transfer Switch) to create a third powercircuit ● With CRUSH we store each replica on a different powercircuit – When a circuit fails, we loose 2/3 of the Ceph cluster – Each powercircuit has it's own switching / network
  • 18. The CRUSHMap type 7 powerfeed host ceph03 { alg straw hash 0 item osd.12 weight 1.000 item osd.13 weight 1.000 .. } powerfeed powerfeed-a { alg straw hash 0 item ceph03 weight 6.000 item ceph04 weight 6.000 .. } root ams02 { alg straw hash 0 item powerfeed-a item powerfeed-b item powerfeed-c } rule powerfeed { ruleset 4 type replicated min_size 1 max_size 3 step take ams02 step chooseleaf firstn 0 type powerfeed step emit }
  • 20. Testing the CRUSHMap ● With crushtool you can test your CRUSHMap ● $ crushtool -c ceph.zone01.ams02.crushmap.txt -o /tmp/crushmap ● $ crushtool -i /tmp/crushmap --test --rule 4 --num-rep 3 –show- statistics ● This shows you the result of the CRUSHMap: rule 4 (powerfeed), x = 0..1023, numrep = 3..3 CRUSH rule 4 x 0 [36,68,18] CRUSH rule 4 x 1 [21,52,67] .. CRUSH rule 4 x 1023 [30,41,68] rule 4 (powerfeed) num_rep 3 result size == 3: 1024/1024 ● Manually verify those locations are correct
  • 21. A summary ● We cache anonymously accessed objects with Varnish – Allows us to process thousands of requests per second – Saves us I/O on the OSDs ● We use LogStash and ElasticSearch to store all requests and do usage accounting ● With CRUSH we store each replica on a different power circuit
  • 22. Resources ● LogStash: http://www.logstash.net/ ● ElasticSearch: http://www.elasticsearch.net/ ● Varnish: http://www.varnish-cache.org/ ● CRUSH: http://ceph.com/docs/master/ ● E-Mail: [email protected] ● Twitter: @widodh