SlideShare a Scribd company logo
Lessons from database failures
Colin Charles, Chief Evangelist, Percona Inc.

colin.charles@percona.com / byte@bytebot.net

http://www.bytebot.net/blog/ | @bytebot on Twitter

Percona Live Europe Amsterdam, Netherlands

5 October 2016
whoami
• Chief Evangelist (in the CTO office), Percona Inc

• Focusing on the MySQL ecosystem (MySQL, Percona Server, MariaDB
Server), as well as the MongoDB ecosystem (Percona Server for
MongoDB) + 100% open source tools from Percona like Percona
Monitoring & Management, Percona xtrabackup, Percona Toolkit, etc.

• Founding team of MariaDB Server (2009-2016), previously at Monty
Program Ab, merged with SkySQL Ab, now MariaDB Corporation

• Formerly MySQL AB (exit: Sun Microsystems)

• Past lives include Fedora Project (FESCO), OpenOffice.org

• MySQL Community Contributor of the Year Award winner 2014
Agenda
• Backups (and verification)

• Replication (and failover)

• Security (and encryption)
ma.gnolia.com
ma.gnolia.com’s failure
• January 30 2009: complete outage

• February 17 2009: data corruption in the UDB, essentially dead

• What happened?

• Ruby on Rails on four self-hosted Mac Mini’s, a couple of
XServe’s, 500GB+ MySQL 5 DB

• Filesystem corruption, corrupted database backup

• No versioning, didn’t check if the backups worked, made use of
rsync to backup the database over Firewire network
ma.gnolia.com today?
• EC2 for the app with EBS snapshots, RDS with snapshots, Multi-AZ
deployment

• Self-hosted?

• xtrabackup
• START TRANSACTION WITH CONSISTENT SNAPSHOT +
mysqldump —single-transaction —master-data
• Backup a replica

• Replication event checksums
Couchsurfing, 2006
Couchsurfing problems
1. major, avoidable hard drive crash

2. incremental backups weren’t executed in the correct manner, and
twelve of our most important data files didn’t survive
Time-delayed replication
• MySQL 5.6+ has time-delayed replication. Stop replication when you
know a mistake has happened before it propagates to all the slaves.

• Feature suggestion since 2001! Bug reported August 2006
(mysql#21639). Pushed June 2010 (WL#344). GA February 2013.
Why replicate?
• Scale out

• [automatic] (master) failover

• Geographical redundancy across multiple data centres

• Online schema changes
Replication
• Asynchronous (default)

• (Enhanced loss-less) Semi-synchronous (plugin)

• Synchronous (Galera, group replication, NDBCLUSTER)

• DRBD
Frameworks
• MySQL-MMM

• Severalnines ClusterControl

• Orchestrator

• MySQL MHA

• Tungsten Replicator

• 5.6+ utilities:
mysqlfailover,
mysqlrpladmin

• Percona Replication Manager
(https://github.com/percona/
percona-pacemaker-agents/)

• Replication Manager
(github.com/tanji/replication-
manager)
GitHub
GitHub
GitHub
GitHub
https://github.com/blog/1261-github-availability-this-week
Fully automated failover a good idea?
• False alarms

• Repeated failover

• Overloaded master? MHA doesn’t allow a failover within 8h,
unless —last_failover_min=n is set

• Data loss

• id=103 latest, relay logs at id=101 => loss

• group commit in the binary log

• Split brain
Proxies
• MariaDB MaxScale

• MaxScale as binlog server @ Booking - to replace intermediate
masters (downloads binlog from master, saves to disk, serves to
slave as if served from master)

• Popular use: load balancing Galera clusters

• MySQL Router + MySQL Fabric

• ProxySQL

• Used alongside Galera clusters too
Lessons from database failures
Sharding
• SPIDER

• Tungsten Replicator

• Tumblr JetPants
Vitess
• Servers & tools to scale MySQL for web written in Go

• Has MariaDB support too (*)

• Python client interface

• DML annotation, connection pooling, shard management, workflow
management, zero downtime restarts

• Become super easy to use: http://vitess.io/ (with the help of
Kubernetes)
Failwhales
• Twitter started on MySQL, and is still MySQL - you just need to
“evolve”

• Gizzard (sharding), Mesos + Apache Cotton

• Digg started on MySQL, migrated to Cassandra, and came back to
MySQL
Security
• Philippines voter data leave 55m at risk: 338GB MySQL dump

• Ashley Madison: 6.9GB compressed dump, 36m email addresses
leaked, 9.6m credit card transactions

• Patreon: 13.7GB MySQL dump, 99 tables
Mossack Fonseca: Panama Papers
Prevent SQL injections
• MariaDB MaxScale database firewall filter

• Configurable filter actions on rule match (Allow the query, block
the query or ignore the match), Logging of matching and/or non-
matching queries

• MySQL Enterprise firewall
Encryption at rest
• MariaDB Server 10.1: table or tablespace encryption

• design goal: Encrypt all user data that may touch the disk — InnoDB
data, InnoDB logs, binary logs, temporary tables, temporary files

• key management on the filesystem? [no key rotation] Amazon KMS? 

• caveats: mysqlbinlog needs work with encrypted binlogs; Galera
Cluster gcache isn’t encrypted

• MySQL 5.7: only encrypts InnoDB tablespaces (innodb_file_per_table;
logs unencrypted)
In conclusion…
• Use semi-sync replication with a failover solution that ensures you
don’t failover too often

• Make good backups. Test them. Save them.

• You’ll most definitely need to shard your data, use proven
frameworks and get a proxy involved. Complete backups with multi-
source replication when needed.

• Use mysqldump and xtrabackup together (and mydumper for
parallel backup/restore; mysqlpump)

• Security is key: prevent SQL injections, encrypt your data at rest
It’s 2016, you don’t want this…
Percona Monitoring and Management (PMM)
• http://pmmdemo.percona.com/
Thank you. Q&A?
colin.charles@percona.com / byte@bytebot.net
@bytebot on Twitter | http://www.bytebot.net/blog/
slides: slideshare.net/bytebot
Lessons from database failures

More Related Content

What's hot (20)

PDF
Distributions from the view a package
Colin Charles
 
PDF
MariaDB Server & MySQL Security Essentials 2016
Colin Charles
 
PDF
Capacity planning for your data stores
Colin Charles
 
PDF
Best practices for MySQL/MariaDB Server/Percona Server High Availability
Colin Charles
 
PDF
My first moments with MongoDB
Colin Charles
 
PDF
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
Colin Charles
 
PDF
Lessons from {distributed,remote,virtual} communities and companies
Colin Charles
 
PDF
Tuning Linux for your database FLOSSUK 2016
Colin Charles
 
PDF
Meet MariaDB 10.1 at the Bulgaria Web Summit
Colin Charles
 
PDF
Meet MariaDB Server 10.1 London MySQL meetup December 2015
Colin Charles
 
PDF
MariaDB - the "new" MySQL is 5 years old and everywhere (LinuxCon Europe 2015)
Colin Charles
 
PDF
MariaDB 10: The Complete Tutorial
Colin Charles
 
PDF
The MySQL Server ecosystem in 2016
sys army
 
PDF
Databases in the hosted cloud
Colin Charles
 
PDF
Differences between MariaDB 10.3 & MySQL 8.0
Colin Charles
 
PDF
The MySQL Server Ecosystem in 2016
Colin Charles
 
PDF
MariaDB 10 and what's new with the project
Colin Charles
 
PDF
Why MariaDB?
Colin Charles
 
PDF
MariaDB 10 Tutorial - 13.11.11 - Percona Live London
Ivan Zoratti
 
PDF
Best practices for MySQL High Availability Tutorial
Colin Charles
 
Distributions from the view a package
Colin Charles
 
MariaDB Server & MySQL Security Essentials 2016
Colin Charles
 
Capacity planning for your data stores
Colin Charles
 
Best practices for MySQL/MariaDB Server/Percona Server High Availability
Colin Charles
 
My first moments with MongoDB
Colin Charles
 
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
Colin Charles
 
Lessons from {distributed,remote,virtual} communities and companies
Colin Charles
 
Tuning Linux for your database FLOSSUK 2016
Colin Charles
 
Meet MariaDB 10.1 at the Bulgaria Web Summit
Colin Charles
 
Meet MariaDB Server 10.1 London MySQL meetup December 2015
Colin Charles
 
MariaDB - the "new" MySQL is 5 years old and everywhere (LinuxCon Europe 2015)
Colin Charles
 
MariaDB 10: The Complete Tutorial
Colin Charles
 
The MySQL Server ecosystem in 2016
sys army
 
Databases in the hosted cloud
Colin Charles
 
Differences between MariaDB 10.3 & MySQL 8.0
Colin Charles
 
The MySQL Server Ecosystem in 2016
Colin Charles
 
MariaDB 10 and what's new with the project
Colin Charles
 
Why MariaDB?
Colin Charles
 
MariaDB 10 Tutorial - 13.11.11 - Percona Live London
Ivan Zoratti
 
Best practices for MySQL High Availability Tutorial
Colin Charles
 

Similar to Lessons from database failures (20)

PDF
OSDC 2017 | Lessons from database failures by Colin Charles
NETWAYS
 
PDF
MySQL Ecosystem in 2023 - FOSSASIA'23 - Alkin.pptx.pdf
Alkin Tezuysal
 
PDF
The MySQL ecosystem - understanding it, not running away from it!
Colin Charles
 
PDF
Databases in the Hosted Cloud
Colin Charles
 
PDF
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
Ivan Zoratti
 
PDF
OSDC 2018 | Scaling & High Availability MySQL learnings from the past decade+...
NETWAYS
 
PDF
A beginners guide to MariaDB
Colin Charles
 
PDF
MySQL Ecosystem in 2020
Alkin Tezuysal
 
PDF
MariaDB: in-depth (hands on training in Seoul)
Colin Charles
 
PDF
Databases in the hosted cloud
Colin Charles
 
PDF
Maria db 10 and the mariadb foundation(colin)
kayokogoto
 
PDF
MariaDB: The 2012 Edition
Colin Charles
 
PDF
MariaDB 初学者指南
YUCHENG HU
 
PDF
MariaDB 10: A MySQL Replacement - HKOSC
Colin Charles
 
PDF
Best practices for MySQL High Availability
Colin Charles
 
PDF
MySQL in the Cloud
Colin Charles
 
PDF
MariaDB - a MySQL Replacement #SELF2014
Colin Charles
 
PDF
The Complete MariaDB Server Tutorial - Percona Live 2015
Colin Charles
 
PDF
Seminar : "The Future of MySQL - Roadmap to Success" session MySQL ...
Software Park Thailand
 
PDF
[db tech showcase Tokyo 2014] B15: Scalability with MariaDB and MaxScale by ...
Insight Technology, Inc.
 
OSDC 2017 | Lessons from database failures by Colin Charles
NETWAYS
 
MySQL Ecosystem in 2023 - FOSSASIA'23 - Alkin.pptx.pdf
Alkin Tezuysal
 
The MySQL ecosystem - understanding it, not running away from it!
Colin Charles
 
Databases in the Hosted Cloud
Colin Charles
 
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
Ivan Zoratti
 
OSDC 2018 | Scaling & High Availability MySQL learnings from the past decade+...
NETWAYS
 
A beginners guide to MariaDB
Colin Charles
 
MySQL Ecosystem in 2020
Alkin Tezuysal
 
MariaDB: in-depth (hands on training in Seoul)
Colin Charles
 
Databases in the hosted cloud
Colin Charles
 
Maria db 10 and the mariadb foundation(colin)
kayokogoto
 
MariaDB: The 2012 Edition
Colin Charles
 
MariaDB 初学者指南
YUCHENG HU
 
MariaDB 10: A MySQL Replacement - HKOSC
Colin Charles
 
Best practices for MySQL High Availability
Colin Charles
 
MySQL in the Cloud
Colin Charles
 
MariaDB - a MySQL Replacement #SELF2014
Colin Charles
 
The Complete MariaDB Server Tutorial - Percona Live 2015
Colin Charles
 
Seminar : "The Future of MySQL - Roadmap to Success" session MySQL ...
Software Park Thailand
 
[db tech showcase Tokyo 2014] B15: Scalability with MariaDB and MaxScale by ...
Insight Technology, Inc.
 
Ad

Recently uploaded (20)

PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
PDF
Why aren't you using FME Flow's CPU Time?
Safe Software
 
PDF
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
PPTX
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
PPTX
Practical Applications of AI in Local Government
OnBoard
 
PDF
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
PDF
Simplify Your FME Flow Setup: Fault-Tolerant Deployment Made Easy with Packer...
Safe Software
 
PDF
Proactive Server and System Monitoring with FME: Using HTTP and System Caller...
Safe Software
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PDF
Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...
Carsten Stoecker
 
PPTX
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
PDF
Supporting the NextGen 911 Digital Transformation with FME
Safe Software
 
PDF
Kubernetes - Architecture & Components.pdf
geethak285
 
PDF
DoS Attack vs DDoS Attack_ The Silent Wars of the Internet.pdf
CyberPro Magazine
 
PPTX
Wondershare Filmora Crack Free Download 2025
josanj305
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
Why aren't you using FME Flow's CPU Time?
Safe Software
 
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
Practical Applications of AI in Local Government
OnBoard
 
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
Simplify Your FME Flow Setup: Fault-Tolerant Deployment Made Easy with Packer...
Safe Software
 
Proactive Server and System Monitoring with FME: Using HTTP and System Caller...
Safe Software
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...
Carsten Stoecker
 
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
Supporting the NextGen 911 Digital Transformation with FME
Safe Software
 
Kubernetes - Architecture & Components.pdf
geethak285
 
DoS Attack vs DDoS Attack_ The Silent Wars of the Internet.pdf
CyberPro Magazine
 
Wondershare Filmora Crack Free Download 2025
josanj305
 
Ad

Lessons from database failures

  • 1. Lessons from database failures Colin Charles, Chief Evangelist, Percona Inc. [email protected] / [email protected] http://www.bytebot.net/blog/ | @bytebot on Twitter Percona Live Europe Amsterdam, Netherlands 5 October 2016
  • 2. whoami • Chief Evangelist (in the CTO office), Percona Inc • Focusing on the MySQL ecosystem (MySQL, Percona Server, MariaDB Server), as well as the MongoDB ecosystem (Percona Server for MongoDB) + 100% open source tools from Percona like Percona Monitoring & Management, Percona xtrabackup, Percona Toolkit, etc. • Founding team of MariaDB Server (2009-2016), previously at Monty Program Ab, merged with SkySQL Ab, now MariaDB Corporation • Formerly MySQL AB (exit: Sun Microsystems) • Past lives include Fedora Project (FESCO), OpenOffice.org • MySQL Community Contributor of the Year Award winner 2014
  • 3. Agenda • Backups (and verification) • Replication (and failover) • Security (and encryption)
  • 5. ma.gnolia.com’s failure • January 30 2009: complete outage • February 17 2009: data corruption in the UDB, essentially dead • What happened? • Ruby on Rails on four self-hosted Mac Mini’s, a couple of XServe’s, 500GB+ MySQL 5 DB • Filesystem corruption, corrupted database backup • No versioning, didn’t check if the backups worked, made use of rsync to backup the database over Firewire network
  • 6. ma.gnolia.com today? • EC2 for the app with EBS snapshots, RDS with snapshots, Multi-AZ deployment • Self-hosted? • xtrabackup • START TRANSACTION WITH CONSISTENT SNAPSHOT + mysqldump —single-transaction —master-data • Backup a replica • Replication event checksums
  • 8. Couchsurfing problems 1. major, avoidable hard drive crash 2. incremental backups weren’t executed in the correct manner, and twelve of our most important data files didn’t survive
  • 9. Time-delayed replication • MySQL 5.6+ has time-delayed replication. Stop replication when you know a mistake has happened before it propagates to all the slaves. • Feature suggestion since 2001! Bug reported August 2006 (mysql#21639). Pushed June 2010 (WL#344). GA February 2013.
  • 10. Why replicate? • Scale out • [automatic] (master) failover • Geographical redundancy across multiple data centres • Online schema changes
  • 11. Replication • Asynchronous (default) • (Enhanced loss-less) Semi-synchronous (plugin) • Synchronous (Galera, group replication, NDBCLUSTER) • DRBD
  • 12. Frameworks • MySQL-MMM • Severalnines ClusterControl • Orchestrator • MySQL MHA • Tungsten Replicator • 5.6+ utilities: mysqlfailover, mysqlrpladmin • Percona Replication Manager (https://github.com/percona/ percona-pacemaker-agents/) • Replication Manager (github.com/tanji/replication- manager)
  • 17. Fully automated failover a good idea? • False alarms • Repeated failover • Overloaded master? MHA doesn’t allow a failover within 8h, unless —last_failover_min=n is set • Data loss • id=103 latest, relay logs at id=101 => loss • group commit in the binary log • Split brain
  • 18. Proxies • MariaDB MaxScale • MaxScale as binlog server @ Booking - to replace intermediate masters (downloads binlog from master, saves to disk, serves to slave as if served from master) • Popular use: load balancing Galera clusters • MySQL Router + MySQL Fabric • ProxySQL • Used alongside Galera clusters too
  • 20. Sharding • SPIDER • Tungsten Replicator • Tumblr JetPants
  • 21. Vitess • Servers & tools to scale MySQL for web written in Go • Has MariaDB support too (*) • Python client interface • DML annotation, connection pooling, shard management, workflow management, zero downtime restarts • Become super easy to use: http://vitess.io/ (with the help of Kubernetes)
  • 22. Failwhales • Twitter started on MySQL, and is still MySQL - you just need to “evolve” • Gizzard (sharding), Mesos + Apache Cotton • Digg started on MySQL, migrated to Cassandra, and came back to MySQL
  • 23. Security • Philippines voter data leave 55m at risk: 338GB MySQL dump • Ashley Madison: 6.9GB compressed dump, 36m email addresses leaked, 9.6m credit card transactions • Patreon: 13.7GB MySQL dump, 99 tables
  • 25. Prevent SQL injections • MariaDB MaxScale database firewall filter • Configurable filter actions on rule match (Allow the query, block the query or ignore the match), Logging of matching and/or non- matching queries • MySQL Enterprise firewall
  • 26. Encryption at rest • MariaDB Server 10.1: table or tablespace encryption • design goal: Encrypt all user data that may touch the disk — InnoDB data, InnoDB logs, binary logs, temporary tables, temporary files • key management on the filesystem? [no key rotation] Amazon KMS? • caveats: mysqlbinlog needs work with encrypted binlogs; Galera Cluster gcache isn’t encrypted • MySQL 5.7: only encrypts InnoDB tablespaces (innodb_file_per_table; logs unencrypted)
  • 27. In conclusion… • Use semi-sync replication with a failover solution that ensures you don’t failover too often • Make good backups. Test them. Save them. • You’ll most definitely need to shard your data, use proven frameworks and get a proxy involved. Complete backups with multi- source replication when needed. • Use mysqldump and xtrabackup together (and mydumper for parallel backup/restore; mysqlpump) • Security is key: prevent SQL injections, encrypt your data at rest
  • 28. It’s 2016, you don’t want this…
  • 29. Percona Monitoring and Management (PMM) • http://pmmdemo.percona.com/
  • 30. Thank you. Q&A? [email protected] / [email protected] @bytebot on Twitter | http://www.bytebot.net/blog/ slides: slideshare.net/bytebot