Introduction to Databases - query optimizations for MySQL

Databases
Kodok Márton
■ simb.ro
■ kodokmarton.eu
■ twitter.com/martonkodok
■ facebook.com/marton.kodok
■ stackoverflow.com/users/243782/pentium10
23 May, 2013 @Sapientia

Relational Databases
● A relational database is essentially a group of tables (entities).
● Tables are made up of columns and rows.
● Those tables have constraints, and relationships are defined between them.
● Relational databases are queried using SQL
● Multiple tables being accessed in a single query are "joined" together, typically by
a criteria defined in the table relationship columns.
● Normalization is a data-structuring model used with relational databases that
ensures data consistency and removes data duplication.

Non-Relational Databases (NoSQL)
● Key-value stores are the simplest NoSQL
databases. Every single item in the database is
stored as an attribute name, or key, together with its
value. Examples of key-value stores are Riak and
MongoDB. Some key-value stores, such as Redis,
allow each value to have a type, such as "integer",
which adds functionality.
● Document databases can contain many different
key-value pairs, or key-array pairs, or even nested
documents
● Graph stores are used to store information about
networks, such as social connections
● Wide-column stores such as Cassandra and
HBase are optimized for queries over large datasets,
and store columns of data together, instead of rows.

SQL vs NoSQL
The purpose of this presentation is NOT about
SQL vs NoSQL.
Let’s be blunt: none of them are difficult, we need both of them.
We evolve.

Spreadsheets/Frontends
Excel and Access are not a database.
Let’s be blunt: Excel does not need more than 256 columns and 65 536 rows.

DDL (Data Definition Language)
CREATE TABLE employees (
id INTEGER(11) PRIMARY KEY,
first_name VARCHAR(50) NULL,
last_name VARCHAR(75) NOT NULL,
dateofbirth DATE NULL
);
ALTER TABLE sink ADD bubbles INTEGER;
ALTER TABLE sink DROP COLUMN bubbles;
DROP TABLE employees;
RENAME TABLE My_table TO
Tmp_table;
TRUNCATE TABLE My_table;
CREATE, ALTER, DROP, RENAME, TRUNCATE

SQL (Structured Query Language)
INSERT INTO My_table
(field1, field2, field3)
VALUES
('test', 'N', NULL);
SELECT Book.title AS Title,
COUNT(*) AS Authors
FROM Book JOIN Book_author
ON Book.isbn = Book_author.isbn
GROUP BY Book.title;
UPDATE My_table
SET field1 = 'updated value'
WHERE field2 = 'N';
DELETE FROM My_table
WHERE field2 = 'N';
CRUD (CREATE, READ, UPDATE, DELETE)

Indexes (fast lookup + constraints)
Constraint:
a. PRIMARY
b. UNIQUE
c. FOREIGN KEY
● Index reduce the amount of data the server has to examine
● can speed up reads but can slow down inserts and updates
● is used to enforce constraints
Type:
1. BTREE
- can be used for look-ups and sorting
- can match the full value
- can match a leftmost prefix ( LIKE 'ma%' )
2. HASH
- only supports equality comparisons: =, IN()
- can't be used for sorting
3. FULLTEXT
- only MyISAM tables
- compares words or phrases
- returns a relevance value

Some Queries That Can Use BTREE Index
● point look-up
SELECT * FROM students WHERE grade = 100;
● open range
SELECT * FROM students WHERE grade > 75;
● closed range
SELECT * FROM students WHERE 70 < grade < 80;
● special range
SELECT * FROM students WHERE name LIKE 'ma%';
Multi Column Indexes Useful for sorting/where
CREATE INDEX `salary_name_idx` ON emp(salary, name);
SELECT salary, name FROM emp ORDER BY salary, name;
(5000, 'john') < (5000, 'michael')
(9000, 'philip') < (9999, 'steve')

Indexing InnoDB Tables
● data is clustered by primary key
● primary key is implicitly appended to all indexes
CREATE INDEX fname_idx ON emp(firstname);
actually creates KEY(firstname, id) internally
Avoid long primary keys!
TS-09061982110055-12345
349950002348857737488334
supercalifragilisticexpialidocious

How MySQL Uses Indexes
● looking up data
● joining tables
● sorting
● avoiding reading data
MySQL chooses only ONE index per table.

DON'Ts
● don't follow optimization rules blindly
● don't create an index for every column in your table
thinking that it will make things faster
● don't create duplicate indexes
ex.
BAD:
create index firstname_ix on Employee(firstname);
create index lastname_ix on Employee(lastname);
GOOD:
create index first_last_ix on Employee(firstname, lastname);
create index id_ix on Employee(id);

DOs
● use index for optimizing look-ups, sorting and retrieval of
data
● use short primary keys if possible when using the
InnoDB storage engine
● extend index if you can, instead of creating new indexes
● validate performance impact as you're doing changes
● remove unused indexes

Speeding it up
● proper table design (3nf)
● understand query cache (internal of MySQL)
● EXPLAIN syntax
● proper indexes
● MySQL server daemon optimization
● Slow Query Logs
● Stored Procedures
● Profiling
● Redundancy -> Master : Slave
● Sharding
● mix database servers based on business logic
-> Memcache, Redis, MongoDB

EXPLAIN
EXPLAIN SELECT * FROM attendees
WHERE conference_id = 123 AND registration_status > 0
table possible_keys key rows
attendees NULL NULL 14052
The three most important columns returned by EXPLAIN
1) Possible keys
● All the possible indexes which MySQL could have used
● Based on a series of very quick lookups and calculations
2) Chosen key
3) Rows scanned
● Indication of effort required to identify your result set
-> Interpreting the results

Interpreting the results
● No suitable indexes for this query
● MySQL had to do a full table scan
● Full table scans are almost always the slowest query
● Full table scans, while not always bad, are usually an indication that an
index is required
-> Adding indexes

Adding indexes
ALTER TABLE ADD INDEX conf (conference_id);
ALTER TABLE ADD INDEX reg (registration_status);
attendees conf, reg conf 331
● MySQL had two indexes to choose from, but discarded “reg”
● “reg” isn't sufficiently unique
● The spread of values can also be a factor (e.g when 99% of rows contain
the same value)
● Index “uniqueness” is called cardinality
● There is scope for some performance increase... Lower server load,
quicker response
-> Choosing a better index

Choosing a better index
ALTER TABLE ADD INDEX reg_conf_index (registration_status, conference_id);
WHERE registration_status > 0 AND conference_id = 123
attendees reg, conf,
reg_conf_index
reg_conf_index 204
● reg_conf_index is a much better choice
● Note that the other two keys are still available, just not as effective
● Our query is now served well by the new index
-> Using it wrong

Watch for WHERE column order
DELETE INDEX conf; DELETE INDEX reg;
EXPLAIN SELECT * FROM attendees WHERE conference_id = 123
● Without the “conf” index, we're back to square one
● The order in which fields were defined in a composite index affects whether it is available for
use in a query
● Remember, we defined our index : (registration_status, conference_id)
Potential workaround:
EXPLAIN SELECT * FROM attendees WHERE registration_status >= -1 AND
conference_id = 123
attendees reg_conf_index reg_conf_index 204

JOINs
● JOINing together large data sets (>= 10,000) is really where EXPLAIN
becomes useful
● Each JOIN in a query gets its own row in EXPLAIN
● Make sure each JOIN condition is FAST
● Make sure each joined table is getting to its result set as quickly as possible
● The benefits compound if each join requires less effort

Simple JOIN example
EXPLAIN SELECT * FROM conferences c
JOIN attendees a ON c.conference_id = a.conference_id
WHERE conferences.location_id = 2 AND conferences.topic_id IN (4,6,1) AND
attendees.registration_status > 1
table type possible_keys key rows
conferences ref conference_topic conference_topic 15
attendees ALL NULL NULL 14052
● Looks like I need an index on attendees.conference_id
● Another indication of effort, aside from rows scanned
● Here, “ALL” is bad – we should be aiming for “ref”
● There are 13 different values for “type”
● Common values are:
const, eq_ref, ref, fulltext, index_merge, range, all
http://dev.mysql.com/doc/refman/5.0/en/using-explain.html

The "extra" column
With every EXPLAIN, you get an “extra” column, which shows additional
operations invoked to get your result set.
Some example “extra” values:
● Using index
● Using where
● Using temporary table
● Using filesort
There are many more “extra” values which are discussed in the MySQL manual:
Distinct, Full scan, Impossible HAVING, Impossible WHERE, Not exists
http://dev.mysql.com/doc/refman/5.0/en/explain-output.html#explain-join-types
table type possible_keys key rows extra
attendees ref conf conf 331 Using where
Using filesort

Using filesort
Avoid, because:
● Doesn't use an index
● Involves a full scan of your result set
● Employs a generic (i.e. one size fits all) algorithm
● Creates temporary tables
● Uses the filesystem (seek)
● Will get slower with more data
It's not all bad...
● Perfectly acceptable provided you get to your
● result set as quickly as possible, and keep it predictably small
● Sometimes unavoidable - ORDER BY RAND()
● ORDER BY operations can use indexes to do the sorting!

Using filesort
WHERE conference_id = 123 ORDER BY surname
ALTER TABLE attendees ADD INDEX conf_surname (conference_id, surname);
We've avoided a filesort!
table possible_keys key rows extra
attendees conference_id conference_id 331 Using filesort
MySQL is using an index, but it's sorting the results slowly
table possible_keys key rows extra
attendees conference_id,
conf_surname
conf_surname 331

NoSQL engines
● Redis
● MongoDB
● Cassandra
● CouchDB
● DynamoDB
● Riak
● Membase
● HBase
Data is created, updated, deleted, retrieved using API calls.
All application and data integrity logic is contained in the application code.

Redis
● Written In: C/C++
● Main point: Blazing fast
● License: BSD
● Protocol: Telnet-like
● Disk-backed in-memory database,
● Master-slave replication
● Simple values or hash tables by keys,
● but complex operations like ZREVRANGEBYSCORE.
● INCR & co (good for rate limiting or statistics)
● Has sets (also union/diff/inter)
● Has lists (also a queue; blocking pop)
● Has hashes (objects of multiple fields)
● Sorted sets (high score table, good for range queries)
● Redis has transactions (!)
● Values can be set to expire (as in a cache)
● Pub/Sub lets one implement messaging (!)
Best used: For rapidly changing data with a foreseeable database size (should fit mostly in memory).
For example: Stock prices. Analytics. Real-time data collection. Real-time communication.
SET uid:1000:username antirezr
uid:1000:followers
uid:1000:following
GET foo => bar
INCR foo => 11
LPUSH mylist a (now mylist holds one element list 'a')
LPUSH mylist b (now mylist holds 'b,a')
LPUSH mylist c (now mylist holds 'c,b,a')
SADD myset a
SADD myset b
SADD myset foo
SADD myset bar
SCARD myset => 4
SMEMBERS myset => bar,a,foo,b

MongoDB
● Written In: C++
● Main point: Retains some friendly properties of SQL. (Query, index)
● License: AGPL (Drivers: Apache)
● Protocol: Custom, binary (BSON)
● Master/slave replication (auto failover with replica sets)
● Sharding built-in
● Queries are javascript expressions / Run arbitrary javascript functions server-side
● Uses memory mapped files for data storage
● Performance over features
● Journaling (with --journal) is best turned on
● On 32bit systems, limited to ~2.5Gb
● An empty database takes up 192Mb
● Has geospatial indexing
Best used: If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If
you need good performance on a big DB.
For example: For most things that you would do with MySQL or PostgreSQL, but having predefined
columns really holds you back.

MongoDB -> JSON
The MongoDB examples assume a collection named users that contain
documents of the following prototype:
{
_id: ObjectID("509a8fb2f3f4948bd2f983a0"),
user_id: "abc123",
age: 55,
status: 'A'
}

MongoDB -> insert
SQL MongoDB
CREATE TABLE users (
id MEDIUMINT NOT NULL
AUTO_INCREMENT,
user_id Varchar(30),
age Number,
status char(1),
PRIMARY KEY (id)
)
db.users.insert( {
user_id: "abc123",
age: 55,
status: "A"
} )
Implicitly created on first insert operation.
The primary key _id is automatically added
if _id field is not specified.

MongoDB -> Alter, Index, Select
SQL MongoDB
ALTER TABLE users
ADD join_date DATETIME
db.users.update(
{ },
{ $set: { join_date: new Date() } },
{ multi: true }
)
CREATE INDEX idx_user_id_asc
ON users(user_id)
db.users.ensureIndex( { user_id: 1 } )
SELECT user_id, status
FROM users
WHERE status = "A"
db.users.find(
{ status: "A" },
{ user_id: 1, status: 1, _id: 0 }
)
SELECT *
FROM users
WHERE status = "A"
OR age = 50
db.users.find(
{ $or: [ { status: "A" } ,
{ age: 50 } ] }
)

Introduction to Databases - query optimizations for MySQL

More Related Content

What's hot (20)

Similar to Introduction to Databases - query optimizations for MySQL (20)

More from Márton Kodok (20)

Recently uploaded (20)

Introduction to Databases - query optimizations for MySQL