SC4x W3L1 TopicsInDatabases v2
SC4x W3L1 TopicsInDatabases v2
2
Indexes and Performance
3
Indexes
• When database tables get large, it can impact performance,
and make finding or updating a record slow or costly
5
Index example
• Customers table, where CustomerID is primary key
7
Optimization using indexes
• Search engines are optimized using 'text retrieval engines'
n Large text corpuses can be indexed, such that every
keyword that might be searched will have an index
n Then these index entries can be optimized based on the
number of occurrences or the rank of a keyword among all
possible matches for that keyword
n The frequency or importance of links to a web site, the
usage, and other factors may be used to optimize these
indexes
8
Key points from lesson
• Indexes are used to make searching a database faster
9
Databases and data warehouses
10
Databases and data warehouses
• Best practices and techniques discussed so far make sense for
databases and Online Transaction Processing (OLTP)
• Each have pros and cons, and are ideal for different purposes
11
Use cases
OLTP OLAP
Manage real-time business operations Perform analytics and reporting
Supports implementation of business Supports data driven decision making
tasks
e.g. Transactions in an online store e.g. Data mining and machine learning
e.g. Dashboard showing health of e.g. Forecasting based on historic data
business over the last few days/hours
Concurrency is paramount Concurrency may not be important
12
Key differences
OLTP OLAP
Optimized for writes (updates and Optimized for reads
inserts)
Normalized using normal forms, few Denormalized, many duplicates
duplicates
Few indexes, many tables Many indexes, fewer tables
Optimized for simple queries Used for complex queries
Uncommon to store metrics which can Common to store derived metrics
be derived from existing data
Overall performance: Designed to be Overall performance: Designed to be
very fast for small writes and simple very fast for reads and large writes,
reads however relatively slower because data
tends to be much larger
13
Differences in normalization
• Star schema is common in OLAP
n E.g.: fact table is a historical record of shipments to customers,
including price and locations:
w Date dimension: full date, year, quarter, month, week
w Carrier dimension: carrier, type of freight, address, location
w Customer dimension: name, address, income bracket, percentiles
based on demographic data from census
https://commons.wikimedia.org/wiki/File:Star-schema.png 14
Key points from lesson
• Data storage in a production database tends to be optimized
for business operation tasks
15
NoSQL
16
NoSQL
• Joining two large tables together in a normalized relational
database can be slow
• Simple reads and writes are very fast with NoSQL solutions
https://neo4j.com/developer/graph-database/ 21
Major programming and database
stacks
https://media.licdn.com/mpr/mpr/AAEAAQAAAAAAAAfQAAAAJDlmNDIxZTVmLTQ1MzQtNDc0Mi0 22
4YTI2LWU3ZDg3ZTRjYWU4NQ.png
Key points from lesson
• NoSQL may be faster in some use cases, design data model
around application
23
Cloud services
24
Cloud computing
• Infrastructure as a service (IaaS): outsource hardware
n User provides operating system, database, app, etc.
n Most flexible, higher upfront total IT cost
n Example: Rackspace
26
Example: AWS data analytics pipeline
https://aws.amazon.com/data-warehouse/ 27
Key points from lesson
• PaaS removes the high start-up cost of owning hardware, and
the IT cost of maintaining hardware
• With PaaS, users are responsible for learning the best way to
implement these services – they are not fool proof
28
Data Cleaning
29
Data cleaning: example problems
• Date format mismatch
• Correct text case
• Split first and last names
• Column mismatch/offset
• Remove records with missing values or impute missing values
based on logic
• Prepare two datasets for a merge with imperfect key
matching
30
Filtering large datasets
• Suppose you receive a very large dataset from a third party
which contains order data for their 2000 stores and:
n You only need stores which are in the Northeast
31
Types of data cleaning solutions
• Off the shelf software for (or including) data preparation
n Trifacta, Paxata, Alteryx, SAS most user
friendly
32
Off the shelf tools for data preparation
• Graphical user interfaces, no programming required, enables
collaboration with non-programmers
• Can join disparate data sources together
• Works on large data sets
• Reproducible steps and workflow
• Offers version control
33
Open-source programming languages
• Working in data frames and data dictionaries, requires some
programming skills, but languages are relatively google-
friendly
• Broad set of computational and scientific tools are available
• Reproducibility, version control, visual inspection and step-
wise data processing
• Free
34
Unix command line tools
• Developed in the 1970's and still relevant today
• Not as user friendly as previous options, but very fast and
versatile
• Google-friendly for common applications
• Excellent for breaking up large datasets that would crash
other software
• Free
• Built-in to Unix-based systems
http://teaching.idallen.com/cst8207/13f/notes/data/awkgrepsedpwd.gif 35
grep, sed and awk with regex
• regex (regular expressions) – used to specify a text pattern
36
Key points from lesson
• Data almost always needs to be cleaned or pre-processed
before it can be inserted into a database
37