0% found this document useful (0 votes)
15 views

SC4x W3L1 TopicsInDatabases v2

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

SC4x W3L1 TopicsInDatabases v2

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Topics in Databases

MIT Center for


Transportation & Logistics ctl.mit.edu
Motivating questions
• How can relational databases be optimized for performance?

• When is a normalized relational database not the best


solution?

• What are the benefits of storing data on the cloud?

• How is data cleaned and pre-processed?

2
Indexes and Performance

3
Indexes
• When database tables get large, it can impact performance,
and make finding or updating a record slow or costly

• An index is a separate data object, stored in the database,


that lists the records in order
n Primary keys and foreign keys automatically indexed

• Indexed columns are rapidly searched

Company Index CustNbr Company CustRep CreditLimit


Amaratunga Enterprises 211 Connor Co 89 50000.00
Connor Co 522 Amaratunga Enterprises 89 40000.00
Feni Fabricators 890 Feni Fabricators 53 1000000.00
4
Indexes
• Each index may be updated when a row is updated

n Indexes slow updates, insertions and deletes

n If a database is mostly read, use indexes on most selected


attributes to improve performance

n If database is mostly updates, use as few indexes as


possible

n Practical maximum of 3 or 4 indexes per table

5
Index example
• Customers table, where CustomerID is primary key

n We also want to search by:


w Customer name (last, first)
w City, state
w Postal (zip) code
w Address

n Index the name, city/state, zip and address


w Four indexes: slower for insert, update, delete, fast
lookups
w If customer database is fairly stable (few updates), this
is fine
6
Index syntax
CREATE INDEX IX_City_State
ON Customers (State, City)

• Makes searching for customers within a specific state faster


• Makes searching for customers within a specific state and a
specific city faster

7
Optimization using indexes
• Search engines are optimized using 'text retrieval engines'
n Large text corpuses can be indexed, such that every
keyword that might be searched will have an index
n Then these index entries can be optimized based on the
number of occurrences or the rank of a keyword among all
possible matches for that keyword
n The frequency or importance of links to a web site, the
usage, and other factors may be used to optimize these
indexes

8
Key points from lesson
• Indexes are used to make searching a database faster

• Many indexes will slow updates, insertions and deletions, so


prioritize use of indexes to most selected attributes

• Think about how the database is used in the business to


determine the best use of indexes

9
Databases and data warehouses

10
Databases and data warehouses
• Best practices and techniques discussed so far make sense for
databases and Online Transaction Processing (OLTP)

• Data warehouses are another choice to store data, these are


optimized for Online Analytical Processing (OLAP)

• Each have pros and cons, and are ideal for different purposes

11
Use cases
OLTP OLAP
Manage real-time business operations Perform analytics and reporting
Supports implementation of business Supports data driven decision making
tasks
e.g. Transactions in an online store e.g. Data mining and machine learning
e.g. Dashboard showing health of e.g. Forecasting based on historic data
business over the last few days/hours
Concurrency is paramount Concurrency may not be important

12
Key differences
OLTP OLAP
Optimized for writes (updates and Optimized for reads
inserts)
Normalized using normal forms, few Denormalized, many duplicates
duplicates
Few indexes, many tables Many indexes, fewer tables
Optimized for simple queries Used for complex queries
Uncommon to store metrics which can Common to store derived metrics
be derived from existing data
Overall performance: Designed to be Overall performance: Designed to be
very fast for small writes and simple very fast for reads and large writes,
reads however relatively slower because data
tends to be much larger

13
Differences in normalization
• Star schema is common in OLAP
n E.g.: fact table is a historical record of shipments to customers,
including price and locations:
w Date dimension: full date, year, quarter, month, week
w Carrier dimension: carrier, type of freight, address, location
w Customer dimension: name, address, income bracket, percentiles
based on demographic data from census

https://commons.wikimedia.org/wiki/File:Star-schema.png 14
Key points from lesson
• Data storage in a production database tends to be optimized
for business operation tasks

• Companies interested in analyzing their data may not want to


run big queries and risk performance of their database

• Data warehouses can be designed differently, storing data in


a way that optimizes use for analytics

15
NoSQL

16
NoSQL
• Joining two large tables together in a normalized relational
database can be slow

• NoSQL databases offer an alternative solution, where data


are not stored in explicit table structures with relationships

• Simple reads and writes are very fast with NoSQL solutions

• NoSQL databases are easy to scale because of their simplified


structures

• Not all problems are suited to NoSQL technologies


17
Common types of NoSQL databases
• Key-value database – "look-up table" or "dictionary"
n Simplest examples can be made more complex with
formats like JSON
n Each record may have different data fields or attributes,
which are stored together with a unique key
n Data model is not predefined, empty fields are not stored
Key Value
Key Value
1000 {name: "Chris", language:
1000 Chris "English", city: "Boston", state:
1001 Julie "MA"}
1002 Clark 1001 {name: "Julie", state: "NY"}
1003 MA 1002 {name: "Clark", language:
"French", language: "Spanish"}
18
Common types of NoSQL databases
• Document-oriented database
n Similar to key-value database
n Key value pairs can be further grouped into collections,
typically related to entities
n Note duplicated data Offices

Employees Key Value

Key Value 10000 {name: "Boston Office",


city: "Boston", state:
1000 {name: "Chris", language: "MA", employee:
"English", city: "Boston", state: "Chris", employee:
"MA"} "Clark"}
1001 {name: "Julie", state: "NY"} 10001 {name: "New York
1002 {name: "Clark", language: Office", state: "NY",
"French", language: "Spanish"} employee: "Julie"}
19
Common types of NoSQL databases
• Column-oriented database
n Imagine "indexing" every column in a relational database
with the row ID
n "RowID" acts as the "key" and the attribute acts as the
"value"
n Now information about each column is stored more
efficiently for some queries First Name RowID
Christopher 1, 3
Employees
Julie 2
RowID First Name State
1 Christopher Massachusetts State RowID
2 Julie New York Massachusetts 1
3 Christopher New York New York 2, 3
20
Common types of NoSQL databases
• Graph database
n Connections exist, and are not created during query like a
relational database, efficient for highly connected systems

https://neo4j.com/developer/graph-database/ 21
Major programming and database
stacks

https://media.licdn.com/mpr/mpr/AAEAAQAAAAAAAAfQAAAAJDlmNDIxZTVmLTQ1MzQtNDc0Mi0 22
4YTI2LWU3ZDg3ZTRjYWU4NQ.png
Key points from lesson
• NoSQL may be faster in some use cases, design data model
around application

• Transactional business data is still commonly stored in


relational databases due to consistency and how reads and
writes are handled

• Companies may use multiple types of databases for their


various operational tasks

23
Cloud services

24
Cloud computing
• Infrastructure as a service (IaaS): outsource hardware
n User provides operating system, database, app, etc.
n Most flexible, higher upfront total IT cost

n Example: Rackspace

• Platform as a service (PaaS): outsource operating


environment
n Cloud platform provides OS, database, etc.

n Examples: Amazon, Microsoft, Google

• Software as a service (SaaS): outsource software


n Configure the third party app

n Examples: Salesforce, HubSpot

n Least flexible, lower upfront total IT cost


25
Benefits of the cloud
• Low start-up cost
• Low risk development and testing
• Managed hardware (and software)
• Global reach
• Highly available and highly durable
• Scale on demand, pay for what you use
• In some cases, can work with local infrastructure if needed

26
Example: AWS data analytics pipeline

https://aws.amazon.com/data-warehouse/ 27
Key points from lesson
• PaaS removes the high start-up cost of owning hardware, and
the IT cost of maintaining hardware

• Can rely on the IT from large PaaS companies to create


services that are highly available and store data in a highly
durable manner

• With PaaS, users are responsible for learning the best way to
implement these services – they are not fool proof

28
Data Cleaning

29
Data cleaning: example problems
• Date format mismatch
• Correct text case
• Split first and last names
• Column mismatch/offset
• Remove records with missing values or impute missing values
based on logic
• Prepare two datasets for a merge with imperfect key
matching

30
Filtering large datasets
• Suppose you receive a very large dataset from a third party
which contains order data for their 2000 stores and:
n You only need stores which are in the Northeast

n You need orders that have a total of over $20.00

n You need orders which are processed after 2015

31
Types of data cleaning solutions
• Off the shelf software for (or including) data preparation
n Trifacta, Paxata, Alteryx, SAS most user
friendly

• Open-source programming languages


n Python, R
least user
friendly
• Unix command line tools
n regex, grep, sed, awk

32
Off the shelf tools for data preparation
• Graphical user interfaces, no programming required, enables
collaboration with non-programmers
• Can join disparate data sources together
• Works on large data sets
• Reproducible steps and workflow
• Offers version control

• Software is not free

33
Open-source programming languages
• Working in data frames and data dictionaries, requires some
programming skills, but languages are relatively google-
friendly
• Broad set of computational and scientific tools are available
• Reproducibility, version control, visual inspection and step-
wise data processing

• Free

34
Unix command line tools
• Developed in the 1970's and still relevant today
• Not as user friendly as previous options, but very fast and
versatile
• Google-friendly for common applications
• Excellent for breaking up large datasets that would crash
other software

• Free
• Built-in to Unix-based systems

http://teaching.idallen.com/cst8207/13f/notes/data/awkgrepsedpwd.gif 35
grep, sed and awk with regex
• regex (regular expressions) – used to specify a text pattern

• grep (globally search a regular expression and print) – search


to find text that matches a specified pattern

• sed (stream editor) – used to find and replace text that


matches a specified pattern and more

• awk – used to find a specified pattern and perform some


action on it and more, nice support of delimited data

36
Key points from lesson
• Data almost always needs to be cleaned or pre-processed
before it can be inserted into a database

• Data cleaning can be performed with free software and tools,


however the learning curve for these can be steeper than that
of commercial software

37

You might also like