0% found this document useful (0 votes)

30 views37 pages

SC4x W3L1 TopicsInDatabases v2

Uploaded by

Lucas Martin Muñoz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views37 pages

SC4x W3L1 TopicsInDatabases v2

Uploaded by

Lucas Martin Muñoz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Topics in Databases

MIT Center for

Transportation & Logistics ctl.mit.edu
Motivating questions
• How can relational databases be optimized for performance?

• When is a normalized relational database not the best

solution?

• What are the benefits of storing data on the cloud?

• How is data cleaned and pre-processed?

2
Indexes and Performance

3
Indexes
• When database tables get large, it can impact performance,
and make finding or updating a record slow or costly

• An index is a separate data object, stored in the database,

that lists the records in order
n Primary keys and foreign keys automatically indexed

• Indexed columns are rapidly searched

Company Index CustNbr Company CustRep CreditLimit

Amaratunga Enterprises 211 Connor Co 89 50000.00
Connor Co 522 Amaratunga Enterprises 89 40000.00
Feni Fabricators 890 Feni Fabricators 53 1000000.00
4
Indexes
• Each index may be updated when a row is updated

n Indexes slow updates, insertions and deletes

n If a database is mostly read, use indexes on most selected

attributes to improve performance

n If database is mostly updates, use as few indexes as

possible

n Practical maximum of 3 or 4 indexes per table

5
Index example
• Customers table, where CustomerID is primary key

n We also want to search by:

w Customer name (last, first)
w City, state
w Postal (zip) code
w Address

n Index the name, city/state, zip and address

w Four indexes: slower for insert, update, delete, fast
lookups
w If customer database is fairly stable (few updates), this
is fine
6
Index syntax
CREATE INDEX IX_City_State
ON Customers (State, City)

• Makes searching for customers within a specific state faster

• Makes searching for customers within a specific state and a
specific city faster

7
Optimization using indexes
• Search engines are optimized using 'text retrieval engines'
n Large text corpuses can be indexed, such that every
keyword that might be searched will have an index
n Then these index entries can be optimized based on the
number of occurrences or the rank of a keyword among all
possible matches for that keyword
n The frequency or importance of links to a web site, the
usage, and other factors may be used to optimize these
indexes

8
Key points from lesson
• Indexes are used to make searching a database faster

• Many indexes will slow updates, insertions and deletions, so

prioritize use of indexes to most selected attributes

• Think about how the database is used in the business to

determine the best use of indexes

9
Databases and data warehouses

10
Databases and data warehouses
• Best practices and techniques discussed so far make sense for
databases and Online Transaction Processing (OLTP)

• Data warehouses are another choice to store data, these are

optimized for Online Analytical Processing (OLAP)

• Each have pros and cons, and are ideal for different purposes

11
Use cases
OLTP OLAP
Manage real-time business operations Perform analytics and reporting
Supports implementation of business Supports data driven decision making
tasks
e.g. Transactions in an online store e.g. Data mining and machine learning
e.g. Dashboard showing health of e.g. Forecasting based on historic data
business over the last few days/hours
Concurrency is paramount Concurrency may not be important

12
Key differences
OLTP OLAP
Optimized for writes (updates and Optimized for reads
inserts)
Normalized using normal forms, few Denormalized, many duplicates
duplicates
Few indexes, many tables Many indexes, fewer tables
Optimized for simple queries Used for complex queries
Uncommon to store metrics which can Common to store derived metrics
be derived from existing data
Overall performance: Designed to be Overall performance: Designed to be
very fast for small writes and simple very fast for reads and large writes,
reads however relatively slower because data
tends to be much larger

13
Differences in normalization
• Star schema is common in OLAP
n E.g.: fact table is a historical record of shipments to customers,
including price and locations:
w Date dimension: full date, year, quarter, month, week
w Carrier dimension: carrier, type of freight, address, location
w Customer dimension: name, address, income bracket, percentiles
based on demographic data from census

https://commons.wikimedia.org/wiki/File:Star-schema.png 14
Key points from lesson
• Data storage in a production database tends to be optimized
for business operation tasks

• Companies interested in analyzing their data may not want to

run big queries and risk performance of their database

• Data warehouses can be designed differently, storing data in

a way that optimizes use for analytics

15
NoSQL

16
NoSQL
• Joining two large tables together in a normalized relational
database can be slow

• NoSQL databases offer an alternative solution, where data

are not stored in explicit table structures with relationships

• Simple reads and writes are very fast with NoSQL solutions

• NoSQL databases are easy to scale because of their simplified

structures

• Not all problems are suited to NoSQL technologies

17
Common types of NoSQL databases
• Key-value database – "look-up table" or "dictionary"
n Simplest examples can be made more complex with
formats like JSON
n Each record may have different data fields or attributes,
which are stored together with a unique key
n Data model is not predefined, empty fields are not stored
Key Value
Key Value
1000 {name: "Chris", language:
1000 Chris "English", city: "Boston", state:
1001 Julie "MA"}
1002 Clark 1001 {name: "Julie", state: "NY"}
1003 MA 1002 {name: "Clark", language:
"French", language: "Spanish"}
18
Common types of NoSQL databases
• Document-oriented database
n Similar to key-value database
n Key value pairs can be further grouped into collections,
typically related to entities
n Note duplicated data Offices

Employees Key Value

Key Value 10000 {name: "Boston Office",

city: "Boston", state:
1000 {name: "Chris", language: "MA", employee:
"English", city: "Boston", state: "Chris", employee:
"MA"} "Clark"}
1001 {name: "Julie", state: "NY"} 10001 {name: "New York
1002 {name: "Clark", language: Office", state: "NY",
"French", language: "Spanish"} employee: "Julie"}
19
Common types of NoSQL databases
• Column-oriented database
n Imagine "indexing" every column in a relational database
with the row ID
n "RowID" acts as the "key" and the attribute acts as the
"value"
n Now information about each column is stored more
efficiently for some queries First Name RowID
Christopher 1, 3
Employees
Julie 2
RowID First Name State
1 Christopher Massachusetts State RowID
2 Julie New York Massachusetts 1
3 Christopher New York New York 2, 3
20
Common types of NoSQL databases
• Graph database
n Connections exist, and are not created during query like a
relational database, efficient for highly connected systems

https://neo4j.com/developer/graph-database/ 21
Major programming and database
stacks

https://media.licdn.com/mpr/mpr/AAEAAQAAAAAAAAfQAAAAJDlmNDIxZTVmLTQ1MzQtNDc0Mi0 22
4YTI2LWU3ZDg3ZTRjYWU4NQ.png
Key points from lesson
• NoSQL may be faster in some use cases, design data model
around application

• Transactional business data is still commonly stored in

relational databases due to consistency and how reads and
writes are handled

• Companies may use multiple types of databases for their

various operational tasks

23
Cloud services

24
Cloud computing
• Infrastructure as a service (IaaS): outsource hardware
n User provides operating system, database, app, etc.
n Most flexible, higher upfront total IT cost

n Example: Rackspace

• Platform as a service (PaaS): outsource operating

environment
n Cloud platform provides OS, database, etc.

n Examples: Amazon, Microsoft, Google

• Software as a service (SaaS): outsource software

n Configure the third party app

n Examples: Salesforce, HubSpot

n Least flexible, lower upfront total IT cost

25
Benefits of the cloud
• Low start-up cost
• Low risk development and testing
• Managed hardware (and software)
• Global reach
• Highly available and highly durable
• Scale on demand, pay for what you use
• In some cases, can work with local infrastructure if needed

26
Example: AWS data analytics pipeline

https://aws.amazon.com/data-warehouse/ 27
Key points from lesson
• PaaS removes the high start-up cost of owning hardware, and
the IT cost of maintaining hardware

• Can rely on the IT from large PaaS companies to create

services that are highly available and store data in a highly
durable manner

• With PaaS, users are responsible for learning the best way to
implement these services – they are not fool proof

28
Data Cleaning

29
Data cleaning: example problems
• Date format mismatch
• Correct text case
• Split first and last names
• Column mismatch/offset
• Remove records with missing values or impute missing values
based on logic
• Prepare two datasets for a merge with imperfect key
matching

30
Filtering large datasets
• Suppose you receive a very large dataset from a third party
which contains order data for their 2000 stores and:
n You only need stores which are in the Northeast

n You need orders that have a total of over $20.00

n You need orders which are processed after 2015

31
Types of data cleaning solutions
• Off the shelf software for (or including) data preparation
n Trifacta, Paxata, Alteryx, SAS most user
friendly

• Open-source programming languages

n Python, R
least user
friendly
• Unix command line tools
n regex, grep, sed, awk

32
Off the shelf tools for data preparation
• Graphical user interfaces, no programming required, enables
collaboration with non-programmers
• Can join disparate data sources together
• Works on large data sets
• Reproducible steps and workflow
• Offers version control

• Software is not free

33
Open-source programming languages
• Working in data frames and data dictionaries, requires some
programming skills, but languages are relatively google-
friendly
• Broad set of computational and scientific tools are available
• Reproducibility, version control, visual inspection and step-
wise data processing

• Free

34
Unix command line tools
• Developed in the 1970's and still relevant today
• Not as user friendly as previous options, but very fast and
versatile
• Google-friendly for common applications
• Excellent for breaking up large datasets that would crash
other software

• Free
• Built-in to Unix-based systems

http://teaching.idallen.com/cst8207/13f/notes/data/awkgrepsedpwd.gif 35
grep, sed and awk with regex
• regex (regular expressions) – used to specify a text pattern

• grep (globally search a regular expression and print) – search

to find text that matches a specified pattern

• sed (stream editor) – used to find and replace text that

matches a specified pattern and more

• awk – used to find a specified pattern and perform some

action on it and more, nice support of delimited data

36
Key points from lesson
• Data almost always needs to be cleaned or pre-processed
before it can be inserted into a database

• Data cleaning can be performed with free software and tools,

however the learning curve for these can be steeper than that
of commercial software

3.0
No ratings yet
3.0
31 pages
Notes - DP900
No ratings yet
Notes - DP900
53 pages
Intro 2 DB
No ratings yet
Intro 2 DB
126 pages
BIS Lecture 01 - Introduction (1)
No ratings yet
BIS Lecture 01 - Introduction (1)
28 pages
MIS 2018 - Database & BI
No ratings yet
MIS 2018 - Database & BI
55 pages
Database & Big Data
No ratings yet
Database & Big Data
22 pages
DBMS Notes
No ratings yet
DBMS Notes
25 pages
CH - 1 Relational Database Design Updated
No ratings yet
CH - 1 Relational Database Design Updated
80 pages
Database 240112 181346
No ratings yet
Database 240112 181346
16 pages
SQL Unit1
No ratings yet
SQL Unit1
28 pages
Introduction to Data Models 677e35511a823
No ratings yet
Introduction to Data Models 677e35511a823
45 pages
dbms ----
No ratings yet
dbms ----
12 pages
Database notes
No ratings yet
Database notes
47 pages
Notes
No ratings yet
Notes
14 pages
Digitization Week 3
No ratings yet
Digitization Week 3
13 pages
DBMS PPT 1 ENG
No ratings yet
DBMS PPT 1 ENG
74 pages
Antim Prahar 2025 Data Base Management System
No ratings yet
Antim Prahar 2025 Data Base Management System
58 pages
CH 11
No ratings yet
CH 11
50 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
19 pages
Module 02 Databases Accessible PowerPoint Presentation
No ratings yet
Module 02 Databases Accessible PowerPoint Presentation
51 pages
Opensap Hana1 Warmup
No ratings yet
Opensap Hana1 Warmup
73 pages
PDF Document BIDA 2
No ratings yet
PDF Document BIDA 2
21 pages
Introduction To Database Management System Second Edition PDF
100% (2)
Introduction To Database Management System Second Edition PDF
553 pages
SWDF_Assignment_Database
No ratings yet
SWDF_Assignment_Database
12 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
Business Intelligence MSE 1 -IMP
No ratings yet
Business Intelligence MSE 1 -IMP
11 pages
UDBMS NOTES
No ratings yet
UDBMS NOTES
18 pages
L3...
No ratings yet
L3...
14 pages
Handouts PDF
No ratings yet
Handouts PDF
293 pages
DATABASE MANAGEMENT
No ratings yet
DATABASE MANAGEMENT
7 pages
Understanding Databases For Product Managers
No ratings yet
Understanding Databases For Product Managers
7 pages
S-Advance Database Management System 1
No ratings yet
S-Advance Database Management System 1
68 pages
Lesson 2
No ratings yet
Lesson 2
50 pages
102-COPIES-ADV-LESSON-1
No ratings yet
102-COPIES-ADV-LESSON-1
5 pages
4.2 NoSQL Databases UNIT-1
No ratings yet
4.2 NoSQL Databases UNIT-1
35 pages
SQL 2
No ratings yet
SQL 2
7 pages
Seminar Nosql
No ratings yet
Seminar Nosql
59 pages
2023_IT_22IT405_U1-LM1 (1)
No ratings yet
2023_IT_22IT405_U1-LM1 (1)
11 pages
Chapter1-Overview of Database Concepts
No ratings yet
Chapter1-Overview of Database Concepts
19 pages
Chapter 5 ITM100
No ratings yet
Chapter 5 ITM100
5 pages
New 2nd Lecture Data Resource Management
No ratings yet
New 2nd Lecture Data Resource Management
24 pages
Dbms 1
No ratings yet
Dbms 1
23 pages
Elective-I Advanced Database Management Systems
No ratings yet
Elective-I Advanced Database Management Systems
67 pages
Relational DB
No ratings yet
Relational DB
32 pages
DB2 Refresher: by Parvathavardhini Sathya
No ratings yet
DB2 Refresher: by Parvathavardhini Sathya
104 pages
Introduction To DB2 LUW Performance
No ratings yet
Introduction To DB2 LUW Performance
268 pages
D B M S: ATA ASE Anage Me NT Ystem
No ratings yet
D B M S: ATA ASE Anage Me NT Ystem
114 pages
Dbms + SQL Sheet (1)
No ratings yet
Dbms + SQL Sheet (1)
78 pages
Databases_ A Comprehensive Overview
No ratings yet
Databases_ A Comprehensive Overview
7 pages
Lecture Database Course Introdutcion for Student
No ratings yet
Lecture Database Course Introdutcion for Student
42 pages
Introduction to Database Systems
No ratings yet
Introduction to Database Systems
4 pages
Course01 - Introduction in Databases
No ratings yet
Course01 - Introduction in Databases
31 pages
Progress Database Design Guide
No ratings yet
Progress Database Design Guide
114 pages
Business Intelligence and Databases - Kopie
No ratings yet
Business Intelligence and Databases - Kopie
14 pages
Session 9
No ratings yet
Session 9
12 pages
Gregory R. Andrews-Foundations of Multithreaded, Parallel, and Distributed Programming-Addison-Wesley (1999)
100% (4)
Gregory R. Andrews-Foundations of Multithreaded, Parallel, and Distributed Programming-Addison-Wesley (1999)
682 pages
DBMS Black
No ratings yet
DBMS Black
19 pages
Elliott Jaques - Requisite Organization
No ratings yet
Elliott Jaques - Requisite Organization
25 pages
Last-Day Cheat Sheet for the AWS Certified Cloud Practitioner (CLF-C02) exam
No ratings yet
Last-Day Cheat Sheet for the AWS Certified Cloud Practitioner (CLF-C02) exam
7 pages
001 - OpenEdge Getting Started Database Essentials Gsdbe
No ratings yet
001 - OpenEdge Getting Started Database Essentials Gsdbe
142 pages
Dynamics365 Release Plan 2023wave2 2
No ratings yet
Dynamics365 Release Plan 2023wave2 2
530 pages
PowerBI Developer - Business Analyst Resume - Hire IT People - We Get IT Done
No ratings yet
PowerBI Developer - Business Analyst Resume - Hire IT People - We Get IT Done
5 pages
SC4x W2L1 Clean
No ratings yet
SC4x W2L1 Clean
45 pages
Compute-Adminguide 032013
No ratings yet
Compute-Adminguide 032013
334 pages
Winshuttle-Migrating-Finance-SAP-ECC-to-S4HANA-whitepaper-EN
100% (1)
Winshuttle-Migrating-Finance-SAP-ECC-to-S4HANA-whitepaper-EN
62 pages
SAP Note
0% (1)
SAP Note
3 pages
SC4x W1L1 Looking To Data For Answers
No ratings yet
SC4x W1L1 Looking To Data For Answers
9 pages
Fortigate I: Antivirus and Conserve Mode
No ratings yet
Fortigate I: Antivirus and Conserve Mode
43 pages
SC4x W2L2 v2
No ratings yet
SC4x W2L2 v2
49 pages
PDE Exam Dump 3
No ratings yet
PDE Exam Dump 3
98 pages
Ankit Pattanaik
No ratings yet
Ankit Pattanaik
2 pages
Practice Problems - Week 2 - Data Management II - Supply Chain Technology and Systems - Edx2
No ratings yet
Practice Problems - Week 2 - Data Management II - Supply Chain Technology and Systems - Edx2
6 pages
Practice Problems - Week 2 - Data Management II - Supply Chain Technology and Systems - Edx
No ratings yet
Practice Problems - Week 2 - Data Management II - Supply Chain Technology and Systems - Edx
6 pages
Music Plagiarism Detection System
No ratings yet
Music Plagiarism Detection System
16 pages
PTPi - Bono ESmart Introduction & Value Proposition Presentation - Public Sector
No ratings yet
PTPi - Bono ESmart Introduction & Value Proposition Presentation - Public Sector
18 pages
IT - (X) - Practice Question Paper Final
No ratings yet
IT - (X) - Practice Question Paper Final
4 pages
Expert Coaching Catalog.202302
No ratings yet
Expert Coaching Catalog.202302
40 pages
Favour Emmanuel - Resume - Data Analyst
No ratings yet
Favour Emmanuel - Resume - Data Analyst
3 pages
Class Time Table - Jan - Jun 2022 (Offline + Online)
No ratings yet
Class Time Table - Jan - Jun 2022 (Offline + Online)
11 pages
Practical-2 Date: / /: A. Download Sqlyog
No ratings yet
Practical-2 Date: / /: A. Download Sqlyog
11 pages
Company Profile - Kita Bisa Teknologi
No ratings yet
Company Profile - Kita Bisa Teknologi
11 pages
Build An Internet Infrastructure Final Exam
100% (4)
Build An Internet Infrastructure Final Exam
2 pages
Cloud Security Mechanisms
100% (1)
Cloud Security Mechanisms
31 pages
North South University: Assignment: Information Technology
No ratings yet
North South University: Assignment: Information Technology
11 pages
E Signature
No ratings yet
E Signature
1 page
Modul - Bahasa Inggris 1 - UNIT 6 - 7th Edition - 2020
No ratings yet
Modul - Bahasa Inggris 1 - UNIT 6 - 7th Edition - 2020
8 pages
VERY NICE Solutions For The Issues.2
No ratings yet
VERY NICE Solutions For The Issues.2
11 pages
FM-IMS-GR-050 Supervised Induction Module - CONTROLLED
No ratings yet
FM-IMS-GR-050 Supervised Induction Module - CONTROLLED
2 pages
SG Certified Developer
No ratings yet
SG Certified Developer
8 pages
Satish Resume
No ratings yet
Satish Resume
2 pages
A Marketing Analytics Framework For CMOs
No ratings yet
A Marketing Analytics Framework For CMOs
6 pages
Online Sbi Registration Form To The Branch Manager State Bank of India Khambhalia, Main
No ratings yet
Online Sbi Registration Form To The Branch Manager State Bank of India Khambhalia, Main
3 pages
The Database Guide for Web Developers
From Everand
The Database Guide for Web Developers
Pasquale De Marco
No ratings yet
SQL Query Basics
From Everand
SQL Query Basics
Isabella Ramirez
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
SQL and NoSQL Full Mastery: A Comprehensive Guide to Modern Data Management
From Everand
SQL and NoSQL Full Mastery: A Comprehensive Guide to Modern Data Management
Kameron Hussain
No ratings yet

SC4x W3L1 TopicsInDatabases v2

Uploaded by

SC4x W3L1 TopicsInDatabases v2

Uploaded by

Topics in Databases

MIT Center for

• When is a normalized relational database not the best

• What are the benefits of storing data on the cloud?

• How is data cleaned and pre-processed?

• An index is a separate data object, stored in the database,

• Indexed columns are rapidly searched

Company Index CustNbr Company CustRep CreditLimit

n Indexes slow updates, insertions and deletes

n If a database is mostly read, use indexes on most selected

n If database is mostly updates, use as few indexes as

n Practical maximum of 3 or 4 indexes per table

n We also want to search by:

n Index the name, city/state, zip and address

• Makes searching for customers within a specific state faster

• Many indexes will slow updates, insertions and deletions, so

• Think about how the database is used in the business to

• Data warehouses are another choice to store data, these are

• Companies interested in analyzing their data may not want to

• Data warehouses can be designed differently, storing data in

• NoSQL databases offer an alternative solution, where data

• NoSQL databases are easy to scale because of their simplified

• Not all problems are suited to NoSQL technologies

Employees Key Value

Key Value 10000 {name: "Boston Office",

• Transactional business data is still commonly stored in

• Companies may use multiple types of databases for their

• Platform as a service (PaaS): outsource operating

n Examples: Amazon, Microsoft, Google

• Software as a service (SaaS): outsource software

n Examples: Salesforce, HubSpot

n Least flexible, lower upfront total IT cost

• Can rely on the IT from large PaaS companies to create

n You need orders that have a total of over $20.00

n You need orders which are processed after 2015

• Open-source programming languages

• Software is not free

• grep (globally search a regular expression and print) – search

• sed (stream editor) – used to find and replace text that

• awk – used to find a specified pattern and perform some

• Data cleaning can be performed with free software and tools,

You might also like