0% found this document useful (0 votes)
16 views

Streamlined Data Ingestion With Pandas Chapter3

pandas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Streamlined Data Ingestion With Pandas Chapter3

pandas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Introduction to

databases
S T R E A M L I N E D D ATA I N G E S T I O N W I T H PA N D A S

Amany Mahfouz
Instructor
Relational Databases
Data about entities is organized into tables
Each row or record is an instance of an entity

Each column has information about an attribute

Tables can be linked to each other via unique keys

Support more data, multiple simultaneous users, and data quality controls

Data types are specified for each column

SQL (Structured Query Language) to interact with databases

STREAMLINED DATA INGESTION WITH PANDAS


Common Relational Databases

SQLite databases are computer files

STREAMLINED DATA INGESTION WITH PANDAS


Connecting to Databases
Two-step process:
1. Create way to connect to database

2. Query database

STREAMLINED DATA INGESTION WITH PANDAS


Creating a Database Engine

sqlalchemy 's create_engine() makes an engine to handle database connections


Needs string URL of database to connect to

SQLite URL format: sqlite:///filename.db

STREAMLINED DATA INGESTION WITH PANDAS


Querying Databases
pd.read_sql(query, engine) to load in data from a database

Arguments
query : String containing SQL query to run or table to load

engine : Connection/database engine object

STREAMLINED DATA INGESTION WITH PANDAS


SQL Review: SELECT
Used to query data from a database
Basic syntax:
SELECT [column_names] FROM [table_name];

To get all data in a table:


SELECT * FROM [table_name];

Code style: keywords in ALL CAPS, semicolon (;) to end a statement

STREAMLINED DATA INGESTION WITH PANDAS


Getting Data from a Database
# Load pandas and sqlalchemy's create_engine
import pandas as pd
from sqlalchemy import create_engine

# Create database engine to manage connections


engine = create_engine("sqlite:///data.db")

# Load entire weather table by table name


weather = pd.read_sql("weather", engine)

STREAMLINED DATA INGESTION WITH PANDAS


# Create database engine to manage connections
engine = create_engine("sqlite:///data.db")

# Load entire weather table with SQL


weather = pd.read_sql("SELECT * FROM weather", engine)

print(weather.head())

station name latitude ... prcp snow tavg tmax tmin


0 USW00094728 NY CITY CENTRAL PARK, NY US 40.77898 ... 0.00 0.0 52 42
1 USW00094728 NY CITY CENTRAL PARK, NY US 40.77898 ... 0.00 0.0 48 39
2 USW00094728 NY CITY CENTRAL PARK, NY US 40.77898 ... 0.00 0.0 48 42
3 USW00094728 NY CITY CENTRAL PARK, NY US 40.77898 ... 0.00 0.0 51 40
4 USW00094728 NY CITY CENTRAL PARK, NY US 40.77898 ... 0.75 0.0 61 50

[5 rows x 13 columns]

STREAMLINED DATA INGESTION WITH PANDAS


Let's practice!
S T R E A M L I N E D D ATA I N G E S T I O N W I T H PA N D A S
Refining imports
with SQL queries
S T R E A M L I N E D D ATA I N G E S T I O N W I T H PA N D A S

Amany Mahfouz
Instructor
SELECTing Columns
SELECT [column names] FROM [table name];

Example:
SELECT date, tavg
FROM weather;

STREAMLINED DATA INGESTION WITH PANDAS


WHERE Clauses
Use a WHERE clause to selectively import records

SELECT [column_names]
FROM [table_name]
WHERE [condition];

STREAMLINED DATA INGESTION WITH PANDAS


Filtering by Numbers
Compare numbers with mathematical operators
=

> and >=

< and <=

<> (not equal to)

Example:
SELECT *
FROM weather
WHERE tmax > 32;

STREAMLINED DATA INGESTION WITH PANDAS


Filtering Text
Match exact strings with the = sign and the text to match
String matching is case-sensitive

Example:
/* Get records about incidents in Brooklyn */
SELECT *
FROM hpd311calls
WHERE borough = 'BROOKLYN';

STREAMLINED DATA INGESTION WITH PANDAS


SQL and pandas
# Load libraries
import pandas as pd
from sqlalchemy import create_engine
# Create database engine
engine = create_engine("sqlite:///data.db")
# Write query to get records from Brooklyn
query = """SELECT *
FROM hpd311calls
WHERE borough = 'BROOKLYN';"""
# Query the database
brooklyn_calls = pd.read_sql(query, engine)
print(brookyn_calls.borough.unique())

['BROOKLYN']

STREAMLINED DATA INGESTION WITH PANDAS


Combining Conditions: AND
WHERE clauses with AND return records that meet all conditions

# Write query to get records about plumbing in the Bronx


and_query = """SELECT *
FROM hpd311calls
WHERE borough = 'BRONX'
AND complaint_type = 'PLUMBING';"""
# Get calls about plumbing issues in the Bronx
bx_plumbing_calls = pd.read_sql(and_query, engine)

# Check record count


print(bx_plumbing_calls.shape)

(2016, 8)

STREAMLINED DATA INGESTION WITH PANDAS


Combining Conditions: OR
WHERE clauses with OR return records that meet at least one condition

# Write query to get records about water leaks or plumbing


or_query = """SELECT *
FROM hpd311calls
WHERE complaint_type = 'WATER LEAK'
OR complaint_type = 'PLUMBING';"""
# Get calls that are about plumbing or water leaks
leaks_or_plumbing = pd.read_sql(or_query, engine)

# Check record count


print(leaks_or_plumbing.shape)

(10684, 8)

STREAMLINED DATA INGESTION WITH PANDAS


Let's practice!
S T R E A M L I N E D D ATA I N G E S T I O N W I T H PA N D A S
More complex SQL
queries
S T R E A M L I N E D D ATA I N G E S T I O N W I T H PA N D A S

Amany Mahfouz
Instructor
Getting DISTINCT Values
Get unique values for one or more columns with SELECT DISTINCT
Syntax:
SELECT DISTINCT [column names] FROM [table];

Remove duplicate records:


SELECT DISTINCT * FROM [table];

/* Get unique street addresses and boroughs */


SELECT DISTINCT incident_address,
borough
FROM hpd311calls;

STREAMLINED DATA INGESTION WITH PANDAS


Aggregate Functions
Query a database directly for descriptive statistics
Aggregate functions
SUM

AVG

MAX

MIN

COUNT

STREAMLINED DATA INGESTION WITH PANDAS


Aggregate Functions
SUM , AVG , MAX , MIN
Each takes a single column name
SELECT AVG(tmax) FROM weather;

COUNT
Get number of rows that meet query conditions
SELECT COUNT(*) FROM [table_name];

Get number of unique values in a column


SELECT COUNT(DISTINCT [column_names]) FROM [table_name];

STREAMLINED DATA INGESTION WITH PANDAS


GROUP BY
Aggregate functions calculate a single summary statistic by default
Summarize data by categories with GROUP BY statements

Remember to also select the column you're grouping by!

/* Get counts of plumbing calls by borough */


SELECT borough,
COUNT(*)
FROM hpd311calls
WHERE complaint_type = 'PLUMBING'
GROUP BY borough;

STREAMLINED DATA INGESTION WITH PANDAS


Counting by Groups
# Create database engine
engine = create_engine("sqlite:///data.db")

# Write query to get plumbing call counts by borough


query = """SELECT borough, COUNT(*)
FROM hpd311calls
WHERE complaint_type = 'PLUMBING'
GROUP BY borough;"""

# Query databse and create dataframe


plumbing_call_counts = pd.read_sql(query, engine)

STREAMLINED DATA INGESTION WITH PANDAS


Counting by Groups
print(plumbing_call_counts)

borough COUNT(*)
0 BRONX 2016
1 BROOKLYN 2702
2 MANHATTAN 1413
3 QUEENS 808
4 STATEN ISLAND 178

STREAMLINED DATA INGESTION WITH PANDAS


Let's practice!
S T R E A M L I N E D D ATA I N G E S T I O N W I T H PA N D A S
Loading multiple
tables with joins
S T R E A M L I N E D D ATA I N G E S T I O N W I T H PA N D A S

Amany Mahfouz
Instructor
Keys
Database records have unique identifiers, or keys

STREAMLINED DATA INGESTION WITH PANDAS


Keys
Database records have unique identifiers, or keys

STREAMLINED DATA INGESTION WITH PANDAS


Keys
Database records have unique identifiers, or keys

STREAMLINED DATA INGESTION WITH PANDAS


Keys

STREAMLINED DATA INGESTION WITH PANDAS


Keys

STREAMLINED DATA INGESTION WITH PANDAS


Joining Tables

STREAMLINED DATA INGESTION WITH PANDAS


Joining Tables
SELECT *
FROM hpd311calls

STREAMLINED DATA INGESTION WITH PANDAS


Joining Tables
SELECT *
FROM hpd311calls
JOIN weather
ON hpd311calls.created_date = weather.date;

Use dot notation ( table.column ) when working with multiple tables

Default join only returns records whose key values appear in both tables

Make sure join keys are the same data type or nothing will match

STREAMLINED DATA INGESTION WITH PANDAS


Joining and Filtering
/* Get only heat/hot water calls and join in weather data */
SELECT *
FROM hpd311calls
JOIN weather
ON hpd311calls.created_date = weather.date
WHERE hpd311calls.complaint_type = 'HEAT/HOT WATER';

STREAMLINED DATA INGESTION WITH PANDAS


Joining and Aggregating
/* Get call counts by borough */
SELECT hpd311calls.borough,
COUNT(*)
FROM hpd311calls
GROUP BY hpd311calls.borough;

STREAMLINED DATA INGESTION WITH PANDAS


Joining and Aggregating
/* Get call counts by borough
and join in population and housing counts */
SELECT hpd311calls.borough,
COUNT(*),
boro_census.total_population,
boro_census.housing_units
FROM hpd311calls
GROUP BY hpd311calls.borough

STREAMLINED DATA INGESTION WITH PANDAS


Joining and Aggregating
/* Get call counts by borough
and join in population and housing counts */
SELECT hpd311calls.borough,
COUNT(*),
boro_census.total_population,
boro_census.housing_units
FROM hpd311calls
JOIN boro_census
ON hpd311calls.borough = boro_census.borough
GROUP BY hpd311calls.borough;

STREAMLINED DATA INGESTION WITH PANDAS


query = """SELECT hpd311calls.borough,
COUNT(*),
boro_census.total_population,
boro_census.housing_units
FROM hpd311calls
JOIN boro_census
ON hpd311calls.borough = boro_census.borough
GROUP BY hpd311calls.borough;"""

call_counts = pd.read_sql(query, engine)


print(call_counts)

borough COUNT(*) total_population housing_units


0 BRONX 29874 1455846 524488
1 BROOKLYN 31722 2635121 1028383
2 MANHATTAN 20196 1653877 872645
3 QUEENS 11384 2339280 850422
4 STATEN ISLAND 1322 475948 179179

STREAMLINED DATA INGESTION WITH PANDAS


Review
SQL order of keywords
SELECT

FROM

JOIN

WHERE

GROUP BY

STREAMLINED DATA INGESTION WITH PANDAS


Let's practice!
S T R E A M L I N E D D ATA I N G E S T I O N W I T H PA N D A S

You might also like