0% found this document useful (0 votes)
9 views

2.1 Combining Data Frames

The document discusses data transformation and provides examples of filtering and joining data frames in Pandas and SQL. It demonstrates how to clean data, select rows that meet certain conditions, and join multiple data frames to combine data. The overall goal is to clean and link different data sources to answer business questions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

2.1 Combining Data Frames

The document discusses data transformation and provides examples of filtering and joining data frames in Pandas and SQL. It demonstrates how to clean data, select rows that meet certain conditions, and join multiple data frames to combine data. The overall goal is to clean and link different data sources to answer business questions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Data Transformation

Supply Chain Analytics

Instructor: Shagufta Henna


Lab (5pm)
• Task-1: Using Pandas, filter all the "executive" names from
"exec_df" dataframe with the age in range "1950-01-01" to
"1960-01-01". Ensure result is stored in a new datafram along
with the executive names. Print the results in dataframe
• Task-2: Using SQL, filter all the "executive" names from
"exec_df" dataframe with the age in range "1950-01-01" to
"1960-01-01". Print the returned results.
We Have Structured Data – Are We There Yet?

company_data_df company_ceos_df execs_df

We have lots of data, but fragmented!


How do we:
(1) Make sure it’s encoded the way we want
(2) Stitch the data together?
We Now Know How to Bring Data into Pandas

Populating our Big Data Toolbox:

Basic operations (building blocks) to clean and link tables

As we build towards a solution for our company-CEO question


The Main Programming Models We’ll Learn

1. Pandas dataframes – direct control, not persistent


2. SQL – automatically optimized, persistent

3. Soon: Apache Spark, which combines all of the above!

Let’s start by reviewing a few things we need to do…


company_ceos_df:
Company Executive Title Since Notes Updated
Succeeded Pierre
0 Accenture Julie Sweet CEO[1] 2019 2019-01-31
Nanterme, Passed Away
Aditya Birla 1995[2 Part of the Birla family
1 Kumar Birla Chairman[2] 2018-10-01
Group ] business house in India
Chairman,
2 Being Short Meghan president 2007 Formerly with Apple Inc. 2018-10-01
and CEO[3]

execs_df: Need to
filter missing
Need to data
clean name
Road Map

• Column-wise operations
• Filtering rows
• Joining or combining tables
Columnwise Operations
Data Cleaning

• Our first task: “clean” the contents of the executive column in


company_ceos_df, which had underscores instead of spaces

• To do this: we need to start with projection…


Projecting from a Dataframe
Projection in Pandas has two forms:
# Double-brackets: return dataframe name born
exec_df[['name', 'born']] 0 Julie_Sweet NaT
1 Kumar_Birla 1967-06-14
2 Shantanu_Narayen NaT

0 Julie_Sweet
# Single brackets: 1 column as a Series 1 Kumar_Birla
exec_df['name'] 2 Shantanu_Narayen
3 Garo_H._Armen
Computing Over a Series with a Function

It’s best NOT to iterate over the elements and modify them
• Instead: call apply with a function!

def to_space(x): 0 Julie Sweet


return x.replace('_', ' ') 1 Kumar Birla
2 Shantanu Narayen
# *apply* to each element, returning a new Series
exec_df['name'].apply(to_space)
Computing Over a Series with a Lambda Function

It’s best NOT to iterate over the elements and modify them
• Instead: call apply with a function!

0 Julie Sweet
1 Kumar Birla
2 Shantanu Narayen
# *apply* to each element, returning a new Series
exec_df['name'].apply(lambda x: x.replace('_', ' '))
Inserting the Results Back into the Dataframe
# Let's clean the name!
exec_df['clean_name'] = exec_df['name'].apply
(lambda x: x.replace('_', ' '))
name page born clean_name
https://en.wikipedia.org/wiki
0 Julie_Sweet NaT Julie Sweet
/Julie_Sweet
https://en.wikipedia.org/wiki 1967-06-
1 Kumar_Birla Kumar Birla
/Kumar_Birla 14
https://en.wikipedia.org/wiki
2 Shantanu_Narayen NaT Shantanu Narayen
/Shantanu_Narayen
Can I Change Column Names?
exec_df.rename(columns={'name': 'old_name'})
Another Variation of the Same Idea

• Sometimes our dataframe is in a database on disk – using SQL we can


fetch just the portions we want and do our computations…

select col1, col2 AS new_name1, expr1 AS new_name2


from table
What if the Data is in a Database?
Dataframes must fit in memory… but what if the data is
too large, or stored in a database?
exec_df.to_sql('exec', conn, if_exists="replace")
-----------------------------------------------------------------
pd.read_sql_query('select name, replace(name, "_", " ") as
clean_name from exec', conn)

name clean_name
0 Julie_Sweet Julie Sweet
1 Kumar_Birla Kumar Birla
2 Meghan Meghan
The Story So Far
• We’ve seen how to extract single columns and subsets of columns via
projection
• The apply operation allows us to do computation over a field or the
contents of a row
• We can assign the results back to columns in a dataframe

what about filtering “bad” rows?


Filtering Rows
Selecting Items

We treat rows very differently from columns:

Columns are properties – a name, a date, etc.


dataframe[[‘col1’,’col2’]]

A row represents a sample or instance


dataframe rows with particular values
Selecting a Row by a Name

Suppose we want to find the URL for


clean_name = Julie Sweet…

In Pandas, we need to do this by first


defining a series of True / False (Boolean) values
Selecting Only Some of the Rows
Selecting Only Some of the Rows

Filter “mask” – each row is given


… … a True or False, we only return
those with True
Selecting Only Some of the Rows
Selecting Only Some of the Rows
… And the Page URL
Selection in SQL…
Can We Filter People with Missing Birthdays?
Now We Have a Basic Set of Operations
Over (Single) Tables
• Projection – pull out a “slice” of the table on certain columns
• Selection – pull out rows matching conditions
• apply() to compute over columns
• In-place updates for Dataframes
• Basic SQL: select columns, expressions from table where conditions

• Next: let’s look at combining tables!


Joining Tables
Can We Put the Data Together?

company_ceos_df execs_df

• We now want to “connect” the data – we can do this via a merge


aka a join, which matches rows with the same value
The Join (Merge) Operation
In SQL
company_ceos_df.to_sql('company_ceos', conn, if_exists="replace")
exec_df.to_sql('executives', conn, if_exists="replace")

pd.read_sql_query('select Executive, Company, born from company_ceos ' +\


'join executives on Executive=clean_name', conn)
An Issue!
company_ceos_df
[['Executive', 'Company']]
exec_df[['clean_name', 'born']]

Join Output
Let’s Try the Join Again As a Left Outerjoin with an Indicator

company_ceos_df[['Executive', 'Company']].merge(exec_df[['clean_name', 'born']],


left_on=['Executive’], right_on=['clean_name’],
how="left", indicator=True)
More Generally
result_df = company_ceos_df[['Executive', 'Company’]].
merge(exec_df[['clean_name', 'born']],
left_on=['Executive'],
right_on=['clean_name'], how="outer", indicator=True)

result_df[result_df['_merge'] != 'both']
Exact-Match (Inner) Joins vs Outerjoins

Joins allow us to match on sub-fields


• By default, and in Pandas, they are on equality only

Outerjoin will include “partial” rows when one side (e.g., the left)
doesn’t have a match on the other side (e.g., the right)
Composition! Joining a Joined Result

name is ambiguous, so
we need to give a table
variable to company_data

note we only have 10


matches!!! outerjoin?
Recap: Joins

• In Pandas: merge combines rows from two tables if they exactly


match on column values
• In SQL we can specify a more general condition for the join
• Outerjoins are the same as (inner) joins EXCEPT when there’s no
match for a tuple
• We can compose joins to link multiple tables
Next Time
How do we address two open issues:

Detecting and cleaning errors in the data

You might also like