0% found this document useful (0 votes)
3 views

Lecture3 Pandas and Scraping

The document outlines Lecture #3 of CS109A, focusing on Exploratory Data Analysis (EDA) using Python's Pandas library and web scraping techniques. It includes announcements about class sections, homework, and resources, as well as a structured approach to EDA, emphasizing the importance of data quality. Key topics covered include data storage, cleaning, visualization, and common functions in Pandas for data manipulation and analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture3 Pandas and Scraping

The document outlines Lecture #3 of CS109A, focusing on Exploratory Data Analysis (EDA) using Python's Pandas library and web scraping techniques. It includes announcements about class sections, homework, and resources, as well as a structured approach to EDA, emphasizing the importance of data quality. Key topics covered include data storage, cleaning, visualization, and common functions in Pandas for data manipulation and analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Credit: Toronto Zoo

CS109A, PROTOPAPAS, RADER, TANNER 1


Lecture #3: Getting our hands dirty: pandas
and web scraping
CS109A Introduction to Data Science
Pavlos Protopapas, Kevin Rader, and Chris Tanner

2
ANNOUNCEMENTS
• Standard Sections:
• Fridays (start 9/13) @ 10:30am (1 Story St Room 306)
• Mondays (start 9/16) @ 4:30pm (Science Center 110)
• Advanced Sections (A-Sections):
• Wednesday (start 9/18) @ 4:30pm (TBD)
• Homework 0 isn’t graded for accuracy; however,
• Homework 1 is, and it’ll be released today @ 3pm.
• Inclusion & Diversity Statements and Academic Honesty
documents are now on syllabus. Read them!
CS109A, PROTOPAPAS, RADER, TANNER 3
ANNOUNCEMENTS
• Ed is where the discussions and quizzes reside
• Quizzes are under the ‘Sway’ tab
• If you can’t connect to Ed, try logging out of Canvas, then
back into Canvas
• We are looking to change our lecture room, due to
current space limitations.

CS109A, PROTOPAPAS, RADER, TANNER 4


ANNOUNCEMENTS
• Access GitHub for all content (“git clone” and “git pull” are your friends)

CS109A, PROTOPAPAS, RADER, TANNER 5


BACKGROUN
D

CS109A, PROTOPAPAS, RADER, TANNER 6


Background

So far, we’ve learned:

Lecture 1 What is Data Science?


Lectures 1 & 2 The Data Science Process
Lecture 2 Data: types, formats, issues, etc.
Lecture 2 Visualization (briefly)
This lecture How to quickly prepare data and scrape the
Future lectures web
How to model data

CS109A, PROTOPAPAS, RADER, TANNER 7


Background
The Data Science
Process:
Ask an interesting question

Get the Data

Explore the Data

Model the Data

Communicate/Visualize the Results

CS109A, PROTOPAPAS, RADER, TANNER 8


Background
The Data Science
Process:
Ask an interesting question

Get the Data


This
lecture Explore the Data

Model the Data

Communicate/Visualize the Results

CS109A, PROTOPAPAS, RADER, TANNER 9


Lecture Outline

• Exploratory Data Analysis (EDA):


• Without Pandas (part 1) – These slides
• With Pandas (part 2) – Mostly Jupyter Notebook
• Data concerns (part 3) – These slides
• Web Scraping with Beautiful Soup (part 4) – Mix

CS109A, PROTOPAPAS, RADER, TANNER 10


Exploratory Data Analysis (EDA)
Why?
• EDA encompasses the “explore data” part of the data
science process
• EDA is crucial but often overlooked:
• If your data is bad, your results will be bad
• Conversely, understanding your data well can help you
create smart, appropriate models

CS109A, PROTOPAPAS, RADER, TANNER 11


Exploratory Data Analysis (EDA)
What?
1. Store data in data structure(s) that will be convenient for
exploring/processing
(Memory is fast. Storage is slow)
2. Clean/format the data so that:
– Each row represents a single object/observation/entry
– Each column represents an attribute/property/feature of that
entry
– Values are numeric whenever possible
– Columns contain atomic properties that cannot be further
decomposed* * Unlike food waste, which can be
composted.
Please consider composting food
CS109A, PROTOPAPAS, RADER, TANNER scraps.
12
Exploratory Data Analysis (EDA)
What? (continued)
3. Explore global properties: use histograms, scatter plots, and
aggregation functions to summarize the data
4. Explore group properties: group like-items together to compare
subsets of the data (are the comparison results
reasonable/expected?)

This process transforms your data into a format which is


easier to work with, gives you a basic overview of the
data's properties, and likely generates several questions
for you to follow-up in subsequent analysis.

CS109A, PROTOPAPAS, RADER, TANNER 13


EDA: without Pandas

Say we have a small dataset of the top 50 most-


streamed Spotify songs, globally, for 2019.

CS109A, PROTOPAPAS, RADER, TANNER 14


EDA: without Pandas

Say we have a small dataset of the top 50 most-


streamed Spotify songs, globally, for 2019.

NOTE: The following music data are used purely for


illustrative, educational purposes. The data, including song
titles, may include explicit language. Harvard, including
myself and the rest of the CS109 staff, does not endorse any
of the entailed contents or the songs themselves, and we
apologize if it is offensive to anyone in anyway.

CS109A, PROTOPAPAS, RADER, TANNER 15


EDA: without Pandas
top50.cs
v
Each row represents a distinct song. The columns are:
• ID: a unique ID (i.e., 1-50)
• TrackName: Name of the Track
• ArtistName: Name of the Artist
• Genre: the genre of the track
• BeatsPerMinute: The tempo of the song.
• Energy: The energy of a song - the higher the value, the more energetic.
• Danceability: The higher the value, the easier it is to dance to this song.
• Loudness: The higher the value, the louder the song.
• Liveness: The higher the value, the more likely the song is a live recording.
• Valence: The higher the value, the more positive mood for the song.
• Length: The duration of the song (in seconds).
• Acousticness: The higher the value, the more acoustic the song is.
• Speechiness: The higher the value, the more spoken words the song contains.
• Popularity: The higher the value, the more popular the song is.
CS109A, PROTOPAPAS, RADER, TANNER 16
EDA: without Pandas
top50.cs
v

..
.
Q1: What are some ways we can store this file into
data structure(s) using regular Python (not the
Pandas library).

CS109A, PROTOPAPAS, RADER, TANNER 17


EDA: without Pandas
top50.cs
v

..
. Possible Solution #1: A 2D array (i.e.,
matrix)
Weaknesses: data = [][]
• What are the row and column names? Need col_name -> index
index -> col_name
separate lists for them – clumsy.
• Lists are O(N). We’d need 2 dictionaries just for
column names
CS109A, PROTOPAPAS, RADER, TANNER 18
EDA: without Pandas
top50.cs
v

..
. Possible Solution #2: A list of
list
dictionaries
Item 1 = {“Track.Name”: “Senorita”, “Artist.Name”: “Shawn Mendes”, “Genre”: “Canadian pop”, …}

Item 2 = {“Track.Name”: “China”, “Artist.Name”: “Anuel AA”, “Genre”: “reggaetón flow”, … }

Item 3 = {“Track.Name”: “Ariana Grande”, “Artist.Name”: “boyfriend”, “Genre”: “dance pop”, … }

CS109A, PROTOPAPAS, RADER, TANNER 19


EDA: list of dictionaries
Possible Solution #2: A list of
dictionaries

From
lecture3.ipynb
CS109A, PROTOPAPAS, RADER, TANNER 20
EDA: list of dictionaries
Possible Solution #2: A list of
dictionaries
Q2: Write code to print all songs (Artist and
Track name) that are longer than 4 minutes (240
seconds):

From
lecture3.ipynb
CS109A, PROTOPAPAS, RADER, TANNER 21
EDA: list of dictionaries
Possible Solution #2: A list of
dictionaries
Q3: Write code to print the most popular song
(artist and track) – if ties, show all ties.

From
lecture3.ipynb
CS109A, PROTOPAPAS, RADER, TANNER 22
EDA: list of dictionaries
Possible Solution #2: A list of
dictionaries
Q4: Write code to print the songs (and their
attributes), if we sorted by their popularity
(highest scoring ones first).

CS109A, PROTOPAPAS, RADER, TANNER 23


EDA: list of dictionaries
Possible Solution #2: A list of
dictionaries
Q4: Write code to print the songs (and their
attributes), if we sorted by their popularity
(highest scoring ones first).
list
Item 1 = {“Track.Name”: “Senorita”, “Artist.Name”: “Shawn Mendes”, “Genre”: “Canadian pop”, …}

Item 2 = {“Track.Name”: “China”, “Artist.Name”: “Anuel AA”, “Genre”: “reggaetón flow”, … }

Item 3 = {“Track.Name”: “Ariana Grande”, “Artist.Name”: “boyfriend”, “Genre”: “dance pop”, … }

Cumbersome to move dictionaries around in


a list. Problematic even if we don’t move the
dictionaries. CS109A, PROTOPAPAS, RADER, TANNER 24
EDA: list of dictionaries
Possible Solution #2: A list of
dictionaries
Q5: How could you check for null/empty entries?
This is only 50 entries. Imagine if we had
500,000.
list
Item 1 = {“Track.Name”: “Senorita”, “Artist.Name”: “Shawn Mendes”, “Genre”: “Canadian pop”, …}

Item 2 = {“Track.Name”: “China”, “Artist.Name”: “Anuel AA”, “Genre”: “reggaetón flow”, … }

Item 3 = {“Track.Name”: “Ariana Grande”, “Artist.Name”: “boyfriend”, “Genre”: “dance pop”, … }

CS109A, PROTOPAPAS, RADER, TANNER 25


EDA: list of dictionaries
Possible Solution #2: A list of
dictionaries
Q6: Imagine we had another table* below
(i.e., .csv file). How could we combine its data
with our already-existing dataset?
spotify_aux.csv

* 3rd column is made-up by me. Random values. Pretend they’re accurate.

CS109A, PROTOPAPAS, RADER, TANNER 26


EDA: with Pandas!

Kung Fu Panda is property of DreamWorks and Paramount Pictures

CS109A, PROTOPAPAS, RADER, TANNER 27


Lecture Outline

• Exploratory Data Analysis (EDA):


• Without Pandas (part 1) – These slides
• With Pandas (part 2) – Mostly Jupyter Notebook
• Data concerns (part 3) – These slides
• Web Scraping with Beautiful Soup (part 4) – Mix

CS109A, PROTOPAPAS, RADER, TANNER 28


EDA: with Pandas
What / Why?
• Pandas is an open-source Python library (anyone can
contribute)
• Allows for high-performance, easy-to-use data
structures and data analysis
• Unlike NumPy library which provides multi-dimensional
arrays, Pandas provides 2D table object called
DataFrame
(akin to a spreadsheet with column names and row
labels).
• Used by a lot of people
CS109A, PROTOPAPAS, RADER, TANNER 29
EDA: with Pandas
How
• import pandas library (convenient to rename it)
• Use read_csv() function

CS109A, PROTOPAPAS, RADER, TANNER 30


EDA: with Pandas
Common Panda functions
High-level viewing:
• head() – first N observations
• tail() – last N observations
• columns() – names of the columns
• describe() – statistics of the quantitative data
• dtypes() – the data types of the columns

CS109A, PROTOPAPAS, RADER, TANNER 31


EDA: with Pandas
Common Panda functions
Accessing/processing:
• df[“column_name”]
• Df.column_name
• .max(), .min(), .idxmax(), .idxmin()
• <dataframe> <conditional statement>
• .loc[] – label-based accessing
• .iloc[] – index-based accessing
• .sort_values()
• .isnull(), .notnull()

CS109A, PROTOPAPAS, RADER, TANNER 32


EDA: with Pandas
Common Panda functions
Grouping/Splitting/Aggregating:
• groupby(), .get_groups()
• .merge()
• .concat()
• .aggegate()
• .append()

CS109A, PROTOPAPAS, RADER, TANNER 33


EDA: with Pandas

Now, let’s open the


lecture3.ipynb notebook
for some real-time
practice.

CS109A, PROTOPAPAS, RADER, TANNER 34


Lecture Outline

• Exploratory Data Analysis (EDA):


• Without Pandas (part 1) – These slides
• With Pandas (part 2) – Mostly Jupyter Notebook
• Data concerns (part 3) – These slides
• Web Scraping with Beautiful Soup (part 4) – Mix

CS109A, PROTOPAPAS, RADER, TANNER 35


Data Concerns

When determining if a dataset is sound to use, it can be


useful to think about these four questions:
• Did it come from a trustworthy, authoritative source?
• Is the data a complete sample?
• Does the data seem correct?
• (optional) Is the data stored efficiently or does it
have redundancies?

CS109A, PROTOPAPAS, RADER, TANNER 36


Data Concerns: the format

• Often times, there may not exist a single dataset that


contains all of the information we are interested in.
• May need to merge existing datasets
• Important to do so in a sound and efficient format

CS109A, PROTOPAPAS, RADER, TANNER 37


Data Concerns: the format
For example, say we have two datasets:

Dataset Dataset
2019)
1
Top 200 most-frequent streams per day (for June
Top 50 most streamed in 2019, so
2
far
SpotifySongID, # of Streams, Date SpotifySongID, Artist, Track, [10 acoustic
features]
2789179, 42003, 2789179, Billie Eilish, bad guy, 3.2,
.
06-01 5.9, … .
200
. 50 .
3819390, 89103, 3901829, Outkast, Elevators, 9.3,
06-01 5,1, …
4492014, 52923,
.
06-02
50 x 13
200 .
8593013, 189145,
06-02
6,000 x 3
CS109A, PROTOPAPAS, RADER, TANNER 38
Data Concerns: the format
For example, say we have two datasets:

Dataset Dataset
2019)
1
Top 200 most-frequent streams per day (for June
Top 50 most streamed in 2019, so
2
far
SpotifySongID, # of Streams, Date SpotifySongID, Artist, Track, [10 acoustic
2789179, We are interested in determining if songs Billie
withEilish,
highbad guy, 3.2,
features]
42003, 2789179,
.
06-01 danceability are more popular5.9,
during
… .
the weekends of
200
. 50 .
3819390, June than weekdays in June. What
89103, shouldOutkast,
3901829, our mergedElevators, 9.3,
06-01
table look like? Concerns? 5,1, …
4492014, 52923,
.
06-02
50 x 13
200 .
8593013, 189145,
06-02
6,000 x 3
CS109A, PROTOPAPAS, RADER, TANNER 39
Data Concerns: the format

This is wasteful, as it has 10 acoustic features, artist,


and track repeated many times for each unique song.

Datasets Merged
SpotifySongID, # (poorly)
of Streams, Date,
Artist, Track, [10 acoustic features]

2789179, 42003, Billie Eilish, bad guy, 3.2, 5.9, …


.
06-01
200
.
3819390, 89103, Outkast, Elevators, 9.3, 5,1, …
06-01
4492014, 52923,
.
06-02
200 .
8593013, 189145,
06-02
6,000 x 15  90,000 cells
CS109A, PROTOPAPAS, RADER, TANNER 40
Data Concerns: the format

Some rows may have null values for # of Streams (if the song wasn’t popular in June)

Datasets Merged (better)

SpotifySongID, Artist, Track, [10 acoustic features], 06-01 Streams,


06-02 Streams
2789179, Billie Eilish, bad guy, 3.2, 5.9, …, 42003,
42831, 43919
.
50 .
3901829, Outkast, Elevators, 9.3, 5,1, … 29109,
27193, 25982
50 x 70  3,500 cells

CS109A, PROTOPAPAS, RADER, TANNER 41


Data Concerns: the format

• Is the data correctly constructed (or are values


wrong)?
• Is there redundant data in our merged table?
• Missing values?

CS109A, PROTOPAPAS, RADER, TANNER 42


Lecture Outline

• Exploratory Data Analysis (EDA):


• Without Pandas (part 1) – These slides
• With Pandas (part 2) – Mostly Jupyter Notebook
• Data concerns (part 3) – These slides
• Web Scraping with Beautiful Soup (part 4) – Mix

CS109A, PROTOPAPAS, RADER, TANNER 43


Sources of data

• Data can come from:


• You curate it
• Someone else provides it, all pre-packaged for you
• Someone else provides an API
• Someone else has available content, and you try to take it
(web scraping)

CS109A, PROTOPAPAS, RADER, TANNER 44


Web scraping

• Web servers
• A server is a long-running process (also called a daemon)
which listens on a pre-specified port(s)
• It responds to requests, which is sent using a protocol
called HTTP (HTTPS is secure)
• Our browser sends these requests and downloads the
content, then displays it
• 2– request was successful, 4– client error, often `page not
found`; 5– server error (often that your request was
incorrectly formed)

CS109A, PROTOPAPAS, RADER, TANNER 45


Web scraping

• Using programs to download or otherwise get data


from online
• Often much faster than manually copying data!
• Transfer the data into a form that is compatible with
your code
• Legal and moral issues (per Lecture 2)

CS109A, PROTOPAPAS, RADER, TANNER 46


Web scraping

• Requests (Python library): gets a webpage for you


• Requests.get(url)
• BeautifulSoup library parses webpages (.html
content) for you!
• Use BeautifulSoup to find all the text or all the links
on a page
• Documentation:
http://crummy.com/software/BeautifulSoup

CS109A, PROTOPAPAS, RADER, TANNER 47


Web scraping

CS109A, PROTOPAPAS, RADER, TANNER 48


Web scraping

CS109A, PROTOPAPAS, RADER, TANNER 49


Web scraping

CS109A, PROTOPAPAS, RADER, TANNER 50


Web scraping

CS109A, PROTOPAPAS, RADER, TANNER 51


Web scraping

CS109A, PROTOPAPAS, RADER, TANNER 52


Web scraping

CS109A, PROTOPAPAS, RADER, TANNER 53


Web scraping

• Question: how can we get a list of all image URLs?


• Question: how can we navigate through subsequent
pages (i.e., crawler) recursively.
• Question: could we crawl the entire web?

CS109A, PROTOPAPAS, RADER, TANNER 54

You might also like