Lecture3 Pandas and Scraping
Lecture3 Pandas and Scraping
2
ANNOUNCEMENTS
• Standard Sections:
• Fridays (start 9/13) @ 10:30am (1 Story St Room 306)
• Mondays (start 9/16) @ 4:30pm (Science Center 110)
• Advanced Sections (A-Sections):
• Wednesday (start 9/18) @ 4:30pm (TBD)
• Homework 0 isn’t graded for accuracy; however,
• Homework 1 is, and it’ll be released today @ 3pm.
• Inclusion & Diversity Statements and Academic Honesty
documents are now on syllabus. Read them!
CS109A, PROTOPAPAS, RADER, TANNER 3
ANNOUNCEMENTS
• Ed is where the discussions and quizzes reside
• Quizzes are under the ‘Sway’ tab
• If you can’t connect to Ed, try logging out of Canvas, then
back into Canvas
• We are looking to change our lecture room, due to
current space limitations.
..
.
Q1: What are some ways we can store this file into
data structure(s) using regular Python (not the
Pandas library).
..
. Possible Solution #1: A 2D array (i.e.,
matrix)
Weaknesses: data = [][]
• What are the row and column names? Need col_name -> index
index -> col_name
separate lists for them – clumsy.
• Lists are O(N). We’d need 2 dictionaries just for
column names
CS109A, PROTOPAPAS, RADER, TANNER 18
EDA: without Pandas
top50.cs
v
..
. Possible Solution #2: A list of
list
dictionaries
Item 1 = {“Track.Name”: “Senorita”, “Artist.Name”: “Shawn Mendes”, “Genre”: “Canadian pop”, …}
From
lecture3.ipynb
CS109A, PROTOPAPAS, RADER, TANNER 20
EDA: list of dictionaries
Possible Solution #2: A list of
dictionaries
Q2: Write code to print all songs (Artist and
Track name) that are longer than 4 minutes (240
seconds):
From
lecture3.ipynb
CS109A, PROTOPAPAS, RADER, TANNER 21
EDA: list of dictionaries
Possible Solution #2: A list of
dictionaries
Q3: Write code to print the most popular song
(artist and track) – if ties, show all ties.
From
lecture3.ipynb
CS109A, PROTOPAPAS, RADER, TANNER 22
EDA: list of dictionaries
Possible Solution #2: A list of
dictionaries
Q4: Write code to print the songs (and their
attributes), if we sorted by their popularity
(highest scoring ones first).
Dataset Dataset
2019)
1
Top 200 most-frequent streams per day (for June
Top 50 most streamed in 2019, so
2
far
SpotifySongID, # of Streams, Date SpotifySongID, Artist, Track, [10 acoustic
features]
2789179, 42003, 2789179, Billie Eilish, bad guy, 3.2,
.
06-01 5.9, … .
200
. 50 .
3819390, 89103, 3901829, Outkast, Elevators, 9.3,
06-01 5,1, …
4492014, 52923,
.
06-02
50 x 13
200 .
8593013, 189145,
06-02
6,000 x 3
CS109A, PROTOPAPAS, RADER, TANNER 38
Data Concerns: the format
For example, say we have two datasets:
Dataset Dataset
2019)
1
Top 200 most-frequent streams per day (for June
Top 50 most streamed in 2019, so
2
far
SpotifySongID, # of Streams, Date SpotifySongID, Artist, Track, [10 acoustic
2789179, We are interested in determining if songs Billie
withEilish,
highbad guy, 3.2,
features]
42003, 2789179,
.
06-01 danceability are more popular5.9,
during
… .
the weekends of
200
. 50 .
3819390, June than weekdays in June. What
89103, shouldOutkast,
3901829, our mergedElevators, 9.3,
06-01
table look like? Concerns? 5,1, …
4492014, 52923,
.
06-02
50 x 13
200 .
8593013, 189145,
06-02
6,000 x 3
CS109A, PROTOPAPAS, RADER, TANNER 39
Data Concerns: the format
Datasets Merged
SpotifySongID, # (poorly)
of Streams, Date,
Artist, Track, [10 acoustic features]
Some rows may have null values for # of Streams (if the song wasn’t popular in June)
• Web servers
• A server is a long-running process (also called a daemon)
which listens on a pre-specified port(s)
• It responds to requests, which is sent using a protocol
called HTTP (HTTPS is secure)
• Our browser sends these requests and downloads the
content, then displays it
• 2– request was successful, 4– client error, often `page not
found`; 5– server error (often that your request was
incorrectly formed)