Homework Assignment #2: (Data Wrangling Principles)
Homework Assignment #2: (Data Wrangling Principles)
Homework Assignment #2
(Data Wrangling Principles)
Congratulations are in order! Your initial project as a Data Intern for the Movie Intelligence
Department at Wholoo.com was very well-received by your boss. She’s super excited by the
possibilities that Data Science has to offer, but unfortunately she’s been hearing some scary
tales at cocktail parties about the “data wrangling” problem. Your next assignment is to become
“dangerous” in terms of your knowledge of data wrangling by reading the assigned mini-book
(hopefully you’ve done that already!) and then doing some analyses of some of the problems
and potential solutions for two popular movie data sets and doing some actual prototyping to
show that you can wrangle data with the best of them. You’ll continue to use the IMDB data in
this next assignment - but in addition, you’ll make use of another publically available data set
about movies - the new dataset is called MovieLens.
Get Ready...
Finish reading Principles of Data Wrangling: Practical Techniques for Data Preparation if
you have not already done so.
Your next step is to grab a copy of the relevant MovieLens data. Go to the MovieLens web site
(https://grouplens.org/datasets/movielens/) and download a copy of the latest Full version of the
MovieLens Latest Datasets (under “recommended for education and development”). Your goal
is to convince your boss that you’ve really got this, and just wrangling the smaller version of the
data might not convince her. (You also want to give her a sense of how well Postgres can work
as a data wrangling tool at scale.)
Go...!
It’s time to get back to work. First read the README file that’s next to the data’s download link
and then unzip the downloaded data (focusing only on four files: links.csv, movies.csv,
tags.csv and ratings.csv). Your boss wants you to examine the MovieLens data and,
considering both this new data and the IMDB data that you’ve already begun using, to answer
the following questions. You can use whatever tools you like to explore the new data, but you
shouldn’t need anything more than your favorite text editor at first - in fact, you should use a text
editor even if you also use other tools, as it will be important to examine the raw data in order to
properly assess the data’s “issues” and answer all of the questions. If you want, you could also
download the Small version of the data, or better yet, just create your own samples of the files
by throwing away all but the first 11 lines (like we did for you for IMDB). Once you have a feel for
the data, you can load it and explore it further using SQL queries if you like. Note: When
1
STATS 170A Winter 2019 Carey/Minin
answering the following questions, you can consider each of the files (tables) as being its own
dataset in the book’s terminology.
1. Your work begins at the Raw Data Stage. It’s time to familiarize yourself with the data by
following the book’s prescription for doing so.
a. Start by understanding the structure of your newly acquired MovieLens data. To
do so, for each of the four data files, briefly answer the Basic Questions to
Assess Structure on pp. 16-17 of the book.
b. Next, improve your understanding of the semantics and granularity of the data by
briefly answering the Basic Questions to Assess Data Granularity on p. 18 of the
book for each of the four data files.
c. Since this data was neatly assembled for you, skip the Accuracy and T emporality
analyses (though do review those questions so you know what you’re skipping).
Instead, do a scope analysis by briefly answering the Basic Questions to Assess
Data Scope on p. 22-23 for each of the four data files.
2. Now that you’ve hopefully wrapped your head around the new data by studying the files
that have been extracted for you by the friendly folks at MovieLens, it’s time to think
about the problem of wrangling it. We have talked a little about ETL as a part of the data
pipeline; ETL stands for Extract-Transform-Load. You’re going to take a slightly different
approach, an approach that is sometimes called ELT, short for Extract-Load-Transform.
The extraction has been done for you, so the next step is load - so create four tables in
Postgres to hold “as is” versions of the MovieLens data files (ignoring the “genome”
files, as you won’t be using them in this assignment). Load the four files into your tables
and run SQL count queries on each to verify that all of the data made it in.
3. E and L are done, so now it’s T-time (so to speak). It’s time to drill into the transformation
and refinement opportunities posed by this data. Carefully examine the four data files
(and the IMDB files when asked to do so) again and tackle the following questions:
a. There is at least one piece of MovieLens information that’s currently embedded
in a place where it doesn’t belong - i.e., it’s a classic value extraction use case!
Identify this piece of information - which column of which table is it in, and how is
it modeled presently? Now create a SQL view whose results are better suited to
exposing this information for analysis - i.e., a view whose query body returns a
wider and tabular result with this information moved from its current location into
a separate field. Show the result of doing a SELECT * from your view (ORDERed
by a different field and with a LIMIT of 10 rows to keep your answer short-ish).
(Hint: Read up on PostgreSQL’s “substring” function and other string functions.)
b. Another common data transformation involves extracting (“lifting”) the contents of
one field into a few separate fields for subsequent use and analysis. Do you see
any fields in the IMDB tables for which this might be helpful? Create and query a
SQL view (or views), like you did in (a), for each table that has such an
2
STATS 170A Winter 2019 Carey/Minin
opportunity and then use a similar SELECT * query strategy to show the results
of your work. (Hint: Think back to question 3(k) in Homework 1. :-))
c. In the MovieLens dataset, two of the files (ratings and tags) contain timestamp
fields. The README explains: "Timestamps represent seconds since midnight
Coordinated Universal Time (UTC) of January 1, 1970". PostgreSQL does not
support timestamps represented in seconds, and the current format is not
interpretable by humans nor by PostgreSQL as datetime values. Create views for
the two tables that show the timestamp values in a "human readable" format.
(Hint: Look at the "to_timestamp()" function among other date/time functions in
PostgreSQL documentation).
d. Aggregation is an inter-record data transformation step that is sometimes useful
when preparing data for further analysis. Take the MovieLens ratings table as an
example, and explain how (and why) one might choose to aggregate its ratings if
one wanted it to potentially be combinable with the IMDB rating information. As
you did in (a) and (b), create a view that transforms the data in this way and then
query your view to show a sample of its contents.
e. The movies table, as provided, would likely be the most problematic MovieLens
data set to analyze with SQL in its current form. Explain briefly why this is so.
How might you transform the table to make it more amenable to processing using
tabular tools (e.g., SQL or - soon - Pandas)? Create a view (or set of views) that
transform the data in that way and then demo your view(s) via SELECT *.
f. The links table allows you to integrate three datasets: IMDB, MovieLens and a
third dataset called TMDB (The Movie Database). It is common sense that each
movie should have a unique identifier; however, it turns out that TMDB IDs are
not unique in this table. i) Which is the TMDB ID that is not missing (or null) and
repeats the most? ii) What movie names are associated with that ID in the
MovieLens dataset? iii) Why are TMDB IDs not unique here in your opinion? iv)
What other information you can derive from the IMDB dataset about this movie?
(Hint: Y ou will need to get the IMDB IDs from MovieLens and use the table
titleBasics to answer the last sub-part of this question).
g. Your boss got hooked on a TV show last year called the “Wisdom of the Crowd”,
so now she wants to exploit that same kind of wisdom for movie analytics at
Wholoo.com. Hey - each of your data collections includes reviews for movies...!
Think more about the transformation steps or operations (in addition to your
answer to (d)) needed to combine the two “well”. Impress your boss by creating a
view whose columns include an id (e.g., the tconst) and title for each movie along
with review information (in separate columns) from both IMDB and MovieLens.
Make sure your view includes what each data set knows about the number of
contributors to its review values. In addition, your view should compute a third,
combined, review value for each movie that has been reviewed in both data sets.
Be sure to compute the combined value in a way that makes sense - and briefly
explain and justify your approach. Last but not least, make sure that your view
3
STATS 170A Winter 2019 Carey/Minin
includes movies with reviews in either of the two data sets (i.e., not just those
movies with reviews in both).
h. After showing your boss your views, she’s sold on them. Materialize each of your
views for future use by creating a new table for it and populating the table using
an INSERT statement whose body is a SELECT * from the view. Voila! You’ve
just successfully wrangled your first data - and relatively large data, at that…!
What To Turn In
Use your favorite word processing application (Word, Google Docs, …) to document your
analysis and use Canvas to turn in a PDF file of your write-up. In a separate file, create a “sql”
file that shows your technical thoughts (i.e., the resulting queries/views) for each question.
(Please be sure to put your name somewhere on the first page of your writeup. :-))