Notes On Intro To Data Science Udacity
Notes On Intro To Data Science Udacity
Created 12/20/13
Updated 01/20/14, Updated 02/21/14, Updated 04/06/14, Updated 04/19/14, Updated 04/26/14, Updated 05/04/14
Updated 05/17/14, Updated 05/31/14
Introduction
The Introduction to Data Science class (Udacity UD359) will survey the foundational topics in data science,
namely:
Data Manipulation
Data Analysis with Statistics and Machine Learning
Data Communication with Information Visualization
Data at Scale -- Working with Big Data
The class will focus on breadth and present the topics briefly instead of focusing on a single topic in depth. This
will give you the opportunity to sample and apply the basic techniques of data science.
Presenters are Dave Holtz (Yub, TrialPay) and Cheng-Han Lee (formerly at Microsoft, now at Udacity).
Course duration and time required is Approx. 2 months Assumes 6hr/wk (work at your own pace)
URL: https://www.udacity.com/course/ud359
There is a similar course on Coursera, taught by Bill Howe of University of Washington.
Related Groups
https://groups.google.com/forum/#!forum/datsciprojects - this is the Monday night group 6:30. Organized by Mike
Wilber.
https://groups.google.com/forum/#!forum/spark-ml-and-big-data-analytics this is the Wednesday night group 6:30.
Organized by Richard Walker.
Lectures
Lesson 1: Introduction to Data Science
The definitions here included concepts that a data scientist has math/stat knowledge, domain knowledge, and good
analysis skills. It doesnt pre-suppose any specific set of tools, though R and MatLab-like tools are important. The
definition also includes a focus on communication skills, to get the results across.
At this point, we are starting to work with an R + Python environment, which the student should be installing on a
computer, thought the initial exercises can be run interactively on the web.
Page 1
Lesson project: Titanic Data. Can we predict who will survive? In this case, we were asked to fill in some python
code that implements a heuristic that generates a prediction. In the first case, the performance is about 70%, then the
questions ask you to improve the heuristic to get better and better results.
predictions = {}
df = pandas.read_csv(file_path)
columns = ['Survived','Sex','Age','SibSp','Parch']
print df[df['Survived'] == 1][columns]
You could dump the data for analysis using a regression analysis, for instance. But in this case you are on your own
to generate a heuristic.
Final Lectures:
Pi-Chaun (Data Scientist @ Google): What is Data Science?
Gabor (Data Scientist @ Twitter): What is Data Science?
Summary: no cookie-cutter way to become a Data science, but a strong mathematical background is very important.
Our next
lesson will focus on getting data and loading it so that you can get value from it.
The lectures point out that the Data Wrangling can be 70% of your time.
Common Data Formats, examples of each of the common formats. Baseball data is used here. JSON data looks like a dictionary.
The examples are based on the API of last.fm. The structure of the request is located in the URL.
Two ways to deal with missing value: partial deletion, and imputing data. Approaches to imputing data include
using the average, or performing linear regression. There are functions within panda to perform this in two steps:
x = numpy.mean(baseball['weight'])
baseball['weight'] = baseball['weight'].fillna(x)
Easy Imputation
Impute using Linear Regression
Page 2
Lesson Project: Wrangle NYC subway and weather data (11 parts)
Step 1: Number of rainy days SQL query
Step 2: Temp on Foggy and Non Foggy Days SQL query
Step 5: Fixing NYC subway turnstile data this focused on operations using panda data frames, and
deriving corrected one.
Step 7: Filtering No Regular data another example of using panda data frames, such as
turnstile_data[(turnstile_data.DESCn == 'REGULAR')]
Step 9: Get Hourly Exits in this case we are comparing a count value for a data set that has hourly values
with the value in the row above, for the prior hour.
Step 10: Time to hour this was a way to learn some of the operations on a datetime, such as extract a
field, convert to string and extract a character, etc. Several different versions of solutions were posted.
Step 11: Reformat subway dates demonstrated a few formatting operations on a datetime, including
converted back to a string, and extracting a substring.
Statistical Rigor
Kurt (Data Scientist @ Twitter) - Why is Stats Useful?
Introduction to Normal Distribution
T Test
Welch T Test
By this point, we have seen how to understand tests, and seen a formulation in Python.
Non-Parametric Tests
These dont assume that the data is drawn from any specific probability distribution. For example, Mann-Whitney
U test.
Non-Normal Data
Shapiro-Wilks test
For instance, predict home runs given information about a baseball player
Cost Function
These introduces steepest-descent methods, which are built around the idea of a cost function, where the cost
function is typically of the sum of the squares of the errors.
This is intended to be a discussion of an algorithm. But it is rather weak. One of the students added a note which is
a link to the Coursera class on machine learning, as a better source.
Page 3
Coefficient of Determination
This value is also called R-squared. The closer to 1, the better is our model.
The final lectures list other issues to take into account. Gradient descent is only one implementation of linear
regression. Another issue is overfitting. An approach is cross-validation. Cost function may have local minima.
Lesson Project: Analyze NYC subway and weather data. We will be analyzing data, and modeling links between
weather and ridership, for instance.
Step 1: Exploratory Data Analysis examine the hourly entries in our NYC subway data and determine
what distribution the data follows. In this case, we are using matplotlib with pandas. However, there has
been little discussion of matplotlib in the source materials
o
o
o
o
o
o
o
o
o
Step 2: Welchs t-test this just asks the question if you think the t-test applies. Since the data is not
normal, it doesnt.
Step 3: Mann-Whitney U test there is not a lot of explanation here, simply a request to run this test,
which is built into the stat libraries.
o
o
o
o
o
plt.figure()
x=turnstile_weather[turnstile_weather['rain']==1]
y=turnstile_weather[turnstile_weather['rain']==0]
x['ENTRIESn_hourly'].hist(color='r', bins=30, alpha=0.5, label='Rain')
y['ENTRIESn_hourly'].hist(color='b', bins=30, alpha=0.5, label='No rain')
plt.xlabel("Ridership")
plt.ylabel("Counts")
plt.legend()
plt.show()
a = turnstile_weather['ENTRIESn_hourly'][turnstile_weather['rain']==1]
b = turnstile_weather['ENTRIESn_hourly'][turnstile_weather['rain']==0]
with_rain_mean = np.mean(a)
without_rain_mean = np.mean(b)
U, p = scipy.stats.mannwhitneyu(a, b)
Step 4: Rainy Day Ridership vs. Non Rainy Day Ridership this asks you a question about interpreting the
M-W test.
Step 5: Linear regression this asks you to integrate the cost function and calls to steepest descent into
analysis of ridership data. Since the dataset of much larger than the baseball set, you are running a subset
of the data.
Step 6: Plot residuals this adds to step 5, and asked you generate a histogram of (predicted minus actual)
Step 7: Compute R-squared this also adds to Step 5
Step 8: Non Gradient-Descent Linear Regression this is example of using the ordinary least squares
analytical formulation, which is built into a function called stats.linregress. However, it requires that the
input matrix avoid problems such as colinearity. I had trouble coming up with a set of columns that didnt
exhibit this behavior.
The different possible ways to encode information are ranked by peoples perception.
Page 4
Types of charts
We have seen the following diagram in a similar course, and want to add it here:
Plotting in Python
We are going to use ggplot instead of matplotlib. The former looks nicer, and has a grammar of graphs. What is
meant by a grammar of graphs is a set of graphing components, similar to d3/nvd3. This in turn encourages us to
think about scales and other low-level elements of the chart, rather than simply having all of the configuration
provided to us and not visible or controllable.
Data Scales
Commentary
After reading the help files of Pythons ggplot clone, the document basically says; 'Warning, ggplot is NOT 'Pythonic'!' This basically means, it is
weird to use. Don't worry; it was designed to be weird. It follows some rules in a famous book called 'Grammar of Graphics' (a book written before most
of us were born!)
Don't worry about it being weird, you can just copy & paste some code, then modify it for yourself.
Page 5
Lesson 5: MapReduce
This includes a discussion about how much data is big data? The answer was several terabytes or more.
Basics of MapReduce
Mapper
Reducer
At this point, we have been introduced to the concepts, and the idea that MapReduce performs partitioning of a large
data problem has been described.
At this point, we are writing mapper and reducer functions in python. However, they appear to be operating on the
same datasets as in earlier lessons (Aadhaar data). So we are not really using Hadoop or a compute cluster. But this
is similar to the use of MapReduce in our MongoDB work back in 2012.
MapReduce Ecosystem
At this point the discussion starting including the terms Hadoop, and really large configurations are described. It
would appear that setting up such examples using compute resources that are at classroom scale is beyond the scope
of the course.
So we get to hear from industry experts about Hadoop, Hive, and Pig. Hive is a library for writing Hadoop jobs that
makes it easier to create them. Pig is a high-level platform for creating MapReduce programs used with Hadoop.
The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce
idiom into a notation which makes MapReduce programming high level, similar to that f SQL for RDBMS systems.
Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python,
JavaScript, Ruby or Groovy and then call directly from the language. Hive was developed at Facebook, and Pig at
Yahoo.
Appendix A: Set Up
Local installation
You would need to install the following Python libraries and packages to run the assignments on your own
computer:
Page 6
Pandas
Numpy
Scipy
Statsmodels
Ggplot
Matplotlib
Pandasql
We would highly recommend that you install Anaconda, which should contain most of the libraries and packages that
you need to work on the assignments.
One caveat is that Anaconda does not include pandasql but installing it after Anaconda is as easy as:
pip install -U pandasql
From the same company you have a hosted Data science Toolbox: www.wakari.io - nothing to install and the Anaconda
distribution is included.
Virtual machine
There is a Vagrant specification for a virtual machine here:
https://github.com/asimihsan/intro-to-data-science-udacity
By following these instructions, which are also present in the Git repository README file, you will be able to
create a virtual machine on Linux, Mac OS X, or Windows, that will include all dependencies required for this
course, and additionally be able to use IPython Notebooks, which make following this class much easier.
Install Git, then clone this repo to your computer: git clone [email protected]:asimihsan/intro-to-data-scienceudacity.git, OR
2.
Check for errors. There should be none. A warning about the version of the Guest Additions is harmless.
For more basic information on using Vagrant refer to the official documentation: http://docs.vagrantup.com/v2/gettingstarted/index.html
After starting the virtual machine you can run an IPython Notebook server by running the following inside the guest VM: ipython
notebook --ip 0.0.0.0 --pylab inline . Then on your host machine browse to http://localhost:58888. Congratulations!
Page 7
Baseball data
http://www.seanlahman.com/baseball-archive/statistics/
Contains player information and scores going back to 1871.
Page 8