0% found this document useful (0 votes)

744 views8 pages

Notes On Intro To Data Science Udacity

This document provides notes on an Introduction to Data Science course offered on Udacity. The course covers topics such as data manipulation, analysis using statistics and machine learning, data visualization, and working with big data. It is taught by instructors from various companies and focuses on breadth over depth. The course is self-paced and estimated to take 2 months at 6 hours per week. The notes describe the content covered in each of the 4 lessons: data wrangling, analysis, visualization, and working with different data formats and APIs. Example projects are provided for each lesson involving analyzing NYC subway and weather data.

Uploaded by

Hari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

744 views8 pages

Notes On Intro To Data Science Udacity

Uploaded by

Hari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 8

Notes on Intro to Data Science Udacity

Created 12/20/13
Updated 01/20/14, Updated 02/21/14, Updated 04/06/14, Updated 04/19/14, Updated 04/26/14, Updated 05/04/14
Updated 05/17/14, Updated 05/31/14

Introduction
The Introduction to Data Science class (Udacity UD359) will survey the foundational topics in data science,
namely:
Data Manipulation
Data Analysis with Statistics and Machine Learning
Data Communication with Information Visualization
Data at Scale -- Working with Big Data
The class will focus on breadth and present the topics briefly instead of focusing on a single topic in depth. This
will give you the opportunity to sample and apply the basic techniques of data science.
Presenters are Dave Holtz (Yub, TrialPay) and Cheng-Han Lee (formerly at Microsoft, now at Udacity).
Course duration and time required is Approx. 2 months Assumes 6hr/wk (work at your own pace)
URL: https://www.udacity.com/course/ud359
There is a similar course on Coursera, taught by Bill Howe of University of Washington.

Related Groups
https://groups.google.com/forum/#!forum/datsciprojects - this is the Monday night group 6:30. Organized by Mike
Wilber.
https://groups.google.com/forum/#!forum/spark-ml-and-big-data-analytics this is the Wednesday night group 6:30.
Organized by Richard Walker.

Lectures
Lesson 1: Introduction to Data Science

Introduction to Data Science

What is a Data Scientist?
Pi-Chaun (Data Scientist @ Google): What is Data Science?
Gabor (Data Scientist @ Twitter): What is Data Science?

The definitions here included concepts that a data scientist has math/stat knowledge, domain knowledge, and good
analysis skills. It doesnt pre-suppose any specific set of tools, though R and MatLab-like tools are important. The
definition also includes a focus on communication skills, to get the results across.

Problems Solved by Data Science

Pandas = R + Python
Dataframes = data sets that can be run through algorithms
Create a New Dataframe. Indicates that a dataframe is a data set with metadata, like in ExtJS stores and models.

At this point, we are starting to work with an R + Python environment, which the student should be installing on a
computer, thought the initial exercises can be run interactively on the web.

Page 1

Lesson project: Titanic Data. Can we predict who will survive? In this case, we were asked to fill in some python
code that implements a heuristic that generates a prediction. In the first case, the performance is about 70%, then the
questions ask you to improve the heuristic to get better and better results.
predictions = {}
df = pandas.read_csv(file_path)
columns = ['Survived','Sex','Age','SibSp','Parch']
print df[df['Survived'] == 1][columns]

You could dump the data for analysis using a regression analysis, for instance. But in this case you are on your own
to generate a heuristic.
Final Lectures:
Pi-Chaun (Data Scientist @ Google): What is Data Science?
Gabor (Data Scientist @ Twitter): What is Data Science?
Summary: no cookie-cutter way to become a Data science, but a strong mathematical background is very important.

Our next

lesson will focus on getting data and loading it so that you can get value from it.

Lesson 2: Data Wrangling

What is Data Wrangling?

Acquiring Data: from databases, from text files, from the web (scraping). Versions include CSV, XML, and JSON.

The lectures point out that the Data Wrangling can be 70% of your time.

Common Data Formats, examples of each of the common formats. Baseball data is used here. JSON data looks like a dictionary.

The baseball database is at: http://www.seanlahman.com/baseball-archive/statistics/

Assignment: load CSV data, add another column to it.

What are Relational Databases?

Aadhaar Data (this is from an Indian national registry project)
Aadhaar Data and Relational Databases
Introduction to Databases Schemas
APIs

The examples are based on the API of last.fm. The structure of the request is located in the URL.

Data in JSON Format

How to Access an API efficiently
Missing Values

Two ways to deal with missing value: partial deletion, and imputing data. Approaches to imputing data include
using the average, or performing linear regression. There are functions within panda to perform this in two steps:
x = numpy.mean(baseball['weight'])
baseball['weight'] = baseball['weight'].fillna(x)

More sophisticated functions exist as well.

Easy Imputation
Impute using Linear Regression

This was discussed but not demonstrated.

Page 2

Tip of the Imputation Iceberg

Lesson Project: Wrangle NYC subway and weather data (11 parts)
Step 1: Number of rainy days SQL query
Step 2: Temp on Foggy and Non Foggy Days SQL query
Step 5: Fixing NYC subway turnstile data this focused on operations using panda data frames, and
deriving corrected one.
Step 7: Filtering No Regular data another example of using panda data frames, such as
turnstile_data[(turnstile_data.DESCn == 'REGULAR')]
Step 9: Get Hourly Exits in this case we are comparing a count value for a data set that has hourly values
with the value in the row above, for the prior hour.
Step 10: Time to hour this was a way to learn some of the operations on a datetime, such as extract a
field, convert to string and extract a character, etc. Several different versions of solutions were posted.
Step 11: Reformat subway dates demonstrated a few formatting operations on a datetime, including
converted back to a string, and extracting a substring.

Lesson 3: Data Analysis

Statistical Rigor
Kurt (Data Scientist @ Twitter) - Why is Stats Useful?
Introduction to Normal Distribution
T Test
Welch T Test

By this point, we have seen how to understand tests, and seen a formulation in Python.

Non-Parametric Tests

These dont assume that the data is drawn from any specific probability distribution. For example, Mann-Whitney
U test.

Non-Normal Data

Shapiro-Wilks test

Stats vs. Machine Learning

Different Types of Machine Learning
Prediction with Regression

For instance, predict home runs given information about a baseball player

Cost Function

These introduces steepest-descent methods, which are built around the idea of a cost function, where the cost
function is typically of the sum of the squares of the errors.

How to Minimize Cost Function

This is intended to be a discussion of an algorithm. But it is rather weak. One of the students added a note which is
a link to the Coursera class on machine learning, as a better source.

Page 3

Coefficient of Determination

This value is also called R-squared. The closer to 1, the better is our model.
The final lectures list other issues to take into account. Gradient descent is only one implementation of linear
regression. Another issue is overfitting. An approach is cross-validation. Cost function may have local minima.
Lesson Project: Analyze NYC subway and weather data. We will be analyzing data, and modeling links between
weather and ridership, for instance.
Step 1: Exploratory Data Analysis examine the hourly entries in our NYC subway data and determine
what distribution the data follows. In this case, we are using matplotlib with pandas. However, there has
been little discussion of matplotlib in the source materials
o
o
o
o
o
o
o
o
o

Step 2: Welchs t-test this just asks the question if you think the t-test applies. Since the data is not
normal, it doesnt.
Step 3: Mann-Whitney U test there is not a lot of explanation here, simply a request to run this test,
which is built into the stat libraries.
o
o
o
o
o

plt.figure()
x=turnstile_weather[turnstile_weather['rain']==1]
y=turnstile_weather[turnstile_weather['rain']==0]
x['ENTRIESn_hourly'].hist(color='r', bins=30, alpha=0.5, label='Rain')
y['ENTRIESn_hourly'].hist(color='b', bins=30, alpha=0.5, label='No rain')
plt.xlabel("Ridership")
plt.ylabel("Counts")
plt.legend()
plt.show()

a = turnstile_weather['ENTRIESn_hourly'][turnstile_weather['rain']==1]
b = turnstile_weather['ENTRIESn_hourly'][turnstile_weather['rain']==0]
with_rain_mean = np.mean(a)
without_rain_mean = np.mean(b)
U, p = scipy.stats.mannwhitneyu(a, b)

Step 4: Rainy Day Ridership vs. Non Rainy Day Ridership this asks you a question about interpreting the
M-W test.
Step 5: Linear regression this asks you to integrate the cost function and calls to steepest descent into
analysis of ridership data. Since the dataset of much larger than the baseball set, you are running a subset
of the data.
Step 6: Plot residuals this adds to step 5, and asked you generate a histogram of (predicted minus actual)
Step 7: Compute R-squared this also adds to Step 5
Step 8: Non Gradient-Descent Linear Regression this is example of using the ordinary least squares
analytical formulation, which is built into a function called stats.linregress. However, it requires that the
input matrix avoid problems such as colinearity. I had trouble coming up with a set of columns that didnt
exhibit this behavior.

Lesson 4: Data Visualization

Effective Information Visualization

Napoleon's March on Russia

Discussion of the classic diagram by

Don (Principal Data Scientist @ AT&T): Communicating Findings

Rishiraj (Principal Data Scientist @ AT&T): Communicating Findings Well
Visual Encodings

Here is a discussion of using lines, colors, indicators, thickness/direction of lines, etc.

Perception of Visual Cues

The different possible ways to encode information are ranked by peoples perception.

Page 4

Types of charts

We have seen the following diagram in a similar course, and want to add it here:

Plotting in Python

We are going to use ggplot instead of matplotlib. The former looks nicer, and has a grammar of graphs. What is
meant by a grammar of graphs is a set of graphing components, similar to d3/nvd3. This in turn encourages us to
think about scales and other low-level elements of the chart, rather than simply having all of the configuration
provided to us and not visible or controllable.

Data Scales

This is similar to the discussion about scales in d3. Here is an example:

f=pd.read_csv(hr_year_csv)
p = ggplot(f, aes(x='yearID', y='HR')) + geom_point(color = 'red') + geom_line(color='red')
p = p + ggtitle('Number of Home Runs by year') + xlab('Year') + ylab('Home Runs')
return p

Commentary
After reading the help files of Pythons ggplot clone, the document basically says; 'Warning, ggplot is NOT 'Pythonic'!' This basically means, it is
weird to use. Don't worry; it was designed to be weird. It follows some rules in a famous book called 'Grammar of Graphics' (a book written before most
of us were born!)
Don't worry about it being weird, you can just copy & paste some code, then modify it for yourself.

Page 5

Visualizing Time Series Data

A LOESS curve can emphasize long-term trends.

Lesson Project: Visualizing NYC subway data
Step 1: Visualization 1: this was a simple map by date
Step 2: Visualization 2: here I produced a bar chart by date. Had trouble grouping it any other way.

Lesson 5: MapReduce

Big Data and MapReduce

This includes a discussion about how much data is big data? The answer was several terabytes or more.

Basics of MapReduce
Mapper
Reducer

At this point, we have been introduced to the concepts, and the idea that MapReduce performs partitioning of a large
data problem has been described.

MapReduce with Aadhaar Data

At this point, we are writing mapper and reducer functions in python. However, they appear to be operating on the
same datasets as in earlier lessons (Aadhaar data). So we are not really using Hadoop or a compute cluster. But this
is similar to the use of MapReduce in our MongoDB work back in 2012.

MapReduce Ecosystem

At this point the discussion starting including the terms Hadoop, and really large configurations are described. It
would appear that setting up such examples using compute resources that are at classroom scale is beyond the scope
of the course.

Joshua (Data Scientist @ Twitter): MapReduce Tools, Pig

So we get to hear from industry experts about Hadoop, Hive, and Pig. Hive is a library for writing Hadoop jobs that
makes it easier to create them. Pig is a high-level platform for creating MapReduce programs used with Hadoop.
The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce
idiom into a notation which makes MapReduce programming high level, similar to that f SQL for RDBMS systems.
Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python,
JavaScript, Ruby or Groovy and then call directly from the language. Hive was developed at Facebook, and Pig at
Yahoo.

MapReduce with Subway Data

Lesson 6: Final Project

This is up to each individual student or group.

Appendix A: Set Up
Local installation
You would need to install the following Python libraries and packages to run the assignments on your own
computer:

Page 6

Pandas
Numpy
Scipy
Statsmodels
Ggplot
Matplotlib
Pandasql
We would highly recommend that you install Anaconda, which should contain most of the libraries and packages that
you need to work on the assignments.
One caveat is that Anaconda does not include pandasql but installing it after Anaconda is as easy as:
pip install -U pandasql
From the same company you have a hosted Data science Toolbox: www.wakari.io - nothing to install and the Anaconda
distribution is included.

Virtual machine
There is a Vagrant specification for a virtual machine here:
https://github.com/asimihsan/intro-to-data-science-udacity

By following these instructions, which are also present in the Git repository README file, you will be able to
create a virtual machine on Linux, Mac OS X, or Windows, that will include all dependencies required for this
course, and additionally be able to use IPython Notebooks, which make following this class much easier.

Install VirtualBox: https://www.virtualbox.org/wiki/Downloads

Install Vagrant: http://www.vagrantup.com/downloads.html

Download this repository's contents to your machine. Either:

Install Git, then clone this repo to your computer: git clone [email protected]:asimihsan/intro-to-data-scienceudacity.git, OR

Download then extract a ZIP file of this repo.

Change directory to your clone: cd intro-to-data-science-udacity

From the root of the clone run: vagrant up

Check for errors. There should be none. A warning about the version of the Guest Additions is harmless.

SSH onto the box using: vagrant ssh.

For more basic information on using Vagrant refer to the official documentation: http://docs.vagrantup.com/v2/gettingstarted/index.html

After starting the virtual machine you can run an IPython Notebook server by running the following inside the guest VM: ipython
notebook --ip 0.0.0.0 --pylab inline . Then on your host machine browse to http://localhost:58888. Congratulations!

Appendix B: Project Files

There was a 16MB download provided with the course.

Page 7

Baseball data
http://www.seanlahman.com/baseball-archive/statistics/
Contains player information and scores going back to 1871.

NYC Subway data

The best example file was turnstile_data_master_with_weather.csv. this is over 15MB and has over 10,000 rows by
22 columns of data about turnstile locations, routes, usage by period (4-hour internals in some cases, other intervals
in other cases) on 5/1/2011 through 5/15/2011, and weather on that date.
For discussion see http://chriswhong.com/open-data/visualizing-the-mtas-turnstile-data/
This data is from https://nycopendata.socrata.com/ Apparently there are over 1,000 data sets available.

Appendix C: What other projects and data are relevant?

RO data: construct data set of activity events joined to user data for K challenge. Determine if there is statistical
difference between tracking in the different activity types of the challenge.
RO data: construct data set of challenge members joined to user data for K challenge. Determine a model to predict
a challenge members final score given user data.
JG data: construct data set of donations joined user data for 2012-2013. Determine a model to predict donation
amount given user data.
JG Data: construct data set of charity information joined to donation data for 2012-2013. Determine if there is a
relationship between charity output divided by charity capitalization and donation amounts.
Scott Davis <[email protected]> May 16 09:32PM -0700
Just got access to a plethora of Amazon datasets by professor McAuley from Stanford. Hope the site helps anyone
in the beginning stages of getting their project together.
Here's the link: http://snap.stanford.edu/data/web-Amazon-links.html.

Page 8

Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
Dice Resume CV Vijay Krishna
No ratings yet
Dice Resume CV Vijay Krishna
4 pages
Data Science Training in Naresh I Technologies
100% (3)
Data Science Training in Naresh I Technologies
18 pages
Udacity Dandsyllabus
No ratings yet
Udacity Dandsyllabus
7 pages
Data Science With Python-Sasmita PDF
67% (3)
Data Science With Python-Sasmita PDF
9 pages
Udacity Enterprise Syllabus Data Analyst nd002
No ratings yet
Udacity Enterprise Syllabus Data Analyst nd002
16 pages
Python
100% (1)
Python
635 pages
DS-DS Lab-1
No ratings yet
DS-DS Lab-1
4 pages
Data Science Course in Hyderabad - Innomatics
No ratings yet
Data Science Course in Hyderabad - Innomatics
10 pages
Industrialreport
No ratings yet
Industrialreport
26 pages
Python For Data Science Department of Indian Institute of Technology, Madras Lecture - 01 Why Python For Data Science?
No ratings yet
Python For Data Science Department of Indian Institute of Technology, Madras Lecture - 01 Why Python For Data Science?
9 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
Lec 1
No ratings yet
Lec 1
9 pages
Foundations of Data Science.docx
No ratings yet
Foundations of Data Science.docx
3 pages
Nd002 Syllabus 2018 June v9
No ratings yet
Nd002 Syllabus 2018 June v9
5 pages
Ayush New
No ratings yet
Ayush New
4 pages
Dsa Report
No ratings yet
Dsa Report
24 pages
AI-ML Syllabus
100% (1)
AI-ML Syllabus
8 pages
Sem 6
No ratings yet
Sem 6
12 pages
Data Science Final Syllabus
No ratings yet
Data Science Final Syllabus
8 pages
Data Science Course Outline CES LUMS
No ratings yet
Data Science Course Outline CES LUMS
4 pages
Data Analyst: Nanodegree Program Syllabus
No ratings yet
Data Analyst: Nanodegree Program Syllabus
16 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
PDS MERGED NEW
No ratings yet
PDS MERGED NEW
19 pages
NAC.pdf (1)
No ratings yet
NAC.pdf (1)
23 pages
Full detailed i need
No ratings yet
Full detailed i need
7 pages
Data Science Master
No ratings yet
Data Science Master
11 pages
Minor Cse Dsv2
No ratings yet
Minor Cse Dsv2
7 pages
AD1301A Introduction To Data Science Syllabus
No ratings yet
AD1301A Introduction To Data Science Syllabus
2 pages
Six Weeks Summer Training Reportpdf
100% (1)
Six Weeks Summer Training Reportpdf
26 pages
Generative AI and Machine Learning Course Content
No ratings yet
Generative AI and Machine Learning Course Content
19 pages
Data Analyst Nanodegree Program - Syllabus
50% (2)
Data Analyst Nanodegree Program - Syllabus
7 pages
data-science-report
No ratings yet
data-science-report
32 pages
DSP U1
No ratings yet
DSP U1
89 pages
213j1a05h6 Data Science Cse-f
No ratings yet
213j1a05h6 Data Science Cse-f
25 pages
HSB3119 Theory Summary p1 Stud
No ratings yet
HSB3119 Theory Summary p1 Stud
22 pages
DSP U2
No ratings yet
DSP U2
172 pages
R Lect1 Introduction
No ratings yet
R Lect1 Introduction
16 pages
Roadmap AI
No ratings yet
Roadmap AI
19 pages
Introduction To Data Science: Cpts 483-06 - Syllabus
No ratings yet
Introduction To Data Science: Cpts 483-06 - Syllabus
5 pages
Applied Data Science With Python-N
No ratings yet
Applied Data Science With Python-N
17 pages
CS3352 FDS
No ratings yet
CS3352 FDS
23 pages
TDS Notes Jan22 Term
No ratings yet
TDS Notes Jan22 Term
8 pages
Learn Data Scoence Learnbay
No ratings yet
Learn Data Scoence Learnbay
8 pages
Python data science cookbook over 60 practical recipes to help you explore Python and its robust data science capabilities Subramanian download pdf
100% (6)
Python data science cookbook over 60 practical recipes to help you explore Python and its robust data science capabilities Subramanian download pdf
71 pages
22am901 Data Science Using Python Unit 2
No ratings yet
22am901 Data Science Using Python Unit 2
116 pages
Roadmap Geeksforgeeks
No ratings yet
Roadmap Geeksforgeeks
24 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
Analisis de Datos MIT
No ratings yet
Analisis de Datos MIT
340 pages
Foundation of Data Science previous year question paper
No ratings yet
Foundation of Data Science previous year question paper
40 pages
Data Science Full Stack Roadmap
No ratings yet
Data Science Full Stack Roadmap
25 pages
DATA_SCIENCE_MANAUL (TE) (1)
No ratings yet
DATA_SCIENCE_MANAUL (TE) (1)
78 pages
Machine learning with pythone_syllabus
No ratings yet
Machine learning with pythone_syllabus
13 pages
Introduction to Data Science Course Outline
No ratings yet
Introduction to Data Science Course Outline
5 pages
Gujarat Technological University: Overview of Python and Data Structures
No ratings yet
Gujarat Technological University: Overview of Python and Data Structures
4 pages
Data Science & Aiml (Mile Stone Solution)
No ratings yet
Data Science & Aiml (Mile Stone Solution)
37 pages
NPTEL Coursebook
No ratings yet
NPTEL Coursebook
649 pages
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Mastering Algorithms and Data Structures
From Everand
Mastering Algorithms and Data Structures
Manish Soni
No ratings yet
Exploring the World of Data Science and Machine Learning
From Everand
Exploring the World of Data Science and Machine Learning
NIBEDITA Sahu
No ratings yet
How Do I Register and Schedule My Cloudera Exam
No ratings yet
How Do I Register and Schedule My Cloudera Exam
16 pages
Mrcet R20 Iv 1 QB
No ratings yet
Mrcet R20 Iv 1 QB
79 pages
Talend Resume
No ratings yet
Talend Resume
8 pages
Starburst Introduction - March 2021
No ratings yet
Starburst Introduction - March 2021
12 pages
Syllabus - PGD - DS - Batch-7 PDF
No ratings yet
Syllabus - PGD - DS - Batch-7 PDF
12 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
19 pages
Resume Sam5
No ratings yet
Resume Sam5
1 page
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
Kanishk Malhotra
No ratings yet
Kanishk Malhotra
11 pages
SumanaV Bigdata
No ratings yet
SumanaV Bigdata
6 pages
Oltp Olap Rtap
No ratings yet
Oltp Olap Rtap
53 pages
Azure Data Factory Notes 1682135573
No ratings yet
Azure Data Factory Notes 1682135573
78 pages
Ccs334 - Big Data Analytics
75% (4)
Ccs334 - Big Data Analytics
2 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
7 Hive Notes
No ratings yet
7 Hive Notes
36 pages
Relational Entities On Databricks
No ratings yet
Relational Entities On Databricks
11 pages
BAD601 Important Question
No ratings yet
BAD601 Important Question
2 pages
Airflow DAG - Best Practices: DAG As Configuration File
100% (1)
Airflow DAG - Best Practices: DAG As Configuration File
6 pages
Ojassvi Kumar: Education
No ratings yet
Ojassvi Kumar: Education
1 page
Hiho
No ratings yet
Hiho
12 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
Hive - PIG - HBase - Zookeeper
100% (1)
Hive - PIG - HBase - Zookeeper
31 pages
Ramniranjan Jhunjhunwala College of Arts, Science & Commerce (Autonomous)
No ratings yet
Ramniranjan Jhunjhunwala College of Arts, Science & Commerce (Autonomous)
35 pages
Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehouse systems
No ratings yet
Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehouse systems
38 pages
6th sem AIDS syllabus 2022 scheme
No ratings yet
6th sem AIDS syllabus 2022 scheme
52 pages
Hive: A Data Warehouse On Hadoop: On Facebook Team's Paper
No ratings yet
Hive: A Data Warehouse On Hadoop: On Facebook Team's Paper
14 pages
SPARK
No ratings yet
SPARK
125 pages
4-Big Data Management
No ratings yet
4-Big Data Management
40 pages
Looker Overview
No ratings yet
Looker Overview
4 pages

Notes On Intro To Data Science Udacity

Uploaded by

Notes On Intro To Data Science Udacity

Uploaded by

Notes on Intro to Data Science Udacity

Introduction to Data Science

Problems Solved by Data Science

Lesson 2: Data Wrangling

What is Data Wrangling?

The baseball database is at: http://www.seanlahman.com/baseball-archive/statistics/

What are Relational Databases?

Data in JSON Format

More sophisticated functions exist as well.

This was discussed but not demonstrated.

Tip of the Imputation Iceberg

Lesson 3: Data Analysis

Stats vs. Machine Learning

How to Minimize Cost Function

Lesson 4: Data Visualization

Effective Information Visualization

Discussion of the classic diagram by

Don (Principal Data Scientist @ AT&T): Communicating Findings

Here is a discussion of using lines, colors, indicators, thickness/direction of lines, etc.

Perception of Visual Cues

This is similar to the discussion about scales in d3. Here is an example:

Visualizing Time Series Data

A LOESS curve can emphasize long-term trends.

Big Data and MapReduce

MapReduce with Aadhaar Data

Joshua (Data Scientist @ Twitter): MapReduce Tools, Pig

MapReduce with Subway Data

Lesson 6: Final Project

Install VirtualBox: https://www.virtualbox.org/wiki/Downloads

Install Vagrant: http://www.vagrantup.com/downloads.html

Download this repository's contents to your machine. Either:

Download then extract a ZIP file of this repo.

Change directory to your clone: cd intro-to-data-science-udacity

From the root of the clone run: vagrant up

SSH onto the box using: vagrant ssh.

Appendix B: Project Files

NYC Subway data

Appendix C: What other projects and data are relevant?

You might also like