0% found this document useful (0 votes)
53 views37 pages

01 Introduction

Uploaded by

Tushar Chaudhari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views37 pages

01 Introduction

Uploaded by

Tushar Chaudhari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Brian D.

Davison
Lehigh University
DSCI 398
Introduction to Data Science
Outline for Today

• What is Data Science?

• Review course and syllabus

• Introductions

• The Data Scientist


Big Data “is at the foundation
of all the megatrends that are
happening today, from social to
mobile to cloud to gaming.”
–Chris Lynch, Vertica Systems
Our definition of “Big Data”

Ø An all-encompassing term for a collection of


data so large that it is difficult to process
using traditional single-machine techniques

Proponents of big data solutions would also describe big data as a large
volume of unstructured data which cannot be handled by standard
databases.
Source: http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Why is Big Data
a big deal now?

We’ve always had


“big” data
• E.g., Meteorology and
Astronomical data

There is even more


data now in more
topic areas!
From the dawn of civilization until
2003, humankind generated five
exabytes of data. Now we produce
five exabytes every two days… and
the pace is accelerating.
Eric Schmidt (2010)
Executive Chairman, Google
https://www.weforum.org/agenda/2019/04/how-much-data-is-generated-each-day-cf4bddf29f/
https://www.weforum.org/agenda/2019/04/how-much-data-is-generated-each-day-cf4bddf29f/
• The Internet reaches
over 5B people

https://www.domo.com/learn/infographic/data-never-sleeps-9
• Over 90% of people
access the Internet via
mobile devices
• Amount of data
consumed last year
was 79ZB!

How many of these


services did you use
this week?
Why now?

Answer #1:

Compute costs have


dropped

Source:
http://www.networkworld.com/article/2358531/dat
a-center/internet-guru-mary-meeker-says-
enterprise-technology-is-getting-much-much-
cheaper.html
Why now?

Answer #2:

Bandwidth costs have


dropped

Source:
http://www.networkworld.com/article/2358531/da
ta-center/internet-guru-mary-meeker-says-
enterprise-technology-is-getting-much-much-
cheaper.html
Why now?

Answer #3:

Storage costs have


dropped

http://www.mkomo.com/cost-per-gigabyte-update
• As a result of the factors (computing, storage,
and communication costs) just mentioned for
big data, we are now drowning in data.
• A bigger shift in business itself: “information is
power” and organizations need to think about
Why what data to collect and what information to
extract and how to use it optimally.
Data Science • Sensors, e.g., transaction processing systems are
Now? critical (ATMs, point-of-sale scanners, web servers,
IoT), as the eyes and ears of an org
• Data warehouses provide access to historical data
• Data mining provides the analytical/modeling toolkits
• Together they provide organizations an effective
“sense and respond” mechanism
Big data is not
about the data.
–Gary King, Harvard University, making the point that while data may
be plentiful, the real value is in the analytics
Hiding within those mounds
of data is knowledge that
could change the life of a
patient, or change the world.
–Atul Butte, Stanford School of Medicine
Data Science
Is a set of principles, concepts, and techniques that structure
thinking and analysis of data

Extracts useful information and knowledge from (large) volumes of


data by following a process with reasonably well-defined steps

Changes the way you think about data and its role in business

The volume, variety, velocity and veracity of data


Offers new opportunities to
Powerful distributed computing
leverage Advanced algorithms
Data Science
The process of extracting meaning from data to generate data
products for
• the good of society
• advancement of science
• profits in business
A broad discipline, including significant aspects of
• computer science
• statistics and mathematics
• data mining and machine learning
• operations research
• domain expertise

See http://www.datasciencecentral.com/profiles/blogs/17-analytic-disciplines-compared
Some other (useful) definitions of Data Science

Data science is a ‘concept to unify statistics, data analysis, machine


learning and their related methods’ in order to ‘understand and
analyze actual phenomena’ with data.
-- One of the earliest mentions of the concept by Chikio Hayashi (1998), as quoted in Wikipedia
https://en.wikipedia.org/wiki/Data_science

Data science is the discipline of making data useful.


-- Cassie Kozyrkov (2018), Chief Decision Intelligence Engineer, Google
https://hackernoon.com/what-on-earth-is-data-science-eb1237d8cb37
One View
Data Science – everything relating to
data gathering, cleansing, preparation
and analysis.
Data Analysis – human activities
aimed at gaining insight into data. An
analyst can use Data Analytics tools
to obtain results, or perform analysis
without special data processing.
Data Analytics – automating insights
into data, using queries and data
aggregation. It can use Data Mining
techniques to discover patterns in the
data.
Machine Learning – artificial
intelligence techniques that use a
Machine
training dataset to build a model that
Learning can predict values of target variables.
Data Gathering Data Mining applies machine
learning techniques to Big Data.
Data Cleansing Data Preparation

Based on: https://onthe.io/learn/en/category/qna/What-is-the-difference-between-Data-Science,-Data-Analysis,-Big-Data,-Data-Analytics,--Data-Mining-and-Machine-Learning%3F


There are three general types of concepts:
This course • Ways to think data-analytically, which will help to identify
appropriate data and consider appropriate methods

will cover • The data mining process (which we will see on another
day)

fundamental • General approaches for extracting knowledge from data


• E.g., data mining algorithms
concepts • How data science fits in the organization and the
competitive landscape

We will also learn:


• How data science is used in a variety of fields
• Programming in R for mining, analysis and visualization
• Hands-on experience mining data
Syllabus
Syllabus
Textbooks
Syllabus
Syllabus
Who am I?
Full Professor, Computer Science and Engineering Dept.
• Also teach courses like Computers, Internet and Society,
Prof. Davison (he/him)
Data Mining, Search Engines…
• Research in data mining, search engines, text mining, social networking
• Anywhere one needs to predict, recommend, rank, search, classify, or filter
Co-director for the new interdisciplinary MS in Data Science
Director for the undergraduate minor in Data Science at Lehigh University
Former Facebook Visiting Professor (Core Data Science team)
My students have gone to Yahoo, Google, Microsoft, LinkedIn, Baidu…
… and some are now managers of data science teams!

… Professor, scientist, engineer, father, brother, son, husband, friend.


Who are you?

Introduce yourself! Tell us about some of your identities


• Race / Ethnicity / Tribal affiliation
• Gender / Orientation / Sex / Pronouns
• Citizenship / State if USA / Immigration Status
• Language
• Religion / Spirituality
• Dis/Ability
• Age Tell us more online!
• Degree / Year
• Career Goals

HW 1: Due Thursday: post pics, introduce yourself online


January 2016
What is a
Data
Scientist?
Data Scientist – an expert in Data Science
e Ma
nc tha
ie
Sc nd
er
Machine
t Sta
pu Learning
tis
om tic
C s
Data
Science Tra
alue g Re ditio
V n se na
i gh eeri arc l
H in h
g
En

Domain Expertise
Data Science
Social
Sciences

ar al
So orm re t

se on
inf ftw men
cia ed

ch
Domain-

Re diti
De
So lop

lly
unaware
ve

Tra
Ma

-
r Data Science

a
te th
u
p ce a
m Data Sta nd
Co cien Science tis
tic
S Socially- Tra s
e
alu ng unaware Re ditio
V se na
i gh eeri Data Science arc l
H in h
n g
E

Domain Expertise
“Data Scientist” (Geek?)
• Can do the actual modeling, building or extending tools when needed
• Applied statistician + computer scientist

Many Collaborator in a data-centric project


• Can translate from business to the execution

Roles in Researcher on the cutting edge

Data
• Uses data science techniques to extract knowledge of value

Manager of a data-mining project

Science • Can understand the potential, evaluate a proposal and execution, and
interface with a broad variety of people

Strategist, Investor, …
• Envision opportunities, come up with novel ideas, evaluate the promise
of new ideas, design data science projects / companies conceptually
Jobs in the Data Science / Big Data Ecosystem

http://101.datascience.community/wp-content/uploads/2015/11/datasciencejobs.png
Jobs in the Data Science / Big Data Ecosystem

http://101.datascience.community/wp-content/uploads/2015/11/datasciencejobs.png
Jobs in the Data Science / Big Data Ecosystem

http://101.datascience.community/wp-content/uploads/2015/11/datasciencejobs.png

You might also like