0% found this document useful (0 votes)
7 views

DAT100_Int_Data_Ana_Lec2_Intro II

The document provides an introduction to data analytics, defining data and its significance in various fields. It discusses the concept of datafication, the role of data science, and the essential skills required, including math, programming, and domain knowledge. The data science process is outlined in five steps: asking questions, obtaining data, exploring data, modeling, and communicating results.

Uploaded by

Bahaa Mohd
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

DAT100_Int_Data_Ana_Lec2_Intro II

The document provides an introduction to data analytics, defining data and its significance in various fields. It discusses the concept of datafication, the role of data science, and the essential skills required, including math, programming, and domain knowledge. The data science process is outlined in five steps: asking questions, obtaining data, exploring data, modeling, and communicating results.

Uploaded by

Bahaa Mohd
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

DAT 100 - Introduction to Data

Analytics
Lecture 2 – Introduction II

Dr. Elfadil Abdalla January 24, 2025

1
INTRODUCTION TO DATA
ANALYTICS

2
What is Data?

3
What is data?

Wikipedia:
Data (singular datum) are individual units of information.
A datum describes a single quality or quantity of some
object

Another definition:
Data is a collection of facts, such
as numbers, words, measurements
, observations or even just
descriptions of things.

4
Data All Around

Lots of data is being collected


and warehoused
oWeb data, e-commerce
oFinancial transactions, bank/credit transactions
oOnline trading and purchasing
oSocial Network

5
Data is the New Oil

It's valuable, but if unrefined it cannot really be used.


What to do with the collected data?
How to utilize data?

6
Digging for data: Datafication

According to [1]
Datafication is:
A technological trend turning many aspects of our life into data.
Or
A process of taking all aspects of life and turning them into data.

Once we datafy things, we can transform their purpose and turn


the information into new forms of value.

[1] K.Cukier and V.Mayer-Schoenberger, Viktor (2013). "The Rise of Big Data".

7
Datafication Examples

Social platforms:
o(e.g. Facebook) collect and monitor data information of our actions and
friendships to market products and services to us.
Insurance: Data used to update risk profile development and
business models.
Banking: Data used to establish trustworthiness and likelihood of
a person paying back a loan.
Hiring and recruitment: Data used to replace personality tests.

8
What is a data Scientist?

9
Facts about Data science

Data science is the art and science of acquiring knowledge


through data.

Data science is all about how we take data, use it to acquire


knowledge, and then use that knowledge to:
oMake decisions
oPredict the future
oUnderstand the past/present
oCreate new industries/products
o etc..

1
0
Facts about Data science

Data science won't replace the human brain, but complement it,
work alongside it.

Data science should not be thought of as an end-all solution to


our data problems;

It is merely an opinion, a very informed opinion.

1
1
Why data science?

In this data age, it's clear that we have a surplus


of data.

oBut why should that necessitate an entire new set of


vocabulary?

oWhat was wrong with our previous forms of analysis?

1
2
Why data science?

The sheer volume of data makes it literally impossible for a human


to parse it in a reasonable time.

data is collected in various forms,


and from different sources,
and often comes in very unorganized.
data can be missing, incomplete, or just flat out wrong.
data on very different scales
1
3
Main areas of data science

Understanding data science begins with three basic


areas:
oMath/statistics: This is the use of equations and formulas to
perform analysis

oComputer programming: This is the ability to use code to


create outcomes on the computer

oDomain knowledge: This refers to understanding the


problem domain (medicine, finance, social science, and so on)

1
4
1
5
1
6
The data science Venn diagram

The following Venn diagram provides


a visual representation of how the
three areas of data science intersect:

1
7
Cont.
Data science areas

While having only two of these three qualities can make you
intelligent, it will also leave a gap.
In order to gain knowledge from data, we must be able to
outilize computer programming (to access and manipulate data, develop models,
visualize the results, etc..)
o understand the mathematics behind the models we derive
oabove all, understand our analyses' place in the domain we are in. (domain
expertise allows you to apply concepts and results in a meaningful and effective
way.)

1
8
The math

Math & Statistics Knowledge allows you to theorize and evaluate


algorithms and tweak the existing procedures to fit specific
situations
Math can be used to formalize relationships between variables.
We will study basic mathematic and statistic principles that are
handy when dealing with data science

1
9
 Advice from Hadley Wickham the Chief Scientist
at Rstudio

2
0
Computer programming

Computer help us to accomplish tedious, time-consuming tasks


which would have otherwise taken us ages to manually fulfill.
Computer languages help us communicate with machine
processors.
A computer speaks many languages and can be written in many
languages;
Similarly, data science can also be done in many languages.
Python, Julia, and R are some of the many languages available to
us.

2
1
2
2
Python

In this course we will learn Python for a variety of reasons:


o Python is an extremely simple language to read and write, even if you've never
coded before
oIt is one of the most common languages, both in production and in the
academic setting (one of the fastest growing, as a matter of fact)
oThe language's online community is vast and friendly.
oPython has prebuilt data science modules that data scientist can utilize.

2
3
2
4
Domain knowledge

This category focuses mainly on having knowledge about the


particular topic you are working on.
Examples of such domains includes medicine, marketing, banking,
and industry.

2
5
2
6
Data Science Process
If duplicates, missing values,
outliers, then we may go back to
collect more data, or spend more
time cleaning the dataset.

e.g., a spam classifier,


search ranking algorithm,
a recommendation system
Schutt, R., & O'Neil, C. (2013). Doing data science: Straight talk from the frontline.
2
7
2
8
Overview of the main steps

The five essential steps to perform data science are as follows:

1. Asking an interesting question


2. Obtaining the data
3. Exploring the data
4. Modeling the data
5. Communicating and visualizing the results

2
9
1. Asking an interesting question

This step can be seen as a brainstorming session

Understand the problem that needs to be addressed and solved

Data scientists have to frame the problem into a data science


problem

Thus, they need to learn the domain knowledge and combine the
technical knowledge with data to come up with a solution to drive
business values.

3
0
2. Obtaining the data

Once the question is determined, it is time to look out the world


for the data that might be able to answer that question.

There are several sources of data which can be private or public,


for example:
oOpen Data is open for everyone (e.g. WHO, World Health Organization,
database)
oData from companies
oData from surveys
o Simulated data
oEtc..

3
1
3. Exploring the data/
Explorative data analysis(EDA)

The basic tools of EDA are plots, graphs and summary statistics.

Generally speaking, it’s a method of:


o systematically going through the data,
oplotting distributions of all variables (using box plots),
oplotting time series of data,
otransforming variables,
olooking at all pairwise relationships between variables using scatterplot
matrices, and
ogenerating summary statistics for all of them (computing variables mean,
minimum, maximum, the upper and lower quartiles, and identifying outliers).

3
2
EDA

With EDA, you want to understand the data, understand the


shape of it, and try to connect your understanding of the process
that generated the data to the data itself

Although there’s lots of visualization involved in EDA, we


distinguish between EDA and data visualization in that:
oEDA is done toward the beginning of analysis, and data visualization is done
toward the end to communicate one’s findings.

3
3
4. Modeling the data

This step involves the use of statistical and machine learning


models.

In this step, we are not only fitting and choosing models, we are
implanting mathematical validation metrics in order to quantify
the models and their effectiveness.

3
4
5. Communicate and visualize the results

This could take the form of reporting the results up to manager or


coworkers, or publishing a paper in a journal.

The main goal of data visualization is to have the reader quickly


digest the data, including possible trends, relationships, and
more.

We must ensure that we are making a visual as effective as


possible

3
5
Data Science Workflow 3
6
EXAMPLE:
PREDICTING NEONATAL INFECTION

3
8
Click to edit
Master title style
1/28/2024
1/28/2024 39

You might also like