0% found this document useful (0 votes)
102 views

A Crash Course in Data Science Review

This document provides an overview of key concepts in data science, including definitions of data science, areas of statistics, machine learning techniques, software engineering practices, the structure of data science projects, outputs, and hallmarks of successful experiments. It discusses descriptive, inferential, and predictive statistics, as well as supervised and unsupervised machine learning. The document also outlines tools used in data science.

Uploaded by

huka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views

A Crash Course in Data Science Review

This document provides an overview of key concepts in data science, including definitions of data science, areas of statistics, machine learning techniques, software engineering practices, the structure of data science projects, outputs, and hallmarks of successful experiments. It discusses descriptive, inferential, and predictive statistics, as well as supervised and unsupervised machine learning. The document also outlines tools used in data science.

Uploaded by

huka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

A Crash Course in Data

Science- Review
Kartikeya Bolar

1
What is Data Science?
• Applied Statistics ?
• Applied Machine Learning ?
• Database Management ?
• Answering Specific Questions with Data ?
• Deep Learning ?

2
Broad Areas of Statistics
• Descriptive - Involves Basic Summary Tables and Exploratory Data Analysis
Focus is on Profiling and exploring potential relationships)
• Inferential - Process of Drawing Conclusions about a Population from a Sample.
Focus on Parameter Estimates and Relationship between Parameters)
• Predictive - Process of getting predictions irrespective of the statistical significance of
the relationship
Focus is on Sampling variance
• Experimental Design - Balancing observed and unobserved covariates that may
contaminate our results
Focus is on Cause and Effect

3
Machine Learning
• Obtain generalizability by testing on novel datasets
Types
• Supervised ( Focus on prediction through prediction performance)
• Unsupervised ( Clustering , Association, Principal Component)
• Traditional statistical approaches often differ from ML approaches
• By often placing a higher priority on parameter interpretability and simplicity
(model specification) over prediction performance

4
Software engineering for Data Science
• Software engineering is used to generalize data analyses into software so that
they can be applied in different situations
• Software packages provide a well-defined interface that can abstract low-level
technical details of data analysis routines
• Developing a function or a package depends upon the level of repetition of the
procedure or steps

5
Structure of Data Science Project
• A Data Science Project might start
with Exploratory Data Analysis or
Defining /Stating the Question

• Decision making is not the part of


data analysis process

6
Output of Data Science Experiment - 1
Output Types Characteristics
• Reports • Clearly written
• Presentations • Narrative
• Concise Conclusions
• Omit the Unnecessary
• Reproducible
• Tools : Rmarkdown Knittr, Presenter

7
Output of Data Science Experiment -2
Output Types Characteristics
• Interactive web pages (Dashboards) • Easy to use
• Apps • Documentation
• Code commented
• Version Control
• Tools : Rmarkdown Shiny, Shiny
WebApp, Tableau, PowerBI

8
Hallmarks of Successful Data Science
Experiment
• New knowledge is created.
• Decisions or policies are made based on the outcome of the experiment.
• A report, presentation or app with impact is created.
• It is learned that the data can't answer the question being asked of it.

9
Data scientist’s toolbox
• Data programming languages ( Eg. R ,Python)
• Scaling computing frameworks ( Eg. Apache Spark, Hadoop Map Reduce)
• Web servers (Eg. Amazon Web Service , RStudio Cloud, Azure)
• Help websites(Eg. Stack overflow)
• Databases (Eg. Sqlserver, Excel)
• Chat tools (Eg. Slack)
• Reproducibility tools (Eg. Rmarkdown)
• Data products development tools(Eg. Shiny, Tableau ,PowerBI)

10
Separating Hype from Value
• What is the question you are trying to
answer with the data?
• Do you have the data to answer that
question?
• If you could answer the question,
could you use the answer?

11

You might also like