Approaches in data science [Slides]
Approaches in data science [Slides]
Overview
It is important that we are able to make Problem statement
informed decisions and derive appropriate
insights from data.
2
An introduction to data science
| Quantitative and qualitative data analyses are important because they enable us to gain a more
comprehensive understanding of complex phenomena and make data-driven decisions.
Quantitative data analysis involves numerical Qualitative data analysis involves exploring patterns
measurement and statistical analysis. and themes in non-numerical data.
It allows us to measure and analyse numerical data It allows us to explore and interpret non-numerical
using statistical methods, enabling us to identify data, such as text, images, or videos.
patterns, trends, and relationships between variables.
It is useful for understanding the context of a
It is useful for making predictions, testing hypotheses, problem and people’s attitudes, behaviours, etc.
and identifying cause-and-effect relationships.
Both types of analysis are important because they provide different ways of understanding and interpreting
data.
3
An introduction to data science
| The data science process is a systematic approach to transforming a data problem into a
data-driven solution.
State the problem Find the right data Remove, fix, and Understand the Select features, build, Deploy, test, and
or hypothesis sources filter data train, and validate communicate
This approach to data science helps us to discover meaningful patterns, relationships, and trends and helps us
develop accurate and robust models. Various forms of this process are used across different data disciplines,
including data analytics, science, and engineering, under various names, such as OSEMN and CRISP-DM.
4
An introduction to data science
Problem statement
| The problem statement helps us define the scope and objectives of our analysis and ensures that our
insights are relevant.
A problem statement identifies the gap between the current (problem) state and the desired (outcome) state.
It should be specific, brief, concise, clear, unbiased, and measurable.
A problem statement may also be in the form of a hypothesis, which is a proposed cause and effect for a
particular phenomenon or problem which has not yet been proven correct.
Examples:
Statement: We need to report on Hypothesis: The estimated water and Question: How much water and
estimated water and electricity income electricity income from domestic electricity income can we expect from
from different customer groups. customers are 30% lower than from commercial customers per month?
other customers.
5
An introduction to data science
Data collection
| Data collection includes identifying and acquiring applicable data sources, both internally and
externally, which can help answer the problem statement.
We can use company data or open-source data, or collect our own data depending on the nature of our problem
and the analysis we would like to do.
Examples:
Data acquired from surveys such as Queried data from databases or APIs Downloaded data from open sources
market research and customer (Application Programming Interfaces) and cloud repositories such as general
satisfaction surveys. such as sales data and employee census data.
information.
6
An introduction to data science
Data cleaning
| Data cleaning, also known as data wrangling, involves transforming raw data into usable formats.
We can use several cleaning techniques to ensure that our data are indeed accurate and of the required quality.
If our data are inaccurate, so will our insights be.
Examples:
Using spreadsheets or a programming Using regular expressions for pattern Using data visualisation tools such as
language to remove irrelevant matching and replacing data. PowerBI or spreadsheets for identifying
observations, handle missing values, fix outliers and anomalies.
structural issues, etc.
7
An introduction to data science
| Exploratory data analysis (EDA) is an approach used to summarise the main characteristics of a
dataset using aggregations, fundamental statistics, and visualisation techniques.
Before we can gather insights or build a model, we first need to understand our data. We can use non-graphical
methods, such as descriptive statistics and correlation, or graphical (visualisation) methods to investigate our
data.
Examples:
8
An introduction to data science
_Univariate analysis_ is the exploration of individual In a _multivariate analysis_ we're more interested in
variables in a dataset, i.e. we only consider one the relationship between the different variables of our
variable at a time. dataset.
We can use descriptive We can use visualisations We use correlation to We can use visualisations
statistics such as the such as histograms, density understand the strength and such as heatmaps, scatter
standard deviation, central plots, and box plots to direction between variables. plots, and pair plots to
tendency, and measures of understand the investigate the relationship.
distribution. characteristics of a variable.
9
An introduction to data science
Gather insights
| Gathering insights, also known as data dissemination, involves gathering and reporting the insights
derived from the analysis.
Insights may be gathered in and reported to stakeholders through dashboards and reports that include text and
data visualisations.
Examples:
10
An introduction to data science
Model building
| Model building involves selecting an appropriate algorithm and training the model on the data.
Model building often involves reiteration since a model will rarely give us the results we seek on the first try.
This means that we train and test a model until we’ve found a suitable model before deploying it into a larger
system.
Some common tools and skills required for data Select features
collection include:
A
Machine learning libraries Deep learning libraries such Deploy the model E B Build model
such as Scikit-learn and as Keras and PyTorch for Regression, classification,
TensorFlow for building building neural networks in or other ML model
models in Python. Python.
Validate the results D C Train model
11
An introduction to data science
Model deployment
| Model deployment involves integrating the model into a large system or application.
Deployment bridges the gap between data science and real-world applications. Effective testing and
communication ensure the model is useful, reliable, and understood.
A
Maintenance: Monitor and Optimisation: Regularly
maintain the model, retrain the model with new
archiving insights to data sources and make
facilitate future endeavours. adjustments to improve
performance. Optimise the Maintain the
C B
model model
12
An introduction to data science
Type of analytics
| The type of analytics we apply depends on our goal and prescribes our approach to the data
analytics or data science process.
Used to describe what has Used to determine why Used to forecast what will Used to recommend the best
happened in the past. something has happened in happen in the future. course of action for a given
the past. situation.
It's a summary of historical Uses statistical models and
data that provides insights Helps organisations machine learning algorithms Uses advanced algorithms
into patterns, trends, and understand the factors that to identify patterns and and optimisation techniques
relationships within the data contributed to a particular trends in historical data to to suggest the most optimal
outcome. predict future outcomes. solution based on a variety of
Examples: Dashboards and factors and constraints.
reports. Examples: Data mining and Examples: Forecasting and
drill-down analysis. risk modelling. Examples: Optimisation
13