Fundamentals of Data Source and Preparation For ML v31
Fundamentals of Data Source and Preparation For ML v31
• It is essential that data selection and preparation be carried out in a structured manner.
• For ML work, these are typical steps for data selection and preparation:
• Formulate and specify the data sourcing requirements.
• Identify suitable sources of data for the ML system
• Specify data collection processes.
• Determine how to format source data to make it consistent.
• Reduce the volume and dimensions of data, if possible; carry out attribute and record sampling to decide
which data attributes add value to the ML model
• Carry out data scaling if required, using decimal or min-max scaling methods
• Prime the data cleansing system
• Commence ML work by ingesting the data
Data Selection and Preparation for ML
• A taxonomy of data collection actions for working with new or existing ML data can be presented
as shown.
Data Selection and Preparation for ML
• For developing ML models, there are publicly available datasets from a number of organisations
such as:
• Kaggle Datasets
• Amazon Datasets (Registry of Open Data on AWS)
• UCI Machine Learning Repository
• Yahoo WebScope
• Datasets subreddit
• Google's Datasets Search Engine
• Facebook
• Twitter
• These can be used to train supervised and semi-supervised learning ML systems for a variety
contexts.
Sourcing online data
• Bias is systematic error in the data sample. Where manually creating representative data
samples for the ML system, consider using the following sampling techniques to reduce bias:
• Simple random sampling - a probabilistic approach where each dataset item from the population has an
equal probability of being selected
• Systematic sampling - sample selections are made using specified criteria that preserve
representativeness
• Stratified sampling - Each sample group preserves the compositional strata in the population
• Cluster sampling - creates groups of sampled items in a pre-specified manner
• Additionally, ensure that samples have sufficient size – typically sample size should be a
minimum of the square root of the population size to minimise sampling error or bias.
Exploratory Data
Analysis (EDA)
Data Science Process
Source: https://purnasaigudikandula.medium.com/exploratory-data-analysis-beginner-univariate-bivariate-and-
What is EDA?
• Exploratory data analysis (EDA) is used by data scientists to analyse and investigate data sets
and summarise their main characteristics, often employing data visualisation methods.
• It helps determine how best to manipulate data sources to get the answers you need, making it
easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check
assumptions.
• EDA is primarily used to see what data can reveal beyond the formal modelling or hypothesis
testing task and provides a provides a better understanding of data set variables and the
relationships between them.
• It can also help determine if the statistical techniques you are considering for data analysis are
appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA
techniques continue to be a widely used method in the data discovery process today.
Why EDA important in ML?
• The main purpose of EDA is to help look at data before making any assumptions. It can help
identify obvious errors, as well as better understand patterns within the data, detect outliers or
anomalous events, find interesting relations among the variables.
• Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals.
• EDA also helps stakeholders by confirming they are asking the right questions.
• EDA can help answer questions about standard deviations, categorical variables, and confidence
intervals.
• Once EDA is complete and insights are drawn, its features can then be used for more
sophisticated data analysis or modelling, including machine learning.
EDA tools
• Specific statistical functions and techniques you can perform with EDA tools include:
• Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional
data containing many variables.
• For example, K-means clustering method in unsupervised learning where data points are assigned into K groups,
i.e. the number of clusters, based on the distance from each group’s centroid. It is commonly used in market
segmentation, pattern recognition, and image compression.
• Univariate visualisation of each field in the raw dataset, with summary statistics.
• Bivariate visualisations and summary statistics that allow you to assess the relationship between each
variable in the dataset and the target variable you’re looking at.
• Multivariate visualisations, for mapping and understanding interactions between different fields in the
data.
• Predictive models, such as linear regression, use statistics and data to predict outcomes.
Types of EDA
• Types of EDA:
• Univariate non-graphical: This is simplest form of data analysis, where the data being analysed consists
of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main
purpose of univariate analysis is to describe the data and find patterns that exist within it.
• Univariate graphical: Non-graphical methods don’t provide a full picture of the data. Graphical methods
are therefore required. Common types of univariate graphics include:
• Multivariate nongraphical: Multivariate data arises from more than one variable. Multivariate non-
graphical EDA techniques generally show the relationship between two or more variables of the data
through cross-tabulation or statistics.
• Multivariate graphical: Multivariate data uses graphics to display relationships between two or more sets
of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level
of one of the variables and each bar within a group representing the levels of the other variable.
Types of EDA
• Set-Membership constraints
• Values of a column come from a set of discrete values, e.g.
enum values. For example, a person’s gender may be male
or female.
Data Quality: Validity, Accuracy, Consistency, Uniformity and
Completeness
Accuracy Consistency
• This is the degree to which the ML data held • This indicates the degree to which the ML
represents true or correct values. data is consistent, within the same data set
• Note that having valid data does not or across multiple data sets.
necessarily indicate that they are accurate. • Inconsistency occurs when two or more
• For example: values in the data set contradict each other.
• A street address may be valid but may in fact • For example:
not be the correct address for the candidate, • The individual’s age of say 8 years, would not
resulting in inaccuracy. be expected to correspond to a married
marital status
• A customer is recorded in two different tables
with two different addresses.
Data Quality: Validity, Accuracy, Consistency, Uniformity and
Completeness
Uniformity Completeness:
• This is the degree to which measured data is • This is the degree to which all of the required
specified using the same units of measure. data needed to build the ML model is known
• For example: to be available for sampling. Missing data
impacts the completeness of data.
• The weight may be recorded either in pounds
or kilos. • An obvious way to counteract this is to try to
• The date might follow the USA format or ensure that data collection methods have no
European format. gaps or missing entries when collecting the
• The currency is sometimes in USD and data.
sometimes in GBP. • However, this is not always possible, and for
• To ensure uniformity the data must be this reason, we will be discussing techniques
converted to single units of measure in every
for addressing and mitigating the effects of
case.
missing data on ML model building later on in
this lecture.
Data Cleaning
• The data cleaning workflow is a sequence of steps aiming at producing high-quality data and
taking into account all the criteria we have discussed so far.
• Inspection: Detect unexpected, incorrect, and inconsistent data
• Cleaning: Fix or remove the anomalies discovered
• Verifying: After cleaning, the results are inspected to verify correctness
• Reporting: A report about the changes made and the quality of the currently stored data is recorded
• It may seems a sequential process, in fact, data cleaning is an iterative, endless process. One
can go from verifying to inspection when new flaws are detected.
Inspecting the data is time-consuming and requires using many methods for exploring the underlying data for error
detection.
• A summary statistics about the data that is helpful to • By analysing and visualising the data using statistical
give a general idea about the quality of the data. methods such as mean, standard deviation, range, or
quantiles, one can find values that are unexpected and
• For examples:
thus erroneous.
• Check whether a particular column conforms to particular
standards or pattern.
• Is the data column recorded as a string or number?
• How many values are missing?
• How many unique values in a column, and their distribution?
• Is this data set is linked to or have a relationship with
another?
Data Cleaning: Cleaning
Data cleaning involve different techniques based on the problem and the data type. Different methods can be applied
with each has its own trade-offs. Overall, incorrect data is either removed, corrected, or imputed.
• Data that are not actually needed, and don’t fit under • Duplicates are data points that are repeated in your
the context of the problem we are trying to solve. dataset.
you may drop it. Otherwise, explore the correlation • A common symptom is when two users have the same
matrix between feature variables. identity number. Or, the same article was scrapped twice.
Data Cleaning: Cleaning
Missing values (MV) are which leaves us with the question of what to do. There are three, or perhaps more, ways to
deal with them:
• Drop • Impute
• If the MV in a column rarely happen and occur at random, we • It means to calculate the MV based on other observations
can simply drop observations (rows) that have MV
• Statistical values
• If most of the column’s values are missing, and occur at • Such as mean and median (for real or continuous values),
random, we normally the whole column and mode (for integer or categorical values).
• Dropping MV is particularly useful when doing statistical • Linear regression
analysis, since filling in the missing values may yield • one can calculate the best fit linear between two variables
unexpected or biased results based on the existing data.
• Hot-deck
• Copying values from other similar records. This is only
useful if you have enough available data. And it can be
applied to numerical and categorical data.
• Flag
• Filling in the MV leads to a loss information, no matter what
imputation method is used
• Therefore, flagging MV e.g. NaN allows to handle them
separately if required
Data Cleaning: Verifying and Reporting
Verifying Reporting
• When done, one should verify correctness • Reporting how healthy the data is, is equally
by re-inspecting the data and making sure it important to cleaning
rules and constraints do hold
• Use software/libraries to generate reports of
• For example, after filling out the missing the changes made, which rules were
data, they might violate any of the rules and violated, how many times, and logging the
constraints violations
• It might involve some manual correction if • Finally, no matter how robust and strong the
not possible otherwise validation and cleaning process is, one will
continue to suffer as new data come in…the
report process should help to understand
why it happens at the first place
Handling Missing Data
• Handling missing data values (data cleansing) is a key element of the data preparation phase of
the CRISP-DM methodology.
• Missing data can occur due to a variety of factors and is dependent on the source of the data.
• Examples include data entry errors, equipment failure, poor user interface design not enforcing required
fields or computer bugs to name a few.
• Missing data can impact the correctness and validity of any discovery obtained from it, a
consequence that can be succinctly described by the principle garbage in, garbage out.
• The quality of ML inference output is related directly to how much data is not missing.
• Generally, for large datasets, a missing rate of 5% can be considered inconsequential while anything
over 10% can lead to biases.
Handling Missing Data
• Before considering how missing data is best handled, it is useful to understand the mechanism of missingness in
the data is, so that the appropriate method can be chosen to deal with it. The following types of missing data may
exist:
• Missing completely at random (MCAR)
• Given a dataset, missing values do not exhibit any pattern and are not dependent on other observed variables in the dataset.
• For example: if the probability of the data missing does not depend on the observed data and also not on any attributes of the
missing data then it is “missing completely at random”. Essentially, nothing has caused the missing data to be missing.
• Deleting records with missing attributes is applied when ML work is carried out for data
classification, and the class labels are missing.
• The following approaches can be applied:
• Listwise deletion:
• This discards all cases with incomplete information
• ML developers using listwise deletion will remove a case entirely if it is missing a value for one of the variables
included in the analysis.
• Pairwise deletion:
• This method omits cases based on the variables included in the analysis. As a result, analyses may be completed
on subsets of the data depending on where values are missing.
• A case may contain three variables VAR1, VAR2, and VAR3. A case may have a missing value for VAR1, but this
does not prevent some statistical procedures from using the same case to analyse variables VAR2 and VAR3.
Handling Missing Data: Deletion
• Listwise and pairwise deletion approaches generally have the following problems:
• Deleting an entire row of data with only one missing value in listwise case removes legitimate and valid
data that could have been used
• Deleting an entire row makes an assumption that the other values do not depend on or are somehow
related to the missing value. Sometimes the absence of a value has meaning or there is a correlation
between a missing value and other values in the dataset. In certain cases other values can also be used
with high confidence to predict the missing value using various methods discussed in this week’s
discussions.
• In general, listwise and pairwise deletion is most suitable in MCAR cases. That is, when no
predictable pattern exists between the missing values and other values in the dataset. In all
other cases, imputation to replace missing values is the preferred approach.
Handling Missing Data: Imputation
• Model-based Imputation
• Is to find a predictive model for each target variable in the data set that contains missing values. The model is
fitted on the observed data and subsequently used to generate imputations for the missing values. Several
commonly-used imputation methods are special cases of model-based imputation; this includes mean imputation,
ratio imputation, and regression imputation.
• Example: linear regression, logistic regression, and random forests
Handling Missing Data: PMM
• Hot-deck imputation
• This is a method for handling missing data in
which each missing value is replaced with an
observed response from a similar unit.
• Hot-decks only impute values from the original
data, including distinctive features of the data that
would be "smoothed out" by parametric methods.
• Classic hot-deck methods cannot include
continuous variables in the conditioning set and
are limited in the number of categorical variables
because the "curse of dimensionality" quickly
makes the number of cells large, which results in
imputation from empty or thinly populated cells.
Thereby, they often fail to capture multivariate
relationships beyond basic ones such as univariate Note:
statistics within cells. Last Observation Carried Forward (LOCF)
Observation Carried Backward (NOCB)
Handling Missing Data: Hot-deck
• Depending on the type of analysis you’re doing, you need to accomplish six things in the
cleansing stage:
• Ditch all duplicate records that clog your server space and distort your analysis.
• Remove any rows or columns that aren’t relevant to the problem you’re trying to solve. Investigate and
possibly remove missing or incomplete info.
• Nip any unwanted outliers you discovered during data exploration.
• Fix structural errors: typography, capitalisation, abbreviation, formatting, extra characters.
• Validate that your work is accurate, consistent, uniform and complete, documenting all tools and
techniques you used.
Ethical Dimension: AI Fairness 360
• Bias and Discrimination: AI systems have raised concerns about bias and discrimination, where
algorithms can unfairly impact certain groups based on sensitive attributes.
• Fairness in ML: Fairness in machine learning is the goal of ensuring AI systems make decisions
without discrimination. Achieving this is challenging due to historical biases in data.
• Fairness Metrics: Various fairness metrics have been developed to quantify bias, helping identify
and measure bias in data and model predictions.
• Bias Mitigation: Researchers are working on techniques to reduce bias in data and model
predictions while maintaining performance.
• Legal and Ethical Implications: Discrimination in AI has legal and ethical consequences,
prompting organizations to prioritize fairness.
• Open-Source Tools: Open-source toolkits like AIF360 help data scientists and engineers assess,
measure, and mitigate bias in AI models.
• Industry Initiatives: Many industries adopt guidelines to ensure fairness and transparency in AI,
recognizing its importance.
Ethical Dimension: AI Fairness 360
• AI Fairness 360 (AIF360) is an open-source toolkit developed by IBM that is designed to help
developers and data scientists mitigate bias and promote fairness in machine learning models.
• The toolkit provides a comprehensive set of algorithms and metrics to help identify and address bias in
data and models, making it a valuable tool for ensuring that machine learning systems are fair and
equitable.
• Some of the key features and components of AIF360 include:
• Bias Metrics: AIF360 provides a variety of fairness metrics that can be used to measure different aspects of bias in
datasets and models, such as disparate impact, disparate mistreatment, and individual fairness.
• Bias Mitigation Algorithms: The toolkit includes a range of algorithms that can be used to reduce bias in datasets
and models. These algorithms aim to adjust the data or model in such a way that fairness is improved without
sacrificing too much predictive accuracy.
• Preprocessing and Postprocessing Tools: AIF360 offers preprocessing techniques to transform data before training
a model and postprocessing techniques to adjust model predictions to achieve fairness.
• Data Exploration and Visualisation: AIF360 provides tools for exploring and visualizing bias in data, which can be
helpful in understanding the sources of bias and making informed decisions about how to address it.
• Extensibility: AIF360 is designed to be extensible, allowing developers to incorporate their own fairness metrics,
bias mitigation algorithms, and custom data preprocessing techniques into the toolkit.
Ethical Dimension: AI Fairness 360
Source URL:
https://aif360.res.ibm.com
Ethical Dimension: AI Fairness 360
Source URL:
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8843908
Data Pre-
processing
Pipelines
CRISP-DM Methodology
Source:
https://thinkinsights.net/digital/crisp-dm/
CRISP-DM Methodology
CRISP-DM: Data Preparation
• CRISP-DM Stage 3 is often referred to as “data wrangling” or “data scrubbing”, helps generate the final data
set(s) for ML from the initial raw data. Note that data preparation stages are often performed iteratively.
• Drilling down further into the five Stage 3 sub-asks:
• Select data : Ascertain which data sets will be used for ML and set criteria for inclusion/exclusion.
• Clean data: At this stage you correct, impute, or remove erroneous or missing values. Common activities at this stage
include -
• Correcting invalid entries
• Removing or setting rules for ignoring data noise.
• Transforming column values where needed (For example one-hot encoding converts text values into predefined number
e.g. True/False into 1 or 0).
• Dealing with missing values.
• Dealing with Outliers.
• Engineer the data (Feature Selection): Extract the features (attributes) or derive new features that will be useful .
Suppress features that will not be needed. You can also compress data columns and rows by merging features.
• Integrate the data: Create new data sets by combining data from multiple sources.
• Re-Format the data: Re-format data as needed. For example, you might convert string values that store numbers to
numeric values so that you can perform mathematical operations.
CRISP-DM: Feature Engineering
• Feature Engineering (FE) is a process where domain knowledge of the data is used to create
additional relevant features that increase the predictive power of the ML algorithms and models.
• The pre-processing pipeline may additionally incorporate data transformational stages that
feature:
• Binning
• Converts continuous numeric values into categories or groups of values termed bins aka. dummy variables
• Normalisation
• Re-scales the range of values for a given feature into a set range with a specified minimum and maximum values,
such as [0, 1] or [−1, 1].
• Standardisation
• Converts the data distribution to a standard normal distribution with a mean of zero and a standard deviation (σ)
of one.
• Standardization usually recommended when preparing data for Support Vector Machines (SVM), Principal
Component Analysis (PCA), and k-Nearest Neighbours (k-NN) ML model building.
Alternative Approaches to CRISP-DM