0% found this document useful (0 votes)
15 views

Fundamentals of Data Source and Preparation For ML v31

Fundamentals of Data Source and Preparation for ML v31 (1)

Uploaded by

076bch026.priya
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Fundamentals of Data Source and Preparation For ML v31

Fundamentals of Data Source and Preparation for ML v31 (1)

Uploaded by

076bch026.priya
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

CP70066E Machine Learning

Fundamentals of Data Source and


Preparation for ML

Professor Jonathan Loo


Chair in Computing and Engineering
School of Engineering and Computing
University of West London
Lesson Outline

• Data Sourcing and Representativeness • Pre-processing pipelines for Data Cleansing


• Selecting Data Sources for ML • CRISP-DM Formalism
• Sampling and Data Representativeness • Semma and KDD Approaches
• Addressing Sampling Bias Issues
• Python Programming Toolsets for Coding Pre-
Processing Stages
• Assuring Data Quality
• Accuracy, Consistency and Uniformity
Considerations
• Completeness and Validity Considerations
• Dealing with Noisy and Imperfect (Dirty) Data
• Handling Missing Data – Basic Data
Imputation/Replacement Methods
• Handling Missing Data – Other Techniques: Hot-
Deck Imputation and Listwise/Pairwise-deletion
approaches
• Ethical Dimension: AI Fairness 360
Data Sourcing and
Representation
Data Selection and Preparation for ML

• It is essential that data selection and preparation be carried out in a structured manner.
• For ML work, these are typical steps for data selection and preparation:
• Formulate and specify the data sourcing requirements.
• Identify suitable sources of data for the ML system
• Specify data collection processes.
• Determine how to format source data to make it consistent.
• Reduce the volume and dimensions of data, if possible; carry out attribute and record sampling to decide
which data attributes add value to the ML model
• Carry out data scaling if required, using decimal or min-max scaling methods
• Prime the data cleansing system
• Commence ML work by ingesting the data
Data Selection and Preparation for ML

• A taxonomy of data collection actions for working with new or existing ML data can be presented
as shown.
Data Selection and Preparation for ML

• For developing ML models, there are publicly available datasets from a number of organisations
such as:
• Kaggle Datasets
• Amazon Datasets (Registry of Open Data on AWS)
• UCI Machine Learning Repository
• Yahoo WebScope
• Datasets subreddit
• Google's Datasets Search Engine
• Facebook
• Twitter

• These can be used to train supervised and semi-supervised learning ML systems for a variety
contexts.
Sourcing online data

• API (Application Programming Interface): using a prebuilt set of functions developed by a


company to access their services. Often pay to use. For example: Google Map API, Facebook API,
Twitter API
• RSS (Rich Site Summary): summarizes frequently updated online content in standard format.
Free to read if the site has one. For example: news-related sites, blogs
• Web scraping: using software, scripts or by-hand extracting data from what is displayed on a
page or what is contained in the HTML file (often in tables).
• This method is used when URL sites don’t offer APIs for accessing data, publish RSS feeds or have databases.

Twitter Intelligence Tool


https://github.com/twintproject/
twint

Open source Intelligence Tool


https://github.com/akat12/Open-source-Intellige
nce
Data Sampling and Representativeness

• Bias is systematic error in the data sample. Where manually creating representative data
samples for the ML system, consider using the following sampling techniques to reduce bias:
• Simple random sampling - a probabilistic approach where each dataset item from the population has an
equal probability of being selected
• Systematic sampling - sample selections are made using specified criteria that preserve
representativeness
• Stratified sampling - Each sample group preserves the compositional strata in the population
• Cluster sampling - creates groups of sampled items in a pre-specified manner

• Additionally, ensure that samples have sufficient size – typically sample size should be a
minimum of the square root of the population size to minimise sampling error or bias.
Exploratory Data
Analysis (EDA)
Data Science Process

Source: https://purnasaigudikandula.medium.com/exploratory-data-analysis-beginner-univariate-bivariate-and-
What is EDA?

• Exploratory data analysis (EDA) is used by data scientists to analyse and investigate data sets
and summarise their main characteristics, often employing data visualisation methods.
• It helps determine how best to manipulate data sources to get the answers you need, making it
easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check
assumptions.
• EDA is primarily used to see what data can reveal beyond the formal modelling or hypothesis
testing task and provides a provides a better understanding of data set variables and the
relationships between them.
• It can also help determine if the statistical techniques you are considering for data analysis are
appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA
techniques continue to be a widely used method in the data discovery process today.
Why EDA important in ML?

• The main purpose of EDA is to help look at data before making any assumptions. It can help
identify obvious errors, as well as better understand patterns within the data, detect outliers or
anomalous events, find interesting relations among the variables.
• Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals.
• EDA also helps stakeholders by confirming they are asking the right questions.
• EDA can help answer questions about standard deviations, categorical variables, and confidence
intervals.
• Once EDA is complete and insights are drawn, its features can then be used for more
sophisticated data analysis or modelling, including machine learning.
EDA tools

• Specific statistical functions and techniques you can perform with EDA tools include:
• Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional
data containing many variables.
• For example, K-means clustering method in unsupervised learning where data points are assigned into K groups,
i.e. the number of clusters, based on the distance from each group’s centroid. It is commonly used in market
segmentation, pattern recognition, and image compression.
• Univariate visualisation of each field in the raw dataset, with summary statistics.
• Bivariate visualisations and summary statistics that allow you to assess the relationship between each
variable in the dataset and the target variable you’re looking at.
• Multivariate visualisations, for mapping and understanding interactions between different fields in the
data.
• Predictive models, such as linear regression, use statistics and data to predict outcomes.
Types of EDA

• Types of EDA:
• Univariate non-graphical: This is simplest form of data analysis, where the data being analysed consists
of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main
purpose of univariate analysis is to describe the data and find patterns that exist within it.
• Univariate graphical: Non-graphical methods don’t provide a full picture of the data. Graphical methods
are therefore required. Common types of univariate graphics include:
• Multivariate nongraphical: Multivariate data arises from more than one variable. Multivariate non-
graphical EDA techniques generally show the relationship between two or more variables of the data
through cross-tabulation or statistics.
• Multivariate graphical: Multivariate data uses graphics to display relationships between two or more sets
of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level
of one of the variables and each bar within a group representing the levels of the other variable.
Types of EDA

• Examples of multivariate graphics include:


• Scatter plot: which is used to plot data points on a horizontal and a vertical axis to show how much one
variable is affected by another.
• Multivariate chart: which is a graphical representation of the relationships between factors and a
response.
• Histograms: a bar plot in which each bar represents the frequency (count) or proportion (count/total
count) of cases for a range of values.
• Box plots: which graphically depict the five-number summary of minimum, first quartile, median, third
quartile, and maximum.
• Heat map: which is a graphical representation of data where values are depicted by colour.
• Run chart: which is a line graph of data plotted over time.
• Bubble chart: which is a data visualization that displays multiple circles (bubbles) in a two-dimensional
plot.
• Stem-and-leaf plots: show all data values and the shape of the distribution.
Data Quality Issues
Data Quality: Validity, Accuracy, Consistency, Uniformity and
Completeness
Validity – is the degree to which the data conform to defined business rules or constraints

• Data-Type Constraints • Foreign-key constraints


• Values in a particular column must be of a particular • As in relational databases, a foreign key column can’t have a
datatype, e.g., boolean, numeric, date, etc. value that does not exist in the referenced primary key.

• Range Constraints • Regular expression patterns


• Typically, numbers or dates should fall within a certain • Text fields that have to be in a certain pattern. For example,
range. phone numbers may be required to have the pattern (999)
999–9999.
• Mandatory Constraints
• Cross-field validation
• Certain columns cannot be empty.
• Certain conditions that span across multiple fields must hold.
• Unique Constraints
• For example, a customer’s insurance validity cannot be
• A field, or a combination of fields, must be unique across a earlier than the date of purchase of the insurance
dataset.

• Set-Membership constraints
• Values of a column come from a set of discrete values, e.g.
enum values. For example, a person’s gender may be male
or female.
Data Quality: Validity, Accuracy, Consistency, Uniformity and
Completeness

Accuracy Consistency
• This is the degree to which the ML data held • This indicates the degree to which the ML
represents true or correct values. data is consistent, within the same data set
• Note that having valid data does not or across multiple data sets.
necessarily indicate that they are accurate. • Inconsistency occurs when two or more
• For example: values in the data set contradict each other.
• A street address may be valid but may in fact • For example:
not be the correct address for the candidate, • The individual’s age of say 8 years, would not
resulting in inaccuracy. be expected to correspond to a married
marital status
• A customer is recorded in two different tables
with two different addresses.
Data Quality: Validity, Accuracy, Consistency, Uniformity and
Completeness

Uniformity Completeness:
• This is the degree to which measured data is • This is the degree to which all of the required
specified using the same units of measure. data needed to build the ML model is known
• For example: to be available for sampling. Missing data
impacts the completeness of data.
• The weight may be recorded either in pounds
or kilos. • An obvious way to counteract this is to try to
• The date might follow the USA format or ensure that data collection methods have no
European format. gaps or missing entries when collecting the
• The currency is sometimes in USD and data.
sometimes in GBP. • However, this is not always possible, and for
• To ensure uniformity the data must be this reason, we will be discussing techniques
converted to single units of measure in every
for addressing and mitigating the effects of
case.
missing data on ML model building later on in
this lecture.
Data Cleaning

• The data cleaning workflow is a sequence of steps aiming at producing high-quality data and
taking into account all the criteria we have discussed so far.
• Inspection: Detect unexpected, incorrect, and inconsistent data
• Cleaning: Fix or remove the anomalies discovered
• Verifying: After cleaning, the results are inspected to verify correctness
• Reporting: A report about the changes made and the quality of the currently stored data is recorded

• It may seems a sequential process, in fact, data cleaning is an iterative, endless process. One
can go from verifying to inspection when new flaws are detected.

Source: The Ultimate Guide to Data Cleaning


https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4
Data Cleaning: Inspection

Inspecting the data is time-consuming and requires using many methods for exploring the underlying data for error
detection.

Data profiling Visualisations

• A summary statistics about the data that is helpful to • By analysing and visualising the data using statistical
give a general idea about the quality of the data. methods such as mean, standard deviation, range, or
quantiles, one can find values that are unexpected and
• For examples:
thus erroneous.
• Check whether a particular column conforms to particular
standards or pattern.
• Is the data column recorded as a string or number?
• How many values are missing?
• How many unique values in a column, and their distribution?
• Is this data set is linked to or have a relationship with
another?
Data Cleaning: Cleaning

Data cleaning involve different techniques based on the problem and the data type. Different methods can be applied
with each has its own trade-offs. Overall, incorrect data is either removed, corrected, or imputed.

Irrelevant data Duplications

• Data that are not actually needed, and don’t fit under • Duplicates are data points that are repeated in your
the context of the problem we are trying to solve. dataset.

• For examples: • For examples,


• If we were analysing data about the general health of the • It often happens when data are combined from different
population, unique data such as IDs, address, phone number sources
wouldn’t be necessary — column-wise.
• The user may hit submit button twice thinking the form
• Similarly, if you were interested in only one particular area wasn’t actually submitted.
(or year), you wouldn’t want to include all other areas
• A request to online booking was submitted twice correcting
(years) — row-wise.
wrong information that was entered accidentally in the first
• Only if you are sure that a piece of data is unimportant, time.

you may drop it. Otherwise, explore the correlation • A common symptom is when two users have the same
matrix between feature variables. identity number. Or, the same article was scrapped twice.
Data Cleaning: Cleaning

Missing values (MV) are which leaves us with the question of what to do. There are three, or perhaps more, ways to
deal with them:

• Drop • Impute
• If the MV in a column rarely happen and occur at random, we • It means to calculate the MV based on other observations
can simply drop observations (rows) that have MV
• Statistical values
• If most of the column’s values are missing, and occur at • Such as mean and median (for real or continuous values),
random, we normally the whole column and mode (for integer or categorical values).
• Dropping MV is particularly useful when doing statistical • Linear regression
analysis, since filling in the missing values may yield • one can calculate the best fit linear between two variables
unexpected or biased results based on the existing data.

• Hot-deck
• Copying values from other similar records. This is only
useful if you have enough available data. And it can be
applied to numerical and categorical data.

• Flag
• Filling in the MV leads to a loss information, no matter what
imputation method is used
• Therefore, flagging MV e.g. NaN allows to handle them
separately if required
Data Cleaning: Verifying and Reporting

Verifying Reporting
• When done, one should verify correctness • Reporting how healthy the data is, is equally
by re-inspecting the data and making sure it important to cleaning
rules and constraints do hold
• Use software/libraries to generate reports of
• For example, after filling out the missing the changes made, which rules were
data, they might violate any of the rules and violated, how many times, and logging the
constraints violations
• It might involve some manual correction if • Finally, no matter how robust and strong the
not possible otherwise validation and cleaning process is, one will
continue to suffer as new data come in…the
report process should help to understand
why it happens at the first place
Handling Missing Data

• Handling missing data values (data cleansing) is a key element of the data preparation phase of
the CRISP-DM methodology.
• Missing data can occur due to a variety of factors and is dependent on the source of the data.
• Examples include data entry errors, equipment failure, poor user interface design not enforcing required
fields or computer bugs to name a few.

• Missing data can impact the correctness and validity of any discovery obtained from it, a
consequence that can be succinctly described by the principle garbage in, garbage out.
• The quality of ML inference output is related directly to how much data is not missing.
• Generally, for large datasets, a missing rate of 5% can be considered inconsequential while anything
over 10% can lead to biases.
Handling Missing Data

• Before considering how missing data is best handled, it is useful to understand the mechanism of missingness in
the data is, so that the appropriate method can be chosen to deal with it. The following types of missing data may
exist:
• Missing completely at random (MCAR)
• Given a dataset, missing values do not exhibit any pattern and are not dependent on other observed variables in the dataset.
• For example: if the probability of the data missing does not depend on the observed data and also not on any attributes of the
missing data then it is “missing completely at random”. Essentially, nothing has caused the missing data to be missing.

• Missing at random (MAR)


• Given a dataset, missing data is dependent on a pattern of observed variables within the dataset and not the missing data
itself, which are missing at random.
• For example: males are less likely to admit to having alcohol problems, but this has no correlation to alcoholic levels in males.

• Missing not at random (MNAR)


• Given a dataset, the probability of data missing is dependent on observed data and the missing data.
• For example:
• High-income earners are less likely to report their annual earnings. Essentially, if it is missing based purely on the value of the attribute
then it is “not missing at random”.
• MNAR may also occur when data collection records only entries within a certain range, say between 1 and 10. Values outside of the
specified range 1 - 10 will be missing, but these values are not missing due to randomness
Handling Missing Data

• The process of filling in missing values is known as


imputation. Knowing how to correctly fill in missing
data is an essential to produce accurate ML
predictions.
• Filling missing values with predefined constants
• Such as “NA” or “UNKNOWN” is straightforward.
However, ML algorithm may try to group records
with the constant value thinking of it as a pattern
having correlation.

• Filling missing values with measures of central


tendency (mean, mode or median):
• This is a statistical approach to replace missing
values using means if the data distribution is
symmetric, medians if having a moderately skewed
dataset or modes (most frequently occurring
values) if the dataset is heavily skewed or pattern-
less.

• Filling missing values using the highest


probabilistic values:
• The highest probabilistic value may be determined
using inferential statistical methods such as
regression, which can infer or predict values using
probabilities.
Handling Missing Data: Deletion

• Deleting records with missing attributes is applied when ML work is carried out for data
classification, and the class labels are missing.
• The following approaches can be applied:
• Listwise deletion:
• This discards all cases with incomplete information
• ML developers using listwise deletion will remove a case entirely if it is missing a value for one of the variables
included in the analysis.
• Pairwise deletion:
• This method omits cases based on the variables included in the analysis. As a result, analyses may be completed
on subsets of the data depending on where values are missing.
• A case may contain three variables VAR1, VAR2, and VAR3. A case may have a missing value for VAR1, but this
does not prevent some statistical procedures from using the same case to analyse variables VAR2 and VAR3.
Handling Missing Data: Deletion

• Listwise and pairwise deletion approaches generally have the following problems:
• Deleting an entire row of data with only one missing value in listwise case removes legitimate and valid
data that could have been used
• Deleting an entire row makes an assumption that the other values do not depend on or are somehow
related to the missing value. Sometimes the absence of a value has meaning or there is a correlation
between a missing value and other values in the dataset. In certain cases other values can also be used
with high confidence to predict the missing value using various methods discussed in this week’s
discussions.

• In general, listwise and pairwise deletion is most suitable in MCAR cases. That is, when no
predictable pattern exists between the missing values and other values in the dataset. In all
other cases, imputation to replace missing values is the preferred approach.
Handling Missing Data: Imputation

• Imputation methods can generally be categorised as:


• Donor-based Imputation:
• Is to fill in the missing values for a given unit by copying observed values of another unit, the donor. Typically, the
donor is chosen in such a way that it resembles the imputed unit as much as possible on one or more background
characteristics. The rationale behind this is that if the two units match (exactly or approximately) on a number of
relevant auxiliary variables, it is likely that their scores on the target variable will also be similar.
• Example: predictive mean matching (PMM), hot-deck and nearest-neighbour imputation

• Model-based Imputation
• Is to find a predictive model for each target variable in the data set that contains missing values. The model is
fitted on the observed data and subsequently used to generate imputations for the missing values. Several
commonly-used imputation methods are special cases of model-based imputation; this includes mean imputation,
ratio imputation, and regression imputation.
• Example: linear regression, logistic regression, and random forests
Handling Missing Data: PMM

• Predictive Means Matching (PMM)


• This is a variant of the standard hot deck method that has been found to perform well in practice. It
forms donor pools based on the distance between the predicted means from a regression of the missing
variable on some covariates.
• PMM is a form of matching in which the missing values are replaced by a draw from observations with
similar predicted values of the missing variable.
• The main advantage of methods like PMM is that they allow more flexible conditioning sets and thereby
capture the relations between variables better, but they make more parametric assumptions which are
often chosen arbitrarily because specification tests are not available.
Handling Missing Data: Hot-deck

• Hot-deck imputation
• This is a method for handling missing data in
which each missing value is replaced with an
observed response from a similar unit.
• Hot-decks only impute values from the original
data, including distinctive features of the data that
would be "smoothed out" by parametric methods.
• Classic hot-deck methods cannot include
continuous variables in the conditioning set and
are limited in the number of categorical variables
because the "curse of dimensionality" quickly
makes the number of cells large, which results in
imputation from empty or thinly populated cells.
Thereby, they often fail to capture multivariate
relationships beyond basic ones such as univariate Note:
statistics within cells. Last Observation Carried Forward (LOCF)
Observation Carried Backward (NOCB)
Handling Missing Data: Hot-deck

• Last observation carried forward (LOCF)


• This method is a widely used implementation of the hot-deck method where the missing attribute value
is filled in with the last known attribute. This approach strongly assumes that the value is unchanged
from the last value which may be a pitfall in certain cases. The advantage of this type is that it's easy to
understand and communicate.

• Next observation carried backward (NOCB)


• This method fills in the missing value with the next known value. The approach strongly assumes that
the missing value is the same as the following value which can be a demerit of the technique in some
cases. As with LOCF, NOCB is fairly straightforward to implement.
Data Quality in Layman’s Terms

• Depending on the type of analysis you’re doing, you need to accomplish six things in the
cleansing stage:
• Ditch all duplicate records that clog your server space and distort your analysis.
• Remove any rows or columns that aren’t relevant to the problem you’re trying to solve. Investigate and
possibly remove missing or incomplete info.
• Nip any unwanted outliers you discovered during data exploration.
• Fix structural errors: typography, capitalisation, abbreviation, formatting, extra characters.
• Validate that your work is accurate, consistent, uniform and complete, documenting all tools and
techniques you used.
Ethical Dimension: AI Fairness 360

• Bias and Discrimination: AI systems have raised concerns about bias and discrimination, where
algorithms can unfairly impact certain groups based on sensitive attributes.
• Fairness in ML: Fairness in machine learning is the goal of ensuring AI systems make decisions
without discrimination. Achieving this is challenging due to historical biases in data.
• Fairness Metrics: Various fairness metrics have been developed to quantify bias, helping identify
and measure bias in data and model predictions.
• Bias Mitigation: Researchers are working on techniques to reduce bias in data and model
predictions while maintaining performance.
• Legal and Ethical Implications: Discrimination in AI has legal and ethical consequences,
prompting organizations to prioritize fairness.
• Open-Source Tools: Open-source toolkits like AIF360 help data scientists and engineers assess,
measure, and mitigate bias in AI models.
• Industry Initiatives: Many industries adopt guidelines to ensure fairness and transparency in AI,
recognizing its importance.
Ethical Dimension: AI Fairness 360

• AI Fairness 360 (AIF360) is an open-source toolkit developed by IBM that is designed to help
developers and data scientists mitigate bias and promote fairness in machine learning models.
• The toolkit provides a comprehensive set of algorithms and metrics to help identify and address bias in
data and models, making it a valuable tool for ensuring that machine learning systems are fair and
equitable.
• Some of the key features and components of AIF360 include:
• Bias Metrics: AIF360 provides a variety of fairness metrics that can be used to measure different aspects of bias in
datasets and models, such as disparate impact, disparate mistreatment, and individual fairness.
• Bias Mitigation Algorithms: The toolkit includes a range of algorithms that can be used to reduce bias in datasets
and models. These algorithms aim to adjust the data or model in such a way that fairness is improved without
sacrificing too much predictive accuracy.
• Preprocessing and Postprocessing Tools: AIF360 offers preprocessing techniques to transform data before training
a model and postprocessing techniques to adjust model predictions to achieve fairness.
• Data Exploration and Visualisation: AIF360 provides tools for exploring and visualizing bias in data, which can be
helpful in understanding the sources of bias and making informed decisions about how to address it.
• Extensibility: AIF360 is designed to be extensible, allowing developers to incorporate their own fairness metrics,
bias mitigation algorithms, and custom data preprocessing techniques into the toolkit.
Ethical Dimension: AI Fairness 360

• AIF360 has been widely used in the machine


learning community and is considered an important
resource for researchers and practitioners working
on fairness and ethics in AI.
• It helps address the challenges of bias and
discrimination that can arise when deploying
machine learning models in real-world applications,
such as lending, hiring, and criminal justice.
• By using AIF360, developers can work toward
creating more fair and equitable AI systems.

Source URL:
https://aif360.res.ibm.com
Ethical Dimension: AI Fairness 360

• AIF360 has been widely used in the machine


learning community and is considered an important
resource for researchers and practitioners working
on fairness and ethics in AI.
• It helps address the challenges of bias and
discrimination that can arise when deploying
machine learning models in real-world applications,
such as lending, hiring, and criminal justice.
• By using AIF360, developers can work toward
creating more fair and equitable AI systems.

Source URL:
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8843908
Data Pre-
processing
Pipelines
CRISP-DM Methodology

• Recalling the Cross-Industry Standard


Process for Data Mining (CRISP-DM)
methodology, introduced in our last lesson
when discussing ML deployment
approaches, offers a structured approach
to deploying ML models and projects.
• CRISP-DM was conceived in 1996 and
became a European Union project under
the ESPRIT funding initiative in 1997. The
project was led by five companies:
Integral Solutions Ltd (ISL), Teradata,
Daimler AG, NCR Corporation and OHRA,
an insurance company.

Source:
https://thinkinsights.net/digital/crisp-dm/
CRISP-DM Methodology
CRISP-DM: Data Preparation

• CRISP-DM Stage 3 is often referred to as “data wrangling” or “data scrubbing”, helps generate the final data
set(s) for ML from the initial raw data. Note that data preparation stages are often performed iteratively.
• Drilling down further into the five Stage 3 sub-asks:
• Select data : Ascertain which data sets will be used for ML and set criteria for inclusion/exclusion.
• Clean data: At this stage you correct, impute, or remove erroneous or missing values. Common activities at this stage
include -
• Correcting invalid entries
• Removing or setting rules for ignoring data noise.
• Transforming column values where needed (For example one-hot encoding converts text values into predefined number
e.g. True/False into 1 or 0).
• Dealing with missing values.
• Dealing with Outliers.
• Engineer the data (Feature Selection): Extract the features (attributes) or derive new features that will be useful .
Suppress features that will not be needed. You can also compress data columns and rows by merging features.
• Integrate the data: Create new data sets by combining data from multiple sources.
• Re-Format the data: Re-format data as needed. For example, you might convert string values that store numbers to
numeric values so that you can perform mathematical operations.
CRISP-DM: Feature Engineering

• Feature Engineering (FE) is a process where domain knowledge of the data is used to create
additional relevant features that increase the predictive power of the ML algorithms and models.

• The pre-processing pipeline may additionally incorporate data transformational stages that
feature:
• Binning
• Converts continuous numeric values into categories or groups of values termed bins aka. dummy variables

• Normalisation
• Re-scales the range of values for a given feature into a set range with a specified minimum and maximum values,
such as [0, 1] or [−1, 1].

• Standardisation
• Converts the data distribution to a standard normal distribution with a mean of zero and a standard deviation (σ)
of one.
• Standardization usually recommended when preparing data for Support Vector Machines (SVM), Principal
Component Analysis (PCA), and k-Nearest Neighbours (k-NN) ML model building.
Alternative Approaches to CRISP-DM

• SEMMA (Sample, Explore, Modify,


Model, Assess)
• Refers to the process of conducting
an ML project starting with the
sampling phase and ending with the
model evaluation phase.

• KDD (Knowledge Discovery in


Databases)
• It has 5 phases for selection, pre-
processing, transformation, data
mining and
interpretation/evaluation, as shown
in the diagram. KDD processes may Source:
be run iteratively, depending on the https://www.semanticscholar.org/paper/KDD%2C-SEMMA-and-CRISP-DM%3A-a-parallel-
overview-Azevedo-Santos/6bc30ac3f23d43ffc2254b0be24ec4217cf8c845
ML project.

• Both KDD and SEMMA focus on the


more technical stages of data
preparation and are thus generally
easier to use.
End of Lesson

You might also like