project report final
project report final
INTRODUCTION
In the recent years, data acquisition systems have shown great potential to solve tasks
efficiently in different areas such as manufacture industries, video processing, agriculture,
imaging, meteorological systems and remote laboratory, and. In research, it is necessary to
acquire a considerable number of samples in manual mode using electronic devices. During
these processes, the measurements can be affected by errors and information loss for the
analysis. But with the laboratory devices with improved software application in the
monitoring and controlling can help in time reduction and obtaining samples with good
accuracy.
Data acquisition means to gather or acquire the data from different sources which is
classified into primary data and secondary data. In Primary data, data is collected by the
individual or organization who was doing that analysis. The primary data is acquired by
different ways like experiments (e.g., wet lab experiments like gene sequencing), observation
(e.g., surveys, sensors, in situ collection), simulations (e.g., theoretical models like climate
models), webscraping( by extracting or copying data directly from a website). In Secondary
data, data is collected by someone else or by published data which is available. Most popular
repositories includes: GitHub, Kaggle, KDnuggets, UCI Machine Learning Repository, US
Government‟s Open Data, Five Thirty Eight, Amazon Web Services, BuzzFeed, etc.
Acquisition of data can be quickly processed using by traditional programming. On
the other hand, some advanced algorithms offer the possibility to interpret, self-assess, and
learn from the data, such as numerical variables, sound, or images which is called machine
learning (ML). ML includes supervised learning, unsupervised learning, and semi-supervised
learning. These algorithms have been widely used for signal–processing in applications such
as pattern recognition, analysis, and computer vision [1]
1|Page
an optimally programmed model and a successful AI application that operates as it was
intended once deployed as in:
I. Public Databases
The databases which are made accessible by the businesses so that the given data can be
used for machine learning, computer vision, natural language processing, and various other
AI application training purposes. The downside to tapping into public databases is that it‟s
not easily customized or specifically suitable for a project niche or focus. On the other hand,
open-source data can cut down on time and team resources obtaining basic, subject-matter
data quickly and without expense.
ML engineer uses In-house data sourcing as the training data for their algorithm and the
model development is acquired from within their organization. Engineers are creating
databases themselves. Other specialists are brought in to help generate the necessary data
required for a project.
It is provided internally and doesn‟t require excessive expense. The internal data
sources are usually highly relevant to a team‟s specific needs. The data tends to be more
reliable, since ML developers are assured of how and when it was generated, and if any needs
come up to further personalize the sets, they can simply do so without relying on an external
source or provider.
III. Crowdsourcing
The crowdsourcing method is the standard and go-to approaches to collecting training data.
The basic procedure for an ML developer or organization is to recruit outside assistance for
data acquisition efforts. Since, it is outsourced work, it might not be possible to pass feedback
to the team or individuals assigned the work, making it unlikely that they will be able to
improve individually or as a team if the sourced data does not meet expectations from a
quality standpoint.
Manually collecting data involves acquiring the data from real-world settings. An
organization can collect or build the data by using tools or devices to monitor and derive real-
2|Page
world data that can then be processed and used to train models.
These devices can range from online apps that collect data through user interaction
and surveys, to social media, sensors, and drones.
V. Data collection services
Data collection services are used by ML teams for their data collection needs The quality and
result of the data acquired through a collection service or company which satisfies an ML
team's needs on a case-by-case basis [2].
As the existing data generally in very limited quantity and ML models rely on large
amount of diverse data to develop accurate predictions in various contexts. For ML
applications especially in the deep learning continues to diverse and increase rapidly. Data
augmentation techniques can be a good tool against challenges that the artificial intelligence
world faces.
DATA AUGEMENTATION
Data augmentation is useful to improve the performance and outcomes of machine learning
models by forming new and different examples to train datasets. If the dataset in a machine
learning model is rich and sufficient, the model performs better and more accurately. One
of the steps in a data model is cleaning data which is necessary for high-accuracy models.
However, if cleaning reduces the represent ability of data, then the model cannot provide
good predictions for real-world inputs. Data augmentation techniques can enable machine
learning models to be more robust by creating variations that the model may see in the real
world. Using some techniques for data augmentation based on Artificial Intelligence about
computer vision, audio, text, image, and advanced data augmentation are explained below.
i) COMPUTER VISION
Computer vision systems use artificial intelligence (AI) technology to mimic the
capabilities of the human brain that are responsible for object recognition and object
classification. Engineers train computers to recognize visual data by inputting vast amounts
of information. ML algorithms identify common patterns in these images or videos and
apply that knowledge to identify unknown images accurately.
3|Page
new images.
Figure 1 Position Augmentation (rotation) (a) original (b) simple rotation 90⁰ left (c) simple rotation 90⁰
right (d) simple rotation 180⁰ (e) advanced rotation 45⁰ clockwise (f) advanced rotation 45⁰ counter clockwise
(g) advanced rotation 270⁰ clockwise (h) advanced rotation 270⁰ counter clockwise (i) random rotation [3] .
Also, the augmentation in computer vision is in colour augmentation. This strategy adjusts
the elementary factors of a training image, such as its brightness, contrast degree, or
saturation. These common image transformations change the hue, dark and light balance,
and separation between an image's darkest and lightest areas to create augmented images.
Figure 2 shows different colour augmentation techniques
Figure 2 Colour Augmentation (blur) (a) original (b) blur (c) box blur (d) motion blur (e) radial blur (f) heavy
radial blur (g) Gaussian blur (h) stack blur (i) soften (j) noise injection [4].
4|Page
neurons that work together inside the computer. They use mathematical calculations to
automatically process different aspects of image data and gradually develop a combined
understanding of the image. CNNs utilize a labelling system to categorize visual data and
comprehend the whole image. They analyse images as pixels and give each pixel a label
value. The value is inputted to perform a mathematical operation called convolution and
make predictions about the picture. Like a human attempting to recognize an object at a
distance, a CNN first identifies outlines and simple shapes before filling in additional
details like colour, internal forms, and texture. Finally, it repeats the prediction process over
several iterations to improve accuracy. RNNs are similar to CNNs, but can process a series
of images to find links between them. While CNNs are used for single image analysis,
[5]
RNNs can analyse videos and understand the relationships between images . The next
one in list of data augmentation type is audio data augmentation
Audio data augmentation include injecting random or Gaussian noise into some audio, fast-
forwarding parts, changing the speed of parts by a fixed rate or altering the pitch. Different
audio data augmentation methods are discussed below.
Time Stretching: It can increase or decrease the speed of audio playback without changing
its pitch or duration. This allows simulating variations in the rhythm or speed of speech.
Pitch Shifting: It is used to alter pitch or frequency while maintaining audio duration. This
technique can simulate variations of the voice or musical notes.
Noise Addition: This method adds random noise to the audio signal. It makes the model
more robust against environmental noise or variations in recording conditions.
Amplitude Scaling: It adjusts the volume or intensity of the audio signal. It can simulate
different sound levels or distances between the audio source and the microphone.
Reverberation: These effects are applied to the audio clip to simulate different acoustic
environments. This can help the model to better generalize across different recording
settings.
Time and Frequency Masking: A random section of the audio is temporarily masked in
the time or frequency domain. This technique can simulate missing or damaged audio
segments.
Figure 3 given below shows different changes in an audio clip after using audio
augmentation
5|Page
Figure 3 Audio augmentations, (a) Original audio, (b) Audio after adding noise, (c) Audio after Stretching,
(d) Audio after shifting [6].
Geometric transformations: randomly flip, crop, rotate, stretch, and zoom images. You
6|Page
need to be careful about applying multiple transformations on the same images, as this can
reduce model performance.
Colour space transformations: randomly change RGB colour channels, contrast, and
brightness.
Kernel filters: randomly change the sharpness or blurring of the image.
Random erasing: delete some part of the initial image.
Mixing images: blending and mixing multiple images.
Application of all these image augmentation techniques is shown in Figure 5
Figure 5 Image data augmentation technique for increasing the size of an image to train a deep neural network
algorithm. The drawing has been modified for the purpose of this study. The variables X and X FL are
mathematical representations used in this study to mimic the image data augmentation strategy using fuzzy
mathematics to design the proposed system [8].
GANs consist of two neural networks, a generator and a discriminator, which are trained
simultaneously. The generator creates synthetic data points, while the discriminator evaluates
the quality of the generated data by comparing it to the real data. The generator and
discriminator compete against each other, with the generator trying to create realistic data
points that can fool the discriminator, and the discriminator trying to accurately distinguish
between real and generated data. As the training progresses, the generator becomes better at
producing high-quality synthetic data.
7|Page
ii) Variational Autoencoders (VAEs):
VAEs are a type of generative model that uses encoder-decoder architecture. The encoder
learns a lower-dimensional representation i.e. latent space of the input data, while the decoder
reconstructs the input data from the latent space. VAEs impose a probabilistic structure on
the latent space, which allows them to generate new data points by sampling from the learned
distribution. These are particularly useful for data augmentation tasks where the input data
has a complex structure, such as images or text.
Adversarial attacks are imperceptible changes to images (pixel-level changes) that can
completely change the model prediction. In this, images are transformed till the deep learning
model is deceived and the model fails to correctly analyse the data. These transformed or
augmented images are used in the training examples to make the model robust toward
adversarial attacks. In the below image, by adding a small amount of noise to an image can
confuse the AI classifier and classifies a panda as a gibbon. Such alterations are added to the
training dataset to tackle the adversarial attacks. Addition of little noise in a panda is
represented in Figure 6
[9]
Figure 6 Augmented image of a panda generated by adding little noise .
It is a series of convolutional layers are trained such that the images are deconstructed where
content and style can be separated. After separation, the content from an image is composed
with the style of another image to create an augmented style image. Thus, the content remains
the same but the style is changed. This increases the robustness of the model as the model is
8|Page
working independently of the style of the image. Figure 7 given below shows an example of
a style of sunflower applied to a photo of a person.
Figure 7 Style Transfer: the style of sunflower (b) (Van Gogh) is applied on the photo (a) to generate a new
stylised image (c) which can keep the main content of the photo but as well as contain the texture style from
image sunflower [10] .
All these techniques are used to collect data to train ML model but the data acquired by using
data acquisition method is imperfect and it also contains lots of inconsistencies as well as
missing values so it is not applicable directly as training dataset for ML model in that case
data preprocessing came into picture.
DATA PREPROCESSING
Data preprocessing is a set of techniques which enhance the quality of raw data. It serves as
foundation for data analysis which helps in enhancing success of machine learning on a given
task. Success of machine learning is dependent on quality of data that's why irrelevant
information, noisy and any kind of unreliable data present will only affects the output. It will
also make the training phase more and more difficult. The steps that are involved in data
preprocessing are data cleaning, data normalisation, transformation, feature extraction and
9|Page
selection, removing outliners and fill in missing values etc. Figure 8 is given below that
shows how all the steps mentioned are used to obtain an improved dataset.
Figure 8 Depict the various steps that have been involved in data preprocessing. It also shows the order of
different steps [11].
Broadly data preprocessing include data preparation, data transformation and data
reduction. The techniques that are included for data preparations are data transformation, data
integration, data cleaning and normalisation. All the steps of data preparation are generally
used to enhance the quality of data. The aim of data reduction is to reduce the complexity of
the data using feature selection, instance selection and discretization of data etc. Data
reduction generally reduces data dimensions. Reduction in dimension of data helps in
reducing the computational costs. Data transformation includes arrangement of the original
data into suitable format. Data partitioning is used to divide whole data set into different
groups by to enhance sensitivity and reliability of the analysis. After successfully applying all
these data preprocessing stage final data set is obtained, that is suitable for machine learning
algorithm [12]. The detailed analysis of different techniques involved in data preprocessing are
discussed. The first one is data cleaning
I. DATA CLEANING
Data cleaning is one of the major data preprocessing steps as it locates and fixes error in the
data. Data that is inaccurate, damaged, improperly formatted, duplicated or insufficient is
either deleted or corrected from a data set based on significance of data in training of ML
model. All different type of techniques used in cleaning data are represented in Figure 9
given below.
10 | P a g e
[13]
Figure 9. This image explains different type of data found during data cleaning and how it is treated .
Even if result appeared after using data with incorrect and missing data seems to be
correct, they are unreliable. Data cleaning lowers error of the data that appears during
merging the data from multiple data sources. Data cleaning is time consuming but removing
incorrect information must be done. Data mining is a crucial method for cleaning data
mechanically. It pulls hidden information from large data set using a variety of data mining
approaches. Sometimes data that has been acquired has some missing values that affect the
final result. This type of data is treated based on its importance in training of ML model.
Methods used to deal with these datasets are discussed in detail.
MISSING VALUE IMPUTATION : While using data sources from real world incomplete
data is an unavoidable and major problem. Missing value is a datum that has not been stored
due to different reasons like faulty sampling process, cost restriction and limitation in
acquisition process. Treatment of incomplete dataset is shown in Figure 10.
Figure 10 . Shows that when obtained dataset is incomplete missing value imputation method is used to fill that
data and transform it into a complete dataset [14].
When the portion of data that is missing considered insignificant the data samples are
simply discarded. Method of Ignoring Instances with Unknown Feature Values is used in this
11 | P a g e
case. In this method instances having at least one unknown feature value has been ignored. In
case when missing data holds great significance Missing values imputation method is applied
to replace missing data with inference value. There are different types of methods available to
fill that missing value. All the methods available to choose inference value are given below:
Most Common Feature Value: The value of the feature that occurs most often is
selected to be the value for all the unknown values of the feature.
Mean substitution: In this method we generally substitute a feature‟s mean value
computed from available cases to fill in missing data values on the remaining cases.
Regression or classification methods: A regression or classification model has been
developed based on complete case data for a given feature. It is treated as the outcome and all
other relevant features are used as predictors.
Hot deck imputation: The most similar case and a case with a missing value are used
simultaneously in this method. The most similar case‟s Y value is substituted for the missing
case‟s Y value.
Method of Treating Missing Feature Values as Special Values: The “unknown”
itself is treated as a new value for the features that contain missing values [15].
Other than missing values the data that is not related to rest of data points is also
obtained. This type of data points have to identify first and then they are treated. This type of
data points is called outlier and method used to identify this type of data is called outlier
detection.
OUTLIER DETECTION: Outliers are those data points that are significantly different from
the rest of the dataset. They are often abnormal observations that arise due to inconsistent
data entry. There are two types of methods in data preprocessing for outlier detection named
as statistical method and clustering based method. Figure 10 given below represents outlier in
a dataset.
Figure 10. It represents detection of outliers in a given dataset using statistical method [16].
12 | P a g e
For detection outlier in numerical data the method used is called statistics method is,
while clustering based method is used either to detect outlier directly or it can also be used as
primary step to identify data clusters, after identification of clusters, statistical method is
applied for outlier detection.
The data provided for data mining as the training set is assumed to be perfect and has
no disturbance. In real world the big data is rarely perfect, which affects the quality of data
mining technique. These problems make it mandatory to tackle the problem of data noise
present in data. For better understanding noise treatment used is discussed below.
NOISE TREATMENT: Any noise present in data can affect input feature as well as output
values. Noise present in input attribute is usually referred as attribute noise. Noises that affect
the output attribute introduce great bias in output. The treatment of noise can be done either
by data polishing method or using noise filters. Noise correction is done partially in case of
data polishing method, which is sometime beneficial. It is a difficult task and usually limited
to small amounts of noise that‟s why noise filters are used to identify and remove the noise
from data in the training session. Also no modification in data mining techniques is required
for treatment of noise [17] .
The data that was collected was stored over different data silos. There is no central
database or file system so for a unified view for user data integration is used.
Data transformation also called data preparation ensures the compatibility of data with data
mining algorithm. Data transformation is used to transform numerical data into categorical
data. The equal width method divides the range of a variable into several equally sized
intervals, where number of interval is predefined by user. The equal frequency method
divides the data into several intervals with each having approximately same data amounts.
Data transformation can also be applied to transform categorical variables into numerical
ones to facilitate the development of prediction models. Data is generally very large so
transforming into small subsets helps to analyse data mere accurately and easily. For that data
partitioning is used
DATA PARTITIONING: Data partitioning divide the whole data into several Groups for
in depth analysis. It defines the whole data into two or three non-overlapping subset: the
training set, the validation set and the test set. It is used to keep a subset of available data of
13 | P a g e
out of access and use it later for verification. A number of clustering algorithms have been
applied for data partitioning, such as k-means, hierarchical clustering, entropy weighting k-
means (EWKM), and fuzzy c-means clustering. Data normalization is another technique
involved in data transformation to transform data when there is a large difference between
maximum and minimum value of data in dataset.
DATA NORMALIZATION: Data normalization put data into a consistent format and make
sure that all values in dataset are on same scale. As the spreader data is hard to analyse and
compare hence data normalization helps in solving this problem. After normalization
unstructured data may appear similar across all records. The two most common methods for
normalization are:
The max-min normalisation is used when relative difference between values is
required. It transforms data to specified range. Such method is sensitive to data outliers as
their presence may dramatically change the data range. The formula used is:
x′ =(x – xmin)/(xmax − xmin)
Where min(x) and max(x) refer to the minimum and maximum of variable x
z-score standardization scales data by taking mean 0 and standard deviation 1. It deals with
normally distributed data. The z-score standardisation method is less affected by outliers. The
formula used is: x′ = (x – μ)/σ
Where μ is the mean and σ is the standard deviation.
Log transformation is used with highly skewed distribution and makes data more
symmetric. Formula used is: x′=log(x)
Decimal scaling is used to scale the decimal points in values to common place [18].
Figure 12 below shows graph representing before and after results of data normalization.
Figure 12 shows that all the outliers present are removed after normalization and more symmetric data is
obtained [19].
14 | P a g e
Normalized data make the prediction by ML model more accurate and also increases its
speed. After that the important step of data preprocessing is data reduction which reduces the
dimensionality of data and make algorithm faster.
Analysis of a large data is time consuming and cost more, hence reducing its size is
beneficial. It includes summarizing, choosing basic things, looking for pattern in available
data, focusing on important things. Many techniques are present to reduce the size of dataset
without affecting the output. The choice is based on type of data we have and our
requirement. Some techniques for data reduction are discussed below and the first one in list
is feature selection
FEATURE SELECTION: Features selection reduce the dimension of data set by removing
as much irrelevant feature as possible to obtain subset of feature from the original problem
that still appropriately describe it. This helps the learning algorithm to operate faster and
more effectively. It can be used in data collection stage, saving cost in time sampling and
sensing. Features that have an influence on output are called relevant features while features
having no influence on the output are called irrelevant features. The feature election
algorithm has two components: selection algorithm and evaluation algorithm. The figure
below explains process of feature selection.
Figure 13 Flow chart representing the steps involves in reduction of data with feature selection method. This is
also representing when stopping criterion is used [20].
Selection algorithm algorithms generate propose subset of and find an optimal subset,
Evaluation algorithm determine how good a proposed features set is, returning some measure
of goodness to the selection algorithm. A stopping criterion is also included to stop the
15 | P a g e
feature selection process that may run through the space of subset or forever. Stopping
criteria can be based on whether addition (or deletion) of any feature does not produce a
better subset or whether an optimal subset according to some evaluation function is obtained.
Space transformation is another way which reduces dimensionality of dataset. In this
case a whole new set of feature is generated by combining the original ones. Another method
generally used for reducing data is data discretisation.
Figure 14. This represents how continuous values of data are changed into discrete value using data
discretisation method [21].
Data discretization is also done using decision tree and correlation analysis. In
decision tree analysis top down slicing techniques is used, which is done through a supervise
procedure. Correlation analysis is also a supervised procedure in which linear regression
technique are used to get the best neighbouring interval [22].
16 | P a g e
CONCLUSION
In summary, the acquired data which is finite is gone through data augmentation to get the
infinite data using many Artificial Intelligence technologies like deep neural networks (DNN)
convolutional neural networks (CNN), generative adversarial networks (GANs), Variational
Autoencoders (VAEs) and neural style transfer and applying Data preprocessing techniques
like data cleansing, data integration, data transformation, data partitioning and data reduction
for infinite data which used to train the model.
17 | P a g e
REFERENCES
1. Automated Data Acquisition System Using a Neural Network for Prediction Response in a
Mode-Locked Fiber Laser, Jose Ramon Martinez-Angulo, Eduardo Perez-Careta, Juan Carlos
Hernandez-Garcia, Sandra Marquez-Figueroa, Jose Hugo Barron Zambrano, Daniel Jauregui-
Vazquez, Jose David Filoteo-Razo, Jesus Pablo Lauterio-Cruz, Olivier Pottiez, Julian Moises
Estudillo-Ayala and Roberto Rojas-Laguna[https://www.mdpi.com/2079-9292/9/8/1181]
2. 5 Data Acquisition Strategies for Supervised Machine Learning, The superb AI Round up
[https://www.linkedin.com/pulse/5-data-acquisition-strategies-supervised-machine-learning-
superbai#:~:text=In%20terms%20of%20sourcing%20data,hiring%20a%20data%20collection
%20servicer.]
6. Aspect-based sentiment analysis of customer speech data using deep convolutional neural
network,SivakumarMurugaiyan,U.SrivasuluReddy[https://www.researchgate.net/publication/
369036271_Aspect-
Based_Sentiment_Analysis_of_Customer_Speech_Data_Using_Deep_Convolutional_Neural
_Network_and_BiLSTM]
18 | P a g e
8. Can artificial intelligence assist project developers in long time management of energy
projects? The case of CO2 capture and storage, Eric Buah, Lassi Linnaen, Huapeng Wu,
Martim A. Kesse[https://www.mdpi.com/1996-1073/13/23/6259]
10. sTaDa: style transfer as data augmentation, Xu Zheng, Tejo Chalasani, Koustav Ghosal,
AljosaSmolic[https://www.researchgate.net/publication/331777321_STaDA_Style_Transfer_
as_Data_Augmentation]
11. Research on data preprocessing and categorization technique for smartphone review
analysis,VivekAgarwal[https://www.researchgate.net/publication/291019609_Research_on_
Data_Preprocessing_and_Categorization_Technique_for_Smartphone_Review_Analysis]
12. A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge
Discovery From Building Operational Data, Cheng Fan, Meiling Chen, Xinghua Wang,
JiayuanWang,BufuHuang[https://www.researchgate.net/publication/350466129_A_Review_
on_Data_Preprocessing_Techniques_Toward_Efficient_and_Reliable_Knowledge_Discover
y_From_Building_Operational_Data]
13. Cleaning Big Data Streams: A systematic literature review, Obaid Alotaibi, Eric Pardede,
Sarah Tomy[https://www.mdpi.com/2227-7080/11/4/101]
14. Missing value imputation: a review and analysis of the itreture(2006-2017), Wei-chao
Lin, Chih-fong Tsai[https://link.springer.com/article/10.1007/s10462-019-09709-4]
15. Data Preprocessing for Supervised Learning, Sotiris Kotsiantis, Dimitris Kanellopoulous,
P.E.Pintelas,113[https://www.researchgate.net/publication/228084519_Data_Preprocessing_f
or_Supervised_Learning]
17. Big data preprocessing: methods and prospects Salvador García*, Sergio Ramírez-
Gallego, Julián Luengo, José Manuel Benítez and Francisco
Herrera[https://bdataanalytics.biomedcentral.com/articles/10.1186/s41044-016-0014-0]
19 | P a g e
18. A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge
Discovery From Building Operational Data Cheng Fan1,2 , Meiling Chen1,2 , Xinghua
Wang3,JiayuanWang1,2andBufuHuang3[https://www.frontiersin.org/articles/10.3389/fenrg.2
021.652801/full]
20. Data Preprocessing for Supervised Learning, Sotiris Kotsiantis, Dimitris Kanellopoulous,
P.E.Pintelas,114[https://www.researchgate.net/publication/228084519_Data_Preprocessing_f
or_Supervised_Learning]
20 | P a g e