0% found this document useful (0 votes)
2 views

project report final

The document discusses the significance of data acquisition systems in various fields, emphasizing the importance of accurate data collection methods such as primary and secondary data. It outlines different methods of data acquisition, including public databases, in-house sourcing, crowdsourcing, and manual collection, and highlights the role of data augmentation techniques in enhancing machine learning model performance. Additionally, it details the data preprocessing steps necessary to improve data quality for effective machine learning applications.

Uploaded by

bhakunishraddha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

project report final

The document discusses the significance of data acquisition systems in various fields, emphasizing the importance of accurate data collection methods such as primary and secondary data. It outlines different methods of data acquisition, including public databases, in-house sourcing, crowdsourcing, and manual collection, and highlights the role of data augmentation techniques in enhancing machine learning model performance. Additionally, it details the data preprocessing steps necessary to improve data quality for effective machine learning applications.

Uploaded by

bhakunishraddha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

I.

INTRODUCTION

In the recent years, data acquisition systems have shown great potential to solve tasks
efficiently in different areas such as manufacture industries, video processing, agriculture,
imaging, meteorological systems and remote laboratory, and. In research, it is necessary to
acquire a considerable number of samples in manual mode using electronic devices. During
these processes, the measurements can be affected by errors and information loss for the
analysis. But with the laboratory devices with improved software application in the
monitoring and controlling can help in time reduction and obtaining samples with good
accuracy.
Data acquisition means to gather or acquire the data from different sources which is
classified into primary data and secondary data. In Primary data, data is collected by the
individual or organization who was doing that analysis. The primary data is acquired by
different ways like experiments (e.g., wet lab experiments like gene sequencing), observation
(e.g., surveys, sensors, in situ collection), simulations (e.g., theoretical models like climate
models), webscraping( by extracting or copying data directly from a website). In Secondary
data, data is collected by someone else or by published data which is available. Most popular
repositories includes: GitHub, Kaggle, KDnuggets, UCI Machine Learning Repository, US
Government‟s Open Data, Five Thirty Eight, Amazon Web Services, BuzzFeed, etc.
Acquisition of data can be quickly processed using by traditional programming. On
the other hand, some advanced algorithms offer the possibility to interpret, self-assess, and
learn from the data, such as numerical variables, sound, or images which is called machine
learning (ML). ML includes supervised learning, unsupervised learning, and semi-supervised
learning. These algorithms have been widely used for signal–processing in applications such
as pattern recognition, analysis, and computer vision [1]

METHODS OF DATA ACQUISITION


There are some methods to acquiring data like collecting new data,
converting/transforming legacy data, sharing/exchanging data, and purchasing data. This
includes automated collection e.g., sensor-derived data, the manual recording of empirical
observations, and obtaining existing data from other sources. For functioning of a model
training is essential, which is not possible without data.
The first step to creating is to feed the acquired data and power ML algorithms in

1|Page
an optimally programmed model and a successful AI application that operates as it was
intended once deployed as in:

I. Public Databases

The databases which are made accessible by the businesses so that the given data can be
used for machine learning, computer vision, natural language processing, and various other
AI application training purposes. The downside to tapping into public databases is that it‟s
not easily customized or specifically suitable for a project niche or focus. On the other hand,
open-source data can cut down on time and team resources obtaining basic, subject-matter
data quickly and without expense.

II. In-House Data Sourcing

ML engineer uses In-house data sourcing as the training data for their algorithm and the
model development is acquired from within their organization. Engineers are creating
databases themselves. Other specialists are brought in to help generate the necessary data
required for a project.
It is provided internally and doesn‟t require excessive expense. The internal data
sources are usually highly relevant to a team‟s specific needs. The data tends to be more
reliable, since ML developers are assured of how and when it was generated, and if any needs
come up to further personalize the sets, they can simply do so without relying on an external
source or provider.

III. Crowdsourcing

The crowdsourcing method is the standard and go-to approaches to collecting training data.
The basic procedure for an ML developer or organization is to recruit outside assistance for
data acquisition efforts. Since, it is outsourced work, it might not be possible to pass feedback
to the team or individuals assigned the work, making it unlikely that they will be able to
improve individually or as a team if the sourced data does not meet expectations from a
quality standpoint.

IV. Manual Data Collection

Manually collecting data involves acquiring the data from real-world settings. An
organization can collect or build the data by using tools or devices to monitor and derive real-

2|Page
world data that can then be processed and used to train models.
These devices can range from online apps that collect data through user interaction
and surveys, to social media, sensors, and drones.
V. Data collection services

Data collection services are used by ML teams for their data collection needs The quality and
result of the data acquired through a collection service or company which satisfies an ML
team's needs on a case-by-case basis [2].
As the existing data generally in very limited quantity and ML models rely on large
amount of diverse data to develop accurate predictions in various contexts. For ML
applications especially in the deep learning continues to diverse and increase rapidly. Data
augmentation techniques can be a good tool against challenges that the artificial intelligence
world faces.
DATA AUGEMENTATION

Data augmentation is useful to improve the performance and outcomes of machine learning
models by forming new and different examples to train datasets. If the dataset in a machine
learning model is rich and sufficient, the model performs better and more accurately. One
of the steps in a data model is cleaning data which is necessary for high-accuracy models.
However, if cleaning reduces the represent ability of data, then the model cannot provide
good predictions for real-world inputs. Data augmentation techniques can enable machine
learning models to be more robust by creating variations that the model may see in the real
world. Using some techniques for data augmentation based on Artificial Intelligence about
computer vision, audio, text, image, and advanced data augmentation are explained below.

i) COMPUTER VISION

Computer vision systems use artificial intelligence (AI) technology to mimic the
capabilities of the human brain that are responsible for object recognition and object
classification. Engineers train computers to recognize visual data by inputting vast amounts
of information. ML algorithms identify common patterns in these images or videos and
apply that knowledge to identify unknown images accurately.

The augmentation in computer vision is through position augmentation. This strategy


crops, flips, or rotates an input image to create augmented images. Cropping either resizes
the image or crops a small part of the original image to create a new one. Rotation, flip, and
resizing transformation all alter the original randomly with a given probability of providing

3|Page
new images.

Figure 1 shows results after applying image augmentation techniques

Figure 1 Position Augmentation (rotation) (a) original (b) simple rotation 90⁰ left (c) simple rotation 90⁰
right (d) simple rotation 180⁰ (e) advanced rotation 45⁰ clockwise (f) advanced rotation 45⁰ counter clockwise
(g) advanced rotation 270⁰ clockwise (h) advanced rotation 270⁰ counter clockwise (i) random rotation [3] .

Also, the augmentation in computer vision is in colour augmentation. This strategy adjusts
the elementary factors of a training image, such as its brightness, contrast degree, or
saturation. These common image transformations change the hue, dark and light balance,
and separation between an image's darkest and lightest areas to create augmented images.
Figure 2 shows different colour augmentation techniques

Figure 2 Colour Augmentation (blur) (a) original (b) blur (c) box blur (d) motion blur (e) radial blur (f) heavy
radial blur (g) Gaussian blur (h) stack blur (i) soften (j) noise injection [4].

Computer vision is based on deep learning (DL), convolutional neural networks


(CNNs) and recurrent neural networks (RNNs). DL is a type of ML that uses neural
networks DL neural networks are made of many layers of software modules called artificial

4|Page
neurons that work together inside the computer. They use mathematical calculations to
automatically process different aspects of image data and gradually develop a combined
understanding of the image. CNNs utilize a labelling system to categorize visual data and
comprehend the whole image. They analyse images as pixels and give each pixel a label
value. The value is inputted to perform a mathematical operation called convolution and
make predictions about the picture. Like a human attempting to recognize an object at a
distance, a CNN first identifies outlines and simple shapes before filling in additional
details like colour, internal forms, and texture. Finally, it repeats the prediction process over
several iterations to improve accuracy. RNNs are similar to CNNs, but can process a series
of images to find links between them. While CNNs are used for single image analysis,
[5]
RNNs can analyse videos and understand the relationships between images . The next
one in list of data augmentation type is audio data augmentation

ii) AUDIO DATA AUGMENTATION

Audio data augmentation include injecting random or Gaussian noise into some audio, fast-
forwarding parts, changing the speed of parts by a fixed rate or altering the pitch. Different
audio data augmentation methods are discussed below.

Time Stretching: It can increase or decrease the speed of audio playback without changing
its pitch or duration. This allows simulating variations in the rhythm or speed of speech.

Pitch Shifting: It is used to alter pitch or frequency while maintaining audio duration. This
technique can simulate variations of the voice or musical notes.

Noise Addition: This method adds random noise to the audio signal. It makes the model
more robust against environmental noise or variations in recording conditions.

Amplitude Scaling: It adjusts the volume or intensity of the audio signal. It can simulate
different sound levels or distances between the audio source and the microphone.

Reverberation: These effects are applied to the audio clip to simulate different acoustic
environments. This can help the model to better generalize across different recording
settings.

Time and Frequency Masking: A random section of the audio is temporarily masked in
the time or frequency domain. This technique can simulate missing or damaged audio
segments.
Figure 3 given below shows different changes in an audio clip after using audio
augmentation

5|Page
Figure 3 Audio augmentations, (a) Original audio, (b) Audio after adding noise, (c) Audio after Stretching,
(d) Audio after shifting [6].

iii) TEXT DATA AUGMENTATION


Text augmentation is a vital data augmentation technique for Natural Language Processing
(NLP) and other text-related sectors of ML. Transformations of text data include shuffling
sentences, changing the positions of words, replacing words with close synonyms, inserting
random words, and deleting random words. Different Text Data Augmentation methods are
mentioned below.
Word or sentence shuffling randomly changes the position of a word or
sentence. Word replacement replaces words with synonyms.
Syntax-tree manipulation paraphrases the sentence using the same word.
Random word insertion inserts words at random.
Random word deletion deletes words at random

Text augmentation is shown in Figure 4

Figure 4 Text Augmentation by inserting synonyms of word „very‟ [7].

iv) IMAGE AUGMENTATION

Geometric transformations: randomly flip, crop, rotate, stretch, and zoom images. You

6|Page
need to be careful about applying multiple transformations on the same images, as this can
reduce model performance.
Colour space transformations: randomly change RGB colour channels, contrast, and
brightness.
Kernel filters: randomly change the sharpness or blurring of the image.
Random erasing: delete some part of the initial image.
Mixing images: blending and mixing multiple images.
Application of all these image augmentation techniques is shown in Figure 5

Figure 5 Image data augmentation technique for increasing the size of an image to train a deep neural network
algorithm. The drawing has been modified for the purpose of this study. The variables X and X FL are
mathematical representations used in this study to mimic the image data augmentation strategy using fuzzy
mathematics to design the proposed system [8].

Some advance technologies used for data augmentation are discussed as

i) Generative adversarial networks (GANs):

GANs consist of two neural networks, a generator and a discriminator, which are trained
simultaneously. The generator creates synthetic data points, while the discriminator evaluates
the quality of the generated data by comparing it to the real data. The generator and
discriminator compete against each other, with the generator trying to create realistic data
points that can fool the discriminator, and the discriminator trying to accurately distinguish
between real and generated data. As the training progresses, the generator becomes better at
producing high-quality synthetic data.

7|Page
ii) Variational Autoencoders (VAEs):

VAEs are a type of generative model that uses encoder-decoder architecture. The encoder
learns a lower-dimensional representation i.e. latent space of the input data, while the decoder
reconstructs the input data from the latent space. VAEs impose a probabilistic structure on
the latent space, which allows them to generate new data points by sampling from the learned
distribution. These are particularly useful for data augmentation tasks where the input data
has a complex structure, such as images or text.

iii) Adversarial training/Adversarial machine learning

Adversarial attacks are imperceptible changes to images (pixel-level changes) that can
completely change the model prediction. In this, images are transformed till the deep learning
model is deceived and the model fails to correctly analyse the data. These transformed or
augmented images are used in the training examples to make the model robust toward
adversarial attacks. In the below image, by adding a small amount of noise to an image can
confuse the AI classifier and classifies a panda as a gibbon. Such alterations are added to the
training dataset to tackle the adversarial attacks. Addition of little noise in a panda is
represented in Figure 6

[9]
Figure 6 Augmented image of a panda generated by adding little noise .

iv) Neural Style Transfer:

It is a series of convolutional layers are trained such that the images are deconstructed where
content and style can be separated. After separation, the content from an image is composed
with the style of another image to create an augmented style image. Thus, the content remains
the same but the style is changed. This increases the robustness of the model as the model is

8|Page
working independently of the style of the image. Figure 7 given below shows an example of
a style of sunflower applied to a photo of a person.

Figure 7 Style Transfer: the style of sunflower (b) (Van Gogh) is applied on the photo (a) to generate a new
stylised image (c) which can keep the main content of the photo but as well as contain the texture style from
image sunflower [10] .

All these techniques are used to collect data to train ML model but the data acquired by using
data acquisition method is imperfect and it also contains lots of inconsistencies as well as
missing values so it is not applicable directly as training dataset for ML model in that case
data preprocessing came into picture.

DATA PREPROCESSING

Data preprocessing is a set of techniques which enhance the quality of raw data. It serves as
foundation for data analysis which helps in enhancing success of machine learning on a given
task. Success of machine learning is dependent on quality of data that's why irrelevant
information, noisy and any kind of unreliable data present will only affects the output. It will
also make the training phase more and more difficult. The steps that are involved in data
preprocessing are data cleaning, data normalisation, transformation, feature extraction and

9|Page
selection, removing outliners and fill in missing values etc. Figure 8 is given below that
shows how all the steps mentioned are used to obtain an improved dataset.

Figure 8 Depict the various steps that have been involved in data preprocessing. It also shows the order of
different steps [11].

Broadly data preprocessing include data preparation, data transformation and data
reduction. The techniques that are included for data preparations are data transformation, data
integration, data cleaning and normalisation. All the steps of data preparation are generally
used to enhance the quality of data. The aim of data reduction is to reduce the complexity of
the data using feature selection, instance selection and discretization of data etc. Data
reduction generally reduces data dimensions. Reduction in dimension of data helps in
reducing the computational costs. Data transformation includes arrangement of the original
data into suitable format. Data partitioning is used to divide whole data set into different
groups by to enhance sensitivity and reliability of the analysis. After successfully applying all
these data preprocessing stage final data set is obtained, that is suitable for machine learning
algorithm [12]. The detailed analysis of different techniques involved in data preprocessing are
discussed. The first one is data cleaning

I. DATA CLEANING

Data cleaning is one of the major data preprocessing steps as it locates and fixes error in the
data. Data that is inaccurate, damaged, improperly formatted, duplicated or insufficient is
either deleted or corrected from a data set based on significance of data in training of ML
model. All different type of techniques used in cleaning data are represented in Figure 9
given below.

10 | P a g e
[13]
Figure 9. This image explains different type of data found during data cleaning and how it is treated .

Even if result appeared after using data with incorrect and missing data seems to be
correct, they are unreliable. Data cleaning lowers error of the data that appears during
merging the data from multiple data sources. Data cleaning is time consuming but removing
incorrect information must be done. Data mining is a crucial method for cleaning data
mechanically. It pulls hidden information from large data set using a variety of data mining
approaches. Sometimes data that has been acquired has some missing values that affect the
final result. This type of data is treated based on its importance in training of ML model.
Methods used to deal with these datasets are discussed in detail.

MISSING VALUE IMPUTATION : While using data sources from real world incomplete
data is an unavoidable and major problem. Missing value is a datum that has not been stored
due to different reasons like faulty sampling process, cost restriction and limitation in
acquisition process. Treatment of incomplete dataset is shown in Figure 10.

Figure 10 . Shows that when obtained dataset is incomplete missing value imputation method is used to fill that
data and transform it into a complete dataset [14].

When the portion of data that is missing considered insignificant the data samples are
simply discarded. Method of Ignoring Instances with Unknown Feature Values is used in this
11 | P a g e
case. In this method instances having at least one unknown feature value has been ignored. In
case when missing data holds great significance Missing values imputation method is applied
to replace missing data with inference value. There are different types of methods available to
fill that missing value. All the methods available to choose inference value are given below:
Most Common Feature Value: The value of the feature that occurs most often is
selected to be the value for all the unknown values of the feature.
Mean substitution: In this method we generally substitute a feature‟s mean value
computed from available cases to fill in missing data values on the remaining cases.
Regression or classification methods: A regression or classification model has been
developed based on complete case data for a given feature. It is treated as the outcome and all
other relevant features are used as predictors.
Hot deck imputation: The most similar case and a case with a missing value are used
simultaneously in this method. The most similar case‟s Y value is substituted for the missing
case‟s Y value.
Method of Treating Missing Feature Values as Special Values: The “unknown”
itself is treated as a new value for the features that contain missing values [15].
Other than missing values the data that is not related to rest of data points is also
obtained. This type of data points have to identify first and then they are treated. This type of
data points is called outlier and method used to identify this type of data is called outlier
detection.

OUTLIER DETECTION: Outliers are those data points that are significantly different from
the rest of the dataset. They are often abnormal observations that arise due to inconsistent
data entry. There are two types of methods in data preprocessing for outlier detection named
as statistical method and clustering based method. Figure 10 given below represents outlier in
a dataset.

Figure 10. It represents detection of outliers in a given dataset using statistical method [16].

12 | P a g e
For detection outlier in numerical data the method used is called statistics method is,
while clustering based method is used either to detect outlier directly or it can also be used as
primary step to identify data clusters, after identification of clusters, statistical method is
applied for outlier detection.
The data provided for data mining as the training set is assumed to be perfect and has
no disturbance. In real world the big data is rarely perfect, which affects the quality of data
mining technique. These problems make it mandatory to tackle the problem of data noise
present in data. For better understanding noise treatment used is discussed below.

NOISE TREATMENT: Any noise present in data can affect input feature as well as output
values. Noise present in input attribute is usually referred as attribute noise. Noises that affect
the output attribute introduce great bias in output. The treatment of noise can be done either
by data polishing method or using noise filters. Noise correction is done partially in case of
data polishing method, which is sometime beneficial. It is a difficult task and usually limited
to small amounts of noise that‟s why noise filters are used to identify and remove the noise
from data in the training session. Also no modification in data mining techniques is required
for treatment of noise [17] .
The data that was collected was stored over different data silos. There is no central
database or file system so for a unified view for user data integration is used.

II. DATA TRANSFORMATION

Data transformation also called data preparation ensures the compatibility of data with data
mining algorithm. Data transformation is used to transform numerical data into categorical
data. The equal width method divides the range of a variable into several equally sized
intervals, where number of interval is predefined by user. The equal frequency method
divides the data into several intervals with each having approximately same data amounts.
Data transformation can also be applied to transform categorical variables into numerical
ones to facilitate the development of prediction models. Data is generally very large so
transforming into small subsets helps to analyse data mere accurately and easily. For that data
partitioning is used

DATA PARTITIONING: Data partitioning divide the whole data into several Groups for
in depth analysis. It defines the whole data into two or three non-overlapping subset: the
training set, the validation set and the test set. It is used to keep a subset of available data of

13 | P a g e
out of access and use it later for verification. A number of clustering algorithms have been
applied for data partitioning, such as k-means, hierarchical clustering, entropy weighting k-
means (EWKM), and fuzzy c-means clustering. Data normalization is another technique
involved in data transformation to transform data when there is a large difference between
maximum and minimum value of data in dataset.

DATA NORMALIZATION: Data normalization put data into a consistent format and make
sure that all values in dataset are on same scale. As the spreader data is hard to analyse and
compare hence data normalization helps in solving this problem. After normalization
unstructured data may appear similar across all records. The two most common methods for
normalization are:
The max-min normalisation is used when relative difference between values is
required. It transforms data to specified range. Such method is sensitive to data outliers as
their presence may dramatically change the data range. The formula used is:
x′ =(x – xmin)/(xmax − xmin)
Where min(x) and max(x) refer to the minimum and maximum of variable x

z-score standardization scales data by taking mean 0 and standard deviation 1. It deals with
normally distributed data. The z-score standardisation method is less affected by outliers. The
formula used is: x′ = (x – μ)/σ
Where μ is the mean and σ is the standard deviation.

Log transformation is used with highly skewed distribution and makes data more
symmetric. Formula used is: x′=log(x)

Decimal scaling is used to scale the decimal points in values to common place [18].
Figure 12 below shows graph representing before and after results of data normalization.

Figure 12 shows that all the outliers present are removed after normalization and more symmetric data is
obtained [19].

14 | P a g e
Normalized data make the prediction by ML model more accurate and also increases its
speed. After that the important step of data preprocessing is data reduction which reduces the
dimensionality of data and make algorithm faster.

III. DATA REDUCTION

Analysis of a large data is time consuming and cost more, hence reducing its size is
beneficial. It includes summarizing, choosing basic things, looking for pattern in available
data, focusing on important things. Many techniques are present to reduce the size of dataset
without affecting the output. The choice is based on type of data we have and our
requirement. Some techniques for data reduction are discussed below and the first one in list
is feature selection

FEATURE SELECTION: Features selection reduce the dimension of data set by removing
as much irrelevant feature as possible to obtain subset of feature from the original problem
that still appropriately describe it. This helps the learning algorithm to operate faster and
more effectively. It can be used in data collection stage, saving cost in time sampling and
sensing. Features that have an influence on output are called relevant features while features
having no influence on the output are called irrelevant features. The feature election
algorithm has two components: selection algorithm and evaluation algorithm. The figure
below explains process of feature selection.

Figure 13 Flow chart representing the steps involves in reduction of data with feature selection method. This is
also representing when stopping criterion is used [20].

Selection algorithm algorithms generate propose subset of and find an optimal subset,
Evaluation algorithm determine how good a proposed features set is, returning some measure
of goodness to the selection algorithm. A stopping criterion is also included to stop the

15 | P a g e
feature selection process that may run through the space of subset or forever. Stopping
criteria can be based on whether addition (or deletion) of any feature does not produce a
better subset or whether an optimal subset according to some evaluation function is obtained.
Space transformation is another way which reduces dimensionality of dataset. In this
case a whole new set of feature is generated by combining the original ones. Another method
generally used for reducing data is data discretisation.

DATA DISCRETISATION: Discretisation converts a huge amount of data values into


smaller ones by converting attributes values of continuous data into a fine set of intervals
with minimum data loss. Discretisation can be supervised as well as unsupervised.
Supervised Discretisation uses class data and unsupervised discretisation depends upon the
way operation proceeds. It means it works on the top-down splitting strategy and bottom-up
merging strategy. Histogram analysis, binning, cluster analysis are methods that are generally
used for data discretisation.
Histogram analysis involves a plot that represents the frequency distribution of a
continuous data set. The histogram assists the data inspection for data distribution.
Binning is a data smoothing technique that helps to group huge number of
continuous values into smaller values. The figure that is given below shows discretization of
dataset.
In cluster analysis clustering algorithm is executed by dividing the values of x
numbers into clusters. This isolates the computational value of feature x. In Figure 14 given
below discretization of continuous data is represented

Figure 14. This represents how continuous values of data are changed into discrete value using data
discretisation method [21].
Data discretization is also done using decision tree and correlation analysis. In
decision tree analysis top down slicing techniques is used, which is done through a supervise
procedure. Correlation analysis is also a supervised procedure in which linear regression
technique are used to get the best neighbouring interval [22].
16 | P a g e
CONCLUSION

In summary, the acquired data which is finite is gone through data augmentation to get the
infinite data using many Artificial Intelligence technologies like deep neural networks (DNN)
convolutional neural networks (CNN), generative adversarial networks (GANs), Variational
Autoencoders (VAEs) and neural style transfer and applying Data preprocessing techniques
like data cleansing, data integration, data transformation, data partitioning and data reduction
for infinite data which used to train the model.

17 | P a g e
REFERENCES

1. Automated Data Acquisition System Using a Neural Network for Prediction Response in a
Mode-Locked Fiber Laser, Jose Ramon Martinez-Angulo, Eduardo Perez-Careta, Juan Carlos
Hernandez-Garcia, Sandra Marquez-Figueroa, Jose Hugo Barron Zambrano, Daniel Jauregui-
Vazquez, Jose David Filoteo-Razo, Jesus Pablo Lauterio-Cruz, Olivier Pottiez, Julian Moises
Estudillo-Ayala and Roberto Rojas-Laguna[https://www.mdpi.com/2079-9292/9/8/1181]

2. 5 Data Acquisition Strategies for Supervised Machine Learning, The superb AI Round up
[https://www.linkedin.com/pulse/5-data-acquisition-strategies-supervised-machine-learning-
superbai#:~:text=In%20terms%20of%20sourcing%20data,hiring%20a%20data%20collection
%20servicer.]

3. Performance improvement of deep learning models using image augmentation techniques,


MamillapallyNagaraju,PriyankaChawla,NeerajKumar[https://www.researchgate.net/publicati
on/358087878_Performance_improvement_of_Deep_Learning_Models_using_image_augme
ntation_techniques]

4. Performance improvement of deep learning models using image augmentation techniques,


MamillapallyNagaraju,PriyankaChawla,NeerajKumar[https://www.researchgate.net/publicati
on/358087878_Performance_improvement_of_Deep_Learning_Models_using_image_augme
ntation_techniques]

5. The essential guide to data augmentation in deep learning, Deval


shah[https://www.v7labs.com/blog/data-augmentation-guide]

6. Aspect-based sentiment analysis of customer speech data using deep convolutional neural
network,SivakumarMurugaiyan,U.SrivasuluReddy[https://www.researchgate.net/publication/
369036271_Aspect-
Based_Sentiment_Analysis_of_Customer_Speech_Data_Using_Deep_Convolutional_Neural
_Network_and_BiLSTM]

7. Text data augmentation in natural language processing with texattack, Priya


Tidke[https://www.analyticsvidhya.com/blog/2022/02/text-data-augmentation-in-natural-
language-processing-with-texattack/]

18 | P a g e
8. Can artificial intelligence assist project developers in long time management of energy
projects? The case of CO2 capture and storage, Eric Buah, Lassi Linnaen, Huapeng Wu,
Martim A. Kesse[https://www.mdpi.com/1996-1073/13/23/6259]

9. The essential guide to data augmentation in deep learning, Deval


shah[https://www.v7labs.com/blog/data-augmentation-guide]

10. sTaDa: style transfer as data augmentation, Xu Zheng, Tejo Chalasani, Koustav Ghosal,
AljosaSmolic[https://www.researchgate.net/publication/331777321_STaDA_Style_Transfer_
as_Data_Augmentation]

11. Research on data preprocessing and categorization technique for smartphone review
analysis,VivekAgarwal[https://www.researchgate.net/publication/291019609_Research_on_
Data_Preprocessing_and_Categorization_Technique_for_Smartphone_Review_Analysis]

12. A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge
Discovery From Building Operational Data, Cheng Fan, Meiling Chen, Xinghua Wang,
JiayuanWang,BufuHuang[https://www.researchgate.net/publication/350466129_A_Review_
on_Data_Preprocessing_Techniques_Toward_Efficient_and_Reliable_Knowledge_Discover
y_From_Building_Operational_Data]

13. Cleaning Big Data Streams: A systematic literature review, Obaid Alotaibi, Eric Pardede,
Sarah Tomy[https://www.mdpi.com/2227-7080/11/4/101]

14. Missing value imputation: a review and analysis of the itreture(2006-2017), Wei-chao
Lin, Chih-fong Tsai[https://link.springer.com/article/10.1007/s10462-019-09709-4]

15. Data Preprocessing for Supervised Learning, Sotiris Kotsiantis, Dimitris Kanellopoulous,
P.E.Pintelas,113[https://www.researchgate.net/publication/228084519_Data_Preprocessing_f
or_Supervised_Learning]

16.What are outlier detection methods in data mining?,


Utkarsh[https://www.scaler.com/topics/data-mining-tutorial/outlier-detection-methods-in-
data-mining/]

17. Big data preprocessing: methods and prospects Salvador García*, Sergio Ramírez-
Gallego, Julián Luengo, José Manuel Benítez and Francisco
Herrera[https://bdataanalytics.biomedcentral.com/articles/10.1186/s41044-016-0014-0]

19 | P a g e
18. A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge
Discovery From Building Operational Data Cheng Fan1,2 , Meiling Chen1,2 , Xinghua
Wang3,JiayuanWang1,2andBufuHuang3[https://www.frontiersin.org/articles/10.3389/fenrg.2
021.652801/full]

19. Data normalization in data mining, Adegboyega aare[https://aare.substack.com/p/data-


normalization-in-data-mining]

20. Data Preprocessing for Supervised Learning, Sotiris Kotsiantis, Dimitris Kanellopoulous,
P.E.Pintelas,114[https://www.researchgate.net/publication/228084519_Data_Preprocessing_f
or_Supervised_Learning]

21. Data discretization, Devanshi Patel [https://medium.com/codex/data-discretization-


b5faa2b77f06]

22 Discretization in data mining [https://www.javatpoint.com/discretization-in-data-mining]

20 | P a g e

You might also like