0% found this document useful (0 votes)
8 views

unit 2 data science (1)_ab17128d98d5ea43270628a14fa53833

Data collection is a systematic process of gathering and analyzing information to address specific questions and evaluate results, with methods categorized into primary and secondary data collection. Primary data collection includes qualitative and quantitative methods, while secondary data involves using existing data sources. The importance of data collection lies in maintaining research integrity, reducing errors, and supporting informed decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

unit 2 data science (1)_ab17128d98d5ea43270628a14fa53833

Data collection is a systematic process of gathering and analyzing information to address specific questions and evaluate results, with methods categorized into primary and secondary data collection. Primary data collection includes qualitative and quantitative methods, while secondary data involves using existing data sources. The importance of data collection lies in maintaining research integrity, reducing errors, and supporting informed decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

What is Data Collection?

Data collection is a methodical process of gathering and


analyzing specific information to proffer solutions to relevant
questions and evaluate the results. It focuses on finding out
all there is to a particular subject matter. Data is collected to
be further subjected to hypothesis testing which seeks to
explain a phenomenon.

Data collection is defined as a systematic method of obtaining,


observing, measuring, and analyzing accurate information to
support research conducted by groups of professionals
regardless of the field where they belong. The general data
collection methods used in the process are essentially the
same. In other words, there are specific standards that need to
be strictly followed and implemented to make sure that data is
collected accurately.

Not to mention, if the appropriate procedures are not given


importance, a variety of problems might arise and impact the
study or research being conducted.

The most common risk is the inability to identify answers and


draw correct conclusions for the study, as well as failure to
validate if the results are correct. These risks may also result in
questionable research, which can greatly affect your credibility.

Types of Data Collection


Before broaching the subject of the various types of data
collection. It is pertinent to note that data collection in itself
falls under two broad categories; Primary data collection and
secondary data collection.
Primary Data Collection

Primary data collection by definition is the gathering of raw


data collected at the source. It is a process of collecting the
original data collected by a researcher for a specific research
purpose. It could be further analyzed into two segments;
qualitative research and quantitative data collection
methods.
● Qualitative Research Method
The qualitative research methods of data collection do not
involve the collection of data that involves numbers or a
need to be deduced through a mathematical calculation,
rather it is based on the non-quantifiable elements like the
feeling or emotion of the researcher. An example of such a
method is an open-ended questionnaire.

● Quantitative Method
Quantitative methods are presented in numbers and require
a mathematical calculation to deduce. An example would be
the use of a questionnaire with close-ended questions to
arrive at figures to be calculated Mathematically. Also,
methods of correlation and regression, mean, mode and
median.

Secondary Data Collection

Secondary data collection, on the other hand, is referred to


as the gathering of second-hand data collected by an
individual who is not the original user. It is the process of
collecting data that is already existing, be it already
published books, journals, and/or online portals. In terms of
ease, it is much less expensive and easier to collect.
Your choice between Primary data collection and secondary
data collection depends on the nature, scope, and area of
your research as well as its aims and objectives.
IMPORTANCE OF DATA COLLECTION
There are a bunch of underlying reasons for collecting data,
especially for a researcher.
● Integrity of the Research
A key reason for collecting data, be it through quantitative or
qualitative methods is to ensure that the integrity of the
research question is indeed maintained.
● Reduce the likelihood of errors
The correct use of appropriate data collection of methods
reduces the likelihood of errors consistent with the results.
● Decision Making
To minimize the risk of errors in decision-making, it is
important that accurate data is collected so that the
researcher doesn't make uninformed decisions.
● Save Cost and Time
Data collection saves the researcher time and funds that
would otherwise be misspent without a deeper understanding
of the topic or subject matter.
● To support a need for a new idea, change, and/or
innovation
To prove the need for a change in the norm or the
introduction of new information that will be widely accepted,
it is important to collect data as evidence to support these
claims.
What is a Data Collection Tool?
Data collection tools refer to the devices/instruments used to
collect data, such as a paper questionnaire or computer-
assisted interviewing system. Case Studies, Checklists,
Interviews, Observation sometimes, and Surveys or
Questionnaires are all tools used to collect data.
It is important to decide the tools for data collection because
research is carried out in different ways and for different
purposes. The objective behind data collection is to capture
quality evidence that allows analysis to lead to the
formulation of convincing and credible answers to the posed
questions.
The following are the top 7 data collection methods are:
INTERVIEW

An interview is a face-to-face conversation between two


individuals with the sole purpose of collecting relevant
information to satisfy a research purpose. Interviews are of
different types namely; Structured, Semi-
structured, and unstructured with each having a slight
variation from the other.
● Structured Interviews - Simply put, it is a verbally
administered questionnaire. In terms of depth, it is
surface level and is usually completed within a short
period. For speed and efficiency, it is highly
recommendable, but it lacks depth.
● Semi-structured Interviews - In this method, there
subsist several key questions which cover the scope of
the areas to be explored. It allows a little more leeway
for the researcher to explore the subject matter.
● Unstructured Interviews - It is an in-depth interview
that allows the researcher to collect a wide range of
information with a purpose. An advantage of this
method is the freedom it gives a researcher to combine
structure with flexibility even though it is more time-
consuming.
Pros
● In-depth information
● Freedom of flexibility
● Accurate data.
Cons
● Time-consuming
● Expensive to collect.
What are the best Data Collection Tools for Interviews?
For collecting data through interviews, here are a few tools
you can use to easily collect data.
● Audio Recorder
An audio recorder is used for recording sound on disc, tape,
or film. Audio information can meet the needs of a wide
range of people, as well as provide alternatives to print data
collection tools.
● Digital Camera
An advantage of a digital camera is that it can be used for
transmitting those images to a monitor screen when the
need arises.
● Camcorder
A camcorder is used for collecting data through interviews. It
provides a combination of both an audio recorder and a video
camera. The data provided is qualitative in nature and allows
the respondents to answer questions asked exhaustively. If
you need to collect sensitive information during an interview,
a camcorder might not work for you as you would need to
maintain your subject’s privacy.
QUESTIONNAIRES

This is the process of collecting data through an instrument


consisting of a series of questions and prompts to receive a
response from individuals it is administered to.
Questionnaires are designed to collect data from a group.
For clarity, it is important to note that a questionnaire isn't a
survey, rather it forms a part of it. A survey is a process of
data gathering involving a variety of data collection methods,
including a questionnaire.
On a questionnaire, there are three kinds of questions used.
They are; fixed-alternative, scale, and open-ended. With each
of the questions tailored to the nature and scope of the
research.
Pros
● Can be administered in large numbers and is cost-
effective.
● It can be used to compare and contrast previous
research to measure change.
● Easy to visualize and analyze.
● Questionnaires offer actionable data.
● Respondent identity is protected.
● Questionnaires can cover all areas of a topic.
● Relatively inexpensive.
Cons
● Answers may be dishonest or the respondents lose
interest midway.
● Questionnaires can't produce qualitative data.
● Questions might be left unanswered.
● Respondents may have a hidden agenda.
● Not all questions can be analyzed easily.
What are the best Data Collection Tools for Questionnaire?
● Formplus Online Questionnaire
Formplus lets you create powerful forms to help you collect
the information you need. Formplus helps you create the
online forms that you like. The Formplus online questionnaire
form template to get actionable trends and measurable
responses. Conduct research, optimize knowledge of your
brand or just get to know an audience with this form
template. The form template is fast, free and fully
customizable.
● Paper Questionnaire
A paper questionnaire is a data collection tool consisting of a
series of questions and/or prompts for the purpose of
gathering information from respondents. Mostly designed for
statistical analysis of the responses, they can also be used as
a form of data collection.

REPORTING

By definition, data reporting is the process of gathering and


submitting data to be further subjected to analysis. The key
aspect of data reporting is reporting accurate data because
of inaccurate data reporting leads to uninformed decision
making.
Pros
● Informed decision-making.
● Easily accessible.
Cons
● Self-reported answers may be exaggerated.
● The results may be affected by bias.
● Respondents may be too shy to give out all the details.
● Inaccurate reports will lead to uninformed decisions.
What are the best Data Collection Tools for Reporting?
Reporting tools enable you to extract and present data in
charts, tables, and other visualizations so users can find
useful information. You could source data for reporting from
Non-Governmental Organizations (NGO) reports, newspapers,
website articles, hospital records.
● NGO Reports
Contained in NGO reports is an in-depth and comprehensive
report on the activities carried out by the NGO, covering
areas such as business and human rights. The information
contained in these reports is research-specific and forms an
acceptable academic base for collecting data. NGOs often
focus on development projects which are organized to
promote particular causes.
● Newspapers
Newspaper data are relatively easy to collect and are
sometimes the only continuously available source of event
data. Even though there is a problem of bias in newspaper
data, it is still a valid tool in collecting data for Reporting.

● Website Articles
Gathering and using data contained in website articles is also
another tool for data collection. Collecting data from web
articles is a quicker and less expensive data collection Two
major disadvantages of using this data reporting method are
biases inherent in the data collection process and possible
security/confidentiality concerns.
● Hospital Care records
Health care involves a diverse set of public and private data
collection systems, including health surveys, administrative
enrollment and billing records, and medical records, used by
various entities, including hospitals, CHCs, physicians, and
health plans. The data provided is clear, unbiased and
accurate, but must be obtained under legal means as
medical data is kept with the strictest regulations.
EXISTING DATA

This is the introduction of new investigative questions in


addition to/other than the ones originally used when the data
was initially gathered. It involves adding measurement to a
study or research. An example would be sourcing data from
an archive.
Pros
● Accuracy is very high.
● Easily accessible information.
Cons
● Problems with evaluation.
● Difficulty in understanding.
What are the Best Data Collection Tools for Existing Data?
The concept of Existing data means that data is collected
from existing sources to investigate research questions other
than those for which the data were originally gathered. Tools
to collect existing data include:
● Research Journals - Unlike newspapers and
magazines, research journals are intended for an
academic or technical audience, not general readers. A
journal is a scholarly publication containing articles
written by researchers, professors, and other experts.

● Surveys - A survey is a data collection tool for


gathering information from a sample population, with
the intention of generalizing the results to a larger
population. Surveys have a variety of purposes and can
be carried out in many ways depending on the
objectives to be achieved.
OBSERVATION

This is a data collection method by which information on a


phenomenon is gathered through observation. The nature of
the observation could be accomplished either as a complete
observer, an observer as a participant, a participant as an
observer, or as a complete participant. This method is a key
base for formulating a hypothesis.
Pros
● Easy to administer.
● There subsists a greater accuracy with results.
● It is a universally accepted practice.
● It diffuses the situation of an unwillingness of
respondents to administer a report.
● It is appropriate for certain situations.
Cons
● Some phenomena aren’t open to observation.
● It cannot be relied upon.
● Bias may arise.
● It is expensive to administer.
● Its validity cannot be predicted accurately.
What are the best Data Collection Tools for Observation?
Observation involves the active acquisition of information
from a primary source. Observation can also involve the
perception and recording of data via the use of scientific
instruments. The best tools for Observation are:
● Checklists - state-specific criteria, allow users to gather
information and make judgments about what they
should know in relation to the outcomes. They offer
systematic ways of collecting data about specific
behaviours, knowledge, and skills.
● Direct observation - This is an observational study
method of collecting evaluative information. The
evaluator watches the subject in his or her usual
environment without altering that environment.
FOCUS GROUPS

The opposite of quantitative research which involves


numerical-based data, this data collection method focuses
more on qualitative research. It falls under the primary
category for data based on the feelings and opinions of the
respondents. This research involves asking open-ended
questions to a group of individuals usually ranging from 6-10
people, to provide feedback.
Pros
● Information obtained is usually very detailed.
● Cost-effective when compared to one-on-one interviews.
● It reflects speed and efficiency in the supply of results.
Cons
● Lacking depth in covering the nitty-gritty of a subject
matter.
● Bias might still be evident.
● Requires interviewer training
● The researcher has very little control over the outcome.
● A few vocal voices can drown out the rest.
● Difficulty in assembling an all-inclusive group.
What are the best Data Collection Tools for Focus Groups?
A focus group is a data collection method that is tightly
facilitated and structured around a set of questions. The
purpose of the meeting is to extract from the participants'
detailed responses to these questions. The best tools for
tackling Focus groups are:
● Two-Way - One group watches another group answer
the questions posed by the moderator. After listening to
what the other group has to offer, the group that listens
are able to facilitate more discussion and could
potentially draw different conclusions.
● Dueling-Moderator - There are two moderators who
play the devil’s advocate. The main positive of the
dueling-moderator focus group is to facilitate new ideas
by introducing new ways of thinking and varying
viewpoints.
COMBINATION RESEARCH

This method of data collection encompasses the use of


innovative methods to enhance participation to both
individuals and groups. Also under the primary category, it is
a combination of Interviews and Focus Groups while
collecting qualitative data. This method is key when
addressing sensitive subjects.
Pros
● Encourage participants to give responses.
● It stimulates a deeper connection between participants.
● The relative anonymity of respondents increases
participation.
● It improves the richness of the data collected.
Cons
● It costs the most out of all the top 7.
● It's the most time-consuming.
What are the best Data Collection Tools for Combination Research?
The Combination Research method involves two or more data
collection methods, for instance, interviews as well as
questionnaires or a combination of semi-structured telephone
interviews and focus groups. The best tools for combination
research are:
● Online Survey - The two tools combined here are
online interviews and the use of questionnaires. This is a
questionnaire that the target audience can complete
over the Internet. It is timely, effective and efficient.
Especially since the data to be collected is quantitative
in nature.
● Dual-Moderator - The two tools combined here are
focus groups and structured questionnaires. The
structured questioners give a direction as to where the
research is headed while two moderators take charge of
proceedings. Whilst one ensures the focus group session
progresses smoothly, the other makes sure that the
topics in question are all covered. Dual-moderator focus
groups typically result in a more productive session and
essentially leads to an optimum collection of data.
Data Pre-processing
Data pre-processing is the process of transforming raw data into an
understandable format. It is also an important step in data mining as we
cannot work with raw data. The quality of the data should be checked
before applying machine learning or data mining algorithms.

Pre-processing of data is mainly to check the data quality. The quality can
be checked by the following

Major Tasks in Data Pre-processing:


1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation
Data cleaning:
Data cleaning is the process to remove incorrect data, incomplete data and
inaccurate data from the datasets, and it also replaces the missing values.
There are some techniques in data cleaning

Handling missing values:


● Standard values like “Not Available” or “NA” can be used to replace the
missing values.
● Missing values can also be filled manually but it is not recommended when
that dataset is big.
● The attribute’s mean value can be used to replace the missing value when
the data is normally distributed wherein in the case of non-normal
distribution median value of the attribute can be used.
● While using regression or decision tree algorithms the missing value can be
replaced by the most probable value.
Noisy:
Noisy generally means random error or containing unnecessary data points. Here
are some of the methods to handle noisy data.

1. Binning: This method is to smooth or handle noisy data. First, the data is
sorted then and then the sorted values are separated and stored in the
form of bins. There are three methods for smoothing data in the bin.

● Smoothing by bin mean method: In this method, the values in the


bin are replaced by the mean value of the bin.
● Smoothing by bin median: In this method, the values in the bin are
replaced by the median value.
● Smoothing by bin boundary: In this method, the using minimum
and maximum values of the bin values are taken and the values are
replaced by the closest boundary value.

2. Regression: This is used to smooth the data and will help to handle data
when unnecessary data is present. For the analysis, purpose regression
helps to decide the variable which is suitable for our analysis.

3. Clustering: This is used for finding the outliers and also in grouping the
data. Clustering is generally used in unsupervised learning.

Data integration:
Data integration is the process of bringing data from disparate sources
together to provide users with a unified view. The premise of data integration
is to make data more freely available and easier to consume and process by
systems and users. Data integration done right can reduce IT costs, free-up
resources, improve data quality, and foster innovation all without sweeping
changes to existing applications or data structures. And though IT
organizations have always had to integrate, the payoff for doing so has
potentially never been as great as it is right now.

1.Schema integration: Integrates meta-data(a set of data that describes


other data) from different sources.

2.Entity identification problem: Identifying entities from multiple


databases. For example, the system or the use should know student _id of
one database and student name of another database belongs to the same
entity.
3.Detecting and resolving data value concepts: The data taken from
different databases while merging may differ. Like the attribute values from
one database may differ from another database. For example, the date
format may differ like “MM/DD/YYYY” or “DD/MM/YYYY”.

Data reduction:
Data reduction techniques ensure the integrity of data while reducing the data. Data
reduction is a process that reduces the volume of original data and represents it in a
much smaller volume. Data reduction techniques are used to obtain a reduced
representation of the dataset that is much smaller in volume by maintaining the
integrity of the original data. By reducing the data, the efficiency of the data mining
process is improved, which produces the same analytical results.
Data reduction does not affect the result obtained from data mining. That means the
result obtained from data mining before and after data reduction is the same or
almost the same.
Data reduction aims to define it more compactly. When the data size is smaller, it is
simpler to apply sophisticated and computationally high-priced algorithms. The
reduction of the data may be in terms of the number of rows (records) or terms of the
number of columns (dimensions).
455.9K

Techniques of Data Reduction


Here are the following techniques or methods of data reduction in data mining, such as:
1. Dimensionality Reduction
Whenever we encounter weakly important data, we use the attribute
required for our analysis. Dimensionality reduction eliminates the
attributes from the data set under consideration, thereby reducing the
volume of original data. It reduces data size as it eliminates outdated or
redundant features. Here are three methods of dimensionality reduction.
i. Wavelet Transform: In the wavelet transform, suppose a data vector A
is transformed into a numerically different data vector A' such that both A
and A' vectors are of the same length. Then how it is useful in reducing
data because the data obtained from the wavelet transform can be
truncated. The compressed data is obtained by retaining the smallest
fragment of the strongest wavelet coefficients. Wavelet transform can be
applied to data cubes, sparse data, or skewed data.
ii. Principal Component Analysis: Suppose we have a data set to be
analyzed that has tuples with n attributes. The principal component
analysis identifies k independent tuples with n attributes that can
represent the data set.
In this way, the original data can be cast on a much smaller space, and
dimensionality reduction can be achieved. Principal component analysis
can be applied to sparse and skewed data.
iii. Attribute Subset Selection: The large data set has many attributes,
some of which are irrelevant to data mining or some are redundant. The
core attribute subset selection reduces the data volume and
dimensionality. The attribute subset selection reduces the volume of data
by eliminating redundant and irrelevant attributes.
The attribute subset selection ensures that we get a good subset of
original attributes even after eliminating the unwanted attributes. The
resulting probability of data distribution is as close as possible to the
original data distribution using all the attributes.

2. Numerosity Reduction
The numerosity reduction reduces the original data volume and
represents it in a much smaller form. This technique includes two types
parametric and non-parametric numerosity reduction.
1. Parametric: Parametric numerosity reduction incorporates storing only data
parameters instead of the original data. One method of parametric
numerosity reduction is the regression and log-linear method.
● Regression and Log-Linear: Linear regression models a
relationship between the two attributes by modelling a linear
equation to the data set. Suppose we need to model a linear
function between two attributes.
y=wx+b
Here, y is the response attribute, and x is the predictor attribute. If
we discuss in terms of data mining, attribute x and attribute y are
the numeric database attributes, whereas w and b are regression
coefficients. Multiple linear regressions let the response variable y
model linear function between two or more predictor variables. Log-
linear model discovers the relation between two or more discrete
attributes in the database. Suppose we have a set of tuples
presented in n-dimensional space. Then the log-linear model is used
to study the probability of each tuple in a multidimensional space.
Regression and log-linear methods can be used for sparse data and
skewed data.
2. Non-Parametric: A non-parametric numerosity reduction technique does not
assume any model. The non-Parametric technique results in a more uniform
reduction, irrespective of data size, but it may not achieve a high volume of
data reduction like the parametric. There are at least four types of Non-
Parametric data reduction techniques, Histogram, Clustering, Sampling, Data
Cube Aggregation, and Data Compression.

● Histogram: A histogram is a graph that represents frequency distribution


which describes how often a value appears in the data. Histogram uses
the binning method to represent an attribute's data distribution. It uses a
disjoint subset which we call bin or buckets.
A histogram can represent a dense, sparse, uniform, or skewed data.
Instead of only one attribute, the histogram can be implemented for
multiple attributes. It can effectively represent up to five attributes.
● Clustering: Clustering techniques groups similar objects from the data so
that the objects in a cluster are similar to each other, but they are
dissimilar to objects in another cluster.
How much similar are the objects inside a cluster can be calculated using
a distance function. More is the similarity between the objects in a cluster
closer they appear in the cluster.
The quality of the cluster depends on the diameter of the cluster, i.e., the
max distance between any two objects in the cluster.
The cluster representation replaces the original data. This technique is
more effective if the present data can be classified into a distinct
clustered.
● Sampling: One of the methods used for data reduction is sampling, as it
can reduce the large data set into a much smaller data sample. Below we
will discuss the different methods in which we can sample a large data set
D containing N tuples:
a) Simple random sample without replacement (SRSWOR) of
size s: In this s, some tuples are drawn from N tuples such that in
the data set D (s<N). The probability of drawing any tuple from the
data set D is 1/N. This means all tuples have an equal probability of
getting sampled.
b) Simple random sample with replacement (SRSWR) of size
s: It is similar to the SRSWOR, but the tuple is drawn from data set
D, is recorded, and then replaced into the data set D so that it can
be drawn again.

0. Cluster sample: The tuples in data set D are clustered into


M mutually disjoint subsets. The data reduction can be
applied by implementing SRSWOR on these clusters. A simple
random sample of size s could be generated from these
clusters where s<M.
a. Stratified sample: The large data set D is partitioned into
mutually disjoint sets called 'strata'. A simple random sample
is taken from each stratum to get stratified data. This method
is effective for skewed data.

3. Data Cube Aggregation


This technique is used to aggregate data in a simpler form. Data Cube
Aggregation is a multidimensional aggregation that uses aggregation at
various levels of a data cube to represent the original data set, thus
achieving data reduction.
For example, suppose you have the data of All Electronics sales per
quarter for the year 2018 to the year 2022. If you want to get the annual
sale per year, you just have to aggregate the sales per quarter for each
year. In this way, aggregation provides you with the required data, which
is much smaller in size, and thereby we achieve data reduction even
without losing any data.
The data cube aggregation is a multidimensional aggregation that eases
multidimensional analysis. The data cube present precomputed and
summarized data which eases the data mining into fast access.
4. Data Compression
Data compression employs modification, encoding, or converting the
structure of data in a way that consumes less space. Data compression
involves building a compact representation of information by removing
redundancy and representing data in binary form. Data that can be
restored successfully from its compressed form is called Lossless
compression. In contrast, the opposite where it is not possible to restore
the original form from the compressed form is Lossy compression.
Dimensionality and numerosity reduction method are also used for data
compression.
This technique reduces the size of the files using different encoding
mechanisms, such as Huffman Encoding and run-length Encoding. We can
divide it into two types based on their compression techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding)
allow a simple and minimal data size reduction. Lossless data compression
uses algorithms to restore the precise original data from the compressed
data.
ii. Lossy Compression: In lossy-data compression, the decompressed data
may differ from the original data but are useful enough to retrieve
information from them. For example, the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original
image. Methods such as the Discrete Wavelet transform technique PCA
(principal component analysis) are examples of this compression.

5. Discretization Operation
The data discretization technique is used to divide the attributes of the
continuous nature into data with intervals. We replace many constant
values of the attributes with labels of small intervals. This means that
mining results are shown in a concise and easily understandable way.
i. Top-down discretization: If you first consider one or a couple of points
(so-called breakpoints or split points) to divide the whole set of attributes
and repeat this method up to the end, then the process is known as top-
down discretization, also known as splitting.
ii. Bottom-up discretization: If you first consider all the constant values as
split-points, some are discarded through a combination of the
neighborhood values in the interval. That process is called bottom-up
discretization.

Benefits of Data Reduction


The main benefit of data reduction is simple: the more data you can fit
into a terabyte of disk space, the less capacity you will need to purchase.
Here are some benefits of data reduction, such as:
o Data reduction can save energy.
o Data reduction can reduce your physical storage costs.
o And data reduction can decrease your data center track.

Data reduction greatly increases the efficiency of a storage system and


directly impacts your total spending on capacity.

Data Transformation:
Data transformation is the process of changing the format, structure, or values of data. For data
analytics projects, data may be transformed at two stages of the data pipeline. Organizations that
use on-premises data warehouses generally use an ETL (extract, transform, load) process, in
which data transformation is the middle step. Today, most organizations use cloud-based data
warehouses, which can scale compute and storage resources with latency measured in seconds or
minutes. The scalability of the cloud platform lets organizations skip preload transformations and
load raw data into the data warehouse, then transform it at query time — a model called ELT
( extract, load, transform).

The data are transformed in ways that are ideal for mining the data. The data
transformation involves steps that are:
1. Smoothing:
It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It
helps in predicting the patterns. When collecting data, it can be manipulated to
eliminate or reduce any variance or any other noise form.

The concept behind data smoothing is that it will be able to identify simple
changes to help predict different trends and patterns. This serves as a help to
analysts or traders who need to look at a lot of data which can often be difficult
to digest for finding patterns that they wouldn’t see otherwise.
2. Aggregation:
Data collection or aggregation is the method of storing and presenting data in
a summary format. The data may be obtained from multiple data sources to
integrate these data sources into a data analysis description. This is a crucial
step since the accuracy of data analysis insights is highly dependent on the
quantity and quality of the data used. Gathering accurate data of high quality
and a large enough quantity is necessary to produce relevant results.
The collection of data is useful for everything from decisions concerning
financing or business strategy of the product, pricing, operations, and
marketing strategies.
For example, Sales, data may be aggregated to compute monthly& annual
total amounts.
3. Discretization:
It is a process of transforming continuous data into set of small intervals. Most
Data Mining activities in the real world require continuous attributes. Yet many
of the existing data mining frameworks are unable to handle these attributes.
Also, even if a data mining task can manage a continuous attribute, it can
significantly improve its efficiency by replacing a constant quality attribute with
its discrete values.
For example, (1-10, 11-20) (age:- young, middle age, senior).

4.Attribute Construction:
Where new attributes are created & applied to assist the mining process from
the given set of attributes. This simplifies the original data & makes the mining
more efficient.

5. Generalization:
It converts low-level data attributes to high-level data attributes using concept
hierarchy. For Example Age initially in Numerical form (22, 25) is converted
into categorical value (young, old).
For example, Categorical attributes, such as house addresses, may be
generalized to higher-level definitions, such as town or country.

6. Normalization: Data normalization involves converting all data variable into


a given range.
Techniques that are used for normalization are:
● Min-Max Normalization:
● This transforms the original data linearly.
● Suppose that: min-A is the minima and max-A is the maxima of
an attribute, P
We Have the Formula:

● Where v is the value you want to plot in the new range.


● v’ is the new value you get after normalizing the old value.

For example:
Suppose the minimum and maximum value for an attribute profit(P) are Rs.
10, 000 and Rs. 100, 000. We want to plot the profit in the range [0, 1].
Using min-max normalization the value of Rs. 20, 000 for attribute profit
can be plotted to:

And hence, we get the value of v’ as 0.11


● Z-Score Normalization:
● In z-score normalization (or zero-mean normalization) the values
of an attribute (A), are normalized based on the mean of A and its
standard deviation
● A value, v, of attribute A is normalized to v’ by computing

For example:
Let mean of an attribute P = 60, 000, Standard Deviation = 10, 000, for the
attribute P. Using z-score normalization, a value of 85000 for P can be
transformed to:

And hence we get the value of v’ to be 2.5

● Decimal Scaling:
● It normalizes the values of an attribute by changing the position of
their decimal points
● The number of points by which the decimal point is moved can be
determined by the absolute maximum value of attribute A.
● A value, v, of attribute A is normalized to v’ by computing

● Where, j is the smallest integer such that Max(|v’|) < 1.

For example:
● Suppose: Values of an attribute P varies from -99 to 99.
● The maximum absolute value of P is 99.
● For normalizing the values we divide the numbers by 100 (i.e., j =
2) or (number of integers in the largest number) so that values
come out to be as 0.98, 0.97 and so on.

You might also like