0% found this document useful (0 votes)
6 views

DAUnit-1

The document provides an overview of data analytics, including key terminologies such as big data and its 4V properties: volume, velocity, variety, and veracity. It distinguishes between structured, unstructured, and semi-structured data, outlines primary and secondary data sources, and explains the processes of data analysis and data analytics. Additionally, it discusses the relationship between data analytics, data analysis, and machine learning, as well as the design of data architecture for effective data management.

Uploaded by

sudharani.am
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

DAUnit-1

The document provides an overview of data analytics, including key terminologies such as big data and its 4V properties: volume, velocity, variety, and veracity. It distinguishes between structured, unstructured, and semi-structured data, outlines primary and secondary data sources, and explains the processes of data analysis and data analytics. Additionally, it discusses the relationship between data analytics, data analysis, and machine learning, as well as the design of data architecture for effective data management.

Uploaded by

sudharani.am
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

DATA ANALYTICS

BASIC TERMINOLOGIES
BIG DATA

Big data is a field that treats ways to analyze, systematically extract information from, or
otherwise deal with data sets that are too large or complex to be dealt with by traditional
data-processing application software.

4V PROPERTIES OF BIG DATA

 Volume
 Variety
 Velocity
 Veracity

Volume of Big Data

The volume of data refers to the size of the data sets that need to be analyzed and processed,
which are now frequently larger than terabytes and petabytes. The sheer volume of the data
requires distinct and different processing technologies than traditional storage and processing
capabilities. In other words, this means that the data sets in Big Data are too large to process
with a regular laptop or desktop processor. An example of a high-volume data set would be
all credit card transactions on a day within Europe.

Velocity of Big Data

Velocity refers to the speed with which data is generated. High velocity data is generated
with such a pace that it requires distinct (distributed) processing techniques. An example of a
data that is generated with high velocity would be Twitter messages or Facebook posts.

Variety of Big Data

Variety makes Big Data really big. Big Data comes from a great variety of sources and
generally is one out of three types: structured, semi structured and unstructured data. The
variety in data types frequently requires distinct processing capabilities and specialist
algorithms. An example of high variety data sets would be the CCTV audio and video files
that are generated at various locations in a city.

Veracity of Big Data

Veracity refers to the quality of the data that is being analyzed. High veracity data has many
records that are valuable to analyze and that contribute in a meaningful way to the overall
results. Low veracity data, on the other hand, contains a high percentage of meaningless data.
The non-valuable in these data sets is referred to as noise. An example of a high veracity data
set would be data from a medical experiment or trial.

Data that is high volume, high velocity and high variety must be processed with advanced
tools (analytics and algorithms) to reveal meaningful information. Because of these
characteristics of the data, the knowledge domain that deals with the storage, processing, and
analysis of these data sets has been labeled Big Data.

FORMS OF DATA

• Collection of information stored in a particular file is represented as forms of data.

– STRUCTURED FORM

• Any form of relational database structure where relation between attributes


is possible. That there exists a relation between rows and columns in the
database with a table structure. Eg: using database programming languages
(sql, oracle, mysql etc).

– UNSTRUCTURED FORM.

• Any form of data that does not have predefined structure is represented as
unstructured form of data. Eg: video, images, comments, posts, few
websites such as blogs and wikipedia

– SEMI STRUCTURED DATA

• Does not have form tabular data similar to rdbms.


• Predefined organized formats available.
• Eg: csv, xml, json, txt file with tab separator etc..

SOURCES OF DATA

– There are two types of sources of data available.

– PRIMARY SOURCE OF DATA


• Eg: data created by individual or a business concern on their own.
– SECONDARY SOURCE OF DATA
• Eg: data can be extracted from cloud servers, website sources (kaggle, uci,
aws, google cloud, twitter, facebook, youtube, github etc..)

DATA ANALYSIS

Data analysis is a process of inspecting, cleansing, transforming and modeling data with
the goal of discovering useful information, informing conclusions and supporting decision-
making.

DATA ANALYTICS

• Data analytics is the science of analyzing raw data in order to make conclusions about
that information...... This information can then be used to optimize processes to increase
the overall efficiency of a business or system.

Types:

– Descriptive analytics Eg: (observation, case-study, surveys)

In descriptive statistics the result is always going lead with probability among ‘n’
number of options where each option has an equal chance of probability.

– Predictive analytics Eg: healthcare, sports, weather, insurance, social media analysis.

This type of analytics deals with predicting past data to make decisions based on
certain algorithms. In case of a doctor the doctor questions the patient about the
past to correct his illness through already existing procedures.

– Prescriptive analytics Eg: healthcare, banking.

Prescriptive analytics works with predictive analytics, which uses data to determine
near-term outcomes. Prescriptive analytics makes use of machine learning to help
businesses decide a course of action based on a computer program's predictions.
DIFFERENCE BETWEEN DATA ANALYTICS AND DATA ANALYSIS

Characteristics Data Analytics Data Analysis

Form Used in business to make decision Form of data analytics in business to


from data – Data Driven identify useful information in data.

Structure It is a process of data collection with Cleaning, transforming the data


various strategies

Tools Excel, python, R etc.. KNIME, NodeXL, Rapid Miner etc..

Prediction Analytics means we are trying to Analysis means we analyze always what
find conclusions about future. has happened in the past

MACHINE LEARNING

• Machine learning is an application of artificial intelligence (AI) that provides systems


the ability to automatically learn and improve from experience without being
explicitly programmed.
• Machine learning focuses on the development of computer programs that can access
data and use it learn for themselves.

Analytics

Fig 0.2Relation between machine learning and data analytics

In general data is passed to a machine learning tool to perform descriptive data analytics
through set of algorithms built in it. Here both data analytics and data analysis is done by the
tool automatically. Hence we can say that Data analysis is a sub component of data analytics.
And data analytics is a sub component of machine learning tool. All these are described in
figure 0.2. The output of this machine learning tool generates a model. And from this model
predictive analytics and prescriptive analytics can be performed because the model gives
output as data to machine learning tool. This cycle continues till we get an efficient output.
UNIT - I

1.1 DESIGN DATA ARCHITECTURE AND MANAGE THE DATA FOR ANALYSIS

Data architecture is composed of models, policies, rules or standards that govern which
data
is collected, and how it is stored, arranged, integrated, and put to use in data systems and
in organizations. Data is usually one of several architecture domains that form the pillars of
an enterprise architecture or solution architecture.

Various constraints and influences will have an effect on data architecture design. These
include enterprise requirements, technology drivers, economics, business policies and data
processing needs.

• Enterprise requirements
These will generally include such elements as economical and effective system
expansion, acceptable performance levels (especially system access speed), transaction
reliability, and transparent data management. In addition, the conversion of raw data such as
transaction records and image files into more useful information forms through such
features as data warehouses is also a common organizational requirement, since this enables
managerial decision making and other organizational processes. One of the architecture
techniques is the split between managing transaction data and (master) reference data.
Another one is splitting data capture systems from data retrieval systems (as done in a
datawarehouse).

• Technology drivers
These are usually suggested by the completed data architecture and database
architecture designs. In addition, some technology drivers will derive from existing
organizational integration frameworks and standards, organizational economics, and
existing site resources (e.g. previously purchased software licensing).

• Economics
These are also important factors that must be considered during the data architecture phase.
It is possible that some solutions, while optimal in principle, may not be potential
candidates due to their cost. External factors such as the business cycle, interest rates,
market conditions, and legal considerations could all have an effect on decisions relevant to
data architecture.

 Business policies
Business policies that also drive data architecture design include internal organizational
policies, rules of regulatory bodies, professional standards, and applicable governmental
laws that can vary by applicable agency. These policies and rules will help describe the
manner in which enterprise wishes to process their data.
• Data processing needs
These include accurate and reproducible transactions performed in high volumes, data
warehousing for the support of management information systems (and potential data
mining), repetitive periodic reporting, ad hoc reporting, and support of various
organizational initiatives as required (i.e. annual budgets, new productdevelopment).

The General Approach is based on designing the Architecture at three Levels of


Specification as shown below in figure 1.1
 The Logical Level
 The Physical Level
 The Implementation Level

Fig 1.1: Three levels architecture in data analytics.

The logical view/user's view, of a data analytics represents data in a format that is
meaningful to a user and to the programs that process those data. That is, the logical
view tells the user, in user terms, what is in the database. Logical level consists of data
requirements and process models which are processed using any data modelling techniques to
result in logical data model.

Physical level is created when we translate the top level design in physical tables in
the database. This model is created by the database architect, software architects, software
developers or database administrator. The input to this level from logical level and various
data modeling techniques are used here with input from software developers or database
administrator. These data modelling techniques are various formats of representation of data
such as relational data model, network model, hierarchical model, object oriented model,
Entity relationship model.
Implementation level contains details about modification and presentation of data through
the use of various data mining tools such as (R-studio, WEKA, Orange etc). Here each tool
has a specific feature how it works and different representation of viewing the same data.
These tools are very helpful to the user since it is user friendly and it does not require much
programming knowledge from the user.

1.2 Various Sources of Data

Understand various primary sources of the Data

Data can be generated from two types of sources namely Primary and Secondary
Sources of Primary Data

The sources of generating primary data are -


 Observation Method
 Survey Method
 Experimental Method

Observation Method:

Fig 1.2: Data collections

An observation is a data collection method, by which you gather knowledge of the


researched phenomenon through making observations of the phenomena, as and when it
occurs. The main aim is to focus on observations of human behavior, the use of the
phenomenon and human interactions related to the phenomenon. We can also make
observations on verbal and nonverbal expressions. In making and documenting observations,
we need to clearly differentiate our own observations from the observations provided
to us by other people. The range of data storage genre found in Archives and
Collections, is suitable for documenting observations e.g. audio, visual, textual
and digital including sub-genres of note taking, audio recording and video
recording.

There exist various observation practices, and our role as an observer may
vary according to the research approach. We make observations from either the
outsider or insider point of view in relation to the researched phenomenon and the
observation technique can be structured or unstructured. The degree of the outsider
or insider points of view can be seen as a movable point in a continuum between
the extremes of outsider and insider. If you decide to take the insider point of view,
you will be a participant observer in situ and actively participate in the observed
situation or community. The activity of a Participant observer in situ is called field
work. This observation technique has traditionally belonged to the data collection
methods of ethnology and anthropology. If you decide to take the outsider point of
view, you try to try to distance yourself from your own cultural ties and observe the
researched community as an outsider observer. These details are seen in figure 1.2.

Experimental Designs
There are number of experimental designs that are used in carrying out and
experiment. However, Market researchers have used 4 experimental designs most
frequently. These are –

CRD - Completely Randomized Design

A completely randomized design (CRD) is one where the treatments are


assigned completely at random so that each experimental unit has the same chance
of receiving any one treatment. For the CRD, any difference among experimental
units receiving the same treatment is considered as experimental error. Hence,
CRD is appropriate only for experiments with homogeneous experimental units,
such as laboratory experiments, where environmental effects are relatively easy to
control. For field experiments, where there is generally large variation among
experimental plots in such environmental factors as soil, the CRD is rarely used.
CRD is mainly used in agricultural field.

Randomized Block Design


A randomized block design, the experimenter divides subjects into
subgroups called blocks, such that the variability within blocks is less than the
variability between blocks. Then, subjects within each block are randomly
assigned to treatment conditions. Compared to a completely randomized design,
this design reduces variability within treatment conditions and potential
confounding, producing a better estimate of treatment effects.
LSD - Latin Square Design
A Latin square is one of the experimental designs which has a balanced two-way
classification scheme say for example - 4 X 4 arrangement. In this scheme each letter from
A to D occurs only once in each row and also only once in each column. The balance
arrangement, it may be noted that, will not get disturbed if any row gets changed with the
other.
A B C D
B C D A
C D A B
D A B C

The balance arrangement achieved in a Latin Square is its main strength. In this
design, the comparisons among treatments, will be free from both differences
between rows and columns. Thus the magnitude of error will be smaller than any
other design.

FD - Factorial Designs
This design allows the experimenter to test two or more variables simultaneously. It
also measures interaction effects of the variables and analyzes the impacts of each of the
variables. In a true experiment, randomization is essential so that the experimenter can infer
cause and effect without any bias.

Sources of Secondary Data


While primary data can be collected through questionnaires, depth interview,
focus group interviews, case studies, experimentation and observation; The
secondary data can be obtained through

 Internal Sources - These are within the organization


 External Sources - These are outside the organization

Internal sources
If available, internal secondary data may be obtained with less time, effort and
money than the external secondary data. In addition, they may also be more
pertinent to the situation at hand since they are from within the organization. The
internal sources include

Accounting resources- This gives so much information which can be used by the
marketing researcher. They give information about internal factors.

Sales Force Report- It gives information about the sale of a product. The
information provided is of outside the organization.

Internal Experts- These are people who are heading the various departments. They
can give an idea of how a particular thing is working
Miscellaneous Reports- These are what information you are getting from
operational reports.If the data available within the organization are unsuitable or
inadequate, the marketer should extend the search to external secondary data sources.

External Sources of Data


External Sources are sources which are outside the company in a larger
environment. Collection of external data is more difficult because the data have
much greater variety and the sources are much more numerous.

External data can be divided into following classes.

Government Publications- Government sources provide an extremely rich pool


of data for the researchers. In addition, many of these data are available free of cost
on internet websites. There are number of government agencies generating data.
These are:

Registrar General of India- It is an office which generates demographic data. It


includes details of gender, age, occupation etc.

Central Statistical Organization- This organization publishes the national


accounts statistics. It contains estimates of national income for several years, growth
rate, and rate of major economic activities. Annual survey of Industries is also
published by the CSO. It gives information about the total number of workers
employed, production units, material used and value added by the manufacturer.

Director General of Commercial Intelligence- This office operates from Kolkata.


It gives information about foreign trade i.e. import and export. These figures are
provided region- wise and country-wise.

Ministry of Commerce and Industries- This ministry through the office of


economic advisor provides information on wholesale price index. These indices may
be related to a number of sectors like food, fuel, power, food grains etc. It also
generates All India Consumer Price Index numbers for industrial workers, urban,
non-manual employees and cultural labourers.

Planning Commission- It provides the basic statistics of Indian Economy.

Reserve Bank of India- This provides information on Banking Savings and


investment. RBI also prepares currency and finance reports.

Labour Bureau- It provides information on skilled, unskilled, white collared jobs


etc. National Sample Survey- This is done by the Ministry of Planning and it
provides social, economic, demographic, industrial and agricultural statistics.
Department of Economic Affairs- It conducts economic survey and it also
generates information on income, consumption, expenditure, investment, savings
and foreign trade.

State Statistical Abstract- This gives information on various types of activities


related to the state like - commercial activities, education, occupation etc.

Non-Government Publications- These includes publications of various industrial


and trade associations, such as

The Indian Cotton Mill Association Various chambers of commerce

The Bombay Stock Exchange (it publishes a directory containing financial


accounts, key profitability and other relevant matter)
Various Associations of Press Media.
Export Promotion Council.

Confederation of Indian Industries


(CII) Small Industries
Development Board of India

Different Mills like - Woolen mills, Textile mills etc


The only disadvantage of the above sources is that the data may be biased. They are
likely to colour their negative points.

Syndicate Services- These services are provided by certain organizations which


collect and tabulate the marketing information on a regular basis for a number of
clients who are the subscribers to these services. So the services are designed in such
a way that the information suits the subscriber. These services are useful in television
viewing, movement of consumer goods etc. These syndicate services provide
information data from both household as well as institution.
In collecting data from household they use three approaches Survey- They conduct
surveys regarding - lifestyle, sociographic, general topics.

Electronic Scanner Services- These are used to generate data on volume.


They collect data for Institutions from Whole sellers, Retailers, and Industrial Firms

1.2.1 Comparison of sources of data

Based on various features (cost, data, process, source time etc.) various
sources of data can be compared as per table 1.

Table 1: Difference between primary data and secondary data.


Comparison Feature Primary data Secondary data
Meaning Data that is collected by a Data that is collected by
researcher. other people.
Data Real time data Past data.
Process Very involved Quick and easy
Source Surveys, interviews, or Books, journals, publications
experiments, questionnaire, etc..
interview etc..
Cost effectiveness Expensive Economical
Collection time Long Short
Specific Specific to researcher need May not be specific to
researcher need
Available Crude form Refined form
Accuracy and reliability More Less

1.3 Understanding Sources of Data from Sensor

Sensor data is the output of a device that detects and responds to some type
of input from the physical environment. The output may be used to provide
information or input to another system or to guide a process. Examples are as follows

 A photosensor detects the presence of visible light, infrared transmission (IR)


and/or ultraviolet (UV) energy.
 Lidar, a laser-based method of detection, range finding and mapping, typically
uses a low-power, eye-safe pulsing laser working in conjunction with a camera.
 A charge-coupled device (CCD) stores and displays the data for an image in
such a way that each pixel is converted into an electrical charge, the intensity of
which is related to a color in the color spectrum.
 Smart grid sensors can provide real-time data about grid conditions, detecting
outages, faults and load and triggering alarms.
 Wireless sensor networks combine specialized transducers with a
communications infrastructure for monitoring and recording conditions at diverse
locations. Commonly monitored parameters include temperature, humidity,
pressure, wind direction and speed, illumination intensity, vibration intensity,
sound intensity, powerline voltage, chemical concentrations, pollutant levels and
vital body functions.

1.4 Understanding Sources of Data from Signal

The simplest form of signal is a direct current (DC) that is switched on and
off; this is the principle by which the early telegraph worked. More complex signals
consist of an alternating-current (AC) or electromagnetic carrier that contains one
or more data streams.
Data must be transformed into electromagnetic signals prior to transmission
across a network. Data and signals can be either analog or digital. A signal is
periodic if it consists of a continuously repeating pattern.
1.5 Understanding Sources of Data from GPS

The Global Positioning System (GPS) is a space based navigation system


that provides location and time information in all weather conditions, anywhere on
or near the Earth where there is an unobstructed line of sight to four or more GPS
satellites. The system provides critical capabilities to military, civil, and commercial
users around the world. The United States government created the system, maintains
it, and makes it freely accessible to anyone with a GPS receiver.

1.6 Data Management

Data management is the development and execution of architectures,


policies, practices and procedures in order to manage the information lifecycle
needs of an enterprise in an effective manner.

1.7 Data Quality


Data quality refers to the quality of data. Data quality refers to the state of
qualitative or quantitative pieces of information. There are many definitions of data
quality but data is generally considered high quality if it is "fit for [its] intended uses
in operations, decision making and planning

The seven characteristics that define data quality are:

1. Accuracy and Precision


2. Legitimacy and Validity
3. Reliability and Consistency
4. Timeliness and Relevance
5. Completeness and Comprehensiveness
6. Availability and Accessibility
7. Granularity and Uniqueness

Accuracy and Precision: This characteristic refers to the exactness of the data.
It cannot have any erroneous elements and must convey the correct message without
being misleading. This accuracy and precision have a component that relates to its
intended use. Without understanding how the data will be consumed, ensuring
accuracy and precision could be off-target or more costly than necessary. For
example, accuracy in healthcare might be more important than in another industry
(which is to say, inaccurate data in healthcare could have more serious
consequences) and, therefore, justifiably worth higher levels of investment.

Legitimacy and Validity: Requirements governing data set the boundaries of this
characteristic. For example, on surveys, items such as gender, ethnicity, and
nationality are typically limited to a set of options and open answers are not
permitted. Any answers other than these would not be considered valid or legitimate
based on the survey’s requirement. This is the case for most data and must be
carefully considered when determining its quality. The people in each department
in an organization understand what data is valid or not to them, so the requirements
must be leveraged when evaluating data quality.

Reliability and Consistency: Many systems in today’s environments use and/or


collect the same source data. Regardless of what source collected the data or where it
resides, it cannot contradict a value residing in a different source or collected by a
different system. There must be a stable and steady mechanism that collects and
stores the data without contradiction or unwarranted variance.

Timeliness and Relevance: There must be a valid reason to collect the data to
justify the effort required, which also means it has to be collected at the right
moment in time. Data collected too soon or too late could misrepresent a
situation and drive
inaccurate decisions.

Completeness and Comprehensiveness: Incomplete data is as dangerous as


inaccurate data. Gaps in data collection lead to a partial view of the overall picture
to be displayed. Without a complete picture of how operations are running,
uninformed actions will occur. It’s important to understand the complete set of
requirements that constitute a comprehensive set of data to determine whether or
not the requirements are being fulfilled.

Availability and Accessibility: This characteristic can be tricky at times due to legal
and regulatory constraints. Regardless of the challenge, though, individuals need the
right level of access to the data in order to perform their jobs. This presumes that
the data exists and is available for access to be granted.

Granularity and Uniqueness: The level of detail at which data is collected is


important, because confusion and inaccurate decisions can otherwise occur.
Aggregated, summarized and manipulated collections of data could offer a
different meaning than the data implied at a lower level. An appropriate level of
granularity must be defined to provide sufficient uniqueness and distinctive
properties to become visible. This is a requirement for operations to function
effectively.
Noisy data is meaningless data. The term has often been used as a synonym
for corrupt data. However, its meaning has expanded to include any data that
cannot be understood and interpreted correctly by machines, such as unstructured
text.

An outlier is an observation that lies an abnormal distance from other values


in a random sample from a population. In a sense, this definition leaves it up to the
analyst (or a consensus process) to decide what will be considered abnormal.
In statistics, missing data, or missing values, occur when no data value is
stored for the variable in an observation. Missing data are a common occurrence and
can have a significant effect on the conclusions that can be drawn from the data.
Missing values can be replaced by following techniques:

 Ignore the record with missing values.


 Replace the missing term with constant.
 Fill the missing value manually based on domain knowledge.
 Replace them with mean (if data is numeric) or frequent
value (if data is categorical)
 Use of modelling techniques such decision trees, baye`s
algorithm, nearest neighbor algorithm Etc.

In computing, data deduplication is a specialized data compression


technique for eliminating duplicate copies of repeating data. Related and somewhat
synonymous terms are intelligent (data) compression and single instance (data)
storage.

Noisy data

For objects, noise is considered an extraneous object.

For attributes, noise refers to modification of original values.

 Examples: distortion of a person’s voice when talking on a poor phone and


“snow” on television screen
 We can talk about signal to noise ratio.
Left image of 2 sine waves has low or zero SNR; the right image are the two
waves combined with noise and has high SNR

Origins of noise

 outliers -- values seemingly out of the normal range of data


 duplicate records -- good database design should minimize this (use
DISTINCT on SQL retrievals)
 incorrect attribute values -- again good db design and integrity
constraints should minimize this
 numeric only, deal with rogue strings or characters where numbers should be.
 null handling for attributes (nulls=missing values)
Missing Data Handling

Many causes: malfunctioning equipment, changes in experimental design, collation


of different data sources, measurement not possible. People may wish to not supply
information. Information is not applicable (children don't have annual income)

 Discard records with missing values


 Ordinal-continuous data, could replace with attribute means
 Substitute with a value from a similar instance
 Ignore missing values, i.e., just proceed and let the tools deals with them
 Treat missing values as equals (all share the same missing value code)
 Treat missing values as unequal values

BUT...Missing (null) values may have significance in themselves (e.g. missing


test in a medical examination, death date missing means still alive!)

Missing completely at random (MCAR)

 Missingness of a value is independent of attributes


 Fill in values based on the attribute as suggested above (e.g. attribute mean)
 Analysis may be unbiased overall

Missing at Random (MAR)

 Missingness is related to other variables


 Fill in values based other values (e.g., from similar instances)
 Almost always produces a bias in the analysis

Missing Not at Random (MNAR)

 Missingness is related to unobserved measurements


 Informative or non-ignorable missingness

Duplicate Data
Data set may include data objects that are duplicates, or almost duplicates of
one another A major issue when merging data from multiple, heterogeneous
sources
 Examples: Same person with multiple email addresses

1.8 Data Preprocessing

Data preprocessing is a data mining technique that involves


transforming raw data into an understandable format. Real-world data is often
incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely
to contain many errors. Data preprocessing is a proven method of resolving such
issues.
Data goes through a series of steps during preprocessing:

 Data Cleaning: Data is cleansed through processes such as filling in


missing values, smoothing the noisy data, or resolving the inconsistencies
in the data.
 Data Integration: Data with different representations are put together and
conflicts within the data are resolved.
 Data Transformation: Data is normalized, aggregated and generalized.
 Data Reduction: This step aims to present a reduced representation of the
data in a data warehouse.
Data Discretization: Involves the reduction of a
number of values of a continuous attribute by dividing the range of
attribute intervals

Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.

(a). Missing Data:


This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:

1. Ignore the tuples:


This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill
the missing values manually, by attribute mean or the most
probable value.

(b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It can
be generated due to faulty data collection, data entry errors etc. It can be
handled in following ways :

1. Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
Data discretization is the process of converting continuous data into
discrete buckets or intervals. Here's an example:
Example: Discretizing Age into Age Groups
Suppose we have a dataset with a continuous "Age" column:
Person Age
A 23
B 35
C 42
D 51
E 67
If we want to discretize "Age" into categorical bins, we can define age
groups:
 0-30 → "Young"
 31-50 → "Middle-aged"
 51+ → "Senior"
Applying discretization:
Person Age Age Group
A 23 Young
B 35 Middle-aged
C 42 Middle-aged
D 51 Senior
E 67 Senior
This process simplifies analysis, especially for machine learning
models that prefer categorical features. You can use methods like:
 Equal-width binning (dividing data into equal-sized ranges)
 Equal-frequency binning (each bin has roughly the same number of observations)
 K-means clustering (grouping similar values)
3. Smoothing by Clustering (K-Means)
 Data points are grouped using clustering algorithms.
 Each value is replaced by its cluster centroid.

2. Regression:
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable)
or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways.
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For
Example- The attribute “city” can be converted to “country”.

3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to
get rid of this, we uses data reduction technique. It aims to increase the storage
efficiency and reduce data storage and analysis costs.

The various steps to data reduction are:

1. Data Cube Aggregation:


Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p-
value of the attribute.the attribute having p-value greater than significance
level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or
lossless. If after reconstruction from compressed data, original data can
be retrieved, such reduction are called lossless reduction else it is called
lossy re
duction. The two effective methods of dimensionality reduction
are:Wavelet transforms and PCA (Principal Componenet Analysis).

You might also like