unit 2 data science (1)_ab17128d98d5ea43270628a14fa53833
unit 2 data science (1)_ab17128d98d5ea43270628a14fa53833
● Quantitative Method
Quantitative methods are presented in numbers and require
a mathematical calculation to deduce. An example would be
the use of a questionnaire with close-ended questions to
arrive at figures to be calculated Mathematically. Also,
methods of correlation and regression, mean, mode and
median.
REPORTING
● Website Articles
Gathering and using data contained in website articles is also
another tool for data collection. Collecting data from web
articles is a quicker and less expensive data collection Two
major disadvantages of using this data reporting method are
biases inherent in the data collection process and possible
security/confidentiality concerns.
● Hospital Care records
Health care involves a diverse set of public and private data
collection systems, including health surveys, administrative
enrollment and billing records, and medical records, used by
various entities, including hospitals, CHCs, physicians, and
health plans. The data provided is clear, unbiased and
accurate, but must be obtained under legal means as
medical data is kept with the strictest regulations.
EXISTING DATA
Pre-processing of data is mainly to check the data quality. The quality can
be checked by the following
1. Binning: This method is to smooth or handle noisy data. First, the data is
sorted then and then the sorted values are separated and stored in the
form of bins. There are three methods for smoothing data in the bin.
2. Regression: This is used to smooth the data and will help to handle data
when unnecessary data is present. For the analysis, purpose regression
helps to decide the variable which is suitable for our analysis.
3. Clustering: This is used for finding the outliers and also in grouping the
data. Clustering is generally used in unsupervised learning.
Data integration:
Data integration is the process of bringing data from disparate sources
together to provide users with a unified view. The premise of data integration
is to make data more freely available and easier to consume and process by
systems and users. Data integration done right can reduce IT costs, free-up
resources, improve data quality, and foster innovation all without sweeping
changes to existing applications or data structures. And though IT
organizations have always had to integrate, the payoff for doing so has
potentially never been as great as it is right now.
Data reduction:
Data reduction techniques ensure the integrity of data while reducing the data. Data
reduction is a process that reduces the volume of original data and represents it in a
much smaller volume. Data reduction techniques are used to obtain a reduced
representation of the dataset that is much smaller in volume by maintaining the
integrity of the original data. By reducing the data, the efficiency of the data mining
process is improved, which produces the same analytical results.
Data reduction does not affect the result obtained from data mining. That means the
result obtained from data mining before and after data reduction is the same or
almost the same.
Data reduction aims to define it more compactly. When the data size is smaller, it is
simpler to apply sophisticated and computationally high-priced algorithms. The
reduction of the data may be in terms of the number of rows (records) or terms of the
number of columns (dimensions).
455.9K
2. Numerosity Reduction
The numerosity reduction reduces the original data volume and
represents it in a much smaller form. This technique includes two types
parametric and non-parametric numerosity reduction.
1. Parametric: Parametric numerosity reduction incorporates storing only data
parameters instead of the original data. One method of parametric
numerosity reduction is the regression and log-linear method.
● Regression and Log-Linear: Linear regression models a
relationship between the two attributes by modelling a linear
equation to the data set. Suppose we need to model a linear
function between two attributes.
y=wx+b
Here, y is the response attribute, and x is the predictor attribute. If
we discuss in terms of data mining, attribute x and attribute y are
the numeric database attributes, whereas w and b are regression
coefficients. Multiple linear regressions let the response variable y
model linear function between two or more predictor variables. Log-
linear model discovers the relation between two or more discrete
attributes in the database. Suppose we have a set of tuples
presented in n-dimensional space. Then the log-linear model is used
to study the probability of each tuple in a multidimensional space.
Regression and log-linear methods can be used for sparse data and
skewed data.
2. Non-Parametric: A non-parametric numerosity reduction technique does not
assume any model. The non-Parametric technique results in a more uniform
reduction, irrespective of data size, but it may not achieve a high volume of
data reduction like the parametric. There are at least four types of Non-
Parametric data reduction techniques, Histogram, Clustering, Sampling, Data
Cube Aggregation, and Data Compression.
5. Discretization Operation
The data discretization technique is used to divide the attributes of the
continuous nature into data with intervals. We replace many constant
values of the attributes with labels of small intervals. This means that
mining results are shown in a concise and easily understandable way.
i. Top-down discretization: If you first consider one or a couple of points
(so-called breakpoints or split points) to divide the whole set of attributes
and repeat this method up to the end, then the process is known as top-
down discretization, also known as splitting.
ii. Bottom-up discretization: If you first consider all the constant values as
split-points, some are discarded through a combination of the
neighborhood values in the interval. That process is called bottom-up
discretization.
Data Transformation:
Data transformation is the process of changing the format, structure, or values of data. For data
analytics projects, data may be transformed at two stages of the data pipeline. Organizations that
use on-premises data warehouses generally use an ETL (extract, transform, load) process, in
which data transformation is the middle step. Today, most organizations use cloud-based data
warehouses, which can scale compute and storage resources with latency measured in seconds or
minutes. The scalability of the cloud platform lets organizations skip preload transformations and
load raw data into the data warehouse, then transform it at query time — a model called ELT
( extract, load, transform).
The data are transformed in ways that are ideal for mining the data. The data
transformation involves steps that are:
1. Smoothing:
It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It
helps in predicting the patterns. When collecting data, it can be manipulated to
eliminate or reduce any variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple
changes to help predict different trends and patterns. This serves as a help to
analysts or traders who need to look at a lot of data which can often be difficult
to digest for finding patterns that they wouldn’t see otherwise.
2. Aggregation:
Data collection or aggregation is the method of storing and presenting data in
a summary format. The data may be obtained from multiple data sources to
integrate these data sources into a data analysis description. This is a crucial
step since the accuracy of data analysis insights is highly dependent on the
quantity and quality of the data used. Gathering accurate data of high quality
and a large enough quantity is necessary to produce relevant results.
The collection of data is useful for everything from decisions concerning
financing or business strategy of the product, pricing, operations, and
marketing strategies.
For example, Sales, data may be aggregated to compute monthly& annual
total amounts.
3. Discretization:
It is a process of transforming continuous data into set of small intervals. Most
Data Mining activities in the real world require continuous attributes. Yet many
of the existing data mining frameworks are unable to handle these attributes.
Also, even if a data mining task can manage a continuous attribute, it can
significantly improve its efficiency by replacing a constant quality attribute with
its discrete values.
For example, (1-10, 11-20) (age:- young, middle age, senior).
4.Attribute Construction:
Where new attributes are created & applied to assist the mining process from
the given set of attributes. This simplifies the original data & makes the mining
more efficient.
5. Generalization:
It converts low-level data attributes to high-level data attributes using concept
hierarchy. For Example Age initially in Numerical form (22, 25) is converted
into categorical value (young, old).
For example, Categorical attributes, such as house addresses, may be
generalized to higher-level definitions, such as town or country.
For example:
Suppose the minimum and maximum value for an attribute profit(P) are Rs.
10, 000 and Rs. 100, 000. We want to plot the profit in the range [0, 1].
Using min-max normalization the value of Rs. 20, 000 for attribute profit
can be plotted to:
For example:
Let mean of an attribute P = 60, 000, Standard Deviation = 10, 000, for the
attribute P. Using z-score normalization, a value of 85000 for P can be
transformed to:
● Decimal Scaling:
● It normalizes the values of an attribute by changing the position of
their decimal points
● The number of points by which the decimal point is moved can be
determined by the absolute maximum value of attribute A.
● A value, v, of attribute A is normalized to v’ by computing
For example:
● Suppose: Values of an attribute P varies from -99 to 99.
● The maximum absolute value of P is 99.
● For normalizing the values we divide the numbers by 100 (i.e., j =
2) or (number of integers in the largest number) so that values
come out to be as 0.98, 0.97 and so on.