0% found this document useful (0 votes)

8 views

21BCAD5C01 IDA Module 2 Notes

Uploaded by

12460.tharshankumar.kb

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

21BCAD5C01 IDA Module 2 Notes

Uploaded by

12460.tharshankumar.kb

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Introduction to Data Analytics

Module -2
Syllabus of the Module
Data Preparation (12 Hours)
Data Collection Methods: Primary, Secondary data; Pre-Processing: Data Cleaning, Data Integration
and Transformation, Data Discretization; Dimensionality Reduction, PCA, Feature Engineering and
Selection.

4th Semester, Dept. of BCA 1

Introduction to Data Analytics

Data Collection Methods

 In Data Science, data collection is a process of gathering information from all the relevant sources to
find a solution to the research problem. It helps to analyse the context and understand the problem.
 The data used for analysis can be of two types: Primary and Secondary.
 Primary Data: Data that has been generated by the data scientist/ analyst himself/herself, surveys,
interviews, experiments, specially designed for understanding and solving the research problem at
hand.
 Secondary Data: The existing data generated by large government Institutions, healthcare facilities,
Organizational Information Systems etc. as part of organizational record keeping. The data is then
extracted from more varied datafiles.
 The secondary data also includes magazines, newspapers, books, journals, etc. It may be either
published data or unpublished data.
 Depending on the type of data, the data collection method is divided into two categories namely,

• Primary Data Collection methods

• Secondary Data Collection methods

Primary Data Collection

 Questionnaire Method: Questionnaires are a simple, straightforward data collection method.
Respondents get a series of questions, either open or close-ended, related to the matter at hand. They
should read, reply and subsequently return the questionnaire.
 Interview Method: The researcher asks questions of a large sampling of people, either by direct
interviews or means of mass communication such as by phone or mail. This method is by far the most
common means of data gathering.

• Personal Interview – In this method, a person known as an interviewer is required to ask

questions face to face to the other person. The personal interview can be structured or
unstructured, direct investigation, focused conversation, etc.

• Telephonic Interview – In this method, an interviewer obtains information by contacting

people on the telephone to ask the questions or views orally.
 Observation Method: Observation method is used when the study relates to behavioural science.
This method is planned systematically. It is subject to many controls and checks. The different types
of observations are:
 Focus Groups: Focus groups, like interviews, are a commonly used technique. The group consists of
anywhere from a half-dozen to a dozen people, led by a moderator, brought together to discuss the
issue.

Secondary Data Collection

 Unlike primary data collection, there are no specific collection methods.
 Since secondary data has varieties of sources published / unpublished documents/records, web pages
to organizational databases, Secondary Data collection methods can be from manual process to
automated ones using programming.

4th Semester, Dept. of BCA 2

Introduction to Data Analytics

 In organizations, while data warehouses and data marts are home to pre-processed data, data lakes
contain data in its natural or raw format.
 But the possibility exists that your data still resides in Excel files on the desktop of a domain
expert.
 Finding data even within your own company can sometimes be a challenge.
 As companies grow, their data becomes scattered around many places. Knowledge of the data
may be dispersed as people change positions and leave the company.
 Getting access to data is another difficult task.
 Organizations understand the value and sensitivity of data and often have policies in place so
everyone has access to what they need and nothing more.
 These policies translate into physical and digital barriers called Chinese walls.

Why Data Pre-processing

 Data pre-processing is the process of transforming raw data into machine understandable format.
 The majority of the real-world datasets are highly susceptible to missing, inconsistent, and noisy data
due to their heterogeneous origin.
 Applying ML algorithms on this noisy data would not give quality results as they would fail to
identify patterns effectively.
 Data Processing is, therefore, important to improve the overall data quality.
 The main goal of pre-processing is to improve the quality of data such that the machine learning
algorithm should be able to easily interpret the data features and generate a model to be accurate and
precise in prediction.

Quality Factors
❑ Accuracy: Data must be correct. Outdated data, typos, and redundancies can affect a dataset’s
accuracy.
❑ Consistency: The data should have no contradictions.
❑ Inconsistent data may give you different answers to the same question.
❑ Completeness: The dataset shouldn’t have incomplete fields or lack empty fields.
❑ This characteristic allows data scientists to perform accurate analyses as they have access to a
complete picture of the situation the data describes.
❑ Validity: A dataset is considered valid if the data samples appear in the correct format, are within a
specified range, and are of the right type.
❑ Invalid datasets are hard to organize and analyze.
❑ Timeliness: Data should be collected as soon as the event it represents occurs.
❑ As time passes, every dataset becomes less accurate and useful as it doesn’t represent the
current reality. Therefore, the topicality and relevance of data is a critical data quality
characteristic.

4th Semester, Dept. of BCA 3

Introduction to Data Analytics

Major Tasks in Data Pre-processing

❑ Data cleaning / cleansing:

 Data cleaning is the process of cleaning datasets by accounting for missing values, removing outliers,
correcting inconsistent data points, and smoothing noisy data.
 In essence, the motive behind data cleaning is to offer complete and accurate samples for
machine learning models.
 Missing values: The problem of missing data values is quite common.
 It may happen during data collection or due to some specific data validation rule. In such
cases, you need to collect additional data samples or look for additional datasets.
 The issue of missing values can also arise when you concatenate two or more datasets to form
a bigger dataset. If not all fields are present in both datasets, it’s better to delete such fields
before merging.
Handling Missing Values
 Here are some ways to account for missing data:
❑ Manually fill in the missing values.
• This can be a tedious and time-consuming approach and is not recommended for large
datasets.
❑ Make use of a standard value to replace the missing data value.
• You can use a global constant like “unknown” or “N/A” to replace the missing value.
Although a straightforward approach, it isn’t foolproof.

4th Semester, Dept. of BCA 4

Introduction to Data Analytics

❑ Fill the missing value with the most probable value.

• To predict the probable value, you can use algorithms like logistic regression or
decision trees.
❑ Use a central tendency to replace the missing value.
• Central tendency is the tendency of a value to cluster around its mean, mode, or
median.

Noisy Data
 Noise includes
❑ data having incorrect attribute values,
❑ duplicate or semi-duplicates of data points,
❑ data segments of no value/attributes for a specific research process, or
❑ unwanted/irrelevant attributes.
 An outlier can be treated as noise, although some consider it a valid data point.
❑ For numeric values, you can use a scatter plot or box plot to identify outliers.

Outliers
 An outlier is a data point that is noticeably different from the rest. It is an unusual data point that does
not follow the general trend of the rest of the data. It can be an extreme value.
 Outliers can arise for a variety of reasons, such as errors in measurement or bad data collection (e.g.
Human error, Measuring Instrument error, Experiment design error etc.).
 Machine learning algorithms are sensitive to the range and distribution of attribute values.
 Data outliers can spoil and mislead the training process resulting in less accurate models and poorer
results.

Figure showing one data point as outlier

❑ Data Integration
 Data integration is the process of combining/ merging data from different sources.
 The data sources might be heterogenous in nature.
 These sources may include multiple data cubes (A data cube in a data warehouse is a
multidimensional structure used to store data), databases, or flat files.

4th Semester, Dept. of BCA 5

Introduction to Data Analytics

 Data integration results in a coherent/ well-organized data store and provides a unified view of the
data.

Data Integration Approaches

 Main approaches to integrate data:
❑ Data consolidation:
 Data is physically brought together and stored in a single place.
 Having all data in one place increases efficiency and productivity.
 This step typically involves using data warehouse software (i.e. data integration tools: ETL).
 Extract, transform, and load (ETL) technology supports data consolidation.
 ETL pulls data from sources, transforms it into an understandable format, and then transfers it
to another database or data warehouse.
 The ETL process cleans, filters, and transforms data, and then applies business rules before
data populates the new source.
❑ Data virtualization:
 Virtualization uses an interface to provide a near real-time, unified view of data from disparate
sources with different data models.
 Data can be viewed in one location but is not stored in that single location.
 Data virtualization retrieves and interprets data but does not require uniform formatting or a
single point of access.

Issues in Data Integration

 There are some issues to be considered during data integration.
❑ Schema Integration:
 The schema integration can be achieved using metadata (a set of data that describes other
data) of each attribute.

4th Semester, Dept. of BCA 6

Introduction to Data Analytics

 Metadata of an attribute incorporates its name, what does it mean in the particular scenario,
what is its data type, up to what range it can accept the value. What rules does the attribute
follow for the null value, blank, or zero?
 Analyzing this metadata information will prevent error in schema integration.
 Entity identification problem:
 The real-world entities from multiple sources need to be matched correctly.
 For example, we have customer data from two different data source. An entity from one data
source has customer_id and the entity from the other data source has customer_number.
 Here, the data analyst or the integration tool needs to understand that these two entities refer to
the same attribute.
❑ Redundancy:
 Redundancy is one of the big issues during data integration.
 Redundant data is an unimportant data or the data that is no longer needed.
 An attribute may be redundant if it can be derived or obtaining from another attribute or set of
attributes.
 For example, one data set has the customer age and other data set has the customers date of
birth then age would be a redundant attribute as it could be derived using the date of birth.
❖ Some redundancies can be detected by correlation analysis. The attributes are analyzed to detect their
interdependency on each other thereby detecting the correlation between them – collinearity.
❑ Collinearity:
 Collinearity refers to the situation in which two or more predictor variables are closely related
to one another.
 The presence of collinearity can pose problems, since it can be difficult to separate out the
individual effects of collinear variables on the response variable.
 When faced with the problem of collinearity, there are two simple solutions.
 The first is to drop one of the problematic variables from the data. This can usually be done
without much compromise to the prediction, since the presence of collinearity implies that the
information that this variable provides about the response is redundant in the presence of the
other variables.
 The second solution is to combine the collinear variables together into a single predictor. For
instance, we might take the average of standardized versions of there variables in order to
create a new variable.
❑ Data Value Conflict:
 Data conflict means the data merged from the different sources do not match.
 Attribute values from different sources may differ for the same real-world entity. The
difference maybe because they are represented differently in the different data sets.
 For example, the price of a hotel room may be represented in different currencies in different
cities. Likewise, the date format may differ like “MM/DD/YYYY” or “DD/MM/YYYY”.

4th Semester, Dept. of BCA 7

Introduction to Data Analytics

 Detection and resolution of data value conflicts need to be carried out carefully.
❑ Tuple Duplication:
 Along with redundancies data integration has to deal with the duplicate tuples also.
 Duplicate tuples may come in the resultant data if the denormalized table has been used as a
source for data integration. Normalization is used to remove redundant data from the
database and to store non-redundant and consistent data into it.

❑ Data Transformation
 Data transformation is the process of converting data from one format to another.
 In essence, it involves methods for transforming data into appropriate formats that the computer can
learn efficiently from.
 Data transformation increases the efficiency of analytic processes, and it enables businesses to make
better data-driven decisions.
 Data transformation could change the structure, format, or values of data by

• Moving, renaming, and combining columns in a database.

• Adding, copying, and replicating data.

• Deleting records and columns in a database

 Typically, data engineers, data scientists, and data analysts use programming languages such as SQL
or Python to transform data.
 Organizations may also choose to use ETL tools, which can automate the data transformation process
for data stored in different databases.
 Different modelling tools provide different features for data transformation.
 This step can be simple or complex based on the requirements.

Methods in Data Transformation

❑ Data aggregation
 Data aggregation is the process where raw data is gathered and expressed in a summary form for
statistical analysis.
 For example, raw data can be aggregated over a given time period to provide statistics such as
average, minimum, maximum, sum, and count. After the data is aggregated and written to a view or
report, you can analyze the aggregated data to gain insights about particular resources or resource
groups.
 There are two types of data aggregation:
 Time aggregation
 All data points for a single resource over a specified time period.
 Spatial aggregation
 All data points for a group of resources over a specified time period.

4th Semester, Dept. of BCA 8

Introduction to Data Analytics

➢ Data is collected and presented in a view within the context of various time intervals:
 Reporting period: The period over which data is collected for presentation. For example, Daily,
Weekly, Monthly, Quarterly, and Yearly.
 Polling period: The time duration that determines how often resources are sampled for data. For
example, a group of resources might be polled every 5 minutes, meaning that a data point for each
resource is generated every 5 minutes.
❖ Example: Weather forecasting, Sensor data for temperature, humidity polled in every 5 minutes,
aggregated to minimum and maximum and reported daily. Daily average sales of a super bazar store
aggregated for monthly forecasting

❑ Data Smoothing
 Data smoothing is performed to remove noise from a data set. This allows important patterns to more
clearly stand out.
 When data collected over time displays random variation, smoothing techniques can be used to reduce
or cancel the effect of these variations.
 When properly applied, these techniques smooth out the random variation in the time series data to
reveal underlying trends.
 For example, Data smoothing can be used to help predict trends, such as those found in
securities prices, as well as in economic analysis.
 Data Analytics Tools features four different smoothing techniques: Exponential, Moving Average,
Double Exponential, and Holt-Winters.
 Exponential and Moving Average are relatively simple smoothing techniques and should not
be performed on data sets involving seasonality.
 Double Exponential and Holt-Winters are more advanced techniques that can be used on data
sets involving seasonality.
 Seasonality is a characteristic of a time series in which the data experiences regular and predictable
changes that recur every calendar year. Any predictable fluctuation or pattern that recurs or repeats
over a one-year period is said to be seasonal.

❑ Discretization
 Numerical input variables may have a highly skewed or non-standard distribution. This could be
caused by outliers in the data, multi-modal distributions, highly exponential distributions, and more.
 Many machine learning algorithms prefer or perform better when numerical input variables have a
standard probability distribution.
 The discretization transform provides an automatic way to change a numeric input variable to have a
different data distribution, which in turn can be used as input to a predictive model.
 Discretization is the process through which we can transform continuous variables into a discrete
form.
 We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired
variable.

4th Semester, Dept. of BCA 9

Introduction to Data Analytics

 For example, we can divide a continuous variable, weight, and store it in the following groups : Under
100 gm (light), between 140–160 gm (mid), and over 200 gm (heavy)
 Methods: Equal-width, Equal-frequency, clustered/ k-means
❑ Equal-Width or Uniform Discretization:
 Each bin has the same width in the span of possible values for the variable.
 Separating all possible values into ‘N’ number of bins, each having the same width. Formula for
interval width:
 Width = (maximum value - minimum value) / N, where N is the number of bins or intervals.
 Example: 1-10: child; 11-20: teenager; 21-30: young; 31-40: middle-aged; 41-50: adults; 51-
60: aged; 61-70: senior citizens;
❑ Equal-Frequency or Quantile Discretization:
 A quantile discretization transform will attempt to split the observations for each input variable into k
groups, where the number of observations assigned to each group is approximately equal.
 Each bin has the same number of values, split based on percentiles.
 Separating all possible values into ‘N’ number of bins, each having the same amount of observations.
Intervals may correspond to quantile values.
❑ Clustered or K-Means Discretization:
 Clusters are identified and observations / data points are assigned to each group.
 We apply K-Means clustering to the continuous variable, thus dividing it into discrete groups or
clusters.

❑ Normalization
Normalization is a technique often applied as part of data preparation for machine learning. The goal of
normalization is to change the values of numeric columns in the dataset to a common scale, without
distorting differences in the ranges of values. For machine learning, every dataset does not require
normalization. It is required only when features have different ranges. Normalization gives equal
weights/importance to each variable so that no single variable steers model performance in one direction
just because they are bigger numbers. Normalization dramatically improves model accuracy.
 Data normalisation is a method to convert the source data into another format for effective processing.
 The purpose of normalization is to transform data in a way that they are either dimensionless and/or
have similar distributions.
 It offers several advantages, such as making data mining /machine learning algorithms more effective,
faster data extraction, etc.
 This process of normalization is known by other names such as standardization, feature scaling etc.
 Normalization is an essential step in data pre-processing in any machine learning application and
model fitting.
 For example, consider a data set containing two features, age(x1), and income(x2). Where age ranges
from 0–100, while income ranges from 20,000-50,0000. Income is about several 1,000 times larger
than age and ranges from 20,000–500,000. So, these two features are in very different ranges. When

4th Semester, Dept. of BCA 10

Introduction to Data Analytics

we do further analysis, like multivariate linear regression, for example, the attributed income will
intrinsically influence the result more due to its larger value. But this doesn’t necessarily mean it is
more important as a predictor.

Methods of Normalization:
 Rescaling: also known as “min-max normalization”; the values are shifted and rescaled so that they
end up ranging between 0 and 1; it is the simplest of all methods and calculated as:

 Mean normalization: This method uses the mean of the observations in the transformation process:

 Z-score normalization: Also known as standardization, this technic uses Z-score or “standard score”.
It is widely used in machine learning algorithms such as SVM and logistic regression:

Here, z is the standard score, μ is the population mean and ϭ is the population standard deviation

❑ Generalization
 Generalization is used to convert low-level data attributes to high-level data attributes by the use of
concept hierarchy.
 Data generalization is the process of creating a more broad categorization of data in a database,
essentially ‘zooming out’ from the data to create a more general picture of trends or insights it
provides.
 For example, If you have a data set that includes the ages of a group of people, the data generalization
process may look like this:
 Original Data: Ages: 15, 17, 20, 26, 28, 31, 33, 37, 42, 42, 46, 48, 49, 54, 57, 57, 58, 59
 Generalized Data: Ages: < 40 : young; >41: old
 Here, an age in the numerical form of raw data (20, 52) is converted into (Young, old)
categorical value.
 Another example, Customer Name, Address Detail can be Customer Name, City

4th Semester, Dept. of BCA 11

Introduction to Data Analytics

 Data generalization replaces a specific data value with one that is less precise, which may seem
counterproductive, but actually is a widely applicable and used technique in data mining, analysis, and
secure storage.

Dimensionality Reduction
 Usually, datasets contain large number of attributes (i.e., features) and hundreds and thousands of
instances (i.e., data points) making the dataset voluminous.
 High dimensional datasets throw several challenges in analysis projects.
 In particular, many machine learning algorithms cannot be applied directly on a high dimensional
dataset.
 Large amounts of data might sometimes produce worse performances in data analytics applications.
 Dimensionality reduction techniques help us in reducing the high dimensional data into a small
dimensional data format that is easier to analyse, and visualize.

The Curse of Dimensionality

 The curse of dimensionality is a phenomenon that arises when you analyse and visualize with data in
high-dimensional dataset.
 The higher is the number of features or factors (a.k.a. variables) in a dataset, the more difficult it
becomes to visualize the training set and work on it.
 Another vital point to consider is that most of the variables are often correlated.
 These correlated/ redundant variables within the feature set affect training of machine learning
algorithms producing overfitted models that fail to perform well on real data.
 The primary aim of dimensionality reduction is to avoid overfitting. A training data with considerably
lesser features will ensure that your model remains simple – it will make smaller assumptions.
 When you reduce the number of features and only keep the most relevant features, it is called feature
selection.
 Therefore, the curse of dimensionality mandates the application of dimensionality reduction.

Benefits of Dimensionality Reduction

 Here are some of the benefits of applying dimensionality reduction to a dataset:
❑ It eliminates redundant features and noise making the hidden pattern more clearly visible. Helps in
analysis and visualization of the data.
❑ It helps improve the accuracy and performance of the machine learning (ML) models.
❑ Some ML algorithms do not perform well when we have a large dimension. So, reducing these
dimensions make the dataset compatible with the ML algorithm and helps usage of the algorithm.
❑ Space required to store the data is reduced as the number of dimensions comes down.
❑ Less dimensions lead to less computation/ and faster training time of the ML algorithm.

Techniques of Dimensionality Reduction

 There are several techniques for dimensionality reduction and feature selection. Commonly used
techniques are outlined here.

4th Semester, Dept. of BCA 12

Introduction to Data Analytics

1) Missing Value Ratio

2) Low Variance Filter
3) High Correlation Filter
4) Backward Feature Elimination
5) Forward Feature Selection
6) Random Forest
7) Principal Component Analysis

Missing Value Ratio

 When you explore a given dataset, you might find that there are some missing values in the dataset.
The first step in dealing with missing values is to identify the reason behind them. Accordingly, you
can then impute the missing values or drop them altogether by using the befitting methods. This
approach is perfect for situations when there are a few missing values.
 However, what to do when there are too many missing values, say, over 50%? In such situations, you
can set a threshold value and use the missing values ratio method. The higher the threshold value, the
more aggressive will be the dimensionality reduction. If the percentage of missing values in a variable
exceeds the threshold, you can drop the variable.
 Generally, data columns having numerous missing values hardly contain useful information. So, you
can remove all the data columns having missing values higher than the set threshold.

Low Variance Filter

 Consider a variable in a dataset where all the values have the same value, say 1 or the values vary less.
When a dataset has constant variables or variables whose values differ very less, it is not possible to
improve the model’s performance. We need to identify these low variance variables and eliminate them
when their variance is beyond a threshold value.
 However, one thing you must remember about the low variance filter method is that variance is range
dependent. Thus, normalization is a must before implementing this dimensionality reduction technique.

High Correlation Filter

o High correlation between two variables means they have similar trends and are likely to carry
similar information.
▪ Suppose we have two variables: Income and Education. These variables will
potentially have a high correlation as people with a higher education level tend to
have significantly higher income, and vice versa.
o In this case, only one of them will suffice to feed the machine learning model.
o Here we calculate the correlation coefficient
▪ Pearson’s Product Moment Coefficient between numerical columns/ attributes and
▪ Pearson's chi square value between nominal columns / attributes as the case may be.
▪ Pairs of columns with correlation coefficient higher than a threshold are reduced to
only one.
o A word of caution: correlation is scale sensitive; therefore, column normalization is required
for a meaningful correlation comparison.

4th Semester, Dept. of BCA 13

Introduction to Data Analytics

Backward Feature Elimination

 In this technique,
• We first take all the n variables present in our dataset and train the model using them.
• We then calculate the performance of the model (in terms of error, or prediction accuracy).
• Now, we compute the performance of the model after eliminating each variable (n times), i.e.,
we drop one variable every time and train the model on the remaining n-1 variables.
• We identify the variable whose removal has produced the smallest (or no) change in the
performance of the model, and then drop that variable (it means the variable is not a good
predictor, has no much effect on model performance).
• Repeat this process until no variable can be dropped.

Forward Feature Selection

 This is the reverse process of the Backward Feature Elimination. Instead of eliminating features, we
try to find the best features which improve the performance of the model.
 This technique works as follows:
• We start with a single feature, progressively adding 1 feature at a time. Essentially, we train
the model n number of times using each feature one after another.
• The variable that produces the highest increase in performance is retained
• We repeat this process until no significant improvement is seen in the model’s performance

Random Forest
 One approach to dimensionality reduction is to generate a large and carefully constructed set of trees
against a target attribute and then use each attribute’s usage statistics to find the most informative
subset of features.
 Specifically, we can generate a large set (2000) of very shallow trees (2 levels), with each tree being
trained on a small fraction (3) of the total number of attributes.
 If an attribute is often selected as best split, it is most likely an informative feature to retain.
 A score calculated on the attribute usage statistics in the random forest tells us ‒ relative to the other
attributes ‒ which are the most predictive attributes.

4th Semester, Dept. of BCA 14

Introduction to Data Analytics

Example of a Tree constructed from an attribute of a dataset

Principal Component Analysis (PCA)

❖ PCA is a statistical technique which helps us in extracting a small set of transformed variables from a
large set of original variables available in a data set. The reduced set of transformed variables called
Principal Component still contains most of the information in the large set. Smaller data sets are
easier to explore and visualize and make data analysis process much easier and faster for machine
learning algorithms.
• A principal component is a linear combination of the original variables:
• Z¹ = Φ¹¹X¹ + Φ²¹X² + Φ³¹X³ + .... +Φp¹Xp , where,
Z¹ is first principal component
Φp¹ is the loading vector comprising of loadings (Φ¹, Φ²..) of first principal component.
Eigenvectors, eigenvalues, and covariance matrix are computed in order to determine the principal
components of the data.

4th Semester, Dept. of BCA 15

Introduction to Data Analytics

✓ Principal components are extracted in such a way that the first principal component captures the
maximum information / variance in the dataset. Larger the variability captured, larger the information
captured by component.
✓ Second principal component is computed that tries to capture the remaining variance in the dataset and
is uncorrelated (or orthogonal) to the first principal component.
✓ Third principal component is computed tries to capture the variance which is not captured by the first
two principal components and so on.
✓ For example, if a 10-dimensional data gives you 10 principal components, PCA tries to put maximum
possible information in the first component, then maximum remaining information in the second and
so on.

Figure: Principal components with decrease in Variance

✓ Organizing information in principal components this way reduces dimensionality without losing much
information, and helps us discarding the components with low information and considering the
remaining components as important variables.
❖ The PCA conversion is sensitive to the variance of the relative scaling of the original variables. Thus,
the data column ranges must first be normalized before implementing the PCA method.

4th Semester, Dept. of BCA 16

Simple Linear Ordinary Least Squares Regression: JTMS-03 Applied Statistics With R
No ratings yet
Simple Linear Ordinary Least Squares Regression: JTMS-03 Applied Statistics With R
39 pages
1708443470801
No ratings yet
1708443470801
71 pages
Unit I
No ratings yet
Unit I
41 pages
Data - Analytics - Interview - Q and A
No ratings yet
Data - Analytics - Interview - Q and A
64 pages
Data Analysis and Information Management
No ratings yet
Data Analysis and Information Management
13 pages
03 Data Science Process_Fall 23-24
No ratings yet
03 Data Science Process_Fall 23-24
38 pages
Rudra Bhatt Data
No ratings yet
Rudra Bhatt Data
9 pages
ML Lect1
100% (1)
ML Lect1
51 pages
General Data Analyst Interview Questions
No ratings yet
General Data Analyst Interview Questions
7 pages
Handouts
No ratings yet
Handouts
19 pages
Data Mining Simran
No ratings yet
Data Mining Simran
128 pages
Machine Learning Based Missing Data Imputation
No ratings yet
Machine Learning Based Missing Data Imputation
13 pages
QB 10 Marker
No ratings yet
QB 10 Marker
19 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
Chapter7 Methods of Research Module
No ratings yet
Chapter7 Methods of Research Module
9 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
UNIT 2 Notes - Data Science
No ratings yet
UNIT 2 Notes - Data Science
18 pages
Data Mining unit-1 complete
No ratings yet
Data Mining unit-1 complete
45 pages
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
No ratings yet
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
12 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Sample Theory & Que. - UGC NET GP-1 Data Interpretation (UNIT-7)
100% (1)
Sample Theory & Que. - UGC NET GP-1 Data Interpretation (UNIT-7)
23 pages
Data2 Science Process Am
No ratings yet
Data2 Science Process Am
33 pages
ACTIVITY5-BASILLOTE-HERMOSO-SAMOC-SAGUINDANG-TABUNIAG
No ratings yet
ACTIVITY5-BASILLOTE-HERMOSO-SAMOC-SAGUINDANG-TABUNIAG
6 pages
Unit-1 Data Mining
No ratings yet
Unit-1 Data Mining
19 pages
Solution
No ratings yet
Solution
16 pages
Research Methodology (Data Analysis)
No ratings yet
Research Methodology (Data Analysis)
7 pages
Knowledge Discovery in Databases
No ratings yet
Knowledge Discovery in Databases
17 pages
DMjoy
No ratings yet
DMjoy
9 pages
Unit 2 FDS
No ratings yet
Unit 2 FDS
13 pages
Bi 20soeit11002 Antala Krishnaa
No ratings yet
Bi 20soeit11002 Antala Krishnaa
5 pages
Top Data Analyst Interview Questions
No ratings yet
Top Data Analyst Interview Questions
28 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Data Preparation
No ratings yet
Data Preparation
17 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Math211101020
No ratings yet
Math211101020
12 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Data Wrangling
No ratings yet
Data Wrangling
6 pages
Data Warehousing & Data Mining - Study Material
No ratings yet
Data Warehousing & Data Mining - Study Material
27 pages
Data Mining and Data Analysis UNIT-1 Notes For Print
No ratings yet
Data Mining and Data Analysis UNIT-1 Notes For Print
22 pages
Data Preprocessing in Sentiment Analysis Using Twitter Data: July 2019
No ratings yet
Data Preprocessing in Sentiment Analysis Using Twitter Data: July 2019
5 pages
Class3-9 DataPreprocessing 22Aug-06Sept2019
No ratings yet
Class3-9 DataPreprocessing 22Aug-06Sept2019
53 pages
BI_Unit 5
No ratings yet
BI_Unit 5
9 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Short Notes On Data Mining & Warehousing
No ratings yet
Short Notes On Data Mining & Warehousing
43 pages
FDM notes
No ratings yet
FDM notes
48 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
Sathyapriya Thesis NEW
No ratings yet
Sathyapriya Thesis NEW
47 pages
Finding Answers Through Data Collection: Data Collection Procedure and Skills Using Varied Instruments
No ratings yet
Finding Answers Through Data Collection: Data Collection Procedure and Skills Using Varied Instruments
4 pages
P Veerabathiran Computerscience
No ratings yet
P Veerabathiran Computerscience
48 pages
Unit 3
No ratings yet
Unit 3
33 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
DataPreparation
No ratings yet
DataPreparation
15 pages
unit2
No ratings yet
unit2
20 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
16 pages
Data Analytics Basics - Unlocked
No ratings yet
Data Analytics Basics - Unlocked
59 pages
Unit 1 - DATA ANALYTICS - KIT-601 - AKTU
No ratings yet
Unit 1 - DATA ANALYTICS - KIT-601 - AKTU
24 pages
SMA_Expt_3
No ratings yet
SMA_Expt_3
9 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet
Practice Problems (Correlation, Regression and Probability)
No ratings yet
Practice Problems (Correlation, Regression and Probability)
5 pages
BBAD1311
No ratings yet
BBAD1311
2 pages
Statistics and Data Science 188 Y1 s1
No ratings yet
Statistics and Data Science 188 Y1 s1
38 pages
Motivations To Participate in Online Communities: Cliff Lampe, Rick Wash, Alcides Velasquez, Elif Ozkaya
No ratings yet
Motivations To Participate in Online Communities: Cliff Lampe, Rick Wash, Alcides Velasquez, Elif Ozkaya
10 pages
Lollipops - PreSchool Pilot Study
No ratings yet
Lollipops - PreSchool Pilot Study
5 pages
570 ASM2 NguyenDangQuang GBS0909A
No ratings yet
570 ASM2 NguyenDangQuang GBS0909A
34 pages
Ferragamo Group3 La14. 2
No ratings yet
Ferragamo Group3 La14. 2
5 pages
NCERT Class 11 Psychology Sample Paper
No ratings yet
NCERT Class 11 Psychology Sample Paper
10 pages
Assessment of The Contributions of Nigerian Union of Local Government Employees (Nulge) To Conflict Resolution in Ibaji Local Government Area, Kogi State. Nigeria
100% (1)
Assessment of The Contributions of Nigerian Union of Local Government Employees (Nulge) To Conflict Resolution in Ibaji Local Government Area, Kogi State. Nigeria
6 pages
Effectiveness of Video Based Instruction 0498b459
No ratings yet
Effectiveness of Video Based Instruction 0498b459
14 pages
Technology: Important Or Wasteful Invention
No ratings yet
Technology: Important Or Wasteful Invention
9 pages
BDT KSETA Freudenstadt
No ratings yet
BDT KSETA Freudenstadt
32 pages
ACR Accreditation Hints and Tips
No ratings yet
ACR Accreditation Hints and Tips
40 pages
Can Individual Investors Beat The Market
100% (1)
Can Individual Investors Beat The Market
27 pages
Petrus Marlim (2015) - LN
No ratings yet
Petrus Marlim (2015) - LN
18 pages
A Study On Digital Transformation in Southeast Bank PLC. (Uttara Branch) : Impacts On Customer Experience.
No ratings yet
A Study On Digital Transformation in Southeast Bank PLC. (Uttara Branch) : Impacts On Customer Experience.
59 pages
A Case of Durban University of Technology
No ratings yet
A Case of Durban University of Technology
10 pages
Principles of Multivariate Analysis
No ratings yet
Principles of Multivariate Analysis
6 pages
Assignment 5
No ratings yet
Assignment 5
10 pages
6407-1
No ratings yet
6407-1
21 pages
Difference Between Rank Coefficient and Karl Pearson-10
No ratings yet
Difference Between Rank Coefficient and Karl Pearson-10
5 pages
1.sports Achievement Motivation and Sports Competition
No ratings yet
1.sports Achievement Motivation and Sports Competition
6 pages
Notes On Educ 302 and 305 Assessment of Learning
No ratings yet
Notes On Educ 302 and 305 Assessment of Learning
28 pages
PHD Thesis in Oil and Gas
100% (3)
PHD Thesis in Oil and Gas
7 pages
514-Article Text-2099-1-10-20240424
No ratings yet
514-Article Text-2099-1-10-20240424
15 pages
Determinants of The Academic Performance of Student Nurses in San Lorenzo Ruiz College of Ormoc Inc.
No ratings yet
Determinants of The Academic Performance of Student Nurses in San Lorenzo Ruiz College of Ormoc Inc.
127 pages
Complete Syllabus BBA 2018-21
No ratings yet
Complete Syllabus BBA 2018-21
277 pages
The Impact of Teacher's Personality On Character Building of Students
No ratings yet
The Impact of Teacher's Personality On Character Building of Students
38 pages
A Project Report On
No ratings yet
A Project Report On
27 pages