0% found this document useful (0 votes)
7 views

DA total notes

The document discusses data management and architecture, outlining the importance of designing data systems to ensure effective data collection, storage, and processing. It details various data models (conceptual, logical, physical) and emphasizes the significance of data quality, sources, and management practices for organizational success. Additionally, it covers methods of data collection, including primary and secondary sources, and the role of sensors and GPS in data acquisition.

Uploaded by

21p61a66j0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

DA total notes

The document discusses data management and architecture, outlining the importance of designing data systems to ensure effective data collection, storage, and processing. It details various data models (conceptual, logical, physical) and emphasizes the significance of data quality, sources, and management practices for organizational success. Additionally, it covers methods of data collection, including primary and secondary sources, and the role of sensors and GPS in data acquisition.

Uploaded by

21p61a66j0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

UNIT-I

(Data Analytics)
Data Management: Design Data Architecture and manage the data for analysis,
understand various sources of Data like Sensors/Signals/GPS etc. Data
Management, Data Quality (noise, outliers, missing values, duplicate data) and
Data Processing & Processing.

Design Data Architecture:


Data architecture is the process of standardizing how organizations collect, store,
transform, distribute, and use data. The goal is to deliver relevant data to people who
need it, when they need it, and help them make sense of it. Data architecture design
is set of standards which are composed of certain policies, rules and models.
Data is usually one of several architecture domains that form the pillars of an
enterprise architecture or solution architecture. The data architecture is formed
by dividing into three essential models
➢ Conceptual model
➢ Logical model
➢ Physical model

Conceptual model:
It is a business model which uses Entity Relationship (ER) model for relation
between entities and their attributes.
Logical model:It is a model where problems are represented in the form of logic
such as rows and column of data, classes, xml tags and other DBMS techniques.
Physical model:
Physical models hold the database design like which type of database technology
will be suitable for architecture.
Factors that influence Data Architecture:
Few influences that can have an effect on data architecture are business policies,
business requirements, Technology used, economics, and data processing needs.
➢ Business requirements
➢ Business policies
➢ Technology in use
➢ Business economics
➢ Data processing needs
Business requirements:
These include factors such as the expansion of business, the performance
of the system access, data management, transaction management, making
use of raw data by converting them into image files and records, and then
storing in data warehouses. Data warehouses are the main aspects of storing
transactions in business.
Business policies:
The policies are rules that are useful for describing the way of processing
data. These policies are made by internal organizational bodies and other
government agencies.
Technology in use:
This includes using the example of previously completed data architecture
design and also using existing licensed software purchases, database
technology.
Business economics:
The economical factors such as business growth and loss, interest rates,
loans, condition of the market, and the overall cost will also have an effect
on design architecture.
Data processing needs:
These include factors such as mining of the data, large continuous
transactions, database management, and other data preprocessing needs.
Data management:
Data management is an administrative process that includes acquiring,
validating, storing, protecting, and processing required data to ensure the
accessibility, reliability, and timeliness of the data for its users.
Data management software is essential, as we are creating and
consuming data at unprecedented rates.
Data management is the practice of managing data as a valuable resource to
unlock its potential for an organization. Managing data effectively requires having
a data strategy and reliable methods to access, integrate, cleanse, govern, store
and prepare data for analytics. In our digital world, data pours into organizations
from many sources – operational and transactional systems, scanners, sensors,
smart devices, social media, video and text. But the value of data is not based on
its source, quality or format. Its value depends on what you do with it.
Motivation/Importance of Data management:
➢ Data management plays a significant role in an organization's ability to
generate revenue, control costs.
➢ Data management helps organizations to mitigate risks.
➢ It enables decision making in organizations.
What are the benefits of good data management?
➢ Optimum data quality
➢ Improved user confidence
➢ Efficient and timely access to data
➢ Improves decision making in an organization
Managing data Resources:
➢ An information system provides users with timely, accurate, and relevant
information.
➢ The information is stored in computer files. When files are properly arranged
and maintained, users can easily access and retrieve the information when
they need.
➢ If the files are not properly managed, they can lead to chaos in information
processing.
➢ Even if the hardware and software are excellent, the information system
can be very inefficient because of poor file management.
Areas of Data Management:
Data Modeling: Is first creating a structure for the data that you collect and use
and then organizing this data in a way that is easily accessible and efficient to
store and pull the data for reports and analysis.
Data warehousing: is storing data effectively so that it can be accessed and used
efficiently in future
Data Movement: is the ability to move data from one place to another. For instance,
data needs to be moved from where it is collected to a database andthen to an
end user.
Understand various sources of the Data:
Data are the special type of information generally obtained through observations,
surveys, inquiries, or are generated as a result of human activity. Methods of data
collection are essential for anyone who wish to collect data.
Data collection is a fundamental aspect and as a result, there are different
methods of collecting data which when used on one particular set will result in
different kinds of data.
Collection of data refers to a purpose gathering of information and relevant to the
subject-matter of the study from the units under investigation. The method of
collection of data mainly depends upon the nature, purpose and the scope of inquiry
on one hand and availability of resources, and the time to the other.
Data can be generated from two types of sources namely
l. Primary sources of data
2. Secondary sources of data
l. Primary sources of data:
Primary data refers to the first hand data gathered by the researcher himself.
Sources of primary data are surveys, observations, Experimental Methods.
Survey: Survey method is one of the primary sources of data which is used
to collect quantitative information about an items in a population. Surveys are
used in different areas for collecting the data even in public and private sectors.
A survey may be conducted in the field by the researcher. The respondents are
contacted by the research person personally, telephonically or through mail. This
method takes a lot of time, efforts and money but the data collected are of high
accuracy, current and relevant to the topic.
When the questions are administered by a researcher, the survey is called a
structured interview or a researcher-administered survey.
Observations: Observation as one of the primary sources of data. Observation is a
technique for obtaining information involves measuring variables or gathering of
data necessary for measuring the variable under investigation.
Observation is defined as accurate watching and noting of phenomena as they
occur in nature with regards to cause and effect relation.
Interview: Interviewing is a technique that is primarily used to gain an
understanding of the underlying reasons and motivations for people’s attitudes,
preferences or behavior. Interviews can be undertaken on a personal one-to-one
basis or in a group.
Experimental Method: There are number of experimental designs that are used
in carrying out and experiment. However, Market researchers have used 4
experimental designs most frequently. These are
CRD - Completely Randomized Design
RBD - Randomized Block Design
LSD - Latin Square Design
FD - Factorial Designs
CRD: A completely randomized design (CRD) is one where the treatments are assigned
completely at random so that each experimental unit has the same chance of
receiving any one treatment.
CRD is appropriate only for experiments with homogeneous experimental
units.
Example:

RBD - The term Randomized Block Design has originated from agricultural
research. In this design several treatments of variables are applied to different
blocks of land to ascertain their effect on the yield of the crop. Blocks are formed
in such a manner that each block contains as many plots as a number of
treatments so that one plot from each is selected at random for each treatment.
The production of each plot is measured after the treatment is given. These data
are then interpreted and inferences are drawn by using the analysis of Variance
Technique so as to know the effect of various treatments like different dozes of
fertilizers, different types of irrigation etc.
LSD - Latin Square Design - A Latin square is one of the experimental designs
which has a balanced two way classification scheme say for example - 4 X 4
arrangement. In this scheme each letter from A to D occurs only once in each row
and also only once in each column. The balance arrangement, it may be noted
that, will not get disturbed if any row gets changed with the other.

The balance arrangement achieved in a Latin Square is its main strength. In this
design, the comparisons among treatments will be free from both differences
between rows and columns. Thus the magnitude of error will be smaller than any
other design.
FD - Factorial Designs - This design allows the experimenter to test two or more
variables simultaneously. It also measures interaction effects of the variables and
analyzes the impacts of each of the variables. In a true experiment, randomization
is essential so that the experimenter can infer cause and effect without any bias.
A experiment which involves multiple independent variables is known as afactorial
design.
A factor is a major independent variable. In this example we have two factors:
time in instruction and setting. A level is a subdivision of a factor. In this example,
time in instruction has two levels and setting has two levels.

For example, suppose a botanist wants to understand the effects of sunlight


(low vs. high) and watering frequency (daily vs. weekly) on the growth of a
certain species of plant.

This is an example of a 2×2 factorial design because there are two


independent variables, each with two levels:
Independent variable #1: Sunlight (Levels: Low, High)
Independent variable #2: Watering Frequency (Levels: Daily, Weekly)
2. Secondary sources of Data:
While secondary sources means data collected by someone else earlier. Secondary
data are the data collected by a party not related to the
research study but collected these data for some other purpose and at different
time in the past.
If the researcher uses these data then these become secondary data for the
current users. Sources of secondary data are government publications websites,
books, journal articles, internal records.
l. Internal Sources
2. External Sources
Internal Sources –These are within the organization. External Sources - These are
outside the organization

➢ Internal Sources:
If available, internal secondary data may be obtained with less time, effort and
money than the external secondary data. In addition, they may also be more
pertinent to the situation at hand since they are from within the organization.
The internal sources include
Accounting resources- This gives so much information which can be used by
the marketing researcher. They give information about internal factors.
Sales Force Report- It gives information about the sale of a product. The
information provided is of outside the organization.
Internal Experts- These are people who are heading the various departments.
They can give an idea of how a particular thing is working
Miscellaneous Reports- These are what information you are getting from
operational reports. If the data available within the organization are unsuitable or
inadequate, the marketer should extend the search to external secondary data
sources.

➢ External Sources of Data:


External Sources are sources which are outside the company in a larger
environment. Collection of external data is more difficult because the data have
much greater variety and the sources are much more numerous.
Government Publications- Government sources provide an extremely rich pool
of data for the researchers. In addition, many of these data are available free of
cost on internet websites.
There are number of government agencies generating data.
These are:
Registrar General of India- It is an office which generates demographic data. It
includes details of gender, age, occupation etc.
Central Statistical Organization- This organization publishes the national accounts
statistics. It contains estimates of national income for several years, growth rate,
and rate of major economic activities. Annual survey of Industries isalso published
by the CSO. It gives information about the total number of workers employed,
production units, material used and value added by the manufacturer.
Director General of Commercial Intelligence- This office operates from Kolkata.
It gives information about foreign trade i.e. import and export. These figures are
provided region-wise and country-wise.
Ministry of Commerce and Industries- This ministry through the office of
economic advisor provides information on wholesale price index. These indices
may be related to a number of sectors like food, fuel, power, food grains etc. It
also generates All India Consumer Price Index numbers for industrial workers,
urban, non manual employees and cultural labourers.
Planning Commission- It provides the basic statistics of Indian Economy.
Reserve Bank of India- This provides information on Banking Savings and
investment. RBI also prepares currency and finance reports.
Labour Bureau- It provides information on skilled, unskilled, white collared jobs
National Sample Survey- This is done by the Ministry of Planning and it provides
social, economic, demographic, industrial and agricultural statistics.
Department of Economic Affairs- It conducts economic survey and it also
generates information on income, consumption, expenditure, investment, savings
and foreign trade.
State Statistical Abstract- This gives information on various types of activities
related to the state like - commercial activities, education, occupation etc.
Non-Government Publications- These includes publications of various industrial
and trade associations, such as The Indian Cotton Mill Association Various
chambers of commerce.
Understand various sources of Data like Sensors/signal/GPS etc:
Sensor data:
➢ Sensor data is the output of a device that detects and responds to some
type of input from the physical environment. The output may be used to
provide information or input to another system or to guide a process.
➢ Here are a few examples of sensors, just to give an idea of the number and
diversity of their applications:
❖ A photosensor detects the presence of visible light, infrared
transmission (IR) and/or ultraviolet (UV) energy.
❖ Smart grid sensors can provide real-time data about grid
conditions, detecting outages, faults and load and triggering
alarms.
❖ Wireless sensor networks combine specialized transducers
with a communications infrastructure for monitoring and
recording conditions at diverse locations. Commonly monitored
parameters include temperature, humidity, pressure, wind
direction and speed, illumination intensity, vibration intensity,
sound intensity, powerline voltage, chemical concentrations,
pollutant levels and vital body functions.
Signal:
The simplest form of signal is a direct current (DC) that is switched on and off;
this is the principle by which the early telegraph worked. More complex signals
consist of an alternating-current (AC) or electromagnetic carrier that contains one
or more data streams.
Data must be transformed into electromagnetic signals prior to transmission across
a network. Data and signals can be either analog or digital. A signal is periodic if it
consists of a continuously repeating pattern.
Global Positioning System (GPS):
The Global Positioning System (GPS) is a space based navigation system that
provides location and time information in all weather conditions, anywhere on or
near the Earth where there is an unobstructed line of sight to four or more GPS
satellites. The system provides critical capabilities to military, civil, and
commercial users around the world. The United States government created the
system, maintains it, and makes it freely accessible to anyone with a GPS receiver.
Quality of Data:
Data quality is the ability of your data to serve its intended purpose based on
factors such as accuracy, completeness, consistency, reliability and these factors that
play a huge role in determining data quality.
Accuracy:
Erroneous values that deviate from the expected. The causes for inaccurate data
can be various, which include:
➢ Human/computer errors during data entry and transmission
➢ Users deliberately submitting incorrect values (called disguised
missing data)
➢ Incorrect formats for input fields
➢ Duplication of training examples
Completeness:
Lacking attribute/feature values or values of interest. The dataset might be
incomplete due to:
➢ Unavailability of data
➢ Deletion of inconsistent data
➢ Deletion of data deemed irrelevant initially
Consistency: Inconsistent means data source containing discrepancies between
different data items. Some attributes representing a given concept may have
different names in different databases, causing inconsistencies and redundancies.
Naming inconsistencies may also occur for attribute values.
Reliability: Reliability means that data are reasonably complete and accurate,
meet the intended purposes, and are not subject to inappropriate alteration.
Some other features that also affect the data quality include timeliness (the data
is incomplete until all relevant information is submitted after certain time
periods), believability (how much the data is trusted by the user) and
interpretability (how easily the data is understood by all stakeholders).
To ensure high quality data, it’s crucial to preprocess it. To make the process easier,
data preprocessing is divided into four stages: data cleaning, data integration,
data reduction, and data transformation.
Data Quality is also effected by
➢ Outliers
➢ Missing Values
➢ Noisy
➢ Duplicate Values
Outliers:
Outliers are extreme values that deviate from other observations on data, they
may indicate a variability in a measurement, experimental errors or a novelty. It
is a point or an observation that deviates significantly from the other
observations.
Outlier detection from graphical representation:
➢ Scatter plot and
➢ Box plot
Scatter plot:
Scatter plots are used to plot data points on a horizontal and a vertical axis in the
attempt to show how much one variable is affected by another. A scatterplot
uses dots to represent values for two different numeric variables.

Box plot:
A boxplot is a standardized way of displaying the distribution of data based on a
five number summary
• Minimum
• First quartile (Q1),
• Median,
• Third quartile (Q3), and
• Maximum”).
Most common causes of outliers on a data set:
➢ Data entry errors (human errors)
➢ Measurement errors (instrument errors)
➢ Experimental errors (data extraction or experiment
planning/executing errors)
➢ Intentional (dummy outliers made to test detection methods)
➢ Data processing errors (data manipulation or data set unintended
mutations)
➢ Sampling errors (extracting or mixing data from wrong or various
sources)
➢ Natural (not an error, novelties in data)
How to remove Outliers?
Most of the ways to deal with outliers are similar to the methods of missing values
like deleting observations, transforming them, binning them, treat them as a
separate group, imputing values and other statistical methods. Here, we will
discuss the common techniques used to deal with outliers:
Deleting observations: We delete outlier values if it is due to data entry error,
data processing error or outlier observations are very small in numbers. We can
also use trimming at both ends to remove outliers.
Transforming and binning values: Transforming variables can also eliminate
outliers. Natural log of a value reduces the variation caused by extreme values.
Binning is also a form of variable transformation. Decision Tree algorithm allows
to deal with outliers well due to binning of variable. We can also use the process
of assigning weights to different observations.
Imputing: Like imputation of missing values, we can also impute outliers. We can
use mean, median, mode imputation methods. Before imputing values, we should
analyse if it is natural outlier or artificial. If it is artificial, we can go withimputing
values. We can also use statistical model to predict values of outlier observation
and after that we can impute it with predicted values.
Missing data:
Missing data in the training data set can reduce the power / fit of a model or can lead
to a biased model because we have not analysed the behavior and relationship with
other variables correctly. It can lead to wrong prediction orclassification.
Why my data has missing values?
We looked at the importance of treatment of missing values in a dataset. Now,
let’s identify the reasons for occurrence of these missing values. They may occur
at two stages:
l. Data Extraction: It is possible that there are problems with extraction process.
In such cases, we should double-check for correct data with data guardians. Some
hashing procedures can also be used to make sure data extraction is correct. Errors
at data extraction stage are typically easy to find andcan be corrected easily as
well.
2. Data collection: These errors occur at time of data collection and are harderto
correct. They can be categorized in four types:
➢ Missing completely at random: This is a case when the probability of
missing variable is same for all observations. For example:
respondents of data collection process decide that they will declare
their earning after tossing a fair coin. If an head occurs, respondent
declares his / her earnings & vice versa. Here each observation has
equal chance of missing value.
➢ Missing at random: This is a case when variable is missing at random
and missing ratio varies for different values / level of other input
variables. For example: We are collecting data for age andfemale has
higher missing value compare to male.
➢ Missing that depends on unobserved predictors: This is a case when
the missing values are not random and are related to the unobserved
input variable. For example: In a medical study, if a particular
diagnostic causes discomfort, then there is higher chance of drop out
from the study. This missing value is not at random unless we have
included “discomfort” as an input variable for all patients.
➢ Missing that depends on the missing value itself: This is a case when
the probability of missing value is directly correlated with missing
value itself. For example: People with higher or lower income are likely
to provide non-response to their earning.
Which are the methods to treat missing values?
l. Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
➢ In list wise deletion, we delete observations where any of the variable is
missing. Simplicity is one of the major advantage of this method, but this
method reduces the power of model because it reduces the sample size.
➢ In pair wise deletion, we perform analysis with all cases in which the
variables of interest are present. Advantage of this method is, it keeps as
many cases available for analysis. One of the disadvantage of this method,
it uses different sample size for different variables.

➢ Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the model
output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing
values with estimated ones. The objective is to employ known relationships that
can be identified in the valid values of the data set to assist in estimating the
missing values. Mean / Mode / Median imputation is one of the most frequently
used methods. It consists of replacing the missing data for a given attribute by
the mean or median (quantitative attribute) or mode (qualitative attribute) of all
known values of that variable.
It can be of two types:
➢ Generalized Imputation: In this case, we calculate the mean or median for
all non missing values of that variable then replace missing value with mean
or median. Like in above table, variable “Manpower” is missing so we take
average of all non missing values of “Manpower” (28.33) and then replace
missing value with it.
➢ Similar case Imputation: In this case, we calculate average for gender
“Male” (29.75) and “Female” (25) individually of non missing values then
replace the missing value based on gender. For “Male“, we will replace
missing values of manpower with 29.75 and for “Female” with 25.
Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be
generated due to faulty data collection, data entry errors etc. It can be handled
in following ways:
Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete the
task. Each segmented is handled separately. One can replace alldata in a
segment by its mean or boundary values can be used to complete thetask.
➢ Smoothing by bin means: In smoothing by bin means, each value in a bin
is replaced by the mean value of the bin.
➢ Smoothing by bin median: In this method each bin value is replaced by
its bin median value.
➢ Smoothing by bin boundary: In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
Regression: Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
Clustering: This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
Duplicate values: A dataset may include data objects which are duplicates of
one another. It may happen when say the same person submits a form more than
once. The term deduplication is often used to refer to the process of dealing with
duplicates. In most cases, the duplicates are removed so as to not give that
particular data object an advantage or bias, when running machine learning
algorithms.
Redundant data occurs while we merge data from multiple databases. If the
redundant data is not removed incorrect results will be obtained during data
analysis. Redundant data occurs due to the following reasons.

➢ Object identification: The same attribute or object may have


different names in different databases

➢ Derivable data: One attribute may be a “derived” attribute in


another table, e.g., annual revenue

Redundant attributes may be able to be detected by correlation analysis Careful


integration of the data from multiple sources may help reduce/avoid redundancies
and inconsistencies and improve mining speed and quality
Data Pre-processing:
Data preprocessing is a data mining technique that involves transforming raw
data into an understandable format. Real-world data is often incomplete,
inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain
many errors.
Major Tasks in Data Preprocessing are
➢ Data Cleaning
➢ Data Integration
➢ Data Transformation
➢ Data Reduction
l. Data Cleaning: Data is cleansed through processes such as filling in missing
values, smoothing the noisy data, or resolving the inconsistencies in the data.
Data cleaning tasks
➢ Fill in missing values
➢ Identify outliers and smooth out noisy data
➢ Correct inconsistent data
➢ Resolve redundancy caused by data integration
Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
e.g., occupation=“ ”
Noisy: containing errors or outlier value that deviate from the expected.
e.g., Salary=“-10”
Inconsistent: containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
2. Data Integration: Data with different representations are put together and
conflicts within the data are resolved. Integration of multiple databases, data
cubes, or files.

There are mainly 2 major approaches for data integration – one is “Tight coupling
approach” and another is “Loose coupling approach”.
Tight Coupling:
Here, a data warehouse is treated as an information retrieval component. In this
coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation and Loading.
Loose Coupling:
Here, an interface is provided that takes the query from the user, transforms it in
a way the source database can understand and then sends the query directly to
the source databases to obtain the result.
Issues in Data Integration:
There are three issues to consider during data integration: Schema Integration,
Redundancy Detection, and resolution of data value conflicts. These areexplained
in brief below.
➢ Schema Integration: Integrate metadata from different sources. The real-
world entities from multiple sources are referred to as the entity
identification problem.
➢ Redundancy: An attribute may be redundant if it can be derived orobtained
from another attribute or set of attributes. Inconsistencies in attributes can
also cause redundancies in the resulting data set. Some redundancies can
be detected by correlation analysis.
➢ Detection and resolution of data value conflicts: This is the third critical
issue in data integration. Attribute values from different sources may differ
for the same real-world entity. An attribute in one system may be recorded
at a lower level of abstraction than the “same” attribute in another.
3. Data Transformation: This step is taken in order to transform the data in
appropriate forms suitable for mining process. This involves following ways.
➢ Normalization: It is done in order to scale the data values in a
specified range (-1.0 to 1.0 or 0.0 to 1.0).
Min-Max Normalization:
This transforms the original data linearly. Suppose that min_F is
the minima and max_F is the maxima of an attribute, F
We Have the Formula:

Where v is the value you want to plot in the new range. v’ is the new
value you get after normalizing the old value.

➢ Attribute Selection: In this strategy, new attributes are constructed


from the given set of attributes to help the mining process.
➢ Discretization: This is done to replace the raw values of numeric
attribute by interval levels or conceptual levels.
➢ Concept Hierarchy Generation: Here attributes are converted from
level to higher level in hierarchy. For Example-The attribute “city” can
be converted to “country”.
4. Data Reduction: Since data mining is a technique that is used to handle huge
amount of data. While working with huge volume of data, analysis became harder
in such cases. In order to get rid of this, we uses data reduction technique. It aims
to increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
➢ Data Cube Aggregation: Aggregation operation is applied to data for the
construction of the data cube.
➢ Attribute Subset Selection: The highly relevant attributes should be used,
rest all can be discarded. For performing attribute selection, one can use
level of significance and p- value of the attribute.the attribute having p-
value greater than significance level can be discarded.
➢ Data compression
It reduces the size of the files using different encoding mechanisms
(Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques.
Lossless Compression:
Encoding techniques (Run Length Encoding) allows a simple and minimal
data size reduction. Lossless data compression uses algorithms to restore
the precise original data from the compressed data.
Lossy Compression:
Methods such as Discrete Wavelet transform technique, PCA (principal
component analysis) are examples of this compression. In lossy-data
compression, the decompressed data may differ to the original data but are
useful enough to retrieve information from them.
➢ Numerosity Reduction: This enables to store the model of data instead of
whole data, for example: Regression Models.
➢ Dimensionality Reduction: This reduces the size of data by encoding
mechanisms. It can be lossy or lossless. If after reconstruction from
compressed data, original data can be retrieved, such reduction are called
lossless reduction else it is called lossy reduction. The two effectivemethods
of dimensionality reduction are: Wavelet transforms and PCA (Principal
Componenet Analysis).
5. Data Discretization: Involves the reduction of a number of values of a
continuous attribute by dividing the range of attribute intervals. Data
discretization refers to a method of converting a huge number of data values into
smaller ones so that the evaluation and management of data become easy. In
other words, data discretization is a method of converting attributes values of
continuous data into a finite set of intervals with minimum data loss.
Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old


Important Questions
1. Discuss about Data Management? How to manage the data for analysis?
2. How to design Data Architecture? What are the factors that influences the
data architecture?
3. Define Primary Data sources and also explain about the types of primary
data sources?
4. What is Data Pre-processing? Discuss about various steps to pre-process
the data?
5. Explain about Missing values and how to eliminate missing values?
6. Define Quality of data and discuss various factors that affect the Quality of
data?
7. Discuss about various sources of data?
8. What is Noisy data? Explain about methods to handle noisy data?
9. Discuss about different methods in experimental data sources (CRD, RBD,
LSD and FA)?
10. Differentiate internal and external secondary data sourses with examples?
11. What is data transformation in data preprocessing and discuss about
different normalization techniques?
12. Discuss the process of handling duplicate values in organizational data.
Briefly describe various sources of data like sensors, signals GPS in data
management?
Data Analytics

UNIT – II
INTRODUCTION TO ANALYTICS
2.1 Introduction to Analytics

As an enormous amount of data gets generated, the need to extract useful insights is a
must for a business enterprise. Data Analytics has a key role in improving your business.
Here are 4 main factors which signify the need for Data Analytics:

 Gather Hidden Insights – Hidden insights from data are gathered and then analyzed
with respect to business requirements.
 Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in
business.
 Perform Market Analysis – Market Analysis can be performed to understand the
strengths and the weaknesses of competitors.
 Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.

Data Analytics refers to the techniques to analyze data to enhance productivity and
business gain. Data is extracted from various sources and is cleaned and categorized to
analyze different behavioral patterns. The techniques and the tools used vary according to the
organization or individual.

Data analysts translate numbers into plain English. A Data Analyst delivers value to their
companies by taking information about specific topics and then interpreting, analyzing,
and presenting findings in comprehensive reports. So, if you have the capability to collect
data from various sources, analyze the data, gather hidden insights and generate reports, then
you can become a Data Analyst. Refer to the image below:

Fig 2.1 Data Analytics


Data Analytics

In general data analytics also deals with bit of human knowledge as discussed below
in figure 2.2 in this under each type of analytics there is a part of human knowledge required
in prediction. Descriptive analytics requires the highest human input while predictive
analytics requires less human input. In case of prescriptive analytics no human input is
required since all the data is predicted.

Fig 2.3 Data and Human work

Fig 2.3 Data Analytics


Data Analytics

2.2 Introduction to Tools and Environment

In general data analytics deals with three main parts, subject knowledge, statistics and
person with computer knowledge to work on a tool to give insight in to the business. In the
mainly used tool is Rand Phyton as shown in figure 2.3

With the increasing demand for Data Analytics in the market, many tools have emerged
with various functionalities for this purpose. Either open-source or user-friendly, the top tools
in the data analytics market are as follows.

 R programming – This tool is the leading analytics tool used for statistics and data
modeling. R compiles and runs on various platforms such as UNIX, Windows, and Mac
OS. It also provides tools to automatically install all packages as per user-requirement.
 Python – Python is an open-source, object-oriented programming language which is easy
to read, write and maintain. It provides various machine learning and visualization
libraries such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras etc. It also can be
assembled on any platform like SQL server, a MongoDB database or JSON
 Tableau Public – This is a free software that connects to any data source such as Excel,
corporate Data Warehouse etc. It then creates visualizations, maps, dashboards etc with
real-time updates on the web.
 QlikView – This tool offers in-memory data processing with the results delivered to the
end-users quickly. It also offers data association and data visualization with data being
compressed to almost 10% of its original size.
 SAS – A programming language and environment for data manipulation and analytics,
this tool is easily accessible and can analyze data from different sources.
 Microsoft Excel – This tool is one of the most widely used tools for data analytics.
Mostly used for clients’ internal data, this tool analyzes the tasks that summarize the data
with a preview of pivot tables.
 RapidMiner – A powerful, integrated platform that can integrate with any data source
types such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc. This tool is
mostly used for predictive analytics, such as data mining, text analytics, machine
learning.
 KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics
platform, which allows you to analyze and model data. With the benefit of visual
programming, KNIME provides a platform for reporting and integration through its
modular data pipeline concept.
 OpenRefine – Also known as GoogleRefine, this data cleaning software will help you
clean up data for analysis. It is used for cleaning messy data, the transformation of data
and parsing data from websites.
 Apache Spark – One of the largest large-scale data processing engine, this tool executes
applications in Hadoop clusters 100 times faster in memory and 10 times faster on disk.
This tool is also popular for data pipelines and machine learning model development.
Data Analytics

Apart from the above-mentioned capabilities, a Data Analyst should also possess skills
such as Statistics, Data Cleaning, Exploratory Data Analysis, and Data Visualization. Also, if
you have knowledge of Machine Learning, then that would make you stand out from the
crowd.

2.3 Application of modelling a business & Need for business Modelling

Data analytics is mainly involved in field of business in various concerns for the
following purpose and it varies according to business needs and it is discussed below in
detail. Nowadays majority of the business deals with prediction with large amount of data to
work with.

Using big data as fundamental factor of making decision which need new capability, most
firms are far away from accessing all data resources. Companies in various sectors have
acquired crucial insight from the structured data collected from different enterprise systems
and anatomize by commercial database management systems. Eg:

1.) Facebook and Twitter to standard the instantaneous influence on campaign and to
examine consumer opinion about their products
2.) Some companies, like Amazon, eBay, and Google, considered as early commandants,
examining factors that control performance to define what raise sales revenue and
user interactivity.

2.3.1 Utilizing Hadoop in Big Data Analytics.

Hadoop is an open source software platform that enables processing of large data sets in a
distributed computing environment", it discusses some concepts according to big data, the
rules for building, organizing and analyzing huge data-sets in the business environment, they
offered 3 architecture layers and also they indicate some graphical tools to explore and
represent unstructured-data, the authors specified how the famous companies could improve
their business. Eg: Google, Twitter and Facebook show their attention in processing big data
within cloud-environment
Data Analytics

Fig 2.4: Working of Hadoop – With Map Reduce Concept


Data Analytics

The Map() step: Each worker node applies the Map() function to the local data and writes the
output to atemporary storage space. The Map() code is run exactly once for each K1 key
value, generating output that isorganized by key values K2. A master node arranges it so that
for redundant copies of input data only one isprocessed.

The Shuffle ()step: The map output is sent to the reduce processors, which assign the K2 key
value that eachprocessor should work on, and provide that processor with all of the map-
generated data associated with that keyvalue, such that all data belonging to one key are
located on the same worker node.

The Reduce() step: Worker nodes process each group of output data(perkey) in parallel,
executing the userprovidedReduce() code; each function is run exactly once for each K2 key
value pro-duced by the map step.

Produce the final output: The MapReduce system collects all of the reduce outputs and sorts
them by K2 to producethe final out-come.

Fig.2.4 shows the classical “word count problem” using the MapReduce paradigm. As shown
in Fig.2.4, initially aprocess will split the data into a subset of chunks that will later be
processed by the mappers. Once the key/values aregenerated by mappers, a shuffling process
is used to mix (combine) these key values (combining the same keys in the sameworker
node). Finally, the reduce functions are used to count the words that generate a common
output as a result of thealgorithm. As a result of the execution or wrappers/reducers, the out-
put will generate a sorted list of word counts from theoriginal text input.

2.3.2 The Employment of Big Data Analytics on IBM.

IBM and Microsoft are prominent representatives. IBM represented many big data options
that enable users to storing, managing, and analyzing data through various resources; it has a
good rendering on business-intelligence also healthcare areas. Compared with IBM, also
Microsoft showed powerful work in the area of cloud computing activities and techniques
another example is Face-book and Twitter, who are collecting various data from user's
profiles and using it to increase their revenue

2.3.3 The Performance of Data Driven Companies.

Big data analytics and Business intelligence are united fields which became widely
significant in the business and academic area, companies are permanently trying to make
insight from the extending the three V's ( variety, volume and velocity) to support decision
making

2.4 Databases
Database is an organized collection of structured information, or data, typically
stored electronically in a computer system. A database is usually controlled by
Data Analytics

a database management system (DBMS)


Data Analytics

The database can be divided into various categories such as text databases,
desktop database programs, relational database management systems (RDMS), and NoSQL
and object-oriented databases

A text database is a system that maintains a (usually large) text collection and
provides fast and accurate access to it. Eg: Text book, magazine, journals, manuals, etc..

A desktop database is a database system that is made to run on a single computer


or PC. These simpler solutions for data storage are much more limited and constrained than
larger data center or data warehouse systems, where primitive database software is replaced
by sophisticated hardware and networking setups. Eg: Microsoft excel, open access, etc.

A relational database (RDB) is a collective set of multiple data sets organized by


tables, records and columns. RDBs establish a well-defined relationship
between database tables. Tables communicate and share information, which facilitates data
searchability, organization and reporting. Eg: sql, oracle,Db2, DbaaS etc

NoSQL databases are non-tabular, and store data differently than relational
tables. NoSQL databases come in a variety of types based on their data model. The main
types are document, key-value, wide-column, and graph. Eg: JSON,Mango DB,CouchDB etc

Object-oriented databases (OODB) are databases that represent data in the form
of objects and classes. In object-oriented terminology, an object is a real-world entity, and a
class is a collection of objects. Object-oriented databases follow the fundamental principles
of object-oriented programming (OOP). Eg: c++, java, c#, small talk, LISP etc..

2.5 Types of Data and variables

In any database we will be working with data to perform any kind of analysis and
predication. In relational data base management system we normally use rows to represent
data and columns to represent the attribute.

In terms of big data we represent the columns from RDMS as an attribute or a


variable. This variable can be divided in to two types’ categorical data or qualitative data
and continuous or discrete data called as quantitative data. As shown below in figure 2.5.

Qualitative data or Categorical data is normally represented as variable that holds


characters. And this is divided in to two types’ nominal data and ordinal data.

InNominal Data there is no natural ordering in values in the attribute of the dataset.
Eg: color, Gender, nouns (name, place, animal, thing). These categories cannot be predefined
with order for example there is no specific way to arrange gender of 50 students in a class. In
this case the first student can be male or female similarly for all 50 students. So ordering
Data Analytics

cannot be valid.
Data Analytics

In Ordinal Data there isnatural ordering in values in the attribute of the dataset. Eg:
size (S, M, L, XL, XXL), rating (excellent, good, better, worst). In the above example we can
quantify the amount of data after performing ordering which gives valuable insights into the
data.

Fig 2.5: Types of Data Variables

Quantitative data or (discrete or continuous data) can be further divided in to two


types’ discrete attribute and continuous attribute.

Discrete Attribute which takes only finite number of numerical values (integers). Eg:
number of buttons, no of days for product delivery etc.. These data can be represented at
every specific interval in case of time series data mining or even in ratio based entries.

Continuous Attribute which takes finite number of fractional values. Eg: price,
discount, height, weight, length, temperature, speed etc….. These data can be represented at
every specific interval in case of time series data mining or even in ratio based entries.

2.5 Data Modelling Techniques

Data modelling is nothing but a process through which data is stored structurally in a
format in a database. Data modelling is important because it enables organizations to make
data-driven decisions and meet varied business goals.

The entire process of data modelling is not as easy as it seems, though. You are
required to have a deeper understanding of the structure of an organization and then propose
Data Analytics

a solution that aligns with its end-goals and suffices it in achieving the desired objectives.
Data Analytics

Types of Data Models

Data modeling can be achieved in various ways. However, the basic concept of each
of them remains the same. Let’s have a look at the commonly used data modeling methods:

Hierarchical model

As the name indicates, this data model makes use of hierarchy to structure the data in
a tree-like format as shown in figure 2.6. However, retrieving and accessing data is difficult
in a hierarchical database. This is why it is rarely used now.

Fig 2.6: Hierarchical Model Structure

Relational model

Proposed as an alternative to hierarchical model by an IBM researcher, here data is


represented in the form of tables. It reduces the complexity and provides a clear overview of
the data as shown below in figure 2.7.

Fig 2.7: Relational Model Structure


Network model
Data Analytics

The network model is inspired by the hierarchical model. However, unlike the
hierarchical model, this model makes it easier to convey complex relationships as each record
can be linked with multiple parent records as shown in figure 2.8. In this model data can be
shared easily and the computation becomes easier.

Fig 2.8: Network Model Structure


Object-oriented model

This database model consists of a collection of objects, each with its own features and
methods. This type of database model is also called the post-relational database model as
shown in figure 2.8.

Fig 2.9: Object-Oriented Model Structure

Entity-relationship model
Data Analytics

Entity-relationship model, also known as ER model, represents entities and their


relationships in a graphical format. An entity could be anything – a concept, a piece of data,
or an object.

Fig 2.10: Entity Relationship Diagram

The entity relationship diagram explains relation between variables and with their
primary key and foreign key as shown in figure 2.10. along with this it also explains the
multiple instances of relation between tables.

Now that we have a basic understanding of data modeling, let’s see why it is important.

Importance of Data Modeling


 A clear representation of data makes it easier to analyze the data properly. It provides
a quick overview of the data which can then be used by the developers in varied
applications.
 Data modeling represents the data properly in a model. It rules out any chances of
data redundancy and omission. This helps in clear analysis and processing.
 Data modeling improves data quality and enables the concerned stakeholders to make
data-driven decisions.

Since a lot of business processes depend on successful data modeling, it is necessary to


adopt the right data modeling techniques for the best results.

Best Data Modeling Practices to Drive Your Key Business Decisions

Have a clear understanding of your end-goals and results


Data Analytics

You will agree with us that the main goal behind data modeling is to equip your business and
contribute to its functioning. As a data modeler, you can achieve this objective only when
you know the needs of your enterprise correctly.
It is essential to make yourself familiar with the varied needs of your business so that you can
prioritize and discard the data depending on the situation.

Key takeaway: Have a clear understanding of your organization’s requirements and organize
your data properly.

Keep it sweet and simple and scale as you grow

Things will be sweet initially, but they can become complex in no time. This is why it is
highly recommended to keep your data models small and simple, to begin with.

Once you are sure of your initial models in terms of accuracy, you can gradually introduce
more datasets. This helps you in two ways. First, you are able to spot any inconsistencies in
the initial stages. Second, you can eliminate them on the go.

Key takeaway: Keep your data models simple. The best data modeling practice here is to use
a tool which can start small and scale up as needed.
Organize your data based on facts, dimensions, filters, and order

You can find answers to most business questions by organizing your data in terms of four
elements – facts, dimensions, filters, and order.

Let’s understand this better with the help of an example. Let’s assume that you run four e-
commerce stores in four different locations of the world. It is the year-end, and you want to
analyze which e-commerce store made the most sales.

In such a scenario, you can organize your data over the last year. Facts will be the overall
sales data of last 1 year, the dimensions will be store location, the filter will be last 12
months, and the order will be the top stores in decreasing order.

This way, you can organize all your data properly and position yourself to answer an array
of business intelligence questions without breaking a sweat.

Key takeaway: It is highly recommended to organize your data properly using individual
tables for facts and dimensions to enable quick analysis.

Keep as much as is needed

While you might be tempted to keep all the data with you, do not ever fall for this trap!
Although storage is not a problem in this digital age, you might end up taking a toll over your
Data Analytics

machines’ performance.
Data Analytics

More often than not, just a small yet useful amount of data is enough to answer all the business-
related questions. Spending huge on hosting enormous data of data only leads to performance
issues, sooner or later.

Key takeaway: Have a clear opinion on how much datasets you want to keep. Maintaining
more than what is actually required wastes your data modeling, and leads to performance
issues.

Keep crosschecking before continuing

Data modeling is a big project, especially when you are dealing with huge amounts of data.
Thus, you need to be cautious enough. Keep checking your data model before continuing to
the next step.

For example, if you need to choose a primary key to identify each record in the dataset
properly, make sure that you are picking the right attribute. Product ID could be one such
attribute. Thus, even if two counts match, their product ID can help you in distinguishing
each record. Keep checking if you are on the right track. Are product IDs same too? In those
aces, you will need to look for another dataset to establish the relationship.
Key takeaway: It is the best practice to maintain one-to-one or one-to-many relationships.
The many-to-many relationship only introduces complexity in the system.

Let them evolve


Data models are never written in stone. As your business evolves, it is essential to customize
your data modeling accordingly. Thus, it is essential that you keep them updating over time.
The best practice here is to store your data models in as easy-to-manage repository such that
you can make easy adjustments on the go.

Key takeaway: Data models become outdated quicker than you expect. It is necessary that
you keep them updated from time to time.

The Wrap Up

Data modeling plays a crucial role in the growth of businesses, especially when you
organizations to base your decisions on facts and figures. To achieve the varied business
intelligence insights and goals, it is recommended to model your data correctly and use
appropriate tools to ensure the simplicity of the system.

2.6 Missing Imputations

In statistics, imputation is the process of replacing missing data with substituted values. ...
Because missing data can create problems for analyzing data, imputation is seen as a way
Data Analytics

to avoid pitfalls involved with list-wise deletion of cases that have missing values.
Data Analytics

I. Do nothing to missing data


II. Fill the missing values in the dataset using mean, median.

Eg: for sample dataset given below

SNo Column 1 Column2 Column 3


1 3 6 NAN
2 5 10 12
3 6 11 15
4 NAN 12 14
5 6 NAN NAN
6 10 13 16

Can be replaced as using column mean as follows

SNo Column 1 Column2 Column 3


1 3 6 9.5
2 5 10 12
3 6 11 15
4 5 12 14
5 6 8.66 9.5
6 10 13 16

Advantages:
• Works well with numerical dataset.
• Very fast and reliable.

Disadvantage:
• Does not work with categorical attributes
• Does not correlate relation between columns
• Not very accurate.
• Does not account for any uncertainty in data

III. Imputations using (most frequent) or (zero / constant) values


This can be used for categorical attributes.
Disadvantage:
• Does not correlate relation between columns
• Creates bias in data.

IV. Imputation using KNN


It creates a basic mean impute then uses the resulting complete list to construct a KDTree.
Then, it uses the resulting KDTree to compute nearest neighbours (NN). After it finds the k-
NNs, it takes the weighted average of them.
Data Analytics

The k nearest neighbours is an algorithm that is used for simple classification. The algorithm
uses ‘feature similarity’ to predict the values of any new data points. This means that the new
point is assigned a value based on how closely it resembles the points in the training set. This can
be very useful in making predictions about the missing values by finding the k’s closest
neighbours to the observation with missing data and then imputing them based on the non-
missing values in the neighbourhood.
Advantage:
• This method is very accurate than mean, median and mode

Disadvantage:
• Sensitive to outliers

UNIT-3
BLUE Property
assumptions

 The Gauss Markov theorem tells us that if a certain set of assumptions are met, the
ordinary least squares estimate for regression coefficients gives you the Best Linear
Unbiased Estimate (BLUE) possible.

 There are five Gauss Markov assumptions (also called conditions):

 Linearity:
o The parameters we are estimating using the OLS method must be themselves
linear.
 Random:
o Our data must have been randomly sampled from the population.
 Non-Collinearity:
o The regressors being calculated aren’t perfectly correlated with each other.
 Exogeneity:
o The regressors aren’t correlated with the error term.
 Homoscedasticity:
o No matter what the values of our regressors might be, the error of the variance is
constant.

Purpose of the Assumptions


 The Gauss Markov assumptions guarantee the validity of ordinary least squares for
estimating regression coefficients.

 Checking how well our data matches these assumptions is an important part of estimating
regression coefficients.
DATA ANALYTICS UNIT 4

Object Segmentation

Object segmentation in data analytics involves dividing data into meaningful groups or
segments. Each segment shares common characteristics, which helps in analyzing patterns or
behaviors within a dataset. Object segmentation is widely used in various domains such as
marketing, finance, and healthcare to group data points that are similar, making it easier to
interpret and predict trends.

Regression vs. Segmentation

● Regression is a statistical method used to determine the relationship between a dependent


variable (target) and one or more independent variables (features). In data analytics,
regression is primarily used for prediction.
○ Example of Regression: Predicting house prices based on features like square
footage, number of bedrooms, and location. Linear regression would find the line
of best fit, representing the relationship between these features and the target
variable (price).
● Segmentation, on the other hand, involves grouping data points into distinct categories
based on similarity, without necessarily predicting a numerical output. It’s typically used
in clustering and classification tasks.
○ Example of Segmentation: Customer segmentation in retail, where customers are
grouped based on purchasing habits, age, or spending levels. This allows targeted
marketing strategies for each group (e.g., frequent buyers, occasional buyers).

Diagram: Below is a sample diagram illustrating the difference between regression and
segmentation. In regression, there’s a continuous line representing predictions. In segmentation,
data points are grouped into distinct clusters.
Example: In retail, regression might be used to predict sales based on historical sales data, while
segmentation can help group customers into categories (e.g., high-spenders, occasional buyers)
based on purchasing behavior.

Supervised and Unsupervised Learning


Supervised Learning:

● In supervised learning, the model is trained on labeled data where both input and the
corresponding output are provided.
● The primary goal is to learn the mapping function that relates input to output.
Examples:
○ Predicting house prices based on features like area, number of rooms, and
location.
○ Email spam classification.
● Techniques:
○ Regression: Predict continuous values, e.g., predicting stock prices.
○ Classification: Predict discrete categories, e.g., classifying emails as spam or
non-spam.
○ Linear Regression: For continuous target variables.
○ Logistic Regression: For binary classification tasks.
● Key Features:
○ Requires labeled data for training.
○ The output can be continuous (regression) or categorical (classification).
● Applications:
○ Predictive Analytics: Forecasting sales, predicting customer churn.
○ Classification Problems: Identifying whether an email is spam or not.

Example: Predicting housing prices based on features like area, location, and the number of
bedrooms.

The model learns from this data to predict prices for new housing data.

Advantages:

● Clear and interpretable results.


● Accurate predictions with quality labeled data.

Challenges:

● Labeled data can be expensive or time-consuming to obtain.


● Poor generalization if the training data is biased or insufficient.

Unsupervised Learning:
● In unsupervised learning, the model is trained on data without labeled outputs.
● It identifies patterns and relationships in the data.
● The goal is to identify underlying patterns, structures, or clusters within the data.
Examples:
○ Customer segmentation for marketing.
○ Identifying fraudulent transactions in financial data.
● Techniques:
○ Clustering: Grouping similar data points, e.g., K-means, hierarchical clustering.
○ Dimensionality Reduction: Reducing the number of features, e.g., PCA
(Principal Component Analysis).
● Key Features:
○ Does not rely on labeled outputs.
○ Focuses on exploring the dataset's hidden structures.
● Applications:
○ Customer Segmentation: Grouping customers based on purchasing behavior.
○ Anomaly Detection: Identifying fraudulent credit card transactions.

Example: Clustering shopping data to group customers based on their buying patterns.

Advantages:

● Automatically identifies meaningful patterns.


● Useful for exploratory data analysis.

Challenges:

● Results can be harder to interpret compared to supervised learning.


● The performance depends on the quality of the dataset and feature representation.

Supervised and Unsupervised Learning in Segmentation


Supervised Learning: In supervised segmentation, the model is trained using labeled data,
meaning each data point has a known output label or category. This is useful when there is prior
knowledge about the categories or groups in the data.

Example: A financial institution categorizes loan applicants as "low risk" or "high risk" based on
their credit history and income. The algorithm is trained on labeled data to classify new
applicants accordingly.

Unsupervised Learning: Here, the model is trained on unlabeled data, and it automatically
identifies patterns within the data. Clustering algorithms like K-means and DBSCAN are
commonly used for unsupervised segmentation.

Example: In marketing, clustering algorithms are applied to identify different customer groups
based on purchasing habits without predefined categories, enabling targeted marketing strategies.

Comparison Table:
Tree Building

Tree-building algorithms are widely used in supervised learning for both regression and
classification tasks.

Decision Trees

A decision tree is a flowchart-like structure used to make decisions or predictions. It consists of


nodes representing decisions or tests on attributes, branches representing the outcome of these
decisions, and leaf nodes representing final outcomes or predictions. Each internal node
corresponds to a test on an attribute, each branch corresponds to the result of the test, and each
leaf node corresponds to a class label or a continuous value.

Structure of a Decision Tree

1. Root Node: Represents the entire dataset and the initial decision to be made.

2. Internal Nodes: Represent decisions or tests on attributes. Each internal node has one

or more branches.
3. Branches: Represent the outcome of a decision or test, leading to another node.

4. Leaf Nodes/ Terminal Nodes: Represent the final decision or prediction. No further

splits occur at these nodes.


Working of Decision Trees

The process of creating a decision tree involves:

1. Selecting the Best Attribute: Using a metric like Gini impurity, entropy, or

information gain, the best attribute to split the data is selected.


2. Splitting the Dataset: The dataset is split into subsets based on the selected attribute.

3. Repeating the Process: The process is repeated recursively for each subset, creating

a new internal node or leaf node until a stopping criterion is met (e.g., all instances in
a node belong to the same class or a predefined depth is reached).

Metrics for Splitting

● Gini Impurity: Measures the likelihood of an incorrect classification of a new instance if


it was randomly classified according to the distribution of classes in the dataset.
● Entropy: Measures the amount of uncertainty or impurity in the dataset.

● Information Gain: Measures the reduction in entropy or Gini impurity after a dataset is
split on an attribute.

Advantages of Decision Trees

● Simplicity and Interpretability: Decision trees are easy to understand and interpret.
The visual representation closely mirrors human decision-making processes.
● Versatility: Can be used for both classification and regression tasks.
● No Need for Feature Scaling: Decision trees do not require normalization or scaling
of the data.
● Handles Non-linear Relationships: Capable of capturing non-linear relationships
between features and target variables.

Disadvantages of Decision Trees

● Overfitting: Decision trees can easily overfit the training data, especially if they are
deep with many nodes.
● Instability: Small variations in the data can result in a completely different tree being
generated.
● Bias towards Features with More Levels: Features with more levels can dominate
the tree structure.

Example of a Decision Tree:


Regression Trees: Used for predicting continuous outcomes.

Classification Trees: Used for predicting discrete categories.

Key Concepts:

● Decision Trees:
○ A tree-like structure where each internal node represents a feature, each branch
represents a decision rule, and each leaf node represents an output.
○ Algorithms: CART (Classification and Regression Trees), ID3.
● Regression Trees:
○ Used for predicting continuous values.
○ Example: Predicting the sales of a product based on pricing and advertising.
● Classification Trees:
○ Used for predicting discrete categories.
○ Example: Determining whether a customer will buy a product based on age and
income.

Challenges:

● Overfitting: A tree model that is too complex and performs well on training data but
poorly on unseen data.
● Trees may become too complex, leading to poor performance on unseen data.
○ Solution: Use techniques like pruning (removing unnecessary nodes), limiting
tree depth, or ensembling methods like Random Forest.
● Pruning:
○ Reduces tree size by removing sections of the tree that provide little predictive
power.
○ Types: Pre-pruning (limit growth during training) and Post-pruning (simplify after
tree creation).
○ Pre-pruning: Stops tree growth early by limiting depth or the number of nodes.
○ Post-pruning: Simplifies a fully grown tree by removing redundant nodes.

Advanced Tree-Based Methods

● Random Forest: Builds multiple decision trees and aggregates their outputs for better
accuracy and robustness.
● Gradient Boosting Machines (GBM): Combines weak learners (small trees) iteratively
to improve overall performance.

Applications:

● Customer churn prediction.


● Loan approval systems.
● Risk assessment in insurance.

Diagram:

Below is an example of a decision tree:


Tree Building in Segmentation

Decision Trees: Decision trees are widely used in segmentation, especially when the data has a
clear hierarchy. A decision tree divides the data based on the value of specific features
(variables), making a series of splits that result in a tree-like structure.

Regression Trees: Used when the outcome is continuous. For instance, predicting the price of a
house based on features like size and location.

Classification Trees: Used when the outcome is categorical. For example, classifying emails as
"spam" or "not spam."

Example of Decision Tree in Segmentation:


In a customer segmentation task, decision trees might divide customers based on income levels,
purchase frequency, and age to classify them into groups like “Frequent Buyers,” “Occasional
Buyers,” and “Rare Buyers.”

Overfitting, Pruning & Complexity

Overfitting: A decision tree can become too complex by adding too many branches that fit the
training data very well but don’t generalize to new data. Overfitting makes the model less
effective in real-world scenarios.

Complexity: Large trees may become very complex and less interpretable. Simplifying the trees
with pruning can help make them easier to understand.
Pruning: To address overfitting, pruning techniques are applied. Pruning removes sections of the
tree that provide little predictive power, thereby reducing complexity and making the model
more generalizable.

To overcome overfitting, pruning techniques are used. Pruning reduces the size of the tree by
removing nodes that provide little power in classifying instances. There are two main types of
pruning:
● Pre-pruning (Early Stopping): Stops the tree from growing once it meets certain
criteria (e.g., maximum depth, minimum number of samples per leaf).
● Post-pruning: Removes branches from a fully grown tree that do not provide
significant power.

Example: In a medical diagnosis decision tree, some branches might only apply to specific cases
and are not representative of general patterns. Pruning these branches improves the model’s
accuracy on unseen data.
Applications of Decision Trees

● Business Decision Making: Used in strategic planning and resource allocation.


● Healthcare: Assists in diagnosing diseases and suggesting treatment plans.
● Finance: Helps in credit scoring and risk assessment.
● Marketing: Used to segment customers and predict customer behavior.

Multiple Decision Trees

Multiple Decision Trees involve leveraging multiple decision tree algorithms to improve
prediction accuracy and reduce the risk of overfitting. These techniques are commonly used in
ensemble learning methods, where a group of trees collaborates to make better predictions than a
single tree.

Challenges of a Single Decision Tree

● Overfitting: Single trees may fit the training data too closely, resulting in poor
generalization to unseen data.
● Bias and Variance: A single tree may have high variance or high bias, depending on its
configuration.
● Stability: Small changes in the dataset can lead to significantly different trees.

Techniques Involving Multiple Decision Trees


There are several approaches to utilize multiple decision trees for improved performance:

1) Random Forest

● Concept:
○ Builds multiple decision trees on different subsets of the dataset (created through
bootstrapping) and features (randomly selected for each split).
○ The final prediction is obtained by:
■ Regression: Taking the average of predictions from all trees.
■ Classification: Using majority voting among the trees.
● Key Features:
○ Reduces overfitting by averaging multiple trees.
○ Handles missing data well.
○ Can measure feature importance.

2) Gradient Boosting Trees

● Concept:
○ Builds trees sequentially, where each tree attempts to correct the errors of the
previous trees.
○ A loss function (e.g., Mean Squared Error) guides how trees are built.
○ Models like XGBoost, LightGBM, and CatBoost are popular implementations.
● Key Features:
○ High predictive accuracy.
○ Can handle both regression and classification tasks.
○ Requires careful tuning of hyperparameters (e.g., learning rate, number of trees).
● Example: Predicting product sales:
○ First tree predicts 100100100, but the actual value is 120120120.
○ Second tree tries to predict the residual (202020).
○ Final prediction is the sum of predictions from all trees.

3) Bagging (Bootstrap Aggregation)

● Concept:
○ Trains multiple decision trees on different random subsets of the training data
(using bootstrapping).
○ Combines predictions by averaging (for regression) or voting (for classification).
● Key Features:
○ Reduces variance and avoids overfitting.
○ Often used as a base for Random Forest.
● Example: Predicting stock prices:
○ Each tree is trained on a different subset of the dataset.
○ Predictions are averaged to produce the final output.

4) Extremely Randomized Trees (Extra Trees)

● Concept:
○ A variant of Random Forest where splits are made randomly, rather than choosing
the best split.
○ Uses the entire dataset (no bootstrapping).
● Key Features:
○ Faster than Random Forest.
○ Adds additional randomness to reduce overfitting.

5) AdaBoost (Adaptive Boosting)

● Concept:
○ Focuses on improving weak learners (e.g., shallow decision trees).
○ Adjusts the weights of incorrectly predicted samples, so subsequent trees focus
more on them.
● Key Features:
○ Works well with imbalanced data.
○ Sensitive to outliers.
● Example: Classifying fraudulent transactions:
○ First tree classifies 90%90\%90% of the data correctly.
○ Second tree focuses on the 10%10\%10% misclassified cases, and so on.

Advantages of Multiple Decision Trees

● Improved Accuracy: Ensemble methods typically outperform single decision trees in


both regression and classification tasks.
● Robustness: Less sensitive to noise and outliers.
● Flexibility: Can be applied to diverse datasets and tasks.

Applications

● Healthcare: Disease diagnosis using Random Forest or Gradient Boosting Trees.


● Finance: Predicting credit risk or stock prices using ensemble methods.
● Marketing: Customer segmentation and sales forecasting.
● Energy: Predicting energy consumption or renewable energy production.

Time Series Methods

Time series analysis focuses on understanding patterns and trends in data over time to make
forecasts. Time series analysis deals with data that is collected over time in sequential order. This
type of data can reveal trends, cycles, and seasonal patterns. Time series analysis is crucial in
fields like finance, retail, and meteorology, where forecasting future values based on historical
patterns is valuable. Time series analysis involves analyzing data points collected or recorded at
specific time intervals. It is widely used for forecasting trends and predicting future values.

Key Concepts:

1. Trend: Long-term movement in the data.


2. Seasonality: Regular patterns or cycles in the data (e.g., monthly sales).
3. Noise: Random variation or irregularities.

Techniques:

● ARIMA (Auto-Regressive Integrated Moving Average):


○ A powerful technique for time series forecasting that combines:
■ AR (Auto-Regression): Relating current values to past values.
■ I (Integrated): Differencing to make the series stationary.
■ MA (Moving Average): Smoothing random errors.
○ Example: Forecasting electricity consumption over time.
● STL Decomposition:
○ Decomposes a time series into Seasonal, Trend, and Residual components.
○ Useful for identifying cyclical behavior in data, such as quarterly sales patterns.

Applications of time-series methods:

● Weather prediction.
● Sales forecasting.
● Anomaly detection in IoT data.

Example Diagram for Time Series:

ARIMA (Auto-Regressive Integrated Moving Average)

The ARIMA model is one of the most widely used time series forecasting models. ARIMA is a
statistical model used for time series forecasting. It combines three main
components—Auto-Regression (AR), Integration (I), and Moving Average (MA)—to predict
future values based on past observations.

AR (Auto-Regressive): Uses past values to predict future values. For example, the sales on a
particular day could depend on the sales from previous days.

I (Integrated): Differencing the data to make it stationary. Stationarity means that the data’s
statistical properties (mean, variance) are consistent over time.
MA (Moving Average): Incorporates the dependency between an observation and residual
errors from previous observations.

Components of ARIMA

1. Auto-Regression (AR):
● Refers to a model that uses the relationship between a variable and its past values.
● Example: Predicting the current sales of a product based on sales in the previous
months.
● Represented as p: The number of lagged observations to include in the model.

2. Integration (I):
● Involves differencing the data to make it stationary (removing trends or
seasonality).
● Represented as d: The number of differencing operations required.
● Example: If sales consistently increase by 10 units every month, differencing will
subtract one month’s sales from the next to stabilize the trend.

3. Moving Average (MA):


● Uses the dependency between an observation and a residual error from a moving
average model applied to lagged observations.
● Represented as q: The number of lagged forecast errors in the prediction model.

Example: Suppose a retailer wants to forecast monthly sales for the next year. Using ARIMA,
the model would learn from monthly sales data over the past few years, capturing trends and
seasonality, to predict future sales values.
Parameters of ARIMA

Each component in ARIMA functions as a parameter with a standard notation. For ARIMA
models, a standard notation would be ARIMA with p, d, and q, where integer values substitute
for the parameters to indicate the type of ARIMA model used. The parameters can be defined as:

● p: the number of lag observations in the model, also known as the lag order.
● d: the number of times the raw observations are differenced; also known as the degree of
differencing.
● q: the size of the moving average window, also known as the order of the moving
average.

For example, a linear regression model includes the number and type of terms. A value of zero
(0), which can be used as a parameter, would mean that a particular component should not be
used in the model. This way, the ARIMA model can be constructed to perform the function of an
ARMA model, or even simple AR, I, or MA models.

Steps in Building an ARIMA Model

1. Check Stationarity:
○ Stationarity means that the statistical properties (mean, variance) of the time
series do not change over time.
○ Check whether the data is stationary. (ensure stationarity by testing)
○ If not stationary, apply differencing until the series becomes stationary.
2. Identify Parameters (p, d, q):
○ Use Autocorrelation Function (ACF) and Partial Autocorrelation Function
(PACF) plots to identify the values for p and q.
○ The differencing order d is determined by the number of times differencing was
applied.
3. Fit the Model:
○ Use the chosen p, d, and q values to fit the ARIMA model.
4. Validate the Model:
○ Check residual errors for randomness (using residual plots and statistical tests).
○ If residuals are not random, refine the model parameters.
5. Forecast:
○ Use the fitted ARIMA model to predict future values.

ARIMA Equation: The general ARIMA model combines AR, I, and MA components-
Applications of ARIMA

● Forecasting stock prices or financial market trends.


● Predicting electricity demand or energy usage.
● Sales and demand forecasting in retail.

Example: Using ARIMA for Time Series Prediction

Let’s assume you want to predict the daily sales of a product for the next week. You have daily
sales data for the past year, which shows both trend and seasonal patterns.

1. Step 1: Data Collection You collect daily sales data from your e-commerce platform for
the past year.
2. Step 2: Make the Data Stationary Before applying ARIMA, you check whether the
data is stationary. If not, you apply differencing to remove trends and seasonality.
3. Step 3: Choose ARIMA Model Parameters (p, d, q) You choose the order of the AR
(p), differencing (d), and MA (q) parts using statistical techniques like the ACF
(Auto-Correlation Function) and PACF (Partial Auto-Correlation Function).
4. Step 4: Train the Model You fit the ARIMA model using historical data.
5. Step 5: Make Predictions Once the model is trained, you can use it to make predictions
for the next 7 days.

Measures of Forecast Accuracy (or) Evaluation metrics for Forest Accuracy

Forecast accuracy is crucial in evaluating the performance of predictive models. It ensures that
forecasts generated by models align closely with actual values. The measures of forecast
accuracy help quantify the error between predicted and observed values, enabling analysts to
choose or improve forecasting methods.

Importance of Measuring Forecast Accuracy

● Assessment: Determines how well a forecasting model performs.


● Comparison: Helps compare multiple models to select the most accurate one.
● Optimization: Identifies patterns in errors to improve the model.
● Applications: Widely used in business forecasting, energy demand prediction, financial
markets, supply chain planning, etc.
Types of Forecast Errors

a) Positive Error

● Indicates underestimation (forecast is lower than the actual value).

b) Negative Error

● Indicates overestimation (forecast is higher than the actual value).

Measures of Forecast Accuracy

1) Mean Absolute Error (MAE): Measures the average of the absolute errors between
predicted and actual values.

● Definition: The average of the absolute differences between observed and predicted
values.

● Characteristics:
○ Simple to calculate and interpret.
○ Treats all errors equally, irrespective of their magnitude.

2) Mean Squared Error (MSE): Measures the average of the squared errors between predicted
and actual values, emphasizing larger errors.

● Definition: The average of the squared differences between observed and predicted
values.
● Characteristics:
○ Penalizes larger errors more heavily.
○ Sensitive to outliers.

3) Root Mean Squared Error (RMSE): The square root of MSE, giving an indication of the
model’s prediction error in the original units.

● Definition: The square root of the MSE, providing error in the same units as the data.

● Characteristics:
○ Combines the advantages of MSE but is interpretable in the original scale of the
data.

Use Case: In weather forecasting, RMSE is commonly used to measure the accuracy of
temperature predictions.

4) Mean Absolute Percentage Error (MAPE)

● Definition: Measures the average percentage error between observed and predicted
values.
● Characteristics:
○ Expresses errors as percentages, making it scale-independent.
○ May give misleading results if actual values are close to zero.

5) Symmetric Mean Absolute Percentage Error (SMAPE)

● Definition: A variation of MAPE that accounts for symmetry in percentage errors.

● Characteristics:
○ Addresses the issue of zero or near-zero actual values.
○ Useful for more balanced percentage error calculations.

6) Mean Forecast Error (MFE)

● Definition: Measures the average error between observed and predicted values (signed).

● Characteristics:
○ Indicates bias in forecasts (negative MFE shows overestimation, positive MFE
shows underestimation).

7) Tracking Signal
● Definition: Monitors the consistency of forecast errors over time.

● Use:
○ Helps detect bias or systematic error in forecasts.

Selecting the Right Measure

● MAE: For simple and intuitive error evaluation.


● MSE/RMSE: When penalizing larger errors is important.
● MAPE/SMAPE: When percentage-based accuracy is more meaningful.
● MFE: To detect directional bias in forecasts.

Practical Examples

Other Applications:

● Weather forecasting.
● Stock market predictions.
● Predictive maintenance in manufacturing.

Tools for Forecast Accuracy

● Excel: Built-in statistical functions for MAE, MSE, etc.


● Python: Libraries such as sklearn, statsmodels, and numpy for calculating
forecast accuracy.
● R: Functions like accuracy() in the forecast package.

STL (Seasonal-Trend decomposition using Loess)

STL decomposes a time series into three components:

Trend: The long-term movement in the data.


Seasonal: The repeating cycle or pattern (e.g., monthly or yearly).
Residual: The random noise or fluctuations that cannot be attributed to trend or seasonality.

Example: A retail company can use STL to decompose monthly sales data. This allows them to
separate seasonal effects (e.g., holiday sales boosts) from the underlying trend in sales growth.

STL Approach (Seasonal and Trend Decomposition Using Loess)

STL is a robust method used to decompose a time series into three components:

1. Seasonal Component: Represents the periodic patterns (e.g., weekly, monthly, or yearly
cycles).
2. Trend Component: Represents the long-term movement in the data (e.g., increasing
sales over years).
3. Residual (Remainder) Component: Represents the irregular or random variations in the
data.
Key Features of STL

● Uses Loess (Locally Estimated Scatterplot Smoothing) for flexible, non-linear


smoothing.
● Handles both additive and multiplicative time series models.
● Can accommodate seasonal changes over time (non-stationary seasonality).
● Allows customization of smoothing parameters for trend and seasonal components.

Steps in STL Decomposition

1. Input Data:
○ The time series data is provided as input.
○ Example: Monthly sales data over the past 3 years.
2. Seasonal Extraction:
○ The seasonal component is extracted using smoothing techniques.
○ This component captures repeating patterns (e.g., higher sales in December).
3. Trend Extraction:
○ The trend component is obtained by removing the seasonal component and
applying smoothing to capture the long-term movement.
4. Residual Calculation:
○ After removing the seasonal and trend components, the remainder (residuals) is
calculated, representing noise or unexplained variation.

Mathematical Representation

For an additive model:


For a multiplicative model:

Visualization of STL Decomposition

An STL decomposition yields the following outputs:

1. Original Time Series: The observed data.


2. Trend Component: Smoothed long-term trend.
3. Seasonal Component: Periodic pattern (e.g., monthly spikes).
4. Residual Component: Irregular noise or fluctuations.
Applications of STL

● Sales Forecasting: Identifying seasonal patterns in retail sales.


● Climate Analysis: Understanding temperature trends over time.
● Website Traffic: Analyzing weekly or monthly fluctuations.
● The difference between ARIMA and STL is given below:
Data Serialization

Serialization refers to saving time-ordered data in a format that can be easily transmitted or
stored for later analysis. Common serialization formats include JSON and CSV. Serialization
ensures data integrity and allows time series data to be analyzed across different systems and
applications.

Example: In financial applications, serialization is used to store historical stock prices in JSON
format for real-time analytics and forecasting.

Data Extraction and Analysis for Prediction

Data Extraction: Select key features or variables from the dataset. In time series analysis, this
might involve identifying important dates, events, or anomalies.

Data Analysis: Apply models like ARIMA or machine learning algorithms (e.g., RNNs or
LSTMs) to analyze time series data. These models can learn patterns over time, allowing for
more accurate predictions.

Example: A company might extract sales data during promotional events, analyze trends during
these periods, and predict future sales for upcoming promotions using ARIMA.

Extract Features from Generated Model as Height, Average Energy etc., and
Analyze for Prediction

Feature extraction is a crucial step in machine learning and data analysis, where meaningful
information is derived from raw data or model outputs to improve prediction accuracy. In the
context of analyzing a generated model, features like Height, Average Energy, and other
derived metrics can provide insightful information for predictive analytics.

Understanding Features

Feature extraction is a fundamental process in data analysis and machine learning. It involves
identifying and deriving significant attributes from raw data or a model's output that can be
utilized for prediction. Features such as Height, Average Energy, and other derived metrics
serve as inputs for predictive models, helping to uncover patterns and trends.

Importance of Feature Extraction


● Definition: Feature extraction is the process of reducing the dimensionality of data while
retaining the most critical information.
● Objective: Simplify data representation without losing meaningful insights, enabling
more efficient and accurate predictive models.
● Applications: Used in various domains such as time-series forecasting, signal processing,
energy analysis, and healthcare diagnostics.

Key Features for Extraction

1) Height

● Definition: The peak value or maximum value in the dataset or model output.
● Purpose: Height often signifies the intensity or magnitude of a phenomenon, such as the
highest sales in a month, peak temperature in a year, or the maximum value in a
waveform.
● Significance:
○ Highlights extreme conditions or events.
○ Useful in trend detection and anomaly identification.
● Examples:
○ Stock Market: Height can represent the highest stock price in a time frame.
○ Energy Usage: The highest energy consumption during a day.
○ Waveform Analysis: The peak amplitude in signal processing.
○ Weather Analysis: The highest temperature recorded in a season.

2) Average Energy

● Definition: The mean of the energy values across the data, representing the overall
intensity or activity over time.
● Purpose: Helps in understanding the typical level of activity or variation in the data.
● Significance:
○ Provides a general measure of the dataset's activity over time.
○ Useful for understanding trends and deviations.
● Calculation:

● Examples:
○ Audio Signal Processing: Average energy can reflect the loudness or intensity of
a sound.
○ IoT Sensors: Average power consumption of a device over a period.
○ Time Series Data: Average sales per week for retail forecasting.
○ Signal Processing: Average amplitude of a waveform.
○ IoT Applications: Average sensor readings over a day.

Other Features

● Variance: Measures the spread of the data, indicating its variability.


● Frequency Components: Extracted using Fourier Transform for time-series data.
● Slope: Rate of change in the data, useful for identifying growth or decline trends.
● Cycle Length: Identifies periodic patterns in time-series data.

Feature Extraction Process

Feature extraction involves the following steps:

1. Generate Model Outputs:


○ Use predictive or analytical models (e.g., ARIMA, neural networks) to generate
data outputs such as time series, waveforms, or categorical predictions.
2. Identify Key Features:
○ Height: Find the maximum value in the dataset or specific regions of interest.
○ Average Energy: Calculate the average intensity or variation across the dataset.
3. Preprocess Data:
○ Normalize the data to ensure consistency and avoid scaling issues.
○ Remove noise or outliers that may distort the feature values.
4. Transform Data:
○ Use mathematical or statistical transformations (e.g., FFT for frequency analysis)
to derive features from complex data.
5. Store Features:
○ Combine extracted features into a structured dataset for training predictive
models.

Feature Analysis for Prediction

After extracting features, the next step is to analyze them for their predictive capabilities.
a) Statistical Analysis

● Correlation: Measure how strongly a feature is associated with the target variable.
● Example: Correlate energy peaks (Height) with outdoor temperature.

b) Feature Selection

● Use algorithms like Recursive Feature Elimination (RFE) to select the most relevant
features.
● Example: Choose Height and Average Energy as key predictors for future energy
consumption.

c) Model Training

● Train machine learning models using extracted features.


● Models:
○ Regression: Predict continuous variables (e.g., sales, energy usage).
○ Classification: Categorize outcomes (e.g., high vs. low consumption).

d) Visualization for Insights

● Use graphs to identify relationships:


○ Scatter Plot: Height vs. prediction variable.
○ Line Graph: Average Energy trends over time.

e) Correlation Analysis

● Check the relationship between extracted features and the target variable.
● Example:
○ Height of sales peaks correlating with promotional campaigns.
○ Average energy consumption correlating with seasonal changes.

f) Feature Engineering

● Derive new features from the existing ones.


● Example:
○ Normalizing Height to calculate the percentage change over time.
○ Aggregating Average Energy for monthly trends.

g) Model Building

● Use features as inputs to machine learning models.


● Techniques:
○ Regression: Predicting a continuous variable (e.g., future energy consumption).
○ Classification: Categorizing patterns (e.g., low, medium, high energy usage).

h) Visualization

● Plot extracted features to understand trends and anomalies.


● Example:
○ Plot Height over time to observe recurring patterns.
○ Visualize Average Energy across different periods to detect changes.

Practical Example: Predicting Energy Usage

Scenario:

A smart grid collects data on daily energy usage. The goal is to predict future consumption
patterns.

Feature Extraction:

1. Height: Maximum energy usage in a day.


○ Insight: Identify peak usage during heat waves or holidays.
2. Average Energy: Mean daily consumption.
○ Insight: Provides baseline energy usage trends.

Visualization:

● Plot energy usage trends to highlight peaks (Height) and averages over time.

Predictive Model:

● Train a regression model with Height and Average Energy as features.


● Use the model to predict future high-usage periods and optimize resource allocation.

Advanced Techniques for Feature Analysis

a) Fourier Transform

● Extract frequency-domain features from time-series data.


● Example: Analyze periodic cycles in energy consumption.

b) Principal Component Analysis (PCA)

● Reduce the dimensionality of features while preserving variance.


● Example: Combine Height and Average Energy into a single principal component.

c) Machine Learning for Feature Selection

● Use decision trees, Random Forest, or LASSO regression to rank feature importance.

Applications of Feature-Based Predictions

a) Healthcare

● Height: Maximum heart rate during exercise.


● Average Energy: Mean activity level over a week.

b) Finance

● Height: Highest stock price in a month.


● Average Energy: Average daily transaction volume.

c) IoT and Smart Devices

● Height: Peak temperature detected by sensors.


● Average Energy: Mean power usage of appliances.

d) Energy Consumption Analysis

● Extract features like peak energy (Height) and average daily usage (Average Energy) to
predict future consumption and identify energy-saving opportunities.

e) Medical Diagnosis

● Use Height to identify peaks in heart rate or blood pressure signals.


● Calculate Average Energy in EEG or ECG signals to diagnose abnormalities.

f) Financial Forecasting

● Height can represent the highest stock price in a specific interval.


● Average Energy can be used to analyze the volatility of stock market trends.

g) Predictive Maintenance

● Height: Peak vibration in machinery may indicate mechanical issues.


● Average Energy: Increased average energy of a motor may signal inefficiency or wear.
Example: Practical Application

Dataset

● Consider a dataset of daily power consumption of a household for one year.

Step-by-Step Analysis

1. Feature Extraction:
○ Height: Identify the day with the highest power consumption (e.g., during
summer months with heavy air conditioning usage).
○ Average Energy: Calculate the daily average power consumption over the year.
2. Visualization:
○ Plot a time series of daily consumption, marking the highest points (Height).
○ Create a bar chart of average monthly energy usage.
3. Predictive Analysis:
○ Use features to predict periods of high energy usage.
○ Example Model: Linear regression to predict future daily consumption.
4. Insights:
○ Height might indicate days of peak activity (e.g., holidays or extreme weather).
○ Average Energy provides a baseline for typical consumption, helping identify
anomalies.

Diagram:
5.1 Standard Operating Procedures

SOPs are clear, step-by-step instructions that guide how to collect, clean, analyze, document,
and share data. They help ensure that everyone on the team works in a consistent, accurate,
and repeatable way.

 Collect – Get the data from reliable sources


 Clean – Fix errors, fill blanks, format consistently
 Explore – Use charts/stats to understand data
 Analyze – Apply the right tools/models
 Document – Write down steps and code clearly
 Share – Present insights with charts & summaries
 Store & Teach – Save work, share tips, help the team learn

Standard Operating Procedures (SOPs) for documentation and knowledge sharing ensure
consistency, reproducibility, and team collaboration.

✅ SOPs for Documentation in Data Analytics

1. Project Summary Template


o Include: problem statement, objectives, data sources, key stakeholders.

o Used as the first page of every analytics report.

2. Code Documentation

o Comment your code: explain complex logic, functions, and assumptions.

o Maintain a README file with:

 Project description

 Setup instructions

 Data schema

 How to run analysis or scripts

3. Version Control with Git

o Use Git for tracking code and notebook changes.

o Include meaningful commit messages (e.g., "added EDA for customer churn").

4. Jupyter Notebooks Best Practices

o Keep outputs clean

o Use headings and markdown cells to explain analysis steps

o Save final notebooks with outputs removed for clarity

5. Final Deliverables

o Store dashboards, presentations, and notebooks in a shared repository.

o Create a project handover document summarizing findings, limitations, and next


steps .

🔄 SOPs for Knowledge Sharing

1. Use of Collaboration Platforms

o Tools like Slack, Confluence, Notion, or Google Docs are recommended.

o Maintain a shared folder structure (e.g., /Projects/2025_Q1/CustomerAnalysis).

2. Meeting Routines
o Weekly stand-ups or data team syncs for updates and challenges.

o Monthly retrospectives to capture lessons learned.

3. Documentation of Learnings

o After each project, document:

 What went well

 What could be improved

 Tips for future similar projects

4. Internal Wiki

o Build a searchable internal knowledge base with:

 Data definitions

 Common queries

 How-to guides for tools like SQL, Power BI, or ChatGPT

5. Prompt Engineering Guides

o Document prompts that work well with ChatGPT, like:

“Explain this SQL query step-by-step.”


“Summarize trends from this dashboard.”

6. Cross-Training

o Encourage team members to present short sessions on tools or recent projects.

5.2 Purpose and Scope Document


clearly defining the purpose and scope of a project or task is crucial for success. The purpose
describes the overall goal and reason for the analysis, while the scope outlines the specific
boundaries, deliverables, and resources involved. This ensures alignment with stakeholders,
avoids scope creep, and leads to more efficient and focused data analysis efforts.

Defining the Purpose:

 Problem Statement:
Start with a clear understanding of the problem you are trying to solve with data analysis. What
are the business drivers, stakeholders, and the potential impact of the solution?

 Value Proposition:

Identify the specific benefits the data analysis will provide. How will it improve decision-making,
optimize processes, or generate new insights?

 SMART Goals:

Define Specific, Measurable, Achievable, Relevant, and Time-bound goals for the analysis.

Defining the Scope:

 Data Requirements:

Identify the specific data sources, types of data, and data quality standards needed for the
analysis. This helps determine the feasibility and complexity of the project.

 Deliverables:

Clearly outline the specific products or outputs of the analysis, such as reports, dashboards, or
models. This helps manage stakeholder expectations.

 Resources:

Determine the resources required for the project, including tools, software, and personnel.

 Boundaries:

Define the limits of the project, including what is included and excluded from the scope. This
helps prevent scope creep and ensures that the project stays focused.

 Timeline:

Establish a realistic timeline for the project, including milestones and deadlines.

Benefits of Clear Purpose and Scope:

 Alignment with Stakeholders:

Ensures that everyone involved understands the project's goals and objectives.

 Efficiency and Focus:

Helps teams concentrate on the most relevant tasks and avoid unnecessary work.

 Reduced Scope Creep:


Clearly defined boundaries prevent scope creep and ensure that the project stays on track.

 Improved Communication:

Facilitates clear and consistent communication among team members and stakeholders.

 Measurable Success:

Provides a framework for evaluating the project's success and impact.

Great topic! 🎯 In data analytics, defining a purpose and scope document is one of the first
and most important steps when starting a project. Here's a beginner-friendly explanation:

Purpose and Scope Document

✅ What It Is:

A purpose and scope document explains:

 Why the project is being done (the purpose)

 What will and won’t be included (the scope)

It keeps everyone on the same page — from data analysts to business stakeholders.

✍️ What to Include:

Section What it means Example

What problem are we solving? Why “To analyze customer churn to help
🎯 Purpose
does it matter? improve retention strategies.”

What’s included and excluded in “Include customers from Jan–Dec 2024;


🧭 Scope
this project? exclude B2B clients.”

📊 Data
Where the data is coming from “Customer database, CRM export”
Sources

👥 “Marketing team, customer support


Who’s involved or affected
Stakeholders manager”
Section What it means Example

“Initial findings by May 15, final report by


⏳ Timeline Key dates or phases
May 30”

🛠 Tools What tools/software will be used “Python, Power BI, SQL Server”

🚀 Why It’s Important:

 Prevents scope creep (project getting too big or off track)

 Saves time by setting clear expectations

 Helps align with business goals

 Makes it easier to measure success

5.3 Intellectual Property


Intellectual Property Rights (IPR) are legal rights granted to creators or owners over their
intellectual creations. Intellectual Property (IP) refers to original works of the human mind,
such as inventions, literary and artistic works, designs, symbols, names, and images used in
business. These creations are vulnerable to plagiarism or unauthorized use, and IPR safeguards
them by preventing unapproved reproduction, distribution, or display.

Meaning of Intellectual Property Rights (IPR)

Intellectual Property Rights (IPR) refer to the legal protections granted to individuals or
businesses over their intangible assets, preventing unauthorized use or exploitation. These
rights ensure that creators maintain control over their work, including:

1. The right to reproduce

2. The right to sell

3. The right to create derivative works

IPR provides a temporary monopoly over the use of the protected property, and violations can
lead to strict legal penalties.
🔐 Why Is IPR Important?

1. Boosts Business Growth – Protects unique ideas from competitors, helping especially
small businesses maintain market share and grow.

2. Supports Marketing – Builds brand identity and prevents copying, making it easier to
connect with customers.

3. Protects Innovation – Secures exclusive rights to original ideas, preventing misuse by


others.

4. Attracts Funding – IPR assets can be sold, licensed, or used as collateral to raise money.

5. Expands Global Reach – Enables businesses to enter new markets and form international
partnerships via protected brands or patents.

Types of Intellectual Property Rights (IPR)


1. Copyright
Protects original works like literature, music, films, and art from unauthorized use. It
arises automatically upon creation but registration strengthens enforcement rights.

2. Trademark
Identifies and distinguishes goods or services using names, symbols, or logos (e.g.,
Apple, Audi). Registration is not mandatory but necessary to claim exclusive ownership.

3. Geographical Indication (GI)


Indicates the origin of products tied to specific regions, like Darjeeling tea or Kashmiri
Pashmina. It reflects quality, reputation, or characteristics linked to that location.

4. Patent
Grants exclusive rights to inventors over their inventions (not discoveries), such as new
devices or processes. A patent prevents others from making, using, or selling the
invention without permission.

5. Design
Protects the aesthetic or visual aspects of products (e.g., car shapes, kitchen tools). It
ensures exclusive rights over commercial production and sale based on the protected
design.
6. Plant Variety Protection
Provides rights to breeders for developing new plant varieties. It ensures protection for
genetically developed or selectively bred plant species under laws like the Plant Variety
Protection Act.

7. Semiconductor Integrated Circuits Layout Design


Secures rights for original layouts of semiconductor chips used in electronics. This
prevents unauthorized copying or commercial exploitation of circuit designs.

5.4 Copyright
Copyright refers to the legal protection of original works created during the data analytics
process. While raw data itself is not copyrightable, many outputs and tools used or produced
during data analysis can be protected.

What Copyright Protects in Data Analytics:

1. Code and Scripts

o Custom Python, R, SQL, or other language scripts written for data cleaning,
analysis, or visualization are protected as literary works.

2. Data Visualizations

o Unique charts, dashboards, graphs, and infographics (especially when creatively


designed) are eligible for copyright.

3. Reports and Documentation

o Written analyses, interpretations, and presentations based on data analysis are


protected.

4. Software Tools

o Proprietary tools or platforms developed for analytics may be copyrightable,


depending on their originality.

What Copyright Does NOT Protect:

 Raw data or facts (e.g., temperatures, population numbers) — as facts cannot be


owned.
 Ideas, methods, or algorithms — unless separately protected by patents or trade
secrets.

Why Copyright Matters in Data Analytics:

 Prevents others from copying or redistributing your original code or visualizations


without permission.

 Helps companies protect their investment in custom tools or analytic products.

 Encourages innovation by ensuring creators benefit from their work.

Best Practices:

 Always use data and code with proper licenses (e.g., open-source tools under MIT/GPL).

 Attribute sources when using third-party data or visualizations.

 Document ownership in collaborative projects.

You might also like