0% found this document useful (0 votes)
35 views

Data Analytics For IoT Solutions (Module VI)

Uploaded by

Lalithkumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Data Analytics For IoT Solutions (Module VI)

Uploaded by

Lalithkumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Data Analytics for IoT Solutions

steps to analyze the IoT solution

Module VI
Data generation, Data gathering, Data Pre-processing, data
analyzation, application of analytics, Exploratory Data
Analysis
By
Dr Shola Usharani
Solve the Scenario
• Hunger reduction through smart farming, IoT analytics could be the
monitoring and optimization of crop irrigation. The IoT sensors can be
installed in the soil to measure moisture levels and transmit the data to a
central hub. The hub uses analytics to compute the crop's water
requirements based on crop type, soil type, and weather conditions. The
system can then trigger smart irrigation systems to apply the right amount
of water at the right time, reducing water waste and increasing crop yields.
The data collected and analyzed can also help farmers to identify trends and
improve crop management decisions.
• Discuss about data generation, Data gathering for the given example?
• Identify the data preprocessing techniques required for it, How IoT analytics
can be defined for the above scenario.
Categories of IoT data
• Status data
– whether an appliance is off or on, whether there are available spots
at a property, etc
• Location data
– location data for fleet management, asset tracking, employee
monitoring
• Automation data
– control devices inside a house, vehicles on the road, and other moving
parts of any system
• Actionable data
– an extension of status data
– the system processes it and transforms into easy-to-carry-out
instructions. Actionable data is often used in forecasting and
prediction, energy consumption and workplace efficiency
optimization, as well as during long-term decision-making
Benefits of IoT Data Collection
• connected devices and the Internet of Things making big data collection
• Healthcare :
– healthcare professionals improve the precision of diagnosis, as well as the speed of post-op recovery.
– ensure safety inside the facility, tracking both patients and staff.
– improve medication adherence, monitor the effects of treatment, and prevent theft during shipping, as well
as at the warehouse.
• Manufacturing:
– create a favorable environment for the peak performance of factory equipment, monitor worker
productivity and integrate big data performance monitoring solutions for predictive analytics and
maintenance
• Agriculture:
– monitor farming sites forecast the likelihood of natural disasters and their impact on crops sensor data
helps design efficient plant treatment, monitor water usage, and reduce the amount of workforce needed
to manage the site
• Energy
– By tracking the amount of electricity spent by the property, facility managers become more aware of
potential ways to reduce energy consumption
• Smart homes :
– Security systems, smart plugs, and other appliances all use IoT for data collection to ensure energy
efficiency, as well as in-house safety
• Transportation :
– Traffic congestion managers, virtual parking assistants, fleet management tools, and fuel consumption
monitoring devices are all sensor data collection examples
Data Analytics steps for IoT
solution
Data Creation or generation
• IoT data is the information collected by connected
devices — sensors, wearables, and others.
• The IoT data is nothing but data produced by those
things, the new services you can enable via those
connected things, and the business insights that the
data can reveal.
• the world of IoT, the creation of massive amounts of
data from sensors is common.
– Modern jet engines are fitted with thousands of sensors
that generate a whopping 10GB of data per second
• Need
– demand fast data analytics with minimal delay.
Data collection for Painting a house
Operations on Data Set
• Creating a table
• Inserting values into the table
• Modify the values : drop, change groupby
– Using relational databases.
Data Set
• Structured data by RDBMS
• Matrix form given by
– Oracle DB, PhP, CSV,Excel
– Column: domain of interest (characteristics, features
– Row : experience of individual experience
• Gains from each individual data entry by person or resource.
• What are you going do with data set
– Data preprocessing.
Sample Dataset
Data pre-processing
• Problems with data
– dealing with noisy, inaccurate, uncertain and real-time data.
– IoT data may contain missing and incomplete values that lead to poor
data quality.
– a lot of ambiguous information, and these data are large in size
– constrained nature of IoT sensors and intermittent loss of connectivity,
the massive scale of IoT data contains more irregularities and
uncertainties,
• Preprocessing is ensuring completeness in IoT data that leads the
data quality.
• It removes the irregularities and uncertainties in the data.
• before mining or modeling the data, it must be passed through
improvement techniques known as data preprocessing.
Different data pre-processing
techniques
Data cleaning
• the process of eliminating the erroneous and
missing part in the data.
• The process of handling these noisy and missing
values can be achieved by various ways
Data Integration
• It combines data from different source and
giving users an integrated view of this data.
Mainly, Data integration is done through two
main approaches
Data reduction
• used to obtain a data set, which are very small in size
but yield, the same analytical results. Data reduction
approaches utilized to diminish the unnecessary data
as well as improve analytical process
Data Transformation
• the process of converts’ data from one format
to another format. Data transformation
includes various functions to achieve the
perfect format
Data Modelling
• Dataset→preprocessing→ML
modeling(categorization of data)→Analysis of
data(understanding)→testing→deployment.
Data analyzation
• Analyzing the mass amount of data in the most
efficient manner possible is the umbrella of data
analytics.
• IoT data analytics refers to the analysis of every
fragment of data generated from IoT devices at right
time in order to extract intelligent insights
– Not all data is the same; it can be categorized and
analyzed in different ways.
• Depending on how data is categorized, various data
analytics tools and processing methods can be applied.
– Two important categorizations from an IoT perspective are
whether the data is structured or unstructured and
whether it is in motion or at rest.
Different categories of IoT data
Structured data Unstructured data
• data follows a model or schema that defines how • Unstructured data lacks a logical schema for
the data is represented or organized, meaning it fits understanding and decoding the data
well with a traditional relational database through traditional programming means.
management system (RDBMS).
• structured data in a simple tabular form—for • Examples of this data type include text,
example, a spreadsheet where data occupies a speech, images, and video.
specific cell and can be explicitly defined and • any data that does not fit neatly into a
referenced predefined data model.
• can be found in most computing systems and
includes everything from banking transaction and • data analytics methods that can be applied
invoices to computer log files and router to unstructured data, such as cognitive
configurations computing and machine learning
• Structured data is easily • With machine learning applications, such as
formatted,stored,queried,and processed natural language processing (NLP), you can
• IoT sensor data often uses structured values, such decode speech.
as temperature, pressure, humidity, and so on,
which are all sent in a known format. • With image/facial recognition applications,
• wide array of data analytics tools are readily
you can extract critical information from still
available for processing this type of data scripts to images and video
commercial software like Microsoft Excel and
Tableau
Different categories of IoT data
Different categories of IoT data
Data in motion Data at rest
• Data in IoT networks is in transit • Data in IoT networks is being held or
– Data in motion is often streamed stored.
continuously from IoT devices to data • Data saved to a hard drive, storage
processing and analytics platforms. This
streaming data can include sensor array, or USB drive
readings, telemetry data, event logs, or • Typically found in IoT brokers or in
any other information generated by IoT some sort of storage array at the data
devices
center
• the data from smart objects is
considered data in motion as it passes
through the network and route to its
final destination : edge or fog
computing
• traditional client/server exchanges,
such as web browsing and file transfers,
and email.
• Tools supported: Spark, Storm, and
Flink
What kind of data it is?
• in a smart city application, data include real-
time traffic flow data from vehicle sensors or
environmental sensor readings from air
quality monitoring stations.
• Is it data in rest or motion?
Application of analytics
• Descriptive analysis
• Diagnostic analysis
• Predictive analysis
• Prescriptive analysis
Descriptive analysis

• delineates what has occurred and what is going on.


• tells what is happening, either now or in the past.
– For example, a thermometer in a truck engine reports
temperature values every second.
• It is used to pull the data at any moment to gain insight
into the current operating condition of the truck
engine.
– If the temperature value is too high, then there may be a
cooling problem or the engine may be experiencing too
much load.
• utilizing data aggregation and data mining techniques
Diagnostic Analysis
• Interested the problems for the questions of : “Why”.
• the example of the temperature sensor in the truck
engine, you might wonder why the truck engine failed.
– Diagnostic analysis might show that the temperature of
“the engine was too high, and the engine overheated”.
• Applying diagnostic analysis across the data generated
by a wide range of smart objects can provide a clear
picture of why a problem or an event occurred
Predictive Analysis
• It aims to foretell problems or issues before they occur.
– For example, with historical values of temperatures for the
truck engine, predictive analysis could provide an estimate
on the remaining life of certain components in the
engine.
• These components could then be proactively replaced
before failure occurs.
• Or perhaps if temperature values of the truck engine
start to rise slowly over time, this could indicate the
need for an oil change or some other sort of engine
cooling maintenance
Prescriptive Analysis
• predictive and recommends solutions for upcoming
problems.
• A prescriptive analysis of the temperature data from a
truck engine might calculate various alternatives to
cost-effectively maintain our truck
• These calculations could range from the cost necessary
for more frequent oil changes and cooling maintenance
to installing new cooling equipment on the engine or
upgrading to a lease on a model with a more powerful
engine.
• Prescriptive analysis looks at a variety of factors and
makes the appropriate recommendation
Complexity vs Value of Analysis
Descriptive analytics on IoT data
(what)
• Focuses on what’s happening, by monitoring the
status of IoT devices, machines, products and assets.
• Determines if things are going as planned, and notifies
if anomalies occur.
• Descriptive analytics is generally implemented as
dashboards that show current and historical
sensor data, key performance indicators (KPIs),
statistics and alerts.
• Addresses questions such as:
– Are there any anomalies that demand attention?
– What’s the utilization and throughput of this machine?
– How are consumers using our products?
– Where do my assets reside?
– How many components are we creating with this tool?
– How much energy is this machine using?
Descriptive Analysis
• Descriptive methodologies focus on analyzing
historic data for the purpose of identifying
patterns or trends.
• Analytic techniques that fall into this category are
most often associated with exploratory data
analysis which identifies central tendencies,
variations, and distributional shapes.
• Descriptive methodologies can also search for
underlying structures within data when there
is no priori knowledge about patterns and
relationships are assumed.
• This can include correlation analysis, exploratory
factor analysis, principal component analysis,
trend analyses, and cluster analysis.
Descriptive Analysis in R
Programming
• In Descriptive analysis, we describing our
data with the help of various representative
methods like using charts, graphs, tables,
excel files, etc.
• Present it in a meaningful way so that it can
be easily understood.
• Some measures that are used to describe a
data set are
– Measures of central tendency and
– Measures of variability or dispersion.
Measure of central tendency
• It represents the whole set of data by a
single value.
• It gives us the location of central points.
• There are three main measures of central
tendency:
– Mean
– Mode
– Median
Measure of variability
• Measure of variability is known as the
spread of data or how well data is
distributed.
• The most common variability measures are:
– Range
– Variance
– Standard deviation
Mean It is the sum of observations divided by the total number of
observations. (average which is the sum divided by count)
Median It is the middle value of the data set. It splits the data into two
halves. If the number of elements in the data set is odd then the
center element is median and if it is even then the median would
be the average of two central elements.

Mode It is the value that has the highest frequency in the given data set.
The data set may have no mode if the frequency of all data points
is the same. Also, can have more than one mode if we encounter
two or more data points having the same frequency.

Range The range describes the difference between the largest and
smallest data point in our data set. The bigger the range, the
more is the spread of data and vice versa.
Range = Largest data value – smallest data value
Variance It is defined as an average squared deviation from the mean. It is
being calculated by finding the difference between every data
point and the average which is also known as the mean, squaring
them, adding all of them, and then dividing by the number of data
points present in data set.
Standard It is defined as the square root of the variance. It is being
Deviation calculated by finding the Mean, then subtract each number from
the Mean which is also known as average and square the result.
Adding all the values and then divide by the no of terms followed
the square root.
Quartiles A quartile is a type of quantile. The first quartile (Q1), is defined
as the middle number between the smallest number and the
median of the data set, the second quartile (Q2) – the median of
the given data set while the third quartile (Q3), is the middle
number between the median and the largest value of the data set.
Interquartile The interquartile range (IQR), also called as midspread or
Range middle 50%, or technically H-spread is the difference between
the third quartile (Q3) and the first quartile (Q1). It covers the
center of the distribution and contains 50% of the
observations.
IQR = Q3 – Q1
summary() The function summary() can be used to display several statistic
summaries of either one variable or an entire data frame.
Different forms of Descriptive
Analytics
Numerical data descriptive Categorical data
Common forms of statistics descriptive
statistics
Descriptive
analytics
Classical Analyses Text Mining
– Numerical data descriptive statistics • Tidying Text & Word Frequency
– Categorical data descriptive • Sentiment Analysis
statistics • Term vs. Document Frequency
• Word Relationships
– Assumption of normality • Converting Between Tidy and Non-tidy
– Assumption of homogeneity Formats
– Assessing correlations Unsupervised Learning
– Univariate statistical inference • Principal Component Analysis
– Multivariate statistical inference • K-means Cluster Analysis
• Hierarchical Cluster Analysis
– Bootstrapping for parameter
estimates
Univariate Statistical Inference-
classical Analysis
• univariate statistical inference, specifically in the context of classical analysis, refers to the
process of making inferences or drawing conclusions about a single variable or population
parameter based on sample data. This approach is fundamental in statistical analysis and
hypothesis testing.
1. Population and Sample: In classical analysis, there's a clear distinction between the population
and the sample. The population refers to the entire group or set of individuals, items, or
observations of interest. The sample is a subset of the population selected for analysis.
Univariate statistical inference typically involves drawing conclusions about population
parameters based on sample statistics.
2. Parameter Estimation: One of the primary objectives in univariate statistical inference is to
estimate population parameters based on sample data. For example, if we're interested in
estimating the population mean (μ) or variance (2σ2) of a variable, we use sample statistics
such as the sample mean (ˉxˉ) or sample variance (2s2) as estimators.
Bootstrapping for Parameter
Estimates-classical Analysis
• Resampling methods are an
indispensable tool in modern
statistics.
• They involve repeatedly drawing
samples from a training set and re-
computing an item of interest on
each sample.
• Bootstrapping is one such
resampling method that repeatedly
draws independent samples from
our data set and provides a direct
computational way of assessing
uncertainty.
Text Mining
• Creating Tidy Text
– A fundamental requirement to perform text
mining is to get text in a tidy format and
perform word frequency analysis.
– Text is often in an unstructured format so
performing even the most basic analysis
requires some re-structuring.
• Sentiment Analysis
– To understand the opinion or emotion in the
text from tidy text.
• Term vs. Document Frequency
– Focused on identifying the frequency of individual terms within
a document along with the sentiments that these words provide.
– It is also important to understand the importance that words
provide within and across documents.
• Term frequency (tf) identifies how frequently a word occurs in a
document.
• That many common words such as “the”, “is”, “for”, etc. typically top the
term frequency lists.
• One approach to correct for these common, yet low context
words, is to remove these words by using a list of stop words.
• Another approach is to use what is called a term’s inverse
document frequency (idf),
– which decreases the weight for commonly used words and
increases the weight for words that are not used very much in a
collection of documents.
Principal Components Analysis
• Principal Component Analysis (PCA) involves the process by
which principal components are computed, and their role in
understanding the data.
• PCA is an unsupervised approach, which means that it is
performed on a set of variables X1, X2, …, Xp with no
associated response Y.
• PCA reduces the dimensionality of the data set, allowing most
of the variability to be explained using fewer variables.
• PCA is commonly used as one step in a series of analyses.
• The goal of PCA is to explain most of the variability in the
data with a smaller number of variables than the original data
set.
• For a large data set with p variables, we could examine
pairwise plots of each variable against every other variable,
but even for moderate p, the number of these plots becomes
excessive and not useful.
K-means Cluster Analysis
• Clustering is a broad set of techniques for finding subgroups of
observations within a data set.
• When we cluster observations, we want observations in the
same group to be similar and observations in different groups to
be dissimilar.
• Because there isn’t a response variable, this is an unsupervised
method, which implies that it seeks to find relationships
between the n observations without being trained by a response
variable.
• Clustering allows us to identify which observations are alike,
and potentially categorize them therein.
• K-means clustering is the simplest and the most commonly used
clustering method for splitting a dataset into a set of k groups.
Hierarchical Cluster Analysis

• Hierarchical clustering is an alternative


approach to k-means clustering for identifying
groups in the dataset.
• It does not require us to pre-specify the
number of clusters to be generated as is
required by the k-means approach.
• Furthermore, hierarchical clustering has an
added advantage over K-means clustering in
that it results in an attractive tree-based
representation of the observations.
Scenario
• An XYZ mobile company who have their
mobile v1 in the current market. Due to their
heavy sales for the product they want to add
some features and wants to release mobile v2.
so the company wants to analyse willingness
of their customer feedback. What kind of
analytics is suitable here and justify it.
References
• https://www.geeksforgeeks.org/descriptive-
analysis-in-r-programming/
• https://uc-r.github.io/descriptive
Diagnostic analytics on IoT data
• Answers the question: why is something happening?
• Analyzes IoT data to identify core problems and to fix or
improve a service, product or process.
• Diagnostic capabilities are typically extensions to dashboards that
permit users to drill into data, compare it, and visualize
correlations and trends in an ad-hoc manner.
• Many organizations employ domain experts knowledgeable about a
specific process, machine, device or product, rather than data
scientists, to perform diagnostics on data.
• Addresses questions such as:
– Why is this machine producing more defective parts than other
machines?
– Why is this machine consuming excessive energy?
– Why aren’t we producing enough parts with this tool?
– Why are we getting a lot of product returns from American
customers?
How Does Diagnostic Analytics
Work?
• Diagnostic analytics uses a variety of
techniques to provide insights into the
causes of trends. These include:
• Data drilling:
– Drilling down into a dataset can reveal more detailed information
about which aspects of the data are driving the observed trends.
– For example, analysts may drill down into national sales data to
determine whether specific regions, customers or retail
channels are responsible for increased sales growth.
• Data mining
– hunts through large volumes of data to find patterns and
associations within the data.
• For example, data mining might reveal the most common factors associated
with a rise in insurance claims.
– Data mining can be conducted manually or automatically with
machine learning technology.

• Correlation analysis
– examines how strongly different variables are linked to each other.
– For example, sales of ice cream and refrigerated soda may soar on
hot days.
Three Diagnostic Analytics
Categories
• The diagnostic analytics process of determining
the root cause of a problem or trend typically
comprises three primary stages.
• Identify anomalies:
– Trends or anomalies highlighted by descriptive analysis may require
diagnostic analytics if the cause isn’t immediately obvious.
– In addition, it can sometimes be difficult to determine whether the
results of descriptive analysis really show a new trend, especially if
there’s a lot of natural variability in the data.
– In those cases, statistical analysis can help to determine whether the
results actually represent a departure from the norm.
• Discovery:
– The next step is to look for data that explains the
anomalies: data discovery.
– That may involve gathering external data as well as drilling into
internal data.
• For example, searching external data might reveal changes in supply chains,
new regulatory requirements, a shifting competitive landscape or weather
patterns that are associated with the anomalous data.

• Causal relationships:
– Further investigation can provide insights into whether the
associations in the data point to the true cause of the
anomaly.
– The fact that two events correlate doesn’t necessarily mean one
causes the other.
– Deeper examination of the data associated with the sales increase
can indicate which factor or factors were the most likely cause.
Benefits of Diagnostic Analytics
• Understanding the causes of business outcomes is
critical to a company’s ability to grow and learn from
mistakes.
• Diagnostic analytics lets companies zero in on the factors
that drive success or cause failure, including contributing
factors that may not be obvious at first glance.
• Diagnostic analytics can help to instill a data-driven
analytical culture throughout the business.
• When business leaders understand that the company has
the tools to investigate the cause of problems, they’re
more likely to use diagnostic analytics in their decision-
making.
• .
Drawbacks of Diagnostic
Analytics
• A drawback of diagnostic analytics is that it focuses on
historical data;
• it can only help businesses understand why events
happened in the past.
• Further investigation may be needed to determine whether
the correlations revealed by diagnostic analytics really show
cause and effect.
• To look into the future, businesses need to use other analytic
techniques, such as predictive analytics, which examines the
potential future impact of trends and events, and
• Prescriptive analytics, which suggests actions businesses can
take to influence the outcome of those future trends.
Example Scenario II
• Recently I installed a new software XYZ in my
Laptop. After continuous notifications about
LAPTOP restart, I restarted my LAPTOP. From
next immediate restart I found some of file
formats are not recognized and my processor
become very slow. How will you analyse this
problem using Descriptive and Diagnostic
analytics. What kind of questions you will try
to solve based on those.
References
• https://www.netsuite.com/portal/resource/ar
ticles/data-warehouse/diagnostic-
analytics.shtml
Tools & technology supported
• Data Mining and Machine learning
• Big IoT data analytics,
– creating smarter IoT by extracting unseen
patterns,
– hidden correlations, trends, inferences, and
actionable insights, facilitating greater
intelligence,
– smarter decision making, enhancing performance,
automation, productivity, and accuracy.
What is IoT Search Engines (IOTSE)
• It Is a search tool to search the list of IoT devices
connected to the internet or search and identity
recognition of IOT devices.
• This tool provides a set of information related to IOT
devices like type, ports, operating system and location.
• This tool provides a huge amount of data.
• The data provided must be analyzed and interpreted in
the aim to give a better understanding, quickly
description and help to make decisions.
• It will solve the search issue of the internet of things
IoT Search Engines: importance of
Exploratory Data Analysis
• Exploratory data analysis(EDA) on IOT search
engine data in order to give an analysis of
IOTSE data with visual method.
• Searching and identity visualization represent
the essential factors which led to the
development of IOTSE
IoTSE steps & tools
• data collection indexing and display of results
• most popular of this type are Shodan, Censys
and Thingful
Issues of data interpretation of IOT
search engine
• several challenges
– interpretation of results, irrelevant results and
analysis of queries and results
• Solution : exploratory data analysis can be
used to analyze and make sense of data from
the IOT search engine
Exploratory Data
Analysis
Importance of exploratory data
analysis?
• Exploratory data analysis (EDA) is used by data scientists to
analyze and investigate data sets and summarize their
main characteristics, employing data visualization
methods.
• It helps determine how best to manipulate data sources to
get the answers need, making it easier for data scientists to
discover patterns, spot anomalies, test a hypothesis, or check
assumptions.
• EDA is primarily used to see what data can reveal beyond the
formal modeling or hypothesis testing task and provides a
better understanding of data set variables and the
relationships between them.
• It can also help determine if the statistical techniques you are
considering for data analysis are appropriate.
Contd.,
• Data Scientists widely use EDA to understand
datasets for decision-making and data cleaning
processes.
• EDA reveals crucial information about the data,
such as hidden patterns, outliers, variance,
covariance, correlations between features.
• The information is essential for the hypothesis’s
design and creating better-performing models.
Definitions: Exploratory data analysis
(EDA)
• It is an approach to analyzing data which serve to give
a quick and understandable description of dataset.
• EDA is one of the step of data analysis process which
consist of visualize data, taking the main characteristics
and giving better understand by using visual methods.
It’s an important approach to analyze expansive
number of data like data generated in real time by IOT
devices.
• It is an approach to analyzing datasets to summarize
their main characteristics, often with visual methods
Why is exploratory data analysis
important in data science?
• The main purpose of EDA is to help look at data before making any
assumptions.
• It can help identify obvious errors, as well as better understand patterns
within the data, detect outliers or anomalous events, find interesting
relations among the variables.
• Data scientists can use exploratory analysis to ensure the results they
produce are valid and applicable to any desired business outcomes and
goals.
• EDA also helps stakeholders by confirming they are asking the right
questions.
• EDA can help answer questions about standard deviations, categorical
variables, and confidence intervals.
• Once EDA is complete and insights are drawn, its features can then be used
for more sophisticated data analysis or modeling, including machine
learning.
Process of EDA
The position of EDA in ML modelling
Analysis by EDA
• Data exploration using numerical analysis
• Data exploration using visual analysis
– Preview data
– Check total number of entries and column types
– Check any null values
– Check duplicate entries
– Plot distribution of numeric data (univariate and
pairwise joint distribution)
– Plot count distribution of categorical data
– Analyse time series of numeric data by daily, monthly
and yearly frequencies
Exploratory data analysis tools
• Specific statistical functions and
techniques to perform with EDA tools
include:
– Clustering and dimension reduction techniques, which
help create graphical displays of high-dimensional data
containing many variables.
– Univariate visualization of each field in the raw dataset,
with summary statistics.
– Bivariate visualizations and summary statistics that
allow you to assess the relationship between each variable
in the dataset and the target variable you’re looking at.
– Multivariate visualizations, for mapping and
understanding interactions between different fields in the
data.
– K-means Clustering is a clustering method in unsupervised
learning where data points are assigned into K groups, i.e.
the number of clusters, based on the distance from each
group’s centroid.
– The data points closest to a particular centroid will be
clustered under the same category.
– K-means Clustering is commonly used in market
segmentation, pattern recognition, and image compression.
– Predictive models, such as linear regression, use statistics
and data to predict outcomes.
EDA categories
• It falls into two categories
– The univariate analysis involves analyzing one
feature, such as summarizing and finding the
feature patterns.
– The multivariate analysis technique shows the
relationship between two or more features using
cross-tabulation or statistics.
Types of exploratory data
analysis
There are four primary types of EDA:
• Univariate non-graphical.
• This is simplest form of data analysis, where the data being
analyzed consists of just one variable.
• Since it’s a single variable, it doesn’t deal with causes or
relationships.
• The main purpose of univariate analysis is to describe the
data and find patterns that exist within it.
• Univariate graphical. Non-graphical methods don’t provide
a full picture of the data. Graphical methods are therefore
required. Common types of univariate graphics include:
– Stem-and-leaf plots, which show all data values and the shape of the distribution.
– Histograms, a bar plot in which each bar represents the frequency (count) or proportion
(count/total count) of cases for a range of values.
– Box plots, which graphically depict the five-number summary of minimum, first quartile,
median, third quartile, and maximum.
A stem-and-leaf plot of the values 20, 30, 32, 35, 41, 41,
43, 47, 48, 51, 53, 53, 54, 56, 57, 58, 58, 59, 60, 62, 64,
65, 65, 69, 71, 74, 77, 88 and 102
• Multivariate nongraphical: Multivariate data arises
from more than one variable.
• Multivariate non-graphical EDA techniques generally
show the relationship between two or more variables
of the data through cross-tabulation or statistics.
• Multivariate graphical: Multivariate data uses
graphics to display relationships between two or
more sets of data.
• The most used graphic is a grouped bar plot or bar
chart with each group representing one level of one of
the variables and each bar within a group
representing the levels of the other variable.
Other common types of multivariate graphics include:
• Scatter plot, which is used to plot data points on a
horizontal and a vertical axis to show how much one
variable is affected by another.
• Multivariate chart, which is a graphical
representation of the relationships between factors
and a response.
• Run chart, which is a line graph of data plotted over
time.
• Bubble chart, which is a data visualization that
displays multiple circles (bubbles) in a two-
dimensional plot.
• Heat map, which is a graphical representation of data
where values are depicted by color.
Reference
• https://www.ibm.com/cloud/learn/explorator
y-data-analysis
References
• https://mite.ac.in/wp-
content/uploads/2021/04/iot_module4.pdf.
• Tausifa Jan Saleem, Mohammad Ahsan Chishti, “Data
Analytics in the Internet of Things: A Survey”, Article in
Scalable Computing · December 2019.
• Erwin Adi, Adnan Anwar, Zubair Baig, “Sherali
Zeadally3Machine learning and data analytics for the
IoT”, 11
Mayhttps://towardsdatascience.com/exploratory-data-
analysis-eda-a-practical-guide-and-template-for-
structured-data-abfbf3ee3bd9 2020.
• https://www.digiteum.com/iot-data-collection/

You might also like