Iot Unit 3 4 5 Sem
Iot Unit 3 4 5 Sem
1. Data Collection
o Gather data from sensors via protocols like MQTT, HTTP, or CoAP.
o Store data in databases (e.g., InfluxDB, MongoDB, Azure IoT Hub).
2. Data Cleaning and Preprocessing
o Remove missing or erroneous values.
o Normalize or scale data.
o Convert timestamps and unify formats.
3. Descriptive Analysis
o Calculate statistical metrics: mean, median, standard deviation.
o Identify patterns, trends, or outliers in the dataset.
4. Segmentation
o Divide data by device, location, or time window (e.g., hourly, daily).
Python Libraries:
o Matplotlib, Seaborn – for basic and static plots
o Plotly, Bokeh – for interactive and web-based plots
Jupyter Notebook – useful for analysis and experimentation
Dashboard Tools:
Specialized Platforms:
Summary:
Step Description
Data Collection From sensors/devices
Exploration Clean, filter, analyze
Visualization Charts, graphs, dashboards
Tools Python, Grafana, Power BI, etc.
C. Probability Distributions
Time series analysis is the process of analyzing data that is collected over time, usually at
regular intervals (e.g., seconds, minutes, days, months). It’s used to understand patterns,
trends, and forecasts in data that changes with time — such as stock prices, weather, energy
usage, or IoT sensor readings.
Example:
Temperature recorded every hour over 24 hours.
Let’s discuss the time series’ data types and their influence. While discussing TS
data types, there are two major types – stationary and non-stationary.
Stationary: A dataset should follow the below thumb rules without having Trend,
The mean value of them should be completely constant in the data during
the analysis.
different periods.
With the help of “Time Series,” we can prepare numerous time-based analyses
and results.
in it.
outcome.
5.
📈 Common Techniques
1. Moving Average (MA)
Combines AR, Moving Average, and differencing to handle trends and noise.
4. Exponential Smoothing
5. Seasonal Decomposition
Splits series into trend, seasonality, and residuals using techniques like:
o Additive model: when seasonality is constant over time
o Multiplicative model: when seasonality changes proportionally
# Plot
plt.plot(df['Date'], df['Temp'])
plt.title('Temperature Over Time')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.grid(True)
plt.show()
Similar to other models, the missing values are not supported by TSA
Accurate
Reliable
Complete
Consistent
Timely
Relevant
It ensures that organizations can use their data confidently for decision-making,
operations, and analytics.
Dimension Description
Completeness Whether all necessary data is present.
Timeliness How current or outdated the data is.
Validity Conformance to format/rules (e.g., date format).
Integrity Trustworthiness and correctness of data.
Uniqueness No duplicated entries.
Consistency Uniformity across formats, sources, or time.
❗ Why is Data Quality Important?
Rise of Big Data & IoT → more reliance on data.
Essential for:
o Business Intelligence (BI)
o Machine Learning (ML)
o Operational Decision-Making
Bad Data = Bad Outcomes: In sectors like healthcare or finance, poor data quality
can lead to moral, legal, or financial consequences.
Enables:
o Accurate KPI tracking
o Workflow optimization
o Competitive edge over rivals
Trait Meaning
Accuracy No errors or false values.
Completeness All required data is available.
Relevance Only useful data for the task at hand.
Consolidation No duplicates; one version of the truth.
Consistency Standardized and conflict-free data.
Here are the key techniques to understand and manage Data Quality in any system
(including IoT, analytics, business databases, etc.):
What it is: Analyzing datasets to understand their structure, patterns, and anomalies.
Use for:
o Discovering missing values, unusual patterns
o Frequency and distribution checks
Tools: Pandas (Python), Talend, Informatica
2. Data Validation
What it is: Finding and measuring gaps or null values in the data.
Approaches:
o NULL checks
o Time gap analysis (for time-series)
o Visual null maps (e.g., heatmaps)
5. Duplicate Detection
7. Schema Validation
8. Cross-Source Comparison
What it does: Checks consistency of values across multiple data sources.
Example: Comparing sales totals from CRM vs. billing database.
Benefit: Detect integration or sync issues.
Why it's used: Metadata (e.g., timestamp, device ID, source) helps verify the origin,
meaning, and accuracy of the data.
Example: Sensor data from an uncalibrated device may seem valid but is not reliable.
Unit -4
Data Science for IoT (Internet of Things) Analytics involves using data science methods and
tools to collect, process, analyze, and extract valuable insights from the vast amounts of data
generated by IoT devices. This intersection helps improve decision-making, automation, and
predictive capabilities in IoT applications.
2. Data Storage
Cloud Platforms: AWS IoT, Azure IoT Hub, Google Cloud IoT.
Databases: NoSQL (MongoDB, Cassandra), Time-series databases (InfluxDB),
Relational (MySQL, PostgreSQL).
3. Data Preprocessing
5. Data Visualization
Edge Analytics: Analysis done near the data source (low latency)
Cloud Analytics: Scalable, powerful processing (more latency)
Languages: Python, R
Libraries: NumPy, Pandas, Scikit-learn, TensorFlow, PyTorch, XGBoost
Visualization: Matplotlib, Seaborn, Plotly
IoT Platforms
Edge Analytics
📌 Definition:
"Machine learning is the science of getting computers to learn and act without being
explicitly programmed." — Arthur Samuel
📈 Real-World Applications
Healthcare: Predict disease risks
Finance: Fraud detection, credit scoring
Marketing: Recommendation engines
IoT: Smart home automation, predictive maintenance
Agriculture: Crop prediction, pest detection
Feature engineering is a critical step in preparing IoT data for machine learning or deep
learning models. Due to the unique characteristics of IoT data (high volume, time-series,
multi-source, real-time), it requires specialized techniques.
🔂 3. Walk-Forward Validation
Use GroupKFold (e.g., per sensor/device) to avoid data leakage between correlated
devices.
📊 Metrics to Use
The Bias-Variance Tradeoff can be defined as the balancing act that is required
between models that are too simple and those that are too complex.
While bias refers to the error rate of the data trained, the difference in error rate
between the data trained and tested is called variance.
The underfitting models have high bias and low variance whereas the overfitting
models possess high variance and low bias.
A proper balancing act on bias and variance is very important for good
generalisation, and also avoiding overfitting or underfitting.
Bias in AI Algorithms
Bias in Machine Learning: This is the error that results from simplification when
trying to model a highly complex problem using an oversimplified model. The model
with high bias makes strong assumptions, and often results in underfitting.
o Example: A linear model to predict a non-linear relationship would likely have
a high bias and will make systematic errors in prediction.
Variance in Machine Learning: In simple terms, variance refers to how sensitive a
model is to small changes in the data. If a model has a high variety then it is too
complex and fits too closely to the training data.
o Example: A deep learning model, if too deep, will memorise the noise in the
training data, leading to high variance and poor generalisation on new data.
High Bias Example: A shallow decision tree with only a few splits has a high bias
toward underfitting as it fails to capture intricate patterns in data.
High Variance Example: A deep decision tree with a lot of splits can be highly
variable and lead to overfitting due to its ability to model noisy training data.
Complex models (e.g., deep neural nets) Reduce bias, increase variance
List out all relevant data sources (sensors, devices, logs, user inputs).
Select only useful, actionable data for your goal.
✅ Data Cleaning
Remove or correct:
o Missing values
o Duplicates
o Outliers
o Inconsistent formats (e.g., date/time, units)
✅ Data Transformation
✅ Metadata Management
Track:
o Data origin (source)
o Units, timestamp format
o Collection method
o Sensor type/model
Useful for data lineage and debugging.
3. Data Governance
✅ Establish Policies
✅ Data Stewardship
✅ Access Control
Use data warehouses for structured, cleaned data (e.g., Redshift, BigQuery).
Use data lakes for raw, unstructured, or semi-structured data (e.g., S3, Azure Data
Lake).
✅ ETL/ELT Tools
✅ Visualization Tools
✅ Data Modeling
✅ Version Control
📁 Use a consistent folder and naming structure Easy to navigate and retrieve datasets
🔁 Regularly audit and validate data Keeps data clean and trustworthy
Linked analytical datasets refer to interconnected collections of data that are logically or
relationally joined together, often across different sources, to provide a comprehensive
view for analysis and decision-making.
These datasets are typically used in data analytics, business intelligence, and machine
learning where relationships between different data points are critical for generating
insights.
📘 Example:
🧩 Key Characteristics:
2. Ingestion Layer
3. Metadata Store
6. Data Catalog
Function: Indexes all data assets for easy discovery and reuse.
Features:
o Searchable data inventory
o Tagging and classification
o Data usage tracking
(or)
✅ Managing a Data Lake
Managing a data lake involves ensuring functionality, security, and usability across its
lifecycle. A well-managed data lake enables efficient data storage, processing, and analytics
while maintaining compliance and performance.
🔐 3. Security
⚡ 4. Performance Optimization
🚀 5. Scalability
Utilize distributed storage (S3, ADLS) and compute engines (Spark, Presto).
Plan for future data growth without degrading performance.
6. Metadata Management
🧠 8. Resource Management
Category Tools/Platforms
Role Responsibility
Chief Data Officer (CDO) Align data strategies with business goals
A data retention strategy defines how long different types of data are kept in a system and
when they should be archived or deleted. It ensures regulatory compliance, cost efficiency,
data relevance, and risk reduction.
🔍 Why is a Data Retention Strategy Important?
1. Compliance: Meets legal and industry regulations (e.g., GDPR, HIPAA).
2. Cost Control: Removes unnecessary or outdated data, reducing storage costs.
3. Performance: Optimizes system performance by reducing clutter.
4. Security & Risk Management: Limits exposure of sensitive or outdated data.
5. Data Governance: Ensures data is retained purposefully and consistently.
Categorize data by type and importance (e.g., financial data, logs, user data).
Assign retention policies based on classification.
2. Retention Policies
3. Archival Plan
Move less frequently used but important data to cold or archival storage.
Technologies: AWS Glacier, Azure Archive Storage.
4. Deletion Mechanisms
Here's a neatly organized version of your content for quick reference or inclusion in
reports, presentations, or documentation:
🏁 Example
Data Type Retention Period Storage Tier Action After Expiry
Transaction Logs 90 days Hot Storage Delete securely
Financial Data 7 years Cold Storage Archive or backup
IoT Sensor Data 30 days Edge/Hot Storage Aggregate, then delete
HR Records 5 years post-exit Secure Cold Store Review, then delete securely