Data Analytics Assignment
Data Analytics Assignment
TECHNOLOGY
(Affiliated to Dr. APJ Abdul Kalam Technical University, Lucknow)
Even Semester(2022-23)
Assignment – 1
th
Programme: B.Tech. - IT Semester:6 M.M.: 10
Course Code:13 Subject: Data Analytics (KIT601)
Date: 02/03/2023
Knowledge Level (KL)
KL1- Remembering KL4- Analyzing
KL2-
Understanding
KL5- Evaluating
KL3- Applying KL6- Creating
Date of Assignment: 02/03/2023 Date of Submission: 10/03/2023
1. What is data analytics? What are the various sources of data? Explain their
characteristics.
CO
1
KL
1
2. Explain the different classification of data with their advantages and
disadvantages.
CO
1
KL
2
3. What is Big data? And what are sources of Big data. Explain their uses &
application.
CO
1
KL
2
4. Explain different type data analytics. Explain with example. CO
1
KL
1
5. Explain the various phases of data analytics life cycle.
BANSAL INSTITUTE OF ENGINEERING AND
TECHNOLOGY
(Affiliated to Dr. APJ Abdul Kalam Technical University, Lucknow)
Even Semester(2022-23)
Assignment – 1
th
Programme: B.Tech. - IT Semester:6 M.M.: 10
Course Code:13 Subject: Data Analytics (KIT601)
Date: 02/03/2023
Knowledge Level (KL)
KL1- Remembering KL4- Analyzing
KL2-
Understanding
KL5- Evaluating
KL3- Applying KL6- Creating
Date of Assignment: 02/03/2023 Date of Submission: 10/03/2023
1. What is data analytics? What are the various sources of data? Explain their
characteristics.
CO
1
KL
1
2. Explain the different classification of data with their advantages and
disadvantages.
CO
1
KL
2
3. What is Big data? And what are sources of Big data. Explain their uses &
application.
CO
1
KL
2
4. Explain different type data analytics. Explain with example. CO
1
KL
1
5. Explain the various phases of data analytics life cycle.
hat is data analytics? What are the various sources of data? Explain their
characteristics.
CO
1
KL
1
2. Explain the different classification of data with their advantages and
disadvantages.
CO
1
KL
2
3. What is Big data? And what are sources of Big data. Explain their uses &
application.
CO
1
KL
2
4. Explain different type data analytics. Explain with example. CO
1
KL
1
5. Explain the various phases of data analytics life cycle.
Assignment of Data Analytics (KIT-601)
A data stream refers to a continuous, unbounded flow of data points generated over
time. Unlike traditional static datasets, data streams are dynamic and require real-time
processing to extract useful information. This constant flow can originate from various
sources like sensors, social media, financial markets, and network logs.
1. Financial Markets: Real-time processing of stock prices and trading data helps in
making quick investment decisions and detecting market trends or anomalies.
2. Social Media: Analyzing real-time posts, comments, and user interactions to
understand trends, sentiment analysis, and customer feedback.
3. Sensor Networks: Continuous monitoring of environmental data (e.g.,
temperature, humidity, pollution levels) for applications in weather forecasting,
disaster management, and smart cities.
4. Network Traffic Monitoring: Real-time analysis of network traffic to detect
security threats, optimize performance, and manage bandwidth.
5. IoT (Internet of Things): Collecting and analyzing data from smart devices and
sensors in real-time to optimize operations, enhance user experiences, and
enable predictive maintenance.
Data streams enable timely insights and actions, crucial for applications where delayed
analysis could result in missed opportunities or increased risks.
The stream model involves continuous, ordered data sequences processed under
constraints of memory and latency, making it necessary to use algorithms that operate
in a single pass with incremental updates and approximations. It suits applications
requiring immediate processing and responses, as storing the entire stream is
impractical.
1. Data Source: Origin of the data streams, such as sensors, social media feeds, or
financial transactions.
2. Ingestion Layer: Initial point where data enters the system, involving preliminary
filtering, cleaning, and sometimes pre-processing to manage the high volume
and velocity.
3. Stream Processing Engine: Core component performing real-time processing
using algorithms designed for fast, efficient handling of streaming data. This
includes transformations, aggregations, and windowed operations.
4. Storage Layer: For temporary or permanent storage of processed or raw data,
often utilizing distributed databases or file systems for scalability.
5. Query Processor: Allows users to execute queries on streaming data, providing
real-time insights with possible approximate answers due to processing
constraints.
6. Output Interface: Delivers the processed results to users or systems, often via
dashboards, alerts, or API endpoints, facilitating real-time decision-making.
Steps:
1. Buckets: Each bucket represents a block of 1s and is characterized by its size and
a timestamp marking the position of its last element in the stream. The size of
buckets is always a power of two.
2. Merging Buckets: When two buckets of the same size appear consecutively, they
merge into a single bucket of double the size, retaining the timestamp of the
more recent bucket. This maintains a logarithmic number of buckets relative to
the window size.
3. Estimation: To estimate the count of 1s in the last 𝑊W bits, sum the sizes of all
buckets within the window. For the last bucket partially included in the window,
add half its size for a close approximation.
This method allows for efficient counting with reduced memory usage, making it
suitable for applications needing real-time analytics on high-velocity binary streams.
Counting distinct elements in a data stream is challenging due to the continuous and
potentially infinite nature of the stream. Probabilistic algorithms like HyperLogLog or
Flajolet-Martin (FM) are commonly used to estimate the number of unique elements
with high accuracy and low memory usage. These algorithms use hash functions to map
elements into a fixed-size data structure, allowing efficient approximate counting by
leveraging the properties of hash collisions. This approach is essential for applications
like network traffic analysis, web analytics, and database query optimization, where exact
counting is impractical due to the volume and velocity of data.
Finding the most popular elements in a data stream can be managed using a decaying
window approach, which assigns weights to elements that decrease over time, ensuring
recent data has more influence. This is implemented using techniques like Exponential
Decay, where each element's contribution is multiplied by an exponentially decaying
factor, making older data progressively less significant. This method is useful in
applications like recommendation systems, fraud detection, and dynamic content
prioritization, where trends and patterns must reflect the most current data, balancing
between historical context and recent activities.
In Big Data, filters are used to efficiently manage and query large datasets, often to
quickly determine membership or the presence of an element within a set.
Example:
For instance, in a spam detection system, Bloom Filters can quickly verify if an email
sender's address is from a known spam list. This technique saves memory and speeds up
queries significantly, especially in large-scale applications.
The FM algorithm estimates the number of distinct elements in a data stream using
probabilistic counting. It leverages the properties of hash functions to create a bitmap
representing the presence of elements.
1. Hashing: Each element in the stream is hashed into a large bit array.
2. Bit Patterns: Identify the position of the least significant 1-bit in the hashed
values.
3. Estimation: Use the maximum observed position to estimate the number of
distinct elements using a logarithmic transformation.
Applications:
The AMS algorithm estimates the second moment (variance) of a data stream, which is
useful for finding heavy hitters and understanding the distribution of the data.
1. Sketching: Maintain a sketch of the stream using random variables and hash
functions.
2. Estimation: Compute an unbiased estimate of the second moment from the
sketch.
Applications:
Technologies:
These technologies enable businesses to analyze data in motion, supporting use cases
like fraud detection, dynamic pricing, recommendation systems, and real-time
monitoring.
9. Explain the Three Categories of Prediction Methodologies.
1. Supervised Learning:
2. Unsupervised Learning:
3. Reinforcement Learning:
b. Nature of Data:
Data originates from various sources like sensors, transactional systems, social media,
and IoT devices. Understanding the nature of data is crucial for selecting appropriate
storage, processing, and analytical techniques, enabling effective handling of diverse
data types and sources.
Support Vector Machines (SVMs) are supervised learning models used for classification
and regression tasks. The key steps in SVM-based inference are:
1. Data Preparation:
Collect and preprocess labeled data, ensuring it’s suitable for training (e.g.,
normalization, handling missing values).
3. Kernel Selection:
4. Model Training:
Train the SVM model by finding the hyperplane that maximizes the margin
between different classes. This involves solving a quadratic optimization
problem to identify support vectors and define the decision boundary.
5. Model Validation:
6. Inference:
Use the trained SVM model to classify new data points based on the
learned decision boundary, predicting labels for unseen instances.
SVMs are powerful for handling complex classification problems, particularly when the
data is not linearly separable, making them widely applicable in image recognition,
bioinformatics, and text classification.
Bayesian inference is a statistical method that updates the probability estimate for a
hypothesis as more evidence becomes available. It relies on Bayes' theorem, which
describes the relationship between conditional probabilities.
Key Concepts:
1. Prior Probability: The initial belief about a hypothesis before observing any data.
2. Likelihood: The probability of observing the data given the hypothesis.
3. Posterior Probability: The updated probability of the hypothesis after
considering the evidence.
Applications:
Bayesian inference provides a flexible framework for incorporating new data, handling
uncertainty, and making probabilistic predictions, making it essential in various fields
such as finance, engineering, and artificial intelligence.
1. Descriptive Inference:
2. Predictive Inference:
3. Prescriptive Inference:
Objective: Recommend actions based on predictive insights to achieve
desired outcomes.
Techniques: Optimization algorithms, simulation models, decision
analysis.
Applications: Supply chain optimization, resource allocation, strategic
planning.
Each type of inference plays a critical role in transforming raw data into actionable
insights, enabling organizations to make informed decisions, anticipate future trends,
and optimize operations.
Steps:
Importance:
Bootstrapping is essential in data science and machine learning for model validation,
hypothesis testing, and constructing confidence intervals, enhancing the reliability of
inferential statistics and decision-making.
15. What is Sampling and Sampling Distribution? Give a Detailed
Analysis.
Types of Sampling:
Sampling Distribution:
Detailed Analysis:
Importance:
Efficiency: Provides a way to draw inferences about populations with less cost
and effort.
Practicality: Enables analysis of large populations where a full census is
impractical.
Inferential Power: Facilitates understanding of the variability and reliability of
sample estimates.
Sampling and sampling distributions are fundamental concepts in statistics, ensuring
that conclusions drawn from samples are generalizable to the population, thereby
underpinning many empirical research and data analysis methodologies.
Analysis:
Reporting:
Objective: Present data and findings in a structured format for review and
decision-making.
Approach: Descriptive, involving the use of dashboards, summaries, and
visualizations.
Outcome: Provides a snapshot of key metrics and performance indicators, aiding
in monitoring and assessment.
While analysis focuses on extracting and understanding insights from data, reporting is
about communicating these insights effectively to stakeholders, making both essential
for data-driven decision-making.
Prediction Error is the discrepancy between actual values and predicted values
produced by a model. It’s crucial for assessing model accuracy and guiding
improvements.
Regression Techniques:
1. Linear Regression:
Objective: Model the relationship between a dependent variable and one
or more independent variables using a linear equation.
Application: Predicting outcomes like sales, costs, or risk scores.
2. Multiple Regression:
Objective: Extend linear regression to include multiple independent
variables.
Application: Complex predictions involving several factors, such as
marketing impact analysis.
3. Polynomial Regression:
Objective: Model non-linear relationships by fitting a polynomial equation
to the data.
Application: Predicting growth curves, trends in scientific data.
4. Ridge and Lasso Regression:
Objective: Address multicollinearity and overfitting by adding
regularization terms to the regression equation.
Application: Predictive modeling in high-dimensional datasets.
Prediction error metrics guide the selection and tuning of regression models, ensuring
their effectiveness and reliability in making accurate predictions across various
applications.
1. Volume:
2. Velocity:
3. Variety:
4. Veracity:
5. Value:
The potential insights and benefits derived from analyzing big data.
Extracting value involves identifying relevant data, applying advanced
analytics, and translating findings into actionable business strategies,
driving competitive advantage and innovation.
These characteristics define the complexities and opportunities associated with Big Data,
guiding the development of technologies and methodologies to harness its potential
effectively.
a. Arcing Classifier:
Key Steps:
1. Initial Model Training: Train the first classifier on the original dataset.
2. Weight Adjustment: Increase weights for misclassified instances.
3. Resampling: Create a new training set by sampling with replacement, favoring
higher-weight instances.
4. Model Combination: Repeat the process, combining results from multiple
classifiers to improve overall accuracy.
Application: Used in tasks requiring high accuracy and robustness, such as image
recognition and fraud detection.
b. Bagging Predictors:
Key Steps:
1. Bootstrap Sampling: Generate multiple training datasets by sampling with
replacement from the original dataset.
2. Independent Model Training: Train a separate model on each bootstrap
sample.
3. Aggregation: Combine predictions from all models, typically by averaging for
regression or majority voting for classification.
Application: Enhances the performance of unstable models like decision trees, making
it popular in random forests. Bagging is particularly effective when the model suffers
from high variance, providing more robust and reliable predictions.
Both arcing and bagging leverage the power of multiple models to achieve higher
accuracy and resilience, addressing the limitations of single models in complex
prediction tasks.