0% found this document useful (0 votes)

15 views

Data Analytics Assignment

data analytics assignment

Uploaded by

vinaykumarmahor69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Data Analytics Assignment

data analytics assignment

Uploaded by

vinaykumarmahor69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

BANSAL INSTITUTE OF ENGINEERING AND

TECHNOLOGY
(Affiliated to Dr. APJ Abdul Kalam Technical University, Lucknow)
Even Semester(2022-23)
Assignment – 1
th
Programme: B.Tech. - IT Semester:6 M.M.: 10
Course Code:13 Subject: Data Analytics (KIT601)
Date: 02/03/2023
Knowledge Level (KL)
KL1- Remembering KL4- Analyzing
KL2-
Understanding
KL5- Evaluating
KL3- Applying KL6- Creating
Date of Assignment: 02/03/2023 Date of Submission: 10/03/2023
1. What is data analytics? What are the various sources of data? Explain their
characteristics.
CO
1
KL
1
2. Explain the different classification of data with their advantages and
disadvantages.
CO
1
KL
2
3. What is Big data? And what are sources of Big data. Explain their uses &
application.
CO
1
KL
2
4. Explain different type data analytics. Explain with example. CO
1
KL
1
5. Explain the various phases of data analytics life cycle.
BANSAL INSTITUTE OF ENGINEERING AND
TECHNOLOGY
(Affiliated to Dr. APJ Abdul Kalam Technical University, Lucknow)
Even Semester(2022-23)
Assignment – 1
th
Programme: B.Tech. - IT Semester:6 M.M.: 10
Course Code:13 Subject: Data Analytics (KIT601)
Date: 02/03/2023
Knowledge Level (KL)
KL1- Remembering KL4- Analyzing
KL2-
Understanding
KL5- Evaluating
KL3- Applying KL6- Creating
Date of Assignment: 02/03/2023 Date of Submission: 10/03/2023
1. What is data analytics? What are the various sources of data? Explain their
characteristics.
CO
1
KL
1
2. Explain the different classification of data with their advantages and
disadvantages.
CO
1
KL
2
3. What is Big data? And what are sources of Big data. Explain their uses &
application.
CO
1
KL
2
4. Explain different type data analytics. Explain with example. CO
1
KL
1
5. Explain the various phases of data analytics life cycle.
hat is data analytics? What are the various sources of data? Explain their
characteristics.
CO
1
KL
1
2. Explain the different classification of data with their advantages and
disadvantages.
CO
1
KL
2
3. What is Big data? And what are sources of Big data. Explain their uses &
application.
CO
1
KL
2
4. Explain different type data analytics. Explain with example. CO
1
KL
1
5. Explain the various phases of data analytics life cycle.
Assignment of Data Analytics (KIT-601)

1. What is a data stream? Explain the different applications of data streams in

detail.
2. Explain the stream model and Data stream management system architecture.
3. Explain how to count ones in a window using DGIM algorithm
4. Write a short note on the following: (i) Counting distinct elements in a
stream. (ii) Finding most popular elements using decaying window.
5. What are filters in Big Data? Explain Bloom Filter with example
6. Define Decaying window and how its performed in data analytics
7. Explain the following a.FM algorithm and its application b.AMS algorthim
and its applications
8. What is Real Time Analytics? Discuss their technologies in detail
9. Explain the three categories of Prediction methodologies.
10. Discuss the following in detail a. Conventional challenges in big data b.
Nature of Data
11.Describe the steps involved in support vector based inference methodology.
12. Write a short note on Bayesian inference methodology.
13..Define the different inferences in big data analytics.
14.Describe the bootstrapping and its importance
15. What is sampling and sampling distribution give a detailed analysis.
16.Define the following a. Intelligent Data Analytics b.Analysis Vs Reporting.
17.Describe the prediction error and regression techniques
18.Describe any five characteristics of Big Data.
19. Define Arcing classifier & Bagging predictors in detail.

1. What is a Data Stream? Explain the Different Applications of

Data Streams in Detail.

A data stream refers to a continuous, unbounded flow of data points generated over
time. Unlike traditional static datasets, data streams are dynamic and require real-time
processing to extract useful information. This constant flow can originate from various
sources like sensors, social media, financial markets, and network logs.

Applications of Data Streams:

1. Financial Markets: Real-time processing of stock prices and trading data helps in
making quick investment decisions and detecting market trends or anomalies.
2. Social Media: Analyzing real-time posts, comments, and user interactions to
understand trends, sentiment analysis, and customer feedback.
3. Sensor Networks: Continuous monitoring of environmental data (e.g.,
temperature, humidity, pollution levels) for applications in weather forecasting,
disaster management, and smart cities.
4. Network Traffic Monitoring: Real-time analysis of network traffic to detect
security threats, optimize performance, and manage bandwidth.
5. IoT (Internet of Things): Collecting and analyzing data from smart devices and
sensors in real-time to optimize operations, enhance user experiences, and
enable predictive maintenance.

Data streams enable timely insights and actions, crucial for applications where delayed
analysis could result in missed opportunities or increased risks.

2. Explain the Stream Model and Data Stream Management

System Architecture.

The stream model involves continuous, ordered data sequences processed under
constraints of memory and latency, making it necessary to use algorithms that operate
in a single pass with incremental updates and approximations. It suits applications
requiring immediate processing and responses, as storing the entire stream is
impractical.

Data Stream Management System (DSMS) Architecture:

1. Data Source: Origin of the data streams, such as sensors, social media feeds, or
financial transactions.
2. Ingestion Layer: Initial point where data enters the system, involving preliminary
filtering, cleaning, and sometimes pre-processing to manage the high volume
and velocity.
3. Stream Processing Engine: Core component performing real-time processing
using algorithms designed for fast, efficient handling of streaming data. This
includes transformations, aggregations, and windowed operations.
4. Storage Layer: For temporary or permanent storage of processed or raw data,
often utilizing distributed databases or file systems for scalability.
5. Query Processor: Allows users to execute queries on streaming data, providing
real-time insights with possible approximate answers due to processing
constraints.
6. Output Interface: Delivers the processed results to users or systems, often via
dashboards, alerts, or API endpoints, facilitating real-time decision-making.

This architecture ensures efficient, real-time data handling, enabling responsive,

adaptive applications across various domains.

3. Explain How to Count Ones in a Window Using DGIM

Algorithm.
The DGIM (Datar-Gionis-Indyk-Motwani) algorithm is designed to count the number
of 1s in the last 𝑊W bits of a binary data stream using a logarithmic amount of
memory. It approximates the count by grouping bits into buckets.

Steps:

1. Buckets: Each bucket represents a block of 1s and is characterized by its size and
a timestamp marking the position of its last element in the stream. The size of
buckets is always a power of two.
2. Merging Buckets: When two buckets of the same size appear consecutively, they
merge into a single bucket of double the size, retaining the timestamp of the
more recent bucket. This maintains a logarithmic number of buckets relative to
the window size.
3. Estimation: To estimate the count of 1s in the last 𝑊W bits, sum the sizes of all
buckets within the window. For the last bucket partially included in the window,
add half its size for a close approximation.

This method allows for efficient counting with reduced memory usage, making it
suitable for applications needing real-time analytics on high-velocity binary streams.

4. Write a Short Note on the Following: (i) Counting Distinct

Elements in a Stream. (ii) Finding Most Popular Elements Using
Decaying Window.

(i) Counting Distinct Elements in a Stream:

Counting distinct elements in a data stream is challenging due to the continuous and
potentially infinite nature of the stream. Probabilistic algorithms like HyperLogLog or
Flajolet-Martin (FM) are commonly used to estimate the number of unique elements
with high accuracy and low memory usage. These algorithms use hash functions to map
elements into a fixed-size data structure, allowing efficient approximate counting by
leveraging the properties of hash collisions. This approach is essential for applications
like network traffic analysis, web analytics, and database query optimization, where exact
counting is impractical due to the volume and velocity of data.

(ii) Finding Most Popular Elements Using Decaying Window:

Finding the most popular elements in a data stream can be managed using a decaying
window approach, which assigns weights to elements that decrease over time, ensuring
recent data has more influence. This is implemented using techniques like Exponential
Decay, where each element's contribution is multiplied by an exponentially decaying
factor, making older data progressively less significant. This method is useful in
applications like recommendation systems, fraud detection, and dynamic content
prioritization, where trends and patterns must reflect the most current data, balancing
between historical context and recent activities.

5. What Are Filters in Big Data? Explain Bloom Filter with

Example.

In Big Data, filters are used to efficiently manage and query large datasets, often to
quickly determine membership or the presence of an element within a set.

A Bloom Filter is a space-efficient probabilistic data structure used to test whether an

element is a member of a set. It allows false positives but no false negatives, meaning it
can suggest an element is in the set even if it’s not, but never misses an actual member.

Example:

1. Initialization: Start with an array of bits initialized to 0 and multiple independent

hash functions.
2. Insertion: For each element, compute its hash values using the hash functions
and set the corresponding bits in the array to 1.
3. Query: To check if an element is part of the set, compute its hash values and
check the corresponding bits. If all bits are 1, the element is likely in the set; if any
bit is 0, the element is definitely not in the set.

For instance, in a spam detection system, Bloom Filters can quickly verify if an email
sender's address is from a known spam list. This technique saves memory and speeds up
queries significantly, especially in large-scale applications.

6. Define Decaying Window and How It’s Performed in Data

Analytics.

A decaying window is a technique used in data analytics to give more importance to

recent data points while gradually reducing the weight of older data. This method
ensures that analysis and insights reflect the most current trends without being
overwhelmed by outdated information.

Performance in Data Analytics:

1. Weight Calculation: Assign a decaying weight to each data point based on its
age. A common approach is using an exponential decay function:
𝑊(𝑡)=𝑒−𝜆𝑡W(t)=e−λt, where 𝜆λ is the decay rate and 𝑡t is the time elapsed
since the data point was added.
2. Aggregation: Aggregate data by summing weighted values rather than raw
counts, ensuring more recent data has a higher impact on the results.
3. Update: Continuously update the decaying weights as new data arrives and old
data ages, maintaining an up-to-date analysis.

This technique is crucial in real-time analytics, such as online recommendation systems,

where user behavior and preferences rapidly change. By prioritizing recent interactions,
decaying windows help maintain the relevance and accuracy of the recommendations or
insights provided.

7. Explain the Following: a. FM Algorithm and Its Application. b.

AMS Algorithm and Its Applications.

a. FM Algorithm (Flajolet-Martin Algorithm) and Its Application:

The FM algorithm estimates the number of distinct elements in a data stream using
probabilistic counting. It leverages the properties of hash functions to create a bitmap
representing the presence of elements.

1. Hashing: Each element in the stream is hashed into a large bit array.
2. Bit Patterns: Identify the position of the least significant 1-bit in the hashed
values.
3. Estimation: Use the maximum observed position to estimate the number of
distinct elements using a logarithmic transformation.

Applications:

 Network traffic analysis to count unique IP addresses.

 Web analytics for counting unique visitors.
 Database query optimization by estimating the cardinality of intermediate results.

b. AMS Algorithm (Alon-Matias-Szegedy Algorithm) and Its Applications:

The AMS algorithm estimates the second moment (variance) of a data stream, which is
useful for finding heavy hitters and understanding the distribution of the data.
1. Sketching: Maintain a sketch of the stream using random variables and hash
functions.
2. Estimation: Compute an unbiased estimate of the second moment from the
sketch.

Applications:

 Identifying frequent elements (heavy hitters) in network traffic.

 Real-time monitoring of data distribution in large datasets.
 Detecting anomalies and outliers in streaming data.

Both algorithms are vital in real-time analytics, enabling efficient, scalable

approximations with limited memory.

8. What is Real-Time Analytics? Discuss Their Technologies in

Detail.

Real-Time Analytics involves processing and analyzing data as it is generated to

provide immediate insights and enable prompt actions. This is crucial for applications
where timely information can lead to significant advantages or prevent potential issues.

Technologies:

1. Apache Kafka: A distributed streaming platform that handles real-time data

feeds, ensuring scalability and fault tolerance. It's widely used for building real-
time data pipelines and streaming applications.
2. Apache Flink: A stream processing framework that allows for complex event
processing and real-time analytics with low latency and high throughput. It
supports stateful computations and exactly-once processing guarantees.
3. Apache Storm: A distributed real-time computation system designed for
processing vast streams of data quickly and reliably. It's used for real-time
analytics, online machine learning, and continuous computation.
4. Spark Streaming: An extension of Apache Spark that provides scalable, high-
throughput, fault-tolerant stream processing of live data streams. It supports
complex analytics and integrates seamlessly with the Spark ecosystem.

These technologies enable businesses to analyze data in motion, supporting use cases
like fraud detection, dynamic pricing, recommendation systems, and real-time
monitoring.
9. Explain the Three Categories of Prediction Methodologies.

Prediction methodologies in data analytics can be broadly categorized into:

1. Supervised Learning:

 Description: Uses labeled data to train predictive models. The model

learns from input-output pairs and predicts outcomes for new data.
 Techniques: Regression (predicting continuous values), classification
(predicting discrete labels).
 Applications: Credit scoring, medical diagnosis, spam detection.

2. Unsupervised Learning:

 Description: Finds hidden patterns in unlabeled data. The model identifies

inherent structures without predefined labels.
 Techniques: Clustering (grouping similar data points), dimensionality
reduction (reducing data complexity while preserving its structure).
 Applications: Market segmentation, anomaly detection, recommendation
systems.

3. Reinforcement Learning:

 Description: Involves learning to make sequences of decisions by

interacting with an environment. The model receives feedback in the form
of rewards or punishments and aims to maximize cumulative rewards.
 Techniques: Q-learning, deep reinforcement learning.
 Applications: Robotics, game playing, automated trading systems.

Each methodology serves different analytical needs, enabling a comprehensive

approach to predictive modeling across various domains.

10. Discuss the Following in Detail: a. Conventional Challenges in

Big Data. b. Nature of Data.

a. Conventional Challenges in Big Data:

1. Volume: The sheer amount of data generated daily necessitates scalable storage
and processing solutions.
2. Velocity: High-speed data generation requires real-time processing capabilities
to keep up with the influx.
3. Variety: Data comes in multiple formats (structured, semi-structured,
unstructured), complicating integration and analysis.
4. Veracity: Ensuring data quality and accuracy is challenging due to the presence
of noise, errors, and inconsistencies.
5. Value: Extracting meaningful insights from vast and diverse datasets to drive
decision-making.

b. Nature of Data:

Data can be characterized by its structure and origin:

 Structured Data: Organized in a defined format, such as databases and

spreadsheets.
 Semi-Structured Data: Partially organized data with tags or markers, such as
XML or JSON files.
 Unstructured Data: Lacks a predefined structure, including text documents,
images, videos, and social media posts.

Data originates from various sources like sensors, transactional systems, social media,
and IoT devices. Understanding the nature of data is crucial for selecting appropriate
storage, processing, and analytical techniques, enabling effective handling of diverse
data types and sources.

11. Describe the Steps Involved in Support Vector Based Inference

Methodology.

Support Vector Machines (SVMs) are supervised learning models used for classification
and regression tasks. The key steps in SVM-based inference are:

1. Data Preparation:

 Collect and preprocess labeled data, ensuring it’s suitable for training (e.g.,
normalization, handling missing values).

2. Feature Selection and Transformation:

 Identify and transform features that best represent the data for the SVM
model. This may involve scaling, dimensionality reduction, or kernel
transformations.

3. Kernel Selection:

 Choose an appropriate kernel function (linear, polynomial, radial basis

function, etc.) to transform the data into a higher-dimensional space
where it’s easier to separate.

4. Model Training:

 Train the SVM model by finding the hyperplane that maximizes the margin
between different classes. This involves solving a quadratic optimization
problem to identify support vectors and define the decision boundary.

5. Model Validation:

 Validate the model using techniques like cross-validation to ensure it

generalizes well to unseen data. Adjust parameters if necessary.

6. Inference:

 Use the trained SVM model to classify new data points based on the
learned decision boundary, predicting labels for unseen instances.

SVMs are powerful for handling complex classification problems, particularly when the
data is not linearly separable, making them widely applicable in image recognition,
bioinformatics, and text classification.

12. Write a Short Note on Bayesian Inference Methodology.

Bayesian inference is a statistical method that updates the probability estimate for a
hypothesis as more evidence becomes available. It relies on Bayes' theorem, which
describes the relationship between conditional probabilities.

Key Concepts:

1. Prior Probability: The initial belief about a hypothesis before observing any data.
2. Likelihood: The probability of observing the data given the hypothesis.
3. Posterior Probability: The updated probability of the hypothesis after
considering the evidence.

Bayesian inference combines these components:

𝑃(𝐻∣𝐷)=𝑃(𝐷∣𝐻)⋅𝑃(𝐻)𝑃(𝐷)P(H∣D)=P(D)P(D∣H)⋅P(H) where 𝑃(𝐻∣𝐷)P(H∣D) is the
posterior probability, 𝑃(𝐷∣𝐻)P(D∣H) is the likelihood, 𝑃(𝐻)P(H) is the prior, and
𝑃(𝐷)P(D) is the evidence (normalizing constant).

Applications:

 Medical Diagnosis: Updating the probability of a disease based on test results.

 Machine Learning: Bayesian networks for probabilistic graphical models.
 Spam Filtering: Updating the likelihood of an email being spam based on
observed features.

Bayesian inference provides a flexible framework for incorporating new data, handling
uncertainty, and making probabilistic predictions, making it essential in various fields
such as finance, engineering, and artificial intelligence.

13. Define the Different Inferences in Big Data Analytics.

In Big Data Analytics, inferences can be categorized into:

1. Descriptive Inference:

 Objective: Summarize and describe past data to understand what has

happened.
 Techniques: Statistical summaries, data visualization, clustering.
 Applications: Business reporting, trend analysis, customer segmentation.

2. Predictive Inference:

 Objective: Forecast future events based on historical data.

 Techniques: Regression analysis, time series forecasting, machine learning
models (e.g., decision trees, neural networks).
 Applications: Sales forecasting, risk management, personalized marketing.

3. Prescriptive Inference:
 Objective: Recommend actions based on predictive insights to achieve
desired outcomes.
 Techniques: Optimization algorithms, simulation models, decision
analysis.
 Applications: Supply chain optimization, resource allocation, strategic
planning.

Each type of inference plays a critical role in transforming raw data into actionable
insights, enabling organizations to make informed decisions, anticipate future trends,
and optimize operations.

14. Describe Bootstrapping and Its Importance.

Bootstrapping is a statistical resampling method used to estimate the distribution of a

sample statistic by repeatedly sampling with replacement from the original dataset. It
allows for assessing the accuracy and variability of sample estimates without making
strong assumptions about the underlying population distribution.

Steps:

1. Resampling: Generate multiple bootstrap samples by randomly selecting data

points with replacement from the original dataset.
2. Statistic Calculation: Compute the desired statistic (e.g., mean, variance) for
each bootstrap sample.
3. Distribution Estimation: Analyze the distribution of the computed statistics to
estimate the confidence intervals, bias, and standard errors.

Importance:

 Non-Parametric: Works without assuming a specific parametric distribution for

the data.
 Versatility: Applicable to various statistics and complex estimators, including
those without closed-form solutions.
 Reliability: Provides robust estimates of sampling variability, especially for small
samples or non-normal data.

Bootstrapping is essential in data science and machine learning for model validation,
hypothesis testing, and constructing confidence intervals, enhancing the reliability of
inferential statistics and decision-making.
15. What is Sampling and Sampling Distribution? Give a Detailed
Analysis.

Sampling involves selecting a subset of individuals or observations from a larger

population to estimate population parameters. This process is crucial when it's
impractical or impossible to study the entire population.

Types of Sampling:

1. Random Sampling: Each member of the population has an equal chance of

being selected.
2. Stratified Sampling: Population is divided into strata, and random samples are
taken from each stratum.
3. Cluster Sampling: Population is divided into clusters, and entire clusters are
randomly selected.
4. Systematic Sampling: Every 𝑘k-th member of the population is selected.

Sampling Distribution:

The sampling distribution of a statistic is the probability distribution of that statistic

calculated from all possible samples of a specific size from the population.

Detailed Analysis:

1. Central Limit Theorem: Regardless of the population distribution, the sampling

distribution of the sample mean approaches a normal distribution as the sample
size increases.
2. Estimation: Sampling distributions allow estimation of population parameters
(mean, variance) and construction of confidence intervals.
3. Hypothesis Testing: Sampling distributions form the basis for hypothesis tests,
comparing sample statistics to population parameters under null hypotheses.

Importance:

 Efficiency: Provides a way to draw inferences about populations with less cost
and effort.
 Practicality: Enables analysis of large populations where a full census is
impractical.
 Inferential Power: Facilitates understanding of the variability and reliability of
sample estimates.
Sampling and sampling distributions are fundamental concepts in statistics, ensuring
that conclusions drawn from samples are generalizable to the population, thereby
underpinning many empirical research and data analysis methodologies.

16. Define the Following: a. Intelligent Data Analytics. b. Analysis

vs. Reporting.

a. Intelligent Data Analytics:

Intelligent Data Analytics combines advanced techniques from machine learning,

artificial intelligence, and statistical analysis to derive insights from data. It involves:

 Automated Data Processing: Using AI algorithms to preprocess, analyze, and

interpret large datasets without human intervention.
 Predictive Modeling: Developing models that predict future trends based on
historical data.
 Anomaly Detection: Identifying unusual patterns or outliers that may indicate
significant events or changes.

Applications include fraud detection, customer behavior analysis, and predictive

maintenance. Intelligent Data Analytics enhances decision-making by providing deeper,
more actionable insights.

b. Analysis vs. Reporting:

Analysis:

 Objective: Discover patterns, trends, and relationships within data to answer

specific questions or solve problems.
 Approach: Exploratory, involving statistical techniques, data mining, and
hypothesis testing.
 Outcome: Generates insights, models, and actionable recommendations.

Reporting:

 Objective: Present data and findings in a structured format for review and
decision-making.
 Approach: Descriptive, involving the use of dashboards, summaries, and
visualizations.
 Outcome: Provides a snapshot of key metrics and performance indicators, aiding
in monitoring and assessment.

While analysis focuses on extracting and understanding insights from data, reporting is
about communicating these insights effectively to stakeholders, making both essential
for data-driven decision-making.

17. Describe the Prediction Error and Regression Techniques.

Prediction Error is the discrepancy between actual values and predicted values
produced by a model. It’s crucial for assessing model accuracy and guiding
improvements.

Types of Prediction Error:

1. Mean Absolute Error (MAE): Average of absolute differences between actual

and predicted values.
2. Mean Squared Error (MSE): Average of squared differences, penalizing larger
errors more.
3. Root Mean Squared Error (RMSE): Square root of MSE, providing error in the
original units of the data.

Regression Techniques:

1. Linear Regression:
 Objective: Model the relationship between a dependent variable and one
or more independent variables using a linear equation.
 Application: Predicting outcomes like sales, costs, or risk scores.
2. Multiple Regression:
 Objective: Extend linear regression to include multiple independent
variables.
 Application: Complex predictions involving several factors, such as
marketing impact analysis.
3. Polynomial Regression:
 Objective: Model non-linear relationships by fitting a polynomial equation
to the data.
 Application: Predicting growth curves, trends in scientific data.
4. Ridge and Lasso Regression:
 Objective: Address multicollinearity and overfitting by adding
regularization terms to the regression equation.
 Application: Predictive modeling in high-dimensional datasets.

Prediction error metrics guide the selection and tuning of regression models, ensuring
their effectiveness and reliability in making accurate predictions across various
applications.

18. Describe Any Five Characteristics of Big Data.

1. Volume:

 Refers to the massive amounts of data generated daily from various

sources like social media, sensors, and transactions. Managing and
processing this sheer volume requires scalable storage and efficient
processing solutions.

2. Velocity:

 The speed at which data is generated, transmitted, and processed. Real-

time analytics and streaming technologies are essential to handle high-
velocity data for timely insights and decision-making.

3. Variety:

 The diversity of data formats, including structured, semi-structured, and

unstructured data. Integrating and analyzing such varied data types
demands flexible data management and analysis tools.

4. Veracity:

 The trustworthiness and quality of data. Ensuring data accuracy,

consistency, and reliability is critical for making sound decisions, requiring
robust data cleansing and validation techniques.

5. Value:
 The potential insights and benefits derived from analyzing big data.
Extracting value involves identifying relevant data, applying advanced
analytics, and translating findings into actionable business strategies,
driving competitive advantage and innovation.

These characteristics define the complexities and opportunities associated with Big Data,
guiding the development of technologies and methodologies to harness its potential
effectively.

19. Define the Following: a. Arcing Classifier. b. Bagging

Predictors in Detail.

a. Arcing Classifier:

Arcing (Adaptive Resampling and Combining) is an ensemble learning technique that

improves model accuracy by iteratively adjusting the weight of training instances.
Misclassified instances are given higher weights, making them more likely to be selected
in subsequent iterations. The final prediction is made by combining the outputs of all
classifiers, typically through majority voting.

Key Steps:

1. Initial Model Training: Train the first classifier on the original dataset.
2. Weight Adjustment: Increase weights for misclassified instances.
3. Resampling: Create a new training set by sampling with replacement, favoring
higher-weight instances.
4. Model Combination: Repeat the process, combining results from multiple
classifiers to improve overall accuracy.

Application: Used in tasks requiring high accuracy and robustness, such as image
recognition and fraud detection.

b. Bagging Predictors:

Bagging (Bootstrap Aggregating) is an ensemble method that improves model

stability and accuracy by reducing variance through averaging.

Key Steps:
1. Bootstrap Sampling: Generate multiple training datasets by sampling with
replacement from the original dataset.
2. Independent Model Training: Train a separate model on each bootstrap
sample.
3. Aggregation: Combine predictions from all models, typically by averaging for
regression or majority voting for classification.

Application: Enhances the performance of unstable models like decision trees, making
it popular in random forests. Bagging is particularly effective when the model suffers
from high variance, providing more robust and reliable predictions.

Both arcing and bagging leverage the power of multiple models to achieve higher
accuracy and resilience, addressing the limitations of single models in complex
prediction tasks.

CBT_Admit_Card_23_March_2025_211138
No ratings yet
CBT_Admit_Card_23_March_2025_211138
3 pages
DEWA sample test 2024 (1)
No ratings yet
DEWA sample test 2024 (1)
28 pages
Data Analytics Quantum
100% (1)
Data Analytics Quantum
148 pages
Unit 4 Notes PDF
100% (2)
Unit 4 Notes PDF
27 pages
BDA
No ratings yet
BDA
6 pages
Bda Mid Ans
No ratings yet
Bda Mid Ans
18 pages
PUT Solutions_ Paper Data Analytics 2024
No ratings yet
PUT Solutions_ Paper Data Analytics 2024
20 pages
unit-3 notes
No ratings yet
unit-3 notes
10 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
DA Unit 3
No ratings yet
DA Unit 3
12 pages
Big Data Analytics Rajnish)
No ratings yet
Big Data Analytics Rajnish)
13 pages
Unit 3
No ratings yet
Unit 3
30 pages
Data Analytics V.IMP + PYQs (Edushine Classes) (2)
No ratings yet
Data Analytics V.IMP + PYQs (Edushine Classes) (2)
8 pages
Data Analytics Quantum
No ratings yet
Data Analytics Quantum
144 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Mining Data Streams
No ratings yet
Mining Data Streams
17 pages
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Mod4_DWDM_BTECH
No ratings yet
Mod4_DWDM_BTECH
9 pages
a.
No ratings yet
a.
3 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Big Data 3rd Unit
No ratings yet
Big Data 3rd Unit
16 pages
BigData_Mod2
No ratings yet
BigData_Mod2
12 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Bda Ut2 Que Ans
No ratings yet
Bda Ut2 Que Ans
14 pages
Bda Important Questions
100% (1)
Bda Important Questions
4 pages
Data analytics Question bank
No ratings yet
Data analytics Question bank
5 pages
4 Bda Chapter4 Answer
No ratings yet
4 Bda Chapter4 Answer
6 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
53 pages
BDA-2
No ratings yet
BDA-2
16 pages
UNIT 2 BDA
No ratings yet
UNIT 2 BDA
13 pages
Python Machine Learning: Machine Learning Algorithms for Beginners - Data Management and Analytics for Approaching Deep Learning and Neural Networks from Scratch
From Everand
Python Machine Learning: Machine Learning Algorithms for Beginners - Data Management and Analytics for Approaching Deep Learning and Neural Networks from Scratch
Ahmed Ph. Abbasi
No ratings yet
Question bank CS AI_ VI Sem_ 1-5
No ratings yet
Question bank CS AI_ VI Sem_ 1-5
2 pages
DWDM - Unit - VII
No ratings yet
DWDM - Unit - VII
42 pages
DA
No ratings yet
DA
1 page
MMD3
No ratings yet
MMD3
17 pages
Dwdm Unit 5 Part One
No ratings yet
Dwdm Unit 5 Part One
29 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
50 pages
unit-1.pptx
No ratings yet
unit-1.pptx
88 pages
Data Analytics Quantum
No ratings yet
Data Analytics Quantum
148 pages
bdaIA2
No ratings yet
bdaIA2
12 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Data Analytics Sys
No ratings yet
Data Analytics Sys
1 page
Da Quantum
No ratings yet
Da Quantum
143 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Mining Data Streams
No ratings yet
Mining Data Streams
34 pages
10 - (Module-6) Data Generation, Data Gathering-07-03-2023
No ratings yet
10 - (Module-6) Data Generation, Data Gathering-07-03-2023
18 pages
It 6001 Da 2 Marks With Answer PDF
No ratings yet
It 6001 Da 2 Marks With Answer PDF
10 pages
2marks With Answers
No ratings yet
2marks With Answers
10 pages
DataStreamsCRC Anjaly
No ratings yet
DataStreamsCRC Anjaly
258 pages
Methodologies for Stream Data Processing and Stream Data Systems
No ratings yet
Methodologies for Stream Data Processing and Stream Data Systems
20 pages
BDA Mod 3
No ratings yet
BDA Mod 3
57 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
UNIT 3 Notes Data Analytics
No ratings yet
UNIT 3 Notes Data Analytics
136 pages
21CA3101_Unit_II_QB
No ratings yet
21CA3101_Unit_II_QB
3 pages
Data Analytics Quantum
No ratings yet
Data Analytics Quantum
143 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
da last year
No ratings yet
da last year
21 pages
DAV Quantum
No ratings yet
DAV Quantum
143 pages
Part A & B Big Data Questions Final
No ratings yet
Part A & B Big Data Questions Final
21 pages
DSBDA EndSem2023 12F FlyHigh
No ratings yet
DSBDA EndSem2023 12F FlyHigh
20 pages
Data Mining_Unit-V
No ratings yet
Data Mining_Unit-V
12 pages
Bda Sem 7 Book
No ratings yet
Bda Sem 7 Book
188 pages
ASM720N No Mode Control: Micro-Processor Based
No ratings yet
ASM720N No Mode Control: Micro-Processor Based
1 page
Climate of The Philippines
100% (1)
Climate of The Philippines
8 pages
Robopac 13-Nov-2024 09h 35m
No ratings yet
Robopac 13-Nov-2024 09h 35m
64 pages
Ma2261 Mathematics Ii UNIT I - Ordinary Differential Equation
No ratings yet
Ma2261 Mathematics Ii UNIT I - Ordinary Differential Equation
17 pages
2017 Yamaha AR Basis en
No ratings yet
2017 Yamaha AR Basis en
3 pages
Routine Immunization - Details - For 4th August Meting
No ratings yet
Routine Immunization - Details - For 4th August Meting
57 pages
ECLIPSE actvity sheet
No ratings yet
ECLIPSE actvity sheet
3 pages
Crispina-Q4 - Pt-Science 6
No ratings yet
Crispina-Q4 - Pt-Science 6
6 pages
NCTB 2010 Class V English For Today
100% (1)
NCTB 2010 Class V English For Today
147 pages
8.1 and 8.2 Group Assignment
No ratings yet
8.1 and 8.2 Group Assignment
7 pages
SPM Exam Module Set 2 Answer Key 2014
100% (2)
SPM Exam Module Set 2 Answer Key 2014
7 pages
Grammar A2
No ratings yet
Grammar A2
4 pages
HUL Vs Patanjali (Market Study)
0% (1)
HUL Vs Patanjali (Market Study)
26 pages
Va-3D MG: 3-Axis Acceleration and Gravitiy Sensor
No ratings yet
Va-3D MG: 3-Axis Acceleration and Gravitiy Sensor
2 pages
Saurabh Bhai Portfolio
No ratings yet
Saurabh Bhai Portfolio
8 pages
20BD0055 - Palayag MPB
No ratings yet
20BD0055 - Palayag MPB
33 pages
02-Module 2 Lesson 2 Workbook
No ratings yet
02-Module 2 Lesson 2 Workbook
5 pages
View / Download Account Statement: Savings Account No.: 16751140002568, MEMNAGAR
100% (1)
View / Download Account Statement: Savings Account No.: 16751140002568, MEMNAGAR
2 pages
The Dark Side of The Internet: A Study About Representations of The Deep Web and The Tor Network in The British Press
No ratings yet
The Dark Side of The Internet: A Study About Representations of The Deep Web and The Tor Network in The British Press
324 pages
How Came The Bible
No ratings yet
How Came The Bible
30 pages
Flaunt Magazine 2007 Reverberations and Echo Chambers: Exploring Legendary L.A. Recording Studios
100% (1)
Flaunt Magazine 2007 Reverberations and Echo Chambers: Exploring Legendary L.A. Recording Studios
4 pages
Final PDF
No ratings yet
Final PDF
61 pages
Work, Energy & Power
No ratings yet
Work, Energy & Power
11 pages
5620-Sam E2e Lte en
No ratings yet
5620-Sam E2e Lte en
19 pages
Cross Cultural Management
No ratings yet
Cross Cultural Management
10 pages
SALE AGREEMENT - Isaac Mwoma Nyanya and Helly Felab Nyabuto Obanyi
No ratings yet
SALE AGREEMENT - Isaac Mwoma Nyanya and Helly Felab Nyabuto Obanyi
7 pages
PSLE Science Notes
No ratings yet
PSLE Science Notes
2 pages

Data Analytics Assignment

Uploaded by

Data Analytics Assignment

Uploaded by

BANSAL INSTITUTE OF ENGINEERING AND

1. What is a data stream? Explain the different applications of data streams in

1. What is a Data Stream? Explain the Different Applications of

Applications of Data Streams:

2. Explain the Stream Model and Data Stream Management

Data Stream Management System (DSMS) Architecture:

This architecture ensures efficient, real-time data handling, enabling responsive,

3. Explain How to Count Ones in a Window Using DGIM

4. Write a Short Note on the Following: (i) Counting Distinct

(i) Counting Distinct Elements in a Stream:

(ii) Finding Most Popular Elements Using Decaying Window:

5. What Are Filters in Big Data? Explain Bloom Filter with

A Bloom Filter is a space-efficient probabilistic data structure used to test whether an

1. Initialization: Start with an array of bits initialized to 0 and multiple independent

6. Define Decaying Window and How It’s Performed in Data

A decaying window is a technique used in data analytics to give more importance to

Performance in Data Analytics:

This technique is crucial in real-time analytics, such as online recommendation systems,

7. Explain the Following: a. FM Algorithm and Its Application. b.

a. FM Algorithm (Flajolet-Martin Algorithm) and Its Application:

 Network traffic analysis to count unique IP addresses.

b. AMS Algorithm (Alon-Matias-Szegedy Algorithm) and Its Applications:

 Identifying frequent elements (heavy hitters) in network traffic.

Both algorithms are vital in real-time analytics, enabling efficient, scalable

8. What is Real-Time Analytics? Discuss Their Technologies in

Real-Time Analytics involves processing and analyzing data as it is generated to

1. Apache Kafka: A distributed streaming platform that handles real-time data

Prediction methodologies in data analytics can be broadly categorized into:

 Description: Uses labeled data to train predictive models. The model

 Description: Finds hidden patterns in unlabeled data. The model identifies

 Description: Involves learning to make sequences of decisions by

Each methodology serves different analytical needs, enabling a comprehensive

10. Discuss the Following in Detail: a. Conventional Challenges in

a. Conventional Challenges in Big Data:

Data can be characterized by its structure and origin:

 Structured Data: Organized in a defined format, such as databases and

11. Describe the Steps Involved in Support Vector Based Inference

2. Feature Selection and Transformation:

 Choose an appropriate kernel function (linear, polynomial, radial basis

 Validate the model using techniques like cross-validation to ensure it

12. Write a Short Note on Bayesian Inference Methodology.

Bayesian inference combines these components:

 Medical Diagnosis: Updating the probability of a disease based on test results.

13. Define the Different Inferences in Big Data Analytics.

In Big Data Analytics, inferences can be categorized into:

 Objective: Summarize and describe past data to understand what has

 Objective: Forecast future events based on historical data.

14. Describe Bootstrapping and Its Importance.

Bootstrapping is a statistical resampling method used to estimate the distribution of a

1. Resampling: Generate multiple bootstrap samples by randomly selecting data

 Non-Parametric: Works without assuming a specific parametric distribution for

Sampling involves selecting a subset of individuals or observations from a larger

1. Random Sampling: Each member of the population has an equal chance of

The sampling distribution of a statistic is the probability distribution of that statistic

1. Central Limit Theorem: Regardless of the population distribution, the sampling

16. Define the Following: a. Intelligent Data Analytics. b. Analysis

a. Intelligent Data Analytics:

Intelligent Data Analytics combines advanced techniques from machine learning,

 Automated Data Processing: Using AI algorithms to preprocess, analyze, and

Applications include fraud detection, customer behavior analysis, and predictive

b. Analysis vs. Reporting:

 Objective: Discover patterns, trends, and relationships within data to answer

17. Describe the Prediction Error and Regression Techniques.

Types of Prediction Error:

1. Mean Absolute Error (MAE): Average of absolute differences between actual

18. Describe Any Five Characteristics of Big Data.

 Refers to the massive amounts of data generated daily from various

 The speed at which data is generated, transmitted, and processed. Real-

 The diversity of data formats, including structured, semi-structured, and

 The trustworthiness and quality of data. Ensuring data accuracy,

19. Define the Following: a. Arcing Classifier. b. Bagging

Arcing (Adaptive Resampling and Combining) is an ensemble learning technique that

Bagging (Bootstrap Aggregating) is an ensemble method that improves model

You might also like