BD Imp Ques 1
BD Imp Ques 1
UNIT 1
2M
A collection of datasets which is huge and complex that is becomes difficult to process using dbms
tools.
Big Data has to deal with large and complex datasets that can be structured, Semi-structured, or
unstructured and will typically not fit into memory to be Processed.
Features
Volume
Big Data is a vast ‘volume’ of data generated from many sources daily, such as machines, social
media platforms, etc.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the data is
created in real-time. It contains the linking of incoming data sets speeds, rate of change, and activity
bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, but these days
the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also essential
in business development.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It is
valuable and reliable data that we store, process, and also analyse.
3.WHAT ARE THE SOURCES OF BIGDATA AND HOW TO STORE AND PROCESS IT
Big Data has evolved over the years due to advancements in technology, the increasing amount of
data generation, and the need for efficient data management and analysis. The evolution can be
divided into the following stages:
➢ Data was mostly structured and stored in relational databases (RDBMS) like MySQL, Oracle,
and SQL Server.
➢ Data processing was limited to small-scale datasets using manual queries and batch
processing.
➢ Businesses relied on data warehouses for analytics, but scalability was a challenge.
➢ The internet, e-commerce, and social media led to a massive explosion of data.
➢ Traditional databases struggled to handle large, unstructured, and real-time data.
➢ Google introduced MapReduce (2004), and Apache Hadoop (2006) was developed for
distributed storage and parallel processing.
➢ The rise of AI, Machine Learning (ML), and IoT led to even larger and more complex data
streams.
➢ Real-time processing (Apache Kafka, Flink, Storm) became essential for quick decision-
making.
➢ Edge computing and 5G enabled faster data processing closer to the source.
➢ Data privacy and security became critical due to strict regulations like GDPR and CCPA.
EXPLAIN THE USECASES OF BIGDATA 6M
Big Data is transforming various industries by enabling better decision-making, automation, and
innovation.
➢ Predictive Analytics – AI-driven models analyze patient data to predict diseases (e.g., early
cancer detection).
➢ Genomics & Drug Discovery – Big Data helps in sequencing genomes and accelerating drug
research.
➢ Remote Patient Monitoring – IoT devices collect real-time health data for better patient
care.
➢ Fraud Detection – Banks use Big Data to analyze transaction patterns and detect fraud in
real time.
➢ Risk Management – Predicting credit risks and market fluctuations using machine learning
models.
➢ Algorithmic Trading – Automated stock trading based on real-time market trends.
➢ Personalized Recommendations – Platforms like Amazon and Flipkart use Big Data to suggest
products based on user behavior.
➢ Inventory & Supply Chain Management – Predicting demand and optimizing logistics using
data analytics.
➢ Customer Sentiment Analysis – Analyzing social media and reviews to understand consumer
preferences.
➢ Traffic Management – Analyzing GPS and IoT data to optimize traffic flow and reduce
congestion.
➢ Public Transport Optimization – Big Data helps in scheduling and managing buses/trains
efficiently.
➢ Energy Management – Smart grids analyze energy consumption patterns for better
distribution.
➢ Content Recommendation – Netflix, YouTube, and Spotify use Big Data to suggest movies,
videos, and music.
➢ Audience Analytics – Understanding user engagement to improve content production.
➢ Ad Targeting – Big Data helps in showing personalized ads to users.
➢ Predictive Maintenance – Sensors in machines detect issues before they fail, reducing
downtime.
➢ Quality Control – AI-driven analysis of production data to ensure quality products.
➢ Supply Chain Optimization – Using data analytics to enhance manufacturing and logistics.
2.EXPLAIN THE HADOOP ECOSYSTEM IN DETAIL
Hadoop Ecosystem
Apache Hadoop is an open-source framework designed for efficient storage and processing of Big
Data. It enables handling large datasets that traditional RDBMS cannot process effectively.
Supporting Components
➢ A distributed file system that splits data into blocks and stores it across multiple nodes.
➢ Provides high fault tolerance and scalability.
➢ Works efficiently with large datasets that do not fit into a single machine.
HBase (Hadoop Database)
MapReduce
Hive
Pig
Mahout
Avro
Sqoop
➢ A tool for transferring structured data between Hadoop and relational databases (RDBMS).
➢ Used for bulk imports and exports of data.
➢ Helps integrate traditional databases with Hadoop.
4.Data Management Layer
Oozie
Chukwa
Flume
➢ A tool for ingesting and collecting streaming data (logs, events, sensor data, etc.).
➢ Works well for real-time analytics.
➢ Can transfer data to HDFS, HBase, or other storage systems.
ZooKeeper
➢ Scalability
➢ Cost-Effective
➢ Fault Tolerance
Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to
maintain and store big size data. Hadoop works on MapReduce Programming Algorithm that was
introduced by Google.
Today lots of Big Brand Companies are using Hadoop in their Organization to deal with big data, eg.
Facebook, Yahoo, Netflix, eBay, etc.
➢ MapReduce
➢ HDFS (Hadoop Distributed File System)
➢ YARN (Yet Another Resource Negotiator)
➢ Common Utilities or Hadoop Common
1.MapReduce
MapReduce nothing but just like an Algorithm or a data structure that is based on the YARN
framework. The major feature of MapReduce is to perform the distributed processing in parallel in a
Hadoop cluster which Makes Hadoop working so fast. When you are dealing with Big Data, serial
processing is no more of any use.
Map Phase – Data is processed in parallel by Map () functions, generating intermediate key-value
pairs.
Shuffle & Sort – Key-value pairs are grouped, shuffled, and sorted for processing.
Reduce Phase – Reduce () functions aggregate and process data to generate final results.
Output – The final processed data is stored in HDFS or other storage systems.
➢ Parallel Processing
➢ Scalability
➢ Fault Tolerance
2.HDFS
HDFS is the storage layer of the Hadoop ecosystem, designed for fault-tolerance, high availability,
and scalability on commodity hardware.
NameNode (Master)
DataNode (Slave)
HDFS Operations
➢ File Block Division – HDFS splits files into 128MB or 256MB blocks
➢ Replication – Default replication factor = 3, storing each block on three nodes for fault
tolerance.
➢ Fault Tolerance – Ensures data availability even if nodes fail.
➢ Rack Awareness – Spreads replicas across racks to reduce network congestion and improve
3.YARN (Yet Another Resource Negotiator)
YARN is a Framework on which MapReduce works. YARN performs 2 operations that are Job
scheduling and Resource Management. The Purpose of Job schedular is to divide a big task into small
jobs so that each job can be assigned to various slaves in a Hadoop cluster and Processing can be
Maximized.
Job Scheduler also keeps track of which job is important, which job has more priority, dependencies
between the jobs and all the other information like job timing, etc. And the use of Resource
Manager is to manage all the resources that are made available for running a Hadoop cluster.
Features of YARN
Multi-Tenancy
Scalability
Cluster-Utilization
Compatibility
Hadoop common or Common utilities are nothing but our java library and java files or we can say the
java scripts that we need for all the other components present in a Hadoop cluster.
These utilities are used by HDFS, YARN, and MapReduce for running the cluster. Hadoop Common
verify that Hardware failure in a Hadoop cluster is common so it needs to be solved automatically in
software by Hadoop Framework.
Advantages of Hadoop
➢ Scalability
➢ Fault Tolerance
➢ Cost-Effective
Disadvantages of Hadoop
➢ Complexity
➢ High Latency
➢ Security Concerns
4.GIVE THE DIFFERENT HDFS OPERATION AND COMMANDS
HDFS OPERATIONS
HDFS (Hadoop Distributed File System) supports various operations for managing large datasets
efficiently. Key operations include:
Starting HDFS:
HDFS (Hadoop Distributed File System) commands help in managing files, directories, and data
within the system.
2M
MapReduce is used in Big Data to process and analyse massive datasets efficiently across distributed
systems. It enables parallel computation, making it ideal for handling petabytes of data.
Reduce (): Aggregates and processes the mapped data to generate the final output.
➢ Parallel Processing
➢ Scalability
➢ Fault Tolerance
➢ Client Request
➢ Block Allocation
➢ Pipeline Setup
➢ Replication
➢ Acknowledgment
➢ Completion
➢ Client Request
➢ Block Location
➢ Nearest DataNode Selection
➢ Sequential Read
➢ Reconstruction
➢ Completion
3.WHAT IS SPARK
Apache Spark is an open-source, distributed processing system used for big data analytics. It utilizes
in-memory caching, and optimized query execution for fast analytic queries against data of any size.
It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple
nodes—batch processing, interactive queries, real-time analytics, machine learning, and graph
processing.
Features of Spark:
➢ Swift Processing
➢ Reusability
➢ Real-Time Stream Processing
➢ Cost Efficient
4.WHY IS APACHE SPARK STREAMING USED
Apache Spark Streaming is a scalable, fault-tolerant real-time data processing system that extends
the core Spark API. It supports both batch and streaming workloads and processes real-time data
from sources like Kafka, Flume, and Amazon Kinesis. The processed data can be stored in file
systems, databases, or live dashboards.
DStream
➢ A Discretized Stream (DStream) represents a continuous stream of data divided into small
batches.
➢ DStreams are built on RDDs (Resilient Distributed Datasets), enabling seamless integration
with Spark MLlib, Spark SQL, and other components.
Apache Mahout is an open-source project for building scalable machine learning algorithms. It is
closely associated with Apache Hadoop, which uses an elephant as its logo.
ML Techniques in Mahout
➢ Mahout follows the "bring the math to the data" approach, reducing data movement for
better performance.
➢ It runs on Hadoop (MapReduce) and also supports Apache Spark and Apache Flink for
distributed processing.
Applications of Mahout
Hadoop MapReduce
Hadoop MapReduce is a parallel processing framework for handling large-scale data across
distributed clusters in a fault-tolerant manner. It processes multi-terabyte datasets efficiently by
dividing tasks into Map and Reduce stages.
The MapReduce Word Count program calculates the frequency of each word in a text file stored in
HDFS.
1)Input Splits
➢ The input dataset is divided into smaller chunks called input splits.
➢ Each split is assigned to an individual Map task.
2)Mapping
3)Shuffling
4)Reducing
➢ The Reducer function takes the shuffled data and aggregates values.
➢ It sums up the occurrences of each word.
SOLVE MATRIX MULTIPLICATION 4M
➢ Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of
columns(j)=2
➢ Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of columns(k)=2.
➢ Each cell of the matrix is labelled as Aij and Bij.
➢ Now One step matrix multiplication has 1 mapper and 1 reducer.
➢ The Formula is:
➢ Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
➢ Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i
# Here all are 2, therefore when k=1, i can have 2 values 1 & 2, each case can have 2 further values
of j=1 and j=2.
k=1
k=2
Reducer (k, v) = (i, k) =>Make sorted Alist and Blist (i, k) => Summation (Aij * Bjk)) for j Output => ((i,
k), sum)
from Mapper computation that 4 pairs are common (1, 1), (1, 2), (2, 1) and (2, 2)
Make a list separate for Matrix A & B with adjoining values taken from Mapper step above:
YARN stands for “Yet Another Resource Negotiator “. It was introduced in Hadoop 2.0 to remove the
bottleneck on Job Tracker which was present in Hadoop 1.0.
Client
8. Once the processing is complete, the Application Manager un-registers with the Resource
Manager
YARN Features
➢ Scalability
➢ Multi-tenancy
Advantages
➢ Flexibility
➢ Scalability
➢ Improved Performance
Disadvantages
➢ Complexity
➢ Overhead
➢ Limited Support
3.EXPLAIN THE SPARK ARCHITECTURE 8M
b) Cluster Manager
➢ Executors are distributed agents responsible for executing tasks on worker nodes.
➢ Every Spark application gets its own executors, which:
1. Execute tasks assigned by the driver.
2. Store intermediate computation results in memory (for caching).
3. Write output to HDFS, databases, or other storage systems.
➢ Fast Processing
➢ Fault Tolerance
➢ Scalability
➢ Multiple Language Support
MLlib is nothing but a machine learning (ML) library of Apache Spark. Basically, it helps to make
practical machine learning scalable and easy.
1.Basic statistics
➢ Provides descriptive statistics such as mean, variance, correlation, and standard deviation.
➢ Example: Finding correlation between product sales and advertisement spending.
3.Clustering
4.Collaborative filtering