0% found this document useful (0 votes)
4 views

BDA

Uploaded by

rajambekar2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

BDA

Uploaded by

rajambekar2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

BDA

Q. What is big data?


Ans: Big data refers to extremely large and diverse collections of structured,
unstructured, and semi-structured data that continues to grow exponentially over time.
These datasets are so huge and complex in volume, velocity, and variety, that
traditional data management systems cannot store, process, and analyze them.
Big data describes large and diverse datasets that are huge in volume and also
rapidly grow in size over time. Big data is used in machine learning, predictive
modelling, and other advanced analytics to solve business problems and make
informed decisions

Q. Explain the characteristics of big data analytics what are the five Vs
of big data? (diagram)

Volume: Volumes of data generated from many sources daily, such as business
processes, machines, social media platforms, networks, human interactions, etc.
Eg: Facebook can generate approximately a billion messages, 4.5 billion times that
the "Like" button is recorded, and more than 350 million new posts are uploaded each
day.
Variety: Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will comes in array
forms, that are PDFs, Emails, audios, SM posts, photos, videos etc.
The data is categorized as below:
Structured data: It is in a tabular form. Structured Data is stored in the relational
database management system.
Semi-structured: In Semi-structured, the schema is not appropriately defined,
e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction Processing)
systems are built to work with semi-structured data.
Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data.
Quasi-structured Data: The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server
that contains a list of activities.
Veracity: Veracity means how much the data is reliable. It has many ways to filter or
translate the data. Veracity is the process of being able to handle and manage data
efficiently.
For example: Facebook posts with hashtags.
Value: It is not the data that we process or store. It is valuable and reliable data that
we store, process, and also analyze.
Velocity: Big data velocity deals with the speed at the data flows from sources
like application logs, business processes, networks, and social media sites, sensors,
mobile devices, etc.
*Q. What is Hadoop Eco System? What are the main components of?
(HDFs, YARN AND MAP Reduce)
Data Ingestion: Data ingestion is the first layer of Big Data Architecture. The data is
generated from various sources such as social media. Sensors, IoT devices, and
SaaS platforms need to be collected and brought to a single warehouse or a database.
There are three types of data ingestion techniques.
HDFS: HDFS stands for Hadoop Distributed File System, and it is designed to run on
commodity servers. A typical HDFS architecture consists of a Name node and several
data nodes. Nodes can be thought up as a single computer, and a collection of nodes
constitute a cluster, and each cluster could boast 1000s of nodes.When HDFS
receives data, it converts the data file into small chunks of data, mostly 64 or 128 MB.
The chunk size depends on the system configuration. While partitioning and replicating
the data, HDFS follows a principle called rack awareness. A rack is a collection of 40-
50 data nodes.
The entire process can be summed up in the below picture

HBase: We discussed HDFS, and now we will move on to HBase, a column-oriented


non-relational database management system that runs on top of HDFS. It operates
similarly to HDFS; it has a master node to manage clusters, and slave nodes or region
servers store the portion of the table and perform read and write operations.
YARN: YARN stands for Yet Another Resource Negotiator. In Hadoop 1.0, map-reduce
was responsible for processing and job tracking tasks. But the utilisation of resources
turned out to be highly inefficient. Then came YARN which took over the task of
resource distribution and job scheduling from map-reduce.
Oozie: Apache Oozie is an open-source Java web application for workflow scheduling
in a distributed cluster. It combines multiple jobs into a single unit, and Oozie supports
various jobs from Hive, Map Reduce, pig etc. There are three types of Oozie jobs.
Map Reduce: Map Reduce is responsible for processing a huge amount of data in a
parallel distributed manner. It has two different jobs: Map and the other is Reduce. Just
as the name Map always proceeds to Reduce. The data is processed and converted
into key-value pairs or tuples in the Map stage. the output of the map job is fed to the
reducer as inputs.

Pig: Yahoo developed Apache Pig to analyse large amounts of data. This is what map-
reduce does, too, but one fundamental problem with Map Reduce is it takes a lot of
code to perform the intended jobs. This is the primary reason why Pig was developed.
It has two significant components Pig Latin and Pig engine.
Spark: One of the critical concerns with map-reduce was that it takes a sequential
multi-step process to run a job, and it must read cluster data to do the operation and
write it back to nodes to perform a job. Thus, map-reduce jobs have high latency,
making them inefficient for real-time analytics.
Hive: Hive is a data warehousing tool designed to work with voluminous data, and it
works on top of HDFS and Map Reduce. The Hive query language is similar to SQL,
making it user-friendly. The hive queries internally get converted into map-reduce or
spark jobs which run on Hadoop’s distributed node cluster.
Impala: Apache Impala is an open-source data warehouse tool for querying high
volume data. Syntactically it is like HQL but provides highly optimised faster queries
than Hive. Unlike Hive, it is not dependent on map-reduce; instead, it has its engine,
which stores intermediate results in memory, thus providing faster query execution. It
can easily be integrated with HDFS, HBase and amazon s3. AS Impala is similar to
SQL, and the learning curve is not very steep.
Zookeeper: Apache zookeeper is another essential member of the Hadoop family,
responsible for cross node synchronisation and coordination. Hadoop applications
may need cross-cluster services; deploying Zookeeper takes care of this issue.
Applications create a znode within Zookeeper; applications can synchronise their
tasks across the distributed cluster by updating their status in the node.
Q. What are the issues and challenges of big data?
1. Sharing and Accessing Data:
 The inaccessibility of data sets from external sources.
 Sharing data can cause substantial challenges.
2. Privacy and Security:
 Most of the organizations are unable to maintain regular checks due to large
amounts of data generation. However, it should be necessary to perform
security checks and observation in real time because it is most beneficial.
 Some of the organization collects information of the people in order to add
value to their business. This is done by making insights into their lives that
they’re unaware of.
3. Analytical Challenges:
 There are some huge analytical challenges in big data which arise some
main challenges questions like how to deal with a problem if data volume
gets too large?
 Or how to find out the important data points?
4. Technical challenges:
 Quality of data:
o When there is a collection of a large amount of data and storage of
this data, it comes at a cost. Big companies, business leaders and IT
leaders always want large data storage.
o For better results and conclusions, Big data rather than having
irrelevant data, focuses on quality data storage.
 Fault tolerance:
o Fault tolerance computing is extremely hard, involving intricate
algorithms.
o The new technologies like cloud computing and big data always
intended that whenever the failure occurs the damage done should
be within the acceptable threshold that is the whole task should not
begin from the scratch.
 Scalability:
o Big data projects can grow and evolve rapidly. The scalability issue
of Big Data has lead towards cloud computing.
o It leads to various challenges like how to run and execute various
jobs so that goal of each workload can be achieved cost-effectively.
*Q. What are the Hadoop Commands?

Command Description

-rm Removes file or directory

-ls Lists files with permissions and other details

-mkdir Creates a directory named path in HDFS

-cat Shows contents of the file

-rmdir Deletes a directory

-put Uploads a file or folder from a local disk to HDFS

-rmr Deletes the file identified by path or folder and subfolders

-get Moves file or folder from HDFS to local file

-count Counts number of files, number of directory, and file size

-df Shows free space

-getmerge Merges multiple files in HDFS

-chmod Changes file permissions

-copyToLocal Copies files to the local system

-Stat Prints statistics about the file or directory

-head Displays the first kilobyte of a file

-usage Returns the help for an individual command

-chown Allocates a new owner and group of a file


Q. What is Hadoop archives?
Ans: Hadoop Archives (HAR) HAR is a way to organize and compress data within
HDFS to reduce the number of files and improve performance when accessing large
datasets. HAR helps manage and optimize the storage of a large number of small files
in HDFS
 Archive: Hadoop Archives bundle a large number of files into a single archive
file. This reduces the number of files in HDFS and can help improve
performance by reducing the metadata overhead.
 Access: Data in HAR files can be accessed just like regular HDFS files, but the
access patterns are optimized for performance.
Example: If we have millions of small files (e.g., log files or data chunks), managing
these files can become inefficient due to the overhead of storing metadata. By
archiving these files into HAR files, we can improve performance and manageability.
In Hadoop, compression and serialization are crucial techniques to optimize data
storage and transmission. They help in reducing the size of data and enabling efficient
data processing across the distributed Hadoop ecosystem, particularly HDFS (Hadoop
Distributed File System) and MapReduce.

Q. Explain Compression and serialization in Hadoop.


Or Where is Compression Used?
Or Why Serialization is Important in Hadoop?
Ans: Compression in Hadoop: Compression reduces the size of data files, which is
essential in a distributed environment like Hadoop, where large datasets are common.
By using compression, less space is needed for storage, and data transfer becomes
faster due to reduced input/output (I/O) overhead.
Use of Compression:
 HDFS Storage: Data stored on HDFS can be compressed to save disk space
and improve I/O performance.
 MapReduce Jobs: Input/output data can be compressed to reduce the time
spent reading from or writing to HDFS. Compression also helps reduce network
traffic between mappers and reducers.
Advantages of Compression:
 Less Disk Space Usage: Reduces storage costs.
 Faster Data Transmission: Decreases the amount of data transferred across
the network.
 Improved Performance: Speeds up processing tasks by reducing I/O
bottlenecks.
2. Serialization in Hadoop:
Serialization is the process of converting an object into a byte stream to store it or
transmit it to another system, where it can later be deserialized into its original form.
Hadoop relies on serialization to efficiently process and move data across its
distributed environment.
 Data Exchange: In a distributed system, data must be exchanged between
nodes (across mappers, reducers, and data nodes). Serialization allows this
data to be efficiently encoded and transferred.
 Storage: Data that is stored in HDFS or processed by MapReduce jobs needs
to be serialized to ensure that it can be efficiently written to and read from disk.
Serialization Frameworks in Hadoop:
Writable: Hadoop's native serialization format, optimized for speed and
compatibility with the Hadoop ecosystem.
Advantages: Lightweight, fast, and well-suited for Hadoop’s distributed
nature.
Disadvantage: Limited to Hadoop and Java environments; less
portable.
Difference between Compression and serialization

Feature Compression Serialization

Reduce data size for storage Encode data into a format for storage
Purpose
and I/O and transmission

Data files (HDFS,


Scope In-memory data and data exchange
MapReduce)

Common
Gzip, Bzip2, Snappy, LZO Writable, Avro, Protobuf, Thrift
Formats

Reducing disk space, Transmitting data between nodes,


Use Cases
speeding up jobs RPCs

You might also like