0% found this document useful (0 votes)

4 views

BDA

Uploaded by

rajambekar2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

BDA

Uploaded by

rajambekar2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

BDA

Q. What is big data?

Ans: Big data refers to extremely large and diverse collections of structured,
unstructured, and semi-structured data that continues to grow exponentially over time.
These datasets are so huge and complex in volume, velocity, and variety, that
traditional data management systems cannot store, process, and analyze them.
Big data describes large and diverse datasets that are huge in volume and also
rapidly grow in size over time. Big data is used in machine learning, predictive
modelling, and other advanced analytics to solve business problems and make
informed decisions

Q. Explain the characteristics of big data analytics what are the ﬁve Vs
of big data? (diagram)

Volume: Volumes of data generated from many sources daily, such as business
processes, machines, social media platforms, networks, human interactions, etc.
Eg: Facebook can generate approximately a billion messages, 4.5 billion times that
the "Like" button is recorded, and more than 350 million new posts are uploaded each
day.
Variety: Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will comes in array
forms, that are PDFs, Emails, audios, SM posts, photos, videos etc.
The data is categorized as below:
Structured data: It is in a tabular form. Structured Data is stored in the relational
database management system.
Semi-structured: In Semi-structured, the schema is not appropriately defined,
e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction Processing)
systems are built to work with semi-structured data.
Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data.
Quasi-structured Data: The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server
that contains a list of activities.
Veracity: Veracity means how much the data is reliable. It has many ways to filter or
translate the data. Veracity is the process of being able to handle and manage data
efficiently.
For example: Facebook posts with hashtags.
Value: It is not the data that we process or store. It is valuable and reliable data that
we store, process, and also analyze.
Velocity: Big data velocity deals with the speed at the data flows from sources
like application logs, business processes, networks, and social media sites, sensors,
mobile devices, etc.
*Q. What is Hadoop Eco System? What are the main components of?
(HDFs, YARN AND MAP Reduce)
Data Ingestion: Data ingestion is the first layer of Big Data Architecture. The data is
generated from various sources such as social media. Sensors, IoT devices, and
SaaS platforms need to be collected and brought to a single warehouse or a database.
There are three types of data ingestion techniques.
HDFS: HDFS stands for Hadoop Distributed File System, and it is designed to run on
commodity servers. A typical HDFS architecture consists of a Name node and several
data nodes. Nodes can be thought up as a single computer, and a collection of nodes
constitute a cluster, and each cluster could boast 1000s of nodes.When HDFS
receives data, it converts the data file into small chunks of data, mostly 64 or 128 MB.
The chunk size depends on the system configuration. While partitioning and replicating
the data, HDFS follows a principle called rack awareness. A rack is a collection of 40-
50 data nodes.
The entire process can be summed up in the below picture

HBase: We discussed HDFS, and now we will move on to HBase, a column-oriented

non-relational database management system that runs on top of HDFS. It operates
similarly to HDFS; it has a master node to manage clusters, and slave nodes or region
servers store the portion of the table and perform read and write operations.
YARN: YARN stands for Yet Another Resource Negotiator. In Hadoop 1.0, map-reduce
was responsible for processing and job tracking tasks. But the utilisation of resources
turned out to be highly inefficient. Then came YARN which took over the task of
resource distribution and job scheduling from map-reduce.
Oozie: Apache Oozie is an open-source Java web application for workflow scheduling
in a distributed cluster. It combines multiple jobs into a single unit, and Oozie supports
various jobs from Hive, Map Reduce, pig etc. There are three types of Oozie jobs.
Map Reduce: Map Reduce is responsible for processing a huge amount of data in a
parallel distributed manner. It has two different jobs: Map and the other is Reduce. Just
as the name Map always proceeds to Reduce. The data is processed and converted
into key-value pairs or tuples in the Map stage. the output of the map job is fed to the
reducer as inputs.

Pig: Yahoo developed Apache Pig to analyse large amounts of data. This is what map-
reduce does, too, but one fundamental problem with Map Reduce is it takes a lot of
code to perform the intended jobs. This is the primary reason why Pig was developed.
It has two significant components Pig Latin and Pig engine.
Spark: One of the critical concerns with map-reduce was that it takes a sequential
multi-step process to run a job, and it must read cluster data to do the operation and
write it back to nodes to perform a job. Thus, map-reduce jobs have high latency,
making them inefficient for real-time analytics.
Hive: Hive is a data warehousing tool designed to work with voluminous data, and it
works on top of HDFS and Map Reduce. The Hive query language is similar to SQL,
making it user-friendly. The hive queries internally get converted into map-reduce or
spark jobs which run on Hadoop’s distributed node cluster.
Impala: Apache Impala is an open-source data warehouse tool for querying high
volume data. Syntactically it is like HQL but provides highly optimised faster queries
than Hive. Unlike Hive, it is not dependent on map-reduce; instead, it has its engine,
which stores intermediate results in memory, thus providing faster query execution. It
can easily be integrated with HDFS, HBase and amazon s3. AS Impala is similar to
SQL, and the learning curve is not very steep.
Zookeeper: Apache zookeeper is another essential member of the Hadoop family,
responsible for cross node synchronisation and coordination. Hadoop applications
may need cross-cluster services; deploying Zookeeper takes care of this issue.
Applications create a znode within Zookeeper; applications can synchronise their
tasks across the distributed cluster by updating their status in the node.
Q. What are the issues and challenges of big data?
1. Sharing and Accessing Data:
 The inaccessibility of data sets from external sources.
 Sharing data can cause substantial challenges.
2. Privacy and Security:
 Most of the organizations are unable to maintain regular checks due to large
amounts of data generation. However, it should be necessary to perform
security checks and observation in real time because it is most beneficial.
 Some of the organization collects information of the people in order to add
value to their business. This is done by making insights into their lives that
they’re unaware of.
3. Analytical Challenges:
 There are some huge analytical challenges in big data which arise some
main challenges questions like how to deal with a problem if data volume
gets too large?
 Or how to find out the important data points?
4. Technical challenges:
 Quality of data:
o When there is a collection of a large amount of data and storage of
this data, it comes at a cost. Big companies, business leaders and IT
leaders always want large data storage.
o For better results and conclusions, Big data rather than having
irrelevant data, focuses on quality data storage.
 Fault tolerance:
o Fault tolerance computing is extremely hard, involving intricate
algorithms.
o The new technologies like cloud computing and big data always
intended that whenever the failure occurs the damage done should
be within the acceptable threshold that is the whole task should not
begin from the scratch.
 Scalability:
o Big data projects can grow and evolve rapidly. The scalability issue
of Big Data has lead towards cloud computing.
o It leads to various challenges like how to run and execute various
jobs so that goal of each workload can be achieved cost-effectively.
*Q. What are the Hadoop Commands?

Command Description

-rm Removes file or directory

-ls Lists files with permissions and other details

-mkdir Creates a directory named path in HDFS

-cat Shows contents of the file

-rmdir Deletes a directory

-put Uploads a file or folder from a local disk to HDFS

-rmr Deletes the file identified by path or folder and subfolders

-get Moves file or folder from HDFS to local file

-count Counts number of files, number of directory, and file size

-df Shows free space

-getmerge Merges multiple files in HDFS

-chmod Changes file permissions

-copyToLocal Copies files to the local system

-Stat Prints statistics about the file or directory

-head Displays the first kilobyte of a file

-usage Returns the help for an individual command

-chown Allocates a new owner and group of a file

Q. What is Hadoop archives?
Ans: Hadoop Archives (HAR) HAR is a way to organize and compress data within
HDFS to reduce the number of files and improve performance when accessing large
datasets. HAR helps manage and optimize the storage of a large number of small files
in HDFS
 Archive: Hadoop Archives bundle a large number of files into a single archive
file. This reduces the number of files in HDFS and can help improve
performance by reducing the metadata overhead.
 Access: Data in HAR files can be accessed just like regular HDFS files, but the
access patterns are optimized for performance.
Example: If we have millions of small files (e.g., log files or data chunks), managing
these files can become inefficient due to the overhead of storing metadata. By
archiving these files into HAR files, we can improve performance and manageability.
In Hadoop, compression and serialization are crucial techniques to optimize data
storage and transmission. They help in reducing the size of data and enabling efficient
data processing across the distributed Hadoop ecosystem, particularly HDFS (Hadoop
Distributed File System) and MapReduce.

Q. Explain Compression and serialization in Hadoop.

Or Where is Compression Used?
Or Why Serialization is Important in Hadoop?
Ans: Compression in Hadoop: Compression reduces the size of data files, which is
essential in a distributed environment like Hadoop, where large datasets are common.
By using compression, less space is needed for storage, and data transfer becomes
faster due to reduced input/output (I/O) overhead.
Use of Compression:
 HDFS Storage: Data stored on HDFS can be compressed to save disk space
and improve I/O performance.
 MapReduce Jobs: Input/output data can be compressed to reduce the time
spent reading from or writing to HDFS. Compression also helps reduce network
traffic between mappers and reducers.
Advantages of Compression:
 Less Disk Space Usage: Reduces storage costs.
 Faster Data Transmission: Decreases the amount of data transferred across
the network.
 Improved Performance: Speeds up processing tasks by reducing I/O
bottlenecks.
2. Serialization in Hadoop:
Serialization is the process of converting an object into a byte stream to store it or
transmit it to another system, where it can later be deserialized into its original form.
Hadoop relies on serialization to efficiently process and move data across its
distributed environment.
 Data Exchange: In a distributed system, data must be exchanged between
nodes (across mappers, reducers, and data nodes). Serialization allows this
data to be efficiently encoded and transferred.
 Storage: Data that is stored in HDFS or processed by MapReduce jobs needs
to be serialized to ensure that it can be efficiently written to and read from disk.
Serialization Frameworks in Hadoop:
Writable: Hadoop's native serialization format, optimized for speed and
compatibility with the Hadoop ecosystem.
Advantages: Lightweight, fast, and well-suited for Hadoop’s distributed
nature.
Disadvantage: Limited to Hadoop and Java environments; less
portable.
Difference between Compression and serialization

Feature Compression Serialization

Reduce data size for storage Encode data into a format for storage
Purpose
and I/O and transmission

Data files (HDFS,

Scope In-memory data and data exchange
MapReduce)

Common
Gzip, Bzip2, Snappy, LZO Writable, Avro, Protobuf, Thrift
Formats

Reducing disk space, Transmitting data between nodes,

Use Cases
speeding up jobs RPCs

THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Siddharath. D Ms Dynamics CRM Developer - Summary
No ratings yet
Siddharath. D Ms Dynamics CRM Developer - Summary
4 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Pacific Rim
100% (1)
Pacific Rim
10 pages
Assignment questions BDA Lec 6
No ratings yet
Assignment questions BDA Lec 6
51 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
biggdata
No ratings yet
biggdata
24 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Data Science
No ratings yet
Data Science
87 pages
BDA Answers-1
No ratings yet
BDA Answers-1
15 pages
Escritura 1
No ratings yet
Escritura 1
7 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
IRJET - Big Data-A Review Study With Comp
No ratings yet
IRJET - Big Data-A Review Study With Comp
6 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
big data unit 1
No ratings yet
big data unit 1
24 pages
Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Cloud Comp Techno
No ratings yet
Cloud Comp Techno
5 pages
Unit 1 (1)
No ratings yet
Unit 1 (1)
89 pages
Elementary Concepts of Big Data and Hadoop
No ratings yet
Elementary Concepts of Big Data and Hadoop
4 pages
Big Data and BDA
No ratings yet
Big Data and BDA
44 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
No ratings yet
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
5 pages
Replication-Based Query Management For Resource Allocation Using Hadoop and MapReduce Over Big Data
No ratings yet
Replication-Based Query Management For Resource Allocation Using Hadoop and MapReduce Over Big Data
13 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Unit - I Introduction To Big Data
No ratings yet
Unit - I Introduction To Big Data
38 pages
Big Data
No ratings yet
Big Data
17 pages
Big Data Open Source Frameworks Lecture Slides
No ratings yet
Big Data Open Source Frameworks Lecture Slides
109 pages
An Insight On Big Data Analytics Using Pig Script
No ratings yet
An Insight On Big Data Analytics Using Pig Script
7 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
Big Data and Hadoop Overview
100% (1)
Big Data and Hadoop Overview
17 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
51 pages
Lect7 IoT BigData1
No ratings yet
Lect7 IoT BigData1
28 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Big Data Analytics - Project
50% (2)
Big Data Analytics - Project
27 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
BDA Question Bank
No ratings yet
BDA Question Bank
10 pages
Survey Paper On Big Data Analytics Using Hadoop Technologies
No ratings yet
Survey Paper On Big Data Analytics Using Hadoop Technologies
7 pages
Data-Intensive Computing
No ratings yet
Data-Intensive Computing
88 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
BDP UNIT 4
No ratings yet
BDP UNIT 4
28 pages
HADOOP
No ratings yet
HADOOP
55 pages
Guha Roy 2017
No ratings yet
Guha Roy 2017
3 pages
4 A Review Paper On Big Data and Hadoop
No ratings yet
4 A Review Paper On Big Data and Hadoop
3 pages
Module 1 Glossary What Is Big Data
No ratings yet
Module 1 Glossary What Is Big Data
2 pages
00 HadoopWelcome Transcript
No ratings yet
00 HadoopWelcome Transcript
4 pages
Assignment 6
No ratings yet
Assignment 6
12 pages
Big Data NOTES
No ratings yet
Big Data NOTES
14 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Big Data: Presented By, Nishaa R
No ratings yet
Big Data: Presented By, Nishaa R
24 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
Big Data Analytics Using Apache Hadoop
No ratings yet
Big Data Analytics Using Apache Hadoop
33 pages
Unit 1 1
No ratings yet
Unit 1 1
10 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Chapter 2 Auditing It Governance Controls
No ratings yet
Chapter 2 Auditing It Governance Controls
54 pages
AcademyCloudFoundations Module 00
No ratings yet
AcademyCloudFoundations Module 00
37 pages
Strategies For Query Processing
No ratings yet
Strategies For Query Processing
19 pages
Python Full Stack Documentation
No ratings yet
Python Full Stack Documentation
31 pages
The Relational Database Model: Database Systems: Design, Implementation, & Management, 5th Edition, Rob & Coronel 1
No ratings yet
The Relational Database Model: Database Systems: Design, Implementation, & Management, 5th Edition, Rob & Coronel 1
35 pages
Data Sources Advance Data Handling
No ratings yet
Data Sources Advance Data Handling
23 pages
Sematic Web: Bachelor of Technology
No ratings yet
Sematic Web: Bachelor of Technology
26 pages
Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File
No ratings yet
Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File
91 pages
A216 DWM EXP 2b
No ratings yet
A216 DWM EXP 2b
33 pages
37965
No ratings yet
37965
57 pages
CMP1042 Information Systems
No ratings yet
CMP1042 Information Systems
4 pages
Final PHP Code
No ratings yet
Final PHP Code
16 pages
Big Data and Artificial Intelligence for Healthcare Applications
No ratings yet
Big Data and Artificial Intelligence for Healthcare Applications
287 pages
ASM1 1ST Database Development NguyenChiThanh BH00887
No ratings yet
ASM1 1ST Database Development NguyenChiThanh BH00887
16 pages
HC Workshop Session-02 Workbook
No ratings yet
HC Workshop Session-02 Workbook
57 pages
Management of Authentications and Authorizations
No ratings yet
Management of Authentications and Authorizations
22 pages
Haramaya University Staff Clearance System SDD
No ratings yet
Haramaya University Staff Clearance System SDD
71 pages
naveen practical
No ratings yet
naveen practical
26 pages
QUIZ-2
No ratings yet
QUIZ-2
6 pages
IE2042 Assignment Semester2 2023
No ratings yet
IE2042 Assignment Semester2 2023
3 pages
Full Download of Management Information Systems 13th Edition Laudon Solutions Manual in PDF DOCX Format
100% (2)
Full Download of Management Information Systems 13th Edition Laudon Solutions Manual in PDF DOCX Format
42 pages
Unit 1 - Chapter 5 - Worksheet
No ratings yet
Unit 1 - Chapter 5 - Worksheet
4 pages
SQL MCQ
No ratings yet
SQL MCQ
6 pages
Sabari Resume
No ratings yet
Sabari Resume
2 pages
Kunal Samant: Cum Laude
No ratings yet
Kunal Samant: Cum Laude
1 page
14 File System Implementation
No ratings yet
14 File System Implementation
46 pages
ER Modeling (I)
No ratings yet
ER Modeling (I)
55 pages
Assign 1
No ratings yet
Assign 1
3 pages

BDA

Uploaded by

BDA

Uploaded by

BDA

Q. What is big data?

HBase: We discussed HDFS, and now we will move on to HBase, a column-oriented

-rm Removes file or directory

-ls Lists files with permissions and other details

-mkdir Creates a directory named path in HDFS

-cat Shows contents of the file

-rmdir Deletes a directory

-put Uploads a file or folder from a local disk to HDFS

-rmr Deletes the file identified by path or folder and subfolders

-get Moves file or folder from HDFS to local file

-count Counts number of files, number of directory, and file size

-df Shows free space

-getmerge Merges multiple files in HDFS

-chmod Changes file permissions

-copyToLocal Copies files to the local system

-Stat Prints statistics about the file or directory

-head Displays the first kilobyte of a file

-usage Returns the help for an individual command

-chown Allocates a new owner and group of a file

Q. Explain Compression and serialization in Hadoop.

Feature Compression Serialization

Data files (HDFS,

Reducing disk space, Transmitting data between nodes,

You might also like