0% found this document useful (0 votes)

17 views

Big-Data Final

Big data questions and answers

Uploaded by

biswasprantik5.12

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Big-Data Final

Big data questions and answers

Uploaded by

biswasprantik5.12

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Big Data 2 marks

What is sequence file in Hadoop?

Definition: A SequenceFile is a flat, binary file type that serves as a container for data to be
used in Apache Hadoop distributed computing projects. SequenceFiles are used extensively
with MapReduce.
Used to: SequenceFiles are used to store and compress files that are smaller than the optimum
size for operating efficiently with Hadoop, which can help reduce required disk space capacity
and I/O requirements.

Define the three key design principles of Pig Latin.

1. Simplified Data Processing: Pig Latin simplifies data processing by providing a high-level
language that abstracts away the complexity of MapReduce programming.
2. Extensibility: Pig Latin is designed to be extensible, allowing users to define their own
functions and operators to process data in a custom way.
3. Optimization: Pig Latin optimizes data processing by automatically optimizing the execution
plan based on the data flow and available resources.

Define the various file formats supported in HIVE.

1. Text File Format: stores data in plain text files with a delimiter.
2. Sequence File Format: stores binary key-value pairs in a compressed block-based format.
3. ORC File Format: stores data in a columnar format with compression and indexing.
4. Parquet File Format: stores data in a columnar format with efficient compression and
nested data support.
5. Avro File Format: stores data in a compact binary format with schema evolution support.

What is the difference between analysis and analytics

S.No. Data Analytics Data Analysis

It is described as a traditional form
1. It is described as a particularized form of analytics.
or generic form of analytics.
It includes several stages like the To process data, firstly raw data is defined in a
2. collection of data and then the meaningful manner, then data cleaning and conversion
inspection of business data is done. are done to get meaningful information from raw data.
It supports decision making by It analyzes the data by focusing on insights into
3.
analysing enterprise data. business data.
It uses various tools to process data It uses different tools to analyse data such as Rapid
4.
such as Tableau, Python, Excel, etc. Miner, Open Refine, Node XL, KNIME, etc.

What do you mean by semi-structured data?

Semi-structured: Semi-structured data is typically characterized by the use of metadata or tags
that provide additional information about the data elements.
For example, an XML document might contain tags that indicate the structure of the
document, but may also contain additional tags that provide metadata about the content, such
as author, date, or keywords.
What is YARN?
 YARN (Yet Another Resource Negotiator) is a component of Apache Hadoop that manages
resources and schedules tasks for distributed processing of large data sets.
 It separates the resource management and job scheduling/monitoring functions into
separate daemons, allowing for more efficient and flexible use of cluster resources.
 YARN enables Hadoop to support a wider range of processing models, including batch
processing, interactive processing, and real-time streaming.
 It also allows for the integration of other processing engines, such as Apache Spark,
Apache Flink, and Apache Storm

What is HDFS?
 The Hadoop Distributed File System (HDFS) is the primary data storage system used by
Hadoop applications
 This storage layer of Hadoop is responsible for storing and managing data across a cluster.
 It is designed to handle large datasets by breaking them into smaller blocks and
distributing them across multiple machines.
 The data blocks are replicated to ensure fault tolerance and high availability

How NameNode tackle DataNode in HDFS?

 The NameNode manages the HDFS namespace and coordinates access to data stored on
DataNodes.
 It monitors the health of each DataNode and detects any failures.
 It ensures data reliability by redirecting clients to other available DataNodes that have a
copy of the data.
 It maintains a persistent record of the file system metadata

Define BIG Data.

Big data Definition: Big Data are extremely large data sets that may be analysed
computationally to reveal patterns, trends and associations, specially relating to human
behaviours and interactions.
It is characterised by 4V’s:
 Volume: Huge Amount of data
 Variety: Different formats of data from various sources
 Velocity: High speed of accumulation of data
 Variability: Inconsistencies and uncertainty in data

State two characteristics of big data application.

Characteristics of Big Data Big data can be described by the following characteristics:
 Volume: The name Big Data itself is related to a size which is enormous. Whether a
particular data can actually be considered as a Big Data or not, is dependent upon the
volume of data.
 Variety: Variety refers to heterogeneous sources and the nature of data, both structured
and unstructured. The variety (emails, photos, videos, monitoring devices, PDFs, audio) of
unstructured data poses certain issues for storage, mining and analysing data.
Write down the four computing resources of Big Data Storage.
The four computing resources of Big Data storage are:
 Storage capacity: The ability to accommodate the massive volume of data generated and
collected.
 Scalability: The capability to scale the storage infrastructure horizontally as data volume
grows.
 Data durability: Ensuring the reliability and resilience of data through replication, fault
tolerance, and backup strategies.
 Data accessibility: Facilitating efficient and fast access to stored data through
optimization, indexing, and metadata management.

What is throughput in Big Data?

 Throughput in Big Data refers to the amount of data that can be processed or
transferred within a given time frame.
 It is a measure of the system's ability to handle a high volume of data and perform
operations on it efficiently.
 Throughput is often used as a key performance indicator (KPI) for Big Data systems, as it
reflects the system's ability to meet the demands of processing large amounts of data in
real-time or near-real-time.
 High throughput is essential for applications such as streaming data processing, real-
time analytics, and high-performance computing.

Big data 4 marks

There are two main components of HDFS:
 NameNode:
o It is the master node in HDFS.
o Centralized metadata repository for Hadoop Distributed File System (HDFS).
o Stores file system metadata like filenames, directories, & hierarchical structure.
o It is responsible for managing the file system namespace
o It periodically checkpoints the metadata onto disk to ensure its durability.
o It tracks the location of data blocks on DataNodes.
o Maintains the in-memory mapping of data blocks to DataNodes for efficient data
access.
 DataNode:
o It is the slave node in HDFS.
o Responsible for storing data blocks and serving read/write requests from clients.
o Each DataNode manages a set of data blocks stored on the local file system.
o Detects and reports data block corruption or failures to the NameNode.
o Replicates data blocks to ensure data availability and fault tolerance.
o Support data streaming & direct access to data blocks for efficient data processing
o Can be added or removed from the HDFS cluster dynamically without affecting the
overall system availability.
Explain HADOOP architecture and how it works.
Hadoop Architecture: At its core, Hadoop has two major layers namely −
 Processing/Computation layer (MapReduce) –
 MapReduce is a programming model and processing framework in Hadoop that
allows for parallel and distributed processing of large datasets.
 It simplifies the processing of big data by dividing the work into two main phases:
the map phase (data is split into smaller chunks and processed in parallel across
multiple nodes) and the reduce phase (results from the map phase are combined
and aggregated to produce the final output.)
 MapReduce provides fault tolerance and automatic parallelization, making it suitable
for large-scale data processing tasks.
 Storage layer (Hadoop Distributed File System) –
 HDFS is the storage layer of Hadoop and is responsible for storing and managing
data across a cluster.
 It is designed to handle large datasets by breaking them into smaller blocks and
distributing them across multiple machines.
 The data blocks are replicated to ensure fault tolerance and high availability.
 Hadoop Common − these are Java libraries and utilities required by other Hadoop modules.
 Hadoop YARN − this is a framework for job scheduling and cluster resource management.

Hadoop runs code across a cluster of computers. This process includes the following core tasks
that Hadoop performs –

 Data is initially divided into directories and files. Files are divided into uniform sized blocks of
128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.
Evolution of Big Data: There are a lot of milestones in the evolution of Big Data which are
described below:
 Data Warehousing: In the 1990s, data warehousing emerged as a solution to store and
analyze large volumes of structured data.
 Hadoop: Hadoop was introduced in 2006, which provides distributed storage medium
and large data processing and it is an open-source framework.
 NoSQL Databases: In 2009, NoSQL databases were introduced, which provide a flexible
way to store and retrieve unstructured data.
 Cloud Computing: Cloud computing technology helps companies to store their
important data in data centers that are remote, and it saves their infrastructure cost
and maintenance costs.
 Machine Learning: Machine Learning algorithms are those algorithms that work on
large data, and analysis is done on a huge amount of data to get meaningful insights
from it. This has led to the development of artificial intelligence (AI) applications.

Characteristics of Big Data

Big data can be described by the following characteristics:
 Volume: The name Big Data itself is related to a size which is enormous. Whether a
particular data can actually be considered as a Big Data or not, is dependent upon
the volume of data.
 Variety: Variety refers to heterogeneous sources and the nature of data, both
structured and unstructured. The variety (emails, photos, videos, monitoring
devices, PDFs, audio) of unstructured data poses certain issues for storage, mining
and analyzing data.
 Velocity: The term ‘velocity’ refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real potential in
the data. The flow of data is massive and continuous.
 Variability: This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage the data
effectively.
Discuss the functions of MongoDB Query Language and database Commands.

1. MongoDB Query Language (MQL):

 MQL provides a rich set of querying capabilities to retrieve data from MongoDB.
 It supports filtering based on criteria, sorting, limiting results, & projecting specific fields.
 It supports operations like grouping, summing, averaging, joining, and more.
 It supports various types of indexes, including single field, compound, text, and
geospatial indexes to improve query performance.
 MQL provides text search capabilities, allowing you to perform full-text search queries
on text fields, with support for text indexing, stemming, and relevance sorting.

2. Database Commands:
 Document Manipulation: Commands like insert, update, and delete allow you to create,
modify, and remove documents in collections.
 Index Management: MongoDB commands enable you to create, drop, and manage
indexes on collections to optimize query performance.
 Data Backup and Restoration: Commands like mongodump and mongorestore allow you
to create backups of databases or collections and restore them when needed.
Explain the different tools to extract Big Data
a. Apache Hadoop:
 Distributed processing framework for big data.
 Utilizes the Hadoop Distributed File System (HDFS) for storing and processing large
datasets across clusters of computers.
b. Hive:
 Hive is an open source software big data tool.
 It allows programmers analyze large data sets on Hadoop
 It helps with querying and managing large data sets real fast
c. MongoDB:
 MongoDB is an open source No-SQQL database which is cross-platform compatible
with many built-in features.
 It is ideal for the business that needs fast and real-time data for instant decisions.
 It is ideal for users who want data-driven experience

Brief explanation of 4 types of NoSQL:

1. Key-Value Stores:
 Stores data as simple key-value pairs.
 Provides high performance and scalability.
 Suitable for caching, session management, and user profiles.
 Examples: Redis, Riak, Amazon DynamoDB.
2. Document Databases:
 Stores data in semi-structured documents (e.g., JSON, XML).
 Offers flexibility with dynamic schemas.
 Handles complex, hierarchical, and evolving data structures.
 Examples: MongoDB, Couchbase, Apache CouchDB.
3. Column-Family Stores:
 Stores data in column families or column groups.
 Optimized for write-heavy workloads.
 Efficiently handles vast amounts of data.
 Used for large-scale data analytics and time-series data storage.
 Examples: Apache Cassandra, Apache HBase.
4. Graph Databases:
 Stores and represents relationships between entities.
 Utilizes graph structures with nodes and edges.
 Excellent for complex data modeling and graph traversals.
 Ideal for social networks, recommendation systems, and fraud detection.
 Examples: Neo4j, Amazon Neptune, ArangoDB
Feature Key-value Database Document Database

Data Structure Simple key-value pairs Documents (often JSON or BSON)

Schema Schema-less Schema-less or schema-on-read

Limited query capabilities (usually based Advanced querying using query languages
Querying
on key lookup) or APIs

Data Supports hierarchical relationships and

Flat structure, no hierarchical relationships
Organization nested data

Indexes on various fields and attributes

Indexing Typically only indexes on keys
within documents

Excellent scalability and performance for Good scalability for both read and write-
Scalability
read-heavy workloads heavy workloads

Caching, session management, simple key- Content management systems, user

Use Cases
based lookups profiles, e-commerce applications

Eventual consistency (replication delay can Eventual consistency (replication delay can
Consistency
lead to data inconsistencies) lead to data inconsistencies)

Examples Redis, Riak MongoDB, CouchDB

Criteria Hadoop RDBMS

Schema Based on ‘Schema on Read’. Based on ‘Schema on Write’.

Structured, Semi-Structured and Structured Data

Data Type
Unstructured data.

works better when the data size is works better when the volume of data is low(in
Efficiency
big(i.e., In Terabytes and Petabytes) Gigabytes)

Hadoop Has higher throughput as RDBMS fails to achieve a higher throughput as

Throughput
compared to RDBMS compared to the Apache Hadoop Framework.

Latency/
Response Hadoop is said to have low latency RDMS has high latency as compared to Hadoop
Time
Speed Writes are fast Reads are fast

Cost Open source framework, free of cost Licensed software, Paid.

Data Discovery, Storage and processing

Application OLTP and complex ACID transaction
of unstructured data.

DSA Presentation
No ratings yet
DSA Presentation
34 pages
Big Data
No ratings yet
Big Data
63 pages
SDCBDASPARKWEEK1-1
No ratings yet
SDCBDASPARKWEEK1-1
9 pages
Hadoop Class 1 PDF
No ratings yet
Hadoop Class 1 PDF
27 pages
Data Analytics mid sem notes
No ratings yet
Data Analytics mid sem notes
9 pages
bda final sem 7
No ratings yet
bda final sem 7
120 pages
Big Data QB
No ratings yet
Big Data QB
37 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
11 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
BDA IA1 QB Solved complete - Copy
No ratings yet
BDA IA1 QB Solved complete - Copy
22 pages
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
bioDiesel_research
No ratings yet
bioDiesel_research
29 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
10 pages
Big Data Analysis IAT-1
No ratings yet
Big Data Analysis IAT-1
43 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Compare Hadoop & Spark Criteria Hadoop Spark
No ratings yet
Compare Hadoop & Spark Criteria Hadoop Spark
18 pages
UNIT III BASICS_OF_HADOOP
No ratings yet
UNIT III BASICS_OF_HADOOP
22 pages
Act2 - March7 - 6E - BDA - SEC
No ratings yet
Act2 - March7 - 6E - BDA - SEC
8 pages
UNIT 3 FULL
No ratings yet
UNIT 3 FULL
89 pages
Chapter_6_839bc026c9704c0b899907d1ad5145b3_1712934164767
No ratings yet
Chapter_6_839bc026c9704c0b899907d1ad5145b3_1712934164767
19 pages
Lect7 IoT BigData1
No ratings yet
Lect7 IoT BigData1
28 pages
Bigdata
No ratings yet
Bigdata
12 pages
Unit 2
No ratings yet
Unit 2
17 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
21ai402 Data Analytics Unit-2
No ratings yet
21ai402 Data Analytics Unit-2
44 pages
UNIT-2
No ratings yet
UNIT-2
14 pages
Assignment BDHhhh
No ratings yet
Assignment BDHhhh
15 pages
BIGDATA FINAL
No ratings yet
BIGDATA FINAL
25 pages
BDA_Unit_2
No ratings yet
BDA_Unit_2
29 pages
BD UNIT 4&5
No ratings yet
BD UNIT 4&5
10 pages
BIT4440 BSE4040 CloudComputing 3.big Data Technologies
No ratings yet
BIT4440 BSE4040 CloudComputing 3.big Data Technologies
43 pages
Testing Big Data: Camelia Rad
No ratings yet
Testing Big Data: Camelia Rad
31 pages
HADOOP FRAME WORK
No ratings yet
HADOOP FRAME WORK
38 pages
Module 3 Session 3 HDFS
No ratings yet
Module 3 Session 3 HDFS
3 pages
CH 3 BDA
No ratings yet
CH 3 BDA
13 pages
biggdata
No ratings yet
biggdata
24 pages
Big Data Tools and Its Framework
No ratings yet
Big Data Tools and Its Framework
5 pages
Bigdata
No ratings yet
Bigdata
3 pages
BDA ESE
No ratings yet
BDA ESE
21 pages
Big Data Hadoop Interview Questions and Answers
No ratings yet
Big Data Hadoop Interview Questions and Answers
26 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
BDA
No ratings yet
BDA
8 pages
Module 2 CN
No ratings yet
Module 2 CN
23 pages
UNIT -2
No ratings yet
UNIT -2
27 pages
BDA Module-2
No ratings yet
BDA Module-2
7 pages
HADOOP
No ratings yet
HADOOP
19 pages
Big Data Analytics QP
No ratings yet
Big Data Analytics QP
36 pages
Assignment No 01
No ratings yet
Assignment No 01
34 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
BDA CW Chapter 2
No ratings yet
BDA CW Chapter 2
6 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
BDA Unit 2 Q&A
No ratings yet
BDA Unit 2 Q&A
14 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
CH-05_cc
No ratings yet
CH-05_cc
21 pages
Chapter 05
No ratings yet
Chapter 05
6 pages
BDA Unit 3
No ratings yet
BDA Unit 3
6 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
CICS Certification Guide For Developers
No ratings yet
CICS Certification Guide For Developers
3 pages
Hadoop Wordmean Program in Mapreduce (Without Ide) - College Projects Blog - College Projects
No ratings yet
Hadoop Wordmean Program in Mapreduce (Without Ide) - College Projects Blog - College Projects
4 pages
HDFS Commands
No ratings yet
HDFS Commands
2 pages
An Iterative Methodology For Defining Big Data
No ratings yet
An Iterative Methodology For Defining Big Data
20 pages
Hadoop Installation On CentOS PDF
No ratings yet
Hadoop Installation On CentOS PDF
3 pages
Big Data Management Syllabus
100% (1)
Big Data Management Syllabus
5 pages
Apache Ranger Auditing
No ratings yet
Apache Ranger Auditing
17 pages
Experiment No 2
No ratings yet
Experiment No 2
9 pages
Manoj DE
No ratings yet
Manoj DE
6 pages
Apache Spark 101 for Data Engineering
No ratings yet
Apache Spark 101 for Data Engineering
15 pages
R360-Installation Guide PDF
No ratings yet
R360-Installation Guide PDF
172 pages
Map-Reduce (Hadoop) Based Data Clustering For BigData A Survey
No ratings yet
Map-Reduce (Hadoop) Based Data Clustering For BigData A Survey
6 pages
Hyrax Cloud Computing On Mobile Devices
No ratings yet
Hyrax Cloud Computing On Mobile Devices
123 pages
Bda Unit 5
No ratings yet
Bda Unit 5
17 pages
Lab Manual Iot & Big Data
No ratings yet
Lab Manual Iot & Big Data
75 pages
Talend Tutorial
50% (2)
Talend Tutorial
19 pages
4
No ratings yet
4
5 pages
Data Science P3 Job Description
No ratings yet
Data Science P3 Job Description
4 pages
Project 4
No ratings yet
Project 4
5 pages
Assignment - Fundamentals of Big Data and Business Analytics
No ratings yet
Assignment - Fundamentals of Big Data and Business Analytics
9 pages
Big Data Analytics syllabus
No ratings yet
Big Data Analytics syllabus
1 page
Bda Lab Manual
No ratings yet
Bda Lab Manual
45 pages
Dell Emc Isilon Onefs Operating
No ratings yet
Dell Emc Isilon Onefs Operating
5 pages
Thesis Apache Spark
100% (2)
Thesis Apache Spark
4 pages
Module 1 - Introduction To Cloud Computing
No ratings yet
Module 1 - Introduction To Cloud Computing
22 pages
Meenakshi College of Engineering Chennai - 78: Lesson Plan Grid and Cloud Computing
No ratings yet
Meenakshi College of Engineering Chennai - 78: Lesson Plan Grid and Cloud Computing
6 pages
Unit 2 Notes Data Analytics
No ratings yet
Unit 2 Notes Data Analytics
11 pages
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
No ratings yet
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
36 pages
Master Cheat Sheet
No ratings yet
Master Cheat Sheet
18 pages

Big-Data Final

Uploaded by

Big-Data Final

Uploaded by

Big Data 2 marks

What is sequence file in Hadoop?

Define the three key design principles of Pig Latin.

Define the various file formats supported in HIVE.

What is the difference between analysis and analytics

S.No. Data Analytics Data Analysis

What do you mean by semi-structured data?

How NameNode tackle DataNode in HDFS?

Define BIG Data.

State two characteristics of big data application.

What is throughput in Big Data?

Big data 4 marks

Characteristics of Big Data

1. MongoDB Query Language (MQL):

Brief explanation of 4 types of NoSQL:

Data Structure Simple key-value pairs Documents (often JSON or BSON)

Schema Schema-less Schema-less or schema-on-read

Data Supports hierarchical relationships and

Indexes on various fields and attributes

Caching, session management, simple key- Content management systems, user

Examples Redis, Riak MongoDB, CouchDB

Criteria Hadoop RDBMS

Structured, Semi-Structured and Structured Data

Hadoop Has higher throughput as RDBMS fails to achieve a higher throughput as

Cost Open source framework, free of cost Licensed software, Paid.

Data Discovery, Storage and processing

You might also like