0% found this document useful (0 votes)
17 views

Big-Data Final

Big data questions and answers
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Big-Data Final

Big data questions and answers
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Big Data 2 marks

What is sequence file in Hadoop?


Definition: A SequenceFile is a flat, binary file type that serves as a container for data to be
used in Apache Hadoop distributed computing projects. SequenceFiles are used extensively
with MapReduce.
Used to: SequenceFiles are used to store and compress files that are smaller than the optimum
size for operating efficiently with Hadoop, which can help reduce required disk space capacity
and I/O requirements.

Define the three key design principles of Pig Latin.


1. Simplified Data Processing: Pig Latin simplifies data processing by providing a high-level
language that abstracts away the complexity of MapReduce programming.
2. Extensibility: Pig Latin is designed to be extensible, allowing users to define their own
functions and operators to process data in a custom way.
3. Optimization: Pig Latin optimizes data processing by automatically optimizing the execution
plan based on the data flow and available resources.

Define the various file formats supported in HIVE.


1. Text File Format: stores data in plain text files with a delimiter.
2. Sequence File Format: stores binary key-value pairs in a compressed block-based format.
3. ORC File Format: stores data in a columnar format with compression and indexing.
4. Parquet File Format: stores data in a columnar format with efficient compression and
nested data support.
5. Avro File Format: stores data in a compact binary format with schema evolution support.

What is the difference between analysis and analytics

S.No. Data Analytics Data Analysis


It is described as a traditional form
1. It is described as a particularized form of analytics.
or generic form of analytics.
It includes several stages like the To process data, firstly raw data is defined in a
2. collection of data and then the meaningful manner, then data cleaning and conversion
inspection of business data is done. are done to get meaningful information from raw data.
It supports decision making by It analyzes the data by focusing on insights into
3.
analysing enterprise data. business data.
It uses various tools to process data It uses different tools to analyse data such as Rapid
4.
such as Tableau, Python, Excel, etc. Miner, Open Refine, Node XL, KNIME, etc.

What do you mean by semi-structured data?


Semi-structured: Semi-structured data is typically characterized by the use of metadata or tags
that provide additional information about the data elements.
For example, an XML document might contain tags that indicate the structure of the
document, but may also contain additional tags that provide metadata about the content, such
as author, date, or keywords.
What is YARN?
 YARN (Yet Another Resource Negotiator) is a component of Apache Hadoop that manages
resources and schedules tasks for distributed processing of large data sets.
 It separates the resource management and job scheduling/monitoring functions into
separate daemons, allowing for more efficient and flexible use of cluster resources.
 YARN enables Hadoop to support a wider range of processing models, including batch
processing, interactive processing, and real-time streaming.
 It also allows for the integration of other processing engines, such as Apache Spark,
Apache Flink, and Apache Storm

What is HDFS?
 The Hadoop Distributed File System (HDFS) is the primary data storage system used by
Hadoop applications
 This storage layer of Hadoop is responsible for storing and managing data across a cluster.
 It is designed to handle large datasets by breaking them into smaller blocks and
distributing them across multiple machines.
 The data blocks are replicated to ensure fault tolerance and high availability

How NameNode tackle DataNode in HDFS?


 The NameNode manages the HDFS namespace and coordinates access to data stored on
DataNodes.
 It monitors the health of each DataNode and detects any failures.
 It ensures data reliability by redirecting clients to other available DataNodes that have a
copy of the data.
 It maintains a persistent record of the file system metadata

Define BIG Data.


Big data Definition: Big Data are extremely large data sets that may be analysed
computationally to reveal patterns, trends and associations, specially relating to human
behaviours and interactions.
It is characterised by 4V’s:
 Volume: Huge Amount of data
 Variety: Different formats of data from various sources
 Velocity: High speed of accumulation of data
 Variability: Inconsistencies and uncertainty in data

State two characteristics of big data application.


Characteristics of Big Data Big data can be described by the following characteristics:
 Volume: The name Big Data itself is related to a size which is enormous. Whether a
particular data can actually be considered as a Big Data or not, is dependent upon the
volume of data.
 Variety: Variety refers to heterogeneous sources and the nature of data, both structured
and unstructured. The variety (emails, photos, videos, monitoring devices, PDFs, audio) of
unstructured data poses certain issues for storage, mining and analysing data.
Write down the four computing resources of Big Data Storage.
The four computing resources of Big Data storage are:
 Storage capacity: The ability to accommodate the massive volume of data generated and
collected.
 Scalability: The capability to scale the storage infrastructure horizontally as data volume
grows.
 Data durability: Ensuring the reliability and resilience of data through replication, fault
tolerance, and backup strategies.
 Data accessibility: Facilitating efficient and fast access to stored data through
optimization, indexing, and metadata management.

What is throughput in Big Data?


 Throughput in Big Data refers to the amount of data that can be processed or
transferred within a given time frame.
 It is a measure of the system's ability to handle a high volume of data and perform
operations on it efficiently.
 Throughput is often used as a key performance indicator (KPI) for Big Data systems, as it
reflects the system's ability to meet the demands of processing large amounts of data in
real-time or near-real-time.
 High throughput is essential for applications such as streaming data processing, real-
time analytics, and high-performance computing.

Big data 4 marks


There are two main components of HDFS:
 NameNode:
o It is the master node in HDFS.
o Centralized metadata repository for Hadoop Distributed File System (HDFS).
o Stores file system metadata like filenames, directories, & hierarchical structure.
o It is responsible for managing the file system namespace
o It periodically checkpoints the metadata onto disk to ensure its durability.
o It tracks the location of data blocks on DataNodes.
o Maintains the in-memory mapping of data blocks to DataNodes for efficient data
access.
 DataNode:
o It is the slave node in HDFS.
o Responsible for storing data blocks and serving read/write requests from clients.
o Each DataNode manages a set of data blocks stored on the local file system.
o Detects and reports data block corruption or failures to the NameNode.
o Replicates data blocks to ensure data availability and fault tolerance.
o Support data streaming & direct access to data blocks for efficient data processing
o Can be added or removed from the HDFS cluster dynamically without affecting the
overall system availability.
Explain HADOOP architecture and how it works.
Hadoop Architecture: At its core, Hadoop has two major layers namely −
 Processing/Computation layer (MapReduce) –
 MapReduce is a programming model and processing framework in Hadoop that
allows for parallel and distributed processing of large datasets.
 It simplifies the processing of big data by dividing the work into two main phases:
the map phase (data is split into smaller chunks and processed in parallel across
multiple nodes) and the reduce phase (results from the map phase are combined
and aggregated to produce the final output.)
 MapReduce provides fault tolerance and automatic parallelization, making it suitable
for large-scale data processing tasks.
 Storage layer (Hadoop Distributed File System) –
 HDFS is the storage layer of Hadoop and is responsible for storing and managing
data across a cluster.
 It is designed to handle large datasets by breaking them into smaller blocks and
distributing them across multiple machines.
 The data blocks are replicated to ensure fault tolerance and high availability.
 Hadoop Common − these are Java libraries and utilities required by other Hadoop modules.
 Hadoop YARN − this is a framework for job scheduling and cluster resource management.

Hadoop runs code across a cluster of computers. This process includes the following core tasks
that Hadoop performs –

 Data is initially divided into directories and files. Files are divided into uniform sized blocks of
128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.
Evolution of Big Data: There are a lot of milestones in the evolution of Big Data which are
described below:
 Data Warehousing: In the 1990s, data warehousing emerged as a solution to store and
analyze large volumes of structured data.
 Hadoop: Hadoop was introduced in 2006, which provides distributed storage medium
and large data processing and it is an open-source framework.
 NoSQL Databases: In 2009, NoSQL databases were introduced, which provide a flexible
way to store and retrieve unstructured data.
 Cloud Computing: Cloud computing technology helps companies to store their
important data in data centers that are remote, and it saves their infrastructure cost
and maintenance costs.
 Machine Learning: Machine Learning algorithms are those algorithms that work on
large data, and analysis is done on a huge amount of data to get meaningful insights
from it. This has led to the development of artificial intelligence (AI) applications.

Characteristics of Big Data


Big data can be described by the following characteristics:
 Volume: The name Big Data itself is related to a size which is enormous. Whether a
particular data can actually be considered as a Big Data or not, is dependent upon
the volume of data.
 Variety: Variety refers to heterogeneous sources and the nature of data, both
structured and unstructured. The variety (emails, photos, videos, monitoring
devices, PDFs, audio) of unstructured data poses certain issues for storage, mining
and analyzing data.
 Velocity: The term ‘velocity’ refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real potential in
the data. The flow of data is massive and continuous.
 Variability: This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage the data
effectively.
Discuss the functions of MongoDB Query Language and database Commands.

1. MongoDB Query Language (MQL):


 MQL provides a rich set of querying capabilities to retrieve data from MongoDB.
 It supports filtering based on criteria, sorting, limiting results, & projecting specific fields.
 It supports operations like grouping, summing, averaging, joining, and more.
 It supports various types of indexes, including single field, compound, text, and
geospatial indexes to improve query performance.
 MQL provides text search capabilities, allowing you to perform full-text search queries
on text fields, with support for text indexing, stemming, and relevance sorting.

2. Database Commands:
 Document Manipulation: Commands like insert, update, and delete allow you to create,
modify, and remove documents in collections.
 Index Management: MongoDB commands enable you to create, drop, and manage
indexes on collections to optimize query performance.
 Data Backup and Restoration: Commands like mongodump and mongorestore allow you
to create backups of databases or collections and restore them when needed.
Explain the different tools to extract Big Data
a. Apache Hadoop:
 Distributed processing framework for big data.
 Utilizes the Hadoop Distributed File System (HDFS) for storing and processing large
datasets across clusters of computers.
b. Hive:
 Hive is an open source software big data tool.
 It allows programmers analyze large data sets on Hadoop
 It helps with querying and managing large data sets real fast
c. MongoDB:
 MongoDB is an open source No-SQQL database which is cross-platform compatible
with many built-in features.
 It is ideal for the business that needs fast and real-time data for instant decisions.
 It is ideal for users who want data-driven experience

Brief explanation of 4 types of NoSQL:

1. Key-Value Stores:
 Stores data as simple key-value pairs.
 Provides high performance and scalability.
 Suitable for caching, session management, and user profiles.
 Examples: Redis, Riak, Amazon DynamoDB.
2. Document Databases:
 Stores data in semi-structured documents (e.g., JSON, XML).
 Offers flexibility with dynamic schemas.
 Handles complex, hierarchical, and evolving data structures.
 Examples: MongoDB, Couchbase, Apache CouchDB.
3. Column-Family Stores:
 Stores data in column families or column groups.
 Optimized for write-heavy workloads.
 Efficiently handles vast amounts of data.
 Used for large-scale data analytics and time-series data storage.
 Examples: Apache Cassandra, Apache HBase.
4. Graph Databases:
 Stores and represents relationships between entities.
 Utilizes graph structures with nodes and edges.
 Excellent for complex data modeling and graph traversals.
 Ideal for social networks, recommendation systems, and fraud detection.
 Examples: Neo4j, Amazon Neptune, ArangoDB
Feature Key-value Database Document Database

Data Structure Simple key-value pairs Documents (often JSON or BSON)

Schema Schema-less Schema-less or schema-on-read

Limited query capabilities (usually based Advanced querying using query languages
Querying
on key lookup) or APIs

Data Supports hierarchical relationships and


Flat structure, no hierarchical relationships
Organization nested data

Indexes on various fields and attributes


Indexing Typically only indexes on keys
within documents

Excellent scalability and performance for Good scalability for both read and write-
Scalability
read-heavy workloads heavy workloads

Caching, session management, simple key- Content management systems, user


Use Cases
based lookups profiles, e-commerce applications

Eventual consistency (replication delay can Eventual consistency (replication delay can
Consistency
lead to data inconsistencies) lead to data inconsistencies)

Examples Redis, Riak MongoDB, CouchDB

Criteria Hadoop RDBMS


Schema Based on ‘Schema on Read’. Based on ‘Schema on Write’.

Structured, Semi-Structured and Structured Data


Data Type
Unstructured data.

works better when the data size is works better when the volume of data is low(in
Efficiency
big(i.e., In Terabytes and Petabytes) Gigabytes)

Hadoop Has higher throughput as RDBMS fails to achieve a higher throughput as


Throughput
compared to RDBMS compared to the Apache Hadoop Framework.

Latency/
Response Hadoop is said to have low latency RDMS has high latency as compared to Hadoop
Time
Speed Writes are fast Reads are fast

Cost Open source framework, free of cost Licensed software, Paid.

Data Discovery, Storage and processing


Application OLTP and complex ACID transaction
of unstructured data.

You might also like