0% found this document useful (0 votes)

2 views

data analyst

Uploaded by

botmasterind

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

data analyst

Uploaded by

botmasterind

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Q1. Discuss the role of Distributed and parallel computing in big data analysis.

Distributed and parallel computing play crucial roles in big data analysis by enabling the processing of vast amounts of data
efficiently and effectively. Here’s a breakdown of their roles:

Distributed Computing

Distributed computing involves a network of interconnected computers that work together to process data. This approach
offers several advantages for big data analysis:

1. Scalability: Distributed systems can easily scale out by adding more nodes to handle increasing data volumes.

2. Fault Tolerance: If one node fails, others can take over, ensuring continuous data processing.

3. Resource Sharing: Multiple users can share resources, optimizing the use of available computational power.

4. Cost-Effectiveness: Using a network of commodity hardware is often more cost-effective than investing in a single
supercomputer1.

Parallel Computing

Parallel computing, on the other hand, involves dividing a large problem into smaller sub-problems that can be solved
simultaneously. This method is particularly beneficial for big data analysis because:

1. Speed: By processing multiple tasks at the same time, parallel computing significantly reduces the time required for
data analysis.

2. Efficiency: It maximizes the use of available computational resources, leading to faster data processing.

3. Complex Problem Solving: Parallel computing can handle complex computations that would be infeasible on a
single processor2.

Integration in Big Data Frameworks

Frameworks like Hadoop and Spark leverage both distributed and parallel computing to manage and analyze big data:

 Hadoop: Utilizes the Hadoop Distributed File System (HDFS) for distributed storage and MapReduce for parallel
processing of large data sets.

 Spark: Provides in-memory processing capabilities, making it faster than Hadoop for certain tasks. It also supports
distributed data processing3.

By combining distributed and parallel computing, these frameworks enable organizations to process and analyze large
datasets efficiently, driving insights and innovation.

Q2. What is Hadoop Ecosystem? Explain advantages of hadoop ecosystem.

The Hadoop Ecosystem is a suite of tools and frameworks that work together to facilitate the storage, processing, and
analysis of large datasets. It is built around the core Hadoop framework, which includes the Hadoop Distributed File System
(HDFS) and the MapReduce programming model. Here’s a brief overview of the key components and their roles:

Key Components of the Hadoop Ecosystem

1. HDFS (Hadoop Distributed File System): A distributed file system that stores data across multiple machines,
providing high throughput access to data.

2. MapReduce: A programming model for processing large data sets with a distributed algorithm on a Hadoop cluster.

3. YARN (Yet Another Resource Negotiator): Manages resources and schedules jobs in the Hadoop cluster.

4. HBase: A NoSQL database that runs on top of HDFS, providing real-time read/write access to large datasets.

5. Hive: A data warehousing tool that provides SQL-like querying capabilities on top of Hadoop.

6. Pig: A high-level platform for creating MapReduce programs used with Hadoop.
7. Spark: An open-source cluster-computing framework that provides in-memory processing capabilities.

8. Sqoop: A tool for transferring data between Hadoop and relational databases.

9. Flume: A service for collecting and moving large amounts of log data.

10. Oozie: A workflow scheduler system to manage Hadoop jobs.

Advantages of the Hadoop Ecosystem

1. Scalability: Hadoop can scale from a single server to thousands of machines, each offering local computation and
storage1.

2. Cost-Effectiveness: It uses commodity hardware, making it a cost-effective solution for storing and processing large
datasets1.

3. Flexibility: Hadoop can process structured, semi-structured, and unstructured data, allowing for a wide range of
data types and sources1.

4. Fault Tolerance: Data is replicated across multiple nodes, ensuring that the system can recover from hardware
failures without data loss1.

5. High Throughput: HDFS provides high throughput access to application data and is suitable for applications with
large data sets2.

6. Community Support: Being an open-source project, Hadoop has a large community of developers and users who
contribute to its continuous improvement3.

The Hadoop Ecosystem’s combination of these tools and advantages makes it a powerful platform for big data analytics,
enabling organizations to derive valuable insights from their data.

Q3. Explain hadoop streaming and architecture in details .

Hadoop Streaming is a utility that allows users to create and run MapReduce jobs with any executable or script as the
mapper and/or reducer. This flexibility means you can write MapReduce programs in languages other than Java, such as
Python, Perl, Ruby, or even shell scripts1.

How Hadoop Streaming Works

1. Input and Output: The utility reads input data from HDFS and passes it to the mapper script via standard input
(stdin). The mapper processes the data and outputs key-value pairs to standard output (stdout).

2. Mapper: Each mapper task launches the specified script as a separate process. The mapper reads input lines,
processes them, and outputs key-value pairs.

3. Shuffle and Sort: The framework sorts and transfers the mapper outputs to the reducers.

4. Reducer: Each reducer task also launches the specified script as a separate process. The reducer reads the sorted
key-value pairs, processes them, and outputs the final results to HDFS1.

Example Command

Here’s a basic example of a Hadoop Streaming command:

hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

-input myInputDirs \

-output myOutputDir \

-mapper /bin/cat \

-reducer /usr/bin/wc

In this example:
 Input: Data is read from myInputDirs.

 Output: Results are written to myOutputDir.

 Mapper: The /bin/cat command is used as the mapper.

 Reducer: The /usr/bin/wc command is used as the reducer1.

Hadoop Architecture

The Hadoop architecture consists of several key components that work together to process large datasets efficiently:

1. HDFS (Hadoop Distributed File System): A distributed file system that stores data across multiple nodes, providing
high throughput access to data.

2. MapReduce: A programming model for processing large data sets with a distributed algorithm on a Hadoop cluster.

3. YARN (Yet Another Resource Negotiator): Manages resources and schedules jobs in the Hadoop cluster.

4. Hadoop Common: The common utilities and libraries that support other Hadoop modules.

HDFS Architecture

 NameNode: Manages the metadata and namespace of the file system. It keeps track of the files, directories, and
blocks.

 DataNode: Stores the actual data blocks. DataNodes report to the NameNode with the list of blocks they store.

 Secondary NameNode: Periodically merges the namespace image with the edit log to prevent the NameNode from
becoming a bottleneck.

MapReduce Architecture

 JobTracker: Manages the MapReduce jobs, distributes tasks to TaskTrackers, and monitors their progress.

 TaskTracker: Executes the individual tasks assigned by the JobTracker and reports progress.

YARN Architecture

 ResourceManager: Manages resources across the cluster.

 NodeManager: Manages resources on a single node and monitors the resource usage of containers.

Advantages of Hadoop Streaming

Language Flexibility: Allows the use of various programming languages for MapReduce jobs1.

Ease of Use: Simplifies the process of writing MapReduce programs by using familiar scripting languages1.

Integration: Easily integrates with other Hadoop ecosystem components like HDFS and YARN

Q4. What difference between RDBMD and hadoop

1. Language Flexibility: Hadoop Streaming allows you to write MapReduce jobs in any language that can read from
standard input and write to standard output. This means you can use languages like Python, Perl, Ruby, or even
shell scripts, making it accessible to a wider range of developers.

2. Ease of Use: By using familiar scripting languages, Hadoop Streaming simplifies the process of writing MapReduce
programs. You don’t need to learn Java or other complex programming languages to leverage Hadoop’s power.

3. Integration: Hadoop Streaming seamlessly integrates with other components of the Hadoop ecosystem, such as
HDFS for storage and YARN for resource management. This integration ensures that your streaming jobs can
efficiently utilize the distributed computing resources available in a Hadoop cluster.
Q5. What is Big Data Analystics? Write some advantages of Big Data analytics.

Big Data Analytics is the process of examining large and varied data sets—often referred to as big data—to uncover hidden
patterns, unknown correlations, market trends, customer preferences, and other useful business information. This analysis
helps organizations make informed decisions, improve operations, and gain a competitive edge.

Advantages of Big Data Analytics

1. Cost Reduction: Big data technologies, such as Hadoop and cloud-based analytics, bring significant cost advantages
when it comes to storing large amounts of data. They also help identify more efficient ways of doing business1.

2. Faster, Better Decision Making: With the speed of Hadoop and in-memory analytics, combined with the ability to
analyze new sources of data, businesses can analyze information immediately and make decisions based on what
they’ve learned1.

3. New Product and Service Innovations: By analyzing big data, companies can gauge customer needs and satisfaction
through analytics. This helps in creating new products and services to meet customer needs2.

4. Improved Customer Experience: Big data analytics helps in understanding customer behavior and preferences,
allowing businesses to tailor their products and services to meet customer expectations more effectively2.

5. Operational Efficiency: Analyzing big data can help streamline operations, reduce downtime, and improve overall
efficiency by identifying bottlenecks and areas for improvement3.

6. Risk Management: Big data analytics can enhance risk management by identifying potential risks and providing
insights to mitigate them. This is particularly useful in industries like finance and insurance3.

7. Competitive Advantage: Companies that leverage big data analytics can gain a competitive edge by making more
informed decisions, optimizing their operations, and better understanding their market and customers

Q6. Explain sampling techniques and data classification.

Sampling Techniques

Sampling techniques are methods used to select a subset of individuals from a larger population to represent the whole.
Here are some common sampling techniques:

Probability Sampling

1. Simple Random Sampling: Every member of the population has an equal chance of being selected. This method is
straightforward but requires a complete list of the population1.

2. Systematic Sampling: Selects every nth member from a list of the population. The starting point is chosen
randomly1.

3. Stratified Sampling: Divides the population into subgroups (strata) based on a specific characteristic, then randomly
samples from each subgroup. This ensures representation from all subgroups1.

4. Cluster Sampling: Divides the population into clusters, randomly selects some clusters, and then samples all
members from those clusters. This method is useful when the population is spread over a large area1.

Non-Probability Sampling

1. Convenience Sampling: Samples are selected based on ease of access. This method is quick and inexpensive but
may not be representative of the population1.

2. Quota Sampling: Ensures that specific characteristics are represented in the sample by setting quotas for
subgroups1.

3. Purposive (Judgmental) Sampling: Samples are selected based on the researcher’s judgment about which members
will be most useful or representative1.
4. Snowball Sampling: Existing study subjects recruit future subjects from among their acquaintances. This method is
often used in studies involving hard-to-reach populations1.

Data Classification

Data classification is the process of organizing data into categories for its most effective and efficient use. Here are some
common data classification techniques:

1. Content-Based Classification: Classifies data based on its content. For example, emails containing sensitive
information can be classified as confidential2.

2. Context-Based Classification: Classifies data based on the context in which it was created or used. For instance,
documents created by the finance department might be classified as financial data2.

3. User-Based Classification: Relies on users to classify data based on their knowledge and judgment. This method is
often used when users are best positioned to understand the sensitivity of the data2.

Advantages of Data Classification

1. Improved Data Security: Helps in identifying and protecting sensitive data, reducing the risk of data breaches2.

2. Regulatory Compliance: Ensures that data handling practices comply with relevant laws and regulations2.

3. Efficient Data Management: Facilitates better data organization, making it easier to locate and retrieve
information2.

4. Cost Reduction: Helps in reducing storage and backup costs by identifying and managing data according to its
importance

Q7. Disscus hadoop yarn with working function

Hadoop YARN (Yet Another Resource Negotiator) is a core component of the Hadoop ecosystem introduced in Hadoop 2.0.
It is responsible for resource management and job scheduling/monitoring in Hadoop clusters. YARN allows multiple data
processing engines such as batch processing, stream processing, interactive processing, and graph processing to run and
process data stored in HDFS.

Key Components of YARN

1. ResourceManager (RM): The central authority that manages resources across the cluster. It has two main
components:

o Scheduler: Allocates resources to various running applications based on resource requirements and
constraints like capacities and queues. It does not monitor or track the status of applications.

o ApplicationsManager: Manages job submissions, negotiates the first container for executing the
application-specific ApplicationMaster, and handles the restart of the ApplicationMaster container on
failure1.

2. NodeManager (NM): Runs on each node in the cluster and is responsible for managing containers, monitoring their
resource usage (CPU, memory, disk, network), and reporting this information to the ResourceManager1.

3. ApplicationMaster (AM): A per-application component that negotiates resources from the ResourceManager and
works with the NodeManager(s) to execute and monitor tasks. Each application has its own ApplicationMaster

Q8. What is HBase? Explain architecture of HBase.

HBase is a distributed, scalable, and NoSQL database built on top of the Hadoop Distributed File System (HDFS). It is
designed to handle large amounts of sparse data, providing real-time read/write access to big data. HBase is modeled after
Google’s Bigtable and is part of the Hadoop ecosystem, offering a way to store and retrieve data in a column-oriented
format.
HBase Architecture

HBase architecture consists of several key components that work together to manage and process data efficiently:

1. HMaster:

o Acts as the master server in the HBase cluster.

o Manages the distribution of regions to RegionServers.

o Handles administrative operations like schema changes and load balancing1.

2. RegionServer:

o Manages regions, which are subsets of tables.

o Handles read and write requests from clients.

o Stores data in HDFS and uses MemStore for in-memory storage before flushing to disk2.

3. Regions:

o The basic unit of scalability and distribution in HBase.

o Each region contains a subset of the table’s rows.

o Regions are dynamically split and reassigned to balance the load across RegionServers2.

4. ZooKeeper:

o Coordinates and manages the distributed environment.

o Keeps track of the cluster state and configuration.

o Ensures high availability and reliability by managing the HMaster and RegionServers

Q9. Explain File output format, Record writer and role of combiner in detail

File Output Format

In Hadoop MapReduce, the OutputFormat determines how the output of a job is written. It specifies the output directory
and the format in which the output data is stored. There are several types of OutputFormats provided by Hadoop:

1. TextOutputFormat: The default OutputFormat. It writes each key-value pair as a line of text, separated by a tab
character.

2. SequenceFileOutputFormat: Writes the output as a binary file containing serialized key-value pairs. It is more
efficient for large datasets.

3. MapFileOutputFormat: A sorted SequenceFile with an index to allow random access to records.

4. NullOutputFormat: Discards the output, useful for jobs where the output is not needed1.

Record Writer

The RecordWriter is responsible for writing the output key-value pairs from the Reducer to the output files. It is provided by
the OutputFormat and has two main functions:

1. write: Takes key-value pairs from the MapReduce job and writes them to the output file.

2. close: Closes the data stream to the output file, ensuring all data is properly written and resources are released2.

The default RecordWriter is the LineRecordWriter, which writes each key-value pair as a line of text. Custom RecordWriters
can be implemented to write data in different formats, such as CSV or JSON2.

Role of Combiner
A Combiner is an optional component in the MapReduce framework that acts as a mini-reducer. It processes the output of
the Mapper before it is sent to the Reducer, reducing the amount of data transferred across the network. The main
functions of a Combiner are:

1. Data Reduction: Aggregates intermediate data to reduce the volume of data transferred to the Reducer. This can
significantly improve the performance of the MapReduce job3.

2. Local Aggregation: Performs local aggregation of data on the Mapper node, which helps in reducing the load on the
Reducer3.

The Combiner class must implement the same interface as the Reducer and can be used to perform operations like
summing, counting, or averaging

Q10. Explain Mapreduce Optimization technique in detail

Optimizing MapReduce jobs is crucial for improving performance and efficiency. Here are some key techniques for
optimizing MapReduce:

1. Proper Configuration of the Cluster

 Disable Access Time Updates: Mount DFS and MapReduce storage with the -noatime option to disable access time
updates, improving I/O performance1.

 Avoid RAID on TaskTracker and DataNode Machines: RAID can reduce performance; instead,
configure mapred.local.dir and dfs.data.dir to point to one directory on each disk1.

 Monitor Resource Usage: Use monitoring tools to track swap and network usage. Reduce RAM allocation if swap is
being used excessively1.

2. LZO Compression Usage

 Intermediate Data Compression: Enable LZO compression for intermediate data to reduce disk I/O during the
shuffle phase. This can be done by setting mapred.compress.map.output to true1.

3. Proper Tuning of the Number of MapReduce Tasks

 Task Duration: Ensure each task runs for at least 1 minute to avoid the overhead of starting and stopping JVMs
frequently1.

 Block Size Adjustment: For large datasets (e.g., over 1TB), increase the block size to 256MB or 512MB to reduce the
number of tasks1.

 Mapper and Reducer Slots: Adjust the number of mapper tasks to be a multiple of the number of mapper slots in
the cluster. Similarly, set the number of reducer tasks to be equal to or slightly less than the number of reducer
slots1.

4. Speculative Execution

 Enable Speculative Execution: This feature allows the framework to run duplicate tasks for slow-running tasks,
ensuring faster job completion. It can be enabled by
setting mapreduce.map.speculative and mapreduce.reduce.speculative to true2.

5. Data Locality

 Data Locality Optimization: Ensure that data is processed on the node where it is stored to minimize network
I/O. This can be achieved by configuring the cluster to prioritize local data processing2.

6. Combiner Usage

 Use Combiners: Implement combiners to perform local aggregation of intermediate data before it is sent to the
reducer. This reduces the amount of data transferred across the network2.

7. Task JVM Reuse

Q11. Explain the various Hadoop Daemons and utilites of daemons

Hadoop is a powerful framework for processing large datasets across clusters of computers. It relies on several key
daemons (background processes) to manage and execute its tasks. Here are the main Hadoop daemons and their utilities:

1. NameNode

 Role: Acts as the master server for the Hadoop Distributed File System (HDFS).

 Utility: Manages the metadata and directory structure of all files and directories in the HDFS. It keeps track of
where data is stored across the cluster.

2. Secondary NameNode

 Role: Works alongside the NameNode.

 Utility: Periodically merges the namespace image with the edit logs to prevent the NameNode from becoming
overloaded. It is not a backup NameNode but helps in housekeeping tasks.

3. DataNode

 Role: Acts as the worker node in HDFS.

 Utility: Stores the actual data blocks. It performs read and write operations as requested by the clients and the
NameNode.

4. ResourceManager

 Role: Manages resources in the cluster.

 Utility: Allocates resources to various running applications. It is part of the YARN (Yet Another Resource Negotiator)
framework.

5. NodeManager

 Role: Manages individual nodes in the cluster.

 Utility: Monitors resource usage (CPU, memory, disk) and reports to the ResourceManager. It also manages the
execution of containers on the node.

6. JobTracker (Deprecated in Hadoop 2.x and replaced by ResourceManager)

 Role: Managed MapReduce jobs.

 Utility: Assigned tasks to TaskTrackers and monitored their progress.

7. TaskTracker (Deprecated in Hadoop 2.x and replaced by NodeManager)

 Role: Executed individual tasks.

 Utility: Reported task progress to the JobTracker.

Q12. Discuss processing data with mapreduce execution environment in details

MapReduce is a programming model and execution environment for processing large datasets in a distributed manner. It
simplifies data processing across large clusters of computers. Here’s a detailed look at how data is processed using the
MapReduce execution environment:

1. MapReduce Architecture

MapReduce consists of two main functions: Map and Reduce.

 Map Function: Takes a set of input key-value pairs and produces a set of intermediate key-value pairs.

 Reduce Function: Merges all intermediate values associated with the same intermediate key.

2. Execution Flow

The execution of a MapReduce job involves several steps:

a. Input Splitting

 The input data is split into fixed-size chunks, typically 64MB or 128MB.

 Each chunk is processed by a separate map task.

b. Mapping

 Each map task processes a chunk of data and generates intermediate key-value pairs.

 The output of the map function is stored on the local disk of the node running the map task.

c. Shuffling and Sorting

 The intermediate key-value pairs are shuffled and sorted by key.

 This step ensures that all values associated with a particular key are grouped together.

d. Reducing

 The reduce tasks process the sorted intermediate data.

 Each reduce task receives a key and a list of values associated with that key.

 The reduce function aggregates these values to produce the final output.

e. Output

 The final output is written to the distributed file system (HDFS).

 The output is typically stored in multiple files, one per reduce task.

Unit Iii
No ratings yet
Unit Iii
20 pages
( (Handbook of Harmony - Gospel - Jazz - R&B - Soul: The Secrets To Those Beautiful Chord Changes Now Exposed) ) (Author: Gregory Moody) Published On (July, 2010)
67% (3)
( (Handbook of Harmony - Gospel - Jazz - R&B - Soul: The Secrets To Those Beautiful Chord Changes Now Exposed) ) (Author: Gregory Moody) Published On (July, 2010)
5 pages
اسئلة البرومترك
No ratings yet
اسئلة البرومترك
49 pages
Big Data Technologies On Map Reduce and Hadoop
No ratings yet
Big Data Technologies On Map Reduce and Hadoop
2 pages
BDA Unit 3
No ratings yet
BDA Unit 3
6 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
Unit III
No ratings yet
Unit III
15 pages
BDAunit-II
No ratings yet
BDAunit-II
4 pages
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
No ratings yet
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
10 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Module-2 - Introduction To Hadoop
No ratings yet
Module-2 - Introduction To Hadoop
13 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
Hadoop
No ratings yet
Hadoop
13 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
IMTC634_Data Science_Chapter 13
No ratings yet
IMTC634_Data Science_Chapter 13
16 pages
CC UNIT 2 (1)
No ratings yet
CC UNIT 2 (1)
29 pages
INTRODUCTION TO DATA SCIENCE
No ratings yet
INTRODUCTION TO DATA SCIENCE
14 pages
Hadoop Features 2
No ratings yet
Hadoop Features 2
3 pages
Guided By:-Prof. K. Kakwani: Payal M. Wadhwani
No ratings yet
Guided By:-Prof. K. Kakwani: Payal M. Wadhwani
24 pages
Unit-2-_Hadoop2_
No ratings yet
Unit-2-_Hadoop2_
30 pages
Unit II Big Data
No ratings yet
Unit II Big Data
27 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Big Data notes (1)
No ratings yet
Big Data notes (1)
13 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
HADOOP
No ratings yet
HADOOP
19 pages
Map Reduce Features Hadoop Environment
No ratings yet
Map Reduce Features Hadoop Environment
3 pages
Hadoop in bigdata processing concept
No ratings yet
Hadoop in bigdata processing concept
2 pages
shawn
No ratings yet
shawn
4 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Unit 5
No ratings yet
Unit 5
7 pages
Assignment 5 (Hadoop)
No ratings yet
Assignment 5 (Hadoop)
1 page
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
Haddob Lab Report
No ratings yet
Haddob Lab Report
12 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Chapter 3 Hadoop
No ratings yet
Chapter 3 Hadoop
10 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
BDM 2
No ratings yet
BDM 2
5 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
BDA Notes Unit-2
No ratings yet
BDA Notes Unit-2
27 pages
BDA Module 2
No ratings yet
BDA Module 2
40 pages
cloud computing Unit-5
No ratings yet
cloud computing Unit-5
22 pages
Hadoop
No ratings yet
Hadoop
11 pages
Bda Unit 4 Material
No ratings yet
Bda Unit 4 Material
37 pages
Hadoop
No ratings yet
Hadoop
3 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
Weather Data Analysis Using Had Oop
No ratings yet
Weather Data Analysis Using Had Oop
9 pages
Unit 3
No ratings yet
Unit 3
18 pages
Act2 - March7 - 6E - BDA - SEC
No ratings yet
Act2 - March7 - 6E - BDA - SEC
8 pages
Experiment No - 01
No ratings yet
Experiment No - 01
14 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Big Data Assignment 1
No ratings yet
Big Data Assignment 1
6 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
A Guide To CIF For Authors: - Cell - Formul - Diffrn - Reflns - Av - R - Equivalen
No ratings yet
A Guide To CIF For Authors: - Cell - Formul - Diffrn - Reflns - Av - R - Equivalen
20 pages
druid-of-ravenhenge-dl
No ratings yet
druid-of-ravenhenge-dl
9 pages
The Fake Realtor
No ratings yet
The Fake Realtor
30 pages
Creating Your Own Indicators Amibroker
100% (1)
Creating Your Own Indicators Amibroker
6 pages
Analogy Exercise C
No ratings yet
Analogy Exercise C
10 pages
ENENDA1L - Engineering Data Analysis-Lab
No ratings yet
ENENDA1L - Engineering Data Analysis-Lab
7 pages
Theory Phoneme
No ratings yet
Theory Phoneme
31 pages
16 PF- Psych Assessment (1)
No ratings yet
16 PF- Psych Assessment (1)
12 pages
Faraid Mudah
No ratings yet
Faraid Mudah
5 pages
The Rise and Fall of Ryan Sheckler
No ratings yet
The Rise and Fall of Ryan Sheckler
4 pages
Sea Ltd. (SE) - Earnings Review - Upbeat Guidance, Strong Growth Buy (On CL)
No ratings yet
Sea Ltd. (SE) - Earnings Review - Upbeat Guidance, Strong Growth Buy (On CL)
12 pages
Reflection Paper 1 - Career and Vocation Plan
No ratings yet
Reflection Paper 1 - Career and Vocation Plan
2 pages
Manned Aircraft Losses Over The Former Yugoslavia1994-1999 PDF
No ratings yet
Manned Aircraft Losses Over The Former Yugoslavia1994-1999 PDF
14 pages
Valleygsahan 2018 Condensed Mechanics
No ratings yet
Valleygsahan 2018 Condensed Mechanics
11 pages
Barangay: - Expected Score Earned Score General Cleanliness (24 Points)
No ratings yet
Barangay: - Expected Score Earned Score General Cleanliness (24 Points)
6 pages
AFFIDAVIT in English
100% (1)
AFFIDAVIT in English
43 pages
United States For America Moorish National Republic Federal Government
No ratings yet
United States For America Moorish National Republic Federal Government
2 pages
Product Design and Development
No ratings yet
Product Design and Development
41 pages
Semester 2 Lesson Plan Pop Cycle
No ratings yet
Semester 2 Lesson Plan Pop Cycle
5 pages
Robit Hyper 63GA Manual 2018 PDF
No ratings yet
Robit Hyper 63GA Manual 2018 PDF
9 pages
Bending Beam Lab
50% (2)
Bending Beam Lab
17 pages
Royal Historical Society
No ratings yet
Royal Historical Society
23 pages
Davis, Suzanne - Social and Scientific Influences Childrens Suggestibility
No ratings yet
Davis, Suzanne - Social and Scientific Influences Childrens Suggestibility
10 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
35 pages
Clay Mineral Studies of Some Recent Marine Sediments Off The North Carolina Coast
No ratings yet
Clay Mineral Studies of Some Recent Marine Sediments Off The North Carolina Coast
12 pages
Present Simple Affirmative Negative Grammar Drills Information Gap Activities 84223
100% (1)
Present Simple Affirmative Negative Grammar Drills Information Gap Activities 84223
3 pages
Assignment 2
No ratings yet
Assignment 2
9 pages
Relevant Costs (Part 2) : F. M. Kapepiso
No ratings yet
Relevant Costs (Part 2) : F. M. Kapepiso
21 pages

data analyst

Uploaded by

data analyst

Uploaded by

Q1. Discuss the role of Distributed and parallel computing in big data analysis.

Integration in Big Data Frameworks

Q2. What is Hadoop Ecosystem? Explain advantages of hadoop ecosystem.

Key Components of the Hadoop Ecosystem

10. Oozie: A workflow scheduler system to manage Hadoop jobs.

Advantages of the Hadoop Ecosystem

Q3. Explain hadoop streaming and architecture in details .

How Hadoop Streaming Works

Here’s a basic example of a Hadoop Streaming command:

hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

 Output: Results are written to myOutputDir.

 Mapper: The /bin/cat command is used as the mapper.

 Reducer: The /usr/bin/wc command is used as the reducer1.

 ResourceManager: Manages resources across the cluster.

Advantages of Hadoop Streaming

Q4. What difference between RDBMD and hadoop

Advantages of Big Data Analytics

Q6. Explain sampling techniques and data classification.

Advantages of Data Classification

Q7. Disscus hadoop yarn with working function

Key Components of YARN

Q8. What is HBase? Explain architecture of HBase.

o Acts as the master server in the HBase cluster.

o Manages the distribution of regions to RegionServers.

o Handles administrative operations like schema changes and load balancing1.

o Manages regions, which are subsets of tables.

o Handles read and write requests from clients.

o The basic unit of scalability and distribution in HBase.

o Each region contains a subset of the table’s rows.

o Coordinates and manages the distributed environment.

o Keeps track of the cluster state and configuration.

File Output Format

3. MapFileOutputFormat: A sorted SequenceFile with an index to allow random access to records.

Q10. Explain Mapreduce Optimization technique in detail

1. Proper Configuration of the Cluster

2. LZO Compression Usage

3. Proper Tuning of the Number of MapReduce Tasks

7. Task JVM Reuse

 Role: Works alongside the NameNode.

 Role: Acts as the worker node in HDFS.

 Role: Manages resources in the cluster.

 Role: Manages individual nodes in the cluster.

6. JobTracker (Deprecated in Hadoop 2.x and replaced by ResourceManager)

 Role: Managed MapReduce jobs.

 Utility: Assigned tasks to TaskTrackers and monitored their progress.

7. TaskTracker (Deprecated in Hadoop 2.x and replaced by NodeManager)

 Role: Executed individual tasks.

 Utility: Reported task progress to the JobTracker.

MapReduce consists of two main functions: Map and Reduce.

The execution of a MapReduce job involves several steps:

 Each chunk is processed by a separate map task.

c. Shuffling and Sorting

 The intermediate key-value pairs are shuffled and sorted by key.

 The reduce tasks process the sorted intermediate data.

 The final output is written to the distributed file system (HDFS).

You might also like