0% found this document useful (0 votes)
12 views

BDA QB3

Uploaded by

Aditya Paul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

BDA QB3

Uploaded by

Aditya Paul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

1. Explain Map Reduce Map tasks with the Map Reduce Programming Model.

A Map task in the MapReduce programming model processes input key-value pairs
(kl, vl), where kl is a set of keys mapping to data values, and vl represents large
strings from input files. The map() function, implemented in the Mapper class of the
Hadoop Java API, generates zero or more intermediate key-value pairs (k2, v2).
These intermediate values are passed to the Reduce task, which aggregates and
transforms them into a smaller dataset using reducing functions or combiners. The
Mapper processes data independently, regardless of input size, working on one dataset
at a time. The number of Map tasks (Nmap) depends on input file size and block size,
typically 10–100 tasks per node, but can be adjusted using the setNumMapTasks()
function. For instance, a 1TB file with a 128MB block size would create 8192 Map
tasks.

public class SampleMapper extends Mapper<k1, v1, k2, v2> {


@Override
protected void map(k1 key, v1 value, Context context)
throws IOException, InterruptedException {
// Your mapping logic here
}
}

2. Discuss how to compose MapReduce for calculations.

1. Counting and Summing

This is used for tasks like counting messages, words, or items in a dataset. The
Mapper emits 1 for each item encountered, and the Reducer sums up all these 1s to
calculate the total count.

2. Sorting

Sorting involves organizing data in a particular order based on sorting keys. The
Mapper emits items as values paired with sorting keys, and the Reducer combines and
arranges these items according to the keys.

3. Finding Distinct Values

Finding distinct values is useful in scenarios like counting unique users in a web log.
There are two approaches:

 In the first, the Mapper emits dummy counters for each group, and the
Reducer totals these counters to find unique counts.
 In the second, the Mapper emits values with group IDs, and the Reducer
removes duplicates within each group and increments a counter for unique
groups.

4. Collating
Collating refers to grouping items that share the same value into a single collection.
The Mapper computes a function value for each item as a key and emits the item. The
Reducer groups items based on these keys and outputs the results.

5. Filtering or Parsing

Filtering or parsing focuses on extracting or transforming items that meet specific


conditions. The Mapper processes items one by one, emits only those that satisfy the
criteria, or outputs their transformed versions. The Reducer collects the filtered items,
saves them, and produces the final output.

6. Distributed Task Execution

Distributed task execution is used for large computations that are divided into smaller
partitions. The Mapper performs computations on each partition and emits results,
which the Reducer aggregates to produce the final output.

7. Graph Processing

Graph processing involves analyzing graphs, where nodes represent entities and edges
represent relationships. Path traversal methods process graphs by moving from one
node to another, generating results at each step. These results are passed as messages
for further traversal. Iterative message passing is used in cyclic paths.

3. Illustrate different Relational Algebraic operations in MapReduce.

Selection

Definition: Selection in relational algebra is used to extract specific tuples from a


relation based on a given condition.

Projection
Definition: Projection selects specific attributes from a relation while eliminating
duplicates.

Union

Definition: Union combines tuples from two relations, removing duplicates.

Intersection

Definition: Intersection retrieves tuples present in both relations.

Difference

Definition: Difference retrieves tuples from one relation that are not in the other.
Natural Join

Definition: Natural Join combines two relations based on common attributes,


eliminating duplicate attributes in the result.

4. Discuss HIVE (i) features (ii) architecture (iii) installation process


The Hive Server (Thrift) is an optional service that enables remote clients to submit
requests to Hive and retrieve results. It supports multiple programming languages for
sending requests.

The Hive CLI (Command Line Interface) is a popular tool for interacting with Hive.
It runs in local mode, which uses local storage instead of HDFS when executed on a
Hadoop cluster.

The Web Interface provides users with the option to access Hive through a web
browser. This feature requires an HWI (Hive Web Interface) Server running on a
designated node.

The Metastore serves as the system catalog for Hive. It stores essential schema and
metadata, including table structures, databases, column details, data types, and
mappings to HDFS.

The Hive Driver is responsible for managing the lifecycle of a HiveQL statement. It
oversees the phases of compilation, optimization, and execution, ensuring that queries
are processed efficiently and results are delivered accurately.

Installation process: (i) Install Javac and Java from Oracle Java download website. (ii)
Set the path by the commands for jdk1.7.0_71 (iii) Install Hadoop (iv) Make shared
HADOOP, MAPRED, COMMON, HDFS and all related files , configure HADOOP
and set property (v) Assign yarn.nodemanager.aux-services to mapreduce_shuffle (vi)
Dowload Hive (vii) Configure metastore for the server.

5. Compare HIVE and RDBMS.


6. Explain HIVE datatypes and file format.
7. Discuss HIVE data model and data flow sequences.
8. Explain HIVE built-in functions

9. Define HiveQL. Write a program to create, show, drop and query operations
taking a database for a toy company.

HiveQL (Hive Query Language) is a SQL-like query language used in Apache Hive
for managing and querying large datasets stored in a Hadoop environment. It
simplifies data analysis by providing a familiar SQL interface while operating on
distributed datasets.

-- 1. Create a database for the toy company

CREATE DATABASE toy_company;

-- 2. Use the toy_company database

USE toy_company;

-- 3. Create a table for storing toy details


CREATE TABLE toys (

toy_id INT,

toy_name STRING,

category STRING,

price FLOAT,

stock INT

-- 4. Insert some data into the toys table

INSERT INTO TABLE toys VALUES

(1, 'Action Figure', 'Action', 15.99, 100),

(2, 'Doll', 'Fashion', 10.99, 200),

(3, 'Lego Set', 'Building', 50.00, 50),

(4, 'Puzzle', 'Educational', 20.50, 150);

-- 5. Query operations

-- a. Show all toys

SELECT * FROM toys;

-- b. Find toys under $20

SELECT toy_name, price FROM toys WHERE price < 20;

-- 6. Show all tables in the database

SHOW TABLES;

-- 7. Drop the toys table

DROP TABLE toys;

-- 8. Drop the toy_company database

DROP DATABASE toy_company;


10. Explain table partitioning, bucketing, views, join and aggregation in HiveQL.

· Table Partitioning:

 Organizes data into partitions based on column values (e.g., by date, region).
 Improves query performance by scanning only relevant partitions.

· Bucketing:

 Divides data within partitions into buckets based on a hash function of a


column.
 Ensures even distribution and efficient joins by bucketed columns.

· Views:

 Virtual tables created using CREATE VIEW.


 Simplifies complex queries; does not store data, only query logic.

· Joins:

 Combines data from multiple tables based on a common column.


 Types: Inner, Left, Right, Full, and Cross joins.

· Aggregation:

 Performs operations like SUM, AVG, COUNT, etc., on grouped data.


 Achieved using GROUP BY and aggregate functions.

11. Explain PIG architecture with applications and features.

Pig scripts can be executed in three ways: Grunt Shell for interactive script execution,
Script Files for commands run on the Pig Server, and Embedded Scripts with UDFs
for custom functions.

Scripts are processed by a Parser, which checks syntax and types, generating a
Directed Acyclic Graph (DAG) to represent logical operators and data flows. The
Optimizer refines the DAG by splitting, merging, and reordering operators to reduce
data processing and improve efficiency.
Features

1. Ease of Use: High-level scripting language simplifies data processing.


2. Extensibility: Supports custom functions for specialized operations.
3. Data Handling: Manages structured, semi-structured, and unstructured data.
4. Optimized Execution: Automatically optimizes queries for faster execution.
5. Interoperability: Runs on top of Hadoop; integrates with HDFS and
MapReduce.

Applications

 Processing large-scale unstructured and semi-structured data.


 Log analysis and web analytics.
 Data transformation and ETL (Extract, Transform, Load).
 Machine learning preprocessing.

12. Give differences between (i) PIG and MapReduce (ii) PIG and SQL
13. Give Pig Latin data model with Pig installation steps.
14. Explain Pig Relational Operators.

· FOREACH: Iterates over each tuple in a dataset and applies a transformation to


generate a new relation, typically used to compute expressions or derive new fields
from existing ones.
· FILTER: Filters out tuples from a dataset that do not meet a specified condition,
essentially narrowing down the dataset based on a predicate expression.
· GROUP: Groups the data based on a specified field (or fields), creating a nested
relation with each group containing tuples that share the same values in the group-by
field.
· ORDER BY: Sorts the dataset based on one or more fields, either in ascending or
descending order, allowing for organized data output.
· DISTINCT: Removes duplicate tuples from a dataset, ensuring that each tuple in
the result is unique.
· JOIN: Combines two datasets based on a common field (or condition), producing a
new dataset that contains tuples from both datasets where the join condition is
satisfied.

15. Illustrate user defined functions in PIG with a programming example.

User-Defined Functions (UDFs) in Pig are custom functions created by programmers


to perform operations not available in built-in Pig functions. UDFs can be written in
programming languages like Java, Python, Ruby, and are used for tasks such as
filtering data or performing complex analysis. A UDF must extend either the
EvalFunc or FilterFunc class, with the core method exec() that handles the logic. The
EvalFunc class is the base class for all evaluation functions, and it is parameterized
with the return type, such as String. FilterFunc extends EvalFunc and is specifically
used when a function returns a Boolean value, making it applicable in operations like
FILTER or Bincond.

public class IsCorrectAge extends FilterFunc {

@Override

public Boolean exec(Tuple tuple) throws IOException {

if (tuple == null || tuple.size() == 0) {

return false;

try {

Object object = tuple.get(0);


if (object == null) {

return false;

int i = (Integer) object;

if (i == 11 || i == 20 || i == 25) {

return true;

} else {

return false;

} catch (ExecException e) {

throw new IOException(e);

16. Explain the following: (i) text mining with text analytics process pipe line (ii)
text mining process and phases (iii) text mining challenges

17. Discuss the following: (i) naive bayes analysis (ii) support vector machines (iii)
binary classification

· Naive Bayes Classifier: A probabilistic classifier based on Bayes' Theorem with


the assumption that features are conditionally independent. It calculates the
probability of a class given a set of features (words in text classification) by
multiplying the individual probabilities of the features. This simplicity makes it
effective for text classification tasks like spam detection.

· Support Vector Machines (SVM): A supervised learning algorithm used for


classification tasks that finds the optimal hyperplane to separate different classes. The
hyperplane is chosen to maximize the margin, which is the distance between the
closest data points (support vectors) from each class. SVM can be applied to various
domains, including text and image classification.
· Binary Classification: A type of classification where the goal is to categorize data
into one of two classes. For a given training dataset, the objective is to learn a
classifier that correctly predicts the class label (e.g., -1 or 1) for each data point. In
binary classification, the hyperplane divides the feature space into two regions, one
for each class.

18. Discuss (i) Web Mining (ii) Web Content (iii) Web Usage Analytics

Web Mining: Web mining is the process of extracting useful knowledge from web
data. It combines multiple disciplines, including data mining, machine learning,
natural language processing, statistics, and information retrieval. The main challenge
is discovering meaningful patterns and content from web data, offering both
opportunities and challenges for data mining.

Web Content Mining: This involves discovering information from web documents'
content. It can be done either by directly mining document contents or through search
engines, which are faster. Web content mining intersects with data mining and text
mining since web content often consists of text and is semi-structured or unstructured,
unlike traditional structured data used in data mining.

Web Usage Mining: This type of mining focuses on discovering patterns in web
usage data, particularly clickstream data generated from user interactions with web
resources. It includes analyzing data that is a consequence of users’ web activities,
such as navigation paths, and can help in understanding user behavior and improving
web resource management.

19. Explain (i) page rank (ii) structure of web and analysing a web graph
authorities

(i) PageRank

PageRank is an algorithm used to rank web pages based on their authority or


importance. It was introduced by Larry Page and Sergey Brin in 1996 as part of the
Google search engine. The concept of PageRank is built on the idea that a page is
important if it is linked to by other important pages. Here's how it works:

In-degree (visibility): This is the number of incoming links (links pointing to a page).
The more links a page receives from other pages, the higher its rank.

Out-degree (luminosity): This refers to the number of links that a page points to. A
page with many outgoing links might dilute the PageRank of the pages it links to.

Earlier Approach: PageRank used in-degrees and out-degrees to determine the


authority of a page. A page with many in-links was considered authoritative.

New Approach: The newer version of PageRank, introduced by Page and Brin
(1998), takes into account the entire web. Instead of just considering the local
neighborhood of a page (the in-links and out-links of a page), this method also
considers the relative authority of the parent links. It looks at the quality of the pages
that link to a page (the importance of the linking pages), thus refining the ranking
system.

(ii) Structure of Web and Analyzing a Web Graph's Authorities

Web Structure: The web can be represented as a directed graph, where each
webpage is a node, and hyperlinks between pages are edges. The structure is dynamic
and consists of pages with varying levels of connectivity (links).

Web Graph: A web graph is the visual representation of the web, showing the
relationships (links) between various web pages. Nodes represent web pages, and
edges represent hyperlinks. The structure of the graph influences how authority is
distributed across different pages.

Analyzing Web Authorities:

Authority: A page's authority is its importance or rank within the web graph. It is
determined by how many other pages link to it (in-links), and the authority of the
pages that link to it.

Web Graph Analysis: To analyze the authorities, we examine the link structure of
the web graph. Pages that have more inbound links from authoritative pages are
themselves considered more authoritative. The central idea is that a link from a page
with high authority carries more weight than a link from a less authoritative page.

PageRank Calculation: The PageRank algorithm assigns scores to pages based on


the in-links and the quality of the links. Pages that are linked to by many authoritative
pages have higher PageRank scores.

20. What are Hubs and Authorities.

Hubs and Authorities are concepts introduced in the HITS (Hyperlink-Induced


Topic Search) algorithm, which was developed by Jon Kleinberg in 1999. These
concepts are used to analyze web pages based on their connectivity and the quality of
the pages they link to, and they are closely related to the concepts of PageRank but
focus on different aspects of the link structure.

1. Hubs

A Hub is a web page that links to many other pages. In other words, a hub is a page
that acts as a directory or a resource that points to relevant, authoritative pages. Hubs
are considered valuable because they point to resources or content that could be useful
for users.

A good hub is one that links to many authoritative pages. However, simply linking to
many pages does not automatically make a page a good hub. The quality of the pages
it links to also matters.
2. Authorities

An Authority is a web page that is linked to by many hubs. In other words, an


authority is a page that is considered an expert or valuable in a particular area, and
many hubs (which may be other web pages) link to it because it contains relevant,
authoritative information.

A good authority is one that receives links from many high-quality hubs. The value of
an authority increases when it is referenced by many reputable sources (hubs).

21. Explain Social Network as Graph and Social Network Analytics.

Social Network as Graph

A social network is often represented as a graph, where the nodes (also called
vertices) represent individuals (or organizations) and the edges (also called links)
represent the relationships or interactions between these individuals. These
relationships can take various forms, such as friendships, professional connections, or
shared interests.

In graph theory terms:

 Nodes represent entities, such as people or organizations.


 Edges represent the relationships or connections between these entities (e.g.,
friendship, collaboration, etc.).

Social Network Analytics

Social Network Analytics refers to the application of network analysis techniques to


understand the structure, behavior, and interactions within social networks. It involves
analyzing the relationships between individuals (nodes) and identifying patterns and
insights that could be valuable for various applications.

Key Metrics in Social Network Analytics:

(i) Degree
(ii) Closeness
(iii) Betweenness
(iv) Eigenvector
(v) Clustering coefficient
(vi) PageRank
(vii) Anomaly Detection

22. Discuss (i) clustering in social networks (ii) Sim rank (iii) counting triangles
and graph matches (iv) direct discovery of communities

(i) Clustering in Social Networks


Clustering in social networks refers to the tendency of nodes in a network to form
tightly-knit groups or communities. This is a key feature of social networks, where
individuals often interact with a small group of other people, leading to clusters of
nodes that are more densely connected within the group than with the outside world.
Clustering is used to identify groups of individuals who share common interests,
relationships, or characteristics.

There are two types of clustering coefficients:

 Local Clustering Coefficient: Measures the degree to which a vertex’s


neighbors are also connected to each other.
 Global Clustering Coefficient: Measures the overall likelihood of neighbors
being connected across the entire network.

(ii) SimRank

SimRank is a similarity measure used in graph-based social network analysis to


determine how similar two vertices (nodes) are in terms of their structural
relationships within the graph. It is particularly used when measuring similarity
between vertices of the same type, for instance, two students in a social network or
two researchers in a scientific collaboration network.

(iii) Counting Triangles and Graph Matches

Triangle Counting:

A triangle in a graph is formed by three vertices where each vertex is connected to


the other two. Triangles are an important measure of clustering and local
cohesiveness in networks.

Triangle Count refers to the number of triangles that pass through each vertex in a
graph. A vertex is considered part of a triangle when it has two adjacent vertices, and
those two vertices are also connected by an edge.

Graph Matches:

Graph matching involves comparing two graphs or sub-graphs to find similarities


based on properties such as vertex labels, edge labels, or geographic locations. This is
typically achieved through search or filtering algorithms.

(iv) Direct Discovery of Communities

In social network analysis, discovering communities involves identifying groups of


vertices that are more densely connected to each other than to other vertices in the
network. There are several ways to directly detect communities in a social graph, and
the methods often focus on cliques, cohesive blocks, and social circles:

Cliques:
A clique is a subset of vertices where every vertex is connected to every other vertex
in the set. In a social network, a clique represents a group of people who are all
connected to each other.

Direct Discovery of Cliques: Identifying cliques can directly reveal communities or


groups with high internal connectivity. Cliques are often used in social network
analysis to find groups of highly connected individuals, such as teams or close-knit
circles of friends.

Structurally Cohesive Blocks:

These are groups of nodes that are tightly connected and have high internal density
but may have few connections to the rest of the network. Identifying these blocks
helps uncover communities that are tightly knit but not necessarily cliques.

Social Circles:

Social circles are groups of individuals who are connected to each other through
common relationships or interactions. These can be neighborhoods, friendship groups,
or professional networks that are more loosely connected than cliques but still exhibit
a strong degree of interconnectedness.

You might also like