BDA QB3
BDA QB3
A Map task in the MapReduce programming model processes input key-value pairs
(kl, vl), where kl is a set of keys mapping to data values, and vl represents large
strings from input files. The map() function, implemented in the Mapper class of the
Hadoop Java API, generates zero or more intermediate key-value pairs (k2, v2).
These intermediate values are passed to the Reduce task, which aggregates and
transforms them into a smaller dataset using reducing functions or combiners. The
Mapper processes data independently, regardless of input size, working on one dataset
at a time. The number of Map tasks (Nmap) depends on input file size and block size,
typically 10–100 tasks per node, but can be adjusted using the setNumMapTasks()
function. For instance, a 1TB file with a 128MB block size would create 8192 Map
tasks.
This is used for tasks like counting messages, words, or items in a dataset. The
Mapper emits 1 for each item encountered, and the Reducer sums up all these 1s to
calculate the total count.
2. Sorting
Sorting involves organizing data in a particular order based on sorting keys. The
Mapper emits items as values paired with sorting keys, and the Reducer combines and
arranges these items according to the keys.
Finding distinct values is useful in scenarios like counting unique users in a web log.
There are two approaches:
In the first, the Mapper emits dummy counters for each group, and the
Reducer totals these counters to find unique counts.
In the second, the Mapper emits values with group IDs, and the Reducer
removes duplicates within each group and increments a counter for unique
groups.
4. Collating
Collating refers to grouping items that share the same value into a single collection.
The Mapper computes a function value for each item as a key and emits the item. The
Reducer groups items based on these keys and outputs the results.
5. Filtering or Parsing
Distributed task execution is used for large computations that are divided into smaller
partitions. The Mapper performs computations on each partition and emits results,
which the Reducer aggregates to produce the final output.
7. Graph Processing
Graph processing involves analyzing graphs, where nodes represent entities and edges
represent relationships. Path traversal methods process graphs by moving from one
node to another, generating results at each step. These results are passed as messages
for further traversal. Iterative message passing is used in cyclic paths.
Selection
Projection
Definition: Projection selects specific attributes from a relation while eliminating
duplicates.
Union
Intersection
Difference
Definition: Difference retrieves tuples from one relation that are not in the other.
Natural Join
The Hive CLI (Command Line Interface) is a popular tool for interacting with Hive.
It runs in local mode, which uses local storage instead of HDFS when executed on a
Hadoop cluster.
The Web Interface provides users with the option to access Hive through a web
browser. This feature requires an HWI (Hive Web Interface) Server running on a
designated node.
The Metastore serves as the system catalog for Hive. It stores essential schema and
metadata, including table structures, databases, column details, data types, and
mappings to HDFS.
The Hive Driver is responsible for managing the lifecycle of a HiveQL statement. It
oversees the phases of compilation, optimization, and execution, ensuring that queries
are processed efficiently and results are delivered accurately.
Installation process: (i) Install Javac and Java from Oracle Java download website. (ii)
Set the path by the commands for jdk1.7.0_71 (iii) Install Hadoop (iv) Make shared
HADOOP, MAPRED, COMMON, HDFS and all related files , configure HADOOP
and set property (v) Assign yarn.nodemanager.aux-services to mapreduce_shuffle (vi)
Dowload Hive (vii) Configure metastore for the server.
9. Define HiveQL. Write a program to create, show, drop and query operations
taking a database for a toy company.
HiveQL (Hive Query Language) is a SQL-like query language used in Apache Hive
for managing and querying large datasets stored in a Hadoop environment. It
simplifies data analysis by providing a familiar SQL interface while operating on
distributed datasets.
USE toy_company;
toy_id INT,
toy_name STRING,
category STRING,
price FLOAT,
stock INT
-- 5. Query operations
SHOW TABLES;
· Table Partitioning:
Organizes data into partitions based on column values (e.g., by date, region).
Improves query performance by scanning only relevant partitions.
· Bucketing:
· Views:
· Joins:
· Aggregation:
Pig scripts can be executed in three ways: Grunt Shell for interactive script execution,
Script Files for commands run on the Pig Server, and Embedded Scripts with UDFs
for custom functions.
Scripts are processed by a Parser, which checks syntax and types, generating a
Directed Acyclic Graph (DAG) to represent logical operators and data flows. The
Optimizer refines the DAG by splitting, merging, and reordering operators to reduce
data processing and improve efficiency.
Features
Applications
12. Give differences between (i) PIG and MapReduce (ii) PIG and SQL
13. Give Pig Latin data model with Pig installation steps.
14. Explain Pig Relational Operators.
@Override
return false;
try {
return false;
if (i == 11 || i == 20 || i == 25) {
return true;
} else {
return false;
} catch (ExecException e) {
16. Explain the following: (i) text mining with text analytics process pipe line (ii)
text mining process and phases (iii) text mining challenges
17. Discuss the following: (i) naive bayes analysis (ii) support vector machines (iii)
binary classification
18. Discuss (i) Web Mining (ii) Web Content (iii) Web Usage Analytics
Web Mining: Web mining is the process of extracting useful knowledge from web
data. It combines multiple disciplines, including data mining, machine learning,
natural language processing, statistics, and information retrieval. The main challenge
is discovering meaningful patterns and content from web data, offering both
opportunities and challenges for data mining.
Web Content Mining: This involves discovering information from web documents'
content. It can be done either by directly mining document contents or through search
engines, which are faster. Web content mining intersects with data mining and text
mining since web content often consists of text and is semi-structured or unstructured,
unlike traditional structured data used in data mining.
Web Usage Mining: This type of mining focuses on discovering patterns in web
usage data, particularly clickstream data generated from user interactions with web
resources. It includes analyzing data that is a consequence of users’ web activities,
such as navigation paths, and can help in understanding user behavior and improving
web resource management.
19. Explain (i) page rank (ii) structure of web and analysing a web graph
authorities
(i) PageRank
In-degree (visibility): This is the number of incoming links (links pointing to a page).
The more links a page receives from other pages, the higher its rank.
Out-degree (luminosity): This refers to the number of links that a page points to. A
page with many outgoing links might dilute the PageRank of the pages it links to.
New Approach: The newer version of PageRank, introduced by Page and Brin
(1998), takes into account the entire web. Instead of just considering the local
neighborhood of a page (the in-links and out-links of a page), this method also
considers the relative authority of the parent links. It looks at the quality of the pages
that link to a page (the importance of the linking pages), thus refining the ranking
system.
Web Structure: The web can be represented as a directed graph, where each
webpage is a node, and hyperlinks between pages are edges. The structure is dynamic
and consists of pages with varying levels of connectivity (links).
Web Graph: A web graph is the visual representation of the web, showing the
relationships (links) between various web pages. Nodes represent web pages, and
edges represent hyperlinks. The structure of the graph influences how authority is
distributed across different pages.
Authority: A page's authority is its importance or rank within the web graph. It is
determined by how many other pages link to it (in-links), and the authority of the
pages that link to it.
Web Graph Analysis: To analyze the authorities, we examine the link structure of
the web graph. Pages that have more inbound links from authoritative pages are
themselves considered more authoritative. The central idea is that a link from a page
with high authority carries more weight than a link from a less authoritative page.
1. Hubs
A Hub is a web page that links to many other pages. In other words, a hub is a page
that acts as a directory or a resource that points to relevant, authoritative pages. Hubs
are considered valuable because they point to resources or content that could be useful
for users.
A good hub is one that links to many authoritative pages. However, simply linking to
many pages does not automatically make a page a good hub. The quality of the pages
it links to also matters.
2. Authorities
A good authority is one that receives links from many high-quality hubs. The value of
an authority increases when it is referenced by many reputable sources (hubs).
A social network is often represented as a graph, where the nodes (also called
vertices) represent individuals (or organizations) and the edges (also called links)
represent the relationships or interactions between these individuals. These
relationships can take various forms, such as friendships, professional connections, or
shared interests.
(i) Degree
(ii) Closeness
(iii) Betweenness
(iv) Eigenvector
(v) Clustering coefficient
(vi) PageRank
(vii) Anomaly Detection
22. Discuss (i) clustering in social networks (ii) Sim rank (iii) counting triangles
and graph matches (iv) direct discovery of communities
(ii) SimRank
Triangle Counting:
Triangle Count refers to the number of triangles that pass through each vertex in a
graph. A vertex is considered part of a triangle when it has two adjacent vertices, and
those two vertices are also connected by an edge.
Graph Matches:
Cliques:
A clique is a subset of vertices where every vertex is connected to every other vertex
in the set. In a social network, a clique represents a group of people who are all
connected to each other.
These are groups of nodes that are tightly connected and have high internal density
but may have few connections to the rest of the network. Identifying these blocks
helps uncover communities that are tightly knit but not necessarily cliques.
Social Circles:
Social circles are groups of individuals who are connected to each other through
common relationships or interactions. These can be neighborhoods, friendship groups,
or professional networks that are more loosely connected than cliques but still exhibit
a strong degree of interconnectedness.