0% found this document useful (0 votes)
22 views

BIG data master

Uploaded by

sahilverma20652
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

BIG data master

Uploaded by

sahilverma20652
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

INDEX

Serial Date of Date of


List of Experiment Sign of
No Experiment Submission
Faculty
To draw and explain Hadoop architecture and
1
ecosystem with the help of a case study.
Perform setting up and installing single node
2
Hadoop in a Windows environment.
To implement the following file management
3 tasks in the Hadoop System (HDFS): Adding
files and directories, retrieving files, Deleting files
4. Create a database ‘STD’ and make a collection
(e.g. "student" with fields 'No., Stu_Name, Enrol.,
4 Branch, Contact, e-mail, Score') using MongoDB.
Perform various operations in the following
experiments.
5. Insert multiple records (at least 10) into the
5
created student collection.
6. Execute the following queries on the collection
created.
a. Display data in proper format.
b. Update the contact information of a specific
student.
6
c. Add a new field remark to the document with
the name 'REM'.
d. Add a new field as no 11, stu_name XYZ,
enroll 00101, branch VB, e-mail [email protected]
Contact
098675345 without using insert statement.
7. Create an employee table in monogdb with 4
departments and 25 employees equally
divided along with one manager. The following
fields should be added; Employee_ID, Dept_ID,
First_Name, Last_Name, Salary (Range
7 between 20K-60K). Now Run the following
queries
a. Find all the employees of a particular
department where salary lies < 40K.
b. Find the highest salary for each department
0 Big Data [704] 1
and fetch the name of such employees.
c. Find all the employees who are on a lesser
salary than 30k; increase their salary by 10% and
display the results.
8. To design and implement a social network
8 graph of 50 nodes and edges between
nodes using networkx library in Python.
9. Design and plot an asymmetric social network
(socio graph) of 5 nodes (A, B, C, D, and E) such
9 that A is directed to B, B is directed to D, D is
directed to A, and D is directed to C.
10. Consider the above scenario (No. 09) and plot
10 a weighted asymmetric graph, the weight range is
between 20 to 50.
11. Implement betweenness measure between
11 nodes across the social network. (Assume the
social network of 10 nodes)
Name of Student: Class: BE
Enrplment No: Batch
Date of Experiment Date of Submission Submitted on:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

List of Experiments:

Question 1. To draw and explain Hadoop architecture and ecosystem with the help of a case study.

Answer:

Hadoop is an open-source framework for distributed storage and processing of large datasets. It consists
of a comprehensive ecosystem of components that work together to handle big data challenges. To
illustrate Hadoop's architecture and ecosystem, let's consider a case study involving a retail company,
"Retail Mart," which aims to analyze and derive insights from its vast number of sales and customer
data.

Hadoop Architecture and Ecosystem Components:

1. Hadoop Distributed File System (HDFS):


○ Retail Mart’s data includes sales transactions, customer profiles, inventory data, and
more. HDFS is used to store and manage these large datasets across a distributed
cluster of commodity hardware. Data is divided into blocks and replicated for fault
tolerance.
2. MapReduce:
○ Retail Mart wants to perform analytics on its sales data to gain insights. MapReduce is

0 Big Data [704] 1


a core processing framework in Hadoop. It allows parallel, distributed processing of
data stored in HDFS. In our case study, MapReduce jobs can be used to analyze sales
data, calculate revenues, and extract relevant insights.
3. YARN (Yet Another Resource Negotiator):
○ YARN manages cluster resources and schedules tasks for MapReduce jobs and other
applications. It enables efficient resource utilization by allocating resources
dynamically to various applications. Retail Mart can use YARN to ensure that its sales
data analysis does not overwhelm the cluster.
4. Hive:
○ Hive is a data warehousing and SQL-like query language for Hadoop. Retail Mart can
use Hive to run SQL queries on their sales data, generate reports, and create dashboards
for business intelligence.
5. Pig:
○ Pig is a high-level platform for creating MapReduce programs. It provides a scripting
language to process and analyze data. Retail Mart can use Pig to perform data
transformation and ETL (Extract, Transform, Load) operations on their raw sales
data.
6. HBase:
○ Retail Mart needs to store and access customer profiles and inventory data in a
scalable, real-time database. HBase, a NoSQL database on top of Hadoop, can be used
to achieve this. It provides fast and random access to large datasets.

7. Spark:
○ To perform more complex and iterative data processing, Retail Mart can utilize
Apache Spark, which is often used alongside Hadoop. Spark is well-suited for
machine learning, data streaming, and graph processing.
8. Sqoop:
○ Retail Mart wants to import data from their existing relational databases into
Hadoop. Sqoop is a tool for efficiently transferring data between Hadoop and
structured data stores like relational databases.
9. Flume and Kafka:
○ To ingest real-time data from online transactions and weblogs, Retail Mart can use
Flume and Kafka. These tools enable data streaming into Hadoop for immediate
analysis.

Case Study Scenario:

Retail Mart loads daily sales transaction data into HDFS, which includes information about products,
customers, and sales. They run MapReduce jobs to calculate daily, weekly, and monthly revenues and
perform market basket analysis to find correlations between products. The results are stored in HBase
for real-time access.

Retail Mart also utilizes Hive to create reports for management, showing sales trends and customer
behavior. They use Pig to clean and preprocess the data, while Spark helps them build
recommendation engines based on customer preferences.

In addition, Retail Mart employs Sqoop to import historical sales data from their existing relational
database into HDFS, ensuring all data is in one place for analysis. Flume and Kafka are used to stream
0 Big Data [704] 1
data from online transactions, enabling real-time analytics.

Name of Student: Class: BE


Enrplment No: Batch
Date of Experiment Date of Submission Submitted on:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Question 2. Perform setting up and installing single node Hadoop in a Windows environment.

Answer:

Prerequisites: Before you begin, ensure that you have the following prerequisites in place:

1. A Windows machine with sufficient system resources (RAM, CPU, and storage) to run
Hadoop.
2. Java Development Kit (JDK) installed. Hadoop requires Java. You can download and install
Oracle JDK or OpenJDK.
3. Download the Hadoop distribution for Windows. You can download it from the official
Apache Hadoop website.

Installation Steps:

1. Java Installation:
0 Big Data [704] 1
○ Install Java (if not already installed) and set the JAVA_HOME environment variable to
point to the Java installation directory. Make sure you have the correct version of Java
compatible with the Hadoop version you are using.
2. Hadoop Installation:
○ Extract the downloaded Hadoop distribution to a directory on your Windows
machine. For example, you can extract it to C:\hadoop.
3. Configuration:
○ Navigate to the C:\hadoop\etc\hadoop directory (or the location where you
extracted Hadoop) and edit the following configuration files:
■ core-site.xml: Add the following configuration to specify Hadoop's data
directory:

● hdfs-site.xml: Configure the HDFS data and replication settings:

4. Formatting HDFS:

● Open a Command Prompt and navigate to the Hadoop bin directory, typically
located at C:\hadoop\bin.
● Run the following command to format the HDFS:

hdfs namenode -format

5. Starting Hadoop Services:


○ Start the Hadoop services using the following commands:
■ Start the NameNode and DataNode:

0 Big Data [704] 1


start-dfs.cmd

■ Start the ResourceManager and NodeManager:

start-yarn.cmd

Name of Student: Class: BE


Enrplment No: Batch
Date of Experiment Date of Submission Submitted on:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Question 3. To implement the following file management tasks in Hadoop System (HDFS):
Adding files and directories, retrieving files, Deleting files

Answer:

● In a Hadoop Distributed File System (HDFS), you can perform file


management tasks like adding files and directories, retrieving files, and
deleting files using Hadoop's command-line utilities. Here's how you can
execute each of these tasks:
○ Adding Files and Directories:
■ To add files and directories to HDFS, you can use the
Hadoop fs command-line utility. The command for adding a
file from your local file system to HDFS is Hadoop fs -
copyFromLocal:
lua

hadoop fs -copyFromLocal /local/path/to/source /hdfs/path/to/destination

● To create an HDFS directory, you can use the mkdir


command: bash

hadoop fs -mkdir /hdfs/path/to/directory

● Retrieving Files:
■ To retrieve files from HDFS to your local file system,
use the get command: lua

hadoop fs -get
/hdfs/path/to/source/local/path/to/destination

● Deleting Files:
■ To delete files in HDFS, you can use the rm command. Be
cautious when using this command as it permanently deletes
files.
0 Big Data [704] 1
bash

hadoop fs -rm /hdfs/path/to/file

● To delete an empty directory, use the


rmdir command: bash

hadoop fs -rmdir /hdfs/path/to/empty_directory

● To delete a directory and its contents, use the -r


option with rm: bash

Name of Student: Class: BE


Enrplment No: Batch
Date of Experiment Date of Submission Submitted on:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

hadoop fs -rm -r /hdfs/path/to/directory

Question 4. Create a database ‘STD’ and make a collection (e.g. "student" with fields 'No.,
Stu_Name, Enrol., Branch, Contact, e-mail, Score') using MongoDB. Perform various operations
in the following experiments.

Answer:

Create a Database:
In the MongoDB shell, you can create a database named 'STD' using the use command. If the database
doesn't exist, MongoDB will create it when you insert data into it:
Perl

use STD

Create a Collection and Insert Data:

1. Now, create a collection called 'student' and insert some sample data into it. You can use the
insertOne or insertMany method to add documents to the collection. Here's an example of
0 Big Data [704] 1
inserting a single document:
javascript

db.student.insertOne({ No
: 1,
Stu_Name: "John Doe",
Enrol: "E12345",
Branch: "Computer Science",
Contact: "123-456-7890",
email: "[email protected]",
Score: 95
});

2. You can insert more documents using the insertOne or insertMany methods.

Querying Data:
db.student.find()

Updating Data:
db.student.updateOne(
{ Enrol: "E12345" },
{ $set: { Score: 98 } }
);

Deleting Data:
db.student.deleteOne({ No: 1 });

0 Big Data [704] 1


Name of Student: Class: BE

Enrplment No: Batch

Date of Experiment Date of Submission Submitted on:

Remarks by faculty: Grade:

Signature of student: Signature of Faculty:

Question 5. Insert multiple records (at least 10) into the created student collection. Answer:

Below is the code of inserting 10 multiple records in the Student Collection

db.student.insertMany([
{
No: 2,
Stu_Name: "Bhavik",
Enrol: "E23456",
Branch: "Computer Science",
Contact: "9876543210",
email: "[email protected]",
Score: 88
},
{
No: 3,
Stu_Name: "Bhavika",
Enrol: "E34567",
Branch: "Mathematics",
Contact: "9988776655",
email: "[email protected]",
Score: 73
},
{
No: 4,
Stu_Name: "Aeshna",
Enrol: "E45678",

0 Big Data [704] 1


Branch: "Physics",
Contact: "9871234560",
email: "[email protected]",
Score: 92
},
{

No: 5,
Stu_Name: "Ansh",
Enrol: "E56789",
Branch: "Chemistry",
Contact: "9877005500",
email: "[email protected]",
Score: 79
},
{
No: 6,
Stu_Name: "Devendra",
Enrol: "E67890",
Branch: "Mechanical Engineering",
Contact: "9966337700",
email: "[email protected]",
Score: 85
},
{
No: 7,
Stu_Name: "Yash",
Enrol: "E78901",
Branch: "Electrical Engineering",
Contact: "9977553344",
email: "[email protected]",
Score: 94
},
{
No: 8,
Stu_Name: "Anuj",
Enrol: "E89012",
Branch: "Economics", Contact:
"9876009900",
email: "[email protected]",
Score: 78
},
{
No: 9,
Stu_Name: "Aditya",
Enrol: "E90123",
Branch: "History",
0 Big Data [704] 1
Contact: "9888998899",
email: "[email protected]",
Score: 87
},
{

No: 10,
Stu_Name: "Jay", Enrol:
"E01234",
Branch: "Geography", Contact:
"9966558877",
email: "[email protected]",

Score: 91
},
{

No: 11,
Stu_Name: "Patel", Enrol:
"E12345",
Branch: "Business Administration",
Contact: "9876543000",
email: "[email protected]", Score:
75
}
]);

0 Big Data [704] 1


Name of Student: Class: BE
Enrplment No: Batch
Date of Experiment Date of Submission Submitted on:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Question 6. Execute the following queries on the collection created.


a. Display data in proper format.
b. Update the contact information of a specific student.
c. Add a new field remark to the document with the name 'Yash'.
d. Add a new field as no 11, stu_name XYZ, enroll 00101, branch VB, e-mail [email protected]
Contact 098675345 without using insert statement.

Answer:

a. Display Data in Proper Format (In Ascending Order of Name Alphabet):

javascript
db.student.find().sort({ Stu_Name: 1 }).pretty()

0 Big Data [704] 1


This query will retrieve all documents from the 'student' collection and display them in ascending order
of the 'Stu_Name' field, in a properly formatted manner using the pretty() method.

b. Update the Contact Information of a Specific Student (Name = Anuj):

javascript
db.student.updateOne({ Stu_Name: "Anuj" }, { $set: { Contact: "9876543210" } })

This query updates the 'Contact' field for the student with the name 'Anuj'. You can replace
"9876543210" with the new contact information as needed.

c. Add a New Field 'remark' to the Document with the Name 'Yash':

javascript
db.student.updateOne({ Stu_Name: "Yash" }, { $set: { remark: "New remark text" } })

This query adds a new field 'remark' with the value "New remark text" to the document where
'Stu_Name' is 'Yash'. You can adjust the value accordingly.

d. Add a New Field with Student No. 12 (Name: Shruti, Enroll: 00101, Branch: CSE, Email:
[email protected], Contact: 098675345): This task can be accomplished using the updateOne
method without an insert statement. To add a new student with No. 12 and the specified details, run the
following query:
javascript
db.student.updateOne(
{ No: 12 },
{
$set: {
Stu_Name: "Shruti",
Enrol: "00101",
Branch: "CSE",
email: "[email protected]",
Contact: "098675345",
Score: 0 // You can set an initial score or any other default value
}
},
{ upsert: true } // Use upsert to insert if the document doesn't exist
)

0 Big Data [704] 1


Name of Student: Class: BE
Enrplment No: Batch
Date of Experiment Date of Submission Submitted on:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:
Question 7. Create an employee table in monogdb with 4 departments and 25 employees equally divided along
with one manager. The following fields should be added; Employee_ID,Dept_ID, First_Name, Last_Name,
Salary (Range between 20K-60K). Now Run the following queries

a. Find all the employees of a particular department where salary lies < 40K.
b. Find the highest salary for each department and fetch the name of such employees.
c. Find all the employees who are on a lesser salary than 30k; increase their salary by 10%
and display the results.

Answer:

1. Create the Employee Collection:

Assuming you are using the MongoDB shell, you can create the employee collection as follows:
// Create the Employee Collection
db.createCollection("employee")

// Insert Employee Data


var departments = ["HR", "Finance", "Marketing", "Engineering"]; var
salaries = [20000, 30000, 40000, 50000, 60000];

// Insert employees for each department for


(var deptId = 1; deptId <= 4; deptId++) {
// Insert department manager
db.employee.insert({
Employee_ID: deptId,
Dept_ID: deptId, First_Name:
"Manager", Last_Name: "Dept
0 Big Data [704] 1
" + deptId, Salary: salaries[4],
});

// User can insert the data of 25 Employee dynamically for


(var i = 1; i <= 25; i++) {
db.employee.insert({
Employee_ID: (deptId - 1) * 25 + i + 4,
Dept_ID: deptId,
First_Name: "Employee" + i,
Last_Name: "Dept " + deptId,
Salary: salaries[i % 5],
});
}
}

a. Find all the employees of a particular department where salary is less than 40K. For
example, to find employees in the "HR" department with a salary less than 40K:

db.employee.find({ Dept_ID: 1, Salary: { $lt: 40000 } })

b. Find the highest salary for each department and fetch the names of such employees:

db.employee.aggregate([
{
$group: {
_id: "$Dept_ID",
maxSalary: { $max: "$Salary" }
0 Big Data [704] 1
}
},
{

$lookup: {
from: "employee",
localField: "_id",
foreignField: "Dept_ID",
as: "employees"
}
},
{
$unwind: "$employees"
},
{
$match: { "employees.Salary": "$maxSalary" }
},
{
$project: {
_id: 0,
Dept_ID: "$_id",
Employee_ID: "$employees.Employee_ID",
First_Name: "$employees.First_Name",
Last_Name: "$employees.Last_Name", Salary:
"$maxSalary"
}
}
])

c. Find all employees who are on a salary less than 30K, increase their salary by 10%, and
display the results:

db.employee.updateMany(
{ Salary: { $lt: 30000 } },
{ $mul: { Salary: 1.1 } }
)

db.employee.find({ Salary: { $lt: 30000 } })

0 Big Data [704] 1


Name of Student: Class: BE
Enrplment No: Batch
Date of Experiment Date of Submission Submitted on:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:
Question 8. To design and implement a social network graph of 50 nodes and edges between nodes using
networkx library in Python.

Answer:
To design and implement a social network graph with 50 nodes and edges between nodes using the
NetworkX library in Python, follow these steps:

1. Install NetworkX:
If you haven't already, you need to install the NetworkX library. You can install it using pip:
bash
pip install networkx

2. Import NetworkX and Create a Graph:


In your Python script or Jupyter Notebook, import the NetworkX library and create a graph:
python

import networkx as nx #

Create an empty graph


G = nx.Graph()

3. Add Nodes:
You can add nodes to the graph. Since you want 50 nodes, you can do this programmatically:
python

num_nodes = 50
0 Big Data [704] 1
for i in range(1, num_nodes + 1):
G.add_node(i)

4. Add Edges:
To create edges between nodes, you can use various methods, such as adding random
connections or defining a specific network structure. Here's an example of adding random edges
to create a connected graph:

python
import random

# Add random edges to create


connections for i in range(1, num_nodes
+ 1):
for j in range(i + 1, num_nodes + 1):
if random.random() < 0.1: # Adjust the probability as
needed G.add_edge(i, j)

5. Visualize the Graph (Optional):


If you want to visualize the network graph, you can use the matplotlib library along with
NetworkX. Make sure to install matplotlib:
bash

pip install matplotlib

Here's an example of how to visualize the graph:


Python:

import matplotlib.pyplot as plt


nx.draw(G, with_labels=True, node_color=’lightpink’,font_weight='bold')
plt.show()

0 Big Data [704] 1


Name of Student: Class: BE
Enrplment No: Batch
Date of Experiment Date of Submission Submitted on:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Question 9. Design and plot an asymmetric social network (socio graph) of 5 nodes (A, B, C, D, and E)
such that A is directed to B, B is directed to D, D is directed to A, and D is directed to C.

Answer:

1. Install NetworkX

If you haven't already, you need to install the NetworkX library. You can install it using pip: pip

install networkx

2. Import NetworkX and Create a Directed Graph:

In your Python script or Jupyter Notebook, import the NetworkX library and create a directed graph
(DiGraph)
import networkx as nx
import matplotlib.pyplot as plt G

= nx.DiGraph()

3. Add Nodes:
0 Big Data [704] 1
Add nodes A, B, C, D, and E to the graph:
nodes = ["A", "B", "C", "D", "E"]
G.add_nodes_from(nodes)

4. Add Directed Edges:

Add directed edges between nodes A, B, D, and C as specified:

G.add_edge("A", "B")
G.add_edge("B", "D")
G.add_edge("D", "A")
G.add_edge("D", "C")
```

5. Visualize the Graph:


You can use Network X and matplotlib to visualize the directed sociograph: pos =
{
"A": (0, 1),
"B": (1, 2),
"C": (2, 1),
"D": (1, 0),
"E": (3, 2)
}
nx.draw(G, pos, with_labels=True, node_color='green', font_weight='bold', node_size=1000, arrowsize=20)
plt.show()

0 Big Data [704] 1


Name of Student: Class: BE
Enrplment No: Batch
Date of Experiment Date of Submission Submitted on:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Question 10. Consider the above scenario (No. 09) and plot a weighted asymmetric graph, the
weight range is between 20 to 50.

Answer:

To create a weighted asymmetric graph based on the scenario provided in question 9, you can use
NetworkX in Python. In this case, you will assign random weights in the range of 20 to 50 to the
directed edges between nodes A, B, D, and C. Here's how you can achieve this:

1. Install and Import NetworkX


We already install and import the NetworkX, again the same code :

pip install networkx


import networkx as nx
import matplotlib.pyplot as plt G
= nx.DiGraph()

2. Add Nodes: Add nodes A, B, C, D, and E to the graph:

nodes = ["A", "B", "C", "D", "E"]


G.add_nodes_from(nodes)

3. Add Directed Edges with Random Weights: Add directed edges between nodes A, B, D, and C,
and assign random weights in the range of 20 to 50:

# Define the edges and assign random weights edges =


[("A", "B", random.randint(20, 50)),
("B", "D", random.randint(20, 50)),
("D", "A", random.randint(20, 50)),
("D", "C", random.randint(20, 50))]

for edge in edges:


source, target, weight = edge G.add_edge(source,
target, weight=weight)

0 Big Data [704] 1


4. Visualize the Weighted Directed Graph: Use NetworkX and matplotlib to visualize the
weighted asymmetric graph:

pos = {
"A": (0, 1),
"B": (1, 2),
"C": (2, 1),
"D": (1, 0),
"E": (3, 2)
}
# Extract edge weights for drawing
edge_weights = nx.get_edge_attributes(G, 'weight')

# Draw the weighted directed graph with edge labels


nx.draw(G, pos, with_labels=True, node_color='lightpink', font_weight='bold', node_size=1000,
arrowsize=20)
nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_weights) plt.show()

Name of Student: Class: BE


Enrplment No: Batch
Date of Experiment Date of Submission Submitted on:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:
Question 11. Implement betweenness measure between nodes across the social network. (Assume the social
network of 10 nodes)

Answer:

To implement the betweenness centrality measure between nodes in a social network using Python
and NetworkX, you can follow these steps. Here, I'll assume a social network with 10 nodes for
demonstration:

0 Big Data [704] 1


1. Install NetworkX: If you haven't already, install the NetworkX library using pip:
bash

pip install networkx

2. Import NetworkX and Create a Graph: Import NetworkX and create a graph for your social
network:

import networkx as nx

# Create a graph (assuming a social network with 10 nodes) G =


nx.Graph()
nodes = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
G.add_nodes_from(nodes)

3. Add Edges (Connections between Nodes): Define the connections (edges) between nodes to
represent your social network. This can be done based on your specific network structure:

# Example edges for demonstration (customize as needed) edges =


[
(1, 2), (1, 3), (1, 4),
(2, 3), (2, 4), (2, 5),
(3, 5), (3, 6), (3, 7),
(4, 7), (4, 8),
(5, 6), (5, 9),
(6, 9), (6, 10),
(7, 8), (7, 10),
(8, 10),
(9, 10)
]
G.add_edges_from(edges)

4. Calculate Betweenness Centrality: Now, calculate the betweenness centrality for each node in
the social network:

betweenness = nx.betweenness_centrality(G)

The betweenness_centrality function computes the normalized betweenness centrality for all nodes in the graph. The
result is stored in the betweenness dictionary, where the keys are node identifiers, and the values are the betweenness
centrality scores.

5. Display Betweenness Centrality Scores: You can print or analyze the betweenness
centrality scores for each node:

for node, centrality in betweenness.items():

0 Big Data [704] 1


print(f"Node {node}: Betweenness Centrality = {centrality:.4f}")

0 Big Data [704] 1

You might also like