BIG data master
BIG data master
List of Experiments:
Question 1. To draw and explain Hadoop architecture and ecosystem with the help of a case study.
Answer:
Hadoop is an open-source framework for distributed storage and processing of large datasets. It consists
of a comprehensive ecosystem of components that work together to handle big data challenges. To
illustrate Hadoop's architecture and ecosystem, let's consider a case study involving a retail company,
"Retail Mart," which aims to analyze and derive insights from its vast number of sales and customer
data.
7. Spark:
○ To perform more complex and iterative data processing, Retail Mart can utilize
Apache Spark, which is often used alongside Hadoop. Spark is well-suited for
machine learning, data streaming, and graph processing.
8. Sqoop:
○ Retail Mart wants to import data from their existing relational databases into
Hadoop. Sqoop is a tool for efficiently transferring data between Hadoop and
structured data stores like relational databases.
9. Flume and Kafka:
○ To ingest real-time data from online transactions and weblogs, Retail Mart can use
Flume and Kafka. These tools enable data streaming into Hadoop for immediate
analysis.
Retail Mart loads daily sales transaction data into HDFS, which includes information about products,
customers, and sales. They run MapReduce jobs to calculate daily, weekly, and monthly revenues and
perform market basket analysis to find correlations between products. The results are stored in HBase
for real-time access.
Retail Mart also utilizes Hive to create reports for management, showing sales trends and customer
behavior. They use Pig to clean and preprocess the data, while Spark helps them build
recommendation engines based on customer preferences.
In addition, Retail Mart employs Sqoop to import historical sales data from their existing relational
database into HDFS, ensuring all data is in one place for analysis. Flume and Kafka are used to stream
0 Big Data [704] 1
data from online transactions, enabling real-time analytics.
Question 2. Perform setting up and installing single node Hadoop in a Windows environment.
Answer:
Prerequisites: Before you begin, ensure that you have the following prerequisites in place:
1. A Windows machine with sufficient system resources (RAM, CPU, and storage) to run
Hadoop.
2. Java Development Kit (JDK) installed. Hadoop requires Java. You can download and install
Oracle JDK or OpenJDK.
3. Download the Hadoop distribution for Windows. You can download it from the official
Apache Hadoop website.
Installation Steps:
1. Java Installation:
0 Big Data [704] 1
○ Install Java (if not already installed) and set the JAVA_HOME environment variable to
point to the Java installation directory. Make sure you have the correct version of Java
compatible with the Hadoop version you are using.
2. Hadoop Installation:
○ Extract the downloaded Hadoop distribution to a directory on your Windows
machine. For example, you can extract it to C:\hadoop.
3. Configuration:
○ Navigate to the C:\hadoop\etc\hadoop directory (or the location where you
extracted Hadoop) and edit the following configuration files:
■ core-site.xml: Add the following configuration to specify Hadoop's data
directory:
4. Formatting HDFS:
● Open a Command Prompt and navigate to the Hadoop bin directory, typically
located at C:\hadoop\bin.
● Run the following command to format the HDFS:
start-yarn.cmd
Question 3. To implement the following file management tasks in Hadoop System (HDFS):
Adding files and directories, retrieving files, Deleting files
Answer:
● Retrieving Files:
■ To retrieve files from HDFS to your local file system,
use the get command: lua
hadoop fs -get
/hdfs/path/to/source/local/path/to/destination
● Deleting Files:
■ To delete files in HDFS, you can use the rm command. Be
cautious when using this command as it permanently deletes
files.
0 Big Data [704] 1
bash
Question 4. Create a database ‘STD’ and make a collection (e.g. "student" with fields 'No.,
Stu_Name, Enrol., Branch, Contact, e-mail, Score') using MongoDB. Perform various operations
in the following experiments.
Answer:
Create a Database:
In the MongoDB shell, you can create a database named 'STD' using the use command. If the database
doesn't exist, MongoDB will create it when you insert data into it:
Perl
use STD
1. Now, create a collection called 'student' and insert some sample data into it. You can use the
insertOne or insertMany method to add documents to the collection. Here's an example of
0 Big Data [704] 1
inserting a single document:
javascript
db.student.insertOne({ No
: 1,
Stu_Name: "John Doe",
Enrol: "E12345",
Branch: "Computer Science",
Contact: "123-456-7890",
email: "[email protected]",
Score: 95
});
2. You can insert more documents using the insertOne or insertMany methods.
Querying Data:
db.student.find()
Updating Data:
db.student.updateOne(
{ Enrol: "E12345" },
{ $set: { Score: 98 } }
);
Deleting Data:
db.student.deleteOne({ No: 1 });
Question 5. Insert multiple records (at least 10) into the created student collection. Answer:
db.student.insertMany([
{
No: 2,
Stu_Name: "Bhavik",
Enrol: "E23456",
Branch: "Computer Science",
Contact: "9876543210",
email: "[email protected]",
Score: 88
},
{
No: 3,
Stu_Name: "Bhavika",
Enrol: "E34567",
Branch: "Mathematics",
Contact: "9988776655",
email: "[email protected]",
Score: 73
},
{
No: 4,
Stu_Name: "Aeshna",
Enrol: "E45678",
No: 5,
Stu_Name: "Ansh",
Enrol: "E56789",
Branch: "Chemistry",
Contact: "9877005500",
email: "[email protected]",
Score: 79
},
{
No: 6,
Stu_Name: "Devendra",
Enrol: "E67890",
Branch: "Mechanical Engineering",
Contact: "9966337700",
email: "[email protected]",
Score: 85
},
{
No: 7,
Stu_Name: "Yash",
Enrol: "E78901",
Branch: "Electrical Engineering",
Contact: "9977553344",
email: "[email protected]",
Score: 94
},
{
No: 8,
Stu_Name: "Anuj",
Enrol: "E89012",
Branch: "Economics", Contact:
"9876009900",
email: "[email protected]",
Score: 78
},
{
No: 9,
Stu_Name: "Aditya",
Enrol: "E90123",
Branch: "History",
0 Big Data [704] 1
Contact: "9888998899",
email: "[email protected]",
Score: 87
},
{
No: 10,
Stu_Name: "Jay", Enrol:
"E01234",
Branch: "Geography", Contact:
"9966558877",
email: "[email protected]",
Score: 91
},
{
No: 11,
Stu_Name: "Patel", Enrol:
"E12345",
Branch: "Business Administration",
Contact: "9876543000",
email: "[email protected]", Score:
75
}
]);
Answer:
javascript
db.student.find().sort({ Stu_Name: 1 }).pretty()
javascript
db.student.updateOne({ Stu_Name: "Anuj" }, { $set: { Contact: "9876543210" } })
This query updates the 'Contact' field for the student with the name 'Anuj'. You can replace
"9876543210" with the new contact information as needed.
c. Add a New Field 'remark' to the Document with the Name 'Yash':
javascript
db.student.updateOne({ Stu_Name: "Yash" }, { $set: { remark: "New remark text" } })
This query adds a new field 'remark' with the value "New remark text" to the document where
'Stu_Name' is 'Yash'. You can adjust the value accordingly.
d. Add a New Field with Student No. 12 (Name: Shruti, Enroll: 00101, Branch: CSE, Email:
[email protected], Contact: 098675345): This task can be accomplished using the updateOne
method without an insert statement. To add a new student with No. 12 and the specified details, run the
following query:
javascript
db.student.updateOne(
{ No: 12 },
{
$set: {
Stu_Name: "Shruti",
Enrol: "00101",
Branch: "CSE",
email: "[email protected]",
Contact: "098675345",
Score: 0 // You can set an initial score or any other default value
}
},
{ upsert: true } // Use upsert to insert if the document doesn't exist
)
a. Find all the employees of a particular department where salary lies < 40K.
b. Find the highest salary for each department and fetch the name of such employees.
c. Find all the employees who are on a lesser salary than 30k; increase their salary by 10%
and display the results.
Answer:
Assuming you are using the MongoDB shell, you can create the employee collection as follows:
// Create the Employee Collection
db.createCollection("employee")
a. Find all the employees of a particular department where salary is less than 40K. For
example, to find employees in the "HR" department with a salary less than 40K:
b. Find the highest salary for each department and fetch the names of such employees:
db.employee.aggregate([
{
$group: {
_id: "$Dept_ID",
maxSalary: { $max: "$Salary" }
0 Big Data [704] 1
}
},
{
$lookup: {
from: "employee",
localField: "_id",
foreignField: "Dept_ID",
as: "employees"
}
},
{
$unwind: "$employees"
},
{
$match: { "employees.Salary": "$maxSalary" }
},
{
$project: {
_id: 0,
Dept_ID: "$_id",
Employee_ID: "$employees.Employee_ID",
First_Name: "$employees.First_Name",
Last_Name: "$employees.Last_Name", Salary:
"$maxSalary"
}
}
])
c. Find all employees who are on a salary less than 30K, increase their salary by 10%, and
display the results:
db.employee.updateMany(
{ Salary: { $lt: 30000 } },
{ $mul: { Salary: 1.1 } }
)
Answer:
To design and implement a social network graph with 50 nodes and edges between nodes using the
NetworkX library in Python, follow these steps:
1. Install NetworkX:
If you haven't already, you need to install the NetworkX library. You can install it using pip:
bash
pip install networkx
import networkx as nx #
3. Add Nodes:
You can add nodes to the graph. Since you want 50 nodes, you can do this programmatically:
python
num_nodes = 50
0 Big Data [704] 1
for i in range(1, num_nodes + 1):
G.add_node(i)
4. Add Edges:
To create edges between nodes, you can use various methods, such as adding random
connections or defining a specific network structure. Here's an example of adding random edges
to create a connected graph:
python
import random
Question 9. Design and plot an asymmetric social network (socio graph) of 5 nodes (A, B, C, D, and E)
such that A is directed to B, B is directed to D, D is directed to A, and D is directed to C.
Answer:
1. Install NetworkX
If you haven't already, you need to install the NetworkX library. You can install it using pip: pip
install networkx
In your Python script or Jupyter Notebook, import the NetworkX library and create a directed graph
(DiGraph)
import networkx as nx
import matplotlib.pyplot as plt G
= nx.DiGraph()
3. Add Nodes:
0 Big Data [704] 1
Add nodes A, B, C, D, and E to the graph:
nodes = ["A", "B", "C", "D", "E"]
G.add_nodes_from(nodes)
G.add_edge("A", "B")
G.add_edge("B", "D")
G.add_edge("D", "A")
G.add_edge("D", "C")
```
Question 10. Consider the above scenario (No. 09) and plot a weighted asymmetric graph, the
weight range is between 20 to 50.
Answer:
To create a weighted asymmetric graph based on the scenario provided in question 9, you can use
NetworkX in Python. In this case, you will assign random weights in the range of 20 to 50 to the
directed edges between nodes A, B, D, and C. Here's how you can achieve this:
3. Add Directed Edges with Random Weights: Add directed edges between nodes A, B, D, and C,
and assign random weights in the range of 20 to 50:
pos = {
"A": (0, 1),
"B": (1, 2),
"C": (2, 1),
"D": (1, 0),
"E": (3, 2)
}
# Extract edge weights for drawing
edge_weights = nx.get_edge_attributes(G, 'weight')
Answer:
To implement the betweenness centrality measure between nodes in a social network using Python
and NetworkX, you can follow these steps. Here, I'll assume a social network with 10 nodes for
demonstration:
2. Import NetworkX and Create a Graph: Import NetworkX and create a graph for your social
network:
import networkx as nx
3. Add Edges (Connections between Nodes): Define the connections (edges) between nodes to
represent your social network. This can be done based on your specific network structure:
4. Calculate Betweenness Centrality: Now, calculate the betweenness centrality for each node in
the social network:
betweenness = nx.betweenness_centrality(G)
The betweenness_centrality function computes the normalized betweenness centrality for all nodes in the graph. The
result is stored in the betweenness dictionary, where the keys are node identifiers, and the values are the betweenness
centrality scores.
5. Display Betweenness Centrality Scores: You can print or analyze the betweenness
centrality scores for each node: