0% found this document useful (0 votes)

30 views

BDA Assignment1 BE6 20

The document discusses big data types, NoSQL data architecture patterns including key-value, columnar, document and graph databases. It also describes MapReduce algorithm involving input splitting, mapping, shuffling, sorting and reducing phases. Collaborative filtering is discussed as a recommendation technique that finds similar users and recommends items liked by similar users.

Uploaded by

vardhan mordharia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

BDA Assignment1 BE6 20

Uploaded by

vardhan mordharia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Assignment 1

Name: Harsh Mordharia Class and Roll No: BE 6/ 20

Subject: Big Data Analytics PRN: 20UF15936IT031

1. Define Big Data and elaborate types of Big Data?

Big Data refers to extremely large and complex datasets that cannot be easily
managed, processed, or analyzed using traditional data processing methods. It
typically involves datasets that are too large to be handled by conventional
database systems or software tools.

Types of Big Data can be categorized into three main categories:

I. Structured Data: This type of data refers to highly organized and

formatted information, often stored in relational databases. Structured
data is easily searchable and can be processed using traditional database
management systems. Examples include data stored in tables with rows
and columns, such as customer information, transaction records, and
inventory data.

II. Unstructured Data: Unstructured data does not conform to a

predefined data model or structure, making it more challenging to
process and analyze. This type of data includes text, images, videos,
social media posts, emails, and sensor data. Analyzing unstructured
data often requires advanced techniques such as natural language
processing (NLP), image recognition, and sentiment analysis.

III. Semi-Structured Data: Semi-structured data lies somewhere between

structured and unstructured data. It has some organizational properties
but does not fit neatly into a relational database or other traditional data
models. Examples include XML files, JSON documents, log files, and
NoSQL databases. Semi-structured data often requires specialized tools
and techniques for processing and analysis, such as schema-on-read
databases and document-oriented databases.

Each type of Big Data presents unique challenges and opportunities for
organizations seeking to extract insights and value from their data assets.
Effective management and analysis of Big Data require a combination of
advanced technologies, analytical tools, and expertise in data science and data
engineering.

2. Describe NoSQL Data Architecture Pattern

Architecture Pattern is a logical way of categorizing data that will be stored on

the Database. NoSQL is a type of database which helps to perform operations on
big data and store it in a valid format. It is widely used because of its flexibility
and a wide variety of services.
Architecture Patterns of NoSQL:
The data is stored in NoSQL in any of the following four data architecture patterns.
1. Key-Value Store Database
2. Column Store Database
3. Document Database
4. Graph Database
These are explained as following below.
1. Key-Value Store Database:

This model is one of the most basic models of NoSQL databases. As the name
suggests, the data is stored in form of Key-Value Pairs. The key is usually a
sequence of strings, integers or characters but can also be a more advanced data
type. The value is typically linked or co-related to the key. The key-value pair
storage databases generally store data as a hash table where each key is unique.
The value can be of any type (JSON, BLOB(Binary Large Object), strings, etc).
This type of pattern is usually used in shopping websites or e-commerce
applications.

SKIP
Advantages:
• Can handle large amounts of data and heavy load,
• Easy retrieval of data by keys.
Limitations:
• Complex queries may attempt to involve multiple key-value pairs which
may delay performance.
• Data can be involving many-to-many relationships which may collide.
Examples:
• DynamoDB
• Berkeley DB

2. Column Store Database:

Rather than storing data in relational tuples, the data is stored in individual cells
which are further grouped into columns. Column-oriented databases work only
on columns. They store large amounts of data into columns together. Format
and titles of the columns can diverge from one row to other. Every column is
treated separately. But still, each individual column may contain multiple other
columns like traditional databases. Basically, columns are mode of storage in
this type.
Advantages:
• Data is readily available
• Queries like SUM, AVERAGE, COUNT can be easily performed on
columns.
Examples:
• HBase
• Bigtable by Google
• Cassandra
3. Document Database:

The document database fetches and accumulates data in form of key-value pairs
but here, the values are called as Documents. Document can be stated as a
complex data structure. Document here can be a form of text, arrays, strings,
JSON, XML or any such format. The use of nested documents is also very
common. It is very effective as most of the data created is usually in form of
JSONs and is unstructured.
Advantages:
• This type of format is very useful and apt for semi-structured data.
• Storage retrieval and managing of documents is easy.
Limitations:
• Handling multiple documents is challenging
• Aggregation operations may not work accurately.
Examples:
• MongoDB
• CouchDB
Figure – Document Store Model in form of JSON documents
4. Graph Databases:

Clearly, this architecture pattern deals with the storage and management of data
in graphs. Graphs are basically structures that depict connections between two
or more objects in some data. The objects or entities are called as nodes and are
joined together by relationships called Edges. Each edge has a unique identifier.
Each node serves as a point of contact for the graph. This pattern is very
commonly used in social networks where there are a large number of entities
and each entity has one or many characteristics which are connected by edges.
The relational database pattern has tables that are loosely connected, whereas
graphs are often very strong and rigid in nature.
Advantages:
• Fastest traversal because of connections.
• Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.
Examples:
• Neo4J
• FlockDB( Used by Twitter)

Figure – Graph model format of NoSQL Databases

3. Describe in detail Map Reduce Algorithm

MapReduce is a programming model and processing framework designed for

processing and generating large-scale datasets in a distributed computing
environment. It was popularized by Google and is now widely used in various
big data processing systems such as Apache Hadoop.

Here's a detailed description of the MapReduce algorithm:

Input Splitting: Initially, the input data is divided into smaller chunks or
splits, typically ranging from a few kilobytes to several gigabytes. These splits
are then distributed across the nodes in a distributed computing cluster.

Mapping: The Mapping phase involves processing each input split

independently. This phase applies a user-defined function called the "Map"
function to each record within the input split. The Map function takes the input
data and generates a set of intermediate key-value pairs. These key-value pairs
are typically different from the input format and are often in a more structured
form suitable for subsequent processing.

Shuffling and Sorting: In this phase, the intermediate key-value pairs

produced by the Map tasks are sorted and grouped by their keys. This step is
crucial for preparing the data for the next phase, which involves passing the
grouped data to the Reduce tasks. The shuffling and sorting process ensures
that all values associated with a particular key are sent to the same Reduce
task.

Reducing: The Reduce phase involves applying another user-defined

function called the "Reduce" function to each group of intermediate key-value
pairs produced by the shuffling and sorting phase. The Reduce function takes
a key and a list of values associated with that key and performs some
aggregation or computation on the values. The output of the Reduce function
is typically a set of aggregated results or a transformed dataset.

Output: Finally, the output of the Reduce tasks is collected and merged to
produce the final result. This result may be written to a distributed file system,
returned to the user's application, or used as input for subsequent MapReduce
jobs.

MapReduce offers several advantages for processing large-scale datasets:

Scalability: MapReduce can scale horizontally across a distributed cluster of

commodity hardware, allowing it to handle datasets of virtually any size.
Fault Tolerance: MapReduce frameworks like Hadoop are designed to handle
node failures and ensure that processing continues without data loss or
interruption.
Parallelism: MapReduce exploits parallelism by distributing tasks across
multiple nodes in the cluster, enabling faster processing of large datasets.
Simplicity: The MapReduce programming model abstracts away many of the
complexities of distributed computing, making it easier for developers to write
and debug large-scale data processing applications.

Overall, MapReduce has become a foundational paradigm for distributed data

processing and has been instrumental in enabling the analysis of vast amounts
of data in fields such as web search, social media analytics, and scientific
research.

4. Write note on Collaborative Filtering with examples

In Collaborative Filtering, we tend to find similar users and recommend what
similar users like. In this type of recommendation system, we don’t use the
features of the item to recommend it, rather we classify the users into clusters of
similar types and recommend each user according to the preference of its cluster.
There are basically four types of algorithms o say techniques to build
Collaborative filtering based recommender systems:
• Memory-Based
• Model-Based
• Hybrid
• Deep Learning
Advantages of Collaborative Filtering-Based Recommender Systems:
As we know there are two types of recommender systems the content-based
recommender systems have limited use cases and have higher time complexity.
Also, this algorithm is based on some limited content but that is not the case in
Collaborative Filtering based algorithms. One of the main advantages that these
recommender systems have is that they are highly efficient in providing
personalized content but also able t adapt to changing user preferences.

Measuring Similarity
A simple example of the movie recommendation system will help us in
explaining:In this type of scenario, we can see that User 1 and User 2 give nearly
similar ratings to the movie, so we can conclude that Movie 3 is also going to be
averagely liked by User 1 but Movie 4 will be a good recommendation to User 2,
like this we can also see that there are users who have different choices like User
1 and User 3 are opposite to each other. One can see that User 3 and User 4 have
a common interest in the movie, on that basis we can say that Movie 4 is also
going to be disliked by User 4. This is Collaborative Filtering, we recommend to
users the items which are liked by users of similar interest domains.

Cosine Similarity
We can also use the cosine similarity between the users to find out the users with
similar interests, larger cosine implies that there is a smaller angle between two
users, hence they have similar interests. We can apply the cosine distance
between two users in the utility matrix, and we can also give the zero value to all
the unfilled columns to make calculation easy, if we get smaller cosine then there
will be a larger distance between the users, and if the cosine is larger than we
have a small angle between the users, and we can recommend them similar
things.

*** QuickLaTeX cannot compile formula:

\text{similarity} = \frac{A\cdot B}{\left\| A\right\|\times \left\|
B\right\|}=\frac{\sum_{i=1}^{n}A_i \times
B_i}{\sqrt{\sum_{i=1}^{n}A_i^2}\times {\sqrt{\sum_{i=1}^{n}B_i^2}}

*** Error message:

File ended while scanning use of \frac .
Emergency stop.

Rounding the Data

In collaborative filtering, we round off the data to compare it more easily like we
can assign below 3 ratings as 0 and above of it as 1, this will help us to compare
data more easily, for example:
We again took the previous example and we apply the rounding-off process, as
you can see how much more readable the data has become after performing this
process, we can see that User 1 and User 2 are more similar and User 3 and User
4 are more alike.

Normalizing Rating
In the process of normalizing, we take the average rating of a user and subtract
all the given ratings from it, so we’ll get either positive or negative values as a
rating, which can simply classify further into similar groups. By normalizing the
data we can make clusters of the users that give a similar rating to similar items
and then we can use these clusters to recommend items to the users.

Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
NOSQL , MONGODB
No ratings yet
NOSQL , MONGODB
18 pages
Big Data Module 1,2,3.pptx
No ratings yet
Big Data Module 1,2,3.pptx
59 pages
Unit 6
No ratings yet
Unit 6
143 pages
Big Data NOTES
No ratings yet
Big Data NOTES
14 pages
BIG DATA UNIT-II NOTES
No ratings yet
BIG DATA UNIT-II NOTES
7 pages
CloudComputing DATABASE
No ratings yet
CloudComputing DATABASE
27 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Big Data With Hadoop
No ratings yet
Big Data With Hadoop
26 pages
Case Study On Different Nosql Data Models
No ratings yet
Case Study On Different Nosql Data Models
6 pages
Unit II No-SQL Db Managment
No ratings yet
Unit II No-SQL Db Managment
33 pages
BD 3
No ratings yet
BD 3
1 page
Untitled document.
No ratings yet
Untitled document.
7 pages
DOC-20250429-WA0003. (1)
No ratings yet
DOC-20250429-WA0003. (1)
23 pages
BDA IAT 1 Question Bank
No ratings yet
BDA IAT 1 Question Bank
21 pages
NOSQL Concept 2
No ratings yet
NOSQL Concept 2
4 pages
2 emerging
No ratings yet
2 emerging
10 pages
NoSQL Databases
No ratings yet
NoSQL Databases
20 pages
no sql.pptx
No ratings yet
no sql.pptx
12 pages
BDA_(2)_merged[1]
No ratings yet
BDA_(2)_merged[1]
29 pages
Introduction To Nosql: What Is A Nosql Database Used For?
No ratings yet
Introduction To Nosql: What Is A Nosql Database Used For?
6 pages
NOSQL
No ratings yet
NOSQL
25 pages
Unit No 1
No ratings yet
Unit No 1
34 pages
2 Big Data Analytics-Hadoop R21 A7902 ABP
No ratings yet
2 Big Data Analytics-Hadoop R21 A7902 ABP
16 pages
NoSQL (1)
No ratings yet
NoSQL (1)
12 pages
HBase
No ratings yet
HBase
36 pages
BDA Question Bank
No ratings yet
BDA Question Bank
20 pages
cp5293 Big Data Analytics Unit 5 PDF
No ratings yet
cp5293 Big Data Analytics Unit 5 PDF
28 pages
41 NoSQL Introduction.pptx
No ratings yet
41 NoSQL Introduction.pptx
18 pages
10gen Top 5 NoSQL Considerations
No ratings yet
10gen Top 5 NoSQL Considerations
10 pages
No SQL
No ratings yet
No SQL
38 pages
ADBMS original-output
No ratings yet
ADBMS original-output
28 pages
Unit 1
No ratings yet
Unit 1
26 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
Big Data
No ratings yet
Big Data
19 pages
NOSQL
No ratings yet
NOSQL
15 pages
Lecture 1
No ratings yet
Lecture 1
31 pages
Introduction To Big Data and NoSQL
No ratings yet
Introduction To Big Data and NoSQL
52 pages
Lecture 1 - NoSQL
No ratings yet
Lecture 1 - NoSQL
31 pages
Bcse302l Dbms Module-7 Nosql
No ratings yet
Bcse302l Dbms Module-7 Nosql
30 pages
BDA_ppt1
No ratings yet
BDA_ppt1
45 pages
BDA
No ratings yet
BDA
65 pages
Unit-3_BDA
No ratings yet
Unit-3_BDA
21 pages
Chapter 5c
No ratings yet
Chapter 5c
18 pages
Unit 5_230601_174540-1
No ratings yet
Unit 5_230601_174540-1
14 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
10 pages
No SQL
No ratings yet
No SQL
38 pages
Chapter-14
No ratings yet
Chapter-14
35 pages
EUC1502 Module5 Big-Data
No ratings yet
EUC1502 Module5 Big-Data
46 pages
Lec 15 Notes
No ratings yet
Lec 15 Notes
3 pages
BDS Session 1
100% (1)
BDS Session 1
70 pages
Module 1
No ratings yet
Module 1
54 pages
Nosql Database
No ratings yet
Nosql Database
19 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Module 1
No ratings yet
Module 1
34 pages
NOSQL Lecture 1 Notes
No ratings yet
NOSQL Lecture 1 Notes
31 pages
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Data Structures Explained: A Practical Guide with Examples
From Everand
Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
CTF Series - Vulnerable Machines
100% (1)
CTF Series - Vulnerable Machines
153 pages
OOP Teaching Material
100% (3)
OOP Teaching Material
83 pages
Degree Transcripts
No ratings yet
Degree Transcripts
2 pages
Quotation of Charger From GCM Allen
No ratings yet
Quotation of Charger From GCM Allen
11 pages
Cheatsheet Kubernetes A4
No ratings yet
Cheatsheet Kubernetes A4
5 pages
What Are The Best Autoroute PCB Softwares
No ratings yet
What Are The Best Autoroute PCB Softwares
9 pages
Nvidia Connectx-7 400G Ethernet: Smart Acceleration For Cloud, Data-Center and Edge
No ratings yet
Nvidia Connectx-7 400G Ethernet: Smart Acceleration For Cloud, Data-Center and Edge
2 pages
Markndrill
No ratings yet
Markndrill
5 pages
What's The Main Benefit of A Three-Tier Architecture?
No ratings yet
What's The Main Benefit of A Three-Tier Architecture?
2 pages
CV_Hejer BEN BRAYEK
No ratings yet
CV_Hejer BEN BRAYEK
3 pages
OSS Vfe
100% (1)
OSS Vfe
46 pages
Tle 10
No ratings yet
Tle 10
2 pages
TCP and UDP Discovery Methods
No ratings yet
TCP and UDP Discovery Methods
4 pages
Vlsi 1 Chapter 111111111
No ratings yet
Vlsi 1 Chapter 111111111
15 pages
Hello Arduino Experiment MODULE 4
No ratings yet
Hello Arduino Experiment MODULE 4
7 pages
HP Firmware Installer For Docks
No ratings yet
HP Firmware Installer For Docks
18 pages
Python For Data Science July 2024W3
No ratings yet
Python For Data Science July 2024W3
4 pages
Transfer CFT ErrorMessagesGuide AllOS en
No ratings yet
Transfer CFT ErrorMessagesGuide AllOS en
305 pages
VMCE
No ratings yet
VMCE
4 pages
Arma3Launcher Exception 20220802T113858
No ratings yet
Arma3Launcher Exception 20220802T113858
4 pages
Handout Preparation Model
No ratings yet
Handout Preparation Model
10 pages
SAP Deleting License Keys
No ratings yet
SAP Deleting License Keys
2 pages
Unit 5 Pointers
No ratings yet
Unit 5 Pointers
9 pages
Course List - Winter - 23-24
No ratings yet
Course List - Winter - 23-24
19 pages
Cyber Security Module - 01
No ratings yet
Cyber Security Module - 01
22 pages
Access Standalone - User's Manual - V1.0.4 1
No ratings yet
Access Standalone - User's Manual - V1.0.4 1
66 pages
Altivar 71: Stirrer Application Programming Manual
No ratings yet
Altivar 71: Stirrer Application Programming Manual
39 pages
Word Bai Tap Buoi 8
No ratings yet
Word Bai Tap Buoi 8
2 pages
IK Gujral Punjab Technical University Jalandhar, Kaputhala
No ratings yet
IK Gujral Punjab Technical University Jalandhar, Kaputhala
53 pages
AUTOSAR FO RS HealthMonitoring
No ratings yet
AUTOSAR FO RS HealthMonitoring
29 pages

BDA Assignment1 BE6 20

Uploaded by

BDA Assignment1 BE6 20

Uploaded by

Assignment 1

Name: Harsh Mordharia Class and Roll No: BE 6/ 20

1. Define Big Data and elaborate types of Big Data?

Types of Big Data can be categorized into three main categories:

I. Structured Data: This type of data refers to highly organized and

II. Unstructured Data: Unstructured data does not conform to a

III. Semi-Structured Data: Semi-structured data lies somewhere between

2. Describe NoSQL Data Architecture Pattern

Architecture Pattern is a logical way of categorizing data that will be stored on

2. Column Store Database:

Figure – Graph model format of NoSQL Databases

3. Describe in detail Map Reduce Algorithm

MapReduce is a programming model and processing framework designed for

Here's a detailed description of the MapReduce algorithm:

Mapping: The Mapping phase involves processing each input split

Shuffling and Sorting: In this phase, the intermediate key-value pairs

Reducing: The Reduce phase involves applying another user-defined

MapReduce offers several advantages for processing large-scale datasets:

Scalability: MapReduce can scale horizontally across a distributed cluster of

Overall, MapReduce has become a foundational paradigm for distributed data

4. Write note on Collaborative Filtering with examples

*** QuickLaTeX cannot compile formula:

*** Error message:

Rounding the Data

You might also like