0% found this document useful (0 votes)
30 views

BDA Assignment1 BE6 20

The document discusses big data types, NoSQL data architecture patterns including key-value, columnar, document and graph databases. It also describes MapReduce algorithm involving input splitting, mapping, shuffling, sorting and reducing phases. Collaborative filtering is discussed as a recommendation technique that finds similar users and recommends items liked by similar users.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

BDA Assignment1 BE6 20

The document discusses big data types, NoSQL data architecture patterns including key-value, columnar, document and graph databases. It also describes MapReduce algorithm involving input splitting, mapping, shuffling, sorting and reducing phases. Collaborative filtering is discussed as a recommendation technique that finds similar users and recommends items liked by similar users.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Assignment 1

Name: Harsh Mordharia Class and Roll No: BE 6/ 20


Subject: Big Data Analytics PRN: 20UF15936IT031

1. Define Big Data and elaborate types of Big Data?

Big Data refers to extremely large and complex datasets that cannot be easily
managed, processed, or analyzed using traditional data processing methods. It
typically involves datasets that are too large to be handled by conventional
database systems or software tools.

Types of Big Data can be categorized into three main categories:

I. Structured Data: This type of data refers to highly organized and


formatted information, often stored in relational databases. Structured
data is easily searchable and can be processed using traditional database
management systems. Examples include data stored in tables with rows
and columns, such as customer information, transaction records, and
inventory data.

II. Unstructured Data: Unstructured data does not conform to a


predefined data model or structure, making it more challenging to
process and analyze. This type of data includes text, images, videos,
social media posts, emails, and sensor data. Analyzing unstructured
data often requires advanced techniques such as natural language
processing (NLP), image recognition, and sentiment analysis.

III. Semi-Structured Data: Semi-structured data lies somewhere between


structured and unstructured data. It has some organizational properties
but does not fit neatly into a relational database or other traditional data
models. Examples include XML files, JSON documents, log files, and
NoSQL databases. Semi-structured data often requires specialized tools
and techniques for processing and analysis, such as schema-on-read
databases and document-oriented databases.

Each type of Big Data presents unique challenges and opportunities for
organizations seeking to extract insights and value from their data assets.
Effective management and analysis of Big Data require a combination of
advanced technologies, analytical tools, and expertise in data science and data
engineering.

2. Describe NoSQL Data Architecture Pattern

Architecture Pattern is a logical way of categorizing data that will be stored on


the Database. NoSQL is a type of database which helps to perform operations on
big data and store it in a valid format. It is widely used because of its flexibility
and a wide variety of services.
Architecture Patterns of NoSQL:
The data is stored in NoSQL in any of the following four data architecture patterns.
1. Key-Value Store Database
2. Column Store Database
3. Document Database
4. Graph Database
These are explained as following below.
1. Key-Value Store Database:

This model is one of the most basic models of NoSQL databases. As the name
suggests, the data is stored in form of Key-Value Pairs. The key is usually a
sequence of strings, integers or characters but can also be a more advanced data
type. The value is typically linked or co-related to the key. The key-value pair
storage databases generally store data as a hash table where each key is unique.
The value can be of any type (JSON, BLOB(Binary Large Object), strings, etc).
This type of pattern is usually used in shopping websites or e-commerce
applications.

SKIP
Advantages:
• Can handle large amounts of data and heavy load,
• Easy retrieval of data by keys.
Limitations:
• Complex queries may attempt to involve multiple key-value pairs which
may delay performance.
• Data can be involving many-to-many relationships which may collide.
Examples:
• DynamoDB
• Berkeley DB

2. Column Store Database:

Rather than storing data in relational tuples, the data is stored in individual cells
which are further grouped into columns. Column-oriented databases work only
on columns. They store large amounts of data into columns together. Format
and titles of the columns can diverge from one row to other. Every column is
treated separately. But still, each individual column may contain multiple other
columns like traditional databases. Basically, columns are mode of storage in
this type.
Advantages:
• Data is readily available
• Queries like SUM, AVERAGE, COUNT can be easily performed on
columns.
Examples:
• HBase
• Bigtable by Google
• Cassandra
3. Document Database:

The document database fetches and accumulates data in form of key-value pairs
but here, the values are called as Documents. Document can be stated as a
complex data structure. Document here can be a form of text, arrays, strings,
JSON, XML or any such format. The use of nested documents is also very
common. It is very effective as most of the data created is usually in form of
JSONs and is unstructured.
Advantages:
• This type of format is very useful and apt for semi-structured data.
• Storage retrieval and managing of documents is easy.
Limitations:
• Handling multiple documents is challenging
• Aggregation operations may not work accurately.
Examples:
• MongoDB
• CouchDB
Figure – Document Store Model in form of JSON documents
4. Graph Databases:

Clearly, this architecture pattern deals with the storage and management of data
in graphs. Graphs are basically structures that depict connections between two
or more objects in some data. The objects or entities are called as nodes and are
joined together by relationships called Edges. Each edge has a unique identifier.
Each node serves as a point of contact for the graph. This pattern is very
commonly used in social networks where there are a large number of entities
and each entity has one or many characteristics which are connected by edges.
The relational database pattern has tables that are loosely connected, whereas
graphs are often very strong and rigid in nature.
Advantages:
• Fastest traversal because of connections.
• Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.
Examples:
• Neo4J
• FlockDB( Used by Twitter)

Figure – Graph model format of NoSQL Databases

3. Describe in detail Map Reduce Algorithm

MapReduce is a programming model and processing framework designed for


processing and generating large-scale datasets in a distributed computing
environment. It was popularized by Google and is now widely used in various
big data processing systems such as Apache Hadoop.

Here's a detailed description of the MapReduce algorithm:


Input Splitting: Initially, the input data is divided into smaller chunks or
splits, typically ranging from a few kilobytes to several gigabytes. These splits
are then distributed across the nodes in a distributed computing cluster.

Mapping: The Mapping phase involves processing each input split


independently. This phase applies a user-defined function called the "Map"
function to each record within the input split. The Map function takes the input
data and generates a set of intermediate key-value pairs. These key-value pairs
are typically different from the input format and are often in a more structured
form suitable for subsequent processing.

Shuffling and Sorting: In this phase, the intermediate key-value pairs


produced by the Map tasks are sorted and grouped by their keys. This step is
crucial for preparing the data for the next phase, which involves passing the
grouped data to the Reduce tasks. The shuffling and sorting process ensures
that all values associated with a particular key are sent to the same Reduce
task.

Reducing: The Reduce phase involves applying another user-defined


function called the "Reduce" function to each group of intermediate key-value
pairs produced by the shuffling and sorting phase. The Reduce function takes
a key and a list of values associated with that key and performs some
aggregation or computation on the values. The output of the Reduce function
is typically a set of aggregated results or a transformed dataset.

Output: Finally, the output of the Reduce tasks is collected and merged to
produce the final result. This result may be written to a distributed file system,
returned to the user's application, or used as input for subsequent MapReduce
jobs.

MapReduce offers several advantages for processing large-scale datasets:

Scalability: MapReduce can scale horizontally across a distributed cluster of


commodity hardware, allowing it to handle datasets of virtually any size.
Fault Tolerance: MapReduce frameworks like Hadoop are designed to handle
node failures and ensure that processing continues without data loss or
interruption.
Parallelism: MapReduce exploits parallelism by distributing tasks across
multiple nodes in the cluster, enabling faster processing of large datasets.
Simplicity: The MapReduce programming model abstracts away many of the
complexities of distributed computing, making it easier for developers to write
and debug large-scale data processing applications.

Overall, MapReduce has become a foundational paradigm for distributed data


processing and has been instrumental in enabling the analysis of vast amounts
of data in fields such as web search, social media analytics, and scientific
research.

4. Write note on Collaborative Filtering with examples


In Collaborative Filtering, we tend to find similar users and recommend what
similar users like. In this type of recommendation system, we don’t use the
features of the item to recommend it, rather we classify the users into clusters of
similar types and recommend each user according to the preference of its cluster.
There are basically four types of algorithms o say techniques to build
Collaborative filtering based recommender systems:
• Memory-Based
• Model-Based
• Hybrid
• Deep Learning
Advantages of Collaborative Filtering-Based Recommender Systems:
As we know there are two types of recommender systems the content-based
recommender systems have limited use cases and have higher time complexity.
Also, this algorithm is based on some limited content but that is not the case in
Collaborative Filtering based algorithms. One of the main advantages that these
recommender systems have is that they are highly efficient in providing
personalized content but also able t adapt to changing user preferences.

Measuring Similarity
A simple example of the movie recommendation system will help us in
explaining:In this type of scenario, we can see that User 1 and User 2 give nearly
similar ratings to the movie, so we can conclude that Movie 3 is also going to be
averagely liked by User 1 but Movie 4 will be a good recommendation to User 2,
like this we can also see that there are users who have different choices like User
1 and User 3 are opposite to each other. One can see that User 3 and User 4 have
a common interest in the movie, on that basis we can say that Movie 4 is also
going to be disliked by User 4. This is Collaborative Filtering, we recommend to
users the items which are liked by users of similar interest domains.

Cosine Similarity
We can also use the cosine similarity between the users to find out the users with
similar interests, larger cosine implies that there is a smaller angle between two
users, hence they have similar interests. We can apply the cosine distance
between two users in the utility matrix, and we can also give the zero value to all
the unfilled columns to make calculation easy, if we get smaller cosine then there
will be a larger distance between the users, and if the cosine is larger than we
have a small angle between the users, and we can recommend them similar
things.

*** QuickLaTeX cannot compile formula:


\text{similarity} = \frac{A\cdot B}{\left\| A\right\|\times \left\|
B\right\|}=\frac{\sum_{i=1}^{n}A_i \times
B_i}{\sqrt{\sum_{i=1}^{n}A_i^2}\times {\sqrt{\sum_{i=1}^{n}B_i^2}}

*** Error message:


File ended while scanning use of \frac .
Emergency stop.

Rounding the Data


In collaborative filtering, we round off the data to compare it more easily like we
can assign below 3 ratings as 0 and above of it as 1, this will help us to compare
data more easily, for example:
We again took the previous example and we apply the rounding-off process, as
you can see how much more readable the data has become after performing this
process, we can see that User 1 and User 2 are more similar and User 3 and User
4 are more alike.

Normalizing Rating
In the process of normalizing, we take the average rating of a user and subtract
all the given ratings from it, so we’ll get either positive or negative values as a
rating, which can simply classify further into similar groups. By normalizing the
data we can make clusters of the users that give a similar rating to similar items
and then we can use these clusters to recommend items to the users.

You might also like