BDA Assignment1 BE6 20
BDA Assignment1 BE6 20
Big Data refers to extremely large and complex datasets that cannot be easily
managed, processed, or analyzed using traditional data processing methods. It
typically involves datasets that are too large to be handled by conventional
database systems or software tools.
Each type of Big Data presents unique challenges and opportunities for
organizations seeking to extract insights and value from their data assets.
Effective management and analysis of Big Data require a combination of
advanced technologies, analytical tools, and expertise in data science and data
engineering.
This model is one of the most basic models of NoSQL databases. As the name
suggests, the data is stored in form of Key-Value Pairs. The key is usually a
sequence of strings, integers or characters but can also be a more advanced data
type. The value is typically linked or co-related to the key. The key-value pair
storage databases generally store data as a hash table where each key is unique.
The value can be of any type (JSON, BLOB(Binary Large Object), strings, etc).
This type of pattern is usually used in shopping websites or e-commerce
applications.
SKIP
Advantages:
• Can handle large amounts of data and heavy load,
• Easy retrieval of data by keys.
Limitations:
• Complex queries may attempt to involve multiple key-value pairs which
may delay performance.
• Data can be involving many-to-many relationships which may collide.
Examples:
• DynamoDB
• Berkeley DB
Rather than storing data in relational tuples, the data is stored in individual cells
which are further grouped into columns. Column-oriented databases work only
on columns. They store large amounts of data into columns together. Format
and titles of the columns can diverge from one row to other. Every column is
treated separately. But still, each individual column may contain multiple other
columns like traditional databases. Basically, columns are mode of storage in
this type.
Advantages:
• Data is readily available
• Queries like SUM, AVERAGE, COUNT can be easily performed on
columns.
Examples:
• HBase
• Bigtable by Google
• Cassandra
3. Document Database:
The document database fetches and accumulates data in form of key-value pairs
but here, the values are called as Documents. Document can be stated as a
complex data structure. Document here can be a form of text, arrays, strings,
JSON, XML or any such format. The use of nested documents is also very
common. It is very effective as most of the data created is usually in form of
JSONs and is unstructured.
Advantages:
• This type of format is very useful and apt for semi-structured data.
• Storage retrieval and managing of documents is easy.
Limitations:
• Handling multiple documents is challenging
• Aggregation operations may not work accurately.
Examples:
• MongoDB
• CouchDB
Figure – Document Store Model in form of JSON documents
4. Graph Databases:
Clearly, this architecture pattern deals with the storage and management of data
in graphs. Graphs are basically structures that depict connections between two
or more objects in some data. The objects or entities are called as nodes and are
joined together by relationships called Edges. Each edge has a unique identifier.
Each node serves as a point of contact for the graph. This pattern is very
commonly used in social networks where there are a large number of entities
and each entity has one or many characteristics which are connected by edges.
The relational database pattern has tables that are loosely connected, whereas
graphs are often very strong and rigid in nature.
Advantages:
• Fastest traversal because of connections.
• Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.
Examples:
• Neo4J
• FlockDB( Used by Twitter)
Output: Finally, the output of the Reduce tasks is collected and merged to
produce the final result. This result may be written to a distributed file system,
returned to the user's application, or used as input for subsequent MapReduce
jobs.
Measuring Similarity
A simple example of the movie recommendation system will help us in
explaining:In this type of scenario, we can see that User 1 and User 2 give nearly
similar ratings to the movie, so we can conclude that Movie 3 is also going to be
averagely liked by User 1 but Movie 4 will be a good recommendation to User 2,
like this we can also see that there are users who have different choices like User
1 and User 3 are opposite to each other. One can see that User 3 and User 4 have
a common interest in the movie, on that basis we can say that Movie 4 is also
going to be disliked by User 4. This is Collaborative Filtering, we recommend to
users the items which are liked by users of similar interest domains.
Cosine Similarity
We can also use the cosine similarity between the users to find out the users with
similar interests, larger cosine implies that there is a smaller angle between two
users, hence they have similar interests. We can apply the cosine distance
between two users in the utility matrix, and we can also give the zero value to all
the unfilled columns to make calculation easy, if we get smaller cosine then there
will be a larger distance between the users, and if the cosine is larger than we
have a small angle between the users, and we can recommend them similar
things.
Normalizing Rating
In the process of normalizing, we take the average rating of a user and subtract
all the given ratings from it, so we’ll get either positive or negative values as a
rating, which can simply classify further into similar groups. By normalizing the
data we can make clusters of the users that give a similar rating to similar items
and then we can use these clusters to recommend items to the users.