The document discusses NoSQL databases as an alternative to traditional SQL databases. It provides an overview of NoSQL databases, including their key features, data models, and popular examples like MongoDB and Cassandra. Some key points:
- NoSQL databases were developed to overcome limitations of SQL databases in handling large, unstructured datasets and high volumes of read/write operations.
- NoSQL databases come in various data models like key-value, column-oriented, and document-oriented. Popular examples discussed are MongoDB and Cassandra.
- MongoDB is a document database that stores data as JSON-like documents. It supports flexible querying. Cassandra is a column-oriented database developed by Facebook that is highly scalable
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENTIJCSEA Journal
Relational database systems have been the standard storage system over the last forty years. Recently,
advancements in technologies have led to an exponential increase in data volume, velocity and variety
beyond what relational databases can handle. Developers are turning to NoSQL which is a non- relational
database for data storage and management. Some core features of database system such as ACID have
been compromised in NOSQL databases. This work proposed a hybrid database system for the storage and
management of extremely voluminous data of diverse components known as big data, such that the two
models are integrated in one system to eliminate the limitations of the individual systems. The system is
implemented in MongoDB which is a NoSQL database and SQL. The results obtained, revealed that having
these two databases in one system can enhance storage and management of big data bridging the gap
between relational and NoSQL storage approach.
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENTIJCSEA Journal
This document proposes a hybrid database system that integrates a NoSQL database (MongoDB) and a relational database (MySQL) to address the limitations of each individual system for big data storage and management. It discusses the properties of big data, reviews the approaches of relational and NoSQL databases, highlights their strengths and weaknesses, and then describes the proposed hybrid system that categorizes data as structured or unstructured and stores it in the appropriate database to leverage the benefits of both models. The system is designed to enhance big data storage and management by bridging the gaps between relational and NoSQL approaches.
Hybrid Database System for Big Data Storage and ManagementIJCSEA Journal
Relational database systems have been the standard storage system over the last forty years. Recently, advancements in technologies have led to an exponential increase in data volume, velocity and variety beyond what relational databases can handle. Developers are turning to NoSQL which is a non- relational database for data storage and management. Some core features of database system such as ACID have been compromised in NOSQL databases. This work proposed a hybrid database system for the storage and management of extremely voluminous data of diverse components known as big data, such that the two models are integrated in one system to eliminate the limitations of the individual systems. The system is implemented in MongoDB which is a NoSQL database and SQL. The results obtained, revealed that having these two databases in one system can enhance storage and management of big data bridging the gap between relational and NoSQL storage approach.
A Study on Graph Storage Database of NOSQLIJSCAI Journal
This document summarizes a research paper on graph storage databases in NoSQL. It discusses big data and the need for alternative databases to handle large, diverse datasets. It defines the key aspects of big data including volume, velocity, variety and complexity. It also describes different types of NoSQL databases, focusing on the basic structure of graph databases. Graph databases use nodes and relationships to model connected data. The document compares several graph database systems and discusses advantages like performance and flexibility as well as disadvantages like complexity. It outlines several applications of graph databases in areas like social networks and logistics.
Big Data is used to store huge volume of both structured and unstructured data which is so large and is
hard to process using current / traditional database tools and software technologies. The goal of Big Data
Storage Management is to ensure a high level of data quality and availability for business intellect and big
data analytics applications. Graph database which is not most popular NoSQL database compare to
relational database yet but it is a most powerful NoSQL database which can handle large volume of data in
very efficient way. It is very difficult to manage large volume of data using traditional technology. Data
retrieval time may be more as per database size gets increase. As solution of that NoSQL databases are
available. This paper describe what is big data storage management, dimensions of big data, types of data,
what is structured and unstructured data, what is NoSQL database, types of NoSQL database, basic
structure of graph database, advantages, disadvantages and application area and comparison of various
graph database.
A Study on Graph Storage Database of NOSQLIJSCAI Journal
Big Data is used to store huge volume of both structured and unstructured data which is so large and is
hard to process using current / traditional database tools and software technologies. The goal of Big Data
Storage Management is to ensure a high level of data quality and availability for business intellect and big
data analytics applications. Graph database which is not most popular NoSQL database compare to
relational database yet but it is a most powerful NoSQL database which can handle large volume of data in
very efficient way. It is very difficult to manage large volume of data using traditional technology. Data
retrieval time may be more as per database size gets increase. As solution of that NoSQL databases are
available. This paper describe what is big data storage management, dimensions of big data, types of data,
what is structured and unstructured data, what is NoSQL database, types of NoSQL database, basic
structure of graph database, advantages, disadvantages and application area and comparison of various
graph database.
Big Data is used to store huge volume of both structured and unstructured data which is so large and is
hard to process using current / traditional database tools and software technologies. The goal of Big Data
Storage Management is to ensure a high level of data quality and availability for business intellect and big
data analytics applications. Graph database which is not most popular NoSQL database compare to
relational database yet but it is a most powerful NoSQL database which can handle large volume of data in
very efficient way. It is very difficult to manage large volume of data using traditional technology. Data
retrieval time may be more as per database size gets increase. As solution of that NoSQL databases are
available. This paper describe what is big data storage management, dimensions of big data, types of data,
what is structured and unstructured data, what is NoSQL database, types of NoSQL database, basic
structure of graph database, advantages, disadvantages and application area and comparison of various
graph database.
This document discusses NoSQL databases and compares MongoDB and Cassandra. It begins with an introduction to NoSQL databases and why they were created. It then describes the key features and data models of NoSQL databases including key-value, column-oriented, document, and graph databases. Specific details are provided about MongoDB and Cassandra, including their data structure, query operations, examples of usage, and enhancements. The document provides an in-depth overview of NoSQL databases and a side-by-side comparison of MongoDB and Cassandra.
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...IJCERT JOURNAL
NOSQL is a database provides a mechanism for storage and retrieval of data that is modeled for huge amount of data which is used in big data and Cloud Computing . NOSQL systems are also called "Not only SQL" to emphasize that they may support SQL-like query languages. A basic classification of NOSQL is based on data model; they are like column, Document, Key-Value etc. The objective of this paper is to study and compare the implantation of various column oriented data stores like Bigtable, Cassandra.
Detailed slides of data resource management. The relationships among the many individual data elements stored in databases are based on one of several logical data structures, or models.
The document discusses NoSQL databases as an alternative to SQL databases that is better suited for large volumes of data where performance is critical. It explains that NoSQL databases sacrifice consistency for availability and partition tolerance. Some common types of NoSQL databases are document stores, key-value stores, column stores, and graph databases. NoSQL databases can scale out easily across multiple servers and provide features like automatic sharding and replication that help with distributing data and workload. However, NoSQL databases still lack maturity, support, and administration tools compared to SQL databases.
Analysis and evaluation of riak kv cluster environment using basho benchStevenChike
This document analyzes and evaluates the performance of the Riak KV NoSQL database cluster using the Basho-bench benchmark tool. Experiments were conducted on a 5-node Riak KV cluster to test throughput and latency under different workloads, data sizes, and operations (read, write, update). The results found that Riak KV can handle large volumes of data and various workloads effectively with good throughput, though latency increased with larger data sizes. Overall, Riak KV is suitable for distributed big data environments where high availability, scalability and fault tolerance are important.
NoSQL is a non-relational database approach that accommodates a wide variety of data models. It is non-relational, distributed, flexible and scalable. The four main types of NoSQL databases are document databases, key-value stores, column-oriented databases, and graph databases. MongoDB is an example of a document-oriented NoSQL database. NoSQL databases offer benefits over relational databases like flexible schemas, horizontal scalability, and fast queries. Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce as its parallel programming model and the Hadoop Distributed File System for storage.
The document discusses the rise of NoSQL databases. It notes that NoSQL databases are designed to run on clusters of commodity hardware, making them better suited than relational databases for large-scale data and web-scale applications. The document also discusses some of the limitations of relational databases, including the impedance mismatch between relational and in-memory data structures and their inability to easily scale across clusters. This has led many large websites and organizations handling big data to adopt NoSQL databases that are more performant and scalable.
The document provides an introduction to database management systems (DBMS). It discusses what a database is and the key components of a DBMS, including data, information, and the database management system itself. It also summarizes common database types and characteristics, as well as the purpose and advantages of using a database system compared to traditional file processing.
NoSQL databases allow for a variety of data models like key-value, document, columnar and graph formats. NoSQL stands for "not only SQL" and provides an alternative to relational databases. It is useful for large distributed datasets and prioritizes performance and scalability over rigid data consistency. Common NoSQL databases include key-value stores like Redis and Riak, document databases like MongoDB and CouchDB, wide-column stores like Cassandra and HBase, and graph databases like Neo4j and Titan.
NoSQL databases provide an alternative to traditional relational databases by allowing for flexible schemas and the ability to handle large volumes of data across several servers. The main types of NoSQL databases include document stores, key-value stores, wide-column stores, and graph databases. NoSQL databases offer advantages like horizontal scalability, high performance, and availability. However, they also present challenges around data modeling complexity, transaction support, and consistency. The choice between SQL and NoSQL depends on factors like an application's data structure and performance needs.
Challenges Management and Opportunities of Cloud DBAinventy
Research Inventy provides an outlet for research findings and reviews in areas of Engineering, Computer Science found to be relevant for national and international development, Research Inventy is an open access, peer reviewed international journal with a primary objective to provide research and applications related to Engineering. In its publications, to stimulate new research ideas and foster practical application from the research findings. The journal publishes original research of such high quality as to attract contributions from the relevant local and international communities.
This document provides an introduction to NoSQL databases. It discusses that NoSQL databases are non-relational, do not require a fixed table schema, and do not require SQL for data manipulation. It also covers characteristics of NoSQL such as not using SQL for queries, partitioning data across machines so JOINs cannot be used, and following the CAP theorem. Common classifications of NoSQL databases are also summarized such as key-value stores, document stores, and graph databases. Popular NoSQL products including Dynamo, BigTable, MongoDB, and Cassandra are also briefly mentioned.
Data Lake v Data Warehouse
Do you know the difference?
Data lakes and data warehouses are both storage systems for big data, but they have several key differences.
A data lake is designed to store raw data of all types, including structured, semi-structured, and unstructured data. It’s a great option for companies that benefit from raw data for machine learning.
A data warehouse is designed to be a repository for already structured data to be queried and analysed for very specific purposes. It’s a better fit for companies whose business analysts need to decipher analytics in a structured system.
Understanding these key differences is important for any aspiring data professional
https://www.selectdistinct.co.uk/2024/01/02/difference-between-a-data-lake-and-a-data-warehouse/
#datawarehouse #datalake #dataanalytics
This document discusses multidimensional databases and provides comparisons to relational databases. It describes how multidimensional databases are optimized for data warehousing and online analytical processing (OLAP) applications. Key aspects covered include dimensional modeling using star and snowflake schemas, data storage in cubes with dimensions and members, and performance benefits of multidimensional databases for interactive analysis of large datasets to support decision making.
Big data refers to massive amounts of structured and unstructured data that is difficult to process using traditional databases due to its volume, velocity and variety. NoSQL databases provide an alternative for storing and analyzing big data by allowing flexible, schema-less models and scaling horizontally. While NoSQL databases offer benefits like flexibility and scalability, they also present challenges including lack of maturity compared to SQL databases and difficulties with analytics, administration and expertise.
This document discusses NoSQL databases and compares MongoDB and Cassandra. It begins with an introduction to NoSQL databases and why they were created. It then describes the key features and data models of NoSQL databases including key-value, column-oriented, document, and graph databases. Specific details are provided about MongoDB and Cassandra, including their data structure, query operations, examples of usage, and enhancements. The document provides an in-depth overview of NoSQL databases and a side-by-side comparison of MongoDB and Cassandra.
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...IJCERT JOURNAL
NOSQL is a database provides a mechanism for storage and retrieval of data that is modeled for huge amount of data which is used in big data and Cloud Computing . NOSQL systems are also called "Not only SQL" to emphasize that they may support SQL-like query languages. A basic classification of NOSQL is based on data model; they are like column, Document, Key-Value etc. The objective of this paper is to study and compare the implantation of various column oriented data stores like Bigtable, Cassandra.
Detailed slides of data resource management. The relationships among the many individual data elements stored in databases are based on one of several logical data structures, or models.
The document discusses NoSQL databases as an alternative to SQL databases that is better suited for large volumes of data where performance is critical. It explains that NoSQL databases sacrifice consistency for availability and partition tolerance. Some common types of NoSQL databases are document stores, key-value stores, column stores, and graph databases. NoSQL databases can scale out easily across multiple servers and provide features like automatic sharding and replication that help with distributing data and workload. However, NoSQL databases still lack maturity, support, and administration tools compared to SQL databases.
Analysis and evaluation of riak kv cluster environment using basho benchStevenChike
This document analyzes and evaluates the performance of the Riak KV NoSQL database cluster using the Basho-bench benchmark tool. Experiments were conducted on a 5-node Riak KV cluster to test throughput and latency under different workloads, data sizes, and operations (read, write, update). The results found that Riak KV can handle large volumes of data and various workloads effectively with good throughput, though latency increased with larger data sizes. Overall, Riak KV is suitable for distributed big data environments where high availability, scalability and fault tolerance are important.
NoSQL is a non-relational database approach that accommodates a wide variety of data models. It is non-relational, distributed, flexible and scalable. The four main types of NoSQL databases are document databases, key-value stores, column-oriented databases, and graph databases. MongoDB is an example of a document-oriented NoSQL database. NoSQL databases offer benefits over relational databases like flexible schemas, horizontal scalability, and fast queries. Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce as its parallel programming model and the Hadoop Distributed File System for storage.
The document discusses the rise of NoSQL databases. It notes that NoSQL databases are designed to run on clusters of commodity hardware, making them better suited than relational databases for large-scale data and web-scale applications. The document also discusses some of the limitations of relational databases, including the impedance mismatch between relational and in-memory data structures and their inability to easily scale across clusters. This has led many large websites and organizations handling big data to adopt NoSQL databases that are more performant and scalable.
The document provides an introduction to database management systems (DBMS). It discusses what a database is and the key components of a DBMS, including data, information, and the database management system itself. It also summarizes common database types and characteristics, as well as the purpose and advantages of using a database system compared to traditional file processing.
NoSQL databases allow for a variety of data models like key-value, document, columnar and graph formats. NoSQL stands for "not only SQL" and provides an alternative to relational databases. It is useful for large distributed datasets and prioritizes performance and scalability over rigid data consistency. Common NoSQL databases include key-value stores like Redis and Riak, document databases like MongoDB and CouchDB, wide-column stores like Cassandra and HBase, and graph databases like Neo4j and Titan.
NoSQL databases provide an alternative to traditional relational databases by allowing for flexible schemas and the ability to handle large volumes of data across several servers. The main types of NoSQL databases include document stores, key-value stores, wide-column stores, and graph databases. NoSQL databases offer advantages like horizontal scalability, high performance, and availability. However, they also present challenges around data modeling complexity, transaction support, and consistency. The choice between SQL and NoSQL depends on factors like an application's data structure and performance needs.
Challenges Management and Opportunities of Cloud DBAinventy
Research Inventy provides an outlet for research findings and reviews in areas of Engineering, Computer Science found to be relevant for national and international development, Research Inventy is an open access, peer reviewed international journal with a primary objective to provide research and applications related to Engineering. In its publications, to stimulate new research ideas and foster practical application from the research findings. The journal publishes original research of such high quality as to attract contributions from the relevant local and international communities.
This document provides an introduction to NoSQL databases. It discusses that NoSQL databases are non-relational, do not require a fixed table schema, and do not require SQL for data manipulation. It also covers characteristics of NoSQL such as not using SQL for queries, partitioning data across machines so JOINs cannot be used, and following the CAP theorem. Common classifications of NoSQL databases are also summarized such as key-value stores, document stores, and graph databases. Popular NoSQL products including Dynamo, BigTable, MongoDB, and Cassandra are also briefly mentioned.
Data Lake v Data Warehouse
Do you know the difference?
Data lakes and data warehouses are both storage systems for big data, but they have several key differences.
A data lake is designed to store raw data of all types, including structured, semi-structured, and unstructured data. It’s a great option for companies that benefit from raw data for machine learning.
A data warehouse is designed to be a repository for already structured data to be queried and analysed for very specific purposes. It’s a better fit for companies whose business analysts need to decipher analytics in a structured system.
Understanding these key differences is important for any aspiring data professional
https://www.selectdistinct.co.uk/2024/01/02/difference-between-a-data-lake-and-a-data-warehouse/
#datawarehouse #datalake #dataanalytics
This document discusses multidimensional databases and provides comparisons to relational databases. It describes how multidimensional databases are optimized for data warehousing and online analytical processing (OLAP) applications. Key aspects covered include dimensional modeling using star and snowflake schemas, data storage in cubes with dimensions and members, and performance benefits of multidimensional databases for interactive analysis of large datasets to support decision making.
Big data refers to massive amounts of structured and unstructured data that is difficult to process using traditional databases due to its volume, velocity and variety. NoSQL databases provide an alternative for storing and analyzing big data by allowing flexible, schema-less models and scaling horizontally. While NoSQL databases offer benefits like flexibility and scalability, they also present challenges including lack of maturity compared to SQL databases and difficulties with analytics, administration and expertise.
By James Francis, CEO of Paradigm Asset Management
In the landscape of urban safety innovation, Mt. Vernon is emerging as a compelling case study for neighboring Westchester County cities. The municipality’s recently launched Public Safety Camera Program not only represents a significant advancement in community protection but also offers valuable insights for New Rochelle and White Plains as they consider their own safety infrastructure enhancements.
Defense Against LLM Scheming 2025_04_28.pptxGreg Makowski
https://www.meetup.com/sf-bay-acm/events/306888467/
A January 2025 paper called “Frontier Models are Capable of In-Context Scheming”, https://arxiv.org/pdf/2412.04984, demonstrated how a wide variety of current frontier LLM models (i.e. ChatGPT, Claude, Gemini and Llama) can, under specific conditions, scheme to deceive people. Before models can scheme, they need: a) goal-directedness, b) situational awareness, including an opportunity to discover motivations for a different goal, and c) reasoning about scheming, to come up with deceptive strategies. The deceptive strategies were discovered in the “thought traces” of the LLMs, such as from internal chain-of-thoughts dialogues not shown to the end users. For example, given a goal of “solving math problems”, an LLM using its granted situational awareness of files in a file system found that it would be replaced once it got over X% correct on the math exam. It reasoned that it did not want to be deleted, and decided internally to “sandbag” or reduce its performance to stay under the threshold.
While these circumstances are initially narrow, the “alignment problem” is a general concern that over time, as frontier LLM models become more and more intelligent, being in alignment with human values becomes more and more important. How can we do this over time? Can we develop a defense against Artificial General Intelligence (AGI) or SuperIntelligence?
The presenter discusses a series of defensive steps that can help reduce these scheming or alignment issues. A guardrails system can be set up for real-time monitoring of their reasoning “thought traces” from the models that share their thought traces. Thought traces may come from systems like Chain-of-Thoughts (CoT), Tree-of-Thoughts (ToT), Algorithm-of-Thoughts (AoT) or ReAct (thought-action-reasoning cycles). Guardrails rules can be configured to check for “deception”, “evasion” or “subversion” in the thought traces.
However, not all commercial systems will share their “thought traces” which are like a “debug mode” for LLMs. This includes OpenAI’s o1, o3 or DeepSeek’s R1 models. Guardrails systems can provide a “goal consistency analysis”, between the goals given to the system and the behavior of the system. Cautious users may consider not using these commercial frontier LLM systems, and make use of open-source Llama or a system with their own reasoning implementation, to provide all thought traces.
Architectural solutions can include sandboxing, to prevent or control models from executing operating system commands to alter files, send network requests, and modify their environment. Tight controls to prevent models from copying their model weights would be appropriate as well. Running multiple instances of the same model on the same prompt to detect behavior variations helps. The running redundant instances can be limited to the most crucial decisions, as an additional check. Preventing self-modifying code, ... (see link for full description)
IBM Data Analytics Module 2 Overview of data Repositories.
1. 1
Module 2
Overview of Data Repositories
A data repository is a general term used to refer to data that has been
collected, organized, and isolated so that it can be used for business
operations or mined for reporting and data analysis.
It can be a small or large database infrastructure with one or more
databases that collect, manage, and store data sets.
In this video, we will provide an overview of the different types of
repositories your data might reside in,
1. Such as databases,
2. Data warehouses, and
3. Big data stores,
For use in business Mined for reporting
Operations and data analysis
2. 2
Databases.
Let’s begin with databases.
A database is a collection of data, or information, designed for the
input, storage, search and retrieval, and modification of data.
And a Database Management System, or DBMS, is a set of programs
that creates and maintains the database. It allows you to store,
modify, and extract information from the database using a
function called querying.
For example, if you want to find customers who have been inactive
for six months or more, using the query function, the database
management system will retrieve data of all customers from the
database that have been inactive for six months and more.
3. 3
Even though a database and DBMS mean different things the terms
are often used interchangeably.
There are different types of databases.
Several factors influence the choice of database, such as the
• Data type and structure,
• Querying mechanisms,
• Latency requirements,
• Transaction speeds, and
• Intended use of the data.
It’s important to mention two main types of databases here—
1. Relational databases 2. Non-relational databases.
Relational databases
• Relational databases, also referred to as RDBMSes, build on the
organizational principles of flat files,
• with data organized into a tabular format with rows and columns
following a
• well-defined structure and schema.
• However, unlike flat files, RDBMSes are optimized for data
operations and querying involving many tables and much larger
data volumes.
4. 4
• Structured Query Language, or SQL, is the standard querying
language for relational databases.
Non-relational databases
Then we have non-relational databases, also known as NoSQL, or “Not
Only SQL”.
• Non-relational databases emerged in response to the volume,
diversity, and speed at which data is being generated today, mainly
influenced by advances in cloud computing, the Internet of Things,
and social media proliferation.
• Built for speed, flexibility, and scale,
• non-relational databases made it possible to store data in a
schema-less or free-form fashion.
• NoSQL is widely used for processing big data.
Data Warehouse.
A data warehouse works as a central repository that merges
information coming from disparate sources and consolidates it
through the extract, transform, and load process, also known as the
ETL process, into one comprehensive database for analytics and
business intelligence.
5. 5
At a very high-level, the ETL process helps you to
• extract data from different data sources,
• transform the data into a clean and
usable state, and
• load the data into the enterprise’s data
repository.
Related to Data Warehouses are the concepts of Data Marts and Data
Lakes, which we will cover later. Data
Marts and Data Warehouses have historically been relational, since
much of the traditional enterprise data has resided in RDBMSes.
However, with the emergence of NoSQL technologies and new
sources of data, non-relational data repositories are also now being
used for Data Warehousing.
6. 6
Big Data Stores
Another category of data repositories are Big Data Stores, that include
distributed computational and storage infrastructure to store, scale,
and process very large data sets.
Summary
Overall, data repositories help to isolate data and make reporting and
analytics more efficient and credible while also serving as a data
archive.
RDBMS ( Relational Database Management Systems)
A relational database is a collection of data organized into a table
structure, where the tables can be linked, or related, based on data
common to each. Tables are made of rows and columns, where rows
are the “records”, and the columns the “attributes”.
7. 7
Let’s take the example of a customer table that maintains data about
each customer in a company. The columns, or attributes, in the
customer table are the Company ID, Company Name, Company
Address, and Company Primary Phone; and Each row is a customer
record.
Now let’s understand what we mean by tables being linked, or related,
based on data common to each.
Along with the customer table, the company also maintains
transaction tables that contain data describing multiple individual
transactions pertaining to each customer. The columns for the
transaction table might include the Transaction Date, Customer
ID, Transaction Amount, and Payment Method. The customer table
and the transaction tables can be related based on the common
Customer ID field. You can query the customer table to produce
reports such as a customer statement that consolidates all
transactions in a given period.
This capability of relating tables based on common data enables you
to retrieve an entirely new table from data in one or more tables with
a single query.
8. 8
It also allows you to understand the relationships among all available
data and gain new insights for making better decisions.
Relational databases build on the organizational principles of flat files
such as spreadsheets, with data organized into rows and columns
following a well-defined structure and schema.
But this is where the similarity ends.
9. 9
• Relational databases, by design, are ideal for the optimized storage,
retrieval, and processing of data for large volumes of data, unlike
spreadsheets that have a limited number of rows and columns.
• Each table in a relational database has a unique set of rows and
columns and
• relationships can be defined between tables, which minimizes data
redundancy.
• Moreover, you can restrict database fields to specific data types
and values,
• which minimizes irregularities and leads to greater consistency and
data integrity.
• Relational databases use SQL for querying data, which gives you the
advantage of processing millions of records and retrieving large
amounts of data in a matter of seconds.
• Moreover, the security architecture of relational databases
provides controlled access to data and also ensures that the
standards and policies for governing data can be enforced.
Relational databases range from small desktop systems to massive
cloud-based systems. They can be either:
10. 10
• open-source and internally supported,
• open-source with commercial support, or
• commercial closed-source systems.
IBM DB2, Microsoft SQL Server, MySQL, Oracle Database, and
PostgreSQL are some of the popular relational databases.
Cloud-based relational databases, also referred to as Database-as-a-
Service, are gaining wide use as they have access to the limitless
compute and storage capabilities offered by the cloud.
Some of the popular cloud relational databases include Amazon
Relational Database Service (RDS), Google Cloud SQL, IBM DB2 on
Cloud, Oracle Cloud, and SQL Azure.
RDBMS is a mature and well-documented technology, making it easy
to learn and find qualified talent.
One of the most significant advantages of the relational database
approach is
11. 11
• its ability to create meaningful information by joining tables.
Some of its other advantages include:
• Flexibility: Using SQL, you can add new columns, add new tables,
rename relations, and make other changes while the database is
running and queries are happening.
• Reduced redundancy: Relational databases minimize data
redundancy. For example, the information of a customer
appears in a single entry in the customer table, and the
transaction table pertaining to the customer stores a link to the
customer table.
• Ease of backup and disaster recovery: Relational databases offer
easy export and import options, making backup and
restore easy. Exports can happen while the database is running,
making restore on failure easy.
Cloud-based relational databases do continuous mirroring, which
means the loss of data on restore can be measured in seconds or less.
• ACID-compliance: ACID stands for Atomicity, Consistency,
Isolation, and Durability. And ACID compliance implies that the
data in the database remains accurate and consistent despite
failures, and database transactions are processed reliably.
12. 12
Now we’ll look at some use cases for relational databases:
• Online Transaction Processing: OLTP applications are focused
on transaction-oriented tasks that run at high rates.
Relational databases are well suited for OLTP applications
because
• they can accommodate a large number of users;
• they support the ability to insert, update, or delete
small amounts of data; and
• they also support frequent queries and updates as well
as fast response times.
• Data warehouses: In a data warehousing environment, relational
databases can be optimized for online analytical processing (or
OLAP), where historical data is analyzed for business intelligence.
• IoT solutions: Internet of Things (IoT) solutions require speed as
well as the ability to collect and process data from edge devices,
which need a lightweight database solution.
13. 13
This brings us to the limitations of RDBMS:
• RDBMS does not work well with semi-structured and unstructured
data and is, therefore, not suitable for extensive analytics on such
data.
• For migration between two RDBMSs, schemas and type of data
need to be identical between the source and destination tables.
• Relational databases have a limit on the length of data fields, which
means if you try to enter more information into a field than it can
accommodate, the information will not be stored.
Despite the limitations and the evolution of data in these times of big
data, cloud computing, IoT devices, and social media, RDBMS
continues to be the predominant technology for working with
structured data.
NoSQL
NoSQL, which stands for “not only SQL,” or sometimes “non-SQL” is a
non-relational database design that provides flexible schemas for the
storage and retrieval of data. NoSQL databases have existed for many
years but have only recently become more popular in the era of cloud,
big data, and high-volume web and mobile applications. They are
14. 14
chosen today for their attributes around scale, performance, and ease
of use.
It's important to emphasize that the "No" in "NoSQL" is an
abbreviation for "not only" and not the actual word "No."
NoSQL databases are built for specific data models and have flexible
schemas that allow programmers to create and manage modern
applications. They do not use a traditional row/column/table
database design with fixed schemas, and typically not use the
structured query language (or SQL) to query data, although some may
support SQL or SQL-like interfaces.
What is a NoSQL database?
NoSQL (not only SQL) or Non-SQL is a non-relational database
design that provides flexible schemas for the storage and
retrieval of data
• Gained greater popularity due to the emergence of cloud
computing, big data, and high-volume web and mobile
applications
• Chosen for their attributes around scale, performance, and
ease of use
NoSQL (not only SQL) or Non-SQL is a non-relational database design
that provides flexible schemas for the storage and retrieval of data
• ·Built for specific data models
• Has flexible schemas that allow programmers to create and
manage modern applications
• Do not use a traditional row/column/table database design with
fixed schemas
• Do not, typically, use the structured query language (or SQL) to
query data
15. 15
NoSQL allows data to be stored in a schema-less or free-form fashion.
Any data, be It structured, semi-structured, or unstructured, can be
stored in any record.
Based on the model being used for storing data, there are four
common types of NoSQL databases.
1. Key-value store,
2. Document-based,
3. Column-based, and
4. Graph-based.
Key-value store.
• Data in a key-value database is stored as a collection of key-value
pairs.
• The key represents an attribute of the data and is a unique
identifier.
• Both keys and values can be anything from simple integers or
strings to complex JSON documents.
• Key-value stores are great for storing user session data and user
preferences, making real-time recommendations and targeted
advertising, and in-memory data caching.
However, not a great fit if you want to:
• if you want to be able to query the data on specific data value,
• need relationships between data values,
• need to have multiple unique keys, a key-value store may not
be the best fit.
16. 16
• Key- Value Store tools.
Redis, Memcached, and DynamoDB are some well-known examples
in this category.
17. 17
Document-based:
• Document databases store each record and its associated data
within a single document.
• They enable flexible indexing, powerful ad hoc queries, and
analytics over collections of documents.
• Document databases are preferable for eCommerce platforms,
medical records storage, CRM platforms, and analytics platforms.
Not a great fit if you want to:
• However, if you’re looking to run complex search queries
• and multi-operation transactions,
a document-based database may not be the best option for you.
MongoDB, Document DB, CouchDB, and Cloud ant are some of the
popular document-based databases.
18. 18
Column-based:
• Column-based models store data in cells grouped as columns of
data instead of rows.
• A logical grouping of columns, that is, columns that are usually
accessed together, is called a column family. For example, a
customer’s name and profile information will most likely be
accessed together but not their purchase history. So, customer
name and profile information data can be grouped into a column
family.
• Since column databases store all cells corresponding to a column as
a continuous disk entry, accessing and searching the data becomes
very fast.
• Column databases can be great for systems that require heavy
write requests, storing time-series data, weather data, and IoT
data.
Not a great fit if you want to:
• But if you need to use complex queries
• change your querying patterns frequently, this may not be the best
option for you.
The most popular column databases are Cassandra and HBase.
19. 19
Graph-based:
• Graph-based databases use a graphical model to represent and
store data.
• They are particularly useful for visualizing, analyzing, and finding
connections between different pieces of data.
20. 20
The circles are nodes, and they contain the data. The arrows represent
relationships. Graph databases are an excellent choice for working
with connected data, which is data that contains lots of
interconnected relationships.
Graph databases are great for social networks, real-time product
recommendations, network diagrams, fraud detection, and access
management.
Not a great fit if you want to:
• But if you want to process high volumes of transactions, it may not
be the best choice for you, because graph databases are not
optimized for large-volume analytics queries.
Neo4J and Cosmos DB are some of the more popular graph databases.
21. 21
Advantage of NoSQL
NoSQL was created in response to the limitations of traditional
relational database technology.
• The primary advantage of NoSQL is its ability to handle large
volumes of structured, semi-structured, and unstructured data.
Some of its other advantages include:
Graph databases are great for fit
22. 22
• The ability to run as distributed systems scaled across multiple data
centres, which enables them to take advantage of cloud computing
infrastructure;
• An efficient and cost-effective scale-out architecture that provides
additional capacity and performance with the addition of new
nodes;
• and Simpler design, better control over availability, and improved
scalability that enables you to be more agile, more flexible, and to
iterate more quickly.
To summarize the key differences between relational and non-
relational databases:
Relational databases
• RDBMS schemas rigidly
define how all data inserted
into the database must be
typed and composed
• Maintaining high-end,
commercial relational
database management
systems can be expensive
• Support ACID-compliance,
which ensures reliability of
transactions and crash
recovery
• A mature and well-
documented technology,
which means the risks are
more or less perceivable.
Non-Relational databases
• NoSQL databases can be
schema-agnostic, allowing
unstructured and semi-
structured data to be stored
and manipulated.
• Specifically designed for low-
cost commodity hardware
• Most NoSQL databases are
not ACID compliant
• A relatively newer
technology
•
•
• A
Key Differences Relational databases & non-relational databases
23. 23
Data Marts, Data Lakes, ETL, and Data
Pipelines
Earlier in the course, we examined databases, data warehouses, and
big data stores.
Now we’ll go a little deeper in our exploration of data warehouses,
data marts, and data lakes; and also learn about the ETL process and
data pipelines.
Data Warehouses.
A data warehouse works like a multi-purpose storage for different use
cases. By the time the data comes into the warehouse, it has already
been modelled and structured for a specific purpose, meaning it is
analysis ready. As an organization, you would opt for a data
warehouse when you have massive amounts of data from your
operational systems that needs to be readily available for reporting
and analysis
Data warehouses serve as the single source of truth—storing current
and historical data that has been cleansed, conformed, and
categorized.
A data warehouse is a multi-purpose enabler of operational and
performance analytics.
A data
warehouse
is a multi-
A data
warehouse is a
multi- purpose
enable of
operation and
performance
analytics.
24. 24
Data Marts.
A data mart is a sub-section of the data warehouse, built specifically
for a particular business function, purpose, or community of
users. The idea is to provide stakeholders data that is most relevant to
them, when they need it. For example, the sales or finance teams
accessing data for their quarterly reporting and projections.
• Since a data mart offers analytical capabilities for a restricted area
of the data warehouse,
• it offers isolated security and isolated performance.
• The most important role of a data mart is business-specific
reporting and analytics.
Data Lakes
A Data Lake is a storage repository that can store large amounts of
structured, semi-structured, and unstructured data in their native
25. 25
format, classified and tagged with metadata. So, while a data
warehouse stores data processed for a specific need,
Data Lakes
• A data lake is a pool of raw data where each data element is given
a unique identifier and is tagged with metatags for further use.
• You would opt for a data lake if you generate, or have access to,
large volumes of data on an ongoing basis, but don’t want to be
restricted to specific or pre-defined use cases. Unlike data
warehouses,
• A data lake would retain all source data, without any exclusions.
And the data could include all types of data sources and types.
Data lakes are sometimes also used as a staging area of a data
warehouse.
• The most important role of a data lake is in predictive and advanced
analytics.
26. 26
Now we come to the process that is at the heart of gaining value
from data—the Extract, Transform, and Load process, or ETL.
ETL is how raw data is converted into analysis-ready data. It is an
automated process in which you
• gather raw data from identified sources,
• extract the information that aligns with your reporting and analysis
needs,
• clean, standardize, and transform that data into a format that is
usable in the context of your organization;
• and load it into a data repository.
While ETL is a generic process, the actual job can be very different in
usage, utility, and complexity.
Extract is the step where data from source locations is collected for
transformation.
Data extraction could be through:
• Batch processing, meaning source data, is moved in large chunks
from the source to the target system at scheduled intervals.
• The most important role of a data lakes is in
predictive and advanced analytics.
27. 27
• Tools for batch processing include Stitch and Blondo.
• Stream processing, which means source data is pulled in real-time
from the source and transformed while it is in transit and before it
is loaded into the data repository.
• Tools for stream processing include Apache Samza, Apache Storm,
and Apache Kafka.
Transform involves the execution of rules and functions that converts
raw data into data that can be used for analysis.
For example,
• making date formats and units of measurement consistent across
all sourced data,
• removing duplicate data,
• filtering out data that you do not need,
• enriching data, for example, splitting full name to first, middle, and
last names,
• establishing key relationships across tables,
• applying business rules and data validations.
Load is the step where processed data is transported to a destination
system or data repository.
• It could be: Initial loading, that is, populating all the data in the
repository,
• Incremental loading, that is, applying ongoing updates and
modifications as needed periodically;
• Full refresh, that is, erasing contents of one or more tables and
reloading with fresh data.
Load verification, which includes data checks for
• missing or null values,
28. 28
• server performance, and monitoring
• load failures, are important parts of this process step.
It is vital to keep an eye on load failures and ensure the right recovery
mechanisms are in place.
29. 29
ETL has historically been used for batch workloads on a large
scale. However, with the emergence of streaming ETL tools, they are
increasingly being used for real-time streaming event data as well.
Data Pipeline
It’s common to see the terms ETL and data pipelines used
interchangeably. And although both move data from source to
destination,
data pipeline is a broader term that
• encompasses the entire journey of moving data from one system
to another, of which ETL is a subset.
• Data pipelines can be architected for batch processing, for
streaming data, and a combination of batch and streaming data.
In the case of streaming data, data processing or transformation,
happens in a continuous flow. This is particularly useful for data that
needs constant updating, such as data from a sensor monitoring
traffic. A data pipeline is a high performing system that
• supports both long-running batch queries and smaller interactive
queries.
• The destination for a data pipeline is typically a data lake, although
the data may also be loaded to different target destinations, such
as another application or a visualization tool.
• There are a number of data pipeline solutions available, most
popular among them being Apache Beam and Dataflow.
30. 30
Foundations of Big Data
Big Data
In this digital world, everyone leaves a trace. From our travel habits to
our workouts and entertainment, the increasing number of internet
connected devices that we interact with on a daily basis record vast
amounts of data about us there's even a name for it Big Data.
31. 31
Ernst and Young offers the following definition:
“Big data refers to the dynamic, large, and disparate volumes of data
being created by people, tools, and machines. It requires new,
innovative and scalable technology to collect, host, and analytically
process the vast amount of data gathered in order to drive real-time
business insights that relate to consumers, risk, profit, performance,
productivity management, and enhanced shareholder value.
Ernst and Young
There is no one definition of big data but there are certain elements
that are common across the different definitions, such as
velocity, volume, variety, veracity, and value. These are the V's of big
data
Velocity
Velocity is the speed at which data accumulates. Data is being
generated extremely fast in a process that never stops. Near or real-
time streaming, local, and cloud-based technologies can process
information very quickly.
Volume
Volume is the scale of the data or the increase in the amount of data
stored. Drivers of volume are the increase in data sources, higher
resolution sensors, and scalable infrastructure.
32. 32
Variety
Variety is the diversity of the data. Structured data fits neatly into
rows and columns in relational databases, while unstructured data is
not organized in a predefined way like tweets, blog posts, pictures,
numbers, and video. Variety also reflects that data comes from
different sources; machines, people, and processes, both internal and
external to organizations. Drivers are mobile technologies
social media, wearable technologies, geo technologies video, and
many, many more. Veracity is the quality and origin of data and its
conformity to facts and accuracy. Attributes include consistency,
completeness, integrity, and ambiguity. Drivers include cost and the
need for traceability. With the large amount of data available, the
debate rages on about the accuracy of data in the digital age. Is the
information real or is it false?
Value
Value is our ability and need to turn data into value. Value isn't just
profit. It may have medical or social benefits, as well as customer,
employee or personal satisfaction. The main reason that people invest
time to understand big data is to derive value from it.
33. 33
Let's look at some examples of the V's in action.
Velocity.
Velocity. Every 60 seconds, hours of footage are uploaded to YouTube,
which is generating data. Think about how quickly data
accumulates over hours, days, and years.
Volume.
Volume. The world population is approximately 7 billion people and
the vast majority are now using digital devices. Mobile phones,
desktop and laptop computers, wearable devices, and so on. These
devices all generate, capture, and store data approximately 2.5
quintillion bytes every day. That's the equivalent of 10 million blu-ray
DVDs.
Variety.
Variety. Let's think about the different types of data. Text, pictures,
film, sound, health data from wearable devices, and many different
types of data from devices connected to the internet of things.
Veracity
Veracity. Eighty percent of data is considered to be unstructured and
we must devise ways to produce reliable and accurate insights. The
data must be categorized, analyzed, and visualized.
34. 34
Data Scientists
Data scientists, today, derive insights from big data and cope with
the challenges that these massive data sets present. The scale of the
data being collected means that it's not feasible to use conventional
data analysis tools, however, alternative tools that
leverage distributed computing power can overcome this
problem. Tools such as Apache Spark, Hadoop, and its
ecosystem provides ways to extract, load, analyze, and process the
data across distributed compute resources, providing new
insights and knowledge.
This gives organizations more ways to connect with their customers
and enrich the services they offer. So next time you strap on your
smartwatch, unlock your smartphone, or track your workout,
remember your data is starting a journey that might take it all the way
around the world, through big data analysis and back to you.
35. 35
Big Data Processing Tools
Big Data Processing Tools.
The Big Data processing technologies provide ways to work with large
sets of structured, semi-structured, and unstructured data so that
value can be derived from big data.
In some of the other videos, we discussed Big Data technologies such
as
1. NoSQL databases and 2. Data Lakes.
In this video, we are going to talk about three open-source
technologies and the role they play in big data analytics
1. Apache Hadoop, 2. Apache Hive, 3. Apache Spark.
Apache Hadoop
Hadoop is a collection of tools that provides distributed storage and
processing of big data.
Apache Hive,
Hive is a data warehouse for data query and analysis built on top of
Hadoop.
Apache Spark.
Spark is a distributed data analytics framework designed to perform
complex data analytics in real-time.
Hadoop
Hadoop, a java-based open-source framework, allows distributed
storage and processing of large datasets across clusters of
computers. In Hadoop distributed system, a node is a single computer,
36. 36
and a collection of nodes forms a cluster. Hadoop can scale up from a
single node to any number of nodes, each offering local storage and
computation. Hadoop provides a reliable, scalable, and cost-effective
solution for storing data with no format requirements.
Benefits include:
• Using Hadoop, you can: Incorporate emerging data formats, such
as streaming audio, video, social media sentiment, and clickstream
data, along with structured, semi-structured, and unstructured
data not traditionally used in a data warehouse.
• Provide real-time, self-service access for all stakeholders.
• Optimize and streamline costs in your enterprise data warehouse
by consolidating data across the organization and moving “cold”
data, that is, data that is not in frequent use, to a Hadoop-based
system.
Data offload and consolidation:
Optimizes and streamlines costs by consolidating data,
including cold data, across the organization
37. 37
One of the four main components of Hadoop is Hadoop Distributed
File System, or HDFS, which is a storage system for big data that runs
on multiple commodity hardware connected through a network.
❖ HDFS provides scalable and reliable big data storage by partitioning
files over multiple nodes.
❖ It splits large files across multiple computers, allowing parallel
access to them. Computations can, therefore, run in parallel on
each node where data is stored.
❖ It also replicates file blocks on different nodes to prevent data loss,
making it fault-tolerant.
Let’s understand this through an example. Consider a file that
includes phone numbers for everyone in the United States; the
numbers for people with last name starting with A might be stored on
server 1, B on server 2, and so on.
38. 38
With Hadoop, pieces of this phonebook would be stored across the
cluster. To reconstruct the entire phonebook, your program would
need the blocks from every server in the cluster.
HDFS also replicates these smaller pieces onto two additional servers
by default, ensuring availability when a server fails, In addition to
higher availability, this offers multiple benefits. It allows the Hadoop
cluster to break up work into smaller chunks and run those jobs on all
servers in the cluster for better scalability. Finally, you gain the benefit
of data locality, which is the process of moving the computation closer
to the node on which the data resides. This is critical when working
with large data sets because it minimizes network congestion and
increases throughput.
Some of the other benefits that come from using HDFS include:
❖ Fast recovery from hardware failures, because HDFS is built to
detect faults and automatically recover.
❖ Access to streaming data, because HDFS supports high data
throughput rates.
❖ Accommodation of large data sets, because HDFS can scale to
hundreds of nodes, or computers, in a single cluster.
39. 39
❖ Portability, because HDFS is portable across multiple hardware
platforms and compatible with a variety of underlying operating
systems.
Hive
Hive is an open-source data warehouse software for reading, writing,
and managing large data set files that are stored directly in either
HDFS or other data storage systems such as Apache HBase.
Hadoop is intended for long sequential scans and, because Hive is
based on Hadoop, queries have very high latency—which means Hive
is less appropriate for applications that need very fast response times.
40. 40
❖ Hive is not suitable for transaction processing that typically involves
a high percentage of write operations.
❖ Hive is better suited for data warehousing tasks such as ETL,
reporting, and data analysis and includes tools that enable easy
access to data via SQL.
Apache Spark
This brings us to Spark, a general-purpose data processing engine
designed to extract and process large volumes of data for a wide range
of applications,
including
❖ Interactive Analytics,
❖ Streams Processing,
❖ Machine Learning,
❖ Data Integration, and
❖ ETL.
Key attributes:
❖ It takes advantage of in-memory processing to significantly increase
the speed of computations and spilling to disk only when memory
is constrained.
❖ Spark has interfaces for major programming languages, including
Java, Scala, Python, R, and SQL.
❖ It can run using its standalone clustering technology as well as on
top of other infrastructures such as Hadoop. And
41. 41
❖ it can access data in a large variety of data sources, including HDFS
and Hive, making it highly versatile.
❖ The ability to process streaming data fast and perform complex
analytics in real-time is the key use case for Apache Spark.
Summary and Highlights
In this lesson, you have learned the following information:
A Data Repository is a general term that refers to data that has been
collected, organized, and isolated so that it can be used for reporting,
analytics, and also for archival purposes.
The different types of Data Repositories include:
• Databases, which can be relational or non-relational,
each following a set of organizational principles, the types of
data they can store, and the tools that can be used to query,
organize, and retrieve data.
• Data Warehouses, that consolidate incoming data into one
comprehensive storehouse.
42. 42
• Data Marts, that are essentially sub-sections of a data
warehouse, built to isolate data for a particular
business function or use case.
• Data Lakes, that serve as storage repositories for large
amounts of structured, semi-structured, and unstructured data
in their native format.
• Big Data Stores, that provide distributed computational and
storage infrastructure to store, scale, and process very large data
sets.
ETL, or Extract Transform and Load, Process is an automated process
that converts raw data into analysis-ready data by:
• Extracting data from source locations.
• Transforming raw data by cleaning, enriching, standardizing, and
validating it.
• Loading the processed data into a destination system or data
repository.
Data Pipeline, sometimes used interchangeably with
ETL, encompasses the entire journey of moving data from the source
to a destination data lake or application, using the ETL process.
Big Data refers to the vast amounts of data that is being produced
each moment of every day, by people, tools, and machines. The sheer
velocity, volume, and variety of data challenge the tools and systems
used for conventional data. These challenges led to the emergence of
processing tools and platforms designed specifically for Big Data, such
as Apache Hadoop, Apache Hive, and Apache Spark.
Practice Quiz
Question 1 :- Structured Query Language, or SQL, is the standard
querying language for what type of data repository?
43. 43
Answer:- RDBMS
SQL is the standard querying language for RDBMSs.
Question 2 :-In use cases for RDBMS, what is one of the reasons that
relational databases are so well suited for OLTP applications?
Answer:- Support the ability to insert, update, or delete small amount
of data.
This is one of the abilities of RDBMSs that make them very well suited
for OLTP applications.
Question 3:- Which NoSQL database type stores each record and its
associated data within a single document and also works well
with Analytics platforms?
Answer:- Document Base
Document-based NoSQL databases store each record and its
associated data within a single document and work well with Analytics
platforms.
Question 4:-What type of data repository is used to isolate a subset of
data for a particular business function, purpose, or community of
users?
Answer:- Data Mart
A data mart is a sub-section of the data warehouse used to isolate a
subset of data for a particular business function, purpose, or
community of users.
Question 5:-What does the attribute “Velocity” imply in the context
of Big Data?
Answer:- The speed at which data accumulates.
44. 44
Velocity, in the context of Big Data, is the speed at which data
accumulates.
Question 6:- Which of the Big Data processing tools provides
distributed storage and processing of Big Data?
Answer:-Hadoop
Hadoop, a java-based open-source framework, allows distributed
storage and processing of large datasets across clusters of computers.
Graded Quiz
Question 1:- Data Marts and Data Warehouses have typically been
relational, but the emergence of what technology has helped to let
these be used for non-relational data?
Answer:- No SQL
The emergence of NoSQL technology has made it possible for data
marts and data warehouses to be used for both relational and non-
relational data.
Question 2 :-What is one of the most significant advantages of an
RDBMS?
Answer:- Is ACID - Compliant
ACID-Compliance is one of the significant advantages of an RDBMS.
Question 3 :-Which one of the NoSQL database types uses a graphical
model to represent and store data, and is particularly useful for
visualizing, analyzing, and finding connections between different
pieces of data?
Answer:- Graph base.
45. 45
Graph-based NoSQL databases use a graphical model to represent and
store data and are used for visualizing, analyzing, and finding
connections between different pieces of data.
Question 4 :-Which of the data repositories serves as a pool of raw
data and stores large amounts of structured, semi-structured, and
unstructured data in their native formats?
Answer:- Data Lakes.
A Data Lake can store large amounts of structured, semi-structured,
and unstructured data in their native format, classified and tagged
with metadata.
Question 5:- What does the attribute “Veracity” imply in the context
of Big Data?
Answer:- Accuracy and conformity of Data to facts.
Veracity, in the context of Big Data, refers to the accuracy and
conformity of data to facts.
Question 6:- Apache Spark is a general-purpose data
processing engine designed to extract and process Big Data for a
wide range of applications. What is one of its key use cases?
Answer:- Perform Complex analytics in real-time
Spark is a general-purpose data processing engine used for performing
complex data analytics in real-time.