IBM Data Analytics Module 2 Overview of data Repositories.

1
Module 2
Overview of Data Repositories
A data repository is a general term used to refer to data that has been
collected, organized, and isolated so that it can be used for business
operations or mined for reporting and data analysis.
It can be a small or large database infrastructure with one or more
databases that collect, manage, and store data sets.
In this video, we will provide an overview of the different types of
repositories your data might reside in,
1. Such as databases,
2. Data warehouses, and
3. Big data stores,
For use in business Mined for reporting
Operations and data analysis

2
Databases.
Let’s begin with databases.
A database is a collection of data, or information, designed for the
input, storage, search and retrieval, and modification of data.
And a Database Management System, or DBMS, is a set of programs
that creates and maintains the database. It allows you to store,
modify, and extract information from the database using a
function called querying.
For example, if you want to find customers who have been inactive
for six months or more, using the query function, the database
management system will retrieve data of all customers from the
database that have been inactive for six months and more.

3
Even though a database and DBMS mean different things the terms
are often used interchangeably.
There are different types of databases.
Several factors influence the choice of database, such as the
• Data type and structure,
• Querying mechanisms,
• Latency requirements,
• Transaction speeds, and
• Intended use of the data.
It’s important to mention two main types of databases here—
1. Relational databases 2. Non-relational databases.
Relational databases
• Relational databases, also referred to as RDBMSes, build on the
organizational principles of flat files,
• with data organized into a tabular format with rows and columns
following a
• well-defined structure and schema.
• However, unlike flat files, RDBMSes are optimized for data
operations and querying involving many tables and much larger
data volumes.

4
• Structured Query Language, or SQL, is the standard querying
language for relational databases.
Non-relational databases
Then we have non-relational databases, also known as NoSQL, or “Not
Only SQL”.
• Non-relational databases emerged in response to the volume,
diversity, and speed at which data is being generated today, mainly
influenced by advances in cloud computing, the Internet of Things,
and social media proliferation.
• Built for speed, flexibility, and scale,
• non-relational databases made it possible to store data in a
schema-less or free-form fashion.
• NoSQL is widely used for processing big data.
Data Warehouse.
A data warehouse works as a central repository that merges
information coming from disparate sources and consolidates it
through the extract, transform, and load process, also known as the
ETL process, into one comprehensive database for analytics and
business intelligence.

5
At a very high-level, the ETL process helps you to
• extract data from different data sources,
• transform the data into a clean and
usable state, and
• load the data into the enterprise’s data
repository.
Related to Data Warehouses are the concepts of Data Marts and Data
Lakes, which we will cover later. Data
Marts and Data Warehouses have historically been relational, since
much of the traditional enterprise data has resided in RDBMSes.
However, with the emergence of NoSQL technologies and new
sources of data, non-relational data repositories are also now being
used for Data Warehousing.

6
Big Data Stores
Another category of data repositories are Big Data Stores, that include
distributed computational and storage infrastructure to store, scale,
and process very large data sets.
Summary
Overall, data repositories help to isolate data and make reporting and
analytics more efficient and credible while also serving as a data
archive.
RDBMS ( Relational Database Management Systems)
A relational database is a collection of data organized into a table
structure, where the tables can be linked, or related, based on data
common to each. Tables are made of rows and columns, where rows
are the “records”, and the columns the “attributes”.

7
Let’s take the example of a customer table that maintains data about
each customer in a company. The columns, or attributes, in the
customer table are the Company ID, Company Name, Company
Address, and Company Primary Phone; and Each row is a customer
record.
Now let’s understand what we mean by tables being linked, or related,
based on data common to each.
Along with the customer table, the company also maintains
transaction tables that contain data describing multiple individual
transactions pertaining to each customer. The columns for the
transaction table might include the Transaction Date, Customer
ID, Transaction Amount, and Payment Method. The customer table
and the transaction tables can be related based on the common
Customer ID field. You can query the customer table to produce
reports such as a customer statement that consolidates all
transactions in a given period.
This capability of relating tables based on common data enables you
to retrieve an entirely new table from data in one or more tables with
a single query.

8
It also allows you to understand the relationships among all available
data and gain new insights for making better decisions.
Relational databases build on the organizational principles of flat files
such as spreadsheets, with data organized into rows and columns
following a well-defined structure and schema.
But this is where the similarity ends.

9
• Relational databases, by design, are ideal for the optimized storage,
retrieval, and processing of data for large volumes of data, unlike
spreadsheets that have a limited number of rows and columns.
• Each table in a relational database has a unique set of rows and
columns and
• relationships can be defined between tables, which minimizes data
redundancy.
• Moreover, you can restrict database fields to specific data types
and values,
• which minimizes irregularities and leads to greater consistency and
data integrity.
• Relational databases use SQL for querying data, which gives you the
advantage of processing millions of records and retrieving large
amounts of data in a matter of seconds.
• Moreover, the security architecture of relational databases
provides controlled access to data and also ensures that the
standards and policies for governing data can be enforced.
Relational databases range from small desktop systems to massive
cloud-based systems. They can be either:

10
• open-source and internally supported,
• open-source with commercial support, or
• commercial closed-source systems.
IBM DB2, Microsoft SQL Server, MySQL, Oracle Database, and
PostgreSQL are some of the popular relational databases.
Cloud-based relational databases, also referred to as Database-as-a-
Service, are gaining wide use as they have access to the limitless
compute and storage capabilities offered by the cloud.
Some of the popular cloud relational databases include Amazon
Relational Database Service (RDS), Google Cloud SQL, IBM DB2 on
Cloud, Oracle Cloud, and SQL Azure.
RDBMS is a mature and well-documented technology, making it easy
to learn and find qualified talent.
One of the most significant advantages of the relational database
approach is

11
• its ability to create meaningful information by joining tables.
Some of its other advantages include:
• Flexibility: Using SQL, you can add new columns, add new tables,
rename relations, and make other changes while the database is
running and queries are happening.
• Reduced redundancy: Relational databases minimize data
redundancy. For example, the information of a customer
appears in a single entry in the customer table, and the
transaction table pertaining to the customer stores a link to the
customer table.
• Ease of backup and disaster recovery: Relational databases offer
easy export and import options, making backup and
restore easy. Exports can happen while the database is running,
making restore on failure easy.
Cloud-based relational databases do continuous mirroring, which
means the loss of data on restore can be measured in seconds or less.
• ACID-compliance: ACID stands for Atomicity, Consistency,
Isolation, and Durability. And ACID compliance implies that the
data in the database remains accurate and consistent despite
failures, and database transactions are processed reliably.

12
Now we’ll look at some use cases for relational databases:
• Online Transaction Processing: OLTP applications are focused
on transaction-oriented tasks that run at high rates.
Relational databases are well suited for OLTP applications
because
• they can accommodate a large number of users;
• they support the ability to insert, update, or delete
small amounts of data; and
• they also support frequent queries and updates as well
as fast response times.
• Data warehouses: In a data warehousing environment, relational
databases can be optimized for online analytical processing (or
OLAP), where historical data is analyzed for business intelligence.
• IoT solutions: Internet of Things (IoT) solutions require speed as
well as the ability to collect and process data from edge devices,
which need a lightweight database solution.

13
This brings us to the limitations of RDBMS:
• RDBMS does not work well with semi-structured and unstructured
data and is, therefore, not suitable for extensive analytics on such
data.
• For migration between two RDBMSs, schemas and type of data
need to be identical between the source and destination tables.
• Relational databases have a limit on the length of data fields, which
means if you try to enter more information into a field than it can
accommodate, the information will not be stored.
Despite the limitations and the evolution of data in these times of big
data, cloud computing, IoT devices, and social media, RDBMS
continues to be the predominant technology for working with
structured data.
NoSQL
NoSQL, which stands for “not only SQL,” or sometimes “non-SQL” is a
non-relational database design that provides flexible schemas for the
storage and retrieval of data. NoSQL databases have existed for many
years but have only recently become more popular in the era of cloud,
big data, and high-volume web and mobile applications. They are

14
chosen today for their attributes around scale, performance, and ease
of use.
It's important to emphasize that the "No" in "NoSQL" is an
abbreviation for "not only" and not the actual word "No."
NoSQL databases are built for specific data models and have flexible
schemas that allow programmers to create and manage modern
applications. They do not use a traditional row/column/table
database design with fixed schemas, and typically not use the
structured query language (or SQL) to query data, although some may
support SQL or SQL-like interfaces.
What is a NoSQL database?
NoSQL (not only SQL) or Non-SQL is a non-relational database
design that provides flexible schemas for the storage and
retrieval of data
• Gained greater popularity due to the emergence of cloud
computing, big data, and high-volume web and mobile
applications
• Chosen for their attributes around scale, performance, and
ease of use
NoSQL (not only SQL) or Non-SQL is a non-relational database design
that provides flexible schemas for the storage and retrieval of data
• ·Built for specific data models
• Has flexible schemas that allow programmers to create and
manage modern applications
• Do not use a traditional row/column/table database design with
fixed schemas
• Do not, typically, use the structured query language (or SQL) to
query data

15
NoSQL allows data to be stored in a schema-less or free-form fashion.
Any data, be It structured, semi-structured, or unstructured, can be
stored in any record.
Based on the model being used for storing data, there are four
common types of NoSQL databases.
1. Key-value store,
2. Document-based,
3. Column-based, and
4. Graph-based.
Key-value store.
• Data in a key-value database is stored as a collection of key-value
pairs.
• The key represents an attribute of the data and is a unique
identifier.
• Both keys and values can be anything from simple integers or
strings to complex JSON documents.
• Key-value stores are great for storing user session data and user
preferences, making real-time recommendations and targeted
advertising, and in-memory data caching.
However, not a great fit if you want to:
• if you want to be able to query the data on specific data value,
• need relationships between data values,
• need to have multiple unique keys, a key-value store may not
be the best fit.

16
• Key- Value Store tools.
Redis, Memcached, and DynamoDB are some well-known examples
in this category.

17
Document-based:
• Document databases store each record and its associated data
within a single document.
• They enable flexible indexing, powerful ad hoc queries, and
analytics over collections of documents.
• Document databases are preferable for eCommerce platforms,
medical records storage, CRM platforms, and analytics platforms.
Not a great fit if you want to:
• However, if you’re looking to run complex search queries
• and multi-operation transactions,
a document-based database may not be the best option for you.
MongoDB, Document DB, CouchDB, and Cloud ant are some of the
popular document-based databases.

18
Column-based:
• Column-based models store data in cells grouped as columns of
data instead of rows.
• A logical grouping of columns, that is, columns that are usually
accessed together, is called a column family. For example, a
customer’s name and profile information will most likely be
accessed together but not their purchase history. So, customer
name and profile information data can be grouped into a column
family.
• Since column databases store all cells corresponding to a column as
a continuous disk entry, accessing and searching the data becomes
very fast.
• Column databases can be great for systems that require heavy
write requests, storing time-series data, weather data, and IoT
data.
• But if you need to use complex queries
• change your querying patterns frequently, this may not be the best
option for you.
The most popular column databases are Cassandra and HBase.

19
Graph-based:
• Graph-based databases use a graphical model to represent and
store data.
• They are particularly useful for visualizing, analyzing, and finding
connections between different pieces of data.

20
The circles are nodes, and they contain the data. The arrows represent
relationships. Graph databases are an excellent choice for working
with connected data, which is data that contains lots of
interconnected relationships.
Graph databases are great for social networks, real-time product
recommendations, network diagrams, fraud detection, and access
management.
• But if you want to process high volumes of transactions, it may not
be the best choice for you, because graph databases are not
optimized for large-volume analytics queries.
Neo4J and Cosmos DB are some of the more popular graph databases.

21
Advantage of NoSQL
NoSQL was created in response to the limitations of traditional
relational database technology.
• The primary advantage of NoSQL is its ability to handle large
volumes of structured, semi-structured, and unstructured data.
Some of its other advantages include:
Graph databases are great for fit

22
• The ability to run as distributed systems scaled across multiple data
centres, which enables them to take advantage of cloud computing
infrastructure;
• An efficient and cost-effective scale-out architecture that provides
additional capacity and performance with the addition of new
nodes;
• and Simpler design, better control over availability, and improved
scalability that enables you to be more agile, more flexible, and to
iterate more quickly.
To summarize the key differences between relational and non-
relational databases:
Relational databases
• RDBMS schemas rigidly
define how all data inserted
into the database must be
typed and composed
• Maintaining high-end,
commercial relational
database management
systems can be expensive
• Support ACID-compliance,
which ensures reliability of
transactions and crash
recovery
• A mature and well-
documented technology,
which means the risks are
more or less perceivable.
Non-Relational databases
• NoSQL databases can be
schema-agnostic, allowing
unstructured and semi-
structured data to be stored
and manipulated.
• Specifically designed for low-
cost commodity hardware
• Most NoSQL databases are
not ACID compliant
• A relatively newer
technology
•
•
• A
Key Differences Relational databases & non-relational databases

23
Data Marts, Data Lakes, ETL, and Data
Pipelines
Earlier in the course, we examined databases, data warehouses, and
big data stores.
Now we’ll go a little deeper in our exploration of data warehouses,
data marts, and data lakes; and also learn about the ETL process and
data pipelines.
Data Warehouses.
A data warehouse works like a multi-purpose storage for different use
cases. By the time the data comes into the warehouse, it has already
been modelled and structured for a specific purpose, meaning it is
analysis ready. As an organization, you would opt for a data
warehouse when you have massive amounts of data from your
operational systems that needs to be readily available for reporting
and analysis
Data warehouses serve as the single source of truth—storing current
and historical data that has been cleansed, conformed, and
categorized.
A data warehouse is a multi-purpose enabler of operational and
performance analytics.
A data
warehouse
is a multi-
A data
warehouse is a
multi- purpose
enable of
operation and
performance
analytics.

24
Data Marts.
A data mart is a sub-section of the data warehouse, built specifically
for a particular business function, purpose, or community of
users. The idea is to provide stakeholders data that is most relevant to
them, when they need it. For example, the sales or finance teams
accessing data for their quarterly reporting and projections.
• Since a data mart offers analytical capabilities for a restricted area
of the data warehouse,
• it offers isolated security and isolated performance.
• The most important role of a data mart is business-specific
reporting and analytics.
Data Lakes
A Data Lake is a storage repository that can store large amounts of
structured, semi-structured, and unstructured data in their native

25
format, classified and tagged with metadata. So, while a data
warehouse stores data processed for a specific need,
Data Lakes
• A data lake is a pool of raw data where each data element is given
a unique identifier and is tagged with metatags for further use.
• You would opt for a data lake if you generate, or have access to,
large volumes of data on an ongoing basis, but don’t want to be
restricted to specific or pre-defined use cases. Unlike data
warehouses,
• A data lake would retain all source data, without any exclusions.
And the data could include all types of data sources and types.
Data lakes are sometimes also used as a staging area of a data
warehouse.
• The most important role of a data lake is in predictive and advanced
analytics.

26
Now we come to the process that is at the heart of gaining value
from data—the Extract, Transform, and Load process, or ETL.
ETL is how raw data is converted into analysis-ready data. It is an
automated process in which you
• gather raw data from identified sources,
• extract the information that aligns with your reporting and analysis
needs,
• clean, standardize, and transform that data into a format that is
usable in the context of your organization;
• and load it into a data repository.
While ETL is a generic process, the actual job can be very different in
usage, utility, and complexity.
Extract is the step where data from source locations is collected for
transformation.
Data extraction could be through:
• Batch processing, meaning source data, is moved in large chunks
from the source to the target system at scheduled intervals.
• The most important role of a data lakes is in
predictive and advanced analytics.

27
• Tools for batch processing include Stitch and Blondo.
• Stream processing, which means source data is pulled in real-time
from the source and transformed while it is in transit and before it
is loaded into the data repository.
• Tools for stream processing include Apache Samza, Apache Storm,
and Apache Kafka.
Transform involves the execution of rules and functions that converts
raw data into data that can be used for analysis.
For example,
• making date formats and units of measurement consistent across
all sourced data,
• removing duplicate data,
• filtering out data that you do not need,
• enriching data, for example, splitting full name to first, middle, and
last names,
• establishing key relationships across tables,
• applying business rules and data validations.
Load is the step where processed data is transported to a destination
system or data repository.
• It could be: Initial loading, that is, populating all the data in the
repository,
• Incremental loading, that is, applying ongoing updates and
modifications as needed periodically;
• Full refresh, that is, erasing contents of one or more tables and
reloading with fresh data.
Load verification, which includes data checks for
• missing or null values,

28
• server performance, and monitoring
• load failures, are important parts of this process step.
It is vital to keep an eye on load failures and ensure the right recovery
mechanisms are in place.

29
ETL has historically been used for batch workloads on a large
scale. However, with the emergence of streaming ETL tools, they are
increasingly being used for real-time streaming event data as well.
Data Pipeline
It’s common to see the terms ETL and data pipelines used
interchangeably. And although both move data from source to
destination,
data pipeline is a broader term that
• encompasses the entire journey of moving data from one system
to another, of which ETL is a subset.
• Data pipelines can be architected for batch processing, for
streaming data, and a combination of batch and streaming data.
In the case of streaming data, data processing or transformation,
happens in a continuous flow. This is particularly useful for data that
needs constant updating, such as data from a sensor monitoring
traffic. A data pipeline is a high performing system that
• supports both long-running batch queries and smaller interactive
queries.
• The destination for a data pipeline is typically a data lake, although
the data may also be loaded to different target destinations, such
as another application or a visualization tool.
• There are a number of data pipeline solutions available, most
popular among them being Apache Beam and Dataflow.

30
Foundations of Big Data
Big Data
In this digital world, everyone leaves a trace. From our travel habits to
our workouts and entertainment, the increasing number of internet
connected devices that we interact with on a daily basis record vast
amounts of data about us there's even a name for it Big Data.

31
Ernst and Young offers the following definition:
“Big data refers to the dynamic, large, and disparate volumes of data
being created by people, tools, and machines. It requires new,
innovative and scalable technology to collect, host, and analytically
process the vast amount of data gathered in order to drive real-time
business insights that relate to consumers, risk, profit, performance,
productivity management, and enhanced shareholder value.
Ernst and Young
There is no one definition of big data but there are certain elements
that are common across the different definitions, such as
velocity, volume, variety, veracity, and value. These are the V's of big
data
Velocity
Velocity is the speed at which data accumulates. Data is being
generated extremely fast in a process that never stops. Near or real-
time streaming, local, and cloud-based technologies can process
information very quickly.
Volume
Volume is the scale of the data or the increase in the amount of data
stored. Drivers of volume are the increase in data sources, higher
resolution sensors, and scalable infrastructure.

32
Variety
Variety is the diversity of the data. Structured data fits neatly into
rows and columns in relational databases, while unstructured data is
not organized in a predefined way like tweets, blog posts, pictures,
numbers, and video. Variety also reflects that data comes from
different sources; machines, people, and processes, both internal and
external to organizations. Drivers are mobile technologies
social media, wearable technologies, geo technologies video, and
many, many more. Veracity is the quality and origin of data and its
conformity to facts and accuracy. Attributes include consistency,
completeness, integrity, and ambiguity. Drivers include cost and the
need for traceability. With the large amount of data available, the
debate rages on about the accuracy of data in the digital age. Is the
information real or is it false?
Value
Value is our ability and need to turn data into value. Value isn't just
profit. It may have medical or social benefits, as well as customer,
employee or personal satisfaction. The main reason that people invest
time to understand big data is to derive value from it.

33
Let's look at some examples of the V's in action.
Velocity.
Velocity. Every 60 seconds, hours of footage are uploaded to YouTube,
which is generating data. Think about how quickly data
accumulates over hours, days, and years.
Volume.
Volume. The world population is approximately 7 billion people and
the vast majority are now using digital devices. Mobile phones,
desktop and laptop computers, wearable devices, and so on. These
devices all generate, capture, and store data approximately 2.5
quintillion bytes every day. That's the equivalent of 10 million blu-ray
DVDs.
Variety.
Variety. Let's think about the different types of data. Text, pictures,
film, sound, health data from wearable devices, and many different
types of data from devices connected to the internet of things.
Veracity
Veracity. Eighty percent of data is considered to be unstructured and
we must devise ways to produce reliable and accurate insights. The
data must be categorized, analyzed, and visualized.

34
Data Scientists
Data scientists, today, derive insights from big data and cope with
the challenges that these massive data sets present. The scale of the
data being collected means that it's not feasible to use conventional
data analysis tools, however, alternative tools that
leverage distributed computing power can overcome this
problem. Tools such as Apache Spark, Hadoop, and its
ecosystem provides ways to extract, load, analyze, and process the
data across distributed compute resources, providing new
insights and knowledge.
This gives organizations more ways to connect with their customers
and enrich the services they offer. So next time you strap on your
smartwatch, unlock your smartphone, or track your workout,
remember your data is starting a journey that might take it all the way
around the world, through big data analysis and back to you.

35
Big Data Processing Tools
Big Data Processing Tools.
The Big Data processing technologies provide ways to work with large
sets of structured, semi-structured, and unstructured data so that
value can be derived from big data.
In some of the other videos, we discussed Big Data technologies such
as
1. NoSQL databases and 2. Data Lakes.
In this video, we are going to talk about three open-source
technologies and the role they play in big data analytics
1. Apache Hadoop, 2. Apache Hive, 3. Apache Spark.
Apache Hadoop
Hadoop is a collection of tools that provides distributed storage and
processing of big data.
Apache Hive,
Hive is a data warehouse for data query and analysis built on top of
Hadoop.
Apache Spark.
Spark is a distributed data analytics framework designed to perform
complex data analytics in real-time.
Hadoop
Hadoop, a java-based open-source framework, allows distributed
storage and processing of large datasets across clusters of
computers. In Hadoop distributed system, a node is a single computer,

36
and a collection of nodes forms a cluster. Hadoop can scale up from a
single node to any number of nodes, each offering local storage and
computation. Hadoop provides a reliable, scalable, and cost-effective
solution for storing data with no format requirements.
Benefits include:
• Using Hadoop, you can: Incorporate emerging data formats, such
as streaming audio, video, social media sentiment, and clickstream
data, along with structured, semi-structured, and unstructured
data not traditionally used in a data warehouse.
• Provide real-time, self-service access for all stakeholders.
• Optimize and streamline costs in your enterprise data warehouse
by consolidating data across the organization and moving “cold”
data, that is, data that is not in frequent use, to a Hadoop-based
system.
Data offload and consolidation:
Optimizes and streamlines costs by consolidating data,
including cold data, across the organization

37
One of the four main components of Hadoop is Hadoop Distributed
File System, or HDFS, which is a storage system for big data that runs
on multiple commodity hardware connected through a network.
❖ HDFS provides scalable and reliable big data storage by partitioning
files over multiple nodes.
❖ It splits large files across multiple computers, allowing parallel
access to them. Computations can, therefore, run in parallel on
each node where data is stored.
❖ It also replicates file blocks on different nodes to prevent data loss,
making it fault-tolerant.
Let’s understand this through an example. Consider a file that
includes phone numbers for everyone in the United States; the
numbers for people with last name starting with A might be stored on
server 1, B on server 2, and so on.

38
With Hadoop, pieces of this phonebook would be stored across the
cluster. To reconstruct the entire phonebook, your program would
need the blocks from every server in the cluster.
HDFS also replicates these smaller pieces onto two additional servers
by default, ensuring availability when a server fails, In addition to
higher availability, this offers multiple benefits. It allows the Hadoop
cluster to break up work into smaller chunks and run those jobs on all
servers in the cluster for better scalability. Finally, you gain the benefit
of data locality, which is the process of moving the computation closer
to the node on which the data resides. This is critical when working
with large data sets because it minimizes network congestion and
increases throughput.
Some of the other benefits that come from using HDFS include:
❖ Fast recovery from hardware failures, because HDFS is built to
detect faults and automatically recover.
❖ Access to streaming data, because HDFS supports high data
throughput rates.
❖ Accommodation of large data sets, because HDFS can scale to
hundreds of nodes, or computers, in a single cluster.

39
❖ Portability, because HDFS is portable across multiple hardware
platforms and compatible with a variety of underlying operating
systems.
Hive
Hive is an open-source data warehouse software for reading, writing,
and managing large data set files that are stored directly in either
HDFS or other data storage systems such as Apache HBase.
Hadoop is intended for long sequential scans and, because Hive is
based on Hadoop, queries have very high latency—which means Hive
is less appropriate for applications that need very fast response times.

40
❖ Hive is not suitable for transaction processing that typically involves
a high percentage of write operations.
❖ Hive is better suited for data warehousing tasks such as ETL,
reporting, and data analysis and includes tools that enable easy
access to data via SQL.
Apache Spark
This brings us to Spark, a general-purpose data processing engine
designed to extract and process large volumes of data for a wide range
of applications,
including
❖ Interactive Analytics,
❖ Streams Processing,
❖ Machine Learning,
❖ Data Integration, and
❖ ETL.
Key attributes:
❖ It takes advantage of in-memory processing to significantly increase
the speed of computations and spilling to disk only when memory
is constrained.
❖ Spark has interfaces for major programming languages, including
Java, Scala, Python, R, and SQL.
❖ It can run using its standalone clustering technology as well as on
top of other infrastructures such as Hadoop. And

41
❖ it can access data in a large variety of data sources, including HDFS
and Hive, making it highly versatile.
❖ The ability to process streaming data fast and perform complex
analytics in real-time is the key use case for Apache Spark.
Summary and Highlights
In this lesson, you have learned the following information:
A Data Repository is a general term that refers to data that has been
collected, organized, and isolated so that it can be used for reporting,
analytics, and also for archival purposes.
The different types of Data Repositories include:
• Databases, which can be relational or non-relational,
each following a set of organizational principles, the types of
data they can store, and the tools that can be used to query,
organize, and retrieve data.
• Data Warehouses, that consolidate incoming data into one
comprehensive storehouse.

42
• Data Marts, that are essentially sub-sections of a data
warehouse, built to isolate data for a particular
business function or use case.
• Data Lakes, that serve as storage repositories for large
amounts of structured, semi-structured, and unstructured data
in their native format.
• Big Data Stores, that provide distributed computational and
storage infrastructure to store, scale, and process very large data
sets.
ETL, or Extract Transform and Load, Process is an automated process
that converts raw data into analysis-ready data by:
• Extracting data from source locations.
• Transforming raw data by cleaning, enriching, standardizing, and
validating it.
• Loading the processed data into a destination system or data
repository.
Data Pipeline, sometimes used interchangeably with
ETL, encompasses the entire journey of moving data from the source
to a destination data lake or application, using the ETL process.
Big Data refers to the vast amounts of data that is being produced
each moment of every day, by people, tools, and machines. The sheer
velocity, volume, and variety of data challenge the tools and systems
used for conventional data. These challenges led to the emergence of
processing tools and platforms designed specifically for Big Data, such
as Apache Hadoop, Apache Hive, and Apache Spark.
Practice Quiz
Question 1 :- Structured Query Language, or SQL, is the standard
querying language for what type of data repository?

43
Answer:- RDBMS
SQL is the standard querying language for RDBMSs.
Question 2 :-In use cases for RDBMS, what is one of the reasons that
relational databases are so well suited for OLTP applications?
Answer:- Support the ability to insert, update, or delete small amount
of data.
This is one of the abilities of RDBMSs that make them very well suited
for OLTP applications.
Question 3:- Which NoSQL database type stores each record and its
associated data within a single document and also works well
with Analytics platforms?
Answer:- Document Base
Document-based NoSQL databases store each record and its
associated data within a single document and work well with Analytics
platforms.
Question 4:-What type of data repository is used to isolate a subset of
data for a particular business function, purpose, or community of
users?
Answer:- Data Mart
A data mart is a sub-section of the data warehouse used to isolate a
subset of data for a particular business function, purpose, or
community of users.
Question 5:-What does the attribute “Velocity” imply in the context
of Big Data?
Answer:- The speed at which data accumulates.

44
Velocity, in the context of Big Data, is the speed at which data
accumulates.
Question 6:- Which of the Big Data processing tools provides
distributed storage and processing of Big Data?
Answer:-Hadoop
Hadoop, a java-based open-source framework, allows distributed
storage and processing of large datasets across clusters of computers.
Graded Quiz
Question 1:- Data Marts and Data Warehouses have typically been
relational, but the emergence of what technology has helped to let
these be used for non-relational data?
Answer:- No SQL
The emergence of NoSQL technology has made it possible for data
marts and data warehouses to be used for both relational and non-
relational data.
Question 2 :-What is one of the most significant advantages of an
RDBMS?
Answer:- Is ACID - Compliant
ACID-Compliance is one of the significant advantages of an RDBMS.
Question 3 :-Which one of the NoSQL database types uses a graphical
model to represent and store data, and is particularly useful for
visualizing, analyzing, and finding connections between different
pieces of data?
Answer:- Graph base.

45
Graph-based NoSQL databases use a graphical model to represent and
store data and are used for visualizing, analyzing, and finding
connections between different pieces of data.
Question 4 :-Which of the data repositories serves as a pool of raw
data and stores large amounts of structured, semi-structured, and
unstructured data in their native formats?
Answer:- Data Lakes.
A Data Lake can store large amounts of structured, semi-structured,
and unstructured data in their native format, classified and tagged
with metadata.
Question 5:- What does the attribute “Veracity” imply in the context
of Big Data?
Answer:- Accuracy and conformity of Data to facts.
Veracity, in the context of Big Data, refers to the accuracy and
conformity of data to facts.
Question 6:- Apache Spark is a general-purpose data
processing engine designed to extract and process Big Data for a
wide range of applications. What is one of its key use cases?
Answer:- Perform Complex analytics in real-time
Spark is a general-purpose data processing engine used for performing
complex data analytics in real-time.

IBM Data Analytics Module 2 Overview of data Repositories.

Recommended

More Related Content

Similar to IBM Data Analytics Module 2 Overview of data Repositories. (20)

Recently uploaded (20)

IBM Data Analytics Module 2 Overview of data Repositories.