0% found this document useful (0 votes)
42 views

Unit 5 Concepts of Big Data and Data Lake

Uploaded by

jaysukhv234
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Unit 5 Concepts of Big Data and Data Lake

Uploaded by

jaysukhv234
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Unit 5 Concepts of Big Data and Data Lake

What is Big Data?


Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size
is called Big Data. It is stated that almost 90% of today's data has been generated in the past 3
years.

Types Of Big Data

Structured

• Any data that can be stored, accessed and processed in the form of fixed format is
termed as a ‘structured’ data. Over the period of time, talent in computer science has
achieved greater success in developing techniques for working with such kind of data
(where the format is well known in advance) and also deriving value out of it. However,
nowadays, we are foreseeing issues when a size of such data grows to a huge extent,
typical sizes are being in the rage of multiple zettabytes.

Unstructured

• Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in terms of
its processing for deriving value out of it. A typical example of unstructured data is a
heterogeneous data source containing a combination of simple text files, images, videos
etc. Now day organizations have wealth of data available with them but unfortunately,
they don’t know how to derive value out of it since this data is in its raw form or
unstructured format.

Semi-structured

• Semi-structured data can contain both the forms of data. We can see semi-structured
data as a structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in an XML file.

Characteristics (4V’S)
• (i) Volume – The name Big Data itself is related to a size which is enormous. Size of data
plays a very crucial role in determining value out of data. Also, whether a particular data
can actually be considered as a Big Data or not, is dependent upon the volume of data.
Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with
Big Data solutions.

Prep By Akansha Srivastav Page 1


Unit 5 Concepts of Big Data and Data Lake

• (ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of
data considered by most of the applications. Nowadays, data in the form of emails,
photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the
analysis applications. This variety of unstructured data poses certain issues for storage,
mining and analyzing data.

(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast
the data is generated and processed to meet the demands, determines real potential in
the data.

Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices,
etc. The flow of data is massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage the data
effectively.

Sources of Big Data


These data come from many sources like

o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs
from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are
stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data
through its daily transaction.

Prep By Akansha Srivastav Page 2


Unit 5 Concepts of Big Data and Data Lake

Traditional data

o Traditional data refers to structured data which is collected and stored in formats like
databases, spreadsheets, etc. Such data includes customer information, inventory
records, financial statements, etc.
o This data is stored in relational databases such as SQL and other traditional data analysis
tools. It can be easily processed and manually analyzed using traditional methods to gain
insight into business operations. It can also be used to create reports and visualizations,
so it plays a vital role in making the right profitable decisions after understanding the
trends and patterns in the data.

The main differences between traditional data and big data are as
follows:

Traditional Data Big Data

It is usually a small amount of data that can be It is usually a big amount of data that cannot be
collected and analyzed using traditional processed and analyzed easily using traditional
methods easily. methods.

It is usually structured data and can be stored in It includes semi-structured, unstructured, and
spreadsheets, databases, etc. structured data.

It often collects data manually. It collects information automatically with the use
of automated systems.

It usually comes from internal systems. It comes from various sources such as mobile
devices, social media, etc.

It consists of data such as customer information, It consists of data such as images, videos, etc.
financial transactions, etc.

Analysis of traditional data can be done with the Analysis of big data needs advanced analytics
use of primary statistical methods. methods such as machine learning, data mining,
etc.

Traditional methods to analyze data are slow Methods to analyze big data are fast and instant.
and gradual.

It generates data after the happening of an It generates data every second.


event.

It is typically processed in batches. It is developed and processed in real-time.

It is limited in its value and insights. It provides valuable insights and patterns for

Prep By Akansha Srivastav Page 3


Unit 5 Concepts of Big Data and Data Lake

good decision-making.

It contains reliable and accurate data. It may contain unreliable, inconsistent, or


inaccurate data because of its size and
complexity.

It is used for simple and small business It is used for complex and big business processes.
processes.

It does not provide in-depth insights. It provides in-depth insights.

It is easy to secure and protect than big data It is harder to secure and protect than traditional
because of its small size and simplicity. data because of its size and complexity.

It requires less time and money to store It requires more time and money to store big
traditional data. data.

It can be stored on a single computer or server. It requires distributed storage across numerous
systems.

It is less efficient than big data. It is more efficient than traditional data.

It can be managed in a centralized structure It requires a decentralized infrastructure to


easily. manage the data.

What is a Data Warehouse?


A Data Warehouse (DW) is a relational database that is designed for query and analysis rather
than transaction processing. It includes historical data derived from transaction data from single
and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on


providing support for decision-makers for data modeling and analysis.

A Data Warehouse is a group of data specific to the entire organization, not only to a particular
group of users.

It is not used for daily operations and transaction processing but used for making decisions.

A Data Warehouse can be viewed as a data system with the following attributes:

o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.

Prep By Akansha Srivastav Page 4


Unit 5 Concepts of Big Data and Data Lake

o Its usage is read-intensive.


o It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in


support of management's decisions."

Concept of Data Processing Techniques


Online Transaction Processing (OLTP)
o OLTP is an operational system that supports transaction-oriented applications in a 3-tier
architecture. It administers the day to day transaction of an organization. OLTP is
basically focused on query processing, maintaining data integrity in multi-access
environments as well as effectiveness that is measured by the total number of
transactions per second. The full form of OLTP is Online Transaction Processing.

Characteristics of OLTP
Following are important characteristics of OLTP:

o OLTP uses transactions that include small amounts of data.


o Indexed data in the database can be accessed easily.
o OLTP has a large number of users.
o It has fast response times
o Databases are directly accessible to end-users
o OLTP uses a fully normalized schema for database consistency.
o The response time of OLTP system is short.
o It strictly performs only the predefined operations on a small number of records.
o OLTP stores the records of the last few days or a week.
o It supports complex data models and tables.

Prep By Akansha Srivastav Page 5


Unit 5 Concepts of Big Data and Data Lake

Type of queries that an OLTP system can Process


OLTP system is an online database changing system. Therefore, it supports database query such
as insert, update, and delete information from the database.

POS system for OLTP

Consider a point of sale system of a supermarket, following are the sample queries that this
system can process:

o Retrieving the description of a particular product.


o Filtering all products related to the supplier.
o Searching the record of the customer.
o Listing products having a price less than the expected amount.

Prep By Akansha Srivastav Page 6


Unit 5 Concepts of Big Data and Data Lake

Architecture of OLTP
Here is the architecture of OLTP:

OLTP Architecture
1. Business / Enterprise Strategy: Enterprise strategy deals with the issues that affect the
organization as a whole. In OLTP, it is typically developed at a high level within the firm,
by the board of directors or the top management
2. Business Process: OLTP business process is a set of activities and tasks that, once
completed, will accomplish an organizational goal.
3. Customers, Orders, and Products: OLTP database store information about products,
orders (transactions), customers (buyers), suppliers (sellers), and employees.

Prep By Akansha Srivastav Page 7


Unit 5 Concepts of Big Data and Data Lake

4. ETL Processes: It separates the data from various RDBMS source systems, then
transforms the data (like applying concatenations, calculations, etc.) and loads the
processed data into the Data Warehouse system.
5. Data Mart and Data warehouse: A Data Mart is a structure/access pattern specific to
data warehouse environments. It is used by OLAP to store processed data.
6. Data Mining, Analytics, and Decision Making: Data stored in the data mart and data
warehouse can be used for data mining, analytics, and decision making. This data helps
you to discover data patterns, analyze raw data, and make analytical decisions for your
organization’s growth.

Example of OLTP Transaction


An example of the OLTP system is the ATM center. Assume that a couple has a joint account
with a bank. One day both simultaneously reach different ATM centers at precisely the same
time and want to withdraw the total amount present in their bank account.

OLTP for ATM image


However, the person that completes the authentication process first will be able to get money.
In this case, the OLTP system makes sure that the withdrawn amount will be never more than
the amount present in the bank. The key to note here is that OLTP systems are optimized for
transactional superiority instead of data analysis.
Other examples of OLTP system are:

 Online banking
 Online airline ticket booking
 Sending a text message
 Order entry
 Add a book to shopping cart

Prep By Akansha Srivastav Page 8


Unit 5 Concepts of Big Data and Data Lake

Advantages of OLTP
Following are the pros/benefits of OLTP system:

 OLTP offers accurate forecast for revenue and expense.


 It provides a solid foundation for a stable business /organization due to timely
modification of all transactions.
 OLTP makes transactions much easier on behalf of the customers.
 It broadens the client base for an organization by speeding up and simplifying individual
processes.
 OLTP provides support for bigger databases.
 Partition of data for data manipulation is easy.
 We need OLTP to use the tasks which are frequently performed by the system.
 When we need only a small number of records.
 The tasks that include insertion, updation, or deletion of data.
 It is used when you need consistency and concurrency in order to perform tasks that
ensure its greater availability.

Disadvantages of OLTP
Here are cons/drawbacks of OLTP system:

 If the OLTP system faces hardware failures, then online transactions get severely affected.
 OLTP systems allow multiple users to access and change the same data at the same time,
which many times created an unprecedented situation.
 If the server hangs for seconds, it can affect to a large number of transactions.
 OLTP required a lot of staff working in groups in order to maintain inventory.
 Online Transaction Processing Systems do not have proper methods of transferring
products to buyers by themselves.
 OLTP makes the database much more susceptible to hackers and intruders.
 In B2B transactions, there are chances that both buyers and suppliers miss out efficiency
advantages that the system offers.
 Server failure may lead to wiping out large amounts of data from the database.
 You can perform a limited number of queries and updates.

Prep By Akansha Srivastav Page 9


Unit 5 Concepts of Big Data and Data Lake

OLAP(Online Analytics Processing)


Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It
allows managers, and analysts to get an insight of the information through fast, consistent, and
interactive access to information.

Who uses OLAP and Why

OLAP applications are used by a variety of the functions of an organization.


Finance and accounting:
 Budgeting
 Activity-based costing
 Financial performance analysis
 And financial modeling
Sales and Marketing
 Sales analysis and forecasting
 Market research analysis
 Promotion analysis
 Customer analysis
 Market and customer segmentation
Production
 Production planning
 Defect analysis

Prep By Akansha Srivastav Page 10


Unit 5 Concepts of Big Data and Data Lake

Advantages

 The advantages of OLAP are as follows −


 Business-centred multidimensional information.
 Business-centred figuring’s.
 Dependable information and figuring’s.
 Speed-of-thought examination.
 Adaptable, self-administration detailing.

Disadvantages

 Pre-demonstrating is an absolute necessity. As to business information, the traditional


OLAP tools don't take into consideration quick investigation without pre-demonstrating.
 Extraordinary reliance on IT.
 Helpless calculation capacity.
 Shy Interactive examination capacity.
 Slow in responding.
 Theoretical model.
 Extraordinary, expected danger.

OLTP vs. OLAP


Here is the important difference between OLTP and OLAP:

OLTP OLAP

OLAP is an online analysis and data retrieving


OLTP is an online transactional system.
process.

It is characterized by large numbers of short


It is characterized by a large volume of data.
online transactions.

OLAP is an online database query management


OLTP is an online database modifying system.
system.

OLTP uses traditional DBMS. OLAP uses the data warehouse.

Insert, Update, and Delete information from the


Mostly select operations
database.

OLTP and its transactions are the sources of Different OLTP databases become the source of data

Prep By Akansha Srivastav Page 11


Unit 5 Concepts of Big Data and Data Lake

OLTP OLAP

data. for OLAP.

OLTP database must maintain data integrity OLAP database does not get frequently modified.
constraints. Hence, data integrity is not an issue.

It’s response time is in a millisecond. Response time in seconds to minutes.

The data in the OLTP database is always detailed The data in the OLAP process might not be
and organized. organized.

Allow read/write operations. Only read and rarely write.

It is a market-orientated process. It is a customer orientated process.

Queries in this process are standardized and


Complex queries involving aggregations.
simple.

Complete backup of the data combined with OLAP only need a backup from time to time. Backup
incremental backups. is not important compared to OLTP

DB design is an application-oriented example: DB design is subject-oriented. Example: Database


Database design changes with the industry like design changes with subjects like sales, marketing,
retail, airline, banking, etc. purchasing, etc.

It is used by Data critical users like clerk, DBA & It is used by Data knowledge users like workers,
Data Base professionals. managers, and CEO.

It is designed for analysis of business measures by


It is designed for real time business operations.
category and attributes.

Transaction throughput is the performance


Query throughput is the performance metric.
metric

This kind of Database user allows thousands of


This kind of Database allows only hundreds of users.
users.

It helps to Increase user’s self-service and Help to Increase the productivity of business
productivity analysts.

Data Warehouses historically have been a An OLAP cube is not an open SQL server data

Prep By Akansha Srivastav Page 12


Unit 5 Concepts of Big Data and Data Lake

OLTP OLAP

development project which may prove costly to warehouse. Therefore, technical knowledge and
build. experience are essential to managing the OLAP
server.

It ensures that response to the query is quicker


It provides a fast result for daily used data.
consistently.

It lets the user create a view with the help of a


It is easy to create and maintain.
spreadsheet.

A data warehouse is created uniquely so that it can


OLTP is designed to have fast response time, low
integrate different data sources for building a
data redundancy, and is normalized.
consolidated database

Concept of Data Lake


A Data Lake is a storage repository that can store large amount of structured, semi-structured,
and unstructured data. It is a place to store every type of data in its native format with no fixed
limits on account size or file. It offers high data quantity to increase analytic performance and
native integration.

Data Lake is like a large container which is very similar to real lake and rivers. Just like in a lake
you have multiple tributaries coming in, a data lake has structured data, unstructured data,
machine to machine, logs flowing through in real-time.

Prep By Akansha Srivastav Page 13


Unit 5 Concepts of Big Data and Data Lake

• Ingestion Tier: The tiers on the left side depict the data sources. The data could be

loaded into the data lake in batches or in real-time

• Insights Tier: The tiers on the right represent the research side where insights from the

system are used. SQL, NoSQL queries, or even excel could be used for data analysis.

• HDFS is a cost-effective solution for both structured and unstructured data. It is a

landing zone for all data that is at rest in the system.

• Distillation tier takes data from the storage tire and converts it to structured data for

easier analysis.

• Processing tier run analytical algorithms and users queries with varying real time,

interactive, batch to generate structured data for easier analysis.

Prep By Akansha Srivastav Page 14


Unit 5 Concepts of Big Data and Data Lake

• Unified operations tier governs system management and monitoring. It includes auditing

and proficiency management, data management, workflow management.

Difference Between Data Lakes &Data Warehouse

Prep By Akansha Srivastav Page 15

You might also like