Unit 5 Concepts of Big Data and Data Lake
Unit 5 Concepts of Big Data and Data Lake
Structured
• Any data that can be stored, accessed and processed in the form of fixed format is
termed as a ‘structured’ data. Over the period of time, talent in computer science has
achieved greater success in developing techniques for working with such kind of data
(where the format is well known in advance) and also deriving value out of it. However,
nowadays, we are foreseeing issues when a size of such data grows to a huge extent,
typical sizes are being in the rage of multiple zettabytes.
Unstructured
• Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in terms of
its processing for deriving value out of it. A typical example of unstructured data is a
heterogeneous data source containing a combination of simple text files, images, videos
etc. Now day organizations have wealth of data available with them but unfortunately,
they don’t know how to derive value out of it since this data is in its raw form or
unstructured format.
Semi-structured
• Semi-structured data can contain both the forms of data. We can see semi-structured
data as a structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in an XML file.
Characteristics (4V’S)
• (i) Volume – The name Big Data itself is related to a size which is enormous. Size of data
plays a very crucial role in determining value out of data. Also, whether a particular data
can actually be considered as a Big Data or not, is dependent upon the volume of data.
Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with
Big Data solutions.
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of
data considered by most of the applications. Nowadays, data in the form of emails,
photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the
analysis applications. This variety of unstructured data poses certain issues for storage,
mining and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast
the data is generated and processed to meet the demands, determines real potential in
the data.
Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices,
etc. The flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage the data
effectively.
o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs
from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are
stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data
through its daily transaction.
Traditional data
o Traditional data refers to structured data which is collected and stored in formats like
databases, spreadsheets, etc. Such data includes customer information, inventory
records, financial statements, etc.
o This data is stored in relational databases such as SQL and other traditional data analysis
tools. It can be easily processed and manually analyzed using traditional methods to gain
insight into business operations. It can also be used to create reports and visualizations,
so it plays a vital role in making the right profitable decisions after understanding the
trends and patterns in the data.
The main differences between traditional data and big data are as
follows:
It is usually a small amount of data that can be It is usually a big amount of data that cannot be
collected and analyzed using traditional processed and analyzed easily using traditional
methods easily. methods.
It is usually structured data and can be stored in It includes semi-structured, unstructured, and
spreadsheets, databases, etc. structured data.
It often collects data manually. It collects information automatically with the use
of automated systems.
It usually comes from internal systems. It comes from various sources such as mobile
devices, social media, etc.
It consists of data such as customer information, It consists of data such as images, videos, etc.
financial transactions, etc.
Analysis of traditional data can be done with the Analysis of big data needs advanced analytics
use of primary statistical methods. methods such as machine learning, data mining,
etc.
Traditional methods to analyze data are slow Methods to analyze big data are fast and instant.
and gradual.
It is limited in its value and insights. It provides valuable insights and patterns for
good decision-making.
It is used for simple and small business It is used for complex and big business processes.
processes.
It is easy to secure and protect than big data It is harder to secure and protect than traditional
because of its small size and simplicity. data because of its size and complexity.
It requires less time and money to store It requires more time and money to store big
traditional data. data.
It can be stored on a single computer or server. It requires distributed storage across numerous
systems.
It is less efficient than big data. It is more efficient than traditional data.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular
group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
Characteristics of OLTP
Following are important characteristics of OLTP:
Consider a point of sale system of a supermarket, following are the sample queries that this
system can process:
Architecture of OLTP
Here is the architecture of OLTP:
OLTP Architecture
1. Business / Enterprise Strategy: Enterprise strategy deals with the issues that affect the
organization as a whole. In OLTP, it is typically developed at a high level within the firm,
by the board of directors or the top management
2. Business Process: OLTP business process is a set of activities and tasks that, once
completed, will accomplish an organizational goal.
3. Customers, Orders, and Products: OLTP database store information about products,
orders (transactions), customers (buyers), suppliers (sellers), and employees.
4. ETL Processes: It separates the data from various RDBMS source systems, then
transforms the data (like applying concatenations, calculations, etc.) and loads the
processed data into the Data Warehouse system.
5. Data Mart and Data warehouse: A Data Mart is a structure/access pattern specific to
data warehouse environments. It is used by OLAP to store processed data.
6. Data Mining, Analytics, and Decision Making: Data stored in the data mart and data
warehouse can be used for data mining, analytics, and decision making. This data helps
you to discover data patterns, analyze raw data, and make analytical decisions for your
organization’s growth.
Online banking
Online airline ticket booking
Sending a text message
Order entry
Add a book to shopping cart
Advantages of OLTP
Following are the pros/benefits of OLTP system:
Disadvantages of OLTP
Here are cons/drawbacks of OLTP system:
If the OLTP system faces hardware failures, then online transactions get severely affected.
OLTP systems allow multiple users to access and change the same data at the same time,
which many times created an unprecedented situation.
If the server hangs for seconds, it can affect to a large number of transactions.
OLTP required a lot of staff working in groups in order to maintain inventory.
Online Transaction Processing Systems do not have proper methods of transferring
products to buyers by themselves.
OLTP makes the database much more susceptible to hackers and intruders.
In B2B transactions, there are chances that both buyers and suppliers miss out efficiency
advantages that the system offers.
Server failure may lead to wiping out large amounts of data from the database.
You can perform a limited number of queries and updates.
Advantages
Disadvantages
OLTP OLAP
OLTP and its transactions are the sources of Different OLTP databases become the source of data
OLTP OLAP
OLTP database must maintain data integrity OLAP database does not get frequently modified.
constraints. Hence, data integrity is not an issue.
The data in the OLTP database is always detailed The data in the OLAP process might not be
and organized. organized.
Complete backup of the data combined with OLAP only need a backup from time to time. Backup
incremental backups. is not important compared to OLTP
It is used by Data critical users like clerk, DBA & It is used by Data knowledge users like workers,
Data Base professionals. managers, and CEO.
It helps to Increase user’s self-service and Help to Increase the productivity of business
productivity analysts.
Data Warehouses historically have been a An OLAP cube is not an open SQL server data
OLTP OLAP
development project which may prove costly to warehouse. Therefore, technical knowledge and
build. experience are essential to managing the OLAP
server.
Data Lake is like a large container which is very similar to real lake and rivers. Just like in a lake
you have multiple tributaries coming in, a data lake has structured data, unstructured data,
machine to machine, logs flowing through in real-time.
• Ingestion Tier: The tiers on the left side depict the data sources. The data could be
• Insights Tier: The tiers on the right represent the research side where insights from the
system are used. SQL, NoSQL queries, or even excel could be used for data analysis.
• Distillation tier takes data from the storage tire and converts it to structured data for
easier analysis.
• Processing tier run analytical algorithms and users queries with varying real time,
• Unified operations tier governs system management and monitoring. It includes auditing