0% found this document useful (0 votes)
27 views

Big Data Analysis by deshbandhu

Big data refers to large volumes of diverse data characterized by high velocity, variety, and volume, often requiring advanced technologies for analysis. It has significant applications across various sectors, including customer experience, predictive maintenance, and operational efficiency, driven by the need for data-driven decision-making. The evolution of big data technologies, such as Hadoop and cloud storage, has enabled organizations to harness insights from massive datasets, although challenges related to data storage and management persist.

Uploaded by

Asmie 2012
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Big Data Analysis by deshbandhu

Big data refers to large volumes of diverse data characterized by high velocity, variety, and volume, often requiring advanced technologies for analysis. It has significant applications across various sectors, including customer experience, predictive maintenance, and operational efficiency, driven by the need for data-driven decision-making. The evolution of big data technologies, such as Hadoop and cloud storage, has enabled organizations to harness insights from massive datasets, although challenges related to data storage and management persist.

Uploaded by

Asmie 2012
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 368

BIG DATA ANALYSIS

M. TECH - SEM III

APRIL 1, 2024
DESH BANDHU BHATT
MDSU, AJMER
What is Big Data
What exactly is big data?

The definition of big data is data that contains greater variety, arriving in increasing
volumes and with more velocity. This is also known as the three “Vs.”

Normally we work on data of size MB(WordDoc ,Excel) or maximum GB(Movies,


Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data. It is stated that
almost 90% of today's data has been generated in the past 3 years.

The Three “Vs” of Big Data


1. Volume: The amount of data matters. With big data, you’ll have to process
high volumes of low-density, unstructured data. This can be data of unknown
value, such as X (formerly Twitter) data feeds, clickstreams on a web page or a
mobile app, or sensor-enabled equipment. For some organizations, this might
be tens of terabytes of data. For others, it may be hundreds of petabytes.

2. Velocity: Velocity is the fast rate at which data is received and (perhaps) acted
on. Normally, the highest velocity of data streams directly into memory versus
being written to disk. Some internet-enabled smart products operate in real
time or near real time and will require real-time evaluation and action.

3. Variety: Variety refers to the many types of data that are available. Traditional
data types were structured and fit neatly in a relational database. With the rise
of big data, data comes in new unstructured data types. Unstructured and
semistructured data types, such as text, audio, and video, require additional
preprocessing to derive meaning and support metadata.

In addition to the 3Vs, some have expanded this framework to include other
characteristics such as:

4. Veracity: This refers to the quality of the data. Big data may include
inaccurate, incomplete, or inconsistent data, and dealing with such data
quality issues is a significant challenge in big data analytics.

5. Value: Ultimately, the goal of analyzing big data is to derive value from it. This
could involve gaining insights, making predictions, optimizing processes, or
creating new products and services.

6. Variability: Refers to the inconsistency of the data flow. This can mean a
change in the data's velocity or volume, or it can mean the nature of the data
itself is changing.

2
Big data technologies and analytics techniques, such as Hadoop, Spark, NoSQL
databases, machine learning, and data mining, are employed to extract insights,
patterns, and trends from these massive datasets, enabling organizations to make
data-driven decisions and gain competitive advantages.

The History of Big Data


Although the concept of big data itself is relatively new, the origins of large data sets
go back to the 1960s and ‘70s when the world of data was just getting started with
the first data centers and the development of the relational database.

Around 2005, people began to realize just how much data users generated through
Facebook, YouTube, and other online services. Hadoop (an open source framework
created specifically to store and analyze big data sets) was developed that same year.
NoSQL also began to gain popularity during this time.

The development of open source frameworks, such as Hadoop (and more recently,
Spark) was essential for the growth of big data because they make big data easier to

3
work with and cheaper to store. In the years since then, the volume of big data has
skyrocketed. Users are still generating huge amounts of data—but it’s not just
humans who are doing it.

With the advent of the Internet of Things (IoT), more objects and devices are
connected to the internet, gathering data on customer usage patterns and product
performance. The emergence of machine learning has produced still more data.

While big data has come far, its usefulness is only just beginning. Cloud computing
has expanded big data possibilities even further. The cloud offers truly elastic
scalability, where developers can simply spin up ad hoc clusters to test a subset of
data. And graph databases are becoming increasingly important as well, with their
ability to display massive amounts of data in a way that makes analytics fast and
comprehensive.

Sources of Big Data


These data come from many sources like

 Social networking sites: Facebook, Google, LinkedIn all these sites generates
huge amount of data on a day to day basis as they have billions of users
worldwide.

 E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount
of logs from which users buying trends can be traced.

 Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.

 Telecom Company: Telecom giants like Airtel, Vodafone study the user
trends and accordingly publish their plans and for this they store the data of
its million users.

 Share Market: Stock exchange across the world generates huge amount of
data through its daily transaction.

Big Data Benefits


 Big data makes it possible for you to gain more complete answers because
you have more information.
 More complete answers mean more confidence in the data—which means a
completely different approach to tackling problems.

4
Big Data Use Cases
Big data can help you address a range of business activities, including customer
experience and analytics. Here are just a few.

 Product development: Companies like Netflix and Procter & Gamble use big
data to anticipate customer demand. They build predictive models for new
products and services by classifying key attributes of past and current
products or services and modeling the relationship between those attributes
and the commercial success of the offerings. In addition, P&G uses data and
analytics from focus groups, social media, test markets, and early store
rollouts to plan, produce, and launch new products.

 Predictive maintenance: Factors that can predict mechanical failures may be


deeply buried in structured data, such as the year, make, and model of
equipment, as well as in unstructured data that covers millions of log entries,
sensor data, error messages, and engine temperature. By analyzing these
indications of potential issues before the problems happen, organizations can
deploy maintenance more cost effectively and maximize parts and equipment
uptime.

 Customer experience: The race for customers is on. A clearer view of


customer experience is more possible now than ever before. Big data enables
you to gather data from social media, web visits, call logs, and other sources
to improve the interaction experience and maximize the value delivered. Start
delivering personalized offers, reduce customer churn, and handle issues
proactively.

 Fraud and compliance: When it comes to security, it’s not just a few rogue
hackers—you’re up against entire expert teams. Security landscapes and
compliance requirements are constantly evolving. Big data helps you identify
patterns in data that indicate fraud and aggregate large volumes of
information to make regulatory reporting much faster.

 Machine learning: Machine learning is a hot topic right now. And data—
specifically big data—is one of the reasons why. We are now able to teach
machines instead of program them. The availability of big data to train
machine learning models makes that possible.

 Operational efficiency: Operational efficiency may not always make the


news, but it’s an area in which big data is having the most impact. With big
data, you can analyze and assess production, customer feedback and returns,
and other factors to reduce outages and anticipate future demands. Big data

5
can also be used to improve decision-making in line with current market
demand.

 Drive innovation: Big data can help you innovate by studying


interdependencies among humans, institutions, entities, and process and then
determining new ways to use those insights. Use data insights to improve
decisions about financial and planning considerations. Examine trends and
what customers want to deliver new products and services. Implement
dynamic pricing. There are endless possibilities.

How Big Data Works / Solution


Big data gives you new insights that open up new opportunities and business
models. Getting started involves three key actions:

1. Integrate: Big data brings together data from many disparate sources and
applications. Traditional data integration mechanisms, such as extract, transform,
and load (ETL) generally aren’t up to the task. It requires new strategies and
technologies to analyze big data sets at terabyte, or even petabyte, scale.

During integration, you need to bring in the data, process it, and make sure it’s
formatted and available in a form that your business analysts can get started with.

2. Manage: Big data requires storage. Your storage solution can be in the cloud, on
premises, or both. You can store your data in any form you want and bring your
desired processing requirements and necessary process engines to those data
sets on an on-demand basis. Many people choose their storage solution
according to where their data is currently residing.

The cloud is gradually gaining popularity because it supports your current


compute requirements and enables you to spin up resources as needed.

3. Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File
System) which uses commodity hardware to form clusters and store data in a
distributed fashion. It works on Write once, read many times principle.

4. Processing: Map Reduce paradigm is applied to data distributed over network to


find the required output.

5. Analyze: Pig, Hive can be used to analyze the data.

6. Cost: Hadoop is open source so the cost is no more an issue.

6
Why big data
Big data has become increasingly important for several reasons:

1. Data-driven decision making: With the growth of big data, organizations


have access to vast amounts of information about their operations, customers,
markets, and more. Analyzing this data can provide valuable insights that can
inform decision-making processes, leading to better strategies, improved
efficiency, and competitive advantages.

2. Improved customer understanding: Big data analytics allows organizations


to better understand their customers' behaviors, preferences, and needs. By
analyzing large volumes of customer data from various sources such as social
media, online transactions, and customer interactions, companies can tailor
their products, services, and marketing efforts to meet customer expectations
more effectively.

3. Innovation and new business opportunities: Big data can uncover new
business opportunities and drive innovation. By analyzing market trends,
consumer behaviors, and emerging technologies, organizations can identify
new product and service offerings, enter new markets, and stay ahead of
competitors.

4. Optimized operations and processes: Big data analytics can help


organizations optimize their operations and processes. By analyzing data from
various sources such as supply chain systems, manufacturing equipment, and
operational sensors, companies can identify inefficiencies, reduce costs, and
improve productivity.

5. Risk management and fraud detection: Big data analytics can help
organizations mitigate risks and detect fraudulent activities. By analyzing
patterns and anomalies in large datasets, companies can identify potential
risks, such as credit card fraud, cybersecurity threats, or supply chain
disruptions, and take proactive measures to mitigate them.

6. Personalized experiences: Big data enables organizations to deliver


personalized experiences to customers, employees, and other stakeholders. By
analyzing individual preferences, behaviors, and past interactions, companies
can tailor products, services, and communications to meet specific needs and
preferences, enhancing customer satisfaction and loyalty.

7
Overall, big data has the potential to transform industries, drive innovation, and
create significant value for organizations across various sectors. However, effectively
harnessing the power of big data requires advanced analytics capabilities, robust
data management practices, and a strategic approach to data-driven decision
making.

Data storage and analysis


What is Big Data Storage?
Big Data Storage is a new technology poised to revolutionize how we store data. The
technology was first developed in the early 2000s when companies were faced with
storing massive amounts of data that they could not keep on their servers.

The problem was that traditional storage methods couldn't handle storing all this
data, so companies had to look for new ways to keep it. That's when Big Data
Storage came into being. It's a way for companies to store large amounts of data
without worrying about running out of space.

Big Data Storage Challenges


Big data is a hot topic in IT. Every month, more companies are adopting it to help
them improve their businesses. But with any new technology comes challenges and
questions, and big data is no exception.

The first challenge is how much storage you'll need for your extensive data system.If
you're going to store large amounts of information about your customers and their
behavior, you'll need a lot of space for that data to live.

8
It's not uncommon for large companies like Google or Facebook to have petabytes
(1 million gigabytes) of storage explicitly dedicated to their big data needs, and that's
only one company!

Another challenge with big data is how quickly it grows. Companies are constantly
gathering new types of information about their customer's habits and preferences,
and they're looking at ways they can use this information to improve their products
or services.

As a result, big data systems will continue growing exponentially until something
stops them. It means it's essential for companies who want to use this technology
effectively to plan how they'll deal with it later on down the road when it becomes
too much for them alone!

Big Data Storage Key Considerations


Big data storage is a complicated problem. There are many things to consider when
building the infrastructure for your big data project, but there are three key
considerations you must consider before you move forward.

 Data velocity: Your data must be able to move quickly between processing
centers and databases for it to be helpful in real-time applications.

 Scalability: The system should be able to expand as your business does and
accommodate new projects as needed without disrupting existing workflows
or causing any downtime.

 Cost efficiency: Because big data projects can be so expensive, choosing a


system that reduces costs without sacrificing the quality of service or
functionality is essential.

Finally, consider how long you want your stored data to remain accessible. If you're
planning on keeping it for years (or even decades), you may need more than one
storage solution.

Key Insights for Big Data Storage


Big data storage is a critical part of any business. The sheer volume of data being
created and stored by companies is staggering and growing daily. But without a
proper strategy for storing and protecting this data, your business could be
vulnerable to hackers—and your bottom line could suffer.

Here are some critical insights for big data storage:

9
 Have a plan for how you'll organize your data before you start collecting it. It
will ensure you can find what you need when you need it. Here are some
critical insights for big data storage:

 Ensure your team understands security's essential when dealing with sensitive
information. Everyone in the company needs to be trained on best practices
for protecting data and preventing hacks.

 Remember backup plans! You never want to get stuck and unable to access
your information because something went wrong with the server or hardware
it's stored.

Data Storage Methods


Warehouse and cloud storage are two of the most popular options for storing big
data. Warehouse storage is typically done on-site, while cloud storage involves
storing your data offsite in a secure location.

 Warehouse Storage
Warehouse storage is one of the more common ways to store large amounts
of data, but it has drawbacks. For example, if you need immediate access to
your data and want to avoid delays or problems accessing it over the internet,
there might be better options than this. Also, warehouse storage can be
expensive if you're looking for long-term contracts or need extra personnel to
manage your warehouse space.

 Cloud Storage
Cloud storage is an increasingly popular option since it's easier than ever to
use this method, thanks to advancements in technology such as Amazon Web
Services (AWS). With AWS, you can store unlimited data without worrying
about how much space each file takes up on their servers. They'll
automatically compress them before sending them over, so they take up less
space overall!

Data Storage Technologies


Apache Hadoop, Apache HBase, and Snowflake are three big data storage
technologies often used in the data lake analytics paradigm.

 Hadoop

10
Hadoop has gained considerable attention as it is one of the most common
frameworks to support big data analytics.

A distributed processing framework based on open-source software, Hadoop


enables large data sets to be processed across clusters of computers. Large
data sets were initially intended to be processed and stored across clusters of
commodity hardware.

 HBase
With HBase, you can use a NoSQL database or complement Hadoop with a
column-oriented store. This database is designed to efficiently manage large
tables with billions of rows and millions of columns. The performance can be
tuned by adjusting memory usage, the number of servers, block size, and
other settings.

 Snowflake
Snowflake for Data Lake Analytics is an enterprise-grade cloud platform for
advanced analytics applications built on top of Apache Hadoop. It offers real-
time access to historical and streaming data from any source and format at
any scale without requiring changes to existing applications or workflows. It
also enables users to quickly scale up their processing power as needed
without having to worry about infrastructure management tasks such as
provisioning and

Data storage
Data storage in big data following

1. Distributed Storage Systems: Traditional relational databases are often not


well-suited for handling big data due to their scalability limitations. Instead,
distributed storage systems are commonly used. These systems distribute data
across multiple nodes or servers, allowing for horizontal scalability and fault
tolerance. Examples include Hadoop Distributed File System (HDFS), Apache
HBase, and Amazon S3.

2. Data Warehouses: Data warehouses are specialized databases optimized for


analytical queries and reporting. They consolidate data from various sources
and provide a unified view for analysis. Data warehouses are often used for
storing structured data and are essential for business intelligence and
analytics. Popular data warehousing solutions include Amazon Redshift,
Google BigQuery, and Snowflake.

3. NoSQL Databases: NoSQL (Not Only SQL) databases are designed to handle
large volumes of unstructured or semi-structured data and are a key
11
component of big data ecosystems. NoSQL databases offer flexible data
models and horizontal scalability, making them well-suited for applications
such as web and mobile apps, real-time analytics, and content management
systems. Examples include MongoDB, Cassandra, and Apache CouchDB.

4. Data Lakes: Data lakes are storage repositories that can store vast amounts of
raw data in its native format until it's needed for analysis. Unlike traditional
data warehouses, which require structured data, data lakes can accommodate
structured, semi-structured, and unstructured data. Data lakes provide
flexibility and scalability for storing and analyzing diverse datasets. Popular
data lake solutions include Apache Hadoop, Apache Spark, and Amazon S3.

5. In-Memory Databases: In-memory databases store data in system memory


(RAM) rather than on disk, allowing for faster data access and processing. In-
memory databases are well-suited for real-time analytics, high-speed
transactions, and low-latency applications. Examples include SAP HANA, Redis,
and Apache Ignite.

What is big data analysis?


Big data analytics describes the process of uncovering trends, patterns, and
correlations in large amounts of raw data to help make data-informed decisions.
These processes use familiar statistical analysis techniques—like clustering and
regression—and apply them to more extensive datasets with the help of newer tools.
Big data has been a buzz word since the early 2000s, when software and hardware
capabilities made it possible for organizations to handle large amounts of
unstructured data. Since then, new technologies—from Amazon to smartphones—
have contributed even more to the substantial amounts of data available to
organizations. With the explosion of data, early innovation projects like Hadoop,
Spark, and NoSQL databases were created for the storage and processing of big
data. This field continues to evolve as data engineers look for ways to integrate the
vast amounts of complex information created by sensors, networks, transactions,
smart devices, web usage, and more. Even now, big data analytics methods are being

12
used with emerging technologies, like machine learning, to discover and scale more
complex insights.

How big data analysis works


Big data analytics refers to collecting, processing, cleaning, and analyzing large
datasets to help organizations operationalize their big data.

1. Collect Data
Data collection looks different for every organization. With today’s
technology, organizations can gather both structured and unstructured data
from a variety of sources — from cloud storage to mobile applications to in-
store IoT sensors and beyond. Some data will be stored in data
warehouses where business intelligence tools and solutions can access it
easily. Raw or unstructured data that is too diverse or complex for a
warehouse may be assigned metadata and stored in a data lake.

2. Process Data
Once data is collected and stored, it must be organized properly to get
accurate results on analytical queries, especially when it’s large and
unstructured. Available data is growing exponentially, making data processing
a challenge for organizations. One processing option is batch processing,
which looks at large data blocks over time. Batch processing is useful when
there is a longer turnaround time between collecting and analyzing
data. Stream processing looks at small batches of data at once, shortening
the delay time between collection and analysis for quicker decision-making.
Stream processing is more complex and often more expensive.

3. Clean Data
Data big or small requires scrubbing to improve data quality and get stronger
results; all data must be formatted correctly, and any duplicative or
irrelevant data must be eliminated or accounted for. Dirty data can obscure
and mislead, creating flawed insights.

4. Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced
analytics processes can turn big data into big insights. Some of these big data
analysis methods include:

o Data mining sorts through large datasets to identify patterns and


relationships by identifying anomalies and creating data clusters.

13
o Predictive analytics uses an organization’s historical data to make
predictions about the future, identifying upcoming risks and
opportunities.

o Deep learning imitates human learning patterns by using artificial


intelligence and machine learning to layer algorithms and find patterns
in the most complex and abstract data.

Big data analysis work


 Stream Processing: Stream processing systems analyze and process data in
real-time as it is generated or ingested. These systems are essential for
handling high-velocity data streams from sources such as sensors, IoT devices,
social media, and financial transactions. Stream processing technologies
include Apache Kafka, Apache Flink, and Amazon Kinesis.

 Data Analytics Platforms: Data analytics platforms provide tools and


frameworks for analyzing and deriving insights from big data. These platforms
typically include data visualization tools, machine learning algorithms, and
data processing frameworks. Examples include Apache Spark, TensorFlow, and
Microsoft Azure Machine Learning.

 Data Governance and Security: Data governance and security are critical
aspects of big data storage and analysis. Organizations must implement
robust data governance policies, access controls, encryption, and compliance
measures to protect sensitive data and ensure regulatory compliance.

Big data analysis tools and technology


Big data analytics cannot be narrowed down to a single tool or technology. Instead,
several types of tools work together to help you collect, process, cleanse, and analyze
big data. Some of the major players in big data ecosystems are listed below.

 Hadoop is an open-source framework that efficiently stores and processes big


datasets on clusters of commodity hardware. This framework is free and can
handle large amounts of structured and unstructured data, making it a
valuable mainstay for any big data operation.

 NoSQL databases are non-relational data management systems that do not


require a fixed scheme, making them a great option for big, raw, unstructured
data. NoSQL stands for “not only SQL,” and these databases can handle a
variety of data models.

14
 MapReduce is an essential component to the Hadoop framework serving two
functions. The first is mapping, which filters data to various nodes within the
cluster. The second is reducing, which organizes and reduces the results from
each node to answer a query.

 YARN stands for “Yet Another Resource Negotiator.” It is another component


of second-generation Hadoop. The cluster management technology helps
with job scheduling and resource management in the cluster.

 Spark is an open source cluster computing framework that uses implicit data
parallelism and fault tolerance to provide an interface for programming entire
clusters. Spark can handle both batch and stream processing for fast
computation.

 Tableau is an end-to-end data analytics platform that allows you to prep,


analyze, collaborate, and share your big data insights. Tableau excels in self-
service visual analysis, allowing people to ask new questions of governed big
data and easily share those insights across the organization.

The big benefits of big data analysis


The ability to analyze more data at a faster rate can provide big benefits to an
organization, allowing it to more efficiently use data to answer important questions.
Big data analytics is important because it lets organizations use colossal amounts of
data in multiple formats from multiple sources to identify opportunities and risks,
helping organizations move quickly and improve their bottom lines. Some benefits of
big data analytics include:

 Cost savings. Helping organizations identify ways to do business more


efficiently

 Product development. Providing a better understanding of customer needs

 Market insights. Tracking purchase behavior and market trends

Read more about how real organizations reap the benefits of big data.

The big challenges of big data analysis


Big data brings big benefits, but it also brings big challenges such new privacy and
security concerns, accessibility for business users, and choosing the right solutions

15
for your business needs. To capitalize on incoming data, organizations will have to
address the following:

 Making big data accessible. Collecting and processing data becomes more
difficult as the amount of data grows. Organizations must make data easy and
convenient for data owners of all skill levels to use.

 Maintaining quality data. With so much data to maintain, organizations are


spending more time than ever before scrubbing for duplicates, errors,
absences, conflicts, and inconsistencies.

 Keeping data secure. As the amount of data grows, so do privacy and


security concerns. Organizations will need to strive for compliance and put
tight data processes in place before they take advantage of big data.

 Finding the right tools and platforms. New technologies for processing and
analyzing big data are developed all the time. Organizations must find the
right technology to work within their established ecosystems and address
their particular needs. Often, the right solution is also a flexible solution that
can accommodate future infrastructure changes.

Comparison with other systems


Traditional data: Traditional data is the structured data that is being majorly
maintained by all types of businesses starting from very small to big organizations. In
a traditional database system, a centralized database architecture used to store and
maintain the data in a fixed format or fields in a file. For managing and accessing the
data Structured Query Language (SQL) is used.

Big data: We can consider big data an upper version of traditional data. Big data
deal with too large or complex data sets which is difficult to manage in traditional
data-processing application software. It deals with large volume of both structured,
semi structured and unstructured data. Volume, Velocity and Variety, Veracity and
Value refer to the 5’V characteristics of big data. Big data not only refers to large
amount of data it refers to extracting meaningful data by analyzing the huge amount
of complex data sets. Semi-structured

The main differences between traditional data and big data as follows:

 Volume: Traditional data typically refers to small to medium-sized datasets


that can be easily stored and analyzed using traditional data processing
technologies. In contrast, big data refers to extremely large datasets that
cannot be easily managed or processed using traditional technologies.

16
 Variety: Traditional data is typically structured, meaning it is organized in a
predefined manner such as tables, columns, and rows. Big data, on the other
hand, can be structured, unstructured, or semi-structured, meaning it may
contain text, images, videos, or other types of data.

 Velocity: Traditional data is usually static and updated on a periodic basis. In


contrast, big data is constantly changing and updated in real-time or near
real-time.

 Complexity: Traditional data is relatively simple to manage and analyze.

Big data, on the other hand, is complex and requires specialized tools and
techniques to manage, process, and analyze.

 Value: Traditional data typically has a lower potential value than big data
because it is limited in scope and size. Big data, on the other hand, can
provide valuable insights into customer behavior, market trends, and other
business-critical information.

Some similarities between them, including:

 Data Quality: The quality of data is essential in both traditional and big data
environments. Accurate and reliable data is necessary for making informed
business decisions.

 Data Analysis: Both traditional and big data require some form of analysis to
derive insights and knowledge from the data. Traditional data analysis
methods typically involve statistical techniques and visualizations, while big
data analysis may require machine learning and other advanced techniques.

 Data Storage: In both traditional and big data environments, data needs to
be stored and managed effectively. Traditional data is typically stored in
relational databases, while big data may require specialized technologies such
as Hadoop, NoSQL, or cloud-based storage systems.

 Data Security: Data security is a critical consideration in both traditional and


big data environments. Protecting sensitive information from unauthorized
access, theft, or misuse is essential in both contexts.

 Business Value: Both traditional and big data can provide significant value to
organizations. Traditional data can provide insights into historical trends and
patterns, while big data can uncover new opportunities and help organizations
make more informed decisions.

17
The main differences between traditional data and big data are as follows:

Traditional Data Big Data

It is usually a small amount of data that It is usually a big amount of data that
can be collected and analyzed using cannot be processed and analyzed easily
traditional methods easily. using traditional methods.

It is usually structured data and can be It includes semi-structured, unstructured,


stored in spreadsheets, databases, etc. and structured data.

It often collects data manually. It collects information automatically with


the use of automated systems.

It usually comes from internal systems. It comes from various sources such as
mobile devices, social media, etc.

It consists of data such as customer It consists of data such as images,


information, financial transactions, etc. videos, etc.

Analysis of traditional data can be done Analysis of big data needs advanced
with the use of primary statistical analytics methods such as machine
methods. learning, data mining, etc.

Traditional methods to analyze data are Methods to analyze big data are fast and
slow and gradual. instant.

It generates data after the happening of It generates data every second.


an event.

It is typically processed in batches. It is developed and processed in real-


time.

It is limited in its value and insights. It provides valuable insights and


patterns for good decision-making.

It contains reliable and accurate data. It may contain unreliable, inconsistent, or


inaccurate data because of its size and

18
complexity.

It is used for simple and small business It is used for complex and big business
processes. processes.

It does not provide in-depth insights. It provides in-depth insights.

It is easy to secure and protect than big It is harder to secure and protect than
data because of its small size and traditional data because of its size and
simplicity. complexity.

It requires less time and money to store It requires more time and money to
traditional data. store big data.

It can be stored on a single computer or It requires distributed storage across


server. numerous systems.

It is less efficient than big data. It is more efficient than traditional data.

It can be managed in a centralized It requires a decentralized infrastructure


structure easily. to manage the data.

More other systems


Comparing big data systems with other types of systems, such as traditional
databases or data processing systems, highlights the differences in their capabilities,
architectures, and use cases. Here's a comparison between big data systems and
other systems:

1. Traditional Databases:

 Structure: Traditional databases are typically relational databases with


structured data organized in tables with predefined schemas. Big data
systems, on the other hand, can handle structured, semi-structured, and
unstructured data.

 Scalability: Traditional databases may have scalability limitations, especially


when dealing with large volumes of data. Big data systems are designed for
horizontal scalability across distributed clusters of commodity hardware.

19
 Use Cases: Traditional databases are well-suited for transactional processing,
OLTP (Online Transaction Processing), and structured analytics. Big data
systems are better suited for handling large-scale analytics, real-time
processing, and unstructured data analysis.

2. Data Warehouses:

 Data Model: Data warehouses are optimized for structured data and typically
use a star or snowflake schema. Big data systems can handle various data
types, including structured, semi-structured, and unstructured data.

 Processing Paradigm: Data warehouses are designed for batch processing of


structured data. Big data systems support batch, real-time, and stream
processing paradigms.

 Scalability: Data warehouses may have limitations in scalability and may


require expensive hardware upgrades. Big data systems offer horizontal
scalability by adding more nodes to the cluster.

3. In-Memory Databases:

 Data Storage: In-memory databases store data in system memory (RAM),


providing faster access to data compared to disk-based storage. Big data
systems may use a combination of in-memory and disk-based storage
depending on the workload and requirements.

 Processing Speed: In-memory databases excel in low-latency applications


and real-time analytics. Big data systems can also support real-time
processing but may have slightly higher latency due to the distributed nature
of processing.

 Use Cases: In-memory databases are suitable for applications requiring high-
speed transactions, real-time analytics, and low-latency processing. Big data
systems are ideal for handling large-scale analytics, big data processing, and
complex data analysis tasks.

4. Stream Processing Systems:

 Data Processing: Stream processing systems analyze and process data in


real-time as it is generated or ingested. Big data systems can support stream
processing but may require additional components or frameworks for real-
time analytics.

 Scalability: Stream processing systems are designed to scale horizontally to


handle high-velocity data streams. Big data systems offer similar scalability but
may require additional configuration and optimization for stream processing
workloads.

20
 Use Cases: Stream processing systems are used for real-time analytics, event
processing, and IoT applications. Big data systems are suitable for batch
processing, large-scale analytics, and handling diverse data types.

In summary, big data systems offer greater flexibility, scalability, and support for
diverse data types compared to traditional databases, data warehouses, in-memory
databases, and stream processing systems. However, the choice between these
systems depends on the specific requirements, workload characteristics, and use
cases of the application.

Rational database management system


What is a Relational Database (RDBMS)?

A relational database is a type of database that stores and provides access to data
points that are related to one another. Relational databases are based on the
relational model, an intuitive, straightforward way of representing data in tables. In a
relational database, each row in the table is a record with a unique ID called the key.
The columns of the table hold attributes of the data, and each record usually has a
value for each attribute, making it easy to establish the relationships among data
points.

Relational database defined

A relational database (RDB) is a way of structuring information in tables, rows, and


columns. An RDB has the ability to establish links—or relationships–between
information by joining tables, which makes it easy to understand and gain insights
about the relationship between various data points.

Relational Database Management Systems maintains data integrity by simulating the


following features:

 Entity Integrity: No two records of the database table can be completely


duplicate.

 Referential Integrity: Only the rows of those tables can be deleted which are
not used by other tables. Otherwise, it may lead to data inconsistency.

 User-defined Integrity: Rules defined by the users based on confidentiality


and access.

21
 Domain integrity: The columns of the database tables are enclosed within
some structured limits, based on default values, type of data or ranges.

The relational database model


Developed by E.F. Codd from IBM in the 1970s, the relational database model allows
any table to be related to another table using a common attribute. Instead of using
hierarchical structures to organize data, Codd proposed a shift to using a data model
where data is stored, accessed, and related in tables without reorganizing the tables
that contain them.

Think of the relational database as a collection of spreadsheet files that help


businesses organize, manage, and relate data. In the relational database model, each
“spreadsheet” is a table that stores information, represented as columns (attributes)
and rows (records or tuples).

Attributes (columns) specify a data type, and each record (or row) contains the value
of that specific data type. All tables in a relational database have an attribute known
as the primary key, which is a unique identifier of a row, and each row can be used to
create a relationship between different tables using a foreign key—a reference to a
primary key of another existing table.

Let’s take a look at how the relational database model works in practice:

Say you have a Customer table and an Order table.

22
The Customer table contains data about the customer:

 Customer ID (primary key)

 Customer name

 Billing address

 Shipping address

In the Customer table, the customer ID is a primary key that uniquely identifies who
the customer is in the relational database. No other customer would have the same
Customer ID.

The Order table contains transactional information about an order:

23
 Order ID (primary key)

 Customer ID (foreign key)

 Order date

 Shipping date

 Order status

Here, the primary key to identify a specific order is the Order ID. You can connect a
customer with an order by using a foreign key to link the customer ID from
the Customer table.

The two tables are now related based on the shared customer ID, which means you
can query both tables to create formal reports or use the data for other applications.
For instance, a retail branch manager could generate a report about all customers
who made a purchase on a specific date or figure out which customers had orders
that had a delayed delivery date in the last month.

The above explanation is meant to be simple. But relational databases also excel at
showing very complex relationships between data, allowing you to reference data in
more tables as long as the data conforms to the predefined relational schema of
your database.

As the data is organized as pre-defined relationships, you can query the data
declaratively. A declarative query is a way to define what you want to extract from
the system without expressing how the system should compute the result. This is at
the heart of a relational system as opposed to other systems.

How relational databases are structured


The relational model means that the logical data structures—the data tables, views,
and indexes—are separate from the physical storage structures. This separation
means that database administrators can manage physical data storage without
affecting access to that data as a logical structure. For example, renaming a database
file does not rename the tables stored within it.

The distinction between logical and physical also applies to database operations,
which are clearly defined actions that enable applications to manipulate the data and
structures of the database. Logical operations allow an application to specify the
content it needs, and physical operations determine how that data should be
accessed and then carries out the task.

24
Examples of relational databases
Now that you understand how relational databases work, you can begin to learn
about the many relational database management systems that use the relational
database model. A relational database management system (RDBMS) is a program
used to create, update, and manage relational databases. Some of the most well-
known RDBMSs include MySQL, PostgreSQL, MariaDB, Microsoft SQL Server, and
Oracle Database.

Cloud-based relational databases like Cloud SQL, Cloud Spanner and AlloyDB have
become increasingly popular as they offer managed services for database
maintenance, patching, capacity management, provisioning and infrastructure
support.

Characteristics of RDBMS
 Data must be stored in tabular form in DB file, that is, it should be organized
in the form of rows and columns.

 Each row of table is called record/tuple . Collection of such records is known


as the cardinality of the table

 Each column of the table is called an attribute/field. Collection of such


columns is called the arity of the table.

 No two records of the DB table can be same. Data duplicity is therefore


avoided by using a candidate key. Candidate Key is a minimum set of
attributes required to identify each record uniquely.

 Tables are related to each other with the help for foreign keys.

 Database tables also allow NULL values, that is if the values of any of the
element of the table are not filled or are missing, it becomes a NULL value,
which is not equivalent to zero. (NOTE: Primary key cannot have a NULL
value).

Benefits of relational databases


The main benefit of the relational database model is that it provides an intuitive way
to represent data and allows easy access to related data points. As a result, relational
databases are most commonly used by organizations that need to manage large
amounts of structured data, from tracking inventory to processing transactional data
to application logging.

25
There are many other advantages to using relational databases to manage and store
your data, including:

1. Flexibility: It’s easy to add, update, or delete tables, relationships, and make
other changes to data whenever you need without changing the overall database
structure or impacting existing applications.

2. ACID compliance: Relational databases support ACID (Atomicity, Consistency,


Isolation, Durability) performance to ensure data validity regardless of errors,
failures, or other potential mishaps.

 Atomicity defines all the elements that make up a complete database


transaction.

 Consistency defines the rules for maintaining data points in a correct state
after a transaction.

 Isolation keeps the effect of a transaction invisible to others until it is


committed, to avoid confusion.

 Durability ensures that data changes become permanent once the


transaction is committed.

3. Ease of use: It’s easy to run complex queries using SQL, which enables even non-
technical users to learn how to interact with the database.

4. Collaboration: Multiple people can operate and access data simultaneously.


Built-in locking prevents simultaneous access to data when it’s being updated.

5. Built-in security: Role-based security ensures data access is limited to specific


users.

6. Database normalization: Relational databases employ a design technique


known as normalization that reduces data redundancy and improves data
integrity.

7. Tabular Structure: Data in an RDBMS is organized into tables, also known as


relations, where each table consists of rows (records) and columns (attributes).
Each row represents a single record, and each column represents a specific
attribute or field.

8. Data Integrity: RDBMSs enforce data integrity through constraints such as


primary keys, foreign keys, unique constraints, and check constraints. These
constraints help maintain the consistency and accuracy of data stored in the
database.

9. SQL (Structured Query Language): RDBMSs use SQL as the standard language
for querying and manipulating data. SQL provides a rich set of commands for
26
creating, querying, updating, and deleting data in relational databases. Common
SQL operations include SELECT, INSERT, UPDATE, DELETE, JOIN, and GROUP BY.

10. Data Relationships: RDBMSs allow the establishment of relationships between


tables using foreign keys. These relationships enable data normalization, integrity
enforcement, and efficient data retrieval through JOIN operations.

11. Indexing and Query Optimization: RDBMSs use indexing techniques to


optimize query performance by creating data structures that facilitate fast data
retrieval. Indexes are built on columns frequently used in query predicates,
allowing the database engine to quickly locate relevant data.

12. Multi-user Support: RDBMSs support concurrent access by multiple users or


applications. They provide mechanisms for managing concurrent access, ensuring
data consistency, and preventing conflicts between transactions.

13. Data Security: RDBMSs offer features for data security, including authentication,
authorization, and access control mechanisms. Administrators can define user
roles, privileges, and permissions to restrict access to sensitive data and database
operations.

14. Scalability: RDBMSs can scale vertically by upgrading hardware resources such as
CPU, memory, and storage capacity. Some RDBMSs also support horizontal
scalability through features like sharding, partitioning, and replication.

15. Commercial and Open Source Options: There are both commercial and open-
source RDBMS solutions available in the market. Examples of commercial
RDBMSs include Oracle Database, Microsoft SQL Server, and IBM Db2, while
popular open-source RDBMSs include MySQL, PostgreSQL, and SQLite

Relational vs. non-relational databases


The main difference between relational and non-relational databases (NoSQL
databases) is how data is stored and organized. Non-relational databases do not
store data in a rule-based, tabular way. Instead, they store data as individual,
unconnected files and can be used for complex, unstructured data types, such as
documents or rich media files.

Unlike relational databases, NoSQL databases follow a flexible data model, making
them ideal for storing data that changes frequently or for applications that handle
diverse types of data.

27
Disadvantages of RDBMS
 High Cost and Extensive Hardware and Software Support: Huge costs and
setups are required to make these systems functional.

 Scalability: In case of addition of more data, servers along with additional


power, and memory are required.

 Complexity: Voluminous data creates complexity in understanding of


relations and may lower down the performance.

 Structured Limits: The fields or columns of a relational database system is


enclosed within various limits, which may lead to loss of data.

Grid computing
What Is Grid Computing?
Grid computing is a distributed architecture of multiple computers connected by
networks to accomplish a joint task. These tasks are compute-intensive and difficult
for a single machine to handle. Several machines on a network collaborate under a
common protocol and work as a single virtual supercomputer to get complex tasks
done. This offers powerful virtualization by creating a single system image that
grants users and applications seamless access to IT capabilities.

28
How Grid Computing Works
A typical grid computing network consists of three machine types:

 Control node/server: A control node is a server or a group of servers that


administers the entire network and maintains the record for resources in a
network pool.
 Provider/grid node: A provider or grid node is a computer that contributes its
resources to the network resource pool.
 User: A user refers to the computer that uses the resources on the network to
complete the task.

Grid computing operates by running specialized software on every computer


involved in the grid network. The software coordinates and manages all the tasks of
the grid. Fundamentally, the software segregates the main task into subtasks and
assigns the subtasks to each computer. This allows all the computers to work
simultaneously on their respective subtasks. Upon completion of the subtasks, the
outputs of all computers are aggregated to complete the larger main task.

The software allows computers to communicate and share information on the


portion of the subtasks being carried out. As a result, the computers can consolidate
and deliver a combined output for the assigned main task.
29
Key Components of Grid Computing
A grid computing environment consists of a set of primary grid components. As grid
designs and their expected usage vary, specific components may or may not always
be a part of the grid network. These components can be combined to form a hybrid
component in specific scenarios. Although the combination of elements may differ
depending on use cases, understanding their roles can help you while developing
grid-enabled applications.

Let’s understand the key components of a grid computing environment.

Grid Computing: Key Components


1. User interface

Today, users are well-versed with web portals. They provide a single interface that
allows users to view a wide variety of information. Similarly, a grid portal offers an
interface that enables users to launch applications with resources provided by the
grid.

The interface has a portal style to help users query and execute various functions on
the grid effectively. A grid user views a single, large virtual computer offering
computing resources, similar to an internet user who views a unified instance of
content on the web.

30
2. Security

Security is one of the major concerns for grid computing environments. Security
mechanisms can include authentication, authorization, data encryption, and others.
Grid security infrastructure (GSI) is an important ingredient here. It outlines
specifications that establish secret and tamper-proof communication between
software entities operating in a grid network.

It includes OpenSSL implementation and provides a single sign-on mechanism for


users to perform actions within the grid. It offers robust security by providing
authentication and authorization mechanisms for system protection.

3. Scheduler

On identifying the resources, the next step is to schedule the tasks to run on them. A
scheduler may not be needed if standalone tasks are to be executed that do not
showcase interdependencies. However, if you want to run specific tasks concurrently
that require inter-process communication, the job scheduler would suffice to
coordinate the execution of different subtasks.

Moreover, schedulers of different levels operate in a grid environment. For example,


a cluster may represent an independent resource with its own scheduler to manage
the nodes it contains. Hence, a high-level scheduler may sometimes be required to
accomplish the task done on the cluster, while the cluster employs its own separate
scheduler to handle work on its individual nodes.

4. Data management

Data management is crucial for grid environments. A secure and reliable mechanism
to move or make any data or application module accessible to various nodes within
the grid is necessary. Consider the Globus toolkit — an open-source toolkit for grid
computing.

It offers a data management component called grid access to secondary storage


(GASS). It includes GridFTP built on the standard FTP protocol and utilizes GSI
for user authentication and authorization. After authentication, the user can move
files using the GridFTP facility without going through the login process at every node.

5. Workload & resource management

The workload & resource component enables the actual launch of a job on a
particular resource, checks its status, and retrieves the results when the job is
complete. Say a user wants to execute an application on the grid. In that case, the
application should be aware of the available resources on the grid to take up the
workload.

31
Types of Grid Computing With Examples
Grid computing is divided into several types based on its uses and the task at hand.
Let’s understand the types of grid computing with some examples.

Grid Computing Types


1. Computational grid computing

Computational grids account for the largest share of grid computing usage across
industries today, and the trend is expected to stay the same over the years to come.
A computational grid comes into the picture when you have a task taking longer to
execute than expected. In this case, the main task is split into multiple subtasks, and
each subtask is executed in parallel on a separate node. Upon completion, the results
of the subtasks are combined to get the main task’s result. By splitting the task, the

32
end result is achieved O(n) times faster (where ‘n’ denotes the number of subtasks)
than when a single machine executes the task.

Computational grids find application in several real-life scenarios. For example, a


computational grid can speed up the business report generation for a company with
an online marketplace. As time is an important factor for customers, the company
can use computational grids to generate reports within seconds rather than minutes.
Such grids result in substantial performance improvement compared to traditional
systems.

2. Data grid computing

Data grids refer to grids that split data onto multiple computers. Like computational
grids where computations are split, data grids enable placing data onto a network of
computers or storage. However, the grid virtually treats them as one despite the
splitting. Data grid computing allows several users to simultaneously access, change,
or transfer distributed data.

For instance, a data grid can be used as a large data store where each website stores
its own data on the grid. Here, the grid enables coordinated data sharing across all
grid users. Such a grid allows collaboration along with increased knowledge transfer
between grid users.

3. Collaborative grid computing

Collaborative grid computing solves problems by offering seamless collaboration.


This type of computing uses various technologies that support work between
individuals. As individual workers can readily access each other’s work and critical
information on time, it improves overall workforce productivity and creativity, which
benefits organizations massively.

It overcomes geographical barriers and adds capabilities that enhance work


experience by allowing remote individuals to work together. For example, with a
collaborative grid, all users can access and simultaneously work on text-based
documents, graphics, design files, and other work-related products.

4. Manuscript grid computing

Manuscript grid computing comes in handy when managing large volumes of image
and text blocks. This grid type allows the continuous accumulation of image and text
blocks while it processes and performs operations on previous block batches. It is a
simple grid computing framework where vast volumes of text or manuscripts and
images are processed in parallel.

5. Modular grid computing

Modular grid computing relates to disaggregating computing resources in a system


or chassis, where resources can include storage, GPUs, memory, and networking. IT

33
teams can then combine the required assets and computing resources to support
specific apps or services.

Fundamentally, in a modular grid, a set of resources is combined with software for


distinct applications. For example, CPU and GPU drives may reside in a server rack
chassis. They can be interconnected with an auxiliary high-speed and low-latency
fabric to create a server configuration that is optimized for a particular application.

When applications are created, a set of computing resources and services are defined
to support them. Subsequently, when the applications expire, computing support is
withdrawn, and resources are set free, making them available for other apps.
Practically, original equipment manufacturers (OEMs) play a key role in modular grid
computing as their cooperation is critical in creating modular grids that are
application-specific.

Top 5 Applications of Grid Computing


Grid computing acts as an enabling technology for developing several applications
across diverse fields like science, business, health, and entertainment. According to
Wipro’s 2021 report, cloud leaders expect a 29% increase in the usage of grid
computing as a complementary technology to boost cloud ROI by 2023.

As industries continue to streamline their IT infrastructure to better realize the true


potential of grids, grid infrastructure will evolve to match the pace of change and
provide stable platforms. Here are the top five applications of grid computing.

34
Grid Computing Applications
1. Life science

Life science is one of the fastest-growing application areas of grid computing.


Various life science disciplines such as computational biology, bioinformatics,
genomics, neuroscience, and others have embraced grid technology rapidly. Medical
practitioners can access, collect, and mine relevant data effectively. The grid also
enables medical staff to perform large-scale simulations and analyses and connect
remote instruments to existing medical infrastructure.

For example, the MCell project explores cellular microphysiology using sophisticated
‘Monte Carlo’ diffusion and chemical reaction algorithms to simulate and study
molecular interactions inside and outside cells. Grid technologies have enabled the
large-scale deployment of various MCell modules, as MCell now runs on a large pool
of resources, including clusters and supercomputers, to perform biochemical
simulations.

2. Engineering-oriented applications

Grid computing has contributed significantly to reducing the cost of resource-


intensive engineering applications. Several engineering services which require
collaborative design efforts and data-intensive testing facilities like the automotive or
aerospace industries have opted for grid technologies.
35
NASA Information Power Grid (NASA IPG) has deployed large-scale engineering-
oriented grid applications in the U.S. IPG is NASA’s computational grid with
distributed computing resources — from computers to large databases and scientific
instruments. One application that is of great interest to NASA is complete aircraft
design. A separate, often geographically distributed engineering team manages each
key aspect of an aircraft, such as the airframe, wing, stabilizer, engine, landing gear,
and human factors. The work of all the teams is integrated by a grid that employs
concurrent engineering for coordinating tasks.

In this way, grid computing also speeds up procedures involved in developing


engineering-oriented applications.

3. Data-oriented applications

Today, data is emerging from every corner — from sensors, smart gadgets, and
scientific instruments to many new IoT devices. With the explosion of data, grids
have a crucial role to play. Grids are being used to collect, store, and analyze data,
and at the same time, derive patterns to synthesize knowledge from that same data.

Distributed aircraft maintenance environment (DAME) is a fit use case of a data-


oriented application. DAME is a grid-based distributed diagnostic system for aircraft
engines developed in the U.K. It uses grid technology to manage large volumes of in-
flight data collected by operational aircraft. The data is used to design and develop a
decision support system for the diagnosis and maintenance of aircraft by utilizing
geographically distributed resources and data that are combined under a virtual
framework.

4. Scientific research collaboration (e-Science)

Universities and institutions participating in advanced research collaboration


programs have an enormous amount of data to analyze and process. Some examples
of these projects include data analysis work for high-energy physics experiments,
genome sequence analysis in COVID-19-like scenarios, and the development of
earth system models (ESM) by collecting data from several remote sensing sources.

Organizations involved in research collaboration require substantial storage space as


they regularly generate petabytes of data. They also need advanced computational
resources to perform data-intensive processing.

In this case, grid computing provides a resource-sharing mechanism by offering a


single virtual organization that shares computing capabilities. The virtual
supercomputer facilitates the on-demand sharing of resources and integrates a
secure framework for easy data access and interchange.

5. Commercial applications

Grid computing supports various commercial applications such as the online gaming
and entertainment industry, where computation-intensive resources, such as

36
computers and storage networks, are essential. The resources are selected based on
computing requirements in a gaming grid environment. It considers aspects such as
the volume of traffic and the number of participating players.

Such grids promote collaborative gaming and reduce the upfront cost of hardware
and software resources in on-demand-driven games. Moreover, in the media
industry, grid computing enhances the visual appearance of the motion picture by
adding special effects. The grid also helps theater film production as different
portions are processed concurrently, requiring less production time.

Why is grid computing important?


Organizations use grid computing for several reasons.

 Efficiency: With grid computing, you can break down an enormous, complex
task into multiple subtasks. Multiple computers can work on the subtasks
concurrently, making grid computing an efficient computational solution.

 Cost: Grid computing works with existing hardware, which means you can
reuse existing computers. You can save costs while accessing your excess
computational resources. You can also cost-effectively access resources from
the cloud.

 Flexibility: Grid computing is not constrained to a specific building or


location. You can set up a grid computing network that spans several regions.
This allows researchers in different countries to work collaboratively with the
same supercomputing power.

What are the use cases of grid computing?


The following are some common applications of grid computing.

 Financial services
Financial institutions use grid computing primarily to solve problems involving
risk management. By harnessing the combined computing powers in the grid,
they can shorten the duration of forecasting portfolio changes in volatile
markets.

 Gaming
The gaming industry uses grid computing to provide additional computational
resources for game developers. The grid computing system splits large tasks,
such as creating in-game designs, and allocates them to multiple machines.
This results in a faster turnaround for the game developers.

37
 Entertainment
Some movies have complex special effects that require a powerful computer
to create. The special effects designers use grid computing to speed up the
production timeline. They have grid-supported software that shares
computational resources to render the special-effect graphics.

 Engineering
Engineers use grid computing to perform simulations, create models, and
analyze designs. They run specialized applications concurrently on multiple
machines to process massive amounts of data. For example, engineers use
grid computing to reduce the duration of a Monte Carlo simulation, a
software process that uses past data to make future predictions.

Advantages of Grid Computing:


1. It is not centralized, as there are no servers required, except the control node
which is just used for controlling and not for processing.
2. Multiple heterogeneous machines i.e. machines with different Operating
Systems can use a single grid computing network.
3. Tasks can be performed parallelly across various physical locations and the
users don’t have to pay for them (with money).

Disadvantages of Grid Computing:


1. The software of the grid is still in the involution stage.
2. A super-fast interconnect between computers resources is the need of the
hour.
3. Licensing across many servers may make it prohibitive for some applications.
4. Many groups are reluctant with sharing resources.
5. Trouble in the control node can come to halt in the whole network.

Volunteer computing
What is volunteer computing?
“Volunteer computing” is a type of distributed computing in which computer
owners can donate their spare computing resources (processing power, storage and
Internet connection) to one or more research projects.

Volunteer computing, also called VC which can be defined as the method of


obtaining higher throughput during computation using simple digital devices
like smartphones, computers, laptops, and tablets. You can install a program that is
38
capable of downloading and executing tasks from the operating servers to take part
in volunteer computing.

Volunteer computing in big data is becoming one of the important fields of


research these days. In this regard, we have provided a detailed overview of
volunteer computing and the Big Data analytics aspects of it in this article. Let us first
start by defining volunteer computing

Why is volunteer computing important?


It's important for several reasons:

 Because of the huge number (> 1 billion) of PCs in the world, volunteer
computing can supply more computing power to science than does any other
type of computing. This computing power enables scientific research that
could not be done otherwise. This advantage will increase over time, because
the laws of economics dictate that consumer products such as PCs and game
consoles will advance faster than more specialized products, and that there
will be more of them.

 Volunteer computing power can't be bought; it must be earned. A research


project that has limited funding but large public appeal can get huge
computing power. In contrast, traditional supercomputers are extremely
expensive, and are available only for applications that can afford them (for
example, nuclear weapon design and espionage).
 Volunteer computing encourages public interest in science, and provides the
public with voice in determining the directions of scientific research.

How does it compare to 'Grid computing'?


It depends on how you define 'Grid computing'. The term generally refers to the
sharing of computing resources within and between organizations, with the following
properties:

 Each organization can act as either producer or consumer of resources (hence


the analogy with the electrical power grid, in which electric companies can buy
and sell power to/from other companies, according to fluctuating demand).

 The organizations are mutually accountable. If one organization misbehaves,


the others can respond by suing them or refusing to share resources with
them.

This is different from volunteer computing. 'Desktop grid' computing - which uses
desktop PCs within an organization - is superficially similar to volunteer computing,
but because it has accountability and lacks anonymity, it is significantly different.
39
If your definition of 'Grid computing' encompasses all distributed computing (which
is silly - there's already a perfectly good term for that) then volunteer computing is a
type of Grid computing.

Basic Structure for a volunteer computing system.

Features of Volunteer Computing


 Resources sharing only to machines within the same platforms, resulting in
homogeneous redundancies.

 By delivering resources to the device that request is satisfied. For instance,


locality scheduling may be accomplished.

 Checking twice the servers reports the performance using prediction


models or a custom validation approach after sending the identical work unit
to at least two users.

 Even before the work unit is finalized, information from the work unit is
trickled to the server.

 Work is distributed depending on host characteristics, with work units


exceeding 512 MB of RAM being delivered exclusively to hosts with at least
quite so much RAM, and we transmit additional tasks to multi-core CPUs and
GPUs.

40
 Both the client and the server require job scheduling to ensure that
different tasks meet their deadlines.

With all these useful features volunteer computing has turned to be one of
the major areas of research and technology for everyday applications. By
delivering more than four hundred research projects in volunteer computing, we
are one of the highly experienced technical volunteer computing networks research
supporters in the world. You can thus confidently reach out to us for your VC
projects. Let us now look into the volunteer network attributes in brief below

What is volunteer computing give an example of such


computing?
 Volunteer computing refers to the use of idle or less used consumer digital
devices primarily to cut the cost of high-performance computation and
maintenance

 The devices include desktops, laptops, and fog devices

With world-class certified engineers volunteer computing in big data is providing


advanced research guidance and technical assistance to research scholars and
students from top universities of the world. We have got more than twenty years of
research experience in the field. Let us now have an overview of volunteer
computing below

Overview of volunteer computing in big data


The following are the two important purposes for which volunteer computing is
being investigated,

 Performance of volunteer computing with multiple volunteers

 Convergence of the overall performance as a function of the number of


volunteers

It is important to give priority to reducing the overused or extra volunteer devices


that are present to divert them to other purposes. The major aim of volunteer
computing research in big data lies in the optimization of available volunteers that
is utilizing a minimum number of volunteers to establish high throughput. For this
purpose assimilation platform with the following characteristics are required

 System performance assessment

 Behavioural dynamics assessment

41
 Assessing the volunteer opportunities

 Processing the generic big data concerns

Usually, researchers reach out to us for the technicalities associated with these
aspects. We also ensure our full support in the writing aspects like research
proposal writing, paper publication, survey paper writing, and thesis. Let us now
see about some major volunteer computing-based technologies

Key enabling technologies of volunteer computing in big data


 Edge computing

o It focuses on device-based direct data processing

o It makes use of attached sensors and gateway devices at sensor proximity

 Mist computing

o It is used in processing the extreme network data

o It consists of sensors and multiple microcontrollers

 Fog computing

o It is a network framework that ranges between data creation and storage


locations

o The location for storage can be both cloud and local data center

 Centralised computing

o Mainframe computers give significance in this type

o Dumb terminals and time-sharing on the different aspects of centralized


computing

 Utility computing

o It works under metered bandwidth and has self-service provisioning

o It provides for quick scalability

 Cloud computing

o Cloud Computing provides for the three different types of services As


given below
42
o Software as a service

o Platform as a service

o Infrastructure as a service

 Grid computing

o The processing is decentralized and highly parallel

o Commoditized hardware is its specialty

For detailed explanations on all the above technologies, you can visit our website or
contact us. By having a good interaction with our technical team you can get all your
doubts and queries solved instantly. We will now discuss the taxonomic approach of
volunteer computing in Big Data.

What are the criteria for good volunteer computing?


 It should be an easy to use platform for both programmers and volunteers
providing for better opportunities to develop new applications and encourage
many volunteers respectively

o It has to be capable of being applied to diverse applications apart from


mathematics and scientifically significant problems

o It must be extended into complicated patterns of communication and the


coarse-grain applications while it should not be limited to parallel and
master worker style of applications

o It must ensure reliability even when malicious volunteers are present


without compromising on network performance

We help you in meeting all the demands of the best volunteer computing system.
Authentic research materials and expert advice are the essential combinations to
carry out the best research work. We ensure to provide all these facilities readily to
you. Let us now see the issues and concerns associated with volunteer computing
in big data below

Two major issues of volunteer computing in big data


 Specialised air conditioning systems are required in place of data
centers for the removal of hardware heat whereas such systems or not
required by the consumer devices

43
 Ambient heating is contributed by consumer devices in cold climates which
lead to the net-zero cost of computing. Therefore global deployment of
volunteer computing is looked upon for its efficiency over data center
computing

For many years our research experts have been working in close contact with the top
researchers of the world to find solutions to these problems. That we are aware of
the recent trends and developments in volatile computing research. Let us now
see how monitor computing is prepared for high throughput applications.

What are the reasons to prefer volunteer computing?


 Since the most important reason for using volunteer computing is to increase the
task completion rate it is highly suited for high throughput computing

 Also reducing the turnaround time for completing a task is not the primary goal
of volunteer computing

 In cases of huge memory workloads, large storage demands and the higher ratio
between network communication and computing, volunteer computing in Big
Data cannot be used

Hence without a doubt volunteer computing is highly accepted and suited for high
throughput computing applications.

We have committed and dedicated teams of experienced and qualified


developers, subject matter experts, engineers, and writers to support you in all
aspects of your research. We provide you with customized and holistic research
guidance. Let us now discuss the unique characteristics of volunteer computing

Key features of volunteer computing


1. Crowdsourced Computing Resources: Volunteer computing harnesses the
collective computational power of volunteers' devices, including desktop
computers, laptops, smartphones, and tablets. Volunteers install client software,
also known as volunteer computing clients or BOINC (Berkeley Open
Infrastructure for Network Computing) clients, which run in the background and
utilize idle CPU cycles to process computational tasks.

2. Decentralized Participation: Volunteer computing projects are open to anyone


who wishes to contribute their computing resources to scientific research,
humanitarian efforts, or other computational projects. There are no strict
requirements or eligibility criteria for participation, and volunteers can join or
leave projects at any time.

44
3. Diverse Applications: Volunteer computing projects cover a wide range of
scientific, research, and humanitarian domains, including astronomy, physics,
biology, climate modeling, drug discovery, cryptography, and social sciences.
These projects rely on volunteer computing to perform complex calculations,
simulations, and data analysis tasks that would otherwise require significant
computational resources.

4. Distributed Work Units: Volunteer computing projects break down large


computational tasks into smaller, manageable units known as work units or tasks.
These tasks are distributed to volunteer devices by a central server, which
coordinates the assignment and collection of completed work units. Volunteers
download, process, and return the results of their assigned tasks to the project
server.

5. Community Engagement: Volunteer computing projects foster community


engagement and collaboration among participants, who share a common interest
in contributing to scientific research or other meaningful endeavors. Projects
often provide forums, discussion boards, and social media channels where
volunteers can interact, share experiences, and learn about project updates and
achievements.

6. Security and Privacy: Volunteer computing projects prioritize security and


privacy to protect volunteers' personal data and computing resources. Client
software is designed to operate within a secure sandbox environment, isolating
volunteer-contributed resources from the host system. Projects implement
encryption, authentication, and other security measures to safeguard data
transmission and storage.

Big data Tools used for volunteer computing


 Spark – supportive too in-memory calculations

 Cassandra – availability, and scalability are maximum

 Hadoop – Big Data storage, processing, and analysis

 Mongo DB – useful in performing cross-platform capacities

 Storm – processing and bounded data streams

Without resource overuse and with minimal volunteers achieving high


optimization and overall performance is the target of volunteer computing. In
case you are working with a suitable platform then the next need is a big data
algorithm. MapReduce is one of the best and successful big data and got them that
been used in various applications. You can get complete guidance concerning
all deep learning algorithms and especially MapReduce from our experts at all

45
times. Let us now look into the prominent volunteer computing based big data
techniques below

Volunteer Computing Techniques


The following are the important machine learning and volunteer computing
based techniques that are utilized to enhance the quality of service parameters in
distributed systems

 Deep learning and Machine Learning

 Reinforcement Learning and Regression Analysis

 Artificial Neural Networks

Deploying deep and machine learning methods and their suitability to volunteer
computing systems is one of the important researches in the field these days. You
can get any kind of research guidance for volunteer computer networks from us.
Our developers will stand by you in the successful implementation of codes. Let us
now look into the parameters used in analyzing the volunteer computer networks

Big data techniques for volunteer computing


The following are the crucial big data methodologies for volunteer computing
networks

 Loop control

o Loop aware task scheduling is used for scheduling the tasks

o It provides for checkpoints in fault tolerance

o It is used in Big data analysis applications involving multiple iterations

 K means clustering

o Cache aware decentralized task scheduling process is used

o It provides for decentralized control Framework in fault tolerance

o Scientific Discovery and data-intensive computing are the benefits of this


mechanism

 Additional combine phase apart from reducing phase

o MapReduce based statics scheduling is used

46
o Checkpointing fault tolerance is followed in this mechanism

o MapReduce computations with multiple lighter actions are its major


benefit

 Updates and detects

o MapReduce static scheduling mechanism is used

o Fault tolerance checkpointing is utilized

o Performance increase in the manifold where there is no benefit using task-


level memorization

 Iterative computation with distinct map and reducing functions

o Statics scheduling is used for MapReduce

o It provides for checkpointing in fault tolerance

o It enables asynchronous map task execution and reduces the overhead by


eliminating static shuffling

So far we have seen the importance of volunteer computing in establishing a


cost-effective and environment-friendly alternative computer framework in
place of the expensive and centralized infrastructure.

You can use volunteer computing for tasks that involve large utilization of resources
like Big Data analytics and scientific simulations by aggregating idle computer
devices like desktops, routers smart.

For harnessing the complete power of volunteer computing novel techniques,


procedures, algorithms, and standards have to be devised. Our experts are also
working to improve the existing technologies to enhance volunteer computing in
big data. Check out our website for the successful projects that are both completed
and ongoing in our organization.

Resource Management in Volunteer Computing


Research issues in Resource Management

 Resource provisioning – selection and discovery of resources

 Resource monitoring – utilization of resources

 Resource scheduling – allocation and mapping of resources

47
The following are various aspects of resource allocation which are considered to be
important issues in Resource Management under volunteer computing

 Cost and power-aware resource allocation

 QoS and content-aware allocation of resources

 SLA based allocation of resources

Our experts are here to guide you on all the ways and means to optimally manage
the resources in volunteer computing networks. Since we have the experience of
working many years with researchers from all over the world we gained expertise in
the field. Let us now discuss the research issues in task scheduling aspects of
volunteer computing

What Are The Parameters Used For Volunteer Computing?


 Quorum

o The minimal successful results needed by a validator is called quorum

o In the case of strict majority the parameter is considered accurate

 Size of the output file

o Output file size denotes the data amount sent by a client to the server
after executing a task

 Size of the input file

o Input file size stands for the data that is uploaded for processing into
the volunteer nodes

 Duration of tasks

o The total number of floating-point operations that are demanded for


computing every task

Many such factors have to be considered for evaluating your volunteer computing
networks project and choosing the best simulation tool for you. Our engineers will
guide you by considering the demands of your project to choose the best methods,
tools, and techniques. Let us now talk about the parameters used for analysing the
performance of volunteer computer networks.

48
Performance Analysis of Volunteer Computing
 Maximum error results

o Then the client error results exit the maximum value the work unit is
established to contain errors

 Maximum successful results

o When the number of successful results is greater than this value


without reaching a consensus then the work unit contains an error

 Maximum total results

o Then the number of user error results is greater than the maximum
error result value the work unit consists of errors

 Target results

o Initial results to be created for a work unit

Finally to conclude on the optimal outcomes on all these performance metrics you
can have a talk with our experts. We will provide you with the successful instances in
which we have attempted to maximize the possible results in all these parameters.
So contact us without hesitation at any time for your volunteer computing
networks project.

Examples of volunteer computing projects:


 SETI@home (Search for Extraterrestrial Intelligence at Home): Participants analyze
radio telescope data in search of signals from extraterrestrial civilizations.

 Folding@home: Volunteers simulate protein folding and other molecular


dynamics to advance biomedical research and drug discovery.

 World Community Grid: Participants contribute their computing power to tackle


global challenges such as health, poverty, and sustainability by running scientific
simulations and data analysis tasks.

Convergence of key trends


The convergence of key trends refers to the coming together or intersection of
multiple significant developments or advancements in various fields, leading to
49
synergistic effects and transformative outcomes. This convergence often amplifies
the impact of each trend and creates new opportunities, challenges, and possibilities.
Here are several examples of key trends that have converged or are converging to
shape the future:

1. Artificial Intelligence (AI) and Big Data: The combination of AI and big data
has led to groundbreaking advancements in data analytics, machine learning,
and predictive modeling. AI algorithms are increasingly used to analyze vast
amounts of data, extract meaningful insights, and drive decision-making in
diverse fields such as healthcare, finance, marketing, and autonomous
systems.

2. Internet of Things (IoT) and Edge Computing: The proliferation of IoT


devices and sensors, combined with edge computing capabilities, enables
real-time data processing and analysis at the network edge. This convergence
enhances efficiency, reduces latency, and supports applications such as smart
cities, industrial automation, and connected vehicles.

3. 5G Connectivity and Mobile Technology: The rollout of 5G networks and


advancements in mobile technology are transforming connectivity, enabling
faster data transfer speeds, lower latency, and greater network capacity. This
convergence facilitates innovations in areas such as augmented reality (AR),
virtual reality (VR), telemedicine, and autonomous vehicles.

4. Blockchain and Decentralized Finance (DeFi): The integration of blockchain


technology with decentralized finance (DeFi) platforms is reshaping the
financial industry by enabling peer-to-peer transactions, smart contracts, and
digital asset management without intermediaries. This convergence fosters
financial inclusion, transparency, and innovation in areas such as banking,
lending, and payments.

5. Renewable Energy and Energy Storage: The convergence of renewable


energy sources, such as solar and wind power, with advancements in energy
storage technologies, such as batteries and hydrogen fuel cells, is driving the
transition to a cleaner and more sustainable energy system. This convergence
facilitates grid flexibility, energy independence, and carbon reduction efforts.

6. Healthcare and Telemedicine: The intersection of healthcare services with


telemedicine platforms and digital health technologies is revolutionizing
healthcare delivery, patient care, and medical diagnostics. This convergence
expands access to healthcare services, improves remote monitoring
capabilities, and enhances personalized medicine approaches.

7. E-commerce and Last-Mile Delivery: The convergence of e-commerce


platforms with logistics and delivery services is reshaping the retail landscape,
leading to innovations in last-mile delivery, fulfillment centers, and supply

50
chain optimization. This convergence enables faster delivery times, improved
customer experiences, and greater convenience for consumers.

8. Sustainability and Circular Economy: The convergence of sustainability


initiatives with circular economy principles is driving efforts to reduce waste,
conserve resources, and minimize environmental impact. This convergence
promotes eco-friendly practices, product lifecycle management, and
sustainable consumption patterns across industries.

Overall, the convergence of key trends has profound implications for society,
economy, and technology, shaping the way we live, work, and interact in the digital
age. By recognizing and harnessing the synergies between these trends,
organizations and policymakers can unlock new opportunities and address complex
challenges in a rapidly evolving world.

Top 10 Big Data Trends For 2024


In today’s world, we are living in an era of a new digital world where Artificial
Intelligence and Machine Learning have reshaped businesses and society. You
might not be surprised that big data has taken over the perspective of seeing
through new market trends and making important decisions for the business. In fact,
with the growth of data, companies are now looking to adopt new methods to
optimize data on a larger scale. Big data has also played a crucial role during the
COVID-19 pandemic and has uplifted many sectors such as healthcare, e-commerce,
etc.

It is being expected that the big data market is going to shoot up to 200 USD Billion
by 2025. So, let’s check out the top 10 big data trends for 2022.

1. TinyML

TinyML is a type or technique of ML that is powered with small and low-powered


devices such as microcontrollers. The best part about TinyML is it runs on low
latency at the edge of devices. Thus, it consumes microwatts or milliwatts which is
1000x less than a standard GPU. This quality of TinyML helps devices to run for a
longer period of time which can also be years in some cases. Since they’re low on
power consumption, they don’t allow any data to get stored and that’s the best part
when it comes to safety concerns.

2. AutoML

It is also considered as modern ML these days. AutoML is being used to reduce


human interaction and process all the tasks automatically to solve real-life problems.
This functionality includes the whole process right from raw data to a final ML model.

51
The motive of AutoML is to offer extensive learning techniques and models for non-
experts in ML. Not to forget, although AutoML does not require human interaction
that doesn’t mean that it’s going to completely overtake it.

3. Data Fabric

Data Fabric has been in trend for a while now and will continue its dominance in the
coming times. It’s an architecture and group of data services throughout the cloud
environment. Not only has this but data fabric been also listed as the best analytical
tool by Gartner. However, it has to continue spreading all over the enterprise scale. It
consists of key data management technologies which include data pipelining, data
integration, data governance, etc. It has been accepted by enterprise scales openly as
it consumes less time for fetching out business insights which can be helpful for
making impactful business decisions.

4. Cloud Migration

In today’s world of technology, businesses are now shifting towards cloud


technology. However, cloud migration has been in trend for a while now and this is
the next future in technology. Moving towards the cloud has several benefits and not
only businesses but even “we” as an individual are also relying totally on cloud
technology. Cloud migration is very much helpful in terms of performance as it
uplifts the performance, speed, and scalability of any operation, especially during
heavy traffic.

5. Data Regulation

Since industries have started changing their working patterns and


measuring business decisions, it’s now making it easy for them to manage their
operations. However, big data is yet to make some more impact on the legal
industry. In fact, some have started adopting big data structures but it’s a long way
to go. This comes up with a lot of responsibility of handing data on such a large
scale and some specific industries such as healthcare, legal fields cannot be
compromised or let’s say if there’s any patient data, it cannot be left with AI methods
only. So, as far as we’re concerned a better data regulations are going to play a
major role in 2022.

6. IoT

With the growing pace of technology, we’re becoming more dependable on


technology. IoT has been playing a great role in this for the last few years and we
believe it’ll be playing a more interesting role in the near future. Today advanced
data technologies and architectures are adding value over IoT with the help of
monitoring, and collecting data in different forms. We believe IoT should be playing
it on a larger scale now for storing and processing data in real-time to solve
unusual problems such as Traffic Management, Manufacturing, Healthcare, etc.

7. NLP
52
Natural Language Processing is a kind of AI that helps in assessing text or voice
inputs provided by humans. In short, it is being used nowadays to understand what’s
being said and works like a charm. It is a next-level achievement in technology where
we’ve been working now and even you can find some of the examples where you can
ask a machine to read aloud for you. NLP uses a method of methodologies to extract
the vagueness in speech and to provide it a natural touch. Your very best example
can be Apple’s Siri or Google Assistant, where you speak to the AI and it provides
you the useful information as per your need.

8. Data Quality

Data quality has one of the most sought concerns for companies later in 2021. In
fact, the ratio is less where companies have accepted that data quality is becoming
an issue for them. Well, on the other hand, it’s not a concern for them. To date,
companies have not been focusing on the quality of data from various mining tools
which resulted in poor data management. The reason is, if ‘Data’ is their decision-
maker and playing a crucial role then they might be setting wrong targets for their
business or might be targeting the wrong group. That’s where filtration is required to
achieve real milestones.

9. Cyber Security

With the rise of pandemic (COVID-19), where the world was forced to shut down and
companies were left with none other than WFH, things began changing. Even after so
many months and years, people are focusing on getting remote work. Everything has
pros and cons in its own way. This also comes with a lot of challenges which include
cyber-attacks. In fact, working remotely comes with a lot of safety measures and
responsibilities. Since the employee is out of cyber security range and thus it
becomes a concern for companies. As people are working remotely, cyber attackers
are becoming more active to breach out by finding different ways of attack.

Taking this into considerations, XDR (Extended Detection and Response) and
SOAR have been introduced which helps in detecting any cyber-attack by applying
advanced security analytics into their network. Therefore, it is and will be one of the
major trends for 2022 in big data and analytics.

10. Predictive Analytics

It helps in identifying any future trends and forecasts with the help of certain sets
of statistics tools. Predictive analytics analyses a pattern in a meaningful way and it
is being used for weather forecasts. However, its ability and techniques are not just
limited to this, in fact, it can be used in sorting any data, and based on the pattern, it
analyses the stats.

Some of the examples are Share market, Product Research, etc. Based on the
provided data, it measures and provide a full report beforehand if any market share
is dipping down or if you want to launch any product then it collects data from
different regions and based on their interests, it will help you in analyzing your
53
business decision and in the world of this heavy competition, it’s becoming even
more demanding and will be in trend for the upcoming years.

Unstructured data
What is Unstructured Data?
Unstructured data is the data which does not conforms to a data model and has no
easily identifiable structure such that it can not be used by a computer program
easily. Unstructured data is not organised in a pre-defined manner or does not have
a pre-defined data model, thus it is not a good fit for a mainstream relational
database.

Unstructured data
From 80% to 90% of data generated and collected by organizations is unstructured,
and its volumes are growing rapidly — many times faster than the rate of growth for
structured databases.

Unstructured data stores contain a wealth of information that can be used to guide
business decisions. However, unstructured data has historically been very difficult to
analyze. With the help of AI and machine learning, new software tools are emerging
that can search through vast quantities of it to uncover beneficial and actionable
business intelligence.

What is the meaning of Unstructured Data?

54
Unstructured data doesn’t have a predefined structure and is common in sources
like:

 Emails

 PDFs

 Images

 Audio files

 Video files

 Social media posts

While unstructured data doesn't have the same organization as structured data, you
can still analyze it to find trends and insights. To do this, businesses need to invest in
big data technologies like OpenText™ IDOL Unstructured Data Analytics to easily
process large amounts of unstructured data.

Characteristics of unstructured data


1. Lack of Structure: Unstructured data does not conform to a rigid schema or
predefined format. It can vary widely in content, format, and organization,
making it difficult to store, manage, and analyze using traditional database
systems.

2. High Volume: Unstructured data often constitutes a significant portion of the


data generated by organizations and individuals. With the proliferation of
digital content, social media interactions, and sensor devices, the volume of
unstructured data continues to grow rapidly.

3. Diverse Formats: Unstructured data comes in various formats, including text,


images, videos, audio recordings, PDFs, emails, and more. Each format may
require different processing techniques and tools for analysis and extraction of
useful information.

4. Complexity: Unstructured data may contain complex relationships, semantics,


and context that are not easily discernible using automated algorithms or
traditional data processing methods. Extracting meaningful insights from
unstructured data often requires advanced analytics techniques, natural
language processing (NLP), and machine learning algorithms.

5. Varied Sources: Unstructured data originates from diverse sources, including


social media platforms, websites, emails, sensors, multimedia devices, and IoT
(Internet of Things) devices. Integrating and analyzing data from multiple

55
sources can provide valuable insights but also presents challenges related to
data integration, quality, and governance.

Unstructured data vs. structured data


Let’s take structured data first: it’s usually stored in a relational database or RDBMS,
and is sometimes referred to as relational data. It can be easily mapped into
designated fields — for example, fields for zip codes, phone numbers, and credit
cards. Data that conforms to RDBMS structure is easy to search, both with human-
defined queries and with software.

Unstructured data, in contrast, doesn’t fit into these sorts of pre-defined data
models. It can’t be stored in an RDBMS. And because it comes in so many formats,
it’s a real challenge for conventional software to ingest, process, and analyze. Simple
content searches can be undertaken across textual unstructured data with the right
tools.

Beyond that, the lack of consistent internal structure doesn’t conform to what typical
data mining systems can work with. As a result, companies have largely been unable
to tap into value-laden data like customer interactions, rich media, and social
network conversations. Robust tools for doing so are only now being developed and
commercialized.

What are some examples of unstructured data?


Unstructured data can be created by people or generated by machines.

56
Here are some examples of the human-generated variety:

 Email: Email message fields are unstructured and cannot be parsed by traditional
analytics tools. That said, email metadata affords it some structure, and explains
why email is sometimes considered semi-structured data.

 Social media and websites: data from social networks like Twitter, LinkedIn, and
Facebook, and websites such as Instagram, photo-sharing sites, and YouTube.

 Mobile and communications data: For this category, look no further than text
messages, phone recordings, collaboration software, chat, and instant messaging.

 Media: This data includes digital photos, audio, and video files.

 Text Documents: Word documents, PDFs, emails, web pages, and text files
containing unstructured textual content.

 Multimedia Files: Images, videos, audio recordings, and multimedia


presentations.

 Social Media Data: Posts, comments, tweets, photos, videos, and other content
shared on social media platforms such as Facebook, Twitter, Instagram, and
LinkedIn.

 Sensor Data: Data collected from sensors embedded in IoT devices, industrial
equipment, vehicles, environmental monitoring systems, and wearable devices.

Here are some examples of unstructured data generated by machines:

 Scientific data: This includes oil and gas surveys, space exploration, seismic
imagery, and atmospheric data.

 Digital surveillance: This category features data like reconnaissance photos and
videos.

 Satellite imagery: This data includes weather data, land forms, and military
movements.

57
How is Unstructured Data stored?
Unstructured data is usually stored in a non-relational database like Hadoop or
NoSQL and processed by unstructured data analytics programs like OpenText IDOL.
These databases can store and process large amounts of unstructured data.

Common storage formats for unstructured data are:

 Text files (PDFs and emails)

 Image files (JPEGs and PNGs)

 Audio files (MP3s and WAVs)

 Video files (MPEGs and AVIs)

What are the benefits of Unstructured Data?


There are many benefits to working with unstructured data. Data scientists use
unstructured data to improve customer service, target marketing campaigns, and
make intelligent business decisions.

Some of the most common benefits of unstructured data are:

 Improved customer service: Businesses can provide better customer service by


analyzing customer sentiment in social media posts and online reviews.

58
 Targeted marketing campaigns: Marketing teams can use unstructured data to
identify customer needs and wants. This information can then help them create
targeted marketing campaigns.

 Better business decisions: Unstructured data can businesses find trends and
insights that would otherwise be difficult to identify. This information ultimately
helps stakeholder make accurate judgments and improve their companies.

What companies can do with Unstructured Data after parsing?


Some companies have successfully parsed unstructured data through text
analytics and natural language processing (NLP). These technologies help
organizations sift through large amounts of unstructured data to find the nuggets of
information they are looking for. What's more, parsing through unstructured data
does hold several key benefits, such as:

 Limitless use: Unstructured data isn’t predefined, meaning owners can use it in
unlimited ways.

 Versatile formatting: Users can store unstructured data in various formats.

 Affordable storage cost: Enterprises have more raw, unstructured data than
structured information. Storing unstructured data is both convenient and cost-
effective.

 File extraction: Gain more insight from your data with support for over 1,500 file
formats, and a Document file reader and file extraction with standalone file
format detection, content decryption, text extraction, subfile processing, non-
native rendering, and structured export solution.

 AI Digital Assistant: Once data is analyzed, natural-language dialogues are


pulled from many different sources to provide highly matched answers to
questions. Visitors to your site can chat with an automated, human-like natural
language digital assistant.

 AI Video Surveillance & Analytics: Automatically monitor thousands of CCTV


cameras in real time or retrospectively. Tag video, send alerts, review, and
distribute to interested parties. Includes facial recognition, event analysis, license
plate recognition, and more.

 Law Enforcement Analytics & Media Analysis: Identify and extract facts from
video and image evidence during investigations. Collect, organize, classify, and
secure these assets faster while reducing costs and the strain on labor.

 Natural Language Q&A and Chatbot: Accesses a variety of sources for highly
matched answers and responds in a natural language format. Create a human
dialog chat experience for customers through AI and ML.
59
What are the challenges of Unstructured Data?
Working with unstructured data can be challenging. Since this type of information is
not organized in a predefined manner, it's more challenging to analyze.

In addition, unstructured data is often stored in a non-relational database, making it


more difficult to query. Some of the most common challenges of unstructured data
are:

 Security risks: Securing unstructured data can be complex since users can spread
this information across many storage formats and locations.

 Poor indexation: Because of its arbitrary nature, indexation is usually both a


challenging and error-prone process.

 Need for data scientists: Unstructured data usually requires data scientists to
parse through it and make interpretations.

 Expensive data analytics equipment: Advanced data analytics software is


necessary for parsing unstructured data, but it may be out of reach for companies
on a tight budget.

 Numerous data formats: Unstructured data doesn’t have a specific format,


which makes it difficult to use in its raw state.

How is Unstructured Data analyzed?


There are many ways to analyze unstructured data. Users can process unstructured
data using NLP techniques like text mining and sentiment analysis. In addition,
stakeholders can analyze unstructured data through tools that feature machine
learning.

Some standard methods for analyzing unstructured data are:

 Text mining: This technique extracts valuable information from text-based


sources. For example, text mining can analyze customer reviews to identify
patterns and trends.

 Sentiment analysis: This technique identifies emotions in text-based sources. For


example, sentiment analysis can examine social media posts to determine positive
or negative sentiments about a brand or product.

 Machine learning: This technique finds patterns and insights in data. For
example, tools that feature machine learning can inspect customer behavior to
identify trends.

60
How can OpenText IDOL Unstructured Data Analytics help?
OpenText unstructured data analytics platform helps organizations analyze this type
of information. OpenText IDOL includes tools and technologies that collect, process,
and analyze unstructured data.

Critical features of OpenText IDOL include:

 Image analytics: This feature enables businesses to extract meaning from


images. For example, image analytics can identify objects in a picture or find faces
in a crowded image.

 Audio analytics: This feature enables businesses to extract meaning from audio
files. For example, audio analytics can identify keywords in a conversation or
detect emotions in a voice.

 Repository Data access and connectors: Users can easily connect to various
data sources. This includes social media, enterprise applications, and databases.

 Cognitive search: OpenText IDOL enables businesses to find information using


natural language queries. For example, cognitive search can help data scientists
find documents that contain a certain keyword or phrase.

 Unstructured Data Analytics Software for OEM & SDKs: Use our software
development kit to build the apps and APIs you need to take advantage of your
unstructured data.

Advantages of Unstructured Data:


 Its supports the data which lacks a proper format or sequence

 The data is not constrained by a fixed schema

 Very Flexible due to absence of schema.

 Data is portable

 It is very scalable

 It can deal easily with the heterogeneity of sources.

 These type of data have a variety of business intelligence and analytics


applications.

Disadvantages of Unstructured data:

61
 It is difficult to store and manage unstructured data due to lack of schema and
structure

 Indexing the data is difficult and error prone due to unclear structure and not
having pre-defined attributes. Due to which search results are not very accurate.

 Ensuring security to data is difficult task.

Industry examples of big data

1. Transportation

Big Data powers the GPS smartphone applications most of us depend on to get from
place to place in the least amount of time. GPS data sources include satellite images
and government agencies.

Airplanes generate enormous volumes of data, on the order of 1,000 gigabytes for
transatlantic flights. Aviation analytics systems ingest all of this to analyze fuel
efficiency, passenger and cargo weights, and weather conditions, with a view toward
optimizing safety and energy consumption.

Big Data simplifies and streamlines transportation through:

 Congestion management and traffic control

62
Thanks to Big Data analytics, Google Maps can now tell you the least traffic-
prone route to any destination.

 Route planning
Different itineraries can be compared in terms of user needs, fuel
consumption, and other factors to plan for maximize efficiency.

 Traffic safety
Real-time processing and predictive analytics are used to pinpoint accident-
prone areas.

2. Advertising and Marketing

Ads have always been targeted towards specific consumer segments. In the past,
marketers have employed TV and radio preferences, survey responses, and focus
groups to try to ascertain people’s likely responses to campaigns. At best, these
methods amounted to educated guesswork.

Today, advertisers buy or gather huge quantities of data to identify what consumers
actually click on, search for, and “like.” Marketing campaigns are also monitored for
effectiveness using click-through rates, views, and other precise metrics.

For example, Amazon accumulates massive data stories on the purchases, delivery
methods, and payment preferences of its millions of customers. The company then
sells ad placements that can be highly targeted to very specific segments and
subgroups.

3. Banking and Financial Services

The financial industry puts Big Data and analytics to highly productive use, for:

 Fraud detection
Banks monitor credit cardholders’ purchasing patterns and other activity to
flag atypical movements and anomalies that may signal fraudulent
transactions.

 Risk management
Big Data analytics enable banks to monitor and report on operational
processes, KPIs, and employee activities.

 Customer relationship optimization


Financial institutions analyze data from website usage and transactions to
better understand how to convert prospects to customers and incentivize
greater use of various financial products.

63
 Personalized marketing
Banks use Big Data to construct rich profiles of individual customer lifestyles,
preferences, and goals, which are then utilized for micro-targeted marketing
initiatives.

4. Government

Government agencies collect voluminous quantities of data, but many, especially at


the local level, don’t employ modern data mining and analytics techniques to extract
real value from it.

Examples of agencies that do include the IRS and the Social Security Administration,
which use data analysis to identify tax fraud and fraudulent disability claims. The FBI
and SEC apply Big Data strategies to monitor markets in their quest to detect
criminal business activities. For years now, the Federal Housing Authority has been
using Big Data analytics to forecast mortgage default and repayment rates.

The Centers for Disease Control tracks the spread of infectious illnesses using data
from social media, and the FDA deploys Big Data techniques across testing labs to
investigate patterns of foodborne illness. The U.S. Department of Agriculture
supports agribusiness and ranching by developing Big Data-driven technologies.

Military agencies, with expert assistance from a sizable ecosystem of defense


contractors, make sophisticated and extensive use of data-driven insights for
domestic intelligence, foreign surveillance, and cybersecurity.

5. Media and Entertainment

The entertainment industry harnesses Big Data to glean insights from customer
reviews, predict audience interests and preferences, optimize programming
schedules, and target marketing campaigns.

Two conspicuous examples are Amazon Prime, which uses Big Data analytics to
recommend programming for individual users, and Spotify, which does the same to
offer personalized music suggestions.

6. Meteorology

Weather satellites and sensors all over the world collect large amounts of data for
tracking environmental conditions. Meteorologists use Big Data to:

 Study natural disaster patterns

64
 Prepare weather forecasts

 Understand the impact of global warming

 Predict the availability of drinking water in various world regions

 Provide early warning of impending crises such as hurricanes and tsunamis

7. Healthcare

Big Data is slowly but surely making a major impact on the huge healthcare industry.
Wearable devices and sensors collect patient data which is then fed in real-time to
individuals’ electronic health records. Providers and practice organizations are now
using Big Data for a number of purposes, including these:

 Prediction of epidemic outbreaks

 Early symptom detection to avoid preventable diseases

 Electronic health records

 Real-time alerting

 Enhancing patient engagement

 Prediction and prevention of serious medical conditions

 Strategic planning

 Research acceleration

 Telemedicine

 Enhanced analysis of medical images

8. Cybersecurity

While Big Data can expose businesses to a greater risk of cyberattacks, the same
datastores can be used to prevent and counteract online crime through the power of
machine learning and analytics. Historical data analysis can yield intelligence to
create more effective threat controls. And machine learning can warn businesses
when deviations from normal patterns and sequences occur, so that effective
countermeasures can be taken against threats such as ransomware attacks, malicious
insider programs, and attempts at unauthorized access.

65
After a company has suffered an intrusion or data theft, post-attack analysis can
uncover the methods used, and machine learning can then be deployed to devise
safeguards that will foil similar attempts in the future.

9. Education

Administrators, faculty, and stakeholders are embracing Big Data to help improve
their curricula, attract the best talent, and optimize the student experience. Examples
include:

 Customizing curricula
Big Data enables academic programs to be tailored to the needs of individual
students, often drawing on a combination of online learning, traditional on-
site classes, and independent study.

 Reducing dropout rates


Predictive analytics give educational institutions insights on student results,
responses to proposed programs of study, and input on how students fare in
the job market after graduation.

 Improving student outcomes


Analyzing students’ personal “data trails” can provide a better understanding
of their learning styles and behaviors, and be used to create an optimal
learning environment.

 Targeted international recruiting


Big Data analysis helps institutions more accurately predict applicants’ likely
success. Conversely, it aids international students in pinpointing the schools
best matched to their academic goals and most likely to admit them.

10. Telecommunications:

Network Optimization: Telecommunications companies leverage big data analytics to


optimize network performance, improve service quality, and enhance customer
experiences. Network data, call logs, and customer usage patterns are analyzed to
identify network congestion points, predict capacity demands, and prioritize network
upgrades.

Customer Churn Prediction: Big data analytics help telecom operators predict
customer churn, identify at-risk customers, and implement targeted retention
strategies. By analyzing customer behavior, usage patterns, and billing data, telecom

66
companies can personalize offers, improve customer satisfaction, and reduce churn
rates.

11. Retail and E-commerce:

Customer Analytics: Retailers analyze big data from various sources such as
transaction records, website visits, and social media interactions to understand
customer preferences, behavior patterns, and purchasing trends. This information is
used to personalize marketing campaigns, optimize product assortments, and
improve customer experiences.

Supply Chain Optimization: Big data analytics help retailers optimize inventory
management, supply chain logistics, and demand forecasting. By analyzing historical
sales data, weather patterns, and market trends, retailers can anticipate demand
fluctuations, reduce stockouts, and optimize inventory levels across their distribution
networks.

Web analytics
What is web analytics?
Web analytics is the process of analyzing the behavior of visitors to a website. This
involves tracking, reviewing and reporting data to measure web activity, including
the use of a website and its components, such as webpages, images and videos.

Data collected through web analytics may include traffic sources, referring sites, page
views, paths taken and conversion rates. The compiled data often forms a part of
customer relationship management analytics (CRM analytics) to facilitate and
streamline better business decisions.

Web analytics enables a business to retain customers, attract more visitors and
increase the dollar volume each customer spends.

Analytics can help in the following ways:

 Determine the likelihood that a given customer will repurchase a product


after purchasing it in the past.

 Personalize the site to customers who visit it repeatedly.

 Monitor the amount of money individual customers or specific groups of


customers spend.

 Observe the geographic regions from which the most and the least
customers visit the site and purchase specific products.

67
 Predict which products customers are most and least likely to buy in the
future.

The objective of web analytics is to serve as a business metric for promoting specific
products to the customers who are most likely to buy them and to determine which
products a specific customer is most likely to purchase. This can help improve the
ratio of revenue to marketing costs.

In addition to these features, web analytics may track the clickthrough and drilldown
behavior of customers within a website, determine the sites from which customers
most often arrive, and communicate with browsers to track and analyze online
behavior. The results of web analytics are provided in the form of tables, charts and
graphs.

Follow these steps as part of the web analytics processes.

Web analytics process


The web analytics process involves the following steps:

1. Setting goals. The first step in the web analytics process is for businesses to
determine goals and the end results they are trying to achieve. These goals
can include increased sales, customer satisfaction and brand awareness.
Business goals can be both quantitative and qualitative.

2. Collecting data. The second step in web analytics is the collection and
storage of data. Businesses can collect data directly from a website or web
analytics tool, such as Google Analytics. The data mainly comes
from Hypertext Transfer Protocol requests -- including data at the network
and application levels -- and can be combined with external data to interpret
web usage. For example, a user's Internet Protocol address is typically
associated with many factors, including geographic location and clickthrough
rates.

68
3. Processing data. The next stage of the web analytics funnel involves
businesses processing the collected data into actionable information.

4. Identifying key performance indicators (KPIs). In web analytics, a KPI is a


quantifiable measure to monitor and analyze user behavior on a website.
Examples include bounce rates, unique users, user sessions and on-site search
queries.

5. Developing a strategy. This stage involves implementing insights to


formulate strategies that align with an organization's goals. For example,
search queries conducted on-site can help an organization develop a content
strategy based on what users are searching for on its website.

6. Experimenting and testing. Businesses need to experiment with different


strategies in order to find the one that yields the best results. For example,
A/B testing is a simple strategy to help learn how an audience responds to
different content. The process involves creating two or more versions of
content and then displaying it to different audience segments to reveal which
version of the content performs better.

What is web analytics used for?


Web analytics is helpful for understanding which channels users come through to
your website. You can also identify popular site content by calculating the average
length of stay on your web pages and how users interact with them—including which
pages prompt users to leave.

69
The process of web analytics involves:

 Setting business goals: Defining the key metrics that will determine the
success of your business and website

 Collecting data: Gathering information, statistics, and data on website visitors


using analytics tools

 Processing data: Converting the raw data you’ve gathered into meaningful
ratios, KPIs, and other information that tell a story

 Reporting data: Displaying the processed data in an easy-to-read format

 Developing an online strategy: Creating a plan to optimize the website


experience to meet business goals

 Experimenting: Doing A/B tests to determine the best way to optimize


website performance

You can use this information to optimize underperforming pages and further
promote higher-performing ones across your website. For example, French news
publisher Le Monde used analytics to inform a website redesign that increased
subscriber conversions by 46 percent and grew digital subscriptions by over 20
percent. Le Monde was able to identify which paid content users engaged with the
most, then use that information to highlight top-performing content on the
homepage.

The importance of web analytics


Your company’s website is probably the first place your users end up on to learn
more about your product. In fact, your website is also a product. That’s why the data
you collect on your website visitors can tell you a lot about them and their website
and product expectations.

Here are a few reasons why web analytics are important:

1. Understand your website visitors

Web analytics tools reveal key details about your site visitors—including their
average time spent on page and whether they’re a new or returning user—
and which content draws in the most traffic. With this information, you’ll learn
more about what parts of your website and product interest users and
potential customers the most.

For instance, an analytics tool might show you that a majority of your website
visitors are landing on your German site. You could use this information to

70
ensure you have a German version of your product that’s well translated to
meet the needs of these users.

2. Analyze website conversions

Conversions could mean real purchases, signing up for your newsletter, or


filling out a contact form on your website. Web analytics can give you
information about the total number of these conversions, how much you
earned from the conversions, the percentage of conversions (number of
conversions divided by the number of website sessions), and the
abandonment rate. You can also see the “conversion path,” which shows you
how your users moved through your site before they converted.

By looking at the above data, you can do conversion rate optimization (CRO).
CRO will help you design your website to achieve the optimum quantity and
quality of conversions.

Web analytics tools can also show you important metrics that help you boost
purchases on your site. Some tools offer an enhanced ecommerce tracking
feature to help you figure out which are the top-selling products on your
website. Once you know this, you can refine your focus on your top-sellers
and boost your product sales.

3. Boost your search engine optimization (SEO)

By connecting your web analytics tool with Google Search Console, it’s possible
to track which search queries are generating the most traffic for your site.
With this data, you’ll know what type of content to create to answer those
queries and boost your site’s search rankings.

It’s also possible to set up onsite search tracking to know what users are
searching for on your site. This search data can further help you generate
content ideas for your site, especially if you have a blog.

4. Understand top performing content

Web analytics tools will also help you learn which content is performing the
best on your site, so you can focus on the types of content that work and also
use that information to make product improvements. For instance, you may
notice blog articles that talk about design are the most popular on your
website. This might signal that your users care about the design feature of
your product (if you offer design as a product feature), so you can invest more
resources into the design feature. The popular content pieces on your website
could spark ideas for new product features, too.

5. Understand and optimize referral sources

71
Web analytics will tell you who your top referral sources are, so you know
which channels to focus on. If you’re getting 80% of your traffic from
Instagram, your company’s marketers will know that they should invest in ads
on that platform.

Web analytics also shows you which outbound links on your site people are
clicking on. Your company’s marketing team might discover a mutually
beneficial relationship with these external websites, so you can reach out to
them to explore partnership or cross-referral opportunities.

Example metrics to track with web analytics


Website performance metrics vary from company to company based on their goals
for their site. Here are some example KPIs that businesses should consider tracking
as a part of their web analytics practice.

 Page visits / Sessions

Page visits and sessions refer to the traffic to a webpage over a specific period
of time. The more visits, the more your website is getting noticed.

Keep in mind traffic is a relative success metric. If you’re seeing 200 visits a
month to a blog post, that might not seem like great traffic. But if those 200
visits represent high-intent views—views from prospects considering
purchasing your product—that traffic could make the blog post much more
valuable than a high-volume, low-intent piece.

 Source of traffic

Web analytics tools allow you to easily monitor your traffic sources and adjust
your marketing strategy accordingly. For example, if you’re seeing lots of
traffic from email campaigns, you can send out more email campaigns to
boost traffic.

 Total website conversion rate

Total website conversion rate refers to the percentage of people who


complete a critically important action or goal on your website. A conversion
could be a purchase or when someone signs up for your email list, depending
on what you define as a conversion for your website.

 Bounce rate

Bounce rate refers to how many people visit just one page on your website
and then leave your site.

72
Interpreting bounce rates is an art. A high bounce rate could be both negative
and positive for your business. It’s a negative sign since it shows people are
not interacting with other pages on your site, which might signal low
engagement among your site visitors. On the other hand, if they spend quality
time on a single page, it might indicate that users are getting all the
information they need, which could be a positive sign. That’s why you need to
investigate bounce rates further to understand what they might mean.

 Repeat visit rate

Repeat visit rate tells you how many people are visiting your website regularly
or repeatedly. This is your core audience since it consists of the website
visitors you’ve managed to retain. Usually, a repeat visit rate of 30% is good.
Anything below 20% shows your website is not engaging enough.

 Monthly unique visitors

Monthly unique visitors refers to the number of visitors who visit your site for
the first time each month.

This metric shows how effective your site is at attracting new visitors each
month, which is important for your growth. Ideally, a healthy website will show
a steady flow of new visitors to the site.

 Unique ecommerce metrics

Along with tracking these basic metrics, an ecommerce company’s team might
also track additional KPIs to understand how to boost sales:

o Shopping cart abandonment rate shows how many people leave


their shopping carts without actually making a purchase. This number
should be as low as possible.

o Other relevant ecommerce metrics include average order value and


the average number of products per sale. You need to boost these
metrics if you want to increase sales.

What are the two main categories of web analytics?


The two main categories of web analytics are off-site web analytics and on-site web
analytics.

Off-site web analytics

The term off-site web analytics refers to the practice of monitoring visitor activity
outside of an organization's website to measure potential audience. Off-site web
analytics provides an industrywide analysis that gives insight into how a business is
73
performing in comparison to competitors. It refers to the type of analytics that
focuses on data collected from across the web, such as social media, search
engines and forums.

On-site web analytics

On-site web analytics refers to a narrower focus that uses analytics to track the
activity of visitors to a specific site to see how the site is performing. The data
gathered is usually more relevant to a site's owner and can include details on site
engagement, such as what content is most popular. Two technological approaches to
on-site web analytics include log file analysis and page tagging.

Log file analysis, also known as Log Management, is the process of analyzing data
gathered from log files to monitor, troubleshoot and report on the performance of a
website. Log files hold records of virtually every action taken on a network server,
such as a web server, email server, database server or file server.

Page tagging is the process of adding snippets of code into a website's HyperText
Markup Language code using a tag management system to track website visitors and
their interactions across the website. These snippets of code are called tags. When
businesses add these tags to a website, they can be used to track any number of
metrics, such as the number of pages viewed, the number of unique visitors and the
number of specific products viewed.

Common issues with web analytics


While web analytics can be extremely useful for optimizing the website experience,
there are some drawbacks to it. Some of these include:

1. Keeping track of too many metrics

There are so many data points available to track. It can be overwhelming to


combine web analytics, product analytics, customer experience tools,
heatmaps, and other business intelligence analytics to make sense of things.

As a general rule, only measure the metrics that are important to your
business goals, and ignore the rest. For example, if your primary goal is to
increase sales in a certain location, you don’t need metrics about anything
outside of that location.

2. Data is not always accurate

The data collected by analytics tools is not always accurate. Many users may
opt-out of analytics services, preventing web analytics tools from collecting
information on them. They may also block cookies, further preventing the
collection of their data and leading to a lot of missing information in the data
reported by analytics tools. As we move towards a cookieless world, you’ll
74
need to consider analytics solutions that track first-party data, rather than
relying on third-party data.

Your web analytics tool may also be using incorrect data filters, which may
skew the information it collects, making the data inaccurate and unreliable.
And there’s not much you can do with unreliable data.

3. Data privacy is at risk

Untracked or overly exposed data can cause privacy or security vulnerabilities.


People could reveal all sorts of personal information about themselves on
your website, including credit card details and their address. Any breach to an
analytics service provider that compromises your user data can be devastating
for your business’ reputation. Since privacy laws have become more
stringent over the last decade globally, it’s important you pay attention to
cyber security.

Website data is particularly sensitive. Make sure your web analytics tools have
proper monitoring procedures and security testing in place. Take steps to
protect your website against any potential threats.

4. Data doesn’t tell the whole story

While web analytics are useful to learn how users are interacting with your
website, they only scratch the surface when it comes to understanding user
behavior. Web analytics can tell you what users are doing, but not why they
do it. To understand behaviors, you need to go beyond web analytics and
leverage a behavioral analytics solution like Amplitude Analytics. By looking at
behavioral product data, you’ll see which actions drive higher engagement,
retention, and lifetime value.

Web analytics tools


Web analytics tools report important statistics on a website, such as where visitors
came from, how long they stayed, how they found the site and their online activity
while on the site. In addition to web analytics, these tools are commonly used for
product analytics, social media analytics and marketing analytics.

Web analytics tools, like Google Analytics, report important website statistics to
analyze the behavior of visitors as part of CRM analytics to facilitate and streamline
business decisions.

Some examples of web analytics tools include the following:

1. Web Analytics - Google Analytics

75
Analytics Tools offer an insight into the performance of your website, visitors’
behavior, and data flow. These tools are inexpensive and easy to use. Sometimes,
they are even free.

Google Analytics

Google Analytics is a freemium analytic tool that provides a detailed statistics of the
web traffic. It is used by more than 60% of website owners.

Google analytics helps you to track and measure visitors, traffic sources, goals,
conversion, and other metrics (as shown in the above image). It basically generates
reports on −

 Audience Analysis

 Acquisition Analysis

 Behavior Analysis

 Conversion Analysis

Let us discuss each one of them in detail.

Audience Analysis

As the name suggests, audience analysis gives you an overview of the audience who
visit your site along with their session history, page-views, bounce rate, etc. You can
trace the new as well as the returning users along with their geographical locations.
You can track −

 The age and gender of your audience under Demographics.

76
 The affinity reach and market segmentation under Interests.

 Language and location under Geo.

 New and returning visitors, their frequency, and engagement under Behavior.

 Browsers, Operating systems, and network of your audience in Technology.

 Mobile device info under Mobile.

 Custom variable report under Custom. This report shows the activity by
custom modules that you created to capture the selections.

 Benchmarking channels, locations, and devices under Benchmarking.


Benchmarking allows you to compare your metrics with other related
industries. So, you can plot what you need to incur in order to overtake the
market.

 Flow of user activity under Users flow to see the path they took on your
website.

Acquisition Analysis

Acquisition means ‘to acquire.’ Acquisition analysis is carried out to find out the
sources from where your web traffic originates. Using acquisition analysis, you can −

 Capture traffic from all channels, particular source/medium, and from referrals.

 Trace traffic from AdWords (paid search).

 See traffic from search engines. Here, you can see Queries, triggered landing
pages, and geographical summary.

 Track social media traffic. It helps you to identify networks where your users
are engaged. You can see referrals from where your traffic originates. You can
also have a view of your hub activity, bookmarking sites follow-up, etc. In the
same tab, you can have a look at your endorsements in details. It helps you
measure the impact of social media on your website.

 See which plug-ins gave you traffic.

 Have a look at all the campaigns you built throughout your website with
detailed statistics of paid/organic keywords and the cost incurred on it.

Behavior Analysis

Behavior analysis monitors users’ activities on a website. You can find behavioral data
under the following four segments −

77
 Site Content − It shows how many pages were viewed. You can see the
detailed interaction of data across all pages or in segments like content drill-
down, landing pages, and exit pages. Content drill-down is breaking up of
data into sub-folders. Landing page is the page where the user lands, and exit
page is where the user exits your site. You can measure the behavioral flow in
terms of content.

 Site Speed − Here, you can capture page load time, execution speed, and
performance data. You can see how quickly the browser can parse through
the page. Further, you can measure page timings, user timings, and get speed
suggestion. It helps you to know where you are lagging.

 Site Search − It gives you a full picture of how the users search across your
site, what they normally look for, and how they arrive at a particular landing
page. You can analyze what they search for before landing on your website.

 Events − Events are visitors’ actions with content, which can be traced
independently. Example − downloads, sign up, log-in, etc.

Conversion Analysis

Conversion is a goal completion or a transaction by a user on your website. For


example, download, checkout, buy, etc. To track conversions in analytics, you need to
define a goal and set a URL that is traceable.

 Goals − Metrics that measure a profitable activity that you want the user to
complete. You can set them to track the actions. Each time a goal is achieved,
a conversion is added to your data. You can observe goal completion, value,
reverse path, and goal flow.

 Ecommerce − You can set ecommerce tracking to know what the users buy
from your website. It helps you to find product performance, sale
performance, transactions, and purchase time. Based on these data, you can
analyze what can be beneficial and what can incur you loss.

 Multi-channel funnels − Multi-channel funnels or MCF reports the source of


conversion; what roles the website plays, referrals’ role in that conversion; and
what all slabs did when users pass through landing page to conversion. For
example, a user searched for a query on Google search page, he visited the
website, but did not convert. Later on, he directly typed your website name
and made a purchase. All these activities can be traced on MCF.

 Attribution − Attribution modeling credits sales and conversions to touch


points in conversion tracking. It lets you decide what platforms or strategy or
module is the best for your business. Suppose a person visited your website
through AdWords ad and made no purchase. A month later, he visits via a
social platform and again does not buy. Third time, he visited directly and
converted. Here, the last interaction model will credit direct for the conversion,
78
whereas first interaction model will assign credit to paid medium. This way,
you can analyze what module should be credited for a conversion.

2. Web Analytics - Optimizely

Optimizely is an optimization platform to test and validate changes and the present
look of your webpage. It also determines which layout to finally go with. It uses A/B
Testing, Multipage, and Multivariate Testing to improve and analyze your website.

A wonderful feature of Optimizely is that you do not need to be a technical expert.


You just need to insert a deployed code provided by Optimizely in your HTML. After
putting it, you can trace anything, take any action, and make any changes in your
website.

Optimizely provides you administrative and management functionality to let you


create account, organize projects, and experiment. This facility helps you in tracking
clicks, conversions, sign-ups, etc.

You are allowed to run tests and use custom integrations with Optimizely interface.
All you need is −

 Set up an account on Optimizely and add a generated script.

 Once you are done with it, select your test pages. It implies the factors you
want to run test on.

 Set Goals. To set goals, click on the flag icon at the top right of the page and
follow up the instructions. Check metrics you are looking for. Click Save.

 You can create variations with the usual editor like changing text and images.

79
 Next step is monitoring your tests. You need to test which landing pages are
performing well. What is attracting the visitors? What is the bounce rate?
Understand the statistics, filter the non-performing areas, and conclude the
test.

 You can run multipage tests using javascript editors.

Optimizely gives you a better understanding of conversion rate optimization and


running tests.

3. Web Analytics - KISSmetrics

KISSmetrics is a powerful web analytics tool that delivers key insights and user
interaction on your website. It defines a clear picture of users’ activities on your
website and collects acquisition data of every visitor.

You can use this service free for a month. After that, you can switch on to a paid plan
that suits you. KISSmetrics helps in improving sales by knowing cart-abandoned
products. It helps you to know exactly when to follow up your customers by tracking
the repeat buyers activity slot.

KISSmetrics helps you identify the following −

 Cart size

 Landing page conversion rate

 Customer activity on your portal

 Customer bounce points

 Cart abandoned products

 Customer occurrence before making a purchase

 Customer lifetime value, etc.

80
Summarizing KISSmetrics

 It gets you more customers by not letting you lose potential customers and
maintaining brand loyalty.

 It lets you to judge your decisions where you are playing right.

 It helps you identify data and trends, which contribute in customer acquisition.

Best Features of KISSmetrics

 Ability to track effective marketing channels.

 Figure out how much time a user takes to convert.

 Determine a degree of which user was engaged with your site.

 A convenient dashboard. You do not need to run around searching for figures.

Installation

Just sign-up for an account and customize accordingly.

Tracking

Add a java snippet under <head> tag of the source code of your website.

81
Event Setting

By default, KISSmetrics sets two events for you − visited site and search engine hit.
To add more events, click on new event, add an attribute and record an event name.

Setting up Metrics

Click on create a new metric. Select your metric type from the list. Give metric name,
description, and event. Save metric.

Define Conversions

Define your conversion and track them. Select number of times event happened.
Give metric name and description and select event. Save metric again.

KISSmetrics can track web pages, mobile apps, mobile web, facebook apps, and can
blend all data into one. You don’t need the multiple analytics platforms.

4. Web Analytics - Crazy Egg

Crazy Egg is an online analytics application that provides you eye-tracking tools. It
generates heatmaps based on where people clicked on your website. Thus, it gives
you an idea on where to focus. It lets you filter data on top 15 referrers, search terms,
operating systems, etc.

To use Crazy Egg, a small piece of JavaScript code needs to be placed on your site
pages.

82
Once the code is on your site, Crazy Egg will track user behavior. Your servers will
create a report that shows you the clicks on the pages you are tracking. You can
review the reports in the dashboard within the member’s area of the Crazy Egg site.
Setting up Crazy Egg is a quick and easy task.

It offers you insights in four different ways −

 Heatmaps − It gives you a defined picture of where visitors who clicked on


your page. Where you need to make changes so as to improve conversions.

 Scrollmaps − It gives you insights of to what length people scroll down on


your page. With Crazy Egg, you can ensure where people leave your page and
where to hold them exactly and where to add more to hold them for longer.

 Overlay Tool − It gives you overlay report of the number of clicks occurring
on your website. You may be able to get more on it.

 Confetti − Confetti distinguishes clicks for you segmented by referral sources,


search terms, etc. Now, you know the origin of your clicks, so you uncover the
traffic sources. Put extra efforts there and you will earn more traffic and
revenue.

Installation

Insert JavaScript code on source code of your website. Crazy Egg will by default track
the user behavior. The servers generate reports providing you the view. Set
dashboard to review the reports.

Web Analytics - Key Metrics


You need to find a few key metrics for your business. You have a website and it has a
tracking code in it. Now, you need to make sure what are you going to measure.
Analyzing may help you retain your customer and hold them.

What to Measure

1. Audience

 Pageviews − Pageviews is the number of views of a page. Multiple pageviews


are possible in a single session. If pageviews is improved, it will directly
influence AdSense revenue and average time on website.

 Bounce rate − Bounce rate reflects the percentage of visitors returning back
only after visiting one page of your website. It helps you to know how many
visitors do so. If the bounce rate of a website increases, its webmaster should
be worried.

83
 Pages per session − Pages/session is the number of pages surfed in a single
session. For example, a user landed on your website and surfed 3 pages, then
the website pages/session is 3.

 Demographic info − Demographic data shows Age and Gender. With the
help of Demographic Info, you can find the percentage of Male/Female
visitors coming to your website. Analyzing the ratio of this data, you can make
a strategy according to genders. Age group data help you find what
percentage of age group visiting your website. So, you can make a strategy
for highest percentage of age group visitors.

 Devices − This data shows the devices info. In devices info, you can easily find
how many percentage of visitors come from mobile, how many come from
desktop, how many come from tablets, etc. If mobile traffic is high, then you
need to make your website responsive.

2. Acquisition

Traffic sources − In the acquisition, you have to check all your sources of the traffic.
Major sources of the traffic are −

 Organic traffic is the traffic coming through all search engines (Google,
Yahoo, Bing....)

 Social traffic is the traffic coming through all social media platforms (like −
Facebook, Twitter, Google+, ...)

 Referral traffic is the traffic coming through where your website is linked.

 Direct traffic is the traffic coming directly to your website. For example,
typing the url of your website, clicking on the link of your website given in
emails, etc.

84
 Source/Medium − This metrics gives you an idea of the sources from where
you are getting traffic (Google, Yahoo, Bing, Direct, Facebook...).

3. Site Content

 Landing pages − Landing pages are the pages where the visitors land first
(normally, home pages of the websites are the landing pages). With the help
of this metrics, you can find the top pages of the website. Using this metrics,
you can analyze how many pages are getting 50% or more traffic of the
website. So, you can easily find which type of content is working for you.
Further, based on this analysis, you can plan the next content strategy.

 Site speed − Site speed is the metrics used for checking page timing (average
page load time). Using this metrics, you can find which page is taking more
time to load, how many pages have high load time, etc.

Web Analytics - Data Sources


Data sources are simply the files created on DBM or feed. The objective of keeping a
data source is to encapsulate all information in one stack and hide it from the users,
e.g., payroll, inventory, etc.

Server Logs

Log files list actions that take place. They maintain files for every request invoked, for
example, the source of visitor, their next action, etc.

Server logs is a simple text file that records activity on the server. It is created
automatically and maintained by server’s data. With the help of a server log file, you
can find the activity detail of the website/pages. In activity sheet, you can find the
data with IP address, Time/Date, and pages. It gives you insight on the type of
browser, country, and origin. These files are only for the webmasters, not for the
website users. The statistics provided by server log is used to examine traffic patterns
segmented by day, week, or a referrer.

Visitors' Data

Visitors’ data shows the total traffic of the website. It can be calculated by any web
analytics tool. With the help of visitors’ data, you can analyze your website
improvement and can update your servers accordingly. It may comprise of −

 A top-level view of metrics

 Age and Gender of visitors

 User behavior, their location and interests

85
 Technology they are using, e.g., browsers and operating systems

 Breakdown of your website on devices other than desktops

 User Flow

Search Engine Statistics

Search engine statistics show the data that is acquired by organic traffic (as shown in
the image given below). If the search engine traffic of a website has improved, then it
means the website search ranking for the main keywords has improved. This data
also helps you to −

 Find the revenue generating keywords and the keywords those are typed in
search engine by visitors.

 How different Search Engines affect your data.

 Where you are lagging and where you need to focus.

Conversion Funnels

Conversion funnels is the path by which a goal (Product purchase, Lead form done,
Service contact form submitted, etc.) is completed. It is a series of steps covered by
the visitors to become customers. It is explained in the “Bertus Engelbrecht’s” image,
given below. If more numbers of visitors are leaving the website without any
purchase, then you can use conversion funnels to analyze the following −

86
 Why are they leaving the website?

 Is there any problem with the conversion path?

 Is there any broken link in the conversion path or any other feature that is not
working in the conversion path?

Conversion funnels help you visualize the following aspects in the form of graphics −

 The hurdles the users are facing before converting

 Where the emotional behaviors of the users alter

 Where the technical bugs become nuisance for the customers

Web Analytics - Segmentation


Segmentation is the process that segregates the data to find the actionable items.
For example, you can categorize your entire website traffic data as one segment for a
“Country,” and one for a specific City.

87
For the users, you can make the segments as one who purchased your products; one
who only visited your website, and likewise. During the remarketing, you can target
those audiences with the help of this segment.

Data Segmentation

Data segmentation is very useful to analyze website traffic. In analytics, you can
analyze traffic insight with the help of segmentation. The following image shows how
to add segments in Google analytics.

For a website, you can segment total traffic according to Acquisition, Goals, and
Channels. Following are the types of acquisition segmentation −

 Organic Traffic − It shows only the organic traffic of the website. You can find
which search engine (Google, Yahoo, Bing, Baidu, Aol, etc.) is working for you.
With the help of organic traffic, you can also find the top keywords that send
traffic to your website.

 Referrals Traffic − This segment shows the total referrals traffic of the
website. With the help of this segment, you can find the top referrals website
that send traffic to your website.

88
 Direct Traffic − This segment helps you find the traffic that visit your website
directly.

 Social Traffic − With the help of social segment, you can analyze social traffic.
How much traffic you are getting from social media? In social media, which
platform (Facebook, G+, Twitter, Pinterest, Stumbleupon, Reddit, etc.) is
sending traffic to your website. With the help of this segment, you can make
future social media strategy. For example, if Facebook is sending the highest
traffic to your website, then you can improve your Facebook post frequency.

 Paid Traffic − Paid traffic segment captures traffic through paid channels
(Google AdWords, Twitter ads...).

Analysis Using Segmentation

When you are done with your segments (collected the data from segments), then the
next step is analysis. Analysis is all about finding the actionable item from the data.

Example

Let’s map a table for analysis.

Month Jan Feb Mar April May June July Aug Sep

Organic 40K 42K 40K 43K 45K 47K 57K 54K 60K

Referrals 5K 4K 5K 4K 6K 5K 4K 3K 4K

Social 1K 1K 2K 4K 2K 3K 5K 5K 4K

Analysis

 From the above table, you can see that your organic traffic is growing
(improved 20k in 9 months). Referrals traffic is going down. Social traffic has
also improved (1k to 4k).

 Find out the pages that send traffic in organic traffic. Analyze them.

 Find out which social platform is working for you

Actionable

 Add new pages according to organic traffic sender pages.

 Focus on the social media platform that is sending the highest traffic.

89
 Find why your referrals traffic is going down. Is any link removed from the
website, which was sending traffic earlier?

Web Analytics - Dashboards


Dashboard is an interface showing graphical status of the trends of your business key
performance indicators. This helps you to take instantaneous and intelligent
decisions. It gives you a visual display of important data that can be encapsulated in
a single space to let you monitor in a glance.

Dashboard Implementation

In Google analytics, you can create dashboards according to your requirements.


Dashboards are used for finding data. With the help of dashboards, you can quickly
analyze the data. In dashboard, you have to create widgets as per your requirements.

The following image shows how to create a dashboard −

Types of Dashboards

You can create dashboards according to your requirements. Following are the main
types of dashboards −

 SEO dashboardContent dashboard

 Content dashboard

 Website performance dashboard

 Real time overview dashboard

 Ecommerce dashboard

90
 Social Media dashboard

 PPC dashboard

In every dashboard, you have to create widgets. Widgets are form in graphical or in
numbers.

For example, if you want to create a dashboard for SEO, you have to create a widget
for the total traffic, for the organic traffic, for the keywords, etc. You can analyze
these metrics with the help of SEO dashboard.

If you want to create a dashboard for website performance, then you have to create a
widget for website avg. page load time, Website server response time, Page load
time for mobile, and Check page load time by browser. With the help of these
widgets, you can easily analyze the website performance.

Metrics for Every Dashboard

 Search Engine Optimization (SEO) − Organic traffic, Website total traffic,


Keyword used in Organic, Top landing pages, etc.

 Content − In content dashboard, you have to monitor traffic for blog section,
Conversion by blog post, and Top landing page by exit.

 Website Performance Dashboard − Avg. page load time, Mobile page load
time, Page load time by browser, and Website server response time.

 Real Time Overview Dashboard− In real time overview, you can set a widget
for real time traffic, Real time traffic source, and real time traffic landing pages.

 Ecommerce Dashboard − In ecommerce total traffic, Landing by products,


and Total sale by products.

 Social Media Dashboard − In social media traffic by social media channel,


Sale by social media, most socially shared content.

 PPC dashboard − In pay per click (PPC) dashboard, you need to include
clicks, impressions, CTR, converted clicks, etc.

Web Analytics - Conversion


Conversion is when a user visits your page and performs an action, for example,
purchase, sign-up, download, etc.

Goals

Goals are used in analytics for tracking completions of specific actions. With the help
of goals, you can measure the rate of success. Goals are measured differently in
91
different industries. For example, in an e-commerce website you can measure the
goal when a product gets sold. In a software company, you can measure the goal
when a software product is sold. In a marketing company, goals are measured when
a contact form is filled.

Types of Goals

Goals can be divided into the following categories −

 Destination Goal − Destination goal is used to find pageviews of a website.


Put a destination URL in the destination field to complete your goal.

 Duration Goal − You can measure the user engagement with the help of
duration goal. You can specify hours, minutes, and second field to quantify the
goals. If a user spends more than that much of time on the page, then the
goal is completed.

 Event Goals − You can measure user interaction with your event on the site. It
is called as event goals. You must have at least one event to compose this
goal.

 Pages/session Goal − You can measure the user engagement with


pages/session goal. First, you have to specify how many pageviews/session
counts as goal complete. Then, with the help of goal metric, you can analyze
how many goals are completed.

Funnels

Funnels are the steps to complete your goals. With the help of funnels, you can
review your goals completion steps. Let’s suppose for an ecommerce company,
product sale is goal completion. So, funnels are the steps to purchase that product. If
most of the visitors leave the website after carting the products, then you have to

92
check why users are leaving. Is there any problem with the cart section? This can help
you improve your product performance or steps to sale the products.

Multi-Channel Funnels

Multi-Channel Funnel (MCF) report shows how your marketing channels work
together. MCF report shows that how many conversions are done and by
which channel. In MCF report, you can find the following data −

 Assisted Conversion − In assisted conversion, you can find which


channel has assisted the highest number of conversions.

 Top Conversion Path − Top conversion path report shows the


following picture.

In the above picture, you can see that Organic search > > Direct has 11 conversions.
It means the user first interacts with your product via organic search. Later on, he/she
comes to the website direct and makes a purchase. So, with the help of this report,
you can easily analyze your top conversion path to improve your funnels.

Web Analytics - Emerging Analytics


You need to leverage data to drive insights in order to learn customers’ behavior on
your website. There is nothing new in it. What alters the game is emerging analytics
trends in Social Media, E-commerce, and Mobile, as these are new game changers in
digital world.

Social Media Analytics

Social Media Analytics comprise of gathering data from social media platforms and
analyzing it to derive information to make business decisions. It provides powerful
customer insight to uncover sentiments across online sources. You tend to take
control of Social Media Analytics in order to predict customers’ behavior, discover

93
patterns and trends, and make quick decisions to improve online reputation. Social
Media Analytics also let you identify primary influencers within specific network
channels. Some of the popular social media analytics tools are discussed below.

Google Social Analytics

It is a free tool that lets you add social media results to your analysis report. You get
to know what is in air about your business. How many people interacted with your
website through social media and how many liked and shared your content.

 SumAll

It combines Twitter, Facebook, and Google Plus into one dashboard to give
you an overall view of what people are talking about you on social media.

 Facebook Insights

Facebook plays a major role in your marketing campaign. You need to


familiarize yourself with Facebook data to mark a flag. You need to set up a
page for your business to get the insights. It gives you information about who
visited your page, saw your post, liked your page, and shared it (as shown in
the following image).

 Twitter Analytics

Twitter Analytics show how many impressions each tweet received, what is
your engagement status, and when were you on peak (see the image given
below).

94
 E-commerce Analytics

Business owners need to survive and thrive among tough competition. They
have to become big decision makers in order to survive in the market. This is
where Web Analytics play a critical role.

E-commerce Analytics let you figure out customers’ acquisition, users’


behavior, and conversion. In Google Analytics, you can get relevant
information about your volume of sales, product with revenue, and sources of
conversion occurred. You need to keep all this information to find out where

95
your business stands and to boost e-commerce sales, generate leads, and
enhance brand awareness.

 Mobile Analytics

Mobiles have emerged as one of the most significant tools in the past two
decades. It changes the way people communicate and innovate. This has led
to marketing driven by mobile apps.

Mobile apps have proved easy to access and engaging. Webmasters and
online business makers need to take support of mobile apps to make their
way perfect. Once you are done with making a mobile app, you’ll need to
acquire new users, engage with them, and earn revenue. For this, you need
mobile analytics. It helps marketers to measure their apps better. For example

o How many people are using your app

o How to optimize user experience

o How to prioritize

o What operating system to focus on

o How to visualize navigation path, etc.

Web Analytics A/B Testing


A/B Testing or split testing is a comparison between two variants of one aspect, say,
two versions of a webpage. It is like running an experiment between two or more
pages simultaneously to discover which one has the potential to convert more.

96
For example, e-Commerce websites use A/B testing on products to discover which
product has the potential to earn more revenue. Second example is AdWords
campaign manager running two ads for the same campaign in order to know which
of them works well.

97
A/B testing allows you to extract more out of your existing traffic. You can run A/B
Testing on Headlines, Ads, Call to action, Links, Images, Landing pages, etc…

Automated Reporting & Annotation

In Google analytics, we can set an automated reporting. If we want on every Monday


a report having the top 10 landing pages of the website, then in the email section,
we can set a report that is automatically sent to users.

Annotation

With the help of annotations, we can find what tasks have been done at which date.
We can annotate the update in Google Analytics. Let us suppose Google search
update arrived on 21 March, then we can annotate 21 March as Google update.
Annotation helps us find the impact of the change.

Web Analytics - Actionable Reporting


Actionable reporting is the final part of the analytics analysis. When you are done
with collecting data, the next step is actionable reporting. Graphics of the data helps
to write actionable points. Always try to build graphs that show data trends because
visuals depict more information than plain text.

How to Prepare Actionable Report?

Let’s assume we have the following data available for an ecommerce company −

98
Country USA UK Canada Australia China India

Product sale 200 100 135 120 160 155

Budget Spent

Country USA UK Canada Australia China India

Budget Spent in $ 10K 9K 8K 9k 8K 5K

Actionable Points

 Highest revenue generating country is USA, increase the budget for the USA.

 India has high potential. If we double the budget, then we can make good
revenue (from India).

 China is doing well. We can increase the budget for China too.

 UK is not up to the mark, so stop spending money there or find new


techniques to improve sales.

 Canada and Australia need improvement. Try for the next segment. If you find
the same data in the next segment, then stop spending money there too.

Web Analytics - Terminology


We have listed here a set of terms that one should be familiar with while performing
web analytics −

 Benchmarking − A service that gives a view of how your website is


performing in contrast to others.

 Bounce Rate − Number of times a user quits without exploring your


webpages.

 Click − An action of clicking on your webpages.

 Conversion − Conversion takes place when a goal is completed, e.g.,


purchase, registration, downloads, etc.

99
 Direct Traffic − Traffic coming directly on your website by clicking on your
website’s link or typing the URL of your website in the address bar.

 Filter − A guideline that exclude/include specific data from reports.

 Funnels − Steps visitors take to finally complete a goal.

 Goal − A metric that defines the success rate, e.g., sale or sign-up.

 Goal Conversion Rate − Percentage of visits on every goal achieved.

 Impression − The display of your website on the Internet.

 Keywords − Search queries that visitors use to find your website.

 Landing Page − The first page from where a visitor enters your website.

 New Visitor − The visitor who is coming to your website for the first time.

 Organic Traffic − Traffic for which you need not pay. It comes naturally, e.g.,
traffic from search engines.

 Paid Traffic − Traffic for which you need to pay, e.g., Google AdWords.

 Page View − Number of times a page is viewed.

 Returning Visitor − The visitors who have already visited your page earlier.
Returning visitors are an asset for any website.

 Time on Site − The average time a visitor spends accessing your site in a
time.

 Tracking Code − A small snippet of code inserted into the body of HTML
page. This code captures the information about visits to a page.

 Traffic − Flow of visitors to your website.

 Traffic Sources − The source from where traffic originates.

Benefits of web analytics


Are you considering investing in web analytics but wondering if it is worthwhile? If
yes, here are the key benefits of analytics.

1. Track Online Traffic

It provides web traffic information, such as your website’s visitors or users at any
given time, the time they spend on the site, where the traffic comes from, and how

100
visitors interact. Analytics provides easy-to-understand data, helping your company
know what produces the best results and how to invest resources more effectively.

2. Measuring Bounce Rate

The bounce rate is the percentage of visitors leaving your site without interacting
with it. A high bounce rate shows your content is not engaging or does not match
search intent. It is an important parameter that can help you improve user experience
and consequently increase conversion rate.

3. Optimizing Advertising Campaigns

Analytics increases return on investment on marketing campaigns by providing tools


that help with tracking and optimization. A few years back, advertisers did not have
the tools to track the performance of their ads in real time. Today, you can track
performance on the go and optimize accordingly.

4. Data-Driven Decisions

According to a recent survey, highly data-driven companies are three times more
likely to report massive decision-making improvements than organizations that rely
less on data.

Great businesses thrive on high-quality decisions. And the only way to make
decisions that drive business success is to base decisions on data. For example,
it is difficult (if not impossible) to do what your target audience wants if you know
little or nothing about their needs. Off-site and on-site analytics help you discover
what your ideal potential customer is searching for. This enables you to position your
business to attract them.

5. Competitive Edge

Web analytics provides information on your business’s performance and lets you
peep into your competitors' actions. Such intelligence makes outsmarting your
competitors easier. For instance, analytics helps uncover web content gaps, which
opens your company up to opportunities your competitors are missing out on.

6. Market Research Made Easy

Before the recent advancement in analytics technology, market research was highly
costly and time-consuming. And brands needed access to detailed and personalized
insights. Web analytics solutions have streamlined market research. Today, you can
do thorough market research with minimal investment.

Big data and marketing

101
At first glance, marketing and big data might seem like an odd pair, but they’re
actually very complementary. Marketing is all about effectively reaching different
audiences, and data tells us what’s working and what’s not.

These days, there’s more available data than most businesses and marketing teams
know what to do with, which can lead to new opportunities if that data is accurately
interpreted and effectively deployed.

WHY IS BIG DATA IMPORTANT IN MARKETING?


Big data is becoming a fundamental tool in marketing. Data constantly informs
marketing teams of customer behaviors and industry trends, and is used to optimize
future efforts, create innovative campaigns and build lasting relationships with
customers.

Why big data matters to marketing


Having big data doesn’t automatically lead to better marketing – but the potential is
there. Think of big data as your secret ingredient, your raw material, your essential
element. It’s not the data itself that’s so important. Rather, it’s the insights derived
from big data, the decisions you make and the actions you take that make all the
difference.

By combining big data with an integrated marketing management strategy,


marketing organisations can make a substantial impact in these key areas:

 Customer engagement. Big data can deliver insight into not just who your
customers are, but where they are, what they want, how they want to be
contacted and when.

 Customer retention and loyalty. Big data can help you discover what
influences customer loyalty and what keeps them coming back again and
again..

 Marketing optimisation/performance. With big data, you can determine the


optimal marketing spend across multiple channels, as well as continuously
optimise marketing programs through testing, measurement and analysis.

Three types of big data that are a big deal for marketing
Customer: The big data category most familiar to marketing may include
behavioural, attitudinal and transactional metrics from such sources as marketing
campaigns, points of sale, websites, customer surveys, social media, online
communities and loyalty programs.

102
Operational: This big data category typically includes objective metrics that measure
the quality of marketing processes relating to marketing operations, resource
allocation, asset management, budgetary controls, etc.

Financial: Typically housed in an organisation’s financial systems, this big data


category may include sales, revenue, profits and other objective data types that
measure the financial health of the organisation.

The role of Big Data in Marketing


Big Data relates to the huge amounts of structured and unstructured data generated
daily. It has the potential to change how marketers do business.

Big Data enables marketers to gain a fuller comprehension of their customer


behavior, preferences, and demographics by gathering data from various sources,
such as social media, customer feedback, and website analytics.

With the data collected, marketing teams can:

 Optimize pricing decisions

 Improve customer relationship management

 Reduce customer churn.

103
Five benefits of using Big Data in Marketing
1. Effective predictive modeling

By analyzing customer data, analysts can predict which customers are most likely to
purchase in the future.

The benefit: Marketers use this information to develop targeted marketing


campaigns that are more effective at driving sales.

2. Better personalization

By analyzing customer data, marketing teams can personalize messages and offers
based on each customer’s preferences and behavior.

The benefit: This personalization can lead to increased customer engagement and
loyalty.

3. Optimizing marketing spend

By analyzing customer behavior and preferences, marketers can optimize their


spending budget.

The benefit: Targeting only valuable customers allows companies to get the most
out of their marketing efforts.

4. Reducing customer churn

Analysts can identify which customers are at risk of leaving by analyzing customer
behavior and preferences.

The benefit: Marketers can reduce customer churn by targeting these customers
with personalized offers and messages.

5. Improving customer experience

By analyzing customer behavior and preferences, marketing teams can identify


improvement areas related to customer experience.

The benefit: Companies can improve customer experience and loyalty by making
changes based on this information.

How Is Big Data Used in Marketing?


Big data offers a snapshot of a business’ user base, as trends, preferences and
behavior patterns can be revealed through data analysis. Companies use these data
insights to drive marketing strategies and product decisions.

104
CUSTOMER ENGAGEMENT AND RETENTION

Big data regarding customers provides marketers details about user demographics,
locations, and interests, which can be used to personalize the product experience
and increase customer loyalty over time.

MARKETING OPTIMIZATION AND PERFORMANCE

Big data solutions can help organize data and pinpoint which marketing campaigns,
strategies or social channels are getting the most traction. This lets marketers
allocate marketing resources and reduce costs for projects that aren’t yielding as
much revenue or meeting desired audience goals.

COMPETITOR TRACKING AND OPERATION ADJUSTMENT

Big data can also compare prices and marketing trends among competitors
to see what consumers prefer. Based on average industry standards,
marketers can then adjust product prices, logistics and other operations to
appeal to customers and remain competitive.

Challenges
The challenges related to the effective use of big data can be especially daunting for
marketing. That's because most analytics systems are not aligned to the marketing
organisation’s data, processes and decisions. For marketing, three of the biggest
challenges are:

 Knowing what data to gather. Data, data everywhere. You have enormous
volumes of customer, operational and financial data to contend with. But
more is not necessarily better – it has to be the right data.

 Knowing which analytical tools to use. As the volume of big data grows, the
time available for making decisions and acting on them is shrinking. Analytical
tools can help you aggregate and analyse data, as well as allocate relevant
insights and decisions appropriately throughout the organisation – but which
ones?

 Knowing how to go from data to insight to impact. Once you have the
data, how do you turn it into insight? And how do you use that insight to
make a positive impact on your marketing programs?

Three steps for going from big data to better marketing


105
Big data is a big deal in marketing. But there are a few things every marketer should
keep in mind to help ensure that big data will lead to big success:

1. Use big data to dig for deeper insight. Big data affords you the opportunity
to dig deeper and deeper into the data, peeling back layers to reveal richer
insights. The insights you gain from your initial analysis can be explored
further, with richer, deeper insights emerging each time. This level of insight
can help you develop specific strategies and actions to drive growth.

2. Get insights from big data to those who can use it. There’s no debating it –
CMOs need the meaningful insights that big data can provide; but so do
front-line store managers, and call centre phone staff, and sales associates,
and so on and so on. What good is insight if it stays within the confines of the
board room? Get it into the hands of those who can act on it.

3. Don’t try to save the world – at least not at first. Taking on big data can at
times seem overwhelming, so start out by focusing on a few key objectives.
What outcomes would you like to improve? Once you decide that, you can
identify what data you would need to support the related analysis. When
you’ve completed that exercise, move on to your next objective. And the next.

How big data is transforming marketing:


1. Consumer Insights: Big data analytics allow marketers to analyze vast
amounts of data from various sources, including social media, website
interactions, purchase history, and demographic information. By
understanding consumer preferences, interests, and behaviors, marketers can
create more targeted and personalized marketing campaigns.

2. Personalization: Big data enables marketers to deliver personalized


experiences to consumers across multiple channels, such as websites, email,
social media, and mobile apps. By leveraging data on individual preferences,
browsing history, and past interactions, marketers can tailor content, offers,
and recommendations to each consumer's specific needs and interests.

3. Predictive Analytics: Big data analytics tools use predictive modeling


techniques to forecast consumer behavior, trends, and outcomes. Marketers
can use predictive analytics to anticipate customer needs, identify potential
leads, and optimize marketing strategies for better results.

4. Campaign Optimization: Big data analytics provide insights into the


performance of marketing campaigns in real-time. Marketers can track key
metrics such as engagement, conversion rates, and return on investment (ROI)
to optimize campaign targeting, messaging, and allocation of resources for
maximum effectiveness.

106
5. Customer Journey Analysis: Big data enables marketers to map and analyze
the entire customer journey across multiple touchpoints and channels. By
understanding the customer journey, marketers can identify pain points,
optimize user experiences, and implement strategies to guide consumers
through the sales funnel more effectively.

6. Social Media Listening: Big data tools allow marketers to monitor social
media conversations, sentiment, and trends in real-time. Social media listening
provides valuable insights into consumer opinions, brand perception, and
emerging topics, allowing marketers to engage with their audience proactively
and respond to feedback promptly.

7. Segmentation and Targeting: Big data enables marketers to segment their


audience into smaller, more defined groups based on demographic,
behavioral, and psychographic characteristics. Marketers can then target these
segments with tailored messaging and offers that resonate with their specific
needs and preferences.

8. Attribution Modeling: Big data analytics help marketers understand the


contribution of each marketing touchpoint to the overall customer journey
and conversion process. Attribution modeling allows marketers to allocate
marketing budgets more effectively and optimize spending across channels
for maximum impact.

9 examples and uses cases of big data marketing


Now, let's explore 9 examples and use cases that demonstrate the transformative
power of big data in marketing:

1. Discover new customer acquisition levers

Customers leave a footprint every time they interact with our digital marketing
content and the apps we build.

Use these breadcrumbs to dig into the reasons why certain leads turned into
customers.

With big data analytics tools, you can shift through vast amounts of data to discover
which lever was the crucial differentiator between a lead and a customer. Was it your
digital marketing target audience, the novel communication/referral channel, the
copywriting used, the choice of visuals, specific demographics, or something else?

Knowing the answer helps you focus your energies on what actually works. Turning
eyeballs into paying customers.

2. Segment audiences for a personalized boost in customer engagement


107
Big data analytics helps you segment customers based on behavior, preferences, and
demographics.

This segmentation allows for highly personalized marketing campaigns that deliver
relevant content to the right audience, increasing engagement and conversions.

Here’s how you do it with Keboola:

1. Collect and unify your customer data, including web/app interactions, past
marketing campaigns, and CRM data with Keboola. Just a few clicks and
you’re ready for the next step.

2. Next, use Keboola’s Transformations to apply machine learning segmentation


algorithms. Split your customers into different segments based on their
activity.

3. Feed the segments back into the communication app of your choice. Keboola
can do this with a couple of clicks.

4. Roll out personalized campaigns for each customer segment to boost


engagement.

3. Use predictive analytics to communicate the right message

Apply predictive analytics to historical data to anticipate customer behavior and


preferences. This foresight empowers marketers to proactively tailor their strategies
to meet future demands and optimize their marketing efforts.

For example, by combining your stock levels with a demand forecast, you can
automate the promotion of in-stock items that are predicted to surge in demand to
speed up product movement out of your warehouse.

4. Foster customer retention and loyalty programs

Big data helps you identify factors contributing to customer churn and pinpoint the
specific customers who are most likely to jump ship.

Armed with this knowledge, marketers can design targeted customer loyalty
programs and retention strategies to retain valuable customers.

5. Use price optimization to increase revenue and profits

Analyze market trends and competitor pricing with big data to optimize your pricing
decisions for enhanced competitiveness and profitability.

Here’s an example. Olfin Car is a leading car dealer in the Czech Republic with
additional services in the field of financing, authorized car service, and insurance.

With Keboola, Olfin Car was able to automate data collection of all the product
offerings and pricing points across their competitors. By using advanced pricing
108
algorithms, Olfin Cars optimized the pricing of its products and services. This led to
a 760% increase in revenues in a single quarter.

6. Keep a finger on the pulse with sentiment analysis

Big data tools can analyze customer sentiment and feedback from various sources,
such as social media and reviews.

This sentiment analysis helps businesses understand how customers perceive their
brand and products, enabling them to address concerns and reinforce positive
experiences at scale.

7. Tap into A/B testing to discover what works


Don’t waste your marketing efforts on market research when identifying what
communication strategy works best.

Instead, test it.

Leveraging big data for A/B testing allows marketers to compare the performance of
different marketing strategies and identify the most effective approaches in a much
shorter time.

A/B testing showcase: A subject line worth its weight in gold - or 2 million dollars, to
be precise.

Obama’s presidential campaign will go down as one of the most successful A/B tests
in history.

What did the marketing team do? They rolled out different emails to a smaller batch
of their email list first, testing the impact of different subject lines. The champion was
then sent to the rest of the list.

The result? The top-performing email left the underperformer in the dust, bringing in
extra 2 million dollars in donations.

8. Optimize the customer journey with big data insights

Understanding the customer journey through big data analysis enables marketers to
streamline touchpoints and improve overall customer experience.

For example, Marketing Intelligence used Keboola to reconstruct the customer


journeys along multiple touchpoints and across different attribution models.

With careful analysis of how different channels and pathways interact with each other
and what customer segment tends to convert, they helped their client save 30% on
marketing costs while increasing acquisition by informing them what conversion
paths work best.

9. Build data products that delight customers

109
Use all the before-mentioned big data practices together to build products that
address your customers with delightful messages.

Rohlik, the e-commerce unicorn, uses Keboola and real-time machine learning
algorithms, to identify the food items that will expire soon, discount them, and
automatically advertise them to price-conscious consumers (discovered through
customer segmentation).

This end-to-end automated marketing initiative helps Rohlik reduce food waste while
addressing the needs of a targeted customer profile.

Fraud and big data


Just as fraudsters are becoming more sophisticated in their attacks, so too are the
ways companies can protect their data. According to the Association of Certified
Fraud Examiners’ Report to the Nations, organizations that use proactive data
monitoring can reduce their fraud losses by an average of 54% and detect scams in half
the time.

Big data analytics is changing the way companies prevent fraud. AI, machine learning,
and data mining technologies are being used in tandem to counteract the hydra of
fraud attempts impacting more than 3 billion identities each year.

In summary, big data analytics techniques can help identify patterns of fraudulent
activity and provide actionable reports used to monitor and prevent fraud—for
businesses of all sizes. Here’s how.

Fraud Detection and Prevention Definition


Banking and healthcare fraud account for tens of billions of dollars in losses annually,
which results in compromised financial institutions, personal impact for bank clients,
and higher premiums for patients. Fraud detection and prevention refers to the
strategies undertaken to detect and prevent attempts to obtain money or property
through deception.

110
What is Fraud Detection and Prevention?
Fraudulent activities can encompass a wide range of cases, including money
laundering, cybersecurity threats, tax evasion, fraudulent insurance claims, forged
bank checks, identity theft, and terrorist financing, and is prevalent throughout the
financial institutions, government, healthcare, public sector, and insurance sectors.

To combat this growing list of opportunities for fraudulent transactions,


organizations are implementing modern fraud detection and prevention
technologies and risk management strategies, which combine big data sources with
real-time monitoring, and apply adaptive and predictive analytics techniques, such as
Machine Learning, to create a risk of fraud score.

Detecting fraud with data analytics, fraud detection software and tools, and a fraud
detection and prevention program enables organizations to predict conventional
fraud tactics, cross-reference data through automation, manually and continually
monitor transactions and crimes in real time, and decipher new and sophisticated
schemes.

Fraud detection and prevention software is available in both proprietary and open
source versions. Common features in fraud analytics software include: a dashboard,
data import and export, data visualization, customer relationship management
integration, calendar management, budgeting, scheduling, multi-user capabilities,
password and access management, Application Programming Interfaces (API), two-
factor authentication, billing, and customer database management.

Fraud Detection and Prevention Techniques


111
Fraud data analytics methodologies can be categorized as either statistical data
analysis techniques or artificial intelligence (AI).

Statistical data analysis techniques include:

 Statistical parameter calculation, such as averages, quantiles, and


performance metrics

 Regression analysis - estimates relationships between independent variables


and a dependent variable

 Probability distributions and models

 Data matching - used to compare two sets of collected data, remove


duplicate records, and identify links between sets

 Time-series analysis

AI techniques include:

 Data mining - data mining for fraud detection and prevention classifies and
segments data groups in which millions of transactions can be performed to
find patterns and detect fraud

 Neural networks - suspicious patterns are learned and used to detect further
repeats

 Machine Learning - fraud analytics Machine Learning automatically identifies


characteristics found in fraud

 Pattern recognition - detects patterns or clusters of suspicious behavior

The four most crucial steps in the fraud prevention and detection process include:

 Capture and unify all manner of data types from every channel and
incorporate them into the analytical process.

 Continually monitor all transactions and employ behavioral analytics to


facilitate real-time decisions.

 Incorporate analytics culture into every facet of the enterprise through data
visualization.

 Employ layered security techniques.

What are fraud analytics?


Fraud analytics is the process of using data analytics to identify and prevent fraud. It
involves collecting and analyzing large amounts of data to identify patterns and

112
anomalies that may indicate credit card fraud, identity theft, insurance fraud, and
other possible crimes

Similarly, risk analytics uses data analytics to identify, assess, and manage risks. The
process includes collecting and analyzing large amounts of data to identify potential
risks, assessing the likelihood and impact of those risks, and developing a strategy to
mitigate the highest-priority risks.

The biggest advantage of big data analytics, and fraud and risk analytics accordingly,
is that it facilitates the use of large and complex data. Faster decisions can be made
in real time using data analytics techniques. Ultimately, thanks to big data analytics, a
company can better understand customer requests and flag those it deems
suspicious.

Fraud Detection Using Big Data Analytics


Fraud detection and prevention analytics relies on data mining and Machine
Learning, and is used in fraud analytics use cases such as payment fraud analytics,
financial fraud analytics, and insurance fraud detection analytics. Data mining reveals
meaningful patterns, turning raw, big datasets into valuable information. Machine
Learning then submits that information to either Supervised or Unsupervised
algorithms.

Supervised Machine Learning algorithms, such as logistic regression and time-series


analysis, learn from historical data and identify patterns of interest that require
further investigation. Unsupervised Machine Learning algorithms, such as cluster
analysis and peer group analysis,
examines data without any identified fraud and reveals new anomalies and patterns
of interest. Data analysts and scientists can then act on these anomalies.

How can big data analytics help prevent fraud?


Successfully harnessing big data requires robust infrastructure, advanced analytics
techniques, and skilled professionals to process, analyze, and derive insights from
large and diverse datasets. These assets combine to identify patterns and anomalies
that could be signs of fraudulent behavior.

Big analytics can help prevent fraud using a variety of techniques, such as data
mining, machine learning, and anomaly detection. For example, data mining can be
used to identify patterns of fraudulent activity, such as using stolen credit card
numbers or making multiple small payments in a short period of time. Machine
learning can be used to build models that can automatically detect fraudulent activity.
Anomaly detection, such as device intelligence, can identify when a malicious bot,
fraudster, or other bad actor is present on your site.

113
More specifically, here are some examples of how big data analytics can help
prevent fraud:

 Identifying patterns of fraudulent activity, such as using stolen credit card


numbers or making multiple small payments in a short period of time.

 Detecting anomalies in data, such as if a customer suddenly starts making


large purchases that are out of character with their normal spending patterns,
or accessing their account from an unknown device.

 Building fraud detection models that can automatically identify fraudulent


activity. These models are typically trained on historical data and can be used
to scan new data for signs of fraud.

 Mitigating fraud risk by identifying and addressing the root causes of


fraud. For example, if fraud is occurring because customers are reusing
passwords, big data analytics can be used to identify these customers and
require them to change their passwords.

Capturing these benefits takes the right tools and implementation. Fraud.net offers
an award-winning fraud prevention platform to help digital businesses quickly detect
transactional anomalies and pinpoint fraud using artificial intelligence, big data, and
live-streaming visualizations.

What are the Common Problems in Big Data Analytics in Fraud


Detection?
We mentioned the importance of big data analytics in detecting fraud. Although it
makes it easier to detect fraud, it can also bring some problems with it. Some of
these problems can be listed as:

 Unrelated or Insufficient Data: The data from the transactions may come
from many different sources. In some cases, false results can be obtained in
fraud detection due to insufficient or irrelevant data. Detection can be based
on the inappropriate rules used in the algorithm. Because of this risk of failure,
companies may be hesitant to use big data analytics and machine learning.

 High Costs: Big data analytics and fraud detection systems may cause some
costs such as the cost of software, and hardware systems, the cost of
components used for the sustainability of these systems, and the time spent.

 Dynamic Fraud Methods: As technology develops, fraud methods develop at


the same pace. In order to catch this speed and detect fraud, it is necessary to
constantly monitor the data and give rules to the algorithms with new and
accurate data analytics.

114
 Data Security: While processing the data and making decisions with this data
analytics system, the security of the data is also a problem to be considered.
That means the security of data should be checked.

How big data is utilized in fraud detection and prevention:


1. Pattern Recognition: Big data analytics enables organizations to identify
patterns and anomalies in large volumes of transactional data. By analyzing
historical transaction records, user behavior, and other relevant data sources,
organizations can detect unusual patterns or deviations from normal behavior
that may indicate fraudulent activity.

2. Machine Learning Algorithms: Big data analytics leverages machine learning


algorithms to build predictive models for fraud detection. These algorithms
learn from past fraud cases and non-fraudulent transactions to detect patterns
and characteristics associated with fraudulent behavior. Machine learning
models can continuously evolve and adapt to new fraud schemes and tactics.

3. Real-Time Monitoring: Big data analytics enables real-time monitoring of


transactions, interactions, and events across digital channels. Organizations
can use streaming data analytics to detect suspicious activities as they occur,
allowing for immediate intervention and mitigation of fraud risks.

4. Behavioral Analysis: Big data analytics facilitates behavioral analysis to


identify deviations from normal behavior patterns. By establishing baseline
behavior profiles for individual users or entities, organizations can detect
unusual activities, such as sudden changes in spending patterns, unusual login
locations, or atypical transaction volumes.

5. Network Analysis: Big data analytics allows organizations to analyze


networks of relationships and connections between entities, such as
customers, accounts, and devices. Network analysis helps uncover complex
fraud schemes involving multiple parties, fraudulent transactions, and money
laundering activities.

6. Data Integration and Fusion: Big data analytics integrates data from multiple
sources, including internal transactional data, external data feeds, social
media, and third-party data sources. By correlating and fusing diverse data
sources, organizations gain a more comprehensive view of fraud risks and can
identify potential red flags more effectively.

7. Geospatial Analysis: Big data analytics incorporates geospatial analysis to


detect location-based fraud patterns and trends. Organizations can analyze
geographic data, IP addresses, and device locations to identify suspicious
activities, such as transactions originating from high-risk locations or unusual
travel patterns.
115
8. Scalability and Performance: Big data analytics platforms provide scalability
and performance to handle the volume, velocity, and variety of data
generated in fraud detection processes. Organizations can process large
datasets efficiently and in real-time, enabling timely detection and response to
fraud threats.

Big Data for Fraud Detection (Guide for All Industries)


Big data analytics is an effective solution for identifying behavioral patterns and
establishing strategies to help detect and prevent fraud in various business sectors.

Which industries have been more vulnerable?


According to ACFE Report to the Nations, the most vulnerable industries for fraud
are:

1. Banking and financial services

2. Government and public administration

3. Manufacturing

4. Healthcare

5. Retail

6. Energy

7. Insurance

8. Transportation and warehousing

9. Technology

10. Construction

116
Risk and big data
What are the risks of big data?
While it’s easy to get caught up in the opportunities big data offers, it’s not
necessarily a cornucopia of progress. If gathered, stored, or used wrongly, big data
poses some serious dangers. However, the key to overcoming these is to understand
them. So let’s get ahead of the curve.

Broadly speaking, the risks of big data can be divided into four main categories:
security issues, ethical issues, the deliberate abuse of big data by malevolent players
(e.g. organized crime), and unintentional misuse.

117
Risks Associated With Big Data
While big data can provide valuable insights, it also presents significant risks,
particularly for startups needing more resources to invest in robust cybersecurity
measures. These risks include:

 Data breaches: As startups collect and store sensitive customer information,


the risk of data breaches and cyber-attacks increases exponentially. Various
factors, including weak passwords, unsecured networks and phishing attacks,
can cause data breaches.

 Reputation damage: Reputation can suffer significantly due to a data breach,


as it can cause customers to lose faith in the company, and negative publicity
can harm the brand's perception. Therefore, it is crucial to prevent such
incidents and maintain customer trust.

 Legal liabilities: A startup failing to protect customers' data may be liable for
legal damages. In addition, data breaches sometimes result in class-action
lawsuits, which can be costly and time-consuming to defend.

Major Risks & Threats come with Big Data


Analyzing such a large amount of data can come with various risks and threats that
can affect the business heavily. So organizations need to understand these risks and
threats and identify the best possible ways to minimize the risks. Let’s find out!

1. Privacy and Data Protection

When companies are collecting big data, then the first risk that comes with big data
is data privacy. This sensitive data is the backbone of many big companies, and if it
leaks to any wrong hand, like cybercrime or hackers, it can badly affect the business
and its reputation. In 2019, 4.1 billion records were exposed through data
breaches, according to the Risk-Based Security Mid-Year Data Breach Report.

So businesses should mainly focus on protecting their data’s privacy and security
from malicious attacks.

Big data is not easy to store in pockets; companies need to manage big servers to
hold this crucial information and protect it from the outside world. It’s a very
challenging and risky process, but it’s a need for businesses to keep their big data
protected.

Various companies are adapting new privacy regulations to protect their database.
Recently, many hackers have been attacking giant companies to steal their data for
monetary benefits.

118
So it clearly shows that the bigger the data company has, the more the chances of
getting malicious attacks. So companies must ensure the security of the data with
high-level encryption.

2. Cost Management

Big data requires big costs for its maintenance, and companies should do the
calculation of collecting, storing, analyzing, and reporting the big data costs. So all
companies need to budget and plan well for maintaining big data.

If companies don’t plan for the management, they may face unpredictable costs,
which can affect the finances. The best way to manage big data costs is by
eliminating irrelevant data and analyzing the big data to find meaningful insights and
solutions to achieve their goals.

3. Unorganized Data

As we’ve discussed, Big Data is a combination of structured, semi-structured, and


unstructured data that is the major problem companies face while managing big
data, i.e., Unorganized Data. It’s a complex process to categorize the data and make
it well-structured.

From small business to enterprise-level, handling unorganized data becomes hectic.


It requires a well-planned strategy to collect, store, diversify, eliminate and optimize
data to find meaningful insights that help businesses make profitable decisions.

4. Data Storage and Retention

Big Data is not just information that can be stored in a computer; it’s a collection of
structured, semi-structured, and unstructured data from different sources that can be
the size of zettabytes. To store the big data, companies need to take a big server
area where all the big data is stored, processed, and analyzed.

This way companies should be concerned about the storage space of big data.
Otherwise, it can be a complex issue. Nowadays, companies leverage the power of
cloud-based services to store data and make accessibility easy and secure.

5. Incompetent Big Data Analysis

It’s estimated that the amount of data generated by users each day will reach 463
exabytes worldwide, according to weforum. The main aim of big data is to analyze
and find meaningful information that helps businesses to make the right business
decisions and innovations. If any organization doesn’t have a proper analyzing
process, big data is just trash that seems unnecessary.

The analysis makes big data Important, and companies should hire the best data
analyst and software that helps to analyze the big data and find meaningful insights
with the help of professional analysts and technology.

119
Thus, before planning to work on big data, each business, from small to enterprise-
level, should hire professional analysts and use powerful technologies to analyze big
data.

6. Poor Data Quality

One key risk of getting big data is that organizations may reach poor quality,
irrelevant or out-of-date databases that will not help their business to find
something meaningful.

In Big Data, when everything is stored, whether structured, semi-structured, or


unstructured data then it’s a risk for organizations to collect and analyze the data
because it may be useful or not for their business based on the data relevancy.

Many challenges come across while analyzing big data, and organizations must be
prepared for these outputs, try to eliminate irrelevant data, and focus on analyzing
relevant data to get meaningful insights.

7. Deployment Process

Deployment is a core process of an organization to collect and analyze big data and
deploy meaningful insights in a time period. In this situation, companies have two
options for data deployment, i.e. first is to use an in-house deployment process
where the big data is collected and analyzed to find meaningful insights, but this
process takes a good amount of time.

Instead of setting up their server infrastructure, a cloud-based solution is more


convenient, easy, and beneficial because there’s no internal infrastructure to store big
data. The amount of time to deploy meaningful insights from big data is important.

How can Big Data Be Integrated into Risk Management?


While evaluating risks, it is necessary to extract the most information possible to deal
with complex data, especially to extract the right data, resulting in a deep
understanding of the risk and having better risk management.

In fintech industries, big data helps identify an opportunity to provide efficient and
sustainable financial services. Overall, risk management is the process of finding and
controlling threats to the company's well-being and ways to minimize those threats.

Application of Big Data in Risk Management


Let's figure out the main benefits of big data in risk management.

120
 Fraud Identification

Predictive analysis is an excellent way to detect money laundering and other


fraudulent activities and prevent future losses. As a company collects
terabytes of data from different sources, Big Data Analytics guarantees close
monitoring of such platforms and activities. This can increase the probability
of identifying the risk and fraud before it happens.

 Improve Financial Health

Big data analytics effectively allows the organization to save money.


Monitoring and tracking resources, expenses, and crucial information help
track when spending is higher than expected. Through an automated process,
problematic areas are flagged and asked for appropriate action by
management.

 Credit Management

Credits can potentially paralyze the operations of your business. It is


challenging to manage the risk by just determining a few transactions
manually; with Big Data Analytics in place, one can identify the misuse within
the organization and check for how long it is happening.

 Anti-Money Laundering

The traditional approach to anti-money laundering was rule-based, using


descriptive analytics to process complex structured data. This system has
some limitations like no automated algorithm, requires keyword searches, and
manual shifting through reports.

On the other hand, Big Data analytics helps to improve the existing processes
by using advanced statistical analysis of structured data and statistical text
mining of unstructured data. It generates real-time actionable insights and
stops money laundering in its track.

How big data contributes to risk management:


1. Risk Identification: Big data analytics helps organizations identify and
quantify risks by analyzing large volumes of data from diverse sources. By
collecting and integrating data from internal systems, external databases,
social media, and IoT devices, organizations gain a holistic view of potential
risks and vulnerabilities.

2. Predictive Analytics: Big data enables organizations to leverage predictive


analytics techniques to forecast and anticipate future risks. By analyzing
historical data patterns, market trends, and environmental factors,

121
organizations can identify emerging risks and take proactive measures to
mitigate them before they escalate.

3. Fraud Detection: Big data analytics is instrumental in detecting fraudulent


activities and mitigating fraud risks. By analyzing transactional data, user
behavior, and other relevant data sources, organizations can identify
suspicious patterns, anomalies, and deviations indicative of fraudulent
behavior.

4. Credit Risk Assessment: In the financial industry, big data analytics is used
for credit risk assessment and scoring. By analyzing borrower data, credit
history, payment behavior, and other relevant factors, lenders can assess the
creditworthiness of applicants and make informed decisions about lending
risks.

5. Operational Risk Management: Big data analytics helps organizations


identify and mitigate operational risks associated with business processes,
supply chain operations, and IT systems. By analyzing operational data,
organizations can identify inefficiencies, bottlenecks, and vulnerabilities that
pose risks to business continuity and performance.

6. Cybersecurity: Big data analytics is crucial for cybersecurity risk management,


enabling organizations to detect and respond to cyber threats in real-time. By
analyzing network traffic, log files, system alerts, and user behavior,
organizations can identify security breaches, malware infections, and insider
threats before they cause significant damage.

7. Supply Chain Risk Management: Big data analytics helps organizations


manage supply chain risks by monitoring and analyzing data across the entire
supply chain ecosystem. By tracking supplier performance, inventory levels,
transportation routes, and market dynamics, organizations can identify supply
chain disruptions, mitigate risks, and optimize resilience.

8. Healthcare Risk Assessment: In the healthcare industry, big data analytics is


used for risk assessment, patient safety, and clinical decision support. By
analyzing electronic health records (EHRs), medical imaging data, genomics
data, and patient outcomes, healthcare providers can identify health risks,
predict disease progression, and personalize treatment plans.

Big Data Risk / Security Issues and their solutions


Big Data comes with several security issues and dangers that can heavily impact
organizations. So it’s important for businesses to understand and resolve these
security issues.

Here are some common issues and dangers of big data along with their solution.
122
1. Data Storage

Problem: When businesses plan to store big data, the first problem they face is
storage space. Many companies are leveraging the power of cloud space to store
data, but due to online access to data, there’s a chance of security issues. So some
companies prefer to own their physical server storage to store the database.

One of the major data storage issues faced by Amazon in 2017, where AWS cloud
storage is full & doesn’t have space to run even basic operations and later Amazon
resolved the issue and maintained the storage to prevent this problem in future.

Solution: To resolve the problem, companies should store their sensitive data in an
on-premises database, and less sensitive data can be stored in cloud storage. But
still, there are security issues that can be resolved by hiring cybersecurity experts.

It may increase the cost of organizations, but database value is more worth it.

2. Fake Data

Problem: Another Big Data issue many organizations may face, i.e., fake databases.
When collecting data, companies require a relevant database that can be analyzed
and used to generate meaningful insights. However, having irrelevant or fake data
can waste any organization’s efforts and costs in analyzing the data.

In 2016, Facebook faced the issue of fake databases because the algorithm’s didn’t
recognise real or fake news differences and ended up with nonsense political
issues, according to Vox.

Solution: To validate the data source, organizations should do periodic assessments


and evaluations of databases to find irrelevant data and eliminate it. So that only
relevant data is left to analyze and generate results.

3. Data Access Control

Problem: When users get access to control the data like view, edit or remove, it may
affect the business operations and privacy.

Here’s an example:

Netflix which reported the loss of 200,000 subscribers in Q1 because users are
sharing their login details with friends/family to log in with the same account. Later
Netflix takes charge and controls data accessibility to limited users on a single
account.

Solution: Its solution is used to work with Identity Access Management (IAM) to
simplify the process of controlling the data via identification, authentication, and
authorization. By following the ISO standards, organizations can protect their access
to IAM.

123
4. Data Poisoning

Problem: Nowadays, almost every website has Chatbots on their website, and it’s a
target of hackers to attack these Machine Learning models that lead to Data
Poisoning, where organizations’ databases can be manipulated and injected.

Solution: The best way to resolve this issue is through outlier detection. It helps to
separate injected elements from the existing data distribution.

Credit risk management


Credit risk refers to the probability of loss due to a borrower’s failure to make
payments on any type of debt. Credit risk management is the practice of mitigating
losses by assessing borrowers’ credit risk – including payment behavior and
affordability. This process has been a longstanding challenge for financial institutions.

Continued global economic crises, ongoing digitalization, recent developments in


technology and the increased use of artificial intelligence in banking have kept credit
risk management in the spotlight. As a result, regulators continue to demand
transparency and other improved capabilities in this space. They want to know that
banks have a thorough knowledge of customers and their associated credit risk. And
as Basel regulations evolve, banks will face an even bigger regulatory burden.

To comply with ever-changing regulatory requirements and to better manage risk,


many banks are overhauling their approaches to credit risk. But banks who view this
as strictly a compliance exercise are being short-sighted. Better credit risk
management presents an opportunity to improve overall performance and secure a
competitive advantage.

Key Takeaways
 Credit risk management refers to managing the probability of a company’s
losses if its borrowers default in repayment.

 The main purpose is to reduce the rising quantum of the non-performing


assets from the customers and to recover the same in due time with
appropriate decisions.

 It is one of the important tools for any lending company to survive in the long
term since, without proper mitigation strategies, it will be very difficult to stay
in the Lending Business due to the rising NPAs and defaults happening.

 Every bank/NBFC has a separate department to take care of the quality of the
portfolios and the customers by framing appropriate risk mitigating
Techniques.

124
What is Credit Risk Management?
Credit risk management refers to the process of assessing and mitigating the
potential risks associated with lending money or extending credit to individuals or
businesses. At its core, it’s about ensuring that borrowers are reliable and will fulfill
their repayment obligations.

One of the key aspects of credit risk management is evaluating


the creditworthiness of borrowers. This involves a thorough analysis of their financial
history, credit score, income stability, and other pertinent factors. By doing so,
lenders can gauge a borrower’s ability to repay the loan and make informed lending
decisions.

Credit risk management also involves setting appropriate interest rates and credit
limits, as well as monitoring and managing the loan portfolio to identify and address
potential risks. Effective credit risk management helps businesses protect themselves
against financial losses and ensure the overall stability and profitability of their
business.

Here are the key components of credit risk management:


1. Credit Risk Identification: The first step in credit risk management is to
identify and understand the various types of credit risks that an organization
may face. This includes assessing the risk of default, late payments, non-
payment, or other adverse events that may lead to financial losses. Credit risk
can arise from individual borrowers, counterparties, or entire portfolios of
loans or credit exposures.

125
2. Credit Risk Assessment: Credit risk assessment involves evaluating the
creditworthiness of borrowers or counterparties to determine the likelihood of
default or non-payment. This process typically includes analyzing financial
statements, credit reports, credit scores, payment histories, collateral, and
other relevant factors. Credit risk assessment helps organizations make
informed decisions about whether to extend credit, set credit limits, or
approve loan applications.

3. Credit Scoring and Modeling: Credit scoring and modeling techniques are
used to quantify and predict credit risk based on historical data and statistical
analysis. Credit scoring models assign numerical scores to borrowers based on
their creditworthiness, probability of default, and risk profile. These models
help organizations automate credit decisions, streamline underwriting
processes, and assess risk consistently across different applicants.

4. Risk-Based Pricing: Risk-based pricing strategies involve pricing credit


products and services based on the level of credit risk associated with
individual borrowers or credit portfolios. Higher-risk borrowers may be
charged higher interest rates, fees, or other costs to compensate for the
increased risk of default. Risk-based pricing helps organizations manage risk
while maintaining competitiveness and profitability in the market.

5. Credit Risk Mitigation: Credit risk mitigation involves implementing


strategies and measures to reduce the impact of credit risk on an
organization's financial health. Common risk mitigation techniques include
requiring collateral or security for loans, obtaining credit insurance or
guarantees, diversifying credit exposures across different borrowers or asset
classes, and setting aside reserves or provisions for potential loan losses.

6. Portfolio Management: Credit risk management also involves actively


monitoring and managing credit portfolios to optimize risk-return trade-offs
and ensure portfolio diversification. This includes regularly assessing the
quality of loans, monitoring borrower performance, identifying emerging risks,
and taking corrective actions as needed to mitigate risks and preserve asset
quality.

7. Regulatory Compliance: Financial institutions are subject to regulatory


requirements and guidelines related to credit risk management, such as
capital adequacy regulations, stress testing requirements, and risk
management standards. Organizations must ensure compliance with relevant
regulations and implement robust governance frameworks to oversee credit
risk management practices effectively.

Challenges in Credit Risk Management

126
Credit risk management is really important for keeping the financial system stable.
But, it’s not easy and there are some major challenges that come with it. These
challenges are a big part of financial operations and need to be watched carefully.

In this section, we will talk about the major challenges finance professionals face
when they try to manage credit risk.

1. Data quality and accessibility: Data quality plays a crucial role in credit risk
evaluation. But most of the time the data available is not very reliable or easy
to get. Incomplete or inaccurate data can compromise the decision-making
process, necessitating robust strategies to ensure data integrity.

2. Global interconnectedness: When evaluating the credit risk, businesses


cannot afford to overlook the global landscape they operate in. The
interconnected nature of the global financial system implies that issues in one
sector can swiftly reverberate across others. For instance, if a major
organization is adversely affected by global factors and defaults on its loans, it
can trigger a ripple effect, impacting other businesses connected through
lending relationships or financial transactions.

3. Regulatory adherence: The constantly changing regulatory landscape adds


complexity to credit risk management. Banks and financial institutions must
comply with a multitude of regulations aimed at ensuring stability and
127
transparency in the financial system. Keeping up with these regulations and
implementing effective risk management practices can be a daunting task.

4. Rise of non-traditional lenders: The rise of non-traditional lenders and


fintech companies presents a challenge for credit risk management. These
new players often operate with different risk assessment models and rely on
alternative data sources. Incorporating these non-traditional approaches while
ensuring accuracy and reliability poses a challenge for risk managers.

5. Economic volatility: Financial landscapes are subject to unpredictable


economic fluctuations. The ever-shifting terrain of interest rates, inflation, and
market dynamics can substantially impact the creditworthiness of borrowers.
Adapting to these changes while safeguarding financial interests is an
ongoing challenge.

6. Human factors: The human element introduces its own set of challenges.
Misjudgments, communication breakdowns, or ethical lapses can inject
unpredictability into credit risk management, underlining the importance of
strong internal controls.

Credit Risk Management Steps?


From assessing the borrower’s personality through the personal details, they provide
to checking on how their property could help them recover the amount in the event
of default; the lenders should evaluate everything. Once the evaluation is done, they
need to verify and validate details to decide on the approval or disapproval of the
loan application.

However, there is more to credit risk management in banks than deciding whether
to lend money to an applicant. To help themselves manage Credit Risk well, banking

128
or other lending institutions can check on data sources they are taking information
from and validate their reliability. In addition, the institutions can have a third-party
entity involved to assess if the models and measures adopted for credit management
are proper. They could help them identify the weakness, leading to improvements in
the framework.

The third-party unit is the best element to be included in assessing the entire system
without bias. They monitor the active models and suggest changes based on their
opinion. These entities use the most dynamic datasets to conduct their studies to
reach valid conclusions. In addition, they help deploy advanced technology, like
artificial intelligence and machine learning, to make risk management more efficient
and accurate. As a result, the entities effectively manage credit risks and remain
prepared for upcoming financial crimes.

What are the 5 Cs of Credit Risk?


1. Character

For individual borrowers, character refers to their personal traits and credit history. It
encompasses factors such as their reliability, integrity, and creditworthiness.

In the case of commercial borrowers, character extends to the reputation and


credibility of the company’s management. It also includes the character of the
company’s ownership, particularly in the context of private corporations.

2. Capacity

Capacity refers to the borrower’s capability to assume and fulfill their debt
obligations. It encompasses the ability of both retail and commercial borrowers to
handle their debt.

In the case of commercial borrowers, assessing capacity involves examining various


debt service and coverage ratios that provide insights into the borrower’s ability to
manage their debt effectively.

3. Capital

Capital is commonly referred to as the borrower’s “wealth” or overall financial


stability. Lenders analyze the composition of debt and equity that underpins the
borrower’s asset base to assess their capital position.

It is important to determine if the borrower has the potential to obtain additional


funds from alternative sources. For business borrowers, this involves examining if
there are related companies with available liquidity.

129
In the case of personal borrowers who may lack an extensive credit history, it is
important to consider if a parent or family member could provide a guarantee to
support their loan application.

4. Collateral security

In structuring loans to mitigate credit risk, collateral security plays an important role.
It is of utmost importance to thoroughly assess the value of assets, their physical
location, the ease of transferring ownership and determining appropriate loan-to-
value ratios (LTVs), among other factors.

5. Conditions

Conditions encompass the purpose of the credit, external circumstances, and various
factors in the surrounding environment that can introduce risks or opportunities for a
borrower. These factors may involve political or macroeconomic conditions, as well
as the current stage of the economic cycle.

In the case of business borrowers, conditions also encompass industry-specific


challenges and social or technological developments that have the potential to
impact their competitive advantage.

Examples of Credit Risk Management

ABC Bank is dedicated to assisting individuals in obtaining the necessary finances for
their specific needs. To ensure ease of repayment, the bank maintains low-interest
rates. As a result, loans are accessible to individuals from all segments of society,
provided they meet certain minimum criteria.

In the interest of fairness, the lenders employ an automated system that only accepts
loan applications that meet the necessary requirements. This enables effective credit
risk management by limiting loan options to individuals with a specified income
level.

Principles
The strategies can be many, but the basic ones must be incorporated to make the
credit risk management tools and framework effective. The first and foremost thing is
to have a proper setup to ensure a feasible environment for credit risk assessment.
There should be a proper protocol to follow, from assessing the measures to
approving them to reviewing them from time to time.

130
Effective Business Credit Management Best Practices
Most businesses extend credit without properly assessing the creditworthiness of the
customer even though they know it's very risky. If you are wondering why this
happens - the answer is very simple - salespersons are often in a hurry to onboard
customers faster to achieve their targets. They often pressure the finance teams to
extend credit without sufficient due diligence.

Keep in mind that effective credit risk management practices should be tailored to
the unique characteristics of each business. This includes identifying customers with a
history of frequent payment defaults and crafting a dynamic strategy to mitigate
credit risk. With that in mind here are the six most efficient credit risk management
best practices you need to know:

1. Provide online credit application forms

Introduce online credit application forms to make customer onboarding smooth and
faster. Make all the essential sections mandatory to avoid missing out on any critical
information.

An online application makes it easier to gather and store data. Accurate and
complete customer information makes your credit risk analysis process more robust.
Your credit application must collect the following data:

 Company information

 Bank information

 Commercial trade information

 Provisions in the event of non-payment

 The maximum time required to report a quality/quantity issue

 Terms of payment
131
 Description of how disputes would be resolved

 Your rights to end credit terms

 Data verification

2. Analyze and predict credit risk

You must consider two factors before you extend credit to your customer. First is the
creditworthiness of the customer, and the second is the impact on your cash flow if
the customer goes delinquent.

Before customer onboarding, review their payment history from financial institutions
and sources such as:

 Credit bureaus like D&B, Experian

 Credit groups like NACM

 Banks

 Public financials (such as income statements, balance sheets, financial ratios,


etc.)

 Other sources such as personal guarantees or trade reference

Current and historical data available on these sources help improve your credit
scoring accuracy. It also lets you identify the creditworthiness and the potential risk
posed by any new customers. This approach helps create a strong functional
structure for credit risk management and decision-making.

3. Real-time credit risk monitoring

Credit risk management is a continuous process. In this constantly changing business


environment, periodic review of existing customers is essential. Real-time credit risk
monitoring keeps you updated about all the risks and opportunities. It helps to
identify and mitigate credit risk before it becomes a problem.

For example, if an existing customer is growing and they have strong financials, you
might want to consider increasing their credit limit to expand trade with them. But, if
an existing customer makes late payments to other vendors and shows signs of
delinquency, you might want to reach out to that customer and collect your payment
or modify payment terms at the earliest.

Using automated solutions is an efficient way to monitor risk in real-time. Such


solutions integrate easily with credit agencies and send alerts to credit risk
management teams directly. These solutions alert you in case of:

 An improvement in credit score

132
 A decline in credit score

 Bankruptcy

 Recent legal judgments

 Relocation of business

 Change in the management

4. Establish and follow a credit policy

A credit policy protects your business from financial risks and defaulting customers.
A well-defined credit policy allows you to make credit decisions quickly and set
payment terms. You must periodically review and update your credit policy to ensure
it meets changing market conditions and standards.

To avoid disagreements on credit limits between internal teams, clearly define


workflows and the person or team responsible for approving credit limits. To make
an effective credit policy, you must clearly document and communicate the
following:

 The mission of the credit team

 Goals of the organization

 Roles and responsibilities of team members

 Credit evaluation process

 Collection process

 Terms of sale

5. Use clear communication for payment terms and conditions

Communicating payment terms to customers clearly and on time is crucial to avoid


late payments and ensure healthy customer relationships. Here are some extra tips to
help you improve customer communications for better credit risk management:

 Establish precise payment terms

 Clarify interest rates and taxes

 Specify due dates and late payment penalties

 Include conditions for closing the credit limit

 Specify clauses for dispute management

6. Leverage automation for fast and accurate credit risk management


133
Mid-market organizations are increasingly adopting accounts receivable automation
solutions to keep pace with their growing clientele and to minimize credit risk.
Automation enables real-time credit management, lowers credit risk, and reduces
bad debts. Accounts receivable automation supports:

 Accurate credit decision-making

 Automated periodic reviews

 Faster customer onboarding

 Bank and trade reference validation

 Real-time credit risk alerts

 Alert-triggered credit reviews

Big data and algorithmic trading


History of Algorithmic Trading
Back in the 1980s, program trading was used on the New York Stock Exchange, with
arbitrage traders pre-programming orders to automatically trade when the S&P500’s
future and index prices were far apart. As markets moved to becoming fully
electronic, human presence on a trading floor gradually became redundant, and the
rise of high frequency traders emerged. A special class of algo traders with speed
and latency advantage of their trading software emerged to react faster to order
flows.

By 2009, high frequency trading firms were estimated to account for as much as 73%
of US equity trading volume.

What is Algorithmic Trading?


Application of computer and communication techniques has stimulated the rise of
algorithm trading. Algorithm trading is the use of computer programs for entering
trading orders, in which computer programs decide on almost every aspect of the
order, including the timing, price, and quantity of the order etc.

In previous days investment researches were done on day-to-day basis information


and patterns. Now the volatilities in market are more than ever and due to this risk
factor has been increased. Investment banks has increased risk evaluation from inter-
day to intra-day. RBI interests rates, key governmental policies, news from SEBI,
quarterly results, geo-political events and many other factors influence the market
within a couple of seconds and hugely.

134
Investment banks use algorithmic trading which houses a complex mechanism to
derive business investment decisions from insightful data. Algorithmic trading
involves in using complex mathematics to derive buy and sell orders for derivatives,
equities, foreign exchange rates and commodities at a very high speed.

The core component in algorithmic trading systems is to estimate risk reward ratio
for a potential trade and then triggering buy or sell action. Risk analysts help banks
to get trading and implementation rules. Market risk is estimated by the variation in
the value of assets in portfolio by risk analysts. The calculations involved to estimate
risk factor for a portfolio is about billions. Algorithmic trading uses computer
programs to automate trading actions without much human intervention.

Algorithm trading has been adopted by institutional investors and individual


investors and made profit in practice. The soul of algorithm trading is the trading
strategies, which are built upon technical analysis rules, statistical methods, and
machine learning techniques. Big data era is coming, although making use of the big
data in algorithm trading is a challenging task, when the treasures buried in the data
is dug out and used, there is a huge potential that one can take the lead and make a
great profit.

The soul of algorithm trading is the trading strategies, which are built upon technical
analysis rules, statistical methods, and machine learning techniques. Big data era is
coming, although making use of the big data in algorithm trading is a challenging
task, when the treasures buried in the data is dug out and used, there is a huge
potential that one can take the lead and make a great profit.

Role of Big Data in Algorithmic Trading


1. Technical Analysis: Technical Analysis is the study of prices and price
behavior, using charts as the primary tool.

2. Real Time Analysis: The automated process enables computer to execute


financial trades at speeds and frequencies that a human trader cannot.

3. Machine Learning: With Machine Learning, algorithms are constantly fed


data and actually get smarter over time by learning from past mistakes,
logically deducing new conclusions based on past results and creating new
techniques that make sense based on thousands of unique factors.

Traditional Trading Architecture


It was found that traditional architecture could not scale up to the needs and
demands of automated trading with DMA. The latency between the origins of the
event to the order generation went beyond the dimension of human control and
135
entered the realms of milliseconds and microseconds. Order management also needs
to be more robust and capable of handling many more orders per second. Since the
time frame is minuscule compared to human reaction time, risk management also
needs to handle orders in real-time and in a completely automated way.

For example, even if the reaction time for an order is 1 millisecond (which is a lot
compared to the latencies we see today), the system is still capable of making 1000
trading decisions in a single second. Thus, each of these 1000 trading decisions
needs to go through the Risk management within the same second to reach the
exchange. You could say that when it comes to automated trading systems, this is
just a problem of complexity.

Another point which emerged is that since the architecture now involves automated
logic, 100 traders can now be replaced by a single automated trading system. This
adds scale to the problem. So each of the logical units generates 1000 orders and
100 such units mean 100,000 orders every second. This means that the decision-
making and order sending part needs to be much faster than the market data
receiver in order to match the rate of data.

Automated Trading Architecture


The data flow of a typical algorithm trading system. Firstly the trading system collects
price data from the exchange (for cross market arbitrage, the system needs to collect
price data from more than one exchange), news data from news companies such as
Reuters, Bloomberg. Some algorithm trading systems may also collect data from the
web for deep analysis such as sentiment analysis. While the data is being collected,
the system performs some complicated analysis on the data to look for profitable
chances with the expectation of making profit. Sometimes the trading system
136
conducts a simulation to see what the actions may result in. Finally, the system
decides on the buy/sell/hold actions, the quantity of order, and the time to trade, it
then generates some trading signals. The signals can be directly transmitted to the
exchanges using a predefined data format, and trading orders are executed
immediately through an API exposed by the exchange without any human
intervention. Some investors may like to take a look at what signals the algorithm
trading system have generated, and he can initiate the trading action manually or
simply ignore the signals. Human intervention is a double blade sword, on one hand
it can screen away some unprofitable signals according to the experience of human,
on the other hand human being is likely to make mistakes, they cannot trade in a
consistent manner, because they will be tired, be over pessimistic or be over
optimistic, one’s mood will greatly affect the trading. In the author’s opinion, if the
algorithm trading is properly designed and thoroughly verified, it is better to let the
system do the whole thing, from data analysis, to deciding on trading actions, and
initiating the execution of trading orders.

General Flow View of Algorithmic Trading

137
 Market Adapter (Data Feed)

 Complex Event Processing (Strategy)

 Order Routing (Execution)

How Big Data can be used for Algorithmic Trading


There are several standard modules in a proprietary algorithm trading system,
including trading strategies, order execution, cash management and risk
management. Trading strategies are the core of an automated trading system.
Complex algorithms are used to analyze data (price data and news data) to capture
anomalies in market, to identify profitable patterns, or to detect the strategies of
rivals and take advantages of the information. Various techniques are used in trading
strategies to extract actionable information from the data, including rules, fuzzy rules,
statistical methods, time series analysis, machine learning, as well as text mining.

 Technical Analysis and Rules

 Using of Statistics

 Artificial Intelligence, Machine Learning Based Algorithm Trading

 Text Mining for Algorithm Trading

 Levels the Playing Field to Stabilize Online Trade.

138
Algorithmic trading is the current trend in the financial world and machine learning
helps computers to analyze at rapid speed. The real-time picture that big data
analytics provides gives the potential to improve investment opportunities for
individuals and trading firms.

 Estimation of outcomes and returns.

Access to big data helps to mitigate probable risks on online trading and making
precise predictions. Financial analytics helps to tie up principles that affect trends,
pricing and price behavior.

 Deliver accurate predictions

Big data can be used in combination with machine learning and this helps in
making a decision based on logic than estimates and guesses. The data can be
reviewed and applications can be developed to update information on a regular
basis for making accurate predictions.

 Backtesting Strategy

One of the features of Algorithmic Trading is the ability to backtest. It can be


tough for traders to know what parts of their trading system work and what
doesn’t work since they can’t run their system on past data. With algo trading,
you can run the algorithms based on past data to see if it would have worked in
the past. This ability provides a huge advantage as it lets the user remove any
flaws of a trading system before you run it live.

The Key Features of Algorithmic Trading


 Availability of Market and Company Data.

All trading algorithms are designed to act on real-time market data and price
quotes. A few programs are also customized to account for company
fundamentals data like EPS and P/E ratios. Any algorithmic trading software
should have a real-time market data feed, as well as a company data feed. It
should be available as a build-in into the system or should have a provision to
easily integrate from alternate sources.

 Connectivity to Various Markets.

Traders looking to work across multiple markets should note that each exchange
might provide its data feed in a different format, like TCP/IP, Multicast, or a FIX.
Your software should be able to accept feeds of different formats. Another option
is to go with third-party data vendors like Bloomberg and Reuters, which
aggregate market data from different exchanges and provide it in a uniform
format to end clients. The algorithmic trading software should be able to process
these aggregated feeds as needed.

 Latency.
139
This is the most important factor for algorithm trading. Latency is the time-delay
introduced in the movement of data points from one application to the other.
Consider the following sequence of events. It takes 0.2 seconds for a price quote
to come from the exchange to your software vendor’s data center (DC), 0.3
seconds from the data center to reach your trading screen, 0.1 seconds for your
trading software to process this received quote, 0.3 seconds for it to analyze and
place a trade, 0.2 seconds for your trade order to reach your broker, 0.3 seconds
for your broker to route your order to the exchange.

Total time elapsed = 0.2 + 0.3 + 0.1 + 0.3 + 0.2 + 0.3 = Total 1.4 seconds.

In today’s dynamic trading world, the original price quote would have changed
multiple times within this 1.4 second period. This delay could make or break your
algorithmic trading venture. One needs to keep this latency to the lowest possible
level to ensure that you get the most up-to-date and accurate information
without a time gap.

Latency has been reduced to microseconds, and every attempt should be made
to keep it as low as possible in the trading system. A few measures include having
direct connectivity to the exchange to get data faster by eliminating the vendor in
between; by improving your trading algorithm so that it takes less than 0.1+0.3 =
0.4 seconds for analysis and decision making; or by eliminating the broker and
directly sending trades to the exchange to save 0.2 seconds.

 Configurability and Customization.

Most algorithmic trading software offers standard built-in trade algorithms, such
as those based on a crossover of the 50-day moving average (MA) with the 200-
day MA. A trader may like to experiment by switching to the 20-day MA with the
100-day MA. Unless the software offers such customization of parameters, the
trader may be constrained by the built-ins fixed functionality. Whether buying or
building, the trading software should have a high degree of customization and
configurability.

 Functionality to Write Custom Programs.

MATLAB, Python, C++, JAVA, and Perl are the common programming languages
used to write trading software. Most trading software sold by the third-party
vendors offers the ability to write your own custom programs within it. This allows
a trader to experiment and try any trading concept he or she develops. Software
140
that offers coding in the programming language of your choice is obviously
preferred.

 Backtesting Feature on Historical Data.

Backtesting simulation involves testing a trading strategy on historical data. It


assesses the strategy’s practicality and profitability on past data, certifying it for
success (or failure or any needed changes). This mandatory feature also needs to
be accompanied by availability of historical data, on which the backtesting can be
performed.

 Integration with Trading Interface.

Algorithmic trading software places trades automatically based on the occurrence


of a desired criteria. The software should have the necessary connectivity to the
broker(s) network for placing the trade or a direct connectivity to the exchange to
send the trade orders.

 Plug-n-Play Integration.

A trader may be simultaneously using a Bloomberg terminal for price analysis, a


broker’s terminal for placing trades, and a MATLAB program for trend analysis.
Depending upon individual needs, the algorithmic trading software should have
easy plug-n-play integration and available APIs across such commonly used
trading tools. This ensures scalability, as well as integration.

Strategies used for Algorithmic Trading


 Execution

Imagine if you’re a huge sovereign wealth fund placing a 100 million order on
Apple shares. Do you think there will be enough sellers at the price you chose?
And what do you think will happen to the share price before the order gets filled?

This is where an algorithm can be used to break up orders and strategically place
them over the course of the trading day. In this case, the trader isn’t exactly
profiting from this strategy, but he’s more likely able to get a better price for his
entry.

 Arbitrage

Buying a dual-listed stock at a lower price in one market and simultaneously


selling it at a higher price in another market offers the price differential as risk-
free profit or arbitrage. If you see the price of a Chanel bag to be US$5000 in
France and US$6000 in Singapore, what would you do? The obvious answer
would be the buy in France and sell in Singapore. This is risk free profit at no cost,
by earning a spread between the 2 countries. Similarly, if one spots a price
difference in futures and cash markets, an algo trader can be alerted by this and
take advantage.
141
 Trend following

There are tons of investment gurus claiming to have the best strategies based on
technical analysis, relying on indicators like moving averages, momentum,
stochastics and many more. Some automated trading systems make use of these
indicators to trigger a buy and sell order. Trades are initiated based on the
occurrence of desirable trends, which are easy and straightforward to implement
through algorithms without getting into the complexity of predictive analysis.
Using 50- and 200-day moving averages is a popular trend-following strategy.

 Index Fund Rebalancing

Index funds have defined periods of rebalancing to bring their holdings to par
with their respective benchmark indices. This creates profitable opportunities for
algorithmic traders, who capitalize on expected trades that offer 20 to 80 basis
points profits depending on the number of stocks in the index fund just before
index fund rebalancing. Such trades are initiated via algorithmic trading systems
for timely execution and the best prices.

 Mathematical Model-based Strategies

Proven mathematical models, like the delta-neutral trading strategy, allow trading
on a combination of options and the underlying security. (Delta neutral is a
portfolio strategy consisting of multiple positions with offsetting positive and
negative deltas — a ratio comparing the change in the price of an asset, usually a
marketable security, to the corresponding change in the price of its derivative —
so that the overall delta of the assets in question totals zero.)

 Trading Range (Mean Reversion)

Mean reversion strategy is based on the concept that the high and low prices of
an asset are a temporary phenomenon that revert to their mean value (average
value) periodically. Identifying and defining a price range and implementing an
algorithm based on it allows trades to be placed automatically when the price of
an asset breaks in and out of its defined range.

 Volume-weighted Average Price (VWAP)

Volume-weighted average price strategy breaks up a large order and releases


dynamically determined smaller chunks of the order to the market using stock-
specific historical volume profiles. The aim is to execute the order close to the
volume-weighted average price (VWAP).

 Time Weighted Average Price (TWAP)

Time-weighted average price strategy breaks up a large order and releases


dynamically determined smaller chunks of the order to the market using evenly
divided time slots between a start and end time. The aim is to execute the order

142
close to the average price between the start and end times thereby minimizing
market impact.

 Percentage of Volume (POV)

Until the trade order is fully filled, this algorithm continues sending partial orders
according to the defined participation ratio and according to the volume traded
in the markets. The related “steps strategy” sends orders at a user-defined
percentage of market volumes and increases or decreases this participation rate
when the stock price reaches user-defined levels.

 Implementation Shortfall

The implementation shortfall strategy aims at minimizing the execution cost of an


order by trading off the real-time market, thereby saving on the cost of the order
and benefiting from the opportunity cost of delayed execution. The strategy will
increase the targeted participation rate when the stock price moves favourably
and decrease it when the stock price moves adversely.

Advantages and Disadvantages of Algorithmic Trading


Advantages
Algo-trading provides the following advantages

 Best Execution: Trades are often executed at the best possible prices.

 Low Latency: Trade order placement is instant and accurate (there is a high
chance of execution at the desired levels). Trades are timed correctly and instantly
to avoid significant price changes.

 Reduced transaction costs.

 Simultaneous automated checks on multiple market conditions.

 No Human Error: Reduced risk of manual errors or mistakes when placing trades.
Also negates human traders; tendency to be swayed by emotional and
psychological factors.

 Backtesting: Algo-trading can be backtested using available historical and real-


time data to see if it is a viable trading strategy.

Disadvantages
There are also several drawbacks or disadvantages of algorithmic trading to consider:

143
 Latency: Algorithmic trading relies on fast execution speeds and low latency,
which is the delay in the execution of a trade. If a trade is not executed quickly
enough, it may result in missed opportunities or losses.

 Black Swan Events: Algorithmic trading relies on historical data and


mathematical models to predict future market movements. However,
unforeseen market disruptions, known as black swan events, can occur, which
can result in losses for algorithmic traders.

 Dependence on Technology: Algorithmic trading relies on technology,


including computer programs and high-speed internet connections. If there
are technical issues or failures, it can disrupt the trading process and result in
losses.

 Market Impact: Large algorithmic trades can have a significant impact on


market prices, which can result in losses for traders who are not able to adjust
their trades in response to these changes. Algo trading has also been
suspected of increasing market volatility at times, even leading to so-
called flash crashes.

 Regulation: Algorithmic trading is subject to various regulatory requirements


and oversight, which can be complex and time-consuming to comply with.

 High Capital Costs: The development and implementation of algorithmic


trading systems can be costly, and traders may need to pay ongoing fees for
software and data feeds.

 Limited Customization: Algorithmic trading systems are based on pre-


defined rules and instructions, which can limit the ability of traders to
customize their trades to meet their specific needs or preferences.

 Lack of Human Judgment: Algorithmic trading relies on mathematical


models and historical data, which means that it does not take into account the
subjective and qualitative factors that can influence market movements. This
lack of human judgment can be a disadvantage for traders who prefer a more
intuitive or instinctive approach to trading.

Pros & Cons of Algorithmic Trading

Pros Cons

 Instant order confirmation  Lack of human judgment in real-


time
 Potential for best price and
 Can lead to increased volatility or
144
lowest cost trades market instability at times

 No human error in trade  High capital outlays to build and


execution maintain software & hardware

 Not biased by human emotion  May be subject to additional


regulatory scrutiny

Algo-Trading Time Scales


Much of the algo-trading today is high-frequency trading (HFT), which attempts to
capitalize on placing a large number of orders at rapid speeds across multiple
markets and multiple decision parameters based on preprogrammed instructions.

Algo-trading is used in many forms of trading and investment activities including:

 Mid- to long-term investors or buy-side firms—pension funds, mutual


funds, insurance companies—use algo-trading to purchase stocks in large
quantities when they do not want to influence stock prices with discrete,
large-volume investments.

 Short-term traders and sell-side participants—market makers (such as


brokerage houses), speculators, and arbitrageurs—benefit from automated
trade execution; in addition, algo-trading aids in creating sufficient liquidity
for sellers in the market.

 Systematic traders—trend followers, hedge funds, or pairs traders (a market-


neutral trading strategy that matches a long position with a short position in a
pair of highly correlated instruments such as two stocks, exchange-traded
funds (ETFs), or currencies)—find it much more efficient to program their
trading rules and let the program trade automatically.

Algorithmic trading provides a more systematic approach to active trading than


methods based on trader intuition or instinct.

Technical Requirements for Algorithmic Trading


Implementing the algorithm using a computer program is the final component of
algorithmic trading, accompanied by backtesting (trying out the algorithm on
historical periods of past stock-market performance to see if using it would have
been profitable). The challenge is to transform the identified strategy into an
integrated computerized process that has access to a trading account for placing
orders. The following are the requirements for algorithmic trading

145
 Computer-programming knowledge to program the required trading strategy,
hired programmers, or pre-made trading software.

 Network connectivity and access to trading platforms to place orders.

 Access to market data feeds that will be monitored by the algorithm for
opportunities to place orders.

 The ability and infrastructure to backtest the system once it is built before it
goes live on real markets.

 Available historical data for backtesting depending on the complexity of rules


implemented in the algorithm.

An Example of Algorithmic Trading


Royal Dutch Shell (RDS) is listed on the Amsterdam Stock Exchange (AEX) and
London Stock Exchange (LSE).7 We start by building an algorithm to identify
arbitrage opportunities. Here are a few interesting observations:

 AEX trades in euros while LSE trades in British pound sterling.7

 Due to the one-hour time difference, AEX opens an hour earlier than LSE
followed by both exchanges trading simultaneously for the next few hours
and then trading only in LSE during the last hour as AEX closes.

Can we explore the possibility of arbitrage trading on the Royal Dutch Shell stock
listed on these two markets in two different currencies?

Requirements:

 A computer program that can read current market prices.

 Price feeds from both LSE and AEX.

 A forex (foreign exchange) rate feed for GBP-EUR.

 Order-placing capability that can route the order to the correct exchange.

 Backtesting capability on historical price feeds.

The computer program should perform the following:

 Read the incoming price feed of RDS stock from both exchanges.

 Using the available foreign exchange rates, convert the price of one currency
to the other.

146
 If there is a large enough price discrepancy (discounting the brokerage costs)
leading to a profitable opportunity, then the program should place the buy
order on the lower-priced exchange and sell the order on the higher-priced
exchange.

 If the orders are executed as desired, the arbitrage profit will follow.

Simple and easy! However, the practice of algorithmic trading is not that simple to
maintain and execute. Remember, if one investor can place an algo-generated trade,
so can other market participants. Consequently, prices fluctuate in milli- and even
microseconds. In the above example, what happens if a buy trade is executed but the
sell trade does not because the sell prices change by the time the order hits the
market? The trader will be left with an open position making the arbitrage strategy
worthless.

There are additional risks and challenges such as system failure risks, network
connectivity errors, time-lags between trade orders and execution and, most
important of all, imperfect algorithms. The more complex an algorithm, the more
stringent backtesting is needed before it is put into action.

Big data and healthcare


Big data refers to large data sets consisting of both structured and unstructured
data that are analyzed to find insights, trends, and patterns. Most commonly, big
data is defined by the three V’s – volume, velocity, and variety – meaning that it has a
high volume of data that is generated quickly and consisting of different data types,
such as text, images, graphs, or videos.

In health care, big data is generated by various sources and analyzed to guide
decision-making, improve patient outcomes, and decrease health care costs, among
other things. Some of the most common sources of big data in health care include
electronic health records (EHR), electronic medical records (EMRs), personal health
records (PHRs), and data produced by widespread digital health tools like wearable
medical devices and health apps on mobile devices.

What Is Big Data In Healthcare?

147
Big data in healthcare is a term used to describe massive volumes of information
created by the adoption of digital technologies that collect patients' records and
help in managing hospital performance; otherwise too large and complex for
traditional technologies.

The application of big data analytics in healthcare has a lot of positive and also life-
saving outcomes. In essence, big-style data refers to the vast quantities of
information created by the digitization of everything that gets consolidated and
analyzed by specific technologies. Applied to healthcare, it will use specific health
data of a population (or of a particular individual) and potentially help to prevent
epidemics, cure diseases, cut down costs, etc.

Now that we live longer, treatment models have changed, and many of these
changes are namely driven by data. Doctors want to understand as much as they can
about a person and, as early in their life as possible, to pick up warning signs of
serious illness as they arise – treating any disease at an early stage is far more simple
and less expensive. By utilizing key performance indicators in healthcare and
healthcare data analytics, prevention is better than cure, and managing to draw a
comprehensive picture of someone will let insurance provide a tailored package. This
is the industry’s attempt to tackle the siloes problems a patient’s data has:
everywhere are collected bits and bites of it and archived in hospitals, clinics,
surgeries, etc., with the impossibility of communicating properly.

148
That said, the amount of sources from which health professionals can gain insights
from their patients keeps growing. This data is normally coming in different formats
and sizes, which presents a challenge to the user. However, the current focus is no
longer on how “big” the data is but on how smartly it is managed. With the help of
the right technology, data can be extracted from the following sources of the
healthcare industry in a smart and fast way:

 Patients portals

 Research studies

 EHRs

 Wearable devices

 Search engines

 Generic databases

 Government agencies

 Payer records

 Staffing schedules

 Patient waiting room

Indeed, for years gathering huge amounts of data for medical use has been costly
and time-consuming. With today’s always-improving technologies, it becomes easier
not only to collect such data but also to create comprehensive healthcare
reports and convert them into relevant critical insights that can then be used to
provide better care. This is the purpose of healthcare data analysis: using data-driven
findings to predict and solve a problem before it is too late, but also assess methods
and treatments faster, keep better track of inventory, involve patients more in their
own health, and empower them with the tools to do so.

Big data examples in health care


Perhaps the most common source of big data in health care is electronic health
records (EHRs), which typically contain a patient’s medical history, demographic
information, medications, immunizations, test results, and progress notes. While in
the past this information was put down in hand-written files that were easily
misplaced, difficult to share, and occasionally illegible, today EHRs allow health care
professionals to easily access a patient’s pertinent medical information and provide
the best possible care.

149
Pairing the big data produced by EHRs with advanced analytics techniques
like machine learning, medical researchers can create predictive machine learning
models with various applications, such as predicting post-surgical complications,
heart failure, or substance abuse.

 Predict the daily patients' income to tailor staffing accordingly

 Use Electronic Health Records (EHRs)

 Use real-time alerting for instant care

 Help in preventing opioid abuse in the US

 Enhance patient engagement in their own health

 Use health data for a better-informed strategic planning

 Research more extensively to cure cancer

 Use predictive analytics

 Reduce fraud and enhance data security

 Practice telemedicine

 Integrate medical imaging for a broader diagnosis

 Prevent unnecessary ER visits

 Smart staffing & personnel management

 Learning & development

 Advanced risk & disease management

 Suicide & self-harm prevention

 Improved supply chain management

 Financial facility management

 Developing new therapies & innovations

 Track and control mass diseases

 Improve the prescription process

 Mitigate human error

 Apple can alert people about heart problems

 Bluetooth helps asthma patients


150
Big data analytics in health care explained
Data professionals working in health care use big data for a variety of applications,
from simply improving the patient experience to creating complex machine learning
models capable of diagnosing medical conditions using X-ray scans. To accomplish
these feats, data professionals use analytics to effectively manage and analyze big
data to produce insights, identify patterns and trends, and guide decision-making.

The impact of big data in health care is huge, and the market has grown to match it.
According to research conducted by Allied Market Research in 2019, for example, the
North American market value for big data analytics in health care is projected to
reach $34.16 billion by 2025, several times higher than its $9.36 billion valuation in
2017 [4]. Just as big data lays the foundation for big advances in health care, it has
also drawn investment for further growth.

Big data applications in health care

Professionals in health care use big data for a wide range of purposes – from
developing insights in biomedical research to providing patients with personalized
medicine. Here are just some of the ways that big data is used in health care today:

 Employing predictive analytics to create machine learning models that can


predict the likelihood a patient might develop a particular disease.

 Providing real-time alerts to medical staff by continuously monitoring patient


conditions within a facility.

151
 Enhancing security surrounding the processing of sensitive medical data, such
as insurance claims and medical records.

Benefits of big data in health care


Big data has the potential to improve health care for the better. Here are some of the
most common benefits of using big data in health care:

 Better patient care: More patient data means an opportunity to understand


the patient experience better and improve the care they receive.

 Improved research: Big data gives medical researchers unprecedented access


to a large volume of data and methods of collecting data. In turn, this data
can drive important medical breakthroughs that save lives.

 Smarter treatment plans: Analyzing the treatment plans that helped patients
(and those that didn’t) can help researchers create even better treatment plans
for future patients.

 Reduced health care costs for patients and health providers: Health care
can cost a lot. Big data offers the possibility of reducing the cost of obtaining
and providing health care by identifying appropriate treatment plans,
allocating resources intelligently, and identifying potential health issues before
they occur.

Big data analytics jobs in health care


There are many jobs that use big data analytics in health care. Here are some of the
most common that you’ll likely encounter as you explore the field:

 Health informatics specialist

 Health data engineer

 Health care data analyst

 Health care statistician

How to Use Big Data in Healthcare?


All in all, we’ve noticed three key trends through these 24 examples of healthcare
analytics: the patient experience will improve dramatically, including quality of

152
treatment and satisfaction levels; the overall health of the population can also be
enhanced on a sustainable basis, and operational costs can be reduced significantly.

Let’s have a look now at concrete examples of big data in healthcare:

a) Big Data in Healthcare Applied On a Hospital Dashboard

This healthcare dashboard below provides you with the overview needed as a
hospital director or as a facility manager. Gathering in one central point all the data
on every division of the hospital, the attendance, its nature, the costs incurred, etc.,
you have the big picture of your facility, which will be of great help to run it
smoothly.

You can see here the most important metrics concerning various aspects: the number
of patients that were welcomed in your facility, how long they stayed and where, how
much it cost to treat them, and the average waiting time in emergency rooms. Such a
holistic view helps top administrators to identify potential bottlenecks, spot trends
and patterns over time, and in general, assess the situation. This is key in order to
make better-informed decisions that will improve the overall operations
performance, with the goal of treating patients better and having the right staffing
resources.

b) Big Data Healthcare Application on Patients' Care

153
Another real-world application of healthcare big data analytics, our dynamic
patient KPI dashboard, is a visually-balanced tool designed to enhance service levels
as well as treatment accuracy across departments.

By offering a perfect storm or patience-centric information in one central location,


medical institutions can create harmony between departments while streamlining
care processes in a wealth of vital areas. For instance, bed occupancy rate metrics
offer a window of insight into where resources might be required while tracking
canceled or missed appointments will give senior executives the data they need to
reduce costly patient no-shows.

Here, you will find everything you need to enhance your level of patient care both in
real-time and in the long term. This is a visual innovation that has the power to
improve every type of medical institution, big or small.

Why We Need Big Data Analytics in Healthcare


By looking at our list of most insightful medical big data applications, you should
have a notion of how positive the use of analytics can be for this industry. If this is
not clear yet, here we will summarize the main points of importance by listing a few
benefits of big data in healthcare.

154
As mentioned, there’s a huge need for big data in healthcare, especially due to rising
costs in nations like the United States. As a McKinsey report states: “After more than
20 years of steady increases, healthcare expenses now represent 17.6 percent of GDP
— nearly $600 billion more than the expected benchmark for a nation of the United
States’s size and wealth.” This quote leads us to our first benefit.

Reducing costs

As stated above, costs are much higher than they should be, and they have been
rising for the past 20 years. Clearly, we are in need of some smart, data-driven
thinking in this area. And current incentives are changing as well: many insurance
companies are switching from fee-for-service plans (which reward using expensive
and sometimes unnecessary treatments and treating large amounts of patients
quickly) to plans that prioritize patient outcomes

As the authors of the popular Freakonomics books have argued, financial incentives
matter – and incentives that prioritize patients' health over treating large amounts of
patients are a good thing. Why does this matter?

Well, in the previous scheme, healthcare providers had no direct incentive to share
patient information with one another, which made it harder to utilize the power of
analytics. Now that more of them are getting paid based on patient outcomes, they
have a financial incentive to share data that can be used to improve the lives of
patients while cutting costs for insurance companies.

Reducing medical errors

Physician decisions are becoming more and more evidence-based, meaning that
they rely on large swathes of research and clinical data as opposed to solely their
schooling and professional opinion. That said, the risk of human error is always a
latent threat. Even though doctors are highly trained professionals, they are still
human, and the risk of selecting the wrong medication or treatment can potentially
risk a person’s life. With the use of big data and the tools that we mentioned
throughout this post, professionals can be easily alerted when the wrong medication,
test, treatment, or other has been provided and remediate it immediately. In time,
this can significantly reduce the rates of medical errors and improve the facility’s
reputation.

As in many other industries, data gathering and management are getting bigger, and
professionals need help in the matter. This new treatment attitude means there is a
greater demand for big data analytics in healthcare facilities than ever before, and
the rise of SaaS BI tools is also answering that need.

Optimizing organizational and personnel management

While using data to ensure you are providing the best care to patients is
fundamental, there are also other operational areas in which it can assist the health

155
industry. Part of providing quality care is ensuring the facility works optimally, and
this can also be achieved with the help of big data.

By using the right BI software, professionals can gather and analyze real-time data
about the performance of their organization in areas such as operations and
finances, as well as personnel management. For instance, predictive analytics
technologies can provide relevant information regarding admission rates. These
insights can help define staffing schedules to cover demand as well as inventory for
medical supplies. This way, care facilities can stay one step ahead and ensure that
patients are getting the best experience possible.

Getting this level of insight in such an intuitive way allows managers to redirect
resources where they are most needed and optimize areas that are not performing
well to ensure the best return on investment possible.

Drive innovation and growth

Our last benefit is one that should be the clearest from the list of applications we
provided earlier. The use of big data in the care industry enables professionals to test
new technologies, drugs, and treatments to improve the quality of care given to
patients and battle diseases that were once thought of as unbeatable.

Thanks to wearable devices that can tell your heart rate, Bluetooth asthma inhalers
that gather insights to prevent attacks, and much more, doctors are able to use data
to understand how common diseases work and how certain external factors might be
affecting entire communities. Through that, they are able to provide personalized
quality care to each and every person that goes into a hospital.

There is no denying that the power of big data analytics is saving lives. That being
said, the process of managing data requires a lot of effort, and with that comes
challenges, which we will discuss below.

Obstacles to a Widespread Big Data Healthcare


The points mentioned above are just a few of the countless benefits of big data in
healthcare. That said, with benefits also come limitations. In order to provide a full
picture of the topic, we will now list a few obstacles and healthcare data challenges
that organizations can face when implementing analytics into their processes.

 Data integration and storage: One of the biggest hurdles standing in the
way of using big data in medicine is how medical data is spread across many
sources governed by different states, hospitals, and administrative
departments. The integration of these data sources would require developing
a new infrastructure where all data providers collaborate with each other.

156
 Data sharing: Equally important is implementing new online reporting
software and business intelligence strategy that will allow all relevant users to
be connected with the data. Healthcare needs to catch up with other
industries that have already moved from standard regression-based methods
to more future-oriented ones like predictive analytics, machine learning, and
graph analytics. This is done with the help of modern reporting tools such as
a dashboard creator that allows anyone to perform advanced analytics with
just a few clicks easily.

 Security and privacy: Security and privacy are constant concerns are one of
the biggest challenges of big data in healthcare. Daily, hospitals and care
centers deal with sensitive patient data that needs to be carefully protected.
Taking that into consideration with the fact that the data is coming from many
different sources, security can present a challenge for these types of
organizations. To avoid this, it is critical to follow law regulations, conduct
regular audits to ensure everything is going well, and train employees on data
protection best practices.

 Data literacy: Using big data and analytics in healthcare involves many
processes and tools to collect, clean, process, manage, and analyze the huge
amounts of data available. This requires a level of knowledge and skills that
can present a limitation for average users that are not acquainted with these
processes. However, while data literacy might have been one of the big
disadvantages of big data in healthcare, it is no longer the case.

So, even if these analytical services are not your cup of tea, you are a potential
patient, and so you should care about new healthcare analytics applications. Besides,
it’s good to take a look around sometimes and see how other industries cope with it.
They can inspire you to adapt and adopt some good ideas.

Challenges of Big Data Analytics in Healthcare


1. CAPTURE 2. CLEANING

3. STORAGE 4. SECURITY

5. STEWARDSHIP 6. QUERYING

7. REPORTING 8. VISUALIZATION

9. UPDATING 10. SHARING

Here's how big data is transforming healthcare:

157
1. Clinical Decision Support: Big data analytics provides clinicians with access
to vast amounts of patient data, including electronic health records (EHRs),
medical imaging, genomics, and real-time monitoring data. Advanced
analytics techniques, such as machine learning and predictive modeling, help
clinicians make more informed decisions by identifying patterns, predicting
outcomes, and recommending personalized treatment plans.

2. Population Health Management: Big data analytics allows healthcare


organizations to analyze population-level data to identify trends, patterns, and
risk factors across large patient populations. By segmenting patient
populations based on demographic, clinical, and behavioral characteristics,
organizations can implement targeted interventions, preventive care
programs, and chronic disease management initiatives to improve population
health outcomes and reduce healthcare costs.

3. Precision Medicine: Big data analytics enables precision medicine


approaches by analyzing genomic data, biomarkers, and patient health
information to tailor treatments and therapies to individual patients' unique
characteristics and needs. By integrating genetic data with clinical data and
outcomes data, healthcare providers can identify genetic markers, predict
drug responses, and develop personalized treatment plans that optimize
efficacy and minimize adverse effects.

4. Healthcare Predictive Analytics: Big data analytics facilitates predictive


analytics in healthcare by forecasting disease outbreaks, predicting patient
outcomes, and identifying high-risk patients who may benefit from early
intervention or targeted preventive measures. Predictive models analyze
historical data, clinical variables, and environmental factors to identify patterns
and trends that can inform clinical decision-making and resource allocation.

5. Real-Time Monitoring and Remote Patient Monitoring: Big data analytics


powers real-time monitoring and remote patient monitoring solutions that
enable continuous monitoring of patients' vital signs, symptoms, and health
status outside of traditional healthcare settings. By analyzing streaming data
from wearable devices, sensors, and mobile health apps, healthcare providers
can detect health issues early, intervene proactively, and improve patient
outcomes while reducing hospitalizations and healthcare costs.

6. Healthcare Operations and Efficiency: Big data analytics helps healthcare


organizations optimize operational processes, resource utilization, and
workflow efficiency. By analyzing data on patient flow, staffing levels,
equipment usage, and supply chain logistics, organizations can identify
bottlenecks, streamline operations, and improve resource allocation to
enhance patient experiences and reduce wait times.

7. Healthcare Fraud Detection and Prevention: Big data analytics is used to


detect and prevent healthcare fraud, waste, and abuse by analyzing claims

158
data, billing patterns, and provider behaviors. Advanced analytics techniques,
such as anomaly detection and network analysis, help identify suspicious
activities, fraudulent claims, and improper billing practices, enabling payers
and healthcare organizations to mitigate financial losses and protect against
fraud risks.

Big data in medicine


Certainly! Let's delve into each of these topics:

Big Data in Medicine


 Clinical Decision Support: Big data analytics assists healthcare professionals in
making more informed decisions by analyzing vast amounts of patient data,
including electronic health records (EHRs), medical imaging, genomic data, and
real-time monitoring data. Machine learning algorithms help identify patterns,
predict outcomes, and recommend personalized treatment plans.

 Drug Discovery and Development: Big data analytics accelerates drug discovery
and development processes by analyzing genomic data, molecular interactions,

159
and clinical trial data. Predictive modeling and simulation techniques help identify
potential drug candidates, predict drug efficacy, and optimize dosage regimens.

 Precision Medicine: Big data enables precision medicine approaches by


analyzing patient-specific genetic, clinical, and lifestyle data to tailor treatments
and therapies to individual patients. By integrating genomic data with clinical
information, healthcare providers can identify genetic markers, predict drug
responses, and develop personalized treatment plans.

 Population Health Management: Big data analytics allows healthcare


organizations to analyze population-level data to identify trends, patterns, and
risk factors across large patient populations. By segmenting patient populations
based on demographic, clinical, and behavioral characteristics, organizations can
implement targeted interventions, preventive care programs, and chronic disease
management initiatives to improve population health outcomes.

Advertising and Big Data


Big data is transforming the relationship between companies and customers.
Analyzing large amounts of data for marketing purposes is not new, but recent
advancements in big data technology have given advertisers powerful new ways of
understanding consumers’ behaviors, needs and preferences. Big data helps you
optimize each customer’s demands and convert them into prospective purchasers.

A company’s ability to forecast growth accurately and devise a viable marketing plan
now relies heavily on the availability and analysis of information. By using big data
analytics in your planning and decision-making, your company will be well-equipped
to solve today’s advertising problems and anticipate tomorrow’s challenges.

160
The Role of Big Data Analytics in Digital Marketing Strategy
Big data is crucial in digital marketing because it provides companies with deep
insights about consumer behavior.

Google is an excellent example of big data analytics in action. Google leverages big
data to deduce what consumers want based on a variety of characteristics, such as
search history, geography and trending topics. Big data mining has resulted in
Google's secret source of proactive or predictive marketing: determining what
consumer’s desire and how to incorporate that knowledge into the company’s ad
and product experiences.

But your company doesn’t have to be a tech giant to use big data analytics
successfully. Here are four key ways companies of all sizes can benefit from big data:

1. Receiving Data Analysis In Real Time: In the past, traditional scalable


database engine technologies could process and analyze vast data collections.
However, they did so at a glacial pace, requiring days or even weeks to
complete jobs that frequently produced "stale" outcomes. In contrast, big data
analytics systems can conduct complex procedures at breakneck speeds,
allowing for real-time analysis and insights.

2. Enabling Targeted Advertising: Big data allows your company to accumulate


more data on your visitors so you can target consumers with tailored
advertisements that they are more likely to view. Google and Facebook are
already doing this, but third-party merchants also have access to the same
capabilities.

3. Analyzing Customer Insights: Big data is invaluable in sentiment analysis,


which assesses how consumers feel about your business. With sentiment
analysis, you can analyze your audience's likes and dislikes and determine
whether they have positive, negative or neutral feelings toward your brand.
Big data provides detailed information about your company’s strengths and
weaknesses, which can strengthen a marketing strategy aimed at retaining
and wooing potential customers.

4. Creating Relevant Content: Big data helps you deliver tailored content that
aligns with your customers’ interests and needs. It provides the information
you need to create the right content for the right consumers on the right
channel at the right time.

5. Protecting Customer Privacy: Users’ concerns about digital privacy are


reasonable, and large-scale data mining has introduced a wide range of
applications to provide a smooth user experience and protect personal data in
the case of an attack. As your business collects more data, it becomes more

161
crucial to keep customers informed about how you store their information
and what actions you're taking to adhere to privacy and data protection rules.

Best Practices for Big Data Analytics in Advertising


Start with these best practices to get the most business value from big data analytics:

 Use analytic innovation to your advantage. Big data processing and


analytics innovations revolutionize how companies extract value from their
consumer data. We're witnessing a transition from techniques that provide
periodic snapshots in descriptive reports and dashboards to comprehensive
platforms that continuously analyze inbound data to create real-time forecasts
and prescriptions.

 Use a range of analytical approaches. You’ll need a flexible architecture that


welcomes variety in order to create a cohesive production environment from
multiple analytic models. Integrate models created by various tools that have
extendable libraries, web applications and standards.

 Balance expertise against automation. Even as technology expands, human


knowledge is still required in big data analytics. Work with well-trained data
scientists who have analytical skills informed by deep domain knowledge to
construct successful prediction and decision-making models.

 Build a big data analytical pipeline. Big data provides additional ears and
eyes for your marketing and advertising campaigns, empowering you to
respond to audience activity and influence consumer behavior in real time.
You now have the tools and know-how to develop effective big data
advertising campaigns, thanks to cloud technologies like Amazing Web
Services, Microsoft Azure and Google Cloud.

The growth of big data analytics offers advertisers new opportunities for forecasting
trends and solving ongoing challenges. Embrace the power of big data to analyze
real-time data and customer insights and create targeted advertising and content
that hit the mark with your audience.

What is Big Data and How It is related to Advertising


Big Data is a general term that describes technologies and different methods for
analyzing and processing continuous flows of helpful information that are
enormously big and could not be processed without the help of machines.

For comparison, this text consists form approx 130 lines or 1100 words. Putting some
big data in one spreadsheet file could be billions of lines with data in different
162
formats. For example, it could be data that includes all programmatic advertising
deals inside the BidsCube AdExchange and details about them just for one month.

In today's world, everyone is constantly generating data. It happens by using apps,


searching for information through search engines, online shopping, and even simply
traveling around cities and countries with your smartphone in your pocket. All this
creates a vast amount of valuable and helpful information. Furthermore, it can be
quickly collected, visualized, and carefully analyzed.

Let's look at a simple example to make it easier to understand the essence of Big
Data. Imagine a market where all products are arranged chaotically: bread near the
vegetables, fruit in the beverage department, vegetable oil next to the bathtub and
toiletries, and so on. With Big Data, it became possible to distribute all the goods
strictly in their places. But that's not all. You can easily find the product you want, see
expiration dates, learn about the benefits of that brand or variety of products, and
compare it with similar products.

Big Data is also a tool for the practical application of received information. It is
presented in a clear and convenient form, making it easy to solve everyday tasks and
make decisions. For example, you need to learn how to find your potential client and
offer a particular product at the right time for advertising campaigns. You can only
do this with a specific database.

 Targeted Advertising: Big data analytics enables advertisers to deliver


personalized and targeted advertisements to specific audience segments based
on demographic, behavioral, and psychographic data. By analyzing consumer
preferences, browsing history, and online interactions, advertisers can tailor ad
content, placement, and messaging to maximize relevance and engagement.

 Campaign Optimization: Big data analytics helps advertisers optimize


advertising campaigns by tracking key performance metrics, such as click-through
rates (CTR), conversion rates, and return on investment (ROI). By analyzing
campaign data in real-time, advertisers can identify trends, optimize ad creative,
and allocate ad spend more effectively to achieve campaign objectives.

 Customer Insights: Big data analytics provides advertisers with valuable insights
into consumer behavior, preferences, and purchasing patterns. By analyzing
customer data from multiple sources, including social media, website interactions,
and transaction history, advertisers can understand their target audience better,
identify market trends, and tailor marketing strategies to meet consumer needs.

 Attribution Modeling: Big data analytics helps advertisers attribute conversions


and sales to specific marketing channels, touchpoints, and campaigns. By
analyzing customer journeys and engagement paths, advertisers can determine
the impact of each marketing interaction on the overall conversion funnel and
allocate marketing budgets more effectively to maximize ROI.
163
Big Data Technologies
Before big data technologies were introduced, the data was managed by general
programming languages and basic structured query languages. However, these
languages were not efficient enough to handle the data because there has been
continuous growth in each organization's information and data and the domain. That
is why it became very important to handle such huge data and introduce an efficient
and stable technology that takes care of all the client and large organizations'
requirements and needs, responsible for data production and control. Big data
technologies, the buzz word we get to hear a lot in recent times for all such needs.

In this article, we are discussing the leading technologies that have expanded their
branches to help Big Data reach greater heights. Before we discuss big data
technologies, let us first understand briefly about Big Data Technology.

What is Big Data Technology?


Big data technology is defined as software-utility. This technology is primarily
designed to analyze, process and extract information from a large data set and a
huge set of extremely complex structures. This is very difficult for traditional data
processing software to deal with.

Among the larger concepts of rage in technology, big data technologies are widely
associated with many other technologies such as deep learning, machine
learning, artificial intelligence (AI), and Internet of Things (IoT) that are massively
augmented. In combination with these technologies, big data technologies are

164
focused on analyzing and handling large amounts of real-time data and batch-
related data.

Types of Big Data Technology


Before we start with the list of big data technologies, let us first discuss this
technology's board classification. Big Data technology is primarily classified into the
following two types:

Operational Big Data Technologies

This type of big data technology mainly includes the basic day-to-day data that
people used to process. Typically, the operational-big data includes daily basis data
such as online transactions, social media platforms, and the data from any particular
organization or a firm, which is usually needed for analysis using the software based
on big data technologies. The data can also be referred to as raw data used as the
input for several Analytical Big Data Technologies.

Some specific examples that include the Operational Big Data Technologies can be
listed as below:

 Online ticket booking system, e.g., buses, trains, flights, and movies, etc.

 Online trading or shopping from e-commerce websites like Amazon, Flipkart,


Walmart, etc.

 Online data on social media sites, such as Facebook, Instagram, Whatsapp,


etc.

 The employees' data or executives' particulars in multinational companies.

165
Analytical Big Data Technologies

Analytical Big Data is commonly referred to as an improved version of Big Data


Technologies. This type of big data technology is a bit complicated when compared
with operational-big data. Analytical big data is mainly used when performance
criteria are in use, and important real-time business decisions are made based on
reports created by analyzing operational-real data. This means that the actual
investigation of big data that is important for business decisions falls under this type
of big data technology.

Some common examples that involve the Analytical Big Data Technologies can be
listed as below:

 Stock marketing data

 Weather forecasting data and the time series analysis

 Medical health records where doctors can personally monitor the health status
of an individual

 Carrying out the space mission databases where every information of a


mission is very important

166
Top Big Data Technologies
We can categorize the leading big data technologies into the following four sections:

 Data Storage

 Data Mining

 Data Analytics

 Data Visualization

Let’s now examine the technologies falling under each of these categories with facts
and features, along with the companies that use them.
167
Data Storage

Typically, this type of big data technology includes infrastructure that allows data to
be fetched, stored, and managed, and is designed to handle massive amounts of
data. Various software programs are able to access, use, and process the collected
data easily and quickly. Among the most widely used big data technologies for this
purpose are:

1. Apache Hadoop

Apache Hadoop is an open-source, Java-based framework for storing and processing


big data, developed by the Apache Software Foundation. In essence, it provides a
distributed storage platform and processes big data using the MapReduce
programming model. The Hadoop framework is designed to automatically handle
hardware failures since they are common occurrences. Hadoop framework consists of
five modules, namely Hadoop Distributed File System (HDFS), Hadoop YARN (Yet
another Resource Negotiator), Hadoop MapReduce, Hadoop Common, and Hadoop
Ozone.

Companies using Hadoop: LinkedIn, Intel, IBM, MapR, Facebook, Microsoft,


Hortonworks, Cloudera, etc.

Key features:

 A distributed file system, called HDFS (Hadoop Distributed File System),


enables fast data transfer between nodes.

 HDFS is a fundamentally resilient file system. In Hadoop, data that is stored on


one node is also replicated on other nodes of the cluster to prevent data loss
in case of hardware or software failure.

 Hadoop is an inexpensive, fault-tolerant, and extremely flexible framework


capable of storing and processing data in any format (structured, semi-
structured, or unstructured).

168
 MapReduce is a built-in batch processing engine in Hadoop that splits large
computations across multiple nodes to ensure optimum performance and
load balancing.

2. MongoDB

MongoDB is an open-source, cross-platform, document-oriented database designed


to store and handle large amounts of data while providing high availability,
performance, and scalability. Since MongoDB does not store or retrieve data in the
form of tables, it is considered a NoSQL database. A new entrant to the data storage
field, MongoDB is very popular due to its document-oriented NoSQL features,
distributed key-value store, and Map Reduce calculation capabilities. This was named
“Database Management System of the Year,” by DB-Engines, which isn’t surprising
since NoSQL databases are more adept at handling Big Data than traditional RDBMS.

Companies using MongoDB: MySQL, Facebook, eBay, MetLife, Google, Shutterfly,


Aadhar, etc.

Key features:

 It seamlessly integrates with languages like Ruby, Python, and JavaScript ; this
seamless integration facilitates high coding velocity.

 A MongoDB database stores data in JSON documents, which provide a rich


data model that maps effortlessly to native programming languages.

 MongoDB has several features that are unavailable in a traditional RDBMS,


such as dynamic queries, secondary indexes, rich updates, sorting, and easy
aggregation.

 In document-based database systems, related data is stored in a single


document, making it possible to run queries faster than with a traditional
relational database where related data is stored in multiple tables and later
joined using joins.

169
3. RainStor

RainStor is a database management system that manages and analyzes big data and
is developed by the RainStor company. A de-duplication technique is used in order
to streamline the storage of large amounts of data for reference. Due to its ability to
sort and store large volumes of information for reference, it eliminates duplicate files.
Additionally, it supports cloud storage and multi-tenancy. The RainStor database
product is available in two editions: Big Data Retention and Big Data Analytics on
Hadoop, which enable highly efficient data management and accelerate data analysis
and queries.

Companies using RainStor: Barclays, Reimagine Strategy, Credit Suisse, etc.

Key features:

 With RainStor, large enterprises can manage and analyze Big Data at the
lowest total cost.

 The enterprise database is built on Hadoop to support faster analytics.

 It allows you to run faster queries and analyses using both SQL queries and
MapReduce, leading to 10-100x faster results.

 RainStor provides the highest compression level. Data is compressed up to


40x (97.5 percent) or more compared to raw data and has no re-inflation
required when accessed.

4. Cassandra

Cassandra is an open-source, distributed NoSQL database that enables the in-depth


analysis of multiple sets of real-time data. It enables high scalability and availability
without compromises in performance. To interact with the database, it uses CQL
(Cassandra Structure Language). With scalability and fault tolerance on cloud
infrastructure or commodity hardware, this is the ideal platform for mission-critical

170
data processing. As a major Big Data tool, it accommodates all types of data formats,
including structured, semi-structured, and unstructured.

Companies using Cassandra: Facebook, GoDaddy, Netflix, GitHub, Rackspace,


Cisco, Hulu, eBay, etc.

Key Features:

 Cassandra’s decentralized architecture prevents single points of failure within


a cluster.

 Data sensitivity makes Cassandra suitable for enterprise applications that


cannot afford data loss, even when the entire data center fails.

 ACID (Atomicity, Consistency, Isolation, and Durability) are all supported by


Cassandra.

 It allows Hadoop integration with MapReduce. It also supports Apache Hive &
Apache Pig.

 Due to its scalability, Cassandra can be scaled up to accommodate more


customers and more data as required.

5. Hunk

Hunk is mainly helpful when data needs to be accessed in remote Hadoop clusters
using virtual indexes. This helps us to use the spunk search processing language to
analyze data. Also, Hunk allows us to report and visualize vast amounts of data from
Hadoop and NoSQL data sources.

Hunk was introduced in 2013 by Splunk Inc. It is based on the Java programming
language.

171
Data Mining

Data mining is the process of extracting useful information from raw data and
analyzing it. In many cases, raw data is very large, highly variable, and constantly
streaming at speeds that make data extraction nearly impossible without a special
technique. Among the most widely used big data technologies for data mining are:

6. Presto

Developed by Facebook, Presto is an open-source SQL query engine that enables


interactive query analyses on massive amounts of data. This distributed search
engine supports fast analytics queries on data sources of various sizes, from
gigabytes to petabytes. With this technology, it is possible to query data right where
it lives, without moving the data into separate analytics systems. It is possible even to
query data from multiple sources within a single query. It supports both relational
data sources (such as PostgreSQL, MySQL, Microsoft SQL Server, Amazon Redshift,
Teradata, etc.) and non-relational data sources (such as HDFS (Hadoop Distributed
File System), MongoDB, Cassandra, HBase, Amazon S3, etc.).

Companies using Presto: Repro, Netflix, Facebook, Airbnb, GrubHub, Nordstrom,


NASDAQ, Atlassian, etc.

Key Features:

 With Presto, you can query data wherever it resides, whether it is in Cassandra,
Hive, Relational databases, or even proprietary data stores.

172
 With Presto, multiple data sources can be queried at once. This allows you to
reference data from multiple databases in one query.

 It does not rely on MapReduce techniques and is capable of retrieving data


very quickly within seconds to minutes. Query responses are typically returned
within a few seconds.

 Presto supports standard ANSI SQL, making it easy to use. The ability to query
your data without learning a dedicated language is always a big plus, whether
you’re a developer or a data analyst. Additionally, it connects easily to the
most common BI (Business Intelligence) tools with JDBC (Java Database
Connectivity) connectors.

7. RapidMiner

RapidMiner is an advanced open-source data mining tool for predictive analytics. It’s
a powerful data science platform that lets data scientists and big data analysts
analyze their data quickly. In addition to data mining, it enables model deployment
and model operation. With this solution, you will have access to all the machine
learning and data preparation capabilities you need to make an impact on your
business operations. By providing a unified environment for data preparation,
machine learning, deep learning, text mining, and predictive analytics, it aims to
enhance productivity for enterprise users of every skill level.

Companies using RapidMiner: Domino’s Pizza, McKinley Marketing Partners,


Windstream Communications, George Mason University, etc.

Key Features:

 There is an integrated platform for processing data, building machine learning


models, and deploying them.

 Further, it integrates the Hadoop framework with its inbuilt RapidMiner


Radoop.

173
 RapidMiner Studio provides access, loading, and analysis of any type of data,
whether it is structured data or unstructured data such as text, images, and
media.

 Automated predictive modeling is available in RapidMiner.

8. ElasticSearch

Built on Apache Lucene, Elasticsearch is an open-source, distributed, modern search


and analytics engine that allows you to search, index, and analyze data of all types.
Some of its most common use cases include log analytics, security intelligence,
operational intelligence, full-text search, and business analytics. Unstructured data
from various sources is retrieved and stored in a format that is highly optimized for
language-based searches. Users can easily search and explore a large volume of data
at a very fast speed. DB-Engines ranks Elasticsearch as the top enterprise search
engine.

Companies using ElasticSearch: Netflix, Facebook, Uber, Shopify, Linkedln,


StackOverflow, GitHub, Instacart, etc.

Key Features:

 Using ElasticSearch, you can store and analyze structured and unstructured
data up to petabytes.

 By providing simple RESTful APIs and schema-free JSON documents,


Elasticsearch makes it easy to search, index, and query data.

 Moreover, it provides near real-time search, scalable search, and multitenancy


capabilities.

 Elasticsearch is written in Java, which makes it compatible with nearly every


platform.

174
 As a language-agnostic open-source application, Elasticsearch makes it easy
to extend its functionality with plugins and integrations.

 Several management tools, UIs (User Interfaces), and APIs (Application


Programming Interfaces) are provided for full control over data, cluster
operations, users, etc.

Data Analytics

Big data analytics involves cleaning, transforming, and modeling data in order to
extract essential information that will aid in the decision-making process. You can
extract valuable insights from raw data by using data analytic techniques. Among the
information that big data analytics tools can provide are hidden patterns,
correlations, customer preferences, and statistical information about the market.
Listed below are a few types of data analysis technologies you should be familiar
with.

9. Apache Kafka

Apache Kafka is a popular open-source event store and streaming platform


developed by the Apache Software Foundation in Java and Scala. The platform is
used by thousands of organizations for streaming analytics, high-performance data
pipelines, data integration, and mission-critical applications. It is a fault-tolerant
messaging system based on a publish-subscriber model that can handle massive
data volumes. For real-time streaming data analysis, Apache Kafka can be integrated
seamlessly with Apache Storm and Apache Spark. Basically, Kafka is a system for
collecting, storing, reading, and analyzing streaming data at scale.

Companies using Kafka: Netflix, Goldman Sachs, Shopify, Target, Cisco, Spotify,
Intuit, Uber, etc.

Key Features:
175
 With Apache Kafka, scalability can be achieved in four dimensions: event
processors, event producers, event consumers, and event connectors. This
means that Kafka scales effortlessly without any downtime.

 Kafka is very reliable due to its distributed architecture, partitioning,


replicating, and fault-tolerance.

 You can publish and subscribe to messages at high throughput.

 The system guarantees zero downtime and no data loss.

10. Splunk

Splunk is a scalable, advanced software platform that searches, analyzes, and


visualizes machine-generated data from websites, applications, sensors, devices, etc.,
in order to provide metrics, diagnose problems, and gain insight into business
operations. In Splunk, real-time data is captured, indexed, and correlated into a
searchable repository, which can be used to generate Reports, Alerts, Graphs,
Dashboards, and Visualizations. In addition to application management, security, and
compliance, Splunk also provides web analytics and business intelligence. The advent
of big data makes Splunk capable of ingesting big data from a variety of sources,
which may or may not include machine data, and performing analytics on it.

Companies using Splunk: JPMorgan Chase, Lenovo, Wells Fargo, Verizon,


BookMyShow, John Lewis, Domino’s, Porsche, etc.

Key Features:

 Improve the performance of your business with automated operations,


advanced analytics, and end-to-end integrations.

 In addition to structured data formats like JSON and XML, Splunk can ingest
unstructured machine data like web and application logs.

176
 Splunk indexes the ingested data to enable faster search and querying based
on different conditions.

 Splunk provides analytical reports including interactive graphs, charts, and


tables, as well as allows sharing them with other people.

11. KNIME

KNIME (Konstanz Information Miner) is a free, open-source platform for analytics,


reporting, and integration of large sets of data. In addition to being intuitive and
open, KNIME actively incorporates new ideas and developments to make
understanding data and developing data science workflows and reusable
components as easy and accessible as possible. KNIME allows users to visually create
and design data flows (or pipelines), execute analysis steps selectively, and analyze
the results and models later using interactive views and widgets. As part of the core
version, there are hundreds of modules for integration, data transformations (such as
filters, converters, splitters, combiners, and joiners), as well as methods used for
analytics, statistics, data mining, and text analytics.

Companies using KNIME: Fiserv, Opplane, Procter & Gamble, Eaton Corporation,
etc.

Key Features:

 Additional Plugins are added via its Extension mechanism in order to extend
functionality.

 Furthermore, additional plugins provide integration of methods for image


mining, text mining, time-series analysis, and network analysis.

 The KNIME workflows can serve as data sets for creating report templates that
can be exported to a variety of file formats, including doc, pdf, ppt, xls, etc.

177
 Additionally, KNIME integrates a variety of open-source projects such as
machine learning algorithms from Spark, Weka, Keras, LIBSVM, and R projects;
as well as ImageJ, JFreeChart, and the Chemistry Development Kit.

 You can perform simple ETL operations with it.

12. Apache Spark

The most important and most awaited technology is now in sight – Apache Spark. It
is an open-source analytics engine that supports big data processing. This platform
features In-Memory Computing (IMC) for performing fast queries against data of any
size; a generalized Execution Model (GEM) that supports a wide range of
applications, as well as Java, Python, and Scala APIs for ease of development. These
APIs make it possible to hide the complexity of distributed processing behind simple,
high-level operators. Spark was introduced by the Apache Software Foundation to
speed up Hadoop computation.

Companies using Presto: Amazon, Oracle, Cisco, Netflix, Yahoo, eBay, Hortonworks,
etc.

Key Features:

 The Spark platform enables the execution of programs 100 times faster on
memory than Hadoop MapReduce or 10 times faster on disk.

 With Apache Spark, you can run an array of workloads including machine
learning, real-time analytics, interactive queries, and graph processing.

 Spark has convenient development interfaces (APIs) available in Java, Scala,


Python, and R for working with large datasets.

 A number of higher-level libraries are included with Spark, such as support for
SQL queries, machine learning, streaming data, and graph processing.

13. R-Language:
178
R is defined as the programming language, mainly used in statistical computing and
graphics. It is a free software environment used by leading data miners, practitioners
and statisticians. Language is primarily beneficial in the development of statistical-
based software and data analytics.

R-language was introduced in Feb 2000 by R-Foundation. It is written in Fortran.


Companies like Barclays, American Express, and Bank of America use R-Language for
their data analytics needs.

14. Blockchain:

Blockchain is a technology that can be used in several applications related to


different industries, such as finance, supply chain, manufacturing, etc. It is primarily
used in processing operations like payments and escrow. This helps in reducing the
risks of fraud. Besides, it enhances the transaction's overall processing speed,
increases financial privacy, and internationalize the markets. Additionally, it is also
used to fulfill the needs of shared ledger, smart contract, privacy, and consensus in
any Business Network Environment.

Blockchain technology was first introduced in 1991 by two researchers, Stuart


Haber and W. Scott Stornetta. However, blockchain has its first real-world application
in Jan 2009 when Bitcoin was launched. It is a specific type of database based on
Python, C++, and JavaScript. ORACLE, Facebook, and MetLife are a few of those top
companies using Blockchain technology.

179
Data Visualization

Data visualization is a way of visualizing data through a graphic representation. Data


visualization techniques utilize visual elements such as graphs, charts, and maps to
provide an easy way of viewing and interpreting trends, patterns, and outliers in data.
Data is processed to create graphic illustrations that enable people to grasp large
amounts of information in seconds. Below are a few top technologies for data
visualization.

15. Tableau

In the business intelligence and analytics industry, Tableau is the fastest growing tool
for Data Visualization. It makes it easy for users to create graphs, charts, maps, and
dashboards, for visualizing and analyzing data, thus aiding them in driving the
business forward. Using this platform, data is rapidly analyzed, resulting in interactive
dashboards and worksheets that display the results. With Tableau, users are able to
work on live datasets, obtaining valuable insights and enhancing decision-making.
You don’t need any programming knowledge to get started; even those without
relevant experience can create visualizations with Tableau right away.

Companies using Tableau: Accenture, Myntra, Nike, Skype, Coca-Cola, Wells Fargo,
Citigroup, Qlik, etc

Key Features:

 In Tableau, a user can easily create visualizations in the form of Bar charts, Pie
charts, Histograms, Treemaps, Box plots, Gantt charts, Bullet charts, and other
tools.

 Tableau supports a wide array of data sources, including on-premise files, CSV,
Text files, Excel, spreadsheets, relational and non-relational databases, cloud
data, and big data.\

 Some of Tableau’s significant features include data blending and real-time


analytics.

 It allows real-time sharing of data in the form of dashboards, sheets, etc.


180
16. Plotly

Plotly is a Python library that facilitates interactive visualizations of big data. This tool
makes it possible to create superior graphs more quickly and efficiently. Plotly has
many advantages, including user-friendliness, scalability, reduced costs, cutting-edge
analytics, and flexibility. It offers a much richer set of libraries and APIs, including
Python, R, MATLAB, Arduino, Julia, etc. It can be used interactively within Jupyter
notebooks and Pycharm in order to create interactive graphs. With Plotly, we can
include interactive features such as buttons, sliders, and dropdowns to display
different perspectives on a graph.

Companies using Plotly: Paladins, Bitbank, etc.

Key Features:

 A unique feature of Plotly is its interactivity. Users can interact with graphs on
display, providing an enhanced storytelling experience.

 It’s like drawing on paper, you can draw anything you want. When compared
with other visualization tools like Tableau, Plotly enables full control over what
is being plotted.

 Additionally to Seaborn and Matplotlib charts, Plotly also offers a wide range
of graphs, and charts, such as Statistical Charts, Scientific Charts, Financial
Charts, geographical maps, and so forth.

 Furthermore, Plotly offers a broad range of AI and ML charts, which allow you
to step up your machine learning game.

Emerging Big Data Technologies


Apart from the above mentioned big data technologies, there are several other
emerging big data technologies. The following are some essential technologies
among them:

181
 TensorFlow: TensorFlow combines multiple comprehensive libraries, flexible
ecosystem tools, and community resources that help researchers implement
the state-of-art in Machine Learning. Besides, this ultimately allows developers
to build and deploy machine learning-powered applications in specific
environments.
TensorFlow was introduced in 2019 by Google Brain Team. It is mainly based
on C++, CUDA, and Python. Companies like Google, eBay, Intel, and Airbnb
are using this technology for their business requirements.

 Beam: Apache Beam consists of a portable API layer that helps build and
maintain sophisticated parallel-data processing pipelines. Apart from this, it
also allows the execution of built pipelines across a diversity of execution
engines or runners.
Apache Beam was introduced in June 2016 by the Apache Software
Foundation. It is written in Python and Java. Some leading companies like
Amazon, ORACLE, Cisco, and VerizonWireless are using this technology.

 Docker: Docker is defined as the special tool purposely developed to create,


deploy, and execute applications easier by using containers. Containers
usually help developers pack up applications properly, including all the
required components like libraries and dependencies. Typically, containers
bind all components and ship them all together as a package.
Docker was introduced in March 2003 by Docker Inc. It is based on the Go
language. Companies like Business Insider, Quora, Paypal, and Splunk are
using this technology.

 Airflow: Airflow is a technology that is defined as a workflow automation and


scheduling system. This technology is mainly used to control, and maintain
data pipelines. It contains workflows designed using the DAGs (Directed
Acyclic Graphs) mechanism and consisting of different tasks. The developers
can also define workflows in codes that help in easy testing, maintenance, and
versioning.
Airflow was introduced in May 2019 by the Apache Software Foundation. It is
based on a Python language. Companies like Checkr and Airbnb are using this
leading technology.

 Kubernetes: Kubernetes is defined as a vendor-agnostic cluster and container


management tool made open-source in 2014 by Google. It provides a
platform for automation, deployment, scaling, and application container
operations in the host clusters.

Kubernetes was introduced in July 2015 by the Cloud Native Computing


Foundation. It is written in the Go language. Companies like American Express,
Pear Deck, PeopleSource, and Northwestern Mutual are making good use of
this technology.

182
Introduction to HADOOP
Introduction
Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment. It is
designed to handle big data and is based on the MapReduce programming model,
which allows for the parallel processing of large datasets.

What is Hadoop?
Hadoop is an open source software programming framework for storing a large
amount of data and performing the computation. Its framework is based on Java
programming with some native code in C and shell scripts.

Hadoop is an open-source software framework that is used for storing and


processing large amounts of data in a distributed computing environment. It is
designed to handle big data and is based on the MapReduce programming model,
which allows for the parallel processing of large datasets.

Hadoop defined
Hadoop is an open source framework based on Java that manages the storage and
processing of large amounts of data for applications. Hadoop uses distributed
storage and parallel processing to handle big data and analytics jobs, breaking
workloads down into smaller workloads that can be run at the same time.

Four modules comprise the primary Hadoop framework and work collectively to form
the Hadoop ecosystem:

1. Hadoop Distributed File System (HDFS): As the primary component of


the Hadoop ecosystem, HDFS is a distributed file system in which
individual Hadoop nodes operate on data that resides in their local
storage. This removes network latency, providing high-throughput access
to application data. In addition, administrators don’t need to define
schemas up front.

2. Yet Another Resource Negotiator (YARN): YARN is a resource-


management platform responsible for managing compute resources in
clusters and using them to schedule users’ applications. It performs
scheduling and resource allocation across the Hadoop system.

3. MapReduce: MapReduce is a programming model for large-scale data


processing. In the MapReduce model, subsets of larger datasets and
183
instructions for processing the subsets are dispatched to multiple different
nodes, where each subset is processed by a node in parallel with other
processing jobs. After processing the results, individual subsets are
combined into a smaller, more manageable dataset.

4. Hadoop Common: Hadoop Common includes the libraries and utilities


used and shared by other Hadoop modules.

Beyond HDFS, YARN, and MapReduce, the entire Hadoop open source ecosystem
continues to grow and includes many tools and applications to help collect, store,
process, analyze, and manage big data. These include Apache Pig, Apache Hive,
Apache HBase, Apache Spark, Presto, and Apache Zeppelin.

What is Hadoop programming?


In the Hadoop framework, code is mostly written in Java but some native code is
based in C. Additionally, command-line utilities are typically written as shell scripts.
For Hadoop MapReduce, Java is most commonly used but through a module like
Hadoop streaming, users can use the programming language of their choice to
implement the map and reduce functions.

What is a Hadoop database?


Hadoop isn't a solution for data storage or relational databases. Instead, its purpose
as an open-source framework is to process large amounts of data simultaneously in
real-time.

Data is stored in the HDFS, however, this is considered unstructured and does not
qualify as a relational database. In fact, with Hadoop, data can be stored in an
unstructured, semi-structured, or structured form. This allows for greater flexibility for
companies to process big data in ways that meet their business needs and beyond.

What type of database is Hadoop?


Technically, Hadoop is not in itself a type of database such as SQL or RDBMS.
Instead, the Hadoop framework gives users a processing solution to a wide range of
database types.

Hadoop is a software ecosystem that allows businesses to handle huge amounts of


data in short amounts of time. This is accomplished by facilitating the use of parallel
computer processing on a massive scale. Various databases such as Apache HBase

184
can be dispersed amongst data node clusters contained on hundreds or thousands
of commodity servers.

What's the impact of Hadoop?


Hadoop was a major development in the big data space. In fact, it's credited with
being the foundation for the modern cloud data lake. Hadoop democratized
computing power and made it possible for companies to analyze and query big data
sets in a scalable manner using free, open source software and inexpensive, off-the-
shelf hardware.

This was a significant development because it offered a viable alternative to the


proprietary data warehouse (DW) solutions and closed data formats that had - until
then - ruled the day.

With the introduction of Hadoop, organizations quickly had access to the ability to
store and process huge amounts of data, increased computing power, fault
tolerance, flexibility in data management, lower costs compared to DWs, and greater
scalability. Ultimately, Hadoop paved the way for future developments in big data
analytics, like the introduction of Apache Spark.

What is Hadoop used for?


When it comes to Hadoop, the possible use cases are almost endless.

1. Retail

Large organizations have more customer data available on hand than ever
before. But often, it's difficult to make connections between large amounts of
seemingly unrelated data. When British retailer M&S deployed the Hadoop-
powered Cloudera Enterprise, they were more than impressed with the results.

Cloudera uses Hadoop-based support and services for the managing and
processing of data. Shortly after implementing the cloud-based platform,
M&S found they were able to successfully leverage their data for much
improved predictive analytics.

This led them to more efficient warehouse use and prevented stock-outs
during "unexpected" peaks in demand and gaining a huge advantage over the
competition.

2. Finance

Hadoop is perhaps more suited to the finance sector than any other. Early on,
the software framework was quickly pegged for primary use in handling the
advanced algorithms involved with risk modeling. It's exactly the type of risk

185
management that could help avoid the credit swap disaster that led to the
2008 recession.

Banks have also realized this same logic also applies to managing risk for
customer portfolios. Today, it's common for financial institutions to implement
Hadoop to better manage the financial security and performance of their
client's assets. JPMorgan Chase is just one of many industry giants that use
Hadoop to manage exponentially increasing amounts of customer data from
across the globe.

3. Healthcare

Whether nationalized or privatized, healthcare providers of any size deal with


huge volumes of data and customer information. Hadoop frameworks allow
for doctors, nurses and carers to have easy access to the information they
need when they need it and it also makes it easy to aggregate data that
provides actionable insights. This can apply to matters of public health, better
diagnostics, improved treatments and more.

Academic and research institutions can also leverage a Hadoop framework to


boost their efforts. Take for instance, the field of genetic disease which
includes cancer. We have the human genome mapped out and there are
nearly three billion base pairs in total. In theory, everything to cure an army of
diseases is now right in front of our faces.

But to identify complex relationships, systems like Hadoop will be necessary to


process such a large amount of information.

4. Security and law enforcement

Hadoop can help improve the effectiveness of national and local security, too.
When it comes to solving related crimes spread across multiple regions, a
Hadoop framework can streamline the process for law enforcement by
connecting two seemingly isolated events. By cutting down on the time to
make case connections, agencies will be able to put out alerts to other
agencies and the public as quickly as possible.

In 2013, The National Security Agency (NSA) concluded that the open-source
Hadoop software was superior to the expensive alternatives they'd been
implementing. They now use the framework to aid in the detection of
terrorism, cybercrime and other threats.

How does Hadoop work?


Hadoop allows for the distribution of datasets across a cluster of commodity
hardware. Processing is performed in parallel on multiple servers simultaneously.

186
Software clients input data into Hadoop. HDFS handles metadata and the distributed
file system. MapReduce then processes and converts the data. Finally, YARN divides
the jobs across the computing cluster.

All Hadoop modules are designed with a fundamental assumption that hardware
failures of individual machines or racks of machines are common and should be
automatically handled in software by the framework.

Hadoop query example


Here are a few examples of how to query Hadoop:

 Apache Hive

Apache Hive was the early go-to solution for how to query SQL with Hadoop.
This module emulates the behavior, syntax and interface of MySQL for
programming simplicity. It's a great option if you already heavily use Java
applications as it comes with a built-in Java API and JDBC drivers. Hive offers a
quick and straightforward solution for developers but it's also quite limited as
the software's rather slow and suffers from read-only capabilities.

 IBM BigSQL

This offering from IBM is a high-performance massively parallel processing


(MPP) SQL engine for Hadoop. Its query solution catered to enterprises that
need ease in a stable and secure environment. In addition to accessing HDFS
data, it can also pull from RDBMS, NoSQL databases, WebHDFS and other
sources of data.

What is the Hadoop ecosystem?


The term Hadoop is a general name that may refer to any of the following:

 The overall Hadoop ecosystem, which encompasses both the core modules
and related sub-modules.

 The core Hadoop modules, including Hadoop Distributed File System (HDFS),
Yet another Resource Negotiator (YARN), MapReduce, and Hadoop Common
(discussed below). These are the basic building blocks of a typical Hadoop
deployment.

 Hadoop-related sub-modules, including: Apache Hive, Apache Impala,


Apache Pig, and Apache Zookeeper, and Apache Flume among others. These
187
related pieces of software can be used to customize, improve upon, or extend
the functionality of core Hadoop.

What are the benefits of Hadoop?


 Scalability

Hadoop is important as one of the primary tools to store and process huge
amounts of data quickly. It does this by using a distributed computing model
which enables the fast processing of data that can be rapidly scaled by adding
computing nodes.

 Low cost

As an open source framework that can run on commodity hardware and has a
large ecosystem of tools, Hadoop is a low-cost option for the storage and
management of big data.

 Flexibility

Hadoop allows for flexibility in data storage as data does not require
preprocessing before storing it which means that an organization can store as
much data as they like and then utilize it later.

 Resilience

As a distributed computing model, Hadoop allows for fault tolerance and


system resilience, meaning if one of the hardware nodes fail, jobs are
redirected to other nodes. Data stored on one Hadoop cluster is replicated
across other nodes within the system to fortify against the possibility of
hardware or software failure.

What are the challenges of Hadoop?


 MapReduce complexity and limitations

As a file-intensive system, MapReduce can be a difficult tool to utilize for


complex jobs, such as interactive analytical tasks. MapReduce functions also
need to be written in Java and can require a steep learning curve. The
MapReduce ecosystem is quite large, with many components for different
functions that can make it difficult to determine what tools to use.

 Security

188
Data sensitivity and protection can be issues as Hadoop handles such large
datasets. An ecosystem of tools for authentication, encryption, auditing, and
provisioning has emerged to help developers secure data in Hadoop.

 Governance and management

Hadoop does not have many robust tools for data management and
governance, nor for data quality and standardization.

 Talent gap

Like many areas of programming, Hadoop has an acknowledged talent gap.


Finding developers with the combined requisite skills in Java to program
MapReduce, operating systems, and hardware can be difficult. In addition,
MapReduce has a steep learning curve, making it hard to get new
programmers up to speed on its best practices and ecosystem.

Why is Hadoop important?


Research firm IDC estimated that 62.4 zettabytes of data were created or replicated
in 2020, driven by the Internet of Things, social media, edge computing, and data
created in the cloud. The firm forecasted that data growth from 2020 to 2025 was
expected at 23% per year. While not all that data is saved (it is either deleted after
consumption or overwritten), the data needs of the world continue to grow.

Hadoop tools
Hadoop has a large ecosystem of open source tools that can augment and extend
the capabilities of the core module. Some of the main software tools used with
Hadoop include:

 Apache Hive: A data warehouse that allows programmers to work with data
in HDFS using a query language called HiveQL, which is similar to SQL

 Apache HBase: An open source non-relational distributed database often


paired with Hadoop

 Apache Pig: A tool used as an abstraction layer over MapReduce to analyze


large sets of data and enables functions like filter, sort, load, and join

 Apache Impala: Open source, massively parallel processing SQL query engine
often used with Hadoop

189
 Apache Sqoop: A command-line interface application for efficiently
transferring bulk data between relational databases and Hadoop

 Apache ZooKeeper: An open source server that enables reliable distributed


coordination in Hadoop; a service for, "maintaining configuration information,
naming, providing distributed synchronization, and providing group services"

 Apache Oozie: A workflow scheduler for Hadoop jobs

What is Apache Hadoop used for?


Here are some common uses cases for Apache Hadoop:

Analytics and big data

A wide variety of companies and organizations use Hadoop for research, production
data processing, and analytics that require processing terabytes or petabytes of big
data, storing diverse datasets, and data parallel processing.

Data storage and archiving

As Hadoop enables mass storage on commodity hardware, it is useful as a low-cost


storage option for all kinds of data, such as transactions, click streams, or sensor and
machine data.

Data lakes

Since Hadoop can help store data without preprocessing, it can be used to
complement to data lakes, where large amounts of unrefined data are stored.

Marketing analytics

Marketing departments often use Hadoop to store and analyze customer


relationship management (CRM) data.

Risk management

Banks, insurance companies, and other financial services companies use Hadoop to
build risk analysis and management models.

AI and machine learning

Hadoop ecosystems help with the processing of data and model training operations
for machine learning applications.

Related products and services


190
Companies often choose to run Hadoop clusters on public, or hybrid cloud resources
versus on-premises hardware to gain flexibility, availability, and cost control. Many
cloud solution providers offer fully managed services for Hadoop. With this kind of
prepackaged service for cloud-first Hadoop, operations that used to take hours or
days can be completed in seconds or minutes, with companies paying only for the
resources used.

On Google Cloud, Dataproc is a fast, easy-to-use, and fully-managed cloud service


for running Apache Spark and Apache Hadoop clusters in a simpler, integrated, most
cost-effective way. It fully integrates with other Google Cloud services that meet
critical security, governance, and support needs, allowing you to gain a complete and
powerful platform for data processing, analytics, and machine learning.

Big data analytics tools from Google Cloud—such as Dataproc, BigQuery, Vertex AI
Workbench, and Dataflow—can enable you to build context-rich applications, build
new analytics solutions, and turn data into actionable insights.

Dataproc

Dataproc makes open source data and analytics processing fast, easy, and more
secure in the cloud.

BigQuery

Serverless, highly scalable, and cost-effective cloud data warehouse designed for
business agility.

Notebooks

An enterprise notebook service to get your projects up and running in minutes.

Vertex AI Workbench

The single development environment for the entire data science workflow. Data to
training at scale. Build and train models 5X faster.

191
Dataflow

Unified stream and batch data processing that’s serverless, fast, and cost-effective

Data lake modernization

Google Cloud’s data lake powers any analysis on any type of data. This empowers
your teams to securely and cost-effectively ingest, store, and analyze large volumes
of diverse, full-fidelity data.

Smart analytics

Google Cloud’s fully managed serverless analytics platform empowers your business
while eliminating constraints of scale, performance, and cost.

Features of hadoop:
1. It is fault tolerance.

2. It is highly available.

3. Its programming is easy.

4. It have huge flexible storage.

5. It is low cost.

Hadoop has several key features that make it well-suited for big
data processing:
 Distributed Storage: Hadoop stores large data sets across multiple machines,
allowing for the storage and processing of extremely large amounts of data.

 Scalability: Hadoop can scale from a single server to thousands of machines,


making it easy to add more capacity as needed.

 Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it


can continue to operate even in the presence of hardware failures.

 Data locality: Hadoop provides data locality feature, where the data is stored
on the same node where it will be processed, this feature helps to reduce the
network traffic and improve the performance

 High Availability: Hadoop provides High Availability feature, which helps to


make sure that the data is always available and is not lost.

192
 Flexible Data Processing: Hadoop’s MapReduce programming model allows
for the processing of data in a distributed fashion, making it easy to
implement a wide variety of data processing tasks.

 Data Integrity: Hadoop provides built-in checksum feature, which helps to


ensure that the data stored is consistent and correct.

 Data Replication: Hadoop provides data replication feature, which helps to


replicate the data across the cluster for fault tolerance.

 Data Compression: Hadoop provides built-in data compression feature,


which helps to reduce the storage space and improve the performance.

 YARN: A resource management platform that allows multiple data processing


engines like real-time streaming, batch processing, and interactive SQL, to run
and process data stored in HDFS.

Which companies use Hadoop?


Hadoop adoption is becoming the standard for successful multinational companies
and enterprises. The following is a list of companies that utilize Hadoop today:

 Adobe - the software and service providers use Apache Hadoop and HBase
for data storage and other services.

 EBay - uses the framework for search engine optimization and research.

 A9 - a subsidiary of Amazon that is responsible for technologies related to


search engines and search-related advertising.

 LinkedIn - as one of the most popular social and professional networking


sites, the company uses many Apache modules including Hadoop, Hive, Kafka,
Avro, and DataFu.

 Spotify - the Swedish music streaming giant used the Hadoop framework for
analytics and reporting as well content generation and listening
recommendations.

 Facebook - the social media giant maintains the largest Hadoop cluster in the
world, with a dataset that grows a reported half of a PB per day.

 InMobi - the mobile marketing platform utilizes HDFS and Apache


Pig/MRUnit tasks involving analytics, data science and machine learning.

How much does Hadoop cost?


193
The Hadoop framework itself is an open-source Java-based application. This means,
unlike other big data alternatives, it's free of charge. Of course, the cost of the
required commodity software depends on what scale.

When it comes to services that implement Hadoop frameworks you will have several
pricing options:

1. Per Node- most common

2. Per TB

3. Freemium product with or without subscription-only tech support

4. All-in-one package deal including all hardware and software

5. Cloud-based service with its own broken down pricing options- can essentially
pay for what you need or pay as you go

Read more about challenges with Hadoop, and the shift toward modern data
platforms, in our blog post.

Hadoop Distributed File System


It has distributed file system known as HDFS and this HDFS splits files into blocks and
sends them across various nodes in form of large clusters. Also in case of a node
failure, the system operates and data transfer takes place between the nodes which
are facilitated by HDFS.

HDFS

194
Advantages of HDFS: It is inexpensive, immutable in nature, stores data reliably,
ability to tolerate faults, scalable, block structured, can process a large amount of
data simultaneously and many more.

Disadvantages of HDFS: It’s the biggest disadvantage is that it is not fit for small
quantities of data. Also, it has issues related to potential stability, restrictive and
rough in nature. Hadoop also supports a wide range of software packages such as
Apache Flumes, Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark, Apache
Storm, Apache Pig, Apache Hive, Apache Phoenix, and Cloudera Impala.

Some common frameworks of Hadoop


1. Hive- It uses HiveQl for data structuring and for writing complicated
MapReduce in HDFS.

2. Drill- It consists of user-defined functions and is used for data exploration.

3. Storm- It allows real-time processing and streaming of data.

4. Spark- It contains a Machine Learning Library (MLlib) for providing enhanced


machine learning and is widely used for data processing. It also supports Java,
Python, and Scala.

5. Pig- It has Pig Latin, a SQL-Like language and performs data transformation of
unstructured data.

6. Tez- It reduces the complexities of Hive and Pig and helps in the running of
their codes faster.

Hadoop framework is made up of the following modules:


1. Hadoop MapReduce- a MapReduce programming model for handling and
processing large data.

2. Hadoop Distributed File System- distributed files in clusters among nodes.

3. Hadoop YARN- a platform which manages computing resources.

4. Hadoop Common- it contains packages and libraries which are used for
other modules.

History of Hadoop

195
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was
the Google File System paper, published by Google.

Let's focus on the history of Hadoop in the following steps: -

 In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache
Nutch. It is an open source web crawler software project.

 While working on Apache Nutch, they were dealing with big data. To store
that data they have to spend a lot of costs which becomes the consequence of
that project. This problem becomes one of the important reason for the
emergence of Hadoop.

 In 2003, Google introduced a file system known as GFS (Google file system). It
is a proprietary distributed file system developed to provide efficient access to
data.

 In 2004, Google released a white paper on Map Reduce. This technique


simplifies the data processing on large clusters.

 In 2005, Doug Cutting and Mike Cafarella introduced a new file system known
as NDFS (Nutch Distributed File System). This file system also includes Map
reduce.

 In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the
Nutch project, Dough Cutting introduces a new project Hadoop with a file
system known as HDFS (Hadoop Distributed File System). Hadoop first version
0.1.0 released in this year.

 Doug Cutting gave named his project Hadoop after his son's toy elephant.

 In 2007, Yahoo runs two clusters of 1000 machines.

 In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a


900 node cluster within 209 seconds.

 In 2013, Hadoop 2.2 was released.

196
 In 2017, Hadoop 3.0 was released.

Year Event

2003 Google released the paper, Google File System (GFS).

2004 Google released a white paper on Map Reduce.

2006 Hadoop introduced.

Hadoop 0.1.0 released.

Yahoo deploys 300 machines and within this year reaches 600
machines.

2007 Yahoo runs 2 clusters of 1000 machines.

Hadoop includes HBase.

2008 YARN JIRA opened

Hadoop becomes the fastest system to sort 1 terabyte of data on a


900 node cluster within 209 seconds.

Yahoo clusters loaded with 10 terabytes per day.

Cloudera was founded as a Hadoop distributor.

2009 Yahoo runs 17 clusters of 24,000 machines.

Hadoop becomes capable enough to sort a petabyte.

MapReduce and HDFS become separate subproject.

2010 Hadoop added the support for Kerberos.

Hadoop operates 4,000 nodes with 40 petabytes.

Apache Hive and Pig released.

2011 Apache Zookeeper released.

197
Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of
storage.

2012 Apache Hadoop 1.0 version released.

2013 Apache Hadoop 2.2 version released.

2014 Apache Hadoop 2.6 version released.

2015 Apache Hadoop 2.7 version released.

2017 Apache Hadoop 3.0 version released.

2018 Apache Hadoop 3.1 version released.

Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the
HDFS (Hadoop Distributed File System). The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master
node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the
slave node includes DataNode and TaskTracker.

198
Advantages and Disadvantages of Hadoop
Advantages:
 Ability to store a large amount of data.

 High flexibility.

 Cost effective.

 High computational power.

 Tasks are independent.

 Linear scaling.

Hadoop has several advantages that make it a popular choice


for big data processing:
 Scalability: Hadoop can easily scale to handle large amounts of data by
adding more nodes to the cluster.

199
 Cost-effective: Hadoop is designed to work with commodity hardware, which
makes it a cost-effective option for storing and processing large amounts of
data.

 Fault-tolerance: Hadoop’s distributed architecture provides built-in fault-


tolerance, which means that if one node in the cluster goes down, the data
can still be processed by the other nodes.

 Flexibility: Hadoop can process structured, semi-structured, and unstructured


data, which makes it a versatile option for a wide range of big data scenarios.

 Open-source: Hadoop is open-source software, which means that it is free to


use and modify. This also allows developers to access the source code and
make improvements or add new features.

 Large community: Hadoop has a large and active community of developers


and users who contribute to the development of the software, provide
support, and share best practices.

 Integration: Hadoop is designed to work with other big data technologies


such as Spark, Storm, and Flink, which allows for integration with a wide range
of data processing and analysis tools.

Disadvantages:
 Not very effective for small data.

 Hard cluster management.

 Has stability issues.

 Security concerns.

 Complexity: Hadoop can be complex to set up and maintain, especially for


organizations without a dedicated team of experts.

 Latency: Hadoop is not well-suited for low-latency workloads and may not be
the best choice for real-time data processing.

 Limited Support for Real-time Processing: Hadoop’s batch-oriented nature


makes it less suited for real-time streaming or interactive data processing use
cases.

 Limited Support for Structured Data: Hadoop is designed to work with


unstructured and semi-structured data, it is not well-suited for structured data
processing

200
 Data Security: Hadoop does not provide built-in security features such as data
encryption or user authentication, which can make it difficult to secure
sensitive data.

 Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming


model is not well-suited for ad-hoc queries, making it difficult to perform
exploratory data analysis.

 Limited Support for Graph and Machine Learning: Hadoop’s core component
HDFS and MapReduce are not well-suited for graph and machine learning
workloads, specialized components like Apache Graph and Mahout are
available but have some limitations.

 Cost: Hadoop can be expensive to set up and maintain, especially for


organizations with large amounts of data.

 Data Loss: In the event of a hardware failure, the data stored in a single node
may be lost permanently.

 Data Governance: Data Governance is a critical aspect of data management,


Hadoop does not provide a built-in feature to manage data lineage, data
quality, data cataloging, data lineage, and data audit.

Open source technologies


Open-source technologies refer to software or technologies that are distributed with
a license granting anyone the right to use, modify, and distribute the software's
source code freely. These technologies are typically developed collaboratively by a
community of developers and made publicly available, often at no cost. Open-source
software promotes transparency, collaboration, and innovation, as developers can
access, modify, and improve the software code to meet their specific needs. Here are
some key categories and examples of open-source technologies:

What Are Open Source Technologies?


Open source is a kind of licensing agreement that allows users to freely inspect, copy,
modify and redistribute a software, and have full access to the underlying source
code. Simply put, open source refers to software created by a developer who pledges
to make the entire software source code available to users. On the other hand,
proprietary or “closed source” software has source code that only the creator can
legally copy, inspect, and modify. Top Open source technologies celebrate the
principle of open exchange, shared participation, transparency, and community-
oriented development.

However, open-source software is not necessarily free. The source code is freely
available to anyone, but the executable software is sometimes available upon
201
subscription. The best open source technologies allow users to download, modify
and distribute it without paying any license fees to its original creator.

Different kinds of software programs like operating systems, applications, databases,


games, and even programming languages can be open source. Open source
technology can be used by programmers as well as non-programmers because it
enables anyone with skills to become innovators and contribute code. When source
code is open, it offers advantages to businesses and individual programmers.
Businesses can customize open-source software to meet their specific needs or make
innovations that are not included in the original source code. People worldwide are
using open-source software like Linux or browsers like Firefox to develop websites
and applications.

Operating Systems:
Linux: Linux is an open-source operating system kernel that forms the basis of many
Linux distributions, such as Ubuntu, CentOS, and Debian. Linux is widely used in
servers, embedded systems, and cloud computing environments due to its stability,
security, and flexibility.

Programming Languages and Frameworks:


Python: Python is a high-level, interpreted programming language known for its
simplicity, readability, and versatility. Python is widely used for web development,
data analysis, artificial intelligence, and scientific computing.

Java: Java is a popular programming language used for developing enterprise


applications, web services, and mobile applications. The Java Development Kit (JDK)
is open-source, and many Java libraries and frameworks, such as Spring and
Hibernate, are also open-source.

JavaScript: JavaScript is a scripting language used for client-side web development.


It is supported by all major web browsers and is used extensively for building
interactive web applications.

Node.js: Node.js is an open-source, cross-platform JavaScript runtime environment


that allows developers to run JavaScript code outside of a web browser. Node.js is
commonly used for server-side web development and building scalable, real-time
applications.

Database Systems:

202
MySQL: MySQL is an open-source relational database management system (RDBMS)
that is widely used for web applications, content management systems, and other
data-driven applications. MySQL is known for its reliability, performance, and ease of
use.

PostgreSQL: PostgreSQL is an open-source object-relational database system known


for its advanced features, extensibility, and compliance with SQL standards.
PostgreSQL is often used for enterprise applications, data warehousing, and
geospatial applications.

Web Servers and Development Tools:


Apache HTTP Server: Apache HTTP Server is the most widely used open-source web
server software, powering millions of websites worldwide. It is known for its stability,
performance, and extensibility.

NGINX: NGINX is an open-source web server and reverse proxy server known for its
high performance, scalability, and flexibility. NGINX is commonly used as a load
balancer, web accelerator, and API gateway.

Content Management Systems (CMS):


WordPress: WordPress is an open-source CMS written in PHP and used by millions
of websites for blogging, e-commerce, and content management. It offers a wide
range of themes, plugins, and customization options.

Joomla: Joomla is an open-source CMS written in PHP and known for its flexibility,
extensibility, and user-friendly interface. Joomla is often used for building community
websites, e-commerce platforms, and corporate portals.

Development Tools and Libraries:


Git: Git is an open-source distributed version control system used for tracking
changes in source code during software development. Git is widely used by
developers for collaborative coding, version control, and code management.

TensorFlow: TensorFlow is an open-source machine learning framework developed


by Google for building and training machine learning models. TensorFlow is widely
used for various applications, including image recognition, natural language
processing, and predictive analytics.

These are just a few examples of open-source technologies that have had a
significant impact on the software development industry and are widely used by
203
developers and organizations around the world. Open-source software continues to
play a crucial role in driving innovation, collaboration, and accessibility in the tech
community.

Our open source ecosystem in Hadoop


Apache Hadoop is an open source software platform for distributed storage and
distributed processing of very large data sets on computer clusters built from
commodity hardware. Hadoop services are foundational to data storage, data
processing, data access, data governance, security, and operations.

 Apache Accumulo: A sorted, distributed key-value store with cell-based


access control.

 Apache Atlas: Agile enterprise regulatory compliance through metadata.

 Apache Flink: A real-time stream processing framework for big data analytics
and applications.

 Apache Hadoop: A distributed storage and processing framework for large-


scale data processing tasks.

 Apache HBase: A non-relational (NoSQL) database that runs on top of HDFS.

 Apache Hive: The de facto standard for SQL queries in Hadoop.

 Apache Impala: The open source, analytic MPP database for Apache Hadoop
that provides the fastest time-to-insight.

 Apache Kafka: A fast, scalable, fault-tolerant messaging system

 Apache Knox Gateway: A secure entry point for Hadoop clusters.

 Apache Kudu: Storage for fast analytics on fast data.

 Apache NiFi: A real-time integrated data logistics and simple event


processing platform.

 Apache Oozie: The blueprint for Enterprise Hadoop includes Apache


Hadoop’s original data storage and data processing layers.

 Apache Phoenix: An open source, massively parallel, relational database


engine supporting OLTP for Hadoop using Apache HBase.

 Apache Ranger: Comprehensive security for Enterprise Hadoop.

 Apache Solr: Rapid indexing & search on Hadoop.

204
 Apache Spark: Spark adds in-Memory Compute for ETL, Machine Learning
and Data Science Workloads to Hadoop.

 Apache Sqoop: Efficiently transfers bulk data between Apache Hadoop and
structured datastores.

 Apache Tez: Framework for YARN-based, Data Processing Applications In


Hadoop.

 Apache YARN: The Architectural Center of Enterprise Hadoop.

 Apache Zeppelin: A completely open web-based notebook that enables


interactive data analytics.

 Apache ZooKeeper: An open source server that reliably coordinates


distributed processes.

 HDFS: A distributed file system designed for storing and managing vast data.

 Hue: An open source SQL Workbench for Data Warehouses.

How to Build a Career in Open Source Technologies?


The Open Source Services market includes services in consultation, implementation,
support, management, maintenance, and training. With open source making huge
inroads in networking, database, and security, numerous job opportunities have
opened up in the domain.

If you are a developer looking to master new skills through collaborative work, you
can pursue a career in Open source. Being part of an open-source community will
expand your network with fellow programmers and add more credentials to your
resume.

Kick-start a career in open source technologies with certification courses on free and
open-source software. Learn how to use PHP XML on the Linux platform to develop,
test and deploy open source applications.

Upgrade your web development skills by choosing our Post Graduate Program in Full
Stack Web Development. This course can help you hone the right skills and make
you job-ready in no time.

If you have any questions, feel free to post them in the comments section below. Our
team will get back to you at the earliest.

Cloud and big data


205
Introduction
You’ve likely heard the terms “Big Data” and “Cloud Computing” before. If you’re
involved with cloud application development, you may even have experience with
them. The two go hand-in-hand, with many public cloud services performing big
data analytics.

With Software as a Service (SaaS) becoming increasingly popular, keeping up-to-date


with cloud infrastructure best practices and the types of data that can be stored in
large quantities is crucial. We’ll take a look at the differences between cloud
computing and big data, the relationship between them, and why the two are a
perfect match, bringing us lots of new, innovative technologies, such as artificial
intelligence.

What is big data in the cloud?


Big data and cloud computing are two distinctly different ideas, but the two concepts
have become so interwoven that they are almost inseparable. It's important to define
the two ideas and see how they relate.

Big Data and Cloud Computing


The cloud and big data analytics are often used together. This is because big data
requires huge computational power and storage. Cloud computing offers on-
demand storage, computation resources, and tools to store and analyze big data.
Hence, big data cloud computing and big data cloud analytics are becoming
increasingly popular.

The rise of big data on cloud computing has made the process of analyzing big data
more efficient. Businesses can choose from three types of cloud computing
services, IaaS, PaaS, and SaaS, for cloud-based big data analytics. These services are
available on a pay-per-use or subscription basis, which means users only pay for the
services they use.

Cloud analytics essentially means storing and analyzing data in a big data cloud
instead of on-premises systems of the organization. This includes any type of data
analytics that is performed on systems hosted in the cloud, including big data
analytics.

Big Data Analysis in Cloud Computing

206
For big data analytics in cloud computing, the data (both structured and
unstructured) is gathered from different sources, such as smart devices, websites,
social media, etc. The next step involves cleaning and storing this large amount of
data. Companies then use big data cloud tools by big data cloud providers to
process this data for analysis.

The big data cloud architecture below will help you understand cloud big data, cloud
computing big data, and how cloud computing and big data are used together:

Image source

One of the most common cloud computing platforms for big data processing and
analysis is AaaS. AaaS or Analytics as a service refers to a big data cloud solution that
provides analytics software and procedures. It provides efficient business intelligence
(BI) solutions that help organize, analyze, and present big data so that it is easy to
interpret.

AaaS involves advanced data analytics technologies, such as machine learning


algorithms, AI, predictive analytics, data mining, etc., to analyze data and show
trends.

The pros of big data in the cloud


The cloud brings a variety of important benefits to businesses of all sizes. Some of
the most immediate and substantial benefits of big data in the cloud include the
following.

 Scalability

207
A typical business data center faces limits in physical space, power, cooling
and the budget to purchase and deploy the sheer volume of hardware it
needs to build a big data infrastructure. By comparison, a public cloud
manages hundreds of thousands of servers spread across a fleet of global
data centers. The infrastructure and software services are already there, and
users can assemble the infrastructure for a big data project of almost any size.

 Agility

Not all big data projects are the same. One project may need 100 servers, and
another project might demand 2,000 servers. With cloud, users can employ as
many resources as needed to accomplish a task and then release those
resources when the task is complete.

 Cost

A business data center is an enormous capital expense. Beyond hardware,


businesses must also pay for facilities, power, ongoing maintenance and more.
The cloud works all those costs into a flexible rental model where resources
and services are available on demand and follow a pay-per-use model.

 Accessibility

Many clouds provide a global footprint, which enables resources and services
to deploy in most major global regions. This enables data and processing
activity to take place proximally to the region where the big data task is
located. For example, if a bulk of data is stored in a certain region of a cloud
provider, it's relatively simple to implement the resources and services for a
big data project in that specific cloud region -- rather than sustaining the cost
of moving that data to another region.

 Resilience

Data is the real value of big data projects, and the benefit of cloud resilience is
in data storage reliability. Clouds replicate data as a matter of standard
practice to maintain high availability in storage resources, and even more
durable storage options are available in the cloud.

The cons of big data in the cloud


Public clouds and many third-party big data services have proven their value in big
data use cases. Despite the benefits, businesses must also consider some of the
potential pitfalls. Some major disadvantages of big data in the cloud can include the
following.

208
 Network dependence

Cloud use depends on complete network connectivity from the LAN, across
the internet, to the cloud provider's network. Outages along that network path
can result in increased latency at best or complete cloud inaccessibility at
worst. While an outage might not impact a big data project in the same ways
that it would affect a mission-critical workload, the effect of outages should
still be considered in any big data use of the cloud.

 Storage costs

Data storage in the cloud can present a substantial long-term cost for big data
projects. The three principal issues are data storage, data migration and data
retention. It takes time to load large amounts of data into the cloud, and then
those storage instances incur a monthly fee. If the data is moved again, there
may be additional fees. Also, big data sets are often time-sensitive, meaning
that some data may have no value to a big data analysis even hours into the
future. Retaining unnecessary data costs money, so businesses must employ
comprehensive data retention and deletion policies to manage cloud storage
costs around big data.

 Security

The data involved in big data projects can involve proprietary or personally
identifiable data that is subject to data protection and other industry- or
government-driven regulations. Cloud users must take the steps needed to
maintain security in cloud storage and computing through adequate
authentication and authorization, encryption for data at rest and in flight, and
copious logging of how they access and use data.

 Lack of standardization

There is no single way to architect, implement or operate a big data


deployment in the cloud. This can lead to poor performance and expose the
business to possible security risks. Business users should document big data
architecture along with any policies and procedures related to its use. That
documentation can become a foundation for optimizations and improvements
for the future.

The Difference between Big Data & Cloud Computing


Before discussing how the two go together, it’s important to form a clear distinction
between “Big Data” and “Cloud Computing”. Although they are technically different

209
terms, they’re often seen together in literature because they interact synergistically
with one another.

 Big Data: This simply refers to the very large sets of data that are output by a
variety of programs. It can refer to any of a large variety of types of data, and
the data sets are usually far too large to peruse or query on a regular
computer.

 Cloud Computing: This refers to the processing of anything, including Big


Data Analytics, on the “cloud”. The “cloud” is just a set of high-powered
servers from one of many providers. They can often view and query large data
sets much more quickly than a standard computer could.

Essentially, “Big Data” refers to the large sets of data collected, while “Cloud
Computing” refers to the mechanism that remotely takes this data in and performs
any operations specified on that data.

The Roles & Relationship between Big Data & Cloud


Computing
Cloud Computing providers often utilize a “software as a service” model to allow
customers to easily process data. Typically, a console that can take in specialized
commands and parameters is available, but everything can also be done from the
site’s user interface. Some products that are usually part of this package include
database management systems, cloud-based virtual machines and containers,
identity management systems, machine learning capabilities, and more.

In turn, Big Data is often generated by large, network-based systems. It can be in


either a standard or non-standard format. If the data is in a non-standard format,
artificial intelligence from the Cloud Computing provider may be used in addition to
machine learning to standardize the data.

From there, the data can be harnessed through the Cloud Computing platform and
utilized in a variety of ways. For example, it can be searched, edited, and used for
future insights.

This cloud infrastructure allows for real-time processing of Big Data. It can take huge
“blasts” of data from intensive systems and interpret it in real-time. Another common
relationship between Big Data and Cloud Computing is that the power of the cloud
allows Big Data analytics to occur in a fraction of the time it used to.

Big Data & Cloud Computing: A Perfect Match

210
As you can see, there are infinite possibilities when we combine Big Data and Cloud
Computing! If we simply had Big Data alone, we would have huge data sets that have
a huge amount of potential value just sitting there. Using our computers to analyze
them would be either impossible or impractical due to the amount of time it would
take.

However, Cloud Computing allows us to use state-of-the-art infrastructure and only


pay for the time and power that we use! Cloud application development is also
fueled by Big Data. Without Big Data, there would be far fewer cloud-based
applications, since there wouldn’t be any real necessity for them. Remember, Big
Data is often collected by cloud-based applications, as well!

In short, Cloud Computing services largely exist because of Big Data. Likewise, the
only reason that we collect Big Data is because we have services that are capable of
taking it in and deciphering it, often in a matter of seconds. The two are a perfect
match, since neither would exist without the other!

Choose the right cloud deployment model


So, which cloud model is ideal for a big data deployment? Organizations typically
have four different cloud models to choose from: public, private, hybrid and multi-
cloud. It's important to understand the nature and tradeoffs of each model.

Which deployment model is right for you?

 Private cloud

211
Private clouds give businesses control over their cloud environment, often to
accommodate specific regulatory, security or availability requirements.
However, it is more costly because a business must own and operate the
entire infrastructure. Thus, a private cloud might only be used for sensitive
small-scale big data projects.

 Public cloud

The combination of on-demand resources and scalability makes public cloud


ideal for almost any size of big data deployment. However, public cloud users
must manage the cloud resources and services it uses. In a shared
responsibility model, the public cloud provider handles the security of the
cloud, while users must configure and manage security in the cloud.

 Hybrid cloud

A hybrid cloud is useful when sharing specific resources. For example, a hybrid
cloud might enable big data storage in the local private cloud -- effectively
keeping data sets local and secure -- and use the public cloud for compute
resources and big data analytical services. However, hybrid clouds can be
more complex to build and manage, and users must deal with all of the issues
and concerns of both public and private clouds.

 Multi-cloud

With multiple clouds, users can maintain availability and use cost benefits.
However, resources and services are rarely identical between clouds, so
multiple clouds are more complex to manage. This cloud model also has more
risks of security oversights and compliance breaches than single public cloud
use. Considering the scope of big data projects, the added complexity of
multi-cloud deployments can add unnecessary challenges to the effort.

Review big data services in the cloud


While the underlying hardware gets the most attention and budget for big data
initiatives, it's the services -- the analytical tools -- that make big data analytics
possible. The good news is that organizations seeking to implement big data
initiatives don't need to start from scratch.

Providers not only offer services and documentation, but can also arrange for
support and consulting to help businesses optimize their big data projects. A
sampling of available big data services from the top three providers include the
following.

AWS

 Amazon Elastic MapReduce


212
 AWS Deep Learning AMIs

 Amazon SageMaker

Microsoft Azure

 Azure HDInsight

 Azure Analysis Services

 Azure Databricks

Google Cloud

 Google BigQuery

 Google Cloud Dataproc

 Google Cloud AutoML

Keep in mind that there are numerous capable services available from third-party
providers. Typically, these providers offer more niche services, whereas major
providers follow a one-size-fits-all strategy for their services. Some third-party
options include the following:

 Cloudera

 Hortonworks Data Platform

 Oracle Big Data Service

 Snowflake Data Cloud

Advantages of Big Data in Cloud Computing


There are several benefits of big data in the cloud and big data analytics cloud:

 Scalability

Cloud computing for big data offers flexible, on-demand capabilities. With big
data cloud technology, organizations can scale up or scale down as per their
needs. For example, organizations can ask cloud-based big data solutions
providers to increase cloud storage as the volume of their data increases.
Businesses can also add data analysis capacity as needed. Big data cloud
server’s help businesses respond to customer demands more efficiently.

 Higher Efficiency

213
Cloud computing for big data analytics provides incredible processing power.
This makes big data processing in cloud computing environments more
efficient compared to on-premise systems.

 Cost Reductions

When it comes to big data on-premise vs. cloud, another major difference is
cost. In comparing big data cloud vs. on-premise, on-premises systems
involve different costs, such as power consumption costs, purchasing and
maintaining hardware and servers, replacing the hardware, etc.

However, with cloud and big data cloud technologies, there are no such costs
because the cloud service providers are responsible for everything.
Additionally, cloud services are based on a pay-per-use model, which further
reduces the cost.

 Disaster Recovery

Data of any size is a valuable asset for organizations, so it’s important not to
lose it. However, cyber-attacks, equipment failure, and power outages can
result in data loss, especially if you’re using an on-premise system. On the
other hand, a big data cloud service replicates data to ensure high availability
and security. Hence, cloud computing for big data helps organizations recover
from disasters faster.

If you want to know the differences between cloud and on-premise in detail,
check out our article on cloud vs. on-premise.

Challenges of Big Data in Cloud Computing


Some of the key big data challenges in IoT and cloud computing include:

 Big Data Cloud Security

Security issues associated with big data in cloud computing are usually a
major concern for businesses. Big data consists of different types of data,
including the personal data of customers, which are subject to data privacy
regulations. As cyber-attacks are increasing, hackers can steal data on poorly
secured clouds.

 Requires Internet

You need an internet connection to access data in the cloud and perform
analytics.

214
Difference between Big Data and Cloud Computing

S.No. BIG DATA CLOUD COMPUTING

01. Big data refers to the data which is Cloud computing refers to the on
huge in size and also increasing demand availability of computing
rapidly with respect to time. resources over internet.

02. Big data includes structured data, Cloud Computing Services


unstructured data as well as semi- includes Infrastructure as a Service
structured data. (IaaS), Platform as a Service (PaaS)
and Software as a Service (SaaS).

03. Volume of data, Velocity of data, On-Demand availability of IT


Variety of data, Veracity of data, and resources, broad network access,
Value of data are considered as the 5 resource pooling, elasticity and
most important characteristics of Big measured service are considered
data. as the main characteristics of cloud
computing.

04. The purpose of big data is to The purpose of cloud computing is


organizing the large volume of data to store and process data in cloud
and extracting the useful information or availing remote IT services
from it and using that information for without physically installing any IT
the improvement of business. resources.

05. Distributed computing is used for Internet is used to get the cloud
analyzing the data and extracting the based services from different cloud
useful information. vendors.

06. Big data management allows Cloud computing services are cost
centralized platform, provision for effective, scalable and robust.
backup and recovery and low
maintenance cost.

07. Some of the challenges of big data are Some of the challenges of cloud
variety of data, data storage and computing are availability,
integration, data processing and transformation, security concern,
resource management. charging model.

215
08. Big data refers to huge volume of Cloud computing refers to remote
data, its management, and useful IT resources and different internet
information extraction. service models.

09. Big data is used to describe huge Cloud computing is used to store
volume of data and information. data and information on remote
servers and also processing the
data using remote infrastructure.

10. Some of the sources where big data is Some of the cloud computing
generated includes social media data, vendors who provides cloud
e-commerce data, weather station computing services are Amazon
data, IoT Sensor data etc. Web Service (AWS), Microsoft
Azure, Google Cloud Platform, IBM
Cloud Services etc.

Mobile business intelligence


BI delivers relevant and trustworthy information to the right person at the right time.
Mobile business intelligence is the transfer of business intelligence from the desktop
to mobile devices such as the BlackBerry, iPad, and iPhone.

The ability to access analytics and data on mobile devices or tablets rather than
desktop computers is referred to as mobile business intelligence. The business metric
dashboard and key performance indicators (KPIs) are more clearly displayed.

With the rising use of mobile devices, so have the technology that we all utilise in our
daily lives to make our lives easier, including business. Many businesses have
benefited from mobile business intelligence. Essentially, this post is a guide for
business owners and others to educate them on the benefits and pitfalls of Mobile
BI.

What is Mobile Business Intelligence?


Mobile Business Intelligence (BI) is an evolution of traditional BI technologies,
enabling the delivery and synthesis of business data through mobile devices like
smartphones and tablets. Unlike traditional BI, which is often confined to desktops
and laptops, Mobile BI emphasizes agility, real-time access, and flexible user

216
experiences. It allows users to retrieve, interact with, and analyze business data on
the move, breaking the shackles of stationary data interaction.

Why is Mobile BI Needed?


In the current fast-paced business world, decision-makers are often on the move and
require immediate access to data and analytics. With the increasing capabilities of
mobile devices, including enhanced data storage, processing power, and
connectivity, Mobile BI has become a critical tool for timely and effective decision-
making. It allows for the constant flow of information, keeping business leaders
connected with their operations, sales, and customer interactions in real-time,
regardless of their physical location.

Advantages of Mobile BI with


1. Simple Access

 Mobile BI: Offers the ability to access BI tools from any location at any time,
using a mobile device.

- Example: A marketing manager at an airport can quickly check the


latest campaign performance on their smartphone, enabling instant
strategical adjustments.

 Traditional BI: Access typically requires being at a desk, often limiting


immediate response or action.

2. Competitive Advantage

 Mobile BI: Provides real-time data, facilitating agile responses to market


changes, often leading to a competitive edge.

- Example: Stock traders using Mobile BI can make instant trade


decisions based on the latest market data, directly from their devices.

 Traditional BI: Responses to market conditions might be delayed due to the


need to access data from fixed locations.

3. Simplified Decision-Making

 Mobile BI: Streamlines decision-making processes with instant data access


and analysis capabilities.

217
- Example: A sales manager in a client meeting can use a tablet to
demonstrate recent sales trends and make informed decisions without
delay.

 Traditional BI: Requires access to a stationary computer, potentially


postponing important decisions.

4. Increased Productivity

 Mobile BI: By providing critical information at fingertips, it maximizes the use


of time and minimizes delays in operations.

- Example: Field service engineers can access maintenance data and


customer history on-site through mobile devices, leading to faster and
more effective service delivery.

 Traditional BI: Might restrict information access to office environments,


affecting timely task completion.

Disadvantages of mobile
1. Stack of data

The primary function of a mobile BI is to store data in a systematic manner and


then present it to the user as required. As a result, Mobile BI stores all of the
information and does end up with heaps of earlier data. The corporation only
needs a small portion of the previous data, but they need to store the entire
information, which ends up in the stack

2. Expensive

Mobile BI can be quite costly at times. Large corporations can continue to pay for
their expensive services, but small businesses cannot. As the cost of mobile BI is
not sufficient, we must additionally consider the rates of IT workers for the
smooth operation of BI, as well as the hardware costs involved.

However, larger corporations do not settle for just one Mobile BI provider for
their organisations; they require multiple. Even when doing basic commercial
transactions, mobile BI is costly.

3. Time consuming

Businesses prefer Mobile BI since it is a quick procedure. Companies are not


patient enough to wait for data before implementing it. In today's fast-paced
environment, anything that can produce results quickly is valuable. The data from
the warehouse is used to create the system, hence the implementation of BI in an
enterprise takes more than 18 months.
218
4. Data breach

The biggest issue of the user when providing data to Mobile BI is data leakage. If
you handle sensitive data through Mobile BI, a single error can destroy your data
as well as make it public, which can be detrimental to your business.

Many Mobile BI providers are working to make it 100 percent secure to protect
their potential users' data. It is not only something that mobile BI carriers must
consider, but it is also something that we, as users, must consider when granting
data access authorization. (From)

5. Poor quality data

Because we work online in every aspect, we have a lot of data stored in Mobile BI,
which might be a significant problem. This means that a large portion of the data
analysed by Mobile BI is irrelevant or completely useless. This can speed down
the entire procedure. This requires you to select the data that is important and
may be required in the future.

Recommended Tools
The use of Mobile BI transforms not only how and where business intelligence is
consumed but also significantly speeds up decision-making and enhances overall
productivity. For businesses looking to implement Mobile BI, choosing the right tool
is crucial. Some top recommendations include:

1. Si Sense

 Sisense is a flexible business intelligence (BI) solution that includes powerful


analytics, visualisations, and reporting capabilities for managing and
supporting corporate data.

 Businesses can use the solution to evaluate large, diverse databases and
generate relevant business insights. You may easily view enormous volumes of
complex data with Si Sense's code-first, low-code, and even no-code
technologies. Si Sense was established in 2004 with its headquarters in New
York.

 Since then, the team has only taken precautionary steps in their investigation.
Once the company had received $ 4 million in funding from investors, they
began to pace its research.

2. SAP Roambi analytics

 Roambi analytics is a BI tool that offers a solution that allows you to


fundamentally rethink your data analysis, making it easier and faster while also
increasing your data interaction.
219
 You can consolidate all of your company's data in a single tool using SAP
Roambi Analytics, which integrates all ongoing systems and data. Use of SAP
Roambi analysis is a simple three-step technique. Upload your html or
spreadsheet files first. The information is subsequently transformed into
informative data or graphs, as well as data that may be visualised.

 After the data is collected, you may easily share it with your preferred device.
Roambi Analytics was founded in 2008 by a team based in California.

3. IBM Cognos Analytics

 Cognos Analytics is an IBM-registered web-based business intelligence tool.


Cognos Analytics is now merging with Watsons, and the benefits for users are
extremely exciting. Watson cognos analytics will assist in connecting and
cleaning the users' data, resulting in proper visualised data.

 That way, the business owner will know where they stand in comparison to
their competitors and where they can grow in the future. It combines
reporting, modelling, analysis, dashboards to help you understand your
organization's data and make sound business decisions.

4. Amazon quick sights

 Amazon Quick View assists in the creation and distribution of interactive BI


dashboards to their users, as well as the retrieval of answers in natural
language queries in seconds. Quick sight can be accessed through any device
embedded in any website, portal, or app.

 Amazon Quick Sight allows you to quickly and easily create interactive
dashboards and reports for your users. Anyone in your organisation can
securely access those dashboards via browsers or mobile devices.

 Quick sight's eye-catching feature is its pay-per-session model, which allows


users to use the creative dashboard created by another without paying much.
The user pays according to the length of the session, with prices ranging from
$0.30 for a 30-minute session to $5 for unlimited use per month per user.

5. Power BI Mobile (Microsoft):

 Strengths: Microsoft Power BI is widely used and integrates seamlessly with


other Microsoft tools. Power BI Mobile enables users to view and interact with
Power BI reports and dashboards on mobile devices.

 Considerations: The functionality and capabilities may be tied to your Power


BI subscription, and some advanced features might not be available in the
mobile app. Authoring is not available on mobile.

6. Tableau Mobile (Salesforce):

220
 Strengths: Tableau is known for its powerful data visualization capabilities
and ease of use. Tableau Mobile allows users to access and interact with
Tableau dashboards on mobile devices.

 Considerations: It may require a learning curve for new users, and the full
functionality might depend on your Tableau licensing level. Authoring is not
available on mobile.

7. Qlik Sense Mobile (Qlik):

 Strengths: Qlik offers robust data analytics and visualization capabilities. Qlik
Mobile allows users to access and explore Qlik apps on mobile devices. Stands
out for its associative data model, facilitating deep, intuitive data exploration
and insights.

 Considerations: Like the others, the full range of features may vary
depending on your Qlik licensing, and users might need some training to fully
utilize its capabilities. Authoring is not available on mobile.

8. Oneboard (Sweeft):

 Strengths: Oneboard offers a unified user interface, reliable native mobile


application. It stands out by allowing users to connect to any system with an
API, create dashboards directly on mobile, and combine fragmented data on
their mobile devices without external help.

 Considerations: Oneboard's emphasis on user empowerment, ease of use,


and integration capabilities can make it a strong choice for organizations
looking for a mobile BI tool that prioritizes flexibility and self-service analytics.

Mobile BI Technology
Business data and analytics are better accessed on tablets than mobile phones
comparatively. This is because the difference in the screens of notebooks and tablets
is relatively less. However, different-sized mobile phone screens create a significant
difference in accessing these analytical reports.

Consequently, it affects Mobile BI overall as due to the small display of mobile


phones, there is less display space to showcase content like KPIs, dashboards,
business information, and reports as strategically planned by the analytical
companies. For instance, to optimize collaborative data visualization and BI
dashboards, you require a larger screen display to view the details effectively at a
glance.

However, modern BI analytics solutions provide robust user interfaces and desktop-
like visualization that take Mobile BI to a whole new level.

221
Modern BI platforms typically extend the delivery of their desktop BI capabilities to
smartphones and tablet devices so that users can access, consume, and share data
easily on portable devices. In addition, mobile BI especially formats and optimizes
dashboards and reports so that the user experience is taken into account and
leverages smaller, touchscreen-based mobile screens and interfaces.

There are a few methods by which Mobile BI provides Big Data and ETL solutions.
They are categorized into three types:

1. Web-based solutions: Generally, the content to be accessed and viewed on


mobile devices is implemented using HTML5 by software developers on web
browsers. HTML5 helps to optimize the dashboards and data visualization for
consumption on compact mobile phones, touch-screen tablets, and other
portable devices very efficiently.

2. Native applications: This is one of the most expensive custom-built software


applications that support Mobile BI. It supports mobile operating systems
(OS), like iOS and Android-based devices.

3. Hybrid solutions: Hybrid solutions are the advanced form of native analytics
applications that render content by using HTML5. In hybrid applications,
features of native applications are merged with HTML5 and perform similarly
to Web-based solutions.

The methods sound all helpful and easy, but how does Mobile BI actually operate?
Let’s find out.

How Does Mobile BI Work?


Mobile BI is similar in performance to a standard BI software/solution, but has been
designed specifically for users with mobility. Clients need to install their business-
dealt organization's user application on their mobile phones. Users can find these
apps, such as Kube, on Google Play Store or iPhone App Store. You can receive
insights and resourceful data and perform queries via wireless or remote Internet
connection.

Mobile-based Analytical Solutions

222
Mobile-based Business Intelligence solutions are an interesting innovation in Data
Analytics. However, it is also quite challenging because many aspects have to be
taken so that the data visualization presented is top-notch.

This is where Kube comes in handy. Kube is an integral Mobile BI tool of Kockpit
Analytics. It wraps up all the analytical business points to cash cycles and provides
you with various exciting features and advantages.

Kube has several fantastic features:


 Kube offers personalized and reliable dashboards related to your business’s
sales, similar to your desktop. Be it managing vendors, an in-depth sales
analysis, or tracking your collection, Kube does it all!

 Kuber features integrated chat functionality that makes you truly hyper-
connected with your business so that you can conveniently alter your last-
minute decisions. This makes communication more simple and more efficient.

 Kube provides a bird’s eye view of your sales and operations. It also allows its
users to monitor anything and everything about your sales process and access
more areas.

 Users also can easily assign and track their goals. Whether it is accessing the
goals tab, viewing all the assigned goals & tasks, tracking their details and
progress, or self-assigning goals for better task management, you have it all!

223
Kube also provides advantageous assistance:
 Share the data with your team and other members straight from the
application. Just a few taps and Kube will share your data through various
platforms.

 The big data-enabled engine ensures real-time information for users, allowing
you to make data-driven decisions on the go.

 With the Kube mobile app, you'll see the same sophisticated and powerful
visualizations that you see on your desktop.

 Sort your data by name, target, actual, and many more with Kockpit Kube and
view your data more conveniently.

 Favorite the cards that are most important to you and view them anytime you
like.

Moreover, it has quite exciting modules which operate for various departments. For
example, within 15 minutes of the data refresh time rate, you get updated with real-
time information on your app regardless of your location.

Benefits of Using Mobile BI


MBI has various benefits. They have been discussed below:

 Availability: One of the huge advantages of using Mobile BI is that this


technology allows its end-users to access, consume, and monitor big data
business analytics. This happens by logging into the utility app with an
appropriate wireless connection from being present in any part of the world. It
also assists in growing your network as you meet new people at a new
location. Moreover, it allows you to hire even new workers who are
comfortable working on BI and analytics from remote places without needing
proper dedicated desktop systems.

 Usability: Today, for all business users, MBI solutions utilize the facility of
touch display by accessing and monitoring the dashboards, reports, and KPIs
via drag-drop UI done on fingertips.

 Reliability: After being connected to a wireless or remote Internet connection


and successfully logging into the end-user utility app, MBI enables its clients
to monitor real-time insights and KPIs effectively. In addition, it lets them
update on current trends and changes, receive BI dashboards and reports, and
implement queries.

 Collaboration: Since business data is accessed, monitored, and shared via


individual mobile phones, MBI encourages collaborative opportunities among

224
workers to keep them on the same page by providing real-time insights and
functional strategies.

 Real-Time Insights: Mobile BI platforms enable users to access real-time data


and analytics, allowing them to monitor key performance indicators (KPIs),
track business metrics, and respond promptly to changing business
conditions. Real-time insights empower decision-makers to make data-driven
decisions on the fly, improving agility and responsiveness.

 Interactive Dashboards: Mobile BI applications offer interactive dashboards


and data visualizations optimized for mobile devices, providing users with
intuitive tools for exploring and analyzing data on the go. Users can interact
with charts, graphs, and maps, drill down into detailed data, and customize
views to gain deeper insights into business performance and trends.

 Offline Access: Many mobile BI applications offer offline access capabilities,


allowing users to access and view cached data and reports even when they are
offline or have limited connectivity. Offline access ensures that users can
access critical information even in remote locations or areas with poor
network coverage, improving productivity and decision-making continuity.

 Security: Security is a critical consideration in mobile BI deployments to


ensure that sensitive business data remains protected on mobile devices.
Mobile BI platforms employ various security measures, such as data
encryption, multi-factor authentication, and device management policies, to
safeguard data integrity and confidentiality.

 Integration with Existing Systems: Mobile BI solutions integrate with


existing BI platforms, data warehouses, and enterprise systems to access and
analyze data from multiple sources. Integration capabilities enable seamless
data connectivity and interoperability, allowing users to leverage existing data
assets and infrastructure for mobile analytics.

Crowd sourcing analytics


What is Crowdsourcing?
Crowdsourcing is a sourcing model in which an individual or an organization gets
support from a large, open-minded, and rapidly evolving group of people in the
form of ideas, micro-tasks, finances, etc. Crowdsourcing typically involves the use of
the internet to attract a large group of people to divide tasks or to achieve a target.
The term was coined in 2005 by Jeff Howe and Mark Robinson. Crowdsourcing can
help different types of organizations get new ideas and solutions, deeper consumer
engagement, optimization of tasks, and several other things.

225
Let us understand this term deeply with the help of an example. Like GeeksforGeeks
is giving young minds an opportunity to share their knowledge with the world by
contributing articles, videos of their respective domain. Here GeeksforGeeks is using
the crowd as a source not only to expand their community but also to include ideas
of several young minds improving the quality of the content.

Where Can We Use Crowdsourcing?


Crowdsourcing is touching almost all sectors from education to health. It is not only
accelerating innovation but democratizing problem-solving methods. Some fields
where crowdsourcing can be used.

1. Enterprise

2. IT

3. Marketing

4. Education

5. Finance

6. Science and Health

Understanding Crowdsourcing
Crowdsourcing allows companies to farm out work to people anywhere in the
country or around the world; as a result, crowdsourcing lets businesses tap into a
vast array of skills and expertise without incurring the normal overhead costs of in-
house employees.

Crowdsourcing is becoming a popular method to raise capital for special projects. As


an alternative to traditional financing options, crowdsourcing taps into the shared
interest of a group, bypassing the conventional gatekeepers and intermediaries
required to raise capital.

Crowdsourcing vs. Crowdfunding


While crowdsourcing seeks information or workers' labor, crowdfunding instead
solicits money or resources to help support individuals, charities, or startups. People
can contribute to crowdfunding requests with no expectation of repayment, or
companies can offer shares of the business to contributors.

226
What Are the Main Types of Crowdsourcing?
Crowdsourcing involves obtaining information or resources from a wide swath of
people. In general, we can break this up into four main categories:

 Wisdom - Wisdom of crowds is the idea that large groups of people are
collectively smarter than individual experts when it comes to problem-solving
or identifying values (like the weight of a cow or number of jelly beans in a
jar).

 Creation - Crowd creation is a collaborative effort to design or build


something. Wikipedia and other wikis are examples of this. Open-
source software is another good example.

 Voting - Crowd voting uses the democratic principle to choose a particular


policy or course of action by "polling the audience."

 Funding - Crowdfunding involved raising money for various purposes by


soliciting relatively small amounts from a large number of funders.

How does data crowdsourcing work?


Data crowdsourcing platforms typically allow users to sign up and complete simple
tasks in exchange for compensation. These tasks might involve answering questions,
providing feedback, or rating products. The data collected is then used by the
company or organization running the platform to improve their understanding of a
specific topic or issue.

How crowdsourcing analytics works and its key aspects:


1. Problem Definition: Organizations define the analytics problem or task they
need to solve and articulate the desired outcomes or insights they aim to
achieve through crowdsourcing. This could range from data labeling and
annotation tasks to more complex data analysis or predictive modeling
challenges.

2. Platform Selection: Organizations choose a suitable crowdsourcing platform


or marketplace to host their analytics tasks and engage with a community of
contributors. Platforms like Kaggle, CrowdFlower (now Figure Eight),
Topcoder, and Amazon Mechanical Turk provide access to a global pool of
contributors with diverse skills and expertise.

227
3. Task Design and Setup: Organizations design and set up the analytics tasks
or challenges on the chosen crowdsourcing platform, specifying the
requirements, guidelines, and evaluation criteria for the contributors. Tasks
may involve data cleaning, data labeling, image classification, sentiment
analysis, or even more advanced analytics tasks such as predictive modeling or
algorithm development.

4. Incentive Structure: Organizations define the incentive structure for


contributors, including monetary rewards, prizes, recognition, or other forms
of compensation for participating in and successfully completing the analytics
tasks. Incentives play a crucial role in motivating contributors to participate
and deliver high-quality results.

5. Crowdsourcing Execution: Contributors from the crowd participate in the


analytics tasks by providing their inputs, annotations, or solutions based on
the task requirements and guidelines. Organizations may monitor the
progress of the tasks, provide feedback to contributors, and ensure the quality
and accuracy of the results.

6. Aggregation and Analysis: Organizations aggregate and analyze the


contributions from the crowd to derive insights, make decisions, or generate
actionable outputs based on the analytics tasks' outcomes. This may involve
aggregating individual contributions, validating the results, and synthesizing
the findings to extract meaningful insights or patterns.

7. Validation and Quality Assurance: Organizations validate the accuracy and


quality of the crowdsourced analytics results through various validation
techniques, such as cross-validation, consensus-based approaches, or expert
review. Quality assurance measures are essential to ensure the reliability and
trustworthiness of the crowdsourced insights.

8. Integration with Internal Processes: Organizations integrate the


crowdsourced analytics results with their internal processes, systems, or
decision-making workflows to leverage the insights and drive business
outcomes. Crowdsourced insights may inform strategic decisions, enhance
product development, optimize operations, or improve customer experiences.

How to Crowdsource?
For scientific problem solving, a broadcast search is used where an organization
mobilizes a crowd to come up with a solution to a problem.

For information management problems, knowledge discovery and management is


used to find and assemble information.

228
For processing large datasets, distributed human intelligence is used. The
organization mobilizes a crowd to process and analyze the information.

Examples of Crowdsourcing
1. Doritos: It is one of the companies which is taking advantage of
crowdsourcing for a long time for an advertising initiative. They use
consumer-created ads for one of their 30-Second Super Bowl Spots
(Championship Game of Football).

2. Starbucks: Another big venture which used crowdsourcing as a medium for


idea generation. Their white cup contest is a famous contest in which
customers need to decorate their Starbucks cup with an original design and
then take a photo and submit it on social media.

3. Lays:” Do us a flavor” contest of Lays used crowdsourcing as an idea-


generating medium. They asked the customers to submit their opinion about
the next chip flavor they want.

4. Airbnb: A very famous travel website that offers people to rent their houses or
apartments by listing them on the website. All the listings are crowdsourced
by people.

There are several examples of businesses being set up with the help of
crowdsourcing.

Crowdsourced Marketing
As discussed already crowdsourcing helps grow businesses grow a lot. May it be a
business idea or just a logo design, crowdsourcing engages people directly and in
turn, saves money and energy. In the upcoming years, crowdsourced marketing will
surely get a boost as the world is accepting technology faster.

SKIP

Crowdsourcing Sites
Here is the list of some famous crowdsourcing and crowdfunding sites.

1. Kickstarter

2. GoFundMe

3. Patreon

4. RocketHub
229
What are the benefits of crowdsourcing?
The rapid growth in the popularity of crowdsourcing is due to its numerous
advantages.

1. Remarkable solutions to challenging problems

An enterprise can access hundreds or even thousands of unique approaches


to problem solving by including a larger group of individuals, from all walks of
lives with different experiences and perspectives in problem solving.

2. Accelerated tasks

Companies can obtain excellent ideas in a lot less time by engaging a larger
group of people to participate in the process. This could be crucial to the
success of time-sensitive undertakings like urgent software fixes or medical
research.
Microtasking proves to be advantageous here. It is a form of crowdsourcing in
which little groups are given specific tasks. Microtaskers can either be one
person or a group of people who share the workload. Writing a blog post or
conducting research are examples of jobs that are frequently carried out in
small, sequential chunks, or microtasks.

3. Greater accuracy of data

Crowdsourcing data can provide greater accuracy because it is based on a


large number of inputted data points. This allows for knowledge to be spread
more widely and makes mistakes easier to identify.

What are the benefits of using data crowdsourcing?


1. Provides accurate and timely data

Data crowdsourcing can provide accurate and timely data for businesses. The
data is flexible and can be modified to fit the business’s needs. The business
can pay-per-use for the data or receive real-time alerts when traffic is
congested.

2. Greater speed

Data crowdsourcing can help speed up the process of finding the right data
by allowing a large number of people to quickly and cheaply contribute data.
This ensures that data tasks are completed quickly and with high quality
standards.

230
3. Allows for diverse input

Data crowdsourcing can benefit your business by providing access to large


amounts of data from diverse sources quickly and cheaply. A high-quality
dataset is important for the success of an AI model, and data can be collected
easily and cheaply through crowdsourcing. Crowdsourcing enables businesses
to access a large number of skilled data collectors from around the world.

4. Greater accuracy

Data crowdsourcing can help with accuracy and other advantages. It can be
more reliable than traditional methods when the dataset is large, and it can
help reduce the number of pairwise comparisons required to rank. This
reduces annotation burden, making data more accurate and easier to use.

5. Allows for feedback and improvement

Data crowdsourcing can help with improving content, getting feedback, and
more. By being transparent and honest with data crowdsourcing participants,
you can ensure a successful project.

6. Allows for faster decision-making

Crowdsourcing data can help with faster decision-making by providing a


flexible and real-time way to collect data. This can be used to identify mistakes
more easily and in a time-sensitive manner. For example, traffic alerts can be
sent out in real-time based on pre-selected thresholds or historical trends.

7. Allows for cost-effective solutions

Data crowdsourcing can be used to get new ideas for cost-effective solutions.
It is a cheaper and more accessible way to get solutions to complex problems
than traditional methods. Crowdsourcing is not limited to highly technical and
complex problems – it can also be used for research and development (R&D).
Data crowdsourcing can be used to improve productivity and creativity in a
company.

8. Allows for quicker product development

Data crowdsourcing can help with quicker product development by allowing


for faster feedback and better understanding of user needs. By crowdsourcing
data, businesses can receive feedback and input from a large number of users
in a short amount of time. This can be used to improve products and make
them more user-friendly. Additionally, data crowdsourcing can be used to
understand customer sentiment and track product performance.

9. Allows for better understanding

231
Data crowdsourcing can help with better understanding by gathering data
from a large number of sources. This can be used to improve customer
service, product development, and more. For example, data crowdsourcing
can be used to gather data about customer sentiment or trends.

10. Allows for better customer service

Data crowdsourcing can help with better customer service by gathering


feedback from customers about their experiences. Data crowdsourcing can
also help identify patterns and trends in customer service interactions, which
can help improve the quality of customer service.

11. Allows for better understanding of customer needs

Data crowdsourcing can help with understanding customer needs and can be
used to improve customer service. It can also help identify trends and patterns
in customer data which can help businesses improve their services and
products.

12. Allows for improved product quality

Data crowdsourcing can improve product quality by identifying duplicate


products, business location data and other product information. It can also be
used to identify problems with products early on, before they cause major
issues. For example, if a company is considering adding a new feature to its
product, it can use data crowdsourcing to gauge customer reaction and get
feedback on whether the feature is actually useful.

Advantages of Crowdsourcing
1. Evolving Innovation: Innovation is required everywhere and in this advancing
world innovation has a big role to play. Crowdsourcing helps in getting
innovative ideas from people belonging to different fields and thus helping
businesses grow in every field.

2. Save costs: There is the elimination of wastage of time of meeting people and
convincing them. Only the business idea is to be proposed on the internet and
you will be flooded with suggestions from the crowd.

3. Increased Efficiency: Crowdsourcing has increased the efficiency of business


models as several expertise ideas are also funded.

Disadvantages of Crowdsourcing
1. Lack of confidentiality: Asking for suggestions from a large group of people
can bring the threat of idea stealing by other organizations.
232
2. Repeated ideas: Often contestants in crowdsourcing competitions submit
repeated, plagiarized ideas which leads to time wastage as reviewing the same
ideas is not worthy.

Data quality in data crowdsourcing


Data quality is the accuracy and completeness of data, as well as preventing errors
from occurring. Reducing accuracy can be introduced when participants transliterate
obvious abbreviations, while reduced completeness can arise when data is missing or
incorrect. To overcome these issues, crowdsourcing can be used to enlist the help of
a large number of individuals. This approach is advantageous because it allows
projects to overcome errors caused by participant error.

Quality control methods


Crowdsourcing is a method of obtaining input from a large group of people. Quality
control methods, such as proofreading and validation, are essential in order to
ensure that the data collected through crowdsourcing is accurate and meets
customer expectations. By harnessing the power of a large group of people,
crowdsourcing can be extremely effective in gathering data. However, like any form
of collaboration, there are certain risks associated with using this method. One such
risk is bias; because crowdsourced data is typically gathered by individuals who have
an interest in the subject matter at hand, it can be susceptible to bias.

Additionally, due to the way this type of data is typically collected (i.e., through
individual submissions), it often suffers from the founder effect: because
contributions are often made by those who initiated or own the project itself (the
founder effect), projects that begin as popular or well-known tend to have more
contributions than projects that start off relatively unknown or less popular.

Finally, due to its open-ended nature, crowdsourcing can also be prone to errors
caused by hypercorrection – normalizing words that look misspelled in the original
submission – as well as reviewer fatigue: when reviewers see submissions from many
different users all at once rather than one after another over time, it can be harder
for them to spot mistakes that “look” correct. Despite these risks, crowdsourcing can
be an extremely effective way to gather data if used in conjunction with quality
control methods.

Processing and accessing results


Data quality is an important consideration when processing and accessing results
from data sets. Improving data quality can reduce costs associated with inaccurate or

233
outdated information, as well as prevent disasters from happening in the first place.It
is important to use results from crowdsourced data sets to improve data quality.

How to crowdsource data – Best practices for successful data


crowdsourcing
Crowdsourcing can be one of the best ways to generate a large amount of diverse
data. However, there are a few points to be kept in mind while executing this
process.

1. Establish clear goals

When planning a data crowdsourcing project, it is important to have clear goals in


mind. These goals will help determine the target audience and platform for the
project. Once these factors are considered, the project can be successfully
implemented.

2. Choose your target participants

To successfully crowdsource data, you must first determine the type of data to be
collected and the participants who will be collecting it. The platform you use should
be easy to use and allow participants to easily share their data. The compensation
method for participants should be fair and incentive-based.

3. Decide on the type of data you need

To crowdsource data successfully, first determine what type of data needs to be


collected and who will be collecting it. Then, create a platform for registering
participants, sharing data, and managing the crowd. Once the platform is set-up,
provide instructions for gathering the data and create a compensation system. After
that, choose a data labelling team that uses appropriate tools for the task at hand.
Finally, you need to evaluate a data labelling platform before you commit to it by
looking at client logos, testimonials, and case studies to get a good idea of the
quality of the service. Make sure to understand the security protocols and measures
in place to prevent data theft and leaks.

4. Encourage participation from a diverse range of participants

Too successfully crowdsource data, it is important to be sensitive to the diversity of


participants and encourage them to contribute their voices. For example,
encouraging participation from a diverse range of people by being aware of their
language and cultural preferences when writing projects or communicating with
them directly will make sure all messages are easily understood by all participants,
regardless of their language proficiency or cultural background.

5. Reward participants for their contributions

234
Rewards can play an important role in motivating participants to contribute quality
work, even when working remotely. Rewards can be given to participants for their
contributions in a variety of ways, depending on the project. Rewards can help
motivate participants to produce high-quality work, even when working remotely.
Rewards should be aligned with the project’s values and participant motivations in
order to respect and reward participants.

6. Disclose any financial compensation that participants may receive

When conducting data crowdsourcing, it is important to disclose any financial


compensation that participants may receive. This allows them to feel comfortable
participating in the process and ensures that the data is collected ethically.

7. Take care to protect participants’ data

To protect participants’ data and avoid common mistakes, follow these tips:

 Keep your data safe. Use secure methods to store your data, such as
encrypting it with a strong password. Make sure to keep updated backups of
your data in case of accidents or malicious attacks.

 Make sure your dataset is properly licensed. If you are using public or open
datasets, make sure that the license agreement allows others to use the data
without limitations.

 Be clear about who can access the dataset and what rights they have. Clearly
label any datasets that you make available so that others know how to use
them safely and legally.

 Follow standard privacy policies and practices when sharing your data with
other researchers or users. Make sure that all users understand the terms of
use before using them, and take appropriate measures to protect their privacy
if required by law or regulations

8. Monitor and track participant participation

To ensure the quality of data crowdsourced from participants, a variety of quality


control methods must be in place.

9. Terminate participation when goals have been reached

When goals have been reached, it is important to terminate participation for ethical
reasons. This preserves the standard use of data and maintains a humanized and
acknowledging view of black people whose collective organizational histories are
assembled here.

235
What Is Real Estate Crowdsourcing?
Real estate crowdfunding allows everyday individuals the opportunity to invest in
commercial real estate, purchasing just a portion of a piece of development. It's a
relatively new way to invest in commercial real estate and relieves investors of the
hassle of owning, financing, and managing properties.

Does Netflix Use Crowdsourcing?

Yes. Netflix uses crowdsourcing to help improve its entertainment platform. Most
notably, in 2006, it launched the Netflix Prize competition to see who could improve
Netflix's algorithm to predict user viewing recommendations and offered the winner
$1 million.1

How Does Amazon Mechanical Turk Use Crowdsourcing?

Amazon's Mechanical Turk (MTurk or AMT) is a crowdsourcing marketplace that


businesses or researchers can use to outsource parts of their jobs, everything from
data validation to finding survey respondents to content moderation. Anyone can
sign up through their Amazon account to be a Mechanical Turk Worker.2

The Bottom Line

Especially as the nature of work shifts more towards an online, virtual environment,
crowdsourcing provides many benefits for companies that are seeking innovative
ideas from a large group of individuals, hoping to better their products or services. In
addition, crowdsourcing niches from real estate to philanthropy are beginning to
proliferate and bring together communities to achieve a common goal.

Inter and trans-firewall analytics.


In the realm of network security, “inter-firewall” and “trans-firewall” analytics refer to
two distinct approaches to analyzing network traffic and identifying threats.

While both involve analyzing data, they differ in their scope and methodology:

Inter-firewall analytics
 Focus: Analyzes traffic flows between different firewalls within a network.

 Methodology: Utilizes data collected from multiple firewalls to identify


anomalies and potential breaches.

 Benefits: Provides a comprehensive view of network traffic flow and helps


identify lateral movement across different security zones.

236
 Limitations: Requires deployment of multiple firewalls within the network and
efficient data exchange mechanisms between them.

Trans-firewall analytics
 Focus: Analyzes encrypted traffic that traverses firewalls, which traditional
security solutions may not be able to decrypt and inspect.

 Methodology: Uses deep packet inspection (DPI) and other advanced


techniques to analyze the content of encrypted traffic without compromising
its security.

 Benefits: Provides insight into previously hidden threats within encrypted


traffic and helps detect sophisticated attacks.

 Limitations: Requires specialized hardware and software solutions for DPI,


and raises concerns regarding potential data privacy violations.

Difference between inter and Trans fire wall analytics

Feature Inter-Firewall Analytics Trans-Firewall Analytics

Focus Network traffic flow between Content of encrypted traffic


firewalls

Methodology Analyzes data from multiple Uses DPI and other techniques to
firewalls analyze encrypted traffic

Benefits Comprehensive view of Detects threats within encrypted


network traffic, identifies traffic
lateral movement

Limitations Requires multiple firewalls Requires specialized hardware


and efficient data exchange and software, raises privacy
concerns

Choosing the right approach

237
The choice between inter-firewall and trans-firewall analytics depends on several
factors, including:

 Network size and complexity: Larger and more complex networks benefit
more from inter-firewall analytics for comprehensive monitoring.

 Security needs and threats: Trans-firewall analytics is crucial for networks


handling sensitive data and facing advanced threats.

 Budget and resources: Implementing trans-firewall analytics requires


additional investment in specialized hardware and software.

Here's a more detailed look at inter and trans-firewall analytics:


Inter-Firewall Analytics:

 Definition: Inter-firewall analytics involve analyzing data that passes through


multiple firewalls within a single network or domain.

 Use Cases: Inter-firewall analytics are commonly used in large enterprises with
complex network architectures, where data must traverse multiple firewalls to
reach different segments of the network. For example, organizations may
analyze network traffic logs from various firewall appliances to detect and
investigate security incidents, monitor user activities, or optimize network
performance.

Trans-Firewall Analytics:

 Definition: Trans-firewall analytics involve analyzing data that crosses


network boundaries between different networks or domains separated by
firewalls.

 Use Cases: Trans-firewall analytics are relevant in scenarios where data needs
to be exchanged securely between different organizations, cloud
environments, or remote locations. For instance, organizations may analyze
data exchanged between on-premises systems and cloud-based applications,
conduct threat intelligence sharing with external partners, or monitor traffic
between different branch offices connected via virtual private networks
(VPNs).

Key Considerations for Inter and Trans-Firewall Analytics:


 Security: Security is paramount when analyzing data across firewall
boundaries. Organizations must implement encryption, access controls, and

238
secure communication protocols to protect data in transit and prevent
unauthorized access or interception.

 Compliance: Organizations must ensure that inter and trans-firewall analytics


solutions comply with relevant data protection regulations, industry standards,
and internal security policies. This may include requirements related to data
privacy, confidentiality, and integrity.

 Performance: Analyzing data across firewall boundaries may introduce


latency and impact performance. Organizations should optimize network
configurations, use efficient data transfer protocols, and leverage distributed
computing architectures to minimize latency and improve processing speed.

 Scalability: Inter and trans-firewall analytics solutions should be scalable to


accommodate growing data volumes, increasing network traffic, and evolving
business requirements. Scalable architectures, such as distributed computing
platforms or cloud-based analytics services, can help organizations handle
large-scale analytics workloads effectively.

 Visibility and Monitoring: Organizations must have visibility into network


traffic, data flows, and security events across firewall boundaries. Real-time
monitoring, logging, and analysis of network traffic logs, firewall logs, and
security events enable organizations to detect anomalies, respond to security
incidents, and ensure compliance with security policies.

Overall, inter and trans-firewall analytics are essential for organizations operating in
distributed environments to securely analyze data across network boundaries and
derive actionable insights while maintaining data security, privacy, and compliance.
By implementing robust security measures, optimizing network performance, and
leveraging scalable analytics solutions, organizations can effectively harness the
power of data analytics across firewall boundaries to drive business value and
innovation.

Introduction to NoSQL

239
Introduction to NoSQL
NoSQL is a type of database management system (DBMS) that is designed to handle
and store large volumes of unstructured and semi-structured data. Unlike traditional
relational databases that use tables with pre-defined schemas to store data, NoSQL
databases use flexible data models that can adapt to changes in data structures and
are capable of scaling horizontally to handle growing amounts of data.

The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but


the term has since evolved to mean “not only SQL,” as NoSQL databases have
expanded to include a wide range of different database architectures and data
models.

NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better term
would be “NoREL”, NoSQL caught on. Carl Strozz introduced the NoSQL concept in
1998.

Traditional RDBMS uses SQL syntax to store and retrieve data for further insights.
Instead, a NoSQL database system encompasses a wide range of database
technologies that can store structured, semi-structured, unstructured and
polymorphic data. Let’s understand about NoSQL with a diagram in this NoSQL
database tutorial:

What is a NoSQL database?


When people use the term “NoSQL database”, they typically use it to refer to any
non-relational database. Some say the term “NoSQL” stands for “non SQL” while

240
others say it stands for “not only SQL”. Either way, most agree that NoSQL databases
are databases that store data in a format other than relational tables.

Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data. The system response
time becomes slow when you use RDBMS for massive volumes of data.

To resolve this problem, we could “scale up” our systems by upgrading our existing
hardware. This process is expensive.

The alternative for this issue is to distribute database load on multiple hosts
whenever the load increases. This method is known as “scaling out.”

NoSQL database is non-relational, so it scales out better than relational databases as


they are designed with web applications in mind.

Brief history of NoSQL databases


NoSQL databases emerged in the late 2000s as the cost of storage dramatically
decreased. Gone were the days of needing to create a complex, difficult-to-manage
data model in order to avoid data duplication. Developers (rather than storage) were
becoming the primary cost of software development, so NoSQL databases optimized
for developer productivity.

241
As storage costs rapidly decreased, the amount of data that applications needed to
store and query increased. This data came in all shapes and sizes — structured, semi-
structured, and polymorphic — and defining the schema in advance became nearly
impossible. NoSQL databases allow developers to store huge amounts of
unstructured data, giving them a lot of flexibility.

Additionally, the Agile Manifesto was rising in popularity, and software engineers
were rethinking the way they developed software. They were recognizing the need to
rapidly adapt to changing requirements. They needed the ability to iterate quickly
and make changes throughout their software stack — all the way down to the
database. NoSQL databases gave them this flexibility.

Cloud computing also rose in popularity, and developers began using public clouds
to host their applications and data. They wanted the ability to distribute data across
multiple servers and regions to make their applications resilient, to scale out instead
of scale up, and to intelligently geo-place their data. Some NoSQL databases like
MongoDB provide these capabilities.

 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source
relational database

 2000- Graph database Neo4j is launched

 2004- Google BigTable is launched

 2005- CouchDB is launched

 2007- The research paper on Amazon Dynamo is released

 2008- Facebooks open sources the Cassandra project

 2009- The term NoSQL was reintroduced

Key Features of NoSQL:


1. Dynamic schema: NoSQL databases do not have a fixed schema and can
accommodate changing data structures without the need for migrations or
schema alterations.

2. Horizontal scalability: NoSQL databases are designed to scale out by adding


more nodes to a database cluster, making them well-suited for handling large
amounts of data and high levels of traffic.

3. Document-based: Some NoSQL databases, such as MongoDB, use a


document-based data model, where data is stored in a scalessemi-structured
format, such as JSON or BSON.

242
4. Key-value-based: Other NoSQL databases, such as Redis, use a key-value
data model, where data is stored as a collection of key-value pairs.

5. Column-based: Some NoSQL databases, such as Cassandra, use a column-


based data model, where data is organized into columns instead of rows.

6. Distributed and high availability: NoSQL databases are often designed to be


highly available and to automatically handle node failures and data replication
across multiple nodes in a database cluster.

7. Flexibility: NoSQL databases allow developers to store and retrieve data in a


flexible and dynamic manner, with support for multiple data types and
changing data structures.

8. Performance: NoSQL databases are optimized for high performance and can
handle a high volume of reads and writes, making them suitable for big data
and real-time applications.

Types of NoSQL Databases


A database is a collection of structured data or information which is stored in a
computer system and can be accessed easily. A database is usually managed by a
Database Management System (DBMS).

NoSQL is a non-relational database that is used to store the data in the nontabular
form. NoSQL stands for not only SQL. The main types are documents, key-value,
wide-column, and graphs.

Types of NoSQL Database:

1. Document-based databases: Examples – MongoDB, CouchDB, Cloudant

2. Key-value stores: Examples – Memcached, Redis, Coherence

3. Column-oriented databases: Examples – Hbase, Big Table, Accumulo

4. Graph-based databases: Examples – Amazon Neptune, Neo4j

243
1. Document-Based Database:

The document-based database is a nonrelational database. Instead of storing the


data in rows and columns (tables), it uses the documents to store the data in the
database. A document database stores data in JSON, BSON, or XML documents.

Documents can be stored and retrieved in a form that is much closer to the data
objects used in applications which means less translation is required to use these
data in the applications. In the Document database, the particular elements can be
accessed by using the index value that is assigned for faster querying.

Collections are the group of documents that store documents that have similar
contents. Not all the documents are in any collection as they require a similar schema
because document databases have a flexible schema.

Key features of documents database:

 Flexible schema: Documents in the database has a flexible schema. It means


the documents in the database need not be the same schema.

 Faster creation and maintenance: the creation of documents is easy and


minimal maintenance is required once we create the document.

244
 No foreign keys: There is no dynamic relationship between two documents
so documents can be independent of one another. So, there is no requirement
for a foreign key in a document database.

 Open formats: To build a document we use XML, JSON, and others.

[
{
"product_title": "Codecademy SQL T-shirt",
"description": "SQL > NoSQL",
"link": "https://shop.codecademy.com/collections/student-
swag/products/sql-tshirt"
"shipping_details": {
"weight": 350,
"width": 10,
"height": 10,
"depth": 1
},
"sizes": ["S", "M", "L", "XL"],
"quantity": 101010101010,
"pricing": {
"price": 14.99
}
}
]

2. Key-Value Stores:

A key-value store is a nonrelational database. The simplest form of a NoSQL


database is a key-value store. Every data element in the database is stored in key-
value pairs. The data can be retrieved by using a unique key allotted to each element

245
in the database. The values can be simple data types like strings and numbers or
complex objects.

A key-value store is like a relational database with only two columns which is the key
and the value.

Key features of the key-value store:

 Simplicity.

 Scalability.

 Speed.

Key Value

customer-123 {“address”: “…”, name: “…”, “preferences”: {…}}

customer-456 {“address”: “…”, name: “…”, “preferences”: {…}}

3. Column Oriented Databases:

A column-oriented database is a non-relational database that stores the data in


columns instead of rows. That means when we want to run analytics on a small
number of columns, you can read those columns directly without consuming
memory with the unwanted data.

Columnar databases are designed to read data more efficiently and retrieve the data
with greater speed. A columnar database is used to store a large amount of data.

Key features of columnar oriented database:

 Scalability.

246
 Compression.

 Very responsive.

4. Graph-Based databases:

Graph-based databases focus on the relationship between the elements. It stores the
data in the form of nodes in the database. The connections between the nodes are
called links or relationships.

Key features of graph database:

247
 In a graph-based database, it is easy to identify the relationship between the
data by using the links.

 The Query’s output is real-time results.

 The speed depends upon the number of relationships among the database
elements.

 Updating data is also easy, as adding a new node or edge to a graph database
is a straightforward task that does not require significant schema changes.

248
Advantages of NoSQL
There are many advantages of working with NoSQL databases such as MongoDB and
Cassandra. The main advantages are high scalability and high availability.

1. High scalability: NoSQL databases use sharding for horizontal scaling.


Partitioning of data and placing it on multiple machines in such a way that the
order of the data is preserved is sharding. Vertical scaling means adding more
resources to the existing machine whereas horizontal scaling means adding
more machines to handle the data. Vertical scaling is not that easy to
implement but horizontal scaling is easy to implement. Examples of horizontal
scaling databases are MongoDB, Cassandra, etc. NoSQL can handle a huge
amount of data because of scalability, as the data grows NoSQL scalesThe
auto itself to handle that data in an efficient manner.

2. Flexibility: NoSQL databases are designed to handle unstructured or semi-


structured data, which means that they can accommodate dynamic changes
to the data model. This makes NoSQL databases a good fit for applications
that need to handle changing data requirements.

3. High availability: The auto, replication feature in NoSQL databases makes it


highly available because in case of any failure data replicates itself to the
previous consistent state.

4. Scalability: NoSQL databases are highly scalable, which means that they can
handle large amounts of data and traffic with ease. This makes them a good fit
for applications that need to handle large amounts of data or traffic

5. Performance: NoSQL databases are designed to handle large amounts of


data and traffic, which means that they can offer improved performance
compared to traditional relational databases.

6. Cost-effectiveness: NoSQL databases are often more cost-effective than


traditional relational databases, as they are typically less complex and do not
require expensive hardware or software.

7. Agility: Ideal for agile development.

Disadvantages of NoSQL
NoSQL has the following disadvantages.

1. Lack of standardization: There are many different types of NoSQL


databases, each with its own unique strengths and weaknesses. This lack of

249
standardization can make it difficult to choose the right database for a specific
application

2. Lack of ACID compliance: NoSQL databases are not fully ACID-compliant,


which means that they do not guarantee the consistency, integrity, and
durability of data. This can be a drawback for applications that require strong
data consistency guarantees.

3. Narrow focus: NoSQL databases have a very narrow focus as it is mainly


designed for storage but it provides very little functionality. Relational
databases are a better choice in the field of Transaction Management than
NoSQL.

4. Open-source: NoSQL is a databaseopen-source database. There is no reliable


standard for NoSQL yet. In other words, two database systems are likely to be
unequal.

5. Lack of support for complex queries: NoSQL databases are not designed to
handle complex queries, which means that they are not a good fit for
applications that require complex data analysis or reporting.

6. Lack of maturity: NoSQL databases are relatively new and lack the maturity
of traditional relational databases. This can make them less reliable and less
secure than traditional databases.

7. Management challenge: The purpose of big data tools is to make the


management of a large amount of data as simple as possible. But it is not so
easy. Data management in NoSQL is much more complex than in a relational
database. NoSQL, in particular, has a reputation for being challenging to
install and even more hectic to manage on a daily basis.

8. GUI is not available: GUI mode tools to access the database are not flexibly
available in the market.

9. Backup: Backup is a great weak point for some NoSQL databases like
MongoDB. MongoDB has no approach for the backup of data in a consistent
manner.

10. Large document size: Some database systems like MongoDB and CouchDB
store data in JSON format. This means that documents are quite large
(BigData, network bandwidth, speed), and having descriptive key names
actually hurts since they increase the document size.

When should NoSQL be used?

250
When deciding which database to use, decision-makers typically find one or more of
the following factors lead them to selecting a NoSQL database:

 Fast-paced Agile development

 Storage of structured and semi-structured data

 Huge volumes of data

 Requirements for scale-out architecture

 Modern application paradigms like microservices and real-time streaming

See When to Use NoSQL Databases and Exploring NoSQL Database Examples for
more detailed information on the reasons listed above.

NoSQL database misconceptions


Over the years, many misconceptions about NoSQL databases have spread
throughout the developer community. In this section, we'll discuss two of the most
common misconceptions:

 Relationship data is best suited for relational databases.

 NoSQL databases don't support ACID transactions.

Aggregate data models


We know, NoSQL are databases that store data in another format other than
relational databases. NoSQL deals in nearly every industry nowadays. For the people
who interact with data in databases, the Aggregate Data model will help in that
interaction.

What are Aggregate Data Models in NoSQL?


Aggregate means a collection of objects that are treated as a unit. In NoSQL
Databases, an aggregate is a collection of data that interact as a unit. Moreover,
these units of data or aggregates of data form the boundaries for the ACID
operations.

Aggregate Data Models in NoSQL make it easier for the Databases to manage data
storage over the clusters as the aggregate data or unit can now reside on any of the

251
machines. Whenever data is retrieved from the Database all the data comes along
with the Aggregate Data Models in NoSQL.

Aggregate Data Models in NoSQL don’t support ACID transactions and sacrifice one
of the ACID properties. With the help of Aggregate Data Models in NoSQL, you can
easily perform OLAP operations on the Database.

You can achieve high efficiency of the Aggregate Data Models in the NoSQL
Database if the data transactions and interactions take place within the same
aggregate

Aggregate Data Models:


The term aggregate means a collection of objects that we use to treat as a unit. An
aggregate is a collection of data that we interact with as a unit. These units of data or
aggregates form the boundaries for ACID operation.

Types of Aggregate Data Models in NoSQL Databases


The Aggregate Data Models in NoSQL are majorly classified into 4 Data Models
listed below:

1. Key-Value Model

The Key-Value Data Model contains the key or an ID used to access or fetch the
data of the aggregates corresponding to the key. In this Aggregate Data Models
in NoSQL, the data of the aggregates are secure and encrypted and can be
decrypted with a Key.

Use Cases:

 These Aggregate Data Models in NoSQL Database are used for storing the
user session data.

252
 Key Value-based Data Models are used for maintaining schema-less user
profiles.

 It is used for storing user preferences and shopping cart data.

2. Document Model

The Document Data Model allows access to the parts of aggregates. In this
Aggregate Data Models in NoSQL, the data can be accessed in an inflexible way.
The Database stores and retrieves documents, which can be XML, JSON, BSON,
etc. There are some restrictions on data structure and data types of the data
aggregates that are to be used in this Aggregate Data Models in NoSQL
Database.

Use Cases:

 Document Data Models are widely used in E-Commerce platforms

 It is used for storing data from content management systems.

 Document Data Models are well suited for Blogging and Analytics platforms.

3. Column Family Model

253
Column family is an Aggregate Data Models in NoSQL Database usually with big-
table style Data Models that are referred to as column stores. It is also called a
two-level map as it offers a two-level aggregate structure. In this Aggregate Data
Models in NoSQL, the first level of the Column family contains the keys that act as
a row identifier that is used to select the aggregate data. Whereas the second
level values are referred to as columns.

Use Cases:

 Column Family Data Models are used in systems that maintain counters.

 These Aggregate Data Models in NoSQL are used for services that have
expiring usage.

 It is used in systems that have heavy write requests.

4. Graph-Based Model

254
Graph-based data models store data in nodes that are connected by edges. These
Aggregate Data Models in NoSQL are widely used for storing the huge volumes
of complex aggregates and multidimensional data having many interconnections
between them.

Use Cases:

 Graph-based Data Models are used in social networking sites to store


interconnections.

 It is used in fraud detection systems.

 This Data Model is also widely used in Networks and IT operations.

Example of Aggregate Data Model:


Steps to Build Aggregate Data Models in NoSQL Databases
Now that you have a brief knowledge of Aggregate Data Models in NoSQL Database.
In this section, you will go through an example to understand how to design
Aggregate Data Models in NoSQL. For this, a Data Model of an E-Commerce website
will be used to explain Aggregate Data Models in NoSQL.

This example of the E-Commerce Data Model has two main aggregates – customer
and order. The customer contains data related to billing addresses while the order
aggregate consists of ordered items, shipping addresses, and payments. The
payment also contains the billing address.

255
Image Source

If you notice a single logical address record appears 3 times in the data, but its value
is copied each time wherever used. The whole address can be copied into an
aggregate as needed. There is no pre-defined format to draw the aggregate
boundaries. It solely depends on whether you want to manipulate the data as per
your requirements.

The Data Model for customer and order would look like this.

// in customers
{
"customer": {
"id": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"orders": [
{
"id":99,
"customerId":1,
"orderItems":[
{
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
256
}
],
"shippingAddress":[{"city":"Chicago"}]

"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"}
}],
}]
}
}

In these Aggregate Data Models in NoSQL, if you want to access a customer along
with all customers orders at once. Then designing a single aggregate is preferable.
But if you want to access a single order at a time, then you should have separate
aggregates for each order. It is very content-specific.

Here in the diagram have two Aggregate:

 Customer and Orders link between them represent an aggregate.

 The diamond shows how data fit into the aggregate structure.

 Customer contains a list of billing address

 Payment also contains the billing address

 The address appears three times and it is copied each time

 The domain is fit where we don’t want to change shipping and billing address.

Consequences of Aggregate Orientation:


 Aggregation is not a logical data property it is all about how the data is being
used by applications.

 An aggregate structure may be an obstacle for others but help with some data
interactions.

 It has an important consequence for transactions.

 NoSQL databases don’t support ACID transactions thus sacrificing consistency.

257
 Aggregate-oriented databases support the atomic manipulation of a single
aggregate at a time.

Here are some key aspects of aggregate data models:


1. Aggregation Levels:

Aggregate data models define different levels of aggregation based on the


granularity of the data being analyzed. This could include aggregating data at
various levels, such as daily, weekly, monthly, or yearly, or aggregating data by
different dimensions, such as region, product category, or customer segment.

2. Summary Statistics:

Aggregate data models often include summary statistics or metrics calculated


from the underlying data, such as counts, sums, averages, min/max values, or
other aggregate functions. These summary statistics provide insights into the
overall trends, patterns, and characteristics of the data across different
aggregation levels or dimensions.

3. Dimensional Modeling:

Aggregate data models often use dimensional modeling techniques to organize


data into facts and dimensions. Facts represent the measurable data points or
metrics being analyzed, while dimensions represent the contextual attributes or
categories by which data is analyzed and aggregated. Dimensional modeling
enables efficient querying and analysis of aggregated data by providing a
structured and intuitive way to navigate and slice data across different
dimensions.

4. Data Cubes:

Aggregate data models can be represented using data cube structures, where
data is organized into multi-dimensional arrays or matrices. Each dimension of
the cube represents a different attribute or category, and cells within the cube
contain aggregated data values. Data cubes enable efficient multidimensional
analysis and slicing-and-dicing of data along multiple dimensions.

5. Pre-Aggregated Views:

In some cases, aggregate data models may include pre-aggregated views or


materialized views that store pre-computed aggregate values to improve query
performance. These pre-aggregated views can be precomputed and stored ahead
of time to avoid the need for costly aggregation calculations during query
execution, resulting in faster query response times.

6. OLAP (Online Analytical Processing):


258
Aggregate data models are commonly used in OLAP systems, which are designed
for interactive analysis of aggregated and summarized data. OLAP systems
support multidimensional analysis, ad-hoc querying, and drill-down/drill-up
capabilities, enabling users to explore and analyze data from different
perspectives and levels of granularity.

7. Performance Optimization:

Aggregate data models are optimized for query performance, as they allow
organizations to pre-compute and store aggregated data values, use efficient
indexing strategies, and leverage query optimization techniques to accelerate
query processing and analysis. By summarizing data at higher levels of
granularity, aggregate data models can reduce the amount of data that needs to
be processed during query execution, resulting in faster query response times.

Advantage:
 It can be used as a primary data source for online applications.

 Easy Replication.

 No single point Failure.

 It provides fast performance and horizontal Scalability.

 It can handle structured semi-structured and unstructured data with equal


effort.

Disadvantage:
 No standard rules.

 Limited query capabilities.

 Doesn’t work well with relational data.

 Not so popular in the enterprise.

 When the value of data increases it is difficult to maintain unique values.

259
Aggregates
In the context of databases and data modeling, aggregates refer to summarized or
aggregated data values derived from underlying raw data. Aggregates are calculated
using aggregate functions, which perform mathematical operations on sets of data
to produce single values or summaries. These aggregated values provide insights
into the overall trends, patterns, and characteristics of the underlying data. Here are
some key points about aggregates:

1. Types of Aggregates:

Aggregates can take various forms, including counts, sums, averages,


minimum and maximum values, standard deviations, and other statistical
measures. These aggregates can be calculated across different dimensions or
subsets of data, depending on the analysis requirements.

2. Aggregation Functions:

Aggregate functions, such as COUNT, SUM, AVG, MIN, MAX, and STDDEV, are
used to calculate aggregates from sets of data. These functions operate on
columns or expressions within a database query and produce single values
representing the aggregated result.

3. Grouping and Aggregation:

Aggregates are often calculated in conjunction with grouping operations,


where data is grouped into subsets based on one or more dimensions or
attributes. Aggregates are then calculated separately for each group, allowing
for the analysis of data at different levels of granularity.

4. Aggregate Queries:

Aggregate queries are SQL queries that include aggregate functions to


calculate summarized data values. These queries typically involve SELECT
statements with aggregate functions, optional GROUP BY clauses for grouping
data, and optional HAVING clauses for filtering aggregated results.

5. Roll-Up and Drill-Down:

Aggregates support roll-up and drill-down operations, which allow users to


navigate hierarchies or levels of aggregation in their data. Roll-up involves
summarizing data at higher levels of aggregation, while drill-down involves
breaking down aggregated data into finer levels of detail.

6. Performance Optimization:

Aggregates can improve query performance by reducing the amount of data


that needs to be processed during query execution. Precomputed aggregates,

260
materialized views, and indexing strategies can be used to optimize the
performance of aggregate queries and accelerate data analysis.

7. Business Intelligence and Reporting:

Aggregates are commonly used in business intelligence (BI) and reporting


applications to provide summarized views of data for decision-making
purposes. Aggregated data values are often visualized using charts, graphs,
dashboards, and reports to communicate key insights and trends to
stakeholders.

Overall, aggregates play a crucial role in data analysis, reporting, and decision-
making by summarizing raw data into meaningful and actionable insights. Whether
it's calculating total sales revenue, average customer satisfaction scores, or monthly
website traffic, aggregates help organizations derive value from their data and make
informed business decisions based on aggregated data analysis.

Key-value and document data models


What is Key Value Database?
A key-value database, also known as a key-value store, is a type of NoSQL database
that stores data as a collection of key-value pairs. Each key is a unique identifier that
is used to retrieve the corresponding value. The value can be any type of data, such
as a string, number, or object.

Key-value databases are designed for high performance and scalability, and are often
used in situations where the data does not require complex relationships or joins.
They are well suited for storing data that can be easily partitioned, such as caching
data or session data. Key-value databases are simple and easy to use, but they may
not be as suitable for complex queries or data relationships as other types of
databases such as document or relational databases.

What is Document Database?


A document database is a type of NoSQL database that stores data in the form of
documents, rather than in tables with rows and columns like a traditional relational
database. These documents can be in a variety of formats, such as JSON, BSON, or
XML. They often include nested data structures, which can make it easier to store and
query complex data. Document databases are well suited for storing semi-structured
data and are often used in web and mobile applications. They offer a flexible schema
and high performance, but may have limited querying capabilities compared to
relational databases and difficulties with data consistency and validation.

261
Key-Value Data Model in NoSQL
A key-value data model or database is also referred to as a key-value store. It is a
non-relational type of database. In this, an associative array is used as a basic
database in which an individual key is linked with just one value in a collection. For
the values, keys are special identifiers. Any kind of entity can be valued. The
collection of key-value pairs stored on separate records is called key-value databases
and they do not have an already defined structure.

How do key-value databases work?


A number of easy strings or even a complicated entity are referred to as a value that
is associated with a key by a key-value database, which is utilized to monitor the
entity. Like in many programming paradigms, a key-value database resembles a map
object or array, or dictionary, however, which is put away in a tenacious manner and
controlled by a DBMS.

An efficient and compact structure of the index is used by the key-value store to
have the option to rapidly and dependably find value using its key. For example,
Redis is a key-value store used to tracklists, maps, heaps, and primitive types (which
are simple data structures) in a constant database. Redis can uncover a very basic
point of interaction to query and manipulate value types, just by supporting a
predetermined number of value types, and when arranged, is prepared to do high
throughput.
262
When to use a key-value database:
Here are a few situations in which you can use a key-value database:-

 User session attributes in an online app like finance or gaming, which is


referred to as real-time random data access.

 Caching mechanism for repeatedly accessing data or key-based design.

 The application is developed on queries that are based on keys.

Features:
 One of the most un-complex kinds of NoSQL data models.

 For storing, getting, and removing data, key-value databases utilize simple
functions.

 Querying language is not present in key-value databases.

 Built-in redundancy makes this database more reliable.

Advantages:
 It is very easy to use. Due to the simplicity of the database, data can accept
any kind, or even different kinds when required.

 Its response time is fast due to its simplicity, given that the remaining
environment near it is very much constructed and improved.

 Key-value store databases are scalable vertically as well as horizontally.

 Built-in redundancy makes this database more reliable.

Disadvantages:
 As querying language is not present in key-value databases, transportation of
queries from one database to a different database cannot be done.

 The key-value store database is not refined. You cannot query the database
without a key.

263
Some examples of key-value databases:
Here are some popular key-value databases which are widely used:
 Couchbase: It permits SQL-style querying and searching for text.

 Amazon DynamoDB: The key-value database which is mostly used is Amazon


DynamoDB as it is a trusted database used by a large number of users. It can
easily handle a large number of requests every day and it also provides
various security options.

 Riak: It is the database used to develop applications.

 Aerospike: It is an open-source and real-time database working with billions


of exchanges.

 Berkeley DB: It is a high-performance and open-source database providing


scalability.

Document Databases in NoSQL


In this article, we will see about the Document Data Model of NoSQL and apart from
Examples, Advantages, Disadvantages, and Applications of the document data
model.

Document Data Model


A Document Data Model is a lot different than other data models because it stores
data in JSON, BSON, or XML documents. in this data model, we can move documents
under one document and apart from this, any particular elements can be indexed to
run queries faster. Often documents are stored and retrieved in such a way that it
becomes close to the data objects which are used in many applications which means
very less translations are required to use data in applications. JSON is a native
language that is often used to store and query data too.

So in the document data model, each document has a key-value pair below is an
example for the same.

{
"Name”: "Yashodhra",
"Address”: "Near Patel Nagar",
"Email”: "[email protected]",
"Contact”: "12345"
}
264
Working of Document Data Model:
This is a data model which works as a semi-structured data model in which the
records and data associated with them are stored in a single document which means
this data model is not completely unstructured. The main thing is that data here is
stored in a document.

Features:
 Document Type Model: As we all know data is stored in documents rather
than tables or graphs, so it becomes easy to map things in many
programming languages.

 Flexible Schema: Overall schema is very much flexible to support this


statement one must know that not all documents in a collection need to have
the same fields.

 Distributed and Resilient: Document data models are very much dispersed
which is the reason behind horizontal scaling and distribution of data.

 Manageable Query Language: These data models are the ones in which
query language allows the developers to perform CRUD (Create Read Update
Destroy) operations on the data model.

Examples of Document Data Models


 Amazon DocumentDB

 MongoDB

 Cosmos DB

 ArangoDB

 Couchbase Server

 CouchDB

Advantages:

265
 Schema-less: These are very good in retaining existing data at massive
volumes because there are absolutely no restrictions in the format and the
structure of data storage.

 Faster creation of document and maintenance: It is very simple to create a


document and apart from this maintenance requires is almost nothing.

 Open formats: It has a very simple build process that uses XML, JSON, and its
other forms.

 Built-in versioning: It has built-in versioning which means as the documents


grow in size there might be a chance they can grow in complexity. Versioning
decreases conflicts.

Disadvantages:
 Weak Atomicity: It lacks in supporting multi-document ACID transactions. A
change in the document data model involving two collections will require us
to run two separate queries i.e. one for each collection. This is where it breaks
atomicity requirements.

 Consistency Check Limitations: One can search the collections and


documents that are not connected to an author collection but doing this
might create a problem in the performance of database performance.

 Security: Nowadays many web applications lack security which in turn results
in the leakage of sensitive data. So it becomes a point of concern, one must
pay attention to web app vulnerabilities.

Applications of Document Data Model


 Content Management: These data models are very much used in creating
various video streaming platforms, blogs, and similar services Because each is
stored as a single document and the database here is much easier to maintain
as the service evolves over time.

 Book Database: These are very much useful in making book databases
because as we know this data model lets us nest.

 Catalog: When it comes to storing and reading catalog files these data
models are very much used because it has a fast reading ability if incase
Catalogs have thousands of attributes stored.

 Analytics Platform: These data models are very much used in the Analytics
Platform.
266
Document Database VS Key Value
Document databases and key-value databases are both types of NoSQL databases,
but they have some key differences:

1. Data Storage:

 A document database stores data in the form of documents, which can


include nested data structures. Each document can have a unique structure
and can contain different fields.

 A key-value database stores data as a collection of key-value pairs, where


each key is a unique identifier and the value can be any type of data.

2. Querying:

 Document databases support more advanced querying capabilities, and often


have built-in support for indexing and searching.

 Key-value databases typically have more limited querying capabilities and may
not support advanced search or indexing features.

3. Data Modeling:

 Document databases are more flexible in terms of data modeling, and allow
for more complex data structures and relationships.

 Key-value databases have a simple data model that is based on key-value


pairs and may not support complex data structures or relationships.

4. Use cases:

 Document databases are well suited for storing semi-structured or


unstructured data, and nested data structures, such as JSON or XML
documents. They are also well suited for complex queries and data
relationships.

 Key-value databases are well suited for storing data that can be easily
partitioned, such as caching data or session data. They are simple and easy to
use, but they may not be as suitable for complex queries or data relationships
as other types of databases.

267
268
Difference between Document Database VS Key Value

Relationships
Defining Relationships for NoSQL Databases

Relationships are associations between different collections in a database. You can


create relationships and define its object properties for NoSQL databases using
either of the following methods:

 Embedding

Embeds the related data in collections into a single or multiple structured


collections

 Referencing

Relates the data in multiple collections as an identifying or non-


identifying relationships

Relations are the crux of any database and relations in NoSQL databases are handled
in a completely different way compared to an SQL database. There is one very
important difference that you need to keep in mind while building a NoSQL database
and that is, NoSQL databases usually always have a JSON like Schema. Once you’re
familiar with that, then handling relations will be a lot easier.

General recommendation for schema modelling and relationships:

 One to one relationship: embedding model preferred

 One to few relationships: embedding model preferred

 One to many relationships: referencing model preferred

 Many to many relationships: referencing model preferred

 Favour embedding unless there is a compelling reason not to

 Needing to access an object on its own is a compelling reason not to embed it


269
 Avoid joins and populate (lookups) if possible, but don't be afraid if they can
provide a better schema design

 Arrays should not grow without bound. If there are more than a couple of
hundred documents on the many sides, don't embed them; if there are more
than a few thousand documents on the many sides, don't use an array of
ObjectID references.

1. One-to-One (1:1) Relationship:


In a one-to-one relationship, each record in one entity (table) is associated with
exactly one record in another entity, and vice versa. This type of relationship is
relatively rare in database design but may be used to represent scenarios where two
entities are closely related and share a one-to-one correspondence.

One to one relation, as the name suggests requires one entity to have an exclusive
relationship with another entity and vice versa. Let’s consider a simple example to
understand this relationship better…

The relationship between a user and his account. One user can have one account
associated with him and one account can have only one user associated with it.

One to one relationships can be handled in two ways…

First and the easiest one is to have just one collection, the ‘user’ collection and the
account of that particular user will be stored as an object in the user document itself.

The second way is to create another collection named account and store a reference
key (ideally the ID of the account) in the user document.

270
You might think why in the world would I ever need to do this?

This way is usually used when one of the following three scenarios occur-

 The main document is too large (MongoDB documents have a size limit of
16mb)

 When some sensitive information needs to be stored (you might not want to
return account information on every user GET request).

 When there’s an exclusive need for getting the account data without the user
data (when ‘account’ is requested you don’t want to send ‘user’ information
with it and/or when a ‘user’ is requested you don’t want to send ‘account’
information with it, even though both of them are connected)

2. One-to-Many (1: N) Relationship


In a one-to-many relationship, each record in one entity can be associated with one
or more records in another entity, but each record in the related entity is associated
with at most one record in the first entity. This is the most common type of
relationship in database design and is used to represent hierarchical or parent-child
relationships.

One to many relation, requires one entity to have an exclusive relationship with
another entity but the other entity can have relations with multiple other entities.
Let’s consider a simple example to understand this relationship better…

Consider, a user has multiple accounts, but each account can have a single user
associated with it (think about these accounts as bank accounts, it’ll let you
understand the example better). In this case, again there are two ways to handle it.

The first is to store an array of accounts in the user collection itself. This will let you
GET all the accounts associated with a user in a single call. MongoDB also has
features to push and pull data from an array in a document, which makes it quite
easy to add or remove accounts from the user if need be.

271
The second way is to create another collection named ‘account’ and store a reference
key (ideally the ID of the account) in the ‘user’ document. The reasons to do this are
the same as in the case of one to one relations.

One issue with this approach is that when a new account needs to be created for a
particular user, we need to create a new account and also update the existing user
document with the id of this new account (basically requires 2 database calls).
Obviously you can store the user ID in Account collection as well, in that way, you’ll
only need one call to create a new account but it depends on the system you’re
planning to build.

Before building the schema, it’s important that you plan out what kind of calls will be
used more in your system and plan your schema accordingly.

For example, in this case, since this is a bank application (assumption), you know
that most of the calls you’ll make would be getting a single user (while logging in
maybe) and another call to get the accounts associated with that user (when he goes
to the accounts tab maybe) and hence the above schema seems a pretty good one

272
for this use case. In fact, storing user_id in the accounts’ collection would be an even
better approach in this case.

Now consider another scenario, this time it’s a public forum, users can create posts
and these posts can be viewed by the public. In this case, it’s better to store user_id
in posts collection, instead of storing post_ids in users collection, since you know that
your selling point is the posts list that the users can view and hence the calls you
mostly make would be to get the posts list, with the user data associated with it
(maybe in the homepage itself, like Facebook’s timeline). This way, while updating
you wouldn’t need to update two collections.

Another scenario would be that you need both of them, that is, you need posts in
users’ data as well and users in posts data as well. This will make creating new posts a
bit slow (since you need to add IDs to the users’ collection as well), but getting data
in both cases would be fast.

3. Many-to-Many (N: M) Relationship:


In a many-to-many relationship, multiple records in one entity can be associated with
multiple records in another entity, and vice versa. This type of relationship requires
the use of a junction table or associative entity to represent the intermediate
relationship between the two entities. Many-to-many relationships are common in
scenarios where entities have a many-to-many association, such as students and
courses in a university system.

Many to many relation, doesn’t require any entity to have exclusive relations. Both
entities can have multiple relations. Let’s consider a simple example to understand
this relationship better…

Consider the relationship between users and products in an eCommerce


environment. There is a list of users and there is a list of products. Any user can buy
any product, meaning a user can buy multiple products and a product can be bought
by multiple users. In this case, there is just one ideal way to handle it.

There’ll be two collections, one a collection for users and the other a collection from
products. Whenever a user buys a product, add the ID of the product as a reference
in the user’s collection, and since the user can buy multiple products, these IDs need
to be stored as an array.

273
When a product needs to be updated, only that product in the product collection
needs to be updated and every user who has bought the product will automatically
get the updated product.

Obviously when a user buys any of the product, then it should go to another
collection (maybe bought_items collection) so that the products there doesn’t get
updated when the product in the products collection gets updated (since ideally, you
shouldn’t make changes to already bought products). These things are actually
related to the architecture of the application you’re building.

4. Many-to-One (N: 1) Relationship:


In a many-to-one relationship, multiple records in one entity can be associated with a
single record in another entity. This type of relationship is essentially the inverse of a
one-to-many relationship and is used to represent scenarios where multiple
instances of one entity are related to a single instance of another entity.

5. Self-Referencing Relationship:
A self-referencing relationship occurs when an entity has a relationship with itself.
This type of relationship is used to represent hierarchical or recursive relationships
within a single entity. For example, in an organizational chart, employees may have
relationships with other employees who are their managers or subordinates.

274
Graph databases
What is a graph
The term “graph” comes from the field of mathematics. A graph contains a collection
of nodes and edges.

 Nodes

Nodes are vertices that store the data objects. Each node can have an
unlimited number and types of relationships.

 Edges

Edges represent relationships between nodes. For example, edges can


describe parent-child relationships, actions, or ownership. They can represent
both one-to-many and many-to-many relationships. An edge always has a
start node, end node, type, and direction.

 Properties

Each node has properties or attributes that describe it. In some cases, edges
have properties as well. Graphs with properties are also called property
graphs.

 Graph example

The following property graph shows an example of a social network graph.


Given the people (nodes) and their relationships (edges), you can find out who
the "friends of friends" of a particular person are—for example, the friends of
Howard's friends.

Introduction to Graph Database on NoSQL

275
A graph database is a type of NoSQL database that is designed to handle data with
complex relationships and interconnections. In a graph database, data is stored as
nodes and edges, where nodes represent entities and edges represent the
relationships between those entities.

1. Graph databases are particularly well-suited for applications that require deep
and complex queries, such as social networks, recommendation engines, and
fraud detection systems. They can also be used for other types of applications,
such as supply chain management, network and infrastructure management,
and bioinformatics.

2. One of the main advantages of graph databases is their ability to handle and
represent relationships between entities. This is because the relationships
between entities are as important as the entities themselves, and often cannot
be easily represented in a traditional relational database.

3. Another advantage of graph databases is their flexibility. Graph databases can


handle data with changing structures and can be adapted to new use cases
without requiring significant changes to the database schema. This makes
them particularly useful for applications with rapidly changing data structures
or complex data requirements.

4. However, graph databases may not be suitable for all applications. For
example, they may not be the best choice for applications that require simple
queries or that deal primarily with data that can be easily represented in a
traditional relational database. Additionally, graph databases may require
more specialized knowledge and expertise to use effectively.

Some popular graph databases include Neo4j, OrientDB, and ArangoDB. These
databases provide a range of features, including support for different data models,
scalability, and high availability, and can be used for a wide variety of applications.

As we all know the graph is a pictorial representation of data in the form of nodes
and relationships which are represented by edges. A graph database is a type of
database used to represent the data in the form of a graph. It has three components:
nodes, relationships, and properties. These components are used to model the data.
The concept of a Graph Database is based on the theory of graphs. It was introduced
in the year 2000. They are commonly referred to NoSql databases as data is stored
using nodes, relationships and properties instead of traditional databases. A graph
database is very useful for heavily interconnected data. Here relationships between
data are given priority and therefore the relationships can be easily visualized. They
are flexible as new data can be added without hampering the old ones. They are
useful in the fields of social networking, fraud detection, AI Knowledge graphs etc.

The description of components are as follows:

276
 Nodes: represent the objects or instances. They are equivalent to a row in
database. The node basically acts as a vertex in a graph. The nodes are
grouped by applying a label to each member.

 Relationships: They are basically the edges in the graph. They have a specific
direction, type and form patterns of the data. They basically establish
relationship between nodes.

 Properties: They are the information associated with the nodes.

Some examples of Graph Databases software are Neo4j, Oracle NoSQL DB, Graph
base etc. Out of which Neo4j is the most popular one.

In traditional databases, the relationships between data is not established. But in the
case of Graph Database, the relationships between data are prioritized. Nowadays
mostly interconnected data is used where one data is connected directly or indirectly.
Since the concept of this database is based on graph theory, it is flexible and works
very fast for associative data. Often data are interconnected to one another which
also helps to establish further relationships. It works fast in the querying part as well
because with the help of relationships we can quickly find the desired nodes. join
operations are not required in this database which reduces the cost. The
relationships and properties are stored as first-class entities in Graph Database.

Graph databases allow organizations to connect the data with external sources as
well. Since organizations require a huge amount of data, often it becomes
cumbersome to store data in the form of tables. For instance, if the organization
wants to find a particular data that is connected with another data in another table,
so first join operation is performed between the tables, and then search for the data
is done row by row. But Graph database solves this big problem. They store the
relationships and properties along with the data. So if the organization needs to
search for a particular data, then with the help of relationships and properties the
nodes can be found without joining or without traversing row by row. Thus the
searching of nodes is not dependent on the amount of data.

Types of Graph Databases


 Property Graphs: These graphs are used for querying and analyzing data by
modelling the relationships among the data. It comprises of vertices that has
information about the particular subject and edges that denote the
relationship. The vertices and edges have additional attributes called
properties.

 RDF Graphs: It stands for Resource Description Framework. It focuses more


on data integration. They are used to represent complex data with well
defined semantics. It is represented by three elements: two vertices, an edge

277
that reflect the subject, predicate and object of a sentence. Every vertex and
edge is represented by URI (Uniform Resource Identifier).

When to Use Graph Database?


 Graph databases should be used for heavily interconnected data.

 It should be used when amount of data is larger and relationships are present.

 It can be used to represent the cohesive picture of the data.

What are the use cases of graph databases


Graph databases have advantages for use cases such as social networking,
recommendation engines, and fraud detection when used to create relationships
between data and quickly query these relationships.

 Fraud detection
Graph databases are capable of sophisticated fraud prevention. For example,
you can use relationships in graph databases to process financial transactions
in near-real time. With fast graph queries, you can detect that a potential
purchaser is using the same email address and credit card included in a known
fraud case. Graph databases can also help you detect fraud through
relationship patterns, such as multiple people associated with a personal email
address or multiple people sharing the same IP address but residing in
different physical locations.

 Recommendation engines
The graph model is a good choice for applications that provide
recommendations. You can store graph relationships between information
categories such as customer interests, friends, and purchase history. You can
use a highly available graph database to make product recommendations to a
user based on which products are purchased by others who have similar
interests and purchase histories. You can also identify people who have a
mutual friend but don’t yet know each other and then make a friendship
recommendation.

 Route optimization
Route optimization problems involve analyzing a dataset and finding values
that best suit a particular scenario. For example, you can use a graph database
to find the following:

 The shortest route from point A to B on a map by considering various


paths.

278
 The right employee for a particular shift by analyzing varied
availabilities, locations, and skills.

 The optimum machinery for operations by considering parameters like


cost and life of the equip-ment.

Graph queries can analyze these situations much faster because they can
count and compare the number of links between two nodes.

 Pattern discovery
Graph databases are well suited for discovering complex relationships and
hidden patterns in data. For instance, a social media company uses a graph
database to distinguish between bot accounts and real accounts. It analyzes
account activity to discover connections between account interactions and bot
activity.

 Knowledge management
Graph databases offer techniques for data integration, linked data, and
information sharing. They represent complex metadata or domain concepts in
a standardized format and provide rich semantics for natural language
processing. You can also use these databases for knowledge graphs and
master data management. For example, machine learning algorithms
distinguish between the Amazon rainforest and the Amazon brand using
graph models.

How Graph and Graph Databases Work?


Graph databases provide graph models they allow users to perform traversal queries
since data is connected. Graph algorithms are also applied to find patterns, paths
and other relationships this enabling more analysis of the data. The algorithms help
to explore the neighboring nodes, clustering of vertices analyze relationships and
patterns. Countless joins are not required in this kind of database.

How do graph analytics and graph databases work


Graph databases work using a standardized query language and graph algorithms.

 Graph query languages

Graph query languages are used to interact with a graph database. Similar
to SQL, the language has features to add, edit, and query data. However, these
languages take advantage of the underlying graph structures to process
complex queries efficiently. They provide an interface so you can ask
questions like:

 Number of hops between nodes


279
 Longest path/shortest path/optimal paths

 Value of nodes

Apache TinkerPop Gremlin, SPARQL, and openCypher are popular graph


query languages.

 Graph algorithms

Graph algorithms are operations that analyze relationships and behaviors in


interconnected data. For instance, they explore the distance and paths
between nodes or analyze incoming edges and neighbor nodes to generate
reports. The algorithms can identify common patterns, anomalies,
communities, and paths that connect the data elements. Some examples of
graph algorithms include:

 Clustering

Applications like image processing, statistics, and data mining use


clustering to group nodes based on common characteristics. Clustering
can be done on both inter-cluster differences and intra-cluster
similarities.

 Partitioning

You can partition or cut graphs at the node with the fewest edges.
Applications such as network testing use partitioning to find weak
spots in the network.

 Search

Graph searches or traversals can be one of two types—breadth-first or


depth-first. Breadth-first search moves from one node to the other
across the graph. It is useful in optimal path discovery. Depth-first
search moves along a single branch to find all relations of a particular
node.

Example of Graph Database


 Recommendation engines in E commerce use graph databases to provide
customers with accurate recommendations, updates about new products thus
increasing sales and satisfying the customer’s desires.

 Social media companies use graph databases to find the “friends of friends” or
products that the user’s friends like and send suggestions accordingly to user.

280
 To detect fraud Graph databases play a major role. Users can create graph
from the transactions between entities and store other important information.
Once created, running a simple query will help to identify the fraud.

Graph Database Use Case Examples


There are many notable examples where graph databases outperform other database
modeling techniques, some of which include:

 Real-Time Recommendation Engines. Real-time product and


ecommerce recommendations provide a better user experience while
maximizing profitability. Notable cases include Netflix, eBay, and Walmart.

 Master Data Management. Linking all company data to one location for a
single point of reference provides data consistency and accuracy. Master data
management is crucial for large-scale global companies.

 GDPR and regulation compliances. Graphs make tracking of data movement


and security easier to manage. The databases reduce the potential of data
breaches and provide better consistency when removing data, improving the
overall trust with sensitive information.

 Digital asset management. The amount of digital content is massive and


constantly increasing. Graph databases provide a scalable and straightforward
database model to keep track of digital assets, such as documents,
evaluations, contracts, etc.

 Context-aware services. Graphs help provide services related to actual-world


characteristics. Whether it is natural disaster warnings, traffic updates, or
product recommendations for a given location, graph databases offer a logical
solution to real-life circumstances.

 Fraud detection. Finding suspicious patterns and uncovering


fraudulent payment transactions is done in real-time using graph databases.
Targeting and isolating parts of graphs provide quicker detection of deceptive
behavior.

 Semantic search. Natural language processing is ambiguous. Semantic


searches help provide meaning behind keywords for more relevant results,
which is easier to map using graph databases.

 Network management. Networks are linked graphs in their essence. Graphs


reduce the time needed to alert a network administrator about problems in a
network.

 Routing. Information travels through a network by finding optimal paths


makes graph databases the perfect choice for routing.
281
Advantages of Graph Database
 Potential advantage of Graph Database is establishing the relationships with
external sources as well

 No joins are required since relationships is already specified.

 Query is dependent on concrete relationships and not on the amount of data.

 It is flexible and agile.

 it is easy to manage the data in terms of graph.

 Efficient data modeling: Graph databases allow for efficient data modeling by
representing data as nodes and edges. This allows for more flexible and
scalable data modeling than traditional relational databases.

 Flexible relationships: Graph databases are designed to handle complex


relationships and interconnections between data elements. This makes them
well-suited for applications that require deep and complex queries, such as
social networks, recommendation engines, and fraud detection systems.

 High performance: Graph databases are optimized for handling large and
complex datasets, making them well-suited for applications that require high
levels of performance and scalability.

 Scalability: Graph databases can be easily scaled horizontally, allowing


additional servers to be added to the cluster to handle increased data volume
or traffic.

 Easy to use: Graph databases are typically easier to use than traditional
relational databases. They often have a simpler data model and query
language, and can be easier to maintain and scale.

Disadvantages of Graph Database


 Often for complex relationships speed becomes slower in searching.

 The query language is platform dependent.

 They are inappropriate for transactional data

 It has smaller user base.

 Limited use cases: Graph databases are not suitable for all applications. They
may not be the best choice for applications that require simple queries or that

282
deal primarily with data that can be easily represented in a traditional
relational database.

 Specialized knowledge: Graph databases may require specialized knowledge


and expertise to use effectively, including knowledge of graph theory and
algorithms.

 Immature technology: The technology for graph databases is relatively new


and still evolving, which means that it may not be as stable or well-supported
as traditional relational databases.

 Integration with other tools: Graph databases may not be as well-integrated


with other tools and systems as traditional relational databases, which can
make it more difficult to use them in conjunction with other technologies.

 Overall, graph databases on NoSQL offer many advantages for applications


that require complex and deep relationships between data elements. They are
highly flexible, scalable, and performant, and can handle large and complex
datasets. However, they may not be suitable for all applications, and may
require specialized knowledge and expertise to use effectively.

Future of Graph Database


Graph Database is an excellent tool for storing data but it cannot be used to
completely replace the traditional database. This database deals with a typical set of
interconnected data. Although Graph Database is in the developmental phase it is
becoming an important part as business and organizations are using big data and
Graph databases help in complex analysis. Thus these databases have become a
must for today’s needs and tomorrow success.

Schema less databases


Traditional relational databases are well-defined, using a schema to describe every
functional element, including tables, rows views, indexes, and relationships. By
exerting a high degree of control, the database administrator can improve
performance and prevent capture of low-quality, incomplete, or malformed data. In a
SQL database, the schema is enforced by the Relational Database Management
System (RDBMS) whenever data is written to disk.

But in order to work, data needs to be heavily formatted and shaped to fit into the
table structure. This means sacrificing any undefined details during the save, or
storing valuable information outside the database entirely.

A schemaless database, like MongoDB, does not have these up-front constraints,
mapping to a more ‘natural’ database. Even when sitting on top of a data lake, each
document is created with a partial schema to aid retrieval. Any formal schema is
283
applied in the code of your applications; this layer of abstraction protects the raw
data in the NoSQL database and allows for rapid transformation as your needs
change.

Any data, formatted or not, can be stored in a non-tabular NoSQL type of database.
At the same time, using the right tools in the form of a schemaless database can
unlock the value of all of your structured and unstructured data types.

What is a schemaless database?


A schemaless database manages information without the need for a blueprint. The
onset of building a schemaless database doesn’t rely on conforming to certain fields,
tables, or data model structures. There is no Relational Database Management
System (RDBMS) to enforce any specific kind of structure. In other words, it’s a non-
relational database that can handle any database type, whether that be a key-value
store, document store, in-memory, column-oriented, or graph data
model. NoSQL databases’ flexibility is responsible for the rising popularity of a
schemaless approach and is often considered more user-friendly than scaling a
schema or SQL database.

How does a schemaless database work?


In schemaless databases, information is stored in JSON-style documents which can
have varying sets of fields with different data types for each field. So, a collection
could look like this:

{
name : “Joe”, age : 30, interests : ‘football’ }
{
name : “Kate”, age : 25
}
As you can see, the data itself normally has a fairly consistent structure. With the
schemaless MongoDB database, there is some additional structure — the system
namespace contains an explicit list of collections and indexes. Collections may be
implicitly or explicitly created — indexes must be explicitly declared.

Schemaless vs. schema databases pros and cons


How much information do you know about your new database setup? Can you see
its structure well ahead of time and know for certain it will never change? If so, you
may be dealing with a situation that best suits a schema database. Its strictness is the

284
basis of its appeal. Let’s get granular and weigh the pros and cons of going one way
or the other.

Schema Database Pros Schema Database Cons

Rigorous testing Data modeling and planning must be flexible and


predefined

Rules are inflexible Difficult to expedite the launch of the database

Code is more intelligible The rigidity makes altering the schema at a later
date a laborious process

Streamlines the process of Experimenting with fields is very difficult


migrating data between
systems

Schemaless Database Pros Schemaless Database Cons

All data (and metadata) No universal language available to query data in a


remains unaltered and non-relational database
accessible

There is no existing Though the NoSQL community is still growing at a


“schema” for the data to be tremendous rate, not all troubleshooting issues have
structured around been properly documented

Can add additional fields Lack of compatibility with SQL instructions


that SQL databases can’t
accommodate

Accommodates key-value No ACID-level compliance, as data retrievals can


store, document store, in- have inconsistencies given their distributed approach
memory, column-oriented,
or graph data models

What are the benefits of using a schemaless database?


285
 Greater flexibility over data types

By operating without a schema, schemaless databases can store, retrieve, and


query any data type — perfect for big data analytics and similar operations
that are powered by unstructured data. Relational databases apply rigid
schema rules to data, limiting what can be stored.

 No pre-defined database schemas

The lack of schema means that your NoSQL database can accept any data
type — including those that you do not yet use. This future-proofs your
database, allowing it to grow and change as your data-driven operations
change and mature.

 No data truncation

A schemaless database makes almost no changes to your data; each item is


saved in its own document with a partial schema, leaving the raw information
untouched. This means that every detail is always available and nothing is
stripped to match the current schema. This is particularly valuable if your
analytics needs to change at some point in the future.

 Suitable for real-time analytics functions

With the ability to process unstructured data, applications built on NoSQL


databases are better able to process real-time data, such as readings and
measurements from IoT sensors. Schemaless databases are also ideal for use
with machine learning and artificial intelligence operations, helping to
accelerate automated actions in your business.

 Enhanced scalability and flexibility

With NoSQL, you can use whichever data model is best suited to the job.
Graph databases allow you to view relationships between data points, or you
can use traditional wide table views with an exceptionally large number of
columns. You can query, report, and model information however you choose.
And as your requirements grow, you can keep adding nodes to increase
capacity and power.

When a record is saved to a relational database, anything (particularly


metadata) that does not match the schema is truncated or removed. Deleted
at write, these details cannot be recovered at a later point in time.

What does this look like?


A lack of rigid schema allows for increased transparency and automation when
making changes to the database or performing a data migration. Say you want to
286
add GPA attributes to student objects held in your database. You simply add the
attribute, resave, and the GPA value has been added to the NoSQL document. If you
look up an existing student and reference GPA, it will return null. If you roll back your
code, the new GPA fields in the existing objects are unlikely to cause problems and
do not need to be removed if your code is well written.

Here are some key characteristics and concepts of schemaless


databases:
1. Flexible Data Modeling:

Schemaless databases allow developers to store data without defining a fixed


schema upfront. Data can be stored in a flexible, schema-less format, such as
JSON (JavaScript Object Notation), BSON (Binary JSON), or other semi-
structured formats. This flexibility enables developers to adapt the data model
to changing application requirements and iterate quickly without needing to
modify the database schema.

2. Dynamic Schema Evolution:

Schemaless databases support dynamic schema evolution, allowing data


schemas to evolve over time as new data is ingested or existing data is
updated. Developers can add, modify, or remove fields from data documents
without disrupting existing data, providing agility and flexibility in data
management.

3. NoSQL Data Models:

Schemaless databases are often associated with NoSQL (Not Only SQL)
database systems, which offer various data models, such as document, key-
value, columnar, or graph databases, that are well-suited for schema-less data
storage. These NoSQL data models provide flexible structures for storing and
querying semi-structured or unstructured data without predefined schemas.

4. Scalability and Performance:

Schemaless databases are designed for scalability and performance,


particularly in distributed and cloud-native environments. Many schemaless
databases employ distributed architectures, sharding strategies, and
horizontal scaling techniques to handle large volumes of data and high
transaction rates effectively.

5. Query Flexibility:

Schemaless databases typically provide flexible querying capabilities, allowing


developers to query data using dynamic and ad-hoc queries without relying

287
on a fixed schema. Query languages, such as MongoDB's query language or
Elasticsearch's query DSL, support querying and filtering data based on its
content and structure.

6. Use Cases:

Schemaless databases are well-suited for use cases involving semi-structured


or unstructured data, such as content management systems, e-commerce
platforms, social media analytics, IoT data management, and real-time
analytics. They excel in scenarios where data schemas are fluid, data volumes
are large, and agility is essential.

7. Examples:

Some examples of schemaless databases include MongoDB, Couchbase,


Amazon DynamoDB (with the Document data model option), Elasticsearch,
Cassandra, and Firebase Firestore. These databases offer schema-less
capabilities and flexible data modeling features for various use cases and
application scenarios.

What are the challenges of schema-less databases?


Despite their advantages, schema-less databases also pose some challenges for
information management. One of the challenges is the lack of data validation and
integrity. Schema-less databases do not enforce data types, constraints, or
relationships, which means that data can be inconsistent, incomplete, or duplicated
across the database. This can lead to data quality issues and errors in data analysis
and reporting. Another challenge is the complexity of data querying and
manipulation. Schema-less databases do not support the standard SQL language,
which means that developers and users need to learn different query languages and
tools for each database. Moreover, some schema-less databases do not support
transactions, joins, or aggregations, which can limit the functionality and efficiency of
data operations.

Materialized views
What is a Materialized View?
A materialized view is a duplicate data table created by combining data from
multiple existing tables for faster data retrieval. For example, consider a retail
application with two base tables for customer and product data. The customer table
contains information like the customer’s name and contact details, while the product
table contains information about product details and cost. The customer table only
stores the product IDs of the items an individual customer purchases. You have to
cross-reference both tables to obtain product details of items purchased by specific
customers. Instead, you can create a materialized view that stores customer names
288
and the associated product details in a single temporary table. You can build index
structures on the materialized view for improved data read performance.

Let's understand the syntax of the materialized view.

1. Create Materialized View view_name

2. Build [clause] Refresh [ type]

3. ON [trigger ]

4. As <query expression>

In the above syntax, the Build clause decides when to populate the materialized view.
It contains two options -

 IMMEDIATE - It populate the materialized view immediately.

 DEFFERED - Need to refresh materialized view manually at least once.

Refresh type define the how to update the materialized view. There are three options
-

 FAST - The materialized view logs is required against the source table in
advance, without logs, the creation fails. A fast refresh is attempted. A fast
refresh is attempted.

 COMPLETE - The table segment supporting the materialized view is truncated


and repopulated completely using the associated query.

 FORCE - The materialized logs is not required. A fast refresh is attempted.

On trigger defines when to update the materialized view. The refresh can be
triggered in the two ways -

 ON COMMIT - When the data change is committed in one of the dependent


tables. The refresh is triggered.

 ON DEMAND - A refresh happens when we schedule task or a manual


request.

We have discussed the basic concept of the normal view and materialized view. Now,
let's see the difference between normal view and materialized view.

Materialized view example


Here is the user_purchase_summary view from before, turned into a materialized
view:

289
CREATE MATERIALIZED VIEW user_purchase_summary AS SELECT
u.id as user_id,
COUNT(*) as total_purchases,
SUM(CASE when p.status = 'cancelled' THEN 1 ELSE 0 END) as
cancelled_purchases
FROM users u
JOIN purchases p ON p.user_id = u.id;
Sql

In terms of SQL, all that has changed is the addition of the MATERIALIZED keyword.
But when executed, this statement instructs the database to:

1. Execute the SELECT query within the materialized view definition.

2. Cache the results in a new “virtual” table named user_purchase_summary

3. Save the original query so it knows how to update the materialized view in the
future.

How are materialized views useful?


The limitations of materialized views in popular databases discussed above have
historically made them a relatively niche feature, used primarily as a way to cache the
results of complex queries that would bring down the database if run frequently as
regular views. But if we set aside historic limitations and think about the idea of
materialized views: They give us the ability to define (using SQL) any complex
transformation of our data, and leave it to the database to maintain the results in a
“virtual” table.

This makes materialized views great for use cases where:

1. The SQL query is known ahead of time and needs to be repeatedly


recalculated.

2. It’s valuable to have low end-to-end latency from when data originates to
when it is reflected in a query

3. It’s valuable to have low-latency query response times, high concurrency, or


high volume of queries.

We see these requirements most often in areas of analytics and data-intensive


applications.

Materialized views for analytics


290
The extract-load-transform (ELT) pattern where raw data is loaded in bulk into a
warehouse and then transformed via SQL typically relies on alternatives to
materialized views for the transform step. In dbt, these are referred to
as materializations. A materialization can use a regular view (where nothing is
cached) or cached tables built from the results of a SELECT query, or an incrementally
updated table where the user is responsible for writing the update
strategy. Historically, support for materialized views in data warehouses has been so
bad that SQL modelling services like dbt don’t even have the syntax to allow users to
create them. However, the dbt-materialize adapter allows dbt users building on
Materialize to use materialized views. Here’s the standard advice given to dbt users
on when to use the different types of materializations available to them:

1. If using a view isn’t too slow for your end-users, use a view.

2. If a view gets too slow for your end-users, use a table.

3. If building a table with dbt gets too slow, use incremental models in dbt.

What are the benefits of materialized views?


Materialized views are a fast and efficient method of accessing relevant data. They
help with query optimization in data-intensive applications. We go through some of
the major benefits next.

 Speed

Read queries scan through different tables and rows of data to gather the
necessary information. With materialized views, you can query data directly
from your new view instead of having to compute new information every time.
The more complex your query is, the more time you will save using a
materialized view.

 Data storage simplicity

Materialized views allow you to consolidate complex query logic in one table.
This makes data transformations and code maintenance easier for developers.
It can also help make complex queries more manageable. You can also use
data subsetting to decrease the amount of data you need to replicate in the
view.

 Consistency

Materialized views provide a consistent view of data captured at a specific


moment. You can configure read consistency in materialized views and make
data accessible even in multi-user environments where concurrency control is
essential.
291
Materialized views also provide data access even if the source data changes or
is deleted. Over time, this means that you can use materialized views to report
on time-based data snapshots. The level of isolation from source tables
ensures that you have a greater degree of consistency across your data.

 Improved access control

You can use a materialized view to control who has access to specific data.
You can filter information for users without giving them access to the source
tables. This approach is practical if you want to control who has access to what
data and how much of it they can see and interact with.

What are the use cases of materialized views?


You can benefit from materialized views in many different scenarios.

 Distribute filtered data

If you need to distribute recent data across many locations, like for a remote
workforce, materialized views help. You replicate and distribute data to many
sites using materialized views. The people needing access to data interact with
the replicated data store closest to them geographically.

This system allows for concurrency and decreases network load. It’s an
effective approach with read-only databases.

 Analyze time series data

Materialized views provide timestamped snapshots of datasets, so you can


model information changes over time. You can store precomputed
aggregations of data, like monthly or weekly summaries. These uses are
helpful for business intelligence and reporting platforms.

 Remote data interaction

In distributed database systems, you can use materialized views to optimize


queries involving data from remote servers. Rather than repeatedly fetching
data from a remote source, you can fetch and store data in a local
materialized view. This reduces the need for network communication and
improving performance.

For example, if you receive data from an external database or through an API,
a materialized view consolidates and helps process it.

 Periodic batch processing

292
Materialized views are helpful for situations where periodic batch processing is
required. For instance, a financial institution might use materialized views to
store end-of-day balances and interest calculations. Or they might store
portfolio performance summaries, which can be refreshed at the end of each
business day.

How do materialized views work?


Materialized views work by precomputing and storing the results of a specific query
as a physical table in the database. The database performs the precomputation at
regular intervals, or users can trigger it by specific events. Administrators monitor the
performance and resource utilization of materialized views to ensure they continue
to meet their intended purpose.

Here's a general overview of how materialized views work.

 Create materialized view

You define a query that retrieves the desired data from one or more source
tables for creating materialized views. This query may include filtering,
aggregations, joins, and other operations as needed.

The database initially populates the materialized view by running the defined
query against the source data. The result of the query is stored as a physical
table in the database, and this table represents the materialized view.

 Update materialized view

The data in a materialized view needs to be periodically updated to reflect


changes in the underlying data in the source tables. The data refresh
frequency depends on the use case and requirements.

Next, we explain a few common approaches for data refresh.

 Full refresh

The materialized view is completely recomputed and overwritten with


the latest query results. It is the simplest approach but can be resource-
intensive, especially for large materialized views.

 Incremental refresh

Only the changes in the underlying data are applied to the materialized
view. It can be more efficient than a full refresh when dealing with large
datasets and frequent updates.

 On-demand refresh
293
Some systems allow materialized views to be refreshed on demand,
triggered by specific events or user requests. This gives more control
over when the data is updated, but it requires careful management to
ensure the materialized view remains up-to-date.

Technical variations in different systems

Each database management system has distinct methods for creating a materialized
view.

Database How materialized views work


management
system

PostgreSQL With PostgreSQL, you have to manually refresh the materialized


view, recomputing the entire view. You populate the
materialized view with data at the exact moment you create it.

MySQL MySQL doesn’t support materialized views.

Oracle Oracle automatically refreshes materialized views, but you also


have the option to refresh on demand. You can also write a SQL
statement that prompts the views to refresh before delivering
results.

SQL Server SQL Server uses the name “indexed views,” as materialization is a
step of creating an index of a regular view. You can only perform
basic SQL queries with their indexed views. They update
automatically for the user.

MongoDB MongoDB uses aggregation functions to deliver a similar


capability to materialized views but for a NoSQL environment.

Advantages of Materialized View


The following are the some important advantages of the materialized view.

 Materialized view optimizes the query performance, using the same sub-query
results every time.

294
 The data is not updated frequent in the materialized view, user needs to
update data manually or using the trigger clause. It reduces the chances of
any error and returns the efficient outcome.

 Materialized views are transparent and automatically maintained with the help
of snowflake, which is a background services.

Difference between View and Materialized View


The following are the important difference between the view and materialized view.

Sr. No. View Materialized View

1. Views are the virtual projection of The resulting data and query
the base table. The query expression both are saved in the
expressions are stored in the physical memory (database system).
database but not the resulting
data of query expression.

2. Views are generally created by Materialized views are primary used


joining one or more tables. for the warehousing of data.

3. A view is virtual table that is It is also known as snapshot view of


based on the select query the data which provides access to
duplicate, physical data in a separate
table.

4. DML commands cannot be used DML commands can be used in


if they are created using the materialized view no matter how they
multiple tables. are created.

5. There is no updation cost involve It does have the updation cost


in normal view. associated with it.

6. Views respond slowly. It causes Materialized view responds faster.


the query performances. Because it store data in the database.

7. Views are defined according to There is no predefined SQL standard


the fixed design architecture to defining it, the database provides
approach which is based on the the functionality in the form of the
295
SQL standard defining a view. extension.

8. Views are more effective when Materialized views are mostly used
the data is accessed infrequently when data is accessed more
and data in table get updated on frequently and data is not updated
frequent basis. frequently.

What are the challenges with materialized views?


As materialized views are another database component to consider, you add another
layer of complexity in terms of maintenance. You must balance the query and
efficiency benefits with potential storage costs and data consistency issues.

You have to create effective rules that trigger updates to ensure your materialized
views remain beneficial. Frequently updating your materialized views may impact
system performance, especially if you are already in a peak period. Additionally,
materialized views also take up a significant amount of space as they replicate data. If
you have a large database that constantly updates, the storage demands of
materialized views will likely be significant.

If you are going to use a materialized view, you need to set clear refresh rules and
schedules. You must also understand how to deal with data inconsistencies, refresh
failures, and the added storage strain.

How can AWS help with your materialized view requirements?


Materialized views are a powerful tool to improve query performance in Amazon
Redshift.

Amazon Redshift continually monitors the workload using machine learning and
creates new materialized views when they are beneficial. This Automated
Materialized Views (AutoMV) feature in Redshift provides the same performance
benefits of user-created materialized views.

The AutoMV feature can benefit you in many ways:

 Balance the costs of creating and keeping materialized views up-to-date


against expected benefits to query latency

 Monitor previously created AutoMVs and drop them when they are no longer
beneficial

 Refresh automatically and incrementally, using the same criteria and


restrictions as user-created materialized views
296
Additionally, developers don't need to revise queries to take advantage of
AutoMV. Automatic query rewriting to use materialized views identifies queries that
can benefit from system-created AutoMVs. It automatically rewrites those queries to
use the AutoMVs, improving query efficiency.

Here are some key aspects of materialized views


1. Precomputed Query Results:

Materialized views contain the results of a specific query that has been
precomputed and stored in the database. These results are typically derived
from one or more base tables or views using aggregation, filtering, or other
data transformation operations.

2. Improved Query Performance:

Materialized views improve query performance by reducing the need for


repeated and expensive computations of query results. Instead of executing
the query against the base tables every time it is invoked, the database can
retrieve the precomputed results directly from the materialized view, which is
often faster.

3. Incremental Maintenance:

Materialized views can be refreshed or updated periodically to reflect changes


in the underlying data. Incremental maintenance techniques are used to
efficiently update the materialized view with the latest changes from the base
tables, minimizing the overhead of refreshing the view.

4. Query Rewrite:

Many database systems support query rewrite capabilities, which


automatically rewrite a query to use a materialized view if its structure
matches the query's requirements. This allows the database to transparently
substitute the materialized view for the original query, further improving query
performance.

5. Aggregation and Summarization:

Materialized views are often used to precompute aggregated or summarized


data, such as totals, averages, counts, or other statistical measures. By storing
precomputed aggregates, materialized views can significantly speed up
queries that require aggregation operations.

6. Query Optimization:

Materialized views can be used as part of query optimization strategies to


improve the performance of complex queries and analytical workloads. By
identifying frequently executed queries and creating materialized views to
297
support them, database administrators can optimize query execution plans
and reduce response times.

7. Storage Overhead:

Materialized views consume storage space in the database to store the


precomputed query results. Depending on the size of the result set and the
frequency of updates, materialized views can impose additional storage
overhead on the database.

8. Use Cases:

Materialized views are commonly used in data warehousing, business


intelligence, and reporting applications where query performance is critical.
They are particularly useful for analytical queries, ad-hoc reporting, and
decision support systems that require fast access to aggregated or
summarized data.

Distribution models
The primary driver of interest in NoSQLhas been its ability to run databases on a
large cluster. As data volumes increase, it becomes more difficult and expensive to
scale up—buy a bigger server to run the database on. A more appealing option is to
scale out—run the database on a cluster of servers. Aggregate orientation fits well
with scaling out because the aggregate is a natural unit to use for distribution.

Depending on your distribution model, you can get a data store that will give you
the ability to handle larger quantities of data, the ability to process a greater read or
write traffic, or more availability in the face of network slowdowns or breakages.
These are often important benefits, but they come at a cost. Running over a cluster
introduces complexity—so it’s not something to do unless the benefits are
compelling.

Broadly, there are two paths to data distribution: replication and sharding.
Replication takes the same data and copies it over multiple nodes. Sharding puts
different data on different nodes. Replication and sharding are orthogonal
techniques: You can use either or both of them. Replication comes into two forms:
master-slave and peer-to-peer. We will now discuss these techniques starting at the
simplest and working up to the more complex: first single-server, then master-slave
replication, then sharding, and finally peer-to-peer replication.

1. Single Server

298
The first and the simplest distribution option is the one we would most often
recommend—no distribution at all. Run the database on a single machine that
handles all the reads and writes to the data store. We prefer this option because it
eliminates all the complexities that the other options introduce; it’s easy for
operations people to manage and easy for application developers to reason about.

Although a lot of NoSQLdatabases are designed around the idea of running on a


cluster, it can make sense to use NoSQLwith a single-server distribution model if the
data model of the NoSQL store is more suited to the application. Graph databases
are the obvious category here—these work best in a single-server configuration. If
your data usage is mostly about processing aggregates, then a single-server
document or key-value store may well be worthwhile because it’s easier on
application developers.

For the rest of this chapter we’ll be wading through the advantages and
complications of more sophisticated distribution schemes. Don’t let the volume of
words fool you into thinking that we would prefer these options. If we can get away
without distributing our data, we will always choose a single-server approach.

2. Sharding
Often, a busy data store is busy because different people are accessing different
parts of the dataset. In these circumstances we can support horizontal scalability by
putting different parts of the data onto different servers—a technique that’s called
sharding (see Figure 4.1).

In the ideal case, we have different users all talking to different server nodes. Each
user only has to talk to one server, so gets rapid responses from that server. The load
299
is balanced out nicely between servers—for example, if we have ten servers, each one
only has to handle 10% of the load.

Of course the ideal case is a pretty rare beast. In order to get close to it we have to
ensure that data that’s accessed together is clumped together on the same node and
that these clumps are arranged on the nodes to provide the best data access.

The first part of this question is how to clump the data up so that one user mostly
gets her data from a single server. This is where aggregate orientation comes in
really handy. The whole point of aggregates is that we design them to combine data
that’s commonly accessed together—so aggregates leap out as an obvious unit of
distribution.

When it comes to arranging the data on the nodes, there are several factors that can
help improve performance. If you know that most accesses of certain aggregates are
based on a physical location, you can place the data close to where it’s being
accessed. If you have orders for someone who lives in Boston, you can place that
data in your eastern US data center.

Another factor is trying to keep the load even. This means that you should try to
arrange aggregates so they are evenly distributed across the nodes which all get
equal amounts of the load. This may vary over time, for example if some data tends
to be accessed on certain days of the week—so there may be domain-specific rules
you’d like to use.

3. Master-Slave Replication
With master-slave distribution, you replicate data across multiple nodes. One node is
designated as the master, or primary. This master is the authoritative source for the
data and is usually responsible for processing any updates to that data. The other
nodes are slaves, or secondaries. A replication process synchronizes the slaves with
the master (see Figure 4.2).

300
4. Peer-to-Peer Replication
Master-slave replication helps with read scalability but doesn’t help with scalability of
writes. It provides resilience against failure of a slave, but not of a master. Essentially,
the master is still a bottleneck and a single point of failure. Peer-to-peer replication
(see Figure 4.3) attacks these problems by not having a master. All the replicas have
equal weight, they can all accept writes, and the loss of any of them doesn’t prevent
access to the data store.

301
5. Combining Sharding and Replication
Replication and sharding are strategies that can be combined. If we use both master-
slave replication and sharding (see Figure 4.4), this means that we have multiple
masters, but each data item only has a single master. Depending on your
configuration, you may choose a node to be a master for some data and slaves for
others, or you may dedicate nodes for master or slave duties.

302
Using peer-to-peer replication and sharding is a common strategy for column-family
databases. In a scenario like this you might have tens or hundreds of nodes in a
cluster with data sharded over them. A good starting point for peer-to-peer
replication is to have a replication factor of 3, so each shard is present on three
nodes. Should a node fail, then the shards on that node will be built on the other
nodes (see Figure 4.5).

Aggregate oriented databases make distribution of data easier, since the distribution
mechanism has to move the aggregate and not have to worry about related data, as
all the related data is contained in the aggregate.

There are two styles of distributing data:

 Sharding: Sharding distributes different data across multiple servers, so each


server acts as the single source for a subset of data.

 Replication: Replication copies data across multiple servers, so each bit of data
can be found in multiple places. Replication comes in two forms,

 Master-slave replication makes one node the authoritative copy that


handles writes while slaves synchronize with the master and may
handle reads.

 Peer-to-peer replication allows writes to any node; the nodes


coordinate to synchronize their copies of the data.

Master-slave replication reduces the chance of update conflicts but peer-to-peer


replication avoids loading all writes onto a single server creating a single point of
failure. A system may use either or both techniques. Like Riak database shards the
data and also replicates it based on the replication factor.

303
Here are some common distribution models:
1. Direct Distribution:

In a direct distribution model, producers sell products directly to consumers


without involving intermediaries. This can be done through company-owned
retail stores, e-commerce websites, catalogs, or direct sales representatives.
Direct distribution gives producers greater control over the customer
experience and allows them to capture more of the profit margin.

2. Indirect Distribution:

In an indirect distribution model, producers rely on intermediaries or third-


party channels to distribute their products to consumers. Intermediaries may
include wholesalers, distributors, retailers, agents, or brokers who buy
products from producers and resell them to end-users. Indirect distribution
can expand the reach of products to a broader market but may result in lower
profit margins due to the involvement of intermediaries.

3. Wholesale Distribution:

Wholesale distribution involves selling products in bulk quantities to retailers,


businesses, or other intermediaries who then resell them to consumers.
Wholesalers act as middlemen between producers and retailers, providing
storage, logistics, and distribution services. Wholesale distribution is common
in industries such as consumer goods, electronics, and food products.

4. Retail Distribution:

Retail distribution involves selling products directly to consumers through


brick-and-mortar stores, online retail platforms, or other retail channels.
Retailers purchase products from wholesalers or distributors and sell them to
individual consumers at a markup. Retail distribution allows producers to
reach consumers through various touchpoints and offer personalized
shopping experiences.

5. Franchise Distribution:

Franchise distribution involves licensing the rights to operate a business under


a specific brand or trademark to independent entrepreneurs or franchisees.
Franchisees pay an initial fee and ongoing royalties to the franchisor in
exchange for access to the brand, products, and business model. Franchise
distribution enables rapid expansion into new markets while leveraging local
expertise and resources.

6. Agency Distribution:
304
Agency distribution involves appointing agents or representatives to sell
products on behalf of the producer. Agents act as intermediaries who
negotiate sales contracts, handle customer inquiries, and facilitate transactions
on behalf of the producer. Agency distribution is common in industries such
as insurance, real estate, and pharmaceuticals.

7. Online Distribution:

Online distribution involves selling products or services through digital


channels such as e-commerce websites, online marketplaces, mobile apps, or
social media platforms. Online distribution offers convenience, accessibility,
and global reach, allowing producers to reach a broader audience and bypass
traditional distribution channels.

8. Hybrid Distribution:

Hybrid distribution models combine elements of direct and indirect


distribution, as well as online and offline channels, to reach consumers
through multiple touchpoints. Hybrid distribution strategies leverage the
strengths of different distribution channels to optimize reach, efficiency, and
customer satisfaction.

Overall, distribution models play a crucial role in the success of businesses by


determining how products or services are delivered to consumers. By selecting the
most appropriate distribution model based on their goals, resources, and target
market, producers can effectively reach customers, maximize sales, and build
sustainable competitive advantage in the marketplace.

Shading
What is database sharding?
Sharding is a method for distributing a single dataset across multiple databases,
which can then be stored on multiple machines. This allows for larger datasets to be
split into smaller chunks and stored in multiple data nodes, increasing the total
storage capacity of the system. See more on the basics of sharding here.

Similarly, by distributing the data across multiple machines, a sharded database can
handle more requests than a single machine can.

Sharding is a form of scaling known as horizontal scaling or scale-out, as additional


nodes are brought on to share the load. Horizontal scaling allows for near-limitless
scalability to handle big data and intense workloads. In contrast, vertical
scaling refers to increasing the power of a single machine or single server through a
more powerful CPU, increased RAM, or increased storage capacity.

305
Do you need database sharding?
Database sharding, as with any distributed architecture, does not come for free.
There is overhead and complexity in setting up shards, maintaining the data on each
shard, and properly routing requests across those shards. Before you begin sharding,
consider if one of the following alternative solutions will work for you.

How to optimize database sharding for even data distribution


When a data overload occurs on specific physical shards although others remain
underloaded, it results in database hotspots. Hotspots slow down the retrieval
process on the database, defeating the purpose of data sharding.

Good shard-key selection can evenly distribute data across multiple shards. When
choosing a shard key, database designers should consider the following factors.

 Cardinality

Cardinality describes the possible values of the shard key. It determines the
maximum number of possible shards on separate column-oriented databases.
For example, if the database designer chooses a yes/no data field as a shard
key, the number of shards is restricted to two.

 Frequency

Frequency is the probability of storing specific information in a particular


shard. For example, a database designer chooses age as a shard key for a
fitness website. Most of the records might go into nodes for subscribers aged
30–45 and result in database hotspots.

 Monotonic change

Monotonic change is the rate of change of the shard key. A monotonically


increasing or decreasing shard key results in unbalanced shards. For example,
a feedback database is split into three different physical shards as follows:

306
 Shard A stores feedback from customers who have made 0–10
purchases.

 Shard B stores feedback from customers who have made 11–20


purchases.

 Shard C stores feedback from customers who have made 21 or more


purchases.

As the business grows, customers will make more than 21 or more purchases.
The application stores their feedback in Shard C. This results in an unbalanced
shard because Shard C contains more feedback records than other shards.

What are the alternatives to database sharding?


Database sharding is a horizontal scaling strategy that allocates additional nodes or
computers to share the workload of an application. Organizations benefit from
horizontal scaling because of its fault-tolerant architecture. When one computer fails,
the others continue to operate without disruption. Database designers reduce
downtime by spreading logical shards across multiple servers.

However, sharding is one among several other database scaling strategies. Explore
some other techniques and understand how they compare.

1. Vertical scaling

Vertical scaling increases the computing power of a single machine. For example, the
IT team adds a CPU, RAM, and a hard disk to a database server to handle increasing
traffic.

Comparison of database sharding and vertical scaling

Vertical scaling is less costly, but there is a limit to the computing resources you can
scale vertically. Meanwhile, sharding, a horizontal scaling strategy, is easier to
implement. For example, the IT team installs multiple computers instead of
upgrading old computer hardware.

307
2. Replication

Replication is a technique that makes exact copies of the database and stores them
across different computers. Database designers use replication to design a fault-
tolerant relational database management system. When one of the computers
hosting the database fails, other replicas remain operational. Replication is a
common practice in distributed computing systems.

Comparison of database sharding and replication

Database sharding does not create copies of the same information. Instead, it splits
one database into multiple parts and stores them on different computers. Unlike
replication, database sharding does not result in high availability. Sharding can be
used in combination with replication to achieve both scale and high availability.

In some cases, database sharding might consist of replications of specific datasets.


For example, a retail store that sells products to both US and European customers
might store replicas of size conversion tables on different shards for both regions.
The application can use the duplicate copies of the conversion table to convert the
measurement size without accessing other database servers.

3. Partitioning

Partitioning is the process of splitting a database table into multiple groups.


Partitioning is classified into two types:

 Horizontal partitioning splits the database by rows.

 Vertical partitioning creates different partitions of the database columns.

 Comparison of database sharding and partitioning

Database sharding is like horizontal partitioning. Both processes split the database
into multiple groups of unique rows. Partitioning stores all data groups in the same
computer, but database sharding spreads them across different computers.

4. Specialized services or databases

308
Depending on your use case, it may make more sense to simply shift a subset of the
burden onto other providers or even a separate database. For example, blob or file
storage can be moved directly to a cloud provider such as Amazon S3. Analytics or
full-text search can be handled by specialized services or a data warehouse.
Offloading this particular functionality can make more sense than trying to shard
your entire database.

Why is database sharding important?


As an application grows, the number of application users and the amount of data it
stores increase over time. The database becomes a bottleneck if the data volume
becomes too large and too many users attempt to use the application to read or
save information simultaneously. The application slows down and affects customer
experience. Database sharding is one of the methods to solve this problem because
it enables parallel processing of smaller datasets across shards.

Horizontal and vertical sharding


Sharding involves splitting and distributing one logical data set across multiple
databases that share nothing and can be deployed across multiple servers. To
achieve sharding, the rows or columns of a larger database table are split into
multiple smaller tables.

Once a logical shard is stored on another node, it is known as a physical shard. One
physical shard can hold multiple logical shards. The shards are autonomous and
don't share the same data or computing resources. That's why they exemplify a
shared-nothing architecture. At the same time, the data in all the shards represents a
logical data set.

Sharding can either be horizontal or vertical:

 Horizontal sharding. When each new table has the same schema but unique
rows, it is known as horizontal sharding. In this type of sharding, more
machines are added to an existing stack to spread out the load, increase
309
processing speed and support more traffic. This method is most effective
when queries return a subset of rows that are often grouped together.

 Vertical sharding. When each new table has a schema that is a faithful subset
of the original table's schema, it is known as vertical sharding. It is effective
when queries usually return only a subset of columns of the data.

The following illustrates how new tables look when both horizontal and vertical
sharding are performed on the same original data set.

Original data set

Student Name Age Major Hometown


ID

1 Amy 21 Economics Austin

2 Jack 20 History San Francisco

3 Matthew 22 Political Science New York


City

4 Priya 19 Biology Gary

5 Ahmed 19 Philosophy Boston

Horizontal shards

Shard 1

Student Name Age Major Hometown


ID

1 Amy 21 Economics Austin

2 Jack 20 History San Francisco

Shard 2

310
Student Name Age Major Hometown
ID

3 Matthew 22 Political Science New York


City

4 Priya 19 Biology Gary

5 Ahmed 19 Philosophy Boston

Vertical Shards

Shard 1

Student Name Age


ID

1 Amy 21

2 Jack 20

Shard 2

Student Major
ID

1 Economics

2 History

Shard 3

Student ID Hometown

311
1 Austin

2 San Francisco

What are the benefits of database sharding?


Organizations use database sharding to gain the following benefits:

 Improve response time

Data retrieval takes longer on a single large database. The database


management system needs to search through many rows to retrieve the
correct data. By contrast, data shards have fewer rows than the entire
database. Therefore, it takes less time to retrieve specific information, or run a
query, from a sharded database.

 Avoid total service outage

If the computer hosting the database fails, the application that depends on
the database fails too. Database sharding prevents this by distributing parts of
the database into different computers. Failure of one of the computers does
not shut down the application because it can operate with other functional
shards. Sharding is also often done in combination with data replication
across shards. So, if one shard becomes unavailable, the data can be accessed
and restored from an alternate shard.

 Scale efficiently

A growing database consumes more computing resources and eventually


reaches storage capacity. Organizations can use database sharding to add
more computing resources to support database scaling. They can add new
shards at runtime without shutting down the application for maintenance.

Advantages of sharding
Sharding allows you to scale your database to handle increased load to a nearly
unlimited degree by providing increased read/write throughput, storage capacity,
and high availability. Let’s look at each of those in a little more detail.

 Increased read/write throughput — By distributing the dataset across


multiple shards, both read and write operation capacity is increased as long as
read and write operations are confined to a single shard.

312
 Increased storage capacity — similarly, by increasing the number of shards,
you can also increase overall total storage capacity, allowing near-infinite
scalability.

 High availability — finally, shards provide high availability in two ways. First,
since each shard is a replica set, every piece of data is replicated. Second, even
if an entire shard becomes unavailable since the data is distributed, the
database as a whole still remains partially functional, with part of the schema
on different shards.

Disadvantages of sharding
Sharding does come with several drawbacks, namely overhead in query result
compilation, complexity of administration, and increased infrastructure costs.

 Query overhead — each sharded database must have a separate machine or


service which understands how to route a querying operation to the
appropriate shard. This introduces additional latency on every operation.
Furthermore, if the data required for the query is horizontally partitioned
across multiple shards, the router must then query each shard and merge the
result together. This can make an otherwise simple operation quite expensive
and slow down response times.

 Complexity of administration — with a single unsharded database, only the


database server itself requires upkeep and maintenance. With every sharded
database, on top of managing the shards themselves, there are additional
service nodes to maintain. Plus, in cases where replication is being used, any
data updates must be mirrored across each replicated node. Overall, a
sharded database is a more complex system which requires more
administration.

 Increased infrastructure costs — Sharding by its nature requires additional


machines and compute power over a single database server. While this allows
your database to grow beyond the limits of a single machine, each additional
shard comes with higher costs. The cost of a distributed database system,
especially if it is missing the proper optimization, can be significant.

How does sharding work?


In order to shard a database, we must answer several fundamental questions. The
answers will determine your implementation.

First, how will the data be distributed across shards? This is the fundamental question
behind any sharded database. The answer to this question will have effects on both
313
performance and maintenance. More detail on this can be found in the “Sharding
Architectures and Types” section.

Second, what types of queries will be routed across shards? If the workload is
primarily read operations, replicating data will be highly effective at increasing
performance, and you may not need sharding at all. In contrast, a mixed read-write
workload or even a primarily write-based workload will require a different
architecture.

Finally, how will these shards be maintained? Once you have sharded a database,
over time, data will need to be redistributed among the various shards, and new
shards may need to be created. Depending on the distribution of data, this can be an
expensive process and should be considered ahead of time.

With these questions in mind, let’s consider some sharding architectures.

Difference between sharding and partitioning


Although sharding and partitioning both break up a large database into smaller
databases, there is a difference between the two methods.

After a database is sharded, the data in the new tables is spread across multiple
systems, but with partitioning, that is not the case. Partitioning groups data subsets
within a single database instance.

Sharding architectures and types


What are the methods of database sharding?
While there are many different sharding methods, we will consider four main kinds:
ranged/dynamic sharding, algorithmic/hashed sharding, entity/relationship-based
sharding, and geography-based sharding.

1. Ranged/dynamic sharding

Ranged sharding, or dynamic sharding, takes a field on the record as an input and,
based on a predefined range, allocates that record to the appropriate shard. Ranged
sharding requires there to be a lookup table or service available for all queries or
writes. For example, consider a set of data with IDs that range from 0-50. A simple
lookup table might look like the following:

Range Shard ID

314
[0, 20) A

[20, 40) B

[40, 50] C

The field on which the range is based is also known as the shard key. Naturally, the
choice of shard key, as well as the ranges, are critical in making range-based
sharding effective. A poor choice of shard key will lead to unbalanced shards, which
leads to decreased performance. An effective shard key will allow for queries to be
targeted to a minimum number of shards. In our example above, if we query for all
records with IDs 10-30, then only shards A and B will need to be queried.

Two key attributes of an effective shard key are high cardinality and well-
distributed frequency. Cardinality refers to the number of possible values of that key.
If a shard key only has three possible values, then there can only be a maximum of
three shards. Frequency refers to the distribution of the data along the possible
values. If 95% of records occur with a single shard key value then, due to this
hotspot, 95% of the records will be allocated to a single shard. Consider both of
these attributes when selecting a shard key.

Range-based sharding is an easy-to-understand method of horizontal partitioning,


but the effectiveness of it will depend heavily on the availability of a suitable shard
key and the selection of appropriate ranges. Additionally, the lookup service can
become a bottleneck, although the amount of data is small enough that this typically
is not an issue.

2. Algorithmic/hashed sharding

Algorithmic sharding or hashed sharding, takes a record as an input and applies a


hash function or algorithm to it which generates an output or hash value. This output
is then used to allocate each record to the appropriate shard.

315
The function can take any subset of values on the record as inputs. Perhaps the
simplest example of a hash function is to use the modulus operator with the number
of shards, as follows:

Hash Value=ID % Number of Shards

Name Hash value

John 1

Jane 2

Paulo 1

Wang 2

This is similar to range-based sharding — a set of fields determines the allocation of


the record to a given shard. Hashing the inputs allows more even distribution across
shards even when there is not a suitable shard key, and no lookup table needs to be
maintained. However, there are a few drawbacks.

First, query operations for multiple records are more likely to get distributed across
multiple shards. Whereas ranged sharding reflects the natural structure of the data
across shards, hashed sharding typically disregards the meaning of the data. This is
reflected in increased broadcast operation occurrence.

Second, resharding can be expensive. Any update to the number of shards likely
requires rebalancing all shards to moving around records. It will be difficult to do this
while avoiding a system outage.

3. Entity-/relationship-based sharding

Entity-based sharding keeps related data together on a single physical shard. In a


relational database (such as PostgreSQL, MySQL, or SQL Server), related data is often
spread across several different tables.

For instance, consider the case of a shopping database with users and payment
methods. Each user has a set of payment methods that is tied tightly with that user.
As such, keeping related data together on the same shard can reduce the need for
broadcast operations, increasing performance.

4. Geography-based sharding

316
Geography-based sharding, or geosharding, also keeps related data together on a
single shard, but in this case, the data is related by geography. This is essentially
ranged sharding where the shard key contains geographic information and the
shards themselves are geo-located.

For example, consider a dataset where each record contains a “country” field. In this
case, we can both increase overall performance and decrease system latency by
creating a shard for each country or region, and storing the appropriate data on that
shard. This is a simple example, and there are many other ways to allocate your
geoshards which are beyond the scope of this article.

Name Shard key

John California

Jane Washington

Paulo Arizona

5. Directory sharding

Directory sharding uses a lookup table to match database information to the


corresponding physical shard. A lookup table is like a table on a spreadsheet that
links a database column to a shard key. For example, the following diagram shows a
lookup table for clothing colors.

Color Shard key

Blue A

Red B

Yellow C

Black D

317
When an application stores clothing information in the database, it refers to the
lookup table. If a dress is blue, the application stores the information in the
corresponding shard.

Pros and cons

Software developers use directory sharding because it is flexible. Each shard is a


meaningful representation of the database and is not limited by ranges. However,
directory sharding fails if the lookup table contains the wrong information.

Sharding Architectures
1. Key Based Sharding

 This technique is also known as hash-based sharding.

 Here, we take the value of an entity such as customer ID, customer email, IP
address of a client, zip code, etc and we use this value as an input of the hash
function.

 This process generates a hash value which is used to determine which shard
we need to use to store the data.

 We need to keep in mind that the values entered into the hash function
should all come from the same column (shard key) just to ensure that data is
placed in the correct order and in a consistent manner.

 Basically, shard keys act like a primary key or a unique identifier for individual
rows.

Let’s understand this with the help of an example:

You have 3 database servers and each request has an application id which is
incremented by 1 every time a new application is registered.

To determine which server data should be placed on, we perform a modulo


operation on these applications id with the number 3. Then the remainder is used to
identify the server to store our data.

 The downside of this method is elastic load balancing which means if you
will try to add or remove the database servers dynamically it will be a difficult
and expensive process.

 A shard shouldn’t contain values that might change over time. It should be
always static otherwise it will slow down the performance

318
Advantages of Key Based Sharding:

 Predictable Data Distribution:

 Key-based sharding provides a predictable and deterministic way to


distribute data across shards.

 Each unique key value corresponds to a specific shard, ensuring even


and predictable distribution of data.

 Optimized Range Queries:

 If queries involve ranges of key values, key-based sharding can be


optimized to handle these range queries efficiently.

 This is especially beneficial when dealing with operations that span a


range of consecutive key values.

Disadvantages of Key Based Sharding:

 Uneven Data Distribution:

 Explanation: If the sharding key is not well-distributed or if certain key


values are more frequently accessed than others, it may result in
uneven data distribution across shards, leading to potential
performance bottlenecks on specific shards.

 Limited Scalability with Specific Keys:

 The scalability of key-based sharding may be limited if certain keys


experience high traffic or if the dataset is heavily skewed toward
specific key ranges.

 Scaling may become challenging for specific subsets of data.


319
 Complex Key Selection:

 Selecting an appropriate sharding key is crucial for effective key-based


sharding.

 Choosing the right key may require a deep understanding of the data
and query patterns, and poor choices may lead to suboptimal
performance.

2. Horizontal or Range Based Sharding

 In this method, we split the data based on the ranges of a given value
inherent in each entity.

 Let’s say you have a database of your online customers’ names and email
information.

 You can split this information into two shards. In one shard you can keep the
info of customers whose first name starts with A-P and in another shard, keep
the information of the rest of the customers.

Advantages of Range Based Sharding:

 Scalability:

 Horizontal or range-based sharding allows for seamless scalability by


distributing data across multiple shards, accommodating growing
datasets.

320
 Improved Performance:

 Data distribution among shards enhances query performance through


parallelization, ensuring faster operations with smaller subsets of data
handled by each shard.

Disadvantages of Range Based Sharding:

 Complex Querying Across Shards:

 Coordinating queries involving multiple shards can be challenging.

 Uneven Data Distribution:

 Poorly managed data distribution may lead to uneven workloads


among shards.

3. Vertical Sharding

 In this method, we split the entire column from the table and we put those
columns into new distinct tables.

 Data is totally independent of one partition to the other ones.

 Also, each partition holds both distinct rows and columns.

 We can split different features of an entity in different shards on different


machines.

Let’s understand this with the help of an example:

On Twitter users might have a profile, number of followers, and some tweets posted
by his/her own. We can place the user profiles on one shard, followers in the second
shard, and tweets on a third shard.

321
Advantages of Vertical Sharding:

 Query Performance:

 Vertical sharding can improve query performance by allowing each


shard to focus on a specific subset of columns.

 This specialization enhances the efficiency of queries that involve only a


subset of the available columns.

 Simplified Queries:

 Queries that require a specific set of columns can be simplified, as they


only need to interact with the shard containing the relevant columns.

 This can result in more straightforward and efficient query execution.

Disadvantages of Vertical Sharding:

 Limited Horizontal Scalability:

 Vertical sharding may have limitations in terms of horizontal scalability


compared to horizontal sharding.

 Scaling vertically involves upgrading the capacity of individual servers,


which may have practical limitations.

 Potential for Hotspots:

 Certain shards may become hotspots if they contain highly accessed


columns, leading to uneven distribution of workloads.

 This can result in performance bottlenecks and reduced overall system


efficiency.
322
 Challenges in Schema Changes:

 Making changes to the schema, such as adding or removing columns,


may be more challenging in a vertically sharded system.

 Changes can impact multiple shards and require careful coordination.

4. Directory-Based Sharding

 In this method, we create and maintain a lookup service or lookup table for
the original database.

 Basically we use a shard key for lookup table and we do mapping for each
entity that exists in the database.

 This way we keep track of which database shards hold which data.

The lookup table holds a static set of information about where specific data can be
found. In the above image, you can see that we have used the delivery zone as a
shard key:

 Firstly the client application queries the lookup service to find out the shard
(database partition) on which the data is placed.

 When the lookup service returns the shard it queries/updates that shard.

Advantages of Directory-Based Sharding:

 Flexible Data Distribution:

 Directory-based sharding allows for flexible data distribution, where the


central directory can dynamically manage and update the mapping of
data to shard locations.

323
 This flexibility facilitates efficient load balancing and adaptation to
changing data patterns.

 Efficient Query Routing:

 Queries can be efficiently routed to the appropriate shard using the


information stored in the directory.

 This results in improved query performance, as the central directory


optimizes the process of directing queries to the specific shard that
contains the relevant data.

 Dynamic Scalability:

 The system can dynamically scale by adding or removing shards


without requiring changes to the application logic.

 The central directory handles the mapping and distribution of data,


making it easier to adapt the system to changing requirements and
workloads.

Disadvantages of Directory-Based Sharding:

 Centralized Point of Failure:

 The central directory represents a single point of failure.

 If the directory becomes unavailable or experiences issues, it can


disrupt the entire system, impacting data access and query routing.

 Increased Latency:

 Query routing through a central directory introduces an additional


layer, potentially leading to increased latency compared to other
sharding strategies.

 This additional step in the process can affect response times.

Advantages of Sharding in System Design


 Solve Scalability Issue:

o With a single database server architecture any application experience


performance degradation when users start growing on that application.

o Reads and write queries become slower and the network bandwidth starts to
saturate. Database sharding fixes all these issues by partitioning the data
across multiple machines.
324
 High Availability:

o A problem with single server architecture is that if an outage happens then


the entire application will be unavailable which is not good for a website.

o Whereas, If an outage happens in sharded architecture, then only some


specific shards will be down.

o All the other shards will continue the operation and the entire application
won’t be unavailable for the users.

 Speed Up Query Response Time:

o When you submit a query in an application with a large monolithic database


and have no sharded architecture, it takes more time to find the result.

o It has to search every row in the table and that slows down the response time
for the query.

o In a sharded database a query has to go through fewer rows and you receive
the response in less time.

 More Write Bandwidth:

o For many applications writing is a major bottleneck.

o With no master database serializing writes sharded architecture allows you to


write in parallel and increase your write throughput.

 Scaling Out:

o Sharding a database facilitates horizontal scaling, known as scaling out. In


horizontal scaling, you add more machines in the network and distribute the
load on these machines for faster processing and response.

Disadvantages of Sharding in System Design


 Adds Complexity in the System:

o You need to be careful while implementing a proper sharded database


architecture in an application.

o It’s a complicated task and if it’s not implemented properly then you may lose
the data or get corrupted tables in your database.

o You also need to manage the data from multiple shard locations, This may
affect the workflow of your team

325
 Rebalancing Data:

o Sometimes shards become unbalanced (when a shard outgrows other shards).

o Consider an example that you have two shards of a database:

 One shard store the name of the customers begins with letter A
through M. Another shard store the name of the customer begins with
the letters N through Z.

 If there are so many users with the letter L then shard one will have
more data than shard two. This will affect the performance (slow down)
of the application and it will stall out for a significant portion of your
users.

 The A-M shard will become unbalance and it will be known


as database hotspot.

o To overcome this problem and to rebalance the data you need to do re-
sharding for even data distribution.

 Joining Data From Multiple Shards is Expensive:

o In a single database, joins can be performed easily to implement any


functionalities.

o But in sharded architecture, you need to pull the data from different shards
and you need to perform joins across multiple networked servers and You
can’t submit a single query to get the data from various shards.

o You need to submit multiple queries for each one of the shards, It
adds latency to your system.

 No Native Support:

o Sharding is not natively supported by every database engine. For example,


PostgreSQL doesn’t include automatic sharding features, so there you have to
do manual sharding. You need to follow the “roll-your-own” approach.

o It will be difficult for you to find the tips or documentation for sharding and
troubleshoot the problem during the implementation of sharding.

What are the challenges of database sharding?


Organizations might face these challenges when implementing database sharding.

1. Data hotspots

326
Some of the shards become unbalanced due to the uneven distribution of data.
For example, a single physical shard that contains customer names starting with A
receives more data than others. This physical shard will use more computing
resources than others.

Solution

You can distribute data evenly by using optimal shard keys. Some datasets are
better suited for sharding than others.

2. Operational complexity

Database sharding creates operational complexity. Instead of managing a single


database, developers have to manage multiple database nodes. When they are
retrieving information, developers must query several shards and combine the
pieces of information together. These retrieval operations can complicate
analytics.

Solution

In the AWS database portfolio, database setup and operations have been
automated to a large extent. This makes working with a sharded database
architecture a more streamlined task.

3. Infrastructure costs

Organizations pay more for infrastructure costs when they add more computers
as physical shards. Maintenance costs can add up if you increase the number of
machines in your on-premises data center.

Solution

Developers use Amazon Elastic Compute Cloud (Amazon EC2) to host and scale
shards in the cloud. You can save money by using virtual infrastructure that AWS
fully manages.

4. Application complexity

Most database management systems do not have built-in sharding features. This
means that database designers and software developers must manually split,
distribute, and manage the database.

Solution

You can migrate your data to the appropriate AWS purpose-built databases,
which have several built-in features that support horizontal scaling.

327
Version
Here are some key aspects of versioning in big data
1. Data Versioning:

Data versioning involves tracking different versions of data over time. Each
version represents a snapshot of the data at a specific point in time, capturing
changes, updates, or modifications made to the data set. Data versioning
enables users to track the history of changes, revert to previous versions if
needed, and maintain data lineage for auditability and compliance purposes.

2. Schema Evolution:

In big data systems, data schemas may evolve over time due to changes in
business requirements, data sources, or application logic. Versioning enables
the management of schema changes and evolution by tracking different
versions of data schemas and ensuring compatibility with older versions of the
data.

3. Version Control:

Version control systems (VCS) or versioning repositories are commonly used


to manage data versioning in big data environments. These systems provide
mechanisms for storing, organizing, and tracking different versions of data
files, scripts, configurations, and other artifacts used in big data processing
workflows.

4. Immutable Data:

In some big data systems, data may be stored in immutable or append-only


formats, where existing data cannot be modified or deleted once it is written.
Immutable data storage facilitates versioning by ensuring that each version of
the data remains unchanged and immutable, preserving data integrity and
traceability.

5. Metadata Management:

Metadata plays a crucial role in data versioning by providing information


about the provenance, lineage, and dependencies of different data versions.
Metadata management systems track metadata associated with data versions,
including timestamps, authorship, version identifiers, and dependencies on
other data artifacts.

6. Data Lineage:

Data lineage refers to the end-to-end tracking of data from its source to its
destination and through various processing steps and transformations.

328
Versioning helps maintain data lineage by tracking changes and
transformations applied to the data over time, allowing users to trace back to
the original source and understand how the data has been modified or
transformed.

7. Data Reproducibility:

Versioning facilitates data reproducibility by enabling users to recreate and


reproduce data processing workflows and analyses using specific versions of
input data, code, and configurations. By ensuring consistency and
reproducibility of data versions, versioning enhances the reliability and
trustworthiness of analytical results and insights generated from big data.

Overall, versioning is a critical aspect of big data management and processing,


enabling organizations to maintain data integrity, traceability, and reproducibility in
distributed and scalable data environments. By tracking different versions of data
and data artifacts, versioning systems help ensure data quality, consistency, and
reliability in big data systems and workflows.

Hadoop Versions and Distributions


Hadoop is an ecosystem of software providing various services related to distributed
processing of data. One of these core services is HDFS, which is a scalable and fault-

329
tolerant distributed file system. If you want to run DataFlow in a distributed cluster,
then we recommend that you use a distributed file system such as HDFS.

The following distributions and versions of Hadoop are supported for use with
DataFlow:

 Apache Hadoop 2.2 and 3.1

 HortonWorks distribution (HDP) version 2.3 to 2.6.5

 Cloudera’s distribution (CDH) version 4.2 up to version 5.15

 MapR distribution version 5.2 or 6.1

HBase Version
DataFlow provides both a reader and writer for accessing HBase, a scalable database
built using Hadoop. The HBase support in DataFlow works with:

 Apache HBase distributed with CDH version 4.2 and later

 Hortonworks HBase distributed with HDP version 2.0 and later

Note: For more information about supported Hadoop and HBase distributions,
see Hadoop Module Configurations.

Hive Version
DataFlow the follwoing readers and writers for Hive: ORCReader, ORCWriter, and
ParquetReader operator.

What Are Version Control and Data Versioning?


Version control is a mechanism that allows you to follow and monitor every stage of
a project. It enables teams to check modifications on the source code and ensure
that changes are not overlooked in the code.

Data versioning is the process of storing corresponding versions of data that were
created or modified at different time intervals.

There are many valid reasons for making changes to a dataset. Data specialists can
test the machine learning (ML) models to increase the success rate of the project. For
this, they need to make important manipulations on the data. Datasets may also

330
update over time due to the continuous inflow of data from different resources. In
the end, keeping older versions of data can help organizations replicate a previous
environment.

Why Is Data Versioning Important?


Usually, in the software development lifecycle, the development process is spread
over a large period, and tracking changes in the project can be difficult. Therefore,
versioning software projects makes this process easy and saves teams from having to
regularly name every version of the script.

Additionally, with the help of data versioning, historical datasets are saved and kept
in the databases. This aspect provides some advantages as follows.

Benefits of Data Versioning


 Keeping the Best Model while Training

The main objective of data science projects is to contribute to the business


demands of the company. Therefore, data scientists need to develop many ML
models as per customer or product requests. This situation requires inserting
new datasets into the ML pipeline for every attempted modeling. However,
data specialists need to be careful not to lose the dataset that gives them the
best score in modeling. This is achieved with the help of a data versioning
system.

 Providing A New Business Metric in Growth

In today’s digital world, it is a fact that companies that make decisions and
develop strategies using data will survive. Therefore, it is important not to lose
historical data.

Consider an e-commerce company that serves daily necessities. Every


transaction in the application changes the sales data. People’s needs and
demands may change over time. Therefore, keeping all the sales data can be
beneficial for gaining insights into new trends about customer demands and
determining the right strategies and campaigns.

In the end, it gives companies a new business metric to measure their success
or performance.

 Protection Regarding Data Privacy Issues

As digital transformation is accelerating in today’s world, the amount of data


produced is increasing rapidly. However, this situation has brought with it
331
concerns about protecting personal data. Consequent data protection
regulations set by governments force companies to store a certain amount of
data.

Data versioning can help in such situations by ensuring that data is stored at
specific times. It can also assist organizations in meeting the requirements of
such regulations.

A Quick Demo with an Open Source Data Versioning Tool


There are many data versioning tools in the market. They offer similar features for
data storage, but some of them have important advantages over others.

LakeFS is an open-source platform that enables data analytics teams to manage


databases or data lakes like they manage the source code. It runs parallel ML
pipelines for testing and CI/CD operations for the whole data lifecycle. This provides
flexibility and ease of control in the form of object storage in data lakes.

With LakeFS, every process — from complex ETL processes to data analytics and
machine learning steps — can be transformed into automatic and easy-to-track data
science projects. Some prominent features of lakeFS are:

 Supports cloud solutions like AWS S3, Google Cloud Storage, and Microsoft
Azure Blob Storage

 Works easily with most modern big data frameworks and technologies such as
Hadoop, Spark, Kafka, etc.

332
 Provides Git-like operations like a branch, commit, and merge, which enables
scaling of petabytes of data with the power of cloud solutions

 Gives options for deployment in the cloud or on-prem and using any API
compatible with S3 storage

Installing and Running the LakeFS Environment

To run a LakeFS session on your local computer, please make sure you install Docker
and Docker Compose with a version of 1.25.04 or higher. To run LakeFS with Docker,
type the following command:

$ curl https://compose.lakefs.io | docker-compose -f - up

After that, check your installation and running session From

http://127.0.0.1:8000/setup

User Registration and Creating Repository

 To create a new repository, register as an admin user from the following


link http://127.0.0.1:8000/setup.

 For this step, determine the Username and save your credentials, which are
Key ID and Secret Key. Log in to your admin user profile with this information.

 Click the Create Repository button in the admin panel and enter the
Repository ID, Storage Namespace, and Default Branch values. After that,
press Create Repository. Your initial repository has been created.

Adding Data to a New Repository

LakeFS sessions can be used with AWS CLI because it has an S3 compatible API. But
please make sure the AWS CLI is installed on your local computer.

 To be able to configure a new connection using the LakeFS credentials with


AWS CLI, type the following command in the terminal. After that, please enter
your Key ID and Secret Key values into the terminal.

$ aws configure --profile local

 To see whether the connection works and to list all the repositories in the
workspace, type the following command in the terminal:

$ aws --endpoint-url=http://localhost:8000 --profile local s3


ls# output:
# 2022-01-30 22:57:02 demo-repo
333
 Finally, to add new data to the repository by writing it to the main branch,
type the following command in the terminal:

$ aws --endpoint-url=http://localhost:8000 --profile local s3 cp


./tweets.txt s3://demo-repo/main/
# output:
# upload: ./tweets.txt to s3://demo-repo/main/tweets.txt

 Now, the tweets.txt file has been written to the main branch of the demo-repo
repository. Please check it on the LakeFS UI.

Committing Changes on Added Data

Thanks to LakeFS, the changes made to data can be committed using LakeFS’s
default CLI client lakectl. Please make sure you have installed the latest version of the
CLI binary on your local computer.

 To configure the CLI binary settings, type the following command in the
terminal:

$ lakectl config

 To verify the configuration of lakectl, you can list all the branches in the
repository with the following command:

$ lakectl branch list lakefs://demo-repo


 To commit about added data to the repository, please type the following
command in the terminal:

$ lakectl commit lakefs://demo-repo/main -m 'added my first


tweets data to repo!'

 Finally, to check the committed message, type the following command in the
terminal:

$ lakectl log lakefs://demo-repo/main

The Challenges of Data Versioning


 Storage Space

Training data may take up a lot of space in Git repositories. This is because Git
was designed to track changes in text files rather than big binary files. If a
team’s training data sets include big audio or video files, this might lead to a

334
slew of issues down the road. Each modification to the training data set will
frequently result in a duplicated data set in the repository’ history. Not only
does this result in a bug repository, but it also makes cloning and rebasing
extremely sluggish.

 Data Versioning Management

When it comes to managing versions, whether it’s code or user interfaces,


there’s a general tendency—even among techies—to “manage versions’ ‘ by
appending a version number or word to the end of a file name. In the context
of data, this may imply that a project has data.csv, data v1.csv, data v2.csv,
data v3 finalversion.csv, and so forth. This terrible habit is more than cliche; in
reality, most engineers, data scientists, and UI specialists begin with terrible
versioning habits.

 Multiple Users

One of the most challenging aspects of working in a production setting is


interacting with other data scientists. If you don’t use version control in a
collaborative workplace, files will be destroyed, changed, and relocated, and
you’ll have no idea who did what. Furthermore, restoring your data to its
original form will be tough. This is one of the most difficult challenges in
managing models and datasets.

Options for versioning the data


File Versioning

One method for data versioning is to save versions to your PC manually. File
versioning is helpful for:

 Small businesses: Businesses with less than a few data engineers or scientists
operating in the same area.

 Protecting sensitive information: If the data contains sensitive information,


it should only be examined and analyzed by a small group of executives and
data engineers.

 Individual work: When a task is not appropriate for cooperation and several
persons cannot work together to reach a common goal.

Using a data versioning tool

335
Aside from file versioning, specialized tools are available. You have the option of
developing your software or outsourcing it. DVC, Delta Lake, and Pachyderm are
among the companies that provide such services.

Data versioning systems are better suited for businesses that require:

 Real-time editing: When more than one person is working on a dataset, it is


more efficient to use a dedicated tool. This is because file versioning does not
allow for real-time editing with a group of individuals.

 Collaboration from multiple places: When individuals need to work from


separate locations, employing software rather than file versioning is more
efficient.

 Accountability: Data versioning software allows you to discover where errors


occur and who produces them. As a result, the team’s responsibility is
increased.

Best Data Version Control Alternatives


One of the cornerstones to automating a team’s machine learning model
development is data versioning. While developing your system to handle the process
might be highly hard, this does not have to be the case.

 DVC

DVC, or Data Version Control, is one of several open-source technologies


available to aid data science and machine learning projects. The programme is
similar to Git in that it provides a simple command line that can be configured
in a few easy steps. As the name implies, DVC is not just concerned with data
versioning. It also assists teams in managing pipelines and machine learning
models. Finally, DVC will aid your team’s consistency and the repeatability of
your models.

 Delta Lake

Delta Lake is an open-source storage layer designed to aid in the


improvement of data lakes. It enables ACID transactions, data versioning,
metadata management, and data version management.

The technology is more akin to a data lake abstraction layer, filling in the gaps
left by typical data lakes.

 Git LFS

336
Git LFS is a Git extension created by a group of open-source volunteers. By
employing pointers instead of files, the programme seeks to eliminate big files
that may be uploaded to your repository (e.g., images and data sets).

The pointers are lighter and point to the local sporting goods store. As a
result, when you push your repo into the central repository, it updates quickly
and takes up less space.

When it comes to data management, this is a pretty lightweight solution.

 Pachyderm

Pachyderm is one of the list’s few data science platforms. The goal of
Pachyderm is to provide a platform that makes it simple to replicate the
outcomes of machine learning models by controlling the complete data
process. Pachyderm is known as “the Docker of data” in this context.

Pachyderm packages your execution environment using Docker containers.


This makes it simple to replicate the same output. The combination of
versioned data with Docker makes it simple for data scientists and DevOps
teams to deploy and maintain the consistency of models.

Pachyderm has agreed to its Data Science Bill of Rights, which describes the
product’s core goals: reproducibility, data provenance, collaboration,
incrementality, and autonomy, as well as infrastructure abstraction.

These pillars drive many of its features, allowing teams to utilize the platform
entirely.

 Dolt

Dolt is a one-of-a-kind data versioning system. Dolt is a database; instead of


some of the other solutions provided, just version data.

Dolt is a SQL database that supports Git-style versioning. Unlike Git, which
allows you to version files, Dolt will enable you to version tables. This means
you may update and modify data without fear of losing the changes.

While the programme is currently in its early stages, there are hopes to make
it fully Git and MySQL compatible shortly.

 LakeFS

LakeFS enables teams to create data lake activities that are repeatable, atomic,
and versioned. It’s a newbie on the scene, but it delivers a powerful punch. It
offers a Git-like branching and version management methodology designed
to operate with your data lake and scale to Petabytes of data.

337
It delivers ACID compliance to your data lake the same way as Delta Lake.
However, LakeFS supports both AWS S3 and Google Cloud Storage as
backends, so you don’t have to use Spark to get the benefits.

You don’t necessarily have to put in a lot of work to manage your data to reap
the benefits of data versioning. For example, much of data versioning is
intended to aid in the tracking of data sets that change significantly over time.

Some data, such as web traffic, is simply appended to. That is, data is added
but seldom, if ever, updated. This implies that the only data versioning needed
to get reproducible results is the start and finish dates. This is significant
because, in such circumstances, you may be able to bypass all of the tools
mentioned above.

Map Reduce
MapReduce is a big data analysis model that processes data sets using a parallel
algorithm on computer clusters, typically Apache Hadoop clusters or cloud systems
like Amazon Elastic MapReduce (EMR) clusters.

What is MapReduce?
MapReduce is a Java-based, distributed execution framework within the Apache
Hadoop Ecosystem. It takes away the complexity of distributed programming by
exposing two processing steps that developers implement: 1) Map and 2) Reduce. In
the Mapping step, data is split between parallel processing tasks. Transformation
logic can be applied to each chunk of data. Once completed, the Reduce phase takes
over to handle aggregating data from the Map set.. In general, MapReduce
uses Hadoop Distributed File System (HDFS) for both input and output. However,
some technologies built on top of it, such as Sqoop, allow access to relational
systems.

History of MapReduce
MapReduce was developed in the walls of Google back in 2004 by Jeffery Dean and
Sanjay Ghemawat of Google (Dean & Ghemawat, 2004).

In their paper, “MAPREDUCE: SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS,”


and was inspired by the map and reduce functions commonly used in functional

338
programming. At that time, Google’s proprietary MapReduce system ran on the
Google File System (GFS). By 2014, Google was no longer using MapReduce as their
primary big data processing model. MapReduce was once the only method through
which the data stored in the HDFS could be retrieved, but that is no longer the case.
Today, there are other query-based systems such as Hive and Pig that are used to
retrieve data from the HDFS using SQL-like statements that run along with jobs
written using the MapReduce model.

Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and
process data. The following illustration depicts a schematic view of a traditional
enterprise system. Traditional model is certainly not suitable to process huge
volumes of scalable data and cannot be accommodated by standard database
servers. Moreover, the centralized system creates too much of a bottleneck while
processing multiple files simultaneously.

Google solved this bottleneck issue using an algorithm called MapReduce.


MapReduce divides a task into small parts and assigns them to many computers.
Later, the results are collected at one place and integrated to form the result dataset.

339
How MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The Map task takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key-value pairs).

 The Reduce task takes the output from the Map as an input and combines
those data tuples (key-value pairs) into a smaller set of tuples.

The reduce task is always performed after the map job.

Let us now take a close look at each of the phases and try to understand their
significance.

 Input Phase − Here we have a Record Reader that translates each record in
an input file and sends the parsed data to the mapper in the form of key-value
pairs.

 Map − Map is a user-defined function, which takes a series of key-value pairs


and processes each one of them to generate zero or more key-value pairs.

 Intermediate Keys − they key-value pairs generated by the mapper are


known as intermediate keys.

 Combiner − A combiner is a type of local Reducer that groups similar data


from the map phase into identifiable sets. It takes the intermediate keys from
the mapper as input and applies a user-defined code to aggregate the values
in a small scope of one mapper. It is not a part of the main MapReduce
algorithm; it is optional.

340
 Shuffle and Sort − the Reducer task starts with the Shuffle and Sort step. It
downloads the grouped key-value pairs onto the local machine, where the
Reducer is running. The individual key-value pairs are sorted by key into a
larger data list. The data list groups the equivalent keys together so that their
values can be iterated easily in the Reducer task.

 Reducer − The Reducer takes the grouped key-value paired data as input and
runs a Reducer function on each one of them. Here, the data can be
aggregated, filtered, and combined in a number of ways, and it requires a
wide range of processing. Once the execution is over, it gives zero or more
key-value pairs to the final step.

 Output Phase − in the output phase, we have an output formatter that


translates the final key-value pairs from the Reducer function and writes them
onto a file using a record writer.

Let us try to understand the two tasks Map &f Reduce with the help of a small
diagram –

MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter
receives around 500 million tweets per day, which is nearly 3000 tweets per second.
The following illustration shows how Tweeter manages its tweets with the help of
MapReduce.

341
As shown in the illustration, the MapReduce algorithm performs the following
actions −

 Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-
value pairs.

 Filter − Filters unwanted words from the maps of tokens and writes the
filtered maps as key-value pairs.

 Count − Generates a token counter per word.

 Aggregate Counters − Prepares an aggregate of similar counter values into


small manageable units.

MapReduce Architecture
MapReduce and HDFS are the two major components of Hadoop which makes it so
powerful and efficient to use. MapReduce is a programming model used for efficient
processing in parallel over large data-sets in a distributed manner. The data is first
split and then combined to produce the final result. The libraries for MapReduce is
written in so many programming languages with various different-different
optimizations. The purpose of MapReduce in Hadoop is to Map each of the jobs and
then it will reduce it to equivalent tasks for providing less overhead over the cluster
network and to reduce the processing power. The MapReduce task is mainly divided
into two phases Map Phase and Reduce Phase.

MapReduce Architecture:

342
Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the
MapReduce for processing. There can be multiple clients available that
continuously send jobs for processing to the Hadoop MapReduce Manager.

2. Job: The MapReduce Job is the actual work that the client wanted to do which
is comprised of so many smaller tasks that the client wants to process or
execute.

3. Hadoop MapReduce Master: It divides the particular job into subsequent


job-parts.

4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job.
The result of all the job-parts combined to produce the final output.

5. Input Data: The data set that is fed to the MapReduce for processing.

6. Output Data: The final result is obtained after the processing.

In MapReduce, we have a client. The client will submit the job of a particular size to
the Hadoop MapReduce Master. Now, the MapReduce master will divide this job into
further equivalent job-parts. These job-parts are then made available for the Map
343
and Reduce Task. This Map and Reduce task will contain the program as per the
requirement of the use-case that the particular company is solving. The developer
writes their logic to fulfill the requirement that the industry requires. The input data
which we are using is then fed to the Map Task and the Map will generate
intermediate key-value pair as its output. The output of Map i.e. these key-value
pairs are then fed to the Reducer and the final output is stored on the HDFS. There
can be n number of Map and Reduce tasks made available for processing the data as
per the requirement. The algorithm for Map and Reduce is made with a very
optimized way such that the time complexity or space complexity is minimum.

Let’s discuss the MapReduce phases to get a better understanding of its architecture:

The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce
phase.

1. Map: As the name suggests its main use is to map the input data in key-value
pairs. The input to the map may be a key-value pair where the key can be the
id of some kind of address and value is the actual value that it keeps. The Map
() function will be executed in its memory repository on each of these input
key-value pairs and generates the intermediate key-value pair which works as
input for the Reducer or Reduce () function.

2. Reduce: The intermediate key-value pairs that work as input for Reducer are
shuffled and sort and send to the Reduce () function. Reducer aggregate or
group the data based on its key-value pair as per the reducer algorithm
written by the developer.

How Job tracker and the task tracker deal with MapReduce:

1. Job Tracker: The work of Job tracker is to manage all the resources and all
the jobs across the cluster and also to schedule each map on the Task Tracker
running on the same data node since there can be hundreds of data nodes
available in the cluster.

2. Task Tracker: The Task Tracker can be considered as the actual slaves that are
working on the instruction given by the Job Tracker. This Task Tracker is
deployed on each of the nodes available in the cluster that executes the Map
and Reduce task as instructed by Job Tracker.

MapReduce provides several key benefits for processing big


data

344
 Parallelization: MapReduce enables parallel processing of large datasets
across distributed clusters, allowing computations to be performed in parallel
on multiple nodes.

 Fault Tolerance: MapReduce provides built-in fault tolerance mechanisms,


such as task replication and automatic task recovery, to handle node failures
and ensure reliable data processing.

 Scalability: MapReduce is highly scalable and can efficiently process datasets


of any size by adding more nodes to the cluster as needed.

 Abstraction: MapReduce abstracts away the complexities of distributed data


processing, allowing developers to focus on writing simple Map and Reduce
functions without worrying about low-level details of parallelization and fault
tolerance.

Key Features of MapReduce


The following advanced features characterize MapReduce:

1. Highly scalable

A framework with excellent scalability is Apache Hadoop MapReduce. This is


because of its capacity for distributing and storing large amounts of data
across numerous servers. These servers can all run simultaneously and are all
reasonably priced.

By adding servers to the cluster, we can simply grow the amount of storage
and computing power. We may improve the capacity of nodes or add any
number of nodes (horizontal scalability) to attain high computing power.
Organizations may execute applications from massive sets of nodes,
potentially using thousands of terabytes of data, thanks to Hadoop
MapReduce programming.

2. Versatile

Businesses can use MapReduce programming to access new data sources. It


makes it possible for companies to work with many forms of data. Enterprises
can access both organized and unstructured data with this method and
acquire valuable insights from the various data sources.

Since Hadoop is an open-source project, its source code is freely accessible


for review, alterations, and analyses. This enables businesses to alter the code
to meet their specific needs. The MapReduce framework supports data from
sources including email, social media, and clickstreams in different languages.

3. Secure
345
The MapReduce programming model uses the HBase and HDFS security
approaches, and only authenticated users are permitted to view and
manipulate the data. HDFS uses a replication technique in Hadoop 2 to
provide fault tolerance. Depending on the replication factor, it makes a clone
of each block on the various machines. One can therefore access data from
the other devices that house a replica of the same data if any machine in a
cluster goes down. Erasure coding has taken the role of this replication
technique in Hadoop 3. Erasure coding delivers the same level of fault
tolerance with less area. The storage overhead with erasure coding is less than
50%.

4. Affordability

With the help of the MapReduce programming framework and Hadoop’s


scalable design, big data volumes may be stored and processed very
affordably. Such a system is particularly cost-effective and highly scalable,
making it ideal for business models that must store data that is constantly
expanding to meet the demands of the present.

In terms of scalability, processing data with older, conventional relational


database management systems was not as simple as it is with the Hadoop
system. In these situations, the company had to minimize the data and
execute classification based on presumptions about how specific data could
be relevant to the organization, hence deleting the raw data. The MapReduce
programming model in the Hadoop scale-out architecture helps in this
situation.

5. Fast-paced

The Hadoop Distributed File System, a distributed storage technique used by


MapReduce, is a mapping system for finding data in a cluster. The data
processing technologies, such as MapReduce programming, are typically
placed on the same servers that enable quicker data processing.

Thanks to Hadoop’s distributed data storage, users may process data in a


distributed manner across a cluster of nodes. As a result, it gives the Hadoop
architecture the capacity to process data exceptionally quickly. Hadoop
MapReduce can process unstructured or semi-structured data in high
numbers in a shorter time.

6. Based on a simple programming model

Hadoop MapReduce is built on a straightforward programming model and is


one of the technology’s many noteworthy features. This enables programmers
to create MapReduce applications that can handle tasks quickly and
effectively. Java is a very well-liked and simple-to-learn programming
language used to develop the MapReduce programming model.

346
Java programming is simple to learn, and anyone can create a data processing
model that works for their company. Hadoop is straightforward to utilize
because customers don’t need to worry about computing distribution. The
framework itself does the processing.

7. Parallel processing-compatible

The parallel processing involved in MapReduce programming is one of its key


components. The tasks are divided in the programming paradigm to enable
the simultaneous execution of independent activities. As a result, the program
runs faster because of the parallel processing, which makes it simpler for the
processes to handle each job. Multiple processors can carry out these broken-
down tasks thanks to parallel processing. Consequently, the entire software
runs faster.

8. Reliable

The same set of data is transferred to some other nodes in a cluster each time
a collection of information is sent to a single node. Therefore, even if one
node fails, backup copies are always available on other nodes that may still be
retrieved whenever necessary. This ensures high data availability.

The framework offers a way to guarantee data trustworthiness through the


use of Block Scanner, Volume Scanner, Disk Checker, and Directory Scanner
modules. Your data is safely saved in the cluster and is accessible from
another machine that has a copy of the data if your device fails or the data
becomes corrupt.

9. Highly available

Hadoop’s fault tolerance feature ensures that even if one of the DataNodes
fails, the user may still access the data from other DataNodes that have copies
of it. Moreover, the high accessibility Hadoop cluster comprises two or more
active and passive NameNodes running on hot standby. The active
NameNode is the active node. A passive node is a backup node that applies
changes made in active NameNode’s edit logs to its namespace.

Top 5 Uses of MapReduce


By spreading out processing across numerous nodes and merging or decreasing the
results of those nodes, MapReduce has the potential to handle large data volumes.
This makes it suitable for the following use cases:

347
Uses of MapReduce

1. Entertainment

Hadoop MapReduce assists end users in finding the most popular movies based on
their preferences and previous viewing history. It primarily concentrates on their
clicks and logs.

Various OTT services, including Netflix, regularly release many web series and movies.
It may have happened to you that you couldn’t pick which movie to watch, so you
looked at Netflix’s recommendations and decided to watch one of the suggested
series or films. Netflix uses Hadoop and MapReduce to indicate to the user some
well-known movies based on what they have watched and which movies they enjoy.
MapReduce can examine user clicks and logs to learn how they watch movies.

2. E-commerce

Several e-commerce companies, including Flipkart, Amazon, and eBay, employ


MapReduce to evaluate consumer buying patterns based on customers’ interests or
historical purchasing patterns. For various e-commerce businesses, it provides
product suggestion methods by analyzing data, purchase history, and user
interaction logs.

Many e-commerce vendors use the MapReduce programming model to identify


popular products based on customer preferences or purchasing behavior. Making

348
item proposals for e-commerce inventory is part of it, as is looking at website
records, purchase histories, user interaction logs, etc., for product recommendations.

3. Social media

Nearly 500 million tweets, or about 3000 per second, are sent daily on the
microblogging platform Twitter. MapReduce processes Twitter data, performing
operations such as tokenization, filtering, counting, and aggregating counters.

 Tokenization: It creates key-value pairs from the tokenized tweets by


mapping the tweets as maps of tokens.

 Filtering: The terms that are not wanted are removed from the token maps.

 Counting: It creates a token counter for each word in the count.

 Aggregate counters: A grouping of comparable counter values is prepared


into small, manageable pieces using aggregate counters.

4. Data warehouse

Systems that handle enormous volumes of information are known as data warehouse
systems. The star schema, which consists of a fact table and several dimension tables,
is the most popular data warehouse model. In a shared-nothing architecture, storing
all the necessary data on a single node is impossible, so retrieving data from other
nodes is essential.

This results in network congestion and slow query execution speeds. If the
dimensions are not too big, users can replicate them over nodes to get around this
issue and maximize parallelism. Using MapReduce, we may build specialized business
logic for data insights while analyzing enormous data volumes in data warehouses.

5. Fraud detection

Conventional methods of preventing fraud are not always very effective. For instance,
data analysts typically manage inaccurate payments by auditing a tiny sample of
claims and requesting medical records from specific submitters. Hadoop is a system
well suited for handling large volumes of data needed to create fraud
detection algorithms. Financial businesses, including banks, insurance companies,
and payment locations, use Hadoop and MapReduce for fraud detection, pattern
recognition evidence, and business analytics through transaction analysis.

6. Takeaway

For years, MapReduce was a prevalent (and the de facto standard) model for
processing high-volume datasets. In recent years, it has given way to new systems
like Google’s new Cloud Dataflow. However, MapReduce continues to be used across
cloud environments, and in June 2022, Amazon Web Services (AWS) made its
Amazon Elastic MapReduce (EMR) Serverless offering generally available. As
349
enterprises pursue new business opportunities from big data, knowing how to use
MapReduce will be an invaluable skill in building data analysis applications.

How do companies use MapReduce?


As the data processing market has matured, MapReduce’s market share has declined
to less than one per cent. Nevertheless, it is still used by nearly 1,500 companies in
the United States, with some uptake in other countries.

By and large, MapReduce is used by the computer software and IT services industry.
Other industries include financial services, hospitals and healthcare, higher education,
retail, insurance, telecommunications and banking. The following are a few example
use cases:

 Financial services: Retail banks use a Hadoop system to validate data


accuracy and quality to comply with federal regulations.

 Healthcare: A health IT Company uses a Hadoop system to archive years of


claims and remit data, which amounts to processing terabytes of data every
day and storing them for further analytical purposes. Another hospital system
monitors patient vitals by collecting billions of constantly streaming data
points from sensors attached to patients.

 IT services: A major IT services provider collects diagnostic data from its


storage systems deployed at its customers’ sites. It uses a Hadoop system that
runs MapReduce on unstructured logs and system diagnostic information.

 Research: Ongoing research on the human genome project uses Hadoop


MapReduce to process massive amounts of data. And a popular family
genetics research provider runs an increasing flood of gene-sequencing data,
including structured and unstructured data on births, deaths, census results,
and military and immigration records, which amounts to many petabytes and
continues to grow.

 Retail: A leading online marketplace uses MapReduce to analyse huge


volumes of log data to determine customer behaviour, search
recommendations and more. And a major department store runs marketing
campaign data through a Hadoop system to gain insights for making more
targeted campaigns, down to the individual customer.

 Telecommunications: Storing billions of call records with real-time access for


customer’s amounts to hundreds of terabytes of data to be processed for a
major telecom vendor.

350
How does HPE help with MapReduce?
HPE offers several solutions that can help you save time, money and workforce
resources on managing Hadoop systems running MapReduce.

For example, HPE Pointnext Services offers advice and technical assistance in
planning, design and integrating your Big Data analytics environment. They simplify
designing and implementing Hadoop – and MapReduce – so that you can truly focus
on finding analytical insights to make informed business decisions.

In addition, HPE GreenLake offers a scalable solution that radically simplifies the
whole Hadoop lifecycle. It is an end-to-end solution that includes the required
hardware, software and support for both symmetrical and asymmetrical
environments. The unique HPE pricing and billing method makes it easier to
understand your existing Hadoop costs and to more accurately predict future costs
associated with your solution.

Following many years of customer engagement experiences in which HPE helped


with Hadoop environments, HPE created two editions of an enterprise-grade Hadoop
solution that are tested and ready to implement. They are complemented by the HPE
Insight Cluster Management Utility, which enables IT I&O leaders to quickly
provision, manage and monitor their infrastructure and choice of Hadoop
implementations. The HPE enterprise-grade Hadoop standard edition solution can be
supported in the HPE GreenLake solution.

Advantages of MapReduce
1. Scalability

2. Flexibility

3. Security and authentication

4. Faster processing of data

5. Very simple programming model

6. Availability and resilient nature

Simple tips on how to improve MapReduce performance


1. Enabling uber mode

2. Use native library

351
3. Increase the block size

4. Monitor time taken by map tasks

5. Identify if data compression is splittable or not

6. Set number of reduced tasks

7. Analyze the partition of data

8. Shuffle phase performance movements

9. Optimize MapReduce code

Five best alternatives to MapReduce


1. Apache Spark

2. Apache Storm

3. Ceph

4. Hydra

5. Google BigQuery

Partitioning and combining


What Is Database Partitioning?
Database partitioning (or data partitioning) is a technique used to split data in a
large database into smaller chunks called partitions. Each partition is then stored and
accessed separately to improve the performance and scalability of the database
system.

Database partitioning strategies apply to different types of databases such as SQL


databases (e.g., MySQL, PostgreSQL), NoSQL databases (e.g., MongoDB, Cassandra),
or time series databases like QuestDB.

Partitioning in NoSQL

352
Partitioning, also called Sharding, is a fundamental consideration in NoSQL database.
If you get this right, database works beautifully. If not, there will be big changes
down the line until it is gotten right.

Entities in a NoSQL Database live within Partitions. Every entity has a partition key,
indicating which Partition (or Logical Partition or Application Partition) it belongs to.
The Database service maps these Logical Partitions into Physical Partitions or Shards
and places them on Storage nodes in its backend. Note that each Partition is fully
served by one Storage node running its Query engine.

Given that a Partition is fully served by one Storage node and query engine, this is
the scope at which most functionality works.

1. Queries need to be scoped to a partition (i.e. Query requires a Partition key


== value clause among others). Queries across partitions are possible, but
they fan-out across Storage nodes and are expensive — see Query containers
in Azure Cosmos DB | Microsoft Docs.

2. ACID Transactions are possible only within a Partition. Transactions are not
possible across partitions. See Database transactions and optimistic
concurrency control in Azure Cosmos DB | Microsoft Docs

3. Indexes are typically local within a Partition. i.e. use of the index requires a
partition key and value to be specified. E.g. see Local Secondary Indexes —
Amazon DynamoDB. Cross-partition index, or Global index, is possible in some
offerings, but these are essentially a fully copy of the data that is stored with a
different Partition key — and the Partitioning rules still apply there. See Using
Global Secondary Indexes in DynamoDB — Amazon DynamoDB.

4. Range queries (e.g. fetch all Products with price between $50 and $100) work
only within a Partition.

Given above, we would ideally want Partitions to be as big as possible — so that we


get the benefits of Transactions, Indexing, etc. at as big a scope as possible. These
forces drive the partition sizes up.

353
However, since each Partition lives within one Storage node and has to fit there,
there are restrictions:

 Partition sizes are very limited. DynamoDB allows for 10GB, CosmosDB for 20
GB. If your partition grows to this size, you will not be able to write any data
— your App is broken!

 Partition IOPS (IO Operations per second) are bounded. See Azure Cosmos DB
service quotas | Microsoft Docs. If a partition becomes very “hot”, i.e. there is
too much IO in any one partition, you will start seeing storage throttling and
App failures.

These considerations drive the partition sizes down.

The main design challenge with NoSQL databases then is then in to pick a strategy
that hits the optimal middle:

1. Keep Partitions large enough so that all entities that require transactional
updates across themselves, or fall within a range query, come within one
partition.

2. Keep Partitions small enough so that size stays below 10 to 20 GB and IOPS
are roughly uniformly distributed and stay below threshold

Advantages of database partitioning


The primary motivation for database partitioning is to improve the performance and
scalability of large databases by distributing the data that can be accessed
independently.

By dividing the data into partitions, databases can avoid reading from partitions that
are not needed for queries that only need a subset of the data collocated in a
partition. This allows the database to reduce expensive disk I/O calls and return the
data much quicker.

Types of database partitioning


There are two major types of database partitioning approaches:

Vertical partitioning

354
Vertical partitioning

In vertical partitioning, columns of a table are divided into partitions with each
partition containing one or more columns from the table. This approach is useful
when some columns are accessed more frequently than others.

Data partitioning is often combined with sharding: frequently accessed columns may
be split into different partitions and sharded to run on discrete servers. Alternatively,
columns that are rarely used may be partitioned to a cheaper and slower storage
solution to reduce the I/O overhead.

One of the downsides to vertical partitioning is that when a query needs to span
multiple partitions, combining the results from those partitions may be slow or
complicated. Also, as the database scales, partitions may need to be split even
further to meet the demand.

Horizontal partitioning

355
Horizontal partitioning

On the other hand, horizontal partitioning works by splitting the table by rows based
on the partition key. In this approach, each row of the table is assigned to a partition
based on some criteria, which include:

 Range-based partitioning: data is split based on a range that does not


overlap. The most common example is partitioning by time on time series
workloads. Data can be partitioned by some time interval (e.g., daily, weekly,
monthly) to aid range-based search. Old partitions can easily be archived to
serve queries for newer ranges more efficiently.

 List-based partitioning: data is split based on discrete sets of values, usually


from a particular column.

 For example, a table containing sales data may be partitioned by geo-regions


such as North America or Asia-Pacific regions. Partitions may be further split
into subsections.

 Hash-based partitioning: data is split based on some hashing algorithm.


Hash-based partitioning applies a hash function to one or more columns to
determine which partition to send the request to.

For example, we may use a simple modulo function on the employee id field or use a
complicated cryptographic hashing function on an IP address to divide the data.
When a non-trivial hash function is used, hash-based partitioning tends to distribute
the data evenly across partitions. However, depending on the function, adding or
removing a new partition may require an expensive migration process.

356
 Composite partitioning: any of the aforementioned methods can be
combined. For example, a time series workload may first be partitioned by
time and further split based on another column field.

One thing to note with horizontal partitioning is that the performance depends
heavily on how evenly distributed the data is across the partitions. If the data
distribution is skewed, the partition with the most records will become the
bottleneck.

Also, most analytical databases employ horizontal partitioning strategies over vertical
partitioning. Some popular file formats such as Apache Parquet support partitioning
natively, making it ideal for big data processing.

Data Partitioning Techniques in System Design


Using data partitioning techniques, a huge dataset can be divided into
smaller, simpler sections. A few applications for these techniques include parallel
computing, distributed systems, and database administration. Data partitioning aims
to improve data processing performance, scalability, and efficiency.

The list of popular data partitioning techniques is as follows:

1. Horizontal Partitioning

2. Vertical Partitioning

3. Key-based Partitioning

4. Range-based Partitioning

5. Hash-based Partitioning

6. Round-robin Partitioning

Now let us discuss each partitioning in detail that is as follows:

1. Horizontal Partitioning/Sharding

In this technique, the dataset is divided based on rows or records. Each partition
contains a subset of rows, and the partitions are typically distributed across multiple
servers or storage devices. Horizontal partitioning is often used in distributed
databases or systems to improve parallelism and enable load balancing.

Advantages:

357
1. Greater scalability: By distributing data among several servers or storage
devices, horizontal partitioning makes it possible to process large datasets in
parallel.

2. Load balancing: By partitioning data, the workload can be distributed equally


among several nodes, avoiding bottlenecks and enhancing system
performance.

3. Data separation: Since each partition can be managed independently, data


isolation and fault tolerance are improved. The other partitions can carry on
operating even if one fails.

Disadvantages:

1. Join operations: Horizontal partitioning can make join operations across


multiple partitions more complex and potentially slower, as data needs to be
fetched from different nodes.

2. Data skew: If the distribution of data is uneven or if some partitions receive


more queries or updates than others, it can result in data skew, impacting
performance and load balancing.

3. Distributed transaction management: Ensuring transactional consistency


across multiple partitions can be challenging, requiring additional
coordination mechanisms.

2. Vertical Partitioning

Unlike horizontal partitioning, vertical partitioning divides the dataset based on


columns or attributes. In this technique, each partition contains a subset of columns
for each row. Vertical partitioning is useful when different columns have varying
access patterns or when some columns are more frequently accessed than others.

Advantages:

1. Improved query performance: By placing frequently accessed columns in a


separate partition, vertical partitioning can enhance query performance by
reducing the amount of data read from storage.

2. Efficient data retrieval: When a query only requires a subset of columns,


vertical partitioning allows retrieving only the necessary data, saving storage
and I/O resources.

3. Simplified schema management: With vertical partitioning, adding or


removing columns becomes easier, as the changes only affect the respective
partitions.

Disadvantages:

358
1. Increased complexity: Vertical partitioning can lead to more complex query
execution plans, as queries may need to access multiple partitions to gather all
the required data.

2. Joins across partitions: Joining data from different partitions can be more
complex and potentially slower, as it involves retrieving data from different
partitions and combining them.

3. Limited scalability: Vertical partitioning may not be as effective for datasets


that continuously grow in terms of the number of columns, as adding new
columns may require restructuring the partitions.

3. Key-based Partitioning

Using this method, the data is divided based on a particular key or attribute value.
The dataset has been partitioned, with each containing all the data related to a
specific key value. Key-based partitioning is commonly used in distributed databases
or systems to distribute the data evenly and allow efficient data retrieval based on
key lookups.

Advantages:

1. Even data distribution: Key-based partitioning ensures that data with the
same key value is stored in the same partition, enabling efficient data retrieval
by key lookups.

2. Scalability: Key-based partitioning can distribute data evenly across


partitions, allowing for better parallelism and improved scalability.

3. Load balancing: By distributing data based on key values, the workload is


balanced across multiple partitions, preventing hotspots and optimizing
performance.

Disadvantages:

1. Skew and hotspots: If the key distribution is uneven or if certain key values
are more frequently accessed than others, it can lead to data skew or
hotspots, impacting performance and load balancing.

2. Limited query flexibility: Key-based partitioning is most efficient for queries


that primarily involve key lookups. Queries that span multiple keys or require
range queries may suffer from increased complexity and potentially slower
performance.

3. Partition management: Managing partitions based on key values requires


careful planning and maintenance, especially when the dataset grows or the
key distribution changes.

4. Range Partitioning
359
Range partitioning divides the dataset according to a predetermined range of values.
You can divide data based on a particular time range, for instance, if your dataset
contains timestamps. When you want to distribute data evenly based on the range of
values and have data with natural ordering, range partitioning can be helpful.

Advantages:

1. Natural ordering: Range partitioning is suitable for datasets with a natural


ordering based on a specific attribute. It allows for efficient data retrieval
based on ranges of values.

2. Even data distribution: By dividing the dataset based on ranges, range


partitioning can distribute the data evenly across partitions, ensuring load
balancing and optimal performance.

3. Simplified query planning: Range partitioning simplifies query planning


when queries primarily involve range-based conditions, as the system knows
which partition(s) to access based on the range specified.

Disadvantages:

1. Uneven data distribution: If the data distribution is not evenly distributed


across ranges, it can lead to data skew and impact load balancing and query
performance.

2. Data growth challenges: As the dataset grows, the ranges may need to be
adjusted or new partitions added, requiring careful management and
potentially affecting existing queries and data distribution.

3. Joins and range queries: Range partitioning can introduce complexity when
performing joins across partitions or when queries involve multiple non-
contiguous ranges, potentially leading to performance challenges.

5. Hash-based Partitioning

Hash partitioning is the process of analyzing the data using a hash function to decide
which division it belongs to. The data is fed into the hash function, which produces a
hash value used to categorize the data into a certain division. By randomly
distributing data among partitions, hash-based partitioning can help with load
balancing and quick data retrieval.

Advantages:

1. Even data distribution: Hash-based partitioning provides a random


distribution of data across partitions, ensuring even data distribution and load
balancing.

2. Scalability: Hash-based partitioning enables scalable parallel processing by


evenly distributing data across multiple nodes.
360
3. Simpleness: Hash-based partitioning does not depend on any particular data
properties or ordering, and it is relatively easy to implement.

Disadvantages:

1. Key-based queries: Hash-based partitioning is not suitable for efficient key-


based lookups, as the data is distributed randomly across partitions. Key-
based queries may require searching across multiple partitions.

2. Load balancing challenges: In some cases, the distribution of data may not
be perfectly balanced, resulting in load imbalances and potential performance
issues.

3. Partition management: Hash-based partitioning may require adjustments to


the number of partitions or hash functions as the dataset grows or the system
requirements change, necessitating careful management and potential data
redistribution.

6. Round-robin Partitioning

In round-robin partitioning, data is evenly distributed across partitions in a cyclic


manner. Each partition is assigned the next available data item sequentially,
regardless of the data’s characteristics. Round-robin partitioning is straightforward to
implement and can provide a basic level of load balancing.

Advantages:

1. Simple implementation: Round-robin partitioning is straightforward to


implement, as it assigns data items to partitions in a cyclic manner without
relying on any specific data characteristics.

2. Basic load balancing: Round-robin partitioning can provide a basic level of


load balancing, ensuring that data is distributed across partitions evenly.

3. Scalability: It is made possible by round-robin partitioning, which divides the


data into several parts and permits parallel processing.

Disadvantages:

1. Uneven data distribution or a number of partitions that are not a multiple of


the total number of data items may cause round-robin partitioning to produce
unequal partition sizes.

2. Inefficient data retrieval: Round-robin partitioning does not consider any


data characteristics or access patterns, which may result in inefficient data
retrieval for certain queries.

361
3. Limited query optimization: Round-robin partitioning does not optimize for
specific query patterns or access patterns, potentially leading to suboptimal
query performance.

Partitioning Description Suitable Query Data Complexity


Technique Data Performa Distribution
nce

Horizontal Divides Large Complex Uneven Distributed


Partitioning dataset datasets joins distribution transaction
based on management
rows/record
s

Vertical Divides Wide Improved Efficient Increased


Partitioning dataset tables retrieval storage query
based on complexity
columns/att
ributes

Key-based Divides Key- Efficient Even Limited


Partitioning dataset value key distribution query
based on datasets lookups by key flexibility
specific key

Range Divides Ordered Efficient Even Joins and


Partitioning dataset datasets range distribution range queries
based on queries by range
specific
range

Hash-based Divides Unorder Even Random Inefficient


Partitioning dataset ed distributio distribution key-based
based on datasets n queries
hash
function

Round-robin Divides Equal- Basic load Even Limited


Partitioning dataset in a sized balancing distribution query

362
cyclic datasets optimization
manner

These are a few examples of data partitioning strategies. The dataset’s properties,
access patterns, and the needs of the particular application or system all play a role
in the choice of partitioning strategy.

Benefits of Partitioning

 It advances query functionalities. Because queries can be easily and rapidly


solved for a collection of partitions instead of solving them for a giant
database. Hence, the functionality and performance level gain improvement.

 The planned intermission time gets abridged.

 It facilitates information administration procedures like information loading,


index formation and restoration, and backup and upturn at the partition stage.
As a result, processes become faster.

 The parallel implementation offers detailed benefits to optimize resource


utilization and lessens the implementation time too. Parallel execution next to
partitioned substances is a solution for scalability in a crowded setting.

Partitioning techniques not only improve the running and management of very large
data centers but even allow the medium-range and smaller databases to take
pleasure in its benefits. Although it can be implemented in all sizes of databases, it is
most important for the databases that handle big data. The scalability of the
partitioning techniques proves that the advantages the smaller data centers are
facilitated with do not get changed when it comes to the bigger data centers.

363
What is combining
Combining in NoSQL databases refers to the practice of integrating and utilizing
various features, functionalities, and methodologies within the NoSQL database
environment to achieve specific objectives or to address particular challenges
effectively. This involves leveraging a combination of techniques, such as data
modeling, indexing, sharding, replication, caching, and optimization strategies, to
optimize performance, scalability, reliability, and other aspects of data management
and processing.

In essence, combining in NoSQL encompasses the orchestrated utilization of


different tools, features, and techniques available within the NoSQL ecosystem to
build robust, efficient, and scalable data storage and retrieval solutions tailored to
the requirements of modern applications.

Combining in NoSQL refers to the integration and utilization of various features,


techniques, and methodologies within a NoSQL database system to achieve specific
goals such as performance optimization, scalability, data modeling, and reliability.

In a NoSQL context, combining typically involves:

 Utilizing Multiple Features: NoSQL databases often offer a wide array of


features such as sharding, replication, indexing, caching, and data modeling
capabilities. Combining involves leveraging these features together to achieve
desired outcomes. For example, combining sharding with replication for
horizontal scalability and fault tolerance.

 Applying Techniques Synergistically: Combining different techniques such


as denormalization, data partitioning, and compression to optimize data
storage, retrieval, and performance. For instance, denormalizing data to
reduce join operations, partitioning data to distribute workload evenly, and
compressing data to minimize storage requirements.

 Balancing Trade-offs: NoSQL databases often require trade-offs between


consistency, availability, and partition tolerance (the CAP theorem). Combining
involves striking a balance between these trade-offs by choosing appropriate
consistency models, replication strategies, and partitioning techniques based
on application requirements.

 Integration with Ecosystem Tools: NoSQL databases are often part of a


larger ecosystem of tools and technologies. Combining involves integrating
NoSQL databases with other tools such as caching systems, message queues,
data processing frameworks, and monitoring tools to build robust and
scalable architectures.

 Customization and Optimization: Combining also involves customizing and


optimizing various aspects of the NoSQL database environment such as

364
configurations, indexing strategies, query optimization, and infrastructure
scaling to meet specific application needs and performance goals.

Here's how you can combine different techniques effectively:


1. Data Modeling with Sharding and Replication:

 Design an efficient data model by considering denormalization and


embedding to reduce query complexity.

 Implement sharding to horizontally partition data across multiple nodes


for scalability.

 Utilize replication for high availability and fault tolerance, ensuring data
redundancy across nodes.

2. Indexing and Caching:

 Create appropriate indexes to accelerate query performance based on


common access patterns.

 Integrate caching mechanisms like Redis or Memcached to cache


frequently accessed data and reduce database load.

 Combine indexing with caching to further optimize data retrieval speed,


especially for read-heavy workloads.

3. Concurrency Control with Consistency Models:

 Implement concurrency control mechanisms like optimistic concurrency


control or distributed locking to manage concurrent read and write
operations efficiently.

 Choose the appropriate consistency model (e.g., eventual consistency,


strong consistency) based on application requirements and consistency-
tradeoffs.

 Use techniques like quorum reads or writes to balance consistency and


performance in distributed environments.

4. Data Partitioning Strategies with Data Compression:

 Employ data partitioning strategies such as key-based partitioning or


document-based partitioning to distribute data effectively.

 Implement data compression techniques to reduce storage requirements


and optimize I/O performance, especially for large datasets.

365
 Combine partitioning with compression to minimize storage overhead and
enhance data access efficiency.

5. Monitoring and Optimization:

 Continuously monitor database performance metrics using monitoring


dashboards and performance profiling tools.

 Analyze query execution plans and optimize query performance by tuning


indexes, data models, and database configurations.

 Optimize resource utilization by scaling infrastructure dynamically based


on workload patterns.

6. Backup and Disaster Recovery:

 Implement robust backup and disaster recovery strategies to ensure data


durability and business continuity.

 Use techniques like incremental backups, snapshots, and geo-replication


to protect against data loss and mitigate downtime risks.

7. Security Measures:

 Implement security measures such as access control, encryption, and


auditing to safeguard data privacy and integrity.

 Utilize techniques like role-based access control (RBAC) and encryption-at-


rest to protect sensitive data from unauthorized access and breaches.

Composing map-reduce calculations


Composing MapReduce calculations in NoSQL databases involves designing and
implementing Map and Reduce functions to process and analyze large datasets
distributed across multiple nodes in a distributed database cluster. While the
concepts of Map and Reduce originated from the MapReduce programming model,
they are also applicable in NoSQL databases for parallel data processing and analysis.
Here's how you can compose MapReduce calculations in NoSQL databases:

1. Map Function:

 The Map function is responsible for processing individual data elements and
emitting intermediate key-value pairs. In NoSQL databases, the Map function
can be designed to operate on data stored in distributed partitions or shards,
processing data in parallel across multiple nodes.

366
 When composing MapReduce calculations in NoSQL databases, define the
Map function to extract relevant information from each data record and emit
intermediate key-value pairs based on the analysis or transformation
requirements.

 For example, the Map function may parse a JSON document, extract specific
fields or attributes, perform filtering or aggregation operations, and emit key-
value pairs representing the results of the analysis.

2. Shuffle and Sort:

 After the Map function has processed the data and emitted intermediate key-
value pairs, the database system performs shuffle and sort operations to
group and sort the intermediate pairs based on their keys. This step ensures
that all key-value pairs with the same key are grouped together for
subsequent processing.

 In NoSQL databases, shuffle and sort operations are typically handled


internally by the database system, which redistributes and sorts the
intermediate key-value pairs across the distributed nodes based on their keys.

3. Reduce Function:

 The Reduce function processes the sorted intermediate key-value pairs


generated by the shuffle and sort phase. Each Reduce function receives a
subset of key-value pairs with the same key and is responsible for
aggregating, summarizing, or analyzing the data.

 When composing MapReduce calculations in NoSQL databases, define the


Reduce function to operate on the grouped key-value pairs and perform the
desired aggregation or analysis tasks.

 For example, the Reduce function may calculate the sum, average, count, or
other aggregate statistics for each group of key-value pairs with the same key.

4. Output:

 The final output of the MapReduce calculation consists of the results


generated by the Reduce function. These results represent the aggregated or
analyzed data produced by processing the input dataset using the Map and
Reduce functions.

 Depending on the application requirements, the output of the MapReduce


calculation can be written to a distributed file system, stored in a NoSQL
database, or used as input for further analysis or processing tasks.

When composing MapReduce calculations in NoSQL databases, it's essential to


consider factors such as data distribution, partitioning strategies, fault tolerance, and
scalability. By designing efficient Map and Reduce functions and leveraging the
367
parallel processing capabilities of the NoSQL database system, you can perform
complex data processing and analysis tasks on large datasets efficiently and
effectively.

368

You might also like