Big Data Analysis by deshbandhu
Big Data Analysis by deshbandhu
APRIL 1, 2024
DESH BANDHU BHATT
MDSU, AJMER
What is Big Data
What exactly is big data?
The definition of big data is data that contains greater variety, arriving in increasing
volumes and with more velocity. This is also known as the three “Vs.”
2. Velocity: Velocity is the fast rate at which data is received and (perhaps) acted
on. Normally, the highest velocity of data streams directly into memory versus
being written to disk. Some internet-enabled smart products operate in real
time or near real time and will require real-time evaluation and action.
3. Variety: Variety refers to the many types of data that are available. Traditional
data types were structured and fit neatly in a relational database. With the rise
of big data, data comes in new unstructured data types. Unstructured and
semistructured data types, such as text, audio, and video, require additional
preprocessing to derive meaning and support metadata.
In addition to the 3Vs, some have expanded this framework to include other
characteristics such as:
4. Veracity: This refers to the quality of the data. Big data may include
inaccurate, incomplete, or inconsistent data, and dealing with such data
quality issues is a significant challenge in big data analytics.
5. Value: Ultimately, the goal of analyzing big data is to derive value from it. This
could involve gaining insights, making predictions, optimizing processes, or
creating new products and services.
6. Variability: Refers to the inconsistency of the data flow. This can mean a
change in the data's velocity or volume, or it can mean the nature of the data
itself is changing.
2
Big data technologies and analytics techniques, such as Hadoop, Spark, NoSQL
databases, machine learning, and data mining, are employed to extract insights,
patterns, and trends from these massive datasets, enabling organizations to make
data-driven decisions and gain competitive advantages.
Around 2005, people began to realize just how much data users generated through
Facebook, YouTube, and other online services. Hadoop (an open source framework
created specifically to store and analyze big data sets) was developed that same year.
NoSQL also began to gain popularity during this time.
The development of open source frameworks, such as Hadoop (and more recently,
Spark) was essential for the growth of big data because they make big data easier to
3
work with and cheaper to store. In the years since then, the volume of big data has
skyrocketed. Users are still generating huge amounts of data—but it’s not just
humans who are doing it.
With the advent of the Internet of Things (IoT), more objects and devices are
connected to the internet, gathering data on customer usage patterns and product
performance. The emergence of machine learning has produced still more data.
While big data has come far, its usefulness is only just beginning. Cloud computing
has expanded big data possibilities even further. The cloud offers truly elastic
scalability, where developers can simply spin up ad hoc clusters to test a subset of
data. And graph databases are becoming increasingly important as well, with their
ability to display massive amounts of data in a way that makes analytics fast and
comprehensive.
Social networking sites: Facebook, Google, LinkedIn all these sites generates
huge amount of data on a day to day basis as they have billions of users
worldwide.
E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount
of logs from which users buying trends can be traced.
Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.
Telecom Company: Telecom giants like Airtel, Vodafone study the user
trends and accordingly publish their plans and for this they store the data of
its million users.
Share Market: Stock exchange across the world generates huge amount of
data through its daily transaction.
4
Big Data Use Cases
Big data can help you address a range of business activities, including customer
experience and analytics. Here are just a few.
Product development: Companies like Netflix and Procter & Gamble use big
data to anticipate customer demand. They build predictive models for new
products and services by classifying key attributes of past and current
products or services and modeling the relationship between those attributes
and the commercial success of the offerings. In addition, P&G uses data and
analytics from focus groups, social media, test markets, and early store
rollouts to plan, produce, and launch new products.
Fraud and compliance: When it comes to security, it’s not just a few rogue
hackers—you’re up against entire expert teams. Security landscapes and
compliance requirements are constantly evolving. Big data helps you identify
patterns in data that indicate fraud and aggregate large volumes of
information to make regulatory reporting much faster.
Machine learning: Machine learning is a hot topic right now. And data—
specifically big data—is one of the reasons why. We are now able to teach
machines instead of program them. The availability of big data to train
machine learning models makes that possible.
5
can also be used to improve decision-making in line with current market
demand.
1. Integrate: Big data brings together data from many disparate sources and
applications. Traditional data integration mechanisms, such as extract, transform,
and load (ETL) generally aren’t up to the task. It requires new strategies and
technologies to analyze big data sets at terabyte, or even petabyte, scale.
During integration, you need to bring in the data, process it, and make sure it’s
formatted and available in a form that your business analysts can get started with.
2. Manage: Big data requires storage. Your storage solution can be in the cloud, on
premises, or both. You can store your data in any form you want and bring your
desired processing requirements and necessary process engines to those data
sets on an on-demand basis. Many people choose their storage solution
according to where their data is currently residing.
3. Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File
System) which uses commodity hardware to form clusters and store data in a
distributed fashion. It works on Write once, read many times principle.
6
Why big data
Big data has become increasingly important for several reasons:
3. Innovation and new business opportunities: Big data can uncover new
business opportunities and drive innovation. By analyzing market trends,
consumer behaviors, and emerging technologies, organizations can identify
new product and service offerings, enter new markets, and stay ahead of
competitors.
5. Risk management and fraud detection: Big data analytics can help
organizations mitigate risks and detect fraudulent activities. By analyzing
patterns and anomalies in large datasets, companies can identify potential
risks, such as credit card fraud, cybersecurity threats, or supply chain
disruptions, and take proactive measures to mitigate them.
7
Overall, big data has the potential to transform industries, drive innovation, and
create significant value for organizations across various sectors. However, effectively
harnessing the power of big data requires advanced analytics capabilities, robust
data management practices, and a strategic approach to data-driven decision
making.
The problem was that traditional storage methods couldn't handle storing all this
data, so companies had to look for new ways to keep it. That's when Big Data
Storage came into being. It's a way for companies to store large amounts of data
without worrying about running out of space.
The first challenge is how much storage you'll need for your extensive data system.If
you're going to store large amounts of information about your customers and their
behavior, you'll need a lot of space for that data to live.
8
It's not uncommon for large companies like Google or Facebook to have petabytes
(1 million gigabytes) of storage explicitly dedicated to their big data needs, and that's
only one company!
Another challenge with big data is how quickly it grows. Companies are constantly
gathering new types of information about their customer's habits and preferences,
and they're looking at ways they can use this information to improve their products
or services.
As a result, big data systems will continue growing exponentially until something
stops them. It means it's essential for companies who want to use this technology
effectively to plan how they'll deal with it later on down the road when it becomes
too much for them alone!
Data velocity: Your data must be able to move quickly between processing
centers and databases for it to be helpful in real-time applications.
Scalability: The system should be able to expand as your business does and
accommodate new projects as needed without disrupting existing workflows
or causing any downtime.
Finally, consider how long you want your stored data to remain accessible. If you're
planning on keeping it for years (or even decades), you may need more than one
storage solution.
9
Have a plan for how you'll organize your data before you start collecting it. It
will ensure you can find what you need when you need it. Here are some
critical insights for big data storage:
Ensure your team understands security's essential when dealing with sensitive
information. Everyone in the company needs to be trained on best practices
for protecting data and preventing hacks.
Remember backup plans! You never want to get stuck and unable to access
your information because something went wrong with the server or hardware
it's stored.
Warehouse Storage
Warehouse storage is one of the more common ways to store large amounts
of data, but it has drawbacks. For example, if you need immediate access to
your data and want to avoid delays or problems accessing it over the internet,
there might be better options than this. Also, warehouse storage can be
expensive if you're looking for long-term contracts or need extra personnel to
manage your warehouse space.
Cloud Storage
Cloud storage is an increasingly popular option since it's easier than ever to
use this method, thanks to advancements in technology such as Amazon Web
Services (AWS). With AWS, you can store unlimited data without worrying
about how much space each file takes up on their servers. They'll
automatically compress them before sending them over, so they take up less
space overall!
Hadoop
10
Hadoop has gained considerable attention as it is one of the most common
frameworks to support big data analytics.
HBase
With HBase, you can use a NoSQL database or complement Hadoop with a
column-oriented store. This database is designed to efficiently manage large
tables with billions of rows and millions of columns. The performance can be
tuned by adjusting memory usage, the number of servers, block size, and
other settings.
Snowflake
Snowflake for Data Lake Analytics is an enterprise-grade cloud platform for
advanced analytics applications built on top of Apache Hadoop. It offers real-
time access to historical and streaming data from any source and format at
any scale without requiring changes to existing applications or workflows. It
also enables users to quickly scale up their processing power as needed
without having to worry about infrastructure management tasks such as
provisioning and
Data storage
Data storage in big data following
3. NoSQL Databases: NoSQL (Not Only SQL) databases are designed to handle
large volumes of unstructured or semi-structured data and are a key
11
component of big data ecosystems. NoSQL databases offer flexible data
models and horizontal scalability, making them well-suited for applications
such as web and mobile apps, real-time analytics, and content management
systems. Examples include MongoDB, Cassandra, and Apache CouchDB.
4. Data Lakes: Data lakes are storage repositories that can store vast amounts of
raw data in its native format until it's needed for analysis. Unlike traditional
data warehouses, which require structured data, data lakes can accommodate
structured, semi-structured, and unstructured data. Data lakes provide
flexibility and scalability for storing and analyzing diverse datasets. Popular
data lake solutions include Apache Hadoop, Apache Spark, and Amazon S3.
12
used with emerging technologies, like machine learning, to discover and scale more
complex insights.
1. Collect Data
Data collection looks different for every organization. With today’s
technology, organizations can gather both structured and unstructured data
from a variety of sources — from cloud storage to mobile applications to in-
store IoT sensors and beyond. Some data will be stored in data
warehouses where business intelligence tools and solutions can access it
easily. Raw or unstructured data that is too diverse or complex for a
warehouse may be assigned metadata and stored in a data lake.
2. Process Data
Once data is collected and stored, it must be organized properly to get
accurate results on analytical queries, especially when it’s large and
unstructured. Available data is growing exponentially, making data processing
a challenge for organizations. One processing option is batch processing,
which looks at large data blocks over time. Batch processing is useful when
there is a longer turnaround time between collecting and analyzing
data. Stream processing looks at small batches of data at once, shortening
the delay time between collection and analysis for quicker decision-making.
Stream processing is more complex and often more expensive.
3. Clean Data
Data big or small requires scrubbing to improve data quality and get stronger
results; all data must be formatted correctly, and any duplicative or
irrelevant data must be eliminated or accounted for. Dirty data can obscure
and mislead, creating flawed insights.
4. Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced
analytics processes can turn big data into big insights. Some of these big data
analysis methods include:
13
o Predictive analytics uses an organization’s historical data to make
predictions about the future, identifying upcoming risks and
opportunities.
Data Governance and Security: Data governance and security are critical
aspects of big data storage and analysis. Organizations must implement
robust data governance policies, access controls, encryption, and compliance
measures to protect sensitive data and ensure regulatory compliance.
14
MapReduce is an essential component to the Hadoop framework serving two
functions. The first is mapping, which filters data to various nodes within the
cluster. The second is reducing, which organizes and reduces the results from
each node to answer a query.
Spark is an open source cluster computing framework that uses implicit data
parallelism and fault tolerance to provide an interface for programming entire
clusters. Spark can handle both batch and stream processing for fast
computation.
Read more about how real organizations reap the benefits of big data.
15
for your business needs. To capitalize on incoming data, organizations will have to
address the following:
Making big data accessible. Collecting and processing data becomes more
difficult as the amount of data grows. Organizations must make data easy and
convenient for data owners of all skill levels to use.
Finding the right tools and platforms. New technologies for processing and
analyzing big data are developed all the time. Organizations must find the
right technology to work within their established ecosystems and address
their particular needs. Often, the right solution is also a flexible solution that
can accommodate future infrastructure changes.
Big data: We can consider big data an upper version of traditional data. Big data
deal with too large or complex data sets which is difficult to manage in traditional
data-processing application software. It deals with large volume of both structured,
semi structured and unstructured data. Volume, Velocity and Variety, Veracity and
Value refer to the 5’V characteristics of big data. Big data not only refers to large
amount of data it refers to extracting meaningful data by analyzing the huge amount
of complex data sets. Semi-structured
The main differences between traditional data and big data as follows:
16
Variety: Traditional data is typically structured, meaning it is organized in a
predefined manner such as tables, columns, and rows. Big data, on the other
hand, can be structured, unstructured, or semi-structured, meaning it may
contain text, images, videos, or other types of data.
Big data, on the other hand, is complex and requires specialized tools and
techniques to manage, process, and analyze.
Value: Traditional data typically has a lower potential value than big data
because it is limited in scope and size. Big data, on the other hand, can
provide valuable insights into customer behavior, market trends, and other
business-critical information.
Data Quality: The quality of data is essential in both traditional and big data
environments. Accurate and reliable data is necessary for making informed
business decisions.
Data Analysis: Both traditional and big data require some form of analysis to
derive insights and knowledge from the data. Traditional data analysis
methods typically involve statistical techniques and visualizations, while big
data analysis may require machine learning and other advanced techniques.
Data Storage: In both traditional and big data environments, data needs to
be stored and managed effectively. Traditional data is typically stored in
relational databases, while big data may require specialized technologies such
as Hadoop, NoSQL, or cloud-based storage systems.
Business Value: Both traditional and big data can provide significant value to
organizations. Traditional data can provide insights into historical trends and
patterns, while big data can uncover new opportunities and help organizations
make more informed decisions.
17
The main differences between traditional data and big data are as follows:
It is usually a small amount of data that It is usually a big amount of data that
can be collected and analyzed using cannot be processed and analyzed easily
traditional methods easily. using traditional methods.
It usually comes from internal systems. It comes from various sources such as
mobile devices, social media, etc.
Analysis of traditional data can be done Analysis of big data needs advanced
with the use of primary statistical analytics methods such as machine
methods. learning, data mining, etc.
Traditional methods to analyze data are Methods to analyze big data are fast and
slow and gradual. instant.
18
complexity.
It is used for simple and small business It is used for complex and big business
processes. processes.
It is easy to secure and protect than big It is harder to secure and protect than
data because of its small size and traditional data because of its size and
simplicity. complexity.
It requires less time and money to store It requires more time and money to
traditional data. store big data.
It is less efficient than big data. It is more efficient than traditional data.
1. Traditional Databases:
19
Use Cases: Traditional databases are well-suited for transactional processing,
OLTP (Online Transaction Processing), and structured analytics. Big data
systems are better suited for handling large-scale analytics, real-time
processing, and unstructured data analysis.
2. Data Warehouses:
Data Model: Data warehouses are optimized for structured data and typically
use a star or snowflake schema. Big data systems can handle various data
types, including structured, semi-structured, and unstructured data.
3. In-Memory Databases:
Use Cases: In-memory databases are suitable for applications requiring high-
speed transactions, real-time analytics, and low-latency processing. Big data
systems are ideal for handling large-scale analytics, big data processing, and
complex data analysis tasks.
20
Use Cases: Stream processing systems are used for real-time analytics, event
processing, and IoT applications. Big data systems are suitable for batch
processing, large-scale analytics, and handling diverse data types.
In summary, big data systems offer greater flexibility, scalability, and support for
diverse data types compared to traditional databases, data warehouses, in-memory
databases, and stream processing systems. However, the choice between these
systems depends on the specific requirements, workload characteristics, and use
cases of the application.
A relational database is a type of database that stores and provides access to data
points that are related to one another. Relational databases are based on the
relational model, an intuitive, straightforward way of representing data in tables. In a
relational database, each row in the table is a record with a unique ID called the key.
The columns of the table hold attributes of the data, and each record usually has a
value for each attribute, making it easy to establish the relationships among data
points.
Referential Integrity: Only the rows of those tables can be deleted which are
not used by other tables. Otherwise, it may lead to data inconsistency.
21
Domain integrity: The columns of the database tables are enclosed within
some structured limits, based on default values, type of data or ranges.
Attributes (columns) specify a data type, and each record (or row) contains the value
of that specific data type. All tables in a relational database have an attribute known
as the primary key, which is a unique identifier of a row, and each row can be used to
create a relationship between different tables using a foreign key—a reference to a
primary key of another existing table.
Let’s take a look at how the relational database model works in practice:
22
The Customer table contains data about the customer:
Customer name
Billing address
Shipping address
In the Customer table, the customer ID is a primary key that uniquely identifies who
the customer is in the relational database. No other customer would have the same
Customer ID.
23
Order ID (primary key)
Order date
Shipping date
Order status
Here, the primary key to identify a specific order is the Order ID. You can connect a
customer with an order by using a foreign key to link the customer ID from
the Customer table.
The two tables are now related based on the shared customer ID, which means you
can query both tables to create formal reports or use the data for other applications.
For instance, a retail branch manager could generate a report about all customers
who made a purchase on a specific date or figure out which customers had orders
that had a delayed delivery date in the last month.
The above explanation is meant to be simple. But relational databases also excel at
showing very complex relationships between data, allowing you to reference data in
more tables as long as the data conforms to the predefined relational schema of
your database.
As the data is organized as pre-defined relationships, you can query the data
declaratively. A declarative query is a way to define what you want to extract from
the system without expressing how the system should compute the result. This is at
the heart of a relational system as opposed to other systems.
The distinction between logical and physical also applies to database operations,
which are clearly defined actions that enable applications to manipulate the data and
structures of the database. Logical operations allow an application to specify the
content it needs, and physical operations determine how that data should be
accessed and then carries out the task.
24
Examples of relational databases
Now that you understand how relational databases work, you can begin to learn
about the many relational database management systems that use the relational
database model. A relational database management system (RDBMS) is a program
used to create, update, and manage relational databases. Some of the most well-
known RDBMSs include MySQL, PostgreSQL, MariaDB, Microsoft SQL Server, and
Oracle Database.
Cloud-based relational databases like Cloud SQL, Cloud Spanner and AlloyDB have
become increasingly popular as they offer managed services for database
maintenance, patching, capacity management, provisioning and infrastructure
support.
Characteristics of RDBMS
Data must be stored in tabular form in DB file, that is, it should be organized
in the form of rows and columns.
Tables are related to each other with the help for foreign keys.
Database tables also allow NULL values, that is if the values of any of the
element of the table are not filled or are missing, it becomes a NULL value,
which is not equivalent to zero. (NOTE: Primary key cannot have a NULL
value).
25
There are many other advantages to using relational databases to manage and store
your data, including:
1. Flexibility: It’s easy to add, update, or delete tables, relationships, and make
other changes to data whenever you need without changing the overall database
structure or impacting existing applications.
Consistency defines the rules for maintaining data points in a correct state
after a transaction.
3. Ease of use: It’s easy to run complex queries using SQL, which enables even non-
technical users to learn how to interact with the database.
9. SQL (Structured Query Language): RDBMSs use SQL as the standard language
for querying and manipulating data. SQL provides a rich set of commands for
26
creating, querying, updating, and deleting data in relational databases. Common
SQL operations include SELECT, INSERT, UPDATE, DELETE, JOIN, and GROUP BY.
13. Data Security: RDBMSs offer features for data security, including authentication,
authorization, and access control mechanisms. Administrators can define user
roles, privileges, and permissions to restrict access to sensitive data and database
operations.
14. Scalability: RDBMSs can scale vertically by upgrading hardware resources such as
CPU, memory, and storage capacity. Some RDBMSs also support horizontal
scalability through features like sharding, partitioning, and replication.
15. Commercial and Open Source Options: There are both commercial and open-
source RDBMS solutions available in the market. Examples of commercial
RDBMSs include Oracle Database, Microsoft SQL Server, and IBM Db2, while
popular open-source RDBMSs include MySQL, PostgreSQL, and SQLite
Unlike relational databases, NoSQL databases follow a flexible data model, making
them ideal for storing data that changes frequently or for applications that handle
diverse types of data.
27
Disadvantages of RDBMS
High Cost and Extensive Hardware and Software Support: Huge costs and
setups are required to make these systems functional.
Grid computing
What Is Grid Computing?
Grid computing is a distributed architecture of multiple computers connected by
networks to accomplish a joint task. These tasks are compute-intensive and difficult
for a single machine to handle. Several machines on a network collaborate under a
common protocol and work as a single virtual supercomputer to get complex tasks
done. This offers powerful virtualization by creating a single system image that
grants users and applications seamless access to IT capabilities.
28
How Grid Computing Works
A typical grid computing network consists of three machine types:
Today, users are well-versed with web portals. They provide a single interface that
allows users to view a wide variety of information. Similarly, a grid portal offers an
interface that enables users to launch applications with resources provided by the
grid.
The interface has a portal style to help users query and execute various functions on
the grid effectively. A grid user views a single, large virtual computer offering
computing resources, similar to an internet user who views a unified instance of
content on the web.
30
2. Security
Security is one of the major concerns for grid computing environments. Security
mechanisms can include authentication, authorization, data encryption, and others.
Grid security infrastructure (GSI) is an important ingredient here. It outlines
specifications that establish secret and tamper-proof communication between
software entities operating in a grid network.
3. Scheduler
On identifying the resources, the next step is to schedule the tasks to run on them. A
scheduler may not be needed if standalone tasks are to be executed that do not
showcase interdependencies. However, if you want to run specific tasks concurrently
that require inter-process communication, the job scheduler would suffice to
coordinate the execution of different subtasks.
4. Data management
Data management is crucial for grid environments. A secure and reliable mechanism
to move or make any data or application module accessible to various nodes within
the grid is necessary. Consider the Globus toolkit — an open-source toolkit for grid
computing.
The workload & resource component enables the actual launch of a job on a
particular resource, checks its status, and retrieves the results when the job is
complete. Say a user wants to execute an application on the grid. In that case, the
application should be aware of the available resources on the grid to take up the
workload.
31
Types of Grid Computing With Examples
Grid computing is divided into several types based on its uses and the task at hand.
Let’s understand the types of grid computing with some examples.
Computational grids account for the largest share of grid computing usage across
industries today, and the trend is expected to stay the same over the years to come.
A computational grid comes into the picture when you have a task taking longer to
execute than expected. In this case, the main task is split into multiple subtasks, and
each subtask is executed in parallel on a separate node. Upon completion, the results
of the subtasks are combined to get the main task’s result. By splitting the task, the
32
end result is achieved O(n) times faster (where ‘n’ denotes the number of subtasks)
than when a single machine executes the task.
Data grids refer to grids that split data onto multiple computers. Like computational
grids where computations are split, data grids enable placing data onto a network of
computers or storage. However, the grid virtually treats them as one despite the
splitting. Data grid computing allows several users to simultaneously access, change,
or transfer distributed data.
For instance, a data grid can be used as a large data store where each website stores
its own data on the grid. Here, the grid enables coordinated data sharing across all
grid users. Such a grid allows collaboration along with increased knowledge transfer
between grid users.
Manuscript grid computing comes in handy when managing large volumes of image
and text blocks. This grid type allows the continuous accumulation of image and text
blocks while it processes and performs operations on previous block batches. It is a
simple grid computing framework where vast volumes of text or manuscripts and
images are processed in parallel.
33
teams can then combine the required assets and computing resources to support
specific apps or services.
When applications are created, a set of computing resources and services are defined
to support them. Subsequently, when the applications expire, computing support is
withdrawn, and resources are set free, making them available for other apps.
Practically, original equipment manufacturers (OEMs) play a key role in modular grid
computing as their cooperation is critical in creating modular grids that are
application-specific.
34
Grid Computing Applications
1. Life science
For example, the MCell project explores cellular microphysiology using sophisticated
‘Monte Carlo’ diffusion and chemical reaction algorithms to simulate and study
molecular interactions inside and outside cells. Grid technologies have enabled the
large-scale deployment of various MCell modules, as MCell now runs on a large pool
of resources, including clusters and supercomputers, to perform biochemical
simulations.
2. Engineering-oriented applications
3. Data-oriented applications
Today, data is emerging from every corner — from sensors, smart gadgets, and
scientific instruments to many new IoT devices. With the explosion of data, grids
have a crucial role to play. Grids are being used to collect, store, and analyze data,
and at the same time, derive patterns to synthesize knowledge from that same data.
5. Commercial applications
Grid computing supports various commercial applications such as the online gaming
and entertainment industry, where computation-intensive resources, such as
36
computers and storage networks, are essential. The resources are selected based on
computing requirements in a gaming grid environment. It considers aspects such as
the volume of traffic and the number of participating players.
Such grids promote collaborative gaming and reduce the upfront cost of hardware
and software resources in on-demand-driven games. Moreover, in the media
industry, grid computing enhances the visual appearance of the motion picture by
adding special effects. The grid also helps theater film production as different
portions are processed concurrently, requiring less production time.
Efficiency: With grid computing, you can break down an enormous, complex
task into multiple subtasks. Multiple computers can work on the subtasks
concurrently, making grid computing an efficient computational solution.
Cost: Grid computing works with existing hardware, which means you can
reuse existing computers. You can save costs while accessing your excess
computational resources. You can also cost-effectively access resources from
the cloud.
Financial services
Financial institutions use grid computing primarily to solve problems involving
risk management. By harnessing the combined computing powers in the grid,
they can shorten the duration of forecasting portfolio changes in volatile
markets.
Gaming
The gaming industry uses grid computing to provide additional computational
resources for game developers. The grid computing system splits large tasks,
such as creating in-game designs, and allocates them to multiple machines.
This results in a faster turnaround for the game developers.
37
Entertainment
Some movies have complex special effects that require a powerful computer
to create. The special effects designers use grid computing to speed up the
production timeline. They have grid-supported software that shares
computational resources to render the special-effect graphics.
Engineering
Engineers use grid computing to perform simulations, create models, and
analyze designs. They run specialized applications concurrently on multiple
machines to process massive amounts of data. For example, engineers use
grid computing to reduce the duration of a Monte Carlo simulation, a
software process that uses past data to make future predictions.
Volunteer computing
What is volunteer computing?
“Volunteer computing” is a type of distributed computing in which computer
owners can donate their spare computing resources (processing power, storage and
Internet connection) to one or more research projects.
Because of the huge number (> 1 billion) of PCs in the world, volunteer
computing can supply more computing power to science than does any other
type of computing. This computing power enables scientific research that
could not be done otherwise. This advantage will increase over time, because
the laws of economics dictate that consumer products such as PCs and game
consoles will advance faster than more specialized products, and that there
will be more of them.
This is different from volunteer computing. 'Desktop grid' computing - which uses
desktop PCs within an organization - is superficially similar to volunteer computing,
but because it has accountability and lacks anonymity, it is significantly different.
39
If your definition of 'Grid computing' encompasses all distributed computing (which
is silly - there's already a perfectly good term for that) then volunteer computing is a
type of Grid computing.
Even before the work unit is finalized, information from the work unit is
trickled to the server.
40
Both the client and the server require job scheduling to ensure that
different tasks meet their deadlines.
With all these useful features volunteer computing has turned to be one of
the major areas of research and technology for everyday applications. By
delivering more than four hundred research projects in volunteer computing, we
are one of the highly experienced technical volunteer computing networks research
supporters in the world. You can thus confidently reach out to us for your VC
projects. Let us now look into the volunteer network attributes in brief below
41
Assessing the volunteer opportunities
Usually, researchers reach out to us for the technicalities associated with these
aspects. We also ensure our full support in the writing aspects like research
proposal writing, paper publication, survey paper writing, and thesis. Let us now
see about some major volunteer computing-based technologies
Mist computing
Fog computing
o The location for storage can be both cloud and local data center
Centralised computing
Utility computing
Cloud computing
o Platform as a service
o Infrastructure as a service
Grid computing
For detailed explanations on all the above technologies, you can visit our website or
contact us. By having a good interaction with our technical team you can get all your
doubts and queries solved instantly. We will now discuss the taxonomic approach of
volunteer computing in Big Data.
We help you in meeting all the demands of the best volunteer computing system.
Authentic research materials and expert advice are the essential combinations to
carry out the best research work. We ensure to provide all these facilities readily to
you. Let us now see the issues and concerns associated with volunteer computing
in big data below
43
Ambient heating is contributed by consumer devices in cold climates which
lead to the net-zero cost of computing. Therefore global deployment of
volunteer computing is looked upon for its efficiency over data center
computing
For many years our research experts have been working in close contact with the top
researchers of the world to find solutions to these problems. That we are aware of
the recent trends and developments in volatile computing research. Let us now
see how monitor computing is prepared for high throughput applications.
Also reducing the turnaround time for completing a task is not the primary goal
of volunteer computing
In cases of huge memory workloads, large storage demands and the higher ratio
between network communication and computing, volunteer computing in Big
Data cannot be used
Hence without a doubt volunteer computing is highly accepted and suited for high
throughput computing applications.
44
3. Diverse Applications: Volunteer computing projects cover a wide range of
scientific, research, and humanitarian domains, including astronomy, physics,
biology, climate modeling, drug discovery, cryptography, and social sciences.
These projects rely on volunteer computing to perform complex calculations,
simulations, and data analysis tasks that would otherwise require significant
computational resources.
45
times. Let us now look into the prominent volunteer computing based big data
techniques below
Deploying deep and machine learning methods and their suitability to volunteer
computing systems is one of the important researches in the field these days. You
can get any kind of research guidance for volunteer computer networks from us.
Our developers will stand by you in the successful implementation of codes. Let us
now look into the parameters used in analyzing the volunteer computer networks
Loop control
K means clustering
46
o Checkpointing fault tolerance is followed in this mechanism
You can use volunteer computing for tasks that involve large utilization of resources
like Big Data analytics and scientific simulations by aggregating idle computer
devices like desktops, routers smart.
47
The following are various aspects of resource allocation which are considered to be
important issues in Resource Management under volunteer computing
Our experts are here to guide you on all the ways and means to optimally manage
the resources in volunteer computing networks. Since we have the experience of
working many years with researchers from all over the world we gained expertise in
the field. Let us now discuss the research issues in task scheduling aspects of
volunteer computing
o Output file size denotes the data amount sent by a client to the server
after executing a task
o Input file size stands for the data that is uploaded for processing into
the volunteer nodes
Duration of tasks
Many such factors have to be considered for evaluating your volunteer computing
networks project and choosing the best simulation tool for you. Our engineers will
guide you by considering the demands of your project to choose the best methods,
tools, and techniques. Let us now talk about the parameters used for analysing the
performance of volunteer computer networks.
48
Performance Analysis of Volunteer Computing
Maximum error results
o Then the client error results exit the maximum value the work unit is
established to contain errors
o Then the number of user error results is greater than the maximum
error result value the work unit consists of errors
Target results
Finally to conclude on the optimal outcomes on all these performance metrics you
can have a talk with our experts. We will provide you with the successful instances in
which we have attempted to maximize the possible results in all these parameters.
So contact us without hesitation at any time for your volunteer computing
networks project.
1. Artificial Intelligence (AI) and Big Data: The combination of AI and big data
has led to groundbreaking advancements in data analytics, machine learning,
and predictive modeling. AI algorithms are increasingly used to analyze vast
amounts of data, extract meaningful insights, and drive decision-making in
diverse fields such as healthcare, finance, marketing, and autonomous
systems.
50
chain optimization. This convergence enables faster delivery times, improved
customer experiences, and greater convenience for consumers.
Overall, the convergence of key trends has profound implications for society,
economy, and technology, shaping the way we live, work, and interact in the digital
age. By recognizing and harnessing the synergies between these trends,
organizations and policymakers can unlock new opportunities and address complex
challenges in a rapidly evolving world.
It is being expected that the big data market is going to shoot up to 200 USD Billion
by 2025. So, let’s check out the top 10 big data trends for 2022.
1. TinyML
2. AutoML
51
The motive of AutoML is to offer extensive learning techniques and models for non-
experts in ML. Not to forget, although AutoML does not require human interaction
that doesn’t mean that it’s going to completely overtake it.
3. Data Fabric
Data Fabric has been in trend for a while now and will continue its dominance in the
coming times. It’s an architecture and group of data services throughout the cloud
environment. Not only has this but data fabric been also listed as the best analytical
tool by Gartner. However, it has to continue spreading all over the enterprise scale. It
consists of key data management technologies which include data pipelining, data
integration, data governance, etc. It has been accepted by enterprise scales openly as
it consumes less time for fetching out business insights which can be helpful for
making impactful business decisions.
4. Cloud Migration
5. Data Regulation
6. IoT
7. NLP
52
Natural Language Processing is a kind of AI that helps in assessing text or voice
inputs provided by humans. In short, it is being used nowadays to understand what’s
being said and works like a charm. It is a next-level achievement in technology where
we’ve been working now and even you can find some of the examples where you can
ask a machine to read aloud for you. NLP uses a method of methodologies to extract
the vagueness in speech and to provide it a natural touch. Your very best example
can be Apple’s Siri or Google Assistant, where you speak to the AI and it provides
you the useful information as per your need.
8. Data Quality
Data quality has one of the most sought concerns for companies later in 2021. In
fact, the ratio is less where companies have accepted that data quality is becoming
an issue for them. Well, on the other hand, it’s not a concern for them. To date,
companies have not been focusing on the quality of data from various mining tools
which resulted in poor data management. The reason is, if ‘Data’ is their decision-
maker and playing a crucial role then they might be setting wrong targets for their
business or might be targeting the wrong group. That’s where filtration is required to
achieve real milestones.
9. Cyber Security
With the rise of pandemic (COVID-19), where the world was forced to shut down and
companies were left with none other than WFH, things began changing. Even after so
many months and years, people are focusing on getting remote work. Everything has
pros and cons in its own way. This also comes with a lot of challenges which include
cyber-attacks. In fact, working remotely comes with a lot of safety measures and
responsibilities. Since the employee is out of cyber security range and thus it
becomes a concern for companies. As people are working remotely, cyber attackers
are becoming more active to breach out by finding different ways of attack.
Taking this into considerations, XDR (Extended Detection and Response) and
SOAR have been introduced which helps in detecting any cyber-attack by applying
advanced security analytics into their network. Therefore, it is and will be one of the
major trends for 2022 in big data and analytics.
It helps in identifying any future trends and forecasts with the help of certain sets
of statistics tools. Predictive analytics analyses a pattern in a meaningful way and it
is being used for weather forecasts. However, its ability and techniques are not just
limited to this, in fact, it can be used in sorting any data, and based on the pattern, it
analyses the stats.
Some of the examples are Share market, Product Research, etc. Based on the
provided data, it measures and provide a full report beforehand if any market share
is dipping down or if you want to launch any product then it collects data from
different regions and based on their interests, it will help you in analyzing your
53
business decision and in the world of this heavy competition, it’s becoming even
more demanding and will be in trend for the upcoming years.
Unstructured data
What is Unstructured Data?
Unstructured data is the data which does not conforms to a data model and has no
easily identifiable structure such that it can not be used by a computer program
easily. Unstructured data is not organised in a pre-defined manner or does not have
a pre-defined data model, thus it is not a good fit for a mainstream relational
database.
Unstructured data
From 80% to 90% of data generated and collected by organizations is unstructured,
and its volumes are growing rapidly — many times faster than the rate of growth for
structured databases.
Unstructured data stores contain a wealth of information that can be used to guide
business decisions. However, unstructured data has historically been very difficult to
analyze. With the help of AI and machine learning, new software tools are emerging
that can search through vast quantities of it to uncover beneficial and actionable
business intelligence.
54
Unstructured data doesn’t have a predefined structure and is common in sources
like:
Emails
PDFs
Images
Audio files
Video files
While unstructured data doesn't have the same organization as structured data, you
can still analyze it to find trends and insights. To do this, businesses need to invest in
big data technologies like OpenText™ IDOL Unstructured Data Analytics to easily
process large amounts of unstructured data.
55
sources can provide valuable insights but also presents challenges related to
data integration, quality, and governance.
Unstructured data, in contrast, doesn’t fit into these sorts of pre-defined data
models. It can’t be stored in an RDBMS. And because it comes in so many formats,
it’s a real challenge for conventional software to ingest, process, and analyze. Simple
content searches can be undertaken across textual unstructured data with the right
tools.
Beyond that, the lack of consistent internal structure doesn’t conform to what typical
data mining systems can work with. As a result, companies have largely been unable
to tap into value-laden data like customer interactions, rich media, and social
network conversations. Robust tools for doing so are only now being developed and
commercialized.
56
Here are some examples of the human-generated variety:
Email: Email message fields are unstructured and cannot be parsed by traditional
analytics tools. That said, email metadata affords it some structure, and explains
why email is sometimes considered semi-structured data.
Social media and websites: data from social networks like Twitter, LinkedIn, and
Facebook, and websites such as Instagram, photo-sharing sites, and YouTube.
Mobile and communications data: For this category, look no further than text
messages, phone recordings, collaboration software, chat, and instant messaging.
Media: This data includes digital photos, audio, and video files.
Text Documents: Word documents, PDFs, emails, web pages, and text files
containing unstructured textual content.
Social Media Data: Posts, comments, tweets, photos, videos, and other content
shared on social media platforms such as Facebook, Twitter, Instagram, and
LinkedIn.
Sensor Data: Data collected from sensors embedded in IoT devices, industrial
equipment, vehicles, environmental monitoring systems, and wearable devices.
Scientific data: This includes oil and gas surveys, space exploration, seismic
imagery, and atmospheric data.
Digital surveillance: This category features data like reconnaissance photos and
videos.
Satellite imagery: This data includes weather data, land forms, and military
movements.
57
How is Unstructured Data stored?
Unstructured data is usually stored in a non-relational database like Hadoop or
NoSQL and processed by unstructured data analytics programs like OpenText IDOL.
These databases can store and process large amounts of unstructured data.
58
Targeted marketing campaigns: Marketing teams can use unstructured data to
identify customer needs and wants. This information can then help them create
targeted marketing campaigns.
Better business decisions: Unstructured data can businesses find trends and
insights that would otherwise be difficult to identify. This information ultimately
helps stakeholder make accurate judgments and improve their companies.
Limitless use: Unstructured data isn’t predefined, meaning owners can use it in
unlimited ways.
Affordable storage cost: Enterprises have more raw, unstructured data than
structured information. Storing unstructured data is both convenient and cost-
effective.
File extraction: Gain more insight from your data with support for over 1,500 file
formats, and a Document file reader and file extraction with standalone file
format detection, content decryption, text extraction, subfile processing, non-
native rendering, and structured export solution.
Law Enforcement Analytics & Media Analysis: Identify and extract facts from
video and image evidence during investigations. Collect, organize, classify, and
secure these assets faster while reducing costs and the strain on labor.
Natural Language Q&A and Chatbot: Accesses a variety of sources for highly
matched answers and responds in a natural language format. Create a human
dialog chat experience for customers through AI and ML.
59
What are the challenges of Unstructured Data?
Working with unstructured data can be challenging. Since this type of information is
not organized in a predefined manner, it's more challenging to analyze.
Security risks: Securing unstructured data can be complex since users can spread
this information across many storage formats and locations.
Need for data scientists: Unstructured data usually requires data scientists to
parse through it and make interpretations.
Machine learning: This technique finds patterns and insights in data. For
example, tools that feature machine learning can inspect customer behavior to
identify trends.
60
How can OpenText IDOL Unstructured Data Analytics help?
OpenText unstructured data analytics platform helps organizations analyze this type
of information. OpenText IDOL includes tools and technologies that collect, process,
and analyze unstructured data.
Audio analytics: This feature enables businesses to extract meaning from audio
files. For example, audio analytics can identify keywords in a conversation or
detect emotions in a voice.
Repository Data access and connectors: Users can easily connect to various
data sources. This includes social media, enterprise applications, and databases.
Unstructured Data Analytics Software for OEM & SDKs: Use our software
development kit to build the apps and APIs you need to take advantage of your
unstructured data.
Data is portable
It is very scalable
61
It is difficult to store and manage unstructured data due to lack of schema and
structure
Indexing the data is difficult and error prone due to unclear structure and not
having pre-defined attributes. Due to which search results are not very accurate.
1. Transportation
Big Data powers the GPS smartphone applications most of us depend on to get from
place to place in the least amount of time. GPS data sources include satellite images
and government agencies.
Airplanes generate enormous volumes of data, on the order of 1,000 gigabytes for
transatlantic flights. Aviation analytics systems ingest all of this to analyze fuel
efficiency, passenger and cargo weights, and weather conditions, with a view toward
optimizing safety and energy consumption.
62
Thanks to Big Data analytics, Google Maps can now tell you the least traffic-
prone route to any destination.
Route planning
Different itineraries can be compared in terms of user needs, fuel
consumption, and other factors to plan for maximize efficiency.
Traffic safety
Real-time processing and predictive analytics are used to pinpoint accident-
prone areas.
Ads have always been targeted towards specific consumer segments. In the past,
marketers have employed TV and radio preferences, survey responses, and focus
groups to try to ascertain people’s likely responses to campaigns. At best, these
methods amounted to educated guesswork.
Today, advertisers buy or gather huge quantities of data to identify what consumers
actually click on, search for, and “like.” Marketing campaigns are also monitored for
effectiveness using click-through rates, views, and other precise metrics.
For example, Amazon accumulates massive data stories on the purchases, delivery
methods, and payment preferences of its millions of customers. The company then
sells ad placements that can be highly targeted to very specific segments and
subgroups.
The financial industry puts Big Data and analytics to highly productive use, for:
Fraud detection
Banks monitor credit cardholders’ purchasing patterns and other activity to
flag atypical movements and anomalies that may signal fraudulent
transactions.
Risk management
Big Data analytics enable banks to monitor and report on operational
processes, KPIs, and employee activities.
63
Personalized marketing
Banks use Big Data to construct rich profiles of individual customer lifestyles,
preferences, and goals, which are then utilized for micro-targeted marketing
initiatives.
4. Government
Examples of agencies that do include the IRS and the Social Security Administration,
which use data analysis to identify tax fraud and fraudulent disability claims. The FBI
and SEC apply Big Data strategies to monitor markets in their quest to detect
criminal business activities. For years now, the Federal Housing Authority has been
using Big Data analytics to forecast mortgage default and repayment rates.
The Centers for Disease Control tracks the spread of infectious illnesses using data
from social media, and the FDA deploys Big Data techniques across testing labs to
investigate patterns of foodborne illness. The U.S. Department of Agriculture
supports agribusiness and ranching by developing Big Data-driven technologies.
The entertainment industry harnesses Big Data to glean insights from customer
reviews, predict audience interests and preferences, optimize programming
schedules, and target marketing campaigns.
Two conspicuous examples are Amazon Prime, which uses Big Data analytics to
recommend programming for individual users, and Spotify, which does the same to
offer personalized music suggestions.
6. Meteorology
Weather satellites and sensors all over the world collect large amounts of data for
tracking environmental conditions. Meteorologists use Big Data to:
64
Prepare weather forecasts
7. Healthcare
Big Data is slowly but surely making a major impact on the huge healthcare industry.
Wearable devices and sensors collect patient data which is then fed in real-time to
individuals’ electronic health records. Providers and practice organizations are now
using Big Data for a number of purposes, including these:
Real-time alerting
Strategic planning
Research acceleration
Telemedicine
8. Cybersecurity
While Big Data can expose businesses to a greater risk of cyberattacks, the same
datastores can be used to prevent and counteract online crime through the power of
machine learning and analytics. Historical data analysis can yield intelligence to
create more effective threat controls. And machine learning can warn businesses
when deviations from normal patterns and sequences occur, so that effective
countermeasures can be taken against threats such as ransomware attacks, malicious
insider programs, and attempts at unauthorized access.
65
After a company has suffered an intrusion or data theft, post-attack analysis can
uncover the methods used, and machine learning can then be deployed to devise
safeguards that will foil similar attempts in the future.
9. Education
Administrators, faculty, and stakeholders are embracing Big Data to help improve
their curricula, attract the best talent, and optimize the student experience. Examples
include:
Customizing curricula
Big Data enables academic programs to be tailored to the needs of individual
students, often drawing on a combination of online learning, traditional on-
site classes, and independent study.
10. Telecommunications:
Customer Churn Prediction: Big data analytics help telecom operators predict
customer churn, identify at-risk customers, and implement targeted retention
strategies. By analyzing customer behavior, usage patterns, and billing data, telecom
66
companies can personalize offers, improve customer satisfaction, and reduce churn
rates.
Customer Analytics: Retailers analyze big data from various sources such as
transaction records, website visits, and social media interactions to understand
customer preferences, behavior patterns, and purchasing trends. This information is
used to personalize marketing campaigns, optimize product assortments, and
improve customer experiences.
Supply Chain Optimization: Big data analytics help retailers optimize inventory
management, supply chain logistics, and demand forecasting. By analyzing historical
sales data, weather patterns, and market trends, retailers can anticipate demand
fluctuations, reduce stockouts, and optimize inventory levels across their distribution
networks.
Web analytics
What is web analytics?
Web analytics is the process of analyzing the behavior of visitors to a website. This
involves tracking, reviewing and reporting data to measure web activity, including
the use of a website and its components, such as webpages, images and videos.
Data collected through web analytics may include traffic sources, referring sites, page
views, paths taken and conversion rates. The compiled data often forms a part of
customer relationship management analytics (CRM analytics) to facilitate and
streamline better business decisions.
Web analytics enables a business to retain customers, attract more visitors and
increase the dollar volume each customer spends.
Observe the geographic regions from which the most and the least
customers visit the site and purchase specific products.
67
Predict which products customers are most and least likely to buy in the
future.
The objective of web analytics is to serve as a business metric for promoting specific
products to the customers who are most likely to buy them and to determine which
products a specific customer is most likely to purchase. This can help improve the
ratio of revenue to marketing costs.
In addition to these features, web analytics may track the clickthrough and drilldown
behavior of customers within a website, determine the sites from which customers
most often arrive, and communicate with browsers to track and analyze online
behavior. The results of web analytics are provided in the form of tables, charts and
graphs.
1. Setting goals. The first step in the web analytics process is for businesses to
determine goals and the end results they are trying to achieve. These goals
can include increased sales, customer satisfaction and brand awareness.
Business goals can be both quantitative and qualitative.
2. Collecting data. The second step in web analytics is the collection and
storage of data. Businesses can collect data directly from a website or web
analytics tool, such as Google Analytics. The data mainly comes
from Hypertext Transfer Protocol requests -- including data at the network
and application levels -- and can be combined with external data to interpret
web usage. For example, a user's Internet Protocol address is typically
associated with many factors, including geographic location and clickthrough
rates.
68
3. Processing data. The next stage of the web analytics funnel involves
businesses processing the collected data into actionable information.
69
The process of web analytics involves:
Setting business goals: Defining the key metrics that will determine the
success of your business and website
Processing data: Converting the raw data you’ve gathered into meaningful
ratios, KPIs, and other information that tell a story
You can use this information to optimize underperforming pages and further
promote higher-performing ones across your website. For example, French news
publisher Le Monde used analytics to inform a website redesign that increased
subscriber conversions by 46 percent and grew digital subscriptions by over 20
percent. Le Monde was able to identify which paid content users engaged with the
most, then use that information to highlight top-performing content on the
homepage.
Web analytics tools reveal key details about your site visitors—including their
average time spent on page and whether they’re a new or returning user—
and which content draws in the most traffic. With this information, you’ll learn
more about what parts of your website and product interest users and
potential customers the most.
For instance, an analytics tool might show you that a majority of your website
visitors are landing on your German site. You could use this information to
70
ensure you have a German version of your product that’s well translated to
meet the needs of these users.
By looking at the above data, you can do conversion rate optimization (CRO).
CRO will help you design your website to achieve the optimum quantity and
quality of conversions.
Web analytics tools can also show you important metrics that help you boost
purchases on your site. Some tools offer an enhanced ecommerce tracking
feature to help you figure out which are the top-selling products on your
website. Once you know this, you can refine your focus on your top-sellers
and boost your product sales.
By connecting your web analytics tool with Google Search Console, it’s possible
to track which search queries are generating the most traffic for your site.
With this data, you’ll know what type of content to create to answer those
queries and boost your site’s search rankings.
It’s also possible to set up onsite search tracking to know what users are
searching for on your site. This search data can further help you generate
content ideas for your site, especially if you have a blog.
Web analytics tools will also help you learn which content is performing the
best on your site, so you can focus on the types of content that work and also
use that information to make product improvements. For instance, you may
notice blog articles that talk about design are the most popular on your
website. This might signal that your users care about the design feature of
your product (if you offer design as a product feature), so you can invest more
resources into the design feature. The popular content pieces on your website
could spark ideas for new product features, too.
71
Web analytics will tell you who your top referral sources are, so you know
which channels to focus on. If you’re getting 80% of your traffic from
Instagram, your company’s marketers will know that they should invest in ads
on that platform.
Web analytics also shows you which outbound links on your site people are
clicking on. Your company’s marketing team might discover a mutually
beneficial relationship with these external websites, so you can reach out to
them to explore partnership or cross-referral opportunities.
Page visits and sessions refer to the traffic to a webpage over a specific period
of time. The more visits, the more your website is getting noticed.
Keep in mind traffic is a relative success metric. If you’re seeing 200 visits a
month to a blog post, that might not seem like great traffic. But if those 200
visits represent high-intent views—views from prospects considering
purchasing your product—that traffic could make the blog post much more
valuable than a high-volume, low-intent piece.
Source of traffic
Web analytics tools allow you to easily monitor your traffic sources and adjust
your marketing strategy accordingly. For example, if you’re seeing lots of
traffic from email campaigns, you can send out more email campaigns to
boost traffic.
Bounce rate
Bounce rate refers to how many people visit just one page on your website
and then leave your site.
72
Interpreting bounce rates is an art. A high bounce rate could be both negative
and positive for your business. It’s a negative sign since it shows people are
not interacting with other pages on your site, which might signal low
engagement among your site visitors. On the other hand, if they spend quality
time on a single page, it might indicate that users are getting all the
information they need, which could be a positive sign. That’s why you need to
investigate bounce rates further to understand what they might mean.
Repeat visit rate tells you how many people are visiting your website regularly
or repeatedly. This is your core audience since it consists of the website
visitors you’ve managed to retain. Usually, a repeat visit rate of 30% is good.
Anything below 20% shows your website is not engaging enough.
Monthly unique visitors refers to the number of visitors who visit your site for
the first time each month.
This metric shows how effective your site is at attracting new visitors each
month, which is important for your growth. Ideally, a healthy website will show
a steady flow of new visitors to the site.
Along with tracking these basic metrics, an ecommerce company’s team might
also track additional KPIs to understand how to boost sales:
The term off-site web analytics refers to the practice of monitoring visitor activity
outside of an organization's website to measure potential audience. Off-site web
analytics provides an industrywide analysis that gives insight into how a business is
73
performing in comparison to competitors. It refers to the type of analytics that
focuses on data collected from across the web, such as social media, search
engines and forums.
On-site web analytics refers to a narrower focus that uses analytics to track the
activity of visitors to a specific site to see how the site is performing. The data
gathered is usually more relevant to a site's owner and can include details on site
engagement, such as what content is most popular. Two technological approaches to
on-site web analytics include log file analysis and page tagging.
Log file analysis, also known as Log Management, is the process of analyzing data
gathered from log files to monitor, troubleshoot and report on the performance of a
website. Log files hold records of virtually every action taken on a network server,
such as a web server, email server, database server or file server.
Page tagging is the process of adding snippets of code into a website's HyperText
Markup Language code using a tag management system to track website visitors and
their interactions across the website. These snippets of code are called tags. When
businesses add these tags to a website, they can be used to track any number of
metrics, such as the number of pages viewed, the number of unique visitors and the
number of specific products viewed.
As a general rule, only measure the metrics that are important to your
business goals, and ignore the rest. For example, if your primary goal is to
increase sales in a certain location, you don’t need metrics about anything
outside of that location.
The data collected by analytics tools is not always accurate. Many users may
opt-out of analytics services, preventing web analytics tools from collecting
information on them. They may also block cookies, further preventing the
collection of their data and leading to a lot of missing information in the data
reported by analytics tools. As we move towards a cookieless world, you’ll
74
need to consider analytics solutions that track first-party data, rather than
relying on third-party data.
Your web analytics tool may also be using incorrect data filters, which may
skew the information it collects, making the data inaccurate and unreliable.
And there’s not much you can do with unreliable data.
Website data is particularly sensitive. Make sure your web analytics tools have
proper monitoring procedures and security testing in place. Take steps to
protect your website against any potential threats.
While web analytics are useful to learn how users are interacting with your
website, they only scratch the surface when it comes to understanding user
behavior. Web analytics can tell you what users are doing, but not why they
do it. To understand behaviors, you need to go beyond web analytics and
leverage a behavioral analytics solution like Amplitude Analytics. By looking at
behavioral product data, you’ll see which actions drive higher engagement,
retention, and lifetime value.
Web analytics tools, like Google Analytics, report important website statistics to
analyze the behavior of visitors as part of CRM analytics to facilitate and streamline
business decisions.
75
Analytics Tools offer an insight into the performance of your website, visitors’
behavior, and data flow. These tools are inexpensive and easy to use. Sometimes,
they are even free.
Google Analytics
Google Analytics is a freemium analytic tool that provides a detailed statistics of the
web traffic. It is used by more than 60% of website owners.
Google analytics helps you to track and measure visitors, traffic sources, goals,
conversion, and other metrics (as shown in the above image). It basically generates
reports on −
Audience Analysis
Acquisition Analysis
Behavior Analysis
Conversion Analysis
Audience Analysis
As the name suggests, audience analysis gives you an overview of the audience who
visit your site along with their session history, page-views, bounce rate, etc. You can
trace the new as well as the returning users along with their geographical locations.
You can track −
76
The affinity reach and market segmentation under Interests.
New and returning visitors, their frequency, and engagement under Behavior.
Custom variable report under Custom. This report shows the activity by
custom modules that you created to capture the selections.
Flow of user activity under Users flow to see the path they took on your
website.
Acquisition Analysis
Acquisition means ‘to acquire.’ Acquisition analysis is carried out to find out the
sources from where your web traffic originates. Using acquisition analysis, you can −
Capture traffic from all channels, particular source/medium, and from referrals.
See traffic from search engines. Here, you can see Queries, triggered landing
pages, and geographical summary.
Track social media traffic. It helps you to identify networks where your users
are engaged. You can see referrals from where your traffic originates. You can
also have a view of your hub activity, bookmarking sites follow-up, etc. In the
same tab, you can have a look at your endorsements in details. It helps you
measure the impact of social media on your website.
Have a look at all the campaigns you built throughout your website with
detailed statistics of paid/organic keywords and the cost incurred on it.
Behavior Analysis
Behavior analysis monitors users’ activities on a website. You can find behavioral data
under the following four segments −
77
Site Content − It shows how many pages were viewed. You can see the
detailed interaction of data across all pages or in segments like content drill-
down, landing pages, and exit pages. Content drill-down is breaking up of
data into sub-folders. Landing page is the page where the user lands, and exit
page is where the user exits your site. You can measure the behavioral flow in
terms of content.
Site Speed − Here, you can capture page load time, execution speed, and
performance data. You can see how quickly the browser can parse through
the page. Further, you can measure page timings, user timings, and get speed
suggestion. It helps you to know where you are lagging.
Site Search − It gives you a full picture of how the users search across your
site, what they normally look for, and how they arrive at a particular landing
page. You can analyze what they search for before landing on your website.
Events − Events are visitors’ actions with content, which can be traced
independently. Example − downloads, sign up, log-in, etc.
Conversion Analysis
Goals − Metrics that measure a profitable activity that you want the user to
complete. You can set them to track the actions. Each time a goal is achieved,
a conversion is added to your data. You can observe goal completion, value,
reverse path, and goal flow.
Ecommerce − You can set ecommerce tracking to know what the users buy
from your website. It helps you to find product performance, sale
performance, transactions, and purchase time. Based on these data, you can
analyze what can be beneficial and what can incur you loss.
Optimizely is an optimization platform to test and validate changes and the present
look of your webpage. It also determines which layout to finally go with. It uses A/B
Testing, Multipage, and Multivariate Testing to improve and analyze your website.
You are allowed to run tests and use custom integrations with Optimizely interface.
All you need is −
Once you are done with it, select your test pages. It implies the factors you
want to run test on.
Set Goals. To set goals, click on the flag icon at the top right of the page and
follow up the instructions. Check metrics you are looking for. Click Save.
You can create variations with the usual editor like changing text and images.
79
Next step is monitoring your tests. You need to test which landing pages are
performing well. What is attracting the visitors? What is the bounce rate?
Understand the statistics, filter the non-performing areas, and conclude the
test.
KISSmetrics is a powerful web analytics tool that delivers key insights and user
interaction on your website. It defines a clear picture of users’ activities on your
website and collects acquisition data of every visitor.
You can use this service free for a month. After that, you can switch on to a paid plan
that suits you. KISSmetrics helps in improving sales by knowing cart-abandoned
products. It helps you to know exactly when to follow up your customers by tracking
the repeat buyers activity slot.
Cart size
80
Summarizing KISSmetrics
It gets you more customers by not letting you lose potential customers and
maintaining brand loyalty.
It lets you to judge your decisions where you are playing right.
It helps you identify data and trends, which contribute in customer acquisition.
A convenient dashboard. You do not need to run around searching for figures.
Installation
Tracking
Add a java snippet under <head> tag of the source code of your website.
81
Event Setting
By default, KISSmetrics sets two events for you − visited site and search engine hit.
To add more events, click on new event, add an attribute and record an event name.
Setting up Metrics
Click on create a new metric. Select your metric type from the list. Give metric name,
description, and event. Save metric.
Define Conversions
Define your conversion and track them. Select number of times event happened.
Give metric name and description and select event. Save metric again.
KISSmetrics can track web pages, mobile apps, mobile web, facebook apps, and can
blend all data into one. You don’t need the multiple analytics platforms.
Crazy Egg is an online analytics application that provides you eye-tracking tools. It
generates heatmaps based on where people clicked on your website. Thus, it gives
you an idea on where to focus. It lets you filter data on top 15 referrers, search terms,
operating systems, etc.
To use Crazy Egg, a small piece of JavaScript code needs to be placed on your site
pages.
82
Once the code is on your site, Crazy Egg will track user behavior. Your servers will
create a report that shows you the clicks on the pages you are tracking. You can
review the reports in the dashboard within the member’s area of the Crazy Egg site.
Setting up Crazy Egg is a quick and easy task.
Overlay Tool − It gives you overlay report of the number of clicks occurring
on your website. You may be able to get more on it.
Installation
Insert JavaScript code on source code of your website. Crazy Egg will by default track
the user behavior. The servers generate reports providing you the view. Set
dashboard to review the reports.
What to Measure
1. Audience
Bounce rate − Bounce rate reflects the percentage of visitors returning back
only after visiting one page of your website. It helps you to know how many
visitors do so. If the bounce rate of a website increases, its webmaster should
be worried.
83
Pages per session − Pages/session is the number of pages surfed in a single
session. For example, a user landed on your website and surfed 3 pages, then
the website pages/session is 3.
Demographic info − Demographic data shows Age and Gender. With the
help of Demographic Info, you can find the percentage of Male/Female
visitors coming to your website. Analyzing the ratio of this data, you can make
a strategy according to genders. Age group data help you find what
percentage of age group visiting your website. So, you can make a strategy
for highest percentage of age group visitors.
Devices − This data shows the devices info. In devices info, you can easily find
how many percentage of visitors come from mobile, how many come from
desktop, how many come from tablets, etc. If mobile traffic is high, then you
need to make your website responsive.
2. Acquisition
Traffic sources − In the acquisition, you have to check all your sources of the traffic.
Major sources of the traffic are −
Organic traffic is the traffic coming through all search engines (Google,
Yahoo, Bing....)
Social traffic is the traffic coming through all social media platforms (like −
Facebook, Twitter, Google+, ...)
Referral traffic is the traffic coming through where your website is linked.
Direct traffic is the traffic coming directly to your website. For example,
typing the url of your website, clicking on the link of your website given in
emails, etc.
84
Source/Medium − This metrics gives you an idea of the sources from where
you are getting traffic (Google, Yahoo, Bing, Direct, Facebook...).
3. Site Content
Landing pages − Landing pages are the pages where the visitors land first
(normally, home pages of the websites are the landing pages). With the help
of this metrics, you can find the top pages of the website. Using this metrics,
you can analyze how many pages are getting 50% or more traffic of the
website. So, you can easily find which type of content is working for you.
Further, based on this analysis, you can plan the next content strategy.
Site speed − Site speed is the metrics used for checking page timing (average
page load time). Using this metrics, you can find which page is taking more
time to load, how many pages have high load time, etc.
Server Logs
Log files list actions that take place. They maintain files for every request invoked, for
example, the source of visitor, their next action, etc.
Server logs is a simple text file that records activity on the server. It is created
automatically and maintained by server’s data. With the help of a server log file, you
can find the activity detail of the website/pages. In activity sheet, you can find the
data with IP address, Time/Date, and pages. It gives you insight on the type of
browser, country, and origin. These files are only for the webmasters, not for the
website users. The statistics provided by server log is used to examine traffic patterns
segmented by day, week, or a referrer.
Visitors' Data
Visitors’ data shows the total traffic of the website. It can be calculated by any web
analytics tool. With the help of visitors’ data, you can analyze your website
improvement and can update your servers accordingly. It may comprise of −
85
Technology they are using, e.g., browsers and operating systems
User Flow
Search engine statistics show the data that is acquired by organic traffic (as shown in
the image given below). If the search engine traffic of a website has improved, then it
means the website search ranking for the main keywords has improved. This data
also helps you to −
Find the revenue generating keywords and the keywords those are typed in
search engine by visitors.
Conversion Funnels
Conversion funnels is the path by which a goal (Product purchase, Lead form done,
Service contact form submitted, etc.) is completed. It is a series of steps covered by
the visitors to become customers. It is explained in the “Bertus Engelbrecht’s” image,
given below. If more numbers of visitors are leaving the website without any
purchase, then you can use conversion funnels to analyze the following −
86
Why are they leaving the website?
Is there any broken link in the conversion path or any other feature that is not
working in the conversion path?
Conversion funnels help you visualize the following aspects in the form of graphics −
87
For the users, you can make the segments as one who purchased your products; one
who only visited your website, and likewise. During the remarketing, you can target
those audiences with the help of this segment.
Data Segmentation
Data segmentation is very useful to analyze website traffic. In analytics, you can
analyze traffic insight with the help of segmentation. The following image shows how
to add segments in Google analytics.
For a website, you can segment total traffic according to Acquisition, Goals, and
Channels. Following are the types of acquisition segmentation −
Organic Traffic − It shows only the organic traffic of the website. You can find
which search engine (Google, Yahoo, Bing, Baidu, Aol, etc.) is working for you.
With the help of organic traffic, you can also find the top keywords that send
traffic to your website.
Referrals Traffic − This segment shows the total referrals traffic of the
website. With the help of this segment, you can find the top referrals website
that send traffic to your website.
88
Direct Traffic − This segment helps you find the traffic that visit your website
directly.
Social Traffic − With the help of social segment, you can analyze social traffic.
How much traffic you are getting from social media? In social media, which
platform (Facebook, G+, Twitter, Pinterest, Stumbleupon, Reddit, etc.) is
sending traffic to your website. With the help of this segment, you can make
future social media strategy. For example, if Facebook is sending the highest
traffic to your website, then you can improve your Facebook post frequency.
Paid Traffic − Paid traffic segment captures traffic through paid channels
(Google AdWords, Twitter ads...).
When you are done with your segments (collected the data from segments), then the
next step is analysis. Analysis is all about finding the actionable item from the data.
Example
Month Jan Feb Mar April May June July Aug Sep
Organic 40K 42K 40K 43K 45K 47K 57K 54K 60K
Referrals 5K 4K 5K 4K 6K 5K 4K 3K 4K
Social 1K 1K 2K 4K 2K 3K 5K 5K 4K
Analysis
From the above table, you can see that your organic traffic is growing
(improved 20k in 9 months). Referrals traffic is going down. Social traffic has
also improved (1k to 4k).
Find out the pages that send traffic in organic traffic. Analyze them.
Actionable
Focus on the social media platform that is sending the highest traffic.
89
Find why your referrals traffic is going down. Is any link removed from the
website, which was sending traffic earlier?
Dashboard Implementation
Types of Dashboards
You can create dashboards according to your requirements. Following are the main
types of dashboards −
Content dashboard
Ecommerce dashboard
90
Social Media dashboard
PPC dashboard
In every dashboard, you have to create widgets. Widgets are form in graphical or in
numbers.
For example, if you want to create a dashboard for SEO, you have to create a widget
for the total traffic, for the organic traffic, for the keywords, etc. You can analyze
these metrics with the help of SEO dashboard.
If you want to create a dashboard for website performance, then you have to create a
widget for website avg. page load time, Website server response time, Page load
time for mobile, and Check page load time by browser. With the help of these
widgets, you can easily analyze the website performance.
Content − In content dashboard, you have to monitor traffic for blog section,
Conversion by blog post, and Top landing page by exit.
Website Performance Dashboard − Avg. page load time, Mobile page load
time, Page load time by browser, and Website server response time.
Real Time Overview Dashboard− In real time overview, you can set a widget
for real time traffic, Real time traffic source, and real time traffic landing pages.
PPC dashboard − In pay per click (PPC) dashboard, you need to include
clicks, impressions, CTR, converted clicks, etc.
Goals
Goals are used in analytics for tracking completions of specific actions. With the help
of goals, you can measure the rate of success. Goals are measured differently in
91
different industries. For example, in an e-commerce website you can measure the
goal when a product gets sold. In a software company, you can measure the goal
when a software product is sold. In a marketing company, goals are measured when
a contact form is filled.
Types of Goals
Duration Goal − You can measure the user engagement with the help of
duration goal. You can specify hours, minutes, and second field to quantify the
goals. If a user spends more than that much of time on the page, then the
goal is completed.
Event Goals − You can measure user interaction with your event on the site. It
is called as event goals. You must have at least one event to compose this
goal.
Funnels
Funnels are the steps to complete your goals. With the help of funnels, you can
review your goals completion steps. Let’s suppose for an ecommerce company,
product sale is goal completion. So, funnels are the steps to purchase that product. If
most of the visitors leave the website after carting the products, then you have to
92
check why users are leaving. Is there any problem with the cart section? This can help
you improve your product performance or steps to sale the products.
Multi-Channel Funnels
Multi-Channel Funnel (MCF) report shows how your marketing channels work
together. MCF report shows that how many conversions are done and by
which channel. In MCF report, you can find the following data −
In the above picture, you can see that Organic search > > Direct has 11 conversions.
It means the user first interacts with your product via organic search. Later on, he/she
comes to the website direct and makes a purchase. So, with the help of this report,
you can easily analyze your top conversion path to improve your funnels.
Social Media Analytics comprise of gathering data from social media platforms and
analyzing it to derive information to make business decisions. It provides powerful
customer insight to uncover sentiments across online sources. You tend to take
control of Social Media Analytics in order to predict customers’ behavior, discover
93
patterns and trends, and make quick decisions to improve online reputation. Social
Media Analytics also let you identify primary influencers within specific network
channels. Some of the popular social media analytics tools are discussed below.
It is a free tool that lets you add social media results to your analysis report. You get
to know what is in air about your business. How many people interacted with your
website through social media and how many liked and shared your content.
SumAll
It combines Twitter, Facebook, and Google Plus into one dashboard to give
you an overall view of what people are talking about you on social media.
Facebook Insights
Twitter Analytics
Twitter Analytics show how many impressions each tweet received, what is
your engagement status, and when were you on peak (see the image given
below).
94
E-commerce Analytics
Business owners need to survive and thrive among tough competition. They
have to become big decision makers in order to survive in the market. This is
where Web Analytics play a critical role.
95
your business stands and to boost e-commerce sales, generate leads, and
enhance brand awareness.
Mobile Analytics
Mobiles have emerged as one of the most significant tools in the past two
decades. It changes the way people communicate and innovate. This has led
to marketing driven by mobile apps.
Mobile apps have proved easy to access and engaging. Webmasters and
online business makers need to take support of mobile apps to make their
way perfect. Once you are done with making a mobile app, you’ll need to
acquire new users, engage with them, and earn revenue. For this, you need
mobile analytics. It helps marketers to measure their apps better. For example
−
o How to prioritize
96
For example, e-Commerce websites use A/B testing on products to discover which
product has the potential to earn more revenue. Second example is AdWords
campaign manager running two ads for the same campaign in order to know which
of them works well.
97
A/B testing allows you to extract more out of your existing traffic. You can run A/B
Testing on Headlines, Ads, Call to action, Links, Images, Landing pages, etc…
Annotation
With the help of annotations, we can find what tasks have been done at which date.
We can annotate the update in Google Analytics. Let us suppose Google search
update arrived on 21 March, then we can annotate 21 March as Google update.
Annotation helps us find the impact of the change.
Let’s assume we have the following data available for an ecommerce company −
98
Country USA UK Canada Australia China India
Budget Spent
Actionable Points
Highest revenue generating country is USA, increase the budget for the USA.
India has high potential. If we double the budget, then we can make good
revenue (from India).
China is doing well. We can increase the budget for China too.
Canada and Australia need improvement. Try for the next segment. If you find
the same data in the next segment, then stop spending money there too.
99
Direct Traffic − Traffic coming directly on your website by clicking on your
website’s link or typing the URL of your website in the address bar.
Goal − A metric that defines the success rate, e.g., sale or sign-up.
Landing Page − The first page from where a visitor enters your website.
New Visitor − The visitor who is coming to your website for the first time.
Organic Traffic − Traffic for which you need not pay. It comes naturally, e.g.,
traffic from search engines.
Paid Traffic − Traffic for which you need to pay, e.g., Google AdWords.
Returning Visitor − The visitors who have already visited your page earlier.
Returning visitors are an asset for any website.
Time on Site − The average time a visitor spends accessing your site in a
time.
Tracking Code − A small snippet of code inserted into the body of HTML
page. This code captures the information about visits to a page.
It provides web traffic information, such as your website’s visitors or users at any
given time, the time they spend on the site, where the traffic comes from, and how
100
visitors interact. Analytics provides easy-to-understand data, helping your company
know what produces the best results and how to invest resources more effectively.
The bounce rate is the percentage of visitors leaving your site without interacting
with it. A high bounce rate shows your content is not engaging or does not match
search intent. It is an important parameter that can help you improve user experience
and consequently increase conversion rate.
4. Data-Driven Decisions
According to a recent survey, highly data-driven companies are three times more
likely to report massive decision-making improvements than organizations that rely
less on data.
Great businesses thrive on high-quality decisions. And the only way to make
decisions that drive business success is to base decisions on data. For example,
it is difficult (if not impossible) to do what your target audience wants if you know
little or nothing about their needs. Off-site and on-site analytics help you discover
what your ideal potential customer is searching for. This enables you to position your
business to attract them.
5. Competitive Edge
Web analytics provides information on your business’s performance and lets you
peep into your competitors' actions. Such intelligence makes outsmarting your
competitors easier. For instance, analytics helps uncover web content gaps, which
opens your company up to opportunities your competitors are missing out on.
Before the recent advancement in analytics technology, market research was highly
costly and time-consuming. And brands needed access to detailed and personalized
insights. Web analytics solutions have streamlined market research. Today, you can
do thorough market research with minimal investment.
101
At first glance, marketing and big data might seem like an odd pair, but they’re
actually very complementary. Marketing is all about effectively reaching different
audiences, and data tells us what’s working and what’s not.
These days, there’s more available data than most businesses and marketing teams
know what to do with, which can lead to new opportunities if that data is accurately
interpreted and effectively deployed.
Customer engagement. Big data can deliver insight into not just who your
customers are, but where they are, what they want, how they want to be
contacted and when.
Customer retention and loyalty. Big data can help you discover what
influences customer loyalty and what keeps them coming back again and
again..
Three types of big data that are a big deal for marketing
Customer: The big data category most familiar to marketing may include
behavioural, attitudinal and transactional metrics from such sources as marketing
campaigns, points of sale, websites, customer surveys, social media, online
communities and loyalty programs.
102
Operational: This big data category typically includes objective metrics that measure
the quality of marketing processes relating to marketing operations, resource
allocation, asset management, budgetary controls, etc.
103
Five benefits of using Big Data in Marketing
1. Effective predictive modeling
By analyzing customer data, analysts can predict which customers are most likely to
purchase in the future.
2. Better personalization
By analyzing customer data, marketing teams can personalize messages and offers
based on each customer’s preferences and behavior.
The benefit: This personalization can lead to increased customer engagement and
loyalty.
The benefit: Targeting only valuable customers allows companies to get the most
out of their marketing efforts.
Analysts can identify which customers are at risk of leaving by analyzing customer
behavior and preferences.
The benefit: Marketers can reduce customer churn by targeting these customers
with personalized offers and messages.
The benefit: Companies can improve customer experience and loyalty by making
changes based on this information.
104
CUSTOMER ENGAGEMENT AND RETENTION
Big data regarding customers provides marketers details about user demographics,
locations, and interests, which can be used to personalize the product experience
and increase customer loyalty over time.
Big data solutions can help organize data and pinpoint which marketing campaigns,
strategies or social channels are getting the most traction. This lets marketers
allocate marketing resources and reduce costs for projects that aren’t yielding as
much revenue or meeting desired audience goals.
Big data can also compare prices and marketing trends among competitors
to see what consumers prefer. Based on average industry standards,
marketers can then adjust product prices, logistics and other operations to
appeal to customers and remain competitive.
Challenges
The challenges related to the effective use of big data can be especially daunting for
marketing. That's because most analytics systems are not aligned to the marketing
organisation’s data, processes and decisions. For marketing, three of the biggest
challenges are:
Knowing what data to gather. Data, data everywhere. You have enormous
volumes of customer, operational and financial data to contend with. But
more is not necessarily better – it has to be the right data.
Knowing which analytical tools to use. As the volume of big data grows, the
time available for making decisions and acting on them is shrinking. Analytical
tools can help you aggregate and analyse data, as well as allocate relevant
insights and decisions appropriately throughout the organisation – but which
ones?
Knowing how to go from data to insight to impact. Once you have the
data, how do you turn it into insight? And how do you use that insight to
make a positive impact on your marketing programs?
1. Use big data to dig for deeper insight. Big data affords you the opportunity
to dig deeper and deeper into the data, peeling back layers to reveal richer
insights. The insights you gain from your initial analysis can be explored
further, with richer, deeper insights emerging each time. This level of insight
can help you develop specific strategies and actions to drive growth.
2. Get insights from big data to those who can use it. There’s no debating it –
CMOs need the meaningful insights that big data can provide; but so do
front-line store managers, and call centre phone staff, and sales associates,
and so on and so on. What good is insight if it stays within the confines of the
board room? Get it into the hands of those who can act on it.
3. Don’t try to save the world – at least not at first. Taking on big data can at
times seem overwhelming, so start out by focusing on a few key objectives.
What outcomes would you like to improve? Once you decide that, you can
identify what data you would need to support the related analysis. When
you’ve completed that exercise, move on to your next objective. And the next.
106
5. Customer Journey Analysis: Big data enables marketers to map and analyze
the entire customer journey across multiple touchpoints and channels. By
understanding the customer journey, marketers can identify pain points,
optimize user experiences, and implement strategies to guide consumers
through the sales funnel more effectively.
6. Social Media Listening: Big data tools allow marketers to monitor social
media conversations, sentiment, and trends in real-time. Social media listening
provides valuable insights into consumer opinions, brand perception, and
emerging topics, allowing marketers to engage with their audience proactively
and respond to feedback promptly.
Customers leave a footprint every time they interact with our digital marketing
content and the apps we build.
Use these breadcrumbs to dig into the reasons why certain leads turned into
customers.
With big data analytics tools, you can shift through vast amounts of data to discover
which lever was the crucial differentiator between a lead and a customer. Was it your
digital marketing target audience, the novel communication/referral channel, the
copywriting used, the choice of visuals, specific demographics, or something else?
Knowing the answer helps you focus your energies on what actually works. Turning
eyeballs into paying customers.
This segmentation allows for highly personalized marketing campaigns that deliver
relevant content to the right audience, increasing engagement and conversions.
1. Collect and unify your customer data, including web/app interactions, past
marketing campaigns, and CRM data with Keboola. Just a few clicks and
you’re ready for the next step.
3. Feed the segments back into the communication app of your choice. Keboola
can do this with a couple of clicks.
For example, by combining your stock levels with a demand forecast, you can
automate the promotion of in-stock items that are predicted to surge in demand to
speed up product movement out of your warehouse.
Big data helps you identify factors contributing to customer churn and pinpoint the
specific customers who are most likely to jump ship.
Armed with this knowledge, marketers can design targeted customer loyalty
programs and retention strategies to retain valuable customers.
Analyze market trends and competitor pricing with big data to optimize your pricing
decisions for enhanced competitiveness and profitability.
Here’s an example. Olfin Car is a leading car dealer in the Czech Republic with
additional services in the field of financing, authorized car service, and insurance.
With Keboola, Olfin Car was able to automate data collection of all the product
offerings and pricing points across their competitors. By using advanced pricing
108
algorithms, Olfin Cars optimized the pricing of its products and services. This led to
a 760% increase in revenues in a single quarter.
Big data tools can analyze customer sentiment and feedback from various sources,
such as social media and reviews.
This sentiment analysis helps businesses understand how customers perceive their
brand and products, enabling them to address concerns and reinforce positive
experiences at scale.
Leveraging big data for A/B testing allows marketers to compare the performance of
different marketing strategies and identify the most effective approaches in a much
shorter time.
A/B testing showcase: A subject line worth its weight in gold - or 2 million dollars, to
be precise.
Obama’s presidential campaign will go down as one of the most successful A/B tests
in history.
What did the marketing team do? They rolled out different emails to a smaller batch
of their email list first, testing the impact of different subject lines. The champion was
then sent to the rest of the list.
The result? The top-performing email left the underperformer in the dust, bringing in
extra 2 million dollars in donations.
Understanding the customer journey through big data analysis enables marketers to
streamline touchpoints and improve overall customer experience.
With careful analysis of how different channels and pathways interact with each other
and what customer segment tends to convert, they helped their client save 30% on
marketing costs while increasing acquisition by informing them what conversion
paths work best.
109
Use all the before-mentioned big data practices together to build products that
address your customers with delightful messages.
Rohlik, the e-commerce unicorn, uses Keboola and real-time machine learning
algorithms, to identify the food items that will expire soon, discount them, and
automatically advertise them to price-conscious consumers (discovered through
customer segmentation).
This end-to-end automated marketing initiative helps Rohlik reduce food waste while
addressing the needs of a targeted customer profile.
Big data analytics is changing the way companies prevent fraud. AI, machine learning,
and data mining technologies are being used in tandem to counteract the hydra of
fraud attempts impacting more than 3 billion identities each year.
In summary, big data analytics techniques can help identify patterns of fraudulent
activity and provide actionable reports used to monitor and prevent fraud—for
businesses of all sizes. Here’s how.
110
What is Fraud Detection and Prevention?
Fraudulent activities can encompass a wide range of cases, including money
laundering, cybersecurity threats, tax evasion, fraudulent insurance claims, forged
bank checks, identity theft, and terrorist financing, and is prevalent throughout the
financial institutions, government, healthcare, public sector, and insurance sectors.
Detecting fraud with data analytics, fraud detection software and tools, and a fraud
detection and prevention program enables organizations to predict conventional
fraud tactics, cross-reference data through automation, manually and continually
monitor transactions and crimes in real time, and decipher new and sophisticated
schemes.
Fraud detection and prevention software is available in both proprietary and open
source versions. Common features in fraud analytics software include: a dashboard,
data import and export, data visualization, customer relationship management
integration, calendar management, budgeting, scheduling, multi-user capabilities,
password and access management, Application Programming Interfaces (API), two-
factor authentication, billing, and customer database management.
Time-series analysis
AI techniques include:
Data mining - data mining for fraud detection and prevention classifies and
segments data groups in which millions of transactions can be performed to
find patterns and detect fraud
Neural networks - suspicious patterns are learned and used to detect further
repeats
The four most crucial steps in the fraud prevention and detection process include:
Capture and unify all manner of data types from every channel and
incorporate them into the analytical process.
Incorporate analytics culture into every facet of the enterprise through data
visualization.
112
anomalies that may indicate credit card fraud, identity theft, insurance fraud, and
other possible crimes
Similarly, risk analytics uses data analytics to identify, assess, and manage risks. The
process includes collecting and analyzing large amounts of data to identify potential
risks, assessing the likelihood and impact of those risks, and developing a strategy to
mitigate the highest-priority risks.
The biggest advantage of big data analytics, and fraud and risk analytics accordingly,
is that it facilitates the use of large and complex data. Faster decisions can be made
in real time using data analytics techniques. Ultimately, thanks to big data analytics, a
company can better understand customer requests and flag those it deems
suspicious.
Big analytics can help prevent fraud using a variety of techniques, such as data
mining, machine learning, and anomaly detection. For example, data mining can be
used to identify patterns of fraudulent activity, such as using stolen credit card
numbers or making multiple small payments in a short period of time. Machine
learning can be used to build models that can automatically detect fraudulent activity.
Anomaly detection, such as device intelligence, can identify when a malicious bot,
fraudster, or other bad actor is present on your site.
113
More specifically, here are some examples of how big data analytics can help
prevent fraud:
Capturing these benefits takes the right tools and implementation. Fraud.net offers
an award-winning fraud prevention platform to help digital businesses quickly detect
transactional anomalies and pinpoint fraud using artificial intelligence, big data, and
live-streaming visualizations.
Unrelated or Insufficient Data: The data from the transactions may come
from many different sources. In some cases, false results can be obtained in
fraud detection due to insufficient or irrelevant data. Detection can be based
on the inappropriate rules used in the algorithm. Because of this risk of failure,
companies may be hesitant to use big data analytics and machine learning.
High Costs: Big data analytics and fraud detection systems may cause some
costs such as the cost of software, and hardware systems, the cost of
components used for the sustainability of these systems, and the time spent.
114
Data Security: While processing the data and making decisions with this data
analytics system, the security of the data is also a problem to be considered.
That means the security of data should be checked.
6. Data Integration and Fusion: Big data analytics integrates data from multiple
sources, including internal transactional data, external data feeds, social
media, and third-party data sources. By correlating and fusing diverse data
sources, organizations gain a more comprehensive view of fraud risks and can
identify potential red flags more effectively.
3. Manufacturing
4. Healthcare
5. Retail
6. Energy
7. Insurance
9. Technology
10. Construction
116
Risk and big data
What are the risks of big data?
While it’s easy to get caught up in the opportunities big data offers, it’s not
necessarily a cornucopia of progress. If gathered, stored, or used wrongly, big data
poses some serious dangers. However, the key to overcoming these is to understand
them. So let’s get ahead of the curve.
Broadly speaking, the risks of big data can be divided into four main categories:
security issues, ethical issues, the deliberate abuse of big data by malevolent players
(e.g. organized crime), and unintentional misuse.
117
Risks Associated With Big Data
While big data can provide valuable insights, it also presents significant risks,
particularly for startups needing more resources to invest in robust cybersecurity
measures. These risks include:
Legal liabilities: A startup failing to protect customers' data may be liable for
legal damages. In addition, data breaches sometimes result in class-action
lawsuits, which can be costly and time-consuming to defend.
When companies are collecting big data, then the first risk that comes with big data
is data privacy. This sensitive data is the backbone of many big companies, and if it
leaks to any wrong hand, like cybercrime or hackers, it can badly affect the business
and its reputation. In 2019, 4.1 billion records were exposed through data
breaches, according to the Risk-Based Security Mid-Year Data Breach Report.
So businesses should mainly focus on protecting their data’s privacy and security
from malicious attacks.
Big data is not easy to store in pockets; companies need to manage big servers to
hold this crucial information and protect it from the outside world. It’s a very
challenging and risky process, but it’s a need for businesses to keep their big data
protected.
Various companies are adapting new privacy regulations to protect their database.
Recently, many hackers have been attacking giant companies to steal their data for
monetary benefits.
118
So it clearly shows that the bigger the data company has, the more the chances of
getting malicious attacks. So companies must ensure the security of the data with
high-level encryption.
2. Cost Management
Big data requires big costs for its maintenance, and companies should do the
calculation of collecting, storing, analyzing, and reporting the big data costs. So all
companies need to budget and plan well for maintaining big data.
If companies don’t plan for the management, they may face unpredictable costs,
which can affect the finances. The best way to manage big data costs is by
eliminating irrelevant data and analyzing the big data to find meaningful insights and
solutions to achieve their goals.
3. Unorganized Data
Big Data is not just information that can be stored in a computer; it’s a collection of
structured, semi-structured, and unstructured data from different sources that can be
the size of zettabytes. To store the big data, companies need to take a big server
area where all the big data is stored, processed, and analyzed.
This way companies should be concerned about the storage space of big data.
Otherwise, it can be a complex issue. Nowadays, companies leverage the power of
cloud-based services to store data and make accessibility easy and secure.
It’s estimated that the amount of data generated by users each day will reach 463
exabytes worldwide, according to weforum. The main aim of big data is to analyze
and find meaningful information that helps businesses to make the right business
decisions and innovations. If any organization doesn’t have a proper analyzing
process, big data is just trash that seems unnecessary.
The analysis makes big data Important, and companies should hire the best data
analyst and software that helps to analyze the big data and find meaningful insights
with the help of professional analysts and technology.
119
Thus, before planning to work on big data, each business, from small to enterprise-
level, should hire professional analysts and use powerful technologies to analyze big
data.
One key risk of getting big data is that organizations may reach poor quality,
irrelevant or out-of-date databases that will not help their business to find
something meaningful.
Many challenges come across while analyzing big data, and organizations must be
prepared for these outputs, try to eliminate irrelevant data, and focus on analyzing
relevant data to get meaningful insights.
7. Deployment Process
Deployment is a core process of an organization to collect and analyze big data and
deploy meaningful insights in a time period. In this situation, companies have two
options for data deployment, i.e. first is to use an in-house deployment process
where the big data is collected and analyzed to find meaningful insights, but this
process takes a good amount of time.
In fintech industries, big data helps identify an opportunity to provide efficient and
sustainable financial services. Overall, risk management is the process of finding and
controlling threats to the company's well-being and ways to minimize those threats.
120
Fraud Identification
Credit Management
Anti-Money Laundering
On the other hand, Big Data analytics helps to improve the existing processes
by using advanced statistical analysis of structured data and statistical text
mining of unstructured data. It generates real-time actionable insights and
stops money laundering in its track.
121
organizations can identify emerging risks and take proactive measures to
mitigate them before they escalate.
4. Credit Risk Assessment: In the financial industry, big data analytics is used
for credit risk assessment and scoring. By analyzing borrower data, credit
history, payment behavior, and other relevant factors, lenders can assess the
creditworthiness of applicants and make informed decisions about lending
risks.
Here are some common issues and dangers of big data along with their solution.
122
1. Data Storage
Problem: When businesses plan to store big data, the first problem they face is
storage space. Many companies are leveraging the power of cloud space to store
data, but due to online access to data, there’s a chance of security issues. So some
companies prefer to own their physical server storage to store the database.
One of the major data storage issues faced by Amazon in 2017, where AWS cloud
storage is full & doesn’t have space to run even basic operations and later Amazon
resolved the issue and maintained the storage to prevent this problem in future.
Solution: To resolve the problem, companies should store their sensitive data in an
on-premises database, and less sensitive data can be stored in cloud storage. But
still, there are security issues that can be resolved by hiring cybersecurity experts.
It may increase the cost of organizations, but database value is more worth it.
2. Fake Data
Problem: Another Big Data issue many organizations may face, i.e., fake databases.
When collecting data, companies require a relevant database that can be analyzed
and used to generate meaningful insights. However, having irrelevant or fake data
can waste any organization’s efforts and costs in analyzing the data.
In 2016, Facebook faced the issue of fake databases because the algorithm’s didn’t
recognise real or fake news differences and ended up with nonsense political
issues, according to Vox.
Problem: When users get access to control the data like view, edit or remove, it may
affect the business operations and privacy.
Here’s an example:
Netflix which reported the loss of 200,000 subscribers in Q1 because users are
sharing their login details with friends/family to log in with the same account. Later
Netflix takes charge and controls data accessibility to limited users on a single
account.
Solution: Its solution is used to work with Identity Access Management (IAM) to
simplify the process of controlling the data via identification, authentication, and
authorization. By following the ISO standards, organizations can protect their access
to IAM.
123
4. Data Poisoning
Problem: Nowadays, almost every website has Chatbots on their website, and it’s a
target of hackers to attack these Machine Learning models that lead to Data
Poisoning, where organizations’ databases can be manipulated and injected.
Solution: The best way to resolve this issue is through outlier detection. It helps to
separate injected elements from the existing data distribution.
Key Takeaways
Credit risk management refers to managing the probability of a company’s
losses if its borrowers default in repayment.
It is one of the important tools for any lending company to survive in the long
term since, without proper mitigation strategies, it will be very difficult to stay
in the Lending Business due to the rising NPAs and defaults happening.
Every bank/NBFC has a separate department to take care of the quality of the
portfolios and the customers by framing appropriate risk mitigating
Techniques.
124
What is Credit Risk Management?
Credit risk management refers to the process of assessing and mitigating the
potential risks associated with lending money or extending credit to individuals or
businesses. At its core, it’s about ensuring that borrowers are reliable and will fulfill
their repayment obligations.
Credit risk management also involves setting appropriate interest rates and credit
limits, as well as monitoring and managing the loan portfolio to identify and address
potential risks. Effective credit risk management helps businesses protect themselves
against financial losses and ensure the overall stability and profitability of their
business.
125
2. Credit Risk Assessment: Credit risk assessment involves evaluating the
creditworthiness of borrowers or counterparties to determine the likelihood of
default or non-payment. This process typically includes analyzing financial
statements, credit reports, credit scores, payment histories, collateral, and
other relevant factors. Credit risk assessment helps organizations make
informed decisions about whether to extend credit, set credit limits, or
approve loan applications.
3. Credit Scoring and Modeling: Credit scoring and modeling techniques are
used to quantify and predict credit risk based on historical data and statistical
analysis. Credit scoring models assign numerical scores to borrowers based on
their creditworthiness, probability of default, and risk profile. These models
help organizations automate credit decisions, streamline underwriting
processes, and assess risk consistently across different applicants.
126
Credit risk management is really important for keeping the financial system stable.
But, it’s not easy and there are some major challenges that come with it. These
challenges are a big part of financial operations and need to be watched carefully.
In this section, we will talk about the major challenges finance professionals face
when they try to manage credit risk.
1. Data quality and accessibility: Data quality plays a crucial role in credit risk
evaluation. But most of the time the data available is not very reliable or easy
to get. Incomplete or inaccurate data can compromise the decision-making
process, necessitating robust strategies to ensure data integrity.
6. Human factors: The human element introduces its own set of challenges.
Misjudgments, communication breakdowns, or ethical lapses can inject
unpredictability into credit risk management, underlining the importance of
strong internal controls.
However, there is more to credit risk management in banks than deciding whether
to lend money to an applicant. To help themselves manage Credit Risk well, banking
128
or other lending institutions can check on data sources they are taking information
from and validate their reliability. In addition, the institutions can have a third-party
entity involved to assess if the models and measures adopted for credit management
are proper. They could help them identify the weakness, leading to improvements in
the framework.
The third-party unit is the best element to be included in assessing the entire system
without bias. They monitor the active models and suggest changes based on their
opinion. These entities use the most dynamic datasets to conduct their studies to
reach valid conclusions. In addition, they help deploy advanced technology, like
artificial intelligence and machine learning, to make risk management more efficient
and accurate. As a result, the entities effectively manage credit risks and remain
prepared for upcoming financial crimes.
For individual borrowers, character refers to their personal traits and credit history. It
encompasses factors such as their reliability, integrity, and creditworthiness.
2. Capacity
Capacity refers to the borrower’s capability to assume and fulfill their debt
obligations. It encompasses the ability of both retail and commercial borrowers to
handle their debt.
3. Capital
129
In the case of personal borrowers who may lack an extensive credit history, it is
important to consider if a parent or family member could provide a guarantee to
support their loan application.
4. Collateral security
In structuring loans to mitigate credit risk, collateral security plays an important role.
It is of utmost importance to thoroughly assess the value of assets, their physical
location, the ease of transferring ownership and determining appropriate loan-to-
value ratios (LTVs), among other factors.
5. Conditions
Conditions encompass the purpose of the credit, external circumstances, and various
factors in the surrounding environment that can introduce risks or opportunities for a
borrower. These factors may involve political or macroeconomic conditions, as well
as the current stage of the economic cycle.
ABC Bank is dedicated to assisting individuals in obtaining the necessary finances for
their specific needs. To ensure ease of repayment, the bank maintains low-interest
rates. As a result, loans are accessible to individuals from all segments of society,
provided they meet certain minimum criteria.
In the interest of fairness, the lenders employ an automated system that only accepts
loan applications that meet the necessary requirements. This enables effective credit
risk management by limiting loan options to individuals with a specified income
level.
Principles
The strategies can be many, but the basic ones must be incorporated to make the
credit risk management tools and framework effective. The first and foremost thing is
to have a proper setup to ensure a feasible environment for credit risk assessment.
There should be a proper protocol to follow, from assessing the measures to
approving them to reviewing them from time to time.
130
Effective Business Credit Management Best Practices
Most businesses extend credit without properly assessing the creditworthiness of the
customer even though they know it's very risky. If you are wondering why this
happens - the answer is very simple - salespersons are often in a hurry to onboard
customers faster to achieve their targets. They often pressure the finance teams to
extend credit without sufficient due diligence.
Keep in mind that effective credit risk management practices should be tailored to
the unique characteristics of each business. This includes identifying customers with a
history of frequent payment defaults and crafting a dynamic strategy to mitigate
credit risk. With that in mind here are the six most efficient credit risk management
best practices you need to know:
Introduce online credit application forms to make customer onboarding smooth and
faster. Make all the essential sections mandatory to avoid missing out on any critical
information.
An online application makes it easier to gather and store data. Accurate and
complete customer information makes your credit risk analysis process more robust.
Your credit application must collect the following data:
Company information
Bank information
Terms of payment
131
Description of how disputes would be resolved
Data verification
You must consider two factors before you extend credit to your customer. First is the
creditworthiness of the customer, and the second is the impact on your cash flow if
the customer goes delinquent.
Before customer onboarding, review their payment history from financial institutions
and sources such as:
Banks
Current and historical data available on these sources help improve your credit
scoring accuracy. It also lets you identify the creditworthiness and the potential risk
posed by any new customers. This approach helps create a strong functional
structure for credit risk management and decision-making.
For example, if an existing customer is growing and they have strong financials, you
might want to consider increasing their credit limit to expand trade with them. But, if
an existing customer makes late payments to other vendors and shows signs of
delinquency, you might want to reach out to that customer and collect your payment
or modify payment terms at the earliest.
132
A decline in credit score
Bankruptcy
Relocation of business
A credit policy protects your business from financial risks and defaulting customers.
A well-defined credit policy allows you to make credit decisions quickly and set
payment terms. You must periodically review and update your credit policy to ensure
it meets changing market conditions and standards.
Collection process
Terms of sale
By 2009, high frequency trading firms were estimated to account for as much as 73%
of US equity trading volume.
134
Investment banks use algorithmic trading which houses a complex mechanism to
derive business investment decisions from insightful data. Algorithmic trading
involves in using complex mathematics to derive buy and sell orders for derivatives,
equities, foreign exchange rates and commodities at a very high speed.
The core component in algorithmic trading systems is to estimate risk reward ratio
for a potential trade and then triggering buy or sell action. Risk analysts help banks
to get trading and implementation rules. Market risk is estimated by the variation in
the value of assets in portfolio by risk analysts. The calculations involved to estimate
risk factor for a portfolio is about billions. Algorithmic trading uses computer
programs to automate trading actions without much human intervention.
The soul of algorithm trading is the trading strategies, which are built upon technical
analysis rules, statistical methods, and machine learning techniques. Big data era is
coming, although making use of the big data in algorithm trading is a challenging
task, when the treasures buried in the data is dug out and used, there is a huge
potential that one can take the lead and make a great profit.
For example, even if the reaction time for an order is 1 millisecond (which is a lot
compared to the latencies we see today), the system is still capable of making 1000
trading decisions in a single second. Thus, each of these 1000 trading decisions
needs to go through the Risk management within the same second to reach the
exchange. You could say that when it comes to automated trading systems, this is
just a problem of complexity.
Another point which emerged is that since the architecture now involves automated
logic, 100 traders can now be replaced by a single automated trading system. This
adds scale to the problem. So each of the logical units generates 1000 orders and
100 such units mean 100,000 orders every second. This means that the decision-
making and order sending part needs to be much faster than the market data
receiver in order to match the rate of data.
137
Market Adapter (Data Feed)
Using of Statistics
138
Algorithmic trading is the current trend in the financial world and machine learning
helps computers to analyze at rapid speed. The real-time picture that big data
analytics provides gives the potential to improve investment opportunities for
individuals and trading firms.
Access to big data helps to mitigate probable risks on online trading and making
precise predictions. Financial analytics helps to tie up principles that affect trends,
pricing and price behavior.
Big data can be used in combination with machine learning and this helps in
making a decision based on logic than estimates and guesses. The data can be
reviewed and applications can be developed to update information on a regular
basis for making accurate predictions.
Backtesting Strategy
All trading algorithms are designed to act on real-time market data and price
quotes. A few programs are also customized to account for company
fundamentals data like EPS and P/E ratios. Any algorithmic trading software
should have a real-time market data feed, as well as a company data feed. It
should be available as a build-in into the system or should have a provision to
easily integrate from alternate sources.
Traders looking to work across multiple markets should note that each exchange
might provide its data feed in a different format, like TCP/IP, Multicast, or a FIX.
Your software should be able to accept feeds of different formats. Another option
is to go with third-party data vendors like Bloomberg and Reuters, which
aggregate market data from different exchanges and provide it in a uniform
format to end clients. The algorithmic trading software should be able to process
these aggregated feeds as needed.
Latency.
139
This is the most important factor for algorithm trading. Latency is the time-delay
introduced in the movement of data points from one application to the other.
Consider the following sequence of events. It takes 0.2 seconds for a price quote
to come from the exchange to your software vendor’s data center (DC), 0.3
seconds from the data center to reach your trading screen, 0.1 seconds for your
trading software to process this received quote, 0.3 seconds for it to analyze and
place a trade, 0.2 seconds for your trade order to reach your broker, 0.3 seconds
for your broker to route your order to the exchange.
Total time elapsed = 0.2 + 0.3 + 0.1 + 0.3 + 0.2 + 0.3 = Total 1.4 seconds.
In today’s dynamic trading world, the original price quote would have changed
multiple times within this 1.4 second period. This delay could make or break your
algorithmic trading venture. One needs to keep this latency to the lowest possible
level to ensure that you get the most up-to-date and accurate information
without a time gap.
Latency has been reduced to microseconds, and every attempt should be made
to keep it as low as possible in the trading system. A few measures include having
direct connectivity to the exchange to get data faster by eliminating the vendor in
between; by improving your trading algorithm so that it takes less than 0.1+0.3 =
0.4 seconds for analysis and decision making; or by eliminating the broker and
directly sending trades to the exchange to save 0.2 seconds.
Most algorithmic trading software offers standard built-in trade algorithms, such
as those based on a crossover of the 50-day moving average (MA) with the 200-
day MA. A trader may like to experiment by switching to the 20-day MA with the
100-day MA. Unless the software offers such customization of parameters, the
trader may be constrained by the built-ins fixed functionality. Whether buying or
building, the trading software should have a high degree of customization and
configurability.
MATLAB, Python, C++, JAVA, and Perl are the common programming languages
used to write trading software. Most trading software sold by the third-party
vendors offers the ability to write your own custom programs within it. This allows
a trader to experiment and try any trading concept he or she develops. Software
140
that offers coding in the programming language of your choice is obviously
preferred.
Plug-n-Play Integration.
Imagine if you’re a huge sovereign wealth fund placing a 100 million order on
Apple shares. Do you think there will be enough sellers at the price you chose?
And what do you think will happen to the share price before the order gets filled?
This is where an algorithm can be used to break up orders and strategically place
them over the course of the trading day. In this case, the trader isn’t exactly
profiting from this strategy, but he’s more likely able to get a better price for his
entry.
Arbitrage
There are tons of investment gurus claiming to have the best strategies based on
technical analysis, relying on indicators like moving averages, momentum,
stochastics and many more. Some automated trading systems make use of these
indicators to trigger a buy and sell order. Trades are initiated based on the
occurrence of desirable trends, which are easy and straightforward to implement
through algorithms without getting into the complexity of predictive analysis.
Using 50- and 200-day moving averages is a popular trend-following strategy.
Index funds have defined periods of rebalancing to bring their holdings to par
with their respective benchmark indices. This creates profitable opportunities for
algorithmic traders, who capitalize on expected trades that offer 20 to 80 basis
points profits depending on the number of stocks in the index fund just before
index fund rebalancing. Such trades are initiated via algorithmic trading systems
for timely execution and the best prices.
Proven mathematical models, like the delta-neutral trading strategy, allow trading
on a combination of options and the underlying security. (Delta neutral is a
portfolio strategy consisting of multiple positions with offsetting positive and
negative deltas — a ratio comparing the change in the price of an asset, usually a
marketable security, to the corresponding change in the price of its derivative —
so that the overall delta of the assets in question totals zero.)
Mean reversion strategy is based on the concept that the high and low prices of
an asset are a temporary phenomenon that revert to their mean value (average
value) periodically. Identifying and defining a price range and implementing an
algorithm based on it allows trades to be placed automatically when the price of
an asset breaks in and out of its defined range.
142
close to the average price between the start and end times thereby minimizing
market impact.
Until the trade order is fully filled, this algorithm continues sending partial orders
according to the defined participation ratio and according to the volume traded
in the markets. The related “steps strategy” sends orders at a user-defined
percentage of market volumes and increases or decreases this participation rate
when the stock price reaches user-defined levels.
Implementation Shortfall
Best Execution: Trades are often executed at the best possible prices.
Low Latency: Trade order placement is instant and accurate (there is a high
chance of execution at the desired levels). Trades are timed correctly and instantly
to avoid significant price changes.
No Human Error: Reduced risk of manual errors or mistakes when placing trades.
Also negates human traders; tendency to be swayed by emotional and
psychological factors.
Disadvantages
There are also several drawbacks or disadvantages of algorithmic trading to consider:
143
Latency: Algorithmic trading relies on fast execution speeds and low latency,
which is the delay in the execution of a trade. If a trade is not executed quickly
enough, it may result in missed opportunities or losses.
Pros Cons
145
Computer-programming knowledge to program the required trading strategy,
hired programmers, or pre-made trading software.
Access to market data feeds that will be monitored by the algorithm for
opportunities to place orders.
The ability and infrastructure to backtest the system once it is built before it
goes live on real markets.
Due to the one-hour time difference, AEX opens an hour earlier than LSE
followed by both exchanges trading simultaneously for the next few hours
and then trading only in LSE during the last hour as AEX closes.
Can we explore the possibility of arbitrage trading on the Royal Dutch Shell stock
listed on these two markets in two different currencies?
Requirements:
Order-placing capability that can route the order to the correct exchange.
Read the incoming price feed of RDS stock from both exchanges.
Using the available foreign exchange rates, convert the price of one currency
to the other.
146
If there is a large enough price discrepancy (discounting the brokerage costs)
leading to a profitable opportunity, then the program should place the buy
order on the lower-priced exchange and sell the order on the higher-priced
exchange.
If the orders are executed as desired, the arbitrage profit will follow.
Simple and easy! However, the practice of algorithmic trading is not that simple to
maintain and execute. Remember, if one investor can place an algo-generated trade,
so can other market participants. Consequently, prices fluctuate in milli- and even
microseconds. In the above example, what happens if a buy trade is executed but the
sell trade does not because the sell prices change by the time the order hits the
market? The trader will be left with an open position making the arbitrage strategy
worthless.
There are additional risks and challenges such as system failure risks, network
connectivity errors, time-lags between trade orders and execution and, most
important of all, imperfect algorithms. The more complex an algorithm, the more
stringent backtesting is needed before it is put into action.
In health care, big data is generated by various sources and analyzed to guide
decision-making, improve patient outcomes, and decrease health care costs, among
other things. Some of the most common sources of big data in health care include
electronic health records (EHR), electronic medical records (EMRs), personal health
records (PHRs), and data produced by widespread digital health tools like wearable
medical devices and health apps on mobile devices.
147
Big data in healthcare is a term used to describe massive volumes of information
created by the adoption of digital technologies that collect patients' records and
help in managing hospital performance; otherwise too large and complex for
traditional technologies.
The application of big data analytics in healthcare has a lot of positive and also life-
saving outcomes. In essence, big-style data refers to the vast quantities of
information created by the digitization of everything that gets consolidated and
analyzed by specific technologies. Applied to healthcare, it will use specific health
data of a population (or of a particular individual) and potentially help to prevent
epidemics, cure diseases, cut down costs, etc.
Now that we live longer, treatment models have changed, and many of these
changes are namely driven by data. Doctors want to understand as much as they can
about a person and, as early in their life as possible, to pick up warning signs of
serious illness as they arise – treating any disease at an early stage is far more simple
and less expensive. By utilizing key performance indicators in healthcare and
healthcare data analytics, prevention is better than cure, and managing to draw a
comprehensive picture of someone will let insurance provide a tailored package. This
is the industry’s attempt to tackle the siloes problems a patient’s data has:
everywhere are collected bits and bites of it and archived in hospitals, clinics,
surgeries, etc., with the impossibility of communicating properly.
148
That said, the amount of sources from which health professionals can gain insights
from their patients keeps growing. This data is normally coming in different formats
and sizes, which presents a challenge to the user. However, the current focus is no
longer on how “big” the data is but on how smartly it is managed. With the help of
the right technology, data can be extracted from the following sources of the
healthcare industry in a smart and fast way:
Patients portals
Research studies
EHRs
Wearable devices
Search engines
Generic databases
Government agencies
Payer records
Staffing schedules
Indeed, for years gathering huge amounts of data for medical use has been costly
and time-consuming. With today’s always-improving technologies, it becomes easier
not only to collect such data but also to create comprehensive healthcare
reports and convert them into relevant critical insights that can then be used to
provide better care. This is the purpose of healthcare data analysis: using data-driven
findings to predict and solve a problem before it is too late, but also assess methods
and treatments faster, keep better track of inventory, involve patients more in their
own health, and empower them with the tools to do so.
149
Pairing the big data produced by EHRs with advanced analytics techniques
like machine learning, medical researchers can create predictive machine learning
models with various applications, such as predicting post-surgical complications,
heart failure, or substance abuse.
Practice telemedicine
The impact of big data in health care is huge, and the market has grown to match it.
According to research conducted by Allied Market Research in 2019, for example, the
North American market value for big data analytics in health care is projected to
reach $34.16 billion by 2025, several times higher than its $9.36 billion valuation in
2017 [4]. Just as big data lays the foundation for big advances in health care, it has
also drawn investment for further growth.
Professionals in health care use big data for a wide range of purposes – from
developing insights in biomedical research to providing patients with personalized
medicine. Here are just some of the ways that big data is used in health care today:
151
Enhancing security surrounding the processing of sensitive medical data, such
as insurance claims and medical records.
Smarter treatment plans: Analyzing the treatment plans that helped patients
(and those that didn’t) can help researchers create even better treatment plans
for future patients.
Reduced health care costs for patients and health providers: Health care
can cost a lot. Big data offers the possibility of reducing the cost of obtaining
and providing health care by identifying appropriate treatment plans,
allocating resources intelligently, and identifying potential health issues before
they occur.
152
treatment and satisfaction levels; the overall health of the population can also be
enhanced on a sustainable basis, and operational costs can be reduced significantly.
This healthcare dashboard below provides you with the overview needed as a
hospital director or as a facility manager. Gathering in one central point all the data
on every division of the hospital, the attendance, its nature, the costs incurred, etc.,
you have the big picture of your facility, which will be of great help to run it
smoothly.
You can see here the most important metrics concerning various aspects: the number
of patients that were welcomed in your facility, how long they stayed and where, how
much it cost to treat them, and the average waiting time in emergency rooms. Such a
holistic view helps top administrators to identify potential bottlenecks, spot trends
and patterns over time, and in general, assess the situation. This is key in order to
make better-informed decisions that will improve the overall operations
performance, with the goal of treating patients better and having the right staffing
resources.
153
Another real-world application of healthcare big data analytics, our dynamic
patient KPI dashboard, is a visually-balanced tool designed to enhance service levels
as well as treatment accuracy across departments.
Here, you will find everything you need to enhance your level of patient care both in
real-time and in the long term. This is a visual innovation that has the power to
improve every type of medical institution, big or small.
154
As mentioned, there’s a huge need for big data in healthcare, especially due to rising
costs in nations like the United States. As a McKinsey report states: “After more than
20 years of steady increases, healthcare expenses now represent 17.6 percent of GDP
— nearly $600 billion more than the expected benchmark for a nation of the United
States’s size and wealth.” This quote leads us to our first benefit.
Reducing costs
As stated above, costs are much higher than they should be, and they have been
rising for the past 20 years. Clearly, we are in need of some smart, data-driven
thinking in this area. And current incentives are changing as well: many insurance
companies are switching from fee-for-service plans (which reward using expensive
and sometimes unnecessary treatments and treating large amounts of patients
quickly) to plans that prioritize patient outcomes
As the authors of the popular Freakonomics books have argued, financial incentives
matter – and incentives that prioritize patients' health over treating large amounts of
patients are a good thing. Why does this matter?
Well, in the previous scheme, healthcare providers had no direct incentive to share
patient information with one another, which made it harder to utilize the power of
analytics. Now that more of them are getting paid based on patient outcomes, they
have a financial incentive to share data that can be used to improve the lives of
patients while cutting costs for insurance companies.
Physician decisions are becoming more and more evidence-based, meaning that
they rely on large swathes of research and clinical data as opposed to solely their
schooling and professional opinion. That said, the risk of human error is always a
latent threat. Even though doctors are highly trained professionals, they are still
human, and the risk of selecting the wrong medication or treatment can potentially
risk a person’s life. With the use of big data and the tools that we mentioned
throughout this post, professionals can be easily alerted when the wrong medication,
test, treatment, or other has been provided and remediate it immediately. In time,
this can significantly reduce the rates of medical errors and improve the facility’s
reputation.
As in many other industries, data gathering and management are getting bigger, and
professionals need help in the matter. This new treatment attitude means there is a
greater demand for big data analytics in healthcare facilities than ever before, and
the rise of SaaS BI tools is also answering that need.
While using data to ensure you are providing the best care to patients is
fundamental, there are also other operational areas in which it can assist the health
155
industry. Part of providing quality care is ensuring the facility works optimally, and
this can also be achieved with the help of big data.
By using the right BI software, professionals can gather and analyze real-time data
about the performance of their organization in areas such as operations and
finances, as well as personnel management. For instance, predictive analytics
technologies can provide relevant information regarding admission rates. These
insights can help define staffing schedules to cover demand as well as inventory for
medical supplies. This way, care facilities can stay one step ahead and ensure that
patients are getting the best experience possible.
Getting this level of insight in such an intuitive way allows managers to redirect
resources where they are most needed and optimize areas that are not performing
well to ensure the best return on investment possible.
Our last benefit is one that should be the clearest from the list of applications we
provided earlier. The use of big data in the care industry enables professionals to test
new technologies, drugs, and treatments to improve the quality of care given to
patients and battle diseases that were once thought of as unbeatable.
Thanks to wearable devices that can tell your heart rate, Bluetooth asthma inhalers
that gather insights to prevent attacks, and much more, doctors are able to use data
to understand how common diseases work and how certain external factors might be
affecting entire communities. Through that, they are able to provide personalized
quality care to each and every person that goes into a hospital.
There is no denying that the power of big data analytics is saving lives. That being
said, the process of managing data requires a lot of effort, and with that comes
challenges, which we will discuss below.
Data integration and storage: One of the biggest hurdles standing in the
way of using big data in medicine is how medical data is spread across many
sources governed by different states, hospitals, and administrative
departments. The integration of these data sources would require developing
a new infrastructure where all data providers collaborate with each other.
156
Data sharing: Equally important is implementing new online reporting
software and business intelligence strategy that will allow all relevant users to
be connected with the data. Healthcare needs to catch up with other
industries that have already moved from standard regression-based methods
to more future-oriented ones like predictive analytics, machine learning, and
graph analytics. This is done with the help of modern reporting tools such as
a dashboard creator that allows anyone to perform advanced analytics with
just a few clicks easily.
Security and privacy: Security and privacy are constant concerns are one of
the biggest challenges of big data in healthcare. Daily, hospitals and care
centers deal with sensitive patient data that needs to be carefully protected.
Taking that into consideration with the fact that the data is coming from many
different sources, security can present a challenge for these types of
organizations. To avoid this, it is critical to follow law regulations, conduct
regular audits to ensure everything is going well, and train employees on data
protection best practices.
Data literacy: Using big data and analytics in healthcare involves many
processes and tools to collect, clean, process, manage, and analyze the huge
amounts of data available. This requires a level of knowledge and skills that
can present a limitation for average users that are not acquainted with these
processes. However, while data literacy might have been one of the big
disadvantages of big data in healthcare, it is no longer the case.
So, even if these analytical services are not your cup of tea, you are a potential
patient, and so you should care about new healthcare analytics applications. Besides,
it’s good to take a look around sometimes and see how other industries cope with it.
They can inspire you to adapt and adopt some good ideas.
3. STORAGE 4. SECURITY
5. STEWARDSHIP 6. QUERYING
7. REPORTING 8. VISUALIZATION
157
1. Clinical Decision Support: Big data analytics provides clinicians with access
to vast amounts of patient data, including electronic health records (EHRs),
medical imaging, genomics, and real-time monitoring data. Advanced
analytics techniques, such as machine learning and predictive modeling, help
clinicians make more informed decisions by identifying patterns, predicting
outcomes, and recommending personalized treatment plans.
158
data, billing patterns, and provider behaviors. Advanced analytics techniques,
such as anomaly detection and network analysis, help identify suspicious
activities, fraudulent claims, and improper billing practices, enabling payers
and healthcare organizations to mitigate financial losses and protect against
fraud risks.
Drug Discovery and Development: Big data analytics accelerates drug discovery
and development processes by analyzing genomic data, molecular interactions,
159
and clinical trial data. Predictive modeling and simulation techniques help identify
potential drug candidates, predict drug efficacy, and optimize dosage regimens.
A company’s ability to forecast growth accurately and devise a viable marketing plan
now relies heavily on the availability and analysis of information. By using big data
analytics in your planning and decision-making, your company will be well-equipped
to solve today’s advertising problems and anticipate tomorrow’s challenges.
160
The Role of Big Data Analytics in Digital Marketing Strategy
Big data is crucial in digital marketing because it provides companies with deep
insights about consumer behavior.
Google is an excellent example of big data analytics in action. Google leverages big
data to deduce what consumers want based on a variety of characteristics, such as
search history, geography and trending topics. Big data mining has resulted in
Google's secret source of proactive or predictive marketing: determining what
consumer’s desire and how to incorporate that knowledge into the company’s ad
and product experiences.
But your company doesn’t have to be a tech giant to use big data analytics
successfully. Here are four key ways companies of all sizes can benefit from big data:
4. Creating Relevant Content: Big data helps you deliver tailored content that
aligns with your customers’ interests and needs. It provides the information
you need to create the right content for the right consumers on the right
channel at the right time.
161
crucial to keep customers informed about how you store their information
and what actions you're taking to adhere to privacy and data protection rules.
Build a big data analytical pipeline. Big data provides additional ears and
eyes for your marketing and advertising campaigns, empowering you to
respond to audience activity and influence consumer behavior in real time.
You now have the tools and know-how to develop effective big data
advertising campaigns, thanks to cloud technologies like Amazing Web
Services, Microsoft Azure and Google Cloud.
The growth of big data analytics offers advertisers new opportunities for forecasting
trends and solving ongoing challenges. Embrace the power of big data to analyze
real-time data and customer insights and create targeted advertising and content
that hit the mark with your audience.
For comparison, this text consists form approx 130 lines or 1100 words. Putting some
big data in one spreadsheet file could be billions of lines with data in different
162
formats. For example, it could be data that includes all programmatic advertising
deals inside the BidsCube AdExchange and details about them just for one month.
Let's look at a simple example to make it easier to understand the essence of Big
Data. Imagine a market where all products are arranged chaotically: bread near the
vegetables, fruit in the beverage department, vegetable oil next to the bathtub and
toiletries, and so on. With Big Data, it became possible to distribute all the goods
strictly in their places. But that's not all. You can easily find the product you want, see
expiration dates, learn about the benefits of that brand or variety of products, and
compare it with similar products.
Big Data is also a tool for the practical application of received information. It is
presented in a clear and convenient form, making it easy to solve everyday tasks and
make decisions. For example, you need to learn how to find your potential client and
offer a particular product at the right time for advertising campaigns. You can only
do this with a specific database.
Customer Insights: Big data analytics provides advertisers with valuable insights
into consumer behavior, preferences, and purchasing patterns. By analyzing
customer data from multiple sources, including social media, website interactions,
and transaction history, advertisers can understand their target audience better,
identify market trends, and tailor marketing strategies to meet consumer needs.
In this article, we are discussing the leading technologies that have expanded their
branches to help Big Data reach greater heights. Before we discuss big data
technologies, let us first understand briefly about Big Data Technology.
Among the larger concepts of rage in technology, big data technologies are widely
associated with many other technologies such as deep learning, machine
learning, artificial intelligence (AI), and Internet of Things (IoT) that are massively
augmented. In combination with these technologies, big data technologies are
164
focused on analyzing and handling large amounts of real-time data and batch-
related data.
This type of big data technology mainly includes the basic day-to-day data that
people used to process. Typically, the operational-big data includes daily basis data
such as online transactions, social media platforms, and the data from any particular
organization or a firm, which is usually needed for analysis using the software based
on big data technologies. The data can also be referred to as raw data used as the
input for several Analytical Big Data Technologies.
Some specific examples that include the Operational Big Data Technologies can be
listed as below:
Online ticket booking system, e.g., buses, trains, flights, and movies, etc.
165
Analytical Big Data Technologies
Some common examples that involve the Analytical Big Data Technologies can be
listed as below:
Medical health records where doctors can personally monitor the health status
of an individual
166
Top Big Data Technologies
We can categorize the leading big data technologies into the following four sections:
Data Storage
Data Mining
Data Analytics
Data Visualization
Let’s now examine the technologies falling under each of these categories with facts
and features, along with the companies that use them.
167
Data Storage
Typically, this type of big data technology includes infrastructure that allows data to
be fetched, stored, and managed, and is designed to handle massive amounts of
data. Various software programs are able to access, use, and process the collected
data easily and quickly. Among the most widely used big data technologies for this
purpose are:
1. Apache Hadoop
Key features:
168
MapReduce is a built-in batch processing engine in Hadoop that splits large
computations across multiple nodes to ensure optimum performance and
load balancing.
2. MongoDB
Key features:
It seamlessly integrates with languages like Ruby, Python, and JavaScript ; this
seamless integration facilitates high coding velocity.
169
3. RainStor
RainStor is a database management system that manages and analyzes big data and
is developed by the RainStor company. A de-duplication technique is used in order
to streamline the storage of large amounts of data for reference. Due to its ability to
sort and store large volumes of information for reference, it eliminates duplicate files.
Additionally, it supports cloud storage and multi-tenancy. The RainStor database
product is available in two editions: Big Data Retention and Big Data Analytics on
Hadoop, which enable highly efficient data management and accelerate data analysis
and queries.
Key features:
With RainStor, large enterprises can manage and analyze Big Data at the
lowest total cost.
It allows you to run faster queries and analyses using both SQL queries and
MapReduce, leading to 10-100x faster results.
4. Cassandra
170
data processing. As a major Big Data tool, it accommodates all types of data formats,
including structured, semi-structured, and unstructured.
Key Features:
It allows Hadoop integration with MapReduce. It also supports Apache Hive &
Apache Pig.
5. Hunk
Hunk is mainly helpful when data needs to be accessed in remote Hadoop clusters
using virtual indexes. This helps us to use the spunk search processing language to
analyze data. Also, Hunk allows us to report and visualize vast amounts of data from
Hadoop and NoSQL data sources.
Hunk was introduced in 2013 by Splunk Inc. It is based on the Java programming
language.
171
Data Mining
Data mining is the process of extracting useful information from raw data and
analyzing it. In many cases, raw data is very large, highly variable, and constantly
streaming at speeds that make data extraction nearly impossible without a special
technique. Among the most widely used big data technologies for data mining are:
6. Presto
Key Features:
With Presto, you can query data wherever it resides, whether it is in Cassandra,
Hive, Relational databases, or even proprietary data stores.
172
With Presto, multiple data sources can be queried at once. This allows you to
reference data from multiple databases in one query.
Presto supports standard ANSI SQL, making it easy to use. The ability to query
your data without learning a dedicated language is always a big plus, whether
you’re a developer or a data analyst. Additionally, it connects easily to the
most common BI (Business Intelligence) tools with JDBC (Java Database
Connectivity) connectors.
7. RapidMiner
RapidMiner is an advanced open-source data mining tool for predictive analytics. It’s
a powerful data science platform that lets data scientists and big data analysts
analyze their data quickly. In addition to data mining, it enables model deployment
and model operation. With this solution, you will have access to all the machine
learning and data preparation capabilities you need to make an impact on your
business operations. By providing a unified environment for data preparation,
machine learning, deep learning, text mining, and predictive analytics, it aims to
enhance productivity for enterprise users of every skill level.
Key Features:
173
RapidMiner Studio provides access, loading, and analysis of any type of data,
whether it is structured data or unstructured data such as text, images, and
media.
8. ElasticSearch
Key Features:
Using ElasticSearch, you can store and analyze structured and unstructured
data up to petabytes.
174
As a language-agnostic open-source application, Elasticsearch makes it easy
to extend its functionality with plugins and integrations.
Data Analytics
Big data analytics involves cleaning, transforming, and modeling data in order to
extract essential information that will aid in the decision-making process. You can
extract valuable insights from raw data by using data analytic techniques. Among the
information that big data analytics tools can provide are hidden patterns,
correlations, customer preferences, and statistical information about the market.
Listed below are a few types of data analysis technologies you should be familiar
with.
9. Apache Kafka
Companies using Kafka: Netflix, Goldman Sachs, Shopify, Target, Cisco, Spotify,
Intuit, Uber, etc.
Key Features:
175
With Apache Kafka, scalability can be achieved in four dimensions: event
processors, event producers, event consumers, and event connectors. This
means that Kafka scales effortlessly without any downtime.
10. Splunk
Key Features:
In addition to structured data formats like JSON and XML, Splunk can ingest
unstructured machine data like web and application logs.
176
Splunk indexes the ingested data to enable faster search and querying based
on different conditions.
11. KNIME
Companies using KNIME: Fiserv, Opplane, Procter & Gamble, Eaton Corporation,
etc.
Key Features:
Additional Plugins are added via its Extension mechanism in order to extend
functionality.
The KNIME workflows can serve as data sets for creating report templates that
can be exported to a variety of file formats, including doc, pdf, ppt, xls, etc.
177
Additionally, KNIME integrates a variety of open-source projects such as
machine learning algorithms from Spark, Weka, Keras, LIBSVM, and R projects;
as well as ImageJ, JFreeChart, and the Chemistry Development Kit.
The most important and most awaited technology is now in sight – Apache Spark. It
is an open-source analytics engine that supports big data processing. This platform
features In-Memory Computing (IMC) for performing fast queries against data of any
size; a generalized Execution Model (GEM) that supports a wide range of
applications, as well as Java, Python, and Scala APIs for ease of development. These
APIs make it possible to hide the complexity of distributed processing behind simple,
high-level operators. Spark was introduced by the Apache Software Foundation to
speed up Hadoop computation.
Companies using Presto: Amazon, Oracle, Cisco, Netflix, Yahoo, eBay, Hortonworks,
etc.
Key Features:
The Spark platform enables the execution of programs 100 times faster on
memory than Hadoop MapReduce or 10 times faster on disk.
With Apache Spark, you can run an array of workloads including machine
learning, real-time analytics, interactive queries, and graph processing.
A number of higher-level libraries are included with Spark, such as support for
SQL queries, machine learning, streaming data, and graph processing.
13. R-Language:
178
R is defined as the programming language, mainly used in statistical computing and
graphics. It is a free software environment used by leading data miners, practitioners
and statisticians. Language is primarily beneficial in the development of statistical-
based software and data analytics.
14. Blockchain:
179
Data Visualization
15. Tableau
In the business intelligence and analytics industry, Tableau is the fastest growing tool
for Data Visualization. It makes it easy for users to create graphs, charts, maps, and
dashboards, for visualizing and analyzing data, thus aiding them in driving the
business forward. Using this platform, data is rapidly analyzed, resulting in interactive
dashboards and worksheets that display the results. With Tableau, users are able to
work on live datasets, obtaining valuable insights and enhancing decision-making.
You don’t need any programming knowledge to get started; even those without
relevant experience can create visualizations with Tableau right away.
Companies using Tableau: Accenture, Myntra, Nike, Skype, Coca-Cola, Wells Fargo,
Citigroup, Qlik, etc
Key Features:
In Tableau, a user can easily create visualizations in the form of Bar charts, Pie
charts, Histograms, Treemaps, Box plots, Gantt charts, Bullet charts, and other
tools.
Tableau supports a wide array of data sources, including on-premise files, CSV,
Text files, Excel, spreadsheets, relational and non-relational databases, cloud
data, and big data.\
Plotly is a Python library that facilitates interactive visualizations of big data. This tool
makes it possible to create superior graphs more quickly and efficiently. Plotly has
many advantages, including user-friendliness, scalability, reduced costs, cutting-edge
analytics, and flexibility. It offers a much richer set of libraries and APIs, including
Python, R, MATLAB, Arduino, Julia, etc. It can be used interactively within Jupyter
notebooks and Pycharm in order to create interactive graphs. With Plotly, we can
include interactive features such as buttons, sliders, and dropdowns to display
different perspectives on a graph.
Key Features:
A unique feature of Plotly is its interactivity. Users can interact with graphs on
display, providing an enhanced storytelling experience.
It’s like drawing on paper, you can draw anything you want. When compared
with other visualization tools like Tableau, Plotly enables full control over what
is being plotted.
Additionally to Seaborn and Matplotlib charts, Plotly also offers a wide range
of graphs, and charts, such as Statistical Charts, Scientific Charts, Financial
Charts, geographical maps, and so forth.
Furthermore, Plotly offers a broad range of AI and ML charts, which allow you
to step up your machine learning game.
181
TensorFlow: TensorFlow combines multiple comprehensive libraries, flexible
ecosystem tools, and community resources that help researchers implement
the state-of-art in Machine Learning. Besides, this ultimately allows developers
to build and deploy machine learning-powered applications in specific
environments.
TensorFlow was introduced in 2019 by Google Brain Team. It is mainly based
on C++, CUDA, and Python. Companies like Google, eBay, Intel, and Airbnb
are using this technology for their business requirements.
Beam: Apache Beam consists of a portable API layer that helps build and
maintain sophisticated parallel-data processing pipelines. Apart from this, it
also allows the execution of built pipelines across a diversity of execution
engines or runners.
Apache Beam was introduced in June 2016 by the Apache Software
Foundation. It is written in Python and Java. Some leading companies like
Amazon, ORACLE, Cisco, and VerizonWireless are using this technology.
182
Introduction to HADOOP
Introduction
Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment. It is
designed to handle big data and is based on the MapReduce programming model,
which allows for the parallel processing of large datasets.
What is Hadoop?
Hadoop is an open source software programming framework for storing a large
amount of data and performing the computation. Its framework is based on Java
programming with some native code in C and shell scripts.
Hadoop defined
Hadoop is an open source framework based on Java that manages the storage and
processing of large amounts of data for applications. Hadoop uses distributed
storage and parallel processing to handle big data and analytics jobs, breaking
workloads down into smaller workloads that can be run at the same time.
Four modules comprise the primary Hadoop framework and work collectively to form
the Hadoop ecosystem:
Beyond HDFS, YARN, and MapReduce, the entire Hadoop open source ecosystem
continues to grow and includes many tools and applications to help collect, store,
process, analyze, and manage big data. These include Apache Pig, Apache Hive,
Apache HBase, Apache Spark, Presto, and Apache Zeppelin.
Data is stored in the HDFS, however, this is considered unstructured and does not
qualify as a relational database. In fact, with Hadoop, data can be stored in an
unstructured, semi-structured, or structured form. This allows for greater flexibility for
companies to process big data in ways that meet their business needs and beyond.
184
can be dispersed amongst data node clusters contained on hundreds or thousands
of commodity servers.
With the introduction of Hadoop, organizations quickly had access to the ability to
store and process huge amounts of data, increased computing power, fault
tolerance, flexibility in data management, lower costs compared to DWs, and greater
scalability. Ultimately, Hadoop paved the way for future developments in big data
analytics, like the introduction of Apache Spark.
1. Retail
Large organizations have more customer data available on hand than ever
before. But often, it's difficult to make connections between large amounts of
seemingly unrelated data. When British retailer M&S deployed the Hadoop-
powered Cloudera Enterprise, they were more than impressed with the results.
Cloudera uses Hadoop-based support and services for the managing and
processing of data. Shortly after implementing the cloud-based platform,
M&S found they were able to successfully leverage their data for much
improved predictive analytics.
This led them to more efficient warehouse use and prevented stock-outs
during "unexpected" peaks in demand and gaining a huge advantage over the
competition.
2. Finance
Hadoop is perhaps more suited to the finance sector than any other. Early on,
the software framework was quickly pegged for primary use in handling the
advanced algorithms involved with risk modeling. It's exactly the type of risk
185
management that could help avoid the credit swap disaster that led to the
2008 recession.
Banks have also realized this same logic also applies to managing risk for
customer portfolios. Today, it's common for financial institutions to implement
Hadoop to better manage the financial security and performance of their
client's assets. JPMorgan Chase is just one of many industry giants that use
Hadoop to manage exponentially increasing amounts of customer data from
across the globe.
3. Healthcare
Hadoop can help improve the effectiveness of national and local security, too.
When it comes to solving related crimes spread across multiple regions, a
Hadoop framework can streamline the process for law enforcement by
connecting two seemingly isolated events. By cutting down on the time to
make case connections, agencies will be able to put out alerts to other
agencies and the public as quickly as possible.
In 2013, The National Security Agency (NSA) concluded that the open-source
Hadoop software was superior to the expensive alternatives they'd been
implementing. They now use the framework to aid in the detection of
terrorism, cybercrime and other threats.
186
Software clients input data into Hadoop. HDFS handles metadata and the distributed
file system. MapReduce then processes and converts the data. Finally, YARN divides
the jobs across the computing cluster.
All Hadoop modules are designed with a fundamental assumption that hardware
failures of individual machines or racks of machines are common and should be
automatically handled in software by the framework.
Apache Hive
Apache Hive was the early go-to solution for how to query SQL with Hadoop.
This module emulates the behavior, syntax and interface of MySQL for
programming simplicity. It's a great option if you already heavily use Java
applications as it comes with a built-in Java API and JDBC drivers. Hive offers a
quick and straightforward solution for developers but it's also quite limited as
the software's rather slow and suffers from read-only capabilities.
IBM BigSQL
The overall Hadoop ecosystem, which encompasses both the core modules
and related sub-modules.
The core Hadoop modules, including Hadoop Distributed File System (HDFS),
Yet another Resource Negotiator (YARN), MapReduce, and Hadoop Common
(discussed below). These are the basic building blocks of a typical Hadoop
deployment.
Hadoop is important as one of the primary tools to store and process huge
amounts of data quickly. It does this by using a distributed computing model
which enables the fast processing of data that can be rapidly scaled by adding
computing nodes.
Low cost
As an open source framework that can run on commodity hardware and has a
large ecosystem of tools, Hadoop is a low-cost option for the storage and
management of big data.
Flexibility
Hadoop allows for flexibility in data storage as data does not require
preprocessing before storing it which means that an organization can store as
much data as they like and then utilize it later.
Resilience
Security
188
Data sensitivity and protection can be issues as Hadoop handles such large
datasets. An ecosystem of tools for authentication, encryption, auditing, and
provisioning has emerged to help developers secure data in Hadoop.
Hadoop does not have many robust tools for data management and
governance, nor for data quality and standardization.
Talent gap
Hadoop tools
Hadoop has a large ecosystem of open source tools that can augment and extend
the capabilities of the core module. Some of the main software tools used with
Hadoop include:
Apache Hive: A data warehouse that allows programmers to work with data
in HDFS using a query language called HiveQL, which is similar to SQL
Apache Impala: Open source, massively parallel processing SQL query engine
often used with Hadoop
189
Apache Sqoop: A command-line interface application for efficiently
transferring bulk data between relational databases and Hadoop
A wide variety of companies and organizations use Hadoop for research, production
data processing, and analytics that require processing terabytes or petabytes of big
data, storing diverse datasets, and data parallel processing.
Data lakes
Since Hadoop can help store data without preprocessing, it can be used to
complement to data lakes, where large amounts of unrefined data are stored.
Marketing analytics
Risk management
Banks, insurance companies, and other financial services companies use Hadoop to
build risk analysis and management models.
Hadoop ecosystems help with the processing of data and model training operations
for machine learning applications.
Big data analytics tools from Google Cloud—such as Dataproc, BigQuery, Vertex AI
Workbench, and Dataflow—can enable you to build context-rich applications, build
new analytics solutions, and turn data into actionable insights.
Dataproc
Dataproc makes open source data and analytics processing fast, easy, and more
secure in the cloud.
BigQuery
Serverless, highly scalable, and cost-effective cloud data warehouse designed for
business agility.
Notebooks
Vertex AI Workbench
The single development environment for the entire data science workflow. Data to
training at scale. Build and train models 5X faster.
191
Dataflow
Unified stream and batch data processing that’s serverless, fast, and cost-effective
Google Cloud’s data lake powers any analysis on any type of data. This empowers
your teams to securely and cost-effectively ingest, store, and analyze large volumes
of diverse, full-fidelity data.
Smart analytics
Google Cloud’s fully managed serverless analytics platform empowers your business
while eliminating constraints of scale, performance, and cost.
Features of hadoop:
1. It is fault tolerance.
2. It is highly available.
5. It is low cost.
Hadoop has several key features that make it well-suited for big
data processing:
Distributed Storage: Hadoop stores large data sets across multiple machines,
allowing for the storage and processing of extremely large amounts of data.
Data locality: Hadoop provides data locality feature, where the data is stored
on the same node where it will be processed, this feature helps to reduce the
network traffic and improve the performance
192
Flexible Data Processing: Hadoop’s MapReduce programming model allows
for the processing of data in a distributed fashion, making it easy to
implement a wide variety of data processing tasks.
Adobe - the software and service providers use Apache Hadoop and HBase
for data storage and other services.
EBay - uses the framework for search engine optimization and research.
Spotify - the Swedish music streaming giant used the Hadoop framework for
analytics and reporting as well content generation and listening
recommendations.
Facebook - the social media giant maintains the largest Hadoop cluster in the
world, with a dataset that grows a reported half of a PB per day.
When it comes to services that implement Hadoop frameworks you will have several
pricing options:
2. Per TB
5. Cloud-based service with its own broken down pricing options- can essentially
pay for what you need or pay as you go
Read more about challenges with Hadoop, and the shift toward modern data
platforms, in our blog post.
HDFS
194
Advantages of HDFS: It is inexpensive, immutable in nature, stores data reliably,
ability to tolerate faults, scalable, block structured, can process a large amount of
data simultaneously and many more.
Disadvantages of HDFS: It’s the biggest disadvantage is that it is not fit for small
quantities of data. Also, it has issues related to potential stability, restrictive and
rough in nature. Hadoop also supports a wide range of software packages such as
Apache Flumes, Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark, Apache
Storm, Apache Pig, Apache Hive, Apache Phoenix, and Cloudera Impala.
5. Pig- It has Pig Latin, a SQL-Like language and performs data transformation of
unstructured data.
6. Tez- It reduces the complexities of Hive and Pig and helps in the running of
their codes faster.
4. Hadoop Common- it contains packages and libraries which are used for
other modules.
History of Hadoop
195
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was
the Google File System paper, published by Google.
In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache
Nutch. It is an open source web crawler software project.
While working on Apache Nutch, they were dealing with big data. To store
that data they have to spend a lot of costs which becomes the consequence of
that project. This problem becomes one of the important reason for the
emergence of Hadoop.
In 2003, Google introduced a file system known as GFS (Google file system). It
is a proprietary distributed file system developed to provide efficient access to
data.
In 2005, Doug Cutting and Mike Cafarella introduced a new file system known
as NDFS (Nutch Distributed File System). This file system also includes Map
reduce.
In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the
Nutch project, Dough Cutting introduces a new project Hadoop with a file
system known as HDFS (Hadoop Distributed File System). Hadoop first version
0.1.0 released in this year.
Doug Cutting gave named his project Hadoop after his son's toy elephant.
196
In 2017, Hadoop 3.0 was released.
Year Event
Yahoo deploys 300 machines and within this year reaches 600
machines.
197
Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of
storage.
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the
HDFS (Hadoop Distributed File System). The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master
node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the
slave node includes DataNode and TaskTracker.
198
Advantages and Disadvantages of Hadoop
Advantages:
Ability to store a large amount of data.
High flexibility.
Cost effective.
Linear scaling.
199
Cost-effective: Hadoop is designed to work with commodity hardware, which
makes it a cost-effective option for storing and processing large amounts of
data.
Disadvantages:
Not very effective for small data.
Security concerns.
Latency: Hadoop is not well-suited for low-latency workloads and may not be
the best choice for real-time data processing.
200
Data Security: Hadoop does not provide built-in security features such as data
encryption or user authentication, which can make it difficult to secure
sensitive data.
Limited Support for Graph and Machine Learning: Hadoop’s core component
HDFS and MapReduce are not well-suited for graph and machine learning
workloads, specialized components like Apache Graph and Mahout are
available but have some limitations.
Data Loss: In the event of a hardware failure, the data stored in a single node
may be lost permanently.
However, open-source software is not necessarily free. The source code is freely
available to anyone, but the executable software is sometimes available upon
201
subscription. The best open source technologies allow users to download, modify
and distribute it without paying any license fees to its original creator.
Operating Systems:
Linux: Linux is an open-source operating system kernel that forms the basis of many
Linux distributions, such as Ubuntu, CentOS, and Debian. Linux is widely used in
servers, embedded systems, and cloud computing environments due to its stability,
security, and flexibility.
Database Systems:
202
MySQL: MySQL is an open-source relational database management system (RDBMS)
that is widely used for web applications, content management systems, and other
data-driven applications. MySQL is known for its reliability, performance, and ease of
use.
NGINX: NGINX is an open-source web server and reverse proxy server known for its
high performance, scalability, and flexibility. NGINX is commonly used as a load
balancer, web accelerator, and API gateway.
Joomla: Joomla is an open-source CMS written in PHP and known for its flexibility,
extensibility, and user-friendly interface. Joomla is often used for building community
websites, e-commerce platforms, and corporate portals.
These are just a few examples of open-source technologies that have had a
significant impact on the software development industry and are widely used by
203
developers and organizations around the world. Open-source software continues to
play a crucial role in driving innovation, collaboration, and accessibility in the tech
community.
Apache Flink: A real-time stream processing framework for big data analytics
and applications.
Apache Impala: The open source, analytic MPP database for Apache Hadoop
that provides the fastest time-to-insight.
204
Apache Spark: Spark adds in-Memory Compute for ETL, Machine Learning
and Data Science Workloads to Hadoop.
Apache Sqoop: Efficiently transfers bulk data between Apache Hadoop and
structured datastores.
HDFS: A distributed file system designed for storing and managing vast data.
If you are a developer looking to master new skills through collaborative work, you
can pursue a career in Open source. Being part of an open-source community will
expand your network with fellow programmers and add more credentials to your
resume.
Kick-start a career in open source technologies with certification courses on free and
open-source software. Learn how to use PHP XML on the Linux platform to develop,
test and deploy open source applications.
Upgrade your web development skills by choosing our Post Graduate Program in Full
Stack Web Development. This course can help you hone the right skills and make
you job-ready in no time.
If you have any questions, feel free to post them in the comments section below. Our
team will get back to you at the earliest.
The rise of big data on cloud computing has made the process of analyzing big data
more efficient. Businesses can choose from three types of cloud computing
services, IaaS, PaaS, and SaaS, for cloud-based big data analytics. These services are
available on a pay-per-use or subscription basis, which means users only pay for the
services they use.
Cloud analytics essentially means storing and analyzing data in a big data cloud
instead of on-premises systems of the organization. This includes any type of data
analytics that is performed on systems hosted in the cloud, including big data
analytics.
206
For big data analytics in cloud computing, the data (both structured and
unstructured) is gathered from different sources, such as smart devices, websites,
social media, etc. The next step involves cleaning and storing this large amount of
data. Companies then use big data cloud tools by big data cloud providers to
process this data for analysis.
The big data cloud architecture below will help you understand cloud big data, cloud
computing big data, and how cloud computing and big data are used together:
Image source
One of the most common cloud computing platforms for big data processing and
analysis is AaaS. AaaS or Analytics as a service refers to a big data cloud solution that
provides analytics software and procedures. It provides efficient business intelligence
(BI) solutions that help organize, analyze, and present big data so that it is easy to
interpret.
Scalability
207
A typical business data center faces limits in physical space, power, cooling
and the budget to purchase and deploy the sheer volume of hardware it
needs to build a big data infrastructure. By comparison, a public cloud
manages hundreds of thousands of servers spread across a fleet of global
data centers. The infrastructure and software services are already there, and
users can assemble the infrastructure for a big data project of almost any size.
Agility
Not all big data projects are the same. One project may need 100 servers, and
another project might demand 2,000 servers. With cloud, users can employ as
many resources as needed to accomplish a task and then release those
resources when the task is complete.
Cost
Accessibility
Many clouds provide a global footprint, which enables resources and services
to deploy in most major global regions. This enables data and processing
activity to take place proximally to the region where the big data task is
located. For example, if a bulk of data is stored in a certain region of a cloud
provider, it's relatively simple to implement the resources and services for a
big data project in that specific cloud region -- rather than sustaining the cost
of moving that data to another region.
Resilience
Data is the real value of big data projects, and the benefit of cloud resilience is
in data storage reliability. Clouds replicate data as a matter of standard
practice to maintain high availability in storage resources, and even more
durable storage options are available in the cloud.
208
Network dependence
Cloud use depends on complete network connectivity from the LAN, across
the internet, to the cloud provider's network. Outages along that network path
can result in increased latency at best or complete cloud inaccessibility at
worst. While an outage might not impact a big data project in the same ways
that it would affect a mission-critical workload, the effect of outages should
still be considered in any big data use of the cloud.
Storage costs
Data storage in the cloud can present a substantial long-term cost for big data
projects. The three principal issues are data storage, data migration and data
retention. It takes time to load large amounts of data into the cloud, and then
those storage instances incur a monthly fee. If the data is moved again, there
may be additional fees. Also, big data sets are often time-sensitive, meaning
that some data may have no value to a big data analysis even hours into the
future. Retaining unnecessary data costs money, so businesses must employ
comprehensive data retention and deletion policies to manage cloud storage
costs around big data.
Security
The data involved in big data projects can involve proprietary or personally
identifiable data that is subject to data protection and other industry- or
government-driven regulations. Cloud users must take the steps needed to
maintain security in cloud storage and computing through adequate
authentication and authorization, encryption for data at rest and in flight, and
copious logging of how they access and use data.
Lack of standardization
209
terms, they’re often seen together in literature because they interact synergistically
with one another.
Big Data: This simply refers to the very large sets of data that are output by a
variety of programs. It can refer to any of a large variety of types of data, and
the data sets are usually far too large to peruse or query on a regular
computer.
Essentially, “Big Data” refers to the large sets of data collected, while “Cloud
Computing” refers to the mechanism that remotely takes this data in and performs
any operations specified on that data.
From there, the data can be harnessed through the Cloud Computing platform and
utilized in a variety of ways. For example, it can be searched, edited, and used for
future insights.
This cloud infrastructure allows for real-time processing of Big Data. It can take huge
“blasts” of data from intensive systems and interpret it in real-time. Another common
relationship between Big Data and Cloud Computing is that the power of the cloud
allows Big Data analytics to occur in a fraction of the time it used to.
210
As you can see, there are infinite possibilities when we combine Big Data and Cloud
Computing! If we simply had Big Data alone, we would have huge data sets that have
a huge amount of potential value just sitting there. Using our computers to analyze
them would be either impossible or impractical due to the amount of time it would
take.
In short, Cloud Computing services largely exist because of Big Data. Likewise, the
only reason that we collect Big Data is because we have services that are capable of
taking it in and deciphering it, often in a matter of seconds. The two are a perfect
match, since neither would exist without the other!
Private cloud
211
Private clouds give businesses control over their cloud environment, often to
accommodate specific regulatory, security or availability requirements.
However, it is more costly because a business must own and operate the
entire infrastructure. Thus, a private cloud might only be used for sensitive
small-scale big data projects.
Public cloud
Hybrid cloud
A hybrid cloud is useful when sharing specific resources. For example, a hybrid
cloud might enable big data storage in the local private cloud -- effectively
keeping data sets local and secure -- and use the public cloud for compute
resources and big data analytical services. However, hybrid clouds can be
more complex to build and manage, and users must deal with all of the issues
and concerns of both public and private clouds.
Multi-cloud
With multiple clouds, users can maintain availability and use cost benefits.
However, resources and services are rarely identical between clouds, so
multiple clouds are more complex to manage. This cloud model also has more
risks of security oversights and compliance breaches than single public cloud
use. Considering the scope of big data projects, the added complexity of
multi-cloud deployments can add unnecessary challenges to the effort.
Providers not only offer services and documentation, but can also arrange for
support and consulting to help businesses optimize their big data projects. A
sampling of available big data services from the top three providers include the
following.
AWS
Amazon SageMaker
Microsoft Azure
Azure HDInsight
Azure Databricks
Google Cloud
Google BigQuery
Keep in mind that there are numerous capable services available from third-party
providers. Typically, these providers offer more niche services, whereas major
providers follow a one-size-fits-all strategy for their services. Some third-party
options include the following:
Cloudera
Scalability
Cloud computing for big data offers flexible, on-demand capabilities. With big
data cloud technology, organizations can scale up or scale down as per their
needs. For example, organizations can ask cloud-based big data solutions
providers to increase cloud storage as the volume of their data increases.
Businesses can also add data analysis capacity as needed. Big data cloud
server’s help businesses respond to customer demands more efficiently.
Higher Efficiency
213
Cloud computing for big data analytics provides incredible processing power.
This makes big data processing in cloud computing environments more
efficient compared to on-premise systems.
Cost Reductions
When it comes to big data on-premise vs. cloud, another major difference is
cost. In comparing big data cloud vs. on-premise, on-premises systems
involve different costs, such as power consumption costs, purchasing and
maintaining hardware and servers, replacing the hardware, etc.
However, with cloud and big data cloud technologies, there are no such costs
because the cloud service providers are responsible for everything.
Additionally, cloud services are based on a pay-per-use model, which further
reduces the cost.
Disaster Recovery
Data of any size is a valuable asset for organizations, so it’s important not to
lose it. However, cyber-attacks, equipment failure, and power outages can
result in data loss, especially if you’re using an on-premise system. On the
other hand, a big data cloud service replicates data to ensure high availability
and security. Hence, cloud computing for big data helps organizations recover
from disasters faster.
If you want to know the differences between cloud and on-premise in detail,
check out our article on cloud vs. on-premise.
Security issues associated with big data in cloud computing are usually a
major concern for businesses. Big data consists of different types of data,
including the personal data of customers, which are subject to data privacy
regulations. As cyber-attacks are increasing, hackers can steal data on poorly
secured clouds.
Requires Internet
You need an internet connection to access data in the cloud and perform
analytics.
214
Difference between Big Data and Cloud Computing
01. Big data refers to the data which is Cloud computing refers to the on
huge in size and also increasing demand availability of computing
rapidly with respect to time. resources over internet.
05. Distributed computing is used for Internet is used to get the cloud
analyzing the data and extracting the based services from different cloud
useful information. vendors.
06. Big data management allows Cloud computing services are cost
centralized platform, provision for effective, scalable and robust.
backup and recovery and low
maintenance cost.
07. Some of the challenges of big data are Some of the challenges of cloud
variety of data, data storage and computing are availability,
integration, data processing and transformation, security concern,
resource management. charging model.
215
08. Big data refers to huge volume of Cloud computing refers to remote
data, its management, and useful IT resources and different internet
information extraction. service models.
09. Big data is used to describe huge Cloud computing is used to store
volume of data and information. data and information on remote
servers and also processing the
data using remote infrastructure.
10. Some of the sources where big data is Some of the cloud computing
generated includes social media data, vendors who provides cloud
e-commerce data, weather station computing services are Amazon
data, IoT Sensor data etc. Web Service (AWS), Microsoft
Azure, Google Cloud Platform, IBM
Cloud Services etc.
The ability to access analytics and data on mobile devices or tablets rather than
desktop computers is referred to as mobile business intelligence. The business metric
dashboard and key performance indicators (KPIs) are more clearly displayed.
With the rising use of mobile devices, so have the technology that we all utilise in our
daily lives to make our lives easier, including business. Many businesses have
benefited from mobile business intelligence. Essentially, this post is a guide for
business owners and others to educate them on the benefits and pitfalls of Mobile
BI.
216
experiences. It allows users to retrieve, interact with, and analyze business data on
the move, breaking the shackles of stationary data interaction.
Mobile BI: Offers the ability to access BI tools from any location at any time,
using a mobile device.
2. Competitive Advantage
3. Simplified Decision-Making
217
- Example: A sales manager in a client meeting can use a tablet to
demonstrate recent sales trends and make informed decisions without
delay.
4. Increased Productivity
Disadvantages of mobile
1. Stack of data
2. Expensive
Mobile BI can be quite costly at times. Large corporations can continue to pay for
their expensive services, but small businesses cannot. As the cost of mobile BI is
not sufficient, we must additionally consider the rates of IT workers for the
smooth operation of BI, as well as the hardware costs involved.
However, larger corporations do not settle for just one Mobile BI provider for
their organisations; they require multiple. Even when doing basic commercial
transactions, mobile BI is costly.
3. Time consuming
The biggest issue of the user when providing data to Mobile BI is data leakage. If
you handle sensitive data through Mobile BI, a single error can destroy your data
as well as make it public, which can be detrimental to your business.
Many Mobile BI providers are working to make it 100 percent secure to protect
their potential users' data. It is not only something that mobile BI carriers must
consider, but it is also something that we, as users, must consider when granting
data access authorization. (From)
Because we work online in every aspect, we have a lot of data stored in Mobile BI,
which might be a significant problem. This means that a large portion of the data
analysed by Mobile BI is irrelevant or completely useless. This can speed down
the entire procedure. This requires you to select the data that is important and
may be required in the future.
Recommended Tools
The use of Mobile BI transforms not only how and where business intelligence is
consumed but also significantly speeds up decision-making and enhances overall
productivity. For businesses looking to implement Mobile BI, choosing the right tool
is crucial. Some top recommendations include:
1. Si Sense
Businesses can use the solution to evaluate large, diverse databases and
generate relevant business insights. You may easily view enormous volumes of
complex data with Si Sense's code-first, low-code, and even no-code
technologies. Si Sense was established in 2004 with its headquarters in New
York.
Since then, the team has only taken precautionary steps in their investigation.
Once the company had received $ 4 million in funding from investors, they
began to pace its research.
After the data is collected, you may easily share it with your preferred device.
Roambi Analytics was founded in 2008 by a team based in California.
That way, the business owner will know where they stand in comparison to
their competitors and where they can grow in the future. It combines
reporting, modelling, analysis, dashboards to help you understand your
organization's data and make sound business decisions.
Amazon Quick Sight allows you to quickly and easily create interactive
dashboards and reports for your users. Anyone in your organisation can
securely access those dashboards via browsers or mobile devices.
220
Strengths: Tableau is known for its powerful data visualization capabilities
and ease of use. Tableau Mobile allows users to access and interact with
Tableau dashboards on mobile devices.
Considerations: It may require a learning curve for new users, and the full
functionality might depend on your Tableau licensing level. Authoring is not
available on mobile.
Strengths: Qlik offers robust data analytics and visualization capabilities. Qlik
Mobile allows users to access and explore Qlik apps on mobile devices. Stands
out for its associative data model, facilitating deep, intuitive data exploration
and insights.
Considerations: Like the others, the full range of features may vary
depending on your Qlik licensing, and users might need some training to fully
utilize its capabilities. Authoring is not available on mobile.
8. Oneboard (Sweeft):
Mobile BI Technology
Business data and analytics are better accessed on tablets than mobile phones
comparatively. This is because the difference in the screens of notebooks and tablets
is relatively less. However, different-sized mobile phone screens create a significant
difference in accessing these analytical reports.
However, modern BI analytics solutions provide robust user interfaces and desktop-
like visualization that take Mobile BI to a whole new level.
221
Modern BI platforms typically extend the delivery of their desktop BI capabilities to
smartphones and tablet devices so that users can access, consume, and share data
easily on portable devices. In addition, mobile BI especially formats and optimizes
dashboards and reports so that the user experience is taken into account and
leverages smaller, touchscreen-based mobile screens and interfaces.
There are a few methods by which Mobile BI provides Big Data and ETL solutions.
They are categorized into three types:
3. Hybrid solutions: Hybrid solutions are the advanced form of native analytics
applications that render content by using HTML5. In hybrid applications,
features of native applications are merged with HTML5 and perform similarly
to Web-based solutions.
The methods sound all helpful and easy, but how does Mobile BI actually operate?
Let’s find out.
222
Mobile-based Business Intelligence solutions are an interesting innovation in Data
Analytics. However, it is also quite challenging because many aspects have to be
taken so that the data visualization presented is top-notch.
This is where Kube comes in handy. Kube is an integral Mobile BI tool of Kockpit
Analytics. It wraps up all the analytical business points to cash cycles and provides
you with various exciting features and advantages.
Kuber features integrated chat functionality that makes you truly hyper-
connected with your business so that you can conveniently alter your last-
minute decisions. This makes communication more simple and more efficient.
Kube provides a bird’s eye view of your sales and operations. It also allows its
users to monitor anything and everything about your sales process and access
more areas.
Users also can easily assign and track their goals. Whether it is accessing the
goals tab, viewing all the assigned goals & tasks, tracking their details and
progress, or self-assigning goals for better task management, you have it all!
223
Kube also provides advantageous assistance:
Share the data with your team and other members straight from the
application. Just a few taps and Kube will share your data through various
platforms.
The big data-enabled engine ensures real-time information for users, allowing
you to make data-driven decisions on the go.
With the Kube mobile app, you'll see the same sophisticated and powerful
visualizations that you see on your desktop.
Sort your data by name, target, actual, and many more with Kockpit Kube and
view your data more conveniently.
Favorite the cards that are most important to you and view them anytime you
like.
Moreover, it has quite exciting modules which operate for various departments. For
example, within 15 minutes of the data refresh time rate, you get updated with real-
time information on your app regardless of your location.
Usability: Today, for all business users, MBI solutions utilize the facility of
touch display by accessing and monitoring the dashboards, reports, and KPIs
via drag-drop UI done on fingertips.
224
workers to keep them on the same page by providing real-time insights and
functional strategies.
225
Let us understand this term deeply with the help of an example. Like GeeksforGeeks
is giving young minds an opportunity to share their knowledge with the world by
contributing articles, videos of their respective domain. Here GeeksforGeeks is using
the crowd as a source not only to expand their community but also to include ideas
of several young minds improving the quality of the content.
1. Enterprise
2. IT
3. Marketing
4. Education
5. Finance
Understanding Crowdsourcing
Crowdsourcing allows companies to farm out work to people anywhere in the
country or around the world; as a result, crowdsourcing lets businesses tap into a
vast array of skills and expertise without incurring the normal overhead costs of in-
house employees.
226
What Are the Main Types of Crowdsourcing?
Crowdsourcing involves obtaining information or resources from a wide swath of
people. In general, we can break this up into four main categories:
Wisdom - Wisdom of crowds is the idea that large groups of people are
collectively smarter than individual experts when it comes to problem-solving
or identifying values (like the weight of a cow or number of jelly beans in a
jar).
227
3. Task Design and Setup: Organizations design and set up the analytics tasks
or challenges on the chosen crowdsourcing platform, specifying the
requirements, guidelines, and evaluation criteria for the contributors. Tasks
may involve data cleaning, data labeling, image classification, sentiment
analysis, or even more advanced analytics tasks such as predictive modeling or
algorithm development.
How to Crowdsource?
For scientific problem solving, a broadcast search is used where an organization
mobilizes a crowd to come up with a solution to a problem.
228
For processing large datasets, distributed human intelligence is used. The
organization mobilizes a crowd to process and analyze the information.
Examples of Crowdsourcing
1. Doritos: It is one of the companies which is taking advantage of
crowdsourcing for a long time for an advertising initiative. They use
consumer-created ads for one of their 30-Second Super Bowl Spots
(Championship Game of Football).
4. Airbnb: A very famous travel website that offers people to rent their houses or
apartments by listing them on the website. All the listings are crowdsourced
by people.
There are several examples of businesses being set up with the help of
crowdsourcing.
Crowdsourced Marketing
As discussed already crowdsourcing helps grow businesses grow a lot. May it be a
business idea or just a logo design, crowdsourcing engages people directly and in
turn, saves money and energy. In the upcoming years, crowdsourced marketing will
surely get a boost as the world is accepting technology faster.
SKIP
Crowdsourcing Sites
Here is the list of some famous crowdsourcing and crowdfunding sites.
1. Kickstarter
2. GoFundMe
3. Patreon
4. RocketHub
229
What are the benefits of crowdsourcing?
The rapid growth in the popularity of crowdsourcing is due to its numerous
advantages.
2. Accelerated tasks
Companies can obtain excellent ideas in a lot less time by engaging a larger
group of people to participate in the process. This could be crucial to the
success of time-sensitive undertakings like urgent software fixes or medical
research.
Microtasking proves to be advantageous here. It is a form of crowdsourcing in
which little groups are given specific tasks. Microtaskers can either be one
person or a group of people who share the workload. Writing a blog post or
conducting research are examples of jobs that are frequently carried out in
small, sequential chunks, or microtasks.
Data crowdsourcing can provide accurate and timely data for businesses. The
data is flexible and can be modified to fit the business’s needs. The business
can pay-per-use for the data or receive real-time alerts when traffic is
congested.
2. Greater speed
Data crowdsourcing can help speed up the process of finding the right data
by allowing a large number of people to quickly and cheaply contribute data.
This ensures that data tasks are completed quickly and with high quality
standards.
230
3. Allows for diverse input
4. Greater accuracy
Data crowdsourcing can help with accuracy and other advantages. It can be
more reliable than traditional methods when the dataset is large, and it can
help reduce the number of pairwise comparisons required to rank. This
reduces annotation burden, making data more accurate and easier to use.
Data crowdsourcing can help with improving content, getting feedback, and
more. By being transparent and honest with data crowdsourcing participants,
you can ensure a successful project.
Data crowdsourcing can be used to get new ideas for cost-effective solutions.
It is a cheaper and more accessible way to get solutions to complex problems
than traditional methods. Crowdsourcing is not limited to highly technical and
complex problems – it can also be used for research and development (R&D).
Data crowdsourcing can be used to improve productivity and creativity in a
company.
231
Data crowdsourcing can help with better understanding by gathering data
from a large number of sources. This can be used to improve customer
service, product development, and more. For example, data crowdsourcing
can be used to gather data about customer sentiment or trends.
Data crowdsourcing can help with understanding customer needs and can be
used to improve customer service. It can also help identify trends and patterns
in customer data which can help businesses improve their services and
products.
Advantages of Crowdsourcing
1. Evolving Innovation: Innovation is required everywhere and in this advancing
world innovation has a big role to play. Crowdsourcing helps in getting
innovative ideas from people belonging to different fields and thus helping
businesses grow in every field.
2. Save costs: There is the elimination of wastage of time of meeting people and
convincing them. Only the business idea is to be proposed on the internet and
you will be flooded with suggestions from the crowd.
Disadvantages of Crowdsourcing
1. Lack of confidentiality: Asking for suggestions from a large group of people
can bring the threat of idea stealing by other organizations.
232
2. Repeated ideas: Often contestants in crowdsourcing competitions submit
repeated, plagiarized ideas which leads to time wastage as reviewing the same
ideas is not worthy.
Additionally, due to the way this type of data is typically collected (i.e., through
individual submissions), it often suffers from the founder effect: because
contributions are often made by those who initiated or own the project itself (the
founder effect), projects that begin as popular or well-known tend to have more
contributions than projects that start off relatively unknown or less popular.
Finally, due to its open-ended nature, crowdsourcing can also be prone to errors
caused by hypercorrection – normalizing words that look misspelled in the original
submission – as well as reviewer fatigue: when reviewers see submissions from many
different users all at once rather than one after another over time, it can be harder
for them to spot mistakes that “look” correct. Despite these risks, crowdsourcing can
be an extremely effective way to gather data if used in conjunction with quality
control methods.
233
outdated information, as well as prevent disasters from happening in the first place.It
is important to use results from crowdsourced data sets to improve data quality.
To successfully crowdsource data, you must first determine the type of data to be
collected and the participants who will be collecting it. The platform you use should
be easy to use and allow participants to easily share their data. The compensation
method for participants should be fair and incentive-based.
234
Rewards can play an important role in motivating participants to contribute quality
work, even when working remotely. Rewards can be given to participants for their
contributions in a variety of ways, depending on the project. Rewards can help
motivate participants to produce high-quality work, even when working remotely.
Rewards should be aligned with the project’s values and participant motivations in
order to respect and reward participants.
To protect participants’ data and avoid common mistakes, follow these tips:
Keep your data safe. Use secure methods to store your data, such as
encrypting it with a strong password. Make sure to keep updated backups of
your data in case of accidents or malicious attacks.
Make sure your dataset is properly licensed. If you are using public or open
datasets, make sure that the license agreement allows others to use the data
without limitations.
Be clear about who can access the dataset and what rights they have. Clearly
label any datasets that you make available so that others know how to use
them safely and legally.
Follow standard privacy policies and practices when sharing your data with
other researchers or users. Make sure that all users understand the terms of
use before using them, and take appropriate measures to protect their privacy
if required by law or regulations
When goals have been reached, it is important to terminate participation for ethical
reasons. This preserves the standard use of data and maintains a humanized and
acknowledging view of black people whose collective organizational histories are
assembled here.
235
What Is Real Estate Crowdsourcing?
Real estate crowdfunding allows everyday individuals the opportunity to invest in
commercial real estate, purchasing just a portion of a piece of development. It's a
relatively new way to invest in commercial real estate and relieves investors of the
hassle of owning, financing, and managing properties.
Yes. Netflix uses crowdsourcing to help improve its entertainment platform. Most
notably, in 2006, it launched the Netflix Prize competition to see who could improve
Netflix's algorithm to predict user viewing recommendations and offered the winner
$1 million.1
Especially as the nature of work shifts more towards an online, virtual environment,
crowdsourcing provides many benefits for companies that are seeking innovative
ideas from a large group of individuals, hoping to better their products or services. In
addition, crowdsourcing niches from real estate to philanthropy are beginning to
proliferate and bring together communities to achieve a common goal.
While both involve analyzing data, they differ in their scope and methodology:
Inter-firewall analytics
Focus: Analyzes traffic flows between different firewalls within a network.
236
Limitations: Requires deployment of multiple firewalls within the network and
efficient data exchange mechanisms between them.
Trans-firewall analytics
Focus: Analyzes encrypted traffic that traverses firewalls, which traditional
security solutions may not be able to decrypt and inspect.
Methodology Analyzes data from multiple Uses DPI and other techniques to
firewalls analyze encrypted traffic
237
The choice between inter-firewall and trans-firewall analytics depends on several
factors, including:
Network size and complexity: Larger and more complex networks benefit
more from inter-firewall analytics for comprehensive monitoring.
Use Cases: Inter-firewall analytics are commonly used in large enterprises with
complex network architectures, where data must traverse multiple firewalls to
reach different segments of the network. For example, organizations may
analyze network traffic logs from various firewall appliances to detect and
investigate security incidents, monitor user activities, or optimize network
performance.
Trans-Firewall Analytics:
Use Cases: Trans-firewall analytics are relevant in scenarios where data needs
to be exchanged securely between different organizations, cloud
environments, or remote locations. For instance, organizations may analyze
data exchanged between on-premises systems and cloud-based applications,
conduct threat intelligence sharing with external partners, or monitor traffic
between different branch offices connected via virtual private networks
(VPNs).
238
secure communication protocols to protect data in transit and prevent
unauthorized access or interception.
Overall, inter and trans-firewall analytics are essential for organizations operating in
distributed environments to securely analyze data across network boundaries and
derive actionable insights while maintaining data security, privacy, and compliance.
By implementing robust security measures, optimizing network performance, and
leveraging scalable analytics solutions, organizations can effectively harness the
power of data analytics across firewall boundaries to drive business value and
innovation.
Introduction to NoSQL
239
Introduction to NoSQL
NoSQL is a type of database management system (DBMS) that is designed to handle
and store large volumes of unstructured and semi-structured data. Unlike traditional
relational databases that use tables with pre-defined schemas to store data, NoSQL
databases use flexible data models that can adapt to changes in data structures and
are capable of scaling horizontally to handle growing amounts of data.
NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better term
would be “NoREL”, NoSQL caught on. Carl Strozz introduced the NoSQL concept in
1998.
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights.
Instead, a NoSQL database system encompasses a wide range of database
technologies that can store structured, semi-structured, unstructured and
polymorphic data. Let’s understand about NoSQL with a diagram in this NoSQL
database tutorial:
240
others say it stands for “not only SQL”. Either way, most agree that NoSQL databases
are databases that store data in a format other than relational tables.
Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data. The system response
time becomes slow when you use RDBMS for massive volumes of data.
To resolve this problem, we could “scale up” our systems by upgrading our existing
hardware. This process is expensive.
The alternative for this issue is to distribute database load on multiple hosts
whenever the load increases. This method is known as “scaling out.”
241
As storage costs rapidly decreased, the amount of data that applications needed to
store and query increased. This data came in all shapes and sizes — structured, semi-
structured, and polymorphic — and defining the schema in advance became nearly
impossible. NoSQL databases allow developers to store huge amounts of
unstructured data, giving them a lot of flexibility.
Additionally, the Agile Manifesto was rising in popularity, and software engineers
were rethinking the way they developed software. They were recognizing the need to
rapidly adapt to changing requirements. They needed the ability to iterate quickly
and make changes throughout their software stack — all the way down to the
database. NoSQL databases gave them this flexibility.
Cloud computing also rose in popularity, and developers began using public clouds
to host their applications and data. They wanted the ability to distribute data across
multiple servers and regions to make their applications resilient, to scale out instead
of scale up, and to intelligently geo-place their data. Some NoSQL databases like
MongoDB provide these capabilities.
1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source
relational database
242
4. Key-value-based: Other NoSQL databases, such as Redis, use a key-value
data model, where data is stored as a collection of key-value pairs.
8. Performance: NoSQL databases are optimized for high performance and can
handle a high volume of reads and writes, making them suitable for big data
and real-time applications.
NoSQL is a non-relational database that is used to store the data in the nontabular
form. NoSQL stands for not only SQL. The main types are documents, key-value,
wide-column, and graphs.
243
1. Document-Based Database:
Documents can be stored and retrieved in a form that is much closer to the data
objects used in applications which means less translation is required to use these
data in the applications. In the Document database, the particular elements can be
accessed by using the index value that is assigned for faster querying.
Collections are the group of documents that store documents that have similar
contents. Not all the documents are in any collection as they require a similar schema
because document databases have a flexible schema.
244
No foreign keys: There is no dynamic relationship between two documents
so documents can be independent of one another. So, there is no requirement
for a foreign key in a document database.
[
{
"product_title": "Codecademy SQL T-shirt",
"description": "SQL > NoSQL",
"link": "https://shop.codecademy.com/collections/student-
swag/products/sql-tshirt"
"shipping_details": {
"weight": 350,
"width": 10,
"height": 10,
"depth": 1
},
"sizes": ["S", "M", "L", "XL"],
"quantity": 101010101010,
"pricing": {
"price": 14.99
}
}
]
2. Key-Value Stores:
245
in the database. The values can be simple data types like strings and numbers or
complex objects.
A key-value store is like a relational database with only two columns which is the key
and the value.
Simplicity.
Scalability.
Speed.
Key Value
Columnar databases are designed to read data more efficiently and retrieve the data
with greater speed. A columnar database is used to store a large amount of data.
Scalability.
246
Compression.
Very responsive.
4. Graph-Based databases:
Graph-based databases focus on the relationship between the elements. It stores the
data in the form of nodes in the database. The connections between the nodes are
called links or relationships.
247
In a graph-based database, it is easy to identify the relationship between the
data by using the links.
The speed depends upon the number of relationships among the database
elements.
Updating data is also easy, as adding a new node or edge to a graph database
is a straightforward task that does not require significant schema changes.
248
Advantages of NoSQL
There are many advantages of working with NoSQL databases such as MongoDB and
Cassandra. The main advantages are high scalability and high availability.
4. Scalability: NoSQL databases are highly scalable, which means that they can
handle large amounts of data and traffic with ease. This makes them a good fit
for applications that need to handle large amounts of data or traffic
Disadvantages of NoSQL
NoSQL has the following disadvantages.
249
standardization can make it difficult to choose the right database for a specific
application
5. Lack of support for complex queries: NoSQL databases are not designed to
handle complex queries, which means that they are not a good fit for
applications that require complex data analysis or reporting.
6. Lack of maturity: NoSQL databases are relatively new and lack the maturity
of traditional relational databases. This can make them less reliable and less
secure than traditional databases.
8. GUI is not available: GUI mode tools to access the database are not flexibly
available in the market.
9. Backup: Backup is a great weak point for some NoSQL databases like
MongoDB. MongoDB has no approach for the backup of data in a consistent
manner.
10. Large document size: Some database systems like MongoDB and CouchDB
store data in JSON format. This means that documents are quite large
(BigData, network bandwidth, speed), and having descriptive key names
actually hurts since they increase the document size.
250
When deciding which database to use, decision-makers typically find one or more of
the following factors lead them to selecting a NoSQL database:
See When to Use NoSQL Databases and Exploring NoSQL Database Examples for
more detailed information on the reasons listed above.
Aggregate Data Models in NoSQL make it easier for the Databases to manage data
storage over the clusters as the aggregate data or unit can now reside on any of the
251
machines. Whenever data is retrieved from the Database all the data comes along
with the Aggregate Data Models in NoSQL.
Aggregate Data Models in NoSQL don’t support ACID transactions and sacrifice one
of the ACID properties. With the help of Aggregate Data Models in NoSQL, you can
easily perform OLAP operations on the Database.
You can achieve high efficiency of the Aggregate Data Models in the NoSQL
Database if the data transactions and interactions take place within the same
aggregate
1. Key-Value Model
The Key-Value Data Model contains the key or an ID used to access or fetch the
data of the aggregates corresponding to the key. In this Aggregate Data Models
in NoSQL, the data of the aggregates are secure and encrypted and can be
decrypted with a Key.
Use Cases:
These Aggregate Data Models in NoSQL Database are used for storing the
user session data.
252
Key Value-based Data Models are used for maintaining schema-less user
profiles.
2. Document Model
The Document Data Model allows access to the parts of aggregates. In this
Aggregate Data Models in NoSQL, the data can be accessed in an inflexible way.
The Database stores and retrieves documents, which can be XML, JSON, BSON,
etc. There are some restrictions on data structure and data types of the data
aggregates that are to be used in this Aggregate Data Models in NoSQL
Database.
Use Cases:
Document Data Models are well suited for Blogging and Analytics platforms.
253
Column family is an Aggregate Data Models in NoSQL Database usually with big-
table style Data Models that are referred to as column stores. It is also called a
two-level map as it offers a two-level aggregate structure. In this Aggregate Data
Models in NoSQL, the first level of the Column family contains the keys that act as
a row identifier that is used to select the aggregate data. Whereas the second
level values are referred to as columns.
Use Cases:
Column Family Data Models are used in systems that maintain counters.
These Aggregate Data Models in NoSQL are used for services that have
expiring usage.
4. Graph-Based Model
254
Graph-based data models store data in nodes that are connected by edges. These
Aggregate Data Models in NoSQL are widely used for storing the huge volumes
of complex aggregates and multidimensional data having many interconnections
between them.
Use Cases:
This example of the E-Commerce Data Model has two main aggregates – customer
and order. The customer contains data related to billing addresses while the order
aggregate consists of ordered items, shipping addresses, and payments. The
payment also contains the billing address.
255
Image Source
If you notice a single logical address record appears 3 times in the data, but its value
is copied each time wherever used. The whole address can be copied into an
aggregate as needed. There is no pre-defined format to draw the aggregate
boundaries. It solely depends on whether you want to manipulate the data as per
your requirements.
The Data Model for customer and order would look like this.
// in customers
{
"customer": {
"id": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"orders": [
{
"id":99,
"customerId":1,
"orderItems":[
{
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
256
}
],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"}
}],
}]
}
}
In these Aggregate Data Models in NoSQL, if you want to access a customer along
with all customers orders at once. Then designing a single aggregate is preferable.
But if you want to access a single order at a time, then you should have separate
aggregates for each order. It is very content-specific.
The diamond shows how data fit into the aggregate structure.
The domain is fit where we don’t want to change shipping and billing address.
An aggregate structure may be an obstacle for others but help with some data
interactions.
257
Aggregate-oriented databases support the atomic manipulation of a single
aggregate at a time.
2. Summary Statistics:
3. Dimensional Modeling:
4. Data Cubes:
Aggregate data models can be represented using data cube structures, where
data is organized into multi-dimensional arrays or matrices. Each dimension of
the cube represents a different attribute or category, and cells within the cube
contain aggregated data values. Data cubes enable efficient multidimensional
analysis and slicing-and-dicing of data along multiple dimensions.
5. Pre-Aggregated Views:
7. Performance Optimization:
Aggregate data models are optimized for query performance, as they allow
organizations to pre-compute and store aggregated data values, use efficient
indexing strategies, and leverage query optimization techniques to accelerate
query processing and analysis. By summarizing data at higher levels of
granularity, aggregate data models can reduce the amount of data that needs to
be processed during query execution, resulting in faster query response times.
Advantage:
It can be used as a primary data source for online applications.
Easy Replication.
Disadvantage:
No standard rules.
259
Aggregates
In the context of databases and data modeling, aggregates refer to summarized or
aggregated data values derived from underlying raw data. Aggregates are calculated
using aggregate functions, which perform mathematical operations on sets of data
to produce single values or summaries. These aggregated values provide insights
into the overall trends, patterns, and characteristics of the underlying data. Here are
some key points about aggregates:
1. Types of Aggregates:
2. Aggregation Functions:
Aggregate functions, such as COUNT, SUM, AVG, MIN, MAX, and STDDEV, are
used to calculate aggregates from sets of data. These functions operate on
columns or expressions within a database query and produce single values
representing the aggregated result.
4. Aggregate Queries:
6. Performance Optimization:
260
materialized views, and indexing strategies can be used to optimize the
performance of aggregate queries and accelerate data analysis.
Overall, aggregates play a crucial role in data analysis, reporting, and decision-
making by summarizing raw data into meaningful and actionable insights. Whether
it's calculating total sales revenue, average customer satisfaction scores, or monthly
website traffic, aggregates help organizations derive value from their data and make
informed business decisions based on aggregated data analysis.
Key-value databases are designed for high performance and scalability, and are often
used in situations where the data does not require complex relationships or joins.
They are well suited for storing data that can be easily partitioned, such as caching
data or session data. Key-value databases are simple and easy to use, but they may
not be as suitable for complex queries or data relationships as other types of
databases such as document or relational databases.
261
Key-Value Data Model in NoSQL
A key-value data model or database is also referred to as a key-value store. It is a
non-relational type of database. In this, an associative array is used as a basic
database in which an individual key is linked with just one value in a collection. For
the values, keys are special identifiers. Any kind of entity can be valued. The
collection of key-value pairs stored on separate records is called key-value databases
and they do not have an already defined structure.
An efficient and compact structure of the index is used by the key-value store to
have the option to rapidly and dependably find value using its key. For example,
Redis is a key-value store used to tracklists, maps, heaps, and primitive types (which
are simple data structures) in a constant database. Redis can uncover a very basic
point of interaction to query and manipulate value types, just by supporting a
predetermined number of value types, and when arranged, is prepared to do high
throughput.
262
When to use a key-value database:
Here are a few situations in which you can use a key-value database:-
Features:
One of the most un-complex kinds of NoSQL data models.
For storing, getting, and removing data, key-value databases utilize simple
functions.
Advantages:
It is very easy to use. Due to the simplicity of the database, data can accept
any kind, or even different kinds when required.
Its response time is fast due to its simplicity, given that the remaining
environment near it is very much constructed and improved.
Disadvantages:
As querying language is not present in key-value databases, transportation of
queries from one database to a different database cannot be done.
The key-value store database is not refined. You cannot query the database
without a key.
263
Some examples of key-value databases:
Here are some popular key-value databases which are widely used:
Couchbase: It permits SQL-style querying and searching for text.
So in the document data model, each document has a key-value pair below is an
example for the same.
{
"Name”: "Yashodhra",
"Address”: "Near Patel Nagar",
"Email”: "[email protected]",
"Contact”: "12345"
}
264
Working of Document Data Model:
This is a data model which works as a semi-structured data model in which the
records and data associated with them are stored in a single document which means
this data model is not completely unstructured. The main thing is that data here is
stored in a document.
Features:
Document Type Model: As we all know data is stored in documents rather
than tables or graphs, so it becomes easy to map things in many
programming languages.
Distributed and Resilient: Document data models are very much dispersed
which is the reason behind horizontal scaling and distribution of data.
Manageable Query Language: These data models are the ones in which
query language allows the developers to perform CRUD (Create Read Update
Destroy) operations on the data model.
MongoDB
Cosmos DB
ArangoDB
Couchbase Server
CouchDB
Advantages:
265
Schema-less: These are very good in retaining existing data at massive
volumes because there are absolutely no restrictions in the format and the
structure of data storage.
Open formats: It has a very simple build process that uses XML, JSON, and its
other forms.
Disadvantages:
Weak Atomicity: It lacks in supporting multi-document ACID transactions. A
change in the document data model involving two collections will require us
to run two separate queries i.e. one for each collection. This is where it breaks
atomicity requirements.
Security: Nowadays many web applications lack security which in turn results
in the leakage of sensitive data. So it becomes a point of concern, one must
pay attention to web app vulnerabilities.
Book Database: These are very much useful in making book databases
because as we know this data model lets us nest.
Catalog: When it comes to storing and reading catalog files these data
models are very much used because it has a fast reading ability if incase
Catalogs have thousands of attributes stored.
Analytics Platform: These data models are very much used in the Analytics
Platform.
266
Document Database VS Key Value
Document databases and key-value databases are both types of NoSQL databases,
but they have some key differences:
1. Data Storage:
2. Querying:
Key-value databases typically have more limited querying capabilities and may
not support advanced search or indexing features.
3. Data Modeling:
Document databases are more flexible in terms of data modeling, and allow
for more complex data structures and relationships.
4. Use cases:
Key-value databases are well suited for storing data that can be easily
partitioned, such as caching data or session data. They are simple and easy to
use, but they may not be as suitable for complex queries or data relationships
as other types of databases.
267
268
Difference between Document Database VS Key Value
Relationships
Defining Relationships for NoSQL Databases
Embedding
Referencing
Relations are the crux of any database and relations in NoSQL databases are handled
in a completely different way compared to an SQL database. There is one very
important difference that you need to keep in mind while building a NoSQL database
and that is, NoSQL databases usually always have a JSON like Schema. Once you’re
familiar with that, then handling relations will be a lot easier.
Arrays should not grow without bound. If there are more than a couple of
hundred documents on the many sides, don't embed them; if there are more
than a few thousand documents on the many sides, don't use an array of
ObjectID references.
One to one relation, as the name suggests requires one entity to have an exclusive
relationship with another entity and vice versa. Let’s consider a simple example to
understand this relationship better…
The relationship between a user and his account. One user can have one account
associated with him and one account can have only one user associated with it.
First and the easiest one is to have just one collection, the ‘user’ collection and the
account of that particular user will be stored as an object in the user document itself.
The second way is to create another collection named account and store a reference
key (ideally the ID of the account) in the user document.
270
You might think why in the world would I ever need to do this?
This way is usually used when one of the following three scenarios occur-
The main document is too large (MongoDB documents have a size limit of
16mb)
When some sensitive information needs to be stored (you might not want to
return account information on every user GET request).
When there’s an exclusive need for getting the account data without the user
data (when ‘account’ is requested you don’t want to send ‘user’ information
with it and/or when a ‘user’ is requested you don’t want to send ‘account’
information with it, even though both of them are connected)
One to many relation, requires one entity to have an exclusive relationship with
another entity but the other entity can have relations with multiple other entities.
Let’s consider a simple example to understand this relationship better…
Consider, a user has multiple accounts, but each account can have a single user
associated with it (think about these accounts as bank accounts, it’ll let you
understand the example better). In this case, again there are two ways to handle it.
The first is to store an array of accounts in the user collection itself. This will let you
GET all the accounts associated with a user in a single call. MongoDB also has
features to push and pull data from an array in a document, which makes it quite
easy to add or remove accounts from the user if need be.
271
The second way is to create another collection named ‘account’ and store a reference
key (ideally the ID of the account) in the ‘user’ document. The reasons to do this are
the same as in the case of one to one relations.
One issue with this approach is that when a new account needs to be created for a
particular user, we need to create a new account and also update the existing user
document with the id of this new account (basically requires 2 database calls).
Obviously you can store the user ID in Account collection as well, in that way, you’ll
only need one call to create a new account but it depends on the system you’re
planning to build.
Before building the schema, it’s important that you plan out what kind of calls will be
used more in your system and plan your schema accordingly.
For example, in this case, since this is a bank application (assumption), you know
that most of the calls you’ll make would be getting a single user (while logging in
maybe) and another call to get the accounts associated with that user (when he goes
to the accounts tab maybe) and hence the above schema seems a pretty good one
272
for this use case. In fact, storing user_id in the accounts’ collection would be an even
better approach in this case.
Now consider another scenario, this time it’s a public forum, users can create posts
and these posts can be viewed by the public. In this case, it’s better to store user_id
in posts collection, instead of storing post_ids in users collection, since you know that
your selling point is the posts list that the users can view and hence the calls you
mostly make would be to get the posts list, with the user data associated with it
(maybe in the homepage itself, like Facebook’s timeline). This way, while updating
you wouldn’t need to update two collections.
Another scenario would be that you need both of them, that is, you need posts in
users’ data as well and users in posts data as well. This will make creating new posts a
bit slow (since you need to add IDs to the users’ collection as well), but getting data
in both cases would be fast.
Many to many relation, doesn’t require any entity to have exclusive relations. Both
entities can have multiple relations. Let’s consider a simple example to understand
this relationship better…
There’ll be two collections, one a collection for users and the other a collection from
products. Whenever a user buys a product, add the ID of the product as a reference
in the user’s collection, and since the user can buy multiple products, these IDs need
to be stored as an array.
273
When a product needs to be updated, only that product in the product collection
needs to be updated and every user who has bought the product will automatically
get the updated product.
Obviously when a user buys any of the product, then it should go to another
collection (maybe bought_items collection) so that the products there doesn’t get
updated when the product in the products collection gets updated (since ideally, you
shouldn’t make changes to already bought products). These things are actually
related to the architecture of the application you’re building.
5. Self-Referencing Relationship:
A self-referencing relationship occurs when an entity has a relationship with itself.
This type of relationship is used to represent hierarchical or recursive relationships
within a single entity. For example, in an organizational chart, employees may have
relationships with other employees who are their managers or subordinates.
274
Graph databases
What is a graph
The term “graph” comes from the field of mathematics. A graph contains a collection
of nodes and edges.
Nodes
Nodes are vertices that store the data objects. Each node can have an
unlimited number and types of relationships.
Edges
Properties
Each node has properties or attributes that describe it. In some cases, edges
have properties as well. Graphs with properties are also called property
graphs.
Graph example
275
A graph database is a type of NoSQL database that is designed to handle data with
complex relationships and interconnections. In a graph database, data is stored as
nodes and edges, where nodes represent entities and edges represent the
relationships between those entities.
1. Graph databases are particularly well-suited for applications that require deep
and complex queries, such as social networks, recommendation engines, and
fraud detection systems. They can also be used for other types of applications,
such as supply chain management, network and infrastructure management,
and bioinformatics.
2. One of the main advantages of graph databases is their ability to handle and
represent relationships between entities. This is because the relationships
between entities are as important as the entities themselves, and often cannot
be easily represented in a traditional relational database.
4. However, graph databases may not be suitable for all applications. For
example, they may not be the best choice for applications that require simple
queries or that deal primarily with data that can be easily represented in a
traditional relational database. Additionally, graph databases may require
more specialized knowledge and expertise to use effectively.
Some popular graph databases include Neo4j, OrientDB, and ArangoDB. These
databases provide a range of features, including support for different data models,
scalability, and high availability, and can be used for a wide variety of applications.
As we all know the graph is a pictorial representation of data in the form of nodes
and relationships which are represented by edges. A graph database is a type of
database used to represent the data in the form of a graph. It has three components:
nodes, relationships, and properties. These components are used to model the data.
The concept of a Graph Database is based on the theory of graphs. It was introduced
in the year 2000. They are commonly referred to NoSql databases as data is stored
using nodes, relationships and properties instead of traditional databases. A graph
database is very useful for heavily interconnected data. Here relationships between
data are given priority and therefore the relationships can be easily visualized. They
are flexible as new data can be added without hampering the old ones. They are
useful in the fields of social networking, fraud detection, AI Knowledge graphs etc.
276
Nodes: represent the objects or instances. They are equivalent to a row in
database. The node basically acts as a vertex in a graph. The nodes are
grouped by applying a label to each member.
Relationships: They are basically the edges in the graph. They have a specific
direction, type and form patterns of the data. They basically establish
relationship between nodes.
Some examples of Graph Databases software are Neo4j, Oracle NoSQL DB, Graph
base etc. Out of which Neo4j is the most popular one.
In traditional databases, the relationships between data is not established. But in the
case of Graph Database, the relationships between data are prioritized. Nowadays
mostly interconnected data is used where one data is connected directly or indirectly.
Since the concept of this database is based on graph theory, it is flexible and works
very fast for associative data. Often data are interconnected to one another which
also helps to establish further relationships. It works fast in the querying part as well
because with the help of relationships we can quickly find the desired nodes. join
operations are not required in this database which reduces the cost. The
relationships and properties are stored as first-class entities in Graph Database.
Graph databases allow organizations to connect the data with external sources as
well. Since organizations require a huge amount of data, often it becomes
cumbersome to store data in the form of tables. For instance, if the organization
wants to find a particular data that is connected with another data in another table,
so first join operation is performed between the tables, and then search for the data
is done row by row. But Graph database solves this big problem. They store the
relationships and properties along with the data. So if the organization needs to
search for a particular data, then with the help of relationships and properties the
nodes can be found without joining or without traversing row by row. Thus the
searching of nodes is not dependent on the amount of data.
277
that reflect the subject, predicate and object of a sentence. Every vertex and
edge is represented by URI (Uniform Resource Identifier).
It should be used when amount of data is larger and relationships are present.
Fraud detection
Graph databases are capable of sophisticated fraud prevention. For example,
you can use relationships in graph databases to process financial transactions
in near-real time. With fast graph queries, you can detect that a potential
purchaser is using the same email address and credit card included in a known
fraud case. Graph databases can also help you detect fraud through
relationship patterns, such as multiple people associated with a personal email
address or multiple people sharing the same IP address but residing in
different physical locations.
Recommendation engines
The graph model is a good choice for applications that provide
recommendations. You can store graph relationships between information
categories such as customer interests, friends, and purchase history. You can
use a highly available graph database to make product recommendations to a
user based on which products are purchased by others who have similar
interests and purchase histories. You can also identify people who have a
mutual friend but don’t yet know each other and then make a friendship
recommendation.
Route optimization
Route optimization problems involve analyzing a dataset and finding values
that best suit a particular scenario. For example, you can use a graph database
to find the following:
278
The right employee for a particular shift by analyzing varied
availabilities, locations, and skills.
Graph queries can analyze these situations much faster because they can
count and compare the number of links between two nodes.
Pattern discovery
Graph databases are well suited for discovering complex relationships and
hidden patterns in data. For instance, a social media company uses a graph
database to distinguish between bot accounts and real accounts. It analyzes
account activity to discover connections between account interactions and bot
activity.
Knowledge management
Graph databases offer techniques for data integration, linked data, and
information sharing. They represent complex metadata or domain concepts in
a standardized format and provide rich semantics for natural language
processing. You can also use these databases for knowledge graphs and
master data management. For example, machine learning algorithms
distinguish between the Amazon rainforest and the Amazon brand using
graph models.
Graph query languages are used to interact with a graph database. Similar
to SQL, the language has features to add, edit, and query data. However, these
languages take advantage of the underlying graph structures to process
complex queries efficiently. They provide an interface so you can ask
questions like:
Value of nodes
Graph algorithms
Clustering
Partitioning
You can partition or cut graphs at the node with the fewest edges.
Applications such as network testing use partitioning to find weak
spots in the network.
Search
Social media companies use graph databases to find the “friends of friends” or
products that the user’s friends like and send suggestions accordingly to user.
280
To detect fraud Graph databases play a major role. Users can create graph
from the transactions between entities and store other important information.
Once created, running a simple query will help to identify the fraud.
Master Data Management. Linking all company data to one location for a
single point of reference provides data consistency and accuracy. Master data
management is crucial for large-scale global companies.
Efficient data modeling: Graph databases allow for efficient data modeling by
representing data as nodes and edges. This allows for more flexible and
scalable data modeling than traditional relational databases.
High performance: Graph databases are optimized for handling large and
complex datasets, making them well-suited for applications that require high
levels of performance and scalability.
Easy to use: Graph databases are typically easier to use than traditional
relational databases. They often have a simpler data model and query
language, and can be easier to maintain and scale.
Limited use cases: Graph databases are not suitable for all applications. They
may not be the best choice for applications that require simple queries or that
282
deal primarily with data that can be easily represented in a traditional
relational database.
But in order to work, data needs to be heavily formatted and shaped to fit into the
table structure. This means sacrificing any undefined details during the save, or
storing valuable information outside the database entirely.
A schemaless database, like MongoDB, does not have these up-front constraints,
mapping to a more ‘natural’ database. Even when sitting on top of a data lake, each
document is created with a partial schema to aid retrieval. Any formal schema is
283
applied in the code of your applications; this layer of abstraction protects the raw
data in the NoSQL database and allows for rapid transformation as your needs
change.
Any data, formatted or not, can be stored in a non-tabular NoSQL type of database.
At the same time, using the right tools in the form of a schemaless database can
unlock the value of all of your structured and unstructured data types.
{
name : “Joe”, age : 30, interests : ‘football’ }
{
name : “Kate”, age : 25
}
As you can see, the data itself normally has a fairly consistent structure. With the
schemaless MongoDB database, there is some additional structure — the system
namespace contains an explicit list of collections and indexes. Collections may be
implicitly or explicitly created — indexes must be explicitly declared.
284
basis of its appeal. Let’s get granular and weigh the pros and cons of going one way
or the other.
Code is more intelligible The rigidity makes altering the schema at a later
date a laborious process
The lack of schema means that your NoSQL database can accept any data
type — including those that you do not yet use. This future-proofs your
database, allowing it to grow and change as your data-driven operations
change and mature.
No data truncation
With NoSQL, you can use whichever data model is best suited to the job.
Graph databases allow you to view relationships between data points, or you
can use traditional wide table views with an exceptionally large number of
columns. You can query, report, and model information however you choose.
And as your requirements grow, you can keep adding nodes to increase
capacity and power.
Schemaless databases are often associated with NoSQL (Not Only SQL)
database systems, which offer various data models, such as document, key-
value, columnar, or graph databases, that are well-suited for schema-less data
storage. These NoSQL data models provide flexible structures for storing and
querying semi-structured or unstructured data without predefined schemas.
5. Query Flexibility:
287
on a fixed schema. Query languages, such as MongoDB's query language or
Elasticsearch's query DSL, support querying and filtering data based on its
content and structure.
6. Use Cases:
7. Examples:
Materialized views
What is a Materialized View?
A materialized view is a duplicate data table created by combining data from
multiple existing tables for faster data retrieval. For example, consider a retail
application with two base tables for customer and product data. The customer table
contains information like the customer’s name and contact details, while the product
table contains information about product details and cost. The customer table only
stores the product IDs of the items an individual customer purchases. You have to
cross-reference both tables to obtain product details of items purchased by specific
customers. Instead, you can create a materialized view that stores customer names
288
and the associated product details in a single temporary table. You can build index
structures on the materialized view for improved data read performance.
3. ON [trigger ]
4. As <query expression>
In the above syntax, the Build clause decides when to populate the materialized view.
It contains two options -
Refresh type define the how to update the materialized view. There are three options
-
FAST - The materialized view logs is required against the source table in
advance, without logs, the creation fails. A fast refresh is attempted. A fast
refresh is attempted.
On trigger defines when to update the materialized view. The refresh can be
triggered in the two ways -
We have discussed the basic concept of the normal view and materialized view. Now,
let's see the difference between normal view and materialized view.
289
CREATE MATERIALIZED VIEW user_purchase_summary AS SELECT
u.id as user_id,
COUNT(*) as total_purchases,
SUM(CASE when p.status = 'cancelled' THEN 1 ELSE 0 END) as
cancelled_purchases
FROM users u
JOIN purchases p ON p.user_id = u.id;
Sql
In terms of SQL, all that has changed is the addition of the MATERIALIZED keyword.
But when executed, this statement instructs the database to:
3. Save the original query so it knows how to update the materialized view in the
future.
2. It’s valuable to have low end-to-end latency from when data originates to
when it is reflected in a query
1. If using a view isn’t too slow for your end-users, use a view.
3. If building a table with dbt gets too slow, use incremental models in dbt.
Speed
Read queries scan through different tables and rows of data to gather the
necessary information. With materialized views, you can query data directly
from your new view instead of having to compute new information every time.
The more complex your query is, the more time you will save using a
materialized view.
Materialized views allow you to consolidate complex query logic in one table.
This makes data transformations and code maintenance easier for developers.
It can also help make complex queries more manageable. You can also use
data subsetting to decrease the amount of data you need to replicate in the
view.
Consistency
You can use a materialized view to control who has access to specific data.
You can filter information for users without giving them access to the source
tables. This approach is practical if you want to control who has access to what
data and how much of it they can see and interact with.
If you need to distribute recent data across many locations, like for a remote
workforce, materialized views help. You replicate and distribute data to many
sites using materialized views. The people needing access to data interact with
the replicated data store closest to them geographically.
This system allows for concurrency and decreases network load. It’s an
effective approach with read-only databases.
For example, if you receive data from an external database or through an API,
a materialized view consolidates and helps process it.
292
Materialized views are helpful for situations where periodic batch processing is
required. For instance, a financial institution might use materialized views to
store end-of-day balances and interest calculations. Or they might store
portfolio performance summaries, which can be refreshed at the end of each
business day.
You define a query that retrieves the desired data from one or more source
tables for creating materialized views. This query may include filtering,
aggregations, joins, and other operations as needed.
The database initially populates the materialized view by running the defined
query against the source data. The result of the query is stored as a physical
table in the database, and this table represents the materialized view.
Full refresh
Incremental refresh
Only the changes in the underlying data are applied to the materialized
view. It can be more efficient than a full refresh when dealing with large
datasets and frequent updates.
On-demand refresh
293
Some systems allow materialized views to be refreshed on demand,
triggered by specific events or user requests. This gives more control
over when the data is updated, but it requires careful management to
ensure the materialized view remains up-to-date.
Each database management system has distinct methods for creating a materialized
view.
SQL Server SQL Server uses the name “indexed views,” as materialization is a
step of creating an index of a regular view. You can only perform
basic SQL queries with their indexed views. They update
automatically for the user.
Materialized view optimizes the query performance, using the same sub-query
results every time.
294
The data is not updated frequent in the materialized view, user needs to
update data manually or using the trigger clause. It reduces the chances of
any error and returns the efficient outcome.
Materialized views are transparent and automatically maintained with the help
of snowflake, which is a background services.
1. Views are the virtual projection of The resulting data and query
the base table. The query expression both are saved in the
expressions are stored in the physical memory (database system).
database but not the resulting
data of query expression.
8. Views are more effective when Materialized views are mostly used
the data is accessed infrequently when data is accessed more
and data in table get updated on frequently and data is not updated
frequent basis. frequently.
You have to create effective rules that trigger updates to ensure your materialized
views remain beneficial. Frequently updating your materialized views may impact
system performance, especially if you are already in a peak period. Additionally,
materialized views also take up a significant amount of space as they replicate data. If
you have a large database that constantly updates, the storage demands of
materialized views will likely be significant.
If you are going to use a materialized view, you need to set clear refresh rules and
schedules. You must also understand how to deal with data inconsistencies, refresh
failures, and the added storage strain.
Amazon Redshift continually monitors the workload using machine learning and
creates new materialized views when they are beneficial. This Automated
Materialized Views (AutoMV) feature in Redshift provides the same performance
benefits of user-created materialized views.
Monitor previously created AutoMVs and drop them when they are no longer
beneficial
Materialized views contain the results of a specific query that has been
precomputed and stored in the database. These results are typically derived
from one or more base tables or views using aggregation, filtering, or other
data transformation operations.
3. Incremental Maintenance:
4. Query Rewrite:
6. Query Optimization:
7. Storage Overhead:
8. Use Cases:
Distribution models
The primary driver of interest in NoSQLhas been its ability to run databases on a
large cluster. As data volumes increase, it becomes more difficult and expensive to
scale up—buy a bigger server to run the database on. A more appealing option is to
scale out—run the database on a cluster of servers. Aggregate orientation fits well
with scaling out because the aggregate is a natural unit to use for distribution.
Depending on your distribution model, you can get a data store that will give you
the ability to handle larger quantities of data, the ability to process a greater read or
write traffic, or more availability in the face of network slowdowns or breakages.
These are often important benefits, but they come at a cost. Running over a cluster
introduces complexity—so it’s not something to do unless the benefits are
compelling.
Broadly, there are two paths to data distribution: replication and sharding.
Replication takes the same data and copies it over multiple nodes. Sharding puts
different data on different nodes. Replication and sharding are orthogonal
techniques: You can use either or both of them. Replication comes into two forms:
master-slave and peer-to-peer. We will now discuss these techniques starting at the
simplest and working up to the more complex: first single-server, then master-slave
replication, then sharding, and finally peer-to-peer replication.
1. Single Server
298
The first and the simplest distribution option is the one we would most often
recommend—no distribution at all. Run the database on a single machine that
handles all the reads and writes to the data store. We prefer this option because it
eliminates all the complexities that the other options introduce; it’s easy for
operations people to manage and easy for application developers to reason about.
For the rest of this chapter we’ll be wading through the advantages and
complications of more sophisticated distribution schemes. Don’t let the volume of
words fool you into thinking that we would prefer these options. If we can get away
without distributing our data, we will always choose a single-server approach.
2. Sharding
Often, a busy data store is busy because different people are accessing different
parts of the dataset. In these circumstances we can support horizontal scalability by
putting different parts of the data onto different servers—a technique that’s called
sharding (see Figure 4.1).
In the ideal case, we have different users all talking to different server nodes. Each
user only has to talk to one server, so gets rapid responses from that server. The load
299
is balanced out nicely between servers—for example, if we have ten servers, each one
only has to handle 10% of the load.
Of course the ideal case is a pretty rare beast. In order to get close to it we have to
ensure that data that’s accessed together is clumped together on the same node and
that these clumps are arranged on the nodes to provide the best data access.
The first part of this question is how to clump the data up so that one user mostly
gets her data from a single server. This is where aggregate orientation comes in
really handy. The whole point of aggregates is that we design them to combine data
that’s commonly accessed together—so aggregates leap out as an obvious unit of
distribution.
When it comes to arranging the data on the nodes, there are several factors that can
help improve performance. If you know that most accesses of certain aggregates are
based on a physical location, you can place the data close to where it’s being
accessed. If you have orders for someone who lives in Boston, you can place that
data in your eastern US data center.
Another factor is trying to keep the load even. This means that you should try to
arrange aggregates so they are evenly distributed across the nodes which all get
equal amounts of the load. This may vary over time, for example if some data tends
to be accessed on certain days of the week—so there may be domain-specific rules
you’d like to use.
3. Master-Slave Replication
With master-slave distribution, you replicate data across multiple nodes. One node is
designated as the master, or primary. This master is the authoritative source for the
data and is usually responsible for processing any updates to that data. The other
nodes are slaves, or secondaries. A replication process synchronizes the slaves with
the master (see Figure 4.2).
300
4. Peer-to-Peer Replication
Master-slave replication helps with read scalability but doesn’t help with scalability of
writes. It provides resilience against failure of a slave, but not of a master. Essentially,
the master is still a bottleneck and a single point of failure. Peer-to-peer replication
(see Figure 4.3) attacks these problems by not having a master. All the replicas have
equal weight, they can all accept writes, and the loss of any of them doesn’t prevent
access to the data store.
301
5. Combining Sharding and Replication
Replication and sharding are strategies that can be combined. If we use both master-
slave replication and sharding (see Figure 4.4), this means that we have multiple
masters, but each data item only has a single master. Depending on your
configuration, you may choose a node to be a master for some data and slaves for
others, or you may dedicate nodes for master or slave duties.
302
Using peer-to-peer replication and sharding is a common strategy for column-family
databases. In a scenario like this you might have tens or hundreds of nodes in a
cluster with data sharded over them. A good starting point for peer-to-peer
replication is to have a replication factor of 3, so each shard is present on three
nodes. Should a node fail, then the shards on that node will be built on the other
nodes (see Figure 4.5).
Aggregate oriented databases make distribution of data easier, since the distribution
mechanism has to move the aggregate and not have to worry about related data, as
all the related data is contained in the aggregate.
Replication: Replication copies data across multiple servers, so each bit of data
can be found in multiple places. Replication comes in two forms,
303
Here are some common distribution models:
1. Direct Distribution:
2. Indirect Distribution:
3. Wholesale Distribution:
4. Retail Distribution:
5. Franchise Distribution:
6. Agency Distribution:
304
Agency distribution involves appointing agents or representatives to sell
products on behalf of the producer. Agents act as intermediaries who
negotiate sales contracts, handle customer inquiries, and facilitate transactions
on behalf of the producer. Agency distribution is common in industries such
as insurance, real estate, and pharmaceuticals.
7. Online Distribution:
8. Hybrid Distribution:
Shading
What is database sharding?
Sharding is a method for distributing a single dataset across multiple databases,
which can then be stored on multiple machines. This allows for larger datasets to be
split into smaller chunks and stored in multiple data nodes, increasing the total
storage capacity of the system. See more on the basics of sharding here.
Similarly, by distributing the data across multiple machines, a sharded database can
handle more requests than a single machine can.
305
Do you need database sharding?
Database sharding, as with any distributed architecture, does not come for free.
There is overhead and complexity in setting up shards, maintaining the data on each
shard, and properly routing requests across those shards. Before you begin sharding,
consider if one of the following alternative solutions will work for you.
Good shard-key selection can evenly distribute data across multiple shards. When
choosing a shard key, database designers should consider the following factors.
Cardinality
Cardinality describes the possible values of the shard key. It determines the
maximum number of possible shards on separate column-oriented databases.
For example, if the database designer chooses a yes/no data field as a shard
key, the number of shards is restricted to two.
Frequency
Monotonic change
306
Shard A stores feedback from customers who have made 0–10
purchases.
As the business grows, customers will make more than 21 or more purchases.
The application stores their feedback in Shard C. This results in an unbalanced
shard because Shard C contains more feedback records than other shards.
However, sharding is one among several other database scaling strategies. Explore
some other techniques and understand how they compare.
1. Vertical scaling
Vertical scaling increases the computing power of a single machine. For example, the
IT team adds a CPU, RAM, and a hard disk to a database server to handle increasing
traffic.
Vertical scaling is less costly, but there is a limit to the computing resources you can
scale vertically. Meanwhile, sharding, a horizontal scaling strategy, is easier to
implement. For example, the IT team installs multiple computers instead of
upgrading old computer hardware.
307
2. Replication
Replication is a technique that makes exact copies of the database and stores them
across different computers. Database designers use replication to design a fault-
tolerant relational database management system. When one of the computers
hosting the database fails, other replicas remain operational. Replication is a
common practice in distributed computing systems.
Database sharding does not create copies of the same information. Instead, it splits
one database into multiple parts and stores them on different computers. Unlike
replication, database sharding does not result in high availability. Sharding can be
used in combination with replication to achieve both scale and high availability.
3. Partitioning
Database sharding is like horizontal partitioning. Both processes split the database
into multiple groups of unique rows. Partitioning stores all data groups in the same
computer, but database sharding spreads them across different computers.
308
Depending on your use case, it may make more sense to simply shift a subset of the
burden onto other providers or even a separate database. For example, blob or file
storage can be moved directly to a cloud provider such as Amazon S3. Analytics or
full-text search can be handled by specialized services or a data warehouse.
Offloading this particular functionality can make more sense than trying to shard
your entire database.
Once a logical shard is stored on another node, it is known as a physical shard. One
physical shard can hold multiple logical shards. The shards are autonomous and
don't share the same data or computing resources. That's why they exemplify a
shared-nothing architecture. At the same time, the data in all the shards represents a
logical data set.
Horizontal sharding. When each new table has the same schema but unique
rows, it is known as horizontal sharding. In this type of sharding, more
machines are added to an existing stack to spread out the load, increase
309
processing speed and support more traffic. This method is most effective
when queries return a subset of rows that are often grouped together.
Vertical sharding. When each new table has a schema that is a faithful subset
of the original table's schema, it is known as vertical sharding. It is effective
when queries usually return only a subset of columns of the data.
The following illustrates how new tables look when both horizontal and vertical
sharding are performed on the same original data set.
Horizontal shards
Shard 1
Shard 2
310
Student Name Age Major Hometown
ID
Vertical Shards
Shard 1
1 Amy 21
2 Jack 20
Shard 2
Student Major
ID
1 Economics
2 History
Shard 3
Student ID Hometown
311
1 Austin
2 San Francisco
If the computer hosting the database fails, the application that depends on
the database fails too. Database sharding prevents this by distributing parts of
the database into different computers. Failure of one of the computers does
not shut down the application because it can operate with other functional
shards. Sharding is also often done in combination with data replication
across shards. So, if one shard becomes unavailable, the data can be accessed
and restored from an alternate shard.
Scale efficiently
Advantages of sharding
Sharding allows you to scale your database to handle increased load to a nearly
unlimited degree by providing increased read/write throughput, storage capacity,
and high availability. Let’s look at each of those in a little more detail.
312
Increased storage capacity — similarly, by increasing the number of shards,
you can also increase overall total storage capacity, allowing near-infinite
scalability.
High availability — finally, shards provide high availability in two ways. First,
since each shard is a replica set, every piece of data is replicated. Second, even
if an entire shard becomes unavailable since the data is distributed, the
database as a whole still remains partially functional, with part of the schema
on different shards.
Disadvantages of sharding
Sharding does come with several drawbacks, namely overhead in query result
compilation, complexity of administration, and increased infrastructure costs.
First, how will the data be distributed across shards? This is the fundamental question
behind any sharded database. The answer to this question will have effects on both
313
performance and maintenance. More detail on this can be found in the “Sharding
Architectures and Types” section.
Second, what types of queries will be routed across shards? If the workload is
primarily read operations, replicating data will be highly effective at increasing
performance, and you may not need sharding at all. In contrast, a mixed read-write
workload or even a primarily write-based workload will require a different
architecture.
Finally, how will these shards be maintained? Once you have sharded a database,
over time, data will need to be redistributed among the various shards, and new
shards may need to be created. Depending on the distribution of data, this can be an
expensive process and should be considered ahead of time.
After a database is sharded, the data in the new tables is spread across multiple
systems, but with partitioning, that is not the case. Partitioning groups data subsets
within a single database instance.
1. Ranged/dynamic sharding
Ranged sharding, or dynamic sharding, takes a field on the record as an input and,
based on a predefined range, allocates that record to the appropriate shard. Ranged
sharding requires there to be a lookup table or service available for all queries or
writes. For example, consider a set of data with IDs that range from 0-50. A simple
lookup table might look like the following:
Range Shard ID
314
[0, 20) A
[20, 40) B
[40, 50] C
The field on which the range is based is also known as the shard key. Naturally, the
choice of shard key, as well as the ranges, are critical in making range-based
sharding effective. A poor choice of shard key will lead to unbalanced shards, which
leads to decreased performance. An effective shard key will allow for queries to be
targeted to a minimum number of shards. In our example above, if we query for all
records with IDs 10-30, then only shards A and B will need to be queried.
Two key attributes of an effective shard key are high cardinality and well-
distributed frequency. Cardinality refers to the number of possible values of that key.
If a shard key only has three possible values, then there can only be a maximum of
three shards. Frequency refers to the distribution of the data along the possible
values. If 95% of records occur with a single shard key value then, due to this
hotspot, 95% of the records will be allocated to a single shard. Consider both of
these attributes when selecting a shard key.
2. Algorithmic/hashed sharding
315
The function can take any subset of values on the record as inputs. Perhaps the
simplest example of a hash function is to use the modulus operator with the number
of shards, as follows:
John 1
Jane 2
Paulo 1
Wang 2
First, query operations for multiple records are more likely to get distributed across
multiple shards. Whereas ranged sharding reflects the natural structure of the data
across shards, hashed sharding typically disregards the meaning of the data. This is
reflected in increased broadcast operation occurrence.
Second, resharding can be expensive. Any update to the number of shards likely
requires rebalancing all shards to moving around records. It will be difficult to do this
while avoiding a system outage.
3. Entity-/relationship-based sharding
For instance, consider the case of a shopping database with users and payment
methods. Each user has a set of payment methods that is tied tightly with that user.
As such, keeping related data together on the same shard can reduce the need for
broadcast operations, increasing performance.
4. Geography-based sharding
316
Geography-based sharding, or geosharding, also keeps related data together on a
single shard, but in this case, the data is related by geography. This is essentially
ranged sharding where the shard key contains geographic information and the
shards themselves are geo-located.
For example, consider a dataset where each record contains a “country” field. In this
case, we can both increase overall performance and decrease system latency by
creating a shard for each country or region, and storing the appropriate data on that
shard. This is a simple example, and there are many other ways to allocate your
geoshards which are beyond the scope of this article.
John California
Jane Washington
Paulo Arizona
5. Directory sharding
Blue A
Red B
Yellow C
Black D
317
When an application stores clothing information in the database, it refers to the
lookup table. If a dress is blue, the application stores the information in the
corresponding shard.
Sharding Architectures
1. Key Based Sharding
Here, we take the value of an entity such as customer ID, customer email, IP
address of a client, zip code, etc and we use this value as an input of the hash
function.
This process generates a hash value which is used to determine which shard
we need to use to store the data.
We need to keep in mind that the values entered into the hash function
should all come from the same column (shard key) just to ensure that data is
placed in the correct order and in a consistent manner.
Basically, shard keys act like a primary key or a unique identifier for individual
rows.
You have 3 database servers and each request has an application id which is
incremented by 1 every time a new application is registered.
The downside of this method is elastic load balancing which means if you
will try to add or remove the database servers dynamically it will be a difficult
and expensive process.
A shard shouldn’t contain values that might change over time. It should be
always static otherwise it will slow down the performance
318
Advantages of Key Based Sharding:
Choosing the right key may require a deep understanding of the data
and query patterns, and poor choices may lead to suboptimal
performance.
In this method, we split the data based on the ranges of a given value
inherent in each entity.
Let’s say you have a database of your online customers’ names and email
information.
You can split this information into two shards. In one shard you can keep the
info of customers whose first name starts with A-P and in another shard, keep
the information of the rest of the customers.
Scalability:
320
Improved Performance:
3. Vertical Sharding
In this method, we split the entire column from the table and we put those
columns into new distinct tables.
On Twitter users might have a profile, number of followers, and some tweets posted
by his/her own. We can place the user profiles on one shard, followers in the second
shard, and tweets on a third shard.
321
Advantages of Vertical Sharding:
Query Performance:
Simplified Queries:
4. Directory-Based Sharding
In this method, we create and maintain a lookup service or lookup table for
the original database.
Basically we use a shard key for lookup table and we do mapping for each
entity that exists in the database.
This way we keep track of which database shards hold which data.
The lookup table holds a static set of information about where specific data can be
found. In the above image, you can see that we have used the delivery zone as a
shard key:
Firstly the client application queries the lookup service to find out the shard
(database partition) on which the data is placed.
When the lookup service returns the shard it queries/updates that shard.
323
This flexibility facilitates efficient load balancing and adaptation to
changing data patterns.
Dynamic Scalability:
Increased Latency:
o Reads and write queries become slower and the network bandwidth starts to
saturate. Database sharding fixes all these issues by partitioning the data
across multiple machines.
324
High Availability:
o All the other shards will continue the operation and the entire application
won’t be unavailable for the users.
o It has to search every row in the table and that slows down the response time
for the query.
o In a sharded database a query has to go through fewer rows and you receive
the response in less time.
Scaling Out:
o It’s a complicated task and if it’s not implemented properly then you may lose
the data or get corrupted tables in your database.
o You also need to manage the data from multiple shard locations, This may
affect the workflow of your team
325
Rebalancing Data:
One shard store the name of the customers begins with letter A
through M. Another shard store the name of the customer begins with
the letters N through Z.
If there are so many users with the letter L then shard one will have
more data than shard two. This will affect the performance (slow down)
of the application and it will stall out for a significant portion of your
users.
o To overcome this problem and to rebalance the data you need to do re-
sharding for even data distribution.
o But in sharded architecture, you need to pull the data from different shards
and you need to perform joins across multiple networked servers and You
can’t submit a single query to get the data from various shards.
o You need to submit multiple queries for each one of the shards, It
adds latency to your system.
No Native Support:
o It will be difficult for you to find the tips or documentation for sharding and
troubleshoot the problem during the implementation of sharding.
1. Data hotspots
326
Some of the shards become unbalanced due to the uneven distribution of data.
For example, a single physical shard that contains customer names starting with A
receives more data than others. This physical shard will use more computing
resources than others.
Solution
You can distribute data evenly by using optimal shard keys. Some datasets are
better suited for sharding than others.
2. Operational complexity
Solution
In the AWS database portfolio, database setup and operations have been
automated to a large extent. This makes working with a sharded database
architecture a more streamlined task.
3. Infrastructure costs
Organizations pay more for infrastructure costs when they add more computers
as physical shards. Maintenance costs can add up if you increase the number of
machines in your on-premises data center.
Solution
Developers use Amazon Elastic Compute Cloud (Amazon EC2) to host and scale
shards in the cloud. You can save money by using virtual infrastructure that AWS
fully manages.
4. Application complexity
Most database management systems do not have built-in sharding features. This
means that database designers and software developers must manually split,
distribute, and manage the database.
Solution
You can migrate your data to the appropriate AWS purpose-built databases,
which have several built-in features that support horizontal scaling.
327
Version
Here are some key aspects of versioning in big data
1. Data Versioning:
Data versioning involves tracking different versions of data over time. Each
version represents a snapshot of the data at a specific point in time, capturing
changes, updates, or modifications made to the data set. Data versioning
enables users to track the history of changes, revert to previous versions if
needed, and maintain data lineage for auditability and compliance purposes.
2. Schema Evolution:
In big data systems, data schemas may evolve over time due to changes in
business requirements, data sources, or application logic. Versioning enables
the management of schema changes and evolution by tracking different
versions of data schemas and ensuring compatibility with older versions of the
data.
3. Version Control:
4. Immutable Data:
5. Metadata Management:
6. Data Lineage:
Data lineage refers to the end-to-end tracking of data from its source to its
destination and through various processing steps and transformations.
328
Versioning helps maintain data lineage by tracking changes and
transformations applied to the data over time, allowing users to trace back to
the original source and understand how the data has been modified or
transformed.
7. Data Reproducibility:
329
tolerant distributed file system. If you want to run DataFlow in a distributed cluster,
then we recommend that you use a distributed file system such as HDFS.
The following distributions and versions of Hadoop are supported for use with
DataFlow:
HBase Version
DataFlow provides both a reader and writer for accessing HBase, a scalable database
built using Hadoop. The HBase support in DataFlow works with:
Note: For more information about supported Hadoop and HBase distributions,
see Hadoop Module Configurations.
Hive Version
DataFlow the follwoing readers and writers for Hive: ORCReader, ORCWriter, and
ParquetReader operator.
Data versioning is the process of storing corresponding versions of data that were
created or modified at different time intervals.
There are many valid reasons for making changes to a dataset. Data specialists can
test the machine learning (ML) models to increase the success rate of the project. For
this, they need to make important manipulations on the data. Datasets may also
330
update over time due to the continuous inflow of data from different resources. In
the end, keeping older versions of data can help organizations replicate a previous
environment.
Additionally, with the help of data versioning, historical datasets are saved and kept
in the databases. This aspect provides some advantages as follows.
In today’s digital world, it is a fact that companies that make decisions and
develop strategies using data will survive. Therefore, it is important not to lose
historical data.
In the end, it gives companies a new business metric to measure their success
or performance.
Data versioning can help in such situations by ensuring that data is stored at
specific times. It can also assist organizations in meeting the requirements of
such regulations.
With LakeFS, every process — from complex ETL processes to data analytics and
machine learning steps — can be transformed into automatic and easy-to-track data
science projects. Some prominent features of lakeFS are:
Supports cloud solutions like AWS S3, Google Cloud Storage, and Microsoft
Azure Blob Storage
Works easily with most modern big data frameworks and technologies such as
Hadoop, Spark, Kafka, etc.
332
Provides Git-like operations like a branch, commit, and merge, which enables
scaling of petabytes of data with the power of cloud solutions
Gives options for deployment in the cloud or on-prem and using any API
compatible with S3 storage
To run a LakeFS session on your local computer, please make sure you install Docker
and Docker Compose with a version of 1.25.04 or higher. To run LakeFS with Docker,
type the following command:
http://127.0.0.1:8000/setup
For this step, determine the Username and save your credentials, which are
Key ID and Secret Key. Log in to your admin user profile with this information.
Click the Create Repository button in the admin panel and enter the
Repository ID, Storage Namespace, and Default Branch values. After that,
press Create Repository. Your initial repository has been created.
LakeFS sessions can be used with AWS CLI because it has an S3 compatible API. But
please make sure the AWS CLI is installed on your local computer.
To see whether the connection works and to list all the repositories in the
workspace, type the following command in the terminal:
Now, the tweets.txt file has been written to the main branch of the demo-repo
repository. Please check it on the LakeFS UI.
Thanks to LakeFS, the changes made to data can be committed using LakeFS’s
default CLI client lakectl. Please make sure you have installed the latest version of the
CLI binary on your local computer.
To configure the CLI binary settings, type the following command in the
terminal:
$ lakectl config
To verify the configuration of lakectl, you can list all the branches in the
repository with the following command:
Finally, to check the committed message, type the following command in the
terminal:
Training data may take up a lot of space in Git repositories. This is because Git
was designed to track changes in text files rather than big binary files. If a
team’s training data sets include big audio or video files, this might lead to a
334
slew of issues down the road. Each modification to the training data set will
frequently result in a duplicated data set in the repository’ history. Not only
does this result in a bug repository, but it also makes cloning and rebasing
extremely sluggish.
Multiple Users
One method for data versioning is to save versions to your PC manually. File
versioning is helpful for:
Small businesses: Businesses with less than a few data engineers or scientists
operating in the same area.
Individual work: When a task is not appropriate for cooperation and several
persons cannot work together to reach a common goal.
335
Aside from file versioning, specialized tools are available. You have the option of
developing your software or outsourcing it. DVC, Delta Lake, and Pachyderm are
among the companies that provide such services.
Data versioning systems are better suited for businesses that require:
DVC
Delta Lake
The technology is more akin to a data lake abstraction layer, filling in the gaps
left by typical data lakes.
Git LFS
336
Git LFS is a Git extension created by a group of open-source volunteers. By
employing pointers instead of files, the programme seeks to eliminate big files
that may be uploaded to your repository (e.g., images and data sets).
The pointers are lighter and point to the local sporting goods store. As a
result, when you push your repo into the central repository, it updates quickly
and takes up less space.
Pachyderm
Pachyderm is one of the list’s few data science platforms. The goal of
Pachyderm is to provide a platform that makes it simple to replicate the
outcomes of machine learning models by controlling the complete data
process. Pachyderm is known as “the Docker of data” in this context.
Pachyderm has agreed to its Data Science Bill of Rights, which describes the
product’s core goals: reproducibility, data provenance, collaboration,
incrementality, and autonomy, as well as infrastructure abstraction.
These pillars drive many of its features, allowing teams to utilize the platform
entirely.
Dolt
Dolt is a SQL database that supports Git-style versioning. Unlike Git, which
allows you to version files, Dolt will enable you to version tables. This means
you may update and modify data without fear of losing the changes.
While the programme is currently in its early stages, there are hopes to make
it fully Git and MySQL compatible shortly.
LakeFS
LakeFS enables teams to create data lake activities that are repeatable, atomic,
and versioned. It’s a newbie on the scene, but it delivers a powerful punch. It
offers a Git-like branching and version management methodology designed
to operate with your data lake and scale to Petabytes of data.
337
It delivers ACID compliance to your data lake the same way as Delta Lake.
However, LakeFS supports both AWS S3 and Google Cloud Storage as
backends, so you don’t have to use Spark to get the benefits.
You don’t necessarily have to put in a lot of work to manage your data to reap
the benefits of data versioning. For example, much of data versioning is
intended to aid in the tracking of data sets that change significantly over time.
Some data, such as web traffic, is simply appended to. That is, data is added
but seldom, if ever, updated. This implies that the only data versioning needed
to get reproducible results is the start and finish dates. This is significant
because, in such circumstances, you may be able to bypass all of the tools
mentioned above.
Map Reduce
MapReduce is a big data analysis model that processes data sets using a parallel
algorithm on computer clusters, typically Apache Hadoop clusters or cloud systems
like Amazon Elastic MapReduce (EMR) clusters.
What is MapReduce?
MapReduce is a Java-based, distributed execution framework within the Apache
Hadoop Ecosystem. It takes away the complexity of distributed programming by
exposing two processing steps that developers implement: 1) Map and 2) Reduce. In
the Mapping step, data is split between parallel processing tasks. Transformation
logic can be applied to each chunk of data. Once completed, the Reduce phase takes
over to handle aggregating data from the Map set.. In general, MapReduce
uses Hadoop Distributed File System (HDFS) for both input and output. However,
some technologies built on top of it, such as Sqoop, allow access to relational
systems.
History of MapReduce
MapReduce was developed in the walls of Google back in 2004 by Jeffery Dean and
Sanjay Ghemawat of Google (Dean & Ghemawat, 2004).
338
programming. At that time, Google’s proprietary MapReduce system ran on the
Google File System (GFS). By 2014, Google was no longer using MapReduce as their
primary big data processing model. MapReduce was once the only method through
which the data stored in the HDFS could be retrieved, but that is no longer the case.
Today, there are other query-based systems such as Hive and Pig that are used to
retrieve data from the HDFS using SQL-like statements that run along with jobs
written using the MapReduce model.
Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and
process data. The following illustration depicts a schematic view of a traditional
enterprise system. Traditional model is certainly not suitable to process huge
volumes of scalable data and cannot be accommodated by standard database
servers. Moreover, the centralized system creates too much of a bottleneck while
processing multiple files simultaneously.
339
How MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The Map task takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines
those data tuples (key-value pairs) into a smaller set of tuples.
Let us now take a close look at each of the phases and try to understand their
significance.
Input Phase − Here we have a Record Reader that translates each record in
an input file and sends the parsed data to the mapper in the form of key-value
pairs.
340
Shuffle and Sort − the Reducer task starts with the Shuffle and Sort step. It
downloads the grouped key-value pairs onto the local machine, where the
Reducer is running. The individual key-value pairs are sorted by key into a
larger data list. The data list groups the equivalent keys together so that their
values can be iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and
runs a Reducer function on each one of them. Here, the data can be
aggregated, filtered, and combined in a number of ways, and it requires a
wide range of processing. Once the execution is over, it gives zero or more
key-value pairs to the final step.
Let us try to understand the two tasks Map &f Reduce with the help of a small
diagram –
MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter
receives around 500 million tweets per day, which is nearly 3000 tweets per second.
The following illustration shows how Tweeter manages its tweets with the help of
MapReduce.
341
As shown in the illustration, the MapReduce algorithm performs the following
actions −
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-
value pairs.
Filter − Filters unwanted words from the maps of tokens and writes the
filtered maps as key-value pairs.
MapReduce Architecture
MapReduce and HDFS are the two major components of Hadoop which makes it so
powerful and efficient to use. MapReduce is a programming model used for efficient
processing in parallel over large data-sets in a distributed manner. The data is first
split and then combined to produce the final result. The libraries for MapReduce is
written in so many programming languages with various different-different
optimizations. The purpose of MapReduce in Hadoop is to Map each of the jobs and
then it will reduce it to equivalent tasks for providing less overhead over the cluster
network and to reduce the processing power. The MapReduce task is mainly divided
into two phases Map Phase and Reduce Phase.
MapReduce Architecture:
342
Components of MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the
MapReduce for processing. There can be multiple clients available that
continuously send jobs for processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which
is comprised of so many smaller tasks that the client wants to process or
execute.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job.
The result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
In MapReduce, we have a client. The client will submit the job of a particular size to
the Hadoop MapReduce Master. Now, the MapReduce master will divide this job into
further equivalent job-parts. These job-parts are then made available for the Map
343
and Reduce Task. This Map and Reduce task will contain the program as per the
requirement of the use-case that the particular company is solving. The developer
writes their logic to fulfill the requirement that the industry requires. The input data
which we are using is then fed to the Map Task and the Map will generate
intermediate key-value pair as its output. The output of Map i.e. these key-value
pairs are then fed to the Reducer and the final output is stored on the HDFS. There
can be n number of Map and Reduce tasks made available for processing the data as
per the requirement. The algorithm for Map and Reduce is made with a very
optimized way such that the time complexity or space complexity is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce
phase.
1. Map: As the name suggests its main use is to map the input data in key-value
pairs. The input to the map may be a key-value pair where the key can be the
id of some kind of address and value is the actual value that it keeps. The Map
() function will be executed in its memory repository on each of these input
key-value pairs and generates the intermediate key-value pair which works as
input for the Reducer or Reduce () function.
2. Reduce: The intermediate key-value pairs that work as input for Reducer are
shuffled and sort and send to the Reduce () function. Reducer aggregate or
group the data based on its key-value pair as per the reducer algorithm
written by the developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all
the jobs across the cluster and also to schedule each map on the Task Tracker
running on the same data node since there can be hundreds of data nodes
available in the cluster.
2. Task Tracker: The Task Tracker can be considered as the actual slaves that are
working on the instruction given by the Job Tracker. This Task Tracker is
deployed on each of the nodes available in the cluster that executes the Map
and Reduce task as instructed by Job Tracker.
344
Parallelization: MapReduce enables parallel processing of large datasets
across distributed clusters, allowing computations to be performed in parallel
on multiple nodes.
1. Highly scalable
By adding servers to the cluster, we can simply grow the amount of storage
and computing power. We may improve the capacity of nodes or add any
number of nodes (horizontal scalability) to attain high computing power.
Organizations may execute applications from massive sets of nodes,
potentially using thousands of terabytes of data, thanks to Hadoop
MapReduce programming.
2. Versatile
3. Secure
345
The MapReduce programming model uses the HBase and HDFS security
approaches, and only authenticated users are permitted to view and
manipulate the data. HDFS uses a replication technique in Hadoop 2 to
provide fault tolerance. Depending on the replication factor, it makes a clone
of each block on the various machines. One can therefore access data from
the other devices that house a replica of the same data if any machine in a
cluster goes down. Erasure coding has taken the role of this replication
technique in Hadoop 3. Erasure coding delivers the same level of fault
tolerance with less area. The storage overhead with erasure coding is less than
50%.
4. Affordability
5. Fast-paced
346
Java programming is simple to learn, and anyone can create a data processing
model that works for their company. Hadoop is straightforward to utilize
because customers don’t need to worry about computing distribution. The
framework itself does the processing.
7. Parallel processing-compatible
8. Reliable
The same set of data is transferred to some other nodes in a cluster each time
a collection of information is sent to a single node. Therefore, even if one
node fails, backup copies are always available on other nodes that may still be
retrieved whenever necessary. This ensures high data availability.
9. Highly available
Hadoop’s fault tolerance feature ensures that even if one of the DataNodes
fails, the user may still access the data from other DataNodes that have copies
of it. Moreover, the high accessibility Hadoop cluster comprises two or more
active and passive NameNodes running on hot standby. The active
NameNode is the active node. A passive node is a backup node that applies
changes made in active NameNode’s edit logs to its namespace.
347
Uses of MapReduce
1. Entertainment
Hadoop MapReduce assists end users in finding the most popular movies based on
their preferences and previous viewing history. It primarily concentrates on their
clicks and logs.
Various OTT services, including Netflix, regularly release many web series and movies.
It may have happened to you that you couldn’t pick which movie to watch, so you
looked at Netflix’s recommendations and decided to watch one of the suggested
series or films. Netflix uses Hadoop and MapReduce to indicate to the user some
well-known movies based on what they have watched and which movies they enjoy.
MapReduce can examine user clicks and logs to learn how they watch movies.
2. E-commerce
348
item proposals for e-commerce inventory is part of it, as is looking at website
records, purchase histories, user interaction logs, etc., for product recommendations.
3. Social media
Nearly 500 million tweets, or about 3000 per second, are sent daily on the
microblogging platform Twitter. MapReduce processes Twitter data, performing
operations such as tokenization, filtering, counting, and aggregating counters.
Filtering: The terms that are not wanted are removed from the token maps.
4. Data warehouse
Systems that handle enormous volumes of information are known as data warehouse
systems. The star schema, which consists of a fact table and several dimension tables,
is the most popular data warehouse model. In a shared-nothing architecture, storing
all the necessary data on a single node is impossible, so retrieving data from other
nodes is essential.
This results in network congestion and slow query execution speeds. If the
dimensions are not too big, users can replicate them over nodes to get around this
issue and maximize parallelism. Using MapReduce, we may build specialized business
logic for data insights while analyzing enormous data volumes in data warehouses.
5. Fraud detection
Conventional methods of preventing fraud are not always very effective. For instance,
data analysts typically manage inaccurate payments by auditing a tiny sample of
claims and requesting medical records from specific submitters. Hadoop is a system
well suited for handling large volumes of data needed to create fraud
detection algorithms. Financial businesses, including banks, insurance companies,
and payment locations, use Hadoop and MapReduce for fraud detection, pattern
recognition evidence, and business analytics through transaction analysis.
6. Takeaway
For years, MapReduce was a prevalent (and the de facto standard) model for
processing high-volume datasets. In recent years, it has given way to new systems
like Google’s new Cloud Dataflow. However, MapReduce continues to be used across
cloud environments, and in June 2022, Amazon Web Services (AWS) made its
Amazon Elastic MapReduce (EMR) Serverless offering generally available. As
349
enterprises pursue new business opportunities from big data, knowing how to use
MapReduce will be an invaluable skill in building data analysis applications.
By and large, MapReduce is used by the computer software and IT services industry.
Other industries include financial services, hospitals and healthcare, higher education,
retail, insurance, telecommunications and banking. The following are a few example
use cases:
350
How does HPE help with MapReduce?
HPE offers several solutions that can help you save time, money and workforce
resources on managing Hadoop systems running MapReduce.
For example, HPE Pointnext Services offers advice and technical assistance in
planning, design and integrating your Big Data analytics environment. They simplify
designing and implementing Hadoop – and MapReduce – so that you can truly focus
on finding analytical insights to make informed business decisions.
In addition, HPE GreenLake offers a scalable solution that radically simplifies the
whole Hadoop lifecycle. It is an end-to-end solution that includes the required
hardware, software and support for both symmetrical and asymmetrical
environments. The unique HPE pricing and billing method makes it easier to
understand your existing Hadoop costs and to more accurately predict future costs
associated with your solution.
Advantages of MapReduce
1. Scalability
2. Flexibility
351
3. Increase the block size
2. Apache Storm
3. Ceph
4. Hydra
5. Google BigQuery
Partitioning in NoSQL
352
Partitioning, also called Sharding, is a fundamental consideration in NoSQL database.
If you get this right, database works beautifully. If not, there will be big changes
down the line until it is gotten right.
Entities in a NoSQL Database live within Partitions. Every entity has a partition key,
indicating which Partition (or Logical Partition or Application Partition) it belongs to.
The Database service maps these Logical Partitions into Physical Partitions or Shards
and places them on Storage nodes in its backend. Note that each Partition is fully
served by one Storage node running its Query engine.
Given that a Partition is fully served by one Storage node and query engine, this is
the scope at which most functionality works.
2. ACID Transactions are possible only within a Partition. Transactions are not
possible across partitions. See Database transactions and optimistic
concurrency control in Azure Cosmos DB | Microsoft Docs
3. Indexes are typically local within a Partition. i.e. use of the index requires a
partition key and value to be specified. E.g. see Local Secondary Indexes —
Amazon DynamoDB. Cross-partition index, or Global index, is possible in some
offerings, but these are essentially a fully copy of the data that is stored with a
different Partition key — and the Partitioning rules still apply there. See Using
Global Secondary Indexes in DynamoDB — Amazon DynamoDB.
4. Range queries (e.g. fetch all Products with price between $50 and $100) work
only within a Partition.
353
However, since each Partition lives within one Storage node and has to fit there,
there are restrictions:
Partition sizes are very limited. DynamoDB allows for 10GB, CosmosDB for 20
GB. If your partition grows to this size, you will not be able to write any data
— your App is broken!
Partition IOPS (IO Operations per second) are bounded. See Azure Cosmos DB
service quotas | Microsoft Docs. If a partition becomes very “hot”, i.e. there is
too much IO in any one partition, you will start seeing storage throttling and
App failures.
The main design challenge with NoSQL databases then is then in to pick a strategy
that hits the optimal middle:
1. Keep Partitions large enough so that all entities that require transactional
updates across themselves, or fall within a range query, come within one
partition.
2. Keep Partitions small enough so that size stays below 10 to 20 GB and IOPS
are roughly uniformly distributed and stay below threshold
By dividing the data into partitions, databases can avoid reading from partitions that
are not needed for queries that only need a subset of the data collocated in a
partition. This allows the database to reduce expensive disk I/O calls and return the
data much quicker.
Vertical partitioning
354
Vertical partitioning
In vertical partitioning, columns of a table are divided into partitions with each
partition containing one or more columns from the table. This approach is useful
when some columns are accessed more frequently than others.
Data partitioning is often combined with sharding: frequently accessed columns may
be split into different partitions and sharded to run on discrete servers. Alternatively,
columns that are rarely used may be partitioned to a cheaper and slower storage
solution to reduce the I/O overhead.
One of the downsides to vertical partitioning is that when a query needs to span
multiple partitions, combining the results from those partitions may be slow or
complicated. Also, as the database scales, partitions may need to be split even
further to meet the demand.
Horizontal partitioning
355
Horizontal partitioning
On the other hand, horizontal partitioning works by splitting the table by rows based
on the partition key. In this approach, each row of the table is assigned to a partition
based on some criteria, which include:
For example, we may use a simple modulo function on the employee id field or use a
complicated cryptographic hashing function on an IP address to divide the data.
When a non-trivial hash function is used, hash-based partitioning tends to distribute
the data evenly across partitions. However, depending on the function, adding or
removing a new partition may require an expensive migration process.
356
Composite partitioning: any of the aforementioned methods can be
combined. For example, a time series workload may first be partitioned by
time and further split based on another column field.
One thing to note with horizontal partitioning is that the performance depends
heavily on how evenly distributed the data is across the partitions. If the data
distribution is skewed, the partition with the most records will become the
bottleneck.
Also, most analytical databases employ horizontal partitioning strategies over vertical
partitioning. Some popular file formats such as Apache Parquet support partitioning
natively, making it ideal for big data processing.
1. Horizontal Partitioning
2. Vertical Partitioning
3. Key-based Partitioning
4. Range-based Partitioning
5. Hash-based Partitioning
6. Round-robin Partitioning
1. Horizontal Partitioning/Sharding
In this technique, the dataset is divided based on rows or records. Each partition
contains a subset of rows, and the partitions are typically distributed across multiple
servers or storage devices. Horizontal partitioning is often used in distributed
databases or systems to improve parallelism and enable load balancing.
Advantages:
357
1. Greater scalability: By distributing data among several servers or storage
devices, horizontal partitioning makes it possible to process large datasets in
parallel.
Disadvantages:
2. Vertical Partitioning
Advantages:
Disadvantages:
358
1. Increased complexity: Vertical partitioning can lead to more complex query
execution plans, as queries may need to access multiple partitions to gather all
the required data.
2. Joins across partitions: Joining data from different partitions can be more
complex and potentially slower, as it involves retrieving data from different
partitions and combining them.
3. Key-based Partitioning
Using this method, the data is divided based on a particular key or attribute value.
The dataset has been partitioned, with each containing all the data related to a
specific key value. Key-based partitioning is commonly used in distributed databases
or systems to distribute the data evenly and allow efficient data retrieval based on
key lookups.
Advantages:
1. Even data distribution: Key-based partitioning ensures that data with the
same key value is stored in the same partition, enabling efficient data retrieval
by key lookups.
Disadvantages:
1. Skew and hotspots: If the key distribution is uneven or if certain key values
are more frequently accessed than others, it can lead to data skew or
hotspots, impacting performance and load balancing.
4. Range Partitioning
359
Range partitioning divides the dataset according to a predetermined range of values.
You can divide data based on a particular time range, for instance, if your dataset
contains timestamps. When you want to distribute data evenly based on the range of
values and have data with natural ordering, range partitioning can be helpful.
Advantages:
Disadvantages:
2. Data growth challenges: As the dataset grows, the ranges may need to be
adjusted or new partitions added, requiring careful management and
potentially affecting existing queries and data distribution.
3. Joins and range queries: Range partitioning can introduce complexity when
performing joins across partitions or when queries involve multiple non-
contiguous ranges, potentially leading to performance challenges.
5. Hash-based Partitioning
Hash partitioning is the process of analyzing the data using a hash function to decide
which division it belongs to. The data is fed into the hash function, which produces a
hash value used to categorize the data into a certain division. By randomly
distributing data among partitions, hash-based partitioning can help with load
balancing and quick data retrieval.
Advantages:
Disadvantages:
2. Load balancing challenges: In some cases, the distribution of data may not
be perfectly balanced, resulting in load imbalances and potential performance
issues.
6. Round-robin Partitioning
Advantages:
Disadvantages:
361
3. Limited query optimization: Round-robin partitioning does not optimize for
specific query patterns or access patterns, potentially leading to suboptimal
query performance.
362
cyclic datasets optimization
manner
These are a few examples of data partitioning strategies. The dataset’s properties,
access patterns, and the needs of the particular application or system all play a role
in the choice of partitioning strategy.
Benefits of Partitioning
Partitioning techniques not only improve the running and management of very large
data centers but even allow the medium-range and smaller databases to take
pleasure in its benefits. Although it can be implemented in all sizes of databases, it is
most important for the databases that handle big data. The scalability of the
partitioning techniques proves that the advantages the smaller data centers are
facilitated with do not get changed when it comes to the bigger data centers.
363
What is combining
Combining in NoSQL databases refers to the practice of integrating and utilizing
various features, functionalities, and methodologies within the NoSQL database
environment to achieve specific objectives or to address particular challenges
effectively. This involves leveraging a combination of techniques, such as data
modeling, indexing, sharding, replication, caching, and optimization strategies, to
optimize performance, scalability, reliability, and other aspects of data management
and processing.
364
configurations, indexing strategies, query optimization, and infrastructure
scaling to meet specific application needs and performance goals.
Utilize replication for high availability and fault tolerance, ensuring data
redundancy across nodes.
365
Combine partitioning with compression to minimize storage overhead and
enhance data access efficiency.
7. Security Measures:
1. Map Function:
The Map function is responsible for processing individual data elements and
emitting intermediate key-value pairs. In NoSQL databases, the Map function
can be designed to operate on data stored in distributed partitions or shards,
processing data in parallel across multiple nodes.
366
When composing MapReduce calculations in NoSQL databases, define the
Map function to extract relevant information from each data record and emit
intermediate key-value pairs based on the analysis or transformation
requirements.
For example, the Map function may parse a JSON document, extract specific
fields or attributes, perform filtering or aggregation operations, and emit key-
value pairs representing the results of the analysis.
After the Map function has processed the data and emitted intermediate key-
value pairs, the database system performs shuffle and sort operations to
group and sort the intermediate pairs based on their keys. This step ensures
that all key-value pairs with the same key are grouped together for
subsequent processing.
3. Reduce Function:
For example, the Reduce function may calculate the sum, average, count, or
other aggregate statistics for each group of key-value pairs with the same key.
4. Output:
368