Unit 2 DS
Unit 2 DS
1
PREDICTIVE ANALYTICS:
Predictive analytics turn the data into valuable, actionable information. predictive analytics
uses data to determine the probable outcome of an event or a likelihood of a situation
occurring. Predictive analytics holds a variety of statistical techniques from
modeling, machine learning, data mining, and game theory that analyze current and
historical facts to make predictions about a future event. Techniques that are used for
predictive analytics are:
Linear Regression
Time Series Analysis and Forecasting
Data Mining
Basic Cornerstones of Predictive Analytics
Predictive modeling
Decision Analysis and optimization
Transaction profiling
DESCRIPTIVE ANALYTICS:
Descriptive analytics looks at data and analyze past event for insight as to how to approach
future events. It looks at past performance and understands the performance by mining
historical data to understand the cause of success or failure in the past. Almost all
management reporting such as sales, marketing, operations, and finance uses this type of
analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify
customers or prospects into groups. Unlike a predictive model that focuses on predicting
the behavior of a single customer, Descriptive analytics identifies many different
relationships between customer and product.
Common examples of Descriptive analytics are company reports that provide historic
reviews like:
Data Queries
Reports
Descriptive Statistics
Data dashboard
PRESCRIPTIVE ANALYTICS:
Prescriptive Analytics automatically synthesize big data, mathematical science, business
rule, and machine learning to make a prediction and then suggests a decision option to take
advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefits from the predictions and showing the decision maker the implication of each
decision option. Prescriptive Analytics not only anticipates what will happen and when to
happen but also why it will happen. Further, Prescriptive Analytics can suggest decision
options on how to take advantage of a future opportunity or mitigate a future risk and
illustrate the implication of each decision option.
DIAGNOSTIC ANALYTICS
In this analysis, we generally use historical data over other data to answer any question or
for the solution of any problem. We try to find any dependency and pattern in the historical
data of the particular problem.
2
For example, companies go for this analysis because it gives a great insight into a problem,
and they also keep detailed information about their disposal otherwise data collection may
turn out individual for every problem and it will be very time-consuming. Common
techniques used for Diagnostic Analytics are:
Data discovery
Data mining
Correlations
3
Improved Decision-Making – If we have supporting data in favour of a decision, then
we can implement them with even more success probability. For example, if a certain
decision or plan has to lead to better outcomes then there will be no doubt in
implementing them again.
Better Customer Service – Churn modeling is the best example of this in which we try
to predict or identify what leads to customer churn and change those things accordingly
so, that the attrition of the customers is as low as possible which is a most important
factor in any organization.
Efficient Operations – Data Analytics can help us understand what is the demand of
the situation and what should be done to get better results then we will be able to
streamline our processes which in turn will lead to efficient operations.
Effective Marketing – Market segmentation techniques have been implemented to
target this important factor only in which we are supposed to find the marketing
techniques which will help us increase our sales and leads to effective marketing
strategies.
4
BI software
Users with appropriate analytical
FEATURES OF BI SYSTEMS:
DATA WAREHOUSE:
Data warehouse includes the process of collecting all the data needed. In addition, a data
warehouse provides a data storage environment where data from multiple data sources will be
ETLed(Extracted, Transformed, Dunked), cleaned up, and stored on a specific topic,
indicating powerful data integration and maintenance capabilities of BI.
DATA ANALYSIS
You need the ability of data analysis to aid in enterprise modelling. OLAP is a data analysis
tool based on a data warehouse environment. Moreover, it solves the shortcoming of low
efficiency of multi-dimensional analysis based on OLTP analysis.
Business intelligence (BI) uses data analysis to derive actionable insights for strategic and
tactical decision-making.
DATA MINING
In practical applications, data mining delves into the past to predict the future, involving
active design and analysis of business data. It utilizes knowledge discovery tools for
uncovering previously unknown and potentially valuable information, representing an
automated and active discovery process.
DATA VISUALIZATION
Data visualization can reflect business operations intuitively. Enterprises could conduct
purposeful analysis of those abnormal data and explore the possible causes. Then further
make business decisions and strategies based on previous moves.
Business Intelligence (BI) systems are critical tools for organizations aiming to transform raw
data into actionable insights. These systems integrate data from various sources, analyze it,
and present the information in a way that supports decision-making processes. Below is an
overview of the application and development of BI systems:
APPLICATIONS OF BI SYSTEMS
1. DATA INTEGRATION:
BI systems can aggregate data from multiple sources (e.g., databases, spreadsheets, CRM
systems, ERP systems) into a unified view, facilitating comprehensive analysis.
BI systems provide tools for creating reports and interactive dashboards that visualize key
performance indicators (KPIs), trends, and patterns. These tools help stakeholders monitor
business performance in real-time.
5
3. DATA ANALYSIS AND MINING:
BI systems support advanced data analysis techniques, such as predictive analytics, data
mining, and statistical analysis, enabling organizations to forecast future trends, identify
potential risks, and discover hidden opportunities.
4. PERFORMANCE MANAGEMENT:
BI systems enable organizations to set, track, and manage business goals. By aligning data
with strategic objectives, companies can measure performance and adjust strategies as
needed.
5. DECISION SUPPORT:
6. CUSTOMER INSIGHTS:
BI tools can analyze customer data to identify purchasing patterns, preferences, and
behaviors. This information helps in personalizing marketing efforts, improving customer
service, and increasing customer retention.
BI systems can help identify cost-saving opportunities and manage risks by analyzing
operational data. This might include optimizing supply chains, reducing waste, or improving
resource allocation.
DEVELOPMENT OF BI SYSTEMS
1. REQUIREMENT ANALYSIS:
The first step in developing a BI system is to understand the business needs and objectives.
This involves identifying key stakeholders, understanding their data needs, and defining the
metrics and KPIs that the BI system will track.
Data from various sources needs to be gathered, cleansed, and integrated into a central data
warehouse or a similar repository. This process may involve ETL (Extract, Transform, Load)
tools that automate the extraction, transformation, and loading of data.
3. DATA MODELING:
A logical data model is created to organize and structure the data in a way that supports easy
access and analysis. This might involve creating schemas, tables, relationships, and indexing
strategies
6
4. DEVELOPMENT OF BI TOOLS:
5. IMPLEMENTATION OF ANALYTICS:
Advanced analytics, such as machine learning models, predictive analytics, and data mining,
can be integrated into the BI system to enhance the depth of insights provided.
Once developed, the BI system must be rigorously tested to ensure data accuracy,
performance, and security. Validation involves checking the system against the original
business requirements to ensure it meets the needs of the users.
After testing, the BI system is deployed across the organization. Training sessions are
conducted to help users understand how to use the system effectively.
Continuous monitoring and maintenance are essential to ensure the BI system remains up-to-
date and performs optimally. Regular updates may be required to incorporate new data
sources, technologies, or business requirement.
Big Data, a popular term recently, has come to be defined as a large amount of data that can’t
be stored or processed by conventional data storage or processing equipment. Due to the
massive amounts of data produced by human and machine activities, the data are so complex
and expansive that they cannot be interpreted by humans nor fit into a relational database for
analysis. However, when suitably evaluated using modern tools, these massive volumes of
data provide organizations with useful insights that help them improve their business by
making informed decisions.
As the Internet age continues to grow, we generate an incomprehensible amount of data every
second. So much so that the number of data floating around the internet is estimated to
reach 163 zettabytes by 2025. That’s a lot of tweets, selfies, purchases, emails, blog posts,
7
and any other piece of digital information that we can think of. These data can be classified
according to the following types:
STRUCTURED DATA
Structured data has certain predefined organizational properties and is present in structured or
tabular schema, making it easier to analyze and sort. In addition, thanks to its predefined
nature, each field is discrete and can be accessed separately or jointly along with data from
other fields. This makes structured data extremely valuable, making it possible to collect data
from various locations in the database quickly.
UNSTRUCTURED DATA
Unstructured data entails information with no predefined conceptual definitions and is not
easily interpreted or analyzed by standard databases or data models. Unstructured data
accounts for the majority of big data and comprises information such as dates, numbers, and
facts. Big data examples of this type include video and audio files, mobile activity, satellite
imagery, and No-SQL databases, to name a few. Photos we upload on Facebook or Instagram
and videos that we watch on YouTube or any other platform contribute to the growing pile of
unstructured data.
SEMI-STRUCTURED DATA
Semi-structured data is a hybrid of structured and unstructured data. This means that it
inherits a few characteristics of structured data but nonetheless contains information that fails
to have a definite structure and does not conform with relational databases or formal
structures of data models. For instance, JSON and XML are typical examples of semi-
structured data.
There are five v's of Big Data that explains the characteristics.
8
o VALUE
o VELOCITY
VOLUME
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.
9
VARIETY
Big Data can be structured, unstructured, and semi-structured that are being collected
from different sources. Data will only be collected from databases and sheets in the past, But
these days the data will comes in array forms, that are PDFs, Emails, audios, SM posts,
photos, videos, etc.
a. Structured data: In Structured schema, along with all the required columns. It is in a
tabular form. Structured Data is stored in the relational database management system.
b. Semi-structured: In Semi-structured, the schema is not appropriately defined,
e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction Processing)
systems are built to work with semi-structured data. It is stored in relations,
i.e., tables.
c. Unstructured Data: All the unstructured files, log files, audio files, and image files
are included in the unstructured data. Some organizations have much data available,
but they did not know how to derive the value of data since the data is raw.
d. Quasi-structured Data:The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools.
10
VERACITY
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
VALUE
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
VELOCITY
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
11
BIG DATA ARCHITECTURE:
A big data architecture is designed to handle the ingestion, processing, and analysis of data
that is too large or complex for traditional database systems.
ARCHITECTURE DIAGRAM
Big data solutions typically involve one or more of the following types of workload:
Most big data architectures include some or all of the following components:
Data sources: All big data solutions start with one or more data sources. Examples
include:
o Application data stores, such as relational databases.
o Static files produced by applications, such as web server log files.
o Real-time data sources, such as IoT devices.
Data storage: Data for batch processing operations is typically stored in a distributed
file store that can hold high volumes of large files in various formats. This kind of store
is often called a data lake. Options for implementing this storage include Azure Data
Lake Store or blob containers in Azure Storage.
Batch processing: Because the data sets are so large, often a big data solution must
process data files using long-running batch jobs to filter, aggregate, and otherwise
prepare the data for analysis. Usually these jobs involve reading source files, processing
them, and writing the output to new files. Options include running U-SQL jobs in
Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an
HDInsight Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight
Spark cluster.
Real-time message ingestion: If the solution includes real-time sources, the
architecture must include a way to capture and store real-time messages for stream
12
processing. This might be a simple data store, where incoming messages are dropped
into a folder for processing. However, many solutions need a message ingestion store to
act as a buffer for messages, and to support scale-out processing, reliable delivery, and
other message queuing semantics. Options include Azure Event Hubs, Azure IoT Hubs,
and Kafka.
Stream processing: After capturing real-time messages, the solution must process
them by filtering, aggregating, and otherwise preparing the data for analysis. The
processed stream data is then written to an output sink. Azure Stream Analytics
provides a managed stream processing service based on perpetually running SQL
queries that operate on unbounded streams. You can also use open source Apache
streaming technologies like Spark Streaming in an HDInsight cluster.
Analytical data store: Many big data solutions prepare data for analysis and then
serve the processed data in a structured format that can be queried using analytical
tools. The analytical data store used to serve these queries can be a Kimball-style
relational data warehouse, as seen in most traditional business intelligence (BI)
solutions. Alternatively, the data could be presented through a low-latency NoSQL
technology such as HBase, or an interactive Hive database that provides a metadata
abstraction over data files in the distributed data store. Azure Synapse Analytics
provides a managed service for large-scale, cloud-based data warehousing. HDInsight
supports Interactive Hive, HBase, and Spark SQL, which can also be used to serve
data for analysis.
Analysis and reporting: The goal of most big data solutions is to provide insights
into the data through analysis and reporting. To empower users to analyze the data,
the architecture may include a data modeling layer, such as a multidimensional OLAP
cube or tabular data model in Azure Analysis Services. It might also support self-
service BI, using the modeling and visualization technologies in Microsoft Power BI
or Microsoft Excel. Analysis and reporting can also take the form of interactive data
exploration by data scientists or data analysts. For these scenarios, many Azure
services support analytical notebooks, such as Jupyter, enabling these users to
leverage their existing skills with Python or R. For large-scale data exploration, you
can use Microsoft R Server, either standalone or with Spark.
Orchestration: Most big data solutions consist of repeated data processing
operations, encapsulated in workflows, that transform source data, move data between
multiple sources and sinks, load the processed data into an analytical data store, or
push the results straight to a report or dashboard. To automate these workflows, you
can use an orchestration technology such Azure Data Factory or Apache Oozie and
Sqoop.
Store and process data in volumes too large for a traditional database.
Transform unstructured data for analysis and reporting.
Capture, process, and analyze unbounded streams of data in real time, or with low
latency.
Use Azure Machine Learning or Azure Cognitive Services.
13
BENEFITS:
Technology choices. You can mix and match Azure managed services and Apache
technologies in HDInsight clusters, to capitalize on existing skills or technology
investments.
Performance through parallelism. Big data solutions take advantage of parallelism,
enabling high-performance solutions that scale to large volumes of data.
Elastic scale. All of the components in the big data architecture support scale-out
provisioning, so that you can adjust your solution to small or large workloads, and pay
only for the resources that you use.
Interoperability with existing solutions. The components of the big data architecture
are also used for IoT processing and enterprise BI solutions, enabling you to create an
integrated solution across data workloads.
CHALLENGES:
Complexity. Big data solutions can be extremely complex, with numerous components
to handle data ingestion from multiple data sources. It can be challenging to build, test,
and troubleshoot big data processes. Moreover, there may be a large number of
configuration settings across multiple systems that must be used in order to optimize
performance.
Skillset. Many big data technologies are highly specialized, and use frameworks and
languages that are not typical of more general application architectures. On the other
hand, big data technologies are evolving new APIs that build on more established
languages. For example, the U-SQL language in Azure Data Lake Analytics is based on
a combination of Transact-SQL
14
DATA ANALYTICS LIFE CYCLE:
PHASE 1: DISCOVERY:
The data science team learns and investigates the problem.
Develop context and understanding.
Come to know about data sources needed and available for the project.
The team formulates the initial hypothesis that can be later tested with data.
PHASE 2: DATA PREPARATION:
Steps to explore, preprocess, and condition data before modeling and analysis.
It requires the presence of an analytic sandbox, the team executes, loads, and
transforms, to get data into the sandbox.
Data preparation tasks are likely to be performed multiple times and not in predefined
order.
Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine,
etc.
PHASE 3: MODEL PLANNING :
The team explores data to learn about relationships between variables and subsequently,
selects key variables and the most suitable models.
In this phase, the data science team develops data sets for training, testing, and
production purposes.
Team builds and executes models based on the work done in the model planning phase.
Several tools commonly used for this phase are – Matlab and STASTICA.
15
PHASE 4: MODEL BUILDING:
Team develops datasets for testing, training, and production purposes.
Team also considers whether its existing tools will suffice for running the models or if
they need more robust environment for executing models.
Free or open-source tools – Rand PL/R, Octave, WEKA.
Commercial tools – Matlab and STASTICA.
PHASE 5: COMMUNICATION RESULTS :
After executing model team need to compare outcomes of modeling to criteria
established for success and failure.
Team considers how best to articulate findings and outcomes to various team members
and stakeholders, taking into account warning, assumptions.
Team should identify key findings, quantify business value, and develop narrative to
summarize and convey findings to stakeholders.
PHASE 6: OPERATIONALIZE:
The team communicates benefits of project more broadly and sets up pilot project to
deploy work in controlled way before broadening the work to full enterprise of users.
This approach enables team to learn about performance and related constraints of the
model in production environment on small scale which make adjustments before full
deployment.
The team delivers final reports, briefings, codes.
Free or open source tools – Octave, WEKA, SQL, MADlib.
16
more extensive datasets with the help of newer tools. Big data has been a buzz word since the
early 2000s, when software and hardware capabilities made it possible for organizations to
handle large amounts of unstructured data. Since then, new technologies—from Amazon to
smartphones—have contributed even more to the substantial amounts of data available to
organizations. With the explosion of data, early innovation projects like Hadoop, Spark, and
NoSQL databases were created for the storage and processing of big data. This field
continues to evolve as data engineers look for ways to integrate the vast amounts of complex
information created by sensors, networks, transactions, smart devices, web usage, and more.
Even now, big data analytics methods are being used with emerging technologies, like
machine learning, to discover and scale more complex insights.
HOW BIG DATA ANALYTICS WORKS?
Big data analytics refers to collecting, processing, cleaning, and analyzing large
datasets to help organizations operationalize their big data.
1. COLLECT DATA
Data collection looks different for every organization. With today’s technology,
organizations can gather both structured and unstructured data from a variety of
sources — from cloud storage to mobile applications to in-store IoT sensors and
beyond. Some data will be stored in data warehouses where business intelligence
tools and solutions can access it easily. Raw or unstructured data that is too diverse
or complex for a warehouse may be assigned metadata and stored in a data lake.
2. PROCESS DATA
Once data is collected and stored, it must be organized properly to get accurate
results on analytical queries, especially when it’s large and unstructured. Available
data is growing exponentially, making data processing a challenge for organizations.
One processing option is batch processing, which looks at large data blocks over
time. Batch processing is useful when there is a longer turnaround time between
collecting and analyzing data. Stream processing looks at small batches of data at
once, shortening the delay time between collection and analysis for quicker decision-
making. Stream processing is more complex and often more expensive.
3. CLEAN DATA
Data big or small requires scrubbing to improve data quality and get stronger results;
all data must be formatted correctly, and any duplicative or irrelevant data must be
eliminated or accounted for. Dirty data can obscure and mislead, creating flawed
insights.
17
4. ANALYZE DATA
Getting big data into a usable state takes time. Once it’s ready, advanced analytics
processes can turn big data into big insights. Some of these big data analysis
methods include:
Data mining sorts through large datasets to identify patterns and relationships by
identifying anomalies and creating data clusters.
Deep learning imitates human learning patterns by using artificial intelligence and
machine learning to layer algorithms and find patterns in the most complex and
abstract data.
Big data analytics cannot be narrowed down to a single tool or technology. Instead,
several types of tools work together to help you collect, process, cleanse, and analyze
big data. Some of the major players in big data ecosystems are listed below.
Spark is an open source cluster computing framework that uses implicit data
parallelism and fault tolerance to provide an interface for programming entire
clusters. Spark can handle both batch and stream processing for fast
computation.
18
service visual analysis, allowing people to ask new questions of governed big
data and easily share those insights across the organization.
The ability to analyze more data at a faster rate can provide big benefits to an
organization, allowing it to more efficiently use data to answer important questions.
Big data analytics is important because it lets organizations use colossal amounts of
data in multiple formats from multiple sources to identify opportunities and risks,
helping organizations move quickly and improve their bottom lines. Some benefits of
big data analytics include:
Read more about how real organizations reap the benefits of big data.
Big data brings big benefits, but it also brings big challenges such new privacy and
security concerns, accessibility for business users, and choosing the right solutions
for your business needs. To capitalize on incoming data, organizations will have to
address the following:
Making big data accessible. Collecting and processing data becomes more
difficult as the amount of data grows. Organizations must make data easy and
convenient for data owners of all skill levels to use.
Finding the right tools and platforms. New technologies for processing and
analyzing big data are developed all the time. Organizations must find the
right technology to work within their established ecosystems and address
their particular needs. Often, the right solution is also a flexible solution that
can accommodate future infrastructure changes.
19
METHODOLOGY:
In terms of methodology, big data analytics differs significantly from the traditional statistical
approach of experimental design. Analytics starts with data. Normally, we model the data in a
way that able to answer the questions that a business professionals have. The objectives of
this approach are to predict the response behavior or understand how the input variables
relate to a response.
Typically, statistical experimental designs develop an experiment and then retrieve the
resulting data. This enables the generation of data suitable for a statistical model, under the
assumption of independence, normality, and randomization. Big data analytics methodology
begins with problem identification, and once the business problem is defined, a research stage
is required to design the methodology. However, general guidelines are relevant to mention
and apply to almost all problems.
The following figure demonstrates the methodology often followed in Big Data Analytics –
20
DEFINE OBJECTIVES
Clearly outline the analysis's goals and objectives. What insights do you seek? What business
difficulties are you attempting to solve? This stage is critical to steering the entire process.
DATA COLLECTION
Gather relevant data from a variety of sources. This includes structured data from databases,
semi-structured data from logs or JSON files, and unstructured data from social media,
emails, and papers.
DATA PRE-PROCESSING
This step involves cleaning and pre-processing the data to ensure its quality and consistency.
This includes addressing missing values, deleting duplicates, resolving inconsistencies, and
transforming data into a useful format.
Store the data in an appropriate storage system. This could include a typical relational
database, a NoSQL database, a data lake, or a distributed file system such as Hadoop
Distributed File System (HDFS).
This phase includes the identification of data features, finding patterns, and detecting outliers.
We often use visualization tools like histograms, scatter plots, and box plots.
FEATURE ENGINEERING
Create new features or modify existing ones to improve the performance of machine learning
models. This could include feature scaling, dimensionality reduction, or constructing
composite features.
Choose relevant machine learning algorithms based on the nature of the problem and the
properties of the data. If labeled data is available, train the models.
MODEL EVALUATION
Measure the trained models' performance using accuracy, precision, recall, F1-score, and
ROC curves. This helps to determine the best-performing model for deployment.
21
DEPLOYMENT
In a production environment, deploy the model for real-world use. This could include
integrating the model with existing systems, creating APIs for model inference, and
establishing monitoring tools.
Also, change the analytics pipeline as needed to reflect changing business requirements or
data characteristics.
ITERATE
Big Data analytics is an iterative process. Analyze the data, collect comments, and update the
models or procedures as needed to increase accuracy and effectiveness over time
In this article, we are discussing the leading technologies that have expanded their branches
to help Big Data reach greater heights. Before we discuss big data technologies, let us first
understand briefly about Big Data Technology.
Among the larger concepts of rage in technology, big data technologies are widely associated
with many other technologies such as deep learning, machine learning, artificial intelligence
(AI), and Internet of Things (IoT) that are massively augmented. In combination with these
technologies, big data technologies are focused on analyzing and handling large amounts of
real-time data and batch-related data.
Before we start with the list of big data technologies, let us first discuss this technology's
board classification. Big Data technology is primarily classified into the following two types:
22
Operational Big Data Technologies
This type of big data technology mainly includes the basic day-to-day data that people used
to process. Typically, the operational-big data includes daily basis data such as online
transactions, social media platforms, and the data from any particular organization or a firm,
which is usually needed for analysis using the software based on big data technologies. The
data can also be referr
o Online ticket booking system, e.g., buses, trains, flights, and movies, etc.
o Online trading or shopping from e-commerce websites like Amazon, Flipkart,
Walmart, etc.
o Online data on social media sites, such as Facebook, Instagram, Whatsapp, etc.
o The employees' data or executives' particulars in multinational companies.
Some common examples that involve the Analytical Big Data Technologies can be listed as
below:
We can categorize the leading big data technologies into the following four sections:
o Data Storage
o Data Mining
o Data Analytics
o Data Visualization
23
DATA STORAGE
Let us first discuss leading Big Data Technologies that come under Data Storage:
o HADOOP: When it comes to handling big data, Hadoop is one of the leading
technologies that come into play. This technology is based entirely on map-reduce
architecture and is mainly used to process batch information. Also, it is capable
enough to process tasks in batches. The Hadoop framework was mainly introduced to
store and process data in a distributed data processing environment parallel to
commodity hardware and a basic programming execution model.
Apart from this, Hadoop is also best suited for storing and analyzing the data from
various machines with a faster speed and low cost. That is why Hadoop is known as
one of the core components of big data technologies. The Apache Software
Foundation introduced it in Dec 2011. Hadoop is written in Java programming
language.
o MONGODB: MongoDB is another important component of big data technologies in
terms of storage. No relational properties and RDBMS properties apply to MongoDb
because it is a NoSQL database. This is not the same as traditional RDBMS databases
that use structured query languages. Instead, MongoDB uses schema documents.
The structure of the data storage in MongoDB is also different from traditional
RDBMS databases. This enables MongoDB to hold massive amounts of data. It is
based on a simple cross-platform document-oriented design. The database in
MongoDB uses documents similar to JSON with the schema. This ultimately helps
operational data storage options, which can be seen in most financial organizations.
As a result, MongoDB is replacing traditional mainframes and offering the flexibility
24
to handle a wide range of high-volume data-types in distributed architectures.
MongoDB Inc. introduced MongoDB in Feb 2009. It is written with a combination of
C++, Python, JavaScript, and Go language.
o CASSANDRA: Cassandra is one of the leading big data technologies among the list
of top NoSQL databases. It is open-source, distributed and has extensive column
storage options. It is freely available and provides high availability without fail. This
ultimately helps in the process of handling data efficiently on large commodity
groups. Cassandra's essential features include fault-tolerant mechanisms, scalability,
MapReduce support, distributed nature, eventual consistency, query language
property, tunable consistency, and multi-datacenter replication, etc.
Cassandra was developed in 2008 by the Apache Software Foundation for the
Facebook inbox search feature. It is based on the Java programming language.
DATA MINING
Let us now discuss leading Big Data Technologies that come under Data Mining:
DATA ANALYTICS
Now, let us discuss leading Big Data Technologies that come under Data Analytics:
25
o APACHE KAFKA: Apache Kafka is a popular streaming platform. This streaming
platform is primarily known for its three core capabilities: publisher, subscriber and
consumer. It is referred to as a distributed streaming platform. It is also defined as a
direct messaging, asynchronous messaging broker system that can ingest and perform
data processing on real-time streaming data. This platform is almost similar to an
enterprise messaging system or messaging queue.
Besides, Kafka also provides a retention period, and data can be transmitted through a
producer-consumer mechanism. Kafka has received many enhancements to date and
includes some additional levels or properties, such as schema, Ktables, KSql, registry,
etc. It is written in Java language and was developed by the Apache software
community in 2011. Some top companies using the Apache Kafka platform include
Twitter, Spotify, Netflix, Yahoo, LinkedIn etc.
o SPLUNK: Splunk is known as one of the popular software platforms for capturing,
correlating, and indexing real-time streaming data in searchable repositories. Splunk
can also produce graphs, alerts, summarized reports, data visualizations, and
dashboards, etc., using related data. It is mainly beneficial for generating business
insights and web analytics. Besides, Splunk is also used for security purposes,
compliance, application management and control.
Splunk Inc. introduced Splunk in the year 2014. It is written in combination with
AJAX, Python, C ++ and XML. Companies such as Trustwave, QRadar, and 1Labs
are making good use of Splunk for their analytical and security needs.
o KNIME: KNIME is used to draw visual data flows, execute specific steps and
analyze the obtained models, results, and interactive views. It also allows us to
execute all the analysis steps altogether. It consists of an extension mechanism that
can add more plugins, giving additional features and functionalities.
KNIME is based on Eclipse and written in a Java programming language. It was
developed in 2008 by KNIME Company. A list of companies that are making use of
KNIME includes Harnham, Tyler, and Paloalto.
o SPARK: Apache Spark is one of the core technologies in the list of big data
technologies. It is one of those essential technologies which are widely used by top
companies. Spark is known for offering In-memory computing capabilities that help
enhance the overall speed of the operational process. It also provides a generalized
execution model to support more applications. Besides, it includes top-level APIs
(e.g., Java, Scala, and Python) to ease the development process.
Also, Spark allows users to process and handle real-time streaming data using
batching and windowing operations techniques. This ultimately helps to generate
26
datasets and data frames on top of RDDs. As a result, the integral components of
Spark Core are produced. Components like Spark MlLib, GraphX, and R help analyze
and process machine learning and data science. Spark is written using Java, Scala,
Python and R language. The Apache Software Foundation developed it in 2009.
Companies like Amazon, ORACLE, CISCO, VerizonWireless, and Hortonworks are
using this big data technology and making good use of it.
o R-LANGUAGE: R is defined as the programming language, mainly used in
statistical computing and graphics. It is a free software environment used by leading
data miners, practitioners and statisticians. Language is primarily beneficial in the
development of statistical-based software and data analytics.
R-language was introduced in Feb 2000 by R-Foundation. It is written in Fortran.
Companies like Barclays, American Express, and Bank of America use R-Language
for their data analytics needs.
DATA VISUALIZATION
Let us discuss leading Big Data Technologies that come under Data Visualization:
o Tableau: Tableau is one of the fastest and most powerful data visualization tools used
by leading business intelligence industries. It helps in analyzing the data at a very
faster speed. Tableau helps in creating the visualizations and insights in the form of
dashboards and worksheets.
Tableau is developed and maintained by a company named TableAU. It was
introduced in May 2013. It is written using multiple languages, such as Python, C,
C++, and Java. Some of the list's top companies are Cognos, QlikQ, and ORACLE
Hyperion, using this tool.
o Plotly: As the name suggests, Plotly is best suited for plotting or creating graphs and
relevant components at a faster speed in an efficient way. It consists of several rich
libraries and APIs, such as MATLAB, Python, Julia, REST API, Arduino, R, Node.js,
etc. This helps interactive styling graphs with Jupyter notebook and Pycharm.
Plotly was introduced in 2012 by Plotly company. It is based on JavaScript. Paladins
and Bitbank are some of those companies that are making good use of Plotly.
27
o TensorFlow: TensorFlow combines multiple comprehensive libraries, flexible
ecosystem tools, and community resources that help researchers implement the state-
of-art in Machine Learning. Besides, this ultimately allows developers to build and
deploy machine learning-powered applications in specific environments.
TensorFlow was introduced in 2019 by Google Brain Team. It is mainly based on
C++, CUDA, and Python. Companies like Google, eBay, Intel, and Airbnb are using
this technology for their business requirements.
o Beam: Apache Beam consists of a portable API layer that helps build and maintain
sophisticated parallel-data processing pipelines. Apart from this, it also allows the
execution of built pipelines across a diversity of execution engines or runners.
Apache Beam was introduced in June 2016 by the Apache Software Foundation. It
is written in Python and Java. Some leading companies like Amazon, ORACLE,
Cisco, and VerizonWireless are using this technology.
o Docker: Docker is defined as the special tool purposely developed to create, deploy,
and execute applications easier by using containers. Containers usually help
developers pack up applications properly, including all the required components like
libraries and dependencies. Typically, containers bind all components and ship them
all together as a package.
Docker was introduced in March 2003 by Docker Inc. It is based on the Go language.
Companies like Business Insider, Quora, Paypal, and Splunk are using this
technology.
o Airflow: Airflow is a technology that is defined as a workflow automation and
scheduling system. This technology is mainly used to control, and maintain data
pipelines. It contains workflows designed using the DAGs (Directed Acyclic Graphs)
mechanism and consisting of different tasks. The developers can also define
workflows in codes that help in easy testing, maintenance, and versioning.
Airflow was introduced in May 2019 by the Apache Software Foundation. It is
based on a Python language. Companies like Checkr and Airbnb are using this leading
technology.
o Kubernetes: Kubernetes is defined as a vendor-agnostic cluster and container
management tool made open-source in 2014 by Google. It provides a platform for
automation, deployment, scaling, and application container operations in the host
clusters.
Kubernetes was introduced in July 2015 by the Cloud Native Computing
Foundation. It is written in the Go language. Companies like American Express, Pear
28
Deck, PeopleSource, and Northwestern Mutual are making good use of this
technology.
1. IMPROVES DECISION-MAKING
Organizations can gather valuable insights about their performance with the help of big data
solutions. For example, big data analytics can help HR departments with recruitment and
hiring processes.
2. REDUCES COSTS
Any professional knows how important it is for a company to keep costs down wherever and
whenever possible. Here are a few ways companies are using big data to reduce costs:
Leveraging big data can provide more cost-cutting opportunities than just the above
examples. Netflix, for example, uses big data to save around $1 billion annually on customer
retention alone.
3. INCREASES PRODUCTIVITY
IT specialists can increase their productivity levels by using big data solutions. Instead of
manually sorting through all types of data from disparate sources, big data tools can automate
the process and allow employees to focus on other meaningful tasks.
When companies can analyze more data more quickly, it can speed up other business
processes and increase productivity more broadly throughout the organization.
29
DISADVANTAGES OF BIG DATA:
Below are some of the disadvantages of big data companies should understand.
1. CYBERSECURITY RISKS
o As new technologies emerge in business, it’s understandable that there are some risks
involved with adoption. Big data solutions are major targets for cybercriminals,
meaning companies using these advanced analytics tools are exposed to more
potential cybersecurity threats.
o Storing big data, particularly data of a sensitive nature, comes with inherent risks.
Still, companies can implement various cybersecurity measures to protect their data.
2. TALENT GAPS
o Data scientists and experts are in high demand as big data becomes more prevalent in
business. These IT professionals are often paid very well and can significantly impact
a company.
o However, there’s a lack of IT workers in the field capable of handling big data
responsibilities. While having access to big data can benefit a company, it’s only
useful if someone with a strong background in big data works with it.
3. COMPLIANCE CONSIDERATIONS
o Another disadvantage of big data is the compliance issues companies might deal with.
Companies must ensure they’re meeting industry and federal regulatory requirements
for the information they work with, including sensitive or personal customer data.
o Without a compliance officer, organizations would find it challenging to handle, store
and leverage big data. For example, companies operating in the European Union must
be aware of the General Data Protection Regulation (GDPR), a major data privacy
regulation focused on protecting consumers.
o The role of big data is already playing a critical role in modern business. As time goes
on, it’s expected that big data will continue on its growth journey and become a staple
for just about every type of business, regardless of size or industry
30