0% found this document useful (0 votes)
47 views

Unit 2 DS

Ghhjklllphjknnkkkmmkkkkkk

Uploaded by

b3952355
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Unit 2 DS

Ghhjklllphjknnkkkmmkkkkkk

Uploaded by

b3952355
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

UNIT 2

INTRODUCTION TO DATA ANALYTICS:


Data analytics is an important field that involves the process of collecting, processing, and
interpreting data to uncover insights and help in making decisions. Data analytics is the
practice of examining raw data to identify trends, draw conclusions, and extract meaningful
information. This involves various techniques and tools to process and transform data into
valuable insights that can be used for decision-making.
In this article, we will learn about Data analytics,data which will help businesses and
individuals that can help them to enhance and solve complex problems, Types of Data
Analytics, Techniques, Tools, and the Importance of Data Analytics.

WHAT IS DATA ANALYTICS:


Data analytics is an important field that involves the process of collecting, processing, and
interpreting data to uncover insights and help in making decisions. Data analytics is the
practice of examining raw data to identify trends, draw conclusions, and extract meaningful
information. This involves various techniques and tools to process and transform data into
valuable insights that can be used for decision-making.
In this article, we will learn about Data analytics,data which will help businesses and
individuals that can help them to enhance and solve complex problems, Types of Data
Analytics, Techniques, Tools, and the Importance of Data Analytics.

TYPES OF DATA ANALYTICS:

There are four major types of data analytics:


1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics
5.

DATA ANALYTICS AND TYPES

1
PREDICTIVE ANALYTICS:
Predictive analytics turn the data into valuable, actionable information. predictive analytics
uses data to determine the probable outcome of an event or a likelihood of a situation
occurring. Predictive analytics holds a variety of statistical techniques from
modeling, machine learning, data mining, and game theory that analyze current and
historical facts to make predictions about a future event. Techniques that are used for
predictive analytics are:
 Linear Regression
 Time Series Analysis and Forecasting
 Data Mining
Basic Cornerstones of Predictive Analytics
 Predictive modeling
 Decision Analysis and optimization
 Transaction profiling

DESCRIPTIVE ANALYTICS:
Descriptive analytics looks at data and analyze past event for insight as to how to approach
future events. It looks at past performance and understands the performance by mining
historical data to understand the cause of success or failure in the past. Almost all
management reporting such as sales, marketing, operations, and finance uses this type of
analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify
customers or prospects into groups. Unlike a predictive model that focuses on predicting
the behavior of a single customer, Descriptive analytics identifies many different
relationships between customer and product.
Common examples of Descriptive analytics are company reports that provide historic
reviews like:
 Data Queries
 Reports
 Descriptive Statistics
 Data dashboard

PRESCRIPTIVE ANALYTICS:
Prescriptive Analytics automatically synthesize big data, mathematical science, business
rule, and machine learning to make a prediction and then suggests a decision option to take
advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefits from the predictions and showing the decision maker the implication of each
decision option. Prescriptive Analytics not only anticipates what will happen and when to
happen but also why it will happen. Further, Prescriptive Analytics can suggest decision
options on how to take advantage of a future opportunity or mitigate a future risk and
illustrate the implication of each decision option.

DIAGNOSTIC ANALYTICS
In this analysis, we generally use historical data over other data to answer any question or
for the solution of any problem. We try to find any dependency and pattern in the historical
data of the particular problem.

2
For example, companies go for this analysis because it gives a great insight into a problem,
and they also keep detailed information about their disposal otherwise data collection may
turn out individual for every problem and it will be very time-consuming. Common
techniques used for Diagnostic Analytics are:
 Data discovery
 Data mining
 Correlations

THE ROLE OF DATA ANALYTICS:


Data analytics plays a pivotal role in enhancing operations, efficiency, and performance
across various industries by uncovering valuable patterns and insights. Implementing data
analytics techniques can provide companies with a competitive advantage. The process
typically involves four fundamental steps:
 Data Mining: This step involves gathering data and information from diverse sources
and transforming them into a standardized format for subsequent analysis. Data mining
can be a time-intensive process compared to other steps but is crucial for obtaining a
comprehensive dataset.
 Data Management: Once collected, data needs to be stored, managed, and made
accessible. Creating a database is essential for managing the vast amounts of
information collected during the mining process. SQL (Structured Query Language)
remains a widely used tool for database management, facilitating efficient querying and
analysis of relational databases.
 Statistical Analysis: In this step, the gathered data is subjected to statistical analysis to
identify trends and patterns. Statistical modeling is used to interpret the data and make
predictions about future trends. Open-source programming languages like Python, as
well as specialized tools like R, are commonly used for statistical analysis and graphical
modeling.
 Data Presentation: The insights derived from data analytics need to be effectively
communicated to stakeholders. This final step involves formatting the results in a
manner that is accessible and understandable to various stakeholders, including
decision-makers, analysts, and shareholders. Clear and concise data presentation is
essential for driving informed decision-making and driving business growth.

STEPS IN DATA ANALYSIS:


 Define Data Requirements: This involves determining how the data will be grouped or
categorized. Data can be segmented based on various factors such as age, demographic,
income, or gender, and can consist of numerical values or categorical data.
 Data Collection: Data is gathered from different sources, including computers, online
platforms, cameras, environmental sensors, or through human personnel.
 Data Organization: Once collected, the data needs to be organized in a structured
format to facilitate analysis. This could involve using spreadsheets or specialized
software designed for managing and analyzing statistical data.
 Data Cleaning: Before analysis, the data undergoes a cleaning process to ensure
accuracy and reliability. This involves identifying and removing any duplicate or
erroneous entries, as well as addressing any missing or incomplete data. Cleaning the
data helps to mitigate potential biases and errors that could affect the analysis results.

USAGE OF DATA ANALYTICS:


There are some key domains and strategic planning techniques in which Data Analytics has
played a vital role:

3
 Improved Decision-Making – If we have supporting data in favour of a decision, then
we can implement them with even more success probability. For example, if a certain
decision or plan has to lead to better outcomes then there will be no doubt in
implementing them again.
 Better Customer Service – Churn modeling is the best example of this in which we try
to predict or identify what leads to customer churn and change those things accordingly
so, that the attrition of the customers is as low as possible which is a most important
factor in any organization.
 Efficient Operations – Data Analytics can help us understand what is the demand of
the situation and what should be done to get better results then we will be able to
streamline our processes which in turn will lead to efficient operations.
 Effective Marketing – Market segmentation techniques have been implemented to
target this important factor only in which we are supposed to find the marketing
techniques which will help us increase our sales and leads to effective marketing
strategies.

FUTURE SCOPE OF DATA ANALYTICS:


 Retail: To study sales patterns, consumer behavior, and inventory management, data
analytics can be applied in the retail sector. Data analytics can be used by retailers to
make data-driven decisions regarding what products to stock, how to price them, and
how to best organize their stores.
 Healthcare: Data analytics can be used to evaluate patient data, spot trends in patient
health, and create individualized treatment regimens. Data analytics can be used by
healthcare companies to enhance patient outcomes and lower healthcare expenditures.
 Finance: In the field of finance, data analytics can be used to evaluate investment data,
spot trends in the financial markets, and make wise investment decisions. Data analytics
can be used by financial institutions to lower risk and boost the performance of
investment portfolios.
 Marketing: By analyzing customer data, spotting trends in consumer behavior, and
creating customized marketing strategies, data analytics can be used in marketing. Data
analytics can be used by marketers to boost the efficiency of their campaigns and their
overall impact.
 Manufacturing: Data analytics can be used to examine production data, spot trends in
production methods, and boost production efficiency in the manufacturing sector. Data
analytics can be used by manufacturers to cut costs and enhance product quality.

BUSINESS INTELLIGENCE SYSTEM APPLICAION AND


DEVELOPMENT:

WHAT IS BUSINESS INTELLIGENCE SYSTEM?


A Business Intelligence system comprises technology, processes, and applications, offering
comprehensive solutions. BI systems extract valuable data from various enterprise systems,
facilitating storage, analysis, and internal data management. Generally, BI systems consist
of three main components: data collection, organized presentation, and efficient delivery to
those in need. These systems enable companies to provide timely and relevant information to
the right individuals in the right format, supporting the effective development of strategic and
operational insights for decision-makers.
 data warehouse

4
 BI software
 Users with appropriate analytical

FEATURES OF BI SYSTEMS:

DATA WAREHOUSE:

Data warehouse includes the process of collecting all the data needed. In addition, a data
warehouse provides a data storage environment where data from multiple data sources will be
ETLed(Extracted, Transformed, Dunked), cleaned up, and stored on a specific topic,
indicating powerful data integration and maintenance capabilities of BI.
DATA ANALYSIS
You need the ability of data analysis to aid in enterprise modelling. OLAP is a data analysis
tool based on a data warehouse environment. Moreover, it solves the shortcoming of low
efficiency of multi-dimensional analysis based on OLTP analysis.

Business intelligence (BI) uses data analysis to derive actionable insights for strategic and
tactical decision-making.
DATA MINING

In practical applications, data mining delves into the past to predict the future, involving
active design and analysis of business data. It utilizes knowledge discovery tools for
uncovering previously unknown and potentially valuable information, representing an
automated and active discovery process.
DATA VISUALIZATION
Data visualization can reflect business operations intuitively. Enterprises could conduct
purposeful analysis of those abnormal data and explore the possible causes. Then further
make business decisions and strategies based on previous moves.

BUSINESS INTELLIGENCE (BI) APPLICATION AND DEVELOPMENT:

Business Intelligence (BI) systems are critical tools for organizations aiming to transform raw
data into actionable insights. These systems integrate data from various sources, analyze it,
and present the information in a way that supports decision-making processes. Below is an
overview of the application and development of BI systems:

APPLICATIONS OF BI SYSTEMS

1. DATA INTEGRATION:

BI systems can aggregate data from multiple sources (e.g., databases, spreadsheets, CRM
systems, ERP systems) into a unified view, facilitating comprehensive analysis.

2. REPORTING AND DASHBOARDS:

BI systems provide tools for creating reports and interactive dashboards that visualize key
performance indicators (KPIs), trends, and patterns. These tools help stakeholders monitor
business performance in real-time.

5
3. DATA ANALYSIS AND MINING:

BI systems support advanced data analysis techniques, such as predictive analytics, data
mining, and statistical analysis, enabling organizations to forecast future trends, identify
potential risks, and discover hidden opportunities.

4. PERFORMANCE MANAGEMENT:

BI systems enable organizations to set, track, and manage business goals. By aligning data
with strategic objectives, companies can measure performance and adjust strategies as
needed.

5. DECISION SUPPORT:

By providing insights and evidence-based analysis, BI systems assist executives and


managers in making informed decisions, whether for operational efficiency, market
expansion, or customer satisfaction.

6. CUSTOMER INSIGHTS:

BI tools can analyze customer data to identify purchasing patterns, preferences, and
behaviors. This information helps in personalizing marketing efforts, improving customer
service, and increasing customer retention.

7. COST AND RISK MANAGEMENT:

BI systems can help identify cost-saving opportunities and manage risks by analyzing
operational data. This might include optimizing supply chains, reducing waste, or improving
resource allocation.

DEVELOPMENT OF BI SYSTEMS

1. REQUIREMENT ANALYSIS:

The first step in developing a BI system is to understand the business needs and objectives.
This involves identifying key stakeholders, understanding their data needs, and defining the
metrics and KPIs that the BI system will track.

2. DATA SOURCING AND INTEGRATION:

Data from various sources needs to be gathered, cleansed, and integrated into a central data
warehouse or a similar repository. This process may involve ETL (Extract, Transform, Load)
tools that automate the extraction, transformation, and loading of data.

3. DATA MODELING:

A logical data model is created to organize and structure the data in a way that supports easy
access and analysis. This might involve creating schemas, tables, relationships, and indexing
strategies

6
4. DEVELOPMENT OF BI TOOLS:

The development of reporting tools, dashboards, and visualization components is crucial.


These tools should be intuitive and user-friendly, providing users with the ability to drill
down into data, perform ad-hoc queries, and generate reports.

5. IMPLEMENTATION OF ANALYTICS:

Advanced analytics, such as machine learning models, predictive analytics, and data mining,
can be integrated into the BI system to enhance the depth of insights provided.

6. TESTING AND VALIDATION:

Once developed, the BI system must be rigorously tested to ensure data accuracy,
performance, and security. Validation involves checking the system against the original
business requirements to ensure it meets the needs of the users.

7. DEPLOYMENT AND TRAINING:

After testing, the BI system is deployed across the organization. Training sessions are
conducted to help users understand how to use the system effectively.

8. MAINTENANCE AND UPDATES:

Continuous monitoring and maintenance are essential to ensure the BI system remains up-to-
date and performs optimally. Regular updates may be required to incorporate new data
sources, technologies, or business requirement.

BIG DATA OVERVIEW:


WHAT IS BIG DATA?

Big Data, a popular term recently, has come to be defined as a large amount of data that can’t
be stored or processed by conventional data storage or processing equipment. Due to the
massive amounts of data produced by human and machine activities, the data are so complex
and expansive that they cannot be interpreted by humans nor fit into a relational database for
analysis. However, when suitably evaluated using modern tools, these massive volumes of
data provide organizations with useful insights that help them improve their business by
making informed decisions.

TYPES OF BIG DATA

As the Internet age continues to grow, we generate an incomprehensible amount of data every
second. So much so that the number of data floating around the internet is estimated to
reach 163 zettabytes by 2025. That’s a lot of tweets, selfies, purchases, emails, blog posts,

7
and any other piece of digital information that we can think of. These data can be classified
according to the following types:

STRUCTURED DATA

Structured data has certain predefined organizational properties and is present in structured or
tabular schema, making it easier to analyze and sort. In addition, thanks to its predefined
nature, each field is discrete and can be accessed separately or jointly along with data from
other fields. This makes structured data extremely valuable, making it possible to collect data
from various locations in the database quickly.

UNSTRUCTURED DATA

Unstructured data entails information with no predefined conceptual definitions and is not
easily interpreted or analyzed by standard databases or data models. Unstructured data
accounts for the majority of big data and comprises information such as dates, numbers, and
facts. Big data examples of this type include video and audio files, mobile activity, satellite
imagery, and No-SQL databases, to name a few. Photos we upload on Facebook or Instagram
and videos that we watch on YouTube or any other platform contribute to the growing pile of
unstructured data.

SEMI-STRUCTURED DATA

Semi-structured data is a hybrid of structured and unstructured data. This means that it
inherits a few characteristics of structured data but nonetheless contains information that fails
to have a definite structure and does not conform with relational databases or formal
structures of data models. For instance, JSON and XML are typical examples of semi-
structured data.

BIG DATA CHARACTERISTICS


Big Data contains a large amount of data that is not being processed by traditional data
storage or the processing unit. It is used by many multinational companies to process the
data and business of many organizations. The data flow would exceed 150 exabytes per day
before replication.

There are five v's of Big Data that explains the characteristics.

5 V's of Big Data


o VOLUME
o VERACITY
o VARIETY

8
o VALUE
o VELOCITY

CHARACTERISTICS OF BIG DATA

VOLUME

The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.

Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.

9
VARIETY

Big Data can be structured, unstructured, and semi-structured that are being collected
from different sources. Data will only be collected from databases and sheets in the past, But
these days the data will comes in array forms, that are PDFs, Emails, audios, SM posts,
photos, videos, etc.

The data is categorized as below:

a. Structured data: In Structured schema, along with all the required columns. It is in a
tabular form. Structured Data is stored in the relational database management system.
b. Semi-structured: In Semi-structured, the schema is not appropriately defined,
e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction Processing)
systems are built to work with semi-structured data. It is stored in relations,
i.e., tables.
c. Unstructured Data: All the unstructured files, log files, audio files, and image files
are included in the unstructured data. Some organizations have much data available,
but they did not know how to derive the value of data since the data is raw.
d. Quasi-structured Data:The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools.

10
VERACITY
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.

For example, Facebook posts with hashtags.

VALUE
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.

VELOCITY
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.

Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.

11
BIG DATA ARCHITECTURE:

A big data architecture is designed to handle the ingestion, processing, and analysis of data
that is too large or complex for traditional database systems.

ARCHITECTURE DIAGRAM

Big data solutions typically involve one or more of the following types of workload:

 Batch processing of big data sources at rest.


 Real-time processing of big data in motion.
 Interactive exploration of big data.
 Predictive analytics and machine learning.

Most big data architectures include some or all of the following components:

 Data sources: All big data solutions start with one or more data sources. Examples
include:
o Application data stores, such as relational databases.
o Static files produced by applications, such as web server log files.
o Real-time data sources, such as IoT devices.
 Data storage: Data for batch processing operations is typically stored in a distributed
file store that can hold high volumes of large files in various formats. This kind of store
is often called a data lake. Options for implementing this storage include Azure Data
Lake Store or blob containers in Azure Storage.
 Batch processing: Because the data sets are so large, often a big data solution must
process data files using long-running batch jobs to filter, aggregate, and otherwise
prepare the data for analysis. Usually these jobs involve reading source files, processing
them, and writing the output to new files. Options include running U-SQL jobs in
Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an
HDInsight Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight
Spark cluster.
 Real-time message ingestion: If the solution includes real-time sources, the
architecture must include a way to capture and store real-time messages for stream

12
processing. This might be a simple data store, where incoming messages are dropped
into a folder for processing. However, many solutions need a message ingestion store to
act as a buffer for messages, and to support scale-out processing, reliable delivery, and
other message queuing semantics. Options include Azure Event Hubs, Azure IoT Hubs,
and Kafka.
 Stream processing: After capturing real-time messages, the solution must process
them by filtering, aggregating, and otherwise preparing the data for analysis. The
processed stream data is then written to an output sink. Azure Stream Analytics
provides a managed stream processing service based on perpetually running SQL
queries that operate on unbounded streams. You can also use open source Apache
streaming technologies like Spark Streaming in an HDInsight cluster.
 Analytical data store: Many big data solutions prepare data for analysis and then
serve the processed data in a structured format that can be queried using analytical
tools. The analytical data store used to serve these queries can be a Kimball-style
relational data warehouse, as seen in most traditional business intelligence (BI)
solutions. Alternatively, the data could be presented through a low-latency NoSQL
technology such as HBase, or an interactive Hive database that provides a metadata
abstraction over data files in the distributed data store. Azure Synapse Analytics
provides a managed service for large-scale, cloud-based data warehousing. HDInsight
supports Interactive Hive, HBase, and Spark SQL, which can also be used to serve
data for analysis.
 Analysis and reporting: The goal of most big data solutions is to provide insights
into the data through analysis and reporting. To empower users to analyze the data,
the architecture may include a data modeling layer, such as a multidimensional OLAP
cube or tabular data model in Azure Analysis Services. It might also support self-
service BI, using the modeling and visualization technologies in Microsoft Power BI
or Microsoft Excel. Analysis and reporting can also take the form of interactive data
exploration by data scientists or data analysts. For these scenarios, many Azure
services support analytical notebooks, such as Jupyter, enabling these users to
leverage their existing skills with Python or R. For large-scale data exploration, you
can use Microsoft R Server, either standalone or with Spark.
 Orchestration: Most big data solutions consist of repeated data processing
operations, encapsulated in workflows, that transform source data, move data between
multiple sources and sinks, load the processed data into an analytical data store, or
push the results straight to a report or dashboard. To automate these workflows, you
can use an orchestration technology such Azure Data Factory or Apache Oozie and
Sqoop.

When to use this architecture?

Consider this architecture style when you need to:

 Store and process data in volumes too large for a traditional database.
 Transform unstructured data for analysis and reporting.
 Capture, process, and analyze unbounded streams of data in real time, or with low
latency.
 Use Azure Machine Learning or Azure Cognitive Services.

13
BENEFITS:
 Technology choices. You can mix and match Azure managed services and Apache
technologies in HDInsight clusters, to capitalize on existing skills or technology
investments.
 Performance through parallelism. Big data solutions take advantage of parallelism,
enabling high-performance solutions that scale to large volumes of data.
 Elastic scale. All of the components in the big data architecture support scale-out
provisioning, so that you can adjust your solution to small or large workloads, and pay
only for the resources that you use.
 Interoperability with existing solutions. The components of the big data architecture
are also used for IoT processing and enterprise BI solutions, enabling you to create an
integrated solution across data workloads.

CHALLENGES:
 Complexity. Big data solutions can be extremely complex, with numerous components
to handle data ingestion from multiple data sources. It can be challenging to build, test,
and troubleshoot big data processes. Moreover, there may be a large number of
configuration settings across multiple systems that must be used in order to optimize
performance.
 Skillset. Many big data technologies are highly specialized, and use frameworks and
languages that are not typical of more general application architectures. On the other
hand, big data technologies are evolving new APIs that build on more established
languages. For example, the U-SQL language in Azure Data Lake Analytics is based on
a combination of Transact-SQL

DIFFERENCE BETWEEN DATA SCIENCE AND BUSINESS


INTELLIGENCE:
DATA SCIENCE: Data science is basically a field in which information and knowledge
are extracted from the data by using various scientific methods, algorithms, and processes.
It can thus be defined as a combination of various mathematical tools, algorithms, statistics,
and machine learning techniques which are thus used to find the hidden patterns and
insights from the data which help in the decision-making process. Data science deals with
both structured as well as unstructured data. It is related to both data mining and big data.
Data science involves studying the historic trends and thus using its conclusions to redefine
present trends and also predict future trends. Technologies
BUSINESS INTELLIGENCE: Business intelligence(BI) is a set of technologies,
applications, and processes that are used by enterprises for business data analysis. It is used
for the conversion of raw data into meaningful information which is thus used for business
decision-making and profitable actions. It deals with the analysis of structured and
sometimes unstructured data which paves the way for new and profitable business
opportunities. It supports decision-making based on facts rather than assumption-based
decision-making. Thus it has a direct impact on the business decisions of an enterprise.
Business intelligence tools enhance the chances of an enterprise to enter a new market as
well as help in studying the impact of marketing efforts.

14
DATA ANALYTICS LIFE CYCLE:
PHASE 1: DISCOVERY:
 The data science team learns and investigates the problem.
 Develop context and understanding.
 Come to know about data sources needed and available for the project.
 The team formulates the initial hypothesis that can be later tested with data.
PHASE 2: DATA PREPARATION:
 Steps to explore, preprocess, and condition data before modeling and analysis.
 It requires the presence of an analytic sandbox, the team executes, loads, and
transforms, to get data into the sandbox.
 Data preparation tasks are likely to be performed multiple times and not in predefined
order.
 Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine,
etc.
PHASE 3: MODEL PLANNING :
 The team explores data to learn about relationships between variables and subsequently,
selects key variables and the most suitable models.
 In this phase, the data science team develops data sets for training, testing, and
production purposes.
 Team builds and executes models based on the work done in the model planning phase.
 Several tools commonly used for this phase are – Matlab and STASTICA.

15
PHASE 4: MODEL BUILDING:
 Team develops datasets for testing, training, and production purposes.
 Team also considers whether its existing tools will suffice for running the models or if
they need more robust environment for executing models.
 Free or open-source tools – Rand PL/R, Octave, WEKA.
 Commercial tools – Matlab and STASTICA.
PHASE 5: COMMUNICATION RESULTS :
 After executing model team need to compare outcomes of modeling to criteria
established for success and failure.
 Team considers how best to articulate findings and outcomes to various team members
and stakeholders, taking into account warning, assumptions.
 Team should identify key findings, quantify business value, and develop narrative to
summarize and convey findings to stakeholders.
PHASE 6: OPERATIONALIZE:
 The team communicates benefits of project more broadly and sets up pilot project to
deploy work in controlled way before broadening the work to full enterprise of users.
 This approach enables team to learn about performance and related constraints of the
model in production environment on small scale which make adjustments before full
deployment.
 The team delivers final reports, briefings, codes.
 Free or open source tools – Octave, WEKA, SQL, MADlib.

DATA ANALYTICS LIFE CYCLE

BIG DATA ANALYTICS :


What is big data analytics?
Big data analytics describes the process of uncovering trends, patterns, and correlations in
large amounts of raw data to help make data-informed decisions. These processes use
familiar statistical analysis techniques—like clustering and regression—and apply them to

16
more extensive datasets with the help of newer tools. Big data has been a buzz word since the
early 2000s, when software and hardware capabilities made it possible for organizations to
handle large amounts of unstructured data. Since then, new technologies—from Amazon to
smartphones—have contributed even more to the substantial amounts of data available to
organizations. With the explosion of data, early innovation projects like Hadoop, Spark, and
NoSQL databases were created for the storage and processing of big data. This field
continues to evolve as data engineers look for ways to integrate the vast amounts of complex
information created by sensors, networks, transactions, smart devices, web usage, and more.
Even now, big data analytics methods are being used with emerging technologies, like
machine learning, to discover and scale more complex insights.
HOW BIG DATA ANALYTICS WORKS?

Big data analytics refers to collecting, processing, cleaning, and analyzing large
datasets to help organizations operationalize their big data.

1. COLLECT DATA

Data collection looks different for every organization. With today’s technology,
organizations can gather both structured and unstructured data from a variety of
sources — from cloud storage to mobile applications to in-store IoT sensors and
beyond. Some data will be stored in data warehouses where business intelligence
tools and solutions can access it easily. Raw or unstructured data that is too diverse
or complex for a warehouse may be assigned metadata and stored in a data lake.

2. PROCESS DATA

Once data is collected and stored, it must be organized properly to get accurate
results on analytical queries, especially when it’s large and unstructured. Available
data is growing exponentially, making data processing a challenge for organizations.
One processing option is batch processing, which looks at large data blocks over
time. Batch processing is useful when there is a longer turnaround time between
collecting and analyzing data. Stream processing looks at small batches of data at
once, shortening the delay time between collection and analysis for quicker decision-
making. Stream processing is more complex and often more expensive.

3. CLEAN DATA

Data big or small requires scrubbing to improve data quality and get stronger results;
all data must be formatted correctly, and any duplicative or irrelevant data must be
eliminated or accounted for. Dirty data can obscure and mislead, creating flawed
insights.

17
4. ANALYZE DATA

Getting big data into a usable state takes time. Once it’s ready, advanced analytics
processes can turn big data into big insights. Some of these big data analysis
methods include:

 Data mining sorts through large datasets to identify patterns and relationships by
identifying anomalies and creating data clusters.

 Predictive analytics uses an organization’s historical data to make predictions about


the future, identifying upcoming risks and opportunities.

 Deep learning imitates human learning patterns by using artificial intelligence and
machine learning to layer algorithms and find patterns in the most complex and
abstract data.

Big data analytics tools and technology ?

Big data analytics cannot be narrowed down to a single tool or technology. Instead,
several types of tools work together to help you collect, process, cleanse, and analyze
big data. Some of the major players in big data ecosystems are listed below.

 Hadoop is an open-source framework that efficiently stores and processes big


datasets on clusters of commodity hardware. This framework is free and can
handle large amounts of structured and unstructured data, making it a
valuable mainstay for any big data operation.

 NoSQL databases are non-relational data management systems that do not


require a fixed scheme, making them a great option for big, raw, unstructured
data. NoSQL stands for “not only SQL,” and these databases can handle a
variety of data models.

 MapReduce is an essential component to the Hadoop framework serving two


functions. The first is mapping, which filters data to various nodes within the
cluster. The second is reducing, which organizes and reduces the results from
each node to answer a query.

 YARN stands for “Yet Another Resource Negotiator.” It is another component


of second-generation Hadoop. The cluster management technology helps
with job scheduling and resource management in the cluster.

 Spark is an open source cluster computing framework that uses implicit data
parallelism and fault tolerance to provide an interface for programming entire
clusters. Spark can handle both batch and stream processing for fast
computation.

 Tableau is an end-to-end data analytics platform that allows you to prep,


analyze, collaborate, and share your big data insights. Tableau excels in self-

18
service visual analysis, allowing people to ask new questions of governed big
data and easily share those insights across the organization.

THE BIG BENEFITS OF BIG DATA ANALYTICS

The ability to analyze more data at a faster rate can provide big benefits to an
organization, allowing it to more efficiently use data to answer important questions.
Big data analytics is important because it lets organizations use colossal amounts of
data in multiple formats from multiple sources to identify opportunities and risks,
helping organizations move quickly and improve their bottom lines. Some benefits of
big data analytics include:

 Cost savings. Helping organizations identify ways to do business more


efficiently

 Product development. Providing a better understanding of customer needs

 Market insights. Tracking purchase behavior and market trends

Read more about how real organizations reap the benefits of big data.

THE BIG CHALLENGES OF BIG DATA

Big data brings big benefits, but it also brings big challenges such new privacy and
security concerns, accessibility for business users, and choosing the right solutions
for your business needs. To capitalize on incoming data, organizations will have to
address the following:

 Making big data accessible. Collecting and processing data becomes more
difficult as the amount of data grows. Organizations must make data easy and
convenient for data owners of all skill levels to use.

 Maintaining quality data. With so much data to maintain, organizations are


spending more time than ever before scrubbing for duplicates, errors,
absences, conflicts, and inconsistencies.

 Keeping data secure. As the amount of data grows, so do privacy and


security concerns. Organizations will need to strive for compliance and put
tight data processes in place before they take advantage of big data.

 Finding the right tools and platforms. New technologies for processing and
analyzing big data are developed all the time. Organizations must find the
right technology to work within their established ecosystems and address
their particular needs. Often, the right solution is also a flexible solution that
can accommodate future infrastructure changes.

19
METHODOLOGY:
In terms of methodology, big data analytics differs significantly from the traditional statistical
approach of experimental design. Analytics starts with data. Normally, we model the data in a
way that able to answer the questions that a business professionals have. The objectives of
this approach are to predict the response behavior or understand how the input variables
relate to a response.

Typically, statistical experimental designs develop an experiment and then retrieve the
resulting data. This enables the generation of data suitable for a statistical model, under the
assumption of independence, normality, and randomization. Big data analytics methodology
begins with problem identification, and once the business problem is defined, a research stage
is required to design the methodology. However, general guidelines are relevant to mention
and apply to almost all problems.

The following figure demonstrates the methodology often followed in Big Data Analytics –

BIG DATA ANALYTICS METHODOLOGY

Big Data Analytics Methodology

The following are the methodologies of big data analytics −

20
DEFINE OBJECTIVES

Clearly outline the analysis's goals and objectives. What insights do you seek? What business
difficulties are you attempting to solve? This stage is critical to steering the entire process.

DATA COLLECTION

Gather relevant data from a variety of sources. This includes structured data from databases,
semi-structured data from logs or JSON files, and unstructured data from social media,
emails, and papers.

DATA PRE-PROCESSING

This step involves cleaning and pre-processing the data to ensure its quality and consistency.
This includes addressing missing values, deleting duplicates, resolving inconsistencies, and
transforming data into a useful format.

DATA STORAGE AND MANAGEMENT

Store the data in an appropriate storage system. This could include a typical relational
database, a NoSQL database, a data lake, or a distributed file system such as Hadoop
Distributed File System (HDFS).

EXPLORATORY DATA ANALYSIS (EDA)

This phase includes the identification of data features, finding patterns, and detecting outliers.
We often use visualization tools like histograms, scatter plots, and box plots.

FEATURE ENGINEERING

Create new features or modify existing ones to improve the performance of machine learning
models. This could include feature scaling, dimensionality reduction, or constructing
composite features.

MODEL SELECTION AND TRAINING

Choose relevant machine learning algorithms based on the nature of the problem and the
properties of the data. If labeled data is available, train the models.

MODEL EVALUATION

Measure the trained models' performance using accuracy, precision, recall, F1-score, and
ROC curves. This helps to determine the best-performing model for deployment.

21
DEPLOYMENT

In a production environment, deploy the model for real-world use. This could include
integrating the model with existing systems, creating APIs for model inference, and
establishing monitoring tools.

MONITORING AND MAINTENANCE

Also, change the analytics pipeline as needed to reflect changing business requirements or
data characteristics.

ITERATE

Big Data analytics is an iterative process. Analyze the data, collect comments, and update the
models or procedures as needed to increase accuracy and effectiveness over time

BIG DATA TECHNOLOGIES


Before big data technologies were introduced, the data was managed by general
programming languages and basic structured query languages. However, these languages
were not efficient enough to handle the data because there has been continuous growth in
each organization's information and data and the domain. That is why it became very
important to handle such huge data and introduce an efficient and stable technology that takes
care of all the client and large organizations' requirements and needs, responsible for data
production and control. Big data technologies, the buzz word we get to hear a lot in recent
times for all such needs.

In this article, we are discussing the leading technologies that have expanded their branches
to help Big Data reach greater heights. Before we discuss big data technologies, let us first
understand briefly about Big Data Technology.

What is Big Data Technology?

Big data technology is defined as software-utility. This technology is primarily designed to


analyze, process and extract information from a large data set and a huge set of extremely
complex structures. This is very difficult for traditional data processing software to deal with.

Among the larger concepts of rage in technology, big data technologies are widely associated
with many other technologies such as deep learning, machine learning, artificial intelligence
(AI), and Internet of Things (IoT) that are massively augmented. In combination with these
technologies, big data technologies are focused on analyzing and handling large amounts of
real-time data and batch-related data.

Types of Big Data Technology

Before we start with the list of big data technologies, let us first discuss this technology's
board classification. Big Data technology is primarily classified into the following two types:

22
Operational Big Data Technologies
This type of big data technology mainly includes the basic day-to-day data that people used
to process. Typically, the operational-big data includes daily basis data such as online
transactions, social media platforms, and the data from any particular organization or a firm,
which is usually needed for analysis using the software based on big data technologies. The
data can also be referr

o Online ticket booking system, e.g., buses, trains, flights, and movies, etc.
o Online trading or shopping from e-commerce websites like Amazon, Flipkart,
Walmart, etc.
o Online data on social media sites, such as Facebook, Instagram, Whatsapp, etc.
o The employees' data or executives' particulars in multinational companies.

Analytical Big Data Technologies

Analytical Big Data is commonly referred to as an improved version of Big Data


Technologies. This type of big data technology is a bit complicated when compared with
operational-big data. Analytical big data is mainly used when performance criteria are in use,
and important real-time business decisions are made based on reports created by analyzing
operational-real data. This means that the actual investigation of big data that is important for
business decisions falls under this type of big data technology.

Some common examples that involve the Analytical Big Data Technologies can be listed as
below:

o Stock marketing data


o Weather forecasting data and the time series analysis
o Medical health records where doctors can personally monitor the health status of an
individual
o Carrying out the space mission databases where every information of a mission is
very important

Top Big Data Technologies

We can categorize the leading big data technologies into the following four sections:

o Data Storage
o Data Mining
o Data Analytics
o Data Visualization

23
DATA STORAGE

Let us first discuss leading Big Data Technologies that come under Data Storage:

o HADOOP: When it comes to handling big data, Hadoop is one of the leading
technologies that come into play. This technology is based entirely on map-reduce
architecture and is mainly used to process batch information. Also, it is capable
enough to process tasks in batches. The Hadoop framework was mainly introduced to
store and process data in a distributed data processing environment parallel to
commodity hardware and a basic programming execution model.
Apart from this, Hadoop is also best suited for storing and analyzing the data from
various machines with a faster speed and low cost. That is why Hadoop is known as
one of the core components of big data technologies. The Apache Software
Foundation introduced it in Dec 2011. Hadoop is written in Java programming
language.
o MONGODB: MongoDB is another important component of big data technologies in
terms of storage. No relational properties and RDBMS properties apply to MongoDb
because it is a NoSQL database. This is not the same as traditional RDBMS databases
that use structured query languages. Instead, MongoDB uses schema documents.
The structure of the data storage in MongoDB is also different from traditional
RDBMS databases. This enables MongoDB to hold massive amounts of data. It is
based on a simple cross-platform document-oriented design. The database in
MongoDB uses documents similar to JSON with the schema. This ultimately helps
operational data storage options, which can be seen in most financial organizations.
As a result, MongoDB is replacing traditional mainframes and offering the flexibility

24
to handle a wide range of high-volume data-types in distributed architectures.
MongoDB Inc. introduced MongoDB in Feb 2009. It is written with a combination of
C++, Python, JavaScript, and Go language.
o CASSANDRA: Cassandra is one of the leading big data technologies among the list
of top NoSQL databases. It is open-source, distributed and has extensive column
storage options. It is freely available and provides high availability without fail. This
ultimately helps in the process of handling data efficiently on large commodity
groups. Cassandra's essential features include fault-tolerant mechanisms, scalability,
MapReduce support, distributed nature, eventual consistency, query language
property, tunable consistency, and multi-datacenter replication, etc.
Cassandra was developed in 2008 by the Apache Software Foundation for the
Facebook inbox search feature. It is based on the Java programming language.

DATA MINING
Let us now discuss leading Big Data Technologies that come under Data Mining:

o PRESTO: Presto is an open-source and a distributed SQL query engine developed to


run interactive analytical queries against huge-sized data sources. The size of data
sources can vary from gigabytes to petabytes. Presto helps in querying the data in
Cassandra, Hive, relational databases and proprietary data storage systems.
Presto is a Java-based query engine that was developed in 2013 by the Apache
Software Foundation. Companies like Repro, Netflix, Airbnb, Facebook and Checkr
are using this big data technology and making good use of it.
o RAPIDMINER: RapidMiner is defined as the data science software that offers us a
very robust and powerful graphical user interface to create, deliver, manage, and
maintain predictive analytics. Using RapidMiner, we can create advanced workflows
and scripting support in a variety of programming languages.
RapidMiner is a Java-based centralized solution developed in 2001 by Ralf
Klinkenberg, Ingo Mierswa, and Simon Fischer at the Technical University of
Dortmund's AI unit. It was initially named YALE (Yet Another Learning
Environment). A few sets of companies that are making good use of the RapidMiner
tool are Boston Consulting Group, InFocus, Domino's, Slalom, and
Vivint.SmartHome.

DATA ANALYTICS

Now, let us discuss leading Big Data Technologies that come under Data Analytics:

25
o APACHE KAFKA: Apache Kafka is a popular streaming platform. This streaming
platform is primarily known for its three core capabilities: publisher, subscriber and
consumer. It is referred to as a distributed streaming platform. It is also defined as a
direct messaging, asynchronous messaging broker system that can ingest and perform
data processing on real-time streaming data. This platform is almost similar to an
enterprise messaging system or messaging queue.
Besides, Kafka also provides a retention period, and data can be transmitted through a
producer-consumer mechanism. Kafka has received many enhancements to date and
includes some additional levels or properties, such as schema, Ktables, KSql, registry,
etc. It is written in Java language and was developed by the Apache software
community in 2011. Some top companies using the Apache Kafka platform include
Twitter, Spotify, Netflix, Yahoo, LinkedIn etc.
o SPLUNK: Splunk is known as one of the popular software platforms for capturing,
correlating, and indexing real-time streaming data in searchable repositories. Splunk
can also produce graphs, alerts, summarized reports, data visualizations, and
dashboards, etc., using related data. It is mainly beneficial for generating business
insights and web analytics. Besides, Splunk is also used for security purposes,
compliance, application management and control.
Splunk Inc. introduced Splunk in the year 2014. It is written in combination with
AJAX, Python, C ++ and XML. Companies such as Trustwave, QRadar, and 1Labs
are making good use of Splunk for their analytical and security needs.
o KNIME: KNIME is used to draw visual data flows, execute specific steps and
analyze the obtained models, results, and interactive views. It also allows us to
execute all the analysis steps altogether. It consists of an extension mechanism that
can add more plugins, giving additional features and functionalities.
KNIME is based on Eclipse and written in a Java programming language. It was
developed in 2008 by KNIME Company. A list of companies that are making use of
KNIME includes Harnham, Tyler, and Paloalto.
o SPARK: Apache Spark is one of the core technologies in the list of big data
technologies. It is one of those essential technologies which are widely used by top
companies. Spark is known for offering In-memory computing capabilities that help
enhance the overall speed of the operational process. It also provides a generalized
execution model to support more applications. Besides, it includes top-level APIs
(e.g., Java, Scala, and Python) to ease the development process.
Also, Spark allows users to process and handle real-time streaming data using
batching and windowing operations techniques. This ultimately helps to generate

26
datasets and data frames on top of RDDs. As a result, the integral components of
Spark Core are produced. Components like Spark MlLib, GraphX, and R help analyze
and process machine learning and data science. Spark is written using Java, Scala,
Python and R language. The Apache Software Foundation developed it in 2009.
Companies like Amazon, ORACLE, CISCO, VerizonWireless, and Hortonworks are
using this big data technology and making good use of it.
o R-LANGUAGE: R is defined as the programming language, mainly used in
statistical computing and graphics. It is a free software environment used by leading
data miners, practitioners and statisticians. Language is primarily beneficial in the
development of statistical-based software and data analytics.
R-language was introduced in Feb 2000 by R-Foundation. It is written in Fortran.
Companies like Barclays, American Express, and Bank of America use R-Language
for their data analytics needs.

DATA VISUALIZATION
Let us discuss leading Big Data Technologies that come under Data Visualization:

o Tableau: Tableau is one of the fastest and most powerful data visualization tools used
by leading business intelligence industries. It helps in analyzing the data at a very
faster speed. Tableau helps in creating the visualizations and insights in the form of
dashboards and worksheets.
Tableau is developed and maintained by a company named TableAU. It was
introduced in May 2013. It is written using multiple languages, such as Python, C,
C++, and Java. Some of the list's top companies are Cognos, QlikQ, and ORACLE
Hyperion, using this tool.
o Plotly: As the name suggests, Plotly is best suited for plotting or creating graphs and
relevant components at a faster speed in an efficient way. It consists of several rich
libraries and APIs, such as MATLAB, Python, Julia, REST API, Arduino, R, Node.js,
etc. This helps interactive styling graphs with Jupyter notebook and Pycharm.
Plotly was introduced in 2012 by Plotly company. It is based on JavaScript. Paladins
and Bitbank are some of those companies that are making good use of Plotly.

EMERGING BIG DATA TECHNOLOGIES


Apart from the above mentioned big data technologies, there are several other emerging big
data technologies. The following are some essential technologies among them:

27
o TensorFlow: TensorFlow combines multiple comprehensive libraries, flexible
ecosystem tools, and community resources that help researchers implement the state-
of-art in Machine Learning. Besides, this ultimately allows developers to build and
deploy machine learning-powered applications in specific environments.
TensorFlow was introduced in 2019 by Google Brain Team. It is mainly based on
C++, CUDA, and Python. Companies like Google, eBay, Intel, and Airbnb are using
this technology for their business requirements.
o Beam: Apache Beam consists of a portable API layer that helps build and maintain
sophisticated parallel-data processing pipelines. Apart from this, it also allows the
execution of built pipelines across a diversity of execution engines or runners.
Apache Beam was introduced in June 2016 by the Apache Software Foundation. It
is written in Python and Java. Some leading companies like Amazon, ORACLE,
Cisco, and VerizonWireless are using this technology.
o Docker: Docker is defined as the special tool purposely developed to create, deploy,
and execute applications easier by using containers. Containers usually help
developers pack up applications properly, including all the required components like
libraries and dependencies. Typically, containers bind all components and ship them
all together as a package.
Docker was introduced in March 2003 by Docker Inc. It is based on the Go language.
Companies like Business Insider, Quora, Paypal, and Splunk are using this
technology.
o Airflow: Airflow is a technology that is defined as a workflow automation and
scheduling system. This technology is mainly used to control, and maintain data
pipelines. It contains workflows designed using the DAGs (Directed Acyclic Graphs)
mechanism and consisting of different tasks. The developers can also define
workflows in codes that help in easy testing, maintenance, and versioning.
Airflow was introduced in May 2019 by the Apache Software Foundation. It is
based on a Python language. Companies like Checkr and Airbnb are using this leading
technology.
o Kubernetes: Kubernetes is defined as a vendor-agnostic cluster and container
management tool made open-source in 2014 by Google. It provides a platform for
automation, deployment, scaling, and application container operations in the host
clusters.
Kubernetes was introduced in July 2015 by the Cloud Native Computing
Foundation. It is written in the Go language. Companies like American Express, Pear

28
Deck, PeopleSource, and Northwestern Mutual are making good use of this
technology.

ADVANTAGES OF BIG DATA ANALYTICS:

1. IMPROVES DECISION-MAKING

Organizations can gather valuable insights about their performance with the help of big data
solutions. For example, big data analytics can help HR departments with recruitment and
hiring processes.

Poor hiring practices negatively impact a company because finding high-performing


candidates takes time and resources. When a company uses big data in HR, it can make these
processes more efficient and effective.

2. REDUCES COSTS

Any professional knows how important it is for a company to keep costs down wherever and
whenever possible. Here are a few ways companies are using big data to reduce costs:

o Facilitates targeted marketing campaigns to reach customers effectively


o Digitizes supply chains to improve efficiency and minimize costly disruptions
o Identifies instances of fraud to prevent loss

Leveraging big data can provide more cost-cutting opportunities than just the above
examples. Netflix, for example, uses big data to save around $1 billion annually on customer
retention alone.

3. INCREASES PRODUCTIVITY

IT specialists can increase their productivity levels by using big data solutions. Instead of
manually sorting through all types of data from disparate sources, big data tools can automate
the process and allow employees to focus on other meaningful tasks.

When companies can analyze more data more quickly, it can speed up other business
processes and increase productivity more broadly throughout the organization.

4. ENHANCES CUSTOMER SERVICE

Customer service plays an important role in determining a company’s reputation, customer


loyalty and overall position in the marketplace. Big data analytics provide customer service
departments with myriad data-driven insights, allowing managers to measure employee
performance and overcome shortcomings.

29
DISADVANTAGES OF BIG DATA:

Below are some of the disadvantages of big data companies should understand.

1. CYBERSECURITY RISKS

o As new technologies emerge in business, it’s understandable that there are some risks
involved with adoption. Big data solutions are major targets for cybercriminals,
meaning companies using these advanced analytics tools are exposed to more
potential cybersecurity threats.
o Storing big data, particularly data of a sensitive nature, comes with inherent risks.
Still, companies can implement various cybersecurity measures to protect their data.

2. TALENT GAPS

o Data scientists and experts are in high demand as big data becomes more prevalent in
business. These IT professionals are often paid very well and can significantly impact
a company.
o However, there’s a lack of IT workers in the field capable of handling big data
responsibilities. While having access to big data can benefit a company, it’s only
useful if someone with a strong background in big data works with it.

3. COMPLIANCE CONSIDERATIONS

o Another disadvantage of big data is the compliance issues companies might deal with.
Companies must ensure they’re meeting industry and federal regulatory requirements
for the information they work with, including sensitive or personal customer data.
o Without a compliance officer, organizations would find it challenging to handle, store
and leverage big data. For example, companies operating in the European Union must
be aware of the General Data Protection Regulation (GDPR), a major data privacy
regulation focused on protecting consumers.
o The role of big data is already playing a critical role in modern business. As time goes
on, it’s expected that big data will continue on its growth journey and become a staple
for just about every type of business, regardless of size or industry

30

You might also like