0% found this document useful (0 votes)
30 views

Big Data Analytics Algorithm, Tools in Systematic Review

Uploaded by

tharani devi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Big Data Analytics Algorithm, Tools in Systematic Review

Uploaded by

tharani devi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Big Data Analytics Algorithm, Tools In

Systematic Review
M. THARANI DEVI, Asst.Prof.,
Nadar Saraswathi College of Arts and Science, Theni.
[email protected]

Abstract:
Big data analytics describes the process of uncovering trends, patterns, and correlations in large
amounts of raw data to help make data-informed decisions. These processes use familiar statistical analysis
techniques—like clustering and regression—and apply them to more extensive datasets with the help of
never tools. new technologies—from Amazon to smartphones—have contributed even more to the substantial
amounts of data available to organizations. Five fundamentals algorithm can review on this paper. There are
hundreds of data analytics tools out there in the market today but the selection of the right tool will
depend upon your business NEED, GOALS, and VARIETY to get business in the right direction.

Keynotes: Introduction of Big Data Analytics, Fundamentals of Algorithms and Tools

INTRODUCTION visualize how it is working and how the input data


is related to the output data.
Big data analytics is the often complex
process of examining big data to uncover Linear regression uses the relationship
information. Such as hidden patterns, correlations, between two sets of continuous quantitative
market trends and customer preferences—that can measures. The first set is called
help organizations make informed business the predictor or independent variable. The other is
decisions. This data analytics is a form of advance the response or dependent variable. The goal of
analytics, which involve complex applications linear regression is to identify the relationship in
with elements such as predictive models, the form of a formula that describes the dependent
statistical algorithms and what -if analysis variable in terms of the independent variable. Once
powered by analytics systems. This data analytics this relationship is quantified, the dependent
can used to most powerful algorithm. variable can be predicted for any instance of an
FUNDAMENTALS OF ALGORITHMS independent variable.
Algorithm: An Algorithm is a procedure
used for solving a problem or performing a One of the most common independent
computation. This paper can explain 4 types of variables used is time. Whether your independent
algorithm. variable is revenue, costs, customers, use, or
1. Linear Regression productivity, if you can define the relationship it
2. Logistic Regression has with time, you can forecast a value with linear
3. K-Nearest Neighbour regression.
4. K-Means Clustering
1.Linear Regression: 2. Logistic Regression:
Logistic regression sounds similar to
Linear regression is one of the most basic
algorithms of advanced analytics. This also makes linear regression but is actually focused on
it one of the most widely used. People can easily problems involving categorization instead of
quantitative forecasting. Here the output variable
values are discrete and finite rather than continuous A variant of classification and regression
and with infinite values as with linear regression. trees is called random forests. Instead of
The goal of logistic regression is to constructing a single tree with many branches of
categorize whether an instance of an input variable logic, a random forest is a culmination of many
either fits within a category or not. The output of small and simple trees that each evaluate the
logistic regression is a value between 0 and 1. instances of data and determine a categorization.
Results closer to 1 indicate that the input variable Once all of these simple trees complete their data
more clearly fits within the category. Results closer evaluation, the process merges the individual
to 0 indicate that the input variable likely does not results to create a final prediction of the category
fit within the category. based on the composite of the smaller
Logistic regression is often used to answer categorizations. This is commonly referred to as
clearly defined yes or no questions. Will a an ensemble method. These random forests often do
customer buy again? Is a buyer credit worthy? Will well at balancing exact fit and abstraction and have
the prospect become a customer? Predicting the been implemented successfully in many business
answer to these questions can spawn a series of cases.
actions within the business process which can help
drive future revenue. In contrast to logistic regression, which

Classification and Regression Trees focuses on a yes or no categorization, classification


and regression trees can be used to predict
Classification and regression trees use a
multivalue categorizations. They are also easier to
decision to categorize data. Each decision is based
visualize and see the definitive path that guided the
on a question related to one of the input variables.
algorithm to a specific categorization.
With each question and corresponding response,
the instance of data gets moved closer to being 3. K-Nearest Neighbour:
categorized in a specific way. This set of questions K-nearest neighbor is also a classification
and responses and subsequent divisions of data algorithm. It is known as a "lazy learner" because
create a tree-like structure. At the end of each line the training phase of the process is very limited.
of questions is a category. This is called the leaf The learning process is composed of the training
node of the classification tree. set of data being stored. As new instances are
evaluated, the distance to each data point in the
These classification trees can become
training set is evaluated and there is a consensus
quite large and complex. One method of controlling
decision as to which category the new instance of
the complexity is through pruning the tree or
data falls into based on its proximity to the training
intentionally removing levels of questioning to
instances.
balance between exact fit and abstraction. A model
that works well with all instances of input values, This algorithm can be computationally
both those that are known in training and those that expensive depending on the size and scope of the
are not, is paramount. Preventing overfitting of this training set. As each new instance has to be
model requires a delicate balance between exact fit compared to all instances of the training data set
and abstraction. and a distance derived, this process can use many
computing resources each time it runs.
This categorization algorithm allows for value. The end target with the implementation of
multivalued categorizations of the data. In addition, these algorithms is to further refine the data to a
noisy training data tends to skew classifications. point where the information that results can be
applied to business decisions. It is this process of
K-nearest neighbors is often chosen informing downstream processes with more refined
because it is easy to use, easy to train, and easy to and higher value data that is a fundamental to
interpret the results. It is often used in search companies becoming truly harnessing the value of
applications when you are trying to find similar their data and achieving the results that they desire.
items. Big data analytic Tools:
More than Tools available there still this
4.K-Means Clustering:
paper can explain some of the tools.
K-means clustering focuses on creating
1. APACHE Hadoop
groups of related attributes. These groups are
2. Cassandra
referred to as clusters. Once these clusters are
3. Qubole
created, other instances can be evaluated against
4. Xplenty
them to see where they best fit.
5. Spark
6. Mongo DB
This technique is often used as part of data
7. Apache Storm
exploration. To start, the analyst specifies the
8. SAS
number of clusters. The K-means cluster process
9. Data Pine
breaks the data into that number of clusters based
10. Rapid Miner
on finding data points with similarities around a
These are explain to below.
common hub, called the centroid. These clusters
1.APACHE Hadoop
are not the same as categories because initially they
It’s a Java-based open-source platform
do not have business meaning. They are just
that is being used to store and process big data. It
closely related instances of input variables. Once
is built on a cluster system that allows the system
these clusters are identified and analyzed, they can
to process data efficiently and let the data run
be converted to categories and provided a name
parallel. It can process both structured and
that has business meaning.
unstructured data from one server to multiple

K-means clustering is often used because computers. Hadoop also offers cross-

it is simple to use and explain and because it is fast. platform support for its users. Today, it is the

One area to note is that k-means clustering is best big data analytic tool and is popularly used

extremely sensitive to outliers. These outliers can by many tech giants such as Amazon, Microsoft,

significantly shift the nature and definition of these IBM, etc.

clusters and ultimately the results of analysis. Features of Apache Hadoop:


 Free to use and offers an efficient
These are some of the most popular storage solution for businesses.
algorithms in use in advanced analytics initiatives.  Offers quick access via HDFS
Each has pros and cons and different ways in which (Hadoop Distributed File System).
it can be effectively utilized to generate business
 Highly flexible and can be easily
implemented with MySQL, and 3.Qubole
JSON. It’s an open-source big data tool that helps in
 Highly scalable as it can distribute a fetching data in a value of chain using ad-hoc
large amount of data in small analysis in machine learning. Qubole is a data
segments. lake platform that offers end-to-end service with
 It works on small commodity reduced time and effort which are required in
hardware like JBOD or a bunch of moving data pipelines. It is capable of configuring
disks. multi-cloud services such as AWS, Azure, and
2.Cassandra Google Cloud. Besides, it also helps in lowering
APACHE Cassandra is an open-source the cost of cloud computing by 50%.
NoSQL distributed database that is used to fetch
Features of Qubole:
large amounts of data. It’s one of the most
 Supports ETL process: It allows
popular tools for data analytics and has been
companies to migrate data from
praised by many tech companies due to its high
multiple sources in one place.
scalability and availability without compromising
 Real-time Insight: It monitors user’s
speed and performance. It is capable of
systems and allows them to view
delivering thousands of operations every
real-time insights
second and can handle petabytes of resources with
 Predictive Analysis: Qubole offers
almost zero downtime. It was created by
predictive analysis so that companies
Facebook back in 2008 and was published
can take actions accordingly for
publicly.
targeting more acquisitions.
Features of APACHE Cassandra:
 Advanced Security System: To
 Data Storage Flexibility: It supports
protect users’ data in the cloud,
all forms of data i.e. structured,
Qubole uses an advanced security
unstructured, semi-structured, and
system and also ensures to protect
allows users to change as per their
any future breaches. Besides, it also
needs.
allows encrypting cloud data from
 Data Distribution System: Easy to
any potential threat.
distribute data with the help of
4.Xplenty
replicating data on multiple data
It is a data analytic tool for building a data
centers.
pipeline by using minimal codes in it. It offers a
 Fast Processing: Cassandra has been
wide range of solutions for sales, marketing, and
designed to run on efficient
support. With the help of its interactive graphical
commodity hardware and also offers
interface, it provides solutions for ETL, ELT, etc.
fast storage and data processing.
The best part of using Xplenty is its low
 Fault-tolerance: The moment, if any
investment in hardware & software and its offers
node fails, it will be replaced without
support via email, chat, telephonic and virtual
any delay.
meetings. Xplenty is a platform to process data
for analytics over the cloud and segregates all the  Flexible: It can run on, Mesos,
data together. Kubernetes, or the cloud.
6.Mongo DB
Features of Xplenty: Came in limelight in 2010, is a free,
 Rest API: A user can possibly do open-source platform and a document-oriented
anything by implementing Rest API (NoSQL) database that is used to store a high
 Flexibility: Data can be sent, and volume of data. It uses collections and documents
pulled to databases, warehouses, and for storage and its document consists of key-value
salesforce. pairs which are considered a basic unit of Mongo
 Data Security: It offers SSL/TSL DB. It is so popular among developers due to its
encryption and the platform is availability for multi-programming languages
capable of verifying algorithms and such as Python, Jscript, and Ruby.
certificates regularly. Features of Mongo DB:
 Deployment: It offers integration  Written in C++: It’s a schema-less
apps for both cloud & in-house and DB and can hold varieties of
supports deployment to integrate documents inside.
apps over the cloud.  Simplifies Stack: With the help of
5.Spark mongo, a user can easily store files
APACHE Spark is another framework without any disturbance in the stack.
that is used to process data and perform numerous  Master-Slave Replication: It can
tasks on a large scale. It is also used to process write/read data from the master and
data via multiple computers with the help of can be called back for backup.
distributing tools. It is widely used among data 7.Apache Storm
analysts as it offers easy-to-use APIs that provide A storm is a robust, user-friendly tool
easy data pulling methods and it is capable of used for data analytics, especially in small
handling multi-petabytes of data as well. companies. The best part about the storm is that it
Recently, Spark made a record of processing 100 has no language barrier (programming) in it and
terabytes of data in just 23 minutes which broke can support any of them. It was designed
the previous world record of Hadoop (71 to handle a pool of large data in fault-tolerance
minutes). This is the reason why big tech giants and horizontally scalable methods. When we
are moving towards spark now and is highly talk about real-time data processing, Storm leads
suitable for ML and AI today. the chart because of its distributed real-time big
Features of APACHE Spark: data processing system, due to which today many
 Ease of use: It allows users to run in tech giants are using APACHE Storm in their
their preferred language. (JAVA, system. Some of the most notable names are
Python, etc.) Twitter, Zendesk, NaviSite, etc.
 Real-time Processing: Spark can Features of Storm:
handle real-time streaming via Spark  Data Processing: Storm process the
Streaming data even if the node gets
disconnected
 Highly Scalable: It keeps the can visit and check the data as per their
momentum of performance even if requirement and offer in 4 different price brackets,
the load increases starting from $249 per month. They do offer
 Fast: The speed of APACHE Storm dashboards by functions, industry, and platform.
is impeccable and can process up
to 1 million messages of 100 bytes
Features of Datapine:
on a single node.
 Automation: To cut down the manual
8.SAS
chase, datapine offers a wide array of
Today it is one of the best tools for
AI assistant and BI tools.
creating statistical modeling used by data analysts.
 Predictive Tool: datapine provides
By using SAS, a data scientist can mine, manage,
forecasting/predictive analytics by
extract or update data in different variants from
using historical and current data, it
different sources. Statistical Analytical System or
derives the future outcome.
SAS allows a user to access the data in any format
 Add on: It also offers
(SAS tables or Excel worksheets). Besides that it
intuitive widgets, visual analytics &
also offers a cloud platform for business analytics
discovery, ad hoc reporting, etc.
called SAS Viya and also to get a strong grip on
10.Rapid Miner
AI & ML, they have introduced new tools and
It’s a fully automated visual workflow
products.
design tool used for data analytics. It’s a no-code
Features of SAS:
platform and users aren’t required to code for
 Flexible Programming Language: It
segregating data. Today, it is being heavily used
offers easy-to-learn syntax and has
in many industries such as ed-tech, training,
also vast libraries which make it
research, etc. Though it’s an open-source platform
suitable for non-programmers
but has a limitation of adding 10000 data rows
 Vast Data Format: It provides
and a single logical processor. With the help of
support for many programming
Rapid Miner, one can easily deploy their ML
languages which also include SQL
models to the web or mobile (only when the user
and carries the ability to read data
interface is ready to collect real-time figures).
from any format.
Features of Rapid Miner:
 Encryption: It provides end-to-end
 Accessibility: It allows users to
security with a feature
access 40+ types of files (SAS,
called SAS/SECURE.
ARFF, etc.) via URL
9.Data Pine
 Storage: Users can access cloud
Datapine is an analytical used for BI and
storage facilities such as AWS and
was founded back in 2012 (Berlin, Germany). In a
dropbox
short period of time, it has gained much
 Data validation: Rapid miner
popularity in a number of countries and it’s
enables the visual display of multiple
mainly used for data extraction (for small-medium
results in history for better
companies fetching data for close monitoring).
evaluation.
With the help of its enhanced UI design, anyone
Conclusion:
Big Data is a data process that exceeds the
capacity to be processed in conventional databases,
so it is necessary to use technology for extensive
data processing and fast processing in helping
companies deal with data errors, from various
analyzes that have described that big data has many
benefits to explore the potential for a set of data to
be reused in addressing and increasing potential
problems to be resolved, big data has made it easier
to make decisions.
Reference:
[1] B. Daniel, “Big Data and analytics in higher
education: Opportunities and challenges,” Br. J.
Educ. Technol., 2015.
[2] S. Fosso Wamba, S. Akter, A. Edwards, G.
Chopin, and D. Gnanzou, “How ‘big data’ can
make big impact: Findings from a systematic
review and a longitudinal case study,” Int. J. Prod.
Econ., 2015.
[3] U. Sivarajah, M. M. Kamal, Z. Irani, and V.
Weerakkody, “Critical analysis of Big Data
challenges and analytical methods,” J. Bus. Res.,
2017.
[4] E. Al Nuaimi, H. Al Neyadi, N. Mohamed, and
J. Al-Jaroodi, “Applications of big data to smart
cities,” J. Internet Serv. Appl., 2015.
[5] D. Singh and C. K. Reddy, “A survey on
platforms for big data analytics,” J. Big Data, 2015.
[6] M. Cox and D. Ellsworth, “Managing Big Data
for Scientific Visualization,” ACM Siggraph,1997

You might also like