SlideShare a Scribd company logo
Abstract—At this moment, data deluge is continuously
producing a large amount of data in various sectors of
modern society. Such data are called big data. Big data
contain datasets originating both in our physical real world
and in social media and are difficult to manage with current
methodologies or data mining software tools due to their
large size and complexity. Big Data mining is the capability
of extracting useful information from these large datasets or
streams of data. The Big Data is providing the robust
solutions for overcoming the present issues caused due to the
volume, variability and velocity. We present in this issue, a
broad overview of the topic, its current status and techniques
such as NoSQL, MapReduce and Hadoop.
Keywords - Big Data Mining, Mining Techniques, NoSQL,
Hadoop, MapReduce.
1. INTRODUCTION
In the present age, large amounts of data are produced every
moment in various fields, such as science, Internet, and
physical systems. Such phenomena collectively called data
deluge [Mcfedries 2011]. According to researches carried out
by IDC [IDC 2008, IDC 2012], the size of data which are
generated and reproduced all over the world every year is
estimated to be 161 exa bytes. It is predicted that data
increase rapidly at a rate of 10x every five years [1]. In the
meanwhile the computing size of general purpose computers
encounter a 58% rise annually [2]. Consider the Internet data.
The web pages indexed by Google were around one million
in 1998 but quickly reached one billion in 2000 and have
already exceeded 1 trillion in 2008. This rapid expansion is
accelerated by the dramatic increase in acceptance of social
networking applications, such as Facebook, Twitter, Weibo,
etc., that allow users create content freely and amplify the
already huge Web volume.
Thus, the term “Big Data” is a critical issue that needs
solemn attention [3,4]. The etymology of the Big Data coined
by two person: First, John Mashey, who was the chief
scientist at Silicon Graphics in the 1990s, who gave a talk
“Big Data and the Next Wave of InfraStress” in 1998.
Second, Francis X. Diebold, an economist at the University
of Pennsylvania, for his paper on “Big Data Dynamic Factor
Models for Macroeconomic Measurement and
Forecasting,” (2000) [5].
We introduce Big Data Mining and its application in Section
2. We discuss some Data Mining Techniques in Section 3.
Then we discuss the Issues and Challenges in the Section 4.
2. BIG DATA MINING
The origin of the term ‘Big Data’, is due to the fat we are
creating a huge amount of data every day. Usama Fayyad
[11] in his invited talk at the KDD BigMine’ 12 Workshop
presented amazing data numbers about internet usage, among
them the following: each day Google has more than 1 billion
queries per day, Twitter has more than 250 million tweets per
day, Facebook has more than 800 million updates per day
and YouTube has more than 800 million updates per day. The
data produced nowadays is estimated in the order of
zettabytes and is growing around 40% every year.
There are mainly three concepts associated with big data:
structured, semi structured and unstructured data. In today’s
world structured data represent only 5 to 10% of all
informatics data. Structured data is the data that can be stored
in database SQL in table with specific rows and tables [7].
Following this, semi structured data, likewise represents a
few parts of data (approximately 5 to 10%). This type of data
does not have the precise organization infrastructure of
structured data, which fits into tables. In other words, semi-
structured data associated with metadata. Metadata is the
term that we use in order to describe the content and context
of data files, e.g. Means of creation, purpose, time and date
of creation, and author [9]. In particular, XML documents are
the semi-structured documents. Moreover, NoSQL databases
are considered as semi structured [7].
The eminent challenge is to find ways in order to cope with
unstructured data, which is everywhere and is most the
strong one among others, streaming such as text, images,
audios and videos. It represent 80% of data [7].
2.1 Big Data Definition - 3 V’s:
In today's world, organization have been bombed with bulk
of information, but there is a decline in the percent of data
that can be analyzed. The reason behind that is 80% of the
data is in the semi-structured and unstructured format. And
thus, we need new algorithms and new toolset deal with all
this data.
The features of big data can be summarized as follows:
• Volume: The quantity of data is extraordinary, but not the
percent of data that our tools can process.
• Variety: The kinds of data have expanded into
unstructured texts, audio, video, graph or XML.
• Velocity: Data is arriving continuously as streams, the
speed at which data are generated is very high. 

Big Data Mining - Classification, Techniques and Issues
Karan Deep Singh Yeghia Koronian Gelareh Tavako Saberi
ka_ingh@encs.concordia.ca y_koroni@encs.concordia.ca g_tavako@encs.concordia.ca
Masters in Computer Science Masters in Computer Science Masters in Computer Science
Concordia University Concordia University Concordia University
Big Data Mining - Classification, Techniques and Issues
Therefore, big data are often characterized as V3 by taking
the initial letters of these three terms Volume, Variety, and
Velocity. Apart from these, there is another factor Variability
that corresponds to the changes in the structure of the data
and how users want to interpret that data.
Gartner[15] summarizes this in their definition of Big Data in
2012 as high volume, velocity and variety information assets
that demand cost-effective, innovation forms of information
processing for enhanced insight and decision making.
2.2 Data Mining
Data mining is, in a nutshell, to discover frequent patterns
and meaningful structures appearing in a large amount of
data used by applications.
Association Analysis: It is to discover frequent co-
occurrences between structured data used in business
applications, which are usually managed by DBMS. An
algorithm called Apriori is used in many cases for that
purpose. For example, it discovers combinations of items co-
occurring frequently in a group of items (i.e., contents of the
shopping carts) purchased at the same time in retail stores.
Based on association rules, a lot of application systems
recommend a set of items by revising arrangements of them.
Association rule mining is extended and applied to the
history of product purchases and the history of click streams
on the Web pages in order to discover the frequent patterns of
series data.
Classification: On the other hand, a classifier is learned
based on data whose classes (i.e., categories) are known in
advance. Then, if there is new data, classes to which they
should belong are determined by using the learned classifier.
This task called classification is one of the basic data mining
techniques. Naïve Bayes and decision trees are used as
typical classifiers. Classification is used by such a variety of
applications as determination of promising customers,
detection of spam e-mails and determination of categories of
new specimens in science or medicine. Determination of
continuous values such as temperatures and stock prices is
also called prediction of future values.
Clustering: It may be possible to define the degrees of
similarity between data even if the categories of the data are
not known in advance. The opposite concept of similarity is
dissimilarity or distance. Based on the defined similarity,
grouping data into the same group which are similar to each
other in a collection of data is called cluster analysis or
clustering, which is also one of the basic technologies of data
mining. Unlike classification, clustering doesn’t demand that
the names and characteristics of clusters are known in
advance. Techniques such as a hierarchical agglomerative
method and a nonhierarchical k-means method are often used
for clustering. Promising applications of clustering include
discovery of groups of similar customers for marketing.
Outlier Detection: This data mining task can detect
exceptional values or values different from standard values.
There are methods for outlier detection based on statistical
models, data distances, and data densities. There are
alternative ways to find outliers using clustering and
classification. Outlier detection has been used for
applications, such as detection of credit card frauds or
network intrusions.
2.3 Big data vs traditional DBMS
Big data convey us through the compelling opportunities for
the data manipulation. It allows us to encounter with huge
volume of semi-structured and unstructured data that the
traditional database is not able to store these data. Moreover,
it gives us a chance to uncover hidden insights in large sets of
data [10]. Enterprise and companies tend to track their
customers, monitor their transactions in order to achieve
desired statistics. Thus, evaluating the customer’s behaviors
permit to have a vantage point of the whole systems and
conducting advanced research in order to ensure long term
goals [6]. To illustrate with, Tesco’s loyalty program, a
British multinational grocery and general merchandise
retailer, generates a tremendous amount of customer data that
the company mines to inform decisions from promotions to
strategic segmentation of customers. Amazon uses customer
data to power its recommendation engine “you may also like
…” based on a type of predictive modelling technique called
collaborative filtering[6]. In this method, “the system observe
what the user has been done together with what all users have
done (what items they have bought, what music they have
listened) and predict how the user’s might behave in the
future[11]”.
2.4 Limitations of the traditional DBMS
In relational database, we can cope with structured and
sometimes semi-structured data. The data is neatly formatted
and fits into the schema. The data should fit into the table and
if the data does not fit into the table, there is a need to design
a database that is more complex and more difficult to handle.
This approach might result in loss of some hidden data. In
addition, the schema of traditional relational database is not
suitable for certain dynamic information, like weather
patterns, that change concurrently. However, ”There are
some more flexible mechanisms, such as the ability to store
XML documents and binary data, but the capabilities for
handling these types of data are usually quite limited
[10]”.Furthermore, in the traditional database to process data,
the data is to put in the central node location. As the data
grows, the processing central node has to be extended and
consequently, there are some limitations depending on the
chosen hardware platform like memory size[12].
“It’s important that understand that conventional database
technologies are an important, and relevant, part of an overall
analytic solution. In fact, they become even more vital when
used in conjunction with your Big Data platform [14].”
In Big Data, there is no limitation in storing the data. We can
have all sort of data, structured, semi-structured and,
particularly, unstructured data and easily query a data. Big
data solutions store the data in its raw format and apply a
schema only when the data is read, which preserves all of the
information within the data [10].
Big Data Mining - Classification, Techniques and Issues
3. DATA MINING TECHNIQUES
Traditionally, data mining handles transactions which are
recorded in databases if the customers actually purchase
products or services. Analyzing transactional data leads to
discovery of frequently purchased products or services,
especially repeat customers. But transaction mining cannot
obtain information about customers who are likely to be
interested in products or services, but have not purchased any
products or services yet. In other words, it is impossible to
discover prospective customers who are likely to be new
customers in the future.
In the physical real world, however, customers look at or
touch interesting items displayed in the racks. They trial-
listen to interesting videos or audios if they can. They may
even smell or taste interesting items if possible and even if
interesting items are unavailable for any reasons, customers
talk about them or collect information about them.
These behaviors can be considered, as parts of interactions
between customers and systems. Such interactions indicate
the interests of latent customers, who either purchase
interesting items or do not in the end, for some reasons.
Analyzing interactions in the physical real world leads to
understanding which items customers are interested in. By
such analysis, however, which aspects of the items the
customers are interested in, why they bought the items, or
why they didn’t, remain unknown. Therefore, if interests of
the users are extracted from heterogeneous data sources and
the reasons for purchasing or not purchasing the items are
uncovered, it will be possible to obtain valuable information
about latent customers. Traditional mining of transactional
data and new mining of interactional data are distinctively
called transaction mining and interaction mining.
3.1 NoSQL as a Database
It has been reported that 65% of queries processed by
Amazon depend on primary keys [Vogels 2007]. Therefore,
data access based on keys, key value stores mechanism is
used by Internet giants such as Google and Amazon. The
concrete key value stores include DynamoDB [DynamoDB
2014] of Amazon, BigTable [Chang et al. 2006] of Google,
HBase [HBase 2014] of the Hadoop project and Cassandra
[Cassandra 2014], by Facebook.
Generally, given key data, key value stores are suitable for
searching non-key data (attribute values) associated with the
key data. Initially a hash function is applied to a node which
stores data. Then, the node is mapped to a point (i.e., logical
place) on a ring type network. In storing data, the same hash
function is applied to a key value of each data and then the
data is similarly mapped to a point on the ring. Each data is
stored in the nearest node by the clockwise rotation of the
ring. Thus, for data access, search for the nearest node
located by applying the hash function to the key value. This
access structure is called consistent hashing, which is also
adopted by P2P systems used for various purposes such as
file sharing.
3.2 MapReduce
MapReduce is considered as a design pattern which can
process tasks efficiently by carrying out scale-out in a
straightforward manner. For example, human users browsing
web sites and robots aiming at crawling for search engines
leave the access log data in Web servers when they access the
sites. Therefore it is necessary to extract only the session
(i.e., a coherent series of page accesses) by each user from
the recorded access log data and store them in databases for
further analysis. Generally such a task is called extraction,
transformation, and loading (ETL).
MapReduce is suitable for applications which perform such
ETL tasks. It divides a task into subtasks and processes them
in a parallel distributed manner. MapReduce is suitable for
cases where only data or parameters of each subtask are
separate although the method of processing is completely the
same. First, the Map phase is carried out and the outputs are
rearranged so that they are suitable for the input of the
Reduce phase. For applications where similarity (i.e., identity
of processing in this case) and diversity (i.e., difference of
data and parameters for processing) are inherent, MapReduce
exploits these characters to improve the efficiency of
processing. Parallelization and distribution of large scale
computations are the two contributing factors for generating
this kind of model.
3.3 Hadoop
Hadoop [Hadoop 2014] is an open source software for
distributed processing on a computer cluster, which consists
of two or more servers. Hadoop consists of a distributed file
system called HDFS (Hadoop Distributed File System),
MapReduce as it is, and Hadoop Common as common
libraries. A computer system is a collection of clusters which
consist of two or more servers. Data is divided into blocks.
While one block for original data is stored in a server which
is determined by Hadoop, copies of the original data are
stored in two other servers (default) inside racks other than
the rack holding the server for the original data
simultaneously. Although such data arrangement has the
objective to improve availability, it also has another objective
to improve parallelism.
The special server called NameNode manages data
arrangement in HDFS. The NameNode server carries out
book keeping of all the metadata of data files. The metadata
are resident on core memories for high speed access.
Therefore, the server for NameNode should be more reliable
than the other servers.
It is expected that if copies of the same data exist in two or
more servers, candidate solutions increase in number for such
problems that process tasks in parallel by dividing them into
multiple subtasks. If Hadoop is fed a task, Hadoop searches
the location of relevant data by consulting NameNode and
sends a program for execution to the server which stores the
data. This is because communication cost for sending
programs is generally lower than that for sending data.
Big Data Mining - Classification, Techniques and Issues
4. ISSUES AND CHALLENGES
Variety and heterogeneity: In the past, the datasets that we
had had was quite simple and homogenous. We have to
interact with structured, semi-structured and unstructured
data. Structured data is compatible with conventional
DBMS. Semi-structured and unstructured dataset require to
envelope in the adequate and state-of-the-art platforms.
Volume/Scalability: Data now is in tremendous scale, which
will give us an opportunity to discover hidden knowledge
and serve/ understand people better. There are two
approaches if exploited properly, may lead to remarkable
scalability required for future data and mining systems to
manage and mine the big data; Advanced User
Interaction[5.6] Data mining in a straight forward manner
will implies extremely time consuming task on a large space,
however with user interaction we can decrease the search
space into more promising subspaces, Cloud Computing
which is an another approach that showed admirable
elasticity, which, combined with massively parallel
computing architectures can make our systems scalable.
Velocity/Speed: We must finish processing/mining in a
desired time or else the information is useless. Speed
depends a) Data access time and b) Efficiency of mining
algorithms, Exploitation of advanced indexing schemes is the
key to speed issue multidimensional indexing structures such
as R tree is useful for big data and data access time. An
additional approach to boost the speed of big data access and
mining is through maximally identifying and exploiting the
potential parallelism in the access and mining algorithms.
Accuracy, trust and provenance: In the past, we were dealing
with the dataset techniques which were reliable. On the era of
big data, evolution of big data urge us to deal with all the
rigors of a considerable amount of unstructured and
unreliable data. Moreover, how can we trust the unreliable
data? The use of learning algorithms is an appropriate way to
determine the creditability of the source of data, these
algorithms should be able to update the creditability of the
source of data in a timely manner.
Privacy crisis: Every piece of info can be mined out from the
internet about someone because data is interconnected, once
this info is put together the privacy will disappears. We are
working on developing a mining system that can mine a huge
portion of the web, so these same tools can be used to
retrieve personal and confidential information about you.
Interactiveness: Is the capability of data mining system that
allows user interaction such as feedback and guidance.
Interactiveness can help narrow down the search space,
accelerating the speed and increase system scalability, also
heterogeneity can be overcome by allowing users to interpret
intermediate and final results by interaction. Interactiveness
boosts the data mining results, even if data mining systems
are professionally designed but without interactiveness the
value of the results can be discounted or simply rejected.
Garbage mining: In WWW the volume of data is generated
very fast and outdated very fast so we require cyberspace
cleaning but it's not easy foreseeable reasons: garbage is
hidden, and there is an ownership issue, are you allowed to
dispose or collect garbage that does not belong to you? We
propose applying data mining approaches to mine garbage
and recycle it. We believe garbage mining is a serious
research topic mining for garbage is mining for knowledge.
REFERENCES
1. S. Hendrickson, Getting Started with Hadoop with Amazon’s
Elastic MapReduce, EMR, 2010.
2. M. Hilbert and P. L´opez, “The world’s technological capacity
to store, communicate, and compute information,” Science,
vol. 332, no. 6025, pp. 60–65, 2011.
3. J. M. Wing, “Computational thinking and thinking about
computing,” Philosophical Transactions of the Royal Society of
London A:Mathematical, Physical and Engineering Sciences,
vol.366, no. 1881, pp. 3717–3725, 2008.
4. J.Mervis, “Agencies rally to tackle big data,” Science, vol. 336,
no. 6077, p. 22, 2012.
5. http://www.marklogic.com/blog/birth-of-big-data/
6. Che.Dunren, Safran.Mejdl, Peng.Zhiyong, From big data to big
data mining: Challenges,Issues and Opportunities, In: DAFAA
Workshop 2013,LNCS 7827,pp. 1-15, 2013
7. https://jeremyronk.wordpress.com/2014/09/01/structured-semi-
structured-and-unstructured-data/
8. http://whatis.techtarget.com/definition/semi-structured-data
9. Manyika,J.,Chui,M.,Brown,B.,Bughin,J.,Dobbs,R.,Roxburgh,C
., Byers,AH., Big Data: The next frontier for innovation,
competition and productivity, McKinsey Global Institute, p33,
June 20111
10. https://msdn.microsoft.com/en-us/library/dn749785.aspx
11. https://en.wikipedia.org/wiki/Collaborative_filtering
12. Salehinia.A, Comparisons of Relational Databases with Big
Data: a Teaching Approach, South Dakota State University,
Brookings, SD 57007
13. Zikopoulos.PC, Eaton.Ch, deRoos.Ch, Deutsch.Th,
Lapis,G,”Understanding Big Data”, p5,2012
14. Zikopoulos.PC, Eaton.Ch, deRoos.Ch, Deutsch.Th,
Lapis,G,”Understanding Big Data”, p16,2012
15. Ishikawa.H, Social Big Data Mining, 2015

More Related Content

What's hot (18)

PDF
Data Mining and Big Data Challenges and Research Opportunities
Kathirvel Ayyaswamy
 
PDF
A Model Design of Big Data Processing using HACE Theorem
AnthonyOtuonye
 
PPT
Seminar presentation
Klawal13
 
PPTX
Data science.chapter-1,2,3
varshakumar21
 
PDF
Challenges of Big Data Research
Regional Science Academy
 
DOCX
Datamining
greenstarvijay
 
PDF
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
Robert Grossman
 
PDF
Introduction to Data Mining
AbcdDcba12
 
PDF
elgendy2014.pdf
Akuhuruf
 
PDF
Big data Mining Using Very-Large-Scale Data Processing Platforms
IJERA Editor
 
PPTX
1 Introduction to-data-mining lecture
Mahmoud Alfarra
 
PDF
Ngdm09 han gao
Tarek Dakel
 
PPTX
Big Data and Classification
303Computing
 
PPTX
Data Mining
Medicaps University
 
PDF
Terrorism in the Age of Big Data
Michael Maman
 
PPTX
3 classification
Mahmoud Alfarra
 
PDF
A Survey on Big Data Analytics: Challenges
Dr. Amarjeet Singh
 
Data Mining and Big Data Challenges and Research Opportunities
Kathirvel Ayyaswamy
 
A Model Design of Big Data Processing using HACE Theorem
AnthonyOtuonye
 
Seminar presentation
Klawal13
 
Data science.chapter-1,2,3
varshakumar21
 
Challenges of Big Data Research
Regional Science Academy
 
Datamining
greenstarvijay
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
Robert Grossman
 
Introduction to Data Mining
AbcdDcba12
 
elgendy2014.pdf
Akuhuruf
 
Big data Mining Using Very-Large-Scale Data Processing Platforms
IJERA Editor
 
1 Introduction to-data-mining lecture
Mahmoud Alfarra
 
Ngdm09 han gao
Tarek Dakel
 
Big Data and Classification
303Computing
 
Data Mining
Medicaps University
 
Terrorism in the Age of Big Data
Michael Maman
 
3 classification
Mahmoud Alfarra
 
A Survey on Big Data Analytics: Challenges
Dr. Amarjeet Singh
 

Viewers also liked (11)

PDF
Fundamentals of data security policy in i.t. management it-toolkits
IT-Toolkits.org
 
PDF
Legal issues Text and Data Mining
openminted_eu
 
PPTX
Personal Information Collection: A Trade-Off Analysis
Shannon Szabo-Pickering
 
PPT
Lecture1
Manish Kumar
 
PDF
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Chris Shillum
 
PPT
Merit Event - Understanding and Managing Data Protection
meritnorthwest
 
PPTX
A business driven approach to security policy management a technical perspec...
AlgoSec
 
PPT
1.3 applications, issues
Krish_ver2
 
PPTX
Major issues in data mining
Slideshare
 
PPT
Data security in local network using distributed firewall ppt
Sabreen Irfana
 
PDF
Data mining (lecture 1 & 2) conecpts and techniques
Saif Ullah
 
Fundamentals of data security policy in i.t. management it-toolkits
IT-Toolkits.org
 
Legal issues Text and Data Mining
openminted_eu
 
Personal Information Collection: A Trade-Off Analysis
Shannon Szabo-Pickering
 
Lecture1
Manish Kumar
 
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Chris Shillum
 
Merit Event - Understanding and Managing Data Protection
meritnorthwest
 
A business driven approach to security policy management a technical perspec...
AlgoSec
 
1.3 applications, issues
Krish_ver2
 
Major issues in data mining
Slideshare
 
Data security in local network using distributed firewall ppt
Sabreen Irfana
 
Data mining (lecture 1 & 2) conecpts and techniques
Saif Ullah
 
Ad

Similar to Big Data Mining - Classification, Techniques and Issues (20)

PDF
ISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MINING
cscpconf
 
PDF
Data Mining in the World of BIG Data-A Survey
Editor IJCATR
 
PDF
RESEARCH IN BIG DATA – AN OVERVIEW
ieijjournal
 
PDF
RESEARCH IN BIG DATA – AN OVERVIEW
ieijjournal1
 
PDF
Research in Big Data - An Overview
ieijjournal
 
PDF
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
IJSRD
 
PDF
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
IJSRD
 
PPTX
introduction to data science
Johnson Ubah
 
PDF
BigData Analytics_1.7
Rohit Mittal
 
PDF
Real World Application of Big Data In Data Mining Tools
ijsrd.com
 
PDF
A Survey on Big Data Mining Challenges
Editor IJMTER
 
PDF
Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...
AnthonyOtuonye
 
PDF
06. 9534 14985-1-ed b edit dhyan
IAESIJEECS
 
PDF
A SURVEY OF BIG DATA ANALYTICS..........
ijistjournal
 
PDF
A SURVEY OF BIG DATA ANALYTICS
ijistjournal
 
PDF
Big data Paper
Daryaz Fares
 
PDF
An Efficient Approach for Clustering High Dimensional Data
IJSTA
 
PDF
Mining Big Data to Predicting Future
IJERA Editor
 
PPTX
Data Mining Algorithm and New HRDSD Theory for Big Data
KamleshKumar394
 
PDF
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Love Arora
 
ISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MINING
cscpconf
 
Data Mining in the World of BIG Data-A Survey
Editor IJCATR
 
RESEARCH IN BIG DATA – AN OVERVIEW
ieijjournal
 
RESEARCH IN BIG DATA – AN OVERVIEW
ieijjournal1
 
Research in Big Data - An Overview
ieijjournal
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
IJSRD
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
IJSRD
 
introduction to data science
Johnson Ubah
 
BigData Analytics_1.7
Rohit Mittal
 
Real World Application of Big Data In Data Mining Tools
ijsrd.com
 
A Survey on Big Data Mining Challenges
Editor IJMTER
 
Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...
AnthonyOtuonye
 
06. 9534 14985-1-ed b edit dhyan
IAESIJEECS
 
A SURVEY OF BIG DATA ANALYTICS..........
ijistjournal
 
A SURVEY OF BIG DATA ANALYTICS
ijistjournal
 
Big data Paper
Daryaz Fares
 
An Efficient Approach for Clustering High Dimensional Data
IJSTA
 
Mining Big Data to Predicting Future
IJERA Editor
 
Data Mining Algorithm and New HRDSD Theory for Big Data
KamleshKumar394
 
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Love Arora
 
Ad

Recently uploaded (20)

PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 

Big Data Mining - Classification, Techniques and Issues

  • 1. Abstract—At this moment, data deluge is continuously producing a large amount of data in various sectors of modern society. Such data are called big data. Big data contain datasets originating both in our physical real world and in social media and are difficult to manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. The Big Data is providing the robust solutions for overcoming the present issues caused due to the volume, variability and velocity. We present in this issue, a broad overview of the topic, its current status and techniques such as NoSQL, MapReduce and Hadoop. Keywords - Big Data Mining, Mining Techniques, NoSQL, Hadoop, MapReduce. 1. INTRODUCTION In the present age, large amounts of data are produced every moment in various fields, such as science, Internet, and physical systems. Such phenomena collectively called data deluge [Mcfedries 2011]. According to researches carried out by IDC [IDC 2008, IDC 2012], the size of data which are generated and reproduced all over the world every year is estimated to be 161 exa bytes. It is predicted that data increase rapidly at a rate of 10x every five years [1]. In the meanwhile the computing size of general purpose computers encounter a 58% rise annually [2]. Consider the Internet data. The web pages indexed by Google were around one million in 1998 but quickly reached one billion in 2000 and have already exceeded 1 trillion in 2008. This rapid expansion is accelerated by the dramatic increase in acceptance of social networking applications, such as Facebook, Twitter, Weibo, etc., that allow users create content freely and amplify the already huge Web volume. Thus, the term “Big Data” is a critical issue that needs solemn attention [3,4]. The etymology of the Big Data coined by two person: First, John Mashey, who was the chief scientist at Silicon Graphics in the 1990s, who gave a talk “Big Data and the Next Wave of InfraStress” in 1998. Second, Francis X. Diebold, an economist at the University of Pennsylvania, for his paper on “Big Data Dynamic Factor Models for Macroeconomic Measurement and Forecasting,” (2000) [5]. We introduce Big Data Mining and its application in Section 2. We discuss some Data Mining Techniques in Section 3. Then we discuss the Issues and Challenges in the Section 4. 2. BIG DATA MINING The origin of the term ‘Big Data’, is due to the fat we are creating a huge amount of data every day. Usama Fayyad [11] in his invited talk at the KDD BigMine’ 12 Workshop presented amazing data numbers about internet usage, among them the following: each day Google has more than 1 billion queries per day, Twitter has more than 250 million tweets per day, Facebook has more than 800 million updates per day and YouTube has more than 800 million updates per day. The data produced nowadays is estimated in the order of zettabytes and is growing around 40% every year. There are mainly three concepts associated with big data: structured, semi structured and unstructured data. In today’s world structured data represent only 5 to 10% of all informatics data. Structured data is the data that can be stored in database SQL in table with specific rows and tables [7]. Following this, semi structured data, likewise represents a few parts of data (approximately 5 to 10%). This type of data does not have the precise organization infrastructure of structured data, which fits into tables. In other words, semi- structured data associated with metadata. Metadata is the term that we use in order to describe the content and context of data files, e.g. Means of creation, purpose, time and date of creation, and author [9]. In particular, XML documents are the semi-structured documents. Moreover, NoSQL databases are considered as semi structured [7]. The eminent challenge is to find ways in order to cope with unstructured data, which is everywhere and is most the strong one among others, streaming such as text, images, audios and videos. It represent 80% of data [7]. 2.1 Big Data Definition - 3 V’s: In today's world, organization have been bombed with bulk of information, but there is a decline in the percent of data that can be analyzed. The reason behind that is 80% of the data is in the semi-structured and unstructured format. And thus, we need new algorithms and new toolset deal with all this data. The features of big data can be summarized as follows: • Volume: The quantity of data is extraordinary, but not the percent of data that our tools can process. • Variety: The kinds of data have expanded into unstructured texts, audio, video, graph or XML. • Velocity: Data is arriving continuously as streams, the speed at which data are generated is very high. 
 Big Data Mining - Classification, Techniques and Issues Karan Deep Singh Yeghia Koronian Gelareh Tavako Saberi [email protected] [email protected] [email protected] Masters in Computer Science Masters in Computer Science Masters in Computer Science Concordia University Concordia University Concordia University
  • 2. Big Data Mining - Classification, Techniques and Issues Therefore, big data are often characterized as V3 by taking the initial letters of these three terms Volume, Variety, and Velocity. Apart from these, there is another factor Variability that corresponds to the changes in the structure of the data and how users want to interpret that data. Gartner[15] summarizes this in their definition of Big Data in 2012 as high volume, velocity and variety information assets that demand cost-effective, innovation forms of information processing for enhanced insight and decision making. 2.2 Data Mining Data mining is, in a nutshell, to discover frequent patterns and meaningful structures appearing in a large amount of data used by applications. Association Analysis: It is to discover frequent co- occurrences between structured data used in business applications, which are usually managed by DBMS. An algorithm called Apriori is used in many cases for that purpose. For example, it discovers combinations of items co- occurring frequently in a group of items (i.e., contents of the shopping carts) purchased at the same time in retail stores. Based on association rules, a lot of application systems recommend a set of items by revising arrangements of them. Association rule mining is extended and applied to the history of product purchases and the history of click streams on the Web pages in order to discover the frequent patterns of series data. Classification: On the other hand, a classifier is learned based on data whose classes (i.e., categories) are known in advance. Then, if there is new data, classes to which they should belong are determined by using the learned classifier. This task called classification is one of the basic data mining techniques. Naïve Bayes and decision trees are used as typical classifiers. Classification is used by such a variety of applications as determination of promising customers, detection of spam e-mails and determination of categories of new specimens in science or medicine. Determination of continuous values such as temperatures and stock prices is also called prediction of future values. Clustering: It may be possible to define the degrees of similarity between data even if the categories of the data are not known in advance. The opposite concept of similarity is dissimilarity or distance. Based on the defined similarity, grouping data into the same group which are similar to each other in a collection of data is called cluster analysis or clustering, which is also one of the basic technologies of data mining. Unlike classification, clustering doesn’t demand that the names and characteristics of clusters are known in advance. Techniques such as a hierarchical agglomerative method and a nonhierarchical k-means method are often used for clustering. Promising applications of clustering include discovery of groups of similar customers for marketing. Outlier Detection: This data mining task can detect exceptional values or values different from standard values. There are methods for outlier detection based on statistical models, data distances, and data densities. There are alternative ways to find outliers using clustering and classification. Outlier detection has been used for applications, such as detection of credit card frauds or network intrusions. 2.3 Big data vs traditional DBMS Big data convey us through the compelling opportunities for the data manipulation. It allows us to encounter with huge volume of semi-structured and unstructured data that the traditional database is not able to store these data. Moreover, it gives us a chance to uncover hidden insights in large sets of data [10]. Enterprise and companies tend to track their customers, monitor their transactions in order to achieve desired statistics. Thus, evaluating the customer’s behaviors permit to have a vantage point of the whole systems and conducting advanced research in order to ensure long term goals [6]. To illustrate with, Tesco’s loyalty program, a British multinational grocery and general merchandise retailer, generates a tremendous amount of customer data that the company mines to inform decisions from promotions to strategic segmentation of customers. Amazon uses customer data to power its recommendation engine “you may also like …” based on a type of predictive modelling technique called collaborative filtering[6]. In this method, “the system observe what the user has been done together with what all users have done (what items they have bought, what music they have listened) and predict how the user’s might behave in the future[11]”. 2.4 Limitations of the traditional DBMS In relational database, we can cope with structured and sometimes semi-structured data. The data is neatly formatted and fits into the schema. The data should fit into the table and if the data does not fit into the table, there is a need to design a database that is more complex and more difficult to handle. This approach might result in loss of some hidden data. In addition, the schema of traditional relational database is not suitable for certain dynamic information, like weather patterns, that change concurrently. However, ”There are some more flexible mechanisms, such as the ability to store XML documents and binary data, but the capabilities for handling these types of data are usually quite limited [10]”.Furthermore, in the traditional database to process data, the data is to put in the central node location. As the data grows, the processing central node has to be extended and consequently, there are some limitations depending on the chosen hardware platform like memory size[12]. “It’s important that understand that conventional database technologies are an important, and relevant, part of an overall analytic solution. In fact, they become even more vital when used in conjunction with your Big Data platform [14].” In Big Data, there is no limitation in storing the data. We can have all sort of data, structured, semi-structured and, particularly, unstructured data and easily query a data. Big data solutions store the data in its raw format and apply a schema only when the data is read, which preserves all of the information within the data [10].
  • 3. Big Data Mining - Classification, Techniques and Issues 3. DATA MINING TECHNIQUES Traditionally, data mining handles transactions which are recorded in databases if the customers actually purchase products or services. Analyzing transactional data leads to discovery of frequently purchased products or services, especially repeat customers. But transaction mining cannot obtain information about customers who are likely to be interested in products or services, but have not purchased any products or services yet. In other words, it is impossible to discover prospective customers who are likely to be new customers in the future. In the physical real world, however, customers look at or touch interesting items displayed in the racks. They trial- listen to interesting videos or audios if they can. They may even smell or taste interesting items if possible and even if interesting items are unavailable for any reasons, customers talk about them or collect information about them. These behaviors can be considered, as parts of interactions between customers and systems. Such interactions indicate the interests of latent customers, who either purchase interesting items or do not in the end, for some reasons. Analyzing interactions in the physical real world leads to understanding which items customers are interested in. By such analysis, however, which aspects of the items the customers are interested in, why they bought the items, or why they didn’t, remain unknown. Therefore, if interests of the users are extracted from heterogeneous data sources and the reasons for purchasing or not purchasing the items are uncovered, it will be possible to obtain valuable information about latent customers. Traditional mining of transactional data and new mining of interactional data are distinctively called transaction mining and interaction mining. 3.1 NoSQL as a Database It has been reported that 65% of queries processed by Amazon depend on primary keys [Vogels 2007]. Therefore, data access based on keys, key value stores mechanism is used by Internet giants such as Google and Amazon. The concrete key value stores include DynamoDB [DynamoDB 2014] of Amazon, BigTable [Chang et al. 2006] of Google, HBase [HBase 2014] of the Hadoop project and Cassandra [Cassandra 2014], by Facebook. Generally, given key data, key value stores are suitable for searching non-key data (attribute values) associated with the key data. Initially a hash function is applied to a node which stores data. Then, the node is mapped to a point (i.e., logical place) on a ring type network. In storing data, the same hash function is applied to a key value of each data and then the data is similarly mapped to a point on the ring. Each data is stored in the nearest node by the clockwise rotation of the ring. Thus, for data access, search for the nearest node located by applying the hash function to the key value. This access structure is called consistent hashing, which is also adopted by P2P systems used for various purposes such as file sharing. 3.2 MapReduce MapReduce is considered as a design pattern which can process tasks efficiently by carrying out scale-out in a straightforward manner. For example, human users browsing web sites and robots aiming at crawling for search engines leave the access log data in Web servers when they access the sites. Therefore it is necessary to extract only the session (i.e., a coherent series of page accesses) by each user from the recorded access log data and store them in databases for further analysis. Generally such a task is called extraction, transformation, and loading (ETL). MapReduce is suitable for applications which perform such ETL tasks. It divides a task into subtasks and processes them in a parallel distributed manner. MapReduce is suitable for cases where only data or parameters of each subtask are separate although the method of processing is completely the same. First, the Map phase is carried out and the outputs are rearranged so that they are suitable for the input of the Reduce phase. For applications where similarity (i.e., identity of processing in this case) and diversity (i.e., difference of data and parameters for processing) are inherent, MapReduce exploits these characters to improve the efficiency of processing. Parallelization and distribution of large scale computations are the two contributing factors for generating this kind of model. 3.3 Hadoop Hadoop [Hadoop 2014] is an open source software for distributed processing on a computer cluster, which consists of two or more servers. Hadoop consists of a distributed file system called HDFS (Hadoop Distributed File System), MapReduce as it is, and Hadoop Common as common libraries. A computer system is a collection of clusters which consist of two or more servers. Data is divided into blocks. While one block for original data is stored in a server which is determined by Hadoop, copies of the original data are stored in two other servers (default) inside racks other than the rack holding the server for the original data simultaneously. Although such data arrangement has the objective to improve availability, it also has another objective to improve parallelism. The special server called NameNode manages data arrangement in HDFS. The NameNode server carries out book keeping of all the metadata of data files. The metadata are resident on core memories for high speed access. Therefore, the server for NameNode should be more reliable than the other servers. It is expected that if copies of the same data exist in two or more servers, candidate solutions increase in number for such problems that process tasks in parallel by dividing them into multiple subtasks. If Hadoop is fed a task, Hadoop searches the location of relevant data by consulting NameNode and sends a program for execution to the server which stores the data. This is because communication cost for sending programs is generally lower than that for sending data.
  • 4. Big Data Mining - Classification, Techniques and Issues 4. ISSUES AND CHALLENGES Variety and heterogeneity: In the past, the datasets that we had had was quite simple and homogenous. We have to interact with structured, semi-structured and unstructured data. Structured data is compatible with conventional DBMS. Semi-structured and unstructured dataset require to envelope in the adequate and state-of-the-art platforms. Volume/Scalability: Data now is in tremendous scale, which will give us an opportunity to discover hidden knowledge and serve/ understand people better. There are two approaches if exploited properly, may lead to remarkable scalability required for future data and mining systems to manage and mine the big data; Advanced User Interaction[5.6] Data mining in a straight forward manner will implies extremely time consuming task on a large space, however with user interaction we can decrease the search space into more promising subspaces, Cloud Computing which is an another approach that showed admirable elasticity, which, combined with massively parallel computing architectures can make our systems scalable. Velocity/Speed: We must finish processing/mining in a desired time or else the information is useless. Speed depends a) Data access time and b) Efficiency of mining algorithms, Exploitation of advanced indexing schemes is the key to speed issue multidimensional indexing structures such as R tree is useful for big data and data access time. An additional approach to boost the speed of big data access and mining is through maximally identifying and exploiting the potential parallelism in the access and mining algorithms. Accuracy, trust and provenance: In the past, we were dealing with the dataset techniques which were reliable. On the era of big data, evolution of big data urge us to deal with all the rigors of a considerable amount of unstructured and unreliable data. Moreover, how can we trust the unreliable data? The use of learning algorithms is an appropriate way to determine the creditability of the source of data, these algorithms should be able to update the creditability of the source of data in a timely manner. Privacy crisis: Every piece of info can be mined out from the internet about someone because data is interconnected, once this info is put together the privacy will disappears. We are working on developing a mining system that can mine a huge portion of the web, so these same tools can be used to retrieve personal and confidential information about you. Interactiveness: Is the capability of data mining system that allows user interaction such as feedback and guidance. Interactiveness can help narrow down the search space, accelerating the speed and increase system scalability, also heterogeneity can be overcome by allowing users to interpret intermediate and final results by interaction. Interactiveness boosts the data mining results, even if data mining systems are professionally designed but without interactiveness the value of the results can be discounted or simply rejected. Garbage mining: In WWW the volume of data is generated very fast and outdated very fast so we require cyberspace cleaning but it's not easy foreseeable reasons: garbage is hidden, and there is an ownership issue, are you allowed to dispose or collect garbage that does not belong to you? We propose applying data mining approaches to mine garbage and recycle it. We believe garbage mining is a serious research topic mining for garbage is mining for knowledge. REFERENCES 1. S. Hendrickson, Getting Started with Hadoop with Amazon’s Elastic MapReduce, EMR, 2010. 2. M. Hilbert and P. L´opez, “The world’s technological capacity to store, communicate, and compute information,” Science, vol. 332, no. 6025, pp. 60–65, 2011. 3. J. M. Wing, “Computational thinking and thinking about computing,” Philosophical Transactions of the Royal Society of London A:Mathematical, Physical and Engineering Sciences, vol.366, no. 1881, pp. 3717–3725, 2008. 4. J.Mervis, “Agencies rally to tackle big data,” Science, vol. 336, no. 6077, p. 22, 2012. 5. http://www.marklogic.com/blog/birth-of-big-data/ 6. Che.Dunren, Safran.Mejdl, Peng.Zhiyong, From big data to big data mining: Challenges,Issues and Opportunities, In: DAFAA Workshop 2013,LNCS 7827,pp. 1-15, 2013 7. https://jeremyronk.wordpress.com/2014/09/01/structured-semi- structured-and-unstructured-data/ 8. http://whatis.techtarget.com/definition/semi-structured-data 9. Manyika,J.,Chui,M.,Brown,B.,Bughin,J.,Dobbs,R.,Roxburgh,C ., Byers,AH., Big Data: The next frontier for innovation, competition and productivity, McKinsey Global Institute, p33, June 20111 10. https://msdn.microsoft.com/en-us/library/dn749785.aspx 11. https://en.wikipedia.org/wiki/Collaborative_filtering 12. Salehinia.A, Comparisons of Relational Databases with Big Data: a Teaching Approach, South Dakota State University, Brookings, SD 57007 13. Zikopoulos.PC, Eaton.Ch, deRoos.Ch, Deutsch.Th, Lapis,G,”Understanding Big Data”, p5,2012 14. Zikopoulos.PC, Eaton.Ch, deRoos.Ch, Deutsch.Th, Lapis,G,”Understanding Big Data”, p16,2012 15. Ishikawa.H, Social Big Data Mining, 2015