Big Data Analytics For Satellite Image Processing and Remote Sensing
Big Data Analytics For Satellite Image Processing and Remote Sensing
Prabu Sevugan
VIT University, India
Copyright © 2018 by IGI Global. All rights reserved. No part of this publication
may be reproduced, stored or distributed in any form or by any means, electronic
or mechanical, including photocopying, without written permission from the
publisher.
Product or company names used in this set are for identification purposes only.
Inclusion of the names of the products or companies does not indicate a claim of
ownership by IGI Global of the trademark or registered trademark.
This book is published under the IGI Global book series Advances in Computer
and Electrical Engineering (ACEE) (ISSN: 2327-039X eISSN: 2327-0403)
British Cataloguing in Publication Data
A Cataloguing in Publication record for this book is available from the British
Library.
ISSN: 2327-039X
Mission
Free and Open Source Software in Modern Data Science and Business
Intelligence Emerging Research and Opportunities
K.G. Srinivasa (CBP Government Engineering College, India) Ganesh Chandra
Deka (M. S. Ramaiah Institute of Technology, India) and Krishnaraj P.M. (M. S.
Ramaiah Institute of Technology, India)
Engineering Science Reference • copyright 2018 • 189pp • H/C (ISBN:
9781522537076) • US $190.00 (our price)
Smart Grid Test Bed Using OPNET and Power Line Communication
Jun-Ho Huh (Catholic University of Pusan, South Korea)
Engineering Science Reference • copyright 2018 • 425pp • H/C (ISBN:
9781522527763) • US $225.00 (our price)
When I was invited to write a foreword for the book Big Data Analytics for
Satellite Image Processing and Remote Sensing, I felt glad to note the varied
tools, challenges, methods in Bigdata for Satellite Image processing. This book
is a significant collection of 10 chapters covering image processing, satellite
image processing, bigdata and cloud based processing, as well as their
applications in emerged in the recent decades. This book provides an excellent
platform to review various areas of Satellite Image processing and affords for the
needs of both beginners to the field and seasoned researchers and practitioners.
The tremendous growth of Satellite Image Processing and Bigdata are
documented in this book, such as Bigdata, Satellite Image Processing, Remote
Sensing, Computational methods, 3D Asset development and product
development, Landslide susceptibility, Hierarchical clustering, Modified Support
Vector Machine, Bigdata as a Service and Cloud Based Workflow Scheduling
techniques which are focused in various applications.
To the best of my knowledge, this is the first attempt of its kind, providing a
coverage of the key subjects in the fields Bigdata, Satellite Image Processing and
Cloud Computing and applications. This book is an invaluable, topical, and
timely source of knowledge in the field, which serves nicely as a major text book
for several courses at both undergraduate, post graduate levels and scholars. It is
also a key reference for scientists, professionals, and academicians, who are
interested in new challenges, theories and practice of the specific areas
mentioned above.
V. Susheela Devi
Indian Institute of Science Bangalore, India
Ravee Sundararajan
London South Bank University (LSBU), UK
Big Data Analytics for Satellite Image Processing and Remote Sensing is a
critical scholarly resource that examines the challenges and difficulties of
implementing big data in image processing for remote sensing and related areas.
Featuring coverage on a broad range of topics, such as distributed computing,
parallel processing, and spatial data, this book is geared towards scientists,
professionals, researchers, and academicians seeking current research on the use
of big data analytics in satellite image processing and remote sensing.
We would like to express our sincere gratitude to all the contributors, who have
submitted their high-quality chapters, and to the experts for their supports in
providing insightful review comments and suggestions on time.
CHAPTER 1
New Computational Models for Image Remote Sensing and Big
Data
Dhanasekaran K. Pillai
Jain College of Engineering, India
ABSTRACT
This chapter focuses on the development of new computational models for
remote sensing applications with big data handling method using image data.
Furthermore, this chapter presents an overview of the process of developing
systems for remote sensing and monitoring. The issues and challenges are
presented to discuss various problems related to the handling of image big data
in wireless sensor networks that have various real-world applications. Moreover,
the possible solutions and future recommendations to address the challenges
have been presented and also this chapter includes discussion of emerging trends
and a conclusion.
INTRODUCTION
The goal of developing new computational models is to enable creation of new
big data based remote sensing infrastructure for analysing and mining image
data. The system must include a data collection component to aggregate,
integrate data and perform validation of image data. Then, the central component
of the system performs tasks like filtering, analysis and extraction of relevant
patterns from image data. The result of extraction and prediction can be used for
agricultural monitoring, crop monitoring or for forecasting of weather and
market values.
Most of the big data framework that uses image remote sensing involves the
following steps:
The system architectural model in Figure 1 involves scenario based models for
analysing and mining images, weather data, and pollution data.
For data storage and management, DSpace can be used to store and maintain a
large amount of heterogeneous data. The DSpace is an open source dynamic
digital repository that can be used for image analysis while using big data. It
enables free access to the data.
This chapter enables users to understand major issues and problems related to
remote sensing in combination with big data handling for image data. After
analysing solutions recommended for addressing the problems, users will be able
to understand the process of developing a new framework, tools, or software
systems to meet the current needs.
BACKGROUND
Mostly, remote sensing data is collected to analyse disease conditions, growth of
plant, pollution, land use, road traffic congestions, and effects of disaster etc.
One solution to address these problems is to develop possible computational
models which represent several modules for the data analysis. The creation of
thematic map for certain problems requires meaningful analysis which aims to
show satisfactory results.
The image data collected through multispectral image sensing can provide
information at the element level. It can also provide information at the composite
level via inter-pixel relationships. In some of the applications, the output
information is used to assess the user belief or expert suggestions by the analyst.
The software program validates the hypothesis developed by users. In most of
the cases, the analysis fails due to the compatibility issues between user-defined
performance measure that is used for optimization and the objective that is
unlikely to produce the expected results. So, every analysis should need to be
applied iteratively. The ordering and optimal selection of the objects involved in
analysis may not be known. Hence, an effective approach must use suitable
selection and ordering technique for objects in image remote sensing and
analysis applications.
Crop related mapping of soybean and corn has been conducted at regional scale
focusing on the tropical and temperature plains (Arvor, Jonathan, Meirelles,
focusing on the tropical and temperature plains (Arvor, Jonathan, Meirelles,
Dubreuil, & Durieux, 2011). Most of the methods have used spectral features of
land cover classes for classification either based on supervised learning methods
or based on unsupervised learning methods.
Another approach based on machine learning has been developed to select and
combine feature groups. It allows users to give positive and negative examples.
This method improves the user interaction and the quality of queries (Minka, &
Picard, 1997). The two methods discussed in the following paragraphs are based
Picard, 1997). The two methods discussed in the following paragraphs are based
on the concepts of information mining.
Some of the existing approaches does not adapt to different situations to satisfy
user needs. So, the image retrieval has been developed based on relevance
feedback functions (Rui, Huang, Ortega, & Mehrota, 1998). Also, the system is
designed to search image according to the suggestions of user, taking the
feedback into account.
During the last decades, traditional database system has been used to store data
considering characteristics such as color, texture, and shape. Further
development has focused on region-based image retrieval since the content-
based image retrieval was not satisfactory due to the growing size of image and
information content (Veltkamp, Burkhardt, & Kriegel, 2001). This method has
been found to be a viable solution to deal with the varying nature of image
content. In this method, each image is segmented, and object characteristics are
used to index individual object.
Due to the atmospheric effects such as Rayleigh scattering that occurs because of
atmospheric molecules, ozone, absorption by water vapour, and other gases,
absorption due to atmospheric aerosols, changes will frequently occur in the data
collection environment. So, the data correction becomes a computationally
intensive task which requires innovative error correction approaches, for
example, standard radiative transfer algorithms 6S (Vermote, Tanre, Deuze,
Herman, & Morcrette, 1997) for processing high resolution image datasets.
Otherwise, the data processing will not be practically feasible.
The way earth system changes would change the accuracy and updation of
The way earth system changes would change the accuracy and updation of
current global earth data. To handle the increasing complexity in terms of size
and dynamism, multi-sensor remote sensing data and temporal big data mining
are useful for big data processing. The major data-intensive computing issues
and challenges arise because of rapid growth of remote sensing data. So, the
methods to deal with the computational complexity associated with big data are
necessarily moving towards the high-performance computing paradigm.
Because of the data availability requirement and huge required computing power
for processing massive amount of data, the cluster-based high-performance
computing still remains as big challenge in remote sensing applications.
Moreover, the intensive irregular data access patterns of remote sensing data
increase I/O burden making the common parallel file systems inapplicable.
Further, the current task scheduling seeks load balancing among computational
resources taking the data availability as major concern. To reduce the
complexity, the large tasks could be divided into smaller data dependent tasks
with ordering constraints. Another way to handling the critical issue is to
introduce an optimization technique for scheduling of tasks while trying to
achieve higher performance.
Normally, the processing of remote sensing big data involves the following
stages on the process flow: satellite observational network, data acquisition and
recording, remote sensing data processing (pre-processing, central processing,
information abstraction and representation), and operating remote sensing
applications.
applications.
Naturally, the remote sensing data captured from different data centres are
distributed and these data centres are normally far away and connected by the
Internet. So, the data management component has to critically manage these
distributed, huge amounts of data for improving interoperability and global data
sharing.
This means that there is a demand for introducing more storage devices and for
improving easy access. In order to tackle these challenges, new technologies and
techniques are required. Further, the high dimensionality characteristics of
remotely sensed data makes the distributed data sharing and accessing more
complicated. The main issue arises while trying to organize and map the multi-
dimensional remote sensing imageries to a one-dimensional array.
The critical data processing and data sharing component requires high data
availability. Most of the remote sensing applications perform processing on
irregular data access pattern. Example issues include: irregular I/O patterns and
increased CPU load that occurs by varying degree of dependency between
computation of algorithm and remote sensing data.
Hence, it requires large, efficient data structure to store these massive amounts
of remotely sensed big data in local memory. Further, data transmission among
processing nodes requires high bandwidth to transmit image big data in the form
of data blocks, so, it would be a time-consuming process when the volume of
data to be communicated is large.
The large scale design modeling of water management and remote monitoring
would involve a large number of smaller data dependent tasks. Therefore, the
main processing module become extremely difficult and may require ordering
constraints to deal with data dependent tasks. To achieve good performance, an
optimized scheduling algorithm is required. Sometimes, decoupling of data
dependencies may be helpful to achieve better selection of execution path.
The remote sensing and monitoring that requires complex models to work with
large amounts of multi-sensor and temporal data sets may have to apply widely
used pre-processing algorithm. The computational and storage requirements for
problems that use large number of earth observations would normally exceed the
available computing power on a single computing platform.
This computational model deals with error correction on remotely sensed image
data, and includes major components such as image segmentation using
hierarchically connected components, followed by retrieval of data distribution
using computing function, hierarchy based data organization that allows on-
demand processing and retrieval of information. This scheme with addition or
removal of some software modules can be adopted to improve computational
time in remote sensing based applications.
time in remote sensing based applications.
Most of the information analytical systems require data fusion from various
sources and instruments. The use of these data sets in modeling ecosystem
response, various types of data face challenges in dealing with huge data storage
and high computational complexity. Therefore, the data acquisition, processing,
mapping and conversion of remote sensing data deal with a complicated
modeling and increased computational complexity. The computational tasks that
involve a variety of pre-processing of satellite data (e.g. land use data) include
complex neighbourhood operations.
For example, the total storage requirements for a global land cover data sets
would sometimes require processing of Tera bytes, Giga Floating Point
Operations (GFLOPs), or Peta FLOPs for data processing. So, it requires high
performance computing techniques to acquire information that is required from
earth observational systems.
Figure 3. Cloud based remotely sensed image big data processing model
Cloud data processing has capability to handle scalability issue while dealing
with remotely sensed big data. The system that uses earth observational system
is required to perform local processing to purify raw data which usually has
inconsistent data. Because of size and complexity of big data, the traditional
database management faces difficulties in handling big data.
The major challenges in big data processing include: 1) volume that denotes
large amounts of data generated from remote area; 2) velocity that denotes
frequency and speed at which data has been generated and shared; 3) variety that
denotes diversity of data types of data collected from various sources. A system
that deals with any of these challenges may use smaller subsets to create a result
set through correlation analysis. The major disadvantages of conventional
systems include: 1) difficult data transformation of remotely sensed continuous
stream of data; 2) data collected from remote areas are not in a valid format
which is ready for further data analysis; 3) remote sensor network may generate
vast amounts of raw data.
To deal with various challenges in remotely sensed big data analysis, a system
that incorporates offline data storage and filtering with load balancing sub-
systems can be developed for extracting useful information. The input to the big
data system comes from social networks, satellite imagery, sensor devices, Web
servers, finance data store, and banking data store etc.
In this computational model, the load balancer balances the processing power by
distributing the real-time data to the servers where the base station is processing
data. It can also enhance the efficiency of the system. Data extraction finds
insights into the data model and discovers information to create a structured
view of the data. Here, machine learning techniques are applied to process and
interpret image data for generating maps, and summary results. The system that
interpret image data for generating maps, and summary results. The system that
deals with huge amounts of big data processing can be implemented in
development platforms like Python, R Analytics platform, and Hadoop using
MapReduce.
The atmospheric effects will vary according to the context based on spatial and
temporal data, and also depends on the wavelength and geometry of
observations. Sometimes decoupling of the effects of components will be useful
in remote sensing applications. The feature selection based error removal will
also be helpful in dealing with erroneous data.
The error correction method involves two steps. The first step is to estimate
atmospheric properties from the imagery. The second is for retrieving surface
reflectance. The following steps are involved in this method:
The process steps involved in parallel processing model for image remote
sensing is shown in Figure 4.
During the past few decades, the image satellite sensors such as optical sensors,
synthetic aperture radar, and other satellite sensors have acquired huge amounts
of image scenes. In future, the quantities of real-time image sensing data will
further increase due to the data collection by high resolution satellite sensors.
The state-of-the-art systems access these data and images through query passed
using geographical coordinates, time of acquisition, and sensor type. The
information which are collected using traditional system is often less relevant,
so, only a few images can be used. Because the relevancy is determined by the
so, only a few images can be used. Because the relevancy is determined by the
content of the image considering its structures, patterns, objects, or scattering
properties.
In this section, an automated approach to map hybrid tomato and original tomato
is discussed for analyzing and mining image patterns. Here, a decision tree
classifier is constructed by using rules that are manually written based on expert
opinions. The automated approach will be more advantageous when mapping is
to be done for multiple years, because, the mapping can be performed without
re-training or repeated calibration task. To identify vegetation, moderate
resolution, time series based, imaging Spectroradiometer and reflectance product
can be used which identifies hybrid variety of tomato and original variety based
on year.
In this model, the following variables are used for measurement and
In this model, the following variables are used for measurement and
classification:
Vb: This denotes the Enhanced Vegetation index value of background in the
non-growing season.
Va: This denotes the amplitude of Enhanced Vegetation index variation
within the growing cycle. The Va value of field crops is higher than natural
vegetation. Average Va of hybrid tomato is greater than original crop.
p, q: changing rate parameters that corresponds to the increased and
decreased segments in the cycle. In this case, the field crop cycles have fast
increase or decrease in enhanced vegetation index value.
Di, Dd: These variables denote the middle Dates of segments when
increasing or decreasing rates are high. These variables are used as
indicators of the dates on which rapid growth is there and harvesting is
necessary.
D1, D2, D3, D4: These variables denote when the second derivative of the
curve reaches local maximum or minimum. D1 denotes starting dates and
D4 denotes end dates of the growing season.
L: This variable denotes the difference between D4 and D1 . It represents
the length of growing season. The length should maintain consistency.
Hybrid tomato has shorter growing time than original tomato.
R: This variable denotes reflectance at Di. Reflectance of Hybrid variety is
slightly higher than normal variety. Here, pixels with high reflectance are
hybrid, and pixels with low reflectance are normal variety and this provides
confidence pixels for training.
In the past few decades, machine learning has been a widely used technique in
earth science and observational systems such as land use detection, ocean change
detection, road extraction, and atmosphere, disaster prediction, crop disease
prediction, activity detection, etc. Herein, a number of relevant applications of
machine learning are summarized for understanding its applicability in
geosciences and remote sensing. The main focus may fall under two categories,
one is on how to apply multivariate nonlinear nonparametric regression, and the
other is on how to use multivariate nonlinear unsupervised classification.
Some of the popular machine learning algorithms that can be used in image
classification includes Support Vector Machine, k-nearest neighbour algorithm,
Artificial Neural Network, Genetic Algorithm, Expectation Maximization
Algorithm, C4.5 decision tree algorithm, AdaBoost, CART, k-means algorithm
etc. Mostly, the machine learning algorithms are used to predict the future trends
or quality decisions. Data mining task which involves machine learning aims to
discover information and generate a large number of rules. In big data mining,
image classification mainly focuses on classifying new objects or unknown
vectors under a predefined category or target label.
Image mining is one of the domains which can be used to extract meaningful
image content from a large image dataset. Image classification automatically
image content from a large image dataset. Image classification automatically
classifies image pixels into appropriate target based on natural evaluation and
relationships among image data. It has two categories of classification, namely,
supervised classification, and unsupervised classification.
Remote sensing provides a method to quickly and directly acquire data from
earth surface. Emerging trends in remote sensing of big data in information
science and environmental engineering has led to the application of remote
sensing and monitoring techniques in various fields which include ecology, earth
quake prediction and analysis, soil contamination, air pollution analysis, water
pollution analysis, environmental geology, solid waste detection and monitoring,
street light monitoring, crop disease analysis and loss prediction, industrial fraud
detection and monitoring, weather forecasting, customer behaviour prediction,
patient monitoring, home appliances monitoring, gas leakage detection and
monitoring in industry, energy consumption and sustainable development etc.
The computational model discussed in this chapter shows the importance of a
particular model in a problem domain.
In recent years, major countries have launched remote sensing satellites, which
include India, USA, and Russia. The remotely sensed features and data may
differ based on image resolution, spectrum, mode of imaging, and revisit cycle,
amplitude, and time. Nowadays, there are different remote sensing systems.
Example low-resolution satellite imaging includes meteorological satellite
MODIS, and microwave satellite Envisat. Example mid-resolution satellite
imaging includes terrestrial satellite (e.g. Landsat), satellite with long revisit
period (e.g. EO-1), microwave satellites (e.g. Terra, and RADARSAT). Example
high-resolution satellite imaging includes QuickBird, IKONOS, and WorldView.
It requires efficient investigations and techniques to deal with increasing
diversity of data.
Another challenge in handling remote sensing data is to deal with increasing size
or large volume of data. In image remote sensing, for a single scene, the volume
of data may be at the gigabyte level or at the terabyte level. The satellite remote
sensing data collected for a particular period in one country is maintained as
historical data. The volume of these data may be at the petabyte level and a large
archive maintained at the global level may reach up to Exabyte level. Therefore,
remote sensing data is termed as “big data” and requires efficient big data
handling techniques.
CONCLUSION
This chapter presented different computational models for image remote sensing
and big data handling. The representation of remotely sensed image information
on a hierarchical form or in suitable form with different semantic abstraction is
based on the levels involved in computational models. For example, a Bayesian
model may consist of the following levels: 1) extracting image features and
meta-features using signal models; 2) obtaining a vocabulary of signal classes
for each model by applying unsupervised machine learning (or clustering) of the
for each model by applying unsupervised machine learning (or clustering) of the
pre-extracted image parameters; 3) At last, user-interests, i.e., semantic labels are
linked to combinations of these vocabularies through Bayesian networks. In
order to infer information from the image data that covers the class label or
target label, the system has to learn the probabilistic link based on user given
input samples.
To overcome this drawback, a multi-resolution image data cube that has the
original image at the lowest layer and reduced resolution representations of the
image at the higher layers of data cube may be generated. By applying texture
model that uses Gibbs random field to layers of limited neighbourhood size,
various information from different structures can be extracted. It provides a way
to characterize a large set of spatial information. The output of feature extraction
may have large volumes of data, which may be difficult to store and manage in
practical applications. Moreover, clustering may reduce the accuracy due to a
large data reduction. To remove unnecessary structures and to avoid the time-
consuming process of similarity checking, clustering is performed across all
images. Even if a model generates large number of clusters, the algorithm can
show good efficiency by applying some of these computational models that
focused on parallel processing to improve efficiency of the system to produce
promising result.
REFERENCES
Alavi, A. H., & Gandomi, A. H. (2011). A robust data mining
approach for formulation of geotechnical engineering systems.
Engineering Computations , 28(3), 242–274.
doi:10.1108/02644401111118132
Arvor, D., Jonathan, M., Meirelles, M. S. P., Dubreuil, V., & Durieux,
L. (2011). Classification of MODIS EVI time series for crop mapping
in the state of MatoGrosso. Brazil International Journal of Remote
Sensing , 32(22), 7847–7871. doi:10.1080/01431161.2010.531783
Dong, J., Xiao, X., Kou, W., Qin, Y., Zhang, G., Li, L., & Moore, B.
III. (2015). Tracking the dynamics of paddy rice planting area in
1986-2010 through time series Landsat images and phenology-based
algorithms. Remote Sensing of Environment , 160, 99–113.
doi:10.1016/j.rse.2015.01.004
Lobell, D. B., Thau, D., Seifert, C., Engle, E., & Little, B. (2015). A
scalable satellite-based crop yield mapper. Remote Sensing of
Environment , 164, 324–333. doi:10.1016/j.rse.2015.04.021
Vermote, E. F., Tanre, D., Deuze, J. L., Herman, M., & Morcrette, J.
(1997). Second simulation of the satellite signal in the solar spectrum,
6S: An overview. IEEE Transactions on Geoscience and Remote
Sensing , 35(3), 675–686. doi:10.1109/36.581987
Venkatesan M.
National Institute of Technology Karanataka, India
Prabhavathy P.
VIT University, India
ABSTRACT
Effective and efficient strategies to acquire, manage, and analyze data leads to
better decision making and competitive advantage. The development of cloud
computing and the big data era brings up challenges to traditional data mining
algorithms. The processing capacity, architecture, and algorithms of traditional
database systems are not coping with big data analysis. Big data are now rapidly
growing in all science and engineering domains, including biological,
biomedical sciences, and disaster management. The characteristics of complexity
formulate an extreme challenge for discovering useful knowledge from the big
data. Spatial data is complex big data. The aim of this chapter is to propose a
multi-ranking decision tree big data approach to handle complex spatial
landslide data. The proposed classifier performance is validated with massive
real-time dataset. The results indicate that the classifier exhibits both time
efficiency and scalability.
INTRODUCTION
Very large amount of Geo-spatial data leads to definition of complex
relationship, which creates challenges in today data mining research. Current
scientific advancement has led to a flood of data from distinctive domains such
as healthcare and scientific sensors, user-generated data, Internet and disaster
management. Big data is data that exceeds the processing capacity of
conventional database systems. The data is too big, moves too fast, or doesn’t fit
conventional database systems. The data is too big, moves too fast, or doesn’t fit
the strictures of your database architectures. For instance, big data is commonly
unstructured and require more real-time analysis. This development calls forms
system architectures for data acquisition, transmission, storage, and large-scale
data processing mechanisms. Hadoop is a platform for distributing computing
problems across a number of servers. First developed and released as open
source by Yahoo, it implements the MapReduce approach pioneered by Google
in compiling its search indexes. Hadoop’s MapReduce involves distributing a
dataset among multiple servers and operating on the data: the “map” stage. The
partial results are then recombined: the “reduce” stage. To store data, Hadoop
utilizes its own distributed file system, HDFS, which makes data available to
multiple computing nodes.
Related Work
Decision trees are one of the most accepted methods for classification in diverse
data mining applications (H. I. Witten & E. Frank, 2005; M. J. Berry & G. S.
Linoff, 1997) and help the development of decision making(J. R. Quinlan,
1990). One of the well known decision tree algorithms is C4.5 (J. R. Quinlan,
1993; J. R. Quinlan, 1996), an expansion of basic ID3 algorithm(J. R.
1993; J. R. Quinlan, 1996), an expansion of basic ID3 algorithm(J. R.
Quinlan,1986). However, with the growing improvement of cloud computing
(M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee,
D. Patterson, A. Rabkin, I.Stoica and M. Zaharia, 2010) as well as the big data
challenge (D. Howe, M. Costanzo, P. Fey, T. Gojobori, L. Hannick, W. Hide,
D.P. Hill, R. Kania, M. Schaeffer, S.S., 2008), traditional decision tree
algorithms reveal numerous restrictions. First and foremost, building a decision
tree can be very time consuming when the volume of dataset is extremely big,
and new computing paradigm should be applied for clusters. Second, although
parallel computing(V. Kumar, A. Grama, A. Gupta & G. Karypis, 1994) in
clusters can be leveraged in decision tree based classification algorithms (K. W.
Bowyer, L. O. Hall, T. Moore, N. Chawla & W. P. Kegelmeyer, 2000; J. Shafer,
R. Agrawal & M. Mehta, 1996), the strategy of data distribution should be
optimized so that required data for building one node is localized and mean
while the communication cost to be minimized. Weighted classification are well-
suited for many real-world binary classification problems. Weighted
classification (J.L.Polo, F.Berzal, & J.C.Cubero, 2007) assigns different
importance degrees to different attributes. Many different splitting criteria for
attribute selection have been proposed in the literature and they all tend to
provide similar results (F.Berzal, J.C.Cubero, F.Cuenca, & M.J.Martín-Bautista,
2003).
An integration of remote sensing, GIS and Data mining techniques has been
used to predicting the landslide risk. The probabilistic and statistical approaches
were applied for estimating the landslide susceptibility area. Landslide
susceptibility map is reduced the landslide hazard and is used for land cover
planning. The frequency ratio model has better than logistic regression model.
Fuzzy membership functions and factor analysis were used to assess the
landslide susceptibility using various factors. The spatial data were collected and
processed and create a spatial database using GIS and Image processing
techniques. The landslide occurrence factor was identified and processed. Each
factor weight was determined and calculated the training using back-
propagation. Improvised Bayesian Classification approach (Venkatesan M*,
propagation. Improvised Bayesian Classification approach (Venkatesan M*,
Rajawat A S, Arunkumar T, Anbarasi M, & Malarvizhi K, 2014) and decision
tree approach (Venkatesan.M, Arunkumar .Thangavelu, & Prabhavathy.P, 2013)
have been applied to predict the landslide susceptibility in Nilgiris district.
Classification is the process to predict the unknown class label using training
data set. Classification approaches are categorized into Decision Tree, Back
propagation Neural Network, Support Vector machine(SVM), Rule based
Classification and Bayesian Classification. In the present scenario, landslide
analysis study was done by using Neural Network and Bayesian but these
approaches are difficult to understand and tricky to predict. In this chapter, Multi
Ranking Decision Tree Classifier is proposed for landslide Risk Analysis. The
performance of the proposed approach is measured with various parameters.
Decision Tree (DT) approach is used to analyze the data in the form of tree. The
Tree is constructed using the top-down and recursive splitting technique. A tree
structure consists of a root node, internal nodes, and leaf nodes. Ranking
classification techniques give simpler models for the important classes. Ranking
classification assigns different importance degrees to different landslide factor.
In this chapter, rankings are assigned to the different landslide factors in order to
represent the relative importance of each landslide factor. In a distributing
computing environment, the large data sets are handled by an open source
framework called Hadoop. It consists of MapReduce, Hadoop file distribution
system (HDFS) and number of related projects Apache Hive, HBase and
Zookeeper.
In general, the input and output are both in the form of key/value pairs. Figure 2
shows MapReduce programming model architecture. The input data is divided in
to block in the size of 68MB or 128 MB. The mapper input will be supplied as
key/value paris and it produces the relative output in the form of key/pairs.
Partitioner and combiner are used in between mapper and reducer to perform
sorting and shuffling. The Reducer iterates through the values that are associated
with specific key and produces zero or more outputs.
Figure 2. MapReduce Architecture
The traditional data is converted into above three data structure for MapReduce
processing. The algorithm -I, procedure data conversion transforms the instance
record into attribute table with attribute Aj as key, and row id and class label c
as values. Then, REDUCEATTRIBUTE computes the number of instances with
specific class labels if split by attribute Aj, which forms the count table.Note that
hash table is set to null at the beginning of process.
As shown in algorithm - III, the records are read from attribute table with key
value equals to abest and emit the count of class labels.
Algorithm –IV shows the procedure to grow the decision tree by building
linkages between nodes.
Ranking classification techniques give simpler models for the important classes.
Ranking classification assigns different importance degrees to different landslide
factor. In this chapter, rankings are assigned to the different landslide factors in
order to represent the relative importance of each landslide factor. Weighted
decision tree classification algorithm is improved as multi ranking decision tree
classifier using map reduce programming model as shown in the above
algorithms. The developed classifier is used to analyze the landslide risk in the
ooty region of Niligiris district. The proposed classifier scalability is improved
and performance is compared with the existing classification methods.
concerned with the time efficiency of parallel version of weighted decision tree
classification algorithm in big data environment. This chapter focuses landslide
risk analysis using big data computational techniques. The needed toposheets
and required maps are collected from the geological survey of India. Many
number of factors causes landslide in the hill region, but four factors are very
The performance of proposed classifier is compared with the weighted decision
tree classifier and decision tree classifier on single node. Figure 4 illustrates the
following observations.
The scalability of the proposed ranking decision tree classification is also tested
in distributed parallel domain. The scalability evaluation includes two aspects:
(1) performance with different numbers of nodes, and (2) performance with
different size of training datasets.
CONCLUSION
Predicting and analyzing disaster is complex task. In this chapter, landslide risk
is analyzed using multi ranking decision tree classifier approach. Disaster
management domain generates huge amount of data. Traditional sequential
decision tree algorithms cannot fit to handle such huge data sets. For example, as
the size of training data grows, the process of building decision trees can be very
time consuming. To solve the above challenges, parallel weighted decision
classifier approach is proposed to improve the scalability of the model. We have
compared the performance of the proposed approach with existing approach with
respect to number of nodes and number of record. The empirical results shows
that the proposed algorithm exhibit both time efficiency and scalability. In future
works, the rainfall induced landslide risk analysis will be studied using big data
computational approaches.
REFERENCES
Armbrust, Fox, Griffith, Joseph, & Katz, Konwinski, … Zaharia.
(2010). A view of cloud computing. Communications of the ACM ,
53(4), 50–58.
Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., Hide, W.,
& Rhee, S. Y. (2008). Big data: The future of biocuration . Nature ,
455(7209), 47–50. doi:10.1038/455047a
Kumar, V., Grama, A., Gupta, A., & Karypis, G. (1994). Introduction
to parallel computing (Vol. 110). Redwood City:
Benjamin/Cummings.
Wu, , Zhu, , & Wu, , & Ding. (2014). Data Mining with Big Data .
IEEE Transactions on Knowledge and Data Engineering , 26(1).
Prabu Sevugan
VIT University, India
P. Swarnalatha
VIT University, India
ABSTRACT
A number of methodologies are available in the field of data mining, machine
learning, and pattern recognition for solving classification problems. In past few
years, retrieval and extraction of information from a large amount of data is
growing rapidly. Classification is nothing but a stepwise process of prediction of
responses using some existing data. Some of the existing prediction algorithms
are support vector machine and k-nearest neighbor. But there is always some
drawback of each algorithm depending upon the type of data. To reduce
misclassification, a new methodology of support vector machine is introduced.
Instead of having the hyperplane exactly in middle, the position of hyperplane is
to be change per number of data points of class available near the hyperplane. To
optimize the time consumption for computation of classification algorithm, some
multi-core architecture is used to compute more than one independent module
simultaneously. All this results in reduction in misclassification and faster
computation of class for data point.
INTRODUCTION
These days, numerus organizations are using “big data”, “machine learning”
technologies for data analysis. These are the terms which describes that available
technologies for data analysis. These are the terms which describes that available
data is so complex as well as large so that it becomes distinctly clumsy to work
with existing statistical algorithms which restricts size and type of data. The
existing data mining algorithm usually can be divided in to sub types like,
“associate rule mining”, “classification”, “clustering” (A Survey on Feature
Selection Techniques and Classification Algorithms for Efficient Text
Classification.) Classification technique works with association of unstructured
to well-structured data. Numerous amount of classification techniques is
introduced in the fields of big data. As every algorithm has its own pros and cons
depending upon type of data need to be classified. The performance of these
techniques is generally measured in terminology of cost and cost is nothing but
required computation time and misclassification.
Hence, machine learning has developed to imitate the example coordinating that
human brains perform. Today, machine learning algorithms instruct computers
to perceive components of a protest. In these models, for instance, a computer is
demonstrated an apple and told that it is an apple. The computer then uses that
data to characterize the different attributes of an apple, expanding upon new data
each time. At initial, a computer may arrange an apple as round, and fabricate a
model that expresses that if something is around, it's an apple. At that point later,
when an orange is presented, the computer discovers that if something is around
AND red, it's an apple. At that point a tomato is presented, etc. The computer
should persistently alter its model considering new data and allot a prescient
incentive to each model, showing the level of certainty that a question is one
thing over another. For instance, yellow is a more prescient incentive for a
banana than red is for an apple.
Machine learning has a few exceptionally functional applications that drive the
sort of genuine business comes about-- for example, time and cash funds – that
can possibly significantly affect the eventual fate of your association. At
Connections, specifically, we see colossal effect happening inside the client
mind industry, whereby machine learning is permitting individuals to
accomplish things more rapidly and proficiently. Through virtual right-hand
arrangements, machine learning computerizes errands that would some way or
another should be performed by a live operator –, for example, changing a secret
key or checking a record adjust. This arranges for significant operator time that
can be utilized to concentrate on the sort of client care that people perform best:
high touch, confused basic leadership that is not as effectively dealt with by a
machine. At Associations, we additionally enhance the procedure by wiping out
machine. At Associations, we additionally enhance the procedure by wiping out
the choice of whether a demand ought to be sent to a human or a machine: one
of a kind versatile understanding innovation, the machine figures out how to
know about its constraints, and safeguard to people when it has a low trust in
giving the right arrangement.
BACKGROUND
Classification technique works with association of unstructured to well-
structured data. Numerous amount of classification techniques is introduced in
the fields of big data. As every algorithm has its own pros and cons depending
the fields of big data. As every algorithm has its own pros and cons depending
upon type of data need to be classified. The performance of these techniques is
generally measured in terminology of cost and cost is nothing but required
computation time and misclassification.
Classification
Many existing methods propose abstracting the test information before arranging
it into different classes. There are a few options for doing reflection before order:
An informational index can be summed up to either a negligibly summed up
deliberation level, a transitional deliberation level, or a high deliberation level.
Too low a deliberation level may bring about scattered classes, thick grouping
trees, and trouble at compact semantic understanding; though too high a level
may bring about the loss of characterization exactness.
1. Genetic Algorithm
2. Decision Trees
The precision of the classifier is dictated by the rate of the test cases that are
accurately arranged. The characteristics of the records are partitioned into two
sorts as takes after:
There is one recognized trait called the class mark. The objective of the
arrangement is to assemble a compact model that can be utilized to anticipate the
class of the records whose class mark is not known.
class of the records whose class mark is not known.
A choice tree is a tree where the interior hub - is a test on a quality, the tree limb
- is a result of the test, and the leaf hub - is a class name or class dispersion.
Tree Construction
Tree pruning
There are different techniques for building choice trees from a given preparing
informational index. Some essential ideas required in the working of choice trees
are examined beneath.
Splitting Attribute
With each hub of the choice tree, there is a related property whose qualities
decide the apportioning of the informational index when the hub is extended.
Splitting Criterion
The qualifying condition on the part property for informational collection part at
a hub is known as the part foundation at that hub. For a numeric characteristic,
the paradigm can be a condition or an imbalance. For a clear-cut trait, it is a
participation condition on a subset of qualities.
participation condition on a subset of qualities.
Various algorithms for instigating choice trees have been proposed throughout
the years. They vary among themselves in the techniques utilized for choosing
part properties and part conditions. These algorithms can be arranged into two
sorts. The principal kind of algorithms is the established algorithms which
handle just memory occupant information. The second class can deal with the
proficiency and adaptability issues. These algorithms evacuate the memory
limitations and are quick and versatile.
A standout amongst the most prominent heuristics for taking care of the k-means
issue depends on a basic iterative plan for finding a locally ideal arrangement.
This algorithm is regularly called the k-implies algorithm. There are various
variations to this algorithm, so to illuminate which rendition we are utilizing, we
will allude to it as the credulous k-implies algorithm as it is substantially less
complex contrasted with alternate algorithms portrayed here.
The guileless k-implies algorithm segments the dataset into “k” subsets with the
end goal that all records, starting now and into the foreseeable future alluded to
as focuses, in each subset “have a place” to a similar focus. Likewise, the
focuses in each subset are nearer to that inside than to some other focus. The
parcelling of the space can be contrasted with that of Veronesi dividing aside
from that in Veronesi apportioning one segments the space in view of separation
and here we segment the focuses considering separation. The algorithm monitors
the centroids of the subsets, and continues in straightforward emphases. The
underlying dividing is haphazardly created, that is, we arbitrarily instate the
centroids to a few focuses in the locale of the space. In every cycle step, another
arrangement of centroids is created utilizing the current arrangement of centroids
taking after two extremely straightforward strides. Give us a chance to mean the
arrangement of centroids after the ith emphasis by
• Partition the focuses in view of the centroids C (i), that is, discover the
centroids to which each of the focuses in the dataset has a place. The
focuses are apportioned in view of the Euclidean separation from the
centroids.
centroids.
• Set another centroid c(i+1) ∈ C (i+1) to be the mean of the considerable
number of focuses that are nearest to c(i) ∈ C (i) The new area of the
centroid in a specific segment is alluded to as the new area of the old
centroid.
The algorithm is said to have joined while re-computing the allotments does not
bring about an adjustment in the apportioning. In the wording that we are
utilizing, the algorithm has merged totally when C(i) and C(i – 1) are
indistinguishable. For designs where no point is equidistant to more than one
focus, the above joining condition can simply be come to. This meeting property
alongside its straightforwardness adds to the engaging quality of the k-implies
algorithm.
The naive k-means needs to play out countless “neighbour” inquiries for the
focuses in the dataset. If the information is “d” dimensional and there are “N”
focuses in the dataset, the cost of a solitary emphasis is O(kdN). As one would
need to run a few cycles, it is for the most part not achievable to run the innocent
k-means algorithm for expansive number of focuses.
Here and there the union of the centroids (i.e. C(i) and C(i+1) being
indistinguishable) takes a few cycles. Likewise, in the last a few emphases, the
centroids move practically nothing. As running the costly cycles such many
more circumstances won't not be proficient, we require a measure of joining of
the centroids with the goal that we stop the emphases when the merging criteria
is met. Twisting is the most broadly acknowledged measure.
Bunching mistake measures a similar rule and is occasionally utilized rather than
bending. Truth be told k-means algorithm is intended to improve twisting.
Putting the bunch focus at the mean of the considerable number of focuses limits
the bending for the focuses in the group. Likewise, when another group focus is
more like a point than its present bunch focus, moving the group from its present
bunch to the next can lessen the bending further. The over two stages are
correctly the means done by the k-means group. Along these lines k-means
decreases twisting in each progression locally. The k-Means algorithm ends at an
answer that is locally ideal for the twisting capacity. Thus, a characteristic
decision as a meeting standard is contortion. Among different measures of
merging utilized by different analysts, we can quantify the total of Euclidean
separation of the new centroids from the old centroids. In this proposition, we
generally utilize grouping blunder/twisting as the union rule for all variations of
k-means algorithm.
k-means algorithm.
The nearby union properties of k-means have been enhanced in this algorithm.
Likewise, it doesn't require the underlying arrangement of centroids to be
chosen. The thought is that the worldwide minima can be come to through a
progression of neighbourhood ventures in view of the worldwide bunching with
one group less.
Give us a chance to accept that the issue is to discover K clusters and K' ≤ K. We
use the above suspicion, the worldwide optima for k = K' clusters is registered as
a progression of neighbourhood quests. Expecting that we have tackled the k-
means bunching issue for K' – 1 clusters, we need to put another group at a
fitting area. To find the fitting addition area, which is not known, we run k-
means algorithm until joining with each of the focuses in the whole arrangement
of the focuses in the dataset being included as the applicant new bunch, each one
in turn, to the K' – 1 clusters. The focalized K clusters that have the base
mutilation after the joining of k-means in the above neighbourhood quests are
the clusters of the worldwide k-means. We realize that for k = 1, the ideal
grouping arrangement is the mean of the considerable number of focuses in the
dataset. Utilizing the above technique, we can figure the ideal positions for the k
= 2, 3, 4, ... K, clusters. Subsequently the procedure includes figuring the ideal k-
means communities for each of the K = 1, 2, 3… K clusters. The algorithm is
completely deterministic.
Although the engaging quality of the worldwide k-means lies in it finding the
worldwide arrangement, the technique includes a substantial cost. K-means is
run N times, where N is the quantity of focuses in the dataset, for each group to
be embedded. The unpredictability can be decreased significantly by not running
the K-means with the new group being embedded at each of the dataset focuses
however by finding another arrangement of focuses that could go about as a
proper set for inclusion area of the new bunch.
The variation of the kd-tree parts the focuses in a hub utilizing the plane that
goes through the mean of the focuses in the hub and is opposite to the essential
goes through the mean of the focuses in the hub and is opposite to the essential
segment of the focuses in the hub. A hub is not part if it has not exactly a pre-
indicated number of focuses or an upper bound to the quantity of leaf hubs is
come to. The thought is that regardless of the possibility that the kd-tree were
not utilized for closest neighbour questions, only the development of the kd-tree
in light of this system would give a decent preparatory bunching of the
information. We can accordingly utilize the kd-tree hubs focuses as the
applicant/introductory inclusion positions for the new clusters. The time
multifaceted nature of the algorithm can likewise be enhanced by adopting an
eager strategy. In this approach, running k-means for every conceivable
inclusion position is stayed away from. Rather diminishment in the contortion
when the new group is included is considered without running k-means. The
point that gives the most extreme reduction in the mutilation when included as a
group focus is taken to be the new addition position.
K-means is keep running until union on the new rundown of clusters with this
additional point as the new group. The presumption is that the point that gives
the most extreme abatement in twisting is likewise the point for which the
focalized clusters would have the minimum mutilation. These outcomes in a
considerable change in the running time of the algorithm, as it is pointless to run
k-means for all the conceivable inclusion positions. Be that as it may, the
arrangement may not be all inclusive ideal but rather an inexact worldwide
arrangement.
6. Self-Organizing Map
Neurons are ordinarily sorted out in a 2D matrix, and the SOM tries to discover
clusters to such an extent that any two clusters that are near each other in the
lattice space have codebook vectors that are near each other in the information
space.
Supervised Learning
Unsupervised Learning
Unsupervised learning considers how frameworks can figure out how to speak to
specific information patterns in a way that mirrors the measurable structure of
the general gathering of info patterns (Laskov, P., & Lippmann, R., 2010). By
appear differently in relation to “SUPERVISED LEARNING or
REINFORCEMENT LEARNING”, there are no express target yields or natural
assessments related with each info; rather the unsupervised learner conveys to
hold up under earlier predispositions concerning what parts of the structure of
the information ought to be caught in the yield (Laskov, P., & Lippmann, R.,
2010).
The main things that unsupervised learning techniques need to work with are the
watched input patterns, which are regularly thought to be free specimens from a
basic obscure likelihood conveyance, and some express or certain from the
earlier data in the matter of what is essential. One key idea is that info, for
example, the picture of a scene, has distal autonomous causes, for example,
objects at given areas lit up by specific lighting (Laskov, P., & Lippmann, R.
(2010). Since it is on those free causes that we regularly should act, the best
portrayal for an information is in their terms. Two classes of technique have
been recommended for unsupervised learning. Thickness estimation methods
expressly fabricate factual models, (for example, “BAYESIAN NETWORKS”)
of how hidden causes could make the information. Include extraction systems
attempt to separate measurable regularities (or now and again inconsistencies)
straightforwardly from the information sources (Laskov, P., & Lippmann, R.,
2010).
Unsupervised learning as a rule has a long and recognized history. Some early
impacts were “Horace Barlow” (see Barlow, 1992), who looked for methods for
portraying neural codes, “Donald MacKay” (1956), who embraced a robotic
theoretic approach, and “David Marr” (1970), who made an early unsupervised
learning propose about the objective of learning in his model of the neocortex.
The Hebb administer (Hebb, 1949), which joins measurable techniques to
neurophysiological trials on versatility, has likewise thrown a long shadow.
“Geoffrey Hinton” and “Terrence Sejnowski” in designing a model of learning
called the Boltzmann machine (1986), imported a considerable lot of the ideas
from insights that now command the thickness estimation techniques
(Grenander, 1976-1981). Highlight extraction strategies have been less broadly
investigated.
investigated.
Bunching gives an advantageous case. Consider the case in which the sources of
info are the photoreceptor exercises made by different pictures of an apple or an
orange. In the space of all conceivable exercises, these specific sources of info
shape two bunches, with numerous less degrees of variety than, yellower
measurement. One regular errand for unsupervised learning is to discover and
describe these different, low dimensional bunches.
The littler class of unsupervised learning techniques looks to find how to speak
to the contributions by characterizing some quality that great 6 features have,
and after that hunting down those elements in the sources of info. For example,
consider the case that the yield is a direct projection of the info onto a weight
vector. As far as possible hypothesis infers that most such straight projections
will have Gaussian insights. Consequently, on the off chance that one can
discover weights with the end goal that the projection has an exceptionally non-
Gaussian (for example, multi-modular) conveyance, then the yield is probably
going to mirror some fascinating part of the info. This is the instinct behind a
factual technique called projection interest. It has been demonstrated that
projection interest can be actualized utilizing an adjusted type of Hebbian
learning (Intrator & Cooper, 1992) (Laskov, P., & Lippmann, R., 2010).
Orchestrating that distinctive yields ought to speak to various parts of the info
ends up being shockingly precarious.
Projection interest can likewise execute a type of grouping in the case. Consider
anticipating the photoreceptor exercises onto the line joining the focuses of the
bunches. The circulation of all exercises will be bimodal – one mode for each
bunch – and thusly exceptionally non-Gaussian. Take note of that this single
projection does not describe well the nature or state of the groups.
Assume some given information indicates each have a place one of two classes,
and the objective is to choose which class another information point will be in.
In “Support vector machines”, data point is a P-dimensional vector, and we need
to know whether we can separate such data points with (P-1)- dimensional
hyperplane (Ali AlShaari, M., 2014). This is known as a linear classifier. There
are numerous hyperplanes that may classify the information. The best choice of
hyperplane is such line with highest margin from the different data points. So,
we pick the hyperplane so that the separation from it to the closest data point on
each side is maximum. Such hyperplane is known as hyperplane with maximum
margin.
2. K Nearest Neighbour
The existing data cases are vectors in a multidimensional element space, each
with a class mark. The training period of the algorithm comprises just of putting
away the component vectors and class names of the existing data points.
CONCLUSION
This new proposed methodology reduce misclassification as it adjusts the hyper
plane depending upon data points around the border of the that class. This new
algorithm is nothing but the combination of two algorithms “Support Vector
Machine”, “K-Nearest Neighbour”. As there are high chances of
misclassification of data points which lies near to the boundary of that class. So,
to reduce the misclassification double verification of these sensitive area’s data
to reduce the misclassification double verification of these sensitive area’s data
point is to be done through this new algorithm. As there is parallel computation
of independent module is to be done, so execution time can be dramatically
decreased. This new proposed algorithm works without increasing actual
execution time but it reduces the misclassification.
REFERENCES
Sameera K.
VIT University, India
P. Swarnalatha
VIT University, India
ABSTRACT
With the predominance of administration registering and distributed computing,
an ever-increasing number of administrations are developing on the internet,
producing tremendous volume of information. The mind-boggling
administration-created information turn out to be too extensive and complex to
be successfully prepared by customary methodologies. The most effective
method to store, oversee, and make values from the administration-situated
enormous information turn into a vital research issue. With the inexorably huge
measure of information, a solitary framework that gives normal usefulness to
overseeing and dissecting diverse sorts of administration-produced enormous
information is critically required. To address this test, this chapter gives a review
of administration-produced huge information and big data-as-a-service. Initially,
three sorts of administration-produced huge information are abused to upgrade
framework execution. At that point, big data-as-a-service, including big data
infrastructure-as-a-Service, big data platform-as-a-service, and big data analytics
software-as-a-service, is utilized to give regular huge information-related
administrations (e.g., getting to benefit-produced huge information and
information investigation results) to clients to improve effectiveness and lessen
cost.
INTRODUCTION
After entering the 21st century, the worldwide financial structure is exchanging
from “mechanical economy” to “administration economy”. As per the insights of
from “mechanical economy” to “administration economy”. As per the insights of
the World Bank, the yield of present day benefit industry takes more than 60
percent of the world yield, while the rate in created nations surpasses 70%. The
opposition in the region of current benefit industry is turning into a point of
convergence of the world's economy advancement. Benefit registering, which
gives adaptable registering designs to bolster present day benefit industry, has
developed as a promising exploration range. With the commonness of
distributed computing, increasingly present day administrations are conveyed in
cloud foundations to give rich functionalities. The quantity of administrations
and administration clients are expanding quickly. There has been gigantic blast
in information era by these administrations with the predominance of versatile
gadgets, client informal organizations, and substantial scale benefit situated
frameworks. The staggering administration produced information turn out to be
as well expansive and complex to be viably handled by conventional approaches.
ANALYSIS
Figure 1 gives an outline of the administration created enormous information
and Big Data as-a-Service. As appeared in the figure, on the one hand, we will
present some run of the mill applications which misuse three sorts of
administration produced huge information separately for framework execution
upgrade. In the first place, log representation what's more, execution program
conclusion are examined by means of mining administration ask for follow logs.
Second, QoS-mindful blame resilience and administration QoS forecast are
concentrated in light of the benefit QoS data. At long last, noteworthy
administration recognizable proof also, benefit relocation are accomplished by
examining administration relationship. These solid applications will reveal some
insight on the issue of enormous information investigation by mining the usage-
created huge information.
With the promotion of expansive scale benefit situated frameworks, also, the
number expanding of administration clients (e.g., PCs, cell phones, and so on.), a
gigantic volume of follow logs are produced by the administration arranged
frameworks every day. There are billions of every day logs, log documents, and
organized/unstructured information from a wide assortment of administration
frameworks. For instance, an Email benefit given by Alibaba (one of the greatest
web based business organization on the planet) would deliver around 30-50
gigabytes (around 120-200 million lines) of following logs every hour (H. Mi,
H.Wang, Y. Zhou, M. R. Lyu, & H. Cai). These logs can be used both in the
improvement stages and in ordinary operations for comprehension and
investigating the conduct of the perplexing framework.
This area talks about how to research the follow logs to discover the esteem
This area talks about how to research the follow logs to discover the esteem
covered up in it, including follow log perception furthermore, execution issue
conclusion.
To address this issue, various methodologies have been proposed. Stardust (E.
Thereska, B. Salmon, J. Strunk, M. Wachs, M. Abd-El-Malek, J. Lopez, & G. R.
Ganger) utilizes social databases as store which endures poor inquiry
effectiveness in the earth of huge information volumes. Ptracer (H. Mi, H.
Wang, H. Cai, Y. Zhou, M. R. Lyu, & Z. Chen) is an online execution profiling
device to pictures multi-dimensional measurable data to help chairmen
comprehend the framework execution practices top to bottom. DTrace (B. M.
Cantrill, M. W. Shapiro, & A. H. Leventhal) and gprof (S. L. Graham, P. B.
Kessler, & M. K. Mckusick) picture the execution of frameworks as call charts
to connote where demands invest energy.
In spite of the fact that various past research examinations have been directed at
administration log representation, this exploration issue turns out to be all the
more difficult in the situation of enormous information, created by the quick
increment of log documents, the unstructured log information, and the
prerequisite of continuous inquiry and show. More research examinations are
expected to empower constant handling and representation of the enormous
volume of follow logs.
2. Detection of Difficulties in Functioning: In today's disseminated
frameworks, particularly the cloud frameworks, an administration demand
will experience diverse hosts, conjuring various programming modules. At
the point when the administration can't fulfill the guaranteed benefit level
understanding (SLA) to clients, it is basic to distinguish which module
(e.g., a summoned technique) is the underlying driver of the execution issue
in an opportune way. Follow logs give important data to discover the reason
for execution issues. Step by step instructions to misuse the gigantic follow
logs successfully what's more, effectively to help creator comprehend
framework execution what's more, find execution practices turns into a
pressing and testing research issue.
In any case, the vast volume and high speed of usage-created follow logs make it
extremely hard to perform real-time analysis. The vast majority of the past
arrangements experience the ill effects of low proficiency in dealing with
expansive volume of information. More effective capacity, administration, and
investigation approaches for usage-created follow logs are required.
The present expansive scale dispersed stages (e.g., different cloud stages) give
various administrations to heterogeneous also, expanded clients. Extensive
volume of QoS information of these administrations are recorded, in both server-
side and client side. Since distinctive clients may watch very unique QoS
side and client side. Since distinctive clients may watch very unique QoS
execution (e.g., reaction time) on a similar administration, the volume of user-
side QoS information is considerably bigger than that of server-side QoS
information. In addition, QoS estimations of administration segments are
evolving progressively every once in a while, bringing about touchy increment
of client side administration QoS data.
In our past work (Z. Zheng & M. R. Lyu), a preparatory middleware has been
intended for blame tolerant Web administrations. Be that as it may, this
middleware did not give a customized adaptation to internal failure system for
various clients. In the dynamic Internet condition, server-side adaptation to non-
critical failure is insufficient since the correspondence associations can bomb
critical failure is insufficient since the correspondence associations can bomb
effortlessly. Customized user-side adaptation to non-critical failure should be
considered. In addition, to speed up the investigation and calculation of the huge
volume of benefit QoS data, web based learning calculations (H. Yang, Z. Xu, I.
King, & M. Lyu) will should be explored for incremental refresh of the blame
resistance procedure when new QoS values end up plainly accessible.
2. QoS Prognosis: Web benefit QoS forecast goes for giving customized
QoS esteem forecast to administration clients, by utilizing the verifiable
QoS estimations of various clients. Web benefit QoS forecast more often
than excludes a client benefit framework, where every passage in the
network speaks to the estimation of a specific QoS property (e.g., reaction
time) of a Web benefit seen by an administration client. The client benefit
lattice is normally exceptionally inadequate with many missing sections,
since an administration client normally just conjured few Web benefits in
the past. The issue is the means by which to precisely foresee the missing
QoS values in the client benefit lattice by utilizing the accessible QoS
values. Subsequent to foreseeing the missing Web benefit QoS values in the
client benefit framework, each administration client can have a QoS
assessment on every one of the administrations, even on the unused
administrations. Thus, ideal administration can be chosen for clients to
accomplish great execution.
C. Usage Alliance
Later on, next to the administration follow logs, QoS data what's more,
administration relationship, more sorts of administration created huge
information will be researched. More complete investigations of different
administration produced huge information investigation methodologies will be
led. Point by point innovation guide will be given what's more, security issues
past the extent of this paper will likewise be explored.
REFERENCES
ChenM. Y.AccardiA.KicimanE.LloydJ.PattersonD.FoxA.BrewerE.
(2004). Path-based faliure and evolution management. In Proceedings
of the 1st conference on Symposium on Networked Systems Design
and Implementation. USENIX Association.
Chen, X., Zheng, Z., Liu, X., Huang, Z., & Sun, H. (2011).
Personalized QoS-aware Web service recommendation and
visualization. IEEE Transactions on Services Computing .
IBM. (2013). What is big data? ł bringing big data to the enterprise.
Retrieved from http://www-01.ibm.com/software/data/bigdata
Liu, Li, Huang, & Wen. (2012). Shapley value based impression
propagation for reputation management in web service composition.
Pro. IEEE 19th Int’l Conf’ on Web Services (ICWS’12), 58–65.
Lo, W., Yin, J., Deng, S., Li, Y., & Wu, Z. (2012). Collaborative web
service qos prediction with location-based regularization. Pro. IEEE
19th Int’l Conf’ on Web Services (ICWS’12), 464–471.
10.1109/ICWS.2012.49
Mi, Wang, & Zhou, Lyu, & Cai. (2013). Towards fine-grained,
unsupervised, scalable performance diagnosis for production cloud
computing systems. IEEE Transactions on Parallel and Distributed
Systems .
Thereska, G., Salmon, B., Strunk, J., Wachs, M., Abd-El-Malek, M.,
Lopez, J., & Ganger, G. R. (2006). Stardust: tracking activity in a
distributed storage system. ACM SIGMETRICS Performance
Evaluation Review, 34(1), 3–14. doi:10.1145/1140277.1140280
Prabu Sevugan
VIT University, India
ABSTRACT
This chapter improves the SE scheme to grasp these contest difficulties. In the
development, prototypical, hierarchical clustering technique is intended to lead
additional search semantics with a supplementary feature of making the scheme
to deal with the claim for reckless cipher text search in big-scale surroundings,
such situations where there is a huge amount of data. Least relevance of
threshold is considered for clustering the cloud document with hierarchical
approach, and it divides the clusters into sub-clusters until the last cluster is
reached. This method may affect the linear computational complexity versus the
exponential growth of group of documents. To authenticate the validity for
search, minimum hash sub tree is also implemented. This chapter focuses on
fetching of cloud data of a subcontracted encrypted information deprived of loss
of idea and of security and privacy by transmission attribute key to the
information. In the next level, the typical is improved with a multilevel
conviction privacy preserving scheme.
INTRODUCTION
Individuals are profited with cloud computing as cloud computing reduces it
work and make computing and storage simplified. (Liang, Cai, Huang, Shen &
Peng, 2012), (Mahmoud & Shen, 2012), (Shen, Liang, Shen, Lin & Lou, 2012).
Data can be stored remotely in the cloud server as data outsourcing and accessed
publicly. This embodies a mountable, constant and low-cost method for public
publicly. This embodies a mountable, constant and low-cost method for public
access of data as per the high productivity and mount ability of cloud servers,
and so it is favored.
Nevertheless, if the user clicks from the authors site of search result, to another
the authors site will identify the search terms the user has used.
On dealing with the above matters, the searchable form of encryption (e.g.,
(Song,Wagner & Perrig, 2000), (Li,Xu,Kang,Yow & Xu, 2014),
(Li,Lui,Dai,Luan & Shen, 2014)) has been established as a basic method to
allow searching over encrypted data of cloud, which profits the procedures. At
first the owner of data will produce quite a few keywords rendering to the
outsourced data. Cloud server will be used to store this encrypted keywords.
When the outsourced data needs to be accessed, it can choice approximately
appropriate keywords and direct the cipher text of the designated keywords to
the cloud server. The cloud server then usages the cipher text to contest the
outsourced keywords which are encrypted, and finally will yields the matching
consequences to the user who search. To attain the like search effectiveness and
accuracy over data which is encrypted as like plaintext search of keyword, a
widespread form of research has been advanced in literature. Wang et al.(2014)
recommended a ranked keyword search system which deliberates the scores of
relevance’s of keywords. Inappropriately, because of using order-preserving
encryption (OPE)(Boldyreva,Chenette, Lee & Oneill, 2009) to attain the
property of ranking, the planned arrangement cannot attain unlikability of
trapdoor.
Cao et al. advise the coordinate matching search scheme (MRSE) which can be
the authors as a searchable encryption system with “OR” operation (Shen,Liang,
Shen,Lin, & Luo, 2014) recommended a conjunctive keyword search scheme
which can be observed as a searchable encryption scheme with “AND”
operation with the refunded documents matching all keywords. Though, most
current suggestions can only allow search with single logic operation, somewhat
than the mixture of numerous logic operations on keywords, which encourages
the work.
Here, the authors discourse above two issues by emerging two Fine-grained
MultiKeyword Search (FMS) arrangements over encrypted data of cloud. Our
unique donations can be abridged in three characteristics as tracks:
An assortment of research the whole thing have freshly been established about
multikeyword search over the data which is encrypted. Cash et al.
(Jarecki,Jutla,Krawczyk,Ro3u, & Steiner, 2013) offer a symmetric searchable
encryption organization which attains high effectiveness for big databases with
uncertain on security guarantees. Cao et al. (Cao, Wang, Li, Ren, & Lou,
2014)suggest a multikeyword search structure supportive consequence ranking
by approving k-nearest neighbors (kNN)technique (Wong, Cheung,Kao, &
Mamoulis,2009). Naveed et al. (2014) proposition an active searchable
encryption system complete blind storage to obscure admittance pattern of the
user for search. In demand to encounter the practical search necessities, search
concluded data which is encrypted should provision the subsequent three
functions. First, the encryption systems which are searchable should provision
multikeyword search, and deliver the same experience for user as thorough in
search for Google with different keywords; search for single-keyword is far from
acceptable by only recurring very incomplete and imprecise results for results.
Second, to rapidly classify most applicable results, the user for search would
characteristically favor cloud servers to category the refunded search
consequences in a relevance-based command (Pang, Shen, & Krishnan, 2010)
ranked by the order of relevance of the request for search documents. In
accumulation, display the search based on rank to users can also eradicate the
needless traffic of network by only distributing back the utmost results which are
relevant from to search users from cloud. Third, as for the effectiveness of
search, then the quantity of the documents which are imperfect in a database
could be tremendously large, encryption which is searchable constructions
should be organized in the authors-mannered to quickly response to the
requirements for search with interruptions and they are smallest. In modification
to the proposed prosperities, frequently of the usual proposals, the authors, nose
dive to proposal satisfactory intuitions near the construction of full performed
encryption which is searchable. As an application near the subject, the authors
proposition and the authors-organized multikeyword ranked search (EMRS)
preparation over encrypted cloud data for mobile through blind storage. Our
important charities can be abridged as surveys:
Clouds are huge pools of easily practical and available resources which are
virtualized. The data and the applications which are software essential by the
workers are not stored on their self-computers; in its place, they are the authors
on servers which are remote which are the control of users. It is a model which is
pay-per-use in that the structure benefactor by resources of service level
agreements(SLAs) which (Vaquero, Rodero-Merino, Caceres, & Lindner, 2009).
As cloud computing becomes prevalent, more and more sensitive information’s
are being centralized into the cloud. Such as emails, photo albums, personal
health records, financial transactions, tax documents and government documents
etc.
The detail that owners of data and cloud server are no extended in the similar
trusted domain may place the data at risk outsourced unencrypted. The cloud
server escape information of data to illegal enables or can be hacked. To deliver
privacy for data, data which is sensitive needs to be encrypted first of
outsourcing to the profitable public cloud (Kamara & Lauter, 2010). The
unimportant explanation of transferring all the information and decrypting in the
vicinity is clearly unreasonable, due to the gigantic amount of band width rate in
level of cloud scale systems.
Discovering preserving privacy and real search over encrypted date of cloud is
of supreme position seeing the possibly great amount of on claim users of data &
enormous quantity of outsourced document of data in the cloud, this problem is
predominantly challenging as it is tremendously difficult to meet also the
necessities of presentation, system usability and scalability encryption makes
operative utilization of data a very stimulating task given that there could be a
big quantity of outsourced information files. Also in the cloud computing owners
of data may portion their outsourced information with numerous users who
might want to only recover certain exact files of data. They are engrossed in
during a conference. One of the utmost prevalent imposts to do so is whole
keyword search method licenses users to intelligently recuperate files of interest.
Need for information retrieval is the most commonly occurring commission in
Need for information retrieval is the most commonly occurring commission in
cloud to the user to from server. Usually, cloud servers complete relevance result
ranking in instruction to make the exploration as earlier. Such ranked search
scheme allows users of data to find the most applicable info quickly, instead of
returning undistinguishable results. Ranked search can stylishly remove
unnecessary traffic for the network by distribution back only the greatest data for
relevance which is highly wanted in the “Pay-As-You-Use” paradigm for cloud.
For confidentiality shield, such process of ranking, the authors, must not escape
any keyword connected data. On the other side, to recuperate the result for
accuracy of search as the authors as to recover the searching for user experience,
it is also necessary for such ranking system to support multiple keywords search,
as single keyword search often yields far too coarse results. As a common
practice indicated by today’s the authors search engines (e.g., Google search),
data users may tend to provide a set of keywords instead of only one as the
indicator of their search interest to retrieve the most relevant data. And each
keyword in the search request can help narrow down the search result further.
“Coordinate matching” (Witten,Moffat & Bell,1999), i.e., as numerous matches
as conceivable, is a the authors-organized similarity amount among such
multikeyword meaning to improve the relevance of result, and has been
extensively used in the plaintext data retrieval (IR) community. Though, how to
put on it in the encrypted cloud data search scheme remains a very stimulating
assignment because of distinguishing privacy and security problems, counting
numerous strict supplies like the privacy of data, the privacy of index, privacy of
keyword, and many more.
There are a portion of interests to tolerant Cloud Computing, there are also some
significant walls to receipt (Seung Hwan, Gelogo & Park, 2012). One of the
greatest major fences to acceptance is the security, surveyed by matters
concerning acquiescence, privacy and matters which are authorized. Since Cloud
Computing characterizes a comparatively new figuring model, there is an
enormous deal of ambiguity about how security at every level (network, host,
application, data levels, etc.) can be attained and in what way security of
application is stimulated to Cloud Computing. That indecision has dependably
controlled information managers to state that security is their number one
apprehension with Cloud Computing. Security anxieties recount to hazard areas
such as outside data storage, dependence on the internet which is public, absence
of control, integration and multitenancy with security is internal. Associated to
conservative technologies, cloud has many exact topographies, such as its
countless gauge and the detail that resources going to cloud providers are
completely disseminated, heterogeneous and totally virtualized. Conservative
security machineries such as Identity authentication, and authorization are no
longer enough for clouds in their current form. For of the cloud facilities replicas
working, the working replicas, and practices cast-off to allow services for cloud.
Cloud computing may present-day dissimilar dangers to suggestion than old-
style IT resolutions. Unfortunately, participating safety into these explanations is
frequently supposed as making them more inflexible.
As Cloud Computing turn into extensive, more delicate data are being
transported into cloud, such as individual health records, emails, confidential
videos and images, data for business finance, documents for government, etc. As
per this i.e., storage their information into the cloud, the data owners can be
reassured from the problem of information storing space and preservation so to
like the on-demand high brilliance storage for data service (Reddy, 2013).
Though, the reality that information suppliers and cloud servers are not in the
alike reliable area may put the subcontracted information at danger. By way of
the cloud server can no extended be completely reliable in such an environment
for cloud since of a variety of reasons, they are: the cloud server may leakage
data to illegal things or it may be slashed. It tracks that delicate information
characteristically can be encrypted before outsourcing for data confidentiality
and fighting undesirable admissions. Though, encryption for data makes data
utilization effectiveness and efficiency a very challenging task given that there
could be a large amount of outsourced data files. Furthermore, in Cloud
Computing, data owners/provider may share their outsourced data with many
users. The individual users shall wish to only recover certain exact files of data
they are absorbed in through a given conference. One among the most
recognized customs is to specifically recover files complete keyword-based
search as a substitute of regaining all the files which are encrypted like before
which is totally unreasoning in cloud computing circumstances
(Khan,Wang,Kulsoom & Ullah, 2013) Like this keyword-based search technique
allows users to meaningfully regain files of awareness and has approximately
valuable in search of plaintext situations, such as Google search. Miserably,
encryption for information limits user’s capability to perform search for keyword
and subsequently makes the out-of-date plain text search methods not
appropriate for Cloud Computing.
Lately, the cloud computing pattern (Mell & T. Grance, 2011) is transforming
the establishments in method of effective their information mainly in the method
they accumulation, admittance and process information (Paillier,1999). As a
developing calculating pattern, for cloud computing, it interests many
establishments to correlated potential for cloud in relations of flexibility, cost-
efficiency and rid of managerial overhead. Cloud will originate valuable and
delicate data about the real data matters by detecting the variable data admission
designs even if is an information is encrypted (Capitani, Vimercati, Foresti, &
Samarati, 2012; Williams, Sion, & Carbunar, 2008). Most often, governments
characteristic their computational procedures in accrual to their data to the cloud.
The advantage of cloud is that the privacy and security matters in the cloud
which circumvents the productions to use those plunders. The information can
be encrypted earlier subcontracting to cloud when information is highly delicate.
When information is encrypted, regardless of the fundamental encryption
system, it is very interesting to accomplishment any information mining
responsibilities ever decrypting the information (Samanthula, Elmehdwi, &
Jiang, 2014).
1. For the first occasion when, the authors investigate the issue of multi
keyword positioned seek over encoded cloud information, and build up an
arrangement of strict protection necessities for such a safe cloud
information usage framework.
2. The authors propose two MRSE plans considering the similitude measure
of “organize coordinating” while at the same time meeting distinctive
protection prerequisites in two diverse danger models.
3. Thorough examination exploring protection and effectiveness assurances
of the proposed plans is given, and investigations on this present reality
dataset additionally demonstrate the proposed conspires undoubtedly
present low overhead on calculation and correspondence.
Distributed computing is one method for processing. Here the figuring assets are
shared by numerous clients. The advantages of cloud can be stretched out from
individual clients to associations. The information stockpiling in cloud is one
among them. The virtualization of equipment and programming assets in cloud
invalidates the money related venture for owning the information stockroom and
its upkeep. Many cloud stages like Google Drive, iCloud, SkyDrive, Amazon
S3, Dropbox and Microsoft Azure give stockpiling administrations.
Security and protection concerns have been the major challenges in distributed
computing. The equipment and programming security instruments like firewalls
and so forth have been utilized by cloud supplier. These arrangements are not
adequate to secure information in cloud from unapproved clients because of low
level of straightforwardness (Cloud Security Alliance, 2009). Since the cloud
client and the cloud supplier are in the diverse put stock in area, the outsourced
information might be presented to the vulnerabilities (Cloud Security Alliance,
2009; Ren, Wang, & Wang, 2012; Brinkman, 2007). In this manner, before
putting away the important information in cloud, the information should be
encoded (Kamara & Lauter, 2010). Information encryption guarantees the
information classification and trustworthiness. To save the information
protection the authors must outline a searchable calculation that takes a shot at
scrambled information (Wong,Cheung, Kao, & Mamoulis, 2009). Numerous
specialists have been adding to seeking on scrambled information. The hunt
systems might be single catchphrase look or multi watchword seek (C. Wang, N.
Cao, J. Li, K. Ren, & W. Lou, 2010). In gigantic database, the inquiry may bring
Cao, J. Li, K. Ren, & W. Lou, 2010). In gigantic database, the inquiry may bring
about many reports to be coordinated with watchwords. This causes trouble for a
cloud client to experience all archives and have generally applicable reports.
Look in view of positioning is another arrangement, wherein the reports are
positioned in view of their pertinence to the watchwords (Singhal, 2001).
Practical searchable encryption systems help the cloud clients particularly in
pay-as-you utilize show. The analysts consolidated the rank of archives with
numerous watchword pursuit to think of proficient financially suitable
searchable encryption strategies. In searchable encryption, related writing,
calculation time what's more, calculation overhead is the two most as often as
possible utilized parameters by the specialists in the space for dissecting the
execution of their plans. Calculation time (moreover called “running time”) is
the period required to play out a computational procedure for instance looking a
watchword, creating trapdoor and so forth. Calculation overhead is identified
with CPU usage regarding asset distribution measured in time. In this
examination work, the authors dissect the security issues in distributed storage
and propose the authors for the same. Our commitment can be compressed as
takes after:
Distributed computing has changed the way businesses approach IT, the authors
them to end up noticeably more nimble, present new plans of action, offer more
administrations, and trim down IT costs. Distributed computing innovations can
be executed in a wide assortment of designs, under different administration and
arrangement models, and can exist together with numerous advancements and
programming outline strategies. The distributed computing foundation keeps on
acknowledging unstable development. The authors, for security experts, the
cloud displays a tremendous quandary: How would you grasp the advantages of
the cloud while keeping up security controls over your associations' advantages?
It turns into an issue of adjust to decide if the expanded dangers are genuinely
It turns into an issue of adjust to decide if the expanded dangers are genuinely
justified regardless of the nimbleness and monetary advantages. Keeping up
control over the information is foremost to cloud achievement. 10 years prior,
undertaking information regularly lived in the association's physical framework,
all alone servers in the association's server farm, where one could isolate touchy
data in individual physical servers. Today, by virtualization and the cloud,
information might be under the association's intelligent control, yet physically
put away in foundation possessed and oversaw by an alternate element. This
move in charge is the main reason new methodologies and systems are required
to guarantee associations can keep up information security. At the point when an
outside party claims, controls, and oversees foundation and computational assets,
how might you be guaranteed that business or administrative information stays
private and secure, and that your association is shielded from harming
information breaks? This makes cloud information security fundamental.
Distributed computing the significance of Cloud Computing is expanding also, it
is accepting a developing thought in the logical also, mechanical groups. The
NIST (National Institute of Standards and Technology) proposed the
accompanying meaning of distributed computing:
SECURITY IN CLOUD
Although there is a considerable measure of advantages to receiving Distributed
computing, there are additionally some extensive hindrances to acknowledgment
(Seung Hwan, Gelogo & Park. 2012). A standout amongst the most significant
obstructions to selection is the security, trailed by issues in regards to
consistence, protection and approved matters. Since Cloud Processing speaks to
a moderately new registering model, there is an enormous arrangement of
vulnerability about how security by any stretch of the imagination levels
(arrange, have, application, information levels, and so forth.) can been
accomplished and how application security is moved to Cloud Figuring. That
vulnerability has reliably driven data administrators to express that security is
their number one worry with Cloud Computing. Security concerns identify with
hazard territories, for example, outside information stockpiling, reliance on
hazard territories, for example, outside information stockpiling, reliance on
general society the authors, absence of control, multitenancy what's more,
incorporation with inside security. Contrasted with traditional advances, cloud
has numerous elements, for example, its extraordinary scale and the reality that
assets having a place with cloud suppliers are totally disseminated,
heterogeneous and totally virtualized. Regular security components, for
example, character, validation, and approval are no sufficiently longer for mists
in their present shape. Considering the cloud benefit models utilized, the
operational models, and the philosophies used to the authors cloud
administrations, cloud registering may exhibit distinctive dangers to an
affiliation than customary IT arrangements. Unfortunately, incorporating
security into these arrangements is regularly seen as making them more
inflexible.
BACKGROUND
Firstly, they use the “Dormant Semantic Analysis” to uncover relationship
amongst terms and reports. The inert semantic examination exploits certain
higher-arrange structure in the relationship of terms with archives (“semantic
structure”) and receives a diminished measurement vector space to speak to
words and reports. In this manner, the relationship be the authors terms is
naturally caught. Furthermore, their plan utilizes secure “k-closest neighbour (k-
NN)” to accomplish secure hunt usefulness. The proposed plan could return the
correct coordinating records, as the authors as the documents including the terms
inactive semantically related to the question watchword. At long last, the
exploratory result exhibits that their strategy is superior to the first MRSE
conspire (Song & Wagner, 2000). Their procedures have various pivotal focal
points. They are provably secure: they give provable mystery to encryption, as in
the untrusted server can't learn anything about the plain content when just given
the cipher text; they give inquiry disengagement to quests, implying that the
untrusted server can't learn much else about the plaintext than the query output;
they give controlled seeking, so that the untrusted server can't hunt down a
subjective word without the client's approval; they likewise bolster concealed
inquiries, so that the client may approach the untrusted server to hunt down a
mystery word without uncovering the word to the serve (Sun, Wang, Cao, Li,
Lou, Hou, & Li, 2013). They propose a tree-based list structure and different
versatile techniques for multidimensional (MD) calculation so that the handy
hunt proficiency is greatly improved than that of direct inquiry. To further
improve the inquiry protection, they propose two secure file plans to meet the
stringent protection prerequisites under solid danger models, i.e., known cipher
text display and known foundation demonstrate. What's more, they devise a plan
upon the proposed file tree structure to the authors genuineness check over the
returned indexed lists. At long last, the authors show the viability and
productivity of the proposed plots through broad trial assessment. (Yu, Lu, Zhu,
Xue, & Li, 2013) Distributed computing has developing as a promising example
for information outsourcing and astounding information administrations.
Worries of touchy data on cloud conceivably causes security issues. Information
encryption ensures information security to some degree, yet at the cost of traded
off proficiency. Searchable symmetric encryption (SSE) permits recovery of
encoded information over cloud. In this chapter, the concentration is on tending
to information security issues utilizing SSE. Interestingly, the authors define the
security issue from the part of comparability significance and plan vigor. They
watch that server-side positioning considering request saving encryption (OPE)
spills information security. To take out the spillage, the authors propose a two-
round searchable encryption (TRSE) conspire that backings beat k multikey
word recovery. (Wong, D, Cheung, Kao, & Mamoulis, 2009) In This chapter
they talk about the general issue of secure calculation on an encoded database
and propose a SCONEDB (Secure Computation ON an Encrypted Database)
show, which catches the execution and security necessities. As a contextual
analysis, the authors concentrate on the issue of k-closest neighbour (kNN)
calculation on an encoded database. The authors build up another lopsided
scalar-item protecting encryption (ASPE) that jelly a unique kind of scalar item.
They utilize APSE to develop two secure plans that bolster kNN calculation on
scrambled information; each of these plans is appeared to oppose down to earth
assaults of an alternate foundation learning level, at an alternate overhead cost.
Broad execution studies are done to assess the overhead and the proficiency of
the plans (Zhang & Zhang, 2011). Since Boneh et al. proposed the thought and
development of Public Key Encryption with Keyword Search (PEKS) conspire,
numerous updates and expansions have been given. Conjunctive watchword
inquiry is one of these expansions. A large portion of these built plans cannot
understand conjunctive with subset catchphrases look work. Subset watchwords
look implies that the beneficiary could inquiry the subset catchphrases of all the
catchphrases implanted in the cipher text. The authors ponder the issue of
conjunctive with subset watchwords seek work, talk about the disadvantages
about the existed plans, and after that give out a more effective development of
Public Key Encryption with Conjunctive-Subset Keywords Search (PECSK)
conspire. A correlation with different plans about effectiveness will be
displayed. They additionally list the security prerequisites of their plan, then give
out the security investigation (Song, Wagner, & Perrig, 2000).
MAIN FOCUS OF THE AUTHORS
Data Owner
The information proprietor subcontracts her information to the cloud for suitable
and consistent data admission to the equivalent search operators. To defend the
information confidentiality, the information proprietor encrypts the unique
information over symmetric encryption. To recover the exploration
effectiveness, the data owner makes approximately keywords for each
subcontracted document. The equivalent index is then formed giving to the
keywords and a secret key. Afterward, the data owner directs the encrypted
brochures and the equivalent directories to the cloud, and directs the symmetric
key and secret key to exploration workers.
Cloud Server
The cloud server is an in-the authors entity which supplies the encrypted
brochures and equivalent indexes that are established from the information
proprietor, and delivers information admittance and exploration facilities to
exploration users. When an exploration operator directs a keyword access to the
cloud server, it resolves re-emergence a group of equivalent brochures founded
on firm processes.
Search User
A search operator enquiries the subcontracted brochures from the cloud server
with subsequent three steps. First, the exploration operator accepts together the
symmetric key and the secret key from the information proprietor. Second, as
per the exploration keywords, the exploration operator usages the secret key to
make hatch and directs it to cloud server. Last, he obtains the corresponding text
collection from the cloud server and is decrypting them with the symmetric key.
RSA Algorithm
RSA is the process shaped by the modern computers to decrypt and encrypt
messages. It is an asymmetric cryptographic technique. Asymmetric resources
messages. It is an asymmetric cryptographic technique. Asymmetric resources
that there are two different keys. This is public key cryptography, since one of
them can be prearranged to everyone. The other key must be private. In
conclusion, the factors of an integer are rigid (the factoring problem). A worker
of RSA kinds and then issues the produce of two big prime figures, with a
supplementary rate, as public key of theirs. The factors which are prime remain
secret. Anybody can use the public key to encrypt a communication, but with
presently published approaches, if the public key is large, only somebody with
information of the prime factors can practicably decode the communication. It is
used by modern computers to encrypt and decrypt communications. It is an
asymmetric cryptographic procedure. Asymmetric means that there are two
different keys. This is public key cryptography, since one of them can be given
to all. The other key must be kept private.
Hierarchical Clustering
Two approaches:
1. Agglomerative:
2. Divisive:
1. Asymmetric key uses one key to decrypt and one key to encrypt
2. RSA is one of the asymmetric key algorithm
3. It is slow then symmetric key algorithm
4. No need of configuring secret keys in all hosts
Knn Algorithm
1. Partitional clustering
2. Partitions independent of each other
3. Sensitive to cluster center initialization
4. Poor convergence speed and bad overall clustering can happen due to
poor initialization
5. Works only for around shapes
6. Doesn’t work well for non-convex shapes
1. Hierarchical clustering
2. Visualize using a tree structure
3. Can give different partitioning
4. Doesn’t need specification of number of clusters
5. It can be slow
6. Two types agglomerative and divisive
REFERENCES
Abdalla, M., Bellare, M., Catalano, D., Kiltz, E., Kohno, T., Lange,
T., & Shi, H. (2008). Searchable encryption revisited: Consistency
properties, relation to anonymous ibe, and extensions . Journal of
Cryptology , 21(3), 350–391. doi:10.1007/s00145-007-9006-6
Boldyreva, A., Chenette, N., Lee, Y., & Oneill, A. (2009). Order-
preserving symmetric encryption . In Advances in Cryptology-
EUROCRYPT (pp. 224–241). Springer.
Boneh, Kushilevitz, Ostrovsky, & W. E. S. III. (2007). Public key
encryption that allows pir queries. Proc. of CRYPTO.
Cao, N., Wang, C., Li, M., Ren, K., & Lou, W. (2014). Privacy-
preserving multikeyword ranked search over encrypted cloud data .
IEEE Transactions on Parallel and Distributed Systems , 25(1), 222–
233. doi:10.1109/TPDS.2013.45
Cao, N., Wang, C., Li, M., Ren, K., & Lou, W. (2014). Privacy-
preserving multikeyword ranked search over encrypted cloud data .
IEEE Transactions on Parallel and Distributed Systems , 25(1), 222–
233. doi:10.1109/TPDS.2013.45
Cash, J., & Jutla, K., Ro3u, & Steiner. (2013). Highly-scalable
searchable symmetric encryption with support for Boolean queries.
Proc. CRYPTO, 353-373.
Hwang, Y., & Lee, P. (2007). Public key encryption with conjunctive
keyword search and its extension to a multiuser system . Pairing.
doi:10.1007/978-3-540-73489-5_2
Li, H., Dai, Y., Tian, L., & Yang, H. (2009). Identity-based
authentication for cloud computing. In Cloud Computing. Berlin,
Germany: Springer-Verlag. doi:10.1007/978-3-642-10665-1_14
Li, H., Liu, D., Dai, Y., Luan, T. H., & Shen, X. (2014). Enabling
efficient multikeyword ranked search over encrypted cloud data
through blind storage . IEEE Transactions on Emerging Topics in
Computing . doi:doi:10.1109/TETC.2014.2371239
Li, J., Wang, Q., Wang, C., Cao, N., Ren, K., & Lou, W. (2010).
Fuzzy keyword search over encrypted data in cloud computing. Proc.
of IEEE INFOCOM’10 Mini-Conference.
doi:10.1109/INFCOM.2010.5462196
Li, R., Xu, Z., Kang, W., Yow, K. C., & Xu, C.-Z. (2014). Efficient
multikeyword ranked query over encrypted data in cloud computing .
Future Generation Computer Systems , 30, 179–190.
doi:10.1016/j.future.2013.06.029
Liang, H., Cai, L. X., Huang, D., Shen, X., & Peng, D. (2012). An
smdpbased service model for interdomain resource allocation in
mobile cloud networks . IEEE Transactions on Vehicular Technology
, 61(5), 2222–2232. doi:10.1109/TVT.2012.2194748
Mell & Grance. (2011). The nist definition of cloud computing (draft).
NIST Special Publication, 800, 145.
Naveed, Prabhakaran, & Gunter. (2014). Dynamic searchable
encryption via blind storage. Proceedings - IEEE Symposium on
Security and Privacy , 639–654.
Ren, K., Wang, C., & Wang, Q. (2012). Security Challenges for the
Public Cloud . IEEE Internet Computing , 16(1), 69–73.
doi:10.1109/MIC.2012.14
Shen, Q., Liang, X., Shen, X., Lin, X., & Luo, H. (2014). Exploiting
geodistributed clouds for e-health monitoring system with minimum
service delay and privacy preservation . IEEE Journal of Biomedical
and Health Informatics , 18(2), 430–439.
doi:10.1109/JBHI.2013.2292829
Song, D. X., Wagner, D., & Perrig, A. (2000). Practical techniques for
searches on encrypted data. In Proceedings of S&P. IEEE.
Song, D. X., Wagner, D., & Perrig, A. (2000). Practical techniques for
searches on encrypted data. Proceedings of S&P, 44–55.
Sun, Wang, Cao, Li, Lou, Hou, & Li. (2013). Verifiable privacy-
preserving multikeyword text search in the cloud supporting
similarity-based ranking. IEEE Transactions on Parallel and
Distributed Systems. DOI: 10.1109/TPDS.2013.282
Sun, Wang, Cao, Li, Lou, Hou, & Li. (2013). Verifiable privacy-
preserving multikeyword text search in the cloud supporting
similarity-based ranking. IEEE Transactions on Parallel and
Distributed Systems. DOI: 10.1109/TPDS.2013.282
Williams, P., Sion, R., & Carbunar, B. (2008). Building castles out of
mud: practical access pattern privacy and correctness on untrusted
storage. ACM CCS, 139–148. doi:10.1145/1455770.1455790
Yang, L., Liu, & Yang. (2014). Secure dynamic searchable symmetric
encryption with constant document update cost. Proc.GLOBECOM.
Yu, J., Lu, P., Zhu, Y., Xue, G., & Li, M. (2013). Towards secure
multikeyword top-k retrieval over encrypted cloud data . IEEE
Transactions on Dependable and Secure Computing , 10(4), 239–250.
doi:10.1109/TDSC.2013.9
Yu, J., Lu, P., Zhu, Y., Xue, G., & Li, M. (2013). Towards secure
multikeyword top-k retrieval over encrypted cloud data . IEEE
Transactions on Dependable and Secure Computing , 10(4), 239–250.
doi:10.1109/TDSC.2013.9
Minimum Hash Subtree: Root is having the minimum value in the subtree.
Pronay Peddiraju
VIT University, India
P. Swarnalatha
VIT University, India
ABSTRACT
The purpose of this chapter is to observe the 3D asset development and product
development process for creating real-world solutions using augmented and
virtual reality technologies. To do this, the authors create simulative software
solutions that can be used in assisting corporations with training activities. The
method involves using augmented reality (AR) and virtual reality (VR) training
tools to cut costs. By applying AR and VR technologies for training purposes, a
cost reduction can be observed. The application of AR and VR technologies can
help in using smartphones, high performance computers, head mounted displays
(HMDs), and other such technologies to provide solutions via simulative
environments. By implementing a good UX (user experience), the solutions can
be seen to cause improvements in training, reduce on-site training risks and cut
costs rapidly. By creating 3D simulations driven by engine mechanics, the
applications for AR and VR technologies are vast ranging from purely computer
science oriented applications such as data and process simulations to mechanical
equipment and environmental simulations. This can help users further
familiarize with potential scenarios.
INTRODUCTION
Background of Study
The scope for AR and VR technologies can be seen in numerous fields ranging
from game development to creating training simulations that can be used in the
industry to cut operating costs as well as visualize final products in real time.
Industries that can benefit from these technologies however will see the
requirement of a heavy initial investment depending on the complexity and the
demand of the product created. Typically AR applications are less resource
hungry and can be used on smartphones and have a lower initial investment
while most VR applications are bulky, resource hungry and need powerful
computing environments paired with precise HMDs hence making it a more
expensive yet smoother experience.
PROBLEM STATEMENT
Augmented and Virtual Reality have great scope when applied onto mobile
Augmented and Virtual Reality have great scope when applied onto mobile
device hardware as it can reach out to a larger user base (Jason Jong Kyu Park,
2013). But due to the complex nature of the development process and the
technology itself, it can be difficult to identify a solid model that can be used to
develop, maintain and re-iterate if required on the software. Engine
optimizations continue to be made to allow more precise colours, details,
tracking and operation (Jacob B. Madsen, 2014). This allows the development
process to continue to get simpler with respect to using a game engine but there
is no method that correlates the work flow between each of the pipelines. The
evolution of VR and AR is extremely fast paced and is witnessing the extinction
of numerous software add-ons (Dey, 2016). There is a requirement to be able to
modify and re-iterate on existing products in order to ensure the scalability and
longevity of the said product. There are numerous restrictions on the polygon
count of the 3D models that can be successfully rendered in a VR environment
due to restrictions on the hardware capabilities of hand held devices (Ahmad
Hoirul Basori, 2015). This is to be identified properly and accounted for before
the start of the development process to ensure that no issues with respect to
support over the platform is encountered. It is important to create a smart UI
(User Interaction) that is both aware of position and orientation of the user in
order to provide a better user experience within the application (Hsin-Kai Wu,
2103). A common problem faced in the making of VR and AR applications is
providing a good UI and UX (User Experience).
MOTIVATION
The motivation behind this study is to understand the development process
revolving around VR and AR technologies and leveraging the tools in an
efficient manner. By creating VR and AR based simulations, we are able to
create a more realistic 3D perspective that the users can find a lot more
comfortable to orient themselves to. By the virtue of this study, we will try to
compare and contrast various methodologies for the application development
process implementing VR and AR and eventually try to conceptualize a model
for the same.
From a software engineering point of view, it is difficult to categorise VR and
AR applications and implement a specific model for development due to its
complex and flexible nature. To be able to visualize the development process
and come to conclusions on the time line and nature of the project, it is important
that research on the same is performed to implement a more efficient
methodology or follow a model designed specifically for these applications.
DEVELOPMENT PIPELINES
Although there is no specific rules in developing for VR and AR applications,
the most common approach is to make use of a Game Engine paired with some
3D modelling software. The common approach revolves around the following
processes:
In this document, we will observe the steps involved in each process, the study
of what tasks may be performed parallel to others so as to improve efficiency,
methods that can optimize the mentioned pipeline and devise a model that can be
implemented for the same. Since all the integration and use of assets is
performed in the Game Engine that is implemented for development, the focus
will primarily be on how to improve the process with the Game Engine in
will primarily be on how to improve the process with the Game Engine in
perspective.
The model shall provide an insight to the application flow and how to create a
progress methodology that reduces risk, cost and other potential elements that
can cause drawbacks in the development process. At the same time, the model
will also provide a framework that can be implemented to improve efficiency
and performance over the target platform by performing the required software
planning.
To come to conclusion about the above mentioned process and derive a model
that can be used, we will be implementing a VR application to understand the
same. The proposed system shall be applying the use of VR on a smartphone and
implement the use of 3D objects as well as audio within the application.
VR Reticle Pointer
The VR Reticle pointer is used to provide the user a point on the centre of the
screen so that the user can identify where his point-of-view is and what objects
within his view are intractable. As can be observed from Figure 1, the user
reticle (white circle on the centre of the screen) will scale up and form a circle
when there is an intractable object on the scene before the user. In this case we
see the teleportation cylinder being highlighted to green when the user looks
towards it.
VR View Port
The view port for VR implements the use of the 2 cameras as discussed. The 2
cameras provide 2 views on the view port that will create a single image when
used in a VR headset (Due to the human perspective). The view port does not
used in a VR headset (Due to the human perspective). The view port does not
only account for the generated views but also provides depth, contrast and
provides realistic imagery based on scene lighting.
Lighting
Lighting is among the most important elements within the level designed. If the
lighting within the scene is not done properly, it could affect the look of the level
and hence affect the way the objects on the scene are perceived by the user. With
low quality lighting, the objects on the scene will not provide the same levels of
realism which in turn creates a poor user experience within the application.
User Interface
LITERATURE REVIEW
From the mentioned reference documents, we are able to obtain an evolutionary
view on the VR and AR development processes over the years. Some of the gaps
identified from these documents are listed as follows
ARCHITECTURE IMPLEMENTED
Figure 2 is the block diagram describing the architecture of the working model
used to perform analysis on various aspects of the pipeline.
• Application Subsystem: The system that is the chunk of code that defines
the application and its working methodologies combined with a user
interface for operation.
• Context Subsystem: The subsystem that collects context data from
various subsystems. Examples include preferences and progress data.
• World Model Subsystem: The subsystem that controls the use of real
world elements in the 3D world space within the application.
Implementation of this subsystem is the major difference between
Augmented Reality and Virtual Reality based solutions.
• Tracking Subsystem: This subsystem used the sensors present on the
device to track the user’s position with respect to the 3D space. Tracking is
the module that provides smooth video feed and movement flow in the
application during run time.
• Interaction Subsystem: The subsystem that controls the user input and
output. All forms of interaction between the user and the system is
controlled by this subsystem.
• Presentation Subsystem: This subsystem is responsible for how the
application is generated and presented to the user. It includes modules such
as the render engine and texture repositories.
IN-DEPTH ANALYSIS
To understand the 6 subsystems in the proposed model we shall go through their
respective functional diagrams. The functioning and interaction between each
subsystem is integral in ensuring the smooth flow of the application and to
ensure that nothing within the application is asynchronous at any given point.
Application Subsystem
The code will interact with both the Engine editor and Engine headers in order to
provide functionality within the scene. Additionally the code will also work
alongside the used SDKs and Plugins such as iTween (a commonly used tool for
providing game object based transitions and animations). The references to the
SDKs provide the required SDK features within the application such as the use
of the VR Reticle and the dual Scene Camera setup.
Context Subsystem
The Context Subsystem communicates with all the subsystems to put together
context settings and details for the application including the preferences,
progress data and the in app setting that have been setup by the user. This
subsystem performs a rough system analysis to understand the working
principles and regulates it to a recommended setting to ensure the smooth flow
principles and regulates it to a recommended setting to ensure the smooth flow
of the application.
The context files interact with internal system details and read pre-sets that exist
within the hardware to be able to optimize working of the application for the
specific working environment at hand.
The world model subsystem is primarily used as a manager for the level
designed in the application. Ensuring the working of the level correctly and
maintaining the proper perspective vision within the application is handled by
the World Model Subsystem. Within this subsystem, we account for real time
motion tracking accounting for the world model view with respect to the
rotation, orientation and movement of the user by using the smartphone platform
on a head mounted device (HMD).
By implementing the link to the Tracker module in the device, the World Model
Subsystem can be able to pick up reading from the tracker to provide an accurate
update to the world view and ensure a stable performance in real time as any
movement performed in the real world will be immediately reflected in the
virtual world.
Tracking Subsystem
The Tracking subsystem is responsible for the tracking of the user by making
use of the available sensors on the HMD or device used with the HMD. For
example, in the case of the VR headsets such as Oculus Rift or HTC Vive, the
tracking will be performed using some specialized sensors that are part of the
HMD to ensure smoothness and maintaining a real time 110 degrees field of
view. While in the case of the smartphone (the case in our application for the
experiment), we observe that the smartphone sensors such as the gyroscopic
sensor and in some cases the availability of special hardware such as Tango by
Google can be implemented for spatial tracking within the application.
This subsystem mainly interacts with sensors and ensure that the feedback
received from them is accurately accounted for by the remaining subsystems.
Interaction Subsystem
This subsystem uses the available input devices to provide interaction between
user and application. In the case of VR and AR the interactive elements in the
scene may be in great numbers but to allow user to interact with the application
we have limited hardware capabilities. This is why the interaction subsystem is
used to control the variety of permutations that are performed to maximize the
utility of available inputs.
This subsystem can only interact with a limited set of inputs but these inputs are
manipulated by conversion to individual events. The event system hence
generated is used to provide the required functionalities and interaction for the
user.
Presentation Subsystem
Figure 3 shows the render of a scene within the engine taken from the
perspectives of both clients in the scene. The clients require a generated view of
perspectives of both clients in the scene. The clients require a generated view of
their own and both are required to be generated real time. The characters on the
scene and the weapons on the scene are rendered by this subsystem and provides
a real-time output which in this case is synchronized via a network.
PROPOSED DESIGN
Overview
For the purpose of this project, the methodology will incorporate the use of a
level created using the Unity 3D engine. By using 3D models, audio and other
such assets the scene will be created to provide a realistic user experience. To
incorporate the integration of various smartphone features to improve the
experience, the project will be exported to support the format of the device
(Windows OS for PC, Mac OS for Apple platforms, Android for Android based
smartphones, iOS for iPhones, etc). The final process will be creating a build
that will run as an application on the target platform and provide an interface to
share data between the operating system and the VR platform in real time to
share data between the operating system and the VR platform in real time to
provide a rich user experience.
EXPECTED RESULT
To be able to create a Virtual Reality application that runs on an android device
and provides interaction with the real world and the generated level to the user.
The application will highlight the pipelines for developing Virtual Reality
applications and provide a rich and immersive user experience to solve a real-
world problem.
IMPLEMENTATION
Application Design
VR Level Design
Game View VR
Figure 5. Game View and reticle pointer for the generated VR level
The game view (Figure 5) shows the use of the reticle and interactive elements
that have their light turn green on hover while they remain red when they are not
interacted with by the reticle. This is an example of programmed logic that is
being performed on each of the interactive element that defines its working
principles within the application. The level designed applies materials on the
objects depending on the interaction from the user. In case of no selection
performed, the object is assigned a material that provides the red glow. Once the
object has been hovered over, it applies a material that provides the green glow
while the reticle is still above the intractable object. After the user has hovered
over the intractable element and the reticle in no longer on the object, a blue
material is assigned providing a blue glow that signals that the user has seen the
object and the reticle has hovered over it at some point in the application.
Hardware
Software
ACKNOWLEDGMENT
The author would like to thank the School of Computer Science and
Engineering, VIT University and special thanks to Dean SCOPE, for his kind
guidance and support along with our guide Dr. Prabu S and Dr. Swarnalatha P,
without whom it would not have been possible to complete this mammoth task.
Also, I would like to thank the anonymous reviewers and the editor-in-chief of
the IGI global publishers for their valuable guidance which has improved the
quality and presentation of the chapter.
REFERENCES
Basori, Afif, Almazyad, Abujabal, Rehman, & Alkawaz. (2015). Fast
Markerless Tracking for Augmented Reality in Planar Environment.
3D Research, 6(4), 1-11.
Madsen, J. B., & Stenholt, R. (2014). How wrong can you be:
Perception of static orientation errors in mixed reality. 3D User
Interfaces (3DUI) 2014 IEEE Symposium on, 83-90.
10.1109/3DUI.2014.6798847
Wu, H.-K., Lee, S. W.-Y., Chang, H.-Y., & Liang, J.-C. (2013,
March). Current status, opportunities and challenges of augmented
reality in education . Computers & Education , 62(C), 41–49.
doi:10.1016/j.compedu.2012.10.024
CHAPTER 7
Iterative MapReduce:
i-MapReduce on Medical Dataset Using Hadoop
Utkarsh Srivastava
VIT University, India
Ramanathan L.
VIT University, India
ABSTRACT
Diabetes Mellitus has turned into a noteworthy general wellbeing issue in India.
Most recent measurements on diabetes uncover that 63 million individuals in
India are experiencing diabetes, and this figure is probably going to go up to 80
million by 2025. Given the rise of big data as a socio-technical phenomenon,
there are various complications in analyzing big data and its related data
handling issues. This chapter examines Hadoop, an open source structure that
permits the disseminated handling for huge datasets on group of PCs and thus
finally produces better results with the deployment of Iterative MapReduce. The
goal of this chapter is to dissect and extricate the enhanced performance of data
analysis in distributed environment. Iterative MapReduce (i-MapReduce) plays a
major role in optimizing the analytics performance. Implementation is done on
Cloudera Hadoop introduced on top of Hortonworks Data Platform (HDP)
Sandbox.
INTRODUCTION
We live in the age of data. It’s not easy to measure the total volume of data. IDC
estimate puts the size of the “digital universe” at 4.4 zettabytes in 2013 and is
forecasting a tenfold growth by 2020 to 44 zettabytes. Such enormous volumes
forecasting a tenfold growth by 2020 to 44 zettabytes. Such enormous volumes
of data suffer from various issues like storage capabilities and synchronization
problems since they are stored at different places depending upon the vicinity to
data servers. Given the rise of Big Data as a socio-technical phenomenon there
are various complications in analyzing Bigdata and its related data handling
issues. In such a case Iterative MapReduce comes in really handy. The term
'Enormous Data' alludes to the monstrous volumes of both organized and
unstructured information which can’t be directly utilized with conventional
database administration frameworks.
With the quick increment in the diabetic patients in India and number of
determinants for the diabetes, the information will become tremendous and turns
out to be Big Data which couldn't be handled by traditional DBMS. Here we
discuss about Hadoop, an open source structure that permits the disseminated
handling for huge datasets on group of PCs and thus finally produces better
results with the deployment of Iterative MapReduce. The main goal is to dissect
and extricate the enhanced performance of Data Analysis in distributed
environment. Iterative MapReduce (i-MapReduce) plays a major role in
optimizing the analytics performance. Implementation is done on Cloudera
Hadoop introduced on top of Hortonworks Data Platform (HDP) Sandbox.
Hortonworks Hadoop is used for the extraction of useful data patterns in light of
the inquiry identified with different determinants of diabetes dataset obtained
from IBM quest. Iterative MapReduce algorithm for sequential pattern mining
utilizes a distributed computing environment. It consists of two processes a
mapping process and reducing process which is further utilized in two separate
phases namely Scanning phase and Mining phase. During the scanning phase
high performance is gained by distributing the task of finding elements over
different mapper tasks which can be run parallel on multiple machines with a
distributed database or file system. In the mining phase the mapper task creates a
lexical sequence tree for finding patterns and to improve efficiency of limited
depth dfs which is run on the tree and the reducer task finds the support value for
the patterns and thus in turn finds the useful patterns. So, to avoid the problem of
the serialized processing, we opt for parallel processing in Big Data
environment. This parallel processing not only reduces the computation time but
also optimizes the resource utilization.
This is a brief idea about the chapter and we will see its implementation and
impacts in further sections.
BACKGROUND
Big Data is also like normal data but with an enormous size. This is a term is
generally used to describe a collection of data that is very huge in size and still
growing exponentially with time. In short, such a large collection of data which
is difficult to handle via traditional databases and other management tools is
called ‘BigData’. Generic features of BigData are:
• Volume
• Velocity
• Variability
• Veracity
The main objective of this analysis is to find interesting patterns on the basis of
conditional dependence on given attributes of a dataset. With such an increased
rate of data generation it becomes very difficult to analyze the patterns in the
dataset. Also the situation becomes more critical when we have sequential
patterns in the dataset i.e. the order of dependency matters. Such behavior of
data is very usual in day to day actions such as customer shopping behavior,
medical symptoms leading to a future patient disease, financial stock market data
predictions etc. Pattern mining of BigData using Hadoop faces a lot of issues in
terms of data storage, data shuffling, data scanning, data processing units etc.
Sequential pattern mining is one of the most important data mining technique
used in various application technologies in modern world. Some examples are
Gene Analysis, Intrusion detection of System attack and Customer Behaviour
Prediction. The centralized logic behind sequential pattern mining is to find
frequent sequences within a transactional or operational database. The formal
definition can be detailed as follows. Definition 1: Let D be a sequence database,
and I = {y1, · · ·, ym} be a set of m different items. S = {d1, · · ·, di} is a
sequence consisting of an ordered list of itemsets. An itemset di is a subset of
items ⊆ I. A sequence Dr = {r1, · · ·, rn} is a subsequence of sequence Dt = {t1, ·
· ·, tm} where 1 ≤ i1 < · · · < in ≤ m such that r1 ⊆ ti1, r2 ⊆ ti2, · · ·, rn ⊆ tin.
Sequential pattern mining is generally used to find all the sets of sequential
patterns whose occurrence frequencies ≥ minimum support ∗ |D|. Here
minimum support is the support threshold value. Some of the patterns which will
be of great help in this regard are Apriori-based, Pattern growth-based and
Projection-based algorithms.
Tree based algorithm includes DHP (Direct Hashing and Pruning) which was
proposed as an improvisation for Apriori algorithm. It is built on traditional logic
only but with two major improvisations. The first improvisation is to prune the
itemsets in each iteration, and the second improvisation is to trim the
transactions so that support-counting becomes more efficient. After first scan of
the dataset it maintains a DHT (Distributed Hash Table) and uses it for all
the dataset it maintains a DHT (Distributed Hash Table) and uses it for all
further calculations. It is this DHT which is used for making a BitMap
representation table for determining the presence or absence of an item set in the
dataset. For presence of an item in the dataset its normalized frequency count is
taken into consideration with all conditional attributes.
IBM’s Watson Analytics, introduced initially for the finance industry, attempts
to provide a simple analysis system based won IBM’s sophisticated Cogno’s
processing capabilities. It uses NLP to give predictive decisions based on the
input data. It was initiated with an objective to help business processes to access
data remotely for coming up with business related decisions. All types of
businesses generate data these days. It is done by means of websites, social
media, shopping patterns of customers, user experience, etc. This reflects the
need for companies to strategize on various aspects of storage and mining of
useful information for the available data. This is considerably more challenging
than just locating, identifying, understanding, and citing data. In order to take
decisions based on the available data, the systems require being fully fledged
automated which can store data and schedule tasks at regular time intervals to
phase out results from time-time.
Most of the times we need to filter the patterns according to user requirements
and make an appropriate choice of the distinguishing parameters. For example a
client may want the presence of one specific itemset in the mining dataset but
defines some differentiating factors which determine the conditional rule base
for that particular mining pattern. We can in fact directly push all the constraints
in the mining process. It has several advantages as mining can be performed at
much lower support levels as compared to other methods. Thus we can say that
constrained pattern mining involves deployment of conditional patterns with
normalized frequency values.
One of major issues in pattern mining is the volume and redundancy of patterns
in a given dataset. All traditional algorithms need to make multiple data scans
for finding interesting patterns. This increases the operational and system
requirements. Most of the times a subset of a frequent pattern set is also
frequent. Also to make the process easy, most algorithms make use of bitmap
representations of the useful patterns and simultaneously pruning the remaining
patterns. Therefore it solves the problem of multiple data scans and also reduces
the computational requirements.
The main objective of the proposed system is to make this process hassle free for
the users i.e. they will not have to maintain clusters and spend huge amounts for
the hardware stuff to mine information as well as schedule tasks responsible for
Big Data analysis. This system is built upon the Hadoop ecosystem. It will be a
mix of standalone and hosted service for the clients and will greatly reduce their
workload. It aims at finding all interesting patterns in the dataset based on the
user input and thus helps in categorizing the useful patterns from the others. It
aims at reducing the hardware requirements for the computation and produce
resource optimized results. It will greatly boost the development of any product
or services in the market industry with reference to:
• Investors
• Business Analyst
• E-commerce companies
• Advertising agencies
• Wholesaler
In the mining phase the main aim is to find all possible patterns in the dataset
and to prune away non-sequential and non-useful patterns to filter out the
important information from the dataset. We try to achieve this using a lexical
important information from the dataset. We try to achieve this using a lexical
sequence tree which can generate all possible patterns existing in the dataset and
then this tree can be pruned away to find out useful or sequential patterns.
So now comes a very obvious question, how is iterative MapReduce better than
traditional algorithms in distributed environment. Here are some of the
comparing parameters:
The sampling layer is responsible for collecting random pieces of data in order
to perform a few initial preprocessing tasks. It performs the task of selecting a
subset from the original set of all measurements. Since unstructured data is fed
into this layer so the task of structuring it is performed by the sampling layer.
This is done by means of finding the delimiter of the input data and finding
useful summarizations of the sampled data to get an overview of the data to run
optimized algorithms on the same. The types of sampling ranges from simple
random sampling which in which any particular data has equal probability of
being sampled to custom sampling where the user provided how much
percentage of data is to be sampled from the input data.
The data parsing layer supports extraction of various attributes of any type of
data. This layer supports in various formats like CSV,XML and JSON since
most of the data meant for analysis is available in any one of these formats. This
layer also supports extraction of specific attributes of the data. This is again
based on custom queries as provided by the user for any of the three input types.
The core analysis layer is subdivided into various sublayers. These sublayers are
named aggeration, filtering, classification and data organisation layers
respectively. The data enters into this layer after being sampled and parsed and
hence is ready for the type of analysis which has been specified by the user. The
objective of the filtering layer is to extract certain chunks of data on the basis of
some applied function or randomly extracting chunks which vary in size and
dimension. Major functionality in this layer is finding all distinct values present
in a string of parameters so that all repetitions are filtered out thus reducing
redundancy of patterns.
1. Scanning Phase
During the scanning phase high performance is gained by distributing the task of
finding elements over different mapper tasks which can be run parallel on
multiple machines with a distributed database or file system. Reducer tasks are
then run to find the frequency of different patterns and infrequent patterns we
subsequently removed from the records. The result can be stored to be passed on
to the mining phase. The major task in this phase is to create the lexical sequence
tree. Each common node or its associated subtree represents a sequence in the
data set. We gain scalability in this phase using by splitting the data set into parts
and running a mapper on each independently to offer better execution time on
distributed systems. The mapper runs a depth search on the tree mine the
patterns but the search is a depth limited search to reduce the load on the mapper
program. This removes a bottleneck in the LST construction and mining and
gives more balance to each mapper phase. The mapper phase creates an
gives more balance to each mapper phase. The mapper phase creates an
intermediate output which can be processed by the reducer processed to prune
away non-sequential patterns. The pattern is stored in the format<pattern, bit-
AND -result> for the reducer side to process. The mapper phase outputs the
patterns along with the node depth and a threshold value is assumed for the node
depth any node which has support count lower than the threshold is not
processed and its extensions are also not processed.
2. Mining Phase
In the mining phase the mapper task creates a lexical sequence tree for finding
patterns and to improve efficiency of limited depth dfs which is run on the tree
and the reducer task finds the support value for the patterns and finds the useful
patterns. These phases are further elaborated. In the scanning phase the data is
loaded on to the machine for scanning and mining. To avoid loss of data during
loaded on to the machine for scanning and mining. To avoid loss of data during
loading each mapper task and reducer task each of these tasks reads pre-
partitioned part of the data based on a memory chunk or no of lines. The mapper
task converts the data Tuples to key value pairs for faster processing. The tuples
are grouped based on itemID and the transaction IDs are stored in the value part.
For example, item,(tid,tid) The reducer task counts the support value for the
items and patterns which can be used to remove infrequent patterns. For
handling data between mappers and reducers the mapper and reducer with
identical keys handle the same set of data. A threshold would be identified to
remove the infrequent patterns. To remove infrequent patterns any pattern
having confidence value less than the threshold would be removed and the
reducer would store the data in a distributed hash table or a distributed file
system would be accessible to the next MapReduce job in the mining phase.
• Memory scalability
• Work partitioning
• Load balancing.
CONCLUSION
In this chapter we have discussed an optimized version of a traditional pattern
mining algorithm on a distributed setup using Hadoop framework and Iterative
MapReduce algorithms. A stepwise system to find and filter relevant patterns
from the data set using Iterative MapReduce is the major point discussion in the
chapter. The capacity to handle large data sets varies depending upon the
server’s capacity as well as the amount of commodity servers available. By the
usage of commodity server’s the data processing has been scaled out. Thus we
can say that MapReduce algorithms play a very pivotal role in pattern mining.
The distributed environment of the node clusters helps in smooth flow of data
accompanied by efficient computation and optimized resource utilization.
REFERENCES
Ramasubbareddy Somula
VIT University, India
Sravani Nalluri
VIT University, India
Vaishali R.
VIT University, India
Sasikala R.
VIT University, India
ABSTRACT
This chapter presents an introduction to the basics in big data including
architecture, modeling, and the tools used. Big data is a term that is used for
serving the high volume of data that can be used as an alternative to RDBMS
and the other analytical technologies such as OLAP. For every application there
exist databases that contain the essential information. But the sizes of the
databases vary in different applications and we need to store, extract, and modify
these databases. In order to make it useful, we have to deal with it efficiently.
This is the place that big data plays an important role. Big data exceeds the
processing and the overall capacity of other traditional databases. In this chapter,
the basic architecture, tools, modeling, and challenges are presented in each
section.
section.
1. INTRODUCTION
Day by day, we see the data is rapidly increasing in many forms. We have some
traditional data processing software to process small quantity of data. But as
trillions of bytes of information is being processed per second, the traditional
software techniques fail in processing this data. We need to re-think of a solution
which can process this data. Now Big Data gives us a solution. Big Data is a
term used for creating, capturing, communicating, aggregating, storing and
analyzing large amounts of data. Many attempts encountered to quantify the
growth rate in the volume of data is called as Information Explosion.
Major milestones took place in the history of sizing data volumes plus the
evolution of the term Big Data. The following are some of them:
The term Big Data was coined in 1998 by Mr. John Mashey, Chief Scientist at
SGI. Even though Michael Cox and David Ellsworth seem to have used the term
‘Big Data’ in print, Mr. Mashey supposedly used the term in his various
speeches and that’s why he is crediting from coming up with Big Data. But some
various sources say that the first use of the term Big Data was done in an
academic paper-Visually Exploring Gigabyte Datasets in Realtime(ACM)
(OECD, 2015; Mark A. Beyer & Douglas Laney, 2012).
The following are the differentiators of Big Data over Traditional Business
Intelligence solutions:
The Big Data challenges include capturing data, data storage, data analysis,
search, sharing, transfer, visualization, querying, updating and information
policy. Organizations have to compromise and balance against the
confidentiality requirements of the data. Organizations must determine how long
the data has to be retained. With the advent of new tools and technologies to
build big data solutions, availability of skills is a big challenge for CIO’s. A
higher level of proficiency in the data science is required to implement big data
solutions today because the tools are not user-friendly yet. (Bill Franks, 2012).
Analogous to the Cloud Computing architecture, the Big Data landscape can be
divided into four layers.
Here are some of the big data providers that are offering solutions in the specific
industries:
• Velocity: The data is increasing at a very fast rate. It is estimated that the
volume of data doubles every year.
• Variety: Now a days data is not stored in rows and columns i.e.,
structured format. We see data is being stored in the form of log files i.e.
unstructured.
• Volume: The amount of data which we deal with is of very large size of
peta bytes.
• Veracity: Explains the reliability of data (Foster Provost & Tom Fawcett,
2013).
The bigdata differentiates from the traditional data based on these 4 components:
2.1.1 Volume
We live in the data age and in 2013 it was analyzed that the total volume of data
storage is 4.4 zeta bytes and in 2020 it will become 44 zeta bytes (1 zeta byte is
10 21 Bytes. These very much amount of data can need to be stored and
analyzed. These data are from different sources and also need to be combined for
better results.
2.1.2. Velocity
The data is from multiple sources and these sources are to be run parallel. Hence
the important next issue is how to run these sources parallel with very high speed
for data generation and transmission. For example consider a weather sensor
which collects the weather data from multiple sources in each and every hour.
These data is need to be move on to a particular storage and this data log is very
high. Traditional systems are not capable of doing this storage and frequent
movements.
2.1.3. Variety
The data sources can be of different types which can be collected from weather
sensors, social networks, stock exchange and smart phones. The data includes
text, images, audio, video or any other data logs. The data can be classified
mainly into three such as structured data, semi structured data and unstructured
data. Traditional distributed systems such as RDBMS, volunteer computing and
grid computing can handle only structured data. Bigdata differs from these
systems by it can handle semi structured and unstructured data also.
2.1.4. Veracity
The bigdata can handle huge amount of data and this data need to be correct
also. Hence the Veracity refers to how we can clean the data for data
preprocessing stage. The data need to be relevant and valuable (X. L. Dong & D.
Srivastava, 2013).
3. ARCHITECTURE OF BIGDATA
Bigdata is treated as set of tools for developing and analyzing scalable, reliable
and portable data. It serves as key for design infrastructure and solutions. It
interconnects the existing and organizing the resources and consist different
layers such as:
• Data Sources: Real time data sources, application of data sources such as
relational databases. Static files produced by webservers and other
applications.
• Data Storage: Bigdata is not the first distributed processing systems. But
it can store high volumes of data known as Data Lake, than other traditional
systems. Bigdata can prepare the data for analysis and then these analytical
data store can be used to serve different types of queries. The analytical
data store can also provide metadata abstraction and low latency NoSQL
technologies.
• Processing of Data: Bigdata provides interactive and batch processing
including real time applications. Bigdata solutions process the data files
using batch systems to filter and aggregate. If the solutions include real
time sources, the bigdata architect can include stream processing also. The
processed stream data is then written into an output sink (B. Ramesh,
2015).
• Service Orchestration: Most of the bigdata solutions consists of repeated
data processing operations, then transform the encapsulated source data.
The movement of data between multiple sources and destination and load
the processing data in to an analytical store is an important issue in
traditional systems. This can be automated by using service orchestration in
bigdata by using the tools such as sqoop and Oozie.
A number of open source tools, frameworks and query languages have been
introduced to analyse big data. MongoDB is a famous NoSQL based data
analytics tool that provides option to visualize, analyse and explore datasets. In
this section let us explore a GIS dataset in Mongo Compass.
Schema view lists the attributes information and visualizes the data with data
types.
To get the view of the Schema click on the Schema Tab at and top and then click
on the green ‘Analyse’ button.
As shown in Figure 6, the query made to analyse the 100Y weather dataset has
returned 250,000 json documents
The Bigdata collects a large scale data from different sources and solve the
computational problems. The most significant characteristics of big data are
types of data patterns, different structures and complicated communication
between data samples. Big data refers to the data (or) Information which can
always be processed with advance technology like analytics, visualization
methods and can also find hidden pattern in order to make an accurate decision
rather than using additional compute system (“Global Data Center Traffic”).
Many Business organizations have large amount of data but that whole data is
used properly due to the lack of efficient systems so that the percentage of data
utilization is keep on decreasing (Oracle Big Data Strategy Guide). Now the
technology is growing day by day, the mobile devices and sensors playing an
important role for generating data and then stored. For example, people can
monitor working people away from office because the office people staying far
away from office (Cloudera's 100% Open Source Distribution of Hadoop). For
instance, railway companies decided to install sensors for every few feet in order
to monitor internal and external event which causes train to meet accidents. By
monitoring every event, railway people comes to know that which part is
required to replace and which one is repaired (James Manyika, Michael Chui,
Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, & Angela
Hung Byers, 2011). The roads are equipped with sensors to read the information
for reducing the chances of occurring natural disaster by analyzing and
predicting generated data (“Big Data, Big Impact: New Possibilities for
International Development”). Everyday few terabytes of data is generated from
every business organization, to deal with the traditional methods such as
information retrieval, discovery, analysis, sentimental analysis, but they are not
fit for huge data. Currently, we don’t have good knowledge on data distribution
law and association relation of big data. Also lack of deep understanding
association between data complexity and computational complexity arises. The
lack of the processing methods in big data and all the other aspects confine our
ability to implement new method and models to solve all these problems in big
data. The basic problem is to understand essential characteristics of complexity
of big data. Basic study on complexity theory will give clear view on complexity
and how they are formed, how they are associated with other pattern. By getting
clear understanding on complex theory, we can design and implement novel
models to resolve problems of big data complexity.
Big data includes three main features such as fast-changing, multiple sources,
huge volume . These features become difficult for traditional processing
methods such as machine learning, data retrieval, data analysis. New big data
computing tools are needed to overcome problems coming from independent and
identical distribution of data for generating accurate statistics. For addressing the
problems of big data, we require examining computational complexity and used
algorithms. The storage available for generating data through social websites is
not enough, because of the popularity of big data on storage and network.
Server’s offloading the resource intensive data into cloud and sending more data
into cloud for processing. Offloading data into cloud does not solve the problem.
But big data require getting all the data collected and retrieving from terabytes to
petabytes. Offloading the entire data into cloud will take large time and also data
is changing in every second which will make the data hard to be updated in real
time. Offloading data from the storage location to processing location can be
avoided by two ways: one is to process the storage location and second one is
offload only required data which will take more time to process. Building
indexes for storing data which will make easy the retrieving process and
moreover reducing processing time. In order to address computing complexity in
big data we need to understand about life cycle application of big data to study
of centralized processing mechanisms which depends upon the behavior of big
data. We need to get way from tradition approach of computing-centric to
distributed computing paradigms. We need to focus on new methods and data
analysis mechanisms for distributed streaming mechanisms. We are also
required to focus more on boot strapping and local computation, new algorithms
for handling large amount of data.
5.3 System Complexity
Big data can process high-volume of heterogonous data types and applications to
support research on big data. As big data processing are not enough to handle
generating high volume of data and real time requirements, these constraints
pose to establish new system architecture, processing system and energy
optimization model. Addressing the complexity problems will lead to designing
novel hardware and software frameworks for optimizing energy-consumption on
big data. Systems can process data that are all similar in size and structure, but
make difficult to process when data is presented in different patterns and sizes.
We need to conduct small scale research on all big data tools, different work
load conditions, different data types, various data pattern, and performance
evaluation in distributed environment, centralized environment, machine
learning algorithm for performance prediction, energy optimized algorithms,
energy consumption per unit and recursive work. We should focus on novel data
processing systems which is able to process all kinds of data in different
situations.
6. APPLICATIONS OF BIGDATA
• Big Data Analytics(BDA): This kind of application analyzes massive
data using parallel processing framework. BDA applications use sample
data in pseudo-cloud environment. After that they build in real cloud
environment with more processing power and large input data. These
applications utilize large data, which is unable to fit in hard drive. The data
is generated from different sources like traffic, social websites, the online
game information, and stock market and during international games.
• Clustering: User can easily identify group of people by using algorithms
such as k-means algorithms through points and click dialog and based on
specific data dimension. Clustering plays an important role in big data in
order to address group of people by considering customer type patient
documents, purchasing pattern, behavior products.
• Data Mining: The decision tree will help user to understand outcome and
relation between attributes in expected outcome. This decision tree reflects
the structure of that probability hidden in your data. Decision tree helps us
to predict fraud risk, online registrations, online shopping, and disease risk.
• Banking: In banking sector, the use of sensitive data leads to privacy
issues. Research shows that more than 62% of bank employees are cautious
about their bank customer’s information to privacy issues. Distribution of
customer’s data to different branches also leads to security issues. The
investigation happened on banks data containing user’s sensitive
information such as earnings, savings, and insurance policies ended up in
the wrong hands. This discourages the customers in sharing personal details
in bank transactions.
• Stock: Data analytics can be used to detect fraud by establishing a
comprehensive system in data base during private stock exchange.
• Credit Cards: Credit card companies depend on in-data base analytics to
identify fraud transactions with accuracy and speed. This fraud transaction
deletion will follow up users sensitive data such as amount, location and
follow up before authenticate suspicious activity.
Enterprise
It will help the industry people exist around the world. Data doesn’t have to
move to work and back. It provides insight to business people to make accurate
decision for less expensive than traditional tools.
7. CONCLUSION
Bigdata is an important technology in our data era which can handle structured,
semi structured and unstructured data. It provides a viable solution for large and
complex data and become a challenge in nowadays. Each bigdata system
provides a massive power and for providing this different tools are used. This
paper presented the important aspects of big data, architecture and its
applications. However the challenges are also presented in this paper. It is clear
that now we are starting in the bigdata era and we have to discover many things
about big data for competing this data world.
REFERENCES
BakshiK. (2012). Considerations for big data: Architecture and
approach. Aerospace Conference, 1–7. 10.1109/AERO.2012.6187357
Bill Franks. (2012). Taming the big data tidal wave . Wiley.
Hitzler, P., & Janowicz, K. (2013). Linked Data, Big Data, and the 4th
Paradigm. Semantic Web , 4(3), 233–235.
OECD. (2015). Data-driven innovation: big data for growth and well-
being . Paris, France: OECD Publishing.
Yi, X., Liu, F., Liu, J., & Jin, H. (2014). Building a network highway
for big data: Architecture and challenges . IEEE Network , 28(4), 5–
13. doi:10.1109/MNET.2014.6863125
CHAPTER 9
Chetan Kumar
VIT University, India
Leonid Datta
VIT University, India
ABSTRACT
This chapter is a description of MapReduce, which serves as a programming
algorithm for distributed computing in a parallel manner on huge chunks of data
that can easily execute on commodity servers thus reducing the costs for server
maintenance and removal of requirement of having dedicated servers towards for
running these processes. This chapter is all about the various approaches towards
MapReduce programming model and how to use it in an efficient manner for
scalable text-based analysis in various domains like machine learning, data
analytics, and data science. Hence, it deals with various approaches of using
MapReduce in these fields and how to apply various techniques of MapReduce
in these fields effectively and fitting the MapReduce programming model into
any text mining application.
INTRODUCTION
In the steadily changing perception of information technology, data being
In the steadily changing perception of information technology, data being
gathered and looked through for business knowledge purposes have achieved
excessive levels. Welcome to the big data revolution. Big data is the term used
for data set so large or complex such that they cannot be modified or processed
in the conventional programming environment or softwares for generating
specific prediction or output thr, i.e., the day to day data processing application
softwares are not able to deal with them. This happens because the day to day
software systems have limitation of their process and also these huge chunks of
data contain so many outlier and exaggerated information that they are cleaned
and made fit for the processing before they are dealt with. For processing,
curing, storing, analyzing, decision making, transferring, visualizing, query
processing, updating using this huge hunk of data, the Big Data environment is
used.
Now we live in such a generation when huge chunks of data are generated at
every moment which is used by the business persons for analysis. If these data
are to be used for processing, then serialized way of processing will not give
result efficiently because that will consume time more than required. So, to
avoid this problem of the serialized processing, we go for parallel processing in
Big Data environment. This parallel processing helps us to reduce processing
time because the data sets are dealt in relatively small parts and they are merged
after that.
This is the brief idea about the chapter and we will see more details inside the
chapter.
BACKGROUND
While defining Big Data, it is very difficult to differentiate exactly between data
and Big data. The data can also be processed in this environment but that is not
suitable way for that. But big data handling is not at all suitable in the day to day
software system. In processing the Big data, the input data are having the
qualities like-Volume: The quantity of the data determines the weightage and
potential insight. Variety: The nature and variety of the data analyses how
eclectic the data is. Velocity: The data actually varies with time with variance
and this determines how useful the data is.Variability:Inconsistant data set of the
base analysis can hinder the handling.Veracity: The quality of captured data can
vary greatly, affecting accurate analysis. These qualities can be considered as
formal definition of the Big Data.
formal definition of the Big Data.
While dealing with the parallel processing, Parallel computing (Kumar, V.,
Grama, A., Gupta, A., & Karypis, G., 1994) is the specific type of computation
in any computing or processing environment in which many calculations or the
execution of processes are done simultaneously, i.e., large problems can be
divided into smaller problems for processing and then at the end, they are
merged for producing output. Parallel processing reduces the processing time
drastically because more than one portion of the data are being processed at the
same time.
MPP is not by any means the only technology available to encourage the
handling of substantial volumes
Is MPP better than MapReduce or vice versa? This depends on the goals of each
organization, for them they are different tools that suit different situations. As a
matter of fact, there are some organizations that use both MPP and MapReduce,
affording them the advantage of the best of both worlds.
Mean: The arithmetic mean, is the measure of central location of data. In the
cases of a robust measure of central tendency will often provide a better estimate
of the mean of the data (Tukey, J. W., 1977). It can be measured by summing up
all the values of given set and dividing the sum by the number of values. Hence
we need to fit our calculations according to the map reduce algorithm so that we
can form tuples meant for mean calculation from a set of data. Here we deal with
numeric data only as summation of string values is not possible. Mean
calculation needs to assume a key for which all the values will be accumulated
and divided by number of terms. Here we will assume a dataset which contains
the values separated by ‘,’ containing the numbers separated by ‘,’ which aims at
finally finding the average value of each column. Hence we will assume two
phases initially(map, reduce):
1. Map Phase: As stated above the map phase reads the input file line by
line and hence the key and value pairs for the input of map phase serves as
the byte offset of each line as the key and the line as a whole as the value
associated with each key i.e. the byte offset. Usage of byte offset address
translation may be completed in as few as a single clock cycle (Senthil, G.,
2004). Hence this serves as a benchmark for iterating through the line and
performing required operations. Thus while writing the code we can assume
a variable which is retained in every call of the iterator function and hence
is can be programmed as a counter which is incremented as we keep getting
the values for any particular column. Hence map output will have many
keys associated with many values corresponding to number of lines in the
input file. During the sort and shuffle phase all the values i.e. the numbers
in the same column are sorted according to the key and finally the output of
this phase is key associated with a tuple(list) of values which means that
each column number(key) is associated with all values corresponding to all
the values in their respective columns.
2. Reduce Phase: As stated above the input to the reduce phase is the
column number as a key and a list of values as the values associated to the
key. Hence now the reducer phase has access to all values in a particular
column as a list. Hence a running sum is to be maintained which retains its
value as long the iteration of the list of values takes place which keeps
adding the numbers and hence at the end of the iteration we have the final
sum with us in the variable. Thus the sum obtained in divided by the
number of values of the list and is finally associated to the input key i.e. the
column number. Hence the value of mean is obtained in this way for each
of the column.
This method may sound perfect but it has a huge flaw. Consider a huge dataset
with a large number of lines. Hence when the reducer is fed in with the list of
values each key will have as many values in its tuple set as the number of lines
in the data. This will increase the network traffic i.e. a large number of values
will need to serialized altogether at the end of sort and shuffle and will need to
be sent to the reduce phase. This can be optimized by the diving the file by the
basis of size/number of lines and by usage of the combiner phase which will
serve as a localized reducer here. To make it efficient storage of two things in
each value i.e. the average of each key and the number of terms for which the
average has been calculated. Thus instead of sending all the values associated to
the columns one can find the average of a subset of the columns and send the
average accompanied by the number of terms involved in calculating the sum.
Hence at the reducer end a list of values (depending on the number of pieces the
file was divided into) are obtained. Now to calculate the average each of the
value obtained for each key is multiplied with number of occurrences which is
present in the tuple. Summing up of all such values and division by the sum of
number of occurrences produces the mean of the data and hence optimization is
achieved.
1. Map Phase 1: In this phase the data is read line by line and all the words
are collected in the forms of tuples where each words is linked with ‘1’
which in a way means the frequency of occurrence of that words in the file
and hence this phase mainly identifies all the words in the input file and
hence the main task of this phase is to identify all the words irrespective of
their frequency of occurrence in the input file and hence this constitutes the
first phase of the job.
2. Reduce Phase 1: This phase involves the output of the map shuffle and
sort phases. The shuffle and sort phases do a group by operation on the key
i.e. the words in the file which are obtained from the map phase. The sort
phase collected all keys(words) of same value together and hence by the use
of string comparison all the keys are sorted. In the shuffle phase all the
values of the same keys are collected together to form a tuple of values
(1’s) which effectively results in the number of 1’s associated with each key
as the number of times the word actually occurs in the file. Finally in the
reduce phase all the 1’s of each key are added together and hence the
frequency of each unique key is obtained by summing up of the values.
At the end of these two phases we have the frequency of each word occurring in
the input file and hence we get the output of this phase as the words was keys
and the frequency of occurrence of each word as the value. This thus makes the
keys as unique. Now for the final output the words i.e the keys need to be sorted
on the basis of their frequency obtained.
3. Map Phase 2: In the next steps the advantage of sort phase is taken
hence the values i.e. the frequency will now be used as a key and the word
which was the key earlier will now be used as a value. Hence the task of
this mapper is to exchange or swap the frequency as the key and the word
as the value
4. Reduce Phase 2: During the sort and shuffle phases the words with same
frequency are collected together in the form of tuples and hence the output
of sort and shuffle phases will be sorted frequency of words as key and all
the words with the same frequency together. Since reducer outputs single
key and value pairs instead of key and list of values which may occur in the
case of many words having the same frequency and. Hence when the
iteration across the list of values for the key i.e. the frequency occurs an
empty string is assumed initially which is concatenated as and when new
words are obtained from the list of values as obtained by the reducer. This
results in the formation of key as the frequency and the words which have
same frequency as the value. Special characters can be used as delimiters
between the unique words so that separation is obtained between the words
of same frequency are obtained. These phases can be optimized by the
usage of combiner phase between map phase 1 and reduce phase 1 so that
less amount of data is to transferred across the network hence optimizing
network traffic usage. The combiner serves as a localized reducer by
counting the frequency of all words in pieces of data which can later be
combined easily since it is commutative operation.
Inverted Index: In recent years, large-scale image retrieval has been shown
remarkable potential in real-life applications (Nguyen, B. V., Pham, D., Ngo, T.
D., Le, D. D., & Duong, D. A., 2014). This type of job is basically used to list
important terms in a dataset and hence is useful for tagging important words in a
document so that specific values can be traced out. Most search engines carry
out this task for making efficient searches in the web. The mapper gives an
output of the desired unique keys attached with values. The partitioner phase is
used to determine where data is to be sent to a reducer after processing is done
and hence it makes the final processing task of the reducer in a more distributed
fashion. Hence finally the reducer receives a collection of distinct row identifiers
to link them back to the input keys. Finally the reducers can used with unique
delimiters to carry out the operations of separating the key and value pairs. The
final output has the values associated with unique IDs from the input file and
hence they can be associated with data elements from the file. The performance
hence is based upon the unique types of index keys and the cardinality of the
keys i.e. the number of values each key is associated with.
CONCLUSION
The golden age of data has arrived. The capacity to handle large data sets
problems vary by the server capacity as well as the amount of commodity
servers available. By the usage of commodity server the data processing has
been scaled out. This was only possible using innovations like map reduce and
mpp on the software scale. This mainly involves avoiding system level
integrations for the processing tasks and hence making development easier.
REFERENCES
Bijalwan, V., Kumar, V., Kumari, P., & Pascual, J. (2014). KNN
based machine learning approach for text and document mining.
International Journal of Database Theory and Application , 7(1), 61–
70. doi:10.14257/ijdta.2014.7.1.06
Coyle, D. J., Jr., Chang, A., Malkemus, T. R., & Wilson, W. G.
(1997). U.S. Patent No. 5,630,124. Washington, DC: U.S. Patent and
Trademark Office.
Eswaran, K. P., Gray, J. N., Lorie, R. A., & Traiger, I. L. (1976). The
notions of consistency and predicate locks in a database system.
Communications of the ACM , 19(11), 624–633.
doi:10.1145/360363.360369
GuptaA.AgarwalD.TanD.KuleszaJ.PathakR.StefaniS.SrinivasanV.
(2015, May). Amazon Redshift and the case for simpler data
warehouses. In Proceedings of the 2015 ACM SIGMOD International
Conference on Management of Data (pp. 1917-1923). ACM.
10.1145/2723372.2742795
Kumar, V., Grama, A., Gupta, A., & Karypis, G. (1994). Introduction
to parallel computing: design and analysis of algorithms (Vol. 400).
Redwood City, CA: Benjamin/Cummings.
Nguyen, B. V., Pham, D., Ngo, T. D., Le, D. D., & Duong, D. A.
(2014, December). Integrating spatial information into inverted index
for large-scale image retrieval. In Multimedia (ISM), 2014 IEEE
International Symposium on (pp. 102-105). IEEE. doi:10.1007/978-3-
319-12024-9_19
Shreya Tuli
VIT University, India
Gaurav Sharma
VIT University, India
Nayan Mishr
VIT University, India
ABSTRACT
Big data is a term for data sets that are so large or complex that traditional data
processing application software is inadequate to deal with them. Its challenges
include capturing data, data storage, data analysis, search, sharing, transfer,
visualization, querying, updating, and information privacy. Lately, the term big
data tends to refer to the use of predictive analytics, user behavior analytics, or
certain other advanced data analytics methods that extract value from data, and
seldom to a particular size of data set. In this chapter, the authors distinguish
between fake note and a real note and would like to take it to a level where it can
be used everywhere. Its data after the detection of the fakeness and the real note
can be stored in the database. The data to store will be huge. To overcome this
problem, we can go for big data. It will help to store large amounts of data in no
time. The difference between real note and fake note is that real note has its thin
strip to be more or less continuous while the fake strip has fragmented thin lines
in the strip. One could say that the fake note has more than one line in the thin
strip while the real note only has one line. Therefore, if we see just one line, it is
real, but if we see more than one line, it is fake. In this chapter, the authors use
foreign currency.
INTRODUCTION
There are 6 different steps in order to distinguish between fake and original
currency. In this project, we are using one of the famous segmentation
techniques named thresholding. And also, our project involves various process
like opening and closing etc…. In this we are using a real note and 2 fake notes
and are differentiating between them based on the black strips it has.
Thresholding
Opening
Closing
Figure 1. Architecture
PROCEDURE
Let us now see the different steps involved:
In this step, we are going to read the particular image from a given location, and
we also resize the image if it is not in a desired size. Here we are going to
directly read our images from desktop using command “imread”.
Here the first one is the real image and the rest two are fake ones (Figure 2).
Figure 2.
Step 2: Extracting black strips.
Discussed that our project is going to work on the principle number of black
strips. If black strips are only 1 then it is original. And if we they are not equal to
1 then it is fake image.
Figure 3.
Step 3: Converting RGB image into gray level and then thresholding
Here in this step we are going to convert our RGB image into gray level image
by using the command “rgb2gray”. And again, subplot means the same as
discussed above. And here we perform thresholding after image is converted into
gray scale image. It is done by first converting image into binary by using
command “im2bw” and using a threshold value of 30.
Here we used 30 heuristically as this was predominantly the intensity that was
for the black strip consisted of.
BlackStripReal =
rgb2gray(blackStripReal);
blackStripFake = rgb2gray(blackStripFake);
blackStripFake2 = rgb2gray(blackStripFake2);
Figure (2);
subplot (1,3,1);
imshow (blackStripReal);
title('Real');
subplot (1,3,2);
imshow(blackStripFake);
title('Fake');
subplot (1,3,3);
imshow(blackStripFake2);
title ('Fake #2');
Figure 4.
For thresholding: -
Figure 5.
Step 4: Opening.
In this step we are doing area opening of the image like by specifying a larger
area of about 100 to ensure that the image will get rid of any spurious noisy and
isolated pixels. You'll notice that for each of the images, there are some noisy
pixels on the edges. We perform this opening operation using the command
“bwareaopen”. This function removes pixel areas in a black and white image
that have less than a certain area.
Here also again we use subplot for the same reason discussed above.
Figure 6.
This is nothing but closing process. Here in this process we use a square
structuring element of 5 x 5 to ensure that any disconnected regions that are near
larger regions get connected to each other. For this process to be done we use
command “imclose”.
command “imclose”.
Here again we use subplot for the same reason as discussed above.
Figure 7.
[~,countReal] = bwlabel(BWImageCloseReal);
[~,countFake] = bwlabel(BWImageCloseFake);
[~,countFake2] = bwlabel(BWImageCloseFake2);
disp (['The total number of black lines for the real
note is: ' num2str(countReal)]);
Disp (['The total number of black lines for the fake
note is: ' num2str(countFake)]);
Disp (['The total number of black lines for the second
fake note is: ' num2str(countFake2)]);
Finally, we use “disp” to display the output:
The total number of black lines for the real note is:
1
The total number of black lines for the fake note is:
2
The total number of black lines for the second fake
note is: 0
FULL MATLAB CODE
clear all;
close all;
Ireal = imread('C:\Users\Bunny\Desktop\SqbnIm.jpg'); %
Real
Ifake = imread('C:\Users\Bunny\Desktop\2U3DEm.jpg'); %
Fake
Ifake2 = imread('C:\Users\Bunny\Desktop\SVJrwaV.jpg');
% Fake #2
% //Resize so that we have the same dimensions as the
other images
Ifake2 = imresize (Ifake2, [160 320], 'bilinear');
%% //Extract the black strips for each image
blackStripReal = Ireal(:,195:215,:);
blackStripFake = Ifake(:,195:215,:);
blackStripFake2 = Ifake2(:,195:215,:);
Figure (1);
subplot (1,3,1);
imshow(blackStripReal);
title('Real');
subplot (1,3,2); imshow(blackStripFake);
title('Fake');
subplot (1,3,3); imshow(blackStripFake2); title ('Fake
#2');
%% //Convert into grayscale then threshold
blackStripReal = rgb2gray(blackStripReal);
blackStripFake = rgb2gray(blackStripFake);
blackStripFake2 = rgb2gray(blackStripFake2);
Figure (2);
subplot (1,3,1);
imshow (blackStripReal);
title ('Real');
subplot (1,3,2);
imshow (blackStripFake);
title ('Fake');
subplot (1,3,3);
imshow (blackStripFake2);
title ('Fake #2');
%% //Threshold using about intensity 30
blackStripRealBW = ~im2bw (blackStripReal, 30/255);
blackStripFakeBW = ~im2bw (blackStripFake, 30/255);
blackStripFake2BW = ~im2bw (blackStripFake2, 30/255);
figure (3);
subplot (1,3,1);
imshow (blackStripRealBW);
title ('Real');
subplot (1,3,2);
imshow (blackStripFakeBW);
title ('Fake');
subplot (1,3,3);
imshow (blackStripFake2BW);
title ('Fake #2');
%% //Area open the image
Figure (4);
areaopenReal = bwareaopen (blackStripRealBW, 100);
subplot (1,3,1);
imshow (areaopenReal);
title ('Real');
subplot (1,3,2);
areaopenFake = bwareaopen (blackStripFakeBW, 100);
imshow (areaopenFake);
title ('Fake');
subplot (1,3,3);
areaopenFake2 = bwareaopen (blackStripFake2BW, 100);
imshow (areaopenFake2);
title ('Fake #2');
%% //Post-process
se = strel ('square', 5);
BWImageCloseReal = imclose (areaopenReal, se);
BWImageCloseFake = imclose (areaopenFake, se);
BWImageCloseFake2 = imclose (areaopenFake2, se);
Figure (5);
Subplot (1,3,1);
Imshow (BWImageCloseReal);
Title ('Real');
Subplot (1,3,2);
Imshow (BWImageCloseFake);
Title ('Fake');
Subplot (1,3,3);
Imshow (BWImageCloseFake2);
Title ('Fake #2');
%% //Count the total number of objects in this strip
[~, countReal] = bwlabel (BWImageCloseReal);
[~, countFake] = bwlabel (BWImageCloseFake);
[~, countFake2] = bwlabel (BWImageCloseFake2);
Disp (['The total number of black lines for the real
note is: ' num2str(countReal)]);
Disp (['The total number of black lines for the fake
note is: 'num2str(countFake)]);
disp(['The total number of black lines for the second
fake note is: '
num2str(countFake2)]);
RESULTS
Figure 8.
As seen above these are the results that we obtain while we run this process.
CONCLUSION
The conclusion is that we are using various image processing techniques to
determine whether a note is fake or real. We are telling this based on number of
black strips that are present on the image. We have used the processes like
thresholding, opening, closing etc. Here, I would like to conclude the big data
helps us to overcome the problem of storing a large amount of data instantly and
with a great efficiency.
• Volume: Big data doesn't sample; it just observes and tracks what
happens.
• Velocity: Big data is often available in real-time.
• Variety: Big data draws from text, images, audio, video; plus it completes
missing pieces through a fusion.
FUTURE SCOPE
This project has a good future scope. Now-a-days most of the people print these
fake notes which are illegal. So, this process can be used by the bank people to
fake notes which are illegal. So, this process can be used by the bank people to
detect which is the real and fake currency which is very helpful.
Bid Data in the coming time will play a major role in storing a large amount of
data for storing the information of how many fake and how many real notes were
detected in the set.
Figure 9.
REFERENCES
Gunaratna, Kodikara, & Premaratne. (2008). ANN based currency
recognition system using compressed gray scale and application for
Sri Lankan currency notes-SLCRec. Proceedings of World Academy
of Science, Engineering and Technology, 35, 235-240.
Guo, Zhao, & Cai. (2010). A reliable method for paper currency
recognition based on LBP. In Network Infrastructure and Digital
Content, 2010 2nd IEEE International Conference on. IEEE.
Aiken, P., Gillenson, M., Zhang, X., & Rafner, D. (2011). Data
management and data administration: Assessing 25 years of practice.
Journal of Database Management , 22(3), 24–45.
doi:10.4018/jdm.2011070102
Anselma, L., Bottrighi, A., Molino, G., Montani, S., Terenziani, P., &
Torchio, M. (2013). Supporting knowledge-based decision making in
the medical context: The GLARE approach . In Wang, J. (Ed.),
Intelligence methods and systems advancements for knowledge-based
business (pp. 24–42). Hershey, PA: IGI Global. doi:10.4018/978-1-
4666-1873-2.ch002
Arh, T., Dimovski, V., & Blažic, B. J. (2011). ICT and web 2.0
technologies as a determinant of business performance . In Al-
Mutairi, M., & Mohammed, L. (Eds.), Cases on ICT utilization,
practice and solutions: Tools for managing day-to-day issues (pp. 59–
77). Hershey, PA: IGI Global. doi:10.4018/978-1-60960-015-0.ch005
Assefa, T., Garfield, M., & Meshesha, M. (2014). Enabling factors for
knowledge sharing among employees in the workplace . In Al-
Bastaki, Y., & Shajera, A. (Eds.), Building a competitive public sector
with knowledge management strategy (pp. 246–271). Hershey, PA:
IGI Global. doi:10.4018/978-1-4666-4434-2.ch011
Barioni, M. C., Kaster, D. D., Razente, H. L., Traina, A. J., & Júnior,
C. T. (2011). Querying multimedia data by similarity in relational
DBMS . In Yan, L., & Ma, Z. (Eds.), Advanced database query
systems: Techniques, applications and technologies (pp. 323–359).
Hershey, PA: IGI Global. doi:10.4018/978-1-60960-475-2.ch014
Barroso, A. C., Ricciardi, R. I., & Junior, J. A. (2012). Web 2.0 and
project management: Reviewing the change path and discussing a few
cases . In Boughzala, I., & Dudezert, A. (Eds.), Knowledge
management 2.0: Organizational models and enterprise strategies (pp.
164–189). Hershey, PA: IGI Global. doi:10.4018/978-1-61350-195-
5.ch009
Baskaran, V., Naguib, R., Guergachi, A., Bali, R., & Arochen, H.
(2011). Does knowledge management really work? A case study in
the breast cancer screening domain . In Eardley, A., & Uden, L.
(Eds.), Innovative knowledge management: Concepts for
organizational creativity and collaborative design (pp. 177–189).
Hershey, PA: IGI Global. doi:10.4018/978-1-60566-701-0.ch010
Bebensee, T., Helms, R., & Spruit, M. (2012). Exploring the impact of
web 2.0 on knowledge management . In Boughzala, I., & Dudezert, A.
(Eds.), Knowledge management 2.0: Organizational models and
enterprise strategies (pp. 17–43). Hershey, PA: IGI Global.
doi:10.4018/978-1-61350-195-5.ch002
Berends, H., van der Bij, H., & Weggeman, M. (2011). Knowledge
integration . In Schwartz, D., & Te’eni, D. (Eds.), Encyclopedia of
knowledge management (2nd ed.; pp. 581–590). Hershey, PA: IGI
Global. doi:10.4018/978-1-59904-931-1.ch056
Breu, K., Ward, J., & Murray, P. (2000). Success factors in leveraging
the corporate information and knowledge resource through intranets .
In Malhotra, Y. (Ed.), Knowledge management and virtual
organizations (pp. 306–320). Hershey, PA: IGI Global.
doi:10.4018/978-1-930708-65-5.ch016
Colucci, S., Di Noia, T., Di Sciascio, E., Donini, F. M., & Mongiello,
M. (2011). Description logic-based resource retrieval . In Schwartz,
D., & Te’eni, D. (Eds.), Encyclopedia of knowledge management
(2nd ed.; pp. 185–197). Hershey, PA: IGI Global. doi:10.4018/978-1-
59904-931-1.ch018
De Maggio, M., Del Vecchio, P., Elia, G., & Grippa, F. (2011). An
ICT-based network of competence centres for developing intellectual
capital in the Mediterranean area . In Al Ajeeli, A., & Al-Bastaki, Y.
(Eds.), Handbook of research on e-services in the public sector: E-
government strategies and advancements (pp. 164–181). Hershey, PA:
IGI Global. doi:10.4018/978-1-61520-789-3.ch014
Eri, Z. D., Abdullah, R., Jabar, M. A., Murad, M. A., & Talib, A. M.
(2013). Ontology-based virtual communities model for the knowledge
management system environment: Ontology design . In Nazir Ahmad,
M., Colomb, R., & Abdullah, M. (Eds.), Ontology-based applications
for enterprise systems and knowledge management (pp. 343–360).
Hershey, PA: IGI Global. doi:10.4018/978-1-4666-1993-7.ch019
Flynn, R., & Marshall, V. (2014). The four levers for change in
knowledge management implementation . In Al-Bastaki, Y., &
Shajera, A. (Eds.), Building a competitive public sector with
knowledge management strategy (pp. 227–245). Hershey, PA: IGI
Global. doi:10.4018/978-1-4666-4434-2.ch010
Fortier, J., & Kassel, G. (2011). Organizational semantic webs . In
Schwartz, D., & Te’eni, D. (Eds.), Encyclopedia of knowledge
management (2nd ed.; pp. 1298–1307). Hershey, PA: IGI Global.
doi:10.4018/978-1-59904-931-1.ch124
Freivalds, D., & Lush, B. (2012). Thinking inside the grid: Selecting a
discovery system through the RFP process . In Popp, M., & Dallis, D.
(Eds.), Planning and implementing resource discovery tools in
academic libraries (pp. 104–121). Hershey, PA: IGI Global.
doi:10.4018/978-1-4666-1821-3.ch007
Frieß, M. R., Groh, G., Reinhardt, M., Forster, F., & Schlichter, J.
(2012). Context-aware creativity support for corporate open
innovation. International Journal of Knowledge-Based Organizations ,
2(1), 38–55. doi:10.4018/ijkbo.2012010103
Gaál, Z., Szabó, L., Obermayer-Kovács, N., Kovács, Z., & Csepregi,
A. (2011). Knowledge management profile: An innovative approach
to map knowledge management practice . In Eardley, A., & Uden, L.
(Eds.), Innovative knowledge management: Concepts for
organizational creativity and collaborative design (pp. 253–263).
Hershey, PA: IGI Global. doi:10.4018/978-1-60566-701-0.ch016
Gunjal, B., Gaitanou, P., & Yasin, S. (2012). Social networks and
knowledge management: An explorative study in library systems . In
Boughzala, I., & Dudezert, A. (Eds.), Knowledge management 2.0:
Organizational models and enterprise strategies (pp. 64–83). Hershey,
PA: IGI Global. doi:10.4018/978-1-61350-195-5.ch004
He, G., Xue, G., Yu, K., & Yao, S. (2013). Business process
modeling: Analysis and evaluation . In Lu, Z. (Ed.), Design,
performance, and analysis of innovative information retrieval (pp.
382–393). Hershey, PA: IGI Global. doi:10.4018/978-1-4666-1975-
3.ch027
Huang, A., Xiao, J., & Wang, S. (2013). A combined forecast method
integrating contextual knowledge . In Yang, G. (Ed.),
Multidisciplinary studies in knowledge and systems science (pp. 274–
290). Hershey, PA: IGI Global. doi:10.4018/978-1-4666-3998-
0.ch019
Iyer, S. R., Sharda, R., Biros, D., Lucca, J., & Shimp, U. (2011).
Organization of lessons learned knowledge: A taxonomy and
implementation . In Jennex, M. (Ed.), Global aspects and cultural
perspectives on knowledge management: Emerging dimensions (pp.
190–209). Hershey, PA: IGI Global. doi:10.4018/978-1-60960-555-
1.ch013
Jolly, R., & Wakeland, W. (2011). Using agent based simulation and
game theory analysis to study knowledge flow in organizations: The
KMscape . In Jennex, M. (Ed.), Global aspects and cultural
perspectives on knowledge management: Emerging dimensions (pp.
19–29). Hershey, PA: IGI Global. doi:10.4018/978-1-60960-555-
1.ch002
Joshi, S. (2014). Web 2.0 and its implications on globally competitive
business model . In Pańkowska, M. (Ed.), Frameworks of IT
prosumption for business development (pp. 86–101). Hershey, PA:
IGI Global. doi:10.4018/978-1-4666-4313-0.ch007
Kim, J. (2014). Big data sharing among academics . In Hu, W., &
Kaabouch, N. (Eds.), Big data management, technologies, and
applications (pp. 177–194). Hershey, PA: IGI Global.
doi:10.4018/978-1-4666-4699-5.ch008
Lee, H., Chan, K., & Tsui, E. (2013). Knowledge mining Wikipedia:
An ontological approach . In Yang, G. (Ed.), Multidisciplinary studies
in knowledge and systems science (pp. 52–62). Hershey, PA: IGI
Global. doi:10.4018/978-1-4666-3998-0.ch005
Leung, N. K. (2011). A re-distributed knowledge management
framework in help desk . In Schwartz, D., & Te’eni, D. (Eds.),
Encyclopedia of knowledge management (2nd ed.; pp. 1374–1381).
Hershey, PA: IGI Global. doi:10.4018/978-1-59904-931-1.ch131
Li, Y., Guo, H., & Wang, S. (2010). A multiple-bits watermark for
relational data . In Siau, K., & Erickson, J. (Eds.), Principle
advancements in database management technologies: New
applications and frameworks (pp. 1–22). Hershey, PA: IGI Global.
doi:10.4018/978-1-60566-904-5.ch001
Lukovic, I., Ivancevic, V., Celikovic, M., & Aleksic, S. (2014). DSLs
in action with model based approaches to information system
development . In Software design and development: Concepts,
methodologies, tools, and applications (pp. 596–626). Hershey, PA:
IGI Global. doi:10.4018/978-1-4666-4301-7.ch029
Mattmann, C. A., Hart, A., Cinquini, L., Lazio, J., Khudikyan, S., &
Jones, D. … Robnett, J. (2014). Scalable data mining, archiving, and
big data management for the next generation astronomical telescopes.
In W. Hu, & N. Kaabouch (Eds.), Big data management,
technologies, and applications (pp. 196-221). Hershey, PA: IGI
Global. doi:doi:10.4018/978-1-4666-4699-5.ch009
Meloche, J. A., Hasan, H., Willis, D., Pfaff, C. C., & Qi, Y. (2011).
Cocreating corporate knowledge with a wiki . In Jennex, M. (Ed.),
Global aspects and cultural perspectives on knowledge management:
Emerging dimensions (pp. 126–143). Hershey, PA: IGI Global.
doi:10.4018/978-1-60960-555-1.ch009
Moffett, S., Walker, T., & McAdam, R. (2014). Best value and
performance management inspired change within UK councils: A
knowledge management perspective . In Al-Bastaki, Y., & Shajera, A.
(Eds.), Building a competitive public sector with knowledge
management strategy (pp. 199–226). Hershey, PA: IGI Global.
doi:10.4018/978-1-4666-4434-2.ch009
Nah, F. F., Hong, W., Chen, L., & Lee, H. (2010). Information search
patterns in e-commerce product comparison services. Journal of
Database Management , 21(2), 26–40. doi:10.4018/jdm.2010040102
Nah, F. F., Hong, W., Chen, L., & Lee, H. (2012). Information search
patterns in e-commerce product comparison services . In Siau, K.
(Ed.), Cross-disciplinary models and applications of database
management: Advancing approaches (pp. 131–145). Hershey, PA: IGI
Global. doi:10.4018/978-1-61350-471-0.ch006
Palte, R., Hertlein, M., Smolnik, S., & Riempp, G. (2013). The effects
of a KM strategy on KM performance in professional services firms .
In Jennex, M. (Ed.), Dynamic models for knowledge-driven
organizations (pp. 16–35). Hershey, PA: IGI Global. doi:10.4018/978-
1-4666-2485-6.ch002
Reychav, I., Stein, E. W., Weisberg, J., & Glezer, C. (2012). The role
of knowledge sharing in raising the task innovativeness of systems
analysts. International Journal of Knowledge Management , 8(2), 1–
22. doi:10.4018/jkm.2012040101
Scarso, E., Bolisani, E., & Padova, A. (2011). The complex issue of
measuring KM performance: Lessons from the practice . In Vallejo-
Alonso, B., Rodriguez-Castellanos, A., & Arregui-Ayastuy, G. (Eds.),
Identifying, measuring, and valuing knowledge-based intangible
assets: New perspectives (pp. 208–230). Hershey, PA: IGI Global.
doi:10.4018/978-1-60960-054-9.ch010
Siau, K., Long, Y., & Ling, M. (2010). Toward a unified model of
information systems development success. Journal of Database
Management , 21(1), 80–101. doi:10.4018/jdm.2010112304
Siau, K., Long, Y., & Ling, M. (2012). Toward a unified model of
information systems development success . In Siau, K. (Ed.), Cross-
disciplinary models and applications of database management:
Advancing approaches (pp. 80–102). Hershey, PA: IGI Global.
doi:10.4018/978-1-61350-471-0.ch004
Smuts, H., van der Merwe, A., & Loock, M. (2011). Key
characteristics relevant for selecting knowledge management software
tools . In Eardley, A., & Uden, L. (Eds.), Innovative knowledge
management: Concepts for organizational creativity and collaborative
design (pp. 18–39). Hershey, PA: IGI Global. doi:10.4018/978-1-
60566-701-0.ch002
Talet, A. N., Alhawari, S., Mansour, E., & Alryalat, H. (2011). The
practice of Jordanian business to attain customer knowledge
acquisition. International Journal of Knowledge Management , 7(2),
49–67. doi:10.4018/jkm.2011040103
Tanner, K. (2011). The role of emotional capital in organisational KM
. In Schwartz, D., & Te’eni, D. (Eds.), Encyclopedia of knowledge
management (2nd ed.; pp. 1396–1409). Hershey, PA: IGI Global.
doi:10.4018/978-1-59904-931-1.ch133
Tull, J. (2013). Slow knowledge: The case for savouring learning and
innovation . In Buckley, S., & Jakovljevic, M. (Eds.), Knowledge
management innovations for interdisciplinary education:
Organizational applications (pp. 274–297). Hershey, PA: IGI Global.
doi:10.4018/978-1-4666-1969-2.ch014
Van Canh, T., & Zyngier, S. (2014). Using ERG theory as a lens to
understand the sharing of academic tacit knowledge: Problems and
issues in developing countries – Perspectives from Vietnam . In
Chilton, M., & Bloodgood, J. (Eds.), Knowledge management and
competitive advantage: Issues and potential solutions (pp. 174–201).
Hershey, PA: IGI Global. doi:10.4018/978-1-4666-4679-7.ch010
Wagner, L., & Van Belle, J. (2011). Web mining for strategic
competitive intelligence: South African experiences and a practical
methodology . In Al-Shammari, M. (Ed.), Knowledge management in
emerging economies: Social, organizational and cultural
implementation (pp. 1–19). Hershey, PA: IGI Global.
doi:10.4018/978-1-61692-886-5.ch001
Weiß, S., Makolm, J., Ipsmiller, D., & Egger, N. (2011). DYONIPOS:
Proactive knowledge supply . In Jennex, M., & Smolnik, S. (Eds.),
Strategies for knowledge management success: Exploring
organizational efficacy (pp. 277–287). Hershey, PA: IGI Global.
doi:10.4018/978-1-60566-709-6.ch015
Woods, S., Poteet, S. R., Kao, A., & Quach, L. (2011). Knowledge
dissemination in portals . In Schwartz, D., & Te’eni, D. (Eds.),
Encyclopedia of knowledge management (2nd ed.; pp. 539–548).
Hershey, PA: IGI Global. doi:10.4018/978-1-59904-931-1.ch052
Wu, J., Du, H., Li, X., & Li, P. (2010). Creating and delivering a
successful knowledge management strategy . In Russ, M. (Ed.),
Knowledge management strategies for business development (pp.
261–276). Hershey, PA: IGI Global. doi:10.4018/978-1-60566-348-
7.ch012
Wu, J., Liu, N., & Xuan, Z. (2013). Simulation on knowledge transfer
processes from the perspectives of individual’s mentality and behavior
. In Yang, G. (Ed.), Multidisciplinary studies in knowledge and
systems science (pp. 233–246). Hershey, PA: IGI Global.
doi:10.4018/978-1-4666-3998-0.ch016
Xiao, L., & Pei, Y. (2013). A task context aware physical distribution
knowledge service system . In Yang, G. (Ed.), Multidisciplinary
studies in knowledge and systems science (pp. 18–33). Hershey, PA:
IGI Global. doi:10.4018/978-1-4666-3998-0.ch002
Zhang, Y., Wang, Y., Colucci, W., & Wang, Z. (2013). The paradigm
shift in organizational research . In Wang, J. (Ed.), Intelligence
methods and systems advancements for knowledge-based business
(pp. 60–74). Hershey, PA: IGI Global. doi:10.4018/978-1-4666-1873-
2.ch004
Abdalla, M., Bellare, M., Catalano, D., Kiltz, E., Kohno, T., Lange,
T., & Shi, H. (2008). Searchable encryption revisited: Consistency
properties, relation to anonymous ibe, and extensions . Journal of
Cryptology , 21(3), 350–391. doi:10.1007/s00145-007-9006-6
Arvor, D., Jonathan, M., Meirelles, M. S. P., Dubreuil, V., & Durieux,
L. (2011). Classification of MODIS EVI time series for crop mapping
in the state of MatoGrosso. Brazil International Journal of Remote
Sensing , 32(22), 7847–7871. doi:10.1080/01431161.2010.531783
Bijalwan, V., Kumar, V., Kumari, P., & Pascual, J. (2014). KNN
based machine learning approach for text and document mining.
International Journal of Database Theory and Application , 7(1), 61–
70. doi:10.14257/ijdta.2014.7.1.06
Bill Franks. (2012). Taming the big data tidal wave . Wiley.
Boldyreva, A., Chenette, N., Lee, Y., & Oneill, A. (2009). Order-
preserving symmetric encryption . In Advances in Cryptology-
EUROCRYPT (pp. 224–241). Springer.
Cao, N., Wang, C., Li, M., Ren, K., & Lou, W. (2014). Privacy-
preserving multikeyword ranked search over encrypted cloud data .
IEEE Transactions on Parallel and Distributed Systems , 25(1), 222–
233. doi:10.1109/TPDS.2013.45
Cash, J., & Jutla, K., Ro3u, & Steiner. (2013). Highly-scalable
searchable symmetric encryption with support for Boolean queries.
Proc. CRYPTO, 353-373.
Dong, J., Xiao, X., Kou, W., Qin, Y., Zhang, G., Li, L., & Moore, B.
III. (2015). Tracking the dynamics of paddy rice planting area in
1986-2010 through time series Landsat images and phenology-based
algorithms. Remote Sensing of Environment , 160, 99–113.
doi:10.1016/j.rse.2015.01.004
Eswaran, K. P., Gray, J. N., Lorie, R. A., & Traiger, I. L. (1976). The
notions of consistency and predicate locks in a database system.
Communications of the ACM , 19(11), 624–633.
doi:10.1145/360363.360369
Guo, Zhao, & Cai. (2010). A reliable method for paper currency
recognition based on LBP. In Network Infrastructure and Digital
Content, 2010 2nd IEEE International Conference on. IEEE.
Hitzler, P., & Janowicz, K. (2013). Linked Data, Big Data, and the 4th
Paradigm. Semantic Web , 4(3), 233–235.
Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., Hide, W.,
& Rhee, S. Y. (2008). Big data: The future of biocuration . Nature ,
455(7209), 47–50. doi:10.1038/455047a
Hwang, Y., & Lee, P. (2007). Public key encryption with conjunctive
keyword search and its extension to a multi-user system . Pairing.
doi:10.1007/978-3-540-73489-5_2
IBM. (2013). What is big data? ł bringing big data to the enterprise.
Retrieved from http://www-01.ibm.com/software/data/bigdata
Kumar, V., Grama, A., Gupta, A., & Karypis, G. (1994). Introduction
to parallel computing (Vol. 110). Redwood City:
Benjamin/Cummings.
Kumar, V., Grama, A., Gupta, A., & Karypis, G. (1994). Introduction
to parallel computing: design and analysis of algorithms (Vol. 400).
Redwood City, CA: Benjamin/Cummings.
Li, H., Dai, Y., Tian, L., & Yang, H. (2009). Identity-based
authentication for cloud computing. In Cloud Computing. Berlin,
Germany: Springer-Verlag. doi:10.1007/978-3-642-10665-1_14
Li, J., Wang, Q., Wang, C., Cao, N., Ren, K., & Lou, W. (2010).
Fuzzy keyword search over encrypted data in cloud computing. Proc.
of IEEE INFOCOM’10 Mini-Conference.
doi:10.1109/INFCOM.2010.5462196
Liang, H., Cai, L. X., Huang, D., Shen, X., & Peng, D. (2012). An
smdpbased service model for interdomain resource allocation in
mobile cloud networks . IEEE Transactions on Vehicular Technology
, 61(5), 2222–2232. doi:10.1109/TVT.2012.2194748
Li, H., Liu, D., Dai, Y., Luan, T. H., & Shen, X. (2014). Enabling
efficient multikeyword ranked search over encrypted cloud data
through blind storage . IEEE Transactions on Emerging Topics in
Computing . doi:doi:10.1109/TETC.2014.2371239
Li, R., Xu, Z., Kang, W., Yow, K. C., & Xu, C.-Z. (2014). Efficient
multikeyword ranked query over encrypted data in cloud computing .
Future Generation Computer Systems , 30, 179–190.
doi:10.1016/j.future.2013.06.029
Liu, Li, Huang, & Wen. (2012). Shapley value based impression
propagation for reputation management in web service composition.
Pro. IEEE 19th Int’l Conf’ on Web Services (ICWS’12), 58–65.
Lo, W., Yin, J., Deng, S., Li, Y., & Wu, Z. (2012). Collaborative web
service qos prediction with location-based regularization. Pro. IEEE
19th Int’l Conf’ on Web Services (ICWS’12), 464–471.
10.1109/ICWS.2012.49
Lobell, D. B., Thau, D., Seifert, C., Engle, E., & Little, B. (2015). A
scalable satellite-based crop yield mapper. Remote Sensing of
Environment , 164, 324–333. doi:10.1016/j.rse.2015.04.021
Madsen, J. B., & Stenholt, R. (2014). How wrong can you be:
Perception of static orientation errors in mixed reality. 3D User
Interfaces (3DUI) 2014 IEEE Symposium on, 83-90.
10.1109/3DUI.2014.6798847
Mell & Grance. (2011). The nist definition of cloud computing (draft).
NIST Special Publication, 800, 145.
Mi, Wang, & Zhou, Lyu, & Cai. (2013). Towards fine-grained,
unsupervised, scalable performance diagnosis for production cloud
computing systems. IEEE Transactions on Parallel and Distributed
Systems .
Nguyen, B. V., Pham, D., Ngo, T. D., Le, D. D., & Duong, D. A.
(2014, December). Integrating spatial information into inverted index
for large-scale image retrieval. In Multimedia (ISM), 2014 IEEE
International Symposium on (pp. 102-105). IEEE. doi:10.1007/978-3-
319-12024-9_19
OECD. (2015). Data-driven innovation: big data for growth and well-
being . Paris, France: OECD Publishing.
Ren, K., Wang, C., & Wang, Q. (2012). Security Challenges for the
Public Cloud . IEEE Internet Computing , 16(1), 69–73.
doi:10.1109/MIC.2012.14
Rui, Y., Huang, T. S., Ortega, M., & Mehrota, S. (1998). Relevance
feedback: A power tool for interactive content-based image retrieval.
IEEE Transaction on Circuits System and Video Technnology , 8(5),
644–655. doi:10.1109/76.718510
Shen, Q., Liang, X., Shen, X., Lin, X., & Luo, H. (2014). Exploiting
geodistributed clouds for e-health monitoring system with minimum
service delay and privacy preservation . IEEE Journal of Biomedical
and Health Informatics , 18(2), 430–439.
doi:10.1109/JBHI.2013.2292829
Song, D. X., Wagner, D., & Perrig, A. (2000). Practical techniques for
searches on encrypted data. In Proceedings of S&P. IEEE.
Song, D. X., Wagner, D., & Perrig, A. (2000). Practical techniques for
searches on encrypted data. Proceedings of S&P, 44–55.
Sun, Wang, Cao, Li, Lou, Hou, & Li. (2013). Verifiable privacy-
preserving multikeyword text search in the cloud supporting
similarity-based ranking. IEEE Transactions on Parallel and
Distributed Systems. DOI: 10.1109/TPDS.2013.282
Tang, M., Jiang, Y., Liu, J., & Liu, X. F. (2012). Location-aware
collaborative filtering for qos-based service recommendation. Pro.
IEEE 19th Int’l Conf’ on Web Services (ICWS’12), 202–209.
Thereska, G., Salmon, B., Strunk, J., Wachs, M., Abd-El-Malek, M.,
Lopez, J., & Ganger, G. R. (2006). Stardust: tracking activity in a
distributed storage system. ACM SIGMETRICS Performance
Evaluation Review, 34(1), 3–14. doi:10.1145/1140277.1140280
Vermote, E. F., Tanre, D., Deuze, J. L., Herman, M., & Morcrette, J.
(1997). Second simulation of the satellite signal in the solar spectrum,
6S: An overview. IEEE Transactions on Geoscience and Remote
Sensing , 35(3), 675–686. doi:10.1109/36.581987
Williams, P., Sion, R., & Carbunar, B. (2008). Building castles out of
mud: practical access pattern privacy and correctness on untrusted
storage. ACM CCS, 139–148. doi:10.1145/1455770.1455790
Wu, H.-K., Lee, S. W.-Y., Chang, H.-Y., & Liang, J.-C. (2013,
March). Current status, opportunities and challenges of augmented
reality in education . Computers & Education , 62(C), 41–49.
doi:10.1016/j.compedu.2012.10.024
Wu, , Zhu, , & Wu, , & Ding. (2014). Data Mining with Big Data .
IEEE Transactions on Knowledge and Data Engineering , 26(1).
Yang, L., Liu, & Yang. (2014). Secure dynamic searchable symmetric
encryption with constant document update cost. Proc.GLOBECOM.
Yi, X., Liu, F., Liu, J., & Jin, H. (2014). Building a network highway
for big data: Architecture and challenges . IEEE Network , 28(4), 5–
13. doi:10.1109/MNET.2014.6863125
Yu, J., Lu, P., Zhu, Y., Xue, G., & Li, M. (2013). Towards secure
multikeyword top-k retrieval over encrypted cloud data . IEEE
Transactions on Dependable and Secure Computing , 10(4), 239–250.
doi:10.1109/TDSC.2013.9