With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYIJDKP
This document summarizes an approach to improve source code retrieval using structural information from source code. A lexical parser is developed to extract control statements and method identifiers from Java programs. A similarity measure is proposed that calculates the ratio of fully matching statements to partially matching statements in a sequence. Experiments show the retrieval model using this measure improves retrieval performance over other models by up to 90.9% relative to the number of retrieved methods.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
This document discusses GCUBE indexing, which is a method for indexing and aggregating spatial/continuous values in a data warehouse. The key challenges addressed are defining and aggregating spatial/continuous values, and efficiently representing, indexing, updating and querying data that includes both categorical and continuous dimensions. The proposed GCUBE approach maps multi-dimensional data to a linear ordering using the Hilbert curve, and then constructs an index structure on the ordered data to enable efficient query processing. Empirical results show the GCUBE indexing offers significant performance advantages over alternative approaches.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
Enhancement techniques for data warehouse staging areaIJDKP
This document discusses techniques for enhancing the performance of data warehouse staging areas. It proposes two algorithms: 1) A semantics-based extraction algorithm that reduces extraction time by pruning useless data using semantic information. 2) A semantics-based transformation algorithm that similarly aims to reduce transformation time. It also explores three scheduling techniques (FIFO, minimum cost, round robin) for loading data into the data warehouse and experimentally evaluates their performance. The goal is to enhance each stage of the ETL process to maximize overall performance.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...IJDKP
The document summarizes a proposed methodology that integrates associative classification and neural networks for improved classification accuracy. It begins by introducing association rule mining and associative classification. It then describes using chi-squared analysis and the Gini index for attribute selection and rule pruning to generate a reduced set of rules. These rules are used to train a backpropagation neural network classifier. The methodology is tested on datasets from a public repository, demonstrating improved accuracy over traditional associative classification alone. Future work to integrate optical neural networks is also proposed.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
One of the most important problems in modern finance is finding efficient ways to summarize and visualize
the stock market data to give individuals or institutions useful information about the market behavior for
investment decisions Therefore, Investment can be considered as one of the fundamental pillars of national
economy. So, at the present time many investors look to find criterion to compare stocks together and
selecting the best and also investors choose strategies that maximize the earning value of the investment
process. Therefore the enormous amount of valuable data generated by the stock market has attracted
researchers to explore this problem domain using different methodologies. Therefore research in data
mining has gained a high attraction due to the importance of its applications and the increasing generation
information. So, Data mining tools such as association rule, rule induction method and Apriori algorithm
techniques are used to find association between different scripts of stock market, and also much of the
research and development has taken place regarding the reasons for fluctuating Indian stock exchange.
But, now days there are two important factors such as gold prices and US Dollar Prices are more
dominating on Indian Stock Market and to find out the correlation between gold prices, dollar prices and
BSE index statistical correlation is used and this helps the activities of stock operators, brokers, investors
and jobbers. They are based on the forecasting the fluctuation of index share prices, gold prices, dollar
prices and transactions of customers. Hence researcher has considered these problems as a topic for
research.
1) The document discusses a review of semantic approaches for nearest neighbor search. It describes using an ontology to add a semantic layer to an information retrieval system to relate concepts using query words.
2) A technique called spatial inverted index is proposed to locate multidimensional information and handle nearest neighbor queries by finding the hospitals closest to a given address.
3) Several semantic approaches are described including using clustering measures, specificity measures, link analysis, and relation-based page ranking to improve search and interpret hidden concepts behind keywords.
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET Journal
This document discusses several approaches for clustering textual documents, including:
1. TF-IDF, word embedding, and K-means clustering are proposed to automatically classify and organize documents.
2. Previous work on document clustering is reviewed, including partition-based techniques like K-means and K-medoids, hierarchical clustering, and approaches using semantic features, PSO optimization, and multi-view clustering.
3. Challenges of clustering large document collections at scale are discussed, along with potential solutions using frameworks like Hadoop.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET Journal
This document proposes a new one-to-many data linkage technique using a One-Class Clustering Tree (OCCT) to link records from different datasets. The technique constructs a decision tree where internal nodes represent attributes from the first dataset and leaves represent attributes from the second dataset that match. It uses maximum likelihood estimation for splitting criteria and pre-pruning to reduce complexity. The method is applied to the database misuse domain to identify common and malicious users by analyzing access request contexts and accessible data. Evaluation shows the technique achieves better precision and recall than existing methods.
Iaetsd a survey on one class clusteringIaetsd Iaetsd
This document presents a new method for performing one-to-many data linkage called the One Class Clustering Tree (OCCT). The OCCT builds a tree structure with inner nodes representing features of the first dataset and leaves representing similar features of the second dataset. It uses splitting criteria and pruning methods to perform the data linkage more accurately than existing indexing techniques. The OCCT approach induces a decision tree using a splitting criteria and performs prepruning to determine which branches to trim. It then compares entities to match them between the two datasets and produces a final result.
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGIJDKP
This document discusses a hybrid data mining approach called combined mining that can generate informative patterns from complex data sources. It proposes applying three techniques: 1) Using the Lossy-counting algorithm on individual data sources to obtain frequent itemsets, 2) Generating incremental pair and cluster patterns using a multi-feature approach, 3) Combining FP-growth and Bayesian Belief Network using a multi-method approach to generate classifiers. The approach is tested on two datasets to obtain more useful knowledge and the results are compared.
Elimination of data redundancy before persisting into dbms using svm classifi...nalini manogaran
Elimination of data redundancy before persisting into dbms using svm classification,
Data Base Management System is one of the
growing fields in computing world. Grid computing, internet
sharing, distributed computing, parallel processing and cloud
are the areas store their huge amount of data in a DBMS to
maintain the structure of the data. Memory management is
one of the major portions in DBMS due to edit, delete, recover
and commit operations used on the records. To improve the
memory utilization efficiently, the redundant data should be
eliminated accurately. In this paper, the redundant data is
fetched by the Quick Search Bad Character (QSBC) function
and intimate to the DB admin to remove the redundancy.
QSBC function compares the entire data with patterns taken
from index table created for all the data persisted in the
DBMS to easy comparison of redundant (duplicate) data in
the database. This experiment in examined in SQL server
software on a university student database and performance is
evaluated in terms of time and accuracy. The database is
having 15000 students data involved in various activities.
Keywords—Data redundancy, Data Base Management System,
Support Vector Machine, Data Duplicate.
I. INTRODUCTION
The growing (prenominal) mass of information
present in digital media has become a resistive problem for
data administrators. Usually, shaped on data congregate
from distinct origin, data repositories such as those used by
digital libraries and e-commerce agent based records with
disparate schemata and structures. Also problems regarding
to low response time, availability, security and quality
assurance become more troublesome to manage as the
amount of data grow larger. It is practicable to specimen
that the peculiarity of the data that an association uses in its
systems is relative to its efficiency for offering beneficial
services to their users. In this environment, the
determination of maintenance repositories with “dirty” data
(i.e., with replicas, identification errors, equal patterns,
etc.) goes greatly beyond technical discussion such as the
everywhere quickness or accomplishment of data
administration systems.
Nalini.M, [email protected], Anbu.S, anomaly detection,
data mining
big data
dbms
intrusion detection
dublicate detection
data cleaning
data redundancy
data replication, redundancy removel, QSBC, Duplicate detection, error correction, de-duplication, Data cleaning, Dbms, Data sets
The document summarizes research on multi-document summarization using EM clustering. It begins with an introduction to the topic and issues with existing techniques. It then proposes using Expectation-Maximization (EM) clustering to identify clusters, which improves over other methods by identifying latent semantic variables between sentences. The architecture involves preprocessing, EM clustering, mutual reinforcement ranking algorithms RARP and RDRP, summarization, and post-processing. Experimental results on DUC2007 data show EM clustering identifies more clusters and sentences than affinity propagation clustering. The technique aims to improve summarization accuracy by better capturing semantic relationships between sentences.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIEScsandit
This document summarizes a research paper that proposes a system to enhance keyword search over relational databases using ontologies. The system builds structures during pre-processing like a reachability index to store connectivity information and an ontology concept graph. During querying, it maps keywords to concepts, uses the ontology to find related concepts and tuples, and generates top-k answer trees combining syntactic and semantic matches while limiting redundant results. The system is expected to perform better than existing approaches by reducing storage requirements through its approach to materializing neighborhood information in the reachability index.
1) This document discusses different techniques for cross-domain data fusion, including stage-based, feature-level, probabilistic, and multi-view learning methods.
2) It reviews literature on data fusion definitions, implementations, and techniques for handling data conflicts. Common steps in data fusion are data transformation, schema mapping, and duplicate detection.
3) The proposed system architecture performs data cleaning, then applies stage-based, feature-level, probabilistic, and multi-view learning fusion methods before analyzing dataset, hardware, and software requirements.
This document discusses techniques for detecting duplicate records from multiple web databases. It begins with an abstract describing an unsupervised approach that uses classifiers like the weighted component similarity summing classifier and support vector machine along with a Gaussian mixture model to iteratively identify duplicate records. The document then provides details on related work, including probabilistic matching models, supervised and unsupervised learning techniques, distance-based techniques, rule-based approaches, and methods for improving efficiency like blocking and the sorted neighborhood approach.
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...IJDKP
The document summarizes a proposed methodology that integrates associative classification and neural networks for improved classification accuracy. It begins by introducing association rule mining and associative classification. It then describes using chi-squared analysis and the Gini index for attribute selection and rule pruning to generate a reduced set of rules. These rules are used to train a backpropagation neural network classifier. The methodology is tested on datasets from a public repository, demonstrating improved accuracy over traditional associative classification alone. Future work to integrate optical neural networks is also proposed.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
One of the most important problems in modern finance is finding efficient ways to summarize and visualize
the stock market data to give individuals or institutions useful information about the market behavior for
investment decisions Therefore, Investment can be considered as one of the fundamental pillars of national
economy. So, at the present time many investors look to find criterion to compare stocks together and
selecting the best and also investors choose strategies that maximize the earning value of the investment
process. Therefore the enormous amount of valuable data generated by the stock market has attracted
researchers to explore this problem domain using different methodologies. Therefore research in data
mining has gained a high attraction due to the importance of its applications and the increasing generation
information. So, Data mining tools such as association rule, rule induction method and Apriori algorithm
techniques are used to find association between different scripts of stock market, and also much of the
research and development has taken place regarding the reasons for fluctuating Indian stock exchange.
But, now days there are two important factors such as gold prices and US Dollar Prices are more
dominating on Indian Stock Market and to find out the correlation between gold prices, dollar prices and
BSE index statistical correlation is used and this helps the activities of stock operators, brokers, investors
and jobbers. They are based on the forecasting the fluctuation of index share prices, gold prices, dollar
prices and transactions of customers. Hence researcher has considered these problems as a topic for
research.
1) The document discusses a review of semantic approaches for nearest neighbor search. It describes using an ontology to add a semantic layer to an information retrieval system to relate concepts using query words.
2) A technique called spatial inverted index is proposed to locate multidimensional information and handle nearest neighbor queries by finding the hospitals closest to a given address.
3) Several semantic approaches are described including using clustering measures, specificity measures, link analysis, and relation-based page ranking to improve search and interpret hidden concepts behind keywords.
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET Journal
This document discusses several approaches for clustering textual documents, including:
1. TF-IDF, word embedding, and K-means clustering are proposed to automatically classify and organize documents.
2. Previous work on document clustering is reviewed, including partition-based techniques like K-means and K-medoids, hierarchical clustering, and approaches using semantic features, PSO optimization, and multi-view clustering.
3. Challenges of clustering large document collections at scale are discussed, along with potential solutions using frameworks like Hadoop.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET Journal
This document proposes a new one-to-many data linkage technique using a One-Class Clustering Tree (OCCT) to link records from different datasets. The technique constructs a decision tree where internal nodes represent attributes from the first dataset and leaves represent attributes from the second dataset that match. It uses maximum likelihood estimation for splitting criteria and pre-pruning to reduce complexity. The method is applied to the database misuse domain to identify common and malicious users by analyzing access request contexts and accessible data. Evaluation shows the technique achieves better precision and recall than existing methods.
Iaetsd a survey on one class clusteringIaetsd Iaetsd
This document presents a new method for performing one-to-many data linkage called the One Class Clustering Tree (OCCT). The OCCT builds a tree structure with inner nodes representing features of the first dataset and leaves representing similar features of the second dataset. It uses splitting criteria and pruning methods to perform the data linkage more accurately than existing indexing techniques. The OCCT approach induces a decision tree using a splitting criteria and performs prepruning to determine which branches to trim. It then compares entities to match them between the two datasets and produces a final result.
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGIJDKP
This document discusses a hybrid data mining approach called combined mining that can generate informative patterns from complex data sources. It proposes applying three techniques: 1) Using the Lossy-counting algorithm on individual data sources to obtain frequent itemsets, 2) Generating incremental pair and cluster patterns using a multi-feature approach, 3) Combining FP-growth and Bayesian Belief Network using a multi-method approach to generate classifiers. The approach is tested on two datasets to obtain more useful knowledge and the results are compared.
Elimination of data redundancy before persisting into dbms using svm classifi...nalini manogaran
Elimination of data redundancy before persisting into dbms using svm classification,
Data Base Management System is one of the
growing fields in computing world. Grid computing, internet
sharing, distributed computing, parallel processing and cloud
are the areas store their huge amount of data in a DBMS to
maintain the structure of the data. Memory management is
one of the major portions in DBMS due to edit, delete, recover
and commit operations used on the records. To improve the
memory utilization efficiently, the redundant data should be
eliminated accurately. In this paper, the redundant data is
fetched by the Quick Search Bad Character (QSBC) function
and intimate to the DB admin to remove the redundancy.
QSBC function compares the entire data with patterns taken
from index table created for all the data persisted in the
DBMS to easy comparison of redundant (duplicate) data in
the database. This experiment in examined in SQL server
software on a university student database and performance is
evaluated in terms of time and accuracy. The database is
having 15000 students data involved in various activities.
Keywords—Data redundancy, Data Base Management System,
Support Vector Machine, Data Duplicate.
I. INTRODUCTION
The growing (prenominal) mass of information
present in digital media has become a resistive problem for
data administrators. Usually, shaped on data congregate
from distinct origin, data repositories such as those used by
digital libraries and e-commerce agent based records with
disparate schemata and structures. Also problems regarding
to low response time, availability, security and quality
assurance become more troublesome to manage as the
amount of data grow larger. It is practicable to specimen
that the peculiarity of the data that an association uses in its
systems is relative to its efficiency for offering beneficial
services to their users. In this environment, the
determination of maintenance repositories with “dirty” data
(i.e., with replicas, identification errors, equal patterns,
etc.) goes greatly beyond technical discussion such as the
everywhere quickness or accomplishment of data
administration systems.
Nalini.M, [email protected], Anbu.S, anomaly detection,
data mining
big data
dbms
intrusion detection
dublicate detection
data cleaning
data redundancy
data replication, redundancy removel, QSBC, Duplicate detection, error correction, de-duplication, Data cleaning, Dbms, Data sets
The document summarizes research on multi-document summarization using EM clustering. It begins with an introduction to the topic and issues with existing techniques. It then proposes using Expectation-Maximization (EM) clustering to identify clusters, which improves over other methods by identifying latent semantic variables between sentences. The architecture involves preprocessing, EM clustering, mutual reinforcement ranking algorithms RARP and RDRP, summarization, and post-processing. Experimental results on DUC2007 data show EM clustering identifies more clusters and sentences than affinity propagation clustering. The technique aims to improve summarization accuracy by better capturing semantic relationships between sentences.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIEScsandit
This document summarizes a research paper that proposes a system to enhance keyword search over relational databases using ontologies. The system builds structures during pre-processing like a reachability index to store connectivity information and an ontology concept graph. During querying, it maps keywords to concepts, uses the ontology to find related concepts and tuples, and generates top-k answer trees combining syntactic and semantic matches while limiting redundant results. The system is expected to perform better than existing approaches by reducing storage requirements through its approach to materializing neighborhood information in the reachability index.
1) This document discusses different techniques for cross-domain data fusion, including stage-based, feature-level, probabilistic, and multi-view learning methods.
2) It reviews literature on data fusion definitions, implementations, and techniques for handling data conflicts. Common steps in data fusion are data transformation, schema mapping, and duplicate detection.
3) The proposed system architecture performs data cleaning, then applies stage-based, feature-level, probabilistic, and multi-view learning fusion methods before analyzing dataset, hardware, and software requirements.
This document discusses techniques for detecting duplicate records from multiple web databases. It begins with an abstract describing an unsupervised approach that uses classifiers like the weighted component similarity summing classifier and support vector machine along with a Gaussian mixture model to iteratively identify duplicate records. The document then provides details on related work, including probabilistic matching models, supervised and unsupervised learning techniques, distance-based techniques, rule-based approaches, and methods for improving efficiency like blocking and the sorted neighborhood approach.
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
This document summarizes a study investigating horizontal gene transfer (HGT) of antibiotic resistance genes between gut bacteria under conditions mimicking the human gut. Through conjugation experiments, E. coli gained chloramphenicol resistance after co-culture with B. uniformis or E. fergusonii. A transformation experiment found that A. baylyi gained chloramphenicol resistance after exposure to Bacteroidetes DNA. The results suggest HGT of antibiotic resistance genes can readily occur between gut bacteria, and that chloramphenicol resistance genes may spread most easily. Further experiments are needed to confirm the transferred genes and understand factors influencing HGT frequency in the gut.
A comparative study on term weighting methods for automated telugu text categ...IJDKP
Automatic Text categorization refers to the process of assigning a category or some categories
automatically among predefined ones. Text categorization is challenging in Indian languages has rich in
morphology, a large number of word forms and large feature spaces. This paper investigates the
performance of different classification approaches using different term weighting approaches in order to
decide the most applicable one to Telugu text classification problem. We have investigated on different
term weighting methods for Telugu corpus in combination with Naive Bayes ( NB), Support Vector
Machine (SVM) and k Nearest Neighbor (kNN) classifiers.
An apriori based algorithm to mine association rules with inter itemset distanceIJDKP
Association rules discovered from transaction databases can be large in number. Reduction of association
rules is an issue in recent times. Conventionally by varying support and confidence number of rules can be
increased and decreased. By combining additional constraint with support number of frequent itemsets can
be reduced and it leads to generation of less number of rules. Average inter itemset distance(IID) or
Spread, which is the intervening separation of itemsets in the transactions has been used as a measure of
interestingness for association rules with a view to reduce the number of association rules. In this paper by
using average Inter Itemset Distance a complete algorithm based on the apriori is designed and
implemented with a view to reduce the number of frequent itemsets and the association rules and also to
find the distribution pattern of the association rules in terms of the number of transactions of non
occurrences of the frequent itemsets. Further the apriori algorithm is also implemented and results are
compared. The theoretical concepts related to inter itemset distance are also put forward.
Robert Maule is a versatile composer seeking work producing creative, high-quality music. He has a B.A. in Cognitive Science from UC Berkeley with a music minor. His experience includes composing original music for plays, musicals, films, and video games. He has over ten years of experience transcribing sheet music by ear for various instruments. His musical influences include classical composers, jazz musicians, video game soundtracks, and rock/pop artists. He provides three professional references.
The requirement of TV varies from people to people. To buy the best TV we need to analyze our requirement and also explore latest technology of television. To fulfill entertainment requirement Panasonic manufacture wide range of TV like LED TV, LCD TV, 3D TV and Plasma TV.
Jayson bailey technical communication capstone presentationJaysonBailey
Jayson Bailey presents his technical communication capstone portfolio. The presentation demonstrates his understanding of the program outcomes related to rhetorical knowledge, critical thinking/reading/writing, processes, and knowledge of conventions. Bailey completed coursework and projects in these areas, including an Adobe Muse help file, feasibility report, field report, formal proposal, and logo design. He summarizes how the portfolio shows his achievement of recognizing roles/conventions, considering context/language, understanding processes, and valuing research/information in technical communication.
Este documento describe las funciones principales de un sistema operativo y las herramientas que provee. Explica que un sistema operativo administra los recursos del hardware, coordina la ejecución de programas, y provee servicios a los programas de aplicación. También describe algunas herramientas como el desfragmentador de disco, liberador de espacio, y tareas programadas. Finalmente, habla sobre la importancia de realizar reportes de daños, pérdidas o fallas de equipos e insumos.
A freelance multi-instrumentalist and music producer is seeking new transcription opportunities. He has been transcribing music by ear since age 13 and has experience transcribing for clients. His portfolio includes lead sheets and scores for recent radio hits and classic songs, showcasing his ability to transcribe various instruments, styles, and arrangements.
a. DBMS membolehkan pengguna mengurus dan mengakses maklumat dalam pangkalan data
b. Kunci primer dan asing membolehkan hubungan antara jadual dan mengenal pasti rekod unik
c. Objek pangkalan data seperti jadual, pertanyaan, borang dan laporan memudahkan manipulasi dan persembahan maklumat
Integrating Web Services With Geospatial Data Mining Disaster Management for ...Waqas Tariq
Data Mining (DM) and Geographical Information Systems (GIS) are complementary techniques for describing, transforming, analyzing and modeling data about real world system. GIS and DM are naturally synergistic technologies that can be joined to produce powerful market insight from a sea of disparate data. Web Services would greatly simplify the development of many kinds of data integration and knowledge management applications. This research aims to develop a Spatial DM web service. It integrates state of the art GIS and DM functionality in an open, highly extensible, web-based architecture. The Interoperability of geospatial data previously focus just on data formats and standards. The recent popularity and adoption of Web Services has provided new means of interoperability for geospatial information not just for exchanging data but for analyzing these data during exchange as well. An integrated, user friendly Spatial DM System available on the internet via a web service offers exciting new possibilities for geo-spatial analysis to be ready for decision making and geographical research to a wide range of potential users.
SUITABILITY OF SERVICE ORIENTED ARCHITECTURE FOR SOLVING GIS PROBLEMSijait
Nowadays spatial data is becoming as a key element for effective planning and decision making in all aspects of societies. Spatial data are those data which are related to the features on the ground. In this way, a Geographic Information System (GIS) is a system that captures, analyzes, and manages any spatially referenced data. This paper analyzes the architecture and main features of Geographic Information Systems and aims at discussing some important problems emerged in the research of applying GIS in the organizations. It focuses on some of them such as lack of interoperability, agility and business alignment. We explain that SOA as a service oriented software architecture model can support the transformation of geographic information software from "system and function" to "service and application" and as the best practice of the architectural concepts can increase business alignment in the enterprise applications.
This document discusses a system for integrating structured and unstructured data from heterogeneous sources and allowing querying of both data types. The system uses Open Grid Services Architecture Data Access and Integration (OGSA-DAI) services supported by the Globus Toolkit to provide a data abstraction layer. This layer generates metadata from unstructured files to allow database operations on both structured and unstructured data. The system provides a unified interface for users to search and retrieve data from different sources in various formats.
The document discusses a system for integrating structured and unstructured data from heterogeneous environments. The system uses OGSA-DAI services and the Globus Toolkit to provide an abstraction layer that allows database operations on both structured data from databases and unstructured file-based data. It generates metadata from unstructured data and configures the abstraction layer to query across the different data sources. This provides users an integrated view of both structured and unstructured data through a single interface.
This document proposes a data model for managing large point cloud data while integrating semantics. It presents a conceptual model composed of three interconnected meta-models to efficiently store and manage point cloud data, and allow the injection of semantics. A prototype is implemented using Python and PostgreSQL to combine semantic and spatial concepts for queries on indoor point cloud data captured with a terrestrial laser scanner.
This paper discusses emerging tools and technologies for big data analytics. It begins by defining big data and explaining why traditional technologies are inadequate for processing large and complex data sets. The paper then outlines the typical steps in big data analysis: data collection, partitioning, coordination, transformation, storage, processing, extraction, analysis, and visualization. It describes useful tools at each step like Hadoop, MapReduce, Hive, and Mahout. Finally, the paper discusses programming languages commonly used for big data like Python, R, Java, and Storm.
A WebML-Based Approach For The Development Of Web GIS ApplicationsMary Montoya
This document proposes an extension of the Web Modeling Language (WebML) for modeling web-based geographic information systems (Web GIS) applications. The extension includes new conceptual models and notations for modeling geospatial data and interactions specific to Web GIS, such as map visualization, navigation, and spatial querying. The key aspects of the extension are a geospatial entity-relationship model for conceptual data modeling, and new units for the WebML hypertext model including a MultiMap unit for map display, geometry entry units for spatial selection, and units for pan, zoom, and creating overlays. An example application for monitoring farm houses is modeled using the extended WebML notation. The extended WebML approach aims to provide a visual
This document discusses scheduling algorithms for processing big data using Hadoop. It provides background on big data and Hadoop, including that big data is characterized by volume, velocity, and variety. Hadoop uses MapReduce and HDFS to process and store large datasets across clusters. The default scheduling algorithm in Hadoop is FIFO, but performance can be improved using alternative scheduling algorithms. The objective is to study and analyze various scheduling algorithms that could increase performance for big data processing in Hadoop.
This document provides a survey of distributed heterogeneous big data mining adaptation in the cloud. It discusses how big data is large, heterogeneous, and distributed, making it difficult to analyze with traditional tools. The cloud helps overcome these issues by providing scalable infrastructure on demand. However, directly applying Hadoop MapReduce in the cloud is inefficient due to its assumption of homogeneous nodes. The document surveys different approaches for improving MapReduce performance in heterogeneous cloud environments through techniques like optimized task scheduling and resource allocation.
Query Optimization Techniques in Graph DatabasesIJDMS
Graph databases (GDB) have recently been arisen to overcome the limits of traditional databases for
storing and managing data with graph-like structure. Today, they represent a requirementfor many
applications that manage graph-like data,like social networks.Most of the techniques, applied to optimize
queries in graph databases, have been used in traditional databases, distribution systems,… or they are
inspired from graph theory. However, their reuse in graph databases should take care of the main
characteristics of graph databases, such as dynamic structure, highly interconnected data, and ability to
efficiently access data relationships. In this paper, we survey the query optimization techniques in graph
databases. In particular,we focus on the features they have in
Mumbai University, T.Y.B.Sc.(I.T.), Semester VI, Principles of Geographic Information System, USIT604, Discipline Specific Elective Unit 2: Data Management and Processing System
This document provides a survey of techniques for transferring big data. It discusses using grids and parallel transfers to distribute large datasets. Grid computing allows for coordinated sharing of computational and storage resources across distributed systems. Parallel transfer techniques divide files into segments and transfer portions simultaneously from multiple servers to improve download speeds. However, these techniques require significant user involvement. The document then introduces a new NICE model for big data transfers. This store-and-forward approach transfers data to staging servers during periods of low network traffic to avoid impacting other users. It can accommodate different time zones and bandwidth variations between senders and receivers.
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"Guy K. Kloss
This document discusses the development of a grid data infrastructure called MataNui to manage large amounts of observational astronomical data and metadata from a collaboration between researchers in New Zealand and Japan. The infrastructure uses existing open-source tools like MongoDB, GridFTP, and the DataFinder GUI client to allow distributed storage and access of data while meeting requirements like handling large data volumes, metadata, and remote access. This approach provides a robust, reusable, and user-friendly system to address common data management challenges in scientific collaborations.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Survey of Agent Based Pre-Processing and Knowledge RetrievalIOSR Journals
Abstract: Information retrieval is the major task in present scenario as quantum of data is increasing with a
tremendous speed. So, to manage & mine knowledge for different users as per their interest, is the goal of every
organization whether it is related to grid computing, business intelligence, distributed databases or any other.
To achieve this goal of extracting quality information from large databases, software agents have proved to be
a strong pillar. Over the decades, researchers have implemented the concept of multi agents to get the process
of data mining done by focusing on its various steps. Among which data pre-processing is found to be the most
sensitive and crucial step as the quality of knowledge to be retrieved is totally dependent on the quality of raw
data. Many methods or tools are available to pre-process the data in an automated fashion using intelligent
(self learning) mobile agents effectively in distributed as well as centralized databases but various quality
factors are still to get attention to improve the retrieved knowledge quality. This article will provide a review of
the integration of these two emerging fields of software agents and knowledge retrieval process with the focus
on data pre-processing step.
Keywords: Data Mining, Multi Agents, Mobile Agents, Preprocessing, Software Agents
On the-design-of-geographic-information-system-proceduresArmando Guevara
This document discusses the design of geographic information systems (GIS) and proposes an Adaptable Spatial Processing Architecture (ASPA) to improve upon existing GIS design. It identifies six concepts for continuity in GIS design: functional, data base, data structure, knowledge, human interface, and data transfer continuity. It also discusses using a generic functional model and specific derived spatial data models. The proposed ASPA architecture is based on these concepts of continuity and levels of abstraction, and aims to allow GIS to integrate diverse data sources and support multidisciplinary applications in a flexible, adaptable manner.
An elastic , effective, activety or intelligent ,graceful networking architecture layout be desired to make processing massive data. next to that ,existent network architectures be considerably incapable for
cleatting the huge data. massive data thrusts network exchequers into border it consequence with in network overcrowding ,needy achievement, then permicious employer exprtises. this offered the current state-of-the-art research affronts ,potential solutions into huge data networking notion. More specifically, present the state of networking problems into massive data connected intrequirements,capacity,running ,
data manipulating also will introduce the architectures of MapReduce , Hadoop paradigm within research
requirements, fabric networks and software defined networks which utilizized into making today’s idly growing digital world and compare and contrast into identify relevant drawbacks and solutions.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A spatial data model for moving object databasesIJDMS
Moving Object Databases will have significant role in Geospatial Information Systems as they allow users
to model continuous movements of entities in the databases and perform spatio-temporal analysis. For
representing and querying moving objects, an algebra with a comprehensive framework of User Defined
Types together with a set of functions on those types is needed. Moreover, concerning real world
applications, moving objects move along constrained environments like transportation networks so that an
extra algebra for modeling networks is demanded, too. These algebras can be inserted in any data model if
their designs are based on available standards such as Open Geospatial Consortium that provides a
common model for existing DBMS’s. In this paper, we focus on extending a spatial data model for
constrained moving objects. Static and moving geometries in our model are based on Open Geospatial
Consortium standards. We also extend Structured Query Language for retrieving, querying, and
manipulating spatio-temporal data related to moving objects as a simple and expressive query language.
Finally as a proof-of-concept, we implement a generator to generate data for moving objects constrained
by a transportation network. Such a generator primarily aims at traffic planning applications.
Adtran’s SDG 9000 Series brings high-performance, cloud-managed Wi-Fi 7 to homes, businesses and public spaces. Built on a unified SmartOS platform, the portfolio includes outdoor access points, ceiling-mount APs and a 10G PoE router. Intellifi and Mosaic One simplify deployment, deliver AI-driven insights and unlock powerful new revenue streams for service providers.
DePIN = Real-World Infra + Blockchain
DePIN stands for Decentralized Physical Infrastructure Networks.
It connects physical devices to Web3 using token incentives.
How Does It Work?
Individuals contribute to infrastructure like:
Wireless networks (e.g., Helium)
Storage (e.g., Filecoin)
Sensors, compute, and energy
They earn tokens for their participation.
SAP Sapphire 2025 ERP1612 Enhancing User Experience with SAP Fiori and AIPeter Spielvogel
Explore how AI in SAP Fiori apps enhances productivity and collaboration. Learn best practices for SAPUI5, Fiori elements, and tools to build enterprise-grade apps efficiently. Discover practical tips to deploy apps quickly, leveraging AI, and bring your questions for a deep dive into innovative solutions.
Offshore IT Support: Balancing In-House and Offshore Help Desk Techniciansjohn823664
In today's always-on digital environment, businesses must deliver seamless IT support across time zones, devices, and departments. This SlideShare explores how companies can strategically combine in-house expertise with offshore talent to build a high-performing, cost-efficient help desk operation.
From the benefits and challenges of offshore support to practical models for integrating global teams, this presentation offers insights, real-world examples, and key metrics for success. Whether you're scaling a startup or optimizing enterprise support, discover how to balance cost, quality, and responsiveness with a hybrid IT support strategy.
Perfect for IT managers, operations leads, and business owners considering global help desk solutions.
UiPath Community Zurich: Release Management and Build PipelinesUiPathCommunity
Ensuring robust, reliable, and repeatable delivery processes is more critical than ever - it's a success factor for your automations and for automation programmes as a whole. In this session, we’ll dive into modern best practices for release management and explore how tools like the UiPathCLI can streamline your CI/CD pipelines. Whether you’re just starting with automation or scaling enterprise-grade deployments, our event promises to deliver helpful insights to you. This topic is relevant for both on-premise and cloud users - as well as for automation developers and software testers alike.
📕 Agenda:
- Best Practices for Release Management
- What it is and why it matters
- UiPath Build Pipelines Deep Dive
- Exploring CI/CD workflows, the UiPathCLI and showcasing scenarios for both on-premise and cloud
- Discussion, Q&A
👨🏫 Speakers
Roman Tobler, CEO@ Routinuum
Johans Brink, CTO@ MvR Digital Workforce
We look forward to bringing best practices and showcasing build pipelines to you - and to having interesting discussions on this important topic!
If you have any questions or inputs prior to the event, don't hesitate to reach out to us.
This event streamed live on May 27, 16:00 pm CET.
Check out all our upcoming UiPath Community sessions at:
👉 https://community.uipath.com/events/
Join UiPath Community Zurich chapter:
👉 https://community.uipath.com/zurich/
Protecting Your Sensitive Data with Microsoft Purview - IRMS 2025Nikki Chapple
Session | Protecting Your Sensitive Data with Microsoft Purview: Practical Information Protection and DLP Strategies
Presenter | Nikki Chapple (MVP| Principal Cloud Architect CloudWay) & Ryan John Murphy (Microsoft)
Event | IRMS Conference 2025
Format | Birmingham UK
Date | 18-20 May 2025
In this closing keynote session from the IRMS Conference 2025, Nikki Chapple and Ryan John Murphy deliver a compelling and practical guide to data protection, compliance, and information governance using Microsoft Purview. As organizations generate over 2 billion pieces of content daily in Microsoft 365, the need for robust data classification, sensitivity labeling, and Data Loss Prevention (DLP) has never been more urgent.
This session addresses the growing challenge of managing unstructured data, with 73% of sensitive content remaining undiscovered and unclassified. Using a mountaineering metaphor, the speakers introduce the “Secure by Default” blueprint—a four-phase maturity model designed to help organizations scale their data security journey with confidence, clarity, and control.
🔐 Key Topics and Microsoft 365 Security Features Covered:
Microsoft Purview Information Protection and DLP
Sensitivity labels, auto-labeling, and adaptive protection
Data discovery, classification, and content labeling
DLP for both labeled and unlabeled content
SharePoint Advanced Management for workspace governance
Microsoft 365 compliance center best practices
Real-world case study: reducing 42 sensitivity labels to 4 parent labels
Empowering users through training, change management, and adoption strategies
🧭 The Secure by Default Path – Microsoft Purview Maturity Model:
Foundational – Apply default sensitivity labels at content creation; train users to manage exceptions; implement DLP for labeled content.
Managed – Focus on crown jewel data; use client-side auto-labeling; apply DLP to unlabeled content; enable adaptive protection.
Optimized – Auto-label historical content; simulate and test policies; use advanced classifiers to identify sensitive data at scale.
Strategic – Conduct operational reviews; identify new labeling scenarios; implement workspace governance using SharePoint Advanced Management.
🎒 Top Takeaways for Information Management Professionals:
Start secure. Stay protected. Expand with purpose.
Simplify your sensitivity label taxonomy for better adoption.
Train your users—they are your first line of defense.
Don’t wait for perfection—start small and iterate fast.
Align your data protection strategy with business goals and regulatory requirements.
💡 Who Should Watch This Presentation?
This session is ideal for compliance officers, IT administrators, records managers, data protection officers (DPOs), security architects, and Microsoft 365 governance leads. Whether you're in the public sector, financial services, healthcare, or education.
🔗 Read the blog: https://nikkichapple.com/irms-conference-2025/
GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...James Anderson
The Quantum Apocalypse: A Looming Threat & The Need for Post-Quantum Encryption
We explore the imminent risks posed by quantum computing to modern encryption standards and the urgent need for post-quantum cryptography (PQC).
Bio: With 30 years in cybersecurity, including as a CISO, Tommy is a strategic leader driving security transformation, risk management, and program maturity. He has led high-performing teams, shaped industry policies, and advised organizations on complex cyber, compliance, and data protection challenges.
Introducing FME Realize: A New Era of Spatial Computing and ARSafe Software
A new era for the FME Platform has arrived – and it’s taking data into the real world.
Meet FME Realize: marking a new chapter in how organizations connect digital information with the physical environment around them. With the addition of FME Realize, FME has evolved into an All-data, Any-AI Spatial Computing Platform.
FME Realize brings spatial computing, augmented reality (AR), and the full power of FME to mobile teams: making it easy to visualize, interact with, and update data right in the field. From infrastructure management to asset inspections, you can put any data into real-world context, instantly.
Join us to discover how spatial computing, powered by FME, enables digital twins, AI-driven insights, and real-time field interactions: all through an intuitive no-code experience.
In this one-hour webinar, you’ll:
-Explore what FME Realize includes and how it fits into the FME Platform
-Learn how to deliver real-time AR experiences, fast
-See how FME enables live, contextual interactions with enterprise data across systems
-See demos, including ones you can try yourself
-Get tutorials and downloadable resources to help you start right away
Whether you’re exploring spatial computing for the first time or looking to scale AR across your organization, this session will give you the tools and insights to get started with confidence.
"AI in the browser: predicting user actions in real time with TensorflowJS", ...Fwdays
With AI becoming increasingly present in our everyday lives, the latest advancements in the field now make it easier than ever to integrate it into our software projects. In this session, we’ll explore how machine learning models can be embedded directly into front-end applications. We'll walk through practical examples, including running basic models such as linear regression and random forest classifiers, all within the browser environment.
Once we grasp the fundamentals of running ML models on the client side, we’ll dive into real-world use cases for web applications—ranging from real-time data classification and interpolation to object tracking in the browser. We'll also introduce a novel approach: dynamically optimizing web applications by predicting user behavior in real time using a machine learning model. This opens the door to smarter, more adaptive user experiences and can significantly improve both performance and engagement.
In addition to the technical insights, we’ll also touch on best practices, potential challenges, and the tools that make browser-based machine learning development more accessible. Whether you're a developer looking to experiment with ML or someone aiming to bring more intelligence into your web apps, this session will offer practical takeaways and inspiration for your next project.
Dev Dives: System-to-system integration with UiPath API WorkflowsUiPathCommunity
Join the next Dev Dives webinar on May 29 for a first contact with UiPath API Workflows, a powerful tool purpose-fit for API integration and data manipulation!
This session will guide you through the technical aspects of automating communication between applications, systems and data sources using API workflows.
📕 We'll delve into:
- How this feature delivers API integration as a first-party concept of the UiPath Platform.
- How to design, implement, and debug API workflows to integrate with your existing systems seamlessly and securely.
- How to optimize your API integrations with runtime built for speed and scalability.
This session is ideal for developers looking to solve API integration use cases with the power of the UiPath Platform.
👨🏫 Speakers:
Gunter De Souter, Sr. Director, Product Manager @UiPath
Ramsay Grove, Product Manager @UiPath
This session streamed live on May 29, 2025, 16:00 CET.
Check out all our upcoming UiPath Dev Dives sessions:
👉 https://community.uipath.com/dev-dives-automation-developer-2025/
European Accessibility Act & Integrated Accessibility TestingJulia Undeutsch
Emma Dawson will guide you through two important topics in this session.
Firstly, she will prepare you for the European Accessibility Act (EAA), which comes into effect on 28 June 2025, and show you how development teams can prepare for it.
In the second part of the webinar, Emma Dawson will explore with you various integrated testing methods and tools that will help you improve accessibility during the development cycle, such as Linters, Storybook, Playwright, just to name a few.
Focus: European Accessibility Act, Integrated Testing tools and methods (e.g. Linters, Storybook, Playwright)
Target audience: Everyone, Developers, Testers
New Ways to Reduce Database Costs with ScyllaDBScyllaDB
How ScyllaDB’s latest capabilities can reduce your infrastructure costs
ScyllaDB has been obsessed with price-performance from day 1. Our core database is architected with low-level engineering optimizations that squeeze every ounce of power from the underlying infrastructure. And we just completed a multi-year effort to introduce a set of new capabilities for additional savings.
Join this webinar to learn about these new capabilities: the underlying challenges we wanted to address, the workloads that will benefit most from each, and how to get started. We’ll cover ways to:
- Avoid overprovisioning with “just-in-time” scaling
- Safely operate at up to ~90% storage utilization
- Cut network costs with new compression strategies and file-based streaming
We’ll also highlight a “hidden gem” capability that lets you safely balance multiple workloads in a single cluster. To conclude, we will share the efficiency-focused capabilities on our short-term and long-term roadmaps.
New Ways to Reduce Database Costs with ScyllaDBScyllaDB
A unified approach for spatial data query
1. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
A UNIFIED APPROACH FOR SPATIAL DATA QUERY
Mohammed Abdalla 1, Hoda M. O. Mokhtar 2, Mohamed Noureldin 3
1, 2, 3
Faculty of Computers and Information, Cairo University, Giza, Egypt
ABSTRACT
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
KEYWORDS
Spatial data interoperability; GIS; Geo-Spatial Metadata; Spatial Data Infrastructure; Geo-database.
1. INTRODUCTION
The need to store and process large amounts of diverse data, which is often geographically
distributed, is obvious is a wide range of application. Most GISs use specific data models and
databases for this purpose. This implies that making new data available to the system requires the
data to be transferred into the system’s specific data format and structure. However, this is a very
time consuming and tedious process. Data accessing, automatically or semi-automatically, often
makes large-scale investment in technical infrastructure and/or manpower inevitable. These
obstacles are some of the motivations behind the concept of information integration. With the
increase of location based services and geographically inspired applications, the integration of
raster and vector data becomes more and more important [24]. In general, a geo-database is a
database that is in some way referenced to locations on Earth [27]. Coupled with this data is
usually data known as attribute data. At-tribute data are generally defined as additional
information, which can then be tied to spatial data. GIS data can be separated into two categories:
spatially referenced data, which is represented by vector and raster forms (including imagery);
and attribute tables, which are represented in tabular format. Within the spatial referenced data
group, the GIS data can be further classified into two different types: vector and raster. Most GIS
applications mainly focus on the usage and manipulation of vector geo-databases with added
components to work with raster-based geo-databases. Basically, vector and raster models differ in
how they conceptualize, store, and represent the spatial locations of objects. The choice of vector,
raster, or combined forms for the spatial database is usually governed by the GIS system in use
and its ability to manipulate certain types of data. Nevertheless, integrated raster and vector
processing capabilities are most desirable and provide the greatest flexibility for data
manipulation and analysis. Many research papers discussed raster-vector integration as presented
in [24, 25, and 26]. In real world applications, the effective management and integration of
information across agency boundaries results in information being used more efficiently and
DOI : 10.5121/ijdkp.2013.3604
55
2. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
effectively [14]. Hence, developing interoperable platforms is a must. Several research work have
been directed towards establishing protocols and interface specifications offering support for the
discovery and retrieval of information that meets the user’s needs [3]. In [1], the authors refer to
spatial interoperability as the ability to communicate, run programs, or transfer spatial data
between diverse data without having prior knowledge about data sources characteristics.
Motivated by the importance of designing interoperable environments spatial data infra-structures
(SDI) were developed. A spatial data infrastructure (SDI) is a data infrastructure implementing a
framework of geographic data, metadata, users, and tools that interact to use spatial data in an
efficient way [3]. Another definition for SDI was presented in [7], in this paper the authors define
an SDI as the technology, policies, standards, human resources, and related activities necessary to
acquire process, distribute, use, maintain, and preserve spatial data. In general, SDI is required to
discover and deliver spatial data from a data repository, via a spatial service provider, to a user.
The authors in [2] defined the basic software components of an SDI as (1) a software client: to
display, query, and analyze spatial data (this could be a browser or a Desktop GIS), (2) a
catalogue service: to discover, browse, and query metadata or spatial services, spatial datasets,
and other resources, (3) a spatial data service: to allow the delivery of the data via the Internet, (4)
processing services: such as datum and projection transformations, (5) a (spatial) data repository:
to store data, e.g. a spatial database, and (6) a GIS software (client or desktop):to create and
update spatial data. Beside these software components, a range of (international) technical
standards are necessary that enable the interaction between the different software components.
Another vital component of an SDI is the “metadata” which can be viewed as a summarized
document providing content, quality, type, creation, and spatial information about a data set [8].
The importance of metadata in spatial data accessing, integration and management of distributed
GIS resources was explored in several works including [18, 19, 20, 21, 22]. Metadata can be
stored in any format including text file, Extensible Markup Language (XML), or database record.
The summarized view of the metadata enhances data sharing, availability, and reduces data
duplication. Inspired by the importance of developing an interoperable framework for spatial
queries, in this paper we present an interoperable architecture for spatial queries that utilizes
metadata to enhance the query performance. The proposed approach provides usage of modern
and open data access standards. It also helps to develop efficient ways to achieve inter-operability
including consolidation of links between data interoperability extensions and geo-graphic
metadata.
The main contributions of the paper are summarized as follows:
•
•
•
•
•
•
Developing an interoperable framework that converts the basic five vector data
formats (AutoCAD DWG, File Geo- database, Personal Geo-database, Shape file,
Coverage, and Geography Markup Language) into a single unified “gdb” format.
Presenting an easy to use tool for searching at the feature data level of spatial vector
data using metadata criteria.
Using XML-metadata style for expressing the feature metadata, such
representation is thus not restricted to a particular standard or profile.
Improving the quality and performance of spatial queries by filtering the number of
candidate results based on the features expressed in the metadata.
GIS users face an opportunity and a challenge in manipulating and accessing the
huge volume of data available from various GIS systems. The proposed approach can
help making it easier for them to find, access, and use other data sets. It also helps
them to easily advertise, distribute, reuse, and combine their data with other data sets.
The proposed approach provides effective and efficient data management for
processing heterogonous data. The power of the proposed model comes from
integrating sources and displaying to the human eye the proximity-based
relationships between objects of interest. Proximity can't be "seen" in the data, but it
can be seen on a map.
56
3. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
The rest of the paper is structured as follows: Section 2 presents an overview of related work. Section
3 defines the problem. Section 4, presents our proposed solution and architecture. In section 5 we
discuss the proposed system and the results achieved. In section 6 we discuss the analysis and testing
of our implemented system. Finally, section 7 concludes and presents directions for future work.
2. RELATED WORK
The need for geo-data from distributed GIS sources is seen in many applications including
decision making, location based services, and navigation applications. Integration of different
data models, types, and structures facilitates cross-data set analysis from both spatial and nonspatial perspectives. This needs motivated several prior work on spatial data interoperability. In
[4], a fuzzy geospatial data modelling technique for generation of fuzzy application schema is
introduced. This approach aims to formalize the fuzzy model using description logic. The
formalization facilitates automated schema mapping required for the integration process. In [5],
service-based methodology has been discussed for integrating distributed geospatial data
repositories in adherence to OGC specified open standards. The paper also describes the central
role of a geographic ontology in the development of an integrated information system which are
interoperable semantically, and utilizing it for service description and subsequent discovery of
services. In [6], an important initiative to achieve GIS interoperability is presented, this is the
OpenGIS Consortium. OpenGIS Consortium is an association looking to define a set of
requirements, standards, and specifications that will support GIS interoperability. An approach
for designing an integrated interoperability model based on the definition of a common template
that integrates seven interoperability levels is proposed in [7].In addition, several work targeted
SDI and Geo-Graphic metadata. Spatial data infrastructures (SDIs) are used to support the
discovery and retrieval of distributed geographic information (GI) services by providing
catalogue services through interoperability standards. A methodology proposed for improving
current GI service discovery is discussed in [8]. When searching spatial data, traditional queries
are no longer sufficient, because of the intrinsic complexity of data. As a matter of fact,
parameters such as filename and date allow users to pose queries which discriminate among data
solely on the basis of their organizational properties. In [9], a methodology for searching
geographic data is introduced which takes into account the various aspects previously discussed.
In [10], an approach to analyze geographic metadata for information search is introduced. In [11],
the shortcomings of conventional approaches to semantic data integration and of existing
metadata frameworks are discussed. On the other hand, the problem of vector and raster data
integration was also investigated. Traditional techniques for vector to raster conversion result in a
loss of information, the entities shape must follow the shape of the pixels. Thus, the information
about the position of the entities in the vector data structure is lost with the conversion. In [12], an
algorithm was developed to reconstruct the boundaries of the vector geographical entities using
the information stored in the raster Fuzzy Geographical Entities. The authors utilize the fact that
the grades of membership represent partial membership of the pixels to the entities, this
information is thus valuable to reconstruct the entities boundaries in the vector data structure,
generating boundaries of the obtained vector entities that are as close as possible to their original
position. In [15], a new data model named Triangular Pyramid framework for enhanced object
relational dynamic vector data model is proposed for representing the complete information
required for representing the data for GIS based application. A spatial data warehouse based
technique for data exchange from the spatial data warehouse is proposed in [13]. However, data
warehouse based approach has several disadvantages keeping in mind the huge volume of data
required to be updated regularly. Many of the problems associated with raster-to-vector and
vector-to-raster conversion are discussed in [27]. In [23], the authors examine the common
methods for converting spatial data sets between vector and raster formats and present the results
of extensive benchmark testing of the proposed procedures. Also, in [16], many of the problems
associated with raster-to-vector and vector-to-raster conversion are discussed. Raster maps are
57
4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
considered an important source of information. Extracting vector data from raster maps usually
requires significant user input to achieve accurate results. In [17], an accurate road vectorization
technique that minimizes user input is discussed; it aims to extract accurate road vector data from
raster maps.
In this work we continue to explore possible approaches for vector and raster data integration to
develop an efficient spatial data query tool.
3. PROBLEM DEFINITION
The quality of any geo-spatial information system is the main feature that allows system clients to
fine-tune their search according to their specific needs and criteria. Nevertheless, disparate data
sets exist in different geo-spatial databases with different data formats and models. Ac-cessing
and integrating this heterogeneous data remains a challenge to efficiently answer user queries. In
addition, with the increase in the GIS applications that are based on geographic information
developing a unified approach for spatial query is a crucial requirement. Today, several formats
exist for vector data including: AutoCAD DWG, File Geo-database, Personal Geo-database,
Shape file, Coverage, and Geography Markup Language. Such diversity in data formats generates
a problem in communication and data transfer between different data sources. In addition,
geographical information may be stored using the vector or the raster data structure. The use of
either structure depends on the methods used to collect the data and on the application that will
use the information [12]. Also, such diversity in data models generates a problem in integration
and data access operations between different data repositories.
Example 1: Consider 3 different data sources (DS1, DS2, and DS3) where each source stores the
vector data in different format as shown in Figure 1.
DS3
GML
DS3
CAD
DS3
Shape File
GDB Format
Figure 1. Querying different data sources
Assume a user query that requires data from all three sources. Such a query will require the user
to physically pose three different queries to access the different formats. In addition, the user’s
query will eventually return different results in different formats. Motivated by the problem
presented in Example 1, developing an interoperable platform is an optimal solution that unifies
both the issued query and the query results. To achieve such operation, we need to convert the
different spatial data formats (AutoCAD DWG, File Geo-database, Personal Geo-database, Shape
file, Coverage, and Geography Markup Language, etc.) into a unified format. In this paper we
select the File Geo-database format to be the final unified format.
Example 2: Consider two different data repositories with different data models (R1, R2). Assume
that R1 has raster datasets and R2 has vector datasets as shown in Figure 2.
Assume a user query that requires data from both repositories regardless of data model
representation.
58
5. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
R1
R2
Raster data model
Vector data model
Metadata query
Figure 2. Querying different data Models
Using the same sources presented in Example 1, and issuing the same user query but assuming
the existence of the required unified model, we then need to obtain a single unified query in
“gdb” format. Again, motivated by the problem in Example 2, the query result still requires
access to all repositories that have data in different models to retrieve all relevant data. Such
access can be improved by understanding the query statement and filtering initial data to capture
only relevant data. Such understanding and filtering process can be achieved using metadata.
4. PROPOSED SOLUTION
As discussed in Section 3, querying different spatial databases that store spatial data in various
formats and models has a number of problems. In this paper we propose a new approach for
spatial query processing and data accessing. The proposed architecture is composed of six main
layers as shown in Figure 3.
Figure 3. Proposed architecture
59
6. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
The first layer represents different data sources with different vector data formats (.shp, .mif, .cad,
.gml, .mdb) and raster data formats. The second layer contains the spatial data converter
component that is responsible for unifying the vector data formats. The third layer contains the
resulting converted data in a single unified format. The fourth layer is the metadata searcher
component that is responsible to find and access the most suitable datasets regardless of the initial
data models and structures. The fifth layer contains the filtered items by the metadata component.
And finally, the sixth layer contains the final user query results. The main characteristic of our
proposed model is that we build a layer in our architecture that supports “interoperability”
operations by developing a spatial data converter component that converts different spatial data
formats (AutoCAD DWG, File Geo-database, Personal Geo-database, Shape file, Coverage, and
Geography Markup Language) into a single format (File Geo-database “gdb” ).
Nevertheless, the top reasons for choosing the file Geo-database as our final unified format are:
•
•
•
•
•
•
•
•
File geo-databases format is ideal for storing and managing geospatial data.
File geo-databases format offers structural, performance, and data management
advantages over personal geo- databases and shape files.
Vector data can be stored in a file geo-database in a compressed, read-only format
that reduces storage requirements.
Storing raster in geo-database format manages raster data by subdividing it into
small, manageable areas called tiles stored as large binary objects (BLOBs) in a
database.
File geo-databases format provides easy data migration.
File geo-databases format is inclusive: one environment for feature classes, raster
datasets, and tables
File geo-databases format is powerful: enables modelling of spatial and attribute
relationships.
File geo-databases format is scalable: can sup-port organization-wide usage and
workflows, and can be used with DBMS like Oracle, IBM DB2, and Microsoft SQL
Server Express.
In addition our model has a layer that provides usage of modern and open data access standards,
and helps to develop efficient ways to achieve inter-operability including consolidation of links
between geo-graphic data interoperability extensions and geo-graphic metadata by developing a
metadata searcher component that looks in repositories which have data in different spatial data
models, structure, and formats and finds the most proper datasets. In the following discussion we
present our proposed spatial data conversion algorithm.
Algorithm 1: Spatial Data Converter
Input: A different number of spatial databases with different vector data formats (GML, CAD,
MIF, mdb, and shp).
Output: A different number of spatial databases with unified vector data format (File Geodatabase).
Begin
Get the path of the input file;
Create an empty output file with the same name of the input file
and replace extension with “gdb”;
Define a new GeoProcessor object;
If(data format “gml” or “cad” or “mif”) Then{
Define a quick import object;
Set input file as input to QuickImport object;
Set the created empty output file as output to
quick import object;
60
7. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
Pass QuickImport object to GeoProcessor ;
}
ElseIf(data format is “mdb”) Then {
Initialize a CopyTool;
List all feature classes, data sets, and tables of the input file ;
Loop until no features, dataset, tables found
Begin
Set the feature or dataset or table as an input to the CopyTool;
Create the output path of the dataset or feature as the name of the
Created output file and append to it the name of item;
Set the item path as output to CopyTool;
Pass CopyTool object to GeoProcessor;
End loop
{ ElseIf (data format is “.shp”) Then}
Define new Feature class object with the path of the shape file ;
Define an Append object;
Set input to Append object as feature class created from shape file;
Set output to Append object the path of the created output gdb
appended to it the name of feature class name;
}
EndIf
Execute conversion using GeoProcessor;
End
By applying Algorithm 1 on the different data sources with different data formats in layer 1, we
obtain in layer 3 a single unified data format and structure (File Geo-database “gdb”). The
motivation behind choosing theses five formats for conversion is that these formats are very
flexible in terms of the ability to mix all sorts of geometry types in a single dataset, openly
documented, support geo-referenced coordinate systems, and are considered stable exchange
formats. A successful conversion between (AutoCAD DWG, Map Info., Personal Geo-database,
Shape file, Coverage, and Geography Markup Language) and File Geo-database format is done,
considering the same shape size, origin and orientation, the same results are obtained. The areas
occupied by entities inside the original file and the converted one are always the same. Then, in
layer 4 motivated by the problem presented in Example 2, we developed a “Metadata Searcher”
component as shown in Figure 4. The metadata searcher component defines some properties (for
example: number of features, creation date, geographic form, feature name, and reference
system), and searches in different data sources and Repositories for items that match those
properties. The metadata feature selection component proceeds as follows.
Algorithm 2: Metadata Feature Selection
Input: A different number of spatial databases with unified vector data format (File Geodatabase “.gdb“)
Output: A collection of features that match metadata criteria.
Begin
Define metadata search properties and values;
Define the path that contains the converted data “GDB;“
List all the converted gdb files
Loop until no files found
Begin
Loop FOR EACH features and datasets in gdb file
Begin
If item matches defined metadata properties and values Then
61
8. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
Add item to filtered item list
End If
End loop
End loop
End
We apply Algorithm 2 in layer 4 in our proposed architecture on a different number of spatial
databases with unified “.gdb” format and raster datasets. Then, for every data source the
algorithm searches for the features and data elements that match the metadata search criteria, and
save the selected items in the list of filtered items that eventually contribute towards the user
query result.
START
The Catalogue Path,
The system will look in
Define Search criteria and its values
Loop for all datasets in all repositories
If data set matches
search criteria
Add this item to filtered list
END
Figure 4. Metadata Searcher Component Flow Chart
Algorithm 3: Raster Query
Input: Raster dataset.
Output: Raster Result set.
Begin
Create the RasterExtractionOp object.
Declare the Raster input object.
Declare a RasterDescriptor object
Select the field used for extraction Using RasterDescriptor
Set RasterDescriptor as an input to RasterExtractionOp object.
Execute Query using RasterExtractionOp object.
Save the Results in new Geodataset.
End
Next, layer 5 maintains the filtered items resulting from the different data sources that match the
specific metadata properties and is ready to receive user query. The filtered raster dataset will be
queried by applying Algorithm 3 and filtered vector datasets will be queried either by Sptial data
qurey functions or attributte data statements. Finally, layer 6 contains the actual combined user
query results that composed of raster and vector datasets against the filtered items that are then
presented to user.
62
9. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
5. RESULTS AND DISCUSSIONS
In this paper we present a holistic approach to unify spatial data query schemas. Various data
accessing and metadata management steps have been used and subsequently employed to
contribute towards designing a framework for efficiently answering spatial data query. In our
design we focused on the following features that the proposed system satisfies:
•
Easy to access geospatial data repositories and retrieving data in transparent way. The
file Geo-database “gdb” format was chosen in our model for reasons discussed before
in section 4.
•
Developing an interoperable framework that links both semantic interoperability and
syntactic interoperability is a promising scenario for deriving data from multiple
sources with different data formats and models.
•
Metadata descriptions adopted in the proposed system are not reliant up on specific
profile or standard. XML-based metadata was chosen to ensure flexibility for
discovering resources and features.
Taking those constraints into consideration, we built an easy to use tool that unifies different
vector formats into a single “gdb” format, accesses different spatial data models (Raster and
Vector) repositories, and processes user queries using spatial metadata that helps to enhance the
query performance. Figure 5 and Figure 6 show the initial input to the system where data is
presented in different spatial formats and models. This initial format is then unified as shown in
Figure 7.
Figure 5.Vector Data before applying spatial data converter
63
10. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
Figure 6. Raster Data Repository
Figure 7. Unifed "GDB" Format
64
11. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
Once the data is unified, the system starts processing spatial queries. It accepts the criteria defined
by the user that constrain the required output. Those constraints along with the metadata help to
locate the candidate data in different files. For instance some users are interested in files that have
specific number of features, specific creation date, or feature name that start with specific pattern,
or contains specific pattern. Augmenting metadata in the system allows the user to select all the
criteria he needs, and search in the catalogue path to locate matching data sets and feature classes.
Example 3: Consider a MQ1 (Metadata Query) with the following selection criteria as shown in
Figure 8:
Data Representation equals vector digital data, Feature Name contains Streets, Feature Count
greater than 180, East bounding coordinate equals 31.219267, Data Form Value equals File Geodatabase Feature Class, Creation Date equals 20121118, and Reference System equals
WGS_1984_UTM_Zone_36N
Figure 8. Metadata Searcher Screen.
After metadata data query results are retrieved, the user has the ability for selecting features from
single or multiple vector feature classes and datasets retrieved. For single feature class the user
poses vector attribute data query based on specific values and selected criteria (VQ1), for multiple
feature classes and datasets the user poses spatial vector data query based on selected topological
relation between features and values used in selected features buffering.
For VQ1 (vector attribute data query) the user can also specify the values associated to each
feature as follows: “Ename ≠ 'NULL', Width > 15, Shape_length >200 and METERS > 0”
For VQ2 (vector spatial data query) the user can also specify the topological relation and values
associated to buffer features as follows: “Select features from “Fuel_Stations” are within a
distance of “Buildings” with a buffer to features in buildings of 190.000000 Meters”.
Using the sample dataset shown in Figure 7, the system will retrieve three feature classes that
match the user specified criteria and values as shown in Figure 8.
65
12. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
Figure 9. Vector Attribute table
Figure 10. Vector Query Result
Those matching classes are retrieved based on the metadata used in the query. The final results
are then displayed or presented to the user as shown in Figure 9 and Figure 10.Motivated by
Example 3, assume that user interested to find all datasets in all repositories regardless of data
representation model that have the following criteria “East bounding coordinate equals
31.219267” the User to AND the Query appear in Example 3 with the another one in Example 4
to find and access all required datasets.
Example 4: Consider a MQ2 (Metadata Query) a user change query selection criteria to be:
66
13. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
“Data Representation equals raster digital data; Feature Name contains call, east bounding
coordinate equals 31.219267, Data Form Value equals Raster Dataset, Creation Date greater than
20121220, and Reference System equals “IMAGINE GeoTIFF ERDAS, Inc. Al”
After metadata data query results are retrieved the user has the ability to query Raster data using
the cell value. To query a grid, the user has to use a logical expression such as RQ1: [Count]
>700 AND [Temp_C]>=40.34. It is also possible to query multiple grids by cell value.
Figure 11. Raster Attribute table
Figure 12.Raster Query result
Figure 13.Final Query results Integrated Map.
67
14. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
According to Example 4, the sample dataset is shown in Figure 11and Figure 12, the system will
retrieve one dataset that matches the user specified criteria and values. This matched dataset is
retrieved based on setting values used in Example 4 query. The final results are then displayed or
presented to the user as shown in Figure 13.
6. QUANTIFIABLE ANALYSIS AND TESTING
To clarify our justifications of using centralized file geo-database as a back end geospatial data
store, and for linking geographic metadata with data interoperability extensions, we proposed a
platform connecting different data sources and formats for implementing a unified approach for
spatial data query. A framework example was also implemented and tested. In this section we
investigate the design and features of the implemented system. Based on our previous discussion,
in this framework we develop two main components namely, a spatial data converter, and a
metadata searcher. In addition, we also developed the basic operations performed by those two
components as discussed earlier. The main characteristic of those developed operations is that
they hide implementation details from the user providing him with a transparent communication
with the system.
Following the architecture proposed in [1], our proposed system architecture is composed of four
layers; presentation layer, business logic layer, data access layer, and data management tier. The
function of each layer is as defined in [1]. Flyweight and façade design patterns were used for
implementing the four layers mentioned above [28] [29]. Were the system starts with the user
inputting a physical location path for the spatial dataset. Then, the spatial data irrespective to its
original format is converted using the spatial data converter into the unified GDB format. Once
the unified data is ready, the user is requested to input the metadata search criteria and
parameters. Finally, based on user requests, the metadata searcher component retrieves the results
from the unified geo-database and returns the results to the user.
Performance Test: The proposed framework was also tested using random features of sizes:
5000, 10000, 50000. Those features were first inserted and integrated into the centralized geodatabase along with their associated geo-graphic view and attribute tables. Then, to evaluate the
performance two queries were designed and posed against the system. We used the test queries to
test our proposed framework. The first query (Q1) aims to retrieve raster datasets and performs
“raster query by attribute” against result set. The other one (Q2) aims to retrieve vector feature
classes and then perform “Vector attribute Query” against result set.
For both queries, we measured the average run time and used it as a metric for evaluating the
performance. Tables 1 and 2 present the results obtained from both queries.
Table1. Performance test for retrieving features (Q1)
Number of Features
Number of features Retrieved
Time(Milliseconds)Retrieving and manipulating data with
implemented system
Time(Milliseconds)Retrieving and manipulating data without
implemented system
5000
178
195 ms
10000
231
210 ms
50000
343
350 ms
230 ms
360 ms
500 ms
68
15. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
Table2. Performance test for retrieving features (Q2)
Number of Features
Number of features Retrieved
Time(Milliseconds)Retrieving and manipulating data with
implemented system
Time(Milliseconds)Retrieving and manipulating data without
implemented system
5000
103
60 ms
10000
189
198 ms
50000
243
220 ms
105 ms
230 ms
380 ms
The results displayed above show that the proposed solution is an efficient solution for retrieving
and manipulation of spatial data.
7. CONCLUSIONS AND FUTURE WORK
Efficiency of the planning system needs accessible, affordable, adequate, accurate and timely
spatial and Non-spatial information. Information integration and sharing in turn needs an efficient
route that can give possible access to the needy. The potential route can be achieved and accessed
through the implementation of a well structured interoperable approach towards good information
management. This paper introduces the issues of data interoperability, advantages of Geo-Graphic
metadata, and its mechanism for data interoperability. In this paper we proposed an interoperable
framework for spatial data query. Developing spatial data converter component which enables the
proposed framework to accept vector data in various formats and unifies them into a single “gdb”
format, which can be integrated with different raster datasets. GDB format can give users the
capability to easily and dynamically publish and exchange data in an open, non-proprietary
industry-standard format, thus maximizing the re-use of geospatial data, eliminating timeconsuming data conversion and reducing associated costs. The resulting files are then input to a
metadata selection component that uses the spatial features metadata to answer the user queries
more efficiently. For future work we plan to extend our work to consider raster data in order to
present a complete interoperable platform for spatial data. We also think that testing the system
on various queries can strengthen our work. based on the search results we still need to develop a
“ranking component” based on data mining techniques that is able to integrate with our proposed
model, to sort results based on the importance of information value to the user is must. Finally,
the current proposed approach still cannot solve the problem of semantic interoperability,
investigating this point can be a good point for future work.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
Amirian, Pouria, and Ali A. Alesheikh (2008). “Implementation of a Geospatial Web service Using
Web Services Technologies and Native XML Databases”. Middle-East Journal of Scientific Research
Vol. 3, No. 1, pp. 36-48.
D. D. Nebert, Developing Spatial Data Infrastructures: The SDI Cookbook. GSDI: Global Spatial
Data Infrastructure Association, 2004.
P. B. Shah and N. J. Thakkar.Geo-spatial metadata services - ISRO's initiative, The International
Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences.Vol. XXXVII.
Part B4. Beijing 2008.
Mukherjee, Soumya K. Ghosh, Formalizing Fuzzy Spatial Data Model for Integrating Heterogeneous
Spatial Data, 2nd International ACM Conference on Computing for Geospatial Research &
Applications (COM.GEO 2011), ACM, 25:1--25:6, Washington, DC, 23-25 May, 2011.
Manoj Paul, S. K. Ghosh, A Framework for Semantic Interoperability for Distributed Geospatial
Repositories, Journal of Computer and Informatics, Special Issue on Semantic e-Science, Vol. 26, 7392, 2008.
Open Geospatial Consortium. http://www.opengis.org, accessed on May 2013.
Manso, M. A.; Wachowicz, M.; Bernabé, M. A.: “Towards an Integrated Model of Interoperability for
Spatial Data Infrastructures”. Transactions in GIS, Vol. 13, No.1. 2009, pp. 43-67.
Luca Paolino, Monica Sebillo, Genoveffa Tortora, Giuliana Vitiello: Searching geographic resources
through metadata-based queries for expert user communities. GIR 2007: 83-88
69
16. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
Michael Lutz, Ontology-based service discovery in spatial data infrastructures. Geographic
Information Retrieval conference, 2005.
R Albertoni, A Bertone, M De Martino, Visualization and semantic analysis of geographic metadata.
Proceedings of the 2005 workshop on Geographic information retrieval, 9-16.
Nadine Schuurman, and Agnieszka Leszczynski.,Ontology-Based Metadata ,T. GIS 10(5):709-726
(2006)
Caetano, M. and Painho, M. (eds). Conversion between the vector and raster data structures using
Fuzzy Geographical Entities. Proceedings of the 7th International Symposium on Spatial Accuracy
Assessment in Natural Resources and Environmental Sciences, 5 – 7 July 2006, Lisboa, Instituto
Geográfico Português.
Roth, M.T, Wrapper Architecture for Legacy Data Sources. Proceedings of the International
Conference on Very Large Databases (23rd VLDB), 1997.
Abbas Rajabifard, Data Integration and Interoperability of Systems and Data.2nd Preparatory
Meeting of the Proposed UN Committee on Global Geographic Information Management, 2010.
Ms. Barkha Bahl1, Dr. Navin Rajpal and Dr. Vandana Sharma, Triangular Pyramid Framework For
Enhanced Object Relational Dynamic Data Model for GIS, IJCSI International Journal of Computer
Science Issues, Vol. 8, Issue 1, January 2011.
Russell G. Congalton , Exploring and Evaluating the Consequences of Vector-to-Raster and Raster-toVector Conversion, Photogrammetric Engineering & Remote Sensing, Vol. 63, No. 4, April 1997, pp.
425-434.
Chiang, Y.; and Knoblock, C. A. Extracting Road Vector Data from Raster Maps. In Graphics
Recognition: Achievements, Challenges, and Evolution, Selected Papers of the 8th International
Workshop on Graphics Recognition (GREC), Lecture Notes in Computer Science, 6020,
pp.
93-105. Springer, New York, 2009.
In-Hak Joo; Tae-Hyun Hwang; Kyung-Ho Choi, "Generation of video metadata supporting video-GIS
integration," ICIP '04. 2004 International Conference on Image Processing, 2004, vol.3, pp.16951698, 24-27 Oct. 2004.
Tae-Hyun Hwang; Kyoung-Ho Choi; In-Hak Joo; Jong-Hyun Lee, "MPEG-7 metadata for videobased GIS applications," Proceedings of IEEE International Geoscience and Remote Sensing
Symposium, IGARSS '03, pp.3641-3643, vol.6, 21-25 July 2003.
Goncalves Soares Elias, V.; Salgado, A.C., "A metadata-based approach to define a standard to visual
queries in GIS," Proceedings of the 11th International Workshop on Database and Expert Systems
Applications, pp.693-697, 2000.
Yingwei Luo; Xiaolin Wang; Zhuoqun Xu, "Extension of spatial metadata for navigating distributed
spatial data," Proceedings of the IEEE International Geoscience and Remote Sensing Symposium,
IGARSS '03, vol.6, pp.3721- 3723, 21-25 July 2003.
Spery, L.; Claramunt, C.; Libourel, T., "A lineage metadata model for the temporal management of a
cadastre application," Proceedings of the 10th International Workshop on Database and Expert
Systems Applications, pp. 466- 474, 1999.
Joseph M. Piwowar, Ellsworth F. Ledrew, Douglas J. Dudycha. Integration of spatial data in vector
and raster formats in a geographic information system environment. International Journal of
Geographical Information Science 01/1990; 4, pp. 429-444.
Feng Lin; Chaozhen Guo, "Raster-vector integration based on SVG on mobile GIS platform," 2011 6th
International Conference on Pervasive Computing and Applications (ICPCA), pp. 378-383,
26-28
Oct. 2011.
Changxiong Wang; Shanjun Mao; Mei Li; Huiyi Bao; Ying Zhang, "Integration of vector and rasterbased coal mines surface and underground contrast map," Second International Workshop on Earth
Observation and Remote Sensing Applications (EORSA), pp. 309-312, 8-11 June 2012.
Xuefeng Cao; Gang Wan; Feng Li, "Notice of Violation of IEEE Publication Principles 3D VectorRaster Data Integration Model Based on View Dependent Quadtree and GPU Friendly
Rendering Algorithm," International Joint Conference on Computational Sciences and Optimization,
CSO 2009, Vol. 2, pp.244-247, 24-26 April 2009.
Russell G.Congalto, Exploring and Evaluating the Consequences of Vector-to-Raster and Raster-toVector Conversion, Photogrammetric Engineering & Remote Sensing, Vol.63, No.4, April 1997,
pp.425-434.
Horner, M., 2006. Pro.NET 2.0 Code and Design Standards in C#. California, USA, APress
Publishing.
70
17. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
[29] O’Docherty, M., 2005. Object-oriented analysis and design: Understanding system development with
UML 2.0., New Jersey, USA, John Wiley and Sons, Inc.
AUTHORS
Mohammed Abdalla is a Software Engineer of Software Development at Hewlett-Packard Enterprise
services, Inc. Mr. Mohammed have +5 years experience in software development and analysis, with
previously experiences in developing ERP, E-commerce, mobile payment and e-payment applications, In
June 2008 earned a Bachelor of computer Science degree from the Cairo university with grade very good
and excellent grade in graduation project.
Dr. Hoda M. O. Mokhtar is currently an associate professor in the Information Systems Dept., Faculty of
Computers and Information, Cairo University. Dr. Hoda Mokhtar received her PhD in Computer science in
2005 from University of California Santa Barbara. She received her MSc. and BSc. in 2000 and 1997 resp.
from the Computer Engineering Dept., Faculty of Engineering - Cairo University. Her research interests are
database systems, moving object databases, data warehousing, and data mining.
71