Experimental Evaluation of Open Source Data Mining
Experimental Evaluation of Open Source Data Mining
net/publication/345203774
CITATIONS READS
20 1,945
2 authors:
All content following this page was uploaded by Preeti Gulia on 12 November 2020.
Abstract: Nowadays, it is possible for every time consuming task to execute that data. So, there is
organisation to manage the large dataset at a requirement of automated tools that can help the
minimum cost. But in order to collect the fruitful researcher to convert that messy data into useful
information, it is mandatory to utilize the large information. Few years ago, there are so many data
volume of stored data. Data mining is an on-going mining software tools have been developed to
process of searching pattern and collecting useful overcome this problem. Some of them are freely
information from large datasets for future use. There available as open-source tools. The affirmation of
is no doubt that Data mining is very important in open source tools of information sharing for
various areas like education, military, e-business, implementations of different machine learning
healthcare etc. The main objective of data mining algorithms can be most beneficial for the complete
process is to supervise the data from various sources field [10].
in different manner then assemble it to collect the In this paper, a comparative study is conducted
useful information. It can be done by the help of among various classification algorithms like
various tools and techniques. There are a number of Random Forest tree, K-Nearest Neighbour and
data mining tools available in the digital world that Naïve bayes algorithm using WEKA and Orange
can help the researchers for the evaluation of the tool. The evaluation metrics Precision and Recall are
data. These tools work as an interface to receive the used to analyze the performance of the both the tools
data and to extract some meaningful patterns out of with the help of various classification algorithms.
large dataset. Selection of best tool according to The following Classification Algorithms have been
requirement is not an easy task. In order to find out used for the experimentation:
the best data mining tool for classification problem, Naïve Bayes: Naive Bayes classifier is a
comparison of various tools is necessary on the group of simple probabilistic algorithms.
basis of different parameters. In this paper, data These are based on Bayes' theorem. In it
mining tools WEKA and Orange are analysed on the algorithm is applied with powerful
the basis of implementation of parameters. The main assumptions between the various features.
objective of this comparison is to help the K-Nearest Neighbour:It is a simple
researchers to select the suitable tool from these classifier that saves all cases that are
two. available and then generates new cases
based on a similar measurement e.g.,
Keywords: Classification, Naïve Bayes, Random distance functions.
Forest tree, WEKA, Orange, Precision, Recall. Random forest: It is almost same as
Decision tree classifier. But it adds some
I. INTRODUCTION randomness to the model at the time of
In present scenario, data is increasing day by day
making the tree. It can produce great
according to different parameters. It is very difficult
results without the help of hyper
for a person to analyse the large volume of data for
parameter. It builds different decision trees
perfect decision making. Hence, there is need of
and then combines them to generate more
data mining to extract valuable and useful data from
stable prediction.
the available data. Data mining is the process of
finding the most useful knowledge from the large To handle huge volume of data, there are several
volume of data available in databases or data tools available for the user. Moreover it is not easy
repositories. Classification is one of the most to include all the features in single tool .That’s why a
important problems in data mining, which is a number of different varieties of tools have been
collection of finding rules that divides the given data introduced [8][10]. In this paper, two data mining
into different classes. These classes are predefined. tools i.e. WEKA and Orange will be compared.
There is trillions of data available in the form of These tools have different characteristics,
different types in digital world. Manually, it very functionality and capabilities. Researchers can use
these according to their research activities developed at the University of Waikato in New
requirements. These tools are continuously upgraded Zealand. WEKA is a data mining tool that allows
with new features as per the needs of the user which data pre-processing process. Attribute selection is
are changing day by day. It is very typical to deal very interesting feature of WEKA. It enhances the
with the complexity of huge data. effectiveness and accuracy of selected data. WEKA
comes with these functionalities: command-line
The rest of the paper flow is as follows: section II interface (CLI), Explorer, Experimenter and
describes open source software, section III describes Knowledge flow and weka workbench. Explorer is
the comparative study of WEKA and Orange tool on used to define the data source, preparation, selection
the basis of parametric comparison and experimental of algorithms, and visualization. The Experimenter
analyses and conclusions and future scope is is helpful for the comparison of the different
discussed in Section IV. algorithms on same dataset.
It is platform independent
Visualiza
tions
Associati Classific
on ation
Visualiza Unsuper
tion ORANGE vised
using Qt learning
Prototyp
e Regressi
impleme on
ntations
Evaluatio
n
This figure depicts different features of Orange Data The activity is measured by the frequency of updates
Mining. Visualization of data, classification, and time of latest update. Whenever there is
evaluation, comparison between two tools, then it becomes
necessary to compare them both parametrically and
unsupervised learning, association, visualization experimentally. After then reliable results could be
using Qt, and prototype implementations are some achieved. So in this manner, let us start with
famous features of Orange. The cross-platform parametrical comparison and then analysed the
application of orange is QT and developers can use experimental results.
UI framework for applications. It can be done by
using C++. CSS & JavaScript like language. Orange A. Parametric Comparison: In parametric
tool’s working is visually represented by using comparison, all the characteristics of tools are taken
different widgets for example reading file, training from previous available sources. These
SVM classifier etc. Every widget is self-explained characteristics were listed in Table I. Some
i.e. has a short description about itself is within the characteristics are common in both tools for example
interface. To program, first of all widgets are placed Graphical User Interface (GUI) functionalities,
on the canvas and then inputs and outputs are command line of are in both tool [18], [19].
connected. The widgets available are limited in
Orange in counting as compared to other tools.
Logo
Table III: Comparative study of WEKA and Information and Communication Technology, Electronics
and Microelectronics (MIPRO), 2014 37th International
Orange tool Convention, (May), 26–30. Retrieved from
Recall Metric http://www.zemris.fer.hr/~ajovic/articles/MIPRO
2014_final.pdf
Classifier WEKA(%) Orange(%) [2] Alcalá-Fdez, J., Sánchez, L., & García, S. (2009). “KEEL:
a software tool to assess evolutionary algorithms for data
mining problems”. Soft Computing. Retrieved from
Naïve bays 83.7 80.6 http://link.springer.com/article/10.1007/s00500-008-0323-
Random 81.9 73.4 y
[3] Collier, K., Ph, D., Carey, B., & Marjaniemi, C. (1999). “A
Forest Methodology for Evaluating and Selecting Data Mining
k-nearest 75.2 54.7 Software” Keywords : Data Mining , Tool Evaluation ,
Knowledge Discovery, 00(c), 1–11.
[4] Sonnenburg, S., Braun, M., & Ong, C. (2007). “The need
for open source software in machine learning”, 8, 2443–
When the dimension of the input data is high, then 2466. Retrieved from
Naïve Bayes Classifier algorithm is most suited. http://researchcommons.waikato.ac.nz/handle/10289/3928
Naive Bayes is particularly applicable in artificial [5] Chen, X., Ye, Y., Williams, G., & Xu, X. (2007). “A
intelligence. When comparative study is made , the survey of open source data mining systems”. Emerging
Technologies in Knowledge Discovery and Data Mining,
analysis of precession and recall is analysing for (60603066), 3– 14. Retrieved from
heart disease data sets precession in Orange 82.4% http://link.springer.com/chapter/10.1007/978-3-540-
and Recall 80.6%. In WEKA the value of precision 770183_2
is 83.7% and Recall 83.7 %. WEKA tool is best is [6] Jović, A., Brkić, K., & Bogunović, N. (2014). “An
overview of free software tools for general data mining”.
best precession and Recall as compare to Orange Information and Communication Technology, Electronics
tool in Naïve bayes classifier. Same is happened and Microelectronics (MIPRO), 2014 37th International
with Random forest and k-nearest classifiers. In Convention, (May), 26–30. Retrieved from
Random Forest, precision value in Orange is 77.9% http://www.zemris.fer.hr/~ajovic/articles/MIPRO
2014_final.pdf
and Recall value is 73.4%. In WEKA the value of [7] Kalpana Rangra, Dr. K. L. Bansal. “Comparative Study of
precision is 81.8% and Recall 81.9 %. And in k- Data Mining Tools”, presented at International Journal of
nearest algorithm, precision value in Orange is 58% Advanced Research in Computer Science and Software
and Recall value is 54.7%. In WEKA the value of Engineering, Volume 4, Issue 6, 2014.
[8] Dr. Anil Sharma, Balrajpreet Kaur,” A RESEARCH
precision is 75.3% and Recall 75.2 %. REVIEW ON COMPARATIVE ANALYSIS OF DATA
MINING TOOLS, TECHNIQUES AND PARAMETERS”,
IV. CONCLUSION AND FUTURE STUDY ISSN No. 0976-5697, International Journal of Advanced
This paper presents the study of two different open Research in Computer Science, volume 8, No. 7, July –
August 2017.
source Data mining tools along with their features- [9] H.Witten, E. Frank, M. A.Hall, “Data Mining practiced
WEKA and Orange. Both tools have their own machine learning tools and techniques”, 3rd ed., Morgan
merits and demerits This paper specifies the Kaufmann Elsevier: USA,2011.
comparison between these tools by experimental [10] Predictive Analytics [Online].Available
from:http://www.predictiveanalyticstoday.com/top-
analysis and by using their parameters. This softwarefor-text-analysis-text-mining-text-analytics/
comparative study is based on datasets and [11] Jović, A., Brkić, K., & Bogunović, N. “An overview of free
algorithms. It may be possible that the results may software tools for general data mining. Information and
vary with different datasets or algorithms. The Communication Technology”, Electronics and
Microelectronics (MIPRO), 2014 37th International
comparative analysis is helpful in learning and Convention, (May), 26–30. Retrieved from
selection of the data mining tools as per the areas. http://www.zemris.fer.hr/~ajovic/articles/MIPRO
By employing experimental study, it is to be 2014_final.pdf
concluded that WEKA tool is better than Orange. It [12] http://www.kdnuggets.com/2015/12/ top-7-newfeatures-
orange-3.html/2
can be stated that WEKA has most desired features [13] Orange Data Mining, ‘Orange Data Mining Library
for a fully-functional and user friendly platform for Documentation Release 3’.
classification problems. So, WEKA can be [14] http://orange.biolab.si/
recommended for Classification problems of data [15] http://Precision%20and%20recall%20-%20Wikipedia.PDF
[16] M.Hall, E.Frank , G.Holmes, B.Reutemann , IH
mining. In the future work, different data sets and Witten,"The WEKA Data Mining Software: An Update,"
different problems like clustering, association rule SIGKDD Explorations,2009.
mining will be taken and applied using these tools. [17] A.Wahbeh.,"A Comparison Study between Data Mining
Tools over some Classification Methods," International
Journal of Artificial Intelligence,2012.
ACKNOWLEDGEMENTS [18] Swasti Singhal, Monika Jena. “A Study on WEKA Tool for
The authors are thankful to the Data Preprocessing, Classification and Clustering”
http://archive.ics.uci.edu/ml/datasets/heart+Diseasef presented at International Journal of Innovative
or providing the dataset. Technology and Exploring Engineering (IJITEE),
Volume-2, Issue-6,2013.
[19] http://www.ionos.com>digitalguide
REFERENCES [20] http://www.google.com
[21] Venkateswarlu Pynam , R Roje Spanadna, Kolli Srikanth,
[1] Jović, A., Brkić, K., & Bogunović, N. (2014). “An “An Extensive Study of Data Analysis Tools (Rapid Miner,
overview of free software tools for general data mining”.
Weka, R Tool, Knime, Orange)”, SSRG International [27] A. kumar, et al., “ Data mining: various issues and
Journal of Computer Science and Engineering ( SSRG – challenges for future," IJETA,2014
IJCSE ) – Volume 5 Issue 9 – September 2018, ISSN: [28] H. Nasereddin," NEW TECHNIQUE TO DEAL WITH
2348 – 8387,pp 4-11. DYNAMIC DATA MINING IN THE DATABASE,"
[22] http://opensourceforu.com/2017/03/top-10-open-source- IJRRAS,.December 2012.
datamining-t ools/ [29] J.Demšar and B.Zupan, “Orange: Data Mining Fruitful
[23] Nurdatillah Hasim, Norhaidah Abu Haris, “A Study of and Fun - A Historical Perspective”, 2012.
Open-Source Data Mining Tools for Forecasting”, [30] C.Shah, A.Jivani, ”Comparison of data mining
IMCOM '15, January 08 - 10 2015, BALI, Indonesia. classification algorithms for breast cancer prediction”, 4th
[24] Witten, I. H., & Eibe, F. (2005), “Data Mining: Practical ICCCNT ,IEEE,2013.
Machine Learning Tools and Techniques”, (2nd ed., p. [31] P.Kakkar, A.Parashar, “Comparison of different clustering
525). Algorithm using WEKA tool”, International Journal of
[25] Sonnenburg, S., Braun, M., & Ong, C., “The need for open Advanced Research in Technology, Engineering and
source software in machine learning”, 8, 2443–2466. Science, 2014.
2007. Retrieved from [32] N.Chauhan and N.Gautam, “Parametric comparison of
http://researchcommons.waikato.ac.nz/handle/10289/3928. data mining tools,” IJATES, 2015.
[26] 12 data mining tools and techniques [Online]. Available: [33] A.Gupta, N.Chetty , S.Shukla, “A classification method to
https://www.invensis.net/blog/data-processing/12- classify High Dimensional data”,IEEE,2015.
datamining-tools-techniques.