0% found this document useful (0 votes)
59 views

Search Engine Using Apache Lucene

This document discusses the development of a search engine using JSoup and Apache Lucene API. It first provides background on existing search engines like Google, describing their technologies such as crawling, indexing, ranking pages using PageRank, and data structures. It then proposes using JSoup to crawl and download web pages, and Apache Lucene for indexing and searching the downloaded content. The authors aim to describe their method for creating a search engine that can effectively search and mine information from the web.

Uploaded by

quis_ut_deus
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Search Engine Using Apache Lucene

This document discusses the development of a search engine using JSoup and Apache Lucene API. It first provides background on existing search engines like Google, describing their technologies such as crawling, indexing, ranking pages using PageRank, and data structures. It then proposes using JSoup to crawl and download web pages, and Apache Lucene for indexing and searching the downloaded content. The authors aim to describe their method for creating a search engine that can effectively search and mine information from the web.

Uploaded by

quis_ut_deus
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/283771724

Search Engine using Apache Lucene

Article  in  International Journal of Computer Applications · October 2015


DOI: 10.5120/ijca2015906476

CITATIONS READS

7 1,404

2 authors, including:

Balasubramani Ramasamy
NMAM Institute of Technology
31 PUBLICATIONS   49 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Efficient Multimedia Processing in Embedded Devices using Various Power Saving Techniques in the Software Architecture View project

Speech Recognition Based Sentimental Analysis to Enhance the Efficiency of Product Reviews View project

All content following this page was uploaded by Balasubramani Ramasamy on 06 September 2018.

The user has requested enhancement of the downloaded file.


International Journal of Computer Applications (0975 – 8887)
Volume 127 – No.9, October 2015

Search Engine using Apache Lucene

Mamatha Balipa Balasubramani R., PhD


Department of MCA Professor and Head
NMAM.I.T Department of IS and E
Nitte NMAM.I.T, Nitte

ABSTRACT Language processing are increasingly being used in Search


The World-Wide Web is a huge network of billions of engines and Web mining has become a topic research.
workstations and this network contains billions of web pages
containing information on a wide variety of topics. There are
2. PREVIOUS WORK ON SEARCH
a lot of topics discussed by people, opinions and suggestions ENGINES
shared on various social networking sites that the users are 2.1 Google Search Engine
interested in. Low precision and low recall still exists in the
Google search engine instantiates many crawlers that visit
current search engines. So a search engine that is effective
links that appear in web pages and downloads the pages
and one that applies Web mining technology has become very
represented by those links. Each page is given a unique docId.
important. A discussion on the various technologies used to
The pages are compressed and stored. An indexer parses the
implement a search engine and its techniques like indexing
contents of the pages and converts them into word
and searching on the world wide web is done in this paper. occurrences called hits, sorts them, distributes them and
The authors propose to describe the method to create a search
creates forward indexes. It creates an anchor file where the
engine by using JSoup and Apache Lucene API in the paper.
information about the links are stored (S. Brin and L. Page,,
General Terms 2012).
Search engine, web mining, text mining. 2.2 Page Rank
Keywords Google search engine analyses the links in the web pages
available on the internet to rank the pages which helps in
Web, crawler, searching, indexing, JSoup, Apache Lucene.
providing more efficient results. This ranking is called page
1. INTRODUCTION ranking (S. Brin and L. Page,, 2012).
The World Wide Web is ever expanding. There are a lot of
forums and social media sites like Twitter, e-commerce sites
2.3 Anchor Text
Anchors provide information about web pages they point to.
like Amazon, health related discussion forums where a lot of
Anchor can be used to retrieve pages that have not been
opinions and knowledge is shared between users.
crawled. These pages may contain some images or other links
Users or the public have increasingly searched for and used (S. Brin and L. Page,, 2012).
the information available on online health forums. According
to a 2013 Pew Research Center survey, 72\% of the U.S. adult 2.4 Google’s Data Structures
internet users have searched for health information online (V. The data structures used by Google search engine are
V. Vydiswaran, Q. Mei, D. A. Hanauer, and K. Zheng, , designed in such a way that large amount of documents can be
2014). quickly and efficiently indexed and searched. But the disk
seek still takes 10 ms (S. Brin and L. Page,, 2012).
The number of social networking sites and healthcare forums
have increased and patients voluntarily share their health 2.5 Big Files
information as well as the treatments they have undergone and Google’s Big files are virtual files distributed across many file
the solutions they have found for their problems as well as the systems. Big files are addressed using 64 bit integers (S. Brin
drugs they have used. and L. Page,, 2012).
Medications.com, SteadyHealth.com, MedHelp.org and
HealthBoards.com are examples of such online forums (H. 2.6 Repository
Sampathkumar, X.-w. Chen, and B. Luo, 2014). This Google’s repository consists of downloaded and compressed
information is also used by researchers who help the health web pages (S. Brin and L. Page,, 2012).
authorities in their post-marketing drug surveillance as well as
to predict the outbreak of diseases. 2.7 Document Index
Each downloaded document is given a unique docid and is
Existing search engines like Google and Yahoo do a good job
stored with details of the document like, status, pointer to the
of searching and indexing information available on the web.
file in the repository and other details (S. Brin and L. Page,,
But finding information is still difficult.
2012).
So to search and mine the information available on the web,
the development of effective search engines become 2.8 Some of the other search engines
important. Alcides Calsavora and Schmidt describe a semantic search
engine that can be used by users to get information about
Most search engines search based on the keywords. sellers and service providers and their products that can be
Information on the web is usually unstructured. So approaches hierarchically organized.
like data mining, web mining, text mining and Natural

27
International Journal of Computer Applications (0975 – 8887)
Volume 127 – No.9, October 2015

Sara Cohen Mamon et al developed a semantic search engine search engine is developed using Java and threads are created
that uses XML (S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv, to visit each link in different pages and download the links.
“Xsearch: 2003). The search engine retrieves parts of the The memory is efficiently managed by using thread pools (P.
document related to the users’ query. The information Houston, 2013).
retrieved is ranked using extended information retrieval
methods and are presented in the order of their ranking. 3.2 Indexing and searching using Apache
Bhagwat and Polyzotis presented a file system search engine Lucene
– Eureka (D. Bhagwat and N. Polyzotis, 2003). It is a 3.2.1 Apache Lucene
semantic based search engine. It creates links between files
Apache lucene provides API to index documents as well to
and a file ranking system to order the files according to their
query the index and fetch documents that match the query.
importance.
Apache Lucene is a open source library developed by Apache
Wang et al, presented a semantic search technique to get
Software Foundation. It is usually used to implement text
information from regular tables (H.-L. Wang, S.-H. Wu, I.
based search engines. It has API to efficiently index text
Wang, C.-L. Sung, W.-L. Hsu, and W.-K. Shih, 2000). The
documents and search for text in them.
technique recognizes the relationship between table cells and
stores the data in the database. It uses query language to get It can be used to search the web, databases and has been used
information from the database. by sites like Wikipedia, Linkdin,etc., It is Java based but can
also be used in programming languages like Perl, Python and
Kandogan et al, presented a semantic search engine Avatar,
.Net. (A. Sonawane, 2009)
that uses text search along with ontology annotations.
Lucene has efficient and precise search algorithms. It retrieves
Maedch et al, developed an ontology search engine that uses a
the documents queried based on their ranking. It provides
unified method for ontology searching A. M¨adche, B. Motik,
different types of queries like PhraseQuery, WildcardQuery,
L. Stojanovic, R. Studer, and R. Volz, 2003. They use an
RangeQuery, FuzzyQuery, BooleanQuery and more.
ontology registry to store the ontology metadata and a server
to store the ontologies. In the search engine devneloped, an indexer has been built to
index the web pages downloaded, by using Apache Lucene
George Gardarin et al presented SEWISE that maps text data
API (A. Sonawane, 2009).
present in the web pages and creates an XML structure. It also
makes the hidden semantic in the text available to program.

2.9 Keyword and sentence extraction


Machine learning methods use keywords that are extracted to
create a model which is then used to retrieve keywords from
new pages.
Kea is an algorithm to extract keywords using Naïve Bayes
classifier. It computes the normalized the term frequency(TF),
the inverted document frequency(IDF), and the position of the
word in the document. The extracted words from every page
are used to create a Naïve Bayes classifier. The resultant
classifier is used to analyse if the given phrase is a keyword or
not (H. Kian and M. Zahedi, 2011).
An ontology can contain vocabulary for representing terms
belonging to a subject area. It also contains logical statements
that describe the relationships between the terms.
A content repository, thesaurus and ontology repository are
used to extract the content of the documents gathered by the
crawler. The pages are then indexed according to their context
(P. Gupta and D. A. Sharma, 2010).
Fig 1: Building application using Lucene
2.10 Meta search engine
Meta search engine combines the results returned by multiple
3.2.2 The indexing process
search engines and presented if required in a sorted order (P. Indexing is the method of storing text data in index files, in a
Gupta and D. A. Sharma, 2010). format that helps in fast and efficient text searching. Lucene
stores the text data in a data structure called inverted index
3. SEARCH ENGINE DEVELOPMENT which can be stored in the file system or memory. The text
data is analysed prior to storing it in the index files (A.
3.1 Web Document Parser using JSoup Sonawane, 2009).
API
JSoup is an HTML parser. The various elements of an HTML 3.2.3 Environment Used
page can be searched and their contents retrieved using JSoup. Windows 7 operating system was the platform used for the
A crawler for crawling the web and downloading web pages work. A machine with 4 GB RAM and a intel core duo
has been developed using JSoup parser. microprocessor with a speed of 2.93GHz HGz was used for
running the search engine. With a download speed of 1 Mbps,
The search engine developed parses the HTML, follows all the search engine took 3 hours was taken to download 5000
the links and downloads each page pointed to by the link. The pages. Java programming language was used to implement

28
International Journal of Computer Applications (0975 – 8887)
Volume 127 – No.9, October 2015

the search engine. The out of memory error on running 3.2.8 Code Snippet for Searching
multiple threads was handled by creating thread pools.

3.2.4 Code snippet using JSoup API to visit all


links in a Page

Fig 2: Code using JSoup API to crawl links in a Page

3.2.5 Code Snippet for creating Thread Pool

Fig 3: Code to create threadpool


3.2.6 Code Snippet for Indexing using Lucene
API Fig 6: Code for searching text

4. CONCLUSION
In this paper, the various techniques used in search engines
and work that has already been done in the area of search
engines are discussed. The paper also describes the use of
JSoup parser and its use in developing a search engine. The
paper also discusses the Apache Lucene API that provides the
use of its indexer and searching API that can be used to index
the downloaded pages and perform text based search in the
indexed documents which is vital to the development of a
search engine. In their future work the authors propose to use
Natural Language processing to mine information available in
the web pages and optimize the search engine.

5. REFERENCES
[1] V. V. Vydiswaran, Q. Mei, D. A. Hanauer, and K.
Zheng, , 2014, “Mining consumer health vocabulary
from community-generated text,” in AMIA Annual
Symposium Proceedings, vol. 2014, p. 1150, American
Medical Informatics Association. .
Fig 4: Code for Indexing using Lucene
[2] H. Sampathkumar, X.-w. Chen, and B. Luo, 2014.
3.2.7 Code Snippet for calculating ranking/scores “Mining adverse drug reactions from online healthcare
forums using hidden markov model,” BMC medical
informatics and decision making, vol. 14, no. 1, p. 91.

[3] S. Brin and L. Page,, 2012, “Reprint of: The anatomy of


a large-scale hypertextual web search engine,” Computer
networks, vol. 56, no. 18, pp. 3825–3833.

[4] S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv, “Xsearch:


2003, A semantic search engine for xml,” in Proceedings
of the 29th international conference on Very large data
bases-Volume 29, pp. 45–56, VLDB Endowment.

[5] D. Bhagwat and N. Polyzotis, 2003 , “Searching a file


system using inferred semantic links,” in Proceedings of
Fig 5: Code for calculating rank/scores
the sixteenth ACM conference on Hypertext and
hypermedia, pp. 85–87, ACM, 2005.Forman, G.

29
International Journal of Computer Applications (0975 – 8887)
Volume 127 – No.9, October 2015

[6] H.-L. Wang, S.-H. Wu, I. Wang, C.-L. Sung, W.-L. Hsu, [9] P. Gupta and D. A. Sharma, 2010, “Context based
and W.-K. Shih, 2000, “Semantic search on internet indexing in search engines using ontology,” International
tabular information extraction for answering queries,” in Journal of Computer Applications (0975–8887), vol. 1,
Proceedings of the ninth international conference on no. 14.
Information and knowledge management, pp. 243–249,
ACM. [10] P. Houston, 2013, Instant jsoup How-to. Packt
Publishing Ltd,.
[7] A. M¨adche, B. Motik, L. Stojanovic, R. Studer, and R.
Volz, 2003, “An infrastructure for searching, reusing [11] A. Sonawane, 2009, “Using apache lucene to search
and evolving distributed ontologies,” in Proceedings of text,” Online At http://www. ibm.
the 12th international conference on World Wide Web, com/developerworks/opensource/library/os-
pp. 439–448, ACM. apachelucenesearch/(as of 11 December 2013).

[8] H. Kian and M. Zahedi, 2011, “An efficient approach


for keyword selection; improving accessibility of web”.

IJCATM : www.ijcaonline.org 30

View publication stats

You might also like