21-Resume Screening Using Natural Language Processing and Machine Learning-A Systematic Review
21-Resume Screening Using Natural Language Processing and Machine Learning-A Systematic Review
Arvind Kumar Sinha, Md. Amir Khusru Akhtar, and Ashwani Kumar
1 Introduction
Resume parsing is the process that extract information from websites or unstructured
documents using complex patterns matching/language analysis techniques. It is a
means to automatically extract information from resumes/unstructured documents
and to create a potential database for recruiters. This process generally converts free-
form of resumes, that is, pdf, doc, docx, RTF, HTML into structured data such as XML
or JSON. Artificial intelligence technology and natural language processing (NLP)
engine are used to understand human language and automation. Resume parsers use
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 207
D. Swain et al. (eds.), Machine Learning and Information Processing,
Advances in Intelligent Systems and Computing 1311,
https://doi.org/10.1007/978-981-33-4859-2_21
208 A. K. Sinha et al.
semantic search to parse data from available resumes and find suitable candidates.
The process of extracting human language is a difficult because human language
is infinitely varied and ambiguous. A human language is written and expressed in
several ways; thus parsing tool need to capture all the ways of writing by using
complex rules and statistical algorithms. Ambiguity comes when the same word can
mean different in different contexts. For example, a four-digit number may be a
part of telephone number, a street address, a year, a product number or version of a
software application. Thus, the idea is to train the machine to analyze the context of
written documents like a human being.
Recruitment agencies use resume parsing tools to automate the process and to
save recruiters hours of work. Resume parser automatically separates the informa-
tion into various fields based on the given criteria. The relevant information extracted
by a resume parser includes personal information (such as name, address, email),
experience details (such as start/end date, job title, company, location), education
details (such as degree, university, year of passing, location), hobby (such as dancing,
singing, swimming) and so on. There are numerous choices for resume parsers such as
Sovren, Textkernel, Rchilli, BurningGlass, Tobu, JoinVision CVlizer, Daxtra, Hire-
Ability, RapidParser and Trovix [1]. Most companies use applicant tracking system
which bundles resume parser as one of the features. The first resume parsers were
used in the late 1990s as a stand-alone packaged solution for HR operation [2].
This paper presents a systematic review on resume screening and enlightens the
comparison of recognized works. Several techniques and approaches of machine
learning for evaluating and analyzing the unstructured data have been discussed.
Existing resume parsers use semantic search to understand the context of the language
in order to find the reliable and comprehensive results. A review on the use of semantic
search for context-based searching has been explained. In addition to that, this paper
also shows the research challenges and future scope of resume parsing in terms of
writing style, word choice and syntax of unstructured written language.
The rest of the paper is organized as follows. Section 2 discusses informa-
tion extraction methods. Section 3 presents a systematic review on resume parsers
and enlightens the comparison of recognized works. Section 4 discusses the use
of semantic search for context-based searching. Section 5 presents the research
challenges and future scope of resume parsing. Finally, Sect. 6 concludes the paper.
symbols [5]. The supervised machine learning approach requires huge corpora for
training and needs specific knowledge of abbreviations [6]. Kiss and Strunk proposed
unsupervised machine learning approach that uses type-based classification. In this
method a word is analyzed in the whole text and annotated in sentence boundary and
abbreviation annotation [6].
After segmentation of sentence boundaries, the system divides the sentence into
tokens called tokenization. Several tokenization approaches have been proposed in
the literature such as rule-based and statistical approaches. A rule-based tokenizer
approach uses a list of rules for classification of tokens such as Penn Tree Bank
(PTB) tokenizer [7]. Statistical approach uses hidden Markov model (HMM) [8] to
identify the word and sentence boundaries [9]. This method uses scanning and HMM
boundary detector modules for tokenization.
In order to identify the meaning of the word part of speech (POS), taggings such
as the Penn Treebank Tagset (PTT) [7], CLAWS 5 (C5) Tagset [10] can be used. An
important task in information extraction is name entity recognition (NER), which
identifies names of entities such as group, persons, places, currency, ages and times
[11].
3 Resume Parsers
Natural language processing and machine learning have the capability to understand
and parse the unstructured written language, and extract the desired information.
Existing resume parsers use semantic search to understand the context of the language
in order to find the reliable and comprehensive results. A resume parser converts
the unstructured form of data into a structured form. Resume parser automatically
separates the information into various fields based on given criteria. It separates the
information into various fields based on the given criteria and parameters such as
name, address, email, start/end date, job title, company, location, degree, university,
year of passing, location, dancing, singing and swimming [2]. Several open-source
and commercial resume parsers are available for information extraction.
Open-source resume parsers are distributed with source code and these sources are
available for modification. These open-source libraries parse free-form of resumes,
that is, pdf, doc, docx, RTF, HTML into structured data such as XML or JSON.
Meanwhile, the social media profile links these parsers and parse the public webpages
and convert these data into structured JSON format, such as LinkedIn and Github.
Table 1 shows the list of open-source resume parsers and its properties.
These open-source parsers are simple and easy to use except Deepak’s parser
and follows the same approach for cleaning and parsing. These parsers still contain
210
HTML and Unicode characters with negative effect on named entity recognition
[12].
Commercial resume parsers are designed and developed for sale to end users [12].
These resume parsers have more classy algorithms for attribute recognition than
open-source parsers and allows them to correctly identify these attributes. The
strength of commercial parsers undoubtedly lies in the careful analysis of resume to
identify different sections such as skill, qualification and experience sections. Table
2 shows the list of commercial resume parsers and its properties.
Many parsers are available in the market that provides CV automation solutions
and round-the-clock customer support. These resume parsers APIs are inexpensive
and easy to integrate.
ideas. Context-based search includes various parts of search process such as under-
standing the query and knowledge. Literature shows the use of semantic search for
context-based searching and are very effectives in parsing resume [18].
A semantic binary signature has been proposed in the literature [19]. It processes
a search query by determining relevant categories and generates a binary hashing
signature. The appropriate categories are examined and hamming distances are calcu-
lated between inventory binary hashing signatures and search query. The hamming
distance shows semantic significance that can be used to understand the searcher
intent.
A novel sentence-level emotion detection using semantic rules has been published
in the literature [20]. This paper discusses an efficient emotion detection method
and matches emotional words from its emotional_keyword database. This technique
investigates the emotional words and provides better result and performance than
existing researches.
An NLP-based keyword analysis method has been proposed in the literature [21].
This method uses three matrices document content matrix V, word feature matrix W
and document feature matrix H. Then, rank is calculated for each word using the set
of coefficients. Finally, rank is generated for one or more queries using the ranks for
each word.
Thomas and Sangeetha proposed [22] an intelligent sense-enabled lexical search
on text documents to extract word from text document. This method uses word sense
disambiguation (WSD) of each word and then semantic search on the input text to
extract semantically related words. This method of extraction is useful in resume
screening, resume learning and document indexing.
Alexandra et al. proposed [23] design and implementation of a semantic-based
system for automating the staffing process. The proposed system uses skills and
competencies lexicon for semantic processing of the resumes and matches the candi-
date skills as per the job necessities. This method eliminates reputative activities to
minimize processing time of recruiter and improves search efficiency using complex
semantic criteria.
Kumar et al. [24–27] proposed an object detection method for blind people to
locate objects from a scene. They used machine-learning-based methods along with
single SSMD detector algorithm to develop the model.
This research shows the uses of semantic search in order to understand the context
of the language for reliable and comprehensive results.
The correctness of resume parser depends on a number of factors [1], such as writing
style, choice of words and syntax of written text. A set of statistical algorithms and
complex rules are needed to suitably know and fetch the correct information from
resumes. Natural language processing and machine learning have the capability to
Resume Screening Using Natural Language Processing … 213
understand and parse the unstructured written language and context-based informa-
tion. There are many ways to write the same information such as name, address and
date. So, resume parsing is still in its natal stage, and few important challenges and
future scope are as follows [1, 12]:
• Understanding writing style of resume
• Understanding choice of words in a resume
• Understanding syntax of unstructured written language
• Context-based searching
• Understanding organization and formatting of resume
• Understanding headers and footers of resume
• Breaking resume into sections
• Understanding the structural and visual information from PDFs
• Speed of parsing.
6 Conclusions
Resume screening is the process that extract information from unstructured docu-
ments using complex patterns matching/language analysis techniques. Natural
language processing and machine learning have the capability to understand and
parse the unstructured written language and context-based information. This paper
presents a systematic review on resume screening and enlightens the comparison of
recognized works and investigate open-source and commercial resume parser. This
paper discusses several open-source and commercial resume parsers for information
extraction. Then, a review on the use of semantic search for context-based searching
has been explained. In addition, this paper also shows the research challenges and
future scope of resume parsing in terms of writing style, word choice and syntax of
unstructured written language.
References