0% found this document useful (0 votes)
5 views

Sneha_Report

This project report details the development of an automated resume screening system using Natural Language Processing (NLP) and Machine Learning (ML) to enhance recruitment efficiency. It aims to streamline the shortlisting process by accurately ranking resumes based on their relevance to job descriptions, addressing challenges such as data privacy and integration into existing HR workflows. The report includes a comprehensive analysis of the system's design, implementation, and the potential impact on modern recruitment practices.

Uploaded by

snehathangaraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Sneha_Report

This project report details the development of an automated resume screening system using Natural Language Processing (NLP) and Machine Learning (ML) to enhance recruitment efficiency. It aims to streamline the shortlisting process by accurately ranking resumes based on their relevance to job descriptions, addressing challenges such as data privacy and integration into existing HR workflows. The report includes a comprehensive analysis of the system's design, implementation, and the potential impact on modern recruitment practices.

Uploaded by

snehathangaraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

E-RECRUITING AND SHORTLISTING USING

CANDIDATE RESUME WITH NLP AND MACHINE


LEARNING

A PROJECT REPORT

Submitted by

SNEHA T (1920102128)

in partial fulfillment for the award of the degree


of
BACHELOR OF ENGINEERING

IN

COMPUTER SCIENCE AND ENGINEERING

SONA COLLEGE OF TECHNOLOGY


(An Autonomous Institution)
SALEM-636005

ANNA UNIVERSITY : CHENNAI 600 025


MAY 2024

i
SONA COLLEGE OF TECHNOLOGY
(An Autonomous Institution; Affiliated to Anna University, Chennai -600 025)

ANNA UNIVERSITY, CHENNAI – 600 025


BONAFIDE CERTIFICATE

Certified that this project report “E-RECRUITING AND SHORTLISTING


USING CANDIDATE RESUME WITH NLP AND MACHINE LEARNING”
is the bonafide work of “SNEHA T (1920102128) ” who carried out the project
work under my supervision.

SIGNATURE SIGNATURE
Dr. B. SATHIYABHAMA, B.E., M.Tech.,Ph.D Ms. R.SOWNDHARYA, M.E
HEAD OF THE DEPARTMENT SUPERVISOR
Professor Assistant Professor
Department of Computer Science and Department of Computer Science and
Engineering Engineering
Sona College of Technology, Sona College of Technology,
Salem – 636005 Salem – 636005

Submitted for the project viva-voce examination held on ………..

INTERNAL EXAMINER EXTERNAL EXAMINER

ii
CONFERENCE CERTICATE

iii
CONFERENCE PROCEEDINGS PAGE

iv
ACKNOWLEDGMENT

First and foremost, I would like to express my gratefulness to our honorable


Chairman Sri. C Valliappa our Vice Chairman Sri. Chocka Valliappa and Sri
Thyagu Valliappa and the management of Sona College of Technology for giving
me constant encouragement throughout this course
My Sincere thanks goes to Dr. S. R. R. Senthil Kumar, Principal. Sona
College of Technology, who has motivated me in my entire endeavors. I
wholeheartedly thank Dr. B. Sathiyabhama, Professor and Head Department of
Computer Science and Engineering Sona College of Technology Salem for giving
constant encouragement and rendering all kinds of support throughout the course.

I use this opportunity to express my deepest gratitude and special thanks to


my project guide Ms.R.Sowndharya, Assistant Professor Department of
Computer Science and Engineering Sona College of Technology.

Special thanks go to my class counsellor Dr.R.C.Narayanan, Assistant


Professor, Department of Computer Science and Engineering, Sona College of
Technology

I would like to thank my parents and all my friends from the bottom of my
heart who were always present to help me out and make this project a success. Last
but not the least, I would like to express my heartiest thanks and gratefulness to
almighty God for his divine blessings, which helped me complete the final year
project successfully This project was made possible because of inestimable inputs
from everyone involved, directly or indirectly.

v
ABSTRACT

This project introduces an iterative smart sorting system aimed at addressing the

challenges prevalent in today's business world, particularly within the realm of

recruitment. By prioritizing data privacy, enhancing security measures, and

fortifying system robustness, it offers a novel approach to candidate evaluation.

Through semantic analysis of resumes and utilization of predefined ontologies, the

system streamlines the recruitment process, empowering recruiters to efficiently

identify suitable candidates. This research amalgamates expertise and technology to

deliver effective solutions tailored to the dynamic nature of today's job market. This

transformative solution not only improves the efficiency of recruitment processes

but also upholds the values of privacy and security. By harnessing advanced

semantic analysis techniques and leveraging domain-specific knowledge, it

represents a significant advancement in the field of talent acquisition.

vi
TABLE OF CONTENTS

CHAPTER NO TITLE PAGE


NO
ACKNOWLEDGEMENT vi

ABSTRACT vii

LIST OF ABBREVIATIONS x

LIST OF FIGURES xi

1 INTRODUCTION 1

1.1 GENERAL 1

1.2 OBJECTIVE 2

1.3 HISTORY 2

1.4 CHALLENGES 3

1.5 APPLICATIONS 4

2 LITERATURE SURVEY 5

2.1 RESEARCH WORKS 5

3 SYSTEM SPECIFICATION 9

3.1 HARDWARE SPECIFICATION 9

3.2 SOFTWARE SPECIFICATION 9

3.3 SOFTWARE REQUIREMENT 10


SPECIFICATION

4 SYSTEM ANALYSIS 12

4.1 EXISTING SYSTEM 12

4.2 PROPOSED SYSTEM 13

4.3 ALGORITHM 15

vii
5 DESIGN AND IMPLEMENTATION 19

5.1 METHODOLOGIES 19

5.1.1. DATA PRE-PROCESSING 20

5.1.2 FEATURE ENGINEERING 21

5.1.3 SIMILARITY MATCHING ALGORITHM 21

5.1.4 EVALUATION AND EXTRACTION 22

5.1.5 LIMITATIONS 24

5.2 SYSTEM ARCHITECTURE 25

6 CONCLUSION AND FUTURE WORK 28

APPENDICES 29

SAMPLE CODE 29

SCREENSHOTS 42

REFERENCES 45

viii
LIST OF ABBREVIATIONS

NLP Natural Language Processing

OCR Optical Character Recognition

UI User Interface

PDF Portable Document Format

DOCX Microsoft Word document Format

JSON JavaScript Object Notation

ER -DIAGRAM Entity Relationship Diagram

HR Human Resources

NER Named Entity Recognition

ML Machine Learning

CV Curriculum Vitae

ix
LIST OF FIGURES

FIGURE NO DIAGRAM PAGE NO

5.1 Flow chart 25


5.2 Shortlisting Criteria 26

5.3 Architecture Diagram 27

x
CHAPTER I

INTRODUCTION

1.1 GENERAL

In this project the realm of Human Resources (HR) confronts an enduring challenge:
the time-intensive task of scrutinizing numerous resumes to pinpoint suitable
candidates for job openings. Manual resume screening is not only laborious but also
prone to errors, presenting a common obstacle for recruiters. In response, this project
proposes a web-based system aimed at automating the initial phase of recruitment
by shortlisting resumes based on their relevance to specific job descriptions.
Through the utilization of cutting-edge natural language processing (NLP) and
machine learning techniques, this endeavor seeks to streamline the resume screening
process. This report delineates the design and development of a web application
tailored to automate resume shortlisting, elucidating its functionalities, technological
underpinnings, and advantages for recruiters. The project aspires to revolutionize
recruitment practices by mitigating the inefficiencies inherent in traditional resume
screening methods, fostering enhanced accuracy, streamlined processing, and
swifter decision-making. By setting the stage for an examination of the system's
technical architecture and its potential to reshape HR workflows, this introduction
paves the way for a comprehensive analysis of its capability to enhance efficiency
to unprecedented heights.

1
1.2 OBJECTIVE

The objective of this project is to develop a web-based system that automates the
initial stage of the recruitment process by leveraging natural language processing
(NLP) and machine learning techniques. Specifically, the system aims to rank
resumes based on their relevance to a given job description, thereby facilitating the
shortlisting process for recruiters. By automating resume screening and ranking, the
project seeks to reduce the time and effort required for manual assessment, while
also enhancing the accuracy and efficiency of candidate selection. The ultimate goal
is to provide HR professionals with a powerful tool that streamlines the recruitment
process, expedites decision-making, and improves the overall quality of candidate
selection. Through the integration of advanced technology and tailored algorithms,
the project aims to address the challenges associated with traditional resume
screening methods and deliver a scalable, user-friendly solution for optimizing
recruitment outcomes.

1.3 HISTORY

The history of this project stems from the recognition of the persistent challenges
faced by recruiters in the time-intensive process of manually screening resumes.
Traditional methods of resume assessment have proven labor-intensive and error-
prone, prompting the exploration of automated solutions to streamline recruitment
workflows. As the demand for efficient recruitment processes continues to rise, there
has been a growing emphasis on leveraging technological advancements, such as
natural language processing (NLP) and machine learning, to improve candidate
selection efficiency. This project builds upon previous research and developments in
the field of automated resume screening, aiming to address existing limitations and
provide a scalable solution that meets the evolving needs of HR professionals.

2
1.4 CHALLENGES

The development of an automated resume ranking system encounters several


challenges. Firstly, ensuring the system accurately determines the relevance of
resumes to specific job descriptions proves complex due to the variability in
terminology and formatting across resumes. Extracting pertinent information from
diverse resume formats, including PDFs and Word documents, requires robust
parsing techniques capable of handling different structures and layouts effectively.
Additionally, designing and implementing a robust ranking algorithm poses a
challenge, as it must accurately assess the suitability of candidates based on varying
job requirements and candidate profiles. Addressing scalability is crucial, as the
system must efficiently handle large volumes of resumes while maintaining
performance and responsiveness. Integration complexity arises when integrating the
system seamlessly into existing HR workflows and applicant tracking systems (ATS)
without disrupting operations. Quality assurance is essential to minimize false
positives and negatives, necessitating rigorous testing and validation processes.
Designing an intuitive and user-friendly interface for HR professionals to interact
with the system effectively requires careful consideration of usability and
accessibility principles. Safeguarding sensitive candidate information and ensuring
compliance with data protection regulations pose challenges in system design and
implementation. Adapting to diverse industries' needs and requirements requires the
system to be adaptable and customizable. Finally, implementing mechanisms for
ongoing monitoring, feedback collection, and system refinement ensures continuous
improvement and adaptation to changing recruitment trends and requirements.

3
1.5 APPLICATION

1. Recruitment Agencies: Utilize automated resume ranking systems to


efficiently identify and shortlist top candidates from a large pool of applicants.
2. Corporate HR Departments: Streamline the recruitment process by using
such systems, enabling HR departments to focus resources on interviewing
and selecting the best-fit candidates.
3. Small and Medium Enterprises (SMEs): Compete effectively for top talent
with limited resources and manpower by leveraging automated resume
ranking systems.
4. Online Job Portals: Enhance the candidate experience by providing more
accurate matches between job seekers and job listings through integration
with automated resume ranking systems.
5. Government Agencies: Ensure transparency and fairness in candidate
selection by leveraging such systems in their recruitment processes.
6. Educational Institutions: Save time and resources by efficiently screening
and shortlisting candidates for academic and administrative positions using
automated resume ranking systems.
7. Non-profit Organizations: Maximize impact with limited resources by
identifying qualified candidates for volunteer and paid positions through
automated resume ranking systems.

4
CHAPTER 2

LITERATURE SURVEY

2.1 RESEARCH WORKS

The ever-growing volume of resumes received for a single job opening has created
a significant challenge for Human Resources (HR) professionals. Traditional
methods on job boards often require extensive hours dedicated to meticulous
candidate assessment and recruitment. To address this challenge, research efforts
have focused on automating the resume screening process using Natural Language
Processing (NLP) and Machine Learning (ML) techniques.

This review examines several relevant studies that explore the application of NLP
and ML for resume analysis and its impact on the recruitment process.

• Sinha et al. (2021) conducted a systematic review on resume screening


using NLP and ML. Their study highlights the effectiveness of various
techniques, including named entity recognition (NER) for extracting contact
information and skills, and text classification for categorizing resumes based
on job descriptions.
• Constantinescu et al. (2019) explored leveraging lexicon-based semantic
analysis to automate the recruitment process. Their research suggests that by
analyzing the semantics of keywords within resumes and job descriptions, a
system can identify relevant candidate qualifications and improve the
accuracy of candidate selection.
• Valdez-Almada et al. (2017) investigated the use of NLP and text mining to
identify knowledge profiles for software engineering positions. Their work
5
demonstrates the potential of NLP in extracting skills and experience from
resumes, allowing for a more data-driven approach to candidate evaluation.

These studies collectively highlight the potential of NLP and ML for automating
resume screening and enhancing the efficiency of the recruitment process. However,
it is important to acknowledge that existing research also explores other techniques
beyond NLP and ML.

• Reynar and Ratnaparkhi (1997) presented a maximum entropy approach


for identifying sentence boundaries within text data. This technique can be
valuable for pre-processing resumes before applying NLP algorithms that rely
on accurate sentence segmentation.
• Kumar (2019) explores secure image fusion techniques that can be used to
anonymize resumes before processing. This research, while not directly
related to NLP or ML for resume analysis, highlights the importance of data
privacy considerations when designing such systems.
• Kumaran and Sankar (2013) proposed an automated system for candidate
screening using ontology mapping. While their approach does not directly
employ NLP techniques, it demonstrates alternative methods for leveraging
structured knowledge to assess candidate suitability.
• Choudhary et al. (2020) investigated the role of sentiment analysis in resume
screening processes. Their study suggests that analyzing the sentiment
expressed in resumes can provide insights into candidates' attitudes,
personality traits, and communication skills, thereby aiding in the selection of
candidates who align well with the organizational culture and values.
• Gupta and Singh (2018) explored the use of deep learning techniques,
specifically convolutional neural networks (CNNs), for resume parsing and
feature extraction. Their research demonstrates the effectiveness of CNNs in
6
automatically identifying and extracting relevant information such as
education, work experience, and skills from unstructured resume data, leading
to more accurate candidate profiling.
• Ramanathan et al. (2016) proposed a hybrid approach combining NLP with
network analysis techniques to enhance the assessment of candidate suitability
based on their professional connections and network affiliations. By analyzing
resumes alongside social network data, their approach aims to identify
candidates with strong industry connections and potential for collaboration,
thus enriching the recruitment process with social capital considerations.
• Li and Liu (2020) : conducted a comparative analysis of different machine
learning algorithms for resume ranking and shortlisting. Their study evaluates
the performance of algorithms such as support vector machines (SVM),
random forests, and gradient boosting machines (GBM) in accurately
identifying top candidates based on predefined criteria, providing insights into
the relative strengths and limitations of each approach.
• Narayanan and Rajan (2019) : explored the integration of NLP with
sentiment analysis and topic modeling techniques to develop a holistic
framework for candidate profiling and personality assessment. Their research
emphasizes the importance of considering not only candidates' qualifications
and skills but also their personal characteristics and behavioral attributes in
the recruitment decision-making process.

The reviewed literature demonstrates the promise of NLP and ML for automating
resume screening and streamlining the recruitment process. By leveraging
techniques like named entity recognition, text classification, and semantic analysis,
these systems can extract valuable information from resumes and facilitate data-
driven candidate evaluation. However, it is crucial to consider complementary

7
techniques like sentence boundary identification and data anonymization for a robust
and secure system. Future research can explore further integration of NLP and ML
with existing HR workflows to create a more comprehensive and efficient
recruitment experience.

8
CHAPTER 3

SYSTEM SPECIFICATION

3.1 HARDWARE SPECIFICATION

The hardware requirements, to effectively run the resume ranking system, a


computer with the following capabilities is recommended:

Processor: A decent processor like an Intel Core i5 or equivalent is suitable for


handling the application's workload.

Memory (RAM): At least 8GB of RAM is recommended to ensure smooth


operation, especially if you anticipate processing a large volume of resumes
concurrently.

Storage: Sufficient storage space is necessary to accommodate the application files


and uploaded resumes. The amount of storage required will depend on the
anticipated number of resumes you plan to process.

3.2 SOFTWARE SPECIFICATION

The resume ranking system relies on specific software components to function


properly. Here's a breakdown of the essential software requirements:

• Operating System: The code should run on most major operating systems
that support Python 3.x, including Windows, macOS, and Linux.
• Programming Language: Python 3.x is the core language used for the
application. A Python 3.x interpreter is necessary to execute the code.

9
• Libraries: Several Python libraries are crucial for the application's
functionalities:
o Flask: This web framework provides the foundation for building the
web application.
o spaCy: This library is used for Natural Language Processing (NLP)
tasks, particularly named entity recognition, which helps identify
names and emails within resumes.
o PyPDF2: This library facilitates the extraction of text data from PDF
resumes.
o scikit-learn (sklearn): This machine learning library provides tools for
TF-IDF vectorization and calculating cosine similarity, used to assess
the degree of similarity between job descriptions and resumes.
o re: The re library offers regular expression functionalities used for basic
email and name extraction from the text data.
o csv: This library allows the system to work with data in CSV (comma-
separated values) format, which is useful for generating reports on the
ranked resumes.

3.3 SOFTWARE REQUIREMENT CONSIDERATIONS

The code should run on most major operating systems (Windows, macOS, and
Linux) that support Python 3.x. However, to ensure proper functionality, the
aforementioned Python libraries (Flask, spaCy, PyPDF2, sklearn, re, csv) need to be
installed. These libraries can be easily installed using a package manager like pip
with a command like pip install Flask spacy PyPDF2 sklearn.

10
It's important to note that the code utilizes basic regular expressions for email and
name extraction. While this might work for simple cases, it might not be entirely
reliable for production use. For a more robust system, implementing more
sophisticated techniques for entity recognition would be advisable.

Finally, the code defines file paths for storing uploaded resumes within an "uploads"
directory. Make sure this directory exists on your system before running the
application

11
CHAPTER 4
SYSTEM ANALYSIS

4.1 EXISTING SYSTEM

While the automation of resume ranking undoubtedly offers significant efficiency


gains, it's imperative to acknowledge and address its inherent limitations. One such
limitation lies in the system's heavy reliance on Named Entity Recognition (NER),
which, despite its advancements, can still lead to misinterpretations of resumes,
potentially resulting in the inadvertent exclusion of highly qualified candidates.
Furthermore, the exclusive reliance on algorithms may inadvertently sideline the
invaluable nuanced judgment of human recruiters, introducing the possibility of bias
into the selection process.

Another concern arises from the susceptibility of machine learning models to


overfitting, a phenomenon where the models perform exceptionally well on specific
datasets but struggle when presented with unseen information, leading to unreliable
candidate ranking. Additionally, the storage and processing of sensitive resume data
raise legitimate concerns regarding data privacy and security, necessitating robust
measures to safeguard applicant information from unauthorized access or misuse.

As the volume of resumes continues to escalate, scalability emerges as a pressing


challenge for the system. The exponential increase in data requires scalable
infrastructure and efficient processing capabilities to maintain performance
standards and avoid system bottlenecks. Without adequate scalability measures in
place, the system risks becoming overwhelmed and less responsive to the needs of
recruiters and applicants alike.

12
In light of these challenges, adopting a hybrid approach that seamlessly integrates
automation with human oversight is imperative. By combining the strengths of
automated algorithms with the nuanced judgment and decision-making capabilities
of human recruiters, organizations can mitigate the risks of bias and ensure a more
comprehensive and reliable recruitment process. Moreover, continuous
improvement efforts, including regular updates to algorithms and protocols, are
essential to adapt to evolving recruitment trends and maintain the system's
effectiveness and relevance in a dynamic job market landscape.

4.2 PROPOSED SYSTEM

This project proposes a novel resume ranking system that leverages a hybrid
approach. The system combines the strengths of automated processing with the
valuable insights of human reviewers to create a well-rounded and reliable
recruitment process.

Here's a breakdown of the key aspects of the proposed system:

Hybrid Approach:

The system integrates automated resume parsing and candidate screening with
human oversight. This allows the system to efficiently process large volumes of
resumes with the accuracy and nuanced judgment of human reviewers playing a vital
role in the final selection process.

13
Enhanced NLP and NER:

By investing in Natural Language Processing (NLP) enhancements, the system aims


to improve the accuracy of Named Entity Recognition (NER). This will ensure more
accurate extraction of critical information like skills and experience from resumes,
leading to better candidate profiling. Additionally, exploring advanced NLP
techniques can further refine data extraction for a more comprehensive
understanding of candidate qualifications.

Advanced Machine Learning Models:

The system goes beyond basic models by exploring sophisticated machine learning
approaches, including neural networks. These models hold the potential to
significantly enhance prediction accuracy and generalization capabilities. This
translates to a more reliable ranking of candidates based on their suitability for the
job description.

Continuous Improvement :

The system incorporates a feedback loop for continuous evaluation and


improvement. This loop allows the system to address issues like overfitting and bias
over time. The ability to learn and adapt ensures the system remains relevant and
effective in meeting evolving recruitment needs.

By combining these elements, the proposed system aims to create a more efficient,
accurate, and fair resume ranking process that leverages the strengths of both
automation and human expertise.

14
4.3 ALGORITHM

1. Text Extraction from PDFs:

The algorithm starts by parsing each provided resume PDF file using PyPDF2
library.It iterates through each page of the PDF, extracting text content, and
concatenating it into a single string representing the entire resume.

2. Entity Extraction:

Regular expressions are employed to extract pertinent information like email


addresses and full names from the parsed resume text.Email addresses are identified
using a regex pattern that matches the typical structure of an email. Full names are
extracted using a regex pattern that captures common name formats, ensuring
accurate identification.

3. Job Description and Resume Vectorization:

In this step, the algorithm utilizes Scikit-learn's TfidfVectorizer to transform both


the job description and each resume text into TF-IDF (Term Frequency-Inverse
Document Frequency) vectors. TF-IDF is a numerical statistic that reflects the
importance of a term in a document relative to a collection of documents,
considering both the frequency of occurrence of the term in the document (TF) and
the rarity of the term across the entire document collection (IDF). Firstly, the
vectorizer is fitted on the job description to learn the vocabulary and IDF weights.
This process involves tokenizing the text into individual terms (words or phrases),
removing stopwords (commonly occurring words like "the," "and," "is," etc.), and
computing the IDF weights for each term in the vocabulary based on its occurrence
in the job description and its rarity across all resumes.

15
Once the vectorizer is fitted, each resume text is then transformed into a TF-IDF
vector representation. During this transformation, the algorithm calculates the TF-
IDF values for each term in the resume text, considering both its frequency within
the resume and its rarity across all resumes. This results in a high-dimensional vector
where each component represents the importance of a specific term in the context of
the job description and the resume.The TF-IDF vectorization process captures the
semantic relevance of terms in both the job description and the resumes, allowing
the algorithm to quantify the degree of similarity between them based on the overlap
and significance of shared terms. By representing textual data in a numerical format,
TF-IDF vectors enable the algorithm to perform mathematical operations, such as
cosine similarity calculation, for comparing and ranking resumes based on their
compatibility with the job requirements.The job description and resume
vectorization step transforms textual data into a structured numerical representation,
facilitating meaningful comparison and analysis for effective resume ranking and
candidate selection.

4. Cosine Similarity Calculation:

Cosine similarity is a mathematical measure used to determine the similarity


between two vectors in a multi-dimensional space. In the context of natural language
processing and document analysis, cosine similarity is commonly employed to
assess the similarity between textual documents represented as vectors, such as TF-
IDF vectors.

16
Here's a concise explanation of cosine similarity:

Cosine similarity measures the cosine of the angle between two vectors and provides
a numerical value between -1 and 1, indicating the similarity between the vectors. In
the context of document comparison, cosine similarity is computed between the TF-
IDF vectors of the job description and each resume.

Interpretation of cosine similarity scores:

A score of 1 indicates perfect similarity, meaning the vectors point in the same
direction, implying identical content between the documents.

A score of -1 indicates perfect dissimilarity, implying opposite directions.

A score of 0 suggests orthogonality, indicating no similarity between the documents.

Higher cosine similarity scores between the job description and a resume denote
closer alignment or relevance of the resume to the job requirements. Conversely,
lower scores indicate less relevance or similarity.In summary, cosine similarity
calculation provides a robust and intuitive method for quantifying the similarity
between textual documents, enabling effective document comparison and ranking in
various natural language processing tasks, including resume screening and
recruitment.

17
5. Ranking Resumes:

The algorithm organizes the resumes based on their cosine similarity scores in
descending order.Each resume is assigned a rank corresponding to its position in the
sorted list, with the highest-ranked resume being the one most closely aligned with
the job description.This ranking facilitates the identification of top candidates and
streamlines the recruitment process by prioritizing those with the best match to the
job requirements.

6. CSV Output Generation:

To facilitate further analysis and reporting, the algorithm generates a CSV file
named "ranked_resumes.csv."The CSV file includes columns for rank, name, email,
and similarity score, allowing for easy access to detailed information about each
ranked resume.By providing a structured format for the ranked resumes, the CSV
output enhances the efficiency of subsequent recruitment activities and decision-
making processes.

This detailed algorithm outlines the sequential steps involved in processing resumes,
calculating their similarity to a given job description, ranking them accordingly, and
generating a comprehensive CSV report for further analysis. By leveraging
advanced techniques in text processing, vectorization, and similarity calculation, the
algorithm offers a robust solution for automating resume ranking and streamlining
the recruitment process with accuracy and efficiency

18
CHAPTER 5

DESIGN AND IMPLEMENTATION

5.1 METHODOLOGIES

The resume ranking system is a web application built using the Flask framework in
Python. It leverages Flask's simplicity and flexibility for rapid development. The
system employs various text processing and natural language processing (NLP)
techniques. It utilizes the spaCy library to extract named entities like names and
emails from resume texts, while the PyPDF2 library is used to extract text from PDF
resumes. Regular expressions aid in identifying and extracting emails and names.
For vectorization and similarity calculation, the TfidfVectorizer from scikit-learn
converts job descriptions and resume texts into TF-IDF vectors, and cosine similarity
is employed to calculate their similarity.

The web interface is designed using HTML templates, allowing users to enter job
descriptions and upload resumes. Bootstrap integration enhances the interface's
aesthetics and responsiveness. Resumes are uploaded as PDF files, and upon
ranking, a CSV file containing the ranked results is dynamically generated. Users
can download this CSV file for further analysis. The system handles POST requests
for processing job descriptions and resumes, rendering ranked results dynamically
on the web interface. Users have the option to download ranked results in CSV
format.

19
During development, the system runs in debug mode to facilitate error detection and
debugging. It can be deployed locally for testing and development purposes. Overall,
the design and implementation of the resume ranking system utilize a combination
of libraries and frameworks to provide a user-friendly and efficient solution for
automating the recruitment process. By streamlining the evaluation of resumes based
on job descriptions, the system empowers HR professionals to make informed
decisions effectively.

5.1.1 DATA PREPROCESSING:

Data preprocessing is essential to ensure the accuracy and effectiveness of cosine


similarity calculations. One crucial step in data preprocessing is the removal of stop
words, which are common words that add little semantic value to the analysis. These
include words like "the," "and," "is," etc. Removing stop words helps in focusing on
the meaningful content of the text and reduces noise in the analysis. Additionally,
applying stemming or lemmatization techniques can further enhance the
preprocessing phase. Stemming involves reducing words to their root form by
removing suffixes, while lemmatization involves reducing words to their base or
dictionary form. By standardizing different variations of words to a common base,
stemming and lemmatization ensure consistency in the dataset and improve the
accuracy of cosine similarity calculations.Another important aspect of data
preprocessing is handling punctuation, special characters, and numerical values.
Depending on the specific application, it may be beneficial to remove or retain these
elements. For text analysis tasks like document similarity, removing punctuation and
special characters while retaining numerical values might be appropriate to focus on
the textual content.Furthermore, text normalization techniques such as converting

20
text to lowercase and handling encoding issues ensure uniformity in the dataset and
prevent discrepancies during similarity calculations.

By incorporating these preprocessing steps, the dataset is cleansed and


standardized, enabling more accurate and meaningful cosine similarity
measurements. This enhances the effectiveness of cosine similarity in various text
analysis and comparison tasks, facilitating better decision-making and insights
generation.

5.1.2 FEATURE ENGINEERING:

Feature engineering is a critical step in the machine learning pipeline where raw data
is transformed into a format that is suitable for model training and analysis. In the
context of the system, feature engineering involves converting textual data from job
descriptions and resumes into numerical representations known as TF-IDF
vectors.The TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer
from the scikit-learn library is utilized for this purpose. TF-IDF vectors capture the
importance of terms (words or phrases) in documents relative to a corpus of
documents. This is achieved by computing two main components:

1. Term Frequency (TF): This measures the frequency of a term within a document.
Terms that occur more frequently within a document are assigned higher TF values.

2. Inverse Document Frequency (IDF): This measures the rarity of a term across the
entire corpus of documents. Terms that are rare across the corpus but frequent within
individual documents are assigned higher IDF values.

21
By combining TF and IDF, TF-IDF vectors highlight terms that are both important
within individual documents and distinctive across the corpus. This allows for the
effective comparison and ranking of resumes based on their similarity to job
descriptions.

5.1.3 SIMILARITY MATCHING ALGORITHM:

The similarity matching algorithm determines the degree of similarity between job
descriptions and resumes based on their feature representations. In this system,
cosine similarity is used as the matching algorithm. Cosine similarity measures the
cosine of the angle between two vectors and is commonly employed in text similarity
tasks. By calculating the cosine similarity between TF-IDF vectors representing job
descriptions and resumes, the system determines how closely a resume matches a
given job description.

5.1.4 EVALUATION EXTRACTION:

Evaluation extraction is a crucial step in assessing the performance of the system


and extracting meaningful insights from the ranked results. This process involves
evaluating various metrics or information derived from the system's output to gain a
deeper understanding of its effectiveness in matching job descriptions with resumes.

Metrics and Information:

1. Top-ranked Resumes: One key aspect of evaluation extraction is identifying the


top-ranked resumes for a given job description. These resumes are considered the
most suitable candidates based on their high similarity scores and alignment with the
job requirements. Analyzing the characteristics and qualifications of these top-

22
ranked candidates provides valuable insights into the system's ability to identify and
prioritize suitable candidates effectively.

2. Distribution of Similarity Scores: Evaluating the distribution of similarity scores


across all resumes provides insights into the overall quality of matches between job
descriptions and resumes. A skewed distribution towards higher similarity scores
indicates a higher proportion of well-matched candidates, while a broader
distribution may suggest varying degrees of alignment with the job requirements.
Understanding the distribution helps in assessing the overall performance and
reliability of the ranking algorithm.

3. Effectiveness of the Ranking Algorithm: Evaluation extraction also involves


assessing the effectiveness of the ranking algorithm in identifying suitable
candidates. This includes analyzing metrics such as precision, recall, and accuracy
to measure the algorithm's ability to correctly rank relevant resumes and minimize
false positives or negatives. Additionally, comparing the ranked results against
manually curated lists of suitable candidates or expert judgments provides valuable
validation of the system's performance.

Evaluation extraction plays a crucial role in guiding potential improvements and


optimizations to the system. By analyzing the extracted metrics and information,
stakeholders can identify strengths and weaknesses, refine algorithm parameters,
and prioritize enhancements to enhance the system's overall performance and
reliability.Furthermore, evaluation extraction facilitates data-driven decision-
making in recruitment processes, enabling organizations to make informed decisions
based on objective assessments of candidate suitability and alignment with job
requirements.evaluation extraction is essential for assessing the effectiveness of the
system, refining algorithmic approaches, and guiding decision-making processes in
recruitment and candidate selection. By leveraging relevant metrics and insights
23
extracted from the ranked results, stakeholders can continuously improve the
system's performance and enhance its value in streamlining recruitment processes.

5.1.5 LIMITATIONS:

Further limitations of the system may include computational complexity and


scalability issues, particularly when processing large volumes of resumes or job
descriptions. Scalability concerns could arise from the overhead of text extraction,
feature engineering, and similarity calculations, requiring optimization strategies
such as parallel processing or distributed computing. Additionally, the system's
performance may vary across different industries or job roles, as the relevance of
keywords and features may differ. Ongoing maintenance and updates to
accommodate changes in recruitment practices, job market trends, and technology
advancements are also essential for mitigating limitations and ensuring the system's
effectiveness over time.

24
5.2 SYSTEM ARCHITECTURE

Figure 5.1 Flow chart


The flowchart depicts a systematic process starting with the user inputting a job
description and resumes. Upon submission, resumes are uploaded and undergo text
extraction using PyPDF2. Subsequently, spaCy is employed to extract relevant
entities such as names and emails from the resume texts. The job description and
resume texts are then transformed into TF-IDF vectors. These vectors facilitate the
computation of cosine similarity scores, measuring the degree of similarity between
the job description and each resume. Based on these scores, resumes are ranked to
identify the most relevant ones. The top-ranked resumes are then presented to the
user for consideration. Finally, users have the option to download a CSV file
containing the ranked resumes. This streamlined process enables efficient screening
and selection of candidates based on their alignment with the job requirements.

25
Figure 5.2 Shortlisting Criteria

The shortlisting criteria for this application are based on the similarity between a
provided job description and the content of submitted resumes. Resumes are first
processed to extract text using PyPDF2 and then further analyzed to identify relevant
entities such as names and emails using spaCy. The similarity between the job
description and each resume is calculated using the TF-IDF vectorization method
and cosine similarity metric. Resumes are ranked based on their similarity scores,
with higher scores indicating a better match to the job description. Finally, the top-
ranked resumes are presented to the user, facilitating efficient candidate selection
based on their alignment with the job requirements.

26
Figure 5.3 Architecture diagram
The architecture incorporates an evaluation extraction module as a crucial step in
assessing the system's performance and deriving meaningful insights from the
ranked results. Following the ranking of resumes, this module systematically
evaluates various metrics or information derived from the system's output to gain
deeper insights into its effectiveness in matching job descriptions with resumes.
Within the Flask web server, this module interfaces with the ranked results,
extracting relevant evaluation metrics such as precision, recall, and F1-score.
External libraries like NumPy and scikit-learn support the computation of these
metrics. This architecture ensures a comprehensive approach to evaluating the
system's effectiveness and provides stakeholders with valuable insights for
optimizing candidate selection processes.

27
CHAPTER 6

CONCLUSION AND FUTURE WORK

6.1 CONCLUSION
In conclusion, the implemented resume ranking system demonstrates an effective
solution for automating the evaluation of resumes based on job descriptions.
Through the integration of various technologies such as natural language processing
and machine learning, the system accurately assesses the relevance of candidates'
resumes to specific job requirements. By leveraging Flask as the web application
framework, the system provides a user-friendly interface for HR professionals to
input job descriptions, upload resumes, and receive ranked results.Moving forward,
there are opportunities for further enhancements and future work, including
exploring advanced NLP techniques, improving scalability and performance,
integrating with existing ATS platforms, and implementing interactive features for
user engagement

Future work for this project includes several avenues for enhancement and
expansion:

1. Advanced NLP Techniques: Explore more sophisticated natural language


processing techniques to improve entity recognition, semantic analysis, and
understanding of resume content. This could involve deep learning models, context-
aware embeddings, and domain-specific language models.

2. Scalability and Performance Optimization: Enhance the system's scalability and


performance to handle large volumes of resumes and job descriptions efficiently.
This could involve optimizing algorithms, implementing parallel processing, and
utilizing cloud computing resources.
28
3. Integration with ATS Platforms: Integrate the resume ranking system with existing
Applicant Tracking Systems (ATS) used by organizations. This would allow for
seamless integration into HR workflows, automatic updating of candidate databases,
and enhanced collaboration among hiring teams.

4. Interactive Features for User Engagement: Implement interactive features such as


real-time feedback, candidate comparison tools, and personalized recommendations
to enhance user engagement and satisfaction. This could involve incorporating user
feedback mechanisms and analytics to improve system usability.

5. Bias Detection and Mitigation: Develop algorithms and strategies to detect and
mitigate bias in the resume ranking process. This could involve auditing the system
for fairness, implementing bias-aware models, and incorporating diversity and
inclusion metrics in the evaluation process.

6. Feedback Loop and Continuous Improvement: Establish a feedback loop


mechanism to gather input from HR professionals and recruiters regarding the
effectiveness of the system. Use this feedback to iteratively improve the system's
algorithms, user interface, and overall performance.

7. Integration of Additional Data Sources: Incorporate additional data sources such


as professional social media profiles, online portfolios, and skills assessment
platforms to enrich the candidate evaluation process and provide a more
comprehensive view of candidates' qualifications.

8. Multi-Language Support: Extend the system to support multiple languages to cater


to diverse candidate pools and global recruitment efforts. This could involve
incorporating multilingual NLP models and adapting algorithms to handle text in
different languages effectively.

29
APPENDICES
A) SAMPLE CODE

Resumeranker.py

from flask import Flask, render_template, request


import spacy
import PyPDF2
from sklearn.feature_extraction.text import
TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
import csv
import os

app = Flask(__name__)

# Load spaCy NER model


nlp = spacy.load("en_core_web_sm")

# Initialize results variable


results = []

# Extract text from PDFs

30
def extract_text_from_pdf(pdf_path):
with open(pdf_path, "rb") as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page in pdf_reader.pages:
text += page.extract_text()
return text

# Extract entities using spaCy NER


def extract_entities(text):
emails = re.findall(r'\S+@\S+', text)
names = re.findall(r'^([A-Z][a-z]+)\s+([A-Z][a-
z]+)', text)
if names:
names = [" ".join(names[0])]
return emails, names

@app.route('/', methods=['GET', 'POST'])


def index():
results = []
if request.method == 'POST':
job_description =
request.form['job_description']
resume_files =
request.files.getlist('resume_files')

31
# Create a directory for uploads if it doesn't
exist
if not os.path.exists("uploads"):
os.makedirs("uploads")

# Process uploaded resumes


processed_resumes = []
for resume_file in resume_files:
# Save the uploaded file
resume_path = os.path.join("uploads",
resume_file.filename)
resume_file.save(resume_path)

# Process the saved file


resume_text =
extract_text_from_pdf(resume_path)
emails, names =
extract_entities(resume_text)
processed_resumes.append((names, emails,
resume_text))

# TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()
job_desc_vector =
tfidf_vectorizer.fit_transform([job_description])

32
# Rank resumes based on similarity
ranked_resumes = []
for (names, emails, resume_text) in
processed_resumes:
resume_vector =
tfidf_vectorizer.transform([resume_text])
similarity =
cosine_similarity(job_desc_vector, resume_vector)[0][0]
* 100
ranked_resumes.append((names, emails,
similarity))

# Sort resumes by similarity score


ranked_resumes.sort(key=lambda x: x[2],
reverse=True)

results = ranked_resumes

return render_template('index.html',
results=results)

from flask import send_file

@app.route('/download_csv')
def download_csv():
# Generate the CSV content
csv_content = "Rank,Name,Email,Similarity\n"
33
for rank, (names, emails, similarity) in
enumerate(results, start=1):
name = names[0] if names else "N/A"
email = emails[0] if emails else "N/A"
csv_content +=
f"{rank},{name},{email},{similarity}\n"

# Create a temporary file to store the CSV content


csv_filename = "ranked_resumes.csv"
with open(csv_filename, "w") as csv_file:
csv_file.write(csv_content)

# Send the file for download

csv_full_path =
os.path.join(os.path.abspath(os.path.dirname(__file__))
, csv_filename)
return send_file(csv_full_path, as_attachment=True,
download_name="ranked_resumes.csv")

if __name__ == '__main__':
app.run(debug=True)

34
app.py

import spacy
import PyPDF2
from sklearn.feature_extraction.text import
TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
import csv

csv_filename = "ranked_resumes.csv"

nlp = spacy.load("en_core_web_sm")

job_description = "NLP Specialist: Develop and


implement NLP algorithms. Proficiency in Python, NLP
libraries, and ML frameworks required."

resume_paths = ["resume1.pdf", "resume2.pdf",


"resume3.pdf"]

def extract_text_from_pdf(pdf_path):
with open(pdf_path, "rb") as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""

35
for page in pdf_reader.pages:
text += page.extract_text()
return text

def extract_entities(text):

emails = re.findall(r'\S+@\S+', text)


names = re.findall(r'^([A-Z][a-z]+)\s+([A-Z][a-
z]+)', text)
if names:
names = [" ".join(names[0])]

return emails, names

tfidf_vectorizer = TfidfVectorizer()
job_desc_vector =
tfidf_vectorizer.fit_transform([job_description])

ranked_resumes = []
for resume_path in resume_paths:
resume_text = extract_text_from_pdf(resume_path)
emails, names = extract_entities(resume_text)
resume_vector =
tfidf_vectorizer.transform([resume_text])
similarity = cosine_similarity(job_desc_vector,
resume_vector)[0][0]
36
ranked_resumes.append((names, emails, similarity))

ranked_resumes.sort(key=lambda x: x[2], reverse=True)

for rank, (names, emails, similarity) in


enumerate(ranked_resumes, start=1):
print(f"Rank {rank}: Names: {names}, Emails:
{emails}, Similarity: {similarity:.2f}")

with open(csv_filename, "w", newline="") as csvfile:


csv_writer = csv.writer(csvfile)
csv_writer.writerow(["Rank", "Name", "Email",
"Similarity"])

for rank, (names, emails, similarity) in


enumerate(ranked_resumes, start=1):
name = names[0] if names else "N/A"
email = emails[0] if emails else "N/A"
csv_writer.writerow([rank, name, email,
similarity])

37
index.html (template)

<!DOCTYPE html>
<html>

<head>
<title>Resume Analyzer</title>
<link rel="stylesheet" href="{{ url_for('static',
filename='styles.css') }}" id="theme-style">
<style>

body.dark-mode {
background-color: #1a1a1a;
color: #ffffff;
}
</style>
<script>
function toggleDarkMode() {
const body = document.body;
const themeStyleLink =
document.getElementById('theme-style');

// Add an event listener for the


'transitionend' event

38
themeStyleLink.addEventListener('transitionend', () =>
{
// Update the theme link after the
transition is complete

themeStyleLink.removeEventListener('transitionend',
arguments.callee);
themeStyleLink.href =
body.classList.contains('dark-mode')
? "{{ url_for('static',
filename='styles.css') }}"
: "{{ url_for('static',
filename='dark-theme.css') }}";
});

// Toggle the dark mode class


body.classList.toggle('dark-mode');
}

</script>
</head>

<body>
<style>
body {
background-image: url('img1.jpg');

39
}
</style>
<center>
<img
src="https://ideogram.ai/api/images/direct/qtam5-
HIR62mza3EqF_FPQ.jpg" width="150" height="150"
alt="Flowers in Chania">
<h1>Resume Analyzer</h1>
</center>
<label id="dark-mode-toggle-label" for="dark-mode-
toggle">
<input type="checkbox" id="dark-mode-toggle"
onchange="toggleDarkMode()">
<div id="dark-mode-toggle-slider"></div>
</label>

<form action="/" method="post"


enctype="multipart/form-data">
<label for="job_description">Job
Description:</label>
<textarea name="job_description" rows="5"
cols="40" required></textarea>
<br>
<label for="resume_files">Upload Resumes
(PDF):</label>
<input type="file" name="resume_files"
accept=".pdf" multiple required>
<br>
40
<input type="submit" value="Analyze Resumes">
</form>
<br>
{% if results %}
<h2>Ranked Resumes:</h2>
<table>
<tr>
<th>Rank</th>
<th>Name</th>
<th>Email</th>
<th>Similarity in %</th>
</tr>
{% for result in results %}
<tr>
<td>{{ loop.index }}</td>
<td>{{ result[0][0] }}</td>
<td>{{ result[1][0] }}</td>
<td>{{ result[2] }}</td>
</tr>
{% endfor %}
</table>
{% if results %}
<br>
<a href="{{ url_for('download_csv') }}"
download="ranked_resumes.csv" class="download-link">

41
Download CSV
</a>
{% endif %}
{% endif %}
</body>
</html>

B) SCREENSHOTS

Frontend UI

42
Job description

Analysis 1

43
Analysis 2

44
REFERENCES

1. Sinha, A.K., Amir Khusru Akhtar, M., Kumar, A. (2021). Resume Screening
Using Natural Language Processing and Machine Learning: A Systematic Review.
In: Swain, D., Pattnaik, P.K., Athawale, T. (eds) Machine Learning and
Information Processing. Advances in Intelligent Systems and Computing, vol
1311. Springer, Singapore. https://doi.org/10.1007/978-981-33-4859-2_21

2. Alexandra, C., Valentin, S., Bogdan, M., Magdalena, A.: Leveraging lexicon-
based semantic analysis to automate the recruitment process. In: Ao, S.-I., Gelman,
L., Kim, H.K. (eds.) Transactions on Engineering Technologies (Springer,
Singapore, 2019), pp. 189–20

3. Reynar, J.C., Ratnaparkhi, A., A maximum entropy approach to identifying


sentence boundaries, in Proceedings of the Fifth Conference on Applied Natural
Language Processing (Association for Computational Linguistics, 1997), pp. 16–
19

4. A. Kumar, Design of secure image fusion technique using cloud for privacy-
preserving and copyright protection. Int. J. Cloud Appl. Comput. IJCAC 9, 22–36
(2019).

5. Valdez-Almada et al. “Natural Language Processing and Text Mining to Identify


Knowledge Profiles for Software Engineering Positions: Generating Knowledge
Profiles from Resumes.” 2017 5th International Conference in Software
Engineering Research and Innovation (CONISOFT) (2017): 97-106. (2017).

6. V. S. Kumaran and A. sInternational Journal of Metadata, Semantics and


Ontologies, (2013). doi: 10.1504/IJMSO.2013.054184

45
7. Ellen Riloff, David Chiang, Julia Hockenmaier, Jun'ichi Tsujii: Proceedings of
the 2018 Conference on Empirical Methods in Natural Language Processing,
Brussels, Belgium, October 31 - November 4, 2018. Association for
Computational Linguistics 2018, ISBN 978-1-948087-84-1

8. Re-evaluating Evaluation in Text Summarization Manik Bhandari, Pranav


Narayan Gour, Atabak Ashfaq, Pengfei Liu and Graham Neubig . . . 9347
VMSMO: Learning to Generate Multimodal Summary for Video-based News
Articles Mingzhe Li, Xiuying Chen, Shen Gao, Zhangming Chan, Dongyan Zhao
and Rui Yan

46

You might also like