Sneha_Report
Sneha_Report
A PROJECT REPORT
Submitted by
SNEHA T (1920102128)
IN
i
SONA COLLEGE OF TECHNOLOGY
(An Autonomous Institution; Affiliated to Anna University, Chennai -600 025)
SIGNATURE SIGNATURE
Dr. B. SATHIYABHAMA, B.E., M.Tech.,Ph.D Ms. R.SOWNDHARYA, M.E
HEAD OF THE DEPARTMENT SUPERVISOR
Professor Assistant Professor
Department of Computer Science and Department of Computer Science and
Engineering Engineering
Sona College of Technology, Sona College of Technology,
Salem – 636005 Salem – 636005
ii
CONFERENCE CERTICATE
iii
CONFERENCE PROCEEDINGS PAGE
iv
ACKNOWLEDGMENT
I would like to thank my parents and all my friends from the bottom of my
heart who were always present to help me out and make this project a success. Last
but not the least, I would like to express my heartiest thanks and gratefulness to
almighty God for his divine blessings, which helped me complete the final year
project successfully This project was made possible because of inestimable inputs
from everyone involved, directly or indirectly.
v
ABSTRACT
This project introduces an iterative smart sorting system aimed at addressing the
deliver effective solutions tailored to the dynamic nature of today's job market. This
but also upholds the values of privacy and security. By harnessing advanced
vi
TABLE OF CONTENTS
ABSTRACT vii
LIST OF ABBREVIATIONS x
LIST OF FIGURES xi
1 INTRODUCTION 1
1.1 GENERAL 1
1.2 OBJECTIVE 2
1.3 HISTORY 2
1.4 CHALLENGES 3
1.5 APPLICATIONS 4
2 LITERATURE SURVEY 5
3 SYSTEM SPECIFICATION 9
4 SYSTEM ANALYSIS 12
4.3 ALGORITHM 15
vii
5 DESIGN AND IMPLEMENTATION 19
5.1 METHODOLOGIES 19
5.1.5 LIMITATIONS 24
APPENDICES 29
SAMPLE CODE 29
SCREENSHOTS 42
REFERENCES 45
viii
LIST OF ABBREVIATIONS
UI User Interface
HR Human Resources
ML Machine Learning
CV Curriculum Vitae
ix
LIST OF FIGURES
x
CHAPTER I
INTRODUCTION
1.1 GENERAL
In this project the realm of Human Resources (HR) confronts an enduring challenge:
the time-intensive task of scrutinizing numerous resumes to pinpoint suitable
candidates for job openings. Manual resume screening is not only laborious but also
prone to errors, presenting a common obstacle for recruiters. In response, this project
proposes a web-based system aimed at automating the initial phase of recruitment
by shortlisting resumes based on their relevance to specific job descriptions.
Through the utilization of cutting-edge natural language processing (NLP) and
machine learning techniques, this endeavor seeks to streamline the resume screening
process. This report delineates the design and development of a web application
tailored to automate resume shortlisting, elucidating its functionalities, technological
underpinnings, and advantages for recruiters. The project aspires to revolutionize
recruitment practices by mitigating the inefficiencies inherent in traditional resume
screening methods, fostering enhanced accuracy, streamlined processing, and
swifter decision-making. By setting the stage for an examination of the system's
technical architecture and its potential to reshape HR workflows, this introduction
paves the way for a comprehensive analysis of its capability to enhance efficiency
to unprecedented heights.
1
1.2 OBJECTIVE
The objective of this project is to develop a web-based system that automates the
initial stage of the recruitment process by leveraging natural language processing
(NLP) and machine learning techniques. Specifically, the system aims to rank
resumes based on their relevance to a given job description, thereby facilitating the
shortlisting process for recruiters. By automating resume screening and ranking, the
project seeks to reduce the time and effort required for manual assessment, while
also enhancing the accuracy and efficiency of candidate selection. The ultimate goal
is to provide HR professionals with a powerful tool that streamlines the recruitment
process, expedites decision-making, and improves the overall quality of candidate
selection. Through the integration of advanced technology and tailored algorithms,
the project aims to address the challenges associated with traditional resume
screening methods and deliver a scalable, user-friendly solution for optimizing
recruitment outcomes.
1.3 HISTORY
The history of this project stems from the recognition of the persistent challenges
faced by recruiters in the time-intensive process of manually screening resumes.
Traditional methods of resume assessment have proven labor-intensive and error-
prone, prompting the exploration of automated solutions to streamline recruitment
workflows. As the demand for efficient recruitment processes continues to rise, there
has been a growing emphasis on leveraging technological advancements, such as
natural language processing (NLP) and machine learning, to improve candidate
selection efficiency. This project builds upon previous research and developments in
the field of automated resume screening, aiming to address existing limitations and
provide a scalable solution that meets the evolving needs of HR professionals.
2
1.4 CHALLENGES
3
1.5 APPLICATION
4
CHAPTER 2
LITERATURE SURVEY
The ever-growing volume of resumes received for a single job opening has created
a significant challenge for Human Resources (HR) professionals. Traditional
methods on job boards often require extensive hours dedicated to meticulous
candidate assessment and recruitment. To address this challenge, research efforts
have focused on automating the resume screening process using Natural Language
Processing (NLP) and Machine Learning (ML) techniques.
This review examines several relevant studies that explore the application of NLP
and ML for resume analysis and its impact on the recruitment process.
These studies collectively highlight the potential of NLP and ML for automating
resume screening and enhancing the efficiency of the recruitment process. However,
it is important to acknowledge that existing research also explores other techniques
beyond NLP and ML.
The reviewed literature demonstrates the promise of NLP and ML for automating
resume screening and streamlining the recruitment process. By leveraging
techniques like named entity recognition, text classification, and semantic analysis,
these systems can extract valuable information from resumes and facilitate data-
driven candidate evaluation. However, it is crucial to consider complementary
7
techniques like sentence boundary identification and data anonymization for a robust
and secure system. Future research can explore further integration of NLP and ML
with existing HR workflows to create a more comprehensive and efficient
recruitment experience.
8
CHAPTER 3
SYSTEM SPECIFICATION
• Operating System: The code should run on most major operating systems
that support Python 3.x, including Windows, macOS, and Linux.
• Programming Language: Python 3.x is the core language used for the
application. A Python 3.x interpreter is necessary to execute the code.
9
• Libraries: Several Python libraries are crucial for the application's
functionalities:
o Flask: This web framework provides the foundation for building the
web application.
o spaCy: This library is used for Natural Language Processing (NLP)
tasks, particularly named entity recognition, which helps identify
names and emails within resumes.
o PyPDF2: This library facilitates the extraction of text data from PDF
resumes.
o scikit-learn (sklearn): This machine learning library provides tools for
TF-IDF vectorization and calculating cosine similarity, used to assess
the degree of similarity between job descriptions and resumes.
o re: The re library offers regular expression functionalities used for basic
email and name extraction from the text data.
o csv: This library allows the system to work with data in CSV (comma-
separated values) format, which is useful for generating reports on the
ranked resumes.
The code should run on most major operating systems (Windows, macOS, and
Linux) that support Python 3.x. However, to ensure proper functionality, the
aforementioned Python libraries (Flask, spaCy, PyPDF2, sklearn, re, csv) need to be
installed. These libraries can be easily installed using a package manager like pip
with a command like pip install Flask spacy PyPDF2 sklearn.
10
It's important to note that the code utilizes basic regular expressions for email and
name extraction. While this might work for simple cases, it might not be entirely
reliable for production use. For a more robust system, implementing more
sophisticated techniques for entity recognition would be advisable.
Finally, the code defines file paths for storing uploaded resumes within an "uploads"
directory. Make sure this directory exists on your system before running the
application
11
CHAPTER 4
SYSTEM ANALYSIS
12
In light of these challenges, adopting a hybrid approach that seamlessly integrates
automation with human oversight is imperative. By combining the strengths of
automated algorithms with the nuanced judgment and decision-making capabilities
of human recruiters, organizations can mitigate the risks of bias and ensure a more
comprehensive and reliable recruitment process. Moreover, continuous
improvement efforts, including regular updates to algorithms and protocols, are
essential to adapt to evolving recruitment trends and maintain the system's
effectiveness and relevance in a dynamic job market landscape.
This project proposes a novel resume ranking system that leverages a hybrid
approach. The system combines the strengths of automated processing with the
valuable insights of human reviewers to create a well-rounded and reliable
recruitment process.
Hybrid Approach:
The system integrates automated resume parsing and candidate screening with
human oversight. This allows the system to efficiently process large volumes of
resumes with the accuracy and nuanced judgment of human reviewers playing a vital
role in the final selection process.
13
Enhanced NLP and NER:
The system goes beyond basic models by exploring sophisticated machine learning
approaches, including neural networks. These models hold the potential to
significantly enhance prediction accuracy and generalization capabilities. This
translates to a more reliable ranking of candidates based on their suitability for the
job description.
Continuous Improvement :
By combining these elements, the proposed system aims to create a more efficient,
accurate, and fair resume ranking process that leverages the strengths of both
automation and human expertise.
14
4.3 ALGORITHM
The algorithm starts by parsing each provided resume PDF file using PyPDF2
library.It iterates through each page of the PDF, extracting text content, and
concatenating it into a single string representing the entire resume.
2. Entity Extraction:
15
Once the vectorizer is fitted, each resume text is then transformed into a TF-IDF
vector representation. During this transformation, the algorithm calculates the TF-
IDF values for each term in the resume text, considering both its frequency within
the resume and its rarity across all resumes. This results in a high-dimensional vector
where each component represents the importance of a specific term in the context of
the job description and the resume.The TF-IDF vectorization process captures the
semantic relevance of terms in both the job description and the resumes, allowing
the algorithm to quantify the degree of similarity between them based on the overlap
and significance of shared terms. By representing textual data in a numerical format,
TF-IDF vectors enable the algorithm to perform mathematical operations, such as
cosine similarity calculation, for comparing and ranking resumes based on their
compatibility with the job requirements.The job description and resume
vectorization step transforms textual data into a structured numerical representation,
facilitating meaningful comparison and analysis for effective resume ranking and
candidate selection.
16
Here's a concise explanation of cosine similarity:
Cosine similarity measures the cosine of the angle between two vectors and provides
a numerical value between -1 and 1, indicating the similarity between the vectors. In
the context of document comparison, cosine similarity is computed between the TF-
IDF vectors of the job description and each resume.
A score of 1 indicates perfect similarity, meaning the vectors point in the same
direction, implying identical content between the documents.
Higher cosine similarity scores between the job description and a resume denote
closer alignment or relevance of the resume to the job requirements. Conversely,
lower scores indicate less relevance or similarity.In summary, cosine similarity
calculation provides a robust and intuitive method for quantifying the similarity
between textual documents, enabling effective document comparison and ranking in
various natural language processing tasks, including resume screening and
recruitment.
17
5. Ranking Resumes:
The algorithm organizes the resumes based on their cosine similarity scores in
descending order.Each resume is assigned a rank corresponding to its position in the
sorted list, with the highest-ranked resume being the one most closely aligned with
the job description.This ranking facilitates the identification of top candidates and
streamlines the recruitment process by prioritizing those with the best match to the
job requirements.
To facilitate further analysis and reporting, the algorithm generates a CSV file
named "ranked_resumes.csv."The CSV file includes columns for rank, name, email,
and similarity score, allowing for easy access to detailed information about each
ranked resume.By providing a structured format for the ranked resumes, the CSV
output enhances the efficiency of subsequent recruitment activities and decision-
making processes.
This detailed algorithm outlines the sequential steps involved in processing resumes,
calculating their similarity to a given job description, ranking them accordingly, and
generating a comprehensive CSV report for further analysis. By leveraging
advanced techniques in text processing, vectorization, and similarity calculation, the
algorithm offers a robust solution for automating resume ranking and streamlining
the recruitment process with accuracy and efficiency
18
CHAPTER 5
5.1 METHODOLOGIES
The resume ranking system is a web application built using the Flask framework in
Python. It leverages Flask's simplicity and flexibility for rapid development. The
system employs various text processing and natural language processing (NLP)
techniques. It utilizes the spaCy library to extract named entities like names and
emails from resume texts, while the PyPDF2 library is used to extract text from PDF
resumes. Regular expressions aid in identifying and extracting emails and names.
For vectorization and similarity calculation, the TfidfVectorizer from scikit-learn
converts job descriptions and resume texts into TF-IDF vectors, and cosine similarity
is employed to calculate their similarity.
The web interface is designed using HTML templates, allowing users to enter job
descriptions and upload resumes. Bootstrap integration enhances the interface's
aesthetics and responsiveness. Resumes are uploaded as PDF files, and upon
ranking, a CSV file containing the ranked results is dynamically generated. Users
can download this CSV file for further analysis. The system handles POST requests
for processing job descriptions and resumes, rendering ranked results dynamically
on the web interface. Users have the option to download ranked results in CSV
format.
19
During development, the system runs in debug mode to facilitate error detection and
debugging. It can be deployed locally for testing and development purposes. Overall,
the design and implementation of the resume ranking system utilize a combination
of libraries and frameworks to provide a user-friendly and efficient solution for
automating the recruitment process. By streamlining the evaluation of resumes based
on job descriptions, the system empowers HR professionals to make informed
decisions effectively.
20
text to lowercase and handling encoding issues ensure uniformity in the dataset and
prevent discrepancies during similarity calculations.
Feature engineering is a critical step in the machine learning pipeline where raw data
is transformed into a format that is suitable for model training and analysis. In the
context of the system, feature engineering involves converting textual data from job
descriptions and resumes into numerical representations known as TF-IDF
vectors.The TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer
from the scikit-learn library is utilized for this purpose. TF-IDF vectors capture the
importance of terms (words or phrases) in documents relative to a corpus of
documents. This is achieved by computing two main components:
1. Term Frequency (TF): This measures the frequency of a term within a document.
Terms that occur more frequently within a document are assigned higher TF values.
2. Inverse Document Frequency (IDF): This measures the rarity of a term across the
entire corpus of documents. Terms that are rare across the corpus but frequent within
individual documents are assigned higher IDF values.
21
By combining TF and IDF, TF-IDF vectors highlight terms that are both important
within individual documents and distinctive across the corpus. This allows for the
effective comparison and ranking of resumes based on their similarity to job
descriptions.
The similarity matching algorithm determines the degree of similarity between job
descriptions and resumes based on their feature representations. In this system,
cosine similarity is used as the matching algorithm. Cosine similarity measures the
cosine of the angle between two vectors and is commonly employed in text similarity
tasks. By calculating the cosine similarity between TF-IDF vectors representing job
descriptions and resumes, the system determines how closely a resume matches a
given job description.
22
ranked candidates provides valuable insights into the system's ability to identify and
prioritize suitable candidates effectively.
5.1.5 LIMITATIONS:
24
5.2 SYSTEM ARCHITECTURE
25
Figure 5.2 Shortlisting Criteria
The shortlisting criteria for this application are based on the similarity between a
provided job description and the content of submitted resumes. Resumes are first
processed to extract text using PyPDF2 and then further analyzed to identify relevant
entities such as names and emails using spaCy. The similarity between the job
description and each resume is calculated using the TF-IDF vectorization method
and cosine similarity metric. Resumes are ranked based on their similarity scores,
with higher scores indicating a better match to the job description. Finally, the top-
ranked resumes are presented to the user, facilitating efficient candidate selection
based on their alignment with the job requirements.
26
Figure 5.3 Architecture diagram
The architecture incorporates an evaluation extraction module as a crucial step in
assessing the system's performance and deriving meaningful insights from the
ranked results. Following the ranking of resumes, this module systematically
evaluates various metrics or information derived from the system's output to gain
deeper insights into its effectiveness in matching job descriptions with resumes.
Within the Flask web server, this module interfaces with the ranked results,
extracting relevant evaluation metrics such as precision, recall, and F1-score.
External libraries like NumPy and scikit-learn support the computation of these
metrics. This architecture ensures a comprehensive approach to evaluating the
system's effectiveness and provides stakeholders with valuable insights for
optimizing candidate selection processes.
27
CHAPTER 6
6.1 CONCLUSION
In conclusion, the implemented resume ranking system demonstrates an effective
solution for automating the evaluation of resumes based on job descriptions.
Through the integration of various technologies such as natural language processing
and machine learning, the system accurately assesses the relevance of candidates'
resumes to specific job requirements. By leveraging Flask as the web application
framework, the system provides a user-friendly interface for HR professionals to
input job descriptions, upload resumes, and receive ranked results.Moving forward,
there are opportunities for further enhancements and future work, including
exploring advanced NLP techniques, improving scalability and performance,
integrating with existing ATS platforms, and implementing interactive features for
user engagement
Future work for this project includes several avenues for enhancement and
expansion:
5. Bias Detection and Mitigation: Develop algorithms and strategies to detect and
mitigate bias in the resume ranking process. This could involve auditing the system
for fairness, implementing bias-aware models, and incorporating diversity and
inclusion metrics in the evaluation process.
29
APPENDICES
A) SAMPLE CODE
Resumeranker.py
app = Flask(__name__)
30
def extract_text_from_pdf(pdf_path):
with open(pdf_path, "rb") as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page in pdf_reader.pages:
text += page.extract_text()
return text
31
# Create a directory for uploads if it doesn't
exist
if not os.path.exists("uploads"):
os.makedirs("uploads")
# TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()
job_desc_vector =
tfidf_vectorizer.fit_transform([job_description])
32
# Rank resumes based on similarity
ranked_resumes = []
for (names, emails, resume_text) in
processed_resumes:
resume_vector =
tfidf_vectorizer.transform([resume_text])
similarity =
cosine_similarity(job_desc_vector, resume_vector)[0][0]
* 100
ranked_resumes.append((names, emails,
similarity))
results = ranked_resumes
return render_template('index.html',
results=results)
@app.route('/download_csv')
def download_csv():
# Generate the CSV content
csv_content = "Rank,Name,Email,Similarity\n"
33
for rank, (names, emails, similarity) in
enumerate(results, start=1):
name = names[0] if names else "N/A"
email = emails[0] if emails else "N/A"
csv_content +=
f"{rank},{name},{email},{similarity}\n"
csv_full_path =
os.path.join(os.path.abspath(os.path.dirname(__file__))
, csv_filename)
return send_file(csv_full_path, as_attachment=True,
download_name="ranked_resumes.csv")
if __name__ == '__main__':
app.run(debug=True)
34
app.py
import spacy
import PyPDF2
from sklearn.feature_extraction.text import
TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
import csv
csv_filename = "ranked_resumes.csv"
nlp = spacy.load("en_core_web_sm")
def extract_text_from_pdf(pdf_path):
with open(pdf_path, "rb") as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
35
for page in pdf_reader.pages:
text += page.extract_text()
return text
def extract_entities(text):
tfidf_vectorizer = TfidfVectorizer()
job_desc_vector =
tfidf_vectorizer.fit_transform([job_description])
ranked_resumes = []
for resume_path in resume_paths:
resume_text = extract_text_from_pdf(resume_path)
emails, names = extract_entities(resume_text)
resume_vector =
tfidf_vectorizer.transform([resume_text])
similarity = cosine_similarity(job_desc_vector,
resume_vector)[0][0]
36
ranked_resumes.append((names, emails, similarity))
37
index.html (template)
<!DOCTYPE html>
<html>
<head>
<title>Resume Analyzer</title>
<link rel="stylesheet" href="{{ url_for('static',
filename='styles.css') }}" id="theme-style">
<style>
body.dark-mode {
background-color: #1a1a1a;
color: #ffffff;
}
</style>
<script>
function toggleDarkMode() {
const body = document.body;
const themeStyleLink =
document.getElementById('theme-style');
38
themeStyleLink.addEventListener('transitionend', () =>
{
// Update the theme link after the
transition is complete
themeStyleLink.removeEventListener('transitionend',
arguments.callee);
themeStyleLink.href =
body.classList.contains('dark-mode')
? "{{ url_for('static',
filename='styles.css') }}"
: "{{ url_for('static',
filename='dark-theme.css') }}";
});
</script>
</head>
<body>
<style>
body {
background-image: url('img1.jpg');
39
}
</style>
<center>
<img
src="https://ideogram.ai/api/images/direct/qtam5-
HIR62mza3EqF_FPQ.jpg" width="150" height="150"
alt="Flowers in Chania">
<h1>Resume Analyzer</h1>
</center>
<label id="dark-mode-toggle-label" for="dark-mode-
toggle">
<input type="checkbox" id="dark-mode-toggle"
onchange="toggleDarkMode()">
<div id="dark-mode-toggle-slider"></div>
</label>
41
Download CSV
</a>
{% endif %}
{% endif %}
</body>
</html>
B) SCREENSHOTS
Frontend UI
42
Job description
Analysis 1
43
Analysis 2
44
REFERENCES
1. Sinha, A.K., Amir Khusru Akhtar, M., Kumar, A. (2021). Resume Screening
Using Natural Language Processing and Machine Learning: A Systematic Review.
In: Swain, D., Pattnaik, P.K., Athawale, T. (eds) Machine Learning and
Information Processing. Advances in Intelligent Systems and Computing, vol
1311. Springer, Singapore. https://doi.org/10.1007/978-981-33-4859-2_21
2. Alexandra, C., Valentin, S., Bogdan, M., Magdalena, A.: Leveraging lexicon-
based semantic analysis to automate the recruitment process. In: Ao, S.-I., Gelman,
L., Kim, H.K. (eds.) Transactions on Engineering Technologies (Springer,
Singapore, 2019), pp. 189–20
4. A. Kumar, Design of secure image fusion technique using cloud for privacy-
preserving and copyright protection. Int. J. Cloud Appl. Comput. IJCAC 9, 22–36
(2019).
45
7. Ellen Riloff, David Chiang, Julia Hockenmaier, Jun'ichi Tsujii: Proceedings of
the 2018 Conference on Empirical Methods in Natural Language Processing,
Brussels, Belgium, October 31 - November 4, 2018. Association for
Computational Linguistics 2018, ISBN 978-1-948087-84-1
46