ieee paper
ieee paper
Abstract - Automated resume classification has become models like SVM and Naïve Bayes struggle with word
increasingly important as recruitment processes increasingly rely semantics, resulting in inaccurate categorization [3].
on intensive resume analysis. However, resumes comes in doc and Transformer-based models like BERT and DistilBERT embed
pdf format, where pdf format is used often, requiring a reliable
text with deep contextual embeddings due to these restrictions
PDF extraction method to standardize data for processing. Rule-
based extraction and keyword matching algorithms difficult to
[4]. Named Entity Recognition (NER) extracts names,
generalize across resume formats, hence they sometimes produce education, talents, and experience from unstructured resumes.
inferior results. The proposed approach uses Large Language Deep learning and NER improve hiring by classifying and
Model(LLM) for its ability to understand and process natural ranking applicants [5]. Keyword-centric candidate-job
language, addresses the challenge of obtaining organized résumé compatibility methods are less sophisticated and accurate than
information and connecting it with job descriptions. The biggest cosinesimilarity[6]. This research identifies resumes using
challenges are managing multiple formats, acquiring experience powerful NLP and transformer models. DistilBERt
data, dates and firm names, and guaranteeing Resume-Job embeddings are created from ordered resume data by PDF
Description(JD) relevance. We use transformer-based
processing, NER, and regular expression extraction. Job
embeddings for Resume-JD alignment, customized Named Entity
Recognition (NER) for structured data extraction, and seekers benefit from cosine similarity comparisons of resumes
pdfplumber for PDF extraction. Similarity evaluates resumes for and job descriptions. Our automated shortlisting streamlines
relevance, whereas LLM model DistilBERT contextualizes hiring and applicant selection. The rest of this work is
resumes and job descriptions. Challenges include addressing a organized as follows: Section 2 reviews current techniques
typical resume forms, improving the NER model for entity and their literature flaws. Resume organization is covered in
recognition, and optimizing embeddings for computational Section 3 job and problem design. Part 4 covers the method
performance. The proposed process involves preparing job and solution. Results and analysis in Section 5 support our
descriptions and resumes, extracting relevant sections using approach. Section 6 concludes analyses and discusses future
Named Entity Recognition (NER) and regular expressions,
tokenizing the language, embedding using DistilBERT, and
research.
scoring matches using cosine similarity. Our resume-job
description matching accuracy of 92.4% was far better than II. LITERATURE Review
before. Our method automates candidate shortlisting, saving time The increasing demand for efficient recruiting has
in recruitment. resulted in extensive research on automated resume
Index Terms - PDF extractor, CV-JD matching, Named classification, therefore addressing the arduous and biased
Entity Recognition (NER), DistilBert, Cosine similarity hand screening process. Natural language processing (NLP)
and machine learning (ML) have been used to improve
I. INTRODUCTION accuracy and efficiency of classification. Early systems
The rapid growth of online hiring has flooded many struggled with semantic understanding and different formats
job positions with resumes. Because resumes vary in style and and depended on keyword matching and rule-based
contain so much unstructured content, organizations struggle approaches [7]. Using manual characteristics and statistical
to evaluate, classify, and choose prospects. Traditional hiring analysis, models such as SVM and Naïve Bayes enhanced
methods like manual screening or keyword filtering waste resume categorization; nonetheless, domain-specific issues
time and select unqualified candidates [1][2]. continued [8]. Recent work uses transformer models such as
Recently developed NLP and ML can automate resume BERT in text categorization [9][10] leveraging contextual
classification and boost hiring productivity. Traditional knowledge to outperform conventional ML approaches. While
cosine similarity helps in job fit rating [12], combining deep
*
learning with rule-based methods improves structured data Name (Years)
extraction from resumes [11]. Future research on better Python,
training, transformer tuning, and data integration [13] will Chaitanya 5 JNTUK M.Tech.
NLP, SQL
help to address still unresolved issues in format variations, Java, ML,
entity recognition, and efficiency notwithstanding Deepthi 3 AU B.Tech
TensorFlow
advancement. The suggested system seeks to accomplish the Karthik SQL, C++ 4 ANU Ph.D.
following objectives: The suggested DistilBERT and Machine Learning pipeline
1. Enhanced Resume Matching Precision: Attain superior for Automated Resume Screening includes Data Pre-
categorization accuracy compared to conventional machine processing, Feature Extraction, Model Training, and
learning methodologies. Performance Evaluation. Below is the comprehensive solution
2. Diminished Manual Labor: Automate the process of resume method:
shortlisting, hence conserving time for HR personnel. i) Data Collection
3. Scalability: Efficiently manage substantial quantities of The dataset comprises structured and unstructured
resumes. resumes gathered from many job recruitment sources
4. Resilience to Format Variability: Efficiently process
and carefully categorized into pertinent job
resumes in various formats, guaranteeing significant
adaptability. classifications.
The collection comprises text-based resumes
III. PROPOSED Work featuring attributes including Name, Skills,
The proposed method for the classification of resumes Experience, Education, and Certifications [14, 15].
and the matching of candidates with jobs is comprised of four The labeled data is divided into Training (80%) and
essential stages: (1) Preprocessing, (2) Feature Extraction, (3) Testing (20%) sets.
Embedding (4) Cosine similarity calculation.
The flow of the execution is as follows: ii) Text Cleaning and Pre-processing
Convert all text to lowercase for consistency.
Eliminating Stop-words: Frequently used terms such
as “the,” “is,” and “in” that do not aid in
classification are discarded.
Lemmatization: Words are transformed into their
base form (e.g., "running" → "run").
Elimination of Special Characters and Numbers:
Symbols, punctuation, and numerical digits are
removed to minimize extraneous noise.
Tokenization: Transform resumes into discrete words
or phrases for model analysis.
Fig.1 Proposed workflow model
A. Preprocessing Table 3 presents the example of pre-processed
Pre-processing is a crucial step that transforms the raw data resume text.
into a clean and structured format for efficient matching. The
following steps are applied to resumes and job descriptions: TABLE II
Sample Pre-processed Resume Text
Text Normalization: Convert all text to lowercase and
remove unnecessary whitespace.
Pre-processed Resume
Special Character Removal: Eliminate punctuation, URLs, Raw Resume Text
Text
emails, and other non-alphanumeric characters.
"Experienced Software "experienced software
Stop word Removal: Filter out frequently occurring but
Engineer skilled in engineer skill python java
uninformative words (e.g., "the", "is", "and").
Python, Java, and AI." ai"
Tokenization: Convert textual data into meaningful units
(tokens) for further processing. B. Feature Extraction
Short-form Expansion: Convert abbreviations to their full Feature extraction converts unstructured resumes into
forms to standardize terminology. organized data with crucial attributes like skills, work
These pre-processing steps ensure consistency and experience, and education. These steps are taken:
uniformity, allowing for accurate semantic matching of 1. Use pdfplumber to extract raw text from resumes while
resumes with job descriptions as shown in Table 1 maintaining structural integrity.
TABLE I 2. Regex: Regular Expressions Select and extract work
Sample Pre-processed Resume Data experience, education, and contact information.
Candidate Skills Experience University Degree 3. Named Entity Recognition (NER): Extract structured items
like Name, Education (degree, university, year of graduation), 4. Similarity Computation: Compute the cosine similarity
Work Experience (Company, Duration, Role) and Skills between resume embeddings and job description
(Technical and Soft) using a custom-trained NER model. embeddings to determine relevance scores.
The embeddings effectively map resumes and job descriptions
We use DistilBERT embeddings to extract meaningful into a high-dimensional space, where semantically similar
representations from resumes: documents (i.e., relevant resumes for a job role) have higher
i) DistilBERT Tokenization similarity scores.
Each resume is segmented into word parts and processed D. Cosine Similarity Calculation
by the DistilBERT model. The Cosine Similarity Calculation is a technique used to
The resultant CLS token representation serves as the assess the degree of similarity between the vectors of a CV
and a job description thereby ascertaining their fit for the post.
feature vector.
Other methods similar to cosine similarity for resume
ii) Feature Vector Representation classification include - Jaccard Similarity, Euclidean
Distance, Manhattan Distance and Pearson Correlation.
DistilBERT generates a 768-dimensional feature Cosine similarity is often preferred because it focuses on the
vector for each resume. orientation (angle) of vectors rather than magnitude, making it
These vectors are used as input to the machine ideal for text data where word frequency vectors are sparse
learning classifier. and high-dimensional. However, alternatives like Jaccard can
be useful for binary data, while Euclidean distance may work
Three machine learning models are trained and compared [16, better for dense, low-dimensional feature spaces. To assess the
17] are Logistic Regression, Random Forest Classifier and effectiveness of the proposed system, we define threshold-
Support Vector Machine (SVM).The extracted DistilBERT based classification:
embeddings are fed into these classifiers for resume
classification. The model training method is as follows: Resume Relevance Score: A resume is considered a
match if the cosine similarity with the job description
exceeds a predefined threshold (e.g., 0.85).
i) Model Hyper-parameter Tuning
Top-N Candidate Selection: The system ranks
resumes based on similarity scores and selects the top
Model hyperparameter tuning is used for DistilBERT training N candidates for further evaluation.
to optimize its performance by finding the best combination of
parameters (e.g., learning rate, batch size) that minimize loss
and improve accuracy. It ensures the model generalizes well to IV. EXPERIMENTAL RESULTS AND DISCUSSIONS
unseen data while maintaining efficiency. Grid Search is used This section presents the experimental results and discussions
to optimize parameters for each model. Table 3 gives model based on a dataset of 10,000 resumes[23][24] collected from
hyper-parameter tuning kaggle with a size of 56.27MB. Table 4 illustrates the
Example: For SVM, we tune the C parameter and kernel type. distribution of resumes across five distinct job classes:
TABLE III
Model Hyper-parameter Tuning TABLE IV
Model Tuned Parameters Optimal Value Resumes data
SVM Kernel, C RBF, C=1.0 Number of
Category
Number of Trees Resumes
Random Forest 100
(n_estimators) Data Scientist 2,000
Logistic Software Engineer 3,000
Regularization (C) 0.5
Regression Business Analyst 1,500
Project Manager 1,500
C. Embedding Network Engineer 2,000
To perform accurate resume-job description matching, we
Total 10,000
employ Transformer-based embeddings to generate numerical
representations of textual content. The steps include: Performance is measured using key metrics:
1. Tokenization: Convert resumes and job descriptions into Accuracy: Measures the proportion of correctly
word tokens. classified resumes. It can be calculated as:
2. Embedding Generation: Utilize DistilBERT, a (TP + TN) / (TP + TN + FP + FN)
lightweight Transformer model, to encode tokens into Precision: Assesses the proportion of relevant
dense vector embeddings. resumes among the selected ones. It can be calculated
3. Contextual Understanding: DistilBERT captures deep as:
semantic relationships between words, allowing for TP / (TP + FP)
better representation of candidate qualifications.
Recall: Evaluates the system's ability to retrieve all We calculate the following metrics for model comparison [20,
relevant resumes. It can be calculated as: 21]. Table 6 indicates the model performance comparison
TP / (TP + FN) TABLE VI
Model Performance Comparison
F1-Score: Provides a balanced measure of precision
and recall. It can be calculated as: Precisio F1
Model Accuracy Recall
2 * (Precision * Recall) / (Precision + Recall) n Score
To evaluate model performance, we compute the Confusion Logistic
87.4% 85.2% 83.1% 84.1%
Matrix for each classifier [18, 19]. Regression
Confusion matrix illustrating True Positives, False Positives, Random
89.1% 88.3% 85.7% 87.0%
etc., for the classification task was given in the figure 2 Forest
SVM
(Best 91.2% 90.5% 89.4% 89.9%
Model)
To interpret the model predictions, we use SHAP (SHapley
Additive Explanations) values [22]:
SHAP values emphasize the most significant features
influencing resume classification.
For instance, the keyword "Machine Learning" adds a
probability of +0.35 to the categorization of Data
Scientist. Figure 4 indicates a bar graph showing SHAP
values for top resume features.