0% found this document useful (0 votes)
117 views7 pages

Fake Job Recruitment Detection Using Machine Learning Approach

This document summarizes a research paper that proposes using machine learning techniques to detect fake job postings. It compares the performance of single classifier approaches like Naive Bayes, multi-layer perceptron, k-nearest neighbors, and decision trees, as well as ensemble approaches like random forests and boosting. The best approach found was ensemble classifiers, which combine multiple learning algorithms to improve accuracy over single classifiers. The goal is to automatically identify fraudulent job posts from online sources to protect job seekers.

Uploaded by

Pranoti Deshmukh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views7 pages

Fake Job Recruitment Detection Using Machine Learning Approach

This document summarizes a research paper that proposes using machine learning techniques to detect fake job postings. It compares the performance of single classifier approaches like Naive Bayes, multi-layer perceptron, k-nearest neighbors, and decision trees, as well as ensemble approaches like random forests and boosting. The best approach found was ensemble classifiers, which combine multiple learning algorithms to improve accuracy over single classifiers. The goal is to automatically identify fraudulent job posts from online sources to protect job seekers.

Uploaded by

Pranoti Deshmukh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/341325717

Fake Job Recruitment Detection Using Machine Learning Approach

Article · April 2020


DOI: 10.14445/22315381/IJETT-V68I4P209S

CITATIONS READS

0 2,220

2 authors:

Samir Bandyopadhyay Shawni Dutta


University of Calcutta The Bhawanipur Education Society College, Kolkata,India
968 PUBLICATIONS   8,032 CITATIONS    41 PUBLICATIONS   12 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Assam lemon View project

A review on bio medical image processing View project

All content following this page was uploaded by Shawni Dutta on 31 May 2020.

The user has requested enhancement of the downloaded file.


International Journal of Engineering Trends and Technology (IJETT) – Volume 68 Issue 4- April 2020

Fake Job Recruitment Detection Using


Machine Learning Approach
Shawni Dutta#1 and Prof.Samir Kumar Bandyopadhyay*2
#1
Lecturer, Department of Computer Science Department University
2
Academic Advisor, the Bhowanipur Society College
The Bhowanipur Education Society College, 5, Elgin Rd, Sreepally, Bhowanipore, Kolkata, West Bengal
700020, India

Abstract — To avoid fraudulent post for job in the A. Single Classifier based Prediction-
internet, an automated tool using machine learning Classifiers are trained for predicting the unknown
based classification techniques is proposed in the test cases. The following classifiers are used while
paper. Different classifiers are used for checking detecting fake job posts-
fraudulent post in the web and the results of those
classifiers are compared for identifying the best a) Naive Bayes Classifier-
employment scam detection model. It helps in
detecting fake job posts from an enormous number of The Naive Bayes classifier [2] is a supervised
posts. Two major types of classifiers, such as single classification tool that exploits the concept of Bayes
classifier and ensemble classifiers are considered for Theorem [3] of Conditional Probability. The decision
fraudulent job posts detection. However, made by this classifier is quite effective in practice
experimental results indicate that ensemble even if its probability estimates are inaccurate. This
classifiers are the best classification to detect scams classifier obtains a very promising result in the
over the single classifiers. following scenario- when the features are
independent or features are completely functionally
Keywords — Fake Job, Online Recruitment, dependent. The accuracy of this classifier is not
Machine Learning, Ensemble Approach. related to feature dependencies rather than it is the
amount of information loss of the class due to the
I. INTRODUCTION independence assumption is needed to predict the
Employment scam is one of the serious issues in accuracy [2].
recent times addressed in the domain of Online
Recruitment Frauds (ORF) [1]. In recent days, many b) Multi-Layer Perceptron Classifier-
companies prefer to post their vacancies online so
that these can be accessed easily and timely by the Multi-layer perceptron [4] can be used as
job-seekers. However, this intention may be one type supervised classification tool by incorporating
of scam by the fraud people because they offer optimized training parameters. For a given problem,
employment to job-seekers in terms of taking money the number of hidden layers in a multilayer
from them. Fraudulent job advertisements can be perceptron and the number of nodes in each layer can
posted against a reputed company for violating their differ. The decision of choosing the parameters
credibility. These fraudulent job post detection draws depends on the training data and the network
a good attention for obtaining an automated tool for architecture [4].
identifying fake jobs and reporting them to people for
avoiding application for such jobs. c) K-nearest Neighbor Classifier-
For this purpose, machine learning approach is
applied which employs several classification K-Nearest Neighbour Classifiers [5], often known
algorithms for recognizing fake posts. In this case, a as lazy learners, identifies objects based on closest
classification tool isolates fake job posts from a proximity of training examples in the feature space.
larger set of job advertisements and alerts the user. The classifier considers k number of objects as the
To address the problem of identifying scams on job nearest object while determining the class. The main
posting, supervised learning algorithm as challenge of this classification technique relies on
classification techniques are considered initially. A choosing the appropriate value of k [5].
classifier maps input variable to target classes by
considering training data. Classifiers addressed in the d) Decision Tree Classifier-
paper for identifying fake job posts from the others
are described briefly. These classifiers based A Decision Tree (DT) [6] is a classifier that
prediction may be broadly categorized into -Single exemplifies the use of tree-like structure. It gains
Classifier based Prediction and Ensemble Classifiers knowledge on classification. Each target class is
based Prediction. denoted as a leaf node of DT and non-leaf nodes of

ISSN: 2231-5381 http://www.ijettjournal.org Page 48


International Journal of Engineering Trends and Technology (IJETT) – Volume 68 Issue 4- April 2020

DT are used as a decision node that indicates certain B. Email Spam Detection-
test. The outcomes of those tests are identified by Unwanted bulk mails, belong to the category of
either of the branches of that decision node. Starting spam emails, often arrive to user mailbox. This may
from the beginning at the root this tree are going lead to unavoidable storage crisis as well as
through it until a leaf node is reached. It is the way of bandwidth consumption. To eradicate this problem,
obtaining classification result from a decision tree [6]. Gmail, Yahoo mail and Outlook service providers
Decision tree learning is an approach that has been incorporate spam filters using Neural Networks.
applied to spam filtering. This can be useful for While addressing the problem of email spam
forecasting the goal based on some criterion by detection, content based filtering, case based filtering,
implementing and training this model [7]. heuristic based filtering, memory or instance based
filtering, adaptive spam filtering approaches are taken
B. Ensemble Approach based Classifiers- into consideration [7].

Ensemble approach facilitates several machine C. Fake News Detection-


learning algorithms to perform together to obtain Fake news in social media characterizes malicious
higher accuracy of the entire system. Random forest user accounts, echo chamber effects. The
(RF) [8] exploits the concept of ensemble learning fundamental study of fake news detection relies on
approach and regression technique applicable for three perspectives- how fake news is written, how
classification based problems. This classifier fake news spreads, how a user is related to fake news.
assimilates several tree-like classifiers which are Features related to news content and social context
applied on various sub-samples of the dataset and are extracted and a machine learning models are
each tree casts its vote to the most appropriate class imposed to recognize fake news [12].
for the input.
Boosting is an efficient technique where several III. PROPOSED METHODOLOGY
unstable learners are assimilated into a single learner
The target of this study is to detect whether a job
in order to improve accuracy of classification [9]. post is fraudulent or not. Identifying and eliminating
Boosting technique applies classification algorithm to
these fake job advertisements will help the job-
the reweighted versions of the training data and
seekers to concentrate on legitimate job posts only. In
chooses the weighted majority vote of the sequence
this context, a dataset from Kaggle [13] is employed
of classifiers. AdaBoost [9] is a good example of
that provides information regarding a job that may or
boosting technique that produces improved output may not be suspicious. The dataset has the schema as
even when the performance of the weak learners is shown in Fig. 1.
inadequate. Boosting algorithms are quite efficient is
solving spam filtration problems. Gradient boosting
[10] algorithm is another boosting technique based
classifier that exploits the concept of decision tree. It
also minimizes the prediction loss.

II. RELATED WORK


According to several studies, Review spam
detection, Email Spam detection, Fake news
detection have drawn special attention in the domain
of Online Fraud Detection.

A. Review Spam Detection-


People often post their reviews online forum
regarding the products they purchase. It may guide
other purchaser while choosing their products. In this
context, spammers can manipulate reviews for
Fig. 1. Schema structure of the dataset
gaining profit and hence it is required to develop
techniques that detects these spam reviews. This can This dataset contains 17,880 number of job posts.
be implemented by extracting features from the This dataset is used in the proposed methods for
reviews by extracting features using Natural testing the overall performance of the approach. For
Language Processing (NLP). Next, machine learning better understanding of the target as a baseline, a
techniques are applied on these features. Lexicon multistep procedure is followed for obtaining a
based approaches may be one alternative to machine balanced dataset. Before fitting this data to any
learning techniques that uses dictionary or corpus to classifier, some pre-processing techniques are applied
eliminate spam reviews[11]. to this dataset. Pre-processing techniques include
missing values removal, stop-words elimination,
irrelevant attribute elimination and extra space

ISSN: 2231-5381 http://www.ijettjournal.org Page 49


International Journal of Engineering Trends and Technology (IJETT) – Volume 68 Issue 4- April 2020

removal. This prepares the dataset to be transformed not be sufficient enough. Adjustment of these
into categorical encoding in order to obtain a feature parameters enhances the reliability of this model
vector. This feature vectors are fitted to several which may be regarded as the optimised one for
classifiers. The following diagram Fig. 2 depicts a identifying as well as isolating the fake job posts
description of the working paradigm of a classifier from the job seekers.
for prediction. This framework utilised MLP classifier as a
collection of 5 hidden layers of size 128, 64, 32, 16
and 8 respectively. The K-NN classifier gives a
promising result for the value k=5 considering all the
evaluating metric. On the other hand, ensemble
classifiers, such as, Random Forest, AdaBoost and
Gradient Boost classifiers are built based on 500
numbers of estimators on which the boosting is
terminated. After constructing these classification
models, training data are fitted into it. Later the
testing dataset are used for prediction purpose. After
the prediction is done, performance of the classifiers
are evaluated based on the predicted value and the
actual value.
Fig. 2.Detailed description for working of Classifiers
B. Performance Evaluation Metrics
While evaluating performance skill of a model, it
is necessary to employ some metrics to justify the
evaluation. For this purpose, following metrics are
taken into consideration in order to identify the best
relevant problem-solving approach. Accuracy [14] is
a metric that identifies the ratio of true predictions
over the total number of instances considered.
However, the accuracy may not be enough metric for
evaluating model‘s performance since it does not
consider wrong predicted cases. If a fake post is
treated as a true one, it creates a significant problem.
Hence, it is necessary to consider false positive and
false negative cases that compensate to
misclassification. For measuring this compensation,
precision and recall is quite necessary to be
Fig. 3. Classification models used in this considered [7].
framework
Precision [14] identifies the ratio of correct
As depicted in Fig. 3, a couple of classifiers are positive results over the number of positive results
employed such as Naive Bayes Classifier, Decision predicted by the classifier. Recall [14] denotes the
Tree Classifier, Multi-Layer Perceptron Classifier, K- number of correct positive results divided by the
nearest Neighbor Classifier, AdaBoost Classifier, number of all relevant samples. F1-Score or F-
Gradient Boost Classifier and Random Tree measure [14] is a parameter that is concerned for both
Classifier for classifying job post as fake. It is to be recall and precision and it is calculated as the
noted that the attribute ‗fraudulent‘ of the dataset is harmonic mean of precision and recall [14]. Apart
kept as target class for classification purpose. At first, from all these measure, Cohen-Kappa Score [15] is
the classifiers are trained using the 80% of the entire also considered to be as an evaluating metric in this
dataset and later 20% of the entire dataset is used for paper. This metric is a statistical measure that finds
the prediction purpose. The performance measure out inter-rate agreement for qualitative items for
metrics such as Accuracy, F-measure, and Cohen- classification problem. Mean Squared Error (MSE)
Kappa score are used for evaluating the prediction for [14] is another evaluating metric that measures
each of these classifiers. Finally, the classifier that absolute differences between the prediction and
has the best performance with respect to all the actual observation of the test samples. Lower value of
metrics is chosen as the best candidate model. MSE and higher values of accuracy, F1-Score, and
Cohen-kappa score signifies a better performing
A. Implementation of Classifiers model.
In this framework classifiers are trained using
appropriate parameters. For maximizing the
performance of these models, default parameters may

ISSN: 2231-5381 http://www.ijettjournal.org Page 50


International Journal of Engineering Trends and Technology (IJETT) – Volume 68 Issue 4- April 2020

IV. EXPERIMENTAL RESULTS


All the above mentioned classifiers are trained and Accuracy
tested for detecting fake job posts over a given
dataset that contains both fake and legitimate posts. 100
The following Table 1 shows the comparative study 96.5
of the classifiers with respect to evaluating metrics 93
and Table 2 provides results for the classifiers that 89.5
86
are based on ensemble techniques. Fig. 4 to Fig. 7 82.5
depict overall performance of all the classifiers in 79
terms of accuracy, f1-score, Cohen-kappa score, 75.5
MSE respectively. 72
68.5
65
TABLE I
PERFORMANCE COMPARISON CHART
FOR SINGLE CLASSIFIER BASED
PREDICTION

Performance Naïve Multi-Layer K- Decision


Measure Metric Bayes Nearest
Classifier Perceptron Tree
Neighbor Classifier
Classifier
Classifier
Fig. 4. Comparison of Accuracy for all
Accuracy 72.06% 96.14% 95.95% 97.2% specified supervised machine learning model
F1-Score 0.72 0.96 0.96 0.97
Cohen- 0.12 0.3 0.38 0.67
Kappa
Score
MSE 0.52 0.05 0.04 0.03
F1-Score
1
0.95
TABLE II 0.9
PERFORMANCE COMPARISON CHART 0.85
FOR ENSEMBLE CLASSIFIER BASED 0.8
PREDICTION 0.75
0.7
Performance Random AdaBoost Gradient 0.65
Measure Forest Boosting 0.6
Classifier
Metric Classifier
Classifier
Accuracy 98.27% 97.46% 97.65%
F1-Score 0.97 0.98 0.98
Cohen-Kappa 0.74 0.63 0.65
Score
MSE 0.02 0.03 0.03
Fig. 5. Comparison of F1-Score for all
specified supervised machine learning model

ISSN: 2231-5381 http://www.ijettjournal.org Page 51


International Journal of Engineering Trends and Technology (IJETT) – Volume 68 Issue 4- April 2020

Classifiers are implemented and compared with


respect to the metrics. Experimental results have
Cohen-Kappa Score shown that ensemble based classifiers provide an
improved result over the other models specified in
0.8
Table 1. However, Table 2 indicates that Random
0.7
Tree classifier outperforms well over its peers
0.6
because it incorporates multiple Decision Tree
0.5 classifiers. As it is seen that Decision Tree classifier
0.4 is the most competing one over its peers, Random
0.3 Forest Classifier also works well. This classifier has
0.2 achieved accuracy of 98.27%, Cohen-kappa score as
0.1 0.74, F1-score 0.97, MSE 0.02. Though this Random
0 Forest classifier has obtained F1-score which is
almost similar to other competitors, but this classifier
has shown significant performance with respect to
other metrics. Hence Random Forest classifier can be
regarded as the best model for this fake job detection
scheme.

V. CONCLUSIONS
Employment scam detection will guide job-seekers
Fig. 6. Comparison of Cohen-Kappa Score for all to get only legitimate offers from companies. For
specified supervised machine learning model tackling employment scam detection, several
machine learning algorithms are proposed as
countermeasures in this paper. Supervised
mechanism is used to exemplify the use of several
classifiers for employment scam detection.
MSE Experimental results indicate that Random Forest
classifier outperforms over its peer classification tool.
0.55
0.5 The proposed approach achieved accuracy 98.27%
0.45 which is much higher than the existing methods.
0.4
0.35 REFERENCES
0.3
0.25 [1] B. Alghamdi and F. Alharby, ―An Intelligent Model for
0.2 Online Recruitment Fraud Detection,” J. Inf. Secur., vol. 10,
0.15 no. 03, pp. 155–176, 2019, doi: 10.4236/jis.2019.103009.
0.1 [2] I. Rish, ―An Empirical Study of the Naïve Bayes Classifier
0.05 An empirical study of the naive Bayes classifier,‖ no.
0 January 2001, pp. 41–46, 2014.
[3] D. E. Walters, ―Bayes’s Theorem and the Analysis of
Binomial Random Variables,‖ Biometrical J., vol. 30, no. 7,
pp. 817–825, 1988, doi: 10.1002/bimj.4710300710.
[4] F. Murtagh, ―Multilayer perceptrons for classification and
regression,‖ Neurocomputing, vol. 2, no. 5–6, pp. 183–197,
1991, doi: 10.1016/0925-2312(91)90023-5.
[5] P. Cunningham and S. J. Delany, ―K -Nearest Neighbour
Classifiers,‖ Mult. Classif. Syst., no. May, pp. 1–17, 2007,
doi: 10.1016/S0031-3203(00)00099-6.
[6] H. Sharma and S. Kumar, ―A Survey on Decision Tree
Algorithms of Classification in Data Mining,‖ Int. J. Sci.
Res., vol. 5, no. 4, pp. 2094–2097, 2016, doi:
Fig. 7. Comparison of MSE for all specified supervised 10.21275/v5i4.nov162954.
machine learning model [7] E. G. Dada, J. S. Bassi, H. Chiroma, S. M. Abdulhamid, A.
O. Adetunmbi, and O. E. Ajibuwa, “Machine learning for
From Table 1, it is quite clear that Decision email spam filtering: review, approaches and open research
problems,‖ Heliyon, vol. 5, no. 6, 2019, doi:
Tree Classifier gives promising result over Naïve 10.1016/j.heliyon.2019.e01802.
Bayes Classifier, Multi-Layer Perceptron Classifier, [8] L. Breiman, ―ST4_Method_Random_Forest,‖ Mach. Learn.,
K-Nearest Neighbor Classifier. Hence, Decision Tree vol. 45, no. 1, pp. 5–32, 2001, doi:
Classifier can be fruitful predictor as a single 10.1017/CBO9781107415324.004.
[9] B. Biggio, I. Corona, G. Fumera, G. Giacinto, and F. Roli,
classifier. Now, it is checked whether the use of ―Bagging classifiers for fighting poisoning attacks in
ensemble approach enhances the performance of the adversarial classification tasks,” Lect. Notes Comput. Sci.
model or not. For that reason, Random Tree (including Subser. Lect. Notes Artif. Intell. Lect. Notes
classifiers, AdaBoost classifiers and Gradient Boost Bioinformatics), vol. 6713 LNCS, pp. 350–359, 2011, doi:
10.1007/978-3-642-21557-5_37.

ISSN: 2231-5381 http://www.ijettjournal.org Page 52


International Journal of Engineering Trends and Technology (IJETT) – Volume 68 Issue 4- April 2020

[10] A. Natekin and A. Knoll, ―Gradient boosting machines, a from https://www.kaggle.com/shivamb/real-or-fake-fake-


tutorial,‖ Front. Neurorobot., vol. 7, no. DEC, 2013, doi: jobposting-prediction
10.3389/fnbot.2013.00021. [14] H. M and S. M.N, ―A Review on Evaluation Metrics for
[11] N. Hussain, H. T. Mirza, G. Rasool, I. Hussain, and M. Data Classification Evaluations,‖ Int. J. Data Min. Knowl.
Kaleem, ―Spam review detection techniques: A systematic Manag. Process, vol. 5, no. 2, pp. 01–11, 2015, doi:
literature review,‖ Appl. Sci., vol. 9, no. 5, pp. 1–26, 2019, 10.5121/ijdkp.2015.5201.
doi: 10.3390/app9050987. [15] S. M. Vieira, U. Kaymak, and J. M. C. Sousa, ―Cohen’s
[12] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, ―Fake News kappa coefficient as a performance measure for feature
Detection on Social Media,‖ ACM SIGKDD Explor. Newsl., selection,” 2010 IEEE World Congr. Comput. Intell. WCCI
vol. 19, no. 1, pp. 22–36, 2017, doi: 2010, no. May 2016, 2010, doi:
10.1145/3137597.3137600. 10.1109/FUZZY.2010.5584447.
[13] Shivam Bansal (2020, February). [Real or Fake] Fake
JobPosting Prediction,Version 1.Retrieved March 29,2020

ISSN: 2231-5381 http://www.ijettjournal.org Page 53

View publication stats

You might also like