144-Statistical Analysis of Imbalanced Classification With Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature
144-Statistical Analysis of Imbalanced Classification With Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature
knowledge extraction
Article
Statistical Analysis of Imbalanced Classification with Training
Size Variation and Subsampling on Datasets of Research Papers
in Biomedical Literature
Jose Dixon and Md Rahman *
Computer Science Department, Morgan State University, Baltimore, MD 21251, USA; [email protected]
* Correspondence: [email protected]
Abstract: The overall purpose of this paper is to demonstrate how data preprocessing, training size
variation, and subsampling can dynamically change the performance metrics of imbalanced text
classification. The methodology encompasses using two different supervised learning classification
approaches of feature engineering and data preprocessing with the use of five machine learning
classifiers, five imbalanced sampling techniques, specified intervals of training and subsampling sizes,
statistical analysis using R and tidyverse on a dataset of 1000 portable document format files divided
into five labels from the World Health Organization Coronavirus Research Downloadable Articles of
COVID-19 papers and PubMed Central databases of non-COVID-19 papers for binary classification
that affects the performance metrics of precision, recall, receiver operating characteristic area under
the curve, and accuracy. One approach that involves labeling rows of sentences based on regular
expressions significantly improved the performance of imbalanced sampling techniques verified by
performing statistical analysis using a t-test documenting performance metrics of iterations versus
another approach that automatically labels the sentences based on how the documents are organized
into positive and negative classes. The study demonstrates the effectiveness of ML classifiers and
sampling techniques in text classification datasets, with different performance levels and class
imbalance issues observed in manual and automatic methods of data processing.
Citation: Dixon, J.; Rahman, M.
Statistical Analysis of Imbalanced
Keywords: text retrieval; text classification; imbalanced sampling; feature engineering; statistical
Classification with Training Size
analysis; data preprocessing; subsampling; training size variation
Variation and Subsampling on
Datasets of Research Papers in
Biomedical Literature. Mach. Learn.
Knowl. Extr. 2023, 5, 1953–1978.
https://doi.org/10.3390/ 1. Introduction
make5040095 Text or document retrieval involves collecting valuable information from vast quan-
tities of unstructured text in the most common file formats for human-language data [1].
Academic Editors: Zhiping Liu,
Han Zhang and Junwei Han
Information retrieval systems often use unstructured raw text as their primary dataset [1].
An information retrieval system (IR) filters through unstructured data to find anything
Received: 26 October 2023 that may fulfill a user’s information need [2]. It uses classification and filtering to find
Revised: 30 November 2023 documents, while search decides whether documents fit a specific information need [1].
Accepted: 8 December 2023 Two key aspects to consider when assessing the performance of an information retrieval
Published: 11 December 2023
system are efficiency and effectiveness [2].
Document classification is usually a binary classification and supervised learning
problem. Researchers usually classify text using machine and deep learning algorithms [3].
Copyright: © 2023 by the authors.
However, unstructured raw text can create an imbalanced sampling problem. Research has
Licensee MDPI, Basel, Switzerland. shown that using cost-sensitive learning or class weights, ensemble learning, or specific
This article is an open access article learning algorithms can help experiments address the issue of class imbalance [4].
distributed under the terms and Researchers can use data preprocessing techniques as an effective method to help
conditions of the Creative Commons the classifier improve the effectiveness of performance metrics and to help ensure the
Attribution (CC BY) license (https:// classifier can process unstructured data. Past studies have shown that subsampling can
creativecommons.org/licenses/by/ supply effective results even if an experiment does not use it at all and is useful in machine
4.0/). learning and deep learning.
Analysts can gain deeper insights into the data by performing exploratory data anal-
ysis, enabling them to find unique trends, patterns, and analyses. More observations
make it more possible to interpret the differences and supply a more complex view of
the data. Conversely, fewer observations can reveal more periodic trends, patterns, and
analyses. However, more observations can make it easier to interpret the differences and
supply a better and more sufficient view of the data. Exploratory data analysis is essential
when the data becomes too large to understand and interpret a conclusion based on an
experiment’s results.
This paper will first discuss the literature review of the research problem, then present
the method of machine learning models, followed by the study results, and finally discuss
the conclusions of the findings. The essential goals of this research are:
• To set up an effective system of preprocessing data for document classification that
would help the classifier supply reasonable performance metrics based on unstruc-
tured data for statistical analysis.
• Allow statistical analysis to decide the significance of performance metrics for precision
and recall scores from classifiers, sampling techniques, and labels.
• Use five supervised machine learning classifiers with imbalanced sampling techniques
to show the difference in performance.
This study used a set of supervised binary classification algorithms to classify five
labels (Immune, Problems in China, Risk Factors, Testing, and Transmission) on a dataset
of approximately 1000 portable document format files. The machine learning models
use training size variation (of five different training sizes) and subsampling (by intervals
of 5% up to 100%) to supply various unique scores of the dataset. The results will be
in a comma-separated values (CSV) file, depending on the classifier, sampling method,
sampling technique, test split size, train split size, and subsampling size. The file will have a
full array of scores based on precision, recall, area under the curve (AUROC), and accuracy.
In addition, researchers can perform exploratory data analysis to show performance metrics
of precision and recall using histograms, bar graphs, line graphs, and box plots.
A user can categorize PDF documents into five labels, with approximately 25% being
positive and 75% negative in the first approach. For the training subset, 60% consists of
the annotated positive documents, 20% of the development subset, and 20% of the testing
subset in the second approach. These two methods implement binary classification with
separate approaches. The most important part of this research is the training size variation
and subsampling; with this, the individual scores of performance metrics would be possible
to help emphasize the importance of statistical analysis.
Text classification algorithms may run at four levels: document, paragraph, sentence,
and sub-sentence [5]. In addition, two methods for classifying documents are manual and
automatic [6]. Therefore, the method relies on an automated, rule-based classifier and
human categorization of documents.
In supervised learning, a computer algorithm trains input data labeled for a specific
output. As shown in Figure 1, the model is trained until it can detect the underlying
patterns and relationships between the input data and the output labels, enabling it to yield
accurate labeling results when presented with never-before-seen data [7]. A user can tag
new and old data if the model correctly categorizes it [8]. Classification and regression
problems, especially those involving binary classification, are well-suited to supervised
learning [8].
The typical approach to evaluating an information retrieval system is distinguishing
pertinent and irrelevant documents. Then, document collection and gathering of relevancy
scores decide the system’s effectiveness. Although, for example, a researcher may assign
zero to irrelevant and one to pertinent documents, the collection and feature extraction
process of the documents can affect the system’s effectiveness.
Mach. Learn. Knowl. Extr. 2023, 5, FOR PEER REVIEW 3
Figure Supervised
1. 1.
Figure SupervisedLearning
LearningProcess.
Process.
2. Literature Review
The typical approach to evaluating an information retrieval system is distinguish-
ingSeveral
pertinent and irrelevant
papers have proved documents.
that data Then, document collection
preprocessing can helpand changegathering of rel-
performance
evancythat
metrics scores
affectdecide the system’s
imbalanced effectiveness.
classification. Goudjil Although, for example,
et al. introduce a researcher
an innovative active
may assign
learning method zerofor to irrelevant and one to aiming
text categorization, pertinenttodocuments, the collection
minimize labeling effortandwhilefeature
main-
extraction
taining process ofaccuracy
classification the documents can affectselecting
by intelligently the system’s effectiveness.
appropriate samples [9]. Kadhim’s
paper evaluates various text preprocessing tools for English text classification, including
raw2. Literature Review stop words removal, and stemming; results show that preprocessing
text, tokenization,
enhances Several
feature papers have proved
extraction methods, thatespecially
data preprocessing can help change
for small threshold performance
values [10]. Mali et al.
metrics
paper that affect
explores imbalanced
the impact classification.steps
of preprocessing Goudjil et al.classification,
on text introduce an revealing
innovativeimprove-
active
learning
ments methodin
in accuracy forvarious
text categorization,
classifiers, mainlyaiming to minimize
when applied to labeling effort while
unstructured data,main-
despite
thetaining classification
vast amount of digitalaccuracy by intelligently
information available selecting
[11]. appropriate samples [9]. Kad-
him’s paper evaluates
Likewise, similar works various text preprocessing
focused tools for English
on different subsampling text classification,
strategies and trainingin- size
cluding raw
variations text, tokenization,
to change performance stop words removal,
measures. Imberg and et al.stemming;
propose an results
activeshow that
sampling
preprocessing
strategy enhances
that iterates featureestimation
between extraction and methods,
data especially
collectionfor withsmall threshold
optimal values
subsamples,
[10]. Mali et al. paper explores the impact of preprocessing
guided by machine learning predictions on unseen data [12]. Kumar et al. study addresses steps on text classification,
revealing in
challenges improvements
mental health in NLP
accuracy in various
by using classifiers,dataset
the Anno-MI mainly for when applied toquality
counseling un-
structured data, despite the vast amount of digital information available
classification. It employs data augmentation to improve reliability, reduce bias, and address [11].
Likewise,
data scarcity andsimilar
imbalanceworks focused
[13]. on different
In transmitter subsampling
classification strategies and
applications, training
Oyedare and
size variations to change performance measures. Imberg et
Park explored the relationship between training dataset size and classification accuracy,al. propose an active sam-
pling strategy that iterates between estimation and data collection with optimal sub-
suggesting that users should choose how much training data is required to offer the
samples, guided by machine learning predictions on unseen data [12]. Kumar et al.
optimum performance metrics [14].
study addresses challenges in mental health NLP by using the Anno-MI dataset for
Three papers focusing on various text classification methodologies can address various
counseling quality classification. It employs data augmentation to improve reliability,
classification challenges. First, Li et al. propose a paper that reviews text classification
reduce bias, and address data scarcity and imbalance [13]. In transmitter classification
approaches from 1961 to 2021, focusing on traditional models and deep learning. It provides
applications, Oyedare and Park explored the relationship between training dataset size
a taxonomy, technical developments, benchmark datasets, comparisons, evaluation metrics,
and classification accuracy, suggesting that users should choose how much training data
and critical implications [15]. Mubjata et al. review measures the performance of SML
is required to offer the optimum performance metrics [14].
and rule-based
Three papers approaches,
focusing on presenting
various text open research issues
classification and challenges
methodologies can address fromvar-nine
types
iousofclassification
clinical reports, four data
challenges. sets,
First, Li two
et al.sampling
propose atechniques,
paper that and ninetext
reviews pre-processing
classifica-
techniques [16]. Finally, Kamath et al. compared the accuracy
tion approaches from 1961 to 2021, focusing on traditional models and deep learning. of four machine learning
It
classifiers and one convolutional neural network on raw and
provides a taxonomy, technical developments, benchmark datasets, comparisons, evalu- cleaned datasets [17].
Some
ation research
metrics, and articles have proven[15].
critical implications inefficient
Mubjataand practical
et al. approaches
review measures thetoperfor-
analyz-
ingmance
imbalanced
of SML and rule-based approaches, presenting open research issues and chal- of
sample difficulties. Kim & Hwang et al. evaluated a combination
seven
lenges from strategies
sample nine typesand eight machine
of clinical reports, learning
four dataclassifiers
sets, two on 31 datasets
sampling with varying
techniques, and
degrees of imbalance and discovered that some sampling
nine pre-processing techniques [16]. Finally, Kamath et al. compared the accuracy procedures could have beenof
more
fourefficient
machineand helpedclassifiers
learning classifierand performance. In contrast,
one convolutional neuralothers were on
network morerawsuccess-
and
fulcleaned
[18]. Agarwal
datasetset al. developed a novel sampling strategy to increase classification task
[17].
performance and a custom-based sampling method to determine which methods affect
the performance [19]. Gaudreault et al. research investigates performance assessment and
predictive modeling in machine learning applications dealing with imbalanced domains. It
Mach. Learn. Knowl. Extr. 2023, 5 1956
3. Methodology
The dataset is a total of 1000 PDF documents. Five machine learning labels exist
(Immune, Problems in China, Risk Factors, Transmission, and Testing). Furthermore, 25%
of the PDF documents are open-access COVID-19 research papers from the World Health
Organization COVID-19 Downloadable Articles Database to have a positive class [25].
The WHO COVID-19 Downloadable Articles Database supplies free access to open-access
documents and includes a bibliography of documents in CSV format.
A user can manually download a specific kind of PDF document from the COVID-19
research database to help serve as a dataset for the positive class. The other 75% of PDF
documents are non-related COVID-19 research papers from the PubMed Central database
to have a negative class [26]. Both the positive and negative class has papers relevant
to biomedical literature. The documents in the positive class are particularly related to
COVID-19; however, the negative class includes papers relevant to any medical subject,
excluding COVID-19. A user can obtain the PDF documents from PubMed Central using
the PubMed Central Open Access Subset, which maintains a repository of open-access
paper archives suitable for reproducing research [27]. One user creates the dataset used for
the experiment.
Both approaches require using the Python libraries Scikit-learn, NumPy, and Pandas
to perform feature engineering, imbalanced classification, and text classification on raw
text data [28]. R script and Xpdf command line tools convert all portable document format
(PDF) documents into text files in a directory. The goal of combining the text files is to make
the classifier’s processing easier. The first approach uses the text files in the MEDFULL
folder as the first dataset. The combining process creates two combined text files (such as
Immune-pos.txt and Immune-neg.txt) for both positive and negative classes based on a
single label. Even though each label has the same number of document or text files, the
class has different samples and file sizes.
The methodology’s second approach requires a user to convert the individual raw text
files for each labeled in the MedText folder to a CSV file so that the sentences can have the
source code automatically annotate keywords based on regular expressions. The keywords are
terms based on the data origin close to COVID-19 (such as clinical, isolation, respiratory, dis-
ease, spread, PMC, symptomatic, epidemic, endemic, outbreak, quarantine, and others) [29].
The metacharacters and characters associated with regular expressions are best associated
with an example like “(?i)(?i:C|c)lincial|(?i:I|i)nfectious|(?i:C|c)ornavirus|(?i:D|d)isease
|(?i:C|c|o|O|v|V|i|I|D)|(?i:p|P)ositive|(?i:C|c)ommunity|(?i:C|c)ase(?i:N|n)egative
|(?i:A|a)rea|(?i:e|E)pidemic|(?i:E|e)ndemic|(?i:O|o)utbreak|(?i:I|i)solation|(?i:R|r)
espiratory|(?i:S|s)pread|PMC|(?i:S|s)ymptomatic”. This regular expression matches
characters being words in either all lowercase, all uppercase, or title case format. The
and others) [29]. The metacharacters and characters associated with regular expressions
are best associated with an example like
“(?i)(?i:C|c)lincial|(?i:I|i)nfectious|(?i:C|c)ornavirus|(?i:D|d)isease|(?i:C|c|o|O|v|V|i|I|
D)|(?i:p|P)ositive|(?i:C|c)ommunity|(?i:C|c)ase(?i:N|n)egative|(?i:A|a)rea|(?i:e|E)pide
mic|(?i:E|e)ndemic|(?i:O|o)utbreak|(?i:I|i)solation|(?i:R|r)espiratory|(?i:S|s)pread|PMC
Mach. Learn. Knowl. Extr. 2023, 5 |(?i:S|s)ymptomatic”. This regular expression matches characters being words in either1957 all
lowercase, all uppercase, or title case format. The regular expression matches terms like
disease, Disease, Respiratory, respiratory, COVID-19. The matched terms supplied the la-
bel. Next,
regular a script matches
expression transfersterms
the converted
like disease,CSV files ofRespiratory,
Disease, each label and positiveCOVID-19.
respiratory, class to a
COVID_ANN
The matched terms foldersupplied
following thethe illustration
label. Next, a in Figure
script 2. In the
transfers thesecond approach,
converted which
CSV files of
entails
each automatically
label and positivelabeling
class to adocuments
COVID_ANN based on regular
folder following expressions, we exclude
the illustration in Figurethe2.
negative
In text files
the second depicted
approach, in Figure
which entails2automatically
for each label.labeling
The No documents
symbol in Figure
based on2 indicates
regular
that files from
expressions, wethe negative
exclude theclass are not
negative textprocessed further.
files depicted This implies
in Figure thatlabel.
2 for each only positive
The No
symbol
documents in Figure
undergo2 indicates that filesfor
consideration from the negative
automatic class Below,
labeling. are not processed
references further.
to the
This implies that
COVID_ANN only positive
dataset refer to adocuments undergo of
complete collection consideration for automatic
the CSV annotated files oflabeling.
the con-
Below,
verted references to the COVID_ANN
raw text documents. First, combine dataset
andrefer
merge to aallcomplete
annotated collection
CSV files of from
the CSVthe
annotated files of the
Immune, Problems in converted
China, andraw Risktext documents.
Factors labels toFirst,
form combine and merge all an-
‘COVID_Train_Set.csv,’ the
notated CSV files
Train Subset. Next,from
users thecombine
Immune, and Problems
merge allinannotated
China, and CSVRisk Factors
files from labels to form
the Testing la-
‘COVID_Train_Set.csv,’
bel to create a file called the Train Subset. Next,the
‘COVID_Dev_Set.csv,’ users
Devcombine and merge
Subset. Finally, connectall annotated
and merge
CSV files from the
all annotated CSVTesting
files label
fromtothe create a file called ‘COVID_Dev_Set.csv,’
Transmission label to make the Test the Dev
SubsetSubset.
file
Finally,
‘COVID_Test_Set.csv.’ These three subsets or CSV files form ‘Subset Data,’ the second the
connect and merge all annotated CSV files from the Transmission label to make da-
Test
taset.Subset file ‘COVID_Test_Set.csv.’ These three subsets or CSV files form ‘Subset Data,’
the second dataset.
Figure 2. Converting
Figure 2. ConvertingTextTexttotoAnnotated
AnnotatedCSV.
CSV.Positive
Positiveclass is is
class only considered.
only NoNo
considered. symbol shows
symbol that
shows
the
thatnegative text istext
the negative ignored.
is ignored.
3.1.
3.1. Preprocessing
Preprocessing and
and Labeling
Labeling
In
In the first approachofofthe
the first approach theexperiment,
experiment,a aPython
Pythonscript conducts
script data
conducts preprocessing
data in
preprocessing
two phases: noise removal and normalization. Noise removal removes unwanted content
in two phases: noise removal and normalization. Noise removal removes unwanted con-
from the unstructured text by using the various functions from the NLTK library, such as
tent from the unstructured text by using the various functions from the NLTK library,
stopwords, wordnet, WordNetLemmatizer, and word_tokenize. After removing the added
such as stopwords, wordnet, WordNetLemmatizer, and word_tokenize. After removing
noise, normalization helps process the data. A script can concatenate the two negative and
the added noise, normalization helps process the data. A script can concatenate the two
positive text files of each label as sentences in a Pandas data frame and then combine them.
negative and positive text files of each label as sentences in a Pandas data frame and
The Pandas data frame has six columns, excluding the index column, proving the data
then combine them. The Pandas data frame has six columns, excluding the index col-
preprocessing step process. The first step is relevant to extracting raw sentence data. In
umn, proving the data preprocessing step process. The first step is relevant to extracting
the second step, the script labels documents with 1s and 0s. The third step removes any
raw sentence data. In the second step, the script labels documents with 1s and 0s. The
punctuation. The fourth step involves tokenizing the sentences. The fifth step involves
third step removes any punctuation. The fourth step involves tokenizing the sentences.
removing stopwords. The last step involves lemmatizing the words accordingly. All the
The fifth step involves removing stopwords. The last step involves lemmatizing the
sentences undergo a noise removal process for better-improved data reading [26].
words Theaccordingly. All themodule
string.punctuation sentences undergo
removes a noise removal
all punctuation process for to
before tokenization better-
pre-
improved
vent data reading
tokenizing unwanted[26].
elements. The word_tokenize function from the NLTK library
tokenizes the sentences as a data point. The stopwords module removes stopwords from
the tokenized text. Finally, for machine learning models to preprocess unique data, lemma-
tization (using the WordNetLemmatizer and WordNet modules) must decide the base
words of all dataset words and execute them in the machine learning model before the
feature extraction process.
Table 1 shows the number of samples relevant to each label. The number of rows
corresponds to the number of sentences or samples after converting the text files from
MedText data to a Pandas data frame and CSV file for each label. For example, combining
[label]-neg.csv and [label]-pos.csv results in a [label].csv. The machine learning model pro-
cesses each label with a unique CSV file with different sentences as rows. The experiment’s
first approach may have affected each classifier’s performance and imbalanced sampling
technique from their intended purposes.
Mach. Learn. Knowl. Extr. 2023, 5 1958
After processing text files for each specific label from the MEDFULL data, the
method displays three different subsets in separate CSV files named COVID_Train_Set.csv,
COVID_Dev_Set.csv, and COVID_Test_Set.csv. Table 2 displays the number of samples or
sentences from rows of CSV files relevant to each subset. After converting the text files to
a Pandas data frame or a CSV file, the method considers the sentences or samples as the
number of remaining rows. Each subset has a different CSV file with different sentences
and rows.
The second approach of the experiment does not perform data preprocessing by using
the NLTK library on the sentences. The CSV file has a ‘sentence’ column with the raw
text by each sentence line, as shown in Table 3. A user annotates the positive documents
from each of the five labels. The CSV files automatically annotate the documents using
regular expressions. First, annotate the documents by using regular expressions. Suppose a
sentence in the sentence column has data that matches the regular expression pattern. In
that case, the user can assign the number one in the ‘label’ column, showing it is positive,
as shown in Table 4. If the sentence in the sentence column has data that does not match the
regular expression used to automatically label the sentences, as shown in Table 3, the user
assigns it with zero. Regular expressions extract specific keywords from each sentence’s
‘Data’ column. The ‘Regex’ column shows a Boolean expression if the regular expression
matches the sentence column of a particular row, as shown in Table 4 [26]. The second
approach of the experiment supplies a more authentic performance that shows imbalanced
sampling techniques and classifiers work effectively.
Each classifier has specific parameters to execute during each iteration and subsam-
pling size. The machine learning model sequentially adjusts the train split size to 17%, 33%,
50%, 67%, and 83% for each iteration, with the test split size being the remaining 100%
minus the train split size for the first approach. The second approach uses a constant 50%
training and testing size relevant to every iteration.
Mach. Learn. Knowl. Extr. 2023, 5 1960
When subsampling occurs on the data from the first approach, iterations 1 to 13 or
a subsample size initially beginning 5% to 65% increases or decreases the performance
variation for all classifiers and imbalanced sampling techniques. Once iteration 13 is
reached, the machine learning models provide the best or worst performance depending on
each imbalanced sampling technique and classifiers. When the second approach involves
subsampling, increasing the subsampling size from 10% to approximately 90% or 100%
significantly improves the performance of all imbalanced sampling techniques and classi-
fiers. The subsampling size of 50% did supply the worst performance for all imbalanced
sampling techniques and classifiers.
and probabilistic model (Naïve Bayes). Logistic Regression is a classifier that decides the
likelihood of predicting that an instance belongs to a certain class as its main objective.
Logistic regression models the likelihood of an output in terms of input characteristics,
as opposed to linear regression, which predicts continuous values [30]. While it does not
perform direct statistical classification, selecting a cutoff value can be employed to construct
a classifier. This method commonly assigns inputs with probabilities above the cutoff to
one class and those below the cutoff to the other in creating binary classifiers [30].
Predicting the category or class of a given instance is the aim of the probabilistic
machine learning method Multinomial Naïve Bayes classifier based on Bayes’ theorem. It
works well with data with characteristics that indicate discrete frequencies or counts of
occurrences in various natural language processing (NLP) applications because it can com-
pute the probability distribution of text data [31]. XGBoost is a machine learning algorithm
that harnesses the predictions of weak models, typically decision trees, to construct a robust
predictive model. Regression, classification, and ranking issues are among its common
applications. Constructing a powerful classifier from a set of feeble classifiers is the aim.
GPU support, a specialized data structure, goals, loss functions, cross-validation support,
and APIs are all features of XGBoost [32]. The Random Forest classifier is a meta-estimating
classifier that uses averaging to enhance prediction accuracy and address over-fitting. To
accomplish this, the algorithm fits several decision tree classifiers on various subsamples
of the dataset. It constructs these trees by randomly training them on different subsets.
Subsequently, these trees participate in a “voting” process to determine the final prediction,
with most trees determining the outcome. Random Forests mitigate overfitting and en-
hance forecast accuracy by aggregating the information of several trees [33]. Decision Tree
classifier builds a flowchart-like tree structure where each internal node denotes a test on an
attribute, each branch represents an outcome of the test, and each leaf node (terminal node)
holds a class label [34]. Decision trees have a hierarchical structure of root nodes, branches,
leaf nodes, and internal nodes. Decision trees are considered non-parametric; they make
no spatial distribution or classifier structure assumptions and can handle numerical and
categorical variables [35].
The features encompass sentences that undergo preprocessed or non-preprocessed
treatment to eliminate and filter out undesired elements using conventional preprocessing
techniques. We alternated these features to assess their impact on the performance of each
classifier and imbalanced sampling technique. Additionally, it is necessary to vectorize
these features for use by any machine learning classifier in the methodology. The labels
consist of positives (1 s) and negatives (0 s) for each respective class of the dataset. The clas-
sifiers and imbalanced sampling techniques are all implemented similarly. The only factors
that change the performance of each classifier are vectorization, training or testing size, and
subsampling size. Before running the machine learning model, balance the sampling of
classifiers and imbalanced sampling techniques. When employing an imbalanced sampling
technique, the features undergo resampling based on the majority class (negative) and the
minority class (positive). The machine learning models execute XGBoost, Naïve Bayes,
and Decision Tree classifiers with default parameters. However, the machine learning
models have changed parameters for Random Forest and Logistic Regression to optimize
performance, considering the small size of the dataset. Logistic Regression is potentially
biased to undersampling techniques used in this experiment that could have contributed
to improved performance than other classifiers [36].
XGBoost has the fastest execution as an ensembling learning classifier, enabling it
to supply the highest performance shown in this study. Random Forest classifier uses
ensemble learning but takes the longest to execute and to supply a different performance
result from other classifiers that are faster in execution. For detecting discrete features
in text classification, it is suitable to use Multinomial Naïve Bayes. Logistic Regression
classifier depends on binary variables that make it suitable for any binary classification
problem. This study uses binary classification instead of multiclass classification, making
the Logistic Regression classifier suitable for the hypothesis associated with this experiment.
Mach. Learn. Knowl. Extr. 2023, 5 1962
The Decision Tree Classifier promotes swift execution, considering its efficacy for binary
and multiclass tasks. The Decision Tree classifier is the primary classifier evaluated in the
machine learning model.
rizes scores without a sampling technique and method as ‘Imbalanced’ in the Sampling and
Technique columns. The Classifier column can specify five classifiers: Logistic Regression,
XGBoost, Decision Tree, Naïve Bayes, and Random Forest. Subtracting one from The Train
Split Size column is 100% remaining from the Test Split Size. The Iteration and Subsample
columns can range from 1 to 20 intervals by 5% up to 100% for the first approach or 1 to
10 intervals by 10% up to 100% for the second approach. Precision, Recall, AUROC, and
Accuracy columns can range from 0% to 100% in decimal values [26].
The script stores the scores in NumPy arrays each time a classifier executes, depending
on the sampling method, sampling technique, test split size, train split size, subsampling
iteration, and subsample size. After running the classifiers based on one set of train and test
split sizes, the Python script appends the scores to the earlier score in each NumPy array.
Similarly, the COVID_Subset_Iterations.csv file of the second approach defines columns
such as “Subset”, “Other Subset”, “Sampling”, “Technique”, “Classifier”, “Iteration”, “Pre-
cision”, “Recall”, “AUROC”, and “Accuracy”. These files record the evaluation metrics
for each model, sampling method, and iteration, enabling analysis and comparison of
the results.
Each classifier has specific parameters to execute during each iteration and subsam-
pling size. The machine learning model sequentially adjusts the train split size to 17%, 33%,
50%, 67%, and 83% for each iteration, with the test split size being the remaining 100%
minus the train split size. The first approach employs five different training and testing
split sizes, while the second uses a fixed training and testing size of 50%.
The train test split module from the Scikit-learn library performs training size variation.
The Scikit-learn model selection function is a set of tools used for feature engineering on
models. The train test split module, included in the Scikit-learn package, is valuable for
splitting a dataset into two or more pieces for training and testing machine learning models.
When conducting the train test split module during each subsampling iteration of the
classifier, a user can specify the number of samples from each label to decide how closely
the positive and negative classes match.
Table 7 shows the lowest precision and recall scores found from an iteration of 13 or
65% subsampling size when the test and train split sizes are 50%. The results demonstrate
that iteration 13 has sporadic results.
Table 8 summarizes the label’s average scores regardless of sampling method and
technique, classifier, train split size, test split size, iteration, and subsampling size. Again,
Risk Factors supply the best scores, while the Testing label provides the worst scores.
Table 9 summarizes each classifier’s average precision, recall, AUROC, and accuracy
scores regardless of sampling method, sampling technique, label, train split size, test split
size, iteration, and subsampling size. Again, the XGBoost classifier supplies the best scores,
while the Logistic Regression classifier supplies the worst.
Table 10 summarizes the sampling method’s average precision, recall, AUROC, and
accuracy scores regardless of classifier, sampling technique, label, train split size, test split
size, iteration, and subsampling size. Again, the scores from imbalanced data outperform
the other types of sampling methods (Oversampling and Undersampling).
Table 11 summarizes the average sampling technique’s precision, recall, AUROC, and
accuracy scores regardless of sampling method, label, classifier, train split size, test split
size, iteration, and subsampling size. When the machine learning models use imbalanced
sampling techniques, they have shown slight effectiveness than simply running machine
Mach. Learn. Knowl. Extr. 2023, 5 1965
learning models on the imbalanced dataset. TomekLinks has the best scores in comparison
to other imbalanced sampling techniques. NearMiss has the worst scores.
Table 10. Average Performance Metrics Based on Sampling for MEDFULL Data.
Table 11. Average Performance Metrics Based on Technique for MEDFULL Data.
Table 12 displays the ANOVA t-test results for classifiers based on precision. The p-
values represent the statistical significance of the variations in accuracy between algorithm
pairings. Significant variations (****) in accuracy between Decision Tree and Naive Bayes,
Decision Tree and Random Forest, Decision Tree and Random Forest, Logistic Regression
and XGBoost, and Logistic Regression and XGBoost are among the noteworthy results.
The adjusted p-values show that these significant differences hold even after correcting for
multiple comparisons. On the other hand, several comparisons, such as Random Forest
with XGBoost, Naive Bayes with Random Forest, and Random Forest with XGBoost, have
p-values over the standard cutoff of 0.05 and do not demonstrate any discernible changes.
Table 12. t-test Statistics of Precision Based on Classifiers for MEDFULL Data.
The ANOVA t-test of classifiers based on recall is shown in Table 13. The p-values
reflect the statistical significance of differences in recall between the algorithm pairs. Several
Mach. Learn. Knowl. Extr. 2023, 5 1966
noteworthy findings emerge vital statistical significance (****) is observed in the compar-
isons between Decision Tree and Logistic Regression, Decision Tree and Naive Bayes,
Logistic Regression and Random Forest, Decision Tree and Random Forest, Logistic Regres-
sion and XGBoost, and Logistic Regression and XGBoost. These results remain significant
even after adjusting for multiple comparisons, indicated by the adjusted p-values. Con-
versely, some comparisons, such as Naive Bayes with Random Forest and Naive Bayes with
XGBoost, show no significant differences, as evidenced by p-values above the conventional
threshold of 0.05.
Table 13. t-test Statistics of Recall Based on Classifiers for MEDFULL Data.
Figure
Figure 3. 3.ANOVA
ANOVAt-tests
t-testsPlots
Plots for
for MEDFULL
MEDFULL Data.
Data.
The‘MEDFULL—Precision
The ROC curves shown in Figure 4 show graph
& Technique’ the performance of different
shows an F-value classifiers
score of 7.19 and
depending on the training size, as depicted in Figure 4. The Naïve Bayes
an overall p-value less than 0.0001, meaning all sampling techniques and Imbalancedhas the highestdata
will partially have differences in performance metrics compared to other groups. Forest
AUC values in all the graphs, from 0.74 to 0.78. Logistic Regression and Random However,
have the second highest values between 0.72 to 0.77. Decision Tree and
every sampling technique supplies non-significant and significant results compared XGBoost per- to
formed similarly at 0.66 and 0.70 when the training size was 16% and 33%. Decision Tree
other groups.
has the second lowest performance of 0.72 and XGBoost has the most inferior perfor-
The ‘MEDFULL—Recall & Technique’ graph shows an F-value of 1.25 and a p-value
mance of 0.71.
of 0.28, meaning there is no or slight variation among the sample means for recall scores.
The only comparison of a sampling technique is significant with Imbalanced and SMOTE
at p-value at 0.0245. Other comparisons are insignificant.
The ROC curves shown in Figure 4 show the performance of different classifiers
depending on the training size, as depicted in Figure 4. The Naïve Bayes has the highest
AUC values in all the graphs, from 0.74 to 0.78. Logistic Regression and Random Forest
have the second highest values between 0.72 to 0.77. Decision Tree and XGBoost performed
similarly at 0.66 and 0.70 when the training size was 16% and 33%. Decision Tree has the
second lowest performance of 0.72 and XGBoost has the most inferior performance of 0.71.
Mach. Learn. Knowl. Extr. 2023, 5, FOR PEER REVIEW 17
Figure 4.4.ROC
Figure4. Curves
ROCCurves Based
CurvesBased on
Based on NearMiss
on NearMiss Iteration 13for
NearMiss Iteration for MEDFULL Data.
Figure ROC Iteration1313 forMEDFULL
MEDFULLData.
Data.
Figure
Figure555shows
Figure shows precision
precision and
showsprecision recall in
and recall
recall indifferent
in differentfacets
different facetsthat
facets thatcompare
that compare
compare each
each sampling
sampling
each sampling
technique
techniquefor
forMEDFULL
MEDFULL data.
data. These
These facets
facets show
show that
that all
all imbalanced
imbalanced sampling
sampling
technique for MEDFULL data. These facets show that all imbalanced sampling pro- processes
processes
and
andimbalanced
cessesimbalanced data
data have
and imbalanced dataa have
have similar
similar performance,
performance,
a similar whichdoes
which
performance, does
which not
not address
address
does notthethe imbal-
imbal-
address the
anced
ancedproblem.
problem.
imbalanced problem.
Figure 5. Facets of Recall and Precision Based on Technique for MEDFULL Data.
Figure 5. Facets of Recall and Precision Based on Technique
Technique for
for MEDFULL
MEDFULL Data.
Data.
Mach.
Mach.Learn.
Learn. Knowl.
Knowl. Extr. 2023, 5, FOR PEER
PEER REVIEW
REVIEW 18
Mach. Learn. Knowl. Extr.
Extr. 2023, 5 18
1969
Figure
Figure 666 shows
Figure shows
shows aaa scatter
scatter plot
scatterplot
plotofof precision
ofprecision and
precisionand recall
recallinin
andrecall ina aatwo-dimensional
two-dimensionalgraph
two-dimensional graph
graph
comparing
comparing the
the MEDFULL
MEDFULL data
data sampling
sampling technique.
technique. The
The MEDFULL
MEDFULL
comparing the MEDFULL data sampling technique. The MEDFULL data has more observa- data
data has
has more
more ob-
ob-
servations
servations for
for analysis
analysis of
of the
the performance
performance metrics
metricsfrom
from CSV
CSV rather
ratherthan
than
tions for analysis of the performance metrics from CSV rather than Subset data. However, Subset
Subset data.
data.
However,
the scatterthe
However, plotscatter
the showsplot
scatter thatshows
plot shows that
thatthe
theImbalanced
the Imbalanced Imbalanced data performs
datasimilarly
data performs performs similarly
to similarlytotoother
other
other imbalanced
imbalanced
imbalanced sampling techniques.
sampling techniques.
sampling techniques.
Figure 6. Scatter Plot of Precision and Recall Based on Technique for MEDFULL Data.
Figure
Figure 6. Scatter
Scatter Plot
Plot of
of Precision
Precision and Recall Based on Technique for MEDFULL Data.
Figure
Figure 777 shows
shows aaa heatmap
heatmap of precision and recall for MEDFULL data. Again, the
Figure shows heatmap ofof precision
precision and
and recall
recall for
for MEDFULL
MEDFULL data. Again,
data. Again, the
the
MEDFULL
MEDFULL data
data shows
shows much
much variability
variability due
duetotomore
more scores
scores and
and observations.
observations. InIn
this
this
MEDFULL data shows much variability due to more scores and observations. In this
heatmap,
heatmap, thethe highest
highestvalues
valuesare
arewhen
whenthetherecall
recalland
andprecision are around 60%. InIn
this
heatmap, the highest values are when the recall and precision
precision are
are around
around 60%.
60%. In this
this
heatmap,
heatmap, even
even with
with imbalanced
imbalanced sampling
sampling techniques,
techniques, the
theclassifiers supply
classifiers supplythe lowest
the lowest
heatmap, even with imbalanced sampling techniques, the classifiers supply the lowest
recall
recall and
and precision
precisionscores
scorescompared
compared to the Subset data.
recall and precision scores compared to to the
the Subset
Subset data.
data.
Figure 7.
Figure Heatmapof
7. Heatmap ofPrecision
Precisionand
andRecall
Recallfor
forMEDFULL
MEDFULLData.
Data.
Figure 7. Heatmap of Precision and Recall for MEDFULL Data.
4.2. Results from Subset Data
4.2. Results from Subset Data
The Subset
4.2. Results data collection
from Subset Data involves three subsets (Train Set, Dev Set, and Test Set).
The Subset data collection involves three subsets (Train Set, Dev Set, and Test Set).
This The
second approach
Subset uses regular
data collection expressions
involves to decide
three subsets whether
(Train a sentence
Set, Dev Set, andmatches a
Test Set).
This second approach uses regular expressions to decide whether a sentence matches a
keyword
This second and label it positive or negative. The Subset data collection has addressed the
keyword andapproach uses regular
label it positive expressions
or negative. to decide
The Subset data whether
collectiona has
sentence matches
addressed the a
imbalanced issue [26].
keyword and
imbalanced label
issue it positive or negative. The Subset data collection has addressed the
[26].
Table 14 shows that the highest scores can be possible when the iteration is 9 or 10, or
imbalanced issue [26].
the subsample size is 90% or 100%. Therefore, the machine learning models obtain the best
scores from Subset data by utilizing 90% and 100% of the dataset.
Mach. Learn. Knowl. Extr. 2023, 5 1970
Subset Other Set Technique 1 Classifier 1 Iteration Precision Recall AUROC Accuracy
Train_Set Test_Set ROS XG 9 0.991202 0.936199 0.961843 0.961836
Train_Set Test_Set ROS RF 9 0.991104 0.939023 0.963742 0.963584
Train_Set Test_Set ROS LR 10 0.99074 0.935225 0.96111 0.961088
Train_Set Test_Set ROS NB 10 0.990414 0.93441 0.960697 0.96059
Train Set Test Set ROS DT 10 0.990303 0.9352 0.96712 0.96181
1 This table only shows statistics related to Random Over Sampling. Abbreviations for Full terms:
NM—NearMiss, XG—XGBoost, RF—Random Forest, LR—Logistic Regression, NB—Naïve Bayes, DT—Decision
Tree, ROS—Random Over Sampling.
Table 15 shows that the lowest scores are possible when the dataset has a 50% and 60%
subsampling size or iterations of 5 and 6. The results show that only half of the samples
from the Train Set, Dev Set, and Test Set can obtain the lowest scores from the dataset.
Subset Other Set Technique 1 Classifier 1 Iteration Precision Recall AUROC Accuracy
Train_Set Test_Set NM LR 5 0.541509 0.988519 0.597201 0.58692
Train_Set Test_Set NM DT 5 0.578896 0.989391 0.647256 0.641251
Train_Set Test_Set NM XG 5 0.594941 0.993838 0.665787 0.662264
Train Set Test Set NM RF 5 0.598006 0.982636 0.66621 0.663715
Train Set Test Set RUS NB 6 0.676887 0.989792 0.762975 0.760883
1 This table only shows statistics related to undersampling techniques. Abbreviations for Full terms:
NM—NearMiss, LR—Logistic Regression, DT—Decision Tree, XG—XGBoost. RF—Random Forest, and
NB—Naïve Bayes, RUS—Random Under Sampling.
Table 16 shows the average scores from the Test Set supply better in the precision, recall,
and AUROC columns. However, Dev Set was only superior in the accuracy scores column.
Table 16. Average Performance Metrics Based on Label for Subset Data.
Table 17 summarizes each classifier’s average precision, recall, AUROC, and accuracy
scores regardless of other categories. Again, Logistic Regression has the worst performance
as a classifier, while Random Forest has the best performance.
Table 17. Average Performance Metrics Based on Classifier for Subset Data.
Table 18 summarizes the average precision, recall, AUROC, and accuracy scores for
each type of sampling regardless of other categories. Again, the Oversampling methods
outperform the imbalanced data and Undersampling methods.
Table 18. Average Performance Metrics Based on Sampling for Subset Data.
All imbalanced sampling techniques perform more effectively than imbalanced data
in three types of metrics (precision, recall, and AUROC), excluding accuracy, as shown
in Table 19. The second approach of the method using regular expressions that match
keywords to label documents in values of one and zero has helped address the imbalanced
class problem from the average performance metrics provided.
Table 19. Average Performance Metrics Based on Technique for Subset Data.
Table 20 shows the Logistic Regression and Random Forest comparison in Group 1 and
Group 2 yields a p-value of 0.0414, marked with an asterisk (*) showing significance at the
0.05 level. However, this significance disappears after adjusting for multiple comparisons
(p.adj), suggesting that the observed difference may be due to chance. The results show no
statistically significant differences in precision between Group 1 and Group 2.
Table 20. t-test Statistics of Precision Based on Classifiers for Subset Data.
Table 21 shows that all the p-values, though, are higher than the usual significance
level of 0.05. In this case, the recall measure does not show statistically significant changes
between Group 1 and Group 2 for any classifiers provided, such as Decision Tree, Logistic
Regression, Naive Bayes, Random Forest, and XGBoost. Even after considering multiple
comparisons, the p.adj values indicate no significant changes. Based on the recall measure,
the study shows that there needs to be a clear difference between how well the machine
learning methods in Group 1 and Group 2 did with the given dataset.
Table 21. t-test Statistics of Recall Based on Classifiers for Subset Data.
The graph titled ‘Subset—Precision & Classifier’ from Figure 8 shows an F-value of
1.25, and a p-value of 0.29 shows similar scores regardless of the classifier, specifying that all
classifiers from subset data are insignificant. There is barely any variation. All the classifiers
only perform similarly better than each other.
The graph titled ‘Subset—Recall & Classifier’ has an F-value of 0.96 and a p-value of
0.43, showing that all comparisons from classifiers from subset data for recall are insignifi-
cant, worse than precision for the same case. There is variation, but it is closely non-existent.
All classifiers have similar scores for recall and precision.
The graph titled ‘Subset—Precision & Sampling shows that all sampling methods with
precision scores have shown to be significant, with an F-value of 27.64, and all comparisons
are less than the p-value of 0.0001. Oversampling offers the best performance among all the
sampling methods, while Imbalanced and Undersampling perform similarly. There is a
noticeable difference in scores among all the sampling methods.
The graph titled ‘Subset—Recall & Sampling’ shows that all sampling methods with
recall scores have shown to be significant, with an F-value of 111.1, and all comparisons
have a p-value of less than 0.0001. However, the recall scores have a more noticeable
difference than the precision scores.
The graph titled ‘Subset—Precision & Technique’, for the sampling techniques, an
F-value of 14.11 and a p-value less than 0.0001 show slight performance variation from all
sampling techniques and Imbalanced data. Imbalanced data is significant compared to
other techniques. Imbalanced data has a difference in scores compared to NearMiss, ROS,
RUS, and SMOTE. The four sampling techniques have similar scores to TomekLinks and
Imbalanced data. This similarity shows that four sampling techniques (NearMiss, ROS,
RUS, and SMOTE) can address the imbalanced dataset problem.
Mach. Learn. Knowl. Extr. 2023, 5, FOR PEER REVIEW 22
Figure8.8.ANOVA
Figure ANOVAt-tests
t-testsPlots
Plots for
for Subset
Subset Data.
Data.
Thegraph
The ROC titled
Curves in Figure 9 are&the
‘Subset—Recall highest and
Technique’ lowest
shows performances
an F-value of 77.72possible for
and a p-value
Subset
less thandata. The
0.0001 forperformance
the sampling is techniques.
closely similar
Thefor all classifiers
Imbalanced from
label Iterations
(being 5 and 10.
imbalanced data)
Iteration 10 shows a slight decrease in performance from Iteration 5.
and TomekLinks perform similarly to the other sampling techniques (NearMiss, ROS, RUS, The highest per-
formance
and SMOTE). was from Iteration 5 of Random Oversampling. Iteration 10 of NearMiss rec-
ordsThe
theROC
lowest performance
Curves found
in Figure in Iteration
9 are the highest10. Random
and lowest Forest and XGBoost
performances classifi-for
possible
Subset data. The performance is closely similar for all classifiers from Iterationsinferior
ers have the highest performance, averaging 0.99, and Naive Bayes has the most 5 and 10.
performance
Iteration of all aclassifiers.
10 shows slight decrease in performance from Iteration 5. The highest perfor-
mance was from Iteration 5 of Random Oversampling. Iteration 10 of NearMiss records the
lowest performance found in Iteration 10. Random Forest and XGBoost classifiers have the
highest performance, averaging 0.99, and Naive Bayes has the most inferior performance
of all classifiers.
Mach. Learn. Knowl. Extr. 2023, 5 1974
Mach. Learn. Knowl. Extr. 2023, 5, FOR PEER REVIEW 23
Figure9.9.ROC
Figure ROCCurves
CurvesBased
Based on
on NearMiss
NearMiss and
andROS
ROSIteration
Iteration5 5and
and1010forfor
Subset Data.
Subset Data.
Figure1010presents
Figure presentsprecision
precisionand
andrecall
recallscores
scoresforfor different
different facets,
facets, comparing
comparing each
each sam-
sampling technique for subset data. Based on these facets, ROS, RUS,
pling technique for subset data. Based on these facets, ROS, RUS, SMOTE, and NearMiss SMOTE, and
NearMiss
have have data
data points pointslocations
in similar in similarcompared
locations to
compared to the Imbalanced
the Imbalanced data and
data and TomekLinks
TomekLinks technique. By visualizing the different facets, it becomes clear
technique. By visualizing the different facets, it becomes clear that these four imbalanced that these
four imbalanced sampling techniques were more effective in addressing
sampling techniques were more effective in addressing our class’s imbalanced problem our class’s im-
balanced problem
than other techniques.than other techniques.
Figure 11 shows a scatter plot of precision and recall in a two-dimensional graph that
compares imbalanced sampling techniques for Subset data. Fewer observations from the
Subset data supply fewer plots from the MEDFULL data. TomekLinks and Imbalanced
have similar patterns where precision and recall are between 69% to 98%. ROS and SMOTE
show identical ways of having high recall and precision compared to other techniques,
including Imbalanced data.
Figure 12 shows a heatmap of precision and recall in a two-dimensional graph for
Subset data. There is a low amount of variability due to the number of observations. The
lighter the color, the higher the number of values occupying a particular graph area. The
highest scores for precision and recall are easier to find based on the heatmap.
Mach. Learn.
Mach. Learn. Knowl. Extr. 2023, 5,
5 FOR PEER REVIEW 24
1975
Figure 10. Facets of Precision and Recall Based on Technique for Subset Data.
Figure
Figure 11. Scatter
Scatter Plot
Plot of
of Precision
Precision and Recall Based on Technique for Subset Data.
Heatmap of
Figure 12. Heatmap
Figure of Precision
Precision and Recall for Subset Data.
Author Contributions: Conceptualization, J.D.; Methodology, J.D.; Software, J.D. and M.R.; Valida-
tion, J.D. and M.R.; Formal Analysis, J.D.; Investigation, M.R.; Resources, J.D.; Data Curation, J.D.;
Writing—Original Draft Preparation, J.D.; Writing—Review & Editing, J.D. and M.R.; Visualization,
J.D.; Supervision, M.R.; Project Administration, M.R.; Funding Acquisition, M.R. All authors have
read and agreed to the published version of the manuscript.
Funding: This work is supported by the National Science Foundation (NSF) grant (ID#2131307)
under the CISE-MSI program.
Mach. Learn. Knowl. Extr. 2023, 5 1977
Data Availability Statement: The code and datasets can be downloaded from https://github.com
/JDixonCS/Document-Classification (accessed on 30 November 2023). Dataset is accessible from
https://github.com/JDixonCS/Document-Classification/tree/main/classifier/MEDTEXT (accessed
on 30 November 2023). Results are accessible from https://github.com/JDixonCS/Document-Class
ification/tree/main/classifier/Results (accessed on 30 November 2023).
Acknowledgments: The authors want to thank Ian Soboroff for his expertise, wisdom, phenomenal
knowledge, technical guidance, and mentorship and for providing someone with the opportunity
to intern for the Retrieval Group at the National Institute of Standards and Technology, all of
which contributed to the completion of this research project. Ian Soboroff has consented to the
acknowledgement of this paper.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Büttcher, S.; Clarke, C.; Cormack, G.V. Information Retrieval: Implementing and Evaluating Search Engines; The MIT Press: Cambridge,
MA, USA, 2010.
2. Belkin, N.J.; Croft, W.B. Information Filtering and Information Retrieval: Two Sides of the Same Coin? Commun. ACM 1992, 35,
29–38. [CrossRef]
3. Kowsari, K.; Meimandi, K.J.; Heidarysafa, M.; Mendu, S.; Barnes, L.E.; Brown, D.E. Text Classification Algorithms: A Survey.
Information 2019, 10, 150. [CrossRef]
4. Zhou, Z.-H.; Liu, X.-Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans.
Knowl. Data Eng. 2006, 18, 63–77. [CrossRef]
5. Zhang, Z.; Jasaitis, T.; Freeman, R.; Alfrjani, R.; Funk, A. Mining Healthcare Procurement Data Using Text Mining and Natural
Language Processing—Reflection from an Industrial Project. arXiv 2023, arXiv:2301.03458. [CrossRef]
6. Borko, H.; Bernick, M. Automatic Document Classification Part II. Additional Experiments. J. ACM 1964, 11, 138–151. [CrossRef]
7. Shakarami, A.; Ghobaei-Arani, M.; Shahidinejad, A. A Survey on the Computation Offloading Approaches in Mobile Edge
Computing: A Machine Learning-based Perspective. Comput. Netw. 2020, 182, 107496. [CrossRef]
8. Akritidis, L.; Bozanis, P. A Supervised Machine Learning Classification Algorithm for Research Articles. In Proceedings of
the 28th Annual ACM Symposium on Applied Computing (SAC’ 13), Coimbra, Portugal, 18–22 March 2013; Association for
Computing Machinery: New York, NY, USA, 2013; pp. 115–120. [CrossRef]
9. Goudjil, M.; Koudil, M.; Bedda, M.; Ghoggali, N. A Novel Active Learning Method Using SVM for Text Classification. Int. J.
Autom. Comput. 2018, 15, 290–298. [CrossRef]
10. Kadhim, A.I. An evaluation of preprocessing techniques for text classification. Int. J. Comput. Sci. Inf. Secur. 2018, 16, 22–32.
11. Mali, M.; Atique, M. The Relevance of Preprocessing in Text Classification. In Integrated Intelligence Enable Networks and Computing;
Mer, K.K.S., Semwal, V.B., Bijalwan, V., Crespo, R.G., Eds.; Springer: Singapore, 2021; pp. 553–559.
12. Imberg, H.; Yang, X.; Flannagan, C.; Bärgman, J. Active sampling: A machine-learning-assisted framework for finite population
inference with optimal subsamples. arXiv 2022, arXiv:2212.10024.
13. Kumar, V.; Balloccu, S.; Wu, Z.; Reiter, E.; Helaoui, R.; Recupero, D.R.; Riboni, D. Data Augmentation for Reliability and
Fairness in Counselling Quality Classification. In Proceedings of the 1st Workshop on Scarce Data in Artificial Intelligence for
Healthcare-SDAIH, INSTICC, Vienna, Austria, 23 July 2022; SciTePress: Setúbal, Portugal, 2023; pp. 23–28.
14. Oyedare, T.; Park, M.J. Estimating the Required Training Dataset Size for Transmitter Classification Using Deep Learning. In
Proceedings of the 2019 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN), Newark, NJ, USA,
11–14 November 2019; pp. 1–10. [CrossRef]
15. Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A Survey on Text Classification: From Traditional to Deep Learning.
ACM Trans. Intell. Syst. Technol. 2022, 13, 1–41. [CrossRef]
16. Mujtaba, G.; Shuib, L.; Idris, N.; Hoo, W.L.; Raj, R.G.; Khowaja, K.; Shaikh, K.; Nweke, H.F. Clinical text classification research
trends: Systematic literature review and open issues. Expert Syst. Appl. 2019, 116, 494–520. [CrossRef]
17. Kamath, C.N.; Bukhari, S.S.; Dengel, A. Comparative Study between Traditional Machine Learning and Deep Learning Ap-
proaches for Text Classification. In Proceedings of the ACM Symposium on Document Engineering 2018 (DocEng’18), Halifax,
NS, Canada, 28–31 August 2018; Association for Computing Machinery: New York, NY, USA, 2018. [CrossRef]
18. Kim, M.; Hwang, K.-B. An empirical evaluation of sampling methods for the classification of imbalanced data. PLoS ONE 2022,
17, e0271260. [CrossRef] [PubMed]
19. Agarwal, B.; Mittal, N. Text Classification Using Machine Learning Methods—A Survey. In Proceedings of the Second Interna-
tional Conference on Soft Computing for Problem Solving (SocProS 2012), Jaipur, India, 28–30 December 2012; Babu, B.V., Nagar,
A., Deep, K., Pant, M., Bansal, J.C., Ray, K., Gupta, U., Eds.; Springer: New Delhi, India, 2014; pp. 701–709. [CrossRef]
20. Gaudreault, J.-G.; Branco, P.; Gama, J. An Analysis of Performance Metrics for Imbalanced Classification. In Discovery Science;
Soares, C., Torgo, L., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 67–77.
21. Mishra, P.; Biancolillo, A.; Roger, J.M.; Marini, F.; Rutledge, D.N. New data preprocessing trends based on ensemble of multiple
preprocessing techniques. TrAC Trends Anal. Chem. 2020, 132, 116045. [CrossRef]
Mach. Learn. Knowl. Extr. 2023, 5 1978
22. Nordmann, E.; McAleer, P.; Toivo, W.; Paterson, H.; DeBruine, L.M. Data Visualization Using R for Researchers Who Do Not Use
R. Adv. Methods Pract. Psychol. Sci. 2022, 5, 25152459221074656. [CrossRef]
23. Aust, F.; van Doorn, J.; Haaf, J.M. Translating default priors from linear mixed models to repeated-measures ANOVA and paired
t-tests. Transl. Priors 2022. [CrossRef]
24. Moscarelli, M. Exploratory Data Analysis in ‘R’. In Biostatistics with “R”: A Guide for Medical Doctors; Springer International
Publishing: Cham, Switzerland, 2023; pp. 23–40. [CrossRef]
25. COVID-19 Research Articles Downloadable Database; Center of Disease Control and Prevention: Atlanta, GA, USA, 2020. Available
online: https://www.cdc.gov/library/researchguides/2019novelcoronavirus/researcharticles.html (accessed on 9 October 2020).
26. Rahman, M.M.; Dixon, J. Machine Learning for Detecting Trends and Topics from Research Papers and Proceedings in Biomedical
Literature. Research Square. Available online: https://www.researchsquare.com (accessed on 3 November 2023).
27. PMC Open Access Subset-PMC. PubMed Central. 2003. Available online: https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
(accessed on 1 December 2023).
28. Paper, D. TensorFlow 2.x in the Colaboratory Cloud; Apress: Berkeley, CA, USA, 2021. Available online: https://link.springer.com/
book/10.1007/978-1-4842-6649-6 (accessed on 1 December 2023).
29. COVID-19: A Glossary of Key Terms; Henry Ford Hospital: Detroit, MI, USA, 2020. Available online: https://www.henryford.com/
blog/2020/04/covid19-key-terms-to-know (accessed on 22 April 2020).
30. Subasi, C. Logistic Regression Classifier. Available online: https://towardsdatascience.com/logistic-regression-classifier-8583e0
c3cf9 (accessed on 2 April 2019).
31. Roy, R. The Naive Bayes Classifier. Available online: https://towardsdatascience.com/the-naive-bayes-classifier-how-it-works
-e229e7970b84 (accessed on 28 April 2022).
32. What is XGBoost? Available online: https://www.nvidia.com/en-us/glossary/data-science/xgboost/ (accessed on 1 Decem-
ber 2023).
33. sklearn.ensemble.RandomForestClassifier. Available online: https://scikit-learn/stable/modules/generated/sklearn.ensemb
le.RandomForestClassifier.html (accessed on 1 December 2023).
34. sklearn.tree.DecisionTreeClassifier. Available online: https://scikit-learn/stable/modules/generated/sklearn.tree.DecisionT
reeClassifier.html (accessed on 1 December 2023).
35. Bento, C. Decision Tree Classifier Explained in Real-Life: Picking a Vacation Destination. Available online: https://toward
sdatascience.com/decision-tree-classifier-explained-in-real-life-picking-a-vacation-destination-6226b2b60575 (accessed on 18
July 2021).
36. Cartus, A.R.; Bodnar, L.M.; Naimi, A.I. The Impact of Undersampling on the Predictive Performance of Logistic Regression and
Machine Learning Algorithms: A Simulation Study. Available online: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC78712
13/ (accessed on 1 December 2023).
37. Park, H.M. Comparing Group Means: T-Tests and One-Way ANOVA Using Stata, SAS, R, and SPSS. Available online: https:
//scholarworks.iu.edu/dspace/handle/2022/19735 (accessed on 1 January 2009).
38. Çetinkaya-Rundel, M.; Grolemund, G.; Wickham, H. R for Data Science (2e). Hadley Wickman, December 2016. Available online:
https://r4ds.hadley.nz/ (accessed on 1 December 2023).
39. Agarwal, I.; Rana, D.; Jariwala, A.; Bondre, S. A Novel Stance based Sampling for Imbalanced Data. Int. J. Adv. Comput. Sci. Appl.
2022, 13, 461–467. [CrossRef]
40. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell.
Res. 2002, 16, 321–357. [CrossRef]
41. Brownlee, J. Random Oversampling and Undersampling. Machine Learning Mastery. Available online: https://machinelearnin
gmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/ (accessed on 1 December 2023).
42. Tanimoto, A.; Yamada, S.; Takenouchi, T.; Sugiyama, M.; Kashima, H. Improving imbalanced classification using near-miss
instances. Expert Syst. Appl. 2022, 201, 117130. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.