0% found this document useful (0 votes)
923 views9 pages

Mesh Now: Automatic Mesh Indexing at Pubmed Scale Via Learning To Rank

Uploaded by

Kassaye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
923 views9 pages

Mesh Now: Automatic Mesh Indexing at Pubmed Scale Via Learning To Rank

Uploaded by

Kassaye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Mao and Lu Journal of Biomedical Semantics (2017) 8:15

DOI 10.1186/s13326-017-0123-3

RESEARCH Open Access

MeSH Now: automatic MeSH indexing at


PubMed scale via learning to rank
Yuqing Mao1,2 and Zhiyong Lu2*

Abstract
Background: MeSH indexing is the task of assigning relevant MeSH terms based on a manual reading of scholarly
publications by human indexers. The task is highly important for improving literature retrieval and many other
scientific investigations in biomedical research. Unfortunately, given its manual nature, the process of MeSH
indexing is both time-consuming (new articles are not immediately indexed until 2 or 3 months later) and costly
(approximately ten dollars per article). In response, automatic indexing by computers has been previously proposed
and attempted but remains challenging. In order to advance the state of the art in automatic MeSH indexing, a
community-wide shared task called BioASQ was recently organized.
Methods: We propose MeSH Now, an integrated approach that first uses multiple strategies to generate a
combined list of candidate MeSH terms for a target article. Through a novel learning-to-rank framework, MeSH Now
then ranks the list of candidate terms based on their relevance to the target article. Finally, MeSH Now selects the
highest-ranked MeSH terms via a post-processing module.
Results: We assessed MeSH Now on two separate benchmarking datasets using traditional precision, recall and F1-
score metrics. In both evaluations, MeSH Now consistently achieved over 0.60 in F-score, ranging from 0.610 to 0.
612. Furthermore, additional experiments show that MeSH Now can be optimized by parallel computing in order to
process MEDLINE documents on a large scale.
Conclusions: We conclude that MeSH Now is a robust approach with state-of-the-art performance for automatic
MeSH indexing and that MeSH Now is capable of processing PubMed scale documents within a reasonable time
frame. Availability: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/MeSHNow/.

Background terms are primarily used to index articles in PubMed for


The rapid growth of scholar publications in biomedicine improving literature retrieval: The practice of manually
makes the search of relevant information in literature in- assigning relevant MeSH terms to new publications in
creasingly more difficult, even for specialists [1, 2]. To PubMed by the NLM human indexers is known as MeSH
date, PubMed—the U.S. National Library of Medicine indexing [3]. Assigned MeSH terms can then be used im-
(NLM) premier bibliographic database—contains over 24 plicitly (e.g., automatic query expansion using MeSH) or
million articles from over 5,600 biomedical journals with explicitly in PubMed searches [4]. Compared with the
more than a million records added each year. To facilitate commonly used keyword-based PubMed searches, MeSH
searching these articles in PubMed, a controlled vocabu- indexing allows for semantic searching (using the relation-
lary called Medical Subject Headings (MeSH)1 was created ship between the subject headings) and searching against
and updated annually by the NLM since 1960s. Currently, concepts not necessarily present in the PubMed abstract.
MeSH 2015 consists of over 27,000 terms representing a In addition to its use in PubMed, MeSH indexing re-
wide spectrum of key biomedical concepts (e.g. Humans, sults have also been used creatively in many other scien-
Parkinson Disease) in a hierarchical structure. MeSH tific investigation areas, including information retrieval,
text mining, citation analysis, education, and traditional
* Correspondence: [email protected] bioinformatics research (see Fig. 1). When applied to in-
2
National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, formation retrieval, MeSH and its indexing results have
Bethesda, MD 20894, USA
Full list of author information is available at the end of the article
been used to build “tag clouds” for improving the

© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Mao and Lu Journal of Biomedical Semantics (2017) 8:15 Page 2 of 9

Fig. 1 Applications of MeSH

visualization of search results [5, 6] and to help distin- [36], naïve Bayes with optimal training set [37], Stochastic
guish between publication authors with identical names Gradient Descent [38], and meta-learning [39]. While the
[7, 8]. Another major use of MeSH indexing is in bio- pattern matching and text classification methods use only
medical text mining, where it has been applied to prob- the information in the MeSH thesaurus and document it-
lems such as document summarization [9], document self, the k-Nearest Neighbours (k-NN) approach takes ad-
clustering [10], and word sense disambiguation [11]. vantage of the manual annotations of documents similar
MeSH indexing also serves several key roles in cit- to the target document, e.g. [40, 41]. Additional informa-
ation analysis, from identifying emerging research tion, such as citations, can also be utilized for auto-
trends [12, 13] to measuring similar journals [14] and matic MeSH indexing. For example, Delbecque and
characterizing research profiles for an individual re- Zweigenbaum [42] investigated computing neighbour
searcher, institute or journal [15]. In the era of documents based on the cited articles and cited au-
evidence-based practice, MeSH becomes increasingly thors. More recently, Huang et al. [3] reported a
important in assessing and training the literature novel approach based on learning-to-rank algorithms
search skills of healthcare professionals [16, 17], as [43]. This approach has been shown to be highly suc-
well as in assisting undergraduate education in bio- cessful in the recent BioASQ2 challenge evaluations
logical sciences [18]. Finally, much bioinformatics re- [44–46] and has also been adopted by many others
search, such as gene expression data analysis [19, 20], [47, 48]. Finally, many methods attempt to combine
greatly benefits from MeSH indexing [21–25]. results of different approaches [49, 50]. For instance,
Like many manual annotation projects [26–30], the current production system in MeSH indexing at
MeSH indexing is a labour-intensive process. As the NLM is called Medical Text Indexer (MTI),
shown in [3, 31], it can take an average of 2 to which is a hybrid system that combines both pattern
3 months for an article to be manually indexed with matching and k-NN results [51] via manually-
relevant MeSH terms after it first enters PubMed. In developed rules and continues to be improved over
response, many automated systems for assisting the years [52, 53]. The proposed method in this work
MeSH indexing have been previously proposed. In is also a hybrid system but unlike MTI, which only
general, most existing methods are based on the following uses machine learning to predict a small set of MeSH
techniques: i) pattern matching, ii) text classification, iii) terms, it combines individual results and ranks the
k-Nearest Neighbours, iv) learning-to-rank, or v) combin- entire set of recommendations through machine
ation of multiple techniques. Pattern-matching methods learning instead of heuristic rules.
[32] search for exact or approximate matches of MeSH Despite these efforts, automatic MeSH indexing re-
terms in free text. Automatic MeSH indexing can also be mains a challenging task: the current state-of-the-art
regarded as a multi-class text classification problem where performance remains at about 0.6 in F-measure [54].
each MeSH term represents a distinct class label. Thus Several factors contribute to this performance bottle-
many multi-label text classification methods have been neck: First, since each PubMed article can be assigned
proposed, such as neural networks [33], Support Vector with multiple MeSH terms, i.e. class labels, the task of
Machines (SVM) [34, 35], Inductive Logic Programming automatic MeSH indexing can be seen as a multi-class
Mao and Lu Journal of Biomedical Semantics (2017) 8:15 Page 3 of 9

classification problem. In this regard, the size of the Methods


MeSH vocabulary makes automatic classification chal- Approach overview
lenging: 2014 MeSH includes more than 27,000 main Our approach reformulates the MeSH indexing task as a
subject headings and they are not equally used in index- ranking problem. Figure 2 shows the three main steps:
ing [31]. Second, MeSH indexing is a highly complex First, given a target article, we obtain an initial list of
cognitive task. It has been reported that the consistency candidate MeSH terms from three unique sources. Next,
between human indexers is only 48.2% for main heading we apply a learning-to-rank algorithm to sort the candi-
assignment [55]. Lastly, both the MeSH vocabulary and date MeSH terms based on the learned associations be-
indexing principles keep evolving over time. For in- tween the document text and each candidate MeSH
stance, in response to emerging new concepts in the bio- term. Finally, we prune the ranked list and return a
medical research, MeSH 2014 includes almost five times number of top candidates as the final system output.
more concepts than the edition of MeSH in 1963 that Prior to these steps, some standard text processing was
only contains 5,700 descriptors. On the other hand, the performed such as removing stop words and applying a
articles in PubMed are not re-indexed when MeSH gets word-stemming algorithm.
updated. Thus, it is not always obvious in selecting
benchmarking data sets for system development and Input source I: K-nearest neighbours
comparison. We first adapt the PubMed Related Articles algorithm
In this paper, we propose a new method, MeSH Now, [56] to retrieve k-nearest neighbours for each new
to the automatic MeSH indexing task. MeSH Now is PubMed article. The assumption is that documents simi-
built on our previous research [3] but has a number of lar in content would share similar MeSH term annota-
significant advancements: First, MeSH Now combines tions. Previous work [3] has supported this assumption
different methods through machine learning. Second, by showing that over 85% of the gold-standard MeSH
new post-processing and list-pruning steps are now annotations for a target document are present in its
added in MeSH Now for improved performance. Third, nearest 20 neighbours.
from a technical perspective, MeSH Now is optimized Furthermore, we found that retrieving neighbours
using the latest MeSH lexicon and recent indexed arti- from the whole MEDLINE database performed worse
cles for system training and development. Finally, MeSH than only retrieving neighbours from a subset of the
Now is implemented to operate in a parallel computing database (e.g., articles in the BioASQ Journal List, or
environment, making it possible for large-scale process- newly published articles). In particular, the results of
ing needs (e.g., providing computer results of new our approach are best when limiting the neighbour
PubMed articles for assisting human indexing). For documents to articles indexed in the last 5 years (i.e.
evaluation, we first test MeSH Now on a previous data- the articles were assigned with MeSH terms after
set that was widely used in benchmarking. Furthermore, 2009). As mentioned before, MeSH terms evolve
we created a new benchmarking dataset based on the re- every year but the articles already indexed will never
cent BioASQ 2014 challenge task data. Our experimental be re-indexed. The same article would likely be
results show that MeSH Now achieves state-of-the-art assigned with different MeSH terms in 2014 versus
performance on both data sets. 20 years ago. Thus there are many outdated MeSH

Fig. 2 System overview


Mao and Lu Journal of Biomedical Semantics (2017) 8:15 Page 4 of 9

terms in those neighbour documents, which can be classification, xi is a document of a given class (ie
harmful to the accuracy of our approach. Moreover, assigned with a specific MeSH term), λ is a
the word frequencies are also different in the older and regularization parameter, w is a vector of feature
more recent articles, which are closely related to the simi- weights, and θ is a threshold. The function h is the
larity score for two articles. Therefore, we built our index modified Huber loss function and has the form:
with only articles that were assigned with MeSH terms
after 2009, and retrieved the neighbour documents using 8
< −4⋅z; z≤−1
such a new index instead of retrieving similar documents
hðzÞ ¼ ð1−zÞ2 ; −1 < z < 1
from the whole PubMed. When building our document :
0; 1≤z
index for the PubMed Related Articles algorithm3, we also
make sure that all annotated MeSH terms are removed
such that they are not used in the computation of the We can choose C+ to be greater than C− to overcome
neighbour documents. In other words, the similarity be- the dominance of negative points in the decision process
tween two documents is solely based on the words they (here we set C+ = rC− and the ratio r to be 1.5). To train
have in common. these 20,000 classifiers, we used the MEDLINE articles
The parameter k was fixed (k = 20) in [3], which that were indexed with MeSH terms between January
means the same number of neighbours will be in- 2009 and March 2014.
cluded for all target articles. However, we observed
that some articles may only have a few very similar Input source #3: MTI results
documents. We therefore adjust the parameter k dy- MTI is used as one of the baselines in the BioASQ Task,
namically between 10 to 40 in this work according to which primarily uses MetaMap to map the phrases in
the similarity scores of the neighbours: the smaller the text to UMLS (Unified Medical Language System)
the average similarity score of the neighbours, the concepts [61]. We thus add all MeSH terms predicted
fewer neighbours will be used. Once those k-nearest by MTI as candidates, and obtained the feature vectors
neighbour documents are retrieved, we collect all of for those MeSH terms. This is useful since the MTI re-
the unique MeSH terms associated with those neigh- sults can return correct MeSH terms not found by the
bour documents. Note that we only considered the other two methods.
main headings and removed subheadings attached to
the main headings. Learning to rank
Once an initial list of candidate MeSH terms from
Input source #2: multi-label text classification all three sources are obtained, we approached the
Motivated by [57], we implemented a multi-label text task of MeSH indexing as a ranking problem. In our
classification approach where we treat each MeSH con- previous work, we trained the ranking function with
cept as a label and build a binary classifier accordingly. ListNet [62], which sorts the results based on a list
More specifically, we first train individual classification of scores. In this work we evaluated several other
models for each of the most frequently indexed 20,000 learning-to-rank algorithms [43] on the BioASQ test
MeSH terms, as the remaining ones are rarely used in dataset, including MART [63], RankNet [64], Coord-
indexing. Then we apply these models to the new article inate Ascent [65], AdaRank [66], and LambdaMART,
and add those positively classified MeSH concepts as which are available in RankLib v2.24, and found that
candidates to the initial list. We also keep those associ- LambdaMART achieved the best performance.
ated numerical prediction scores and use them as fea- LambdaMART [67] is a combination of MART and
tures in the next step. LambdaRank, where the MART algorithm can be
Our implementation is based on the cost-sensitive viewed as generalizations of logistic regression [63]
SVM classifiers [58] with Huber loss function [59]. Cost- and LambdaRank is a method for learning arbitrary
sensitive SVMs have been shown to be a good solution information retrieval measures [68]. To train such a
for dealing with imbalanced and noisy data in biomed- model, LambdaMART uses gradient boosting to
ical documents [60]. Let C+ denote the higher misclassi- optimize a ranking cost function where the base
fication cost of the positive class and C− denote the learners are limited-depth regression trees. New trees
lower misclassification cost of the negative class, the cost are added to an ensemble sequentially that best ac-
function is formulated as: count for the remaining regression error of the train-
X X
ing samples, i.e., each new tree greedily minimizes
λ
kwk2 þ C þ i:y ¼1 hðyi ðθ þ w⋅xi ÞÞ þ C − i:y ¼−1 hðyi ðθ þ w⋅xi ÞÞ the cost function. LambdaMART uses MART with
2 i i
specified gradients and Newton’s approximation.
where MeSH terms are treated as class labels C in the LambdaMART is briefly presented as follows [67]:
Mao and Lu Journal of Biomedical Semantics (2017) 8:15 Page 5 of 9

First, we obtained a training set consisting of biomedical translation probability features [69], query-likelihood fea-
articles with human assigned MeSH terms from MED- tures [70, 71], and synonym features.
LINE. For each article, we obtain an initial list of MeSH For neighbourhood features, we calculate both neigh-
terms from its neighbour documents. Each MeSH term is bourhood frequency – the number of times the MeSH
then represented as a feature vector. For the list of MeSH term appears in the neighbours, and neighbourhood
terms from its neighbour documents, denoted by {M1, similarity – the sum of similarity scores for these
M2, …, MN}, where N is the number of feature vectors and neighbours.
Mi is the ith feature vector, we obtain a corresponding list For translation probability features, we use the IBM
{y1, y2, …, yN}, where yi∈{0,1} is the ith class label. yi = 1 if translation model [69], which uses title and abstract as
the MeSH term was manually assigned to the target article source language, and MeSH terms as target language.
by expert indexers of the NLM, otherwise yi =0. We then utilize an EM-based algorithm to train the
BioASQ provided approximately 12.6 million PubMed translation probabilities.
documents for system development. Since all PubMed For query-likelihood features, we treat each MeSH
documents can be used as training data, we randomly term as Query (Q), title and abstract as document, and
selected a set of 5,000 MEDLINE documents from the use two genres of query models: classic BM25 model
list of the journals provided by BioASQ for training and [70] and translation-based query model [71], to calculate
optimizing our learning-to-rank algorithm. the probability of whether a MeSH term should be
assigned to the article.
Features In this work, we added a new domain-specific know-
We reused many features developed previously: neighbour- ledge feature. We used a binary feature indicating
hood features, word unigram/bigram overlap features, whether a candidate term is observed by MTI, which
Mao and Lu Journal of Biomedical Semantics (2017) 8:15 Page 6 of 9

relies heavily on the domain-specific UMLS Meta- (1) is that if the (i + 1)th MeSH term was assigned with a
thesaurus [72], for generating its results. score much smaller than the ith MeSH term, the MeSH
To compute the average length of documents and the terms ranked lower than i would not be considered rele-
document frequency for each word, a set of approxi- vant to the target article. Formula (1) also accounts for
mately 60,000 PubMed documents is assembled. These the fact that the difference between lower-ranked MeSH
documents are sampled from recent publications in the terms is subtler than the difference between higher-
BioASQ Select Journal List. The translation model and ranked MeSH terms. The parameter λ was empirically
the background language model were built through set to be 0.3 in this research, and it can be tuned to gen-
training with this data set accordingly. erate predictions favouring either recall or precision.

Post-processing and list pruning Results


We further improve our results with some post- Benchmarking datasets
processing steps. To demonstrate the progress of our development over
First, we observed that the Check tags (a special set of time and compare with other systems, we report our
MeSH Headings that are mentioned in almost every art- system performance on two separate data sets. One of
icle such as human, animal, male, female, child, etc.5) es- them was widely used in previous studies: NLM2007 [3].
pecially the tags for the age factor are most difficult for The NLM2007 dataset contains 200 PubMed documents
our approach. The reason is that the Check tags are fre- obtained from the NLM indexing initiative6. The other
quently present in the neighbour documents, e.g., an art- is created from the BioASQ 2014 test datasets:
icle describing a disease in children might have many BioASQ5000.
similar documents discussing about the same disease in In 2014, the BioASQ challenge task [45] ran for six
adults, which will result in assigning the undesirable consecutive periods (batches) of 5 weeks each. For each
Check tag “Adult” to the new article. On the other hand, week, the BioASQ organizers distributed new unclassi-
it is improper to simply exclude the tag “Adult” if fied PubMed documents, and participants have a limited
“Child” already exists, because many articles in PubMed response time (less than 1 day) to return their predicted
indeed include both “Adult” and “Child” as MeSH terms. MeSH terms. As new manual annotations become avail-
More importantly, many Check tags related to age infor- able, they were used to evaluate the classification per-
mation are added according to the full text article. In formance of participating systems. To be more general
BioASQ, we add the age check tags identified from the (each BioASQ test set contains continuous PMIDs which
abstract text. We first find the numbers near the explicit may belong to a limited set of journals), we randomly se-
“age” in the abstract, then predict the correct Age Check lected 5,000 PubMed documents from the latest 9
Tag according to those numbers and the rules for age BioASQ test sets (start from Batch 2 Week 2 in order to
check tags. avoid overlap with our system training data) to create
Second, to improve the precision, we remove the par- BioASQ5000, with their corresponding MeSH terms
ental MeSH terms when a more specific term is also already assigned by December 6, 2014. Compared to
predicted. This heuristic is based on the principle that NLM2007, BioASQ5000 is much larger in size and con-
indexers should prefer the most specific term applicable tains more recent articles in 2014.
instead of more general terms. Therefore in the candi-
date list, if a child term is ranked higher than its parent Comparison of different methods
term, we will remove the latter accordingly. Here we present our results when evaluated on the two
Finally, after each MeSH term in the initial list is datasets. We first show results on the previously re-
assigned a score by the ranking algorithm described ported benchmarking dataset, NLM2007 [3] in Table 1.
above, the top N ranked MeSH terms will be considered For comparison, we show the results of our previous
relevant to the target article. N was set to be a fixed work as “Huang et al., [3]”, and the results of the
number (N = 25) previously. We found, however, that
the average number of MeSH terms per article in the Table 1 Evaluation results on NLM 2007 test set
BioASQ training data was only 12.7. Thus, we used an Methods Precision Recall F1
automatic cut-off method to further prune the results MTI – 2011 0.318 0.574 0.409
from the top ranked MeSH terms as follows: Huang et al. 2011 [3] 0.390 0.712 0.504
Text Classification 0.655 0.355 0.461
S iþ1 < S i ⋅ logðiÞ⋅λ
MTI – 2014 0.568 0.525 0.545
where Si is the score of the predicted MeSH term at pos- MeSH Now 0.622 0.602 0.612
ition i in the top ranking list. The rationale for Formula Bold data are the best value
Mao and Lu Journal of Biomedical Semantics (2017) 8:15 Page 7 of 9

previous and current versions of MTI (“MTI 2011” and Table 3 Processing time analysis for different steps
“MTI 2014”). It should be noted that here we used Key steps in MeSH Now Average time per
MeSH 2010 and retrieved neighbour documents pub- document (ms)
lished before the articles in NLM2007, and our learning- Obtaining candidate terms via k-NN 1890.82
to-rank model was trained with documents published Obtaining candidate terms via MTI 570.33
before the articles in NLM2007, because the newly pub- Obtaining classification results from 25.63
lished articles are assigned with new MeSH terms which each binary text classifier
are not available in NLM2007. We can see that MeSH Learning to Ranking 103.86
Now makes significant improvement over our previous Post-Processing and List Pruning 1.85
method. We also notice that the results of MTI-2014 are
much better than those of its previous version. Both
MTI-2014 and text classification results (results of input
source #2) contribute to the MeSH Now performance such as “Chi-Square Distribution”, “Survival Analysis”,
with better results generated by MTI than text etc. This is most likely due to the lack of sufficient
classification. positive instances in the training set (i.e. the numbers
Table 2 shows the results on the BioASQ5000 dataset. of these indexed terms in the gold standard are rela-
For comparison, we added the results of MTI First Line tively small). On the other hand, the most incorrectly
(MTIFL_2014) and MTI Default (MTIDEF_2014), both predicted MeSH terms are Check Tags (e.g. “Male”,
of which were used as baselines of the BioASQ chal- “Female”, “Adult”, “Young Adult”, etc.) despite that
lenge. This further verifies that our new approach out- the F1 scores of these individual Check Tags are rea-
performs existing methods. sonably high (most are above the average). Because of
their prevalence in the indexing results, however, im-
System throughput proving their prediction is critical for increasing the
The time complexity of large-scale automatic indexing is overall performance.
crucial to real-world systems but rarely discussed in the As mentioned before, MeSH Now was developed in
past. In Table 3, we present the average processing time 2014 based on the learning-to-rank framework we first
of each step of our method based on BioASQ5000 on a proposed in 2010 [3] for automatic MeSH indexing. At
single computer. We can see that text classification ap- the same time, our ranking framework was adopted by
pears to be a bottleneck given the large size of the classi- several other state-of-the-art systems such as MeSHLa-
fiers (20,000). However, this step can be performed in beler [73] and DeepMeSH [74]. MeSHLabeler is very
parallel so that the overall time can be greatly reduced. similar to MeSH Now with the major difference in using
For example, our current system takes approximately a machine learning model to predict the number of
9 h to process 700,000 articles via a computer cluster MeSH terms instead of heuristics. DeepMeSH further
where 500 jobs can run concurrently. incorporates deep semantic representation into MeSH-
Labeler for improved performance (0.63 in the latest
Discussion and conclusions BioASQ challenge in 2016).
To better understand the differences between the There are some limitations and remaining chal-
computer-predicted and human-indexed results, we lenges in this work for the automatic MeSH indexing
conducted an error analysis based on the results of task. First, our previous work revealed that 85% of
MeSH Now on BioASQ5000 dataset. First, we found the gold-standard MeSH annotations should be
that the predicted MeSH terms with the lowest per- present in the candidate list based on the nearest 20
formance belong to MeSH Category E: “Analytical, neighbours. However, our current best recall is below
Diagnostic and Therapeutic Techniques and Equip- 65%, suggesting there is still room for improving the
ment”, especially the “Statistics as Topic” subcategory, learning-to-rank algorithm to promote the relevant
MeSH terms higher in the ranked list. Second, our
Table 2 Evaluation results on BioASQ5000 test set current binary text classification results are lower
Methods Precision Recall F1 than previously reported [35], partly because for all
Huang et al. 2011 [3] 0.357 0.701 0.473 classifiers we simply used the same training data,
Text Classification 0.689 0.400 0.506
which is quite imbalanced. We believe that the per-
formance of MeSH Now could be further improved
MTIFL – 2014 0.621 0.517 0.564
if better text classification results are available to be
MTI – 2014 0.587 0.559 0.573 integrated. Finally, we are interested in exploring the
MeSH Now 0.612 0.608 0.610 opportunities of using MeSH Now in practical
Bold data are the best value applications.
Mao and Lu Journal of Biomedical Semantics (2017) 8:15 Page 8 of 9

Endnotes 10. Zhu S, Zeng J, Mamitsuka H. Enhancing MEDLINE document clustering by


1 incorporating MeSH semantic similarity. Bioinformatics. 2009;25(15):1944–51.
http://www.ncbi.nlm.nih.gov/mesh/
2 11. Jimeno-Yepes AJ, McInnes BT, Aronson AR. Exploiting MeSH indexing in
http://www.bioasq.org/ MEDLINE to generate a data set for word sense disambiguation. BMC
3
http://www.ncbi.nlm.nih.gov/books/NBK3827/ Bioinformatics. 2011;12(1):223.
4 12. Perez-Iratxeta C, Andrade-Navarro MA, Wren JD. Evolving research trends in
http://sourceforge.net/p/lemur/wiki/RankLib/
5 bioinformatics. Brief Bioinform. 2007;8(2):88–95.
http://www.nlm.nih.gov/bsd/indexing/training/ 13. DeShazo JP, LaVallie DL, Wolf FM. Publication trends in the medical
CHK_010.html informatics literature: 20 years of. BMC Med Inform Decis Mak. 2009;9(1):7.
6 14. D’Souza JL, Smalheiser NR. Three journal similarity metrics and their
http://ii.nlm.nih.gov/DataSets/
application to biomedical journals. PLoS One. 2014;9:e115681.
15. Boyack KW. Mapping knowledge domains: Characterizing PNAS. Proc Natl
Acknowledgements
Acad Sci. 2004;101 suppl 1:5192–9.
We would like to thank the MTI authors and the BioASQ organizers. We also
16. Burrows SC, Tylman V. Evaluating medical student searches of MEDLINE for
thank Dr. Robert Leaman for his proofreading of this manuscript. This
evidence-based information: process and application of results. Bull Med
research is supported by the NIH Intramural Research Program, National
Libr Assoc. 1999;87(4):471.
Library of Medicine, the National Natural Science Foundation of China
17. Gruppen LD, Rana GK, Arndt TS. A controlled comparison study of the
(81674099, 81603498), the Six Talent Peaks Project of Jiangsu Province, China
efficacy of training medical students in evidence-based medicine literature
(XYDXXJS-047), the Qing Lan Project of Jiangsu Province, China (2016), and
searching skills. Acad Med. 2005;80(10):940–4.
the Priority Academic Program Development of Jiangsu Higher Education
Institutions (PAPD). 18. Tennant MR, Miyamoto MM. The role of medical libraries in undergraduate
education: a case study in genetics. J Med Libr Assoc. 2002;90(2):181.
19. Jani SD, Argraves GL, Barth JL, Argraves WS. GeneMesh: a web-based
Availability of data and materials microarray analysis tool for relating differentially expressed genes to MeSH
The datasets supporting the conclusions of this article are available in http:// terms. BMC Bioinformatics. 2010;11(1):166.
www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/MeSHNow/. 20. Masys DR, Welsh JB, Fink JL, Gribskov M, Klacansky I, Corbeil J. Use of
keyword hierarchies to interpret gene expression patterns. Bioinformatics.
Authors’ contributions 2001;17(4):319–26.
ZL conceived the study. YM and ZL participated in its design, analyzed the 21. Mottaz A, Yip YL, Ruch P, Veuthey A-L. Mapping proteins to disease
results and wrote the manuscript. YM collected the data, implemented the terminologies: from UniProt to MeSH. BMC Bioinformatics. 2008;9 Suppl 5:S3.
methods and performed the experiments. Both authors read and approved 22. Sartor MA, Ade A, Wright Z, Omenn GS, Athey B, Karnovsky A. Metab2MeSH:
the final manuscript. annotating compounds with medical subject headings. Bioinformatics.
2012;28(10):1408–10.
Competing interests 23. Cheung WA, Ouellette BF, Wasserman WW. Inferring novel gene-disease
The authors declare that they have no competing interests. associations using medical subject heading over-representation profiles.
Genome Med. 2012;4(9):75.
24. Ono T, Kuhara S. A novel method for gathering and prioritizing disease
Publisher’s Note candidate genes based on construction of a set of disease-related MeSH (R)
Springer Nature remains neutral with regard to jurisdictional claims in terms. BMC Bioinformatics. 2014;15(1):179.
published maps and institutional affiliations. 25. Nakazato T, Takinaka T, Mizuguchi H, Matsuda H, Bono H, Asogawa M.
BioCompass: a novel functional inference tool that utilizes MeSH hierarchy
Author details to analyze groups of genes. In Silico Biol. 2008;8(1):53–61.
1
Nanjing University of Chinese Medicine, 138 Xianlin Avenue, Nanjing, 26. Khare R, Li J, Lu Z. LabeledIn: cataloging labeled indications for human
Jiangsu 210023, China. 2National Center for Biotechnology Information drugs. J Biomed Inform. 2014;52:448–456.
(NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA. 27. Lu Z, Hirschman L. Biocuration workflows and text mining: overview of the
BioCreative 2012 Workshop Track II. Database. 2012;2012:bas043.
Received: 30 June 2016 Accepted: 16 March 2017 28. Mao Y, Van Auken K, Li D, Arighi CN, McQuilton P, Hayman GT, Tweedie S,
Schaeffer ML, Laulederkind SJ, Wang S-J. Overview of the gene ontology
task at BioCreative IV. Database. 2014;2014:bau086.
References 29. Van Auken K, Schaeffer ML, McQuilton P, Laulederkind SJ, Li D, Wang S-J,
1. Névéol A, Islamaj Doğan R, Lu Z. Semi-automatic semantic annotation of Hayman GT, Tweedie S, Arighi CN, Done J. BC4GO: a full-text corpus for the
PubMed queries: a study on quality, efficiency, satisfaction. J Biomed Inform. BioCreative IV GO task. Database. 2014;2014:bau074.
2011;44(2):310–318. 30. Lu Z, Cohen KB, Hunter L. GeneRIF quality assurance as summary revision.
2. Islamaj Dogan R, Murray GC, Neveol A, Lu Z. Understanding PubMed user Pac Symp Biocomput. 2007:269–280.
search behavior through log analysis. Database. 2009;2009:bap018. 31. Huang M, Lu Z. Learning to annotate scientific publications. In: Proceedings
3. Huang M, Névéol A, Lu Z. Recommending MeSH terms for annotating of the 23rd International Conference on Computational Linguistics: Posters.
biomedical articles. J Am Med Inform Assoc. 2011;18(5):660–667. Stroudsburg: Association for Computational Linguistics; 2010. pp. 463–471.
4. Lu Z, Kim W, Wilbur WJ. Evaluation of query expansion using MeSH in 32. Aronson AR. Effective mapping of biomedical text to the UMLS
PubMed. Inf Retr. 2009;12(1):69–80. Metathesaurus: the MetaMap program. In: Proceedings of the AMIA
5. Sarkar IN, Schenk R, Miller H, Norton CN. LigerCat: using “MeSH clouds” from Symposium. Washington DC; 2001. pp. 17–21.
journal, article, or gene citations to facilitate the identification of relevant 33. Ruiz ME, Srinivasan P. Hierarchical text categorization using neural networks.
biomedical literature. AMIA Annu Symp Proc. 2009;2009:563–567. Inf Retr. 2002;5(1):87–118.
6. Smalheiser NR, Zhou W, Torvik VI. Anne O’Tate: A tool to support user- 34. Yetisgen-Yildiz M, Pratt W. The effect of feature representation on MEDLINE
driven summarization, drill-down and browsing of PubMed search results. J document classification. In: AMIA annual symposium proceedings. Washington
Biomed Discov Collab. 2008;3(1):2. D.C: American Medical Informatics Association; 2005. pp. 849–853.
7. Torvik VI, Smalheiser NR. Author name disambiguation in MEDLINE. ACM 35. Tsoumakas G, Laliotis M, Markantonatos N, Vlahavas IP. Large-scale semantic
Trans Knowl Discov Data. 2009;3(3):11. indexing of biomedical publications. In: BioASQ@ CLEF. 2013.
8. Liu W, Islamaj Doğan R, Kim S, Comeau DC, Kim W, Yeganova L, Lu Z, 36. Névéol A, Shooshan SE, Claveau V. Automatic inference of indexing rules for
Wilbur WJ. Author name disambiguation for PubMed. J Assoc Inf Sci MEDLINE. BMC Bioinformatics. 2008;9 Suppl 11:S11.
Technol. 2014;65(4):765–81. 37. Sohn S, Kim W, Comeau DC, Wilbur WJ. Optimal training sets for
9. Bhattacharya S, Ha V, Srinivasan P. MeSH: a window into full text for bayesian prediction of MeSH® assignment. J Am Med Inform Assoc.
document summarization. Bioinformatics. 2011;27(13):i120–8. 2008;15(4):546–53.
Mao and Lu Journal of Biomedical Semantics (2017) 8:15 Page 9 of 9

38. Wilbur WJWK. Stochastic gradient descent and the prediction of MeSH for 64. Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G.
PubMed records. In: AMIA. 2014. Learning to rank using gradient descent. In: Proceedings of the 22nd
39. Jimeno-Yepes A, Mork JG, Demner-Fushman D, Aronson AR. A one-size-fits- international conference on Machine learning. New York: ACM; 2005. pp.
all indexing method does not exist: automatic selection based on meta- 89–96.
learning. JCSE. 2012;6(2):151–60. 65. Metzler D, Croft WB. Linear feature-based models for information retrieval.
40. Yang Y, Chute CG. An application of Expert Network to clinical classification Inf Retr. 2007;10(3):257–74.
and MEDLINE indexing. The 18th Annual Symposium on Computer 66. Xu J, Li H. Adarank: a boosting algorithm for information retrieval. In:
Applications in Medical Care. Bethesda: American Medical Informatics Proceedings of the 30th annual international ACM SIGIR conference on
Association; 1994. pp. 157–161. Research and development in information retrieval. New York: ACM; 2007.
41. Trieschnigg D, Pezik P, Lee V, De Jong F, Kraaij W, Rebholz-Schuhmann D. pp. 391–398.
MeSH Up: effective MeSH text classification for improved document 67. Wu Q, Burges CJ, Svore KM, Gao J. Adapting boosting for information
retrieval. Bioinformatics. 2009;25(11):1412–8. retrieval measures. Inf Retr. 2010;13(3):254–270.
42. Delbecque T, Zweigenbaum P. Using Co-Authoring and Cross-Referencing 68. Quoc C, Le V. Learning to rank with nonsmooth cost functions. In: NIPS’07,
Information for MEDLINE Indexing. In: AMIA Annual Symposium vol. 19. 2007. p. 193.
Proceedings. Washington DC: American Medical Informatics Association; 69. Brown PF, Pietra VJD, Pietra SAD, Mercer RL. The mathematics of
2010. pp. 147–151. statistical machine translation: Parameter estimation. Comput Linguist.
43. Liu T-Y. Learning to rank for information retrieval. Found Trends Inf Retr. 1993;19(2):263–311.
2009;3(3):225–331. 70. Robertson SE, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M. Okapi at
44. Mao Y, Wei C-H, Lu Z. NCBI at the 2014 BioASQ challenge task: large-scale TREC-3. Gaithersburg: NIST Special Publication; 1995. pp. 109–126
biomedical semantic indexing and question answering. In: Proceedings of 71. Berger A, Lafferty J. Information retrieval as statistical translation. In:
Question Answering Lab at CLEF. 2014. Proceedings of the 22nd annual international ACM SIGIR conference on
45. Balikas G, Partalas I, Ngomo A-CN, Krithara A, Gaussier E, Paliouras G. Results Research and development in information retrieval. New York: ACM; 1999.
of the BioASQ Track of the Question Answering Lab at CLEF 2014. In: pp. 222–229.
Proceedings of Question Answering Lab at CLEF. 2014. pp. 1181–1193. 72. Humphreys BL, Lindberg DA. The UMLS project: making the conceptual
46. Tsatsaronis G, Balikas G, Malakasiotis P, Partalas I, Zschunke M, Alvers MR, connection between users and the information they need. Bull Med Libr
Weissenborn D, Krithara A, Petridis S, Polychronopoulos D. An overview of Assoc. 1993;81(2):170–177.
the BIOASQ large-scale biomedical semantic indexing and question 73. Liu K, Peng S, Wu J, Zhai C, Mamitsuka H, Zhu S. MeSHLabeler: improving
answering competition. BMC Bioinformatics. 2015;16(1):138. the accuracy of large-scale MeSH indexing by integrating diverse evidence.
47. Liu K, Wu J, Peng S, Zhai C, Zhu S. The Fudan-UIUC participation in the Bioinformatics. 2015;31(12):339–347.
BioASQ Challenge Task 2a: The Antinomyra system. Risk. 2014;129816:100. 74. Peng S, You R, Wang H, Zhai C, Mamitsuka H, Zhu S. DeepMeSH: deep
48. Kavuluru R, Lu Y. Leveraging output term co-occurrence frequencies and semantic representation for improving large-scale MeSH indexing.
latent associations in predicting medical subject headings. Data & Bioinformatics. 2016;32(12):70–79.
Knowledge Engineering. 2014;94:189–201.
49. Mork JG, Jimeno-Yepes A, Aronson AR. The NLM Medical Text Indexer
System for Indexing Biomedical Literature. In: BioASQ@ CLEF. 2013.
50. Ruch P. Automatic assignment of biomedical categories: toward a generic
approach. Bioinformatics. 2006;22(6):658–64.
51. Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ. The NLM
indexing initiative’s medical text indexer. Medinfo. 2004;11(Pt 1):268–72.
52. Névéol A, Shooshan SE, Humphrey SM, Mork JG, Aronson AR. A recent
advance in the automatic indexing of the biomedical literature. J Biomed
Inform. 2009;42(5):814–23.
53. Mork JG, Demner-Fushman D, Schmidt SC, Aronson AR. Recent
enhancements to the NLM medical text indexer. In: Working Notes for CLEF
2014 Conference, Sheffield, UK. 2014. p. 1328–36.
54. Partalas I, Gaussier É, Ngomo A-CN. Results of the First BioASQ Workshop. In:
BioASQ@ CLEF. 2013. p. 1–8.
55. Funk ME, Reid CA. Indexing consistency in MEDLINE. Bull Med Libr Assoc.
1983;71(2):176.
56. Lin J, Wilbur WJ. PubMed related articles: a probabilistic topic-based model
for content similarity. BMC Bioinformatics. 2007;8(1):423.
57. Tang L, Rajan S, Narayanan VK. Large scale multi-label classification via
metalabeler. In: Proceedings of the 18th international conference on World
wide web. New York: ACM; 2009. pp. 211–220.
58. Thai-Nghe N, Gantner Z, Schmidt-Thieme L. Cost-sensitive learning methods
for imbalanced data. In: Proceedings of the IEEE International Joint Conference
on Neural Networks (IJCNN 2010), Barcelona, Spain. 2010. pp. 1–8.
59. Huber PJ. Robust estimation of a location parameter. Ann Math Stat. 1964; Submit your next manuscript to BioMed Central
35(1):73–101. and we will help you at every step:
60. Kim W, Yeganova L, Comeau DC, Wilbur WJ. Identifying well-formed
biomedical phrases in MEDLINE® text. J Biomed Inform. 2012;45(6):1035–1041. • We accept pre-submission inquiries
61. Yepes AJJ, Mork JG, Demner-Fushman D, Aronson AR. Comparison and • Our selector tool helps you to find the most relevant journal
combination of several MeSH indexing approaches. In: AMIA Annual • We provide round the clock customer support
Symposium Proceedings. Washington DC: American Medical Informatics
Association; 2013. pp. 709–718. • Convenient online submission
62. Cao Z, Qin T, Liu T-Y, Tsai M-F, Li H. Learning to rank: from pairwise • Thorough peer review
approach to listwise approach. In: Proceedings of the 24th international • Inclusion in PubMed and all major indexing services
conference on Machine learning. New York: ACM; 2007. pp 129–136.
63. Friedman JH. Greedy function approximation: a gradient boosting machine. • Maximum visibility for your research
Ann Stat. 2001(29):1189–1232.
Submit your manuscript at
www.biomedcentral.com/submit

You might also like