Building Compact Entity Embeddings Using Wikidata
Building Compact Entity Embeddings Using Wikidata
net/publication/328919946
Article in International Journal on Advanced Science Engineering and Information Technology · September 2018
DOI: 10.18517/ijaseit.8.4-2.6831
CITATIONS READS
0 242
2 authors:
4 PUBLICATIONS 57 CITATIONS
Universiti Kebangsaan Malaysia
259 PUBLICATIONS 2,906 CITATIONS
SEE PROFILE
SEE PROFILE
All content following this page was uploaded by Mohamed Lubani on 17 January 2019.
Abstract—Representing natural language sentences has always been a challenge in statistical language modeling. Atomic discrete
representations of words make it difficult to represent semantically related sentences. Other sentence components such as phrases and
named-entities should be recognized and given representations as units instead of individual words. Different entity senses should be
assigned different representations even though they share identical words. In this paper, we focus on building the vector
representations (embedding) of named-entities from their contexts to facilitate the task of ontology population where named-entities
need to be recognized and disambiguated in natural language text. Given a list of target named-entities, Wikidata is used to
compensate for the lack of a labeled corpus to build the contexts of all target named-entities as well as all their senses. Description text
and semantic relations with other named-entities are considered when building the contexts from Wikidata. To avoid noisy and
uninformative features in the embedding generated from artificially built contexts, we propose a method to build compact entity
representations to sharpen entity embedding by removing irrelevant features and emphasizing the most detailed ones. An extended
version of the Continuous Bag-of-Words model (CBOW) is used to build the joint vector representations of words and named-entities
using Wikidata contexts. Each entity context is then represented by a subset of elements that maximizes the chances of keeping the
most descriptive features about the target entity. The final entity representations are built by compressing the embedding of the
chosen subset using a deep stacked auto encoders model. Cosine similarity and t-SNE visualization technique are used to evaluate the
final entity vectors. Results show that semantically related entities are clustered near each other in the vector space. Entities that
appear in similar contexts are assigned similar compact vector representations based on their contexts.
1437
In a large corpus, the embedding of a word is fine-tuned using specific vector offsets. It suggests that linguistic
each time a new occurrence of the word is encountered. regularities are present between vector representations of
More information is added from frequent contexts, and less words and can be obtained by applying algebraic operations
regard is given to rare isolated occurrences of the word. For such as V − V ≈ V − V .
named-entities, contexts can be generated artificially from a Several enhancements on the Skip-gram model [1] are
knowledge base to build the entity embedding. Due to the described in [17] to speed up the training and provide better
limited number of contexts, rare entity contexts may be a representations. The first enhancement is the subsampling of
source of noise and special processing is required to only the frequent words in the training data using a fixed
keep the most descriptive information in the embedding. In subsampling rate to enhance the representations of rare
this paper, we utilize the Wikidata contexts of named entities words. The second enhancement comes from the fact that the
in order to assign similar vector representations to cost of finding the probabilities in the Skip-gram model is
semantically related entities. A method to build compact proportional to the vocabulary size which makes the model
entity representations with the most descriptive information training significantly expensive. The proposed method is
is proposed. Different senses of named entities are called Negative Sampling (NEG) and it simplifies the Noise
considered and are assigned different vector representations Contrastive Estimation (NCE) [18] used to optimize the
based on their contexts. models in [1]. To avoid the heavy probability calculations,
logistic regression is used to distinguish the real target words
II. MATERIAL AND METHOD from randomly selected noise words by giving the real target
Representing variables in a continuous space to enhance words higher scores. This simplifies the NCE by considering
accuracy is an old idea. It was first used in a SMART only the samples and disregarding the probability
information retrieval system in the 1960s [12] where calculations. [17] also shows that simple mathematical
documents and queries are represented as vectors. The operations such as vector addition represent meaningful and
concept was adopted by [13] to represent the input variables non-obvious linguistic relationships such as V +
of neural networks as vectors of real numbers. Rumelhart et V ≈ V . It also introduces a way to build vector
al. [14] show that these representations can be learned while representations of phrases and entities of multiple words.
training the neural network to perform the desired task using The representation of phrases as vectors is significantly more
back propagation and gradient descent. Bengio et al. [15] expressive than taking the individual words representations.
build on this idea and the distributional hypothesis to The proposed solution identifies the phrases in the data set
construct the vector representations of natural language by locating the words that frequently appear together and
words by maximizing the probability of the next word given representing them as one token while training the model.
the previous ones in a text corpus. It shows that semantically Another attempt to map the embedding of words and
similar words will be assigned with similar representation entities to the same vector space is introduced in [19]. The
vectors. This comes at the expense of model’s complexity model extends the Skip-gram model proposed in [1] by
and slow training over large datasets. adding two more objectives: Obj2: predict neighboring
To compute word vector representations efficiently using entities from a target entity and Obj3: predict neighboring
very large data sets, new models are required. As described words from a target entity using a knowledge base. The
in [1], the complexity of models that find vector Wikipedia Link Based Measure (WLM) [20] is used to find
representations of words comes from the non-linear hidden related entities to a given entity in the KB. Obj3 is used to
layers where heavy matrix multiplications are performed. As allow interactions between entity vectors generated using
proposed in [1], the Continuous Bag-of-Words model Obj2 and word vectors of the Skip-gram model. Wikipedia is
(CBOW) removes the non-linear hidden layer and projects also used to link context words to entities where entities in
the N input words to the same position by taking the average Wikipedia pages represented as hyperlinks called “anchors”
of their vectors. It learns to predict the current word from a are unambiguously linked to specific KB entities. The model
neighboring context such as a window of words before and is then trained by maximizing the objective function that is
after the target word using a log-linear classifier. This means simply the linear combination of all the three objective
that CBOW considers the whole context as one observation functions using the NEG [17] and Stochastic Gradient
while training which helps the model to train well using Descent (SGD). Linking the representations of entities and
small training sets. The second proposed model is called the words using Wikipedia anchors is also used in [21]. Both [19]
Continuous Skip-gram model, which is similar to the CBOW and [21] use the KB to unambiguously identify entities.
model except that the model here predicts context words Different senses of the same entity refer to different KB
from the current word. This division of the context into nodes and thus will be assigned different representations.
multiple observations suggests that the model has much However, the problem arises when attempting to link entity
more to learn than CBOW and thus needs a larger training and word representations using the anchor text. Simply
data set to converge. replacing the anchor with the entity in the text doesn’t
The learned distributed representations using models in [1] represent the specific sense of the entity in that context.
are not only similar for semantically and syntactically related Therefore, different entity senses will not be given different
words but also represent multiple degrees of similarities representations. To solve this, [22] proposes a method to
between words such as similar nouns with similar endings learn multiple sense representations for each mention.
e.g., ing are located near each other in the vector space [16]. Wikipedia anchors are used to map the hyperlink text i.e. the
In addition to learning good vector representations, [16] mention to an entity. Using the context words around the
describes that relationships between vectors can be explored anchor and the entity it refers to, different representations are
1438
learnt for different mention senses. Mentions in Wikipedia expressive entity embedding for the different entity senses.
pages that refer to the same entity are represented using the We define the problem as the following: given a seed
same token. New mention tokens are used to represent a new ontology with instances of concepts as disambiguated named
sense if the mention is referring to a different entity. entities, we aim to learn the vector representations of these
Similarly, same tokens are used for mentions referring to the entities in order to identify their occurrences in the text.
same entity. The objective function used to learn the High-quality entity embedding facilitate the task of spotting
different mention senses representations is to predict the the correct sense of entities in the text and thus extracting
entity linked to a mention given the mention token itself and correct facts about them. Wikidata is used to extract the set
the context words. Another objective function is used to of neighboring entities connected with semantic relations to
predict the entities themselves from their neighbors (direct a given entity. To jointly link entity and word
connections) in a KB. A third objective function is used to representations, the description text of Wikidata entities is
learn word representations by predicting the context words used. The collected knowledge from Wikidata will be used
of a target in the text. It trains a model similar to the training to train a CBOW model to learn the joint embedding.
in [19] to optimize the objective function that is the result of Entities will be assigned contexts that maximize the chances
linearly combining all the three objective functions. of keeping the most descriptive features using the learned
In [19], [21] and [22], a KB is used to build entity embedding. To sharpen entity representations and remove
representations. Then anchors are used to aligning them into any irrelevant information, entities are represented as
the same space as word representations. Another method compact, dense continuous vectors using a deep stacked auto
proposed in [23] learns entity representations using their encoders model.
example occurrences in a large text corpus (Wikipedia) The proposed method to build the compact entity
instead of a KB. This allows for the utilization of representations consists of three main components. These
distributional knowledge about entities in text. It introduces components are explained in detail in the following sections.
the concept of Extended Anchor Text (EAT) which extends
the given corpus with more sentences that relate entities to A. The Crawler
their context words. This is done by substituting the anchor We utilize Wikidata, a collaboratively built public
text in Wikipedia pages with the corresponding entities and knowledge base containing a large number of entities
adding the result to the corpus as new sentences. Then it uses referring to real-world objects such as a person, location,
the training approach similar to [17]. The original models organization or abstract concepts such as “gravity” and
proposed in [1] i.e. the CBOW and the Skip-gram models “seasons” with all their semantic interpretations (i.e., senses).
can also be used as well. This method has the advantage of It contains the structured knowledge of other Wikimedia
using the original context of entities instead of KB-built Foundation projects mainly the knowledge of Wikipedia,
contexts. This allows for building the entity representations which is the world’s largest encyclopedia. As of December
using a large number of different contexts where an entity 2014, Wikidata also contains the resources of Freebase [24].
co-occurs. When using a large corpus to learn vector We build a crawler to find the neighboring entities of a given
representations of entities, the components of these vectors named entity called the co-entities, as well as extracting the
are sharpened with more information each time new contexts keywords associated with the entity found in its description
are encountered. Vectors will be adapted to keep the most referred to as the co-words. Given a set of named entities E,
distinctive features about the entities they represent. the crawler extends the input set by adding all the senses of
As a conclusion, building the vector representations of each entity to create E′. It also associates each entity e′ ∈ E′
named entities can be done using contexts built from a KB with its unique Wikidata id in order to differentiate between
or the contexts in a large corpus. The first is useful to build different senses of an entity. The crawler objective is to
the representations of a specific set of entities. The KB can build the context of each entity e′ ∈ E′ using its co-entities
be queried for each entity in the set to generate its contexts. SE moreover, co-words SW . Co-entities set ( SE )
This comes at the expense of limiting the number of contexts contains only named entities that have a semantic relation
that can be used to learn high-quality features about the with e′ in its Wiki page. The crawler heuristically identifies
entities. On the other hand, using a large corpus allows for entities by checking the capitalization of each token in the
utilizing the many occurrences of named entities in order to entity label. This is to exclude non-named entity concepts in
enhance the representations and keep the most useful the Wiki page (e.g., the universe and space concepts).
features. However, when using a corpus, only named entities Entities of SE are represented as underscore-separated
mentioned in the corpus will be considered. This doesn’t lowercase tokens with their unique Wikidata id as the last
allow for learning the representations of a specific set of token such as united_states_of_america_q30. Co-words set
entities such as entities of an ontology. In addition, many SW is built by taking the non-stop words from e′
reviewed methods do not provide solutions to differentiate description. Entities will be ignored if their description is
between the representations of different named entity senses. empty. We define the context C e of an entity e as the set
In this paper, we present a method to learn named entity of co-words and co-entities SA = SE ∪ SW . While
vector representations to be used for tasks related to crawling, the same context size α is kept for all entities by
ontology population namely NED and relation instance using a subset of SA if its size is more than α and padding
extraction. In ontology population related tasks, the goal is using the first keyword from the description if the size is less
to identify the correct sense of an entity in the natural than α. Entity contexts are then written as separated lines to
language text using its context. This requires the use of a a text file. Each line in the file contains α + 1 space
training set of named entities and a method to build separated elements with the target entity e′ in the middle and
1439
# 8 4
elements from C e to the left and right. To keep the 1
$
target entity in the middle, α is chosen as an even number. ) ) Log - .′/ |12 3
( (1)
For example, considering the named entity /67 567
7
Where σ x = is the sigmoid function, P is the
7M N ON
number of negative samples (noise words) indicated as QK
and taken from the noise distribution -R Q defined as:
T
S Q U
-R Q = T
(3)
∑56Y
R
SWQ5 XU
1440
8 build the extended entity set E′ with all entity senses. The
op3 = ) ||<K − f.e .de <K ||$ (5) size of the built set E′ is 1559 with an average of about three
K67 senses per entity. The extended set is then used as input to
After training the auto encoder, the vectors produced by crawler’s Algorithm 2 in Fig. 2 to build the contexts C e
the encoder part can be seen as compact representations of For all e′ ∈ E′ with a fixed context size α = 14. The output
the input vectors. These representations contain knowledge is a text file contains 1559 lines representing entity contexts.
good enough to rebuild the original vectors with high For example, the named entity “Syria” in the set E has 6
confidence. The encoder part can be used to map the training senses in E′ where it can be either a country, female name,
vectors of S to a new vector space with smaller dimension journal name, a Roman province, Italian singer or a family
and use the new set of vectors to train another autoencoder. name. Related co-words and co-entities surround each of
This structure is called a stacked auto encoders, which these six senses. For example, the country sense is written as
represents a deep neural network with many hidden layers. Syria_q858 with neighbors such as Asia, republic,
We use a two-layer stacked under complete auto encoders damascus_q3766, turkey_q43, etc. The singer sense is
to build the final entity representations. This allows for written as Syria_q3979196 with neighbors including Italian,
capturing high-level abstract features about the entities. Each singer, italy_q38, and rome_q220.
auto encoder consists of two hidden layers in the encoding The second step is to train the model described in section
part and two hidden layers in the decoding part. The first III.B using the crawler’s output text file. We use Google’s
auto encoder is trained using input vectors that represent the Tensor flow Python library [26] to implement the two-layer
target entity contexts. These vectors are built by neural network where the input and output layers have the
concatenating the embedding of β chosen context elements, same size as the vocabulary size. The vocabulary size is the
which can be either a co-word, a co-entity or the target entity number of unique words/entities in the input text file. The
itself. The context elements are chosen in a way to keep the hidden layer’s size equals the required embedding vectors
most distinctive features about the context. These elements size. We set the embedding size, i.e. the size of the hidden
are then ordered alphabetically, and their embedding is layer to 128. We set the rest of the model’s parameters like
concatenated to construct the first auto encoder training the following: the number of negative samplesNEG = 64,
vectors. We chose β − 1 elements that have the highest gradient descent learning rate is set to 1.0. Training
average distances from the embeddings of the remaining examples are generated from each line as pairs w , w'
context elements. These elements are the top β − 1 elements where w' the target entity of the line is and w is a context
from the context that maximize the objective function in element. Since we use a context size of 14, each line can
equation 6. To include features from the rest of the context, produce 14 training pairs. The model is trained for 100,000
the element with the smallest average distance from all the iterations using a batch of training pairs. We use a batch size
other context elements, i.e. the element that minimizes the of 280 to cover all pairs of 20 randomly selected contexts in
objective function in equation 6 is also chosen. We call this the same training batch. Once the model training is
element the context agent. This way, the chosen context completed, the hidden layer’s weight matrix is saved as the
elements maximize the chances of keeping the most embedding of the training vocabulary. It contains the
descriptive features about the target entities. embedding of both words and entities mapped to the same
4M7
1 vector space.
r s = ) f -K , s (6)
t To build the final entity representations, we train the
K67.FD vw stacked auto encoders model to minimize the construction
Where s, -K are embeddings of context elements in the error of L = 1559 vectors in the input set S as per equation 5.
vector space ℝx where z is the embedding vector size, α is We choose β = 4, and thus the size of each training vector is
the context size and f -K , s is the Euclidean distance 128 × 4 = 512 which is the size of the input layer of the
between Pi and Q. first autoencoder. The first autoencoder is trained for 30,000
Each auto encoder has two hidden layers, and each hidden iterations with a learning rate of 0.05 using batches of size
layer is half the size of the previous layer. This gives the 256. It is trained to encode the input to a vector of size 128
used stacked auto encoders structure a vector-compressing which is the size of the input layer of the second auto
factor of 16. The size of the input vectors of the first auto encoder. The second auto encoder is then trained using
encoder is z × β. We train the first auto encoder to represent similar parameters except with a learning rate of 0.01. After
x×{
the size which is the size of the input vectors to be used training the second auto encoder, the two stacked auto
U
to train the second auto encoder. The final entity embedding encoders model is now ready to be used to encode input
x×{ vectors. The result is a set of L vectors of size 32 that
has a size of moreover; it is obtained from the encoder
7| includes the final vector representations of the target named
part of the second auto encoder. entities.
In our experiments, we use Wikidata as the source of the
training set of named entities. We collected top 500 named III. RESULTS AND DISCUSSION
entities from Wikidata using a simple collection algorithm
As described in section III.C, the β context elements that
that automatically checks Wikidata entities starting from
will be used to build the final entity representations can be
id = 1 and an empty set E . The algorithm heuristically
either words or entities. Denoting the chosen set for an entity
checks the label in the corresponding Wikidata page looking
e′ as ∅ moreover, ∝ = ⋃ ∈‰ ∅ , TABLE I shows a
for named entities. Detected entities will be added to the
entity set E. Crawler Algorithm 1 in Fig. 1 is then used to
1441
statistical analysis of the distribution of β elements Panama_Q2204538 Town in Oklahoma
considering all the 1559 test entities and their contexts. & &
Lisbon_Q2384470 A town in Maine, USA
TABLE I 2008 American drama film
STATISTICAL ANALYSIS OF THE DISTRIBUTION OF Š ELEMENTS OF ALL August_Q1192731 &
CHOSEN SETS & 2009 documentary film by Ondi
We_Live_In_Public_Q372 Timoner which profiles internet
Total Total pioneer Josh Harris
Total Total
words as entities as Sunday_Q1286562 Song by British recording duo Hurts
words in entities & &
context context
∝ in ∝ Dubai_Q5310496 2005 Filipino drama film
agents agents
Numbers 2258 3978 349 1210
Percentage 36.2% 63.79% 5.59% 19.4% As TABLE II shows, the cosine similarity is close to 1 for
entities that represent the same real-world concepts such as
As TABLE I shows, entities are chosen at almost twice cities, countries, and films. Entities that are semantically
the rate of choosing words to be in the β chosen elements of related such as city, town, and community also have high
target entities. This is largely because the number of co- cosine similarities i.e. more than 0.95. The last row in
words in any context is less than the number of co-entities. TABLE II is another example of how semantically related
The only source of the co-words is the Wikidata description entities are assigned similar representations where both
text which is usually not more than a couple of sentences. entities represent the artwork concept.
However, Wikidata page of an entity has plenty of co- To show how entities of our training set are distributed in
entities found in the binary relations of the target entity. the vector space, we use the non-linear dimensionality
To evaluate the final vector representations of entities, we reduction tool t-SNE [27] to visualize entity embeddings in
use cosine similarity as a measure of the semantic similarity the 2D space. Fig. 3 shows the distribution of 150 randomly
between two vectors in the space. We find the closest entity chosen entity embeddings.
to all the 1559 test entities by finding the entity from the As expected, Fig. Three shows that semantically related
same set with the maximum cosine similarity. Since an entities are clustered relatively near each other in the vector
entity with a maximum cosine similarity can always be space. TABLE III shows a few examples of related entities
found, a threshold has to be set to consider that the entities found in Fig. 3 where examples are numbered from 1 to 6.
are semantically related. This threshold depends on the size TABLE III
and coverage of the training entity set. For small domain- EXAMPLES OF RELATED ENTITIES FOUND IN FIG. 3
specific training sets, this threshold has to be large and very
Group
close to 1. TABLE II shows some examples of the most #
Related entities Wikidata descriptions
similar entities found in the used training set with cosine City in Mississippi
similarities more than 0.95. Grenada_Q985543 &
& An unincorporated
TABLE II 1 Saginaw_Q7399254 community in Hot Spring
SOME EXAMPLES OF SIMILAR ENTITIES IN THE USED TRAINING SET WITH & County, Arkansas
COSINE SIMILARITIES MORE THAN 0.95. Saginaw_Q970802 &
Related entities Wikidata descriptions City in Texas
Bolivia_Q750 A country in South America Guatemala_Q11221957 Triceratops song
& & 2 & &
Paraguay_Q733 A country in South America Ecuador_Q2347797 1997 song by Sash!
Moscow_Q2380475 City in Tennessee, USA An unincorporated
& & community in Pine County,
Lisbon_Q2310637 A town in New Hampshire, USA Groningen_Q1816384
Minnesota
A town in New York, United States 3 &
Maine_Q3708887 &
& Jamaica_Q3450853
& A town in Vermont, United
Census-designated place in Suffolk States
Flanders_Q3459889
County, New York Dominica_Q784 A country in the Caribbean
A town in Indiana, United States 4 & &
Versailles_Q2729504
& Honduras_Q783 Republic in Central America
&
A civil town in Sheboygan County,
Rhine_Q1886951 A country in South America
Wisconsin Ecuador_Q736
&
Egypt_Q2083973 Town in Arkansas 5 &
A country in East Asia,
& & Mongolia_Q711
between China and Russia
Versailles_Q2729504 A town in Indiana, United States
City in Wilson County City in Cuba
Lebanon_Q1520670 Venezuela_Q593830 &
&
& 6 & An unincorporated
Tennessee, a city in Kentucky,
London_Q3061911 Saginaw_Q7399257 community in St. Louis
United States
County, Minnesota
1442
Fig. 3 t-SNE visualization of 150 randomly selected entity embeddings.
1443
Fig. 3 also shows that not all close entities are which help to identify the correct entity sense in natural
semantically related. As discussed earlier, a high similarity language text to facilitate ontology population tasks.
threshold is required for small training sets. This translates to Ontology population, which is the process of adding new
a short distance between related vectors especially after instances of concepts and relations into an ontology from a
reducing the dimension of embedding for visualization using corpus, will benefit from this process in order to minimize
t-SNE. the manual effort as exhibited in [28].
The goal of constructing entity embedding for ontology The conducted experiments show that entities are
population tasks requires that if two entities share similar assigned close vector representations if they have similar
contexts, then their embedding are expected to be similar contexts. In the future, we plan to demonstrate the use of the
and vice versa. To test this using our training set, we find all constructed entity vectors in the tasks NED and relation
unique entity pairs where the cosine similarity is more than extraction. It would also be interesting to investigate the
0.95. Then we check the corresponding contexts looking for effect of using Wikidata hierarchies while building entity
shared elements, i.e. shared co-words or co-entities. We contexts.
consider the pair as correct if there is at least one shared
element found in their contexts, which justifies the high REFERENCES
similarity. For example, the entities Bolivia_Q750 and
Paraguay_Q733 have a cosine similarity of more than 0.95. [1] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation
For this pair to be considered as correct which means that of word representations in vector space,” arXiv preprint
they were rightfully assigned similar representations, they arXiv:1301.3781, 2013.
[2] M. A. Taiye, S. S. Kamaruddin, and F. K. Ahmad, “Representing
should have common elements in their contexts. By
Semantics of Text by Acquiring its Canonical Form,” International
checking their contexts, we find that four elements are Journal on Advanced Science, Engineering and Information
shared: Spanish_Q1321, South_america_Q18, south, and Technology, vol. 7, no. 3, pp. 808-814, 2017.
country. The first two are named entities, and last two are [3] S. A. M. Noah, N. Omar, and A. Y. Amruddin, “Evaluation of
lexical-based approaches to the semantic similarity of Malay
words. Based on this, the pair is considered as correct. sentences.,” Journal of Quantitative Linguistics, vol. 22, no. 2, pp.
Out of 963 unique pairs found, 951 pairs share context 135-156, 2015.
elements with an accuracy of 98.75%. This means that [4] M. Mohd and O. M. A. Bashaddadh, “Investigating the Combination
entities with similar contexts have a high change of being of Bag of Words and Named Entities Approach in Tracking and
Detection Tasks among Journalists.,” Journal of Information Science
assigned close embedding. It is also worth noting that the
Theory and Practice, vol. 2, no. 4, pp. 31-48, 2014.
remaining pairs are not necessarily false positives since we [5] N. I. Y. Saat and S. A. M. Noah, “Rule-based Approach for
look for the exact elements in both contexts. Contexts may Automatic Ontology Population of Agriculture Domain,”
share synonyms or other semantically related elements that Information Technology Journal, vol. 46, no. 51, pp. 46-51, 2016.
[6] Y. I. A. M. Khalid and S. A. M. Noah, “Semantic text-based image
caused the embedding to be similar which is the expected
retrieval with multi-modality ontology and DBpedia,” The Electronic
behavior of good embedding. We repeated the experiment Library, vol. 35, no. 6, pp. 1191-1214, 2017.
using different thresholds to test the effect on the accuracy. [7] W. Ammar, G. Mulcaire, Y. Tsvetkov, G. Lample, C. Dyer and N. A.
Table I shows the accuracies for different similarity Smith, “Massively Multilingual Word Embeddings,” arXiv preprint
arXiv:1602.01925, 2016.
thresholds. [8] R. E. Salah and L. Q. b. Zakaria, “Arabic Rule-Based Named Entity
TABLE IV Recognition Systems: Progress and Challenges,” International
ACCURACIES FOR DIFFERENT SIMILARITY THRESHOLDS Journal on Advanced Science, Engineering and Information
Technology, vol. 7, no. 3, pp. 815-821, 2017.
Similarity Unique Pairs with at Accuracy [9] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2-3, pp.
threshold entity pairs least one shared 146 - 162, 1954.
with context element [10] J. R. Firth, “A synopsis of linguistic theory 1930-55,” in Studies in
similarity > Linguistic Analysis, Vols. 1952-59, The Philological Society, 1957,
pp. 1-32.
threshold
[11] M. Sahlgren, “The distributional hypothesis,” Italian Journal of
0.95 963 951 98.75% Disability Studies, vol. 20, pp. 33-53, 2008.
0.93 1636 1578 96.34% [12] G. Salton, The SMART Retrieval System—Experiments in
0.90 3199 2783 86.99% Automatic Document Processing, NJ: Prentice-Hall, Inc. Upper
0.85 11050 6535 59.14% Saddle River, 1971.
[13] D. E. Rumelhart and J. L. McClelland, Psychological and Biological
Models, MIT Press, 1986.
As Table II shows, the number of unique entity pairs [14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning
increases when using lower thresholds with a large number internal representations by error propagation,” in Parallel distributed
of wrong entity pairs. This is mainly due to the small size of processing: explorations in the microstructure of cognition,
Cambridge, MA, MIT Press Cambridge, MA, 1986.
the training set. Covering a large amount of entities increases
[15] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A Neural
the chances of the relatedness between close entities. Using a Probabilistic Language Model,” Journal of Machine Learning
larger set helps to discover new correct pairs of similar Research, vol. 3, pp. 1137-1155, 2003.
entities and keeps the number of false positives very low [16] T. Mikolov, W.-t. Yih, and G. Zweig, “Linguistic regularities in
continuous space word representations,” in Proceedings of the 2013
when using low thresholds. Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, 2013.
IV. CONCLUSIONS [17] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean,
“Distributed representations of words and phrases and their
In this paper, we presented a method to build entity vector compositionality,” in Advances in neural information processing
representations using knowledge from Wikidata. These systems, 2013.
representations hold distinctive features about the entities,
1444
[18] M. U. Gutmann and A. Hyvärinen, “Noise-contrastive estimation of embeddings for entity linking,” in European Semantic Web
unnormalized statistical models, with applications to natural image Conference, 2017.
statistics,” The Journal of Machine Learning Research, vol. 13, no. 1, [24] Freebase, 17 December 2014. [Online]. Available:
pp. 307-361, 2012. https://plus.google.com/109936836907132434202/posts/bu3z2wVqc
[19] I. Yamada, H. Shindo, H. Takeda, and Y. Takefuji, “Joint learning of Qc.
the embedding of words and entities for named entity [25] D. H. Ballard, “Modular learning in neural networks,” in AAAI'87
disambiguation,” arXiv preprint arXiv:1601.01343, 2016. Proceedings of the sixth National conference on Artificial
[20] D. Milne and I. H. Witten, “An effective, low-cost measure of intelligence, 1987.
semantic relatedness obtained from Wikipedia links,” in In [26] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.
Proceedings of the First AAAI Workshop on Wikipedia and S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I.
Artificial Intelligence (WIKIAI), 2008. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz and
[21] Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graph and text L, “Tensorflow: Large-scale machine learning on heterogeneous
jointly embedding,” in Proceedings of the 2014 conference on distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
empirical methods in natural language processing (EMNLP), 2014. [27] L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,”
[22] Y. Cao, L. Huang, H. Ji, X. Chen and J. Li, “Bridging Text and Journal of machine learning research, pp. 2579-2605, 2008.
Knowledge by Learning Multi-Prototype Entity Mention Embedding,” [28] Z. Ibrahim, S. A. M. Noah and M. M. Noor, “Knowledge acquisition
in Proceedings of the 55th Annual Meeting of the Association for from textual documents for the construction of medicinal herbs
Computational Linguistics (Volume 1: Long Papers), 2017. domain ontology,” Journal of Applied Science, vol. 9, no. 4, pp. 794-
[23] J. G. Moreno, R. Besancon, R. Beaumont, E. D'hondt, A.-L. Ligozat, 798, 2009.
S. Rosset, X. Tannier and B. Grau, “Combining word and entity
1445