0% found this document useful (0 votes)

9 views

Building Compact Entity Embeddings Using Wikidata

Uploaded by

Mohamed Lub

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Building Compact Entity Embeddings Using Wikidata

Uploaded by

Mohamed Lub

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/328919946

Building Compact Entity Embeddings Using Wikidata

Article in International Journal on Advanced Science Engineering and Information Technology · September 2018
DOI: 10.18517/ijaseit.8.4-2.6831

CITATIONS READS

0 242

2 authors:

Mohamed Lubani Shahrul Azman Mohd Noah

4 PUBLICATIONS 57 CITATIONS
Universiti Kebangsaan Malaysia
259 PUBLICATIONS 2,906 CITATIONS
SEE PROFILE
SEE PROFILE

All content following this page was uploaded by Mohamed Lubani on 17 January 2019.

The user has requested enhancement of the downloaded file.

Vol.8 (2018) No. 4-2
ISSN: 2088-5334

Building Compact Entity Embeddings Using Wikidata

Mohamed Lubani# and Shahrul Azman Mohd Noah#
#
Center for Artificial Intelligent Technology,
Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia,
43600 Selangor, Malaysia.
E-mail: [email protected], [email protected]

Abstract—Representing natural language sentences has always been a challenge in statistical language modeling. Atomic discrete
representations of words make it difficult to represent semantically related sentences. Other sentence components such as phrases and
named-entities should be recognized and given representations as units instead of individual words. Different entity senses should be
assigned different representations even though they share identical words. In this paper, we focus on building the vector
representations (embedding) of named-entities from their contexts to facilitate the task of ontology population where named-entities
need to be recognized and disambiguated in natural language text. Given a list of target named-entities, Wikidata is used to
compensate for the lack of a labeled corpus to build the contexts of all target named-entities as well as all their senses. Description text
and semantic relations with other named-entities are considered when building the contexts from Wikidata. To avoid noisy and
uninformative features in the embedding generated from artificially built contexts, we propose a method to build compact entity
representations to sharpen entity embedding by removing irrelevant features and emphasizing the most detailed ones. An extended
version of the Continuous Bag-of-Words model (CBOW) is used to build the joint vector representations of words and named-entities
using Wikidata contexts. Each entity context is then represented by a subset of elements that maximizes the chances of keeping the
most descriptive features about the target entity. The final entity representations are built by compressing the embedding of the
chosen subset using a deep stacked auto encoders model. Cosine similarity and t-SNE visualization technique are used to evaluate the
final entity vectors. Results show that semantically related entities are clustered near each other in the vector space. Entities that
appear in similar contexts are assigned similar compact vector representations based on their contexts.

Keywords— entity embeddings; entity vector representations; named entity disambiguation.

representations of words significantly improve many NLP

I. INTRODUCTION applications such as text syntactic and semantic analyses [2]
A large number of possible words that are encountered in [3] Named Entity Disambiguation (NED) [4], ontology
a natural language text suggest that a Natural Language population [5] and information retrieval [6]. These
Processing (NLP) model is always expected to encounter representations can be shared across languages [7] to
new word sequences that have never been seen during the overcome language-specific problems such as Arabic entity
building of the model. This makes it very difficult for the detection issues [8].
model to generalize to new cases and it requires much more In NLP, representing (embedding) words as vectors in a
data to train the model. A model that is trained on data continuous vector space means that words with similar
where “Paris” and “Madrid” are represented as different IDs semantic and syntactic properties will be mapped (embedded)
has very little chance of using both terms’ as the concepts of to nearby points in the space. Using the context to build
“Capital.” Statistical models built using the discrete atomic word representations is the most widely used approach in
representations of words are simple and can achieve high NLP. In natural language text, words are represented by their
accuracies when trained using large training sets that covers contexts. The distributional hypothesis in linguistics is used
huge number of input cases. However, there are cases where as a key to unlock the semantic properties of languages. It
scaling up the training data set will not result in any states that words that share similar contexts tend to represent
improvements [1]. Generalization is always better achieved similar meanings [9] [10]. This suggests that there is a clear
when considering continuous input variables. Despite the link between the contexts and meaning similarities. This
complexity, models that use continuous input variables tend opens the door to explore the distributional similarities in
to show better performance. For example, vector linguistics as a way of finding semantic similarities [11].

1437
In a large corpus, the embedding of a word is fine-tuned using specific vector offsets. It suggests that linguistic
each time a new occurrence of the word is encountered. regularities are present between vector representations of
More information is added from frequent contexts, and less words and can be obtained by applying algebraic operations
regard is given to rare isolated occurrences of the word. For such as V − V ≈ V − V .
named-entities, contexts can be generated artificially from a Several enhancements on the Skip-gram model [1] are
knowledge base to build the entity embedding. Due to the described in [17] to speed up the training and provide better
limited number of contexts, rare entity contexts may be a representations. The first enhancement is the subsampling of
source of noise and special processing is required to only the frequent words in the training data using a fixed
keep the most descriptive information in the embedding. In subsampling rate to enhance the representations of rare
this paper, we utilize the Wikidata contexts of named entities words. The second enhancement comes from the fact that the
in order to assign similar vector representations to cost of finding the probabilities in the Skip-gram model is
semantically related entities. A method to build compact proportional to the vocabulary size which makes the model
entity representations with the most descriptive information training significantly expensive. The proposed method is
is proposed. Different senses of named entities are called Negative Sampling (NEG) and it simplifies the Noise
considered and are assigned different vector representations Contrastive Estimation (NCE) [18] used to optimize the
based on their contexts. models in [1]. To avoid the heavy probability calculations,
logistic regression is used to distinguish the real target words
II. MATERIAL AND METHOD from randomly selected noise words by giving the real target
Representing variables in a continuous space to enhance words higher scores. This simplifies the NCE by considering
accuracy is an old idea. It was first used in a SMART only the samples and disregarding the probability
information retrieval system in the 1960s [12] where calculations. [17] also shows that simple mathematical
documents and queries are represented as vectors. The operations such as vector addition represent meaningful and
concept was adopted by [13] to represent the input variables non-obvious linguistic relationships such as V +
of neural networks as vectors of real numbers. Rumelhart et V ≈ V . It also introduces a way to build vector
al. [14] show that these representations can be learned while representations of phrases and entities of multiple words.
training the neural network to perform the desired task using The representation of phrases as vectors is significantly more
back propagation and gradient descent. Bengio et al. [15] expressive than taking the individual words representations.
build on this idea and the distributional hypothesis to The proposed solution identifies the phrases in the data set
construct the vector representations of natural language by locating the words that frequently appear together and
words by maximizing the probability of the next word given representing them as one token while training the model.
the previous ones in a text corpus. It shows that semantically Another attempt to map the embedding of words and
similar words will be assigned with similar representation entities to the same vector space is introduced in [19]. The
vectors. This comes at the expense of model’s complexity model extends the Skip-gram model proposed in [1] by
and slow training over large datasets. adding two more objectives: Obj2: predict neighboring
To compute word vector representations efficiently using entities from a target entity and Obj3: predict neighboring
very large data sets, new models are required. As described words from a target entity using a knowledge base. The
in [1], the complexity of models that find vector Wikipedia Link Based Measure (WLM) [20] is used to find
representations of words comes from the non-linear hidden related entities to a given entity in the KB. Obj3 is used to
layers where heavy matrix multiplications are performed. As allow interactions between entity vectors generated using
proposed in [1], the Continuous Bag-of-Words model Obj2 and word vectors of the Skip-gram model. Wikipedia is
(CBOW) removes the non-linear hidden layer and projects also used to link context words to entities where entities in
the N input words to the same position by taking the average Wikipedia pages represented as hyperlinks called “anchors”
of their vectors. It learns to predict the current word from a are unambiguously linked to specific KB entities. The model
neighboring context such as a window of words before and is then trained by maximizing the objective function that is
after the target word using a log-linear classifier. This means simply the linear combination of all the three objective
that CBOW considers the whole context as one observation functions using the NEG [17] and Stochastic Gradient
while training which helps the model to train well using Descent (SGD). Linking the representations of entities and
small training sets. The second proposed model is called the words using Wikipedia anchors is also used in [21]. Both [19]
Continuous Skip-gram model, which is similar to the CBOW and [21] use the KB to unambiguously identify entities.
model except that the model here predicts context words Different senses of the same entity refer to different KB
from the current word. This division of the context into nodes and thus will be assigned different representations.
multiple observations suggests that the model has much However, the problem arises when attempting to link entity
more to learn than CBOW and thus needs a larger training and word representations using the anchor text. Simply
data set to converge. replacing the anchor with the entity in the text doesn’t
The learned distributed representations using models in [1] represent the specific sense of the entity in that context.
are not only similar for semantically and syntactically related Therefore, different entity senses will not be given different
words but also represent multiple degrees of similarities representations. To solve this, [22] proposes a method to
between words such as similar nouns with similar endings learn multiple sense representations for each mention.
e.g., ing are located near each other in the vector space [16]. Wikipedia anchors are used to map the hyperlink text i.e. the
In addition to learning good vector representations, [16] mention to an entity. Using the context words around the
describes that relationships between vectors can be explored anchor and the entity it refers to, different representations are

1438
learnt for different mention senses. Mentions in Wikipedia expressive entity embedding for the different entity senses.
pages that refer to the same entity are represented using the We define the problem as the following: given a seed
same token. New mention tokens are used to represent a new ontology with instances of concepts as disambiguated named
sense if the mention is referring to a different entity. entities, we aim to learn the vector representations of these
Similarly, same tokens are used for mentions referring to the entities in order to identify their occurrences in the text.
same entity. The objective function used to learn the High-quality entity embedding facilitate the task of spotting
different mention senses representations is to predict the the correct sense of entities in the text and thus extracting
entity linked to a mention given the mention token itself and correct facts about them. Wikidata is used to extract the set
the context words. Another objective function is used to of neighboring entities connected with semantic relations to
predict the entities themselves from their neighbors (direct a given entity. To jointly link entity and word
connections) in a KB. A third objective function is used to representations, the description text of Wikidata entities is
learn word representations by predicting the context words used. The collected knowledge from Wikidata will be used
of a target in the text. It trains a model similar to the training to train a CBOW model to learn the joint embedding.
in [19] to optimize the objective function that is the result of Entities will be assigned contexts that maximize the chances
linearly combining all the three objective functions. of keeping the most descriptive features using the learned
In [19], [21] and [22], a KB is used to build entity embedding. To sharpen entity representations and remove
representations. Then anchors are used to aligning them into any irrelevant information, entities are represented as
the same space as word representations. Another method compact, dense continuous vectors using a deep stacked auto
proposed in [23] learns entity representations using their encoders model.
example occurrences in a large text corpus (Wikipedia) The proposed method to build the compact entity
instead of a KB. This allows for the utilization of representations consists of three main components. These
distributional knowledge about entities in text. It introduces components are explained in detail in the following sections.
the concept of Extended Anchor Text (EAT) which extends
the given corpus with more sentences that relate entities to A. The Crawler
their context words. This is done by substituting the anchor We utilize Wikidata, a collaboratively built public
text in Wikipedia pages with the corresponding entities and knowledge base containing a large number of entities
adding the result to the corpus as new sentences. Then it uses referring to real-world objects such as a person, location,
the training approach similar to [17]. The original models organization or abstract concepts such as “gravity” and
proposed in [1] i.e. the CBOW and the Skip-gram models “seasons” with all their semantic interpretations (i.e., senses).
can also be used as well. This method has the advantage of It contains the structured knowledge of other Wikimedia
using the original context of entities instead of KB-built Foundation projects mainly the knowledge of Wikipedia,
contexts. This allows for building the entity representations which is the world’s largest encyclopedia. As of December
using a large number of different contexts where an entity 2014, Wikidata also contains the resources of Freebase [24].
co-occurs. When using a large corpus to learn vector We build a crawler to find the neighboring entities of a given
representations of entities, the components of these vectors named entity called the co-entities, as well as extracting the
are sharpened with more information each time new contexts keywords associated with the entity found in its description
are encountered. Vectors will be adapted to keep the most referred to as the co-words. Given a set of named entities E,
distinctive features about the entities they represent. the crawler extends the input set by adding all the senses of
As a conclusion, building the vector representations of each entity to create E′. It also associates each entity e′ ∈ E′
named entities can be done using contexts built from a KB with its unique Wikidata id in order to differentiate between
or the contexts in a large corpus. The first is useful to build different senses of an entity. The crawler objective is to
the representations of a specific set of entities. The KB can build the context of each entity e′ ∈ E′ using its co-entities
be queried for each entity in the set to generate its contexts. SE moreover, co-words SW . Co-entities set ( SE )
This comes at the expense of limiting the number of contexts contains only named entities that have a semantic relation
that can be used to learn high-quality features about the with e′ in its Wiki page. The crawler heuristically identifies
entities. On the other hand, using a large corpus allows for entities by checking the capitalization of each token in the
utilizing the many occurrences of named entities in order to entity label. This is to exclude non-named entity concepts in
enhance the representations and keep the most useful the Wiki page (e.g., the universe and space concepts).
features. However, when using a corpus, only named entities Entities of SE are represented as underscore-separated
mentioned in the corpus will be considered. This doesn’t lowercase tokens with their unique Wikidata id as the last
allow for learning the representations of a specific set of token such as united_states_of_america_q30. Co-words set
entities such as entities of an ontology. In addition, many SW is built by taking the non-stop words from e′
reviewed methods do not provide solutions to differentiate description. Entities will be ignored if their description is
between the representations of different named entity senses. empty. We define the context C e of an entity e as the set
In this paper, we present a method to learn named entity of co-words and co-entities SA = SE ∪ SW . While
vector representations to be used for tasks related to crawling, the same context size α is kept for all entities by
ontology population namely NED and relation instance using a subset of SA if its size is more than α and padding
extraction. In ontology population related tasks, the goal is using the first keyword from the description if the size is less
to identify the correct sense of an entity in the natural than α. Entity contexts are then written as separated lines to
language text using its context. This requires the use of a a text file. Each line in the file contains α + 1 space
training set of named entities and a method to build separated elements with the target entity e′ in the middle and

1439
# 8 4
elements from C e to the left and right. To keep the 1
$
target entity in the middle, α is chosen as an even number. ) ) Log - .′/ |12 3
( (1)
For example, considering the named entity /67 567

“united_states_of_america_q30” and a context size of 4, the

corresponding line for this entity in the text file can be as the where L is the total number of entities in the extended entity
following: new_york_city_q60 federal set 9′, .′/ is an entity in 9′, α is the entity context size and
united_states_of_america_q30 republic 12 3 ∈ : .′/ is a co-word or a co-entity from the .′/ context.
thirteen_colonies_q179997. The words “federal” and To avoid complexity, we use NEG proposed in [17],
“republic” are elements of SW whereas the entities defined by the following objective function:
“new_york_city_q60” and “thirteen_colonies_q179997” are
?
elements of SE .Algorithms 1 and 2 shown in Fig. 1 and Fig. Log - . / |12 3 = Log σ < <
= > @A 5
2 respectively explain the functionalities of the crawler in J
? (2)
detail. + ) BCD ~FG C HLog σ −< <
CD @A 5
I
K67

7
Where σ x = is the sigmoid function, P is the
7M N ON
number of negative samples (noise words) indicated as QK
and taken from the noise distribution -R Q defined as:
T
S Q U
-R Q = T
(3)
∑56Y
R
SWQ5 XU

Fig. 1 Steps of Algorithm 1 to build the extended entity set.

Where f is the frequency of the word Q in a vocabulary of
size n.
We maximize the objective function in equation (1) using
a two-layer neural network similar to the structure of the
word2vect model [17]. The vector representations
(embedding) of the words and entities will be stored as the
rows of the weight matrix of the model’s hidden layer. To
maintain vectors properties in the Euclidean space for the
following steps, embedding are normalized using ℓ$ norm as
per equation (4):
]
]R^_` =
a∑K67 |]K |$
b (4)

where ] is an embedding vector of length c.

C. Building the Final Entity Vector Representations
After building the joint vector representations by training
to predict the target entity .′ from its context’s words and
entities, the final entity representations are built using the
embedding of prominent elements from the corresponding
contexts. To remove redundant features and emphasize the
Fig. 2 Steps of Algorithm 2 to build the contexts of target entities.
distinctive ones, entity representations are compressed into
B. Building Joint Vector Representations high level compact vector representations using auto
encoders. The concept of auto encoders were first discussed
To build the joint vector representations of words and
in [14] and then used in [25] as unsupervised neural
entities in the same vector space, we extend the CBOW
networks trained to produce their inputs as outputs. Auto
model proposed in [1] to cover not only words but also
encoders composed of two parts: the encoder .de ] and the
entities as well. We will use the result of the crawler as the
decoder f.e ] . Both can be seen as neural networks with
training input. Each line in the output text file contains the
multiple hidden layers. The encoder maps its input vector
context of a specifically named entity sense. Pairs of training
examples w , w' are generated only within the same line to X ∈ ℝi to an output vector Y ∈ ℝ where d l p if the
predict the target entity w' from a context element w . autoencoder is undercomplete. Then the decoder part
Therefore, the training is achieved by maximizing the attempts to reconstruct the input vector X from a smaller
following objective function using stochastic gradient vector Y. Training the autoencoder is done by minimizing
descent (SGD): the construction error of all vectors v in the training set S of
size L i.e. minimizi isng th,,e function:

1440
8 build the extended entity set E′ with all entity senses. The
op3 = ) ||<K − f.e .de <K ||$ (5) size of the built set E′ is 1559 with an average of about three
K67 senses per entity. The extended set is then used as input to
After training the auto encoder, the vectors produced by crawler’s Algorithm 2 in Fig. 2 to build the contexts C e
the encoder part can be seen as compact representations of For all e′ ∈ E′ with a fixed context size α = 14. The output
the input vectors. These representations contain knowledge is a text file contains 1559 lines representing entity contexts.
good enough to rebuild the original vectors with high For example, the named entity “Syria” in the set E has 6
confidence. The encoder part can be used to map the training senses in E′ where it can be either a country, female name,
vectors of S to a new vector space with smaller dimension journal name, a Roman province, Italian singer or a family
and use the new set of vectors to train another autoencoder. name. Related co-words and co-entities surround each of
This structure is called a stacked auto encoders, which these six senses. For example, the country sense is written as
represents a deep neural network with many hidden layers. Syria_q858 with neighbors such as Asia, republic,
We use a two-layer stacked under complete auto encoders damascus_q3766, turkey_q43, etc. The singer sense is
to build the final entity representations. This allows for written as Syria_q3979196 with neighbors including Italian,
capturing high-level abstract features about the entities. Each singer, italy_q38, and rome_q220.
auto encoder consists of two hidden layers in the encoding The second step is to train the model described in section
part and two hidden layers in the decoding part. The first III.B using the crawler’s output text file. We use Google’s
auto encoder is trained using input vectors that represent the Tensor flow Python library [26] to implement the two-layer
target entity contexts. These vectors are built by neural network where the input and output layers have the
concatenating the embedding of β chosen context elements, same size as the vocabulary size. The vocabulary size is the
which can be either a co-word, a co-entity or the target entity number of unique words/entities in the input text file. The
itself. The context elements are chosen in a way to keep the hidden layer’s size equals the required embedding vectors
most distinctive features about the context. These elements size. We set the embedding size, i.e. the size of the hidden
are then ordered alphabetically, and their embedding is layer to 128. We set the rest of the model’s parameters like
concatenated to construct the first auto encoder training the following: the number of negative samplesNEG = 64,
vectors. We chose β − 1 elements that have the highest gradient descent learning rate is set to 1.0. Training
average distances from the embeddings of the remaining examples are generated from each line as pairs w , w'
context elements. These elements are the top β − 1 elements where w' the target entity of the line is and w is a context
from the context that maximize the objective function in element. Since we use a context size of 14, each line can
equation 6. To include features from the rest of the context, produce 14 training pairs. The model is trained for 100,000
the element with the smallest average distance from all the iterations using a batch of training pairs. We use a batch size
other context elements, i.e. the element that minimizes the of 280 to cover all pairs of 20 randomly selected contexts in
objective function in equation 6 is also chosen. We call this the same training batch. Once the model training is
element the context agent. This way, the chosen context completed, the hidden layer’s weight matrix is saved as the
elements maximize the chances of keeping the most embedding of the training vocabulary. It contains the
descriptive features about the target entities. embedding of both words and entities mapped to the same
4M7
1 vector space.
r s = ) f -K , s (6)
t To build the final entity representations, we train the
K67.FD vw stacked auto encoders model to minimize the construction
Where s, -K are embeddings of context elements in the error of L = 1559 vectors in the input set S as per equation 5.
vector space ℝx where z is the embedding vector size, α is We choose β = 4, and thus the size of each training vector is
the context size and f -K , s is the Euclidean distance 128 × 4 = 512 which is the size of the input layer of the
between Pi and Q. first autoencoder. The first autoencoder is trained for 30,000
Each auto encoder has two hidden layers, and each hidden iterations with a learning rate of 0.05 using batches of size
layer is half the size of the previous layer. This gives the 256. It is trained to encode the input to a vector of size 128
used stacked auto encoders structure a vector-compressing which is the size of the input layer of the second auto
factor of 16. The size of the input vectors of the first auto encoder. The second auto encoder is then trained using
encoder is z × β. We train the first auto encoder to represent similar parameters except with a learning rate of 0.01. After
x×{
the size which is the size of the input vectors to be used training the second auto encoder, the two stacked auto
U
to train the second auto encoder. The final entity embedding encoders model is now ready to be used to encode input
x×{ vectors. The result is a set of L vectors of size 32 that
has a size of moreover; it is obtained from the encoder
7| includes the final vector representations of the target named
part of the second auto encoder. entities.
In our experiments, we use Wikidata as the source of the
training set of named entities. We collected top 500 named III. RESULTS AND DISCUSSION
entities from Wikidata using a simple collection algorithm
As described in section III.C, the β context elements that
that automatically checks Wikidata entities starting from
will be used to build the final entity representations can be
id = 1 and an empty set E . The algorithm heuristically
either words or entities. Denoting the chosen set for an entity
checks the label in the corresponding Wikidata page looking
e′ as ∅ moreover, ∝ = ⋃ ∈‰ ∅ , TABLE I shows a
for named entities. Detected entities will be added to the
entity set E. Crawler Algorithm 1 in Fig. 1 is then used to

1441
statistical analysis of the distribution of β elements Panama_Q2204538 Town in Oklahoma
considering all the 1559 test entities and their contexts. & &
Lisbon_Q2384470 A town in Maine, USA
TABLE I 2008 American drama film
STATISTICAL ANALYSIS OF THE DISTRIBUTION OF Š ELEMENTS OF ALL August_Q1192731 &
CHOSEN SETS & 2009 documentary film by Ondi
We_Live_In_Public_Q372 Timoner which profiles internet
Total Total pioneer Josh Harris
Total Total
words as entities as Sunday_Q1286562 Song by British recording duo Hurts
words in entities & &
context context
∝ in ∝ Dubai_Q5310496 2005 Filipino drama film
agents agents
Numbers 2258 3978 349 1210
Percentage 36.2% 63.79% 5.59% 19.4% As TABLE II shows, the cosine similarity is close to 1 for
entities that represent the same real-world concepts such as
As TABLE I shows, entities are chosen at almost twice cities, countries, and films. Entities that are semantically
the rate of choosing words to be in the β chosen elements of related such as city, town, and community also have high
target entities. This is largely because the number of co- cosine similarities i.e. more than 0.95. The last row in
words in any context is less than the number of co-entities. TABLE II is another example of how semantically related
The only source of the co-words is the Wikidata description entities are assigned similar representations where both
text which is usually not more than a couple of sentences. entities represent the artwork concept.
However, Wikidata page of an entity has plenty of co- To show how entities of our training set are distributed in
entities found in the binary relations of the target entity. the vector space, we use the non-linear dimensionality
To evaluate the final vector representations of entities, we reduction tool t-SNE [27] to visualize entity embeddings in
use cosine similarity as a measure of the semantic similarity the 2D space. Fig. 3 shows the distribution of 150 randomly
between two vectors in the space. We find the closest entity chosen entity embeddings.
to all the 1559 test entities by finding the entity from the As expected, Fig. Three shows that semantically related
same set with the maximum cosine similarity. Since an entities are clustered relatively near each other in the vector
entity with a maximum cosine similarity can always be space. TABLE III shows a few examples of related entities
found, a threshold has to be set to consider that the entities found in Fig. 3 where examples are numbered from 1 to 6.
are semantically related. This threshold depends on the size TABLE III
and coverage of the training entity set. For small domain- EXAMPLES OF RELATED ENTITIES FOUND IN FIG. 3
specific training sets, this threshold has to be large and very
Group
close to 1. TABLE II shows some examples of the most #
Related entities Wikidata descriptions
similar entities found in the used training set with cosine City in Mississippi
similarities more than 0.95. Grenada_Q985543 &
& An unincorporated
TABLE II 1 Saginaw_Q7399254 community in Hot Spring
SOME EXAMPLES OF SIMILAR ENTITIES IN THE USED TRAINING SET WITH & County, Arkansas
COSINE SIMILARITIES MORE THAN 0.95. Saginaw_Q970802 &
Related entities Wikidata descriptions City in Texas
Bolivia_Q750 A country in South America Guatemala_Q11221957 Triceratops song
& & 2 & &
Paraguay_Q733 A country in South America Ecuador_Q2347797 1997 song by Sash!
Moscow_Q2380475 City in Tennessee, USA An unincorporated
& & community in Pine County,
Lisbon_Q2310637 A town in New Hampshire, USA Groningen_Q1816384
Minnesota
A town in New York, United States 3 &
Maine_Q3708887 &
& Jamaica_Q3450853
& A town in Vermont, United
Census-designated place in Suffolk States
Flanders_Q3459889
County, New York Dominica_Q784 A country in the Caribbean
A town in Indiana, United States 4 & &
Versailles_Q2729504
& Honduras_Q783 Republic in Central America
&
A civil town in Sheboygan County,
Rhine_Q1886951 A country in South America
Wisconsin Ecuador_Q736
&
Egypt_Q2083973 Town in Arkansas 5 &
A country in East Asia,
& & Mongolia_Q711
between China and Russia
Versailles_Q2729504 A town in Indiana, United States
City in Wilson County City in Cuba
Lebanon_Q1520670 Venezuela_Q593830 &
&
& 6 & An unincorporated
Tennessee, a city in Kentucky,
London_Q3061911 Saginaw_Q7399257 community in St. Louis
United States
County, Minnesota

1442
Fig. 3 t-SNE visualization of 150 randomly selected entity embeddings.

1443
Fig. 3 also shows that not all close entities are which help to identify the correct entity sense in natural
semantically related. As discussed earlier, a high similarity language text to facilitate ontology population tasks.
threshold is required for small training sets. This translates to Ontology population, which is the process of adding new
a short distance between related vectors especially after instances of concepts and relations into an ontology from a
reducing the dimension of embedding for visualization using corpus, will benefit from this process in order to minimize
t-SNE. the manual effort as exhibited in [28].
The goal of constructing entity embedding for ontology The conducted experiments show that entities are
population tasks requires that if two entities share similar assigned close vector representations if they have similar
contexts, then their embedding are expected to be similar contexts. In the future, we plan to demonstrate the use of the
and vice versa. To test this using our training set, we find all constructed entity vectors in the tasks NED and relation
unique entity pairs where the cosine similarity is more than extraction. It would also be interesting to investigate the
0.95. Then we check the corresponding contexts looking for effect of using Wikidata hierarchies while building entity
shared elements, i.e. shared co-words or co-entities. We contexts.
consider the pair as correct if there is at least one shared
element found in their contexts, which justifies the high REFERENCES
similarity. For example, the entities Bolivia_Q750 and
Paraguay_Q733 have a cosine similarity of more than 0.95. [1] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation
For this pair to be considered as correct which means that of word representations in vector space,” arXiv preprint
they were rightfully assigned similar representations, they arXiv:1301.3781, 2013.
[2] M. A. Taiye, S. S. Kamaruddin, and F. K. Ahmad, “Representing
should have common elements in their contexts. By
Semantics of Text by Acquiring its Canonical Form,” International
checking their contexts, we find that four elements are Journal on Advanced Science, Engineering and Information
shared: Spanish_Q1321, South_america_Q18, south, and Technology, vol. 7, no. 3, pp. 808-814, 2017.
country. The first two are named entities, and last two are [3] S. A. M. Noah, N. Omar, and A. Y. Amruddin, “Evaluation of
lexical-based approaches to the semantic similarity of Malay
words. Based on this, the pair is considered as correct. sentences.,” Journal of Quantitative Linguistics, vol. 22, no. 2, pp.
Out of 963 unique pairs found, 951 pairs share context 135-156, 2015.
elements with an accuracy of 98.75%. This means that [4] M. Mohd and O. M. A. Bashaddadh, “Investigating the Combination
entities with similar contexts have a high change of being of Bag of Words and Named Entities Approach in Tracking and
Detection Tasks among Journalists.,” Journal of Information Science
assigned close embedding. It is also worth noting that the
Theory and Practice, vol. 2, no. 4, pp. 31-48, 2014.
remaining pairs are not necessarily false positives since we [5] N. I. Y. Saat and S. A. M. Noah, “Rule-based Approach for
look for the exact elements in both contexts. Contexts may Automatic Ontology Population of Agriculture Domain,”
share synonyms or other semantically related elements that Information Technology Journal, vol. 46, no. 51, pp. 46-51, 2016.
[6] Y. I. A. M. Khalid and S. A. M. Noah, “Semantic text-based image
caused the embedding to be similar which is the expected
retrieval with multi-modality ontology and DBpedia,” The Electronic
behavior of good embedding. We repeated the experiment Library, vol. 35, no. 6, pp. 1191-1214, 2017.
using different thresholds to test the effect on the accuracy. [7] W. Ammar, G. Mulcaire, Y. Tsvetkov, G. Lample, C. Dyer and N. A.
Table I shows the accuracies for different similarity Smith, “Massively Multilingual Word Embeddings,” arXiv preprint
arXiv:1602.01925, 2016.
thresholds. [8] R. E. Salah and L. Q. b. Zakaria, “Arabic Rule-Based Named Entity
TABLE IV Recognition Systems: Progress and Challenges,” International
ACCURACIES FOR DIFFERENT SIMILARITY THRESHOLDS Journal on Advanced Science, Engineering and Information
Technology, vol. 7, no. 3, pp. 815-821, 2017.
Similarity Unique Pairs with at Accuracy [9] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2-3, pp.
threshold entity pairs least one shared 146 - 162, 1954.
with context element [10] J. R. Firth, “A synopsis of linguistic theory 1930-55,” in Studies in
similarity > Linguistic Analysis, Vols. 1952-59, The Philological Society, 1957,
pp. 1-32.
threshold
[11] M. Sahlgren, “The distributional hypothesis,” Italian Journal of
0.95 963 951 98.75% Disability Studies, vol. 20, pp. 33-53, 2008.
0.93 1636 1578 96.34% [12] G. Salton, The SMART Retrieval System—Experiments in
0.90 3199 2783 86.99% Automatic Document Processing, NJ: Prentice-Hall, Inc. Upper
0.85 11050 6535 59.14% Saddle River, 1971.
[13] D. E. Rumelhart and J. L. McClelland, Psychological and Biological
Models, MIT Press, 1986.
As Table II shows, the number of unique entity pairs [14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning
increases when using lower thresholds with a large number internal representations by error propagation,” in Parallel distributed
of wrong entity pairs. This is mainly due to the small size of processing: explorations in the microstructure of cognition,
Cambridge, MA, MIT Press Cambridge, MA, 1986.
the training set. Covering a large amount of entities increases
[15] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A Neural
the chances of the relatedness between close entities. Using a Probabilistic Language Model,” Journal of Machine Learning
larger set helps to discover new correct pairs of similar Research, vol. 3, pp. 1137-1155, 2003.
entities and keeps the number of false positives very low [16] T. Mikolov, W.-t. Yih, and G. Zweig, “Linguistic regularities in
continuous space word representations,” in Proceedings of the 2013
when using low thresholds. Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, 2013.
IV. CONCLUSIONS [17] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean,
“Distributed representations of words and phrases and their
In this paper, we presented a method to build entity vector compositionality,” in Advances in neural information processing
representations using knowledge from Wikidata. These systems, 2013.
representations hold distinctive features about the entities,

1444
[18] M. U. Gutmann and A. Hyvärinen, “Noise-contrastive estimation of embeddings for entity linking,” in European Semantic Web
unnormalized statistical models, with applications to natural image Conference, 2017.
statistics,” The Journal of Machine Learning Research, vol. 13, no. 1, [24] Freebase, 17 December 2014. [Online]. Available:
pp. 307-361, 2012. https://plus.google.com/109936836907132434202/posts/bu3z2wVqc
[19] I. Yamada, H. Shindo, H. Takeda, and Y. Takefuji, “Joint learning of Qc.
the embedding of words and entities for named entity [25] D. H. Ballard, “Modular learning in neural networks,” in AAAI'87
disambiguation,” arXiv preprint arXiv:1601.01343, 2016. Proceedings of the sixth National conference on Artificial
[20] D. Milne and I. H. Witten, “An effective, low-cost measure of intelligence, 1987.
semantic relatedness obtained from Wikipedia links,” in In [26] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.
Proceedings of the First AAAI Workshop on Wikipedia and S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I.
Artificial Intelligence (WIKIAI), 2008. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz and
[21] Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graph and text L, “Tensorflow: Large-scale machine learning on heterogeneous
jointly embedding,” in Proceedings of the 2014 conference on distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
empirical methods in natural language processing (EMNLP), 2014. [27] L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,”
[22] Y. Cao, L. Huang, H. Ji, X. Chen and J. Li, “Bridging Text and Journal of machine learning research, pp. 2579-2605, 2008.
Knowledge by Learning Multi-Prototype Entity Mention Embedding,” [28] Z. Ibrahim, S. A. M. Noah and M. M. Noor, “Knowledge acquisition
in Proceedings of the 55th Annual Meeting of the Association for from textual documents for the construction of medicinal herbs
Computational Linguistics (Volume 1: Long Papers), 2017. domain ontology,” Journal of Applied Science, vol. 9, no. 4, pp. 794-
[23] J. G. Moreno, R. Besancon, R. Beaumont, E. D'hondt, A.-L. Ligozat, 798, 2009.
S. Rosset, X. Tannier and B. Grau, “Combining word and entity

1445

View publication stats

whitepaper_emebddings_vectorstores_v2
No ratings yet
whitepaper_emebddings_vectorstores_v2
64 pages
Dragon Genetics Lesson
No ratings yet
Dragon Genetics Lesson
17 pages
Word2vec, Node2vec, Graph2vec, X2vec - Towards A Theory of Vector Embeddings of Structured Data 2003.12590
No ratings yet
Word2vec, Node2vec, Graph2vec, X2vec - Towards A Theory of Vector Embeddings of Structured Data 2003.12590
38 pages
Turn Strategy Into Action
No ratings yet
Turn Strategy Into Action
4 pages
French Physical Descriptions Lesson Plan
No ratings yet
French Physical Descriptions Lesson Plan
5 pages
Newwhitepaper_Embeddings & vector stores
No ratings yet
Newwhitepaper_Embeddings & vector stores
51 pages
Effect of Word Embedding Vector Dimensionality On Sentiment Analysis Through Short and Long Texts
No ratings yet
Effect of Word Embedding Vector Dimensionality On Sentiment Analysis Through Short and Long Texts
8 pages
Constructing and Evaluating Word Embeddings
No ratings yet
Constructing and Evaluating Word Embeddings
33 pages
Akshay DBpedia GSoC 2017 Proposal
No ratings yet
Akshay DBpedia GSoC 2017 Proposal
12 pages
Word Embeddings a Survey
No ratings yet
Word Embeddings a Survey
11 pages
Word Embeddings With Neural Network
No ratings yet
Word Embeddings With Neural Network
5 pages
EVE: Explainable Vector Based Embedding Technique Using Wikipedia
No ratings yet
EVE: Explainable Vector Based Embedding Technique Using Wikipedia
22 pages
Language Analysis - Sociolinguistics of Word Embeddings - PREPRINT - 8.8.2020
No ratings yet
Language Analysis - Sociolinguistics of Word Embeddings - PREPRINT - 8.8.2020
17 pages
Unit-2-TB
No ratings yet
Unit-2-TB
20 pages
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
No ratings yet
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
17 pages
Embeddings - A Simple Guide To Rag
No ratings yet
Embeddings - A Simple Guide To Rag
10 pages
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Electronics 10 01372 With Cover
No ratings yet
Electronics 10 01372 With Cover
24 pages
WORD EMBEDDING Project
No ratings yet
WORD EMBEDDING Project
15 pages
Survey of Sentence Embedding Methods
No ratings yet
Survey of Sentence Embedding Methods
4 pages
Performance Evaluation of Word Embedding Algorithms
No ratings yet
Performance Evaluation of Word Embedding Algorithms
7 pages
Evaluating The Stability of Embedding-Based Word Similarities
No ratings yet
Evaluating The Stability of Embedding-Based Word Similarities
14 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
Unit iv
No ratings yet
Unit iv
57 pages
Exploring Afrikaans word embeddings with analogies and nearest neighbours
No ratings yet
Exploring Afrikaans word embeddings with analogies and nearest neighbours
10 pages
Unit iv
No ratings yet
Unit iv
58 pages
wordembed
No ratings yet
wordembed
31 pages
Cs 224 N
No ratings yet
Cs 224 N
128 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
14 pages
Chapter Transformers
No ratings yet
Chapter Transformers
8 pages
Contextual+Word+Embeddings
No ratings yet
Contextual+Word+Embeddings
8 pages
Entity Embeddings of Categorical Variables
No ratings yet
Entity Embeddings of Categorical Variables
9 pages
Introduction to LLMs for Business Leaders: Responsible AI Strategy Beyond Fear and Hype: Byte-Sized Learning Series
From Everand
Introduction to LLMs for Business Leaders: Responsible AI Strategy Beyond Fear and Hype: Byte-Sized Learning Series
I. Almeida
No ratings yet
NLP Course Lecture03 Huawei Noahs Ark Lab
No ratings yet
NLP Course Lecture03 Huawei Noahs Ark Lab
94 pages
Word Embedding Sand Length Normalization For Document Ranking
No ratings yet
Word Embedding Sand Length Normalization For Document Ranking
10 pages
Dept of CSE, AIET, Mijar 1
No ratings yet
Dept of CSE, AIET, Mijar 1
13 pages
Dept of CSE, AIET, Mijar 1
No ratings yet
Dept of CSE, AIET, Mijar 1
13 pages
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
Word Embeddings
No ratings yet
Word Embeddings
59 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
04 - Text Representation
No ratings yet
04 - Text Representation
131 pages
Lect04
No ratings yet
Lect04
44 pages
Embeddings
No ratings yet
Embeddings
13 pages
Text Similarity in Vector Space Models: A Comparative Study
No ratings yet
Text Similarity in Vector Space Models: A Comparative Study
17 pages
Part 3
No ratings yet
Part 3
5 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
T 2V: D R T: OP EC Istributed Epresentations of Opics
No ratings yet
T 2V: D R T: OP EC Istributed Epresentations of Opics
25 pages
Efficient Estimation of Word Representations in Vector Space: January 2013
No ratings yet
Efficient Estimation of Word Representations in Vector Space: January 2013
13 pages
Spanish Word Vectors From Wikipedia: Mathias Etcheverry, Dina Wonsever
No ratings yet
Spanish Word Vectors From Wikipedia: Mathias Etcheverry, Dina Wonsever
5 pages
CS224d Deep Learning For Natural Language Processing Lecture 2: Word Vectors
No ratings yet
CS224d Deep Learning For Natural Language Processing Lecture 2: Word Vectors
40 pages
Survey On Vector Representations
No ratings yet
Survey On Vector Representations
46 pages
Word Embeddings
No ratings yet
Word Embeddings
55 pages
Vector-Based Models of Semantic Composition: Jeff Mitchell and Mirella Lapata
No ratings yet
Vector-Based Models of Semantic Composition: Jeff Mitchell and Mirella Lapata
9 pages
2104.06901
No ratings yet
2104.06901
9 pages
Fqiwefp
No ratings yet
Fqiwefp
2 pages
Data Redundancy Using LSTM
No ratings yet
Data Redundancy Using LSTM
24 pages
Q16-1028
No ratings yet
Q16-1028
16 pages
Levy Improving Distributional
No ratings yet
Levy Improving Distributional
16 pages
NLP Basic - YL
No ratings yet
NLP Basic - YL
16 pages
Learning Vector-Space Representations of Items For Recommendations Using Word Embedding Models
No ratings yet
Learning Vector-Space Representations of Items For Recommendations Using Word Embedding Models
6 pages
Unit-2
No ratings yet
Unit-2
21 pages
2018 - Word Embedding - Word2Vec - 1 (Choi) (11 Slides)
100% (1)
2018 - Word Embedding - Word2Vec - 1 (Choi) (11 Slides)
11 pages
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Igcse Cam Math p4 分章
No ratings yet
Igcse Cam Math p4 分章
656 pages
FINAL CHAPTERS 1 5 With Bibliography and Appendices
No ratings yet
FINAL CHAPTERS 1 5 With Bibliography and Appendices
86 pages
Indian Institute of
No ratings yet
Indian Institute of
43 pages
Buddhism
No ratings yet
Buddhism
7 pages
Reflection Essay
No ratings yet
Reflection Essay
2 pages
Breath Benefits
No ratings yet
Breath Benefits
6 pages
FLP10014 Cuba Band Music
No ratings yet
FLP10014 Cuba Band Music
3 pages
Deliberate Practive Research
No ratings yet
Deliberate Practive Research
4 pages
[FREE PDF sample] Family Communication (Routledge Communication Series) 3rd Edition Chris Segrin ebooks
100% (2)
[FREE PDF sample] Family Communication (Routledge Communication Series) 3rd Edition Chris Segrin ebooks
65 pages
Week 2 Assessment
No ratings yet
Week 2 Assessment
1 page
Answer: Z 0 and Z 1.63 0.4484: X Mean Standard Deviation
No ratings yet
Answer: Z 0 and Z 1.63 0.4484: X Mean Standard Deviation
3 pages
Ambisyon Natin 2040 Reaction Paper
75% (4)
Ambisyon Natin 2040 Reaction Paper
2 pages
Child Psychology Course E-Book
No ratings yet
Child Psychology Course E-Book
31 pages
CPD PDF
100% (1)
CPD PDF
5 pages
Advances in Self-Organizing Maps, Learning Vector Quantization, Clustering and Data Visualization: Proceedings of the 13th International Workshop, WSOM+ 2019, Barcelona, Spain, June 26-28, 2019 Alfredo Vellido download pdf
100% (2)
Advances in Self-Organizing Maps, Learning Vector Quantization, Clustering and Data Visualization: Proceedings of the 13th International Workshop, WSOM+ 2019, Barcelona, Spain, June 26-28, 2019 Alfredo Vellido download pdf
62 pages
PL-300 Prep Guide 1.0
100% (1)
PL-300 Prep Guide 1.0
4 pages
1st Term Exam Materials 2022 - 2023 - G3
No ratings yet
1st Term Exam Materials 2022 - 2023 - G3
2 pages
Effective Teaching Methods and Lecturer Characteristics A Study On Accounting Students at Universiti Malaysia Sabah (UMS)
No ratings yet
Effective Teaching Methods and Lecturer Characteristics A Study On Accounting Students at Universiti Malaysia Sabah (UMS)
10 pages
Siddhesh Updated Sp3d
No ratings yet
Siddhesh Updated Sp3d
4 pages
Job Displacement Index
No ratings yet
Job Displacement Index
18 pages
Introduction To The Sensory Toolkit
No ratings yet
Introduction To The Sensory Toolkit
7 pages
BATJIC Information Sheet: New Wage Rates From Batjic Effective Monday 22 June 2020
100% (1)
BATJIC Information Sheet: New Wage Rates From Batjic Effective Monday 22 June 2020
4 pages
2022 Economics GR 11 Exam Guidelines 061228
No ratings yet
2022 Economics GR 11 Exam Guidelines 061228
34 pages
Theoretical Study of The Rare Earth Ions For Applications in New Luminescent Materials
No ratings yet
Theoretical Study of The Rare Earth Ions For Applications in New Luminescent Materials
1 page
Mozart String Quartet
No ratings yet
Mozart String Quartet
5 pages
DLL 7 Week 4
No ratings yet
DLL 7 Week 4
4 pages
IELTS Reading Cambridge 13 Test 1 Reading Passage 2, Why Being Bored Is Stimulating and Useful, Too With Best Solutions, Expla
No ratings yet
IELTS Reading Cambridge 13 Test 1 Reading Passage 2, Why Being Bored Is Stimulating and Useful, Too With Best Solutions, Expla
1 page

Building Compact Entity Embeddings Using Wikidata

Uploaded by

Building Compact Entity Embeddings Using Wikidata

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Building Compact Entity Embeddings Using Wikidata

Mohamed Lubani Shahrul Azman Mohd Noah

The user has requested enhancement of the downloaded file.

Building Compact Entity Embeddings Using Wikidata

Keywords— entity embeddings; entity vector representations; named entity disambiguation.

representations of words significantly improve many NLP

“united_states_of_america_q30” and a context size of 4, the

Fig. 1 Steps of Algorithm 1 to build the extended entity set.

where ] is an embedding vector of length c.

View publication stats

You might also like