Automatic Clustering of Construction Project Documents Based On Textual Similarity
Automatic Clustering of Construction Project Documents Based On Textual Similarity
Automation in Construction
journal homepage: www.elsevier.com/locate/autcon
a r t i c l e i n f o a b s t r a c t
Article history: Text classifiers, as supervised learning methods, require a comprehensive training set that covers all clas-
Received 7 July 2013 ses in order to classify new instances. This limits the use of text classifiers for organizing construction pro-
Received in revised form 4 February 2014 ject documents since it is not guaranteed that sufficient samples are available for all possible document
Accepted 8 February 2014
categories. To overcome the restriction imposed by the all-inclusive requirement, an unsupervised learn-
Available online 15 March 2014
ing method was used to automatically cluster documents together based on textual similarities. Repeated
Keywords:
evaluations using different randomizations of the dataset revealed a region of threshold/dimensionality
Document management values of consistently high precision values and average recall values. Accordingly, a hybrid approach
Single pass clustering was proposed which initially uses an unsupervised method to develop core clusters and then trains a
Supervised/unsupervised learning methods text classifier on the core clusters to classify outlier documents in a consequent refinement step. Evalua-
tion of the hybrid approach demonstrated a significant improvement in recall values, resulting in an over-
all increase in F-measure scores.
© 2014 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.autcon.2014.02.006
0926-5805/© 2014 Elsevier B.V. All rights reserved.
M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49 37
2.2. Single pass clustering dimensionality reduction of the dataset's term-document matrix (t–d
matrix), relying on the ability of latent semantic analysis (LSA) to reveal
Single pass clustering, also called incremental clustering, generates the hidden similarities among the dataset's instances.
one cluster at a time using a predefined threshold value. The threshold In [15], a successive evaluation approach that implements LSA was
represents the user's perception of acceptable proximity, e.g. the mini- used to automatically classify documents of a small dataset of 17 docu-
mum acceptable cosine similarity measure between instances and the ments made up of two classes. The results showed that the difference
cluster centroid, the maximum acceptable Euclidean distance between between the average similarities of same-class documents and the aver-
the instances and the cluster centroid. Starting with a random instance, age similarities of different-class documents significantly increased
the closest instance in the dataset that satisfies the threshold is identi- when dimensionality reduction was applied using the optimum dimen-
fied and added to the cluster, and the cluster centroid is calculated. sionality factor. This suggests a polarizing effect for LSA which can be
The process is repeated with the new cluster centroid until no instances used to improve clustering results. The use of LSA implies specifying a
remain that satisfy the threshold, thereby finalizing the first cluster. An certain dimensionality factor for the reduction step and as such the op-
unclustered instance is selected at random and the process is repeated timum dimensionality factor (lopt) is defined in this study as the one that
with the remaining instances in the dataset to sequentially create new results in the highest clustering accuracy. A thorough discussion on LSA
clusters until no unclustered instance remain (or until unclustered in- is found in [16] and a simple example demonstrating LSA's potential in
stances remain that do not meet the threshold standard with any of text analysis is given in Appendix A.
the formed clusters, thereby forming single-instance clusters). The methodology follows four main steps: 1) collecting the dataset,
2) randomizing and pre-processing, 3) developing the t–d matrix, and
3. Methodology 4) clustering and evaluation.
Since the objective is to organize construction project documents 3.1. Collecting the dataset
into semantically related groups, a hierarchical clustering structure is
not warranted, especially given the associated computational complex- Seventy-seven project documents related to eight construction
ity of agglomerative clustering. For the current task, flat clustering is claims make up the dataset for evaluating the developed technique.
more suitable and economical. The use of K-means requires pre- All eight claims originated from one project for the construction of
defining the number of clusters (cardinality) before implementing the an international airport with a total value of work exceeding
algorithm. It is up to the users to judge cardinality based on their knowl- $50 million. The majority of the documents are correspondences be-
edge of the domain topic of the dataset. In reality, the number of clusters tween the main contractor and the project engineer, detailing the
has a significant impact on the results. Leaving this decision subject to factual events related to each claim. Collected and organized by the
the user's judgment detracts from the automated nature of the task contract administrator of the project, the supporting documents for
and adds a high degree of subjectivity to the process. In addition to car- each claim are a representation of a group of semantically-similar
dinality, the choice of the initial centroids greatly impacts the clustering documents, related together by their association to a specific search-
results. K-means essentially tries out various clustering outcomes able claim topic. The evaluation aims at quantitatively identifying
looking for the optimal outcome. While it is highly unlikely that all the performance of the proposed technique in organizing the com-
possible outcomes will be tested, the fact remains that a misguided se- plete dataset into the correct document groups, or clusters, without
lection of the number and position of the initial centroids may unneces- implementation of the learning step that characterizes supervised
sarily prolong the process or, even more critically, result in a local learning techniques. Individual cluster size in the dataset varies
optimal clustering outcome instead of the global optimum. On the from a maximum of 22 documents to a minimum of five, and each
other hand, single pass clustering does not require definition of cardi- document belongs to only one cluster.
nality by the user, but requires determination of a threshold defining
the boundary of similarity between documents and cluster centroids. 3.2. Randomizing and pre-processing
Single pass clustering has been criticized for producing varying cluster
outcomes depending on which instances are selected to initiate the This step includes the tasks of tokenizing, removal of stop-words and
clusters and for the tendency of producing large clusters in the first frequency calculation. The outcome of this step is to represent the
pass. In this study, single pass clustering was used to automatically clus- documents in the dataset as vectors of varying sizes corresponding to
ter project documents. In order to overcome the limitations imposed by the features – or terms – in each document and the feature frequency
single pass clustering, several factors were evaluated to assess their ef- of occurrence. The order of the documents in the dataset is randomized
fect on clustering performance. The first factor evaluated was the effect from the start to measure the consistency of the clustering outcomes.
of the value of the threshold on clustering accuracy. The predefined
value of the threshold has a significant impact on the clustering result: 3.3. Developing the t–d matrix
a stricter threshold decreases cluster size (thereby increasing the total
number of clusters) while a less strict threshold results in less clusters The term–document matrix (t–d matrix) is the input required by the
of larger sizes. clustering algorithm in the final step of the methodology. It is a compi-
Predefining a threshold value must be viewed within the context of a lation of all the document vectors into one matrix where the columns
specific dataset. Success of single pass clustering is understandably de- represent the documents in the dataset and the rows represent the
pendent on the extent to which a specific dataset satisfies the contiguity vocabulary of the dataset. First, the vocabulary is compiled from the
hypothesis. Relatively similar instances that are disjoint from other document vectors and the frequency of each term across all documents
groups of instances makes defining a threshold that highlights these is recorded. Then the t–d matrix is developed based on the randomized
groups possible. Overlapping groups of instances defies any attempt document order, in which matrix elements are calculated according
for accurate clustering, regardless of the value of the threshold. Accord- the specified term weighting method. In this study, two popular
ingly, it is the ability to magnify the similarity between same-class in- term weighting methods were studied for evaluation of clustering
stances and the dissimilarity between different-class instances that performance:
ultimately contributes to clustering performance. One way to achieve
this is by using a term weighting method that best depicts this attribute • Term frequency (tf): the elements of the matrix represent the
in the dataset, accordingly two different weighting methods were frequency of occurrence of the term – identified by the matrix row –
evaluated as explained below. Another way is to experiment with in the specific document—identified by the matrix column.
M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49 39
• Term frequency inverse document frequency (tf–idf): modifies term a maximum of one (signifying complete similarity). The threshold value
frequency based on the assumption that high occurrence terms across was varied over the range [0.05, 0.95] with a step of 0.01. Maximum and
the dataset are poor indicators of clusters. Term frequency inverse minimum values for the factors were set based on experimentation, to
document frequency is calculated according to Eq. (1), where n is minimize unnecessary computational cost without overlooking signifi-
the number of documents in the dataset and d is the number of docu- cant results.
ments containing the specific term being evaluated:
d Cc
n simd;C c ¼ : ð2Þ
tf −idf ¼ tf log : ð1Þ jd jjC c j
d
3.4. Clustering and evaluation For a certain dimensionality/threshold combination, clustering com-
mences by considering the first document in the reconstructed t–d ma-
To evaluate the effects of dimensionality factor and threshold value, trix as the centroid of the first cluster, identifying the closest document
single pass clustering is performed on the randomized dataset using to the cluster that satisfies the threshold, recalculating the centroid and
varying combinations of both factors. Since the dataset is 77 documents, repeating the process. When no documents satisfying the condition re-
the dimensionality factor (l) ranges from a minimum of three dimen- main, a new cluster is initiated using the first unclustered document in
sions (to minimize computational cost) to a maximum of 77 (lmax is the dataset as the centroid of the new cluster and the process is repeated
the special case constituting the original t–d matrix, i.e. dimensionality until all documents are either assigned to a cluster or cannot be assigned
reduction is not applied). The threshold value (h) represents the to any cluster and consequently form a separate single-document clus-
minimum acceptable similarity limit between a document and a cluster ^ developed using a
ter. Clustering is illustrated in Fig. 1. A t–d matrix (X)
centroid that makes the document a candidate for inclusion in the specific weighting method from a randomized dataset undergoes
cluster. Similarity between a cluster centroid (Cc) and a document clustering 6825 times corresponding to all possible dimensionality/
(d) is calculated by cosine similarity using Eq. (2). Similarity theoretical- threshold combinations, and clustering accuracy is calculated after
ly ranges from a minimum of zero (signifying complete dissimilarity) to each to determine the best performance.
j= 0
Evaluate
Y clustering
Cc= dj
N
Cc= dj j=j+1 N dj unclustered? j= 1
Y Remaining
unclustered
documents?
N
N
Initialize: dj unclustered
& θ max= Sim(Cc,dj)
j= j+1 Sim(Cc,dj)>θ max Y j≤ jmax? Y pos> -1?
pos= -1 & pos= j
θ max= -99 Sim(Cc,dj)≥T
N Add docpos to
cluster
Recalculate Cc
j= 0
Single Pass Clustering
3.5. Clustering measures where all instances in the dataset are grouped in one cluster, 100% recall
is achieved (since such clustering contains all possible pairwise combi-
Several clustering measures (methods for evaluating the clustering nations) however precision greatly deteriorates from an excess of false-
outcomes) are presented in [5]. A simple measure is purity calculated positive combinations. At the other end of the spectrum in case of an
by Eq. (3), where u represents a specific cluster from an outcome of i overly fragmented clustering outcome, precision is boosted if the clus-
clusters, c represents a specific class from a number of j classes, N is ters contain same-class instances, however recall is negatively affected
the total number of instances in the dataset and count(ui, cj) is the num- as a result of a large number of missed (false-negative) combinations.
ber of instances belonging to class cj in cluster ui. Purity is the summa- If the clusters are mainly composed of different-class instances, then
tion across all clusters of the number of instances of the class with the both precision and recall values are low. In all these scenarios, the
highest representation in each individual cluster, divided by the total combined F-measure score represents a balanced evaluation of the
number of instances in the dataset. clustering outcome.
1X h i
3.6. Evaluation tool
Purity ¼ max j count ui ; c j : ð3Þ
N i
tf tf-idf
0.8 0.8
Baseline
0.7 0.7
Optimum
0.6 0.6
F-measure
F-measure
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Threshold value (h) Threshold value (h)
Fig. 3. F-measure scores for lmax and lopt—average over ten trial runs.
instances grouped into K clusters is a Stirling number of the second kind using lmax, i.e. without any dimensionality reduction. A comparison be-
n tween the clustering performance and the baseline highlights the im-
[5], calculated using Eq. (4). The number of possible outcomes in
K provement in performance resulting from applying LSA to single pass
n n clustering. If results consistently fall below the adopted baseline, that
case the number of clusters is unknown is therefore ∑K¼1 , the
K does not necessarily indicate that they are meaningless, but that the
summation of the Stirling number for all possible values of K; where K proposed procedure does not offer a positive contribution to clustering.
ranges from one (in case all instances are grouped into one group) to Single pass clustering is prone to inconsistent outcomes depending
n (in case each instance is grouped alone in a single-instance cluster). on the order of documents used in the clustering step. To ensure a rep-
resentative value for clustering performance, the document order was
n 1 XK j K n randomized using different seed values and the clustering performance
¼ ð−1Þ ðK−jÞ : ð4Þ
K K! j¼0 j was evaluated for the different document sequences. Over ten trial runs,
the highest average F-measure score achieved using the tf weighting
For the current dataset, the total number of possible outcomes is ex- method was 0.68 at an optimum dimensionality factor of 13 and a
tremely large, making the possibility of a random correct cluster null. threshold of 0.69, while the highest average F-measure score achieved
Even if the problem is simplified by assuming that the correct number using the tf–idf weighting method was 0.75 at lopt = 56 and h = 0.24.
of classes is known, the number of possible outcomes for organizing Fig. 3 presents the variation of average F-measure scores across all
77 objects into eight groups is 8.6 × 1064 which is still very large. The threshold values for two specific dimensionality factors: lmax (the base-
random assumption accordingly defies the purpose of using a baseline. line condition) and lopt (the highest average F-measure score achieved
The baseline adopted for this task is the clustering results achieved using the respective weighting method). For both weighting methods,
10
20
20
Dimensionality factor (l)
30
40
40
50
50
60
60
70
70
the baseline's performance is better at the small threshold values but the corresponding l and h factors, for multiple trial runs of the proposed
gradually declines after the peak and is eventually surpassed by the technique. While the absolute maximum of each individual trial run
optimum's performance. For the tf weighting method, this shift occurs varies, in general F-measure scores for the regions mentioned above
midway through the range of threshold values, while for tf–idf it occurs were consistently high over the trial runs.
at the low threshold value of 0.2. For the tf weighting method, while the combination of a dimension-
For the tf method, the optimum dimensionality factor records an av- ality factor of 13 with a threshold of 0.69 was prevalent in most trials,
erage improvement over the baseline of 7.6%, and an 11.5% improvement the values of the F-measure score for such combinations varied signifi-
in peak performance. For the tf–idf method, the average improvement is cantly from a minimum of 0.69 to a maximum of 0.83. This suggests
7% with a 5% improvement of the peak performance. These results high- inconsistencies in the clustering results. Such inconsistencies are not ap-
light the importance of identifying the optimum dimensionality factor in parent in the top prevalent factor combinations of the tf–idf weighting
order to utilize LSA for improving clustering performance. method. The most common combination is a dimensionality factor in
Fig. 4 presents intensity grids of the average F-measure scores for ten the range [54, 57] with a threshold of 0.24 for which the F-measure
trials using both weighting methods. As can be seen from the figure, a scores were approximately 0.78. Another common (l, h) combination
stretch of high F-measure values (indicated by the dashed lines) is ob- is the (15, 0.69) combination which achieved a constant F-measure
served spanning from high l/low h values to low l/high h values score close to 0.75.
(lower left corner of the grids to upper right corner of the grids). In In order to accurately check consistency of the clustering results, the
case of the tf weighting method, this high performance front extends actual clusters created by the different trial runs were examined. Fig. 5
to the mid-threshold region, while for tf–idf it spans across the limits displays the clusters for the highest and lowest F-measure scores
of both factors. These results reveal the indirect relationship between achieved in the trial runs using the tf weighting method. Noting that
dimensionality factor and threshold values. At high dimensionality the true number of classes is eight and the smallest class contains five
levels (where little or no reduction is applied) high clustering perfor- documents, the resulting clusters are considered fragmented. Discrepan-
mance requires relatively relaxed threshold values. Reducing the num- cies are observed between the two cases of the tf weighting method in
ber of dimensions results in improved class separation allowing the use terms of the number and composition of clusters. In addition, cluster im-
of a stricter similarity definition (i.e. higher threshold values) which at- purity is evident, not only for the low F-measure case (clusters 1, 4, 5 and
tests to the contribution of dimensionality reduction in polarizing same- 10) but also for the high score case (cluster 1). Fig. 6 displays the clusters
class instances in the dataset. for the highest and lowest F-measure scores achieved in the trial runs
The regions of high F-measure scores were more prominent with the using the tf–idf weighting method. Both results are highly fragmented
tf–idf weighting method suggesting the superiority of this method over with a different number of clusters in each case. However, although not
the tf weighting method. This observation is attributed to the method's completely identical, cluster composition is similar, impurity is limited
accurate identification of relative term weights that better reflect simi- and clusters make a good representation of the true classes.
larities and therefore result in improved clustering performance. No Examination of the precision and recall values behind the F-measure
specific region had an average F-measure score higher than 0.7 using results in Table 1 offers an explanation for this observation. Average pre-
the tf weighting method. For the tf–idf weighting method, two promi- cision over all trial runs was higher for the tf–idf method, while average
nent regions of highest F-measure scores are apparent, one in the fifty's recall was higher for the tf method. For the tf method, values for preci-
range of dimensionality factors around a threshold of 0.25, and the other sion and recall for each separate trial run were comparatively close,
within the threshold range of [0.60, 0.70] at a dimensionality factor of while values of precision were significantly higher than recall for the
15. Table 1 identifies the maximum F-measure results achieved, and tf–idf method. This discrepancy between the two methods explains
Table 1
Results of various clustering trials.
Dim factor (l) Thresh (h) Precision Recall F-measure Dim factor (l) Thresh (h) Precision Recall F-measure
Cluster 1: H1 D2 H3 H4 H2 H11 E18 E12 C 4 E19 E13 E17 A 18 H13 E16 E14 C7 E6 A 19
Cluster 2: B1 B 11 B 12 B3 B 10 B 18 B 2 0 B 13 B2 B7 B9 B8
Cluster 3: C1
Cluster 4: C3 F2 G6
Cluster 5: G1 G2 G5 G4 D4 G7 D5 D3
Cluster 6: E1 E3
Cluster 7: D1
Cluster 8: F1 F5 F3 F6 F4 F7
F -measure= 0.83; (l, h)= (13, 0.69); Seed(572639)
Cluster 9: A1 A5 A6 A7 A9 A8 A3 A4 A 10 A 16 A 2 2 A 2 0 A 2 3 A 2 1 A 15 A 17
Cluster 10: G3 C5
Cluster 1: G1 G2 G5 G7 D4 D5 D1 G3 D3 G6 G4 D2
Cluster 11: B 17
Cluster 2: E1 E3 E16 E13 E12 E14 E19 E17 E18 E6
Cluster 12: C 11
Cluster 3: C1
Cluster 13: A 11 A 12 A 14
Cluster 4: F1 F5 F7 F3 F4 F6
Cluster 14: A 13
Cluster 5: H1 H3 H4 H2 H11 H13 C 4
Cluster 6: A1 A4 A8 A7 A9 A6 A5 A3 A 14 A 2 2 A 16 A 2 0 A 10 A 2 1 A 15 A 17 A 2 3 A 19 C9 A 11
Cluster 7: B1 B 11 B 12 B3 B 10 B 13 B 18 B2 B7 B9 B8 B 2 0 B 17
Cluster 8: F2 C3 C7 C5
Cluster 9: C 11
Cluster 10: A 12
Cluster 11: A 13
Cluster 12: A 18
F-measure= 0.84; (l, h)= (69, 0.20); Seed(576168) F-measure= 0.74; (l, h)= (15, 0.69); Seed(8)
tf tf-idf
P= 0.744, R= 0.735 P= 0.905, R= 0.627
F-measure= 0.740 F-measure= 0.741
Cluster 1: F1 F5 F7 F3 F4 F6 Cluster 1: H1 H4 H3 H2 H13 H11
Cluster 2: B1 B 11 B 12 B3 G1 B 10 B 13 B 18 B2 B7 B9 B8 B 2 0 B 17 Cluster 2: B1 G5 G2 G1 G3 G4
Cluster 3: A1 A4 A8 A7 A9 A6 A5 A3 A 14 A 2 2 A 16 A 2 0 A 10 A 2 1 A 15 A 17 A 2 3 A 19 C9 A 11 Cluster 3: C1
Cluster 4: C1 Cluster 4: C3 C7 C9 C4 G6 F2
Fig. 7. Two clustering outcomes similar in F-measure scores and varying in precision and recall.
0.9 Table 3
Matrix of average precision scores for baseline and optimum cases, before and after
refinement.
0.8
Category Baseline Optimum Difference
F-measure
0.6
Optimum - Refined
Baseline - Refined
Optimum - Original are defined based on a minimum cluster size (smin). Members of any
Baseline - Original
0.5 cluster in the original outcome that fails to satisfy the minimum are con-
Trial runs sidered outliers and included in the test set. Accordingly, the larger the
minimum limit the smaller the number of clusters in the final outcome.
Fig. 9. Comparison of F-measure results across trial runs. Selecting the minimum cluster size is judgmental, based primarily on
knowledge of the dataset and whether or not large clusters are expect-
ed. Outliers are extracted and the remaining clusters form the training
the clustering outcomes illustrated in the above figures. As discussed set and are considered the classes used for classification. The more
above, a small number of clusters in an outcome produces a high recall these clusters correlate with the true classes in the dataset (i.e. the
result, but also increases cluster impurity especially if the number of lower their impurity and the better they represent each of the true clas-
resulting clusters is less than the number of true classes. Conversely, a ses) the better the chances of an improved clustering outcome after
high precision value is generated if the outcome contains a large num- refinement.
ber of clusters, provided that the clusters are made up of same-class in- Having identified both sets, each individual test document is added
stances (i.e. the case of class fragmentation). These results suggest that to the training set in order to be classified to one of the clusters. With
the optimum outcome of a trial run using the tf–idf method tends to each addition, the t–d matrix is developed and then reduced to a dimen-
have a relatively high precision result and a moderately high recall sionality level that exposes similarities between documents in the set to
result. facilitate the classification step. Based on the results of the evaluation of
different text classifiers at varying dimensionality factors in [17], a
Rocchio classifier was implemented for the hybrid approach using a di-
5. Clustering using a hybrid approach mensionality level of approximately 67% of the available dimensions. Fi-
nally, each outlier is classified and grouped with the closest cluster and
Fig. 7 displays two different clustering outcomes with an almost the refined outcome is evaluated using F-measure to enable comparison
identical F-measure score, one for each weighting method. The general between the original clustering outcome and the refined outcome. The
characteristics of fragmentation and impurity discussed in the previous outcomes from the trial runs previously performed were used to evalu-
section apply to both cases. If the small clusters – the group of outliers – ate the refinement process to obtain a representative estimate of the
in the tf–idf outcome are ignored, the remaining large clusters with min- process's effect on clustering.
imal impurity can still make an acceptable representation of every true Fig. 9 displays the results of evaluating the hybrid clustering ap-
class in the dataset. For example, class A is represented by cluster 10, proach using a minimum cluster size (smin) of four. The same baseline
class B by cluster 7, class C by cluster 4, etc. The same cannot be said as before was used after considering the optimum threshold for each
for the tf outcome due to the high impurity of cluster 5 (a combination case (i.e. the highest result achieved using lmax across the range of
of classes D and G) and cluster 6 (composed mainly of classes C and E, threshold values versus the highest result achieved using lopt). Table 2
with a couple of instances from other classes). The tf–idf clustering out- is a matrix of the average F-measure scores of the trial runs for the
come can therefore be reformulated as a classification problem, by split- different combinations of original/refined, baseline/optimum. Table 3
ting the outcome into a test set made up of the outlier cases and a and Table 4 are the equivalent matrices for precision and recall.
training set consisting of the remaining clusters. Refinement of a high- In general, the optimum cases displayed better precision and F-
precision average-recall clustering outcome is possible by a secondary measure scores than the baseline cases. This indicates LSA's contribution
classification step in which each outlier is classified to one of the large to improved clustering results, but also highlights the importance of
clusters. This hybrid approach therefore combines an unsupervised identifying the appropriate dimensionality factor in order to achieve
learning method (single pass clustering) with a supervised learning such improvements. The optimum cases demonstrated a slight deterio-
method (text classification) with the objective of improving clustering ration in recall from the baseline, but not significant enough to prevent
performance by reducing fragmentation. an improvement in F-measure scores due to a high increase in the
Fig. 8 outlines the process used for refining cluster outcomes and optimum's precision.
evaluating the technique. The process is preceded by performing single A surge in recall and a drop in precision are observed between the
pass clustering on the dataset and defining a specific outcome which the original and refined states. The increase in recall is attributed to a reduc-
process aims at improving. The first step in the refinement process is to tion in the total number of clusters in the final outcome as a result of
define the training and testing sets for the classifier. Outlier instances
Table 2 Table 4
Matrix of average F-measure scores for baseline and optimum cases, before and after re- Matrix of average recall scores for baseline and optimum cases, before and after
finement. refinement.
Training Set
Cluster 3: G1 G4 D2 G3 G2 G5 G6 G7 Cluster 3: G1 G4 D2 G3 G2 G5 G6 G7 E12
Test Set
Cluster 4: D1 D4 D5 D3 Cluster 4: D1 D4 D5 D3 C5
Cluster 5: B1 B 12 B 11 B 2 0 B3 B 10 B9 B8 B2 B7 B 13 B 18 B 17 Cluster 5: B1 B 12 B 11 B 2 0 B3 B 10 B9 B8 B2 B7 B 13 B 18 B 17 A15
Cluster 6: H1 H4 H3 H2 Cluster 6: H1 H4 H3 H2 A17
Cluster 7: A1 A3 A7 A8 A9 A4 A6 A5 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 Cluster 7: A1 A3 A7 A8 A9 A4 A6 A5 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 A21
Cluster 8: C1 C 11 Cluster 9: C3 C7 C4 C9 E18
Cluster 9: C3 C7 C4 C9 A18
Cluster 10: H11 H13 A19
Cluster 11: E12 A23
Cluster 12: C5
Cluster 13: A 15 A 17 A 21
Cluster 14: E18
Cluster 15: A 18
(l, h)= (57, 0.24); Seed(23)
Cluster 16: A 19
F-measure= 0.893; P= 0.879; R= 0.907
Cluster 17: A23
Cluster 1: F1 F2 F3 F5 F4 F6 F7
Cluster 2: E1 E3 E16 E13 E19 E14 E17 E6 E12 E18 A 18
Cluster 3: G1 G4 D2 G3 G2 G5 G6 G7
Cluster 4: D1 D4 D5 D3 C1
Cluster 5: B1 B 12 B 11 B 2 0 B3 B 10 B9 B8 B2 B7 B 13 B 18 B 17 C 11
Cluster 6: H1 H4 H3 H2 H11
Cluster 7: A1 A3 A7 A8 A9 A4 A6 A5 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 H13 A 15 A 17 A 2 1 A 19 A 2 3
Cluster 9: C3 C7 C4 C9 C5
classification of the outliers, which is expected since the objective of the while precision is more important, completely neglecting recall will
refinement process is to improve recall by reducing fragmentation. The single out for the classification step an extreme result of a completely
decrease in precision is a result of impurity, not only from the misclassi- fragmented outcome that has a perfect precision value but is composed
fication of the outliers, but also from the original pre-refined clusters. of a large number of single-instance or low-size clusters—a result that
This tradeoff between the change in precision and recall before and is unsuitable for the classification step. A moderate recall value is
after refinement resulted in an increase in F-measure scores for both required to cause the necessary balance between low impurity and
the baseline and optimum cases of 1.9% and 6.2%, respectively. Overall, fragmentation.
all three metrics experienced an increase from the original-baseline As demonstrated above, the tf–idf method's optimum clustering out-
averages to the refined-optimum averages of 4.5%, 19.7% and 12.8% for come produced fairly consistent results over many trial runs and were
precision, recall and F-measure, respectively. generally characterized by high precision values and moderately high
A closer look at an actual refined outcome will give substance to the recall values (a mean of 0.93 for precision with a standard deviation of
above results. Fig. 10 illustrates the clustering refinement process for a 0.04, and a mean of 0.68 for recall with a standard deviation of 0.05).
sample outcome. Only one cluster in the original outcome (Cluster 3) ‘Good clusters’ result from a good choice of factors for the single pass
contains a misplaced document. This explains the very high precision clustering step – dimensionality and threshold – that consistently
value. The outcome is also highly fragmented, which explains the medi- generate high clustering performance. Fig. 4 above identifies two main
um recall value: fragmentation increases the number of false-negative regions of high F-measure scores for the tf–idf weighting method
pairwise relationships and consequently reduces recall. Separation of based on the average scores of multiple trial runs: a low-threshold/
the outliers using a minimum cluster size of four results in eight remain- high-dimensionality region, and a high-threshold/low-dimensionality
ing clusters used as the training set for the classification step that are region. So far, the F-measure used in all calculations is the F(1) score;
very low in impurity and that make a good representation of true classes a balanced F-measure that gives equal weights to precision and recall
of the dataset. Four of the 13 documents in the test set were classified (calculated using F − measure = (β2 + 1)PR/(β2 × P + R) and setting
incorrectly, reducing the precision value for the refined outcome. How- β = 1). Using an unbalanced F-measure, which gives a small advantage
ever, accurate classification of the majority of the outliers resulted in a
large gain in recall (due to the decrease in the number of false-
negative pair-wise relationships) ultimately causing a 10.7% increase
in the F-measure score.
The success of the proposed hybrid clustering technique therefore Threshold (h)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
relies on: 0.8
10
impurity clusters that match the true classes in the dataset and conse- 0.8 - 0.7 Region 1
quently ensure easier classification of the outlier documents in the
30
between clusters and outliers (training set and test set) for the classi-
50
< 0.3
Precision of the clustering outcome after the initial step of single
pass clustering is a good indicator of the degree of impurity of the gen-
erated clusters; the higher the precision the less impure. However, Fig. 11. Intensity grid of average F(0.5) scores of ten trial runs.
46 M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49
Region 1 Region 2
Dimensionality factor = 16; Threshold value= 0.67 Dimensionality factor = 56; Threshold value= 0.24
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Fig. 12. Consistency of maximum performance factor combinations over 100 trial runs.
to precision over recall, gives a better picture of the range of factor the sequence of documents. Fig. 12 displays the variation of the evalua-
values that are more likely to produce good clusters. Fig. 11 represents tion metrics across the trial runs. The average F(0.5) score for both re-
the intensity grid of the average F(0.5) score for the same trial runs. A gions across the 100 trials was the same (0.85), however the results
prominent area of high clustering performance appears within the from region 1 were more consistent. The standard deviation for all
[10, 20] dimensionality range and the [0.7, 0.85] threshold range (region three evaluation metrics in region 1 was 0.01, while the standard devi-
1). The highest average F(0.5) was 0.85 at a dimensionality factor of 16 ations for precision, recall and F(0.5) in region 2 were 0.07, 0.01 and
and a threshold of 0.67. A smaller region of high performance appears 0.03, respectively. Moreover, whereas 37 trial runs resulted in a preci-
within the [0.20, 0.25] threshold range and the fifties dimensionality sion value less than 0.9 for the region 2 factors, the lowest precision
range (region 2). The highest score achieved in this region was 0.84 at value for a trial run using the region 1 factors was 0.91.
a threshold of 0.24 and the dimensionality factors 56 and 57. A look at the actual clusters formed by the factors of each region
To test consistency of results, the factor combination with the max- gives a good indication of consistency of results. Over the 100 trial
imum result in each region was tested for 100 trials after randomizing runs, region 1 factors generated three unique outcomes, illustrated in
30%
Precision Recall F-measure
20%
10%
% Change
0%
2 3 smin 4 5
-10%
-20%
-30%
Fig. 14. Change in average clustering performance between original and refined outcomes.
Fig. 13. Outcome R1A occurred 76 times, while outcomes R1B and R1C minimum cluster size, then cluster 13 would be included in the training
occurred 16 and 8 times, respectively. The three outcomes are identical set thereby splitting class A in the refined outcome. In this case the in-
in the number of clusters formed and the composition of each cluster, crease in recall will not be the same as for the case of using a minimum
except for a disagreement over the clusters for documents F2 and E6. cluster size of four and accordingly a lower final F-measure would be ex-
On the other hand, while the highest F(0.5) score in all 100 trials for pected. Evaluating with a minimum cluster size of three for the above
both regions was based on an outcome using region 2 factors, such example yielded P = 0.86, R = 0.778 and F-measure = 0.817; only a
factors generated 10 unique outcomes ranging in F(0.5) scores from a 3.1% improvement over the original clustering outcome. To measure
minimum of 0.62 to a maximum of 0.89. the effect of the minimum cluster size on the refined outcome, the
While factor combinations from region 2 have the potential of above evaluation was repeated for different values of smin ranging
producing outcomes that have higher F-measure scores, results vary between five (the size of the smallest class in the dataset) and two
depending on the order of the documents used during single pass clus- (where only single-instance clusters in the original outcomes are de-
tering. This can be attributed to the low threshold value and high fined as outliers). Fig. 14 illustrates the average difference in precision,
dimensionality factor of region 2. With high dimensionality, optimal recall and F-measure scores between the original and refined outcomes
separation of similar instances is not achieved, and a lower threshold at different minimum cluster size values. For all values of smin, the gain in
is required to achieve good clustering performance. Under these condi- recall after the refinement process overcomes the loss in precision
tions, the number of candidates that satisfy the similarity limit for a resulting in a positive increase in F-measure scores, except at a mini-
forming cluster increases thereby increasing the competition between mum cluster size of 5 for which a loss in performance occurs after re-
clusters over the instances. Since clusters are formed one at a time finement. The highest gain in F-measure scores was at a minimum
based on the order of the instances, an early forming cluster develops cluster size of four.
with a larger pool of candidate instances, thus given priority over a A comparison of the final refined outcome at both extremes of the
late forming cluster. The outcome is therefore susceptible to such smin range reveals the consequences of selection of a specific minimum
order. Conversely for region 1, the polarizing effect of a low dimension- cluster size. Fig. 15 displays two outcomes of the same trial run based
ality factor results in effective separation of same-class instances thus on a minimum cluster size of five and two. At the high end, the number
allowing the use of a relatively high threshold value. With limited com- of clusters used for classification tends to be low compared to the num-
petition between clusters over the documents as a result of the stricter ber of true classes in the dataset and the number of outliers tends to be
similarity threshold, the outcomes of clustering are fairly consistent high. Since whole classes are missing from the training set, classification
regardless of the sequence of documents used in the process. accuracy is expected to be very low. Accordingly, even if such clusters
initially have low impurity, classification quickly erodes this advantage
5.2. Choice of minimum cluster size and the gain in recall is not sufficient to make any positive impact on
the final F-measure score. This case is impractical for information
The choice of the minimum cluster size has an impact on the refined retrieval purposes as the composition of the resulting clusters is too
outcome's final F-measure score. In Fig. 10, if three is used as the diverse to allow any reasonable assessment of the clusters' content.
smin= 2 smin= 5
F-measure= 0.855; P= 0.953; R= 0.776 F-measure= 0.700; P= 0.595; R= 0.851
Cluster 1: F1 F2 F3 F5 F4 F6 F7 Cluster 1: F1 F2 F3 F5 F4 F6 F7
Cluster 2: E1 E3 E16 E13 E19 E14 E17 E6 E12 E18 A 18 Cluster 2: E1 E3 E16 E13 E19 E14 E17 E6 H1 H3 H2 C9 H11 E12 C5 A 15 E18 A 18
Cluster 3: G1 G4 D2 G3 G2 G6 G7 G5 Cluster 3: G1 G4 D2 G3 G2 G6 G7 G5 D1 D4 D5 D3 H4 C1
Cluster 4: D1 D4 D5 D3 Cluster 4: B1 B 12 B 11 B 2 0 B3 B 10 B9 B8 B2 B7 B 18 B 13 B 17 C 11
Cluster 5: B1 B 12 B 11 B 2 0 B3 B 10 B9 B8 B2 B7 B 18 B 13 B 17 Cluster 5: A1 A3 A7 A8 A9 A4 A6 A5 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 C3 C7 C4 H13 A 17 A 21 A 19 A 23
Cluster 6: H1 H4 H3 H2
Cluster 7: A1 A3 A7 A8 A9 A4 A6 A5 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 A 19 A 2 3
Cluster 8: C1 C 11 C5
Cluster 9: C3 C7 C4 C9
Cluster 10: H11 H13
Cluster 11: A 15 A 17 A 2 1
The other end of the spectrum (smin = 2) is the case where outliers are and/or metadata), but also returns other related documents in the clus-
only single-instance clusters. For this case, the number of clusters in the ter even if their similarity with the keywords is low or if they do not sat-
training set tends to be high in comparison with the number of true isfy the metadata criteria [5]. This ensures high recall and guarantees
classes in the dataset, and the number of test documents tends to be access to the relevant information in the documents. While the pro-
low. Due to fragmentation of the original outcome, classes may be rep- posed approach overcomes the all inclusive class limitation of text clas-
resented by more than one cluster in the training set. As such, there is a sifiers, the assumption of mutually exclusive clusters remains a
better chance of grouping outliers with similar-class documents thereby limitation of the approach. In practice, project documents may belong
increasing recall, and limiting any decrease in precision. However, with to discourses of multiple knowledge topics and assigning a document
multiple classes split over more than one cluster, the final result is still to one and only one may cause knowledge gaps in others. Theoretically,
highly fragmented. This case could be considered as a very conservative the technique may be modified to adopt an any-of approach instead of
clustering approach and, for practical purposes, can be used as an initial the current one-of approach; however, evaluation will require a differ-
step for simplifying a large dataset into smaller groups of very similar ent dataset since all classes in the current dataset are mutually
documents. exclusive.
Another limitation is dictated by the size of the dataset used for eval-
6. Summary and conclusion uating the proposed methodology. The impact of the size of the dataset
on the results is arguable. On one hand, a large dataset produces a large
When the project document corpus is complete and appropriate- t–d matrix which complicates matrix operations and increases compu-
ly organized (e.g. for previously completed projects), in such case the tational cost. Also a large dataset increases the chance of noisy data
use of text classifiers for document retrieval is suitable. However in which adversely affect the performance of the text analysis techniques.
many cases, the document corpus is gradually and continuously de- On the other hand, a small dataset, while easier to manipulate, offers a
veloping (such as the case of an ongoing project) and the classes re- smaller feature set. Scarcity of features – the evidence used to perform
quired for training in a supervised learning method are not readily the required text analysis task – can undermine the performance of
available. Particularly when classes are not predetermined and do the evaluated classification or clustering technique. Accordingly, cau-
not cover the whole spectrum of possible categories the application tion should be exercised in extrapolating the results to other datasets.
of text classification is not straightforward. In this study, an unsuper- The dataset and the resulting vocabulary are relatively small making
vised learning method was adapted and evaluated for the task of any generalizations of the results unjustifiable absent further experi-
clustering documents based on textual similarity into sets of docu- mentation on other datasets.
ments that are semantically related. The single pass clustering algo-
rithm was adopted instead of the popular K-means clustering
algorithm to avoid the requirement for a predetermined user- Appendix A
defined cardinality (number of resulting clusters) associated with
the latter. However, single pass clustering requires definition of a The following example illustrates application of LSA. The sample
minimum threshold similarity measure that indicates during the is made up of 12 documents organized into two classes, classes D
clustering process whether a specific instance belongs to a specific and G. Only the documents' subject headers are used in the analysis
cluster. In addition, single pass clustering is prone to variable cluster- (as opposed to the full document body) in order to limit the size of
ing outcomes depending on the sequence of the instances used in the the t–d matrix. The original t–d matrix – based on term frequency
clustering process. Single pass clustering was performed on the same and the reduced t–d matrix—after applying a dimensionality factor
dataset under varying threshold values and dimensionality factors to of 4 – are presented in Tables A.1 and A.2, respectively. On the docu-
evaluate the ability to identify the correct clusters within the dataset. ments' side, average pairwise similarity between document vectors
Results indicate the indirect relationship between threshold and di- of classes D and G increase after applying LSA from 0.28 and 0.44 to
mensionality: low dimensionality factors require high threshold 0.39 and 0.72, respectively. On the terms' side, similarity between
values to achieve good clustering results and vice-versa. For the cur- the vectors of the terms ‘remobilization’ and ‘relocation’ – which
rent dataset, a low dimensionality factor and a high threshold value were used interchangeably – increased from 0 to 0.95 after dimen-
demonstrated the best performance in terms of precision and consis- sionality reduction. Similarly, similarity between the terms ‘fence’
tency resulting in an average F-measure score of 0.782, a 6.6% in- and ‘gate’ increased from 0.38 to 0.87.
crease over the baseline. To boost recall, single pass clustering was
followed by a cluster refinement step in which the resulting clusters
were used to train a text classifier for classifying outliers. The average
Table A.1
F-measure score after refinement was 0.844, a 6.2% improvement Original t–d matrix for example.
over the unrefined result (12.8% improvement over the original
D1 D2 D3 D4 D5 G1 G2 G3 G4 G5 G6 G7
baseline). The results were based on repeated trials of different ran-
Adjacent 0 0 0 0 0 0 0 1 0 0 0 0
domizations of the dataset in order to obtain representative values of Airport 0 0 0 0 0 0 0 1 0 0 0 0
the performance. In general, it can be concluded that results im- Approval 0 0 0 0 0 0 1 0 0 0 0 0
Table A.2
Reduced t–d matrix for example.
D1 D2 D3 D4 D5 G1 G2 G3 G4 G5 G6 G7
Adjacent 0.044 0.000 –0.006 0.000 –0.006 0.028 0.110 0.980 0.028 –0.042 –0.032 –0.032
Airport 0.044 0.000 –0.006 0.000 –0.006 0.028 0.110 0.980 0.028 –0.042 –0.032 –0.032
Approval –0.067 0.000 –0.105 0.000 –0.105 0.086 0.127 0.110 0.086 0.171 0.119 0.119
Area 0.000 1.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Continuation –0.147 0.000 –0.056 0.000 –0.056 0.493 0.409 –0.107 0.493 0.661 0.625 0.625
East –0.107 0.000 –0.145 0.000 –0.145 0.134 0.171 –0.042 0.134 0.270 0.196 0.196
Extension 0.028 0.000 0.196 0.000 0.196 0.326 0.173 0.055 0.326 0.267 0.359 0.359
Fence –0.143 0.000 0.028 0.000 0.028 0.933 0.819 1.038 0.933 1.057 1.071 1.071
Gate –0.067 0.000 –0.105 0.000 –0.105 0.086 0.127 0.110 0.086 0.171 0.119 0.119
Mobilization 0.000 1.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
North –0.107 0.000 –0.145 0.000 –0.145 0.134 0.171 –0.042 0.134 0.270 0.196 0.196
Office 0.440 0.000 0.954 0.000 0.954 0.210 –0.278 0.031 0.210 –0.397 0.069 0.069
Old 0.044 0.000 –0.006 0.000 –0.006 0.028 0.110 0.980 0.028 –0.042 –0.032 –0.032
Re–mobilization 0.089 0.000 0.176 0.000 0.176 0.014 –0.067 0.044 0.014 –0.107 –0.020 –0.020
Relocation 0.351 0.000 0.779 0.000 0.779 0.196 –0.211 –0.013 0.196 –0.290 0.089 0.089
Site 0.339 0.000 1.063 0.000 1.063 0.881 0.200 –0.022 0.881 0.368 0.877 0.877
Stop 0.044 0.000 –0.006 0.000 –0.006 0.028 0.110 0.980 0.028 –0.042 –0.032 –0.032
Temporary –0.067 0.000 –0.105 0.000 –0.105 0.086 0.127 0.110 0.086 0.171 0.119 0.119
Work 0.044 0.000 –0.006 0.000 –0.006 0.028 0.110 0.980 0.028 –0.042 –0.032 –0.032
References [10] W. Guo, L. Soilbelman, J.H. Garrett Jr., Visual pattern recognition supporting defect
reporting and condition assessment of wastewater collection systems, Journal of
[1] M. Al Qady, A. Kandil, Document management in construction—practices and opin- Computing in Civil Engineering 23 (3) (2009) 160–169.
ions, Journal of Construction Engineering and Management 139 (10) (2013) [11] S. Lee, L. Chang, Digital image processing methods for assessing bridge painting rust
06013002-1–06013002-7. defects and their limitations, Proc. of the International Conference on Computing in
[2] C.H. Caldas, L. Soibelman, J. Han, Automated classification of construction project Civil Engineering, American Society of Civil Engineers, Cancun, Mexico, 2005.
documents, Journal of Computing in Civil Engineering 16 (4) (2002) 234–243. [12] I. Brilakis, L. Soibelman, Y. Shinagwa, Material-based construction site image retriev-
[3] C.H. Caldas, L. Soibelman, Automating hierarchical document classification for con- al, Journal of Computing in Civil Engineering 19 (4) (2005) 341–355.
struction management information systems, Automation in Construction 12 (4) [13] J. Gong, C.H. Caldas, Learning and classifying motions of construction workers and
(2003) 395–406. equipment using bag of video feature words and Bayesian learning methods, Proc.
[4] W.B. Frakes, R. Baeza-Yates, Information Retrieval: Data Structure and Algorithms, of the International Workshop on Computing in Civil Engineering, American Society
Prentice Hall, 1992. of Civil Engineers, Miami, Florida, United States, 2011.
[5] C.D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, [14] V. Escorcia, M. Dávila, M. Golparvar-Fard, J. Niebles, Automated vision-based recog-
Cambridge University Press, New York, 2008. nition of construction worker actions for building interior construction operations
[6] S. Saitta, P. Kripakaran, B. Raphael, I.F. Smith, Improving system identification using using RGBD cameras, Proc. of the Construction Research Congress 2012, American
clustering, Journal of Computing in Civil Engineering 22 (5) (2008) 292–302. Society of Civil Engineers, West Lafayette, Indiana, United States, 2012.
[7] T. Cheng, J. Teizer, Modeling tower crane operator visibility to minimize the risk of limited [15] M. Al Qady, A. Kandil, Automatic document classification using a successively evolv-
situational awareness, Journal of Computing in Civil Engineering (Dec. 14 2012), ing dataset, Proc. of the 2011 3rd International/9th Construction Specialty Confer-
http://dx.doi.org/10.1061/(ASCE)CP.1943-5487.0000282 (Epub). ence, Curran Associates, Inc., Ottawa, Ontario, Canada, 2011.
[8] H.S. Ng, A. Toukourou, L. Soibelman, Knowledge discovery in a facility condition as- [16] T.K. Landauer, P.W. Foltz, D. Laham, Introduction to latent semantic analysis, Dis-
sessment database using text clustering, Journal of Infrastructure Systems 12 (1) course Process, 25 (2&3) (1998) 259–284.
(2006) 50–59. [17] M. Al Qady, A. Kandil, Automatic classification of project documents based
[9] O. Raz, R. Buchheit, M. Shaw, P. Koopman, C. Faloutsos, Detecting semantic anoma- on text content, Journal of Computing in Civil Engineering (June 20 2013),
lies in truck weigh-in-motion traffic data using data mining, Journal of Computing in http://dx.doi.org/10.1061/(ASCE)CP.1943-5487.0000338 (Epub).
Civil Engineering 18 (4) (2004) 291–300.