Text Mining Applied To Patent Mapping: A Practical Business Case
Text Mining Applied To Patent Mapping: A Practical Business Case
www.elsevier.com/locate/worpatin
a,*
a
Tetra Pak Carton Ambient SpA, via Delni 1, 41100 Modena, Italy
CINECA, via Magnanelli 6/3, 40033 Casalecchio di Reno (Bologna), Italy
Abstract
Professional patent searchers are traditionally rather suspicious of the alleged black box eect inherently attached to intelligent
software engines relying upon linguistic technologies for patent analysis and mapping. In this article, the authors propose that such
prejudices can be overcome by setting a realistic business objective while experimenting with these new linguistic tools, as well as by
applying serious methodology for validating the results of the analysis. The strengths and weaknesses of a particular text mining tool
are assessed with reference to a practical business case in the eld of packaging technology, and a comparison of the outcome of such
an analysis with a traditional one, carried out using conventional patent classications, is also described.
2003 Elsevier Ltd. All rights reserved.
Keywords: Text mining; Data mining; Patent mapping; Patent analysis; Clustering techniques; Competitive intelligence; Intellectually assigned patent
classications; Results validation; Linguistic technology; Packaging technology
1. Introduction
Various data and text mining tools applied to patent
analysis have been around for quite a while now [1,2].
Nevertheless, as pointed out by Krier and Zacca [3],
within the professional patent information community
there still is a high degree of scepticism as regards the
use of these new linguistic technologies. At least in part,
this is due to the relative black box eect 1 inherently
attached to the nature of the said technology.
Not surprisingly then, professional patent searchers
are rather suspicious of tools that do not generally grant
the user complete control over their inner workings.
Indeed, an aim of this article is to try to shed some
light on the degree of control that a user can expect to
experience when carrying out some kinds of patent
analysis with the help of text mining techniques.
The adopted point of view is that of a patent information professional, who is not necessarily an expert in
linguistic algorithms. The idea was to extrapolate, if not
a precise set of rules, at least some useful guidelines that
2
In the course of this article we shall refer indiscriminately to
PackMOLETM as both the project and the tool.
336
into clusters that are dynamically generated by the algorithm, depending on a number of criteria which are
not predetermined on the basis of a user-provided taxonomy.
This particular process is called clustering, as opposed
to categorisation techniques that are dependent upon
some kinds of predictive model and often also need
adequate training [2].
Categorisation techniques are being investigated by
various patent oces for implementing systems for the
categorisation and classication of patent documents,
and have already been discussed in recent issues of this
journal [35].
Rather than just classifying documents, clustering
techniques can yield valuable insight into the relationships existing between dierent categories (or clusters) of
documents, thus a clustering approach to text mining is
considered more eective in a business environment,
especially where patent information is regarded not only
as a support for legal issues, but also as an important
player within the competitive intelligence function.
In this context, the original assumption that formed
the basis of the PackMOLETM project was that text
mining or, more precisely, the particular type of text
mining called clustering, could not be properly considered a search technique inasmuch as traditional patent
searching was concerned. Rather, text mining (or clustering) should allow for the extraction of information
regarding patenting trends in a more ecient way when
compared to the capabilities of the standard Boolean
tools which, on the other hand, allow for more precise
retrieval of patent documents and other bibliographic
information regarding patents (such as legal status and
patent families).
The outcome of a text mining (or clustering) session
primarily consists of a set of clusters, i.e. groups of
documents that show a certain amount of similarity,
according to a threshold value.
Each cluster is labelled with one or more keywords
deemed to be representative of its content, thus it is
possible to have a preliminary idea of the said contents
without actually reading all the documents.
Of course, it is also possible to browse through the
documents contained in each cluster to review the
quality of the clustering process, as well as to display a
graphical representation of the said clusters. We call this
graphical representation the bubble map (or patent
map, since all the documents involved in our tests were
patent documents), with the clusters being represented
as bubbles [6].
If the mining algorithm nds that there is a certain
degree of similarity between dierent clusters, those
clusters are linked. In the patent map, such links are
displayed as coloured lines connecting two (or more)
bubbles. The colour of the link lines varies according to
the relative strength of the link. In fact, the presence of
3. Methodology
3.1. The tool
With the PackMOLETM prototype, the user has the
option of customising a number of dierent parameters
that govern the clustering process:
the
the
the
the
337
Table 1
Statistical indexes
Index
Description
Frequency
Characteristic ratio
Global frequency
Global ratio
Discriminant ratio
338
4
The alpha parameter can assume continuous values within the
0; 1 interval, while the weighting system parameter can only assume
three discrete values: large domains, specic domains, or medium
domains. To consistently compare these two parameters, as shown in
Tables 2 and 3, against the three built-in criteria (quality, between and
within), the use of a graduated, symbolic scale (shown in Table 4) was
found to be appropriate.
Table 2
Calibration of parameters: rst time slice (19911995)
Alpha
Weighting
system
Quality
Between
Within
0
25
50
75
100
150
200
250
0
25
50
75
100
150
200
250
0
25
50
75
100
150
200
250
0
0
0
0
0
0
0
0
50
50
50
50
50
50
50
50
100
100
100
100
100
100
100
100
75.24883
75.70516
76.45833
76.71793
77.0136
77.02991
76.97636
76.81571
77.16235
78.78848
78.77113
79.66386
79.00095
79.02204
78.94551
78.90068
79.53323
80.16075
80.37956
80.46864
80.57274
80.58289
80.54706
80.43496
20.22216
20.4978
21.42763
21.5146
22.28996
22.56356
22.88898
23.1361
18.40972
18.04413
19.8377
18.8663
20.26706
20.47135
20.72742
20.82032
17.48709
17.96368
18.19908
18.42106
18.60637
18.85936
19.0326
19.23806
53.61754
55.41937
60.51506
62.39135
68.95846
71.63733
74.86402
75.97598
53.2596
57.47902
64.65002
65.21858
69.66037
71.8409
73.73237
74.26834
57.26816
62.31783
65.07483
67.185
69.6718
72.1897
73.67558
74.61258
Table 3
Calibration of parameters: second time slice (19962000)
Alpha
Weighting
system
Quality
Between
Within
0
25
50
75
100
150
200
250
0
25
50
75
100
150
200
250
0
25
50
75
100
150
200
250
0
0
0
0
0
0
0
0
50
50
50
50
50
50
50
50
100
100
100
100
100
100
100
100
75.26623
75.74541
76.2831
76.44939
76.65265
76.63417
76.56703
76.42722
79.61112
79.744
80.06667
80.11666
80.18723
80.18952
80.19743
80.12227
82.21957
82.48489
82.56562
82.61001
82.65516
82.65041
82.65862
82.49071
21.37498
21.71035
22.12541
22.28767
22.77321
23.0164
23.21742
23.3527
18.40944
18.53449
18.84157
18.93831
19.07035
19.3242
19.32966
19.44516
16.05856
16.27234
16.39861
16.44113
16.48689
16.69681
16.74308
16.89322
53.54976
56.20657
60.64929
62.69138
68.25748
70.73466
72.4791
72.05145
59.34375
60.78518
65.34303
66.59558
68.59628
71.39236
71.59061
71.85658
61.68178
65.33604
67.12593
68.0085
68.98714
71.03367
71.69827
70.78495
339
Table 5
Setting of parameters for nal clustering sessions
Symbolic value
Alpha
Weighting system
0
25
50
75
100
150
200
250
0.35
0.37
0.40
0.42
0.45
0.50
0.55
0.60
Large domains
Medium domains
Specic domains
Maximum number
of clusters allowed
Weighting system
KDT
Alpha
20
Specic domains
0.45
Table 6
Resulting values of built-in criteria
First time slice (19911995)
Second time slice (19962000)
Quality
Between
Within
80.99285
84.01455
17.57098
14.33116
65.38984
62.90182
80.8
80.6
80.4
80.2
80
Quality
79.8
79.6
79.4
79.2
79
25
50
75
100
150
200
250
alpha
82.7
82.6
82.5
82.4
Quality
82.3
82.2
82.1
82
0
25
50
75
100
150
200
250
alpha
340
Table 7
Validation of clusters: rst time slice (19911995)
Valid clusters
Rejected clusters
Regular clusters
Borderline clusters
10
Table 8
Validation of clusters: second time slice (19962000)
Valid clusters
Rejected clusters
Regular clusters
Borderline clusters
12
341
Table 9
Classical analysis through patent classications
Derwent Class
19911995
19962000
Description
Q31
Q32
Q34
Q35
A92
Q79
X25
P72
Q39
P62
Q21
Q11
36.0465
16.2791
19.7674
15.1163
9.30233
0
9.30233
9.30233
0
6.97674
18.6047
0
29.2453
22.6415
15.0943
14.1509
12.2642
12.2642
8.49057
7.54717
7.54717
5.66038
0
6.97674
Packaging, labelling
Containers
Packaging elements, types
Refuse collections,conveyors
Packaging and containers
Weapons, ammunition, blasting
Industrial electric equipment
Working paper
Liquid, handling
Hand tools, cutting
Railways
Wheels, tyres, connections
Note. Numeric values represent percentages. As one would expect, the total percentages are greater than 100, as patents are usually labelled with
more than just one Derwent Class.
This subdivision scheme did not always prove eective: for instance, in some extreme cases a few patents
were classied with dierent IPCs, even if they clearly
referred to inventions sharing the same subject matter, 5
and their respective IPCs were spaced so far apart from
each other that they were assigned dierent Derwent
Classes as well. On the contrary, the PackMOLETM
prototype was able to correctly group these patents into
the same clusters.
In any event, the good performances exhibited by the
PackMOLETM prototype in correctly grouping patent
documents were probably greatly enhanced by the high
quality of Derwent abstracting.
Another strong point shown by the PackMOLETM
prototype was to provide the analyst with the ability to
correctly identify, by comparing the two patent maps
shown in Figs. 3 and 4, the changes in patent activities in
the dierent business areas of our target company, as
well as the subtle dynamics related to technological
developments and spin-os, that were not otherwise
immediately detectable through the classical analysis:
Indeed, by only relying on the results shown in Table 9,
one might have deduced quite opposite conclusions.
Finally, it is worth noting that the validated patent
maps shown in Figs. 3 and 4 do not exactly represent the
complete patent portfolio of our target company, as a
few clusters were deemed to be invalid and thus removed, as previously mentioned.
Rather, the information contained in the validated
patent maps was thought to constitute a fairly good
picture of the patenting trends of our target company.
5. Final considerations
The strengths and weaknesses of the PackMOLETM
prototype were evaluated.
The tool showed a series of interesting advantages
over classical patent portfolio analysis techniques, and
proved to be eective in a real business scenario.
In many respects, text mining technology lets the analyst overcome the limits of current patent classications.
On the other hand, the calibration and validation
steps of the clustering process itself proved to be dicult, time-consuming and strongly dependent upon the
contents of each dataset.
Probably, text mining techniques and patent classications should not be considered alternative tools for
patent mapping: Rather, they should be used in synergy.
Patent classications, for instance, certainly have the
potential to help the user during the validation step of a
clustering session more than any built-in criteria.
Therefore, the next generation of text mining tools for
patent analysis should integrate some kind of facility for
manipulating patent classications and other descriptive
indexes or terms, to speed up the whole process, at the
same time guaranteeing a professional quality of results.
Acknowledgements
Tetra Pak Carton Ambient SpA and CINECA are
founding members of the CRIT (Centro di Ricerca e
Innovazione Tecnologica) 6 consortium. Their cooperation in researching business-grade text mining applications for patent analysis was therefore conducted
under the auspices of CRIT.
Apart from the obvious issue regarding the dierent editions of the
IPC, there is also the well-known problem of dierent patent oces
applying the IPC in dierent and not always consistent ways.
6
CRIT is located in viale Mazzini 5/3, 41058 Vignola (Modena),
Italy.
342
References
[1] Cabena P, Hadjinian P, Stadler R, Verhess J, Zanasi A.
Discovering data mining: from concept to implementation.
Englewood Clis, NJ: Prentice Hall; 1997.
[2] Hehenberger M, Coupet P. Text mining applied to patent
analysis. Paper presented at the 1998 Annual Meeting of
American Intellectual Property Law Association (AIPLA), October 1517, Arlington, VA.
[3] Krier M, Zacc
a F. Automatic categorisation applications at the
European patent oce. World Patent Inf 2002;24(3):18796.
[4] Smith H. Automation of patent classication. World Patent Inf
2002;24(4):26971.
[5] Hull D, At-Mokhtar S, Chuat M, Eisele A, Gaussier E,
Grefenstette G, et al. Language technologies and patent search
and classication. World Patent Inf 2001;23:2658.
[6] Trippe A. A comparison of ideologies: intellectually assigned cocoding clustering vs ThemeScape automatic themematic mapping.
In: Proceedings of the 2001 Chemical Information Conference.
[7] Grabmeier J, Rudolph A. Techniques of cluster algorithms in data
mining. Version 2.0. IBM Informationssysteme GmbH; 1998.