0% found this document useful (0 votes)

76 views8 pages

Text Mining Applied To Patent Mapping: A Practical Business Case

The document describes a text mining tool called PackMOLE that was developed to analyze patents and group them into clusters. It discusses how PackMOLE uses a relational analysis algorithm and clustering techniques to mine patent information and generate a "patent map" showing the relationships between patent clusters. The authors assess PackMOLE's strengths and weaknesses using a case study of a company's patent portfolio in packaging technology, and compare the results to a traditional classification-based analysis.

Uploaded by

sriniefs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views8 pages

Text Mining Applied To Patent Mapping: A Practical Business Case

Uploaded by

sriniefs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

World Patent Information 25 (2003) 335342

www.elsevier.com/locate/worpatin

Text mining applied to patent mapping: a practical business case

Michele Fattori
b

a,*

, Giorgio Pedrazzi b, Roberta Turra

a
Tetra Pak Carton Ambient SpA, via Delni 1, 41100 Modena, Italy
CINECA, via Magnanelli 6/3, 40033 Casalecchio di Reno (Bologna), Italy

Abstract
Professional patent searchers are traditionally rather suspicious of the alleged black box eect inherently attached to intelligent
software engines relying upon linguistic technologies for patent analysis and mapping. In this article, the authors propose that such
prejudices can be overcome by setting a realistic business objective while experimenting with these new linguistic tools, as well as by
applying serious methodology for validating the results of the analysis. The strengths and weaknesses of a particular text mining tool
are assessed with reference to a practical business case in the eld of packaging technology, and a comparison of the outcome of such
an analysis with a traditional one, carried out using conventional patent classications, is also described.
2003 Elsevier Ltd. All rights reserved.
Keywords: Text mining; Data mining; Patent mapping; Patent analysis; Clustering techniques; Competitive intelligence; Intellectually assigned patent
classications; Results validation; Linguistic technology; Packaging technology

1. Introduction
Various data and text mining tools applied to patent
analysis have been around for quite a while now [1,2].
Nevertheless, as pointed out by Krier and Zacca [3],
within the professional patent information community
there still is a high degree of scepticism as regards the
use of these new linguistic technologies. At least in part,
this is due to the relative black box eect 1 inherently
attached to the nature of the said technology.
Not surprisingly then, professional patent searchers
are rather suspicious of tools that do not generally grant
the user complete control over their inner workings.
Indeed, an aim of this article is to try to shed some
light on the degree of control that a user can expect to
experience when carrying out some kinds of patent
analysis with the help of text mining techniques.
The adopted point of view is that of a patent information professional, who is not necessarily an expert in
linguistic algorithms. The idea was to extrapolate, if not
a precise set of rules, at least some useful guidelines that

Corresponding author. Tel.: +39-59-898-009; fax: +39-59-898-027.

E-mail addresses: [email protected] (M. Fattori),
[email protected] (G. Pedrazzi), [email protected] (R. Turra).
1
According to Krier and Zacca, it is not clear . . .whether [that]
box is really black or just looks black because of absence of
illumination (in this case knowledge of the linguistic algorithms used).
0172-2190/$ - see front matter 2003 Elsevier Ltd. All rights reserved.
doi:10.1016/S0172-2190(03)00113-3

could be applied to a fairly large number of real business

scenarios when patent analysis is combined with text
mining techniques.
An experimental text mining tool prototype named
PackMOLETM (mining online expert on packaging
patents) was jointly developed by the Data Mining
Centre of the CINECA (Consorzio Interuniversitario
per il Calcolo Automatico dellItalia Nord Orientale)
consortium and the Intellectual Assets Department of
Tetra Pak Carton Ambient SpA. In fact, the PackMOLETM prototype came into being thanks to the
combined expertise acquired by CINECA in the eld of
text mining applications and that of Tetra Pak Carton
Ambient SpA in the management of intellectual assets.
The original aim of the PackMOLETM project 2 was to
implement an application for mining patent information
in the packaging eld.

2. The PackMOLETM prototype

The PackMOLETM prototype is able to work under
unsupervised conditions, which means that the documents (patents, in this case) are processed and grouped

2
In the course of this article we shall refer indiscriminately to
PackMOLETM as both the project and the tool.

336

M. Fattori et al. / World Patent Information 25 (2003) 335342

into clusters that are dynamically generated by the algorithm, depending on a number of criteria which are
not predetermined on the basis of a user-provided taxonomy.
This particular process is called clustering, as opposed
to categorisation techniques that are dependent upon
some kinds of predictive model and often also need
adequate training [2].
Categorisation techniques are being investigated by
various patent oces for implementing systems for the
categorisation and classication of patent documents,
and have already been discussed in recent issues of this
journal [35].
Rather than just classifying documents, clustering
techniques can yield valuable insight into the relationships existing between dierent categories (or clusters) of
documents, thus a clustering approach to text mining is
considered more eective in a business environment,
especially where patent information is regarded not only
as a support for legal issues, but also as an important
player within the competitive intelligence function.
In this context, the original assumption that formed
the basis of the PackMOLETM project was that text
mining or, more precisely, the particular type of text
mining called clustering, could not be properly considered a search technique inasmuch as traditional patent
searching was concerned. Rather, text mining (or clustering) should allow for the extraction of information
regarding patenting trends in a more ecient way when
compared to the capabilities of the standard Boolean
tools which, on the other hand, allow for more precise
retrieval of patent documents and other bibliographic
information regarding patents (such as legal status and
patent families).
The outcome of a text mining (or clustering) session
primarily consists of a set of clusters, i.e. groups of
documents that show a certain amount of similarity,
according to a threshold value.
Each cluster is labelled with one or more keywords
deemed to be representative of its content, thus it is
possible to have a preliminary idea of the said contents
without actually reading all the documents.
Of course, it is also possible to browse through the
documents contained in each cluster to review the
quality of the clustering process, as well as to display a
graphical representation of the said clusters. We call this
graphical representation the bubble map (or patent
map, since all the documents involved in our tests were
patent documents), with the clusters being represented
as bubbles [6].
If the mining algorithm nds that there is a certain
degree of similarity between dierent clusters, those
clusters are linked. In the patent map, such links are
displayed as coloured lines connecting two (or more)
bubbles. The colour of the link lines varies according to
the relative strength of the link. In fact, the presence of

such links is created thanks to a second threshold value,

which is automatically determined by the software.
It is also possible to visualise the properties of the
various clusters with histograms, as well as exporting a
wealth of non-textual, bibliographic cluster data (i.e.:
applicants names, patent classications, priority and
ling dates, etc.) into external software programs for
further processing. We shall refer to this kind of patent
data with the term metainformation.
Among the various clustering algorithms and techniques currently available, the one called relational
analysis is known to be particularly ecient when processing textual information and was therefore selected as
the algorithm of choice to be implemented into the
PackMOLETM prototype.
In our case, the textual information to be analysed
consisted of the contents of selected elds of standard Derwent World Patents Index records retrieved
from an in-house intranet database. The Derwent
WPI dataset was selected in order to describe the complete patent portfolio of a particular company, for
testing the PackMOLETM prototype against a real
business case.

3. Methodology
3.1. The tool
With the PackMOLETM prototype, the user has the
option of customising a number of dierent parameters
that govern the clustering process:

the
the
the
the

maximum number of clusters allowed,

weighting system,
keyword drop threshold (KDT) and
minimum domain homogeneity (alpha).

A brief explanation of the meaning of these parameters

is given below.
The maximum number of clusters allowed parameter
has quite a self-explanatory name. Nevertheless, it is
worth noting that, when necessary, the software automatically generates a number of clusters that is lower
than this specied upper limit.
The weighting system can assume three dierent values: large domains, specic domains and medium domains. By selecting large domains, the algorithm tends
to create large clusters based on frequent words; with
specic domains, small clusters based on rare words are
likely to be created, while medium domains is a compromise between the two.
The KDT parameter teaches the algorithm to ignore
those words that happen to appear in a lower number of
documents than the specied value.

M. Fattori et al. / World Patent Information 25 (2003) 335342

337

Table 1
Statistical indexes
Index

Description

Frequency
Characteristic ratio

The number of documents, in each cluster, having that particular feature

The percentage of documents, in each cluster, having that particular feature. The higher this measure is, the better that
feature characterises the documents of the cluster
The number of documents, considering the whole dataset, having that particular feature
The percentage of documents, considering the whole dataset, having that particular feature
The ratio of frequency to global frequency. The discriminant ratio is the percentage of documents, in each cluster, that
have that particular feature, with respect to the number of documents in the whole dataset that also have that particular
feature. The higher this measure is, the better that feature discriminates the documents in the cluster from the remaining
documents in the dataset

Global frequency
Global ratio
Discriminant ratio

Finally, the alpha parameter sets the minimum degree

of similarity that two documents shall possess in order
for them to be grouped within the same cluster. 3
Other than setting values for the mentioned parameters, the user also has the option of selecting one
or more stop words. This can be useful for the exclusion
of meaningless words from the clustering process [6].
Obviously, some of these parameters can aect the
behaviour of the algorithm in contrasting ways and, in
some cases, the user has to pay attention to avoid any
possible conict. For instance, by raising the value of
the alpha parameter, the algorithm is naturally geared to
produce more clusters of a lower size, which could be
inconvenient if the maximum number of clusters allowed was set to a low value.
In any event, in extreme cases, the PackMOLETM
tool has the ability to automatically override incongruous user settings.
Other typical clustering parameters or functions,
namely the similarity index and the number of iterations,
are not user customisable within the PackMOLETM
environment and therefore not discussed here.
In order to help the user to evaluate, in an objective
way, a particular combination of the above mentioned
clustering parameters and, therefore, the outcome of the
corresponding clustering session, one can rely upon
three built-in criteria called
within,
between,
quality.
The within criterion (or intra-cluster homogeneity) is a
measure of the internal homogeneity of each cluster. On
the contrary, the between criterion (inter-cluster separability) measures the degree of similarity between dierent clusters [7].
The quality criterion in some way summarises the two
preceding ones.

There is actually a second threshold parameter other than alpha.

This second parameter is responsible for the creation of links between
dierent clusters having a similarity degree between the two thresholds.

The PackMOLETM prototype is able to provide the

user with within, between and quality values related to
each single cluster as well as to each clustering (or cluster
map) as a whole.
Generally speaking, a good clustering is expected to
feature high values for its within and quality criteria and
a low value for its between criterion. There are some
caveats, for instance: a very high within value could
mean a cluster map consisting of very small clusters (in
theory, so small as to comprise just one document), and
a very low between could well mean that all the
clusters in the map are completely disconnected. In the
rst case, the cluster map is obviously of no value
whatsoever, while in the second case the map lacks
completely what possibly is the most remarkable value
added information that could be obtained through
clustering analysis: the indication of relationships or
links between clusters.
It is the opinion of the authors that these criteria,
though valuable for deeper insight and ne-tuning of the
clustering process, as well as for obtaining a rst gross
indication of the overall quality of a clustering session,
should nonetheless be handled with extreme care and,
most importantly, should not be regarded as a feasible
shortcut for validating a cluster map a priori.
Finally, we note that the available metainformation is
automatically processed by the PackMOLETM tool and,
for each of the non-textual features mentioned earlier, a
number of statistical indexes are also calculated (see
Table 1).
All these criteria and indexes help the user in selecting
the most appropriate combination of customisable parameters and thus are an indispensable aid during the
ne-tuning and calibration steps of the mining process.
Nevertheless, as previously mentioned, during the step
of validating the results, the user should rely greatly
upon his/her knowledge of both the mining tool and the
subject matter of the documents involved.
This validation step can be performed manually, i.e. by
reading every document in each cluster, can rely upon
some kind of calculated statistical index, be facilitated
through the use of a number of dierent graphical tools, or
any combination of the above. In any event, this dicult
but necessary step is likely to be rather time-consuming,

338

M. Fattori et al. / World Patent Information 25 (2003) 335342

especially in the case of large and complicated patent

maps.
3.2. Analysis of a patent portfolio: the mining steps
For testing the PackMOLETM prototype, it was decided to select a realistic application, i.e. one that could
easily t into the patent information process of Tetra
Pak Carton Ambient SpA and thus adequate emphasis
was on extracting new and actionable knowledge in a
real business context. At the same time, the capacity and
limits of the tool itself had to be taken into account, to
make sure we obtained a valid set of results.
As a rst step, we therefore selected all the patents
led by a particular industrial group, active in the
packaging and other elds, during the 19912000 time
range. To enable a better view of the patenting dynamics, we adopted the well-known technique [6] of
further splitting the time range into smaller segments or
slices: In this case, two segments were created, corresponding to 19911995 and 19962000.
The rst segment was populated by 86 distinct documents, while the second segment was slightly bigger, accounting for 106 documents. As a consequence of the
nature of Derwent WPI records, all the documents retrieved were conveniently representative of unique patent
families.
A number of experimental tests were conducted in
order to nd the best possible combination of clustering
parameters for the chosen application. In particular, we
wanted to obtain the highest possible value for the overall
quality criterion, while at the same time keeping under
control the values of the within and between criteria.
Of course, we also did not want to obtain a high
number of small, meaningless clusters, nor did we want
to lose important information about the relationships
between dierent clusters. Therefore, we conducted
several tests for evaluating the response of the algorithm
with respect to a change in the clustering parameters
when mining the two dierent datasets or time slices.
We set the maximum number of clusters allowed
parameter to 30 and the KDT parameter to 5 and let the
other parameters vary, as shown in Tables 2 and 3,
which refer to the rst and second time slices, respectively. Table 4 summarises the meaning of the symbolic
values 4 shown in the preceding tables for the alpha and
weighting system parameters.

4
The alpha parameter can assume continuous values within the
0; 1 interval, while the weighting system parameter can only assume
three discrete values: large domains, specic domains, or medium
domains. To consistently compare these two parameters, as shown in
Tables 2 and 3, against the three built-in criteria (quality, between and
within), the use of a graduated, symbolic scale (shown in Table 4) was
found to be appropriate.

Table 2
Calibration of parameters: rst time slice (19911995)
Alpha

Weighting
system

Quality

Between

Within

0
25
50
75
100
150
200
250
0
25
50
75
100
150
200
250
0
25
50
75
100
150
200
250

0
0
0
0
0
0
0
0
50
50
50
50
50
50
50
50
100
100
100
100
100
100
100
100

75.24883
75.70516
76.45833
76.71793
77.0136
77.02991
76.97636
76.81571
77.16235
78.78848
78.77113
79.66386
79.00095
79.02204
78.94551
78.90068
79.53323
80.16075
80.37956
80.46864
80.57274
80.58289
80.54706
80.43496

20.22216
20.4978
21.42763
21.5146
22.28996
22.56356
22.88898
23.1361
18.40972
18.04413
19.8377
18.8663
20.26706
20.47135
20.72742
20.82032
17.48709
17.96368
18.19908
18.42106
18.60637
18.85936
19.0326
19.23806

53.61754
55.41937
60.51506
62.39135
68.95846
71.63733
74.86402
75.97598
53.2596
57.47902
64.65002
65.21858
69.66037
71.8409
73.73237
74.26834
57.26816
62.31783
65.07483
67.185
69.6718
72.1897
73.67558
74.61258

Table 3
Calibration of parameters: second time slice (19962000)
Alpha

Weighting
system

Quality

Between

Within

0
25
50
75
100
150
200
250
0
25
50
75
100
150
200
250
0
25
50
75
100
150
200
250

0
0
0
0
0
0
0
0
50
50
50
50
50
50
50
50
100
100
100
100
100
100
100
100

75.26623
75.74541
76.2831
76.44939
76.65265
76.63417
76.56703
76.42722
79.61112
79.744
80.06667
80.11666
80.18723
80.18952
80.19743
80.12227
82.21957
82.48489
82.56562
82.61001
82.65516
82.65041
82.65862
82.49071

21.37498
21.71035
22.12541
22.28767
22.77321
23.0164
23.21742
23.3527
18.40944
18.53449
18.84157
18.93831
19.07035
19.3242
19.32966
19.44516
16.05856
16.27234
16.39861
16.44113
16.48689
16.69681
16.74308
16.89322

53.54976
56.20657
60.64929
62.69138
68.25748
70.73466
72.4791
72.05145
59.34375
60.78518
65.34303
66.59558
68.59628
71.39236
71.59061
71.85658
61.68178
65.33604
67.12593
68.0085
68.98714
71.03367
71.69827
70.78495

It can be easily seen that, at least in terms of the

quality criterion and for both segments, the best results
were obtained when the weighting system was set to
specic domains.

M. Fattori et al. / World Patent Information 25 (2003) 335342

Table 4
Explanation of symbolic values shown in Tables 2 and 3

339

Table 5
Setting of parameters for nal clustering sessions

Symbolic value

Alpha

Weighting system

0
25
50
75
100
150
200
250

0.35
0.37
0.40
0.42
0.45
0.50
0.55
0.60

Large domains

Medium domains

Specic domains

Maximum number
of clusters allowed

Weighting system

KDT

Alpha

Specic domains

0.45

Table 6
Resulting values of built-in criteria
First time slice (19911995)
Second time slice (19962000)

Quality

Between

Within

80.99285
84.01455

17.57098
14.33116

65.38984
62.90182

80.8
80.6
80.4
80.2
80

Quality

79.8
79.6
79.4
79.2
79

100

150

200

250

alpha

Fig. 1. First time slice (19911995): the quality criterion in relation to

the alpha parameter (when the weighting system parameter is set to
specic domains).

82.7

maximum number of clusters allowed parameter was

reduced to 20 in order to enhance the meaningfulness of
the nal patent maps, as our datasets were not very
large.
We also set a number of dierent stop words for both
segments, which served to improve the overall clustering
quality. The resulting values of the within, between and
quality criteria for the nal clustering sessions are shown
in Table 6.
Regarding the internal built-in criteria, we concluded
that, even if they do show some common behavioural
pattern, their connection to a given dataset is rather
strong, and that they should be considered more as generic quality indicators than exact means for validating
a clustering process a priori.

82.6
82.5

3.3. Analysis of a patent portfolio: the validation step

82.4

Quality

82.3
82.2
82.1
82
0

100

150

200

250

alpha

Fig. 2. Second time slice (19962000): the quality criterion in relation

to the alpha parameter (when the weighting system parameter is set to
specic domains).

Fig. 1 shows the quality criterion in relation to the

alpha parameter for the rst segment, while Fig. 2 shows
the same for the second segment, in the case where the
weighting system parameter is set to specic domains.
Both Figs. 1 and 2 seem to indicate the presence of a
threshold value related to the alpha parameter. Exceeding this value, the quality criterion slowly starts to
decrease, probably due to some kind of overkill eect.
On the grounds of these preliminary calibration
studies, for our nal clustering (and for both time slices)
we decided to adopt the settings summarised in Table 5.
In this respect, it is to be noted that, even if it made sense
to allow the algorithm to enjoy more freedom during
the calibration phase, for the nal clustering sessions the

The next logical step was to try to nd a suitable

methodology for validating the patent maps obtained
with the PackMOLETM prototype.
When a preexisting classication of the documents is
already present, a number of metrics, for example the
entropy [8], the purity [9], or the gain ratio [10], are well
known in the eld of document clustering for helping
the analyst to perform the validation step.
Unfortunately, on a general perspective, the problem
with these kinds of metrics is that they do not provide a
measure of the intrinsic quality of a document clustering, rather they show the degree of alignment between
the clustering and the preexisting classication.
When it comes to interpreting the results of a clustering, the eective usefulness of these metrics for competitive intelligence applications is therefore unclear.
Moreover, to correctly assess, for example, the entropy
of a clustering, it is necessary for each document to be
associated with one class only, which is not usually the
case with patents.
In order to overcome these issues, we had to choose
our own validation criteria on the grounds of our understanding of the subject matter involved, and then
check the consistency of every cluster in the bubble maps
against these criteria.

340

M. Fattori et al. / World Patent Information 25 (2003) 335342

Therefore it was decided that, for a cluster to be

considered valid, at least 50% of its documents had to be
found to be homogeneous, where the said homogeneity
had to be intellectually assessed. We also decided to
discard a cluster if it consisted of less than three documents, or if it presented wrong links with other clusters.
The valid clusters were further grouped into regular clusters and borderline clusters, the latter group
consisting of clusters having a percentage of precisely
50% homogeneous documents.
The validation step was carried out by reading the
documents in each cluster, as well as by relying, at least
in part, on the available metainformation. The invalid
clusters were then removed from the bubble maps. Only
after the invalid clusters were removed, were we able to
correctly extract the information about the patenting
trends.
The nal, validated (and graphically retouched) patent maps are shown in Figs. 3 and 4 for the rst and
second time slices, respectively.
The numbers appearing in each bubble correspond to
the numbers of Derwent records contained in each
cluster.
According to our criteria, 70% of all the clusters in
the rst segment were found to be valid, while in the
second segment a slightly higher percentage of 75% valid

Fig. 3. First time slice (19911995): patent map.

Fig. 4. Second time slice (19962000): patent map.

Table 7
Validation of clusters: rst time slice (19911995)
Valid clusters

Rejected clusters

Regular clusters

Borderline clusters

Table 8
Validation of clusters: second time slice (19962000)
Valid clusters

Rejected clusters

Regular clusters

Borderline clusters

clusters was observed. These results are summarised in

Tables 7 and 8 for the rst and second time slices,
respectively.

4. Discussion of results: comparison of clustering analysis

and classication-based analysis
As mentioned above, the PackMOLETM prototype is
able to export data regarding the outcome of a clustering session to an external software program for further
processing, namely a spreadsheet such as, for instance,
Microsoft Excel.
Using this feature, the rst 10 most numerous Derwent Classes [11] were extracted from both time slices, as
shown in Table 9, to study the patenting activities of our
target company with a classical grouping technique
based upon patent classications, and then to compare
the similarities and dierences of this standard analysis
with the patent maps previously obtained through the
clustering sessions.
Clearly, even though the two dierent analyses more
or less show the same overall trends, the patent maps
obtained through text mining are easier to understand,
in part because they are presented in graphical rather
than textual or tabular form.
In fact, the patent maps generated by text clustering
allow for a better overview of the relationships between
the dierent areas of patent activity, at the same time
avoiding the work involved in using dierent, more detailed patent classications, such as for example the IPC.
In particular, the IPC was felt to be either too broad (at
the class/subclass level) or too detailed (at the group/
subgroup level) to eectively carry out an optimal patent
portfolio analysis.
Regarding the Derwent Classication, it is to be
noted that the majority of retrieved Derwent Classes
belonged to the Engineering sections P and Q where
each Derwent Class automatically corresponds to an
exact, predetermined range of IPCs [11].

M. Fattori et al. / World Patent Information 25 (2003) 335342

341

Table 9
Classical analysis through patent classications
Derwent Class

19911995

19962000

Description

Q31
Q32
Q34
Q35
A92
Q79
X25
P72
Q39
P62
Q21
Q11

36.0465
16.2791
19.7674
15.1163
9.30233
0
9.30233
9.30233
0
6.97674
18.6047
0

29.2453
22.6415
15.0943
14.1509
12.2642
12.2642
8.49057
7.54717
7.54717
5.66038
0
6.97674

Packaging, labelling
Containers
Packaging elements, types
Refuse collections,conveyors
Packaging and containers
Weapons, ammunition, blasting
Industrial electric equipment
Working paper
Liquid, handling
Hand tools, cutting
Railways
Wheels, tyres, connections

Note. Numeric values represent percentages. As one would expect, the total percentages are greater than 100, as patents are usually labelled with
more than just one Derwent Class.

This subdivision scheme did not always prove eective: for instance, in some extreme cases a few patents
were classied with dierent IPCs, even if they clearly
referred to inventions sharing the same subject matter, 5
and their respective IPCs were spaced so far apart from
each other that they were assigned dierent Derwent
Classes as well. On the contrary, the PackMOLETM
prototype was able to correctly group these patents into
the same clusters.
In any event, the good performances exhibited by the
PackMOLETM prototype in correctly grouping patent
documents were probably greatly enhanced by the high
quality of Derwent abstracting.
Another strong point shown by the PackMOLETM
prototype was to provide the analyst with the ability to
correctly identify, by comparing the two patent maps
shown in Figs. 3 and 4, the changes in patent activities in
the dierent business areas of our target company, as
well as the subtle dynamics related to technological
developments and spin-os, that were not otherwise
immediately detectable through the classical analysis:
Indeed, by only relying on the results shown in Table 9,
one might have deduced quite opposite conclusions.
Finally, it is worth noting that the validated patent
maps shown in Figs. 3 and 4 do not exactly represent the
complete patent portfolio of our target company, as a
few clusters were deemed to be invalid and thus removed, as previously mentioned.
Rather, the information contained in the validated
patent maps was thought to constitute a fairly good
picture of the patenting trends of our target company.

5. Final considerations
The strengths and weaknesses of the PackMOLETM
prototype were evaluated.
The tool showed a series of interesting advantages
over classical patent portfolio analysis techniques, and
proved to be eective in a real business scenario.
In many respects, text mining technology lets the analyst overcome the limits of current patent classications.
On the other hand, the calibration and validation
steps of the clustering process itself proved to be dicult, time-consuming and strongly dependent upon the
contents of each dataset.
Probably, text mining techniques and patent classications should not be considered alternative tools for
patent mapping: Rather, they should be used in synergy.
Patent classications, for instance, certainly have the
potential to help the user during the validation step of a
clustering session more than any built-in criteria.
Therefore, the next generation of text mining tools for
patent analysis should integrate some kind of facility for
manipulating patent classications and other descriptive
indexes or terms, to speed up the whole process, at the
same time guaranteeing a professional quality of results.

Acknowledgements
Tetra Pak Carton Ambient SpA and CINECA are
founding members of the CRIT (Centro di Ricerca e
Innovazione Tecnologica) 6 consortium. Their cooperation in researching business-grade text mining applications for patent analysis was therefore conducted
under the auspices of CRIT.

Apart from the obvious issue regarding the dierent editions of the
IPC, there is also the well-known problem of dierent patent oces
applying the IPC in dierent and not always consistent ways.

6
CRIT is located in viale Mazzini 5/3, 41058 Vignola (Modena),
Italy.

342

M. Fattori et al. / World Patent Information 25 (2003) 335342

References
[1] Cabena P, Hadjinian P, Stadler R, Verhess J, Zanasi A.
Discovering data mining: from concept to implementation.
Englewood Clis, NJ: Prentice Hall; 1997.
[2] Hehenberger M, Coupet P. Text mining applied to patent
analysis. Paper presented at the 1998 Annual Meeting of
American Intellectual Property Law Association (AIPLA), October 1517, Arlington, VA.
[3] Krier M, Zacc
a F. Automatic categorisation applications at the
European patent oce. World Patent Inf 2002;24(3):18796.
[4] Smith H. Automation of patent classication. World Patent Inf
2002;24(4):26971.
[5] Hull D, At-Mokhtar S, Chuat M, Eisele A, Gaussier E,
Grefenstette G, et al. Language technologies and patent search
and classication. World Patent Inf 2001;23:2658.
[6] Trippe A. A comparison of ideologies: intellectually assigned cocoding clustering vs ThemeScape automatic themematic mapping.
In: Proceedings of the 2001 Chemical Information Conference.
[7] Grabmeier J, Rudolph A. Techniques of cluster algorithms in data
mining. Version 2.0. IBM Informationssysteme GmbH; 1998.

[8] Shannon C. A mathematical theory of communication. Bell Syst

Tech J 1948;27:379423, and 62356.
[9] Zhao Y, Karypis G. Criterion functions for document clustering:
experiments and analysis. Technical Report TR #0140, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2001.
[10] Quinlan R. C4.5: programs for machine learning. San Mateo, CA:
Morgan Kaufmann Publishers; 1993.
[11] Derwent Information, Derwent World Patents Index , the
Derwent Classication, Edition 2. May 2000.
Michele Fattori obtained a 5 year degree
(M.Sc. equivalent) in Materials Engineering
from the University of Modena. Prior to
joining the Intellectual Assets Department of
Tetra Pak Carton Ambient in 2001, he
worked as trainee patent and trade mark attorney for a leading Italian consultancy rm.

Web and Text Mining For Open Sources Analysis and Competitive Intelligence
No ratings yet
Web and Text Mining For Open Sources Analysis and Competitive Intelligence
13 pages
We2 Test Unit6
100% (2)
We2 Test Unit6
2 pages
Text Mining Techniques for Patent Analysis
No ratings yet
Text Mining Techniques for Patent Analysis
32 pages
ebook-artificial-intelligence
No ratings yet
ebook-artificial-intelligence
17 pages
Every Child Matters A Practical Guide For Teachers PDF
No ratings yet
Every Child Matters A Practical Guide For Teachers PDF
2 pages
Call the Midwife A Memoir of Birth Joy and Hard Times Jennifer Worth - The ebook is ready for download with just one simple click
No ratings yet
Call the Midwife A Memoir of Birth Joy and Hard Times Jennifer Worth - The ebook is ready for download with just one simple click
57 pages
Expert Systems With Applications: Heeyong Noh, Yeongran Jo, Sungjoo Lee
No ratings yet
Expert Systems With Applications: Heeyong Noh, Yeongran Jo, Sungjoo Lee
13 pages
Preeti Docs
No ratings yet
Preeti Docs
84 pages
Sex Education Level of FEB UI Students
No ratings yet
Sex Education Level of FEB UI Students
14 pages
Augmenting Software Tools With Intelligence From Patent & Scholarly Data: Insider Recipes From The Trenches
No ratings yet
Augmenting Software Tools With Intelligence From Patent & Scholarly Data: Insider Recipes From The Trenches
98 pages
2097-Article Text-8766-2-10-20240103
No ratings yet
2097-Article Text-8766-2-10-20240103
13 pages
Beyond Patent Analytics: Insights From A Scientific and Technological Data Mashup Based On A Case Example
No ratings yet
Beyond Patent Analytics: Insights From A Scientific and Technological Data Mashup Based On A Case Example
17 pages
Usefull Thesis Docment - FULLTEXT01
No ratings yet
Usefull Thesis Docment - FULLTEXT01
38 pages
Patent Visualisation
No ratings yet
Patent Visualisation
7 pages
Rwanda 25 Years On
No ratings yet
Rwanda 25 Years On
12 pages
United States: (12) Patent Application Publication (10) Pub. No.: US 2012/0259890 A1
No ratings yet
United States: (12) Patent Application Publication (10) Pub. No.: US 2012/0259890 A1
11 pages
Patent Analysis
No ratings yet
Patent Analysis
2 pages
Kurt Greivenkamp Resume
No ratings yet
Kurt Greivenkamp Resume
2 pages
Monteiro
No ratings yet
Monteiro
16 pages
Tejubs
No ratings yet
Tejubs
2 pages
Winston Churchill
No ratings yet
Winston Churchill
13 pages
Using Patent Data For Tech Anal
No ratings yet
Using Patent Data For Tech Anal
8 pages
JB Institute of Engineering and Technology (Autonomous) DEPARTMENT OF
No ratings yet
JB Institute of Engineering and Technology (Autonomous) DEPARTMENT OF
1 page
Reexamination Template
No ratings yet
Reexamination Template
1 page
Personal
No ratings yet
Personal
1 page
People in Our Street: Write Ten Words To Put in The Gaps. You Choose!
No ratings yet
People in Our Street: Write Ten Words To Put in The Gaps. You Choose!
2 pages
Vigilancia Tecnológica Por Big Data Patentes
No ratings yet
Vigilancia Tecnológica Por Big Data Patentes
135 pages
The Crucible in Post 9-11 Politics
No ratings yet
The Crucible in Post 9-11 Politics
24 pages
Chuyên đề Phrasal Verbs
No ratings yet
Chuyên đề Phrasal Verbs
7 pages
Technology Transfer: Professor Philip Griffith School of Public Affairs Ustc Hefei
No ratings yet
Technology Transfer: Professor Philip Griffith School of Public Affairs Ustc Hefei
51 pages
Data Mining Tools For Technology and Competitive Intelligence
No ratings yet
Data Mining Tools For Technology and Competitive Intelligence
68 pages
Türkçe - İngilizce Dilbilgisi - 1 PDF
No ratings yet
Türkçe - İngilizce Dilbilgisi - 1 PDF
54 pages
Python 3 and Data Analytics Pocket Primer: A Quick Guide to NumPy, Pandas, and Data Visualization
From Everand
Python 3 and Data Analytics Pocket Primer: A Quick Guide to NumPy, Pandas, and Data Visualization
Mercury Learning and Information
No ratings yet
How To Teach English For Different Learning Styles
100% (1)
How To Teach English For Different Learning Styles
3 pages
Blockchain Adoption in Supply Chain Management and Logistics
From Everand
Blockchain Adoption in Supply Chain Management and Logistics
Niels Hackius
No ratings yet
Statistics with Rust, Second Edition: Explore rust programming and its powerful crates across data science, machine learning and NLP projects
From Everand
Statistics with Rust, Second Edition: Explore rust programming and its powerful crates across data science, machine learning and NLP projects
Keiko Nakamura
No ratings yet
Practical Parquet Engineering: Definitive Reference for Developers and Engineers
From Everand
Practical Parquet Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Models of Computation: Analysis of Algorithms Week 1, Lecture 2
No ratings yet
Models of Computation: Analysis of Algorithms Week 1, Lecture 2
38 pages
Object Oriented Programming
No ratings yet
Object Oriented Programming
1 page
Web Technologies, Handout 1 by G Sreenivasulu
No ratings yet
Web Technologies, Handout 1 by G Sreenivasulu
9 pages
Table of Pronoun (Advanced)
0% (1)
Table of Pronoun (Advanced)
2 pages
Patent Analysis - 2 Good PDF
No ratings yet
Patent Analysis - 2 Good PDF
6 pages
Patent Analysis-6 PDF
No ratings yet
Patent Analysis-6 PDF
8 pages
Statistics with Rust, Second Edition
From Everand
Statistics with Rust, Second Edition
Keiko Nakamura
No ratings yet
Efficient String Processing with Trie Structures: Definitive Reference for Developers and Engineers
From Everand
Efficient String Processing with Trie Structures: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Non Destructive Testing: J.B.Institute of Engineering & Technology
No ratings yet
Non Destructive Testing: J.B.Institute of Engineering & Technology
1 page
Practical guide for coding: Coding Manual for Industry and product standard classification, generic and for warehouse guideline with Classification Plans and sample Tables for coding articles, products, price lists
From Everand
Practical guide for coding: Coding Manual for Industry and product standard classification, generic and for warehouse guideline with Classification Plans and sample Tables for coding articles, products, price lists
Alessi Marc'Antonio
No ratings yet
Kichesipirini Algonquin First Nation
No ratings yet
Kichesipirini Algonquin First Nation
10 pages
What Is An Algorithm?: Example: Sorting
No ratings yet
What Is An Algorithm?: Example: Sorting
18 pages
Comilla University: Nishat Nigar
No ratings yet
Comilla University: Nishat Nigar
6 pages
Letter To Principals
No ratings yet
Letter To Principals
1 page
IT
No ratings yet
IT
2 pages
Knowledge Extracted From Trained Neural Networks - What's Next?
No ratings yet
Knowledge Extracted From Trained Neural Networks - What's Next?
7 pages
Jose Icpp 2011
No ratings yet
Jose Icpp 2011
28 pages
Poems For Fluency
No ratings yet
Poems For Fluency
4 pages
Icac Act 1005
No ratings yet
Icac Act 1005
3 pages
Knuth-Morris-Pratt Algorithm Explained: Definitive Reference for Developers and Engineers
From Everand
Knuth-Morris-Pratt Algorithm Explained: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Web Technologies, Handout 2, by G Sreenivasulu
No ratings yet
Web Technologies, Handout 2, by G Sreenivasulu
10 pages
Patent Analysis-7 PDF
No ratings yet
Patent Analysis-7 PDF
8 pages
WWW Manaresults Co in
No ratings yet
WWW Manaresults Co in
4 pages
Happiness - The Science of Subjective Well-Being
No ratings yet
Happiness - The Science of Subjective Well-Being
20 pages
Conceptual Framework & Acctg
No ratings yet
Conceptual Framework & Acctg
30 pages
Human Proxies in Cryptographic Networks: Establishing a new direction to end-to-end encryption with the introduction of the inner envelope in the echo protocol
From Everand
Human Proxies in Cryptographic Networks: Establishing a new direction to end-to-end encryption with the introduction of the inner envelope in the echo protocol
Uni Nurf
No ratings yet
WWW Manaresults Co in
No ratings yet
WWW Manaresults Co in
4 pages
Distributed Ledger Technology Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
Distributed Ledger Technology Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Small College Supercomputing
No ratings yet
Small College Supercomputing
5 pages
Cluster Computing Tools, Applications, and Australian Initiatives For Low Cost Supercomputing
No ratings yet
Cluster Computing Tools, Applications, and Australian Initiatives For Low Cost Supercomputing
11 pages
SRM Brochure 2016 PDF
No ratings yet
SRM Brochure 2016 PDF
20 pages
Wireshark Protocol Analysis and Network Investigation: Definitive Reference for Developers and Engineers
From Everand
Wireshark Protocol Analysis and Network Investigation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Introduction To Patent Map Analysis2011
No ratings yet
Introduction To Patent Map Analysis2011
50 pages
Classical Saxophone Transcriptions Role
No ratings yet
Classical Saxophone Transcriptions Role
113 pages
9858 Iso 140012015 Self Appraisal Questionnaire
No ratings yet
9858 Iso 140012015 Self Appraisal Questionnaire
12 pages
Applied Techniques for GPT-3: Definitive Reference for Developers and Engineers
From Everand
Applied Techniques for GPT-3: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Media Transfer Protocol Engineering: Definitive Reference for Developers and Engineers
From Everand
Media Transfer Protocol Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Search Algorithm: Fundamentals and Applications
From Everand
Search Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Cryptography Algorithms: Explore New Algorithms in Zero-knowledge, Homomorphic Encryption, and Quantum Cryptography
From Everand
Cryptography Algorithms: Explore New Algorithms in Zero-knowledge, Homomorphic Encryption, and Quantum Cryptography
Massimo Bertaccini
No ratings yet
Rust In Practice, Second Edition
From Everand
Rust In Practice, Second Edition
Rick Tim
No ratings yet
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
From Everand
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
Marcin Jamro
No ratings yet
Unit 1
No ratings yet
Unit 1
7 pages
Controlsystem Unit 1
No ratings yet
Controlsystem Unit 1
309 pages
Unveiling the Secrets of ChatGPT Inside the Mind of an AI
From Everand
Unveiling the Secrets of ChatGPT Inside the Mind of an AI
Nelson Ambrose
No ratings yet
Beyond Cryptographic Routing: The Echo Protocol in the new Era of Exponential Encryption (EEE): - A comprehensive essay about the Sprinkling Effect of Cryptographic Echo Discovery (SECRED) and further innovations in cryptography around the Echo Applications Smoke, SmokeStack, Spot-On, Lettera and GoldBug Crypto Chat Messenger addressing Encryption, Graph-Theory, Routing and the change from Mix-Networks like Tor or I2P to Peer-to-Peer-Flooding-Networks like the Echo respective to Friend-to-Friend Trust-Networks like they are built over the POPTASTIC protocol
From Everand
Beyond Cryptographic Routing: The Echo Protocol in the new Era of Exponential Encryption (EEE): - A comprehensive essay about the Sprinkling Effect of Cryptographic Echo Discovery (SECRED) and further innovations in cryptography around the Echo Applications Smoke, SmokeStack, Spot-On, Lettera and GoldBug Crypto Chat Messenger addressing Encryption, Graph-Theory, Routing and the change from Mix-Networks like Tor or I2P to Peer-to-Peer-Flooding-Networks like the Echo respective to Friend-to-Friend Trust-Networks like they are built over the POPTASTIC protocol
Mele Gasakis
No ratings yet
PyTorch Cookbook
From Everand
PyTorch Cookbook
Matthew Rosch
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
2 PatentInfo
No ratings yet
2 PatentInfo
34 pages
Practice of Homoeopathy P F Curie
No ratings yet
Practice of Homoeopathy P F Curie
129 pages
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
From Everand
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
Mustafa Al-Dori
4/5 (1)
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet
CHATGPT DALL.E 3: Complete Guide. Third Edition
From Everand
CHATGPT DALL.E 3: Complete Guide. Third Edition
Hesham Mohamed Elsherif
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
tcpflow Essentials: Definitive Reference for Developers and Engineers
From Everand
tcpflow Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
PyTorch Cookbook: 100+ Solutions across RNNs, CNNs, python tools, distributed training and graph networks
From Everand
PyTorch Cookbook: 100+ Solutions across RNNs, CNNs, python tools, distributed training and graph networks
Matthew Rosch
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Block Chain Technology Unit-I
No ratings yet
Block Chain Technology Unit-I
20 pages
Rust for Network Programming and Automation, Second Edition
From Everand
Rust for Network Programming and Automation, Second Edition
Gilbert Stew
No ratings yet
Rust for Network Programming and Automation, Second Edition: Work around designing networks, TCP/IP protocol, packet analysis and performance monitoring using Rust 1.68
From Everand
Rust for Network Programming and Automation, Second Edition: Work around designing networks, TCP/IP protocol, packet analysis and performance monitoring using Rust 1.68
Gilbert Stew
No ratings yet
Blockchain Mastery: Building Decentralized Applications from Beginner to Expert
From Everand
Blockchain Mastery: Building Decentralized Applications from Beginner to Expert
Kameron Hussain
No ratings yet
Quantum Computing for Programmers and Investors: with full implementation of algorithms in C
From Everand
Quantum Computing for Programmers and Investors: with full implementation of algorithms in C
Alberto Palazzi
5/5 (1)
Efficient Memory Optimization for IoT Intrusion Detection
From Everand
Efficient Memory Optimization for IoT Intrusion Detection
Ethan Evelyn
No ratings yet
Purposive Communication Semi Finals and Finals PDF
100% (1)
Purposive Communication Semi Finals and Finals PDF
12 pages
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Learning PyTorch 2.0, Second Edition
From Everand
Learning PyTorch 2.0, Second Edition
Matthew Rosch
No ratings yet
Instant Traffic Analysis with Tshark How-to
From Everand
Instant Traffic Analysis with Tshark How-to
Borja Merino
No ratings yet
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
From Everand
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
Matthew Rosch
No ratings yet
Penetration Testing Fundamentals -1: Penetration Testing Study Guide To Breaking Into Systems
From Everand
Penetration Testing Fundamentals -1: Penetration Testing Study Guide To Breaking Into Systems
Devi Prasad
No ratings yet
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
General Math For K11
No ratings yet
General Math For K11
2 pages

Text Mining Applied To Patent Mapping: A Practical Business Case

Uploaded by

Text Mining Applied To Patent Mapping: A Practical Business Case

Uploaded by

World Patent Information 25 (2003) 335342

Text mining applied to patent mapping: a practical business case

, Giorgio Pedrazzi b, Roberta Turra

Corresponding author. Tel.: +39-59-898-009; fax: +39-59-898-027.

could be applied to a fairly large number of real business

2. The PackMOLETM prototype

M. Fattori et al. / World Patent Information 25 (2003) 335342

such links is created thanks to a second threshold value,

maximum number of clusters allowed,

A brief explanation of the meaning of these parameters

M. Fattori et al. / World Patent Information 25 (2003) 335342

The number of documents, in each cluster, having that particular feature

Finally, the alpha parameter sets the minimum degree

There is actually a second threshold parameter other than alpha.

The PackMOLETM prototype is able to provide the

M. Fattori et al. / World Patent Information 25 (2003) 335342

especially in the case of large and complicated patent

It can be easily seen that, at least in terms of the

M. Fattori et al. / World Patent Information 25 (2003) 335342

Fig. 1. First time slice (19911995): the quality criterion in relation to

maximum number of clusters allowed parameter was

3.3. Analysis of a patent portfolio: the validation step

Fig. 2. Second time slice (19962000): the quality criterion in relation

Fig. 1 shows the quality criterion in relation to the

The next logical step was to try to nd a suitable

M. Fattori et al. / World Patent Information 25 (2003) 335342

Therefore it was decided that, for a cluster to be

Fig. 3. First time slice (19911995): patent map.

Fig. 4. Second time slice (19962000): patent map.

clusters was observed. These results are summarised in

4. Discussion of results: comparison of clustering analysis

M. Fattori et al. / World Patent Information 25 (2003) 335342

M. Fattori et al. / World Patent Information 25 (2003) 335342

[8] Shannon C. A mathematical theory of communication. Bell Syst

You might also like