SlideShare a Scribd company logo
Introduction to the tm Package
Text Mining in R
Ingo Feinerer
July 10, 2013

Introduction
This vignette gives a short introduction to text mining in R utilizing the text mining framework provided by
the tm package. We present methods for data import, corpus handling, preprocessing, meta data management,
and creation of term-document matrices. Our focus is on the main aspects of getting started with text mining
in R—an in-depth description of the text mining infrastructure offered by tm was published in the Journal of
Statistical Software (Feinerer et al., 2008). An introductory article on text mining in R was published in R
News (Feinerer, 2008).

Data Import
The main structure for managing documents in tm is a so-called Corpus, representing a collection of text
documents. A corpus is an abstract concept, and there can exist several implementations in parallel. The
default implementation is the so-called VCorpus (short for Volatile Corpus) which realizes a semantics as known
from most R objects: corpora are R objects held fully in memory. We denote this as volatile since once the
R object is destroyed, the whole corpus is gone. Such a volatile corpus can be created via the constructor
Corpus(x, readerControl). Another implementation is the PCorpus which implements a Permanent Corpus
semantics, i.e., the documents are physically stored outside of R (e.g., in a database), corresponding R objects
are basically only pointers to external structures, and changes to the underlying corpus are reflected to all R
objects associated with it. Compared to the volatile corpus the corpus encapsulated by a permanent corpus
object is not destroyed if the corresponding R object is released.
Within the corpus constructor, x must be a Source object which abstracts the input location. tm provides a
set of predefined sources, e.g., DirSource, VectorSource, or DataframeSource, which handle a directory, a vector
interpreting each component as document, data frame like structures (like CSV files), respectively. Except
DirSource, which is designed solely for directories on a file system, and VectorSource, which only accepts (character) vectors, most other implemented sources can take connections as input (a character string is interpreted
as file path). getSources() lists available sources, and users can create their own sources.
The second argument readerControl of the corpus constructor has to be a list with the named components reader and language. The first component reader constructs a text document from elements delivered by a source. The tm package ships with several readers (e.g., readPlain(), readGmane(), readRCV1(),
readReut21578XMLasPlain(), readPDF(), readDOC(), . . . ). See getReaders() for an up-to-date list of available readers. Each source has a default reader which can be overridden. E.g., for DirSource the default just
reads in the input files and interprets their content as text. Finally, the second component language sets the
texts’ language (preferably using ISO 639-2 codes).
In case of a permanent corpus, a third argument dbControl has to be a list with the named components
dbName giving the filename holding the sourced out objects (i.e., the database), and dbType holding a valid
database type as supported by package filehash. Activated database support reduces the memory demand,
however, access gets slower since each operation is limited by the hard disk’s read and write capabilities.
So e.g., plain text files in the directory txt containing Latin (lat) texts by the Roman poet Ovid can be
read in with following code:
> txt <- system.file("texts", "txt", package = "tm")
> (ovid <- Corpus(DirSource(txt, encoding = "UTF-8"),
+
readerControl = list(language = "lat")))
A corpus with 5 text documents
For simple examples VectorSource is quite useful, as it can create a corpus from character vectors, e.g.:
1
> docs <- c("This is a text.", "This another one.")
> Corpus(VectorSource(docs))
A corpus with 2 text documents
Finally we create a corpus for some Reuters documents as example for later use:
> reut21578 <- system.file("texts", "crude", package = "tm")
> reuters <- Corpus(DirSource(reut21578),
+
readerControl = list(reader = readReut21578XML))

Data Export
For the case you have created a corpus via manipulating other objects in R, thus do not have the texts already
stored on a hard disk, and want to save the text documents to disk, you can simply use writeCorpus()
> writeCorpus(ovid)
which writes a plain text representation of a corpus to multiple files on disk corresponding to the individual
documents in the corpus.

Inspecting Corpora
Custom print() and summary() methods are available, which hide the raw amount of information (consider a
corpus could consist of several thousand documents, like a database). summary() gives more details on meta
data than print(), whereas the full content of text documents is displayed with inspect().
> inspect(ovid[1:2])
A corpus with 2 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
$ovid_1.txt
Si quis in hoc artem populo non novit amandi,
hoc legat et lecto carmine doctus amet.
arte citae veloque rates remoque moventur,
arte leves currus: arte regendus amor.
curribus Automedon lentisque erat aptus habenis,
Tiphys in Haemonia puppe magister erat:
me Venus artificem tenero praefecit Amori;
Tiphys et Automedon dicar Amoris ego.
ille quidem ferus est et qui mihi saepe repugnet:
sed puer est, aetas mollis et apta regi.
Phillyrides puerum cithara perfecit Achillem,
atque animos placida contudit arte feros.
qui totiens socios, totiens exterruit hostes,
creditur annosum pertimuisse senem.
$ovid_2.txt
quas Hector sensurus erat, poscente magistro
verberibus iussas praebuit ille manus.
Aeacidae Chiron, ego sum praeceptor Amoris:
saevus uterque puer, natus uterque dea.
sed tamen et tauri cervix oneratur aratro,

2
frenaque magnanimi dente teruntur equi;
et mihi cedet Amor, quamvis mea vulneret arcu
pectora, iactatas excutiatque faces.
quo me fixit Amor, quo me violentius ussit,
hoc melior facti vulneris ultor ero:
non ego, Phoebe, datas a te mihi mentiar artes,
nec nos a¨riae voce monemur avis,
e
nec mihi sunt visae Clio Cliusque sorores
servanti pecudes vallibus, Ascra, tuis:
usus opus movet hoc: vati parete perito;
Individual documents can be accessed via [[, either via the position in the corpus, or via their name.
> identical(ovid[[2]], ovid[["ovid_2.txt"]])
[1] TRUE

Transformations
Once we have a corpus we typically want to modify the documents in it, e.g., stemming, stopword removal,
et cetera. In tm, all this functionality is subsumed into the concept of a transformation. Transformations are
done via the tm_map() function which applies (maps) a function to all elements of the corpus. Basically, all
transformations work on single text documents and tm_map() just applies them to all documents in a corpus.

Converting to Plain Text Documents
The corpus reuters contains documents in XML format. We have no further use for the XML interna and just
want to work with the text content. This can be done by converting the documents to plain text documents.
It is done by the generic as.PlainTextDocument().
> reuters <- tm_map(reuters, as.PlainTextDocument)
Note that alternatively we could have read in the files with the readReut21578XMLasPlain reader which already
returns a plain text document in the first place.

Eliminating Extra Whitespace
Extra whitespace is eliminated by:
> reuters <- tm_map(reuters, stripWhitespace)

Convert to Lower Case
Conversion to lower case by:
> reuters <- tm_map(reuters, tolower)
As you see you can use arbitrary text processing functions as transformations as long the function returns a
text document. Most text manipulation functions from base R just modify a character vector in place, and as
such, keep class information intact. This is especially true for tolower as used here, but also e.g. for gsub which
comes quite handy for a broad range of text manipulation tasks.

Remove Stopwords
Removal of stopwords by:
> reuters <- tm_map(reuters, removeWords, stopwords("english"))

Stemming
Stemming is done by:
> tm_map(reuters, stemDocument)
A corpus with 20 text documents
3
Filters
Often it is of special interest to filter out documents satisfying given properties. For this purpose the function
tm_filter is designed. It is possible to write custom filter functions, but for most cases sFilter does its job:
it integrates a minimal query language to filter meta data. Statements in this query language are statements as
used for subsetting data frames. E.g., the following statement filters out those documents having an ID equal
to 237 and the string “INDONESIA SEEN AT CROSSROADS OVER ECONOMIC CHANGE” as their heading (both are
meta data attributes of the text document).
> query <- "id == '237' & heading == 'INDONESIA SEEN AT CROSSROADS OVER ECONOMIC CHANGE'"
> tm_filter(reuters, FUN = sFilter, query)
A corpus with 1 text document
There is also a full text search filter available (which is default when no explicit filter function FUN is specified)
accepting regular expressions:
> tm_filter(reuters, pattern = "company")
A corpus with 5 text documents

Meta Data Management
Meta data is used to annotate text documents or whole corpora with additional information. The easiest way
to accomplish this with tm is to use the meta() function. A text document has a few predefined attributes
like Author, but can be extended with an arbitrary number of additional user-defined meta data tags. These
additional meta data tags are individually attached to a single text document. From a corpus perspective these
meta data attachments are locally stored together with each individual text document. Alternatively to meta()
the function DublinCore() provides a full mapping between Simple Dublin Core meta data and tm meta data
structures and can be similarly used to get and set meta data information for text documents, e.g.:
> DublinCore(crude[[1]], "Creator") <- "Ano Nymous"
> meta(crude[[1]])
Available meta data pairs are:
Author
: Ano Nymous
DateTimeStamp: 1987-02-26 17:00:56
Description :
Heading
: DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES
ID
: 127
Language
: en
Origin
: Reuters-21578 XML
User-defined local meta data pairs are:
$TOPICS
[1] "YES"
$LEWISSPLIT
[1] "TRAIN"
$CGISPLIT
[1] "TRAINING-SET"
$OLDID
[1] "5670"
$Topics
[1] "crude"
$Places
[1] "usa"
$People

4
character(0)
$Orgs
character(0)
$Exchanges
character(0)
For corpora the story is a bit more difficult. Corpora in tm have two types of meta data: one is the meta
data on the corpus level (corpus), the other is the meta data related to the individual documents (indexed) in
form of a data frame. The latter is often done for performance reasons (hence the named indexed for indexing)
or because the meta data has an own entity but still relates directly to individual text documents, e.g., a
classification result; the classifications directly relate to the documents, but the set of classification levels forms
an own entity. Both cases can be handled with meta():
> meta(crude, tag = "test", type = "corpus") <- "test meta"
> meta(crude, type = "corpus")
$create_date
[1] "2010-06-17 07:32:26 GMT"
$creator
LOGNAME
"feinerer"
$test
[1] "test meta"
> meta(crude, "foo") <- letters[1:20]
> meta(crude)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

MetaID foo
0
a
0
b
0
c
0
d
0
e
0
f
0
g
0
h
0
i
0
j
0
k
0
l
0
m
0
n
0
o
0
p
0
q
0
r
0
s
0
t

Standard Operators and Functions
Many standard operators and functions ([, [<-, [[, [[<-, c(), lapply()) are available for corpora with
semantics similar to standard R routines. E.g., c() concatenates two (or more) corpora. Applied to several
text documents it returns a corpus. The meta data is automatically updated, if corpora are concatenated (i.e.,
merged).

5
Creating Term-Document Matrices
A common approach in text mining is to create a term-document matrix from a corpus. In the tm package
the classes TermDocumentMatrix and DocumentTermMatrix (depending on whether you want terms as rows and
documents as columns, or vice versa) employ sparse matrices for corpora.
> dtm <- DocumentTermMatrix(reuters)
> inspect(dtm[1:5,100:105])
A document-term matrix (5 documents, 6 terms)
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:

0/30
100%
7
term frequency (tf)

Terms
Docs able abroad, abu accept accord accord,
127
0
0
0
0
0
0
144
0
0
0
0
0
0
191
0
0
0
0
0
0
194
0
0
0
0
0
0
211
0
0
0
0
0
0

Operations on Term-Document Matrices
Besides the fact that on this matrix a huge amount of R functions (like clustering, classifications, etc.) can be
applied, this package brings some shortcuts. Imagine we want to find those terms that occur at least five times,
then we can use the findFreqTerms() function:
> findFreqTerms(dtm, 5)
[1]
[5]
[9]
[13]
[17]
[21]
[25]
[29]
[33]
[37]
[41]
[45]
[49]
[53]
[57]
[61]
[65]
[69]
[73]
[77]
[81]
[85]
[89]

"15.8"
"agency"
"analysts"
"barrel."
"budget"
"daily"
"emergency"
"exports"
"group"
"industry"
"last"
"meet"
"month"
"nymex"
"opec"
"plans"
"prices"
"quota"
"research"
"said"
"sell"
"study"
"west"

"abdul-aziz"
"agreement"
"april"
"barrels"
"commitment"
"demand"
"energy"
"feb"
"gulf"
"international"
"march"
"meeting"
"nazer"
"official"
"output"
"posted"
"prices,"
"quoted"
"reserve"
"said."
"set"
"traders"
"will"

"ability"
"ali"
"arab"
"billion"
"company"
"dlrs"
"exchange"
"futures"
"help"
"january"
"market"
"minister"
"new"
"oil"
"pct"
"present"
"prices."
"recent"
"reserves"
"saudi"
"sheikh"
"u.s."
"world"

"accord"
"also"
"arabia"
"bpd"
"crude"
"economic"
"expected"
"government"
"hold"
"kuwait"
"may"
"mln"
"now"
"one"
"petroleum"
"price"
"production"
"report"
"reuter"
"says"
"sources"
"united"
"york,"

Or we want to find associations (i.e., terms which correlate) with at least 0.8 correlation for the term opec, then
we use findAssocs():
> findAssocs(dtm, "opec", 0.8)
meeting
0.88

15.8
0.85

oil emergency
0.85
0.83

analysts
0.82
6

buyers
0.80
The function also accepts a matrix as first argument (which does not inherit from a term-document matrix). This
matrix is then interpreted as a correlation matrix and directly used. With this approach different correlation
measures can be employed.
Term-document matrices tend to get very big already for normal sized data sets. Therefore we provide a
method to remove sparse terms, i.e., terms occurring only in very few documents. Normally, this reduces the
matrix dramatically without losing significant relations inherent to the matrix:
> inspect(removeSparseTerms(dtm, 0.4))
A document-term matrix (20 documents, 4 terms)
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:

74/6
7%
6
term frequency (tf)

Terms
Docs march oil reuter said
127
0
5
1
1
144
1 11
1
9
191
0
2
1
1
194
0
1
1
1
211
0
2
1
3
236
3
7
1
6
237
1
3
1
0
242
1
3
1
3
246
1
4
1
4
248
1
9
1
5
273
1
5
1
5
349
1
4
1
1
352
1
5
1
1
353
1
4
1
1
368
1
3
1
2
489
1
5
1
2
502
1
5
1
2
543
1
3
1
2
704
1
3
1
3
708
1
2
1
0
This function call removes those terms which have at least a 40 percentage of sparse (i.e., terms occurring 0
times in a document) elements.

Dictionary
A dictionary is a (multi-)set of strings. It is often used to represent relevant terms in text mining. We provide
a class Dictionary implementing such a dictionary concept. It can be created via the Dictionary() constructor,
e.g.,
> (d <- Dictionary(c("prices", "crude", "oil")))
[1] "prices" "crude" "oil"
attr(,"class")
[1] "Dictionary" "character"
and may be passed over to the DocumentTermMatrix() constructor. Then the created matrix is tabulated
against the dictionary, i.e., only terms from the dictionary appear in the matrix. This allows to restrict the
dimension of the matrix a priori and to focus on specific terms for distinct text mining contexts, e.g.,
> inspect(DocumentTermMatrix(reuters, list(dictionary = d)))
A document-term matrix (20 documents, 3 terms)

7
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:

41/19
32%
6
term frequency (tf)

Terms
Docs crude oil prices
127
3
5
4
144
0 11
4
191
3
2
0
194
4
1
0
211
0
2
0
236
1
7
2
237
0
3
0
242
0
3
1
246
0
4
0
248
0
9
7
273
6
5
4
349
2
4
0
352
0
5
4
353
2
4
1
368
0
3
0
489
0
5
2
502
0
5
2
543
3
3
3
704
0
3
2
708
1
2
0

References
I. Feinerer. An introduction to text mining in R. R News, 8(2):19–22, Oct. 2008. URL http://CRAN.R-project.
org/doc/Rnews/.
I. Feinerer, K. Hornik, and D. Meyer. Text mining infrastructure in R. Journal of Statistical Software, 25(5):
1–54, March 2008. ISSN 1548-7660. URL http://www.jstatsoft.org/v25/i05.

8
Ad

More Related Content

What's hot (20)

Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
Johan Blomme
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Databricks
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
Vivian S. Zhang
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
Dan Sullivan, Ph.D.
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniques
Higher Education Department KPK, Pakistan
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
Deeksha thakur
 
Cross-Language Information Retrieval
Cross-Language Information RetrievalCross-Language Information Retrieval
Cross-Language Information Retrieval
Sumin Byeon
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
Yanchang Zhao
 
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
Shuyo Nakatani
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
Pierre de Lacaze
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
Robert Viseur
 
RFS Search Lang Spec
RFS Search Lang SpecRFS Search Lang Spec
RFS Search Lang Spec
Jing Kang
 
Author Topic Model
Author Topic ModelAuthor Topic Model
Author Topic Model
FReeze FRancis
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet Allocation
Quentin Pleplé
 
Lec1
Lec1Lec1
Lec1
Prafulla Kiran
 
Files and streams
Files and streamsFiles and streams
Files and streams
Pranali Chaudhari
 
Navigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept AnalysisNavigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept Analysis
Mehwish Alam
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
Tatiana Tarasova
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
Chandan Deb
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
Tobias Wunner
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
Johan Blomme
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Databricks
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
Vivian S. Zhang
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
Dan Sullivan, Ph.D.
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
Deeksha thakur
 
Cross-Language Information Retrieval
Cross-Language Information RetrievalCross-Language Information Retrieval
Cross-Language Information Retrieval
Sumin Byeon
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
Yanchang Zhao
 
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
Shuyo Nakatani
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
Pierre de Lacaze
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
Robert Viseur
 
RFS Search Lang Spec
RFS Search Lang SpecRFS Search Lang Spec
RFS Search Lang Spec
Jing Kang
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet Allocation
Quentin Pleplé
 
Navigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept AnalysisNavigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept Analysis
Mehwish Alam
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
Tatiana Tarasova
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
Chandan Deb
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
Tobias Wunner
 

Similar to Text Mining with R (20)

Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
Prasenjit Mukherjee
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
Alfresco Software
 
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reuge
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reugeFile handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reuge
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reuge
vsol7206
 
xml2tex at TUG 2014
xml2tex at TUG 2014xml2tex at TUG 2014
xml2tex at TUG 2014
Keiichiro Shikano
 
Files
FilesFiles
Files
Hellen Gakuruh
 
DITA and Translation Best Praticices
DITA and Translation Best PraticicesDITA and Translation Best Praticices
DITA and Translation Best Praticices
Andrzej Zydroń MBCS
 
File Handling in C++
File Handling in C++File Handling in C++
File Handling in C++
Kulachi Hansraj Model School Ashok Vihar
 
Latex workshop: Essentials and Practices
Latex workshop: Essentials and PracticesLatex workshop: Essentials and Practices
Latex workshop: Essentials and Practices
Mohamed Alrshah
 
latex-workshop Dr: Mohamed A. Alrshah
latex-workshop Dr: Mohamed A. Alrshahlatex-workshop Dr: Mohamed A. Alrshah
latex-workshop Dr: Mohamed A. Alrshah
Abdulazim N.Elaati
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfak
Asfak Mahamud
 
Ekon bestof rtl_delphi
Ekon bestof rtl_delphiEkon bestof rtl_delphi
Ekon bestof rtl_delphi
Max Kleiner
 
file handling in python using exception statement
file handling in python using exception statementfile handling in python using exception statement
file handling in python using exception statement
srividhyaarajagopal
 
XML Tutor maXbox starter27
XML Tutor maXbox starter27XML Tutor maXbox starter27
XML Tutor maXbox starter27
Max Kleiner
 
Unit 3
Unit 3Unit 3
Unit 3
Piyush Rochwani
 
Day Of Dot Net Ann Arbor 2007
Day Of Dot Net Ann Arbor 2007Day Of Dot Net Ann Arbor 2007
Day Of Dot Net Ann Arbor 2007
David Truxall
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysis
stat
 
LEX & YACC
LEX & YACCLEX & YACC
LEX & YACC
Mahbubur Rahman
 
Data file handling in c++
Data file handling in c++Data file handling in c++
Data file handling in c++
Vineeta Garg
 
Introduction to Latex
Introduction to LatexIntroduction to Latex
Introduction to Latex
Mohamed Alrshah
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
Paolo Mottadelli
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
Alfresco Software
 
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reuge
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reugeFile handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reuge
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reuge
vsol7206
 
DITA and Translation Best Praticices
DITA and Translation Best PraticicesDITA and Translation Best Praticices
DITA and Translation Best Praticices
Andrzej Zydroń MBCS
 
Latex workshop: Essentials and Practices
Latex workshop: Essentials and PracticesLatex workshop: Essentials and Practices
Latex workshop: Essentials and Practices
Mohamed Alrshah
 
latex-workshop Dr: Mohamed A. Alrshah
latex-workshop Dr: Mohamed A. Alrshahlatex-workshop Dr: Mohamed A. Alrshah
latex-workshop Dr: Mohamed A. Alrshah
Abdulazim N.Elaati
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfak
Asfak Mahamud
 
Ekon bestof rtl_delphi
Ekon bestof rtl_delphiEkon bestof rtl_delphi
Ekon bestof rtl_delphi
Max Kleiner
 
file handling in python using exception statement
file handling in python using exception statementfile handling in python using exception statement
file handling in python using exception statement
srividhyaarajagopal
 
XML Tutor maXbox starter27
XML Tutor maXbox starter27XML Tutor maXbox starter27
XML Tutor maXbox starter27
Max Kleiner
 
Day Of Dot Net Ann Arbor 2007
Day Of Dot Net Ann Arbor 2007Day Of Dot Net Ann Arbor 2007
Day Of Dot Net Ann Arbor 2007
David Truxall
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysis
stat
 
Data file handling in c++
Data file handling in c++Data file handling in c++
Data file handling in c++
Vineeta Garg
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
Paolo Mottadelli
 
Ad

Recently uploaded (20)

Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Library Association of Ireland
 
How to manage Multiple Warehouses for multiple floors in odoo point of sale
How to manage Multiple Warehouses for multiple floors in odoo point of saleHow to manage Multiple Warehouses for multiple floors in odoo point of sale
How to manage Multiple Warehouses for multiple floors in odoo point of sale
Celine George
 
Quality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdfQuality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdf
Dr. Bindiya Chauhan
 
2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx
contactwilliamm2546
 
Sinhala_Male_Names.pdf Sinhala_Male_Name
Sinhala_Male_Names.pdf Sinhala_Male_NameSinhala_Male_Names.pdf Sinhala_Male_Name
Sinhala_Male_Names.pdf Sinhala_Male_Name
keshanf79
 
Ultimate VMware 2V0-11.25 Exam Dumps for Exam Success
Ultimate VMware 2V0-11.25 Exam Dumps for Exam SuccessUltimate VMware 2V0-11.25 Exam Dumps for Exam Success
Ultimate VMware 2V0-11.25 Exam Dumps for Exam Success
Mark Soia
 
How to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odooHow to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odoo
Celine George
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar RabbiPresentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Md Shaifullar Rabbi
 
To study Digestive system of insect.pptx
To study Digestive system of insect.pptxTo study Digestive system of insect.pptx
To study Digestive system of insect.pptx
Arshad Shaikh
 
Anti-Depressants pharmacology 1slide.pptx
Anti-Depressants pharmacology 1slide.pptxAnti-Depressants pharmacology 1slide.pptx
Anti-Depressants pharmacology 1slide.pptx
Mayuri Chavan
 
Handling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptxHandling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptx
AuthorAIDNationalRes
 
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACYUNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
DR.PRISCILLA MARY J
 
Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025
Mebane Rash
 
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetCBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
Sritoma Majumder
 
World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...
World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...
World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...
larencebapu132
 
Geography Sem II Unit 1C Correlation of Geography with other school subjects
Geography Sem II Unit 1C Correlation of Geography with other school subjectsGeography Sem II Unit 1C Correlation of Geography with other school subjects
Geography Sem II Unit 1C Correlation of Geography with other school subjects
ProfDrShaikhImran
 
Odoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo SlidesOdoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo Slides
Celine George
 
GDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptxGDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptx
azeenhodekar
 
The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...
Sandeep Swamy
 
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Library Association of Ireland
 
How to manage Multiple Warehouses for multiple floors in odoo point of sale
How to manage Multiple Warehouses for multiple floors in odoo point of saleHow to manage Multiple Warehouses for multiple floors in odoo point of sale
How to manage Multiple Warehouses for multiple floors in odoo point of sale
Celine George
 
Quality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdfQuality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdf
Dr. Bindiya Chauhan
 
2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx
contactwilliamm2546
 
Sinhala_Male_Names.pdf Sinhala_Male_Name
Sinhala_Male_Names.pdf Sinhala_Male_NameSinhala_Male_Names.pdf Sinhala_Male_Name
Sinhala_Male_Names.pdf Sinhala_Male_Name
keshanf79
 
Ultimate VMware 2V0-11.25 Exam Dumps for Exam Success
Ultimate VMware 2V0-11.25 Exam Dumps for Exam SuccessUltimate VMware 2V0-11.25 Exam Dumps for Exam Success
Ultimate VMware 2V0-11.25 Exam Dumps for Exam Success
Mark Soia
 
How to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odooHow to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odoo
Celine George
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar RabbiPresentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Md Shaifullar Rabbi
 
To study Digestive system of insect.pptx
To study Digestive system of insect.pptxTo study Digestive system of insect.pptx
To study Digestive system of insect.pptx
Arshad Shaikh
 
Anti-Depressants pharmacology 1slide.pptx
Anti-Depressants pharmacology 1slide.pptxAnti-Depressants pharmacology 1slide.pptx
Anti-Depressants pharmacology 1slide.pptx
Mayuri Chavan
 
Handling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptxHandling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptx
AuthorAIDNationalRes
 
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACYUNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
DR.PRISCILLA MARY J
 
Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025
Mebane Rash
 
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetCBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
Sritoma Majumder
 
World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...
World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...
World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...
larencebapu132
 
Geography Sem II Unit 1C Correlation of Geography with other school subjects
Geography Sem II Unit 1C Correlation of Geography with other school subjectsGeography Sem II Unit 1C Correlation of Geography with other school subjects
Geography Sem II Unit 1C Correlation of Geography with other school subjects
ProfDrShaikhImran
 
Odoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo SlidesOdoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo Slides
Celine George
 
GDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptxGDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptx
azeenhodekar
 
The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...
Sandeep Swamy
 
Ad

Text Mining with R

  • 1. Introduction to the tm Package Text Mining in R Ingo Feinerer July 10, 2013 Introduction This vignette gives a short introduction to text mining in R utilizing the text mining framework provided by the tm package. We present methods for data import, corpus handling, preprocessing, meta data management, and creation of term-document matrices. Our focus is on the main aspects of getting started with text mining in R—an in-depth description of the text mining infrastructure offered by tm was published in the Journal of Statistical Software (Feinerer et al., 2008). An introductory article on text mining in R was published in R News (Feinerer, 2008). Data Import The main structure for managing documents in tm is a so-called Corpus, representing a collection of text documents. A corpus is an abstract concept, and there can exist several implementations in parallel. The default implementation is the so-called VCorpus (short for Volatile Corpus) which realizes a semantics as known from most R objects: corpora are R objects held fully in memory. We denote this as volatile since once the R object is destroyed, the whole corpus is gone. Such a volatile corpus can be created via the constructor Corpus(x, readerControl). Another implementation is the PCorpus which implements a Permanent Corpus semantics, i.e., the documents are physically stored outside of R (e.g., in a database), corresponding R objects are basically only pointers to external structures, and changes to the underlying corpus are reflected to all R objects associated with it. Compared to the volatile corpus the corpus encapsulated by a permanent corpus object is not destroyed if the corresponding R object is released. Within the corpus constructor, x must be a Source object which abstracts the input location. tm provides a set of predefined sources, e.g., DirSource, VectorSource, or DataframeSource, which handle a directory, a vector interpreting each component as document, data frame like structures (like CSV files), respectively. Except DirSource, which is designed solely for directories on a file system, and VectorSource, which only accepts (character) vectors, most other implemented sources can take connections as input (a character string is interpreted as file path). getSources() lists available sources, and users can create their own sources. The second argument readerControl of the corpus constructor has to be a list with the named components reader and language. The first component reader constructs a text document from elements delivered by a source. The tm package ships with several readers (e.g., readPlain(), readGmane(), readRCV1(), readReut21578XMLasPlain(), readPDF(), readDOC(), . . . ). See getReaders() for an up-to-date list of available readers. Each source has a default reader which can be overridden. E.g., for DirSource the default just reads in the input files and interprets their content as text. Finally, the second component language sets the texts’ language (preferably using ISO 639-2 codes). In case of a permanent corpus, a third argument dbControl has to be a list with the named components dbName giving the filename holding the sourced out objects (i.e., the database), and dbType holding a valid database type as supported by package filehash. Activated database support reduces the memory demand, however, access gets slower since each operation is limited by the hard disk’s read and write capabilities. So e.g., plain text files in the directory txt containing Latin (lat) texts by the Roman poet Ovid can be read in with following code: > txt <- system.file("texts", "txt", package = "tm") > (ovid <- Corpus(DirSource(txt, encoding = "UTF-8"), + readerControl = list(language = "lat"))) A corpus with 5 text documents For simple examples VectorSource is quite useful, as it can create a corpus from character vectors, e.g.: 1
  • 2. > docs <- c("This is a text.", "This another one.") > Corpus(VectorSource(docs)) A corpus with 2 text documents Finally we create a corpus for some Reuters documents as example for later use: > reut21578 <- system.file("texts", "crude", package = "tm") > reuters <- Corpus(DirSource(reut21578), + readerControl = list(reader = readReut21578XML)) Data Export For the case you have created a corpus via manipulating other objects in R, thus do not have the texts already stored on a hard disk, and want to save the text documents to disk, you can simply use writeCorpus() > writeCorpus(ovid) which writes a plain text representation of a corpus to multiple files on disk corresponding to the individual documents in the corpus. Inspecting Corpora Custom print() and summary() methods are available, which hide the raw amount of information (consider a corpus could consist of several thousand documents, like a database). summary() gives more details on meta data than print(), whereas the full content of text documents is displayed with inspect(). > inspect(ovid[1:2]) A corpus with 2 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID $ovid_1.txt Si quis in hoc artem populo non novit amandi, hoc legat et lecto carmine doctus amet. arte citae veloque rates remoque moventur, arte leves currus: arte regendus amor. curribus Automedon lentisque erat aptus habenis, Tiphys in Haemonia puppe magister erat: me Venus artificem tenero praefecit Amori; Tiphys et Automedon dicar Amoris ego. ille quidem ferus est et qui mihi saepe repugnet: sed puer est, aetas mollis et apta regi. Phillyrides puerum cithara perfecit Achillem, atque animos placida contudit arte feros. qui totiens socios, totiens exterruit hostes, creditur annosum pertimuisse senem. $ovid_2.txt quas Hector sensurus erat, poscente magistro verberibus iussas praebuit ille manus. Aeacidae Chiron, ego sum praeceptor Amoris: saevus uterque puer, natus uterque dea. sed tamen et tauri cervix oneratur aratro, 2
  • 3. frenaque magnanimi dente teruntur equi; et mihi cedet Amor, quamvis mea vulneret arcu pectora, iactatas excutiatque faces. quo me fixit Amor, quo me violentius ussit, hoc melior facti vulneris ultor ero: non ego, Phoebe, datas a te mihi mentiar artes, nec nos a¨riae voce monemur avis, e nec mihi sunt visae Clio Cliusque sorores servanti pecudes vallibus, Ascra, tuis: usus opus movet hoc: vati parete perito; Individual documents can be accessed via [[, either via the position in the corpus, or via their name. > identical(ovid[[2]], ovid[["ovid_2.txt"]]) [1] TRUE Transformations Once we have a corpus we typically want to modify the documents in it, e.g., stemming, stopword removal, et cetera. In tm, all this functionality is subsumed into the concept of a transformation. Transformations are done via the tm_map() function which applies (maps) a function to all elements of the corpus. Basically, all transformations work on single text documents and tm_map() just applies them to all documents in a corpus. Converting to Plain Text Documents The corpus reuters contains documents in XML format. We have no further use for the XML interna and just want to work with the text content. This can be done by converting the documents to plain text documents. It is done by the generic as.PlainTextDocument(). > reuters <- tm_map(reuters, as.PlainTextDocument) Note that alternatively we could have read in the files with the readReut21578XMLasPlain reader which already returns a plain text document in the first place. Eliminating Extra Whitespace Extra whitespace is eliminated by: > reuters <- tm_map(reuters, stripWhitespace) Convert to Lower Case Conversion to lower case by: > reuters <- tm_map(reuters, tolower) As you see you can use arbitrary text processing functions as transformations as long the function returns a text document. Most text manipulation functions from base R just modify a character vector in place, and as such, keep class information intact. This is especially true for tolower as used here, but also e.g. for gsub which comes quite handy for a broad range of text manipulation tasks. Remove Stopwords Removal of stopwords by: > reuters <- tm_map(reuters, removeWords, stopwords("english")) Stemming Stemming is done by: > tm_map(reuters, stemDocument) A corpus with 20 text documents 3
  • 4. Filters Often it is of special interest to filter out documents satisfying given properties. For this purpose the function tm_filter is designed. It is possible to write custom filter functions, but for most cases sFilter does its job: it integrates a minimal query language to filter meta data. Statements in this query language are statements as used for subsetting data frames. E.g., the following statement filters out those documents having an ID equal to 237 and the string “INDONESIA SEEN AT CROSSROADS OVER ECONOMIC CHANGE” as their heading (both are meta data attributes of the text document). > query <- "id == '237' & heading == 'INDONESIA SEEN AT CROSSROADS OVER ECONOMIC CHANGE'" > tm_filter(reuters, FUN = sFilter, query) A corpus with 1 text document There is also a full text search filter available (which is default when no explicit filter function FUN is specified) accepting regular expressions: > tm_filter(reuters, pattern = "company") A corpus with 5 text documents Meta Data Management Meta data is used to annotate text documents or whole corpora with additional information. The easiest way to accomplish this with tm is to use the meta() function. A text document has a few predefined attributes like Author, but can be extended with an arbitrary number of additional user-defined meta data tags. These additional meta data tags are individually attached to a single text document. From a corpus perspective these meta data attachments are locally stored together with each individual text document. Alternatively to meta() the function DublinCore() provides a full mapping between Simple Dublin Core meta data and tm meta data structures and can be similarly used to get and set meta data information for text documents, e.g.: > DublinCore(crude[[1]], "Creator") <- "Ano Nymous" > meta(crude[[1]]) Available meta data pairs are: Author : Ano Nymous DateTimeStamp: 1987-02-26 17:00:56 Description : Heading : DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES ID : 127 Language : en Origin : Reuters-21578 XML User-defined local meta data pairs are: $TOPICS [1] "YES" $LEWISSPLIT [1] "TRAIN" $CGISPLIT [1] "TRAINING-SET" $OLDID [1] "5670" $Topics [1] "crude" $Places [1] "usa" $People 4
  • 5. character(0) $Orgs character(0) $Exchanges character(0) For corpora the story is a bit more difficult. Corpora in tm have two types of meta data: one is the meta data on the corpus level (corpus), the other is the meta data related to the individual documents (indexed) in form of a data frame. The latter is often done for performance reasons (hence the named indexed for indexing) or because the meta data has an own entity but still relates directly to individual text documents, e.g., a classification result; the classifications directly relate to the documents, but the set of classification levels forms an own entity. Both cases can be handled with meta(): > meta(crude, tag = "test", type = "corpus") <- "test meta" > meta(crude, type = "corpus") $create_date [1] "2010-06-17 07:32:26 GMT" $creator LOGNAME "feinerer" $test [1] "test meta" > meta(crude, "foo") <- letters[1:20] > meta(crude) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 MetaID foo 0 a 0 b 0 c 0 d 0 e 0 f 0 g 0 h 0 i 0 j 0 k 0 l 0 m 0 n 0 o 0 p 0 q 0 r 0 s 0 t Standard Operators and Functions Many standard operators and functions ([, [<-, [[, [[<-, c(), lapply()) are available for corpora with semantics similar to standard R routines. E.g., c() concatenates two (or more) corpora. Applied to several text documents it returns a corpus. The meta data is automatically updated, if corpora are concatenated (i.e., merged). 5
  • 6. Creating Term-Document Matrices A common approach in text mining is to create a term-document matrix from a corpus. In the tm package the classes TermDocumentMatrix and DocumentTermMatrix (depending on whether you want terms as rows and documents as columns, or vice versa) employ sparse matrices for corpora. > dtm <- DocumentTermMatrix(reuters) > inspect(dtm[1:5,100:105]) A document-term matrix (5 documents, 6 terms) Non-/sparse entries: Sparsity : Maximal term length: Weighting : 0/30 100% 7 term frequency (tf) Terms Docs able abroad, abu accept accord accord, 127 0 0 0 0 0 0 144 0 0 0 0 0 0 191 0 0 0 0 0 0 194 0 0 0 0 0 0 211 0 0 0 0 0 0 Operations on Term-Document Matrices Besides the fact that on this matrix a huge amount of R functions (like clustering, classifications, etc.) can be applied, this package brings some shortcuts. Imagine we want to find those terms that occur at least five times, then we can use the findFreqTerms() function: > findFreqTerms(dtm, 5) [1] [5] [9] [13] [17] [21] [25] [29] [33] [37] [41] [45] [49] [53] [57] [61] [65] [69] [73] [77] [81] [85] [89] "15.8" "agency" "analysts" "barrel." "budget" "daily" "emergency" "exports" "group" "industry" "last" "meet" "month" "nymex" "opec" "plans" "prices" "quota" "research" "said" "sell" "study" "west" "abdul-aziz" "agreement" "april" "barrels" "commitment" "demand" "energy" "feb" "gulf" "international" "march" "meeting" "nazer" "official" "output" "posted" "prices," "quoted" "reserve" "said." "set" "traders" "will" "ability" "ali" "arab" "billion" "company" "dlrs" "exchange" "futures" "help" "january" "market" "minister" "new" "oil" "pct" "present" "prices." "recent" "reserves" "saudi" "sheikh" "u.s." "world" "accord" "also" "arabia" "bpd" "crude" "economic" "expected" "government" "hold" "kuwait" "may" "mln" "now" "one" "petroleum" "price" "production" "report" "reuter" "says" "sources" "united" "york," Or we want to find associations (i.e., terms which correlate) with at least 0.8 correlation for the term opec, then we use findAssocs(): > findAssocs(dtm, "opec", 0.8) meeting 0.88 15.8 0.85 oil emergency 0.85 0.83 analysts 0.82 6 buyers 0.80
  • 7. The function also accepts a matrix as first argument (which does not inherit from a term-document matrix). This matrix is then interpreted as a correlation matrix and directly used. With this approach different correlation measures can be employed. Term-document matrices tend to get very big already for normal sized data sets. Therefore we provide a method to remove sparse terms, i.e., terms occurring only in very few documents. Normally, this reduces the matrix dramatically without losing significant relations inherent to the matrix: > inspect(removeSparseTerms(dtm, 0.4)) A document-term matrix (20 documents, 4 terms) Non-/sparse entries: Sparsity : Maximal term length: Weighting : 74/6 7% 6 term frequency (tf) Terms Docs march oil reuter said 127 0 5 1 1 144 1 11 1 9 191 0 2 1 1 194 0 1 1 1 211 0 2 1 3 236 3 7 1 6 237 1 3 1 0 242 1 3 1 3 246 1 4 1 4 248 1 9 1 5 273 1 5 1 5 349 1 4 1 1 352 1 5 1 1 353 1 4 1 1 368 1 3 1 2 489 1 5 1 2 502 1 5 1 2 543 1 3 1 2 704 1 3 1 3 708 1 2 1 0 This function call removes those terms which have at least a 40 percentage of sparse (i.e., terms occurring 0 times in a document) elements. Dictionary A dictionary is a (multi-)set of strings. It is often used to represent relevant terms in text mining. We provide a class Dictionary implementing such a dictionary concept. It can be created via the Dictionary() constructor, e.g., > (d <- Dictionary(c("prices", "crude", "oil"))) [1] "prices" "crude" "oil" attr(,"class") [1] "Dictionary" "character" and may be passed over to the DocumentTermMatrix() constructor. Then the created matrix is tabulated against the dictionary, i.e., only terms from the dictionary appear in the matrix. This allows to restrict the dimension of the matrix a priori and to focus on specific terms for distinct text mining contexts, e.g., > inspect(DocumentTermMatrix(reuters, list(dictionary = d))) A document-term matrix (20 documents, 3 terms) 7
  • 8. Non-/sparse entries: Sparsity : Maximal term length: Weighting : 41/19 32% 6 term frequency (tf) Terms Docs crude oil prices 127 3 5 4 144 0 11 4 191 3 2 0 194 4 1 0 211 0 2 0 236 1 7 2 237 0 3 0 242 0 3 1 246 0 4 0 248 0 9 7 273 6 5 4 349 2 4 0 352 0 5 4 353 2 4 1 368 0 3 0 489 0 5 2 502 0 5 2 543 3 3 3 704 0 3 2 708 1 2 0 References I. Feinerer. An introduction to text mining in R. R News, 8(2):19–22, Oct. 2008. URL http://CRAN.R-project. org/doc/Rnews/. I. Feinerer, K. Hornik, and D. Meyer. Text mining infrastructure in R. Journal of Statistical Software, 25(5): 1–54, March 2008. ISSN 1548-7660. URL http://www.jstatsoft.org/v25/i05. 8