Representing TF and TF-IDF transformations in PMML

Representing TF and TF-IDF
transformations in PMML
Villu Ruusmann
Openscoring OÜ

TF
Local Term Frequency (TF) - The frequency of the term in a document.
<TextIndex textField="documentField">
<FieldRef field="termField"/>
</TextIndex>
sklearn.feature_extraction.text.CountVectorizer
org.apache.spark.ml.feature.CountVectorizer

TF-IDF
Global Term Frequency (TF-IDF) - TF, weighted by the "significance" of the term
in the corpus of training documents.
<Apply function="*">
<TextIndex textField="documentField">
<FieldRef field="termField"/>
</TextIndex>
<FieldRef field="termWeightField"/>
</Apply>
sklearn.feature_extraction.text.TfidfTransformer
org.apache.spark.ml.feature.IDF

PMML encoding (1/2)
The "centralized" TF-IDF function definition:
<DefineFunction name="tf-idf" dataType="continuous" optype="continuous">
<ParamField name="document"/>
<ParamField name="term"/>
<ParamField name="weight"/>
<Apply function="*">
<TextIndex textField=" document">
<FieldRef field=" term"/>
</TextIndex>
<FieldRef field=" weight"/>
</Apply>
</DefineFunction>

PMML encoding (2/2)
Many "centralized" TF-IDF function invocations:
<DerivedField name="tf-idf(2017)" dataType="float" optype="continuous">
<Apply function="tf-idf">
<FieldRef field="tweetField"/>
<Constant dataType="string">2017</Constant>
<Constant dataType="double">5.4132</Constant>
</Apply>
</DerivedField>
Many "localized" TF-IDF usages:
<Node>
<SimplePredicate field="tf-idf(2017)" operator="lessThan" value="7.25">
</Node>

PMML TF algorithm
1. Normalize the document.
2. Tokenize the term and the document. Trim tokens by removing leading and
trailing (but not continuation) punctuation characters.
3. Count the occurrences of term tokens in document tokens subject to the
following constraints:
3.1. Case-sensitivity
3.2. Max Levenshtein distance (as measured in the number of
single-character insertions, substitutions or deletions).
4. Transform the count to the final TF metric.
http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_TextIndex

String normalization
Ensuring that the unlimited, free-form text input complies with the limited,
standardized vocabulary of the TextIndex element:
<TextIndexNormalization isCaseSensitive="false">
<InlineTable>
<Row>
<string>[u00c0-u00c5]</string><stem>a</stem> <regex>true</regex>
</Row>
<Row>
<string>is|are|was|were</string><stem>be</stem> <regex>true</regex>
</Row>
</InlineTable>
</TextIndexNormalization>

String tokenization
Two approaches for string tokenization using regular expressions (REs):
1. Define word separator RE and execute
(Pattern.compile(wordSeparatorRE)).split(string)
2. Define word RE and execute
((Pattern.compile(wordRE)).matcher(string)).findAll()
Popular ML frameworks support both approaches.
PMML 4.2 and 4.3 only support the first approach. Hopefully, PMML 4.4 will
support the second approach as well.
http://mantis.dmg.org/view.php?id=173

Counting terms in a document
A "match" is a situation where the difference between term tokens [0, length] and
document tokens [i, i + length] (where i is the match position), is less than or equal
to the match threshold.
Match threshold is a function of TextIndex@isCaseSensitive and
TextIndex@maxLevenshteinDistance attribute values. During
case-insensitive matching (the default), the edit distance between two characters
that only differ by case is considered to be 0, whereas during case-sensitive
matching it is considered to be 1.
The matches may overlap if the "length" of term tokens is greater than one.
http://mantis.dmg.org/view.php?id=172

Interoperability with Scikit-Learn (1/2)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(..,
strip_accents = .., # If not None, handle using text normalization
analyzer = "word", # Set to "word"
preprocessor = .., # If not None, handle using text normalization
tokenizer = .., # If not None, handle using text tokenization
token_pattern = None, # Set to None. Use the "tokenizer" attribute instead
lowercase = .., # If True, convert the document to lowercase String and
perform term matching in a case-insensitive manner
binary = .., # Determines the transformation from counts to final TF
metric ("binary" for True, and "termFrequency" for False)
sublinear_tf = .., # If True, apply scaling to final TF metric
norm = None # Set to None
)

Interoperability with Scikit-Learn (2/2)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn2pmml import PMMLPipeline
from sklearn2pmml.feature_extraction.text import Splitter
pipeline = PMMLPipeline(
('tf-idf', TfidfVectorizer(analyzer = "word", preprocessor = None,
strip_accents = None, tokenizer = Splitter() , token_pattern = None ,
stop_words = "english", ngram_range = (1, 2), binary = False, use_idf =
True, norm = None))
)
from sklearn2pmml import sklearn2pmml
sklearn2pmml(pipeline, "pipeline.pmml")

Q&A
villu@openscoring.io
https://github.com/jpmml
https://github.com/openscoring
https://groups.google.com/forum/#!forum/jpmml

Representing TF and TF-IDF transformations in PMML

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Representing TF and TF-IDF transformations in PMML (20)

Recently uploaded (20)

Representing TF and TF-IDF transformations in PMML