Natural Language Processing
Natural Language Processing
Feature extraction refers to the process of transforming raw data into numerical features that
can be processed while preserving the information in the original data set. It yields better
results than applying machine learning directly to the raw data.
In Supervised classification
(a) During training, a feature extractor is used to convert each input value to a feature set.
These feature sets, which capture the basic information about each input that should be
used to classify it. Pairs of feature sets and labels are fed into the machine learning
algorithm to generate a model.
Natural Language Processing – Unit III
(b) During prediction, the same feature extractor is used to convert unseen inputs to featu re
sets. These feature sets are then fed into the model, which generates predicted labels.
Gender Identification :
The male and female names have some distinctive characteristics. Names ending in a, e, and i
are likely to be female, while names ending in k, o, r, s, and t are likely to be male.
The first step in creating a classifier is deciding what features of the input are relevant, and how
to encode those features.
The dictionary that is returned by this function is called a feature set and maps from features’
names to their values. Feature names are case-sensitive strings that typically provide a short
human-readable description of the feature. Feature values are values with simple types, such as
Booleans, numbers, and strings.
we’ve defined a feature extractor, we need to prepare a list of examples and corresponding class
labels:
>>> random.shuffle(names)
>>> classifier.classify(gender_features('vikram'))
>>> classifier.classify(gender_features('amrutha'))
we use the feature extractor to process the names data, and divide the resulting list of feature
sets into a training set and a test set. The training set is used to train a new “naive Bayes”
classifier.
Finally, we can examine the classifier to determine which features it found most effective for
distinguishing the names’ genders
>>> classifier.show_most_informative_features(5)
Out put :
This listing shows that the names in the training set that end in a are female 38 times more often
than they are male, but names that end in k are male 31 times more often than they are female.
These ratios are known as likelihood ratios.
Selecting relevant features and deciding how to encode them for a learning method can have an
enormous impact on the learning method’s ability to extract a good model. what features might
be relevant, and how we can represent them. Although it’s often possible to get decent
performance by using a fairly simple and obvious set of features, there are usually significant
gains to be had by using carefully constructed features based on a thorough understanding of
the task at hand.
feature extractors are built through a process of trial-and-error, guided by intuitions about what
information is relevant to the problem.
Once an initial set of features has been chosen, A very productive method for refining the
feature set is error analysis. First, we select a development set, containing the corpus data for
creating the model. This development set is then subdivided into the training set and the dev-
test set.
The training set is used to train the model, and the dev-test set is used to perform error analysis.
The test set serves in our final evaluation of the system. It is important that we employ a
separate dev-test set for error analysis, rather than just using the test set. The division of the
corpus data into different subsets.
Natural Language Processing – Unit III
Organization of corpus data for training supervised classifiers. The corpus data is divided into two
sets: the development set and the test set. The development set is often further subdivided into a
training set and a dev-test set.
This error analysis procedure can then be repeated, checking for patterns in the e rrors that are
made by the newly improved classifier. Each time the error analysis procedure is repeated, we
should select a different dev-test/training split.
But once we’ve used the dev-test set to help us develop the model, we can no longer trust that it
will give us an accurate idea of how well the model would perform on new data. It is therefore
important to keep the test set separate, and unused, until our model development is complete. At
that point, we can use the test set to evaluate how well our model will perform on new input values.
Document Classification :
we saw several examples of corpora where documents have been labeled with categories. Using
these corpora, we can build classifiers that will automatically tag new documents with appropriate
category labels. First, we construct a list of documents, labeled with the appropriate categories. For
this example, we’ve chosen the Movie Reviews Corpus, which categorizes each review as positive or
negative.
>>> random.shuffle(documents)
Next, we define a feature extractor for documents, so the classifier will know which aspects of the
data it should pay attention. For document topic identification, we can define a feature for each
word, indicating whether the document contains that word. To limit the number of features that the
classifier needs to process, we begin by constructing a list of the 2,000 most frequent words in the
overall corpus . We can then define a feature extractor that simply checks whether each of these
words is present in a given document.
Part-of-Speech Tagging :
Natural Language Processing – Unit III
we built a regular expression tagger that chooses a part-of-speech tag for a word by looking at the
internal makeup of the word. However, this regular expression tagger had to be handcrafted.
Instead, we can train a classifier to work out which suffixes are most informative. Let’s begin by
finding the most common suffixes.
... suffix_fdist.inc(word[-1:])
... suffix_fdist.inc(word[-2:])
... suffix_fdist.inc(word[-3:])
['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of', 'the',
'y', 'r', 'to', 'in', 'f', 'o', 'ed', 'nd', 'is', 'on', 'l',
'g', 'and', 'ng', 'er', 'as', 'ing', 'h', 'at', 'es', 'or',
're', 'it', '``', 'an', "''", 'm', ';', 'i', 'ly', 'ion', ...]
Next, we’ll define a feature extractor function that checks a given word for these suffixes:
... features = {}
Sequence Classification :
In order to capture the dependencies between related classification tasks, we can use joint classifier
models, which choose an appropriate labeling for a collection of related inputs. In the case of part -
of-speech tagging, a variety of different sequence classifier models can be used to jointly choose
part-of-speech tags for all the words in a given sentence.
which began by choosing a part-of-speech tag for the first word in the sentence, and then chose the
tag for each subsequent word based on the word itself and the predicted tag for the previous word.
Natural Language Processing – Unit III
This strategy is demonstrated, we must augment our feature extractor function to take a history
argument, which provides a list of the tags that we’ve predicted for the sentence so far . Each tag in
history corresponds with a word in sentence. But note that history will only contain tags for words
we’ve already classified, that is, words to the left of the target word. Thus, although it is possible to
look at some features of words to the right of the target word, it is not possible to look at the tags
for those words.
6.3 Evaluation :
In order to decide whether a classification model is accurately capturing a pattern, we must evaluate
that model. The result of this evaluation is important for deciding how trustworthy the model is, and
for what purposes we can use it. Evaluation can also be an effective tool for guiding us in making
future improvements to the model.
Most evaluation techniques calculate a score for a model by comparing the labels that it generates
for the inputs in a test set (or evaluation set) with the correct labels for those inputs. This test set
typically has the same format as the training set. However, it is very important that the test set be
distinct from the training corpus: if we simply reused the training set as the test set, then a model
that simply memorized its input, without learning how to generalize to new examples, would
receive misleadingly high scores.
when choosing the test set is the degree of similarity between instances in the test set and those in
the development set. The more similar these two datasets are, the less confident we can be that
evaluation results will generalize to other datasets.
For example, consider the part-of-speech tagging task. At one extreme, we could create the training
set and test set by randomly assigning sentences from a data source that reflects a single genre, such
as news:
>>> random.shuffle(tagged_sents)
The training set and test set are taken from the same genre, and so we cannot be confident that
evaluation results would generalize to other genres. because of the call to random.shuffle(), the test
set contains sentences that are taken from the same documents that were used for training. If there
is any consistent pattern within a document then that difference will be reflected in both the
development set and the test set. somewhat better approach is to ensure that the training set and
test set are taken from different documents:
If we want to perform a more stringent evaluation, we can draw the test set from documents that
are less closely related to those in the training set:
Another instance where accuracy scores can be misleading is in “search” tasks, such as information
retrieval, where we are attempting to find documents that are relevant to a particular task. Since the
number of irrelevant documents far outweighs the number of relevant documents, the accuracy
score for a model that labels every document as irrelevant would be very close to 100%.
It is therefore conventional to employ a different set of measures for search tasks, based on the
number of items in each of the four categories
• False positives (or Type I errors) are irrelevant items that we incorrectly identified as relevant.
• False negatives (or Type II errors) are relevant items that we incorrectly identified as irrelevant.
is TP/(TP+FP).
• Recall, which indicates how many of the relevant items that we ide ntified, is
TP/(TP+FN).
• The F-Measure (or F-Score), which combines the precision and recall to give a
single score, is defined to be the harmonic mean of the precision and recall
(2 × Precision × Recall)/(Precision+Recall).
Natural Language Processing – Unit III
In naive Bayes classifiers, every feature gets a say in determining which label should be assigned to a
given input value. To choose a label for an input value, the naive Bayes classifier begins by
calculating the prior probability of each label, which is determined by checking the frequency of each
label in the training set.
The contribution from each feature is then combined with this prior probability, to arrive at a
likelihood estimate for each label. The label whose likelihood estimate is the highest is then assigned
to the input value.
In the training corpus, most documents are automotive, so the classifier starts out at a point closer
to the “automotive” label. But it then considers the effect of each feature.
In this example, the input document contains the word dark, which is a weak indicator for murder
mysteries, but it also contains the word football, which is a strong indicator for sports documents.
After every feature has made its contribution, the classifier checks which label it is closest to, and
assigns that label to the input.
Individual features make their contribution to the overall decision by “voting against” labels that
don’t occur with that feature very often. In particular, the likelihood score for each label is reduced
by multiplying it by the probability that an input value with that label would have the feature.
Calculating label likelihoods with naive Bayes. Naive Bayes begins by calculating the prior probability
of each label, based on how frequently each label occurs in the training data. Every feature then
contributes to the likelihood estimate for each label, by multiplying it by the probability that input
Natural Language Processing – Unit III
values with that label will have that feature. The resulting likelihood score can be thought of as an
estimate of the probability that a randomly selected value from the training set would have both the
given label and the set of features, assuming that the feature probabilities are all independent.
A Bayesian Network Graph illustrating the generative process that is assumed by the naive Bayes
classifier. To generate a labeled input, the model first chooses a label for the input, and then it
generates each of the input’s features based on that label. Every feature is assumed to be entirely
independent of every other feature, given the label.
Another way of understanding the naive Bayes classifier is that it chooses the most likely label for an
input, under the assumption that every input value is generated by first choosing a class label for
that input value, and then generating each feature, entirely independent of every other feature. this
assumption is unrealistic. features are often highly dependent on one another. This simplifying
assumption, known as the naive Bayes assumption (or independence assumption), makes it much
easier to combine the contributions of the different features, since we don’t need to worry about
how they should interact with one another.
Based on this assumption, we can calculate an expression for P(label|features), the probability that
an input will have a particular label given that it has a particular set of features. To choose a label for
a new input, we can then simply pick the label l that maximizes P(l|features).
the probability that an input has a particular label and the specified set of features, divided by the
probability that it has the specified set of features:
This is exactly the equation we discussed earlier for calculating the label likelihood: P(label) is the
prior probability for a given label, and each P(f|label) is the contribution of a single feature to the
label likelihood.
Natural Language Processing – Unit III