They Are Basically A Set of Co-Occurring Words Within A Given Window
They Are Basically A Set of Co-Occurring Words Within A Given Window
One of the most common way of extracting features in text mining and NLP is the concept of n-
grams of text. An n-gram is just a sequence of tokens which are generally words or sequence of
characters. They are basically a set of co-occurring words within a given window .For example, for the sentence “I
like to study Indian art and culture”. Here the dataset consist of 7 words and moving one word forward adds the
next bigram like if n=1 then it will be only “I “and n=2 “I like “
I
Like
To
Study
Indian
Art
And culture
So you have 7 n-grams in this case. Notice that we moved from the->cow to cow->jumps to jumps->over, etc, essentially
moving one word forward to generate the next bigram.
If X=Num of words in a given sentence K, the number of n-grams for sentence K would be:
N-grams are used for a variety of different task. For example, when developing a language model, n-grams are used to
develop not just unigram models but also bigram and trigram models. Google and Microsoft have developed web scale n-
gram models that can be used in a variety of tasks such as spelling correction, word breaking and text summarization.
Here is a publicly available web scale n-gram model by Microsoft: http://research.microsoft.com/en-
us/collaboration/focus/cs/web-ngram.aspx. Here is a paper that uses Web N-gram models for text
summarization:Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of
Opinions
Another use of n-grams is for developing features for supervised Machine Learning models such as SVMs, MaxEnt models,
Naive Bayes, etc. The idea is to use tokens such as bigrams in the feature space instead of just unigrams. But please be
warned that from my personal experience and various research papers that I have reviewed, the use of bigrams and
trigrams in your feature space may not necessarily yield any significant improvement. The only way to know this is to try
it!
An n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can
be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from
a text or speech corpus.