Text Mining Techniques
Text Mining Techniques
3. nltk.probability.FreqDist
Purpose:
The FreqDist class in the nltk.probability module is used to
compute the frequency distribution of a list of tokens. It helps in
understanding the frequency of each unique word in a given text.
Input Parameters:
tokens (list of str): The list of tokens obtained through
tokenization.
Output:
FreqDist object: The function returns a frequency distribution
object containing the count of each unique token in the input list.
4. freq_dist.plot
Purpose:
The plot method of the FreqDist class is used to visualize the
frequency distribution of tokens. It generates a plot, typically a bar
chart, to show the most common words and their frequencies in
the text.
Input Parameters:
num (int): The number of most common words to display in the
plot.
cumulative (bool): If True, the plot shows cumulative frequencies;
if False, it shows individual frequencies.
Output:
Matplotlib Plot: The function produces a plot using the Matplotlib
library, displaying the frequency distribution of words.
Conclusion:
Text tokenization and frequency analysis are fundamental steps in text
mining. The nltk library provides convenient functions for these tasks,
allowing users to break down text into tokens and analyze the frequency
distribution of words. The combination of tokenization and frequency
analysis is essential for gaining insights into the content and structure of
textual data, making it a valuable tool in natural language processing and
information extraction.
2-Named Entity Recognition (NER):
1. spacy.load:
Purpose:
The spacy.load function is used to load a spaCy language
processing model. In the context of NER, a pre-trained model
capable of recognizing named entities is loaded.
Input Parameters:
model_name (str): The language processing model to be
loaded. Common models include "en_core_web_sm" for
English.
Output:
Language Processing Model: The function returns a spaCy
language processing model.
2. doc.ents:
Purpose:
The ents attribute of a spaCy Doc object is used to access the
named entities recognized in the document.
Input Parameters:
doc (spaCy Doc): The document processed by the spaCy
language processing model.
Output:
List of spaCy Span objects: The function returns a list of spaCy
Span objects, where each Span represents a named entity along
with its start and end positions in the document.
3. Iterating over Named Entities:
Purpose:
The code iterates over the named entities in the document to
extract information such as the text, label, and start/end positions
of each entity.
Input Parameters:
named_entities (List of spaCy Span objects): The list of
spaCy Span objects representing named entities.
Output:
Print Statements: The function prints information about each
named entity, including its text, label, and start/end
positions.Iterating over Named Entities
Conclusion:
Named Entity Recognition (NER) involves the recognition and classification
of entities such as persons, organizations, locations, etc., in a given text.
The spaCy library provides a powerful tool for this task, and the functions
described above contribute to the NER process. The spacy.load function
loads the language processing model, while the doc.ents attribute allows
access to recognized named entities. The code then demonstrates how to
iterate over these entities to extract relevant information. Overall, spaCy
simplifies the complex task of NER in natural language processing
applications.