0% found this document useful (0 votes)
14 views

WINSEM2022-23 - CSI3005 - ETH - VL2022230503219 - ReferenceMaterialI - FriFeb1700 00 00IST2023 - TextandDocumentVisualization

The document discusses text and document visualization. It describes text data as collections of documents including articles, books, emails, web pages, etc. It explains that text can be analyzed as data by looking at word meanings, relations, orderings, and hierarchies. It then outlines a common text processing pipeline involving tokenization, stemming/lemmatization, and removing stop words. Finally, it discusses several techniques for visualizing document content and structure at both the single document and collection level, including word clouds, word trees, text arcs, and arc diagrams.

Uploaded by

M Ramani Devi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

WINSEM2022-23 - CSI3005 - ETH - VL2022230503219 - ReferenceMaterialI - FriFeb1700 00 00IST2023 - TextandDocumentVisualization

The document discusses text and document visualization. It describes text data as collections of documents including articles, books, emails, web pages, etc. It explains that text can be analyzed as data by looking at word meanings, relations, orderings, and hierarchies. It then outlines a common text processing pipeline involving tokenization, stemming/lemmatization, and removing stop words. Finally, it discusses several techniques for visualizing document content and structure at both the single document and collection level, including word clouds, word trees, text arcs, and arc diagrams.

Uploaded by

M Ramani Devi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Text and Document Visualization

Text data?
• Huge resources of information; from libraries, to e-mail archives
• Documents
• Articles, books and novels
• Computer programs
• E-mails, web pages, blogs
• Tags, comments
Text data
Collection of documents
• Messages (e-mail, blogs, tags, comments)
• Social networks (personal profiles)
• Academic collaborations (publications)
Text as Data

• Words have meanings and relations


– Correlations: Hong Kong, San Francisco, Bay Area
– Order: April, February, January, June, March, May
– Membership: Tennis, Running, Swimming, Hiking, Piano
– Hierarchy, antonyms & synonyms, entities
• Whether text is a nominal or ordinal ??
Text Processing Pipeline
Tokenization: segment text into terms
• Special cases? e.g., “San Francisco”, “L’ensemble”, “U.S.A.”
• Remove stop words? e.g., “a”, “an”, “the”, “to”, “be”?
Stemming: one means of normalizing terms
• Reduce terms to their “root”; Porter’s algorithm for English
• e.g., automate(s), automatic, automation all map to automat
• For visualization, want to reverse stemming for labels
• Simple solution: map from stem to the most frequent word
Stemming Vs Lemmatization
Stop words
Bag of Words Model
• A document ≈ vector of term weights
– Each dimension corresponds to a term (10,000+)
– Each value represents the relevance
– For example, simple term counts
• Aggregate into a document x term matrix
• Document vector space model
Document x Term matrix
• Each document is a vector of term weights
• Simplest weighting is to just count occurrences
Computing Weights
• Tf (w) be the term frequency or number of times that word w occurred in the
document,
• Let Df (w) be the document frequency (number of documents that contain the
word).
• Let N be the number of documents.
• We define Tf Idf(w) as
Bag of Words Model
Example
Vector Space Representation
Visualizing Document Content
Single document visualization
Word Clouds
WordTree
TextArc
Arc Diagrams

You might also like