SlideShare a Scribd company logo
Retrieval and clustering of documents
Measuring similarity for retrieval Given Set of documents a similarity measure determines for retrieval measures how many documents are relevant to the particular category.
Cosine similarity for retrieval Cosine similarity  is a measure of similarity between two vectors of  n  dimensions by finding the cosine of the angle between them, often used to compare documents in text mining. Given two vectors of attributes,  A  and  B , the cosine similarity,  θ , is represented using a dot product and magnitude as Similarity =cos(᜿)=A.B/||A||||B||
Cosine similarity for retrieval For text matching, the attribute vectors  A  and  B  are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison. The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating independence, and in-between values indicating intermediate similarity or dissimilarity.
Cosine similarity for retrieval In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.
Web-based document search and link analysis Link analysis has been used successfully for deciding which web pages to add to the collection of documents how to order the documents matching a user query (i.e., how to rank pages). It has also been used to categorize web pages, to nd pages that are related to given pages, to nd duplicated web sites, and various other problems related to web information retrieval.
Link Analysis A link from page A to page B is a recommendation of page A by the author of page B If page A and page B are connected by a link the probability that they are on the same topic is higher than if they are not connected.
Application Ranking query results.(page Rank) crawling  nding  related pages,  computing web page reputations  geographic scope, prediction categorizing web pages, computing statistics of web pages and of search engines.
Document matching Document matching  is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.
Steps involved in document matching A document matching system has two main tasks: Find relevant documents to user queries Evaluate the matching results and sort them according to relevance, using algorithms such as PageRank.
  k-means clustering Given a set of observations ( x 1 ,  x 2 , …,  x n ), where each observation is a  d -dimensional real vector, then  k -means clustering aims to partition the  n  observations into  k  sets ( k  <  n )  S ={ S 1 ,  S 2 , …,  S k } so as to minimize the within-cluster sum of squares
K-Means algorithm 0. Input :  D ::={d 1 ,d 2 ,…d n  };  k ::=the cluster number; 1.  Select k document vectors as the initial centriods of k clusters  2 . Repeat 3.   Select one vector  d  in remaining documents 4.   Compute similarities between d and  k  centroids 5.  Put  d  in the closest cluster and recompute the centroid  6.  Until the centroids don’t change 7. Output: k  clusters of documents
Pros and Cons Advantage: linear time complexity  works relatively well in low dimension space Drawback: distance computation in high dimension space centroid vector may not well summarize the cluster documents initial  k  clusters affect the quality of clusters
Hierarchical clustering Input :  D ::={d 1 ,d 2 ,…d n  }; 1.  Calculate similarity matrix SIM[i,j]  2 . Repeat 3.   Merge the most similar two clusters, K and L, to form a new cluster KL 4.   Compute similarities between KL and each of the remaining  cluster and update SIM[i,j] 5.  Until there is a single(or specified number) cluster 6 . Output:  dendogram of clusters
Pros and cons Advantage: producing better quality clusters works relatively well in low dimension space Drawback: distance computation in high dimension space quadratic time complexity
The EM algorithm for clustering Let the analyzed object be described by two random variables and which are assumed to have a probability distribution function
The EM algorithm for clustering The distribution is known up to its parameter(s) . It is assumed that we are given a set of samples independently drawn from the distribution
The EM algorithm for clustering The Expectation-Maximization (EM) algorithm is an optimization procedure which computes the Maximal-Likelihood (ML) estimate of the unknown parameter when only uncomplete  ( is unknown) data are presented. In other words, the EM algorithm maximizes the likelihood function
Evaluation of clustering What Is A Good  Clustering ? Internal criterion: A good  clustering  will produce high quality clusters in which  the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a  clustering  depends on both the document representation and the similarity measured used.
conclusion In this presentation we learned about Measuring similarity for retrieval Web-based document search and link analysis Document matching Clustering by similarity Hierarchical clustering The EM algorithm for clustering Evaluation of clustering
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

More Related Content

What's hot (16)

PDF
Clustering sentence level text using a novel fuzzy relational clustering algo...
Ecway Technologies
 
PDF
Av33274282
IJERA Editor
 
PDF
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
PPT
Ghost
Jhih-Ming Chen
 
PPTX
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
PPTX
Svv
Jyothsna Sridhar
 
PPTX
Text clustering
KU Leuven
 
PDF
call for papers, research paper publishing, where to publish research paper, ...
International Journal of Engineering Inventions www.ijeijournal.com
 
PDF
Analysis of different similarity measures: Simrank
Abhishek Mungoli
 
PPTX
Similarity Measurement Preliminary Results
xiaojuzheng
 
PDF
Search: Probabilistic Information Retrieval
Vipul Munot
 
PPTX
Document clustering for forensic analysis
srinivasa teja
 
PPTX
A presentation on the comparison on complexity between
Jubayer Hasan
 
PDF
Textual Data Partitioning with Relationship and Discriminative Analysis
Editor IJMTER
 
PDF
Slides distancecovariance
Shrey Nishchal
 
PDF
M phil-computer-science-machine-language-and-pattern-analysis-projects
Vijay Karan
 
Clustering sentence level text using a novel fuzzy relational clustering algo...
Ecway Technologies
 
Av33274282
IJERA Editor
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
Text clustering
KU Leuven
 
call for papers, research paper publishing, where to publish research paper, ...
International Journal of Engineering Inventions www.ijeijournal.com
 
Analysis of different similarity measures: Simrank
Abhishek Mungoli
 
Similarity Measurement Preliminary Results
xiaojuzheng
 
Search: Probabilistic Information Retrieval
Vipul Munot
 
Document clustering for forensic analysis
srinivasa teja
 
A presentation on the comparison on complexity between
Jubayer Hasan
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Editor IJMTER
 
Slides distancecovariance
Shrey Nishchal
 
M phil-computer-science-machine-language-and-pattern-analysis-projects
Vijay Karan
 

Viewers also liked (9)

PDF
"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io
Dataconomy Media
 
PDF
Deep Learning and Text Mining
Will Stanton
 
PPTX
ElasticSearch for data mining
William Simms
 
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
PPT
Textmining Introduction
DataminingTools Inc
 
PPTX
Textmining Information Extraction
DataminingTools Inc
 
PDF
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Roelof Pieters
 
PDF
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
DOCX
Best topics for seminar
shilpi nagpal
 
"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io
Dataconomy Media
 
Deep Learning and Text Mining
Will Stanton
 
ElasticSearch for data mining
William Simms
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
Textmining Introduction
DataminingTools Inc
 
Textmining Information Extraction
DataminingTools Inc
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Roelof Pieters
 
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Best topics for seminar
shilpi nagpal
 
Ad

Similar to Textmining Retrieval And Clustering (20)

PPT
Cluster
guest1babda
 
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD Editor
 
PDF
50120130406022
IAEME Publication
 
PDF
Bs31267274
IJMER
 
PDF
600 608
Editor IJARCET
 
PDF
A Competent and Empirical Model of Distributed Clustering
IRJET Journal
 
PPTX
Hierarchical clustering
ishmecse13
 
PDF
Volume 2-issue-6-1969-1973
Editor IJARCET
 
PDF
Volume 2-issue-6-1969-1973
Editor IJARCET
 
PDF
A Novel Clustering Method for Similarity Measuring in Text Documents
IJMER
 
PDF
L0261075078
inventionjournals
 
PDF
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
PDF
Clustering Algorithms - Kmeans,Min ALgorithm
Sharmila Chidaravalli
 
PDF
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET Journal
 
PDF
IRJET- Text Document Clustering using K-Means Algorithm
IRJET Journal
 
PDF
Farthest Neighbor Approach for Finding Initial Centroids in K- Means
Waqas Tariq
 
PDF
Improved text clustering with
IJDKP
 
PDF
09Evaluation_Clustering.pdf
BizuayehuDesalegn
 
PDF
Recent Trends in Incremental Clustering: A Review
IOSRjournaljce
 
PPTX
Data mining Techniques
Sulman Ahmed
 
Cluster
guest1babda
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD Editor
 
50120130406022
IAEME Publication
 
Bs31267274
IJMER
 
600 608
Editor IJARCET
 
A Competent and Empirical Model of Distributed Clustering
IRJET Journal
 
Hierarchical clustering
ishmecse13
 
Volume 2-issue-6-1969-1973
Editor IJARCET
 
Volume 2-issue-6-1969-1973
Editor IJARCET
 
A Novel Clustering Method for Similarity Measuring in Text Documents
IJMER
 
L0261075078
inventionjournals
 
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
Clustering Algorithms - Kmeans,Min ALgorithm
Sharmila Chidaravalli
 
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET Journal
 
IRJET- Text Document Clustering using K-Means Algorithm
IRJET Journal
 
Farthest Neighbor Approach for Finding Initial Centroids in K- Means
Waqas Tariq
 
Improved text clustering with
IJDKP
 
09Evaluation_Clustering.pdf
BizuayehuDesalegn
 
Recent Trends in Incremental Clustering: A Review
IOSRjournaljce
 
Data mining Techniques
Sulman Ahmed
 
Ad

More from Datamining Tools (20)

PPTX
Data Mining: Text and web mining
Datamining Tools
 
PPTX
Data Mining: Outlier analysis
Datamining Tools
 
PPTX
Data Mining: Mining stream time series and sequence data
Datamining Tools
 
PPTX
Data Mining: Mining ,associations, and correlations
Datamining Tools
 
PPTX
Data Mining: Graph mining and social network analysis
Datamining Tools
 
PPTX
Data Mining: Data warehouse and olap technology
Datamining Tools
 
PPTX
Data MIning: Data processing
Datamining Tools
 
PPTX
Data Mining: clustering and analysis
Datamining Tools
 
PPTX
Data mining: Classification and Prediction
Datamining Tools
 
PPTX
Data Mining: Data mining classification and analysis
Datamining Tools
 
PPTX
Data Mining: Data mining and key definitions
Datamining Tools
 
PPTX
Data Mining: Data cube computation and data generalization
Datamining Tools
 
PPTX
Data Mining: Applying data mining
Datamining Tools
 
PPTX
Data Mining: Application and trends in data mining
Datamining Tools
 
PPTX
AI: Planning and AI
Datamining Tools
 
PPTX
AI: Logic in AI 2
Datamining Tools
 
PPTX
AI: Logic in AI
Datamining Tools
 
PPTX
AI: Learning in AI 2
Datamining Tools
 
PPTX
AI: Learning in AI
Datamining Tools
 
PPTX
AI: Introduction to artificial intelligence
Datamining Tools
 
Data Mining: Text and web mining
Datamining Tools
 
Data Mining: Outlier analysis
Datamining Tools
 
Data Mining: Mining stream time series and sequence data
Datamining Tools
 
Data Mining: Mining ,associations, and correlations
Datamining Tools
 
Data Mining: Graph mining and social network analysis
Datamining Tools
 
Data Mining: Data warehouse and olap technology
Datamining Tools
 
Data MIning: Data processing
Datamining Tools
 
Data Mining: clustering and analysis
Datamining Tools
 
Data mining: Classification and Prediction
Datamining Tools
 
Data Mining: Data mining classification and analysis
Datamining Tools
 
Data Mining: Data mining and key definitions
Datamining Tools
 
Data Mining: Data cube computation and data generalization
Datamining Tools
 
Data Mining: Applying data mining
Datamining Tools
 
Data Mining: Application and trends in data mining
Datamining Tools
 
AI: Planning and AI
Datamining Tools
 
AI: Logic in AI 2
Datamining Tools
 
AI: Logic in AI
Datamining Tools
 
AI: Learning in AI 2
Datamining Tools
 
AI: Learning in AI
Datamining Tools
 
AI: Introduction to artificial intelligence
Datamining Tools
 

Recently uploaded (20)

PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 

Textmining Retrieval And Clustering

  • 2. Measuring similarity for retrieval Given Set of documents a similarity measure determines for retrieval measures how many documents are relevant to the particular category.
  • 3. Cosine similarity for retrieval Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the cosine of the angle between them, often used to compare documents in text mining. Given two vectors of attributes, A and B , the cosine similarity, θ , is represented using a dot product and magnitude as Similarity =cos(áśż)=A.B/||A||||B||
  • 4. Cosine similarity for retrieval For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison. The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating independence, and in-between values indicating intermediate similarity or dissimilarity.
  • 5. Cosine similarity for retrieval In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.
  • 6. Web-based document search and link analysis Link analysis has been used successfully for deciding which web pages to add to the collection of documents how to order the documents matching a user query (i.e., how to rank pages). It has also been used to categorize web pages, to nd pages that are related to given pages, to nd duplicated web sites, and various other problems related to web information retrieval.
  • 7. Link Analysis A link from page A to page B is a recommendation of page A by the author of page B If page A and page B are connected by a link the probability that they are on the same topic is higher than if they are not connected.
  • 8. Application Ranking query results.(page Rank) crawling  nding related pages, computing web page reputations geographic scope, prediction categorizing web pages, computing statistics of web pages and of search engines.
  • 9. Document matching Document matching is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.
  • 10. Steps involved in document matching A document matching system has two main tasks: Find relevant documents to user queries Evaluate the matching results and sort them according to relevance, using algorithms such as PageRank.
  • 11. k-means clustering Given a set of observations ( x 1 , x 2 , …, x n ), where each observation is a d -dimensional real vector, then k -means clustering aims to partition the n observations into k sets ( k  <  n ) S ={ S 1 , S 2 , …, S k } so as to minimize the within-cluster sum of squares
  • 12. K-Means algorithm 0. Input : D ::={d 1 ,d 2 ,…d n }; k ::=the cluster number; 1. Select k document vectors as the initial centriods of k clusters 2 . Repeat 3. Select one vector d in remaining documents 4. Compute similarities between d and k centroids 5. Put d in the closest cluster and recompute the centroid 6. Until the centroids don’t change 7. Output: k clusters of documents
  • 13. Pros and Cons Advantage: linear time complexity works relatively well in low dimension space Drawback: distance computation in high dimension space centroid vector may not well summarize the cluster documents initial k clusters affect the quality of clusters
  • 14. Hierarchical clustering Input : D ::={d 1 ,d 2 ,…d n }; 1. Calculate similarity matrix SIM[i,j] 2 . Repeat 3. Merge the most similar two clusters, K and L, to form a new cluster KL 4. Compute similarities between KL and each of the remaining cluster and update SIM[i,j] 5. Until there is a single(or specified number) cluster 6 . Output: dendogram of clusters
  • 15. Pros and cons Advantage: producing better quality clusters works relatively well in low dimension space Drawback: distance computation in high dimension space quadratic time complexity
  • 16. The EM algorithm for clustering Let the analyzed object be described by two random variables and which are assumed to have a probability distribution function
  • 17. The EM algorithm for clustering The distribution is known up to its parameter(s) . It is assumed that we are given a set of samples independently drawn from the distribution
  • 18. The EM algorithm for clustering The Expectation-Maximization (EM) algorithm is an optimization procedure which computes the Maximal-Likelihood (ML) estimate of the unknown parameter when only uncomplete ( is unknown) data are presented. In other words, the EM algorithm maximizes the likelihood function
  • 19. Evaluation of clustering What Is A Good Clustering ? Internal criterion: A good clustering will produce high quality clusters in which the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measured used.
  • 20. conclusion In this presentation we learned about Measuring similarity for retrieval Web-based document search and link analysis Document matching Clustering by similarity Hierarchical clustering The EM algorithm for clustering Evaluation of clustering
  • 21. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net