Textmining Retrieval And Clustering

Retrieval and clustering of documents

Measuring similarity for retrieval Given Set of documents a similarity measure determines for retrieval measures how many documents are relevant to the particular category.

Cosine similarity for retrieval Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the cosine of the angle between them, often used to compare documents in text mining. Given two vectors of attributes, A and B , the cosine similarity, θ , is represented using a dot product and magnitude as Similarity =cos(ᶿ)=A.B/||A||||B||

Cosine similarity for retrieval For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison. The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating independence, and in-between values indicating intermediate similarity or dissimilarity.

Cosine similarity for retrieval In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.

Web-based document search and link analysis Link analysis has been used successfully for deciding which web pages to add to the collection of documents how to order the documents matching a user query (i.e., how to rank pages). It has also been used to categorize web pages, to ﬁnd pages that are related to given pages, to ﬁnd duplicated web sites, and various other problems related to web information retrieval.

Link Analysis A link from page A to page B is a recommendation of page A by the author of page B If page A and page B are connected by a link the probability that they are on the same topic is higher than if they are not connected.

Application Ranking query results.(page Rank) crawling ﬁ nding related pages, computing web page reputations geographic scope, prediction categorizing web pages, computing statistics of web pages and of search engines.

Document matching Document matching is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.

Steps involved in document matching A document matching system has two main tasks: Find relevant documents to user queries Evaluate the matching results and sort them according to relevance, using algorithms such as PageRank.

k-means clustering Given a set of observations ( x 1 , x 2 , …, x n ), where each observation is a d -dimensional real vector, then k -means clustering aims to partition the n observations into k sets ( k < n ) S ={ S 1 , S 2 , …, S k } so as to minimize the within-cluster sum of squares

K-Means algorithm 0. Input : D ::={d 1 ,d 2 ,…d n }; k ::=the cluster number; 1. Select k document vectors as the initial centriods of k clusters 2 . Repeat 3. Select one vector d in remaining documents 4. Compute similarities between d and k centroids 5. Put d in the closest cluster and recompute the centroid 6. Until the centroids don’t change 7. Output: k clusters of documents

Pros and Cons Advantage: linear time complexity works relatively well in low dimension space Drawback: distance computation in high dimension space centroid vector may not well summarize the cluster documents initial k clusters affect the quality of clusters

Hierarchical clustering Input : D ::={d 1 ,d 2 ,…d n }; 1. Calculate similarity matrix SIM[i,j] 2 . Repeat 3. Merge the most similar two clusters, K and L, to form a new cluster KL 4. Compute similarities between KL and each of the remaining cluster and update SIM[i,j] 5. Until there is a single(or specified number) cluster 6 . Output: dendogram of clusters

Pros and cons Advantage: producing better quality clusters works relatively well in low dimension space Drawback: distance computation in high dimension space quadratic time complexity

The EM algorithm for clustering Let the analyzed object be described by two random variables and which are assumed to have a probability distribution function

The EM algorithm for clustering The distribution is known up to its parameter(s) . It is assumed that we are given a set of samples independently drawn from the distribution

The EM algorithm for clustering The Expectation-Maximization (EM) algorithm is an optimization procedure which computes the Maximal-Likelihood (ML) estimate of the unknown parameter when only uncomplete ( is unknown) data are presented. In other words, the EM algorithm maximizes the likelihood function

Evaluation of clustering What Is A Good Clustering ? Internal criterion: A good clustering will produce high quality clusters in which the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measured used.

conclusion In this presentation we learned about Measuring similarity for retrieval Web-based document search and link analysis Document matching Clustering by similarity Hierarchical clustering The EM algorithm for clustering Evaluation of clustering

Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

Textmining Retrieval And Clustering

More Related Content

What's hot (16)

Viewers also liked (9)

Similar to Textmining Retrieval And Clustering (20)

More from Datamining Tools (20)

Recently uploaded (20)

Textmining Retrieval And Clustering