SlideShare a Scribd company logo
AUTOMATIC
CLASSIFICATION
Outline
Introduction
Classification for Information Retrieval
Classification methods
Measures of association
Clustering
The use of clustering in information retrieval
graph theoretic
 'Single-Pass Algorithm'
Single-link
Introduction
What is classification?
A formal definition of classification will not be attempted.
The word 'classification' is used to describe the result of
such a process.
Classification is one of the core areas in machine learning.
Classification for Information
Retrieval
In the context of information retrieval, a classification is
required for a purpose.
The purpose may be to group the documents in such a
way that retrieval will be faster or alternatively it may be to
construct a thesaurus automatically.
classification is used in various subtasks of Preprocessing,
content filtering, sorting, ranking, et.
There are two main areas of application of classification
methods in IR:
• keyword clustering;
• document clustering.
In the main, people have achieved the 'logical
organization' in two different ways.
1.direct classification of the documents
2.The intermediate calculation of a measure of closeness
between documents.
-The first approach has proved theoretically to be intractable
so that any experimental test results cannot be considered
to be reliable.
-The second approach to classification is fairly well
documented now.
Classification methods
The data consists of objects and their corresponding descriptions.
The objects may be documents, keywords, hand written characters,
or species (in the last case the objects themselves are classes as
opposed to individuals)
The descriptors come under various names depending on their
structure:
•(1) multi-state attributes (e.g. color)
•(2) binary-state (e.g. keywords)
•(3) numerical (e.g. hardness scale, or weighted keywords)
•(4) probability distributions.
•The fourth category of descriptors is applicable when the objects are
classes.
Measures of association
Some classification methods are based on a binary relationship
between objects, basis of this relationship a classification method can
construct a system of clusters.
The relationship is described variously as 'similarity', 'association' and
'dissimilarity'.
There are five commonly used measures of association in
information retrieval. Since in information retrieval documents and
requests are most commonly represented by term or keyword lists, an
object is represented by a set of keywords and that the counting
measure | . | gives the size of the set.
|X υ Y | Simple matching coefficient
Which is the number of shared index terms.
Dice’s Coefficient
Jaccard’s Coefficient
Cosine’s
Coefficient
Overlap’s
Coefficient
Clustering in information retrieval
 Clustering is used in information retrieval systems to 
enhance the efficiency and effectiveness of the retrieval 
process.
Clustering is achieved by partitioning the documents in a 
collection into classes such that documents that are 
associated with each other are assigned to the same 
cluster.
In order to cluster the items in a data set, some means of 
quantifying the degree of association between them is 
required.
A cluster method depending only on the rank-ordering of 
the association values would given identical clusterings for 
Cluster hypothesis
In information retrieval, it states that documents that are 
clustered together "behave similarly with respect to 
relevance to information needs".
In terms of classification, it states that if points are in the 
same cluster.
This hypothesis may be simply stated as follows: closely 
associated documents tend to be relevant to the same 
requests.
The use of clustering in information retrieval 
            “ theoretical soundness of the method”
•method should satisfy certain criteria of adequacy.  To list some 
of the more important of these: 
1.the method produces a clustering which is unlikely to be altered drastically
when further objects are incorporated, i.e. it is stable under growth.
2.the method is stable in the sense that small errors in the description of the
objects lead to small changes in the clustering.
3.the method is independent of the initial ordering of the objects
“the efficiency”
• We know much about the behavior of clustered files in terms of the effectiveness of
retrieval (i.e. the ability to retrieve wanted and hold back unwanted documents.)
•Efficiency is really a property of the algorithm implementing the cluster method
•It is sometimes useful to distinguish the cluster method from its algorithm
•the context of IR this distinction becomes slightly less than useful since many cluster
methods are defined by their algorithm.
•two distinct approaches to clustering can be identified:
1.the clustering is based on a measure of similarity between the objects to be clustered;
2.the cluster method proceeds directly from the object descriptions
graph theoretic
• define clusters in terms of a graph derived from the measure of similarity
• A string is a connected sequence of objects from some starting point
• A connected component is a set of objects such that each object is 
connected to at least one other member of the set and the set is maximal 
with respect to this property.
• A maximal complete subgraph is a subgraph such that each node is 
connected to every other node in the subgraph and the set is maximal 
with respect to this property
automatic classification in information retrieval
A large class of hierarchic cluster methods
• is based on the initial measurement of similarity.
The most important of these is single-link which is
the only one to have extensively used in document
retrieval. It satisfies all the criteria of adequacy
mentioned.
• A further class of cluster methods based on
measurement of similarity is the class of so called
'clump' methods.
automatic classification in information retrieval
automatic classification in information retrieval
• The algorithms also use a number of empirically determined
parameters such as:
1. The number of clusters desired;
2. A minimum and maximum size for each cluster;
3. A threshold value on the matching function, below which an
object will not be included in a cluster;
4. the control of overlap between clusters;
5. An arbitrarily chosen objective function which is optimized.
'Single-Pass Algorithm'
(1) The object descriptions are processed serially;
(2) The first object becomes the cluster representative of the first
cluster;
(3) Each subsequent object is matched against all cluster representatives
existing at its processing time;
(4) A given object is assigned to one cluster (or more if overlap is
allowed) according to some condition on the matching function;
(5) When an object is assigned to a cluster the representative for that
cluster is recomputed;
(6) If an object fails a certain test it becomes the cluster representative
of a new cluster.
Single-link
• The output is a hierarchy with associated numerical levels called a dendrogram.
• Frequently the hierarchy is represented by a tree structure such that each node
represents a cluster.
The appropriateness of stratified hierarchic cluster methods
The appropriateness of stratified hierarchic cluster methods
Single-link and the minimum spanning tree
• The single-link tree is closely related to another kind of tree: the minimum
spanning tree, or MST, also derived from a dissimilarity coefficient .
• This second tree is quite different from the first, the nodes instead of
representing clusters represent the individual objects to be clustered.
• The MST is the tree of minimum length connecting the objects, where by 'length'
I mean the sum of the weights of the connecting links in the tree.
• we can define a maximum spanning tree as one of maximum length.. maximum
spanning tree based on the expected mutual information measure.
• Given the minimum spanning tree then the single-link clusters are obtained by
deleting links from the MST in order of decreasing length;
• The connected sets after each deletion are the single-link clusters.
• The order of deletion and the structure of the MST ensure that the clusters will
be nested into a hierarchy.
Implication of classification methods
• classification process can usually be speeded up by using extra
storage.
• In experiments classification structure is keot in fast store but
it’s impossible in operational system where the document
collections are so much bigger.
• In experiment we want to vary cluster representatives at search
time but in operational classification cluster representative
would be constructed once and for all cluster time.
• In IR classification file structure is:
– Easily updated - Easily search - Reasonably compact
Reference
Classification for Information Retrieval
http://bit.ly/2zff6cA
Conceptual clustering in information retrieval
https://ieeexplore.ieee.org/document/678640
CLUSTERING ALGORITHMS
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap16.htm
Ad

More Related Content

What's hot (20)

Object oriented database concepts
Object oriented database conceptsObject oriented database concepts
Object oriented database concepts
Temesgenthanks
 
Information retrieval 14 fuzzy set models of ir
Information retrieval 14 fuzzy set models of irInformation retrieval 14 fuzzy set models of ir
Information retrieval 14 fuzzy set models of ir
Vaibhav Khanna
 
Eucalyptus, Nimbus & OpenNebula
Eucalyptus, Nimbus & OpenNebulaEucalyptus, Nimbus & OpenNebula
Eucalyptus, Nimbus & OpenNebula
Amar Myana
 
Web content mining
Web content miningWeb content mining
Web content mining
Daminda Herath
 
Distributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query ProcessingDistributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query Processing
Gyanmanjari Institute Of Technology
 
Information retrieval 13 alternative set theoretic models
Information retrieval 13 alternative set theoretic modelsInformation retrieval 13 alternative set theoretic models
Information retrieval 13 alternative set theoretic models
Vaibhav Khanna
 
Data clustring
Data clustring Data clustring
Data clustring
Salman Memon
 
Open source search engine
Open source search engineOpen source search engine
Open source search engine
Primya Tamil
 
The vector space model
The vector space modelThe vector space model
The vector space model
pkgosh
 
Parallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval SystemParallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval System
vimalsura
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
alaa223
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
 
Web search vs ir
Web search vs irWeb search vs ir
Web search vs ir
Primya Tamil
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
Roi Blanco
 
CS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit ICS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit I
pkaviya
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
Dishant Ailawadi
 
Signature files
Signature filesSignature files
Signature files
Deepali Raikar
 
Grid protocol architecture
Grid protocol architectureGrid protocol architecture
Grid protocol architecture
Pooja Dixit
 
Data partitioning
Data partitioningData partitioning
Data partitioning
Vinod Wilson
 
CS6010 Social Network Analysis Unit IV
CS6010 Social Network Analysis Unit IVCS6010 Social Network Analysis Unit IV
CS6010 Social Network Analysis Unit IV
pkaviya
 
Object oriented database concepts
Object oriented database conceptsObject oriented database concepts
Object oriented database concepts
Temesgenthanks
 
Information retrieval 14 fuzzy set models of ir
Information retrieval 14 fuzzy set models of irInformation retrieval 14 fuzzy set models of ir
Information retrieval 14 fuzzy set models of ir
Vaibhav Khanna
 
Eucalyptus, Nimbus & OpenNebula
Eucalyptus, Nimbus & OpenNebulaEucalyptus, Nimbus & OpenNebula
Eucalyptus, Nimbus & OpenNebula
Amar Myana
 
Information retrieval 13 alternative set theoretic models
Information retrieval 13 alternative set theoretic modelsInformation retrieval 13 alternative set theoretic models
Information retrieval 13 alternative set theoretic models
Vaibhav Khanna
 
Open source search engine
Open source search engineOpen source search engine
Open source search engine
Primya Tamil
 
The vector space model
The vector space modelThe vector space model
The vector space model
pkgosh
 
Parallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval SystemParallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval System
vimalsura
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
alaa223
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
Roi Blanco
 
CS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit ICS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit I
pkaviya
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
Dishant Ailawadi
 
Grid protocol architecture
Grid protocol architectureGrid protocol architecture
Grid protocol architecture
Pooja Dixit
 
CS6010 Social Network Analysis Unit IV
CS6010 Social Network Analysis Unit IVCS6010 Social Network Analysis Unit IV
CS6010 Social Network Analysis Unit IV
pkaviya
 

Similar to automatic classification in information retrieval (20)

Literature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesLiterature Survey On Clustering Techniques
Literature Survey On Clustering Techniques
IOSR Journals
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
IOSR Journals
 
UNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data MiningUNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data Mining
Nandakumar P
 
A0310112
A0310112A0310112
A0310112
iosrjournals
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
IRJET Journal
 
Du35687693
Du35687693Du35687693
Du35687693
IJERA Editor
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
Editor IJARCET
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
Editor IJARCET
 
Ir3116271633
Ir3116271633Ir3116271633
Ir3116271633
IJERA Editor
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
IRJET Journal
 
Literature Survey: Clustering Technique
Literature Survey: Clustering TechniqueLiterature Survey: Clustering Technique
Literature Survey: Clustering Technique
Editor IJCATR
 
F04463437
F04463437F04463437
F04463437
IOSR-JEN
 
Bs31267274
Bs31267274Bs31267274
Bs31267274
IJMER
 
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET Journal
 
Paper id 26201478
Paper id 26201478Paper id 26201478
Paper id 26201478
IJRAT
 
Cluster analysis foundations.docx
Cluster analysis foundations.docxCluster analysis foundations.docx
Cluster analysis foundations.docx
YaseenRashid4
 
A Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text DocumentsA Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text Documents
IJMER
 
Dp33701704
Dp33701704Dp33701704
Dp33701704
IJERA Editor
 
Dp33701704
Dp33701704Dp33701704
Dp33701704
IJERA Editor
 
A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...
IJERA Editor
 
Literature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesLiterature Survey On Clustering Techniques
Literature Survey On Clustering Techniques
IOSR Journals
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
IOSR Journals
 
UNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data MiningUNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data Mining
Nandakumar P
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
IRJET Journal
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
Editor IJARCET
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
Editor IJARCET
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
IRJET Journal
 
Literature Survey: Clustering Technique
Literature Survey: Clustering TechniqueLiterature Survey: Clustering Technique
Literature Survey: Clustering Technique
Editor IJCATR
 
Bs31267274
Bs31267274Bs31267274
Bs31267274
IJMER
 
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET Journal
 
Paper id 26201478
Paper id 26201478Paper id 26201478
Paper id 26201478
IJRAT
 
Cluster analysis foundations.docx
Cluster analysis foundations.docxCluster analysis foundations.docx
Cluster analysis foundations.docx
YaseenRashid4
 
A Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text DocumentsA Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text Documents
IJMER
 
A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...
IJERA Editor
 
Ad

Recently uploaded (20)

Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from AnywhereAutomation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Lynda Kane
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
"PHP and MySQL CRUD Operations for Student Management System"
"PHP and MySQL CRUD Operations for Student Management System""PHP and MySQL CRUD Operations for Student Management System"
"PHP and MySQL CRUD Operations for Student Management System"
Jainul Musani
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Hands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordDataHands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordData
Lynda Kane
 
Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.
gregtap1
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Rock, Paper, Scissors: An Apex Map Learning Journey
Rock, Paper, Scissors: An Apex Map Learning JourneyRock, Paper, Scissors: An Apex Map Learning Journey
Rock, Paper, Scissors: An Apex Map Learning Journey
Lynda Kane
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Buckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug LogsBuckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug Logs
Lynda Kane
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from AnywhereAutomation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Lynda Kane
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
"PHP and MySQL CRUD Operations for Student Management System"
"PHP and MySQL CRUD Operations for Student Management System""PHP and MySQL CRUD Operations for Student Management System"
"PHP and MySQL CRUD Operations for Student Management System"
Jainul Musani
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Hands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordDataHands On: Create a Lightning Aura Component with force:RecordData
Hands On: Create a Lightning Aura Component with force:RecordData
Lynda Kane
 
Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.
gregtap1
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Rock, Paper, Scissors: An Apex Map Learning Journey
Rock, Paper, Scissors: An Apex Map Learning JourneyRock, Paper, Scissors: An Apex Map Learning Journey
Rock, Paper, Scissors: An Apex Map Learning Journey
Lynda Kane
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Buckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug LogsBuckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug Logs
Lynda Kane
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Ad

automatic classification in information retrieval

  • 2. Outline Introduction Classification for Information Retrieval Classification methods Measures of association Clustering The use of clustering in information retrieval graph theoretic  'Single-Pass Algorithm' Single-link
  • 3. Introduction What is classification? A formal definition of classification will not be attempted. The word 'classification' is used to describe the result of such a process. Classification is one of the core areas in machine learning.
  • 4. Classification for Information Retrieval In the context of information retrieval, a classification is required for a purpose. The purpose may be to group the documents in such a way that retrieval will be faster or alternatively it may be to construct a thesaurus automatically. classification is used in various subtasks of Preprocessing, content filtering, sorting, ranking, et. There are two main areas of application of classification methods in IR: • keyword clustering; • document clustering.
  • 5. In the main, people have achieved the 'logical organization' in two different ways. 1.direct classification of the documents 2.The intermediate calculation of a measure of closeness between documents. -The first approach has proved theoretically to be intractable so that any experimental test results cannot be considered to be reliable. -The second approach to classification is fairly well documented now.
  • 6. Classification methods The data consists of objects and their corresponding descriptions. The objects may be documents, keywords, hand written characters, or species (in the last case the objects themselves are classes as opposed to individuals) The descriptors come under various names depending on their structure: •(1) multi-state attributes (e.g. color) •(2) binary-state (e.g. keywords) •(3) numerical (e.g. hardness scale, or weighted keywords) •(4) probability distributions. •The fourth category of descriptors is applicable when the objects are classes.
  • 7. Measures of association Some classification methods are based on a binary relationship between objects, basis of this relationship a classification method can construct a system of clusters. The relationship is described variously as 'similarity', 'association' and 'dissimilarity'. There are five commonly used measures of association in information retrieval. Since in information retrieval documents and requests are most commonly represented by term or keyword lists, an object is represented by a set of keywords and that the counting measure | . | gives the size of the set. |X υ Y | Simple matching coefficient Which is the number of shared index terms.
  • 9. Clustering in information retrieval  Clustering is used in information retrieval systems to  enhance the efficiency and effectiveness of the retrieval  process. Clustering is achieved by partitioning the documents in a  collection into classes such that documents that are  associated with each other are assigned to the same  cluster. In order to cluster the items in a data set, some means of  quantifying the degree of association between them is  required. A cluster method depending only on the rank-ordering of  the association values would given identical clusterings for 
  • 11. The use of clustering in information retrieval              “ theoretical soundness of the method” •method should satisfy certain criteria of adequacy.  To list some  of the more important of these:  1.the method produces a clustering which is unlikely to be altered drastically when further objects are incorporated, i.e. it is stable under growth. 2.the method is stable in the sense that small errors in the description of the objects lead to small changes in the clustering. 3.the method is independent of the initial ordering of the objects
  • 12. “the efficiency” • We know much about the behavior of clustered files in terms of the effectiveness of retrieval (i.e. the ability to retrieve wanted and hold back unwanted documents.) •Efficiency is really a property of the algorithm implementing the cluster method •It is sometimes useful to distinguish the cluster method from its algorithm •the context of IR this distinction becomes slightly less than useful since many cluster methods are defined by their algorithm. •two distinct approaches to clustering can be identified: 1.the clustering is based on a measure of similarity between the objects to be clustered; 2.the cluster method proceeds directly from the object descriptions
  • 13. graph theoretic • define clusters in terms of a graph derived from the measure of similarity • A string is a connected sequence of objects from some starting point • A connected component is a set of objects such that each object is  connected to at least one other member of the set and the set is maximal  with respect to this property. • A maximal complete subgraph is a subgraph such that each node is  connected to every other node in the subgraph and the set is maximal  with respect to this property
  • 15. A large class of hierarchic cluster methods • is based on the initial measurement of similarity. The most important of these is single-link which is the only one to have extensively used in document retrieval. It satisfies all the criteria of adequacy mentioned. • A further class of cluster methods based on measurement of similarity is the class of so called 'clump' methods.
  • 18. • The algorithms also use a number of empirically determined parameters such as: 1. The number of clusters desired; 2. A minimum and maximum size for each cluster; 3. A threshold value on the matching function, below which an object will not be included in a cluster; 4. the control of overlap between clusters; 5. An arbitrarily chosen objective function which is optimized.
  • 19. 'Single-Pass Algorithm' (1) The object descriptions are processed serially; (2) The first object becomes the cluster representative of the first cluster; (3) Each subsequent object is matched against all cluster representatives existing at its processing time; (4) A given object is assigned to one cluster (or more if overlap is allowed) according to some condition on the matching function; (5) When an object is assigned to a cluster the representative for that cluster is recomputed; (6) If an object fails a certain test it becomes the cluster representative of a new cluster.
  • 20. Single-link • The output is a hierarchy with associated numerical levels called a dendrogram. • Frequently the hierarchy is represented by a tree structure such that each node represents a cluster.
  • 21. The appropriateness of stratified hierarchic cluster methods
  • 22. The appropriateness of stratified hierarchic cluster methods
  • 23. Single-link and the minimum spanning tree • The single-link tree is closely related to another kind of tree: the minimum spanning tree, or MST, also derived from a dissimilarity coefficient . • This second tree is quite different from the first, the nodes instead of representing clusters represent the individual objects to be clustered. • The MST is the tree of minimum length connecting the objects, where by 'length' I mean the sum of the weights of the connecting links in the tree. • we can define a maximum spanning tree as one of maximum length.. maximum spanning tree based on the expected mutual information measure.
  • 24. • Given the minimum spanning tree then the single-link clusters are obtained by deleting links from the MST in order of decreasing length; • The connected sets after each deletion are the single-link clusters. • The order of deletion and the structure of the MST ensure that the clusters will be nested into a hierarchy.
  • 25. Implication of classification methods • classification process can usually be speeded up by using extra storage. • In experiments classification structure is keot in fast store but it’s impossible in operational system where the document collections are so much bigger. • In experiment we want to vary cluster representatives at search time but in operational classification cluster representative would be constructed once and for all cluster time. • In IR classification file structure is: – Easily updated - Easily search - Reasonably compact
  • 26. Reference Classification for Information Retrieval http://bit.ly/2zff6cA Conceptual clustering in information retrieval https://ieeexplore.ieee.org/document/678640 CLUSTERING ALGORITHMS http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap16.htm

Editor's Notes

  • #4: Core لب
  • #5: thesaurus ذاكرة/قاموس
  • #6: fairly مقبول/لائق intractable غير مقبول /عسير closeness قرب reliable فعال/جاد/ثقه
  • #8: coefficient معامل/درجة
  • #10: Quantify يحدد المقدار/الكمية association ترابط
  • #11: Hypothesis فرضية assumption افتراض/ادعاء behave سلك/تصرف state يعرض/يوضح respect تصرف/اعتبار behave سلك/تصرف
  • #12:
  • #20: Empirically تجريبى threshold بداية , below دون overlap تداخل – تراكب arbitrarily استبعاد-تحكم
  • #21: Serially متسلسل subsequent لاحق – تابع against ضد fails قصر عن وظيفتة
  • #23: Appropriateness ملائم – مناسب stratified قسم الى طبقات – تراصف