SlideShare a Scribd company logo
Search Engines


       Google & Co. vs Internet
An Introduction to Information Retrieval
Contents
 Overview
 History
 Introduction to Information Retrieval
 Page Rank in Example
 Google & Co.
Search Engines Overview

 deep impact (not only for search)
 developers in big challenge
 search engines getting larger
 problems not new
History
 The web happened (1992)
 Mosaic/Netscape happened (1993-95)
 Crawler happened (1994): M. Mauldin
 SEs happened 1994-1996
    – InfoSeek, Lycos, Altavista, Excite, Inktomi, …

 Yahoo decided to go with a directory
 Google happened 1996-98
    Tried selling technology to other engines

 SEs though search was a commodity, portals were in
 Microsoft said: whatever …
Present
 Most search engines have vanished
 Google is a big player
 Yahoo decided to de-emphasize directories
      Buys three search engines
 Microsoft realized Internet is here to stay
      Dominates the browser market
      Realizes search is critical
Share Of Searches: July 2005
Google
 first launched Sep. 1999
 Over 4 billion pages by beginning of 2004
 strengths
     size and scope
     relevance based
     cached archive
 weaknesses
     limited search features
     only indexes first 101KB of sites and PDFs
Yahoo!
 David Filo, Jerry Yang => 1995
 originally just a subject directory
 strengths
    large, new(Feb. 2004) database
    cached copies
    support of full boolean searching
 weaknesses
    lack of some advanced search features
    indexes only the first 500KB
    tricky wildcard
MSN Search
 used to use third party db´s
 Feb. 2005 began using own db
 strenghts
     large, unique database
     cached copies including data cached
 weaknesses
     limited advanced features
     no title search, truncation, stemming
How Search Engines Work
 Crawler-Based Search Engines
     listing created automatically

 Human-Powered Directories
     contents filled by hand

 "Hybrid Search Engines" Or Mixed Results
     best of both worlds
Ranking Of Sites
 location and frequency of keywords
 keywords near top of page
 spamming filter
 „off the page“ ranking
     link structure
     filtering fake links
     clickthrough measurement
Search Engine Placement Tips (1)
 pick your target keywords
 position your keywords
 have relevant content
 avoid search engine stumbling blocks
     have html links
     frames can kill
     dynamic doorblocks
Search Engine Placement Tips (2)
 build links
 just say no to search engine spamming
 submit your key pages
 verify & maintain your listing


 beyond search engines
Features for webmasters
   Crawling                Yes                       No                     Notes
                    AllTheWeb, Google,
   Deep Crawl                                 AltaVista, Teoma
                          Inktomi
 Frames Support             All                      n/a
    Robots.txt              All                      n/a
 Meta Robots Tag            All                      n/a
  Paid Inclusion          All but…                 Google
                                                                     Some stop words may
  Full Body Text            All                      n/a
                                                                        not be indexed
                     AltaVista, Inktomi,
   Stop Words                                          FAST                Teoma unkown
                            Google
                    All provide some support, but AltaVista, AllTheWeb and Teoma make most
 Meta Description
                                                  use of the tag
                                              AllTheWeb, Altavista,       Teoma support is
 Meta Keywords          Inktomi, Teoma
                                                      Google                 „unofficial“
                     AltaVista, Google,
     ALT text                                  AllTheWeb, Inktomi
                            Teoma
   Comments                 Inktomi                   Others
What is Information Retrieval?
 Informations get lost in the amount of
  documents, but have to be relocated

 Definition:
      IR is the field, that deals with the relocation of
       information/knowledge out of large document
       database.
Quality of an IR-System (1)
 Precision:
     Is the ratio of the relevant documents retrieved
      to the total number of documents retrieved.
                                =   [0;1]


     Precision = 1: all retrieved documents are
      relevant
Quality of an IR-System (2)
 Recall:
     Is the ratio of the number of relevant
      documents retrieved to the total number of
      relevant documents (retrieved and not).

                                 =   [0;1]

     Recall = 1: all relevant documents were found
Quality of an IR-System (3)
 Aim of a good IR-System:
    increasing Precision and Recall!



 Problem:
    increasing Precision cause a decrease of Recall
        e.g.: search results 1 document:


         Recall->0, Precision=1

      increasing Recall cause a decrease of Precision
          e.g. search results all available documents


           Recall=1, Precision->0
Mathematical models
 Boolean Model


 Vector Space Model
Boolean model
 checks if the document includes the search
  term (true) or not (false). True means, the
  document is relevant

 Problem:
     high variation on the result size, depending on
      the search term
     no ranking on result set -> no sort possible
     “relevance” criteria is too strict (e.g. AND,OR)
Vector space model (1)
 index weighted vector
   
   dj = ( w1, j , w2 , j , w3, j , wn , j )

 search weighted vector
  
  q = ( w1, q, w2 , q, w3, q, wn , q )

 analyze the angle between search vector and
  document vector by using the cosine function
 the smaller the angle, the more relevant is the
  document -> use it for ranking
Vector space model (2)
 “relevance” criteria is more tolerant
 no use of boolean operators
 uses weighting
 creates a ranking -> sort is possible


 Problem:
      automatic weighting of index terms in queries
       and documents
Weighting Methods (1)
 law of Zipf
 global weighting (IDF “inverse document
  frequency”)
      considers the distribution of words in a
       language
      filters out words like “or”, “and” (words with
       large occurrence) and weights them weakly

                  IDF = log( N / n)

                            N = Number of documents in the system
                            n = number of documents including the index term
Weighting Methods (2)
 local weighting
     considers term frequency into documents
     weighting corresponding to the frequency
     regards different length of documents and
      normalize the term frequency

                                        tfi , j
                ntfi , j =
                               max l ∈ {1... n }tfl , j

                  tfi , j   = absolute number of term frequency   ti   in a document di
Weighting Methods (3)
 tf-idf weighting
      combination of global (inverse document
       frequency) and local (normalized term
       frequency) weighting



            wi , j = ntfi , j ∗ idfi
Web-Mining
 Web-Mining ≈ Data-Mining, different problems
 Mining of: Content, Structure or User
 Content-Mining: VSM,BM
 Structure-Mining: Analysis of Structure
 User-Mining: Infos about User of a page



Let‘s have a deeper look at Web-Structure-Mining!
History
 IR necessary but not sufficient for web search
 Doesn’t address web navigation
    Query ibm seeks www.ibm.com
    To IR www.ibm.com may look less topical than a
     quarterly report
 Link analysis
    Hubs and authority (Jon Kleinberg)
    PageRank (Brin and Page)
       Computed on the entire graph

       Query independent

       Faster if serving lots of queries

    Others…
Analysis of Hyperlinks
 Links
    Long history in citation analysis
    Navigational tools on the web
    Also a sign of popularity
    Can be thought of as recommendations
     (source recommends destination)
    Also describe the destination: anchor text
 Idea: The exist of a Hyperlink between two
  pages can also give Information
 Hyperlinks can be used to:
      Create a weighting of web pages
      Find pages with similiar topics
      Group pages by different context of meaning
Hubs and Authorities
 Describe the qualitiy of a
  website
 Authorities: pages which
  is linked very often
 Hubs: pages which are
  linking other pages very
  often
 Example:
      Authority: Heise.de
      Hub: Peter‘s Linklist
Page Rank


   Invented by Lawrence Page a. Sergey Brin
   Algorithm itself is well-described
   Implementations are not (Google)
   Main Idea:
       relationship of all Links in WWW
       The more a document is linked, the more important it is
       Not every link counts the same – a link from an
        important page has more worth
Page Rank Algorithm



 PR(p0) : Page Rank of a page
 PR(pi) : Page Rank of pages linking to p0
 outlink(pi): All outgoing links of pi
 q = Random walks (normally q=0,85)
 Attention: Recursive Function!
Page Rank Example


 with q=0.5



 PR(A) = 0.5 + 0.5 PR(C)
  PR(B) = 0.5 + 0.5 (PR(A) / 2)
  PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

 PR(A) = 14/13 = 1.07692308
  PR(B) = 10/13 = 0.76923077
  PR(C) = 15/13 = 1.15384615
Page Rank Calculation
 Solution of system of equation not possible
 Iterative Calcuation of Page Rank necessary
 Each page starts with 1
Page Rank Incoming Links

 Given
       PR(A) = PR(B) = PR(C) = PR(D) = 1
       PR(X) = 10


  PR(A) = 0.5 + 0.5 (PR(X) + PR(D)) = 5.5 + 0.5 PR(D)
   PR(B) = 0.5 + 0.5 PR(A)
   PR(C) = 0.5 + 0.5 PR(B)
   PR(D) = 0.5 + 0.5 PR(C)
  PR(A) = 19/3 = 6.33
   PR(B) = 11/3 = 3.67
   PR(C) = 7/3 = 2.33
   PR(D) = 5/3 = 1.67
Page Rank Outgoing Links

 PR(A) = 0.25 + 0.75 PR(B)
  PR(B) = 0.25 + 0.375 PR(A)
  PR(C) = 0.25 + 0.75 PR(D) + 0.375 PR(A)
  PR(D) = 0.25 + 0.75 PR(C)

 PR(A) = 14/23
  PR(B) = 11/23
  PR(C) = 35/23
  PR(D) = 32/23
Page Rank other Examples

 Dangling Links




 Different
  hierachies
Page Rank Implementation
 Normally implemented as weighting system
 Additional content-search needed for
  retrieving the document set
 Also involved in Page Rank
     The markup of a link
     The position of a link in the document
     The distance between the pages (e.g. other
      domain)
     The context of the linking page
     The actuality of the page
Google Past
 1995 research project at Stanford University
Google Past
 One of the earliest storage systems
Google – How it began
 Peak of google.stanford.edu
Google
 Servers 1999
Google
Google by Numbers
 Index: 40 TB (4 Bill. Pages with est. Size 10 kb)
 Up to 2000 Servers in one Cluster
 Over 30 Cluster
 One Petabyte Data per Cluster – so much that a
  quota of hard disk breakdowns with 1 in 10-15 Bits
  gets a real problem
 Each day in each greater cluster normally two
  servers will breakdown
 System running stable (without any breakdowns)
  since February 2000 (Yes, they don’t use Windows
  server…)
Look-out: Semantic Web
 Information should be read by men &
  machine
 Unified description of data & knowledge
 First approaches: Meta-Data, e.g. Dublin
  Core



 Actual: RDF
Look-out: Personalized Search Engine
 A new approach: personalized Search
  Engines
 Advantage: Only get in what you‘re personally
  interested
 Disadvantage: A lot of data has to be
  collected
 Example:
     www.fooxx.com
Links
 www.searchenginewatch.com (common
  Information about search engines)

 http://pr.efactory.de (page rank algorithm)


 http://zdnet.de/itmanager/unternehmen/0,3902344
  (article: “Google’s Technologien: Von
  Zauberei kaum zu unterscheiden”)
The End
 Thank you for your attention
Ad

More Related Content

What's hot (20)

Web crawler
Web crawlerWeb crawler
Web crawler
poonamkenkre
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
Marina Santini
 
Database basics
Database basicsDatabase basics
Database basics
prachin514
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
Tomek Pluskiewicz
 
Web usage mining
Web usage miningWeb usage mining
Web usage mining
Monu Chaudhary
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slides
mahavir_a
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
 
Information Retrieval Evaluation
Information Retrieval EvaluationInformation Retrieval Evaluation
Information Retrieval Evaluation
José Ramón Ríos Viqueira
 
Web Content Mining
Web Content MiningWeb Content Mining
Web Content Mining
Daminda Herath
 
Web content mining
Web content miningWeb content mining
Web content mining
Daminda Herath
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
JWORKS powered by Ordina
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on ir
Primya Tamil
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
Roi Blanco
 
Data mining
Data miningData mining
Data mining
Birju Tank
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
Bhaskar Mitra
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
Bhaskar Mitra
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Tutorial on Web Scraping in Python
Tutorial on Web Scraping in PythonTutorial on Web Scraping in Python
Tutorial on Web Scraping in Python
Nithish Raghunandanan
 
Pagerank and hits
Pagerank and hitsPagerank and hits
Pagerank and hits
Shatakirti Er
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
Marina Santini
 
Database basics
Database basicsDatabase basics
Database basics
prachin514
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
Tomek Pluskiewicz
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slides
mahavir_a
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on ir
Primya Tamil
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
Roi Blanco
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
Bhaskar Mitra
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
Bhaskar Mitra
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 

Viewers also liked (10)

How search engine works ( Mr. Mirza)
How search engine works ( Mr. Mirza)How search engine works ( Mr. Mirza)
How search engine works ( Mr. Mirza)
Ali Saif Mirza
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search Engines
Nitin Pande
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
201014161
 
Search engines powerpoint
Search engines powerpointSearch engines powerpoint
Search engines powerpoint
vbaker2210
 
The Deep and Dark Web
The Deep and Dark WebThe Deep and Dark Web
The Deep and Dark Web
Swecha | స్వేచ్ఛ
 
Search Engines Presentation
Search Engines PresentationSearch Engines Presentation
Search Engines Presentation
JSCHO9
 
Search engines
Search enginesSearch engines
Search engines
Sahiba Khurana
 
Deep Web
Deep WebDeep Web
Deep Web
St John
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slide
Sovan Misra
 
Search Engine Demystified
Search Engine DemystifiedSearch Engine Demystified
Search Engine Demystified
Sudarsun Santhiappan
 
How search engine works ( Mr. Mirza)
How search engine works ( Mr. Mirza)How search engine works ( Mr. Mirza)
How search engine works ( Mr. Mirza)
Ali Saif Mirza
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search Engines
Nitin Pande
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
201014161
 
Search engines powerpoint
Search engines powerpointSearch engines powerpoint
Search engines powerpoint
vbaker2210
 
Search Engines Presentation
Search Engines PresentationSearch Engines Presentation
Search Engines Presentation
JSCHO9
 
Deep Web
Deep WebDeep Web
Deep Web
St John
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slide
Sovan Misra
 
Ad

Similar to Introduction into Search Engines and Information Retrieval (20)

Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Sean Golliher
 
Search-Engines-and-Information-Retrievals.pptx
Search-Engines-and-Information-Retrievals.pptxSearch-Engines-and-Information-Retrievals.pptx
Search-Engines-and-Information-Retrievals.pptx
nishatmh22
 
Semantic Web, e-commerce
Semantic Web, e-commerceSemantic Web, e-commerce
Semantic Web, e-commerce
Semantic Web San Diego
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
Chidanand Byahatti
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
NIKHIL NAIR
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
Victor de Boer
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
Derek Kane
 
You Don't Know SEO
You Don't Know SEOYou Don't Know SEO
You Don't Know SEO
Michael King
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Asad Abbas
 
Domain Driven Design
Domain Driven DesignDomain Driven Design
Domain Driven Design
Ryan Riley
 
Open source Technology
Open source TechnologyOpen source Technology
Open source Technology
Amardeep Vishwakarma
 
Phpconf2008 Sphinx En
Phpconf2008 Sphinx EnPhpconf2008 Sphinx En
Phpconf2008 Sphinx En
Murugan Krishnamoorthy
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
ivan provalov
 
Try It The Google Way .
Try It The Google Way .Try It The Google Way .
Try It The Google Way .
abhinavbom
 
Web 3.0 Emerging
Web 3.0 EmergingWeb 3.0 Emerging
Web 3.0 Emerging
James Hendler
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
Erudite
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engine
ankur881120
 
Indexing and Searching Cross Media Content in a Social Network
Indexing and Searching Cross Media Content in a Social NetworkIndexing and Searching Cross Media Content in a Social Network
Indexing and Searching Cross Media Content in a Social Network
Paolo Nesi
 
Resource Discovery Landscape
Resource Discovery LandscapeResource Discovery Landscape
Resource Discovery Landscape
Andy Powell
 
NLP and the Web
NLP and the WebNLP and the Web
NLP and the Web
mattthemathman
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Sean Golliher
 
Search-Engines-and-Information-Retrievals.pptx
Search-Engines-and-Information-Retrievals.pptxSearch-Engines-and-Information-Retrievals.pptx
Search-Engines-and-Information-Retrievals.pptx
nishatmh22
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
NIKHIL NAIR
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
Victor de Boer
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
Derek Kane
 
You Don't Know SEO
You Don't Know SEOYou Don't Know SEO
You Don't Know SEO
Michael King
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Asad Abbas
 
Domain Driven Design
Domain Driven DesignDomain Driven Design
Domain Driven Design
Ryan Riley
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
ivan provalov
 
Try It The Google Way .
Try It The Google Way .Try It The Google Way .
Try It The Google Way .
abhinavbom
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
Erudite
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engine
ankur881120
 
Indexing and Searching Cross Media Content in a Social Network
Indexing and Searching Cross Media Content in a Social NetworkIndexing and Searching Cross Media Content in a Social Network
Indexing and Searching Cross Media Content in a Social Network
Paolo Nesi
 
Resource Discovery Landscape
Resource Discovery LandscapeResource Discovery Landscape
Resource Discovery Landscape
Andy Powell
 
Ad

More from A. LE (9)

Master Thesis - Algorithm for pattern recognition
Master Thesis - Algorithm for pattern recognitionMaster Thesis - Algorithm for pattern recognition
Master Thesis - Algorithm for pattern recognition
A. LE
 
Publication - The feasibility of gaze tracking for “mind reading” during search
Publication - The feasibility of gaze tracking for “mind reading” during searchPublication - The feasibility of gaze tracking for “mind reading” during search
Publication - The feasibility of gaze tracking for “mind reading” during search
A. LE
 
Schulug Grundlagen SAP BI / BW
Schulug Grundlagen SAP BI / BWSchulug Grundlagen SAP BI / BW
Schulug Grundlagen SAP BI / BW
A. LE
 
Ergebnisse Simulation eines Verkehrsnetzes mit GPSS/H
Ergebnisse Simulation eines Verkehrsnetzes mit GPSS/HErgebnisse Simulation eines Verkehrsnetzes mit GPSS/H
Ergebnisse Simulation eines Verkehrsnetzes mit GPSS/H
A. LE
 
Simulation eines Verkehrsnetzes mit GPSS/H
Simulation eines Verkehrsnetzes mit GPSS/HSimulation eines Verkehrsnetzes mit GPSS/H
Simulation eines Verkehrsnetzes mit GPSS/H
A. LE
 
Prasentation Managed DirectX
Prasentation Managed DirectXPrasentation Managed DirectX
Prasentation Managed DirectX
A. LE
 
Managed DirectX
Managed DirectXManaged DirectX
Managed DirectX
A. LE
 
Elektronische Kataloge als herzstück von E-Business Systemen
Elektronische Kataloge als herzstück von E-Business SystemenElektronische Kataloge als herzstück von E-Business Systemen
Elektronische Kataloge als herzstück von E-Business Systemen
A. LE
 
Übersicht Skriptsprachen
Übersicht SkriptsprachenÜbersicht Skriptsprachen
Übersicht Skriptsprachen
A. LE
 
Master Thesis - Algorithm for pattern recognition
Master Thesis - Algorithm for pattern recognitionMaster Thesis - Algorithm for pattern recognition
Master Thesis - Algorithm for pattern recognition
A. LE
 
Publication - The feasibility of gaze tracking for “mind reading” during search
Publication - The feasibility of gaze tracking for “mind reading” during searchPublication - The feasibility of gaze tracking for “mind reading” during search
Publication - The feasibility of gaze tracking for “mind reading” during search
A. LE
 
Schulug Grundlagen SAP BI / BW
Schulug Grundlagen SAP BI / BWSchulug Grundlagen SAP BI / BW
Schulug Grundlagen SAP BI / BW
A. LE
 
Ergebnisse Simulation eines Verkehrsnetzes mit GPSS/H
Ergebnisse Simulation eines Verkehrsnetzes mit GPSS/HErgebnisse Simulation eines Verkehrsnetzes mit GPSS/H
Ergebnisse Simulation eines Verkehrsnetzes mit GPSS/H
A. LE
 
Simulation eines Verkehrsnetzes mit GPSS/H
Simulation eines Verkehrsnetzes mit GPSS/HSimulation eines Verkehrsnetzes mit GPSS/H
Simulation eines Verkehrsnetzes mit GPSS/H
A. LE
 
Prasentation Managed DirectX
Prasentation Managed DirectXPrasentation Managed DirectX
Prasentation Managed DirectX
A. LE
 
Managed DirectX
Managed DirectXManaged DirectX
Managed DirectX
A. LE
 
Elektronische Kataloge als herzstück von E-Business Systemen
Elektronische Kataloge als herzstück von E-Business SystemenElektronische Kataloge als herzstück von E-Business Systemen
Elektronische Kataloge als herzstück von E-Business Systemen
A. LE
 
Übersicht Skriptsprachen
Übersicht SkriptsprachenÜbersicht Skriptsprachen
Übersicht Skriptsprachen
A. LE
 

Recently uploaded (20)

Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 

Introduction into Search Engines and Information Retrieval

  • 1. Search Engines Google & Co. vs Internet An Introduction to Information Retrieval
  • 2. Contents  Overview  History  Introduction to Information Retrieval  Page Rank in Example  Google & Co.
  • 3. Search Engines Overview  deep impact (not only for search)  developers in big challenge  search engines getting larger  problems not new
  • 4. History  The web happened (1992)  Mosaic/Netscape happened (1993-95)  Crawler happened (1994): M. Mauldin  SEs happened 1994-1996  – InfoSeek, Lycos, Altavista, Excite, Inktomi, …  Yahoo decided to go with a directory  Google happened 1996-98  Tried selling technology to other engines  SEs though search was a commodity, portals were in  Microsoft said: whatever …
  • 5. Present  Most search engines have vanished  Google is a big player  Yahoo decided to de-emphasize directories  Buys three search engines  Microsoft realized Internet is here to stay  Dominates the browser market  Realizes search is critical
  • 6. Share Of Searches: July 2005
  • 7. Google  first launched Sep. 1999  Over 4 billion pages by beginning of 2004  strengths  size and scope  relevance based  cached archive  weaknesses  limited search features  only indexes first 101KB of sites and PDFs
  • 8. Yahoo!  David Filo, Jerry Yang => 1995  originally just a subject directory  strengths  large, new(Feb. 2004) database  cached copies  support of full boolean searching  weaknesses  lack of some advanced search features  indexes only the first 500KB  tricky wildcard
  • 9. MSN Search  used to use third party db´s  Feb. 2005 began using own db  strenghts  large, unique database  cached copies including data cached  weaknesses  limited advanced features  no title search, truncation, stemming
  • 10. How Search Engines Work  Crawler-Based Search Engines  listing created automatically  Human-Powered Directories  contents filled by hand  "Hybrid Search Engines" Or Mixed Results  best of both worlds
  • 11. Ranking Of Sites  location and frequency of keywords  keywords near top of page  spamming filter  „off the page“ ranking  link structure  filtering fake links  clickthrough measurement
  • 12. Search Engine Placement Tips (1)  pick your target keywords  position your keywords  have relevant content  avoid search engine stumbling blocks  have html links  frames can kill  dynamic doorblocks
  • 13. Search Engine Placement Tips (2)  build links  just say no to search engine spamming  submit your key pages  verify & maintain your listing  beyond search engines
  • 14. Features for webmasters Crawling Yes No Notes AllTheWeb, Google, Deep Crawl AltaVista, Teoma Inktomi Frames Support All n/a Robots.txt All n/a Meta Robots Tag All n/a Paid Inclusion All but… Google Some stop words may Full Body Text All n/a not be indexed AltaVista, Inktomi, Stop Words FAST Teoma unkown Google All provide some support, but AltaVista, AllTheWeb and Teoma make most Meta Description use of the tag AllTheWeb, Altavista, Teoma support is Meta Keywords Inktomi, Teoma Google „unofficial“ AltaVista, Google, ALT text AllTheWeb, Inktomi Teoma Comments Inktomi Others
  • 15. What is Information Retrieval?  Informations get lost in the amount of documents, but have to be relocated  Definition:  IR is the field, that deals with the relocation of information/knowledge out of large document database.
  • 16. Quality of an IR-System (1)  Precision:  Is the ratio of the relevant documents retrieved to the total number of documents retrieved. = [0;1]  Precision = 1: all retrieved documents are relevant
  • 17. Quality of an IR-System (2)  Recall:  Is the ratio of the number of relevant documents retrieved to the total number of relevant documents (retrieved and not). = [0;1]  Recall = 1: all relevant documents were found
  • 18. Quality of an IR-System (3)  Aim of a good IR-System:  increasing Precision and Recall!  Problem:  increasing Precision cause a decrease of Recall  e.g.: search results 1 document: Recall->0, Precision=1  increasing Recall cause a decrease of Precision  e.g. search results all available documents Recall=1, Precision->0
  • 19. Mathematical models  Boolean Model  Vector Space Model
  • 20. Boolean model  checks if the document includes the search term (true) or not (false). True means, the document is relevant  Problem:  high variation on the result size, depending on the search term  no ranking on result set -> no sort possible  “relevance” criteria is too strict (e.g. AND,OR)
  • 21. Vector space model (1)  index weighted vector  dj = ( w1, j , w2 , j , w3, j , wn , j )  search weighted vector  q = ( w1, q, w2 , q, w3, q, wn , q )  analyze the angle between search vector and document vector by using the cosine function  the smaller the angle, the more relevant is the document -> use it for ranking
  • 22. Vector space model (2)  “relevance” criteria is more tolerant  no use of boolean operators  uses weighting  creates a ranking -> sort is possible  Problem:  automatic weighting of index terms in queries and documents
  • 23. Weighting Methods (1)  law of Zipf  global weighting (IDF “inverse document frequency”)  considers the distribution of words in a language  filters out words like “or”, “and” (words with large occurrence) and weights them weakly IDF = log( N / n) N = Number of documents in the system n = number of documents including the index term
  • 24. Weighting Methods (2)  local weighting  considers term frequency into documents  weighting corresponding to the frequency  regards different length of documents and normalize the term frequency tfi , j ntfi , j = max l ∈ {1... n }tfl , j tfi , j = absolute number of term frequency ti in a document di
  • 25. Weighting Methods (3)  tf-idf weighting  combination of global (inverse document frequency) and local (normalized term frequency) weighting wi , j = ntfi , j ∗ idfi
  • 26. Web-Mining  Web-Mining ≈ Data-Mining, different problems  Mining of: Content, Structure or User  Content-Mining: VSM,BM  Structure-Mining: Analysis of Structure  User-Mining: Infos about User of a page Let‘s have a deeper look at Web-Structure-Mining!
  • 27. History  IR necessary but not sufficient for web search  Doesn’t address web navigation  Query ibm seeks www.ibm.com  To IR www.ibm.com may look less topical than a quarterly report  Link analysis  Hubs and authority (Jon Kleinberg)  PageRank (Brin and Page)  Computed on the entire graph  Query independent  Faster if serving lots of queries  Others…
  • 28. Analysis of Hyperlinks  Links  Long history in citation analysis  Navigational tools on the web  Also a sign of popularity  Can be thought of as recommendations (source recommends destination)  Also describe the destination: anchor text  Idea: The exist of a Hyperlink between two pages can also give Information  Hyperlinks can be used to:  Create a weighting of web pages  Find pages with similiar topics  Group pages by different context of meaning
  • 29. Hubs and Authorities  Describe the qualitiy of a website  Authorities: pages which is linked very often  Hubs: pages which are linking other pages very often  Example:  Authority: Heise.de  Hub: Peter‘s Linklist
  • 30. Page Rank  Invented by Lawrence Page a. Sergey Brin  Algorithm itself is well-described  Implementations are not (Google)  Main Idea:  relationship of all Links in WWW  The more a document is linked, the more important it is  Not every link counts the same – a link from an important page has more worth
  • 31. Page Rank Algorithm  PR(p0) : Page Rank of a page  PR(pi) : Page Rank of pages linking to p0  outlink(pi): All outgoing links of pi  q = Random walks (normally q=0,85)  Attention: Recursive Function!
  • 32. Page Rank Example  with q=0.5  PR(A) = 0.5 + 0.5 PR(C) PR(B) = 0.5 + 0.5 (PR(A) / 2) PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))  PR(A) = 14/13 = 1.07692308 PR(B) = 10/13 = 0.76923077 PR(C) = 15/13 = 1.15384615
  • 33. Page Rank Calculation  Solution of system of equation not possible  Iterative Calcuation of Page Rank necessary  Each page starts with 1
  • 34. Page Rank Incoming Links  Given  PR(A) = PR(B) = PR(C) = PR(D) = 1  PR(X) = 10  PR(A) = 0.5 + 0.5 (PR(X) + PR(D)) = 5.5 + 0.5 PR(D) PR(B) = 0.5 + 0.5 PR(A) PR(C) = 0.5 + 0.5 PR(B) PR(D) = 0.5 + 0.5 PR(C)  PR(A) = 19/3 = 6.33 PR(B) = 11/3 = 3.67 PR(C) = 7/3 = 2.33 PR(D) = 5/3 = 1.67
  • 35. Page Rank Outgoing Links  PR(A) = 0.25 + 0.75 PR(B) PR(B) = 0.25 + 0.375 PR(A) PR(C) = 0.25 + 0.75 PR(D) + 0.375 PR(A) PR(D) = 0.25 + 0.75 PR(C)  PR(A) = 14/23 PR(B) = 11/23 PR(C) = 35/23 PR(D) = 32/23
  • 36. Page Rank other Examples  Dangling Links  Different hierachies
  • 37. Page Rank Implementation  Normally implemented as weighting system  Additional content-search needed for retrieving the document set  Also involved in Page Rank  The markup of a link  The position of a link in the document  The distance between the pages (e.g. other domain)  The context of the linking page  The actuality of the page
  • 38. Google Past  1995 research project at Stanford University
  • 39. Google Past  One of the earliest storage systems
  • 40. Google – How it began  Peak of google.stanford.edu
  • 43. Google by Numbers  Index: 40 TB (4 Bill. Pages with est. Size 10 kb)  Up to 2000 Servers in one Cluster  Over 30 Cluster  One Petabyte Data per Cluster – so much that a quota of hard disk breakdowns with 1 in 10-15 Bits gets a real problem  Each day in each greater cluster normally two servers will breakdown  System running stable (without any breakdowns) since February 2000 (Yes, they don’t use Windows server…)
  • 44. Look-out: Semantic Web  Information should be read by men & machine  Unified description of data & knowledge  First approaches: Meta-Data, e.g. Dublin Core  Actual: RDF
  • 45. Look-out: Personalized Search Engine  A new approach: personalized Search Engines  Advantage: Only get in what you‘re personally interested  Disadvantage: A lot of data has to be collected  Example:  www.fooxx.com
  • 46. Links  www.searchenginewatch.com (common Information about search engines)  http://pr.efactory.de (page rank algorithm)  http://zdnet.de/itmanager/unternehmen/0,3902344 (article: “Google’s Technologien: Von Zauberei kaum zu unterscheiden”)
  • 47. The End  Thank you for your attention