Understanding Search Engines Dirk Lewandowski pdf download
Understanding Search Engines Dirk Lewandowski pdf download
download
https://ebookmeta.com/product/understanding-search-engines-dirk-
lewandowski/
https://ebookmeta.com/product/custom-search-discover-more-a-
complete-guide-to-google-programmable-search-engines-1st-edition-
irina-shamaeva/
https://ebookmeta.com/product/custom-search-discover-more-a-
complete-guide-to-google-programmable-search-engines-1st-edition-
irina-shamaeva-david-michael-galley/
https://ebookmeta.com/product/exodus-v-plague-book-13-1st-
edition-patton-dirk-dirk-patton/
https://ebookmeta.com/product/biochemistry-7th-edition-reginald-
h-garrett/
Scent of Deception (Bonds of Steele Omegaverse Book 3)
1st Edition Laurel Night
https://ebookmeta.com/product/scent-of-deception-bonds-of-steele-
omegaverse-book-3-1st-edition-laurel-night/
https://ebookmeta.com/product/tracing-textile-production-from-
the-viking-age-to-the-middle-ages-tools-textiles-texts-and-
contexts-2nd-edition-ingvild-oye/
https://ebookmeta.com/product/dragonlance-shadow-of-the-dragon-
queen-dungeons-dragons-adventure-book-1st-edition-wizards-rpg-
team/
https://ebookmeta.com/product/cultural-intelligence-in-the-world-
of-work-past-present-future-1st-edition-yuan-liao/
https://ebookmeta.com/product/honey-a-miraculous-product-of-
nature-1st-edition/
I Know Body Parts Mary Rose Osburn
https://ebookmeta.com/product/i-know-body-parts-mary-rose-osburn/
Dirk Lewandowski
Understanding
Search Engines
Understanding Search Engines
Dirk Lewandowski
Understanding
Search Engines
Dirk Lewandowski
Department of Information
Hamburg University of Applied Sciences
Hamburg, Germany
Translation from the German language edition: “Suchmaschinen verstehen” by Dirk Lewandowski,
# Springer 2021. Published by Springer Vieweg, Berlin, Heidelberg. All Rights Reserved.
# The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
v
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Importance of Search Engines . . . . . . . . . . . . . . . . . . . . . . 2
1.2 A Book About Google? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Objective of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Talking About Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Structure of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Structure of the Chapters and Markings in the Text . . . . . . . . . . 9
1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Ways of Searching the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Searching for a Website vs. Searching for Information
on a Topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 What Is a Document? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Where Do People Search? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Different Pathways to Information on the World Wide Web . . . . 14
2.4.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Vertical Search Engines . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.3 Metasearch Engines . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.4 Web Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.5 Social Bookmarking Sites . . . . . . . . . . . . . . . . . . . . . . 21
2.4.6 Question-Answering Sites . . . . . . . . . . . . . . . . . . . . . . 21
2.4.7 Social Networking Sites . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 How Search Engines Capture and Process Content from
the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 The World Wide Web and How Search Engines Acquire
Its Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Content Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Web Crawling: Finding Documents on the Web . . . . . . . . . . . . 33
3.3.1 Guiding and Excluding Search Engines . . . . . . . . . . . . 37
3.3.2 Content Exclusion by Search Engine Providers . . . . . . 39
vii
viii Contents
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Introduction
1
This book is about better understanding the search tools we use daily. Only when we
have a basic understanding of how search engines are constructed and how they
work can we use them effectively in our research.
However, not only the use of existing search engines is relevant here but also
what we can learn from well-known search engines like Google when we want to
build our own search systems. The starting point is that Web search engines are
currently the leading systems in terms of technology, setting the standards in terms
of both the search process and user behavior. Therefore, if we want to build our own
search systems, we must comply with the habits shaped by Web search engines,
whether we like it or not.
This book is an attempt to deal with the subject of search engines comprehen-
sively in the sense of looking at it from different angles:
1. Technology: First of all, search engines are technical systems. This involves the
gathering of the Web’s content as well as ranking and presenting the search
results.
2. Use: Search engines are not only shaped by their developers but also by their
users. Since the data generated during use is incorporated into the ranking of the
search results and the design of the user interface, usage significantly influences
how search engines are designed.
3. Web-based research: Although, in most cases, search engines are used in a
relatively simple way – and often not much more is needed for a successful
search – search engines are also tools for professional information research. The
fact that search engines are easy to use for everyone does not mean that every
search task can be easily solved using them.
4. Economy: Search engines are of great importance for content producers who
want to get their content on the market. Because they are central nodes in the
Web, they also play an important economic role. Here, the main focus is on search
engine visibility, which can be achieved through various online marketing
measures (such as search engine optimization and placing advertisements).
5. Society: Since search engines are the preferred means of searching for informa-
tion and are used massively every day, they also have an enormous significance
for knowledge acquisition in society. Among other things, this raises the question
of whether search results are credible and whether search engines play a role in
spreading misinformation and disinformation, often treated under the label of
“fake news.”
In this book, I argue that search engines have an enormous social significance. This
can be explained, on the one hand, by their mass use and, on the other hand, by the
ranking and presentation of search results.
Search engines (like other services on the Internet) are used en masse. Their
importance lies in the fact that we use them to search for information actively. Every
time we enter a query, we reveal our interests. With every search engine result page
(SERP) that a search engine returns to us, there is a (technically mediated) interpre-
tation of both the query and the potentially relevant results. By performing these
interpretations in a particular way, a search engine conveys a specific impression of
the world of information found on the Web.
For every query, there is a result page that displays the results in a specific order.
Although, in theory, we can select from all these results, we rely heavily on the order
given by the search engine. De facto, we do not select from the possibly millions of
results found but only from the few displayed first.
If we consider this, societal questions arise, such as how diverse the search engine
market is: Is it okay to use only one search engine and have only one of many
possible views of the information universe for each query?
The importance of search engines has already been put into punchy titles such as
“Search Engine Society” (Halavais, 2018), “Society of the Query” (the title of a
conference series and a book; König & Rasch, 2014), and “The Googlization of
Everything” (Vaidhyanathan, 2011). Perhaps it is not necessary to go so far as to
proclaim Google, search engines, or queries as the determining factor of our society;
however, the enormous importance of search engines for our knowledge acquisition
can no longer be denied.
1.1 The Importance of Search Engines 3
If we look at the hard numbers, we see that search engines are the most popular
service on the Internet. We regard the Internet as a collection of protocols and
services, including e-mail, chat, and the File Transfer Protocol (FTP). It may seem
surprising that the use of search engines is at the top of the list when users are asked
about their activities on the Internet. Search engines are even more popular than
writing and reading e-mails. For instance, 76% of all Germans use a search engine at
least once a week, but “only” 65% read or write at least one e-mail during this time.
This data comes from the ARD/ZDF-Onlinestudie (Beisch & Schäfer, 2020), which
surveys the use of the Internet among the German population every year. Compara-
ble studies confirm the high frequency of search engine use: the Eurobarometer
study (European Commission, 2016) shows that 85% of all Internet users in
Germany use a search engine at least once a week; the figure for daily use is still
48%. Germany is below the averages of the EU countries (88% and 57%,
respectively).
Let’s look at the ARD/ZDF-Onlinestudie to see which other Internet services are
used particularly often. We find that, in addition to e-mail and search engines,
messengers (probably WhatsApp in particular) are the most popular. On the other
hand, social media services only reach 36%.
A second way of looking at this is to look at the most popular websites (Alexa.
com, 2021). Google is in the first and third place (google.com and google.de),
followed by YouTube (second place), Amazon, and eBay. It is striking that not
only Google is in the first place but eBay and Amazon are two major e-commerce
companies that not only offer numerous opportunities for browsing but also play a
major role in (product) searches.
The fact that search engines are a mass phenomenon can also be seen in the
number of daily queries. Market research companies estimate the number of queries
sent to Google alone at around 3.3 trillion in 2016 (Internet Live Stats & Statistic
Brain Research Institute, 2017) – that’s more than a million queries per second!
An additional level of consideration arises when we look at how users access
information on the World Wide Web. While there are theoretically many access
points to information on the Web, search engines are the most prevalent. On the one
hand, Web pages can, of course, be accessed directly by typing the address (Uniform
Resource Locator; URL) into the browser bar. Then there are other services, such as
social media services, which also lead us to websites. But none of these services has
achieved a level of importance comparable to that of search engines for accessing
information on the Web, nor is this situation likely to change in the foreseeable
future.
Last but not least, search engines are also significant because of the online
advertising market. The sale of ads in search engines (ads in response to a query)
accounts for 40% of the market (Zenith, 2021); in Germany alone, search engine
advertising generated sales of 4.1 billion euros in 2019 (Statista, 2021).
This form of advertising is particularly attractive because, with each search query,
users reveal what they want to find and thus also whether and what they might want
to buy. This makes it easy for advertisers to decide when they want to offer their
product to a user. Scatter losses, i.e., the proportion of users who see an
4 1 Introduction
1. Search engine providers: On the one hand, search engine providers are interested
in satisfying their users. This involves both the quality of the search results and
the user experience. On the other hand, search engine providers’ second major
(or even more significant?) interest is to offer their advertisers an attractive
environment and earn as much money as possible from advertising.
2. Users: The users’ interest is to obtain satisfactory search results with little effort
and not to be disturbed too much in their search process, for example, by intrusive
advertising.
3. Content producers: Anyone who offers content on the Web also wants to be found
by (potential) users. However, another interest of many content producers is to
earn money with their content. This, in turn, means that it is not necessarily in
their interest to make their content fully available to search engines.
4. Search engine optimizers: Search engine optimizers work on behalf of content
producers to ensure that their offerings can be found on the Web, primarily in
search engines. Their knowledge of the search engines’ ranking procedures and
their exploitation of these procedures to place “their” websites influence the
search engine providers, who attempt to protect themselves against manipulation.
This brief explanation of the stakeholders already shows that this interplay can
lead to conflicts. Search engine providers have to balance the interests of their users
and their advertisers; search engine optimizers have to ensure the maximum visibil-
ity of their clients’ offerings but must not exploit their knowledge of how search
engines work to such an extent that they are penalized by search engine providers for
manipulation.
Clearly, we are dealing with complex interactions in the search engine market.
Only if we look at search engines from different perspectives are we able to classify
these interactions and understand why search engines are designed the way they are.
1.2 A Book About Google? 5
Search engines have to meet the needs of different user groups; it is not enough for
them to restrict their services to one of these groups.
When we talk about search engines and their importance for information access,
we usually only consider the content initially produced for the Web. However,
search engines have been trying to include content from the “real,” i.e., the physical
world, in their search systems for years. Vaidhyanathan (2011) distinguishes three
types of content that search engines like Google capture:
1. Scan and link: External content is captured, aggregated, and made available for
search (e.g., Web search).
2. Host and serve: Users’ content is collected and hosted on their own platform (e.g.,
YouTube).
3. Scan and serve: Things from the real world are transferred into the digital world
by the search engine provider (e.g., Google Books, Google Street View).
When we think of search engines, we primarily think of Google. We all use this
search engine almost daily, usually for all kinds of search purposes. Here, again, the
figures speak for themselves: in Germany, well over 90% of all queries to general
search engines are directed to Google, while other search engines play only a minor
role (Statcounter, 2021).
Therefore, this book is based on everyday experience with Google and tries to
explain the structure and use of search engines using this well-known example.
Nevertheless, this book aims to go further: to show which alternatives to Google
there are and when it is worthwhile to use them. But this book will not describe all
possible search engines; it is rather about introducing other search engines, utilizing
Another Random Scribd Document
with Unrelated Content
roach. U, uterus; s, membrane, which are pressed
spermatheca. The nerve- together when the parts are at rest,
cord is introduced into both are stiffened by chitinous
figures. thickenings.
If the succeeding sterna retained
their proper place, as they do in
some Orthoptera (e.g., the Mole
Cricket), the 8th and 9th sterna
would project beyond the 7th, while
the rectum would open beneath the
last tergum, and the uterus between
the 8th and 9th sterna. In the adult
female Cockroach, however, the 8th
and 9th somites are telescoped into
the 7th, and completely hidden by it.
Their terga are reduced to narrow
bands. The 8th sternum forms a
Fig. 97.—Hinder end of ab‐ semi-transparent plate which slopes
domen of female Cock‐ downwards and backwards, and is
roach. In the upper figure pierced by a vertical slit, the outlet
the halves of the 7th ster‐ of the uterus. The upper edge of
num are closed; in the this sternum is hinged upon the
lower figure they are open. projecting basis of the anterior
gonapophyses (to be described
immediately), and the parts form a kind of spring joint, ordinarily
closed, but capable of being opened wide upon occasion. The 9th
sternum is a small median crescentic plate, distinct from the 8th; it
supports the spermatheca, whose duct traverses an oval plate which
projects from the fore-edge of the sternum.
By the telescoping of the 8th and 9th somites the sterna take the
position shown in fig. 96B, and a new cavity, the genital pouch, is
formed by invagination. This receives the extremity of the body of
the male during copulation, while it
serves as a mould in which the egg-
capsule is cast during oviposition. Its
chitinous lining resembles that of the
outer integument. The uterus opens
into its anterior end, which is
bounded by the 8th sternum; the
spermatheca opens into its roof,
which is supported by the 9th
sternum and the gonapophyses;
while its floor is completed by the
7th sternum and the infolded
chitinous membrane.
A pair of appendages (anterior
gonapophyses) are shown by the Fig. 98.—External
development of the parts to belong Reproductive Organs of Fe‐
to the 8th somite. They are slender, male. T8, &c., terga; S7,
irregularly bent, and curved inwards &c., sterna; G, anterior
at the tips. A small, forked, chitinous gonapophysis; G′, its base;
slip connects them with both the 8th g, posterior gonapophyses;
and 9th terga, but their principal Od, oviduct; sp, sperma‐
attachment is to the upper (properly, theca; R, rectum. The
posterior) edge of the 8th sternum. upper figure shows the
The anterior gonapophyses expand parts in oblique profile; the
at their bases into broad horizontal left lower figure is an
plates, which form part of the roof oblique view from before of
of the genital pouch. the outlet of the uterus,
Two pairs of appendages, belonging the anterior gonapophyses
to the 9th somite, form the posterior being cut short; the right
gonapophyses. The outer pair are lower figure shows the
relatively large, soft, and curved: the gonapophyses. Arrows indi‐
inner narrow, hard, and straight. 167
The anterior gonapophyses form the cate the outlet of the ovi‐
lower, and the posterior the upper duct and uterus.
jaw of a forceps, which in many
Insects can be protruded beyond the body. Some of the parts are
often armed with teeth, and the primary use of the apparatus is to
bore holes in earth or wood for the reception of the eggs. Hence the
apparatus is often called the ovipositor. It forms a prominent
appendage of the abdomen in such Insects as Crickets, Saw-flies,
Sirex, and Ichneumons. The sting of the Bee is a peculiar adaptation
of the same organ to a very different purpose. In the Cockroach the
ovipositor is used to grasp the egg-capsule, while it is being formed,
filled with eggs, and hardened; and the notched edge (fig. 5, p. 23)
is the imprint of the inner posterior gonapophyses, made while the
capsule is still soft. The shape of the parts in the male and female
indicates that the ovipositor is passive in copulation, and is then
raised to allow access to the spermatheca.
SPECIAL REFERENCES.
Rathke. Zur Entwickelungsgesch. der Blatta germanica. Meckel’s Arch. of
Anat. u. Phys., Bd. VI. (1832).
Balfour. Comparative Embryology, 2 vols. (1880–1).
Graber. Insekten, Vol. II. (1879).
Lubbock. Origin and Metamorphoses of Insects (1874).
Kowalewsky. Embryol. Studien an Würmern u. Arthropoden. Mém. Ac.
Petersb. Sér. VII., Vol. XVI. (1871).
Weismann. Entw. der Dipteren. Zeits. f. wiss. Zool., Bde. XIII., XIV. (1863–
4).
Metschnikoff. Embryol. Studien. an Insecten. Ib., Bd. XVI. (1866).
Bütschli. Entwicklungsgeschichte der Biene. Ib., Bd. XX. (1870).
Bobretzky. Bildung d. Blastoderms u. d. Keimblätter bei den Insecten. Ib.,
Bd. XXXI. (1878).
Nusbaum. Rozwój przewodów organów pteiowych u owadów (Polish).
Kosmos. (1884). [Development of Sexual Outlets in Insects.]
---- Struna i struna Leydig’a u owadów (Polish). Kosmos (1886). [Chorda
and Leydig’s chorda in Insects.]
As to the origin of the mesoblast most observers have found 178 that
a long groove (the germinal groove) appears in the middle line of
the ventral plate (fig. 108), which bulges into the yolk, gradually
detaches itself from the epiblast, and completes itself into a tube.
The lumen of this tube soon becomes filled with cells, and the solid
cellular mass thus formed divides into two longitudinal tracts, which
lie right and left of the middle line of the ventral plate beneath the
epiblast, and are known as the mesoblastic bands. In the Cockroach
I was able to satisfy myself that in this Insect also, the mesoblast, in
all probability, arises by the formation and closure of a similar groove
of the epiblast. M (fig. 108) represents the stage in which the lumen
of the groove has disappeared, and the mesoblast forms a solid
cellular mass.
The origin of the hypoblast in Insects has not as yet been clearly
determined. Two quite different views on this subject have found
support. Some observers (Bobretsky, Graber, and others) maintain
that the hypoblast originates in the yolk-cells, which form a
superficial layer investing the rest of the yolk. Others (especially
Kowalewsky 179) believe that the process is altogether different.
According to the latest observations of the eminent embryologist just
named, upon the development of the Muscidæ, the germinal groove
gives rise, not only to the two mesoblastic bands, but also, in its
central region, to the hypoblast. This makes its appearance,
however, not as a continuous layer, but as two hourglass-shaped
rudiments, one at the anterior, the other at the posterior end of the
ventral plate. These rudiments have their convex ends directed away
from each other, while their edges are approximated and gradually
meet so as to form a continuous hypoblast beneath the mesoblast.
Although I have not been able completely to satisfy myself as to the
mode of formation of the hypoblast in the Cockroach, I have
observed stages of development which lead me to suppose that it
proceeds in this Insect in a manner similar to that observed by
Kowalewsky in Muscidæ. The hourglass-shaped rudiments of the
hypoblast become pushed upwards by those foldings-in of the
epiblast which form towards the anterior and posterior ends of the
embryo, and give rise to the stomodæum and proctodæum. 180
The stage of development in which the germinal groove appears, by
the folding inwards of the epiblast, has been observed in many other
animals, and is known as the Gastræa-stage. In all higher types
(Vertebrates, the higher Worms, Arthropoda, Echinodermata) the
mesoblast and hypoblast are formed in the folded-in part of the
Gastræa in a manner similar to that observed in Insects.
The yolk-cells, which some observers have supposed to form the
hypoblast, are believed by Kowalewsky to have no other function
except that of the disintegration and solution of the yolk. I can,
however, with confidence affirm that in the Cockroach these cells
take part in the formation of permanent tissues (see below).
Each of the two mesoblastic bands which lie right and left of the
germinal groove divides into many successive somites, and each of
these becomes hollow. Every such somite consists of an inner
(dorsal) one-layered and an outer (ventral) many-layered wall, the
latter being in contact with the epiblast. The cavities of all the
somites unite to form a common cavity, the cœlom or perivisceral
space of the Cockroach. The cœlom, like the cavities in which it
originates, is bounded by two layers of mesoblast—an inner, the so-
called splanchnic or visceral layer, which lies on the outer side of the
hypoblast, and an outer somatic or parietal layer, beneath the
epiblast. There are accordingly four layers in the Cockroach-embryo
—viz., (1) epiblast, from which the integument and nervous system
are developed; (2) somatic layer of mesoblast, mainly converted into
the muscles of the body-wall; (3) splanchnic layer of mesoblast,
yielding the muscular coat of the alimentary canal; and (4)
hypoblast, yielding the epithelium of the mesenteron.
Fig. 109.—Transverse sections of Embryo of B.
germanica, with rudimentary nervous system
(Oc. 4, Obj. D.D. Zeiss). N, nervous system; M,
mesoblastic somites.