Lecture1a DistSyst
Lecture1a DistSyst
Replicated
GFS distributed filesystem Consistent
Fast
How do you index the
web?
1. Get a copy of the
web.
2. Build
Thereanare
index.
over 1 trillion unique URLs
Billions
3. Profit. of unique web pages
Hundreds of millions of websites
30?? terabytes of text
=
• Crawling -- • Profiting -- we
download those leave that to you.
web pages
• Indexing -- harness
10s of thousands • “Data-Intensive
of machines to do Computing”
it
MapReduce / Hadoop
DataWhy? Hiding details of programming 10,000
Computers
machines!
Chunks
• Protocols on • Hundreds of
protocols on thousands of servers
protocols
• ... to find out what’s
• Distributed network the deal with Trump’s
of Internet routers to hair!