SlideShare a Scribd company logo
Frontera: open
source, large scale
web crawling
framework
Alexander Sibiryakov,
Scrapinghub Ltd.
sibiryakov@scrapinghub.com
Здравствуйте участники!
Здравствуйте участники!
• Software Engineer @
Scrapinghub
Здравствуйте участники!
• Software Engineer @
Scrapinghub
• Born in Yekaterinburg,
RU
Здравствуйте участники!
• Software Engineer @
Scrapinghub
• Born in Yekaterinburg,
RU
• 5 years at Yandex: 

social & QA search,
snippets.
Здравствуйте участники!
• Software Engineer @
Scrapinghub
• Born in Yekaterinburg,
RU
• 5 years at Yandex: 

social & QA search,
snippets.
• 2 years at Avast!
antivirus: false
positives, malicious
downloads
We help turn web content into
useful data {
"content": [
{
"title": {
"text": "'Extreme poverty' to fall below 10% of
world population for first time",
"href": "http://www.theguardian.com/society/
2015/oct/05/world-bank-extreme-poverty-to-fall-
below-10-of-world-population-for-first-time"
},
"points": "9 points",
"time_ago": {
"text": "2 hours ago",
"href": "https://news.ycombinator.com/item?
id=10352189"
},
"username": {
"text": "hliyan",
"href": "https://news.ycombinator.com/user?
id=hliyan"
}
},
We help turn web content into
useful data
• Over 2 billion requests per month (~800/sec.)
{
"content": [
{
"title": {
"text": "'Extreme poverty' to fall below 10% of
world population for first time",
"href": "http://www.theguardian.com/society/
2015/oct/05/world-bank-extreme-poverty-to-fall-
below-10-of-world-population-for-first-time"
},
"points": "9 points",
"time_ago": {
"text": "2 hours ago",
"href": "https://news.ycombinator.com/item?
id=10352189"
},
"username": {
"text": "hliyan",
"href": "https://news.ycombinator.com/user?
id=hliyan"
}
},
We help turn web content into
useful data
• Over 2 billion requests per month (~800/sec.)
• Focused crawls & Broad crawls
{
"content": [
{
"title": {
"text": "'Extreme poverty' to fall below 10% of
world population for first time",
"href": "http://www.theguardian.com/society/
2015/oct/05/world-bank-extreme-poverty-to-fall-
below-10-of-world-population-for-first-time"
},
"points": "9 points",
"time_ago": {
"text": "2 hours ago",
"href": "https://news.ycombinator.com/item?
id=10352189"
},
"username": {
"text": "hliyan",
"href": "https://news.ycombinator.com/user?
id=hliyan"
}
},
Broad crawl usages
Broad crawl usages
• News analysis
Broad crawl usages
• News analysis
• Topical crawling
Broad crawl usages
• News analysis
• Topical crawling
• Plagiarism detection
Broad crawl usages
• News analysis
• Topical crawling
• Plagiarism detection
• Sentiment analysis (popularity, likability)
Broad crawl usages
• News analysis
• Topical crawling
• Plagiarism detection
• Sentiment analysis (popularity, likability)
• Due diligence (profile/business data)
Broad crawl usages
• News analysis
• Topical crawling
• Plagiarism detection
• Sentiment analysis (popularity, likability)
• Due diligence (profile/business data)
• Lead generation (extracting contact
information)
Broad crawl usages
• News analysis
• Topical crawling
• Plagiarism detection
• Sentiment analysis (popularity, likability)
• Due diligence (profile/business data)
• Lead generation (extracting contact
information)
• Track criminal activity & find lost persons
(DARPA)
Saatchi Global Gallery Guide
Saatchi Global Gallery Guide
• www.globalgalleryguide.com
Saatchi Global Gallery Guide
• www.globalgalleryguide.com
Saatchi Global Gallery Guide
• www.globalgalleryguide.com
• Discover 11K online
galleries.
Saatchi Global Gallery Guide
• www.globalgalleryguide.com
• Discover 11K online
galleries.
• Extract general
information, art
samples,
descriptions.
Saatchi Global Gallery Guide
• www.globalgalleryguide.com
• Discover 11K online
galleries.
• Extract general
information, art
samples,
descriptions.
• NLP-based
extraction.
Saatchi Global Gallery Guide
• www.globalgalleryguide.com
• Discover 11K online
galleries.
• Extract general
information, art
samples,
descriptions.
• NLP-based
extraction.
• Find more galleries
on the web.
Task
Task
• Spanish web: hosts and their sizes statistics.
Task
• Spanish web: hosts and their sizes statistics.
• Only .es ccTLD.
Task
• Spanish web: hosts and their sizes statistics.
• Only .es ccTLD.
• Breadth-first strategy:
Task
• Spanish web: hosts and their sizes statistics.
• Only .es ccTLD.
• Breadth-first strategy:
• first 1-click environ,
Task
• Spanish web: hosts and their sizes statistics.
• Only .es ccTLD.
• Breadth-first strategy:
• first 1-click environ,
• 2,
Task
• Spanish web: hosts and their sizes statistics.
• Only .es ccTLD.
• Breadth-first strategy:
• first 1-click environ,
• 2,
• 3,
Task
• Spanish web: hosts and their sizes statistics.
• Only .es ccTLD.
• Breadth-first strategy:
• first 1-click environ,
• 2,
• 3,
• …
Task
• Spanish web: hosts and their sizes statistics.
• Only .es ccTLD.
• Breadth-first strategy:
• first 1-click environ,
• 2,
• 3,
• …
• Finishing condition: 100 docs from host max., all
hosts
Task
• Spanish web: hosts and their sizes statistics.
• Only .es ccTLD.
• Breadth-first strategy:
• first 1-click environ,
• 2,
• 3,
• …
• Finishing condition: 100 docs from host max., all
hosts
• Low costs.
Spanish, Russian and world
Web, 2012
Sources: OECD Communications Outlook 2013, statdom.ru
* - current period (October 2015)
Domains
Web
servers
Hosts DMOZ*
Spanish (.es) 1,5M 280K 4,2M 122K
Russian
(.ru, .рф, .su) 4,8M 2,6M ? 105K
World 233M 62M 890M 1,7
Solution
Solution
• Scrapy (based on Twisted) - async
network operations.
Solution
• Scrapy (based on Twisted) - async
network operations.
• Apache Kafka - data bus (offsets,
partitioning).
Solution
• Scrapy (based on Twisted) - async
network operations.
• Apache Kafka - data bus (offsets,
partitioning).
• Apache HBase - storage (random access,
linear scanning, scalability).
Solution
• Scrapy (based on Twisted) - async
network operations.
• Apache Kafka - data bus (offsets,
partitioning).
• Apache HBase - storage (random access,
linear scanning, scalability).
• Snappy - efficient compression algorithm
for IO-bounded applications.
Architecture
Kafka topic
SW
Crawling strategy
workers
Storage workersDB
1. Big and small hosts problem
1. Big and small hosts problem
• Queue is flooded
with URLs from the
same host.
1. Big and small hosts problem
• Queue is flooded
with URLs from the
same host.
• → underuse of
spider resources.
1. Big and small hosts problem
• Queue is flooded
with URLs from the
same host.
• → underuse of
spider resources.
• additional per-host
(per-IP) queue and
metering algorithm.
1. Big and small hosts problem
• Queue is flooded
with URLs from the
same host.
• → underuse of
spider resources.
• additional per-host
(per-IP) queue and
metering algorithm.
• URLs from big hosts
are cached in
memory.
2. DDoS DNS service Amazon AWS
2. DDoS DNS service Amazon AWS
Breadth-first strategy →
2. DDoS DNS service Amazon AWS
Breadth-first strategy →
first visiting of
unknown hosts →
2. DDoS DNS service Amazon AWS
Breadth-first strategy →
first visiting of
unknown hosts →
generating huge
amount of DNS reqs.
2. DDoS DNS service Amazon AWS
Breadth-first strategy →
first visiting of
unknown hosts →
generating huge
amount of DNS reqs.
2. DDoS DNS service Amazon AWS
Breadth-first strategy →
first visiting of
unknown hosts →
generating huge
amount of DNS reqs.
2. DDoS DNS service Amazon AWS
Breadth-first strategy →
first visiting of
unknown hosts →
generating huge
amount of DNS reqs.
Recursive DNS server
2. DDoS DNS service Amazon AWS
Breadth-first strategy →
first visiting of
unknown hosts →
generating huge
amount of DNS reqs.
Recursive DNS server
• on every spider node,
2. DDoS DNS service Amazon AWS
Breadth-first strategy →
first visiting of
unknown hosts →
generating huge
amount of DNS reqs.
Recursive DNS server
• on every spider node,
• upstream to Verizon &
OpenDNS.
2. DDoS DNS service Amazon AWS
Breadth-first strategy →
first visiting of
unknown hosts →
generating huge
amount of DNS reqs.
Recursive DNS server
• on every spider node,
• upstream to Verizon &
OpenDNS.
We used dnsmasq.
3. Tuning Scrapy thread pool
for efficient DNS resolution
3. Tuning Scrapy thread pool
for efficient DNS resolution
• OS DNS resolver,
3. Tuning Scrapy thread pool
for efficient DNS resolution
• OS DNS resolver,
• blocking calls,
3. Tuning Scrapy thread pool
for efficient DNS resolution
• OS DNS resolver,
• blocking calls,
• thread pool to
resolve DNS name to
IP.
3. Tuning Scrapy thread pool
for efficient DNS resolution
• OS DNS resolver,
• blocking calls,
• thread pool to
resolve DNS name to
IP.
3. Tuning Scrapy thread pool
for efficient DNS resolution
• OS DNS resolver,
• blocking calls,
• thread pool to
resolve DNS name to
IP.
• numerous errors
and timeouts 🆘
3. Tuning Scrapy thread pool
for efficient DNS resolution
• OS DNS resolver,
• blocking calls,
• thread pool to
resolve DNS name to
IP.
• numerous errors
and timeouts 🆘
• A patch for thread
pool size and
timeout adjustment.
4. Overloaded HBase region
servers during state check
3Tb of metadata.
URLs, timestamps,…
275 b/doc
4. Overloaded HBase region
servers during state check
• 10^3 links per doc,
3Tb of metadata.
URLs, timestamps,…
275 b/doc
4. Overloaded HBase region
servers during state check
• 10^3 links per doc,
• state check: CRAWLED/NOT
CRAWLED/ERROR,
3Tb of metadata.
URLs, timestamps,…
275 b/doc
4. Overloaded HBase region
servers during state check
• 10^3 links per doc,
• state check: CRAWLED/NOT
CRAWLED/ERROR,
• HDDs.
3Tb of metadata.
URLs, timestamps,…
275 b/doc
4. Overloaded HBase region
servers during state check
• 10^3 links per doc,
• state check: CRAWLED/NOT
CRAWLED/ERROR,
• HDDs.
• Small volume 🆗
3Tb of metadata.
URLs, timestamps,…
275 b/doc
4. Overloaded HBase region
servers during state check
• 10^3 links per doc,
• state check: CRAWLED/NOT
CRAWLED/ERROR,
• HDDs.
• Small volume 🆗
• With ⬆table size, response
times ⬆🆘
3Tb of metadata.
URLs, timestamps,…
275 b/doc
4. Overloaded HBase region
servers during state check
• 10^3 links per doc,
• state check: CRAWLED/NOT
CRAWLED/ERROR,
• HDDs.
• Small volume 🆗
• With ⬆table size, response
times ⬆🆘
• Disk queue ⬆
3Tb of metadata.
URLs, timestamps,…
275 b/doc
4. Overloaded HBase region
servers during state check
• 10^3 links per doc,
• state check: CRAWLED/NOT
CRAWLED/ERROR,
• HDDs.
• Small volume 🆗
• With ⬆table size, response
times ⬆🆘
• Disk queue ⬆
• Host-local fingerprint
function for keys in HBase.
3Tb of metadata.
URLs, timestamps,…
275 b/doc
4. Overloaded HBase region
servers during state check
• 10^3 links per doc,
• state check: CRAWLED/NOT
CRAWLED/ERROR,
• HDDs.
• Small volume 🆗
• With ⬆table size, response
times ⬆🆘
• Disk queue ⬆
• Host-local fingerprint
function for keys in HBase.
• Tuning HBase block cache
to fit average host states
into one block.
3Tb of metadata.
URLs, timestamps,…
275 b/doc
5. Intensive network traffic
from workers to services
5. Intensive network traffic
from workers to services
• Throughput
between workers
and Kafka/HBase 

~ 1Gbit/s.
5. Intensive network traffic
from workers to services
• Throughput
between workers
and Kafka/HBase 

~ 1Gbit/s.
• Thrift compact
protocol for
HBase
5. Intensive network traffic
from workers to services
• Throughput
between workers
and Kafka/HBase 

~ 1Gbit/s.
• Thrift compact
protocol for
HBase
• Message
compression in
Kafka with Snappy
6. Further query and traffic
optimizations to HBase
6. Further query and traffic
optimizations to HBase
• State check: lots of
reqs and network
6. Further query and traffic
optimizations to HBase
• State check: lots of
reqs and network
• Consistency
6. Further query and traffic
optimizations to HBase
• State check: lots of
reqs and network
• Consistency
• Local state cache
in strategy worker.
6. Further query and traffic
optimizations to HBase
• State check: lots of
reqs and network
• Consistency
• Local state cache
in strategy worker.
• For consistency,
spider log was
partitioned by
host.
State cache
State cache
• All ops are batched:
State cache
• All ops are batched:
– If no key in cache→
read HBase
State cache
• All ops are batched:
– If no key in cache→
read HBase
– every ~4K docs →
flush
State cache
• All ops are batched:
– If no key in cache→
read HBase
– every ~4K docs →
flush
• Close to 3M (~1Gb)
elms → flush & cleanup
State cache
• All ops are batched:
– If no key in cache→
read HBase
– every ~4K docs →
flush
• Close to 3M (~1Gb)
elms → flush & cleanup
• Least-Recently-Used
(LRU) 👍
Spider priority queue (slot)
Spider priority queue (slot)
• Cell:
Spider priority queue (slot)
• Cell:
Array of:

- fingerprint, 

- Crc32(hostname), 

- URL, 

- score
Spider priority queue (slot)
• Cell:
Array of:

- fingerprint, 

- Crc32(hostname), 

- URL, 

- score
• Dequeueing top N.
Spider priority queue (slot)
• Cell:
Array of:

- fingerprint, 

- Crc32(hostname), 

- URL, 

- score
• Dequeueing top N.
• Prone to huge hosts
Spider priority queue (slot)
• Cell:
Array of:

- fingerprint, 

- Crc32(hostname), 

- URL, 

- score
• Dequeueing top N.
• Prone to huge hosts
• Scoring model: document
count per host.
7. Problem of big and small
hosts (strikes back!)
7. Problem of big and small
hosts (strikes back!)
• Discovered few very
huge hosts (>20M
docs)
7. Problem of big and small
hosts (strikes back!)
• Discovered few very
huge hosts (>20M
docs)
• All queue partitions
were flooded with
huge hosts,
7. Problem of big and small
hosts (strikes back!)
• Discovered few very
huge hosts (>20M
docs)
• All queue partitions
were flooded with
huge hosts,
• Two MapReduce
jobs:
7. Problem of big and small
hosts (strikes back!)
• Discovered few very
huge hosts (>20M
docs)
• All queue partitions
were flooded with
huge hosts,
• Two MapReduce
jobs:
– queue shuffling,
7. Problem of big and small
hosts (strikes back!)
• Discovered few very
huge hosts (>20M
docs)
• All queue partitions
were flooded with
huge hosts,
• Two MapReduce
jobs:
– queue shuffling,
– limit all hosts to
100 docs MAX.
Hardware requirements
Hardware requirements
• Single-thread Scrapy spider → 

1200 pages/min. from ~100 websites in parallel.
Hardware requirements
• Single-thread Scrapy spider → 

1200 pages/min. from ~100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content)
Hardware requirements
• Single-thread Scrapy spider → 

1200 pages/min. from ~100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content)
• 1 Gb of RAM for every SW (state cache, tunable).
Hardware requirements
• Single-thread Scrapy spider → 

1200 pages/min. from ~100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content)
• 1 Gb of RAM for every SW (state cache, tunable).
• Example:
Hardware requirements
• Single-thread Scrapy spider → 

1200 pages/min. from ~100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content)
• 1 Gb of RAM for every SW (state cache, tunable).
• Example:
– 12 spiders ~ 14.4K pages/min.,
Hardware requirements
• Single-thread Scrapy spider → 

1200 pages/min. from ~100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content)
• 1 Gb of RAM for every SW (state cache, tunable).
• Example:
– 12 spiders ~ 14.4K pages/min.,
– 3 SW and 3 DB workers,
Hardware requirements
• Single-thread Scrapy spider → 

1200 pages/min. from ~100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content)
• 1 Gb of RAM for every SW (state cache, tunable).
• Example:
– 12 spiders ~ 14.4K pages/min.,
– 3 SW and 3 DB workers,
– Total 18 cores.
Software requirements
CDH (100% Open source
Hadoop package)
Software requirements
• Apache HBase,
CDH (100% Open source
Hadoop package)
Software requirements
• Apache HBase,
• Apache Kafka,
CDH (100% Open source
Hadoop package)
Software requirements
• Apache HBase,
• Apache Kafka,
• Python 2.7+,
CDH (100% Open source
Hadoop package)
Software requirements
• Apache HBase,
• Apache Kafka,
• Python 2.7+,
• Scrapy 0.24+,
CDH (100% Open source
Hadoop package)
Software requirements
• Apache HBase,
• Apache Kafka,
• Python 2.7+,
• Scrapy 0.24+,
• DNS Service.
CDH (100% Open source
Hadoop package)
Maintaining Cloudera Hadoop
on Amazon EC2
Maintaining Cloudera Hadoop
on Amazon EC2
• CDH is very sensitive to free space on root
partition, parcels, and storage of Cloudera
Manager.
Maintaining Cloudera Hadoop
on Amazon EC2
• CDH is very sensitive to free space on root
partition, parcels, and storage of Cloudera
Manager.
• We’ve moved it using symbolic links to separate
EBS partition.
Maintaining Cloudera Hadoop
on Amazon EC2
• CDH is very sensitive to free space on root
partition, parcels, and storage of Cloudera
Manager.
• We’ve moved it using symbolic links to separate
EBS partition.
• EBS should be at least 30Gb, base IOPS should be
enough.
Maintaining Cloudera Hadoop
on Amazon EC2
• CDH is very sensitive to free space on root
partition, parcels, and storage of Cloudera
Manager.
• We’ve moved it using symbolic links to separate
EBS partition.
• EBS should be at least 30Gb, base IOPS should be
enough.
• Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb,
2x40 SSD).
Maintaining Cloudera Hadoop
on Amazon EC2
• CDH is very sensitive to free space on root
partition, parcels, and storage of Cloudera
Manager.
• We’ve moved it using symbolic links to separate
EBS partition.
• EBS should be at least 30Gb, base IOPS should be
enough.
• Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb,
2x40 SSD).
• After one week of crawling, we ran out of space,
and started to move DataNodes to d2.xlarge (4
CPU, 30.5Gb, 3x2Tb HDD).
Spanish (.es) internet crawl
results
Spanish (.es) internet crawl
results
• fnac.es, rakuten.es, adidas.es,
equiposdefutbol2014.es, druni.es,
docentesconeducacion.es - are the
biggest websites
Spanish (.es) internet crawl
results
• fnac.es, rakuten.es, adidas.es,
equiposdefutbol2014.es, druni.es,
docentesconeducacion.es - are the
biggest websites
• 68.7K domains found (~600K
expected),
Spanish (.es) internet crawl
results
• fnac.es, rakuten.es, adidas.es,
equiposdefutbol2014.es, druni.es,
docentesconeducacion.es - are the
biggest websites
• 68.7K domains found (~600K
expected),
• 46.5M crawled pages overall,
Spanish (.es) internet crawl
results
• fnac.es, rakuten.es, adidas.es,
equiposdefutbol2014.es, druni.es,
docentesconeducacion.es - are the
biggest websites
• 68.7K domains found (~600K
expected),
• 46.5M crawled pages overall,
• 1.5 months,
Spanish (.es) internet crawl
results
• fnac.es, rakuten.es, adidas.es,
equiposdefutbol2014.es, druni.es,
docentesconeducacion.es - are the
biggest websites
• 68.7K domains found (~600K
expected),
• 46.5M crawled pages overall,
• 1.5 months,
• 22 websites with more than 50M pages
where are the rest of
web servers?!
Bow-tie model
A. Broder et al. / Computer Networks 33 (2000) 309-320
Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005
12 years dynamics
Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014
Main features
Main features
• Online operation: scheduling of new batch,
updating of DB state.
Main features
• Online operation: scheduling of new batch,
updating of DB state.
• Storage abstraction: write your own backend
(sqlalchemy, HBase is included).
Main features
• Online operation: scheduling of new batch,
updating of DB state.
• Storage abstraction: write your own backend
(sqlalchemy, HBase is included).
• Canonical URLs resolution abstraction: each
document has many URLs, which to use?
Main features
• Online operation: scheduling of new batch,
updating of DB state.
• Storage abstraction: write your own backend
(sqlalchemy, HBase is included).
• Canonical URLs resolution abstraction: each
document has many URLs, which to use?
• Scrapy ecosystem: good documentation, big
community, ease of customization.
Distributed Frontera features
Distributed Frontera features
• Communication layer is Apache Kafka: topic
partitioning, offsets mechanism.
Distributed Frontera features
• Communication layer is Apache Kafka: topic
partitioning, offsets mechanism.
• Crawling strategy abstraction: crawling
goal, url ordering, scoring model is coded in
separate module.
Distributed Frontera features
• Communication layer is Apache Kafka: topic
partitioning, offsets mechanism.
• Crawling strategy abstraction: crawling
goal, url ordering, scoring model is coded in
separate module.
• Polite by design: each website is
downloaded by at most one spider.
Distributed Frontera features
• Communication layer is Apache Kafka: topic
partitioning, offsets mechanism.
• Crawling strategy abstraction: crawling
goal, url ordering, scoring model is coded in
separate module.
• Polite by design: each website is
downloaded by at most one spider.
• Python: workers, spiders.
References
• Frontera. https://github.com/scrapinghub/frontera
• Distributed extension. https://github.com/
scrapinghub/distributed-frontera
• Documentation:
– http://frontera.readthedocs.org/
– http://distributed-frontera.readthedocs.org/
• Google groups: Frontera (https://goo.gl/ak9546)
Future plans
Future plans
• Lighter version, without HBase and Kafka.
Communicating using sockets.
Future plans
• Lighter version, without HBase and Kafka.
Communicating using sockets.
• Revisiting strategy out-of-box.
Future plans
• Lighter version, without HBase and Kafka.
Communicating using sockets.
• Revisiting strategy out-of-box.
• Watchdog solution: tracking website content
changes.
Future plans
• Lighter version, without HBase and Kafka.
Communicating using sockets.
• Revisiting strategy out-of-box.
• Watchdog solution: tracking website content
changes.
• PageRank or HITS strategy.
Future plans
• Lighter version, without HBase and Kafka.
Communicating using sockets.
• Revisiting strategy out-of-box.
• Watchdog solution: tracking website content
changes.
• PageRank or HITS strategy.
• Own HTML and URL parsers.
Future plans
• Lighter version, without HBase and Kafka.
Communicating using sockets.
• Revisiting strategy out-of-box.
• Watchdog solution: tracking website content
changes.
• PageRank or HITS strategy.
• Own HTML and URL parsers.
• Integration into Scrapinghub services.
Future plans
• Lighter version, without HBase and Kafka.
Communicating using sockets.
• Revisiting strategy out-of-box.
• Watchdog solution: tracking website content
changes.
• PageRank or HITS strategy.
• Own HTML and URL parsers.
• Integration into Scrapinghub services.
• Testing on larger volumes.
Run your business using Frontera
Run your business using Frontera
 SCALABLE
Run your business using Frontera
 SCALABLE
 OPEN
Run your business using Frontera
 SCALABLE
 OPEN
 CUSTOMIZABLE
Run your business using Frontera
Made in Scrapinghub
(authors of Scrapy)
 SCALABLE
 OPEN
 CUSTOMIZABLE
Здесь может быть ВАШ код!
Здесь может быть ВАШ код!
• Web scale crawler,
Здесь может быть ВАШ код!
• Web scale crawler,
• Historically first
attempt in Python,
Здесь может быть ВАШ код!
• Web scale crawler,
• Historically first
attempt in Python,
• Truly resource-
intensive task:
CPU, network,
disks.
We’re hiring!
http://scrapinghub.com/jobs/
Спасибо!
Alexander Sibiryakov,
sibiryakov@scrapinghub.com
Ad

More Related Content

What's hot (20)

Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Ontico
 
Масштабируемая конфигурация Nginx, Игорь Сысоев (Nginx)
Масштабируемая конфигурация Nginx, Игорь Сысоев (Nginx)Масштабируемая конфигурация Nginx, Игорь Сысоев (Nginx)
Масштабируемая конфигурация Nginx, Игорь Сысоев (Nginx)
Ontico
 
Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...
Fwdays
 
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Jeremy Zawodny
 
MySQL Replication — Advanced Features / Петр Зайцев (Percona)
MySQL Replication — Advanced Features / Петр Зайцев (Percona)MySQL Replication — Advanced Features / Петр Зайцев (Percona)
MySQL Replication — Advanced Features / Петр Зайцев (Percona)
Ontico
 
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Jeremy Zawodny
 
Caching solutions with Redis
Caching solutions   with RedisCaching solutions   with Redis
Caching solutions with Redis
George Platon
 
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
Ontico
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
Jeremy Zawodny
 
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
confluent
 
Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...
Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...
Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...
Ontico
 
Redis overview for Software Architecture Forum
Redis overview for Software Architecture ForumRedis overview for Software Architecture Forum
Redis overview for Software Architecture Forum
Christopher Spring
 
Fusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistFusion-io and MySQL at Craigslist
Fusion-io and MySQL at Craigslist
Jeremy Zawodny
 
MongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log CollectorMongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log Collector
Pierre Baillet
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Arnab Mitra
 
A simple introduction to redis
A simple introduction to redisA simple introduction to redis
A simple introduction to redis
Zhichao Liang
 
Redis acc 2015_eng
Redis acc 2015_engRedis acc 2015_eng
Redis acc 2015_eng
DaeMyung Kang
 
Troubleshooting redis
Troubleshooting redisTroubleshooting redis
Troubleshooting redis
DaeMyung Kang
 
Redis ndc2013
Redis ndc2013Redis ndc2013
Redis ndc2013
DaeMyung Kang
 
Bringing code to the data: from MySQL to RocksDB for high volume searches
Bringing code to the data: from MySQL to RocksDB for high volume searchesBringing code to the data: from MySQL to RocksDB for high volume searches
Bringing code to the data: from MySQL to RocksDB for high volume searches
Ivan Kruglov
 
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Ontico
 
Масштабируемая конфигурация Nginx, Игорь Сысоев (Nginx)
Масштабируемая конфигурация Nginx, Игорь Сысоев (Nginx)Масштабируемая конфигурация Nginx, Игорь Сысоев (Nginx)
Масштабируемая конфигурация Nginx, Игорь Сысоев (Nginx)
Ontico
 
Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...
Fwdays
 
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
Jeremy Zawodny
 
MySQL Replication — Advanced Features / Петр Зайцев (Percona)
MySQL Replication — Advanced Features / Петр Зайцев (Percona)MySQL Replication — Advanced Features / Петр Зайцев (Percona)
MySQL Replication — Advanced Features / Петр Зайцев (Percona)
Ontico
 
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Jeremy Zawodny
 
Caching solutions with Redis
Caching solutions   with RedisCaching solutions   with Redis
Caching solutions with Redis
George Platon
 
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
Ontico
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
Jeremy Zawodny
 
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
confluent
 
Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...
Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...
Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...
Ontico
 
Redis overview for Software Architecture Forum
Redis overview for Software Architecture ForumRedis overview for Software Architecture Forum
Redis overview for Software Architecture Forum
Christopher Spring
 
Fusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistFusion-io and MySQL at Craigslist
Fusion-io and MySQL at Craigslist
Jeremy Zawodny
 
MongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log CollectorMongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log Collector
Pierre Baillet
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Arnab Mitra
 
A simple introduction to redis
A simple introduction to redisA simple introduction to redis
A simple introduction to redis
Zhichao Liang
 
Troubleshooting redis
Troubleshooting redisTroubleshooting redis
Troubleshooting redis
DaeMyung Kang
 
Bringing code to the data: from MySQL to RocksDB for high volume searches
Bringing code to the data: from MySQL to RocksDB for high volume searchesBringing code to the data: from MySQL to RocksDB for high volume searches
Bringing code to the data: from MySQL to RocksDB for high volume searches
Ivan Kruglov
 

Viewers also liked (7)

Андрей Аксёнов, Sphinx Technologies Inc.
Андрей Аксёнов, Sphinx Technologies Inc.Андрей Аксёнов, Sphinx Technologies Inc.
Андрей Аксёнов, Sphinx Technologies Inc.
Ontico
 
В поисках идеальной сети, или зачем нужна еще одна SDN / Андрей Королев (Ионика)
В поисках идеальной сети, или зачем нужна еще одна SDN / Андрей Королев (Ионика)В поисках идеальной сети, или зачем нужна еще одна SDN / Андрей Королев (Ионика)
В поисках идеальной сети, или зачем нужна еще одна SDN / Андрей Королев (Ионика)
Ontico
 
Ускоряем исследования с помощью конкурсов как их готовить и выигрывать / Иван...
Ускоряем исследования с помощью конкурсов как их готовить и выигрывать / Иван...Ускоряем исследования с помощью конкурсов как их готовить и выигрывать / Иван...
Ускоряем исследования с помощью конкурсов как их готовить и выигрывать / Иван...
Ontico
 
Потоковые алгоритмы в задачах обработки больших данных / Виктор Евстратов (Se...
Потоковые алгоритмы в задачах обработки больших данных / Виктор Евстратов (Se...Потоковые алгоритмы в задачах обработки больших данных / Виктор Евстратов (Se...
Потоковые алгоритмы в задачах обработки больших данных / Виктор Евстратов (Se...
Ontico
 
Учебный план для highload гуру / Андрей Аксёнов (Sphinx Technologies Inc.)
Учебный план для highload гуру / Андрей Аксёнов (Sphinx Technologies Inc.)Учебный план для highload гуру / Андрей Аксёнов (Sphinx Technologies Inc.)
Учебный план для highload гуру / Андрей Аксёнов (Sphinx Technologies Inc.)
Ontico
 
Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...
Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...
Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...
Ontico
 
PostgreSQL: практические примеры оптимизации SQL-запросов / Иван Фролков (Po...
PostgreSQL: практические примеры оптимизации SQL-запросов /  Иван Фролков (Po...PostgreSQL: практические примеры оптимизации SQL-запросов /  Иван Фролков (Po...
PostgreSQL: практические примеры оптимизации SQL-запросов / Иван Фролков (Po...
Ontico
 
Андрей Аксёнов, Sphinx Technologies Inc.
Андрей Аксёнов, Sphinx Technologies Inc.Андрей Аксёнов, Sphinx Technologies Inc.
Андрей Аксёнов, Sphinx Technologies Inc.
Ontico
 
В поисках идеальной сети, или зачем нужна еще одна SDN / Андрей Королев (Ионика)
В поисках идеальной сети, или зачем нужна еще одна SDN / Андрей Королев (Ионика)В поисках идеальной сети, или зачем нужна еще одна SDN / Андрей Королев (Ионика)
В поисках идеальной сети, или зачем нужна еще одна SDN / Андрей Королев (Ионика)
Ontico
 
Ускоряем исследования с помощью конкурсов как их готовить и выигрывать / Иван...
Ускоряем исследования с помощью конкурсов как их готовить и выигрывать / Иван...Ускоряем исследования с помощью конкурсов как их готовить и выигрывать / Иван...
Ускоряем исследования с помощью конкурсов как их готовить и выигрывать / Иван...
Ontico
 
Потоковые алгоритмы в задачах обработки больших данных / Виктор Евстратов (Se...
Потоковые алгоритмы в задачах обработки больших данных / Виктор Евстратов (Se...Потоковые алгоритмы в задачах обработки больших данных / Виктор Евстратов (Se...
Потоковые алгоритмы в задачах обработки больших данных / Виктор Евстратов (Se...
Ontico
 
Учебный план для highload гуру / Андрей Аксёнов (Sphinx Technologies Inc.)
Учебный план для highload гуру / Андрей Аксёнов (Sphinx Technologies Inc.)Учебный план для highload гуру / Андрей Аксёнов (Sphinx Technologies Inc.)
Учебный план для highload гуру / Андрей Аксёнов (Sphinx Technologies Inc.)
Ontico
 
Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...
Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...
Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...
Ontico
 
PostgreSQL: практические примеры оптимизации SQL-запросов / Иван Фролков (Po...
PostgreSQL: практические примеры оптимизации SQL-запросов /  Иван Фролков (Po...PostgreSQL: практические примеры оптимизации SQL-запросов /  Иван Фролков (Po...
PostgreSQL: практические примеры оптимизации SQL-запросов / Иван Фролков (Po...
Ontico
 
Ad

Similar to Frontera распределенный робот для обхода веба в больших объемах / Александр Сибиряков (Scrapinghub) (20)

Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
Scrapinghub
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
PyData
 
London devops logging
London devops loggingLondon devops logging
London devops logging
Tomas Doran
 
Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Framework
sixtyone
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
From 100s to 100s of Millions
From 100s to 100s of MillionsFrom 100s to 100s of Millions
From 100s to 100s of Millions
Erik Onnen
 
Gates Toorcon X New School Information Gathering
Gates Toorcon X New School Information GatheringGates Toorcon X New School Information Gathering
Gates Toorcon X New School Information Gathering
Chris Gates
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.com
Damien Krotkine
 
DNS Measurements
DNS MeasurementsDNS Measurements
DNS Measurements
AFRINIC
 
Hunting on the cheap
Hunting on the cheapHunting on the cheap
Hunting on the cheap
Anjum Ahuja
 
Hunting on the Cheap
Hunting on the CheapHunting on the Cheap
Hunting on the Cheap
EndgameInc
 
Web Development using Ruby on Rails
Web Development using Ruby on RailsWeb Development using Ruby on Rails
Web Development using Ruby on Rails
Avi Kedar
 
Badneedles
BadneedlesBadneedles
Badneedles
dimisec
 
Jornadas gvSIG 2009 WSS English
Jornadas gvSIG 2009 WSS EnglishJornadas gvSIG 2009 WSS English
Jornadas gvSIG 2009 WSS English
sabueso81
 
REDIS327
REDIS327REDIS327
REDIS327
Rajan Bhatt
 
DNS Survival Guide.
DNS Survival Guide.DNS Survival Guide.
DNS Survival Guide.
Qrator Labs
 
DNS Survival Guide
DNS Survival GuideDNS Survival Guide
DNS Survival Guide
APNIC
 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
MapR Technologies
 
No SQL : Which way to go? Presented at DDDMelbourne 2015
No SQL : Which way to go?  Presented at DDDMelbourne 2015No SQL : Which way to go?  Presented at DDDMelbourne 2015
No SQL : Which way to go? Presented at DDDMelbourne 2015
Himanshu Desai
 
NoSQL, which way to go?
NoSQL, which way to go?NoSQL, which way to go?
NoSQL, which way to go?
Ahmed Elharouny
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
Scrapinghub
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
PyData
 
London devops logging
London devops loggingLondon devops logging
London devops logging
Tomas Doran
 
Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Framework
sixtyone
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
From 100s to 100s of Millions
From 100s to 100s of MillionsFrom 100s to 100s of Millions
From 100s to 100s of Millions
Erik Onnen
 
Gates Toorcon X New School Information Gathering
Gates Toorcon X New School Information GatheringGates Toorcon X New School Information Gathering
Gates Toorcon X New School Information Gathering
Chris Gates
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.com
Damien Krotkine
 
DNS Measurements
DNS MeasurementsDNS Measurements
DNS Measurements
AFRINIC
 
Hunting on the cheap
Hunting on the cheapHunting on the cheap
Hunting on the cheap
Anjum Ahuja
 
Hunting on the Cheap
Hunting on the CheapHunting on the Cheap
Hunting on the Cheap
EndgameInc
 
Web Development using Ruby on Rails
Web Development using Ruby on RailsWeb Development using Ruby on Rails
Web Development using Ruby on Rails
Avi Kedar
 
Badneedles
BadneedlesBadneedles
Badneedles
dimisec
 
Jornadas gvSIG 2009 WSS English
Jornadas gvSIG 2009 WSS EnglishJornadas gvSIG 2009 WSS English
Jornadas gvSIG 2009 WSS English
sabueso81
 
DNS Survival Guide.
DNS Survival Guide.DNS Survival Guide.
DNS Survival Guide.
Qrator Labs
 
DNS Survival Guide
DNS Survival GuideDNS Survival Guide
DNS Survival Guide
APNIC
 
No SQL : Which way to go? Presented at DDDMelbourne 2015
No SQL : Which way to go?  Presented at DDDMelbourne 2015No SQL : Which way to go?  Presented at DDDMelbourne 2015
No SQL : Which way to go? Presented at DDDMelbourne 2015
Himanshu Desai
 
Ad

More from Ontico (20)

One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
Ontico
 
Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Масштабируя DNS / Артем Гавриченков (Qrator Labs)Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Ontico
 
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Ontico
 
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Ontico
 
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Ontico
 
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
Ontico
 
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Ontico
 
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
Ontico
 
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
Ontico
 
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Ontico
 
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Ontico
 
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Ontico
 
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
Ontico
 
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
Ontico
 
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Ontico
 
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
Ontico
 
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
100500 способов кэширования в Oracle Database или как достичь максимальной ск...100500 способов кэширования в Oracle Database или как достичь максимальной ск...
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
Ontico
 
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Ontico
 
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Ontico
 
Как мы учились чинить самолеты в воздухе / Евгений Коломеец (Virtuozzo)
Как мы учились чинить самолеты в воздухе / Евгений Коломеец (Virtuozzo)Как мы учились чинить самолеты в воздухе / Евгений Коломеец (Virtuozzo)
Как мы учились чинить самолеты в воздухе / Евгений Коломеец (Virtuozzo)
Ontico
 
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
Ontico
 
Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Масштабируя DNS / Артем Гавриченков (Qrator Labs)Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Ontico
 
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Ontico
 
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Ontico
 
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Ontico
 
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
Ontico
 
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Ontico
 
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
Ontico
 
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
Ontico
 
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Ontico
 
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Ontico
 
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Ontico
 
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
Ontico
 
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
Ontico
 
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Ontico
 
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
Ontico
 
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
100500 способов кэширования в Oracle Database или как достичь максимальной ск...100500 способов кэширования в Oracle Database или как достичь максимальной ск...
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
Ontico
 
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Ontico
 
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Ontico
 
Как мы учились чинить самолеты в воздухе / Евгений Коломеец (Virtuozzo)
Как мы учились чинить самолеты в воздухе / Евгений Коломеец (Virtuozzo)Как мы учились чинить самолеты в воздухе / Евгений Коломеец (Virtuozzo)
Как мы учились чинить самолеты в воздухе / Евгений Коломеец (Virtuozzo)
Ontico
 

Recently uploaded (20)

Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Journal of Soft Computing in Civil Engineering
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 

Frontera распределенный робот для обхода веба в больших объемах / Александр Сибиряков (Scrapinghub)

  • 1. Frontera: open source, large scale web crawling framework Alexander Sibiryakov, Scrapinghub Ltd. [email protected]
  • 4. Здравствуйте участники! • Software Engineer @ Scrapinghub • Born in Yekaterinburg, RU
  • 5. Здравствуйте участники! • Software Engineer @ Scrapinghub • Born in Yekaterinburg, RU • 5 years at Yandex: 
 social & QA search, snippets.
  • 6. Здравствуйте участники! • Software Engineer @ Scrapinghub • Born in Yekaterinburg, RU • 5 years at Yandex: 
 social & QA search, snippets. • 2 years at Avast! antivirus: false positives, malicious downloads
  • 7. We help turn web content into useful data { "content": [ { "title": { "text": "'Extreme poverty' to fall below 10% of world population for first time", "href": "http://www.theguardian.com/society/ 2015/oct/05/world-bank-extreme-poverty-to-fall- below-10-of-world-population-for-first-time" }, "points": "9 points", "time_ago": { "text": "2 hours ago", "href": "https://news.ycombinator.com/item? id=10352189" }, "username": { "text": "hliyan", "href": "https://news.ycombinator.com/user? id=hliyan" } },
  • 8. We help turn web content into useful data • Over 2 billion requests per month (~800/sec.) { "content": [ { "title": { "text": "'Extreme poverty' to fall below 10% of world population for first time", "href": "http://www.theguardian.com/society/ 2015/oct/05/world-bank-extreme-poverty-to-fall- below-10-of-world-population-for-first-time" }, "points": "9 points", "time_ago": { "text": "2 hours ago", "href": "https://news.ycombinator.com/item? id=10352189" }, "username": { "text": "hliyan", "href": "https://news.ycombinator.com/user? id=hliyan" } },
  • 9. We help turn web content into useful data • Over 2 billion requests per month (~800/sec.) • Focused crawls & Broad crawls { "content": [ { "title": { "text": "'Extreme poverty' to fall below 10% of world population for first time", "href": "http://www.theguardian.com/society/ 2015/oct/05/world-bank-extreme-poverty-to-fall- below-10-of-world-population-for-first-time" }, "points": "9 points", "time_ago": { "text": "2 hours ago", "href": "https://news.ycombinator.com/item? id=10352189" }, "username": { "text": "hliyan", "href": "https://news.ycombinator.com/user? id=hliyan" } },
  • 11. Broad crawl usages • News analysis
  • 12. Broad crawl usages • News analysis • Topical crawling
  • 13. Broad crawl usages • News analysis • Topical crawling • Plagiarism detection
  • 14. Broad crawl usages • News analysis • Topical crawling • Plagiarism detection • Sentiment analysis (popularity, likability)
  • 15. Broad crawl usages • News analysis • Topical crawling • Plagiarism detection • Sentiment analysis (popularity, likability) • Due diligence (profile/business data)
  • 16. Broad crawl usages • News analysis • Topical crawling • Plagiarism detection • Sentiment analysis (popularity, likability) • Due diligence (profile/business data) • Lead generation (extracting contact information)
  • 17. Broad crawl usages • News analysis • Topical crawling • Plagiarism detection • Sentiment analysis (popularity, likability) • Due diligence (profile/business data) • Lead generation (extracting contact information) • Track criminal activity & find lost persons (DARPA)
  • 19. Saatchi Global Gallery Guide • www.globalgalleryguide.com
  • 20. Saatchi Global Gallery Guide • www.globalgalleryguide.com
  • 21. Saatchi Global Gallery Guide • www.globalgalleryguide.com • Discover 11K online galleries.
  • 22. Saatchi Global Gallery Guide • www.globalgalleryguide.com • Discover 11K online galleries. • Extract general information, art samples, descriptions.
  • 23. Saatchi Global Gallery Guide • www.globalgalleryguide.com • Discover 11K online galleries. • Extract general information, art samples, descriptions. • NLP-based extraction.
  • 24. Saatchi Global Gallery Guide • www.globalgalleryguide.com • Discover 11K online galleries. • Extract general information, art samples, descriptions. • NLP-based extraction. • Find more galleries on the web.
  • 25. Task
  • 26. Task • Spanish web: hosts and their sizes statistics.
  • 27. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD.
  • 28. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD. • Breadth-first strategy:
  • 29. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD. • Breadth-first strategy: • first 1-click environ,
  • 30. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD. • Breadth-first strategy: • first 1-click environ, • 2,
  • 31. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD. • Breadth-first strategy: • first 1-click environ, • 2, • 3,
  • 32. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD. • Breadth-first strategy: • first 1-click environ, • 2, • 3, • …
  • 33. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD. • Breadth-first strategy: • first 1-click environ, • 2, • 3, • … • Finishing condition: 100 docs from host max., all hosts
  • 34. Task • Spanish web: hosts and their sizes statistics. • Only .es ccTLD. • Breadth-first strategy: • first 1-click environ, • 2, • 3, • … • Finishing condition: 100 docs from host max., all hosts • Low costs.
  • 35. Spanish, Russian and world Web, 2012 Sources: OECD Communications Outlook 2013, statdom.ru * - current period (October 2015) Domains Web servers Hosts DMOZ* Spanish (.es) 1,5M 280K 4,2M 122K Russian (.ru, .рф, .su) 4,8M 2,6M ? 105K World 233M 62M 890M 1,7
  • 37. Solution • Scrapy (based on Twisted) - async network operations.
  • 38. Solution • Scrapy (based on Twisted) - async network operations. • Apache Kafka - data bus (offsets, partitioning).
  • 39. Solution • Scrapy (based on Twisted) - async network operations. • Apache Kafka - data bus (offsets, partitioning). • Apache HBase - storage (random access, linear scanning, scalability).
  • 40. Solution • Scrapy (based on Twisted) - async network operations. • Apache Kafka - data bus (offsets, partitioning). • Apache HBase - storage (random access, linear scanning, scalability). • Snappy - efficient compression algorithm for IO-bounded applications.
  • 42. 1. Big and small hosts problem
  • 43. 1. Big and small hosts problem • Queue is flooded with URLs from the same host.
  • 44. 1. Big and small hosts problem • Queue is flooded with URLs from the same host. • → underuse of spider resources.
  • 45. 1. Big and small hosts problem • Queue is flooded with URLs from the same host. • → underuse of spider resources. • additional per-host (per-IP) queue and metering algorithm.
  • 46. 1. Big and small hosts problem • Queue is flooded with URLs from the same host. • → underuse of spider resources. • additional per-host (per-IP) queue and metering algorithm. • URLs from big hosts are cached in memory.
  • 47. 2. DDoS DNS service Amazon AWS
  • 48. 2. DDoS DNS service Amazon AWS Breadth-first strategy →
  • 49. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts →
  • 50. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs.
  • 51. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs.
  • 52. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs.
  • 53. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs. Recursive DNS server
  • 54. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs. Recursive DNS server • on every spider node,
  • 55. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs. Recursive DNS server • on every spider node, • upstream to Verizon & OpenDNS.
  • 56. 2. DDoS DNS service Amazon AWS Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs. Recursive DNS server • on every spider node, • upstream to Verizon & OpenDNS. We used dnsmasq.
  • 57. 3. Tuning Scrapy thread pool for efficient DNS resolution
  • 58. 3. Tuning Scrapy thread pool for efficient DNS resolution • OS DNS resolver,
  • 59. 3. Tuning Scrapy thread pool for efficient DNS resolution • OS DNS resolver, • blocking calls,
  • 60. 3. Tuning Scrapy thread pool for efficient DNS resolution • OS DNS resolver, • blocking calls, • thread pool to resolve DNS name to IP.
  • 61. 3. Tuning Scrapy thread pool for efficient DNS resolution • OS DNS resolver, • blocking calls, • thread pool to resolve DNS name to IP.
  • 62. 3. Tuning Scrapy thread pool for efficient DNS resolution • OS DNS resolver, • blocking calls, • thread pool to resolve DNS name to IP. • numerous errors and timeouts 🆘
  • 63. 3. Tuning Scrapy thread pool for efficient DNS resolution • OS DNS resolver, • blocking calls, • thread pool to resolve DNS name to IP. • numerous errors and timeouts 🆘 • A patch for thread pool size and timeout adjustment.
  • 64. 4. Overloaded HBase region servers during state check 3Tb of metadata. URLs, timestamps,… 275 b/doc
  • 65. 4. Overloaded HBase region servers during state check • 10^3 links per doc, 3Tb of metadata. URLs, timestamps,… 275 b/doc
  • 66. 4. Overloaded HBase region servers during state check • 10^3 links per doc, • state check: CRAWLED/NOT CRAWLED/ERROR, 3Tb of metadata. URLs, timestamps,… 275 b/doc
  • 67. 4. Overloaded HBase region servers during state check • 10^3 links per doc, • state check: CRAWLED/NOT CRAWLED/ERROR, • HDDs. 3Tb of metadata. URLs, timestamps,… 275 b/doc
  • 68. 4. Overloaded HBase region servers during state check • 10^3 links per doc, • state check: CRAWLED/NOT CRAWLED/ERROR, • HDDs. • Small volume 🆗 3Tb of metadata. URLs, timestamps,… 275 b/doc
  • 69. 4. Overloaded HBase region servers during state check • 10^3 links per doc, • state check: CRAWLED/NOT CRAWLED/ERROR, • HDDs. • Small volume 🆗 • With ⬆table size, response times ⬆🆘 3Tb of metadata. URLs, timestamps,… 275 b/doc
  • 70. 4. Overloaded HBase region servers during state check • 10^3 links per doc, • state check: CRAWLED/NOT CRAWLED/ERROR, • HDDs. • Small volume 🆗 • With ⬆table size, response times ⬆🆘 • Disk queue ⬆ 3Tb of metadata. URLs, timestamps,… 275 b/doc
  • 71. 4. Overloaded HBase region servers during state check • 10^3 links per doc, • state check: CRAWLED/NOT CRAWLED/ERROR, • HDDs. • Small volume 🆗 • With ⬆table size, response times ⬆🆘 • Disk queue ⬆ • Host-local fingerprint function for keys in HBase. 3Tb of metadata. URLs, timestamps,… 275 b/doc
  • 72. 4. Overloaded HBase region servers during state check • 10^3 links per doc, • state check: CRAWLED/NOT CRAWLED/ERROR, • HDDs. • Small volume 🆗 • With ⬆table size, response times ⬆🆘 • Disk queue ⬆ • Host-local fingerprint function for keys in HBase. • Tuning HBase block cache to fit average host states into one block. 3Tb of metadata. URLs, timestamps,… 275 b/doc
  • 73. 5. Intensive network traffic from workers to services
  • 74. 5. Intensive network traffic from workers to services • Throughput between workers and Kafka/HBase 
 ~ 1Gbit/s.
  • 75. 5. Intensive network traffic from workers to services • Throughput between workers and Kafka/HBase 
 ~ 1Gbit/s. • Thrift compact protocol for HBase
  • 76. 5. Intensive network traffic from workers to services • Throughput between workers and Kafka/HBase 
 ~ 1Gbit/s. • Thrift compact protocol for HBase • Message compression in Kafka with Snappy
  • 77. 6. Further query and traffic optimizations to HBase
  • 78. 6. Further query and traffic optimizations to HBase • State check: lots of reqs and network
  • 79. 6. Further query and traffic optimizations to HBase • State check: lots of reqs and network • Consistency
  • 80. 6. Further query and traffic optimizations to HBase • State check: lots of reqs and network • Consistency • Local state cache in strategy worker.
  • 81. 6. Further query and traffic optimizations to HBase • State check: lots of reqs and network • Consistency • Local state cache in strategy worker. • For consistency, spider log was partitioned by host.
  • 83. State cache • All ops are batched:
  • 84. State cache • All ops are batched: – If no key in cache→ read HBase
  • 85. State cache • All ops are batched: – If no key in cache→ read HBase – every ~4K docs → flush
  • 86. State cache • All ops are batched: – If no key in cache→ read HBase – every ~4K docs → flush • Close to 3M (~1Gb) elms → flush & cleanup
  • 87. State cache • All ops are batched: – If no key in cache→ read HBase – every ~4K docs → flush • Close to 3M (~1Gb) elms → flush & cleanup • Least-Recently-Used (LRU) 👍
  • 89. Spider priority queue (slot) • Cell:
  • 90. Spider priority queue (slot) • Cell: Array of:
 - fingerprint, 
 - Crc32(hostname), 
 - URL, 
 - score
  • 91. Spider priority queue (slot) • Cell: Array of:
 - fingerprint, 
 - Crc32(hostname), 
 - URL, 
 - score • Dequeueing top N.
  • 92. Spider priority queue (slot) • Cell: Array of:
 - fingerprint, 
 - Crc32(hostname), 
 - URL, 
 - score • Dequeueing top N. • Prone to huge hosts
  • 93. Spider priority queue (slot) • Cell: Array of:
 - fingerprint, 
 - Crc32(hostname), 
 - URL, 
 - score • Dequeueing top N. • Prone to huge hosts • Scoring model: document count per host.
  • 94. 7. Problem of big and small hosts (strikes back!)
  • 95. 7. Problem of big and small hosts (strikes back!) • Discovered few very huge hosts (>20M docs)
  • 96. 7. Problem of big and small hosts (strikes back!) • Discovered few very huge hosts (>20M docs) • All queue partitions were flooded with huge hosts,
  • 97. 7. Problem of big and small hosts (strikes back!) • Discovered few very huge hosts (>20M docs) • All queue partitions were flooded with huge hosts, • Two MapReduce jobs:
  • 98. 7. Problem of big and small hosts (strikes back!) • Discovered few very huge hosts (>20M docs) • All queue partitions were flooded with huge hosts, • Two MapReduce jobs: – queue shuffling,
  • 99. 7. Problem of big and small hosts (strikes back!) • Discovered few very huge hosts (>20M docs) • All queue partitions were flooded with huge hosts, • Two MapReduce jobs: – queue shuffling, – limit all hosts to 100 docs MAX.
  • 101. Hardware requirements • Single-thread Scrapy spider → 
 1200 pages/min. from ~100 websites in parallel.
  • 102. Hardware requirements • Single-thread Scrapy spider → 
 1200 pages/min. from ~100 websites in parallel. • Spiders to workers ratio is 4:1 (without content)
  • 103. Hardware requirements • Single-thread Scrapy spider → 
 1200 pages/min. from ~100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable).
  • 104. Hardware requirements • Single-thread Scrapy spider → 
 1200 pages/min. from ~100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example:
  • 105. Hardware requirements • Single-thread Scrapy spider → 
 1200 pages/min. from ~100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example: – 12 spiders ~ 14.4K pages/min.,
  • 106. Hardware requirements • Single-thread Scrapy spider → 
 1200 pages/min. from ~100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example: – 12 spiders ~ 14.4K pages/min., – 3 SW and 3 DB workers,
  • 107. Hardware requirements • Single-thread Scrapy spider → 
 1200 pages/min. from ~100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example: – 12 spiders ~ 14.4K pages/min., – 3 SW and 3 DB workers, – Total 18 cores.
  • 108. Software requirements CDH (100% Open source Hadoop package)
  • 109. Software requirements • Apache HBase, CDH (100% Open source Hadoop package)
  • 110. Software requirements • Apache HBase, • Apache Kafka, CDH (100% Open source Hadoop package)
  • 111. Software requirements • Apache HBase, • Apache Kafka, • Python 2.7+, CDH (100% Open source Hadoop package)
  • 112. Software requirements • Apache HBase, • Apache Kafka, • Python 2.7+, • Scrapy 0.24+, CDH (100% Open source Hadoop package)
  • 113. Software requirements • Apache HBase, • Apache Kafka, • Python 2.7+, • Scrapy 0.24+, • DNS Service. CDH (100% Open source Hadoop package)
  • 115. Maintaining Cloudera Hadoop on Amazon EC2 • CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager.
  • 116. Maintaining Cloudera Hadoop on Amazon EC2 • CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager. • We’ve moved it using symbolic links to separate EBS partition.
  • 117. Maintaining Cloudera Hadoop on Amazon EC2 • CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager. • We’ve moved it using symbolic links to separate EBS partition. • EBS should be at least 30Gb, base IOPS should be enough.
  • 118. Maintaining Cloudera Hadoop on Amazon EC2 • CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager. • We’ve moved it using symbolic links to separate EBS partition. • EBS should be at least 30Gb, base IOPS should be enough. • Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD).
  • 119. Maintaining Cloudera Hadoop on Amazon EC2 • CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager. • We’ve moved it using symbolic links to separate EBS partition. • EBS should be at least 30Gb, base IOPS should be enough. • Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD). • After one week of crawling, we ran out of space, and started to move DataNodes to d2.xlarge (4 CPU, 30.5Gb, 3x2Tb HDD).
  • 120. Spanish (.es) internet crawl results
  • 121. Spanish (.es) internet crawl results • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites
  • 122. Spanish (.es) internet crawl results • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites • 68.7K domains found (~600K expected),
  • 123. Spanish (.es) internet crawl results • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites • 68.7K domains found (~600K expected), • 46.5M crawled pages overall,
  • 124. Spanish (.es) internet crawl results • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites • 68.7K domains found (~600K expected), • 46.5M crawled pages overall, • 1.5 months,
  • 125. Spanish (.es) internet crawl results • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites • 68.7K domains found (~600K expected), • 46.5M crawled pages overall, • 1.5 months, • 22 websites with more than 50M pages
  • 126. where are the rest of web servers?!
  • 127. Bow-tie model A. Broder et al. / Computer Networks 33 (2000) 309-320
  • 128. Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005
  • 129. 12 years dynamics Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014
  • 131. Main features • Online operation: scheduling of new batch, updating of DB state.
  • 132. Main features • Online operation: scheduling of new batch, updating of DB state. • Storage abstraction: write your own backend (sqlalchemy, HBase is included).
  • 133. Main features • Online operation: scheduling of new batch, updating of DB state. • Storage abstraction: write your own backend (sqlalchemy, HBase is included). • Canonical URLs resolution abstraction: each document has many URLs, which to use?
  • 134. Main features • Online operation: scheduling of new batch, updating of DB state. • Storage abstraction: write your own backend (sqlalchemy, HBase is included). • Canonical URLs resolution abstraction: each document has many URLs, which to use? • Scrapy ecosystem: good documentation, big community, ease of customization.
  • 136. Distributed Frontera features • Communication layer is Apache Kafka: topic partitioning, offsets mechanism.
  • 137. Distributed Frontera features • Communication layer is Apache Kafka: topic partitioning, offsets mechanism. • Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module.
  • 138. Distributed Frontera features • Communication layer is Apache Kafka: topic partitioning, offsets mechanism. • Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module. • Polite by design: each website is downloaded by at most one spider.
  • 139. Distributed Frontera features • Communication layer is Apache Kafka: topic partitioning, offsets mechanism. • Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module. • Polite by design: each website is downloaded by at most one spider. • Python: workers, spiders.
  • 140. References • Frontera. https://github.com/scrapinghub/frontera • Distributed extension. https://github.com/ scrapinghub/distributed-frontera • Documentation: – http://frontera.readthedocs.org/ – http://distributed-frontera.readthedocs.org/ • Google groups: Frontera (https://goo.gl/ak9546)
  • 142. Future plans • Lighter version, without HBase and Kafka. Communicating using sockets.
  • 143. Future plans • Lighter version, without HBase and Kafka. Communicating using sockets. • Revisiting strategy out-of-box.
  • 144. Future plans • Lighter version, without HBase and Kafka. Communicating using sockets. • Revisiting strategy out-of-box. • Watchdog solution: tracking website content changes.
  • 145. Future plans • Lighter version, without HBase and Kafka. Communicating using sockets. • Revisiting strategy out-of-box. • Watchdog solution: tracking website content changes. • PageRank or HITS strategy.
  • 146. Future plans • Lighter version, without HBase and Kafka. Communicating using sockets. • Revisiting strategy out-of-box. • Watchdog solution: tracking website content changes. • PageRank or HITS strategy. • Own HTML and URL parsers.
  • 147. Future plans • Lighter version, without HBase and Kafka. Communicating using sockets. • Revisiting strategy out-of-box. • Watchdog solution: tracking website content changes. • PageRank or HITS strategy. • Own HTML and URL parsers. • Integration into Scrapinghub services.
  • 148. Future plans • Lighter version, without HBase and Kafka. Communicating using sockets. • Revisiting strategy out-of-box. • Watchdog solution: tracking website content changes. • PageRank or HITS strategy. • Own HTML and URL parsers. • Integration into Scrapinghub services. • Testing on larger volumes.
  • 149. Run your business using Frontera
  • 150. Run your business using Frontera  SCALABLE
  • 151. Run your business using Frontera  SCALABLE  OPEN
  • 152. Run your business using Frontera  SCALABLE  OPEN  CUSTOMIZABLE
  • 153. Run your business using Frontera Made in Scrapinghub (authors of Scrapy)  SCALABLE  OPEN  CUSTOMIZABLE
  • 155. Здесь может быть ВАШ код! • Web scale crawler,
  • 156. Здесь может быть ВАШ код! • Web scale crawler, • Historically first attempt in Python,
  • 157. Здесь может быть ВАШ код! • Web scale crawler, • Historically first attempt in Python, • Truly resource- intensive task: CPU, network, disks.