FastWARC Optimizing Large-Scale Web Archive Analyt
FastWARC Optimizing Large-Scale Web Archive Analyt
ming languages, such as WARCIO,2 the de-facto standard GZip WARCIO+HTTP 5 435.6 –
for Python. When processing web archives at the terabyte or GZip FastWARC+HTTP 10 101.5 1.9
GZip WARCIO+HTTP+Checksum 4 121.6 –
petabyte scale, however, even small inefficiencies in these GZip FastWARC+HTTP+Checksum 7 433.0 1.8
tools add up quickly, resulting in hours, days, or even weeks
LZ4 FastWARC 49 825.4 7.7∗
of wasted compute time. Reviewing the basic components LZ4 FastWARC+HTTP 42 394.5 7.8∗
of WARCIO and analyzing its bottlenecks, we proceed to LZ4 FastWARC+HTTP+Checksum 16 992.2 4.1∗
build FastWARC, a new high-performance WARC process- Intel(R) Xeon(R) CPU E5-2620 v2 (remote Ceph storage)
ing library for Python, written in C++ / Cython, which yields
None WARCIO 7 969.1 –
performance improvements by factors of 1.6–8x. None FastWARC 49 396.5 6.2