Analyzing Web Archives

Analysis of Web Archives
Vinay Goel
Senior Data Engineer

Internet Archive
• Established in 1996
• 501(c)(3) non proﬁt organization
• 20+ PB (compressed) of publicly accessible archival
material
• Technology partner to libraries, museums, universities,
research and memory institutions
• Currently archiving books, text, ﬁlm, video, audio,
images, software, educational content and the Internet

IA Web Archive
!
!
• Began in 1996
• 426+ Billion publicly accessible web instances
• Operate web wide, survey, end of life, selective and resource
speciﬁc web harvests
• Develop freely available, open source, web archiving and
access tools

Analysis Tools
• Arbitrary analysis of archived data
• Scales up and down
• Tools
• Apache Hadoop (distributed storage and processing)
• Apache Pig (batch processing with a data ﬂow language)
• Apache Hive (batch processing with a SQL like language)
• Apache Giraph (batch graph processing)
• Apache Mahout (scalable machine learning)

Data
• Crawler logs
• Crawled data
• Crawled data derivatives
• Wayback Index
• Text Search Index
• WAT

Crawled data (ARC / WARC)
• Data written by web crawlers
• Before 2008, written into ARC files
• From 2008, IA began writing into WARC files
• data donations from 3rd parties still include ARC files
• WARC file format (ISO standard) is a revision of the ARC file format
• Each (W)ARC file contains a series of concatenated records
• Full HTTP request/response records
• WARC files also contain metadata records, and records to store
duplication events, and to support segmentation and conversion

Wayback Index (CDX)
• Index for the Wayback Machine
• Generated by parsing crawled (W)ARC data
• Plain text file with one line per captured resource
• Each line contains only essential metadata required by the Wayback
software
• URL, Timestamp, Content Digest
• MIME Type, HTTP Status Code, Size
• Meta tags, Redirect URL (when applicable)
• (W)ARC filename and file offset of record

CDX Analysis
• Store generated CDX data in Hadoop (HDFS)
• Create Hive table
• Partition the data by partner, collection, crawl instance
• reduce I/O and query times
• Run queries using HiveQL (a SQL like language)

CDX Analysis: Growth of
Content

CDX Analysis: Rate of
Duplication

CDX Analysis: Breakdown by
Year First Crawled

Log Warehouse
• Similar Hive set up for Crawler logs
• Distribution of Domains, HTTP Status codes, MIME
types
• Enable crawler engineer to ﬁnd timeout errors,
duplicate content, crawler traps, robots exclusions,
etc.

Text Search Index
• Use the Parsed Text ﬁles: input to build text indexes for Search
• Generated by running a Hadoop MapReduce Job that parses (W)ARC ﬁles
• HTML boilerplate is stripped out
• Also contains metadata
• URL, Timestamp, Content Digest, Record Length
• MIME Type, HTTP Status Code
• Title, description and meta keywords
• Links with anchor text
• Stored in Hadoop Sequence Files

WAT
• Extensible metadata format
• Essential metadata for many types of analyses
• Avoids barriers to data exchange: copyright, privacy
• Less data than (W)ARC, more than CDX
• WAT records are WARC metadata records
• Contains for every HTML page in the (W)ARC,
• Title, description and meta keywords
• Embeds and outgoing links with alt/anchor text

Text Analysis
• Text extracted from (W)ARC / Parsed Text / WAT
• Use Pig
• extract text from records of interest
• tokenize, remove stop words, stemming
• generate top terms by TF-IDF
• prepare text for input to Mahout to generate vectorized
documents (Topic Modeling, Classiﬁcation, Clustering
etc.)

Link Analysis
• Links extracted from crawl logs / WARC metadata records /
Parsed Text / WAT
• Use Pig
• extract links from records of interest
• generate host & domain graphs for a given period
• ﬁnd links in common between a pair of hosts/domains
• extract embedded links and compare with CDX to ﬁnd
resources yet to be crawled

Archival Web Graph
• Use Pig to generate an Archival Web Graph (ID-Map
and ID-Graph)
• ID-Map: Mapping of integer (or ﬁngerprint) ID to
source and destination URLs
• ID-Graph: An adjacency list using the assigned IDs
and timestamp info
• Compact representation of graph data

Link Analysis using Giraph
• Hadoop MapReduce not the best ﬁt for iterative algorithms
• each iteration is a MapReduce Job with the graph structure
being read from and written to HDFS
• Use Giraph: open-source implementation of Google’s Pregel
• Vertex centric Bulk Synchronous Parallel (BSP) execution
model
• runs on Hadoop
• computation executed in memory and proceeds as
sequence of iterations called supersteps

Link Analysis
• Indegree and Outdegree distributions
• Inter-host and Intra-host link information
• Rank resources by PageRank
• Identify important resources
• Prioritize crawling of missing resources
• Find possible spam pages by running biased PageRank
• Trace path of crawler using graph generated from crawl logs
• Determine Crawl and Page Completeness

Link Analysis: PageRank
over Time

Web Archive Analysis
Workshop
• Self guided workshop
• Generative derivatives: CDX, Parsed Text, WAT
• Set up CDX Warehouse using Hive
• Extract text from WARCs / WAT / Parsed Text
• Extract links from WARCs / WAT / Parsed Text
• Generate Archival web graphs, host and domain graphs
• Text and Link Analysis Examples
• Data extraction tools to repackage subsets of data into new (W)ARC ﬁles
• https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Analysis+Workshop

Analyzing Web Archives

Recommended

More Related Content

What's hot (17)

Viewers also liked (10)

Similar to Analyzing Web Archives (20)

Recently uploaded (20)

Analyzing Web Archives