SlideShare a Scribd company logo
Web Archiving
A Brief Introduction
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk, Virginia - 23529 (USA)
About Me
Sawood Alam
Lexical Signature
Web, Digital Library, Web Archiving, Ruby on Rails, PHP,
XHTML, CSS, JavaScript, ExtJS, Urdu, RTL and Linux.
● BTech, Jamia Millia Islamia, India, 2008
● MSc, Old Dominion University, USA, 2013
● PhD, Old Dominion University, USA, Current
She Calls Me Dad!
Agenda
● Archiving and Web archiving
● Purpose and importance
● Scope of the web archiving
● Issues and challenges
● Tools and techniques
● Memento: Time Travel for the Web
● Archive X-Ray
● Research opportunities in Web archiving
● Our WSDL Research Group
What is an Archive?
● Accumulation of historical records
● Long term storage and preservation
● Less frequently used
● Physical or digital
What is Web Archiving?
● Periodic snapshots of web pages
● Preserving important events on the Web
● Making archived content accessible
Why do We Care Archiving?
Web contents decay rapidly!
● To preserve the history
● To tell a story
● For evidence
● For backup
● For personal satisfaction
Issues and Challenges
● Crawling
● Storage
● Retrieval
● Replay
● Accessibility
● Completeness
● Accuracy
● Credibility
Web Archiving Efforts
● Internet Archive
● Archive-It
● Wikipedia
● UK Web Archive
● Various national and non-profit archives
● Film, music and other multimedia archives
● Scholarly archives
● Personal archiving
Tools and Techniques
● Heritrix, PhantomJS, WGet, cURL
● OpenWayback, PyWB
● TimeTravel, MemGator
● CarbonDate, Warrick, Synchronicity
● Preserve Me!
● WARCreate,WAIL, Mink
● Browsertrix
● And many more...
Memento
<http://example.com>; rel="original",
<http://web.archive.org/web/20020120142510/http://example.com/>;
rel="memento";
datetime="Sun, 20 Jan 2002 14:25:10 GMT",
<http://web.archive.org/web/20020328012821/http://www.example.com/>;
rel="memento";
datetime="Thu, 28 Mar 2002 01:28:21 GMT",
<http://webarchive.loc.gov/all/20020803080544/http://www.example.com/>;
rel="memento";
datetime="Sat, 03 Aug 2002 08:05:44 GMT",
<http://wayback.archive-it.org/all/20091213015014/http://www.example.com/>;
rel="memento";
datetime="Sun, 13 Dec 2009 01:50:14 GMT",
Archive X-Ray!
● How much of the Web is archived?
● Profiling various archive services
● Predicting what they contain
● Routing Memento aggregator queries
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Long Tail of Archives
Archive Profile
● High-level summary of an archive
● Predicts presence of mementos
● Provides statistics about the holdings
● Small in size and publicly available
● Easy to update and partially patch
● Useful for Memento query routing and
other things
com,cnn)/ {“frequency”: 40, “spread”: 2}
uk,co,bbc)/ {“frequency”: 20, “spread”: 1}
com,usatoday)/ {“frequency”: 5, “spread”: 1}
Research Opportunities
● Information retrieval
● Information visualization
● Client and server side archiving
● Archiving dynamic content
● Distributed archiving
● Discovering alternate long term archiving
techniques
● Predicting “Important” events on the Web
and archiving them timely
Web Science and Digital
Libraries Research Group
ws-dl.cs.odu.edu
ws-dl.blogspot.com
@WebSciDL
github.com/oduwsdl
flickr.com/photos/124419986@N07
WSDL Research Group
WSDL Research Group
WSDL Research Group
WSDL Research Group
WSDL Research Group
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk, Virginia - 23529 (USA)
salam@cs.odu.edu
ibnesayeed@gmail.com
@ibnesayeed
www.cs.odu.edu/~salam

More Related Content

PDF
JCDL 2016 Doctoral Consortium - Web Archive Profiling
Sawood Alam
 
PDF
TPDL 2016 Doctoral Consortium - Web Archive Profiling
Sawood Alam
 
PDF
Web Archive Profiling Through Fulltext Search
Sawood Alam
 
PDF
Introducing Web Archiving and WSDL Research Group
Sawood Alam
 
PDF
Steam Learn: An introduction to Redis
inovia
 
PDF
Redis Overview
Hoang Long
 
KEY
Redis
Ramon Wartala
 
PDF
Module: InterPlanetary Linked Data (IPLD)
Ioannis Psaras
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
Sawood Alam
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
Sawood Alam
 
Web Archive Profiling Through Fulltext Search
Sawood Alam
 
Introducing Web Archiving and WSDL Research Group
Sawood Alam
 
Steam Learn: An introduction to Redis
inovia
 
Redis Overview
Hoang Long
 
Module: InterPlanetary Linked Data (IPLD)
Ioannis Psaras
 

What's hot (20)

PDF
Module: Content Addressing in IPFS
Ioannis Psaras
 
PDF
Module: Content Routing in IPFS
Ioannis Psaras
 
PDF
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
Dimitris Kontokostas
 
PDF
IPWB and IPFS at WAC2017
David Dias
 
PDF
Graph databases & data integration v2
Dimitris Kontokostas
 
PDF
Archive What I See Now: Personal Web Archiving with WARCs
machawk1
 
PDF
Module: Mutable Content in IPFS
Ioannis Psaras
 
PPTX
NYT Web Archive
Justin Heideman
 
PPT
Java 9 Security Enhancements in Practice
Martin Toshev
 
PDF
Module: Content Exchange in IPFS
Ioannis Psaras
 
PDF
Converting GHO to RDF
Amrapali Zaveri, PhD
 
PDF
Data quality in Real Estate
Dimitris Kontokostas
 
PDF
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization
Hiroaki Hayashi
 
PDF
Globus: Enabling the Open Storage Network
Globus
 
PDF
Enabling Secure Data Discoverability (SC21 Tutorial)
Globus
 
PDF
RDF Seminar Presentation
Muntazir Mehdi
 
PDF
A Framework for Aggregating Private and Public Web Archives
jcdl2018
 
PDF
Apache Any23 - Anything to Triples
Michele Mostarda
 
PPTX
Implementing a Corpus for Sinhala Language
Chamila Wijayarathna
 
PDF
Vocabulary for Linked Data Visualization Model - Dateso 2015
Jiří Helmich
 
Module: Content Addressing in IPFS
Ioannis Psaras
 
Module: Content Routing in IPFS
Ioannis Psaras
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
Dimitris Kontokostas
 
IPWB and IPFS at WAC2017
David Dias
 
Graph databases & data integration v2
Dimitris Kontokostas
 
Archive What I See Now: Personal Web Archiving with WARCs
machawk1
 
Module: Mutable Content in IPFS
Ioannis Psaras
 
NYT Web Archive
Justin Heideman
 
Java 9 Security Enhancements in Practice
Martin Toshev
 
Module: Content Exchange in IPFS
Ioannis Psaras
 
Converting GHO to RDF
Amrapali Zaveri, PhD
 
Data quality in Real Estate
Dimitris Kontokostas
 
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization
Hiroaki Hayashi
 
Globus: Enabling the Open Storage Network
Globus
 
Enabling Secure Data Discoverability (SC21 Tutorial)
Globus
 
RDF Seminar Presentation
Muntazir Mehdi
 
A Framework for Aggregating Private and Public Web Archives
jcdl2018
 
Apache Any23 - Anything to Triples
Michele Mostarda
 
Implementing a Corpus for Sinhala Language
Chamila Wijayarathna
 
Vocabulary for Linked Data Visualization Model - Dateso 2015
Jiří Helmich
 
Ad

Viewers also liked (20)

PPTX
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Michael Nelson
 
PDF
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Yasmin AlNoamany, PhD
 
PDF
Software as a Well-Formed Research Object
Yasmin AlNoamany, PhD
 
PPT
Old Dominion University Computer Science IIPC New Member
Michael Nelson
 
PPTX
When Should I Make Preservation Copies of Myself?
Michael Nelson
 
PPTX
The Memento Protocol and Research Issues With Web Archiving
Michael Nelson
 
PDF
@WebSciDL PhD Student Project Reviews August 5&6, 2015
Michael Nelson
 
PPT
More Archives, More Better
Michael Nelson
 
PPTX
Who and What Links to the Internet Archive
Michael Nelson
 
PPTX
Storytelling for Summarizing Collections in Web Archives
Michael Nelson
 
PPTX
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Michael Nelson
 
PPTX
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Michael Nelson
 
PPT
Assessing the Quality of Web Archives
Michael Nelson
 
PPT
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Michael Nelson
 
PPT
We Need Multiple, Independent Web Archives
Michael Nelson
 
PPT
Profiling Web Archives
Michael Nelson
 
PPTX
Evaluating the Temporal Coherence of Archived Pages
Michael Nelson
 
PPTX
Why We Need Multiple Archives
Michael Nelson
 
PPTX
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Michael Nelson
 
PPTX
Combining Storytelling and Web Archives
Michael Nelson
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Michael Nelson
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Yasmin AlNoamany, PhD
 
Software as a Well-Formed Research Object
Yasmin AlNoamany, PhD
 
Old Dominion University Computer Science IIPC New Member
Michael Nelson
 
When Should I Make Preservation Copies of Myself?
Michael Nelson
 
The Memento Protocol and Research Issues With Web Archiving
Michael Nelson
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
Michael Nelson
 
More Archives, More Better
Michael Nelson
 
Who and What Links to the Internet Archive
Michael Nelson
 
Storytelling for Summarizing Collections in Web Archives
Michael Nelson
 
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Michael Nelson
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Michael Nelson
 
Assessing the Quality of Web Archives
Michael Nelson
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Michael Nelson
 
We Need Multiple, Independent Web Archives
Michael Nelson
 
Profiling Web Archives
Michael Nelson
 
Evaluating the Temporal Coherence of Archived Pages
Michael Nelson
 
Why We Need Multiple Archives
Michael Nelson
 
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Michael Nelson
 
Combining Storytelling and Web Archives
Michael Nelson
 
Ad

Similar to Web Archiving: A Brief Introduction (20)

PDF
Web Archiving: A Brief Introduction
Sawood Alam
 
PPTX
2015-odu-ece-tools-for-past-web
Michele Weigle
 
PPT
Creating and Maintaining Web Archives
MARAC Bethlehem PC
 
PPTX
Archiving Web-Based #musetech for Institutional Memory
Samantha Norling
 
PDF
The web is a mess: how I learnt to stop worrying and love web archiving. Kris...
Biblioteca Nacional de España
 
PDF
Slides anu talkwebarchivingaug2012
Roxanne Missingham
 
PPTX
Web archiving challenges and opportunities
Ahmed AlSum
 
PPTX
Web Archiving for University Records
elliotdwilliams
 
PPTX
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
Micah Altman
 
PDF
Web Archiving – Lessons and Potential
Daniel Gomes
 
PPTX
Capture All the URLS: First Steps in Web Archiving
Kristen Yarmey
 
PDF
Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Repr...
Justin Brunelle
 
PDF
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Justin Brunelle
 
PPT
Web Archiving Intro (circa 2015)
Anna Perricci
 
PDF
Internet content as research data
National Library of Australia
 
PPTX
Archiving for Now and Later - workshop at Common Field Convening 2019
Anna Perricci
 
PDF
Introduction to Web Archiving
Anna Perricci
 
PPTX
Best Practices for Descriptive Metadata
OCLC
 
PDF
Time -Travel on the Internet
IRJET Journal
 
PDF
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
TimelessFuture
 
Web Archiving: A Brief Introduction
Sawood Alam
 
2015-odu-ece-tools-for-past-web
Michele Weigle
 
Creating and Maintaining Web Archives
MARAC Bethlehem PC
 
Archiving Web-Based #musetech for Institutional Memory
Samantha Norling
 
The web is a mess: how I learnt to stop worrying and love web archiving. Kris...
Biblioteca Nacional de España
 
Slides anu talkwebarchivingaug2012
Roxanne Missingham
 
Web archiving challenges and opportunities
Ahmed AlSum
 
Web Archiving for University Records
elliotdwilliams
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
Micah Altman
 
Web Archiving – Lessons and Potential
Daniel Gomes
 
Capture All the URLS: First Steps in Web Archiving
Kristen Yarmey
 
Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Repr...
Justin Brunelle
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Justin Brunelle
 
Web Archiving Intro (circa 2015)
Anna Perricci
 
Internet content as research data
National Library of Australia
 
Archiving for Now and Later - workshop at Common Field Convening 2019
Anna Perricci
 
Introduction to Web Archiving
Anna Perricci
 
Best Practices for Descriptive Metadata
OCLC
 
Time -Travel on the Internet
IRJET Journal
 
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
TimelessFuture
 

More from Sawood Alam (20)

PDF
TrendMachine: Temporal Resilience of Web Pages
Sawood Alam
 
PDF
CDX Summary: Web Archival Collection Insights
Sawood Alam
 
PDF
Video Archiving and Playback in the Wayback Machine
Sawood Alam
 
PDF
Profiling Web Archival Voids for Memento Routing
Sawood Alam
 
PDF
Readying Web Archives to Consume and Leverage Web Bundles
Sawood Alam
 
PDF
Summarize Your Archival Holdings With MementoMap
Sawood Alam
 
PDF
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
Sawood Alam
 
PDF
Supporting Web Archiving via Web Packaging
Sawood Alam
 
PDF
MementoMap: An Archive Profile Dissemination Framework
Sawood Alam
 
PDF
Impact of HTTP Cookie Violations in Web Archives
Sawood Alam
 
PDF
Archive Assisted Archival Fixity Verification Framework
Sawood Alam
 
PDF
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
Sawood Alam
 
PDF
Web ARChive (WARC) File Format
Sawood Alam
 
PDF
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
Sawood Alam
 
PDF
MemGator - A Memento Aggregator CLI and Server in Go
Sawood Alam
 
PDF
Dockerize Your Projects - A Brief Introduction to Containerization
Sawood Alam
 
PDF
Avoiding Zombies in Archival Replay Using ServiceWorker
Sawood Alam
 
PDF
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Sawood Alam
 
PDF
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
Sawood Alam
 
PDF
TPDL 2015 - Profiling Web Archives
Sawood Alam
 
TrendMachine: Temporal Resilience of Web Pages
Sawood Alam
 
CDX Summary: Web Archival Collection Insights
Sawood Alam
 
Video Archiving and Playback in the Wayback Machine
Sawood Alam
 
Profiling Web Archival Voids for Memento Routing
Sawood Alam
 
Readying Web Archives to Consume and Leverage Web Bundles
Sawood Alam
 
Summarize Your Archival Holdings With MementoMap
Sawood Alam
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
Sawood Alam
 
Supporting Web Archiving via Web Packaging
Sawood Alam
 
MementoMap: An Archive Profile Dissemination Framework
Sawood Alam
 
Impact of HTTP Cookie Violations in Web Archives
Sawood Alam
 
Archive Assisted Archival Fixity Verification Framework
Sawood Alam
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
Sawood Alam
 
Web ARChive (WARC) File Format
Sawood Alam
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
Sawood Alam
 
MemGator - A Memento Aggregator CLI and Server in Go
Sawood Alam
 
Dockerize Your Projects - A Brief Introduction to Containerization
Sawood Alam
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Sawood Alam
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Sawood Alam
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
Sawood Alam
 
TPDL 2015 - Profiling Web Archives
Sawood Alam
 

Recently uploaded (20)

PPTX
Black Yellow Modern Minimalist Elegant Presentation.pptx
nothisispatrickduhh
 
PPTX
how many elements are less than or equal to a mid value and adjusts the searc...
kokiyon104
 
PPTX
Pengenalan perangkat Jaringan komputer pada teknik jaringan komputer dan tele...
Prayudha3
 
PPTX
ENCOR_Chapter_10 - OSPFv3 Attribution.pptx
nshg93
 
PPTX
Google SGE SEO: 5 Critical Changes That Could Wreck Your Rankings in 2025
Reversed Out Creative
 
PDF
PDF document: World Game (s) Great Redesign.pdf
Steven McGee
 
PDF
“Google Algorithm Updates in 2025 Guide”
soohhhnah
 
PPTX
Parallel & Concurrent ...
yashpavasiya892
 
PPTX
Perkembangan Perangkat jaringan komputer dan telekomunikasi 3.pptx
Prayudha3
 
PDF
5g is Reshaping the Competitive Landscape
Stellarix
 
PPTX
SEO Trends in 2025 | B3AITS - Bow & 3 Arrows IT Solutions
B3AITS - Bow & 3 Arrows IT Solutions
 
PPTX
办理方法西班牙假毕业证蒙德拉贡大学成绩单MULetter文凭样本
xxxihn4u
 
PDF
KIPER4D situs Exclusive Game dari server Star Gaming Asia
hokimamad0
 
PPTX
Microsoft PowerPoint Student PPT slides.pptx
Garleys Putin
 
PPTX
Different Generation Of Computers .pptx
divcoder9507
 
PDF
Latest Scam Shocking the USA in 2025.pdf
onlinescamreport4
 
PPTX
nagasai stick diagrams in very large scale integratiom.pptx
manunagapaul
 
PDF
LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1
LABUAN 4D
 
PDF
BGP Security Best Practices that Matter, presented at PHNOG 2025
APNIC
 
PPTX
Unlocking Hope : How Crypto Recovery Services Can Reclaim Your Lost Funds
lionsgate network
 
Black Yellow Modern Minimalist Elegant Presentation.pptx
nothisispatrickduhh
 
how many elements are less than or equal to a mid value and adjusts the searc...
kokiyon104
 
Pengenalan perangkat Jaringan komputer pada teknik jaringan komputer dan tele...
Prayudha3
 
ENCOR_Chapter_10 - OSPFv3 Attribution.pptx
nshg93
 
Google SGE SEO: 5 Critical Changes That Could Wreck Your Rankings in 2025
Reversed Out Creative
 
PDF document: World Game (s) Great Redesign.pdf
Steven McGee
 
“Google Algorithm Updates in 2025 Guide”
soohhhnah
 
Parallel & Concurrent ...
yashpavasiya892
 
Perkembangan Perangkat jaringan komputer dan telekomunikasi 3.pptx
Prayudha3
 
5g is Reshaping the Competitive Landscape
Stellarix
 
SEO Trends in 2025 | B3AITS - Bow & 3 Arrows IT Solutions
B3AITS - Bow & 3 Arrows IT Solutions
 
办理方法西班牙假毕业证蒙德拉贡大学成绩单MULetter文凭样本
xxxihn4u
 
KIPER4D situs Exclusive Game dari server Star Gaming Asia
hokimamad0
 
Microsoft PowerPoint Student PPT slides.pptx
Garleys Putin
 
Different Generation Of Computers .pptx
divcoder9507
 
Latest Scam Shocking the USA in 2025.pdf
onlinescamreport4
 
nagasai stick diagrams in very large scale integratiom.pptx
manunagapaul
 
LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1
LABUAN 4D
 
BGP Security Best Practices that Matter, presented at PHNOG 2025
APNIC
 
Unlocking Hope : How Crypto Recovery Services Can Reclaim Your Lost Funds
lionsgate network
 

Web Archiving: A Brief Introduction