0% found this document useful (0 votes)
44 views13 pages

Digital Tools - Week 03 - Milligan - Historiography and The Web

This document summarizes a chapter from The SAGE Handbook of Web History titled "Historiography and the Web". It discusses how web archives will become increasingly important sources for historians as more information from the 1990s onward is preserved online. The scale of digital data now far surpasses what was previously available in archives. Web archives also provide more diverse sources from people not typically represented, like children. Historians will need to adapt to working with this new abundance of sources and using computational methods to analyze large datasets.

Uploaded by

Milano Teodoro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views13 pages

Digital Tools - Week 03 - Milligan - Historiography and The Web

This document summarizes a chapter from The SAGE Handbook of Web History titled "Historiography and the Web". It discusses how web archives will become increasingly important sources for historians as more information from the 1990s onward is preserved online. The scale of digital data now far surpasses what was previously available in archives. Web archives also provide more diverse sources from people not typically represented, like children. Historians will need to adapt to working with this new abundance of sources and using computational methods to analyze large datasets.

Uploaded by

Milano Teodoro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

SAGE Reference

The SAGE Handbook of Web History


Historiography and the Web

By:Ian Milligan
Edited by: Niels Brügger & Ian Milligan
Book Title: The SAGE Handbook of Web History
Chapter Title: "Historiography and the Web"
Pub. Date: 2019
Access Date: August 29, 2022
Publishing Company: SAGE Publications Ltd
City: 55 City Road
Print ISBN: 9781473980051
Online ISBN: 9781526470546
DOI: https://dx.doi.org/
Print pages: 3-15
© 2019 SAGE Publications Ltd All Rights Reserved.
This PDF has been generated from SAGE Knowledge. Please note that the pagination of the online
version will vary from the pagination of the print book.
SAGE SAGE Reference
© Niels Brügger & Ian Milligan, 2019

Historiography and the Web


Historiography and the Web
Ian Milligan

Introduction
The advent of the Web as a primary source will dramatically affect the practice of researching, writing, and
thinking about history. Historians are entering into an era where we will have more information than ever
before, left behind by people who rarely before entered the historical record. Web archives will fundamentally
transform much of what a historian does, requiring a move towards computational methodologies and the
digital humanities.

Web archives matter. One cannot write most histories of the 1990s or later without reference to web archives,
or at the very least to do so would be to neglect a major medium of the period. Web archivists and other
institutions are today engaged in the collaborative effort to ensure that people in the future know what
happened in 1996, or 2001, or 2006, or today. This ensures that we as a society will have the information that
we need to make arguments for justice, for equality, for policy, for a better understanding of ourselves, and
beyond. Web archives will be a foundation for history.

Crucially, historians need to be ready for this shift. They will soon be writing histories of the 1990s that require
web archives to do justice to their topics – and they need to be ready. While there is no exact metric for when
past events become fodder for historical interpretations, it is worth noting that the first historical narratives
of the 1960s in the United States and Canada for example began to appear in the 1980s; by the 1990s,
established monographs and doctoral studies could be undertaken (Gitlin, 1987; Isserman, 1987; Kostash,
1980; Levitt, 1984; Owram, 1997). As the Web is now well over 25 years old, we are roughly at the time when
the first serious historical studies will begin to be undertaken, and it is likely that many trailblazing doctoral
students in the field are now beginning to contemplate their first degrees. To not use web archives would run
the very real possibility of fundamentally misrepresenting any of the above topics. This will happen sooner
than we think, too. Not only are the 1990s history, the Web is now over 25 years old, with widespread web
archiving beginning over two decades ago with the Internet Archive in 1996.

This chapter explores what the changing nature of historical scale will mean for historians. It begins by
discussing how web archives will become increasingly central to the historical profession. Following this,
drawing upon Franco Moretti's concepts of close reading versus distant reading, it advances a typology of
research projects carried out to date. The chapter then discusses the next directions for the field, especially
the growing importance of metadata analysis rather than exploring content itself. It concludes by situating
our contemporary trend into a ‘third wave of computational history', suggesting how historians could profit by
understanding themselves in their own historical context.

The Growing Centrality of Web Archives to the Historical Profession


To reinforce the importance of web archives, consider all of the things that one could not write a history of
without using web archives. Without web archives, one could not write histories of the late 1990s Tamagotchi
trend, figuring out what that meant about our relationship to animals, each other, and technology; or political
histories of the late 1990s on early Internet censorship, from the V-Chip to the Communications Decency Act
in the United States, critical moments in our early Web history that might have fundamentally changed how we
interacted with the medium; or economic and business histories of the 1990s dot.com bubble; or even events
of pivotal significance like the attacks of September 11th, 2001. Each of the above would be innumerably
enhanced by the use of web archives, and considerably diminished by not considering these sources. This
is not a niche area. Crucially, they underscore that Web histories will not just be histories of the World Wide
Web (although those are important and well represented in this Handbook), but histories that happen to use
the Web as a primary source because of its significant role in knowledge production and communication.

The SAGE Handbook of Web History


Page 2 of 13
SAGE SAGE Reference
© Niels Brügger & Ian Milligan, 2019

The novelty of these web archives can be seen in two respects: that of scale, in that we have more data than
ever before, and that of scope, where different kinds of sources that were rarely preserved before are now
being so. In this section, I will explore scale and scope in turn.

We are now working with sources that are being preserved on a different scale than historians are previously
used to working with. In this we are seeing the insights of the late American historian Roy Rosenzweig
borne out, as he foresaw in a 2003 American Historical Review article that historians were shifting from an
environment of scarcity to one of abundance (Rosenzweig, 2003). In other words, historians have traditionally
wished we had more information about the past – now, when working with web archives, historians are
threatened by having too many sources to parse and explore.

Some examples can bear out the sheer scale of born-digital content being generated every day. A constantly
updated page, ‘My Data is Bigger than Your Data', published by University of Waterloo computer scientist
Jimmy Lin, gives a tally of the ever-changing boasts of just how big datasets are. A few examples help to bring
this into contrast. In January 2017, Twitter announced that it was storing over 500 petabytes of information
(one petabyte is 1,000 terabytes). The Internet Archive has over 30 petabytes of archives, with approximately
13 to 15 terabytes per day being added to its collections. Spotify, a music streaming service, collects over a
terabyte of user data every day from its over 75 million users and one and a half billion playlists. YouTube sees
a petabyte of data uploaded every single day (Lin, n.d.). Not all of this will be kept. Content on Snapchat, for
example, is filled with largely intended transient content; similarly, Facebook and many corporate databases
will likely not be archived for historical consumption. Given ethical concerns and rights to privacy, this is
not necessarily a bad thing! But even if a fraction of the above is kept, historians will be challenged to no
end. In particular, the Internet Archive's activities are of interest, as they are collecting with an eye to future
research access. In short, the amount of information generated on the Web means that our historical record
is dramatically changing.

The expansive scope of web archives, too, has the prospect of bringing more people into the historical record.
Much of what can be found in the Internet Archive are primary sources authored by people who never before
would have been part of the historical record. It is not simply that instead of learning about Tamagotchis from
The New York Times or The Guardian we can learn about them from Web-based sources, but that we can
begin to work with the pages of people who actually used Tamagotchis. Young kids and their parents created
sites in the GeoCities child-focused section, for example, allowing us to work with this innovative primary
source (see my chapter on GeoCities in this Handbook). It is emblematic of a broader shift in the categories of
people who we can learn about (Milligan, 2016). Think of the long list of sources that we will now have thanks
to the widespread advent of web archiving. Pages by children, blogs by everyday teenagers, students, and
adults, or even individuals who can give a unique perspective on unfolding world events (such as the Russian
soldier who posted the MH17 missile launch on Facebook) (Taylor, 2014).

Web archives do not provide a perfect representation of the past – they offer facsimiles of the original
pages, and much is missing or of reduced functionality (Brügger, 2013a) – but neither do traditional archives,
which have had to be very selective with what they select, appraise, and preserve. We need to mind gaps
in coverage and inclusion. This can be done by looking at who uses the Web and how, often drawing on
studies of contemporary or historical studies (Blank and Dutton, 2014; Duggan, 2015). Additionally, we need
to understand gaps in how web archives find content on the Web and what gets preserved and what does
not (Thelwall and Vaughan, 2004). But while we need to understand the context in which web archives are
created, we also need to understand how these collections will in many cases be broader than what has been
left behind in the past. Historical practice has been dominated by archival study since the nineteenth century,
often government and institutionally dominated repositories of information about what has come before. While
web archives certainly have an institutional perspective, such as collections from governments or of university
calendars, for example, the new forms of citizen histories present a more democratic record of the past.

To reinforce this shift, I often think to my own first book on Canada's 1960s, where I studied how young
workers, students, and activists engaged with each other (Milligan, 2014). These were difficult questions to
explore, even though the events had only taken place 40 years before. Student activists do not keep minutes
of meetings that took place late in pubs, coffee houses, or communal homes; nor do young workers who
are engaged in illegal wildcat strikes. Union newsletters tended to stress official lines, leaving little room for

The SAGE Handbook of Web History


Page 3 of 13
SAGE SAGE Reference
© Niels Brügger & Ian Milligan, 2019

alternative visions. My sources then were crumbs of the past: memories from a few, garnered through around
70 oral interviews, though it was hard to find names or contemporary contact information: police reports,
including of graffiti on the side of an industrial building; informants; bewildered newspaper articles in the
mainstream media; occasional write-ups in a student newspaper or a manifesto saved in a university archive.
Tackling a similar question today would be different. Think of the resources collected around Occupy Wall
Street, or the Canadian First Nations #idlenomore movement, or just the regular activities of youth culture.

Indeed, this animated my own first foray into studying web archives, when I worked with an old collection of
websites that were created by Canadian youth in the 1990s under the auspices of a Canadian federal training
project. When I found these archived websites over a decade later, I began to realize that these sources were
different from what I had seen before (Milligan, 2012). In digging into youth footprints, I realized that I could
even find my own first trace of a digital past, from when I was an 11-year-old boy in 1995 asking for help about
a board game online. Eleven-year-old boys did not, as a rule, leave sources – yet now they do. This does not
make things easier, of course – one still needs to find these sources, identify who is who if they want to, and
begin to think about what is being crawled and what is not. It is an incomplete record, affected by profound
issues of access, but is certainly bigger and more expansive than what we had before.

The Infinite Archive


Historians work with a necessarily incomplete record of the past. Indeed, the two terms – history and the
past – are not synonyms. The past happened, whereas history is created from the traces of the past that
persist (from memories, to physical fragments, to archival documents, to tweets). The vast majority of things
that happen are never recorded, something which philosophically remains the same. But right now a web
crawler is following links, downloading pages, following links, working in that potentially infinite process that
creates the archives of our lives. In this section, I would like to provide some reasons for how these traces
of the past that now exist that would never have existed before matter and have the potential to reshape our
historiography.

The advent of web archives has the potential to expand scholarship in several respects. Think of the blogs
and social media that document what people are eating – indeed, one of the canards against social media is
often ‘I don't care what you had for dinner’. But a sense of what everyday people eat, consume, and idealize
when it comes to food is going to be valuable; historians have long struggled with really understanding what
we eat in our private homes, relying largely on memory, raw material information, government reports, how
people responded to rationing, and the like (Mosby, 2014). Alternatively, we can have new understandings
of social movements like Occupy Wall Street. If you relied on The New York Times or the Washington Post
to understand what happened in Zuccotti Park in New York City, you would have more of a skewed vision
than if you were able to use the Occupy websites themselves (used for coordination, publicity, and making
the organization come together).

Occupy is also a good way to underline the importance of preserving Web materials. It can help underline
the importance of preserving and documenting this material, and how this is the sort of work that needs to be
done without delay. Of all the Occupy sites, created in the heat of an ever-evolving movement, two years later
only 41% of the sites were active (LaCalle and Reed, 2014). Unlike physical media, which can – especially if
printed on acid-free paper – be surprisingly durable (imagine a book, placed on a shelf, and coming back to it
20 years later; you can probably access the content, even if there have been some moisture problems and the
like), these sources require active preservation. Server fees need to be paid, and with only brief interruption,
they can disappear. And apart from high-profile projects that had considerable funding and expertise, such as
the reconstruction of the first web page hosted at CERN or the first North American site at Stanford, once it is
gone it is gone (Karampelas, 2014; Tsukayama, 2013).

Web Archiving Scholarship Today: A Quick Glimpse


All of this is to say that the research landscape for historians seeking to study the 1990s is changing, and that
it matters that historians are ready to engage with this material now. So what are historians doing with it? We
are only now just beginning to see what users can do with web archives, beyond nostalgic explorations and
The SAGE Handbook of Web History
Page 4 of 13
SAGE SAGE Reference
© Niels Brügger & Ian Milligan, 2019

the such. These largely fall into two rough categories: traditional close reading, and the computational digital
humanities approach of distant reading (Moretti, 2007).

Some of these histories are close readings, or the inquiry of one website or a few different sites – dozens, or
hundreds even, but still of a scale where the historian can click, scroll, read, and explore the individual sites
to a very high degree. This builds on a great textual tradition within the historical field, where we dive deeply
into one story, flesh out the stories, and in so doing learn something far more about the broader context of
the events that we are studying as well. Several of these approaches appear in the Handbook chapters that
follow in part. Federico Nanni, for example, has explored the University of Bologna's website in depth and has
tried to reconstruct the story of that one page, drawing on the Internet Archive, university IT departments, and
eventually using oral histories and personal contacts to tell this deeply textured story. In so doing, it becomes
a good entryway into the robots.txt exclusion protocol and how websites can be removed, even retroactively,
from the Internet Archive; and how a recent history of the website can be done through a close reading of one
domain (Nanni, 2016). Similarly, Stanford University Libraries carried out an in-depth process to reconstruct
the first web server's page outside of Europe, that of the Stanford Linear Accelerator Center's website, using
the SLAC backup system (AlSum, 2015; Deken, 2017). It is an in-depth story of digital preservation, drawing
on source code, newsgroups, generating web archive files from a non-standard web archive, and beyond.
In so doing, this essential work helps us understand the broader challenges of digital preservation, and how
much work goes into individual stories.

Yet we also have operations on a completely different scale, requiring ‘distant reading’. This leverages the
power of what modern data mining and text analysis tools can do on large amounts of data. This approach
of distant reading is inspired by the literary work of Franco Moretti, or the wide gulfs of time and space
that earlier Annales historians like Fernand Braudel explored (Braudel, 1972; Moretti, 2007). These are
stories on a massive, unprecedented scale, aided by computational analysis. Peter Webster, a pioneering
UK historian, has used the link graphs of the UK Web Archive to explore who linked to creationist websites.
Did governments, academics, bloggers, Christian media, or the mainstream media, for example, link to these
sorts of sites? This was large-scale analysis that would not have been possible if it were to be carried
out manually. In this study, Webster noted that creationists mostly talk to themselves, and are ignored by
academia, media, and the churches; underscoring that even among evangelicals this was a particularly
minority view (Webster, 2014). This use of link graphs was also employed in a recent piece by Webster,
looking at how linking patterns changed in the wake of the Sharia Law controversy in the United Kingdom;
and, for example, how these links might reflect real-world interest in various dioceses (Webster, 2017). Or, at
times, this has even involved looking at entire national country-code top-level domains (ccTLDs), such as the
Danish .dk. Recent work by Niels Brügger has explored the general challenges and opportunities presented
by exploring a nation's web domain (Brügger, 2017). There is even promising work that looks at now-defunct
top-level domains, such as the deleted Yugoslavian .yu. Anat Ben-David of the Open University of Israel
conducted pioneering research into the shape of this deleted domain, looking at how the networked structure
of millions of pages dramatically changed between 1996 and 2010. She found that the internal links emerged
after the fall of autocrat Slobodan Milosevic, and then began to fall apart as new domains for the independent
countries emerged; as she puts it, ‘the intra-domain linking patterns of the .yu domain are closely tied with
stability and sovereignty’ (Ben-David, 2015, 2016).

We can see from this very brief overview that there is great power and potential in being able to use these
web archives. Yet the number of scholars working with web archives today is still relatively small, due in no
small part to the problem of access.

Going Wayback: Closely Reading Web Archives


One of the major limitations of a web history, however, is the sheer difficulty of doing this sort of work. There is
a necessity to democratize access to this material, so that historians and everyday people have better access
to this important digital cultural resource.

Currently, most historians and other researchers access web archives via some flavor of a ‘Wayback’
instance. You can see a version of the Internet Archive's instance in Figure 1.1. The Wayback Machine, which
is what the Internet Archive calls its version of the Wayback, was launched in 2001 to provide access to its
The SAGE Handbook of Web History
Page 5 of 13
SAGE SAGE Reference
© Niels Brügger & Ian Milligan, 2019

collections. It is now also available as an open-source project known as ‘Open Wayback’.

The Wayback Machine allows a user to temporally browse ‘back in time', by rendering collected pages and
making links find the closest snapshot to the page that they are launching their search from. In short, one can
load up the American White House's home page from December 27th, 1996 and when clicking on links within
the Wayback Machine, one is brought to snapshots of that page as close to that day as possible.

Figure 1.1

Users can find content in the Wayback Machine by two means. First, they can type in the URL of the address
that they know they want to find (nytimes.com, for example). Second, they can do a limited full-text search
on the home pages of websites (type ‘New York Times’ and you find the URL of the page, which can be very
useful when pages have changed their URL several times over their life).

Researching with the Wayback Machine lends itself to close reading. A user browses one page at a time, and
is limited by the speed of the Wayback Machine and the underlying server. It is not a speedy experience, but
it does in many ways replicate the experience of working with traditional documents. Even when working with
the Web in other ways, finding URLs through various discovery methods discussed below, the end point is
usually a Wayback Machine. It is how we view the documents themselves. Yet for most research projects, the
Wayback Machine will not be enough. When working at scale, such as the millions of pages of GeoCities, one
cannot click through each page manually; and the basic search functionality that the Internet Archive provides
is limited only to home pages.

But imagine trying to use the Wayback Machine at scale: you would have to click through pages manually,
limited to home page searches, and beyond – it would take decades, if not more, just to look at all the pages.
The SAGE Handbook of Web History
Page 6 of 13
SAGE SAGE Reference
© Niels Brügger & Ian Milligan, 2019

Accordingly, new systems of access are being developed to explore this. Some of these are simply various
full-text search interfaces, designed not to handle the massive amount of data within the Wayback Machine,
but to be implemented on relatively smaller collections. Enter the world of distant reading.

Computational Access to Web Archives: We Want to Read It All but Cannot


Historians like to work with content – but the scale of web archives makes that difficult to achieve. What
should historians then do? Traditionally, the historian's approach can be likened to that of a microscope:
closely exploring small numbers of documents; in an era of ever-growing datasets, the macroscope may be
the better metaphor (Graham et al., 2015). Niels Brügger has identified five different layers when it comes to
studying the Web: the Web as a whole, web spheres, web pages, and web elements (Brügger, 2009, 2018).
The concept of web spheres comes from Schneider and Foot, who define this as ‘not simply a collection of
websites, but as a hyperlinked set of dynamically defined digital resources that span multiple websites and
are deemed relevant, or related, to a central theme or “object”’ (Schneider and Foot, 2004: 118).

The debate between a wider focus – thinking of the Web as a whole à la Schneider, Foot, and Brügger
– and a narrower one speaks to the importance of both close and distant reading. In related work, I have
argued that larger perspectives benefit our understanding of the Web, given the importance of context for
understanding single documents (Milligan, 2012, 2016). Other scholars have noted the importance of a ‘Big
Data’ perspective to derive meaning from large amounts of cultural data (Aiden and Michel, 2013; Mayer-
Schönberger and Cukier, 2013; Schroeder, 2014). Some historians, however, based on their experience with
web archives have questioned the idea that bigger is necessarily better. Gareth Millward, who worked with
web archives as part of the BUDDAH project, argued as much in a Washington Post opinion piece:

We're going to have to realize that we can't read everything. We already do this with printed documents, but
we need to be more explicit about it and more willing to admit it. Smaller samples of websites, specifically
chosen for their historical importance, can give us a much better understanding. We can begin to ask
questions about how sites are constructed and what information people and organizations chose to reveal.
Similarly, much more focussed searches on smaller time periods, more marginal topics, or specific cultural
groups can produce a more manageable ‘corpus’ for reading and manipulating in the same way we would on
our trips to traditional archives. (Millward, 2015)

Millward also argues that metadata, the links between websites, might be the most useful way forward –
finding common ground with other scholars (Brügger, 2013b).

Metadata is an important concept that thus lies at the heart of web archival research. While a difficult-to-
define term, it can perhaps be best understood – as the American National Security Agency (NSA) does – as
‘information about content (but not the content itself)’ (Greenwald, 2014: 132). Indeed, the NSA's definition is
a useful beginning point, as the NSA itself underscores the power of metadata. The 2013 Edward Snowden
revelations that the agency had been engaged in widespread harvesting of American metadata, such as the
records of phone calls (who you call, who calls you, and how long you spoke for), shocked many Americans.
Despite agency denials that this was not surveillance as it did not engage with content, authors such as
journalist Glenn Greenwald convincingly hold that metadata can actually be more revealing than the content
itself. A single call, even if it were to be transcribed and published, would not tell you much about your life.
However, a recurring pattern over months or years, would begin to (Greenwald, 2014). The MIT Immersion
Lab's tool at https://immersion.media.mit.edu/ also helps to illustrate this, as it takes a user's Gmail account,
extracts the to, from, cc, and timestamp fields, and begins to reconstruct your life based on social ties.

A similar approach works well with web archives, as later chapters in this Handbook discuss. Stevenson and
Ben-David explore this at depth, but here is an example. Recall the Toronto Political Parties and Political
Interest Groups collection, collected by the University of Toronto between 2005 and today. There are some 14
million documents in total, too many to read. Indeed, some domains consist of thousands of different pages.
Using Warcbase, a software program that works with the underlying WARC files in the collection, we extracted
all the hyperlinks between sites and aggregated them by domain.1

For example, any time a page within the left-wing New Democratic Party of Canada domain (ndp.ca) linked
The SAGE Handbook of Web History
Page 7 of 13
SAGE SAGE Reference
© Niels Brügger & Ian Milligan, 2019

to another page, we counted the domain that it came from (ndp.ca) and the domain that it was going to (for
example, NewYorkTimes.com). By doing so we could tell stories about the web archive that eluded single
page readings. In these hyperlink graphs, for example, we can see the left-wing New Democratic Party of
Canada (ndp.ca) linking heavily to the centrist Liberal Party of Canada (liberal.ca), and largely ignoring the
right-wing Conservative Party of Canada (conservative.ca). This is due to several reasons, notably that the
Liberals are in power at this time. Even though they are ideologically closer, it makes more sense to link to
attack and critique the party in power than an opposition party.

This example helps underscore the importance of metadata for exploring archived web material. At first, the
inability to explore web archival data in a manner we are used to in traditional archives may seem like a
downside, but increasingly by leveraging the structured metadata within archives we are able to quickly find
material of relevance.

Other projects have explored the potential of distant reading to explore web archives. In the United Kingdom,
for example, the Big UK Domain Data for the Arts and Humanities (BUDDAH) project sought to develop
technical and methodological approaches for the historical uses of web archives. Partnering with the British
Library to ‘co-produce tools', the team moved towards new access methods. The ensuing Shine viewer
(Figure 1.2) allows a user to see trends in a large web archival collection obtained by the British Library from
the Internet Archive, which covers the .uk domain between 1996 and 2013. Consider the example query in
Figure 1.2, which compares the relative frequency of three prime ministers between these years. We can see
the blue line (Tony Blair) diminish in frequency, replaced by the red line (Gordon Brown) when he is prime
minister, and finally both are eclipsed by the yellow line (David Cameron). A user can click on the lines, and
then be brought to a sample page containing that phrase (Jackson et al., 2016).

The SAGE Handbook of Web History


Page 8 of 13
SAGE SAGE Reference
© Niels Brügger & Ian Milligan, 2019

Figure 1.2 Three Prime Ministers seen in the UK Web Archive's Shine interface.
Image used with thanks to the British Library.

These interfaces are not perfect. In the above case, the samples are arranged in the order by which they
were crawled – earliest to last. While a laudable reaction to the ‘black box’ of a search engine (like Google)
that we may not understand, when working at scale we need to begin to engage with ranking algorithms.
Eschewing relevance ranking is akin to doing research where archives are Twitter timelines – ordered only by
date, meaning we understand the order they are in but at any scale it begins to overwhelm the human capacity
to reason and engage. Yet it can still build considerable interest. In 2015, for example, our research team at
the University of Waterloo and York University used the Shine platform to provide access to a ten-year-old
collection of Canadian political party and political interest group websites (webarchives.ca). The University of
Toronto had been collecting these sites since 2005, comprising some 50 domains of all of Canada's major
political parties, minor political parties, and a somewhat nebulous assemblage of political interest groups
covering topics as varied as campaigns to ban land mines, protect the Canadian environment, or fight for
social justice for Canada's First Nations. Given the importance of this collection, once we provided access to
it, we had thousands of visitors after the media picked up the story. This underscores the power of an easy-
to-use interface.

Shine, and other similar interests, speak to the new methods that historians will need as they facilitate
both distant and close reading. Later in this book, my Handbook chapter on GeoCities uses some of these
approaches to show how they can be deployed in modern historical research. The downside, however, is that
not all historians are ready to change their technique.

The SAGE Handbook of Web History


Page 9 of 13
SAGE SAGE Reference
© Niels Brügger & Ian Milligan, 2019

The Third Wave of Digital History


This is not the first time that historians have worked at scale, but it comes at a moment when the mainstream
historical profession appears to be retreating from numbers, teams, and computers. Part of this lies with
earlier historical follies. Indeed, we can see the contemporary turn towards the digital humanities as part of a
‘third wave of computational history’ (Milligan, 2012: 30).

The 1960s and 1970s saw computers as indelibly associated with quantitative histories (Anderson, 2008).
These pioneering computational historians relied on large mainframe computers and stacks of punchcards,
saw considerable advancements in the realms of demographic history and studies of large censuses, and
generated arguments foundational to our understanding of social mobility and migration today (Anderson,
2008; Katz, 1975). Others, however, brought hubris to their work, claiming ‘scientific’ rigor (Fogel and Elton,
1984). Perhaps the height of this was the debate around Fogel and Engerman's Time on the Cross: The
Economics of American Negro Slavery, seen by critics as reducing the terrible, human experience of slavery
to numbers and tables (Fogel and Engerman, 1974); by the late 1970s, this first wave of computational
history was in retreat. Some of these earlier debates tempered enthusiasm for computational histories, which
reappeared in the early 1990s with the personal computing revolution. This second wave of computational
history was marked by graphical user interfaces, word processing, and the rise of the World Wide Web and
attendant scholarly networks such as H-Net. Instead of scholars needing to use punchcards to interact with
mainframe computers, they could interact with computers and data directly through their keyboards, mice,
and monitors. In 1998, the Journal of the Association for History and Computing began to be published,
complementing other 1990s digital humanities-type events such as the 1994 conference Hypertext and the
Future of the Humanities (Graham et al., 2015). Since the early 2010s, we are now in a third wave of
computational history due to several factors, notably decreasing storage costs, the power of distributed cloud
computing, the rise of digital preservation professionals, and open-source software. Storage is cheaper than
ever before, we can put it to good use thanks to all of the data that is being collected and made available to
researchers, and crucially we can harness computing power to begin to access it (Milligan, 2012).

Yet we can learn something from these earlier experiences with digital history. First, as I have argued
elsewhere, we need to recognize the subjectivity of the tools that are designed and used (Graham et al., 2015;
Milligan, 2012). Claims towards scientific rigor alienated other historians during the first wave of computational
history. The results from a ‘distant reading’ algorithm over a web archive may appear to be objective, in that
the data and the algorithm combined will produce the same answer. But the subjectivity of tools is embedded
in the machines that we use: the decisions on how we structure datasets, tokenize sentences, calculate
central nodes in a network diagram, and beyond, all rely on generations of scholars as well as human agency.
Just as importantly, even if historians are basing their argument off the exact same evidence, they will not
draw the same historical conclusions or arguments. Results always need interpretation. In short, Big Data is
not better than earlier forms of exploration, simply different.

Conclusions
Historians are in some ways at a crossroads, as we seek to grapple with the implications of scale in the digital
age. This also holds true for non-Web-based research, drawing on databases and beyond (Milligan, 2013;
Putnam, 2016). Imagine a historian trying to understand the rise of somebody like Donald Trump, elected US
president in late 2016. Search interfaces can only get them so far, especially if they rely on keywords; imagine
the number of hits they would get if they put his name into the equivalent of a Google search, or newspaper
database search, or beyond. They would have no sense of what was important, or what was not important. As
Ted Underwood has noted, ‘in a database containing millions of sentences, full-text search can turn up twenty
examples of anything’ (Underwood, 2014). In this dystopian scenario, as the historian cannot read tens of
thousands of hits on one keyword search in one small archive alone, as they begin to put pen to paper, while
they imagine they've chosen the right sources – the ranking algorithm is really writing the work.

To do this work will require algorithmic and computational literacy. But imagine a more sophisticated
approach. They decide to take part of a web archive, say a series of blogs around a particular incident during
the Trump campaign; or a circumscribed corpus of media articles. Perhaps they use network analysis to filter
The SAGE Handbook of Web History
Page 10 of 13
SAGE SAGE Reference
© Niels Brügger & Ian Milligan, 2019

away extreme, unrepresentative voices, focusing instead on pages that both mentioned Trump prominently
and had many other pages (themselves trustworthy based on linking patterns) linking to them. They consult
the information they have about the programs they are using to explore this material, recognize the biases in
the archive and the algorithm, and begin to pare it down. After this filtering process, perhaps they are down to
a manageable set of 500 web pages, which can be closely read. Spanning distant and close, hard work lies
ahead but the historian at least has a rigorous pathway forward.

None of this will be easy or straightforward. Our society has been grappling with a profound medium shift with
the advent of the World Wide Web. Historians will be no different. It has been over 25 years since the Web
was launched, and over 20 years since the beginning of widespread web archiving with the Internet Archive.
This will require a new toolkit, ethical sensitivity, and innovative approaches to access. In the Handbook
chapters that follow, we will explore the various dimensions that this rethinking can play out.

Note
1 WARC, or Web ARChive files, are the ISO-standardized file format that web archives use to store their
content. In a nutshell, they aggregate all the products of a crawl together with metadata. The best way to
understand a WARC file is a tangible example. Imagine we are preserving a university's website. Within it are
potentially millions of files: HTML files, Word and PDF documents, JPG and PNG images, video, stylesheets,
and beyond. A WARC file can aggregate these resources with description, allowing them to be preserved and
crucially accessed.

References
Aiden, E., and Michel, J.B. (2013) Uncharted: Big Data as a Lens on Human Culture. New York: Riverhead.
AlSum, A. (2015) ‘Reconstruction of the US First Website', Proceedings of the 15th ACM/IEEE-CS Joint
Conference on Digital Libraries: 285–286.
Anderson, I. (2008) ‘History and Computing' (http://www.history.ac.uk/makinghistory/resources/articles/
history_and_computing.html). Accessed 22 June 2018.
Ben-David, A. (2015) ‘What Does the Web Remember of Its Deleted Past?' (https://webarchivehistorians.org/
2015/09/07/what-does-the-web-remember-of-its-deleted-past/). Accessed 4 October 2016.
Ben-David, A. (2016) ‘What Does the Web Remember of Its Deleted Past? An Archival Reconstruction of the
Former Yugoslav Top-Level Domain', New Media and Society, 18(7): 1103–1119.
Blank, G., and Dutton, W.H. (2014) ‘Next Generation Internet Users: A New Digital Divide', in M. Graham and
W.H. Dutton (Eds.), Society and the Internet: How Networks of Information and Communication Are Changing
Our Lives. New York: Oxford University Press. pp. 36–52.
Braudel, F. (1972) The Mediterranean and the Mediterranean World in the Age of Philip II. Berkeley: UC
Press.
Brügger, N. (2009) ‘Website History and the Website as an Object of Study', New Media and Society, 11(1–2):
115–132.
Brügger, N. (2013a) ‘Web Historiography and Internet Studies: Challenges and Perspectives', New Media
and Society, 15(5): 752–764.
Brügger, N. (2013b) ‘Historical Network Analysis of the Web', Social Science Computer Review, 31: 306–321.
Brügger, N. (2017) ‘Probing a Nation's Web Domain: A New Approach to Web History and a New Kind
of Historical Source', in G. Goggin and M. McLelland (Eds.), The Routledge Companion to Global Internet
Histories. New York: Routledge. pp. 61–73.
Brügger, N. (2018 forthcoming) The Archived Web: Doing Web History in the Digital Age. Cambridge: MIT
Press.
Deken, J.M. (2017) ‘The Web's First “Killer App”: SLAC National Accelerator Laboratory's World Wide Web
Site, 1991–1993', in N. Brügger (Ed.), Web 25: Histories from the First 25 Years of the World Wide Web. New
York: Peter Lang Publishing. pp. 57–78.
Duggan, M. (2015) ‘The Demographics of Social Media Users' (http://www.pewinternet.org/2015/08/19/the-
demographics-of-social-media-users/). Accessed 10 December 2015.
Fogel, R.W., and Engerman, S.L. (1974) Time on the Cross: The Economics of American Negro Slavery. New
York: W. W. Norton & Company.
The SAGE Handbook of Web History
Page 11 of 13
SAGE SAGE Reference
© Niels Brügger & Ian Milligan, 2019

Fogel, R., and Elton, G. (1984) Which Road to the Past?: Two Views of History. New Haven: Yale University
Press.
Gitlin, T. (1987) The Sixties: Years of Hope, Days of Rage. New York: Bantam Books.
Graham, S., Milligan, I., and Weingart, S. (2015) Exploring Big Historical Data: The Historian's Macroscope.
London: Imperial College Press.
Greenwald, G. (2014) No Place to Hide: Edward Snowden, the NSA, and the U.S. Surveillance State. New
York: Metropolitan Books.
Isserman, M. (1987) If I Had a Hammer: The Death of the Old Left and the Birth of the New Left. New York:
Basic Books.
Jackson, A., Lin, J., Milligan, I., and Ruest, N. (2016) ‘Desiderata for Exploratory Search Interfaces to Web
Archives in Support of Scholarly Activities', Proceedings of the 16th ACM/IEEE-CS on Joint Conference on
Digital Libraries: 103–106.
Karampelas, G. (2014) ‘Stanford Libraries Unearths the Earliest U.S. Website' (http://news.stanford.edu/
news/2014/october/slac-libraries-wayback-102914.html). Accessed 4 April 2017.
Katz, M.B. (1975) The People of Hamilton, Canada West: Family and Class in a Mid-Nineteenth-Century City.
Cambridge: Harvard University Press.
Kostash, M. (1980) Long Way from Home: The Story of the Sixties Generation in Canada. Toronto: Lorimer.
LaCalle, M., and Reed, S. (2014) ‘Poster: The Occupy Web Archive: Is the Movement Still on the Live Web?',
presented at Digital Preservation 2014, Washington, DC.
Levitt, C. (1984) Children of Privilege: Student Revolt in the Sixties: A Study of Student Movements in
Canada, the United States, and West Germany. Toronto: University of Toronto Press.
Lin, J. n.d. ‘My Data Is Bigger Than Your Data' (http://lintool.github.io/my-data-is-bigger-than-your-data/).
Accessed 22 June 2018.
Mayer-Schönberger, V., and Cukier, K. (2013) Big Data: A Revolution That Will Transform How We Live,
Work, and Think. Boston: Eamon Dolan/Houghton Mifflin Harcourt.
Milligan, I. (2012) ‘Mining the “Internet Graveyard”: Rethinking the Historians’ Toolkit', Journal of the Canadian
Historical Association, 23(2): 21–64.
Milligan, I. (2013) ‘Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History,
1997–2010', Canadian Historical Review, 94(4): 540–569.
Milligan, I. (2014) Rebel Youth: 1960s Labour Unrest, Young Workers, and New Leftists in English Canada.
Vancouver: UBC Press.
Milligan, I. (2016) ‘Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives', International Journal
of Humanities and Arts Computing, 10(1): 78–94.
Millward, G. (2015) ‘I Tried to Use the Internet to Do Historical Research. It Was Nearly Impossible',
Washington Post (https://www.washingtonpost.com/posteverything/wp/2015/02/17/i-tried-to-use-the-internet-
to-do-historical-research-it-was-nearly-impossible/?utm_term=.77cd9f605120). Accessed 19 February 2015.
Moretti, F. (2007) Graphs, Maps, Trees: Abstract Models for Literary History. New York: Verso.
Mosby, I. (2014) Food Will Win the War: The Politics, Culture, and Science of Food on Canada's Home Front.
Vancouver: UBC Press.
Nanni, F. (2016) ‘Reconstructing a Website's Lost Past: Methodological Issues Concerning the History of
www.unibo.it', ArXiv160405923.
Owram, D. (1997) Born at the Right Time: A History of the Baby Boom Generation. Toronto: University of
Toronto Press.
Putnam, L. (2016) ‘The Transnational and the Text-Searchable: Digitized Sources and the Shadows They
Cast', American Historical Review, 121(2): 377–402.
Rosenzweig, R. (2003) ‘Scarcity or Abundance? Preserving the Past in a Digital Era', American Historical
Review, 108(3): 735–762.
Schneider, S.M., and Foot, K.A. (2004) ‘The Web as an Object of Study', New Media and Society, 6(1):
114–122.
Schroeder, R. (2014) ‘Big Data: Towards a More Scientific Social Science and Humanities?', in M. Graham
and W.H. Dutton (Eds.), Society and the Internet: How Networks of Information and Communication Are
Changing Our Lives. New York: Oxford University Press. pp. 164–176.
Taylor, N. (2014) ‘The MH17 Crash and Selective Web Archiving' (https://blogs.loc.gov/thesignal/2014/07/
21503/). Accessed 12 April 2017.
Thelwall, M., and Vaughan, L. (2004) ‘A Fair History of the Web? Examining Country Balance in the Internet
Archive', Library and Information Science Research, 26(2): 162–176.
The SAGE Handbook of Web History
Page 12 of 13
SAGE SAGE Reference
© Niels Brügger & Ian Milligan, 2019

Tsukayama, H. (2013) ‘CERN Reposts the World's First Web Page', Washington Post
(https://www.washingtonpost.com/business/technology/cern-reposts-the-worlds-first-web-page/2013/04/30/
d8a70128-b1ac-11e2-bbf2-a6f9e9d79e19_story.html). Accessed 10 June 2014.
Underwood, T. (2014) ‘Theorizing Research Practices We Forgot to Theorize Twenty Years Ago',
Representations, 127(1): 64–72.
Webster, P. (2014) ‘Reading Creationism in the Web Archive' (https://peterwebster.me/2014/11/18/reading-
creationism-in-the-web-archive/). Accessed 23 May 2017.
Webster, P. (2017) ‘Religious Discourse in the Archived Web: Rowan Williams, Archbishop of Canterbury, and
the Sharia Law Controversy of 2008', in N. Brügger and R. Schroeder (Eds.), The Web as History. London:
UCL Press. pp. 190–203.
http://dx.doi.org/10.4135/9781526470546.n1

The SAGE Handbook of Web History


Page 13 of 13

You might also like