Large Scale Study of Web Accessibility Metrics: Beatriz Martins Carlos Duarte
Large Scale Study of Web Accessibility Metrics: Beatriz Martins Carlos Duarte
https://doi.org/10.1007/s10209-022-00956-x
LONG PAPER
Abstract
Evaluating the accessibility of web resources is usually done by checking the conformance of the resource against a standard
or set of guidelines (e.g., the WCAG 2.1). The result of the evaluation will indicate what guidelines are respected (or not) by
the resource. While it might hint at the accessibility level of web resources, often it will be complicated to compare the level
of accessibility of different resources or of different versions of the same resource from evaluation reports. Web accessibility
metrics synthesize the accessibility level of a web resource into a quantifiable value. The fact that there is a wide number of
accessibility metrics, makes it challenging to choose which ones to use. In this paper, we explore the relationship between
web accessibility metrics. For that purpose, we investigated eleven web accessibility metrics. The metrics were computed
from automated accessibility evaluations obtained using QualWeb. A set of around three million web pages were evaluated.
By computing the metrics over this sample of nearly three million web pages, it was possible to identify groups of metrics
that offer similar results. Our analysis shows that there are metrics that behave similarly, which, when deciding what metrics
to use, assists in picking the metric that is less resource intensive or for which it might be easier to collect the inputs.
Keywords Web accessibility metrics · Large-scale accessibility evaluation · Automatic accessibility evaluation · QualWeb
13
Vol.:(0123456789)
412 Universal Access in the Information Society (2024) 23:411–434
subsequent releases of the same website to check for accessi- 2 Accessibility metrics
bility improvements. But it is also important for researchers,
especially when conducting large-scale accessibility evalua- According to Vigo, Brajnik and Connor [3], web metrics
tions [5], comparing domains of activity [6, 7], geographical measure properties of websites or web pages. These met-
areas [8, 9], or user groups [10]. Recently, in the context rics can summarize results obtained from a guideline review
of the European Web Accessibility Directive1, the different based evaluation [11]. Additionally, Song, et al., [12] state
European member-states reported the results of their acces- that web accessibility metrics have the ability to measure the
sibility monitoring activities. This large-scale accessibility accessibility levels of websites.
monitoring exercise was marred by the difficulty in compar- Metrics should meet five different aspects [13]. They
ing the results reported by the different member-states. The should:
use of a common metric would have mitigated this problem.
Other potential benefits from accessibility metrics include 1. be simple to understand;
support for ranking web pages, which can be relevant for 2. be precisely defined;
retrieval systems; or being used as a way to provide criteria 3. be objective;
for adaptations in adaptive hypermedia systems [3]. 4. be cost-effective; and
Given there is a large number of web accessibility metrics 5. give such information so it is possible to have meaning-
available for researchers, auditors or practitioners to choose ful interpretations.
from, an important question emerges: which one(s) should
be used? To help answer this question, it is important to Freire, et al., also mention that web accessibility metrics are
understand how these web accessibility metrics relate to important to understand, control and improve products and
each other and if it is possible to group them according to processes in companies [13]. Nevertheless, they affirm that it
their similarities and understand the differences between is not possible to define which metric is more effective, since
each group. it depends on the project in question and its needs.
To identify existing relationships between web accessibil- Parmanto and Zeng [14] argue that an accessibility metric
ity metrics we computed eleven different web accessibility should be summarized into a quantitative score that provides
metrics over a set of more than two million web pages. In a continuous range of values so it is possible to understand
this article we report the findings of this study. We begin by how accessible and inaccessible the web content is. It is also
providing a background about web accessibility metrics and important to guarantee that the range of the metric’s values
a review of 19 web accessibility metrics that were proposed supports more fine-grained discrimination than accessible
in the literature. Then, we present the methodology, results and inaccessible. Another property the authors ascribe to
and discussion of a study where we compared eleven of the high quality metrics is that they should consider the com-
19 reviewed metrics. Afterwards, we present a second study, plexity of the websites. It would be convenient if the acces-
where we analyzed the validity of the eleven metrics, by sibility metric could be scalable to conduct a large-scale
assessing how they rate a set of pages created to demon- accessibility evaluation.
strate good and bad accessibility practices. We finish with an In conclusion, metrics are useful to process and under-
analysis of the studies’ limitations before concluding. stand the results obtained from an accessibility evaluation.
With this work we contribute the following: This approach can also help rank web pages or even explore
the accessibility level of web pages or websites. The com-
• A review of existing web accessibility metrics, describing putation of accessibility metrics can produce, as a result,
a total of 19 metrics applicable at page or website level; qualitative or quantitative values.
• The results of computing a subset of eleven metrics over
a sample of nearly three million web pages; 2.1 Literature review
• An analysis that identifies relationships between metrics
and determines groups of metrics that report similar out- Before presenting the details of the identified web accessibil-
comes. ity metrics, it is important to introduce concepts that help to
understand how each metric behaves.
Some metrics use the barrier concept. A barrier is a con-
dition caused by the website or web page that prevents the
user to access the web content [15], i.e., a problem found
in a certain website or web page that prevents the user to
perceive or interact with the web content. Barriers can have
different levels of severity.
1
https://digital-strategy.ec.europa.eu/en/policies/web-accessibility.
13
Universal Access in the Information Society (2024) 23:411–434 413
Whenever an accessibility evaluation is performed, its actual problems that were encountered in a web page and the
outcomes vary according to the compliance with standards. potential barriers, i.e., all potential problems of a web page
Different outcomes are considered by different metrics, but that can lead to accessibility issues if they are not properly
they can be summarized into: (1) pass, which means that designed.
the web content fulfills a certain recommendation; (2) fail, Vigo and Brajnik [4] state that the failure rate quanti-
which indicates that the web content does not meet the rec- tatively measures the accessibility conformance, having a
ommendation; (3) warning, an outcome produced by auto- score from zero to one. A web page with a failure rate of
mated evaluation tools to represent those instances where zero is totally accessible, whereas a totally inaccessible web
the tool could not determine the conformance, or lack of page has a failure rate score of one.
conformance, with the recommendation, and the interven- The simplicity of this metric can be explained with the
tion of an human expert is required. fact that it does not consider the error nature, i.e., “whether
Besides the above aspects, it is important to note that checkpoints are automatic errors, warnings or generic prob-
some of the web accessibility metrics that have been lems” [18], or the fact that it does not take into consideration
reviewed verify the conformance with checkpoints and these the checkpoints’ weights.
checkpoints are grouped into priority levels: priority 1, pri-
Bp
ority 2 or priority 3. The priority levels in some metrics have Ip = (1)
associated weights that vary from zero to one. This applies Pp
to metrics proposed before the introduction of WCAG 2.0.
Metrics proposed after WCAG 2.0 typically verify conform- Equation 1 presents the formula for computing the Failure
ance with success criteria grouped at conformance levels A, Rate metric, where Ip is the Failure Rate final score, Bp iden-
AA and AAA. tifies the actual points of failure, and Pp identifies the poten-
The score of a metric can be bounded or not. A bounded tial points of failure.
metric makes it easier to gauge where a score falls within the
accessible to inaccessible continuum of values. Unbounded 2.1.2 Unified web evaluation methodology (UWEM)
metrics, on the other hand, by not having a defined range
of values, can lead to a harder interpretation of whether a According to Sirithumgul, Suchato and Punyabukkana [10],
resource is accessible or inaccessible. UWEM 1.0 is an improved version of UWEM 0.5 [19]
In the following, we present the metrics we found by that was developed in 2006. It is based on user feedback
searching the existing literature on web accessibility. For rather than WCAG priority levels [12]. The final value of
each metric, we describe the data it is based on, its output this metric represents a probability of finding a barrier in a
range, and any other considerations regarding its application website or web page that could prevent users from complet-
(e.g., if it is applicable to web pages or web sites). ing a certain task [11, 13, 20]. This metric also considers
the potential problems and barriers’ weights. The UWEM
2.1.1 Failure‑rate (FR) formula is based on the product of the checkpoints’ failure
rates [20]. Its results are precise and accurate, however, it
The Failure Rate (FR) was developed by Sullivan and Mat- only takes into consideration 2 priority levels of the WCAG
son in 2000 [16]. According to Vigo, et al., [17], this metric guidelines [21].
relates the actual points of failure with the potential points The formula can be interpreted as a web page score or
of failure. For instance, if a web page has ten images, all a website score. If the website score is wanted, then the
these images are potential barriers if they are not properly UWEM formula will be the sum of the UWEM score of
defined. If five out of these ten images do not have a proper each web page divided by the total number of pages of that
alternative text, according to the accessibility evaluator, they website, i.e., the arithmetic mean.
are actual barriers. This formula’s final score varies between zero and one,
A point of failure can be interpreted in two ways: as an where zero means the web page is accessible and one means
accessibility problem or barrier that occurs on a web page’s the web page is inaccessible.
elements preventing the interaction of a user with the web ∏ Bi
content; or as the elements that cause accessibility prob- UWEM = 1 − 1− W (2)
Pi i
lems. According to the first interpretation, each element can
have multiple points of failure, which allows us to count Equation 2 presents the formula for computing the UWEM
more accessibility problems and better estimate the acces- metric, where Bi is the total of actual points of failure of a
sibility level. Therefore, we decided to consider a point of checkpoint i, Pi is the total of potential points of failure of
failure as an accessibility problem that occurs on a web page. a checkpoint i, and Wi identifies the severity of a certain
Consequently, the failure rate can be the ratio between the barrier i (this weight is calculated by simple heuristics, by
13
414 Universal Access in the Information Society (2024) 23:411–434
combining the results of an automatic evaluation and manual of each checkpoint and the priority of that checkpoint [4].
testing or by disabled users feedback [22]). The arithmetic mean of all pages of a website represents the
metric score for that website. The Hackett and the Parmanto
2.1.3 A3 and Zeng formulas are represented in equations 4 and 5,
respectively.
In 2006, Buhler, et al., proposed some changes to the UWEM The range of this metric’s values is not bounded [18],
0.5 metric [22]. In particular, some probability properties as there is no limit for this metric’s score. The only refer-
were used as well as some issues related to the complex- ence this metric has is the higher its score, the worse the
ity of the web page were aggregated. A3 is an improved accessibility level of the website. Since this metric takes into
aggregation formula based on UWEM [11, 13, 20]. Similar consideration 25 WCAG checkpoints out of 65, this metric
to UWEM, A3 also considers the failure rate, i.e., the ratio offers a guideline support of 38%. Nevertheless, according
between the number of barriers (violation of a given check- to Brajnik and Vigo [24], WAB is the best individual metric
point) and the total number of potential barriers. UWEM and compared to A3, Page Measure (PM) and Web Accessibil-
A3 consider the barriers weights coefficients based on the ity Quantitative Metric (WAQM) since it yields an accuracy
impact on the user of each given barrier [13]. rate of 96%.
This metric produces a small range of values, that are all
1 ∑ ∑ fr(p, c)
between zero and one, where zero means the web page is WAB = (4)
Np p c priorityc
accessible whereas 1 means the web page is inaccessible.
∏ Bpb
+
Bpb Equation 4 presents the formula for computing the WAB by
(1 − Fb ) Npb (3) Hackett metric, where fr(p, c) is the failure rate of a certain
Bp
A3 = 1 −
b checkpoint c in web page p, priorityc identifies the priority
Equation 3 presents the formula for computing the A3 met- level of the checkpoint c (1, 2 or 3), and Np is the total num-
ric, where Bpb is the total of actual points of failure of a ber of web pages of a given website.
checkpoint b in page p, b is the barrier (checkpoint viola- ∑T ∑n b
tion), Npb is the total of potential points of failure of a check- ( ij )(Wi )
WAB =
j=1 i=1 Bij
(5)
point b in page p, and Fb identifies the severity of a certain T
barrier b (this weight is calculated by simple heuristics, by
combining the results of an automatic evaluation and manual Equation 5 presents the formula for computing the WAB
testing or by disabled users feedback [22]). by Parmanto and Zeng metric, where bij is the number of
The authors of this metric performed an experimental actual violations of checkpoint i in page j, Bij is the number
study to compare the results between A3 and UWEM and of potential violations of checkpoint i in page j, n is the
understand the differences between them. A checkpoint total number of checkpoints, Wi identifies the weight of the
weight of 0.05 was used for all checkpoints, assuming that checkpoint c, according to its priority level (this weight is
all of them would have the same importance. This experi- calculated from experiments with users with different dis-
ment was conducted with a group of six disabled users that abilities [11]), and T is the total number of web pages of a
evaluated six web pages. After applying both metrics, the given website.
authors concluded that A3 outperformed UWEM in the Parmanto and Zeng [14] weighted the priority levels
experiment [11]. in the calculation of the WAB score. Priority 1 violations
represent a higher weight score since web pages with this
level of violations are more difficult to access by people
2.1.4 Web accessibility barriers (WAB)
with disabilities.
Ana Belén Martínez, Aquilino A. Juan, Darío Álvarez, and
The WAB metric was proposed by Hackett, et al., in 2003
Ma del Carmen Suárez [21] went further and created a quan-
[23]. Parmanto and Zeng proposed a new version of the
titative metric based on the WAB metric: WAB∗. The WAB∗
WAB metric in 2005 [14]. It quantitatively measures the
metric is based on WAB and has some UWEM-like exten-
accessibility of a web site considering the 25 WCAG 1.0
sions. It gets the WAB’s precision of the accessibility score
checkpoints (5 checkpoints in Priority 1, 13 checkpoints in
and uses more detailed checkpoints, as UWEM does. With
Priority 2, and 7 checkpoints in Priority 3). It applies the
all these tools, the authors could build a new metric, namely
concepts of potential problems and weights of the barri-
WAB∗. Martínez, et al. [21], point out the main problems
ers. Barriers’ weights are related to the relative importance
and the main advantages of WAB and UWEM metrics. For
of a given checkpoint. It takes into consideration the total
instance, WAB performs tests to evaluate checkpoints, yet it
number of pages of a certain website. The WAB formula
is not precise in the way it determines the number of potential
is defined as the ratio between the sum of the failure rate
13
Universal Access in the Information Society (2024) 23:411–434 415
violations of each checkpoint. However, it specifies all three given web page, and Nelements is the number of elements on
priorities’ checkpoints. Concerning UWEM, this metric pro- a given web page.
duces more precise results, although it only focuses on priority
1 and 2 checkpoints. Thus, these two metrics are merged into
WAB∗. Consequently, WAB∗ has more precision in terms of 2.1.7 SAMBA
the obtained results. In conclusion, this new metric considers 3
priority levels and has 36 checkpoints (25 WAB checkpoints + Brajnik and Lomuscio proposed SAMBA [27], a semi-
11 UWEM checkpoints). This metric was tested by evaluating automatic method for measuring barriers of accessibil-
30,600 web pages from banking sector websites. The results ity, that combines automatic evaluations with human
show that WAB∗ outperforms WAB and UWEM. judgment, and, for this reason, is a semi-automated
methodology.
2.1.5 Overall accessibility metric (OAM) SAMBA is based on WCAG 1.0. This method applies
human judgment in the context of a Barrier Walkthrough
In 2005, Bailey and Burd [25] proposed OAM. The calculated analysis [27] to estimate aspects related to the automated
value considers the number of violations of a checkpoint and tool errors and the severity of the barriers. The Barrier
the weight of that checkpoint as the confidence level. This Walkthrough method is used for evaluating the web acces-
confidence level depends on how certain the checkpoint is. sibility [28] and it is performed by experts. This manual
There are four confidence levels: certain checkpoints weigh approach contextualizes the accessibility barriers identified
10, high certainty checkpoints weigh 8, low certainty check- by experts within usage scenarios and these barriers receive
points weigh 4 and the most uncertain checkpoints weigh 1. a severity score. The severity score of a barrier assumes a
The higher the weight, the more the barrier is penalized. value from {0, 1, 2, 3} that corresponds to false positive
This metric does not have a bounded range of values. The (FP), minor, major or critical barriers.
higher this metric’s score, the more inaccessible the web page This semi-automated approach [27] applies a set of
is. sequential steps. Initially, automatic accessibility tools are
∑ used to identify the potential accessibility barriers and the
Bc Wc
OAM = (6) provided results are submitted to human judgment. Then,
c
Nattributes + Nelements it is possible to statistically estimate the false positives and
the severity of barriers for each website. Finally, barriers
Equation 6 presents the formula for computing the OAM
are grouped according to disability types and it is possible
metric, where Bc is the number of violations of checkpoint c,
to derive scores that represent non-accessibility.
Wc is the weight of the checkpoint c, Nattributes is the number
This metric computes two accessibility indexes: Raw
of HTML attributes on a given web page, and Nelements is the
Accessibility Index (AIr) and Weighted Accessibility Index
number of elements on a given web page.
(AIw). Since AIw is based on confidence intervals manually
computed by human experts, its result is represented by an
2.1.6 Page measure (PM)
interval [ AIw , AIw ]. The confidence intervals express the
minimum and the maximum percentages of a type of bar-
Later, in 2007, Bailey and Burd [26] proposed Page Meas-
riers (FP, minor, major or critical) for a specific disability
ure (PM). This metric “analyzes the correlations between the
(blind users, deaf users, among others) on a given website.
accessibility of web sites and the policies adopted by software
For example, having the interval [6, 12] in column ‘critical’
companies regarding usage of CMS or maintenance strate-
and row ‘blind’ means that, in a given website, there are
gies” [4]. It is similar to OAM (Overall Accessibility Metric),
between 6% and 12% of critical barriers for blind users. The
however, instead of using checkpoint weights, the checkpoint
AIw index considers weights that are associated with minor
priority levels are considered. This metric does not have a
and major severity levels. If both minor and major weights
bounded range of values. The higher this metric’s score, the
are equal to 1, AIw becomes unweighted (AIu).
more inaccessible the web page is.
SAMBA has a limitation: it cannot cope with false nega-
∑ Bc tives, i.e., problems that were not identified [4]. This means
c priority
PM = c
(7) that, although human judgments are used to evaluate and
Nattributes + Nelements validate the results obtained by the automated tools, they
do not deal with the problem of false negatives, since the
Equation 7 presents the formula for computing the PM met-
experts only verify the identified problems. For this reason,
ric, where Bc is the number of violations of checkpoint c,
the actual issues that were not identified, are not going to be
priorityc identifies the priority level of the checkpoint c (1,
analyzed by the experts, i.e., the problems that are not identi-
2 or 3), Nattributes is the number of HTML attributes on a
fied by the evaluation tools are not considered.
13
416 Universal Access in the Information Society (2024) 23:411–434
∏
AIr = ��⃗d )2
(1 − F ⋅ D Web accessibility”. Nevertheless, the user experience is
(8) a subjective problem and varies according to the user.
d
This means that it is complicated to confirm a relation-
∏ ship between user experience and web accessibility, since
AIw = (1 − F ⋅ min{1, Hd })2 (9)
d
different users can have different user experiences [29].
When using WAEM, the higher the weighted accessibil-
∏ ity score, the more accessible the website is.
AIw = (1 − F ⋅ Hd )2 (10)
d s
p=
h (14)
number of potential barriers Equation 14 presents the formula for computing the Pass
F= , (11)
number of HTML lines Rate, where p is the pass rate of a checkpoint, s is the num-
ber of pages of a website a checkpoint passed, and h is the
f
d,mnr
f
d,maj
total number of web pages of a website.
Hd = + +f , (12)
wmnr wmaj d,cri
∑
m
qi = Pi w = Pi,j wj (15)
j=1
f d,mnr f d,maj
Hd = + + f d,cri (13) Equation 15 presents the formula for computing the
wmnr wmaj
Weighted Accessibility Score, where qi is the weighted
In Eq. 8, F is the barrier density of a website, d is a disability accessibility score of a website i, Pi,j is the pass rate of a
type, and D is the disability vector of a website. In Eqs. 9 checkpoint j on a website i, m is the number of checkpoints,
and 10, Hd is the severity of the barriers of a disability type and wj is the weight of a checkpoint j, according to its prior-
d. Equations 12 and 13 identify f as the relative frequency, ity level.
mnr as a minor barrier, maj as a major barrier, and cri as a {
1 ∶ Pa w > Pb w
critical barrier. f ((a, b), w, P) = (16)
0 ∶ otherwise
2.1.8 Web accessibility evaluation metric (WAEM) Equation 16 presents the formula for computing the function
f, where (a, b) is a PUEXO pair that represents an order iden-
The Web Accessibility Evaluation Metric Based on Partial tified by disabled users, w is the set of checkpoints’ weights,
User Experience Order [29] was proposed by Song et al. and and P is the matrix of the pass rates of all websites.
intends to analyze data from the user experience of people
with disabilities. To do so, the authors defined a formula that ∑
k
argmaxw = f (Li , w, P)
calculates the weighted accessibility score (Eq. 15), by using i=1
the pass rate (Eq. 14), of a certain checkpoint on a website. (17)
∑
m
Besides these formulas, this metric also considers users’ s.t. wj = 1; ∀i, wi > 0
experience evaluations through PUEXO pairs. PUEXO j=1
(Partial User EXperience Order) defines pairs of websites
that establish a comparison in terms of user experience. For Equation 17 presents the formula for computing the optimal
instance, the (a, b) pair indicates that a certain user had a checkpoint weight vector w, where w is the set of check-
better browsing experience in website a compared to web- points’ weights, L is the matrix that contains all pairs of
site b. The PUEXO pairs are then compared to the weighted websites, i is the website, j is the checkpoint, m is the num-
accessibility scores of the websites in question, by Eq. 16. ber of checkpoints, and P is the matrix of pass rates.
Subsequently, the results of Eq. 16 and the users’ evalu- ∑
k
ations are both used to calculate the optimal checkpoint argminw = ei
weights (Eq. 17). Equation 17 is not, however, adequate i=1
once the user experience is a subjective aspect. For this rea- ∑
m
son, the authors developed Eq. 18, where they make use of s.t. wj = 1; ∀, ei ≥ 0, wi > 0, PLi,1 w + ei > PLi,2 w
j=1
machine learning.
(18)
As seen in [29], “results demonstrate that WAEM
really can better match the accessibility evaluation results Equation 18 presents the formula for computing the optimal
with the user experience of people with disabilities on checkpoint weight vector w, where i is the website, e is the
error tolerance vector, P is the matrix of pass rates, m is the
13
Universal Access in the Information Society (2024) 23:411–434 417
13
418 Universal Access in the Information Society (2024) 23:411–434
the absence of barriers. The higher this metric’s score, the However, WAQM proved to be tool independent when con-
higher the impact of a certain barrier on a specific type of ducting large scale accessibility evaluations with more than
assistive technology/disability. 1400 web pages [4].
∑ WAQM’s normalized values range from zero to one hun-
BIF(i) = error(i) × weight(i) (22) dred, where the latter corresponds to the maximum acces-
error
sibility level.
Equation 22 presents the formula for computing the BIF ∑
metric, where BIF(i) is the barrier impact factor of an assis- 1 � � Nx,y z∈{1,2,3} Wz A(x, y, z)
WAQM = N
tive technology i, error(i) is the number of detected errors N x∈{p,o,u,r} x y∈{e,w} Nx
that affect the assistive technology i, and weight(i) is the (23)
weight of assistive technology i (1, 2 or 3).
⎧ −100 Bx,y,z Bx,y,z a−100
⎪ b � Px,y,z �+ 100, if Px,y,z < a−100∕b
A(x, y, z) = ⎨ Bx,y,z (24)
2.1.11 Web accessibility quantitative metric (WAQM) ⎪ −a Px,y,z + a, otherwise
⎩
WAQM was proposed by Vigo, et al. [18], and overcomes Equations 23 and 24 present the formulas for computing the
some limitations of previous measures (i.e., lack of score WAQM metric, where N is total number of checkpoints, Nx
normalization and consideration of manual tests). It consid- is the number of checkpoints from a specific principle x (x
ers the WCAG guidelines classified according to the 4 prin- ∈ {Perceivable, Operable, Understandable, Robust}), Nx,y is
ciples: Perceivable, Operable, Understandable and Robust the number of checkpoints from a principle x and type of test
[13]. This metric measures the conformance using percent- y (y ∈ {automatic, manual)}, Wz is the weight of the check-
ages [31], and it produces one score for each WCAG guide- point, according to its priority level z, Bx,y,z is the number of
line in addition to an overall score. It considers the severity accessibility errors of a checkpoint of priority level z, type
of checkpoint violations according to WCAG priorities and of test y and principle x, Px,y,z is the number of test cases of
it provides normalized results. a checkpoint of priority level z, type of test y and principle
Unlike other metrics, WAQM also takes into account the x, a is a variable that varies between 0 and 100, and b is a
problems that are identified as warnings by the accessibility variable that varies between 0 and 1.
evaluation tools [13]. It not only considers automatic tests
but also manual tests.
2.1.12 Navigability and listenability
According to Vigo, Arrue, Brajnik, Abascal and Lomus-
cio [18], this metric was proposed to overcome the draw-
Fukuda et al. [32] proposed two different web metrics. These
backs of the WAB and FR metrics as they do not focus on
metrics are responsible for evaluating the usability for blind
specific user groups, cover less guidelines and do not con-
users.
sider expert manual evaluation results.
Navigability is responsible for evaluating the structure
This metric is based on the sum of failure rates for groups
of the web page elements. It evaluates headings, intra-page
of checkpoints which are grouped according to their priority
links, labels, among other HTML elements of a certain web
levels and their WCAG 2.0 principles (Perceivable, Oper-
page. Listenability takes into consideration the alternative
able, Understandable, Robust) [20]. The authors defined
texts and denotes how properly built they are.
weights for each priority level: W1 = 0.8, W2 = 0.16
Each of these two metrics executes a set of calculations
and W3 = 0.04 for checkpoints with priorities 1, 2 and 3,
using the aDesigner (Accessibility Designer) engine. This
respectively.
approach is responsible for the visualization of the Web’s
Since WAQM was considered to be tool dependent, there
usability for blind users through colors and graduations [32].
was the need to see if it was possible to prove the opposite
[18]. Therefore, Vigo, et al., in their study [18], wanted to
have similar outcomes, regardless of the evaluation tool 2.1.13 Web interaction environments (WIE)
being used. For this matter, the authors proposed a method to
reach independence of the tools for every possible scenario. Lopes and Carriço proposed, in 2008, a metric that quanti-
A total of 1363 web pages from 15 websites were evaluated fies Web accessibility [33]. It calculates the proportion of
against the WCAG guidelines, using the automated evalua- checkpoints that are violated on a web page [4]. To do so, it
tion tools EvalAccess and LIFT. They used 2 different tools considers a set of checkpoints and, for each of them, it veri-
to understand the behavior of the WAQM metric when the fies if a checkpoint c is successfully evaluated or if it fails
accessibility is measured by different tools. So, they tuned [33]. If it is successfully evaluated, then vc = 1, otherwise
two WAQM parameters (a and b) to obtain independence. vc = 0.
13
Universal Access in the Information Society (2024) 23:411–434 419
This metric’s values have a limited range from zero to presents a quantitative index that measures the accessibil-
one, where one means the web page in question is totally ity of a web page. It uses the WCAG 2.0 as a reference.
accessible and all checkpoints that were evaluated in that It has two different modes to calculate the qualifications:
web page have passed.
∑ • Standard: eXaminator applies all tests. Some of the
vc tests identify errors, while others are responsible to
WIE(p) = (25)
n qualify good practices;
Equation 25 presents the formula for computing the WIE • Strict: eXaminator applies only the set of most secure
metric, where WIE(p) is this metric’s final score for a page tests, i.e. the tests that have less possibilities of creating
p, vc is a variable that assumes 1 if a checkpoint c passes, false positives or false negatives.
otherwise is 0, and n is the number of checkpoints.
The author considers that not all tests have the same
importance, i.e., they need different weights. This means
2.1.14 Conservative, strict and optimistic that it is necessary to first weight the tests to make sure
their relative weight reflects their differences from each
Conservative, Strict and Optimistic are the three web acces- other. The weight calculation is reflected in Eq. 29, and
sibility metrics defined by Lopes, Gomes and Carriço in it is the multiplication between the Confidence of the test
2010 [5]. These metrics are based on the results of a check- and the Value. Both Value and Confidence vary between
point evaluation of an HTML element: PASS, FAIL or 0 and 1, meaning that the weight will always be a value
WARN. For each checkpoint, a PASS result indicates that between 0 and 1. The Value variable depends on the
an HTML document compliance is verified; a FAIL result WCAG conformance levels: level A: V = 0.9 ; level AA:
specifies an HTML document compliance that is not veri- V = 0.5; level AAA: V = 0.1. The Confidence variable ver-
fied; and a WARN result specifies it is impossible to verify ifies, for each test, what procedures are applicable when
the HTML document compliance. The main difference running it and, for each procedure that cannot be verified,
between these three metrics resides in the way they consider the confidence decreases by 0.1.
WARN results. They all contemplate the number of PASS eXaminator uses a matrix with information about each
results and the number of applicable elements to evaluate the test, in particular the Element (E), Situation (S), Note (N),
accessibility results of an automatic accessibility evaluation Tolerance (T) and Fraction (F). The Element identifies one
tool. The conservative metric considers WARN results as or a set of HTML elements and the test is only applied if
failures, the optimistic metric considers them as passes, and the element is present in the web page or if the element
the strict metric does not consider them at all. is all. The Situation identifies one or a set of HTML ele-
These three metrics’ scores range from zero to one, where ments that fulfills a certain condition. The Note is the initial
one means the web page in question is totally accessible. qualification of the test that was applied to the first detected
passed+warned situation. It is an absolute value that varies between 1 and
rateconservative = (26) 10, where 10 means the test classification result is excellent.
applicable
The Tolerance is the error tolerance threshold, i.e., indicates
the maximum number of errors that are allowed to happen
passed
rateoptimistic = (27) in a specific situation. If the number of errors exceeds the
applicable Tolerance, the final test classification decreases by 1 point.
Finally, the Fraction variable represents the quantity of
passed errors that decrease the initial note by 1.
ratestrict = (28)
applicable-warned The final score of a web page is the ratio between the
sum of all tests by the sum of their respective weights. This
Equations 26, 27 and 28 present the formulas for comput- metric’s result uses a scale from 1 to 10, where 1 represents
ing the Conservative, Optimistic and Strict metrics, respec- a very bad accessibility level and 10 means otherwise.
tively, where passed is the number of passes, applicable is
the number of applicable elements, warned is the number P=C∗V (29)
of warnings. Equation 29 presents the formula for computing the Test
Weight, where P is the final weight score, C is the confi-
2.1.15 eXaminator dence of the test, and V is the value of the test.
Afterwards, there are three different tests that can be
According to Benavidez [34], this metric classifies specific applied: True/False tests, Test of proportional type and Test
situations that can be positive or negative. eXaminator of decreasing type [34].
13
420 Universal Access in the Information Society (2024) 23:411–434
13
Universal Access in the Information Society (2024) 23:411–434 421
Failure Rate (FR) [16] 2000 Page level 0-1, where 0 is totally accessible
Unified Web Evaluation Methodology [19] 2006 Page/website level 0-1, where 0 is totally accessible
(UWEM 1.0)
A3 [22] 2006 Page level 0-1, where 0 is totally accessible
Web Accessibility Barriers (WAB) by Hackett [23] 2004 Website level the higher its score, the worse the accessibility
level
Web Accessibility Barriers (WAB) by Par- [14] 2005 Website level the higher its score, the worse the accessibility
manto and Zeng level
Web Accessibility Evaluation Metric (WAEM) [29] 2017 Website level N/A
Web Accessibility Quantitative Metric [18] 2007 Page level 0-100, where 100 is totally accessible
(WAQM)
Web Interaction Environments (WIE) [33] 2008 Page level 0-1, where 1 is totally accessible
Conservative [5] 2010 Page level 0-1, where 1 is totally accessible
Strict [5] 2010 Page level 0-1, where 1 is totally accessible
Optimistic [5] 2010 Page level 0-1, where 1 is totally accessible
Overall Accessibility Metric (OAM) [25] 2005 Page level the higher its score, the worse the accessibility
level
Page Measure (PM) [26] 2007 Page level the higher its score, the worse the accessibility
level
SAMBA [27] 2007 Page level N/A
Reliability Aware Web Accessibility Experi- [12] 2018 Website level N/A
ence Metric (RA-WAEM)
Barrier Impact Factor (BIF) [35] 2011 Assistive tech- the higher its score, the higher the impact of the
nologies/Disability barrier on an assistive technology/disability
types
Navigability and Listenability [32] 2005 Page level N/A
Web Accessibility Barrier Severity (WABS) [15] 2017 Accessibility barriers 0-1, where the closer the score is to 0, the less
severe the barrier is
eXaminator [34] 2012 Page level 1-10, where 1 is totally inaccessible
• WAEM and RA-WAEM - require user experience evalu- computing resources than the other metrics. The time com-
ations by users with disabilities for obtaining the PUEXO plexity of the other metrics is linear. From these, more com-
pairs; and puting resources are required by A3 and WABS, for comput-
• BIF - requires classification of the barriers (errors in BIF) ing exponentiation and square root operations, respectively.
by assistive technology;
In summary, from a data collection perspective, the data 3 Comparing accessibility metrics
required by most metrics should be easily accessible from
accessibility evaluations. The exception is data that requires To compare a subset of the metrics presented in the previous
human intervention, be it from experts that classify out- section, we planned a study based on a large-scale evaluation
comes of evaluations, or from user tests. of web pages.
A different type of issues, not related to data collection, is
raised by eXaminator. Tests in eXaminator need the defini- 3.1 Methodology
tion of multiple parameters. Even though this needs to be
done only once, these parameters are not available when In this section we present the methodology followed in the
data is collected through different tools or methodologies. study. We introduce the automated evaluation tool we used
In what concerns the complexity of computing the metric, and the data set that was evaluated. We then describe what
only one metric group stands out. WAEM and RA-WAEM, metrics we were able to compare, based on the constraints
by running a vector optimization procedure, require more of running a large-scale study, which prevented us from
13
422 Universal Access in the Information Society (2024) 23:411–434
comparing metrics that rely on human judgment. We also Table 2 Number of web pages Top-level Domain Number of
describe how the metrics were implemented, based on the by top-level domain web pages
results provided by the used tool. We conclude this section
.asia 322,208
with a description of how we analyzed the data.
.au 174,371
.com 166,388
3.1.1 QualWeb and evaluation data set
.org 154,154
.pt 139,807
To run a study based on a large number of accessibility
.gov 130,689
evaluations of web pages, the only viable option is to con-
.info 125,233
duct automated accessibility assessments [36]. To run those
.uk 122,543
evaluations, we used QualWeb2 [37].
.es 120,085
QualWeb is an automated web accessibility engine. It
.it 120,085
performs a set of tests on a web page that check conformance
.fr 113,513
with ACT-Rules3 and WCAG Techniques 2.14. For each web
.de 109,135
page evaluated, we extracted from the QualWeb report the
.us 106,432
number of elements that passed, the number of elements
.news 105,196
that failed and the number of warnings for each test. We also
.eu 102,995
collected information about the test being applicable or not
.net 102,671
to the web page. This information is useful since we want
.br 100,353
to consider only applicable tests when computing the met-
.edu 93,956
rics. When an applicable test passes, it means that it has no
failures nor warnings. If an applicable test returns no errors,
but has at least one warning, the test outcome is “warning”.
If the test has at least one element that fails, the test fails. and the lowest was zero. The highest number of errors on a
As previously mentioned, QualWeb has two types of website was 878,776 and the lowest zero. The ACT-Rules
tests: ACT-rules tests, which test a web page against a set violated in most pages were: ACT-R76 (Text has enhanced
community approved checks; and WCAG techniques, which contrast), ACT-R37 (Text has minimum contrast) and ACT-
test a web page against the tool developer’s interpretation of R12 (Link has accessible name). The ACT-Rules with the
specific WCAG techniques. To ensure that only checks that highest number of errors were ACT-R76 (Text has enhanced
correspond to consensual interpretation of the WCAG are contrast) having 33,109,298 errors.
used and increase the validity of the results, we used only the
outcomes of the ACT-Rules tests in this study. In our study 3.1.2 Applicable metrics
we used the 0.6.1 version of QualWeb, which tested a total
of 72 ACT-Rules. The analysis of the accessibility metrics shows that not
QualWeb was used to evaluate a total of 2,884,498 web all metrics can be studied with this data set. For instance,
pages. The pages were obtained from CommonCrawl5. Com- metrics that require human judgment cannot be considered,
monCrawl is an open corpus of web crawl data, including since it is not viable to produce expert judgment over such
metadata and source of billions of web pages since 2013. We a large set of pages. Therefore, we needed to identify the
used the most recent crawls to obtain the URLs of the pages. ones that could be computed with our data. From the 19
The pages were evaluated in the period from March 2021 to presented metrics, we found that 11 metrics could be com-
September 2021. The 2,884,498 pages correspond to a total puted with our dataset composed from the ACT-Rules evalu-
of 166,311 websites, averaging 21 pages per website. The ation results. Most of the applicable metrics use WCAG 1.0
distribution of pages and websites per top level domain is which considers checkpoints rather than success criteria.
presented in Table 2. Since we used WCAG 2.1 in our accessibility evaluation,
The evaluation found a total of 86,644,426 errors, aver- when computing the accessibility metrics, we will refer to
aging 30 errors per page and 521 errors per website. The the checkpoints as success criteria. Each ACT-Rule has cor-
highest number of errors on a single web page was 15,645 responding success criteria. Each success criterion has an
associated principle and conformance level. Through these
success criteria, it was possible to define which principle(s)
2
http://qualweb.di.fc.ul.pt/evaluator/. and conformance level(s) characterize each test. As one test
3
https://act-rules.github.io/rules/. can have more than one success criterion, it can also have
4
https://www.w3.org/TR/WCAG21/. more than one principle or priority level. This information
5
https://commoncrawl.org/. is required by some of the metrics.
13
Universal Access in the Information Society (2024) 23:411–434 423
Table 3 Constants used to compute the WAQM metric to websites as Vigo, et al. define [17], we decided to addi-
Constants Description Value
tionally use another procedure to convert this metric into a
website metric, as will be described further on.
Nall Number of all tests 72 WIE, Conservative, Optimistic and Strict are four simple
N Number of applicable tests 51 metrics that can be easily applied with our data, as they only
Np Number of Perceivable tests 28 require the number of applied success criteria, the number of
No Number of Operable tests 11 elements, the number of warnings, the number of fails and
Nu Number of understandable tests 7 the number of passes.
Nr Number of Robust tests 11 In what concerns metrics that are applicable to websites,
Npe Number of automatic perceivable tests 26 instead of web pages, we considered WAB by Parmanto and
Noe Number of automatic operable tests 11 Zeng and WAB by Hackett. The two WAB formulas were
Nue Number of automatic understandable tests 6 applied as one requires the priority level and the other one
Nre Number of automatic robust tests 10 the weight of the priority level. Both formulas also calculate
Npw Number of manual perceivable tests 5 the failure rate and consider the total number of web pages
Now Number of manual operable tests 4 a website contains.
Nuw Number of manual understandable tests 1 Other metrics like SAMBA or eXaminator were not con-
Nrw Number of manual robust tests 1 sidered, either because of the lack of information in our data
a Constant 20 or the fact that the metric is semi-automated, which means
b Constant 0,3 that it needs manual intervention. For instance, the indexes
that are computed in SAMBA concern the disability type;
and eXaminator considers information about HTML ele-
The FR metric is the simplest metric to compute. It ments that are evaluated in each page. Yet, we could partially
requires the number of potential and actual problems. For use the WAEM/RA-WAEM metric. Since both WAEM and
each page, the sum of all elements that failed a test and RA-WAEM require users’ intervention as they evaluate pairs
the sum of all elements applicable to the test are computed. of websites, i.e. PUEXO pairs, to be compared to the results
Having both totals for all the tests, it is possible to calculate of the weighted accessibility score computation (Eq. 16), we
the failure rate of the page. It is important to highlight that could only consider the weighted accessibility score (Eq. 15)
some tests might evaluate the same elements of the page. that can be automatically computed. This score is used in the
However, they evaluate different aspects and so they cannot WAEM and RA-WAEM metrics’ process to classify a web-
be counted only once, since we are considering the total site and to compare the results with user classifications, and
number of failures and not the total number of elements that it considers the number of pages a success criteria passes in
failed. For the remaining accessibility metrics that utilize the that website.
number of potential and actual points of failure for success We did not consider the OAM nor the Page Metric met-
criteria, the same logic was applied. rics since they both consider the number of HTML attributes
The WAQM metric is the most complex to compute. It in their formulas. We do not have that information from the
considers the priority level and its weight, the type of the QualWeb reports. Also, we did not consider the two met-
test, i.e., if it is manual or automatic, and the principle(s) of rics by Fukuda et al. [32] since we do not have information
each test. WAQM is computed for each test and its compu- regarding the aspects both formulas take into consideration
tation relies on a number of parameters. Table 3 presents (alt attributes, reaching time of a given element, page size).
the parameters we used. Parameters a and b are constants We could not apply BIF since it needs a table that relates
because “the tuning was not necessary because WAQM the errors that were identified by the accessibility evalua-
proved to be independent of the tool when conducting tion tool with the assistive technologies that are affected by
large-scale evaluations (approx. 1400 pages)” [4]. The other these errors. For this reason, it is not viable to attend to all
parameters were tuned to the QualWeb tool and the ACT- the errors of our 2.8 million web pages sample and identify
Rules it tests. which assistive technologies are affected by them.
It was also possible to compute the UWEM and A3 met- WABS was not considered since it classifies the acces-
rics for each web page, since they both rely on the FR of sibility barriers based on their severity, which means that
each checkpoint of that page. Since both metrics are com- it refers to the severity of a barrier that was identified in a
puted using a weight that is obtained from user feedback, set of websites and their respective web pages. Thus, this
we had to determine this weight according to the priority metric is focused on a specific problem that hinders the
levels, due to time and resources constraints. UWEM already user’s interaction. The final result of this metric would be
calculates a score for each website, by calculating its web a list of barriers that were found in our web pages data set
pages’ average score. Besides applying the UWEM metric and their respective severity scores. For this reason, it is
13
424 Universal Access in the Information Society (2024) 23:411–434
Table 4 Descriptive statistics Metric Average Standard deviation Best score Worst score First quartile Third quartile
for web page metrics
FR 0.0673 0.0780 0 1 0.0201 0.0856
A3 0.6657 0.3203 0 1 0.4077 0.9443
UWEM 0.3842 0.3212 0 0.9997 0.1010 0.800
WAQM 82.8626 19.1529 100 0 76.3289 95.6360
WIE 0.5545 0.1561 1 0 0.4375 0.6667
Conservative 0.3936 0.2316 1 0 0.2018 0.5556
Optimistic 0.6015 0.2314 1 0 0.4453 0.7852
Strict 0.4973 0.2692 1 0 0.2658 0.7209
not possible to correlate metrics that evaluate the accessi- 1. Computing the metric score based on the sum of the
bility of web pages with metrics that evaluate accessibility evaluation results of all the pages of the website; and
barriers. 2. Calculating the average of the metric score for all web
The following list summarizes what metrics were ana- pages of the web site, similar to the UWEM strategy.
lyzed in our study.
Besides analyzing the pairwise similarity obtained from the
• Web page metrics: correlation, we used this information to cluster the corre-
lation scores and find if groups of metrics present similar
– Failure-rate (FR);
behaviors in our data set. For this analysis we used hierarchi-
– Unified Web Evaluation Metric (UWEM);
cal clustering [39].
– A3;
– Web Accessibility Quantitative Metric (WAQM);
– Web Interaction Environments (WIE); 3.2 Results
– Conservative;
– Optimistic; This section presents the results of the metrics comparison
– Strict. study. We begin by presenting an overview of the outcomes
of each metric in the full set of web pages evaluated. We
• Website metrics:
then examine the similarity between metrics across different
contexts: metrics over web pages, and metrics over web-
– Web Accessibility Barriers (WAB) by Hackett;
sites, exploring both ways previously introduced to compute
– Web Accessibility Barriers (WAB) by Parmanto &
a website metric from web page metrics.
Zeng;
– Web Accessibility Evaluation Metric (WAEM).
3.2.1 Descriptive statistics
3.1.3 Metrics comparison and analysis Table 4 presents descriptive statistics of the scores for all
metrics that are applicable at page level.
With our goal being to understand the similarities between Regarding the descriptive statistics for web page metrics,
different accessibility metrics, we computed their correlation we can observe some worthwhile points. The FR metric
pairs. We tested the normality of the data using the Shapiro- average indicates a very optimistic perspective on the acces-
Wilk and Kolmogorov-Smirnov tests. We found that our data sibility of the evaluated web content. Additionally, the stand-
did not follow a normal distribution. Therefore, we used the ard deviation is very small, indicating that the web pages
Spearman correlation in our analysis. Following the recom- scores do not vary much from the average. Also, WAQM
mendations from Statstutor [38], absolute correlation values presents a positive perspective about the accessibility of the
above 0.4 represent moderate or stronger correlation. In our Web, as the average is approximately 83, which is close to
analysis we considered two metrics to have similar results the score that expresses the highest accessibility level. The
when they are at least moderately correlated. UWEM metric is slightly positive concerning the accessi-
It is important to take into consideration the fact that bility, having an average of 0.38 and given the fact that the
some metrics are applicable to websites whereas others are lower the score, the more accessible the web page is. The
applicable to web pages. To be able to compare all metrics, WIE, Optimistic and Strict metrics present an intermediate
the web page metrics were converted to web site metrics via accessibility level average. In contrast, A3 and Conservative
two different approaches: metrics report a negative perspective about the accessibility
13
Universal Access in the Information Society (2024) 23:411–434 425
Table 5 Descriptive statistics Metric Average Standard deviation Best score Worst score First quartile Third quartile
for website metrics, adding the
evaluation results of all website FR 0.0832 0.0836 0 1 0.0296 0.1080
pages
A3 0.8390 0.2713 0 1 0.8301 1.0000
UWEM 0.4728 0.3294 0 0.9997 0.1715 0.8131
WAQM 79.4362 21.7414 100 0 71.7592 94.8497
WIE 0.5176 0.1662 1 0.04 0.400 0.6250
Conservative 0.4327 0.2191 1 0.0006 0.2640 0.5799
Optimistic 0.6366 0.1994 1 0.0006 0.5065 0.7857
Strict 0.5390 0.2410 1 0.0006 0.3515 0.7273
WAB-H 0.4742 0.6927 0 5.8333 0.0263 0.6875
WAB-PZ 0.3053 0.4799 0 4.2 0.0133 0.400
WAEM 4.1765 1.3273 8.68 0.0072 3.3353 5.1446
Table 6 Descriptive statistics Metric Average Standard deviation Best score Worst score First quartile Third quartile
for website metrics, considering
the average of the website FR 0.0859 0.0849 0 1 0.0312 0.1111
pages’ metric scores
A3 0.6744 0.3007 0 1 0.4569 0.9255
UWEM 0.4433 0.3208 0 0.9997 0.1562 0.800
WAQM 77.9130 21.3167 100 0 70.1056 93.1619
WIE 0.5715 0.1562 1 0.0588 0.4645 0.6797
Conservative 0.4457 0.2182 1 0.00055 0.2777 0.5955
Optimistic 0.6470 0.1970 1 0.00055 0.5178 0.7968
Strict 0.5527 0.2384 1 0.00055 0.3688 0.7402
WAB-H 0.4742 0.6927 0 5.8333 0.0263 0.6875
WAB-PZ 0.3053 0.4799 0 4.2 0.0133 0.400
WAEM 4.1765 1.3273 8.68 0.0072 3.3353 5.1446
of the evaluated web pages, as their values are closer to the the evaluation results for all pages of the same website was
inaccessible reference. considered, compared to this metric’s web page version. The
Tables 5 and 6 present descriptive statistics of the scores same did not happen when considering the average of the
of all metrics at the website level. Table 5 shows results page metric results for all pages of the same website, as it
where scores for page level metrics were calculated by add- shows an increase on its average. Conservative, as a website
ing the evaluation results for all pages of the same website. metric, also has a similar negative perspective about web
Table 6 results were calculated averaging the page metric accessibility compared to the same metric applied to web
results for all pages of the same website. pages. The A3 metric for websites, in particular, consider-
In relation to the descriptive statistics for website metrics, ing the average of the website pages’ scores, did not show
some metrics reported consistent behavior while for others a noticeable difference in the average result, compared to
some differences to the web page metrics could be observed. the A3 metric for web pages (around 0.67). Nevertheless, a
The FR average indicates that the accessibility of the evalu- considerable difference between these last two approaches
ated web content is very optimistic, as it was observed in for A3 metric was detected in the website level, concerning
this metric’s web page version. Also, WAQM still presents a website as a web page, having an average of approximately
a more positive perspective about the accessibility of the 0.84. Interestingly, the WAB-H and WAB-PZ metrics reveal
Web, as it was stated in the descriptive analysis of the web differences in their averages that might be justified from the
page metrics. The UWEM metric has slightly increased its worst scores. WAB-H evaluated a website that had a score
average when compared with this metric applied to web of 5.8333, which represents the most inaccessible website.
pages. Nevertheless, it still provides a positive perspective Yet, the WAB-PZ worst score was 4.2, which indicates that
about accessibility. The Optimistic and Strict website met- the accessibility issues are less weighted compared to WAB-
rics’ average also increased, yet they do not show a clear H. Since WAB-PZ, WAB-H and WAEM do not provide a
difference compared to the web pages metrics. The average limited range of values, it is more complicated to define the
of the website scores using the WIE metric decreased when accessibility level by their scores.
13
426 Universal Access in the Information Society (2024) 23:411–434
As can be observed from Table 1, different metrics have Table 8 Spearman correlation scores for website metrics (moderate,
scores ranging from 0 to 1, others from 0 to 100, and yet oth- strong and very strong correlation scores are displayed in bold)
ers are unbounded. Bounded ranges support easier to interpret WAEM WAB-H WAB-PZ
results, by allowing to compare a score with the limits of the
range. For example, one intuitively expects that an UWEM WAEM 1
score close to zero represents an accessible page, while a score WAB-H −0.1183 1
close to one represents an inaccessible page. With unbounded WAB-PZ −0.1850 0.9858 1
ranges, since there is only one limit, this comparison is not bold represents scores that have moderate or higher correlation
always possible. For example, a WAB score close to zero rep-
resents an accessible page, but what about a score of 1? Or 5?
The data collected in this study, and presented in Tables 4, 5 the accessibility level of the web page. A3 shows an oppo-
and 6, provides not only a reference for the unbounded ranges site behavior. Similarly to A3, UWEM shows a moderate
(the extreme value for WAB-H was 5.83, for WAB-PZ was negative correlation with WIE (𝜌 = −0.4963). This behavior
4.2, and for WAEM was 8.68) but also gives an indication might be explained from the fact that both UWEM and A3
of how metric scores are distributed. This allows us to inter- consider the number of elements that failed while WIE con-
pret values from the metrics more precisely. For example, a siders the number of success criteria that passed in a page.
UWEM score of 0.5 probably represents a web page that is If a page that fails all the success criteria that are tested, also
less accessible than an A3 score of 0.5. The first quartile for fails one element per test, the number of failed elements will
A3 is 0.4077 (considering web page metrics), which is much be similar to the number of failed success criteria.
closer to 0.5 than the first quartile for UWEM (0.101). Interestingly, no other pairs of metrics are correlated.
This means that FR and WAQM are not correlated to any
3.2.2 Web page metrics other metric.
13
Universal Access in the Information Society (2024) 23:411–434 427
Table 9 Spearman correlation scores for website metrics, considering a domain as a web page (moderate, strong and very strong correlation
scores are displayed in bold)
FR A3 UWEM WAQM WIE Conservative Optimistic Strict WAB-H WAB-PZ WAEM
FR 1
A3 0.1779 1
UWEM 0.4442 0.4131 1
WAQM −0.5423 −0.2140 −0.7285 1
WIE −0.1154 −0.5942 −0.3649 0.2485 1
Conservative −0.0188 −0.3528 −0.0967 0.0333 0.4838 1
Optimistic −0.1336 −0.3519 −0.1329 0.0775 0.4523 0.8759 1
Strict −0.0984 −0.3730 −0.1283 0.0673 0.4898 0.9573 0.9704 1
WAB-H 0.4217 −0.2910 0.5698 −0.5675 −0.0233 −0.0024 −0.0556 −0.0378 1
WAB-PZ 0.4071 −0.2222 0.6457 −0.6249 −0.0525 −0.0098 −0.0591 −0.0434 0.9858 1
WAEM −0.3125 −0.5252 −0.5057 0.4902 0.6566 0.2870 0.2764 0.3044 −0.1183 −0.1850 1
Table 10 Spearman correlation scores for website metrics, considering the average of the web pages’ scores (moderate, strong and very strong
correlation scores are displayed in bold)
FR A3 UWEM WAQM WIE Conservative Optimistic Strict WAB-H WAB-PZ WAEM
FR 1
A3 0.4568 1
UWEM 0.4914 0.8612 1
WAQM −0.5606 −0.7167 −0.7917 1
WIE −0.2018 −0.6053 −0.4411 0.3914 1
Conservative 0.0014 −0.3129 −0.0916 0.0177 0.5604 1
Optimistic −0.1283 −0.3499 −0.1392 0.0748 0.5253 0.8740 1
Strict −0.0910 −0.3567 −0.1332 0.0630 0.5685 0.9536 0.9718 1
WAB-H 0.4083 0.5310 0.6161 −0.5036 −0.2607 −0.0262 −0.0813 −0.0646 1
WAB-PZ 0.3992 0.5957 0.6846 −0.5665 −0.2798 −0.0291 −0.0816 −0.0666 0.9858 1
WAEM −0.3390 −0.5563 −0.5077 0.5369 0.5968 0.2577 0.2583 0.2839 −0.1183 −0.1850 1
Table 9 presents the Spearman correlation coefficients for strong negative correlation with WAQM ( 𝜌 = −0.7285 )
all metrics, with the metrics being computed by considering and positive correlation with both WAB-PZ ( 𝜌 = 0.6457)
the domain as a page with the evaluation results of all the and WAB-H ( 𝜌 = 0.5698). In contrast to UWEM, WAQM
pages of the domain. has negative correlations with WAB-PZ (𝜌 = −0.6249) and
As it would be expected, and matching to the web pages WAB-H (𝜌 = −0.5675). FR shows a positive moderate cor-
correlation scores, the Conservative metric has a very relation with UWEM (𝜌 = 0.4442 ), with WAB by Hackett
strong and positive correlation with Optimistic and Strict: (𝜌 = 0.4217) and with WAB-PZ (𝜌 = 0.4071). It presents a
𝜌 = 0.8759 and 𝜌 = 0.9573, respectively. Also, as observed moderate negative correlation with WAQM (𝜌 = −0.5423).
in the web pages scores, Strict and Optimistic metrics still Average of the web pages’ scores
have the same strong positive correlation ( 𝜌 = 0.9704 ). Table 10 presents the Spearman correlation coefficients
WAEM appears to have a strong positive correlation with for all metrics, with the domain metric being computed by
WIE ( 𝜌 = 0.6566 ). A3 has a moderate positive correla- averaging the metrics of the pages belonging to the domain.
tion with UWEM ( 𝜌 = 0.4131) and it is negatively corre- As expected, the Conservative metric has a very strong
lated with WAEM (𝜌 = −0.5252) and WIE (𝜌 = −0.5942). and positive correlation with Optimistic and Strict:
UWEM has a moderate negative correlation with WAEM 𝜌 = 0.8740 and 𝜌 = 0.9536 , respectively. These values are
( 𝜌 = −0.5057 ), while WAQM has a moderate positive very similar to the ones obtained when considering a domain
correlation with WAEM ( 𝜌 = 0.4902 ). UWEM shares a as a web page. Also, as observed in the web pages scores,
13
428 Universal Access in the Information Society (2024) 23:411–434
13
Universal Access in the Information Society (2024) 23:411–434 429
Fig. 2 Clusters of website metrics, interpreting a website as a web Fig. 3 Clusters of website metrics, calculating the average of the web
page pages’ scores
modifying the metrics’ behavior. The FR is still distant from Another interesting aspect is the fact that Conservative,
the remaining metrics. WAQM and UWEM calculate the Optimistic and Strict are always in the same independ-
failure rate, which might justify the cluster they are grouped ent group, in all the three approaches we have mentioned.
in. WAB-PZ and WAB-H had a very strong correlation, so Perhaps because the number of warnings of the considered
it was expected they would be part of the same cluster. They domains and web pages is not that significant to the point
share similar formulas that only differ in the way they con- of changing these metrics’ results, since the only differ-
sider the success criterion weight: as the priority level or the ence between these three metrics’ formulas is the way the
weight of the priority level. WIE may relate with WAEM, warnings are considered.
since they both consider when a checkpoint passes: WIE In all metrics’ clusters, the FR metric does not form a
increments one every time a certain checkpoint passes on a group with any other metric, even though some of them
website, while WAEM counts the number of pages where incorporate the failure rate in their formulas, indicat-
a checkpoint passes. For instance, if a certain checkpoint ing that their results do not correlate with the FR met-
passes on a website that only has one web page, it will count ric results. Thus, we can recognize that the metrics that
as one for WIE and also for WAEM. The main difference integrate the FR, consider other important information in
between them is that WIE considers the total number of their scope that makes them different from the FR. For
checkpoints, while WAEM not only considers the number of instance, WAQM is more complex than FR, taking into
website pages but also the weight of the checkpoint. account the principles, the type and the priority levels of
Regarding the average of the web pages’ scores, we iden- success criteria. For this reason, when opting for one of
tified the following 6 clusters (Figure 3): these metrics, FR seems a more straightforward and easy
choice, but it should be kept in mind that the other metrics
• Conservative, Strict and Optimistic; may give more relevant information.
• FR;
• A3, UWEM and WAQM;
• WIE;
4 Metric validity
• WAEM;
• WAB-H and WAB-PZ.
Our analysis allows detecting what metrics produce simi-
lar outcomes, but it does not reflect the validity of the
The above groups show that Conservative, Strict and
outcomes. This is a result of the fact that the metrics were
Optimistic metrics belong to the same cluster, as seen
computed from a set of automated evaluation results.
before. However, there are two main differences when
Automated evaluation tools are only capable of identify-
comparing with the other website metrics approach: (1) A3
ing a subset of the real accessibility problems in a web
is now part of the UWEM and WAQM cluster; (2) WAEM
page. Therefore, a page that gets a good outcome on an
and WIE do not belong to the same cluster, as their cluster
automated accessibility evaluation might have undetected
distance is higher than the previously defined threshold.
accessibility problems. This means that a metric computed
13
430 Universal Access in the Information Society (2024) 23:411–434
Table 12 Web page metrics scores for assessing the metrics’ validity (scores that do not reflect the accessibility of the web page are displayed in
bold)
FR A3 UWEM WIE Conservative Optimistic Strict WAQM
Accessible
https://wsnet2.colostate.edu/cwis24/acns/web-accessibil 0,00218 0 0 0,875 0,8297 0,9454 0,93827 98,823
ity/Example
https://www.w3.org/WAI/demos/bad/after/home.html 0,01954 0,02192 0,0057 0,6 0,3909 0,70684 0,5714 89,005
https://www.washington.edu/accesscomputing/AU/after. 0,00998 0 0 0,7333 0,525 0,6367 0,5910 90,667
html
https://www.w3.org/WAI/demos/bad/after/template.html 0,0185 0,0192 0,0057 0,6 0,384 0,7159 0,5746 89,005
Inaccessible
https://wsnet2.colostate.edu/cwis24/acns/web-accessibil 0,271 0,997 0,967 0,308 0,1050 0,2514 0,1230 33,2218
ity/Example/index-inaccessible.html
https://www.w3.org/WAI/demos/bad/before/home.html 0,1453 0,9998 0,939 0,267 0,1738 0,444 0,238 38,6422
https://www.washington.edu/accesscomputing/AU/before. 0,0704 0,995 0,915 0,6154 0,7605 0,901 0,885 51,7320
html
https://www.w3.org/WAI/demos/bad/before/template.html 0,1448 0,999 0,9518 0,2667 0,1785 0,468 0,251 35,420
on that evaluation might indicate an accessibility level that Table 13 Website metrics scores for assessing the metrics’ valid-
is better than the reality. ity (scores that do not reflect the accessibility of the website are dis-
played in bold)
WAEM WAB-PZ WAB-H
4.1 Methodology
Accessible Domains
To investigate what metrics might better reflect the actual wsnet2.colostate.edu 11.06 0.0008 0.007
accessibility level of web pages, we conducted a further www.w3.org 4.430 0.001 0.006
analysis. Since it was not feasible to conduct manual www.washington.edu 8.860 0.001 0.011
assessments of the accessibility of the large data set, we Inaccessible Domains
compiled a small data set composed of web pages that wsnet2.colostate.edu 5.440 0.83 1.10
are published online with the purpose of demonstrating www.w3.org 1.920 0.406 0.521
good and bad accessibility practices. Table 11 presents www.washington.edu 3.840 0.811 1.042
the web pages we considered in our analysis. Two pairs
bold represents scores that do not reflect the accessibility
of web pages were developed by universities while the
other two pairs are part of the Before and After Demon-
stration (BAD)6 published by the W3C Web Accessibil- examples of how web content should be designed to be
ity Initiative. They have been created mostly for educa- accessible, but also how it could be designed in an inacces-
tional purposes, providing instructors with ready to access sible way. One limitation common to all the pages is their
age. All of them were developed prior to the publication
of WCAG 2.1. Therefore, some of the criteria introduced
6
https://www.w3.org/WAI/demos/bad/Overview.html.
in the WCAG 2.1 will not have been explicitly explored in
13
Universal Access in the Information Society (2024) 23:411–434 431
these examples. In our analysis, we computed the metrics Since the three website accessibility metrics do not have a
outcome for all the pages in Table 11, and analyzed the range of scores limited by two values, the level of accessibil-
accessibility level they reported the pages to have. ity of a certain website becomes uncertain. The main con-
clusion we can take from the WAB metric is that the higher
4.2 Results the score, the more inaccessible the website is. Neverthe-
less, it is also possible to detect that the accessible domains’
Table 12 presents the scores of each page level metric for scores are really close to 0, which indicates the domains are
each of the pages used to assess the validity of the metrics. more accessible. In addition, and in contrast to the acces-
Table 13 presents the scores for website metrics. To avoid the sible domains, the inaccessible domains’ scores are higher
canceling effect in these metrics of having a website with the and close to 1.
same number of accessible and inaccessible pages we split By observing Tables 5 or 6, the accessible scores are
each website in two websites: a good website with the acces- in the first quartile, while the inaccessible scores belong
sible pages and a bad website with the inaccessible pages. to the third quartile, indicating that WAB-PZ and WAB-H
may have an appropriate representation of the accessibility.
4.2.1 Web page metrics Nevertheless, the authors of these two WAB metrics [14,
23] performed a study [40] where they refer to the mean-
ing of the WAB scores accessibility level. For instance, for
The results of this experiment show that the FR metric pro- those websites with a WAB score of 5.5 or less, the web-
duces similar scores when evaluating accessible and inac- site is more accessible as it “has better conformance to the
cessible web pages. Since 1 means that the web page is WCAG” [40]. Therefore WAB scores higher than 5.5 indi-
completely inaccessible and 0 means otherwise, we were cate more accessibility barriers, and so, a worse accessibility
expecting to have values close to 1 for the inaccessible web level. Comparing this information to the obtained scores of
pages and close to 0 for the accessible web pages. The FR our study, we can see that all websites’ WAB scores vary
scores for all the accessible web pages seem to be coherent between 0.0069 and 1.0998. These scores indicate that all
and close to 0. However, all the inaccessible web pages also websites (including the inaccessible domains) tend to have
have low values, indicating a positive accessibility level. a small number of accessibility barriers.
WIE, Conservative, Optimistic and Strict metrics exhibit As for the WAEM, the higher this metric’s score, the
a score close to 1 for the same inaccessible web page, which more accessible the website is. Consequently, we cannot
means that this page is close to be completely accessible. define whether a website is accessible or inaccessible. This
Also, these metrics’ scores for this particular inaccessible metric seems to be the only one with incoherent results as
page are higher than some accessible pages’ scores. This the www.w3.org accessible domain presents a lower score
means that the inaccessible page is more accessible than compared to the wsnet2.colostate.edu inaccessible domain.
some accessible pages, according to WIE, Conservative,
Optimistic and Strict metrics’ results. The remaining scores 4.3 Discussion
for these metrics seem to be coherent, except for the Con-
servative metric that classifies two accessible web pages as To define which metric is the most suitable option, it is
inaccessible, by showing scores close to some of the inac- important to analyze their results regarding the accessible
cessible pages’ scores. and inaccessible web pages’ and domains’ evaluations.
A3, UWEM and WAQM are the only three metrics that FR seems to have coherent scores for all the accessible
demonstrated coherent scores for all accessible and inac- web pages. This means that all the accessible web pages
cessible web pages. The WAQM metric shows values close have expected results. However, all the inaccessible web
to 100 for all the accessible web pages. For the inaccessible pages have low scores, indicating that these web pages are
pages, this metric varies from around 33 to 51, which is not accessible when they are not.
close to 0, but still lower than the scores of the accessible WIE, Conservative, Strict and the Optimistic metrics
pages. A3 and UWEM exhibit the correct behavior as all always fail to assess the accessibility level of the inaccessi-
the accessible pages scores are close to 0 and the inacces- ble web page https://w ww.w
ashin gton.e du/a ccess compu ting/
sible pages scores are close to 1. Nevertheless, A3 metric AU/before.html, showing high scores that indicate the web
scores for inaccessible web pages are closer to 1 compared page is accessible. Also, Conservative assigns a score below
to UWEM metric scores for the same web pages. 0.5 to the https://www.w3.org/WAI/demos/bad/after/home.
html accessible web page, which means that this page is not
4.2.2 Website metrics accessible.
Regarding the website metrics, it was possible to state
that WAB-PZ and WAB-H seem to have an optimistic
13
432 Universal Access in the Information Society (2024) 23:411–434
behavior for inaccessible pages, considering these metrics’ correlate with each other. The studied web accessibility met-
ranges defined by Hackett and Parmanto [40]. rics included FR, A3, UWEM, WAQM, WIE, Conservative,
A3, UWEM and WAQM seem to have the expected Optimistic, Strict, WAB-PZ, WAB-H and WAEM.
behavior. Interestingly, these three metrics form a cluster in By analyzing the pairwise correlations we were able to
the analysis of domain accessibility based on the average of identify groups of metrics. When considering the subset
the scores of the web pages belonging to the domain. Hence, of metrics that are applicable at page level, we identified
whenever there is the need to measure the accessibility level four clusters of distinct metrics. By looking at the full set of
of a website, one of these three metrics can be considered, as metrics applicable at site level, we identified a larger num-
they are all correlated in this specific approach. Regardless ber of groups. This information is relevant when a decision
of the available resources, the UWEM metric is the least between using one metric over another is needed. By know-
resource intensive. ing that the outcomes of two metrics are similar, it becomes
Nevertheless, if we investigate deeper into the validity possible to choose the one that is less resource intensive, or
analysis results of each of these metrics, we can detect two from which it is easier to obtain the data required to compute
important aspects that will clarify which metric seems to the metric, for instance.
have the best performance: (1) WAQM metric scores vary Additionally, we ran an experiment with a small number
between 0 and 100 where 0 means the resource is totally of web pages with known levels of accessibility to assess
inaccessible, and the scores of the inaccessible pages are the validity of the different metrics. Even though the set
not close to 0. Instead, they are above 33, which indicates of pages was small, and the metrics were computed from
that this metric is not that discriminating regarding those the outcomes of an automated tool (i.e., not all accessibility
inaccessible web pages; (2) UWEM and A3 have both inac- problems were caught), we were able to identify which met-
cessible and accessible scores close to 1 and 0, respectively, rics were consistent with the expected levels of accessibility
which indicates a correct and consistent behavior. Still, A3 of the pages, and which were not. This information can be
metric scores for inaccessible pages can be more discrimi- also relevant in assisting which metrics to employ.
nating compared to UWEM, as they are all closer to 1.
Acknowledgements This work was supported by FCT through
In conclusion, although UWEM is less resource intensive, the LASIGE Research Unit, ref. UIDB/00408/2020 and ref.
A3 provides more discriminating scores, being the most UIDP/00408/2020.
valid metric in this study.
Funding The research leading to these results received funding from
FCT through the LASIGE Research Unit under Grant Agreements ref.
UIDB/00408/2020 and ref. UIDP/00408/2020.
5 Limitations
Data availability statement The datasets generated during and/or ana-
We acknowledge the accessibility evaluation reports are the lyzed during the current study are available from the corresponding
author on reasonable request.
result of an automated tool and that this type of tools is lim-
ited in the scope of the accessibility problems they can test Declarations
[36, 41]. Given that our main objective is to compare web
accessibility metrics, and that all metrics compared were Conflict of interest On behalf of all authors, the corresponding author
applied to the same dataset, we believe the impact of this states that there is no conflict of interest.
limitation to not be significant. However, for the part of the Open Access This article is licensed under a Creative Commons Attri-
study that checks the validity of the metrics, this limitation bution 4.0 International License, which permits use, sharing, adapta-
can be significant, since it is probable that several acces- tion, distribution and reproduction in any medium or format, as long
sibility problems in the web pages have not been identified. as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons licence, and indicate if changes
Furthermore, in what relates to the validity study, we were made. The images or other third party material in this article are
acknowledge the sample size is limited and the results pre- included in the article's Creative Commons licence, unless indicated
sented are just indicative. A study with further web pages is otherwise in a credit line to the material. If material is not included in
needed to assess the generalizability of these findings. the article's Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will
need to obtain permission directly from the copyright holder. To view a
copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
6 Conclusion
13
Universal Access in the Information Society (2024) 23:411–434 433
13
434 Universal Access in the Information Society (2024) 23:411–434
Machinery, New York, NY, USA (2007). https://doi.org/10.1145/ 35. Mirri, S., Muratori, L.A., Salomoni, P.: Monitoring accessibility:
1296843.1296853 Large scale evaluations at a geo political level. In: The Proceed-
28. Brajnik, G.: Web accessibility testing: When the method is the ings of the 13th International ACM SIGACCESS Conference on
culprit. In: Miesenberger, K., Klaus, J., Zagler, W.L., Karshmer, Computers and Accessibility. ASSETS ’11, pp. 163–170. Asso-
A.I. (eds.) Computers Helping People with Special Needs, pp. ciation for Computing Machinery, New York, NY, USA (2011).
156–163. Springer, Berlin, Heidelberg (2006) https://doi.org/10.1145/2049536.2049566
29. Song, S., Wang, C., Li, L., Yu, Z., Lin, X., Bu, J.: Waem: A web 36. Lazar, J., Goldstein, D., Taylor, A.: Ensuring Digital Accessibility
accessibility evaluation metric based on partial user experience Through Process and Policy. Morgan kaufmann, ??? (2015)
order. In: Proceedings of the 14th Web for All Conference on The 37. Fernandes, N., Costa, D., Neves, S., Duarte, C., Carriço, L.: Eval-
Future of Accessible Work. W4A ’17. Association for Computing uating the accessibility of rich internet applications. In: Proceed-
Machinery, New York, NY, USA (2017). https://doi.org/10.1145/ ings of the International Cross-Disciplinary Conference on Web
3058555.3058576 Accessibility. W4A ’12. Association for Computing Machinery,
30. Battistelli, M., Mirri, S., Muratori, L.A., Salomoni, P.: Measur- New York, NY, USA (2012). https://doi.org/10.1145/2207016.
ing Accessibility Barriers on Large Scale Sets of Pages. (2011). 2207019
Accessed in 28 of December of 2021. https://www.w3.org/WAI/ 38. Statstutor: Spearman’s Correlation. (2021). Accessed in 28 of
RD/2011/metrics/paper2/ December of 2021. https://www.statstutor.ac.uk/resources/uploa
31. Vigo, M., Abascal, J., Aizpurua, A., Arrue, M.: Attaining Metric ded/spearmans.pdf
Validity and Reliability with the Web Accessibility Quantitative 39. Wikipedia: Hierarchical Clustering. (2021). Accessed in 28 of
Metric. (2011). Accessed in 28 of December of 2021. https:// December of 2021. https://en.wikipedia.org/wiki/Hierarchical_
www.w3.org/WAI/RD/2011/metrics/paper6/ clustering
32. Fukuda, K., Saito, S., Takagi, H., Asakawa, C.: Proposing new 40. Hackett, S., Parmanto, B.: Homepage not enough when evaluating
metrics to evaluate web usability for the blind. In: CHI ’05 web site accessibility. Internet Research (2009)
Extended Abstracts on Human Factors in Computing Systems, pp. 41. Abascal, J., Arrue, M., Valencia, X.: Tools for web accessibility
1387–1390. Association for Computing Machinery, New York, evaluation. In: Yesilada, Y., Harper, S. (eds.) Web Accessibility: A
NY, USA (2005). https://doi.org/10.1145/1056808.1056923 Foundation for Research, pp. 479–503. Springer, London (2019).
33. Lopes, R., Carriço, L.: The impact of accessibility assessment in https://doi.org/10.1007/978-1-4471-7440-0_26
macro scale universal usability studies of the web. In: Proceed-
ings of the 2008 International Cross-Disciplinary Conference on Publisher's Note Springer Nature remains neutral with regard to
Web Accessibility (W4A), pp. 5–14. Association for Computing jurisdictional claims in published maps and institutional affiliations.
Machinery, New York, NY, USA (2008). https://doi.org/10.1145/
1368044.1368048
34. Benavidez, C.: Libro Blanco de eXaminator, (2012)
13