Evaluating the use of artificial intelligence and big data in policy making_25_05_05_13_06_04
Evaluating the use of artificial intelligence and big data in policy making_25_05_05_13_06_04
Background
Given the widespread diffusion of artificial intelligence/big data (AI/BD),1 —
recently conceptualized as ‘the newest system technology’ and compared in
magnitude with the introduction of the steam engine, electricity, and the com-
puter (WRR, 2021)—it is important that evaluators address the question of how
to evaluate the expected, unexpected, and adverse effects of the use of AI/BD
when designing, developing, and implementing interventions (of any kind): pol-
icies, programmes, regulation, therapies, drugs, and others.
Indicators of the societal role AI and BD play are manifold. They include
mobile health and the Quantified Self movement, legal scholars and practitioners
using machine learning in analyzing and drafting legal texts, law enforcement
agencies predicting crime rates and patterns, medical professionals diagnosing
and developing therapies and drugs, and policymakers designing and implement-
ing programmes and policies (Dwivedi et al., 2021; Leeuw, 2021; Rajkomar et
al., 2019; York & Bamberger, 2020; Zuiderwijk et al., 2021). Added to this are
several artificial intelligence chatbots like ChatGPT by big tech companies that
have been recently introduced.
At the same time, it is known that there are various problematic, complicated,
and probably adverse effects of living in a ‘society of algorithms’ (Burrel &
Fourcade, 2021). Examples are the bias problem (data collection only or largely
includes subjects using apps, smartphones, or desktops), the legacy problem
(working with ‘old’, sometimes biased data), the big data hubris problem (the
belief that BD is a substitute for everything else, making theories and causal
analysis obsolete), validity issues (Lazer et al., 2021), and the lack of transpar-
ency and explainability of what is happening when AI/BD are applied in devel-
oping and implementing interventions. Often this lack of transparency is related
to the black-box problem of AI/BD—the focus of this essay.
Evaluating black boxes is familiar territory for evaluators; they attempt to
uncover and test assumptions about contexts, mechanisms, and outcomes of
activities (Astbury & Leeuw, 2010; Leeuw, 2020; Lemire et al., 2020; Nielsen
DOI: 10.4324/9781032669618-12
This chapter has been made available under a CC-BY-NC-ND license
84 Leeuw
et al., 2021; Pawson, 2013). However, rarely have realist, theory-driven evalu-
ators applied this approach to black boxes when BD/AI are involved in domains
including health, law enforcement, labor, and education, among others.2 This
essay attempts to outline an approach to do that.
• Although AI-based tools are often clear about the information on the input
(the question or data the AI tool starts with), and that also usually applies to
the output (the answer), it often remains unclear how the input is turned into
the output and what the respective roles of algorithms and (big) data are. As
Price and Rai (2021) mention, ‘even when human field experts are given full
access to the learning algorithm, training data, training process, and resulting
model, the models can be difficult to parse because they are often complex
and nonintuitive’ (p. 779). They distinguish between two related layers of
what is called opacity: ‘the opacity of the system being studied, and the opac-
ity of the research tool (machine learning) being deployed to study it’ (p.
778).
• Another characteristic next to opacity and the related lack of explainability
of AI/BD is plasticity, which means that the algorithms change in response
to new data. Price (2018) mentions, for the field of medicine, that ‘this form
of frequent updating is relatively common in software but relatively rare in
the context of other medical interventions, such as drugs, that are identified,
verified, and then used for years, decades, or even centuries’ (p. 1).
• The third characteristic is that the different actors engaged in using AI/
BD in organizations like hospitals, governments, and companies have often
different levels of practical expertise and experience. Some clinicians, poli-
cymakers, administrators, or managers are more ‘into’ AI and BD than oth-
ers. Probably this also applies to auditors, inspectors, and other oversight
officials. Ranerup and Henriksen (2020, p. 1) studied the introduction of
robotic process automation (RPA) into the world of (governmental) social
services and what it did to the civil servants. Apart from positive effects,
they find ‘that a human–technology hybrid actor [RPA] redefines social
assistance practices. Simplifications are needed to unpack the automated
decision-making process because of the technological and theoretical
complexities’.
Black boxes are not a ‘given’; they can be unpacked and made into white boxes,
that next need (external) validation (or testing). The question is: Does working
Artificial intelligence and big data in policy making 85
Step 1
The first step is to specify the goals or the contributions to be achieved when
applying AI/BD in designing, developing, and implementing interventions (see
Table 9.1).
Step 2
This step concerns the identification of assumptions that underlie the processes
when interventions are designed, developed, and implemented using AI/BD.
Assumptions are sometimes ‘hidden’ (Pawson, 2008), which means that they
have to be articulated. Bennett Moses and Chan (2018) did that for predictive
policing, Mitchell et al. (2021) for ‘algorithmic fairness’, Kempeneer (2021) for
the ‘big data state of mind’, and Domingos (2015) for theory-families (‘tribes’)
existing in AI.4 This step tells us at a minimum that—contrary to the big data
hubris-claim—theories are important, and they can and will differ (as do the
criteria that can be used to ‘judge’ or ‘test’ them). Put differently: AI/BD black
2
ASSUMPTIONS regarding
CONTEXTS and their
characteristics:
accountability,
explainability,
transparancy, privacy,
security, fairness and trust of
BD/AI, risks of biases
[EVALUATING THE]
USE OF BD/AI IN
POLICY/PROGRAMS
1
/INTERVENTIONS 3
ASSUMPTONS regarding ASSUMPTIONS regarding
TECHNICAL the working/impact of
CHARACTERISTICS/ BD/AI on behavior/decision
COMPONENTS OF BIG making including its
DATA/ANALYTICS/AI/ MECHANISMS AND
TYPES OF MACHINE OUTCOMES/
LEARNING CONSEQUENCES
Figure 9.1 Framework for specifying dimensions of AI/BD and their use
Artificial intelligence and big data in policy making 87
Circle 1 includes assumptions on the type and quality and relevance of data
and its data-ecosystem, including types of machine learning (like reinforcement
learning, deep learning, and so on). Examples of these types of assumptions
include:
Circle 3 looks ‘inside’ the operations when algorithms and big data are used
and addresses assumptions underlying these operations. They include assump-
tions on the actors and their behavior that work with AI/BD and how they ‘deal’
with the challenges attached to issues like opacity and plasticity: Who are the
actors? What are their perspectives on, knowledge of, and attitudes toward AI/
BD? Which stakes are involved? The evaluator also has to look for indicators of
impact/behavioral and social consequences, respective of the costs and benefits.6
Pedersen and Johansen (2020, p. 520 ff) introduced the concept of behavioral
AI (BAI). Studying BAI is believed to open up the ‘link between AI-behavior
and AI-inference by describing how to study AI behavior’. Of particular impor-
tance in BAI are:
88 Leeuw
Step 3
The assumptions that have been surfaced are now a step closer to a white box,
but that does not guarantee validity or truth; they need—as is always the case
with (small-t and capital-T) theories—to be tested. Questions from a realist eval-
uator’s perspective include:
Step 4
Assessing the validity of the (articulated) assumptions can be done first by using
existing empirical evidence from (interdisciplinary) research in which similar or
look-alike AI/BD tools and cases are investigated. These studies can be found
in the social and behavioral sciences (like Bennett Moses & Chan, 2018), in
behavioral computer sciences, and in computational social sciences and stud-
ies dealing with machine–human interactions (Lazer et al., 2021; Bowser et al.,
Artificial intelligence and big data in policy making 89
2021). They can help by transferring that knowledge to one or more look-alike
AI/BD cases to help make predictions about the probable validity of the AI/BD
used. Sometimes this approach is called ‘subsuming interventions or cases under
general theories’ (Leeuw, 2012; Pawson 2002a, 2002b) or framed by Foy et al.
(2011, p. 454) as ‘generalization through theory’.
Step 5
• Step 5.1. The first activity is ‘ensuring that algorithms are developed accord-
ing to well-vetted techniques and trained on high-quality data’ (p. 2).
• Step 5.2. The second concerns reliability: ‘demonstrating that an algorithm
reliably finds patterns in data. This type of validation depends on what the
algorithm is trying to do. Some algorithms are trained to measure what we
already know about the world, just more quickly, cheaply, or accurately than
current methods… Showing that this type of algorithm performs at the desired
level is relatively straightforward… Other algorithms optimize based purely
on patient data and self-feedback without developers providing a “correct”
answer, such as an insulin pump programme that measures patient response
to insulin and self-adjusts over time. This type of algorithm cannot be vali-
dated with test datasets’ (p. 2).
• Step 5.3. The third activity ‘applies to all sorts of black-box algorithms:
they should be continuously validated by tracking successes and failures as
they are actually implemented in health-care settings’ (Price, 2018, p. 2).
For performance one can also read: impact or effects of the AI-based inter-
vention when dealing with patients/clients in real life. Park and Han (2018,
pp. 806–807) add this: ‘With a computerized decision-support system such
as artificial intelligence, not only its technical analytic capability but also
the way in which the computerized results are presented to, interpreted by,
and acted on by human practitioners in the clinical workflow could affect
the ultimate usefulness of the computerized algorithm’. They suggest to use
randomized controlled trials to sort this out, but they are also open to other
designs. Vijayakumar and Cheung (2021) add that checking replicability of
90 Leeuw
Step 6
This step concerns the transfer of the findings to experts, other professionals, and
society at large. The goal is to inform parties and society about the validity of the
approach, which is intended to help explain how BD/AI has been applied in the
process, show the transparency of that process, and increase its social acceptance.
Conclusions
This essay outlined the relevance of thinking in line with realist (theory-driven)
evaluations to unpack and test AI/BD black boxes. It included a six-step
approach. Because human–machine interaction is involved—together with a
continuous flow of data, plasticity of algorithms, and different types of machine
learning—this is not an easy task.
If the statement ‘practice makes perfect’ is correct, then that is the way to go.
This should include learning from what is already happening in other worlds,
like in medicine.
All this may and probably will help increase the relevance of evaluating AI/
BD-driven interventions and policies and contribute to an effective, ethical, and
socially acceptable ‘Algorithmic Society’.
Notes
1 For readers not familiar with the concepts of big data, AI, and machine learning, see
Janev (2020), www.linkedin.com/pulse/intelligent-things-its-all-machine-learning
-roger-attick/ and www.zendesk.com/blog/machine-learning-and-deep-learning/.
2 See Bamberger (2016); York & Bamberger (2020); https://datapopalliance.org/lwl-27
-the-role-of-big-data-and-ai-in-monitoring-and-evaluation-me/; and Rathinam, F. et
al. (2020).
3 Sometimes one refers to ‘alchemy’ or ‘black art’ when characterizing AI black boxes
(Campolo & Crawford, 2020, 7 ff).
4 Examples are the symbolists, the evolutionaries and the Bayesians; he described their
characteristics, including assumptions they work with.
5 Sometimes reference is made to ‘cultural metacognitions’ that exist in organizations.
They regard the knowledge of and control over thinking and learning activities in
organizations, like the awareness of different contexts, analyzing them, and develop-
ing plans of actions for different cultural contexts.
6 One example is Ranerup and Henriksen (2020, p. 5) investigating ‘Trelleborg, the
first municipality in Sweden to use automated decision making for social assistance.
The Trelleborg Model is a management model now used in many other municipalities
in Sweden’. A second example is a study of AI adoption in public sector organiza-
tions, comparing three cases in three countries (van Noordt & Misuraca, 2022).
Artificial intelligence and big data in policy making 91
7 They add this: ‘Behavioral Artificial Intelligence (BAI) would study the artificial
inferences inherent in, and the manifested behavior of, artificial intelligent systems
in the same way as the social sciences have studied human cognition, inference and
behavior’.
8 Contribution analysis may also be an interesting approach to apply. The main rea-
son is that AI/BD are not alone in making and implementing policy programmes/
interventions; they always act in combination with human intelligence, experiences,
prior individual knowledge, and so on. So, the focus of an empirical investigation
would probably be most relevant if it tries to sort out what the contribution of AI (in
interaction with humans) has been in developing and implementing programmes and
interventions.
References
Astbury, B., & Leeuw, F. (2010). Unpacking black boxes: Mechanisms and theory
building in evaluation. American Journal of Evaluation, 31(3), 363–381. https://doi
.org/10.1177/1098214010371972
Bamberger, M. (2016). Integrating big data into the monitoring and evaluation of
development programmes. Global Pulse.
Bennett Moses, L., & Chan, J. (2018). Algorithmic prediction in policing: Assumptions,
evaluation, and accountability. Policing and Society, 28(7), 806–822.
Bowser, A., Carmona, A., & Fordyce, A. (2021). Unpacking transparency to support
ethical AI. Science and Technology Innovation Program, Wilson Center.
Burrel, J., & Fourcade, M. (2021). The society of algorithms. Annual Review of Sociology,
47, 23.1–23.25.
Campolo, A., & Crawford, K. (2020). Enchanted determinism: Power without
responsibility in artificial intelligence. Engaging Science, Technology, and Society, 6,
1–19. https://doi.org/10.17351/ests2020.277
Choenni, R., et al. (2021). Exploiting big data for smart government: Facing the
challenges. In J. C. Augusto (Ed.), Handbook of smart cities (pp. 1–23). Springer.
https://doi.org/10.1007/978-3-030-15145-4_82-1
Domingos, P. (2015). The master algorithm. Basic Books.
Dwivedi, Y., et al. (2021). Artificial intelligence: Multidisciplinary perspectives on
emerging challenges, opportunities, and agenda for research, practice, and policy.
International Journal of Information Management, 57(April), 1–47. https://doi.org
/10.1016/j.ijinfomgt.2019.08.002.
Foy, R., et al. (2011). The role of theory in research to develop and evaluate the
implementation of patient safety practices. BMJ Quality & Safety, 20(5), 453–459.
Janev, V. (2020). Ecosystem of big data. In V. Janev, D. Graux, H. Jabeen, & E. Sallinger
(Eds.), Knowledge graphs and big data processing (pp. 3–19). Springer. doi.org/10
.1007/978-3-030-53199-7_1
Kempeneer, S. (2021). A big data state of mind: Epistemological challenges to
accountability and transparency in data-driven regulation. Government Information
Quarterly, 38(3), 1–8. https://doi.org/10.1016/j.giq.2021.101578.
Lazer, D., et al. (2021). Meaningful measures of human society in the twenty-first
century. Nature, 595, 189–196.
Leeuw, F. (2012). Linking theory-based evaluation and contribution analysis: Three
problems and a few solutions. Evaluation, 18(3), 348–363.
92 Leeuw
Leeuw, F. (2020). Program evaluation B: Evaluation, big data, and artificial intelligence:
Two sides of one coin. In E. Vigoda-Gadot & D. R. Vashdi (Eds.), Handbook of
research methods in public administration, management and policy (pp. 277–297).
EE Publishers. https://doi.org/10.4337/9781789903485
Leeuw, F. (2021). Big data, artificial intelligence, and the future of evaluation.
Background report to a presentation given at the Seminar of the Evaluation Network
of DG Regional and Urban Policy, July 1, 2021.
Lemire, S., Kwako, A., Nielsen, S. B., Christie, C. A., Donaldson, S. I., & Leeuw, F.
(2020). What is this thing called a mechanism? Findings from a review of realist
evaluations. Causal Mechanisms in Program Evaluation, 2020(167), 73–86.
Mitchell, S., Potash, E., Barocas, S., D’Amour, A., & Lum, K. (2021). Algorithmic
fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its
Application, 8, 141–163.
Nielsen, S., Lemire, S., & Tangsig, S. (2021). Unpacking context in realist evaluations:
Findings from a comprehensive review. Evaluation, 28(1), 91–112.
Park, S. H., & Han, K. (2018). Methodologic guide for evaluating clinical performance
and effect of AI technology for medical diagnoses and prediction. Radiology, 286(3),
800–809.
Pawson, R. (2002a). Evidence-based policy: The promise of ‘realist synthesis’.
Evaluation, 8(3), 340–358.
Pawson, R. (2002b). Evidence and policy and naming and shaming. Policy Studies, 23(3),
211–230. https://doi.org/10.1080/0144287022000045993
Pawson, R. (2008). Invisible mechanisms. Evaluation Journal of Australasia, 8(2), 3–13.
https://doi.org/10.1177/1035719X0800800202
Pawson, R. (2013). The science of evaluation: A realist manifesto. Sage.
Pawson, R., Greenhalgh, T., Harvey, G., & Walshe, K. (2005). Realist review—A new
method of systematic review designed for complex policy interventions. Journal of
Health Services Research & Policy, 10(1), Suppl 1, 21–34. https://doi.org/10.1258
/1355819054308530.
Pedersen, T., & Johansen, C. (2020). Behavioural artificial intelligence: An agenda for
systematic empirical studies of artificial inference. AI & Society, 35(3), 519–532.
https://doi.org/10.1007/s00146-019-00928-5
Price, W. (2018). Big data and black-box medical algorithms. Science Translational
Medicine, 10(47). doi:10.1126/scitranslmed.aao5333.
Price, W. N., & Rai, A. K. (2021). Clearing opacity through machine learning. Iowa Law
Review, 106, 775–812.
Rajkomar, A., Dean, J., & Kohane, I. (2019). Machine learning in medicine. New England
Journal of Medicine, 380(14), 1347–1358.
Ranerup, A., & Henriksen, H. (2020). Digital discretion: Unpacking human and
technological agency in automated decision making in Sweden’s social services.
Social Science Computer Review, 40(2), 445–461. https://doi.org/10.1177
/0894439320980434.
Rathinam, F., Khatua, S., Siddiqui, Z., Malik, M., Duggal, P., Watson, S., & Vollenweider,
X. (2020). Using big data for evaluating development outcomes: A systematic map.
CEDIL Methods Working Paper 2. Centre of Excellence for Development Impact and
Learning.
Topol, E. (2019). High-performance medicine: The convergence of human and artificial
intelligence. Nature Medicine, 25(January), 44–56.
Artificial intelligence and big data in policy making 93
Van Noordt, C., & Misuraca, G. (2022). Exploratory insights on artificial intelligence for
government in Europe. Social Science Computer Review, 40(2), 426–444. https://doi
.org/10.1177/0894439320980449
Vijayakumar, R., & Cheung, M. (2021). Assessing replicability of machine learning results:
An introduction to methods on predictive accuracy in social sciences. Social Science
Computer Review, 39(5), 768–801. https://doi.org/10.1177/0894439319888445
WRR. (2021). Mission AI: The new system technology. Netherlands Scientific Council
for Government Policy.
York, P., & Bamberger, M. (2020). Measuring results and impacts in an age of big data:
The nexus of evaluation, analytics, and digital technology. Rockefeller Foundation.
Zuiderwijk, A., Chen, Y., & Salem, F. (2021). Implications of the use of artificial
intelligence in public governance: A systematic literature review and research agenda.
Government Information Quarterly, 38(3), 1–19.