0% found this document useful (0 votes)
41 views8 pages

Recovering Quantitative Models of Human Information Processing With Differentiable Architecture Search

This document summarizes a study that explores using differentiable architecture search (DARTS) and other machine learning techniques to automatically construct quantitative models of human information processing and cognition. Specifically, it: 1) Proposes expressing quantitative models as "computation graphs" consisting of nodes and edges representing variables and mathematical operations. 2) Discusses adapting DARTS, originally used for neural architecture search, to search the space of possible computation graph architectures to recover models from synthetic data. 3) Evaluates DARTS' ability to recover three example models of psychophysics, learning, and decision making, finding it can recover basic computational motifs but has limitations requiring further development. 4) Aims to
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views8 pages

Recovering Quantitative Models of Human Information Processing With Differentiable Architecture Search

This document summarizes a study that explores using differentiable architecture search (DARTS) and other machine learning techniques to automatically construct quantitative models of human information processing and cognition. Specifically, it: 1) Proposes expressing quantitative models as "computation graphs" consisting of nodes and edges representing variables and mathematical operations. 2) Discusses adapting DARTS, originally used for neural architecture search, to search the space of possible computation graph architectures to recover models from synthetic data. 3) Evaluates DARTS' ability to recover three example models of psychophysics, learning, and decision making, finding it can recover basic computational motifs but has limitations requiring further development. 4) Aims to
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

UC Merced

Proceedings of the Annual Meeting of the Cognitive Science


Society

Title
Recovering Quantitative Models of Human Information Processing with Differentiable
Architecture Search

Permalink
https://escholarship.org/uc/item/9wd571ts

Journal
Proceedings of the Annual Meeting of the Cognitive Science Society, 43(43)

ISSN
1069-7977

Author
Musslick, Sebastian

Publication Date
2021

Peer reviewed

eScholarship.org Powered by the California Digital Library


University of California
Recovering Quantitative Models of Human Information Processing
with Differentiable Architecture Search
Sebastian Musslick ([email protected])
Princeton Neuroscience Institute, Princeton University
Princeton, NJ 08544, USA

Abstract article, we introduce the notion of a computation graph and


The integration of behavioral phenomena into mechanistic describe ways of expressing the architecture of a quantitative
models of cognitive function is a fundamental staple of cog- model in terms of a such a graph. We then review the use of
nitive science. Yet, researchers are beginning to accumulate differentiable architecture search (DARTS; Liu, Simonyan, &
increasing amounts of data without having the temporal or
monetary resources to integrate these data into scientific the- Yang, 2018) for searching the space candidate computation
ories. We seek to overcome these limitations by incorporat- graphs, and introduce an adaptation of this method for dis-
ing existing machine learning techniques into an open-source covering quantitative models of human information process-
pipeline for the automated construction of quantitative models.
This pipeline leverages the use of neural architecture search to ing. We evaluate two variants of DARTS—regular DARTS
automate the discovery of interpretable model architectures, (Liu et al., 2018) and fair DARTS (Chu, Zhou, Zhang, & Li,
and automatic differentiation to automate the fitting of model 2020)—based on their ability to recover three different mod-
parameters to data. We evaluate the utility of these methods
based on their ability to recover quantitative models of human els of human cognition from synthetic data, to explain behav-
information processing from synthetic data. We find that these ioral phenomena in psychophysics, learning and perceptual
methods are capable of recovering basic quantitative motifs decision making. Our results indicate that such algorithms
from models of psychophysics, learning and decision making.
We also highlight weaknesses of this framework and discuss are capable of recovering computational motifs found in these
future directions for their mitigation. models. However, we also discuss further developments
Keywords: autonomous empirical research; computation that are needed to expand the scope of models amenable
graph; continuous relaxation; NAS; DARTS; AutoML to DARTS. Reported simulations (https://github.com/
musslick/DARTS-Cognitive-Modeling) are embedded in
Introduction a documented open-source framework for autonomous em-
The process of developing a mechanistic model of cogni- pirical research (www.empiricalresearch.ai) and can be
tion incurs two challenges: (1) identifying the architecture extended to explore other search methods.
of the model, i.e. the composition of functions and param-
eters, and (2) tuning parameters of the model to fit experi- Quantitative Models as Computation Graphs
mental data. While there are various methods for automating A broad class of complex mathematical functions—including
the fitting of parameters, cognitive scientists typically lever- the functions expressed by a quantitative model of human in-
age their own expertise and intuitions to identify the architec- formation processing—can be formulated as a computation
ture of a model—a process that requires substantial human graph. A computation graph is a collection of nodes that are
effort. In machine learning, interest has grown in automating connected by directed edges. Each node denotes an expres-
the construction and parameterization of neural networks to sion of a variable, and each outgoing edge corresponds to a
solve machine learning problems more efficiently (He, Zhao, function applied to this variable (cf. Figure 1D). The value
& Chu, 2021). This involves the use of neural architecture of a node is typically computed by integrating over the result
search (NAS) for automating the discovery of model architec- of every function (edge) feeding to that node. Akin to quanti-
tures (Elsken, Metzen, Hutter, et al., 2019), and the use of au- tative models of cognitive function, a computation graph can
tomatic differentiation to automate parameter fitting (Paszke take experiment parameters as input (e.g. the brightness of
et al., 2017). This combination of methods has led to break- two visual stimuli), and can transform this input through a
throughs in the automated construction of neural networks combination of functions (edges) and latent variables (inter-
that are capable of outperforming networks designed by hu- mediate nodes) to produce observable dependent measures as
man researchers (e.g. in computer vision: Mendoza, Klein, output nodes (e.g the probability that a participant is able to
Feurer, Springenberg, & Hutter, 2016). In this study, we ex- detect the difference in brightness between two stimuli).
plore the utility of these methods for constructing quantitative The expression of a formula as computation graph can be
models of human information processing. illustrated with Weber’s law (Fechner, 1860)—a quantitative
To ease the application of NAS to the discovery of a quanti- hypothesis that relates the difference between the intensities
tative model, it is useful to treat quantitative models as neural of two stimuli to the probability that a participant can detect
networks or, more generally, as computation graphs. In this this difference. It states that the just noticeable difference
1837
(JND; the difference in intensity that a participant is capable directed edge ei, j is associated with some operation oi, j that
of detecting in 50% of the trials) amounts to transforms the representation of the preceding node i, and
∆I = c · I (1) feeds it to node j. Each intermediate node is computed by
integrating over its transformed predecessors:
where ∆I is the JND, I corresponds to the intensity of the X
baseline stimulus and c is a constant. The probability of de- xj = oi, j (xi ) . (3)
tecting the difference between two stimuli, I and I , can then i< j
be formulated as a function of the two stimuli (with I < I ), Every output node is computed by linearly combining all
P(detected) = σlogistic ((I − I ) − ∆I) (2) intermediate nodes projecting to it. The goal of DARTS is
to identify all operations oi, j of the DAG. Following Liu et
where σlogistic is a logistic function. Figure 1D depicts the
al. (2018), we define O = {oi, j , oi, j , . . . , oM i, j } to be the set of
argument of σlogistic as a computation graph for c = .. The
M candidate operations associated with edge ei, j where ev-
graph encompasses two input nodes, one representing x = I
ery operation om i, j (xi ) corresponds to some function applied to
and the other x = I . The intermediate node x expresses
xi (e.g. linear, exponential or logistic). DARTS relaxes the
∆I which results from multiplying I with the parameter c =
problem of searching over candidate operations by formulat-
.. The addition and subtraction of I and I , respectively,
ing the transformation associated with an edge as a mixture
result in their difference (I − I ) and are represented by the
of all possible operations in O (cf. Figure 1A-B):
intermediate node x . The linear combination of x and x in
the output node r resembles the argument to σlogistic . X exp(αoi, j )
ōi, j (x) = P o0
· oi, j (x). (4)
The automated construction of a mathematical hypothe-
o∈O o 0 ∈O exp(αi, j )
sis, like Weber’s law, can be formulated as a search over
the space of all possible computation graphs. Machine learn- where each operation is weighted by the softmax transfor-
ing researchers leverage the notion of computation graphs to mation of its architectural weight αoi, j . Every edge ei, j is as-
represent the composition of functions performed by a com- signed a weight vector αi, j of dimension M, containing the
plex artificial neural network (i.e. its architecture), and de- weights of all possible candidate operations for that edge. The
ploy NAS to search a space of computation graphs. Although set of all architecture weight vectors α = {αi, j } determines the
some level of specification of the graph remains with the re- architecture of the model. Thus, searching the architecture
searcher, NAS relieves the researcher from searching through amounts to identifying α. The key contribution of DARTS
these possibilities. is that searching α becomes amenable to gradient descent af-
ter relaxing the search space to become continuous (Equation
Identifying Computation Graphs (4)). However, minimizing the loss function of the model
with Neural Architecture Search L(w, α) requires finding both α∗ and w∗ —the parameters of
the computation graph.1 Liu et al. (2018) propose to learn α
NAS refers to a family of methods for automating the dis-
and w simultaneously using bi-level optimization:
covery of useful neural network architectures. There are a
number of methods to guide this search, such as evolutionary min Lval (w∗ (α), α)
α
algorithms, reinforcement learning or Bayesian optimization (5)
(for a recent survey of NAS search strategies, see Elsken et s.t. w∗ (α) = argminLtrain (w, α).
w
al., 2019). However, most of these methods are computation-
ally demanding due to the nature of the optimization prob- That is, one can obtain α∗ through gradient descent, by
lem: The search space of candidate computation graphs is iterating through the following steps:
high-dimensional and discrete. To address this problem, Liu
1. Obtain the optimal set of weights w∗ for the current archi-
et al. (2018) proposed DARTS which relaxes the search space
tecture α by minimizing the training loss Ltrain (w, α).
to become continuous, making architecture search amenable
to gradient decent. The authors demonstrate that DARTS 2. Update the architecture α (cf. Figure 1C) by following the
can yield useful network architectures for image classification gradient of the validation loss ∇Lval (w∗ , α).
and language modeling that are on par with architectures de-
signed by human researchers. In this work, we assess whether Once α∗ is found, one can obtain the final architecture by
variants of DARTS can be adopted to automate the discovery replacing ōi, j with the operation that has the highest architec-
of interpretable quantitative models to explain human infor- tural weight, i.e. oi, j ← argmaxo α∗oi, j (Figure 1D).
mation processing.
Fair DARTS
Regular DARTS
One of the core premises of regular DARTS is that different
Regular DARTS treats the architecture of a neural network candidate operations compete with one another in determin-
as a directed acyclic computation graph (DAG), containing ing the transformation applied by an edge. This results from
N nodes in sequential order (Figure 1). Each node xi cor-
responds to a latent representation of the input space. Each 1 This includes the parameters of each candidate operation om
i, j .
1838
P(detected)

x_4 y
y
x_3 P(detected)
P(detected)
y x_2 x_4

A Candidate Operations x_4 Continuous Relaxation


Bx_1 and seek to automate the discovery of model architectures by
P(detected) x_3
x_3
differentiating through the space of operations in the under-
none x_0 x_2
x_4 y lying computation graph. To map computation graphs onto
a*x
y x_2 P(detected)
x_1
r quantitative models of cognitive function, we separate the
x_3 -x
P(detected)
nodes of the computation graph into input nodes, intermedi-
P(detected) +x x_1 x_3 P(detected)
x_2 x_4 x_0 ate nodes and output nodes. Every input node corresponds to
x_4 x_0 x_1 x_4 a different independent variable (e.g. the brightness of a stim-
Training of Architecture Architecture Sampling
C x_1 x_3 Weights D
ulus) and every output node corresponds to a different depen-
x_3 x_0 x_3
x_0 x_2 0.5 * x x_2 -1 * x dent variable (e.g. the probability of detecting the stimulus).
x_0
x_2 r
-x
I_11 * x
r
x_2 Intermediate nodes represent latent variables of the model and
P(detected) x_3
x_1 +x are computed according to Equation (3), by applying an op-
x_1 x_1
x_1 x_3 P(detected) I_0 eration to every predecessor of the node and by integrating
x_0
x_0
over all transformed predecessors.2 For the simulation ex-
x_0 1: Learning
Figure x_1 x_4
computation graphs with DARTS. The periments reported below, we consider eight candidate oper-
nodes and edges in a computation
x_3
graph correspond to vari- ations which are summarized in Table 1, including a “zero”
x_0
ables and functions (operations) performed on those vari- operation to indicate the lack of a connection between nodes.
ables, respectively.I_1(A) Edges
x_2 represent different candidate Similar to Liu et al. (2018), we compute every output node r j
operations. (B) DARTS relaxes the search space of opera- by linearly combining all intermediate nodes:
I_0 x_1
tions to be continuous. Each intermediate node (blue) is com-
puted as a weighted mixturex_0of operations. The (architectural) X
K+S
rj = vi, j xi (8)
weight of a candidate operation in an edge represents the con- i=S+
tribution of that operation to the mixture computation. Output
nodes (green) are computed by linearly combining all inter- where vi, j ∈ w is a trainable weight projecting from inter-
mediate nodes. (C) Architectural weights are trained using mediate node xi to the output node r j , S corresponds to the
bi-level optimization, and used to sample the final architec- number of input nodes and K to the number of intermediate
ture of the computation graph, as shown in (D). nodes. We seek to identify simple scientific models that—
unlike complex neural networks—must be parsable by hu-
the softmax function in Equation (4): Increasing the architec- man researchers. To warrant interpretability of the model, we
tural weight αoi, j of operation o suppresses the contribution of constrain all nodes to be scalar variables, i.e. xi , r j ∈ R× .
other operations o 0 6= o. As a consequence, regular DARTS is Our goal is to identify a computation graph that can pre-
biased to prefer operations that yield larger gradients (e.g. an dict each dependent variable from all independent variables.
exponential function) over operations with smaller gradients Thus, we seek to minimize, for every dependent variable j,
(e.g. a logistic function). To address this problem, Chu et al. the discrepancy between every output of the model r j and the
(2020) propose to replace the softmax function in Equation observed data t j . This discrepancy can be formulated as a
(4) with a sigmoid function such as the logistic function, mean squared error (MSE), LMSE (r j ,t j | w, α), that is contin-
gent on both the architecture α and the parameters in w. In
X  addition, we seek to minimize the complexity of the model,
ōi, j (x) = · oi, j (x). (6) XXX
 + exp(−αoi, j )
o∈O Lcomplexity = γ p(om i, j ) (9)
i j m
This introduces a cooperative (“fair”) mechanism for de-
termining the transformation of an edge, allowing each oper- where p(om i, j ) corresponds to the complexity of a candi-
ation to contribute in a manner that is independent from the date operation, amounting to one plus the number of trainable
architectural weights of other operations. To facilitate dis- parameters (see Table 1), and γ scales the degree to which
crete encodings of the architecture, Chu et al. (2020) intro- complexity is penalized. Following the objective in Equa-
duce a supplementary loss L− that forces the sigmoid value tion (5), we seek to minimize the total loss, Ltotal (w, α) =
of architectural weights toward one or zero: LMSE (r j ,t j | w, α) + Lcomplexity , by simultaneously finding α∗
and w∗ , using gradient descent.3
X
N  

L− = −w− − . (7) Experiments and Results
N  + exp(−αl )
l
Identifying the architecture of a quantitative model is an am-
where N corresponds to the total number of architectural bitious task. Consider the challenge of constructing a DAG
weights and w− determines the contribution of L− to the to explain the relationship between three independent vari-
total loss. Here, we set w− = . ables and one dependent variable, with only two latent vari-
ables. Assuming a set of eight candidate operations per edge
Adapting DARTS for Autonomous Cognitive Modeling
We adopt the framework from Liu et al. (2018) by represent- 2 Predecessors include both input and intermediate nodes.
ing quantitative models of information processing as DAGs, 3 Fair DARTS adds L− (Equation 7) to the total loss.
1839
Table 1: Search space of candidate operations o(x) ∈ O and Table 2: Summary of test cases, stating the reference to the re-
their complexity p(o). Note that parameters a, b ∈ w are fitted spective equation (Eqn.), the number of independent variables
separately for every om
i, j . (IVs), the number of dependent variables (DVs), the number
Description o(x) p(o) of free parameters (|Θ|), as well as distinct operations (o∗ ).
zero 0 Test Case Eqn. IVs DVs |Θ| o∗
addition +x 1 Weber’s Law (2) 2 1 1 subtraction
subtraction −x 1 Exp. Learning (10) 3 1 1 exponential
multiplication a·x 2 LCA (12) 3 1 3 rectified linear
linear function a·x+b 3
exponential function exp(a · x + b) 3 learning rate  × − ), momentum . and weight decay
rectified linear function ReLU(x) 1
logistic function σlogistic (x) 1  × − . Following Liu et al. (2018), we initialize archi-
tecture variables to be zero. For a given w∗ , we optimized α
for a total number of seven edges, there are  possible ar- for the validation set over  epochs using Adam (Kingma
chitectures to explore and endless ways to parameterize the & Ba, 2014), with initial learning rate  × − , momentum
chosen architecture. Adding one more latent variable to the β = (., .) and weight decay  × − .
model would expand the search space to  possible archi- After training w and α, we sampled the final architecture
tectures. DARTS offers one way of automating this search. by selecting operations with the highest architectural weights.
However, before applying DARTS to explain human data, it is Finally, we trained 5 random initializations of each sam-
worth assessing whether this method is capable of recovering pled architecture on the training set for 1000 epochs using
computational motifs from a known ground truth. Therefore, SGD with a cosine annealing schedule (initial learning rate
we seek to evaluate whether DARTS can recover established = ., minimum learning rate  × − ). All parameters
quantitative models of human cognition from synthetic data. were selected based on recoveries of out-of-sample test cases.
As detailed below, we assess the performance of two vari- We used the same parameters across all search methods (reg-
ants of DARTS—regular DARTS and fair DARTS—in recov- ular DARTS, fair DARTS and random search). All experi-
ering three distinct computational motifs in cognitive psy- ments were run on a 4 rack Intel cluster computer (2.5 GHz
chology (see Test Cases). For each test case, we vary the Ivybridge; 20 cores per node); each search condition was per-
number of intermediate nodes k ∈ {, , } and the com- formed on a single node, allowing for 8GB memory.
plexity penalty y ∈ {, ., ., ., .} across architecture
searches, and initialize each search with ten different seeds. Test Cases
When evaluating instantiations of NAS, it is important to All test cases are summarized in Table 2. Here, we report the
compare their performance against baselines (Lindauer & results for three different psychological models as test cases
Hutter, 2020). In many cases, random search can yield re- for DARTS. While these models appear fairly simple, they are
sults that are comparable to more sophisticated NAS (Li & based on common computational motifs in cognitive psychol-
Talwalkar, 2020; Xie, Kirillov, Girshick, & He, 2019). Thus, ogy, and serve as a proof of concept for uncovering potential
we seek to compare the average performance of each search weaknesses of DARTS. Below, we describe each computa-
condition against random search. To enable a fair compari- tional model in greater detail.
son, we allow random search to sample and evaluate archi- Case 1: Weber’s Law Weber’s law is a quantitative hy-
tectures without replacement for the same amount of time it pothesis from psychophysics relating the difference in inten-
took either regular DARTS or fair DARTS (whichever took sity of two stimuli (e.g. their brightness) to the probability
more time). Finally, we used the same training and evalua- that a participant can detect the difference. Here, we adopt
tion procedure across all search methods. the formal description of Weber’s law with c =  from Equa-
Training and Evaluation Procedure tion (2) (see Quantitative Models as Computation Graphs for
a detailed description). We consider the two stimulus inten-
For each test case, we used 40% of the generated data set
sities, I and I as the independent variables of the model,
to compute the training loss, and 10% to compute the vali-
and P(detected) as the dependent variable. The generated
dation loss, to optimize the objective stated in Equation (5).
data set (for computing Lval , Ltrain and Ltest ) is synthesized
We evaluated the performance of the architecture search on
based on 20 evenly spaced samples from the interval [, ]
the remaining 50% of the data set (test set). Experiment se-
for I, and by computing P(detected) for all valid crossings
quences for each data set were generated with SweetPea—a
between I and I , with I ≤ I . Since we seek to explain a
programming language for automating experimental design
single probability, we apply a sigmoid function to the output
(Musslick, Cherkaev, et al., 2020).
of each generated computation graph.
For each test case and each search condition, we optimized
the architecture according to Equation (5), using stochas- Case 2: Exponential Learning The exponential learning
tic gradient descent (SGD). To identify w∗ , we optimized w equation is one of the standard equations to explain the im-
for the selected training set over  epochs with a cosine provement on a task with practice (Thurstone, 1919; Heath-
annealing schedule (initial learning rate = ., minimum cote, Brown, & Mewhort, 2000). It explains the performance
1840
on a task Pn as follows: AA B C
Pn = P∞ − (P∞ − P ) · e−ε·t (10)
where t corresponds to the number of practice trials, ε
is a learning rate, P corresponds to the initial performance
on a task and P∞ to the final performance for t → ∞. We
treat t, P and P∞ as independent variables of the model and
Pn as a real-valued dependent variable. To avoid numerical D
instabilities based on large inputs, we constrain  ≤ t ≤ , Recovered Computation Graph

 ≤ P ≤ . and . ≤ P∞ ≤  and set ε = . We generate


the synthesized data set by drawing eight evenly-spaced sam-
ples for each independent variable, generating a full crossing
between these samples, and by computing Pn for each con- E
dition. The purpose of this test case is to highlight a poten-
tial weakness of DARTS: Intermediate nodes cannot repre-
sent non-linear interactions between input variables—as it is
the case in Equation (10)—due to the additive integration of
their inputs. Thus, DARTS must identify alternative expres-
sions to approximate Equation (10). Figure 2: Architecture search results for Weber’s law.
Case 3: Leaky Competing Accumulator To model the dy- (A, B, C) The mean test loss as a function of the number of
namics of perceptual decision making, Usher and McClelland intermediate nodes (k) and penalty on model complexity (γ)
(2001) introduced the leaky, competing accumulator (LCA). for architectures obtained through (A) regular DARTS, (B)
Every unit xi of the model represents a different choice in a fair DARTS and (C) random search. Vertical bars indicate the
decision making task. The activity of these units is used to standard error of the mean (SEM) across seeds. The star des-
determine the selected choice of an agent. The activity dy- ignates the test loss of the best-fitting architecture obtained
namics are determined by the non-linear equation (without through regular DARTS, depicted in (D). (E) Psychometric
consideration of noise): function for different baseline intensities, generated by the
X dt original model and the recovered architecture shown in (D).
dxi = [ρi − λxi + µ f (xi ) − β f (x j )] (11)
τ
j6=i at least for k = . Below, we examine the best-fitting architec-
where ρi is an external input provided to unit xi , λ is the tures of regular DARTS for each test case—determined by the
decay rate of xi , µ is the recurrent excitation weight of xi , β lowest validation loss—which are generally capable of recov-
is the inhibition weight between units, τ is a rate constant and ering distinct operations used by the data generating model.
f (xi ) is a rectified linear activation function. Here, we seek
Case 1: Weber’s Law The best fitting architecture (k = )
to recover the dynamics of an LCA with three units, using the
for Weber’s law (Figure 2D) can be summarized as follows:
following (typical) parameterization: λ = ., µ = ., β =
. and τ = . In addition, we assume that all units receive no P(detected) = σlogistic (. · I − . · I − .) (13)
external input ρi = . This results in the simplified equation:
X and resembles a simplification of the ground truth model
dxi = [−.xi + . f (xi ) − . f (x j )]dt (12) in Equation (2): σlogistic (I −  · I )), recovering the computa-
j6=i tional motif of the difference between the two input variables,
We treat units x , x , x as independent variables (− ≤ xi ≤ as well as the role of I as a bias term. The architecture can
) and dx as the dependent variable for a given time step dt. also reproduce psychometric functions generated by the orig-
We generate data from the model by drawing eight evenly- inal model (Figure 2E). However, the recovery of Weber’s
spaced samples for each xi , generating the full crossing be- law should be merely considered a sanity check given that the
tween these, and computing dx for each condition. data generating model could be recovered with much simpler
methods, such as logistic regression. This is reflected in the
Results decent performance of random search.
Figures 2, 3 and 4 summarize the search results for each test
Case 2: Exponential Learning One of the core features
case. The ground truth in each test case is generally best
of this test case is the exponential relationship between task
recovered with regular DARTS, using 3 intermediate nodes
performance and the number of trials. Note that we do not
and no parameter penalty although the best fitting architecture
expect DARTS to fully recover Equation (10) as it is—by
may result from different parameters.4 Both regular and fair
design—incapable of representing the non-linear interaction
DARTS can achieve higher performance than random search,
of (P∞ − P ) and e−ε·t . Nevertheless, regular DARTS recov-
4 Note that we expect no relationship between γ and the validation ers the exponential relationship between the number of trials
loss for random search, as random search is unaffected by γ. t and performance Pn for k =  (Figure 3D). However, the
1841
AA B C AA B C

D D
Recovered Computation Graph Recovered Computation Graph

E E

Figure 3: Architecture search results for exponential Figure 4: Architecture search results for LCA. (A, B, C)
learning. (A, B, C) The mean test loss as a function of The mean test loss as a function of the number of intermediate
the number of intermediate nodes (k) and penalty on model nodes (k) and penalty on model complexity (γ) for architec-
complexity (γ) for architectures obtained through (A) regular tures obtained through (A) regular DARTS, (B) fair DARTS
DARTS, (B), fair DARTS and (C) random search. Vertical and (C) random search. Vertical bars indicate the SEM across
bars indicate the SEM across seeds. The star designates the seeds. The star designates the test loss of the best-fitting ar-
test loss of the best-fitting architecture obtained through regu- chitecture for regular DARTS (k = ), depicted in (D). (E)
lar DARTS, shown in (D). (E) The learning curves generated Dynamics of each decision unit simulated with the original
by the original model and the recovered architecture in (D). model and the best architecture shown in (D), using the same
initial condition at t = .
best-fitting architecture relies on a number of other transfor-
mations to compute Pn based on its independent variables, introduced and evaluated a method for recovering quantita-
and fails to fully recover learning curves of the original model tive models of cognition using DARTS. The proposed method
(Figure 3E). In the General Discussion, we examine ways of treats quantitative models as DAGs, and leverages continuous
mitigating this issue. relaxation of the architectural search space to identify candi-
date models using gradient descent. We evaluated the perfor-
Case 3: Leaky Competing Accumulator The best-fitting
mance of two variants of this method, regular DARTS (Liu et
architecture, here shown for k =  (Figure 4D), bears remark-
al., 2018) and fair DARTS (Chu et al., 2020), based on their
able resemblance to the original model (cf. Equation (12)),
ability to recover three different quantitative models of human
X cognition from synthetic data. Our results show that these
dxi = [. − . · x − . ReLU(xi )]dt (14)
implementations of DARTS have an advantage over random
j6=i
search, and are capable of recovering computational motifs
in that it recovers the rectified linear activation function from quantitative models of human information processing,
imposed on the two units competing with x , as well as the such as the difference operation in Weber’s law or the recti-
corresponding inhibitory weight . ≈ β = .. Yet, the re- fied linear activation function in the LCA. While the initial
covered model misses to apply this function to unit xi . How- results reported here seem promising, there are a number of
ever, the latter is not surprising given that the LCA has been limitations worth addressing in future work.
reported to not be fully recoverable, partly because its param- All limitations of DARTS pertain to its assumptions, most
eters trade off against each other (Miletić, Turner, Forstmann, of which limit the scope of discoverable models. First, not
& van Maanen, 2017). The generated dynamics are never- all quantitative models can be represented as a DAG, such as
theless capable of approximating the behavior of the original ones that require independent variables to be combined in a
model (Figure 4E). multiplicative fashion (see Test Case 2). Solving this problem
may require expanding the search space to include different
General Discussion and Conclusion
integration functions performed on every node.5 Symbolic
Empirical scientists are challenged with integrating an in-
creasingly large number of experimental phenomena into 5 Another solution would be to linearize the data or to operate in
quantitative models of cognitive function. In this article, we logarithmic space. However, the former might hamper interpretabil-
1842
regression algorithms provide another solution to this prob- Kingma, D. P., & Ba, J. (2014). Adam: A method for stochas-
lem, by recursively identifying modularity of the underlying tic optimization. arXiv preprint arXiv:1412.6980.
computation graph, such as multiplicative separability or sim- Li, L., & Talwalkar, A. (2020). Random search and repro-
ple symmetry (Udrescu et al., 2020). Second, some opera- ducibility for neural architecture search. In Uncertainty in
tions may have an unfair advantage over others when trained artificial intelligence (pp. 367–377).
via gradient descent, e.g. if their gradients are larger. This Lindauer, M., & Hutter, F. (2020). Best practices for scientific
problem can be circumvented with non-gradient based archi- research on neural architecture search. JMLR, 21(243), 1–
tecture search algorithms, such as evolutionary algorithms or 18.
reinforcement learning. Finally, the performance of DARTS Liu, H., Simonyan, K., & Yang, Y. (2018). Darts:
is contingent on a number of training and evaluation param- Differentiable architecture search. arXiv preprint
eters, as is the case for other NAS algorithms. Future work arXiv:1806.09055.
is needed to evaluate DARTS for a larger space of param- McClelland, J. L., & Rumelhart, D. E. (1986). Parallel dis-
eters, in addition to the number of intermediate nodes and tributed processing. Explorations in the Microstructure of
the penalty on model complexity as explored in this study. Cognition, 2, 216–271.
However, despite all these limitations, DARTS may provide Mendoza, H., Klein, A., Feurer, M., Springenberg, J. T., &
a first step toward automating the construction of complex Hutter, F. (2016). Towards automatically-tuned neural net-
quantitative models based on interpretable linear and non- works. In Workshop on AutoML (pp. 58–65).
linear expressions, including connectionist models of cogni- Miletić, S., Turner, B. M., Forstmann, B. U., & van Maanen,
tion (McClelland & Rumelhart, 1986; Rogers & McClelland, L. (2017). Parameter recovery for the leaky competing
2004; Musslick, Saxe, Hoskin, Reichman, & Cohen, 2020). accumulator model. J. Math. Psychol., 76, 25–50.
In this study, we consider a small number of test cases to Musslick, S., Cherkaev, A., Draut, B., Butt, A., Srikumar,
evaluate the performance of DARTS. While these test cases V., Flatt, M., & Cohen, J. D. (2020). Sweetpea: A stan-
present useful proofs of concept, we encourage the rigor- dard language for factorial experimental design. PsyArXiv,
ous evaluation of this method based on more complex quan- doi:10.31234/osf.io/mdwqh.
titative models of cognitive function. To enable such ex- Musslick, S., Saxe, A., Hoskin, A. N., Reichman, D., & Co-
plorations, we provide open access to a documented imple- hen, J. D. (2020). On the rational boundedness of cog-
mentation of the evaluation pipeline described in this article nitive control: Shared versus separated representations. ,
(www.empiricalresearch.ai). This pipeline is part of a PsyArXiv: https://doi.org/10.31234/osf.io/jkhdf.
Python toolbox for autonomous empirical research, and al- Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
lows for the user-friendly integration and evaluation of other DeVito, Z., . . . Lerer, A. (2017). Automatic differentiation
search methods and test cases. As such, the repository in- in pytorch. NIPS 2017 Autodiff Workshop.
cludes additional test cases (e.g. models of controlled pro- Rogers, T. T., & McClelland, J. L. (2004). Semantic cog-
cessing) that we could not include in this article due to nition: A parallel distributed processing approach. MIT
space constraints. We invite interested researchers to evaluate press.
DARTS based on other computational models, and to utilize Thurstone, L. L. (1919). The learning curve equation. Psy-
this method for the automated discovery of quantitative mod- chological Monographs, 26(3), i.
els of human information processing. Udrescu, S.-M., Tan, A., Feng, J., Neto, O., Wu, T., &
Tegmark, M. (2020). AI Feynman 2.0: Pareto-optimal
References symbolic regression exploiting graph modularity. arXiv
preprint arXiv:2006.10782.
Chu, X., Zhou, T., Zhang, B., & Li, J. (2020). Fair darts:
Usher, M., & McClelland, J. L. (2001). The time course
Eliminating unfair advantages in differentiable architecture
of perceptual choice: the leaky, competing accumulator
search. In Eccv (pp. 465–480).
model. Psychological review, 108(3), 550.
Elsken, T., Metzen, J. H., Hutter, F., et al. (2019). Neural
Xie, S., Kirillov, A., Girshick, R., & He, K. (2019). Exploring
architecture search: A survey. JMLR, 20(55), 1–21.
randomly wired neural networks for image recognition. In
Fechner, G. T. (1860). Elemente der psychophysik (Vol. 2).
Proceedings of the IEEE/CVF (pp. 1284–1293).
Breitkopf u. Härtel.
He, X., Zhao, K., & Chu, X. (2021). AutoML: A Survey
of the State-of-the-Art. Knowledge-Based Systems, 212,
106622.
Heathcote, A., Brown, S., & Mewhort, D. J. (2000). The
power law repealed: The case for an exponential law of
practice. Psychonomic bulletin & review, 7(2), 185–207.

ity for models relying on simple non-linear functions, and the latter
may be inconvenient if the ground truth cannot be easily represented
in logarithmic space.
1843

You might also like