0% found this document useful (0 votes)
4 views

DomainGeneralizationByMarginalTransferLearning_Microsoft

This document presents a formal framework for domain generalization (DG), a machine learning problem aimed at making accurate predictions on future unlabeled data sets based on labeled training data from related tasks. The authors introduce 'marginal transfer learning' as a method to augment the feature space with marginal distributions, providing two models for data generation and a universally consistent algorithm for kernel methods. The work includes theoretical foundations, generalization error analysis, and experimental validation on synthetic and real-world data sets, particularly in the context of applications like precision medicine and flow cytometry.

Uploaded by

Tejaswee Sulekh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

DomainGeneralizationByMarginalTransferLearning_Microsoft

This document presents a formal framework for domain generalization (DG), a machine learning problem aimed at making accurate predictions on future unlabeled data sets based on labeled training data from related tasks. The authors introduce 'marginal transfer learning' as a method to augment the feature space with marginal distributions, providing two models for data generation and a universally consistent algorithm for kernel methods. The work includes theoretical foundations, generalization error analysis, and experimental validation on synthetic and real-world data sets, particularly in the context of applications like precision medicine and flow cytometry.

Uploaded by

Tejaswee Sulekh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Domain Generalization by Marginal Transfer Learning

Gilles Blanchard [email protected]


Université Paris-Saclay, CNRS, Inria, Laboratoire de mathématiques d’Orsay
Aniket Anand Deshmukh [email protected]
Microsoft AI & Research
Ürun Dogan [email protected]
Microsoft AI & Research
arXiv:1711.07910v3 [stat.ML] 6 Jan 2021

Gyemin Lee [email protected]


Dept. Electronic and IT Media Engineering
Seoul National University of Science and Technology
Clayton Scott [email protected]
Electrical and Computer Engineering, Statistics
University of Michigan

Abstract
In the problem of domain generalization (DG), there are labeled training data sets from
several related prediction problems, and the goal is to make accurate predictions on future
unlabeled data sets that are not known to the learner. This problem arises in several ap-
plications where data distributions fluctuate because of environmental, technical, or other
sources of variation. We introduce a formal framework for DG, and argue that it can be
viewed as a kind of supervised learning problem by augmenting the original feature space
with the marginal distribution of feature vectors. While our framework has several con-
nections to conventional analysis of supervised learning algorithms, several unique aspects
of DG require new methods of analysis.
This work lays the learning theoretic foundations of domain generalization, building on
our earlier conference paper where the problem of DG was introduced (Blanchard et al.,
2011). We present two formal models of data generation, corresponding notions of risk, and
distribution-free generalization error analysis. By focusing our attention on kernel meth-
ods, we also provide more quantitative results and a universally consistent algorithm. An
efficient implementation is provided for this algorithm, which is experimentally compared
to a pooling strategy on one synthetic and three real-world data sets.
Keywords: domain generalization, generalization error bounds, Rademacher complexity,
kernel methods, universal consistency, kernel approximation

1. Introduction
Domain generalization (DG) is a machine learning problem where the learner has access to
labeled training data sets from several related prediction problems, and must generalize to
a future prediction problem for which no labeled data are available. In more detail, there
are N labeled training data sets Si = (Xij , Yij )1≤j≤ni , i = 1, . . . , N , that describe similar
but possibly distinct prediction tasks. The objective is to learn a rule that takes as input
a previously unseen unlabeled test data set X1T , . . . , XnTT , and accurately predicts outcomes
for these or possibly other unlabeled points drawn from the associated learning task.

©2020 Gilles Blanchard, Aniket Anand Deshmukh, Ürun Dogan, Gyemin Lee and Clayton Scott.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/.
Blanchard, Deshmukh, Dogan, Lee and Scott

DG arises in several applications. One prominent example is precision medicine, where


a common objective is to design a patient-specific classifier (e.g., of health status) based
on clinical measurements, such as an electrocardiogram or electroencephalogram. In such
measurements, patient-to-patient variation is common, arising from biological variations
between patients, or technical or environmental factors influencing data acquisition. Be-
cause of patient-to-patient variation, a classifier that is trained on data from one patient
may not be well matched to another patient. In this context, domain generalization enables
the transfer of knowledge from historical patients (for whom labeled data are available) to a
new patient without the need to acquire training labels for that patient. A detailed example
in the context of flow cytometry is given below.
We view domain generalization as a conventional supervised learning problem where
the original feature space is augmented to include the marginal distribution generating the
features. We refer to this reframing of DG as “marginal transfer learning,” because it reflects
the fact that in DG, information about the test task must be drawn from that task’s marginal
feature distribution. Leveraging this perspective, we formulate two statistical frameworks
for analyzing DG. The first framework allows the observations within each data set to
have arbitrary dependency structure, and makes connections to the literature on Campbell
measures and structured prediction. The second framework is a special case of the first,
assuming the data points are drawn i.i.d. within each task, and allows for a more refined
risk analysis.
We further develop a distribution-free kernel machine that employs a kernel on the
aforementioned augmented feature space. Our methodology is shown to yield a universally
consistent learning procedure under both statistical frameworks, meaning that the domain
generalization risk tends to the best possible value as the relevant sample sizes tend infinity,
with no assumptions on the data generating distributions. Although DG may be viewed as
a conventional supervised learning problem on an augmented feature space, the analysis is
nontrivial owing to unique aspects of the sampling plans and risks. We offer a computation-
ally efficient and freely available implementation of our algorithm, and present a thorough
experimental study validating the proposed approach on one synthetic and three real-world
data sets, including comparisons to a simple pooling approach.1
To our knowledge, the problem of domain generalization was first proposed and studied
by our earlier conference publication (Blanchard et al., 2011) which this work extends
in several ways. It adds (1) a new statistical framework, the agnostic generative model
described below; (2) generalization error and consistency results for the new statistical
model; (3) an extensive literature review; (4) an extension to the regression setting in
both theory and experiments; (5) a more general statistical analysis, in particular, we no
longer assume a bounded loss, and therefore accommodate common convex losses such as
the hinge and logistic losses; (6) extensive experiments (the conference paper considered a
single small data set); (7) a scalable implementation based on a novel extension of random
Fourier features; and (8) error analysis for the random Fourier features approximation.

1. Code is available at https://github.com/aniketde/DomainGeneralizationMarginal.

2
Domain Generalization by Marginal Transfer Learning

2. Motivating Application: Automatic Gating of Flow Cytometry Data


Flow cytometry is a high-throughput measurement platform that is an important clinical
tool for the diagnosis of blood-related pathologies. This technology allows for quantitative
analysis of individual cells from a given cell population, derived for example from a blood
sample from a patient. We may think of a flow cytometry data set as a set of d-dimensional
attribute vectors (Xj )1≤j≤n , where n is the number of cells analyzed, and d is the number
of attributes recorded per cell. These attributes pertain to various physical and chemical
properties of the cell. Thus, a flow cytometry data set may be viewed as a random sample
from a patient-specific distribution.
Now suppose a pathologist needs to analyze a new (test) patient with data (XjT )1≤j≤nT .
Before proceeding, the pathologist first needs the data set to be “purified” so that only cells
of a certain type are present. For example, lymphocytes are known to be relevant for the
diagnosis of leukemia, whereas non-lymphocytes may potentially confound the analysis. In
other words, it is necessary to determine the label YjT ∈ {−1, 1} associated to each cell,
where YjT = 1 indicates that the j-th cell is of the desired type.
In clinical practice this is accomplished through a manual process known as “gating.”
The data are visualized through a sequence of two-dimensional scatter plots, where at each
stage a line segment or polygon is manually drawn to eliminate a portion of the unwanted
cells. Because of the variability in flow cytometry data, this process is difficult to quantify
in terms of a small subset of simple rules. Instead, it requires domain-specific knowledge
and iterative refinement. Modern clinical laboratories routinely see dozens of cases per day,
so it is desirable to automate this process.
Since clinical laboratories maintain historical databases, we can assume access to a
number (N ) of historical (training) patients that have already been expert-gated. Because
(i)
of biological and technical variations in flow cytometry data, the distributions PXY of
the historical patients will vary. To illustrate the flow cytometry gating problem, we use
the NDD data set from the FlowCap-I challenge.2 For example, Fig. 1 shows exemplary
two-dimensional scatter plots for two different patients – see caption for details. Despite
differences in the two distributions, there are also general trends that hold for all patients.
Virtually every cell type of interest has a known tendency (e.g., high or low) for most
measured attributes. Therefore, it is reasonable to assume that there is an underlying
distribution (on distributions) governing flow cytometry data sets, that produces roughly
similar distributions thereby making possible the automation of the gating process.

3. Formal Setting and General Results


In this section we formally define domain generalization via two possible data generation
models together with associated notions of risk. We also provide a basic generalization error
bound for the first of these data generation models.
Let X denote the observation space (assumed to be a Radon space) and Y ⊆ R the
output space. Let PX and PX ×Y denote the set of probability distributions on X and
X × Y, respectively. The spaces PX and PX ×Y are endowed with the topology of weak
convergence and the associated Borel σ-algebras.
2. We will revisit this data set in Section 8.5 where details are given.

3
Blanchard, Deshmukh, Dogan, Lee and Scott

Figure 1: Two-dimensional projections of multi-dimensional flow cytometry data. Each row


corresponds to a single patient, and each column to a particular two-dimensional
projection. The distribution of cells differs from patient to patient. The colors
indicate the results of gating, where a particular type of cell, marked dark (blue),
is separated from all other cells, marked bright (red). Labels were manually
selected by a domain expert.

The disintegration theorem for joint probability distributions (see for instance Kallen-
berg, 2002, Theorem 6.4) tells us that (under suitable regularity properties, satisfied if X is
a Radon space) any element PXY ∈ PX ×Y can be written as a Markov semi-direct product
PXY = PX • PY |X , with PX ∈ PX , PY |X ∈ PY |X , where PY |X is the space of conditional
probability distributions of Y given X, also called Markov transition kernels from X to Y.
This specifically means that

Z Z 
E(X,Y )∼PXY [h(X, Y )] = h(x, y)PY |X (dy|X = x) PX (dx), (1)

for any integrable function h : X × Y → R. Following common terminology in the statistical


learning literature, we will also call PY |X the posterior distribution (of Y given X).
We assume that N training samples Si = (Xij , Yij )1≤j≤ni , i = 1, . . . , N , are observed.
To allow for possibly unequal sample sizes ni , it is convenient to formally identify each
(i)
sample Si with its associated empirical distribution PbXY = n1i nj=1
P i
δ(Xij ,Yij ) ∈ PX ×Y .
We assume that the ordering of the observations inside a given sample Si is arbitrary
(i)
and does not contain any relevant information. We also denote by PbX = n1i nj=1
P i
δXij ∈
PX the ith training sample without labels. Similarly, a test sample is denoted by S T =
(XjT , YjT )1≤j≤nT , and the empirical distribution of the unlabeled data by PbXT .

4
Domain Generalization by Marginal Transfer Learning

3.1 Data Generation Models


We propose two data generation models. The first is more general, and includes the second
as a special case.
Assumption 1 (AGM) There exists a distribution PS on PX ×Y such that S1 , . . . , SN are
i.i.d. realizations from PS .
We call this the agnostic generative model. This is a quite general model in which samples are
assumed to be identically distributed and independent of each other, but nothing particular
is assumed about the generation mechanism for observations inside a given sample, nor for
the (random) sample size.
We also introduce a more specific generative mechanism, where observations (Xij , Yij )
(i)
inside the sample Si are themselves i.i.d. from PXY , a latent unobserved random distribu-
tion, as follows. The symbol ⊗ indicates a product measure.

Assumption 2 (2SGM) There exists a distribution µ on PX ×Y and a distribution ν on


(1) (N )
N, such that (PXY , n1 ), . . . , (PXY , nN ) are i.i.d. realizations from µ ⊗ ν, and conditional to
(i)
(PXY , ni ) the sample Si is made of ni i.i.d. realizations of (X, Y ) following the distribution
(i)
PXY .

This model, called the 2-stage generative model, is a subcase of (AGM): since the
(i)
(PXY , ni )are i.i.d., the samples Si also are. This model was the one studied in our conference
paper (Blanchard et al., 2011). It has been considered in the distinct but related context of
“learning to learn” (Baxter, 2000; see also a more detailed discussion below, Section 4.2).
Many of our results will hold for the agnostic generative model, but the two-stage generative
model allows for additional developments.
Since in (2SGM) we assume that the latent random distribution of the points in the
sample Si and its size ni are independent (which is not necessarily the case for (AGM)),
in this model it becomes a formally well-defined question to ask how the learning problem
evolves if we only change the size of the samples. In other words, we may study the setting
where the generating distribution µ remains fixed, but their size distribution ν changes.
In particular, this work examines the following different situations of interest in which the
distribution µ is fixed:
• The samples all have the same fixed size n, i.e. ν = δn ;

• The training samples are subsampled (without replacement) to a fixed size n in order
to reduce computational complexity; this reduces to the first setting;

• Both the training samples N and their size grow. In this case the size distribution νN
depends on N (possibly νN = δn(N ) )
We note that when the distribution of the sample sizes ni is a Poisson or a mixture of
Poisson distributions, the (2SGM) is (a particular case of) what is known as a Cox model
or doubly stochastic Poisson process in the point process literature (see, e.g., Daley and
Vere-Jones, 2003, Section 6.2), which is a Poisson process with random (inhomogeneous)
intensity.

5
Blanchard, Deshmukh, Dogan, Lee and Scott

3.2 Decision Functions and Augmented Feature Space


In domain generalization, the learner’s goal is to infer from the training data a general
rule that takes an arbitrary, previously unseen, unlabeled data set corresponding to a new
prediction task, and produces a classifier for that prediction task that could be applied to
any x (possibly outside the unlabeled data set). In other words, the learner should out-
put a mapping g : PX → (X → R). Equivalently, the learner should output a function
f : PX × X → R, where the two notations are related via g(PX )(x) = f (PX , x). In the
latter viewpoint, f may be viewed as a standard decision function on the “augmented”
or “extended” feature space PX × X , which facilitates connections to standard supervised
learning. We refer to this view of DG as marginal transfer learning, because the informa-
tion that facilitates generalization to a new task is conveyed entirely through the marginal
distribution. In the next two subsections, we present two definitions of the risk of a decision
function f , one associated to each of the two data generation models.

3.3 Risk and Generalization Error Bound under the Agnostic Generative
Model
Consider a test sample S T = (XjT , YjT )1≤j≤nT , whose labels are not observed. If ` : R×Y 7→
R+ is a loss function for a single prediction, and predictions of a fixed decision function f
on the test sample are given by YbjT = f (PbXT , XjT ), then the empirical average loss incurred
on the test sample is
nT
T 1 X
L(S , f ) := `(YbjT , YjT ) .
nT
j=1

Thus, we define the risk of a decision function as the average of the above quantity when
test samples are drawn according to the same mechanism as the training samples:
 
nT
1 X
E(f ) := ES T ∼PS L(S T , f ) = ES T ∼PS  `(f (PbXT , XjT ), YjT ) .
 
nT
j=1

In a similar way, we define the empirical risk of a decision function as its average prediction
error over the training samples:
N N ni
b N ) := 1 1 X 1 X (i)
X
E(f, L(Si , f ) = `(f (PbX , Xij ), Yij ). (2)
N N ni
i=1 i=1 j=1

Remark 3 It is possible to understand the above setting as a particular instance of a struc-


tured output learning problem (Tsochantaridis et al., 2005; Bakır et al., 2007), in which
the input variable X ∗ is PbXT , and the “structured output” Y ∗ is the collection of labels
(YiT )1≤i≤nT (matched to their respective input points). As is generally the case for struc-
tured output learning, the nature of the problem and the “structure” of the outputs is very
much encoded in the particular form of the loss function. In our setting the loss function is
additive over the labels forming the collection Y ∗ , and we will exploit this particular form
for our method and analysis.

6
Domain Generalization by Marginal Transfer Learning

Remark 4 The risk E(f ) defined above can be described in the following way: consider the
random variable ξ := (PbXY ; (X, Y )) obtained by first drawing PbXY according to PS , then,
conditional to this, drawing (X, Y ) according to PbXY . The risk is then the expectation of a
certain function of ξ (namely Ff (ξ) = `(f (PbX , X), Y )). In probability theory literature, the
distribution of the variable ξ is known as the Campbell measure associated to the distribu-
tion PS over the measure space PX ×Y ; this object is in particular of fundamental use in
point process theory (see, e.g., Daley and Vere-Jones, 2008, Section 13.1). We will denote
it by C(PS ) here. This intriguing connection suggests that more elaborate tools of point
process literature may find their use to analyze DG when various classical point processes
are considered for the generating distribution. The Campbell measure will also appear in
the Rademacher analysis below.

The next result establishes an analogue of classical Rademacher analysis under the
agnostic generative model.

Theorem 5 (Uniform estimation error control under (AGM)) Let F be a class of


decision functions PX × X → R. Assume the following boundedness condition holds:

sup sup sup `(f (PX , x), y) ≤ B` . (3)


f ∈F PX ∈PX (x,y)∈X ×Y

Under (AGM), if S1 , . . . , SN are i.i.d. realizations from PS , then with probability at least
1 − δ with respect to the draws of the training samples:

sup E(f,
b N ) − E(f )
f ∈F
N
" # r
2 X (i) log(δ −1 )
≤ E(Pb(i) ;(X ,Y ))∼C(P )⊗N E(εi )1≤i≤N sup εi `(f (PbX , Xi ), Yi ) + B` , (4)
N XY i i S
f ∈F 2N
i=1

(i)
where (εi )1≤i≤N are i.i.d. Rademacher variables, independent from (PbXY , (Xi , Yi ))1≤i≤N ,
and C(PS ) is the Campbell measure on PX ×Y × (X × Y) associated to PS (see Remark 4).

Proof Since the (Si )1≤i≤N are i.i.d., supf ∈F E(f,b N ) − E(f ) takes the form of a uniform
deviation between average and expected loss over the function class F. We can therefore
apply standard analysis (Azuma-McDiarmid inequality followed by Rademacher complexity
analysis for a nonnegative bounded loss; see, e.g., Koltchinskii, 2001; Bartlett and Mendel-
son, 2002, Theorem 8) to obtain that with probability at least 1 − δ with respect to the
draw of the training samples (Si )1≤i≤N :
N
" # r
2 X log(δ −1 )
sup E(f,
b N ) − E(f ) ≤ E(S )
i 1≤i≤N
E (ε )
i 1≤i≤N
sup ε i L(Si , f ) + B` ,
f ∈F N f ∈F 2N
i=1

where (εi )1≤i≤N are i.i.d. Rademacher variables, independent of (Si )1≤i≤N .
We may write
ni
1 X (i)
h
(i)
i
L(Si , f ) = `(f (PbX , Xij ), Yij ) = E(X,Y )∼Pb(i) `(f (PbX , X), Y ) ;
ni XY
j=1

7
Blanchard, Deshmukh, Dogan, Lee and Scott

thus, we have
N
" #
X
E(Si )1≤i≤N E(εi )1≤i≤N sup εi L(Si , f )
f ∈F i=1
N
" #
h i
(i)
X
= E(Pb(i) ) E(εi )1≤i≤N sup εi E(X ,Y )∼Pb(i) `(f (PbX , Xi ), Yi )
XY 1≤i≤N i i
f ∈F i=1 XY

N
" #
(i)
X
≤ E(Pb(i) ) E(X b(1) (N ) E(εi )1≤i≤N sup εi `(f (PbX , Xi ), Yi ) .
XY 1≤i≤N 1 ,Y1 )∼PXY ,...,(XN ,YN )∼PbXY
f ∈F i=1

In the above inequality, the inner expectation on the (Xi , Yi ) is pulled outwards by Jensen’s
inequality and convexity of the supremum.
To obtain the announced estimate, notice that the above expectation is the same as the
expectation with respect to the N -fold Campbell measure C(PS ).

Remark 6 The main term in the theorem is just the conventional Rademacher complexity
for the augmented feature space PX × X endowed with the Campbell measure C(PS ). It
could also be thought of as the Rademacher complexity for the meta-distribution PS .

3.4 Idealized Risk under the 2-stage Generative Model


The additional structure of (2SGM) allows us to define a different notion of risk under this
model. Toward this end, let PXY T denote the testing data distribution, PXT the marginal
X-distribution of PXY , nT the test sample size, S T = (XiT , YiT )1≤i≤nT the testing sample,
T

and PbXT the empirical X-distribution. Parallel to the training data generating mechanism
under (2SGM), we assume that PXY T is drawn according to µ.
We first define the risk of any f : PX × X → R, conditioned on a test sample of size nT ,
to be
nT
" #
1 X T T T
E(f |nT ) := EP T ∼µ E(X T ,Y T )1≤i≤n ∼(P T )⊗nT `(f (PbX , Xi ), Yi ) . (5)
XY i i T XY nT
i=1

In this definition, the test sample S T consist of nT iid draws from and random PXY T drawn

from µ. This conditional risk may be viewed as the previously defined risk for (AGM)
specialized to (2SGM), where ν = δnT .
We are particularly interested in the idealized situation where the test sample size nT
grows to infinity. By the law of large numbers, as nT grows, PbXT converges to PXT (in the
sense of weak convergence). This motivates the introduction of the following idealized risk
which assumes access to an infinite test sample, and thus to the true marginal PXT :

E ∞ (f ) := EP T ∼µ E(X T ,Y T )∼P T `(f (PXT , X T ), Y T ) .


 
(6)
XY XY

Note that both notions of risk in (5) and (6) depends on µ, but not on ν.
The following proposition more precisely motivates viewing E ∞ (f ) as a limiting case of
E(f |nT ).

8
Domain Generalization by Marginal Transfer Learning

Proposition 7 Assume ` is a bounded, L-Lipschitz loss function and f : PX × X → R is


a fixed decision function which is continuous with respect to both its arguments (recalling
PX is endowed with the weak convergence topology). Then it holds under (2SGM):

lim E(f |nT ) = E ∞ (f ).


nT →∞

Remark 8 This result provides one setting where the risk E ∞ is clearly motivated as the
goal of asymptotic analysis when nT → ∞. Although Proposition 7 is not used elsewhere in
this work, a more quantitative version of this result is stated below for kernels (see Theorem
15), where convergence holds uniformly and the assumption of a bounded loss is dropped.

To gain more insight into the idealized risk E ∞ , recalling the standard decomposition (1)
of PXY into the marginal PX and the posterior PY |X , we observe that we can apply the
disintegration theorem not only to any PXY , but also to µ, and thus decompose it into two
parts, µX which generates the marginal distribution PX , and µY |X which, conditioned on
PX , generates the posterior PY |X . (More precise notation might be µPX instead of µX and
µPY |X |PX instead of µY |X , but this is rather cumbersome.) Denote X e = (PX , X). We then
have, using Fubini’s theorem,
h i
E ∞ (f ) = EPX ∼µX EPY |X ∼µY |X EX∼PX EY ∼PY |X `(f (X),
e Y)
h i
= EPX ∼µX EX∼PX EPY |X ∼µY |X EY ∼PY |X `(f (X),
e Y)
h i
= E(X,Y
e )∼Qµ `(f (X),
e Y ) .

Here Qµ is the distribution that generates X


e by first drawing PX according to µX , and then
drawing X according to PX ; similarly, Y is generated, conditioned on X,e by first drawing
PY |X according to µY |X , and then drawing Y from PY |X . (The distribution of X e again
takes the form of a Campbell measure, see Remark 4.)
From the previous expression, we see that the risk E ∞ is like a standard supervised
learning risk based on (X, e Y ) ∼ Qµ . Thus, we can deduce properties that are known
to hold for supervised learning risks. For example, in the binary classification setting, if
∗ e
 loss, then f (X) = 2η̃(X) − 1 is an optimal predictor, where η̃(X) =
the loss is the 0/1 e e
EY ∼Qµ 1{Y =1} , and
Y |X
e

h i
E ∞ (f ) − E ∞ (f ∗ ) = EX∼Q
e µ 1 e =sign(f ∗ (X))}
{sign(f (X))6 e |2η̃( X)
e − 1| .
X
e

Furthermore, consistency in the sense of E ∞ with respect to a general loss ` (thought of as


a surrogate) will imply consistency for the 0/1 loss, provided ` is classification calibrated
(Bartlett et al., 2006).
For a given loss `, the optimal or Bayes E ∞ -risk in DG is in general larger than the
expected Bayes risk under the (random) test sample generating distribution PXY T , because

it is typically not possible to fully determine the Bayes-optimal predictor from only the
marginal PXT . There is, however, a condition where for µ-almost all test distributions PXY
T ,
∗ T ∗
the decision function f (PX , .) (where f is a global minimizer of Equation 6) coincides with

9
Blanchard, Deshmukh, Dogan, Lee and Scott

an optimal Bayes decision function for PXY T . This condition is simply that the posterior

PY |X is (µ-almost surely) a function of PX (in other words, with the notation introduced
above, µY |X (PX ) is a Dirac measure for µ-almost all PX ). Although we will not be assuming
this condition throughout the paper under (2SGM), observe that it is implicitly assumed
in the motivating application presented in Section 2, where an expert labels the data points
by just looking at their marginal distribution.

Lemma 9 For a fixed distribution PXY , and a decision function g : X → R, let us denote
R(g, PXY ) = E(X,Y )∼PXY [`(g(X), Y )] and

R∗ (PXY ) := min R(g, PXY ) = min E(X,Y )∼PXY [`(g(X), Y )]


g:X →R g:X →R

the corresponding optimal (Bayes) risk for the loss function ` under data distribution PXY .
Then under (2SGM):
E ∞ (f ∗ ) ≥ EPXY ∼µ [R∗ (PXY )] ,
where f ∗ : PX × X → R is a minimizer of the idealized DG risk E ∞ defined in (6).
Furthermore, if µ is a distribution on PX ×Y such that µ-a.s. it holds PY |X = F (PX )
for some deterministic mapping F , then for µ-almost all PXY :

R(f ∗ (PX , .), PXY ) = R∗ (PXY )

and
E ∞ (f ∗ ) = EPXY ∼µ [R∗ (PXY )] .

Proof For any f : PX × X → R, one has for all PXY : R(f (PX , .), PXY ) ≥ R∗ (PXY ).
Taking expectation with respect to PXY establishes the first claim. Now for any fixed
PX ∈ PX , consider PXY := PX • F (PX ) and g ∗ (PX ) a Bayes decision function for this joint
distribution. Pose f (PX , x) := g ∗ (PX )(x). Then f coincides for µ-almost all PXY with a
Bayes decision function for PXY , achieving equality in the above inequality. The second
equality follows by taking expectation over PXY ∼ µ.

Under (2SGM), we will establish that our proposed learning algorithm is E ∞ -consistent,
provided the average sample size grows to infinity as well as the total number of samples.
Thus, the above result provides a condition on µ under which it is possible to asymptotically
attain the (classical, single-task) Bayes risk on any test distribution although no labels from
this test distribution are observed.
More generally, and speaking informally, if µ is such that PY |X is close to being a
function of PX in some sense, we can expect the Bayes E ∞ -risk for domain generalization
to be close to the expected (classical single-task) Bayes risk for a random test distribution.
We reiterate, however, that we make no assumptions on µ in this work so that the two
quantities may be far apart. In the worst case, the posterior may be independent of the
marginal, in which case a method for domain generalization will do no better than the
naı̈ve pooling strategy. For further discussion, see the comparison of domain adaptation
and domain generalization in the next section.

10
Domain Generalization by Marginal Transfer Learning

4. Related Work
Since at least the 1990s, machine learning researchers have investigated the possibility of
solving one learning problem by leveraging data from one or more related problems. In this
section, we provide an overview of such problems and their relation to domain generalization,
while also reviewing prior work on DG.
Two critical terms are domain and task. Use of these terms is not consistent throughout
the literature, but at a minimum, the domain of a learning problem describes the input
(feature) space X and marginal distribution of X, while the task describes the output space
Y and the conditional distribution of Y given X (also called posterior). In many settings,
however, the sets X and Y are the same for all learning problems, and the terms “domain”
and “task” are used interchangeably to refer to a joint distribution PXY on X × Y. This is
the perspective adopted in this work, as well as in much of the work on multi-task learning,
domain adaptation (DA), and domain generalization.
Multi-task learning is similar to DG, except only the training tasks are of interest,
and the goal is to leverage the similarity among distributions to improve the learning of
individual predictors for each task (Caruana, 1997; Evgeniou et al., 2005; Yang et al., 2009).
In contrast, in DG, we are concerned with generalization to a new task.

4.1 Domain Generalization vs. Domain Adaptation


Domain adaptation refers to the setting in which there is a specific target task and one or
more source tasks. The goal is to design a predictor for the target task, for which there are
typically few to no labeled training examples, by leveraging labeled training data from the
source task(s).
Formulations of domain adaptation may take several forms, depending on the number
of sources and whether there are any labeled examples from the target to supplement
the unlabeled examples. In multi-source, unsupervised domain adaptation, the learner is
presented with labeled training data from several source distributions, and unlabeled data
from a target marginal distribution (see Zhang et al. (2015) and references therein). Thus,
the available data are the same as in domain generalization, and algorithms for one of these
problems may be applied to the other.
In all forms of DA, the goal is to attain optimal performance with respect to the joint
distribution of the target domain. For example, if the performance measure is a risk, the goal
is to attain the Bayes risk for the target domain. To achieve this goal, it is necessary to make
assumptions about how the source and target distributions are related (Quionero-Candela
et al., 2009). For example, several works adopt the covariate shift assumption, which
requires the source and target domains to have the same posterior, allowing the marginals
to differ arbitrarily (Zadrozny, 2004; Huang et al., 2007; Cortes et al., 2008; Sugiyama
et al., 2008; Bickel et al., 2009; Kanamori et al., 2009; Yu and Szepesvari, 2012; Ben-David
and Urner, 2012). Another common assumption is target shift, which stipulates that the
source and target have the same class-conditional distributions, allowing the prior class
probability to change (Hall, 1981; Titterington, 1983; Latinne et al., 2001; Storkey, 2009;
Du Plessis and Sugiyama, 2012; Sanderson and Scott, 2014; Azizzadenesheli et al., 2019).
Mansour et al. (2009b); Zhang et al. (2015) assume that the target posterior is a weighted
combination of source posteriors, while Zhang et al. (2013); Gong et al. (2016) extend target

11
Blanchard, Deshmukh, Dogan, Lee and Scott

shift by also allowing the class-conditional distributions to undergo a location-scale shift,


and Tasche (2017) assumes the ratio of class-conditional distributions is unchanged. Work
on classification with label noise assumes the source data are obtained from the target
distribution but the labels have been corrupted in either a label-dependent (Blanchard
et al., 2016; Natarajan et al., 2018; van Rooyen and Williamson, 2018) or feature-dependent
(Menon et al., 2018; Cannings et al., 2018; Scott, 2019) way. Finally, there are several works
that assume the existence of a predictor that achieves good performance on both source and
target domains (Ben-David et al., 2007, 2010; Blitzer et al., 2008; Mansour et al., 2009a;
Cortes et al., 2015; Germain et al., 2016).
The key difference between DG and DA may be found in the performance measures
optimized. In DG, the goal is to design a single predictor f (PX , x) that can apply to any
future task, and risk is assessed with respect to the draw of both a new task, and (under
2SGM) a new data point from that task. This is in contrast to DA, where the target
distribution is typically considered fixed, and the goal is to design a predictor f (x) where, in
assessing the risk, the only randomness is in the draw of a new sample from the target task.
This difference in performance measures for DG and DA has an interesting consequence
for analysis. As we will show, it is possible to attain optimal risk (asymptotically) in DG
without making any distributional assumptions like those described above for DA. Of course,
this optimal risk is typically larger than the Bayes risk for any particular target domain
(see Lemma 9). An interesting question for future research is whether it is possible to
close or eliminate this gap (between DG and expected DA risks) by imposing distributional
assumptions like those for DA.
Another difference between DA and DG lies in whether the learning algorithm must be
rerun for each new test data set. Most unsupervised DA methods employ the unlabeled
target data for training and thus, when a new unlabeled target data set is presented, the
learning algorithm must be rerun. In contrast, most existing DG methods do not assume
access to the unlabeled test data at learning time, and are capable of making predictions
as new unlabeled data sets arrive without any further training.

4.2 Domain Generalization vs. Learning to Learn


In the problem of learning to learn (LTL, Thrun, 1996), which has also been called bias
learning, meta-learning, and (typically in an online setting) lifelong learning, there are
labeled data sets for several tasks, as in DG. There is also a given family of learning algo-
rithms, and the objective is to design a meta-learner that selects the learning algorithm that
will perform best on future tasks. The learning theoretic study of LTL traces to the work
of Baxter (2000), who was the first to propose a distribution on tasks, which he calls an
“environment,” and which coincides with our µ. Given this setting, the performance of the
learning algorithm selected by a meta-learner is obtained by drawing a new task at random,
drawing a labeled training data set from that task, running the selected algorithm, drawing
a test point, and evaluating the expected loss, where the expectation is with respect to all
sources of randomness (new task, training data from new task, test point from new task).
Baxter analyzes learning algorithms given by usual empirical risk minimization over a
hypothesis (prediction function) class, and the goal of the meta-learner is then to select a
hypothesis class from a family of such classes. He shows that it is possible to find a good

12
Domain Generalization by Marginal Transfer Learning

trade-off between the complexity of a hypothesis class and its approximation capabilities
for tasks sampled from µ, in an average sense. In particular, the information gained by
finding a well-adapted hypothesis class can lead to significantly improved sample efficiency
when learning a new task. See Maurer (2009) for further discussion of the results of Baxter
(2000).
Later work on LTL establishes similar results that quantify the ability of a meta-learner
to transfer knowledge to a new task. These meta-learners all optimize a particular structure
that defines a learning algorithm, such as a feature representation (Maurer, 2009; Maurer
et al., 2016; Denevi et al., 2018b), a prior on predictors in a PAC-Bayesian setting (Pentina
and Lampert, 2014), a dictionary (Maurer et al., 2013), the bias of a regularizer (Denevi
et al., 2018a), and a pretrained neural network (Finn et al., 2017). It is also worth not-
ing that some algorithms on multi-task learning extract structures that characterize an
environment and can be applied to LTL.
Although DG and LTL both involve generalization to a new task, they are clearly
different problems because LTL assumes access to labeled data from the new task, whereas
DG only sees unlabeled data and requires no additional learning. In LTL, the learner can
achieve the Bayes risk for the new task, the only issue is the sample complexity. DG is
thus a more challenging problem, but also potentially more useful since in many transfer
learning settings, labeled data for the new task are unavailable.

4.3 Prior Work on Domain Generalization


To our knowledge, the first paper to consider domain generalization (as formulated in Sec-
tion 3.2) was our earlier conference paper (Blanchard et al., 2011). The term “domain
generalization” was coined by Muandet et al. (2013), who study the same setting and build
upon our work by extracting features that facilitate DG. Yang et al. (2013) study an ac-
tive learning variant of DG in the realizable setting, and directly learn the task sampling
distribution.
Other methods for DG were studied by Khosla et al. (2012); Xu et al. (2014); Grubinger
et al. (2015); Ghifary et al. (2015); Gan et al. (2016); Ghifary et al. (2017); Motiian et al.
(2017); Li et al. (2017, 2018a,b,c,d); Balaji et al. (2018); Ding and Fu (2018); Shankar
et al. (2018); Hu et al. (2019); Dou et al. (2019); Carlucci et al. (2019); Wang et al. (2019);
Akuzawa et al. (2019). Many of these methods learn a common feature space for all tasks.
Such methods are complementary to the method that we study. Indeed, our kernel-based
learning algorithm may be applied after having learned a feature representation by another
method, as was done by Muandet et al. (2013). Since our interest is primarily theoretical,
we restrict our experimental comparison to another algorithm that also operates directly on
the original input space, namely, a simple pooling algorithm that lumps all training tasks
into a single data set and trains a single support vector machine.

5. Learning Algorithm
In this section, we introduce a concrete algorithm to tackle the learning problem exposed in
Section 3, using an approach based on kernels. The function k : Ω×Ω → R is called a kernel
on Ω if the matrix (k(xi , xj ))1≤i,j≤n is symmetric and positive semi-definite for all positive
integers n and all x1 , . . . , xn ∈ Ω. It is well known that every kernel k on Ω is associated to

13
Blanchard, Deshmukh, Dogan, Lee and Scott

a space of functions f : Ω → R called the reproducing kernel Hilbert space (RKHS) Hk with
kernel k. One way to envision Hk is as follows. Define Φ(x) := k(·, x), which is called the
canonical feature map associated with k. Then the span of {Φ(x) : x ∈ Ω}, endowed with
the inner product hΦ(x), Φ(x0 )i = k(x, x0 ), is dense in Hk . We also recall the reproducing
property, which states that hf, Φ(x)i = f (x) for all f ∈ Hk and x ∈ Ω.
For later use, we introduce the notion of a universal kernel. A kernel k on a compact
metric space Ω is said to be universal when its RKHS is dense in C(Ω), the set of continuous
functions on Ω, with respect to the supremum norm. Universal kernels are important for
establishing universal consistency of many learning algorithms. See Steinwart and Christ-
mann (2008) for background on kernels and reproducing kernel Hilbert spaces.
Several well-known learning algorithms, such as support vector machines and kernel
ridge regression, may be viewed as minimizers of a norm-regularized empirical risk over
the RKHS of a kernel. A similar development has also been made for multi-task learning
(Evgeniou et al., 2005). Inspired by this framework, we consider a general kernel-based
algorithm as follows.
Consider the loss function ` : R × Y → R+ . Let k be a kernel on PX × X , and let Hk
(i)
be the associated RKHS. For the sample Si , recall that PbX = n1i nj=1
P i
δXij denotes the
corresponding empirical X distribution. Also consider the extended input space PX × X
and the extended data X eij = (Pb(i) , Xij ). Note that Pb(i) plays a role analogous to the task
X X
index in multi-task learning. Now define
N ni
1 X 1 X eij ), Yij ) + λ kf k2 .
fbλ = arg min `(f (X (7)
f ∈Hk N ni
i=1 j=1

Algorithms for solving (7) will be discussed in Section 7.

5.1 Specifying the Kernels


In the rest of the paper we will consider a kernel k on PX × X of the product form
k((P1 , x1 ), (P2 , x2 )) = kP (P1 , P2 )kX (x1 , x2 ), (8)
where kP is a kernel on PX and kX a kernel on X .
Furthermore, we will consider kernels on PX of a particular form. Let kX 0 denote a

kernel on X (which might be different from kX ) that is measurable and bounded. We define
the kernel mean embedding Ψ : PX → HkX 0 :

Z
0
PX 7→ Ψ(PX ) := kX (x, ·)dPX (x). (9)
X

This mapping has been studied in the framework of “characteristic kernels” (Gretton et al.,
2007a), and it has been proved that universality of kX 0 implies injectivity of Ψ (Gretton

et al., 2007b; Sriperumbudur et al., 2010).


Note that the mapping Ψ is linear. Therefore, if we consider the kernel kP (PX , PX0 ) =
hΨ(PX ), Ψ(PX0 )i, it is a linear kernel on PX and cannot be a universal kernel. For this
reason, we introduce yet another kernel K on HkX 0 and consider the kernel on PX given by

kP (PX , PX0 ) = K Ψ(PX ), Ψ(PX0 ) .



(10)

14
Domain Generalization by Marginal Transfer Learning

Note that particular kernels inspired by the finite dimensional case are of the form
K(v, v 0 ) = F ( v − v 0 ), (11)
or
K(v, v 0 ) = G( v, v 0 ), (12)
where F, G are real functions of a real variable such that they define a kernel. For exam-
ple, F (t) = exp(−t2 /(2σ 2 )) yields a Gaussian-like kernel, while G(t) = (1 + t)d yields a
polynomial-like kernel. Kernels of the above form on the space of probability distributions
over a compact space X have been introduced and studied in Christmann and Steinwart
(2010). Below we apply their results to deduce that k is a universal kernel for certain choices
0 , and K.
of kX , kX

5.2 Relation to Other Kernel Methods


By choosing k differently, one can recover other existing kernel methods. In particular,
consider the class of kernels of the same product form as above, but where
1 PX = PX0

kP (PX , PX0 ) =
τ PX 6= PX0
(i)
If τ = 0, the algorithm (7) corresponds to training N kernel machines f (PbX , ·) using kernel
kX (e.g., support vector machines in the case of the hinge loss) on each training data set,
independently of the others (note that this does not offer any generalization ability to a new
data set). If τ = 1, we have a “pooling” strategy that, in the case of equal sample sizes ni ,
is equivalent to pooling all training data sets together in a single data set, and running a
conventional supervised learning algorithm with kernel kX (i.e., this corresponds to trying
to find a single “one-fits-all” prediction function which does not depend on the marginal).
In the intermediate case 0 < τ < 1, the resulting kernel is a “multi-task kernel,” and the
algorithm recovers a multitask learning algorithm like that of Evgeniou et al. (2005). We
compare to the pooling strategy below in our experiments. We also examined the multi-
task kernel with τ < 1, but found that, as far as generalization to a new unlabeled task is
concerned, it was always outperformed by pooling, and so those results are not reported.
This fits the observation that the choice τ = 0 does not provide any generalization to a
new task, while τ = 1 at least offers some form of generalization, if only by fitting the same
predictor to all data sets.
In the special case where all labels Yij are the same value for a given task, and kX is
taken to be the constant kernel, the problem we consider reduces to “distributional” classi-
fication or regression, which is essentially standard supervised learning where a distribution
(observed through a sample) plays the role of the feature vector. Many of our analysis
techniques specialize to this setting.

6. Learning Theoretic Study


This section presents generalization error and consistency analysis for the proposed kernel
method under the agnostic and 2-stage generative models. Although the regularized esti-
mation formula (7) defining fbλ is standard, the generalization error analysis is not, owing
to the particular sampling structures and risks under (AGM) and (2SGM).

15
Blanchard, Deshmukh, Dogan, Lee and Scott

6.1 Universal Consistency under the Agnostic Generative Model


We will consider the following assumptions on the loss function and kernels:

(LB) The loss function ` : R × Y → R+ is L` -Lipschitz in its first variable and satisfies
B0 := supy∈Y `(0, y) < ∞.
0 and K are bounded respectively by constants B 2 , B 2 ≥
(K-Bounded) The kernels kX , kX k k0
1, and BK2 .

The condition B0 < ∞ always holds for classification, as well as certain regression
settings. The boundedness assumptions are clearly satisfied for Gaussian kernels, and can
be enforced by normalizing the kernel (discussed further below).
We begin with a generalization error bound that establishes uniform estimation error
control over functions belonging to a ball of Hk . We then discuss universal kernels, and
finally deduce universal consistency of the algorithm.
Let Bk (r) denote the closed ball of radius r, centered at the origin, in the RKHS of the
kernel k. We start with the following simple result allowing us to bound the loss on a RKHS
ball.

Lemma 10 Suppose k is a kernel on a set Ω, bounded by B 2 . Let ` : R × Y → [0, ∞) be a


loss satisfying (LB). Then for any R > 0 and f ∈ Bk (R), and any z ∈ Ω and y ∈ Y,

`(f (z), y) ≤ B0 + L` RB (13)

Proof By the Lipschitz continuity of `, the reproducing property, and Cauchy-Schwarz,


we have

`(f (z), y) ≤ `(0, y) + `(f (z), y) − `(0, y)


≤ B0 + L` |f (z) − 0|
= B0 + L` hf, k(z, ·)i
≤ B0 + L` kf kHk B
≤ B0 + L` RB.

The expression in (13) serves to replace the boundedness assumption (3) in Theorem 5.
We now state the following, which is a specialization of Theorem 5 to the kernel setting.

Theorem 11 (Uniform estimation error control over RKHS balls) Assume (LB)
and (K-Bounded) hold, and data generation follows (AGM). Then for any R > 0, with
probability at least 1 − δ (with respect to the draws of the samples Si , i = 1, . . . , N )
p
b N ) − E(f ) ≤ (B0 + L` RBK Bk ) ( log δ −1 + 2)
sup E(f, √ . (14)
f ∈Bk (R) N

16
Domain Generalization by Marginal Transfer Learning

Proof This is a direct consequence of Theorem 5 and of Lemma 10, the kernel k on
PX × X being bounded by Bk2 BK2 . As noted there, the main term in the upper bound (4)
is a standard Rademacher complexity on the augmented input space P × X , endowed with
the Campbell measure C(PS ).
In the kernel learning context, we can bound the Rademacher complexity term using
a standard bound for the Rademacher complexity of a Lipschitz loss function on the ball
of radius R of Hk (Koltchinskii, 2001; Bartlett and Mendelson, 2002, e.g., Theorems 8, 12
and Lemma 22 there), using again the bound Bk2 BK2 on the kernel k, giving the conclusion.

Next, we turn our attention to universal kernels (see Section 5 for the definition). A
relevant notion for our purposes is that of a normalized kernel. If k is a kernel on Ω, then

k(x, x0 )
k ∗ (x, x0 ) := p
k(x, x)k(x0 , x0 )

is the associated normalized kernel. If a kernel is universal, then so is its associated normal-
ized kernel. For example, the exponential kernel k(x, x0 ) = exp(κhx, x0 iRd ), κ > 0, can be
shown to be universal on Rd through a Taylor series argument. Consequently, the Gaussian
kernel
exp( σ12 hx, x0 i)
kσ (x, x0 ) :=
exp( 2σ1 2 kxk2 ) exp( 2σ1 2 kx0 k2 )
is universal, being the normalized kernel associated with the exponential kernel with κ =
1/σ 2 . See Steinwart and Christmann (2008) for additional details and discussion.
To establish that k is universal on PX × X , the following lemma is useful.

Lemma 12 Let Ω, Ω0 be two compact spaces and k, k 0 be kernels on Ω, Ω0 , respectively. If


k, k 0 are both universal, then the product kernel

k((x, x0 ), (y, y 0 )) := k(x, y)k 0 (x0 , y 0 )

is universal on Ω × Ω0 .

Several examples of universal kernels are known on Euclidean space. For our purposes,
we also need universal kernels on PX . Fortunately, this was studied by Christmann and
Steinwart (2010). Some additional assumptions on the kernels and feature space are re-
quired:

0 , K, and X satisfy the following:


(K-Univ) kX , kX

• X is a compact metric space


• kX is universal on X
• kX
0 is continuous and universal on X

• K is universal on any compact subset of HkX


0 .

Adapting the results of Christmann and Steinwart (2010), we have the following.

17
Blanchard, Deshmukh, Dogan, Lee and Scott

Theorem 13 (Universal kernel) Assume condition (K-Univ) holds. Then, for kP de-
fined as in (10), the product kernel k in (8) is universal on PX × X .
Furthermore, the assumption on K is fulfilled if K is of the form (12), where G is an
analytical function with positive Taylor series coefficients, or if K is the normalized kernel
associated to such a kernel.

Proof By Lemma 12, it suffices to show PX is a compact metric space, and that kP (PX , PX0 )
is universal on PX . The former statement follows from Theorem 6.4 of Parthasarathy
(1967), where the metric is the Prohorov metric. We will deduce the latter statement from
Theorem 2.2 of Christmann and Steinwart (2010). The statement of Theorem 2.2 there
is apparently restricted to kernels of the form (12), but the proof actually only uses that
the kernel K is universal on any compact set of HkX 0 . To apply Theorem 2.2, it remains

to show that HkX 0 is a separable Hilbert space, and that Ψ is injective and continuous.
Injectivity of Ψ is equivalent to kX0 being a characteristic kernel, and follows from the
0
assumed universality of kX (Sriperumbudur et al., 2010). The continuity of kX 0 implies

separability of HkX0 (Steinwart and Christmann (2008), Lemma 4.33) as well as continuity of

Ψ (Christmann and Steinwart (2010), Lemma 2.3 and preceding discussion). Now Theorem
2.2 of Christmann and Steinwart (2010) may be applied, and the results follows.
The fact that kernels of the form (12), where G is analytic with positive Taylor coeffi-
cients, are universal on any compact set of HkX 0 was established in the proof of Theorem
2.2 of the same work (Christmann and Steinwart, 2010).

As an example, suppose that X is a compact subset of Rd . Let kX and kX 0 be Gaussian

kernels on X . Taking G(t) = exp(t), it follows that K(PX , PX ) = exp(hΨ(PX ), Ψ(PX0 )iHk0 )
0
X
is universal on PX . By similar reasoning as in the finite dimensional case, the Gaussian-like
kernel K(PX , PX0 ) = exp(− 2σ1 2 kΨ(PX ) − Ψ(PX0 )k2H 0 ) is also universal on PX . Thus the
k
X
product kernel is universal on PX × X .
From Theorems 11 and 13, we may deduce universal consistency of the learning algo-
rithm.

Corollary 14 (Universal consistency) Assume that conditions (LB), (K-Bounded)


and (K-Univ) are satisfied. Let λ = λ(N ) be a sequence such that as N → ∞: λ(N ) → 0
and λ(N )N/ log N → ∞. Then
E(fbλ(N ) ) → inf E(f ) a.s., as N → ∞.
f :PX ×X →R

The proof of the corollary relies on the bound established in Theorem 11, the universality
of k established in Theorem 13, and otherwise relatively standard arguments.
One notable feature of this result is that we have established consistency where only N
is required to diverge. In particular, the training sample sizes ni may remain bounded. In
the next subsection, we consider the role of the ni under the 2-stage generative model.

6.2 Role of the Individual Sample Sizes under the 2-Stage Generative Model
In this section, we are concerned with the role of the individual sample sizes (ni )1≤i≤N , more
precisely, of their distribution ν under (2SGM), see Section 3.1. A particular motivation

18
Domain Generalization by Marginal Transfer Learning

for investigating this point is that in some applications the number of training points per
task is large, which can give rise to a high computational burden at the learning stage (and
also for storing the learned model in computer memory). A practical way to alleviate this
issue is to reduce the number of training points per task by random subsampling, which in
effect modifies the sample size distribution ν while keeping the generating distribution µ
for the tasks’ point distributions unchanged. Observe that under (AGM) the sample size
and the sample point distribution may be dependent in general, and subsampling would
then affect that relationship in an unknown manner. This is why we assume (2SGM) in
the present section.
We will consider the following additional assumption.

(K-Hölder) The canonical feature map ΦK : HkX 0 → HK associated to K satisfies a Hölder

condition of order α ∈ (0, 1] with constant LK , on BkX


0 (Bk 0 ) :

∀v, w ∈ BkX
0 (Bk 0 ) : kΦK (v) − ΦK (w)k ≤ LK kv − wkα . (15)

Sufficient conditions for (15) are described in Section A.4. As an example, the condition is
shown to hold with α = 1 when K is the Gaussian-like kernel on HkX 0 .

Since we are interested in the influence of the number of training points per task, it
is helpful to introduce notations for the (2SGM) risks that are conditioned on a fixed
task PXY . Thus, we introduce the following notation, in analogy to (5)–(6) introduced in
Section 3.4, for risk at sample size n, and risk at infinite sample size, conditional to PXY :
" n #
1X
E(f |PXY , n) := ES T ∼(PXY )⊗n `(f (PbX , Xi ), Yi ) ; (16)
n
i=1

E (f |PXY ) := E(X,Y )∼PXY [`(f (PX , X), Y )] . (17)

The following proposition gives an upper bound on the discrepancy between these risks.
It can be seen as a quantitative version of Proposition 7 in the kernel setting, which is
furthermore uniform over an RKHS ball.
Theorem 15 Assume conditions (LB), (K-Bounded), and (K-Hölder) hold. If the
sample S = (Xj , Yj )1≤j≤n is made of n i.i.d. realizations from PXY , with PXY and n fixed,
then for any R > 0, with probability at least 1 − δ:
− α2
log(3δ −1 )


sup |L(S, f ) − E (f |PXY )| ≤ (B0 + 3L` RBk (Bkα0 LK + BK )) . (18)
f ∈Bk (R) n

Averaging over the draw of S, again with PXY and n fixed, it holds for any R > 0:

sup |E(f |PXY , n) − E ∞ (f |PXY )| ≤ 2L` RBk LK Bkα0 n−α/2 . (19)


f ∈Bk (R)

As a consequence, for the unconditional risks when (PXY , n) is drawn from µ ⊗ ν under
(2SGM), for any R > 0:
h αi
sup |E(f ) − E ∞ (f )| ≤ 2RL` Bk LK Bkα0 Eν n− 2 . (20)
f ∈Bk (R)

19
Blanchard, Deshmukh, Dogan, Lee and Scott

The above results are useful in a number of ways. First, under (2SGM), we can consider
the goal of asymptotically achieving the idealized optimal risk inf f E ∞ (f ), where we recall
that E ∞ (f ) is the expected loss of a decision function f over a random test task PXY
T in the
T
case where PX would be perfectly observed (this can be thought of as observing an infinite
sample from the marginal). Equation (20) bounds the risk under (2SGM) in terms of the
risk under (AGM), for which we have already established consistency. Thus, consistency
to the idealized risk under (2SGM) will be possible if the number of examples ni per
training task also grows together with the number of training tasks N . The following result
formalizes this intuition.

Corollary 16 Assume (LB), (K-Bounded), and (K-Hölder), and assume (2SGM).


Then for any R > 0, with probability at least 1 − δ with respect to the draws of the training
tasks and training samples

N
1 X
sup L(Si , f ) − E ∞ (f )
f ∈Bk (R) N
i=1
p
( log δ −1 + 2) h αi
≤ (B0 + L` RBK Bk ) √ + 2RL` Bk LK Bkα0 Eν n− 2 . (21)
N
Consider an asymptotic setting under (2SGM) in which, as the number of training tasks
N → ∞, the distribution µ remains fixed but the sample size distribution νN depends on
−1 −α/2

N . Denote κN := EνN n . Assuming (K-Univ) is satisfied, and the regularization
parameter λ(N ) is such that λ(N ) → 0 and λ(N ) min(N, κ2N ) → ∞, then

E(fbλ(N ) ) → inf E ∞ (f ) in probability, as N → ∞.


f :PX ×X →R

Proof The setting is that of the (2SGM) model. This is a particular case of (AGM),
so we can apply Theorem 11 and combine with (20) to get the announced bound. The
consistency statement follows the same argument as in the proof of Corollary 14, with E(f )
replaced by E ∞ (f ), and ε(N ) there replaced by the RHS in (21).

Remark 17 The bound (21) is non-asymptotic and can be used as such to assess the re-
spective role of number of tasks N and of the sample size distribution ν when the objective
is the idealized risk (see below). The result of consistency to that risk on the other hand, is
formalized as a “triangular array” type of asymptotics where the distribution of the sizes ni
of the i.i.d. training samples Si changes with their number N .

Remark 18 Our conference paper (Blanchard et al., 2011) also established a generalization
error bound and consistency for E ∞ under (2SGM). That bound had a different form for
two main reasons. First, it assumed the loss to be bounded, whereas the present analysis
avoids that assumption via Lemma 10. Second, that analysis did not leverage a connection
to (AGM), which led to a log N in the second term. This required the two sample sizes to
be coupled asymptotically to achieve consistency. In the present analysis, the two sample
sizes N and n may diverge at arbitrary rates.

20
Domain Generalization by Marginal Transfer Learning

Remark 19 It is possible to obtain a result similar to Corollary 16 when the training


task sample sizes (ni )1≤i≤N are fixed (considered as deterministic), unequal and possibly
arbitrary. In this case we would follow a slightly different argument, leveraging (18) for
each single training task together with a union bound, and applying Theorem 11 to the
idealized
h α isituation with an infinite number of samples per training task. This way, the term
−α
Eν n− 2 is replaced by log(N )N −1 N
P 2
i=1 i , the additional logarithmic factor being the
n
price of the union bound. We eschew an exact statement for brevity.

We now come back to our initial motivation of possibly reducing computational burden
by subsampling and analyze to what extent this affects statistical error. Under (2SGM)
the effect of subsampling (without replacement) is transparent: it amounts to changing
the original individual sample size distribution ν by ν 0 = δn0 , while keeping the generating
distribution µ for the tasks’ point distributions fixed. Here n0 is the common fixed size of
the training subsamples, and we must assume implicitly that the original sample sizes are
a.s. larger than n0 , i.e. that their distribution ν has support [n0 , ∞). For simplicity, for the
rest of the discussion we only consider the case of equal, deterministic sizes of sample (n)
and subsample (n0 < n). Using (21) we compare the two settings to a common reference,
namely the idealized risk E ∞ . We see that the statistical risk bound in (21) is unchanged
up to a small factor if n0 ≥ min(N α , n). Assuming α = 1 to simplify, in the case where the
original sample sizes n are much larger than the number of training tasks N , this suggests
that we can subsample to n0 ≈ N without taking a significant hit to performance. This
applies equally well to subsampling the tasks used for prediction or testing. The most precise
statement in this regard is (18), since it bounds the deviations of the observed prediction
loss for a fixed task PXY and i.i.d. sample from that task.
The minimal subsampling size n0 can be interpreted as an optimal efficiency/accuracy
tradeoff, since it reduces computational complexity as much as possible without sacrificing
statistical accuracy. Similar considerations appear in the context of distribution regres-
sion (Szabó et al., 2016, Remark 6). In that reference, a sharp analysis giving rise to fast
convergence rates is presented, resulting in a more involved optimal balance between N and
n. In the present work, we have focused on slow rates based on a uniform control of the
estimation error over RKHS balls; we leave for future work sharper convergence bounds (un-
der additional regularity conditions), which would also give rise to more refined balancing
conditions between n and N .

7. Implementation

Implementation of the algorithm in (7) relies on techniques that are similar to those used
for other kernel methods, but with some variations.3 The first subsection illustrates how,
for the case of hinge loss, the optimization problem corresponds to a certain cost-sensitive
support vector machine. The second subsection focuses on more scalable implementations
based on approximate feature mappings.

3. Code is available at https://github.com/aniketde/DomainGeneralizationMarginal

21
Blanchard, Deshmukh, Dogan, Lee and Scott

7.1 Representer Theorem and Hinge Loss


For a particular loss `, existing algorithms for optimizing an empirical risk based on that
loss can be adapted to the setting of marginal transfer learning. We now illustrate this
idea for the case of the hinge loss, `(t, y) = max(0, 1 − yt). To make the presentation more
concise, we will employ the extended feature representation X eij = (Pb(i) , Xij ), and we will
X
also “vectorize” the indices (i, j) so as to employ a single index onP these variables and on
the labels. Thus the training data are (X ei , Yi )1≤i≤M , where M = N ni , and we seek a
i=1
solution to
M
ei )) + 1 kf k2 .
X
min ci max(0, 1 − Yi f (X
f ∈Hk 2
i=1
1
Here ci = λN nm ,where m is the smallest positive integer such that i ≤ n1 + · · · + nm . By
the representer theorem (Steinwart and Christmann, 2008), the solution of (7) has the form
M
X
fbλ = ei , ·)
ri k(X
i=1

for real numbers ri . Plugging this expression into the objective function of (7), and intro-
ducing the auxiliary variables ξi , we have the quadratic program
M
1 X
min rT Kr + ci ξi
r,ξ 2
i=1
M
X
s.t. Yi rj k(X ej ) ≥ 1 − ξi , ∀i
ei , X
j=1

ξi ≥ 0, ∀i,

where K := (k(X
ei , X
ej ))1≤i,j≤M . Using Lagrange multiplier theory, the dual quadratic
program is
M M
1 X X
max − αi αj Yi Yj k(X
ei , X
ej ) + αi
α 2
i,j=1 i=1

s.t. 0 ≤ αi ≤ ci ∀i,
and the optimal function is
M
X
fbλ = ei , ·).
αi Yi k(X
i=1
This is equivalent to the dual of a cost-sensitive support vector machine, without offset,
where the costs are given by ci . Therefore we can learn the weights αi using any existing
software package for SVMs that accepts example-dependent costs and a user-specified kernel
matrix, and allows for no offset. Returning to the original notation, the final predictor given
a test X-sample S T has the form
ni
N X
(i)
X
fbλ (PbXT , x) = αij Yij k((PbX , Xij ), (PbXT , x))
i=1 j=1

22
Domain Generalization by Marginal Transfer Learning

where the αij are nonnegative. Like the SVM, the solution is often sparse, meaning most
αij are zero.
Finally, we remark on the computation of kP (PbX , PbX0 ). When K has the
D form of (11) or
E
(12), the calculation of kP may be reduced to computations of the form Ψ(PbX ), Ψ(Pb0 ) . X
If PbX and PbX0 are empirical distributions based on the samples X1 , . . . , Xn and X10 , . . . , Xn0 0 ,
then
n0
* n +
D
0
E 1 X
0 1 X
0 0
Ψ(PbX ), Ψ(PbX ) = kX (Xi , ·), 0 kX (Xj , ·)
n n
i=1 j=1
n0
n X
1 X
0
= kX (Xi , Xj0 ).
nn0
i=1 j=1

0 is a (normalized) Gaussian kernel, Ψ(P


Note that when kX bX ) coincides (as a function) with
a smoothing kernel density estimate for PX .

7.2 Approximate Feature Mapping for Scalable Implementation


Assuming ni = n, for all i, the computational complexity of a nonlinear SVM solver (in our
context) is between O(N 2 n2 ) and O(N 3 n3 ) (Joachims, 1999; Chang and Lin, 2011). Thus,
standard nonlinear SVM solvers may be insufficient when N or n are very large.
One approach to scaling up kernel methods is to employ approximate feature mappings
together with linear solvers. This is based on the idea that kernel methods are solving for
a linear predictor after first nonlinearly transforming the data. Since this nonlinear trans-
formation can have an extremely high- or even infinite-dimensional output, classical kernel
methods avoid computing it explicitly. However, if the feature mapping can be approxi-
mated by a finite dimensional transformation with a relatively low-dimensional output, one
can directly solve for the linear predictor, which can be accomplished in O(N n) time (Hsieh
et al., 2008).
In particular, given a kernel k, the goal is to find an approximate feature mapping z(x̃)
such that k(x̃, x̃0 ) ≈ z(x̃)T z(x̃0 ). Given such a mapping z, one then applies an efficient
linear solver, such as Liblinear (Fan et al., 2008), to the training data (z(X̃ij ), Yij )ij to
obtain a weight vector w. The final prediction on a test point x̃ is then sign{wT z(x̃)}. As
described in the previous subsection, the linear solver may need to be tweaked, as in the
case of unequal sample sizes ni , but this is usually straightforward.
Recently, such low-dimensional approximate future mappings z(x) have been developed
for several kernels. We examine two such techniques in the context of marginal transfer
learning, the Nyström approximation (Williams and Seeger, 2001; Drineas and Mahoney,
2005) and random Fourier features. The Nyström approximation applies to any kernel
method, and therefore extends to the marginal transfer setting without additional work.
On the other hand, we give a novel extension of random Fourier features to the marginal
transfer learning setting (for the case of all Gaussian kernels), together with performance
analysis. Our approach is similar to the one in Jitkrittum et al. (2015) which proposes a
two-stage approximation for the mean embedding. Note that Jitkrittum et al. (2015) does
not give an error bound. We describe our novel extension of random Fourier features to

23
Blanchard, Deshmukh, Dogan, Lee and Scott

the marginal transfer learning setting, with error bounds, in the appendix, where we also
review the Nyström method.
The Nyström approximation holds for any positive definite kernel, but random Fourier
features can be used only for shift invariant kernels. On the other hand, random Fourier
features are very easy to implement and the Nyström method has additional time complexity
due to an eigenvalue decomposition. Moreover, the Nyström method is useful only when
the kernel matrix has low rank. For additional comparison of various kernel approximation
approaches we refer the reader to Le et al. (2013). In our experiments, we use random
Fourier features when all kernels are Gaussian and the Nyström method otherwise.

8. Experiments
This section empirically compares our marginal transfer learning method with pooling.4
One implementation of the pooling algorithm was mentioned in Section 5.2, where kP is
taken to be a constant kernel. Another implementation is to put all the training data sets
together and train a single conventional kernel method. The only difference between the two
implementations is that in the former, weights of 1/ni are used for examples from training
task i. In almost all of our experiments below, the various training tasks have the same
sample sizes, in which case the two implementations coincide. The only exception is the
fourth experiment when we use all training data, in which case we use the second of the
two implementations mentioned above.
We consider three classification problems (Y = {−1, 1}), for which the hinge loss is
employed, and one regression problem (Y ⊂ R), where the -insensitive loss is employed.
Thus, the algorithms implemented are natural extensions of support vector classification
and regression to domain generalization. Performance of a learning strategy is assessed by
T , learning a decision function fb on the remaining
holding out several data sets S1T , . . . , SN T
data sets, and reporting the average empirical risk N1T N T b
P T
i=1 L(Si , f ). In some cases, this
value is again averaged over several randomized versions of the experiment.

8.1 Model Selection


The various experiments use different combinations of kernels. In all experiments, linear
−x2 ||2 
kernels k(x1 , x2 ) = xT1 x2 and Gaussian kernels kσ (x1 , x2 ) = exp − ||x12σ 2 were used.
The bandwidth σ of each Gaussian kernel and the regularization parameter λ of the
machines were selected by grid search. For model selection, five-fold cross-validation was
used. In order to stabilize the cross-validation procedure, it was repeated 5 times over
independent random splits into folds (Kohavi, 1995). Thus, candidate parameter values
were evaluated on the 5 × 5 validation sets and the configuration yielding the best average
performance was selected. If any of the chosen hyper-parameters was at the grid boundary,
the grid was extended accordingly, i.e., the same grid size has been used, however, the center
of grid has been assigned to the previously selected point. The grid used for kernels was
σ ∈ 10−2 , 104 with  logarithmic spacing, and the grid used for the regularization parameter
was λ ∈ 10−1 , 101 with logarithmic spacing.

4. Code available at https://github.com/aniketde/DomainGeneralizationMarginal

24
Domain Generalization by Marginal Transfer Learning

8.2 Synthetic Data Experiment

To illustrate the proposed method, a synthetic problem was constructed. The synthetic
data generation algorithm is given in Algorithm 1. In brief, for each classification task, the
data are uniformly supported on an ellipse, with the major axis determining the labels, and
the rotation of the major axis randomly generated in a 90 degree range for each task. One
random realization of this synthetic data is shown in Figure 2. This synthetic data set is
an ideal candidate for marginal transfer learning, because the Bayes classifier for a task is
uniquely determined by the marginal distribution of the features, i.e. Lemma 9 applies (and
the optimal error inf f E ∞ (f ) is zero). On the other hand, observe that the expectation of
each X distribution is the same regardless of the task and thus does not provide any relevant
information, so that taking into account at least second order information is needed to
perform domain generalization.
To analyse the effects of number of examples per task (n) and number of tasks (N ), we
constructed 12 synthetic data sets by taking combinations N × n where N ∈ {16, 64, 256}
and n ∈ {8, 16, 32, 256}. For each synthetic data set, the test set contains 10 tasks and each
task contains one million data points. All kernels are taken to be Gaussian, and the random
Fourier features speedup is used. The results are shown in Figure 3 and Tables 1 and 2
(see appendix). The marginal transfer learning (MTL) method significantly outperforms
the baseline pooling method. Furthermore, the performance of MTL improves as N and n
increase, as expected. The pooling method, however, does no better than random guessing
regardless of N and n.
In the remaining experiments, the marginal distribution does not perfectly characterize
the optimal decision function, but still provides some information to offer improvements
over pooling.

Algorithm 1: Synthetic Data Generation


input : N : Number of tasks, n: Number of training examples per task
output: Realization of synthetic data set for N tasks
for i = 1 to N do
h π 3π i
• sample rotation αi uniformly in , ;
4 4
• Take an ellipse whose major axis is aligned with the horizontal axis, and
rotate it by an angle of αi about its center;
• Sample n points Xij , j = 1, . . . , n uniformly at random from the rotated
ellipse;
• Label the points according to their position with respect to the major axis,
i.e. the points that are on the right of the major axis are considered as class
1 and the points on the left of the major axis are considered as class −1.

end

25
Blanchard, Deshmukh, Dogan, Lee and Scott

(a) (b)

(c) (d)

Figure 2: Plots of synthetic data sets (red and blue points represent negative and positive
classes) for different settings: (a) Random realization of a single task with 256
training examples per task. Plots (b), (c) and(d) are random realizations of
synthetic data with 256 training examples for 16, 64 and 256 tasks.

8.3 Parkinson’s Disease Telemonitoring

We test our method in the regression setting using the Parkinson’s disease telemonitoring
data set, which is composed of a range of biomedical voice measurements using a telemon-
itoring device from 42 people with early-stage Parkinson’s. The recordings were automat-
ically captured in the patients’ homes. The aim is to predict the clinician’s Parkinson’s
disease symptom score for each recording on the unified Parkinson’s disease rating scale
(UPDRS) (Tsanas et al., 2010). Thus we are in a regression setting, and employ the -
insensitive loss from support vector regression. All kernels are taken to be Gaussian, and
the random Fourier features speedup is used.
There are around 200 recordings per patient. We randomly select 7 test users and then
vary the number of training users N from 10 to 35 in steps of 5, and we also vary the
number of training examples n per user from 20 to 100. We repeat this process several
times to get the average errors which are shown in Fig 4 and Tables 3 and 4 (see appendix).
The marginal transfer learning method clearly outperforms pooling, especially as N and n
increase.

26
Domain Generalization by Marginal Transfer Learning

Figure 3: Synthetic data set: Classification error rates for proposed method and difference
with baseline for different experimental settings, i.e., number of examples per task
and number of tasks.

8.4 Satellite Classification


Microsatellites are increasingly deployed in space missions for a variety of scientific and
technological purposes. Because of randomness in the launch process, the orbit of a mi-
crosatellite is random, and must be determined after the launch. One recently proposed
approach is to estimate the orbit of a satellite based on radiofrequency (RF) signals as mea-
sured in a ground sensor network. However, microsatellites are often launched in bunches,
and for this approach to be successful, it is necessary to associate each RF measurement
vector with a particular satellite. Furthermore, the ground antennae are not able to decode
unique identifier signals transmitted by the microsatellites, because (a) of constraints on
the satellite/ground antennae links, including transmission power, atmospheric attenuation,
scattering, and thermal noise, and (b) ground antennae must have low gain and low direc-
tional specificity owing to uncertainty in satellite position and dynamics. To address this
problem, recent work has proposed to apply our marginal transfer learning methodology
(Sharma and Cutler, 2015).
As a concrete instance of this problem, suppose two microsatellites are launched to-
gether. Each launch is a random phenomenon and may be viewed as a task in our frame-
work. For each launch i, training data (Xij , Yij ), j = 1, . . . , ni , are generated using a highly
realistic simulation model, where Xij is a feature vector of RF measurements across a par-
ticular sensor network and at a particular time, and Yij is a binary label identifying which
of the two microsatellites produced a given measurement. By applying our methodology,
we can classify unlabeled measurements XjT from a new launch with high accuracy. Given

27
Blanchard, Deshmukh, Dogan, Lee and Scott

Figure 4: Parkinson’s disease telemonitoring data set: Root mean square error rates for
proposed method and difference with baseline for different experimental settings,
i.e., number of examples per task and number of tasks.

these labels, orbits can subsequently be estimated using the observed RF measurements.
We thank Srinagesh Sharma and James Cutler for providing us with their simulated data,
and refer the reader to their paper for more details on the application (Sharma and Cutler,
2015).
To demonstrate this idea, we analyzed the data from Sharma and Cutler (2015) for
T = 50 launches, viewing up to 40 as training data and 10 as testing. We use Gaussian
kernels and the RFF kernel approximation technique to speed up the algorithm. Results
are shown in Fig 5 (tables given in the appendix). As expected, the error for the proposed
method is much lower than for pooling, especially as N and n increase.

8.5 Flow Cytometry Experiments


We demonstrate the proposed methodology for the flow cytometry auto-gating problem,
described in Sec. 2. The pooling approach has been previously investigated in this context
by Toedling et al. (2006). We used a data set that is a part of the FlowCAP Challenges
where the ground truth labels have been supplied by human experts (Aghaeepour et al.,
2013). We used the so-called “Normal Donors” data set. The data set contains 8 different
classes and 30 subjects. Only two classes (0 and 2) have consistent class ratios, so we have
restricted our attention to these two.
The corresponding flow cytometry data sets have sample sizes ranging from 18,641 to
59,411, and the proportion of class 0 in each data set ranges from 25.59 to 38.44%. We
randomly selected 10 tasks to serve as the test tasks. These tasks were removed from the

28
Domain Generalization by Marginal Transfer Learning

Figure 5: Satellite data set: Classification error rates for proposed method and difference
with baseline for different experimental settings, i.e., number of examples per task
and number of tasks.

pool of eligible training tasks. We varied the number of training tasks from 5 to 20 with an
additive step size of 5, and the number of training examples per task from 1024 to 16384
with a multiplicative step size of 2. We repeated this process 10 times to get the average
classification errors which are shown in Fig. 6 and Tables 7 and 8 (see appendix). The
kernel kP was Gaussian, and the other two were linear. The Nyström approximation was
used to achieve an efficient implementation.
For nearly all settings the proposed method has a smaller error rate than the baseline.
Furthermore, for the marginal transfer learning method, when one fixes the number of
training examples and increases the number of tasks then the classification error rate drops.
On the other hand, we observe on Table 7 that the number n of training points per
task hardly affects the final performance when n ≥ 103 . This is in contrast with the
previous experimental examples (synthetic, Parkinson’s disease telemonitoring, and satellite
classification), for which increasing n led to better performance, but where the values of
n remained somewhat modest (n ≤ 256). This is qualitatively in line with the theoretical
results under (2SGM) in Section 6.2 (see in particular the concluding discussion there),
suggesting that the influence of increasing n on the performance should eventually taper
off, in particular if n  N .

9. Discussion
Our approach to domain generalization relies on the extended input pattern X
e = (PX , X).
Thus, we study the natural algorithm of minimizing a regularized empirical loss over a

29
Blanchard, Deshmukh, Dogan, Lee and Scott

Figure 6: Flow Cytometry Data set: Percentage Classification error rates for proposed
method and difference with baseline for different experimental settings, i.e., num-
ber of examples per task and number of tasks.

reproducing kernel Hilbert space associated with the extended input domain PX × X . We
also establish universal consistency under two sampling plans. To achieve this, we present
novel generalization error analyses, and construct a universal kernel on PX × X . A detailed
implementation based on novel approximate feature mappings is also presented.
On one synthetic and three real-world data sets, the marginal transfer learning approach
consistently outperforms a pooling baseline. On some data sets, however, the difference
between the two methods is small. This is because the utility of transfer learning varies
from one DG problem to another. As an extreme example, if all of the task are the same,
then pooling should do just as well as our method.
Several future directions exist. From an application perspective, the need for adaptive
classifiers arises in many applications, especially in biomedical applications involving biolog-
ical and/or technical variation in patient data. Examples include brain computer interfaces
and patient monitoring. For example, when electrocardiograms are used to continuously
monitor cardiac patients, it is desirable to classify each heartbeat as irregular or not. Given
the extraordinary amount of data involved, automation of this process is essential. How-
ever, irregularities in a test patient’s heartbeat will differ from irregularities of historical
patients, hence the need to adapt to the test distribution (Wiens, 2010).
From a theoretical and methodological perspective, several questions are of interest. We
would like to specify conditions on the meta-distributions PS or µ under which the DG risk
is close to the expected Bayes risk of the test distribution (beyond the simple condition
discussed in Lemma 9). We would also like to develop fast learning rates under suitable dis-

30
Domain Generalization by Marginal Transfer Learning

tributional assumptions. Furthermore, given the close connections with supervised learning,
many common variants of supervised learning can also be investigated in the DG context,
including multiclass classification, class probability estimation, and robustness to various
forms of noise.

We can also ask how the methodology and analysis can be extended to the context
where a small number of labels are available for the test distribution (additionally to a
larger number of unlabeled data from the same distribution); this situation appears to be
common in practice, and can be seen as intermediary between the DG and learning to learn
(LTL, see Section 4.2) settings (one could dub it “semi-supervised domain generalization”).
In this setting, two approaches appear promising to take advantage of the labeled data. The
simplest one is to use the same optimization problem (7), where we include additionally
the labeled examples of the test distribution. However, if several test samples are to be
treated in succession, and we want to avoid a full, resource-consuming re-training using
all the training samples each time, an interesting alternative is the following: learn once
a function f0 (PX , x) using the available training samples via (7); then, given a partially
labeled test sample, learn a decision function on this sample only via the usual kernel (kX )
norm regularized empirical loss minimization method, but replace the usual regularizer
2
term kf k2 by f − f0 (PbT , .)
H X (note that f0 (PbT , .) ∈ Hk ). In this sense, the marginal-
X X
H
adaptive decision function learned from the training samples would serve as a “prior” or
“informed guess” for learning on the test data. This can be also interpreted as learning
an adequate complexity penalty to improve learning on new samples, thus connecting to
the general principles of LTL (see Section 4.2). An interesting difference with underlying
existing LTL approaches is that those tend to adapt the hypothesis class or the“shape” of the
regularization penalty to the problem at hand, while the approach delineated above would
modify the “origin” of the penalty, using the marginal distribution information. These two
principles could also be combined.

Acknowledgments

C. Scott and A. Deshmukh were supported in part by NSF Grants No. 1422157, 1217880,
1047871, 1838179, and 2008074. G. Blanchard was supported in part by the Deutsche
Forschungsgemeinschaft (DFG) via Research Unit FG1735/2 Structural Inference in Statis-
tics, and via SFB1294/1 - 318763901DFG Data Assimilation; and by the Agence Nationale
de la Recherche (ANR) via the IA Chair BiSCottE. G. Lee was supported by the National
Research Foundation of Korea (NRF-2014R1A1A1003458).

Appendix A. Proofs, Technical Details, and Experimental Details

This appendix contains the remaining proofs, as well as additional technical and experi-
mental details.

31
Blanchard, Deshmukh, Dogan, Lee and Scott

A.1 Proof of Proposition 7


Let PXYT be a fixed probability distribution on X × R, and ε > 0 a fixed number. Since X
is a Radon space, by definition any Borel probability measure on it, in particular PXT (the
X-marginal of PXY T ), is inner regular, so that there exists a compact set K ⊂ X such that
T c
PX (K ) ≤ ε.
For all x ∈ K, by the assumed continuity of the decision function f at point (PXT , x) there
exists an open neighborhood Ux ×Vx ⊂ PX ×X of this point such that f (u, v) − f (PXT , x) ≤
ε for all (u, v) ∈ Ux × Vx . Since the family (Vx )x∈K is an open covering T of the compact
K, there exists a finite subfamily (Vxi )i∈I covering K. Denoting U0 := i∈I Uxi which is
an open neighborhood of PXT in PX , it therefore holds for any P ∈ U0 and uniformly over
x ∈ K that f (P, x) − f (PXT , x) ≤ |f (P, x) − f (P, xi0 )| + f (PXT , x) − f (P, xi0 ) ≤ 2ε, where
i0 ∈ I is such that x ∈ Vxi0 .
Denote
n S T =o (XiT , YiT )1≤i≤nT a sample of size nT drawn i.i.d. from PXY T , and A the

event PbT ∈ U0 . By the law of large numbers, PbT weakly converges to P T in probability,
X X X
so that P [Ac ] ≤ ε holds for nT large enough. We have (denoting B a bound on the loss
function):

nT nT
" # " #
1 X T T T 1 X
T T T
EST `(f (PbX , Xi ), Yi ) ≤ Bε + EST `(f (PbX , Xi ), Yi )1{X T ∈K}
nT nT i
i=1 i=1
nT
" #
1 X
≤ B(ε + P [Ac ]) + EST 1{PbT ∈U0 } `(f (PbXT , XiT ), YiT )1{X T ∈K}
X nT i
i=1
nT
" #
1 X
≤ 2Bε + 2Lε + EST `(f (PXT , XiT ), YiT )
nT
i=1
≤ 2(B + L)ε + E(X T ,Y T )∼P T `(f (PXT , X T ), Y T ) .
 
XY

Conversely,
nT nT
" # " #
1 X 1 X
EST `(f (PbXT , XiT ), YiT ) ≥ EST 1{PbT ∈U0 } `(f (PbXT , XiT ), YiT )1{X T ∈K}
nT X nT i
i=1 i=1
nT
" #
1 X
≥ EST `(f (PXT , XiT ), YiT ) − 2Bε − 2Lε
nT
i=1
≥ E(X T ,Y T )∼P T `(f (PXT , X T ), Y T ) − 2(B + L)ε.
 
XY

Since the above inequalities hold for any ε > 0 provided nT is large enough, this yields that
T , we have
for any fixed PXY
nT
" #
1 X
`(f (PbX , Xi ), Yi ) = E(X T ,Y T )∼P T `(f (PXT , X T ), Y T ) .
T T T
 
lim EST ∼(P T )⊗nT
nT →∞ XY nT XY
i=1

Finally, since the above right-hand side is bounded by B, applying dominated convergence
T ∼ µ yields the desired conclusion.
to integrate over PXY

32
Domain Generalization by Marginal Transfer Learning

A.2 Proof of Corollary 14


Proof Denote E ∗ = inf f :PX ×X →R E(f ). Let ε > 0. Since k is a universal kernel on PX × X
and ` is Lipschitz, there exists f0 ∈ Hk such that E(f0 ) ≤ E ∗ + 2ε (Steinwart and Christmann,
2008).
By comparing the objective function in (7) at the minimizer p fλ and at the null function,
b
using assumption p (LB) we deduce that we must have kfbλ k ≤ B0 /λ. Applying Theorem 11
for R = Rλ = B0 /λ, and δ = 1/N , gives that with probability at least 1 − 1/N 2 ,
2


p ( log N + 2)
sup E(f, N ) − E(f ) ≤ ε(N ) := (B0 + L` BK Bk B0 /λ)
b √ .
f ∈Bk (R) N

Let N be large enough so that kf0 k ≤ Rλ . We can now deduce that with probability at
least 1 − 1/N 2 ,
E(fbλ ) ≤ E(
b fbλ , N ) + ε(N )
b fbλ , N ) + λkfbλ k2 − λkfbλ k2 + ε(N )
= E(
b 0 , N ) + λkf0 k2 − λkfbλ k2 + ε(N )
≤ E(f
b 0 , N ) + λkf0 k2 + ε(N )
≤ E(f
≤ E(f0 ) + λkf0 k2 + 2ε(N )
ε
≤ E ∗ + + λkf0 k2 + 2ε(N ).
2

The last two terms become less than 2ε for N sufficiently large by the assumptions on the
growth of λ = λ(N ). This establishes that for any ε > 0, there exists N0 such that
X X 1
Pr(E(fbλ ) ≥ E ∗ + ε) ≤ < ∞,
N2
N ≥N0 N ≥N0

and so the result follows by the Borel-Cantelli lemma.

A.3 Proof of Theorem 15


We control the difference between the training loss and the conditional risk at infinite sample
size via the following decomposition:
n
∞ 1X
sup |L(S, f ) − E (f |PXY )| = sup `(f (PbX , Xi ), Yi ) − E ∞ (f |PXY )
f ∈Bk (R) f ∈Bk (R) n
i=1
n
1 X 
≤ sup `(f (PbX , Xi ), Yi ) − `(f (PX , Xi ), Yi )
f ∈Bk (R) n
i=1
n
1X
+ sup `(f (PX , Xi ), Yi ) − E ∞ (f |PXY )
f ∈Bk (R) n
i=1
=: (I) + (II). (22)

33
Blanchard, Deshmukh, Dogan, Lee and Scott

A.3.1 Control of Term (I)


Using the assumption that the loss ` is L` -Lipschitz in its first coordinate, we can bound
the first term as follows:
n
1X
(I) ≤ L` sup f (PbX , Xi ) − f (PX , Xi ) ≤ L` sup f (PbX , ·) − f (PX , ·) . (23)
f ∈Bk (R) n i=1 f ∈Bk (R) ∞

This can now be controlled using the first part of the following result:

Lemma 20 Assume (K-Bounded) holds. Let PX be an arbitrary distribution on X and


PbX denote an empirical distribution on X based on an iid sample of size n from PX . Then
with probability at least 1 − δ over the draw of this sample, it holds that
 α2
log 2δ −1

sup f (PbX , ·) − f (PX , ·) ≤ 3RBk LK Bkα0 . (24)
f ∈Bk (R) ∞ n

In expectation, it holds
" #
E sup f (PbX , ·) − f (PX , ·) ≤ 2RBk LK Bkα0 n−α/2 . (25)
f ∈Bk (R) ∞

Proof Let X1 , . . . , Xn denote the n-sample from PX . Let us denote by Φ0X the canonical
feature mapping x 7→ kX 0 (x, ·) from X into H 0 . We have for all x ∈ X , kΦ0 (x)k ≤ B 0 ,
kX X k
and so, as a consequence of Hoeffding’s inequality in a Hilbert space (see, e.g., Pinelis and
Sakhanenko, 1985), it holds with probability at least 1 − δ:
n
r
1 X
0
 0  log 2δ −1
Ψ(PX ) − Ψ(PbX ) = ΦX (Xi ) − EX∼PX ΦX (X) ≤ 3Bk0 . (26)
n n
i=1

Furthermore, using the reproducing property of the kernel k, we have for any x ∈ X and
f ∈ Bk (R):
D E
|f (PbX , x) − f (PX , x)| = k((PbX , x), ·) − k((PX , x), ·), f

≤ kf k k((PbX , x), ·) − k((PX , x), ·)


1

≤ RkX (x, x) 2 K(Ψ(PX ), Ψ(PX ))
1
2
+ K(Ψ(PbX ), Ψ(PbX )) − 2K(Ψ(PX ), Ψ(PbX ))

≤ RBk ΦK (Ψ(PX )) − ΦK (Ψ(PbX ))


α
≤ RBk LK Ψ(PX ) − Ψ(PbX ) ,

where in the last step we


R have used property (K-Hölder) together with the fact that for
0
all P ∈ PX , kΨ(P )k ≤ X kkX (x, ·)k dPX (x) ≤ Bk0 , so that Ψ(P ) ∈ BkX
0 (Bk 0 ). Combining

with (26) gives (24).

34
Domain Generalization by Marginal Transfer Learning

For the bound in expectation, we use the inequality above, and can bound further (using
Jensen’s inequality, since α ≤ 1)

2 α/2
 
h αi
E Ψ(PX ) − Ψ(PbX ) ≤E Ψ(PX ) − Ψ(PbX )
 α/2
n
1 X
E Φ0X (Xi ) − E Φ0X (X) , Φ0X (Xj ) − E Φ0X (X) 
     
= 2
n
i,j=1

n
!α/2
1 X h 0  0  2i
= E ΦX (Xi ) − E ΦX (X)
n2
i=1
 2  α2
4Bk0
≤ ,
n

which yields (25) in combination with the above.

A.3.2 Control of Term (II)

Term (II) takes the form of a uniform deviation over a RKHS ball of an empirical loss
for the data (Xei , Yi ), where X
ei := (PX , Xi ). Since PX is fixed (in contrast with term (I)
where PX depended on the whole sample), these data are i.i.d. Similar to the proofs of
b
Theorems 5 and 11, we can therefore apply again standard Rademacher analysis, this time
at the level of one specific task (Azuma-McDiarmid inequality followed by Rademacher
complexity analysis for a Lipschitz, bounded loss over a RKHS ball; see Koltchinskii, 2001;
Bartlett and Mendelson, 2002, Theorems 8, 12 and Lemma 22 there). The kernel k is
bounded by Bk2 BK2 by assumption (K-Bounded); by Lemma 10 and assumption (LB), the
loss is bounded by B0 + L` RBk BK , and is L` -Lipschitz. Therefore, with probability at least
1 − δ we get

r !
log(δ −1 )
(II) ≤ (B0 + 3L` RBk BK ) min ,1
2n
α !
log(δ −1 ) 2

≤ (B0 + 3L` RBk BK ) min ,1 . (27)
2n

Observe that we can cap the second factor at 1 since (II) is upper bounded by the bound
on the loss in all cases; the second inequality then uses α ≤ 1. Combining with a union
bound the probabilistic controls (23), (24) of term (I) and (27) of (II) yields (18).

35
Blanchard, Deshmukh, Dogan, Lee and Scott

To establish the bound (19) we use a similar argument. We use the decomposition

sup |E(f |PXY , n) − E ∞ (f |PXY )|


f ∈Bk (R)
n
" #
1 X T

≤ sup ESn ∼(PXY )⊗n `(f (PbX , Xi ), Yi ) − `(f (PX , Xi ), Yi )
f ∈Bk (R) n
i=1
" n #
1 X
T

+ sup ES∼(PXY )⊗n `(f (PX , Xi ), Yi ) − E(X,Y )∼PXY [`(f (PX , X), Y )]
f ∈Bk (R) n
i=1
0 0
=: (I ) + (II ).
It is easily seen that the second term vanishes: for any fixed f ∈ Bk (R) and PXY , the
difference of the expectations is zero. For the first term, using Lipschitzness of the loss,
then (25), we obtain
" #
0
(I ) ≤ L` E sup f (PbX , .) − f (PX , .) ≤ 2L` RBk LK B α0 n−α/2 ,
k
f ∈Bk (R) ∞

yielding (19). The bound (20) is obtained as a direct consequence by taking expectation
over PXY ∼ µ and using Jensen’s inequality to pull out the absolute value.

A.4 Regularity Conditions for the Kernel on Distributions


We investigate sufficient conditions on the kernel K to ensure the regularity condition (K-
Hölder) (15). Roughly speaking, the regularity of the feature mapping of a reproducing
kernel is “one half” of the regularity of the kernel in each of its variables. The next result
considers the situation where K is itself simply a Hölder continuous function of its variables.
Lemma 21 Let α ∈ (0, 21 ]. Assume that the kernel K is Hölder continuous of order 2α and
constant L2K /2 in each of its two variables on BkX
0 (Bk 0 ). Then (K-Hölder) is satisfied.

Proof For any v, w ∈ BkX


0 (Bk 0 ):

1
kΦK (v) − ΦK (w)k = (K(v, v) + K(w, w) − 2K(v, w)) 2 ≤ LK kv − wkα .

The above type of regularity only leads to a Hölder feature mapping of order at most 12
(when the kernel function is Lipschitz continuous in each variable). Since this order plays
an important role in the rate of convergence of the upper bound in the main error control
theorem, it is desirable to study conditions ensuring more regularity, in particular a feature
mapping which has at least Lipschitz continuity. For this, we consider the following stronger
condition, namely that the kernel function is twice differentiable in a specific sense:
Lemma 22 Assume that, for any u, v ∈ BkX 0 (Bk 0 ) and unit norm vector e of Hk 0 , the
X
2
function hu,v,e : (λ, µ) ∈ R 7→ K(u + λe, v + µe) admits a mixed partial derivative ∂1 ∂2 hu,v,e
at the point (λ, µ) = (0, 0) which is bounded in absolute value by a constant CK2 independent
of (u, v, e). Then (15) is satisfied with α = 1 and LK = CK , that is, the canonical feature
mapping of K is Lipschitz continuous on BkX 0 (Bk 0 ).

36
Domain Generalization by Marginal Transfer Learning

Proof The argument is along the same lines as Steinwart and Christmann (2008), Lemma
4.34. Observe that, since hu,v,e (λ + λ0 , µ + µ0 ) = hu+λe,v+µe,e (λ0 , µ0 ), the function hu,v,e
actually admits a uniformly bounded mixed partial derivative in any point (λ, µ) ∈ R2 such
that (u + λe, v + µe) ∈ BkX 0 (Bk 0 ) . Let us denote ∆1 hu,v,e (λ, µ) := hu,v,e (λ, µ) − hu,v,e (0, µ) .

For any u, v ∈ BkX0 (Bk 0 ) , u 6= v , let us set λ := kv − uk and the unit vector e := λ
−1 (v − u);

we have

kΦK (u) − ΦK (v)k2 = K(u, u) + K(u + λe, u + λe) − K(u, u + λe) − K(u + λe, u)
= ∆1 hu,v,e (λ, λ) − ∆1 hu,v,e (λ, 0)
= λ∂2 ∆1 hu,v,e (λ, λ0 ) ,

where we have used the mean value theorem, yielding existence of λ0 ∈ [0, λ] such that the
last equality holds. Furthermore,

∂2 ∆1 hu,v,e (λ, λ0 ) = ∂2 hu,v,e (λ, λ0 ) − ∂2 hu,v,e (0, λ0 )


= λ∂1 ∂2 hu,v,e (λ00 , λ0 ) ,

using again the mean value theorem, yielding existence of λ00 ∈ [0, λ] in the last equality.
Finally, we get

kΦK (u) − ΦK (v)k2 = λ2 ∂1 ∂2 hu,v,e (λ0 , λ00 ) ≤ CK2 kv − uk2 .

Lemma 23 Assume that the kernel K takes the form of either (a) K(u, v) = g(ku − vk2 )
or (b) K(u, v) = g(hu, vi) , where g is a twice differentiable real function of a real variable
defined on [0, 4Bk20 ] in case (a), and on [−Bk20 , Bk20 ] in case (b). Assume kg 0 k∞ ≤ C1 and
kg 00 k∞ ≤ C2 . Then K satisfies the assumption of Lemma 22 with CK := 2C1 + 16C2 Bk20 in
case (a), and CK := C1 + C2 Bk20 for case (b).

Proof In case (a), we have hu,v,e (λ, µ) = g(ku − v + (λ − µ)ek2 ). It follows

|∂1 ∂2 hu,v,e (0, 0)| = −2g 0 (ku − vk2 ) kek2 − 4g 00 (ku − vk2 ) hu − v, ei2
≤ 2C1 + 16C2 Bk20 .

In case (b), we have hu,v,e (λ, µ) = g(hu + λe, v + µei). It follows

|∂1 ∂2 hu,v,e (0, 0)| = g 0 (hu, vi) kek2 + g 00 (hu, vi) hu, ei hv, ei
≤ C1 + C2 Bk20 .

37
Blanchard, Deshmukh, Dogan, Lee and Scott

A.5 Proof of Lemma 12


Proof Let H, H0 the RKHS associated to k, k 0 with the associated feature mappings Φ, Φ0 .
Then it can be checked that (x, x0 ) ∈ X × X 0 7→ Φ(x) ⊗ Φ0 (x0 ) is a feature mapping for
k into the Hilbert space H ⊗ H0 . Using (Steinwart and Christmann, 2008), Th. 4.21,
we deduce that the RKHS H of k contains precisely all functions of the form (x, x0 ) ∈
X × X 0 7→ Fw (x, x0 ) = hw, Φ(x) ⊗ Φ(x0 )i, where w ranges over H ⊗ H0 . Taking w of the
form w = g ⊗ g 0 , g ∈ H, g ∈ H0 , we deduce that H contains in particular all functions of
the form f (x, x0 ) = g(x)g(x0 ), and further
e := span (x, x0 ) ∈ X × X 0 7→ g(x)g(x0 ); g ∈ H, g 0 ∈ H0 ⊂ H.

H
Denote C(X ), C(X 0 ), C(X × X 0 ) the set of real-valued continuous functions on the respective
spaces. Let
C(X ) ⊗ C(X 0 ) := span (x, x0 ) ∈ X × X 0 7→ f (x)f 0 (x0 ); f ∈ C(X ), f 0 ∈ C(X 0 ) .


Let G(x, x0 ) be an arbitrary element of C(X ) ⊗ C(X 0 ), G(x, x0 ) = ki=1 λi gi (x)gi0 (x0 ) with
P
gi ∈ C(X ), gi0 ∈ C(X 0 ) for i = 1, . . . , k. For ε > 0, by universality of k and k 0 , there
exist fi ∈ H, fi0 ∈ H0 so that kfi − gi k∞ ≤ ε, kfi0 − gi0 k∞ ≤ ε for i = 1, . . . , k. Let
F (x, x0 ) := ki=1 λi fi (x)fi0 (x0 ) ∈ H.
P e We have
k
X
F (x, x0 ) − G(x, x0 ) ∞
≤ λi (gi (x)gi0 (x) − fi (x)fi0 (x))
i=1 ∞
k
X h
= λi (fi (x) − gi (x))(gi0 (x0 ) − fi0 (x0 ))
i=1
i
+ gi (x)(gi0 (x) − fi0 (x0 )) + (gi (x) − fi (x))gi0 (x0 )

k
X
≤ε |λi | (ε + kgi k∞ + gi0 ∞
).
i=1

This establishes that He is dense in C(X ) ⊗ C(X 0 ) for the supremum norm. It can be easily
checked that C(X ) ⊗ C(X 0 ) is an algebra of functions which does not vanish and separates
points on X × X 0 . By the Stone-Weierstrass theorem, it is therefore dense in C(X × X 0 ) for
the supremum norm. We deduce that H e (and thus also H) is dense in C(X × X 0 ), so that
k is universal.

A.6 Approximate Feature Mapping for Scalable Implementation


We first treat random Fourier features and then the Nyström method.

A.6.1 Random Fourier Features


The approximation of Rahimi and Recht (2007) is based on Bochner’s theorem, which
characterizes shift invariant kernels.

38
Domain Generalization by Marginal Transfer Learning

Theorem 24 A continuous kernel k(x, y) = k(x − y) on Rd is positive definite iff k(x − y)


is the Fourier transform of a finite positive measure p(w), i.e.,
Z
T (x−y)
k(x − y) = p(w)ejw dw . (28)
Rd

If a shift invariant kernel k(x − y) is properly scaled then Theorem 24 guarantees that
p(w) in (28) is a proper probability distribution.
Random Fourier features (RFFs) approximate the integral in (28) using samples drawn
from p(w). If w1 , w2 , ..., wL are i.i.d. draws from p(w),
Z
T (x−y)
k(x − y) = p(w)ejw dw
Rd
Z
= p(w) cos(wT x − wT y)dw
Rd
L
1X
≈ cos(wiT x − wiT y)
L
i=1
L
1 X
= cos(wiT x) cos(wiT y) + sin(wiT x) sin(wiT y)
L
i=1
L
1 X
= [cos(wiT x), sin(wiT x)]T [cos(wiT y), sin(wiT y)]
L
i=1
= zw (x)T zw (y) , (29)

where zw (x) = √1L [cos(w1T x), sin(w1T x), ..., cos(wL


T x), sin(w T x)] ∈ R2L is an approximate
L
nonlinear feature mapping of dimensionality 2L. In the following, we extend the RFF
methodology to the kernel k̄ on the extended feature space PX × X . Let X1 , . . . , Xn1
and X10 , . . . , Xn0 2 be i.i.d. realizations of PX and PX0 respectively, and let PbX and PbX0
denote the corresponding empirical distributions. Given x, x0 ∈ X , denote x̃ = (PbX , x)
and x̃0 = (PbX0 , x0 ). The goal is to find an approximate feature mapping z̄(x̃) such that
k̄(x̃, x̃0 ) ≈ z̄(x̃)T z̄(x̃0 ). Recall that

k̄(x̃, x̃0 ) = kP (PbX , PbX0 )kX (x, x0 );

specifically, we consider kX and kX 0 to be Gaussian kernels and the kernel on distributions

kP to have the Gaussian-like form


 
0 1 0 2
kP (PX , PX ) = exp
b b kΨ(PX ) − Ψ(PX )kHk0 .
b b
2σP2 X

As noted earlier in this section, the calculation of kP (PbX , PbX0 ) reduces to the computation
of
n1 Xn2
0 1 X 0
hΨ(PX ), Ψ(PX )i =
b b kX (Xi , Xj0 ). (30)
n1 n2
i=1 j=1

39
Blanchard, Deshmukh, Dogan, Lee and Scott

We use Theorem 24 to approximate kX 0 and thus hΨ(P bX ), Ψ(Pb0 )i. Let w1 , w2 , ..., wL be
X
0 0 . Then we have:
i.i.d. draws from p (w), the inverse Fourier transform of kX

n1 X
n2
1 X
hΨ(PbX ), Ψ(PbX0 )i = 0
kX (Xi , Xj0 )
n1 n2
i=1 j=1
L n1 X
n2
1 XX
≈ cos(wlT Xi − wlT Xj0 )
Ln1 n2
l=1 i=1 j=1
L n1 X
n2
1 XX
= [cos(wlT Xi ) cos(wlT Xj0 ) + sin(wlT Xi ) sin(wlT Xj0 )]
Ln1 n2
l=1 i=1 j=1
L n1 n2
1 X X X
= T T
{ [cos(wl Xi ), sin(wl Xi )]T
[cos(wlT Xj0 ), sin(wlT Xj0 )]}
Ln1 n2
l=1 i=1 j=1

= ZP (PbX ) T
ZP (PbX0 ),

where
n1 h
1 X i
ZP (PbX ) = √ cos(w1T Xi ), sin(w1T Xi ), ..., cos(wL
T T
Xi ), sin(wL Xi ) , (31)
n1 L i=1

and ZP (PbX0 ) is defined analogously with n1 replaced by n2 . For the proof of Theorem 25,
let P0 denote the approximate feature map corresponding to k 0 , which satisfies Z (P
zX bX ) =
X P
1 n1 0
n1 i=1 zX (Xi ).
Note that the lengths of the vectors ZP (PbX ) and ZP (PbX0 ) are 2L. To approximate k̄ we
may write

−kZP (PbX ) − ZP (PbX0 )k2R2L −kx − x0 k2Rd


k̄(x̃, x̃0 ) ≈ exp · exp (32)
2σP2 2
2σX
−(σ 2 kZP (PbX ) − ZP (Pb0 )k2 2L + σ 2 kx − x0 k2 d )
X X R P R
= exp 2 2
2σP σX
−(kσX ZP (PbX ) − σX ZP (PbX0 )k2R2L + kσP x − σP x0 k2Rd )
= exp
2σP2 σX2

−k(σX ZP (PbX ), σP x) − (σX ZP (PbX0 ), σP x0 )k2R2L+d


= exp .
2σP2 σX
2

This is also a Gaussian kernel, now on R2L+d . Again by applying Theorem 24, we have
Z
0 0 T 0 0
k̄(PX , X), (PX , X )) ≈
b b p(v)ejv ((σX ZP (PX ),σP X)−(σX ZP (PX ),σP X )) dv.
R2L+d

Let v1 , v2 , ..., vq be drawn i.i.d. from p(v), the inverse Fourier transform of the Gaussian
kernel with bandwidth σP σX . Let u = (σX ZP (PbX ), σP x) and u0 = (σX ZP (PbX0 ), σP x0 ).

40
Domain Generalization by Marginal Transfer Learning

Then
Q
0 1 X
k̄(x̃, x̃ ) ≈ cos(vqT (u − u0 ))
Q
q=1

= z̄(x̃)T z̄(x̃0 ),

where
1
z̄(x̃) = √ [cos(v1T u), sin(v1T u), ..., cos(vQ
T T
u), sin(vQ u)] ∈ R2Q (33)
Q
and z̄(x̃0 ) is defined similarly.
This completes the construction of the approximate feature map. The following result,
which uses Hoeffding’s inequality and generalizes a result of Rahimi and Recht (2007), says
that the approximation achieves any desired approximation error with very high probability
as L, Q → ∞.

Theorem 25 Let L be the number of random features to approximate the kernel on distri-
butions and Q be the number of features to approximate the final product kernel. For any
l > 0, q > 0, x̃ = (PbX , x), x̃0 = (PbX0 , x0 ),
 Q2   L2 
q
P (|k̄(x̃, x̃0 ) − z̄(x̃)T z̄(x̃0 )| ≥ l + q ) ≤ 2 exp − + 6n1 n2 exp − , (34)
2 2
σ2
where  = 2P log(1 + l ), σP is the bandwidth parameter of the Gaussian-like kernel kP , and
n1 and n2 are the sizes of the empirical distributions PbX and PbX0 , respectively.

Proof Observe:
   
−1 bX ) − Ψ(Pb0 )k2 exp −1 kx − x0 k2 ,
k̄(x̃, x̃0 ) = exp kΨ(P X
2σP2 2
2σX
and denote:
   
−1 −1
k̃(x̃, x̃0 ) = exp kZ P (PbX ) − Z P (Pb
X
0
)k2
exp kx − x 0 2
k ,
2σP2 2
2σX

We omit the arguments of k̄, k̃ for brevity. Let kq be the final approximation (kq =
z̄(x̃)T z̄(x̃0 )) and then we have

|k̄ − kq | = |k̄ − k̃ + k̃ − kq | ≤ |k̄ − k̃| + |k̃ − kq |. (35)


From Eqn. (35) it follows that,

P (|k̄ − kq | ≥ l + q ) ≤ P (|k̄ − k̃| ≥ l ) + P (|k̃ − kq | ≥ q ). (36)

By a direct application of Hoeffding’s inequality,

Q2q
P (|k̃ − kq | ≥ q ) ≤ 2 exp(− ). (37)
2

41
Blanchard, Deshmukh, Dogan, Lee and Scott

Pn1 Pn2
Recall that hΨ(PbX ), Ψ(PbX0 )i = 1
n1 n2 i=1
0 0
j=1 kX (Xi , Xj ). For a pair Xi , Xj0 , we have
again by Hoeffding

0 0 L2
P (|zX (Xi )T zX (Xj0 ) − kX
0
(Xi , Xj0 )| ≥ ) ≤ 2 exp(− ).
2
0 (X )T z 0 (X 0 ) − k 0 (X , X 0 )| ≥ , for particular i, j. Using the union
Let Ωij be the event |zX i X j X i j
bound we have
L2
P (Ω11 ∪ Ω12 ∪ . . . ∪ Ωn1 n2 ) ≤ 2n1 n2 exp(− )
2
This implies
L2
P (|ZP (PbX )T ZP (PbX0 ) − hΨ(PbX ), Ψ(PbX0 )i| ≥ ) ≤ 2n1 n2 exp(− ). (38)
2
Therefore,
 "  
−1 0 2 −1
k̄ − k̃ = exp 2 kx − x k exp kΨ(PbX ) − Ψ(PbX0 )k2
2σX 2σP2
 #
−1
− exp kZP (PbX ) − ZP (PbX0 )k2
2σP2
"    #
−1 0 2 −1 0 2
≤ exp kΨ(PbX ) − Ψ(PbX )k − exp kZP (PbX ) − ZP (PbX )k
2σP2 2σP2

 "
−1 n −1 
= exp kΨ(PbX ) − Ψ(PbX0 )k2 1 − exp kZP (PbX ) − ZP (PbX0 )k2
2σP2 2σP2
#
o
− kΨ(PbX ) − Ψ(Pb0 )k2 X
"  #
−1  bX ) − ZP (PbX0 )k2 − kΨ(PbX ) − Ψ(PbX0 )k2

≤ 1 − exp kZ P (P
2σP2
(
−1 
= 1 − exp ZP (PbX )T ZP (PbX ) − hΨ(PbX ), Ψ(PbX )i + ZP (PbX0 )T ZP (PbX0 )
2σP2
)
0 0 T 0 0

− hΨ(PbX ), Ψ(PbX )i − 2 ZP (PbX ) ZP (PbX ) − hΨ(PbX ), Ψ(PbX )i
(
1 
≤ 1 − exp 2 |ZP (PbX )T ZP (PbX ) − hΨ(PbX ), Ψ(PbX )i| + |ZP (PbX0 )T ZP (PbX0 )
2σP
)

− hΨ(PbX0 ), Ψ(PbX0 )i| + 2| ZP (PbX )T ZP (PbX0 ) − hΨ(PbX ), Ψ(PbX0 )i |


The result now follows by applying the bound of Eqn. (38) to each of the three terms in
the exponent of the preceding expression, together with the stated formula for  in terms
of ` .

42
Domain Generalization by Marginal Transfer Learning

The above results holds for fixed x̃ and x̃0 . Following again Rahimi and Recht (2007),
one can use an -net argument to prove a stronger statement for every pair of points in the
input space simultaneously. They show

Lemma 26 Let M be a compact subset of Rd with diameter r = diam(M) and let D be the
number of random Fourier features used. Then for the mapping defined in (29), we have
   σr 2  −D2 
P sup |zw (x)T zw (y) − k(x − y)| ≥  ≤ 28 exp ,
x,y∈M  2(d + 2)

where σ = E[wT w] is the second moment of the Fourier transform of k.

Our RFF approximation of k̄ is grounded on Gaussian RFF approximations on Euclidean


spaces, and thus, the following result holds by invoking Lemma 26, and otherwise following
the argument of Theorem 25.

Theorem 27 Using the same notations as in Theorem 25 and Lemma 26,


 
0 T 0
P sup |k̄(x̃, x̃ ) − z̄(x̃) z̄(x̃ )| ≥ l + q
x,x0 ∈M
 σ 0 r 2  −Q2   σ σ r 2  −L2 
q P X
≤ 28 X
exp + 29 3n1 n2 exp l
(39)
q 2(d + 2) l 2(d + 2)
0 is the width of kernel k 0 in Eqn. (30) and σ and σ
where σX X P X are the widths of kernels
kP and kX respectively.

Proof The proof is very similar to the proof of Theorem 25. We use Lemma 26 to replace
bound (37) with:
 0 2
−Q2q
   
σ r
P sup |k̃ − kq | ≥ q ≤ 28 X exp . (40)
x,x0 ∈M q 2(d + 2)

Similarly, Eqn. (38) is replaced by


 
P sup ZP (PbX )T ZP (PbX0 ) − Ψ(PbX ), Ψ(PbX0 ) ≥
x,x0 ∈M
2
−L2l
  
9 σP σX r
≤ 2 n1 n2 exp . (41)
l 2(d + 2)
The remainder of the proof now proceeds as in the previous proof.

There are recent developments that give faster rates for approximation quality of random
Fourier features and could potentially be combined with our analysis (Sriperumbudur and

43
Blanchard, Deshmukh, Dogan, Lee and Scott

Szabó, 2015; Sutherland and Schneider, 2015). For example, approximation quality for the
kernel mean map is discussed in Sutherland and Schneider (2015), and these ideas could be
extended to Theorem 27 by combining with the two-stage approach presented in this paper.
We also note that our analysis of random Fourier features is separate from our analysis of
the kernel learning algorithm. We have not presented a generalization error bound for the
learning algorithm using random Fourier features (Rudi and Rosasco, 2017).

A.6.2 Nyström Approximation

Like random Fourier features, the Nyström approximation is a technique to approximate


kernel matrices (Williams and Seeger, 2001; Drineas and Mahoney, 2005). Unlike random
Fourier features, for the Nyström approximation, the feature maps are data-dependent.
Also, in the last subsection, all kernels were assumed to be shift invariant. With the
Nyström approximation there is no such assumption.
For a general kernel k, the goal is to find a feature mapping z : Rd → RL , where L > d,
such that k(x, x0 ) ≈ z(x)T z(x0 ). Let r be the target rank of the final approximated kernel
matrix, and m be the number of selected columns of the original kernel matrix. In general
r ≤ m  n.
Given data points x1 , . . . , xn , the Nyström method approximates the kernel matrix by
first sampling m data points x01 , x02 , ..., x0m without replacement from the original sample,
and then constructing a low rank matrix by K b −1 K T , where Kb = [k(xi , x0 )]n×m ,
b r = Kb K
b j
and K b = [k(x0 , x0 )]m×m . Hence, the final approximate feature mapping is
i j

b − 12 Vb T [k(x, x0 ), ..., k(x, x0 )],


zn (x) = D (42)
1 m

where D
b is the eigenvalue matrix of K
b and Vb is the corresponding eigenvector matrix.

A.7 Results in Tabular Format

Tasks
Examples per Task

16 64 256
8 36.01 33.08 31.69
16 31.55 31.03 30.96
32 30.44 29.31 23.87
256 23.78 7.22 1.27

Table 1: Average Classification Error of Marginal Transfer Learning on Synthetic Data set

44
Domain Generalization by Marginal Transfer Learning

Tasks

Examples per Task


16 64 256
8 49.14 49.11 50.04
16 49.89 50.04 49.68
32 50.32 50.21 49.61
256 50.01 50.43 49.93

Table 2: Average Classification Error of Pooling on Synthetic Data set

Tasks
10 15 20 25 30 35
20 13.78 12.37 11.93 10.74 10.08 11.17
24 14.18 11.89 11.51 10.90 10.55 10.18
28 14.95 13.29 12.00 10.21 10.59 9.52
Examples per Task

34 13.27 11.66 11.79 9.16 9.34 10.50


41 12.89 11.27 11.17 9.91 9.10 10.05
49 13.15 11.70 13.81 10.12 9.01 8.69
58 12.16 9.59 9.85 9.28 8.44 7.62
70 13.03 9.16 8.80 9.03 8.16 7.88
84 11.98 9.18 9.74 9.03 7.30 7.01
100 12.69 8.48 9.52 8.01 7.14 7.5

Table 3: RMSE of Marginal Transfer Learning on Parkinson’s Disease Data set

45
Blanchard, Deshmukh, Dogan, Lee and Scott

Tasks
10 15 20 25 30 35
20 13.64 11.93 11.95 11.06 11.91 12.08
24 13.80 11.83 11.70 11.98 11.68 11.48
28 13.78 11.70 11.72 11.18 11.58 11.73
Examples per Task 34 13.71 12.20 12.04 11.17 11.67 11.92
41 13.69 11.73 12.08 11.28 11.55 12.59
49 13.75 11.85 11.79 11.17 11.34 11.82
58 13.70 11.89 12.06 11.06 11.82 11.65
70 13.54 11.86 12.14 11.21 11.40 11.96
84 13.55 11.98 12.03 11.25 11.54 12.22
100 13.53 11.85 11.92 11.12 11.96 11.84

Table 4: RMSE of Pooling on Parkinson’s Disease Data set

Tasks
10 20 30 40
Examples per Task

5 8.62 7.61 8.25 7.17


15 6.21 5.90 5.85 5.43
30 6.61 5.33 5.37 5.35
45 5.61 5.19 4.71 4.70
all training data 5.36 4.91 3.86 4.08

Table 5: Average Classification Error of Marginal Transfer Learning on Satellite Data set

Tasks
10 20 30 40
Examples per Task

5 8.13 7.54 7.94 6.96


15 6.55 5.81 5.79 5.57
30 6.06 5.36 5.56 5.31
45 5.58 5.12 5.30 4.99
all training data 5.37 4.98 5.32 5.14

Table 6: Average Classification Error of Pooling on Satellite Data set

46
Domain Generalization by Marginal Transfer Learning

Tasks
5 10 15 20

Examples per Task


1024 9 9.03 9.03 8.70
2048 9.12 9.56 9.07 8.62
4096 8.96 8.91 9.01 8.66
8192 9.18 9.20 9.04 8.74
16384 9.05 9.08 9.04 8.63

Table 7: Average Classification Error of Marginal Transfer Learning on Flow Cytometry


Data set

Tasks
5 10 15 20
Examples per Task

1024 9.41 9.48 9.32 9.52


2048 9.92 9.57 9.45 9.54
4096 9.72 9.56 9.36 9.40
8192 9.43 9.53 9.38 9.50
16384 9.42 9.56 9.40 9.33

Table 8: Average Classification Error of Pooling on Flow Cytometry Data set

References
Nima Aghaeepour, Greg Finak, The FlowCAP Consortium, The DREAM Consortium,
Holger Hoos, Tim R. Mosmann, Ryan Brinkman, Raphael Gottardo, and Richard H.
Scheuermann. Critical assessment of automated flow cytometry data analysis techniques.
Nature Methods, 10(3):228–238, 2013.

Kei Akuzawa, Yusuke Iwasawa, and Yutaka Matsuo. Adversarial invariant feature learning
with accuracy constraint for domain generalization. In European Conference on Machine
Learning and Principles and Practice of Knowledge Discovery in Databases, 2019.

Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized
learning for domain adaptation under label shifts. In International Conference on Learn-
ing Representations, 2019. URL https://openreview.net/forum?id=rJl0r3R9KX.

Gökhan Bakır, Thomas Hofmann, Bernhard Schölkopf, Alexander J Smola, and Ben Taskar.
Predicting Structured Data. MIT Press, 2007.

Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. MetaReg: Towards do-
main generalization using meta-regularization. In S. Bengio, H. Wallach, H. Larochelle,

47
Blanchard, Deshmukh, Dogan, Lee and Scott

K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information


Processing Systems 31, pages 998–1008. Curran Associates, Inc., 2018.

Peter Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds
and structural results. Journal of Machine Learning Research, 3:463–482, 2002.

Peter Bartlett, Michael Jordan, and Jon McAuliffe. Convexity, classification, and risk
bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.

Jonathan Baxter. A model of inductive bias learning. Journal of Artificial Intelligence


Research, 12:149–198, 2000.

Shai Ben-David and Ruth Urner. On the hardness of domain adaptation and the utility
of unlabeled target samples. In Nader H. Bshouty, Gilles Stoltz, Nicolas Vayatis, and
Thomas Zeugmann, editors, Algorithmic Learning Theory, pages 139–153, 2012.

Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of repre-
sentations for domain adaptation. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors,
Advances in Neural Information Processing Systems 19, pages 137–144. 2007.

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jen-
nifer Wortman Vaughan. A theory of learning from different domains. Machine Learning,
79:151–175, 2010.

Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning under co-
variate shift. Journal of Machine Learning Research, 10:2137–2155, 2009.

Gilles Blanchard, Gyemin Lee, and Clayton Scott. Generalizing from several related clas-
sification tasks to a new unlabeled sample. In John Shawe-Taylor, Richard S. Zemel,
Peter L. Bartlett, Fernando Pereira, and Kilian Q. Weinberger, editors, Advances in
Neural Information Processing Systems 24, pages 2178–2186. 2011.

Gilles Blanchard, Marek Flaska, Gregory Handy, Sara Pozzi, and Clayton Scott. Classifica-
tion with asymmetric label noise: Consistency and maximal denoising. Electronic Journal
of Statistics, 10:2780–2824, 2016.

John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman.
Learning bounds for domain adaptation. In J. C. Platt, D. Koller, Y. Singer, and S. T.
Roweis, editors, Advances in Neural Information Processing Systems 20, pages 129–136.
2008.

Timothy I. Cannings, Yingying Fan, and Richard J. Samworth. Classification with imperfect
training labels. Technical Report arXiv:1805.11505, 2018.

Fabio Maria Carlucci, Antonio D’Innocente, Silvia Bucci, Barbara Caputo, and Tatiana
Tommasi. Domain generalization by solving jigsaw puzzles. 2019 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 2224–2233, 2019.

Rich Caruana. Multitask learning. Machine Learning, 28:41–75, 1997.

48
Domain Generalization by Marginal Transfer Learning

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2(3):27, 2011.

Andreas Christmann and Ingo Steinwart. Universal kernels on non-standard input spaces.
In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors,
Advances in Neural Information Processing Systems 23, pages 406–414, 2010.

Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection
bias correction theory. In Algorithmic Learning Theory, pages 38–53, 2008.

Corinna Cortes, Mehryar Mohri, and Andrés Muñoz Medina. Adaptation algorithm and
theory based on generalized discrepancy. In Proceedings of the 21th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, KDD ’15, pages
169–178, 2015.

Daryl J. Daley and David Vere-Jones. An Introduction to the Theory of Point Processes,
volume I: Elementary Theory and Methods. Springer, 2003.

Daryl J. Daley and David Vere-Jones. An Introduction to the Theory of Point Processes,
volume II: General Theory and Structure. Springer, 2008.

Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Learning to
learn around a common mean. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,
N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing
Systems 31, pages 10169–10179. 2018a.

Giulia Denevi, Carlo Ciliberto, Dimitros Stamos, and Massimiliano Pontil. Incremental
learning-to-learn with statistical guarantees. In Proc. Uncertainty in Artificial Intelli-
gence, 2018b.

Zhengming Ding and Yun Fu. Deep domain generalization with structured low-rank con-
straint. IEEE Transactions on Image Processing, 27:304–313, 2018.

Qi Dou, Daniel Coelho de Castro, Konstantinos Kamnitsas, and Ben Glocker. Domain gen-
eralization via model-agnostic learning of semantic features. In H. Wallach, H. Larochelle,
A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural
Information Processing Systems 32, pages 6450–6461. 2019.

Petros Drineas and Michael W. Mahoney. On the Nyström method for approximating
a gram matrix for improved kernel-based learning. The Journal of Machine Learning
Research, 6:2153–2175, 2005.

Marthinus Christoffel Du Plessis and Masashi Sugiyama. Semi-supervised learning of class


balance under class-prior change by distribution matching. In J. Langford and J. Pineau,
editors, Proc. 29th Int. Conf. on Machine Learning, pages 823–830, 2012.

Theodoros Evgeniou, Charles A. Michelli, and Massimiliano Pontil. Learning multiple tasks
with kernel methods. Journal of Machine Learning Research, 6:615–637, 2005.

49
Blanchard, Deshmukh, Dogan, Lee and Scott

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LI-
BLINEAR: A library for large linear classification. The Journal of Machine Learning
Research, 9:1871–1874, 2008.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast
adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors, Interna-
tional Conference on Machine Learning, volume 70 of Proceedings of Machine Learning
Research, pages 1126–1135, 2017.

Chuang Gan, Tianbao Yang, and Boqing Gong. Learning attributes equals multi-source
domain generalization. In The IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), June 2016.

Pascal Germain, Amaury Habrard, François Laviolette, and Emilie Morvant. A new pac-
bayesian perspective on domain adaptation. In Internation Conference on Machine Learn-
ing, volume 48 of JMLR Workshop and Conference Proceedings, pages 859–868, 2016.

Muhammad Ghifary, W. Bastiaan Kleijn, Mengjie Zhang, and David Balduzzi. Domain
generalization for object recognition with multi-task autoencoders. In IEEE International
Conference on Computer Vision, page 2551–2559, 2015.

Muhammad Ghifary, David Balduzzi, W. Bastiaan Kleijn, and Mengjie Zhang. Scatter
component analysis: A unified framework for domain adaptation and domain generaliza-
tion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(7):1411–1430,
2017.

Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard
Schölkopf. Domain adaptation with conditional transferable components. In International
conference on machine learning, pages 2839–2848, 2016.

Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alexander
Smola. A kernel approach to comparing distributions. In R. Holte and A. Howe, editors,
22nd AAAI Conference on Artificial Intelligence, pages 1637–1641, 2007a.

Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alexander
Smola. A kernel method for the two-sample-problem. In B. Schölkopf, J. Platt, and
T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 513–
520, 2007b.

Thomas Grubinger, Adriana Birlutiu, Holger Schöner, Thomas Natschläger, and Tom Hes-
kes. Domain generalization based on transfer component analysis. In I. Rojas, G. Joya,
and A. Catala, editors, Advances in Computational Intelligence, International Work-
Conference on Artificial Neural Networks, volume 9094 of Lecture Notes in Computer
Science, pages 325–334. Springer International Publishing, 2015.

Peter Hall. On the non-parametric estimation of mixture proportions. Journal of the Royal
Statistical Society, 43(2):147–156, 1981.

50
Domain Generalization by Marginal Transfer Learning

Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. A
dual coordinate descent method for large-scale linear SVM. In International Conference
on Machine Learning, pages 408–415. ACM, 2008.

Shoubo Hu, Kun Zhang, Zhitang Chen, and Laiwan Chan. Domain generalization via
multidomain discriminant analysis. In Amir Globerson and Ricardo Silva, editors, Un-
certainty in Artificial Intelligence, 2019.

Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard
Scholkopf. Correcting sample selection bias by unlabeled data. In Advances in Neural
Information Processing Systems, pages 601–608, 2007.

Wittawat Jitkrittum, Arthur Gretton, Nicolas Heess, SM Eslami, Balaji Lakshminarayanan,


Dino Sejdinovic, and Zoltán Szabó. Kernel-based just-in-time learning for passing expec-
tation propagation messages. In Proceedings of the Thirty-First Conference on Uncer-
tainty in Artificial Intelligence, pages 405–414. AUAI Press, 2015.

Thorsten Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. Burges,


and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chapter 11,
pages 169–184. MIT Press, Cambridge, MA, 1999.

Olav Kallenberg. Foundations of Modern Probability. Springer, 2002.

Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to


direct importance estimation. Journal of Machine Learning Research, 10:1391–1445, 2009.

Aditya Khosla, Tinghui Zhou, Tomasz Malisiewicz, Alexei A. Efros, and Antonio Torralba.
Undoing the damage of dataset bias. In 12th European Conference on Computer Vision
- Volume Part I, page 158–171, 2012.

Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model
selection. In International Joint Conference on Artificial Intelligence, volume 14, pages
1137–1145, 1995.

Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Trans-
actions on Information Theory, 47(5):1902 – 1914, 2001.

Patrice Latinne, Marco Saerens, and Christine Decaestecker. Adjusting the outputs of a
classifier to new a priori probabilities may significantly improve classification accuracy:
Evidence from a multi-class problem in remote sensing. In C. Sammut and A. H. Hoff-
mann, editors, International Conference on Machine Learning, pages 298–305, 2001.

Quoc Le, Tamás Sarlós, and Alex Smola. Fastfood: approximating kernel expansions in
loglinear time. In International Conference on International Conference on Machine
Learning-Volume 28, pages III–244, 2013.

Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and
artier domain generalization. In Proceedings of the IEEE International Conference on
Computer Vision, pages 5542–5550, 2017.

51
Blanchard, Deshmukh, Dogan, Lee and Scott

Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Learning to generalize:
Meta-learning for domain generalization. In AAAI Conference on Artificial Intelligence,
2018a.

Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C. Kot. Domain generalization with
adversarial feature learning. In IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 5400–5409, 2018b.

Ya Li, Mingming Gong, Xinmei Tian, Tongliang Liu, and Dacheng Tao. Domain gen-
eralization via conditional invariant representations. In AAAI Conference on Artificial
Intelligence, 2018c.

Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng
Tao. Deep domain generalization via conditional invariant adversarial networks. In Pro-
ceedings of the European Conference on Computer Vision (ECCV), pages 624–639, 2018d.

Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning
bounds and algorithms. In Conference on Learning Theory, 2009a.

Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation with
multiple sources. In Advances in Neural Information Processing Systems, pages 1041–
1048, 2009b.

Andreas Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):
327–350, 2009.

Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. Sparse coding


for multitask and transfer learning. In Sanjoy Dasgupta and David McAllester, editors,
International Conference on Machine Learning, volume 28 of Proceedings of Machine
Learning Research, pages 343–351, 2013.

Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of


multitask representation learning. Journal of Machine Learning Research, 17(1):2853–
2884, 2016.

Aditya Krishna Menon, Brendan van Rooyen, and Nagarajan Natarajan. Learning from
binary labels with instance-dependent noise. Machine Learning, 107:1561–1595, 2018.

Saeid Motiian, Marco Piccirilli, Donald A. Adjeroh, and Gianfranco Doretto. Unified deep
supervised domain adaptation and generalization. In IEEE International Conference on
Computer Vision, pages 5715–5725, 2017.

Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via
invariant feature representation. In International Conference on Machine Learning, vol-
ume 28 of Proceedings of Machine Learning Research, pages I–10–I–18, 2013.

Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Cost-
sensitive learning with noisy labels. Journal of Machine Learning Research, 18(155):1–33,
2018. URL http://jmlr.org/papers/v18/15-226.html.

52
Domain Generalization by Marginal Transfer Learning

Kalyanapuram Rangachari Parthasarathy. Probability Measures on Metric Spaces. Aca-


demic Press, 1967.

Anastasia Pentina and Christoph Lampert. A PAC-Bayesian bound for lifelong learning. In
Eric P. Xing and Tony Jebara, editors, International Conference on Machine Learning,
volume 32 of Proceedings of Machine Learning Research, pages 991–999, 2014.

Iosif F. Pinelis and Aleksandr Ivanovich Sakhanenko. Remarks on inequalities for probabil-
ities of large deviations. Theory Probab. Appl., 30(1):143–148, 1985.

Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil Lawrence.


Dataset Shift in Machine Learning. The MIT Press, 2009.

Ali Rahimi and Ben Recht. Random features for large-scale kernel machines. In Advances
in Neural Information Processing Systems, pages 1177–1184, 2007.

Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random
features. In Advances in Neural Information Processing Systems, pages 3215–3225, 2017.

Tyler Sanderson and Clayton Scott. Class proportion estimation with application to mul-
ticlass anomaly rejection. In Conferencen on Artificial Intelligence and Statistics, 2014.

Clayton Scott. A generalized Neyman-Pearson criterion for optimal domain adaptation. In


Aurélien Garivier and Satyen Kale, editors, Algorithmic Learning Theory, volume 98 of
Proceedings of Machine Learning Research, pages 738–761, 2019.

Shiv Shankar, Vihari Piratla, Soumen Chakrabarti, Siddhartha Chaudhuri, Preethi Jyothi,
and Sunita Sarawagi. Generalizing across domains via cross-gradient training. In In-
ternational Conference on Learning Representations, 2018. URL https://openreview.
net/forum?id=r1Dx7fbCW.

Srinagesh Sharma and James W. Cutler. Robust orbit determination and classification: A
learning theoretic approach. Interplanetary Network Progress Report, 203:1, 2015.

Bharath Sriperumbudur and Zoltán Szabó. Optimal rates for random Fourier features. In
Advances in Neural Information Processing Systems, pages 1144–1152, 2015.

Bharath Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert
Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of
Machine Learning Research, 11:1517–1561, 2010.

Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer, 2008.

Amos J Storkey. When training and test sets are different: characterising learning transfer.
In Dataset Shift in Machine Learning, pages 3–28. MIT Press, 2009.

Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and
Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals
of the Institute of Statistical Mathematics, 60:699–746, 2008.

53
Blanchard, Deshmukh, Dogan, Lee and Scott

Danica J. Sutherland and Jeff Schneider. On the error of random Fourier features. In
Uncertainty in Artificial Intelligence, pages 862–871, 2015.

Zoltán Szabó, Bharath K Sriperumbudur, Barnabás Póczos, and Arthur Gretton. Learning
theory for distribution regression. The Journal of Machine Learning Research, 17(1):
5272–5311, 2016.

Dirk Tasche. Fisher consistency for prior probability shift. Journal of Machine Learning
Research, 18:1–32, 2017.

Sebastian Thrun. Is learning the n-th thing any easier than learning the first? Advances in
Neural Information Processing Systems, pages 640–646, 1996.

D. Michael Titterington. Minimum distance non-parametric estimation of mixture propor-


tions. Journal of the Royal Statistical Society, 45(1):37–46, 1983.

Joern Toedling, Peter Rhein, Richard Ratei, Leonid Karawajew, and Rainer Spang. Auto-
mated in-silico detection of cell populations in flow cytometry readouts and its application
to leukemia disease monitoring. BMC Bioinformatics, 7:282, 2006.

Athanasios Tsanas, Max A. Little, Patrick E. McSharry, and Lorraine O. Ramig. Accurate
telemonitoring of Parkinson’s disease progression by noninvasive speech tests. IEEE
Transactions on Biomedical Engineering, 57(4):884–893, 2010.

Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large
margin methods for structured and interdependent output variables. Journal of machine
learning research, 6(Sep):1453–1484, 2005.

Brendan van Rooyen and Robert C. Williamson. A theory of learning with corrupted labels.
Journal of Machine Learning Research, 18(228):1–50, 2018.

Haohan Wang, Zexue He, Zachary C. Lipton, and Eric P. Xing. Learning robust represen-
tations by projecting superficial statistics out. In International Conference on Learning
Representations, 2019. URL https://openreview.net/forum?id=rJEjjoR9K7.

Jenna Wiens. Machine Learning for Patient-Adaptive Ectopic Beat Classication. Mas-
ters Thesis, Department of Electrical Engineering and Computer Science, Massachusetts
Institute of Technology, 2010.

Christopher Williams and Matthias Seeger. Using the Nyström method to speed up kernel
machines. In Advances in Neural Information Processing Systems, pages 682–688, 2001.

Zheng Xu, Wen Li, Li Niu, and Dong Xu. Exploiting low-rank structure from latent domains
for domain generalization. In European Conference on Computer Vision, pages 628–643.
Springer, 2014.

Liu Yang, Steve Hanneke, and Jamie Carbonell. A theory of transfer learning with appli-
cations to active learning. Machine Learning, 90(2):161–189, 2013.

54
Domain Generalization by Marginal Transfer Learning

Xiaolin Yang, Seyoung Kim, and Eric P. Xing. Heterogeneous multitask learning with
joint sparsity constraints. In Advances in Neural Information Processing Systems, pages
2151–2159, 2009.

Yao-Liang Yu and Csaba Szepesvari. Analysis of kernel mean matching under covariate
shift. In International Conference on Machine Learning, pages 607–614, 2012.

Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In Inter-
national Conference on Machine Learning, 2004.

Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adapta-
tion under target and conditional shift. In International Conference on Machine Learning,
pages 819–827, 2013.

Kun Zhang, Mingming Gong, and Bernhard Scholkopf. Multi-source domain adaptation:
A causal view. In AAAI Conference on Artificial Intelligence, pages 3150–3157. AAAI
Press, 2015.

55

You might also like