Dataset biases
Dataset biases
Abstract
1521
Figure 2. Computer plays Name That Dataset. Left: classification
performance as a function of dataset size (log scale) for different
descriptors (notice that performance does not appear to saturate).
Right: confusion matrix.
1522
Figure 3. Dataset Look-alikes: Above, ImageNet is trying to impersonate three different datasets. Here, the samples from ImageNet that
are closest to the decision boundaries of the three datasets are displayed. Look-alikes using PASCAL VOC are shown below.
lished, carefully fine-tuned methods. For instance, one ma- been driven, in no small part, by the desire to be a better,
jor issue for many popular dataset competitions is “creeping more authentic representation of the visual world.
overfitting”, as algorithms over time become too adapted
to the dataset, essentially memorizing all its idiosyncrasies, 2.1. The Rise of the Modern Dataset
and losing ability to generalize [14]. Fortunately, this prob- Any good revolution needs a narrative of struggle against
lem can be greatly alleviated by either changing the dataset perceived unfairness and bias, and the history of dataset de-
regularly (as done in PASCAL VOC 2005-2007), or with- velopment certainly provides that. From the very begin-
holding the test set and limiting the number of times a team ning, every new dataset was, in a way, a reaction against
can request evaluation on it (as done in PASCAL VOC the biases and inadequacies of the previous datasets in ex-
2008+ and Caltech Pedestrian benchmark [4]). plaining the visual world. The famous single-image-dataset
Another concern is that our community gives too much Lena, one of the first “real” images (digitized in 1972 from
value to “winning” a particular dataset competition, regard- a P LAYBOY centerfold) was a reaction against all the care-
less of whether the improvement over other methods is sta- fully controlled lab stock images, the “dull stuff dating back
tistically significant. For PASCAL VOC, Everingham et to television standards work” [10]. In the same spirit, the
al [6] use the Friedman/Nemenyi test, which, for example, COIL-100 dataset [12] (a hundred household objects on
showed no statistically significant difference between the a black background) was a reaction against model-based
eight top-ranked algorithms in the 2010 competition. More thinking of the time (which focused mostly on staplers), and
fundamentally, it may be that the right way to treat dataset an embrace of data-driven appearance models that could
performance numbers is not as a competition for the top capture textured objects like Tylenol bottles. Professional
place, but rather as a sanity check for new algorithms and collections like Corel Stock Photos and 15 Scenes [13] were
an efficient way of comparing against multiple baselines. a reaction against the simple COIL-like backgrounds and
This way, fundamentally new approaches will not be forced an embrace of visual complexity. Caltech-101 [7] (101
to compete for top performance right away, but will have a objects mined using Google and cleaned by hand) was
chance to develop and mature. partially a reaction against the professionalism of Corel’s
Luckily, the above issues are more behavioral rather than photos, and an embrace of the wilderness of the Internet.
scientific, and should be alleviated as our field develops MSRC [19] and LabelMe [16] (both researcher-collected
benchmarking best practices similar to those in other fields. sets), in their turn, were a reaction against the Caltech-like
However, there is a more fundamental question: are the single-object-in-the-center mentality, with the embrace of
datasets measuring the right thing, that is, the expected complex scenes with many objects [14]. PASCAL Visual
performance on some real-world task? Unlike datasets in Object Classes (VOC) [6] was a reaction against the lax
machine learning, where the dataset is the world, com- training and testing standards of previous datasets [14]. Fi-
puter vision datasets are supposed to be a representation nally the batch of very-large-scale, Internet-mined datasets
of the world. Yet, what we have been witnessing is that – Tiny Images [18], ImageNet [3], and SUN09 [20] – can
our datasets, instead of helping us train models that work be considered a reaction against the inadequacies of train-
in the real open world, have become closed worlds unto ing and testing on datasets that are just too small for the
themselves, e.g. the Corel world, the Caltech101 world, the complexity of the real world.
PASCAL VOC world, etc. This is particularly unfortunate On the one hand, this evolution in the development of
since, historically, the development of visual datasets has datasets is perhaps a sign of progress. But on the other hand,
1523
one could also detect a bit of a vicious cycle. Time and
again, we as a community reject the current datasets due to
their perceived biases. Yet time and again, we create new
datasets that turn out to suffer from much the same biases,
though differently manifested. What seems missing, then,
is a clear understanding of the types and sources of bias,
without which, we are doomed to repeat our mistakes.
1524
Table 1. Cross-dataset generalization. Object detection and classification performance (AP) for “car” and “person” when training on one
dataset (rows) and testing on another (columns), i.e. each row is: training on one dataset and testing on all the others. “Self” refers to
training and testing on the same dataset (same as diagonal), and “Mean Others” refers to averaging performance on all except self.
Test on: Mean Percent
task SUN09 LabelMe PASCAL ImageNet Caltech101 MSRC Self
Train on: others drop
SUN09 28.2 29.5 16.3 14.6 16.9 21.9 28.2 19.8 30%
LabelMe 14.7 34.0 16.7 22.9 43.6 24.5 34.0 24.5 28%
PASCAL 10.1 25.5 35.2 43.9 44.2 39.4 35.2 32.6 7%
classification
ImageNet 11.4 29.6 36.0 57.4 52.3 42.7 57.4 34.4 40%
Caltech101 7.5 31.1 19.5 33.1 96.9 42.1 96.9 26.7 73%
“car”
MSRC 9.3 27.0 24.9 32.6 40.3 68.4 68.4 26.8 61%
Mean others 10.6 28.5 22.7 29.4 39.4 34.1 53.4 27.5 48%
SUN09 69.8 50.7 42.2 42.6 54.7 69.4 69.8 51.9 26%
LabelMe 61.8 67.6 40.8 38.5 53.4 67.0 67.6 52.3 23%
PASCAL 55.8 55.2 62.1 56.8 54.2 74.8 62.1 59.4 4%
detection
ImageNet 43.9 31.8 46.9 60.7 59.3 67.8 60.7 49.9 18%
“car”
Caltech101 20.2 18.8 11.0 31.4 100 29.3 100 22.2 78%
MSRC 28.6 17.1 32.3 21.5 67.7 74.3 74.3 33.4 55%
Mean others 42.0 34.7 34.6 38.2 57.9 61.7 72.4 44.8 48%
SUN09 16.1 11.8 14.0 7.9 6.8 23.5 16.1 12.8 20%
LabelMe 11.0 26.6 7.5 6.3 8.4 24.3 26.6 11.5 57%
PASCAL 11.9 11.1 20.7 13.6 48.3 50.5 20.7 27.1 -31%
classification
ImageNet 8.9 11.1 11.8 20.7 76.7 61.0 20.7 33.9 -63%
“person”
Caltech101 7.6 11.8 17.3 22.5 99.6 65.8 99.6 25.0 75%
MSRC 9.4 15.5 15.3 15.3 93.4 78.4 78.4 29.8 62%
Mean others 9.8 12.3 13.2 13.1 46.7 45.0 43.7 23.4 47%
SUN09 69.6 56.8 37.9 45.7 52.1 72.7 69.6 53.0 24%
LabelMe 58.9 66.6 38.4 43.1 57.9 68.9 66.6 53.4 20%
PASCAL 56.0 55.6 56.3 55.6 56.8 74.8 56.3 59.8 -6%
“person”
detection
ImageNet 48.8 39.0 40.1 59.6 53.2 70.7 59.6 50.4 15%
Caltech101 24.6 18.1 12.4 26.6 100 31.6 100 22.7 77%
MSRC 33.8 18.2 30.9 20.8 69.5 74.7 74.7 34.6 54%
Mean others 44.4 37.5 31.9 38.4 57.9 63.7 71.1 45.6 36%
one could expect, both Caltech 101 and MSRC are the eas- scenes, or nature scenes, or images retrieved via Internet
iest datasets (column averages) across all tasks. PASCAL keyword searches). Second, there is probably some capture
and ImageNet are, most of the time, the datasets that gener- bias – photographers tending to take pictures of objects in
alize the best (row averages), although they score higher in similar ways (although this bias might be similar across the
object-centric datasets such as Caltech 101 and MSRC, than different datasets). Third, there is category or label bias.
in scene-centric datasets such as SUN09 and LabelMe. In This comes from the fact that semantic categories are of-
general there is a dramatic drop of performance in all tasks ten poorly defined, and different labellers may assign dif-
and classes when testing on a different test set. For instance, fering labels to the same type of object [11] (e.g. “grass”
for the ”car” classification task the average performance vs. “lawn”, “painting” vs. “picture”). Finally, there is the
obtained when training and testing on the same dataset is negative set bias. The negative set defines what the dataset
53.4% which drops to 27.5%. This is a very significant drop considers to be “the rest of the world”. If that set is not
that would, for instance, make a method ranking first in the representative, or unbalanced, that could produce classifiers
PASCAL competition become one of the worst. Figure 5 that are overconfident and not very discriminative. Of all
shows a typical example of car classification gone bad. A the above, the negative set bias seems to receive the least at-
classifier trained on MSRC “cars” has been applied to six tention, so in the next section we will investigate it in more
datasets, but it can only find cars in one – MSRC itself. detail.
Overall the results look rather depressing, as little gener-
alization appears to be happening beyond the given dataset. 3.2. Negative Set Bias
This is particularly surprising given that most datasets are Datasets define a visual phenomenon (e.g. object, scene,
collected from the same source – the Internet. Why is this event) not just by what it is (positive instances), but also
happening? There are likely several culprits. First, there is by what it is not (negative instances). Alas, the space of
clearly some selection bias, as we’ve shown in Section 1 – all possible negatives in the visual world is astronomically
datasets often prefer particular kinds of images (e.g. street large, so datasets are forced to rely on only a small sample.
1525
sample is sufficient to allow a classifier to tease apart the
important bits of the visual experience. This is particularly
important for classification tasks, where the number of neg-
atives is only a few orders of magnitude larger than the num-
ber of positives for each class. For example, if we want to
find all images of “boats” in a PASCAL VOC-like classifi-
cation task setting, how can we make sure that the classifier
focuses on the boat itself, and not on the water below, or
shore in the distance (after all, all boats are depicted in wa-
ter)? This is where a large negative set (including rivers,
lakes, sea, etc, without boats) is imperative to “push” the
lazy classifier into doing the right thing. Unfortunately, it’s
not at all easy to stress-test the sufficiency of a negative set
in the general case since it will require huge amounts of
labelled (and unbiased) negative data. While beyond the
scope of the present paper, we plan to evaluate this issue
more fully, perhaps with the help of Mechanical Turk.
1526
Table 2. Measuring Negative Set Bias.
Positive Set:
task SUN09 LabelMe PASCAL ImageNet Caltech101 MSRC Mean
Negative Set:
self 67.6 62.4 56.3 60.5 97.7 74.5 70.0
“car”
all 53.8 51.3 47.1 65.2 97.7 70.0 64.1
detection
percent drop 20% 18% 16% -8% 0% 6% 8%
self 67.4 68.6 53.8 60.4 100 76.7 71.1
“person”
all 52.2 58.0 42.6 63.4 100 71.5 64.6
detection
percent drop 22% 15% 21% -5% 0% 7% 9%
Table 3. “Market Value” for a “car” sample across datasets
SUN09 market LabelMe market PASCAL market ImageNet market Caltech101 market
1 SUN09 is worth 1 SUN09 0.91 LabelMe 0.72 pascal 0.41 ImageNet 0 Caltech
1 LabelMe is worth 0.41 SUN09 1 LabelMe 0.26 pascal 0.31 ImageNet 0 Caltech
1 pascal is worth 0.29 SUN09 0.50 LabelMe 1 pascal 0.88 ImageNet 0 Caltech
1 ImageNet is worth 0.17 SUN09 0.24 LabelMe 0.40 pascal 1 ImageNet 0 Caltech
1 Caltech101 is worth 0.18 SUN09 0.23 LabelMe 0 pascal 0.28 ImageNet 1 Caltech
Basket of Currencies 0.41 SUN09 0.58 LabelMe 0.48 pascal 0.58 ImageNet 0.20 Caltech
Given the performance APij (n) obtained when training of a race-car, then there is no reasonable algorithm that will
on dataset i and testing on dataset j as a function of the say that a side view of a family sedan is also a “car”.
number of training samples n, we define the sample value So, how well do the currently active recognition datasets
(α) as Apjj (n) = Apji (n/α). In the plots of Fig. 6 this stack up overall? Unsurprisingly, our results show that
corresponds to a horizontal shift and can be estimated as Caltech-101 is extremely biased with virtually no observed
the shift needed to align each pair of graphs. For instance, generalization, and should have been retired long ago (as ar-
1 LabelMe car sample is worth 0.26 PASCAL car sam- gued by [14] back in 2006). Likewise, MSRC has also fared
ples on the PASCAL benchmark. This means that if we very poorly. On the other hand, most modern datasets, such
want to have a modest increase (maybe 10% AP) in per- as PASCAL VOC, ImageNet and SUN09, have fared com-
formance on the car detector trained with 1250 PASCAL paratively well, suggesting that perhaps things are starting
samples available on PASCAL VOC 2007, we will need to move in the right direction.
1/0.26 × 1250 × 10 = 50000 LabelMe samples! Should we care about the quality of our datasets? If the
Table 3 shows the “market value” of training samples goal is to reduce computer vision to a set of feature vectors
from different datasets2 . One observation is that the sample that can be used in some machine learning algorithm, then
values are always smaller than 1 – each training sample gets maybe not. But if the goal is to build algorithms that can
devalued if it is used on a different dataset. There is no understand the visual world, then, having the right datasets
theoretical reason why this should be the case and it is only will be crucial. In next section we outline some recommen-
due to the strong biases present in actual datasets. So, what dations for developing better datasets.
is the value of current datasets when used to train algorithms
that will be deployed in the real world? The answer that 6. Epilogue
emerges can be summarized as: “better than nothing, but Is there any advice that can be offered to researchers
not by much”. thinking of creating a new dataset on how to detect and
avoid bias? We think that a good first step would be to
5. Discussion run any new dataset on the battery of tests that have been
Is it to be expected that when training on one dataset and outlined in this paper (we will be happy to publish all code
testing on another there is a big drop in performance? One and data online). While this will not detect all potential
could start by arguing that the reason is not that datasets are sources of bias, it might help finding the main problematic
bad, but that our object representations and recognition al- issues quickly and early, not years after the dataset has been
gorithms are terrible and end up over-learning aspects of the released. What about tips on how to avoid, or at least min-
visual data that relates to the dataset and not to the ultimate imize, the effects of bias during the dataset construction it-
visual task. In fact, a human learns about vision by living self? Here we briefly go over a few suggestions for mini-
in a reduced environment with many potential local biases mizing each type of bias:
and yet the visual system is robust enough to overcome this. Selection Bias: As suggested by Figure 2, datasets that
However, let us not put all the blame on the algorithms, at are gathered automatically fare better than these collected
least not yet. If a dataset defines a “car” to be the rear view manually. However, getting images from the Internet does
2 We have also experimented with “incremental market value” – how not in itself guarantee a fair sampling, since keyword-based
much does data from other datasets help after using all the original data. searches will return only particular types of images. Obtain-
We found that this quickly converges to the absolute “market value”. ing data from multiple sources (e.g. multiple search engines
1527
from multiple countries [3]) can somewhat decrease selec- [4] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian
tion bias. However, it might be even better to start with a detection: A benchmark. In CVPR, 2009. 1523, 1524
large collection of unannotated images and label them by [5] L. Duan, I. W.-H. Tsang, D. Xu, and S. J. Maybank. Domain
crowd-sourcing. transfer svm for video concept detection. In CVPR, 2009.
Capture Bias: Professional photographs as well as pho- 1524
tos collected using keyword search appear to suffer consid- [6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
erably from the capture bias. The most well-known bias A. Zisserman. The pascal visual object classes (voc) chal-
lenge. International Journal of Computer Vision, 88(2):303–
is that the object is almost always in the center of the im-
338, June 2010. 1523, 1524
age. Searching for “mug” on Google Image Search will
[7] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative
reveal another kind of capture bias: almost all the mugs has visual models from few training examples: An incremental
a right-facing handle. Beyond better data sampling strate- bayesian approach tested on 101 object categories. In CVPR
gies, one way to deal with this is to perform various data Workshop of Generative Model Based Vision, 2004. 1523,
transformations to reduce this bias, such as flipping images 1524
left-right [8, 9] (but note that any text will appear the wrong [8] P. Felzenszwalb, D. McAllester, and D. Ramanan. A dis-
way), or jittering the image [2], e.g. via small affine trans- criminatively trained, multiscale, deformable part model. In
formations [18]. Another fruitful direction might be gener- CVPR, 2008. 1524, 1528
ating various automatic crops of the image. [9] G. Griffin, A. Holub, and P. Perona. Caltech-256 object cat-
Negative Set Bias: As we have shown, having a rich and egory dataset. Technical report, California Institute of Tech-
unbiased negative set is important to classifier performance. nology, 2007. 1528
Therefore, datasets that only collect the things they are in- [10] J. Hutchison. Culture, communication, and an information
age madonna. In IEEE Professional Communication Society
terested in might be at a disadvantage, because they are not
Newsletter, volume 45, 2001. 1523
modeling the rest of the visual world. One remedy, pro-
[11] T. Malisiewicz and A. A. Efros. Recognition by association
posed in this paper, is to add negatives from other datasets. via learning per-exemplar distances. In CVPR, 2008. 1525
Another approach, suggested by Mark Everingham, is to [12] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object
use a few standard algorithms (e.g. bag of words) to actively image library (coil-100). Technical Report CUCS-006-96,
mine hard negatives as part of dataset construction from a Columbia Univ., 1996. 1523
very large unlabelled set, and then manually going through [13] A. Oliva and A. Torralba. Modeling the shape of the scene: a
them to weed out true positives. The down side is that the holistic representation of the spatial envelope. International
resulting dataset will be biased against existing algorithms. Journal in Computer Vision, 42:145–175, 2001. 1521, 1523
This paper is only the start of an important conversation [14] J. Ponce, T. L. Berg, M. Everingham, D. Forsyth, M. Hebert,
about datasets. We suspect that, despite the title, our own bi- S. Lazebnik, M. Marszałek, C. Schmid, C. Russell, A. Tor-
ases have probably crept into these pages, so there is clearly ralba, C. Williams, J. Zhang, and A. Zisserman. Dataset is-
much more to be done. All that we hope is that our work sues in object recognition. In Towards Category-Level Ob-
will start a dialogue about this very important and underap- ject Recognition. Springer, 2006. 1522, 1523, 1527
prechiated issue. [15] H. Rowley, S. Baluja, and T. Kanade. Neural network-based
face detection. IEEE Transactions on Pattern Analysis and
Acknowledgements: The authors would like to thank
Machine Intelligence, 20(1):23–38, January 1998. 1522
the Eyjafjallajokull volcano as well as the wonderful kirs
[16] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Free-
at the Buvette in Jardin du Luxembourg for the motivation man. LabelMe: a database and web-based tool for image
(former) and the inspiration (latter) to write this paper. This annotation. 77(1-3):157–173, 2008. 1523, 1524
work is part of a larger effort, joint with David Forsyth and [17] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Transferring
Jay Yagnik, on understanding the benefits and pitfalls of us- visual category models to new domains. In ECCV, 2010.
ing large data in vision. The paper was co-sponsored by 1524
ONR MURIs N000141010933 and N000141010934. [18] A. Torralba, R. Fergus, and W. T. Freeman. 80 million
Disclaimer: No graduate students were harmed in the tiny images: a large database for non-parametric object and
production of this paper. Authors are listed in order of in- scene recognition. IEEE PAMI, 30(11):1958–1970, Novem-
creasing procrastination ability. ber 2008. 1521, 1523, 1528
[19] J. Winn, A. Criminisi, and T. Minka. Object categorization
References by learned universal visual dictionary. In ICCV, 2005. 1523,
1524
[1] N. Dalal and B. Triggs. Histogram of oriented gradients for
[20] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun
human detection. 2005. 1521, 1524
database: Large-scale scene recognition from abbey to zoo.
[2] D. DeCoste and M. Burl. Distortion-invariant recognition via
In CVPR, 2010. 1523, 1524
jittered queries. In CVPR, 2000. 1528
[21] J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video
[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
concept detection using adaptive svms. MULTIMEDIA ’07,
Fei. ImageNet: A Large-Scale Hierarchical Image Database.
2007. 1524
2009. 1523, 1524, 1528
1528