0% found this document useful (0 votes)
28 views11 pages

Fairface: Face Attribute Dataset For Balanced Race, Gender, and Age

Uploaded by

mohamed ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views11 pages

Fairface: Face Attribute Dataset For Balanced Race, Gender, and Age

Uploaded by

mohamed ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age

Kimmo Kärkkäinen Jungseock Joo


UCLA UCLA
[email protected] [email protected]
arXiv:1908.04913v1 [cs.CV] 14 Aug 2019

Abstract duce biased models trained from it. This will raise ethical
concerns about fairness of automated systems, which has
Existing public face datasets are strongly biased toward emerged as a critical topic of study in the recent machine
Caucasian faces, and other races (e.g., Latino) are sig- learning and AI literature [17, 11].
nificantly underrepresented. This can lead to inconsistent For example, several commercial computer vision sys-
model accuracy, limit the applicability of face analytic sys- tems (Microsoft, IBM, Face++) have been criticized due to
tems to non-White race groups, and adversely affect re- their asymmetric accuracy across sub-demographics in re-
search findings based on such skewed data. To mitigate cent studies [7, 42]. These studies found that the commer-
the race bias in these datasets, we construct a novel face cial face gender classification systems all perform better on
image dataset, containing 108,501 images, with an empha- male and on light faces. This can be caused by the biases
sis of balanced race composition in the dataset. We define in their training data. Various unwanted biases in image
7 race groups: White, Black, Indian, East Asian, Southeast datasets can easily occur due to biased selection, capture,
Asian, Middle East, and Latino. Images were collected from and negative sets [60]. Most public large scale face datasets
the YFCC-100M Flickr dataset and labeled with race, gen- have been collected from popular online media – newspa-
der, and age groups. Evaluations were performed on exist- pers, Wikipedia, or webs search– and these platforms are
ing face attribute datasets as well as novel image datasets more frequently used by or showing White people.
to measure generalization performance. We find that the
To mitigate the race bias in the existing face datasets, we
model trained from our dataset is substantially more accu-
propose a novel face dataset with an emphasis of balanced
rate on novel datasets and the accuracy is consistent be-
race composition. Our dataset contains 108,501 facial
tween race and gender groups. The dataset will be released
images collected primarily from the YFCC-100M Flickr
via {https://github.com/joojs/fairface}.
dataset [59], which can be freely shared for a research pur-
pose, and also includes examples from other sources such
as Twitter and online newspaper outlets. We define 7 race
1. Introduction groups: White, Black, Indian, East Asian, Southeast Asian,
To date, numerous large scale face image datasets [22, Middle East, and Latino. Our dataset is well-balanced on
31, 13, 69, 37, 24, 41, 68, 15, 27, 46, 8, 39] have been pro- these 7 groups (See Figure 3 and 2)
posed and fostered research and development for automated Our paper make three main contributions. First, we em-
face detection [35, 21], alignment [66, 44], recognition [57, prically show that existing face attribute datasets and mod-
49], generation [67, 5, 26, 58], modification [3, 32, 19], and els learned from them do not generalize well to unseen
attribute classification [31, 37]. These systems have been data in which more non-White faces are present. Second,
successfully translated into many areas including security, we show that our new dataset perform better on the novel
medicine, education, and social sciences. data, not only on average, but also across racial groups, i.e.
Despite the sheer amount of available data, existing pub- more consistent. Third, to the best of our knowledge, our
lic face datasets are strongly biased toward Caucasian faces, dataset is the first large scale face attribute dataset in the
and other races (e.g., Latino) are significantly underrep- wild which includes Latino and Middle Eastern and differ-
resented. A recent study shows that most existing large entiates East Asian and South East Asian. Computer vision
scale face databases are biased towards “lighter skin” faces has been rapidly transferred into other fields such as eco-
(around 80%), e.g., White, compared to “darker” faces, e.g., nomics or social sciences, where researchers want to ana-
Black [39]. This means the model may not apply to some lyze different demographics using image data. The inclu-
subpopulations and its results may not be compared across sion of major racial groups, which have been missing in
different groups without calibration. Biased data will pro- existing datasets, therefore significantly enlarges the appli-

1
Table 1: Statistics of Face Attribute Datasets

Race Annotation
# of In-the- White* Asian* Bla- Ind- Lat- Balan-
Name Source faces wild? Age Gender ck ian ino ced?
W ME E SE
Gov. Official
PPB [7] 1K X X **Skin color prediction
Profiles
MORPH [45] Public Data 55K X X merged X X no
PubFig [31] Celebrity 13K X Model generated predictions no
IMDB-WIKI [46] IMDB, WIKI 500K X X X no
FotW [13] Flickr 25K X X X yes
CACD [10] celebrity 160K X X no
DiF [39] Flickr 1M X X X **Skin color prediction
CelebFace [54, 55]
†CelebA [37] 200K X X X no
LFW [23]
LFW [23]
LFW+ [16] 15K X X X merged merged no
(Newspapers)
LFW [23]
†LFWA+ [37] 13K X X merged merged X X no
(Newspapers)
MORPH, CACD
†UTKFace [75] 20K X X X merged merged X X yes
Web
Flickr, Twitter
FairFace (Ours) 108K X X X X X X X X X X yes
Newspapers, Web
*FairFace (Ours) also defines East (E) Asian, Southeast (SE) Asian, Middle Eastern (ME), and Western (W) White.
**PPB and DiF do not provide race annotations but skin color annotated or automatically computed as a proxy to race.
†denotes datasets used in our experiments.

cability of computer vision methods to these fields. tronic devices (e.g., unlocking smartphones) or monitoring
surveillance CCTVs [14].
2. Related Work It is imperative to ensure that these systems perform
evenly well on difference gender and race groups. Failing
2.1. Face Attribute Recognition to do so can be detrimental to the reputations of individ-
Face attribute recognition is a task to classify various hu- ual service providers and the public trust about the machine
man attributes such as gender, race, age, emotions, expres- learning and computer vision research community. Most
sions or other facial traits from facial appearance [31, 25, notable incidents regarding the racial bias include Google
74, 37]. While there have been many techniques developed Photos recognizing African American faces as Gorilla and
for the task, we mainly review datasets which are the main Nikon’s digital cameras prompting a message asking “did
concern of this paper. someone blink?” to Asian users [73]. These incidents, re-
Table 1 summarizes the statistics of existing large scale gardless of whether the models were trained improperly or
face attribute datasets and our new dataset. The is not how much they actually affected the users, often result in the
an exhaustive list but we focus on public and in-the-wild termination of the service or features (e.g., dropping sensi-
datasets on gender, race, and age. As stated earlier, most of tive output categories). For the reason, most commercial
these datasets were constructed from online sources, which service providers have stopped providing a race classifier.
are typically dominated by the White race. Face attribute recognition is also widely used for demo-
Face attribute recognition has been applied as a sub- graphic survey performed in marketing or social science re-
component to other computer vision systems. For example, search, aimed at understanding human social behaviors and
Kumar et al. [31] used facial attributes such as gender, race, their relations to demographic backgrounds of individuals.
hair style, expressions, and accessories as features for face Using off-the-shelf tools [2, 4] and commercial services, so-
verification, as the attributes characterize individual traits. cial scientists, who traditionally didn’t use images, begun to
Attributes are also widely used for person re-identification use images of people to infer their demographic attributes
in images or videos, combining features from human face and analyze their behaviors in many studies. Notable exam-
and body appearance [33, 34, 53], especially effective when ples are demographic analyses of social media users using
faces are not fully visible or too small. These systems their photographs [9, 43, 64, 65, 62]. The cost of unfair clas-
have applications in security such as authentication for elec- sification is huge as it can over- or under-estimate specific

2
Figure 1: Individual Typology Angle (ITA), i.e., skin color,
distribution of different races measured in our dataset. Figure 2: Racial compositions in face datasets.

sub-populations in their analysis, which may have policy 3. Dataset Construction


implications.
3.1. Race Taxonomy
2.2. Fair Classification and Dataset Bias
In this study, we define 7 race groups: White, Black,
Researchers in AI and machine learning have increas- Indian, East Asian, Southeast Asian, Middle East, and
ingly paid attention to algorithmic fairness and dataset and Latino.Race and ethnicity are different categorizations of
model biases [71, 11, 76, 72]. There exist many differ- humans. Race is defined based on physical trait and eth-
ent definitions of fairness used in the literature [61]. In nicity is based on cultural similarities [48]. For example,
this paper, we focus on balanced accuracy–whether the at- Asian immigrants in Latin America can be of Latino ethnic-
tribute classification accuracy is independent of race and ity. In practice, these two terms are often used interchange-
gender. More generally, research in fairness is concerned ably. Race is not a discrete concept and needs to be clearly
with a model’s ability to produce fair outcomes (e.g., loan defined before data collection.
approval) independent of protected or sensitive attributes We first adopted a commonly accepted race classification
such as race or gender. from the U.S. Census Bureau (White, Black, Asian, Hawai-
Studies in algorithmic fairness have focused on either ian and Pacific Islanders, Native Americans, and Latino).
1) discovering (auditing) existing bias in datasets or sys- Latino is often treated as an ethnicity, but we consider
tems [50, 7, 30], 2) making a better dataset [39, 1], or 3) Latino a race, which can be judged from the facial appear-
designing a better algorithm or model [12, 1, 47, 71, 70]. ance. We then further divided subgroups such as Middle
Our paper falls into the first two categories. Eastern, East Asian, Southeast Asian, and Indian, as they
In computer vision, it has been shown that popular large look clearly distinct. During the data collection, we found
scale image datasets such as Imagenet are biased in terms very few examples for Hawaiian and Pacific Islanders and
of the origin of images (45% were from the U.S.) [56] or Native Americans and discarded these categories. All the
the underlying association between scene and race [52]. experiments conducted in this paper were therefore based
Can we make a perfectly balanced dataset? It is “infea- on 7 race classification.
sible to balance across all possible co-occurrences” of at- A few recent studies [7, 39] use skin color as a proxy to
tributes [20]. This is possible in a lab-controlled setting, but racial or ethnicity grouping. While skin color can be easily
not in a dataset “in-the-wild”. computed without subjective annotations, it has limitations.
Therefore, the contribution of our paper is to mitigate, First, skin color is heavily affected by illumination and light
not entirely solve, the current limitation and biases of exist- conditions. The Pilot Parliaments Benchmark (PPB) dataset
ing databases by collecting more diverse face images from [7] only used profile photographs of government officials
non-White race groups. We empirically show this signif- taken in well controlled lighting, which makes it non-in-
icantly improves the generalization performance to novel the-wild. Second, within-group variations of skin color are
image datasets whose racial compositions are not domi- huge. Even same individuals can show different skin colors
nated by the White race. Furthermore, as shown in Table 1, over time. Third, most importantly, race is a multidimen-
our dataset is the first large scale in-the-wild face image sional concept whereas skin color (i.e., brightness) is one
dataset which includes Southeast Asian and Middle East- dimensional. Figure 1 shows the distributions of the skin
ern races. While their faces share similarity with East Asian color of multiple race groups, measured by Individual Ty-
and White groups, we argue that not having these major race pology Angle (ITA) [63]. As clearly shown here, the skin
groups in datasets is a strong form of discrimination. color provides no information to differentiate many groups

3
(a) FairFace (b) UTKFace

(c) LFWA+ (d) CelebA

Figure 3: Random samples from face attribute datasets.

such as East Asian and White. Therefore, we explicitly use used the same dataset to construct a huge unfiltered face
race and annotate the physical race by human annotators’ dataset (Diversity in Face, DiF) [39]. Our dataset is smaller
judgments. but more balanced on race (See Figure 2).
Specifically, we incrementally increased the dataset size.
3.2. Image Collection and Annotation We first detected and annotated 7,125 faces randomly sam-
Many existing face datasets have relied on photographs pled from the entire YFCC100M dataset ignoring the loca-
of public figures such as politicians or celebrities [31, 23, tions of images. After obtaining annotations on this initial
24, 46, 37]. Despite the easiness of collecting images and set, we estimated demographic compositions of each coun-
ground truth attributes, the selection of these populations try. Based on this statistic, we adaptively adjusted the num-
may be biased. For examples, politicians may be older and ber of images for each country sampled from the dataset
actors may be more attractive than typical faces. Their im- such that the dataset is not dominated by the White race.
ages are usually taken by professional photographers in lim- Consequently, we excluded the U.S. and European coun-
ited situations, leading to the quality bias. Some datasets tries in the later stage of data collection after we sampled
were collected via web search using keywords such as enough White faces from those countries. The minimum
“Asian boy” [75]. These queries may return only stereotyp- size of a detected face was set to 50 by 50 pixels. This is
ical faces or prioritize celebrities in those categories rather a relatively smaller size compared to other datasets, but we
than diverse individuals among general public. find the attributes are still recognizable and these examples
Our goal is to minimize the selection bias introduced by can actually make the classifiers more robust against noisy
such filtering and maximize the diversity and coverage of data. We only used images with “Attribution” and “Share
the dataaset. We started from a huge public image dataset, Alike” Creative Commons licenses, which allow derivative
Yahoo YFCC100M dataset [59], and detected faces from work and commercial usages.
the images without any preselection. A recent work also We used Amazon Mechanical Turk to verify the race,

4
gender and age group for each face. We assigned three our model was highest on some variables on the LFWA+
workers for each image. If two or three workers agreed on dataset and also very close to the leader in other cases. This
their judgements, we took the values as ground-truth. If all is partly because LFWA+ is the most biased dataset and ours
three workers produced different responses, we republished is the most diverse, and thus more generalizable dataset.
the image to another 3 workers and subsequently discarded
the image if the new annotators did not agree. These an- 4.3. Generalization Performance
notations at this stage were still noisy. We further refined 4.3.1 Datasets
the annotations by training a model from the initial ground
truth annotations and applying back to the dataset. We then To test the generalization performance of the models, we
manually re-verified the annotations for images whose an- consider three novel datasets. Note that these datasets were
notations differ from model predictions. collected from completely different sources than our data
from Flickr and not used in training. Since we want to
4. Experiments measure the effectiveness of the model on diverse races, we
chose the test datasets that contain people in different loca-
4.1. Measuring Bias in Datasets tions as follows.
We first measure how skewed each dataset is in terms of Geo-tagged Tweets. First we consider images uploaded
its race composition. For the datasets with race annotations, by Twitter users whose locations are identified by geo-
we use the reported statistics. For the other datasets, we tags (longitude and latitude), provided by [51]. From this
annotated the race labels for 3,000 random samples drawn set, we chose four countries (France, Iraq, Philippines, and
from each dataset. See Figure 2 for the result. As expected, Venezuela) and randomly sampled 5,000 faces.
most existing face attribute datasets, especially the ones fo- Media Photographs. Next, we also use photographs
cusing on celebrities or politicians, are biased toward the posted by 500 online professional media outlets. Specifi-
White race. Unlike race, we find that most datasets are rel- cally, we use a public dataset of tweet IDs [36] posted by
atively more balanced on gender ranging from 40%-60% 4,000 known media accounts, e.g., @nytimes. Note that
male ratio. although we use Twitter to access the photographs, these
tweets are simply external links to pages in the main news-
4.2. Model and Cross-Dataset Performance paper sites. Therefore this data is considered as media pho-
To compare model performance of different datasets, we tographs and different from general tweet images mostly
used an identical model architecture, ResNet-34 [18], to be uploaded by ordinary users. We randomly sampled 8,000
trained from each dataset. We used ADAM optimization faces from the set.
[29] with a learning rate of 0.0001. Given an image, we de- Protest Dataset. Lastly, we also use a public image
tected faces using the dlib1 ’s CNN-based face detector [28] dataset collected for a recent protest activity study [64]. The
and ran the attribute classifier on each face. The experiment authors collected the majority of data from Google Image
was done in pyTorch. search by using keywords such as “Venezuela protest” or
Throughout the evaluations, we compare our dataset “football game” (for hard negatives). The dataset exhibits a
with three other datasets: UTKFace [75], LFWA+, and wide range of diverse race and gender groups engaging in
CelebA [37]. Both UTKFace and LFWA+ have race an- different activities in various countries. We randomly sam-
notations, and thus, are suitable for comparison with our pled 8,000 faces from the set.
dataset. CelebA does not have race annotations, so we only These faces were annotated for gender, race, and age by
use it for gender classification. See Table 1 for more de- Amazon Mechanical Turk workers.
tailed dataset characteristics.
Using models trained from these datasets, we first per- 4.3.2 Result
formed cross-dataset classifications, by alternating training
Table 6 shows the classification accuracy of different mod-
sets and test sets. Note that FairFace is the only dataset
els. Because our dataset is larger than LFWA+ and UTK-
with 6 races. To make it compatible with other datasets, we
Face, we report the three variants of the FairFace model by
merged our fine racial groups when tested on other datasets.
limiting the size of a training set (9k, 18k, and Full) for fair
CelebA does not have race annotations but was included for
comparisons.
gender classification.
Improved Accuracy. As clearly shown in the result, the
Tables 3 and 4 show the classification results for race,
model trained by FairFace outperforms all the other models
gender, and age on the datasets across subpopulations. As
for race, gender, and age, on the novel datasets, which have
expected, each model tends to perform better on the same
never been used in training and also come from different
dataset on which it was trained. However, the accuracy of
data sources. The models trained with fewer training im-
1 dlib.net ages (9k and 18k) still outperform other datasets including

5
(a) FairFace (b) UTKFace (c) LFWA+

Figure 4: t-SNE visualizations [38] of faces in datasets.

CelebA which is larger than FairFace. This suggests that toward the male cateogory. The CelebA model tends to ex-
the dataset size is not the only reason for the performance hibit a bias toward the female category as the dataset con-
improvement. tains more female images than male.
Balanced Accuracy. Our model also produces more The FairFace model achieves less than 1% accuracy dis-
consistent results – for race, gender, age classification – crepancy between male ↔ female and White ↔ non-White
across different race groups compared to other datasets. for gender classification (Table 6). All the other models
We measure the model consistency by standard devia- show a strong bias toward the male class, yielding much
tions of classification accuracy measured on different sub- lower accuracy on the female group, and perform more
populations, as shown in Table 5. More formally, one can inaccurately on the non-White group. The gender perfor-
consider conditional use accuracy equality [6] or equalized mance gap was the biggest in LFWA+ (32%), which is the
odds [17] as the measure of fair classification. For gender smallest among the datasets used in the experiment. Recent
classification: work has also reported asymmetric gender biases in com-
mercial computer vision services [7], and our result further
suggests the cause is likely to due to the unbalanced repre-
P (Yb = i|Y = i, A = j) = P (Yb = i|Y = i, A = k), sentation in training data.
i ∈ {male, female}, ∀j, k ∈ D, (1) Data Coverage and Diversity. We further investigate
dataset characteristics to measure the data diversity in our
where Yb is the predicted gender, Y is the true gender, A dataset. We first visualize randomly sampled faces in 2D
refers to the demographic group, and D is the set of differ- space using t-SNE [38] as shown in Figure 4. We used
ent demographic groups being considered (i.e. race). When the facial embedding based on ResNet-34 from dlib, which
we consider different gender groups for A, this needs to be was trained from the FaceScrub dataset [40], the VGG-Face
modified to measure accuracy equality [6]: dataset [41] and other online sources, which are likely dom-
inated by the White faces. The faces in FairFace are well
P (Yb = Y |A = j) = P (Yb = Y |A = k), ∀j, k ∈ D. (2) spread in the space, and the race groups are loosely sepa-
rated from each other. This is in part because the embedding
We therefore define the maximum accuracy disparity of a
was trained from biased datasets, but it also suggests that the
classifier as follows:
dataset contains many non-typical examples. LFWA+ was

P (Yb = Y |A = j)
 derived from LFW, which was developed for face recogni-
(Yb ) = max log . (3) tion, and therefore contains multiple images of the same in-
∀j,k∈D P (Yb = Y |A = k)
dividuals, i.e., clusters. UTKFace also tends to focus more
Table 2 shows the gender classification accuracy of dif- on local clusters compared to FairFace.
ferent models measured on the external validation datasets To explicitly measure the diversity of faces in these
for each race and gender group. The FairFace model datasets, we examine the distributions of pairwise distance
achieves the lowest maximum accuracy disparity. The between faces (Figure 5). On the random subsets, we first
LFWA+ model yields the highest disparity, strongly biased obtained the same 128-dimensional facial embedding from

6
Race White Black East Asian SE Asian Latino Indian Middle Eastern
Gender M F M F M F M F M F M F M F Max Min AVG STDV 
FairFace .967 .954 .958 .917 .873 .939 .909 .906 .977 .960 .966 .947 .991 .946 .991 .873 .944 .032 .055
UTK .926 .864 .909 .795 .841 .824 .906 .795 .939 .821 .978 .742 .949 .730 .978 .730 .859 .078 .127
LFWA+ .946 .680 .974 .432 .826 .684 .938 .574 .951 .613 .968 .518 .988 .635 .988 .432 .766 .196 .359
CelebA .829 .958 .819 .919 .653 .939 .768 .923 .843 .955 .866 .856 .924 .874 .958 .653 .866 .083 .166

Table 2: Gender classification accuracy measured on external validation datasets across gender-race groups.

Algorithmic fairness is an important aspect to consider


in designing and developing AI systems, especially because
these systems are being translated into many areas in our
society and affecting our decision making. Large scale im-
age datasets have contributed to the recent success in com-
puter vision by improving model accuracy; yet the public
and media have doubts about its transparency. The novel
dataset proposed in this paper will help us discover and mit-
igate race and gender bias present in computer vision sys-
tems such that such systems can be more easily accepted in
society.

Figure 5: Distribution of pairwise distances of face in 3


6. Acknowledgement
datasets measured by L1 distance on face embedding.
This work was supported by the National Science Foun-
dation SMA-1831848, Hellman Fellowship, and UCLA
dlib and measured pair-wise distance. Figure 5 shows the Faculty Career Development Award.
CDF functions for 3 datasets. As conjectured, UTKFace
had more faces that are tightly clustered together and very
References
similar to each other, compared to our dataset. Surprisingly,
the faces in LFWA+ were shown very diverse and far from [1] M. Alvi, A. Zisserman, and C. Nellaaker. Turning a blind
each other, even though the majority of the examples con- eye: Explicit removal of biases and variation from deep neu-
tained a white face. We believe this is mostly due to the fact ral network embeddings. In The European Conference on
that the face embedding was also trained on a very similar Computer Vision (ECCV) Workshops, September 2018. 3
white-oriented dataset which will be effective in separating [2] B. Amos, B. Ludwiczuk, M. Satyanarayanan, et al. Open-
white faces, not because the appearance of their faces is ac- face: A general-purpose face recognition library with mobile
tually diverse. (See Figure 3) applications. CMU School of Computer Science, 6, 2016. 2
[3] G. Antipov, M. Baccouche, and J.-L. Dugelay. Face aging
with conditional generative adversarial networks. In 2017
5. Conclusion IEEE International Conference on Image Processing (ICIP),
This paper proposes a novel face image dataset balanced pages 2089–2093. IEEE, 2017. 1
on race, gender and age. Compared to existing large-scale [4] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L.-P. Morency.
in-the-wild datasets, our dataset achieves much better gen- Openface 2.0: Facial behavior analysis toolkit. In 2018 13th
eralization classification performance for gender, race, and IEEE International Conference on Automatic Face & Ges-
ture Recognition (FG 2018), pages 59–66. IEEE, 2018. 2
age on novel image datasets collected from Twitter, inter-
national online newspapers, and web search, which contain [5] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Cvae-gan: fine-
grained image generation through asymmetric training. In
more non-White faces than typical face datasets. We show
Proceedings of the IEEE International Conference on Com-
that the model trained from our dataset produces balanced puter Vision, pages 2745–2754, 2017. 1
accuracy across race, whereas other datasets often lead to
[6] R. Berk, H. Heidari, S. Jabbari, M. Kearns, and A. Roth.
asymmetric accuracy on different race groups. Fairness in criminal justice risk assessments: The state
This dataset was derived from the Yahoo YFCC100m of the art. Sociological Methods & Research, page
dataset [59] for the images with Creative Common Licenses 0049124118782533. 6
by Attribution and Share Alike, which permit both aca- [7] J. Buolamwini and T. Gebru. Gender shades: Intersectional
demic and commercial usage. Our dataset can be used for accuracy disparities in commercial gender classification. In
training a new model and verifying balanced accuracy of Conference on Fairness, Accountability and Transparency,
existing classifiers. pages 77–91, 2018. 1, 2, 3, 6

7
Table 3: Cross-Dataset Classification Accuracy on White Race.
Tested on
Race Gender Age
FairFace UTKFace LFWA+ FairFace UTKFace LFWA+ CelebA* FairFace UTKFace
FairFace .937 .936 .970 .942 .940 .920 .981 .597 .565
UTKFace .800 .918 .925 .860 .935 .916 .962 .413 .576
Trained on
LFWA+ .879 .947 .961 .761 .842 .930 .940 - -
CelebA - - - .812 .880 .905 .971 - -
* CelebA doesn’t provide race annotations. The result was obtained from the whole set (white and non-white).

Table 4: Cross-Dataset Classification Accuracy on non-White Races.


Tested on
Race† Gender Age
FairFace UTKFace LFWA+ FairFace UTKFace LFWA+ CelebA* FairFace UTKFace
FairFace .754 .801 .960 .944 .939 .930 .981 .607 .616
UTKFace .693 .839 .887 .823 .925 .908 .962 .418 .617
Trained on
LFWA+ .541 .380 .866 .738 .833 .894 .940 - -
CelebA - - - .781 .886 .901 .971 - -
* CelebA doesn’t provide race annotations. The result was obtained from the whole set (white and non-white).
† FairFace defines 7 race categories but only 4 races (White, Black, Asian, and Indian) were used in this result
to make it comparable to UTKFace.

[8] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. [15] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m:
Vggface2: A dataset for recognising faces across pose and A dataset and benchmark for large-scale face recognition. In
age. In 2018 13th IEEE International Conference on Auto- European Conference on Computer Vision, pages 87–102.
matic Face & Gesture Recognition (FG 2018), pages 67–74. Springer, 2016. 1
IEEE, 2018. 1 [16] H. Han, A. K. Jain, F. Wang, S. Shan, and X. Chen. Hetero-
[9] A. Chakraborty, J. Messias, F. Benevenuto, S. Ghosh, geneous face attribute estimation: A deep multi-task learn-
N. Ganguly, and K. P. Gummadi. Who makes trends? under- ing approach. IEEE transactions on pattern analysis and
standing demographic biases in crowdsourced recommenda- machine intelligence, 40(11):2597–2609, 2018. 2
tions. In Eleventh International AAAI Conference on Web [17] M. Hardt, E. Price, N. Srebro, et al. Equality of opportunity
and Social Media, 2017. 2 in supervised learning. In Advances in neural information
[10] B.-C. Chen, C.-S. Chen, and W. H. Hsu. Face recognition processing systems, pages 3315–3323, 2016. 1, 6
and retrieval using cross-age reference coding with cross- [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
age celebrity dataset. IEEE Transactions on Multimedia, ing for image recognition. In Proceedings of the IEEE con-
17(6):804–815, 2015. 2 ference on computer vision and pattern recognition, pages
770–778, 2016. 5
[11] S. Corbett-Davies, E. Pierson, A. Feller, S. Goel, and A. Huq.
[19] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen. Arbitrary
Algorithmic decision making and the cost of fairness. In
facial attribute editing: Only change what you want. arXiv
Proceedings of the 23rd ACM SIGKDD International Con-
preprint arXiv:1711.10678, 1(3), 2017. 1
ference on Knowledge Discovery and Data Mining, pages
[20] L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and
797–806. ACM, 2017. 1, 3
A. Rohrbach. Women also snowboard: Overcoming bias in
[12] A. Das, A. Dantcheva, and F. Bremond. Mitigating bias in captioning models. In European Conference on Computer
gender, age and ethnicity classification: a multi-task con- Vision, pages 793–811. Springer, 2018. 3
volution neural network approach. In The European Con- [21] P. Hu and D. Ramanan. Finding tiny faces. In Proceedings of
ference on Computer Vision (ECCV) Workshops, September the IEEE conference on computer vision and pattern recog-
2018. 3 nition, pages 951–959, 2017. 1
[13] S. Escalera, M. Torres Torres, B. Martinez, X. Baró, [22] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. La-
H. Jair Escalante, I. Guyon, G. Tzimiropoulos, C. Corneou, beled faces in the wild: A database forstudying face recog-
M. Oliu, M. Ali Bagheri, et al. Chalearn looking at people nition in unconstrained environments. In Workshop on faces
and faces of the world: Face analysis workshop and chal- in’Real-Life’Images: detection, alignment, and recognition,
lenge 2016. In Proceedings of the IEEE Conference on Com- 2008. 1
puter Vision and Pattern Recognition Workshops, pages 1–8, [23] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller.
2016. 1, 2 Labeled faces in the wild: A database for studying face
[14] M. Grgic, K. Delac, and S. Grgic. Scface–surveillance recognition in unconstrained environments. Technical Re-
cameras face database. Multimedia tools and applications, port 07-49, University of Massachusetts, Amherst, October
51(3):863–879, 2011. 2 2007. 2, 4

8
Table 5: Gender classification accuracy on external validation datasets, across race and age groups.
Mean across races SD across races Mean across ages SD across ages
FairFace 94.89% 3.03% 92.95% 6.63%
Model UTKFace 89.54% 3.34% 84.23% 12.83%
trained on LFWA+ 82.46% 5.60% 78.50% 11.51%
CelebA 86.03% 4.57% 79.53% 17.96%

Table 6: Classification accuracy on external validation datasets.


Race Classification
All Female Male White Non-White Black Asian E Asian SE Asian Latino Indian Mid-East 0-9 10-29 30-49 50+
FairFace .733 .726 .737 .899 .548 .695 .888 .705 .465 .305 .492 .743 .756 .691 .768 .777
Twitter UTKFace .544 .543 .544 .741 .354 .591 .476 - - - .474 - .606 .516 .574 .567
LFWA+ .626 .596 .647 .965 .284 .283 .425 - - - - - .639 .562 .705 .751
FairFace .866 .874 .863 .949 .685 .890 .918 .886 .152 .267 .691 .704 .833 .853 .852 .893
Media UTKFace .772 .795 .763 .883 .546 .802 .588 - - - .599 - .646 .755 .757 .804
LFWA+ .679 .823 .835 .978 .393 .485 .578 - - - - - .682 .656 .651 .722
FairFace .846 .849 .844 .935 .683 .859 .843 .702 .510 .169 .649 .779 .839 .821 .837 .881
Protest UTKFace .706 .723 .697 .821 .536 .714 .456 - - - .591 - .681 .658 .685 .787
LFWA+ .747 .759 .741 .964 .366 .418 .488 - - - - - .689 .645 .668 .801
FairFace .815 .816 .815 .928 .639 .815 .883 .764 .376 .247 .611 .742 .809 .788 .819 .850
FairFace 18K .800 .812 .795 .917 .588 .779 .856 .685 .355 .279 .502 .625 .786 .773 .809 .827
Average FairFace 9K .774 .788 .768 .885 .564 .756 .827 .641 .315 .281 .531 .544 .723 .757 .789 .787
UTKFace .674 .687 .668 .815 .479 .702 .507 - - - .555 - .644 .643 .672 .719
LFWA+ .684 .726 .741 .969 .348 .395 .497 - - - - - .670 .621 .675 .758
Gender Classification
All Female Male White Non-White Black Asian E Asian SE Asian Latino Indian Mid-East 0-9 10-29 30-49 50+
FairFace .940 .948 .935 .949 .932 .932 .894 .864 .942 .963 .932 .976 .817 .932 .973 .959
UTKFace .884 .859 .899 .897 .874 .864 .829 .803 .871 .901 .898 .947 .671 .874 .933 .912
Twitter
LFWA+ .797 .637 .899 .815 .773 .789 .724 .716 .736 .804 .728 .911 .634 .769 .857 .859
CelebA .829 .955 .750 .850 .812 .818 .764 .716 .839 .843 .831 .876 .539 .818 .889 .881
FairFace .973 .957 .980 .976 .969 .953 .956 .967 .891 .980 .977 .988 .821 .952 .984 .979
UTKFace .927 .841 .961 .928 .915 .907 .908 .915 .869 .928 .945 .932 .679 .917 .931 .924
Media
LFWA+ .887 .656 .976 .893 .871 .851 .864 .875 .804 .859 .897 .944 .688 .835 .832 .911
CelebA .899 .950 .880 .909 .881 .847 .858 .857 .870 .925 .884 .926 .560 .860 .908 .924
FairFace .957 .944 .963 .962 .951 .957 .887 .879 .906 .970 .973 .991 .861 .934 .967 .976
UTKFace .901 .829 .934 .905 .873 .911 .814 .802 .843 .902 .918 .921 .611 .812 .924 .919
Protest
LFWA+ .829 .567 .954 .841 .801 .821 .758 .782 .697 .811 .811 .929 .568 .705 .851 .908
CelebA .882 .935 .856 .893 .866 .876 .892 .750 .833 .892 .878 .956 .492 .842 .904 .927
FairFace .957 .950 .959 .962 .951 .947 .912 .903 .913 .971 .961 .985 .833 .939 .975 .971
FairFace 18K .941 .930 .946 .946 .934 .931 .891 .886 .895 .955 .960 .967 .803 .920 .957 .962
FairFace 9K .926 .921 .927 .929 .921 .922 .864 .851 .883 .942 .951 .974 .760 .901 .949 .943
Average
UTKFace .904 .843 .931 .910 .887 .894 .850 .840 .861 .910 .920 .933 .654 .868 .929 .918
LFWA+ .838 .620 .943 .850 .815 .820 .782 .791 .746 .825 .812 .928 .630 .770 .847 .893
CelebA .870 .947 .829 .884 .853 .847 .838 .774 .847 .887 .864 .919 .530 .840 .900 .911
Age Classification
All Female Male White Non-White Black Asian E Asian SE Asian Latino Indian Mid-East 0-9 10-29 30-49 50+
FairFace .578 .586 .573 .563 .590 .557 .620 .629 .606 .581 .576 .555 .805 .666 .439 .408
Twitter
UTKFace .366 .355 .384 .343 .385 .338 .397 .382 .419 .411 .356 .345 .585 .499 .104 .307
FairFace .516 .511 .517 .513 .520 .483 .557 .559 .543 .537 .532 .475 .714 .686 .447 .501
Media
UTKFace .275 .273 .282 .281 .267 .271 .276 .279 .261 .231 .292 .222 .511 .529 .112 .238
FairFace .515 .543 .502 .498 .539 .527 .584 .605 .531 .507 .581 .469 .885 .687 .395 .478
Protest
UTKFace .302 .306 .294 .291 .319 .305 .316 .318 .312 .314 .371 .318 .516 .503 .114 .349
FairFace .536 .547 .531 .525 .550 .522 .587 .598 .560 .542 .563 .500 .801 .680 .427 .462
FairFace 18K .492 .508 .484 .485 .496 .463 .528 .538 .506 .510 .454 .490 .700 .646 .387 .410
Average
FairFace 9K .470 .493 .459 .462 .478 .449 .506 .515 .483 .473 .458 .463 .662 .611 .361 .394
UTKFace .314 .311 .320 .305 .324 .305 .330 .326 .331 .319 .340 .295 .537 .510 .110 .298

[24] J. Joo, F. F. Steen, and S.-C. Zhu. Automated facial trait judg- tor architecture for generative adversarial networks. arXiv
ment and election outcome prediction: Social dimensions of preprint arXiv:1812.04948, 2018. 1
face. In Proceedings of the IEEE international conference [27] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and
on computer vision, pages 3712–3720, 2015. 1, 4 E. Brossard. The megaface benchmark: 1 million faces for
[25] J. Joo, S. Wang, and S.-C. Zhu. Human attribute recognition recognition at scale. In Proceedings of the IEEE Conference
by rich appearance dictionary. In Proceedings of the IEEE on Computer Vision and Pattern Recognition, pages 4873–
International Conference on Computer Vision, pages 721– 4882, 2016. 1
728, 2013. 2 [28] D. E. King. Max-margin object detection. arXiv preprint
[26] T. Karras, S. Laine, and T. Aila. A style-based genera- arXiv:1502.00046, 2015. 5

9
[29] D. P. Kingma and J. Ba. Adam: A method for stochastic [46] R. Rothe, R. Timofte, and L. V. Gool. Deep expectation
optimization. arXiv preprint arXiv:1412.6980, 2014. 5 of real and apparent age from a single image without fa-
[30] S. Kiritchenko and S. M. Mohammad. Examining gender cial landmarks. International Journal of Computer Vision
and race bias in two hundred sentiment analysis systems. (IJCV), July 2016. 1, 2, 4
arXiv preprint arXiv:1805.04508, 2018. 3 [47] H. J. Ryu, H. Adam, and M. Mitchell. Inclusivefacenet: Im-
[31] N. Kumar, A. Berg, P. N. Belhumeur, and S. Nayar. Describ- proving face attribute detection with race and gender diver-
able visual attributes for face verification and image search. sity. arXiv preprint arXiv:1712.00193, 2017. 3
IEEE Transactions on Pattern Analysis and Machine Intelli- [48] R. T. Schaefer. Encyclopedia of race, ethnicity, and society,
gence, 33(10):1962–1977, 2011. 1, 2, 4 volume 1. Sage, 2008. 3
[32] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. De- [49] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A
noyer, et al. Fader networks: Manipulating images by slid- unified embedding for face recognition and clustering. In
ing attributes. In Advances in Neural Information Processing Proceedings of the IEEE conference on computer vision and
Systems, pages 5967–5976, 2017. 1 pattern recognition, pages 815–823, 2015. 1
[50] S. Shankar, Y. Halpern, E. Breck, J. Atwood, J. Wilson, and
[33] R. Layne, T. M. Hospedales, S. Gong, and Q. Mary. Person
D. Sculley. No classification without representation: Assess-
re-identification by attributes. In Bmvc, volume 2, page 8,
ing geodiversity issues in open data sets for the developing
2012. 2
world. arXiv preprint arXiv:1711.08536, 2017. 3
[34] A. Li, L. Liu, K. Wang, S. Liu, and S. Yan. Clothing at- [51] Z. C. Steinert-Threlkeld. Twitter as data. Cambridge Uni-
tributes assisted person reidentification. IEEE Transactions versity Press, 2018. 5
on Circuits and Systems for Video Technology, 25(5):869–
[52] P. Stock and M. Cisse. Convnets and imagenet beyond ac-
878, 2015. 2
curacy: Understanding mistakes and uncovering biases. In
[35] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolu- Proceedings of the European Conference on Computer Vi-
tional neural network cascade for face detection. In Proceed- sion (ECCV), pages 498–512, 2018. 3
ings of the IEEE conference on computer vision and pattern [53] C. Su, F. Yang, S. Zhang, Q. Tian, L. S. Davis, and W. Gao.
recognition, pages 5325–5334, 2015. 1 Multi-task learning with low rank attribute embedding for
[36] J. Littman, L. Wrubel, D. Kerchner, and Y. Bromberg Gaber. multi-camera person re-identification. IEEE transactions on
News Outlet Tweet Ids, 2017. 5 pattern analysis and machine intelligence, 40(5):1167–1181,
[37] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face 2018. 2
attributes in the wild. In Proceedings of International Con- [54] Y. Sun, X. Wang, and X. Tang. Hybrid deep learning for
ference on Computer Vision (ICCV), 2015. 1, 2, 4, 5 face verification. In Proceedings of the IEEE international
[38] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. conference on computer vision, pages 1489–1496, 2013. 2
Journal of machine learning research, 9(Nov):2579–2605, [55] Y. Sun, X. Wang, and X. Tang. Deep learning face repre-
2008. 6 sentation from predicting 10,000 classes. In Proceedings of
[39] M. Merler, N. Ratha, R. S. Feris, and J. R. Smith. Diversity the IEEE conference on computer vision and pattern recog-
in faces. arXiv preprint arXiv:1901.10436, 2019. 1, 2, 3, 4 nition, pages 1891–1898, 2014. 2
[40] H.-W. Ng and S. Winkler. A data-driven approach to clean- [56] H. Suresh, J. J. Gong, and J. V. Guttag. Learning tasks for
ing large face datasets. In 2014 IEEE International Con- multitask learning: Heterogenous patient populations in the
ference on Image Processing (ICIP), pages 343–347. IEEE, icu. In Proceedings of the 24th ACM SIGKDD International
2014. 6 Conference on Knowledge Discovery & Data Mining, pages
802–810. ACM, 2018. 3
[41] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep face
[57] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
recognition. In bmvc, volume 1, page 6, 2015. 1, 6
Closing the gap to human-level performance in face verifi-
[42] I. D. Raji and J. Buolamwini. Actionable auditing: Inves- cation. In Proceedings of the IEEE conference on computer
tigating the impact of publicly naming biased performance vision and pattern recognition, pages 1701–1708, 2014. 1
results of commercial ai products. In AAAI/ACM Conf. on AI [58] C. Thomas and A. Kovashka. Persuasive faces: Generating
Ethics and Society, volume 1, 2019. 1 faces in advertisements. arXiv preprint arXiv:1807.09882,
[43] J. Reis, H. Kwak, J. An, J. Messias, and F. Benevenuto. De- 2018. 1
mographics of news sharing in the us twittersphere. In Pro- [59] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni,
ceedings of the 28th ACM Conference on Hypertext and So- D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new
cial Media, pages 195–204. ACM, 2017. 2 data in multimedia research. Communications of the ACM,
[44] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000 59(2):64–73. 1, 4, 7
fps via regressing local binary features. In Proceedings of the [60] A. Torralba and A. Efros. Unbiased look at dataset bias. In
IEEE Conference on Computer Vision and Pattern Recogni- Proceedings of the 2011 IEEE Conference on Computer Vi-
tion, pages 1685–1692, 2014. 1 sion and Pattern Recognition, pages 1521–1528. IEEE Com-
[45] K. Ricanek and T. Tesafaye. Morph: A longitudinal image puter Society, 2011. 1
database of normal adult age-progression. In 7th Interna- [61] S. Verma and J. Rubin. Fairness definitions explained. In
tional Conference on Automatic Face and Gesture Recogni- 2018 IEEE/ACM International Workshop on Software Fair-
tion (FGR06), pages 341–345. IEEE, 2006. 2 ness (FairWare), pages 1–7. IEEE, 2018. 3

10
[62] Y. Wang, Y. Feng, Z. Hong, R. Berger, and J. Luo. How
polarized have we become? a multimodal classification
of trump followers and clinton followers. In International
Conference on Social Informatics, pages 440–456. Springer,
2017. 2
[63] M. Wilkes, C. Y. Wright, J. L. du Plessis, and A. Reeder.
Fitzpatrick skin type, individual typology angle, and melanin
index in an african population: steps toward universally ap-
plicable skin photosensitivity assessments. JAMA dermatol-
ogy, 151(8):902–903, 2015. 4
[64] D. Won, Z. C. Steinert-Threlkeld, and J. Joo. Protest activity
detection and perceived violence estimation from social me-
dia images. In Proceedings of the 25th ACM international
conference on Multimedia, pages 786–794. ACM, 2017. 2, 5
[65] N. Xi, D. Ma, M. Liou, Z. C. Steinert-Threlkeld, J. Anas-
tasopoulos, and J. Joo. Understanding the political ideol-
ogy of legislators from social media images. arXiv preprint
arXiv:1907.09594, 2019. 2
[66] X. Xiong and F. De la Torre. Supervised descent method
and its applications to face alignment. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, pages 532–539, 2013. 1
[67] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Con-
ditional image generation from visual attributes. In European
Conference on Computer Vision, pages 776–791. Springer,
2016. 1
[68] S. Yang, P. Luo, C.-C. Loy, and X. Tang. Wider face: A
face detection benchmark. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages
5525–5533, 2016. 1
[69] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen-
tation from scratch. arXiv preprint arXiv:1411.7923, 2014.
1
[70] M. B. Zafar, I. Valera, M. G. Rogriguez, and K. P. Gummadi.
Fairness constraints: Mechanisms for fair classification. In
Artificial Intelligence and Statistics, pages 962–970, 2017. 3
[71] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork.
Learning fair representations. In International Conference
on Machine Learning, pages 325–333, 2013. 3
[72] B. H. Zhang, B. Lemoine, and M. Mitchell. Mitigating un-
wanted biases with adversarial learning. In Proceedings of
the 2018 AAAI/ACM Conference on AI, Ethics, and Society,
pages 335–340. ACM, 2018. 3
[73] M. Zhang. Google photos tags two african-americans as go-
rillas through facial recognition software, Jul 2015. 2
[74] Z. Zhang, P. Luo, C.-C. Loy, and X. Tang. Learning social
relation traits from face images. In Proceedings of the IEEE
International Conference on Computer Vision, pages 3631–
3639, 2015. 2
[75] Z. Zhang, Y. Song, and H. Qi. Age progression/regression by
conditional adversarial autoencoder. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 5810–5818, 2017. 2, 4, 5
[76] J. Zou and L. Schiebinger. Ai can be sexist and racistits time
to make it fair, 2018. 3

11

You might also like