Fairface: Face Attribute Dataset For Balanced Race, Gender, and Age
Fairface: Face Attribute Dataset For Balanced Race, Gender, and Age
Abstract duce biased models trained from it. This will raise ethical
concerns about fairness of automated systems, which has
Existing public face datasets are strongly biased toward emerged as a critical topic of study in the recent machine
Caucasian faces, and other races (e.g., Latino) are sig- learning and AI literature [17, 11].
nificantly underrepresented. This can lead to inconsistent For example, several commercial computer vision sys-
model accuracy, limit the applicability of face analytic sys- tems (Microsoft, IBM, Face++) have been criticized due to
tems to non-White race groups, and adversely affect re- their asymmetric accuracy across sub-demographics in re-
search findings based on such skewed data. To mitigate cent studies [7, 42]. These studies found that the commer-
the race bias in these datasets, we construct a novel face cial face gender classification systems all perform better on
image dataset, containing 108,501 images, with an empha- male and on light faces. This can be caused by the biases
sis of balanced race composition in the dataset. We define in their training data. Various unwanted biases in image
7 race groups: White, Black, Indian, East Asian, Southeast datasets can easily occur due to biased selection, capture,
Asian, Middle East, and Latino. Images were collected from and negative sets [60]. Most public large scale face datasets
the YFCC-100M Flickr dataset and labeled with race, gen- have been collected from popular online media – newspa-
der, and age groups. Evaluations were performed on exist- pers, Wikipedia, or webs search– and these platforms are
ing face attribute datasets as well as novel image datasets more frequently used by or showing White people.
to measure generalization performance. We find that the
To mitigate the race bias in the existing face datasets, we
model trained from our dataset is substantially more accu-
propose a novel face dataset with an emphasis of balanced
rate on novel datasets and the accuracy is consistent be-
race composition. Our dataset contains 108,501 facial
tween race and gender groups. The dataset will be released
images collected primarily from the YFCC-100M Flickr
via {https://github.com/joojs/fairface}.
dataset [59], which can be freely shared for a research pur-
pose, and also includes examples from other sources such
as Twitter and online newspaper outlets. We define 7 race
1. Introduction groups: White, Black, Indian, East Asian, Southeast Asian,
To date, numerous large scale face image datasets [22, Middle East, and Latino. Our dataset is well-balanced on
31, 13, 69, 37, 24, 41, 68, 15, 27, 46, 8, 39] have been pro- these 7 groups (See Figure 3 and 2)
posed and fostered research and development for automated Our paper make three main contributions. First, we em-
face detection [35, 21], alignment [66, 44], recognition [57, prically show that existing face attribute datasets and mod-
49], generation [67, 5, 26, 58], modification [3, 32, 19], and els learned from them do not generalize well to unseen
attribute classification [31, 37]. These systems have been data in which more non-White faces are present. Second,
successfully translated into many areas including security, we show that our new dataset perform better on the novel
medicine, education, and social sciences. data, not only on average, but also across racial groups, i.e.
Despite the sheer amount of available data, existing pub- more consistent. Third, to the best of our knowledge, our
lic face datasets are strongly biased toward Caucasian faces, dataset is the first large scale face attribute dataset in the
and other races (e.g., Latino) are significantly underrep- wild which includes Latino and Middle Eastern and differ-
resented. A recent study shows that most existing large entiates East Asian and South East Asian. Computer vision
scale face databases are biased towards “lighter skin” faces has been rapidly transferred into other fields such as eco-
(around 80%), e.g., White, compared to “darker” faces, e.g., nomics or social sciences, where researchers want to ana-
Black [39]. This means the model may not apply to some lyze different demographics using image data. The inclu-
subpopulations and its results may not be compared across sion of major racial groups, which have been missing in
different groups without calibration. Biased data will pro- existing datasets, therefore significantly enlarges the appli-
1
Table 1: Statistics of Face Attribute Datasets
Race Annotation
# of In-the- White* Asian* Bla- Ind- Lat- Balan-
Name Source faces wild? Age Gender ck ian ino ced?
W ME E SE
Gov. Official
PPB [7] 1K X X **Skin color prediction
Profiles
MORPH [45] Public Data 55K X X merged X X no
PubFig [31] Celebrity 13K X Model generated predictions no
IMDB-WIKI [46] IMDB, WIKI 500K X X X no
FotW [13] Flickr 25K X X X yes
CACD [10] celebrity 160K X X no
DiF [39] Flickr 1M X X X **Skin color prediction
CelebFace [54, 55]
†CelebA [37] 200K X X X no
LFW [23]
LFW [23]
LFW+ [16] 15K X X X merged merged no
(Newspapers)
LFW [23]
†LFWA+ [37] 13K X X merged merged X X no
(Newspapers)
MORPH, CACD
†UTKFace [75] 20K X X X merged merged X X yes
Web
Flickr, Twitter
FairFace (Ours) 108K X X X X X X X X X X yes
Newspapers, Web
*FairFace (Ours) also defines East (E) Asian, Southeast (SE) Asian, Middle Eastern (ME), and Western (W) White.
**PPB and DiF do not provide race annotations but skin color annotated or automatically computed as a proxy to race.
†denotes datasets used in our experiments.
cability of computer vision methods to these fields. tronic devices (e.g., unlocking smartphones) or monitoring
surveillance CCTVs [14].
2. Related Work It is imperative to ensure that these systems perform
evenly well on difference gender and race groups. Failing
2.1. Face Attribute Recognition to do so can be detrimental to the reputations of individ-
Face attribute recognition is a task to classify various hu- ual service providers and the public trust about the machine
man attributes such as gender, race, age, emotions, expres- learning and computer vision research community. Most
sions or other facial traits from facial appearance [31, 25, notable incidents regarding the racial bias include Google
74, 37]. While there have been many techniques developed Photos recognizing African American faces as Gorilla and
for the task, we mainly review datasets which are the main Nikon’s digital cameras prompting a message asking “did
concern of this paper. someone blink?” to Asian users [73]. These incidents, re-
Table 1 summarizes the statistics of existing large scale gardless of whether the models were trained improperly or
face attribute datasets and our new dataset. The is not how much they actually affected the users, often result in the
an exhaustive list but we focus on public and in-the-wild termination of the service or features (e.g., dropping sensi-
datasets on gender, race, and age. As stated earlier, most of tive output categories). For the reason, most commercial
these datasets were constructed from online sources, which service providers have stopped providing a race classifier.
are typically dominated by the White race. Face attribute recognition is also widely used for demo-
Face attribute recognition has been applied as a sub- graphic survey performed in marketing or social science re-
component to other computer vision systems. For example, search, aimed at understanding human social behaviors and
Kumar et al. [31] used facial attributes such as gender, race, their relations to demographic backgrounds of individuals.
hair style, expressions, and accessories as features for face Using off-the-shelf tools [2, 4] and commercial services, so-
verification, as the attributes characterize individual traits. cial scientists, who traditionally didn’t use images, begun to
Attributes are also widely used for person re-identification use images of people to infer their demographic attributes
in images or videos, combining features from human face and analyze their behaviors in many studies. Notable exam-
and body appearance [33, 34, 53], especially effective when ples are demographic analyses of social media users using
faces are not fully visible or too small. These systems their photographs [9, 43, 64, 65, 62]. The cost of unfair clas-
have applications in security such as authentication for elec- sification is huge as it can over- or under-estimate specific
2
Figure 1: Individual Typology Angle (ITA), i.e., skin color,
distribution of different races measured in our dataset. Figure 2: Racial compositions in face datasets.
3
(a) FairFace (b) UTKFace
such as East Asian and White. Therefore, we explicitly use used the same dataset to construct a huge unfiltered face
race and annotate the physical race by human annotators’ dataset (Diversity in Face, DiF) [39]. Our dataset is smaller
judgments. but more balanced on race (See Figure 2).
Specifically, we incrementally increased the dataset size.
3.2. Image Collection and Annotation We first detected and annotated 7,125 faces randomly sam-
Many existing face datasets have relied on photographs pled from the entire YFCC100M dataset ignoring the loca-
of public figures such as politicians or celebrities [31, 23, tions of images. After obtaining annotations on this initial
24, 46, 37]. Despite the easiness of collecting images and set, we estimated demographic compositions of each coun-
ground truth attributes, the selection of these populations try. Based on this statistic, we adaptively adjusted the num-
may be biased. For examples, politicians may be older and ber of images for each country sampled from the dataset
actors may be more attractive than typical faces. Their im- such that the dataset is not dominated by the White race.
ages are usually taken by professional photographers in lim- Consequently, we excluded the U.S. and European coun-
ited situations, leading to the quality bias. Some datasets tries in the later stage of data collection after we sampled
were collected via web search using keywords such as enough White faces from those countries. The minimum
“Asian boy” [75]. These queries may return only stereotyp- size of a detected face was set to 50 by 50 pixels. This is
ical faces or prioritize celebrities in those categories rather a relatively smaller size compared to other datasets, but we
than diverse individuals among general public. find the attributes are still recognizable and these examples
Our goal is to minimize the selection bias introduced by can actually make the classifiers more robust against noisy
such filtering and maximize the diversity and coverage of data. We only used images with “Attribution” and “Share
the dataaset. We started from a huge public image dataset, Alike” Creative Commons licenses, which allow derivative
Yahoo YFCC100M dataset [59], and detected faces from work and commercial usages.
the images without any preselection. A recent work also We used Amazon Mechanical Turk to verify the race,
4
gender and age group for each face. We assigned three our model was highest on some variables on the LFWA+
workers for each image. If two or three workers agreed on dataset and also very close to the leader in other cases. This
their judgements, we took the values as ground-truth. If all is partly because LFWA+ is the most biased dataset and ours
three workers produced different responses, we republished is the most diverse, and thus more generalizable dataset.
the image to another 3 workers and subsequently discarded
the image if the new annotators did not agree. These an- 4.3. Generalization Performance
notations at this stage were still noisy. We further refined 4.3.1 Datasets
the annotations by training a model from the initial ground
truth annotations and applying back to the dataset. We then To test the generalization performance of the models, we
manually re-verified the annotations for images whose an- consider three novel datasets. Note that these datasets were
notations differ from model predictions. collected from completely different sources than our data
from Flickr and not used in training. Since we want to
4. Experiments measure the effectiveness of the model on diverse races, we
chose the test datasets that contain people in different loca-
4.1. Measuring Bias in Datasets tions as follows.
We first measure how skewed each dataset is in terms of Geo-tagged Tweets. First we consider images uploaded
its race composition. For the datasets with race annotations, by Twitter users whose locations are identified by geo-
we use the reported statistics. For the other datasets, we tags (longitude and latitude), provided by [51]. From this
annotated the race labels for 3,000 random samples drawn set, we chose four countries (France, Iraq, Philippines, and
from each dataset. See Figure 2 for the result. As expected, Venezuela) and randomly sampled 5,000 faces.
most existing face attribute datasets, especially the ones fo- Media Photographs. Next, we also use photographs
cusing on celebrities or politicians, are biased toward the posted by 500 online professional media outlets. Specifi-
White race. Unlike race, we find that most datasets are rel- cally, we use a public dataset of tweet IDs [36] posted by
atively more balanced on gender ranging from 40%-60% 4,000 known media accounts, e.g., @nytimes. Note that
male ratio. although we use Twitter to access the photographs, these
tweets are simply external links to pages in the main news-
4.2. Model and Cross-Dataset Performance paper sites. Therefore this data is considered as media pho-
To compare model performance of different datasets, we tographs and different from general tweet images mostly
used an identical model architecture, ResNet-34 [18], to be uploaded by ordinary users. We randomly sampled 8,000
trained from each dataset. We used ADAM optimization faces from the set.
[29] with a learning rate of 0.0001. Given an image, we de- Protest Dataset. Lastly, we also use a public image
tected faces using the dlib1 ’s CNN-based face detector [28] dataset collected for a recent protest activity study [64]. The
and ran the attribute classifier on each face. The experiment authors collected the majority of data from Google Image
was done in pyTorch. search by using keywords such as “Venezuela protest” or
Throughout the evaluations, we compare our dataset “football game” (for hard negatives). The dataset exhibits a
with three other datasets: UTKFace [75], LFWA+, and wide range of diverse race and gender groups engaging in
CelebA [37]. Both UTKFace and LFWA+ have race an- different activities in various countries. We randomly sam-
notations, and thus, are suitable for comparison with our pled 8,000 faces from the set.
dataset. CelebA does not have race annotations, so we only These faces were annotated for gender, race, and age by
use it for gender classification. See Table 1 for more de- Amazon Mechanical Turk workers.
tailed dataset characteristics.
Using models trained from these datasets, we first per- 4.3.2 Result
formed cross-dataset classifications, by alternating training
Table 6 shows the classification accuracy of different mod-
sets and test sets. Note that FairFace is the only dataset
els. Because our dataset is larger than LFWA+ and UTK-
with 6 races. To make it compatible with other datasets, we
Face, we report the three variants of the FairFace model by
merged our fine racial groups when tested on other datasets.
limiting the size of a training set (9k, 18k, and Full) for fair
CelebA does not have race annotations but was included for
comparisons.
gender classification.
Improved Accuracy. As clearly shown in the result, the
Tables 3 and 4 show the classification results for race,
model trained by FairFace outperforms all the other models
gender, and age on the datasets across subpopulations. As
for race, gender, and age, on the novel datasets, which have
expected, each model tends to perform better on the same
never been used in training and also come from different
dataset on which it was trained. However, the accuracy of
data sources. The models trained with fewer training im-
1 dlib.net ages (9k and 18k) still outperform other datasets including
5
(a) FairFace (b) UTKFace (c) LFWA+
CelebA which is larger than FairFace. This suggests that toward the male cateogory. The CelebA model tends to ex-
the dataset size is not the only reason for the performance hibit a bias toward the female category as the dataset con-
improvement. tains more female images than male.
Balanced Accuracy. Our model also produces more The FairFace model achieves less than 1% accuracy dis-
consistent results – for race, gender, age classification – crepancy between male ↔ female and White ↔ non-White
across different race groups compared to other datasets. for gender classification (Table 6). All the other models
We measure the model consistency by standard devia- show a strong bias toward the male class, yielding much
tions of classification accuracy measured on different sub- lower accuracy on the female group, and perform more
populations, as shown in Table 5. More formally, one can inaccurately on the non-White group. The gender perfor-
consider conditional use accuracy equality [6] or equalized mance gap was the biggest in LFWA+ (32%), which is the
odds [17] as the measure of fair classification. For gender smallest among the datasets used in the experiment. Recent
classification: work has also reported asymmetric gender biases in com-
mercial computer vision services [7], and our result further
suggests the cause is likely to due to the unbalanced repre-
P (Yb = i|Y = i, A = j) = P (Yb = i|Y = i, A = k), sentation in training data.
i ∈ {male, female}, ∀j, k ∈ D, (1) Data Coverage and Diversity. We further investigate
dataset characteristics to measure the data diversity in our
where Yb is the predicted gender, Y is the true gender, A dataset. We first visualize randomly sampled faces in 2D
refers to the demographic group, and D is the set of differ- space using t-SNE [38] as shown in Figure 4. We used
ent demographic groups being considered (i.e. race). When the facial embedding based on ResNet-34 from dlib, which
we consider different gender groups for A, this needs to be was trained from the FaceScrub dataset [40], the VGG-Face
modified to measure accuracy equality [6]: dataset [41] and other online sources, which are likely dom-
inated by the White faces. The faces in FairFace are well
P (Yb = Y |A = j) = P (Yb = Y |A = k), ∀j, k ∈ D. (2) spread in the space, and the race groups are loosely sepa-
rated from each other. This is in part because the embedding
We therefore define the maximum accuracy disparity of a
was trained from biased datasets, but it also suggests that the
classifier as follows:
dataset contains many non-typical examples. LFWA+ was
P (Yb = Y |A = j)
derived from LFW, which was developed for face recogni-
(Yb ) = max log . (3) tion, and therefore contains multiple images of the same in-
∀j,k∈D P (Yb = Y |A = k)
dividuals, i.e., clusters. UTKFace also tends to focus more
Table 2 shows the gender classification accuracy of dif- on local clusters compared to FairFace.
ferent models measured on the external validation datasets To explicitly measure the diversity of faces in these
for each race and gender group. The FairFace model datasets, we examine the distributions of pairwise distance
achieves the lowest maximum accuracy disparity. The between faces (Figure 5). On the random subsets, we first
LFWA+ model yields the highest disparity, strongly biased obtained the same 128-dimensional facial embedding from
6
Race White Black East Asian SE Asian Latino Indian Middle Eastern
Gender M F M F M F M F M F M F M F Max Min AVG STDV
FairFace .967 .954 .958 .917 .873 .939 .909 .906 .977 .960 .966 .947 .991 .946 .991 .873 .944 .032 .055
UTK .926 .864 .909 .795 .841 .824 .906 .795 .939 .821 .978 .742 .949 .730 .978 .730 .859 .078 .127
LFWA+ .946 .680 .974 .432 .826 .684 .938 .574 .951 .613 .968 .518 .988 .635 .988 .432 .766 .196 .359
CelebA .829 .958 .819 .919 .653 .939 .768 .923 .843 .955 .866 .856 .924 .874 .958 .653 .866 .083 .166
Table 2: Gender classification accuracy measured on external validation datasets across gender-race groups.
7
Table 3: Cross-Dataset Classification Accuracy on White Race.
Tested on
Race Gender Age
FairFace UTKFace LFWA+ FairFace UTKFace LFWA+ CelebA* FairFace UTKFace
FairFace .937 .936 .970 .942 .940 .920 .981 .597 .565
UTKFace .800 .918 .925 .860 .935 .916 .962 .413 .576
Trained on
LFWA+ .879 .947 .961 .761 .842 .930 .940 - -
CelebA - - - .812 .880 .905 .971 - -
* CelebA doesn’t provide race annotations. The result was obtained from the whole set (white and non-white).
[8] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. [15] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m:
Vggface2: A dataset for recognising faces across pose and A dataset and benchmark for large-scale face recognition. In
age. In 2018 13th IEEE International Conference on Auto- European Conference on Computer Vision, pages 87–102.
matic Face & Gesture Recognition (FG 2018), pages 67–74. Springer, 2016. 1
IEEE, 2018. 1 [16] H. Han, A. K. Jain, F. Wang, S. Shan, and X. Chen. Hetero-
[9] A. Chakraborty, J. Messias, F. Benevenuto, S. Ghosh, geneous face attribute estimation: A deep multi-task learn-
N. Ganguly, and K. P. Gummadi. Who makes trends? under- ing approach. IEEE transactions on pattern analysis and
standing demographic biases in crowdsourced recommenda- machine intelligence, 40(11):2597–2609, 2018. 2
tions. In Eleventh International AAAI Conference on Web [17] M. Hardt, E. Price, N. Srebro, et al. Equality of opportunity
and Social Media, 2017. 2 in supervised learning. In Advances in neural information
[10] B.-C. Chen, C.-S. Chen, and W. H. Hsu. Face recognition processing systems, pages 3315–3323, 2016. 1, 6
and retrieval using cross-age reference coding with cross- [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
age celebrity dataset. IEEE Transactions on Multimedia, ing for image recognition. In Proceedings of the IEEE con-
17(6):804–815, 2015. 2 ference on computer vision and pattern recognition, pages
770–778, 2016. 5
[11] S. Corbett-Davies, E. Pierson, A. Feller, S. Goel, and A. Huq.
[19] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen. Arbitrary
Algorithmic decision making and the cost of fairness. In
facial attribute editing: Only change what you want. arXiv
Proceedings of the 23rd ACM SIGKDD International Con-
preprint arXiv:1711.10678, 1(3), 2017. 1
ference on Knowledge Discovery and Data Mining, pages
[20] L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and
797–806. ACM, 2017. 1, 3
A. Rohrbach. Women also snowboard: Overcoming bias in
[12] A. Das, A. Dantcheva, and F. Bremond. Mitigating bias in captioning models. In European Conference on Computer
gender, age and ethnicity classification: a multi-task con- Vision, pages 793–811. Springer, 2018. 3
volution neural network approach. In The European Con- [21] P. Hu and D. Ramanan. Finding tiny faces. In Proceedings of
ference on Computer Vision (ECCV) Workshops, September the IEEE conference on computer vision and pattern recog-
2018. 3 nition, pages 951–959, 2017. 1
[13] S. Escalera, M. Torres Torres, B. Martinez, X. Baró, [22] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. La-
H. Jair Escalante, I. Guyon, G. Tzimiropoulos, C. Corneou, beled faces in the wild: A database forstudying face recog-
M. Oliu, M. Ali Bagheri, et al. Chalearn looking at people nition in unconstrained environments. In Workshop on faces
and faces of the world: Face analysis workshop and chal- in’Real-Life’Images: detection, alignment, and recognition,
lenge 2016. In Proceedings of the IEEE Conference on Com- 2008. 1
puter Vision and Pattern Recognition Workshops, pages 1–8, [23] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller.
2016. 1, 2 Labeled faces in the wild: A database for studying face
[14] M. Grgic, K. Delac, and S. Grgic. Scface–surveillance recognition in unconstrained environments. Technical Re-
cameras face database. Multimedia tools and applications, port 07-49, University of Massachusetts, Amherst, October
51(3):863–879, 2011. 2 2007. 2, 4
8
Table 5: Gender classification accuracy on external validation datasets, across race and age groups.
Mean across races SD across races Mean across ages SD across ages
FairFace 94.89% 3.03% 92.95% 6.63%
Model UTKFace 89.54% 3.34% 84.23% 12.83%
trained on LFWA+ 82.46% 5.60% 78.50% 11.51%
CelebA 86.03% 4.57% 79.53% 17.96%
[24] J. Joo, F. F. Steen, and S.-C. Zhu. Automated facial trait judg- tor architecture for generative adversarial networks. arXiv
ment and election outcome prediction: Social dimensions of preprint arXiv:1812.04948, 2018. 1
face. In Proceedings of the IEEE international conference [27] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and
on computer vision, pages 3712–3720, 2015. 1, 4 E. Brossard. The megaface benchmark: 1 million faces for
[25] J. Joo, S. Wang, and S.-C. Zhu. Human attribute recognition recognition at scale. In Proceedings of the IEEE Conference
by rich appearance dictionary. In Proceedings of the IEEE on Computer Vision and Pattern Recognition, pages 4873–
International Conference on Computer Vision, pages 721– 4882, 2016. 1
728, 2013. 2 [28] D. E. King. Max-margin object detection. arXiv preprint
[26] T. Karras, S. Laine, and T. Aila. A style-based genera- arXiv:1502.00046, 2015. 5
9
[29] D. P. Kingma and J. Ba. Adam: A method for stochastic [46] R. Rothe, R. Timofte, and L. V. Gool. Deep expectation
optimization. arXiv preprint arXiv:1412.6980, 2014. 5 of real and apparent age from a single image without fa-
[30] S. Kiritchenko and S. M. Mohammad. Examining gender cial landmarks. International Journal of Computer Vision
and race bias in two hundred sentiment analysis systems. (IJCV), July 2016. 1, 2, 4
arXiv preprint arXiv:1805.04508, 2018. 3 [47] H. J. Ryu, H. Adam, and M. Mitchell. Inclusivefacenet: Im-
[31] N. Kumar, A. Berg, P. N. Belhumeur, and S. Nayar. Describ- proving face attribute detection with race and gender diver-
able visual attributes for face verification and image search. sity. arXiv preprint arXiv:1712.00193, 2017. 3
IEEE Transactions on Pattern Analysis and Machine Intelli- [48] R. T. Schaefer. Encyclopedia of race, ethnicity, and society,
gence, 33(10):1962–1977, 2011. 1, 2, 4 volume 1. Sage, 2008. 3
[32] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. De- [49] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A
noyer, et al. Fader networks: Manipulating images by slid- unified embedding for face recognition and clustering. In
ing attributes. In Advances in Neural Information Processing Proceedings of the IEEE conference on computer vision and
Systems, pages 5967–5976, 2017. 1 pattern recognition, pages 815–823, 2015. 1
[50] S. Shankar, Y. Halpern, E. Breck, J. Atwood, J. Wilson, and
[33] R. Layne, T. M. Hospedales, S. Gong, and Q. Mary. Person
D. Sculley. No classification without representation: Assess-
re-identification by attributes. In Bmvc, volume 2, page 8,
ing geodiversity issues in open data sets for the developing
2012. 2
world. arXiv preprint arXiv:1711.08536, 2017. 3
[34] A. Li, L. Liu, K. Wang, S. Liu, and S. Yan. Clothing at- [51] Z. C. Steinert-Threlkeld. Twitter as data. Cambridge Uni-
tributes assisted person reidentification. IEEE Transactions versity Press, 2018. 5
on Circuits and Systems for Video Technology, 25(5):869–
[52] P. Stock and M. Cisse. Convnets and imagenet beyond ac-
878, 2015. 2
curacy: Understanding mistakes and uncovering biases. In
[35] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolu- Proceedings of the European Conference on Computer Vi-
tional neural network cascade for face detection. In Proceed- sion (ECCV), pages 498–512, 2018. 3
ings of the IEEE conference on computer vision and pattern [53] C. Su, F. Yang, S. Zhang, Q. Tian, L. S. Davis, and W. Gao.
recognition, pages 5325–5334, 2015. 1 Multi-task learning with low rank attribute embedding for
[36] J. Littman, L. Wrubel, D. Kerchner, and Y. Bromberg Gaber. multi-camera person re-identification. IEEE transactions on
News Outlet Tweet Ids, 2017. 5 pattern analysis and machine intelligence, 40(5):1167–1181,
[37] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face 2018. 2
attributes in the wild. In Proceedings of International Con- [54] Y. Sun, X. Wang, and X. Tang. Hybrid deep learning for
ference on Computer Vision (ICCV), 2015. 1, 2, 4, 5 face verification. In Proceedings of the IEEE international
[38] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. conference on computer vision, pages 1489–1496, 2013. 2
Journal of machine learning research, 9(Nov):2579–2605, [55] Y. Sun, X. Wang, and X. Tang. Deep learning face repre-
2008. 6 sentation from predicting 10,000 classes. In Proceedings of
[39] M. Merler, N. Ratha, R. S. Feris, and J. R. Smith. Diversity the IEEE conference on computer vision and pattern recog-
in faces. arXiv preprint arXiv:1901.10436, 2019. 1, 2, 3, 4 nition, pages 1891–1898, 2014. 2
[40] H.-W. Ng and S. Winkler. A data-driven approach to clean- [56] H. Suresh, J. J. Gong, and J. V. Guttag. Learning tasks for
ing large face datasets. In 2014 IEEE International Con- multitask learning: Heterogenous patient populations in the
ference on Image Processing (ICIP), pages 343–347. IEEE, icu. In Proceedings of the 24th ACM SIGKDD International
2014. 6 Conference on Knowledge Discovery & Data Mining, pages
802–810. ACM, 2018. 3
[41] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep face
[57] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
recognition. In bmvc, volume 1, page 6, 2015. 1, 6
Closing the gap to human-level performance in face verifi-
[42] I. D. Raji and J. Buolamwini. Actionable auditing: Inves- cation. In Proceedings of the IEEE conference on computer
tigating the impact of publicly naming biased performance vision and pattern recognition, pages 1701–1708, 2014. 1
results of commercial ai products. In AAAI/ACM Conf. on AI [58] C. Thomas and A. Kovashka. Persuasive faces: Generating
Ethics and Society, volume 1, 2019. 1 faces in advertisements. arXiv preprint arXiv:1807.09882,
[43] J. Reis, H. Kwak, J. An, J. Messias, and F. Benevenuto. De- 2018. 1
mographics of news sharing in the us twittersphere. In Pro- [59] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni,
ceedings of the 28th ACM Conference on Hypertext and So- D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new
cial Media, pages 195–204. ACM, 2017. 2 data in multimedia research. Communications of the ACM,
[44] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000 59(2):64–73. 1, 4, 7
fps via regressing local binary features. In Proceedings of the [60] A. Torralba and A. Efros. Unbiased look at dataset bias. In
IEEE Conference on Computer Vision and Pattern Recogni- Proceedings of the 2011 IEEE Conference on Computer Vi-
tion, pages 1685–1692, 2014. 1 sion and Pattern Recognition, pages 1521–1528. IEEE Com-
[45] K. Ricanek and T. Tesafaye. Morph: A longitudinal image puter Society, 2011. 1
database of normal adult age-progression. In 7th Interna- [61] S. Verma and J. Rubin. Fairness definitions explained. In
tional Conference on Automatic Face and Gesture Recogni- 2018 IEEE/ACM International Workshop on Software Fair-
tion (FGR06), pages 341–345. IEEE, 2006. 2 ness (FairWare), pages 1–7. IEEE, 2018. 3
10
[62] Y. Wang, Y. Feng, Z. Hong, R. Berger, and J. Luo. How
polarized have we become? a multimodal classification
of trump followers and clinton followers. In International
Conference on Social Informatics, pages 440–456. Springer,
2017. 2
[63] M. Wilkes, C. Y. Wright, J. L. du Plessis, and A. Reeder.
Fitzpatrick skin type, individual typology angle, and melanin
index in an african population: steps toward universally ap-
plicable skin photosensitivity assessments. JAMA dermatol-
ogy, 151(8):902–903, 2015. 4
[64] D. Won, Z. C. Steinert-Threlkeld, and J. Joo. Protest activity
detection and perceived violence estimation from social me-
dia images. In Proceedings of the 25th ACM international
conference on Multimedia, pages 786–794. ACM, 2017. 2, 5
[65] N. Xi, D. Ma, M. Liou, Z. C. Steinert-Threlkeld, J. Anas-
tasopoulos, and J. Joo. Understanding the political ideol-
ogy of legislators from social media images. arXiv preprint
arXiv:1907.09594, 2019. 2
[66] X. Xiong and F. De la Torre. Supervised descent method
and its applications to face alignment. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, pages 532–539, 2013. 1
[67] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Con-
ditional image generation from visual attributes. In European
Conference on Computer Vision, pages 776–791. Springer,
2016. 1
[68] S. Yang, P. Luo, C.-C. Loy, and X. Tang. Wider face: A
face detection benchmark. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages
5525–5533, 2016. 1
[69] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen-
tation from scratch. arXiv preprint arXiv:1411.7923, 2014.
1
[70] M. B. Zafar, I. Valera, M. G. Rogriguez, and K. P. Gummadi.
Fairness constraints: Mechanisms for fair classification. In
Artificial Intelligence and Statistics, pages 962–970, 2017. 3
[71] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork.
Learning fair representations. In International Conference
on Machine Learning, pages 325–333, 2013. 3
[72] B. H. Zhang, B. Lemoine, and M. Mitchell. Mitigating un-
wanted biases with adversarial learning. In Proceedings of
the 2018 AAAI/ACM Conference on AI, Ethics, and Society,
pages 335–340. ACM, 2018. 3
[73] M. Zhang. Google photos tags two african-americans as go-
rillas through facial recognition software, Jul 2015. 2
[74] Z. Zhang, P. Luo, C.-C. Loy, and X. Tang. Learning social
relation traits from face images. In Proceedings of the IEEE
International Conference on Computer Vision, pages 3631–
3639, 2015. 2
[75] Z. Zhang, Y. Song, and H. Qi. Age progression/regression by
conditional adversarial autoencoder. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 5810–5818, 2017. 2, 4, 5
[76] J. Zou and L. Schiebinger. Ai can be sexist and racistits time
to make it fair, 2018. 3
11