What Makes a Good Image Airbnb Demand Analytics Leveraging Interpretable Image Features
What Makes a Good Image Airbnb Demand Analytics Leveraging Interpretable Image Features
Shunyuan Zhang
Harvard Business School
[email protected]
Dokyun Lee
Questrom School of Business, Boston University
[email protected]
Kannan Srinivasan
Tepper School University, Carnegie Mellon University
[email protected]
ABSTRACT
We study how Airbnb property demand changed after the acquisition of verified images (taken by Airbnb’s
photographers) and explore what makes a good image for an Airbnb property. Using deep learning and
difference-in-difference analyses on an Airbnb panel dataset spanning 7,423 properties over 16 months, we find
that properties with verified images had 8.98% higher occupancy than properties without verified images (images
taken by the host). To explore what constitutes a good image for an Airbnb property, we quantify 12 human-
interpretable image attributes that pertain to three artistic aspects—composition, color, and the figure-ground
relationship—and we find systematic differences between the verified and unverified images. We also predict the
relationship between each of the 12 attributes and property demand, and we find that most of the correlations
are significant and in the theorized direction. Our results provide actionable insights for both Airbnb
photographers and amateur host photographers who wish to optimize their images. Our findings contribute to
and bridge the literature on photography and marketing (e.g., staging), which often either ignores the demand
side (photography) or does not systematically characterize the images (marketing).
Keywords: sharing economy, Airbnb, property demand, computer vision, deep learning, image feature
extraction, content engineering
It remains unclear, however, whether the professional photography program has been beneficial for Airbnb
and its hosts. In fact, the photography program has raised much controversy among Airbnb hosts and guests. It
seems plausible that verified photos could attract more guests, but they also could oversell or misrepresent the
property, leading to a negative impact on property demand —a fear voiced by some Airbnb hosts on the host
1 http://airhostsforum.com/t/professional-photography/3675/35.
3
2. Empirical Framework
2.1 Data Description
We randomly selected more than 13,000 Airbnb property listings from seven US cities (Austin, Boston, Los
Angeles, New York, San Diego, San Francisco, and Seattle), and we collected data on the listings from January
2016 to April 2017. We obtained information about each property host from that host’s public profile on
Airbnb.com. Each profile specifies the date on which the host joined Airbnb and whether the host had a verified
Airbnb account at the time of the analysis. For each property, we obtained information about static
characteristics: location (city and zip code), property type (e.g., house, apartment), property size (number of beds),
amenities (e.g., pool, AC, proximity to a beach), and capacity (maximum guests). We also obtained information
about dynamic characteristics: property bookings, nightly prices, guests’ reviews, property photos, and whether
the photos were verified. In the appendix (Section VI), we provide detailed methods for data collection and
matched sample construction. In the next section, we describe the measures of our key variables and summarize
their measurement statistics.
2 Of the 13,000 listings in our original dataset, approximately 5,000 were already “treated” in January 2016. We exclude them
from our main analyses, but as a robustness check, we repeat our main analysis with the pre-treated units, and we obtain
consistent results (see Section V.7 in the appendix for details).
6
Property Demand
We purchased listing-level booking data from a company that specializes in collecting Airbnb property demand
data. The booking data includes the number of days in a month in which the property was open (i.e., available to
be booked), blocked (i.e., marked as unavailable by the host), and booked (i.e., by a guest). For property t in
period t, we operationalize property demand as the occupancy rate, that is, the fraction of the open days that
were booked, scaled by 100. For example, if a property in March was open for 24 days and booked for six days,
then its demand for that month was DEMANDit = (6/24) * 100 = 25.00.
Property Price
The property price for property i in period t, NIGHTLY_RATEit, refers to the average nightly price over the
days in period (Zhang et al. 2019). Property price is endogenous because it correlates with random demand
shocks in the current period, which also affects property demand. To address the endogeneity concern, we use a
set of instrument variables (IVs) for price. Following the extant literature, we include the characteristics of
competing properties (Berry et al. 1995; Nevo 2001). The logic is that the characteristics of competing products
are unlikely to be correlated with unobserved shocks in the demand for the focal property. However, the
proximity of the characteristics of a property and its competitors influences the competition and, as a result, the
property markup and price. 3 In addition, we collect cost-related variables, that is, the factors that enter the
cost/supply side but not the demand side. We use the local (zip-code level) residential utility fee obtained from
OpenEI and local rental information collected from Zillow.4 These factors affect the host’s expenses and thus
provide an indirect measure of cost but are unlikely to be correlated with demand in the short-term lodging
market.
Property Photos
The property photos are the set of photos posted on the property web page in the given period. Three variables
characterize the property photos: photo quantity, photo quality, and the distribution of photographed room
types. IMAGE_COUNTit is the number of photos of property available during period t. We calculate
IMAGE_QUALITYit using machine learning techniques that are appropriate for the size of the dataset (over
510,000 images).5 We build a supervised image-quality classifier that classifies each image as high-quality (value
3 We compute IVs based on property type, listing type, property capacity, and number of reviews, none of which are directly
controlled by the host. Competitors are defined as the properties in the same zip code.
4 The OpenEI dataset provides average residential, commercial, and industrial electricity rates by zip code, compiled from
ABB, Velocity Suite, and the US Energy Information Administration dataset 861: https://openei.org/doe-
opendata/dataset/u-s-electric-utility-companies-and-rates-look-up-by-zipcode-feb-2011. Zillow Research provides average
home values by zip code and home size (# of bedrooms): https://www.zillow.com/research/data/.
5The image data contains all images associated with all properties included in the dataset. That is, they include images for
properties that were verified before the observation started (and hence are not included in the sample for the DiD analyses)
and all images updated/added/deleted during observation periods.
7
Training step
CNN Approach: We apply a CNN, a deep learning framework widely applied in the field of computer vision with
breakthrough performances on tasks including object recognition and image classification (Krizhevsky et al.
2012). Our CNN image quality classifier (see Figure 2) represents the architecture of a classic CNN model. The
CNN consists of a sequence of neural layers (also called filters); the first layer extracts features from the input
image and summarizes them into an intermediate output, which becomes the input with which the next layer
generates a higher-level summary, and so on. The key component in the CNN is a convolution kernel (or
convolution filter), represented by an n by n weighting matrix. Given the intermediate output from the previous
layer, the convolution kernel extracts features through a matrix dot product operation between the weighting
matrix and the intermediate output. In other words, the sequence of layers in the CNN extracts a hierarchical set
of image features from the input. A training set (in which the label, “high-quality” or “low-quality,” is already
known for each image) enables the model to learn the relationships between the extracted features and the labels
such that the model can extract the features that have the most discriminative power for predicting the labels.
To reduce the overfitting problem in the training step, we increase the size of and random variation within the
training sample by randomly applying one of three transformations to each image (e.g., data augmentation, see
10
Transfer Learning and Fine-Tuning the Parameters of the CNN: Because a deep learning model has many filters and
parameters, it requires a large quantity of data to train the model. We overcome our limited training data by
leveraging transfer learning. We start with the pre-trained parameters of the widely-applied CNN model, VGG-
16 (trained on over one million images, Simonyan and Zisserman 2015) and fine-tune the parameters with 80%
of our training set of Airbnb property images. We use the remaining 20% as a hold-out sample to test the
performance of the trained CNN, and we achieve 90.4% accuracy). The high accuracy in predicting image quality
ensures a valid interpretation of our results of image quality in the demand model. In the appendix, we provide
a detailed description of the CNN architecture and technical notes on the training process.
Prediction step
Once the CNN classifier learns the relationship between image features and the image label, it can be applied to
the unlabeled images (Figure 2). The classifier takes the unlabeled image as the input, extracts the hierarchical set
of image features using the parameters of the trained classifier, and outputs the predicted label: either “1” for a
high-quality image or “0” for a low-quality image.
Figure 2. Description of Architecture and Layer Description of the CNN Classifier
Filters: The number of convolution windows (i.e., number of feature maps) on each convolution layer.
Zero-padding: Pads the input with zeros on the edges to control the spatial size of the output. Zero-padding has no
impact on the predicted output.
Max-pooling: Subsampling method. A 2 × 2 window slides through (without overlap) each feature map at that layer,
and the maximum value in the window is picked as the representation of the window. This reduces computation and
provides translation invariance.
In an ideal setting, the two groups would be comparable such that the impact of treatment (i.e., the adoption
of verified photos) on demand would be reflected by the demand difference in the post-treatment period. In our
unrandomized (i.e., self-selecting) setting, however, the treatment is endogenous, and the two groups may not be
comparable. As shown in Table 1, the treatment and control groups differ on some pre-treatment covariates. If
certain differences affect both property demand and hosts’ decisions about whether to join the photography
program, then we cannot simply attribute any observed difference in the property demand to the treatment
(Athey and Imbens 2006). To mitigate the endogeneity concern, we use the PSW method to create groups that
are sufficiently comparable on the observed characteristics.
In the X of the PSW model, we include covariates to control for the self-selection issue identified by Zhang
et al. (2019)—that a rational host should adopt high-quality images if they have an appropriately high-quality
12
𝜔 (𝑇, 𝑋 ) =
𝑇
+
1−𝑇
,
(1)
𝑝𝑠 (𝑋 ) 1 − 𝑝𝑠 (𝑋 )
where 𝑝𝑠 (𝑋 ) is the estimated propensity score of unit i computed with its observed covariates 𝑋 , and T is a
dummy variable that equals 1 if is in the treatment group and 0 otherwise.
The PSW results are validated to ensure that the treatment and control groups in the weighted sample are
balanced on the included covariates (see the appendix for details on the PSW approach, results, and validation).
As shown in Table 2, the group means are statistically indifferent, suggesting that the PSW method successfully
removed systematic differences in host and property observed characteristics that might confound the treatment
assignment. See Section V.5 in the appendix for the analysis details.
Weighted Means
T-test
Variables Treated Untreated t p-value
REVIEW_COUNT 20.56 19.88 0.27 0.790
IMAGE_QUALITY 0.27 0.25 1.00 0.316
IMAGE_COUNT 14.48 15.1 -0.67 0.506
NIGHTLY_RATE 170.15 191.36 -1.14 0.257
MINIMUM_STAY 2.57 2.57 -0.00 1.000
MAX_GUESTS 3.5 3.67 -0.82 0.410
RESPONSE_RATE 92.25 91.19 0.79 0.431
RESPONSE_TIME (minutes) 225.12 260.98 -1.18 0.238
SUPER_HOST 0.15 0.11 1.05 0.292
INSTANT_BOOK 0.11 0.11 -0.14 0.888
# BLOCKED DAYS 9.51 8.32 1.10 0.271
# RESERVATION DAYS 6.62 6.74 -0.16 0.877
PARKING 0.5 0.49 0.09 0.929
POOL 0.1 0.08 0.76 0.445
BEACH 0.02 0.02 0.34 0.737
INTERNET 0.99 1 -0.58 0.563
TV 0.79 0.81 -0.55 0.579
WASHER 0.6 0.57 0.81 0.419
MICROWAVE 0.15 0.13 0.64 0.523
ELEVATOR 0.2 0.21 -0.22 0.826
GYM 0.11 0.13 -0.69 0.490
FAMILY_FRIENDLY 0.19 0.2 -0.56 0.576
13
where 𝑇𝑅𝐸𝐴𝑇𝐼𝑁𝐷 is the treatment status indicator, which equals 1 if property i has received treatment in
period t and equals 0 otherwise. The key coefficient 𝛼 estimates the percentage change in property demand that
is associated with the treatment (i.e., having verified photos), and 𝜀 is a random shock in period t on property
i’s demand, assumed to follow an i.i.d. normal distribution. The vector 𝐶𝑂𝑁𝑇𝑅𝑂𝐿𝑆 represents a set of control
variables that may be correlated with property demand, for example, the property rules and consumer reviews. 6
We include the property fixed effect term, 𝑃𝑅𝑂𝑃𝐸𝑅𝑇𝑌 , to capture time-invariant factors that may impact
property demand, such as geographic information and property-specific characteristics. We also include the time
fixed effect term 𝑆𝐸𝐴𝑆𝑂𝑁𝐴𝐿𝐼𝑇𝑌 , which captures any city-specific trends in property demand (i.e., we allow
each city to have its own seasonal pattern).
Note that the key outcome variable in this study is property demand, operationalized as the monthly
occupancy rate. As an empirical extension, in the appendix, we analyze whether the property price changed with
treatment (i.e., we replace the DV in Eq. 2 with the property’s nightly rate). We find that after controlling for
6 The vector CONTROLS includes two metrics that measure hosts’ responsiveness, RESPONSE_RATE (percentage of
messages/requests from guests that receive a response from the host) and RESPONSE_TIME (average number of minutes
to respond to a guest); MIN_STAYS, the minimum number of nights that a guest can book; MAX_GUESTS, the maximum
number of guests that may stay at once; SECURITY_DEPOSIT, the money that the guest will be charged if the host claims
that the guest damaged the property and Airbnb approves the claim; CANCELLATION_STRICT, whether the cancellation
policy is strict (1) or not strict (0); SUPER_HOST, whether the host has a “super host” badge (1) or not (0), which Airbnb
assigns based on consumers’ reviews, the host’s responsiveness, etc.; BUSINESS_READY, whether the property has
business-related amenities (1) or not (0); HAS_RATING, whether the average guest ratings are presented on the property
page (1) or not (0); and the interaction terms of HAS_RATING and the multi-dimensional ratings.
14
7 A possible explanation is that a change in photo quality does not necessarily reflect changes in the property or host. In a
report on the smart pricing model, Airbnb did not describe property pictures as one of the many “factors at play” that the
algorithm used: https://blog.atairbnb.com/smart-pricing/.
8 The properties in our sample in the pre-treatment period (January 2016) were priced at $ 179.5 per night on average.
15
We define PRE(1) as the period prior to the treatment month and set it as the reference period (i.e., we
normalize its coefficient to zero). Then, PRE(2) is two months prior to treatment, PRE(3) is three months prior
to treatment, and PRE(4) represents all the periods from the beginning (i.e., January 2016) through four months
prior to treatment (Autor 2003). For properties with fewer than four pre-treatment periods (e.g., a property that
acquired verified photos in February 2016 would have only one pre-treatment period), the period dummies that
correspond to months before January 2016 are set to zero.
The set of coefficients 𝛽 allows us to validate the DiD model by comparing the trends in the property
demand of the weighted control and treatment groups prior to treatment. The common trends assumption is
validated if there is no significant pre-treatment difference between the groups, and Table 4 confirms that none
of 𝛽 the period dummy coefficients are statistically different from zero. Moreover, the set of 𝛽 does not exhibit
an increasing trend in the property demand for the treatment units compared with the control units. In other
words, the demand for the treated units was not already deviating from the demand for the control units prior
to treatment, which would suggest that the difference between groups was caused by an idiosyncratic shock that
affected both the treatment likelihood and property demand rather than by the treatment itself.
Table 4. Falsification Check: A Relative-Time Model of Pre-Treatment Trends
First, we run two analyses to test for potential inflation in the treatment coefficient in the long term, which
could occur if Airbnb’s ranking algorithm favors properties with verified images. We estimate our main DiD
17
9 We do not include the review writing probability in the main DiD model as: 1) review writing probability is indirectly
controlled for through number of reviews and property characteristics, and 2) including it will significantly reduce the sample
size. See Section V.9) for detailed discussions.
18
Our main analyses in Section 3.3 established that the demand for properties with verified images was 8.98%
higher than the demand for properties with unverified images. In this section, we explore the potential sources
of the positive coefficient of verified photos correlationally, by extracting image aesthetic quality. We analyze the
images and then estimate a demand equation (see Eq. 4) that captures three aspects of a property’s image set:
IMAGE_COUNT (the number of property images), IMAGE_QUALITY (the average of the binary labels of
“high-quality” [1] and “low-quality” [0]), and the distribution of the five main types of photographed rooms. In
Section 2.3 and the appendix, we define the variables and provide technical details regarding their measurements.
𝐷𝐸𝑀𝐴𝑁𝐷 = 𝐼𝑁𝑇𝐸𝑅𝐶𝐸𝑃𝑇 + 𝛼𝑇𝑅𝐸𝐴𝑇𝐼𝑁𝐷 + 𝜇𝐼𝑀𝐴𝐺𝐸_ 𝐶𝑂𝑈𝑁𝑇 + 𝛾𝐼𝑀𝐴𝐺𝐸 𝑄𝑈𝐴𝐿𝐼𝑇𝑌
+ 𝜌 𝐵𝐴𝑇𝐻𝑅𝑂𝑂𝑀 𝑃𝐻𝑂𝑇𝑂 𝑅𝐴𝑇𝐼𝑂 + 𝜌 𝐵𝐸𝐷𝑅𝑂𝑂𝑀 𝑃𝐻𝑂𝑇𝑂 𝑅𝐴𝑇𝐼𝑂 (4)
+ 𝜌 𝐾𝐼𝑇𝐶𝐻𝐸𝑁 𝑃𝐻𝑂𝑇𝑂 𝑅𝐴𝑇𝐼𝑂 + 𝜌 𝐿𝐼𝑉𝐼𝑁𝐺 𝑃𝐻𝑂𝑇𝑂 𝑅𝐴𝑇𝐼𝑂 + 𝜆𝐶𝑂𝑁𝑇𝑅𝑂𝐿𝑆
+ 𝑆𝐸𝐴𝑆𝑂𝑁𝐴𝐿𝐼𝑇𝑌 + 𝑃𝑅𝑂𝑃𝐸𝑅𝑇𝑌 + 𝜀
Table 5 reports the estimation results; the model specification does not include IMAGE_QUALITY in
column (1) and does include it in column (2). Recall that the estimated coefficient of the key variable TREATIND
was 8.985 in the original model; the coefficient decreases to 7.453 in column (1) with the inclusion of
IMAGE_COUNT, which has a positive and significant coefficient (none of the room type ratios have a
significant coefficient). The treatment coefficient decreases by another 41% (from 7.453 to 4.397) in column (2)
with the inclusion of IMAGE_QUALITY, which has a positive and significant coefficient. This significant
reduction suggests that the high quality of the verified images explains some but not all of the treatment effect,
as there is a significant residual treatment coefficient even when we control for both covariates.
Table 5. DiD Model: Controlling for the Number and Quality of Property Images
19
The photography literature highlights 12 image attributes across three components: composition, color, and
the figure-ground relationship (Datta et al. 2006; Freeman 2007; Wang et al. 2013). The features affect both the
vertical quality of the photographs and horizontal qualities that might affect a property’s appeal to potential guests
who browse the images.
20
Attribute 1: Diagonal Dominance. A photographer can guide the viewer’s eyes with leading lines, and in a
rectangular frame, the two diagonals of the photograph are the longest possible straight lines. In a diagonally
dominant photograph, the most salient visual elements are placed close to the two diagonals (Grill and Scanlon
1990). Diagonal dominance creates a perception of spaciousness, so we predict a positive relationship between
diagonal dominance and property demand. In Figure 3, it is likely that viewers will perceive that the image on the
right represents a larger room than the one on the left.
Figure 3. Comparing Images on Diagonal Dominance
Attribute 2: Rule of Thirds (ROT). An image can be divided into nine equal parts with (imaginary) horizontal
and vertical third lines. The ROT states that the main visual elements should be placed along the imaginary third
lines or close to the four intersections (Krages 2005). These off-center focal points introduce movement into the
photograph, engaging the viewer and making the image aesthetically pleasing and dynamic (Meech 2004), so we
predict a positive relationship between the use of the ROT and property demand. In Figure 4, the image on the
right follows the ROT better than the image on the left. Hence, when looking at the image, the viewer’s attention
first goes to the bed and then its counterpoint—the other vertical third line. By contrast, in the image on the left,
the focus and key objects are not obvious.
Figure 4. Comparing Images on the ROT
Image That Does Not Follow the ROT Image That Follows the ROT
21
Component: Color
Color, one of the most significant elements in photography, affects the viewer’s emotional arousal. Building on
past research, Gorn et al. (1997) identified two dimensions of arousal: from boredom to excitement, and from
tension to relaxation. Excitement is preferred to boredom, and relaxation is preferred to tension. Three
dimensions of color—hue, saturation (chroma), and brightness (value)––can affect the level of arousal. Each
dimension has been widely studied in the marketing literature, particularly in the contexts of web design, product
packaging design, and advertisement design (Gorn et al. 2004; Miller and Kahn 2005). We add another attribute,
image clarity, that is affected by the combination of the aforementioned three.
Attribute 5: Warm Hue. Hues (e.g., red, green) are believed to be a major driver of emotion. The warmth in an
image is affected by the actual colors of the subject, but the photographer can also manipulate hue by warming
up or cooling down a picture during post-processing. Warm hues (such as red and yellow) elicit higher levels of
excitement (Gorn et al. 2004; Valdez and Mehrabian 1994), while cool hues (such as blue and green) elicit higher
levels of relaxation. Hence, we predict a positive relationship between warm hues and property demand. In Figure
6, we present a cool photo of a living room on the left and a warm photo of a living room on the right.
22
Attribute 6: Saturation. Saturation refers to the richness of color. Highly saturated images are colorful, while
weakly saturated images contain low levels of pigmentation. Saturation is positively associated with happiness
and purity; less-saturated colors are associated with sadness and distress (Valdez and Mehrabian 1994; Gorn et
al. 2004). Thus, we predict that real estate images with saturated colors can induce positive emotions in viewers
and that there is a positive relationship between saturation and property demand. To illustrate the difference in
emotional arousal, Figure 7 displays two images of the same room with different levels of saturation.
Attributes 7 and 8: Brightness and the Contrast of Brightness. The photography literature identifies two
image attributes regarding image illumination: brightness and its contrast. Brightness is the level of overall
illumination; viewers prefer bright images as they induce a sense of relaxation but do not affect the level of
excitement (Valdez and Mehrabian 1994; Gorn et al. 1997). Furthermore, sufficient illumination makes the
content of an image clear to viewers because images convey information through pixel brightness, so we predict
a positive relationship between brightness and property demand. Meanwhile, the contrast of brightness is the
variance in the illumination and describes whether the illumination is evenly distributed over the image with a
smooth flow. In other words, a low contrast of brightness indicates an even distribution of illumination across
the photograph. An uneven distribution (i.e., high contrast) of brightness may induce a feeling of harshness, so
we predict a negative relationship between the contrast of brightness and property demand. For example, in
23
Image with Low Brightness and High Contrast Image with High Brightness and Low Contrast
Attribute 9: Image Clarity. Clarity reflects the intensity of hues in the HSV (i.e., hue, saturation, value) space
(Levkowitz and Herman 1993). An image is “dull” if it is dominated by desaturated colors or has near-zero hue
intensities in some color channels (He et al. 2011). Amateur photographers often shoot dull photos, inducing a
so-called haze effect in which some parts of the image look unclear and ill-focused. By contrast, images with
“clear” color reduce the friction in information transfer, so we predict a positive relationship between image
clarity and property demand. In Figure 9, the photo on the right has higher clarity than the one on the left.
Figure 9. Comparing Images on Clarity
Attributes 10, 11, and 12: Area Difference, Color Difference, and Texture Difference. The FG relationship
within an image is evaluated in relation to three aspects—area, color, and texture. The principle of the FG
relationship is one of the most basic laws of perception and is used extensively by expert photographers to plan
their photographs. In visual art, the figure refers to the key region (i.e., foreground), and the ground refers to the
background. The FG relationship describes the separation between the figure and ground. Gestalt theory states
that objects that share visual characteristics, such as size, color, and texture, are seen as belonging together
(Arnheim 1974). Hence, the figure is more salient if it differs from the ground in size, color, and texture. In
advertising research, consumers pay more attention to images with clear FG relationships (Larsen et al. 2004;
24
25
The low-quality images rate lower than the high-quality images on all image attributes except for the contrast
of brightness, which is theorized to have a negative relationship with the property demand. More interestingly,
the unverified high-quality images also perform significantly worse than the verified high-quality images on most
attributes. We conclude that there is a systematic difference in the high-quality images taken by Airbnb
photographers versus by other photographers.
Table 7. Summary Statistics: Mean (Standard Deviation) of Image Attributes of Verified vs. Unverified
High-Quality Images
10Note that the composition measurements are negative because they reflect distances, and we subtract all distances from
zero to preserve the absolute magnitude when the direction is reversed. A higher value (i.e., less negative) suggests better
performance on that composition attribute. For example, a higher value of diagonal dominance suggests that the image is
more diagonally dominant.
26
Among the four composition attributes, the largest coefficients belong to the visual balance of color and
visual balance of intensity. Both features describe the symmetry of the image, which is determined by both the
physical arrangement of elements in the photograph and the photographer’s position from which the image is
taken.
Among the five color attributes, the largest magnitudes of coefficients belong to image clarity and the contrast
of brightness. Image clarity is obviously desirable; images that do not convey information clearly cannot make
elements of the property appear attractive to the consumer. One would expect most skilled photographers to
prioritize image clarity, but Table 7 shows a significant difference in image clarity between the unverified and
verified high-quality images; unsurprisingly, the verified photos score almost twice as high as the unverified low-
quality images. Meanwhile, a low value is usually preferred for the contrast of brightness, though several hosts
11For ease of understanding, we use standardized image attributes (i.e., variables are normalized to zero-mean and unit-
variance).
27
Among the three FG relationship features, the most prominent one is the size difference, as suggested by its
largest coefficient. When the figure has a much larger size than the background, it is better able to capture the
viewer’s attention. A photographer’s ability to manipulate the size difference may be physically constrained; for
instance, if there is little size difference between the two, it will be hard to separate the figure from the ground.
With the inclusion of the 12 image attributes, the coefficient of the key variable TREATIND falls to 1.721
and is statistically insignificant (Table 7), suggesting that the treatment effect is associated with the ability of
Airbnb professional photographers to achieve more optimal values of the 12 interpretable attributes.
The PSW approach accounts for and matches properties on observed characteristics, but unobserved aspects
of property quality may also affect both the property demand and treatment likelihood. We assess the robustness
of our results to unobserved factors with two follow-up analyses: a sensitivity analysis (Rosenbaum Bounds Test,
Section 3.4.2) and a DiD estimation with the pre-treated units (i.e., those that acquired verified photos before
January 2016) as the control group, such that any unobservables that drive self-selection should affect the
treatment and control groups equally (Section 3.4.3; Section V of the appendix). The results increase our
confidence that the PSW method adequately mitigated the endogeneity concern.
There are a few limitations to this research. First, our trained CNN classifier has a relatively high accuracy of
90.4%, but it is not perfectly accurate;13 future studies may further improve the classifier’s performance with
more training data. Second, we ignore an important element of the user search process on Airbnb. Typically, a
potential guest browses several properties that appear on an Airbnb search page based on user-specified criteria
(e.g., location, dates,). A single image appears for each property on the search result page and may influence
which properties the guest chooses to evaluate further (at which point the guest could see all images available for
the property). Without access to consumer search processes, we cannot explicitly incorporate relevant
information into our analysis, but future research may pursue such an analysis as more data (e.g., on the search
process and transaction) become available to researchers. Lastly, as noted previously, our observational data do
not enable us to make causal claims about the relationship between image features and property demand. Future
research can investigate the causal impact of image attributes on property revenue if the data and context allow
(e.g., in a randomized controlled trial).
This paper relates property demand to both high-level and low-level dimensions of image features. Certain
industries could benefit from the documented insights regarding the 12 image attributes. For example, home
rental markets, such as Airbnb and VRBO, could reduce the issue of quality uncertainty by incentivizing their
hosts to display high-quality property images. In the related industry of real estate (Zillow.com, Redfin.com,
13In our analyses, the IMAGE_QUALITY variable in the econometric model is the mean of the quality labels for the
property’s set of images. Though some images likely were misclassified during the machine learning step, some
misclassifications should cancel out in the computed mean quality, leading to a relatively accurate mean value.
30
References
Angrist, J., and Pischke, J. 2008. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.
Arnheim, R. 1974. Art and Visual Perception: A Psychology of the Creative Eye. Berkeley: University of California Press.
Athey, S., and Imbens, G.W. 2006. Identification and Inference in Nonlinear Difference-in-Differences Models. Econometrica
74(2): 431–497.
Austin, P.C., and Stuart, E.A. 2015. Moving towards Best Practice When Using Inverse Probability of Treatment Weighting
(IPTW) Using the Propensity Score to Estimate Causal Treatment Effects in Observational Studies. Statistics in Medicine
34(28): 3661–3679.
Autor, D. 2003. Outsourcing at Will: The Contribution of Unjust Dismissal Doctrine to the Growth of Employment
Outsourcing. Journal of Labor Economics 21(1): 1–42.
Berry, S., Levinsohn, J., and Pakes, A. 1995. Automobile Prices in Market Equilibrium. Econometrica 63(4): 841–890.
Bertrand, M., Karlan, D., Mullainathan, S., Shafir, E., and Zinman, J. 2010. What’s Advertising Content Worth? Evidence
from a Consumer Credit Marketing Field Experiment. The Quarterly Journal of Economics, 125(1): 263–306.
Bornstein, M.H., Ferdinandsen, K., and Charles, G.G. 1981. Perception of Symmetry in Infancy. Developmental Psychology
17(1): 82–86.
Datta, R., Joshi, D., Li, J., and Wang, J.Z. 2006. Studying Aesthetics in Photographic Images Using a Computational
Approach. ECCV 3953: 288–301.
DiPrete, T.A., and Gangl, M. 2004. Assessing Bias in the Estimation of Causal Effects: Rosenbaum Bounds on Matching
Estimators and Instrumental Variables Estimation with Imperfect Instruments. Sociological Methodology, 34: 271–310.
Freeman, M. 2007. The Photographer’s Eye: Composition and Design for Better Digital Photos. (1st ed.). Focal Press.
Gorn, G.J., Chattopadhyay, A., Sengupta, and Tripathi, J.S. 2004. Waiting for the Web: How Screen Color Affects Time
Perception. Journal of Marketing Research 41(2): 215–225.
Gorn, G.J., Chattopadhyay, A., Yi, T., and Dahl, D.W. 1997. Effects of Color as an Executional Cue in Advertising: They’re
in the Shade. Management Science 43(10): 1387–1400.
Grill, T., and Scanlon, M. 1990. Photographic Composition. Watson-Guptill.
Hagtvedt, H., and Patrick, V.M. 2008. Art Infusion: The Influence of Visual Art on the Perception and Evaluation of
Consumer Products. Journal of Marketing Research 45(3): 379–389.
He, K., Sun, J., and Tang, X. 2011. Single Image Haze Removal Using Dark Channel Prior. Pattern Analysis and Machine
Intelligence. IEEE Transactions 33(12): 2341–2353.
Heckman, J., Ichimura, H., and Todd, P. 1997. Matching as an Econometric Evaluation Estimator: Evidence from
Evaluating a Job Training Programme. The Review of Economic Studies 64(4): 605–654.
Krages, B. 2005. Photography: The Art of Composition. Allworth Press, USA.
Kreitler, H., and Kreitler, S. 1972. Psychology of the Arts. Duke University Press.
Krizhevsky, A., Sutskever I., and Hinton, G.E. 2012. Imagenet Classification with Deep Convolutional Neural Networks.
In NIPS’12: Proceedings of the 25th International Conference on Neural Information Processing Systems, Vol. 1, (pp. 1097–1105).
31
Levkowitz, H., and Herman, G.T. 1993. GLHS: A Generalized Lightness, Hue and Saturation Color Model. CVGIP:
Graphical Models and Image Processing 55(4): 271–285. doi:10.1006/cgip.1993.1019.
Li, J., Moreno, A., and Zhang, D. 2016. Pros vs Joes: Agent Pricing Behavior in the Sharing Economy. Available at
SSRN: https://ssrn.com/abstract=2708279.
Liu, L., Dzyabura, D., and Mizik, N. 2020. Visual Listening In: Extracting Brand Image Portrayed on Social Media. Marketing
Science 39(4): 669-686. https://doi.org/10.1287/mksc.2020.1226
Machajdik, J., and Hanbury, A. 2010. Affective Image Classification Using Features Inspired by Psychology and Art Theory.
In Proceedings of the International Conference on Multimedia (pp. 83–92) ACM.
Malik, N., Singh, P. V., and Srinivasan, Kannan, 2019. A Dynamic Analysis of Beauty Premium (February 26, 2019).
Available at SSRN: https://ssrn.com/abstract=3208162 or http://dx.doi.org/10.2139/ssrn.3208162
Malik, N., and Singh, P.V. 2019. Deep Learning in Computer Vision: Methods, Interpretation, Causation and Fairness,
Tutorials in Operations Research (pp. 73-100).
Manchanda, P., Packard, G., and Pattabhiramaiah, A. 2015. Social Dollars: The Economic Impact of Customer Participation
in a Firm-Sponsored Online Customer Community. Marketing Science 34(3): 367–387.
Meech, S. 2004. Contemporary Quilts: Design, Surface and Stitch. Batsford: London, UK.
Meyers-Levy, J., and Peracchio, L. A. 1992. Getting an Angie in Advertising: The Effect of Camera Angle on Product
Evaluations. Journal of Marketing Research 29: 454–461.
Miller, E.G., and Kahn, B.E. 2005. Shades of Meaning: The Effect of Color and Flavor Names on Consumer Choice. Journal
of Consumer Research 32(1): 86–92.
Mitchell, A., and Olsen, J.O. 1981. Are Product Attribute Beliefs the Only Mediator of Advertising Effects on Brand
Attitude? Journal of Marketing Research 18: 318–332.
Narang, Unnati & Shankar, Venkatesh. (2019). Mobile App Introduction and Online and Offline Purchases and Product
Returns. Marketing Science. 38. 10.1287/mksc.2019.1169.
Nevo, A. 2000. Mergers with Differentiated Products: The Case of the Ready-to-Eat Cereal Industry. The RAND Journal of
Economics 31(3): 395–421.
Netzer, O., Lemaire, A., & Herzenstein, M. (2019). When Words Sweat: Identifying Signals for Loan Default in the Text of
Loan Applications. Journal of Marketing Research, 56(6), 960–980.
Peracchio, L.A., and Meyers-Levy, J. 1994. How Ambiguous Cropped Objects in Ad Photos Can Affect Product
Evaluations. Journal of Consumer Research 21: 190–204.
PwC Report. 2015. Consumer Intelligence Series: The Sharing Economy.
Rosenbaum, P.R. 1993. Hodges-Lehmann Point Estimates of Treatment Effect in Observational Studies. Journal of the
American Statistical Association 83: 1250–1253.
Rosenbaum, P. R. 2002. Observational Studies. (2nd ed.). New York: Springer.
Rosenbaum, P.R., and Rubin, D.B. 1983. The Central Role of the Propensity Score in Observational Studies for Causal
Effects. Biometrika 70(1): 41–55.
Schloss, K. B., and Palmer, S. E. 2011. Aesthetic Response to Color Combinations: Preference, Harmony, and Similarity.
Attention, Perception, & Psychophysics 73(2) 551–571. Available at http://doi.org/10.3758/s13414-010-0027-0.
32
Simonyan, K., and Zisserman, A. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. Presented
at International Conference on Learning Representations 2015 (ICLR 2015).
Sun, M., and Zhu, F. 2013. Ad Revenue and Content Commercialization: Evidence from Blogs. Management Science 59(10):
2314–2331.
Ufford, S. 2015. The Future of the Sharing Economy Depends on Trust. Available at
http://www.forbes.com/sites/theyec/2015/02/10/the-future-of-the-sharing-economy-depends-on-trust/#21a3370b58ff.
Valdez, P., and Mehrabian, A. 1994. Effects of Color on Emotions. Journal of Experimental Psychology: General 123(4): 394–409.
Wang, Q., Li, B., and Singh, P.V. 2018. Copycats versus Original Mobile Apps: A Machine Learning Detection Method and
Empirical Analysis. Information Systems Research, 29(2): 273–291.
Wang, X., Jia, J., Yin, J., and Cai, L. 2013. Interpretable Aesthetic Features for Affective Image Classification. In IEEE
International Conference on Image Processing (pp. 3230–3234). Melbourne, Australia.
Wang, K., and Goldfarb, A. 2017. Can Offline Stores Drive Online Sales?. Journal of Marketing Research 54(5): 706–719.
Winkler, R., and Macmillan, D. 2015. The Secret Math of Airbnb’s $24 Billion Valuation. Available at
https://www.wsj.com/articles/the-secret-math-of-airbnbs-24-billion-valuation-1434568517.
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva, A. 2014. Learning Deep Features for Scene Recognition Using
Places Database. In Proceedings of the 27th International Conference on Neural Information Processing Systems. Vol. 1. December 2014.
Zhang, Shunyuan, Nitin Mehta, Param Vir Singh, and Kannan Srinivasan. Can Lower-quality Images Lead to Greater
Demand on AirBnB?. Technical report, Working Paper, Carnegie Mellon University, 2019.
33
Table of Contents
An Appendix to .................................................................................................................................................................34
Definition ..........................................................................................................................................................................42
Calculation ........................................................................................................................................................................42
Definition ..........................................................................................................................................................................42
Detection ...........................................................................................................................................................................42
3. FG ..........................................................................................................................................................................43
Definition .....................................................................................................................................................................43
Detection ...........................................................................................................................................................................43
a. Composition..........................................................................................................................................................45
b. Color ......................................................................................................................................................................46
c. FG Relationship ...................................................................................................................................................46
2) Sensitivity Analysis of the Propensity Score Method (Rosenbaum Bounds Test) ....................................52
5) Testing Changes in the Property or the Host’s Unobserved Quality (via Multidimensional Ratings) ..61
VII. Example of Property Images with 1 SD of Improvement in Key Image Features ...................................75
35
We describe the steps for creating the dataset to train our classifier using Amazon Mechanical Turk (AMT). AMT
is a platform of Amazon Web Services that enables users to outsource small tasks to a large group of workers at
a relatively low cost. It has been widely used for human intelligence tasks, such as data collection and data
cleaning. As a crowdsourcing method, AMT has been found to be quite efficient and accurate (Casalboni 2015;
Laws et al. 2011).
To construct a labeled training set for supervised learning, we selected 3,000 Airbnb property images from
our dataset and used AMT to tag each image based on its quality. In the selection of images for AMT tagging,
full random sampling was not optimal, as we did not have information on the distribution of image quality
beforehand. We used stratified random sampling to ensure that a sufficient number of images were evaluated
and labeled for different categories of image quality. A random sample stratified by a crude metric of quality was
necessary, as it ensures that the sample is balanced and random. We randomly selected 500 images from the pool
of verified images, as these were guaranteed to be taken by professional photographers and are most likely to be
of high quality. Then, from all the unverified images, a human judge chose 8,000 images that looked bad, and 500
images from this group were randomly sampled. From the unverified images, we chose 5,000 images that we
judged to be in between excellent and very bad. Of these, 500 images were randomly sampled. Lastly, we randomly
sampled 1,500 images from the entire sample. Constructing the AMT data in this way ensured that we would
have a sufficient random sample of images from each stratum (i.e., subgroup of images with a certain quality).
We also manually reviewed the selected images to ensure that no image was repeatedly sampled.
For the AMT tagging task, we created a survey instrument that asked the Turker to assign a score to a displayed
image based on its aesthetic quality. To provide accurate guidelines for image evaluation, we borrowed
instructions from professional photography forums, as well as Airbnb’s guidelines for shooting good property
photos. We also provided example photos. The quality measurement was based on a Likert scale from 1 to 7, in
which 1 is very bad and 7 is excellent. Figure A1 shows an example question on the survey given to the Turkers. To
ensure high-quality and consistent responses from the Turkers, we required them to have an approval rate of
higher than 95% and to have completed at least 50 approved tasks.
36
37
After a training set was constructed, the next task was to build an image quality classifier using labeled data. We
applied CNN, an emerging deep learning framework that is widely applied in the field of computer vision and
has been shown to perform very well for tasks, such as object recognition and image classification (Krizhevsky
et al. 2012; Simonyan and Zisserman 2015).
As shown in Figure A2, a CNN model consists of a sequence of layers, each having multiple neurons. The
number of neurons can vary from one to thousands. These neuron layers perform matrix multiplication based
on an input, generating an output to serve as the input for the next layer. Both the input and output take the
form of multi-dimensional matrices. The sequence of layers makes the neural network deep.
Images serve as the first input for the deep learning framework. In our training task, we resized all the images
to 224 × 224 pixels, determined the pixel intensity of each image, and represented the image with a 3D array
(matrix) that contains pixel information for the three channels (RGB). This was done to alleviate the
computational burden and ensure that the image size aligned with the pre-trained VGG16 model (described
below).
The last output layer predicts the binary label for its input, which, after passing through the whole network,
is an N-dimensional vector extracted from the image. For an image in our training set (represented by 𝐼𝑀𝐺 ),
the output layer applies a sigmoid function and predicts the label regarding the image’s quality:
where 𝑋 represents the output from the layer preceding the output layer (in our model, this was the FC2 layer,
which produces a 4096*1 vector), 𝑊 represents the weight parameters, 𝑊 represents the bias (a constant)
connecting the preceding layer to the output layer, and is the probability that the image is of
( ( ))
high quality, given the 𝑋 and 𝑊 and 𝑊 values for the output layer.
Throughout the CNN model, a sequence of weights on each layer define the intermediate extracted vectors
from each layer, including 𝑋 . These weights are adjusted during the training process to optimize the model’s
predictive performance.
A few key layers determine the performance of a CNN: the convolution layer, the zero-padding layer, and the
max-pooling layer. We describe these below.
38
The convolution layer is the most important and unique layer in the CNN. It consists of a stack of convolution
filters or convolution kernels. For example, the two convolution layers in Layer Block A (shown in Figure A2)
consist of 64 and 128 convolution filters, respectively. A convolution filter is a matrix in which each element
represents a numeric value. For example, in Layer Block A, the convolution layers have a size of 3 × 3 and hence
consist of nine numeric values.14 This matrix, treating an image or an intermediate input as a matrix, operates dot
production by sliding through the input. For an input of a relatively large size (e.g., 224 × 224), a 3 × 3 convolution
filter operates dot production for every 3 × 3 section. The convolution operation is beneficial because it reduces
the dimensionality of parameters and well explores and reserves the (local) spatial relationships of the input.
Regarding the second benefit, if a convolution kernel extracts an oriented edge of an object, operating this kernel
on every small square (e.g., 3 × 3) of an image would enable the extraction of all edges with that direction from
the image. Many kernels that extract edges do so in all directions, potentially constructing the contour of an
object. As can be seen in Figure A2, each block consists of a varying number of convolutional filters (e.g., 64,
128, 256, and 512 filters). These kernels extract features from the input data, which represent the extracted
features from the preceding layers. Toward the output layer (i.e., layers closer to the output layer) in the CNN,
the filters combined together extract higher- and lower-level features. That is, the CNN can extract a hierarchical
structure of related features to predict the output labels.
Zero-Padding Layer
The zero-padding layer adds numerical arrays consisting of all 0 values to the edge of an intermediate output
from a layer. The size of the zero-padding layer is a hyperparameter in the CNN model. Typically, as was done
in our model, the intermediate output is padded with 1*M 0 vectors on each side, causing the width and height
to both increase by 1 after the zero-padding value. As zeros do not contribute to the matrix multiplication
procedure, the zero-padding layer does not affect the features extracted by the layers. In addition, the zero-
padding layer allows us to control the spatial size of the intermediate outputs, and it can prevent the outputs
from decreasing too quickly after layers of convolution operations.
Max-Pooling Layer
Inserting a max-pooling layer between successive convolution layers is common in CNNs. A max-pooling layer
is a small square filter (in our model, a 2 × 2 matrix). Similar to the convolution filter, a max-pooling layer applies
to every 2 × 2 square patch in input data. Its function is to select and preserve only the maximum value in that 2
14 The size of a convolution layer is a choice of the model architecture. A 3 × 3 or 5 × 5 configuration is common.
39
Figure A2. Description of the Architecture and Description of the Layers of the CNN Classifier
Filters: Indicate the number of convolution windows (i.e., number of feature maps) on each convolution
layer.
Zero-padding layer: Pads the input with zeros on the edges to control the spatial size of the output. It has
no impact on the predicted output.
Max-pooling: Subsampling method. A 2 × 2 window slides through each feature map (without overlap) at
that layer, and then the maximum value in the window is selected as a representation of the window. This
reduces computation and ensures translation invariance.
We randomly split the dataset into training and validation sets, with 80% of the examples forming the training
set and the remainder being used as the validation set. To reduce the overfitting problem in the training step, we
use data augmentation and implement a real-time (i.e., during training) image transformation for each image in
the training sample by randomly (1) flipping the input image horizontally, (2) rescaling the input image within a
scale of 1.2, and (3) rotating the image within 20˚ (images being rotated by a random value between 0˚ and 20˚).
This method introduces random variation in the training sample, increases the training set size, and reduces
overfitting (Krizhevsky et al. 2012).
To effectively learn features from a relatively small sample size, we apply the idea of transfer learning, which
involved building our model on top of an existing well-trained CNN model and then fine-tuning it. As the
features extracted from images are generic to some extent (e.g., almost all CNNs extract edge information at the
first layer), transfer learning is quite common in deep learning and is suggested as an effective approach for
dealing with problems caused by limited data (Girshick et al. 2014; Lin et al. 2015; Zhang et al. 2015). In this
study, we used VGG16 (Simonyan and Zisserman 2015), as it is a conceptually simple and popular pre-trained
40
41
In this section, we define the key concepts used in the process of image attribute computation. These key
concepts include image saliency, the key/salient region, and the figure-ground (FG). We first discuss each
concept’s definition and then present the image algorithm used to detect, extract, or compute the concept.
The basic unit determining image saliency is visual saliency at the pixel level. The overall saliency score for a local
patch of an image can be computed based on the pixel saliency within the local region.
Definition
Saliency describes a concept that originates from visual unpredictability. In images, it is often captured by
variations in, for example, boundaries and colors. Studies on cognitive psychology and computer vision have
investigated how humans process and pay attention to visual information and have found that we allocate our
attention to parts of given information (e.g., the regions of an image) while cognitively ignoring other parts. Visual
uniqueness is salient in the sense that it easily attracts the attention of viewers.
Calculation
In general, the models proposed for calculating visual saliency are based on local contrast to surroundings. The
contrasts are determined using features, such as color, intensity, edge density, and orientation. A simple example
is the gradient of pixel intensity. A pixel with great contrast is assigned a high saliency value.
2. Salient Region
Definition
Following the definition of visual saliency, salient regions are defined as the regions of an image that have a high
overall saliency score.
Detection
The detection of a salient region involves four steps: (1) segmenting an image into local patches, (2) assigning a
saliency score to each path, (3) merging similar patches into a region, and (4) finding the most salient region. These
steps are discussed in greater depth below.
1) Segmenting an image: This process generally involves grouping the pixels of an image into multiple parts
containing pixels that are similar to one another. Segmentation can be based on edges (detected edges
are assumed to define the boundaries of objects), colors, or other factors. A segmentation algorithm is
42
3. FG
Definition
The figure is the foreground of an image, and the ground is the background. Only one figure and one ground can
be detected for each image. This is different from the detection of salient regions, for which multiple regions can
be detected.
Detection
The figure is detected and extracted from an image, and then the ground is defined as the rest of the image.
Detection of the foreground is an extension of image segmentation, as a pixel is assigned a value of either 1
(foreground) or 0 (background).
Detailed Algorithm
We used GrabCut, a state-of-the-art model for foreground extraction (Rother et al. 2004). In this model, an image
is treated as a graph, with each pixel serving as a node and pixel similarity being defined as an edge. GrabCut
implements the expectation–maximization (EM) algorithm and min-cut algorithm to iteratively assign a
foreground/background label to pixels and to cut the graph into two subgraphs: one representing the foreground
and the other representing the background.
1) Initially, an arbitrary rectangle separates the image into two parts. The pixels in the rectangle are labeled
“1” (foreground), and those outside it are labeled “0” (background). The initial position of the rectangle
can be arbitrary. Alternatively, one can specify the rectangle’s location or hard label some pixels with
good prior knowledge of where the foreground might be.
2) A Gaussian mixture model (GMM) is trained with the EM algorithm based on the distribution of pixel
color statistics. From the GMM, we can determine the probability that each pixel belongs to a particular
mode (or cluster). That is, the GMM labels a pixel as a probable foreground or a probable background.
43
44
Using the computer vision algorithm described above, we perform image processing tasks to segment images
into patches and detect key/salient regions. After salient regions are detected, subsequent computation is
performed to measure image attributes. This section discusses the steps for computing image attribute
measurements after image processing.
a. Composition
Four image attributes are categorized during the composition step. How well an image performs concerning a
particular attribute is evaluated by distance, such as the distance between two pixels. A smaller distance indicates
better performance. For all four composition attributes, we compute the distance metrics and then subtract the
metrics from zero.
Diagonal Dominance (Attribute 1): Diagonal dominance captures how close an image’s key region is positioned
to the two diagonals of the image. For an image, we first identify the key region and then measure the weighted
Manhattan distance from the key region to each diagonal (Liu et al. 2010; Wang et al. 2013).15 The measurement
of diagonal dominance is computed by subtracting the minimum weighted distance from zero.
A greater diagonal dominance value suggests that the image is more diagonally dominant.
Rule of Thirds (ROT) (Attribute 2): We first divide an image into nine equal parts with two (imaginary) equally
spaced vertical lines and two (imaginary) equally spaced horizontal lines. Then, we calculate the Euclidean
distance from the centroid of the key region to each of the intersections (Wang et al. 2013). If the minimum
distance is small, then the image follows the ROT, with its key region close to at least one intersection. The ROT
is measured by subtracting the minimum distance from zero. If an image follows the ROT more closely, the ROT
value is greater.
Visual Balance Intensity (Attribute 3): In this measurement, we split the image along its vertical central line.
On each half of the image, we identify a key region and compute the distance from its centroid to the vertical
central line (Liu et al. 2010). A relative distance measure is calculated by subtracting the shorter distance from the
longer one and dividing the difference by the longer distance. Then, visual balance intensity is computed by
subtracting the relative distance from zero. A greater value for this measure suggests that the image is more
(vertically) visually balanced in terms of pixel intensity.
Visual Balance Color (Attribute 4): The color measurement of visual balance compares the left half of an image
with the right half based on color. We first calculate the Euclidean distance in terms of color intensity (i.e., RGB
15The Manhattan distance between two points on an image is measured as the number of pixels between them, with only
horizontal and vertical paths allowed.
45
b. Color
Five image attributes related to color are computed. The measurements are based on pixel intensity or related
values (e.g., hue and saturation).
Warm Hue (Attribute 5): The warm hue measurement captures the warmth of an image, which is defined by the
relative proportion of warm hues (e.g., yellow) to cool hues (e.g., green). The measurement is computed in the
HSV (hue, saturation, and volume) space. Specifically, we calculate the pixel hues that fall outside the cool range
(30–110) on the hue spectrum (Wang et al. 2013). If an image contains more warm hues, such as yellow and
orange, it will have a greater warm hue value.
Saturation (Attribute 6): We compute pixel saturation in the HSV space by averaging the saturation values of all
pixels in the image. A greater value indicates higher saturation (for example, the image contains more saturated
colors).
Brightness (Attribute 7): The brightness of an image is defined as the overall level of illumination. We calculate
the intensity of each pixel and then average the intensity values across all pixels in the image. A brighter image
has a greater brightness value.
Contrast of Brightness (Attribute 8): Contrast is calculated as the SD of pixel intensity in the whole image. A
lower contrast of brightness value suggests that the brightness is more evenly distributed across the image.
Image Clarity (Attribute 9): Image clarity captures the portion of hues with sufficient intensity. We measure
pixel brightness on a 0–1 scale and then compute the portion of pixels whose brightness is 0.7–1 (Wang et al.
2013). A clear image has a higher image clarity value.
c. FG Relationship
The FG relationship is described as the difference between the figure and its ground in terms of three metrics:
size, color, and texture. An image with a good FG relationship has a clearly separable figure and ground (i.e.,
greater differences).
Size Difference (Attribute 10): The size difference attribute compares the size of the figure with that of the
ground. We detect the figure and ground in an image and calculate the size of each in relation to the whole image
(Cheng et al. 2011). Then, the size difference is computed by subtracting the ground’s size ratio from the figure’s
46
Color Difference (Attribute 11): The color difference attribute captures the difference in colors between the
figure and the ground. We compute the Euclidean distance between the mean color of the figure and that of the
ground. A high color difference value suggests that the figure and ground contain distinct colors. In such cases,
the figure can be easily distinguished from the ground.
Texture Difference (Attribute 12): Texture difference measures the difference between the figure and the
ground in terms of texture, which is captured by the edge density within a local region. For the figure and ground,
we use the Canny edge detector to detect the edge and then compute edge density. Then, we measure the absolute
difference between the two densities. A high texture difference value suggests that the figure and ground have a
clear separation based on texture.
47
We build a deep learning model to automatically categorize the type of scene photographed. The goal is to
compute the distribution of different types of rooms depicted in property images. Controlling the distribution in
our demand model helps us address the concern that professional photographers know which types of places
appeal more to consumers and thus present these aspects of properties.
We build a deep learning model to automatically classify the room type (bathroom, bedroom, kitchen, living
room, or outdoor area) depicted in a given property image. Using transfer learning with a deep learning model
that was pre-trained on a large scene classification dataset, Places205 (Zhou et al. 2014), we optimize the classifier
for a dataset we collected, which consists of 54,557 images of bathrooms, 59,082 images of bedrooms, 88,030
images of kitchens, 81,819 images of living rooms, and 5,734 images of outdoor areas. The average classification
accuracy is 95.05% on a hold-out set across the five categories.
To train a room type classifier, we need a large number of room images labeled with a room type. We would
have preferred to use original images from Airbnb.com, but the images on property web pages are not labeled,
and it would incur a high labor cost to have someone (e.g., AMT) manually label images for us. Therefore, we
crawled data from real estate–related websites on which a vast number of indoor/outdoor images are classified
into categories. Figure A3 shows a portion of a web page displaying 23 images of kitchens at multiple properties.
48
From the website, we collected images aligning with the five room types: bathroom, bedroom, living room,
kitchen, and outdoor area. We then split the dataset, 80% of which was used as a training set and the remaining
20% used as a hold-out test set.
We used the VGG16 ConvNet model, which was pre-trained on the Places205 dataset (Zhou et al. 2014) and
then fine-tuned on our training set.16 The model was used for a task to classify 205 categories of places. To
transfer the pre-trained model to our study, we removed the output layer in the pre-trained model and then added
an output layer designed for our specific task. The added output layer was a 5*1 vector in which each element
indicated the predicted probability of assigning the corresponding label. To ensure that all the probabilities add
up to 1 (as we assume that a room belongs to only one category), the Softmax function was used to calculate the
predicted probability (the function ensures that all the predicted probabilities add up to 1). For example, the
probability that an image is assigned room type k is computed as follows:
16 The pre-trained VGG16 model (including the architecture and parameters) can be accessed at
http://places.csail.mit.edu/downloadCNN.html.
49
connecting FC2 to the 𝑗 node on the output layer, and 𝑊 represents the bias connecting FC2 to the 𝑗 node
on the output layer.
50
This section reports a series of analyses performed to test the robustness of our main results and/or exclude
alternative explanations. We begin by presenting the validation of the propensity score method used in our
empirical model, which included a propensity score weighting (PSW) strategy and a sensitivity assessment of
unobservables.
To ensure that the propensity score (PS) approach effectively eliminates potential systematic imbalances between
the treatment and control groups, one needs to show that the PSs have balanced the covariates for matched or
weighted samples.
We implemented a balance check, which compares, in terms of the covariates, the weighted means of the
∑ 𝑿𝒊 ∑ 𝑿𝒊
treatment group, 𝑿𝒕𝒓𝒆𝒂𝒕𝒎𝒆𝒏𝒕 = ∑
, and the control group, 𝑿𝒄𝒐𝒏𝒕𝒓𝒐𝒍 = ∑
. Here, 𝑋 is a
1*M dimensional vector of the pre-treatment covariates (i.e., the covariates observed before the treatment) of
unit 𝑖, and 𝜔 is the sample weight for unit 𝑖, computed based on the estimated PSs.
Table 2 presents the weighted group means and tests for the differences in the means for each variable 𝑋
(m = 1, 2, …, M). As shown by the t-statistics in the table, the weighted samples are not statistically different at
the 95% significance level. That is, the systematic differences in the weighed samples are negligible after
performing the PSW method. Therefore, we validated that our PSW method effectively eliminated imbalances
in the sample and that our weighted treatment and control groups are comparable in terms of the observed
covariates that may affect the treatment selection process.
Weighted Means in
Groups T-test
Variables Treated Untreated t p-value
REVIEW_COUNT 20.56 19.88 0.27 0.790
IMAGE_QUALITY 0.27 0.25 1.00 0.316
IMAGE_COUNT 14.48 15.1 −0.67 0.506
NIGHTLY_RATE 170.15 191.36 −1.14 0.257
MINIMUM_STAY 2.57 2.57 −0.00 1.000
MAX_GUESTS 3.5 3.67 −0.82 0.410
RESPONSE_RATE 92.25 91.19 0.79 0.431
RESPONSE_TIME (minutes) 225.12 260.98 −1.18 0.238
51
In the model for estimating the propensity scores, we included a rich set of variables and their interactions. The
inclusion of a complete set of covariates reduced the odds that our main results would be affected by variables
that were not accounted for when computing the propensity scores.
Despite the long list of included covariates for propensity scores estimation, some variables that may affect
one’s decision regarding participation in a professional program were omitted. As is commonly acknowledged,
52
53
Note: Gamma log odds of differential assignment because of unobserved factors; sig + (−): upper (lower)
bound of the significance level; t-hat + (−): upper (lower) bound of the Hodges–Lehmann point estimate
CI + (−): upper (lower) bound of the confidence interval (a = 0.95).
54
55
Observations 75,406
R-squared 0.6943
Note: The number of observations is smaller than that in the main analyses (Tables 2, 3, 4, and 7 in the
main paper) because a shorter period was used for the treated units. Specifically, instead of using the whole
panel of 16 periods of data, we used up to eight periods for the estimation in this table.
*p < 0.05, **p < 0.01, ***p < 0.001.
56
The inclusion of the period leads, POST(k), allowed us to examine the possible dynamics in the treatment
effect across periods after adopting the treatment. That is, if the treatment had a greater impact a certain number
of months after the treatment adoption, we expected that the estimated coefficient associated with that period
dummy would be greater.
As shown in Table A12, the treatment effect exhibits dynamics in the post-treatment period. Specifically, the
effect size of the treatment increases over time before stabilizing (decreasing). We interpret this as follows: in the
month in which the treatment was adopted, (POST(0) *TREAT), we may have underestimated the treatment
effect, as some bookings may have been made in previous periods. In later periods, the estimated coefficient of
the verified photos initially increases before stabilizing. There are two forces that may lead to this change:
(1) All the demand in these periods is due to the verified photos and not the unverified ones.
(2) With an increased demand, Airbnb ranking algorithms may have started showing this property higher in
search rankings, leading to more demand.
Comparing the effect size in the month following the treatment month (i.e., the coefficient of POST(1)
*TREAT) to the treatment effect obtained from the main model, we see that the results are consistent (8.958
versus 9.303). The coefficient of interaction term with the last period dummy, POST (5) *TREAT, decreased
compared with the estimated coefficient for {𝑃𝑂𝑆𝑇 (𝑘) ∗ 𝑇𝑅𝐸𝐴𝑇} . This is likely because we have fewer
properties for which more than four post-treatment periods are observed (for example, a property that adopted
treatment in February 2017 contributes to the estimation of coefficients for {𝑃𝑂𝑆𝑇 (𝑘) ∗ 𝑇𝑅𝐸𝐴𝑇} , with k
= 0–2 only).
Note that to conclusively rule out or separate any potential effect of Airbnb’s ranking system that could be
confounded with verified photo adoption, we would like to know whether Airbnb’s ranking system considers
property images and, if so, how the inclusion of images affects the ranking algorithm. Unfortunately, we do not
have information about whether and how (if at all) Airbnb increases the ranking of treated properties, as the
algorithm is proprietary. In addition, if the data allow, we would like to use more granular information for
investigating whether Airbnb ranks properties with verified photos higher immediately after the adoption of
verified photos. For example, using the available data, one could examine whether a property received more
57
58
One concern regarding the self-selection issue is that properties with particular amenities may be more likely to
adopt the treatment. For example, if some amenities make the properties more attractive in particular seasons
(e.g., a pool or beach in summer), and the hosts adopt the treatment at that time, then some of the increase in
demand could be brought about by those attractive amenities. The effects of these amenities on demand (which
are fixed characteristics of the property) cannot be fully accounted for by the property’s fixed effect terms, as
amenities’ effects may be time varying (e.g., a pool has a greater effect in summer than in winter).
To address this concern, when estimating the PSs for PSW, we obtained additional data on the properties’
amenities and incorporated amenity information (e.g., AC, pool, whether the property is close to a beach) to
account for the possible factors that may be correlated with both property demand and the hosts’ treatment
adoption decision in particular seasons but that cannot be captured by property fixed effects. In addition, in the
model specification, we added interaction terms for the dummy AFTER and meaningful amenities (e.g., pool,
beach, AC) to account for the higher effect of these time-invariant variables after the treatment or in particular
seasons.
In Table A13, we present the estimation results. Column (1) presents the results of Equation (2a), which is
our main specification, and column (2) presents the results of Equation (2b), in which we incorporated the
interaction of AFTER with meaningful amenities. The consistency of the estimated results confirms a positive
and significant treatment coefficient of more than an 8% increase in the property occupancy rate after controlling
for area-specific seasonality, as well as the time-varying effect of particular time-invariant property amenities. In
the panel “Interacting with Meaningful Property Amenities,” we present the coefficients of the interaction terms.
As can be seen, all coefficients are insignificant.
59
60
5) Testing Changes in the Property or the Host’s Unobserved Quality (via Multidimensional Ratings)
It is possible that properties or hosts self-select the adoption of professional Airbnb images when there are
substantial changes in the quality of the experience delivered to guests. Such changes could include an upgrade
of the house/room or the delivery of more friendly/better services to guests. This would introduce an upward
bias in the estimated coefficient, as it could happen at the same time as the treatment adoption. Although we
were not able to observe and control for everything, we adopted a few measures to help control for or alleviate
this issue.
First, in the demand model, we added the measurement of host responsiveness (host response rate and host
response time) to address the concern that hosts are more responsible in the post-treatment period.
Second, in the demand model, we added a complete set of multidimensional guest ratings to capture and
account for any potential changes in the stay experience or hosting quality.
Third, as we show below, for the properties for which ratings are available, we compared the average ratings
in terms of multidimensional aspects for a few periods before and after the treatment. The goal is to examine
whether there are substantial changes in the property characteristics that were unobserved by us but captured in
guests’ ratings. To implement this robustness test, we estimated the following:
where the specification is the same as that in our main demand model, and the dependent variable is replaced
with one of the following guest ratings capturing, to some extent, how good a stay or host was: communication,
accuracy, cleanliness, and check-in. The metrics for cleanliness capture potential changes in the property, whereas
those for communication capture the quality of communication between a host and a guest. The coefficient 𝛼
captures changes in ratings along a particular dimension after treatment adoption.
As shown in Table A14, the coefficients of TREATIND are insignificant across all specifications, suggesting
that for the treated units, there was no substantial change in the stay or hosting quality delivered to guests before
and after adopting verified photos.
61
Table A15. Alternative DiD Model Specification: Regressing Property Demand on Verified Photos
63
17Note that although that the effect size was 11% lower (compared with the value of 8.985 that we obtained from the main
DiD regression), the estimated treatment effect was very close, and it remained at the same level. For example, Narang and
Shankar (2019) reported that the estimated effects from their robustness test were 16%–35% lower than their main results.
64
Table A17 presents the estimation results. As can be seen, we obtained consistent results (the estimated
coefficient of TREATIND is 6.77 and significant at the 0.05 significance level) when using only the pre-treated
units as the control18. This robustness analysis, along with Table A8, suggests that our PSW combined with DiD
can mitigate the self-selection issue in the verified photo adoptions.
18Note that despite the effect size being 24% lower, the estimated effect was very close and remained at the same level (e.g.,
Narang and Shankar [2019] reported that the estimated effects from this robustness analysis were 16%–35% lower than
their main results across multiple DVs of interest).
65
66
Table A18. Regressing Property Price on the Treatment Indicator: DiD Model
19 See https://blog.atairbnb.com/smart-pricing/.
67
As can be seen in Table A19, the estimated coefficient of TREATIND (b = 7.271, p < 0.001) is consistent
with our main results20. In addition, the estimated coefficient of REVIW_PROBABILITY is insignificant at the
0.05 significance level.
20 Note that this regression is estimated on those observations in which review probability can be defined and computed, that
is, the observations when a property in a month had at least one booking. By including review probability in the model, we
automatically excluded those observations with zero booking (demand is 0) in a month, which is more likely for untreated
properties than for treated properties. As a result, the coefficient of TREATIND in this analysis is a conservative estimate.
68
69
Table A20 Robustness Test: Changes in # of Open Days along with Verified Photo Adoption
71
72
We started with over 13,000 unique Airbnb properties in seven US cities, collecting images from the property
pages in each month from January 2016 to April 2017. Of the full sample, 4,932 properties were pre-treated (i.e.,
had verified photos prior to our observation in January 2016). For the analyses presented in this study, we
removed these properties from our sample for the sake of conceptual clarity (in an additional robustness test, we
included these units in the control group and obtained consistent results; see Section V.7) in this Appendix for
detailed results and discussions). The remaining properties totaled 8,858 (13,790 – 4,932). As described in the
main paper, we removed missing data points and properties for which there were web page errors during data
collection. Errors may have occurred because Airbnb blocked our data collection requests, and/or hosts unlisted
their properties from Airbnb temporarily or permanently.
Next, using the valid units, we performed a propensity score analysis to match the treated and untreated units
by comparison (matching) based on a set of observed covariates. Of the 7,825 valid properties identified during
the observational window, there were 231 treated units and 7,594 untreated units before PSW. Next, we
performed a propensity score estimation to find properties in the two groups that were close or matched with
each other in terms of the property and host characteristics in the pre-treatment period. We first estimated the
propensity score as a logit function of a set of property, host, and neighborhood covariates measured at the start
73
unit based on its estimated propensity score. Specifically, we calculated weight as 𝑤 = if i was a treated unit
and as 𝑤 = if i was an untreated unit. As we described in the paper, this is a widely used PSW approach
called inverse probability treatment weighting. Later, we used the sample weights for DiD regressions.
Furthermore, to address the concern that a very low estimated probability can result in extremely large (and
possibly unstable) weights, we followed common practice in PSW and used trimmed weights, excluding those
that were outside 1% and 99% of the distribution (Austin and Stuart 2015; Cole and Hernan 2008). The PSW
generated a sample that included 7,211 of the 7,594 untreated units and 212 of the 231 treated units. This matched
sample was used in the main regressions (Tables 2, 3, 4, and 7 in the main paper). The total number of
observations used in the analyses was 76,90121.
21Note that the total observations in the regressions was fewer than # units * # periods because for properties that did not
have open days in a month, the dependent variable—property demand (occupancy rate)—would be indefinable.
74
As an example, we show what an increase of 1 SD may look like visually in Figure A1. The figure shows the
original photos and a 1 SD increase for two example attributes—brightness and saturation.
Brightness
Saturation
75