Vision-Based Real Estate Price Estimation
Vision-Based Real Estate Price Estimation
A core component of real estate websites like Zillow and lecting data (June 2016). While the latest reported median error rate of Zes-
Redfin is an automated valuation method (AVM) which es- timate is 5.6% (https://www.zillow.com/zestimate/#acc),
the same approach as what we describe in the paper can be used to de-
1 The Digital House Hunt: Consumer and Market Trends in Real crease the error rate.
Estate (https://www.nar.realtor/sites/default/files/ 3 About the Redfin Estimate: www.redfin.com/
Study-Digital-House-Hunt-2013-01_1.pdf) redfin-estimate
data. Given the market value and characteristics for a large num-
ber of houses, the goal is to obtain a function that relates the
• We release a new dataset of photos and metadata for metadata of a house to its value. There are many bodies of
9k houses obtained from Zillow. By applying our val- work that apply regression methods to the problem of real
uation method to this dataset, we show that it outper- estate price estimation. Linear regression models assume
forms Zillow’s estimates. that the market value is a weighted sum of home charac-
teristics. They are not robust to outliers and cannot address
• We release a new, large-scale dataset of 140k interior non-linearity within the data. Another model that is used for
design photos from Houzz. The images are classi- price estimation is the hedonic pricing model. It supposes
fied based on their room types: bathroom, bedroom, that the relationship between the price and independent vari-
kitchen, living room, dining room, interior (miscella- ables is a nonlinear logarithmic relation. The interested
neous) and exterior. reader is referred to [22], [3] and [18] for an overview of
regression analysis for price estimation. Other approaches
• We present a qualitative visualization of real estate
for price estimation include Artificial Neural Networks [19]
photos in which images at similar luxury levels are
and fuzzy logic [1].
clustered near one another.
Zillow and Redfin use their own algorithms for real es-
tate price estimation. Home characteristics, such as square
2. Related Work footage, location or the number of bathrooms are given dif-
We now provide an overview of related work, with a fo- ferent weights according to their influence on home sale
cus on automated real estate valuation methods and visual prices in each specific location over a specific period of
design. We also give a brief overview of machine learning time, resulting in a set of valuation rules, or models that are
methods and datasets relevant to our approach. applied to generate each home’s Zestimate4 . Redfin, hav-
ing direct access to Multiple Listing Services (MLSs), the
2.1. Automated Valuation Methods databases that real estate agents use to list properties, pro-
vides a more reliable estimation tool5 . While Zillow and
Real estate price estimation plays a significant role in
Redfin do not disclose how they compute their estimates,
several businesses. Home valuation is required for pur-
their algorithms are prone to error, and do not consider the
chase and sale, transfer, tax assessment, inheritance or es-
impact of property photos on the market value of residential
tate settlement, investment and financing. The goal of au-
properties.
tomated valuation methods is to automatically estimate the
market value of a house based on its available information. 2.2. Convolutional Neural Networks (ConvNets)
Based on the definition of the International Valuation Stan-
dards Committee (IVSC), market value is a representation Convolutional neural networks (ConvNets) have
of value in exchange, or the amount a property would bring achieved state-of-the-art performance on tasks such
if offered for sale in the open market at the date of valuation. as image recognition [14, 9, 10, 26], segmentation
A survey of real estate price estimation methods is given in [17, 5, 29, 31], object detection [7, 6, 25] and generative
[22]. In this section, we give an overview of these meth- modeling [8, 24, 11, 13, 23] in the last few years. The
ods. To our knowledge, none of these methods consider the recent surge of interest in ConvNets recently has resulted
impact of visual features on value estimation. in new approaches and architectures appearing on arXiv on
One of the traditional methods for market valuation is the a weekly basis. The interested reader is referred to [16] for
“comparables” model, a form of k nearest neighbors regres- a review of ConvNets and deep learning.
sion. In this model, it is assumed that the value of the prop-
2.3. Scene Understanding
erty being appraised is closely related to the selling prices of
similar properties within the same area. The appraiser must One of the hallmark tasks of computer vision is Scene
adjust the selling price of each comparable to account for Understanding. In scene recognition the goal is to deter-
differences between the subject and the comparable. The mine the overall scene category by understanding its global
market value of the subject is inferred from the adjusted properties. The first benchmark for scene classification was
sales prices of the comparables. This approach heavily de- the Scene15 database [15], which contains only 15 scene
pends on the accuracy and availability of sale transaction categories with a few hundred images per class. The Places
data [22]. dataset is presented in [32], a scene-centric database con-
The problem of price estimation can be viewed as a re- 4 What is a Zestimate? Zillow’s Home Value Forecast (http://www.
gression problem in which the dependent variable is the zillow.com/zestimate/)
market value of a house and independent variables are home 5 About the Redfin Estimate: www.redfin.com/
characteristics like size, age, number of bedrooms, etc. redfin-estimate
An approach for predicting the style of images is de-
scribed in [12]. It defines different types of image styles,
and gathers a large-scale dataset of style-annotated photos
that encompasses several different aspects of visual style. It
also compares different image features for the task of style
prediction and shows that features obtained from deep Con-
volutional Neural Networks (ConvNets) outperform other
features. A visual search algorithm to match in-situ im-
ages with iconic product images is presented in [2]. It also
provides an embedding that can be used for several visual
search tasks including searching for products within a cat-
egory, searching across categories, and searching for in-
Figure 1: Examples of correctly and incorrectly classified stances of a product in scenes. [4] presents a scalable al-
pictures. The first row illustrates images classified correctly, gorithm for learning image similarity that captures both se-
and the second row represents wrongly classified photos. mantic and visual aspects of image similarity. [21] discov-
ers and categorizes learnable visual attributes from a large
taining more than 7 million images from 476 place cate- scale collection of images, tags and titles of furniture. A
gories. [30] constructs a new image dataset, named LSUN, computational model of the recognition of real world scenes
which contains around one million labeled images for 10 that bypasses the segmentation and the processing of indi-
scene categories and 20 object categories. Table 1 shows vidual objects or regions is proposed in [20]. It is based
relevant categories of LSUN, Places and Houzz datasets on a low-dimensional representation of the scene, called the
as well as the number of images in each category. Sev- Spatial Envelope. It proposes a set of perceptual dimensions
eral categories in the Places dataset are subsumed under the that represent the dominant spatial structure of a scene.
term “Exterior”: “apartment building (outdoor)”, “build-
ing facade”, “chalet”, “doorway (outdoor)”, “house”, “man- 3. Our Approach
sion”, “manufactured home”, “palace”, “residential neigh- In order to quantify the impact of visual characteristics
borhood” and “schoolhouse”. Miscellaneous indoor classes on the value of residential properties, we need to encode
such as “stairway” and “entrance hall” are categorized as real estate photos based on the value they add to the market
“Interior (misc.)”6 . price of a house. This value is tightly correlated with the
concept of luxury. Luxurious designs increase the value of
Table 1: Number of images per room category in different a house, while austere ones decrease it. Hence, we focus on
datasets the problem of estimating the luxury level of real estate im-
agery and quantifying it in a way that can be used alongside
LSUN Places Houzz
the metadata to predict the price of residential properties.
Living Room 1,315,802 28,842 971,512
Bedroom 3,033,042 71,033 619,180 3.1. Classifying Photos Based on Room Categories
Dining Room 657,571 27,669 435,160
To make a reasonable comparison, we consider photos
Kitchen/Kitchenette 2,212,277 84,054 1,891,946
of each room type (kitchen, bathroom, etc.) separately. In
Bathroom − 27,990 1,173,365
other words, we expect that comparing rooms of the same
Exterior − 25,869 868,383
type will give us better results than comparing different
Interior (misc.) − 20,000 368,293
room categories. Hence, we trained a classifier to categorize
pictures based on the categories shown in Table 1. In order
2.4. Visual Design to train the classifier, we used data from Places dataset [32],
Houzz and Google image search. Our final dataset contains
In spite of the importance of visual design and style, they more than 200k images.
are rarely addressed in the computer vision literature. One Using labeled pictures from our dataset, we trained
of the main challenges in assigning a specific style to an DenseNet [10] for the task of classifying real estate photos
image is that style is hard to define rigorously, as its inter- to the following categories: bathroom, bedroom, kitchen,
pretation can vary from person to person. In our work, we living room, dining room, interior (miscellaneous) and ex-
are interested in encoding information relevant to the luxury terior. Using this classifier, we achieved an accuracy of 91%
level of real estate photos. on the test set. After collecting a large dataset of real estate
6 While the Houzz dataset contains millions of images in each category, photos and metadata from Zillow, we applied the classifier
we download and use 20k images in each category. to the images to categorize them based on their room type.
Figure 2: Crowdsourcing user interface for comparing photos based on their luxury level. Each probe image on the left is
compared with 9 other images, uniformly drawn from the dataset. Using these comparisons, we obtain an embedding of real
estate photos based on their luxury level and anchor images which represent different levels of luxury.
Figure 1 shows examples of photos that are classified cor- the classification framework accordingly. We can also sam-
rectly and incorrectly. As we can observe from this fig- ple images from each cluster to represent different classes
ure, the classifier performs well for typical photos, while it in the classification framework.
sometimes wrongly classifies empty rooms and those rooms
which combine elements from different categories. 3.3. Crowdsourcing Framework
We first discuss our crowdsourcing framework for com-
3.2. Luxury Level Estimation
paring images based on their luxury. Motivated by [28],
After classifying the images based on their room cate- we presented a grid user interface to crowd workers, with a
gories, we used crowdsourcing for luxury level estimation. probe image on the left and 9 gallery images on the right.
Since our goal is to quantify luxury level of photos, we need We asked the workers to select all images on the right that
to assign a value to each photo to represent its level of lux- are similar in terms of luxury level to the image on the left.
ury. Hence, we used a classification framework to catego- In order to extract meaningful comparisons from each grid,
rize photos based on their luxury level. However, since it we want it to have images from several different luxury lev-
was not clear how many classes should be used and which els. Therefore, for each grid, we need to select images from
photos should represent each of those classes, we used an- our dataset uniformly.
other crowdsourcing framework to compare images in our The images from Houzz have a ‘budget’ label which de-
dataset according to their luxury level. Using these compar- termines the cost of each design. There are 4 different bud-
isons, we could obtain a low-dimensional embedding of real get levels, and photos with a higher level represent more
estate photos in which images with the same level of luxury luxurious designs. Houses from Zillow are labeled with
are clustered near each other. By inspecting the embedding, their offered price and Zestimates. We expect that houses
we can determine the number of clusters that best represent with a higher price and Zestimate have more luxurious pho-
variations in luxury, and choose the number of classes for tos, and vice versa. Hence, to uniformly divide our dataset,
(b) Living room
(a) Bathroom
Figure 3: 2D embedding visualization of real estate photos based on their luxury using the t-STE algorithm. The embedding
is obtained using 10,000 triplet comparisons. More luxurious photos are clustered at the center and more austere ones are
scattered around.
we divided Zillow houses into 2 classes based on the aver- different luxury levels to help the crowd workers provide
age value of their offered prices and Zestimates. Moreover, meaningful comparisons among the pictures. A schematic
to add images with low level of luxury to our dataset, we of our crowdsourcing user interface is shown in figure 2.
used Google image search. We searched for terms like ‘ugly We used Amazon Mechanical Turk (AMT) to collect com-
bedroom,’ ‘ugly kitchen,’ etc. In this way, we obtained pho- parisons on our images.
tos from ugly and spartan designs which generally decrease Using the crowdsourced data, we obtained triplet com-
the value of a house. parisons based on luxury level for a large-scale dataset of
In order to create each crowdsourcing grid, we sampled real estate photos. We then used the t-STE algorithm [27]
two images from each of the 2 classes of Zillow photos, four to obtain a two-dimensional embedding of the images. The
images from each of the 4 ‘budget’ categories of Houzz pic- result is shown in figure 3. By examining the embedding,
tures, two photos from the Places dataset, and two photos we observe that images with similar luxury levels are clus-
from Google search results. Then we selected one random tered near one another. This indicates the quality of the
picture as the probe and constructed the grid from the other crowdsourced data. Each cluster represents a specific lux-
9 images. In this way, each grid contains photos of several ury level. We selected one anchor image from each cluster
Figure 4: User interface for classifying real estate photos. Each of the 8 levels of luxury is represented with an anchor image,
and the worker is asked to classify the probe image according to its luxury level.
Figure 5: Examples of bedroom photos classified at different luxury levels. Level 1 represents the least level of luxury, and
level 8 shows the highest.
to represent photos in that cluster. Based on these repre- to rank photos according to their degree of luxury. Fig-
sentative pictures, we created another crowdsourcing task ure 4 shows this task for kitchen images. Figure 5 illus-
Figure 6: The price estimation network. After classifying photos based on their room category, a vector representing luxury
is extracted and concatenated with the normalized metadata vector and passed through a regression layer to produce the
estimated price. The loss function is then computed as difference between the estimated price and the purchase price.