SlideShare a Scribd company logo
Serving Automated Home
Valuation with Redis & Kafka
Jiaqi Wang
jiaqi.wang@redfin.com
Redfin
Tech powered brokerage
RedisConf18 - Serving Automated Home Valuation with Redis & Kafka
RedisatRedfin
• Since 2016
• Opensource version
• LRU cache
• Rate limiting
• …
01 Feature Overview
02 Design Iterations
03 DeepDive
04 Next Steps
Agenda
FeatureOverview
Redfin Estimate
A calculation of the market value of an
individual home
AutomatedHome
Valuation
RedisConf18 - Serving Automated Home Valuation with Redis & Kafka
RedisConf18 - Serving Automated Home Valuation with Redis & Kafka
RedisConf18 - Serving Automated Home Valuation with Redis & Kafka
RedisConf18 - Serving Automated Home Valuation with Redis & Kafka
Design Iterations
AWS
Before…
property_id,
listing_id,
othermetadata
Unclearownership
ComplexAPIthat...
• Promptsmisuse
• Leadstoinconsistentdata
Before…
FirstStep
property_id AWS
Goals
• Performance
Goals
• Networkroundtrip, dbcalls
• Performance
• AWS cost
Goals
• CostonAmazonAPI Gateway
• Performance
• AWS cost
• Data consistency/accuracy
Goals
• All productsreturnthe sameestimate value foragiven
homeatagiven time
• Near-realtimedata
pid → estimate value
AWS
Proposal
¯_(ツ)_/¯
• Performance
• AWS cost
• Data consistency/accuracy
Goals
pid → estimate value
AWS
IterationI
• Access level
BusinessLogics
• Visibilitycompliance
• LocalMLSrules
• Selleroverrides
pid → estimate value,
access level
AWS
IterationII
¯_(ツ)_/¯
• Performance
• AWS cost
• Data consistency/accuracy
Goals
VolatileHousing
Market
• Caching for too long leads
to stale and inaccurate data
• Caching for too short
reduces perf/cost benefit
VolatileEstimate
• Caching for too long leads
to stale and inaccurate data
• Caching for too short
reduces perf/cost benefit
VolatileEstimate
• Active properties
• Off-market properties
• Forsale
• Moreactivities
• Updatemoreoften
• Notforsale
• Lessactivities
• Updateless often
Activevs.Off-
Market
Active properties:TTL5 mins
Off-marketproperties:TTL4 hrs
pid → estimate value,
access level,
with TTL
AWS
IterationIII
RedisConf18 - Serving Automated Home Valuation with Redis & Kafka
Before
• hit rate: N/A
• p90: 102ms
• median: 45ms
After
• hit rate: 53%
• p90: 73ms
• median: 8ms
Results
(SingleGet)
Roll out to 100%
ResponseTimeDrops
Roll out to 100%
45%FewerEstimateAPICalls
RedisConf18 - Serving Automated Home Valuation with Redis & Kafka
RedisConf18 - Serving Automated Home Valuation with Redis & Kafka
• No significant perf gain
• Similar amount of Estimate API calls
ItTurnsOut…
¯_(ツ)_/¯
• No significant perf gain
• Similar amount of Estimate API calls
ItTurnsOut…
~55%hit rate
ItTurnsOut…
Increasehitratetoeliminate thesync call
Solution
• Prefetch estimate values on the
neighboring tiles
• Populate the cache ahead of time
Cache
Warmup
Cache Warmup
Cache Warmup
Cache Warmup
Cache Warmup
async,
near-
realtime
Cache Warmup
AWS
IterationIII
IterationIV
RedisConf18 - Serving Automated Home Valuation with Redis & Kafka
Deep Dive
CacheWarmupPipeline
• Distributed streaming framework
• Storm, Spark, Flink, and etc.
• Redfin adopted Kafka in 2015
Kafka
• Distributed stream processing framework
• Scalablenear-realtime event streaming and
dataprocessing
• Uses Kafkaformessaging
• Uses YARNforfaulttolerance,resource
management,etc.
Samza
RedisConf18 - Serving Automated Home Valuation with Redis & Kafka
• Fault-tolerantlocal state
• Ordered,partitioned,replayablestreams
• Processorisolation,security,faulttoleranceprovidedbyYARN
• Decoupledjobs
Stand-outFeatures
• 70+Samzaapps
• Trackingmarketactivity
• Sending tournotifications
• Sending listing updates
• Cachewarmup
• ...
Kafka&Samza
atRedifn
Architecture
Forecaster
Requestfilter
Datafetcher
Datawriter
CacheWarmup
Identifynearbyproperties
Forecaster
ValidateRequests
RequestFilter
BatchfetchfromAPI gateway
DataFetcher
Populatethe cache
DataWriter
StandaloneSamzaappsformax
horizontalscalability
CacheWarmup
StandaloneSamzaappsformax
horizontalscalability
CacheWarmup
StandaloneSamzaappsformax
horizontalscalability
CacheWarmup
Before
• hit rate: 55%
• p90: 314ms
• median: 58ms
After
• hit rate: 80%
• p90: 29ms
• median: 3ms
Results
(MultiGet)
Roll out to 50%
Roll out to 100%
ResponseTimeDrops
40% Fewer EstimateAPI Calls
40%FewerEstimateAPICalls
CacheInvalidation
Lessons
Lessons
SystemCouplings
• SlowKafkabroker,slow server responsetime
Lessons
Monitoring
Next Steps
What’sNext
• Systemdecoupling
• Envoy:https://github.com/envoyproxy/envoy
• Increase hit rate even further
• Unshardedtoshardedwithproxy
• Experimentwith TTL
• Cachemoredata
• FurtherreduceAWScost
ThankYou

More Related Content

What's hot (20)

PDF
RedisConf18 - Writing modular & encapsulated Redis code
Redis Labs
 
PPTX
Tailoring Redis Modules For Your Users’ Needs
Redis Labs
 
PPTX
High cardinality time series search: A new level of scale - Data Day Texas 2016
Eric Sammer
 
PPTX
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 
PPTX
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
PPTX
Tuning kafka pipelines
Sumant Tambe
 
PPTX
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
PDF
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
HostedbyConfluent
 
PDF
From Three Nines to Five Nines - A Kafka Journey
Allen (Xiaozhong) Wang
 
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
confluent
 
PPTX
Putting Kafka Into Overdrive
Todd Palino
 
PPTX
RedisConf18 - Scalable Microservices with Event Sourcing and Redis
Redis Labs
 
PDF
Deploying Confluent Platform for Production
confluent
 
PPTX
Decoupling Decisions with Apache Kafka
Grant Henke
 
PPTX
RedisConf17 - Redis in High Traffic Adtech Stack
Redis Labs
 
PPTX
Netflix Data Pipeline With Kafka
Steven Wu
 
PPTX
Building an Event-oriented Data Platform with Kafka, Eric Sammer
confluent
 
PDF
Kafka At Scale in the Cloud
confluent
 
PPTX
Apache Kafka at LinkedIn
Discover Pinterest
 
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
DataWorks Summit/Hadoop Summit
 
RedisConf18 - Writing modular & encapsulated Redis code
Redis Labs
 
Tailoring Redis Modules For Your Users’ Needs
Redis Labs
 
High cardinality time series search: A new level of scale - Data Day Texas 2016
Eric Sammer
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
Tuning kafka pipelines
Sumant Tambe
 
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
HostedbyConfluent
 
From Three Nines to Five Nines - A Kafka Journey
Allen (Xiaozhong) Wang
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
confluent
 
Putting Kafka Into Overdrive
Todd Palino
 
RedisConf18 - Scalable Microservices with Event Sourcing and Redis
Redis Labs
 
Deploying Confluent Platform for Production
confluent
 
Decoupling Decisions with Apache Kafka
Grant Henke
 
RedisConf17 - Redis in High Traffic Adtech Stack
Redis Labs
 
Netflix Data Pipeline With Kafka
Steven Wu
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
confluent
 
Kafka At Scale in the Cloud
confluent
 
Apache Kafka at LinkedIn
Discover Pinterest
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
DataWorks Summit/Hadoop Summit
 

Similar to RedisConf18 - Serving Automated Home Valuation with Redis & Kafka (20)

PPTX
Building big data pipelines with Kafka and Kubernetes
Venu Ryali
 
PDF
Kafka short
Tikal Knowledge
 
PPTX
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database
Redis Labs
 
PPTX
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
PDF
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
Scott Mansfield
 
PPTX
How to Empower a Platform With a Data Pipeline At a Scale
Deepak Sood
 
PDF
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
PDF
Change data capture with MongoDB and Kafka.
Dan Harvey
 
ODP
Kafka aws
Ariel Moskovich
 
PPTX
Building Data Streaming Platforms using OpenShift and Kafka
Nenad Bogojevic
 
PDF
Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with El...
HostedbyConfluent
 
KEY
Scaling Django for X Factor - DJUGL Oct 2012
Malcolm Box
 
PDF
AWS re:Invnet Highlights for VCs
Boaz Ziniman
 
PDF
INTRODUCTION REAL ESTATE SYSTEM SOFTWARE
contentinfoskaters
 
PPTX
MongoDB World 2018: Partner Talk - Red Hat: Deploying to Enterprise Kubernetes
MongoDB
 
PDF
AWS re:Invent 2018 notes
Imaya Kulothungan
 
PPTX
reInvent reCap 2022
CloudHesive
 
PPTX
11 Best Real Estate APIs to Create Real Estate Products
CMARIX TechnoLabs
 
PDF
Aws cost optimization: lessons learned, strategies, tips and tools
Felipe
 
PDF
Serverless Architectures on AWS Lambda
Serhat Can
 
Building big data pipelines with Kafka and Kubernetes
Venu Ryali
 
Kafka short
Tikal Knowledge
 
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database
Redis Labs
 
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
Scott Mansfield
 
How to Empower a Platform With a Data Pipeline At a Scale
Deepak Sood
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
Change data capture with MongoDB and Kafka.
Dan Harvey
 
Kafka aws
Ariel Moskovich
 
Building Data Streaming Platforms using OpenShift and Kafka
Nenad Bogojevic
 
Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with El...
HostedbyConfluent
 
Scaling Django for X Factor - DJUGL Oct 2012
Malcolm Box
 
AWS re:Invnet Highlights for VCs
Boaz Ziniman
 
INTRODUCTION REAL ESTATE SYSTEM SOFTWARE
contentinfoskaters
 
MongoDB World 2018: Partner Talk - Red Hat: Deploying to Enterprise Kubernetes
MongoDB
 
AWS re:Invent 2018 notes
Imaya Kulothungan
 
reInvent reCap 2022
CloudHesive
 
11 Best Real Estate APIs to Create Real Estate Products
CMARIX TechnoLabs
 
Aws cost optimization: lessons learned, strategies, tips and tools
Felipe
 
Serverless Architectures on AWS Lambda
Serhat Can
 
Ad

More from Redis Labs (20)

PPTX
Redis Day Bangalore 2020 - Session state caching with redis
Redis Labs
 
PPTX
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Redis Labs
 
PPTX
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
Redis Labs
 
PPTX
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
Redis Labs
 
PPTX
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
Redis Labs
 
PPTX
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
Redis Labs
 
PPTX
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
Redis Labs
 
PPTX
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
Redis Labs
 
PPTX
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
Redis Labs
 
PPTX
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
Redis Labs
 
PPTX
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
Redis Labs
 
PPTX
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Redis Labs
 
PPTX
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
Redis Labs
 
PPTX
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
Redis Labs
 
PPTX
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
Redis Labs
 
PPTX
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
Redis Labs
 
PPTX
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
Redis Labs
 
PPTX
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
Redis Labs
 
PDF
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
Redis Labs
 
PPTX
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
Redis Labs
 
Redis Day Bangalore 2020 - Session state caching with redis
Redis Labs
 
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Redis Labs
 
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
Redis Labs
 
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
Redis Labs
 
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
Redis Labs
 
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
Redis Labs
 
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
Redis Labs
 
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
Redis Labs
 
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
Redis Labs
 
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
Redis Labs
 
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
Redis Labs
 
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Redis Labs
 
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
Redis Labs
 
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
Redis Labs
 
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
Redis Labs
 
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
Redis Labs
 
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
Redis Labs
 
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
Redis Labs
 
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
Redis Labs
 
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
Redis Labs
 
Ad

Recently uploaded (20)

PPTX
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
PPTX
𝙳𝚘𝚠𝚗𝚕𝚘𝚊𝚍—Wondershare Filmora Crack 14.0.7 + Key Download 2025
sebastian aliya
 
PDF
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
PPSX
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
PDF
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
PDF
Redefining Work in the Age of AI - What to expect? How to prepare? Why it mat...
Malinda Kapuruge
 
PDF
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
PDF
The Growing Value and Application of FME & GenAI
Safe Software
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
PDF
Why aren't you using FME Flow's CPU Time?
Safe Software
 
DOCX
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
PDF
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
PPTX
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
PDF
Kubernetes - Architecture & Components.pdf
geethak285
 
PPTX
reInforce 2025 Lightning Talk - Scott Francis.pptx
ScottFrancis51
 
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
PPTX
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PPTX
Practical Applications of AI in Local Government
OnBoard
 
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
𝙳𝚘𝚠𝚗𝚕𝚘𝚊𝚍—Wondershare Filmora Crack 14.0.7 + Key Download 2025
sebastian aliya
 
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
Redefining Work in the Age of AI - What to expect? How to prepare? Why it mat...
Malinda Kapuruge
 
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
The Growing Value and Application of FME & GenAI
Safe Software
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
Why aren't you using FME Flow's CPU Time?
Safe Software
 
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
Kubernetes - Architecture & Components.pdf
geethak285
 
reInforce 2025 Lightning Talk - Scott Francis.pptx
ScottFrancis51
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
Practical Applications of AI in Local Government
OnBoard
 
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 

RedisConf18 - Serving Automated Home Valuation with Redis & Kafka

Editor's Notes

  • #2: Self Intro An engineer on the Owner Engagement team at Redfin.
  • #3: Redfin Intro Technology-powered real estate brokerage. Not only serve home information as a platform, but also employ real estate agents to provide professional home buying and selling services.
  • #4: Stats Every month, over 22 million users visit Redfin to check out homes across the country, to find out how much their home is worth, to keep updated with the neighborhood trend, to get in touch with a real estate professional for services. Owner Engagement primarily focuses on homeowner experiences, providing tools to help owners understand home value better, and to keep their home information up-to-date
  • #5: Redis at Redfin (use open source Redis) Will first briefly touch the usage of Redis at Redfin, we’re using the open source version Two major use case LRU cache Rate limiting Hoping to open up for more use cases soon This talk will focus on a specific use case of Redis at Redfin and go through our thought process of designing the architecture around it, more specific, a cache warmup pipeline powered by kafka and samza. Hopefully you will find it useful
  • #6: Agenda
  • #8: Leading to avm… Coming back to automated home valuation, or what we call Redfin Estimate What is that? Redfin Estimate is a calculation of the market value of an individual home, or what we think your home is worth We've commissioned analysis that shows that Redfin Estimate is the most accurate on the market
  • #9: If you are a Redfin user, you probably have seen it on a listing details page before, either on the web, (next slide: or in the mobile app) Screenshots Desktop web
  • #10: Or in the mobile app Screenshots Mobile app
  • #11: We also use it extensively in email, internal tools for agents, as well as in a fairly new experiment that we are running in a few markets called Redfin Now For example: Redfin Now tries to buy home directly from the customers. The amount we pay for the home largely depends on the Redfin estimate value
  • #12: There’s one more place you will see the Redfin estimate values is when you zoom in on the map page. I personally like this feature a lot because it gives you a good idea on the overall home prices in a neighborhood, not because I want to know if my neighbor’s home is worth more. :grin: Different from the other use cases, this usage presents some unique technical challenges, and I will dive into that later in this talk. Screenshots Avm on map
  • #14: Coming back, this was our tech stack before a cache layer was introduced in the system On the right side we have an Amazon API Gateway set up on AWS serving raw estimate data Each of the feature is responsible for fetching the estimate on their own
  • #15: This original setup has several issues: Due to the complex nature of the estimate data and the fact that a property can be represented in many different forms, the estimate API takes multiple params. It is possible for it to return a different estimate even for the same home based on the parameter that get passed in. This inevitably lead to misuses by other services when they provided undesired inputs, which in turn caused inconsistent data across different products For example, email to XDP could display different values Also introduced unclear ownership between feature teams: who’s going to be responsible for the cache layer, do teams add their own implementation? Duplicate effort, debugging pain
  • #16: Before adding the cache layer, unify the code path and create Estimate Service to take the responsibility Provide simple API Define clear ownership Roll up the sleeve and resume the caching work
  • #17: Spend a minute to talk about our goals here: Why? Good software engineering practice. Without a clear goal, you may not be optimizing your time to solve the right problem Perf: network round trip, db calls to fetch metadata Aws cost, API gateway is not free, charges by number or API calls and amount of data transferred out Data consistency matters a lot to us. To provide good UX.
  • #18: Spend a minute to talk about our goals here: Why? Good software engineering practice. Without a clear goal, you may not be optimizing your time to solve the right problem Perf: cut down network round trip, db calls to fetch metadata Aws cost, API gateway is not free, charges by number or API calls and amount of data transferred out Data consistency matters a lot to us. To provide good UX.
  • #19: Spend a minute to talk about our goals here: Why? Good software engineering practice. Without a clear goal, you may not be optimizing your time to solve the right problem Perf: network round trip, db calls to fetch metadata Aws cost, Amazon API gateway is not free, charges by number or API calls and amount of data transferred out Data consistency matters a lot to us. To provide good UX.
  • #20: Spend a minute to talk about our goals here: Why? Good software engineering practice. Without a clear goal, you may not be optimizing your time to solve the right problem Perf: network round trip, db calls to fetch metadata Aws cost, API gateway is not free, charges by number or API calls and amount of data transferred out Data consistency matters a lot to us. To provide good UX. All products should return the same estimate value for a given home at a given time
  • #21: First proposal: To simplify the diagram, consolidated intermediate services as part of the webserver We naturally fitted a Redis cache in between the estimate service and API gateway. Stores pairs of property id and its corresponding value Now every time someone requests for estimate data, estimate service only performs the expensive fetch in case of a cache miss, and the results are cached in Redis So far so good. Caching is that simple… right?
  • #22: How hard could it be to cache some values :shrug:
  • #23: Remember earlier we talked about goals, and the first one is to improve perf Now that we eliminated the network round trip to API gateway, are we done though?
  • #24: You now notice that there’s still DB calls being made. What is it doing? Can we eliminate that?
  • #25: We then take a look at what it is doing, and it turns out we saved a bunch of “legal” rules in the database on whether or not we can show the estimate value Some of these come from local MLS which stands for multiple listing services, which are databases that real estate agents use to list properties Others come from listing agent’s settings In the end, it boils down to a value called access level Controls the accessibility of a piece of data based on user’s roles Unregistered, registered, email verified, agent
  • #26: So we put access level in the cache as well and eliminated the db call. So far so good. Caching is that simple… right??
  • #27: How hard could it be to cache some values :shrug:
  • #28: before popping open a champagne and celebrate, let’s double check our goals. There’s a line item called consistency. What does that mean? :thinking_face: Have we achieved that? Someone might have already thought about it. It has something to do with expiration.
  • #29: Turns out housing market is a pretty volatile market Home prices changes frequently and dramatically And I believe if you’ve gone through a home buying/selling process in the recent couple years, you would very well know what that means
  • #30: What does this mean for us? For example, say the housing price fluctuates every hour, and we cache it for two hours, what would happen? We ended up with stale data Can we simply cache everything for a short amount of time? We can, but we’ll be making unnecessary calls to API gateway, which we already know costs us a fortune
  • #31: How can we strike the perfect balance between the cost and benefit?
  • #32: Well, we addressed this by treating data with different characteristics differently We split the data in two groups: active properties that are for sale and off market properties that are not for sale We know that for sale listings have more activities going on, their estimates update more often, so we give them a much shorter TTL While off market properties don’t have too much going on, their estimates don’t update as often, thus it’s fine for them to live in the cache for longer (this data is public) And in fact, our estimate algorithm updates the estimate for active properties at least once a day, even more often for newer listings For off-market ones, it updates only once a week
  • #33: We ended up with this: caching estimates of active properties for 5 min, and 4 hrs for off-market properties One question people often ask is that if the estimate value only updates once a week, why do you cache it for only 4 hours instead of a week Cache expiration is a broad topic, and the answer varies case by case. If you know exactly when the data will be updated, you can cache it for as long as you want until you’re notified of the arrival of new data In our case, our side of the system acts on the assumption that it doesn’t know the exact time the update happens. So to be on the cautious side, we naturally experiment with a shorter TTL to make sure the data isn’t too stale. You never know if new data happens to arrive right after you cache the old data right?
  • #34: And here comes a handy command in Redis: setex(), or set expire. It both sets the key/value and sets the key to timeout after a given period of time. It combines the set() command and expire() command in an atomic action It is a very common operation when using Redis as a cache
  • #35: With all goals covered, when we rolled it out to 100%, non-surprisingly, we observed the expected drop off in response time, with an avg. hit rate of 53%
  • #36: A graph illustrating the response time
  • #37: A graph illustrating the drop in number of single calls to Estimate API
  • #38: Alright, so far the setup works pretty well for most of the use cases where we only need estimate data for a single property such as this What about situations where we need multiple estimate values at once?
  • #39: Such as this As i mentioned earlier, when you zoom in enough on the map, you will get to see the estimate values for homes that are not currently on the market in your viewport
  • #40: And it turns out that we didn’t achieve much performance gain
  • #41: Wuuuuut? :shrug:
  • #42: We observed that on the map, we can achieve an avg. of 55% hit rate, even slightly higher than the single fetch case However, hit rate doesn’t matter as much when fetching multiple values at a time, at least when it’s only around the 50s Say we are fetching 100 estimate values at once and 55 are retrieved from the cache, for the rest of the 45, we still need to contact AWS for the data Even though we speed up part of the call by hitting the cache, the entire call didn’t improve much because we didn’t eliminate the expensive part, which is the network roundtrip So, as long as we’re still making the network call as part of the map request, we won’t see too much improvement, and we are not going to save much on AWS cost
  • #43: We then asked ourselves: is it possible to eliminate the synchronous call at all? The answer is yes we can, if the hit rate is high enough
  • #44: So we introduced cache warmup Here’s how it works: Given a map area the user is looking at (what we call viewport), prefetch estimate values for the neighboring tiles Populate the cache ahead of time
  • #45: When the user is browsing homes in this box, or the viewport
  • #46: Which is the orange tile in the center at a more zoomed out level We identify all neighboring tiles of the orange tile
  • #47: And find all properties that are not for sale on these neighboring tiles
  • #48: Prefetch the estimate values for these homes, populate them into the Redis cache in a non-blocking background process, a process that is supposed to happen really fast
  • #49: The expectation is that when the user pans around the map and enter the neighboring tiles, the estimate values on these tiles will ideally already exist in the cache There’ll be no need to fetch synchronously from Amazon API gateway and db It sounds pretty straightforward, so how does the system look like?
  • #50: Remember this is our previous iteration, with a Redis cache sitting in between the estimate service and the estimate api on aws How hard could it be to add a background cache warmup pipeline Probably just couple boxes around it and fill in the details right?
  • #51: this is the final architecture of what we came up with. Consists of a cache warmup pipeline in the middle and couple invalidation pipelines on the bottom Not far from the previous iteration right? :grin:
  • #52: Exactly like how you draw an owl
  • #53: Putting the owl-ful jokes aside, deep dive time
  • #54: We built cache warmup with Kafka and Samza The reason we chose stream processing to solve this problem is because it fits well with the stream processing paradigm which says: Given a sequence of data (what we call a stream), a series of operations is applied to each element in the stream Needs to be near-realtime Work on messages one by one More importantly, also because a stream processing pipeline built on top of kafka and samza is durable and scalable.
  • #55: For those of you that aren’t quite familiar with these Kafka is a distributed streaming framework similar to Storm, Spark, Flink, and etc. It was originally developed by LinkedIn and later open-sourced Redfin adopted Kafka in its earlier days back in 2015 to scale up our fast growing notification needs At the time it was mainly serving as a low-latency messaging system and that’s where Samza comes into the picture
  • #56: Samza is an open-source distributed stream processing framework It uses Kafka for messaging, and hadoop YARN for fault tolerance, resource management, processor isolation, security, and etc. Provides scalable near-real-time event streaming and data processing
  • #57: Without going into too much detail, here’s a graph illustrating the relationship between Kafka and Samza Kafka: buffer between Samza apps
  • #58: At the time we were evaluating multiple similar streaming processing products, There are couple characteristics that made us choose Samza over another stream processing framework at the time Differentiators: Supports out of the box local storage. Because the state itself is modeled as a stream, in case of a machine failure, the state stream will come back in the state before the crash and can be replayed to restore it Streams are ordered, partitioned, and replayable, which provides scalability and durability It takes advantage of YARN for processor isolation, security, and fault tolerance, and YARN provides a distributed environment for Samza containers to run in All jobs are decoupled. If one job experiences issues, the rest of the system is unaffected.
  • #59: As of today, Redfin has more than 70 Samza apps running in prod, processing millions of bytes of content per second. Usage varies from tracking market activity to sending tour notification and sending listing updates in a timely fashion. More than 90% of the listing updates will land in our users’ inbox within 10 minutes since they get updated on MLS, which is at least 10x faster than our competitors And you all know how valuable it is to know about new listings faster than other potential home buyers in such a competitive market
  • #60: Final architecture of estimate service with a caching pipeline Let’s take a closer look at the big grey box
  • #61: The caching pipeline consists of four components: Each of them is a standalone samza app
  • #62: The forecaster identifies properties on the neighboring tiles, sends them to the request filter More specifically, it takes in a list of property ids, finds the bounding box of these ids, and identifies the surrounding boxes of that box For each of the surrounding box, it then fetches all properties on it, and sends them to the downstream app
  • #63: It filters the incoming properties. Couple things it checks on: If the estimate value for a given property already exists in the cache, then there’s no need to fetch them again If the given property display an estimate value at all. Remember the ”legal” rules that’s stored in the database that I mentioned earlier? That is it.
  • #64: Later, the eligible properties are sent to the data fetcher and it batches them together and fetches the estimate data from API Gateway
  • #65: Finally, the data writer populates these data into the Redis cache
  • #66: The reason to make 4 standalone apps is so that every app is responsible for a simple specific task and no more than that. Our goal is to complete the entire flow, from the request entering the pipeline to landing in cache, within 10 sec. Separating the responsibilities allows us to achieve maximum horizontal scalability.
  • #67: E.g. we know ahead of time that request filter will have to handle a comparatively larger amount of incoming messages than the other apps because forecaster fans out multiple messages for every single incoming message so we made it a single app and gave it 4X containers and enabled multi-threading
  • #68: E.g. for data fetcher, because it is responsible for making requests to AWS, instead of making a request for every incoming message, we utilized a unique feature in Samza called local storage to batch the requests together in order to reduce the amount of calls
  • #69: With the new caching pipeline in place, when the new design was rolled out in prod, we saw good results with fetching multiple estimate values. a significant bump in hitrate, from 55% to 80% and a huge amount of perf gain
  • #70: The orange line on the top is upper 90, and the yellow one on the bottom is median Also we didn’t eliminate the API gateway call 100%. In the case of a low hit rate, to ensure a good UX a sync call is still initiated The load on API gateway decreased significantly
  • #71: A graph illustrating the drop in number of batch calls to the Estimate API
  • #72: Now that we have a fancy async pipeline running to update the cache, we piggybacked on that and built two additional streams to invalidate the cache when estimate visibility changes
  • #73: What is the catch?
  • #74: First thing that bit us: because we are sending messages to the streaming pipeline within the map request, system couple between the webserver and the streaming infrastructure It turns out that the supposedly asynchronous process isn’t entirely async, there’s one synchronous tissue in the entire link What happened is that in a recent Kafka upgrade, due to an unexpected bug in the Kafka Python client, some of the nodes in the Kafka cluster or what’s called a Kafka broker experienced enormous latency in accepting messages. Which trickled down and caused webservers to return slowly Fortunately we were able to shut down the pipeline by flipping a feature toggle Build thorough monitoring and alarming system
  • #75: Another lesson we learnt is that as your system gets more complicated, it gets harder to detect where the issue is coming from when there’s a fire need to up yourgame in monitoring For this feature alone, we built at least 4 dashboards to monitor the individual parts, including the cache itself, the caching pipeline, the estimate service, and the webserver endpoints
  • #77: Tweak Kafka producer configs so brokers timeout faster to better isolate Kafka issue from webserver Exploring Envoy to move from unsharded to sharded