0% found this document useful (0 votes)
28 views

CSP Design

Uploaded by

yomedew225
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

CSP Design

Uploaded by

yomedew225
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Common Salary Platform Design Review

Author: Khee-Chin Chua , [email protected]


Author’s team: Jobseeker Content
Design Review Date and Time: Tuesday, March 8⋅09:00 – 10:00
Design Review URL: <INSERT THE REVIEW MEETING CALENDAR EVENT URL HERE>
Design Review Ticket: <CREATE A DESIGN REVIEW TICKET (DESIGNREV)>

Attendees
If you plan to attend this design review meeting, please add your name below so that we know
which functions and teams will be represented
[email protected]
[email protected]
● Yiqing Zhu Manager, Taxonomy
● Tuan Phuong Trinh SWE, Jobseeker Content
● Khee-Chin Chua SWE, Jobseeker Content
● Koki Makino SWE, Taxonomy
● Shuiyuan Zhang Senior Manager, Taxonomy
● Filbert Teo SWE, Careers
● Rong Yu SWE, International
● Tom Fitzsimmons Senior Manager, Careers/Education
● Rong Liu SWE, Taxonomy
● Fabien Cortina Manager, Acme Content
● Sourav Kumar Mohanty SWE, Jobseeker Content
● Bin Chen SWE, Acme Content
● Hui Ying Ooi SWE, Localization
● Louis Lai SWE, Careers
● Vlasto Chvojka Senior Director, Jobseeker Content
● Willie Loyd Tandingan SWE, Acme Content
● Craig Kost Principal SWE, JSBE
● Branson Ng Khin Swee SWE, Acme Content
● Matt Swarthout Staff SWE, Waldo
● Michael Werle Technical Fellow, Jobseeker
● Gayathri Veale Senior Cloud Data Engineer, Product Tech Engineering Platform

Problem Space Review


Our mission is to make salary information accessible to all teams at Indeed and Glassdoor.

To do so, we intend to build a common architecture that will allow Salary teams at Indeed and
Glassdoor to store and retrieve salary information.

For more details, please refer to Common Salary Architecture

Overview of Existing Architecture

Goals
● Allow salary teams from both companies to produce facts (salary datapoints)
● Allow teams from either company to produce labels (company information, geolocation
information) for their own datapoints asynchronously
● Allow teams from either company to produce labels (augmented labels) for other team’s
datapoints asynchronously
● Product teams from either company can consume a single unified stream of enriched
salary datapoints.
Technical Design Review
Projects Affected
● Salary Aggregation Service (SALAGS) - serve salary aggregates on Career Pages and
Company Salary pages. Owned by careers-explorer team
● Salary Aggregation App (Glassdoor Service)
● Salary Estimation (Glassdoor ML)
● Careers-api-service - used on Company Pages Webapp to serve aggregates from
Glassdoor (link) when SALAGS does not have any datapoints
● New applications

Assumptions
● We are using Confluent Cloud’s Kafka as our message bus
● Indeed has ~1 billion salary datapoints across different sources (job descriptions,
reviews, resume surveys) over the last 3 years.
● On average, each salary datapoint has ~5 labels. This will increase when Glassdoor
starts to add labels to Indeed datapoints and vice versa.
○ Initial load: 1 billion salary datapoints, 5 billion labels
○ Daily load: 1 million salary datapoints, 5 million labels
● The architecture design follows the Four Stage Data Pipelines

Terminology
● Producer: Teams that produces salary information to the platform, Indeed-Salary and
Glassdoor-Salary
● Integrate: Team that works on aggregating and merging the salary facts and labels
emitted from various Producer teams and emit to a single unified Salary
datastream/dataset
● Consumer: Product teams that will consume from the Integrate team’s
datastream/datasets
● Fact: Salary datapoint extracted from a source
● Label: Computed/Derived data that adds to the understanding of the Fact.
● Enriched datapoints: The salary datapoint emitted by the Integrate team, which
contains the Fact, Labels from the producers and additional labels computed by the
Integrate team.

Design considerations
The common salary platform needs to support ingesting salary datapoints from different teams.
In the first version of the platform, we want to be able to support Glassdoor and Indeed salary
datapoints.

There were several considerations made in the system design.

Do we allow each team to define their own salary schema(s) or enforce a


standard schema?
1. Allow teams to define and maintain their own schema
○ Pros:
■ Producer teams can begin to emit salary datapoints in their own topics
with minimal effort.
○ Cons:
■ The integrate team will need to handle a mapping from each producer
team’s schema into a common schema.
2. Require teams to produce salary datapoints in a common schema
○ Pros:
■ Less effort will be required for the integrate team to onboard new teams
onto the platform since the input schema is the same as the output
schema
○ Cons:
■ Producer teams need to handle the mapping from their own salary
schema to the standard schema.
■ Teams have different labels for their salary datapoints, for example,
Indeed has title normalization while Glassdoor has GOC labels.
3. Have separate Fact and Label topics for each team, Facts share a common schema
across producers, label schemas are unique across producers.
○ Pros:
■ Less effort will be required for the integrate team to onboard new teams
onto the platform since the input schema for Facts is the same as the
output schema
■ Different producers can add their own team-specific labels
○ Cons:
■ Producer teams need to handle the mapping from their own salary
schema to the standard schema.

Recommendation: Approach 3 provides us with the most flexibility to support custom labels
from different producer teams, while enforcing a consistent schema to store the essential salary
information.

How do we process/store salary datapoints?


1. Kafka Streams application with infinite retention topics as a state store
○ Pros:
■ We won’t need a separate data store, Confluent Cloud supports infinite
storage and we can make use of topic compaction to retain only the latest
version of messages (unique by their key) in each topic
○ Cons:
■ Kafka Streams Consumers will need to maintain their own state, e.g. with
Kafka Streams local state-store or a separate storage
■ Kafka Streams will need to rebuild their state from the beginning of the
Kafka topic when they are first deployed or when there is an update in
topology.
2. Kafka Consumer that reads and writes to a persistent datastore such as DynamoDB or
Osiris
○ Pros:
■ No need to rebuild state-store when application start up time since data is
stored in a separate datastore.
○ Cons:
■ Each update to a salary datapoint will consist of a GET and a PUT
■ Each message is limited to 300kb (hard limit)
■ Dynamodb costs $1.25/million writes and $0.25/million reads, a full table
scan by a consumer will cost around $250.

Recommendation: Kafka Streams application with infinite retention topics as a state store

From a Proof-of-Concept implementation of Approach 1, we believe that using Confluent


Cloud’s Kafka topics with an infinite retention can work for our use-case.

A test-load of the dataset indicates that our initial load of 1 billion salary datapoints will take up
around 40GB in the Fact topic and 320GB in the Label topic.

Based on our subsequent experiments in this document on a virtual machine, we are able to
reprocess 2% of the entire dataset (20 million salary datapoints) in around 12 minutes. A
subsequent test with 10% of the dataset took around x minutes. Re-processing the initial
dataset of a billion salary datapoints should take around 10 hours to complete.

We will use a S3 Sink Connector to take periodic backups of the topics and the S3 Source
Connector to restore from backups in the event of a Cluster outage

How will we provide salary information to consumers?


1. Kafka topic as a datastream
○ Pros:
■ Consumers will incrementally receive new salary datapoints as they are
being produced and when datapoints are updated with labels
■ Ideal for use-cases such as online learning models, product features that
make use of incremental events
○ Cons:
■ Consumers that need to batch processing of salary datapoints will need to
read the entire topic periodically
2. Snapshot datasets in a S3 repository
○ Pros:
■ Ideal for use-cases that require batch-processing such as daily generation
of artifacts for salary aggregates or offline model training.
○ Cons:
■ Data is refreshed periodically, so there is always an element of staleness.
3. Microservice with a GraphQL endpoint
○ Pros:
■ Supports querying for individual salary datapoints and their enriched fields
○ Cons:
■ Enriched salary datapoints has to be stored in a separate datastore for
lookups.
■ Not many active product use-cases for individual salary datapoints

Recommendation: We will start with approach 1 for the initial implementation as it is a


prerequisite for the other two approaches. Approach 2 can be implemented as a Sink connector
of the output topic. We will defer Approach 3 for discussion after the initial implementation.

How do we allow teams to augment another team’s datapoints?


1. Separate Kafka topic for each team to augment datapoints owned by other teams
2. Have a single working Kafka topic shared by both producer teams

Recommendation: Approach 1 provides a clearer separation of ownership and responsibility,


since each team maintains their own topic and does not emit to another team’s topic.

Data Size Estimates from POC


- The average size of a message in a Fact topic is around 50 bytes. (Key and value)
- The average size of all messages in a Label topic for a single salary datapoint is around
200 bytes. (Key and value)
- The average total size of a raw salary datapoint is around 250 bytes.
- Over a 3-year period, we collected around 1 billion salary datapoints, with around 5
billion labels.
- With a test run of 1 billion salary datapoints, 5 billion labels,
- Fact topic: 44.82G
- Label topic: 209.04G
- Integrate output topic: ~260G (estimated)
Implementation

Architecture Overview

In this presentation, we built a proof of concept of the above architecture to estimate the total
amount of space required based on a 3-year snapshot of salary datapoints from Indeed. We
placed the sample IDLs used for the POC in this Google Doc for our colleagues who may not
have access to our Gitlab instances.

Lifecycle of a Salary Datapoint


Flow of a Salary datapoint being processed by its own team
1. The Collector service will emit salary datapoints to the Fact topic
2. A Compute service will consume from the Integration Output Topic and emit labels to the
Label topic
3. The Integrate team’s service will consume from both Fact and Label topics, enrich the
Salary datapoint and emit it to the Integrate Output Topic

Flow of a Salary datapoint being processed by the other team


1. The Augment service consumes from the Integrate Output Topic for new salary
datapoints emitted by the other team.
a. If there is additional labels that it can add to the datapoint, it will emit a message
to the Augment topic
2. The Integrate team’s service will consume from the Fact, Label and Augment topics,
enrich the Salary datapoint and emit it to the Integrate Output Topic
Kafka Topics

Each producer team maintains 3 Kafka topics and their schemas


● Messages in topics are retained indefinitely
● Rely on topic log compaction to remove and clean up messages with duplicate keys in
each topic
● Guidelines for topic naming. Adapted from this article
○ <organization>.<domain>.<description>.<classification>.<version>
■ Organization: indeed, glassdoor
■ Domain for the data: salary
■ Description: producer, integrate
■ Classification: fact, label, augment, unified
■ Version: 1
○ indeed.salary.producer.fact.1
○ Indeed.salary.producer.label.1
○ indeed.salary.producer.augment.1
○ glassdoor.salary.producer.fact.1
○ glassdoor.salary.producer.label.1
○ glassdoor.salary.producer.augment.1

The schemas for the Kafka topics can be found in this doc.

Producer Teams

Fact Topic
● This topic stores the raw salary datapoint produced by the team’s Collector
○ Key: A unique identifier for this salary datapoint [Sample IDL]
{'record_type': 'JOB_DESCRIPTION', 'id': 5735510521}
○ Value: A raw salary datapoint. The schema for this salary datapoint should be
consistent across producers. [Sample IDL]
{'base_salary': {'salary_type': 'WEEKLY', 'range': {'min': 385.0, 'max': 675.0}}, 'currency': 'USD',
'created_date': 1546917033000, 'updated_date': 1546917033000}

Label Topic
● This topic stores the labels for the salary datapoint produced by Compute processes
● Labels enrich the salary datapoint with additional information such as title normalization,
geocoding, company information
○ Key: A unique identifier for this entry [Sample IDL]
{'record_type': 'JOB_DESCRIPTION', 'id': 5735510521, 'label_type': 'TITLE_NORMALIZATION'}
● Value: Processed Data [Sample IDL]
○ {'key': {'record_type': 'JOB_DESCRIPTION', 'id': 5735510521, 'label_type':
'TITLE_NORMALIZATION'}, 'data': {'jobTitle': 'CDL-A Lease Purchase Truck Driver - *$2500 SIGN ON
BONUS* Ta', 'normTitle': 'truck driver', 'normLanguage': 'en', 'displayTitle': 'truck driver',
'normCategory': 'driver'}}

Augment Topic
- This topic stores the augmented labels for other team’s datapoints.
- Key: This is a composite key that contains the Fact key from the other team
- {‘record_source’, ‘GLASSDOOR’, ‘record_key’: {'record_type': REVIEWS, 'id': 12345, 'label_type':
'TITLE_NORMALIZATION'}}
- Value: Processed Data
- {'key': {‘record_source’, ‘GLASSDOOR’, ‘record_key’: {'record_type': REVIEWS, 'id': 12345,
'label_type': 'TITLE_NORMALIZATION'}}, 'data': {'jobTitle': 'CDL-A Lease Purchase Truck Driver -
*$2500 SIGN ON BONUS* Ta', 'normTitle': 'truck driver', 'normLanguage': 'en', 'displayTitle': 'truck
driver', 'normCategory': 'driver'}}

Integrate Team

Unified Output Topic


● The unified data stream contains data from the producers and the enriched data fields
that are computed by the Integrate team. For example, a Glassdoor’s salary datapoint
may be mapped to an Indeed’s fccompany id, or an Indeed salary datapoint with
Occupation could be mapped to a Glassdoor’s GOC
○ Topic for unified data stream: indeed.salary.integrate.unified.1
○ Key: A unique identifier for this salary datapoint
■ { "id": { "com.indeed.common.kafka.salary.IndeedCompositeSalaryKey": { "record_type":
"JOB_DESCRIPTION", "id": { "long": 6080038770 } } } }
○ Value: A composite structure with the raw facts and aggregated labels from the
producers.
■ Fact
● Same value as producing team’s Fact topic
■ Labels
● Aggregated list of labels from the producer team’s Label topic
■ Augment
● Aggregated list of augmented labels from the other team’s
Augment topic
■ Enriched Data (M2)
● Fields that are computed by the Integrate team.

Data Lineage
Avro Directory Layout
- The layout of the directory is based on <team>/<topic-name>-<key|value>.avdl
- Shared data structures should be stored under common/
- Team-specific shared data structures should be stored under <team>/common.avdl such
as indeed/common.avdl

Avro IDL

common/common.avdl
@namespace("com.indeed.common.kafka.salary")
protocol CommonSalary {
// Describes a composite salary value
record CompositeSalaryValue {
union{null, BaseSalary} base_salary = null;
union{null, GigSalary} gig_salary = null;
union{null, StartingBonus} starting_bonus = null;
union{null, ReferralBonus} referral_bonus = null;
union{null, array<RecurringBonus>} recurring_bonus = null;
union{null, GigBonus} gig_bonus = null;
union{null, array<Reimbursement>} reimbursement = null;
union{null, array<Stipend>} stipend = null;
union{null, WorkVolume} work_volume = null;
union{null, Currency} currency = null;
// additional information
union{boolean, null} visibility = true;
union{null, timestamp_ms} created_date = null;
union{null, timestamp_ms} updated_date = null;
}
// Time-based pay cycles for base salary.
enum SalaryType {NONE, HOURLY, DAILY, WEEKLY, MONTHLY, YEARLY, BIWEEKLY, QUARTERLY} = NONE;

enum Currency {
NO_CURRENCY, USD, EUR, GBP, INR, AUD, CAD, SGD, CHF, MYR, JPY, CNY, NZD, THB, HUF, AED, HKD, MXN,
ZAR, PHP, SEK,
IDR, SAR, BRL, TRY, KES, KRW, EGP, IQD, NOK, KWD, RUB, DKK, PKR, ILS, PLN, QAR, XAU, OMR,
COP, CLP, TWD,
ARS, CZK, VND, MAD, JOD, BHD, XOF, LKR, UAH, NGN, TND, UGX, RON, BDT, PEN, GEL, XAF, FJD,
VEF, BYN, HRK,
UZS, BGN, DZD, IRR, DOP, ISK, XAG, CRC, SYP, LYD, JMD, MUR, GHS, AOA, UYU, AFN, LBP, XPF,
TTD, TZS, ALL,
XCD, GTQ, NPR, BOB, ZWD, BBD, CUC, LAK, BND, BWP, HNL, PYG, ETB, NAD, PGK, SDG, MOP, NIO,
BMD, KZT, PAB,
BAM, GYD, YER, MGA, KYD, MZN, RSD, SCR, AMD, SBD, AZN, SLL, TOP, BZD, MWK, GMD, BIF, SOS,
HTG, GNF, MVR,
MNT, CDF, STD, TJS, KPW, MMK, LSL, LRD, KGS, GIP, XPT, MDL, CUP, KHR, MKD, VUV, MRO, ANG,
SZL, CVE, SRD,
XPD, SVC, BSD, XDR, RWF, AWG, DJF, BTN, KMF, WST, SPL, ERN, FKP, SHP, JEP, TMT, TVD, IMP,
GGP, ZMW} =
NO_CURRENCY;

// Work-metric based units for gig salary.


enum AlternateCycle {NO_ALTERNATE_CYCLE, GIG, SESSION, LESSON, CLASS, MILE, KILOMETER, CONTACT,
CREDIT_HOUR,
// most college courses are 3 semester credit hours = 45-48 contact hours (3hr/week for ~15
weeks)
EVENT, OCCASION, OCCURRENCE, SHIFT, ORDER, REFERRAL, DELIVERY, TRIP, SALE, // general unit for
commissions
STUDENT, GAME, CONTACT_HOUR, // number of hours of scheduled instruction
CHILDBIRTH, FLIGHT_HOUR} = NO_ALTERNATE_CYCLE;

enum BonusRewardUnit {UNSPECIFIED, CURRENCY, OF_SALARY, // percentage of base salary, e.g. 1/12 for
13th month
PERCENTAGE, // percentage of other, e.g. with AlternateCycle=SALE means commission
SHARE, SHARE_OPTION, PROFIT_UNIT, REVENUE_UNIT} = UNSPECIFIED;

enum BonusReason {
UNKNOWN_REASON, PERFORMANCE, SAFETY, HAZMAT, PERFECT_ATTENDANCE, FITNESS, MONTH13, // 13th month
bonus
TIPS} = UNKNOWN_REASON;

enum AllowanceReason {
UNKNOWN_ALLOWANCE, TUITION, STUDENT_LOAN, COMMUTING, TRAVEL, RELOCATION, REGIONAL, HOUSING,
FAMILY,
CERTIFICATION, RESPONSIBILITY, DEPENDENTS, FOOD, DRIVING} = UNKNOWN_ALLOWANCE;

// General definition of a range.


// Used by multiple feature types.
//
// The value '-1' should be interpreted as a special value that means an open max.
//
// Examples
//
// (10, 20): Well defined range.
// Any value 10 <= x <= 20 is in the range.
//
// (0, 20): Open-ended min.
// Any value 0 <= x <= 20 is in the range; 0 is implicit.
// Can be read as "up to 20".
//
// (10, -1): Open-ended max.
// Any value 10 <= x is in the range.
// Can be read as "at least 10".
//
// (-1, -1): There's no range.
// Any value 0 <= x is in the range; 0 is implicit.
//
// for point values, min=max, e.g. (1000, 1000)
record Range {
float min;
float max;
}

// Feature
record BaseSalary {
Range range;
SalaryType salary_type = "NONE";
}

// Feature
record GigSalary {
Range range;
AlternateCycle alternate_cycle = "NO_ALTERNATE_CYCLE";
}

// Feature
record StartingBonus {
Range range;
}

// Feature
record ReferralBonus {
Range range;
}

// Feature
record RecurringBonus {
Range range;
BonusRewardUnit unit = "UNSPECIFIED";
SalaryType period = "NONE";
BonusReason reason = "UNKNOWN_REASON";
}

// Feature
record GigBonus {
Range range;
BonusRewardUnit unit = "UNSPECIFIED";
AlternateCycle alternate_cycle = "NO_ALTERNATE_CYCLE";
BonusReason reason = "UNKNOWN_REASON";
}

// Feature
record Reimbursement {
Range range;
SalaryType period = "NONE";
AllowanceReason reason = "UNKNOWN_ALLOWANCE";
}

// Feature
record Stipend {
Range range;
SalaryType period = "NONE";
AllowanceReason reason = "UNKNOWN_ALLOWANCE";
}

// Feature
record GigOrPeriod {
AlternateCycle gig = "NO_ALTERNATE_CYCLE";
SalaryType period = "NONE";
}
// Feature
record WorkVolume {
Range range;
GigOrPeriod unit;
GigOrPeriod cycle;
}
}

indeed/common.avdl
@namespace("com.indeed.common.kafka.salary")
protocol IndeedCommonSalary {
enum SalaryRecordType {
NULL, JOB_DESCRIPTION, REVIEW, RESUME_INLINE_WIZ, RESUME_SURVEY_UPLOAD, RESUME_SURVEY,
RESUME_SURVEY_HOMEPROMO,
RESUME_SURVEY_WIZ, VISA_APPLICATION, GLOBAL_SALARY_SURVEY, SALARY_THIRD_PARTY, GLASSDOOR,
SALARY_SURVEY,
PAYCHECK} = NULL;
}

indeed/indeed.salary.producer.fact.1-key.avdl
@namespace("com.indeed.common.kafka.salary")
protocol IndeedSalaryFactKey {
// Kafka topic - Key Schema
import idl "common.avdl";

record IndeedSalaryFactKey {
SalaryRecordType record_type;
union{long, string} id;
}
}

indeed/indeed.salary.producer.fact.1-value.avdl
@namespace("com.indeed.common.kafka.salary")
protocol IndeedSalaryFactValue {
import idl "../common/common.avdl";

// this is the schema for the Fact topic


record IndeedSalaryFactValue {
union{null, CompositeSalaryValue} salary = null;
union{null, IndeedProperties} properties = null;
}

record IndeedProperties {
union{null, string} title = null;
union{null, string} language = null;
union{null, string} country = null;
// job related fields
union{null, JobInformation} job_information = null;
// review related fields
union{null, ReviewInformation} review_information = null;
}

// properties
record JobInformation {
// JobSourceInfo
union{null, int} feed_id = null;
union{null, int} source_id = null;
// JobTypeInfo
union{null, int} jobtype_id = null;
union{null, array<string>} jobtypes = null;
}

record ReviewInformation {
//This is designed to store the information about a review that we would want to sort on
union{null, timestamp_ms} submission_time = null;
union{null, int} language_id = null;
union{null, int} country_id = null;
}
}

indeed/indeed.salary.producer.label.1-key.avdl
@namespace("com.indeed.common.kafka.salary")
protocol IndeedSalaryLabelKey {
import idl "indeed.salary.producer.fact.1-key.avdl";
import idl "common.avdl";

// Salary Labels
// initial version adapted from
https://sourcegraph.indeed.tech/code.corp.indeed.com/salary/salary-aggregate-index/-/blob/library/src/m
ain/proto/HBaseSchema.proto
enum IndeedSalaryLabelType {
NO_LABEL, JOB_SOURCE_INFO, JOB_TYPE_INFO, TITLE_NORMALIZATION, GEOLOCATION, COMPANY, LANGUAGE,
STATUS_FLAGS,
DELETED_FLAGS, SALARY_OVERRIDE} = NO_LABEL;

// Kafka topic - Key Schema


record IndeedSalaryLabelKey {
union{null, IndeedSalaryFactKey} id;
union{null, IndeedSalaryLabelType} label_type;
}
}

indeed/indeed.salary.producer.label.1-value.avdl

@namespace("com.indeed.common.kafka.salary")
protocol IndeedSalaryLabelValue {
import idl "indeed.salary.producer.fact.1-value.avdl";
import idl "../common/common.avdl";

// column f1:nt
record NormTitle {
union{null, int} norm_status = null;
union{null, string} job_title = null;
union{null, string} norm_title = null;
union{null, string} norm_language = null;
// detected language by title normalization
union{null, string} display_title = null;
union{null, string} base_level_title = null;
union{null, string} base_level_display_title = null;
union{null, string} norm_category = null;
union{null, long} normalization_timestamp = null;
}

// column f1:loc
record Location {
union{null, string} country = null;
union{null, string} region1 = null;
union{null, string} region2 = null;
union{null, string} region3 = null;
union{null, string} region4 = null;
union{null, string} region_name1 = null;
union{null, string} region_name2 = null;
union{null, string} region_name3 = null;
union{null, string} region_name4 = null;
union{null, string} city = null;
union{null, string} postal_code = null;
union{null, string} latitude = null;
union{null, string} longtitude = null;
union{null, string} msa_code = null;
union{null, string} msa_name = null;
}

// column f1:c
record Company {
union{null, int} fccid = null;
union{null, int} legacy_id = null;
union{null, string} companyname = null;
union{null, string} type = null;
union{null, int} canonical_fccid = null;
union{null, string} canonical_companyname = null;
}

// column f1:fl
record StatusFlag {
union{null, boolean} is_overridden = null;
union{null, boolean} is_invalid_salary = null;
}

// column f1:del
record DeletedFlag {
union{null, boolean} is_deleted = null;
union{null, long} deletion_timestamp = null;
union{null, string} deletion_reason = null;
}

// An override for a composite salary.


// CompositeValueOverride
//
// @param fullTakedown True indicates the whole composite salary should be taken
// down; in which case all other fields should be ignored.
//
// @param takedown Set of features to takedown. Valid only when
// fullTakedown=false.
//
// @param modifications Alternative values for features. For each feature that is
// present in this composite salary, it is valid only when
// fullTakedown=false and when 'takedown' does not contain its associated
// FeatureType enum.
//
// An enumeration of known extraction feature types.
enum FeatureType {
BASE_SALARY, GIG_SALARY, STARTING_BONUS, REFERRAL_BONUS, RECURRING_BONUS, GIG_BONUS, STIPEND,
REIMBURSEMENT,
WORK_VOLUME} = BASE_SALARY;

record CompositeValueOverride {
union{boolean, null} full_takedown = false;
union{null, array<FeatureType>} takedown = null;
union{null, CompositeSalaryValue} modifications = null;
}

// Kafka topic - Value Schema


record IndeedSalaryLabelValue {
union{null, NormTitle, Location, Company, StatusFlag, DeletedFlag, CompositeValueOverride}
label_data = null;
}
}

integrate/indeed.salary.integrate.unified-output.1-key.avdl
@namespace("com.indeed.common.kafka.salary")
protocol IntegrateSalaryOutputKey {
// Kafka topic - Key Schema
import idl "../indeed/indeed.salary.producer.fact.1-key.avdl";

record IntegrateSalaryOutputKey {
union{null, IndeedSalaryFactKey, GlassdoorSalaryFactKey} id;
}
}

integrate/indeed.salary.integrate.unified-output.1-value.avdl
@namespace("com.indeed.common.kafka.salary")
protocol IntegrateSalaryOutputValue {
import idl "../indeed/indeed.salary.producer.fact.1-value.avdl";
import idl "../indeed/indeed.salary.producer.label.1-key.avdl";
import idl "../indeed/indeed.salary.producer.label.1-value.avdl";

record IndeedSalaryLabel {
union{null, IndeedSalaryLabelKey} key = null;
union{null, IndeedSalaryLabelValue} value = null;
}

record IndeedSalaryContent {
union{null, IndeedSalaryFactValue} fact = null;
union{null, array<IndeedSalaryLabel>} labels = null;
}

record IntegrateSalaryOutputValue {
union{null, IndeedSalaryContent} raw = null;
// salary structure with a standard defined format
union{null, CompositeSalaryValue} salary = null;
// Indeed specific properties
union{null, IndeedProperties} indeed_properties = null;
/* Labels generated by compute services from Indeed in a map {LABEL_TYPE: LABEL_VALUE}
- {"TITLE_NORMALIZATION": {"norm_title": "xxx", ..., "display_title": "xxx"},
"GEOLOCATION": {"country": "US", "region1": "TX", ..., "city": "Austin"}}
*/
union{null, map<IndeedSalaryLabelValue>} indeed_labels = null;
/*
// Glassdoor specific properties
union{null, GlassdoorProperties} glassdoor_properties;
// Labels generated by compute services from Glassdoor
union{null, map<GlassdoorSalaryLabelValue>} glassdoor_labels;
*/
}
}
Kafka Configuration

Cluster
- Our initial deployment strategy is a dedicated Confluent Cloud cluster with multi-zone
availability.

Topic
- Kafka topic keys and value schemas will be managed with the Confluent Cloud’s
Schema Registry.
- The message format will be in Avro
- Schema validation will be enabled for keys and values in all topics -
(confluent.key.schema.validation and confluent.value.schema.validation)
- The compression codec on all topics is lz4. Based on this study, it appears that lz4 gives
the best compression/decompression rates at the cost of a slightly lower compression
ratio compared to zstd. - (compression.type)
- The cleanup policy for all topics will be “compact” - (cleanup.policy)
- To preserve ordering of messages in a partition, the timestamp for messages in logs will
be set to LogAppendTime. - (message.timestamp.type)
- Fact and Label topics will use a “TopicNameStrategy” for their keys and values
- We will use FULL compatibility type to support schema evolution. This allows us to
perform schema updates without worrying about the order of upgrading clients.
- The number of partitions for each topic is set to 64. This will allow us to scale the number
of processing instances for each topic up to 64 instances. The total number of partitions
should not exceed the 4500 / broker before replication.
- Producers should use “murmur2_random” as the partitioner. This is the default partition
for the Java library, but not for the Python libraries.

Integrate Team’s Service

The Integrate team maintains the service to process all the messages from both companies’
Kafka topics and output a unified datastream

Approaches
There are 3 approaches we considered to build this service
1. Using Kafka as a queue for messages and persisting the data into a backing storage
such as Dynamodb
2. Using Kafka Streams with infinite retention to store and process the data
3. Using Kafka Streams with infinite retention and Redis as a distributed cache
Approach 1: Kafka as a queue and Dynamodb as the persistent storage

With this approach, Kafka is used as a transient data store where we have consumers to
process the messages from each topic and store them into Dynamodb for persistence.

The Integrate Service logic is straightforward, it fetches from the Processed Table to know the
latest state of the Salary datapoint, and creates it if it does not exist, then append the latest label
to the value. The sequence of events is as follows
- A new message is sent to the Fact/Label topic
- The Integrate Service identify the primary key of the message
- Lookup the Processed Table for the latest state of the primary key
- If it does not exist, create a new entry
- Add the message to the current state and perform necessary transformations
- Store the updated state of the salary datapoint into the Processed Table
- Emit a message to the Output Topic
Approach 2: Kafka Streams application with local state-stores

Using this approach, messages are retained indefinitely in the Kafka Topics and the data in
Kafka become the primary source of truth.

The sequence of events in this flow


- A new message is sent to the Fact/Label topic
- The Processing Node will identify the primary key of the message
- Lookup with Fact and Labels state stores with the primary key to get the latest state of
the facts and labels
- Add the message to the current state and perform necessary transformations
- Store the results to the state-stores
- Emit a message to the Output Topic
Approach 3: Kafka Streams and Redis as a distributed cache
WIth this approach, we rely on Kafka Streams to build the changelog topics for the Fact and
Labels state-stores, and rely on Redis as a distributed cache to store the current values for each
key.

To do so, we will create a new KeyValueByteStoreSupplier for the state-stores that will read and
write from a Redis Cluster.

The sequence of events in this flow


- A new message is sent to the Fact/Label topic
- The Processing Node will identify the primary key of the message
- Lookup with Fact and Labels in Redis with the primary key to get the latest state of the
facts and labels
- Add the message to the current state and perform necessary transformations
- Store the results to the state-stores
- This will emit a message to the state store’s changelog topic
- Write the latest value of the key to Redis

Comparison of approaches

Earlier discussion: How do we process/store salary datapoints?

Approach 1
- Depending on the persistent state store solution chosen, there is a trade-off between
latency and costs.
- Dynamodb provides low latency reads (sub-millisecond) and writes, however the
cost is expensive if we need to reprocess our data.
- It costs $1.25/million write requests and 0.25 per million read requests.
- With the initial load of 1 billion salary datapoints and 5 billion labels, this
will cost (1250 + 5 * 1250) * 2 = $15,000 to write the data into Dynamodb,
and around (250 + 5 * 250) * 2 = $3,000 for reads, for a total of $18,000
- Osiris provides low latency reads (based on this dashboard from DocService
KVStore, read latency is ~10ms for a single key and handles around 300
hits/second)
- Investigate if Osiris can handle the read throughput and latency of this
system. We can use LogRepo’s volume as a proxy for its write
throughput.
- Pros:
- Straightforward to implement
- Cons:
- Cost of Dynamodb is expensive

Approach 2
- Pros:
- Built in with Kafka Streams
- Does not need a separate database
- Cons:
- Outlined in Concerns with Kafka Streams and a local state-store
- Slow application start-up time
- Profiling this approach showed a major contributor to this delay is caused
by RocksDB compaction. When we disabled compactions with
.setDisableAutoCompactions(false) , we were able to reload 45G of
state-storage in under 15 minutes, around 3G/minute. This is most likely
bottlenecked by IO due to running the Kafka Cluster and Kafka Streams
application on a single machine.

Approach 3
- Pros:
- Lower in cost compared to Approach 1
- Redis Cluster can be rebuilt from the changelog topic if it goes down
- Kafka Streams application startup time will be reduced since we do not rely on
RocksDB

Recommendation: We decided to go with Approach 2 for the initial design as based on initial
experiments, RocksDB as a local state store across our consumer group should be able to
handle our initial workload in a reasonable amount of time (10 hours).

With the Transformer API approach chosen in the next section, it will be straightforward to swap
RocksDB for the Fact and Label Datastores to a separate database such as DynamoDB, Osiris
or Redis.

This reduces the concern of having to rebuild the state stores when there is a change in the
network topology due to repartitioning.

Kafka Streams Application


Kafka Streams is a client library that allows us to write streaming applications. It supports
stateful processing by storing changelogs in intermediate Kafka topics, and uses RocksDB as
the underlying state-store for the application.

Overview
This diagram shows an overview of the role of the Kafka Streams application. It reads a salary
datapoint from the Fact topic and joins it with a list of Labels for the same key (type, id) and
stores the enriched message into an output topic.

There are a few approaches we considered when designing the Kafka Streams application
topology
- KStream - KTable join [code]
- Pros:
- Most space-efficient topology as there are only 2 intermediate topics and
a single KTable
- Cons:
- An entry will only be produced to the output topic when there is a new
message in the KStream, so updates to the KTable will not be visible in
the output topic until the entry in the KStream has been inserted again
- KTable - KTable join [code]
- Pros:
- A new entry will be emitted whenever there is a new message on either
side of the join.
- Cons:
-Uses one more topic to materialize the left hand side of the join, so there
will be 3 intermediate topics. Requires more storage space than the
KStreams approach.
- Transformer API [code]
- Pros:
- Provides us with the most low-level control over how the application
works
- Implement optimizations based on our domain knowledge.
- Implement a LRU cache so that we do not have to go over the
network to query the state-store.
- Optimize batching in the Transformer API so that we emit the
latest record for each key every 30 seconds. For each Salary Fact
with 5 Labels, a straightforward approach will emit 6 messages for
the same key to the output stream. By batching them together, we
can reduce it to 1 message per time interval.
- Cons:
- Requires more storage space than the KStreams approach
- We have to manually maintain the state-stores for Fact and Labels and
perform lookups

Recommendation: Transformer API


As the KStream to KTable join approach will only emit a message in the output topic when there
is a new message on one side of the join, this will not suffice for our use case. Therefore in this
document, we ran an experiment to compare the performance of the KTable to KTable join and
Transformer API.

The results of this experiment on a 2% subset of the initial load (around 20 million salary
datapoints) after optimizations.

Time taken Percentage improvement

KTable-KTable Join 17 minutes

Transformer API 12 minutes 29.4%

This implies that we should be able to process our initial dataset of around 1 billion salary
datapoints in around 600 minutes, or 10 hours.

While we acknowledge that the increase in time taken is not entirely linear, larger datasets will
have an increased lookup time, we should be able to mitigate this increase in latency by
- Tuning of the number of partitions for each topic,
- Balancing number of running application instances
- Tuning of the underlying RocksDB configuration.
Concerns with Kafka Streams and a local state-store
Restarting a Kafka Streams application is challenging
- When deploying in Marvin or Kubernetes, container storage is usually ephemeral. This
means that the local state stores will be deleted when a container is restarted and will
need to be rebuilt from the changelog topic.
- A change in the stream topology will require that the source topics be reprocessed again
from the beginning.
- An update in the application code or stream processing logic may require the source
topics to be reprocessed again from the beginning.

This causes delays for new data to be processed by the service until the Kafka Streams
application has finished rebuilding its state stores.

There are several ways to address/mitigate these concerns


- Container storage is ephemeral
- Instead of deploying in Marvin, we can deploy with ArgoCD and use Stateful Sets
in Kubernetes to ensure that the container’s volume will be persisted across
application restarts/deployments.
- Using standby replicas to ensure there are multiple instances serving a single
partition for higher availability
- Consider investigating if it is possible to periodically checkpoint a snapshot of the
RocksDB state-stores into s3, so that when the application starts-up, it can
download the snapshot and continue the state-store restoration process from
there.
- Reprocessing of the source streams from the beginning leads to slow start-up times
- With the current approach using the Transformer API, we can implement a
custom KeyValueByteStoreSupplier to read and write from a separately hosted
DynamoDB [example repository] or Dynamodb
- Separate the changelog topic and storage layer by implementing our own
KeyValueByteStoreSupplier
- Use Kafka to store the changelog topic for the state store, and rely on a
distributed cache to fetch the latest values for a key

Indeed Collector and Compute

An overview of the Indeed Collector data flow and their sources of data is described here to
provide some context. The design is in progress and can be tracked in SAL-2584
Overview of the Indeed Collector and Compute data flow

Sources of Salary Data


- Jobs
- VISA from BLS
- User Generated Content
- Reviews

- Resume Survey
- Paycheck

Breakdown of Salary datapoints by sources


- Number of salary datapoints in salaryhbasetable over the past 3 years (IQL)

- Number of ugc salary datapoints (from jobseekersalary) over the last 3 years (IQL)
Onboarding new sources of salary information

In general, there are two approaches to adding a new salabery source into the platform.
1. Add a new SalaryRecordType for an existing producer
2. Add a new producer team, with its own Fact, Label and Augment topics.

Add a new SalaryRecordType for an existing producer


This is the most straightforward and common approach that should be preferred when we add a
new channel to collect salary datapoints from the existing Producer teams. For example, a new
survey form implementation or call-to-action that is designed to collect salary information from
job seekers on Indeed.

- Add a new value to the SalaryRecordType in the Producer team’s fact key schema (e.g.
indeed.salary.producer.fact.1-key)
- Start emitting the new salary datapoints to the fact topic.

Add a new producer team


This approach is used when there is a new source of salary information that is independent of
the existing producer-teams.

- Create a new set of Fact, Label and Augment topics and their schema files
- Update the Integrate Application to consume from these new topics and update the
Integrate Output Topic (indeed.salary.integrate.unified-output.1-value) with the new fields
for the properties and labels from this new producer.
- Start emitting the new salary datapoints to the new Fact topic

In cases where there are multiple new sources of salary information with a similar structure, but
each source having only a small volume of datapoints; Instead of creating a new producer team
for each of them, it might be prudent to create a Generic Producer and to differentiate each
source with its own SalaryRecordType.

Adding a new Compute service

For a new compute service to add labels to salary datapoints, it needs to


- Add a new Label type to the label topic’s key (indeed.salary.producer.labels.1-key)
- Add a Label record with the necessary fields to the label topic’s value
(indeed.salary.producer.labels.1-value)
- Start emitting the new label values to the Labels or Augment topics

The new labels will be stored in the indeed_labels field in the Integrate Output Topic.

Success Metrics

The success metrics of this platform are as follows


- Have 3 years of Indeed’s Salary information in the platform
- Have all of Glassdoor’s Salary information in the platform
- Have 1 product use-case in Q2
- Replace extracted_salaries hbase table as a source of Salary data for Salary
Aggregate Service
- Replace SalaryHbaseTable IQL index
- Use raw Salary datapoints as a source for training estimation models

Infrastructure
● Confluent Cloud Cluster
● Confluent Cloud Schema Registry

Data Stores
● Kafka

New Technologies
● Kafka

Privacy/Security

Privacy Ticket to share salary datapoints with Glassdoor - PRIV-195366


RASS Ticket to sharing a Confluent Cloud cluster with Glassdoor - RASS-3709
Operations
Service Level Objectives
List (or link to) the service level objectives this project will target within the first six months. Consider whether any
design changes materially affect the ability to meet existing SLOs. See the SLO Resource Hub for details.

[FILL OUT THIS SECTION]

Sizing
What kind of traffic do you expect your solution to receive? Roughly how much storage do you need, in databases or
in artifacts?

[FILL OUT THIS SECTION]

Monitoring

Datadog Dashboard for Confluent Cloud Cluster - [link]

Set up a monitoring dashboard for the Integrate Service’s Kafka Streams application
- TBF

Open Questions
List any open questions that you need answered in the design review.

● [FILL OUT THIS SECTION]

References
● Common Salary Architecture
● Common Salary Platform - POC Slides
● Common Salary Platform - Schema Design
● Common Salary Platform - Technical Discussion notes with Confluent
● Optimizing Kafka Streams performance

Annex
Sample output from the Integrate Output Topic
{
"salary": {
"com.indeed.common.kafka.salary.CompositeSalaryValue": {
"base_salary": {
"com.indeed.common.kafka.salary.BaseSalary": {
"range": {
"min": 1141,
"max": 1141
},
"salary_type": "HOURLY"
}
},
"gig_salary": null,
"starting_bonus": null,
"referral_bonus": null,
"recurring_bonus": null,
"gig_bonus": null,
"reimbursement": null,
"stipend": null,
"work_volume": null,
"currency": {
"com.indeed.common.kafka.salary.Currency": "JPY"
},
"visibility": {
"boolean": true
},
"created_date": {
"long": 1557616629000
},
"updated_date": {
"long": 1557616629000
}
}
},
"indeed_properties": {
"com.indeed.common.kafka.salary.IndeedProperties": {
"title": null,
"language": {
"string": "ja"
},
"country": {
"string": "JP"
},
"job_information": {
"com.indeed.common.kafka.salary.JobInformation": {
"feed_id": {
"int": 18132
},
"source_id": {
"int": 33103
},
"jobtype_id": null,
"jobtypes": null
}
},
"review_information": null
}
},
"indeed_labels": {
"map": {
"TITLE_NORMALIZATION": {
"label_data": {
"com.indeed.common.kafka.salary.NormTitle": {
"norm_status": null,
"job_title": {
"string": "介護staff 午前中のみOK 週1日,1日4h~ 主婦さん活躍中"
},
"norm_title": {
"string": "介護スタッフ"
},
"norm_language": {
"string": "ja"
},
"display_title": {
"string": "介護スタッフ"
},
"base_level_title": null,
"base_level_display_title": null,
"norm_category": {
"string": "care"
},
"normalization_timestamp": null
}
}
},
"GEOLOCATION": {
"label_data": {
"com.indeed.common.kafka.salary.Location": {
"country": {
"string": "JP"
},
"region1": {
"string": "JPC"
},
"region2": {
"string": "J12"
},
"region3": {
"string": ""
},
"region4": {
"string": "J4122033"
},
"region_name1": null,
"region_name2": null,
"region_name3": null,
"region_name4": null,
"city": {
"string": "市川市大町"
},
"postal_code": null,
"latitude": null,
"longtitude": null,
"msa_code": null,
"msa_name": null
}
}
},
"COMPANY": {
"label_data": {
"com.indeed.common.kafka.salary.Company": {
"fccid": null,
"legacy_id": {
"int": 62673022
},
"companyname": null,
"type": null,
"canonical_fccid": {
"int": 12214443
},
"canonical_companyname": null
}
}
}
}
}
}

Design Review Notes


Blank until the review. Somebody in the room needs to capture relevant notes as the review progresses, for
post-meeting use by the presenting engineer. These are informal and should include anything in the meeting that
clarifies the design, calls out some additional data to gather, suggests next steps, etc.

You might also like