CSP Design
CSP Design
Attendees
If you plan to attend this design review meeting, please add your name below so that we know
which functions and teams will be represented
● [email protected]
● [email protected]
● Yiqing Zhu Manager, Taxonomy
● Tuan Phuong Trinh SWE, Jobseeker Content
● Khee-Chin Chua SWE, Jobseeker Content
● Koki Makino SWE, Taxonomy
● Shuiyuan Zhang Senior Manager, Taxonomy
● Filbert Teo SWE, Careers
● Rong Yu SWE, International
● Tom Fitzsimmons Senior Manager, Careers/Education
● Rong Liu SWE, Taxonomy
● Fabien Cortina Manager, Acme Content
● Sourav Kumar Mohanty SWE, Jobseeker Content
● Bin Chen SWE, Acme Content
● Hui Ying Ooi SWE, Localization
● Louis Lai SWE, Careers
● Vlasto Chvojka Senior Director, Jobseeker Content
● Willie Loyd Tandingan SWE, Acme Content
● Craig Kost Principal SWE, JSBE
● Branson Ng Khin Swee SWE, Acme Content
● Matt Swarthout Staff SWE, Waldo
● Michael Werle Technical Fellow, Jobseeker
● Gayathri Veale Senior Cloud Data Engineer, Product Tech Engineering Platform
To do so, we intend to build a common architecture that will allow Salary teams at Indeed and
Glassdoor to store and retrieve salary information.
Goals
● Allow salary teams from both companies to produce facts (salary datapoints)
● Allow teams from either company to produce labels (company information, geolocation
information) for their own datapoints asynchronously
● Allow teams from either company to produce labels (augmented labels) for other team’s
datapoints asynchronously
● Product teams from either company can consume a single unified stream of enriched
salary datapoints.
Technical Design Review
Projects Affected
● Salary Aggregation Service (SALAGS) - serve salary aggregates on Career Pages and
Company Salary pages. Owned by careers-explorer team
● Salary Aggregation App (Glassdoor Service)
● Salary Estimation (Glassdoor ML)
● Careers-api-service - used on Company Pages Webapp to serve aggregates from
Glassdoor (link) when SALAGS does not have any datapoints
● New applications
Assumptions
● We are using Confluent Cloud’s Kafka as our message bus
● Indeed has ~1 billion salary datapoints across different sources (job descriptions,
reviews, resume surveys) over the last 3 years.
● On average, each salary datapoint has ~5 labels. This will increase when Glassdoor
starts to add labels to Indeed datapoints and vice versa.
○ Initial load: 1 billion salary datapoints, 5 billion labels
○ Daily load: 1 million salary datapoints, 5 million labels
● The architecture design follows the Four Stage Data Pipelines
Terminology
● Producer: Teams that produces salary information to the platform, Indeed-Salary and
Glassdoor-Salary
● Integrate: Team that works on aggregating and merging the salary facts and labels
emitted from various Producer teams and emit to a single unified Salary
datastream/dataset
● Consumer: Product teams that will consume from the Integrate team’s
datastream/datasets
● Fact: Salary datapoint extracted from a source
● Label: Computed/Derived data that adds to the understanding of the Fact.
● Enriched datapoints: The salary datapoint emitted by the Integrate team, which
contains the Fact, Labels from the producers and additional labels computed by the
Integrate team.
Design considerations
The common salary platform needs to support ingesting salary datapoints from different teams.
In the first version of the platform, we want to be able to support Glassdoor and Indeed salary
datapoints.
Recommendation: Approach 3 provides us with the most flexibility to support custom labels
from different producer teams, while enforcing a consistent schema to store the essential salary
information.
Recommendation: Kafka Streams application with infinite retention topics as a state store
A test-load of the dataset indicates that our initial load of 1 billion salary datapoints will take up
around 40GB in the Fact topic and 320GB in the Label topic.
Based on our subsequent experiments in this document on a virtual machine, we are able to
reprocess 2% of the entire dataset (20 million salary datapoints) in around 12 minutes. A
subsequent test with 10% of the dataset took around x minutes. Re-processing the initial
dataset of a billion salary datapoints should take around 10 hours to complete.
We will use a S3 Sink Connector to take periodic backups of the topics and the S3 Source
Connector to restore from backups in the event of a Cluster outage
Architecture Overview
In this presentation, we built a proof of concept of the above architecture to estimate the total
amount of space required based on a 3-year snapshot of salary datapoints from Indeed. We
placed the sample IDLs used for the POC in this Google Doc for our colleagues who may not
have access to our Gitlab instances.
The schemas for the Kafka topics can be found in this doc.
Producer Teams
Fact Topic
● This topic stores the raw salary datapoint produced by the team’s Collector
○ Key: A unique identifier for this salary datapoint [Sample IDL]
{'record_type': 'JOB_DESCRIPTION', 'id': 5735510521}
○ Value: A raw salary datapoint. The schema for this salary datapoint should be
consistent across producers. [Sample IDL]
{'base_salary': {'salary_type': 'WEEKLY', 'range': {'min': 385.0, 'max': 675.0}}, 'currency': 'USD',
'created_date': 1546917033000, 'updated_date': 1546917033000}
Label Topic
● This topic stores the labels for the salary datapoint produced by Compute processes
● Labels enrich the salary datapoint with additional information such as title normalization,
geocoding, company information
○ Key: A unique identifier for this entry [Sample IDL]
{'record_type': 'JOB_DESCRIPTION', 'id': 5735510521, 'label_type': 'TITLE_NORMALIZATION'}
● Value: Processed Data [Sample IDL]
○ {'key': {'record_type': 'JOB_DESCRIPTION', 'id': 5735510521, 'label_type':
'TITLE_NORMALIZATION'}, 'data': {'jobTitle': 'CDL-A Lease Purchase Truck Driver - *$2500 SIGN ON
BONUS* Ta', 'normTitle': 'truck driver', 'normLanguage': 'en', 'displayTitle': 'truck driver',
'normCategory': 'driver'}}
Augment Topic
- This topic stores the augmented labels for other team’s datapoints.
- Key: This is a composite key that contains the Fact key from the other team
- {‘record_source’, ‘GLASSDOOR’, ‘record_key’: {'record_type': REVIEWS, 'id': 12345, 'label_type':
'TITLE_NORMALIZATION'}}
- Value: Processed Data
- {'key': {‘record_source’, ‘GLASSDOOR’, ‘record_key’: {'record_type': REVIEWS, 'id': 12345,
'label_type': 'TITLE_NORMALIZATION'}}, 'data': {'jobTitle': 'CDL-A Lease Purchase Truck Driver -
*$2500 SIGN ON BONUS* Ta', 'normTitle': 'truck driver', 'normLanguage': 'en', 'displayTitle': 'truck
driver', 'normCategory': 'driver'}}
Integrate Team
Data Lineage
Avro Directory Layout
- The layout of the directory is based on <team>/<topic-name>-<key|value>.avdl
- Shared data structures should be stored under common/
- Team-specific shared data structures should be stored under <team>/common.avdl such
as indeed/common.avdl
Avro IDL
common/common.avdl
@namespace("com.indeed.common.kafka.salary")
protocol CommonSalary {
// Describes a composite salary value
record CompositeSalaryValue {
union{null, BaseSalary} base_salary = null;
union{null, GigSalary} gig_salary = null;
union{null, StartingBonus} starting_bonus = null;
union{null, ReferralBonus} referral_bonus = null;
union{null, array<RecurringBonus>} recurring_bonus = null;
union{null, GigBonus} gig_bonus = null;
union{null, array<Reimbursement>} reimbursement = null;
union{null, array<Stipend>} stipend = null;
union{null, WorkVolume} work_volume = null;
union{null, Currency} currency = null;
// additional information
union{boolean, null} visibility = true;
union{null, timestamp_ms} created_date = null;
union{null, timestamp_ms} updated_date = null;
}
// Time-based pay cycles for base salary.
enum SalaryType {NONE, HOURLY, DAILY, WEEKLY, MONTHLY, YEARLY, BIWEEKLY, QUARTERLY} = NONE;
enum Currency {
NO_CURRENCY, USD, EUR, GBP, INR, AUD, CAD, SGD, CHF, MYR, JPY, CNY, NZD, THB, HUF, AED, HKD, MXN,
ZAR, PHP, SEK,
IDR, SAR, BRL, TRY, KES, KRW, EGP, IQD, NOK, KWD, RUB, DKK, PKR, ILS, PLN, QAR, XAU, OMR,
COP, CLP, TWD,
ARS, CZK, VND, MAD, JOD, BHD, XOF, LKR, UAH, NGN, TND, UGX, RON, BDT, PEN, GEL, XAF, FJD,
VEF, BYN, HRK,
UZS, BGN, DZD, IRR, DOP, ISK, XAG, CRC, SYP, LYD, JMD, MUR, GHS, AOA, UYU, AFN, LBP, XPF,
TTD, TZS, ALL,
XCD, GTQ, NPR, BOB, ZWD, BBD, CUC, LAK, BND, BWP, HNL, PYG, ETB, NAD, PGK, SDG, MOP, NIO,
BMD, KZT, PAB,
BAM, GYD, YER, MGA, KYD, MZN, RSD, SCR, AMD, SBD, AZN, SLL, TOP, BZD, MWK, GMD, BIF, SOS,
HTG, GNF, MVR,
MNT, CDF, STD, TJS, KPW, MMK, LSL, LRD, KGS, GIP, XPT, MDL, CUP, KHR, MKD, VUV, MRO, ANG,
SZL, CVE, SRD,
XPD, SVC, BSD, XDR, RWF, AWG, DJF, BTN, KMF, WST, SPL, ERN, FKP, SHP, JEP, TMT, TVD, IMP,
GGP, ZMW} =
NO_CURRENCY;
enum BonusRewardUnit {UNSPECIFIED, CURRENCY, OF_SALARY, // percentage of base salary, e.g. 1/12 for
13th month
PERCENTAGE, // percentage of other, e.g. with AlternateCycle=SALE means commission
SHARE, SHARE_OPTION, PROFIT_UNIT, REVENUE_UNIT} = UNSPECIFIED;
enum BonusReason {
UNKNOWN_REASON, PERFORMANCE, SAFETY, HAZMAT, PERFECT_ATTENDANCE, FITNESS, MONTH13, // 13th month
bonus
TIPS} = UNKNOWN_REASON;
enum AllowanceReason {
UNKNOWN_ALLOWANCE, TUITION, STUDENT_LOAN, COMMUTING, TRAVEL, RELOCATION, REGIONAL, HOUSING,
FAMILY,
CERTIFICATION, RESPONSIBILITY, DEPENDENTS, FOOD, DRIVING} = UNKNOWN_ALLOWANCE;
// Feature
record BaseSalary {
Range range;
SalaryType salary_type = "NONE";
}
// Feature
record GigSalary {
Range range;
AlternateCycle alternate_cycle = "NO_ALTERNATE_CYCLE";
}
// Feature
record StartingBonus {
Range range;
}
// Feature
record ReferralBonus {
Range range;
}
// Feature
record RecurringBonus {
Range range;
BonusRewardUnit unit = "UNSPECIFIED";
SalaryType period = "NONE";
BonusReason reason = "UNKNOWN_REASON";
}
// Feature
record GigBonus {
Range range;
BonusRewardUnit unit = "UNSPECIFIED";
AlternateCycle alternate_cycle = "NO_ALTERNATE_CYCLE";
BonusReason reason = "UNKNOWN_REASON";
}
// Feature
record Reimbursement {
Range range;
SalaryType period = "NONE";
AllowanceReason reason = "UNKNOWN_ALLOWANCE";
}
// Feature
record Stipend {
Range range;
SalaryType period = "NONE";
AllowanceReason reason = "UNKNOWN_ALLOWANCE";
}
// Feature
record GigOrPeriod {
AlternateCycle gig = "NO_ALTERNATE_CYCLE";
SalaryType period = "NONE";
}
// Feature
record WorkVolume {
Range range;
GigOrPeriod unit;
GigOrPeriod cycle;
}
}
indeed/common.avdl
@namespace("com.indeed.common.kafka.salary")
protocol IndeedCommonSalary {
enum SalaryRecordType {
NULL, JOB_DESCRIPTION, REVIEW, RESUME_INLINE_WIZ, RESUME_SURVEY_UPLOAD, RESUME_SURVEY,
RESUME_SURVEY_HOMEPROMO,
RESUME_SURVEY_WIZ, VISA_APPLICATION, GLOBAL_SALARY_SURVEY, SALARY_THIRD_PARTY, GLASSDOOR,
SALARY_SURVEY,
PAYCHECK} = NULL;
}
indeed/indeed.salary.producer.fact.1-key.avdl
@namespace("com.indeed.common.kafka.salary")
protocol IndeedSalaryFactKey {
// Kafka topic - Key Schema
import idl "common.avdl";
record IndeedSalaryFactKey {
SalaryRecordType record_type;
union{long, string} id;
}
}
indeed/indeed.salary.producer.fact.1-value.avdl
@namespace("com.indeed.common.kafka.salary")
protocol IndeedSalaryFactValue {
import idl "../common/common.avdl";
record IndeedProperties {
union{null, string} title = null;
union{null, string} language = null;
union{null, string} country = null;
// job related fields
union{null, JobInformation} job_information = null;
// review related fields
union{null, ReviewInformation} review_information = null;
}
// properties
record JobInformation {
// JobSourceInfo
union{null, int} feed_id = null;
union{null, int} source_id = null;
// JobTypeInfo
union{null, int} jobtype_id = null;
union{null, array<string>} jobtypes = null;
}
record ReviewInformation {
//This is designed to store the information about a review that we would want to sort on
union{null, timestamp_ms} submission_time = null;
union{null, int} language_id = null;
union{null, int} country_id = null;
}
}
indeed/indeed.salary.producer.label.1-key.avdl
@namespace("com.indeed.common.kafka.salary")
protocol IndeedSalaryLabelKey {
import idl "indeed.salary.producer.fact.1-key.avdl";
import idl "common.avdl";
// Salary Labels
// initial version adapted from
https://sourcegraph.indeed.tech/code.corp.indeed.com/salary/salary-aggregate-index/-/blob/library/src/m
ain/proto/HBaseSchema.proto
enum IndeedSalaryLabelType {
NO_LABEL, JOB_SOURCE_INFO, JOB_TYPE_INFO, TITLE_NORMALIZATION, GEOLOCATION, COMPANY, LANGUAGE,
STATUS_FLAGS,
DELETED_FLAGS, SALARY_OVERRIDE} = NO_LABEL;
indeed/indeed.salary.producer.label.1-value.avdl
@namespace("com.indeed.common.kafka.salary")
protocol IndeedSalaryLabelValue {
import idl "indeed.salary.producer.fact.1-value.avdl";
import idl "../common/common.avdl";
// column f1:nt
record NormTitle {
union{null, int} norm_status = null;
union{null, string} job_title = null;
union{null, string} norm_title = null;
union{null, string} norm_language = null;
// detected language by title normalization
union{null, string} display_title = null;
union{null, string} base_level_title = null;
union{null, string} base_level_display_title = null;
union{null, string} norm_category = null;
union{null, long} normalization_timestamp = null;
}
// column f1:loc
record Location {
union{null, string} country = null;
union{null, string} region1 = null;
union{null, string} region2 = null;
union{null, string} region3 = null;
union{null, string} region4 = null;
union{null, string} region_name1 = null;
union{null, string} region_name2 = null;
union{null, string} region_name3 = null;
union{null, string} region_name4 = null;
union{null, string} city = null;
union{null, string} postal_code = null;
union{null, string} latitude = null;
union{null, string} longtitude = null;
union{null, string} msa_code = null;
union{null, string} msa_name = null;
}
// column f1:c
record Company {
union{null, int} fccid = null;
union{null, int} legacy_id = null;
union{null, string} companyname = null;
union{null, string} type = null;
union{null, int} canonical_fccid = null;
union{null, string} canonical_companyname = null;
}
// column f1:fl
record StatusFlag {
union{null, boolean} is_overridden = null;
union{null, boolean} is_invalid_salary = null;
}
// column f1:del
record DeletedFlag {
union{null, boolean} is_deleted = null;
union{null, long} deletion_timestamp = null;
union{null, string} deletion_reason = null;
}
record CompositeValueOverride {
union{boolean, null} full_takedown = false;
union{null, array<FeatureType>} takedown = null;
union{null, CompositeSalaryValue} modifications = null;
}
integrate/indeed.salary.integrate.unified-output.1-key.avdl
@namespace("com.indeed.common.kafka.salary")
protocol IntegrateSalaryOutputKey {
// Kafka topic - Key Schema
import idl "../indeed/indeed.salary.producer.fact.1-key.avdl";
record IntegrateSalaryOutputKey {
union{null, IndeedSalaryFactKey, GlassdoorSalaryFactKey} id;
}
}
integrate/indeed.salary.integrate.unified-output.1-value.avdl
@namespace("com.indeed.common.kafka.salary")
protocol IntegrateSalaryOutputValue {
import idl "../indeed/indeed.salary.producer.fact.1-value.avdl";
import idl "../indeed/indeed.salary.producer.label.1-key.avdl";
import idl "../indeed/indeed.salary.producer.label.1-value.avdl";
record IndeedSalaryLabel {
union{null, IndeedSalaryLabelKey} key = null;
union{null, IndeedSalaryLabelValue} value = null;
}
record IndeedSalaryContent {
union{null, IndeedSalaryFactValue} fact = null;
union{null, array<IndeedSalaryLabel>} labels = null;
}
record IntegrateSalaryOutputValue {
union{null, IndeedSalaryContent} raw = null;
// salary structure with a standard defined format
union{null, CompositeSalaryValue} salary = null;
// Indeed specific properties
union{null, IndeedProperties} indeed_properties = null;
/* Labels generated by compute services from Indeed in a map {LABEL_TYPE: LABEL_VALUE}
- {"TITLE_NORMALIZATION": {"norm_title": "xxx", ..., "display_title": "xxx"},
"GEOLOCATION": {"country": "US", "region1": "TX", ..., "city": "Austin"}}
*/
union{null, map<IndeedSalaryLabelValue>} indeed_labels = null;
/*
// Glassdoor specific properties
union{null, GlassdoorProperties} glassdoor_properties;
// Labels generated by compute services from Glassdoor
union{null, map<GlassdoorSalaryLabelValue>} glassdoor_labels;
*/
}
}
Kafka Configuration
Cluster
- Our initial deployment strategy is a dedicated Confluent Cloud cluster with multi-zone
availability.
Topic
- Kafka topic keys and value schemas will be managed with the Confluent Cloud’s
Schema Registry.
- The message format will be in Avro
- Schema validation will be enabled for keys and values in all topics -
(confluent.key.schema.validation and confluent.value.schema.validation)
- The compression codec on all topics is lz4. Based on this study, it appears that lz4 gives
the best compression/decompression rates at the cost of a slightly lower compression
ratio compared to zstd. - (compression.type)
- The cleanup policy for all topics will be “compact” - (cleanup.policy)
- To preserve ordering of messages in a partition, the timestamp for messages in logs will
be set to LogAppendTime. - (message.timestamp.type)
- Fact and Label topics will use a “TopicNameStrategy” for their keys and values
- We will use FULL compatibility type to support schema evolution. This allows us to
perform schema updates without worrying about the order of upgrading clients.
- The number of partitions for each topic is set to 64. This will allow us to scale the number
of processing instances for each topic up to 64 instances. The total number of partitions
should not exceed the 4500 / broker before replication.
- Producers should use “murmur2_random” as the partitioner. This is the default partition
for the Java library, but not for the Python libraries.
The Integrate team maintains the service to process all the messages from both companies’
Kafka topics and output a unified datastream
Approaches
There are 3 approaches we considered to build this service
1. Using Kafka as a queue for messages and persisting the data into a backing storage
such as Dynamodb
2. Using Kafka Streams with infinite retention to store and process the data
3. Using Kafka Streams with infinite retention and Redis as a distributed cache
Approach 1: Kafka as a queue and Dynamodb as the persistent storage
With this approach, Kafka is used as a transient data store where we have consumers to
process the messages from each topic and store them into Dynamodb for persistence.
The Integrate Service logic is straightforward, it fetches from the Processed Table to know the
latest state of the Salary datapoint, and creates it if it does not exist, then append the latest label
to the value. The sequence of events is as follows
- A new message is sent to the Fact/Label topic
- The Integrate Service identify the primary key of the message
- Lookup the Processed Table for the latest state of the primary key
- If it does not exist, create a new entry
- Add the message to the current state and perform necessary transformations
- Store the updated state of the salary datapoint into the Processed Table
- Emit a message to the Output Topic
Approach 2: Kafka Streams application with local state-stores
Using this approach, messages are retained indefinitely in the Kafka Topics and the data in
Kafka become the primary source of truth.
To do so, we will create a new KeyValueByteStoreSupplier for the state-stores that will read and
write from a Redis Cluster.
Comparison of approaches
Approach 1
- Depending on the persistent state store solution chosen, there is a trade-off between
latency and costs.
- Dynamodb provides low latency reads (sub-millisecond) and writes, however the
cost is expensive if we need to reprocess our data.
- It costs $1.25/million write requests and 0.25 per million read requests.
- With the initial load of 1 billion salary datapoints and 5 billion labels, this
will cost (1250 + 5 * 1250) * 2 = $15,000 to write the data into Dynamodb,
and around (250 + 5 * 250) * 2 = $3,000 for reads, for a total of $18,000
- Osiris provides low latency reads (based on this dashboard from DocService
KVStore, read latency is ~10ms for a single key and handles around 300
hits/second)
- Investigate if Osiris can handle the read throughput and latency of this
system. We can use LogRepo’s volume as a proxy for its write
throughput.
- Pros:
- Straightforward to implement
- Cons:
- Cost of Dynamodb is expensive
Approach 2
- Pros:
- Built in with Kafka Streams
- Does not need a separate database
- Cons:
- Outlined in Concerns with Kafka Streams and a local state-store
- Slow application start-up time
- Profiling this approach showed a major contributor to this delay is caused
by RocksDB compaction. When we disabled compactions with
.setDisableAutoCompactions(false) , we were able to reload 45G of
state-storage in under 15 minutes, around 3G/minute. This is most likely
bottlenecked by IO due to running the Kafka Cluster and Kafka Streams
application on a single machine.
Approach 3
- Pros:
- Lower in cost compared to Approach 1
- Redis Cluster can be rebuilt from the changelog topic if it goes down
- Kafka Streams application startup time will be reduced since we do not rely on
RocksDB
Recommendation: We decided to go with Approach 2 for the initial design as based on initial
experiments, RocksDB as a local state store across our consumer group should be able to
handle our initial workload in a reasonable amount of time (10 hours).
With the Transformer API approach chosen in the next section, it will be straightforward to swap
RocksDB for the Fact and Label Datastores to a separate database such as DynamoDB, Osiris
or Redis.
This reduces the concern of having to rebuild the state stores when there is a change in the
network topology due to repartitioning.
Overview
This diagram shows an overview of the role of the Kafka Streams application. It reads a salary
datapoint from the Fact topic and joins it with a list of Labels for the same key (type, id) and
stores the enriched message into an output topic.
There are a few approaches we considered when designing the Kafka Streams application
topology
- KStream - KTable join [code]
- Pros:
- Most space-efficient topology as there are only 2 intermediate topics and
a single KTable
- Cons:
- An entry will only be produced to the output topic when there is a new
message in the KStream, so updates to the KTable will not be visible in
the output topic until the entry in the KStream has been inserted again
- KTable - KTable join [code]
- Pros:
- A new entry will be emitted whenever there is a new message on either
side of the join.
- Cons:
-Uses one more topic to materialize the left hand side of the join, so there
will be 3 intermediate topics. Requires more storage space than the
KStreams approach.
- Transformer API [code]
- Pros:
- Provides us with the most low-level control over how the application
works
- Implement optimizations based on our domain knowledge.
- Implement a LRU cache so that we do not have to go over the
network to query the state-store.
- Optimize batching in the Transformer API so that we emit the
latest record for each key every 30 seconds. For each Salary Fact
with 5 Labels, a straightforward approach will emit 6 messages for
the same key to the output stream. By batching them together, we
can reduce it to 1 message per time interval.
- Cons:
- Requires more storage space than the KStreams approach
- We have to manually maintain the state-stores for Fact and Labels and
perform lookups
The results of this experiment on a 2% subset of the initial load (around 20 million salary
datapoints) after optimizations.
This implies that we should be able to process our initial dataset of around 1 billion salary
datapoints in around 600 minutes, or 10 hours.
While we acknowledge that the increase in time taken is not entirely linear, larger datasets will
have an increased lookup time, we should be able to mitigate this increase in latency by
- Tuning of the number of partitions for each topic,
- Balancing number of running application instances
- Tuning of the underlying RocksDB configuration.
Concerns with Kafka Streams and a local state-store
Restarting a Kafka Streams application is challenging
- When deploying in Marvin or Kubernetes, container storage is usually ephemeral. This
means that the local state stores will be deleted when a container is restarted and will
need to be rebuilt from the changelog topic.
- A change in the stream topology will require that the source topics be reprocessed again
from the beginning.
- An update in the application code or stream processing logic may require the source
topics to be reprocessed again from the beginning.
This causes delays for new data to be processed by the service until the Kafka Streams
application has finished rebuilding its state stores.
An overview of the Indeed Collector data flow and their sources of data is described here to
provide some context. The design is in progress and can be tracked in SAL-2584
Overview of the Indeed Collector and Compute data flow
- Resume Survey
- Paycheck
- Number of ugc salary datapoints (from jobseekersalary) over the last 3 years (IQL)
Onboarding new sources of salary information
In general, there are two approaches to adding a new salabery source into the platform.
1. Add a new SalaryRecordType for an existing producer
2. Add a new producer team, with its own Fact, Label and Augment topics.
- Add a new value to the SalaryRecordType in the Producer team’s fact key schema (e.g.
indeed.salary.producer.fact.1-key)
- Start emitting the new salary datapoints to the fact topic.
- Create a new set of Fact, Label and Augment topics and their schema files
- Update the Integrate Application to consume from these new topics and update the
Integrate Output Topic (indeed.salary.integrate.unified-output.1-value) with the new fields
for the properties and labels from this new producer.
- Start emitting the new salary datapoints to the new Fact topic
In cases where there are multiple new sources of salary information with a similar structure, but
each source having only a small volume of datapoints; Instead of creating a new producer team
for each of them, it might be prudent to create a Generic Producer and to differentiate each
source with its own SalaryRecordType.
The new labels will be stored in the indeed_labels field in the Integrate Output Topic.
Success Metrics
Infrastructure
● Confluent Cloud Cluster
● Confluent Cloud Schema Registry
Data Stores
● Kafka
New Technologies
● Kafka
Privacy/Security
Sizing
What kind of traffic do you expect your solution to receive? Roughly how much storage do you need, in databases or
in artifacts?
Monitoring
Set up a monitoring dashboard for the Integrate Service’s Kafka Streams application
- TBF
Open Questions
List any open questions that you need answered in the design review.
References
● Common Salary Architecture
● Common Salary Platform - POC Slides
● Common Salary Platform - Schema Design
● Common Salary Platform - Technical Discussion notes with Confluent
● Optimizing Kafka Streams performance
Annex
Sample output from the Integrate Output Topic
{
"salary": {
"com.indeed.common.kafka.salary.CompositeSalaryValue": {
"base_salary": {
"com.indeed.common.kafka.salary.BaseSalary": {
"range": {
"min": 1141,
"max": 1141
},
"salary_type": "HOURLY"
}
},
"gig_salary": null,
"starting_bonus": null,
"referral_bonus": null,
"recurring_bonus": null,
"gig_bonus": null,
"reimbursement": null,
"stipend": null,
"work_volume": null,
"currency": {
"com.indeed.common.kafka.salary.Currency": "JPY"
},
"visibility": {
"boolean": true
},
"created_date": {
"long": 1557616629000
},
"updated_date": {
"long": 1557616629000
}
}
},
"indeed_properties": {
"com.indeed.common.kafka.salary.IndeedProperties": {
"title": null,
"language": {
"string": "ja"
},
"country": {
"string": "JP"
},
"job_information": {
"com.indeed.common.kafka.salary.JobInformation": {
"feed_id": {
"int": 18132
},
"source_id": {
"int": 33103
},
"jobtype_id": null,
"jobtypes": null
}
},
"review_information": null
}
},
"indeed_labels": {
"map": {
"TITLE_NORMALIZATION": {
"label_data": {
"com.indeed.common.kafka.salary.NormTitle": {
"norm_status": null,
"job_title": {
"string": "介護staff 午前中のみOK 週1日,1日4h~ 主婦さん活躍中"
},
"norm_title": {
"string": "介護スタッフ"
},
"norm_language": {
"string": "ja"
},
"display_title": {
"string": "介護スタッフ"
},
"base_level_title": null,
"base_level_display_title": null,
"norm_category": {
"string": "care"
},
"normalization_timestamp": null
}
}
},
"GEOLOCATION": {
"label_data": {
"com.indeed.common.kafka.salary.Location": {
"country": {
"string": "JP"
},
"region1": {
"string": "JPC"
},
"region2": {
"string": "J12"
},
"region3": {
"string": ""
},
"region4": {
"string": "J4122033"
},
"region_name1": null,
"region_name2": null,
"region_name3": null,
"region_name4": null,
"city": {
"string": "市川市大町"
},
"postal_code": null,
"latitude": null,
"longtitude": null,
"msa_code": null,
"msa_name": null
}
}
},
"COMPANY": {
"label_data": {
"com.indeed.common.kafka.salary.Company": {
"fccid": null,
"legacy_id": {
"int": 62673022
},
"companyname": null,
"type": null,
"canonical_fccid": {
"int": 12214443
},
"canonical_companyname": null
}
}
}
}
}
}