SlideShare a Scribd company logo
JSLT: JSON query & transform
Lars Marius Garshol, lars.marius.garshol@schibsted.com
http://twitter.com/larsga 2018–09–12, JavaZone 2018
Data Platform
2
Data Platform
Batch
Streaming
Pulse
Data volume
3
Routing
• We send data to ~210 different destinations
• Filters on the data specify which data should go
where
• often very detailed conditions on many fields
• Full routing tree has ~600 filter/transform/sink
nodes
4
Transforms
• Because GDPR we need to anonymize most incoming data
formats
• Some data has data quality issues that cannot be fixed at
source, requires transforms to solve
• In many cases data needs to be transformed from one format to
another
• Pulse to Amplitude
• Pulse to Adobe Analytics
• ClickMeter to Pulse
• Convert data to match database structures
• …
5
Who configures?
• Schibsted has >100 business units
• for Data Platform to do detailed configuration for all of
these isn’t going to scale
• for sites to do it themselves saves lots of time
• Configuration requires domain knowledge
• each site has its own specialities in Pulse tracking
• to transform and route these correctly requires knowing all
this
6
Batch config: 1 sink
{
"driver": "anyoffilter",
"name": "image-classification",
"rules": [
{ "name": "ImageClassification", "key": "provider.component", "value": "ImageClassification" },
{ "name": "ImageSimilarity", "key": "provider.component", "value": "ImageSimilarity" }
],
"onmatch": [
{
"driver": "cache",
"name": "image-classification",
"level": "memory+disk"
},
{
"driver": "demuxer",
"name": "image-classification",
"rules": "${pulseSdrnFilterUri}",
"parallel": true,
"onmatch": {
"driver": "textfilewriter",
"uri": "${imageSiteUri}",
"numFiles": {
"eventsPerFile": 500000,
"max": ${numExecutors}
}
}
}
],
"consume": true
7
Early config was 1838 lines
} JSON matching
A real transform
8
What if?
• We had an expression language for JSON
• something like, say, XPath for JSON
• could write routing filters using that
• We had a tranformation language for JSON
• write as JSON template, using expression language to
compute values to insert
• A custom routing language for both batch and
streaming, based on this language
• designed for easy expressivity & deploy
9
• Already existing query language for JSON
• https://stedolan.github.io/jq/
• Originally implemented in C
• there is a Java implementation, too
• Can do things like
• .foo
• .foo.bar
• .foo.bar > 25
• …
10
First, fumbling attempt
{
“event_type” : “View”,
“insert_id” : {“__expr__” : “.object.id”},
“source” : {“__if__” : {
“test” : “.source”,
“then” : “.source.id”,
“else” : “.src”
}
}
11
Second, fumbling attempt
{
“event_type” : “View”,
“insert_id” : “${.object.id}”,
“source” : “${if (.source) .source.id else .src}”
}
12
Third attempt
{
“event_type” : “View”,
“insert_id” : ${ .object.id },
“source” : if ${ .source } ${ .source.id } else ${ .src }
}
13
Proof-of-concept
• Implement real-world transforms in this language
• before it was implemented
• Helped improve and solidify the design
• Verified that the language could do what we
needed
• Transforms looked quite reasonable.
14
A simple language
• JSON is written in JSON syntax
• evaluates to itself
• if <expr> <expr> else <expr>
• [for <expr> <expr>]
• let <name> = <expr>
• ${ … jq … }
15
Matchers
{
“event_type” : “View”,
“insert_id” : ${ .object.id },
* : ${ . }
}
16
{
“event_type” : “View”,
“insert_id” : ${ .object.id },
* - “object” : ${ . }
}
Stunt prototype
• Most of it implemented in two days
• Implemented in Scala
• using Antlr 3 to generate the parser
• jackson-jq for jq
• jackson for JSON
• A simple object tree interpreter
• Constructor.construct(Context, JsonNode) => JsonNode
17
Object tree{
“event_type” : “View”,
“insert_id” : ${ .object.id },
“source” : if ${ .source } ${ .source.id } else ${ .src }
}
18
ObjectConstructor
PairConstructor(LiteralConstructor)
PairConstructor(JqConstructor)
PairConstructor(IfConstructor(Jq, Jq, Jq))
Literal expression
19
Object expression
20
Create object
Construct value
Add to object
If
21
Evaluate
condition
Construct then
Construct else
The parser
• Code that takes a character stream and builds the
expression tree
• Use a parser generator to handle the difficult part
• requires writing a grammar
• Parser generator produces Abstract Syntax Tree
• basically corresponds to the grammar structure
22
Antlr Grammar
grammar Jstl;
WS : [ trn]+ -> skip ; // ignore whitespace
COMMENT : '//' ~[rn]* [rn] -> skip ; // ignore comments
STRING : '"' ((~["]) | ('"'))+ '"' ;
INT: '-'? [0-9]+ ;
FLOAT: '-'? [0-9]+ '.' [0-9]+ ;
NULL: 'null';
BOOL: 'true' | 'false';
IDENT: [A-Za-z] [_A-Za-z]* ;
JQ : '$' '{' (~[}"] | '"' (~["] | '' .)* '"')+ '}' ;
jstl : let* expr EOF ;
expr : object | array | STRING | INT | FLOAT | NULL | BOOL |
JQ | ifvalue | forexpr | parenthesis;
23
Parser
24
Language in use
• Implemented Data Quality Tooling using jq
• filters done in jq
• Implemented routing using jq filters
• and transforms in JSLT
• Wrote some transforms using the language
• anonymization of tracking data
• cleanup transforms to handle bad data
• …
25
The good
• The language works
• proven by DQT, routing, and transforms
• Minimal implementation effort required
• Users approved of the language
• general agreement it was a major improvement
• people started writing their own transforms
26
The bad
• Performance could be better
• not horrible, but not great, either
• The ${ … } wrappers are really ugly
• jq
• does not handle missing data well
• has dangerous features
• has weird and difficult syntax for some things
• Too many dependencies
• Scala runtime (with versioning issues)
• Antlr runtime
27
2.0
• Implement the complete language ourselves
• goodbye ${ … }
• Get rid of the jq strangeness
• Add some new functionality
• Implement in pure Java with JavaCC
• JavaCC has no runtime dependencies
• only dependency is Jackson
28
JSLT expressions
29
.foo Get “foo” key from input object
.foo.bar Get “foo”, then “.bar” on that
.foo == 231 Comparison
.foo and .bar < 12 Boolean operator
$baz.foo Variable reference
test(.foo, “^[a-z0-9]+$”) Functions (& regexps)
JSLT transforms
30
Anonymization
31
Sinks
VG-FrontExperimentsEngagement-1:
eventType: PulseAnonymized
filter: get-client(.) == "vg" and ."@type" == "Engagement" and
contains(.object."@type", ["Article", "SalesPoster"]) and
(contains("df-86-", .origin.terms) or
contains("df-86-", .object."spt:custom".terms))
transform: transforms/vg-article-views.jslt
kinesis:
arn: arn:aws:kinesis:eu-west-1:070941167498:stream/
vg_front_experiments_engagement
role: arn:aws:iam::070941167498:role/data-platform-kinesis-write
32
Expressions
• + - / *
• and or
• > < <= != == >=
• literals
• variable references
• function calls (+ function library)
33
Dealing with missing data
• 2 + null => null
• size(null) => null
• number(“12”) => 12
• number(null) => null
34
Operators
35
Evaluate left & right
Do the operation
Numeric operators
36
Null handling
Convert to numbers
or error
Decide if int or float
The actual operations
Minus
37
Object for expressions
38
{
“event_type” : “View”,
“insert_id” : .object.id,
“source” : if (.source) .source.id else .src
} + {
for (.custom)
“custom_” + .key : .value
}
{
“object” : {“id” : 21323, … },
“src” : “App123”,
“custom” : {
“time” : 4592593492,
“event”: 3433
}
}
{
“event_type” : “View”,
“insert_id” : 21323,
“source” : “App123”,
“custom_time”: 4592593492,
“custom_event” : 3433
}
Function declarations
def <name>(<param1>, <param2>)
<let>*
<expr>
Very easy to implement
Very useful
But means going Turing-complete…
39
A real function
40
Implementation
41
Turing-complete?
• Means that the language can express any
computation
• It’s known that all that’s required is
• conditionals (we have if tests)
• recursion (our functions can call themselves)
• But can this really be true?
42
N-queens
• Write a function that takes the size of the
chessboard and returns it with queens
• queens(4) =>
[
[ 0, 1, 0, 0 ],
[ 0, 0, 0, 1 ],
[ 1, 0, 0, 0 ],
[ 0, 0, 1, 0 ]
]
43https://github.com/schibsted/jslt/blob/master/examples/queens.jslt
Getting started
44
Danger?
• It’s possible to implement operations that run
forever
• But in practice the stack quickly gets too deep
• The JVM will then terminate the transform
45
Performance
• 5-10 times faster than 1.0
• The main difference: no more jackson-jq
• jackson-jq is not very efficient
• internal model is List<JsonNode>
• creates many unnecessary objects during evaluation
• does work at run-time that should be done at compile-time
46
JSLT improvements
• Value model is JsonNode
• can usually just return data from input object or from code
• Efficient internal structures
• all collections are arrays
• very fast to traverse
• Boolean short-circuiting
• once we know the result, stop evaluating
• Cache regular expressions to avoid recompiling
47
The optimizer
• An optimizer is a function that takes an expression
and outputs an expression such that
• the new expression is at least as fast, and
• always outputs the same value
• Improves performance quite substantially even with
very simple techniques
48
Constant folding
contains(get-client(.), [“vg”, “aftenposten”, “bt”])
49
FunctionExpression
FunctionExpression
DotExpression
ArrayExpression
LiteralExpression
LiteralExpression
LiteralExpression
FunctionExpression
FunctionExpression
DotExpression
LiteralExpression
Implementation
50
Performance
• Test case: pulse-cleanup.jslt, real data, my laptop
• a complicated transform: 165 lines
• Transforms 132,000 events/second in one thread
• 1.0 did about 20,000 events/second
51
Three strategies
• Syntax tree interpreter
• known to be the slowest approach
• Bytecode compiler with virtual machine
• C version of jq does this
• Java does that (until the JIT kicks in)
• Python does this
• Native compilation
• what JIT compiler in Java does
52
Designing a VM
53
Opcode Param
DUP
MKOBJ
CALL <func>
int[] bytecode;
JsonNode[] stack;
int top;
switch (opcode) {
case DUP:
stack[++top] = stack[top-1];
break;
case MKOBJ:
stack[++top] = mapper.createObj…
break;
// …
}
Compiler
• Traverse down the object tree
• emit bytecode as you go
• Stack nature of the VM matches object tree
structure
• each Expression produces code that leaves the value of
that expression on the stack
• Example:
• MKARR, <first value>, ARRADD, <second>, ARRADD, …
54
Prototype
• Stunt implemented over a couple of days
• Depressing result: object tree interpreter ~40%
faster
• Anthony Sparks tried the same thing
• original VM implementation 5x slower
• eventually managed to achieve performance parity
• So far: performance does not justify complexity
55
Java bytecode?
• The JVM is actually a stack-based VM
• can simply compile to Java bytecode instead
• Tricky to learn tools for generating bytecode
• no examples, very little documentation
• In the end decided to use the Asm library
• not very nice to use
• very primitive API
• crashes with NullPointerException on bad bytecode
56
Compiler
57
Compile dot accessor
58
Results
• Hard work to build
• many surprising issues in Java bytecode
• Performance boost of 15-25%
• code lives on jvm-bytecode branch in Github
• Ideas for how it could be even faster…
• through type inference
59
Type inference benefits
"sdrn:" + $namespace + ":" + $rType + ":" + $rId
Plus
Plus(“sdrn” $namespace)
Plus(
Plus(“:” $rType)
Plus(“:” $rId)
)
60
Plus:
JsonNode -> String
JsonNode -> String
String + String
new String -> new JsonNode
Will make 4 unnecessary TextNode objects
Will wrap and unwrap String repeatedly
Will check types unnecessarily
Solution
• + operator can ask both sides: what type will you
produce?
• If one side says “string” then the result will be a string
• When compiling, do compile(generator, String)
• will compile code that produces a Java String object
• + operator will make a new String if that’s what’s
wanted
• or turn it into a TextNode if the context wants Any
61
Freedom from Jackson
• The current codebase is bound to Jackson
• JVM bytecode compilation might be a way to
escape that
• Could build compilers that can interface with
different JSON representations
• Have ideas for a more efficient JSON representation
• basically encode everything as arrays of ints
• should save memory, GC, and produce faster code
62
Freedom from JSON
• If we aren’t bound to Jackson, why should we be
bound to JSON?
• Could support Avro, too
• Perhaps also other formats
63
Conclusion
Internal status
• JSLT now used in
• Data Quality Tooling (to express tests on data)
• routing filters
• transforms
• In Schibsted we have
• 52 transforms, 2370 lines of code
• written by many people in different parts of the company
• Data Platform runs ~11 billion transforms/day
65
Open source status
• Released in June
• People are using it for real
• one certain case, several more examples
• details unknown
• Useful contributions from outsiders
• several bug fixes to datetime/number handling
• Two alternative implementations being worked on
• one in .NET
• one is virtual machine-based in Java
66
Lessons learned
• A custom language can make life much simpler
• if it fits the use case well
• Implementing a language is easier than it seems
• basically doable in a week
• Designing a language is not easy
• unfortunately
67
https://github.com/schibsted/jslt
https://www.slideshare.net/larsga
Slides at
Ad

More Related Content

What's hot (20)

왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
Jo Hoon
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
Cloudera, Inc.
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
InfluxData
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
Flink Forward
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Flink Forward
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
HostedbyConfluent
 
Git 101 - Crash Course in Version Control using Git
Git 101 - Crash Course in Version Control using GitGit 101 - Crash Course in Version Control using Git
Git 101 - Crash Course in Version Control using Git
Geoff Hoffman
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
AIMDek Technologies
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
emreakis
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
Jo Hoon
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
Cloudera, Inc.
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
InfluxData
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
Flink Forward
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Flink Forward
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
HostedbyConfluent
 
Git 101 - Crash Course in Version Control using Git
Git 101 - Crash Course in Version Control using GitGit 101 - Crash Course in Version Control using Git
Git 101 - Crash Course in Version Control using Git
Geoff Hoffman
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
emreakis
 

Similar to JSLT: JSON querying and transformation (20)

JS Essence
JS EssenceJS Essence
JS Essence
Uladzimir Piatryka
 
Javascript Everywhere
Javascript EverywhereJavascript Everywhere
Javascript Everywhere
Pascal Rettig
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
Jamund Ferguson
 
Java 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from OredevJava 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from Oredev
Mattias Karlsson
 
Alternatives of JPA/Hibernate
Alternatives of JPA/HibernateAlternatives of JPA/Hibernate
Alternatives of JPA/Hibernate
Sunghyouk Bae
 
Kotlin+MicroProfile: Ensinando 20 anos para uma linguagem nova
Kotlin+MicroProfile: Ensinando 20 anos para uma linguagem novaKotlin+MicroProfile: Ensinando 20 anos para uma linguagem nova
Kotlin+MicroProfile: Ensinando 20 anos para uma linguagem nova
Víctor Leonel Orozco López
 
WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...
Fabio Franzini
 
CBDW2014 - MockBox, get ready to mock your socks off!
CBDW2014 - MockBox, get ready to mock your socks off!CBDW2014 - MockBox, get ready to mock your socks off!
CBDW2014 - MockBox, get ready to mock your socks off!
Ortus Solutions, Corp
 
Javascript done right - Open Web Camp III
Javascript done right - Open Web Camp IIIJavascript done right - Open Web Camp III
Javascript done right - Open Web Camp III
Dirk Ginader
 
Spring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffSpring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard Wolff
JAX London
 
Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?
Mario Camou Riveroll
 
Awesomeness of JavaScript…almost
Awesomeness of JavaScript…almostAwesomeness of JavaScript…almost
Awesomeness of JavaScript…almost
Quinton Sheppard
 
React Native Evening
React Native EveningReact Native Evening
React Native Evening
Troy Miles
 
Why you should be using the shiny new C# 6.0 features now!
Why you should be using the shiny new C# 6.0 features now!Why you should be using the shiny new C# 6.0 features now!
Why you should be using the shiny new C# 6.0 features now!
Eric Phan
 
gdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptxgdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptx
sandeshshahapur
 
Spring data requery
Spring data requerySpring data requery
Spring data requery
Sunghyouk Bae
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
What to expect from Java 9
What to expect from Java 9What to expect from Java 9
What to expect from Java 9
Ivan Krylov
 
How to use the new Domino Query Language
How to use the new Domino Query LanguageHow to use the new Domino Query Language
How to use the new Domino Query Language
Tim Davis
 
Front end fundamentals session 1: javascript core
Front end fundamentals session 1: javascript coreFront end fundamentals session 1: javascript core
Front end fundamentals session 1: javascript core
Web Zhao
 
Javascript Everywhere
Javascript EverywhereJavascript Everywhere
Javascript Everywhere
Pascal Rettig
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
Jamund Ferguson
 
Java 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from OredevJava 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from Oredev
Mattias Karlsson
 
Alternatives of JPA/Hibernate
Alternatives of JPA/HibernateAlternatives of JPA/Hibernate
Alternatives of JPA/Hibernate
Sunghyouk Bae
 
Kotlin+MicroProfile: Ensinando 20 anos para uma linguagem nova
Kotlin+MicroProfile: Ensinando 20 anos para uma linguagem novaKotlin+MicroProfile: Ensinando 20 anos para uma linguagem nova
Kotlin+MicroProfile: Ensinando 20 anos para uma linguagem nova
Víctor Leonel Orozco López
 
WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...
Fabio Franzini
 
CBDW2014 - MockBox, get ready to mock your socks off!
CBDW2014 - MockBox, get ready to mock your socks off!CBDW2014 - MockBox, get ready to mock your socks off!
CBDW2014 - MockBox, get ready to mock your socks off!
Ortus Solutions, Corp
 
Javascript done right - Open Web Camp III
Javascript done right - Open Web Camp IIIJavascript done right - Open Web Camp III
Javascript done right - Open Web Camp III
Dirk Ginader
 
Spring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffSpring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard Wolff
JAX London
 
Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?
Mario Camou Riveroll
 
Awesomeness of JavaScript…almost
Awesomeness of JavaScript…almostAwesomeness of JavaScript…almost
Awesomeness of JavaScript…almost
Quinton Sheppard
 
React Native Evening
React Native EveningReact Native Evening
React Native Evening
Troy Miles
 
Why you should be using the shiny new C# 6.0 features now!
Why you should be using the shiny new C# 6.0 features now!Why you should be using the shiny new C# 6.0 features now!
Why you should be using the shiny new C# 6.0 features now!
Eric Phan
 
gdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptxgdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptx
sandeshshahapur
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
What to expect from Java 9
What to expect from Java 9What to expect from Java 9
What to expect from Java 9
Ivan Krylov
 
How to use the new Domino Query Language
How to use the new Domino Query LanguageHow to use the new Domino Query Language
How to use the new Domino Query Language
Tim Davis
 
Front end fundamentals session 1: javascript core
Front end fundamentals session 1: javascript coreFront end fundamentals session 1: javascript core
Front end fundamentals session 1: javascript core
Web Zhao
 
Ad

More from Lars Marius Garshol (20)

Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
Lars Marius Garshol
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
Lars Marius Garshol
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
Lars Marius Garshol
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
Lars Marius Garshol
 
History of writing
History of writingHistory of writing
History of writing
Lars Marius Garshol
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
Lars Marius Garshol
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
Lars Marius Garshol
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
Lars Marius Garshol
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
Lars Marius Garshol
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
Lars Marius Garshol
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
Lars Marius Garshol
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
Lars Marius Garshol
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
Lars Marius Garshol
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
Lars Marius Garshol
 
Big data 101
Big data 101Big data 101
Big data 101
Lars Marius Garshol
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
Lars Marius Garshol
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Lars Marius Garshol
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
Lars Marius Garshol
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
Lars Marius Garshol
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
Lars Marius Garshol
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
Lars Marius Garshol
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
Lars Marius Garshol
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
Lars Marius Garshol
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
Lars Marius Garshol
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Lars Marius Garshol
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
Lars Marius Garshol
 
Ad

Recently uploaded (20)

Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Decision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdfDecision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdf
Saikat Basu
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptxISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
pankaj6188303
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Process Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial IndustryProcess Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial Industry
Process mining Evangelist
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Decision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdfDecision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdf
Saikat Basu
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptxISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
pankaj6188303
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Process Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial IndustryProcess Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial Industry
Process mining Evangelist
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 

JSLT: JSON querying and transformation

  • 1. JSLT: JSON query & transform Lars Marius Garshol, [email protected] http://twitter.com/larsga 2018–09–12, JavaZone 2018
  • 4. Routing • We send data to ~210 different destinations • Filters on the data specify which data should go where • often very detailed conditions on many fields • Full routing tree has ~600 filter/transform/sink nodes 4
  • 5. Transforms • Because GDPR we need to anonymize most incoming data formats • Some data has data quality issues that cannot be fixed at source, requires transforms to solve • In many cases data needs to be transformed from one format to another • Pulse to Amplitude • Pulse to Adobe Analytics • ClickMeter to Pulse • Convert data to match database structures • … 5
  • 6. Who configures? • Schibsted has >100 business units • for Data Platform to do detailed configuration for all of these isn’t going to scale • for sites to do it themselves saves lots of time • Configuration requires domain knowledge • each site has its own specialities in Pulse tracking • to transform and route these correctly requires knowing all this 6
  • 7. Batch config: 1 sink { "driver": "anyoffilter", "name": "image-classification", "rules": [ { "name": "ImageClassification", "key": "provider.component", "value": "ImageClassification" }, { "name": "ImageSimilarity", "key": "provider.component", "value": "ImageSimilarity" } ], "onmatch": [ { "driver": "cache", "name": "image-classification", "level": "memory+disk" }, { "driver": "demuxer", "name": "image-classification", "rules": "${pulseSdrnFilterUri}", "parallel": true, "onmatch": { "driver": "textfilewriter", "uri": "${imageSiteUri}", "numFiles": { "eventsPerFile": 500000, "max": ${numExecutors} } } } ], "consume": true 7 Early config was 1838 lines } JSON matching
  • 9. What if? • We had an expression language for JSON • something like, say, XPath for JSON • could write routing filters using that • We had a tranformation language for JSON • write as JSON template, using expression language to compute values to insert • A custom routing language for both batch and streaming, based on this language • designed for easy expressivity & deploy 9
  • 10. • Already existing query language for JSON • https://stedolan.github.io/jq/ • Originally implemented in C • there is a Java implementation, too • Can do things like • .foo • .foo.bar • .foo.bar > 25 • … 10
  • 11. First, fumbling attempt { “event_type” : “View”, “insert_id” : {“__expr__” : “.object.id”}, “source” : {“__if__” : { “test” : “.source”, “then” : “.source.id”, “else” : “.src” } } 11
  • 12. Second, fumbling attempt { “event_type” : “View”, “insert_id” : “${.object.id}”, “source” : “${if (.source) .source.id else .src}” } 12
  • 13. Third attempt { “event_type” : “View”, “insert_id” : ${ .object.id }, “source” : if ${ .source } ${ .source.id } else ${ .src } } 13
  • 14. Proof-of-concept • Implement real-world transforms in this language • before it was implemented • Helped improve and solidify the design • Verified that the language could do what we needed • Transforms looked quite reasonable. 14
  • 15. A simple language • JSON is written in JSON syntax • evaluates to itself • if <expr> <expr> else <expr> • [for <expr> <expr>] • let <name> = <expr> • ${ … jq … } 15
  • 16. Matchers { “event_type” : “View”, “insert_id” : ${ .object.id }, * : ${ . } } 16 { “event_type” : “View”, “insert_id” : ${ .object.id }, * - “object” : ${ . } }
  • 17. Stunt prototype • Most of it implemented in two days • Implemented in Scala • using Antlr 3 to generate the parser • jackson-jq for jq • jackson for JSON • A simple object tree interpreter • Constructor.construct(Context, JsonNode) => JsonNode 17
  • 18. Object tree{ “event_type” : “View”, “insert_id” : ${ .object.id }, “source” : if ${ .source } ${ .source.id } else ${ .src } } 18 ObjectConstructor PairConstructor(LiteralConstructor) PairConstructor(JqConstructor) PairConstructor(IfConstructor(Jq, Jq, Jq))
  • 22. The parser • Code that takes a character stream and builds the expression tree • Use a parser generator to handle the difficult part • requires writing a grammar • Parser generator produces Abstract Syntax Tree • basically corresponds to the grammar structure 22
  • 23. Antlr Grammar grammar Jstl; WS : [ trn]+ -> skip ; // ignore whitespace COMMENT : '//' ~[rn]* [rn] -> skip ; // ignore comments STRING : '"' ((~["]) | ('"'))+ '"' ; INT: '-'? [0-9]+ ; FLOAT: '-'? [0-9]+ '.' [0-9]+ ; NULL: 'null'; BOOL: 'true' | 'false'; IDENT: [A-Za-z] [_A-Za-z]* ; JQ : '$' '{' (~[}"] | '"' (~["] | '' .)* '"')+ '}' ; jstl : let* expr EOF ; expr : object | array | STRING | INT | FLOAT | NULL | BOOL | JQ | ifvalue | forexpr | parenthesis; 23
  • 25. Language in use • Implemented Data Quality Tooling using jq • filters done in jq • Implemented routing using jq filters • and transforms in JSLT • Wrote some transforms using the language • anonymization of tracking data • cleanup transforms to handle bad data • … 25
  • 26. The good • The language works • proven by DQT, routing, and transforms • Minimal implementation effort required • Users approved of the language • general agreement it was a major improvement • people started writing their own transforms 26
  • 27. The bad • Performance could be better • not horrible, but not great, either • The ${ … } wrappers are really ugly • jq • does not handle missing data well • has dangerous features • has weird and difficult syntax for some things • Too many dependencies • Scala runtime (with versioning issues) • Antlr runtime 27
  • 28. 2.0 • Implement the complete language ourselves • goodbye ${ … } • Get rid of the jq strangeness • Add some new functionality • Implement in pure Java with JavaCC • JavaCC has no runtime dependencies • only dependency is Jackson 28
  • 29. JSLT expressions 29 .foo Get “foo” key from input object .foo.bar Get “foo”, then “.bar” on that .foo == 231 Comparison .foo and .bar < 12 Boolean operator $baz.foo Variable reference test(.foo, “^[a-z0-9]+$”) Functions (& regexps)
  • 32. Sinks VG-FrontExperimentsEngagement-1: eventType: PulseAnonymized filter: get-client(.) == "vg" and ."@type" == "Engagement" and contains(.object."@type", ["Article", "SalesPoster"]) and (contains("df-86-", .origin.terms) or contains("df-86-", .object."spt:custom".terms)) transform: transforms/vg-article-views.jslt kinesis: arn: arn:aws:kinesis:eu-west-1:070941167498:stream/ vg_front_experiments_engagement role: arn:aws:iam::070941167498:role/data-platform-kinesis-write 32
  • 33. Expressions • + - / * • and or • > < <= != == >= • literals • variable references • function calls (+ function library) 33
  • 34. Dealing with missing data • 2 + null => null • size(null) => null • number(“12”) => 12 • number(null) => null 34
  • 35. Operators 35 Evaluate left & right Do the operation
  • 36. Numeric operators 36 Null handling Convert to numbers or error Decide if int or float The actual operations
  • 38. Object for expressions 38 { “event_type” : “View”, “insert_id” : .object.id, “source” : if (.source) .source.id else .src } + { for (.custom) “custom_” + .key : .value } { “object” : {“id” : 21323, … }, “src” : “App123”, “custom” : { “time” : 4592593492, “event”: 3433 } } { “event_type” : “View”, “insert_id” : 21323, “source” : “App123”, “custom_time”: 4592593492, “custom_event” : 3433 }
  • 39. Function declarations def <name>(<param1>, <param2>) <let>* <expr> Very easy to implement Very useful But means going Turing-complete… 39
  • 42. Turing-complete? • Means that the language can express any computation • It’s known that all that’s required is • conditionals (we have if tests) • recursion (our functions can call themselves) • But can this really be true? 42
  • 43. N-queens • Write a function that takes the size of the chessboard and returns it with queens • queens(4) => [ [ 0, 1, 0, 0 ], [ 0, 0, 0, 1 ], [ 1, 0, 0, 0 ], [ 0, 0, 1, 0 ] ] 43https://github.com/schibsted/jslt/blob/master/examples/queens.jslt
  • 45. Danger? • It’s possible to implement operations that run forever • But in practice the stack quickly gets too deep • The JVM will then terminate the transform 45
  • 46. Performance • 5-10 times faster than 1.0 • The main difference: no more jackson-jq • jackson-jq is not very efficient • internal model is List<JsonNode> • creates many unnecessary objects during evaluation • does work at run-time that should be done at compile-time 46
  • 47. JSLT improvements • Value model is JsonNode • can usually just return data from input object or from code • Efficient internal structures • all collections are arrays • very fast to traverse • Boolean short-circuiting • once we know the result, stop evaluating • Cache regular expressions to avoid recompiling 47
  • 48. The optimizer • An optimizer is a function that takes an expression and outputs an expression such that • the new expression is at least as fast, and • always outputs the same value • Improves performance quite substantially even with very simple techniques 48
  • 49. Constant folding contains(get-client(.), [“vg”, “aftenposten”, “bt”]) 49 FunctionExpression FunctionExpression DotExpression ArrayExpression LiteralExpression LiteralExpression LiteralExpression FunctionExpression FunctionExpression DotExpression LiteralExpression
  • 51. Performance • Test case: pulse-cleanup.jslt, real data, my laptop • a complicated transform: 165 lines • Transforms 132,000 events/second in one thread • 1.0 did about 20,000 events/second 51
  • 52. Three strategies • Syntax tree interpreter • known to be the slowest approach • Bytecode compiler with virtual machine • C version of jq does this • Java does that (until the JIT kicks in) • Python does this • Native compilation • what JIT compiler in Java does 52
  • 53. Designing a VM 53 Opcode Param DUP MKOBJ CALL <func> int[] bytecode; JsonNode[] stack; int top; switch (opcode) { case DUP: stack[++top] = stack[top-1]; break; case MKOBJ: stack[++top] = mapper.createObj… break; // … }
  • 54. Compiler • Traverse down the object tree • emit bytecode as you go • Stack nature of the VM matches object tree structure • each Expression produces code that leaves the value of that expression on the stack • Example: • MKARR, <first value>, ARRADD, <second>, ARRADD, … 54
  • 55. Prototype • Stunt implemented over a couple of days • Depressing result: object tree interpreter ~40% faster • Anthony Sparks tried the same thing • original VM implementation 5x slower • eventually managed to achieve performance parity • So far: performance does not justify complexity 55
  • 56. Java bytecode? • The JVM is actually a stack-based VM • can simply compile to Java bytecode instead • Tricky to learn tools for generating bytecode • no examples, very little documentation • In the end decided to use the Asm library • not very nice to use • very primitive API • crashes with NullPointerException on bad bytecode 56
  • 59. Results • Hard work to build • many surprising issues in Java bytecode • Performance boost of 15-25% • code lives on jvm-bytecode branch in Github • Ideas for how it could be even faster… • through type inference 59
  • 60. Type inference benefits "sdrn:" + $namespace + ":" + $rType + ":" + $rId Plus Plus(“sdrn” $namespace) Plus( Plus(“:” $rType) Plus(“:” $rId) ) 60 Plus: JsonNode -> String JsonNode -> String String + String new String -> new JsonNode Will make 4 unnecessary TextNode objects Will wrap and unwrap String repeatedly Will check types unnecessarily
  • 61. Solution • + operator can ask both sides: what type will you produce? • If one side says “string” then the result will be a string • When compiling, do compile(generator, String) • will compile code that produces a Java String object • + operator will make a new String if that’s what’s wanted • or turn it into a TextNode if the context wants Any 61
  • 62. Freedom from Jackson • The current codebase is bound to Jackson • JVM bytecode compilation might be a way to escape that • Could build compilers that can interface with different JSON representations • Have ideas for a more efficient JSON representation • basically encode everything as arrays of ints • should save memory, GC, and produce faster code 62
  • 63. Freedom from JSON • If we aren’t bound to Jackson, why should we be bound to JSON? • Could support Avro, too • Perhaps also other formats 63
  • 65. Internal status • JSLT now used in • Data Quality Tooling (to express tests on data) • routing filters • transforms • In Schibsted we have • 52 transforms, 2370 lines of code • written by many people in different parts of the company • Data Platform runs ~11 billion transforms/day 65
  • 66. Open source status • Released in June • People are using it for real • one certain case, several more examples • details unknown • Useful contributions from outsiders • several bug fixes to datetime/number handling • Two alternative implementations being worked on • one in .NET • one is virtual machine-based in Java 66
  • 67. Lessons learned • A custom language can make life much simpler • if it fits the use case well • Implementing a language is easier than it seems • basically doable in a week • Designing a language is not easy • unfortunately 67