SlideShare a Scribd company logo
Building analytics applications with
Streaming Expressions in Apache Solr
Amrit Sarkar, Search Engineer, Lucidworks
Based in San Francisco
Offices in Cambridge, Bangalore, Bangkok, New York City, Raleigh, Munich
Over 300 customers across the Fortune 1000
Fusion, a Solr-powered platform for search-driven apps
Consulting and support for organizations using Solr
Who are we?
Outline
• Challenges building analytics applications with real-time data
• Introduction to Streaming Expressions and Overview
• Sources, Decorators and Evaluators
• Short solutions from simple to complex use-cases optimised
• Statistical Programming
• References
Challenges building applications on real-time data
• Search and filter the data before
performing analytics.
• Executing complex operations &
co-relations on unstructured and
non-preprocessed data is time
consuming.
• Dependency on multiple tools
leading to higher maintenance cost.
Data visualizer
Client Client
Database
Real-time
Updates
• Apache Lucene
• Full text search
• Facets/Guided Nav galore!
• Distributed Search | SolrCloud
• Spelling, autocomplete, highlighting
• Deduplication
• Grouping and Joins
• Streaming, sql, aggregations and more
• Learning to Rank
• Massive Scale/Fault tolerance
The standard for
enterprise search.
• Streaming API
• Streaming Expressions
• Shuffling
• Worker collections
• Parallel SQL
Parallel Computing
Framework
Available in SolrCloud mode
Streaming API
• Java API for parallel computation
• Real-time Mapreduce and Parallel Relational Algebra
• Search results are streams of tuples (TupleStream)
• org.apache.solr.client.solrj.io.*
ParallelStream pstream =
(ParallelStream) streamFactory.constructStream("parallel(collectionName, ……....)");
pstream.open();
Streaming Expressions
curl --data-urlencode ‘expr=
search(gettingstarted,
zkHost=”localhost:9983",
qt=”/export”,
q=”hatchbacks”,
fq=”year:2014”,
fl=”id, model_name”,
sort=”id asc”))’
http://localhost:8983/solr/
gettingstarted/stream
Use case: perform full index
search and retrieve specific
fields sorted
• String Query Language and Serialisation
format for the Streaming API
• Streaming expressions compile to
TupleStream; TupleStream serialise to
Streaming Expressions
• Can be used directly via HTTP to SolrJ
• Expressions can be executed against Solr
API: /solr/<collection-name>/stream
Streaming Expressions
curl --data-urlencode ‘expr=
search(gettingstarted,
zkHost=”localhost:9983",
qt=”/export”,
q=”hatchbacks”,
fq=”year:2014”,
fl=”id, model_name”,
sort=”id asc”))’
http://localhost:8983/solr/
gettingstarted/stream
Use case: perform full index
search and retrieve specific
fields sorted
Streaming Expressions
• Stream Sources
The origin of a TupleStream
search, facet, jdbc, stats, topic, timeseries, train and more..
• Stream Decorators
Wrap other stream functions and perform operations on the stream, row wise
complement, hashJoin, innerJoin, merge, intersect, top, unique and more..
• Stream evaluators
evaluate (calculate) new values based on other values in a tuple, column wise
add, eq, div, mul, sub, length, asin, acos, abs, if:then and more..
Streaming Expressions - Use cases
Use case: Destinations reachable with single stop from ‘New York’
(graphical traversal)
nodes(distances,
nodes(distances,
walk="New York->source_s",
gather="destination_s"),
walk="node->source_s",
gather="destination_s",
trackTraversal="true",
scatter="branches,leaves")
Solr indexes are stored in ‘token’ to ‘document-ids’ format, ‘nodes’ perform BFS on field tokens.
Streaming Expressions - Use cases
Use case: Determine most relevant terms on dynamic data set
significantTerms(
enron-emails,
q="To:*Tim Belden*",
field="content",
limit="2",
minDocFreq="10",
maxDocFreq=".20",
minTermLength="5"
)
Solr indexes are stored in ‘token’ to ‘document-ids’ format, ‘significantTerms’ aggregates over tokens.
Streaming Expressions - Use cases
Use case: Calculate useful metrics on data fetched from various sources.
• conversion ratio (conversions to clicks)
• CTR (clicks to impressions)
• cost ratio (conversions to currency cost)
campaign_id_s org_id_s conversions_i impressions_i clicks_i
cmp-01 org-01 4 134 48
cmp-02 org-02 2 174 26
cmp-03 org-01 6 152 49
cmp-01 org-01 5 154 27
cmp-02 org-01 9 176 38
cmp-03 org-01 5 137 83
cmp-01 org-01 3 154 36
cmp-02 org-02 1 178 35
cmp-03 org-01 7 124 49
……... ……... ……... ……... ……...
campaign_id_s currency_cost
cmp-01 6600
cmp-02 5840
cmp-03 8400
Events captured in solr collection
‘weekly_data’
Campaign costs stored in solr collection
‘currency_cost’
Streaming Expressions - Use cases
rollup(
search(weekly_data, zkHost="localhost:9983", qt="/export", q="*:*",
fq="org_id_s:org-01", fl="id,campaign_id_s,org_id_s,
conversations_i,impressions_i,clicks_i", sort="campaign_id_s asc"),
over="campaign_id_s",
sum(conversations_i), sum(impressions_i), sum(clicks_i)),
Use case: Join cost data with aggregated conversions, clicks and impressions per campaign
for organisation ‘org-01’
select(
campaign_id_s as campaign_id_s, sum(conversations_i) as aggr_conv,
sum(impressions_i) as aggr_impr, sum(clicks_i) as aggr_clicks),
innerJoin(
search(currency_cost, zkHost="localhost:9983", qt="/export", q="*:*",
fl="campaign_id_s,currency_cost_i",sort="campaign_id_s asc"),
on="campaign_id_s")
Streaming Expressions - Use cases
Use case: Join cost data with aggregated conversions, clicks and impressions per campaign for
organisation ‘org-01’
innerJoin(
select(
rollup(
search(weekly_data,
zkHost="localhost:9983",
qt="/export", q="*:*",
fq="campaign_id_s:(cmp-01 OR cmp-02 OR cmp-03)",
fq="org_id_s:org-01",
fl="id,campaign_id_s,org_id_s,conversations_i,
impressions_i,clicks_i",
sort="campaign_id_s asc"),
over="campaign_id_s",
sum(conversations_i), sum(impressions_i), sum(clicks_i)),
campaign_id_s as campaign_id_s,
sum(conversations_i) as aggr_conv,sum(impressions_i) as
aggr_impr, sum(clicks_i) as aggr_clicks),
search(currency_cost,
zkHost="localhost:9983",
qt="/export",q="*:*",
fq="campaign_id_s:(cmp-01 OR cmp-02 OR cmp-03)",
fl="campaign_id_s,currency_cost_i",
sort="campaign_id_s asc"),
on="campaign_id_s")
Streaming Expressions - Use cases
Use case: Calculate useful metrics on data fetched from various sources for ‘org-01’:
● conversion ratio (conversions to clicks)
● CTR (clicks to impressions)
● cost ratio (conversions to currency cost)
select(
innerJoin(
……..
on="campaign_id_s"),
div( aggr_conv, aggr_clicks )
as conversion_ratio,
div( aggr_clicks , aggr_impr )
as ctr,
div( currency_cost_i, aggr_conv)
as campaign_cost_ratio)
Streaming Expressions - Use cases
Use case: Create a view from result-set of previously discussed use-case:
calculate metrics (index data to new collection)
update(
report, batchSize=500,
select( ……..
campaign_id_s as campaign)
)
complexity - O(N)
N - total rows processed
Streaming Expressions - Shuffle
Client
/stream
handler
/stream
handler
/stream
handler
Worker 1 Worker 2 Worker 3 Worker 4 Worker 5
Shard 1
Replica 1
Shard 2
Replica 1
Shard 3
Replica 1
Shard 4
Replica 1
Shard 5
Replica 1
Shard 1
Replica 2
Shard 2
Replica 2
Shard 3
Replica 2
Shard 4
Replica 2
Shard 5
Replica 2
Streaming Expressions - Shuffle
Client
/stream
handler
/stream
handler
/stream
handler
Worker 1 Worker 2 Worker 3 Worker 4 Worker 5
Shard 1
Replica 1
Shard 2
Replica 1
Shard 3
Replica 1
Shard 4
Replica 1
Shard 5
Replica 1
Shard 1
Replica 2
Shard 2
Replica 2
Shard 3
Replica 2
Shard 4
Replica 2
Shard 5
Replica 2
“Controlled[subset S1]
Streaming Expressions
Worker
Collections
● Regular SolrCloud collections
● Perform streaming aggregations using the
Streaming API
● Receive shuffled streams from replicas
● May be empty or created just-in-time or
have regular data
● The goal is to separate processing from
data if necessary
Streaming Expressions - Use cases
Use case: Indexing the result-set of discussed use-case (calculate metrics for
organisation) to new collection ‘report’ parallely utilising ‘n’ workers
parallel(worker,
update(report,batchSize=10,
select(
innerJoin(
select(
rollup(
search(weekly_data, zkHost="localhost:9983", qt="/export", q="*:*", fq="campaign_id_s:(cmp-01 OR cmp-02 OR cmp-03)",
fq="org_id_s:org-01", fl="id,campaign_id_s,org_id_s,conversations_i,impressions_i,clicks_i", sort="campaign_id_s asc",
partitionKeys="campaign_id_s"),
over="campaign_id_s", sum(conversations_i), sum(impressions_i), sum(clicks_i)),
campaign_id_s as campaign_id_s, sum(conversations_i) as aggr_conv, sum(impressions_i) as aggr_impr, sum(clicks_i) as aggr_clicks),
search(currency_cost, zkHost="localhost:9983", qt="/export",q="*:*", fq="campaign_id_s:(cmp-01 OR cmp-02 OR cmp-03)",
fl="campaign_id_s,org_id_s,currency_cost_i", partitionKeys="campaign_id_s", sort="campaign_id_s asc"),
on="campaign_id_s"),
div( aggr_conv, aggr_clicks ) as conversion_ratio, div( aggr_clicks , aggr_impr ) as ctr, div( currency_cost_i, aggr_conv)
as campaign_cost_ratio, campaign_id_s as campaign)),
workers=3,
zkHost="localhost:9983",
sort="campaign asc")
Streaming Expressions - Use cases
Use case: Indexing the result-set of discussed use-case (calculate metrics for
organisation) to new collection ‘report’ parallely utilising ‘n’ workers.
complexity - Z(3) + O(N)/W ~ O(N)/W
N - total rows processed Z(3) - aggregation W - number of workers utilised
Statistical Programming
• Solr’s powerful data retrieval capabilities can be combined with in-depth
statistical analysis.
• SQL, timeseries aggregation, KNNs, Graph expressions and more..
• Syntax can be used to create arrays from the data so it can be manipulated,
transformed and analyzed
• Statistical function library:
• Correlation, Covariances, Percentiles, Euclidean distance and more..
• backed by Apache Common Maths Library
Statistical Programming - Use cases
Use case: Determine correlation among stocks from their historical data.
Correlation measures the extent that two variables fluctuate together. For example if rise of stock A typically coincides
with rise in stock B they are positively correlated. If rise in stock A typically coincides with fall in stock B they are
negatively correlated.
Data
Representation:
EventID (unique) StockID Date Closing points
stockA-1 stockA 01-02-2013 30
stockB-1 stockB 01-02-2013 168
stockC-1 stockC 01-02-2013 356
stockB-2 stockB 02-02-2013 237
stockA-2 stockA 02-02-2013 43
……... ……... ……... ……...
Feb 2013 to Jan 2017
Statistical Programming - Use cases
tuple(correlation=corr(pricesA, pricesB)))
set variables and outputs single tuple
limit the resultset to stockA,
assign to variable ‘stockA’
limit the resultset to stockB,
assign to variable ‘stockB’
‘col’ func creates array from a list of
Tuples
corr evaluator which performs the Pearson
product-moment correlation calculation on
two columns of numbers.
Use case: Determine correlation among stocks A to B from their historical data.
stockA=search(historical_stocks_data,
zkHost="localhost:9983", qt="/export",
q="stock_s:stockA", fl="timestamp_dt, closing_pts_i",
sort="timestamp_dt asc"),
stockB=search(historical_stocks_data,
zkHost="localhost:9983", qt="/export",
q="stock_s:stockB", fl="timestamp_dt, closing_pts_i",
sort="timestamp_dt asc"),
let(
pricesA = col(stockA, closing_pts_i),
pricesB = col(stockB, closing_pts_i),
Statistical Programming - Use cases
Use case: Determine correlation among stocks A to B and C from their historical data.
‘A’ to ‘B’ ‘A’ to ‘C’
Stock ‘A’ is highly positively correlated to stock
’B’, indicating if there is a future prediction for stock
‘B’ to rise, it is highly likely stocks prices for stock
‘A’ will rise too and similar trend will follow if
falling.
Stock ‘A’ is moderately negatively correlated to
stock ’C’, indicating prediction for stock ‘A’ cannot
be relied upon stock ‘C’ trend.
Takeaway
• Streaming expressions in Apache Solr allows to perform
• complex correlations with
• map-reduce and statistical
• operations / functions parallely
• on dynamic subsets
• fetched from various sources
• in near real-time
References & Knowledge Base
• Use cases and examples available on Github: /sarkaramrit2/stream-solr
• Streaming expression official documentation in Apache Solr.
• Statistical Programming official documentation in Apache Solr.
• Joel Bernstein’s blog.
• Presentation links:
• The Evolution of Streaming Expressions
• Streaming Aggregation, New Horizons for Search
• Analytics and Graph Traversal with Solr
• Creating New Streaming Expressions
Thank you!
Ad

More Related Content

Similar to Building analytics applications with streaming expressions in apache solr (20)

Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Lucidworks
 
Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming
Stratio
 
WSO2 Analytics Platform: The one stop shop for all your data needs
WSO2 Analytics Platform: The one stop shop for all your data needsWSO2 Analytics Platform: The one stop shop for all your data needs
WSO2 Analytics Platform: The one stop shop for all your data needs
Sriskandarajah Suhothayan
 
Strtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineStrtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP Engine
Myung Ho Yun
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
b0ris_1
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
WSO2
 
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
Lucidworks
 
Swift distributed tracing method and tools v2
Swift distributed tracing method and tools v2Swift distributed tracing method and tools v2
Swift distributed tracing method and tools v2
zhang hua
 
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
WSO2
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
Abhishek M Shivalingaiah
 
Analytics Patterns for Your Digital Enterprise
Analytics Patterns for Your Digital EnterpriseAnalytics Patterns for Your Digital Enterprise
Analytics Patterns for Your Digital Enterprise
Sriskandarajah Suhothayan
 
WSO2Con USA 2017: Analytics Patterns for Your Digital Enterprise
WSO2Con USA 2017: Analytics Patterns for Your Digital EnterpriseWSO2Con USA 2017: Analytics Patterns for Your Digital Enterprise
WSO2Con USA 2017: Analytics Patterns for Your Digital Enterprise
WSO2
 
Qlik_Sense_May_2023_Viz_update_1683564048dddddddd.pdf
Qlik_Sense_May_2023_Viz_update_1683564048dddddddd.pdfQlik_Sense_May_2023_Viz_update_1683564048dddddddd.pdf
Qlik_Sense_May_2023_Viz_update_1683564048dddddddd.pdf
akilanarayanantechie
 
Splunk 6.2 new features
Splunk 6.2 new featuresSplunk 6.2 new features
Splunk 6.2 new features
CleverDATA
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaReal-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Lucidworks
 
Snowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your businessSnowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your business
Giuseppe Gaviani
 
Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...
Maxime Beugnet
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
MapR Technologies
 
Snowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessSnowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your business
yalisassoon
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
Noriaki Tatsumi
 
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Lucidworks
 
Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming
Stratio
 
WSO2 Analytics Platform: The one stop shop for all your data needs
WSO2 Analytics Platform: The one stop shop for all your data needsWSO2 Analytics Platform: The one stop shop for all your data needs
WSO2 Analytics Platform: The one stop shop for all your data needs
Sriskandarajah Suhothayan
 
Strtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineStrtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP Engine
Myung Ho Yun
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
b0ris_1
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
WSO2
 
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
Lucidworks
 
Swift distributed tracing method and tools v2
Swift distributed tracing method and tools v2Swift distributed tracing method and tools v2
Swift distributed tracing method and tools v2
zhang hua
 
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
WSO2
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
Abhishek M Shivalingaiah
 
Analytics Patterns for Your Digital Enterprise
Analytics Patterns for Your Digital EnterpriseAnalytics Patterns for Your Digital Enterprise
Analytics Patterns for Your Digital Enterprise
Sriskandarajah Suhothayan
 
WSO2Con USA 2017: Analytics Patterns for Your Digital Enterprise
WSO2Con USA 2017: Analytics Patterns for Your Digital EnterpriseWSO2Con USA 2017: Analytics Patterns for Your Digital Enterprise
WSO2Con USA 2017: Analytics Patterns for Your Digital Enterprise
WSO2
 
Qlik_Sense_May_2023_Viz_update_1683564048dddddddd.pdf
Qlik_Sense_May_2023_Viz_update_1683564048dddddddd.pdfQlik_Sense_May_2023_Viz_update_1683564048dddddddd.pdf
Qlik_Sense_May_2023_Viz_update_1683564048dddddddd.pdf
akilanarayanantechie
 
Splunk 6.2 new features
Splunk 6.2 new featuresSplunk 6.2 new features
Splunk 6.2 new features
CleverDATA
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaReal-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Lucidworks
 
Snowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your businessSnowplow - Evolve your analytics stack with your business
Snowplow - Evolve your analytics stack with your business
Giuseppe Gaviani
 
Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...
Maxime Beugnet
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
MapR Technologies
 
Snowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessSnowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your business
yalisassoon
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
Noriaki Tatsumi
 

Recently uploaded (20)

Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Shift Left using Lean for Agile Software Development
Shift Left using Lean for Agile Software DevelopmentShift Left using Lean for Agile Software Development
Shift Left using Lean for Agile Software Development
SathyaShankar6
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Minitab 22 Full Crack Plus Product Key Free Download [Latest] 2025
Minitab 22 Full Crack Plus Product Key Free Download [Latest] 2025Minitab 22 Full Crack Plus Product Key Free Download [Latest] 2025
Minitab 22 Full Crack Plus Product Key Free Download [Latest] 2025
wareshashahzadiii
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Sales Deck SentinelOne Singularity Platform.pptx
Sales Deck SentinelOne Singularity Platform.pptxSales Deck SentinelOne Singularity Platform.pptx
Sales Deck SentinelOne Singularity Platform.pptx
EliandoLawnote
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
Agentic AI Use Cases using GenAI LLM models
Agentic AI Use Cases using GenAI LLM modelsAgentic AI Use Cases using GenAI LLM models
Agentic AI Use Cases using GenAI LLM models
Manish Chopra
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Shift Left using Lean for Agile Software Development
Shift Left using Lean for Agile Software DevelopmentShift Left using Lean for Agile Software Development
Shift Left using Lean for Agile Software Development
SathyaShankar6
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Minitab 22 Full Crack Plus Product Key Free Download [Latest] 2025
Minitab 22 Full Crack Plus Product Key Free Download [Latest] 2025Minitab 22 Full Crack Plus Product Key Free Download [Latest] 2025
Minitab 22 Full Crack Plus Product Key Free Download [Latest] 2025
wareshashahzadiii
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Sales Deck SentinelOne Singularity Platform.pptx
Sales Deck SentinelOne Singularity Platform.pptxSales Deck SentinelOne Singularity Platform.pptx
Sales Deck SentinelOne Singularity Platform.pptx
EliandoLawnote
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
Agentic AI Use Cases using GenAI LLM models
Agentic AI Use Cases using GenAI LLM modelsAgentic AI Use Cases using GenAI LLM models
Agentic AI Use Cases using GenAI LLM models
Manish Chopra
 
Ad

Building analytics applications with streaming expressions in apache solr

  • 1. Building analytics applications with Streaming Expressions in Apache Solr Amrit Sarkar, Search Engineer, Lucidworks
  • 2. Based in San Francisco Offices in Cambridge, Bangalore, Bangkok, New York City, Raleigh, Munich Over 300 customers across the Fortune 1000 Fusion, a Solr-powered platform for search-driven apps Consulting and support for organizations using Solr Who are we?
  • 3. Outline • Challenges building analytics applications with real-time data • Introduction to Streaming Expressions and Overview • Sources, Decorators and Evaluators • Short solutions from simple to complex use-cases optimised • Statistical Programming • References
  • 4. Challenges building applications on real-time data • Search and filter the data before performing analytics. • Executing complex operations & co-relations on unstructured and non-preprocessed data is time consuming. • Dependency on multiple tools leading to higher maintenance cost. Data visualizer Client Client Database Real-time Updates
  • 5. • Apache Lucene • Full text search • Facets/Guided Nav galore! • Distributed Search | SolrCloud • Spelling, autocomplete, highlighting • Deduplication • Grouping and Joins • Streaming, sql, aggregations and more • Learning to Rank • Massive Scale/Fault tolerance The standard for enterprise search.
  • 6. • Streaming API • Streaming Expressions • Shuffling • Worker collections • Parallel SQL Parallel Computing Framework Available in SolrCloud mode
  • 7. Streaming API • Java API for parallel computation • Real-time Mapreduce and Parallel Relational Algebra • Search results are streams of tuples (TupleStream) • org.apache.solr.client.solrj.io.* ParallelStream pstream = (ParallelStream) streamFactory.constructStream("parallel(collectionName, ……....)"); pstream.open();
  • 8. Streaming Expressions curl --data-urlencode ‘expr= search(gettingstarted, zkHost=”localhost:9983", qt=”/export”, q=”hatchbacks”, fq=”year:2014”, fl=”id, model_name”, sort=”id asc”))’ http://localhost:8983/solr/ gettingstarted/stream Use case: perform full index search and retrieve specific fields sorted • String Query Language and Serialisation format for the Streaming API • Streaming expressions compile to TupleStream; TupleStream serialise to Streaming Expressions • Can be used directly via HTTP to SolrJ • Expressions can be executed against Solr API: /solr/<collection-name>/stream
  • 9. Streaming Expressions curl --data-urlencode ‘expr= search(gettingstarted, zkHost=”localhost:9983", qt=”/export”, q=”hatchbacks”, fq=”year:2014”, fl=”id, model_name”, sort=”id asc”))’ http://localhost:8983/solr/ gettingstarted/stream Use case: perform full index search and retrieve specific fields sorted
  • 10. Streaming Expressions • Stream Sources The origin of a TupleStream search, facet, jdbc, stats, topic, timeseries, train and more.. • Stream Decorators Wrap other stream functions and perform operations on the stream, row wise complement, hashJoin, innerJoin, merge, intersect, top, unique and more.. • Stream evaluators evaluate (calculate) new values based on other values in a tuple, column wise add, eq, div, mul, sub, length, asin, acos, abs, if:then and more..
  • 11. Streaming Expressions - Use cases Use case: Destinations reachable with single stop from ‘New York’ (graphical traversal) nodes(distances, nodes(distances, walk="New York->source_s", gather="destination_s"), walk="node->source_s", gather="destination_s", trackTraversal="true", scatter="branches,leaves") Solr indexes are stored in ‘token’ to ‘document-ids’ format, ‘nodes’ perform BFS on field tokens.
  • 12. Streaming Expressions - Use cases Use case: Determine most relevant terms on dynamic data set significantTerms( enron-emails, q="To:*Tim Belden*", field="content", limit="2", minDocFreq="10", maxDocFreq=".20", minTermLength="5" ) Solr indexes are stored in ‘token’ to ‘document-ids’ format, ‘significantTerms’ aggregates over tokens.
  • 13. Streaming Expressions - Use cases Use case: Calculate useful metrics on data fetched from various sources. • conversion ratio (conversions to clicks) • CTR (clicks to impressions) • cost ratio (conversions to currency cost) campaign_id_s org_id_s conversions_i impressions_i clicks_i cmp-01 org-01 4 134 48 cmp-02 org-02 2 174 26 cmp-03 org-01 6 152 49 cmp-01 org-01 5 154 27 cmp-02 org-01 9 176 38 cmp-03 org-01 5 137 83 cmp-01 org-01 3 154 36 cmp-02 org-02 1 178 35 cmp-03 org-01 7 124 49 ……... ……... ……... ……... ……... campaign_id_s currency_cost cmp-01 6600 cmp-02 5840 cmp-03 8400 Events captured in solr collection ‘weekly_data’ Campaign costs stored in solr collection ‘currency_cost’
  • 14. Streaming Expressions - Use cases rollup( search(weekly_data, zkHost="localhost:9983", qt="/export", q="*:*", fq="org_id_s:org-01", fl="id,campaign_id_s,org_id_s, conversations_i,impressions_i,clicks_i", sort="campaign_id_s asc"), over="campaign_id_s", sum(conversations_i), sum(impressions_i), sum(clicks_i)), Use case: Join cost data with aggregated conversions, clicks and impressions per campaign for organisation ‘org-01’ select( campaign_id_s as campaign_id_s, sum(conversations_i) as aggr_conv, sum(impressions_i) as aggr_impr, sum(clicks_i) as aggr_clicks), innerJoin( search(currency_cost, zkHost="localhost:9983", qt="/export", q="*:*", fl="campaign_id_s,currency_cost_i",sort="campaign_id_s asc"), on="campaign_id_s")
  • 15. Streaming Expressions - Use cases Use case: Join cost data with aggregated conversions, clicks and impressions per campaign for organisation ‘org-01’ innerJoin( select( rollup( search(weekly_data, zkHost="localhost:9983", qt="/export", q="*:*", fq="campaign_id_s:(cmp-01 OR cmp-02 OR cmp-03)", fq="org_id_s:org-01", fl="id,campaign_id_s,org_id_s,conversations_i, impressions_i,clicks_i", sort="campaign_id_s asc"), over="campaign_id_s", sum(conversations_i), sum(impressions_i), sum(clicks_i)), campaign_id_s as campaign_id_s, sum(conversations_i) as aggr_conv,sum(impressions_i) as aggr_impr, sum(clicks_i) as aggr_clicks), search(currency_cost, zkHost="localhost:9983", qt="/export",q="*:*", fq="campaign_id_s:(cmp-01 OR cmp-02 OR cmp-03)", fl="campaign_id_s,currency_cost_i", sort="campaign_id_s asc"), on="campaign_id_s")
  • 16. Streaming Expressions - Use cases Use case: Calculate useful metrics on data fetched from various sources for ‘org-01’: ● conversion ratio (conversions to clicks) ● CTR (clicks to impressions) ● cost ratio (conversions to currency cost) select( innerJoin( …….. on="campaign_id_s"), div( aggr_conv, aggr_clicks ) as conversion_ratio, div( aggr_clicks , aggr_impr ) as ctr, div( currency_cost_i, aggr_conv) as campaign_cost_ratio)
  • 17. Streaming Expressions - Use cases Use case: Create a view from result-set of previously discussed use-case: calculate metrics (index data to new collection) update( report, batchSize=500, select( …….. campaign_id_s as campaign) ) complexity - O(N) N - total rows processed
  • 18. Streaming Expressions - Shuffle Client /stream handler /stream handler /stream handler Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Shard 1 Replica 1 Shard 2 Replica 1 Shard 3 Replica 1 Shard 4 Replica 1 Shard 5 Replica 1 Shard 1 Replica 2 Shard 2 Replica 2 Shard 3 Replica 2 Shard 4 Replica 2 Shard 5 Replica 2
  • 19. Streaming Expressions - Shuffle Client /stream handler /stream handler /stream handler Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Shard 1 Replica 1 Shard 2 Replica 1 Shard 3 Replica 1 Shard 4 Replica 1 Shard 5 Replica 1 Shard 1 Replica 2 Shard 2 Replica 2 Shard 3 Replica 2 Shard 4 Replica 2 Shard 5 Replica 2 “Controlled[subset S1]
  • 20. Streaming Expressions Worker Collections ● Regular SolrCloud collections ● Perform streaming aggregations using the Streaming API ● Receive shuffled streams from replicas ● May be empty or created just-in-time or have regular data ● The goal is to separate processing from data if necessary
  • 21. Streaming Expressions - Use cases Use case: Indexing the result-set of discussed use-case (calculate metrics for organisation) to new collection ‘report’ parallely utilising ‘n’ workers parallel(worker, update(report,batchSize=10, select( innerJoin( select( rollup( search(weekly_data, zkHost="localhost:9983", qt="/export", q="*:*", fq="campaign_id_s:(cmp-01 OR cmp-02 OR cmp-03)", fq="org_id_s:org-01", fl="id,campaign_id_s,org_id_s,conversations_i,impressions_i,clicks_i", sort="campaign_id_s asc", partitionKeys="campaign_id_s"), over="campaign_id_s", sum(conversations_i), sum(impressions_i), sum(clicks_i)), campaign_id_s as campaign_id_s, sum(conversations_i) as aggr_conv, sum(impressions_i) as aggr_impr, sum(clicks_i) as aggr_clicks), search(currency_cost, zkHost="localhost:9983", qt="/export",q="*:*", fq="campaign_id_s:(cmp-01 OR cmp-02 OR cmp-03)", fl="campaign_id_s,org_id_s,currency_cost_i", partitionKeys="campaign_id_s", sort="campaign_id_s asc"), on="campaign_id_s"), div( aggr_conv, aggr_clicks ) as conversion_ratio, div( aggr_clicks , aggr_impr ) as ctr, div( currency_cost_i, aggr_conv) as campaign_cost_ratio, campaign_id_s as campaign)), workers=3, zkHost="localhost:9983", sort="campaign asc")
  • 22. Streaming Expressions - Use cases Use case: Indexing the result-set of discussed use-case (calculate metrics for organisation) to new collection ‘report’ parallely utilising ‘n’ workers. complexity - Z(3) + O(N)/W ~ O(N)/W N - total rows processed Z(3) - aggregation W - number of workers utilised
  • 23. Statistical Programming • Solr’s powerful data retrieval capabilities can be combined with in-depth statistical analysis. • SQL, timeseries aggregation, KNNs, Graph expressions and more.. • Syntax can be used to create arrays from the data so it can be manipulated, transformed and analyzed • Statistical function library: • Correlation, Covariances, Percentiles, Euclidean distance and more.. • backed by Apache Common Maths Library
  • 24. Statistical Programming - Use cases Use case: Determine correlation among stocks from their historical data. Correlation measures the extent that two variables fluctuate together. For example if rise of stock A typically coincides with rise in stock B they are positively correlated. If rise in stock A typically coincides with fall in stock B they are negatively correlated. Data Representation: EventID (unique) StockID Date Closing points stockA-1 stockA 01-02-2013 30 stockB-1 stockB 01-02-2013 168 stockC-1 stockC 01-02-2013 356 stockB-2 stockB 02-02-2013 237 stockA-2 stockA 02-02-2013 43 ……... ……... ……... ……... Feb 2013 to Jan 2017
  • 25. Statistical Programming - Use cases tuple(correlation=corr(pricesA, pricesB))) set variables and outputs single tuple limit the resultset to stockA, assign to variable ‘stockA’ limit the resultset to stockB, assign to variable ‘stockB’ ‘col’ func creates array from a list of Tuples corr evaluator which performs the Pearson product-moment correlation calculation on two columns of numbers. Use case: Determine correlation among stocks A to B from their historical data. stockA=search(historical_stocks_data, zkHost="localhost:9983", qt="/export", q="stock_s:stockA", fl="timestamp_dt, closing_pts_i", sort="timestamp_dt asc"), stockB=search(historical_stocks_data, zkHost="localhost:9983", qt="/export", q="stock_s:stockB", fl="timestamp_dt, closing_pts_i", sort="timestamp_dt asc"), let( pricesA = col(stockA, closing_pts_i), pricesB = col(stockB, closing_pts_i),
  • 26. Statistical Programming - Use cases Use case: Determine correlation among stocks A to B and C from their historical data. ‘A’ to ‘B’ ‘A’ to ‘C’ Stock ‘A’ is highly positively correlated to stock ’B’, indicating if there is a future prediction for stock ‘B’ to rise, it is highly likely stocks prices for stock ‘A’ will rise too and similar trend will follow if falling. Stock ‘A’ is moderately negatively correlated to stock ’C’, indicating prediction for stock ‘A’ cannot be relied upon stock ‘C’ trend.
  • 27. Takeaway • Streaming expressions in Apache Solr allows to perform • complex correlations with • map-reduce and statistical • operations / functions parallely • on dynamic subsets • fetched from various sources • in near real-time
  • 28. References & Knowledge Base • Use cases and examples available on Github: /sarkaramrit2/stream-solr • Streaming expression official documentation in Apache Solr. • Statistical Programming official documentation in Apache Solr. • Joel Bernstein’s blog. • Presentation links: • The Evolution of Streaming Expressions • Streaming Aggregation, New Horizons for Search • Analytics and Graph Traversal with Solr • Creating New Streaming Expressions