SlideShare a Scribd company logo
Integrate Solr with real-time stream processing applications
INTEGRATE SOLR WITH REAL-TIME STREAM
PROCESSING APPLICATIONS

Timothy Potter
@thelabdude
linkedin.com/thelabdude
whoami
independent consultant search / big data projects
soon to be joining engineering team @LucidWorks
co-author Solr In Action
previously big data architect Dachis Group
my storm story
re-designed a complex batch-oriented indexing
pipeline based on Hadoop (Oozie, Pig, Hive, Sqoop)
to real-time storm topology
agenda
walk through how to develop a storm topology
common integration points with Solr
(near real-time indexing, percolator, real-time get)
example
listen to click events from 1.usa.gov URL shortener
(bit.ly) to determine trending US government sites
stream of click events:
http://developer.usa.gov/1usagov
http://www.smartgrid.gov -> http://1.usa.gov/ayu0Ru
beyond word count
tackle real challenges you’ll encounter when
developing a storm topology
and what about ... unit testing, dependency injection,
measure runtime behavior of your components, separation
of concerns, reducing boilerplate, hiding complexity ...
storm
open source distributed computation system
scalability, fault-tolerance, guaranteed message
processing (optional)
storm primitives
• 
• 
• 
• 
• 

tuple: ordered list of values
stream: unbounded sequence of tuples
spout: emit a stream of tuples (source)
bolt: performs some operation on each tuple
topology: dag of spouts and tuples
solution requirements
• 
• 
• 
• 
• 

receive click events from 1.usa.gov stream
count frequency of pages in a time window
rank top N sites per time window
extract title, body text, image for each link
persist rankings and metadata for visualization
trending snapshot (sept 12, 2013)
Integrate Solr with real-time stream processing applications
embed.ly
API

bolt
spout

field
grouping
bit.ly hash

global
grouping

EnrichLink
Bolt

1.usa.gov
Spout

field
grouping
bit.ly hash

provided by in the
storm-starter project

Solr
Indexing
Bolt

field
grouping
obj

Rolling
Count
Bolt

Intermediate
Rankings
Bolt

Total
Rankings
Bolt

stream
data store

Solr

grouping

global
grouping

Persist
Rankings
Bolt

Metrics
DB
stream grouping
• 
• 
• 
• 

shuffle: random distribution of tuples to all instances of a
bolt
field(s): group tuples by one or more fields in common
global: reduce down to one
all: replicate stream to all instances of a bolt
source: https://github.com/nathanmarz/storm/wiki/Concepts
useful storm concepts
bolts can receive input from many spouts
tuples in a stream can be grouped together
streams can be split and joined
bolts can inject new tuples into the stream
components can be distributed across a cluster at a
configurable parallelism level
•  optionally, storm keeps track of each tuple emitted by a
spout (ack or fail)
• 
• 
• 
• 
• 
tools
• 
• 
• 
• 
• 

Spring framework – dependency injection, configuration,
unit testing, mature, etc.
Groovy – keeps your code tidy and elegant
Mockito – ignore stuff your test doesn’t care about
Netty – fast & powerful NIO networking library
Coda Hale metrics – get visibility into how your bolts and
spouts are performing (at a very low-level)
spout
easy! just produce a stream of tuples ...
and ... avoid blocking when waiting for more data, ease off throttle if
topology is not processing fast enough, deal with failed tuples, choose if it
should use message Ids for each tuple emitted, data model / schema, etc ...
Hide complexity
of implementing
Storm contract

SpringSpout

Streaming
DataAction
(POJO)

Streaming
DataProvider
(POJO)

Spring
Dependency
Injection

SpringBolt

developer
focuses on
business
logic

Spring container (1 per topology per JVM)

JDBC

WebService
streaming data provider
class	
  OneUsaGovStreamingDataProvider	
  implements	
  StreamingDataProvider,	
  MessageHandler	
  {	
  
	
  
Spring Dependency Injection
	
  	
  	
  	
  MessageStream	
  messageStream	
  
	
  
	
  	
  	
  	
  ...	
  
	
  
	
  	
  	
  	
  void	
  open(Map	
  stormConf)	
  {	
  messageStream.receive(this)	
  }	
  
	
  
non-blocking call to get the
	
  	
  	
  	
  boolean	
  next(NamedValues	
  nv)	
  {	
  
next message from 1.usa.gov
	
  	
  	
  	
  	
  	
  	
  	
  String	
  msg	
  =	
  queue.poll()	
  
	
  	
  	
  	
  	
  	
  	
  	
  if	
  (msg)	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  OneUsaGovRequest	
  req	
  =	
  objectMapper.readValue(msg,	
  OneUsaGovRequest)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  if	
  (req	
  !=	
  null	
  &&	
  req.globalBitlyHash	
  !=	
  null)	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  nv.set(OneUsaGovTopology.GLOBAL_BITLY_HASH,	
  req.globalBitlyHash)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  nv.set(OneUsaGovTopology.JSON_PAYLOAD,	
  req)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  return	
  true	
  
use Jackson JSON parser
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  }	
  
to create an object from the
	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  
raw incoming data
	
  	
  	
  	
  	
  	
  	
  	
  return	
  false	
  
	
  	
  	
  	
  }	
  
	
  
	
  	
  	
  	
  void	
  handleMessage(String	
  msg)	
  {	
  queue.offer(msg)	
  }	
  
jackson json to java
@JsonIgnoreProperties(ignoreUnknown	
  =	
  true)	
  
class	
  OneUsaGovRequest	
  implements	
  Serializable	
  {	
  
	
  
	
  	
  	
  	
  @JsonProperty("a")	
  
	
  	
  	
  	
  String	
  userAgent;	
  
	
  
	
  	
  	
  	
  @JsonProperty("c")	
  
Spring converts json to java object for you:
	
  	
  	
  	
  String	
  countryCode;	
  
	
  
	
  <bean	
  id="restTemplate"	
  	
  
	
  	
  	
  	
  @JsonProperty("nk")	
  
	
  	
  	
  	
  class="org.springframework.web.client.RestTemplate">	
  
	
  	
  	
  	
  int	
  knownUser;	
  
	
  	
  	
  <property	
  name="messageConverters">	
  
	
  
	
  	
  	
  	
  	
  <list>	
  
	
  	
  	
  	
  @JsonProperty("g")	
  
	
  	
  	
  	
  	
  	
  	
  <bean	
  id="messageConverter”	
  
	
  	
  	
  	
  String	
  globalBitlyHash;	
  
	
  	
  	
  	
  class="...json.MappingJackson2HttpMessageConverter">	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  </bean>	
  
	
  	
  	
  	
  @JsonProperty("h")	
  
	
  	
  	
  	
  	
  	
  	
  	
  </list>	
  
	
  	
  	
  	
  String	
  encodingUserBitlyHash;	
  
	
  	
  	
  	
  	
  	
  </property>	
  
	
  
	
  	
  	
  	
  </bean>	
  
	
  	
  	
  	
  @JsonProperty("l")	
  
	
  	
  	
  	
  String	
  encodingUserLogin;	
  
	
  
	
  	
  	
  	
  ...	
  
}	
  
spout data provider spring-managed bean
<bean	
  id="oneUsaGovStreamingDataProvider"	
  	
  
	
  	
  	
  	
  	
  	
  class="com.bigdatajumpstart.storm.OneUsaGovStreamingDataProvider">	
  
	
  	
  	
  	
  <property	
  name="messageStream">	
  
	
  	
  	
  	
  	
  	
  	
  	
  <bean	
  class="com.bigdatajumpstart.netty.HttpClient">	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <constructor-­‐arg	
  index="0"	
  value="${streamUrl}"/>	
  
	
  	
  	
  	
  	
  	
  	
  	
  </bean>	
  
	
  	
  	
  	
  </property>	
  
</bean>	
  

Note: when building the StormTopology to submit to Storm, you do:
	
  builder.setSpout("1.usa.gov-­‐spout",	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  new	
  SpringSpout("oneUsaGovStreamingDataProvider",	
  spoutFields),	
  1)	
  
spout data provider unit test
class	
  OneUsaGovStreamingDataProviderTest	
  extends	
  StreamingDataProviderTestBase	
  {	
  
	
  
	
  	
  	
  	
  @Test	
  
	
  	
  	
  	
  void	
  testDataProvider()	
  {	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  String	
  jsonStr	
  =	
  '''{	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "a":	
  "user-­‐agent",	
  "c":	
  "US",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "nk":	
  0,	
  "tz":	
  "America/Los_Angeles",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "gr":	
  "OR",	
  "g":	
  "2BktiW",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "h":	
  "12Me4B2",	
  "l":	
  "usairforce",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "al":	
  "en-­‐us",	
  "hh":	
  "1.usa.gov",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "r":	
  "http://example.com/foo",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ...	
  
	
  	
  	
  	
  	
  	
  	
  	
  }'''	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  OneUsaGovStreamingDataProvider	
  dataProvider	
  =	
  new	
  OneUsaGovStreamingDataProvider()	
  
	
  	
  	
  	
  	
  	
  	
  	
  dataProvider.setMessageStream(mock(MessageStream))	
  
	
  	
  	
  	
  	
  	
  	
  	
  dataProvider.open(stormConf)	
  //	
  Config	
  setup	
  in	
  base	
  class	
  
	
  	
  	
  	
  	
  	
  	
  	
  dataProvider.handleMessage(jsonStr)	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  NamedValues	
  record	
  =	
  new	
  NamedValues(OneUsaGovTopology.spoutFields)	
  
	
  	
  	
  	
  	
  	
  	
  	
  assertTrue	
  dataProvider.next(record)	
  
	
  	
  	
  	
  	
  	
  	
  	
  ...	
  
	
  	
  	
  	
  }	
  
}	
  

mock json to simulate
data from 1.usa.gov feed
use Mockito to satisfy
dependencies not needed
for this test

asserts to verify
data provider
works correctly
rolling count bolt
• 
• 
• 
• 

counts frequency of links in a sliding time window
emits topN in current window every M seconds
uses tick tuple trick provided by Storm to emit
every M seconds (configurable)
provided with the storm-starter project

http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
enrich link metadata bolt
• 
• 
• 
• 
• 

calls out to embed.ly API
caches results locally in the bolt instance
relies on field grouping (incoming tuples)
outputs data to be indexed in Solr
benefits from parallelism to enrich more
links concurrently (watch those rate limits)
embed.ly service

class	
  EmbedlyService	
  {	
  
	
  
	
  	
  	
  	
  @Autowired	
  
	
  	
  	
  	
  RestTemplate	
  restTemplate	
  
integrate coda hale metrics
	
  
	
  	
  	
  	
  String	
  apiKey	
  
	
  
	
  	
  	
  	
  private	
  Timer	
  apiTimer	
  =	
  MetricsSupport.timer(EmbedlyService,	
  "apiCall")	
  
	
  
	
  	
  	
  	
  Embedly	
  getLinkMetadata(String	
  link)	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  String	
  urlEncoded	
  =	
  URLEncoder.encode(link,"UTF-­‐8")	
  
	
  	
  	
  	
  	
  	
  	
  	
  URI	
  uri	
  =	
  new	
  URI("https://api.embed.ly/1/oembed?key=${apiKey}&url=${urlEncoded}")	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  Embedly	
  embedly	
  =	
  null	
  
	
  	
  	
  	
  	
  	
  	
  	
  MetricsSupport.withTimer(apiTimer,	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  embedly	
  =	
  restTemplate.getForObject(uri,	
  Embedly)	
  
	
  	
  	
  	
  	
  	
  	
  	
  })	
  
	
  	
  	
  	
  	
  	
  	
  	
  return	
  embedly	
  
simple closure to time our
	
  	
  	
  	
  }	
  
requests to the Web service
metrics
• 
• 
• 
• 

capture runtime behavior of the components in
your topology
Coda Hale metrics - http://metrics.codahale.com/
output metrics every N minutes
report metrics to JMX, ganglia, graphite, etc
-­‐-­‐	
  Meters	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
EnrichLinkBoltLogic.solrQueries	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  count	
  =	
  97	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  mean	
  rate	
  =	
  0.81	
  events/second	
  
	
  	
  	
  	
  	
  1-­‐minute	
  rate	
  =	
  0.89	
  events/second	
  
	
  	
  	
  	
  	
  5-­‐minute	
  rate	
  =	
  1.62	
  events/second	
  
	
  	
  	
  	
  15-­‐minute	
  rate	
  =	
  1.86	
  events/second	
  
	
  
SolrBoltLogic.linksIndexed	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  count	
  =	
  60	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  mean	
  rate	
  =	
  0.50	
  events/second	
  
	
  	
  	
  	
  	
  1-­‐minute	
  rate	
  =	
  0.41	
  events/second	
  
	
  	
  	
  	
  	
  5-­‐minute	
  rate	
  =	
  0.16	
  events/second	
  
	
  	
  	
  	
  15-­‐minute	
  rate	
  =	
  0.06	
  events/second	
  
	
  
-­‐-­‐	
  Timers	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
EmbedlyService.apiCall	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  count	
  =	
  60	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  mean	
  rate	
  =	
  0.50	
  calls/second	
  
	
  	
  	
  	
  	
  1-­‐minute	
  rate	
  =	
  0.40	
  calls/second	
  
	
  	
  	
  	
  	
  5-­‐minute	
  rate	
  =	
  0.16	
  calls/second	
  
	
  	
  	
  	
  15-­‐minute	
  rate	
  =	
  0.06	
  calls/second	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  min	
  =	
  138.70	
  milliseconds	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  max	
  =	
  7642.92	
  milliseconds	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  mean	
  =	
  1148.29	
  milliseconds	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  stddev	
  =	
  1281.40	
  milliseconds	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  median	
  =	
  652.83	
  milliseconds	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  75%	
  <=	
  1620.96	
  milliseconds	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ...	
  
storm cluster concepts
• 
• 
• 
• 
• 
• 

nimbus: master node (~job tracker in Hadoop)
zookeeper: cluster management / coordination
supervisor: one per node in the cluster to manage worker
processes
worker: one or more per supervisor (JVM process)
executor: thread in worker
task: work performed by a spout or bolt
Topology	
  
JAR	
  

Nimbus	
  

Node	
  1	
  

Zookeeper	
  

Supervisor	
  (1	
  per	
  node)	
  

Each component (spout or bolt)
is distributed across a cluster of
workers based on a configurable
parallelism

Worker	
  1	
  (port	
  6701)	
  
executor	
  
(thread)	
  

JVM	
  process	
  

...	
  N	
  workers	
  
...	
  M	
  nodes	
  
 @Override	
  
	
  StormTopology	
  build(StreamingApp	
  app)	
  throws	
  Exception	
  {	
  
parallelism hint to
	
  	
  	
  	
  	
  	
  	
  	
  	
  
the framework
	
  	
  	
  	
  	
  ...	
  	
  
(can be rebalanced)
	
  	
  	
  	
  	
  TopologyBuilder	
  builder	
  =	
  new	
  TopologyBuilder()	
  
	
  
	
  	
  	
  	
  	
  builder.setSpout("1.usa.gov-­‐spout",	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  new	
  SpringSpout("oneUsaGovStreamingDataProvider",	
  spoutFields),	
  1)	
  
	
  
	
  	
  	
  	
  	
  builder.setBolt("enrich-­‐link-­‐bolt",	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  new	
  SpringBolt("enrichLinkAction",	
  enrichedLinkFields),	
  3)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .fieldsGrouping("1.usa.gov-­‐spout",	
  globalBitlyHashGrouping)	
  
	
  
	
  	
  	
  	
  	
  ...	
  
solr integration points
• 
• 
• 

real-time get
near real-time indexing (NRT)
percolate (match incoming docs to pre-existing
queries)
real-time get
use Solr for fast lookups by document ID
class	
  SolrClient	
  {	
  
	
  
	
  	
  	
  	
  @Autowired	
  
	
  	
  	
  	
  SolrServer	
  solrServer	
  
	
  
	
  	
  	
  	
  SolrDocument	
  get(String	
  docId,	
  String...	
  fields)	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  SolrQuery	
  q	
  =	
  new	
  SolrQuery()	
  
	
  	
  	
  	
  	
  	
  	
  	
  q.setRequestHandler("/get")	
  
	
  	
  	
  	
  	
  	
  	
  	
  q.set("id",	
  docId)	
  
	
  	
  	
  	
  	
  	
  	
  	
  q.setFields(fields)	
  
	
  	
  	
  	
  	
  	
  	
  	
  QueryRequest	
  req	
  =	
  new	
  QueryRequest(q)	
  
	
  	
  	
  	
  	
  	
  	
  	
  req.setResponseParser(new	
  BinaryResponseParser())	
  
	
  	
  	
  	
  	
  	
  	
  	
  QueryResponse	
  rsp	
  =	
  req.process(solrServer)	
  
	
  	
  	
  	
  	
  	
  	
  	
  return	
  (SolrDocument)rsp.getResponse().get("doc")	
  
	
  	
  	
  	
  }	
  
}	
  

send the request to the
“get” request handler
near real-time indexing
•  If possible, use CloudSolrServer to route documents
directly to the correct shard leaders (SOLR-4816)
•  Use <openSearcher>false</openSearcher> for auto “hard”
commits
•  Use auto soft commits as needed
•  Use parallelism of Storm bolt to distribute indexing work to
N nodes
http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
percolate
•  match incoming documents to pre-configured
queries (inverted search)
–  example: Is this tweet related to campaign Y for brand X?

•  use storm’s distributed computation support to
evaluate M pre-configured queries per doc
two possible approaches
•  Lucene-only solution using MemoryIndex
–  See presentation by Charlie Hull and Alan Woodward

•  EmbeddedSolrServer
–  Full solrconfig.xml / schema.xml
–  RAMDirectory
–  Relies on Storm to scale up documents / second
–  Easy solution for up to a few thousand queries
PercolatorBolt	
  1	
  
Embedded	
  
SolrServer	
  
Twi"er	
  
Spout	
  
random	
  
stream	
  
grouping	
  

...	
  

Pre-­‐configured	
  
queries	
  stored	
  in	
  	
  
a	
  database	
  
Could be 100’s of these

PercolatorBolt	
  N	
  
Embedded	
  
SolrServer	
  

ZeroMQ	
  
pub/sub	
  to	
  push	
  
query	
  changes	
  
to	
  percolator	
  
tick tuples
•  send a special kind of tuple to a bolt every N
seconds
	
  if	
  (TupleHelpers.isTickTuple(input))	
  {	
  
	
  	
  	
  	
  	
  //	
  do	
  special	
  work	
  
	
  }	
  

used in percolator to delete accumulated documents every minute or so ...
references
• 

Storm Wiki

•  https://github.com/nathanmarz/storm/wiki/Documentation

• 

Overview: Krishna Gade
•  http://www.slideshare.net/KrishnaGade2/storm-at-twitter

• 

Trending Topics: Michael Knoll

•  http://www.michael-noll.com/blog/2013/01/18/implementing-real-timetrending-topics-in-storm/

• 

Understanding Parallelism: Michael Knoll

•  http://www.michael-noll.com/blog/2012/10/16/understanding-theparallelism-of-a-storm-topology/
Q&A
get the code: https://github.com/thelabdude/lsrdublin

Manning	
  coupon	
  codes	
  for	
  conference	
  related	
  books:	
  
	
  
h"p://deals.manningpublica8ons.com/Revolu8onsEU2013.html	
  

More Related Content

PDF
Building a near real time search engine & analytics for logs using solr
lucenerevolution
 
PDF
Administering and Monitoring SolrCloud Clusters
lucenerevolution
 
PPTX
Solr Search Engine: Optimize Is (Not) Bad for You
Sematext Group, Inc.
 
PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution
 
PDF
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Lucidworks
 
PDF
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
Lucidworks
 
PPTX
DZone Java 8 Block Buster: Query Databases Using Streams
Speedment, Inc.
 
PPT
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Davorin Vukelic
 
Building a near real time search engine & analytics for logs using solr
lucenerevolution
 
Administering and Monitoring SolrCloud Clusters
lucenerevolution
 
Solr Search Engine: Optimize Is (Not) Bad for You
Sematext Group, Inc.
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution
 
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Lucidworks
 
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
Lucidworks
 
DZone Java 8 Block Buster: Query Databases Using Streams
Speedment, Inc.
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Davorin Vukelic
 

What's hot (20)

PPTX
Performance Tuning and Optimization
MongoDB
 
PDF
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
PDF
Norikra: SQL Stream Processing In Ruby
SATOSHI TAGOMORI
 
PDF
[262] netflix 빅데이터 플랫폼
NAVER D2
 
PDF
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
Patrick McFadin
 
PPTX
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
thelabdude
 
PDF
Python in the database
pybcn
 
PDF
Distributed real time stream processing- why and how
Petr Zapletal
 
PPTX
DataStax: An Introduction to DataStax Enterprise Search
DataStax Academy
 
PPTX
Apache Flink Training: DataSet API Basics
Flink Forward
 
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
PDF
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
PPTX
Angular JS deep dive
Axilis
 
PPTX
Cassandra Community Webinar: Back to Basics with CQL3
DataStax
 
PPTX
Mongo db pefrormance optimization strategies
ronwarshawsky
 
PDF
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Lucidworks
 
PDF
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Lucidworks
 
PDF
Apache Zookeeper
Nguyen Quang
 
PDF
Spark with Elasticsearch - umd version 2014
Holden Karau
 
PDF
Real time and reliable processing with Apache Storm
Andrea Iacono
 
Performance Tuning and Optimization
MongoDB
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Norikra: SQL Stream Processing In Ruby
SATOSHI TAGOMORI
 
[262] netflix 빅데이터 플랫폼
NAVER D2
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
Patrick McFadin
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
thelabdude
 
Python in the database
pybcn
 
Distributed real time stream processing- why and how
Petr Zapletal
 
DataStax: An Introduction to DataStax Enterprise Search
DataStax Academy
 
Apache Flink Training: DataSet API Basics
Flink Forward
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
Angular JS deep dive
Axilis
 
Cassandra Community Webinar: Back to Basics with CQL3
DataStax
 
Mongo db pefrormance optimization strategies
ronwarshawsky
 
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Lucidworks
 
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Lucidworks
 
Apache Zookeeper
Nguyen Quang
 
Spark with Elasticsearch - umd version 2014
Holden Karau
 
Real time and reliable processing with Apache Storm
Andrea Iacono
 
Ad

Viewers also liked (20)

PDF
Building Client-side Search Applications with Solr
lucenerevolution
 
PDF
Parallel SQL and Streaming Expressions in Apache Solr 6
Shalin Shekhar Mangar
 
PDF
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Lucidworks
 
PDF
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Spark Summit
 
PDF
Tuning Solr & Pipeline for Logs
Sematext Group, Inc.
 
PDF
Basics of Solr and Solr Integration with AEM6
DEEPAK KHETAWAT
 
PDF
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
Lucidworks
 
PDF
Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...
Hortonworks
 
PDF
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Lucidworks
 
PDF
Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & ...
Lucidworks
 
PDF
Introduction to Apache Solr
Alexandre Rafalovitch
 
PDF
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
Lucidworks
 
PPTX
Benchmarking Solr Performance at Scale
thelabdude
 
PDF
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Hortonworks
 
PDF
Side by Side with Elasticsearch & Solr, Part 2
Sematext Group, Inc.
 
PDF
How to Run Solr on Docker and Why
Sematext Group, Inc.
 
PDF
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
PPTX
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
confluent
 
PPTX
Data Streaming with Apache Kafka & MongoDB
confluent
 
PDF
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Lucidworks
 
Building Client-side Search Applications with Solr
lucenerevolution
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Shalin Shekhar Mangar
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Lucidworks
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Spark Summit
 
Tuning Solr & Pipeline for Logs
Sematext Group, Inc.
 
Basics of Solr and Solr Integration with AEM6
DEEPAK KHETAWAT
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
Lucidworks
 
Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...
Hortonworks
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Lucidworks
 
Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & ...
Lucidworks
 
Introduction to Apache Solr
Alexandre Rafalovitch
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
Lucidworks
 
Benchmarking Solr Performance at Scale
thelabdude
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Hortonworks
 
Side by Side with Elasticsearch & Solr, Part 2
Sematext Group, Inc.
 
How to Run Solr on Docker and Why
Sematext Group, Inc.
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
confluent
 
Data Streaming with Apache Kafka & MongoDB
confluent
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Lucidworks
 
Ad

Similar to Integrate Solr with real-time stream processing applications (20)

PPTX
Integrate Solr with real-time stream processing applications
thelabdude
 
PDF
Learning Stream Processing with Apache Storm
Eugene Dvorkin
 
PPTX
Introduction to Storm
Chandler Huang
 
PDF
StormCrawler at Bristech
Julien Nioche
 
PDF
Storm @ Fifth Elephant 2013
Prashanth Babu
 
PPTX
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
StampedeCon
 
PPTX
Storm is coming
Grzegorz Kolpuc
 
PDF
Mhug apache storm
Joseph Niemiec
 
PDF
Tuga it 2017 - Event processing with Apache Storm
Nuno Caneco
 
PPTX
Storm - SpaaS
Ernestas Vaiciukevicius
 
PDF
Developing Java Streaming Applications with Apache Storm
Lester Martin
 
PDF
Storm Real Time Computation
Sonal Raj
 
PPS
Storm presentation
Shyam Raj
 
PPTX
Introduction to Storm
Eugene Dvorkin
 
PDF
STORMPresentation and all about storm_FINAL.pdf
ajajkhan16
 
PDF
Storm@Twitter, SIGMOD 2014 paper
Karthik Ramasamy
 
PDF
Real-time Big Data Processing with Storm
viirya
 
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
P. Taylor Goetz
 
PPTX
Cleveland HUG - Storm
justinjleet
 
PDF
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
 
Integrate Solr with real-time stream processing applications
thelabdude
 
Learning Stream Processing with Apache Storm
Eugene Dvorkin
 
Introduction to Storm
Chandler Huang
 
StormCrawler at Bristech
Julien Nioche
 
Storm @ Fifth Elephant 2013
Prashanth Babu
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
StampedeCon
 
Storm is coming
Grzegorz Kolpuc
 
Mhug apache storm
Joseph Niemiec
 
Tuga it 2017 - Event processing with Apache Storm
Nuno Caneco
 
Developing Java Streaming Applications with Apache Storm
Lester Martin
 
Storm Real Time Computation
Sonal Raj
 
Storm presentation
Shyam Raj
 
Introduction to Storm
Eugene Dvorkin
 
STORMPresentation and all about storm_FINAL.pdf
ajajkhan16
 
Storm@Twitter, SIGMOD 2014 paper
Karthik Ramasamy
 
Real-time Big Data Processing with Storm
viirya
 
Hadoop Summit Europe 2014: Apache Storm Architecture
P. Taylor Goetz
 
Cleveland HUG - Storm
justinjleet
 
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
 

More from lucenerevolution (20)

PDF
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution
 
PDF
State of the Art Logging. Kibana4Solr is Here!
lucenerevolution
 
PDF
Search at Twitter
lucenerevolution
 
PDF
Scaling Solr with SolrCloud
lucenerevolution
 
PDF
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
lucenerevolution
 
PDF
Using Solr to Search and Analyze Logs
lucenerevolution
 
PDF
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
PDF
Solr's Admin UI - Where does the data come from?
lucenerevolution
 
PDF
Schemaless Solr and the Solr Schema REST API
lucenerevolution
 
PDF
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution
 
PDF
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
lucenerevolution
 
PDF
Faceted Search with Lucene
lucenerevolution
 
PDF
Recent Additions to Lucene Arsenal
lucenerevolution
 
PDF
Turning search upside down
lucenerevolution
 
PDF
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
lucenerevolution
 
PDF
Shrinking the haystack wes caldwell - final
lucenerevolution
 
PDF
The First Class Integration of Solr with Hadoop
lucenerevolution
 
PDF
A Novel methodology for handling Document Level Security in Search Based Appl...
lucenerevolution
 
PDF
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
lucenerevolution
 
PDF
Query Latency Optimization with Lucene
lucenerevolution
 
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
lucenerevolution
 
Search at Twitter
lucenerevolution
 
Scaling Solr with SolrCloud
lucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
lucenerevolution
 
Using Solr to Search and Analyze Logs
lucenerevolution
 
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Solr's Admin UI - Where does the data come from?
lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
lucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
lucenerevolution
 
Faceted Search with Lucene
lucenerevolution
 
Recent Additions to Lucene Arsenal
lucenerevolution
 
Turning search upside down
lucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
lucenerevolution
 
Shrinking the haystack wes caldwell - final
lucenerevolution
 
The First Class Integration of Solr with Hadoop
lucenerevolution
 
A Novel methodology for handling Document Level Security in Search Based Appl...
lucenerevolution
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
lucenerevolution
 
Query Latency Optimization with Lucene
lucenerevolution
 

Recently uploaded (20)

PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 

Integrate Solr with real-time stream processing applications

  • 2. INTEGRATE SOLR WITH REAL-TIME STREAM PROCESSING APPLICATIONS Timothy Potter @thelabdude linkedin.com/thelabdude
  • 3. whoami independent consultant search / big data projects soon to be joining engineering team @LucidWorks co-author Solr In Action previously big data architect Dachis Group
  • 4. my storm story re-designed a complex batch-oriented indexing pipeline based on Hadoop (Oozie, Pig, Hive, Sqoop) to real-time storm topology
  • 5. agenda walk through how to develop a storm topology common integration points with Solr (near real-time indexing, percolator, real-time get)
  • 6. example listen to click events from 1.usa.gov URL shortener (bit.ly) to determine trending US government sites stream of click events: http://developer.usa.gov/1usagov http://www.smartgrid.gov -> http://1.usa.gov/ayu0Ru
  • 7. beyond word count tackle real challenges you’ll encounter when developing a storm topology and what about ... unit testing, dependency injection, measure runtime behavior of your components, separation of concerns, reducing boilerplate, hiding complexity ...
  • 8. storm open source distributed computation system scalability, fault-tolerance, guaranteed message processing (optional)
  • 9. storm primitives •  •  •  •  •  tuple: ordered list of values stream: unbounded sequence of tuples spout: emit a stream of tuples (source) bolt: performs some operation on each tuple topology: dag of spouts and tuples
  • 10. solution requirements •  •  •  •  •  receive click events from 1.usa.gov stream count frequency of pages in a time window rank top N sites per time window extract title, body text, image for each link persist rankings and metadata for visualization
  • 13. embed.ly API bolt spout field grouping bit.ly hash global grouping EnrichLink Bolt 1.usa.gov Spout field grouping bit.ly hash provided by in the storm-starter project Solr Indexing Bolt field grouping obj Rolling Count Bolt Intermediate Rankings Bolt Total Rankings Bolt stream data store Solr grouping global grouping Persist Rankings Bolt Metrics DB
  • 14. stream grouping •  •  •  •  shuffle: random distribution of tuples to all instances of a bolt field(s): group tuples by one or more fields in common global: reduce down to one all: replicate stream to all instances of a bolt source: https://github.com/nathanmarz/storm/wiki/Concepts
  • 15. useful storm concepts bolts can receive input from many spouts tuples in a stream can be grouped together streams can be split and joined bolts can inject new tuples into the stream components can be distributed across a cluster at a configurable parallelism level •  optionally, storm keeps track of each tuple emitted by a spout (ack or fail) •  •  •  •  • 
  • 16. tools •  •  •  •  •  Spring framework – dependency injection, configuration, unit testing, mature, etc. Groovy – keeps your code tidy and elegant Mockito – ignore stuff your test doesn’t care about Netty – fast & powerful NIO networking library Coda Hale metrics – get visibility into how your bolts and spouts are performing (at a very low-level)
  • 17. spout easy! just produce a stream of tuples ... and ... avoid blocking when waiting for more data, ease off throttle if topology is not processing fast enough, deal with failed tuples, choose if it should use message Ids for each tuple emitted, data model / schema, etc ...
  • 18. Hide complexity of implementing Storm contract SpringSpout Streaming DataAction (POJO) Streaming DataProvider (POJO) Spring Dependency Injection SpringBolt developer focuses on business logic Spring container (1 per topology per JVM) JDBC WebService
  • 19. streaming data provider class  OneUsaGovStreamingDataProvider  implements  StreamingDataProvider,  MessageHandler  {     Spring Dependency Injection        MessageStream  messageStream            ...            void  open(Map  stormConf)  {  messageStream.receive(this)  }     non-blocking call to get the        boolean  next(NamedValues  nv)  {   next message from 1.usa.gov                String  msg  =  queue.poll()                  if  (msg)  {                          OneUsaGovRequest  req  =  objectMapper.readValue(msg,  OneUsaGovRequest)                          if  (req  !=  null  &&  req.globalBitlyHash  !=  null)  {                                  nv.set(OneUsaGovTopology.GLOBAL_BITLY_HASH,  req.globalBitlyHash)                                  nv.set(OneUsaGovTopology.JSON_PAYLOAD,  req)                                  return  true   use Jackson JSON parser                        }   to create an object from the                }     raw incoming data                return  false          }            void  handleMessage(String  msg)  {  queue.offer(msg)  }  
  • 20. jackson json to java @JsonIgnoreProperties(ignoreUnknown  =  true)   class  OneUsaGovRequest  implements  Serializable  {            @JsonProperty("a")          String  userAgent;            @JsonProperty("c")   Spring converts json to java object for you:        String  countryCode;      <bean  id="restTemplate"            @JsonProperty("nk")          class="org.springframework.web.client.RestTemplate">          int  knownUser;        <property  name="messageConverters">              <list>          @JsonProperty("g")                <bean  id="messageConverter”          String  globalBitlyHash;          class="...json.MappingJackson2HttpMessageConverter">                        </bean>          @JsonProperty("h")                  </list>          String  encodingUserBitlyHash;              </property>            </bean>          @JsonProperty("l")          String  encodingUserLogin;            ...   }  
  • 21. spout data provider spring-managed bean <bean  id="oneUsaGovStreamingDataProvider"                class="com.bigdatajumpstart.storm.OneUsaGovStreamingDataProvider">          <property  name="messageStream">                  <bean  class="com.bigdatajumpstart.netty.HttpClient">                          <constructor-­‐arg  index="0"  value="${streamUrl}"/>                  </bean>          </property>   </bean>   Note: when building the StormTopology to submit to Storm, you do:  builder.setSpout("1.usa.gov-­‐spout",                                          new  SpringSpout("oneUsaGovStreamingDataProvider",  spoutFields),  1)  
  • 22. spout data provider unit test class  OneUsaGovStreamingDataProviderTest  extends  StreamingDataProviderTestBase  {            @Test          void  testDataProvider()  {                    String  jsonStr  =  '''{                          "a":  "user-­‐agent",  "c":  "US",                          "nk":  0,  "tz":  "America/Los_Angeles",                          "gr":  "OR",  "g":  "2BktiW",                          "h":  "12Me4B2",  "l":  "usairforce",                          "al":  "en-­‐us",  "hh":  "1.usa.gov",                          "r":  "http://example.com/foo",                          ...                  }'''                    OneUsaGovStreamingDataProvider  dataProvider  =  new  OneUsaGovStreamingDataProvider()                  dataProvider.setMessageStream(mock(MessageStream))                  dataProvider.open(stormConf)  //  Config  setup  in  base  class                  dataProvider.handleMessage(jsonStr)                    NamedValues  record  =  new  NamedValues(OneUsaGovTopology.spoutFields)                  assertTrue  dataProvider.next(record)                  ...          }   }   mock json to simulate data from 1.usa.gov feed use Mockito to satisfy dependencies not needed for this test asserts to verify data provider works correctly
  • 23. rolling count bolt •  •  •  •  counts frequency of links in a sliding time window emits topN in current window every M seconds uses tick tuple trick provided by Storm to emit every M seconds (configurable) provided with the storm-starter project http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
  • 24. enrich link metadata bolt •  •  •  •  •  calls out to embed.ly API caches results locally in the bolt instance relies on field grouping (incoming tuples) outputs data to be indexed in Solr benefits from parallelism to enrich more links concurrently (watch those rate limits)
  • 25. embed.ly service class  EmbedlyService  {            @Autowired          RestTemplate  restTemplate   integrate coda hale metrics          String  apiKey            private  Timer  apiTimer  =  MetricsSupport.timer(EmbedlyService,  "apiCall")            Embedly  getLinkMetadata(String  link)  {                  String  urlEncoded  =  URLEncoder.encode(link,"UTF-­‐8")                  URI  uri  =  new  URI("https://api.embed.ly/1/oembed?key=${apiKey}&url=${urlEncoded}")                    Embedly  embedly  =  null                  MetricsSupport.withTimer(apiTimer,  {                          embedly  =  restTemplate.getForObject(uri,  Embedly)                  })                  return  embedly   simple closure to time our        }   requests to the Web service
  • 26. metrics •  •  •  •  capture runtime behavior of the components in your topology Coda Hale metrics - http://metrics.codahale.com/ output metrics every N minutes report metrics to JMX, ganglia, graphite, etc
  • 27. -­‐-­‐  Meters  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐   EnrichLinkBoltLogic.solrQueries                            count  =  97                    mean  rate  =  0.81  events/second            1-­‐minute  rate  =  0.89  events/second            5-­‐minute  rate  =  1.62  events/second          15-­‐minute  rate  =  1.86  events/second     SolrBoltLogic.linksIndexed                            count  =  60                    mean  rate  =  0.50  events/second            1-­‐minute  rate  =  0.41  events/second            5-­‐minute  rate  =  0.16  events/second          15-­‐minute  rate  =  0.06  events/second     -­‐-­‐  Timers  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐   EmbedlyService.apiCall                            count  =  60                    mean  rate  =  0.50  calls/second            1-­‐minute  rate  =  0.40  calls/second            5-­‐minute  rate  =  0.16  calls/second          15-­‐minute  rate  =  0.06  calls/second                                min  =  138.70  milliseconds                                max  =  7642.92  milliseconds                              mean  =  1148.29  milliseconds                          stddev  =  1281.40  milliseconds                          median  =  652.83  milliseconds                              75%  <=  1620.96  milliseconds                              ...  
  • 28. storm cluster concepts •  •  •  •  •  •  nimbus: master node (~job tracker in Hadoop) zookeeper: cluster management / coordination supervisor: one per node in the cluster to manage worker processes worker: one or more per supervisor (JVM process) executor: thread in worker task: work performed by a spout or bolt
  • 29. Topology   JAR   Nimbus   Node  1   Zookeeper   Supervisor  (1  per  node)   Each component (spout or bolt) is distributed across a cluster of workers based on a configurable parallelism Worker  1  (port  6701)   executor   (thread)   JVM  process   ...  N  workers   ...  M  nodes  
  • 30.  @Override    StormTopology  build(StreamingApp  app)  throws  Exception  {   parallelism hint to                   the framework          ...     (can be rebalanced)          TopologyBuilder  builder  =  new  TopologyBuilder()              builder.setSpout("1.usa.gov-­‐spout",                      new  SpringSpout("oneUsaGovStreamingDataProvider",  spoutFields),  1)              builder.setBolt("enrich-­‐link-­‐bolt",                      new  SpringBolt("enrichLinkAction",  enrichedLinkFields),  3)                                .fieldsGrouping("1.usa.gov-­‐spout",  globalBitlyHashGrouping)              ...  
  • 31. solr integration points •  •  •  real-time get near real-time indexing (NRT) percolate (match incoming docs to pre-existing queries)
  • 32. real-time get use Solr for fast lookups by document ID class  SolrClient  {            @Autowired          SolrServer  solrServer            SolrDocument  get(String  docId,  String...  fields)  {                  SolrQuery  q  =  new  SolrQuery()                  q.setRequestHandler("/get")                  q.set("id",  docId)                  q.setFields(fields)                  QueryRequest  req  =  new  QueryRequest(q)                  req.setResponseParser(new  BinaryResponseParser())                  QueryResponse  rsp  =  req.process(solrServer)                  return  (SolrDocument)rsp.getResponse().get("doc")          }   }   send the request to the “get” request handler
  • 33. near real-time indexing •  If possible, use CloudSolrServer to route documents directly to the correct shard leaders (SOLR-4816) •  Use <openSearcher>false</openSearcher> for auto “hard” commits •  Use auto soft commits as needed •  Use parallelism of Storm bolt to distribute indexing work to N nodes http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
  • 34. percolate •  match incoming documents to pre-configured queries (inverted search) –  example: Is this tweet related to campaign Y for brand X? •  use storm’s distributed computation support to evaluate M pre-configured queries per doc
  • 35. two possible approaches •  Lucene-only solution using MemoryIndex –  See presentation by Charlie Hull and Alan Woodward •  EmbeddedSolrServer –  Full solrconfig.xml / schema.xml –  RAMDirectory –  Relies on Storm to scale up documents / second –  Easy solution for up to a few thousand queries
  • 36. PercolatorBolt  1   Embedded   SolrServer   Twi"er   Spout   random   stream   grouping   ...   Pre-­‐configured   queries  stored  in     a  database   Could be 100’s of these PercolatorBolt  N   Embedded   SolrServer   ZeroMQ   pub/sub  to  push   query  changes   to  percolator  
  • 37. tick tuples •  send a special kind of tuple to a bolt every N seconds  if  (TupleHelpers.isTickTuple(input))  {            //  do  special  work    }   used in percolator to delete accumulated documents every minute or so ...
  • 38. references •  Storm Wiki •  https://github.com/nathanmarz/storm/wiki/Documentation •  Overview: Krishna Gade •  http://www.slideshare.net/KrishnaGade2/storm-at-twitter •  Trending Topics: Michael Knoll •  http://www.michael-noll.com/blog/2013/01/18/implementing-real-timetrending-topics-in-storm/ •  Understanding Parallelism: Michael Knoll •  http://www.michael-noll.com/blog/2012/10/16/understanding-theparallelism-of-a-storm-topology/
  • 39. Q&A get the code: https://github.com/thelabdude/lsrdublin Manning  coupon  codes  for  conference  related  books:     h"p://deals.manningpublica8ons.com/Revolu8onsEU2013.html