SlideShare a Scribd company logo
Meetup Data Analysis
By,
Sushmanth Sagala
Spark Project – UpX Academy
Project Information
 Domain: Social
 Technology use: Spark streaming, Spark, Spark MLlib
 Dataset: http://stream.meetup.com/2/open_events
 Meetup is an online social networking portal that facilitates offline group meetings
in various localities around the world. Meetup allows members to find and join
groups unified by a common interest, such as politics, books and games.
Sample Events Dataset
Business Questions
 Streaming/Spark Sql
 Load the streaming data
 Count the number of events happening in a city eg. Hyderabad
 Count the number of free events
 Count the events in Technology category
 Count the number of Big data events happening in US
 Find the average duration of Technology events
 Spark MLLIB
 Group the events by their category (k-means clustering)
Q1. Load the streaming data
 Custom receiver to load data from external URL.
 Asynchronous HTTP request to read the data from streaming URL.
 def onStart() {
val cf = new AsyncHttpClientConfig.Builder()
cf.setRequestTimeout(Integer.MAX_VALUE)
cf.setReadTimeout(Integer.MAX_VALUE)
cf.setPooledConnectionIdleTimeout(Integer.MAX_VALUE)
client= new AsyncHttpClient(cf.build())
inputPipe = new PipedInputStream(1024 * 1024)
outputPipe = new PipedOutputStream(inputPipe)
val producerThread = new Thread(new DataConsumer(inputPipe))
producerThread.start()
client.prepareGet(url).execute(new AsyncHandler[Unit]{
def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
bodyPart.writeTo(outputPipe)
AsyncHandler.STATE.CONTINUE
}
….
})
}
Q1. Load the streaming data
 Class DataConsumer extends Runnable to read the stream data and store.
val bufferedReader = new BufferedReader( new InputStreamReader( inputStream
))
var input=bufferedReader.readLine()
while(input!=null){
store(input)
input=bufferedReader.readLine() }
 Defining the case classes to extract respective data.
case class EventDetails(id: String, name: String, city: String, country: String,
payment_required: Int, cat_id: Int, cat_name: String, duration: Long)
case class Venue(name: Option[String], address1: Option[String], city: Option[String],
state: Option[String], zip: Option[String], country: Option[String], lon: Option[Float], lat:
Option[Float])
case class Event(id: String, name: Option[String], eventUrl: Option[String], description:
Option[String], duration: Option[Long], rsvpLimit: Option[Int], paymentRequired:
Option[Int], status: Option[String])
case class Group(id: Option[String], name: Option[String], city: Option[String], state:
Option[String], country: Option[String])
case class Category(name: Option[String], id: Option[Int], shortname: Option[String])
Q1. Load the streaming data
 parseEvent method uses Json4s lib to extract the json data and define the
EventDetails type.
val json=parse(eventJson).camelizeKeys
val event=json.extract[Event]
val venue=(json  "venue").extract[Venue]
val group=(json  "group").extract[Group]
val category=(json  "group"  "category").extract[Category]
EventDetails(event.id, event.name.getOrElse(""), venue.city.getOrElse(""),
venue.country.getOrElse(""), event.paymentRequired.getOrElse(0),
category.id.getOrElse(0), category.shortname.getOrElse(""),
event.duration.getOrElse(10800000L))
 Starting the event stream with Batch Interval of 2 secs,
val ssc=new StreamingContext(conf, Seconds(2))
val eventStream = ssc.receiverStream(new
MeetupReceiver("http://stream.meetup.com/2/open_events")).flatMap(parseEvent)
Stateful Stream
 Using Window stream to do aggregations across Intervals of stream.
 Window and Slide interval = 10 sec
 Batch interval = 2 sec
 val windowEventStream = eventStream.window(Seconds(10),Seconds(10))
windowEventStream.cache()
 Custom Functions to sum aggregations while using updateStateByKey.
 def updateSumFunc(values: Seq[Int], state: Option[Int]): Option[Int] = { val
currentCount = values.sum val previousCount = state.getOrElse(0)
Some(currentCount + previousCount) }
 def updateSumFunc2f(values: Seq[Double], state: Option[Double]): Option[Double]
= { val currentCount = values.sum val previousCount = state.getOrElse(0.0)
Some(currentCount + previousCount) }
Q2. Count the number of events happening in a
city eg. Hyderabad
 Filtering the list of events happening in a city say “New York”.
 Reducing the events to get the number of events happening in this city for the
current Window computation.
 Aggregating the events count across the Window intervals using
updateStateByKey.
 val cityEventsStream = windowEventStream.filter{event => event.city == "New
York"}.map{event =>
(event.city,1)}.reduceByKey(_+_).updateStateByKey(updateSumFunc _)
 Printing the count of number of events happening in “New York” during each
Window interval.
 cityEventsStream.foreachRDD(rdd => {rdd.foreach{case (city, count) =>
println("No. of Events happening in %s city::%s".format(city, count))}})
Q3. Count the number of free events
 Filtering the list of free events happening by using condition when ever
payment_required value is 0.
 Reducing the events to get the number of free events happening for the current
Window computation.
 Aggregating the events count across the Window intervals using
updateStateByKey.
 val freeEventsStream = windowEventStream.filter{event =>
event.payment_required == 0}.map{event =>
("Free",1)}.reduceByKey(_+_).updateStateByKey(updateSumFunc _)
 Printing the count of number of free events happening during each Window
interval.
 freeEventsStream.foreachRDD(rdd => {rdd.foreach{case (free, count) =>
println("No. of Free Events happening::%s".format(count))}})
Q4. Count the events in Technology category
 Filtering the list of Technology events happening.
 Reducing the events to get the number of Technology events happening for the current
Window computation.
 Aggregating the events count across the Window intervals using updateStateByKey.
 Reusing the Technology category events for another question by storing the count in a
stateless variable.
 val techEventsStream = windowEventStream.filter{event => event.cat_name == "tech"}
 var techCount = 0
 val countTexhEventsStream = techEventsStream.map{event =>
(event.cat_name,1)}.reduceByKey(_+_).updateStateByKey(updateSumFunc _)
 Printing the count of number of Technology events happening during each Window
interval.
 countTexhEventsStream.foreachRDD(rdd => {rdd.foreach{case (cat_name, count) =>
techCount = count; println("No. of %s Events happening::%s".format(cat_name,count))}})
Q5. Count the number of Big data events
happening in US
 Filtering the list of Big data events happening in “US”.
 Reducing the events to get the number of Big data events happening in US for the
current Window computation.
 Aggregating the events count across the Window intervals using
updateStateByKey.
 val bigDataUSEventsStream = windowEventStream.filter{event => event.country
== "us" && event.name.toLowerCase.indexOf("big data") >= 0}.map{event =>
("Big Data",1)}.reduceByKey(_+_).updateStateByKey(updateSumFunc _)
 Printing the count of number of Big data events happening in “US” during each
Window interval.
 bigDataUSEventsStream.foreachRDD(rdd => {rdd.foreach{case (name, count) =>
println("No. of %s Events happening in US::%s".format(name,count))}})
Q6. Find the average duration of
Technology events
 Reducing the Technology events to get the event duration for the current Window
computation.
 Aggregating the events duration across the Window intervals using updateStateByKey.
 Computing the Average duration and Printing the Average duration for Technology
events happening during each Window interval.
 val sumDurTechEventsStream = techEventsStream.map{event => (event.cat_name + "
Events", event.duration.toDouble /
60000.0)}.reduceByKey(_+_).updateStateByKey(updateSumFunc2f _)
 sumDurTechEventsStream.foreachRDD(rdd => {
rdd.map{case(x:String, y:Double) => (x, y / techCount.toDouble)}.foreach{case
(cat_name:String, avg:Double) => {
val hrs = (avg / 60.0).toInt
val min = (avg % 60).toInt
println("Avg duration of %s happening::%d hours %d
minutes".format(cat_name,hrs,min)) }
} })
Sample output screenshot
Q7. Group the events by their category (k-
means clustering)
 Building a recommendation model by using k-means clustering on events.
 Recommendation of group members is done based on clustering the event categories and rsvp’s
responses respect to events.
 Parsing history Events.
 val eventsHistory = ssc.sparkContext.textFile("data/events/events.json", 1).flatMap(parseHisEvent)
 Parsing history Rsvps.
 case class Member(memberName: Option[String], memberId: Option[String])
 case class MemberEvent(eventId: Option[String], eventName: Option[String], eventUrl:
Option[String], time: Option[Long])
 val json=parse(rsvpJson).camelizeKeys
 val member=(json  "member").extract[Member]
 val event=(json  "event").extract[MemberEvent]
 val response=(json  "response").extract[String]
 (member, event, response)
 val rsvpHistory = ssc.sparkContext.textFile("data/rsvps/rsvps.json", 1).flatMap(parseRsvp)
Q7. Group the events by their category (k-
means clustering)
 Broadcasting Dictionary to load list of English dictionary words.
 val localDictionary =
Source.fromURL(getClass.getResource("/wordsEn.txt")).getLines.zipWithIndex.toMa
p
 val dictionary= ssc.sparkContext.broadcast(localDictionary)
 Feature Extraction to get the 10 most popular words from the event description, to
form the event category vectors for each event.
 def eventToVector(dictionary: Map[String, Int], description: String):
Option[Vector]={
val wordsIterator = breakToWords(description)
val topWords=popularWords(wordsIterator)
if (topWords.size==10) Some(Vectors.sparse(dictionary.size,topWords)) else
None }
 val eventVectors=eventsHistory.flatMap{
event=>eventToVector(dictionary.value,event.description.getOrElse("")) }
Q7. Group the events by their category (k-
means clustering)
 Training the history events based on k-means clustering model.
 val eventClusters = KMeans.train(eventVectors, 10, 2)
 Creating the Event History Ids and RSVP Member Event Id to join based on the
Event ID.
 val eventHistoryById=eventsHistory.map{event=>(event.id,
event.description.getOrElse(""))}.reduceByKey{(first: String, second: String)=>first}
 val membersByEventId=rsvpHistory.flatMap{ case(member, memberEvent,
response) => memberEvent.eventId.map{id=>(id,(member, response))} }
 val rsvpEventInfo=membersByEventId.join(eventHistoryById)
 Example: (eventId, ((member, response), description))
 (221069430, ((Member(Some(Susan Beck),Some(101089292)), yes), ‘…’))
 (221149038, ((Member(Some(Tracy Ramey),Some(153724262), no), ‘…’))
Q7. Group the events by their category (k-
means clustering)
 Predicting the Event cluster based on the trained model.
 val memberEventInfo = rsvpEventInfo.flatMap{ case(eventId, ((member, response),
description)) => {eventToVector(dictionary.value,description).map{ eventVector=>
val eventCluster=eventClusters.predict(eventVector) (eventCluster,(member,
response)) } } }
 Clustering members into groups based on the predictions.
 val memberGroups = memberEventInfo.filter{case(cluster, (member,
memberResponse)) => memberResponse == "yes"}.map{case(cluster, (member,
memberResponse)) =>
(cluster,member)}.groupByKey().map{case(cluster,memberItr) =>
(cluster,memberItr.toSet)}
Q7. Group the events by their category (k-
means clustering)
 Member Recommendations based on the clustering.
 val recommendations = memberEventInfo.join(memberGroups).map{case(cluster,
((member, memberResponse), members)) => (member.memberName, members-
member)}
 Example: (member.memberName, members)
 (Some(Rosie),Set(Member(Some(Derek),Some(84715352)), Member(Some(Pastor
Jim Billetdeaux),Some(7569836)), Member(Some(Tom),Some(11503256)),
Member(Some(Haeran Dempsey),Some(10724391)),
Member(Some(Jane),Some(130609252)), Member(Some(Cathy),Some(42921402))))
Sample output screenshot
Conclusion
 Meetup Streaming data loaded and analysed successfully.
 Streaming events data loaded through Spark Streaming using Custom Receivers
and handled as Asynchronous HTTP requests.
 History Events and Rsvp data analysed through Spark MLlib to build an Group
member recommendations based on K-means clustering model.
 Code: https://github.com/ssushmanth/meetup-stream
Ad

More Related Content

What's hot (18)

MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation
Amit Ghosh
 
MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation Pipeline
Jason Terpko
 
Arctic15 keynote
Arctic15   keynoteArctic15   keynote
Arctic15 keynote
Jerry Jalava
 
NoSQL meets Microservices - Michael Hackstein
NoSQL meets Microservices - Michael HacksteinNoSQL meets Microservices - Michael Hackstein
NoSQL meets Microservices - Michael Hackstein
distributed matters
 
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing EconomyMining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
tnoulas
 
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkNLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
Martin Goodson
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
GeeksLab Odessa
 
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
NoSQLmatters
 
Rxjs kyivjs 2015
Rxjs kyivjs 2015Rxjs kyivjs 2015
Rxjs kyivjs 2015
Alexander Mostovenko
 
Cnam azure 2014 mobile services
Cnam azure 2014   mobile servicesCnam azure 2014   mobile services
Cnam azure 2014 mobile services
Aymeric Weinbach
 
Living in eventually consistent reality
Living in eventually consistent realityLiving in eventually consistent reality
Living in eventually consistent reality
Bartosz Sypytkowski
 
Android Sample Project By Wael Almadhoun
Android Sample Project By Wael AlmadhounAndroid Sample Project By Wael Almadhoun
Android Sample Project By Wael Almadhoun
Wael Almadhoun, MSc, PMP®
 
Learning with F#
Learning with F#Learning with F#
Learning with F#
Phillip Trelford
 
ヘルスケアサービスを実現する最新技術 〜HealthKit・GCP + Goの活用〜
ヘルスケアサービスを実現する最新技術  〜HealthKit・GCP + Goの活用〜ヘルスケアサービスを実現する最新技術  〜HealthKit・GCP + Goの活用〜
ヘルスケアサービスを実現する最新技術 〜HealthKit・GCP + Goの活用〜
DeNA
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
MongoDB
 
ヘルスケアサービスを実現する最新技術
〜HealthKit・GCP+Go〜
ヘルスケアサービスを実現する最新技術
〜HealthKit・GCP+Go〜ヘルスケアサービスを実現する最新技術
〜HealthKit・GCP+Go〜
ヘルスケアサービスを実現する最新技術
〜HealthKit・GCP+Go〜
Fukaya Akifumi
 
rx.js make async programming simpler
rx.js make async programming simplerrx.js make async programming simpler
rx.js make async programming simpler
Alexander Mostovenko
 
1403 app dev series - session 5 - analytics
1403   app dev series - session 5 - analytics1403   app dev series - session 5 - analytics
1403 app dev series - session 5 - analytics
MongoDB
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation
Amit Ghosh
 
MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation Pipeline
Jason Terpko
 
NoSQL meets Microservices - Michael Hackstein
NoSQL meets Microservices - Michael HacksteinNoSQL meets Microservices - Michael Hackstein
NoSQL meets Microservices - Michael Hackstein
distributed matters
 
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing EconomyMining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
tnoulas
 
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkNLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
Martin Goodson
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
GeeksLab Odessa
 
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
NoSQLmatters
 
Cnam azure 2014 mobile services
Cnam azure 2014   mobile servicesCnam azure 2014   mobile services
Cnam azure 2014 mobile services
Aymeric Weinbach
 
Living in eventually consistent reality
Living in eventually consistent realityLiving in eventually consistent reality
Living in eventually consistent reality
Bartosz Sypytkowski
 
ヘルスケアサービスを実現する最新技術 〜HealthKit・GCP + Goの活用〜
ヘルスケアサービスを実現する最新技術  〜HealthKit・GCP + Goの活用〜ヘルスケアサービスを実現する最新技術  〜HealthKit・GCP + Goの活用〜
ヘルスケアサービスを実現する最新技術 〜HealthKit・GCP + Goの活用〜
DeNA
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
MongoDB
 
ヘルスケアサービスを実現する最新技術
〜HealthKit・GCP+Go〜
ヘルスケアサービスを実現する最新技術
〜HealthKit・GCP+Go〜ヘルスケアサービスを実現する最新技術
〜HealthKit・GCP+Go〜
ヘルスケアサービスを実現する最新技術
〜HealthKit・GCP+Go〜
Fukaya Akifumi
 
rx.js make async programming simpler
rx.js make async programming simplerrx.js make async programming simpler
rx.js make async programming simpler
Alexander Mostovenko
 
1403 app dev series - session 5 - analytics
1403   app dev series - session 5 - analytics1403   app dev series - session 5 - analytics
1403 app dev series - session 5 - analytics
MongoDB
 

Similar to Spark Streaming - Meetup Data Analysis (20)

Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
Stephan Ewen
 
Introducing the WSO2 Complex Event Processor
Introducing the WSO2 Complex Event ProcessorIntroducing the WSO2 Complex Event Processor
Introducing the WSO2 Complex Event Processor
WSO2
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech Talk
Anirudh Todi
 
Tsar tech talk
Tsar tech talkTsar tech talk
Tsar tech talk
Anirudh Todi
 
Streaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesStreaming Analytics for Financial Enterprises
Streaming Analytics for Financial Enterprises
Databricks
 
KDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialKDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics Tutorial
Neera Agarwal
 
Implementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBImplementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDB
MongoDB
 
Analytics with Spark
Analytics with SparkAnalytics with Spark
Analytics with Spark
Probst Ludwine
 
Reactive programming every day
Reactive programming every dayReactive programming every day
Reactive programming every day
Vadym Khondar
 
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Lucidworks
 
Cassandra at Finn.io — May 30th 2013
Cassandra at Finn.io — May 30th 2013Cassandra at Finn.io — May 30th 2013
Cassandra at Finn.io — May 30th 2013
DataStax Academy
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In Ruby
SATOSHI TAGOMORI
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
Srinath Perera
 
Fabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkFabric - Realtime stream processing framework
Fabric - Realtime stream processing framework
Shashank Gautam
 
A head start on cloud native event driven applications - bigdatadays
A head start on cloud native event driven applications - bigdatadaysA head start on cloud native event driven applications - bigdatadays
A head start on cloud native event driven applications - bigdatadays
Sriskandarajah Suhothayan
 
Mobile Software Engineering Crash Course - C06 WindowsPhone
Mobile Software Engineering Crash Course - C06 WindowsPhoneMobile Software Engineering Crash Course - C06 WindowsPhone
Mobile Software Engineering Crash Course - C06 WindowsPhone
Mohammad Shaker
 
Scaling Experimentation & Data Capture at Grab
Scaling Experimentation & Data Capture at GrabScaling Experimentation & Data Capture at Grab
Scaling Experimentation & Data Capture at Grab
Roman
 
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wycieków
Konrad Kokosa
 
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink Forward
 
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
Stephan Ewen
 
Introducing the WSO2 Complex Event Processor
Introducing the WSO2 Complex Event ProcessorIntroducing the WSO2 Complex Event Processor
Introducing the WSO2 Complex Event Processor
WSO2
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech Talk
Anirudh Todi
 
Streaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesStreaming Analytics for Financial Enterprises
Streaming Analytics for Financial Enterprises
Databricks
 
KDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialKDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics Tutorial
Neera Agarwal
 
Implementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBImplementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDB
MongoDB
 
Reactive programming every day
Reactive programming every dayReactive programming every day
Reactive programming every day
Vadym Khondar
 
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Lucidworks
 
Cassandra at Finn.io — May 30th 2013
Cassandra at Finn.io — May 30th 2013Cassandra at Finn.io — May 30th 2013
Cassandra at Finn.io — May 30th 2013
DataStax Academy
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In Ruby
SATOSHI TAGOMORI
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
Srinath Perera
 
Fabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkFabric - Realtime stream processing framework
Fabric - Realtime stream processing framework
Shashank Gautam
 
A head start on cloud native event driven applications - bigdatadays
A head start on cloud native event driven applications - bigdatadaysA head start on cloud native event driven applications - bigdatadays
A head start on cloud native event driven applications - bigdatadays
Sriskandarajah Suhothayan
 
Mobile Software Engineering Crash Course - C06 WindowsPhone
Mobile Software Engineering Crash Course - C06 WindowsPhoneMobile Software Engineering Crash Course - C06 WindowsPhone
Mobile Software Engineering Crash Course - C06 WindowsPhone
Mohammad Shaker
 
Scaling Experimentation & Data Capture at Grab
Scaling Experimentation & Data Capture at GrabScaling Experimentation & Data Capture at Grab
Scaling Experimentation & Data Capture at Grab
Roman
 
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wycieków
Konrad Kokosa
 
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink Forward
 
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2
 
Ad

Recently uploaded (20)

Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Ad

Spark Streaming - Meetup Data Analysis

  • 1. Meetup Data Analysis By, Sushmanth Sagala Spark Project – UpX Academy
  • 2. Project Information  Domain: Social  Technology use: Spark streaming, Spark, Spark MLlib  Dataset: http://stream.meetup.com/2/open_events  Meetup is an online social networking portal that facilitates offline group meetings in various localities around the world. Meetup allows members to find and join groups unified by a common interest, such as politics, books and games.
  • 4. Business Questions  Streaming/Spark Sql  Load the streaming data  Count the number of events happening in a city eg. Hyderabad  Count the number of free events  Count the events in Technology category  Count the number of Big data events happening in US  Find the average duration of Technology events  Spark MLLIB  Group the events by their category (k-means clustering)
  • 5. Q1. Load the streaming data  Custom receiver to load data from external URL.  Asynchronous HTTP request to read the data from streaming URL.  def onStart() { val cf = new AsyncHttpClientConfig.Builder() cf.setRequestTimeout(Integer.MAX_VALUE) cf.setReadTimeout(Integer.MAX_VALUE) cf.setPooledConnectionIdleTimeout(Integer.MAX_VALUE) client= new AsyncHttpClient(cf.build()) inputPipe = new PipedInputStream(1024 * 1024) outputPipe = new PipedOutputStream(inputPipe) val producerThread = new Thread(new DataConsumer(inputPipe)) producerThread.start() client.prepareGet(url).execute(new AsyncHandler[Unit]{ def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = { bodyPart.writeTo(outputPipe) AsyncHandler.STATE.CONTINUE } …. }) }
  • 6. Q1. Load the streaming data  Class DataConsumer extends Runnable to read the stream data and store. val bufferedReader = new BufferedReader( new InputStreamReader( inputStream )) var input=bufferedReader.readLine() while(input!=null){ store(input) input=bufferedReader.readLine() }  Defining the case classes to extract respective data. case class EventDetails(id: String, name: String, city: String, country: String, payment_required: Int, cat_id: Int, cat_name: String, duration: Long) case class Venue(name: Option[String], address1: Option[String], city: Option[String], state: Option[String], zip: Option[String], country: Option[String], lon: Option[Float], lat: Option[Float]) case class Event(id: String, name: Option[String], eventUrl: Option[String], description: Option[String], duration: Option[Long], rsvpLimit: Option[Int], paymentRequired: Option[Int], status: Option[String]) case class Group(id: Option[String], name: Option[String], city: Option[String], state: Option[String], country: Option[String]) case class Category(name: Option[String], id: Option[Int], shortname: Option[String])
  • 7. Q1. Load the streaming data  parseEvent method uses Json4s lib to extract the json data and define the EventDetails type. val json=parse(eventJson).camelizeKeys val event=json.extract[Event] val venue=(json "venue").extract[Venue] val group=(json "group").extract[Group] val category=(json "group" "category").extract[Category] EventDetails(event.id, event.name.getOrElse(""), venue.city.getOrElse(""), venue.country.getOrElse(""), event.paymentRequired.getOrElse(0), category.id.getOrElse(0), category.shortname.getOrElse(""), event.duration.getOrElse(10800000L))  Starting the event stream with Batch Interval of 2 secs, val ssc=new StreamingContext(conf, Seconds(2)) val eventStream = ssc.receiverStream(new MeetupReceiver("http://stream.meetup.com/2/open_events")).flatMap(parseEvent)
  • 8. Stateful Stream  Using Window stream to do aggregations across Intervals of stream.  Window and Slide interval = 10 sec  Batch interval = 2 sec  val windowEventStream = eventStream.window(Seconds(10),Seconds(10)) windowEventStream.cache()  Custom Functions to sum aggregations while using updateStateByKey.  def updateSumFunc(values: Seq[Int], state: Option[Int]): Option[Int] = { val currentCount = values.sum val previousCount = state.getOrElse(0) Some(currentCount + previousCount) }  def updateSumFunc2f(values: Seq[Double], state: Option[Double]): Option[Double] = { val currentCount = values.sum val previousCount = state.getOrElse(0.0) Some(currentCount + previousCount) }
  • 9. Q2. Count the number of events happening in a city eg. Hyderabad  Filtering the list of events happening in a city say “New York”.  Reducing the events to get the number of events happening in this city for the current Window computation.  Aggregating the events count across the Window intervals using updateStateByKey.  val cityEventsStream = windowEventStream.filter{event => event.city == "New York"}.map{event => (event.city,1)}.reduceByKey(_+_).updateStateByKey(updateSumFunc _)  Printing the count of number of events happening in “New York” during each Window interval.  cityEventsStream.foreachRDD(rdd => {rdd.foreach{case (city, count) => println("No. of Events happening in %s city::%s".format(city, count))}})
  • 10. Q3. Count the number of free events  Filtering the list of free events happening by using condition when ever payment_required value is 0.  Reducing the events to get the number of free events happening for the current Window computation.  Aggregating the events count across the Window intervals using updateStateByKey.  val freeEventsStream = windowEventStream.filter{event => event.payment_required == 0}.map{event => ("Free",1)}.reduceByKey(_+_).updateStateByKey(updateSumFunc _)  Printing the count of number of free events happening during each Window interval.  freeEventsStream.foreachRDD(rdd => {rdd.foreach{case (free, count) => println("No. of Free Events happening::%s".format(count))}})
  • 11. Q4. Count the events in Technology category  Filtering the list of Technology events happening.  Reducing the events to get the number of Technology events happening for the current Window computation.  Aggregating the events count across the Window intervals using updateStateByKey.  Reusing the Technology category events for another question by storing the count in a stateless variable.  val techEventsStream = windowEventStream.filter{event => event.cat_name == "tech"}  var techCount = 0  val countTexhEventsStream = techEventsStream.map{event => (event.cat_name,1)}.reduceByKey(_+_).updateStateByKey(updateSumFunc _)  Printing the count of number of Technology events happening during each Window interval.  countTexhEventsStream.foreachRDD(rdd => {rdd.foreach{case (cat_name, count) => techCount = count; println("No. of %s Events happening::%s".format(cat_name,count))}})
  • 12. Q5. Count the number of Big data events happening in US  Filtering the list of Big data events happening in “US”.  Reducing the events to get the number of Big data events happening in US for the current Window computation.  Aggregating the events count across the Window intervals using updateStateByKey.  val bigDataUSEventsStream = windowEventStream.filter{event => event.country == "us" && event.name.toLowerCase.indexOf("big data") >= 0}.map{event => ("Big Data",1)}.reduceByKey(_+_).updateStateByKey(updateSumFunc _)  Printing the count of number of Big data events happening in “US” during each Window interval.  bigDataUSEventsStream.foreachRDD(rdd => {rdd.foreach{case (name, count) => println("No. of %s Events happening in US::%s".format(name,count))}})
  • 13. Q6. Find the average duration of Technology events  Reducing the Technology events to get the event duration for the current Window computation.  Aggregating the events duration across the Window intervals using updateStateByKey.  Computing the Average duration and Printing the Average duration for Technology events happening during each Window interval.  val sumDurTechEventsStream = techEventsStream.map{event => (event.cat_name + " Events", event.duration.toDouble / 60000.0)}.reduceByKey(_+_).updateStateByKey(updateSumFunc2f _)  sumDurTechEventsStream.foreachRDD(rdd => { rdd.map{case(x:String, y:Double) => (x, y / techCount.toDouble)}.foreach{case (cat_name:String, avg:Double) => { val hrs = (avg / 60.0).toInt val min = (avg % 60).toInt println("Avg duration of %s happening::%d hours %d minutes".format(cat_name,hrs,min)) } } })
  • 15. Q7. Group the events by their category (k- means clustering)  Building a recommendation model by using k-means clustering on events.  Recommendation of group members is done based on clustering the event categories and rsvp’s responses respect to events.  Parsing history Events.  val eventsHistory = ssc.sparkContext.textFile("data/events/events.json", 1).flatMap(parseHisEvent)  Parsing history Rsvps.  case class Member(memberName: Option[String], memberId: Option[String])  case class MemberEvent(eventId: Option[String], eventName: Option[String], eventUrl: Option[String], time: Option[Long])  val json=parse(rsvpJson).camelizeKeys  val member=(json "member").extract[Member]  val event=(json "event").extract[MemberEvent]  val response=(json "response").extract[String]  (member, event, response)  val rsvpHistory = ssc.sparkContext.textFile("data/rsvps/rsvps.json", 1).flatMap(parseRsvp)
  • 16. Q7. Group the events by their category (k- means clustering)  Broadcasting Dictionary to load list of English dictionary words.  val localDictionary = Source.fromURL(getClass.getResource("/wordsEn.txt")).getLines.zipWithIndex.toMa p  val dictionary= ssc.sparkContext.broadcast(localDictionary)  Feature Extraction to get the 10 most popular words from the event description, to form the event category vectors for each event.  def eventToVector(dictionary: Map[String, Int], description: String): Option[Vector]={ val wordsIterator = breakToWords(description) val topWords=popularWords(wordsIterator) if (topWords.size==10) Some(Vectors.sparse(dictionary.size,topWords)) else None }  val eventVectors=eventsHistory.flatMap{ event=>eventToVector(dictionary.value,event.description.getOrElse("")) }
  • 17. Q7. Group the events by their category (k- means clustering)  Training the history events based on k-means clustering model.  val eventClusters = KMeans.train(eventVectors, 10, 2)  Creating the Event History Ids and RSVP Member Event Id to join based on the Event ID.  val eventHistoryById=eventsHistory.map{event=>(event.id, event.description.getOrElse(""))}.reduceByKey{(first: String, second: String)=>first}  val membersByEventId=rsvpHistory.flatMap{ case(member, memberEvent, response) => memberEvent.eventId.map{id=>(id,(member, response))} }  val rsvpEventInfo=membersByEventId.join(eventHistoryById)  Example: (eventId, ((member, response), description))  (221069430, ((Member(Some(Susan Beck),Some(101089292)), yes), ‘…’))  (221149038, ((Member(Some(Tracy Ramey),Some(153724262), no), ‘…’))
  • 18. Q7. Group the events by their category (k- means clustering)  Predicting the Event cluster based on the trained model.  val memberEventInfo = rsvpEventInfo.flatMap{ case(eventId, ((member, response), description)) => {eventToVector(dictionary.value,description).map{ eventVector=> val eventCluster=eventClusters.predict(eventVector) (eventCluster,(member, response)) } } }  Clustering members into groups based on the predictions.  val memberGroups = memberEventInfo.filter{case(cluster, (member, memberResponse)) => memberResponse == "yes"}.map{case(cluster, (member, memberResponse)) => (cluster,member)}.groupByKey().map{case(cluster,memberItr) => (cluster,memberItr.toSet)}
  • 19. Q7. Group the events by their category (k- means clustering)  Member Recommendations based on the clustering.  val recommendations = memberEventInfo.join(memberGroups).map{case(cluster, ((member, memberResponse), members)) => (member.memberName, members- member)}  Example: (member.memberName, members)  (Some(Rosie),Set(Member(Some(Derek),Some(84715352)), Member(Some(Pastor Jim Billetdeaux),Some(7569836)), Member(Some(Tom),Some(11503256)), Member(Some(Haeran Dempsey),Some(10724391)), Member(Some(Jane),Some(130609252)), Member(Some(Cathy),Some(42921402))))
  • 21. Conclusion  Meetup Streaming data loaded and analysed successfully.  Streaming events data loaded through Spark Streaming using Custom Receivers and handled as Asynchronous HTTP requests.  History Events and Rsvp data analysed through Spark MLlib to build an Group member recommendations based on K-means clustering model.  Code: https://github.com/ssushmanth/meetup-stream