Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.

Scalding Big ADta или обжигая горшки с рекламой
Boris Trofimov
@b0ris_1

Agenda
•Two stories on how AD is served inside AD company
•Awesome Scalding
The stories mention one company that has built multimillion-dollar business over ordinary cookies

The story about shoes or Big Brother is watching on you

We will answer this question in a few slides
or be careful while buying shoes

What is common between these things?

Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.

They are simple just for the first glance The same with loading web sites

… 1 sec
20 ms
100 ms
150 ms
Publisher receives request
Publisher
sends
response
Content
delivered
to user
170
Site sends request to Ad Server
200
80 ms
280
SSP picks the winning bid and sends redirect url back to ad Server
Every bidder/DSP receives info about user:
•ssp_cookie_id
•geo data
•site url
300
SSP (Ad Exchange) receives ad request
and opens
RTB Auction
210
Ad Server receives ad request and redirects to Ad Exchange
All bidders should send their decision (participate? & price) back
350
Ad Server shows page to user which redirects to the bidder’s server
User’s web page asks Ad banner from CDN Showing ad & bidder’s 1x1 pixel (impression)
400
The first second…
~70% users have this cookie aboard
>>1 independent companies take part in this auction

Return info about new user’s interests with special markers (segments) that indicates the new fact about user, e.g user is man who has iphone and lives in NYC and has dog.
Major format: <cookie_id – segment_id>
Data Scientists
Real time
Offline
Pixel Tracking
Farm
Warehouse
Bidder Farm
Auction requests
SSP Ad
Exchange
Hourly
Logs
3rd part data
House holders data
…
Hadoop’s HDFS
Updating user profiles
Hive
Oozie
MapReduce
Partners
HBASE
Scalding
hbase keeps user profiles
Update user’s profiles with new segments
Data export
Brand new feed about user interests 2 3 4 5 6 7 8 9 0 1
•Impressions
•Clicks
•Post-click Activities 5

Why do we need all this science?
•Deep audience targeting
•Case: customer would like to show ad for all men who live in NYC have iPhone and dog

Facts about Data Scientists
•Data Scientists do:
–Audience Modeling
identifying new user interests [segments] and finding ways to track them
–Audience Bridging
–Insights and Analytics
•They use IBM Netezza as local warehouse
•They use R language

Facts about Realtime team
•Scala, Java
•Restful Services
•Akka
•In Memory Cache : Aerospike, Redis

Facts about Offline team
•The tasks we solve over hadoop:
–As a Storage to keep all logs we need
–As Profile DB to keep all users and their interests [segments]
–As MapReduce Engine to run jobs on transformations between data
–As a Warehouse to export data via hive
•We use Clouderra CDH 5.1.2
•Major language: Scala
•Pure MapReduce jobs & Scalding/Cascading
•All map reduce applications are wrapped by Oozie’s workflow(s)
•Developing nextgen paltform version based on Spark Streaming/Kafka

hdfs
Scalding in a nutshell
•Concise DSL
•Configurable Source(s) and sink(s)
•Data transform operations:
–map/flatMap
–pivot/unpivot
–project
–groupBy/reduce/foldLeft
hdfs

Just one example (Java way)
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

Source
Just one example (Scalding way)
class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.split("s+") } }
Sink
Transform operations

Use Case 1 Split
•Motivation: reuse calculated streams
val common = Tsv("./file").map(...)
val branch1 = common.map(..).write(Tsv("output"))
val branch2 = common.groupby(..).write(Tsv("output"))

Use Case 2 Exotic Sources JDBC (out of the box)
case object YourTableSource extends JDBCSource {
override val tableName = "tableName"
override val columns = List(
varchar("col1", 64),
date("col2"),
tinyint("col3"),
double("col4"),
)
override def currentConfig = ConnectionSpec("www.gt.com", "username", "password", "mysql")
}
YourTableSource.read.map(...) ...

Use Case 2 Exotic Sources HBASE
HBaseSource (https://github.com/ParallelAI/SpyGlass)
•SCAN_ALL,
•GET_LIST,
•SCAN_RANGE
HBaseRawSource (https://github.com/andry1/SpyGlass)
•Advanced filtering via base64Scan
val hbs3 = new HBaseSource( tableName, quorum, 'key, List("data"), List('data), sourceMode = SourceMode.SCAN_ALL) .read
val scan = new Scan()
scan.setCaching(caching)
val activity_filters = new FilterList(MUST_PASS_ONE, {
val scvf = new SingleColumnValueFilter(toBytes("family"), toBytes("column"), GREATER_OR_EQUAL, toBytes(value))
scvf.setFilterIfMissing(true)
scvf.setLatestVersionOnly(true)
val scvf2 = ...
List(scvf, scvf2)
})
scan.setFilter(activity_filters)
new HBaseRawSource(tableName, quorum, families,
base64Scan = convertScanToBase64(scan)).read. ...

Use Case 3 Join
•Motivation: joining two streams by key
•Different join strategies:
–joinWithLarger
–joinWithSmaller
–joinWithTiny
•Inner, Left, Right, strategies
val pipe1 = Tsv("file1").read
val pipe2 = Tsv("file2").read // small file
val pipe3 = Tsv("file3").read // huge file
val joinedPipe = pipe1.joinWithTiny('id1 -> 'id2, pipe2)
val joinedPipe2 = pipe1.joinWithLarge('id1 -> 'id2, pipe3)

Use Case 4 Distributed Caching and Counters
//somewhere outside Job definition
val fl = DistributedCacheFile("/user/boris/zooKeeper.json")
// next value can be passed through any Scalding's jobs via Args object for instance
val fileName = fl.path
...
class Job(val args:Args) {
// once we receive fl.path we can read it like a ordinary file
val fileName = args.get("fileName")
lazy val data = readJSONFromFile(fileName)
...
TSV(args.get("input")).read.map('line -> 'word ) {
line => ... /* using data json object*/ ... }
}
// counter example Stat("jdbc.call.counter","myapp").incBy(1)

Use Case 5 Bridging Profiles
Motivation: bridge information from different sources and build complete person profile
imp
Own company’s private cookie thanks to 1x1 pixel impression
Bridging two ssp_cookies via private cookie
ssp_cookie_Id1
ssp_cookie_Id2
Bridging via ip address

Bridging Profiles
General task definition:
•Build graph
•Identify connected components

Connected components Let’s scalding it
class ConnectedComponentsJob(args : Args) extends Job(args) { var attempt = 0 while( attempt < 20 ) { val vertexes = Tsv( args("vertexes") ).read // ‘vertex t ‘gid, by default it is equal to vertex val edges = Tsv( args("edges") ) // 'gid_a t 'gid_b var groups = vertexes.joinWithSmaller('id -> 'id_b, vertexes.joinWithSmaller('id -> 'id_a, edges).discard('id ).rename('gid ->'gid_a)) .discard('id ) .rename('gid ->'gid_b) groups = groups.filter('gid_a, 'gid_b) {gid : (String, String) => gid._1 != gid._2 } .project ('gid_a, 'gid_b) .map(('gid_a, 'gid_b) -> ('gid_a, 'gid_b)) {gid : (Integer, Integer) =>(max(gid._1, gid._2), min(gid._1, gid._2))} val count = groups.groupBy( ('gid_a, 'gid_b) ){ _.size } if (count==0) attempt = 20 val new_groups = groups.groupBy('gid_a) {_.min('gid_b)}.rename(('source, 'target)) val new_vertexes = vertexes.joinWithSmaller('id -> 'source, groups, joiner = new LeftJoin ) .mapTo( ('id,'gid,'source,'target)->('id, 'gid)) { param:(Integer, Integer, Integer, Integer) => val (id, gid, source,target) = param if (target != null) ( id , min( gid, target ) ) else ( id, gid ) } new_vertexes.write( Tsv( args("vertexes") ) ) } }

Other nice things
•Typed pipes
•Elegant and fast Matrix operations
•Simple migration on Spark/Kafka
•Way to retrieve data from hive’s hcatalog

Useful Resources
•http://www.adopsinsider.com/ad-serving/how-does-ad-serving- work/
•http://www.adopsinsider.com/ad-serving/diagramming-the-ssp- dsp-and-rtb-redirect-path/
•https://github.com/twitter/scalding
•https://github.com/ParallelAI/SpyGlass
•https://github.com/branky/cascading.hive

Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.

Recommended

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Java/Scala Lab: Борис Трофимов - Обжигающая Big Data. (20)

More from GeeksLab Odessa (20)

Recently uploaded (20)

Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.