Big Data For Chimps
Big Data For Chimps
See http://oreilly.com/catalog/errata.csp?isbn= for release details. Nutshell Handbook, the Nutshell Handbook logo, and the OReilly logo are registered trademarks of OReilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and OReilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
[e]
Table of Contents
2. Hadoop Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter Doneness: B Pardon this digression-laden introduction The Map Phase Processes and Labels Records Individually SIDEBAR: Whats Fast At High Scale Chimpanzee and Elephant Start a Business Map-only Jobs: Process Records Individually Transfer Data to the Cluster Run the Job on the Cluster Map/Reduce Wikipedia Visitor Counts See Progress and Results The HDFS: Highly Durable Storage Optimized for Analytics Summarizing UFO Sightings using Map/Reduce=== UFO Sighting Data Model
9 10 12 14 16 17 19 19 20 20 20 22 26 26
iii
Group the UFO Sightings by Time Bucket Secondary Sort: Extend UFO Sightings with Detailed Location Information Close Encounters of the Reindeer Kind (pt 2) Put UFO Sightings And Places In Context By Location Name Extend The UFO Sighting Records In Each Location Co-Group With Place Data Partitioning, Grouping and Sorting Chimpanzee and Elephant Save Christmas (pt 1) Letters Cannot be Stored with the Right Context for Toy-Making Chimpanzees Process Letters into Labelled Toy Requests Pygmy Elephants Carry Each Toyform to the Appropriate Workbench Hadoop vs Traditional Databases The Map-Reduce Haiku Hadoops Contract The Mapper Guarantee The Group/Sort Guarantee Elephant and Chimpanzee Save Christmas pt 2: A Critical Bottleneck Emerges=== How Hadoop Manages Midstream Data Mappers Spill Data In Sorted Chunks Partitioners Assign Each Record To A Reducer By Label Playing with Partitions: Aggregate by Reducers Receive Sorted Chunks From Mappers Reducers Read Records With A Final Merge/Sort Pass Reducers Write Output Data (Which May Cost More Than You Think) Olga, the Remarkable Calculating Pig Pig Helps Hadoop work with Tables, not Records Wikipedia Visitor Counts
27 28 28 29 29 30 31 31 32 34 36 37 38 38 38 39 42 42 43 43 44 45 45 47 48 49 51 53 54 54 55 55 56 57 57 59
4. Structural Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5. Analytic Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Nanette and Olga Have an Idea Fundamental Data Operations LOAD..AS gives the location and schema of your source data Simple Types Complex Type 1: Tuples are fixed-length sequences of typed fields Complex Type 2: Bags hold zero one or many tuples Complex Type 3: Maps hold collections of key-value pairs for lookup FOREACH: modify the contents of records individually Pig Functions act on fields FILTER: eliminate records using given criteria
iv
Table of Contents
LIMIT selects only a few records Pig matches records in datasets using JOIN Grouping and Aggregating Complex FOREACH Ungrouping operations (FOREACH..FLATTEN) expand records Sorting (ORDER BY, RANK) places all records in total order STORE operation serializes to disk Directives that aid development: DESCRIBE, ASSERT, EXPLAIN, LIMIT..DUMP, ILLUSTRATE DESCRIBE shows the schema of a table ASSERT checks that your data is as you think it is DUMP shows data on the console with great peril ILLUSTRATE magically simulates your scripts actions, except when it fails to work EXPLAIN shows Pigs execution graph Core Platform: Batch Processing Sidebar: Which Hadoop Version? Core Platform: Streaming Data Processing Stream Analytics Online Analytic Processing (OLAP) on Hadoop Database Crossloading Core Platform: Data Stores Traditional Relational Databases Billions of Records Scalable Application-Oriented Data Stores Scalable Free-Text Search Engines: Solr, ElasticSearch and More Lightweight Data Structures Graph Databases Programming Languages, Tools and Frameworks SQL-like High-Level Languages: Hive and Pig High-Level Scripting Languages: Wukong (Ruby), mrjob (Python) and Others Statistical Languages: R, Julia, Pandas and more Mid-level Languages Frameworks
59 60 62 62 64 65 65 67 67 67 67 67 68 69 71 72 73 74 75 75 76 76 77 78 78 79 80 80
80 81 82 82 85 86 87
v
Table of Contents
Filtering cut Character encodings head and tail grep GOOD TITLE HERE sort uniq join Summarizing wc md5sum and sha1sum Enter the Dragon: C&E Corp Gains a New Partner Intro: Storm+Trident Fundamentals Your First Topology Skeleton: Statistics
87 87 88 89 89 90 90 91 91 92 92 92 93 93 94
8. Intro to Storm+Trident. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9. Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
99
101 102 105 106 106 107 109 111 111 112 113 113 115 116 117 118 118 119
vi
Table of Contents
Quadtile Practicalities Converting points to quadkeys (quadtile indexes) Exploration Interesting quadtile properties Quadtile Ready Reference Working with paths Calculating Distances Distributing Boundaries and Regions to Grid Cells Adaptive Grid Size Tree structure of Quadtile indexing Map Polygons to Grid Tiles Weather Near You Find the Voronoi Polygon for each Weather Station Break polygons on quadtiles Map Observations to Grid Cells Turning Points of Measurements Into Regions of Influence Finding Nearby Objects Voronoi Polygons turn Points into Regions Smoothing the Distribution Results Keep Exploring Balanced Quadtiles ===== Its not just for Geo ===== Exercises Refs
120 120 123 124 125 126 128 129 129 134 134 136 136 137 137 137 138 140 143 145 145 145 146 146 147
12. Placeholder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 13. Data Munging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 14. Organizing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 15. Placeholder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 16. Conceptual Model for Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 17. Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 18. Java Api. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
When to use the Hadoop Java API How to use the Hadoop Java API 161 161
Table of Contents
vii
The Skeleton of a Hadoop Java API program Optimizing Hadoop Dataflows Efficient JOINs in Pig Exercises
161 163 165 167 169 172 173 173 175 176 176 177 178 178 179 181 182 183 184 185 186 188 188 188 188 189 191 193 193 194 197 198
viii
Table of Contents
Executor Queues Executor Details (?) The Spout Pending Register Acking and Reliability Lifecycle of a Trident batch exactly once Processing Walk-through of the Github dataflow Goal Provisioning Topology-level settings Initial tuning Sidebar: Littles Law Batch Size Garbage Collection and other JVM options Tempo and Throttling
198 200 200 200 202 205 207 210 211 211 212 213 213 215 216 219 221 221 222 222 223 223 224 225 226 226 227 228 229 230 230 230 231 231 232 232 233 233
Table of Contents
ix
Graph Data Refs Appendix 1: Acquiring a Hadoop Cluster Appendix 2: Cheatsheets Appendix 3: Overview of Example Scripts and Datasets Author License Open Street Map
References
Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Table of Contents
Preface
Mission Statement
Big Data for Chimps will: 1. Explain a practical, actionable view of big data, centered on tested best practices as well as give readers street fighting smarts with Hadoop 2. Readers will also come away with a useful, conceptual idea of big data; big data in its simplest form is a small cluster of well-tagged information that sits upon a central pivot, and can be manipulated through various shifts and rotations with the purpose of delivering insights (Insight is data in context). Key to understanding big data is scalability: infinite amounts of data can rest upon infinite pivot points (Flip - is that accurate or would you say theres just one central pivot - like a Rubics cube?) 3. Finally, the book will contain examples with real data and real problems that will bring the concepts and applications for business to life.
xi
Analogies: Well be accompanied on part of our journey by Chimpanzee and Ele phant, whose adventures are surprisingly relevant to understanding the internals of Hadoop. I dont want to waste your time laboriously remapping those adventures back to the problem at hand, but I definitely dont want to get too cute with the analogy. Again, please let me know if I err on either side.
Run job on cluster See progress on jobtracker, results on HDFS Message Passing visit frequency SQL-like Set Operations visit frequency II Graph operations Hadoop Derives Insight from Data in Context Youve already seen the first trick: processing records individually. The second trick is to form sorted context groups. There isnt a third trick. With these tiny two mustard seeds process and contextify we can reconstruct the full set of data analytic operations that turn mountains of data into gems of insight. C&E help SantaCorp optimize the Christmas toymaking process, demonstrating the essential problem of data lo cality (the central challenge of Big Data). Well follow along with a job requiring map and reduce, and learn a bit more about Wukong (a Ruby-language frame work for Hadoop). Chimpanzee and elephant sve Christmas pt 1 map/reduce: count ufo sightings The Hadoop Haiku Hadoop vs Traditional databases Chimpanzee and elephant sve Christmas pt 2 reducer guarantee reducers in action secondary sort Hadoop Enables SQL-like Set Operations By this point in the book you should: a) Have your mind blown; b) See some compelling enough data and a compelling enough question, and a wukong job that answers that job by using only a mapper; c) see some compelling enough data and a compelling enough question, which requires a map and reduce job, written in both pig and wukong; d) believe the mapreduce story, i.e. you know, in general, the high-level conceptual mechanics of a mapreduce job. Youll have seen whimsical & concrete explanations of mapreduce, whats happening as a job is born and run, and HDFS Count UFO visits by month visit jobtracker to see what Pig is doing Counting Wikipedia pageviews by hour (or whatever)
Preface
xiii
should be same as UFO exploration, but: will actually require Hadoop also do a total sort at the end Fundamental Data Operations in Hadoop Heres the stuff youd like to be able to do with data, in wukong and in pig Foreach/filter operations (messing around inside a record) reading data (brief practical directions on the level of this is what you type in) limit filter sample using a hash digest function to take a signature top k and reservoir sampling refer to subuniverse which is probably elsewhere group join ??cogroup?? (does this go with group? Does it go anywhere?) sort, etc.. : cross cube total sort partitioner basic UDFs ?using ruby or python within a pig dataflow? Analytic Patterns Connect the structural operations youve seen pig do with what is happeining under neath, and flesh out your understanding of them. 1. The Hadoop Toolset and Other Practical Matters toolset overview Its a necessarily polyglot sport Pig is a language that excels at describing we think you are doing it wrong if you are not using : a declarative orchestration language, a high-level scripting language for the dirty stuff (e.g. parsing, contacting external apis, etc..)
xiv
Preface
udfs (without saying udfs) are for accessing a java-native library, e.g. geospacial libraries, when you really care about performance, to gift pig with a new ability, custom loaders, etc there are a lot of tools, they all have merits: Hive, Pig, Cascading, Scalding, Wu kong, MrJob, R, Julia (with your eyes open), Crunch. There arent others that we would recommend for production use, although we see enough momentum from impala and spark that you can adopt them with confidence that they will mature. launching and debugging jobs overview of Wukong overview of Pig 2. Filesystem Mojo and cat herding dumping, listing, moving and manipulating files on the HDFS and local filesys tems total sort transformations from the commandline (grep, cut, wc, etc) pivots from the commandline (head, sort, etc) commandline workflow tips advanced hadoop filesystem (chmod, setrep, fsck) pig schema wukong model loading TSV loading generic JSON storing JSON loading schematized JSON loading parquet or Trevni (Reference the section on working with compressed files; call back to the points about splitability and performance/size tradeoffs) TSV, JSON, not XML; Protobufs, Thrift, Avro; Trevni, Parquet; Sequence Files; HAR compression: gz, bz2, snappy, LZO subsetting your data 3. Intro to Storm+Trident Meet Nim Seadragon
Preface | xv
What and Why Storm and Trident First Storm Job 4. Statistics: (this is first deep experience with Storm+Trident) Summarizing: Averages, Percentiles, and Normalization running / windowed stream summaries make a SummarizingTap trident operation that collects {Sum Count Min Max Avg Stddev SomeExampleValuesReservoirSampled} (fill in the details of what exactly this means) also, maybe: Median+Deciles, Histogram understand the flow of data going on in preparing such an aggregate, by either making sure the mechanics of working with Trident dont overwhelm that or by retracing the story of records in an aggregation you need a group operation means everything in group goes to exactly one executor, exactly one machine, aggregator hits everything in a group combiner-aggregators (in particular), do some aggregation beforehand, and send an intermediate aggregation to the executor that hosts the group operation by default, always use persistent aggregate until we find out why you wouldnt (BUBBLE) highlight the corresponding map/reduce dataflow and illuminate the connection (BUBBLE) Median / calculation of quantiles at large enough scale that doing so is hard (in next chapter we can do histogram) Use a sketching algorithm to get an approximate but distributed answer to a holistic aggregation problem eg most frequent elements Rolling timeseries averages Sampling responsibly: its harder and more important than you think consistent sampling using hashing dont use an RNG appreciate that external data sources may have changed reservoir sampling connectivity sampling (BUBBLE) subuniverse sampling (LOC?) Statistical aggregates and the danger of large numbers
xvi | Preface
numerical stability overflow/underflow working with distributions at scale your intuition is often incomplete with trillions of things, 1 in billion chance things happen thousands of times weather temperature histogram in streaming fashion approximate distinct counts (using HyperLogLog) approximate percentiles (based on quantile digest) 5. Time Series and Event Log Processing: Parsing logs and using regular expressions with Hadoop logline model regexp to match lines, highlighting this as a parser pattern reinforce the source blob source model domain model practice Histograms and time series of pageviews using Hadoop sessionizing flow chart throughout site? n-views: pages viewed in sequence ?? Audience metrics: make sure that this serves the later chapter with the live recommender engine (lambda architecture) Geolocate visitors based on IP with Hadoop use World Cup data? demonstrate using lookup table, explain it as a range query use a mapper-only (replicated) join explain why using that (small with big) but dont explain what its doing (will be covered later) (Ab)Using Hadoop to stress-test your web server Exercise: what predicts the team a country will root for next? In particular: if say Mexico knocks out Greece, do Greeks root for, or against, Mexico in general? 1. Geographic Data: Spatial join (find all UFO sightings near Airports) of points with points
Preface | xvii
map points to grid cell in the mapper; truncate at a certain zoom level (explain how to choose zoom level). must send points to reducers for own grid key and also neighbors (9 total squares). Perhaps, be clever about not having to use all 9 quad grid neighbors by par titioning on a grid size more fine-grained than your original one and then use that to send points only the pertinent grid cell reducers Perhaps generate the four points that are x away from you and use their quad cells. In the reducer, do point-by-point comparisons Maybe a secondary sort??? Geospacial data model, i.e. the terms and fields that you use in, e.g. GeoJSON We choose X, we want the focus to be on data science not on GIS Still have to explain feature, region, latitude, longitude, etc Decomposing a map into quad-cell mapping at constant zoom level mapper input: <name of region, GeoJSON region boundary>; Goal 1: have a mapping from region quad cells it covers; Goal 2: have a mapping from quad key to partial GeoJSON objects on it. mapper output: [thing, quadkey] ; [quadkey, list of region ids, hash of region ids to GeoJSON region boundaries] Spatial join of points with regions, e.g. what congressional district are you in? in mapper for points emit truncated quad key, the rest of the quad key, just stream the regions through (with result from prior exploration); a reducer has quadcell, all points that lie within that quadcell, and all regions (truncated) that lie on that quadcell. Do a brute force search for the regions that the points lie on Nearness query suppose the set of items you want to find nearness to is not huge; produce the voronoi diagrams Decomposing a map into quad-cell mapping at multiple zoom levels;in particular, use voronoi regions to make show multi-scale decomposition Re-do spatial join with Voronoi cells in multi-scale fashion (fill in details later) Framing the problem (NYC vs Pacific Ocean) Discuss how, given a global set of features, to decompose into a multi-scale grid representation Other mechanics of working with geo data 2. Conceptual Model for Data Analysis
xviii
Preface
Theres just one framework 3. Data Munging (Semi-Structured Data): The dirty art of data munging. Its a sad fact, but too often the bulk of time spent on a data exploration is just getting the data ready. Well show you street-fighting tactics that lessen the time and pain. Along the way, well prepare the datasets to be used throughout the book: Datasets Wikipedia Articles: Every English-language article (12 million) from Wiki pedia. Wikipedia Pageviews: Hour-by-hour counts of pageviews for every Wikipedia article since 2007. US Commercial Airline Flights: every commercial airline flight since 1987 Hourly Weather Data: a century of weather reports, with hourly global cov erage since the 1950s. Star Wars Kid weblogs: large collection of apache webserver logs from a popular internet site (Andy Baios waxy.org). Wiki pageviews - String encoding and other bullshit Airport data -Reconciling to mostly agreeing datasets Something that has errors (SW Kid) - dealing with bad records Weather Data - Parsing a flat pack file bear witness, explain that you DID have to temporarily become an ameteur meteorologist, and had to write code to work with that many fields. when your schema is so complicated, it needs to be automated, too. join hell, when your keys change over time Data formats airing of grievances on XML airing of grievances on CSV dont quote, escape the only 3 formats you should use, and when to use them Just do a data munging project from beginning to end that wasnt too horrible Talk about the specific strategies and tactics source blob to source domain object, source domain object to business object. e.g. you want your initial extraction into a model mirrors closely the source domain data format. Mainly because you do not want mix your extraction logic and business logic (extraction logic will pollute business objects code).
Preface | xix
Also, will end up building the wrong model for the business object, i.e. it will look like the source domain. Airport data - chief challenge is reconciling data sets, dealing with conflicting errors 4. Machine Learning without Grad School: Well equip you with a picture of how they work, but wont go into the math of how or why. We will show you how to choose a method, and how to cheat to win. Well combine the record of every com mercial flight since 1987 with the hour-by-hour weather data to predict flight delays using Naive Bayes Logistic Regression Random Forest (using Mahout) 5. Full Application: Regional Flavor 6. Hadoop Native Java API dont 7. Advanced Pig Specialized joins that can dramatically speed up (or make feasible) your data transformations why algebraic UDFs are awesome and how to be algebraic Custom Loaders Performance efficiency and tunables using a filter after a cogroup will get pushed up by Pig, sez Jacob 8. Data Modeling for HBase-style Database 9. Hadoop Internals What happens when a job is launched A shallow dive into the HDFS
HDFS
Lifecycle of a File: What happens as the Namenode and Datanode collaborate to create a new file. How that file is replicated to acknowledged by other Datanodes.
xx | Preface
What happens when a Datanode goes down or the cluster is rebalanced. Briefly, the S3 DFS facade // (TODO: check if HFS?).
Preface
xxi
Where memory is used, in particular, mapper-sort buffers, both kinds of reducermerge buffers, application internal buffers. Hadoop Tuning Tuning for the Wise and Lazy Tuning for the Brave and Foolish The USE Method for understanding performance and diagnosing problems Storm+Trident Internals Understand the lifecycle of a Storm tuple, including spout, tupletree and acking. (Optional but not essential) Understand the details of its reliability mechanism and how tuples are acked. Understand the lifecycle of partitions within a Trident batch and thus, the context behind partition operations such as Apply or PartitionPersist. Understand Tridents transactional mechanism, in the case of a PartitionPersist. Understand how Aggregators, Statemap and the Persistence methods combine to give you exactly once processing with transactional guarantees. Specifically, what an OpaqueValue record will look like in the database and why. Understand how the master batch coordinator and spout coordinator for the Kafka spout in particular work together to uniquely and efficiently process all records in a Kafka topic. One specific: how Kafka partitions relate to Trident partitions. Storm+Trident Tuning Overview of Datasets and Scripts Datasets Wikipedia (corpus, pagelinks, pageviews, dbpedia, geolocations) Airline Flights UFO Sightings Global Hourly Weather Waxy.org Star Wars Kid Weblogs Scripts Cheatsheets: Regular Expressions Sizes of the Universe
xxii
Preface
Hadoop Tuning & Configuration Variables Chopping block 1. Interlude I: Organizing Data: How to design your data models How to serialize their contents (orig, scratch, prod) How to organize your scripts and your data 2. Graph Processing: Graph Representations Community Extraction: Use the page-to-page links in Wikipedia to identify sim ilar documents Pagerank (centrality): Reconstruct pageview paths from web logs, and use them to identify important pages 3. Text Processing : Well show how to combine powerful existing libraries with ha doop to do effective text handling and Natural Language Processing: Indexing documents Tokenizing documents using Lucene Pointwise Mutual Information K-means Clustering 4. Interlude II: Best Practices and Pedantic Points of style Pedantic Points of Style Best Practices How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, theyre equivalent; with some experience under your belt its worth learning how to fluidly shift among these different models. Why Hadoop robots are cheap, people are important 5. Interlude II: Best Practices and Pedantic Points of style Pedantic Points of Style
Preface
xxiii
Best Practices How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, theyre equivalent; with some experience under your belt its worth learning how to fluidly shift among these different models. Why Hadoop robots are cheap, people are important 6. Interlude I: Organizing Data: How to design your data models How to serialize their contents (orig, scratch, prod) How to organize your scripts and your data
Hadoop
In Doug Cuttings words, Hadoop is the kernel of the big-data operating system. Its the dominant batch-processing solution, has both commercial enterprise support and a huge open source community, runs on every platform and cloud, and there are no signs any of that will change in the near term. The code in this book will run unmodified on your laptop computer and on an industrial-strength Hadoop cluster. (Of course you will need to use a reduced data set for the laptop). You do need a Hadoop installation of some sort Appendix (TODO: ref) describes your options, including instructions for running hadoop on a multimachine cluster in the public cloud for a few dollars a day you can analyze terabytescale datasets.
xxiv
| Preface
Helpful Reading
Hadoop the Definitive Guide by Tom White is a must-have. Dont try to absorb its whole the most powerful parts of Hadoop are its simplest parts but youll refer to often it as your applications reach production. Hadoop Operations by Eric Sammer hopefully you can hand this to someone else, but the person who runs your hadoop cluster will eventually need this guide to configuring and hardening a large production cluster. Big Data: principles and best practices of scalable realtime data systems by Nathan Marz
Feedback
The source code for the book all the prose, images, the whole works is on github at http://github.com/infochimps-labs/big_data_for_chimps. Contact us! If you have questions, comments or complaints, the issue tracker http:// github.com/infochimps-labs/big_data_for_chimps/issues is the best forum for shar ing those. If youd like something more direct, please email [email protected]
Preface
xxv
(the ever-patient editor) and [email protected] (your eager author). Please in clude both of us. OK! On to the book. Or, on to the introductory parts of the book and then the book.
About
What this book covers
Big Data for Chimps shows you how to solve important hard problems using simple, fun, elegant tools. Geographic analysis is an important hard problem. To understand a disease outbreak in Europe, you need to see the data from Zurich in the context of Paris, Milan, Frankfurt and Munich; but to understand the situation in Munich requires context from Zurich, Prague and Vienna; and so on. How do you understand the part when you cant hold the whole world in your hand? Finding patterns in massive event streams is an important hard problem. Most of the time, there arent earthquakes but the patterns that will let you predict one in advance lie within the data from those quiet periods. How do you compare the trillions of sub sequences in billions of events, each to each other, to find the very few that matter? Once you have those patterns, how do you react to them in real-time? Weve chosen case studies anyone can understand that generalize to problems like those and the problems youre looking to solve. Our goal is to equip you with: How to think at scale equipping you with a deep understanding of how to break a problem into efficient data transformations, and of how data must flow through the cluster to effect those transformations. Detailed example programs applying Hadoop to interesting problems in context Advice and best practices for efficient software development All of the examples use real data, and describe patterns found in many problem domains: Statistical Summaries Identify patterns and groups in the data Searching, filtering and herding records in bulk Advanced queries against spatial or time-series data sets. The emphasis on simplicity and fun should make this book especially appealing to beginners, but this is not an approach youll outgrow. Weve found its the most powerful and valuable approach for creative analytics. One of our maxims is Robots are cheap,
xxvi | Preface
Humans are important: write readable, scalable code now and find out later whether you want a smaller cluster. The code you see is adapted from programs we write at Infochimps to solve enterprise-scale business problems, and these simple high-level transformations (most of the book) plus the occasional Java extension (chapter XXX) meet our needs. Many of the chapters have exercises included. If youre a beginning user, I highly rec ommend you work out at least one exercise from each chapter. Deep learning will come less from having the book in front of you as you read it than from having the book next to you while you write code inspired by it. There are sample solutions and result datasets on the books website. Feel free to hop around among chapters; the application chapters dont have large de pendencies on earlier chapters.
Preface
xxvii
How to Contact Us
Please address comments and questions concerning this book to the publisher: OReilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 (707) 829-0515 (international or local) To comment or ask technial questions about this book, send email to bookques [email protected] To reach the authors: Flip Kromer is @mrflip on Twitter For comments or questions on the material, file a github issue at http://github.com/ infochimps-labs/big_data_for_chimps/issues
xxviii
Preface
First Exploration
CHAPTER 1
////Write an introductory paragraph that specifically plants a first seed about the con ceptual way of viewing big data. Then, write a paragraph that puts this chapter in context for the reader, introduces it (in this chapter well show you how to start with a ques tion and arrive at an answer without coding a big, hairy, monolithic program) Orient your reader to big data and the goals for lassoing it. Doing this will hook your reader and prep their mind for the chapters main thrust, its juicy bits. Finally, say a word or two about big data before getting into Hadoop, for context (like big data is to Hadoop what x is to y) Do these things before you jump so quickly into Hadoop. Amy//// Hadoop is a remarkably powerful tool for processing data, giving us at long last mastery over massive-scale distributed computing. More than likely, thats how you came to be reading this paragraph. What you might not yet know is that Hadoops power comes from embracing, not con quering, the constraints of distributed computing. This exposes a core simplicity that makes programming it exceptionally fun. Hadoops bargain is this. You must give up fine-grained control over how data is read and sent over the network. Instead, you write a series of short, constrained transfor mations, a sort of programming Haiku:
Data flutters by Elephants make sturdy piles Insight shuffles forth
For any such program, Hadoops diligent elephants intelligently schedule the tasks across single or dozens or thousands of machines. They attend to logging, retry and error handling; distribute your data to the workers that process it; handle memory allocation, partitioning and network routing; and attend to myriad other details that otherwise stand between you and insight. Putting these constraints on how you ask your ques tion releases constraints that traditional database technology placed on your data. Un
locking access to data that is huge, unruly, organic, highly-dimensional and deeply connected unlocks answers to a new deeper set of questions about the large-scale be havior of humanity and our universe. <remark>too much?? pk4</remark>
1. Please discard any geographic context of the word local: for the rest of the book it will always mean held in the same computer location
to: This question has what we call easy locality, essentially, held in the same computer location (nothing to do with geography). Amy//// Instead of the places, lets look at the words. Barbeque is popular all through Texas and the Southeastern US, and as youll soon be able to prove, the term Barbeque is over represented in articles from that region. You and cousin Bubba would be able to brain storm a few more terms with strong place affinity, like beach (the coasts) or wine (France, Napa Valley), and you would guess that terms like hat or couch will not. But theres certainly no simple way you could do so comprehensively or quantifiably. Thats because this question has no easy locality: well have to dismantle and reassemble in stages the entire dataset to answer it. This understanding of locality is the most im portant concept in the book, so lets dive in and start to grok it. Well just look at the step-by-step transformations of the data for now, and leave the actual code for a later chapter.
Where is Barbecue?
So heres our first exploration:
For every word in the English language, which of them have a strong geographic flavor, and what are the places they attach to?
This may not be a practical question (though I hope you agree its a fascinating one), but it is a template for a wealth of practical questions. Its a geospatial analysis showing how patterns of term usage, such as ////list a couple quick examples of usage////, vary over space; the same approach can instead uncover signs of an epidemic from disease reports, or common browsing behavior among visitors to a website. Its a linguistic analysis attaching estimated location to each term; the same approach term can instead quantify document authorship for legal discovery, letting you prove the CEO did au thorize his nogoodnik stepson to destroy that orphanage. Its a statistical analysis re quiring us to summarize and remove noise from a massive pile of term counts; well use those methods ////unclear on which methods youre referring to? Amy////in almost every exploration we do. It isnt itself a time-series analysis, but youd use this data to form a baseline to detect trending topics on a social network or the anomalous presence of drug-trade related terms on a communication channel. //// Consider defining the italicized terms, above, such as geospatial analysis, linguistic analysis, etc., inline (for example, Its a linguistic analysis, the study of language, at taching estimated location to each term) Amy//// //// Provide brief examples of how these methods might be useful, examples to support the above; offer questions that could be posed for each. For example, for every symptom how it correlates to the epidemic and what zip codes the symptoms are attached to. Amy////
Where is Barbecue?
Figure 1-1. Not the actual output, but gives you the picture; TODO insert actual results
Lexington is a town in Lee County, Texas, United States. Snows BBQ, which Texas Monthly called the best barbecue in Texas and The New Yorker named the best Texas BBQ in the world is located in Lexington.
You can do this to each article separately, in any order, and with no reference to any other article. Thats important! Among other things, it lets us parallelize the process across as many machines as we care to afford. Well call this type of step a transform: its independent, non-order-dependent, and isolated.
Bin by Location
The article geolocations are kept in a different data file: Article coordinates.
Lexington,_Texas -97.01 30.41 023130130
We dont actually need the precise latitude and longitude, because rather than treating article as a point, we want to aggregate by area. Instead, well lay a set of grid lines down covering the entire world and assign each article to the grid cell it sits on. That funnylooking number in the fourth column is a quadkey 2, a very cleverly-chosen label for the grid cell containing this articles location. To annotate each wordbag with its grid cell location, we can do a join of the two files on the wikipedia ID (the first column). Picture for a moment a tennis meetup, where youd like to randomly assign the attendees to mixed-doubles (one man and one woman) teams. You can do this by giving each person a team number when they arrive (one pool of numbers for the men, an identical but separate pool of numbers for the women). Have everyone stand in numerical order on the court men on one side, women on the other and walk forward to meet in the middle; people with the same team number will naturally arrive at the same place and form a single team. That is effectively how Hadoop joins the two files: it puts them both in order by page ID, making records with the same page ID arrive at the same locality, and then outputs the combined record: Wordbag with coordinates.
2. you will learn all about quadkeys in the Geographic Data chapter
Gridcell statistics
We have wordbag records labeled by quadkey for each article, but we want combined wordbags for each grid cell. So well group the wordbags by quadkey: them turn the individual word bags into a combined word bag:
023130130 {(("many", X),...,("texas",X),...,("town",X)...("longhorns",X),...("bbq",X),...}
A pause, to think
Lets look at the fundamental pattern that were using. Our steps: 1. transform each article individually into its wordbag 2. augment the wordbags with their geo coordinates by joining on page ID 3. organize the wordbags into groups having the same grid cell; 4. form a single combined wordbag for each grid cell. //// Consider adding some text here that guides the reader with regard to the findings they might expect to result. For example, if you were to use the example of finding
6 | Chapter 1: First Exploration
symptoms that intersect with illness as part of an epidemic, you would have done x, y, and z This will bring the activity to life and help readers appreciate how it applies to thier own data at hand. Amy//// Its a sequence of transforms (operations on each record in isolation: steps 1 and 4) and pivots operations that combine records, whether from different tables (the join in step 2) or the same dataset (the group in step 3). In doing so, weve turned articles that have a geolocation into coarse-grained regions that have implied frequencies for words. The particular frequencies arise from this combination of forces: signal: Terms that describe aspects of the human condition specific to each region, like longhorns or barbecue, and direct references to place names, such as Aus tin or Texas background: The natural frequency of each term "second is used more often than syzygy" slanted by its frequency in geo-locatable texts (the word town occurs far more frequently than its natural rate, simply because towns are geolo catable). noise: Deviations introduced by the fact that we have a limited sample of text to draw inferences from. Our next task the sprint home is to use a few more transforms and pivots to sep arate the signal from the background and, as far as possible, from the noise.
combine those counts into rates, and form the PMI scores. Rather than step through each operation, Ill wave my hands and pull its output from the oven:
023130130 {(("texas",X),...,("longhorns",X),...("bbq",X),...,...}
As expected, in Figure 1-1 you see BBQ loom large over Texas and the Southern US; Wine, over the Napa Valley3.
3. This is a simplified version of work by Jason Baldrige, Ben Wing (TODO: rest of authors), who go farther and show how to geolocate texts based purely on their content. An article mentioning barbecue and Willie Nelson would be placed near Austin, TX; one mentioning startups and trolleys in San Francisco. See: Baldridge et al (TODO: reference)
Hadoop Basics
CHAPTER 2
Chapter Doneness: B
Introduction: exists, lots of extra stuff, not readable description of map phase: good demonstration hadoop job Chimpanzee and Elephant translate Shakespeare description of mapper phase how to run a Hadoop job Seeing your jobs progress and output Sidebar: Whats fast at high scale Overview of the HDFS This chapter and the next couple will see some reshuffling to give the following narrative flow: 1. (first chapter) 2. (this one) Heres how to use hadoop 3. Heres your first map/reduce jobs, and how data moves around behind the scenes 4. Pig lets you work with whole datasets, not record-by-record 5. The core analytic patterns of Hadoop, as Pig scripts and as Wukong map/reduce scripts
10
Weve just seen how Now lets understand a high-level picture of What Hadoop is doing, and why this makes it scalable. (Merge sort, secondary sort) So far weve seen two paradigms: distributed work Record-oriented Letters to toyforms Toyforms to parts forms, parts and toyforms to desks Toys by type and subtype Toys by crate and then address Map/Reduce Paradigm Elephant and Chimpanzee Save Christmas part 1 Elves in Crisis Making Toys: Childrens letters Become Labelled Toy Forms Making Toys: Toy Forms Dispatched to Workbench map/reduce * part 2 Shipping Toys: Cities are mapped to Crates Shipping Toys: Tracking Inventory: Secondary Sort part 3? Tracking Inventory: * Aggregation Structural Operations Paradigm Overview of Operation types FOREACH processes records individually FILTER JOIN matches records in two tables Use a Replicated JOIN When One Table is Small GROUP with Aggregating Functions Summarize Related Records GROUP and COGROUP Assemble Complex After a GROUP, a FOREACH has special abilities The harsh realities of the laws of physics and economics prevent traditional data analysis solutions such as relational databases, supercomputing and so forth from economically
Pardon this digression-laden introduction | 11
scaling to arbitrary-sized data for reasons very similar to Santas original system (see sidebar). Hadoops Map/Reduce paradigm does not provide complex operations, mod ification of existing records, fine-grain control over how the data is distributed or any thing else beyond the ability to write programs that adhere to a single, tightlyconstrained template. If Hadoop were a publishing medium, it would be one that refused essays, novels, sonnets and all other literary forms beyond the haiku:
data flutters by elephants make sturdy piles context yields insight data flutters by elephants make sturdy piles insight shuffles forth (process and label records) (contextify/assemble (?) by label) (process context groups; store(?))
(TODO: insight shuffles forth?) Our Map/Reduce haiku illustrates Hadoops template: 1. The Mapper portion of your script processes records, attaching a label to each. 2. Hadoop assembles those records into context groups according to their label. 3. The Reducer portion of your script processes those context groups and writes them to a data store or external system. What is remarkable is that from this single primitive, we can construct the familiar relational operations (such as GROUPs and ROLLUPs) of traditional databases, many machine-learning algorithms, matrix and graph transformations and the rest of the advanced data analytics toolkit. In the next two chapters, we will demonstrate high-level relational operations and illustrate the Map/Reduce patterns they express. In order to understand the performance and reasoning behind those patterns, lets first understand the motion of data within a Map/Reduce job.
while the spam letter produced no toy forms. Hadoops distcp utility, used to copy data from cluster to cluster, takes this to a useful extreme: Each Mappers input is a remote file to fetch. Its action is to write that files contents directly to the HDFS as a Datanode client and its output is a summary of what it transferred. The right way to bring in data from an external resource is by creating a custom loader or input format (see the chapter on Advanced Pig (TODO: REF)), which decouples loading data from processing data and allows Hadoop to intelligently manage tasks. The poor-mans version of a custom loader, useful for one-offs, is to prepare a small number of file names, URLs, database queries or other external handles as input and emit the corresponding contents. Please be aware, however, that it is only appropriate to access external resources from within a Hadoop job in exceptionally rare cases. Hadoop processes data in batches, which means failure of a single record results in the retry of the entire batch. It also means that when the remote resource is unavailable or responding sluggishly, Hadoop will spend several minutes and unacceptably many retries before abandoning the effort. Lastly, Hadoop is designed to drive every system resource at its disposal to its perfor mance limit. (FOOTNOTE: We will drive this point home in the chapter on Event Log Processing (TODO: REF), where we will stress test a web server to its performance limit by replaying its request logs at full speed.) While a haiku with only its first line is no longer a haiku, a Hadoop job with only a Mapper is a perfectly acceptable Hadoop job, as you saw in the Pig Latin translation example. In such cases, each Map Tasks output is written directly to the HDFS, one file per Map Task, as youve seen. Such jobs are only suitable, however, for so-called em barrassingly parallel problems" where each record can be processed on its own with no additional context. The Map stage in a Map/Reduce job has a few extra details. It is responsible for labeling the processed records for assembly into context groups. Hadoop files each record into the equivalent of the pigmy elephants file folders: an in-memory buffer holding each record in sorted order. There are two additional wrinkles, however, beyond what the pigmy elephants provide. First, the Combiner feature lets you optimize certain special cases by preprocessing partial context groups on the Map side; we will describe these more in a later chapter (TODO: REF). Second, if the sort buffer reaches or exceeds a total count or size threshold, its contents are spilled to disk and subsequently merge/ sorted to produce the Mappers proper output.
13
The table at the right (TODO: REF) summarizes the 2013 values for Peter Norvigs Numbers Every Programmer Should Know. the length of time for each computa tion primitive on modern hardware. Weve listed the figures several different ways: as latency (time to execute); as the number of 500-byte records that could be processed in an hour (TODO: day), if that operation were the performance bottleneck of your pro cess; and as an amount of money to process one billion records of 500-byte each on commodity hardware. Big Data requires high volume, high throughput computing, so our principle bound is the speed at which data can be read from and stored to disk. What is remarkable is that with the current state of technology, most of the other op erations are slammed to one limit or the other: either bountifully unconstraining or devastatingly slow. That lets us write down the following rules for performance at scale: High throughput programs cannot run faster than x (TODO: Insert number) Data can be streamed to and from disk at x GB per hour (x records per hour, y records per hour, z dollars per billion records) (TODO: insert numbers) High throughput programs cannot run faster than that but not run an order of magnitude slower. Data streams over the network at the same rate as disk. Memory access is infinitely fast. CPU is fast enough to not worry about except in the obvious cases where it is not. Random access (seeking to individual records) on disk is unacceptably slow. Network requests for data (anything involving a round trip) is infinitely slow.
14
Disk capacity is free. CPU and network transfer costs are cheap. Memory is expensive and cruelly finite. For most tasks, available memory is either all of your concern or none of your concern. Now that you know how Hadoop moves data around, you can use these rules to explain its remarkable scalability. 1. Mapper streams data from disk and spills it back to disk; cannot go faster than that. 2. In between, your code processes the data 3. If your unwinding proteins or multiplying matrices are otherwise CPU or memory bound, Hadoop at least will not get in your way; the typical Hadoop job can process records as fast as they are streamed. 4. Spilled records are sent over the network and spilled back to disk; again, cannot go faster than that. That leaves the big cost of most Hadoop jobs: the midstream merge-sort. Spilled blocks are merged in several passes (at the Reducer and sometimes at the Mapper) as follows. Hadoop begins streaming data from each of the spills in parallel. Under the covers, what this means is that the OS is handing off the contents of each spill as blocks of memory in sequence. It is able to bring all its cleverness to bear, scheduling disk access to keep the streams continually fed as rapidly as each is consumed. Hadoops actions are fairly straightforward. Since the spills are each individually sorted, at every moment the next (lowest ordered) record to emit is guaranteed to be the next unread record from one of its streams. It continues in this way, eventually merging each of its inputs into an unbroken output stream to disk. The memory requirements the number of parallel streams times the buffer size per stream are manageable and the CPU burden is effectively nil, so the merge/sort as well runs at the speed of streaming to disk. At no point does the Hadoop framework require a significant number of seeks on disk or requests over the network. is individually sorted, the first (lowest ordered record) in the merged stream to emit is guaranteed to be the lowest ordered record in one of its input streams. Introduce the chapter to the reader * take the strands from the last chapter, and show them braided together * in this chapter, youll learn . OR ok were done looking at that, now lets xxx * Tie the chapter to the goals of the book, and weave in the larger themes * perspective, philosophy, what well be working, a bit of repositioning, a bit opinionated, a bit personal.
15
16
form 1 To ensure that no passage is never lost, the librarians on Nanettes team send regular reports on the scrolls they maintain. If ever an elephant doesnt report in (whether it stepped out for an hour or left permanently), Nanette identifies the scrolls designated for that elephant and commissions the various librarians who hold other replicas of that scroll to make and dispatch fresh copies. Each scroll also bears a check of authenticity validating that photocopying, transferring its contents or even moul dering on the shelf has caused no loss of fidelity. Her librarians regularly recalculate those checks and include them in their reports, so if even a single letter on a scroll has been altered, Nanette can commission a new replica at once.
1. When Nanette is not on the job, its a total meltdown a story for much later in the book. But youd be wise to always take extremely good care of the Nanettes in your life.
17
head = 'w' if head.blank? tail.capitalize! if head =~ UPPERCASE_RE "#{tail}-#{head.downcase}ay" end yield(latinized) end
Ruby helper
The first few lines define regular expressions selecting the initial characters (if any) to move. Writing their names in ALL CAPS makes them be constants. Wukong calls the each_line do ... end block with each line; the |line| part puts it in the line variable. the gsub (globally substitute) statement calls its do ... end block with each matched word, and replaces that word with the last line of the block. yield(latinized) hands off the latinized string for wukong to output
Its best to begin developing jobs locally on a subset of data. Run your Wukong script directly from your terminals commandline:
wu-local examples/text/pig_latin.rb data/magi.txt -
The - at the end tells wukong to send its results to standard out (STDOUT) rather than a file you can pipe its output into other unix commands or Wukong scripts. In this case, there is no consumer and so the output should appear on your terminal screen. The last line should read:
Everywhere-way ey-thay are-way isest-way. Ey-thay are-way e-thay agi-may.
Thats what it looks like when a cat is feeding the program data; lets see how it works when an elephant sets the pace.
18
These commands understand ./data/text to be a path on the HDFS, not your local disk; the dot . is treated as your HDFS home directory (use it as you would ~ in Unix.). The wu-put command, which takes a list of local paths and copies them to the HDFS, treats its final argument as an HDFS path by default, and all the preceding paths as being local.
TODO: something about what the reader can expect to see on screen While the script outputs a bunch of happy robot-ese to your screen, open up the job tracker in your browser window by visiting http://hostname_of_jobtracker:50030. The job should appear on the jobtracker window within a few seconds likely in more time than the whole job took to complete. You will see (TODO describe jobtracker job overview). You can compare its output to the earlier by running
hadoop fs -cat ./output/latinized_magi/\*
That command, like the Unix cat command, dumps the contents of a file to standard out, so you can pipe it into any other command line utility. It produces the full contents of the file, which is what you would like for use within scripts but if your file is hundreds of MB large, as HDFS files typically are, dumping its entire contents to your terminal screen is ill appreciated. We typically, instead, use the Unix head command to limit its output (in this case, to the first ten lines).
hadoop fs -cat ./output/latinized_magi/\* | head -n 10
Since you wouldnt want to read a whole 10GB file just to see whether the right number of closing braces come at the end, there is also a hadoop fs -tail command that dumps the terminal one kilobyte of a file. Heres what the head and tail of your output should contain:
19
TODO screenshot of hadoop fs -cat ./output/latinized_magi/\* | head -n 10 TODO screenshot of hadoop fs -tail ./output/latinized_magi/\*
Map/Reduce
As a demonstration, lets find out when aliens like to visit the planet earth. Here is a Wukong script to processes the UFO dataset to find the aggregate number of sightings per month:
DEFINE MODEL FOR INPUT RECORDS MAPPER EXTRACTS MONTHS, EMITS MONTH AS KEY WITH NO VALUE COUNTING REDUCER INCREMENTS ON EACH ENTRY IN GROUP AND EMITS TOTAL IN FINALIZED METHOD
To run the Wukong job, go into the (TODO: REF) directory and run The output shows (TODO:CODE: INSERT CONCLUSIONS).
20
The most important numbers to note are the number of running tasks (there should be some unless your job is finished or the cluster is congested) and the number of failed tasks (for a healthy job on a healthy cluster, there should never be any). Dont worry about killed tasks; for reasons well explain later on, its OK if a few appear late in a job. We will describe what to do when there are failing attempts later in the section on debugging Hadoop jobs (TODO: REF), but in this case, there shouldnt be any. Clicking on the number of running Map tasks will take you to a window that lists all running attempts (and similarly for the other categories). On the completed tasks listing, note how long each attempt took; for the Amazon M3.xlarge machines we used, each attempt took about x seconds (TODO: correct time and machine size). There is a lot of infor mation here, so we will pick this back up in chapter (TODO ref), but the most important indicator is that your attempts complete in a uniform and reasonable length of time. There could be good reasons why you might find task 00001 to still be running after five minutes while other attempts have been finishing in ten seconds, but if thats not what you thought would happen you should dig deeper 3. You should get in the habit of sanity-checking the number of tasks and the input and output sizes at each job phase for the jobs you write. In this case, the job should ultimately require x Map tasks, no Reduce tasks and on our x machine cluster, it completed in x minutes. For this input, there should be one Map task per HDFS block, x GB of input with the typical one-eighth GB block size, means there should be 8x Map tasks. Sanity checking the figure will help you flag cases where you ran on all the data rather than the one little slice you intended or vice versa; to cases where the data is organized ineffi ciently; or to deeper reasons that will require you to flip ahead to chapter (TODO: REF). Annoyingly, the Job view does not directly display the Mapper input data, only the cumulative quantity of data per source, which is not always an exact match. Still, the figure for HDFS bytes read should closely match the size given by Hadoop fs -du (TODO: add pads to command). You can also estimate how large the output should be, using the Gift of the Magi sample we ran earlier (one of the benefits of first running in local mode). That job had an input size of x bytes and an output size of y bytes, for an expansion factor of z, and there is no reason to think the expansion factor on the whole Wikipedia corpus should be much different. In fact, dividing the HDFS bytes written by the HDFS bytes read line shows an expansion factor of q. We cannot stress enough how important it is to validate that your scripts are doing what you think they are. The whole problem of Big Data is that it is impossible to see your
3. A good reason is that task 00001s input file was compressed in a non-splittable format and is 40 times larger than the rest of the files. A bad reason is that task 00001 is trying to read from a failing-but-not-failed datanode, or has a corrupted record that is sending the XML parser into recursive hell. The good reasons you can always predict from the data itself; otherwise assume its a bad reason
Map/Reduce
21
data in its totality. You can spot-check your data, and you should, but without inde pendent validations like these youre vulnerable to a whole class of common defects. This habit of validating your prediction of the jobs execution is not a crutch of fered to the beginner, unsure of what will occur; it is a best practice, observed most diligently by the expert, and one every practitioner should adopt.
22
copies of the affected blocks by issuing replication commands to other Datanodes as they heartbeat in. A final prominent role the Namenode serves is to act as the public face of the HDFS. The put and get commands you just ran were Java programs that made network calls to the Namenode. There are API methods for the rest of the file system commands you would expect for use by that or any other low-level native client. You can also access its web interface, typically by visiting port 50070 (http://hostname.of.namenode: 50070), which gives you the crude but effective ability to view its capacity, operational status and, for the very patient, inspect the contents of the HDFS. Sitting behind the scenes is the often-misunderstood secondary Namenode; this is not, as its name implies and as you might hope, a hot standby for the Namenode. Unless you are using the HA namenode feature provided in later versions of Hadoop, if your Namenode goes down, your HDFS has gone down. All the secondary Namenode does is perform some essential internal bookkeeping. Apart from ensuring that it, like your Namenode, is always running happily and healthily, you do not need to know anything more about the second Namenode for now. One last essential to note about the HDFS is that its contents are immutable. On a regular file system, every time you hit save, the application modifies the file in place on Hadoop, no such thing is permitted. This is driven by the necessities of distributed computing at high scale but it is also the right thing to do. Data analysis should proceed by chaining reproducible syntheses of new beliefs from input data. If the actions you are applying change, so should the output. This casual consumption of hard drive re sources can seem disturbing to those used to working within the constraints of a single machine, but the economics of data storage are clear; it costs $0.10 per GB per month at current commodity prices, or one-tenth that for archival storage, and at least $50 an hour for the analysts who will use it. Possibly the biggest rookie mistake made by those new to Big Data is a tendency to economize on the amount of data they store; we will try to help you break that habit. You should be far more concerned with the amount of data you send over the network or to your CPU than with the amount of data you store and most of all, with the amount of time you spend deriving insight rather than acting on it. Checkpoint often, denorm alize when reasonable and preserve the full provenance of your results. Well spend the next few chapters introducing these core operations from the ground up. Lets start by joining JT and Nannette with their next client.
23
CHAPTER 3
In the previous chapter, you worked with the simple-as-possible Pig Latin script, which let you learn the mechanics of running Hadoop jobs, understand the essentials of the HDFS, and appreciate its scalability. It is an example of an embarrassingly parallel problem: each record could be processed individually, just as they were organized in the source files. Hadoops real power comes from the ability to process data in context, using whats known as the Map/Reduce paradigm. Every map/reduce job is a program with the same three phases. In the first phase, your program processes its input in any way you see fit, emitting labelled output records. In the second phase, Hadoop groups and sorts those records according to their labels. Finally, your program processes each group and Ha doop stores its output. That grouping-by-label part is where the magic lies: it ensures that no matter where the relevant records started, they arrive at the same place in a predictable manner, ready to be synthesized. Well open the chapter with a straightforward example map/reduce program: aggregat ing records from a dataset of Unidentified Flying Object sightings to find out when UFOs are most likely to appear. Next, well outline how a map/reduce dataflow works first with a physical analogy provided by our friends at Elephant and Chimpanzee inc, and then in moderate tech nical detail. The most important thing for you Its essential that you gain an innate, physical sense of how Hadoop moves data around. You cant understand the core the fundamental patterns of data analysis in Hadoop grouping, filtering, joining records, and so forth. Assemble those patterns into the solution you seek For two good reasons, were going to use very particular language whenever we discuss how to design a map/reduce dataflow. First, because it will help you reason by compar
25
ison as you meet more and more map/reduce patterns. The second reason is that those core patterns are not specific to the map/reduce paradigm. Youll see them in different dress but with the same essentials when we dive into the Streaming Analytics paradigm (REF) later in the book. In Chapter (REF) well put forth a conceptual model that ex plains much about how not just streaming and map/reduce dataflows are designed, but service-oriented architectures, distributed message
1. For our purposes, although sixty thousand records are too small to justify Hadoop on their own, its the perfect size to learn with.
26
Mapper
In the Chimpanzee&Elephant world, a chimp had the following role: reads and understand each letter creates a new intermediate item having a label (the type of toy) and information about the toy (the work order) hands it to the elephants for delivery to the elf responsible for making that toy type. Were going to write a Hadoop mapper that performs a similar purpose: reads the raw data and parses it into a structured record creates a new intermediate item having a label (the shape of craft) and information about the sighting (the original record). hands it to Hadoop for delivery to the reducer responsible for that group The program looks like this:
mapper(:count_ufo_shapes) do consumes UfoSighting, from: json # process do |ufo_sighting| # record = 1 # label = ufo_sighting.shape # yield [label, record] # end end
for each record create a dummy payload, label with the shape, and send it downstream for processing
The output is simply the partitioning label (UFO shape), followed by the attributes of the signing, separated by tabs. The framework uses the first field to group/sort by default; the rest is cargo.
Summarizing UFO Sightings using Map/Reduce=== | 27
Reducer
Just as the pygmy elephants transported work orders to elves workbenches, Hadoop delivers each record to the reducer, the second stage of our job.
reducer(:count_sightings) do def process_group(label, group) count = 0 group.each do |record| count += 1 yield record end yield ['# count:', label, count] end end
# on each record, # increment the count # re-output the record # # at end of group, summarize
The elf at each workbench saw a series of work orders, with the guarantee that a) work orders for each toy type are delivered together and in order; and b) this was the only workbench to receive work orders for that toy type. Similarly, the reducer receives a series of records, grouped by label, with a guarantee that it is the unique processor for such records. All we have to do here is re-emit records as they come in, then add a line following each group with its count. Weve put a # at the start of the summary lines, which lets you easily filter them. Test the full mapper-sort-reducer stack from the commandline:
$ cat ./data/geo/ufo_sightings/ufo_sightings-sample.json | ./examples/geo/ufo_sightings/count_ufo_shapes.rb --map | sort | ./examples/geo/ufo_sightings/count_ufo_shapes.rb --reduce | wu-lign 1985-06-01T05:00:00Z 1999-01-20T06:00:00Z 1998-12-16T06:00:00Z # count: chevron 3 1999-01-16T06:00:00Z # count: cigar 1 1947-10-15T06:00:00Z 1999-01-10T06:00:00Z ... 1999-01-14T06:00:00Z 1999-01-31T06:00:00Z 1998-12-16T06:00:00Z 1999-01-16T06:00:00Z 1999-02-25T06:00:00Z 1999-01-11T06:00:00Z North Tonawanda, NY Olney, IL Lubbock, TX Deptford, NJ Palmira, Tyson's Corner, VA chevron chevron chevron cigar circle circle
1 hr 10 sec 3 minu
2 Hour
1 hour 1 to 2
28
interest, so we can extend each UFO sighting whose location matches a populated place name with its longitude, latitude, population and more. Your authors have additionally run the free-text locations "Merrimac, WI or New ark, NJ (south of Garden State Pkwy)" through a geolocation service to (where pos sible) add structured geographic information: longitude, latitude and so forth.
Extend The UFO Sighting Records In Each Location Co-Group With Place Data
Building a toy involved selecting, first, the toy form, then each of the corresponding parts, so the elephants carrying toy forms stood at the head of the workbench next to
Secondary Sort: Extend UFO Sightings with Detailed Location Information | 29
all the parts carts. While the first part of the label (the partition key) defines how records are grouped, the remainder of the label (the sort key) describes how they are ordered within the group. Denoting places with an A and sightings with a B ensures our Reducer always first receives the place for a given location name followed by the sight ings. For each group, the Reducer holds the place record in a temporary variable and appends the places fields to those of each sighting that follows. Iin the happy case where a group holds both place and sightings, the Reducer iterates over each sighting. There are many places that match no UFO sightings; these are discarded. There are some UFO sightings without reconcilable location data; we will hold onto those but leave the place fields blank. Even if these groups had been extremely large, this matching required no more memory overhead than the size of a place record.
30
31
32
Deer SANTA I wood like a doll for me and and an optimus prime robot for my brother joe I have been good this year love julia # Spam, no action Greetings to you Mr Claus, I came to know of you in my search for a reliable and reputable person to handle a very confidential business transaction, which involves the transfer of a huge sum of money...
# Frank is not only a jerk but a Yankees fan. He will get coal. HEY SANTA I WANT A YANKEES HAT AND NOT ANY DUMB BOOKS THIS YEAR FRANK --------------------------------------# Spam, no action coal
33
The first note, from a very good girl who is thoughtful for her brother, creates two toyforms: one for Joes robot and one for Julias doll. The second note is spam, so it creates no toyforms, while the third one yields a toyform directing Santa to put coal in his stocking.
34
Finally, the pygmy elephants would march through the now-quiet hallways to the toy shop floor, each reporting to the workbench that matched its toy types. So the Archery Kit/Doll workbench had a line of pygmy elephants, one for every Chimpanzee&Ele phant desk; similarly the Xylophone/Yo-Yo workbench, and all the rest. Toymaker elves now began producing a steady stream of toys, no longer constrained by the overhead of walking the hallway and waiting for Big-Tree retrieval on every toy.
35
36
Whats more, traditional database applications lend themselves very well to low-latency operations (such as rendering a webpage showing the toys you requested), but very poorly to high-throughput operations (such as requesting every single doll order in sequence). Unless you invest specific expertise and effort, you have little ability to or ganize requests for efficient retrieval. You either suffer a variety of non-locality and congestion based inefficiencies, or wind up with an application that caters to the data base more than to its users. You can to a certain extent use the laws of economics to bend the laws of physics as the commercial success of Oracle and Netezza show but the finiteness of time, space and memory present an insoluble scaling problem for traditional databases. Hadoop solves the scaling problem by not solving the data organization problem. Rather than insist that the data be organized and indexed as its written to disk, catering to every context that could be requested. Instead, it focuses purely on the throughput case. TO DO explain disk is the new tape It takes X to seek but The typical Hadoop operation streams large swaths of data The locality
More prosaically, 1. process and label turn each input record into any number of labelled records 2. sorted context groups hadoop groups those records uniquely under each label, in a sorted order 3. synthesize (process context groups) for each group, process its records in or der; emit anything you want. The trick lies in the group/sort step: assigning the same label to two records in the label step ensures that they will become local in the reduce step. The machines in stage 1 (label) are out of context. They see each record exactly once, but with no promises as to order, and no promises as to which one sees which record. Weve moved the compute to the data, allowing each process to work quietly on the data in its work space. As each pile of output products starts to accumulate, we can begin to group them. Every group is assigned to its own reducer. When a pile reaches a convenient size, it is shipped
37
to the appropriate reducer while the mapper keeps working. Once the map finishes, we organize those piles for its reducer to process, each in proper order. If you notice, the only time data moves from one machine to another is when the in termediate piles of data get shipped. Instead of monkeys flinging poo, we now have a dignified elephant parade conducted in concert with the efforts of our diligent workers.
Hadoops Contract
Hadoop imposes a few seemingly-strict constraints and provides a very few number of guarantees in return. As youre starting to see, that simplicity provides great power and is not as confining as it seems. You can gain direct control over things like partitioning, input splits and input/output formats. Well touch on a very few of those, but for the most part this book concentrates on using Hadoop from the outside (TODO: ref) Hadoop: The Definitive Guide covers this stuff (definitively).
each group is processed by exactly one reducer; groups are sorted lexically by the chosen group key; and records are further sorted lexically by the chosen sort key. Its very important that you understand what that unlocks, so Im going to redundantly spell it out a few different ways: Each mapper-output record goes to exactly one reducer, solely determined by its key. If several records have the same key, they will all go to the same reducer. From the reducers perspective, if it sees any element of a group it will see all elements of the group. You should typically think in terms of groups and not about the whole reduce set: imagine each partition is sent to its own reducer. Its important to know, however, that each reducer typically sees multiple partitions. (Since its more efficient to process large batches, a certain number of reducer processes are started on each machine. This is in contrast to the mappers, who run one task per input split.) Unless you take special measures, the partitions are distributed arbitrarily among the reducers 2. They are fed to the reducer in order by key. Similar to a mapper-only task, your reducer can output anything you like, in any format you like. Its typical to output structured records of the same or different shape, but youre free engage in any of the shenanigans listed above.
39
To clear his mind, JT wandered over to the reindeer ready room, eager to join in the cutthroat games of poker Rudolph and his pals regularly ran. During a break in the action, JT found himself idly sorting out the deck of cards by number, as you do to check that it is a regular deck of 52. (With reindeer, you never know when an extra ace or three will inexplicably appear at the table). As he did so, something in his mind flashed back to the unfinished toys on the assembly floor: mounds of number blocks, stacks of Jackin-the-boxes, rows of dolls. Sorting the cards by number had naturally organized them into groups by kind as well: he saw all the numbers in blocks in a run, followed by all the jacks, then the queens and the kings and the aces. Sorting is equivalent to grouping! he exclaimed to the reindeers puzzlement. Sorry, boys, youll have to deal me out, he said, as he ran off to find Nanette. The next day, they made several changes to the toy-making workflow. First, they set up a delegation of elvish parts clerks at desks behind the letter-writing chimpanzees, di recting the chimps to hand a carbon copy of each toy form to a parts clerk as well. On receipt of a toy form, each parts clerk would write out a set of tickets, one for each part in that toy, and note on the ticket the ID of its toyform. These tickets were then dis patched by pygmy elephant to the corresponding section of the parts warehouse to be retrieved from the shelves. Now, here is the truly ingenious part that JT struck upon that night. Before, the chim panzees placed their toy forms onto the back of each pygmy elephant in no particular order. JT replaced these baskets with standing file folders the kind you might see on an organized persons desk. He directed the chimpanzees to insert each toy form into the file folder according to the alphabetical order of its ID. (Chimpanzees are exceed ingly dextrous, so this did not appreciably impact their speed.) Meanwhile, at the parts warehouse Nanette directed a crew of elvish carpenters to add a clever set of movable set of frames to each of the part carts. She similarly prompted the parts pickers to put each carts parts in the place properly preserving the alphabetical order of their toyform IDs.
40
//// Perhaps a smaller sizing for the image? Amy//// After a double shift that night by the parts department and the chimpanzees, the toy makers arrived in the morning to find, next to each workbench, the pygmy elephants with their toy forms and a set of carts from each warehouse section holding the parts theyd need. As work proceeded, a sense of joy and relief soon spread across the shop. The elves were now producing a steady stream of toys as fast as their hammers could fly, with an economy of motion theyd never experienced. Since both the parts and the toy forms were in the same order by toyform ID, as the toymakers would pull the next toy form from the file they would always find the parts for it first at hand. Pull the toy form for a wooden toy train and you would find a train chassis next in the chassis cart, small wooden wheels next in the wheel cart, and magnetic bumpers next in the small parts cart. Pull the toy form for a rolling duck on a string, and you would find instead, a duck chassis, large wooden wheels and a length of string at the head of their respective carts. Not only did work now proceed with an unbroken swing, but the previously cluttered workbenches were now clear their only contents were the parts immediately required to assemble the next toy. This space efficiency let Santa pull in extra temporary workers from the elves Rivendale branch, who were bored with fighting orcs and excited to help out.
41
Toys were soon coming off the line at a tremendous pace, far exceeding what the elves had ever been able to achieve. By the second day of the new system, Mrs. Claus excitedly reported the news everyone was hoping to hear: they were fully on track to hit the Christmas Eve deadline! And thats the story of how Elephant and Chimpanzee saved Christmas.
42
43
observed months); all the files were close to (TODO size) MB large. (TODO consider updating to 1,2,3 syntax, perhaps with a gratuitous randomizing field as well. If not, make sure wukong errors on a partition_keys larger than the sort_keys). Running with --partition_keys=3 --sort_keys=4 doesnt change anything: the get_key method in this particular reducer only pays attention to the century/year/month, so the ordering within the month is irrelevant. Running it instead with --partition_keys=2 --sort_keys=3 tells Hadoop to parti tion on the century/year, but do a secondary sort on the month as well. All records that share a century and year now go to the same reducer, while the reducers still see months as continuous chunks. Now there are only six (or fewer) reducers that receive data all of 2008 goes to one reducer, similarly 2009, 2010, and the rest of the years in the dataset. In our runs, we saw years X and Y (TODO adjust reducer count to let us prove the point, insert numbers) land on the same reducer. This uneven distribution of data across the reducers should cause the job to take slightly longer than the first run. To push that point even farther, running with --partition_keys=1 --sort_keys=3 now partitions on the century which all the records share. Youll now see 19 reducers finish promptly following the last mapper, and the job should take nearly twenty times as long as with --partition_keys=3. Finally, try running it with --partition_keys=4 --sort_keys=4, causing records to be partitioned by century/year/month/day. Now the days in a month will be spread across all the reducers: for December 2010, we saw -00000 receive X, Y and -00001 receive X, Y, Z; out of 20 reducers, X of them received records from that month (TODO insert numbers). Since our reducer class is coded to aggregate by century/year/month, each of those reducers prepared its own meaningless total pageview count for December 2010, each of them a fraction of the true value. You must always ensure that all the data youll combine in an aggregate lands on the same reducer.
44
Reducers Write Output Data (Which May Cost More Than You Think)
As your Reducers emit records, they are streamed directly to the job output, typically the HDFS or S3. Since this occurs in parallel with reading and processing the data, the primary spill to the Datanode typically carries minimal added overhead. However, the data is simultaneously being replicated as well, which can extend your jobs runtime by more than you might think. Lets consider how data flows in a job intended to remove duplicate records: for example, processing 100 GB of data with one-percent duplicates, and writing output with repli cation factor three. As youll see when we describe the distinct patterns in Chapter 5 (REF), the Reducer input is about the same size as the mapper input. Using what you now know, Hadoop moves roughly the following amount of data, largely in parallel: 100 GB of Mapper input read from disk; 100 GB spilled back to disk; 100 GB of Reducer input sent and received over the network; 100 GB of Reducer input spilled to disk some amount of data merge/sorted to disk if your cluster size requires multiple passes; 100 GB of Reducer output written to disk by the local Datanode; 200 GB of replicated output sent over the network, received over the network and written to disk by the Datanode.
45
If your Datanode is backed by remote volumes (common in some virtual environments 3 ), youll additionally incur 300 GB sent over the network to the remote file store As you can see, unless your cluster is undersized (producing significant merge/sort overhead), the cost of replicating the data rivals the cost of the rest of the job. The default replication factor is 3 for two very good reasons: it helps guarantee the permanence of your data and it allows the Job tracker to efficiently allocate Mapper-local tasks. But in certain cases intermediate checkpoint data, scratch data or where backed by a remote file system with its own durability guarantee an expert who appreciates the risk can choose to reduce the replication factor to 2 or 1. You may wish to send your jobs output not to the HDFS or S3 but to a scalable database or other external data store. We will show an example of this in the chapter on HBase (REF), and there are a great many other output formats available. While your job is in development, though, it is typically best to write its output directly to the HDFS (perhaps at replication factor 1), then transfer it to the external target in a separate stage. The HDFS is generally the most efficient output target and the least likely to struggle under load. This checkpointing also encourages the best practice of sanity-checking your out put and asking questions.
3. This may sound outrageous to traditional IT folk, but the advantages of elasticity are extremely powerful well outline the case for virtualized Hadoop in Chapter (REF)
46
Structural Operations
CHAPTER 4
47
They turned to find Olga, now dressed in street clothes. Why dont you join me for a drink? We can talk then.
For comparison, heres the Wukong script we wrote earlier to answer the same question:
DEFINE MODEL FOR INPUT RECORDS MAPPER EXTRACTS MONTHS, EMITS MONTH AS KEY WITH NO VALUE COUNTING REDUCER INCREMENTS ON EACH ENTRY IN GROUP AND EMITS TOTAL IN FINALIZED METHOD
In a Wukong script or traditional Hadoop job, the focus is on the record, and youre best off thinking in terms of message passing. In Pig, the focus is much more on the table as a whole, and youre able to think in terms of its structure or its relations to other tables. In the example above, each line described an operation on a full table. We declare what change to make and Pig, as youll see, executes those changes by dynamically assembling and running a set of Map/Reduce jobs. To run the Pig job, go into the EXAMPLES/UFO directory and run
pig monthly_visit_counts.pig /data_UFO_sightings.tsv /dataresults monthly_visit_counts-pig.tsv
To run the Wukong job, go into the (TODO: REF) directory and run The output shows (TODO:CODE: INSERT CONCLUSIONS). If you consult the Job Tracker Console, you should see a single Map/Reduce for each with effectively similar statistics; the dataflow Pig instructed Hadoop to run is essentially similar to the Wukong script you ran. What Pig ran was, in all respects, a Hadoop job. It calls on some of Hadoops advanced features to help it operate but nothing you could not access through the standard Java API.
48 | Chapter 4: Structural Operations
Did you notice, by the way, that in both cases, the output was sorted? that is no coinci dence as you saw in Chapter (TODO: REF), Hadoop sorted the results in order to group them.
(TODO: If you do an order and then group, is Pig smart enough to not add an extra REDUCE stage?) Run the script just as you did above:
(TODO: command to run the script)
Up until now, we have described Pig as authoring the same Map/Reduce job you would. In fact, Pig has automatically introduced the same optimizations an advanced practi tioner would have introduced, but with no effort on your part. If you compare the Job Tracker Console output for this Pig job with the earlier ones, youll see that, although x bytes were read by the Mapper, only y bytes were output. Pig instructed Hadoop to use a Combiner. In the naive Wukong job, every Mapper output record was sent across the network to the Reducer but in Hadoop, as you will recall from (TODO: REF), the Map per output files have already been partitioned and sorted. Hadoop offers you the op portunity to do pre-Aggregation on those groups. Rather than send every record for, say, August 8, 2008 8 pm, the Combiner outputs the hour and sum of visits emitted by the Mapper.
SIDEBAR: You can write Combiners in Wukong, too. (TODO:CODE: Insert example with Combiners)
Youll notice that, in the second script, we introduced the additional operation of in structing Pig to explicitly sort the output by minute. We did not do that in the first example because its data was so small that we had instructed Hadoop to use a single Reducer. As you will recall from (TODO: REF), Hadoop uses a Sort to prepare the Reducer groups, so its output was naturally ordered. If there are multiple Reducers, however, that would not be enough to give you a Result file you can treat as ordered. By default, Hadoop assigns partitions to Reducers using the RandomPartitioner, designed to give each Reducer a uniform chance of claiming any given partition. This defends against the problem of one Reducer becoming overwhelmed with an unfair share of records but means the keys are distributed willy-nilly across machines. Although each
Pig Helps Hadoop work with Tables, not Records | 49
Reducers output is sorted, you will see records from 2008 at the top of each result file and records from 2012 at the bottom of each result file. What we want instead is a total sort, the earliest records in the first numbered file in order, the following records in the next file in order, and so on until the last numbered file. Pigs ORDER Operator does just that. In fact, it does better than that. If you look at the Job Tracker Console, you will see Pig actually ran three Map/Reduce jobs. As you would expect, the first job is the one that did the grouping and summing and the last job is the one that sorted the output records. In the last job, all the earliest records were sent to Reducer 0, the middle range of records were sent to Reducer 1 and the latest records were sent to Reducer 2. Hadoop, however, has no intrinsic way to make that mapping happen. Even if it figured out, say, that the earliest buckets were in 2008 and the latest buckets were in 2012, if we fed it a dataset with skyrocketing traffic in 2013, we would end up sending an over whelming portion of results to that Reducer. In the second job, Pig sampled the set of output keys, brought them to the same Reducer, and figured out the set of partition breakpoints to distribute records fairly. In general, Pig offers many more optimizations beyond these and we will talk more about them in the chapter on Advanced Pig (TODO: REF). In our experience, the only times Pig will author a significantly less-performant dataflow than would an expert comes when Pig is overly aggressive about introducing an optimization. The chief ex ample youll hit is that often, the intermediate stage in the total sort to calculate partitions has a larger time penalty than doing a bad job of partitioning would; you can disable that by (TODO:CODE: Describe how).
50
Analytic Patterns
CHAPTER 5
51
I wish I could say I invited you for this drink because I knew the solution, but all I have is a problem Id like to fix. I know your typewriter army helps companies process massive amounts of documents, so youre used to working with the amount of information Im talking about. Is the situation hopeless, or can you help me find a way to apply my skills at a thousand times the scale I work at now? Nanette smiled. Its not hopeless at all, and to tell you the truth your proposal sounds like the other end of a problem Ive been struggling with. Weve now had several successful client deliveries, and recently JTs made some break throughs in what our document handling system can do it involves having the chim panzees at one set of typewriters send letters to another set of chimpanzees at a different set of typewriters. One thing were learning is that even though the actions that the chimpanzees take are different for every client, there are certain themes in how the chimpanzees structure their communication that recur across clients. Now JT here (at this, JT rolled his eyes for effect, as he knew what was coming) spent all his time growing up at a typewriter, and so he thinks about information flow as a set of documents. Designing a new scheme for chimpanzees to send inter-office memos is like pie for him. But where JT thinks about working with words on a page, I think about managing books and libraries. And the other thing were learning is that our clients think like me. They want to be able to tell us the big picture, not fiddly little rules about what should happen to each document. Tell me how you describe the players-andstadiums trick you did in the grand finale. Well, I picture in my head the teams every player was on for each year they played, and at the same time a listing of each teams stadium by year. Then I just think match the players\ seasons to the teams\' seasons using the team and year', and the result pops into my head. Nanette nodded and looked over at JT. I see what youre getting at now, he replied. In my head Im thinking about the process of matching individual players and stadiums when I explain it youre going to think it sounds more complicated but I dont know, to me it seems simpler. I imagine that I could ask each player to write down on a yellow post-it note the team-years they played on, and ask each stadium manager to write down on blue post-it notes the team-years it served. Then I put those notes in piles when ever theres a pile with yellow post-it notes, I can read off the blue post-it notes it matched. Nanette leaned in. So heres the thing. Elephants and Pigs have amazing memories, but not Chimpanzees JT can barely keep track of what day of the week it is. JTs scheme never requires him to remember anything more than the size of the largest pile in fact, he can get by with just remembering whats on the yellow post-it notes. But Well, Nanette said with a grin, Pack a suitcase with a very warm jacket. Were going to take a trip up north way north.
52
and a FILTER will only require one map phase and one reduce phase. In that case, the FOREACH and FILTER will be done in the reduce step and in the right circumstan ces, pig will push part of the FOREACH and FILTER before the JOIN, potentially elimi nating a great deal of processing. In the remainder of this chapter, well illustrate the essentials for each family of opera tions, demonstrating them in actual use. In the following chapter (TODO ref), well learn how to implement the corresponding patterns in a plain map-reduce approach and therefore how to reason about their performance. Finally, the chapter on Advanced Pig (TODO ref) will cover some deeper-level topics, such as a few important optimized variants of the JOIN statement and how to endow Pig with new functions and loaders. We will not explore every nook and cranny of its syntax, only illustrate its patterns of use. Weve omitted operations whose need hasnt arisen naturally in the explorations later, along with fancy but rarely-used options or expressions 1
articles = LOAD './data/wp/articles.tsv' AS (page_id: long, namespace: int, wikipedia_id: chararra hadoop_articles = FILTER articles BY text matches '.*Hadoop.*'; STORE hadoop_articles INTO './data/tmp/hadoop_articles.tsv';
Simple Types
As you can see, the LOAD statement not only tells pig where to find the data, it also describes the tables schema. Pig understands ten kinds of simple type. Six of them are numbers: signed machine integers, as int (32-bit) or long (64-bit); signed floating-point numbers, as float (32-bit) or double (64-bit); arbitrary-length integers as biginteg er; and arbitrary-precision real numbers, as bigdecimal. If youre supplying a literal value for a long, you should append a capital L to the quantity: 12345L; if youre supplying a literal float, use an F: 123.45F. The chararray type loads text as UTF-8 encoded strings (the only kind of string you should ever traffic in). String literals are contained in single quotes 'hello, world'. Regular expressions are supplied as string literals, as in the example above: '.*[Hh]adoop.*'. The `bytearray type does no interpretation of its contents what
1. For example, its legal in Pig to load data without a schema but you shouldnt, and so were not going to tell you how.
54
soever, but be careful the most common interchange formats (tsv, xml and json) cannot faithfully round-trip data that is truly freeform. Lastly, there are two special-purpose simple types. Time values are described with da tetime, and should be serialised in the the ISO-8601 format: 1970-01-01T00:00:00.000+00:00. Boolean values are described with boolean, and should bear the values true or false.
Pig displays tuples using parentheses: it would dump a line from the input file as BOS07,Fenway Park,(4 Yawkey Way,Boston,MA,02215), (-71.097378,42.3465909)'. As shown above, you address single values with in a tuple using `tuple_name.subfield_name address.state will have the schema state:chararray. You can also project fields in a tuple into a new tuple by writing tuple_name.(subfield_a, subfield_b, ...) address.(zip, city, state) will have schema address_zip_city_state:tuple(zip:chararray, city:chararray, state:chararray). (Pig helpfully generated a readable name for the tuple). Tuples can contain values of any type, even bags and other tuples, but thats nothing to be proud of. Youll notice we follow almost every structural operation with a FOREACH to simplify its schema as soon as possible, and so should you it doesnt cost anything and it makes your code readable.
You address values within a bag again using bag_name.(subfield_a, subfield_b), but this time the result is a bag with the given projected tuples youll see examples of this shortly when we discuss FLATTEN and the various group operations. Note that the only type a bag holds is tuple, even if theres only one field a bag of just park ids would have schema bag{tuple(park_id:chararray)}.
A map` schema is described using square brackets: map[value_schema]. You can leave the value schema blank if you supply one later (as in the example that follows). The keys of a map are always of type chararray; the values can be any simple type. Pig renders a map as [key#value,key#value,...]: my twitter user record as a hash would look like `[name#Philip Kromer,id#1554031,screen_name#mrflip]. Apart from loading complex data, the map type is surprisingly useless. You might think it would be useful to carry around a lookup-table in a map field a mapping from ids to names, say and then index into it using the value of some other field, but a) you cannot do so and b) it isnt useful. The only thing you can do with a map field is deref erence by a constant string, as we did above (user#'id'). Carrying around such a lookup table would be kind of silly, anyway, as youd be duplicating it on every row. What you most likely want is either an off-the-cuff UDF or to use Pigs replicated JOIN operation; both are described in the chapter on Advanced Pig (TODO ref). Since the map type is mostly useless, well seize the teachable moment and use this space to illustrate the other way schema are constructed: using a FOREACH. As always when given a complex schema, we took the first available opportunity to simplify it. The
56
FOREACH in the snippet above dereferences the elements of the user map and supplies a schema for each new field with the AS <schema> clauses. The DESCRIBE directive that follows causes Pig to dump the schema to console: in this case, you should see tweets: {created_at: chararray,text: chararray,user_id: long,user_name: charar ray,user_screen_name: chararray}.
(TODO ensure these topics are covered later: combining input splits in Pig; loading different data formats)
This example digests the players table; selects only players who have more than a quali fied number of plate appearances; and generates the stats were interested in (If youre not a baseball fan, just take our word that these four fields are particularly interesting) A FOREACH wont cause a new Hadoop job stage: its chained onto the end of the preceding operation (and when its on its own, like this one, theres just a single a mapper-only job). A FOREACH always produces exactly the same count of output records as input records. Within the GENERATE portion of a FOREACH, you can apply arithmetic expressions (as shown); project fields (rearrange and eliminate fields); apply the FLATTEN operator (see below); and apply Pig functions to fields. Lets look at Pigs functions.
Math functions for all the things youd expect to see on a good calculator: LOG/ LOG10/EXP, RANDOM, ROUND/FLOOR/CEIL, ABS, trigonometric functions, and so forth. String comparison: matches tests a value against a regular expression: Compare strings directly using ==. EqualsIgnoreCase does a case-insensitive match, while STARTSWITH/ENDSWITH test whether one string is a prefix or suffix of the other. SIZE returns the number of characters in a chararray, and the number of bytes in a bytearray. Be reminded that characters often occupy more than one byte: the string Motrhead has nine characters, but because of its umlaut-ed occupies ten bytes. You can use SIZE on other types, too; but as mentioned, use COUNT_STAR and not SIZE to find the number of elements in a bag. INDEXOF finds the character position of a substring within a chararray // LAST_INDEX_OF Transform strings: CONCAT concatenates all its inputs into a new string LOWER converts a string to lowercase characters; UPPER to all uppercase // LCFIRST, UCFIRST TRIM strips leading and trailing whitespace // LTRIM, RTRIM REPLACE(string, 'regexp', 'replacement') substitutes the replacement string wherever the given regular expression matches, as implemented by java.string.replaceAll. If there are no matches, the input string is passed through unchanged. REGEX_EXTRACT(string, regexp, index) applies the given regular expression and returns the contents of the indicated matched group. If the regular expres sion does not match, it returns NULL. The REGEX_EXTRACT_ALL function is sim ilar, but returns a tuple of the matched groups. STRSPLIT splits a string at each match of the given regular expression SUBSTRING selects a portion of a string based on position Datetime Functions, such as CurrentTime, ToUnixTime, SecondsBetween (dura tion between two given datetimes) Aggregate functions that act on bags: AVG, MAX, MIN, SUM COUNT_STAR reports the number of elements in a bag, including nulls; COUNT reports the number of non-null elements. IsEmpty tests that a bag has elements.
58
Dont use the quite-similar-sounding SIZE function on bags: its much less effi cient. SUBTRACT(bag_a, bag_b) returns a new bag with all the tuples that are in the first but not in the second, and DIFF(bag_a, bag_b) returns a new bag with all tuples that are in either but not in both. These are rarely used, as the bags must be of modest size in general us an inner JOIN as described below. TOP(num, column_index, bag) selects the top num of elements from each tuple in the given bag, as ordered by column_index. This uses a clever algorithm that doesnt require an expensive total sort of the data youll learn about it in the Statistics chapter (TODO ref) Conversion Functions to perform higher-level type casting: TOTUPLE, TOBAG, TOMAP
articles = LOAD './data/wp/articles.tsv' AS (page_id: long, namespace: int, wikipedia_id: chararra hadoop_articles = FILTER articles BY text matches '.*Hadoop.*'; STORE hadoop_articles INTO './data/tmp/hadoop_articles.tsv';
Filter as early as possible and in all other ways reduce the number of records youre working with. (This may sound obvious, but in the next chapter (TODO ref) well highlight many non-obvious expressions of this rule). Its common to want to extract a uniform sample one where every record has an equivalent chance of being selected. Pigs SAMPLE operation does so by generating a random number to select records. This brings an annoying side effect: the output of your job is different on every run. A better way to extract a uniform sample is the consistent hash digest" well describe it, and much more about sampling, in the Sta tistics chapter (TODO ref).
The output schema of the new gls_with_parks table has all the fields from the parks table first (because its first in the join statement), stapled to all the fields from the game_logs table. We only want some of the fields, so immediately following the JOIN is a FOREACH to extract what were interested in. Note there are now two park_id columns, one from each dataset, so in the subsequent FOREACH, we need to derefer ence the column name with the table from which it came. (TODO: check that Pig does push the projection of fields up above the JOIN). If you run the script, examples/geo/ baseball_weather/geolocate_games.pig you will see that its output has example as many records as there are game_logs because there is exactly one entry in the parks table for each park. In the general case, though, a JOIN can be many to many. Suppose we wanted to build a table listing all the home ballparks for every player over their career. The player_sea sons table has a row for each year and team over their career. If a player changed teams mid year, there will be two rows for that player. The park_years table, meanwhile, has rows by season for every team and year it was used as a home stadium. Some ballparks have served as home for multiple teams within a season and in other cases (construction or special circumstances), teams have had multiple home ballparks within a season.
60
The Pig script (TODO: write script) includes the following JOIN:
JOIN player_park_years=JOIN parks(year,team_ID), players(year,team_ID); explain_player_park_year;
First notice that the JOIN expression has multiple columns in this case separated by commas; you can actually enter complex expressions here almost all (but not all) the things you do within a FOREACH. If you examine the output file (TODO: name of output file), you will notice it has appreciably more lines than the input player file. For example (TODO: find an example of a player with multiple teams having multiple parks), in year x player x played for the x and the y and y played in stadiums p and q. The one line in the players table has turned into three lines in the players_parks_years table. The examples we have given so far are joining on hard IDs within closely-related data sets, so every row was guaranteed to have a match. It is frequently the case, however, you will join tables having records in one or both tables that will fail to find a match. The parks_info datasets from Retrosheet only lists the city name of each ballpark, not its location. In this case we found a separate human-curated list of ballpark geolocations, but geolocating records that is, using a human-readable location name such as Aus tin, Texas to find its nominal geocoordinates (-97.7,30.2) is a common task; it is also far more difficult than it has any right to be, but a useful first step is match the location names directly against a gazette of populated place names such as the open source Geonames dataset. Run the script (TODO: name of script) that includes the following JOIN:
park_places = JOIN parks BY (location) LEFT OUTER, places BY (concatenate(city, ", ", state); DESCRIBE park_places;
In this example, there will be some parks that have no direct match to location names and, of course, there will be many, many places that do not match a park. The first two JOINs we did were inner JOINs the output contains only rows that found a match. In this case, we want to keep all the parks, even if no places matched but we do not want to keep any places that lack a park. Since all rows from the left (first most dataset) will be retained, this is called a left outer JOIN. If, instead, we were trying to annotate all places with such parks as could be matched producing exactly one output row per place we would use a right outer JOIN instead. If we wanted to do the latter but (somewhat inefficiently) flag parks that failed to find a match, you would use a full outer JOIN. (Full JOINs are pretty rare.) In a Pig JOIN it is important to order the tables by size putting the smallest table first and the largest table last. (Youll learn why in the Map/Reduce Patterns (TODO: REF)
Pig matches records in datasets using JOIN | 61
chapter.) So while a right join is not terribly common in traditional SQL, its quite val uable in Pig. If you look back at the previous examples, you will see we took care to always put the smaller table first. For small tables or tables of similar size, it is not a big deal but in some cases, it can have a huge impact, so get in the habit of always following this best practice.
NOTE A Pig join is outwardly similar to the join portion of a SQL SELECT statement, but notice that
al
Complex FOREACH
Lets continue our example of finding the list of home ballparks for each player over their career.
parks = LOAD '.../parks.tsv' AS (...); team_seasons = LOAD '.../team_seasons.tsv' AS (...) park_seasons = JOIN parks BY park_id, team_seasons BY park_id; park_seasons = FOREACH park_seasons GENERATE team_seasons.team_id, team_seasons.year, parks.park_id, parks.name AS park_name; player_seasons = LOAD '.../player_seasons.tsv' AS (...); player_seasons = FOREACH player_seasons GENERATE player_id, name AS player_name, year, team_id; player_season_parks = JOIN parks BY (year, team_id), player_seasons BY (year, team_id); player_season_parks = FOREACH player_season_parks GENERATE player_id, player_name, parks::year AS player_all_parks = GROUP player_season_parks BY (player_id); describe player_all_parks; Player_parks = FOREACH player_all_parks { player = FirstFromBag(players); home_parks = DISTINCT(parks.park_id); GENERATE group AS player_id, FLATTEN(player.name), MIN(players.year) AS beg_year, MAX(players.year) AS end_year, home_parks; -- TODO ensure this is still tuple-ized }
Whoa! There are a few new tricks here. This alternative { curly braces form of FORE ACH lets you describe its transformations in smaller pieces, rather than smushing ev erything into the single GENERATE clause. New identifiers within the curly braces (such as player) only have meaning within those braces, but they do inform the schema.
62
We would like our output to have one row per player, whose fields have these different flavors: Aggregated fields (beg_year, end_year) come from functions that turn a bag into a simple type (MIN, MAX). The player_id is pulled from the group field, whose value applies uniformly to the the whole group by definition. Note that its also in each tuple of the bagged play er_park_seasons, but then youd have to turn many repeated values into the one you want which we have to do for uniform fields (like name) that are not part of the group key, but are the same for all elements of the bag. The awareness that those values are uniform comes from our understanding of the data Pig doesnt know that the name will always be the same. The FirstFromBag (TODO fix name) function from the Datafu package grabs just first one of those values Inline bag fields (home_parks), which continue to have multiple values. Weve applied the DISTINCT operation so that each home park for a player appears only once. DISTINCT is one of a few operations that can act as a top-level table operation, and can also act on bags within a foreach well pick this up again in the next chapter (TODO ref). For most people, the biggest barrier to mastery of Pig is to understand how the name and type of each field changes through restructuring operations, so lets walk through the schema evolution. We JOIN`ed player seasons and team seasons on `(year, team_id). The result ing schema has those fields twice. To select the name, we use two colons (the disam biguate operator): players::year. After the GROUP BY operation, the schema is group:int, player_sea son_parks:bag{tuple(player_id, player_name, year, team_id, park_id, park_name)}. The schema of the new group field matches that of the BY clause: since park_id has type chararray, so does the group field. (If we had supplied multiple fields to the BY clause, the group field would have been of type tuple). The second field, player_season_parks, is a bag of size-6 tuples. Be clear about what the names mean here: grouping on the player_season_parks table (whose schema has six fields) pro duced the player_parks table. The second field of the player_parks table is a tuple of size six (the six fields in the corresponding table) named player_season_parks (the name of the corresponding table). So within the FOREACH, the expression player_season_parks.park_id is also a bag of tuples (remember, bags only hold tuples!), now size-1 tuples holding only the park_id. That schema is preserved through the DISTINCT operation, so home_parks is also a bag of size-1 tuples.
63
In a case where you mean to use the disambiguation operator (play ers::year), its easy to confuse yourself and use the tuple element operation (players.year). That leads to the baffling error message
team_park_seasons = LOAD '/tmp/team_parks.tsv' AS ( team_id:chararray, park_years: bag{tuple(year:int, park_id:chararray)}, park_ids_lookup: map[chararray] ); team_parks = FOREACH team_park_seasons { distinct_park_ids = DISTINCT park_years.park_id; GENE DUMP team_parks;
In this case, since we grouped on two fields, group is a tuple; earlier, when we grouped on just the player_id field, group was just the simple value. The contextify / reflatten pattern can be applied even within one table. This script will find the career list of teammates for each player all other players with a team and year in common 2.
GROUP player_years BY (team,year); FOREACH cross all players, flatten each playerA/playerB pair AS (player_a FILTER coplayers BY (player_a != player_b); GROUP by playerA FOREACH {
2. yes, this will have some false positives for players who were traded mid-year. A nice exercise would be to rewrite the above script using the game log data, now defining teammate to mean all other players they took the field with over their career.
64
DISTINCT player B }
Heres another The result of the cross operation will include pairing each player with themselves, but since we dont consider a player to be their own teammate we must eliminate player pairs of the form (Aaronha, Aaronha). We did this with a FILTER immediate before the second GROUP (the best practice of removing data before a restructure), but a defensible alternative would be to SUBTRACT playerA from the bag right after the DIS TINCT operation.
This script will run two Hadoop jobs. One pass is a light mapper-only job to sample the sort key, necessary for Pig to balance the amount of data each reducer receives (well learn more about this in the next chapter (TODO ref). The next pass is the map/reduce job that actually sorts the data: output file part-r-00000 has the earliest-ordered re cords, followed by part-r-00001, and so forth.
articles = LOAD './data/wp/articles.tsv' AS (page_id: long, namespace: int, wikipedia_id: chararra hadoop_articles = FILTER articles BY matches('.*[Hh]adoop.*'); STORE hadoop_articles INTO './data/tmp/hadoop_articles.tsv';
As with any Hadoop job, Pig creates a directory (not a file) at the path you specify; each task generates a file named with its task ID into that directory. In a slight difference from vanilla Hadoop, If the last stage is a reduce, the files are named like part-r-00000 (r for reduce, followed by the task ID); if a map, they are named like part-m-00000. Try removing the STORE line from the script above, and re-run the script. Youll see nothing happen! Pig is declarative: your statements inform Pig how it could produce certain tables, rather than command Pig to produce those tables in order. The behavior of only evaluating on demand is an incredibly useful feature for develop ment work. One of the best pieces of advice we can give you is to checkpoint all the time. Smart data scientists iteratively develop the first few transformations of a project, then save that result to disk; working with that saved checkpoint, develop the next few trans formations, then save it to disk; and so forth. Heres a demonstration:
great_start = LOAD '...' AS (...); -- ... -- lots of stuff happens, leading up to -- ... important_milestone = JOIN [...]; -- reached an important milestone, so checkpoint to disk. STORE important_milestone INTO './data/tmp/important_milestone'; important_milestone = LOAD './data/tmp/important_milestone' AS (...schema...);
In development, once youve run the job past the STORE important_milestone line, you can comment it out to make pig skip all the preceding steps since theres nothing tying the graph to an output operation, nothing will be computed on behalf of impor tant_milestone, and so execution will start with the following LOAD. The gratuitous save and load does impose a minor cost, so in production, comment out both the STORE and its following LOAD to eliminate the checkpoint step. These checkpoints bring two other benefits: an inspectable copy of your data at that checkpoint, and a description of its schema in the re-LOAD line. Many newcomers to Big Data processing resist the idea of checkpointing often. It takes a while to accept that a terabyte of data on disk is cheap but the cluster time to generate that data is far less cheap, and the programmer time to create the job to create the data is most expensive of all. We wont include the checkpoint steps in the printed code snippets of the book, but weve left them in the example code.
66
ILLUSTRATE magically simulates your scripts actions, except when it fails to work
The ILLUSTRATE directive is one of our best-loved, and most-hated, Pig operations. Even if you only want to see an example line or two of your output, using a DUMP or a STORE requires passing the full dataset through the processing pipeline. You might think, OK, so just choose a few rows at random and run on that" but if your job has steps that try to match two datasets using a JOIN, its exceptionally unlikely that any matches will survive the limiting. (For example, the players in the first few rows of the baseball players table belonged to teams that are not in the first few rows from the baseball teams table.) ILLUSTRATE walks your execution graph to intelligently mock up records at each pro cessing stage. If the sample rows would fail to join, Pig uses them to generate fake records that will find matches. It solves the problem of running on ad-hoc subsets, and thats why we love it.
67
However, not all parts of Pigs functionality work with ILLUSTRATE, meaning that it often fails to run. When is the ILLUSTRATE command is most valuable? When applied to less-widely-used operations and complex sequences of statements, of course. What parts of Pig are most likely to lack ILLUSTRATE support or trip it up? Well, less-widelyused operations and complex sequences of statements, of course. And when it fails, it does so with perversely opaque error messages, leaving you to wonder if theres a prob lem in your script or if ILLUSTRATE has left you short. If you, eager reader, are looking for a good place to return some open-source karma: consider making ILLUSTRATE into the tool it could be. Until somebody does, you should checkpoint often (described along with the STORE command above) and use the strategies for subuniverse sampling from the Statistics chapter (TODO ref). Lastly, while were on the subject of development tools that dont work perfectly in Pig: the Pig shell gets confused too easily to be useful. Youre best off just running your script directly.
68
CHAPTER 6
Big data is necessarily a polyglot sport. The extreme technical challenges demand diverse technological solutions and the relative youth of this field means, unfortunately, largely incompatible languages, formats, nomenclature and transport mechanisms. Whats more, every ecosystem niche has multiple open source and commercial contenders vying for prominence and it is difficult to know which are widely used, which are being adopted and even which of them work at all. Fixing a map of this ecosystem to print would be nearly foolish; predictions of success or failure will prove wrong, the companies and projects whose efforts you omit or downplay will inscribe your name in their Enemies list, the correct choice can be deeply use-case specific and any list will become out of date the minute it is committed to print. Your authors, fools both, feel you are better off with a set of wrong-headed, impolitic, oblivious and obsolete recommendations based on what has worked for us and what we have seen work for other people.
69
Hortonworks was founded two years later by Eric Baldeschwieler (aka Eric 14), who brought the project into Yahoo! and fostered its essential early growth, and a no-lessimpressive set of core contributors. It has rapidly matured a first-class offering with its own key advantages. Cloudera was the first company to commercialize Hadoop; its distribution is, by far, the most widely adopted and if you dont feel like thinking, its the easy choice. The company is increasingly focused on large-scale enterprise customers and its feature velocity is increasingly devoted to its commercial-only components. Hortonworks offering is 100-percent open source, which will appeal to those uninter ested in a commercial solution or who abhor the notion of vendor lock-in. More im pressively to us has been Hortonworks success in establishing beneficial ecosystem partnerships. The most important of these partnerships is with Microsoft and, although we do not have direct experience, any Microsoft shop should consider Hortonworks first. There are other smaller distributions, from IBM VMware and others, which are only really interesting if you use IBM VMware or one of those others. The core project has a distribution of its own, but apart from people interested in core development, you are better off with one of the packaged distributions. The most important alternative to Hadoop is Map/R, a C++-based rewrite, that is 100percent API compatible with Hadoop. It is a closed-source commercial product for highend Enterprise customers and has a free version with all essential features for smaller installations. Most compellingly, its HDFS also presents an NFS interface and so can be mounted as a native file system. See the section below (TODO: REF) on why this is such a big deal. Map/R is faster, more powerful and much more expensive than the open source version, which is pretty much everything you need to know to decide if it is the right choice for you. There are two last alternatives worthy of note. Both discard compatibility with the Ha doop code base entirely, freeing them from any legacy concerns. Spark is, in some sense, an encapsulation of the iterative development process espoused in this book: prepare a sub-universe, author small self-contained scripts that checkpoint frequently and periodically reestablish a beachhead by running against the full input dataset. The output of Sparks Scala-based domain-specific statements are managed intelligently in memory and persisted to disk when directed. This eliminates, in effect, the often-unnecessary cost of writing out data from the Reducer tasks and reading it back in again to Mapper tasks. Thats just one example of many ways in which Spark is able to impressively optimize development of Map/Reduce jobs.
70
Disco is an extremely lightweight Python-based implementation of Map/Reduce that originated at Nokia. 1 Its advantage and disadvantage is that it is an essentials-only realization of what Hadoop provides whose code base is a small fraction of the size. We do not see either of them displacing Hadoop but since both are perfectly happy to run on top of any standard HDFS, they are reasonable tools to add to your repertoire.
71
The 2.0 branch has cleaner code, some feature advantages and has the primary attention of the core team. However, the MR1 toolkit has so much ecosystem support, docu mentation, applications, lessons learned from wide-scale deployment, it continues to be our choice for production use. Note that you can have your redundant Namenode without having to adopt the new-fangled API. Adoption of MR2 is highly likely (though not certain); if you are just starting out, adopting it from the start is probably a sound decision. If you have a legacy investment in MR1 code, wait until you start seeing blog posts from large-scale deployers titled We Spent Millions Upgrading To MR2 And Boy Are We Ever Happy We Did So. The biggest pressure to move forward will be Impala, which requires the MR2 frame work. If you plan to invest in Impala heavily, it is probably best to uniformly adopt MR2.
72
do almost all the work. Flume, from Cloudera and also Java-based, solves the same problem but less elegantly, in our opinion. It offers the ability to do rudimentary instream processing similar to Storm but lacks the additional sophistication Trident pro vides. Both Kafka and Flume are capable of extremely high throughput and scalability. Most importantly, they guarantee at least once processing. Within the limits of disk space and the laws of physics, they will reliably transport each record to its destination even as networks and intervening systems fail. Kafka and Flume can both deposit your data reliably onto an HDFS but take very dif ferent approaches to doing so. Flume uses the obvious approach of having an always live sync write records directly to a DataNode acting as a native client. Kafkas Camus add-on uses a counterintuitive but, to our mind, superior approach. In Camus, data is loaded onto the HDFS using Mapper-Only MR jobs running in an endless loop. Its Map tasks are proper Hadoop jobs and Kafka clients and elegantly leverage the reliability mechanisms of each. Data is live on the HDFS as often as the Import job runs not more, not less. Flumes scheme has two drawbacks: First, the long-running connections it requires to individual DataNodes silently compete with the traditional framework. 2 Second, a file does not become live on the HDFS until either a full block is produced or the file is closed. Thats fine if all your datastreams are high rate, but if you have a range of rates or variable rates, you are forced to choose between inefficient block sizes (larger Name Node burden, more Map tasks) or exceedingly long delays until data is ready to process. There are workarounds but they are workarounds. Both Kafka and Flume have evolved into general purpose solutions from their origins in high-scale server log transport but there are other use-case specific technologies. You may see Scribe and S4 mentioned as alternatives but they are not seeing the same widespread adoption. Scalable message queue systems such as AMQP, RabbitMQ or Kestrel will make sense if (a) you are already using one; (b) you require complex event-driven routing; or (c) your system is zillions of sources emitting many events rather than many sources emitting zillions of events. AMQP is Enterprise-y and has rich commercial support. RabbitMQ is open source-y and somewhat more fresh. Kestrel is minimal and fast.
Stream Analytics
The streaming transport solutions just described focus on getting your data from here to there as efficiently as possible. A streaming analytics solution allows you to perform, well, analytics on the data in flight. While a transport solution only guarantees at least
2. Make sure you increase DataNode handler counts to match.
73
once processing, frameworks like Trident guarantee exactly once processing, enabling you to perform aggregation operations. They encourage you to do anything to the data in flight that Java or your high-level language of choice permits you to do including even high-latency actions such as pinging an external API or legacy data store while giving you efficient control over locality and persistence. There is a full chapter intro duction to Trident in Chapter (TODO: REF), so we wont go into much more detail here. Trident, a Java and Clojure-based open source project from Twitter, is the most prom inent so far. There are two prominent alternatives. Spark Streaming, an offshoot of the Spark project mentioned above (TODO: REF), is receiving increasing attention. Continuity offers an extremely slick developer-friendly commercial alternative. It is extremely friendly with HBase (the company was started by some members of the HBase core team); as we understand it, most of the action actually takes place within HBase, an interesting al ternative approach. Trident is extremely compelling, the most widely used, is our choice for this book and our best recommendation for general use.
74
Lastly, just as this chapter was being written Facebook open sourced their Presto project. It is too early to say whether it will be widely adopted, but Facebook doesnt do anything thoughtlessly or at a small scale. Wed include it in any evaluation. Which to choose? If you want the simple answer, use Impala if you run your own clusters or RedShift if you prefer a cloud solution. But this technology only makes sense when youve gone beyond what traditional solutions support. Youll be spending hundreds of thousands of dollars here, so do a thorough investigation.
Youll hear the word realtime attached to both streaming and OLAP technologies; there are actually three things meant by that term. The first, lets call immediate realtime provided by the CEP solutions: If the consequent actions of a new piece of data have not occurred within 50 milliseconds or less, forget about it. Lets call what the streaming analytics solutions provide prompt realtime; there is a higher floor on the typical processing latency but you are able to handle all the analytical processing and consequent actions for each piece of data as it is received. Lastly, the OLAP data stores provide what we will call interactive realtime; data is both promptly manifested in the OLAP systems tables and the results of queries are returned and available within an analysts attention span.
Database Crossloading
All the tools above focus on handling massive streams of data in constant flight.
Sometimes, what
Most large enterprises are already using a traditional ETL 4 tool such as Informatica and (TODO: put in name of the other one). If you want a stodgy, expensive Enterprise-grade solution, their sales people will enthusiastically endorse it for your needs, but if extreme scalability is essential, and their relative immaturity is not a deal breaker, use Sqoop, Kafka or Flume to centralize your data.
75
like Oracle or Netezza large sums of money to fight a rear-guard action against data locality on your behalf or you can abandon the Utopian conceit that one device can perfectly satisfy the joint technical constraints of storing, interrogating and restructur ing data at arbitrary scale and velocity for every application in your shop. As it turns out, there are a few coherent ways to variously relax those constraints and around each of those solution sets has grown a wave of next-generation data stores referred to with the (TODO: change word) idiotic collective term NoSQL databases. The resulting explosion in the number of technological choices presents a baffling chal lenge to anyone deciding which NoSQL database is the right one for me? Unfortu nately, the answer is far worse than that because the right question is which NoSQL databases are the right choices for me? Big data applications at scale are best architected using a variety of data stores and analytics systems. The good news is that, by focusing on narrower use cases and relaxing selected technical constraints, these new data stores can excel at their purpose far better than an all-purpose relational database would. Lets look at the respective data store archetypes that have emerged and their primary contenders.
Billions of Records
At the extreme far end of the ecosystem are a set of data stores that give up the ability to be queried in all but the simplest ways in return for the ability to store and retrieve trillions of objects with exceptional durability, throughput and latency. The choices we like here are Cassandra, HBase or Accumulo, although Riak, Voldemort, Aerospike, Couchbase and Hypertable deserve consideration as well. Cassandra is the pure-form expression of the trillions of things mission. It is opera tionally simple and exceptionally fast on write, making it very popular for time-series applications. HBase and Accumulo are architecturally similar in that they sit on top of
76 | Chapter 6: Big Data Ecosystem and Toolset
Hadoops HDFS; this makes them operationally more complex than Cassandra but gives them an unparalleled ability to serve as source and destination of Map/Reduce jobs. All three are widely popular open source, Java-based projects. Accumulo was initially developed by the U.S. National Security Administration (NSA) and was open sourced in 2011. HBase has been an open source Apache project since its inception in 2006 and both are nearly identical in architecture and functionality. As you would expect, Accu mulo has unrivaled security support while HBases longer visibility gives it a wider in stalled base. We can try to make the choice among the three sound simple: If security is an overriding need, choose Accumulo. If simplicity is an overriding need, choose Cassandra. For overall best compatibility with Hadoop, use HBase. However, if your use case justifies a data store in this class, it will also require investing hundreds of thousands of dollars in infrastructure and operations. Do a thorough bakeoff among these three and perhaps some of the others listed above. What you give up in exchange is all but the most primitive form of locality. The only fundamental retrieval operation is to look records or ranges of records by primary key. There is Sugar for secondary indexing and tricks that help restore some of the power you lost but effectively, thats it. No JOINS, no GROUPS, no SQL. H-base, Accumulo and Cassandra Aerospike, Voldemort and Riak, Hypertable
that MongoDB doesnt scale, though, are overblown; it scales quite capably into the billion-record regime but doing so requires expert guidance. Probably the best thing to do is think about it this way: The open source version of MongoDB is free to use on single machines by amateurs and professionals, one and all; anyone considering using it on multiple machines should only do so with commercial support from the start. The increasingly-popular ElasticSearch data store is our first choice for hitting the sweet spot of programmer delight and scalability. The heart of ElasticSearch is Lucene, which encapsulates the exceptionally difficult task of indexing records and text in a streamlined gem of functionality, hardened by a decade of wide open source adoption. 5 ElasticSearch embeds Lucene into a first-class distributed data framework and offers a powerful programmer-friendly API that rivals MongoDBs. Since Lucene is at its core, it would be easy to mistake ElasticSearch for a text search engine like Solr; it is one of those and, to our minds, the best one, but it is also a first-class database.
5. Lucene was started, incidentally, by Doug Cutting several years before he started the Hadoop project.
78
If you turn the knob for programmer delight all the way to the right, one request that would fall out would be, Hey - can you take the same data structures I use while Im coding but make it so I can have as many of them as I have RAM and shared across as many machines and processes as I like? The Redis data store is effectively that. Its API gives you the fundamental data structures you know and love hashmap, stack, buffer, set, etc and exposes exactly the set of operations that can be performance and dis tributedly correct. It is best used when the amount of data does not much exceed the amount of RAM you are willing to provide and should only be used when its data structures are a direct match to your application. Given those constraints, it is simple, light and a joy to use. Sometimes, the only data structure you need is given name, get thing. Memcached is an exceptionally fast in-memory key value store that serves as the caching layer for many of the Internets largest websites. It has been around for a long time and will not go away any time soon. If you are already using MySQL or PostgreSQL, and therefore only have to scale by cost of RAM not cost of license, you will find that they are perfectly defensible key value stores in their own right. Just ignore 90-percent of their user manuals and find out when the need for better latency or lower cost of compute forces you to change. Kyoto Tycoon (TODO LINK) is an open source C++-based distributed key value store with the venerable DBM database engine at its core. It is exceptionally fast and, in our experience, is the simplest way to efficiently serve a mostly-cold data set. It will quite happily serve hundreds of gigabytes or terabytes of data out of not much more RAM than you require for efficient caching.
Graph Databases
Graph-based databases have been around for some time but have yet to see general adoption outside of, as you might guess, the intelligence and social networking com munities (NASH). We suspect that, as the price of RAM continues to drop and the number of data scientists continues to rise, sophisticated analysis of network graphs will become increasingly important and, we hear, increasing adoption of graph data stores. The two open source projects we hear the most about are the longstanding Neo 4J project and the newer, fresher TitanDB. Your authors do not have direct experience here, but the adoption rate of TitanDB is impressive and we believe that is where the market is going.
79
your program to run faster, use more machines, not more code). These languages have an incredibly rich open source toolkit ecosystem and cross-platform glue. Most im portantly, their code is simpler, shorter and easier to read; far more of data science than you expect is brass-knuckle street fighting, necessary acts of violence to make your data look like it should. These are messy, annoying problems, not deep problems and, in our experience, the only way to handle them maintainably is in a high-level scripting lan guage. You probably come in with a favorite scripting language in mind, and so by all means, use that one. The same Hadoop streaming interface powering the ones we will describe below is almost certainly available in your language of choice. If you do not, we will single out Ruby, Python and Scala as the most plausible choices, roll our eyes at the language warhawks sharpening their knives and briefly describe the advantages of each. Ruby is elegant, flexible and maintainable. Among programming languages suitable for serious use, Ruby code is naturally the most readable and so it is our choice for this book. We use it daily at work and believe its clarity makes the thought we are trying to convey most easily portable into the readers language of choice. Python is elegant, clean and spare. It boasts two toolkits appealing enough to serve as the sole basis of choice for some people. The Natural Language toolkit (NLTK) is not far from the field of computational linguistics set to code. SciPy is widely used through out scientific computing and has a full range of fast, robust matrix and numerical log arithms. Lastly, Scala, a relative newcomer, is essentially Java but readable. Its syntax feels very natural to native Java programmers and executives directly into the JBM, giving it strong performance and first-class access to native Java frameworks, which means, of course, native access to the code under Hadoop, Storm, Kafka, etc. If runtime efficiency and a clean match to Java are paramount, you will prefer Scala. If your primary use case is text processing or hardcore numerical analysis, Pythons su perior toolkits make it the best choice. Otherwise, it is a matter of philosophy. Against Perls mad credo of there is more than one way to do it, Python says there is exactly one right way to do it, while Ruby says there are a few good ways to do it, be clear and use good taste. One of those alternatives gets your world view; choose accordingly.
considered as a language, is inelegant, often frustrating and Vulcanized. Do not take that last part too seriously; whatever you are looking to do that can be done on a single machine, R can do. There are Hadoop integrations, like RHIPE, but we do not take them very seriously. R is best used on single machines or trivially parallelized using, say, Hadoop. Julia is an upstart language designed by programmers, not statisticians. It openly intends to replace R by offering cleaner syntax, significantly faster execution and better dis tributed awareness. If its library support begins to rival Rs, it is likely to take over but that probably has not happened yet. Lastly, Pandas, Anaconda and other Python-based solutions give you all the linguistic elegance of Python, a compelling interactive shell and the extensive statistical and machine-learning capabilities that NumPy and scikit provide. If Python is your thing, you should likely start here.
Mid-level Languages
You cannot do everything a high-level language, of course. Sometimes, you need closer access to the Hadoop API or to one of the many powerful, extremely efficient domainspecific frameworks provided within the Java ecosystem. Our preferred approach is to write Pig or Hive UDFs; you can learn more in Chapter (TODO: REF). Many people prefer, however, prefer to live exclusively at this middle level. Cascading strikes a wonderful balance here. It combines an elegant DSL for describing your Ha doop job as a dataflow and a clean UDF framework for record-level manipulations. Much of Tridents API was inspired by Cascading; it is our hope that Cascading even tually supports Trident or Storm as a back end. Cascading is quite popular, and besides its native Java experience, offers first-class access from Scala (via the Scalding project) or Clojure (via the Cascalog project). Lastly, we will mention Crunch, an open source Java-based project from Cloudera. It is modeled after a popular internal tool at Google; it sits much closer to the Map/Reduce paradigm, which is either compelling to you or not.
Frameworks
Finally, for the programmers, there are many open source frameworks to address var ious domain-specific problems you may encounter as a data scientist. Going into any depth here is outside the scope of this book but we will at least supply you with a list of pointers. Elephant Bird, Datafu and Akela offer extremely useful additional Pig and Hive UDFs. While you are unlikely to need all of them, we consider no Pig or Hive installation complete without them. For more domain-specific purposes, anyone in need of a machine-learning algorithm should look first at Mahout, Kiji, Weka scikit-learn or those
82 | Chapter 6: Big Data Ecosystem and Toolset
available in a statistical language, such as R, Julia or NumPy. Apache Giraph and Gremlin are both useful for graph analysis. The HIPI http://hipi.cs.virginia.edu/ toolkit enables image processing on Hadoop with library support and a bundle format to address the dreaded many small files problem (TODO ref). (NOTE TO TECH REVIEWERS: What else deserves inclusion?) Lastly, because we do not know where else to put them, there are several Hadoop en vironments, some combination of IDE frameworks and conveniences that aim to make Hadoop friendlier to the Enterprise programmer. If you are one of those, they are worth a look.
83
CHAPTER 7
When working with big data, there are thousands of housekeeping tasks to do. As you will soon discover, moving data around, munging it, transforming it, and other mun dane tasks will take up a depressingly inordinate amount of your time. A proper knowl edge of some useful tools and tricks will make these tasks doable, and in many cases, easy. In this chapter we discuss a variety of Unix commandline tools that are essential for shaping, transforming, and munging text. All of these commands are covered elsewhere, and covered more completely, but weve focused in on their applications for big data, specifically their use within Hadoop. FLAVORISE By the end of this chapter you should be FINISH If youre already familiar with Unix pipes and chaining commands together, feel free to skip the first few sections and jump straight into the tour of useful commands.
A series of pipes
One of the key aspects of the Unix philosophy is that it is built on simple tools that can be combined in nearly infinite ways to create complex behavior. Command line tools are linked together through pipes which take output from one tool and feed it into the input of another. For example, the cat command reads a file and outputs its contents. You can pipe this output into another command to perform useful transformations. For example, to select only the first field (delimited by tabs) of the each line in a file you could:
cat somefile.txt | cut -f1`
85
The vertical pipe character, |, represents the pipe between the cat and cut commands. You can chain as many commands as you like, and its common to construct chains of 5 commands or more. In addition to redirecting a commands output into another command, you can also redirect output into a file with the > operator. For example:
echo 'foobar' > stuff.txt
writes the text foobar to stuff.txt. stuff.txt is created if it doesnt exist, and is overwritten if it does. If youd like to append to a file instead of overwriting it, the >> operator will come in handy. Instead of creating or overwriting the specified file, it creates the file if it doesnt exist, and appends to the file if it does. As a side note, the Hadoop streaming API is built using pipes. Hadoop sends each record in the map and reduce phases to your map and reduce scripts stdin. Your scripts print the results of their processing to stdout, which Hadoop collects.
Its occasionally useful to be able to redirect these streams independently or into each other. For example, if youre running a command and want to log its output as well as any errors generated, you should redirect stderr into stdout and then direct stdout to a file:
*EXAMPLE*
Alternatively, you could redirect stderr and stdout into separate files:
*EXAMPLE*
86
You might also want to suppress stderr if the command youre using gets too chatty. You can do that by redirecting stderr to /dev/null, which is a special file that discards ev erything you hand it. Now that you understand the basics of pipes and output redirection, lets get on to the fun part - munging data!
cat is generally used to examine the contents of a file or as the start of a chain of com mands:
cat foo.txt | sort | uniq > bar.txt
In addition to examining and piping around files, cat is also useful as an identity map per, a mapper which does nothing. If your data already has a key that you would like to group on, you can specify cat as your mapper and each record will pass untouched through the map phase to the sort phase. Then, the sort and shuffle will group all records with the same key at the proper reducer, where you can perform further manipulations.
echo is very similar to cat except it prints the supplied text to stdout. For example:
echo foo bar baz bat > foo.txt
will result in foo.txt holding foo bar baz bat, followed by a newline. If you dont want the newline you can give echo the -n option.
Filtering
cut
The cut command allows you to cut certain pieces from each line, leaving only the interesting bits. The -f option means keep only these fields, and takes a comma-delimited list of numbers and ranges. So, to select the first 3 and 5th fields of a tsv file you could use:
cat somefile.txt | cut -f 1-3,5`
Watch out - the field numbering is one-indexed. By default cut assumes that fields are tab-delimited, but delimiters are configurable with the -d option. This is especially useful if you have tsv output on the hdfs and want to filter it down to only a handful of fields. You can create a hadoop streaming job to do this like so:
Filtering
87
schema-less or nearly so (like unstructured text), things get slightly more complicated. For example, if you want to select the last field from all of your records, but the field length of your records vary, you can combine cut with the rev command, which reverses text:
cat foobar.txt | rev | cut -1 | rev`
cut is great if you know the indices of the columns you want to keep, but if you data is
This reverses each line, selects the first field in the reversed line (which is really the last field), and then reverses the text again before outputting it.
cut also has a -c (for character) option that allows you to select ranges of characters. This is useful for quickly verifying the output of a job with long lines. For example, in the Regional Flavor exploration, many of the jobs output wordbags which are just giant JSON blobs, one line of which would overflow your entire terminal. If you want to quickly verify that the output looks sane, you could use:
wu-cat /data/results/wikipedia/wordbags.tsv | cut -c 1-100
Character encodings
Cuts -c option, as well as many Unix text manipulation tools require a little forethought when working with different character encodings because each encoding can use a dif ferent numbers of bits per character. If cut thinks that it is reading ASCII (7 bits per character) when it is really reading UTF-8 (variable number of bytes per character), it will split characters and produce meaningless gibberish. Our recommendation is to get your data into UTF-8 as soon as possible and keep it that way, but the fact of the matter is that sometimes you have to deal with other encodings. Unixs solution to this problem is the LC_* environment variables. LC stands for lo cale, and lets you specify your preferred language and character encoding for various types of data.
LC_CTYPE (locale character type) sets the default character encoding used systemwide. In absence of LC_CTYPE, LANG is used as the default, and LC_ALL can be used to override
all other locale settings. If youre not sure whether your locale settings are having their intended effect, check the man page of the tool you are using and make sure that it obeys the LC variables.
You can view your current locale settings with the locale command. Operating systems differ on how they represent languages and character encodings, but on my machine en_US.UTF-8 represents English, encoded in UTF-8.
88
Remember that if youre using these commands as Hadoop mappers or Reducers, you must set these environment variables across your entire cluster, or set them at the top of your script.
head is especially useful for sanity-checking the output of a Hadoop job without over flowing your terminal. head and cut make a killer combination:
wu-cat /data/results/foobar | head -10 | cut -c 1-100
tail works almost identically to head. Viewing the last ten lines of a file is easy:
tail -10 foobar.txt
tail also lets you specify the selection in relation to the beginning of the file with the
+ operator. So, to select every line from the 10th line on:
tail +10 foobar.txt
What if you just finished uploading 5,000 small files to the HDFS and realized that you left a header on every one of them? No worries, just use tail as a mapper to remove the header:
wu-mapred --mapper='tail +2'`
This outputs the end of the log to your terminal and waits for new content, updating the output as more is written to yourlogs.log.
grep
grep is a tool for finding patterns in text. You can give it a word, and it will diligently
search its input, printing only the lines that contain that word:
GREP EXAMPLE
grep has a many options, and accepts regular expressions as well as words and word sequences:
Filtering
89
ANOTHER EXAMPLE
As is the -z option, which decompresses g-zipped text before grepping through it. This can be tremendously useful if you keep files on your HDFS in a compressed form to save space. When using grep in Hadoop jobs, beware its non-standard exit statuses. grep returns a 0 if it finds matching lines, a 1 if it doesnt find any matching lines, and a number greater than 1 if there was an error. Because Hadoop interprets any exit code greater than 0 as an error, any Hadoop job that doesnt find any matching lines will be considered failed by Hadoop, which will result in Hadoop re-trying those jobs without success. To fix this, we have to swallow greps exit status like so:
(grep foobar || true)
You can also tell it to sort numerically with the -n option, but -n only sorts integers properly. To sort decimals and numbers in scientific notation properly, use the -g option:
EXAMPLE
By default the column delimiter is a non-blank to blank transition, so any content char acter followed by a whitespace character (tab, space, etc) is treated as a column. This can be tricky if your data is tab delimited, but contains spaces within columns. For example, if you were trying to sort some tab-delimited data containing movie titles, you would have to tell sort to use tab as the delimiter. If you try the obvious solution, you might be disappointed with the result:
sort -t"\t" sort: multi-character tab `\\t'
90
Instead we have to somehow give the -t option a literal tab. The easiest way to do this is:
sort -t$'\t'
etc
$'<string>' is a special directive that tells your shell to expand <string> into its equivalent literal. You can do the same with other control characters, including \n, \r,
To insert the tab literal between the single quotes, type CTRL-V and then Tab. If you find your sort command is taking a long time, try increasing its sort buffer size with the --buffer command. This can make things go a lot faster:
example
uniq
look for them, among other things. For example, here is how you would find the number of oscars each actor has in a list of annual oscar winners:
example
uniq is used for working with with duplicate lines - you can count them, remove them,
Note the -c option, which prepends the output with a count of the number of duplicates. Also note that we sort the list before piping it into uniq - input to uniq must always be sorted or you will get erroneous results. You can also filter out duplicates with the -u option:
example
join
TBD - do we even want to talk about this?
91
Summarizing
wc
wc is a utility for counting words, lines, and characters in text. Without options, it
searches its input and outputs the number of lines, words, and bytes, in that order:
EXAMPLE
wc will also print out the number of characters, as defined by the LC_CTYPE environ
ment variable:
EXAMPLE
We can use wc as a mapper to count the total number of words in all of our files on the HDFS:
EXAMPLE
92
Intro to Storm+Trident
CHAPTER 8
93
arbitrarily powerful general-purpose code to handle every record. A lot of Storm+Tri dents adoption is in application to real-time systems. 1 But, just as importantly, the framework exhibits radical tolerance of latency. Its perfectly reasonable to, for every record, perform reads of a legacy data store, call an internet API and the like, even if those might have hundreds or thousands of milliseconds worst-case latency. That range of timescales is simply impractical within a batch processing run or database query. In the later chapter on the Lambda Architecture, youll learn how to use stream and batch analytics together for latencies that span from milliseconds to years. As an example, one of the largest hard drive manufacturers in the world ingests sensor data from its manufacturing line, test data from quality assurance processes, reports from customer support and post mortem analysis of returned devices. They have been able to mine the accumulated millisecond scale sensor data for patterns that predict flaws months and years later. Hadoop produces the slow, deep results, uncovering the patterns that predict failure. Storm+Trident produces the fast, relevant results: opera tional alerts when those anomalies are observed. Things you should take away from this chapter: Understand the type of problems you solve using stream processing and apply it to real examples using the best-in-class stream analytics frameworks. Acquire the practicalities of authoring, launching and validating a Storm+Trident flow. Understand Tridents operators and how to use them: Each apply `CombinerAggregator`s, `ReducerAggre gator`s and `AccumulatingAggregator`s (generic aggregator?) Persist records or ag gregations directly to a backing database or to Kafka for item-potent downstream stor age. (probably not going to discuss how to do a streaming join, using either DRPC or a hashmap join)
This chapter will only speak of Storm+Trident, the high level and from the outside. We wont spend any time on how its making this all work until (to do ref the chapter on Storm+Tri dent internals)
1. for reasons youll learn in the Storm internals chapter, its not suitable for ultra-low latency (below, say, 5s of milliseconds), Wall Street-type applications, but if latencies above that are real-time enough for you, Storm +Trident shines.
94
You define your topology and Storm handles all the hard parts fault tolerance, ret rying, and distributing your code across the cluster among other things. For your first Storm+Trident topology, were going to create a topology to handle a typical streaming use case: accept a high rate event stream, process the events to power a realtime dashboard, and then store the records for later analysis. Specifically, were going to analyze the Github public timeline and monitor the number of commits per language. A basic logical diagram of the topology looks like this: i. 89-intro-to-storm-topo.png Each node in the diagram above represents a specific operation on the data flow. Initially JSON records are retrieved from Github and injected into the topology by the Github Spout, where they are transformed by a series of operations and eventually persisted to an external data store. Trident spouts are sources of streams of data common use cases include pulling from a Kafka queue, Redis queue, or some other external data source. Streams are in turn made up of tuples which are just lists of values with names attached to each field. The meat of the Java code that constructs this topology is as follows: IBlobStore bs = new FileBlobStore(~/dev/github-data/test-data); OpaqueTransactio nalBlobSpout spout = new OpaqueTransactionalBlobSpout(bs, StartPolicy.EARLIEST, null); TridentTopology topology = new TridentTopology(); topology.newStream(github-activities, spout) .each(new Fields(line), new Json Parse(), new Fields(parsed-json)) .each(new Fields(parsed-json), new ExtractLan guageCommits(), new Fields(language, commits)) .groupBy(new Fields(lan guage)) .persistentAggregate(new VisibleMemoryMapState.Factory(), new Count(), new Fields(commit-sum)); The first two lines are responsible for constructing the spout. Instead of pulling directly from Github, well be using a directory of downloaded json files so as not to a) unnec essarily burden Github and b) unnecessarily complicate the code. You dont need to worry about the specifics, but the OpaqueTransactionalBlobSpout reads each json file and feeds it line by line into the topology. After creating the spout we construct the topology by calling new TridentTopolo gy(). We then create the topologys first (and only) stream by calling newStream and
passing in the spout we instantiated earlier along with a name, github-activities. We can then chain a series of method calls off newStream() to tack on our logic after the spout.
The each method call, appropriately, applies an operation to each tuple in the stream. The important parameter in the each calls is the second one, which is a class that defines
Your First Topology | 95
the operation to be applied to each tuple. The first each uses the JsonParse class which parses the JSON coming off the spout and turns it into an object representation that we can work with more easily. Our second each uses ExtractLanguageCommits.class to pull the statistics were interested in from the parsed JSON objects, namely the language and number of commits. ExtractLanguageCommits.class is fairly straightforward, and it is instructive to digest it a bit: public static class ExtractLanguageCommits extends BaseFunction { private static final Logger LOG = LoggerFactory.getLogger(ExtractLanguageCommits.class); public void execute(TridentTuple tuple, TridentCollector collector){ JsonNode node = (JsonNode) tuple.getValue(0); if(!node.get(type).toString().equals(\"PushEvent\")) return; List values = new ArrayList(2); //grab the language and the action val ues.add(node.get(repository).get(language).asText()); values.add(node.get(pay load).get(size).asLong()); collector.emit(values); return; } } There is only one method, execute, that excepts a tuple and a collector. The tuples coming into ExtractLanguageCommits have only one field, parsed-json, which con tains a JsonNode, so the first thing we do is cast it. We then use the get method to access the various pieces of information we need. At the time of writing, the full schema for Githubs public stream is available here, but here are the important bits: { type: PushEvent, // can be one of .. finish JSON } finish this section At this point the tuples in our stream might look something like this: (C, 2), (JavaScript, 5), (CoffeeScript, 1), (PHP, 1), (JavaScript, 1), (PHP, 2) We then group on the language and sum the counts, giving our final tuple stream which could look like this: (C, 2), (JavaScript, 6), (CoffeeScript, 1), (PHP, 3) The group by is exactly what you think it is - it ensures that every tuple with the same language is grouped together and passed through the same thread of execution, allowing you to perform the sum operation across all tuples in each group. After summing the commits, the final counts are stored in a database. Feel free to go ahead and try it out yourself. So What? You might be thinking to yourself So what, I can do that in Hadoop in 3 lines and youd be right almost. Its important to internalize the difference in focus between Hadoop and Storm+Trident when using Hadoop you must have all your data sitting in front of you before you can start, and Hadoop wont provide any results until pro
96
cessing all of the data is complete. The Storm+Trident topology you just built allows you to update your results as you receive your stream of data in real time, which opens up a whole set of applications you could only dream about with Hadoop.
97
Statistics
CHAPTER 9
Skeleton: Statistics
Data is worthless. Actually, its worse than worthless. It costs you money to gather, store, manage, replicate and analyze. What you really want is insight a relevant summary of the essential patterns in that data produced using relationships to analyze data in context. Statistical summaries are the purest form of this activity, and will be used repeatedly in the book to come, so now that you see how Hadoop is used its a good place to focus. Some statistical measures let you summarize the whole from summaries of the parts: I can count all the votes in the state by summing the votes from each county, and the votes in each county by summing the votes at each polling station. Those types of ag gregations average/standard deviation, correlation, and so forth are naturally scal able, but just having billions of objects introduces some practical problems you need to avoid. Well also use them to introduce Pig, a high-level language for SQL-like queries on large datasets. Other statistical summaries require assembling context that grows with the size of the whole dataset. The amount of intermediate data required to count distinct objects, ex tract an accurate histogram, or find the median and other quantiles can become costly and cumbersome. Thats especially unfortunate because so much data at large scale has a long-tail, not normal (Gaussian) distribution the median is far more robust indi cator of the typical value than the average. (If Bill Gates walks into a bar, everyone in there is a billionaire on average.) But you dont always need an exact value you need actionable insight. Theres a clever pattern for approximating the whole by combining carefully re-mixed summaries of the parts, and well apply it to Holistic vs algebraic aggregations
99
Underflow and the Law of Huge Numbers Approximate holistic aggs: Median vs remedian; percentile; count distinct (hyper loglog) Count-min sketch for most frequent elements Approx histogram Counting total burgers sold - total commits, repos, counting a running and or smoothed average standard deviation sampling uniform top k reservior ?rolling topk/reservior sampling? algebraic vs holistic aggregate use countmin sketch to turn a holistic aggregate into an algebraic aggregate quantile or histogram numeric stability
100
| Chapter 9: Statistics
Event Streams
CHAPTER 10
doc: "GET, POST, etc" doc: "Combined path and query string of request" doc: "eg 'HTTP/1.1'" doc: doc: doc: doc: "HTTP status code (j.mp/httpcodes)" "Bytes in response body", blankish: ['', nil, '-'] "URL of linked-from page. Note speling." "Version info of application making the request"
Since most of our questions are about what visitors do, well mainly use visitor_id (to identify common requests for a visitor), uri_str (what they requested), reques ted_at (when they requested it) and referer (the page they clicked on). Dont worry if youre not deeply familiar with the rest of the fields in our model theyll become clear in context. Two notes, though. In these explorations, were going to use the ip_address field for the visitor_id. Thats good enough if you dont mind artifacts, like every visitor from the same coffee shop being treated identically. For serious use, though, many web ap plications assign an identifying cookie to each visitor and add it as a custom logline
101
field. Following good practice, weve built this model with a visitor_id method that decouples the semantics (visitor) from the data (the IP address they happen to have visited from). Also please note that, though the dictionary blesses the term referrer, the early authors of the web used the spelling referer and were now stuck with it in this context. ///Here is a great example of supporting with real-world analogies, above where you wrote, like every visitor from the same coffee shop being treated identically Yes! That is the kind of simple, sophisticated tying-together type of connective tissue needed throughout, to greater and lesser degrees. Amy////
This file is a subset of the Apache server logs from April 10 to November 26, 2003. It contains every request for my homepage, the original video, the remix video, the mirror redirector script, the donations spreadsheet, and the seven blog entries I made related to Star Wars Kid. I included a couple weeks of activity before I posted the videos so you can determine the baseline traffic I normally received to my homepage. The data is public domain. If you use it for anything, please drop me a note!
The details of parsing are mostly straightforward we use a regular expression to pick apart the fields in each line. That regular expression, however, is another story:
class Logline # Extract structured fields using the `raw_regexp` regular expression def self.parse(line) mm = raw_regexp.match(line.chomp) or return BadRecord.new('no match', line) new(mm.captures_hash) end ### @export
class_attribute :raw_regexp
# # Regular expression to parse an apache log line. # # 83.240.154.3 - - [07/Jun/2008:20:37:11 +0000] "GET /faq?onepage=true HTTP/1.1" 200 569 "http:/ # self.raw_regexp = %r{\A (?<ip_address> [\w\.]+) # ip_address 83.240.154.3 \ (?<identd> \S+) # identd - (rarely used) \ (?<authuser> \S+) # authuser - (rarely used) # \ \[(?<requested_at> # \d+/\w+/\d+ # date part [07/Jun/2008 :\d+:\d+:\d+ # time part :20:37:11 \ [\+\-]\S*)\] # timezone +0000] # \ \"(?:(?<http_method> [A-Z]+) # http_method "GET \ (?<uri_str> \S+) # uri_str faq?onepage=true \ (?<protocol> HTTP/[\d\.]+)|-)\" # protocol HTTP/1.1" # \ (?<response_code>\d+) # response_code 200 \ (?<bytesize> \d+|-) # bytesize 569 \ \"(?<referer> [^\"]*)\" # referer "http://infochimps.org/search? \ \"(?<user_agent> [^\"]*)\" # user_agent "Mozilla/5.0 (Windows; U; Wind \z}x end
103
It may look terrifying, but taken piece-by-piece its not actually that bad. Regexp-fu is an essential skill for data science in practice youre well advised to walk through it. Lets do so. The meat of each line describe the contents to match \S+ for a sequence of nonwhitespace, \d+ for a sequence of digits, and so forth. If youre not already familiar with regular expressions at that level, consult the excellent tutorial at regularexpressions.info. This is an extended-form regular expression, as requested by the x at the end of it. An extended-form regexp ignores whitespace and treats # as a comment delimit er constructing a regexp this complicated would be madness otherwise. Be care ful to backslash-escape spaces and hash marks. The \A and \z anchor the regexp to the absolute start and end of the string respec tively. Fields are selected using named capture group syntax: (?<ip_address>\S+). You can retrieve its contents using match[:ip_address], or get all of them at once using captures_hash as we do in the parse method. Build your regular expressions to be good brittle. If you only expect HTTP request methods to be uppercase strings, make your program reject records that are other wise. When youre processing billions of events, those one-in-a-million deviants start occurring thousands of times. That regular expression does almost all the heavy lifting, but isnt sufficient to properly extract the requested_at time. Wukong models provide a security gate for each field in the form of the receive_(field name) method. The setter method (reques ted_at=) applies a new value directly, but the receive_requested_at method is ex pected to appropriately validate and transform the given value. The default method performs simple do the right thing-level type conversion, sufficient to (for example) faithfully load an object from a plain JSON hash. But for complicated cases youre invited to override it as we do here.
class Logline
# Map of abbreviated months to date number. MONTHS = { 'Jan' => 1, 'Feb' => 2, 'Mar' => 3, 'Apr' => 4, 'May' => 5, 'Jun' => 6, 'Jul' => 7, ' def receive_requested_at(val) # Time.parse doesn't like the quirky apache date format, so handle those directly mm = %r{(\d+)/(\w+)/(\d+):(\d+):(\d+):(\d+)\s([\+\-]\d\d)(\d\d)}.match(val) rescue nil if mm day, mo, yr, hour, min, sec, tz1, tz2 = mm.captures val = Time.new( yr.to_i, MONTHS[mo], day.to_i, hour.to_i, min.to_i, sec.to_i, "#{tz1}:#{tz2}")
104
Theres a general lesson here for data-parsing scripts. Dont try to be a hero and get everything done in one giant method. The giant regexp just coarsely separates the values; any further special handling happens in isolated methods. Test the script in local mode:
~/code/wukong$ head -n 5 data/serverlogs/star_wars_kid/star_wars_kid-raw-sample.log | examples/ser 170.20.11.59 2003-04-30T20:17:02Z GET /archive/2003/04/29/star_war.shtml HTTP/1.0 154.5.248.92 2003-04-30T20:17:04Z GET /random/video/Star_Wars_Kid.wmv HTTP/1.0 199.91.33.254 2003-04-30T20:17:09Z GET /random/video/Star_Wars_Kid.wmv HTTP/1.0 131.229.113.229 2003-04-30T20:17:09Z GET /random/video/Star_Wars_Kid.wmv HTTP/1.1 64.173.152.130 2003-04-30T20:17:18Z GET /archive/2003/02/19/coachell.shtml HTTP/1.1
Then run it on the full dataset to produce the starting point for the rest of our work:
TODO
Geo-IP Matching
You can learn a lot about your sites audience in aggregate by mapping IP addresses to geolocation. Not just in itself, but joined against other datasets, like census data, store locations, weather and time. 1 Maxmind makes their GeoLite IP-to-geo database available under an open license (CCBY-SA)2. Out of the box, its columns are beg_ip, end_ip, location_id, where the first two columns show the low and high ends (inclusive) of a range that maps to that location. Every address lies in at most one range; locations may have multiple ranges. This arrangement caters to range queries in a relational database, but isnt suitable for our needs. A single IP-geo block can span thousands of addresses. To get the right locality, take each range and break it at some block level. Instead of having 1.2.3.4 to 1.2.5.6 on one line, lets use the first three quads (first 24 bits) and emit rows for 1.2.3.4 to 1.2.3.255, 1.2.4.0 to 1.2.4.255, and 1.2.5.0 to 1.2.5.6. This lets us use the first segment as the partition key, and the full ip address as the sort key.
1. These databases only impute a coarse-grained estimate of each visitors location they hold no direct in formation about the persom. Please consult your priest/rabbi/spirit guide/grandmom or other appropriate moral compass before diving too deep into the world of unmasking your sites guests. 2. For serious use, there are professional-grade datasets from Maxmind, Quova, Digital Element among others.
Geo-IP Matching
105
bytes description file 1_094_541_688 24-bit partition key maxmind-geolite_city-20121002.tsv 183_223_435 16-bit partition key maxmind-geolite_city-20121002-16.tsv 75_729_432 original (not denormalized) GeoLiteCity-Blocks.csv
Range Queries
////Gently introduce the concept. So, heres what range queries are all about, in a nut shell Amy//// This is a generally-applicable approach for doing range queries. Choose a regular interval, fine enough to avoid skew but coarse enough to avoid ballooning the dataset size. Whereever a range crosses an interval boundary, split it into multiple records, each filling or lying within a single interval. Emit a compound key of [interval, join_handle, beg, end], where interval is join_handle identifies the originating table, so that records are grouped for a join (this is what ensures If the interval is transparently a prefix of the index (as it is here), you can instead just ship the remainder: [interval, join_handle, beg_suffix, end_suffix]. Use the In the geodata section, the quadtile scheme is (if you bend your brain right) something of an extension on this idea instead of splitting ranges on regular intervals, well split regions on a regular grid scheme.
106
def process(logline) beg_at = Time.now.to_f resp = Faraday.get url_to_fetch(logline) yield summarize(resp, beg_at) end def summarize(resp, beg_at) duration = Time.now.to_f - beg_at bytesize = resp.body.bytesize { duration: duration, bytesize: bytesize } end def url_to_fetch(logline) logline.url end end flow(:mapper){ input > parse_loglines > elephant_stampede }
You must use Wukongs eventmachine bindings to make more than one simultaneous request per mapper.
Refs
Database of Robot User Agent strings Improving Web Search Results Using Affinity Graph
Refs
107
CHAPTER 11
Region an array of paths, understood to connect and bound a region of space. [ [[longitude,latitude],[longitude,latitude],...], [[longitude,lati tude],[longitude,latitude],...]]. Your array will be of length one unless there are holes or multiple segments Bounding Box (or bbox) a rectangular bounding region, [-5.0, 30.0, 5.0, 40.0] Features of Features
1. in other works youll see the term Point of Interest (POI) for a place.
109
The term feature is somewhat muddied to a geographer, fea ture indicates a thing being described (places, regions, paths are all geographic features). In the machine learning literature, feature describes a potentially-significant attribute of a data element (manu facturer, top speed and weight are features of a car). Since were here as data scientists dabbling in geography, well reserve the term fea ture for only its machine learning sense.
Voronoi Spatial data is fundamentally important ///Go, go! Talk about what this is, put it in context. And then, weave in some conceptual talk about locality, when youre done. Amy//// * So far weve Spatial data ////, which identifies the geographic location of features and boundaries on Earth, - here Im suggesting you define this kind of terms (in-line) when it comes up. Amy//// is very easy to acquire: from smartphones and other GPS devices, from government and public sources, and from a rich ecosystem of commercial suppliers. Its easy to bring our physical and cultural intuition to bear on geospatial prob lems ////"For example Amy//// There are several big ideas introduced here. First of course are the actual mechanics of working with spatial data, and projecting the Earth onto a coordinate plane. The statistics and timeseries chapters dealt with their dimensions either singly or in teracting weakly, Its ////What is? Clarify. Amy/// a good jumping-off point for machine learning. Take a tour through some of the sites that curate the best in data visualization, ////Consider defining in-line, like with spacial data above. Amy//// and youll see a strong overrepresentation of geographic explorations. With most datasets, you need to figure out the salient features, eliminate confounding factors, and of course do all the work of transforming them to be joinable 2. ////May want to suggest a list of 5 URLs to readers here. Amy////Geo Data comes out of the Taking a step back, the fundamental idea this chapter introduces is a direct way to extend locality to two dimensions. It so happens we did so in the context of geospatial data, and required a brief prelude about how to map our nonlinear feature space to the plane.
2. we dive deeper in the chapter on Chapter 17 basics later on
110
Browse any of the open data catalogs (REF) or data visualization blogs, and youll see that geographic datasets and visualizations are by far the most frequent. Partly this is because there are these two big obvious feature components, highly explanatory and direct to understand. But you can apply these tools any time you have a small number of dominant features and a sensible distance measure mapping them to a flat space. TODO: Will be reorganizing below in this order: do a nearness query example, reveal that it is such a thing known as the spatial join, and broaden your mind as to how you think about locality. cover the geographic data model, GeoJSON etc. Spatial concept of Quadtiles none of the mechanics of the projection yet Something with Points and regions, using quadtiles Actual mechanics of Quadtile Projection from lng/lat to quadkey mutiscale quadkey assignment (k-means will move to ML chapter) complex nearness voronoi cells and weather data also TODO: untangle the following two paragraphs, and figure out whether to put them at beginning or end (probably as sidebar, at beginning)
Spatial Data
It not only unwinds two dimensions to one, but any system it to spatial analysis in more dimensions see Exercises, which also extends the coordinate handling to three di mensions
3. in other works youll see the term Point of Interest (POI) for a place.
Spatial Data
111
of
points
[[longitude,latitude],[longitude,lati
Region an array of paths, understood to connect and bound a region of space. [ [[longitude,latitude],[longitude,latitude],...], [[longitude,lati tude],[longitude,latitude],...]]. Your array will be of length one unless there are holes or multiple segments Feature" a generic term for Point or Path or Region. Bounding Box (or bbox) a rectangular bounding region, [-5.0, 30.0, 5.0, 40.0] *
The term feature is somewhat muddied to a geographer, fea ture indicates a thing being described (places, regions, paths are all geographic features). In the machine learning literature, feature describes a potentially-significant attribute of a data element (manu facturer, top speed and weight are features of a car). Since were here as data scientists dabbling in geography, well reserve the term fea ture for its machine learning sense only just say object in place of geographic feature (and ).
Features of Features
Geospatial Information Science (GIS) is a deep subject, ////Say how, like, , which focuses on the study of Amy////treated here shallowly were interested in models that have a geospatial context, not in precise modeling of geographic features them selves. Without apology were going to use the good-enough WGS-84 earth model and a simplistic map projection. Well execute again the approach of using existing tradi tional tools on partitioned data, and Hadoop to reshape and orchestrate their output at large scale. 4
4. If you cant find a good way to scale a traditional GIS approach, algorithms from Computer Graphics are surprisingly relevant.
112
113
Figure 11-1. Z-path of quadtiles As you go along, index each tile, as shown in Figure 11-2:
114
Figure 11-2. Quadtile Numbering This is a 1-d index into a 2-d space! Whats more, nearby points in space are typically nearby in index value. By applying Hadoops fundamental locality operation sort ing geographic locality falls out of numerical locality. Note: youll sometimes see people refer to quadtile coordinates as X/Y/Z or Z/X/Y; the Z here refers to zoom level, not a traditional third coordinate.
115
Going in to this, I predict that UFO sightings will generally follow the population dis tribution (because you need people around to see them) but that sightings in cities will be under-represented per capita. I also suspect UFO sightings will be more likely near airports and military bases, and in the southwestern US. We will restrict attention only to the continental US; coverage of both datasets is spotty elsewhere, which will con taminate our results. Looking through some weather reports, visibilities of ten to fifteen kilometers (6-10 miles) are a reasonable midrange value; lets use that distance to mean nearby. Given this necessarily-fuzzy boundary, lets simplify matters further by saying two objects are nearby if one point lies within the 20-km-per-side bounding box centered on the other:
+---------+---------+ | B | | | | | | | + A + | | | | | | | | +---------+---------+ |- 10 km -|
116
Data is cheap and code is expensive, so for these 60,000 points well just serialize out the bounding box coordinates with each record rather than recalculate them in the reducer. Well discard most of the UFO sightings fields, but during development lets keep the location and time fields in so we can spot-check results. Mapper output:
117
end end
For now Im emitting the full place and sighting record, so we can see whats going on. In a moment we will change the combined_record method to output a more disciplined set of fields. Output data:
...
Comparing Distributions
We now have a set of [place, sighting] pairs, and we want to understand how the distribution of coincidences compares to the background distribution of places. (TODO: dont like the way Im currently handling places near multiple sightings) That is, we will compare the following quantities:
count of count of for each for each sightings features feature type, count of records feature type, count of records near a sighting
The dataset at this point is small enough to do this locally, in R or equivalent; but if youre playing along at work your dataset might not be. So lets use pig.
place_sightings = LOAD "..." AS (...); features = GROUP place_sightings BY feature; feature_stats = FOREACH features { sighted = FILTER place_sightings BY sighted; GENERATE features.feature_code, COUNT(sighted) AS sighted_count, COUNT_STAR(sighted) AS total_count ; }; STORE feature_stats INTO '...';
Data Model
Well represent geographic features in two different ways, depending on focus: If the geography is the focus its a set of features with data riding sidecar use GeoJSON data structures.
118 | Chapter 11: Geographic Data Processing
If the object is the focus among many interesting fields, some happen to have a position or other geographic context use a natural Wukong model. If youre drawing on traditional GIS tools, if possible use GeoJSON; if not use the legacy format it forces, and a lot of cursewords as you go.
GeoJSON
GeoJSON is a new but well-thought-out geodata format; heres a brief overview. The GeoJSON spec is about as readable as Ive seen, so refer to it for anything deeper. The fundamental GeoJSON data structures are:
module GeoJson class Base ; include Wukong::Model ; end class FeatureCollection < Base field :type, String field :features, Array, of: Feature field :bbox, BboxCoords end class Feature < Base field :type, String, field :geometry, Geometry field :properties field :bbox, BboxCoords end class Geometry < Base field :type, String, field :coordinates, Array, doc: "for a 2-d point, the array is a single `(x,y)` pair. For end # lowest value then highest value (left low, right high; class BboxCoords < Array def left ; self[0] ; end def btm ; self[1] ; end def right ; self[2] ; end def top ; self[3] ; end end end
GeoJSON specifies these orderings for features: Point: [longitude, latitude] Polygon: [ [[lng1,lat1],[lng2,lat2],...,[lngN,latN],[lng1,lat1]] ] you must repeat the first point. The first array is the outer ring; other paths in the array are interior rings or holes (eg South Africa/Lesotho). For regions with multiple parts (US/Alaska/Hawaii) use a MultiPolygon.
Data Model
119
Bbox: [left, btm, right, top], ie [xmin, ymin, xmax, ymax] An example hash, taken from the spec:
{ "type": "FeatureCollection", "features": [ { "type": "Feature", "properties": {"prop0": "value0"}, "geometry": {"type": "Point", "coordinates": [102.0, 0.5]} }, { "type": "Feature", "properties": { "prop0": "value0", "prop1": {"this": "that"} }, "bbox": [ "geometry": { "type": "Polygon", "coordinates": [ [ [-10.0, 0.0], [5.0, -1.0], [101.0, 1.0], [100.0, 1.0], [-10.0, 0.0] ] ] } } ] }
Quadtile Practicalities
Converting points to quadkeys (quadtile indexes)
Each grid cell is contained in its parent
120
Quadtile Practicalities
121
The quadkey is a string of 2-bit tile selectors for a quadtile @example infochimps_hq = Geo::Place.receive(Infochimps HQ, -97.759003, 30.273884) infochimps_hq.quadkey(8) # 02313012 First, some preliminaries:
EARTH_RADIUS MIN_LONGITUDE MAX_LONGITUDE MIN_LATITUDE MAX_LATITUDE ALLOWED_LONGITUDE ALLOWED_LATITUDE TILE_PIXEL_SIZE = = = = = = = = 6371000 # meters -180 180 -85.05112878 85.05112878 (MIN_LONGITUDE..MAX_LONGITUDE) (MIN_LATITUDE..MAX_LATITUDE) 256
The maximum latitude this projection covers is plus/minus 85.05112878 degrees. With apologies to the elves of chapter (TODO: ref), this is still well north of Alert, Canada,
122
the northernmost populated place in the world (latitude 82.5 degrees, 817 km from the North Pole). Its straightforward to calculate tile_x indices from the longitude (because all the bru tality is taken up in the Mercator projections severe distortion). Finding the Y tile index requires a slightly more complicated formula: This makes each grid cell be an increasingly better locally-flat approximation to the earths surface, palliating the geographers anger at our clumsy map projection. In code:
# Convert longitude, latitude in degrees to _floating-point_ tile x,y coordinates at given zoom le def lat_zl_to_tile_yf(longitude, latitude, zl) tile_size = map_tile_size(zl) xx = (longitude.to_f + 180.0) / 360.0 sin_lat = Math.sin(latitude.to_radians) yy = Math.log((1 + sin_lat) / (1 - sin_lat)) / (4 * Math::PI) # [ (map_tile_size(zl) * xx).floor, (map_tile_size(zl) * (0.5 - yy)).floor ] end # Convert from tile_x, tile_y, zoom level to longitude and latitude in # degrees (slight loss of precision). # # Tile coordinates may be floats or integer; they must lie within map range. def tile_xy_zl_to_lng_lat(tile_x, tile_y, zl) tile_size = map_tile_size(zl) raise ArgumentError, "tile index must be within bounds ((#{tile_x},#{tile_y}) vs #{tile_size})" xx = (tile_x.to_f / tile_size) yy = 0.5 - (tile_y.to_f / tile_size) lng = 360.0 * xx - 180.0 lat = 90 - 360 * Math.atan(Math.exp(-yy * 2 * Math::PI)) / Math::PI [lng, lat] end
Take care to put coordinates in the order longitude, latitude, main taining consistency with the (X, Y) convention for regular points. Natural english idiom switches their order, a pernicious source of error but the convention in geographic systems is unambiguous ly to use x, y, z ordering. Also, dont abbreviate longitude as long its a keyword in pig and other languages. I like lng.
Exploration
Exemplars Tokyo
Quadtile Practicalities
123
"02313012"
You can also form a packed quadkey the integer formed by interleaving the bits as shown above. At zoom level 15, the packed quadkey is a 30-bit unsigned inte ger meaning you can store it in a pig int; for languages with an unsigned int type, you can go to zoom level 16 before you have to use a less-efficient type. Zoom level 15 has a resolution of about one tile per kilometer (about 1.25 km/tile near the equator; 0.75 km/tile at Londons latitude). It takes 1 billion tiles to tile the world at that scale. a limited number of range scans suffice to cover any given area each grid cells parents are a 2-place bit shift of the grid index itself. A 64-bit quadkey corresponding to zoom level 32 has an accuracty of better than 1 cm over the entire globe. In some intensive database installs, rather than storing lon gitude and latitude separately as floating-point numbers, consider storing either the interleaved packed quadkey, or the individual 32-bit tile ids as your indexed value. The performance impact for Hadoop is probably not worth it, but for a database schema it may be.
5. briefly featured in the Clashs Rock the Casbah Video and where much of this book was written
124
# converts from even/odd state of tile x and tile y to quadkey. NOTE: bit order means y, x BIT_TO_QUADKEY = { [false, false] => "0", [false, true] => "1", [true, false] => "2", [true, true] # converts from quadkey char to bits. NOTE: bit order means y, x QUADKEY_TO_BIT = { "0" => [0,0], "1" => [0,1], "2" => [1,0], "3" => [1,1]} # Convert from tile x,y into a quadkey at a specified zoom level def tile_xy_zl_to_quadkey(tile_x, tile_y, zl) quadkey_chars = [] tx = tile_x.to_i ty = tile_y.to_i zl.times do quadkey_chars.push BIT_TO_QUADKEY[[ty.odd?, tx.odd?]] # bit order y,x tx >>= 1 ; ty >>= 1 end quadkey_chars.join.reverse end
# Convert a quadkey into tile x,y coordinates and level def quadkey_to_tile_xy_zl(quadkey) raise ArgumentError, "Quadkey must contain only the characters 0, 1, 2 or 3: #{quadkey}!" unless zl = quadkey.to_s.length tx = 0 ; ty = 0 quadkey.chars.each do |char| ybit, xbit = QUADKEY_TO_BIT[char] # bit order y, x tx = (tx << 1) + xbit ty = (ty << 1) + ybit end [tx, ty, zl] end
Though quadtile properties do vary, the variance is modest within most of the inhabited world:
125
The (ref table) gives the full coordinates at every zoom level for our exemplar set.
126
When points cross major tile boundaries, the result is less pretty. Austins airport (quad 0231301212221213) shares only the zoom-level 8 tile 02313012:
127
Calculating Distances
To find the distance between two points on the globe, we use the Haversine formula in code:
# Return the haversine distance in meters between two points def haversine_distance(left, top, right, btm) delta_lng = (right - left).abs.to_radians delta_lat = (btm - top ).abs.to_radians top_rad = top.to_radians btm_rad = btm.to_radians
aa = (Math.sin(delta_lat / 2.0))**2 + Math.cos(top_rad) * Math.cos(btm_rad) * (Math.sin(delta_ln cc = 2.0 * Math.atan2(Math.sqrt(aa), Math.sqrt(1.0 - aa)) cc * EARTH_RADIUS end
128
# Return the haversine midpoint in meters between two points def haversine_midpoint(left, top, right, btm) cos_btm = Math.cos(btm.to_radians) cos_top = Math.cos(top.to_radians) bearing_x = cos_btm * Math.cos((right - left).to_radians) bearing_y = cos_btm * Math.sin((right - left).to_radians) mid_lat = Math.atan2( (Math.sin(top.to_radians) + Math.sin(btm.to_radians)), (Math.sqrt((cos_top + bearing_x)**2 + bearing_y**2))) mid_lng = left.to_radians + Math.atan2(bearing_y, (cos_top + bearing_x)) [mid_lng.to_degrees, mid_lat.to_degrees] end # From a given point, calculate the point directly north a specified distance def point_north(longitude, latitude, distance) north_lat = (latitude.to_radians + (distance.to_f / EARTH_RADIUS)).to_degrees [longitude, north_lat] end # From a given point, calculate the change in degrees directly east a given distance def point_east(longitude, latitude, distance) radius = EARTH_RADIUS * Math.sin(((Math::PI / 2.0) - latitude.to_radians.abs)) east_lng = (longitude.to_radians + (distance.to_f / radius)).to_degrees [east_lng, latitude] end
129
The world is a big place, but we dont use all of it the same. Most of the world is water. Lots of it is Siberia. Half the tiles at zoom level 2 have only a few thousand inhabitants6. Suppose you wanted to store a what country am I in dataset a geo-joinable decom position of the region boundaries of every country. Youll immediately note that Monaco fits easily within on one zoom-level 12 quadtile; Russia spans two zoom-level 1 quadtiles. Without multiscaling, to cover the globe at 1-km scale and 64-kB records would take 70 terabytes and 1-km is not all that satisfactory. Huge parts of the world would be taken up by grid cells holding no border that simply said Yep, still in Russia. Theres a simple modification of the grid system that lets us very naturally describe multiscale data. The figures (REF: multiscale images) show the quadtiles covering Japan at ZL=7. For reasons youll see in a bit, we will split everything up to at least that zoom level; well show the further decomposition down to ZL=9.
130
Already six of the 16 tiles shown dont have any land coverage, so you can record their values:
1330000xx 1330011xx 1330013xx 1330031xx 1330033xx 1330032xx { { { { { { Pacific Pacific Pacific Pacific Pacific Pacific Ocean Ocean Ocean Ocean Ocean Ocean } } } } } }
Pad out each of the keys with xs to meet our lower limit of ZL=9. The quadkey 1330011xx means I carry the information for grids 133001100, 133001101, 133001110, 133001111, .
131
132
You should uniformly decompose everything to some upper zoom level so that if you join on something uniformly distributed across the globe you dont have cripplingly large skew in data size sent to each partition. A zoom level of 7 implies 16,000 tiles a small quantity given the exponential growth of tile sizes With the upper range as your partition key, and the whole quadkey is the sort key, you can now do joins. In the reducer, read keys on each side until one key is equal to or a prefix of the other. emit combined record using the more specific of the two keys read the next record from the more-specific column, until theres no overlap Take each grid cell; if it needs subfeatures, divide it else emit directly.
133
You must emit high-level grid cells with the lsb filled with XX or something that sorts after a normal cell; this means that to find the value for a point, Find the corresponding tile ID, Index into the table to find the first tile whose ID is larger than the given one.
00.00.00 00.00.01 00.00.10 00.00.11 00.01.-00.10.-00.11.00 00.11.01 00.11.10 00.11.11 01.--.-10.00.-10.01.-10.10.01 10.10.10 10.10.11 10.10.00 10.11.--
134
| | | | | A +--------------' | | | | | | D / | | __/ \____/ \ | \____________, +-+-----------+-------------+--+-----| | | | | | | | C | | | | C ~~+--+------\ | | 0100 | | / A|B | B \ | / |_|____/___|__|________\____|/|_______ | | C / | | \ C / | | \ / |B | B \ /| | | | | | | |D| 0110 | | A +--+-----------' | | | | |D | D | | +---+------+--+-------------+-/------| | A |D | _|/ | \____/ \ | D | | | \|___________, | 1100 | | | | | | +-------------+-------------+--------^ 1000 ^ 1001
000x
001x
100x
Tile 0000: [A, B, C ] Tile 0001: [ B, C ] Tile 0010: [A, B, C, D] Tile 0011: [ B, C, D] Tile 0100: [ C, ] Tile 0110: [ C, D] Tile 1000: [A, D] Tile 1001: [ D] Tile 1100: [ D] For each grid, also calculate the area each polygon covers within that grid. Pivot: A: [ 0000 0010 1000 ] B: [ 0000 0001 0010 0011 ] C: [ 0000 0001 0010 0011 0100 0110 ]
Quadtile Ready Reference | 135
136
the edge in either direction, I will remain equidistant from each point. Extending these lines defines the Voronoi diagram a set of polygons, one per point, enclosing the area closer to that point than any other. <remark>TODO: above paragraph not very clear, may not be necessary.</remark>
137
The polygons partition is the space divided such that every piece of the plane belongs to exactly one polygon. There is exactly one polygon for each seed location and all points within it are closer to that seed location than to any other seed location. All points on the boundary of two polygons are equidistant from the two neigh boring seed locations; and all vertices where Voronoi polygons meet are equidistant from the respective seed locations. This effectively precomputes the nearest x problem: For any point in question, find the unique polygon within which it resides (or rarely, the polygon boundaries upon which it lies). Breaking those polygons up by quad tile at a suitable zoom level makes it easy to either store them in HBase (or equivalent) for fast querying or as data files optimized for a spatial JOIN. It also presents a solution to the spatial sampling problem by assigning the measure ments taken at each sample location to its Voronoi region. You can use these piece-wise regions directly or follow up with some sort of spatial smoothing as your application requires. Lets dive in and see how to do this in practice.
138
Pig does not have any built-in geospatial features, so we will have to use a UDF. In fact, we will reach into the future and use one of the ones you will learn about in the Advanced Pig chapter (TODO: REF). Here is the script to
Register the UDF Give it an alias Load the polygons file Turn each polygon into a bag of quad key polygon metadata tuples Group by quad key FOREACH generate the output data structure Store results
Transfer the output of the Voronoi script onto the HDFS and run the above Pig script. Its output is a set of TSV files in which the first column is a quad key and the second column is a set of regions in GeoJSON format. We will not go into the details, but the example code shows how to use this to power the nearest x application. Follow the instructions to load the data into HBase and start the application. The application makes two types of requests: One is to determine which polygon is the nearest; it takes the input coordinates and uses the corresponding quad tile to retrieve the relevant regions. It then calls into a geo library to determine which polygon contains the point and sends a response containing the GeoJSON polygon. The application also answers direct requests for a quad tile with a straight GeoJSON stored in its database exactly what is required to power the drivable slippy map widget that is used on the page. This makes the front end code simple, light and fast, enough that mobile devices will have no trouble rendering it. If you inspect the Javascript file, in fact, it is simply the slippy maps example with the only customization being the additional query for the region of interest. It uses the servers response to simply modify the style sheet rule for that portion of the map. The same data locality advantages that the quad key scheme grants are perhaps even more valuable in a database context, especially ones like HBase that store data in sorted form. We are not expecting an epic storm of viral interest in this little app but you might be for the applications you write. The very thing that makes such a flood difficult to manage the long-tail nature of the requests makes caching a suitable remedy. You will get a lot more repeated requests for downtown San Francisco than you will for downtown Cheboygan, so those rows will always be hot in memory. Since those points of lie within compact spatial regions, they also lie within not many more quad key regions, so the number of database blocks contending for cache space is very much smaller than the number of popular quad keys. It also addresses the short-tail caching problem as well. When word does spread to Cheboygan and the quad tile for its downtown is loaded, you can be confident requests for nearby tiles driven by the slippy map will follow as well. Even if those rows are not loaded within the same database block, the quad key helps the operating system pick up the slack since this access pattern is so common, when a read causes the OS to go
Turning Points of Measurements Into Regions of Influence | 139
all the way to disk, it optimistically pre-fetches not just the data you requested but a bit of what follows. When the database gets around to loading a nearby database block, there is a good chance the OS will have already buffered its contents. The strategies employed here precalculating all possible requests, identifying the na ture of popular requests, identifying the nature of adjacent requests and organizing the key space to support that adjacency will let your database serve large-scale amounts of data with millisecond response times even under heavy load. Sidebar: Choosing A Decomposition Zoom Level.
When you are decomposing spatial data onto quad tiles, you will face the question of what zoom lev
To cover the entire globe at zoom level 13 requires 67 million records, each covering about four k
If the preceding considerations leave you with a range of acceptable zoom levels, choose one in th
sensus value; this will render wonderfully as a heat map of values and since each record corresponds to a full quad cell, will be usable directly by downstream analytics or ap plications without requiring a geospatial library. Consulting the quad key grid size cheat sheet (TODO: REF), zoom level 12 implies 17 million total grid cells that are about five to six miles on a side in populated latitudes, which seems reasonable for the domain. As such, though, it is not reasonable for the database. The dataset has reasonably global coverage going back at least 50 years or nearly half a million hours. Storing 1 KB of weather data per hour at zoom-level 12 over that stretch will take about 7.5 PB but the overwhelming majority of those quad cells are boring. As mentioned, weather stations are sparse over huge portions of the earth. The density of measurements covering much of the Atlantic Ocean would be well served by zoom-level 7; at that grid coarseness, 50 years of weather data occupies a mere 7 TB; isnt it nice to be able to say a mere 7 TB? What we can do is use a multi-scale grid. We will start with a coarsest grain zoom level to partition; 7 sounds good. In the Reducers (that is, after the group), we will decompose down to zoom-level 12 but stop if a region is completely covered by a single polygon. Run the multiscale decompose script (TODO: demonstrate it). The results are as you would hope for; even the most recent years map requires only x entries and the full dataset should require only x TB. The stunningly clever key to the multiscale JOIN is, well, the keys. As you recall, the prefixes of a quad key (shortening it from right to left) give the quad keys of each containing quad tile. The multiscale trick is to serialize quad keys at the fixed length of the finest zoom level but where you stop early to fill in with an . - because it sorts lexicographically earlier than the numerals do. This means that the lexicographic sort order Hadoop applies in the midstream group-sort still has the correct spatial ordering just as Zorro would have it. Now it is time to recall how a JOIN works covered back in the Map/Reduce Patterns chapter (TODO: REF). The coarsest Reduce key is the JOIN value, while the secondary sort key is the name of the dataset. Ordinarily, for a two-way join on a key like 012012, the Reducer would buffer in all rows of the form <012012 | A | >, then apply the join to each row of the form <012012 | B | >. All rows involved in the join would have the same join key value. For a multiscale spatial join, you would like rows in the two datasets to be matched whenever one is the same as or a prefix of the other. A key of 012012 in B should be joined against a key of 0120.., 01201. and 012012 but not, of course, against 013. We can accomplish this fairly straightforwardly. When we defined the multiscale de composition, we a coarsest zoom level at which to begin decomposing and the finest zoom level which defined the total length of the quad key. What we do is break the quad key into two pieces; the prefix at the coarsest zoom level (these will always have numbers, never dots) and the remainder (fixed length with some number of quad key digits then
141
some number of dots). We use the quad key prefix as the partition key with a secondary sort on the quad key remainder then the dataset label. Explaining this will be easier with some concrete values to use, so lets say we are doing a multiscale join between two datasets partitioning on a coarsest zoom level of 4, and a total quad key length of 6, leading to the following snippet of raw reducer input. Snippet of Raw Reducer Input for a Multiscale Spatial Join.
0120 0120 0120 0120 0120 0120 0120 0121 0121 1. 10 11 12 13 2. 30 00 00 A B B B B A B A B
As before, the reducer buffers in rows from A for a given key in our example, the first of these look like <0120 | 1. | A | >. It will then apply the join to each row that follows of the form <0120 | (ANYTHING) | B | >. In this case, the 01201. record from A will be joined against the 012010, 012011, 012012 and 012013 records from B. Watch care fully what happens next, though. The following line, for quad key 01202. is from A and so the Reducer clears the JOIN buffer and gets ready to accept records from B to join with it. As it turns out, though, there is no record from B of the form 01202-anything. In this case, the 01202. key from A matches nothing in B and the 012030 key in B is matched by nothing in A (this is why it is important the replacement character is lexi cographically earlier than the digits; otherwise, you would have to read past all your brothers to find out if you have a parent). The behavior is the same as that for a regular JOIN in all respects but the one, that JOIN keys are considered to be equal whenever their digit portions match. The payoff for all this is pretty sweet. We only have to store and we only have to ship and group-sort data down to the level at which it remains interesting in either dataset. (TODO: do we get to be multiscale in both datasets?) When the two datasets meet in the Reducer, the natural outcome is as if they were broken down to the mutuallyrequired resolution. The output is also efficiently multiscale.
142
The multiscale keys work very well in HBase too. For the case where you are storing multiscale regions and querying on points, you will want to use a replacement character that is lexicographically after the digits, say, the letter x. To find the record for a given point, do a range request for one record on the interval starting with that points quad key and extending to infinity (xxxxx). For a point with the finest-grain quad key of 012012, if the database had a record for 012012, that will turn up; if, instead, that region only required zoom level 4, the appropriate row (0120xx) would be correctly returned.
143
Baseball enthusiasts are wonderfully obsessive, so it was easy to find online data listing the geographic location of every single baseball stadium the file sports/baseball/ stadium_geolocations.tsv lists each Retrosheet stadium ID followed by its coordinates and zoom-level 12 quad key. Joining that on the Retrosheet game logs equips the game log record with the same quad key and hour keys used in the smoothed weather dataset. (Since the data is so small, we turned parallelism down to 1.) Next, we will join against the weather data; this data is so large, it is worth making a few optimizations. First, we will apply the guideline of join against the smallest amount of data possible. There are fewer than a hundred quad keys we are interested in over the whole time period of interest and the quad key breakdown only changes year by year, so rather than doing a multiscale join against the full hourly record, we will use the index that gives the quad key breakdown per year to find the specific containing quad keys for each stadium over time. For example (TODO: find an example where a quad key was at a higher zoom level one year and a lower one a different year). Doing the mul tiscale join of stadium quad keys against the weather quad key year gives (TODO: name of file). Having done the multiscale join against the simpler index, we can proceed using the results as direct keys; no more multiscale magic is required. Now that we know the specific quad keys and hours, we need to extract the relevant weather records. We will describe two ways of doing this. The straightforward way is with a join, in this case of the massive weather quad tile data against the relatively tiny set of quad key hours we are interested in. Since we do not need multiscale matching any more, we can use Pig and Pig provides a specialized join for the specific case of joining a tiny dataset to a massive one, called the replicated join. You can skip ahead to the Advanced Pig chapter (TODO: REF) to learn more about it; for now, all you need to know is that you should put the words "USING 'replicated'" at the end of the line, and that the smallest dataset should be on the right. (Yes, its backwards: for replicated joins the smallest should be on the right, while for regular joins it should be on the left.) This type of join loads the small dataset into memory and simply streams through the larger dataset, so no Reduce is necessary. Its always a good thing when you can avoid streaming TB of data through the network card when all you want are a few MB. In this case, there are a few thousand lines in the small dataset, so it is reasonable to do it the honest way, as just described. In the case where you are just trying to extract a few dozen keys, your authors have been known to cheat by inlining the keys in a filter. Regular expression engines are much faster than most people realize and are perfectly content to accept patterns with even a few hundred alternations. An alternative ap proach here is to take the set of candidate keys, staple them together into a single ludi crous regexp and template it into the PIg script you will run. Cheat to Win: Filtering down to only joinable keys using a regexp.
144
huge_data = LOAD '...' AS f1, f2, f3; filtered_data = FILTER huge_data BY MATCH(f1, '^(012012|013000|020111| [...dozens more...])$'); STORE filtered_data INTO '...';
Results
With just the relevant records extracted, we can compare the score sheet data with the weather data. Our script lists output columns for the NCDC weather and wind speed, the score sheet weather and wind speed, the distance from the stadium to the relevant weather station and the percentage difference for wind speed and temperature. It would be an easy mistake to, at this point, simply evict the Retrosheet measurements and replace with the NCDC measurements; we would not argue for doing so. First, the weather does vary, so there is some danger in privileging the measurement at a weather station some distance away (even if more precise) over a direct measurement at a correct place and time. In fact, we have far better historical coverage of the baseball data than the weather data. The weather data we just prepared gives a best-effort estimate of the weather at every quad tile, leaving it in your hands to decide whether to accept a reading from a weather station dozens or hundreds of miles away. Rather, the philosophically sound action would be to flag values for which the two datasets disagree as likely outliers. The successful endpoint of most Big Data explorations is a transition to traditional statistical packages and elbow grease it shows youve found domain patterns worth exploring. If this were a book about baseball or forensic econometrics, wed carry for ward comparing those outliers with local trends, digging up original entries, and so forth. Instead, well just label them with a scarlet O for outlier, drop the mic and walk off stage.
Keep Exploring
Balanced Quadtiles =====
Earlier, we described how quadtiles define a tree structure, where each branch of the tree divides the plane exactly in half and leaf nodes hold features. The multiscale scheme handles skewed distributions by developing each branch only to a certain depth. Splits are even, but the tree is lopsided (the many finer zoom levels you needed for New York City than for Irkutsk). K-D trees are another approach. The rough idea: rather than blindly splitting in half by area, split the plane to have each half hold the same-ish number of points. Its more complicated, but it leads to a balanced tree while still accommodating highly-skew dis tributions. Jacob Perkins (@thedatachef) has a great post about K-D trees with further links.
Keep Exploring
145
Exercises
////Include a bit where you explain what the exercises will do for readers, the why behind the effort. Amy//// Exercise 1: Extend quadtile mapping to three dimensions To jointly model network and spatial relationship of neurons in the brain, you will need to use not two but three spatial dimensions. Write code to map positions within a 200mm-per-side cube to an octcube index analogous to the quadtile scheme. How large (in mm) is each cube using 30-bit keys? using 63-bit keys? For even higher dimensions of fun, extend the Voronoi diagram to three dimensions. Exercise 2: Locality Weve seen a few ways to map feature data to joinable datasets. Describe how youd join each possible pair of datasets from this list (along with the story it would tell): Census data: dozens of variables, each attached to a census tract ID, along with a region polygon for each census tract. Cell phone antenna locations: cell towers are spread unevenly, and have a maximum range that varies by type of antenna. case 1: you want to match locations to the single nearest antenna, if any is within range. case 2: you want to match locations to all antennae within range. Wikipedia pages having geolocations. Disease reporting: 60,000 points distributed sparsely and unevenly around the country, each reporting the occurence of a disease. For example, joining disease reports against census data might expose correlations of outbreak with ethnicity or economic status. I would prepare the census regions as quadtile-split polygons. Next, map each disease report to the right quadtile and in the reducer identify the census region it lies within. Finally, join on the tract ID-to-census record table. Exercise 3: Write a generic utility to do multiscale smoothing Its input is a uniform sampling of values: a value for every grid cell at some zoom level. However, lots of those values are similar. Combine all grid cells whose values lie within a certain tolerance into Example: merge all cells whose contents lie within 10% of each other
146 | Chapter 11: Geographic Data Processing
00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33
10 11 9 8 14 15 12 14 19 20 20 21 12 14 8 3 .9.5. 14 18 . . 12 14 . 20. 12 14 . . 8 3
10 11 14 18 9 8 12 14 19 20 12 14 20 21 8 3
Refs
http://kartoweb.itc.nl/geometrics/Introduction/introduction.html an excellent overview of projections, reference surfaces and other fundamentals of geospatial analysis. http://msdn.microsoft.com/en-us/library/bb259689.aspx http://www.maptiler.org/google-maps-coordinates-tile-bounds-projection/ http://wiki.openstreetmap.org/wiki/QuadTiles https://github.com/simplegeo/polymaps Scaling GIS Data in Non-relational Data Stores by Mike Malone Voronoi Diagrams US County borders in GeoJSON Spatial references, coordinate systems, projections, datums, ellipsoids by Morten Nielsen Public repository of geometry boundaries into text Making a map in D3 ; see also Guidance on making your own geographic bounaries.
Refs
147
Placeholder
CHAPTER 12
149
Data Munging
CHAPTER 13
151
Organizing Data
CHAPTER 14
153
Placeholder
CHAPTER 15
155
CHAPTER 16
lambda architecture unified conceptual model patterns for analytics patterns for lambda architecture
157
Machine Learning
CHAPTER 17
159
Java Api
CHAPTER 18
When this happens around the office, we sing this little dirge[^2]:
Life, sometimes, is Russian Novel. Is having unhappy marriage and much snow and little vodka. But when Russian Novel it is short, then quickly we finish and again is Sweet Valley High.
What we dont do is write a pure Hadoop-API Java program. In practice, those look like this:
HORRIBLE BOILERPLATE TO DO A CRAPPY BUT SERVICABLE JOB AT PARSING PARAMS HORRIBLE BOILERPLATE TO DO A CRAPPY BUT SERVICABLE JOB AT READING FILES COBBLED-TOGETHER CODE TO DESERIALIZE THE FILE, HANDLE SPLITS, ETC [ Your breathtakingly elegant, stunningly performant solution to a novel problem ]
161
COBBLED-TOGETHER CODE THAT KINDA DOES WHAT PIG'S FLATTEN COMMAND DOES COBBLED-TOGETHER CODE THAT KINDA DOES WHAT PIG'S CROSS COMMAND DOES A SIMPLE COMBINER COPIED FROM TOM WHITE'S BOOK 1000 LINES OF CODE TO DO WHAT RUBY COULD IN THREE LINES OF CODE HORRIBLE BOILERPLATE TO DO A CRAPPY BUT SERVICABLE JOB AT WRITING FILES UGLY BUT NECESSARY CODE TO GLUE THIS TO THE REST OF THE ECOSYSTEM
The surrounding code is ugly and boring; it will take more time, produce more bugs, and carry a higher maintenance burden than the important stuff. More importantly, the high-level framework provides an implementation far better than its worth your time to recreate.[^3] Instead, we write
A SKELETON FOR A PIG UDF DEFINITION [ Your breathtakingly elegant, stunningly performant solution to a novel problem ] A PIG SCRIPT
162
Advanced Pig
CHAPTER 19
163
In general, wasting CPU to save network or disk bandwidth is a good idea. If you grew up watching a 386 try to produce a ZIP file, it probably seems counterintuitive that storing your data in compressed files not only saves disk space but also speeds up pro cessing time. However, most Hadoop jobs are overwhelmingly throughput bound, so spending the extra CPU cycles to compress and decompress data is more than justified by the overall efficiency of streaming less data off the disk or network. The section on Hadoop Internals (TODO: REF) explains how to enable compression of data between Mapper and Reducer (which you should always do) and how to read or write compressed data (which has tradeoffs you need to understand). In the case of intermediate check points within a multi-stage workflow, it almost always makes sense to use a light com pression format such as LZO or Snappy. In Pig, if you set the pig.tmpfilecompres sion and pig.tmpfilecompression.codec configuration variables appropriately, it will do that for you. There are a few other cases where you should invoke this rule. If you have a large or variable text string that you only need to compare for uniqueness e.g., using URLs as keys it is worth using a string digest function to shorten its length, as described in the chapter on Sketch Functions (TODO: REF). Regular expressions are much faster than youd think, especially in Ruby. If you only need a small part of a string and it does not cost you in readability, it might be worth slicing out only the interesting part. Use types efficiently. Always use a schema in Pig; no exceptions. In case youre won dering, it makes your script more efficient and catches errors early, but most of all, it shows respect to your colleagues or future self. There are a couple Pig-specific tradeoffs to highlight. Wherever possible, make your UDFs algebraic or at least Accumulator-like as described in the section on UDFs (TODO: REF). If you use two FOREACH statements consecutively, Pig is often able to merge them, but not always. If you see Pig introduce an extra Mapside-only job where you didnt think one was necessary, it has probably failed to combine them. Always start with the more readable code, then decide if the problem is worth solving. Most impor tantly, be aware of Pigs specialized JOINs; these are important enough that they get their whole section below. As youve seen, Pig is extremely sugar-free; more or less every structural operation corresponds to a unique Map/Reduce plan. In principle, a JOIN is simply a Cogroup with a FLATTEN and a DISTINCT is a Cogroup with a projection of just the GROUP key. Pig offers those more specific Operators because it is able to do them more effi ciently. Watch for cases where you have unwittingly spelled these out explicitly. Always remove records with a NULL Group key before the JOIN; those records will never appear in the output of a JOIN statement, yet they are not eliminated until after they have been sent across the network. Even worse, since all these records share all the
164
same (worthless) key, they are all sent to the same Reducer, almost certainly causing a hotspot. If you are processing a lot of small files, Pig offers the ability to process many at once per Mapper, an extremely important optimization. Set the pig.splitCombination and pig.maxCombinedSplitSize options; if youre writing a custom loader, spend the extra effort to make it compatible with this optimization. Do not use less or more parallelism than reasonable. We have talked repeatedly through out the book about the dangers of hotspots a few Reducers laboring to swallow the bulk of the dataset while its many comrades clock out early. Sometimes, however, your jobs configuration might unwittingly recommend to Hadoop that it only use one or a too-few number of Reducers. In this case, the Job Tracker would show only a few heavyweight Reduce tasks running, the other Reduce slots are sitting idle because noth ing has been asked of them. Set the number of Reducers in Pig using the PARALLEL directive, and in Wukong, using the --REDUCE_TASKS=N (TODO: check spelling). It can also be wasteful to have too many Reducers. If your job has many Reducers uniformly processing only a few kb of records, the large fixed costs of launching and accounting for those attempts justify using the parallelism settings to limit the number of Reducers.
165
A Mapper-Only JOIN works analogously. Every Mapper reads the small dataset into a lookup table a hash map keyed by the JOIN key (this is why you will also see it referred to as a HashMap JOIN). Every Mapper loads the contents of the smaller dataset in full into its own local lookup table (which is why it is also known as a Replicated JOIN). The minor cost of replicating that dataset to every single Mapper is often a huge improve ment in processing speed by eliminating the entire Reduce stage. The constraint, how ever, is that the smaller dataset must fit entirely in RAM. The ushers task is manageable when there is one type of libretto for each of a dozen languages but would be unman ageable if there were one type of libretto for each of several thousand home towns. How much is too much? Watch for excessive GC activity. (TODO: Pig probably has a warning too - find out what it is). Within the limits of available RAM, you can use fewer Mappers with more available RAM; the Hadoop tuning chapter (TODO: REF) shows you how. Dont be too aggressive, though; datasets have a habit of growing over time and you would hate to spend Thanksgiving day reworking the jobs that process retail sales data because you realized they would not stand up to the Black Friday rush. There is a general principle here: It is obvious there is a class of problems which only crop up past a certain threshold of data. What may not be obvious, until youve learned it the hard way, is that the external circumstances most likely to produce that flood of extra data are also the circumstances that leave you least able to address the problem.
166
number of future jobs we typically spend the time to do a last pass total sort of the dataset against the most likely JOIN key. It is a nice convenience for future users of the dataset, helps in sanity checking and improves the odds that you will be able to use the more efficient MERGE/JOIN.
Exercises
1. Quoting Pig docs: > You will also see better performance if the data in the left table is partitioned evenly across part files (no significant skew and each part file contains at least one full block of data).
Why is this?
2. Each of the following snippets goes against the Pig documentations recommenda tions in one clear way. Rewrite it according to best practices compare the run time of your improved script against the bad version shown here.
things like this from http://pig.apache.org/docs/r0.9.2/perf.html --
a. (fails to use a map-side join) b. (join large on small, when it should join small on large) c. (many FOREACH`es instead of one expanded-form `FOREACH) d. (expensive operation before LIMIT) For each use weather data on weather stations.
167
Hadoop Internals
CHAPTER 20
For 16 chapters now, weve been using Hadoop and Storm+Trident from the outside. The biggest key to writing efficient dataflows is to understand the interface and funda mental patterns of use, not the particulars of how these framework executes the dataflow. However, even the strongest abstractions pushed far enough can leak, and so at some point, its important to understand these internal details. These next few chapters con centrate on equipping you to understand your jobs performance and practical tips for improving it; if youre looking for more, by far, the best coverage of this material is found in (TODO: Add links Tom Whites Hadoop: The Definitive Guide and Eric Sammers Hadoop Operations). Lets first focus on the internals of Hadoop.
169
When a client wishes to create a file, it contacts the NameNode with the desired path and high-level metadata. The NameNode records that information in an internal table and identifies the DataNodes that will hold the data. The NameNode then replies with that set of DataNodes, identifying one of them as the initial point of contact. (When we say client, thats anything accessing the NameNode, whether its a Map/Reduce job, one of the Hadoop filesystem commands or any other program.) The file is now exclusively available to the client for writing but will remain invisible to anybody else until the write has concluded (TODO: Is it when the first block completes or when the initial write completes?). Within the clients request, it may independently prescribe a replication factor, file per missions, and block size _ (TODO: fill in) The client now connects to the indicated DataNode and begins sending data. At the point youve written a full blocks worth of data, the DataNode transparently finalizes that block and begins another (TODO: check that its the DataNode that does this). As it finishes each block or at the end of the file, it independently prepares a checksum of that block, radioing it back to the NameNode and begins replicating its contents to the other DataNodes. (TODO: Is there an essential endoffile ritual?) This is all transparent to the client, who is able to send data as fast as it can cram it through the network pipe. Once youve created a file, its blocks are immutable as opposed to a traditional file system, there is no mechanism for modifying its internal contents. This is not a limi tation; its a feature. Making the file system immutable not only radically simplifies its implementation, it makes the system more predictable operationally and simplifies cli ent access. For example, you can have multiple jobs and clients access the same file knowing that a client in California hasnt modified a block being read in Tokyo (or even worse, simultaneously modified by someone in Berlin). (TODO: When does append become a thing?) The end of the file means the end of its data but not the end of the story. At all times, the DataNode periodically reads a subset of its blocks to find their checksums and sends a heartbeat back to the DataNode with the (hopefully) happy news. (TODO: fill in). There are several reasons a NameNode will begin replicating a block. If a DataNodes heartbeat reports an incorrect block checksum, the NameNode will remove that Data Node from the list of replica holders for that block, triggering its replication from one of the remaining DataNodes from that block. If the NameNode has not received a heartbeat from a given DataNode within the configured timeout, it will begin replicating all of that DataNodes blocks; if that DataNode comes back online, the NameNode calmly welcomes it back into the cluster, cancelling replication of the valid blocks that Data Node holds. Furthermore, if the amount of data on the most populated and least popu lated DataNodes becomes larger than a certain threshold or the replication factor for a file is increased, it will rebalance; you can optionally trigger one earlier using the hadoop balancer command.
170
However its triggered, there is no real magic; one of the valid replica-holder DataNodes sends the block contents to the new replica holder, which heartbeats back the block once received. (TODO: Check details) As you can see, the NameNode and its metadata are at the heart of every HDFS story. This is why the new HighAvailability (HA) NameNode feature in recent versions is so important and should be used in any production installation. Its even more important, as well, to protect and backup the NameNodes metadata, which, unfortunately, many people dont know to do. (TODO: Insert notes on NameNode metadata hygiene previ ously written). The NameNode selects the recipient DataNodes with some intelligence. Information travels faster among machines on the same switch, switches within the same data center and so forth. However, if a switch fails, all its machines are unavailable and if a data center fails, all its switches are unavailable and so forth. When you configure Hadoop, there is a setting that allows you to tell it about your network hierarchy. If you do so, the NameNode will ensure the first replica lies within the most distant part of the hi erarchy durability is important above all else. All further replicas will be stored within the same rack providing the most efficient replication. (TODO: Mention the phrase rack aware.) As permitted within that scheme, it also tries to ensure that the cluster is balanced preferring DataNodes with less data under management. (TODO: Check that its amount of data not percent free) (TODO: Does it make contact on every block or at the start of a file?) The most important feature of the HDFS is that it is highly resilient against failure of its underlying components. Any system achieves resiliency through four mechanisms: Act to prevent failures; insulate against failure through redundancy; isolate failure within independent fault zones; and lastly, detect and remediate failures rapidly. (FOOTNOTE: This list is due to James Hamilton, TODO: link whose blocks and papers are essential reading). The HDFS is largely insulated from failure by using file systembased access (it does not go behind the back of the operating system), by being open source (ensuring code is reviewed by thousands of eyes and run at extremely high scale by thousands of users), and so forth. Failure above the hardware level is virtually un heard of. The redundancy is provided by replicating the data across multiple DataNodes and, in recent versions, using the Zookeeper-backed HighAvailability NameNode im plementation. The rack awareness, described above, isolates failure using your defined network hier archy, and at the semantic level, independently for each HDFS block. Lastly, the heart beat and checksum mechanisms along with its active replication and monitoring hooks allow it and its Operators to detect intermediate faults.
171
S3 File System
The Amazon EC2 Cloud provides an important alternative to the HDFS, its S3 object store. S3 transparently provides multi-region replication far exceeding even HDFS at exceptionally low cost (at time of writing, about $80 per terabyte per month, and de creasing at petabyte and higher scale). Whats more, its archival datastore solution, Gla cier, will hold rarely-accessed data at one-tenth that price and even higher durability. (FOOTNOTE: The quoted durability figure puts the engineering risk below, say, the risk of violent overthrow of our government). For machines in the Amazon Cloud with a provisioned connection, the throughput to and from S3 is quite acceptable for Map/ Reduce use. Hadoop has a built-in facade for the S3 file system, and you can do more or less all the things you do with an HDFS: list, put and get files; run Map/Reduce jobs to and from any combination of HDFS and S3; read and create files transparently using Hadoops standard file system API. There are actually two facades. The s3hdfs facade (confusingly labeled as plain s3 by Hadoop but we will refer to it here as s3hdfs) stores blocks in individual files using the same checksum format as on a DataNode and stores the Name Node-like metadata separately in a reserved area. The s3n facade, instead, stores a file as it appears to the Hadoop client, entirely in an s3 object with a corresponding path. When you visit S3 using Amazons console or any other standard S3 client, youll see a file called /my/data.txt as an object called datadoc.txt in MyContainer and its con tents are immediately available to any such client; that file, written s3hdfs will appear in objects named for 64-bit identifiers like 0DA37f... and with uninterpretable contents. However, s3n cannot store an individual file larger than 5 terabytes. The s3hdfs blocks minorly improve Map/Reduce efficiency and can store files of arbitrary size. All in all, we prefer the s3n facade; the efficiency improvement for the robots does not make up for the impact on convenience on the humans using the system and that its a bestpractice to not make individual files larger than 1 terabyte any way. The universal availability of client libraries makes S3 a great hand-off point to other systems or other people looking to use the data. We typically use a combination of S3, HDFS and Glacier in practice. Gold data anything one project produces that another might use is stored on S3. In production runs, jobs read their initial data from S3 and write their final data to S3 but use an HDFS local to all its compute nodes for any intermediate checkpoints. When developing a job, we run an initial distcp from S3 onto the HDFS and do all further development using the cluster-local HDFS. The cluster-local HDFS provides better (but not earth-shakingly better) Map/Reduce performance. It is, however, no ticeably faster in interactive use (file system commands, launching jobs, etc). Applying the robots are cheap, humans are important rule easily justifies the maintenance of the cluster-local HDFS.
172
If you use a cluster-local HDFS in the way described, that is, it holds no gold data, only development and checkpoint artifacts, ___ (TODO: fill in). Provision your HDFS to use EBS volumes, not the local (ephemeral) ones. EBS volumes surprisingly offer the same or better throughput as local ones and allow you to snapshot a volume in use, or even kill all the compute instances attached to those volumes then reattach them to a later incarnation of the cluster. (FOOTNOTE: This does require careful coordination. Our open-source Iron-Fan framework has all the code required to do so.) Since the EBS volumes have significant internal redundancy, it then becomes safe to run a replication factor of 2 or even 1. For many jobs, the portion of the commit stage waiting for all DataNodes to acknowledge replication can become a sizable portion of the time it takes a Map/Reduce stage to complete. Do this only if youre an amateur with low stakes or a professional whose colleagues embrace these tradeoffs; nobody ever got fired for using a replication factor of 3. As your S3 usage grows --- certainly if you find you have more than, say, a dozen tera bytes of data not in monthly use its worth marking that data for storage in Glacier, not S3 (you can only do this, of course, if youre using the s3n facade). Theres a charge for migrating data and, of course, your time is valuable, but the savings can be enormous.
Map-Reduce Internals
How the mapper and Datanode handle record splitting and how and when the partial records are dispatched.
173
The mapper sort buffer and spilling to disk (maybe here or maybe later, the I/ O.record.percent). Briefly note that data is not sent from mapper-to-reducer using HDFS and so you should pay attention to where you put the Map-Reduce scratch space and how stupid it is about handling an overflow volume. Briefly, that combiners are a thing. Briefly, how records are partitioned to reducers and that custom partitioners are a thing. How the Reducer accepts and tracks its mapper outputs. Details of the merge/sort (shuffle and sort), including the relevant buffers and flush policies and why it can skip the last merge phase. (NOTE: Secondary sort and so forth will have been described earlier.) Delivery of output data to the HDFS and commit whether from mapper or reducer. Highlight the fragmentation problem with map-only jobs. Where memory is used, in particular, mapper-sort buffers, both kinds of reducermerge buffers, application internal buffers. When using EBS volumes, beware of the commit & replication factor
174
Hadoop Tuning
CHAPTER 21
One of the wonderful and terrible things about Hadoop (or anything else at Big Data scale) is that there are very few boundary cases for performance optimization. If your dataflow has the wrong shape, it is typically so catastrophically inefficient as to be un workable. Otherwise, Hadoops scalability makes the price of simply throwing more hardware at the problem competitive with investing to optimize it, especially for ex ploratory analytics. Thats why the repeated recommendation of this book is: Get the algorithm right, get the contextability right and size your cluster to your job. When you begin to productionize the results of all your clever work, it becomes valuable to look for these 30-percent, 10-percent improvements. Debug loop time is also im portant, though, so it is useful to get a feel for when and why early optimization is justified. The first part of this chapter discusses Tuning for the Wise and Lazy showing you how to answer the question, Is my job slow and should I do anything about it? Next, we will discuss Tuning for the Brave and Foolish, diving into the details of Hadoops maddeningly numerous, often twitchy, configuration knobs. There are low-level setting changes that can dramatically improve runtime; we will show you how to recognize them and how to determine what per-job overrides to apply. Lastly, we will discuss a formal approach for diagnosing the health and performance of your Hadoop cluster and your Hadoop jobs to help you confidently and efficiently identify the source of deeper flaws or bottlenecks.
Some passages are harder than others, so its important that any elephant can deliver page blocks to any chimpanzee otherwise youd have some chimps goofing off while others are stuck translating King Lear into Kinyarwanda. On the other hand, sending page blocks around arbitrarily will clog the hallways and exhaust the elephants. The elephants chief librarian, Nanette, employs several tricks to avoid this congestion. Since each chimpanzee typically shares a cubicle with an elephant, its most convenient to hand a new page block across the desk rather then carry it down the hall. J.T. assigns tasks accordingly, using a manifest of page blocks he requests from Nanette. Together, theyre able to make most tasks be local. Second, the page blocks of each play are distributed all around the office, not stored in one book together. One elephant might have pages from Act I of Hamlet, Act II of The Tempest, and the first four scenes of Troilus and Cressida 1. Also, there are multiple replicas (typically three) of each book collectively on hand. So even if a chimp falls behind, JT can depend that some other colleague will have a cubicle-local replica. (Theres another benefit to having multiple copies: it ensures theres always a copy avail able. If one elephant is absent for the day, leaving her desk locked, Nanette will direct someone to make a xerox copy from either of the two other replicas.) Nanette and J.T. exercise a bunch more savvy optimizations (like handing out the longest passages first, or having folks who finish early pitch in so everyone can go home at the same time, and more). Theres no better demonstration of power through simplicity.
1. Does that sound complicated? It is Nanette is able to keep track of all those blocks, but if she calls in sick, nobody can get anything done. You do NOT want Nanette to call in sick.
176
2. Instead, the putter process asks the namenode to allocate a new data block. The namenode designates a set of datanodes (typically three), along with a permanentlyunique block ID. 3. The putter process transfers the file over the network to the first data node in the set; that datanode transfers its contents to the next one, and so forth. The putter doesnt consider its job done until a full set of replicas have acknowledged successful receipt. 4. As soon as each HDFS block fills, even if it is mid-record, it is closed; steps 2 and 3 are repeated for the next block.
177
Please keep in mind that the tasktracker does not run your code directly it forks a separate process in a separate JVM with its own memory demands. The tasktracker rarely needs more than a few hundred megabytes of heap, and you should not see it consuming significant I/O or CPU.
Streams data to the Mapper, either locally from disk or remotely over the network; Runs that data through your code; Spills the midstream data to disk one or more times; Applies combiners, if any; Merge/Sorts the spills and sends the over the network to the Reducer. The Reducer: Writes each Mappers output to disk; Performs some number of Merge/Sort passes; Reads the data from disk; Runs that data through your code; Writes its output to the DataNode, which writes that data once to disk and twice through the network; Receives two other Reducers output from the network and writes those to disk. Hadoop is, of course, pipelined; to every extent possible, those steps are happening at the same time. What we will do, then, is layer in these stages one by one, at each point validating that your job is as fast as your disk until we hit the stage where it is not.
Fixed Overhead
The first thing we want is a job that does nothing; this will help us understand the fixed overhead costs. Actually, what we will run is a job that does almost nothing; it is useful to know that your test really did run.
(TODO: Disable combining splits) Load 10_tiny_files Filter most of it out Store to disk (TODO: Restrict to 50 Mapper slots or rework) Load 10000_tiny_files Filter most of it out Store to disk
(TODO: is there a way to limit the number of Reduce slots in Pig? Otherwise, revisit the below.) In (TODO: REF), there is a performance comparison worksheet that you should copy and fill in as we go along. It lists the performance figures for several reference clusters on both cloud and dedicated environments for each of the tests we will perform. If your figures do not compare well with the appropriate reference cluster, it is probably worth
179
while adjusting the overall configuration. Assuming your results are acceptable, you can tape the worksheet to your wall and use it to baseline all the jobs you will write in the future. The rest of the chapter will assume that your cluster is large enough to warrant tuning but not grossly larger than the largest reference cluster. If you run the Pig script above (TODO: REF), Hadoop will execute two jobs: one with 10 Mappers and no Reducers and another with 10,000 Mappers and no Reducers. From the Hadoop Job Tracker page for your job, click on the link showing the number of Map tasks to see the full task listing. All 10 tasks for the first job should have started at the same time and uniformly finished a few seconds after that. Back on the main screen, you should see that the total job completion time was more or less identical to that of the slowest Map task. The second job ran its 10,000 Map tasks through a purposefully restricted 50 Mapper slots so each Mapper slot will have processed around 200 files. If you click through to the Task listing, the first wave of tasks should start simultaneously and all of them should run in the same amount of time that the earlier test did. (TODO: show how to find out if one node is way slower) Even in this trivial case, there is more variance in launch and runtimes than you might first suspect (if you dont, you definitely will in the next but for continuity, we will discuss it here). If that splay the delay between the bulk of jobs finishing and the final job finishing is larger than the runtime of a typical task, however, it may indicate a problem, but as long as it is only a few seconds, dont sweat it. If you are interested in a minor but worth-it tweak, adjust the mapred.job.reuse.jvm.num.tasks setting to 10. This causes each Mapper to use the same child process JVM for multiple attempts, minimizing the brief but noticeable JVM startup times impact. If you are writing your own native Java code, you might know a reason to force no reuse (the default), but it is generally harmless for any well-behaved program. On the Job screen, you should see that the total runtime for the job was about 200 times slower for the second job than the first and not much more than 200 times the typical tasks runtime; if not, you may be putting pressure on the Job Tracker. Rerun your job and watch the Job Trackers heap size; you would like the Job Tracker heap to spend most of its life below, say 50-percent, so if you see it making any significant excursions toward 100-percent, that would unnecessarily impede cluster performance. The 1 GB out-of-the-box setting is fairly small; for production use we recommend at least 3 GB of heap on a dedicated machine with at least 7 GB total RAM. If the Job coordination overhead is unacceptable but the Job Tracker heap is not to blame, a whole host of other factors might be involved; apply the USE method, described (TODO: REF).
180
Mapper Input
Now that weve done almost nothing, lets do almost something read in a large amount of data, writing just enough to disk to know that we really were there.
Load 100 GB from disk Filter all but 100 MB Store it to disk
Run that job on the 100-GB GitHub archive dataset. (TODO: Check that it will do speculative execution.) Once the job completes, you will see as many successful Map tasks as there were HDFS blocks in the input; if you are running a 128-MB block size, this will be about (TODO: How many blocks are there?). Again, each Map task should complete in a uniform amount of time and the job as a whole should take about length_of_Map_task*number_of_Map_tasks=num ber_of_Mapper_slots. The Map phase does not end until every Mapper task has com pleted and, as we saw in the previous example, even in typical cases, there is some amount of splay in runtimes. (TODO: Move some of JT and Nanettes optimizations forward to this chapter). Like the chimpanzees at quitting time, the Map phase cannot finish until all Mapper tasks have completed. You will probably notice a half-dozen or so killed attempts as well. The TODO: name of speculative execution setting, which we recommend enabling, causes Hadoop to opportunistically launch a few duplicate attempts for the last few tasks in a job. The faster job cycle time justifies the small amount of duplicate work. Check that there are few non-local Map tasks Hadoop tries to assign Map attempts (TODO: check tasks versus attempts) to run on a machine whose DataNode holds that input block, thus avoiding a trip across the network (or in the chimpanzees case, down the hallway). It is not that costly, but if you are seeing a large number of non-local tasks on a lightly-loaded cluster, dig deeper. Dividing the average runtime by a full block of Map task by the size of an HDFS block gives you the Mappers data rate. In this case, since we did almost nothing and wrote almost nothing, that value is your clusters effective top speed. This has two implications: First, you cannot expect a data-intensive job to run faster than its top speed. Second, there should be apparent reasons for any job that runs much slower than its top speed. Tuning Hadoop is basically about making sure no other part of the system is slower than the fundamental limit at which it can stream from disk. While setting up your cluster, it might be worth baselining Hadoops top speed against the effective speed of your disk and your network. Follow the instructions for the scripts/ baseline_performance script (TODO: write script) from the example code above. It uses a few dependable user-level processes to measure the effective data rate to disk
Mapper Input | 181
(DD and CP) and the effective network rate (NC and SCP). (We have purposely used user-level processes to account for system overhead; if you want to validate that as well, use a benchmark like Bonnie++ (TODO: link)). If you are dedicated hardware, the network throughput should be comfortably larger than the disk throughput. If you are on cloud machines, this, unfortunately, might not hold but it should not be atro ciously lower. If the effective top speed you measured above is not within (TODO: figure out healthy percent) percent, dig deeper; otherwise, record each of these numbers on your perfor mance comparison chart. If youre setting up your cluster, take the time to generate enough additional data to keep your cluster fully saturated for 20 or more minutes and then ensure that each machine processed about the same amount of data. There is a lot more variance in effective performance among machines than you might expect, especially in a public cloud environment; it can also catch a machine with faulty hardware or setup. This is a crude but effective benchmark, but if youre investing heavily in a cluster consider run ning a comprehensive benchmarking suite on all the nodes the chapter on Stupid Hadoop Tricks shows how (TODO ref)
182
example, when you apply a highly-restrictive filter to a large input your output files will have poor occupancy. A sneakier version of this is a slightly expansive Mapper-Only job. A job whose Map pers turned a 128-MB block into, say, 150 MB of output data would reduce the block occupancy by nearly half and require nearly double the Mapper slots in the following jobs. Done once, that is merely annoying but in a workflow that iterates or has many stages, the cascading dilution could become dangerous. You can audit your HDFS to see if this is an issue using the hadoop fsck [directory] command. Running that command against the directory holding the GitHub data should show 100 GB of data in about 800 blocks. Running it against your last jobs output should show only a few MB of data in an equivalent number of blocks. You can always distill a set of files by doing group_by with a small number of Reducers using the record itself as a key. Pig and Hive both have settings to mitigate the manysmall-files problem. In Pig, the (TODO: find name of option) setting will feed multiple small files to the same Mapper; in Hive (TODO: look up what to do in Hive). In both cases, we recommend modifying your configuration to make that the default and disable it on a per-job basis when warranted.
Midstream Data
Now lets start to understand the performance of a proper Map/Reduce job. Run the following script, again, against the 100 GB GitHub data.
Parallel 50 Disable optimizations for pushing up filters and for Combiners Load 100 GB of data Group by record itself Filter out almost everything Store data
The purpose of that job is to send 100 GB of data at full speed through the Mappers and midstream processing stages but to do almost nothing in the Reducers and write almost nothing to disk. To keep Pig from helpfully economizing the amount of midstream data, you will notice in the script we disabled some of its optimizations. The number of Map tasks and their runtime should be effectively the same as in the previous example, and all the sanity checks weve given so far should continue to apply. The overall runtime of the Map phase should only be slightly longer (TODO: how much is slightly?) than in the previous Map-only example, depending on how well your network is able to outpace your disk. It is an excellent idea to get into the habit of predicting the record counts and data sizes in and out of both Mapper and Reducer based on what you believe Hadoop will be doing to each record and then comparing to what you see on the Job Tracker screen. In this case, you will see identical record counts for Mapper input, Mapper output and Reducer
Midstream Data | 183
input and nearly identical data sizes for HDFS bytes read, Mapper output, Mapper file bytes written and Reducer input. The reason for the small discrepancies is that, for the file system metrics, Hadoop is recording everything that is read or written, including logged files and so forth. Midway or so through the job well before the finish of the Map phase you should see the Reducer tasks start up; their eagerness can be adjusted using the (TODO: name of setting) setting. By starting them early, the Reducers are able to begin merge/sorting the various Map task outputs in parallel with the Map phase. If you err low on this setting, you will disappoint your coworkers by consuming Reducer slots with lots of idle time early but that is better than starting them too late, which will sabotage parallels. Visit the Reducer tasks listing. Each Reducer task should have taken a uniform amount of time, very much longer than the length of the Map tasks. Open a few of those tasks in separate browser tabs and look at their counters; each should have roughly the same input record count and data size. It is annoying that this information is buried as deeply as it is because it is probably the single most important indicator of a flawed job; we will discuss it in detail a bit later on.
Spills
First, though, lets finish understanding the datas detailed journey from Mapper to Re ducer. As a Map task outputs records, Hadoop sorts them in the fixed-size io.sort buffer. Hadoop files records into the buffer in partitioned, sorted order as it goes. When that buffer fills up (or the attempt completes), Hadoop begins writing to a new empty io.sort buffer and, in parallel, spills that buffer to disk. As the Map task concludes, Hadoop merge/sorts these spills (if there were more than one) and sends the sorted chunks to each Reducer for further merge/sorting. The Job Tracker screen shows the number of Mapper spills. If the number of spills equals the number of Map tasks, all is good the Mapper output is checkpointed to disk before being dispatched to the Reducer. If the size of your Map output data is large, having multiple spills is the natural outcome of using memory efficiently; that data was going to be merge/sorted anyway, so it is a sound idea to do it on the Map side where you are confident it will have a uniform size distribution. (TODO: do combiners show as multiple spills?) What you hate to see, though, are Map tasks with two or three spills. As soon as you have more than one spill, the data has to be initially flushed to disk as output, then read back in full and written again in full for at least one merge/sort pass. Even the first extra spill can cause roughly a 30-percent increase in Map task runtime. There are two frequent causes of unnecessary spills. First is the obvious one: Mapper output size that slightly outgrows the io.sort buffer size. We recommend sizing the io.sort buffer to comfortably accommodate Map task output slightly larger than your typical
184 | Chapter 21: Hadoop Tuning
HDFS block size the next section (TODO: REF) shows you how to calculate. In the significant majority of jobs that involve a Reducer, the Mapper output is the same or nearly the same size JOINs or GROUPs that are direct, are preceded by a projection or filter or have a few additional derived fields. If you see many of your Map tasks tripping slightly over that limit, it is probably worth requesting a larger io.sort buffer specifically for your job. There is also a disappointingly sillier way to cause unnecessary spills: The io.sort buffer holds both the records it will later spill to disk and an index to maintain the sorted order. An unfortunate early design decision set a fixed size on both of those with fairly con fusing control knobs. The iosortrecordpercent (TODO: check name of setting) setting gives the size of that index as a fraction of the sort buffer. Hadoop spills to disk when either the fraction devoted to records or the fraction devoted to the index becomes full. If your output is long and skinny, cumulatively not much more than an HDFS block but with a typical record size smaller than, say, 100 bytes, you will end up spilling mul tiple small chunks to disk when you could have easily afforded to increase the size of the bookkeeping buffer. There are lots of ways to cause long, skinny output but set a special triggers in your mind for cases where you have long, skinny input; turn an adjacency-listed graph into an edge-listed graph or otherwise FLATTEN bags of records on the Mapper side. In each of these cases, the later section (TODO: REF) will show you how to calculate it. (TODO: either here or later, talk about the surprising cases where you fill up MapRed scratch space or FS.S3.buffer.dir and the rest of the considerations about where to put this).
Combiners
It is a frequent case that the Reducer output is smaller than its input (and kind of an noying that the word Reducer was chosen, since it also frequently is not smaller). Algebraic aggregations such as COUNT, AVG and so forth, and many others can implement part of the Reducer operation on the Map side, greatly lessening the amount of data sent to the Reducer. Pig and Hive are written to use Combiners whenever generically appropriate. Applying a Combiner requires extra passes over your data on the Map side and so, in some cases, can themselves cost much more time than they save. If you ran a distinct operation over a data set with 50-percent duplicates, the Combiner is easily justified since many duplicate pairs will be eliminated early. If, however, only a tiny fraction of records are duplicated, only a disappearingly-tiny fraction will occur on the same Mapper, so you will have spent disk and CPU without reducing the data size. Whenever your Job Tracker output shows that Combiners are being applied, check that the Reducer input data is, in fact, diminished. (TODO: check which numbers show this)
Midstream Data | 185
If Pig or Hive have guessed badly, disable the (TODO: name of setting) setting in Pig or the (TODO: name of setting) setting in Hive.
186
job, however, the only good mechanism is to examine the Reducer logs directly. At some reasonable time after the Reducer has started, you will see it initiate spills to disk (TODO: tell what the log line looks like). At some later point, it will begin merge/sorting those spills (TODO: tell what the log line looks like). The CPU burden of a merge/sort is disappearingly small against the dominating cost of reading then writing the data to disk. If, for example, your job only triggered one merge/sort pass halfway through receiving its data, the cost of the merge/sort is effec tively one and a half times the base cost of writing that data at top speed to disk: all of the data was spilled once, half of it was rewritten as merged output. Comparing the total size of data received by the Reducer to the merge/sort settings will let you estimate the expected number of merge/sort passes; that number, along with the top speed figure you collected above, will, in turn, allow you to estimate how long the Reduce should take. Much of this action happens in parallel but it happens in parallel with your Map pers mapping, spilling and everything else that is happening on the machine. A healthy, data-intensive job will have Mappers with nearly top speed throughput, the expected number of merge/sort passes and the merge/sort should conclude shortly after the last Map input is received. (TODO: tell what the log line looks like). In general, if the amount of data each Reducer receives is less than a factor of two to three times its share of machine RAM, (TODO: should I supply a higher-fidelity thing to compare against?) all those conditions should hold. Otherwise, consult the USE method (TODO: REF). If the merge/sort phase is killing your jobs performance, it is most likely because either all of your Reducers are receiving more data than they can accommodate or because some of your Reducers are receiving far more than their fair share. We will take the uniform distribution case first. The best fix to apply is to send less data to your Reducers. The chapters on writing Map/ Reduce jobs (TODO: REF or whatever we are calling Chapter 5) and the chapter on advanced Pig (TODO: REF or whatever we are calling that now) both have generic recommendations for how to send around less data and throughout the book, we have described powerful methods in a domain-specific context which might translate to your problem. If you cannot lessen the data burden, well, the laws of physics and economics must be obeyed. The cost of a merge/sort is O(N LOG N). In a healthy job, however, most of the merge/sort has been paid down by the time the final merge pass begins, so up to that limit, your Hadoop job should run in O(N) time governed by its top speed. The cost of excessive merge passes, however, accrues directly to the total runtime of the job. Even though there are other costs that increase with the number of machines, the benefits of avoiding excessive merge passes are massive. A cloud environment makes it particularly easy to arbitrage the laws of physics against the laws of economics it costs
Midstream Data
187
the same to run 60 machines for two hours as it does to run ten machines for 12 hours, as long as your runtime stays roughly linear with the increased number of machines, you should always size your cluster to your job, not the other way around. The thresh olding behavior of excessive reduces makes it exceptionally valuable to do so. This is why we feel exploratory data analytics is far more efficiently done in an elastic cloud environment, even given the quite significant performance hit you take. Any physical cluster is too large and also too small; you are overpaying for your cluster overnight while your data scientists sleep and you are overpaying your data scientists to hold roller chair sword fights while their undersized cluster runs. Our rough rule of thumb is to have not more than 2-3 times as much total reducer data as you have total child heap size on all the reducer machines youll use. (TODO: complete)
Reducer Processing
(TODO: complete)
Processing stage of Reduce attempts should be full speed. Not too many Merge passes in Reducers. Shuffle and sort time is explained by the number of Merge passes. Commit phase should be brief. Total job runtime is not much more than the combined Map phase and Reduce phase runtimes. Reducers generally process the same amount of data. Most Reducers process at least enough data to be worth it. *
189
CHAPTER 22
The default settings are those that satisfy in some mixture the constituencies of a) Yahoo, Facebook, Twitter, etc; and b) Hadoop developers, ie. peopl who write Hadoop but rarely use Hadoop. This means that many low-stakes settings (like keeping jobs stats around for more than a few hours) are at the values that make sense when you have a petabytescale cluster and a hundred data engineers; If youre going to run two master nodes, youre a bit better off running one master as (namenode only) and the other master as (jobtracker, 2NN, balancer) the 2NN should be distinctly less utilized than the namenode. This isnt a big deal, as I assume your master nodes never really break a sweat even during heavy usage.
Memory
Heres a plausible configuration for a 16-GB physical machine with 8 cores:
`mapred.tasktracker.reduce.tasks.maximum` `mapred.tasktracker.map.tasks.maximum` `mapred.child.java.opts` `mapred.map.child.java.opts` `mapred.reduce.child.java.opts` total mappers' heap size total reducers' heap size datanode heap size tasktracker heap size ..... total = = = = = 2 5 2 GB blank (inherits mapred.child.java.opts) blank (inherits mapred.child.java.opts) GB (5 * 2GB) GB (2 * 2GB) GB GB GB on a 16 GB machine
Its rare that you need to increase the tasktracker heap at all. With both the TT and DN daemons, just monitor them under load; as long as the heap healthily exceeds their observed usage youre fine.
191
If you find that most of your time is spent in reduce, you can grant the reducers more RAM with mapred.reduce.child.java.opts (in which case lower the child heap size setting for the mappers to compensate). Its standard practice to disable swap youre better off OOMing foot note[OOM = Out of Memory error, causing the kernel to start killing processes outright] than swapping. If you do not disable swap, make sure to reduce the swappiness sysctl (5 is reasonable). Also consider setting overcommit_memory (1) and overcommit_ratio (100). Your sysadmin might get angry when you suggest these changes on a typical server, OOM errors cause pagers to go off. A misanthropically funny T-shirt, or whiskey, will help establish your bona fides. io.sort.mb default X, recommended at least 1.25 * typical output size (so for a 128MB block size, 160). Its reasonable to devote up to 70% of the child heap size to this value. io.sort.factor: default X, recommended io.sort.mb * 0.x5 * (seeks/s) / (thruput MB/s) you want transfer time to dominate seek time; too many input streams and the disk will spend more time switching among them than reading them. you want the CPU well-fed: too few input streams and the merge sort will run the sort buffers dry. My laptop does 76 seeks/s and has 56 MB/s throughput, so with io.sort.mb = 320 Id set io.sort.factor to 27. A server that does 100 seeks/s with 100 MB/s throughput and a 160MB sort buffer should set io.sort.factor to 80. io.sort.record.percent default X, recommended X (but adjust for certain jobs) mapred.reduce.parallel.copies: default X, recommended to be in the range of sqrt(Nw*Nm) to Nw*Nm/2 You should see the shuffle/copy phase of your reduce tasks speed up. mapred.job.reuse.jvm.num.tasks default 1, recommended 10. If a job re quires a fresh JVM for each process, you can override that in its jobconf. Going to -1 (reuse unlimited times) can fill up the dist if your input format uses delete on exit temporary files (as for example the S3 filesystem does), with little ad ditional speedup. You never want Java to be doing stop-the-world garbage collection, but for large JVM heap sizes (above 4GB) they can become especially dangerous. If a full garbage collect takes too long, sockets can time out, causing loads to increase, causing garbage collects to happen, causing trouble, as you can guess.
192
Given the number of files and amount of data youre storing, I would set the NN heap size agressively - at least 4GB to start, and keep an eye on it. Having the NN run out of memory is Not Good. Always make sure the secondary name node has the same heap setting as the name node.
Storage
mapred.system.dir: default X recommended /hadoop/mapred/system Note that this is a path on the HDFS, not the filesystem). Ensure the HDFS data dirs (dfs.name.dir, dfs.data.dir and fs.check point.dir), and the mapreduce local scratch dirs (mapred.system.dir) include all your data volumes (and are off the root partition). The more volumes to write to the better. Include all the volumes in all of the preceding. If you have a lot of volumes, youll need to ensure theyre all attended to; have 0.5-2x the number of cores as physical volumes. HDFS-3652 dont name your dirs /data1/hadoop/nn, name them /data1/ hadoop/nn1 ( final element differs). Solid-state drives are unjustifiable from a cost perspective. Though theyre radically better on seek they dont improve performance on bulk transfer, which is what limits Hadoop. Use regular disks.
Handlers and threads | 193
Do not construct a RAID partition for Hadoop it is happiest with a large JBOD. (Theres no danger to having hadoop sit on top of a RAID volume; youre just hurting performance). We use xfs; Id avoid ext3. Set the noatime option (turns off tracking of last-access-time) otherwise the OS updates the disk on every read. Increase the ulimits for open file handles (nofile) and number of processes (nproc) to a large number for the hdfs and mapred users: we use 32768 and 50000. be aware: you need to fix the ulimit for root (?instead ? as well?) dfs.blockreport.intervalMsec: default 3_600_000 (1 hour); recommended 21_600_000 (6 hours) for a large cluster. 100_000 blocks per data node for a good ratio of CPU to disk
Other
mapred.map.output.compression.codec: default XX, recommended ``. Enable Snappy codec for intermediate task output. mapred.compress.map.output mapred.output.compress mapred.output.compression.type mapred.output.compression.codec mapred.reduce.slowstart.completed.maps: default X, recommended 0.2 for a single-purpose cluster, 0.8 for a multi-user cluster. Controls how long, as a fraction of the full map run, the reducers should wait to start. Set this too high, and you use the network poorly reducers will be waiting to copy all their data. Set this too low, and you will hog all the reduce slots. mapred.map.tasks.speculative.execution: default: true, recommended: true. Speculative execution (FIXME: explain). So this setting makes jobs finish faster, but makes cluster utilization higher; the tradeoff is typically worth it, especially in a development environment. Disable this for any map-only job that writes to a da tabase or has side effects besides its output. Also disable this if the map tasks are expensive and your cluster utilization is high. mapred.reduce.tasks.speculative.execution: default false, recommended: false. (hadoop log location): default /var/log/hadoop, recommended /var/log/ hadoop (usually). As long as the root partition isnt under heavy load, store the logs
194 | Chapter 22: Hadoop Tuning for the Brave and Foolish
on the root partition. Check the Jobtracker however it typically has a much larger log volume than the others, and low disk utilization otherwise. In other words: use the disk with the least competition. fs.trash.interval default 1440 (one day), recommended 2880 (two days). Ive found that files are either a) so huge I want them gone immediately, or b) of no real concern. A setting of two days lets you to realize in the afternoon today that you made a mistake in the morning yesterday, Unless you have a ton of people using the cluster, increase the amount of time the jobtracker holds log and job info; its nice to be able to look back a couple days at least. Also increase mapred.jobtracker.completeuserjobs.maximum to a larger value. These are just for politeness to the folks writing jobs. mapred.userlog.retain.hours mapred.jobtracker.retirejob.interval mapred.jobtracker.retirejob.check mapred.jobtracker.completeuserjobs.maximum mapred.job.tracker.retiredjobs.cache mapred.jobtracker.restart.recover Bump mapreduce.job.counters.limit its not configurable per-job. (From http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduceperformance/ 512M block size fairly reasonable)
Other
195
Storm+Trident Internals
CHAPTER 23
What should you take away from this chapter: You should: Understand the lifecycle of a Storm tuple, including spout, tupletree and acking. (Optional but not essential) Understand the details of its reliability mechanism and how tuples are acked. Understand the lifecycle of partitions within a Trident batch and thus, the context behind partition operations such as Apply or PartitionPersist. Understand Tridents transactional mechanism, in the case of a PartitionPersist. Understand how Aggregators, Statemap and the Persistence methods combine to give you exactly once processing with transactional guarantees. Specifically, what an OpaqueValue record will look like in the database and why. Understand how the master batch coordinator and spout coordinator for the Kafka spout in particular work together to uniquely and efficiently process all records in a Kafka topic. One specific: how Kafka partitions relate to Trident partitions.
relies on the final result of the acking mechanism, it is distinct from that and handled by the spouts Executor. (TODO: Check) If the spout doesnt emit a tuple, the Worker will sleep for a fixed number of milliseconds (by default, you can change the sleep policy). Otherwise, the Worker will keep calling nextTuple until either its send queue is full (see below) or until there are MAX_SPOUT_PENDING or more tuples pending.
Executor Queues
At this point, you see that the spout spins in an independent loop, emitting records to its Collector until one of its limits is hit. We will pick up with the specific tuple in a moment but first, lets get a picture of how tuples move between Executors, locally and remotely. Each Executor, whether bolt or spout, has both a send and a receive queue. (For now, all you need to know about a bolt Executor is that it takes things from its receive queue, does stuff to them and puts them into the send queue.) (TODO: Insert information on what disrupter queue is and how wonderful it is) When a tuple is emitted, the Collector places each into a slot in the send queue, once for each downstream Executor that will process it. (The code, if you should find yourself reading it, doesnt distinguish the tuple as emitted and the copies used for sending. It has to make these multiple copies so that each can be acked independently.) These writes are done in a blocking fashion. If the queue is full, the writemethod does not return until it has been swept; this means that, in turn, the Collectors emit and the Executors execute methods block as well, preventing the Executor from sending more records than downstream stages can process.
198
The worker sweeps each Executors send queue on a tight loop. Unless the downstream queue blocks, it repeats immediately (TODO: Check). Each sweep of the queue gathers all tuples, removing them into a temporary buffer. Tuples to be handled by a local Ex ecutor are deposited directly into that Executors receive queue. All tuples destined for remote Executors are placed in the Workers transfer queue in a single write, regardless of how many remote Executors there are. A couple of notes: First, send and receive queues are per Executor, while the transfer queue is shared by all Executors in the Worker. When the Worker writes to a downstream queue, it deposits all records from that sweep into the queue in a bunch. [FOOTNOTE: Im being careful to use the informal term bunch rather than batch because youll never hear about these again. A batch is a principal element for Trident, whereas well never talk about these again.] So note that, while each slot in a send queue holds exactly one tuple, each slot in a receive or transfer queue can hold up to the SEND_QUEUE size amount of tuples. (TODO: Check variables name) This is important when youre thinking about memory usage.
become blocked writing to that receive queue, keeping it from sweeping upstream send queues. This will continue all the way up the flow to the spout. Finally, as we hinted at the start, once the spouts send queue is full, it will stop calling nexttuple, stop draining its source and so stop writing more records into the flow. If this sounds awfully coarse-grained, youre right. While nothing will break if you get to this L.A. freeway state of gridlock, your flow will have become disastrously inefficient, even well before that point. You can straightforwardly prevent the situation by adjusting the maxspoutpending parameter and each stages parallelism correctly in the next chap ter (TODO: ref), Storm+Trident Tuning, will show you how. In normal operation, you shouldnt have to think about the backpressure mechanics, though; Storm quietly and efficiently buffers records in front of your slowest stages and handles latency shocks (such as transient sluggishness from a remote database or API).
200
When a spout produces a tuple lets take, for example, one named Methuselah" it notifies the acker to do two things: to start tracking Methuselahs tuple tree and to inscribe Methuselahs name in that tupletrees Scroll of Ages. [FOOTNOTE: Actually, since a tuple can be sent to multiple downstream Executors, its more appropriate to say it inscribes each of Methuselahs clones in the Scroll of Ages.] As described above, that tuple will eventually be processed by the downstream Execu tors execute method, which typically emits tuples and must call ack or fail, (TODO: insert details of what happens when a tuple fails). In the typical case, the Executors bolt happily calls emit 0, 1 or many times and then calls ack. As each emitted tuple is placed in the send queue, the Executor notes its name [FOOTNOTE: Actually, the names of all its clones.] for later delivery to the acker. When the bolt calls ack, the Executor notifies the acker with the name of the parent and each child. So if a bolt, receiving a tuple called Noah, emitted tuples called Ham and Shem, it strikes Noah from the Scroll of Ages but lists Ham and Shem therein. (TODO: Rear range?) When a bolt emits one or more tuples, the parent is removed but the children are added and so the Scroll of Ages continues to have at least those entries in it. If a bolt received a tuple called Onan, and emitted nothing, then it would only notify the acker to clear Onan, adding nothing. Ultimately, for a tupletree to be successfully completed, every descendent must ultimately encounter a bolt that emits nothing. Up until now, Ive made it sound as if each name in the Scroll of Ages is maintained separately. The actual implementation is far more elegant than that and relies on a few special properties of the XOR function. First, you can freely rearrange the order in which several terms are XORd together: Noah XOR Shem XOR Ham is the same as Shem XOR Noah XOR Ham and so forth. Second, the XOR of a term with itself is 0: Noah XOR Noah is 0 for anybody. Do you see where this is going? In our example, (TODO: Repair so its Noahs tree)when the Scroll of Ages was first prepared, inscribed on it was only Noahs name. When the Executor handling that tuple notified back, it didnt have to send Noah, Ham and Shem distinctly; it just sent the single 64-bit integer Noah XOR Ham XOR Shem. So the Scroll of Ages is pretty brief, as Scrolls go; it actually only holds the one entry that is the combined XOR of every tuple ID that has been sent. So when the acker receives the ack for Noah, namely Noah XOR Ham XOR Shem, it XOR`s that single 64-bit entry with the existing tupletree checksum storing that checksum back to the Scroll of Ages. (NOTE: TODO Rework Scroll of Ages metaphor to hold all tupletrees.) The value at this point is effectively Noah XOR Shem XOR Ham . From the first property, the Noah terms cancel out and so our tupletree state is now just Shem XOR Ham. Thanks to the second property, even as acks come in asynchronously, the Scroll of Ages remains correct. (Shem XOR Ham) XOR (Shem XOR Abraham) XOR (Ham) XOR (Abra ham) rearranges to provide two Shems, two Hams and two Abrahams. ,Since, in this
Acking and Reliability | 201
example, the family line of Shem and Abraham produced no resulting tuples, we are left with 0. As soon as that last ack comes in, producing a 0 in the Scroll of Ages, the acker notifies the spout that the tupletree has concluded. This lets the spout remove that very first tuple from its pending list. The loop that calls nexttuple will, on its next trip through, see the new pending count and, if conditions are right, call nexttuple. This system is thus able to accommodate many millions of active tuples with remarkably little network chatter or memory footprint. Only the spouts pending tuples are retained for anything except immediate processing. Now, this comes at a cost, if any downstream tuple fails, the whole tree is retried but since failure is the uncommon case, (and finite RAM is the universal case), this is the right tradeoff. Second, the XOR trick means a single 64-bit integer is sufficient to track the legacy of an entire tupletree, no matter how large, and a single 64-bit integer is all that has to be tracked and sent to the acker, no matter how many downstream tuples an Executor produces. If youre scoring at home, for each tupletree, the entire bookkeeping system consumes Order(1) number of tuples, Order(1) size of checksum and only as many acks as tuples. One last note. You can do the math on this yourself, but 64 bits is enough that the composed XOR of even millions of arbitrary 64-bit integer will effectively never come out to be 0 unless each term is repeated.
202
interface and the features (like aggregations and exactly once processing) that make Stream analytics possible. You might think that we begin by talking about a Trident spout; after all, as youve been using Trident, thats where your flow conceptually begins. Its time we revealed a con fusing clarification: The Trident spout is actually a Storm bolt. Viewed from the pro grammer interface, Trident spouts independently source tuples for downstream pro cessing. All of the Trident operations you are familiar with spouts, eaches, aggregations actually take place in Storm bolts. Trident turns your topology into a dataflow graph that it uses to assign operations to bolts and then to assign those bolts to workers. Its smart enough to optimize that assignment: It combines operations into bolts so that, as much as possible, tuples are handed off with simple method cause and it arranges bolts among workers so that, as much as possible, tuples are handed off to local Executors. (connecting material here) The actual spout of a Trident topology is called the Master Batch Coordinator (MBC). From Storms end, its the dullest possible spout; all it does is emit a tuple describing itself as batch 1 and then a tuple describing itself as batch 2 and so forth, ad infinitum. (Of course, deciding when to emit those batches, retry them, etc., is quite exciting but Storm doesnt know anything about all that). Those batch tuples go to the topologys Spout Coordinator. The Spout Coordinator understands the location and arrangement of records in the external source and ensures that each source record belongs uniquely to a successful Trident batch. The diagram on the right shows this in action for the GitHub topology. In this section, we are going to track three Trident matches (labeled 1, 2 and 3) through two parallel Kafka spouts, each pulling from a single Kafka partition. The Spout Coordinator passes the single seed tuple from the MBC onto each of its spouts, equipping each of them with a starting Kafka offset to read from. Each spout then requests, from the Kafka broker, a range of messages, beginning at its determined offset and extending to, at most, Max_Fetch_Bytes . If Max_Fetch_Bytes were, say, 1000 bytes, and your records were uniformly 300 bytes, Kafka would return to the spout just the next three records, to talling 900 bytes. You must set Max_Fetch_Bytes larger than your largest expected re cord; the Kafka spout will fail if a record is too large to fit in a single batch. In most cases, records have a fairly bounded spread of sizes around a typical value. The GitHub records, for example, are (TODO: Find size x+- y bytes long). This means that a Max_Fetch_Bytes size of (TODO: value) might return as few as (A) and as many as (B). Pesky but harmless. If the size variance of your records is large enough to make this a real problem, unfortunately, youll have to modify the Kafka spout. Lets pause there and Im going to tell you a story: Heres the system chimpanzee school children follow when they go on a field trip. At the start of the day, a set of school buses
203
pull up in waves at the school. The first graders all file onto the first set of buses and head off first followed by the set of buses for the second graders and so forth. As each bus pulls up at the museum, all the kids come off that bus in a group, known as a partition. All get off in a group (which well call a partition) and each group is met by a simpleminded docent assigned to that group by the museum. Now, chimpanzees are an unruly sort, but they are able to be well-behaved in at least the following way: All the chim panzees in a partition follow the same path through the museum and no chimpanzee in the partition ever cuts ahead of another chimp. So, the third kid off the bus will see the same paintings as the second kid and the fourth kid and shell see each of those paintings some time after the second kid did and some time before the fourth kid did. Each docent memorizes the number of students in its assigned partition, patiently takes a place in line after the last chimpanzee and follows along with them through the mu seum. If you visited the museum on chimpanzee field trip day, well, it can sometimes be just as chaotic as youd expect, what with kids looking at the pointless paintings from up close and afar, a set of them hooting in recognition at the exhibition on ostrolopi thazines and others gather to intently study paintings theyll be discussing later in class. If you were to try to manage the ebb and flow of each of these partition groups in bulk, you wouldnt decrease the chaos; youd just make it so nobody ever got through the hallways at all. No, the only good system is the one that lets each chimpanzee browse at his or her own pace. Most exhibits are enjoyed by EACH chimpanzee individually, and so the chimpanzees file by as they come. If a set of third graders and a set of first graders arrive at an exhibit at the same time, theyd file through the exhibit in whatever in or leave order happened by circumstance; thats ok, because, of course, within those partitions, the good little chimpanzee boys and girls were obeying the field trip rules; no first grader ever jumped ahead of the first grader it followed and no third grader ever jumped ahead of the third grader it followed. Now, at some points during the field trip, the chimpanzees are to discuss an exhibit as a partition. When the first chimpanzee in that partition arrives at an exhibit, the exhibits Operator will ask her to stand to the side and direct each chimpanzee in the partition to gather behind her. When, at some point, the docent shows up (last in line because of the field trip rules), the Operator checks that everyone is there by counting the number of kids in the partition and checking against the count that the docent carries. With that ritual satisfied, the Operator conducts the partitionQuery Q&A session. Each student, thus enlightened as a group, is then sent along to the next exhibit in exactly the original partition order. As you can see, the students are able to enjoy the exhibits in the museum singly or in groups without any more coordination than is necessary. However, at the end of the day, when its time to go home, a much higher level of commitment to safety is necessary. What happens when its time to return is this: As each partition group files out of the museum, it gathers back at its original school bus. Just as in the group discussion, the
204 | Chapter 23: Storm+Trident Internals
bus Operator notices when the docent shows up (signaling the end of the partition) and compares the actual to expected count. Once satisfied that the full partition is present, it signals the Master Batch Coordinator for that grade that all the buss students are present. Once the Master Batch Coordinator has received the ready signal from all the buses in a grade, it signals all the bus Operators that they are approved to transport the students back to school. Finally, once safely back at school, each bus Operator radios the Master Batch Coordinator of their safe arrival, allowing the MBC to declare the field trip a success.
205
When the partition is complete and the persistentAggregate receives the commit signal from the MBC, it therefore has on hand the following: All group keys seen in this partition and for each of those keys, an Aggregator instance, fat from having consumed all the records in that group. It now needs to retrieve the prior existing value, if any, from the backing store. For example, in the case of the simple counting aggregation, it needs only the primitive integer holding the previous batchs value. In the case where we accumulated a complex profile, its a HashMap describing that profile. (TODO: Perhaps more here ) The persistentAggregate wrapper hands the cache map (the first backing layer) the full set of group keys in the given partition, requesting their value. Assuming things are behaving well, the cache map will have those values, hot and ready, in memory --- but of course, it may be missing some or all. The cache map, using exactly the same interface, asks the concrete backing stores StateMap for the values it lacks. (TODO: What does it send back when a value is missing?) The cache map accepts the results and proudly presents the full set of values back to the persistentAggregate wrapper. The wrapper then promptly finalizes the aggregated values. It then hands a map of group keys and their updated values back to the backing store. The cache map player stores all those values in its cache, possibly triggering the least-recently-used values to be discarded. The cache map then, in turn, hands the full set of values to the concrete datastore, which persists them to the external database. Note that only a fraction of values for any given partition are typically read from the database but that every value in a partition is written back. Now, heres the clever part. The concrete datastore accepts what its given, the actual value to store, and it returns what its asked for, the value as of the last batch. But it stores within the record for a given group key the following things: the transaction ID of the current batch, the newly-updated value and the prior value that it was based on; lets call the values the aligned value and the pre-aligned value, respectively. At whatever later time its asked to retrieve the value for that record, it demands to know the current transaction ID as well. Now, lets think back to the transactional guarantee we described above. Suppose the record it retrieves has a transaction ID of 8 and the aligned transaction ID is 12. Great! The backing store knows that, although this record wasnt involved, batches 8, 9, 10 and 11 were all processed successfully. It then takes the aligned value from batch 8 and faithfully report it as the value to update. (TODO: SIDEBAR: It could happen that the record it retrieves shows a transaction ID of 12. It might be that this worker is retrying an earlier failed attempt, it might be that this worker fell off the grid and its seeing the result of the retry due to its laziness.)
206
It might be, as described in the sidebar, that the transaction ID is 12. Remember, the request is for the value prior to the current batch; luckily, thats exactly what was stored in the pre-aligned value and so thats what is returned. Now, you see why its important that the Aggregator promises acceptably-identical results, given the same records and prior value; youre not allowed to care which attempt of a retry is the last one to complete. This is all done behind the scenes and you never have to worry about it. In fact, the class that hides this transactional behavior is called the opaquevalue class and this type of dataflow is what Trident calls an OpaqueTransactional topology. For folks coming from a traditional database background, please notice that while we use the word transactional here, dont read too much into that. First, were not using and not relying on any native transactional guarantee in the commit to the external database. The transactional behavior weve described covers the entire progress of a batch, not just the commit of any given partition to the database. Second, the transac tional behavior is only eventually consistent. In practice, since the Master Batch Coor dinator signals all persistentAggregate to commit simultaneously, there is very little jitter among attempts to commit. If your database administrator is doing her job, in normal circumstances, an external read will not notice misaligned batches. Of course, all this machinery was put in place to tolerate the fact that sometimes, a subset of workers might hang or die trying to commit their partition within a batch. In that case, a read of the database would return some values current as of, say, batch 12, while (until the retry happens) the failed workers records are only up to date as of batch 11.
207
Storm+Trident Tuning
CHAPTER 24
Outline Our approach for Storm+Trident, as it is for Hadoop, is to use it from the outside. If you find yourself seeking out details of its internal construction to accomplish a task, stop and consider whether youre straying outside its interface. But once your command of its external model solidifies and as your application reaches production, its worth understanding what is going on. This chapter describes the lifecycle of first, a Storm +Trident tuple, and next, Along the way, we can also illuminate how Storm+Trident acheives its breakthroughs its ability to do exactly-once processing is a transformative improvement over transfer-oriented systems (such as Flume or S4) and wall street-style Complex Event Processing (CEP) systems (such as Esper). Because Storm+Trident has not reached Hadoops relative maturity, its important for the regular data scientist to under Tuning constraints refer to earlier discussion on acceptable delay, thruput, ho rizon of compute, et Process for initial tuning General recommendations cloud tuning Tuning a dataflow system is easy:
The Dataflow Tuning Rules: * Ensure each stage is always ready to accept records, and * Deliver each processed record promptly to its destination
That may seem insultingly simplistic, but most tuning questions come down to finding some barrier to one or the other. It also implies a corollary: once your dataflow does
209
obey the Dataflow Tuning Rules, stop tuning it. Storm+Trident flows are not subtle: they either work (in which case configuration changes typically have small effect on performance) or they are catastrophically bad. So most of the tuning youll do is to make sure you dont sabotage Storms ability to meet the Dataflow Tuning Rules
Goal
First, identify your principal goal and principal bottleneck. The goal of tuning here is to optimize one of latency, throughput, memory or cost, without sacrificing sabotaging the other measures or harm stability. Next, identify your dataflows principal bottleneck, the constraining resource that most tightly bounds the performance of its slowest stage. A dataflow cant pass through more records per second than the cumulative output of its most constricted stage, and it cant deliver records in less end-to-end time than the stage with the longest delay. The principal bottleneck may be: IO volume: a hardware bottleneck for the number of bytes per second that a ma chines disks or network connection can sustain. For example, event log processing often involves large amounts of data, but only trivial transformations before storage. CPU: by contrast, a CPU-bound flow spends more time in calculations to process a record than to transport that record. memory: large windowed joins or memory-intensive analytics algorithms will in general require you to provision each machine for the largest expected memory extent. Buying three more machines wont help: if you have a 10 GB window, you need a machine with 10 GB+ of RAM. concurrency: If your dataflow makes external network calls, you will typically find that the network request latency for each record is far more than the time spent to process and transport the record. The solution is to parallelize the request by run ning small batches and and high parallelism for your topology. However, increasing parallelism has a cost: eventually thread switching, per-executor bookkeeping, and other management tasks will consume either all available memory or CPU. remote rate limit: alternatively, you may be calling an external resource that imposes a maximum throughput limit. For example, terms-of-service restrictions from a third-party web API might only permit a certain number of bulk requests per hour, or a legacy datastore might only be able to serve a certain volume of requests before its performance degrades. If the remote resource allows bulk requests, you should take care that each Trident batch is sized uniformly. For example, the twitter users lookup API returns user records for up to 100 user IDs so its essential that each
210
batch consist of 100 tuples (and no more). Otherwise, theres not much to do here besides tuning your throughput to comfortably exceed the maximum expected rate. For each of the cases besides IO-bound and CPU-bound, there isnt that much to say: for memory-bound flows, its buy enough RAM (though you should read the tips on JVM tuning later in this chapter). For remote rate-limited flows, buy a better API plan or remote datastore otherwise, tune as if CPU-bound, allowing generous headroom. For concurrency-bound flows, apply the recommendations that follow, increase con currency until things get screwy. If that per-machine throughput is acceptable to your budget, great; otherwise, hire an advanced master sysadmin to help you chip away at it.
Provisioning
Unless youre memory-bound, the Dataflow Tuning Rules imply network performance and multiple cores are valuable, but that you should not need machines with a lot of RAM. Since tuples are handled and then handed off, they should not be accumulating in memory to any significant extent. Storm works very well on commodity hardware or cloud/virtualized machines. Buy server-grade hardware, but dont climb the price/ performance curve. The figures of merit here are the number and quality of CPU cores, which govern how much parallelism you can use; the amount of RAM per core, which governs how much memory is available to each parallel executor chain; and the cost per month of CPU cores (if CPU-bound) or cost per month of RAM (if memory-bound. Using the Amazon cloud machines as a reference, we like to use either the c1.xlarge machines (7GB ram, 8 cores, $424/month, giving the highest CPU-performance-per-dollar) or the m3.xlarge machines (15 GB ram, 4 cores, $365/month, the best balance of CPU-perdollar and RAM-per-dollar). You shouldnt use fewer than four worker machines in production, so if your needs are modest feel free to downsize the hardware accordingly.
Topology-level settings
Use one worker per machine for each topology: since sending a tuple is much more efficient if the executor is in the same worker, the fewer workers the better. (Tuples go directly from send queue to receive queue, skipping the worker transfer buffers and the network overhead). Also, if youre using Storm pre-0.9, set the number of ackers equal to the number of workers previously, the default was one acker per topology. The total number of workers per machine is set when the supervisor is launched each supervisor manages some number of JVM child processes. In your topology, you specify how many worker slots it will try to claim.
Provisioning
211
In our experience, there isnt a great reason to use more than one worker per topology per machine. With one topology running on those three nodes, and parallelism hint 24 for the critical path, you will get 8 executors per bolt per machine, i.e. one for each core. This gives you three benefits. The primary benefit is that when data is repartitioned (shuffles or group-bys) to exec utors in the same worker, it will not have to hit the transfer buffer tuples will be directly deposited from send to receive buffer. Thats a big win. By contrast, if the des tination executor were on the same machine in a different worker, it would have to go send worker transfer local socket worker recv exec recv buffer. It doesnt hit the network card, but its not as big a win as when executors are in the same worker. Second, youre typically better off with three aggregators having very large backing cache than having twenty-four aggregators having small backing caches. This reduces the effect of skew, and improves LRU efficiency. Lastly, fewer workers reduces control flow chatter. For CPU-bound stages, set one executor per core for the bounding stage (If there are many cores, uses one less). Using the examples above, you would run parallelism of 7 or 8 on a c1.xlarge and parallism of 4 on an m3.xlarge. Dont adjust the parallelism unless theres good reason even a shuffle implies network transfer, and shuffles dont impart any load-balancing. For memory-bound stages, set the parallelism to make good use of the system RAM; for concurrency-bound stages, find the parallelism that makes performance start to degrade and then back off to say 80% of that figure. Match your spout parallelism to its downstream flow. Use the same number of kafka partitions as kafka spouts (or a small multiple). If there are more spouts than kafka machines*kpartitions, the extra spouts will sit idle. Map states or persistentAggregates accumulate their results into memory structures, and so youll typically see the best cache efficiency and lowest bulk-request overhead by using one such stage per worker.
Initial tuning
If you have the ability to specify your development hardware, start tuning on a machine with many cores and over-provisioned RAM so that you can qualify the flows critical bottleneck. A machine similar to Amazons m3.2xlarge (30 GB ram, 8 cores) lets you fall back to either of the two reference machines described above. For a CPU-bound flow: Construct a topology with parallelism one set max-pending to one, use one acker per worker, and ensure that storms no files ulimit is large (65000 is a decent number).
212 | Chapter 24: Storm+Trident Tuning
Set the trident-batch-delay to be comfortably larger than the end-to-end latency there should be a short additional delay after each batch completes. Time the flow through each stage. Increase the parallelism of CPU-bound stages to nearly saturate the CPU, and at the same time adjust the batch size so that state operations (aggregates, bulk data base reads/writes, kafka spout fetches) dont slow down the total batch processing time. Keep an eye on the GC activity. You should see no old-gen or STW GCs, and efficient new-gen gcs (your production goal no more than one new-gen gc every 10 seconds, and no more than 10ms pause time per new-gen gc, but for right now just over provision set the new-gen size to give infrequent collections and dont worry about pause times). Once you have roughly dialed in the batch size and parallelism, check in with the First Rule. The stages upstream of your principal bottleneck should always have records ready to process. The stages downstream should always have capacity to accept and promptly deliver processed records.
This implies that you cant have better throughput than the collective rate of your slowest stage, and you cant have better latency than the sum of the individual latencies. For example, if all records must pass through a stage that handles 10 records per second, then the flow cannot possibly proceed faster than 10 records per second, and it cannot have latency smaller than 100ms (1/10 second). Whats more, with 20 parallel stages, the 95th percentile latency your slowest stage becomes the median latency of the full set. (TODO: nail down numbers) Current versions of Storm+Trident dont do any load-balancing within batches, and so its worth benchmarking each machine to ensure performance is uniform.
Batch Size
Next, well set the batch size.
Initial tuning
213
for the Kafka spout is controlled indirectly by the max fetch bytes. The resulting total batch size is at most (kafka partitions) * (max fetch bytes). For example, given a topology with six kafka spouts and four brokers with three kafkapartitions per broker, you have twelve kafka-partitions total, two per spout. When the MBCoordinator calls for a new batch, each spout produces two sub-batches (one for each kafka-partition), each into its own trident-partition. Now also say you have records of 1000 +/- 100 bytes, and that you set max-fetch-bytes to 100_000. The spout fetches the largest discrete number of records that sit within max-fetch-bytes so in this case, each sub-batch will have between 90 and 111 records. That means the full batch will have between 1080 and 1332 records, and 1_186_920 to 1_200_000 bytes.
Choosing a value
In some cases, there is a natural batch size: for example the twitter users/lookup API call returns information on up to 100 distinct user IDs. If so, use that figure. Otherwise, you want to optimize the throughput of your most expensive batch opera tion. each() functions should not care about batch size batch operations like bulk database requests, batched network requests, or intensive aggregation (partitionPer sist, partitionQuery, or partitionAggregate) do care. Typically, youll find that there are three regimes: 1. when the batch size is too small, the response time per batch is flat its dominated by bookeeping. 2. it then grows slowly with batch size. For example, a bulk put to elasticsearch will take about 200ms for 100 records, about 250ms for 1000 records, and about 300ms for 2000 records (TODO: nail down these numbers). 3. at some point, you start overwhelming some resource on the other side, and exe cution time increases sharply. Since the execution time increases slowly in case (2), you get better and better recordsper-second throughput. Choose a value that is near the top range of (2) but comfortably less than regime (3).
214
portant fact that the executor send queue contains tuples, while the receive/transfer queues contain bunches of tuples. These are advanced-level settings, so dont make changes unless you can quantify their effect, and make sure you understand why any large change is necessary. In all cases, the sizes have to be an even power of two (1024, 2048, 4096, and so forth). As long as the executor send queue is large enough, further increase makes no real difference apart from increased ram use and a small overhead cost. If the executor send queue is way too small, a burst of records will clog it unnecessarily, causing the executor to block. The more likely pathology is that if it is slightly too small, youll get skinny residual batches that will make poor use of the downstream receive queues. Picture an executor that emits 4097 tuples, fast enough to cause one sweep of 4096 records and a second sweep of the final record that sole record at the end requires its own slot in the receive queue. Unfortunately, in current versions of Storm it applies universally so everyone has to live with the needs of the piggiest customer. This is most severe in the case of a spout, which will receive a large number of records in a burst, or anywhere there is high fanout (one tuple that rapidly turns into many tuples). Set the executor send buffer to be larger than the batch record count of the spout or first couple stages.
215
Total heap of 2500 MB (-Xmx2500m): a 1000 MB new-gen, a 100 MB perm-gen, and the implicit 1500 MB old-gen. Dont use gratuitously more heap than you need long gc times can cause timeouts and jitter. Heap size larger than 12GB is trouble on AWS, and heap size larger than 32GB is trouble everywhere. Tells it to use the concurrent-mark-and-sweep collector for long-lived objects, and to only do so when the old-gen becomes crowded. Enables that a few mysterious performance options Logs GC activity at max verbosity, with log rotation If you watch your GC logs, in steady-state you should see No stop-the-world (STW) gcs nothing in the logs about aborting parts of CMS old-gen GCs should not last longer than 1 second or happen more often than every 10 minutes new-gen GCs should not last longer than 50 ms or happen more often than every 10 seconds new-gen GCs should not fill the survivor space perm-gen occupancy is constant Side note: regardless of whether youre tuning your overall flow for latency or through put, you want to tune the GC for latency (low pause times). Since things like committing a batch cant proceed until the last element is received, local jitter induces global drag.
216
number of workers a multiple of number of machines; parallelism a multiple of number of workers; number of kafka partitions a multiple of number of spout par allelism Use one worker per topology per machine Start with fewer, larger aggregators, one per machine with workers on it Use the isolation scheduler Use one acker per worker [pull request #377](https://github.com/nathanmarz/ storm/issues/377) makes that the default.
217
CHAPTER 25
////Definitely add some additional introduction here. Describe how this connects with other chapters. Also characterize the overall goal of the chapter. And, then, put into context to the real-world (and even connect to what readers have learned thus far in the book). Amy//// Space doesnt allow treating HBase in any depth, but its worth equipping you with a few killer dance moves for the most important part of using it well: data modeling. Its also good for your brain optimizing data at rest presents new locality constraints, dual to the ones youve by now mastered for data in motion. ////If its true that readers may get something crutial out of first playing with the tool before reading this chapter, say so here as a suggestion. Amy////So please consult other references (like HBase: The De finitive Guide (TODO:reference) or the free HBase Book online), load a ton of data into it, play around, then come back to enjoy this chapter.
219
1. Given a row key: get, put or delete a single value into which youve serialized a whole record. 2. Given a row key: get, put or delete a hash of column/value pairs, sorted by column key. 3. Given a key: find the first row whose key is equal or larger, and read a hash of column/value pairs (sorted by column key). 4. Given a row key: atomically increment one or several counters and receive their updated values. 5. Given a range of row keys: get a hash of column/value pairs (sorted by column key) from each row in the range. The lowest value in the range is examined, but the highest is not. (If the amount of data is small and uniform for each row, the per formance this type of query isnt too different from case (3). If there are potentially many rows or more data than would reasonably fit in one RPC call, this becomes far less performant.) 6. Feed a map/reduce job by scanning an arbitrarily large range of values. Thats pretty much it! There are some conveniences (versioning by timestamp, timeexpirable values, custom filters, and a type of vertical partitioning known as column families); some tunables (read caching, fast rejection of missing rows, and compression); and some advanced features, not covered here (transactions, and a kind of stored pro cedures/stored triggers called coprocessors). For the most part, however, those features just ameliorate the access patterns listed above.
220
More than most engineering tools, its essential to play to HBases strengths, and in general the simpler your schema the better HBase will serve you. Somehow, though, the sparsity of its feature set amplifies the siren call of even those few features. Resist, Resist. The more stoicly you treat HBases small feature set, the better you will realize how surprisingly large HBases solution set is.
to group the prefixes. The mapper takes each title and emits the first three, four, five, up to say twelve characters along with the pagerank. Use the prefix as partition key, and the prefix-rank as a descending sort key. Within each prefix group, the first ten records will be the ten most prominent completions; store them as a JSON-ized list and ignore all following completions for that prefix.) What will we store into HBase? Your first instinct might be to store each of the ten titles, each in its own cell. Reasonable, but still too clever. Instead, serialize the full JSONencoded response as a single value. This minimizes the cell count (memory- and diskefficient), lets the API front end put the value straight onto the wire (speed and linesof-code efficient), and puts us in the most efficient access pattern: single row, single value. Table 25-1. Autocomplete HBase schema
table row key column family column qualifier
-
options
VERSIONS => 1, BLOOMFILTER => 'ROW', COMPRESSION => 'SNAPPY'
title_autocomp prefix j
222
bus has less data to stream off disk. Row locality often means nearby data elements are highly repetitive (definitely true here), so you realize a great compression ratio. There are two tradeoffs: first, a minor CPU hit to decompress the data; worse though, that you must decompress blocks at a time even if you only want one cell. In the case of auto complete, row locality means youre quite likely to use some of those other cells.
Geographic Data
For our next example, lets look at geographic data: the Geonames dataset of places, the Natural Earth dataset of region boundaries, and our Voronoi-spatialized version of the NCDC weather observations (TODO: ref). We require two things. First, direct information about each feature. Here no magic is called for: compose a row key from the feature type and id, and store the full serialized record as the value. Its important to keep row keys short and sortable, so map the region types to single-byte ids (say, a for country, b for admin 1, etc) and use standard ISO identifiers for the region id (us for the USA, dj for Djibouti, etc). More interestingly, we would like a slippy map (eg Google Maps or Leaflet) API: given the set of quadtiles in view, return partial records (coordinates and names) for each feature. To ensure a responsive user experience, we need low latency, concurrent access and intelligent caching HBase is a great fit.
Quadtile Rendering
The boundaries dataset gives coordinates for continents, countries, states (admin1), and so forth. In (TODO: ref the Geographic Data chapter), we fractured those bound aries into quadtiles for geospatial analysis, which is the first thing we need. We need to choose a base zoom level: fine-grained enough that the records are of man ageable size to send back to the browser, but coarse-grained enough that we dont flood the database with trivial tiles (In Russia. Still in Russia. Russia, next 400,000 tiles"). Consulting the (TODO: ref How big is a Quadtile) table, zoom level 13 means 67 million quadtiles, each about 4km per side; this is a reasonable balance based on our boundary resoluion.
ZL 12 13 14 15 recs 17 67 260 1024 M M M M @64kB/qk 1 TB 4 TB 18 TB 70 TB reference size Manhattan about 2 km per side about 1 km per side
For API requests at finer zoom levels, well just return the ZL 13 tile and crop it (at the API or browser stage). Youll need to run a separate job (not described here, but see the references (TODO: ref migurski boundary thingy)) to create simplified boundaries for each of the coarser zoom levels. Store these in HBase with three-byte row keys built
Geographic Data | 223
from the zoom level (byte 1) and the quadtile id (bytes 2 and 3); the value should be the serialized GeoJSON record well serve back.
Column Families
We want to serve several kinds of regions: countries, states, metropolitan areas, counties, voting districts and so forth. Its reasonable for a request to specify one, some combi nation or all of the region types, and so given our goal of one read per client request we should store the popular region types in the same table. The most frequent requests will be for one or two region types, though. HBase lets you partition values within a row into Column Families. Each column family has its own set of store files and bloom filters and block cache (TODO verify caching details), and so if only a couple column families are requested, HBase can skip loading the rest 2. Well store each region type (using the scheme above) as the column family, and the feature ID (us, jp, etc) as the column qualifier. This means I can request all region boundaries on a quadtile by specifying no column constraints request country, state and voting district boundaries by specifying those three col umn families request only Japans boundary on the quadtile by specifying the column key a:jp Most client libraries will return the result as a hash mapping column keys (combined family and qualifier) to cell values; its easy to reassemble this into a valid GeoJSON feature collection without even parsing the field values.
2. many relational databases accomplish the same end wtih vertical partitioning.
224
To find all the points in quadtile 0121, scan from 012100 to 012200 (returning a through j). Scans ignore the last index in their range, so k is excluded as it should be. To find all the points in quadtile 012 121, scan from 012121 to 012122 (returning g, h and i). Dont
store the quadkeys as the base-4 strings that we use for processing: the efficiency gained by packing them into 16- or 32-bit integers is worth the trouble. The quadkey 12301230 is eight bytes as the string 12301230, two bytes as the 16-bit integer 27756.
When you are using this Rows as Columns technique, or any time youre using a scan query, make sure you set scanner caching on. Its an incredibly confusing name (it does not control a Cache of scanner objects). Instead think of it as Batch Size, allowing may rows of data to be sent per network call.
Typically with a keyspace this sparse youd use a bloom filter, but we wont be doing direct gets and so its not called for here (Bloom Filters are not consulted in a scan).
Geographic Data
225
Use column families to hold high, medium and low importance points; at coarse zoom levels only return the few high-prominence points, while at fine zoom levels they would return points from all the column families
Filters
There are many kinds of features, and some of them are distinctly more populous and interesting. Roughly speaking, geonames features A (XXX million): Political features (states, counties, etc) H (XXX million): Water-related features (rivers, wells, swamps,) P (XXX million): Populated places (city, county seat, capitol, ) R (): road, railroad, S (): Spot, building, farm Very frequently, we only want one feature type: only cities, or only roads common to want one, several or all at a time. You could further nest the feature codes. To do a scan of columns in a single get, need to use a ColumnPrefixFilter http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnPrefixFil ter.html
226
Retrieve the next existing tile. Its a one-row operation, but we specify a range from specific tile to max tile ID. The next tile is either the speific one with that key, or the first parent. Note: next interesting record doesnt use bloom filter To do a range on zoomed-out, do a range from want to scan all cells in 011 123. this means 011 123 000 to 011 123 ~~~. Table 25-2. Server logs HBase schema
table region_info row key
region_typeregion_name
options
VERSIONS => 1, COM PRESSION => 'SNAPPY' VERSIONS => 1, COM PRESSION => 'SNAPPY' VERSIONS => 1, COM PRESSION => 'SNAPPY' VERSIONS => 1, COM PRESSION => 'SNAPPY'
geonames_id name
227
where we stand, a best-of-class big data stack has three legs: Hadoop, one or more scalable databases, and multi-latency streaming analytics.
A high-volume website might have 2 million unique daily visitors, causing 100 M re quests/day on average (4000 requests/second peak), and say 600 bytes per log line from 20-40 servers. Over a year, that becomes about 40 billion records and north of 20 tera bytes of raw data. Feed that to most databases and they will crumble. Feed it to HBase and it will smile, belch and ask for seconds and thirds which in fact we will. Designing for reads means aggressively denormalizing data, to an extent that turns the stomach and tests the will of traditional database experts. Use a streaming data pipeline such as Storm+Kafka or Flume, or a scheduled batch job, to denormalize the data. Webserver log lines contain these fields: ip_address, cookie (a unique ID assigned to each visitor), url (the page viewed), and referer_url (the page they arrived from), status_code (success or failure of request) and duration (time taken to render page). Well add a couple more fields as we go along.
Timestamped Records
Wed like to understand user journeys through the site: (Heres what you should not do: use a row key of timebucket-cookie; see ??? The To sort the values in descending timestamp order, instead use a reverse time stamp: LONG_MAX - timestamp. (You cant simply use the negative of timestamp since sorts are always lexicographic, -1000 sorts before -9999.) By using a row key of cookie-rev_time we can scan with a prefix of just the cookie to get all pageviews per visitor ever. we can scan with a prefix of the cookie, limit one row, to get only the most recent session. if all you want are the distinct pages (not each page view), specify versions = 1 in your request. In a map-reduce job, using the column key and the referring page url gives a graph view of the journey; using the column key and the timestamp gives a timeseries view of the journey.
Row Locality
Row keys determine data locality. When activity is focused on a set of similar and thus adjacent rows, it can be very efficient or very problematic.
228 | Chapter 25: Hbase Data Modeling
Adjacency is good: Most of the time, adjacency is good (hooray locality!). When com mon data is stored together, it enables - range scans: retrieve all pageviews having the same path prefix, or a continuous map region. - sorted retrieval: ask for the earliest entry, or the top-k rated entries - space-efficient caching: map cells for New York City will be much more commonly referenced than those for Montana. Storing records for New York City together means fewer HDFS blocks are hot, which means the opeerating system is better able to cache those blocks. - time-efficient caching: if I retrieve the map cell for Minneapolis, Im much more likely to next retrieve the adjacent cell for nearby St. Paul. Adjacency means that cell will probably be hot in the cache. Adjacency is bad: if everyone targets a narrow range of keyspace, all that activity will hit a single regionserver and your wonderful massively-distributed database will limp along at the speed of one abused machine. This could happen because of high skew: for example, if your row keys were URL paths, the pages in the /product namespace would see far more activity than pages under laborday_2009_party/photos (unless they were particularly exciting photos). Simi larly, a phenomenon known as Benfords law means that addresses beginning with 1 are far more frequent than addresses beginning with 9 3. In this case, managed splitting (pre-assigning a rough partition of the keyspace to different regions) is likely to help. Managed splitting wont help for timestamp keys and other monotonically increasing values though, because the focal point moves constantly. Youd often like to spread the load out a little, but still keep similar rows together. Options include: swap your first two key levels. If youre recording time series metrics, use metric_name-timestamp, not timestamp-metric_name, as the row key. add some kind of arbitrary low-cardinality prefix: a server or shard id, or even the least-significant bits of the row key. To retrieve whole rows, issue a batch request against each prefix at query time.
Timestamps
You could also track the most recently-viewed pages directly. In the cookie_stats table, add a column family r having VERSIONS: 5. Now each time the visitor loads a page, write to that exact value; HBase store files record the timestamp range of their contained records. If your request is limited to values less than one hour old, HBase can ignore all store files older than that.
3. A visit to the hardware store will bear this out; see if you can figure out why. (Hint: on a street with 200 addresses, how many start with the numeral 1?)
229
Domain-reversed values
Its often best to store URLs in domain-reversed form, where the hostname segments are placed in reverse order: eg org.apache.hbase/book.html for hbase.apache.org/ book.html. The domain-reversed URL orders pages served from different hosts within the same organization (org.apache.hbase and org.apache.kafka and so forth) adja cently. To get a picture of inbound traffic
ID Generation Counting
One of the elephants recounts this tale: In my land its essential that every persons prayer be recorded. One is to have diligent monks add a a grain of rice to a bowl on each event, then in daily ritual recount them from beginning to end. You and I might instead use a threadsafe [UUID](http://en.wikipedia.org/wiki/Universally_unique_identifier) library to create a guaranteed-unique ID. However, neither grains of rice nor time-based UUIDs can easily be put in time order. Since monks may neither converse (its incommensurate with mindfulness) nor own fancy wristwatches (vow of poverty and all that), a strict ordering is impossible. Instead, a monk writes on each grain of rice the date and hour, his name, and the index of that grain of rice this hour. You can read a great writeup of distributed UUID generation in Boundarys [Flake project announcement](http://boundary.com/blog/2012/01/12/flakea-decentralized-k-ordered-unique-id-generator-in-erlang/) (see also Twitters [Snow flake](https://github.com/twitter/snowflake)). You can also block grant counters: a central server gives me a lease on
ID Generation Counting
HBase actually provides atomic counters Another is to have an enlightened Bodhisattva hold the single running value in mind fulness. http://stackoverflow.com/questions/9585887/pig-hbase-atomic-increment-columnvalues From http://www.slideshare.net/larsgeorge/realtime-analytics-with-hadoop-and-hbase -1 million counter updates per second on 100 nodes (10k ops per node) Use a different column family for month, day, hour, etc (with different ttl) for increment
230
Atomic Counters
Second, for each visitor we want to keep a live count of times theyve viewed each distinct URL. In principle, you could use the cookie_url table, Maintaining a consistent count is harder than it looks: for example, it does not work to read a value from the database, add one to it, and write the new value back. Some other client may be busy doing the same, and so one of the counts will be off. Without native support for counters, this simple process requires locking, retries, or other complicated machinery. HBase offers atomic counters: a single incr command that adds or subtracts a given value, responding with the new value. From the client perspective its done in a single action (hence, atomic) with guaranteed consistence. That makes the visitor-URL tracking trivial. Build a table called cookie_url, with a column family u. On each page view: 1. Increment the counter for that URL: count = incr(table: "cookie_url_count", row: cookie, col: "u:#{url}"). The return value of the call has the updated count. You dont have to initialize the cell; if it was missing, HBase will treat it as having had a count of zero.
231
2. Store the URL in the cookie_stats table, but use a timestamp equal to that URLs count not the current time in your request: put("cookie_stats", row: cook ie, col: "c", timestamp: count, value: url). To find the value of the most-frequent URL for a given cookie, do a get(table: "cook ie_stats", row: cookie, col: 'c'). HBase will return the most recent value, namely the one with the highest timestamp, which means the value with the highest count. Although were constantly writing in values with lower timestamps (counts), HBase ignores them on queries and eventually compacts them away.
For this hack to work, the value must be forever monotonically increasing (that is, never decrease). The value total lifetime pageviews can only go up; pageviews in last 30 days will go up or down over time
Exercises
1. Besides the pedestrian janitorial work of keeping table sizes in check, TTLs are another feature to joyfully abuse. Describe how you would use TTLs to track timebased rolling aggregates, like average air-speed velocity over last 10 minutes. Table 25-3. Server logs HBase schema
table visits visits visits visits cookie_urls row key
cookie-timebucket cookie-timebucket cookie-timebucket cookie-timebucket cookie
qualifier
referer term product_id cart_id -
options
4. The TTL will only work if youre playing honest with the timestamps you cant use it with the most-frequent URL table
232
ip_tbs
ip-timebucket
IP Address Geolocation
An increasing number of websites personalize content for each reader. Retailers find that even something as simple as saying Free Shipping or No Sales Tax (each true only for people in certain geographic areas) dramatically increases sales. HBases speed and simplicity shine for a high-stakes low-latency task like estimating the geographic location of a visitor based on their IP address If you recall from (TODO ref server logs chapter), the Geo-IP dataset stores information about IP addresses a block at a time. Fields: IP address, ISP, latitude, longitude, quadkey query: given IP address, retrieve geolocation and metadata with very low latency Table 25-4. IP-Geolocation lookup
table row key ip column families column qualifiers versions value
ip_upper_in_hex field name
none
Store the upper range of each IP address block in hexadecimal as the row key. To look up an IP address, do a scan query, max 1 result, on the range from the given ip_address to a value larger than the largest 32-bit IP address. A get is simply a scan-with-equalitymax-1, so theres no loss of efficiency here. Since row keys are sorted, the first value equal-or-larger than your key is the end of the block it lies on. For example, say we had block A covering 50.60.a0.00 to 50.60.a1.08, B covering 50.60.a1.09 to 50.60.a1.d0, and C covering 50.60.a1.d1 to 50.60.a1.ff. We would store 50.60.a1.08 => {...A...}, 50.60.a1.d0 => {...B...}, and 50.60.a1.ff => {...C...}. Looking up 50.60.a1.09 would get block B, because 50.60.a1.d0 is lexicographically after it. So would 50.60.a1.d0; range queries are inclusive on the lower and exclusive on the upper bound, so the row key for block B matches as it should. As for column keys, its a tossup based on your access pattern. If you always request full rows, store a single value holding the serialized IP block metadata. If you often want only a subset of fields, store each field into its own column.
IP Address Geolocation
233
t t v
category-page_id c
Graph Data
Just as we saw with Hadoop, there are two sound choices for storing a graph: as an edge list of from,into pairs, or as an adjacency list of all into nodes for each from node. Table 25-6. HBase schema for Wikipedia pagelink graph: three reasonable implementa tions
table row key column families column qualifiers value (none)
into_page l (links)
options
If we were serving a live wikipedia site, every time a page was updated Id calculate its adjacency list and store it as a static, serialized value. For a general graph in HBase, here are some tradeoffs to consider: The pagelink graph never has more than a few hundred links for each page, so there are no concerns about having too many columns per row. On the other hand, there are many celebrities on the Twitter follower graph with millions of followers or followees. You can shard those cases across multiple rows, or use an edge list instead. An edge list gives you fast are these two nodes connected lookups, using the bloom filter on misses and read cache for frequent hits. If the graph is read-only (eg a product-product similarity graph prepared from server logs), it may make sense to serialize the adjacency list for each node into a single cell. You could also run a regular map/reduce job to roll up the adjacency list into its own column family, and store deltas to that list between rollups.
Refs
Ive drawn heavily on the wisdom of HBase Book Thanks to Lars George for many of these design guidelines, and the Design for Reads motto. HBase Shell Commands
234
HBase Advanced Schema Design by Lars George http://www.quora.com/What-are-the-best-tutorials-on-HBase-schema encoding numbers for lexicographic sorting: an insane but interesting scheme: http://www.zanopha.com/docs/elen.pdf a Java library for wire-efficient encoding of many datatypes: https://github.com/ mrflip/orderly http://www.quora.com/How-are-bloom-filters-used-in-HBase
Refs
235
Appendix
CHAPTER 26
Appendix 2: Cheatsheets
Author
Philip (flip) Kromer is cofounder of Infochimps, a big data platform that makes ac quiring, storing and analyzing massive data streams transformatively easier. Infochimps became part of Computer Sciences Corporation in 2013, and their big data platform now serves customers such as Cisco, HGST and Infomart. He enjoys Bowling, Scrabble, working on old cars or new wood, and rooting for the Red Sox. Graduate School, Dept. of Physics - University of Texas at Austin, 2001-2007 Bachelor of Arts, Computer Science - Cornell University, Ithaca NY, 1992-1996
237
Cofounder of Infochimps, now Head of Technology and Architecture at Info chimps, a CSC Company. Core committer for Storm, a framework for scalable stream processing and ana lytics Core committer for Ironfan, a framework for provisioning complex distributed systems in the cloud or data center Original author and core committer of Wukong, the leading Ruby library for Ha doop Contributed chapter to The Definitive Guide to Hadoop by Tom White Dieterich Lawson is a recent graduate of Stanford University. TODO DL: biography
238
License
TODO: actual license stuff Text and assets are released under CC-BY-NC-SA (Creative Commons Attribution, Non-commercial, derivatives encouraged but Share Alike)
This work is licensed under the Creative Commons Attribution-NonCommercialShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecom mons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
Code is Apache licensed unless specifically labeled otherwise. For access to the big_data_for_chimps/ source code, visit http://github.com/infochimps-labs/
240
Glossary
secondarynn (aka secondary namenode) handles compaction of namenode directory. It is NOT a backup for the namenode. support daemon one of namenode, datanode, jobtracker, tasktracker or secon darynn. job process (aka child process) the actual process that executes your code tasktracker interface between jobracker and task processes. It does NOT execute your code, and typically requires a minimal amount of RAM. shuffle merge (aka shuffle and sort) shuffle buffer - map sort buffer - Resident memory (RSS or RES) - JVM - JVM heap size - Old-Gen heap - New-Gen heap portion of JVM ram used for short-lived objects. If too small, transient objects will go into the old-gen, causing fragmentation and an eventual STW garbage collection. garbage collection - STW garbage collection "Stop-the-world garbage collection, a signal that there has been significant fragmentation or heap pressure. Not Good. attempt ID - task ID -241
job ID -Questions: task process or job process for child process? Storm+Trident Glossary Storm Execution Worker Daemons Nimbus UI Supervisor Child Process Executor Task Trident Function Trident Tuple / TridentTuple tupletree Batch Partition Group Batch (Transaction) ID TridentOperation TridentFunction Aggregator * Layout Topology Stream Assembly * Internal
242 | Glossary
Master Batch Coordinator Spout Coordinator TridentSpoutExecutor TridentBoltExecutor Transport DisruptorQueue Executor Send Queu Executor Receive Queue Worker Transfer Queue (Worker Receive Buffer)
Glossary
243
References
Other Hadoop Books Hadoop the Definitive Guide, Tom White Hadoop Operations, Eric Sammer Hadoop In Practice (Alex Holmes) Hadoop Streaming FAQ Hadoop Configuration defaults mapred Unreasonable Effectiveness of Data Peter Norvigs Facebook Tech Talk Later version of that talk at ?? I like the On the Unreasonable effectiveness of data Source material Wikipedia article on Lexington, Texas (CC-BY-SA) Installing Hadoop on OSX Lion JMX through a ssh tunnel To Consider http://www.cse.unt.edu/~rada/downloads.html Texts semantically annotated with WordNet 1.6 senses (created at Princeton University), and automatically mapped to WordNet 1.7, WordNet 1.7.1, WordNet 2.0, WordNet 2.1, Word Net 3.0 Code Sources
245
wp2txt, by Yoichiro Hasebe the git-scribe toolchain was very useful creating this book. Instructions on how to install the tool and use it for things like editing this book, submitting errata and providing translations can be found at that site.
246
References