Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 People

1. Hi (some code)

2. apache kafka is great software

3. but it isn't perfect

4. running Kafka in production

5. hundreds of clusters with 5 people

6. Tom Crayford

7. Tom Crayford @t_crayford

8. Tom Crayford Heroku @t_crayford

9. Context Rules Everything Around Me

10. Apache Kafka on Heroku

11. Kafka as a Service

12. heroku addons:create heroku-kafka:standard-0

13. KAFKA_URL

14. 3 kafkas 5 zookeepers

15. provide and enforce best practices

16. RF >= 3

17. not safe in AWS without that!

19. automation

20. heroku addons:upgrade heroku-kafka:extended-0

21. heroku kafka:upgrade --version 0.10.2.1

22. we take care of your cluster

23. it's your cluster

24. use it how you like

25. we wear the pager

26. about us

27. heroku data

28. postgres

29. postgres redis

30. postgres redis kafka

31. A U T O M A T I O N

32. my start date: 17 August 2015

33. SELECT COUNT(*) FROM pages WHERE created_at > '2015-08-17'

34. 29,000 SELECT COUNT(*) FROM pages WHERE created_at > '2015-08-17'

35. 46 pages a day

36. pager storms

37. SELECT COUNT(*) FROM incidents WHERE created_at > '2015-08-17'

38. 800,000 SELECT COUNT(*) FROM incidents WHERE created_at > '2015-08-17'

39. 29,000

40. 29,000 /

41. 29,000 / 800,000

42. 29,000 / 800,000 = 3%

43. 97%

44. 0 200000 400000 600000 800000 Region 1

45. 3%

46. 3%

47. 3%

48. 3%

50. what you are going to learn

51. Context Rules Everything Around Me

52. how to talk about 2 years of work?

53. how to talk about 10 years of work?

54. Incidents

55. Incident 1 Incident 2 Incident 3

56. Incident 1 lessons Incident 2 Incident 3

57. Incident 1 lessons Incident 2 lessons Incident 3

58. Incident 1 lessons Incident 2 lessons Incident 3 lessons

59. why does this broker have 8TB of disk?

60. 🆒

61. context

62. context: "inﬁnite disk"

63. it's your cluster

64. you do what you want

65. some clusters are small

66. some clusters are giant

67. disk growth

69. lvm array

70. why does this broker have 8TB of disk?

71. RULE 1

72. scan the ﬂeet

73. YOU HAVE TO SCAN THE FLEET

74. "why do several of our clusters have very large on disk size?"

75. 🆒

76. pick a cluster

77. grab a shovel

78. nearly all the data in one topic

79. high volume?

80. not enough…

81. what else is different

82. compaction!

83. to the JIRA mine!

84. KAFKA-3587

85. ﬁx version: 0.10.0.0

86. cluster version: 0.10.0.0

87. welp

88. can't ﬁnd anything in JIRA!

89. ok!

90. back to the shovel

91. time to look at logs

92. eventually

93. at=LogCleaner [kafka-log-cleaner-thread-0], Error due to java.lang.IllegalArgumentException: requirement failed: 9750860 messages in segment MY_FAVORITE_TOPIC-2/00000000000047580165.lo g but offset map can ﬁt only 5033164. You can increase log.cleaner.dedupe.buffer.size or decrease log.cleaner.threads

94. some gold!

95. grep the code for the error

96. uhoh!

97. this exception…

98. this exception… kills the thread

99. this thread can die without any monitoring data

100. everything will work…

101. but compaction

102. "inﬁnite disk, but"

103. scan the ﬂeet

104. JMX thread dumps

105. dozens of brokers with log cleaner thread missing

106. 🆒

107. ok!

108. time to ﬁx

109. ﬁrst, the shovel

110. how does compaction work?

111. log segments 0-1000 1001-2000 2001-3000 3001-4000 4001-5000

112. inside a segment 0-1000

113. 0-1000 offset 0 offset 1 offset 2 offset 3 ...

114. 0-1000 key1 key2 key3 key4 ...

115. 0-1000 offset 0 key1 offset 1 key2 offset 2 key3 offset 3 key4 ...

116. 0-1000 offset 0 key1 offset 1 key2 offset 2 key3 offset 3 key1 ...

117. 0-1000 offset 1 key2 offset 2 key3 offset 3 key1 ...

118. how does compaction work?

119. offset map { key: offset }

120. offset map {}

121. {} offset 0 key1 offset 1 key2 offset 2 key3 offset 3 key1

122. {} offset 0 key1 offset 1 key2 offset 2 key3 offset 3 key1

123. { "key1": 0, } offset 0 key1 offset 1 key2 offset 2 key3 offset 3 key1

124. { "key1": 0, "key2": 1, } offset 0 key1 offset 1 key2 offset 2 key3 offset 3 key1

125. { "key1": 0, "key2": 1, "key3": 2, } offset 0 key1 offset 1 key2 offset 2 key3 offset 3 key1

126. { "key1": 3, "key2": 1, "key3": 2, } offset 0 key1 offset 1 key2 offset 2 key3 offset 3 key1

127. { "key1": 3, "key2": 1, "key3": 2, } offset 0 key1 offset 1 key2 offset 2 key3 offset 3 key1 0-1000 1001-2000 2001-3000 3001-4000 4001-5000

134. latest offset: 3 { "key1": 3, "key2": 1, "key3": 2, } offset 0 key1 offset 1 key2 offset 2 key3 offset 3 key1 0-1000 1001-2000 2001-3000 3001-4000 4001-5000

135. what was the bug?

136. assumption: "ﬁt the whole segment in the map"

137. the ﬁx

138. { "key1": 0, "key2": 1, } offset 0 key1 offset 1 key2 offset 2 key3 offset 3 key1

139. { "key1": 0, "key2": 1, } offset 0 key1 offset 1 key2 offset 2 key3 offset 3 key1 latest offset: 1

140. contributed back to upstream

143. Lessons

146. 1. Impact

147. 1. Impact 2. Mitigate

148. 1. Impact 2. Mitigate 3. Fix

149. 1. Impact 2. Mitigate 3. Fix 4. Follow up

153. you can ﬁx things

154. (I don't know scala)

155. you can ﬁx things

156. takeways scan the ﬂeet 4 steps A U T O M A T I O N you can ﬁx things

157. EBS

158. start getting paged by a bunch of postgres/redis failing health checks in a single AZ

159. hundreds of pages

160. grab a shovel

161. ok, EBS degradation

162. seen this in the past

163. start evacuating postgres/redis servers

164. but… why isn't kafka impacted?

165. hypothesis: our healthchecks weren't sufﬁcient

166. false!

167. found several kafka nodes had been automatically replaced!

168. can you spell

170. HANDLING FAILURE

171. health checks

172. kafka does much of the work

173. controller in sync replica faster

174. our job: make the cluster fully healthy again

175. two kinds of failures:

176. server fails health check

177. continues to fail

178. "hello, automation, have you tried turning it off and on again?"

179. that 3% number?

180. mostly turning shit off and back on again

181. replace the node

182. process fails health check

183. "hello, automation, have you tried turning it off and on again?"

184. automation saved us

185. wooo

186. LESSONS

187. everything worked?

188. kafka's HA

189. rack aware replication

190. what can we do better?

191. detect EBS is to blame

192. internal dashboard app

194. an aside: safe automation

195. unclean. leader. election

196. min. insync. replicas

199. one thing at a time

200. takeways kafka's HA A U T O M A T I O N safe automation

201. 3am

202. broker won't restart

203. at all

204. well, kafka is HA enough, and this cluster has 8 brokers

205. back to sleep!

206. next day:

207. not a 3am ops monkey

208. coffee breakfast

209. can't scan the ﬂeet yet, don't know what we're looking for

210. grab a shovel

211. in kafka's logs:

212. Recovering unﬂushed segment

213. happens within 20s of broker boot

214. syslog though…

215. There is insufﬁcient memory for the Java Runtime Environment to continue.

216. 🆒

218. only this cluster has seen this error

219. not about to restart other things!

220. preserving debugging info

221. but ﬁrst, ﬁx the cluster

222. happens to be an internal cluster

223. staging has mirrored trafﬁc

224. it reproduces perfectly there!

225. talk to internal team

226. replaced the node

227. time for that shovel

228. hypothesis: memory leak somehow

229. query memory using jmx in a tight loop during boot

230. ♥ JMX exposure starts *super* early

231. uhh, max of 63.2MB

232. -XX:HeapDumpOnOutOfMemoryError

233. on heap vs off heap

234. off heap :(

236. use sysdig to look at mmap calls

237. no notable patterns

238. time for a walk

239. periodic reminder: you are human

240. periodic reminder: you are human so is your team

241. help your brain out

242. hunch!

243. this cluster switched to gzip recently

244. gzip might allocate native memory…

245. let me google that for you

246. "JVM gzip memory leak"

247. http://www.evanjones.ca/java-native-leak-bug.html

248. http:// www.evanjones.ca/ java-native-leak- bug.html

249. "This shows that 94% of the "live" blocks were allocated by Java_java_util_zip_Deﬂater_init and deﬂatInit2 (part of zlib)"

250. search jira: nothing

251. ﬁle KAFKA-3933

252. ﬁx it!

253. time to look at the source

254. you can grep!

255. grep gzip src/main -ri

256. follow the chain to

257. ByteBufferMessageS et

258. how is this used during startup?

259. we have that log message from before…

260. Recovering unﬂushed segment

261. ok, it comes from Log

262. LogSegment. recover

263. // we need to decompress the message, if required, to get the offset of the ﬁrst uncompressed message val startOffset = entry.message.compressionCodec match { case NoCompressionCodec => entry.offset case _ => ByteBufferMessageSet.deepIterator(entry.message).next().offset }

264. // we need to decompress the message, if required, to get the offset of the ﬁrst uncompressed message val startOffset = entry.message.compressionCodec match { case NoCompressionCodec => entry.offset case _ => ByteBufferMessageSet.deepIterator(entry.message).next().offset }

265. ByteBufferMessageSet.deepIterator(entry.message).next().offset

266. no deepIterator.close()

267. but jvm not under heap pressure!

268. ﬁnalizers

269. patch to call close

271. super ugly, big, introduced new abstractions

274. much nicer!

275. no real loss, if message format > 0 we already do this work

276. tested patched version in staging

277. Ship it!

280. Lessons

281. 29k?

282. that 3%?

283. novel failure

285. the space to solve real problems

286. take a break

287. 1. Impact 2. Mitigate 3. Fix 4. Follow up

291. takeways take a break kafka is not perfect A U T O M A T I O N

292. Conclusion

293. if you run kafka in production

294. happy to talk

295. you can't waste my time

299. hundreds of clusters with 5 people

300. Tom Crayford Heroku @t_crayford

Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 People

Recommended

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 People (20)

More from confluent (20)

Recently uploaded (20)

Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 People