Lessons learned while taking Presto from alpha to production at Twitter. Presented at the Presto meetup at Facebook on 2015.03.22.
Video: https://www.facebook.com/prestodb/videos/531276353732033/
1 of 26
Downloaded 55 times
More Related Content
Presto at Twitter
1. Presto at Twitter
From Alpha to Production
Bill Graham - @billgraham
Sailesh Mittal - @saileshmittal
March 22, 2016
Facebook Presto Meetup
11. ● Building a dedicated mesos cluster
○ 200 nodes
○ 128GB ram
○ 56 cores
○ 10 GbE
● One worker per container per host
● Consistent support model within Twitter
Mesos/Aurora
17. ● User-level auth required
● UGI.proxyUser when accessing HDFS (PR #4382 and Teradata PR #105)
● Manage access via LDAP groups per cluster that presto can proxy as
● HMS cache complicates things
● HMS file-based auth on writes only
Authorization
18. ● Hadoop client memory leaks (user x query x FileSystems)
● GC Pressure on coordinator
● Implemented FileSystem cache (user x FileSystems)
Authorization Challenges
19. java.lang.OutOfMemoryError: unable to create new native thread
● Queries failing on the coordinator
● Coordinator is thread-hungry, up to 1500 threads
● Default user process limit is 1024
$ ulimit -u
1024
● Increase ulimit
Stability #1
20. Encountered too many errors talking to a worker node
● Outbound network spikes hitting caps (300 Mb/s)
● Coordinator sending plan was costly (Fixed in PR #4538)
● Tuned timeouts
● Increased Tx cap
Stability #2
21. Encountered too many errors talking to a worker node
● Timeouts still being hit
● Correlated GC pauses with errors
● Tuned GC
● Changed to G1CG collector - BAM!
● 10s of seconds -> 100s of millis
Stability #3