SlideShare a Scribd company logo
Tuning TCP and NGINX on 
EC2
Who are we? 
Chartbeat measures and monetizes attention on the web. Working with 80% of 
the top US news sites and global media sites in 50 countries, Chartbeat brings 
together editors and advertisers to identify in real time the active time an 
audience consumes articles, videos, paid content, and display advertising. 
2 | TCP/NGINX Tuning on EC2
● Founded in 2009 
● Hosted on AWS , 400-500 servers 
depending on time of day 
● Around 180k - 220k req/sec 
● 6 - 9 million concurrents 
chartbeat.com/totaltotal 
2 | TCP/NGINX Tuning on EC2
Who am I? 
● Sr Web Operations Engineer 
● Previously worked at 
○ Bitly 
○ TheStreet.com 
○ Promotions.com 
2 | TCP/NGINX Tuning on EC2
Traffic Characteristics 
Every 15 seconds 
213 byte, request size 
43 byte, response size 
2 | TCP/NGINX Tuning on EC2
Problem 
● Reports of slowness from some customers 
● Taking 3 seconds to send data 
2 | TCP/NGINX Tuning on EC2 
Default Retransmission Timeout 
RFC 1122: Section 4.2.3.1 
The following values SHOULD be used to initialize the 
estimation parameters for a new connection: 
(a) RTT = 0 seconds. 
(b) RTO = 3 seconds. (The smoothed variance is to be 
initialized to the value that will result in this RTO).
2 | TCP/NGINX Tuning on EC2 
flickr: wallyg
2 | TCP/NGINX Tuning on EC2 
flickr: oregondot
Now what? 
TCPDump + Wireshark confirms retransmissions 
2 | TCP/NGINX Tuning on EC2
DON’T GRAPH ALL THE THINGS 
● Graph only relevant metrics 
○ you’ll end up with a ton of red herrings 
2 | TCP/NGINX Tuning on EC2
Sources of info 
● ss -s 
○ summary of socket statistics 
TCP: 10678 (estab 2503, closed 8167, orphaned 0, synrecv 0, timewait 8167/0), 
ports 0 
● netstat -s 
"tcp_active_connections_openings", 
"tcp_connections_aborted_due_to_timeout", 
"tcp_data_loss_events", 
"tcp_failed_connection_attempts", 
"tcp_other_tcp_timeouts", 
"tcp_passive_connection_openings", 
"tcp_segments_retransmited", 
"tcp_segments_send_out", 
"tcp_syns_to_listen_sockets_dropped", 
"tcp_times_the_listen_queue_of_a_socket_overflowed", 
● 
2 | TCP/NGINX Tuning on EC2
2 | TCP/NGINX Tuning on EC2 
TCP/IP 
Illustrated 
Volume 1 
Second Ed.
2 | TCP/NGINX Tuning on EC2 
Logster + Graphite 
https://github.com/etsy/logster 
Tails logs, generates metrics and 
outputs to Graphite or Ganglia
FINDINGS 
2 | TCP/NGINX Tuning on EC2
Sources of info 
● netstat -s 
"tcp_active_connections_openings", 
"tcp_connections_aborted_due_to_timeout", 
"tcp_data_loss_events", 
"tcp_failed_connection_attempts", 
"tcp_other_tcp_timeouts", 
"tcp_passive_connection_openings", 
"tcp_segments_retransmited", 
"tcp_segments_send_out", 
"tcp_syns_to_listen_sockets_dropped", 
"tcp_times_the_listen_queue_of_a_socket_overflowed", 
2 | TCP/NGINX Tuning on EC2 
Values > 1, can’t be 
good 
Confirmed what we 
suspected 
WHUT
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_max_syn_backlog 
Systems Performance 
Enterprise and the Cloud by Brendan Gregg, pg 492 
2 | TCP/NGINX Tuning on EC2 
net.core.somaxconn 
Nginx: listen backlog=####
Insane Defaults 
● net.core.netdev_max_backlog = 1000 
○ Per CPU backlog? 
○ Network Frames 
● net.ipv4.tcp_max_syn_backlog = 128 
● net.core.somaxconn = 128 
● nginx listen backlog = 511 ?!? 
○ Silently truncated to somaxconn value 
2 | TCP/NGINX Tuning on EC2
New Values 
● net.core.netdev_max_backlog = 16384 
● net.ipv4.tcp_max_syn_backlog = 65536 
● net.core.somaxconn = 16384 
● nginx listen backlog = 16384 
○ should be <= somaxconn 
2 | TCP/NGINX Tuning on EC2
2 | TCP/NGINX Tuning on EC2 
Results
Further settings explored 
net.ipv4.tcp_slow_start_after_idle 
net.ipv4.tcp_max_tw_buckets 
net.ipv4.tcp_rmem/wrem 
net.ipv4.tcp_fin_timeout 
net.ipv4.tcp_mem 
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_slow_start_after_idle 
Set to 0 to ensure connections don’t go back to 
default window size after being idle too long. 
Example: HTTP KeepAlive 
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_max_tw_buckets 
Max number of sockets in TIME_WAIT. We 
actually set this very high, since before we 
moved instances behind an ELB it was normal 
to have 200k+ sockets in TIME_WAIT state. 
Exceeding this leads to sockets being torn 
down until under limit 
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_rmem/wrem 
Format: min default max ( in bytes) 
The kernel will autotune the number of bytes to 
use for each socket based on these settings. It 
will start at default and work between the 
min and max 
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_fin_timeout 
The time a connection should spend in 
FIN_WAIT_2 state. Default is 60 seconds, 
lowering this will free memory more quickly and 
transition the socket to TIME_WAIT. 
This will NOT reduce the time a socket is in 
TIME_WAIT which is set to 2 * MSL (max 
segment lifetime) 
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_fin_timeout continued... 
MSL is hardcoded in the kernel at 60 seconds! 
https://github. 
com/torvalds/linux/blob/master/include/net/tcp. 
h#L115 
#define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy 
TIME-WAIT * state, about 60 seconds */ 
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_mem 
Format: low pressure max (in pages!) 
Below low, Kernel won’t put pressure on 
sockets to reduce mem usage. Once pressure 
hits, sockets reduce memory until low is hit. If 
max hit, no new sockets. 
2 | TCP/NGINX Tuning on EC2
2 | TCP/NGINX Tuning on EC2
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_tw_recycle (DANGEROUS) 
● Clients behind NAT/Stateful FW will get 
dropped 
● *99.99999999% of time should never be 
enabled 
* Probably 100% but there may be a valid case out there 
2 | TCP/NGINX Tuning on EC2
net.ipv4.tcp_tw_reuse 
● Makes a safer attempt at freeing sockets in 
TIME_WAIT state. 
2 | TCP/NGINX Tuning on EC2
Recycle vs Reuse Deep Dive 
http://bit.ly/tcp-time-wait 
2 | TCP/NGINX Tuning on EC2
One last thing… 
TCP Congestion Window - initcwnd (initial) 
2 | TCP/NGINX Tuning on EC2 
Starting in Kernel 2.6.39 , set to 10 
Previous default was 3! 
http://research.google.com/pubs/pub36640.html 
Older Kernel? 
$ ip route change default via 192.168.1.1 dev eth0 proto static initcwnd 10
2 | TCP/NGINX Tuning on EC2 
NGINX
listen statement 
● backlog 
○ limited by net.core.somaxconn 
● defer 
○ TCP_DEFER_ACCEPT - Wait till we receive data 
packet before passing socket to server. Completing 
TCP Handshake won’t trigger an accept() 
2 | TCP/NGINX Tuning on EC2
server block 
● sendfile 
○ Saves context switching from userspace on 
read/write. 
○ “zero copy” , happens in kernel space 
● tcp_nopush 
○ TCP_CORK 
○ allows application to control building of packet, e.g 
pack a packet with full HTTP response 
● tcp_nodelay 
○ Nagle’s Algorithm 
○ Only affects keep-alive connections 
● multi_accept 
○ Accept all connections on listen queue at once 
(careful, can overwhelm workers) 
2 | TCP/NGINX Tuning on EC2
Nagle’s Algorithm (tcp_nopush) 
Small payload + need for low latency? 
Disable 
2 | TCP/NGINX Tuning on EC2
HTTP Keep-Alive 
● Enabled once behind ELB 
● Given small payload and 15 seconds between 
data, waste of resources for us to enable 
exposed directly to net 
2 | TCP/NGINX Tuning on EC2
HTTP Keep-Alive Cont.. 
● Also enable on upstream proxies 
○ Available since 1.1.4 
○ *cough* had to upgrade Nginx and Fix memory leak 
dealing with libevent and keepalives before we could 
get this fully setup 
2 | TCP/NGINX Tuning on EC2
2 | TCP/NGINX Tuning on EC2 
ELB
Cross-Zone load balancing 
Ensures requests to 
each ELB in each AZ 
go to ALL instances 
in ALL AZs 
2 | TCP/NGINX Tuning on EC2
Idle Connection Timeout 
● Defaults to 60 seconds 
● Finally tunable via API. 
● Tweak if doing anything long lived , e.g. 
Websockets, or ensure you are sending 
“pings” 
2 | TCP/NGINX Tuning on EC2
Connection draining 
“Graceful” removal of node from ELB, will 
ensure existing connections can finish instead 
of a hard cutoff (old behavior) 
2 | TCP/NGINX Tuning on EC2
Metrics to monitor 
● SurgeQueueLength (Not Good) 
A count of the total number of requests that are 
pending submission to a registered instance. 
● SpilloverCount (BAD) 
A count of the total number of requests that 
were rejected due to the queue being full. 
2 | TCP/NGINX Tuning on EC2
Conclusions 
● The internet is full of lies 
● With enough traffic, tweaking system and 
application defaults are a necessary 
● Find trusted sources (Me? Maybe?) for 
settings and test in staging environments 
● Measure impact and understand what metrics 
may be impacted by your tweaks 
● Don’t get lost in all the sysctl settings 
● TCP is complicated 
2 | TCP/NGINX Tuning on EC2
2 | TCP/NGINX Tuning on EC2 
FIN 
FIN_WAIT_1 
FIN_WAIT_2 
TIME_WAIT
Resources and References 
https://www.kernel. 
org/doc/Documentation/networking/ip-sysctl.txt 
2 | TCP/NGINX Tuning on EC2 
man tcp(7)
Additional reading 
http://engineering.chartbeat.com 
Full story about experiences with our 
architecture and material discussed in 
slides 
2 | TCP/NGINX Tuning on EC2
Questions / Comments? 
2 | TCP/NGINX Tuning on EC2 
@Lintzston 
justin@chartbeat.com

More Related Content

PDF
BPF: Tracing and more
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
ODP
Introduction to Nginx
PPTX
APACHE KAFKA / Kafka Connect / Kafka Streams
PDF
NGINX ADC: Basics and Best Practices – EMEA
PDF
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
PDF
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
PDF
Under the Hood of a Shard-per-Core Database Architecture
BPF: Tracing and more
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction to Nginx
APACHE KAFKA / Kafka Connect / Kafka Streams
NGINX ADC: Basics and Best Practices – EMEA
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Under the Hood of a Shard-per-Core Database Architecture

What's hot (20)

PDF
eBPFは何が嬉しいのか
PDF
Neutron packet logging framework
PPTX
NGINX: Basics & Best Practices - EMEA Broadcast
PDF
containerdの概要と最近の機能
PDF
NGINX: Basics and Best Practices EMEA
PDF
Cilium - Bringing the BPF Revolution to Kubernetes Networking and Security
PDF
エンジニアのキャリアパスを考える 川村
PPTX
BuildKitによる高速でセキュアなイメージビルド
PPTX
Introduction to Apache ZooKeeper
PPTX
Understanding kube proxy in ipvs mode
PPTX
9/14にリリースされたばかりの新LTS版Java 17、ここ3年間のJavaの変化を知ろう!(Open Source Conference 2021 O...
PDF
Nginx dhruba mandal
PPTX
How to Avoid the Top 5 NGINX Configuration Mistakes
PDF
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
PDF
30分でRHEL6 High Availability Add-Onを超絶的に理解しよう!
PDF
ストリーム処理を支えるキューイングシステムの選び方
PDF
TIME_WAITに関する話
PDF
Deep dive into Kubernetes Networking
PDF
The Zen of High Performance Messaging with NATS
PDF
Introduction to eBPF
eBPFは何が嬉しいのか
Neutron packet logging framework
NGINX: Basics & Best Practices - EMEA Broadcast
containerdの概要と最近の機能
NGINX: Basics and Best Practices EMEA
Cilium - Bringing the BPF Revolution to Kubernetes Networking and Security
エンジニアのキャリアパスを考える 川村
BuildKitによる高速でセキュアなイメージビルド
Introduction to Apache ZooKeeper
Understanding kube proxy in ipvs mode
9/14にリリースされたばかりの新LTS版Java 17、ここ3年間のJavaの変化を知ろう!(Open Source Conference 2021 O...
Nginx dhruba mandal
How to Avoid the Top 5 NGINX Configuration Mistakes
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
30分でRHEL6 High Availability Add-Onを超絶的に理解しよう!
ストリーム処理を支えるキューイングシステムの選び方
TIME_WAITに関する話
Deep dive into Kubernetes Networking
The Zen of High Performance Messaging with NATS
Introduction to eBPF
Ad

Viewers also liked (19)

POTX
Performance Tuning EC2 Instances
PPTX
NGINX Installation and Tuning
PDF
Lcu14 Lightning Talk- NGINX
PPTX
Learn nginx in 90mins
PDF
DevConf 2014 Kernel Networking Walkthrough
PDF
How to secure your web applications with NGINX
PPTX
The 3 Models in the NGINX Microservices Reference Architecture
PDF
AWS Black Belt Techシリーズ 2015 Amazon Elastic Block Store (EBS)
PDF
Nginx Internals
PDF
Linux Profiling at Netflix
PDF
Marian Marinov, 1H Ltd.
PDF
Monitoring NGINX (plus): key metrics and how-to
ODP
Nginx monitoring with graphite
PDF
Naxsi, an open source WAF for Nginx
DOCX
Devops training in Hyderabad
PDF
Fluentd and docker monitoring
PPT
Tuning 17 march
PDF
How to measure everything - a million metrics per second with minimal develop...
PPT
Nginx internals
Performance Tuning EC2 Instances
NGINX Installation and Tuning
Lcu14 Lightning Talk- NGINX
Learn nginx in 90mins
DevConf 2014 Kernel Networking Walkthrough
How to secure your web applications with NGINX
The 3 Models in the NGINX Microservices Reference Architecture
AWS Black Belt Techシリーズ 2015 Amazon Elastic Block Store (EBS)
Nginx Internals
Linux Profiling at Netflix
Marian Marinov, 1H Ltd.
Monitoring NGINX (plus): key metrics and how-to
Nginx monitoring with graphite
Naxsi, an open source WAF for Nginx
Devops training in Hyderabad
Fluentd and docker monitoring
Tuning 17 march
How to measure everything - a million metrics per second with minimal develop...
Nginx internals
Ad

Similar to Tuning TCP and NGINX on EC2 (20)

PPTX
Reconsider TCPdump for Modern Troubleshooting
PPTX
Abandon Decades-Old TCPdump for Modern Troubleshooting
PDF
Real-time in the real world: DIRT in production
PDF
Transaction TCP
PDF
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
PDF
Make container without_docker_7
PDF
Programming TCP for responsiveness
PDF
IRJET- Modeling a New Startup Algorithm for TCP New Reno
PDF
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
PDF
FPGA based 10G Performance Tester for HW OpenFlow Switch
PDF
Linux Kernel vs DPDK: HTTP Performance Showdown
PDF
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
PDF
XS Boston 2008 Network Topology
PPTX
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
PDF
Linux HTTPS/TCP/IP Stack for the Fast and Secure Web
PDF
Network Programming: Data Plane Development Kit (DPDK)
PDF
High perf-networking
PDF
TCP and Mobile Networks Turbulent Relationship
PPTX
Programming TCP for responsiveness
PPTX
Linux Network Stack
Reconsider TCPdump for Modern Troubleshooting
Abandon Decades-Old TCPdump for Modern Troubleshooting
Real-time in the real world: DIRT in production
Transaction TCP
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Make container without_docker_7
Programming TCP for responsiveness
IRJET- Modeling a New Startup Algorithm for TCP New Reno
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
FPGA based 10G Performance Tester for HW OpenFlow Switch
Linux Kernel vs DPDK: HTTP Performance Showdown
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
XS Boston 2008 Network Topology
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
Linux HTTPS/TCP/IP Stack for the Fast and Secure Web
Network Programming: Data Plane Development Kit (DPDK)
High perf-networking
TCP and Mobile Networks Turbulent Relationship
Programming TCP for responsiveness
Linux Network Stack

More from Chartbeat (12)

PDF
Chartbeat: Discovery & Engagement in a Mobile-first World
PDF
Audience Building in the Age of Platforms
PPTX
Data in the Newsroom: Content Solutions for Advertising Challenges
PDF
ONA 2015 – Can You Have Their Attention Please?
PDF
Insights from Around the World
PDF
A Data State of the Union: Can We Make Quality Pay Online?
PDF
Understanding Your Traffic Sources
PDF
An Introduction to Video Analytics
PDF
Top Trends in Online Journalism for 2014
PDF
Say Hello to the New Chartbeat Publishing
PDF
Building a Loyal and Returning Audience with the New Chartbeat Publishing
PDF
Tom Germeau: Mobile Apps: Finding the balance between performance and flexibi...
Chartbeat: Discovery & Engagement in a Mobile-first World
Audience Building in the Age of Platforms
Data in the Newsroom: Content Solutions for Advertising Challenges
ONA 2015 – Can You Have Their Attention Please?
Insights from Around the World
A Data State of the Union: Can We Make Quality Pay Online?
Understanding Your Traffic Sources
An Introduction to Video Analytics
Top Trends in Online Journalism for 2014
Say Hello to the New Chartbeat Publishing
Building a Loyal and Returning Audience with the New Chartbeat Publishing
Tom Germeau: Mobile Apps: Finding the balance between performance and flexibi...

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Advanced IT Governance
PDF
KodekX | Application Modernization Development
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Understanding_Digital_Forensics_Presentation.pptx
Big Data Technologies - Introduction.pptx
Machine learning based COVID-19 study performance prediction
Network Security Unit 5.pdf for BCA BBA.
GamePlan Trading System Review: Professional Trader's Honest Take
Spectral efficient network and resource selection model in 5G networks
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Advanced IT Governance
KodekX | Application Modernization Development
Spectroscopy.pptx food analysis technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”

Tuning TCP and NGINX on EC2

  • 1. Tuning TCP and NGINX on EC2
  • 2. Who are we? Chartbeat measures and monetizes attention on the web. Working with 80% of the top US news sites and global media sites in 50 countries, Chartbeat brings together editors and advertisers to identify in real time the active time an audience consumes articles, videos, paid content, and display advertising. 2 | TCP/NGINX Tuning on EC2
  • 3. ● Founded in 2009 ● Hosted on AWS , 400-500 servers depending on time of day ● Around 180k - 220k req/sec ● 6 - 9 million concurrents chartbeat.com/totaltotal 2 | TCP/NGINX Tuning on EC2
  • 4. Who am I? ● Sr Web Operations Engineer ● Previously worked at ○ Bitly ○ TheStreet.com ○ Promotions.com 2 | TCP/NGINX Tuning on EC2
  • 5. Traffic Characteristics Every 15 seconds 213 byte, request size 43 byte, response size 2 | TCP/NGINX Tuning on EC2
  • 6. Problem ● Reports of slowness from some customers ● Taking 3 seconds to send data 2 | TCP/NGINX Tuning on EC2 Default Retransmission Timeout RFC 1122: Section 4.2.3.1 The following values SHOULD be used to initialize the estimation parameters for a new connection: (a) RTT = 0 seconds. (b) RTO = 3 seconds. (The smoothed variance is to be initialized to the value that will result in this RTO).
  • 7. 2 | TCP/NGINX Tuning on EC2 flickr: wallyg
  • 8. 2 | TCP/NGINX Tuning on EC2 flickr: oregondot
  • 9. Now what? TCPDump + Wireshark confirms retransmissions 2 | TCP/NGINX Tuning on EC2
  • 10. DON’T GRAPH ALL THE THINGS ● Graph only relevant metrics ○ you’ll end up with a ton of red herrings 2 | TCP/NGINX Tuning on EC2
  • 11. Sources of info ● ss -s ○ summary of socket statistics TCP: 10678 (estab 2503, closed 8167, orphaned 0, synrecv 0, timewait 8167/0), ports 0 ● netstat -s "tcp_active_connections_openings", "tcp_connections_aborted_due_to_timeout", "tcp_data_loss_events", "tcp_failed_connection_attempts", "tcp_other_tcp_timeouts", "tcp_passive_connection_openings", "tcp_segments_retransmited", "tcp_segments_send_out", "tcp_syns_to_listen_sockets_dropped", "tcp_times_the_listen_queue_of_a_socket_overflowed", ● 2 | TCP/NGINX Tuning on EC2
  • 12. 2 | TCP/NGINX Tuning on EC2 TCP/IP Illustrated Volume 1 Second Ed.
  • 13. 2 | TCP/NGINX Tuning on EC2 Logster + Graphite https://github.com/etsy/logster Tails logs, generates metrics and outputs to Graphite or Ganglia
  • 14. FINDINGS 2 | TCP/NGINX Tuning on EC2
  • 15. Sources of info ● netstat -s "tcp_active_connections_openings", "tcp_connections_aborted_due_to_timeout", "tcp_data_loss_events", "tcp_failed_connection_attempts", "tcp_other_tcp_timeouts", "tcp_passive_connection_openings", "tcp_segments_retransmited", "tcp_segments_send_out", "tcp_syns_to_listen_sockets_dropped", "tcp_times_the_listen_queue_of_a_socket_overflowed", 2 | TCP/NGINX Tuning on EC2 Values > 1, can’t be good Confirmed what we suspected WHUT
  • 16. 2 | TCP/NGINX Tuning on EC2
  • 17. net.ipv4.tcp_max_syn_backlog Systems Performance Enterprise and the Cloud by Brendan Gregg, pg 492 2 | TCP/NGINX Tuning on EC2 net.core.somaxconn Nginx: listen backlog=####
  • 18. Insane Defaults ● net.core.netdev_max_backlog = 1000 ○ Per CPU backlog? ○ Network Frames ● net.ipv4.tcp_max_syn_backlog = 128 ● net.core.somaxconn = 128 ● nginx listen backlog = 511 ?!? ○ Silently truncated to somaxconn value 2 | TCP/NGINX Tuning on EC2
  • 19. New Values ● net.core.netdev_max_backlog = 16384 ● net.ipv4.tcp_max_syn_backlog = 65536 ● net.core.somaxconn = 16384 ● nginx listen backlog = 16384 ○ should be <= somaxconn 2 | TCP/NGINX Tuning on EC2
  • 20. 2 | TCP/NGINX Tuning on EC2 Results
  • 21. Further settings explored net.ipv4.tcp_slow_start_after_idle net.ipv4.tcp_max_tw_buckets net.ipv4.tcp_rmem/wrem net.ipv4.tcp_fin_timeout net.ipv4.tcp_mem 2 | TCP/NGINX Tuning on EC2
  • 22. net.ipv4.tcp_slow_start_after_idle Set to 0 to ensure connections don’t go back to default window size after being idle too long. Example: HTTP KeepAlive 2 | TCP/NGINX Tuning on EC2
  • 23. net.ipv4.tcp_max_tw_buckets Max number of sockets in TIME_WAIT. We actually set this very high, since before we moved instances behind an ELB it was normal to have 200k+ sockets in TIME_WAIT state. Exceeding this leads to sockets being torn down until under limit 2 | TCP/NGINX Tuning on EC2
  • 24. net.ipv4.tcp_rmem/wrem Format: min default max ( in bytes) The kernel will autotune the number of bytes to use for each socket based on these settings. It will start at default and work between the min and max 2 | TCP/NGINX Tuning on EC2
  • 25. net.ipv4.tcp_fin_timeout The time a connection should spend in FIN_WAIT_2 state. Default is 60 seconds, lowering this will free memory more quickly and transition the socket to TIME_WAIT. This will NOT reduce the time a socket is in TIME_WAIT which is set to 2 * MSL (max segment lifetime) 2 | TCP/NGINX Tuning on EC2
  • 26. net.ipv4.tcp_fin_timeout continued... MSL is hardcoded in the kernel at 60 seconds! https://github. com/torvalds/linux/blob/master/include/net/tcp. h#L115 #define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT * state, about 60 seconds */ 2 | TCP/NGINX Tuning on EC2
  • 27. net.ipv4.tcp_mem Format: low pressure max (in pages!) Below low, Kernel won’t put pressure on sockets to reduce mem usage. Once pressure hits, sockets reduce memory until low is hit. If max hit, no new sockets. 2 | TCP/NGINX Tuning on EC2
  • 28. 2 | TCP/NGINX Tuning on EC2
  • 29. 2 | TCP/NGINX Tuning on EC2
  • 30. net.ipv4.tcp_tw_recycle (DANGEROUS) ● Clients behind NAT/Stateful FW will get dropped ● *99.99999999% of time should never be enabled * Probably 100% but there may be a valid case out there 2 | TCP/NGINX Tuning on EC2
  • 31. net.ipv4.tcp_tw_reuse ● Makes a safer attempt at freeing sockets in TIME_WAIT state. 2 | TCP/NGINX Tuning on EC2
  • 32. Recycle vs Reuse Deep Dive http://bit.ly/tcp-time-wait 2 | TCP/NGINX Tuning on EC2
  • 33. One last thing… TCP Congestion Window - initcwnd (initial) 2 | TCP/NGINX Tuning on EC2 Starting in Kernel 2.6.39 , set to 10 Previous default was 3! http://research.google.com/pubs/pub36640.html Older Kernel? $ ip route change default via 192.168.1.1 dev eth0 proto static initcwnd 10
  • 34. 2 | TCP/NGINX Tuning on EC2 NGINX
  • 35. listen statement ● backlog ○ limited by net.core.somaxconn ● defer ○ TCP_DEFER_ACCEPT - Wait till we receive data packet before passing socket to server. Completing TCP Handshake won’t trigger an accept() 2 | TCP/NGINX Tuning on EC2
  • 36. server block ● sendfile ○ Saves context switching from userspace on read/write. ○ “zero copy” , happens in kernel space ● tcp_nopush ○ TCP_CORK ○ allows application to control building of packet, e.g pack a packet with full HTTP response ● tcp_nodelay ○ Nagle’s Algorithm ○ Only affects keep-alive connections ● multi_accept ○ Accept all connections on listen queue at once (careful, can overwhelm workers) 2 | TCP/NGINX Tuning on EC2
  • 37. Nagle’s Algorithm (tcp_nopush) Small payload + need for low latency? Disable 2 | TCP/NGINX Tuning on EC2
  • 38. HTTP Keep-Alive ● Enabled once behind ELB ● Given small payload and 15 seconds between data, waste of resources for us to enable exposed directly to net 2 | TCP/NGINX Tuning on EC2
  • 39. HTTP Keep-Alive Cont.. ● Also enable on upstream proxies ○ Available since 1.1.4 ○ *cough* had to upgrade Nginx and Fix memory leak dealing with libevent and keepalives before we could get this fully setup 2 | TCP/NGINX Tuning on EC2
  • 40. 2 | TCP/NGINX Tuning on EC2 ELB
  • 41. Cross-Zone load balancing Ensures requests to each ELB in each AZ go to ALL instances in ALL AZs 2 | TCP/NGINX Tuning on EC2
  • 42. Idle Connection Timeout ● Defaults to 60 seconds ● Finally tunable via API. ● Tweak if doing anything long lived , e.g. Websockets, or ensure you are sending “pings” 2 | TCP/NGINX Tuning on EC2
  • 43. Connection draining “Graceful” removal of node from ELB, will ensure existing connections can finish instead of a hard cutoff (old behavior) 2 | TCP/NGINX Tuning on EC2
  • 44. Metrics to monitor ● SurgeQueueLength (Not Good) A count of the total number of requests that are pending submission to a registered instance. ● SpilloverCount (BAD) A count of the total number of requests that were rejected due to the queue being full. 2 | TCP/NGINX Tuning on EC2
  • 45. Conclusions ● The internet is full of lies ● With enough traffic, tweaking system and application defaults are a necessary ● Find trusted sources (Me? Maybe?) for settings and test in staging environments ● Measure impact and understand what metrics may be impacted by your tweaks ● Don’t get lost in all the sysctl settings ● TCP is complicated 2 | TCP/NGINX Tuning on EC2
  • 46. 2 | TCP/NGINX Tuning on EC2 FIN FIN_WAIT_1 FIN_WAIT_2 TIME_WAIT
  • 47. Resources and References https://www.kernel. org/doc/Documentation/networking/ip-sysctl.txt 2 | TCP/NGINX Tuning on EC2 man tcp(7)
  • 48. Additional reading http://engineering.chartbeat.com Full story about experiences with our architecture and material discussed in slides 2 | TCP/NGINX Tuning on EC2
  • 49. Questions / Comments? 2 | TCP/NGINX Tuning on EC2 @Lintzston justin@chartbeat.com

Editor's Notes

  • #4: Record traffic during US Election and World Cup, USA vs Germany 10+ million concurrents. Presidential election at the time was 2x our normal traffic
  • #6: High packet rate, low bandwidth. 43 bytes is small empty image we can send. Need this for error handling purposes on frontend side. Can’t send empty response
  • #7: Reports from users about slowness in sending “pings” to our servers. Slow clients, slowness doesn’t really affect our numbers too much as long as its arriving < 5 seconds. Asked for some numbers, seeing pings taking around 3 seconds . That number sets off some alarms
  • #8: These two numbers should raise alarms when you are doing any troubleshooting with TCP connections. 3 seconds is the default timeout for re-trying a connection, will backoff and re-try after 6 seconds, so 9 seconds total for a connection
  • #9: Maybe on client side? How do we know it’s on us? At the time our Pingdom monitoring didn’t show anything unusual, later learned this is definitely not enough
  • #11: Especially if you don’t have a good baseline for some metrics, you will end up chasing oddities in graphs that may be completely irrelevant
  • #12: Only graphed relevant info from netstat -s, there is a ton of metrics that may be useful for debugging other issues, but we started with these since they appeared to be most related with the issue at hand. For example, “fast retransmit” , while relevant, wouldnt indicate a delay of 3 seconds, since it bypasses the timeout. Push these to ganglia/graphite
  • #14: Had to enable logging, discard after hour, space contraints. Rotation impacts performance, switch to ext4 on log volume helped, no compress
  • #16: Confirmed issues, didn’t give us a source. But some symptoms to look into.
  • #18: Two queues, first for half established connections , can make large to help with SYN floods, although given todays flood attacks, probably not much help Second queue for established connections for your app to pluck off max_syn_backlog = system wide net.core.somaxconn = per process
  • #19: Still not 100% sure what this controls, from looking at Kernel source , it appears to be this Nginx backlog originally wasn’t documented in documentation, I had to find this in the source code and from googling
  • #20: Didnt know about nginx listen backlog at first. Initially changed first three values, saw a slight decrease in timeouts and listen queue overflows ,took a bunch of reading till I learned about the fact that each application has to set its own backlog queue and even further research to find what nginxs default value was
  • #24: Kernel will reuse sockets in TIME_WAIT when it can, a socket in a TIME_WAIT state actually doesn’t take up any resources
  • #25: Tweak if dealing with sending/receiving large amounts of data to improve throughput We changed but our throughput is fairly low per server so didn’t see any measurable impact
  • #26: Internet gets this wrong a lot. TIME_WAIT takes up no memory
  • #27: everyone on the internet gets this wrong! If you really want to change TIME_WAIT time , see ip_conntrack_tcp_timeout_time_wait in ip_conntrack module
  • #28: The pressure relates to the rmem and wmem settings we set earlier.
  • #29: Definitions wrong, harmful settings recommended, even seen this in a lot of books when searching books.google.com for settings
  • #31: If you are reading any blog/book that recommends enabling this, run far away
  • #33: Amazing read into why recycle is bad, and why TIME_WAIT exists
  • #34: Allows for more data in flight, if you are serving up larger content, you will see nice improvements here
  • #36: Defer , saves on resources where handshake occurs but no data is sent or data is delayed. Leaves Nginx free to deal with connections already sending a data payload
  • #37: we set both to on. Small payloads tcp_nopush = application controls things tcp_nodelay = seamless to developer, just happens multi_accept, we have off, given our constant stream of connections, it can overwhelm downstream
  • #39: Lower CPU utilization as well
  • #42: Previous behavior, if request hit ELB in AZ us-east-1d, would only get routed to instances behind there. This change really smoothed out distribution for us
  • #45: Indicates capacity issues
  • #46: It’s easy to get carried away and tune too many things (premature optimization) or settings which may have little to no effect for you.
  • #48: Dont trust random blogs, filled with terrible information. Sysctl settings defined wrong or extremely vague