Finish up analysis and edits for clarity.

grokcode · grokcode · commit a563193d85d0 · 2014-03-10T09:52:27.000-06:00
diff --git a/nginx-log-analysis.ipynb b/nginx-log-analysis.ipynb
@@ -21,11 +21,13 @@
      "source": [
       "By Jess Johnson [http://grokcode.com](http://grokcode.com)\n",
       "\n",
-      "This notebook goes through an analysis of Nginx access logs in order to do capacity planning. I'm looking at access logs for [Author Alcove](http://authoralcove.com) which was hit by a big traffic spike when it spent around 24 hours at the top of [/r/books](http://reddit.com/r/books/). The site was hugged to death by reddit and many visitors couldn't access the site due to 50x errors. So let's estimate how much extra capacity would be needed to survive this spike.\n",
+      "This notebook analyzes Nginx access logs in order to estimate the capacity needed to survive a traffic spike. I'm looking at access logs for [Author Alcove](http://authoralcove.com) which was hit by a big traffic spike when it spent around 24 hours at the top of [/r/books](http://reddit.com/r/books/). The site was hugged to death by reddit. Visitors experienced very slow load times, and many people couldn't access the site at all due to 50x errors. So let's estimate how much extra capacity would be needed to survive this spike.\n",
       "\n",
       "The source for this notebook is located on [github](https://github.com/grokcode/ipython-notebooks). See a mistake? Pull requests welcome.\n",
       "\n",
-      "Thanks to Nikolay Koldunov for his [notebook on Apache log analysis](http://nbviewer.ipython.org/github/koldunovn/nk_public_notebooks/blob/master/Apache_log.ipynb), and thanks to my bro Aaron for the much needed optimism and server optimization tips while everything was on fire."
+      "Thanks to Nikolay Koldunov for his [notebook on Apache log analysis](http://nbviewer.ipython.org/github/koldunovn/nk_public_notebooks/blob/master/Apache_log.ipynb), and thanks to my bro Aaron for the much needed optimism and server optimization tips while everything was on fire.\n",
+      "\n",
+      "OK let's get started."
      ]
     },
     {
@@ -80,7 +82,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "We will also use [apachelog](https://code.google.com/p/apachelog/), which is a module for parsing apache logs, but it works fine with nginx logs if we give it the right format string. You can install it with `pip install apachelog`.  "
+      "We will also use [apachelog](https://code.google.com/p/apachelog/), which is a module for parsing apache logs, but it works fine with nginx logs as long as we give it the right format string. You can install it with `pip install apachelog`.  "
      ]
     },
     {
@@ -105,7 +107,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "I started out by doing some command line preproccessing on the log in order to remove bots. I used `egrep -v` to filter out the bots that were hitting the site the most often. These were Googlebot, Bingbot, the New Relic uptime checker, Buidu spider, and a few others. A more careful approach would filter out everything on one of the known bot lists ([like this one](http://www.robotstxt.org/db.html)), but I'm going to play it a bit fast and loose.\n",
+      "I started out by doing some command line preprocessing on the log in order to remove bots. I used `egrep -v` to filter out the bots that were hitting the site the most often. These were Googlebot, Bingbot, the New Relic uptime checker, Buidu spider, and a few others. A more careful approach would filter out everything on one of the known bot lists ([like this one](http://www.robotstxt.org/db.html)), but I'm going to play it a bit fast and loose.\n",
       "\n",
       "First of all let's get a sample line out of the `access.log` and try to parse it. Here is a description of the codes in the log format we are working with:\n",
       "\n",
@@ -468,6 +470,15 @@
       "Analysis"
      ]
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Let's graph the data to visualize what is happening.\n",
+      "\n",
+      "First let's increase the graph size."
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
@@ -480,6 +491,13 @@
      "outputs": [],
      "prompt_number": 11
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Now let's graph the requests hitting the web server. `10t` will use a 10 minute interval size, so each point on the graph shows the number of requestes in a 10 minute window."
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
@@ -513,7 +531,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "Above we see that we were peaking at around 4500 requests every 10 minutes, or about 450 requests a minute. There was a quick ramp up as the site climbed to the top position on /r/books, then a drop off overnight (for US timezones), another small peak the next day as people woke up, and then a decline as the site fell back down the /r/books page.\n",
+      "Above we see that we were peaking at around 4500 requests every 10 minutes, or about 450 requests a minute. There was a quick ramp up as the site climbed to the top position on /r/books, then a drop off overnight (for US timezones), another peak the next day as people woke up, and then a decline as the site fell back down the /r/books page.\n",
       "\n",
       "Let's see how the server held up by looking at some response codes."
      ]
@@ -596,9 +614,9 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "OK that doesn't look too bad. The vast majority of requests were served a 200 response.\n",
+      "OK that doesn't look too bad. The vast majority of requests were served a 200 (OK) response.\n",
       "\n",
-      "Let's graph the response codes over time, with a sample timespan of 1 hour."
+      "Let's graph the most common response codes over time, with a sample timespan of 1 hour."
      ]
     },
     {
@@ -788,7 +806,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "This doesn't look too bad either, but it's hard to make sense of what is going on with so many different response codes. Let's group them into success (webserver handled it as expected) and failure (webserver error or client closed request).\n",
+      "This doesn't look too bad either, but it's hard to make sense of what is going on with so many different response codes. Let's group them into success (web server handled it as expected) and failure (web server error or client closed request).\n",
       "\n",
       "    200    413236    Success    OK \n",
       "    502     39769    Failure    Bad Gateway\n",
@@ -1051,7 +1069,15 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "It looks like about 2,000 requests / hour is all uwsgi can handle without starting to throw errors. That is only about 1 request every two seconds so that is a pretty big problem."
+      "It looks like about 2,000 requests / hour is all uwsgi can handle before it starts to throw errors. That is only about 1 request every two seconds so that is a pretty big problem.\n",
+      "\n",
+      "We needed to handle 10,000 requests / hour, but we were only able to handle 2,000 requests / hour. So it looks like we would need roughly 5x the capacity to handle a spike of this size. This is a little bit misleading though because many visitors abandoned the site after hitting 50X errors and long page loads.\n",
+      "\n",
+      "Here is a view of the same traffic spike from Google Analytics.\n",
+      "\n",
+      "![Google Analytics Pageviews vs. Pages per Visit](files/data/google-analytics.png)\n",
+      "\n",
+      "During the spike, pages per visit were roughly half of what they normally were, so let's double the 5x estimate and plan for 10x the capacity to handle this load."
      ]
     },
     {
@@ -1066,15 +1092,11 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "The easy solution is to upgrade the server running uwsgi and increase the number of workers handling requests. At first glance it looks like we would need roughly 5x the capacity to handle a spike of this size. This is a little bit misleading though because many visitors abandoned the site after hitting 50X errors and long page loads.\n",
-      "\n",
-      "Here is a view of the same traffic spike from Google Analytics.\n",
+      "One solution is to upgrade the server running uwsgi and use ten times the number of workers to handle requests, but let's see if we can be a bit smarter.\n",
       "\n",
-      "![Google Analytics Pageviews vs. Pages per Visit](files/data/google-analytics.png)\n",
-      "\n",
-      "Pages per visit are roughly half of what they normally are during the spike, so let's double the 5x estimate and plan for 10x the capacity to handle this load.\n",
+      "Another solution is to optimize the web app by shortening long-running requests, but I have already picked most of the low hanging fruit there. \n",
       "\n",
-      "There are a couple of choices for optimization here. The most obvious way to handle it is to just beef up the server running the app and increase the number of uwsgi workers, but let's see if we can be a bit smarter. There are also optimizations to be made to the web app to shorten long running requests, but I have already picked most of the low hanging fruit there. Let's investigate to see if there are any opportunities to prevent requests from hitting uwsgi in the first place.\n",
+      "Let's investigate to see if there are any opportunities to prevent requests from hitting uwsgi in the first place.\n",
       "\n",
       "First let's get a count pages with the most requests.\n"
      ]
@@ -1143,6 +1165,8 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
+      "There isn't much that can be done with the POST requests - they need to hit uwsgi so we can save new accounts or ratings in the database. So let's move on to the GET requests. There are two basic techniques for having nginx serve these pages. First, we could have uwsgi store the fully rendered page in memcached or similar, then have nginx try to pull the page from memcached and falling back to uwsgi if the page wasn't in the cache. The second idea is to have uwsgi create a static file, and then let nginx serve that if it exists. Unfortunately, in this case both of those solutions are problematic. It is beyond the scope of this notebook to go into details (hopefully I will have a separate blog post on that soon), but the gist is that for most of these pages, the content changes depending on who is viewing them, so they can't readily be cached at the page level.\n",
+      "\n",
       "The biggest gain would be to make the homepage static. The homepage will redirect to a user's recommended page if they are already logged in, but we could possibly detect logged in users with nginx via the http headers and only serve the static page to logged out visitors. Let's see what proportion of visitors who hit the homepage were logged in already."
      ]
     },
@@ -1177,10 +1201,9 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "About (19331 / (3053 + 946 + 19331)) * 100 = 83% of visitors were logged out. So that nginx configuration change would have saved a bit more than 40,000 requests.\n",
+      "Of the visitors who were able to access the homepage, (19331 / (3053 + 946 + 19331)) * 100 = 83% were logged out. So that nginx configuration change would have saved a bit more than 40,000 requests.\n",
       "\n",
-      "Out of how many total requests?\n",
-      "\n"
+      "Out of how many total?"
      ]
     },
     {
@@ -1210,7 +1233,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "So that optimization would prevent 13% of requests from even reaching uwsgi.\n",
+      "So that change alone would offload 13% of uwsgi's request load to nginx. That's a good start.\n",
       "\n",
       "Another thing that jumps out is that `/apple-touch-icon.png` and `/apple-touch-icon-precomposed.png` don't actually exist have to be passed off to uwsgi before they 404. Setting up nginx to serve anything ending in `.png` will save some requests. Let's see if there are any other files like that."
      ]
@@ -1281,7 +1304,7 @@
      "source": [
       "That is another 3.5% of requests.\n",
       "\n",
-      "TODO: Explain a bit about the problems with caching other files."
+      "Between those two changes we will save 16.5% of requests from hitting uwsgi. It isn't enough to prevent the need for a server upgrade, but it does get us a bit closer."
      ]
     }
    ],