Inside GitHub with Chris Wanstrath

My name is Chris Wanstrath. I go by @defunkt online.

inside
github

And today I’m going to talk about GitHub.

inside
github

That’s me.

GitHub is what we like to call “social coding.”

You can see what your friends are doing from your dashboard or news feed

Everyone has a proﬁle showing off their code and activity

And you can do things like leave comments on commits.

But it wasn’t always like this.

Originally we just wanted to make a git hosting site.

In fact, that was the ﬁrst tagline.

git repository hosting

git repository hosting.

That’s what we wanted to do: give us and our friends a place to share git repositories.

a brief
history

let’s start with a brief history

It’s not easy to setup a git repository. It never was.

But back in 2007 I really wanted to.

I had seen Torvalds’ talk on YouTube about git.

But it wasn’t really about git - it was more about distributed version control.

It answered many of my questions and clariﬁed DVCS ideas.

I still wasn’t sold on the whole idea, and I had no idea what it was good for.

CVS
is stupid

But when Torvalds says “CVS is stupid”

and so are
you

“and so are you,” the natural reaction for me is...

At the time the biggest and best free hosting site was repo.or.cz.

Right after I had seen the Torvalds video, the god project was posted up on repo.or.cz

I was interested in the project so I ﬁnally got a chance to try it out with some other people.

Namely this guy, Tom Preston-Werner.

Seen here in his famous “I put ketchup on my ketchup” shirt.

I managed to make a few contributions to god before realizing that repo.or.cz was not different.

git was not different.

Just more of the same - centralized, inﬂexible code hosting.

This is what I always imagined.

No rules. Project belongs to you, not the site. Share, fork, change - do what you want.

Give people tools and get out of their way. Less ceremony.

So, we set off to create our own site.

A git hub - learning, code hosting, etc.

We started with the code browsing and commit viewing...

But once we added the current version of the dashboard, we knew this was different.

And eventually “git repository hosting” gave way to “social coding”

Unleash Your Code
Join 500,000 coders with
over 1,500,000 repositories

What’s special about GitHub is that people use the site in spite of git.

Many git haters use the site because of what it is - more than a place to host
git repositories, but a place to share code with others.

2007 october

The ﬁrst commit was on a Friday night in October, around 10pm.

2008 january

We launched the beta in January at Steff’s on 2nd street in San Francisco’s SOMA district.

The ﬁrst non-github user was wycats, and the ﬁrst non-github project was merb-core.

They wanted to use the site for their refactoring and 0.9 branch.

2008 april

A few short months after that we launched to the public.

Along the way we managed to pick up Scott Chacon, our VP of R&D

Tekkub, our level 80 support druid

Melissa Severini, who keeps us all in check

Kyle Neath, who makes the site pretty

Ryan Tomayko, who helps keep the site running smoothly.

Zach Holman, head of enterprise

Rick Olson, Rails extraordinaire

Eston Bond, Design Generalissimo

Corey Donohoe, Director of Shipology

And Brian Lopez, our bleeding edge cowboy

Oh yeah, and the other founders: PJ and Tom.

github.com

That’s where we’re at today.

So let’s talk about the technical details of the website: github.com

.com as opposed to ﬁ, which I’m not going to get into today.

You’ll have to invite PJ out if you want to hear about that.

the web site

As everyone knows, a web “site” is really a bunch of different components.

Some of them generate and deliver HTML to you, but most of them don’t.

Our site consists of four major code “frameworks” or “apps”

rails

#
GitHub.com, Gist, etc
1

resque

#
Background processing, 50ish different job types currently
2

smoke

#
All git calls happen over the wire
3

utils

#
Exception logging, stats, helper apps, etc
4

rails

We use Ruby on Rails 2.2.2 as our web framework.

It’s kept up to date with all the security patches and includes custom patches we’ve added
ourselves, as well as patches we’ve cherry-picked from more recent versions of Rails.

rails

GitHub is about 20,000 lines of Rails code, not counting Rails itself, plugins, or gems.

We found out Rails was moving to GitHub in March 2008, after we had reached out to
them and they had turned us down.

So it was a bit of a surprise.

rails plugins

We currently have 27 Rails plugins installed, and that number is always changing.

technoweenie /
serialized_attributes

rubygems

GitHub depends on about 50 RubyGems

rack

One of the big features in Rails 2.3 is Rack support.

We badly wanted this, but didn’t want to invest the time upgrading.

So using a few open source libraries we’ve wrapped our Rails 2.2.2 instance in Rack.

Now we can use awesome Rack middleware like Rack::Bug in GitHub

Coders created and submitted dozens of Rack middleware for the Coderack competition last year.

I was a judge so I got the see the submissions already. Some of my favorite
were

talison / rack-mobile-detect

sets the X_MOBILE_DEVICE header to the mobile device, if
recognized

unicorn

We use unicorn as our application server

- master / worker
- 16 workers
- preforking

unicorn

- instant restart after kill
- hard 30s request timeouts
- control ram growth

unicorn

- 0 downtime deploys
- protects against bad rails startup
- migrations handled old fashioned way

nginx

For serving static content and slow clients, we use nginx

nginx is pretty much the greatest http server ever

it’s simple, fast, and has a great module system

nginx
Limit Zone

Limit simultaneous connections from a client

nginx
Limit Requests

Limit frequency of connections from a client

Anti-DDOS

nginx

I see many people using Rack to do what the Limit modules do.

Don’t.

nginx
memcached

memcached support

can serve directly from memcached

git

The next major part of GitHub is git

grit

We wrote an open source library called Grit
which lets us use git from Ruby

mojombo / grit

you can get it here

it originally shelled out to git and just parsed the responses.

which worked well for a long time.

grit
File.read()

Eventually we realized, however, that File.read() can be 100 times faster

grit
system()

Than shelling out

One of the ﬁrst things Scott worked on was rewriting the core parts of Grit
to be pure Ruby

Basically a Ruby implementation of Git

mojombo / grit

And that’s what we run now

smoke

Kinda.

Eventually we needed to move of our git repositories off of our web servers

Today our HTTP servers are distinct from our git servers. The two communicate using smoke

smoke

“Grit in the cloud”

Instead of reading and writing from the disk, Grit makes Smoke calls

The reading and writing then happens on our ﬁle servers

bert-rpc

Rather than use Protocol Buffers or Thrift or JSON-RPC, Smoke uses BERT-RPC

bert-rpc
bert : erlang ::
json : javascript
BERT is an erlang-based protocol

BERT-RPC is really great at dealing with large binaries
Which is a lot of what we do

bert-rpc

we have four ﬁle servers, each running bert-rpc servers

our front ends and job queue make RPC calls to the backend servers

mojombo / bertrpc

You can grab bert-rpc on GitHub

mojombo / bert

Or if you just want to play with BERT

chimney

We have a proprietary library called chimney

It routes the smoke. I know, don’t blame me.

chimney

All user routes are kept in Redis

Chimney is how our BERT-RPC clients know which server to hit

It falls back to a local cache and auto-detection if Redis is down

chimney

It can also be told a backend is down.

Optimized for connection refused but in reality that wasn’t the real problem - timeouts were

proxymachine

All anonymous git clones hit the front end machines

the git-daemon connects to proxymachine, which uses chimney to proxy your
connection between the front end machine and the back end machine (which holds
the actual git repository)

very fast, transparent to you

mojombo / proxymachine

proxymachine can be used to proxy any kind of tcp connection

open source

ssh

Sometimes you need to access a repository over ssh

In those instances, you ssh to an fe and we tunnel your connection to
the appropriate backend

To ﬁgure that out we use chimney

node.js
downloads
http => https <img>

node.js
downloads
http => https <img>
event streams

jobs

We do a lot of work in the background at GitHub

resque

Currently we use a system called Resque.

defunkt / resque

You can grab it on GitHub

resque

- dealing with pushes
- web hooks
- creating events in the database
- generating GitHub Pages
- clearing & warmingcaches
- search indexing

queues

In Resque, a queue is used as both a priority and a localization technique

By localization I mean, “where your workers live”

queues
critical,high,low

these three run on our front end servers

Resque processes them in this order

queues
page

GitHub Pages are generated on their own machine using the `page` queue

queues
archive

And tarball and zip downloads are created on the ﬂy using the `archive` queue
on our archiving machines

search

On GitHub, you can search code, repositories, and people

solr

Solr is basically an HTTP interface on top of Lucene. This makes it pretty simple
to use in your code.

We use solr because of its ability to incrementally add documents to
an index.

Here I am searching for my name in source code

solr

We’ve had some problems making it stable but luckily the guys at Pivotal
have given us some tips

Like bumping the Java heap size.

Whatever that means

database

Our database story is pretty uninteresting

master / slave

All reads and writes go to the master

We use the slave for backups and failover

caching

On the site we do a ton of caching
using memcached

fragments

We cache chunks of HTML all over

Usually they are invalidated by some action

fragments

Formerly we invalidated most of our fragments using a generation scheme,
where you put a number into a bunch of related keys and increment it
when you want all those caches to be missed (thus creating new cache
entries with fresh data)

fragments

But we had high cache eviction due to low ram and hardware constraints, and found
that scheme did more harm than good.

We also noticed some cached data we wanted to remain forever was being evicted due
to the slabs with generational keys ﬁlling up fast

page

We cache entire pages using nginx’s memcached module

Lots of HTML, but also other data which gets hit a lot and changes rarely:

page

- network graph json
- participation graph data

Always looking to stick more into page caches

object

We do basic object caching of ActiveRecord objects such as
repositories and users all over the place

Caches are invalidated whenever the objects are saved

associations

We also cache associations as arrays of IDs

Grab the array, then do a get_multi on its contents to get a list of objects

That way we don’t have to worry about caching stale objects

walker

We also have a proprietary caching library called Walker

walker

It originally walked trees and cached them when someone pushed

But now it caches everything related to git:

walker

- commits
- diffs
- commit listing
- branches
- tags
- everything

Every git-related page load hits Walker a lot

walker

For most big apps, you need to write a caching layer
that knows your business domain

Generic, catch-all caching libraries probably won’t do

events

An example of this is our events system

Inside GitHub with Chris Wanstrath

They’re also cached as objects

And that’s just for the dashboard...

optimizations

So what other optimizations have we done

asset servers

Well we do the common trick of serving assets from multiple subdomains

asset servers
assets0.github.com
assets1.github.com

and so forth

sha asset id

Instead of using timestamps for asset ids, which may end up hitting the disk
multiple times on each request, we set the asset id to be the sha of the last commit
which modiﬁed a javascript or css ﬁle

sha asset id
/css/bundle.css?197d742e9fdec3f7

/js/bundle.js?197d742e9fdec3f7

Now simple code changes won’t force everyone to re-download the css or js bundles

bundling

For bundling itself, we use

bundling

yui’s compressor for css and

bundling

google’s closure compiler for javascript

we don’t use the most aggressive setting because it means changing
your javascript to appease the compression gods,
which we haven’t committed to yet

scripty 301

Again, for most of these tricks you need to really pay
attention to your app.

One example is scriptaculous’ wiki

scripty 301

When we changed our wiki URL structure, we setup dynamic 301 redirects
for the old urls.

Scriptaculous’ old wiki was getting hit so much we put the redirect into nginx itself -
this took strain off our web app and made the redirects happen almost instantly

ajax loading

We also load data in via ajax in many places.

Sometimes a piece of information will just take too long to retrieve

In those instances, we usually load it in with ajax

If Walker sees that it doesn’t have all the information it needs, it kicks off a job
to stick that information in memcached.

We then periodically hit a URL which checks if the information is in memcached or not.
If it is, we get it and rewrite the page with the new information.

We use this same trick on the Network Graph

ajax loading

and anywhere else it makes sense.

comet loading

very soon this will all be comet, though

monitoring

what do we use for monitoring?

nagios

Our support team monitors the health of our machines and core
services using nagios.

I don’t really touch the thing.

Here’s a screenshot from my IE browser, complete with the ICQ plugin

resque web

We monitor our queue using Resque’s included Sinatra app

haystack

We use an in-house app called Haystack to monitor arbitrary information,
tracked as JSON.

Here’s an example of Haystack’s “exceptions” view

collectd

We also use collectd to monitor load, RAM usage, CPU usage, and other
app-related metrics

pingdom

pingdom sends us SMSes when the site is down

it’s nice

tender

tender is what we use for customer support

it works incredibly well, and they’re constantly improving it

testing

Our testing setup is pretty standard

test unit

We mostly use Ruby’s test/unit.

We’ve experimented with other libraries including test/spec, shoulda, and RSpec, but in the end
we keep coming back to test/unit

git fixtures

As many of our fixtures are git repositories, we specify in the test what sha
we expect to be the HEAD of that fixture.

This means we can completely delete a git repository in one test, then have it back in
pristine state in another. We plan to move all our fixtures to a similar git-system in the future.

machinist

We use machinist for our ﬁxtures

running_man

Gives us setup_once

Use it to cache machinist ﬁxtures on a per-test-class basis

ci joe

We use ci joe, a continuous integration server, to run on tests after each push.

He then notiﬁes us if the tests fail.

defunkt / cijoe

You can grab him at github

staging

We also always deploy the current branch to staging

This means you can be working on your branch, someone else can be working on theirs,
and you don’t need to worry about reconciling the two to test out a feature

One of the best parts of Git

github.com/
security

having a security page really helps

security@
github.com

we get weekly emails to our security email (that people ﬁnd on the security page)

and people are always grateful when we can reassure them or a answer their question

regular audits

if you can, ﬁnd a security consultant to poke your site for XSS vulnerabilities

having your target audience be developers helps, too

24/7 monitoring

24/7 monitoring is cool too

backups

backups are incredibly important

don’t just make backups: ensure you can restore them, as well

sql

we keep nightly, off-site backups of our sql databases

git

and the same for all our git repositories

Inside GitHub with Chris Wanstrath

More Related Content

Inside GitHub with Chris Wanstrath