SlideShare a Scribd company logo
Python, MongoDB, and
asynchronous web frameworks
        A. Jesse Jiryu Davis
        jesse@10gen.com
         emptysquare.net
Agenda
• Talk about web services in a really dumb
  (“abstract”?) way
• Explain when we need async web servers
• Why is async hard?
• What is Tornado and how does it work?
• Why am I writing a new PyMongo wrapper to
  work with Tornado?
• How does my wrapper work?
CPU-bound web service


   Client                 Server
               socket




• No need for async
• Just spawn one process per core
Normal web service


                                             Backend
   Client                Server            (DB, web service,
              socket              socket        SAN, …)




• Assume backend is unbounded
• Service is bound by:
  • Context-switching overhead
  • Memory!
What’s async for?
• Minimize resources per connection
• I.e., wait for backend as cheaply as possible
CPU- vs. Memory-bound



Crypto         Most web services?           Chat
                      •
CPU-bound                           Memory-bound
HTTP long-polling (“COMET”)
• E.g., chat server
• Async’s killer app
• Short-polling is CPU-bound: tradeoff between
  latency and load
• Long-polling is memory bound
• “C10K problem”: kegel.com/c10k.html
• Tornado was invented for this
Why is async hard to code?
Client                   Server                 Backend
           request

                                      request
time




                       store state

                                     response



            response
Ways to store state
                                    this slide is in beta



                        Multithreading
Memory per connection




                        Greenlets / Gevent
                                                       Tornado, Node.js



                                  Coding difficulty
What’s a greenlet?
• A.K.A. “green threads”
• A feature of Stackless Python, packaged as a
  module for standard Python
• Greenlet stacks are stored on heap, copied to
  / from OS stack on resume / pause
• Cooperative
• Memory-efficient
Threads:
       State stored on OS stacks
# pseudo-Python

sock = listen()

request = parse_http(sock.recv())

mongo_data = db.collection.find()

response = format_response(mongo_data)

sock.sendall(response)
Gevent:
      State stored on greenlet stacks
# pseudo-Python
import gevent.monkey; monkey.patch_all()

sock = listen()

request = parse_http(sock.recv())

mongo_data = db.collection.find()

response = format_response(mongo_data)

sock.sendall(response)
Tornado:
 State stored in RequestHandler
class MainHandler(tornado.web.RequestHandler):
  @tornado.web.asynchronous
  def get(self):
    AsyncHTTPClient().fetch(
            "http://example.com",
       callback=self.on_response)

  def on_response(self, response):
    formatted = format_response(response)
    self.write(formatted)
    self.finish()
Tornado IOStream
class IOStream(object):
  def read_bytes(self, num_bytes, callback):
    self.read_bytes = num_bytes
    self.read_callback = callback

    io_loop.add_handler(
      self.socket.fileno(),
              self.handle_events,
              events=READ)

  def handle_events(self, fd, events):
    data = self.socket.recv(self.read_bytes)
    self.read_callback(data)
Tornado IOLoop
class IOLoop(object):
  def add_handler(self, fd, handler, events):
    self._handlers[fd] = handler
    # _impl is epoll or kqueue or ...
    self._impl.register(fd, events)

  def start(self):
    while True:
       event_pairs = self._impl.poll()
       for fd, events in event_pairs:
          self._handlers[fd](fd, events)
Python, MongoDB, & concurrency
• Threads work great with pymongo
• Gevent works great with pymongo
  – monkey.patch_socket(); monkey.patch_thread()
• Tornado works so-so
  – asyncmongo
     • No replica sets, only first batch, no SON manipulators, no
       document classes, …
  – pymongo
     • OK if all your queries are fast
     • Use extra Tornado processes
Introducing: “Motor”
•   Mongo + Tornado
•   Experimental
•   Might be official in a few months
•   Uses Tornado IOLoop and IOStream
•   Presents standard Tornado callback API
•   Stores state internally with greenlets
•   github.com/ajdavis/mongo-python-driver/tree/tornado_async
Motor
class MainHandler(tornado.web.RequestHandler):
  def __init__(self):
    self.c = MotorConnection()

  @tornado.web.asynchronous
  def post(self):
    # No-op if already open
    self.c.open(callback=self.connected)

  def connected(self, c, error):
    self.c.collection.insert(
       {‘x’:1},
       callback=self.inserted)

  def inserted(self, result, error):
    self.write(’OK’)
    self.finish()
Motor internals
                   stack depth
   Client       IOLoop              RequestHandler        greenlet        pymongo
         request
                                             start


                                                      switch()   IOStream.sendall(callback)

                             return
time




                       callback()          switch()


                                                      parse Mongo response
                             schedule
                             callback


                            callback()

       HTTP response
Motor internals: wrapper
class MotorCollection(object):
  def insert(self, *args, **kwargs):
    callback = kwargs['callback']
     1
    del kwargs['callback']
    kwargs['safe'] = True

     def call_insert():
       # Runs on child greenlet
       result, error = None, None
       try:
          sync_insert = self.sync_collection.insert
            3
          result = sync_insert(*args, **kwargs)
       except Exception, e:
          error = e

       # Schedule the callback to be run on the main greenlet
       tornado.ioloop.IOLoop.instance().add_callback(
          lambda: callback(result, error)
                                                                8
       )

     # Start child greenlet
       2
     greenlet.greenlet(call_insert).switch()


       6
     return
Motor internals: fake socket
class MotorSocket(object):
  def __init__(self, socket):
    # Makes socket non-blocking
    self.stream = tornado.iostream.IOStream(socket)

  def sendall(self, data):
    child_gr = greenlet.getcurrent()

    # This is run by IOLoop on the main greenlet
    # when data has been sent;
    # switch back to child to continue processing
    def sendall_callback():
       child_gr.switch()                   7

    self.stream.write(data, callback=sendall_callback)
     4
    # Resume main greenlet
    child_gr.parent.switch()
     5
Motor
• Shows a general method for asynchronizing
  synchronous network APIs in Python
• Who wants to try it with MySQL? Thrift?
• (Bonus round: resynchronizing Motor for
  testing)
Questions?
   A. Jesse Jiryu Davis
   jesse@10gen.com
    emptysquare.net

(10gen is hiring, of course:
   10gen.com/careers)

More Related Content

Python, async web frameworks, and MongoDB

  • 1. Python, MongoDB, and asynchronous web frameworks A. Jesse Jiryu Davis jesse@10gen.com emptysquare.net
  • 2. Agenda • Talk about web services in a really dumb (“abstract”?) way • Explain when we need async web servers • Why is async hard? • What is Tornado and how does it work? • Why am I writing a new PyMongo wrapper to work with Tornado? • How does my wrapper work?
  • 3. CPU-bound web service Client Server socket • No need for async • Just spawn one process per core
  • 4. Normal web service Backend Client Server (DB, web service, socket socket SAN, …) • Assume backend is unbounded • Service is bound by: • Context-switching overhead • Memory!
  • 5. What’s async for? • Minimize resources per connection • I.e., wait for backend as cheaply as possible
  • 6. CPU- vs. Memory-bound Crypto Most web services? Chat • CPU-bound Memory-bound
  • 7. HTTP long-polling (“COMET”) • E.g., chat server • Async’s killer app • Short-polling is CPU-bound: tradeoff between latency and load • Long-polling is memory bound • “C10K problem”: kegel.com/c10k.html • Tornado was invented for this
  • 8. Why is async hard to code? Client Server Backend request request time store state response response
  • 9. Ways to store state this slide is in beta Multithreading Memory per connection Greenlets / Gevent Tornado, Node.js Coding difficulty
  • 10. What’s a greenlet? • A.K.A. “green threads” • A feature of Stackless Python, packaged as a module for standard Python • Greenlet stacks are stored on heap, copied to / from OS stack on resume / pause • Cooperative • Memory-efficient
  • 11. Threads: State stored on OS stacks # pseudo-Python sock = listen() request = parse_http(sock.recv()) mongo_data = db.collection.find() response = format_response(mongo_data) sock.sendall(response)
  • 12. Gevent: State stored on greenlet stacks # pseudo-Python import gevent.monkey; monkey.patch_all() sock = listen() request = parse_http(sock.recv()) mongo_data = db.collection.find() response = format_response(mongo_data) sock.sendall(response)
  • 13. Tornado: State stored in RequestHandler class MainHandler(tornado.web.RequestHandler): @tornado.web.asynchronous def get(self): AsyncHTTPClient().fetch( "http://example.com", callback=self.on_response) def on_response(self, response): formatted = format_response(response) self.write(formatted) self.finish()
  • 14. Tornado IOStream class IOStream(object): def read_bytes(self, num_bytes, callback): self.read_bytes = num_bytes self.read_callback = callback io_loop.add_handler( self.socket.fileno(), self.handle_events, events=READ) def handle_events(self, fd, events): data = self.socket.recv(self.read_bytes) self.read_callback(data)
  • 15. Tornado IOLoop class IOLoop(object): def add_handler(self, fd, handler, events): self._handlers[fd] = handler # _impl is epoll or kqueue or ... self._impl.register(fd, events) def start(self): while True: event_pairs = self._impl.poll() for fd, events in event_pairs: self._handlers[fd](fd, events)
  • 16. Python, MongoDB, & concurrency • Threads work great with pymongo • Gevent works great with pymongo – monkey.patch_socket(); monkey.patch_thread() • Tornado works so-so – asyncmongo • No replica sets, only first batch, no SON manipulators, no document classes, … – pymongo • OK if all your queries are fast • Use extra Tornado processes
  • 17. Introducing: “Motor” • Mongo + Tornado • Experimental • Might be official in a few months • Uses Tornado IOLoop and IOStream • Presents standard Tornado callback API • Stores state internally with greenlets • github.com/ajdavis/mongo-python-driver/tree/tornado_async
  • 18. Motor class MainHandler(tornado.web.RequestHandler): def __init__(self): self.c = MotorConnection() @tornado.web.asynchronous def post(self): # No-op if already open self.c.open(callback=self.connected) def connected(self, c, error): self.c.collection.insert( {‘x’:1}, callback=self.inserted) def inserted(self, result, error): self.write(’OK’) self.finish()
  • 19. Motor internals stack depth Client IOLoop RequestHandler greenlet pymongo request start switch() IOStream.sendall(callback) return time callback() switch() parse Mongo response schedule callback callback() HTTP response
  • 20. Motor internals: wrapper class MotorCollection(object): def insert(self, *args, **kwargs): callback = kwargs['callback'] 1 del kwargs['callback'] kwargs['safe'] = True def call_insert(): # Runs on child greenlet result, error = None, None try: sync_insert = self.sync_collection.insert 3 result = sync_insert(*args, **kwargs) except Exception, e: error = e # Schedule the callback to be run on the main greenlet tornado.ioloop.IOLoop.instance().add_callback( lambda: callback(result, error) 8 ) # Start child greenlet 2 greenlet.greenlet(call_insert).switch() 6 return
  • 21. Motor internals: fake socket class MotorSocket(object): def __init__(self, socket): # Makes socket non-blocking self.stream = tornado.iostream.IOStream(socket) def sendall(self, data): child_gr = greenlet.getcurrent() # This is run by IOLoop on the main greenlet # when data has been sent; # switch back to child to continue processing def sendall_callback(): child_gr.switch() 7 self.stream.write(data, callback=sendall_callback) 4 # Resume main greenlet child_gr.parent.switch() 5
  • 22. Motor • Shows a general method for asynchronizing synchronous network APIs in Python • Who wants to try it with MySQL? Thrift? • (Bonus round: resynchronizing Motor for testing)
  • 23. Questions? A. Jesse Jiryu Davis jesse@10gen.com emptysquare.net (10gen is hiring, of course: 10gen.com/careers)