Jcvi Cloud Computing Talk
Jcvi Cloud Computing Talk
Jcvi Cloud Computing Talk
Konstantinos (ntino) Krampis
Bioinformatics Engineer BCIS
A little background
• PhD in Bioinformatics & Computational Biology, Virginia Tech
• Software developer for 5 years at the Virginia Bioinformatics Institute :
Perl, Java, SQL, from genomic data analysis to web services
(Ruby is my mistress)
• Have done research, published papers, presented at ISMB Europe / Brazil
• This is my second week at JCVI, member of the BCIS group
• Not a cloud computing evangelist, or infrastructure expert
• Researcher coming from a smaller institute, in need of compute cycles
@agbiotec on Twitter
What is cloud computing ?
“a style of computing in which dynamically scalable resources are provided as a service
over the Internet”
“a network of data centers providing users access to powerful applications, platform, and
services delivered over the internet.”
“Virtualized servers running Windows or Linux operating systems that are instantiated
via a web interface or API“
“...just another buzzword that is used to describe too many technologies...” ?!?
plot via
Google Trends
The cloud casts a wide shade...
...and these formal terms will help us define it :
Infrastructure as a Service (IaaS)
outsourcing servers, data center, network
same as lserver14 here, but lserver somewhere far from JCVI
Platform as a Service (PaaS)
builds on the above, not clear distinction with the one below
say you need a portal on your sequenced and annotated genomes
Linux, MySQL, Apache, content management, Drupal, wikis, forums
outsource installation and integration setup all these
Software as a Service (SaaS)
same as above but you also outsource patches, upgrades, security maintenance
The “cloud” casts an even wider shade...
before we get all formal, let's see what else is cloud (SaaS for free) :
Back to formality: IaaS providers on the cloud
(need $$ and contracts?)
Ecosystem around IaaS, that leads to PaaS and SaaS:
most offerings built on Amazon EC2 / S3
each specializes on single or few platforms/software
( Google is
somewhere in
the middle )
My experience with the cloud
Elastic Compute Cloud – EC2, rent Linux/Win servers by the hour, full admin
●
Virtualized servers (Xen), Amazon Machine Instances – AMIs (VM snapshots)
●
6 weeks before PhD defense need to run algorithm on 20 datasets
●
each: 300 Affy chips, takes about 24 hrs in 8core / 16GB (best case, server is multiuser)
●
got manager's Amex, signed up and had 20 of those servers
●
80cents / hour for each, could run 20 datasets in parallel
●
~400$ and a day later computation was done
●
I could hope that I graduate to move on to the next job !
●
Virtual machine instance types at Amazon Web Services EC2
One EC2 Compute Unit (ECU) provides
the equivalent CPU capacity of a 1.01.2
GHz 2007 Opteron or 2007 Xeon
processor.
If up to 1,6TB is not enough...storage costs on Amazon Web Services S3
AMI
Management
Console
Also API, for
automatic
start / stop /
monitoring.
Especially
useful for those
building scaling
ondemand
infrastructures
Choose your
quantity and
trim and...
That's it !
Who makes these ?
Starting AMIs (Fedora, Windows) by
Amazon.
Create your own, from scratch or Vmware
/ Vbox, free or sell.
Using free, Amazon's or other user's, can
customize, save and share alike.
User forums, AMI ratings
Tools & APIs for instance control:
Firefox plugins, desktop, opensource API
libs in Java / Ruby /Perl/ shell / Erlang /
PHP / C# / Python / VB
some PaaS / SaaS here
my favorite
Other Amazon offerings Elastic MapReduce
● MapReduce is Google's algorithm
● distribute queries over distributed file system
● not MPI, but think a large distributed data index
● Apache Hadoop open source implementation
● Map: send query over FS nodes, Reduce: collect result
● you can make your own cluster at home
● Elastic MapReduce only requires you to upload data
... ok, now onto the risky part of the talk...
REST under the shade of the cloud (Amazon's S3)
many interesting things to read and talk on: autoscalability, hadoop on the cloud etc. etc.
●
but at the core is the data, and we access data via web service technologies
●
my first web services experience was with SOAP – Simple (hmm...) Object Access Protocol
●
why would you need additional data (XML) and object methods to control your data ?
●
back to where we started, simple hypertext (HTTP) as the web was meant to be
●
REST : Representational State Transfer (implemented in Amazon S3)
●
control your cloud data via HTTP request: PUT / GET / DELETE / POST
●
no need calling methods on the web service end, to manipulate your data
●
REST under the shade of the cloud
most of the time in Bioinformatics, we read/write data
●
http://www.ncbi.nlm.nih.gov/protein/homosapiens/MAPK14
why efetch, then esearch, when the above URI/URL would do ?
●
your HTTP PUT / GET / DELETE / POST request changes the state of your data
●
think of it as manipulating web resources akin to files on your hard disk
●
no need for complex business processes that SOAP methods were designed for
●
when I pipelined NCBIVBIDDJP web services, had to study 3 interfaces...
●
...they were SOAPbased! But the web/HTTP is the standard, stick with it.
●
Amazon S3 Architecture
very simple data model: objects and buckets
objects store data (FILES), each bucket stores objects (DIRECTORIES)
http://s3.amazonaws.com/bucket_name/object_name
1
1 http://s3.amazonaws.com/
2 http://s3.amazonaws.com/bucket_name
3 http://s3.amazonaws.com/bucket_name/object_name
we need some REST...
while I have scratched only the surface of this
●
as bioinformatics dev. / data consumer I can see the simplicity of the URL/URI
●
complex web service interfaces add barriers for data consumers
●
more resources:
●
“Programming Amazon Web Services” by James Murty (publisher: O'Reilly)
“RESTful Web Services” by (publisher: O'Reilly)
http://delicious.com/agbiotec/cloud
http://delicious.com/agbiotec/aws
http://delicious.com/agbiotec/REST
http://delicious.com/agbiotec/rails
(see something RESTful in the URLs above? delicious was one of the first)
Conclusion and Summary
● if the sky was blue (no cloud) I would still be doing PhD
● cloud provides that extra bit of infrastructure, no longterm obligations
● democratize bioinformatics research: resources for smaller labs and institutions
● give back to the community: AMIs with preconfigured assembly tools
● lab with a genome and no grid, can buy 12 days compute on the cloud
● traditional infrastructures will stay, but cloud changes the game opportunities
Thank you !