diff --git a/content/post/build_ml_powered_game_ai_tensorflow.md b/content/post/build_ml_powered_game_ai_tensorflow.md new file mode 100644 index 0000000..f0fa692 --- /dev/null +++ b/content/post/build_ml_powered_game_ai_tensorflow.md @@ -0,0 +1,109 @@ ++++ +title = "Building an ML-Powered Game AI using TensorFlow in Go" +subtitle = "" +date = "2017-08-10T08:43:45+10:00" +math = false +draft = false + ++++ +*Based on a lightning talk given at GopherCon 2017 "Building an ML-Powered Game AI using TensorFlow in Go" [Video](https://www.youtube.com/watch?v=oiorteQg9n0) / [Slides](https://github.com/gophercon/2017-talks/tree/master/lightningtalks/PeteGarcin-BuildingMLPoweredGameAIwithTensorFlow)* + +(Author: Pete Garcin, Developer Advocate @ [ActiveState](https://activestate.com), @rawktron on [Twitter](https://twitter.com/rawktron) and @peteg on Gophers Slack) + +For GopherCon, we wanted to demonstrate some of the capabilities of the emerging machine learning and data science ecosystem in Go. Originally built as a demo for PyCon, I had put together a simple arcade space shooter game that features enemies powered by machine learning. It was a fun way to get folks engaged at conferences and to learn about the growing library of tools that are available. It also gave me an opportunity to build something non-trivial using machine learning techniques, and my background in games made this kind of interactive demo a good fit. + +NeuroBlast is a vertically scrolling space shooter where you control a ship that tries to defeat increasing waves of enemies. Normally, these enemies fly in predefined formations, with predefined firing patterns, and come in waves. The big difference in NeuroBlast is that the enemies use machine learning to determine what their firing pattern should be. + +## Under the Hood: TensorFlow +Go is one of the languages that has a TensorFlow client available, and so it was a great opportunity to port the original Python game to Go and demonstrate that it’s also possible to deploy trained models in Go applications. + +It is worth noting that the Go TensorFlow client currently does not support training models, and so we relied on a model that was previously trained using the Python version of the game and then exported using the `SavedModelBuilder` functionality in order to load it in Go. This will export a TensorFlow graph as a protocol buffer and allow it to be loaded in Go using the `LoadSavedModel` function. + +For the game portion, I used a library called [Pixel](https://github.com/faiface/pixel) which is still early in development but has a really active community, and offered excellent stability and performance. I was pretty performance conscious when building and porting the game, so there are certain limitations such as non-pixel-perfect collisions, in order to ensure that the game could run acceptably under all conditions. + +### Training the Neural Net +Our Neural Net is ultimately a very simple one -- four inputs and a single output neuron. It will use supervised learning to do binary classification on a simple problem: was each shot a hit or a miss? It utilizes the delta between player and enemy position, and player and enemy velocity as the inputs. The single output neuron will fire if its activation value is >= 0.5 and will not fire if it is < 0.5. + +When building the network, I initially had only a single hidden layer with 4 nodes but found that after training it, it was somewhat erratic. It seemed like it was very sensitive to the training data and would not ‘settle’ on a particular strategy in any consistent way. I experimented with a few different configurations, and ultimately settled on the one we used for the demo. It’s quite likely not the optimal setup, and may have more layers than is necessary. What appealed to me though was that even with a very small amount of training data, and regardless of how you trained it, it would consistently settle into a similar behaviour pattern which made it great for a floor demo where anyone could play or train the game. + +![Neural Network Visualization](/img/gameai/NNViz.png "Neural Net Visualization") +*The inputs are the four nodes at the top, with the output node (primed to fire here) at the bottom. Thicker lines represent higher weights. Activation values appear in the node centers.* + +The visualization was cobbled together by myself to run using the Pixel immediate mode drawing functions. Inspired by [this blog post](https://medium.com/deep-learning-101/how-to-generate-a-video-of-a-neural-network-learning-in-python-62f5c520e85c), the visualization here shows connections between nodes as either red or green lines. Green lines indicate positive weights that will bias the network towards “shooting”, and red values inhibit shooting. Thicker lines indicate higher weight values and thus “stronger” connections. + +After training, the network consistently seems to converge on the following strategy: + +If the player is within a reasonable cone of forward “vision” then fire indiscriminately. +If the player is not within that reasonable forward cone, then do not fire at all. + +At first, it was very interesting to me that the network did not settle on the more obvious “just fire constantly” strategy, but given that it does receive training data that indicates that “misses” are undesirable, it makes sense that it would avoid firing shots with a low probability of hitting. + +![Positive Network Output](/img/gameai/GopherPos.png "Positive Network Output") +*In this image, notice that because I’m in a forward cone, most of the enemies are firing at my indiscriminately.* + +![Negative Network Output](/img/gameai/GopherNeg.png "Negative Network Output") +*In this image, you’ll notice none of the enemies are firing because I am clearly outside their range of possible success. However, if you look at the activation values, you’ll see that the enemy who has just come onto the screen is about to blast me because his output is 0.83. Yikes!* + +In the training mode, the enemies fire randomly, and then each shot taken by the enemy is recorded as a hit or a miss along with its initial relative position/velocity values. + +It’s worth noting that in early iterations of the game, I was passing in raw pixel values for positions and velocities. This meant that there was a really wide variation between the values in the input and I found that the network would just not really converge to a consistent behaviour. So, I normalized the input data to be roughly between 0.0-1.0 and found that it basically instantly converged to a usable behaviour. So, lesson for you kids: normalize your input data! + +It's also important that you make all your input values framerate independent since the model is being trained in Python, any discrepancies between either the co-ordinate space or velocity values when running in Go will result in getting incorrect results back from the network. + +Once the network is trained, when you play the game, every instance of an enemy spaceship uses its own instance of the neural network to make decisions about when it should fire. + +### Using the model in Go + +As mentioned above, we exported our model from Python using the SavedModelBuilder and then need to load this in Go. The `SavedModelBuilder` will export the model into a folder, and you only need to specify that folder when loading in Go, along with the tag for your model. There are two available tags - TRAIN and SERVING. In our case, I used TRAIN but for deployment, it is suggested that you use the SERVING tag. + +```Go + // bundle contains Session + Graph + bundle, err := tf.LoadSavedModel("exported_brain", []string{"train"}, nil) +``` + +This code will load your saved model and return a struct that contains pointers to the TensorFlow graph, and a new TensorFlow session that you can use for evaluation. However, at this stage many are left wondering - so, now what? + +This is where current limitations in the Go binding and documentation show themselves. The key next step is to access the nodes in the TensorFlow graph for the input and output nodes. Right now, the only way to do this is to print the graph node names from Python when you are exporting. You do have the option of labelling nodes as you output them, but don't have the ability to access those nodes by their labels from Go. + +In this case, once we have the names of the nodes, we need to access them in Go via the `Operation` method on the TensorFlow graph. Remember that because TensorFlow is storing our network as a series of computation operations, that's why it is called `Operation`. + +```Go + inputop := bundle.Graph.Operation("dense_1_input") + outputop := bundle.Graph.Operation("dense_5/Sigmoid") +``` + +So now, we have both a Session with our Graph, and we have also found our input and output Operations, so now we're ready to send data to the graph and use it to evaluate that data. In our case, since we've trained the graph using the relative position and velocity in the game, each frame, for each enemy we will take their relative position and velocity to the player and feed it into the network to get a decision back on whether or not we should fire at the player. + +Our first step in having our enemies "think" is to build a new 'Tensor' (which is like a vector) to feed into the network: + +```Go + var column *tf.Tensor + column, err = tf.NewTensor([1][4]float32{{dx, dy, du, dv}}) + results, err := bundle.Session.Run(map[tf.Output]*tf.Tensor{inputop.Output(0): column}, []tf.Output{outputop.Output(0)}, nil) +``` + +What you see above is that we're building a tensor from the relative position (dx, dy) and the relative velocity (du, dv) and then running our TensorFlow session, specifying which nodes are the input and output nodes. The session will then return the results in the form of an array. In our case, since we only have one output node, the array only has one entry. + +Our enemies have almost made a decision - all we need to do is read the output value from their neuron. If value at the output node is greater than or equal to 0.5 we use that threshold to determine that we should fire at the player: + +```Go + for _, result := range results { + if result.Value().([][]float32)[0][0] >= 0.5 && enemy.canfire { + // FIRE!! + } + } +``` + +And that's it! This is literally the only logic governing the attack behaviour of the enemies, and despite its simplicity, this network generates a compelling and interesting set of enemy behaviour. By applying this repeatedly to all enemies, we've been able to create an enemy AI for our game. + +There are obviously lots of easy and obvious ways to improve this, and to make it more sophisticated - but as a demonstration of utilizing a model loaded at runtime in Go and 'deployed' into our game - it demonstrates the ease and power of these techniques and how accessible the tools are making them. + +## Next Steps + +The Machine Learning and Data Science ecosystem in Go is growing, and there are lots of exciting opportunties to contribute to a wide variety of projects - including TensorFlow. + +In the near future, I plan to push this to GitHub as I had a number of requests at both GopherCon and PyCon to do so and would love to see others learn from this project as well as help to develop and expand its capabilities. + +*Update 08/29/17 - Full Source Code is now live on [GitHub](https://github.com/ActiveState/neuroblast)!* + +If you want to ask questions about this project, feel free to hit me up on Twitter [@rawktron](https://twitter.com/rawktron). diff --git a/content/post/distributed_trump_finder.md b/content/post/distributed_trump_finder.md index dfe20a1..88be09d 100644 --- a/content/post/distributed_trump_finder.md +++ b/content/post/distributed_trump_finder.md @@ -4,7 +4,7 @@ draft = false tags = ["academic", "hugo"] title = "Building a distributed Trump finder" math = false -summary = """A step-by-step guide to building a distributed facial recognition system with Pachyderm and Machine Box. +summary = """A step-by-step guide to building a distributed facial recognition system with Pachyderm and Machine Box. """ [header] @@ -57,7 +57,7 @@ trump1.jpg file 78.98 KiB trump2.jpg file 334.5 KiB trump3.jpg file 11.63 KiB trump4.jpg file 27.45 KiB -trump5.jpg file 33.6 KiB +trump5.jpg file 33.6 KiB ➔ cd ../../labels/ ➔ ls clinton.jpg trump.jpg @@ -83,16 +83,16 @@ We are using a little bash magic here to perform these operations. However, it ```sh create-MB-pipeline.sh identify.json tag.json train.json -➔ ./create-MB-pipeline.sh train.json +➔ ./create-MB-pipeline.sh train.json ➔ pachctl list-pipeline NAME INPUT OUTPUT STATE model training model/master running ➔ pachctl list-job ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE -3425a7a0-543e-4e2a-a244-a3982c527248 model/- 9 seconds ago - 1 0 / 1 running +3425a7a0-543e-4e2a-a244-a3982c527248 model/- 9 seconds ago - 1 0 / 1 running ➔ pachctl list-job ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE -3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 5 minutes ago 5 minutes 1 1 / 1 success +3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 5 minutes ago 5 minutes 1 1 / 1 success ➔ pachctl list-repo NAME CREATED SIZE model 5 minutes ago 4.118 KiB @@ -111,15 +111,15 @@ As you can see the output of this pipeline is a `.facebox` file that contained t We then launch another Pachyderm pipeline, based on an [identify.json](https://github.com/dwhitena/pach-machine-box/blob/master/pipelines/identify.json) pipeline specification, to identify faces within the `unidentified` images. This pipeline will take the persisted state of our model in `model` along with the `unidentified` images as input. It will also execute cURL commands to interact with facebox, and it will output indications of identified faces to JSON files, one per `unidentified` image. ```sh -➔ ./create-MB-pipeline.sh identify.json +➔ ./create-MB-pipeline.sh identify.json ➔ pachctl list-job ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE -281d4393-05c8-44bf-b5de-231cea0fc022 identify/- 6 seconds ago - 0 0 / 2 running -3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 8 minutes ago 5 minutes 1 1 / 1 success +281d4393-05c8-44bf-b5de-231cea0fc022 identify/- 6 seconds ago - 0 0 / 2 running +3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 8 minutes ago 5 minutes 1 1 / 1 success ➔ pachctl list-job ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE -281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 About a minute ago 53 seconds 0 2 / 2 success -3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 9 minutes ago 5 minutes 1 1 / 1 success +281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 About a minute ago 53 seconds 0 2 / 2 success +3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 9 minutes ago 5 minutes 1 1 / 1 success ➔ pachctl list-repo NAME CREATED SIZE identify About a minute ago 1.932 KiB @@ -166,17 +166,17 @@ We are most of the way there! We have identified Trump in the `unidentified` ima To do this, we can use a [simple Go program](https://github.com/dwhitena/pach-machine-box/blob/master/tagimage/main.go) to draw the label image on the `unidentified` image at the appropriate location. This part of the pipeline is specified by a [tag.json](https://github.com/dwhitena/pach-machine-box/blob/master/pipelines/tag.json) pipeline specification, and can be created as follows: ```sh -➔ pachctl create-pipeline -f tag.json +➔ pachctl create-pipeline -f tag.json ➔ pachctl list-job ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE -cd284a28-6c97-4236-9f6d-717346c60f24 tag/- 2 seconds ago - 0 0 / 2 running -281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 5 minutes ago 53 seconds 0 2 / 2 success -3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 13 minutes ago 5 minutes 1 1 / 1 success +cd284a28-6c97-4236-9f6d-717346c60f24 tag/- 2 seconds ago - 0 0 / 2 running +281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 5 minutes ago 53 seconds 0 2 / 2 success +3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 13 minutes ago 5 minutes 1 1 / 1 success ➔ pachctl list-job ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE -cd284a28-6c97-4236-9f6d-717346c60f24 tag/ae747e8032704b6cae6ae7bba064c3c3 25 seconds ago 11 seconds 0 2 / 2 success -281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 5 minutes ago 53 seconds 0 2 / 2 success -3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 14 minutes ago 5 minutes 1 1 / 1 success +cd284a28-6c97-4236-9f6d-717346c60f24 tag/ae747e8032704b6cae6ae7bba064c3c3 25 seconds ago 11 seconds 0 2 / 2 success +281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 5 minutes ago 53 seconds 0 2 / 2 success +3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 14 minutes ago 5 minutes 1 1 / 1 success ➔ pachctl list-repo NAME CREATED SIZE tag 30 seconds ago 591.3 KiB @@ -195,9 +195,9 @@ As you can see, we now have two "tagged" versions of the images in the output `t ![alt text](https://raw.githubusercontent.com/dwhitena/pach-machine-box/master/tagged_images1.jpg) -**Teaching a new faces, updating the output**: +**Teaching a new face, updating the output**: -Our pipeline isn't restricted to Trump or any one face. Actually, we can teach facebox another face by updating our `training`. Moreover, becauce Pachyderm verions your data and know what data is new, it can automatically update all our results once facebox learns the new face: +Our pipeline isn't restricted to Trump or any one face. Actually, we can teach facebox another face by updating our `training`. Moreover, because Pachyderm versions your data and knows what data is new, it can automatically update all our results once facebox learns the new face: ```sh ➔ cd ../data/train/faces2/ @@ -206,25 +206,25 @@ clinton1.jpg clinton2.jpg clinton3.jpg clinton4.jpg ➔ pachctl put-file training master -c -r -f . ➔ pachctl list-job ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE -56e24ac0-0430-4fa4-aa8b-08de5c1884db model/- 4 seconds ago - 0 0 / 1 running -cd284a28-6c97-4236-9f6d-717346c60f24 tag/ae747e8032704b6cae6ae7bba064c3c3 6 minutes ago 11 seconds 0 2 / 2 success -281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 11 minutes ago 53 seconds 0 2 / 2 success -3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 20 minutes ago 5 minutes 1 1 / 1 success +56e24ac0-0430-4fa4-aa8b-08de5c1884db model/- 4 seconds ago - 0 0 / 1 running +cd284a28-6c97-4236-9f6d-717346c60f24 tag/ae747e8032704b6cae6ae7bba064c3c3 6 minutes ago 11 seconds 0 2 / 2 success +281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 11 minutes ago 53 seconds 0 2 / 2 success +3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 20 minutes ago 5 minutes 1 1 / 1 success ➔ pachctl list-job ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE -6aa6c995-58ce-445d-999a-eb0e0690b041 tag/7cbd2584d4f0472abbca0d9e015b9829 5 seconds ago 1 seconds 0 2 / 2 success -8a7961b7-1085-404a-b0ee-66034fae7212 identify/1bc94ec558e44e0cb45ed5ab7d9f9674 59 seconds ago 54 seconds 0 2 / 2 success -56e24ac0-0430-4fa4-aa8b-08de5c1884db model/002f16b63a4345a4bc6bdf5510c9faac About a minute ago 19 seconds 0 1 / 1 success -cd284a28-6c97-4236-9f6d-717346c60f24 tag/ae747e8032704b6cae6ae7bba064c3c3 8 minutes ago 11 seconds 0 2 / 2 success -281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 13 minutes ago 53 seconds 0 2 / 2 success -3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 21 minutes ago 5 minutes 1 1 / 1 success +6aa6c995-58ce-445d-999a-eb0e0690b041 tag/7cbd2584d4f0472abbca0d9e015b9829 5 seconds ago 1 seconds 0 2 / 2 success +8a7961b7-1085-404a-b0ee-66034fae7212 identify/1bc94ec558e44e0cb45ed5ab7d9f9674 59 seconds ago 54 seconds 0 2 / 2 success +56e24ac0-0430-4fa4-aa8b-08de5c1884db model/002f16b63a4345a4bc6bdf5510c9faac About a minute ago 19 seconds 0 1 / 1 success +cd284a28-6c97-4236-9f6d-717346c60f24 tag/ae747e8032704b6cae6ae7bba064c3c3 8 minutes ago 11 seconds 0 2 / 2 success +281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 13 minutes ago 53 seconds 0 2 / 2 success +3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 21 minutes ago 5 minutes 1 1 / 1 success ➔ pachctl list-file tag master NAME TYPE SIZE tagged_image1.jpg file 557 KiB tagged_image2.jpg file 36.03 KiB ``` -Now if we look at our images, we find that everything has been updated without any annoying manual work on our end: +Now if we look at our images, we find that everything has been updated without any annoying manual work on our hands: ![alt text](https://raw.githubusercontent.com/dwhitena/pach-machine-box/master/tagged_images2.jpg) diff --git a/content/post/more_go_based_workflow_tools_in_bioinformatics.md b/content/post/more_go_based_workflow_tools_in_bioinformatics.md new file mode 100644 index 0000000..44f4101 --- /dev/null +++ b/content/post/more_go_based_workflow_tools_in_bioinformatics.md @@ -0,0 +1,266 @@ ++++ +title = "More Go-based Workflow Tools in Bioinformatics" +date = "2017-11-02T23:10:02+01:00" + +math = false +highlight = true +tags = ["workflows", "bioinformatics"] + +# Optional featured image (relative to `static/img/` folder). +[header] +image = "" +caption = "" ++++ + +It is an exciting time for Go as a data science language and for the +[\#gopherdata](http://gopherdata.io/) movement. The +[ecosystem of tools](https://github.com/gopherdata/resources/blob/master/tooling/README.md) +is constantly improving, with both general purpose tools (e.g., for data frames and +statistical analyses) and more specialized ones (e.g., for neural networks and +graph-based algorithms) popping up every day. + +Recent weeks have been particularly exciting for those involved in the +bioinformatics field. In addition to generic libraries for bioinformatics such +as [bíogo](https://github.com/biogo/biogo), which was recently reviewed in quite some detail in a two blog posts +([part I](https://medium.com/@boti_ka/a-gentle-introduction-to-b%C3%ADogo-part-i-65dbd40e31d4) +and [part II](https://medium.com/@boti_ka/a-gentle-introduction-to-bíogo-part-ii-1f0df1cf72f0)), +the ecosystem of scientific workflow tools focusing on or being used in +bioinformatics is also growing: Last week, another Go-based workflow +orchestration tool, [Reflow](https://github.com/grailbio/reflow), was released +as open source, by life science startup [Grail Inc](https://grail.com/). + +Reflow brings a number of interesting features to the +table including: (i) comprehensive and very easy-to-use Amazon Web +Services (AWS) integration, (ii) memoization features based on S3 +storage, and (iii) ability to run the same, docker-based workflow on your +local computer or on AWS. + +Because we are users of and contributors to two other existing Go-based +workflow projects ([Pachyderm](http://pachyderm.io/) and +[SciPipe](http://scipipe.org/)), we thought that a brief comparison of the +different approaches would be useful. In particular, we hope that this summary +helps highlight differences between the tools, guide users to appropriate +workflow and orchestration tools, and/or provide a jumping-off-point for +contributions and experimentation + +Workflow Tools in Bio- and Cheminformatics +---------------------------------------------------- + +Before we continue, lets step back and add a few words about what workflow +tools are and why they are relevant in Bioinformatics (or Science in general), +for anyone new to the term. + +[Scientific workflow tools](https://en.wikipedia.org/wiki/Scientific_workflow_system) are tools or +systems designed to help coordinate computations when the number of computation +steps is very large and/or the steps have complex dependencies between their +input and output data. This is common in many scientific fields, but it is +common in bioinformatics in particular, because of the vast plethora of tools +built to analyze the extremely heterogeneous data types describing biological +organisms, from molecules to cells to organs to full body physiology. + +Common tasks in bioinformatics necessitating the use of workflow tools include +(DNA) sequence alignment, (gene) variant calling, RNA quantification, and more. +In the related field of cheminformatics, workflow tools are often used to build +predictive models relating the chemical structure of drug compounds to some +measurable property of the compound, in what is commonly called Quantitative +Structure Activity Relationship (QSAR). Examples of such properties are the +level of binding to protein targets in the body, or various chemical properties +such as solubility, that might affect its potential for update in the body. + +The Go bioinformatics workflow tooling ecosystem +------------------------------------------------ + +![Gopher thinking about workflows, surrounded by workflow tool logos](/img/wftools/gopher_thinking_workflows.png) + +The main Go-based workflow tools that have targeted bioinformatics workflows +(or have been used to implement bioinformatics workflows) are: + +**[Pachyderm](http://pachyderm.io/)** - This platform allows users to build +scalable, containerized data pipelines using any language or framework, while +also getting the right data to the right code as both data and code change over +time. Individual pipeline stages are defined by docker images and can be +individually parallelized across large data sets (by leveraging +[Kubernetes](https://kubernetes.io/) under the hood). + +**[SciPipe](http://scipipe.org/)** - This tool is much smaller than something +like Pachyderm and is focused on highly dynamic workflow constructs in +which on-line scheduling is key. SciPipe also focuses, thus far, on workflows +that run on HPC clusters and local computers.  Although, it can be integrated +in a more comprehensive framework, such as Pachyderm or Reflow. + +**[AWE](https://github.com/MG-RAST/AWE)** - The AWE framework is a +comprehensive bioinformatics system targeting the cloud. It reportedly +comes with multi-cloud support and, as it seems, HPC support, and AWE allows +users to leverage the [Common Workflow Language (CWL)](http://www.commonwl.org/). As +with Reflow, we don’t have hands-on experience with AWE, so our comments +are limited to what we’ve read in the documentation and examples. + +Comparing Reflow and Pachyderm +------------------------------ + +Reflow and Pachyderm seems to be closest at first glance. Both frameworks +utilize containers as the main unit of data processing. However, there are some +differences in both workflow orchestration and data storage/management that we +will stress below. + +### Orchestration + +While Reflow implements custom “runners” for each target environment (AWS and +your local machine currently), Pachyderm leverages Kubernetes under-the-hood +for container orchestration. This use of Kubernetes allows Pachyderm to +be maximally vendor-agnostic, as Kubernetes is widely deployed in all the +major clouds and on-premise. Reflow authors are reportedly planning to +support other cloud providers, like GCP, in the near future though, but +this effort seems to require writing custom runner code per target +environment/cloud. + +The use of Kubernetes also allows Pachyderm to automatically bring certain +orchestration functionality into a context that has otherwise +propelled Kubernetes to become industry leader in orchestration. More +specifically, Pachyderm pipelines are automatically “self-healing” in that they are +rescheduled and restarted when cluster nodes or jobs fail, and Pachyderm is +able to optimally schedule work across high-value nodes (e.g., high-CPU or GPU +nodes) using methods such as auto-scaling. Reflow appears to use it’s own +logic to match workloads with available instance types. + +Finally, Pachyderm and Reflow have differences in the way workflows are +actually defined. Pachyderm leverages the established JSON/YAML format that is +common in the Kubernetes community, and Reflow implements what seems to be its +own readable, Go-inspired domain specific language (DSL). + +### Data Storage/Management + +For the data that is to be processed in workflow stages, Reflow provides an +interesting memoization mechanism based on S3, with automatic re-runs triggered +by updates to data. Similarly, Pachyderm provides a familiar, git-inspired +versioned data store, where new versions of data in the versioned data store +are also used to automatically trigger relevant parts of the workflow. +Pachyderm’s data store can be backed by any of the popular vendor-specific +object stores (GCS in GCP, S3 in AWS, or Blob Storage in Azure) or any other +object store with an S3-compatible API (e.g., an open source option like +Minio). + +In terms of metadata associated with jobs, versions of data, etc., Reflow +appears to use a Reflow-specific cache based on S3 and DynamoDB. In +contrast, Pachyderm utilizes etcd, a distributed key-value store, for metadata. + +Lastly with regard to data sharding and parallelization, Pachyderm seems to go +further than Reflow. While Reflow does parallelize tools running on separate +data sets, Pachyderm also provides automatic parallelization of tools accessing +“datums” that may correspond to the same data set or even the same +file. Pachyderm automatically distributes the processing of these datums to +containers running in parallel (pods in Kubernetes) and gathers all of the +results back into a single logical collection of data. Pachyderm thus provides +what other frameworks like Hadoop and Spark are promising, but without the need +to replace legacy code and tools with code written in the MapReduce-style or +explicitly implement parallelism and data sharding in code. + +With these notes in mind, we think Reflow and Pachyderm are addressing slightly +different user needs. While Reflow seems to be an excellent choice for a quick +setup on AWS or a local docker-based server, we think Pachyderm will generally +provide more vendor agnosticity, better parallelization, and valuable +optimizations and updates that come out of the growing +Kubernetes community. Finally, we think that Pachyderm provides a stronger +foundation for rigorous and manageable data science with its unified data +versioning system, which can help data scientists and engineers better +understand data, collaborate, perform tests, share workflows, and so on. + +How does SciPipe compare? +------------------------------------------------- + +SciPipe is, in this context, more of an apples-to-oranges comparison to +Pachyderm, Reflow or AWE. While Reflow and Pachyderm provides an integrated +tool encapsulation solution based on containers, SciPipe (in its current +form) is primarily focused on managing command-line driven workflows on local +computers or HPC clusters with a shared file system, where containers might not +be an option. However, there are also other relevant differences related to +complexity of the tools, workflow implementation, and deployment/integration: + +### Complexity + +SciPipe is a much smaller tool in many ways. For example, a very simple +count of LOC for the different frameworks shows that SciPipe is implemented with +more than an order of magnitude less lines of code than the other tools: + +```bash +$ cd $GOPATH/src/github.com/grailbio/reflow +$ find | grep "\.go" | grep -vP “(vendor|examples|_test)” | xargs cat | grep -vP "^\/\/" | sed '/^\s*$/d' | wc -l +26371 +``` + +```bash +$ cd $GOPATH/src/github.com/pachyderm/pachyderm/src +$ find | grep "\.go" | grep -vP “(vendor|examples|_test)” | xargs cat | grep -vP "^\/\/" | sed '/^\s*$/d' | wc -l +25778 +``` + +```bash +$ cd $GOPATH/src/github.com/MG-RAST/AWE +$ find | grep "\.go" | grep -vP "(vendor|examples|_test)" | xargs cat | grep -vP "^\/\/" | sed '/^\s*$/d' | wc -l +24485 +``` + +```bash +$ cd $GOPATH/src/github.com/scipipe/scipipe +$ find | grep "\.go" | grep -vP “(vendor|examples|_test)” | xargs cat | grep -vP "^\/\/" | sed '/^\s*$/d' | wc -l +1699 +``` + +### Workflow Implementation + +Further, SciPipe was designed to primarily support highly dynamic workflow +constructs, where dynamic/on-line scheduling is needed. These workflows include +scenarios in which you are continuously chunking up and computing a dataset of +unknown size or parametrizing parts of the workflow with parameter values +extracted in an earlier part of the workflow. An example of the former would +be lazily processing data extracted from a database without saving the +temporary output to disk. An example of the latter would be doing a parameter +optimization to select, e.g., good gamma and cost values for libSVM before +actually training the model with the obtained parameters. + +Also, where Pachyderm and Reflow provide manifest formats or DSLs for writing +workflows, SciPipe lets you write workflows directly in Go. SciPipe is +thus consumed as a programming-library rather than a framework. +This feature might scare off some users intimidated by Go’s relative +verboseness compared to specialized DSLs, but it also allows users to leverage +extremely powerful existing tooling and editor support for Go. + +### Deployment/Integration + +What is perhaps most interesting in the context of this comparison is the fact +that SciPipe workflows can be compiled to small static binaries. This +compilation makes it very easy to package up smaller SciPipe workflows in +individual containers and integrate them into tools like Pachyderm or Reflow or +other services. We thus imagine that SciPipe could be a complement to Pachyderm +or Reflow when highly dynamic workflow constructs are needed, which +may be a challenge to implement in the manifests and DSLs of Pachyderm or +Reflow. + +Reflow and AWE? +--------------------------------------- + +We know the least about AWE at this point, so we don’t want to venture too far +into a detailed comparison with Reflow (which is also new to us). However, +based on our reading of online materials, we can note that they seem to share +the focus on bioinformatics and cloud support. AWE additionally supports Common +Workflow Language and HPC clusters. We expect there to be some differences in +terms of storage, because AWE ships with its own storage solution and isn’t +based on a cloud offering like S3. Past that, we will leave it to the authors +of AWE and Reflow or the community to provide more comprehensive comparisons. + +In Summary +----------------------- + +In summary, we think that it is extremely exciting and reassuring to see +continued innovation in the Go Data Science ecosystem, and we are excited to +see more and more data gophers and projects join the community.  We also hope +this little overview will help users navigate the growing ecosystem of workflow +tools by highlighting some of their inherent differences. + +---- + +[Samuel Lampa](https://twitter.com/smllmp), PhD Student at [Uppsala University](http://pharmb.io)
+ +[Jon Ander Novella](https://www.linkedin.com/in/jon-ander-novella/), Research Assistant at [Uppsala University](http://pharmb.io)
+ +[Daniel Whitenack](https://twitter.com/dwhitena), Data Scientist and Lead Developer Advocate at [Pachyderm Inc.](http://pachyderm.io) diff --git a/public b/public index a8480d0..8dba0dd 160000 --- a/public +++ b/public @@ -1 +1 @@ -Subproject commit a8480d08cf8aa39f2ba90a22acc1b21a7af32e5c +Subproject commit 8dba0ddb17d225280774a968e60ae9e567d7e3f8 diff --git a/static/img/gameai/GopherNeg.png b/static/img/gameai/GopherNeg.png new file mode 100644 index 0000000..95ac22f Binary files /dev/null and b/static/img/gameai/GopherNeg.png differ diff --git a/static/img/gameai/GopherPos.png b/static/img/gameai/GopherPos.png new file mode 100644 index 0000000..6e279ff Binary files /dev/null and b/static/img/gameai/GopherPos.png differ diff --git a/static/img/gameai/NNViz.png b/static/img/gameai/NNViz.png new file mode 100644 index 0000000..033e6c2 Binary files /dev/null and b/static/img/gameai/NNViz.png differ diff --git a/static/img/portrait.jpg b/static/img/portrait.jpg index 04627bd..760d97b 100644 Binary files a/static/img/portrait.jpg and b/static/img/portrait.jpg differ diff --git a/static/img/wftools/gopher_thinking_workflows.png b/static/img/wftools/gopher_thinking_workflows.png new file mode 100644 index 0000000..9da21e9 Binary files /dev/null and b/static/img/wftools/gopher_thinking_workflows.png differ diff --git a/themes/academic b/themes/academic index 3af123c..4859cbe 160000 --- a/themes/academic +++ b/themes/academic @@ -1 +1 @@ -Subproject commit 3af123c4d58edfe9a3d441e2d9e8af03f2f649da +Subproject commit 4859cbe36a35e03ed060c8377982d802371009e9