Skip to content

Commit fffe9fc

Browse files
committed
Fix typos and link directly to README.md for immediate viewing on mobile
1 parent 4ac7573 commit fffe9fc

File tree

1 file changed

+20
-20
lines changed

1 file changed

+20
-20
lines changed

content/post/more_go_based_workflow_tools_in_bioinformatics.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,9 @@ caption = ""
1414

1515
It is an exciting time for Go as a data science language and for the
1616
[\#gopherdata](http://gopherdata.io/) movement. The
17-
[ecosystem of tools](https://github.com/gopherdata/resources/tree/master/tooling)
18-
is constantly improving, with both general purpose tools (e.g., for dataframes and
19-
statistical analyses) and more specialised ones (e.g., for neural networks and
17+
[ecosystem of tools](https://github.com/gopherdata/resources/blob/master/tooling/README.md)
18+
is constantly improving, with both general purpose tools (e.g., for data frames and
19+
statistical analyses) and more specialized ones (e.g., for neural networks and
2020
graph-based algorithms) popping up every day.
2121

2222
Recent weeks have been particularly exciting for those involved in the
@@ -55,7 +55,7 @@ systems designed to help coordinate computations when the number of computation
5555
steps is very large and/or the steps have complex dependencies between their
5656
input and output data. This is common in many scientific fields, but it is
5757
common in bioinformatics in particular, because of the vast plethora of tools
58-
built to analyse the extremely heterogenous data types describing biological
58+
built to analyze the extremely heterogeneous data types describing biological
5959
organisms, from molecules to cells to organs to full body physiology.
6060

6161
Common tasks in bioinformatics necessitating the use of workflow tools include
@@ -82,7 +82,7 @@ time. Individual pipeline stages are defined by docker images and can be
8282
individually parallelized across large data sets (by leveraging
8383
[Kubernetes](https://kubernetes.io/) under the hood).
8484

85-
**[SciPipe](http://scipipe.org/)** - This tool is a much smaller than something
85+
**[SciPipe](http://scipipe.org/)** - This tool is much smaller than something
8686
like Pachyderm and is focused on highly dynamic workflow constructs in
8787
which on-line scheduling is key. SciPipe also focuses, thus far, on workflows
8888
that run on HPC clusters and local computers.  Although, it can be integrated
@@ -114,17 +114,17 @@ support other cloud providers, like GCP, in the near future though, but
114114
this effort seems to require writing custom runner code per target
115115
environment/cloud.
116116

117-
use of Kubernetes also allows Pachyderm to matically bring certain
118-
orchestration functionality into a nformatics context that has otherwise
117+
The use of Kubernetes also allows Pachyderm to automatically bring certain
118+
orchestration functionality into a context that has otherwise
119119
propelled Kubernetes to become industry leader in orchestration. More
120-
specifically, Pachyderm lines are automatically “self-healing” in that they are
120+
specifically, Pachyderm pipelines are automatically “self-healing” in that they are
121121
rescheduled and restarted when cluster nodes or jobs fail, and Pachyderm is
122122
able to optimally schedule work across high-value nodes (e.g., high-CPU or GPU
123-
nodes) using methods such as auto-scaling.  Reflow appears to use it’s own
123+
nodes) using methods such as auto-scaling. Reflow appears to use it’s own
124124
logic to match workloads with available instance types.
125125

126126
Finally, Pachyderm and Reflow have differences in the way workflows are
127-
actually defined.  Pachyderm leverages the established JSON/YAML format that is
127+
actually defined. Pachyderm leverages the established JSON/YAML format that is
128128
common in the Kubernetes community, and Reflow implements what seems to be its
129129
own readable, Go-inspired domain specific language (DSL).
130130

@@ -145,8 +145,8 @@ appears to use a Reflow-specific cache based on S3 and DynamoDB. In
145145
contrast, Pachyderm utilizes etcd, a distributed key-value store, for metadata.
146146

147147
Lastly with regard to data sharding and parallelization, Pachyderm seems to go
148-
further than Reflow. While Reflow does parallelise tools running on separate
149-
data sets, Pachyderm also provides automatic parallelisation of tools accessing
148+
further than Reflow. While Reflow does parallelize tools running on separate
149+
data sets, Pachyderm also provides automatic parallelization of tools accessing
150150
“datums” that may correspond to the same data set or even the same
151151
file. Pachyderm automatically distributes the processing of these datums to
152152
containers running in parallel (pods in Kubernetes) and gathers all of the
@@ -158,7 +158,7 @@ explicitly implement parallelism and data sharding in code.
158158
With these notes in mind, we think Reflow and Pachyderm are addressing slightly
159159
different user needs. While Reflow seems to be an excellent choice for a quick
160160
setup on AWS or a local docker-based server, we think Pachyderm will generally
161-
provide more vendor agnosticity, better parallelisation, and valuable
161+
provide more vendor agnosticity, better parallelization, and valuable
162162
optimizations and updates that come out of the growing
163163
Kubernetes community. Finally, we think that Pachyderm provides a stronger
164164
foundation for rigorous and manageable data science with its unified data
@@ -178,8 +178,8 @@ complexity of the tools, workflow implementation, and deployment/integration:
178178

179179
### Complexity
180180

181-
SciPipe is also a much smaller tool in many ways. For example, a very simple
182-
count of LOC for the three frameworks shows that SciPipe is implemented with
181+
SciPipe is a much smaller tool in many ways. For example, a very simple
182+
count of LOC for the different frameworks shows that SciPipe is implemented with
183183
more than an order of magnitude less lines of code than the other tools:
184184

185185
```bash
@@ -211,16 +211,16 @@ $ find | grep "\.go" | grep -vP “(vendor|examples|_test)” | xargs cat | grep
211211
Further, SciPipe was designed to primarily support highly dynamic workflow
212212
constructs, where dynamic/on-line scheduling is needed. These workflows include
213213
scenarios in which you are continuously chunking up and computing a dataset of
214-
unknown size or parametrising parts of the workflow with parameter values
214+
unknown size or parametrizing parts of the workflow with parameter values
215215
extracted in an earlier part of the workflow. An example of the former would
216216
be lazily processing data extracted from a database without saving the
217217
temporary output to disk. An example of the latter would be doing a parameter
218-
optimisation to select, e.g., good gamma and cost values for libSVM before
218+
optimization to select, e.g., good gamma and cost values for libSVM before
219219
actually training the model with the obtained parameters.
220220

221221
Also, where Pachyderm and Reflow provide manifest formats or DSLs for writing
222-
workflows, SciPipe lets you write workflows directly in Go. SciPipe is,
223-
thus, consumed as a programming-library rather than a framework.
222+
workflows, SciPipe lets you write workflows directly in Go. SciPipe is
223+
thus consumed as a programming-library rather than a framework.
224224
This feature might scare off some users intimidated by Go’s relative
225225
verboseness compared to specialized DSLs, but it also allows users to leverage
226226
extremely powerful existing tooling and editor support for Go.
@@ -232,7 +232,7 @@ that SciPipe workflows can be compiled to small static binaries. This
232232
compilation makes it very easy to package up smaller SciPipe workflows in
233233
individual containers and integrate them into tools like Pachyderm or Reflow or
234234
other services. We thus imagine that SciPipe could be a complement to Pachyderm
235-
or Reflow when highly dynamic workflow constructs are needed, which constructs
235+
or Reflow when highly dynamic workflow constructs are needed, which
236236
may be a challenge to implement in the manifests and DSLs of Pachyderm or
237237
Reflow.
238238

0 commit comments

Comments
 (0)