Skip to content

Commit 6d1e993

Browse files
authored
Merge pull request #4 from gopherdata/dw-post
trump finder post
2 parents 4646933 + b39715e commit 6d1e993

File tree

3 files changed

+240
-1
lines changed

3 files changed

+240
-1
lines changed
Lines changed: 239 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,239 @@
1+
+++
2+
date = "2017-04-26T12:00:00"
3+
draft = false
4+
tags = ["academic", "hugo"]
5+
title = "Building a distributed Trump finder"
6+
math = false
7+
summary = """A step-by-step guide to building a distributed facial recognition system with Pachyderm and Machine Box.
8+
"""
9+
10+
[header]
11+
image = "headers/trump_finder.jpg"
12+
caption = ""
13+
14+
+++
15+
16+
(Author: Daniel Whitenack, @dwhitena on [Twitter](https://twitter.com/dwhitena) and Gophers Slack)
17+
18+
If you haven't heard, there is a new kid on the machine learning block named [Machine Box](https://machinebox.io/). It's pretty cool, and you should check it out. Machine Box provides pre-built Docker images that enable easy, production ready, and reproducible machine learning operations. For example, you can get a "facebox" Docker image from Machine Box for facial recognition. When you run "facebox," you get a full JSON API that lets you to easily "teach" facebox certain people's faces, identify those faces in images, and persist the "state" of trained facial recognition models.
19+
20+
Experimenting with "facebox" got me thinking about how it could be integrated into some of my workflows. In particular, I wanted to see how Machine Box images could be utilized as part of distributed data processing pipeline built with [Pachyderm](http://pachyderm.io/). Pachyderm builds, runs, and manages pipelines, such as machine learning workflows, based on Docker images. Thus, an integration of Machine Box images seems to be only natural.
21+
22+
And that's how the first (to my knowledge) distributed, Docker-ized, "Trump-finding" data pipeline came to be. In the sections below we will walk through the creation of a facial recognition pipeline that is able to find and tag the location of Donald Trump's face in images. Actually, this pipeline could be used to identify any faces, and we will illustrate this flexibility by updating the pipeline to learn a second face, Hillary Clinton.
23+
24+
The below sections assume that you have a Pachyderm cluster running, that `pachctl`(the Pachyderm CLI tool) is connected to that cluster, and that you have signed up for a Machine Box key (which you can do for free). All the code and more detailed instructions can be found [here](https://github.com/dwhitena/pach-machine-box).
25+
26+
**Create the pipeline inputs**:
27+
28+
To train our facial recognition model, identify faces in images, and tag those faces with certain labels, we need to create three "data repositories" that will be the inputs to our Pachyderm pipeline:
29+
30+
1. `training` - which includes images of faces that we use to "teach" facebox
31+
2. `unidentified` - which includes images with faces we want to detect and identify
32+
3. `labels` - which includes label images that we will overlay on the unidentified images to indicate identified faces
33+
34+
This can be done with `pachctl`:
35+
36+
```sh
37+
➔ pachctl create-repo training
38+
➔ pachctl create-repo unidentified
39+
➔ pachctl create-repo labels
40+
➔ pachctl list-repo
41+
NAME CREATED SIZE
42+
labels 3 seconds ago 0 B
43+
unidentified 11 seconds ago 0 B
44+
training 17 seconds ago 0 B
45+
cd data/train/faces1/
46+
➔ ls
47+
trump1.jpg trump2.jpg trump3.jpg trump4.jpg trump5.jpg
48+
➔ pachctl put-file training master -c -r -f .
49+
➔ pachctl list-repo
50+
NAME CREATED SIZE
51+
training 5 minutes ago 486.2 KiB
52+
labels 5 minutes ago 0 B
53+
unidentified 5 minutes ago 0 B
54+
➔ pachctl list-file training master
55+
NAME TYPE SIZE
56+
trump1.jpg file 78.98 KiB
57+
trump2.jpg file 334.5 KiB
58+
trump3.jpg file 11.63 KiB
59+
trump4.jpg file 27.45 KiB
60+
trump5.jpg file 33.6 KiB
61+
cd ../../labels/
62+
➔ ls
63+
clinton.jpg trump.jpg
64+
➔ pachctl put-file labels master -c -r -f .
65+
cd ../unidentified/
66+
➔ ls
67+
image1.jpg image2.jpg
68+
➔ pachctl put-file unidentified master -c -r -f .
69+
➔ pachctl list-repo
70+
NAME CREATED SIZE
71+
unidentified 7 minutes ago 540.4 KiB
72+
labels 7 minutes ago 15.44 KiB
73+
training 7 minutes ago 486.2 KiB
74+
```
75+
76+
**Train, or "teach," facebox**:
77+
78+
Next, we create a Pachyderm pipeline stage that will take the `training` data as input, provide those training images to facebox, and output the "state" of a trained model. This is done by providing Pachyderm with a pipeline spec, [train.json](https://github.com/dwhitena/pach-machine-box/blob/master/pipelines/train.json). This pipeline spec specifies the Docker image to use for data processing, the commands to execute in that Docker container, and what data is input to the pipeline.
79+
80+
In our particular case, `train.json` specifies that we should use an image based on the facebox image from Machine Box and execute a number of cURL commands to post the training data to facebox. Once those training images are provided to and processed by facebox, we specify another cURL command to export the state of the facebox model (for use later in our pipeline).
81+
82+
We are using a little bash magic here to perform these operations. However, it is very possible that, in the future, Machine Box will provide a more standardized command line implementation for these sorts of use cases.
83+
84+
```sh
85+
create-MB-pipeline.sh identify.json tag.json train.json
86+
➔ ./create-MB-pipeline.sh train.json
87+
➔ pachctl list-pipeline
88+
NAME INPUT OUTPUT STATE
89+
model training model/master running
90+
➔ pachctl list-job
91+
ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE
92+
3425a7a0-543e-4e2a-a244-a3982c527248 model/- 9 seconds ago - 1 0 / 1 running
93+
➔ pachctl list-job
94+
ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE
95+
3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 5 minutes ago 5 minutes 1 1 / 1 success
96+
➔ pachctl list-repo
97+
NAME CREATED SIZE
98+
model 5 minutes ago 4.118 KiB
99+
unidentified 18 minutes ago 540.4 KiB
100+
labels 18 minutes ago 15.44 KiB
101+
training 19 minutes ago 486.2 KiB
102+
➔ pachctl list-file model master
103+
NAME TYPE SIZE
104+
state.facebox file 4.118 KiB
105+
```
106+
107+
As you can see the output of this pipeline is a `.facebox` file that contained the trained state of our facebox model.
108+
109+
**Use the trained facebox to identify faces**:
110+
111+
We then launch another Pachyderm pipeline, based on an [identify.json](https://github.com/dwhitena/pach-machine-box/blob/master/pipelines/identify.json) pipeline specification, to identify faces within the `unidentified` images. This pipeline will take the persisted state of our model in `model` along with the `unidentified` images as input. It will also execute cURL commands to interact with facebox, and it will output indications of identified faces to JSON files, one per `unidentified` image.
112+
113+
```sh
114+
➔ ./create-MB-pipeline.sh identify.json
115+
➔ pachctl list-job
116+
ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE
117+
281d4393-05c8-44bf-b5de-231cea0fc022 identify/- 6 seconds ago - 0 0 / 2 running
118+
3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 8 minutes ago 5 minutes 1 1 / 1 success
119+
➔ pachctl list-job
120+
ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE
121+
281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 About a minute ago 53 seconds 0 2 / 2 success
122+
3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 9 minutes ago 5 minutes 1 1 / 1 success
123+
➔ pachctl list-repo
124+
NAME CREATED SIZE
125+
identify About a minute ago 1.932 KiB
126+
model 10 minutes ago 4.118 KiB
127+
unidentified 23 minutes ago 540.4 KiB
128+
labels 23 minutes ago 15.44 KiB
129+
training 24 minutes ago 486.2 KiB
130+
➔ pachctl list-file identify master
131+
NAME TYPE SIZE
132+
image1.json file 1.593 KiB
133+
image2.json file 347 B
134+
```
135+
136+
If we look at the JSON output for, e.g., `image1.jpg`, we can see that there is a portion of the file that clearly identifies Donald Trump in the image along with the location and size of his face in the image:
137+
138+
```
139+
{
140+
"success": true,
141+
"facesCount": 13,
142+
"faces": [
143+
...
144+
...
145+
{
146+
"rect": {
147+
"top": 175,
148+
"left": 975,
149+
"width": 108,
150+
"height": 108
151+
},
152+
"id": "58ff31510f7707a01fb3e2f4d39f26dc",
153+
"name": "trump",
154+
"matched": true
155+
},
156+
...
157+
...
158+
]
159+
}
160+
```
161+
162+
**Tagging identified faces in the images**:
163+
164+
We are most of the way there! We have identified Trump in the `unidentified` images, but the JSON output isn't the most visually appealling. As such, let's overlay a label on the images at the location of Trump's face.
165+
166+
To do this, we can use a [simple Go program](https://github.com/dwhitena/pach-machine-box/blob/master/tagimage/main.go) to draw the label image on the `unidentified` image at the appropriate location. This part of the pipeline is specified by a [tag.json](https://github.com/dwhitena/pach-machine-box/blob/master/pipelines/tag.json) pipeline specification, and can be created as follows:
167+
168+
```sh
169+
➔ pachctl create-pipeline -f tag.json
170+
➔ pachctl list-job
171+
ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE
172+
cd284a28-6c97-4236-9f6d-717346c60f24 tag/- 2 seconds ago - 0 0 / 2 running
173+
281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 5 minutes ago 53 seconds 0 2 / 2 success
174+
3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 13 minutes ago 5 minutes 1 1 / 1 success
175+
➔ pachctl list-job
176+
ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE
177+
cd284a28-6c97-4236-9f6d-717346c60f24 tag/ae747e8032704b6cae6ae7bba064c3c3 25 seconds ago 11 seconds 0 2 / 2 success
178+
281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 5 minutes ago 53 seconds 0 2 / 2 success
179+
3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 14 minutes ago 5 minutes 1 1 / 1 success
180+
➔ pachctl list-repo
181+
NAME CREATED SIZE
182+
tag 30 seconds ago 591.3 KiB
183+
identify 5 minutes ago 1.932 KiB
184+
model 14 minutes ago 4.118 KiB
185+
unidentified 27 minutes ago 540.4 KiB
186+
labels 27 minutes ago 15.44 KiB
187+
training 27 minutes ago 486.2 KiB
188+
➔ pachctl list-file tag master
189+
NAME TYPE SIZE
190+
tagged_image1.jpg file 557 KiB
191+
tagged_image2.jpg file 34.35 KiB
192+
```
193+
194+
As you can see, we now have two "tagged" versions of the images in the output `tag` data repository. If we get these images, we can see that... Boom! Our Trump finder works:
195+
196+
![alt text](https://raw.githubusercontent.com/dwhitena/pach-machine-box/master/tagged_images1.jpg)
197+
198+
**Teaching a new faces, updating the output**:
199+
200+
Our pipeline isn't restricted to Trump or any one face. Actually, we can teach facebox another face by updating our `training`. Moreover, becauce Pachyderm verions your data and know what data is new, it can automatically update all our results once facebox learns the new face:
201+
202+
```sh
203+
cd ../data/train/faces2/
204+
➔ ls
205+
clinton1.jpg clinton2.jpg clinton3.jpg clinton4.jpg
206+
➔ pachctl put-file training master -c -r -f .
207+
➔ pachctl list-job
208+
ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE
209+
56e24ac0-0430-4fa4-aa8b-08de5c1884db model/- 4 seconds ago - 0 0 / 1 running
210+
cd284a28-6c97-4236-9f6d-717346c60f24 tag/ae747e8032704b6cae6ae7bba064c3c3 6 minutes ago 11 seconds 0 2 / 2 success
211+
281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 11 minutes ago 53 seconds 0 2 / 2 success
212+
3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 20 minutes ago 5 minutes 1 1 / 1 success
213+
➔ pachctl list-job
214+
ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE
215+
6aa6c995-58ce-445d-999a-eb0e0690b041 tag/7cbd2584d4f0472abbca0d9e015b9829 5 seconds ago 1 seconds 0 2 / 2 success
216+
8a7961b7-1085-404a-b0ee-66034fae7212 identify/1bc94ec558e44e0cb45ed5ab7d9f9674 59 seconds ago 54 seconds 0 2 / 2 success
217+
56e24ac0-0430-4fa4-aa8b-08de5c1884db model/002f16b63a4345a4bc6bdf5510c9faac About a minute ago 19 seconds 0 1 / 1 success
218+
cd284a28-6c97-4236-9f6d-717346c60f24 tag/ae747e8032704b6cae6ae7bba064c3c3 8 minutes ago 11 seconds 0 2 / 2 success
219+
281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 13 minutes ago 53 seconds 0 2 / 2 success
220+
3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 21 minutes ago 5 minutes 1 1 / 1 success
221+
➔ pachctl list-file tag master
222+
NAME TYPE SIZE
223+
tagged_image1.jpg file 557 KiB
224+
tagged_image2.jpg file 36.03 KiB
225+
```
226+
227+
Now if we look at our images, we find that everything has been updated without any annoying manual work on our end:
228+
229+
![alt text](https://raw.githubusercontent.com/dwhitena/pach-machine-box/master/tagged_images2.jpg)
230+
231+
**Conclusion/Resources**:
232+
233+
As you can see, Machine Box and Pachyderm make it really quick and easy to deploy a distributed, machine learning data pipeline. Be sure to:
234+
235+
- Visit [this repo](https://github.com/dwhitena/pach-machine-box) to get the code and pipeline specs, so you can create your own Trump finder!
236+
- Join the [Pachyderm Slack team](http://slack.pachyderm.io/) to get help implementing your ML pipelines, and participate in the discussion in the #data-science channel on Gophers Slack.
237+
- Follow [Pachyderm on Twitter](https://twitter.com/pachydermIO),
238+
- Sign up for a free [Machine Box](https://machinebox.io/) API key, and
239+
- Follow [Machine Box on Twitter](https://twitter.com/machineboxio).

static/img/headers/trump_finder.jpg

28.1 KB
Loading

themes/academic

0 commit comments

Comments
 (0)