|
| 1 | ++++ |
| 2 | +date = "2017-04-26T12:00:00" |
| 3 | +draft = false |
| 4 | +tags = ["academic", "hugo"] |
| 5 | +title = "Building a distributed Trump finder" |
| 6 | +math = false |
| 7 | +summary = """A step-by-step guide to building a distributed facial recognition system with Pachyderm and Machine Box. |
| 8 | +""" |
| 9 | + |
| 10 | +[header] |
| 11 | +image = "headers/trump_finder.jpg" |
| 12 | +caption = "" |
| 13 | + |
| 14 | ++++ |
| 15 | + |
| 16 | +(Author: Daniel Whitenack, @dwhitena on [Twitter](https://twitter.com/dwhitena) and Gophers Slack) |
| 17 | + |
| 18 | +If you haven't heard, there is a new kid on the machine learning block named [Machine Box](https://machinebox.io/). It's pretty cool, and you should check it out. Machine Box provides pre-built Docker images that enable easy, production ready, and reproducible machine learning operations. For example, you can get a "facebox" Docker image from Machine Box for facial recognition. When you run "facebox," you get a full JSON API that lets you to easily "teach" facebox certain people's faces, identify those faces in images, and persist the "state" of trained facial recognition models. |
| 19 | + |
| 20 | +Experimenting with "facebox" got me thinking about how it could be integrated into some of my workflows. In particular, I wanted to see how Machine Box images could be utilized as part of distributed data processing pipeline built with [Pachyderm](http://pachyderm.io/). Pachyderm builds, runs, and manages pipelines, such as machine learning workflows, based on Docker images. Thus, an integration of Machine Box images seems to be only natural. |
| 21 | + |
| 22 | +And that's how the first (to my knowledge) distributed, Docker-ized, "Trump-finding" data pipeline came to be. In the sections below we will walk through the creation of a facial recognition pipeline that is able to find and tag the location of Donald Trump's face in images. Actually, this pipeline could be used to identify any faces, and we will illustrate this flexibility by updating the pipeline to learn a second face, Hillary Clinton. |
| 23 | + |
| 24 | +The below sections assume that you have a Pachyderm cluster running, that `pachctl`(the Pachyderm CLI tool) is connected to that cluster, and that you have signed up for a Machine Box key (which you can do for free). All the code and more detailed instructions can be found [here](https://github.com/dwhitena/pach-machine-box). |
| 25 | + |
| 26 | +**Create the pipeline inputs**: |
| 27 | + |
| 28 | +To train our facial recognition model, identify faces in images, and tag those faces with certain labels, we need to create three "data repositories" that will be the inputs to our Pachyderm pipeline: |
| 29 | + |
| 30 | +1. `training` - which includes images of faces that we use to "teach" facebox |
| 31 | +2. `unidentified` - which includes images with faces we want to detect and identify |
| 32 | +3. `labels` - which includes label images that we will overlay on the unidentified images to indicate identified faces |
| 33 | + |
| 34 | +This can be done with `pachctl`: |
| 35 | + |
| 36 | +```sh |
| 37 | +➔ pachctl create-repo training |
| 38 | +➔ pachctl create-repo unidentified |
| 39 | +➔ pachctl create-repo labels |
| 40 | +➔ pachctl list-repo |
| 41 | +NAME CREATED SIZE |
| 42 | +labels 3 seconds ago 0 B |
| 43 | +unidentified 11 seconds ago 0 B |
| 44 | +training 17 seconds ago 0 B |
| 45 | +➔ cd data/train/faces1/ |
| 46 | +➔ ls |
| 47 | +trump1.jpg trump2.jpg trump3.jpg trump4.jpg trump5.jpg |
| 48 | +➔ pachctl put-file training master -c -r -f . |
| 49 | +➔ pachctl list-repo |
| 50 | +NAME CREATED SIZE |
| 51 | +training 5 minutes ago 486.2 KiB |
| 52 | +labels 5 minutes ago 0 B |
| 53 | +unidentified 5 minutes ago 0 B |
| 54 | +➔ pachctl list-file training master |
| 55 | +NAME TYPE SIZE |
| 56 | +trump1.jpg file 78.98 KiB |
| 57 | +trump2.jpg file 334.5 KiB |
| 58 | +trump3.jpg file 11.63 KiB |
| 59 | +trump4.jpg file 27.45 KiB |
| 60 | +trump5.jpg file 33.6 KiB |
| 61 | +➔ cd ../../labels/ |
| 62 | +➔ ls |
| 63 | +clinton.jpg trump.jpg |
| 64 | +➔ pachctl put-file labels master -c -r -f . |
| 65 | +➔ cd ../unidentified/ |
| 66 | +➔ ls |
| 67 | +image1.jpg image2.jpg |
| 68 | +➔ pachctl put-file unidentified master -c -r -f . |
| 69 | +➔ pachctl list-repo |
| 70 | +NAME CREATED SIZE |
| 71 | +unidentified 7 minutes ago 540.4 KiB |
| 72 | +labels 7 minutes ago 15.44 KiB |
| 73 | +training 7 minutes ago 486.2 KiB |
| 74 | +``` |
| 75 | + |
| 76 | +**Train, or "teach," facebox**: |
| 77 | + |
| 78 | +Next, we create a Pachyderm pipeline stage that will take the `training` data as input, provide those training images to facebox, and output the "state" of a trained model. This is done by providing Pachyderm with a pipeline spec, [train.json](https://github.com/dwhitena/pach-machine-box/blob/master/pipelines/train.json). This pipeline spec specifies the Docker image to use for data processing, the commands to execute in that Docker container, and what data is input to the pipeline. |
| 79 | + |
| 80 | +In our particular case, `train.json` specifies that we should use an image based on the facebox image from Machine Box and execute a number of cURL commands to post the training data to facebox. Once those training images are provided to and processed by facebox, we specify another cURL command to export the state of the facebox model (for use later in our pipeline). |
| 81 | + |
| 82 | +We are using a little bash magic here to perform these operations. However, it is very possible that, in the future, Machine Box will provide a more standardized command line implementation for these sorts of use cases. |
| 83 | + |
| 84 | +```sh |
| 85 | +create-MB-pipeline.sh identify.json tag.json train.json |
| 86 | +➔ ./create-MB-pipeline.sh train.json |
| 87 | +➔ pachctl list-pipeline |
| 88 | +NAME INPUT OUTPUT STATE |
| 89 | +model training model/master running |
| 90 | +➔ pachctl list-job |
| 91 | +ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE |
| 92 | +3425a7a0-543e-4e2a-a244-a3982c527248 model/- 9 seconds ago - 1 0 / 1 running |
| 93 | +➔ pachctl list-job |
| 94 | +ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE |
| 95 | +3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 5 minutes ago 5 minutes 1 1 / 1 success |
| 96 | +➔ pachctl list-repo |
| 97 | +NAME CREATED SIZE |
| 98 | +model 5 minutes ago 4.118 KiB |
| 99 | +unidentified 18 minutes ago 540.4 KiB |
| 100 | +labels 18 minutes ago 15.44 KiB |
| 101 | +training 19 minutes ago 486.2 KiB |
| 102 | +➔ pachctl list-file model master |
| 103 | +NAME TYPE SIZE |
| 104 | +state.facebox file 4.118 KiB |
| 105 | +``` |
| 106 | + |
| 107 | +As you can see the output of this pipeline is a `.facebox` file that contained the trained state of our facebox model. |
| 108 | + |
| 109 | +**Use the trained facebox to identify faces**: |
| 110 | + |
| 111 | +We then launch another Pachyderm pipeline, based on an [identify.json](https://github.com/dwhitena/pach-machine-box/blob/master/pipelines/identify.json) pipeline specification, to identify faces within the `unidentified` images. This pipeline will take the persisted state of our model in `model` along with the `unidentified` images as input. It will also execute cURL commands to interact with facebox, and it will output indications of identified faces to JSON files, one per `unidentified` image. |
| 112 | + |
| 113 | +```sh |
| 114 | +➔ ./create-MB-pipeline.sh identify.json |
| 115 | +➔ pachctl list-job |
| 116 | +ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE |
| 117 | +281d4393-05c8-44bf-b5de-231cea0fc022 identify/- 6 seconds ago - 0 0 / 2 running |
| 118 | +3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 8 minutes ago 5 minutes 1 1 / 1 success |
| 119 | +➔ pachctl list-job |
| 120 | +ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE |
| 121 | +281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 About a minute ago 53 seconds 0 2 / 2 success |
| 122 | +3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 9 minutes ago 5 minutes 1 1 / 1 success |
| 123 | +➔ pachctl list-repo |
| 124 | +NAME CREATED SIZE |
| 125 | +identify About a minute ago 1.932 KiB |
| 126 | +model 10 minutes ago 4.118 KiB |
| 127 | +unidentified 23 minutes ago 540.4 KiB |
| 128 | +labels 23 minutes ago 15.44 KiB |
| 129 | +training 24 minutes ago 486.2 KiB |
| 130 | +➔ pachctl list-file identify master |
| 131 | +NAME TYPE SIZE |
| 132 | +image1.json file 1.593 KiB |
| 133 | +image2.json file 347 B |
| 134 | +``` |
| 135 | + |
| 136 | +If we look at the JSON output for, e.g., `image1.jpg`, we can see that there is a portion of the file that clearly identifies Donald Trump in the image along with the location and size of his face in the image: |
| 137 | + |
| 138 | +``` |
| 139 | +{ |
| 140 | + "success": true, |
| 141 | + "facesCount": 13, |
| 142 | + "faces": [ |
| 143 | + ... |
| 144 | + ... |
| 145 | + { |
| 146 | + "rect": { |
| 147 | + "top": 175, |
| 148 | + "left": 975, |
| 149 | + "width": 108, |
| 150 | + "height": 108 |
| 151 | + }, |
| 152 | + "id": "58ff31510f7707a01fb3e2f4d39f26dc", |
| 153 | + "name": "trump", |
| 154 | + "matched": true |
| 155 | + }, |
| 156 | + ... |
| 157 | + ... |
| 158 | + ] |
| 159 | +} |
| 160 | +``` |
| 161 | + |
| 162 | +**Tagging identified faces in the images**: |
| 163 | + |
| 164 | +We are most of the way there! We have identified Trump in the `unidentified` images, but the JSON output isn't the most visually appealling. As such, let's overlay a label on the images at the location of Trump's face. |
| 165 | + |
| 166 | +To do this, we can use a [simple Go program](https://github.com/dwhitena/pach-machine-box/blob/master/tagimage/main.go) to draw the label image on the `unidentified` image at the appropriate location. This part of the pipeline is specified by a [tag.json](https://github.com/dwhitena/pach-machine-box/blob/master/pipelines/tag.json) pipeline specification, and can be created as follows: |
| 167 | + |
| 168 | +```sh |
| 169 | +➔ pachctl create-pipeline -f tag.json |
| 170 | +➔ pachctl list-job |
| 171 | +ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE |
| 172 | +cd284a28-6c97-4236-9f6d-717346c60f24 tag/- 2 seconds ago - 0 0 / 2 running |
| 173 | +281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 5 minutes ago 53 seconds 0 2 / 2 success |
| 174 | +3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 13 minutes ago 5 minutes 1 1 / 1 success |
| 175 | +➔ pachctl list-job |
| 176 | +ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE |
| 177 | +cd284a28-6c97-4236-9f6d-717346c60f24 tag/ae747e8032704b6cae6ae7bba064c3c3 25 seconds ago 11 seconds 0 2 / 2 success |
| 178 | +281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 5 minutes ago 53 seconds 0 2 / 2 success |
| 179 | +3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 14 minutes ago 5 minutes 1 1 / 1 success |
| 180 | +➔ pachctl list-repo |
| 181 | +NAME CREATED SIZE |
| 182 | +tag 30 seconds ago 591.3 KiB |
| 183 | +identify 5 minutes ago 1.932 KiB |
| 184 | +model 14 minutes ago 4.118 KiB |
| 185 | +unidentified 27 minutes ago 540.4 KiB |
| 186 | +labels 27 minutes ago 15.44 KiB |
| 187 | +training 27 minutes ago 486.2 KiB |
| 188 | +➔ pachctl list-file tag master |
| 189 | +NAME TYPE SIZE |
| 190 | +tagged_image1.jpg file 557 KiB |
| 191 | +tagged_image2.jpg file 34.35 KiB |
| 192 | +``` |
| 193 | + |
| 194 | +As you can see, we now have two "tagged" versions of the images in the output `tag` data repository. If we get these images, we can see that... Boom! Our Trump finder works: |
| 195 | + |
| 196 | + |
| 197 | + |
| 198 | +**Teaching a new faces, updating the output**: |
| 199 | + |
| 200 | +Our pipeline isn't restricted to Trump or any one face. Actually, we can teach facebox another face by updating our `training`. Moreover, becauce Pachyderm verions your data and know what data is new, it can automatically update all our results once facebox learns the new face: |
| 201 | + |
| 202 | +```sh |
| 203 | +➔ cd ../data/train/faces2/ |
| 204 | +➔ ls |
| 205 | +clinton1.jpg clinton2.jpg clinton3.jpg clinton4.jpg |
| 206 | +➔ pachctl put-file training master -c -r -f . |
| 207 | +➔ pachctl list-job |
| 208 | +ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE |
| 209 | +56e24ac0-0430-4fa4-aa8b-08de5c1884db model/- 4 seconds ago - 0 0 / 1 running |
| 210 | +cd284a28-6c97-4236-9f6d-717346c60f24 tag/ae747e8032704b6cae6ae7bba064c3c3 6 minutes ago 11 seconds 0 2 / 2 success |
| 211 | +281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 11 minutes ago 53 seconds 0 2 / 2 success |
| 212 | +3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 20 minutes ago 5 minutes 1 1 / 1 success |
| 213 | +➔ pachctl list-job |
| 214 | +ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE |
| 215 | +6aa6c995-58ce-445d-999a-eb0e0690b041 tag/7cbd2584d4f0472abbca0d9e015b9829 5 seconds ago 1 seconds 0 2 / 2 success |
| 216 | +8a7961b7-1085-404a-b0ee-66034fae7212 identify/1bc94ec558e44e0cb45ed5ab7d9f9674 59 seconds ago 54 seconds 0 2 / 2 success |
| 217 | +56e24ac0-0430-4fa4-aa8b-08de5c1884db model/002f16b63a4345a4bc6bdf5510c9faac About a minute ago 19 seconds 0 1 / 1 success |
| 218 | +cd284a28-6c97-4236-9f6d-717346c60f24 tag/ae747e8032704b6cae6ae7bba064c3c3 8 minutes ago 11 seconds 0 2 / 2 success |
| 219 | +281d4393-05c8-44bf-b5de-231cea0fc022 identify/287fc78a4cdf42d89142d46fb5f689d9 13 minutes ago 53 seconds 0 2 / 2 success |
| 220 | +3425a7a0-543e-4e2a-a244-a3982c527248 model/1b9c158e33394056a18041a4a86cb54a 21 minutes ago 5 minutes 1 1 / 1 success |
| 221 | +➔ pachctl list-file tag master |
| 222 | +NAME TYPE SIZE |
| 223 | +tagged_image1.jpg file 557 KiB |
| 224 | +tagged_image2.jpg file 36.03 KiB |
| 225 | +``` |
| 226 | + |
| 227 | +Now if we look at our images, we find that everything has been updated without any annoying manual work on our end: |
| 228 | + |
| 229 | + |
| 230 | + |
| 231 | +**Conclusion/Resources**: |
| 232 | + |
| 233 | +As you can see, Machine Box and Pachyderm make it really quick and easy to deploy a distributed, machine learning data pipeline. Be sure to: |
| 234 | + |
| 235 | +- Visit [this repo](https://github.com/dwhitena/pach-machine-box) to get the code and pipeline specs, so you can create your own Trump finder! |
| 236 | +- Join the [Pachyderm Slack team](http://slack.pachyderm.io/) to get help implementing your ML pipelines, and participate in the discussion in the #data-science channel on Gophers Slack. |
| 237 | +- Follow [Pachyderm on Twitter](https://twitter.com/pachydermIO), |
| 238 | +- Sign up for a free [Machine Box](https://machinebox.io/) API key, and |
| 239 | +- Follow [Machine Box on Twitter](https://twitter.com/machineboxio). |
0 commit comments