Eecs 2017 99

BIDViz: Real-time Monitoring and Debugging of Machine
Learning Training Processes
Han Qi
Jingqiu Liu
Xuan Zou
Allen Tang
John F. Canny, Ed.
Electrical Engineering and Computer Sciences

University of California at Berkeley
Technical Report No. UCB/EECS-2017-99

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-99.html
May 12, 2017

Copyright © 2017, by the author(s).
All rights reserved.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior
specific permission.
BIDViz: Real-time Monitoring and Debugging of Machine
Learning Training Processes
Han Qi
Project in collaboration with Allen Tang, Jingqiu Qiu, Xuan Zou,
under supervision of prof. John Canny

Executive Summary
Artificial Intelligence is a thriving field with many applications, whether in automating routinary human labors or
support basic research such as diagnosing diseases (Goodfellow et. al., 2016). Deep learning, or machine learning
with deep neural networks, is particularly successful in providing amazing human like intelligence, such as image
captioning or translation, because it has a more general model of the world and makes relatively few assumptions
on the world it trying to model (Goodfellow et. al., 2016). However, comparing to the tools for programming and
debugging, the tooling for building and learning deep learning models is not as good.
In this report, we present BIDViz, a visualization platform that allows data scientists to visualize and debug his/her
model interatively while the model is training. BIDViz is applicable to general machine learning but is oriented to-
ward deep neural models, which are often challenging to fully understand. BIDViz emphasizes dynamic exploration
of models by allowing the execution of arbitrary metrics or commands on the model while it is training.
BIDViz is developed jointly by Jingqiu Liu, Xuan Zou, Allen Tang and myself, under supervision of professor
John Canny and PhD candidate Biye Jiang. The implementation of BIDViz consists of 2 parts: the design and im-
plementation of the User Interface, which is covered mainly in Xuan’s report; and the design of the computing and
serving backend, which is covered jointly in Jingqiu’s report and this one.
1
Click here to enable desktop notifications for UC Berkeley Mail. Learn more Hide Idle Xuan Zou Offline Jingqiu
Liu Allen Tang BIYE JIANG Felicity Zhao Isla Yang John Canny Joseph E. Gonzalez SHUAI LIU Invited Sijia Teng
More 3 of 3,365
2
Chapter 1
Technical Contributions
1.1 Introduction
Machine learning are powerful and versatile tools for understanding and exploiting data. But it is also the case that
users often under-exploit the power of ML and DL because of limited understanding of the loss functions and op-
timization method [17]. For instance, clustering models like k-Means optimize only internal cluster coherence, not
inter-cluster distance or cluster size uniformity. The latter goals are often more important or assumed by the user.
Previous interactive machine learning tools like [17] allow users to tailor the loss function as a blend of component
losses while visualizing the effects on the component losses. These losses and visualizations must be defined ahead of
time, although they could be added or removed to a particular training session.
For deep neural models, training often lasts hours or days. It is extremely expensive to stop and re-start training in
order to add diagnostics or controls. Therefore, we propose a system that allows dynamic creation and customization
of diagnostics or controls to deep learning models while they are training. To illustrate the main purposes of our
system, we demonstrate three scenarios below that are critical to machine learning researcher and data scientist:
1. Introduce code to compute and stream new statistics. During the training process, instead of waiting for the
process to complete, practitioners can print or plot useful statistics to understand model training progress. For
example, printing out the training loss or validation accuracy is the most common strategy that people use to see
if model training is on the right track. Training may stall for a variety of reasons - vanishing or even exploding
3
gradients, bad operating points (relative to the “bends” in non-linear layers), and “flat” regions on the loss
surface. Its difficult to anticipate all the possible pathologies, and ad-hoc querying is often needed to diagnose
which is in play. Once a cause is found, further exploration may be needed to localize and solve the problem.
These queries may be quite complex, and need to be defined on-the-fly.
2. View these statistics as streaming or dynamic plots. Although it is useful to see the current training accuracy,
it is more meaningful to see how much the accuracy has changed in the past few iterations. By looking at
historical data, users can conclude whether the training processes have incorrect implementation, suboptimal
hyper-parameters or even have converged. Those data are in effect time series and can be conveniently viewed
as streaming plots. It is also useful to make other forms of plot, such as histograms, dynamic so that they change
over windows of data.
3. Change model parameters or variables by intervening in the training process. The deep learning training process
is very time-consuming, and it is very inefficient to diagnose the model by trial and error. If a hyper-parameter
such as learning rate is sub-optimal, shutting down the current process and restarting wastes the work invested in
training to that point. Once problems with the model are identified through streaming/dynamic plots, users will
often want to directly make a fix in the model and continue the current training process. BIDViz supports direct
setting of hyper-params, creation of dependencies between parameters, and creation of new time-dependent
policies to adjust parameters.
Currently, monitoring the machine learning processes is done in the traditional way. It requires users to:
• Modify the production/experiment script to include code that computes desired metrics before training starts.
Once the training processes start, there is no way to add new metrics or delete old metrics.
• Log those metrics into a file or print them to screen. Users are not able to freely browse through the history and
select specific metrics to look at.
• Write custom code to parse and plot the logged metrics to gain insights. For a different experiment, users may
need to write different code to parse those files and plot debugging statistics.
It is clear to see that the current approach is inefficient and slow, especially when training cycles are long. In
particular, it has three main disadvantages:
4
1. Coding both the logging part and the plotting part is tedious. The code that users write for one training process
is usually not reusable to other experiments. In addition, plotting program must directly follow the format of
logging program, which makes the code not extensible.
2. Metrics computing and logging code cannot be changed after the model is running. It is often the case that
user specified metrics cannot capture the whole training process, which needs to add other metrics after training
starts. Therefore, if users forget one of the metrics to log, or want to explore more of the model after seeing
something rare, they must stop the current process, modify the script to include the new logging, and rerun the
model.
3. If the model is showing unexpected behavior, the only way to modify it is to stop and restart. Having a model
trained for an hour, or even longer, and then detected abnormal metrics, users need to start over the whole
process. The current approach is very time-consuming and inefficient.
Therefore, we present our machine learning visualization and monitoring system, BIDViz, an extensible toolkit
to help machine learning practitioners to interact with the model training process. BIDViz is based on the GPU-
accelerated machine learning library BIDMach [6]. To overcome the drawbacks that come from the traditional ap-
proach mentioned above, the system support adding metrics, rendering plots and modifying model parameters in the
same live training session with an user-friendly interface. We in fact follow those procedures to demonstrate our ef-
fort to diagnose deep neural networks run on very sparse input data. This approach leverages support for real time
compilation, serialization and multi-threading in BIDMach and Scala. We also utilize the Scala Play framework to
support high-level HTTP client/server communication with the web visualization client. We will discuss the features
and system design in later sections.
1.2 Related Work
1.2.1 Machine Learning Visualization Platform
There exists several visualization platforms for machine learning training. The most popular one is TensorBoard [2],
supported by the Google TensorFlow project. To use TensorBoard, a user needs to instruct the model to log the
5
information into a specific file directory, then TensorBoard will be able to read that directory and visualize the logged
data. TensorBoard is extremely useful to visualize TensorFlow computation graph, which helps debug if the model is
built as expected. In addition, users use TensorBoard to visualize some statistics to monitor machine learning training
process. We want to argue some key features that our system outperforms TensorBoard:
• TensorBoard is a separate process from the running TensorFlow training process. Those two processes do not
share the memory space, which makes interaction very limited. For example, as mentioned above, such system
cannot intervene machine learning training process to fix suboptimal training performance. However, our system
puts training and visualization in the same process which gives visualization platform direct access to modify
training system.
• Tensorflow has the computational graph structure, which is very hard to add additional evaluations, such as de-
bugging metrics, into the graph once training starts. Although TensorBoard takes care of the plotting functions,
users must explicitly write code to add those calculations into summary operators before training. Our system
allows users to add and delete metrics debugging information anytime during the training processes.
• TensorBoard operates by reading TensorFlow events files. Thus, visualization server needs to check the events
files every period of time which may create busy waiting. In addition, events files are written and read on disks
which might be more expensive compared to memory. Our system mitigate such issue by directly allowing
training thread to communicate with the visualization server through a socket. Such socket does not create busy
waiting in the system and primarily only uses memory.
In addition to TensorFlow, other machine learning visualization toolkits include Keras Callback [9] and TensorDe-
bugger [16]. Keras Callback has very similar features to TensorBoard which opens a separate process to evaluate all
pre-specified logging code. Therefore it has all the drawbacks mentioned above. Another Tensorflow tool, TensorDe-
bugger, uses Jupyter notebook [26] as the backend to support running both training and plotting in the same process.
It can set breakpoints to allow users to pause the training process and debug the current data through the computation
graph. However, the key issue here is still the limitation of computation graph. No matter whether one uses Tensor-
Debugger or TensorBoard, once the computation graph is constructed, no new code or diagnostics can be added. In
addition, although using Jupyter notebook is very convenient for exploring with a paused training session, it is very
6
difficult to reuse the logging and plotting code in other experiments, or share with other people. Our implementation
has a simple interface that makes sharing and open source of diagnostic code possible.
1.2.2 System Monitoring Platform
Part of the functionality of BIDViz is monitoring the machine learning system. System monitoring has developed
several very mature platforms and toolkits. One of the most popular platform in the industry is Graphite [10]. Graphite
allows users to submit metrics files to its own database system and store numeric time-series data of those metrics in
the database once a certain period of time. Then it supports fetching those metrics data from the database and rendering
graphs on demand. Along with Graphite, Grafana [1] is a powerful dashboard on top of Graphite to support time-series
anlaytics. This whole ecosystem is very similar to our system but with the following key differences:
• Since our goal is to monitor machine learning training proceses, if we use Graphite and Gradana ecosystem, the
result is very similar to what TensorBoard has done. In particular, since Graphite askes users to submit metrics
scripts to the database, it is the same thing as having the metrics code in the training script and submitting
the training script to Graphite database. It is two different processes in nature, which creates the separation
between server and training processes. Therefore, we are not able to modify logging code and changing model
parameters once the job starts.
• The monitoring system in Graphite can be very expensive once users request more graphs. Because the key
underlying structure is to have the web interface fetch data directly from the database, cache or even on disk,
once users open a new tab and request to display new graphs, Graphite needs to query more data from the
database, and refresh all the graphs currently displayed in the browser.
Besides those two key points mentioned above, our system actually shares some similarity to Graphite ecosystem.
For example, Grafana has online dashboard community to share their self-defined visualization. We also have clearly
defined API for users to share their plots with others.
There are a lot of real-time monitoring platform developed in the academic world such as [8, 25, 29]. However,
those platforms have specific targets to monitor which do not allow dynamic changes to the monitored metrics. In
addition, similar to Graphite, those systems are not suitable for machine learning processes as they do not directly
7
share memory resources with the training thread, which cannot serve as the real-time debugger.
1.2.3 Interactive Environment Platform
Interactive machine learning has become a very popular field of research in human-computer interaction (HCI). How-
ever, interaction was approached very differently across different research papers. For example,
• In “Interactive learning with convolutional neural networkds for image labeling”, [21], Langkvist et. al. propose
a human-in-the-loop system to fix falsely labeled data when training a convolutional neural networks. A similar
work is [18] where the Kappor et. al. ask users to express their preferences on decision boundaries for multiclass
classification problem. In [20] Kulesca et. al. introduces an explanatory debugging system that allows users to
commnuicate with the system about how the learning is done in the system and to correct false predictions. All
the papers mentioned here, along with other papers such as [19], mostly focus on using human knowledge to
guide the system label or train dataset.
• In [24], Patel et. al. propose to let users use the results from multiple models to diagnose bad features, noisy
observations and suboptimal algorithms. A similar work is Talbot et. al.’s EnsembleMatrix, a tool to present
confusion matrices to help users understand performance of various models [30]. There are a lot of other
research work, such as [28], all try to provide a platform that users can directly analyze the outcomes after
training models.
• In [17], the Jiang et. al. propose a machine learning architecture what can directly change model parameters
during training. The interaction here is directly changing training models to tune hyperparameters. New code
and diagnostics cannot be added to running models.
It can be seen from the list above that interactive machine learning is a very broad area. From using human
knowledge to guide machine learning training processes, to visually analyzing the training results, interactive machine
learning covers every aspect of machine learning training processes. As the authors argue in [4, 5, 7, 13] about the ex-
pectation and challenges of interactive machine learning systems, that ideally one should involve users from exploring
model building using human interaction patterns to refining models through the interface. However, none of the papers
mentioned above covered the whole process of interactive machine learning. Our system exactly defines interaction as
8
a means to allow users to be directly involved in the entire training process, starting from explorary analysis, to model
diagnositcs, and production inference pipeline debugging. Such interaction has never been fully accomplished.
1.3 System Design
1.3.1 Overview
Figure 1.1: An illustration of our system design connecting BIDMach server with user interface. Each blue square
represents a thread.
BIDViz is designed with extensibility in mind. We want a flexible tool for scientists and developers that can to
customized to their needs. To accomplish this, we designed this system to be modular with a clear defined interface
between several pieces. This did not only accomplish extensibility, but also made our own development easier.
A BIDViz application consists of 3 modules, illustrated in 1.1. The first is a machine learning library that handles
the definition and training of the model. The second is the BIDViz server, it consists of a channel that observes the
current training state, and a web server that interacts with the user. Finally a third module is the web application served
by BIDViz that users see and interact with.
We have chosen BIDMach [6], a fast GPU-accelerated machine learning library to base BIDViz on. A script
written for BIDMach usually instantiates a ”Learner” object, and trains that Learner. A Learner contains the main
9
training loop that iterates through each mini-batch of data and performs optimization algorithms, such as stochastic
gradient descent or ADAM, and regularizers defined by mixins.
BIDViz server will handle the creation of a new function when users submit a code snippet that defines a metric
through code evaluation, and handle the request from users as a websocket RPC server.
When BIDViz starts, 2 separate threads are created: the first one is the training thread, which is a thread that runs
the current training loop; and the second is the serving thread, which is the thread that responds to the user requests
and sends out relevant information. BIDViz has the option to execute code in either thread.
1.3.2 Webserver as listener of the main training loop
To make the webserver accessible to the data stored in BIDMach, we make BIDViz server an observer of BIDMach’s
Learner object by implementing the trait below.
trait Observer {
def notify(model:Model)
Learner class is the class that owns the training loop, it is responsible for loading the minibatch of data, and
execute model update steps. In case of gradient based optimization problems, such as training a neural network
using SGD, this means computing the gradient and calling the corresponding updater; in case of EM style cluttering
algorithms such as kMeans, it means one step of mean update.
An observer will get notified on each iteration of the training loop, and it will get a reference to the model object.
The model contains everything we need to define the current model. In the case of a neural network, this means a list
of matrices representing the weights of every layer. It also contains a list of matrices of the current gradient and the
current minibatch of data. With access to these information, we can compute interest metrics such as the training loss
and accuracy, or distribution of model weights and gradients.
When BIDViz is notified, it will iterate through a Map containing functions that takes a model and return a matrix.
How those functions are created is described in the next section. The matrix returned is then serialized and sent out to
the front end through websockets. This evaluation is done in the training thread, because in this way we are sure that
the model weights do not change while we are doing this computation. If we want to do the computation in the serving
10
thread, then we would need to snapshot the current model weights to avoid the race condition of potentially computing
on partially updated weights, and that means copying out the weights out of GPU, which could be expensive giving
the large size of model weights in a deep neural network. We assume that the result of the metric computation, such
as model loss and accuracy, is small in comparison. Therefore, we only sends out small amount of data out of the
training thread and over the wire (websocket). The writes to the websocket are asynchronous, and are handled by
Scala’s Play framework, using the Actor-Reactor library (Akka) that comes with Play. Doing metric computation
does add some overhead to the training thead that potentially can slow down the computation. However, the metric
computation should be small compared to the gradient computation for one gradient step, so the overhead should be
minimal. However, it is possible that a user submits a slow code in metric computation, such as network calls. In the
future we would try to detect and add warning for this case.
1.4 Communication Protocol with the Front End
All the communication are handled via JSON-based messages through a websocket, which is established when the
webapp loads. BIDViz defines a VizManager (javascript) singleton object that handles the communication with the
server.
There are mainly 2 types of communication between the server and client. One is data point notification, this
occures when the BIDViz computes a metric, such as loss, and send it to the front end to be displayed; another is client
requests, such as getting a list of current parameters or send a code snippet to be evaluated. The first type is initiated
from the server, so it is natural to use websocket as this is the only way the web browser can get unsolicited data from
the server. The second type starts with a request from the client, and then the server performs the request actions and
reply with the requested data. This process is normally implemented as an AJAX request. However, in BIDViz both
of them are implemented using the same websockets.
1.4.1 Handling data point event
A data point message looks like the follows:
11
"msgType": "data_point",
"content": {
"name": "",
"ipass": int,
"shape": [int, ...,],
"data": [float, ...,],
The msgType field indicate the type of message, so VizManager knows to route it to a chart. The name field in
content is used to find out which chart this data point belongs to. The ipass field defines the current pass, which
is an increasing integer recording roughly how many iteration the training loop has run. And finally, the shape and
data are the serialized matrix. data is a list of float representing the straighten up matrix, and shape the sizes the
original sizes. For example, a 5×5 matrix will be represented with shape = [5,5] and data a list of 25 numbers.
Once this message is understood, VizManager will then route that data to the corresponding graph by calling the
addPoint method on that graph. Every object that has that method can be treated as a graph as long as it implement
this method. It is very easy to create different charts using any graphing backends, such as D3, C3, HiCharts or Vega.
Xuan’s report will go on depth on different graphs that we suppport.
1.4.2 Client initialized request event
When the user performs an action, such as add a new chart or edit a hyperparameter, a request will be sent to the server.
Traditionally this is done using AJAX requests. Using AJAX has the benefit of the browser handling the routing of
each response to the corresponding callback, but has the disadvantage of being a full HTTP request, which incurs
significant overhead in sending the full HTTP header. We have chosen to minic AJAX using websockets. VizManager
assigns an id to each outgoing websocket requests, and keeps a map of those ids and the corresponding callback
function. On receiving a response, it will use the id to route the result to the correct callback. A callback message
looks like below:
12
"msgType": "callback",
"content": {
"id": "",
"success": true,
"data": {
When the client initiates a request, VizManager assigns a caller id to that request and remembers the callback
function of that request in a dictionary. When the corresponding reply arrives from the websocket, VizManager will
route data field to its callback.
1.4.3 Usage
BIDViz currently targets machine learning practioners that use BIDMach. It is very simple to add usage of BIDMach
in any existing script.
Usually, a BIDMach script looks like the follows:
// ... defines options
val learner :Learner = Net.createSomeSubclassOfLeaner(...)
// ... more code
learner.train
The user defines options allowed in Leaner.Options class, instantiates a object of the Learner class (or its subclass),
then call the train method on it.
To add BIDViz, simply instantiates a WebServerChannel as observer of learner before calling train, as
follows:
learner.opts.observer = new WebServerChannel(learner)
When when this line is executed, a web server will be created and you can direct your browser to localhost:10001
13
to start using BIDViz. nn.opts.observer = new WebServerChannel(nn)
Figure 1.2: User interface that communicates between end-users and training processes. Most of the space on the
webpage are for rendering time series of user-defined metrics. A sample line chart is shown in the figure. There are
three blue buttons on the top of the page which handle resuming/restarting training process, adding metrics code, and
changing model parameters. On the right side of the page, users can choose to expand a floating terminal, which is
already expended in the demonstration above. Users can use such terminal to evaluate specific commands in real-time
training environment.
As shown in figure 1.2, BIDViz UI provides feature such as displaying real time charts, running custom commands
on a Scala interpreter attached to your running model, and defining custome charts through code snippets. Details on
UI features is described more in depth in Xuan’s paper.
1.5 Conclusion
BIDViz provides a way to interactively visualize the training process of a learning model. This dynamic interaction
provides an alternative workflow for machine learning practitioners and data scientists. In the future, we would like to
extend this tool to also work with other popular machine learning frameworks, such as Keras or Tensorflow.
Also, bundling the code that computes a statistic with a declarative graph definition could be used to build a
14
sharable format, in which one could create useful metrics and share them to the community.
15
Chapter 2
Engineering Leadership
2.1 Introduction
As an important branch of machine learning, deep learning provides human like intelligence, such as image captioning
or translation. However, this requires a massive amount of data and machine time in order to train a reasonable deep
learning model, and a large amount of hyperparameters associated with the training itself. Any mistakes in model
design or hyperparameters will result a big loss in both time and energy, as the training needs to be run from scratch
after making required modifications to the parameters or the code. Our project will design and create a tool in which
scientists can babysit the training process, inspecting the current result and parameters, and even live-tricking or
experimenting with different configurations without stopping and starting the training from scratch. This way, not
only we can detect potential model incorrectness early on and save some time and energy, we can also discover the
configuration that yield most efficient model.
2.2 Project Management and software engineering
The goal of Project Management is to ensure maximal throughput and efficient usage of teams engineering hours
to deliver a project. There are many aspects that could affect the the teams throughput. Some of those are social
aspects, such as how aligned the teams goal and personal goals of each team members aligns or does the people feel
the team as a friendly environment. Other aspects are technical, such as can several tasks be carried forward without
16
conflict, enabling parallelism among team members. This section will discuss only the technical aspects, to show how
following a modular design pattern will enable team members to work more efficiently, and how does this design also
fits many software engineering goals.
Software Engineering is a systematic approach to the entire lifecycle of a software systems, such as design, imple-
mentation, testing and maintenance (Laplante, 2007 [22]). As opposed to the concerns of computer science, computer
engineering concerns the adaptation of a system to a series of changing and vaguely defined requirements, instead of
just creating a solution for a particular well defined problem. A well engineered software should have the following
properties: 1. It should be easy to add new features or modify requirements without massively affect other existing
features, in particular, this means that we can add features by adding code, instead of modifying code. And 2. When
the system behaves unexpectedly, it should be easy to pinpoint the few places that need to be fixed. Both of them is
essential to ensure maintainability of the system.
To achieve both the goals of software engineering and project management, we have followed these principles:
Divide the entire system into several independent modules. Each modules can communicate to the others only through
a well-defined contract, or interface. Interface can be expanded, but never modified or removed. Each of modules are
free to be modified, without changing the contract. Each module is owned by one team member, though any member
can work on any module. A new feature is implemented by defining additional interface each module need to support,
and then implemented in parallel. This design allows each module to be worked independently and in parallel, the
owner of a module is responsible to ensure it abides the predefined interface through change, and also serves as the go-
to person for questions when other member is working on this module. It also have the additional benefits of allowing
each module to be unit tested independently, allowing less rooms for errors or bugs.
We have divided our project in 4 modules. The first one is core: core module is the existing machine learning
library (BIDMach) that we are based on, this part is developed and maintained by Prof. Canny. The next one is
channel: channel observes the event happened inside of core, and send them out. This module follows the Observer
Pattern specified in the Design Patterns book by Gamma et. al. (2004, [12]). The channel will observe events, and
compute statistics on those events, and finally it will send it out to the last module, web interface. The web interface
itself is complicated system, so it then divided into smaller modules following the Model-View-Controller pattern
from the same book by Gamma et. al. (2004 [12]). This design allows each components to evolve in their own, also
17
allows prof. Cannys other projects to continue on the same code base, without affecting each other.
2.3 Industry Analysis
Deep learning is changing the world. However, in the Backendearly days, AI research community disregarded the
potential of neural networks. For example, Marvin Minsky et al. (1969, [23]) in the book Perceptron pointed out many
drawback and limitation of neural nets. This situation has not improved over years until the popularity of Internet led
to a stage of Big Data. Online activities make the internet a giant pool of data. Unlike traditional way of telling the
machine what to do by hard coding, machine learning takes the approach to train the machine from data and expect
it to make correct prediction on data after training. Therefore, the more data we use to train the model, the more
experienced the machine will be, and this highly increases the training accuracy of the model. For example, in the
research of generating image caption using deep neural networks, Vinyals team used images uploaded to Flickr.com,
and this dataset is as big as 30,000 images. (Vinyal et al. 2015,5) In addition, they also used dataset from MS COCO
of over 80,000 images and corresponding picture descriptions. (Vinyal et al. 2015,5 [31] )
In recent years, many tech giants in Silicon Valley join a so called Race of AI. Sundar Pichai, Googles CEO,
claims that we are moving to a AI-first world (DOnfro, 2016 [11]) in Alphabets Q1 earnings call. Apple, Microsoft
and Amazon are heavily investing in smart personal assistants, such as Siri and Cortana. Intel acquired three AI-
focusing startups in 2016 alone. (CB Insights-blog, 2017 [15]). Companies invest tremendous resources in their AI
research group, aiming at design better algorithms, build more efficient models to accelerate their product/service
quality.
Besides technology firms, deep learning is widely used in other industries, such as financial institutions. Banks
build neural nets to provide risk score of a customer based on its multiple data resources such as salary, credit history,
etc. Banks and merchants worldwide suffered around $16.31 billion of fraud loss in 2015 (Allerin, 2016 [3]). Deep
learning algorithms can be used to predict criminal transaction patterns and distinguish fraudulent activities from
normal ones.
The broad application of deep neural networks demonstrates a big need of visualization tools and we will target
any industries that uses machine learning algorithms as our potential users.
After discussing our potential users, we need to further analyze any potential competitors. We believe TensorBoard
18
will be our major competitor. TensorBoard is a visualization tool that comes with TensorFlow a widely used machine
learning library. TensorFlow generates summary data during training process and TensorBoard operates by reading
those data. While both tools have same operation mechanisms, our tool enjoys some features that are essential to
a data scientist. First of all, we allow users to add additional visualization requests during training process, while
TensorBoard has to stop the training and add logging data. Second, we enable users to tune hyperparameters while
training to dramatically save training time.
2.4 Marketing Strategy
Our project is about understanding deep neural networks through visualization, and our market will focus on the fields
that utilizing neural networks to do data analysis, pattern recognition or image classification work.
The neural networks have plenty of applications in all kinds of fields and have already been integrated into many
softwares and devices. One of the most straightforward application is using neural networks to recognize characters.
For example, the postman categorizes the letters according to the post code on the envelop, by developing a software
that integrated with neural networks, it could distinguish the digit with high efficiency and accuracy, which can save
the post office bunch of money and relieve human from this boring work. In order to achieve good performance and
accuracy of this application, we need to develop and tune the neural network, in which the product our project can
help a lot. (B. Hussain 1994:98 [14])
Another application of neural networks that may not be obvious but is much more profitable is in the finance
area. According to some companies such as MJ Futures, neural networks have been touted as all-powerful tools in
stock-market prediction. It claims 199.2% returns over a 2-year period using their neural network prediction methods.
Meanwhile the software integrated with neural networks are easy to use. As technical editor John Sweeney said ”you
can skip developing complex rules (and redeveloping them as their effectiveness fades) just define the price series and
indicators you want to use, and the neural network does the rest. (Artificial Neural Networks 135 [27] )
The idea of stock market prediction is not new, of course. Business people always attempt to anticipate the trend
of the stock market by their experience in some external parameters, such as economic indicators, public opinion, and
current political climate. While with neural networks, software is able to discover the the trends that human might not
notice and use this trends in the prediction.
19
Our project is about understanding deep neural networks through visualization, so the outcome of our research is
a software tool that could monitor and analyze the training process of neural network. The software can be used to
improve the performance of neural networks and help tune the parameters and architecture of deep neural networks.
Therefore, our software can play a role in the areas that require well-architecture neural network.
In order to commercialize our product, we have three steps plan. The first step is to present the demos and results
in some famous communities, in order to attract the attention of academia field. This can help improve the fame of
our product and gain the acknowledgement of experts. The second step is to build a website for our product. We will
allow users or companies to freely download our software but with a time-limited trial, which is a common strategy
of many softwares. After the period of trial, they have to pay to acquire membership to continue use. The last step
is after we have got a certain amount of users and further refined our product, we will try to get contact with some
big companies, to promote our product and provide customized service for them. Through the cooperation with big
companies, our product can get advice from industry and further get improved.
2.5 Conclusion
Based on our project management strategies, we efficiently keep track of project progress and make adjustments
when necessary, and this ensures on time high quality deliverables. We get good understanding of industry needs and
requirements, as well as potential users from the industry analysis. We analyze current market, and identify steps to
efficiently promote our final product to users.
20
Bibliography
[1] Grafana - the open platform for analytics and monitoring. https://grafana.com/.
[2] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,
S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, M. Kudlur, J. Lev-
enberg, D. Mané, R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever,
K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viégas, O. Vinyals, P. Warden, M. Wattenberg,
M. Wicke, Y. Yu, and X. Zheng. Tensorflow: Large-scale machine learning on heterogeneous distributed sys-
tems. CoRR, abs/1603.04467, 2016.
[3] Allerin. How is deep learning being used in the banking industry? https://www.allerin.com/blog/
how-is-deep-learning-being-used-in-the-banking-industry, 2016.
[4] S. Amershi, M. Cakmak, W. B. Knox, and T. Kulesza. Power to the people: The role of humans in interactive
machine learning. AI Magazine, 35(4):105–120, 2014.
[5] F. Bernardo, M. Zbyszynski, R. Fiebrink, and M. Grierson. Interactive machine learning for end-user innovation.
2017.
[6] J. Canny and H. Zhao. Big data analytics with small footprint: Squaring the cloud. In Proceedings of the 19th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, pages 95–103,
New York, NY, USA, 2013. ACM.
21
[7] D. Chen, R. K. E. Bellamy, P. K. Malkin, and T. Erickson. Diagnostic visualization for non-expert machine
learning practitioners: A design study. In 2016 IEEE Symposium on Visual Languages and Human-Centric
Computing (VL/HCC), pages 87–95, Sept 2016.
[8] S. E. Chodrow, F. Jahanian, and M. Donner. Run-time monitoring of real-time systems. In Real-Time Systems
Symposium, 1991. Proceedings., Twelfth, pages 74–83. IEEE, 1991.
[9] F. Chollet. Callbacks - keras documentation. https://keras.io/callbacks/, 2016.
[10] C. Davis. Graphite. http://www.aosabook.org/en/graphite.html, 2008.
[11] J. D’Onfro. Googles ceo is looking to the next big thing beyond smartphones. http://www.
businessinsider.com/sundar-pichai-ai-first-world-2016-4, 2016.
[12] H. R. V. J. Gamma E., Johnson R. Design Patterns: Elements of Reusable Object-Oriented Software. Addision-
Wesley, Indianapolis, Indiana, 2004.
[13] M. Gillies, R. Fiebrink, A. Tanaka, J. Garcia, F. Bevilacqua, A. Heloir, F. Nunnari, W. Mackay, S. Amershi,
B. Lee, N. d’Alessandro, J. Tilmanne, T. Kulesza, and B. Caramiaux. Human-centred machine learning. In
Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, CHI EA
’16, pages 3558–3565, New York, NY, USA, 2016. ACM.
[14] B. Hussain and K. M.R. A novel feature recognition neural network and its application to character recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2):98–106, 1 1994.
[15] C. Insights. The race for ai: Google, twitter, intel, apple in a rush to grab artificial intelligence startupss. https:
//www.cbinsights.com/blog/top-acquirers-ai-startups-ma-timeline/, 2017. re-
trieved: 2017-03-10.
[16] E. Jang. Tensordebugger. https://github.com/ericjang/tdb, Jan 2017.
[17] B. Jiang and J. Canny. Interactive machine learning via a gpu-accelerated toolkit. In Proceedings of the 22Nd
International Conference on Intelligent User Interfaces, IUI ’17, pages 535–546, New York, NY, USA, 2017.
ACM.
22
[18] A. Kapoor, B. Lee, D. Tan, and E. Horvitz. Performance and preferences: Interactive refinement of machine
learning procedures. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, AAAI’12,
pages 1578–1584. AAAI Press, 2012.
[19] P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. arXiv preprint
arXiv:1703.04730, 2017.
[20] T. Kulesza, M. Burnett, W.-K. Wong, and S. Stumpf. Principles of explanatory debugging to personalize inter-
active machine learning. In Proceedings of the 20th International Conference on Intelligent User Interfaces, IUI
’15, pages 126–137, New York, NY, USA, 2015. ACM.
[21] M. Längkvist, M. Alirezaie, A. Kiselev, and A. Loutfi. Interactive learning with convolutional neural networks
for image labeling. In International Joint Conference on Artificial Intelligence (IJCAI), New York, USA, 9-15th
July, 2016, 2016.
[22] P. Laplante. What Every Engineer Should Know about Software Engineering. Boca Raton, 2007.
[23] S. Minsky, M. Papert. Perceptron. The MIT Press, Cambridge, Masachussets, 1969.
[24] K. Patel, S. M. Drucker, J. Fogarty, A. Kapoor, and D. S. Tan. Using multiple models to understand data. In
Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Two,
IJCAI’11, pages 1723–1728. AAAI Press, 2011.
[25] V. Paxson. Bro: a system for detecting network intruders in real-time. Computer networks, 31(23):2435–2463,
1999.
[26] F. Pérez and B. E. Granger. IPython: a system for interactive scientific computing. Computing in Science and
Engineering, 9(3):21–29, May 2007.
[27] A. Roghani. Artificial Neural Networks. CreateSpace Independent Publishing Platform, London, 2 edition, 2016.
[28] D. Sacha, M. Sedlmair, L. Zhang, J. A. Lee, D. Weiskopf, S. North, and D. Keim. Human-centered machine
learning through interactive visualization. ESANN, 2016.
23
[29] C. Stauffer and W. E. L. Grimson. Learning patterns of activity using real-time tracking. IEEE Transactions on
pattern analysis and machine intelligence, 22(8):747–757, 2000.
[30] J. Talbot, B. Lee, A. Kapoor, and D. S. Tan. Ensemblematrix: Interactive visualization to support machine
learning with multiple classifiers. In Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems, CHI ’09, pages 1283–1292, New York, NY, USA, 2009. ACM.
[31] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. https:
//arxiv.org/pdf/1411.4555.pdf, 2014.
24

Eecs 2017 99

Uploaded by

Copyright:

Available Formats

Eecs 2017 99

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Eecs 2017 99

Uploaded by

Copyright:

Available Formats

BIDViz: Real-time Monitoring and Debugging of Machine

Learning Training Processes

Electrical Engineering and Computer Sciences

Technical Report No. UCB/EECS-2017-99

May 12, 2017

Learning Training Processes

Project in collaboration with Allen Tang, Jingqiu Qiu, Xuan Zou,

under supervision of prof. John Canny

time, although they could be added or removed to a particular training session.

These queries may be quite complex, and need to be defined on-the-fly.

over windows of data.

policies to adjust parameters.

select specific metrics to look at.

particular, it has three main disadvantages:

logging program, which makes the code not extensible.

process. The current approach is very time-consuming and inefficient.

and system design in later sections.

1.2 Related Work

1.2.1 Machine Learning Visualization Platform

waiting in the system and primarily only uses memory.

1.2.2 System Monitoring Platform

parameters once the job starts.

defined API for users to share their plots with others.

1.2.3 Interactive Environment Platform

guide the system label or train dataset.

and diagnostics cannot be added to running models.

1.3 System Design

by BIDViz that users see and interact with.

gradient descent or ADAM, and regularizers defined by mixins.

1.3.2 Webserver as listener of the main training loop

Learner object by implementing the trait below.

algorithms such as kMeans, it means one step of mean update.

and accuracy, or distribution of model weights and gradients.

1.4 Communication Protocol with the Front End

of them are implemented using the same websockets.

1.4.1 Handling data point event

A data point message looks like the follows:

"shape": [int, ...,],

"data": [float, ...,],

Xuan’s report will go on depth on different graphs that we suppport.

1.4.2 Client initialized request event

looks like below:

route data field to its callback.

in any existing script.

Usually, a BIDMach script looks like the follows:

// ... defines options

val learner :Learner = Net.createSomeSubclassOfLeaner(...)

// ... more code

then call the train method on it.

learner.opts.observer = new WebServerChannel(learner)

UI features is described more in depth in Xuan’s paper.

configuration that yield most efficient model.

2.2 Project Management and software engineering

fits many software engineering goals.

essential to ensure maintainability of the system.

2.3 Industry Analysis

training to dramatically save training time.

2.4 Marketing Strategy

help a lot. (B. Hussain 1994:98 [14])