Paper Fiuri
Paper Fiuri
Paper Fiuri
Abstract. Over the last few years, machine learning methods have used
Deep Neural Network architectures to tackle complex problems. In this
paper, we applied biologically inspired neural machine learning to solve
two classical and well-known challenging problems, the Mountain Car
Continuous and the Cart Pole. We use a neural network extracted from
the connectome of C-Elegans to learn a policy able to yield a good solu-
tion. We used Reinforcement Learning (RL) and optimization techniques
to train the models, in addition to proposing a novel neural dynamics
model. We use different metrics to make a detailed comparison of the
results obtained, combining different neuronal dynamics and optimiza-
tion methods. We obtained very competitive results compared with the
solution provided in the literature, particularly with the novel dynamic
neuronal model.
1 Background
Reinforcement learning (RL) is a machine learning technique in which an
agent learns to perform a task through a succession of trial and error interactions
with a changing environment, usually modeled with Markov decision processes.
To train the agent to solve a problem, the environment provides us with a reward
and a detailed description of the state of the problem. Each time we execute an
action on the environment, we obtain a reward and a state descriptor again. A
desired solution is one that maximizes the rewards obtained in a given number
of interactions [8]. An episode refers to the sequence of interactions with the
environment starting from its initial state. In each episode, the rewards from
each interaction accumulate to obtain an overall reward for the episode.
Code 1. The optimization procedure which ex- Code 2. The episode procedure which represents
plores the different internal parameters and re- the interaction between the agent and the envi-
turns the best configuration. ronment.
of the network (setParams) and calculating in each step the reward obtained in
an episode. The assignment of the values is represented by the policy which is
associated to optimizer.
From now on, we will use the word model to refer to the neural network that
we are using as an agent.
The Cart Pole Problem (CPP) is another classic problem in RL, also
described by Sutton and Barto in [8]. It involves a cart holding a pole with an
ideal (frictionless) joint. The goal is to learn to balance the pole and keep it as
vertical as possible while the cart moves on a frictionless flat surface. The initial
state is random; four variables describe the state of the mobile (position and
velocity of the mobile along with angle and angular velocity of the pole). The
agent receives a reward value for taking an action that either moves left or right.
If the pole is kept upright for an average of 195 actions in 100 separate attempts,
the problem is considered solved. The table 2 provides a visual representation of
this scenario.
4 Fiuri Ariel M. , Dominguez Martin A., and Francisco Tamarit
2 Related works
Many approaches have been used to solve these classical RL problems. For ex-
ample, the MCP was solved using SARSA with a semi-gradient approximation
[8]. Other authors have also employed tabular methods, as seen in Nguyen’s work
[13]. Some studies combine techniques like tabular approximation with RBF [12],
while others attempt to incorporate domain knowledge, such as potential func-
tions, as done by Xiao [11].
Regarding the CPP, a similar pattern emerges. Initially, it was solved us-
ing a tabular method. However, in Berez’s work [14], domain knowledge was
incorporated through constraints. Towers in [16] employed a single DQN, while
Kurban in [17] utilized both DQN and Double DQN, like Simmons in [18]. Fur-
thermore, Lanillos in [15] tackled the problem by utilizing images generated by
the environment.
The previously mentioned citations are related to this work solely because of
the problems they aim to solve. However, particularly interesting work for us is
that of Lechner M. in [1], where the authors propose to solve the MCP using
RL, employing an architecture called Tap Withdrawal Circuit (TWC) and
the Integrate and Fire (IandF) neural dynamics model, using Random Seek
(RS) as the search method in the solution space. This work reports the successful
resolution using libraries Open AI for MCP and rllabbs for CPP. It is important
to note that the authors did not compare their results and performance with
other state-of-the-art techniques.
In the present work, we propose to extend this idea by adding other models
of neural dynamics, other optimization methods, and other architectures. To
evaluate the performance of the different models, we established metrics that
allowed us to differentiate the best solutions and compare them with other recent
works in the area.
3 Our approach
3.1 Exploring the solution space
nal neurons. To model the synaptic connections, we took into account whether
neurons are excitatory, inhibitory, or electrical connections.
In the second stage, we also included the search for an optimal architecture to
compare with the biological connectome by creating what we have called shuffle
architecture (SA). A SA is derived by randomly mixing the set of synaptic
connections, ensuring that no neuron remains disconnected. In figures 3 and 4
(table 3), we compare the TWC of C-Elegance with a SA.
Each model contains internal parameters that need to be learned to solve
different problems. It is important to mention that they have an internal rep-
resentation that in none of the cases coincides with the state variables of the
environment or the actions that can be executed on the problem. Furthermore,
it is important to note that it is necessary to define an interface to communicate
the motor and sensory neurons with the environment. In this sense, the inter-
face with the motor neurons will translate what action will be applied in the
environment. With the sensory ones, information from the environment will be
given.
So far, we have discussed the training methodology (i.e. RL), the neural network
architecture, and how we will explore the solution space. However, a fundamental
piece still needs to be included: the model of neural dynamics. We work with
two well-studied in the literature and also our proposal. Next, we will detail the
three neural models.
The Integrate and Fire (IandF) model [1] is a continuous model that
describes the temporal evolution of the voltage across of the neuronal membrane
(which may be stimulated in the dendritic connections) through a single ordinary
6 Fiuri Ariel M. , Dominguez Martin A., and Francisco Tamarit
differential equation. It has been widely used in the literature, although it is not
suitable for modeling neuronal spikes.
The Izhikevich (IZH) model [2] is a simplification of the famous Hodgkin-
Huxley model and models the neuronal evolution through a system of two cou-
pled ordinary differential equations. Unlike the IandF model, this model not
only allows reproducing the firing and bursting behavior of different types of
biological neurons but also provides a more detailed description of neuronal dy-
namics.
Finally, in this work, we introduce a new neuron model originally proposed by
Fiuri and then named FIU model (FIU). It attempts to represent a simplified
neuronal state and, consequently, the resulting dynamics without losing sight
of biological inspiration. A neuron n will then have in this model two well-
differentiated variables, one internal one (En ) and an output one (On ). Both of
them depend on the dendritic stimulus, and in Table 4, we explain the updating
rules of each one. Each neuron is characterized by its threshold T hn and a
parameter dn which represents the decaying of En in absence of stimulus. Unlike
IandF and IZH models, in this case we have a discrete model in time, making
it much more efficient from a computational point of view.
Table 4 shows the equations that define each dynamic. We can see that
each IandF neuron has three parameters to be learned, and each connection 2.
Therefore, for the TWC, there are 33+52=85 parameters to learn. For the case
of IZH, we have four parameters per neuron and 1 per connection, so to train
TWC, we have 44+26=70 learning parameters. Finally, for FIU, we have two
parameters per neuron and 1 per connection, totaling 22+26=48 parameters to
learn. The difference in the number of parameters and a faster computation in
dynamics (since differential equations do not have to be solved at each simulation
step) put the FIU model in the spotlight for its computational speed. This
advantage can explain the time advantage of being almost five times faster than
the others in training and testing.
In this section, we present the experiments carried out to evaluate the perfor-
mance of various combinations of dynamics, optimizers, and architectures that
we have applied to the two classical RL problems under consideration.
To achieve this, we conducted the same experiment for each configuration. In
other words, given an architecture A, a model M, and an optimizer O, the network
was trained with the problem under different scenarios (while keeping A, M and
O fixed), depending on the number of optimization steps.
Furthermore, we introduced two parameters called batch and worst to miti-
gate the stochastic effect on the training performance of the networks. For exam-
ple, a network trained with batch=5 and a worst=2 means that during training,
five episodes are executed, and we keep the two worst solutions. In other words,
the selection is based on the average performance of the two worst models. Note
that in the pseudocode of table 1, we present the case of batch=1 and worst=1.
Reinforcement learning & biologically inspired artificial neural networks. 7
For each network, we run five training sessions for each scenario. Once the
number of optimizer steps (Steps) is chosen, the batch and worst parameters
are used to run five instances (called a,b,c,d,e) of the same configuration. This
differentiation helps in organizing and comparing the results.
In summary, for example, in the case of the MCP considering a network R
characterized by an architecture A ∈ {T W C, SA}, model M ∈ {IandF, IZH, F IU },
optimizer O ∈ {RS, GA, BO} and number of steps Steps ∈ {S1 , S2 , S3 } (where
each Si depends on the optimizer ), five instances of R were trained in five dif-
ferent scenarios with (batch,worst) ∈ {(1, 1), (5, 2), (5, 5), (10, 5), (10, 10)}. For
each scenario, five training sessions were run (a,b,c,d,e) . Therefore, we can
say that 1350 trainings were executed for MCP in both architectures. A sim-
ilar situation occurs with the CPP. The only difference is that it was decided
to switch to four different Steps ∈ {S1 , S2 , S3 , S4 } for each optimizer and only
three combinations (batch,worst) ∈ {(1, 1), (5, 2), (5, 5)}. So the total number
of trainings for CPP in both architectures was 1080. This change was sufficient
for our purposes.
We will identify a network by constructing its name following the pattern
M+O+Steps+batch+worst+version, where M ∈ {IandF, IZH, F IU } represents
the model type, O ∈ {RS, GA, BO} indicates the optimizer used, Steps refers
to the number of optimization steps, and batch and worst have been explained
previously. Additionally, version ∈ {a,b,c,d,e} is included to differentiate
between different instances of the same configuration. For instance, a network
named FIUBO150052e means that it is built upon the FIU model, uses the BO
optimizer with 1500 optimization steps, and has batch and worst values of 5
and 2, respectively. It is also the instance e of that specific configuration.
Typically, in the area of reinforcement learning, training results are reported,
commonly using rewards as performance indicators. However, evaluations in
most other learning areas (such as supervised and unsupervised learning) are
conducted in a separate way. By adopting this approach, we can more effec-
tively assess whether one model outperforms another in solving a given task. To
achieve this, we carry out experiments to solve the problem, but instead of being
in the training mode, we execute the experiments 1000 times in the evaluation
mode, collecting different metrics for comparison. In the next section, we will
establish specific performance metrics for each problem, allowing us to conduct
a thorough comparison of the trained models.
3.5 Metrics
It is well known in the state of the art that, to evaluate RL models, the rewards
obtained in training (T. Rew. in our tables) and evaluation are the metrics to
report. However, given the problem’s stochastic nature, some solutions performed
very well in the training phase but, in the evaluation stage, did not maintain this
behavior. For this reason, we decided to collect two metrics, in addition to the
obtained reward, to discriminate the quality of the learned models, thus, setting
a fairer criterion.
Reinforcement learning & biologically inspired artificial neural networks. 9
In the case of the MCP, we collect the average number of steps (Me Stp in
the tables) for the mobile to reach its destination in the 1000 test executions. In
addition, we keep the minimum value of steps (Mi Stp in the tables) in which
we get to solve the problem. In the case of the CPP, we define something
similar, collecting the average (in the 1000 tests) of steps in which the post
managed to stay within limits (also Me Stp). In addition, we record the value of
the most significant number of steps that the pole remained upright in the 1000
attempts (Ma Stp). The execution times were also recorded on the same machine
in training and evaluation to compare the performance of computational costs
between different models.
4 Results
In this section, we report the results obtained for all the combinations described
in Section 3.4 and that we evaluated with the metrics that we introduced in the
previous section.
MCP: For this problem, the authors in [1, 11] reported reward values between
90 and 96, based on a sample size of 100 experiments. In our experiments, we
achieved a higher reward of 98.9 ± 0.9, averaging over 1000 experiments. Addi-
tionally, for a more detailed analysis of our results, the average number of steps
taken by our best network was 121.17 ± 5.58, and the minimum number of steps
for solving the problem with our best model was 85. We could not find any pre-
vious reports of these last metrics for comparison. The results presented in the
top section of Table 5 display the nine best rewards obtained for different neural
dynamics and each optimizer, considering both TWC and SA. The resulting
models achieved excellent results, particularly for IandF and FIU in both ar-
chitectures, when compared with the previously mentioned results. The bottom
part of Table 5 showcases the best results for each dynamic model and optimizer,
taking into account the mean number of steps required to solve the MCP prob-
lem. Notably, the proposed model (FIU) achieved outstanding results, ranking
among the top positions.
Table 6 displays the results for the TWC and SA. The graph on the left
shows the TWC architecture, while the one on the right shows a SA. Different
colors are used in each graph to identify the neural dynamic and to show the
variance of the mean steps taken to solve the MCP problem for the top 15
models. In this case, the ordering criteria is the mean number of steps taken by
the model to solve the problem, and the sample is the result of 1000 tests. The
FIU model performs best for both architectures, obtaining a model with fewer
steps to solve MCP and with more stability. The IandF and IZH models also
obtain good results but with more standar deviation.
CP: Regarding the issue at hand, the problem is deemed resolved in the lit-
erature if a sample of 100 tests produces an average of 195 ticks within the
defined boundaries of the pole. The most noteworthy results have been reported
to be close to 500 ticks (referenced in sources [14–19]). We are pleased to report
10 Fiuri Ariel M. , Dominguez Martin A., and Francisco Tamarit
Table 5. MCP: Experimental results, T. Rew. stands for training reward, Me Rew. is the mean
of rewards for 1000 sized test and σ corresponds to its standar deviation, Me Stp and its respective
σ are similar as for rewards and Mi Stp is the min ammount of steps a tests returned.
Results for MCP with TWC architecture, top 15 best Results for MCP with shuffle architecture, top 15 best
models ordered by mean of steps for solving the prob- models ordered by mean of steps for solving the prob-
lem. We report the standar deviation of the performance lem. We report the standar deviation of the performance
for the different runs, including different neural dynamics for the different runs, including different neural dynamics
and optimizers. and optimizers.
The top of table 7 shows the best models for different neuronal dynamic
and optimizers, considering the reward obtained in 1000 experiments, we can
see very good results for IandF and FIU model in both architectures. Notice
that in CPP the reward is same as the ammount of steps the pole is considered
in good position. The bottom of table 7 shows the best models for different
Reinforcement learning & biologically inspired artificial neural networks. 11
neuronal dynamic and optimizers, considering the average of steps that sticks
is up, using 1000 evaluation attempts. It is important to mention that many
models earned 500 on average and almost all the maximum amount of steps is
500. Table 8 displays the results for the TWC and SA, solving CPP problem.
The graph on the left shows the TWC architecture, while the one on the right
shows the SA one. We report graphs showing the variance of the performance
achieved for models, using the mean of steps that the model hods the stick (over
the 1000 test attempts). Note that FIU and IandF performs very well, and, it
is difficult to differentiate between model qualities due to the most models reach
a value of 500.
Table 7. CPP: Experimental results, T. Rew. stands for training reward, Me Rew. is the mean
of rewards for 1000 sized test and σ corresponds to its standar deviation, Me Stp and its respective
σ are similar as for rewards and Ma Stp is the max ammount of steps a tests returned.
Extending CPP limit: The table 9 shows the detail of the number of networks
that obtain an average of 500 steps (again, we notice a tendency in which the
IandF and FIU models obtain a more significant number of good models). As
we can see, many models obtained a maximum of 500 steps required. Therefore,
to distinguish if one model is better than others, we carried out new experiments
extending the cutoff limit of the standard problem. We extended the limit from
500 to 5000 and reran the 1000-attempt test, but this time only included models
that resulted in an average of 500 steps for the standard environment version.
Table 10 shows the evaluation graphs with the extended limit for the top 15
models for both architectures. Surprisingly, we obtained that these networks
trained with a limit of 500 steps for the end of the episode mean much higher
than those reported in the literature, some even reaching 5000 steps.
12 Fiuri Ariel M. , Dominguez Martin A., and Francisco Tamarit
CPP with TWC results corresponding to the 15 better CPP with SA results corresponding to the 15 better
ANN differenced by neural dynamics and optimizer, or- ANN differenced by neural dynamics and optimizer, or-
dered by mean steps. We report the standar deviation of dered by mean steps. We report the standar deviation of
the performance for the different runs. the performance for the different runs.
ECPP with TWC results corresponding to the 15 bet- ECPP with SA results corresponding to the 15 better
ter ANN differenced by neural dynamics and optimizer, ANN differenced by neural dynamics and optimizer, or-
ordered by mean steps. dered by mean steps.
The table 11 contains details of the best results for different dynamics and
optimizers on both architectures. For most retested networks, we get maximums
of 5000 steps for an episode. In this case, we noted a tendency for the IandF
dynamic to find better networks than the rest.
Reinforcement learning & biologically inspired artificial neural networks. 13
Time results: Until now, the variable that was not analyzed was the time it
took to train models and the time it takes a model to get a response. In order to
analyze the time performance to obtain results from a model, we use the TWC,
measure the time spent finding the sensorial neurons, and read the output from
the motor neurons. We repeated this experiment 1000000 times. We performed
this experiment for different configurations and recorded the time it took on
each neural dynamic model. The resulting time cumin are the followings: for the
FIU model:7m 34.156s; for the IandF model, 38m 59.682s; and finally, for IZH
model, 34m 5.389s. Clearly, we can see that FIU is almost five times quicker
than IZH and more than five times quicker than IandF. To analyze the time
spent in the training phase, in the next section, we report the training times for
the different optimization method.
Table 12. Best models. In gray the best values of each box architecture - problem and in light
gray the second best values for each model. In case of MCP a ttest analysis was done between
the gray and the light gray in the same problem and also between the best of MCP with TWC
and MCP with SA. Here the results: TWC/FIURS vs TWC/IandFGA = 2.9e − 226, TWC/FIURS
vs TWC/IZHBO = 0.0, SA/FIUBO vs SA/IandFGA=0.0, SA/FIUBO vs SA/IZHRS=5.9e − 304,
TWC/FIURS vs SA/FIUBO = 7.15e − 12.
Finding the best models: We select one model for each dynamic model -
optimizer combination to identify the best models and compare them in the
table 12. We established criteria to decide when a model m1 is better than other
m2 based on the mean of steps of solving the problem metric for both problems.
Let us suppose si ± σi is the mean steps metric of mi . In MCP, if we intend
14 Fiuri Ariel M. , Dominguez Martin A., and Francisco Tamarit
that si + σi is the upper mean number of steps the model can commit in the
problem solution, we want to select the model with the minor magnitude. So if
s1 + σ1 < s2 + σ2 we consider m1 better than m2 . On the other side but with
similar reasoning for CPP, we want to select the model that can assure the
maximum amount of mean steps that the car can, as a minimum, maintain the
pole between the limits so if s1 − σ1 < s2 − σ2 we will take m2 to be better than
m1 .
Grouping results: For MCP, clearly, FIU obtains best results in both ar-
chitectures. In order to show this statistically significant of the improvements
between the different models we performs the ttest comparing the results in the
1000 runs. The results are shown in the comment of table 12. The values reflect
the solution’s independence.
For CPP, we have the exceptional situation that many models reach very
good results (i.e., 500). The results corresponding to the ECPP show that
IandF models reach better performances than others followed by FIU which
also gets outstanding results.
In this section, we try to analyze the behavior of the optimization methods
to reach the solution space of the hyperparameters. To do so, we try to do a
global analysis calculating only one magnitude for all models considered to be a
good solution, in order to characterize the method as a whole. So we try to infer
some kind of preference, if exists, between the different dynamic models or even
between the different optimizers. The first step was to select the models with
certain quality as base defining in this way some kind of sample, then calculate
the means for training time and mean of steps over each sample. For MCP,
a model is included in the sample if the obtained reward (training and test) is
Reinforcement learning & biologically inspired artificial neural networks. 15
more than 91.5. For CPP, a model is added to the sample if the obtained reward
(training and test) is more than 190. The following tables show training time
and mean of steps metrics for all combinations of the dynamic model - optimizer
- architecture. For the training time metric we report as an error the standard
deviation of the sample. For the mean of steps metric, we report as an error the
mean of the individual standard deviations.
Now we will analyze the time the optimizers considered took for the hyper-
parameter search of our models. In the table 13, in the rows labeled "Training
Time," we can see at the top the amount of time consumed by each of the op-
timizers in the different neural and architectural models. We can see that the
times consumed by GA and BO are much lower than those consumed by RS.
Note that the performance of the solutions obtained, in Table 13 for the rows
labeled "Mean Steps," on average, the best solutions were inserted by the GA,
both in mean and variance. Regarding the CPP, we can do a similar analysis;
however, when looking at the table 14, in the rows labeled as "Training Time,"
we see that the time consumed by GA is higher than that consumed by BO
and RS. However, this is because the best solutions in the CPP are the ones
that take more steps to have the stick hold. Therefore, in table 9, we can see
that the GA obtains many more models that have an average of 500 steps, and
this may explain the more extended amount of time since many of the genes
analyzed take longer to compute as they are better models on average.
References
1. Lechner, M., Hasani. R., Grosu R.: Neuronal circuit policies.,
https://arxiv.org/abs/1803.08554, (2018)
2. Izhikevich Eugene M., Simple Model of Spiking Neurons., IEEE
TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 6.,
https://arxiv.org/pdf/2106.06158.pdf, (2003)
3. Haimovici A., Tagliazucchi E., Balenzuela P., Chialvo D.: Brain Organization into
Resting State Networks Emerges at Criticality on a Model of the Human Connec-
tome., https://arxiv.org/abs/1209.5353, (2013)
3
Code available at https://github.com/afiuriG/BiMPuRLE_CMP/tree/MCP
Reinforcement learning & biologically inspired artificial neural networks. 17