Paper Fiuri

Reinforcement learning & biologically inspired
artificial neural networks.
Fiuri Ariel M.1 , Dominguez Martin A.1 , and Francisco Tamarit1,2

1
Facultad de Matemática, Astronomía, Física y Computación, Universidad Nacional
de Córdoba, Córdoba, Argentina.
{ariel.fiuri,martin.dominguez,francisco.tamarit}@unc.edu.ar
2
Instituto de Física Enrique Gaviola (UNC y CONICET), Córdoba, Argentina.
Abstract. Over the last few years, machine learning methods have used
Deep Neural Network architectures to tackle complex problems. In this
paper, we applied biologically inspired neural machine learning to solve
two classical and well-known challenging problems, the Mountain Car
Continuous and the Cart Pole. We use a neural network extracted from
the connectome of C-Elegans to learn a policy able to yield a good solu-
tion. We used Reinforcement Learning (RL) and optimization techniques
to train the models, in addition to proposing a novel neural dynamics
model. We use different metrics to make a detailed comparison of the
results obtained, combining different neuronal dynamics and optimiza-
tion methods. We obtained very competitive results compared with the
solution provided in the literature, particularly with the novel dynamic
neuronal model.
Keywords: Artificial Neural Network · Reinforcement Learning · Opti-

mization · Neural Dynamics · Biologically Inspired.
Throughout the history of theoretical and computational neuroscience, the

study of how animal nervous systems function has sparked the use of models
across a wide range of fields of knowledge, encompassing the dynamics of a
single neuron to the macroscopic behavior of vast neural networks. The con-
nectionist approach, based on the storage and processing of information within
the intricate architecture of neuronal synapses, has shown remarkable predic-
tive power in recent decades. This capability now allows for the construction of
artificial neural networks capable of replicating the most sophisticated human
mental abilities, a feat that was once unimaginable. A very interesting example
that clearly illustrates the close relationship between biology and computation
is the study of a small and ubiquitous animal called C-Elegans [9, 10], of which
we have extensive knowledge of its neural functions. A promising approach in
this area is to incorporate machine learning techniques into these models, partic-
ularly reinforcement learning (RL) [8], which exhibits a more direct association
with the natural mechanisms of learning from experience.
Our study focused on three key aspects: the architecture of the neural net-
work, the modeling of neural dynamics, and the learning mechanism. Our main
2 Fiuri Ariel M. , Dominguez Martin A., and Francisco Tamarit
objective was to address the problem of understanding the relative importance

of biological inspiration as far as using it to efficiently solve specific problems.
Throughout this process, we successfully tackled two learning problems based
on accumulated experience using a small network. Our main contributions are
as follows: (1) providing a novel model of neural dynamics that yields excel-
lent results in solving classic state-of-the-art problems, (2) identifying metrics to
measure the models and determine which ones excel in the ’intelligence to solve
problems’ aspect, as well as considering computational costs, and (3) ultimately,
shedding light on the relative importance of the biological approach to neural
architecture and its underlying dynamics, where we obtained better results in
configurations that diverged further from biological reality.
1 Background
Reinforcement learning (RL) is a machine learning technique in which an
agent learns to perform a task through a succession of trial and error interactions
with a changing environment, usually modeled with Markov decision processes.
To train the agent to solve a problem, the environment provides us with a reward
and a detailed description of the state of the problem. Each time we execute an
action on the environment, we obtain a reward and a state descriptor again. A
desired solution is one that maximizes the rewards obtained in a given number
of interactions [8]. An episode refers to the sequence of interactions with the
environment starting from its initial state. In each episode, the rewards from
each interaction accumulate to obtain an overall reward for the episode.
Table 1. Pseudo codes.
Require: agent, env, optSteps Require: agent, env

step ← 0 totReward ← 0
while step < optSteps do reset(agent)
setP arams(agent, policy) reset(env)
return ← episode(agent, env) obs ← getState(env)
if (is the best reward) then while ¬done do
saveAgent(agent) action ← getAction(agent, obs)
end if obs, rew, done ← step(env, action)
step ← step + 1 totReward ← totReward + rew
end while end while
return the best agent return totReward
Code 1. The optimization procedure which ex- Code 2. The episode procedure which represents
plores the different internal parameters and re- the interaction between the agent and the envi-
turns the best configuration. ronment.
In the right side of table 1, we present the pseudocode of an episode. This

code shows the interaction between the agent and the environment (represented
by env) through the variables obs. The agent uses obs and rew to define the
actions to be executed on the environment, querying the artificial neural network
(model) through the function getAction. Next, the function step modifies the
environment and returns the new values of obs and rew.
In the left side of table 1, we present the optimization procedure that iterates
a defined number of steps (optSteps) assigning values to the internal parameters
Reinforcement learning & biologically inspired artificial neural networks. 3
of the network (setParams) and calculating in each step the reward obtained in
an episode. The assignment of the values is represented by the policy which is
associated to optimizer.
From now on, we will use the word model to refer to the neural network that
we are using as an agent.
1.1 Two classical RL tasks

In this section, we will present two problems commonly addressed with RL
techniques that we will use throughout the work to carry out the experiments.
The Mountain Car Problem (MCP) is a typical RL challenge, described
by Sutton & Barto in [8]. In this problem, a low-powered car tries to reach the
highest point of one of its two surrounding hills starting from a random position
in the valley between them. The engine power cannot go up in one attempt,
so it has to learn to manage the impulses up and down the hill. To solve this
problem, the solver has access to two system state variables, the speed and the
position of the vehicle, and it can take action based on the information provided
by these variables. In addition to the two state variables, the value of a reward
is also available, which somehow describes how effective was the chosen action.
The initial state is random and the problem is solved when the mobile reaches
the flag at the top of the hill in less than 1000 interactions. In table 2 there is a
simple scheme of this problem.
Table 2. Solved Problems.
Cart Pole Problem, the mobile must keep the

Mountain Car Problem, the mobile must reach
pole between the limits, applying impulses to the
the flag with the proper sequence of impulses.
left and right.
The Cart Pole Problem (CPP) is another classic problem in RL, also
described by Sutton and Barto in [8]. It involves a cart holding a pole with an
ideal (frictionless) joint. The goal is to learn to balance the pole and keep it as
vertical as possible while the cart moves on a frictionless flat surface. The initial
state is random; four variables describe the state of the mobile (position and
velocity of the mobile along with angle and angular velocity of the pole). The
agent receives a reward value for taking an action that either moves left or right.
If the pole is kept upright for an average of 195 actions in 100 separate attempts,
the problem is considered solved. The table 2 provides a visual representation of
this scenario.
2 Related works
Many approaches have been used to solve these classical RL problems. For ex-
ample, the MCP was solved using SARSA with a semi-gradient approximation
[8]. Other authors have also employed tabular methods, as seen in Nguyen’s work
[13]. Some studies combine techniques like tabular approximation with RBF [12],
while others attempt to incorporate domain knowledge, such as potential func-
tions, as done by Xiao [11].
Regarding the CPP, a similar pattern emerges. Initially, it was solved us-
ing a tabular method. However, in Berez’s work [14], domain knowledge was
incorporated through constraints. Towers in [16] employed a single DQN, while
Kurban in [17] utilized both DQN and Double DQN, like Simmons in [18]. Fur-
thermore, Lanillos in [15] tackled the problem by utilizing images generated by
the environment.
The previously mentioned citations are related to this work solely because of
the problems they aim to solve. However, particularly interesting work for us is
that of Lechner M. in [1], where the authors propose to solve the MCP using
RL, employing an architecture called Tap Withdrawal Circuit (TWC) and
the Integrate and Fire (IandF) neural dynamics model, using Random Seek
(RS) as the search method in the solution space. This work reports the successful
resolution using libraries Open AI for MCP and rllabbs for CPP. It is important
to note that the authors did not compare their results and performance with
other state-of-the-art techniques.
In the present work, we propose to extend this idea by adding other models
of neural dynamics, other optimization methods, and other architectures. To
evaluate the performance of the different models, we established metrics that
allowed us to differentiate the best solutions and compare them with other recent
works in the area.
3 Our approach
3.1 Exploring the solution space
As discussed in the previous section, traversing the space of potential solutions

is crucial when training a RL model. In this paper, we employ three well-known
methods for exploring the parameter search space: Random Seek (RS), Ge-
netic Algorithm (GA) [5], and Bayesian Optimization (BO) [6, 7]. In all
cases, we conducted a grid search for hyperparameters optimization, and the
chosen options remained fixed throughout the experiments.
3.2 Neural Architecture
Due to is associated to an animal’s oscilatory response, in the first stage, we

worked with the well-studied Tap Withdrawal Circuit (TWC) architecture
that models a small portion of the C-elegance connectome [9, 10]. This small
network consists of eleven neurons, with four sensory, two motor, and five inter-
nal neurons. To model the synaptic connections, we took into account whether
neurons are excitatory, inhibitory, or electrical connections.
Table 3. Architecture examples.
Tap Withdrawal Circuit Architecture. Example of Shuffle Architecture.
In the second stage, we also included the search for an optimal architecture to
compare with the biological connectome by creating what we have called shuffle
architecture (SA). A SA is derived by randomly mixing the set of synaptic
connections, ensuring that no neuron remains disconnected. In figures 3 and 4
(table 3), we compare the TWC of C-Elegance with a SA.
Each model contains internal parameters that need to be learned to solve
different problems. It is important to mention that they have an internal rep-
resentation that in none of the cases coincides with the state variables of the
environment or the actions that can be executed on the problem. Furthermore,
it is important to note that it is necessary to define an interface to communicate
the motor and sensory neurons with the environment. In this sense, the inter-
face with the motor neurons will translate what action will be applied in the
environment. With the sensory ones, information from the environment will be
given.
3.3 Neuronal Dynamics Models
So far, we have discussed the training methodology (i.e. RL), the neural network
architecture, and how we will explore the solution space. However, a fundamental
piece still needs to be included: the model of neural dynamics. We work with
two well-studied in the literature and also our proposal. Next, we will detail the
three neural models.
The Integrate and Fire (IandF) model [1] is a continuous model that
describes the temporal evolution of the voltage across of the neuronal membrane
(which may be stimulated in the dendritic connections) through a single ordinary
differential equation. It has been widely used in the literature, although it is not
suitable for modeling neuronal spikes.
The Izhikevich (IZH) model [2] is a simplification of the famous Hodgkin-
Huxley model and models the neuronal evolution through a system of two cou-
pled ordinary differential equations. Unlike the IandF model, this model not
only allows reproducing the firing and bursting behavior of different types of
biological neurons but also provides a more detailed description of neuronal dy-
namics.
Finally, in this work, we introduce a new neuron model originally proposed by
Fiuri and then named FIU model (FIU). It attempts to represent a simplified
neuronal state and, consequently, the resulting dynamics without losing sight
of biological inspiration. A neuron n will then have in this model two well-
differentiated variables, one internal one (En ) and an output one (On ). Both of
them depend on the dendritic stimulus, and in Table 4, we explain the updating
rules of each one. Each neuron is characterized by its threshold T hn and a
parameter dn which represents the decaying of En in absence of stimulus. Unlike
IandF and IZH models, in this case we have a discrete model in time, making
it much more efficient from a computational point of view.
Table 4 shows the equations that define each dynamic. We can see that
each IandF neuron has three parameters to be learned, and each connection 2.
Therefore, for the TWC, there are 33+52=85 parameters to learn. For the case
of IZH, we have four parameters per neuron and 1 per connection, so to train
TWC, we have 44+26=70 learning parameters. Finally, for FIU, we have two
parameters per neuron and 1 per connection, totaling 22+26=48 parameters to
learn. The difference in the number of parameters and a faster computation in
dynamics (since differential equations do not have to be solved at each simulation
step) put the FIU model in the spotlight for its computational speed. This
advantage can explain the time advantage of being almost five times faster than
the others in training and testing.
3.4 Experimental Setup
In this section, we present the experiments carried out to evaluate the perfor-
mance of various combinations of dynamics, optimizers, and architectures that
we have applied to the two classical RL problems under consideration.
To achieve this, we conducted the same experiment for each configuration. In
other words, given an architecture A, a model M, and an optimizer O, the network
was trained with the problem under different scenarios (while keeping A, M and
O fixed), depending on the number of optimization steps.
Furthermore, we introduced two parameters called batch and worst to miti-
gate the stochastic effect on the training performance of the networks. For exam-
ple, a network trained with batch=5 and a worst=2 means that during training,
five episodes are executed, and we keep the two worst solutions. In other words,
the selection is based on the average performance of the two worst models. Note
that in the pseudocode of table 1, we present the case of batch=1 and worst=1.
Table 4. Neural Dynamics.
In Integrate & Fire, Cm , GLeack , VLeack

m are neuronal parameters (capacitance, pro-
dvn X
Cmn = GLeackn (VLeackn −vn (t))+ j
Iinportionality, and threshold). Iinj
are input
dt j=1 currents actuating on the synaptic connec-
(1) tions (j’s) of the neuron n. Such currents
are defined by the expression 2 for electri-
cal connections. For chemical connections,
j
Iin = ωj (vn (t) − vj (t)) (2) expression 3 defines the currents. E is the
reverse potential and currently implements
the connection excitatory or inhibitory. The
ωj (E − vj ) differential equations system is solved using
j
Iin = (3)
1 + eσj (vn +µ) a hybrid iterative method merging explicit
and implicit Euler all in one.
In Izhikevich v y u are dimensional vari-
ables representing neural membrane voltage
and the recovery factor after a fire respec-
m
dvn X tively. The other variables are dimension-
= 0.04vn2 +5vn +140−un + j
Iin (4)
dt less, a describe the recovery temporal scale
j=1
u, b its sensitivity, and c describe the mem-
brane reset value after a neuron fire (with
dun the reset condition due by expression 6), and
= an (bn vn − un ) (5)
dt d is the reset value for u. By last Iinj
is the
input current actuating on the synaptic con-
nection coming from neuron j to neuron n,
if vn ≥ 30mV then vn := cn , un := un +dn and ωj is the connection capacitance. We
(6) propose the expression 7) for chemical con-
nections where the reverse potential E im-
j
Iin = ωj (Etype − vj ) (7) plements excitatory and inhibitory connec-
tions and 8) for electrical connections, which
j
Iin = ωj (vj − vn ) (8) is based on the Ohm law. We also propose
to solve the differential equations using a hy-
brid iterative method, merging explicit and
implicit Euler all in one.
m
X
Sn = En + j
Iin (9)
j=1
( In Fiuri , En and On represent the input

Sn − Tn if Sn > Tn and output states of the neuron n, respec-
On = (10)
0 other case tively. If Sn represents the stimulus coming
 through the dendritic connections (j’s) due
Sn − Tn
 if Sn > Tn to the currents Iin
j
(9), then the expressions
E n = E n − dn if Sn ≤ Tn and Sn = En 10) and 11) define the dynamic of n. Tn y
other case dn are neuronal parameters that have to be

Sn

(11) learned and represent the firing threshold
and the decay factor (due to not enough
stimulus). The currents Iin i
are defined by
if Oj ≥ En y gap junct. 12) for each connection type. ωj is the con-

ωj ∗ Oj
nection capacitance.

if Oj < En y gap junct.

−ω ∗ O
j j j
Iin =
ωj ∗ Oj

 chemical excitatory
−ωj ∗ Oj chemical inhibitory

(12)
For each network, we run five training sessions for each scenario. Once the
number of optimizer steps (Steps) is chosen, the batch and worst parameters
are used to run five instances (called a,b,c,d,e) of the same configuration. This
differentiation helps in organizing and comparing the results.
In summary, for example, in the case of the MCP considering a network R
characterized by an architecture A ∈ {T W C, SA}, model M ∈ {IandF, IZH, F IU },
optimizer O ∈ {RS, GA, BO} and number of steps Steps ∈ {S1 , S2 , S3 } (where
each Si depends on the optimizer ), five instances of R were trained in five dif-
ferent scenarios with (batch,worst) ∈ {(1, 1), (5, 2), (5, 5), (10, 5), (10, 10)}. For
each scenario, five training sessions were run (a,b,c,d,e) . Therefore, we can
say that 1350 trainings were executed for MCP in both architectures. A sim-
ilar situation occurs with the CPP. The only difference is that it was decided
to switch to four different Steps ∈ {S1 , S2 , S3 , S4 } for each optimizer and only
three combinations (batch,worst) ∈ {(1, 1), (5, 2), (5, 5)}. So the total number
of trainings for CPP in both architectures was 1080. This change was sufficient
for our purposes.
We will identify a network by constructing its name following the pattern
M+O+Steps+batch+worst+version, where M ∈ {IandF, IZH, F IU } represents
the model type, O ∈ {RS, GA, BO} indicates the optimizer used, Steps refers
to the number of optimization steps, and batch and worst have been explained
previously. Additionally, version ∈ {a,b,c,d,e} is included to differentiate
between different instances of the same configuration. For instance, a network
named FIUBO150052e means that it is built upon the FIU model, uses the BO
optimizer with 1500 optimization steps, and has batch and worst values of 5
and 2, respectively. It is also the instance e of that specific configuration.
Typically, in the area of reinforcement learning, training results are reported,
commonly using rewards as performance indicators. However, evaluations in
most other learning areas (such as supervised and unsupervised learning) are
conducted in a separate way. By adopting this approach, we can more effec-
tively assess whether one model outperforms another in solving a given task. To
achieve this, we carry out experiments to solve the problem, but instead of being
in the training mode, we execute the experiments 1000 times in the evaluation
mode, collecting different metrics for comparison. In the next section, we will
establish specific performance metrics for each problem, allowing us to conduct
a thorough comparison of the trained models.
3.5 Metrics
It is well known in the state of the art that, to evaluate RL models, the rewards
obtained in training (T. Rew. in our tables) and evaluation are the metrics to
report. However, given the problem’s stochastic nature, some solutions performed
very well in the training phase but, in the evaluation stage, did not maintain this
behavior. For this reason, we decided to collect two metrics, in addition to the
obtained reward, to discriminate the quality of the learned models, thus, setting
a fairer criterion.
In the case of the MCP, we collect the average number of steps (Me Stp in
the tables) for the mobile to reach its destination in the 1000 test executions. In
addition, we keep the minimum value of steps (Mi Stp in the tables) in which
we get to solve the problem. In the case of the CPP, we define something
similar, collecting the average (in the 1000 tests) of steps in which the post
managed to stay within limits (also Me Stp). In addition, we record the value of
the most significant number of steps that the pole remained upright in the 1000
attempts (Ma Stp). The execution times were also recorded on the same machine
in training and evaluation to compare the performance of computational costs
between different models.
4 Results
In this section, we report the results obtained for all the combinations described
in Section 3.4 and that we evaluated with the metrics that we introduced in the
previous section.
MCP: For this problem, the authors in [1, 11] reported reward values between
90 and 96, based on a sample size of 100 experiments. In our experiments, we
achieved a higher reward of 98.9 ± 0.9, averaging over 1000 experiments. Addi-
tionally, for a more detailed analysis of our results, the average number of steps
taken by our best network was 121.17 ± 5.58, and the minimum number of steps
for solving the problem with our best model was 85. We could not find any pre-
vious reports of these last metrics for comparison. The results presented in the
top section of Table 5 display the nine best rewards obtained for different neural
dynamics and each optimizer, considering both TWC and SA. The resulting
models achieved excellent results, particularly for IandF and FIU in both ar-
chitectures, when compared with the previously mentioned results. The bottom
part of Table 5 showcases the best results for each dynamic model and optimizer,
taking into account the mean number of steps required to solve the MCP prob-
lem. Notably, the proposed model (FIU) achieved outstanding results, ranking
among the top positions.
Table 6 displays the results for the TWC and SA. The graph on the left
shows the TWC architecture, while the one on the right shows a SA. Different
colors are used in each graph to identify the neural dynamic and to show the
variance of the mean steps taken to solve the MCP problem for the top 15
models. In this case, the ordering criteria is the mean number of steps taken by
the model to solve the problem, and the sample is the result of 1000 tests. The
FIU model performs best for both architectures, obtaining a model with fewer
steps to solve MCP and with more stability. The IandF and IZH models also
obtain good results but with more standar deviation.
CP: Regarding the issue at hand, the problem is deemed resolved in the lit-
erature if a sample of 100 tests produces an average of 195 ticks within the
defined boundaries of the pole. The most noteworthy results have been reported
to be close to 500 ticks (referenced in sources [14–19]). We are pleased to report
excellent results as we have discovered numerous models with a favorable step

mean.
Table 5. MCP: Experimental results, T. Rew. stands for training reward, Me Rew. is the mean
of rewards for 1000 sized test and σ corresponds to its standar deviation, Me Stp and its respective
σ are similar as for rewards and Mi Stp is the min ammount of steps a tests returned.
Tap Withdrawal Circuit Shuffle Architecture

Network T. Rew Me Rew σ Network T. Rew Me Rew σ
Ordered by mean reward.
IandF RS2K55d 98.95 98.90 0.08 IandF RS2K105c 99.08 99.09 0.08
IandF GA8055d 98.54 98.49 0.10 IandF BO50052a 99.06 99.14 0.08
F IU RS5K1010a 98.42 98.43 0.15 F IU BO1K1010c 98.62 98.56 0.15
F IU GA501010a 97.09 97.13 0.48 F IU GA5055b 98.64 98.55 0.09
IandF BO50052a 97.34 97.40 0.84 IandF GA20105a 97.73 97.70 0.26
F IU BO1K1010b 98.37 98.38 0.09 F IU RS5K52b 98.56 97.69 8.93
IZHRS5K105a 96.03 96.29 0.46 IZHBO1K105b 96.95 97.12 0.31
IZHBO150011c 97.46 96.11 0.56 IZHRS5K52a 96.10 96.77 0.52
IZHGA8055b 96.77 96.11 0.68 IZHGA8055e 96.49 95.87 0.61
Network Mi Stp Me Stp σ Network Mi Stp Me Stp σ
Ordered by mean steps
F IU RS5K105e 116.0 122.92 5.77 F IU BO50055a 111.0 121.17 5.58
F IU GA50105a 124.0 134.72 2.26 IZHRS5K11b 105.0 172.89 30.33
IandF GA2055a 150.0 166.85 32.35 F IU RS5K52c 130.0 187.65 47.13
F IU BO1K11b 124.0 186.53 37.6 IandF GA2052a 184.0 195.016 7.85
IandF RS5K11b 160.0 220.54 28.13 F IU GA5052a 191.0 206.58 37.93
IZHRS2K105b 166.0 226.37 39.34 IZHGA8055b 164.0 248.74 36.58
IZHBO1500105b 160.0 226.3 33.31 IandF RS5K55c 144.0 280.11 110.80
IZHGA8052a 211.0 283.18 37.58 IandF BO500105b 251.0 300.40 31.38
IandF BO5001010a 157.0 313.61 79.83 IZHBO1K55d 129.0 449.52 227.40
Table 6. Models comparison.
Results for MCP with TWC architecture, top 15 best Results for MCP with shuffle architecture, top 15 best
models ordered by mean of steps for solving the prob- models ordered by mean of steps for solving the prob-
lem. We report the standar deviation of the performance lem. We report the standar deviation of the performance
for the different runs, including different neural dynamics for the different runs, including different neural dynamics
and optimizers. and optimizers.
The top of table 7 shows the best models for different neuronal dynamic
and optimizers, considering the reward obtained in 1000 experiments, we can
see very good results for IandF and FIU model in both architectures. Notice
that in CPP the reward is same as the ammount of steps the pole is considered
in good position. The bottom of table 7 shows the best models for different
neuronal dynamic and optimizers, considering the average of steps that sticks
is up, using 1000 evaluation attempts. It is important to mention that many
models earned 500 on average and almost all the maximum amount of steps is
500. Table 8 displays the results for the TWC and SA, solving CPP problem.
The graph on the left shows the TWC architecture, while the one on the right
shows the SA one. We report graphs showing the variance of the performance
achieved for models, using the mean of steps that the model hods the stick (over
the 1000 test attempts). Note that FIU and IandF performs very well, and, it
is difficult to differentiate between model qualities due to the most models reach
a value of 500.
Table 7. CPP: Experimental results, T. Rew. stands for training reward, Me Rew. is the mean
of rewards for 1000 sized test and σ corresponds to its standar deviation, Me Stp and its respective
σ are similar as for rewards and Ma Stp is the max ammount of steps a tests returned.

Network T. Rew Me Rew σ Network T. Rew Me Rew σ
Ordered by mean reward.
F IU BO5K52e 500.10 500.0 0.0 IandF RS1M 55e 500.07 500.0 0.0
F IU RS250K11c 500.08 500.0 0.0 IandF BO850055d 500.07 500.0 0.0
IandF BO10K55a 500.07 500.0 0.0 IandF GA5K52b 500.06 500.0 0.0
IandF RS1M 55b 500.04 500.0 0.0 F IU GA1K55e 500.02 499.99 0.30
IandF GA5K52b 500.03 499.93 1.34 IZHBO5K55e 500.07 456.45 51.97
F IU GA1K52e 500.02 499.91 2.97 IZHRS500K11c 500.10 428.50 122.50
IZHGA5K55d 425.8 233.03 140.61 F IU BO15K55e 500.0 373.90 166.37
IZHRS500K55e 123.0 157.50 89.01 F IU RS500K11e 419.0 360.98 153.38
IZHBO10K52e 126.5 122.37 58.96 IZHGA10K11e 500.09 159.92 71.99
Network Ma Stp Me Stp σ Network Ma Stp Me Stp σ
Ordered by mean steps and σ
F IU BO500052e 500.0 500.0 0.0 IandF RS1M 55e 500.0 500.0 0.0
F IU RS25000011c 500.0 500.0 0.0 IandF GA500052b 500.0 500.0 0.0
IandF RS1M 55b 500.0 500.0 0.0 IandF BO850055d 500.0 500.0 0.0
IandF BO1000055a 500.0 500.0 0.0 F IU GA100055e 500.0 499.99 0.29
IandF GA500052b 500.0 499.93 1.34 IZHBO500055e 500.0 456.45 51.97
F IU GA100052e 500.0 499.91 2.97 IZHRS50000011c 500.0 428.50 122.50
IZHGA500055d 500.0 233.03 140.61 F IU BO1500055e 500.0 373.90 166.37
IZHRS50000055e 500.0 157.50 89.01 F IU RS50000011e 500.0 360.98 153.38
IZHBO1000052e 328.0 122.37 58.96 IZHGA1000011e 500.0 159.92 71.99
Extending CPP limit: The table 9 shows the detail of the number of networks
that obtain an average of 500 steps (again, we notice a tendency in which the
IandF and FIU models obtain a more significant number of good models). As
we can see, many models obtained a maximum of 500 steps required. Therefore,
to distinguish if one model is better than others, we carried out new experiments
extending the cutoff limit of the standard problem. We extended the limit from
500 to 5000 and reran the 1000-attempt test, but this time only included models
that resulted in an average of 500 steps for the standard environment version.
Table 10 shows the evaluation graphs with the extended limit for the top 15
models for both architectures. Surprisingly, we obtained that these networks
trained with a limit of 500 steps for the end of the episode mean much higher
than those reported in the literature, some even reaching 5000 steps.
Table 8. Models Comparison.
CPP with TWC results corresponding to the 15 better CPP with SA results corresponding to the 15 better
ANN differenced by neural dynamics and optimizer, or- ANN differenced by neural dynamics and optimizer, or-
dered by mean steps. We report the standar deviation of dered by mean steps. We report the standar deviation of
the performance for the different runs. the performance for the different runs.
Table 9. CPP: Amount of models with mean of steps of 500.

IandF IZH FIU IandF IZH FIU
RS 18 0 57 RS 9 4 0
GA 54 5 58 GA 53 6 58
BO 33 0 38 BO 9 10 1
T ot. 105 5 153 T ot. 71 20 59
Table 10. Models Comparisons.
ECPP with TWC results corresponding to the 15 bet- ECPP with SA results corresponding to the 15 better
ter ANN differenced by neural dynamics and optimizer, ANN differenced by neural dynamics and optimizer, or-
ordered by mean steps. dered by mean steps.
The table 11 contains details of the best results for different dynamics and
optimizers on both architectures. For most retested networks, we get maximums
of 5000 steps for an episode. In this case, we noted a tendency for the IandF
dynamic to find better networks than the rest.
Time results: Until now, the variable that was not analyzed was the time it
took to train models and the time it takes a model to get a response. In order to
analyze the time performance to obtain results from a model, we use the TWC,
measure the time spent finding the sensorial neurons, and read the output from
the motor neurons. We repeated this experiment 1000000 times. We performed
this experiment for different configurations and recorded the time it took on
each neural dynamic model. The resulting time cumin are the followings: for the
FIU model:7m 34.156s; for the IandF model, 38m 59.682s; and finally, for IZH
model, 34m 5.389s. Clearly, we can see that FIU is almost five times quicker
than IZH and more than five times quicker than IandF. To analyze the time
spent in the training phase, in the next section, we report the training times for
the different optimization method.
Table 11. ECPP: Step based experimental results.

Network Ma Stp Me Stp σ Network Ma Stp Me Stp σ
Ordered by mean steps
IandF BO15K52c 5000.0 4940.95 290.67 IandF GA10K52d 5000.0 4887.00 482.30
IandF RS1M 11b 5000.0 4830.34 683.87 IandF BO850055a 5000.0 4569.27 899.78
IandF GA250052b 5000.0 4802.17 654.12 IandF RS1M 55a 5000.0 4260.44 1131.03
F IU BO5K52a 5000.0 4113.76 1300.66 F IU GA5K55c 5000.0 3477.72 1602.62
F IU RS250K11c 5000.0 4000.83 909.43 IZHRS500K11a 4146.0 850.72 634.45
F IU GA1K52d 5000.0 2711.80 1208.94 F IU BO15K55a 4633.0 709.29 716.37
IZHGA10K11a 685.0 155.76 72.53 IZHBO15K55a 4953.0 631.82 726.19
IZHRS − − − IZHGA10K11c 517.0 154.88 70.18
IZHBO − − − F IU RS − − −
Table 12. Best models. In gray the best values of each box architecture - problem and in light
gray the second best values for each model. In case of MCP a ttest analysis was done between
the gray and the light gray in the same problem and also between the best of MCP with TWC
and MCP with SA. Here the results: TWC/FIURS vs TWC/IandFGA = 2.9e − 226, TWC/FIURS
vs TWC/IZHBO = 0.0, SA/FIUBO vs SA/IandFGA=0.0, SA/FIUBO vs SA/IZHRS=5.9e − 304,
TWC/FIURS vs SA/FIUBO = 7.15e − 12.

Problem Op IandF IZH FIU IandF IZH FIU
RS 248.68 263.38 128.69 543.29 203.22 211.46
MCP GA 199.20 320.77 136.98 202.85 285.32 223.34
BO 499.77 259.61 210.04 331.78 555.81 126.75
RS 500.0 N one 500.0 500.0 321.63 207.59
CPP GA 498.59 92.42 496.94 500.0 N one 499.67
BO 500.0 N one 500.0 500.0 404.48 207.53
RS 4146.47 N one 3091.39 3129.41 216.28 N one
ECPP GA 4148.05 N one 1529.32 4404.67 N one 1927.39
BO 4650.28 N one 1748.70 3669.49 391.00 N one
Finding the best models: We select one model for each dynamic model -
optimizer combination to identify the best models and compare them in the
table 12. We established criteria to decide when a model m1 is better than other
m2 based on the mean of steps of solving the problem metric for both problems.
Let us suppose si ± σi is the mean steps metric of mi . In MCP, if we intend
that si + σi is the upper mean number of steps the model can commit in the
problem solution, we want to select the model with the minor magnitude. So if
s1 + σ1 < s2 + σ2 we consider m1 better than m2 . On the other side but with
similar reasoning for CPP, we want to select the model that can assure the
maximum amount of mean steps that the car can, as a minimum, maintain the
pole between the limits so if s1 − σ1 < s2 − σ2 we will take m2 to be better than
m1 .
Table 13. MCP grouping results.
Tap Withrawal Circuit

Metric Op IandF IZH FIU Row Stats
Training RS 14.99 ± 70.07% 12.50 ± 87.23% 2.17 ± 71.56% 9.89 ± 72.29%
Time GA 1.79 ± 66.11% 2.15 ± 45.52% 0.33 ± 43.57% 1.42 ± 51.73%
(hr) BO 2.85 ± 55.28% 1.92 ± 45.50% 0.37 ± 52.46% 1.70 ± 51.08%
Col Stats 6.54 ± 63.82% 5.52 ± 59.42% 0.96 ± 55.86%
RS 478.27 ± 91.89 337.84 ± 65.32 473.61 ± 89.84 429.91 ± 82.35
Mean
GA 396.02 ± 64.21 412.27 ± 95.55 283.02 ± 50.78 363.77 ± 70.18
Steps
BO 527.69 ± 136.61 358.39 ± 62.54 374.37 ± 55.51 429.15 ± 84.87
Col Stats 467.33 ± 97.57 369.50 ± 74.47 377 ± 65.38
Shuffle Architecture
Training RS 18.93 ± 83.17% 8.94 ± 73.39% 1.53 ± 84.20% 9.8 ± 80.25%
Time GA 0.74 ± 60.31% 1.65 ± 47.86% 0.23 ± 42.74% 0.87 ± 50.30%
(hr) BO 2.21 ± 66.42% 1.48 ± 44.22% 0.26 ± 54.50% 1.32 ± 50.05%
Col Stats 7.29 ± 69.97% 4.02 ± 56.88% 0.67 ± 60.48%
RS 589.67 ± 102.44 494.5 ± 76.79 375.38 ± 73.42 486.52 ± 84.22
Mean
GA 421.23 ± 73.38 411.97 ± 90.22 371.1 ± 67.34 401.43 ± 76.98
Steps
BO 503.47 ± 93.3 552.66 ± 90.79 400.72 ± 81.24 485.62 ± 88.44
Col Stats 504.79 ± 89.71 486.38 ± 85.93 382.4 ± 74
Grouping results: For MCP, clearly, FIU obtains best results in both ar-
chitectures. In order to show this statistically significant of the improvements
between the different models we performs the ttest comparing the results in the
1000 runs. The results are shown in the comment of table 12. The values reflect
the solution’s independence.
For CPP, we have the exceptional situation that many models reach very
good results (i.e., 500). The results corresponding to the ECPP show that
IandF models reach better performances than others followed by FIU which
also gets outstanding results.
In this section, we try to analyze the behavior of the optimization methods
to reach the solution space of the hyperparameters. To do so, we try to do a
global analysis calculating only one magnitude for all models considered to be a
good solution, in order to characterize the method as a whole. So we try to infer
some kind of preference, if exists, between the different dynamic models or even
between the different optimizers. The first step was to select the models with
certain quality as base defining in this way some kind of sample, then calculate
the means for training time and mean of steps over each sample. For MCP,
a model is included in the sample if the obtained reward (training and test) is
more than 91.5. For CPP, a model is added to the sample if the obtained reward
(training and test) is more than 190. The following tables show training time
and mean of steps metrics for all combinations of the dynamic model - optimizer
- architecture. For the training time metric we report as an error the standard
deviation of the sample. For the mean of steps metric, we report as an error the
mean of the individual standard deviations.
Now we will analyze the time the optimizers considered took for the hyper-
parameter search of our models. In the table 13, in the rows labeled "Training
Time," we can see at the top the amount of time consumed by each of the op-
timizers in the different neural and architectural models. We can see that the
times consumed by GA and BO are much lower than those consumed by RS.
Note that the performance of the solutions obtained, in Table 13 for the rows
labeled "Mean Steps," on average, the best solutions were inserted by the GA,
both in mean and variance. Regarding the CPP, we can do a similar analysis;
however, when looking at the table 14, in the rows labeled as "Training Time,"
we see that the time consumed by GA is higher than that consumed by BO
and RS. However, this is because the best solutions in the CPP are the ones
that take more steps to have the stick hold. Therefore, in table 9, we can see
that the GA obtains many more models that have an average of 500 steps, and
this may explain the more extended amount of time since many of the genes
analyzed take longer to compute as they are better models on average.
Table 14. CPP grouping results.
Tap Withdrawal Circuit

Training RS 11.13 ± 68.19% 0.0 ± 0.0% 1.70 ± 93.23% 6.42 ± 80.71%
Time GA 43.84 ± 100.01% 11.48 ± 0.0% 11.37 ± 95.35% 22.23 ± 65.12%
(hr) BO 7.57 ± 88.05% 0.0 ± 0.0% 2.91 ± 6.83% 9.03 ± 47.44%
Col Stats 20.85 ± 85.42% 11.48 ± 0.0% 5.33 ± 65.13%
RS 412.34 ± 61.10 0.0 ± 0.0 430.32 ± 88.71 421.33 ± 74.91
Mean
GA 411.02 ± 85.32 233.03 ± 140.6 401.05 ± 115.16 348.37 ± 113.69
Steps
BO 440.82 ± 58.04 0.0 ± 0.0 428.81 ± 89.86 434.82 ± 73.95
Col Stats 421.39 ± 68.15 233.03 ± 140.6 420.06 ± 97.91
Shuffle Architecture
Training RS 13.54 ± 113.19% 11.08 ± 111.40% 0.65 ± 1.66% 8.42 ± 75.42%
Time GA 52.72 ± 102.31% 0.0 ± 0.0% 11.68 ± 96.16% 32.20 ± 99.24%
(hr) BO 9.29 ± 71.86% 11.10 ± 65.23% 5.97 ± 0.0% 8.79 ± 48.63%
Col Stats 25.18 ± 95.79% 11.09 ± 88.32% 6.10 ± 32.61%
RS 412.74 ± 83.39 310.19 ± 143.24 284.63 ± 101.86 335.85 ± 109.50
Mean
GA 399.04 ± 83.31 0.0 ± 0.0 408.71 ± 113.23 373.87 ± 98.27
Steps
BO 400.45 ± 70.59 315.42 ± 105.53 373.9 ± 166.37 363.26 ± 114.16
Col Stats 404.08 ± 70.09 359.45 ± 124.39 355.75 ± 127.15
5 Conclusion and future work

In this work, our main focus was on addressing several questions. First, we inves-
tigated the feasibility of employing biologically-inspired models in architecture
and neural dynamics for applications in solving standard problems within the
area of Reinforcement Learning. Specifically, we examined whether this ap-

proach could serve as a viable alternative to conventional methods such as tabu-
lar, Deep RL, and others. We explored the different configurations and compared
their performance with the metrics obtained through the Mountain Car Prob-
lem and the Cart Pole Problem using two typical dynamics, Integrate and
Fire, Izhikevich, and our proposal Fiuri.
The second question is related to how to explore the solution space of neu-
ronal dynamic hyperparameters. We investigated different optimizer methods,
and all of them obtained competitive networks. However, concerning the time
and computation used to obtain the solutions, the Genetic Algorithm and the
Bayesian Optimizer significantly outperformed Random Seek.
In third place and most importantly, we introduced a new neural dynamics
model, intending to be competitive in results and efficient in computation. The
Fiuri model easily met both wishes, although comparing computing times and
resources with traditional methods remains pending.
Fourthly, we questioned the importance of biologically-inspired architecture
to obtain a good performance. To answer this question, we explored Shuffle
Architectures that achieved excellent results, often even better than biological
architecture.
Finally, a particularly remarkable finding is our capability to solve a problem
with a network comprising only eleven neurons, which is substantially fewer than
what is typically utilized in Deep RL techniques. 3
In the development of this work, we pursued two main lines of research.
On one side, we aimed to solve the same problems using pure Reinforcement
Learning techniques, such as tabular RL, Deep RL, among others. We collected
specific metrics to conduct a comprehensive comparison of these methods.
Another line of research we intend to address is exploring the resolution
of quantum physics problems applying Reinforcement Learning. For exam-
ple, in [21] authors use Deep RL to solve the quantum state transfer in a one-
dimensional spin chain. We plan to use our neural model to solve the problem
of transmission of chains in quantum computing, looking for the most suitable
architecture using the methods we have already explored in the present work.
Finally, we want to to test the setup of Deep Reinforcement Learning
but using the Fiuri neural model instead of the commonly used one.
References
1. Lechner, M., Hasani. R., Grosu R.: Neuronal circuit policies.,
https://arxiv.org/abs/1803.08554, (2018)
2. Izhikevich Eugene M., Simple Model of Spiking Neurons., IEEE
TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 6.,
https://arxiv.org/pdf/2106.06158.pdf, (2003)
3. Haimovici A., Tagliazucchi E., Balenzuela P., Chialvo D.: Brain Organization into
Resting State Networks Emerges at Criticality on a Model of the Human Connec-
tome., https://arxiv.org/abs/1209.5353, (2013)
3
Code available at https://github.com/afiuriG/BiMPuRLE_CMP/tree/MCP
4. Open AI.: Toolkit for standar RL problems., https://gym.openai.com/., (2021)

5. Ahmed F.: PyGAD: An Intuitive Genetic Algorithm Python Library.,
https://arxiv.org/pdf/2106.06158.pdf., (2021)
6. Brochu E., Cora V. De Freitas N., A Tutorial on Bayesian Optimization of Expen-
sive Cost Functions, with Application to Active User Modeling and Hierarchical
Reinforcement Learning., https://doi.org/10.48550/arXiv.1012.2599 (2010)
7. Bergstra, J., Yamins, D., Cox, D. D.: Making a Science of Model
Search: Hyperparameter Optimization in Hundreds of Dimensions for Vi-
sion Architectures., 30th International Conference on Machine Learning.,
http://hyperopt.github.io/hyperopt/., (ICML 2013).
8. Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction., MIT Press.,
Cambridge., (1998)
9. White, J. G., Southgate, E., Thomson, J. N., and Brenner, S.: The structure of the
nervous system of the nematode Caenorhabditis elegans., Philosophical Transactions
of the Royal Society of London B., Biological Sciences, 314 (1165):1–340, 1986.
10. Chen, Beth L., Hall, David H., and Chklovskii, Dmitri B.: Wiring optimization
can relate neuronal structure and function. Proceedings of the National Academy
of Sciences of the United States of America, 103(12):4723–4728, 2006.
11. Xiao B.,Ramasubramanian B.,Poovendran R.: Shaping Advice in Deep Reinforce-
ment Learning. https://arxiv.org/pdf/2202.09489.pdf (2022)
12. Unzueta D.:Reinforcement Learning Applied to the Mountain Car Problem.
Towards Data Science. https://towardsdatascience.com/reinforcement-learning-
applied-to-the-mountain-car-problem-1c4fb16729ba (2022)
13. Nguyen H.: Playing Mountain Car with Q-learning and SARSA. Medium.
https://ha-nguyen-39691.medium.com/playing-mountain-car-with-q-learning-and-
sarsa-4e7327f9e35c (2021)
14. Barez F., Hasanbieg H., Abbate A.:System III: Learning with Domain Knowledge
for Safety Constraints, https://doi.org/10.48550/arXiv.2304.11593,(2023)
15. Himst O., Lanillos P.: Deep Active Inference for Partially Observable MDPs, p:61-
71, https://doi.org/10.1007/978-3-030-64919-7_8, (2020)
16. Paszke A., Towers M.: https://pytorch.org/tutorials/intermediate/reinforcement
_q_learning.html (2023)
17. Kurban R.: Deep Q Learning for the CartPole. Towards Data Science.
https://towardsdatascience.com/deep-q-learning-for-the-cartpole-44d761085c2f
(2019)
18. Simmons L.: Double DQN Implementation to Solve OpenAI Gym’s CartPole.
Medium. https://medium.com/@leosimmons/double-dqn-implementation-to-solve-
openai-gyms-cartpole-v-0-df554cd0614d (2019)
19. Surma G.: Cartpole - Introduction to Reinforcement Learning (DQN -
Deep Q-Learning). Medium. https://gsurma.medium.com/cartpole-introduction-to-
reinforcement-learning-ed0eb5b58288 (2018)
20. Dayan P., Abbott L.: Theoretical Neuroscience, Computational and Mathematical
Modeling of Neural Systems. The MIT Press Cambridge, Massachusetts, London,
England (2005)
21. Zhang, Xiao-Ming and Cui, Zi-Wei and Wang, Xin and Yung, Man-Hong: Auto-
matic spin-chain learning to explore the quantum speed limit.American Physical
Society. Phys. Rev. A (2018)

Paper Fiuri

Uploaded by

Copyright:

Available Formats

Paper Fiuri

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paper Fiuri

Uploaded by

Copyright:

Available Formats

Reinforcement learning & biologically inspired

artificial neural networks.

Fiuri Ariel M.1 , Dominguez Martin A.1 , and Francisco Tamarit1,2

Keywords: Artificial Neural Network · Reinforcement Learning · Opti-

Throughout the history of theoretical and computational neuroscience, the

objective was to address the problem of understanding the relative importance

Table 1. Pseudo codes.

Require: agent, env, optSteps Require: agent, env

In the right side of table 1, we present the pseudocode of an episode. This

1.1 Two classical RL tasks

Table 2. Solved Problems.

Cart Pole Problem, the mobile must keep the

As discussed in the previous section, traversing the space of potential solutions

3.2 Neural Architecture

Due to is associated to an animal’s oscilatory response, in the first stage, we

Table 3. Architecture examples.

Tap Withdrawal Circuit Architecture. Example of Shuffle Architecture.

3.3 Neuronal Dynamics Models

3.4 Experimental Setup

Table 4. Neural Dynamics.

In Integrate & Fire, Cm , GLeack , VLeack

( In Fiuri , En and On represent the input

excellent results as we have discovered numerous models with a favorable step

Tap Withdrawal Circuit Shuffle Architecture

Table 6. Models comparison.

Tap Withdrawal Circuit Shuffle Architecture

Table 8. Models Comparison.

Table 9. CPP: Amount of models with mean of steps of 500.

Tap Withdrawal Circuit Shuffle Architecture

Table 10. Models Comparisons.

Table 11. ECPP: Step based experimental results.

Tap Withdrawal Circuit Shuffle Architecture

Tap Withdrawal Circuit Shuffle Architecture

Table 13. MCP grouping results.

Tap Withrawal Circuit

Table 14. CPP grouping results.

Tap Withdrawal Circuit

5 Conclusion and future work

area of Reinforcement Learning. Specifically, we examined whether this ap-

4. Open AI.: Toolkit for standar RL problems., https://gym.openai.com/., (2021)

You might also like