Logic Tensor Networks
Logic Tensor Networks
Logic Tensor Networks
Citation: Badreddine, S., d'Avila Garcez, A. S., Serafini, L. & Spranger, M. (2022). Logic
Tensor Networks. Artificial Intelligence, 303, 103649. doi: 10.1016/j.artint.2021.103649
This version of the publication may differ from the final published version.
Reuse: Copies of full items can be used for personal research or study,
educational, or not-for-profit purposes without prior permission or charge.
Provided that the authors, title and full bibliographic details are credited, a
hyperlink and/or URL is given for the original metadata page and the content is
not changed in any way.
Abstract
Attempts at combining logic and neural networks into neurosymbolic approaches have been on
the increase in recent years. In a neurosymbolic system, symbolic knowledge assists deep learning,
which typically uses a sub-symbolic distributed representation, to learn and reason at a higher
level of abstraction. We present Logic Tensor Networks (LTN), a neurosymbolic framework that
supports querying, learning and reasoning with both rich data and abstract knowledge about
the world. LTN introduces a fully differentiable logical language, called Real Logic, whereby the
elements of a first-order logic signature are grounded onto data using neural computational graphs
and first-order fuzzy logic semantics. We show that LTN provides a uniform language to represent
and compute efficiently many of the most important AI tasks such as multi-label classification,
relational learning, data clustering, semi-supervised learning, regression, embedding learning
and query answering. We implement and illustrate each of the above tasks with several simple
explanatory examples using TensorFlow 2. The results indicate that LTN can be a general and
powerful framework for neurosymbolic AI.
1. Introduction
Artificial Intelligence (AI) agents are required to learn from their surroundings and reason
about what has been learned to make decisions, act in the world, or react to various stimuli. The
latest Machine Learning (ML) has adopted mostly a pure sub-symbolic learning approach. Using
distributed representations of entities, the latest ML performs quick decision-making without
building a comprehensible model of the world. While achieving impressive results in computer
vision, natural language, game playing, and multimodal learning, such approaches are known
to be data inefficient and to struggle at out-of-distribution generalization. Although the use of
appropriate inductive biases can alleviate such shortcomings, in general, sub-symbolic models lack
comprehensibility. By contrast, symbolic AI is based on rich, high-level representations of the world
that use human-readable symbols. By rich knowledge, we refer to logical representations which are
˚ Corresponding author
Email addresses: badreddine.samy@gmail.com (Samy Badreddine), a.garcez@city.ac.uk (Artur d’Avila Garcez),
serafini@fbk.eu (Luciano Serafini), michael.spranger@sony.com (Michael Spranger)
2
universal and existential quantifiers now allow the user to limit the quantification to the elements
that satisfy some Boolean condition, e.g. @x : agepxq ă 10 pplaysPianopxq Ñ enfantProdigepxqq
restricts the quantification to the cases where age is lower than 10; (3) Diagonal quantification:
Diagonal quantification allows the user to write statements about specific tuples extracted in
order from n variables. For example, if the variables capital and country both have k instances
such that the i-th instance of capital corresponds to the i-th instance of country, one can write
@Diagpcapital, countryq capitalOfpcapital, countryq.
Inspired by the work of [69], this paper also extends the product t-norm configuration of LTN
with the generalized mean aggregator, and it introduces solutions to the vanishing or exploding
gradient problems. Finally, the paper formally defines a semantic approach to refutation-based
reasoning in Real Logic to verify if a statement is a logical consequence of a knowledge base.
Example 4.8 proves that this new approach can better capture logical consequences compared to
simply querying unknown formulas after learning (as done in [6]).
The new version of LTN has been implemented in TensorFlow 2 [1]. Both the LTN li-
brary and the code for the examples used in this paper are available at https://github.com/
logictensornetworks/logictensornetworks.
The remainder of the paper is organized as follows: In Section 2, we define and illustrate Real
Logic as a fully-differentiable first-order logic. In Section 3, we specify learning and reasoning in
Real Logic and its modeling into deep networks with Logic Tensor Networks (LTN). In Section
4, we illustrate the reach of LTN by investigating a range of learning problems from clustering
to embedding learning. In Section 5, we place LTN in the context of the latest related work in
neurosymbolic AI. In Section 6 we conclude and discuss directions for future work. The Appendix
contains information about the implementation of LTN in TensorFlow 2, experimental set-ups,
the different options for the differentiable logic operators, and a study of their relationship with
gradient computations.
2. Real Logic
2.1. Syntax
Real Logic forms the basis of Logic Tensor Networks. Real Logic is defined on a first-order
language L with a signature that contains a set C of constant symbols (objects), a set F of
functional symbols, a set P of relational symbols (predicates), and a set X of variable sym-
bols. L-formulas allow us to specify relational knowledge with variables, e.g. the atomic
formula is_friendpv1 , v2 q may state that the person v1 is a friend of the person v2 , the formula
@x@ypis_friendpx, yq Ñ is_friendpy, xqq states that the relation is_friend is symmetric, and the for-
mula @xpDypItalianpxq ^ is_friendpx, yqqq states that every person has a friend that is Italian. Since
we are interested in learning and reasoning in real-world scenarios where degrees of truth are
often fuzzy and exceptions are present, formulas can be partially true, and therefore we adopt
fuzzy semantics.
Objects can be of different types. Similarly, functions and predicates are typed. Therefore, we
assume there exists a non-empty set of symbols D called domain symbols. To assign types to the
elements of L we introduce the functions D, Din and Dout such that:
• D : X Y C Ñ D. Intuitively, Dpxq and Dpcq returns the domain of a variable x or a constant c.
• Din : F Y P Ñ D˚ , where D˚ is the Kleene star of D, that is the set of all finite sequences
of symbols in D. Intuitively, Din pf q and Din ppq returns the domains of the arguments of a
3
function f or a predicate p. If f takes two arguments (for example, f px, yq), Din pf q returns
two domains, one per argument.
• Dout : F Ñ D. Intuitively, Dout pf q returns the range of a function symbol.
Real Logic may also contain propositional variables, as follows: if P is a 0-ary predicate with
Din pP q “ h i (the empty sequence of domains) then P is a propositional variable (an atom with
truth-value in the interval [0,1]).
A term is constructed recursively in the usual way from constant symbols, variables, and
function symbols. An expression formed by applying a predicate symbol to an appropriate number
of terms with appropriate domains is called an atomic formula, which evaluates to true or false in
classical logic and a number in r0, 1s in the case of Real Logic. We define the set of terms of the
language as follows:
Example 1. Let Town denote the domain of towns in the world and People denote the domain of
living people. Suppose that L contains the constant symbols Alice, Bob and Charlie of domain
People, and Rome and Seoul of domain Town. Let x be a variable of domain People and u be a
variable of domain Town. The term x, u (i.e. the sequence x followed by u) Ś
has domain People, Town
which denotes the Cartesian product between People and Town (People Town). Alice, Rome is
interpreted as an element of the domain People, Town. Let lives_in be a predicate with input
domain Din plives_inq “ People, Town. lives_inpAlice, Romeq is a well-formed expression, whereas
lives_inpBob, Charlieq is not.
1 In the rest of the paper, we commonly use "tensor" to designate "tensor in the real field".
4
interpreted as real functions or tensor operations. Predicates are interpreted as functions or tensor
operations projecting onto a value in the interval r0, 1s.
To emphasize the fact that in Real Logic symbols are grounded onto real-valued features, we
use the term grounding, denoted by G, in place of interpretation2 . Notice that this is different from
the common use of the term grounding in logic, which indicates the operation of replacing the
variables of a term or formula with constants or terms containing no variables. To avoid confusion,
we use the synonym instantiation for this purpose. G associates a tensor of real numbers to any
term of L, and a real number in the interval r0, 1s to any formula φ of L. Intuitively, Gptq are the
numeric features of the objects denoted by t, and Gpφq represents the system’s degree of confidence
in the truth of φ; the higher the value, the higher the confidence.
Rn1 ˆ¨¨¨ˆnd .
Ť
Definition 1. A grounding G associates to each domain D P D a set GpDq Ď
n1 ...nd PN˚
Śn Ś Ś Ś
For every D1 . . . Dn P D˚ , GpD1 . . . Dn q “ i“1 GpDi q, that is GpD1 q GpD2 q ... GpDn q.
Notice that the elements in GpDq may be tensors of any rank d and any dimensions n1 ˆ ¨ ¨ ¨ ˆ nd ,
as N˚ denotes the Kleene star of N.3
Example 2. Let digit_images denote a domain of images of handwritten digits. If we use im-
ages of 256 ˆ 256 RGB pixels, then Gpdigit_imagesq Ď R256ˆ256ˆ3 . Let us consider the predicate
is_digitp , 8q. The terms , 8 have domains digit_images, digits. Any input to the predicate is a
tuple in Gpdigit_images, digitsq “ Gpdigit_imagesq ˆ Gpdigitsq.
A grounding assigns to each constant symbol c, a tensor Gpcq in the domain GpDpcqq; It assigns
to a variable x a finite sequence of tensors d1 . . . dk , each in GpDpxqq. These tensors represent the
instances of x. Differently from in FOL where a variable is assigned to a single value of the domain
of interpretations at a time, in Real Logic a variable is assigned to a sequence of values in its
domain, the k examples of x. A grounding assigns to a function symbol f a function taking tensors
from GpDin pf qq as input, and producing a tensor in GpDout pf qq as output. Finally, a grounding
assigns to a predicate symbol p a function taking tensors from GpDin ppqq as input, and producing
a truth-value in the interval r0, 1s as output.
2 An interpretation is an assignment of truth-values true or false, or in the case of Real Logic a value in [0,1], to a formula.
5
2. Gpf q P GpDin pf qq Ñ GpDout pf qq for every function symbol f P F;
3. Gppq P GpDin ppqq Ñ r0, 1s for every predicate symbol p P P.
• Gppptqqi1 ...in “ GppqpGptqi1 ...in q, i.e. the element-wise application of Gppq to Gptq.
If term ti contains ni variables xj1 , . . . , xjni selected from x1 , . . . , xn then Gpti qkj1 ...kjn can be
i
obtained from Gptqi1 ...in with an appropriate mapping of indices i to k.
4 We assume the usual syntactic definition of free and bound variables in FOL. A variable is free if it is not bound by a
6
Figure 1: Illustration of Example 3: and indicate dimensions associated with the free variables x and y. A tensor
representing a term that includes a free variable x will have an axis . One can index to obtain results calculated using
each of the v1 , v2 or v3 values of x. In our graphical convention, the depth of the boxes indicates that the tensor can have
feature dimensions (refer to the end of Example 3).
Example 3. Suppose that L contains the variables x and y, the function f , the predicate p and
the set of domains D “ tV, W u. Let Dpxq “ V , Dpyq “ W , Din pf q “ V W , Dout pf q “ W and
Dppq “ V W . In what follows, an example of the grounding of L and D is shown on the left, and
the grounding of some examples of possible terms and atomic formulas is shown on the right.
¨ ˛
GpV q “ R` v1 ¨ w1 v1 ¨ w2
Gpf px, yqq “ ˝v2 ¨ w1 v2 ¨ w2 ‚
GpW q “ R´
v3 ¨ w1 v3 ¨ w2
Gpxq “ hv1 , v2 , v3 i
Gpyq “ hw1 , w2 i ¨ ˛
σpv1 `v1 ¨w1 q σpv1 `v1 ¨w2 q
Gppq : x, y ÞÑ σpx ` yq Gpppx, f px, yqqq “ ˝σpv2 `v2 ¨w1 q σpv2 `v2 ¨w2 q‚
Gpf q : x, y ÞÑ x ¨ y σpv3 `v3 ¨w1 q σpv3 `v3 ¨w2 q
Notice the dimensions of the results. Gpf px, yqq and Gpppx, f px, yqqq return |Gpxq|ˆ|Gpyq| “ 3ˆ2
values, one for each combination of individuals that occur in the variables. For functions, we
can have additional dimensions associated to the output domain. Let us suppose a different
grounding such that GpDout pf qq “ Rm . Then the dimensions of Gpf px, yqq would have been
|Gpxq| ˆ |Gpyq| ˆ m, where |Gpxq| ˆ |Gpyq| are the dimensions for indexing the free variables and
m are dimensions associated to the output domain of f . Let us call the latter feature dimensions,
as captioned in Figure 1. Notice that Gpppx, f px, yqqq will always return a tensor with the exact
dimensions |Gpxq| ˆ |Gpyq| ˆ 1 because, under any grounding, a predicate always returns a value
in r0, 1s. Therefore, as the "feature dimensions" of predicates is always 1, we choose to "squeeze it"
and not to represent it in our graphical convention (see Figure 1, the box output by the predicate
has no depth).
7
Figure 2: Illustration of an element-wise operator implementing conjunction (ppxq ^ qpyq). We assume that x and y are
two different variables. The result has one number in the interval r0, 1s to every combination of individuals from Gpxq and
Gpyq.
where AggpQq is the aggregation operator associated with the quantifier Q. Intuitively, we obtain
GpQx1 , . . . , xh pφqq by reducing the dimensions associated with x1 , . . . , xh using the operator AggpQq
(see Figure 3).
Notice that the above grounded semantics can assign different meanings to the three formulas:
` ˘ ` ` ˘˘ ` ` ˘˘
@xy φpx, yq @x @y φpx, yq @y @x φpx, yq
8
Figure 3: Illustration of an aggregation operation implementing quantification (@yDx) over variables x and y. We assume
that x and y have different domains. The result is a single number in the interval [0,1].
The semantics of the three formulas will coincide if the aggregation operator is bi-symmetric.
LTN also allows the following form of quantification, here called diagonal quantification (Diag):
Diagpx1 , . . . , xh q quantifies over specific tuples such that the i-th tuple contains the i-th instance
of each of the variables in the argument of Diag, under the assumption that all variables in the
argument are grounded onto sequences with the same number of instances. Diagpx1 , . . . , xh q
is called diagonal quantification because it quantifies over the diagonal of Gpφq along the axes
associated with x1 ...xh , although in practice only the diagonal is built and not the entire Gpφq, as
shown in Figure 4. For example, given a data set with samples x and target labels y, if looking
to write a statement ppx, yq that holds true for each pair of sample and label, one can write
@Diagpx, yq ppx, yq given that |Gpxq| “ |Gpyq|. As another example, given two variables x and y
whose groundings contain 10 instances of x and y each, the expression @Diagpx, yq ppx, yq produces
10 results such that the i-th result corresponds to the i-th instances of each grounding. Without
Diag, the expression would be evaluated for all 10 ˆ 10 combinations of the elements in Gpxq and
Gpyq.5 Diag will find much application in the examples and experiments to follow.
The grounding of such a formula is obtained by aggregating the values of parentpx, yq only for
the instances of x that satisfy the condition agepxq ą agepyq, that is:
5 Notice how Diag is not simply "syntactic sugar" for creating a new variable pairs_xy by stacking pairs of examples from
Gpxq and Gpyq. If the groundings of x and y have incompatible ranks (for instance, if x denotes images and y denotes their
labels), stacking them in a tensor Gppairs_xyq is non-trivial, requiring several reshaping operations.
9
Figure 4: Diagonal Quantification: Diagpx1 , x2 q quantifies over specific tuples only, such that the i-th tuple contains the
i-th instances of the variables x1 and x2 in the groundings Gpx1 q and Gpx2 q, respectively. Diagpx1 , x2 q assumes, therefore,
that x1 and x2 have the same number of instances as in the case of samples x1 and their labels x2 in a typical supervised
learning tasks.
The evaluation of which tuple is safe is purely symbolic and non-differentiable. Guarded
quantifiers operate over only a subset of the variables, when this symbolic knowledge is crisp and
available. More generally, in what follows, m is a symbol representing the condition, which we
shall call a mask, and Gpmq associates a function 6 returning a Boolean to m.
def
GpQ x1 , . . . , xh : mpx1 , . . . , xn qpφqqih`1 ,...,in “ AggpQq Gpφqi1 ,...,ih ,ih`1 ,...,in (6)
i1 “1,...,|Gpx1 q|
..
.
ih “1,...,|Gpxh q| s.t.
GpmqpGpx1 qi1 ,...,Gpxn qin q
Notice that the semantics of a guarded sentence @x : mpxqpφpxqq is different than the semantics
of @xpmpxq Ñ φpxqq. In crisp and traditional FOL, the two statements would be equivalent. In
Real Logic, they can give different results. Let Gpxq be a sequence of 3 values, Gpmpxqq “ p0, 1, 1q
and Gpφpxqq “ p0.2, 0.7, 0.8q. Only the second and third instances of x are safe, that is, are in the
masked subset. Let Ñ be defined using the Reichenbach operator IR pa, bq “ 1 ´ a ` ab and @ be
defined using the mean operator. We have Gp@xpmpxq Ñ φpxqqq “ 1`0.7`0.8 3 “ 0.833 . . . whereas
Gp@x : mpxqpφpxqqq “ 0.7`0.8
2 “ 0.75. Also, in the computational graph of the guarded sentence,
there are no gradients attached to the instances that do not verify the mask. Similarly, the semantics
of Dx : mpxqpφpxqq is not equivalent to that of Dxpmpxq ^ φpxqq.
6 In some edge cases, a masking may produce an empty sequence, e.g. if for some value of Gpyq, there is no value in Gpxq
that satisfies agepxq ą agepyq, we resort to the concept of an empty semantics: @ returns 1 and D returns 0.
10
Figure 5: Example of Guarded Quantification: One can filter out elements of the various domains that do not satisfy some
condition before the aggregation operators for @ and D are applied.
ApM E measures the power of the deviation of each value from the ground truth 1. With p “ 2,
it is equivalent to 1 ´ RMSEpa, 1q, where RMSE is the root-mean-square error, a is the vector of
truth-values and 1 is a vector of 1’s.
7 We define a symmetric configuration as a set of fuzzy operators such that conjunction and disjunction are defined by
a t-norm and its dual t-conorm, respectively, and the implication operator is derived from such conjunction or disjunction
operators and standard negation (c.f. Appendix B for details). In [69], van Krieken et al. also analyze non-symmetric
configurations and even operators that do not strictly verify fuzzy logic semantics.
11
The intuition behind the choice of p is that the higher that p is, the more weight that ApM (resp.
ApM E ) will give to true (resp. false) truth-values, converging to the max (resp. min) operator.
Therefore, the value of p can be seen as a hyper-parameter as it offers flexibility to account for
outliers in the data depending on the application.
Nevertheless, Product Real Logic still has the following gradient problems: TP pa, bq has vanishing
gradients on the edge case a “ b “ 0; SP pa, bq has vanishing gradients on the edge case a “ b “ 1;
IR pa, bq has vanishing gradients on the edge case a “ 0,b “ 1; ApM pa1 , . . . , an q hasřexploding
gradients when i pai qp tends to 0; ApM E pa1 , . . . , an q has exploding gradients when i p1 ´ ai qp
ř
tends to 0 (see Appendix C for details).
To address these problems, we define the projections π0 and π1 below with an arbitrarily small
positive real number:
We then derive the following stable operators to produce what we call the Stable Product Real
Logic configuration:
It is important noting that the conjunction operator in stable product semantics is not a T-norm
8
. TP1 pa, bq does not satisfy identity in r0, 1r since for any 0 ď a ă 1, TP1 pa, 1q “ p1 ´ qa ` ‰ a,
although can be chosen arbitrarily small. In the experimental evaluations reported in Section 4,
we find that the adoption of the stable product semantics is an important practical step to improve
the numerical stability of the learning system.
In Real Logic, one can define the tasks of learning, reasoning and query-answering. Given a
Real Logic theory that represents the knowledge of an agent at a given time, learning is the task
of making generalizations from specific observations obtained from data. This is often called
inductive inference. Reasoning is the task of deriving what knowledge follows from the facts which
are currently known. Query answering is the task of evaluating the truth value of a certain logical
expression (called a query), or finding the set of objects in the data that evaluate a certain expression
to true. In what follows, we define and exemplify each of these tasks. To do so, we first need to
specify which types of knowledge can be represented in Real Logic.
8 Recall that a T-norm is a function T : r0, 1s ˆ r0, 1s Ñ r0, 1s satisfying commutativity, monotonicity, associativity and
12
3.1. Representing Knowledge with Real Logic
In logic-based knowledge representation systems, knowledge is represented by logical formulas
whose intended meanings are propositions about a domain of interest. The connection between
the symbols occurring in the formulas and what holds in the domain is not represented in the
knowledge base and is left implicit since it does not have any effect on the logic computations.
In Real Logic, by contrast, the connection between the symbols and the domain is represented
explicitly in the language by the grounding G, which plays an important role in both learning
and reasoning. G is an integral part of the knowledge represented by Real Logic. A Real Logic
knowledge base is therefore defined by the formulas of the logical language and knowledge about
the domain in the form of groundings obtained from data. The following types of knowledge can
be represented in Real Logic.
9 Notice that softmax is often used as the last layer in neural networks to turn logits into a probability distribution.
However, we do not use the softmax function as such here. Instead, we use it here to enforce an exclusivity constraint on
satisfiability scores.
13
their groundings Gpwi | θemb q are defined parametrically w.r.t. θemb as embpwi | θemb q. An
example of parametric grounding for a function symbol f is to assume that Gpf q is a linear
function such that Gpf q : Rm Ñ Rn maps each v P Rm into Af v ` bf , with Af a matrix of real
numbers and b a vector of real numbers. In this case, Gpf q “ Gpf | θf q, where θf “ tAf , bf u.
Finally, the grounding of a predicate symbol can be given, for example, by a neural network
N with parameters θN . As an example, consider a neural network N trained for image
classification into n classes: cat, dog, horse, etc. N takes as input a vector v of pixel values
and produces as output a vector y “ pycat , ydog , yhorse , . . . q in r0, 1sn such that y “ N pv | θN q,
where yc is the probability that input image v is of class c. In case classes are, alternatively,
chosen to be represented by unary predicate symbols such as catpvq, dogpvq, horsepvq,. . . then
Gpcatpvqq “ N pv | θN qcat , Gpdogpvqq “ N pv | θN qdog , Gphorsepvqq “ N pv | θN qhorse , etc.
10 This can also be specified using a guarded quantifier @x : ppl pxq ^ ¨ ¨ ¨ ^ l pxqq ą thq l
1 k k`1 pxq where th is a threshold
value in r0, 1s.
14
Smokers and Friends example, people who are smokers are associated by the friendship rela-
tion. In Real Logic, the formula @xy ppsmokespxq ^ friendpx, yqq Ñ smokespyqq would be used
to encode the soft constraint that friends of smokers are normally smokers.
3.1.4. Satisfiability
In summary, a Real Logic knowledge-base has three components: the first describes knowledge
about the grounding of symbols (domains, constants, variables, functions, and predicate symbols);
the second is a set of closed logical formulas describing factual propositions and general knowledge;
the third lies in the operators and the hyperparameters used to evaluate each formula. The
definition that follows formalizes this notion.
Definition 4 (Theory/Knowledge-base). A theory of Real Logic is a triple T “ hK, Gp ¨ | θq, Θi,
where K is a set of closed first-order logic formulas defined on the set of symbols S “ D Y X Y
C Y F Y P denoting, respectively, domains, variables, constants, function and predicate symbols;
Gp ¨ | θq is a parametric grounding for all the symbols s P S and all the logical operators; and
Θ “ tΘs usPS is the hypothesis space for each set of parameters θs associated with symbol s.
Learning and reasoning in a Real Logic theory are both associated with searching and applying
the set of values of parameters θ from the hypothesis space Θ that maximize the satisfaction of the
formulas in K. We use the term grounded theory, denoted by hK, Gθ i, to refer to a Real Logic theory
with a specific set of learned parameter values. This idea shares some similarity with the weighted
MAX-SAT problem [43], where the weights for formulas in K are given by their fuzzy truth-values
obtained by choosing the parameter values of the grounding. To define this optimization problem,
we aggregate the truth-values of all the formulas in K by selecting a formula aggregating operator
SatAgg : r0, 1s˚ Ñ r0, 1s.
Definition 5. The satisfiability of a theory T “ hK, Gθ i with respect to the aggregating operator
SatAgg is defined as SatAggφPK Gθ pφq.
3.2. Learning
Given a Real Logic theory T “ pK, Gp ¨ | θq, Θq, learning is the process of searching for the set
of parameter values θ ˚ that maximize the satisfiability of T w.r.t. a given aggregator:
θ ˚ “ argmax SatAgg Gθ pφq
θPΘ φPK
15
Notice that with this general formulation, one can learn the grounding of constants, functions,
and predicates. The learning of the grounding of constants corresponds to the learning of em-
beddings. The learning of the grounding of functions corresponds to the learning of generative
models or a regression task. Finally, the learning of the grounding of predicates corresponds to a
classification task in Machine Learning.
In some cases, it is useful to impose some regularization (as done customarily in ML) on the set
of parameters θ, thus encoding a preference on the hypothesis space Θ, such as a preference for
smaller parameter values. In this case, learning is defined as follows:
˜ ¸
θ ˚ “ argmax SatAgg Gθ pφq ´ λRpθq
θPΘ φPK
3.3. Querying
Given a grounded theory T “ pK, Gθ q, query answering allows one to check if a certain fact is
true (or, more precisely, by how much it is true since in Real Logic truth-values are real numbers
in the interval [0,1]). There are various types of queries that can be asked of a grounded theory.
A first type of query is called truth queries. Any formula in the language of T can be a truth
query. The answer to a truth query φq is the truth value of φq obtained by computing its grounding,
i.e. Gθ pφq q. Notice that, if φq is a closed formula, the answer is a scalar in r0, 1s denoting the truth-
value of φq according to Gθ . if φq contains n free variables x1 , . . . , xn , the answer to the query is a
tensor of order n such that the component indexed by i1 . . . in is the truth-value of φq evaluated in
Gθ px1 qi1 , . . . , Gθ pxn qin .
The second type of query is called value queries. Any term in the language of T can be a value
query. The answer to a value query tq is a tensor of real numbers obtained by computing the
grounding of the term, i.e. Gθ ptq q. Analogously to truth queries, the answer to a value query is a
“tensor of tensors” if tq contains variables. Using value queries, one can inspect how a constant or
a term, more generally, is embedded in the manifold.
The third type of query is called generalization truth queries. With generalization truth queries, we
are interested in knowing the truth-values of formulas when these are applied to a new (unseen)
set of objects of a domain, such as a validation or a test set of examples typically used in the
evaluation of machine learning systems. A generalization truth query is a pair pφq pxq, U q, where
φq is a formula with a free variable x and U “ pup1q , . . . , upkq q is a set of unseen examples whose
dimensions are compatible with those of the domain of x. The answer to the query pφq pxq, U q is
Gθ pφq pxqq for x taking each value upiq , 1 ď i ď k, in U . The result of this query is therefore a vector
of |U | truth-values corresponding to the evaluation of φq on new data up1q , . . . , upkq .
The fourth and final type of query is generalization value queries. These are analogous to gener-
alization truth queries with the difference that they evaluate a term tq pxq, and not a formula, on
new data U . The result, therefore, is a vector of |U | values corresponding to the evaluation of the
trained model on a regression task using test data U .
16
3.4. Reasoning
3.4.1. Logical consequence in Real Logic
From a pure logic perspective, reasoning is the task of verifying if a formula is a logical
consequence of a set of formulas. This can be achieved semantically using model theory (|ù) or
syntactically via a proof theory ($). To characterize reasoning in Real Logic, we adapt the notion
of logical consequence for fuzzy logic provided in [9]: A formula φ is a fuzzy logical consequence
of a finite set of formulas Γ, in symbols Γ |ù φ if for every fuzzy interpretation f , if all the formulas
in Γ are true (i.e. evaluate to 1) in f then φ is true in f . In other words, every model of Γ is a model
of φ. A direct application of this definition to Real Logic is not practical since in most practical
cases the level of satisfiability of a grounded theory pK, Gθ q will not be equal to 1. We therefore
define an interval rq, 1s with 21 ă q ă 1 and assume that a formula is true if its truth-value is in the
interval rq, 1s. This leads to the following definition:
Definition 6. A closed formula φ is a logical consequence of a knowledge-base pK, Gp ¨ | θq, Θq, in
symbols pK, Gp ¨ | θq, Θq |ùq φ, if, for every grounded theory hK, Gθ i, if SatAggpK, Gθ q ě q then
Gθ pφq ě q.
Reasoning Option 1 (Querying after learning). This is approximate logical inference by considering
only the grounded theories that maximally satisfy pK, Gp ¨ | θq, Θq. We therefore define that φ is a
brave logical consequence of a Real Logic knowledge-base pK, Gp ¨ | θq, Θq if Gθ˚ pφq ě q for all the θ ˚
such that:
θ ˚ “ argmax SatAggpK, Gθ q and SatAggpK, Gθ˚ q ě q
θ
The objective is to find all θ ˚ that optimally satisfy the knowledge base and to measure if they
also satisfy φ. One can search for such θ ˚ by running multiple optimizations with the objective
function of Section 3.2.
This approach is somewhat naive. Even if we run the optimization multiple times with multiple
parameter initializations (to, hopefully, reach different optima in the search space), the obtained
groundings may not be representative of other optimal or close-to-optimal groundings. In Section
4.8, we give an example that shows the limitations of this approach and motivates the next one.
Reasoning Option 2 (Proof by Refutation). Here, we reason by refutation and search for a counter-
example to the logical consequence by introducing an alternative search objective. Normally,
according to Definition 6, one tries to verify that:11
11 For simplicity, we temporarily define the notation GpKq :“ SatAggφPK pK, Gq.
17
If Eq.(22) is true then a counterexample to Eq.(21) has been found and the logical consequence
does not hold. If Eq.(22) is false then no counterexample to Eq.(21) has been found and the logical
consequence is assumed to hold true. A search for such parameters θ (the counterexample) can be
performed by minimizing Gθ pφq while imposing a constraint that seeks to invalidate results where
Gθ pKq ă q. We therefore define:
#
c if Gθ pKq ă q,
penaltypGθ , qq “ where c ą 1.12
0 otherwise,
Given G ˚ such that:
G ˚ “ argminpGθ pφq ` penaltypGθ , qqq (23)
Gθ
• If G ˚ pKq ă q : Then for all Gθ , Gθ pKq ă q and therefore pK, Gp ¨ | θq, Θq |ùq φ.
• If G ˚ pKq ě q and G ˚ pφq ě q : Then for all Gθ with Gθ pKq ě q, we have that Gθ pφq ě G ˚ pφq ě q
and therefore pK, Gp ¨ | θq, Θq |ùq φ.
• If G ˚ pKq ě q and G ˚ pφq ă q : Then pK, Gp ¨ | θq, Θq *q φ.
Clearly, Equation (23) cannot be used as an objective function for gradient-descent due to null
derivatives. Therefore, we propose to approximate the penalty function with the soft constraint:
#
βpq ´ Gθ pKqq if Gθ pKq ď q,
elupα, βpq ´ Gθ pKqqq “ q´Gθ pKq
αpe ´ 1q otherwise,
where α ě 0 and β ě 0 are hyper-parameters (see Figure 6). When Gθ pKq ă q, the penalty is linear
in q ´ Gθ pKq with a slope of β. Setting β high, the gradients for Gθ pKq will be high in absolute value
if the knowledge-base is not satisfied. When Gθ pKq ą q, the penalty is a negative exponential that
converges to ´α. Setting α low but non-zero seeks to ensure that the gradients do not vanish when
the penalty should not apply (when the knowledge-base is satisfied). We obtain the following
approximate objective function:
G ˚ “ argminpGθ pφq ` elupα, βpq ´ Gθ pKqqq (24)
Gθ
Section 4.8 will illustrate the use of reasoning by refutation with an example in comparison
with reasoning as querying after learning. Of course, other forms of reasoning are possible, not least
that adopted in [6], but a direct comparison is outside the scope of this paper and left as future
work.
12 In the objective function, G ˚ should satisfy G ˚ pKq ě q before reducing G ˚ pφq because the penalty c which is greater
than 1 is higher than any potential reduction in Gpφq which is smaller or equal to 1.
13 https://github.com/logictensornetworks/logictensornetworks
18
6 α “ 1, β “ 1
α “ 0.5, β “ 2
´3 ´2 ´1 0 1 2 3
Figure 6: elupα, βxq where α ě 0 and β ě 0 are hyper-parameters. The function elupα, βpq ´ Gθ pKqqq with α low and β
high is a soft constraint for penaltypGθ , qq suitable for learning.
is grounded using Tensorflow primitives such that LTN implements directly a Tensorflow graph.
Due to Tensorflow built-in optimization, LTN is relatively efficient while providing the expressive
power of first-order logic. Details on the implementation of the examples described in this section
are reported in Appendix A. The implementation of the examples presented here is also available
from the LTN repository on GitHub. Except when stated otherwise, the results reported are the
average result over 10 runs using a 95% confidence interval. Every example uses a stable real
product configuration to approximate the Real Logic operators and the Adam optimizer [35] with
a learning rate of 0.001. Table A.3 in the Appendix gives an overview of the network architectures
used to obtain the results reported in this section.
Variables:
x` for the positive examples.
x´ for the negative examples.
x for all examples.
Dpxq “ Dpx` q “ Dpx´ q “ points.
Predicates:
Apxq for the trainable classifier.
Din pAq “ points.
19
Axioms:
Grounding:
Gppointsq “ r0, 1s2 .
Gpxq P r0, 1smˆ2 (Gpxq is a sequence of m points, that is, m examples).
Gpx` q “ hd P Gpxq | kd ´ p0.5, 0.5qk ă 0.09i.14
Gpx´ q “ hd P Gpxq | kd ´ p0.5, 0.5qk ě 0.09i.15
GpA | θq : x ÞÑ sigmoidpMLPθ pxqq, where MLP is a Multilayer Perceptron with a single output
neuron, whose parameters θ are to be learned16 .
Learning:
Let us define D the data set of all examples. The objective function with K “ t@x` Apx` q, @x´ Apx´ qu
is given by argmaxθPΘ SatAggφPK Gθ,xÐD pφq. 17 In practice, the optimizer uses the following
loss function:
L “ p1 ´ SatAgg Gθ,xÐB pφqq
φPK
where B is a mini-batch sampled from D.18 The objective and loss functions depend on the
following hyper-parameters:
• the choice of fuzzy logic operator semantics used to approximate each connective and
quantifier,
• the choice of hyper-parameters underlying the operators, such as the value of the expo-
nent p in any generalized mean,
• the choice of formula aggregator function.
Using the stable product configuration to approximate connectives and quantifiers, and p “ 2
for every occurrence of ApM E , and using for the formula aggregator also ApM E with p “ 2,
yields the following satisfaction equation:
1´ ´ ´ 1 ÿ ` ˘2 ¯ 12 ¨2 ¯
SatAgg Gθ pφq “1 ´ 1´ 1´ 1 ´ sigmoidpMLPθ pvqq
φPK 2 |Gpx` q|
vPGpx` q
´ ´ 1 ÿ ` ˘2 ¯ 12 ¨2 ¯¯ 21
`1´ 1´ sigmoidpMLPθ pvqq
|Gpx´ q|
vPGpx´ q
14 Gpx q are, by definition in this example, the training examples with Euclidean distance to the center p0.5, 0.5q smaller
`
than the threshold of 0.09.
15 Gpx q are, by definition, the training examples with Euclidean distance to the centre p0.5, 0.5q larger or equal to the
´
threshold of 0.09.
16 sigmoidpxq “ 1
1`e´ x
17 The notation G
xÐD pφpxqq means that the variable x is grounded with the data D (that is, Gpxq :“ D) when grounding
φpxq.
18 As usual in ML, while it is possible to compute the loss function and gradients over the entire data set, it is preferred
20
Figure 7: Symbolic Tensor Computational Graph for the Binary Classification Example. In the figure, Gx` and Gx´
are inputs to the network Gθ pAq and the dotted lines indicate the propagation of activation from each input through the
network, which produces two outputs.
The computational graph of Figure 7 shows SatAggφPK Gθ pφqq as used with the above loss
function.
We are therefore interested in learning the parameters θ of the MLP used to model the binary
classifier. We sample 100 data points uniformly from r0, 1s2 to populate the data set of
positive and negative examples. The data set was split into 50 data points for training and
50 points for testing. The training was carried out for a fixed number of 1000 epochs using
backpropagation with the Adam optimizer [35] with a batch size of 64 examples. Figure 8
shows the classification accuracy and satisfaction level of the LTN on both training and test
sets averaged over 10 runs using a 95% confidence interval. The accuracy shown is the ratio
of examples correctly classified, with an example deemed as being positive if the classifier
outputs a value higher than 0.5.
Notice that a model can reach an accuracy of 100% while satisfaction of the knowledge base
is yet not maximized. For example, if the threshold for an example to be deemed as positive
is 0.7, all examples may be classified correctly with a confidence score of 0.7. In that case,
while the accuracy is already maximized, the satisfaction of @x` Apx` q would still be 0.7, and
can still improve until the confidence for every sample reaches 1.0.
This first example, although straightforward, illustrates step-by-step the process of using LTN
in a simple setting. Notice that, according to the nomenclature of Section 3.3, measuring accuracy
amounts to querying the truth query (respectively, the generalization truth query) Apxq for all the
examples of the training set (respectively, test set) and comparing the results with the classification
threshold. In Figure 9, we show the results of such queries Apxq after optimization. Next, we show
how the LTN language can be used to solve progressively more complex problems by combining
learning and reasoning.
21
1.0 1.0
0.8 0.8
Accuracy
0.6 0.6
Sat
0.4 0.4
train train
0.2 test 0.2 test
0.0 0.0
0 200 400 600 800 1000 0 200 400 600 800 1000
Epoch Epoch
Figure 8: Binary Classification task (training and test set performance): Average accuracy (left) and satisfiability (right).
Due to the random initializations, accuracy and satisfiability start on average at 0.5 with performance increasing rapidly
after a few epochs.
classes A, B or C. This syntax allows one to write statements quantifying over the classes, e.g.
@xpDlpP px, lqqq. Since the classes are mutually exclusive, the output layer of the MLP representing
P px, lq will be a softmax layer, instead of a sigmoid function, to ensure the exclusivity constraint
on satisfiability scores.19 . The problem can be specified as follows:
Domains:
items, denoting the examples from the Iris flower data set.
labels, denoting the class labels.
Variables:
xA , xB , xC for the positive examples of classes A, B, C.
x for all examples.
DpxA q “ DpxB q “ DpxC q “ Dpxq “ items.
Constants:
lA , lB , lC , the labels of classes A (Iris setosa), B (Iris virginica), C (Iris versicolor), respectively.
DplA q “ DplB q “ DplC q “ labels.
Predicates:
P px, lq denoting the fact that item x is classified as l.
Din pP q “ items, labels.
Axioms:
Notice that rules about exclusiveness such as @xpP px, lA q Ñ p P px, lB q ^ P px, lC qqq are
not included since such constraints are already imposed by the grounding of P below, more
specifically the softmax function.
“ e xi { exj
19 softmaxpxq
ř
j
22
groundtruth
1.0
0.8
0.6 A
0.4 ~A
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
A(x) - training data 1.0 1.0
~A(x) - training data 1.0
1.0
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
A(x) - test data 1.0 1.0
~A(x) - test data 1.0
1.0
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Figure 9: Binary Classification task (querying the trained predicate Apxq): It is interesting to see how Apxq could be
appropriately named as denoting the inside of the central region shown in the figure, and therefore Apxq represents the
outside of the region.
23
Grounding:
Gpitemsq “ R4 , items are described by 4 features: the length and the width of the sepals and
petals, in centimeters.
Gplabelsq “ N3 , we use a one-hot encoding to represent classes.
GpxA q P Rm1 ˆ4 , that is, GpxA q is a sequence of m1 examples of class A.
GpxB q P Rm2 ˆ4 , GpxB q is a sequence of m2 examples of class B.
GpxC q P Rm3 ˆ4 , GpxC q is a sequence of m3 examples of class C.
Gpxq P Rpm1 `m2 `m3 qˆ4 , Gpxq is a sequence of all the examples.
GplA q “ r1, 0, 0s, GplB q “ r0, 1, 0s, GplC q “ r0, 0, 1s.
GpP | θq : x, l ÞÑ lJ ¨softmaxpMLPθ pxqq, where the MLP has three output neurons corresponding
to as many classes, and ¨ denotes the dot product as a way of selecting an output for GpP | θq;
multiplying the MLP’s output by the one-hot vector lJ gives the truth degree corresponding
to the class denoted by l.
Learning:
The logical operators and connectives are approximated using the stable product configura-
tion with p “ 2 for ApM E . For the formula aggregator, ApM E is used also with p “ 2.
The computational graph of Figure 10 illustrates how SatAggφPK Gθ pφq is obtained. If U
denotes batches sampled from the data set of all examples, the loss function (to minimize) is:
Figure 11 shows the result of training with the Adam optimizer with batches of 64 examples.
Accuracy measures the ratio of examples correctly classified, with example x labeled as
argmaxl pP px, lqq.20 Classification accuracy reaches an average value near 1.0 for both the
training and test data after some 100 epochs. Satisfaction levels of the Iris flower predictions
continue to increase for the rest of the training (500 epochs) to more than 0.8.
It is worth contrasting the choice of using a binary predicate pP px, lqq in this example with
the option of using multiple unary predicates plA pxq, lB pxq, lC pxqq, one for each class. Notice
how each predicate is normally associated with an output neuron. In the case of the unary
predicates, the networks would be disjoint (or modular), whereas weight-sharing takes place
with the use of the binary predicate. Since l is instantiated into lA , lB , lC , in practice P px, lq
becomes P px, lA q, P px, lB q, P px, lC q, which is implemented via three output neurons to which
a softmax function applies.
24
One-hot
0 0 1
Dot Product
MLP classifier
Figure 10: Symbolic Tensor Computational Graph for the Multi-Class Single-Label Problem. As before, the dotted lines in
the figure indicate the propagation of activation from each input through the network, in this case producing three outputs.
1.0 1.0
0.8 0.8
Accuracy
0.6 0.6
Sat
0.4 0.4
train train
0.2 test 0.2 test
0.0 0.0
0 100 200 300 400 500 0 100 200 300 400 500
Epoch Epoch
Figure 11: Multi-Class Single-Label Classification: Classification accuracy (left) and satisfaction level (right).
25
to each other, thus becoming a powerful tool in the case of the multi-label problem when typically
the labeled data is scarce. We explore the Leptograpsus crabs data set [10] consisting of 200 examples
of 5 morphological measurements of 50 crabs. The task is to classify the crabs according to their
color and sex. There are four labels: blue, orange, male, and female. The color labels are mutually
exclusive, and so are the labels for sex. LTN will be used to specify such information logically.
Domains:
items denoting the examples from the crabs dataset.
labels denoting the class labels.
Variables:
xblue , xorange , xmale , xfemale for the positive examples of each class.
x, used to denote all the examples.
Dpxblue q “ Dpxorange q “ Dpxmale q “ Dpxfemale q “ Dpxq “ items.
Constants:
lblue , lorange , lmale , lfemale (the labels for each class).
Dplblue q “ Dplorange q “ Dplmale q “ Dplfemale q “ labels.
Predicates:
P px, lq, denotes the fact that item x is labelled as l.
Din pP q “ items, labels.
Axioms:
Notice how logical rules 34 and 35 above represent the mutual exclusion of the labels on
colour and sex, respectively. As a result, negative examples are not used explicitly in this
specification.
Grounding:
Gpitemsq “ R5 ; the examples from the data set are described using 5 features.
Gplabelsq “ N4 ; one-hot vectors are used to represent class labels.21
Gpxblue q P Rm1 ˆ5 , Gpxorange q P Rm2 ˆ5 , Gpxmale q P Rm3 ˆ5 , Gpxfemale q P Rm4 ˆ5 . These se-
quences are not mutually-exclusive, one example can for instance be in both xblue and xmale .
Gplblue q “ r1, 0, 0, 0s, Gplorange q “ r0, 1, 0, 0s,Gplmale q “ r0, 0, 1, 0s,Gplfemale q “ r0, 0, 0, 1s.
21 There are two possible approaches here: either each item is labeled with one multi-hot encoding or each item is labeled
with several one-hot encodings. The latter approach was used in this example.
26
GpP | θq : x, l ÞÑ lJ ¨ sigmoidpMLPθ pxqq, with the MLP having four output neurons correspond-
ing to as many classes. As before, ¨ denotes the dot product which selects a single output. By
contrast with the previous example, notice the use of a sigmoid function instead of a softmax
function.
Learning:
As before, the fuzzy logic operators and connectives are approximated using the stable
product configuration with p “ 2 for ApM E , and for the formula aggregator, ApM E is also
used with p “ 2.
Figure 12 shows the result of the Adam optimizer using backpropagation trained with batches
of 64 examples. This time, the accuracy is defined as 1´HL, where HL is the average Hamming
loss, i.e. the fraction of labels predicted incorrectly, with a classification threshold of 0.5 (given
an example u, if the model outputs a value greater than 0.5 for class C then u is deemed
as belonging to class C). The rightmost graph in Figure 12 illustrates how LTN learns the
constraint that a crab cannot have both blue and orange color, which is discussed in more
detail in what follows.
Querying:
To illustrate the learning of constraints by LTN, we have queried three formulas that were not
explicitly part of the knowledge-base, over time during learning:
For querying, we use p “ 5 when approximating the universal quantifiers with ApM E . A
higher p denotes a stricter universal quantification with a stronger focus on outliers (see
Section 2.4). 22 We should expect φ1 to hold true (every blue crab cannot be orange and
vice-versa23 , and we should expect φ2 (every blue crab is also orange) and φ3 (every blue
crab is male) to be false. The results are reported in the rightmost plot of Figure 12. Prior to
training, the truth-values of φ1 to φ3 are non-informative. During training one can see, with
the maximization of the satisfaction of the knowledge-base, a trend towards the satisfaction
of φ1 , and an opposite trend of φ2 and φ3 towards false.
22 Training should usually not focus on outliers, as optimizers would struggle to generalize and tend to get stuck in local
minima. However, when querying φ1 ,φ2 ,φ3 , we wish to be more careful about the interpretation of our statement. See also
3.1.3.
23 Notice how, strictly speaking, other colours remain possible since the prior knowledge did not specify the bi-conditional:
27
1.0 1.0 1.0
0.8 train 0.8 0.8
test
Truth Value
Accuracy
Sat
0.4 0.4 0.4 1
train 2
0.2 0.2 test 0.2 3
0.0 0.0 0.0
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Epoch Epoch Epoch
Figure 12: Multi-Class Multi-Label Classification: Classification Accuracy (left), Satisfiability level (middle), and Querying
of Constraints (right).
Single Digits Addition: Consider the predicate additionpX, Y, Nq, where X and Y are images of
digits (the MNIST data set will be used), and N is a natural number corresponding to the sum
of these digits. This predicate should return an estimate of the validity of the addition. For
instance, additionp , , 11q is a valid addition; additionp , , 5q is not.
Multi Digits Addition: The experiment is extended to numbers with more than one digit. Con-
sider the predicate additionprX1 , X2 s, rY1 , Y2 s, Nq. rX1 , X2 s and rY1 , Y2 s are lists of images of
digits, representing two multi-digit numbers; N is a natural number corresponding to the
sum of the two multi-digit numbers. For instance, additionpr , s, r , s, 130q is a valid
addition; additionpr , s, r , s, 26q is not.
A natural neurosymbolic approach is to seek to learn a single-digit classifier and benefit from
knowledge readily available about the properties of addition in this case. For instance, suppose
that a predicate digitpx, dq gives the likelihood of an image x being of digit d. A definition for
additionp , , 11q in LTN is:
In [41], the above task is made more complicated by not providing labels for the single-digit
images during training. Instead, training takes place on pairs of images with labels made available
for the result only, that is, the sum of the individual labels. The single-digit classifier is not explicitly
trained by itself; its output is a piece of latent information that is used by the logic. However, this
does not pose a problem for end-to-end neurosymbolic systems such as LTN or DeepProbLog for
which the gradients can propagate through the logical structures.
We start by illustrating a LTN theory that can be used to learn the predicate digit. The speci-
fication of the theory below is for the single digit addition example, although it can be extended
easily to the multiple digits case.
Domains:
images, denoting the MNIST digit images,
results, denoting the integers that label the results of the additions,
digits, denoting the digits from 0 to 9.
Variables:
x, y, ranging over the MNIST images in the data,
n for the labels, i.e. the result of each addition,
d1 , d2 ranging over digits.
28
Dpxq “ Dpyq “ images,
Dpnq “ results,
Dpd1 q “ Dpd2 q “ digits.
Predicates:
digitpx, dq for the single digit classifier, where d is a term denoting a digit constant or a digit
variable. The classifier should return the probability of an image x being of digit d.
Din pdigitq “ images, digits.
Axioms:
Single Digit Addition:
@Diagpx, y, nq
pDd1 , d2 : d1 ` d2 “ n (39)
pdigitpx, d1 q ^ digitpy, d2 qqq
@Diagpx1 , x2 , y1 , y2 , nq
pDd1 , d2 , d3 , d4 : 10d1 ` d2 ` 10d3 ` d4 “ n (40)
pdigitpx1 , d1 q ^ digitpx2 , d2 q ^ digitpy1 , d3 q ^ digitpy2 , d4 qqq
Notice the use of Diag: when grounding x,y,n with three sequences of values, the i-th
examples of each variable are matching. That is, pGpxqi , Gpyqi , Gpnqi q is a tuple from our
dataset of valid additions. Using the diagonal quantification, LTN aggregates pairs of images
and their corresponding result, rather than any combination of images and results.
Notice also the guarded quantification: by quantifying only on the latent "digit labels" (i.e.
d1 ,d2 , . . . ) that can add up to the result label (n, given in the dataset), we incorporate symbolic
information into the system. For example, in (39), if n “ 3, the only valid tuples pd1 , d2 q are
p0, 3q, p3, 0q, p1, 2q, p2, 1q. Gradients will only backpropagate to these values.
Grounding:
Gpimagesq “ r0, 1s28ˆ28ˆ1 . The MNIST data set has images of 28 by 28 pixels. The images are
grayscale and have just one channel. The RGB pixel values from 0 to 255 of the MNIST data
set are converted to the range r0, 1s.
Gpresultsq “ N.
Gpdigitsq “ t0, 1, . . . , 9u.
Gpxq P r0, 1smˆ28ˆ28ˆ1 , Gpyq P r0, 1smˆ28ˆ28ˆ1 , Gpnq P Nm .24
Gpd1 q “ Gpd2 q “ h0, 1, . . . , 9i.
Gpdigit | θq : x, d ÞÑ onehotpdqJ ¨ softmaxpCNNθ pxqq, where CNN is a Convolutional Neural
Network with 10 output neurons for each class. Notice that, in contrast with the previous
examples, d is an integer label; onehotpdq converts it into a one-hot label.
24 Notice the use of the same number m of examples for each of these variables as they are supposed to match one-to-one
29
Learning:
The computational graph of Figure 13 shows the objective function for the satisfiability of the
knowledge base. A stable product configuration is used with hyper-parameter p “ 2 of the
operator ApM E for universal quantification (@). Let pD denote the exponent hyper-parameter
used in the generalized mean ApM for existential quantification (D). Three scenarios are
investigated and compared in the Multiple Digit experiment (Figure 15):
In the Single Digit experiment, only the last scenario above (schedule) is investigated (Figure
14).
We train to maximize satisfiability by using batches of 32 examples of image pairs, labeled
by the result of their addition. As done in [41], the experimental results vary the number
of examples in the training set to emphasize the generalization abilities of a neurosymbolic
approach. Accuracy is measured by predicting the digit values using the predicate digit and
reporting the ratio of examples for which the addition is correct. A comparison is made with
the same baseline method used in [41]: given a pair of MNIST images, a non-pre-trained
CNN outputs embeddings for each image (Siamese neural network). The embeddings are
provided as input to dense layers that classify the addition into one of the 19 (respectively,
199) possible results of the Single Digit Addition (respectively, Multiple Digit Addition)
experiments. The baseline is trained using a cross-entropy loss between the labels and the
predictions. As expected, such a standard deep learning approach struggles with the task
without the provision of symbolic meaning about intermediate parts of the problem.
Experimentally, we find that the optimizer for the neurosymbolic system gets stuck in a local
optimum at the initialization in about 1 out of 5 runs. We, therefore, present the results on
an average of the 10 best outcomes out of 15 runs of each algorithm (that is, for the baseline
as well). The examples of digit pairs selected from the full MNIST data set are randomized
at each run.
Figure 15 shows that the use of pD “ 2 from the start produces poor results. A higher value
for pD in ApM weighs up the instances with a higher truth-value (see also Appendix C for a
discussion). Starting already with a high value for pD , the classes with a higher initial truth-
value for a given example will have higher gradients and be prioritized for training, which
does not make practical sense when randomly initializing the predicates. Increasing pD by
following a schedule is the most promising approach. In this particular example, pD “ 1 is
also shown to be adequate purely from a learning perspective. However, pD “ 1 implements
a simple average which does not account for the meaning of D well; the resulting satisfaction
value is not meaningful within a reasoning perspective.
Table 1 shows that the training and test times of LTN are of the same order of magnitude as
those of the CNN baselines. Table 2 shows that LTN reaches similar accuracy as that reported
by DeepProbLog.
30
one-hot selection
Figure 13: Symbolic Tensor Computational Graph for the Single Digit Addition task. Notice that the figure does not depict
accurate dimensions for the tensors; Gpxq and Gpyq are in fact 4D tensors of dimensions m ˆ 28 ˆ 28 ˆ 1. Computing
results with the variables d1 or d2 corresponds to the addition of a further axes of dimension 10.
0.50 0.50
Sat
0.50 0.50
Sat
0.25 0.25
0.00 0.00
0 5 10 15 0 5 10 15
Epoch Epoch
Figure 14: Single Digit Addition Task: Accuracy and satisfiability results (top) and results in the presence of fewer examples
(bottom) in comparison with standard Deep Learning using a CNN (blue lines).
31
(Single Digits) (Multi Digits)
Model Train Test Train Test
baseline 2.72˘0.23ms 1.45˘0.21ms 3.87˘0.24ms 2.10˘0.30ms
LTN 5.36˘0.25ms 3.44˘0.39ms 8.51˘0.72ms 5.72˘0.57ms
Table 1: The computation time of training and test steps on the single and multiple digit addition tasks, measured on a
computer with a single Nvidia Tesla V100 GPU and averaged over 1000 steps. Each step operates on a batch of 32 examples.
The computational efficiency of the LTN and the CNN baseline systems are of the same order of magnitude.
Table 2: Accuracy (in %) on the test set: comparison of the final results obtained with LTN and those reported with
DeepProbLog[41]. Although it is difficult to compare directly the results over time (the frameworks are implemented in
different libraries), while achieving similar computational efficiency as the CNN baseline, LTN also reaches similar accuracy
as that reported by DeepProbLog.
4.5. Regression
Another important problem in Machine Learning is regression where a relationship is estimated
between one independent variable X and a continuous dependent variable Y . The essence of
regression is, therefore, to approximate a function f pxq “ y by a function f ˚ , given examples
pxi , yi q such that f pxi q “ yi . In LTN one can model a regression task by defining f ˚ as a learnable
function whose parameter values are constrained by data. Additionally, a regression task requires
a notion of equality. We, therefore, define the predicate eq as a smooth version of the symbol “ to
turn the constraint f pxi q “ yi into a smooth optimization problem.
In this example, we explore regression using a problem from a real estate data set25 with 414
examples, each described in terms of 6 real-numbered features: the transaction date (converted to
a float), the age of the house, the distance to the nearest station, the number of convenience stores
in the vicinity, and the latitude and longitude coordinates. The model has to predict the house
price per unit area.
Domains:
samples, denoting the houses and their features.
prices, denoting the house prices.
Variables:
x for the samples.
y for the prices.
Dpxq “ samples.
Dpyq “ prices.
25 https://www.kaggle.com/quantbruce/real-estate-price-prediction
32
1.00 15000 examples 1.00 15000 examples
p =1 p =2 p =4 p =6
0.75 0.75
Accuracy
Baseline (train)
0.50 0.50
Sat
Baseline (test)
0.25 0.25 LTN p sched. (train)
LTN p sched. (test)
0.00 0.00 LTN p = 1 (train)
0 5 10 15 20 25 0 5 10 15 20 25 LTN p = 1 (test)
Epoch Epoch LTN p = 2 (train)
1.00 1500 examples 1.00 1500 examples LTN p = 2 (test)
p =1 p =2 p =4 p =6
0.75 0.75
Accuracy
0.50 0.50
Sat
0.25 0.25
0.00 0.00
0 5 10 15 20 25 0 5 10 15 20 25
Epoch Epoch
Figure 15: Multiple Digit Addition Task: Accuracy and satisfiability results (top) and results in the presence of fewer
examples (bottom) in comparison with standard Deep Learning using a CNN (blue lines).
Functions:
f ˚ pxq, the regression function to be learned.
Din pf ˚ q “ samples, Dout pf ˚ q “ prices.
Predicates:
eqpy1 , y2 q, a smooth equality predicate that measures how similar y1 and y2 are.
Din peqq “ prices, prices.
Axioms:
Notice again the use of Diag: when grounding x and y onto sequences of values, this is
done by obeying a one-to-one correspondence between the sequences. In other words, we
aggregate pairs of corresponding samples and prices, instead of any combination thereof.
Grounding:
Gpsamplesq “ R6 .
Gppricesq “ R.
Gpxq P Rm ˆ 6, Gpyq P Rm ˆ 1. Notice that this specification refers to the same number m of
examples for x and y due to the above one-to-one correspondence obtained with the use of
Diag. bř
` ˘
Gpeqpu, vqq “ exp ´ α 2
j puj ´ vj q , where the hyper-parameter α is a real number that
33
1.0
400 train 0.8
300 test
0.6
RMSE
Sat
200 0.4
100 train
0.2 test
0 0.0
0 100 200 300 400 500 0 100 200 300 400 500
Epoch Epoch
Figure 16: Regression task: RMSE and satisfaction level over time.
80
Training Data 80
Test Data
70 70
60 60
Fitted y values
Fitted y values
50 50
40 40
30 30
20 20
10 10
0 0
0 20 40 60 80 0 20 40 60 80
Actual y values Actual y values
scales how strict the smooth equality is. 26 In our experiments, we use α “ 0.05.
Gpf ˚ pxq | θq “ MLPθ pxq, where MLPθ is a multilayer perceptron which ends in one neuron
corresponding to a price prediction, with a linear output layer (no activation function).
Learning:
The theory is constrained by the parameters of the model of f ˚ . LTN is used to estimate
such parameters by maximizing the satisfaction of the knowledge-base, in the usual way.
Approximating @ using ApM E with p “ 2, as before, we randomly split the data set into 330
examples for training and 84 examples for testing. Figure 16 shows the satisfaction level over
500 epochs. We also plot the Root Mean Squared Error (RMSE) between the predicted prices
and the labels (i.e. actual prices, also known as target values). We visualize in Figure 17 the
strong correlation between actual and predicted prices at the end of one of the runs.
26 Intuitively, the smooth equality is expp´α dpu, vqq, where dpu, vq is the Euclidean distance between u and v. It
produces a 1 if the distance is zero; as the distance increases, the result decreases exponentially towards 0. In case an
1
exponential decrease is undesirable, one can adopt the following alternative equation: eqpu, vq “ 1`αdpu,vq .
34
alone. LTN can formulate such constraints, such as:
• clusters should be disjoint,
• every example should be assigned to a cluster,
• a cluster should not be empty,
• if the points are near, they should belong to the same cluster,
• if the points are far, they should belong to different clusters, etc.
Domains:
points, denoting the data to cluster.
points_pairs, denoting pairs of examples.
clusters, denoting the cluster.
Variables:
x, y for all points.
Dpxq “ Dpyq “ points.
Dpcq “ clusters.
Predicates:
Cpx, cq, the truth degree of a given point belonging in a given cluster.
Din pCq “ points, clusters.
Axioms:
@x Dc Cpx, cq (42)
@c Dx Cpx, cq (43)
@pc, x, y : |x ´ y| ă thclose q pCpx, cq Ø Cpy, cqq (44)
@pc, x, y : |x ´ y| ą thdistant q pCpx, cq ^ Cpy, cqq (45)
Notice the use of guarded quantifiers: all the pairs of points with Euclidean distance lower
(resp. higher) than a value thclose (resp. thdistant ) should belong in the same cluster (resp.
should not). thclose and thdistant are arbitrary threshold values that define some of the closest
and most distant pairs of points. In our example, they are set to, respectively, 0.2 and 1.0.
As done in the example of Section 4.2, the clustering predicate has mutually exclusive satisfi-
ability scores for each cluster using a softmax layer. Therefore, there is no explicit constraint
about clusters being disjoint.
Grounding:
Gppointsq “ r´1, 1s2 .
Gpclustersq “ N4 , we use one-hot vectors to represent a choice of 4 clusters.
Gpxq P r´1, 1smˆ2 , that is, x is a sequence of m points. Gpyq “ Gpxq.
thclose “ 0.2, thdistant “ 1.0.
Gpcq “ hr1, 0, 0, 0s, r0, 1, 0, 0s, r0, 0, 1, 0s, r0, 0, 0, 1si.
GpC | θq : x, c ÞÑ cJ ¨ softmaxpMLPθ pxqq, where MLP has 4 output neurons corresponding to
the 4 clusters.
35
groundtruth
0.5
0.0
0.5
0.5 0.0 0.5
C0(x) 1.0
C1(x) 1.0
0.5 0.8 0.5 0.8
0.6 0.6
0.0 0.4 0.0 0.4
0.5 0.2 0.5 0.2
0.0 0.0
0.5 0.0 0.5 0.5 0.0 0.5
C2(x) 1.0
C3(x) 1.0
0.5 0.8 0.5 0.8
0.6 0.6
0.0 0.4 0.0 0.4
0.5 0.2 0.5 0.2
0.0 0.0
0.5 0.0 0.5 0.5 0.0 0.5
Figure 18: LTN solving a clustering problem by constraint optimization: ground-truth (top) and querying of each cluster
C0, C1, C2 and C3, in turn.
Learning:
We use the stable real product configuration to approximate the logical operators. For @, we
use ApM E with p “ 4. For D, we use ApM with p “ 1 during the first 100 epochs, and p “ 6
thereafter, as a simplified version of the schedule used in Section 4.4. The formula aggregator
is approximated by ApM E with p “ 2. The model is trained for a total of 1000 epochs using
the Adam optimizer, which is sufficient for LTN to solve the clustering problem shown in
Figure 18. Ground-truth data for this task was generated artificially by creating 4 centers,
and generating 50 random samples from a multivariate Gaussian distribution around each
center. The trained LTN achieves a satisfaction level of the clustering constraints of 0.857.
36
Otherwise, knowledge about friendship is incomplete in that it may be known that e.g. a is a friend
of b, and it may be not known whether b is a friend of a. Finally, there is general knowledge
about smoking, friendship, and cancer, namely that smoking causes cancer, friendship is normally
symmetric and anti-reflexive, everyone has a friend, and smoking propagates (actively or passively)
among friends. All this knowledge is represented in the axioms further below.
Domains:
people, to denote the individuals.
Constants:
a, b, . . . , h, i, j, . . . , n, the 14 individuals. Our goal is to learn an adequate embedding for each
constant.
Dpaq “ Dpbq “ ¨ ¨ ¨ “ Dpnq “ people.
Variables:
x,y ranging over the individuals.
Dpxq “ Dpyq “ people.
Predicates:
Spxq for smokes, F px, yq for friends, Cpxq for cancer.
DpSq “ DpCq “ people. DpF q “ people, people.
Axioms:
Let X1 “ ta, b, . . . , hu and X2 “ ti, j, . . . , nu be the two groups of individuals.
Let S “ ta, e, f, g, j, nu be the smokers; knowledge is complete in both groups.
Let C “ ta, eu be the individuals with cancer; knowledge is complete in X1 only.
Let F “ tpa, bq, pa, eq, pa, f q, pa, gq, pb, cq, pc, dq, pe, f q, pg, hq, pi, jq, pj, mq, pk, lq, pm, nqu be the
set of friendship relations; knowledge is complete if assuming symmetry.
These facts are illustrated in Figure 20a.
We have the following axioms:
Notice that the knowledge base is not satisfiable in the strict logical sense of the word. For
instance, f is said to smoke but not to have cancer, which is inconsistent with the rule
37
@x pSpxq Ñ Cpxqq. Hence, it is important to adopt a fuzzy approach as done with MLN or a
many-valued fuzzy logic interpretation as done with LTN.
Grounding:
Gppeopleq “ R5 . The model is expected to learn embeddings in R5 .
Gpa | θq “ vθ paq, . . . , Gpn | θq “ vθ pnq. Every individual is associated with a vector of 5 real
numbers. The embedding is initialized randomly uniformly.
Gpx | θq “ Gpy | θq “ hvθ paq, . . . , vθ pnqi.
GpS | θq : x ÞÑ sigmoidpMLP_Sθ pxqq, where MLP_Sθ has 1 output neuron.
GpF | θq : x, y ÞÑ sigmoidpMLP_Fθ px, yqq, where MLP_Fθ has 1 output neuron.
GpC | θq : x ÞÑ sigmoidpMLP_Cθ pxqq, where MLP_Cθ has 1 output neuron.
The MLP models for S, F , C are kept simple, so that most of the learning is focused on the
embedding.
Learning:
We use the stable real product configuration to approximate the operators. For @, we use
ApM E with p “ 2 for all the rules, except for rules (52) and (53), where we use p “ 6. The
intuition behind this choice of p is that no outliers are to be accepted for the friendship
relation since it is expected to be symmetric and anti-reflexive, but outliers are accepted for
the other rules. For D, we use ApM with p “ 1 during the first 200 epochs of training, and
p “ 6 thereafter, with the same motivation as that of the schedule used in Section 4.4. The
formula aggregator is approximated by ApM E with p “ 2.
Figure 19 shows the satisfiability over 1000 epochs of training. At the end of one of these
runs, we query Spxq, F px, yq, Cpxq for each individual; the results are shown in Figure 20b.
We also plot the principal components of the learned embeddings [51] in Figure 21. The
friendship relations are learned as expected: (56) "smoking implies cancer" is inferred for
group 2 even though such information was not present in the knowledge base. For group
1, the given facts for smoking and cancer for the individuals f and g are slightly altered, as
these were inconsistent with the rules. (the rule for smoking propagating via friendship (55)
is incompatible with many of the given facts). Increasing the satisfaction of this rule would
require decreasing the overall satisfaction of the knowledge base, which explains why it is
partly ignored by LTN during training. Finally, it is interesting to note that the principal
components for the learned embeddings seem to be linearly separable for the smoking and
cancer classifiers (c.f. Figure 21, top right and bottom right plots).
Querying:
To illustrate querying in LTN, we query over time two formulas that are not present in the
knowledge-base:
We use p “ 5 when approximating @ since the impact of an outlier at querying time should
be seen as more important than at learning time. It can be seen that as the grounding
approaches satisfiability of the knowledge-base, φ1 approaches true, whereas φ2 approaches
false (c.f. Figure 20a).
38
1.0 1.0
0.8 0.8
Truth Value
0.6 0.6
Sat
0.4 0.4
p =1 p =6 1
0.2 train 0.2 2
0.0 0.0
0 200 400 600 800 1000 0 200 400 600 800 1000
Epoch Epoch
Figure 19: Smoker-Friends-Cancer example: Satisfiability levels during training (left) and truth-values of queries φ1 and
φ2 over time (right).
1.0
Friend(x,y) in Group 1 1.0
Friend(x,y) in Group 2 1.0
n h n
m
l 0.8 g 0.8 m 0.8
k
j f
i 0.6 e 0.6 l 0.6
h
g d
f 0.4 0.4 k 0.4
e c
d 0.2 0.2 j 0.2
c b
b a i
a 0.0 0.0 0.0
Smokes Cancer a b c d e f g h i j k l m n
(a) Incomplete facts in the knowledge-base: axioms for smokers and cancer for individuals a to n (left), friendship
relations in group 1 (middle), and friendship relations in group 2 (right).
1.0
Friend(x,y) in Group 1 1.0
Friend(x,y) in Group 2 1.0
n h n
m
l 0.8 g 0.8 m 0.8
k
j f
i 0.6 e 0.6 l 0.6
h
g d
f 0.4 0.4 k 0.4
e c
d 0.2 0.2 j 0.2
c b
b a i
a 0.0 0.0 0.0
Smokes Cancer a b c d e f g h i j k l m n
(b) Querying all the truth-values using LTN after training: smokers and cancer (left), friendship relations (middle
and right).
Figure 20: Smoker-Friends-Cancer example: Illustration of the facts before and after training.
39
H
Embeddings Smokes
2 2
1 I K BDF 1
0.8
C 0.6
0 M G 0
L 0.4
1 A Group 1 1
Group J 0.2
NE 2
2 1 0 1 2 2 1 0 1 2
Friendships per group Cancer
2 2
1 1 0.8
0.6
0 0
0.4
1 1 0.2
2 1 0 1 2 2 1 0 1 2
Figure 21: Smoker-Friends-Cancer example: learned embeddings showing the result of applying PCA on the individuals
(top left); truth-values of smokes and cancer predicates for each embedding (top and bottom right); illustration of the
friendship relations which are satisfied after learning (bottom left).
Propositional Variables:
The symbols A and B denote two propositionial variables.
Axioms:
A_B (60)
27 Here, learning refers to Section 3.2, which is optimizing using the satisfaction of the knowledge base as an objective.
40
1.00 1.00 ( )
0.75 0.75 (A)
0.50 0.50
Loss
Sat
0.25 0.25
0.00 0.00
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Epoch Epoch
Figure 22: Querying after learning: 10 runs of the optimizer with objective G ˚ “ argmaxGθ pGθ pKqq. All runs converge to
the optimum G1 ; the grid search misses the counter-example.
Grounding:
GpAq “ a, GpBq “ b, where a and b are two real-valued parameters. The set of parameters is
therefore θ “ ta, bu. At initialization, a “ b “ 0.
We use the probabilistic-sum SP to approximate _, resulting in the following satisfiability
measure:28
Gθ pKq “ Gθ pA _ Bq “ a ` b ´ ab. (61)
There are infinite global optima maximizing the satisfiability of the theory, as any Gθ such
that Gθ pAq “ 1 (resp. Gθ pBq “ 1) gives a satisfiability Gθ pKq “ 1 for any value of Gθ pBq (resp.
Gθ pAq). As expected, the following groundings are examples of global optima:
G1 : G1 pAq “ 1, G1 pBq “ 1, G1 pKq “ 1,
G2 : G2 pAq “ 1, G2 pBq “ 0, G2 pKq “ 1,
G3 : G3 pAq “ 0, G3 pBq “ 1, G3 pKq “ 1.
Reasoning:
pA _ Bq |ùq A? That is, given the threshold q “ 0.95, does every Gθ such that Gθ pKq ě q verify
Gθ pφq ě q. Immediately, one can notice that this is not the case. For instance, the grounding
G3 is a counter-example.
If one simply reasons by querying multiple groundings after learning with the usual objective
argmaxpGθ q Gθ pKq, the results will all converge to G1 : BGBa
θ pKq
“ 1 ´ b and BGBb
θ pKq
“ 1 ´ a. Every
run of the optimizer will increase a and b simultaneously until they reach the optimum
a “ b “ 1. Because the grid search always converges to the same point, no counter-example
is found and the logical consequence is mistakenly assumed true. This is illustrated in Figure
22.
Reasoning by refutation, however, the objective function has an incentive to find a counter-
example with A, as illustrated in Figure 23. LTN converges to the optimum G3 , which
refutes the logical consequence.
41
1.00 ( )
7.5 0.75 (A)
5.0 0.50
Loss
Sat
2.5 0.25
0.0 0.00
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Epoch Epoch
Figure 23: Reasoning by refutation: one run of the optimizer with objective G ˚ “ argminGθ pGθ pφq ` elupα, βpq ´ Gθ pKqqq,
q “ 0.95, α “ 0.05, β “ 10. In the first training epochs, the directed search prioritizes the satisfaction of the knowledge
base. Then, the minimization of Gθ pφq starts to weigh in more and the search focuses on finding a counter-example.
Eventually, the run converges to the optimum G3 , which refutes the logical consequence.
5. Related Work
The past years have seen considerable work aiming to integrate symbolic systems and neural
networks. We shall focus on work whose objective is to build computational models that integrate
deep learning and logical reasoning into a so-called end-to-end (fully differentiable) architecture.
We summarize a categorization in Figure 24 where the class containing LTN is further expanded
into three sub-classes. The sub-class highlighted in red is the one that contains LTN. The reason why
one may wish to combine symbolic AI and neural networks into a neurosymbolic AI system may
vary, c.f. [17] for a recent comprehensive overview of approaches and challenges for neurosymbolic
AI.
42
is subsequently trained from data using backpropagation. In [24], first-order logic programs
in the form of Horn clauses are used to define a neural network that can solve Inductive Logic
Programming tasks, starting from the most specific hypotheses covering the set of examples. Lifted
relational neural networks [66] is a declarative framework where a Datalog program is used as a
compact specification of a diverse range of existing advanced neural architectures, with a particular
focus on Graph Neural Networks (GNNs) and their generalizations. In [56] a weighted Real Logic
is introduced and used to specify neurons in a highly modular neural network that resembles a tree
structure, whereby neurons with different activation functions are used to implement the different
logic operators.
To some extent, it is also possible to specify neural architectures using logic in LTN. For ex-
ample, a user can define a classifier P px, yq as the formula P px, yq “ pQpx, yq ^ Rpyqq _ Spx, yq.
GpP q becomes a computational graph that combines the sub-architectures GpQq, GpRq, and GpSq
according to the syntax of the logical formula.
5.3. Neurosymbolic architectures for the integration of inductive learning and deductive reasoning
These architectures seek to enable the integration of inductive and deductive reasoning in a
unique fully differentiable framework [15, 23, 41, 46, 47]. The systems that belong to this class
combine a neural component with a logical component. The former consists of one or more neural
networks, the latter provides a set of algorithms for performing logical tasks such as model check-
ing, satisfiability, and logical consequence. These two components are tightly integrated so that
learning and inference in the neural component are influenced by reasoning in the logical compo-
nent and vice versa. Logic Tensor Networks belong to this category. Neurosymbolic architectures
for integrating learning and reasoning can be further separated into three sub-classes:
1. Approaches that introduce additional layers to the neural network to encode logical con-
straints which modify the predictions of the network. This sub-class includes Deep Logic
Models [46] and Knowledge Enhanced Neural Networks [15].
2. Approaches that integrate logical knowledge as additional constraints in the objective func-
tion or loss function used to train the neural network (LTN and [23, 33, 47]).
3. Approaches that apply (differentiable) logical inference to compute the consequences of
the predictions made by a set of base neural networks. Examples of this sub-class are
DeepProblog [41] and Abductive Learning [14].
In what follows, we revise recent neurosymbolic architectures in the same class as LTN: Inte-
grating learning and reasoning.
Systems that modify the predictions of a base neural network:. Among the approaches that modify the
predictions of the neural network using logical constraints are Deep Logic Models [46] and Knowl-
edge Enhanced Neural Networks [15]. Deep Logic Models (DLM) are a general architecture for
learning with constraints. Here, we will consider the special case where constraints are expressed
by logical formulas. In this case, a DLM predicts the truth-values of a set of n ground atoms of
a domain ∆ “ ta1 , . . . , ak u. It consists of two models: a neural network f px | wq which takes as
input the features x of the elements of ∆ and produces as output an evaluation f for all the ground
atoms, i.e. f P r0, 1sn , and a probability distribution ppy | f , λq which is modeled by an undirected
graphical model of the exponential family with each logical constraint characterized by a clique
that contains the ground atoms, rather similarly to GNNs. The model returns the assignment to
43
Neuro-symbolic
architectures
Figure 24: Three classes of neurosymbolic approaches with Architectures Integrating Learning and Reasoning further sub-
divided into three sub-classes, with LTN belonging to the sub-class highlighted in red.
the atoms that maximize the weighted truth-value of the constraints and minimize the difference
between the prediction of the neural network and a target value y. Formally:
˜ ¸
ÿ 1 2
DLMpx | λ, wq “ argmax λc Φc pyc q ´ ||y ´ f px | wq||
y c
2
Each Φc pyc q corresponds to a ground propositional formula which is evaluated w.r.t. the target
truth assignment y, and λc is the weight associated with formula Φc . Intuitively, the upper model
(the undirected graphical model) should modify the prediction of the lower model (the neural
network) minimally to satisfy the constraints. f and y are truth-values of all the ground atoms
obtained from the constraints appearing in the upper model in the domain specified by the data
input.
Similar to LTN, DLM evaluates constraints using fuzzy semantics. However, it considers only
propositional connectives, whereas universal and existential quantifiers are supported in LTN.
Inference in DLM requires maximizing the prediction of the model, which might be prohibitive
in the presence of a large number of instances. In LTN, inference involves only a forward pass
through the neural component which is rather simple and can be carried out in parallel. However,
in DLM the weight associated with constraints can be learned, while in LTN they are specified in
the background knowledge.
The approach taken in Knowledge Enhanced Neural Networks (KENN) [15] is similar to that
of DLM. Starting from the predictions y “ fnn px | wq made by a base neural network fnn p¨ | wq,
KENN adds a knowledge enhancer, which is a function that modifies y based on a set of weighted
constraints formulated in terms of clauses. The formal model can be specified as follows:
ÿ
1 1
KENNpx | λ, wq “ σpfnn px | wq ` λc ¨ psoftmaxpsignpcq d fnn px | wqq d signpcqqq
c
1
where fnn px | wq are the pre-activations of fnn px | wq, signpcq is a vector of the same dimension
of y containing 1, ´1 and ´8, such that signpcqi “ 1 (resp. signpcqi “ ´1) if the i-th atom occurs
44
positively (resp. negatively) in c, or ´8 otherwise, and d is the element-wise product. KENN
learns the weights λ of the clauses in the background knowledge and the base network parameters
w by minimizing some standard loss, (e.g. cross-entropy) on a set of training data. If the training
data is inconsistent with the constraint, the weight of the constraint will be close to zero. This
intuitively implies that the latent knowledge present in the data is preferred to the knowledge
specified in the constraints. In LTN, instead, training data and logical constraints are represented
uniformly with a formula, and we require that they are both satisfied. A second difference between
KENN and LTN is the language: while LTN supports constraints written in full first-order logic,
constraints in KENN are limited to universally quantified clauses.
Systems that add knowledge to a neural network by adding a term to the loss function:. In [33], a frame-
work is proposed that learns simultaneously from labeled data and logical rules. The proposed
architecture is made of a student network fnn and a teacher network, denoted by q. The student
network is trained to do the actual predictions, while the teacher network encodes the information
of the logical rules. The transfer of information from the teacher to the student network is done by
defining a joint loss L for both networks as a convex combination of the loss of the student and the
teacher. If ỹ “ fnn px | wq is the prediction of the student network for input x, the loss is defined
as:
Systems that apply logical reasoning on the predictions of a base neural network:. The most notable
architecture in this category is DeepProblog [41]. DeepProblog extends the ProbLog framework
for probabilistic logic programming to allow the computation of probabilistic evidence from neural
networks. A ProbLog program is a logic program where facts and rules can be associated with
probability values. Such values can be learned. Inference in ProbLog to answer a query q is
45
performed by knowledge compilation into a function ppq | λq that computes the probability that q is
true according to the logic program with relative frequencies λ. In DeepProbLog, a neural network
fnn that outputs a probability distribution t “ pt1 , . . . , tn q over a set of atoms a “ pa1 , . . . , an q is
integrated into ProbLog by extending the logic program with a and the respective probabilities t.
The probability of a query q is then given by p1 pq | λ, fnn px | wqq, where x is the input of fnn and
p1 is the function corresponding to the logic program extended with a. Given a set of queries q,
input vectors x and ground-truths y for all the queries, training is performed by minimizing a loss
function that measures the distance between the probabilities predicted by the logic program and
the ground-truths, as follows:
L py, p1 pq | λ, fnn px | wqqq
The most important difference between DeepProbLog and LTN concerns the logic on which they
are based. DeepProbLog adopts probabilistic logic programming. The output of the base neu-
ral network is interpreted as the probability of certain atoms being true. LTN instead is based
on many-valued logic. The predictions of the base neural network are interpreted as fuzzy truth-
values (though previous work [67] also formalizes Real Logic as handling probabilities with relaxed
constraints). This difference of logic leads to the second main difference between LTN and Deep-
Problog: their inference mechanism. DeepProblog performs probabilistic inference (based on
model counting) while LTN inference consists of computing the truth-value of a formula starting
from the truth-values of its atomic components. The two types of inference are incomparable.
However, computing the fuzzy truth-value of a formula is more efficient than model counting,
resulting in a more scalable inference task that allows LTN to use full first-order logic with function
symbols. In DeepProblog, to perform probabilistic inference, a closed-world assumption is made
and a function-free language is used. Typically, DeepProbLog clauses are compiled into Sentential
Decision Diagrams (SDDs) to accelerate inference considerably[36], although the compilation step
of clauses into the SDD circuit is still costly.
An approach that extends the predictions of a base neural network using abductive reasoning
is [14]. Given a neural network fnn px | wq that produces a crisp output y P t0, 1un for n predicates
p1 , . . . , pn and background knowledge in the form of a logic program p, parameters w of fnn are
learned alongside a set of additional rules ∆C that define a new concept C w.r.t. p1 , . . . , pn such
that, for every object o with features xo :
46
Various other loosely-coupled approaches have been proposed recently such as [44], where
image classification is carried out by a neural network in combination with reasoning from text
data for concept learning at a higher level of abstraction than what is normally possible with
pixel data alone. The proliferation of such approaches has prompted Henry Kautz to propose a
taxonomy for neurosymbolic AI in [34] (also discussed in [17]), including recent work combining
neural networks with graphical models and graph neural networks [4, 40, 58], statistical relational
learning [21, 55], and even verification of neural multi-agent systems [2, 8].
In this paper, we have specified the theory and exemplified the reach of Logic Tensor Networks
as a model and system for neurosymbolic AI. LTN is capable of combining approximate reasoning
and deep learning, knowledge and data.
For ML practitioners, learning in LTN (see Section 3.2) can be understood as optimizing under
first-order logic constraints relaxed into a loss function. For logic practitioners, learning is similar
to inductive inference: given a theory, learning makes generalizations from specific observations
obtained from data. Compared to other neuro-symbolic architectures (see Section 5), the LTN
framework has useful properties for gradient-based optimization (see Section 2.4) and a syntax
that supports many traditional ML tasks and their inductive biases (see Section 4), all while
remaining computationally efficient (see Table 1).
Section 3.4 discussed reasoning in LTN. Reasoning is normally under-specified within neural
networks. Logical reasoning is the task of proving if some knowledge follows from the facts which
are currently known. It is traditionally achieved semantically using model theory or syntactically
via a proof system. The current LTN framework approaches reasoning semantically, although
it should be possible to use LTN and querying alongside a proof system. When reasoning by
refutation in LTN, to find out if a statement φ is a logical consequence of given data and knowledge-
base K, a proof by refutation attempts to find a semantic counterexample where φ and K are
satisfied. If the search fails then φ is assumed to hold. This approach is efficient in LTN when we
allow for a direct search to find counterexamples via gradient-descent optimization. It is assumed
that φ, the statement to prove or disprove, is known. Future work could explore automatically
inducing which statement φ to consider, possibly using syntactical reasoning in the process.
The paper formalizes Real Logic, the language supporting LTN. The semantics of Real Logic are
close to the semantics of Fuzzy FOL with the following major differences: 1) Real Logic domains are
typed and restricted to real numbers and real-valued tensors, 2) Real Logic variables are sequences
of fixed length, whereas FOL variables are a placeholder for any individual in a domain, 3) Real
Logic relations relations are interpreted as mathematical functions, whereas Fuzzy Logic relations
are interpreted as fuzzy set membership functions. Concerning the semantics of connectives and
quantifiers, some LTN implementations correspond to semantics for t-norm fuzzy logic, but not
all. For example, the conjunction operator in stable product semantics is not a t-norm, as pointed
out at the end of Section 2.4.
Integrative neural-symbolic approaches are known for either seeking to bring neurons into a
symbolic system (neurons into symbols) [41] or to bring symbols into a neural network (symbols
into neurons) [60]. LTN adopts the latter approach but maintaining a close link between the
symbols and their grounding into the neural network. The discussion around these two options -
neurons into symbols vs. symbols into neurons - is likely to take center stage in the debate around
neurosymbolic AI in the next decade. LTN and related approaches are well placed to play an
47
important role in this debate by offering a rich logical language tightly coupled with an efficient
distributed implementation into TensorFlow computational graphs.
The close connection between first-order logic and its implementation in LTN makes LTN very
suitable as a model for the neural-symbolic cycle [27, 29], which seeks to translate between neural
and symbolic representations. Such translations can take place at the level of the structure of a
neural network, given a symbolic language [27], or at the level of the loss functions, as done by
LTN and related approaches [13, 45, 46]. LTN opens up a number of promising avenues for further
research:
Firstly, a continual learning approach might allow one to start with very little knowledge, build
up and validate knowledge over time by querying the LTN network. Translations to and from
neural and symbolic representations will enable reasoning also to take place at the symbolic level
(e.g. alongside a proof system), as proposed recently in [70] with the goal of improving fairness of
the network model.
Secondly, LTN should be compared in large-scale practical use cases with other recent efforts
to add structure to neural networks such as the neuro-symbolic concept learner [44] and high-level
capsules which were used recently to learn the part-of relation [38], similarly to how LTN was used
for semantic image interpretation in [19].
Finally, LTN should also be compared with Tensor Product Representations, e.g. [59], which
show that state-of-the-art recurrent neural networks may fail at simple question-answering tasks,
despite achieving very high accuracy. Efforts in the area of transfer learning, mostly in computer
vision, which seek to model systematicity could also be considered a benchmark [5]. Experiments
using fewer data and therefore lower energy consumption, out-of-distribution extrapolation, and
knowledge-based transfer are all potentially suitable areas of application for LTN as a framework
for neurosymbolic AI based on learning from data and compositional knowledge.
Acknowledgement
We would like to thank Benedikt Wagner for his comments and a number of productive dis-
cussions on continual learning, knowledge extraction and reasoning in LTNs.
References
[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Good-
fellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz
Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore,
Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Ku-
nal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow:
Large-scale machine learning on heterogeneous systems, 2015. Software available from ten-
sorflow.org.
[2] Michael Akintunde, Elena Botoeva, Panagiotis Kouvaros, and Alessio Lomuscio. Verifying
strategic abilities of neural multi-agent systems. In Proceedings of 17th International Conference
on Principles of Knowledge Representation and Reasoning, KR2020, Rhodes, Greece, September 2020.
48
[3] Samy Badreddine and Michael Spranger. Injecting Prior Knowledge for Transfer Learning
into Reinforcement Learning Algorithms using Logic Tensor Networks. arXiv:1906.06576 [cs,
stat], June 2019. arXiv: 1906.06576.
[4] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, and Koray
kavukcuoglu. Interaction networks for learning about objects, relations and physics. In
Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16,
pages 4509–4517, USA, 2016. Curran Associates Inc.
[5] Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Nan Rosemary Ke, Sebastien Lachapelle,
Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learning to
disentangle causal mechanisms. In International Conference on Learning Representations, 2020.
[6] Federico Bianchi and Pascal Hitzler. On the capabilities of logic tensor networks for deductive
reasoning. In Proceedings of the AAAI 2019 Spring Symposium on Combining Machine Learning
with Knowledge Engineering (AAAI-MAKE 2019) Stanford University, Palo Alto, California, USA,
March 25-27, 2019., Stanford University, Palo Alto, California, USA, March 25-27, 2019., 2019.
[7] Federico Bianchi, Matteo Palmonari, Pascal Hitzler, and Luciano Serafini. Complementing
logical reasoning with sub-symbolic commonsense. In International Joint Conference on Rules
and Reasoning, pages 161–170. Springer, 2019.
[8] Rafael Borges, Artur d’Avila Garcez, and Luís Lamb. Learning and representing temporal
knowledge in recurrent networks. IEEE transactions on neural networks / a publication of the IEEE
Neural Networks Council, 22:2409–21, 12 2011.
[9] Liber Běhounek, Petr Cintula, and Petr Hájek. Introduction to mathematical fuzzy logic. In
Petr Cintula, Petr Hájek, and Carles Noguera, editors, Handbook of Mathematical Fuzzy Logic,
Volume 1, volume 37 of Studies in Logic, Mathematical Logic and Foundations, pages 1–102. College
Publications, 2011.
[10] N. A. Campbell and R. J. Mahon. A multivariate study of variation in two species of rock
crab of the genus Leptograpsus. Australian Journal of Zoology, 22(3):417–425, 1974. Publisher:
CSIRO PUBLISHING.
[11] Andres Campero, Aldo Pareja, Tim Klinger, Josh Tenenbaum, and Sebastian Riedel. Logical
rule induction and theory learning using neural theorem proving. CoRR, abs/1809.02193,
2018.
[12] Benhui Chen, Xuefen Hong, Lihua Duan, and Jinglu Hu. Improving multi-label classification
performance by label constraints. In The 2013 International Joint Conference on Neural Networks
(IJCNN), pages 1–5. IEEE, 2013.
[13] William W. Cohen, Fan Yang, and Kathryn Mazaitis. Tensorlog: A probabilistic database
implemented using deep-learning infrastructure. J. Artif. Intell. Res., 67:285–325, 2020.
[14] W.-Z. Dai, Q. Xu, Y. Yu, and Z.-H. Zhou. Bridging machine learning and logical reasoning
by abductive learning. In Proceedings of the 33rd International Conference on Neural Information
Processing Systems, NeurIPS’19, USA, 2019. Curran Associates Inc.
49
[15] Alessandro Daniele and Luciano Serafini. Knowledge enhanced neural networks. In Pacific
Rim International Conference on Artificial Intelligence, pages 542–554. Springer, 2019.
[16] Artur d’Avila Garcez, Marco Gori, Luís C. Lamb, Luciano Serafini, Michael Spranger, and
Son N. Tran. Neural-symbolic computing: An effective methodology for principled integration
of machine learning and reasoning. FLAP, 6(4):611–632, 2019.
[17] Artur d’Avila Garcez and Luis C. Lamb. Neurosymbolic AI: The 3rd wave, 2020.
[18] Ivan Donadello and Luciano Serafini. Compensating supervision incompleteness with prior
knowledge in semantic image interpretation. In 2019 International Joint Conference on Neural
Networks (IJCNN), pages 1–8. IEEE, 2019.
[19] Ivan Donadello, Luciano Serafini, and Artur d’Avila Garcez. Logic tensor networks for se-
mantic image interpretation. In Proceedings of the Twenty-Sixth International Joint Conference on
Artificial Intelligence, IJCAI-17, pages 1596–1602, 2017.
[20] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
[21] Richard Evans and Edward Grefenstette. Learning explanatory rules from noisy data. J. Artif.
Intell. Res., 61:1–64, 2018.
[22] Ronald Fagin, Ryan Riegel, and Alexander Gray. Foundations of reasoning with uncertainty
via real-valued logics, 2020.
[23] Marc Fischer, Mislav Balunovic, Dana Drachsler-Cohen, Timon Gehr, Ce Zhang, and Martin
Vechev. Dl2: Training and querying neural networks with logic. In International Conference on
Machine Learning, pages 1931–1941, 2019.
[24] Manoel Franca, Gerson Zaverucha, and Artur d’Avila Garcez. Fast relational learning using
bottom clause propositionalization with artificial neural networks. Machine Learning, 94:81–
104, 01 2014.
[25] Dov M. Gabbay and John Woods, editors. The Many Valued and Nonmonotonic Turn in Logic,
volume 8 of Handbook of the History of Logic. Elsevier, 2007.
[26] Artur d’Avila Garcez, Dov M. Gabbay, and Krysia B. Broda. Neural-Symbolic Learning System:
Foundations and Applications. Springer-Verlag, Berlin, Heidelberg, 2002.
[27] Artur d’Avila Garcez, Lus C. Lamb, and Dov M. Gabbay. Neural-Symbolic Cognitive Reasoning.
Springer Publishing Company, Incorporated, 1 edition, 2008.
[28] Petr Hajek. Metamathematics of Fuzzy Logic. Kluwer Academic Publishers, 1998.
[29] Barbara Hammer and Pascal Hitzler, editors. Perspectives of Neural-Symbolic Integration, vol-
ume 77 of Studies in Computational Intelligence. Springer, 2007.
[30] Stevan Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–
346, 1990.
[31] Patrick Hohenecker and Thomas Lukasiewicz. Ontology reasoning with deep neural net-
works. Journal of Artificial Intelligence Research, 68:503–540, 2020.
50
[32] Steffen Hölldobler and Franz J. Kurfess. CHCL - A connectionist infernce system. In Bertram
Fronhöfer and Graham Wrightson, editors, Parallelization in Inference Systems, International
Workshop, Dagstuhl Castle, Germany, December 17-18, 1990, Proceedings, volume 590 of Lecture
Notes in Computer Science, pages 318–342. Springer, 1990.
[33] Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. Harnessing deep
neural networks with logic rules. In Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 2410–2420, Berlin, Germany, August
2016. Association for Computational Linguistics.
[34] Henry Kautz. The Third AI Summer, AAAI Robert S. Engelmore Memorial Lecture, Thirty-
fourth AAAI Conference on Artificial Intelligence, New York, NY, February 10, 2020.
[35] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization.
arXiv:1412.6980 [cs], January 2017. arXiv: 1412.6980.
[36] Doga Kisa, Guy Van den Broeck, Arthur Choi, and Adnan Darwiche. Probabilistic sentential
decision diagrams. In Proceedings of the 14th International Conference on Principles of Knowledge
Representation and Reasoning (KR), July 2014.
[37] Erich Peter Klement, Radko Mesiar, and Endre Pap. Triangular Norms, volume 8 of Trends in
Logic. Springer Netherlands, Dordrecht, 2000.
[38] Adam Kosiorek, Sara Sabour, Yee Whye Teh, and Geoffrey E Hinton. Stacked capsule autoen-
coders. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett,
editors, Advances in Neural Information Processing Systems 32, pages 15512–15522. Curran As-
sociates, Inc., 2019.
[39] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep
convolutional neural networks. In Proceedings of the 25th International Conference on Neural
Information Processing Systems - Volume 1, NIPS’12, page 1097–1105, Red Hook, NY, USA, 2012.
Curran Associates Inc.
[40] Luís C. Lamb, Artur d’Avila Garcez, Marco Gori, Marcelo O. R. Prates, Pedro H. C. Avelar,
and Moshe Y. Vardi. Graph neural networks meet neural-symbolic computing: A survey
and perspective. In Christian Bessiere, editor, Proceedings of the Twenty-Ninth International Joint
Conference on Artificial Intelligence, IJCAI 2020 [scheduled for July 2020, Yokohama, Japan, postponed
due to the Corona pandemic], pages 4877–4884. ijcai.org, 2020.
[41] Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, and Luc De
Raedt. Deepproblog: Neural probabilistic logic programming. In Proceedings of the 32nd
International Conference on Neural Information Processing Systems, NeurIPS’18, pages 3753–3763,
USA, 2018. Curran Associates Inc.
[42] Francesco Manigrasso, Filomeno Davide Miro, Lia Morra, and Fabrizio Lamberti. Faster-LTN:
a neuro-symbolic, end-to-end object detection architecture. arXiv:2107.01877 [cs], July 2021.
[43] Vasco Manquinho, Joao Marques-Silva, and Jordi Planes. Algorithms for weighted boolean
optimization. In International conference on theory and applications of satisfiability testing, pages
495–508. Springer, 2009.
51
[44] Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jiajun Wu. The neuro-
symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision.
CoRR, abs/1904.12584, 2019.
[45] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and Marco Gori. Constraint-
based visual generation. In Igor V. Tetko, Vera Kurková, Pavel Karpov, and Fabian J. Theis,
editors, Artificial Neural Networks and Machine Learning - ICANN 2019: Image Processing - 28th
International Conference on Artificial Neural Networks, Munich, Germany, September 17-19, 2019,
Proceedings, Part III, volume 11729 of Lecture Notes in Computer Science, pages 565–577. Springer,
2019.
[46] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and Marco Gori. Integrating
learning and reasoning with deep logic models. In Machine Learning and Knowledge Discovery
in Databases - European Conference, ECML PKDD 2019, Würzburg, Germany, September 16-20,
2019, Proceedings, Part II, volume 11907 of Lecture Notes in Computer Science, pages 517–532.
Springer, 2019.
[47] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and Marco Gori. Lyrics: A gen-
eral interface layer to integrate logic inference and deep learning. In Joint European Conference
on Machine Learning and Knowledge Discovery in Databases, pages 283–298. Springer, 2019.
[48] Giuseppe Marra and Ondřej Kuželka. Neural markov logic networks. arXiv preprint
arXiv:1905.13462, 2019.
[49] Pasquale Minervini, Sebastian Riedel, Pontus Stenetorp, Edward Grefenstette, and Tim Rock-
täschel. Learning reasoning strategies in end-to-end differentiable proving, 2020.
[50] Stephen H. Muggleton, Dianhuan Lin, Niels Pahlavi, and Alireza Tamaddoni-Nezhad. Meta-
interpretive learning: Application to grammatical inference. Mach. Learn., 94(1):25–49, January
2014.
[51] Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space. The London,
Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
[52] Gadi Pinkas. Reasoning, nonmonotonicity and learning in connectionist networks that capture
propositional knowledge. Artif. Intell., 77(2):203–247, 1995.
[53] Meng Qu and Jian Tang. Probabilistic logic neural networks for reasoning. In Advances in
Neural Information Processing Systems, pages 7712–7722, 2019.
[54] Luc De Raedt, Sebastijan Dumančić, Robin Manhaeve, and Giuseppe Marra. From statistical
relational to neuro-symbolic artificial intelligence, 2020.
[55] Matthew Richardson and Pedro Domingos. Markov logic networks. Mach. Learn., 62(1-2):107–
136, February 2006.
[56] Ryan Riegel, Alexander Gray, Francois Luus, Naweed Khan, Ndivhuwo Makondo, Is-
mail Yunus Akhalwaya, Haifeng Qian, Ronald Fagin, Francisco Barahona, Udit Sharma, Sha-
jith Ikbal, Hima Karanam, Sumit Neelam, Ankita Likhyani, and Santosh Srivastava. Logical
Neural Networks. arXiv:2006.13155 [cs], June 2020. arXiv: 2006.13155.
52
[57] Tim Rocktäschel and Sebastian Riedel. End-to-end differentiable proving. In Advances in
Neural Information Processing Systems, pages 3788–3800, 2017.
[58] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfar-
dini. The graph neural network model. Trans. Neur. Netw., 20(1):61–80, January 2009.
[59] Imanol Schlag and Jürgen Schmidhuber. Learning to reason with third-order tensor products.
CoRR, abs/1811.12143, 2018.
[60] Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, Jürgen Schmidhuber, and
Jianfeng Gao. Enhancing the transformer with explicit relational encoding for math problem
solving. CoRR, abs/1910.06611, 2019.
[61] Luciano Serafini and Artur d’Avila Garcez. Logic tensor networks: Deep learning and logical
reasoning from data and knowledge. arXiv preprint arXiv:1606.04422, 2016.
[62] Luciano Serafini and Artur d’Avila Garcez. Learning and reasoning with logic tensor networks.
In Conference of the Italian Association for Artificial Intelligence, pages 334–348. Springer, 2016.
[63] Lokendra Shastri. Advances in SHRUTI-A neurally motivated model of relational knowledge
representation and rapid inference using temporal synchrony. Appl. Intell., 11(1):79–108, 1999.
[64] Yun Shi. A deep study of fuzzy implications. PhD thesis, Ghent University, 2009.
[65] Paul Smolensky and Géraldine Legendre. The Harmonic Mind: From Neural Computation to
Optimality-Theoretic GrammarVolume I: Cognitive Architecture (Bradford Books). The MIT Press,
2006.
[66] Gustav Sourek, Vojtech Aschenbrenner, Filip Zelezny, Steven Schockaert, and Ondrej Kuzelka.
Lifted relational neural networks: Efficient learning of latent relational structures. Journal of
Artificial Intelligence Research, 62:69–100, 2018.
[67] Emile van Krieken, Erman Acar, and Frank van Harmelen. Semi-Supervised Learning using
Differentiable Reasoning. arXiv:1908.04700 [cs], August 2019. arXiv: 1908.04700.
[68] Emile van Krieken, Erman Acar, and Frank van Harmelen. Analyzing Differentiable Fuzzy
Implications. In Proceedings of the 17th International Conference on Principles of Knowledge Repre-
sentation and Reasoning, pages 893–903, 9 2020.
[69] Emile van Krieken, Erman Acar, and Frank van Harmelen. Analyzing Differentiable Fuzzy
Logic Operators. arXiv:2002.06100 [cs], February 2020. arXiv: 2002.06100.
[70] Benedikt Wagner and Artur d’Avila Garcez. Neural-Symbolic Integration for Fairness in
AI. Proceedings of the AAAI Spring Symposium: Combining Machine Learning with Knowledge
Engineering 2021, page 14, 2021.
[71] Po-Wei Wang, Priya L Donti, Bryan Wilder, and Zico Kolter. Satnet: Bridging deep learning
and logical reasoning using a differentiable satisfiability solver. arXiv preprint arXiv:1905.12149,
2019.
[72] Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Broeck. A Semantic Loss Function
for Deep Learning with Symbolic Knowledge. In International Conference on Machine Learning,
pages 5502–5511. PMLR, July 2018. ISSN: 2640-3498.
53
Appendix A. Implementation Details
The LTN library is implemented in Tensorflow 2 [1] and is available from GitHub29 . Every
logical operator is grounded using Tensorflow primitives. The LTN code implements directly
a Tensorflow graph. Due to Tensorflow built-in optimization, LTN is relatively efficient while
providing the expressive power of FOL.
Table A.3 shows an overview of the network architectures used to obtain the results of the ex-
amples in Section 4. The LTN repository includes the code for these examples. Except if explicitly
mentioned otherwise, the reported results are averaged over 10 runs using a 95% confidence inter-
val. Every example uses a stable real product configuration to approximate Real Logic operators,
and the Adam optimizer [35] with a learning rate of 0.001 to train the parameters.
29 https://github.com/logictensornetworks/logictensornetworks
54
Appendix B. Fuzzy Operators and Properties
This appendix presents the most common operators used in fuzzy logic literature and some
noteworthy properties [28, 37, 64, 69].
55
Appendix B.3. Disjunction
Definition 10. A disjunction is a function D : r0, 1s2 Ñ r0, 1s that at least satisfies:
D1. boundary conditions: Dp0, 0q “ 0 and Dp0, 1q “ Dp1, 0q “ Dp1, 1q “ 1,
D2. monotonically increasing: @px, y, zq P r0, 1s3 , if x ď y, then Dpx, zq ď Dpy, zq and Dpz, xq ď
Dpz, yq.
Note that the only distributive pair of t-norm and t-conorm is TM and SM – that is, distributivity
of the t-norm over the t-conorm, and inversely.
Definition 12. The N -dual t-conorm S of a t-norm T w.r.t. a strict fuzzy negation N is defined as:
Definition 14. There are two main classes of implications generated from the fuzzy logic operators
for negation, conjunction and disjunction.
S-Implications Strong implications are defined using x Ñ y “ x _ y (material implication).
R-Implications Residuated implications are defined using x Ñ y “ suptz P r0, 1s | x ^ z ď yu. One
way of understanding this approach is a generalization of modus ponens: the consequent is at
least as true as the (fuzzy) conjunction of the antecedent and the implication.
Example 6. Popular fuzzy implications and their classes are presented in Table B.5.
56
Name Ipx, yq “ S-Implication R-Implication
Kleene-Dienes S “ SM
IKD
maxp1 ´ x, yq N “ NS
-
#
Goedel 1, x ď y
IG
- T “ TM
y, otherwise
Reichenbach S “ SP
IR
1 ´ x ` xy N “ NS
-
#
Goguen 1, xďy
IP
- T “ TP
y{x, otherwise
Łukasiewicz S “ SL
ILuk
minp1 ´ x ` y, 1q N “ NS
T “ TL
Table B.5: Popular fuzzy implications and their classes. Strong implications (S-Implications) are defined using a fuzzy
negation and fuzzy disjunction. Residuated implications (R-Implications) are defined using a fuzzy conjunction.
Example 7. Candidates for universal quantification @ can be obtained using t-norms with AT pxi q “
xi and AT px1 , . . . , xn q “ T px1 , AT px2 , . . . , xn qq:
Similarly, candidates for existential quantification D can be obtained using s-norms with AS pxi q “
xi and AS px1 , . . . , xn q “ Spx1 , AS px2 , . . . , xn qq:
57
pTM , SM , NS q pTP , SP , NS q pTL , SL , NS q
IKD IG IR IP ILuk
Commutativity of ^, _ 3 3 3 3 3
Associativity of ^, _ 3 3 3 3 3
Distributivity of ^ over _ 3 3
Distributivity of _ over ^ 3 3
Distrib. of Ñ over _,^ 3 3
Double negation p“p 3 3 3 3 3
Law of excluded middle 3
Law of non contradiction 3
De Morgan’s laws 3 3 3 3 3
Material Implication 3 3 3
Contraposition 3 3 3
Where ApM is the generalized mean, and ApM E can be understood as the generalized mean
measured w.r.t. the errors. That is, ApM E measures the power of the deviation of each value from
the ground truth 1. A few particular values of p yield special cases of aggregators. Notably:
• lim pÑ`8ApM px1 , . . . , xn q “ maxpx1 , . . . , xn q,
• lim pÑ´8ApM px1 , . . . , xn q “ minpx1 , . . . , xn q,
• lim pÑ`8ApM E px1 , . . . , xn q “ minpx1 , . . . , xn q,
• lim pÑ´8ApM E px1 , . . . , xn q “ maxpx1 , . . . , xn q.
These "smooth" min (resp. max) approximators are good candidates for @ (resp. D) in a fuzzy
context. The value of p leaves more or less room for outliers depending on the use case and its
needs. Note that ApM E and ApM are related in the same way that D and @ are related using the
definition D ” @ , where would be approximated by the standard negation.
We propose to use ApM E with p ě 1 to approximate @ and ApM with p ě 1 to approximate
D. When p ě 1, these operators resemble the lp norm of a vector u “ pu1 , u2 , . . . , un q, where
1{p
}u}p “ p|u1 |p ` |u2 |p ` ¨ ¨ ¨ ` |un |p q . In our case, many properties of the lp norm can apply to
ApM (positive homogeneity, triangular inequality, ...).
58
Appendix C. Analyzing Gradients of Generalized Mean Aggregators
[69] show that some operators used in Fuzzy Logics are unsuitable for use in a differentiable
learning setting. Three types of gradient problems commonly arise in fuzzy logic operators.
Single-Passing The derivatives of some operators are non-null for only one argument. The gradi-
ents propagate to only one input at a time.
Vanishing Gradients The gradients vanish on some part of the domain. The learning does not
update inputs that are in the vanishing domain.
Exploding Gradients Large error gradients accumulate and result in unstable updates.
Tables C.7 and C.8 summarize their conclusions for the most common operators. Also, we
underline here exploding gradients issues that arise experimentally in ApM and ApM E , which are
not in the original report. Given the truth values of n propositions px1 , . . . , xn q in r0, 1sn :
ˆ ˙ p1
1
xpi
ř
1. ApM px1 , . . . , xn q “ n
i
ˆ ˙ p1 ´1
1
BApM px1 ,...,xn q řn p
The partial derivatives are Bxi “ 1
n
p
j“1 xj xip´1 .
When p ą 1, the operator weights more for inputs with a higher true value –i.e. their partial
derivative is also higher – and suits for existential quantification. When p ă 1, the operator
weights more for inputs with a lower true value and suits for universal quantification.
ˆ ˙ p1 ´1
řn p řn p
Exploding Gradients When p ą 1, if j“1 xj Ñ 0, then j“1 xj Ñ 8 and the
gradients explode. When p ă 1, if xi Ñ 0, then xip´1 Ñ 8.
ˆ ˙ p1
1
p1 ´ xi qp
ř
2. ApM E px1 , . . . , xn q “ 1 ´ n
i
ˆ ˙ p1 ´1
1
BApM E px1 ,...,xn q 1
řn p
The partial derivatives are Bxi “ n
p
j“1 p1´xj q p1´xi qp´1 . When p ą 1,
the operator weights more for inputs with a lower true value –i.e. their partial derivative is
also higher – and suits for universal quantification. When p ă 1, the operator weights more
for inputs with a higher true value and suits for existential quantification.
Exploding Gradients
ˆ ˙ p1 ´1
řn p
řn p
When p ą 1, if j“1 p1 ´ xj q Ñ 0, then j“1 p1 ´ xj q Ñ 8 and the gradients
explode. When p ă 1, if 1 ´ xi Ñ 0, then p1 ´ xi qp´1 Ñ 8.
We propose the following stable product configuration that does not have any of the aforemen-
59
Single-Passing Vanishing Exploding
Goedel (mininum)
TM ,SM 7
IKD 7
IG 7 7
Goguen (product)
TP ,SP (7)
IR (7)
IKD 7 (7)
Łukasiewicz
TL , SL 7
ILuk 7
Table C.7: Gradient problems for some binary connectives. (7) means that the problem only appears on an edge case.
Table C.8: Gradient problems for some aggregators. (7) means that the problem only appears on an edge case.
NS is the operator for negation, TP1 for conjunction, SP1 for disjunction, IP1 for implication, A1pM for
existential aggregation, A1pM E for universal aggregation.
60