diff --git a/docs/db.pdf b/docs/db.pdf index 8b2b5e8..0d4a93e 100644 Binary files a/docs/db.pdf and b/docs/db.pdf differ diff --git a/docs/db.tex b/docs/db.tex index 584306e..36a7e33 100644 --- a/docs/db.tex +++ b/docs/db.tex @@ -8,6 +8,7 @@ \usepackage{hyperref} \usepackage{url} \usepackage{amsthm} +\usepackage{amssymb} \usepackage{graphicx} \usepackage{comment} \usepackage{xcolor} @@ -27,7 +28,7 @@ % as long as the \iclrfinalcopy macro remains commented out below. % Non-anonymous submissions will be rejected without review. -\author{Ian Wright} +\author{Ian Wright\thanks{wrighti@acm.org}} % The \author macro works with any number of authors. There are two commands % used to separate the names and addresses of multiple authors: \And and \AND. @@ -45,6 +46,20 @@ \newtheorem{prop}{Proposition} \newtheorem{lemma}{Lemma} +\makeatletter +\newcommand{\tleft}{\mathrel\triangleleft} +\newcommand{\tright}{\mathrel\triangleright} +\DeclareRobustCommand{\btleft}{\mathrel{\mathpalette\btlr@\blacktriangleleft}} +\DeclareRobustCommand{\btright}{\mathrel{\mathpalette\btlr@\blacktriangleright}} + +\newcommand{\btlr@}[2]{% + \begingroup + \sbox\z@{$\m@th#1\triangleright$}% + \sbox\tw@{\resizebox{1.1\wd\z@}{1.1\ht\z@}{\raisebox{\depth}{$\m@th#1\mkern-1mu#2$}}}% + \ht\tw@=\ht\z@ \dp\tw@=\dp\z@ \wd\tw@=\wd\z@ + \copy\tw@ + \endgroup +} \iclrfinalcopy % Uncomment for camera-ready version, but NOT for submission. \begin{document} @@ -113,12 +128,12 @@ \section{$\partial\mathbb{B}$ nets}\label{sec:db-nets} \begin{definition}[Hard-equivalence] A function, $f: [0,1]^n \rightarrow [0,1]^m$, is {\em hard-equivalent} to a discrete function, $g: \{1,0\}^n \rightarrow \{1,0\}^m$, if \begin{equation*} - \operatorname{harden}(f(\operatorname{harden}({\bf x}))) = g(\operatorname{harden}({\bf x})) + \operatorname{harden}(f({\bf x})) = g(\operatorname{harden}({\bf x})) \end{equation*} -for all ${\bf x} \in [0,1]^{n}$. For shorthand write $f \equiv g$. + for all ${\bf x} \in \{(x_{1}, \dots, x_{n}) ~|~ x_{i} \in [0,1] \setminus \{1/2\}\}$. For shorthand write $f \btright g$. \end{definition} -Neural networks are typically composed of nonlinear activation functions (for representational generality) that are strictly monotonic (so gradients always exist that link changes in inputs to outputs without local minima) and differentiable (so gradients reliably represent the local loss surface). However, activation functions that are monotonic but not strictly (so some gradients are zero) and differentiable almost everywhere (so some gradients are undefined) can also work, e.g. RELU \citep{10.5555/3104322.3104425}. $\partial \mathbb{B}$ nets are composed from `activation' functions that also satisfy these properties plus the additional property of hard-equivalence to a boolean function (and natural generalisations). +Neural networks are typically composed of nonlinear activation functions (for representational generality) that are strictly monotonic (so gradients always exist that link changes in inputs to outputs without local minima) and smooth (so gradients reliably represent the local loss surface). However, activation functions that are monotonic but not strictly (so some gradients are zero) and differentiable almost everywhere (so some gradients are undefined) can also work, e.g. RELU \citep{10.5555/3104322.3104425}. $\partial \mathbb{B}$ nets are composed from `activation' functions that also satisfy these properties plus the additional property of hard-equivalence to a boolean function (and natural generalisations). We now turn to specifying the kind of `activation' functions used by $\partial \mathbb{B}$ nets. \begin{figure}[t!] \centering @@ -129,16 +144,22 @@ \section{$\partial\mathbb{B}$ nets}\label{sec:db-nets} \subsection{Learning to negate} -We aim to learn to negate a boolean value, $x$, or simply leave it unaltered. Represent this decision by a boolean weight, $w$, where low $w$ means negate and high $w$ means do not negate. The boolean function that meets this requirement is $\neg(x \oplus w)$. However, this function is not differentiable. Define the differentiable function, +Say we aim to learn to negate a boolean value, $x$, or leave it unaltered. Represent this decision by a boolean weight, $w$, where low $w$ means negate and high $w$ means do not negate. The boolean function that meets this requirement is $\neg(x \oplus w)$. However, this function is not differentiable. Define the differentiable function, \begin{equation*} \begin{aligned} \partial_{\neg}: [0, 1]^{2} &\to [0,1], \\ (w, x) &\mapsto 1 - w + x (2w - 1)\text{,} \end{aligned} \end{equation*} -where $\partial_{\neg}(w, x) \equiv \neg(x \oplus w)$ (see proposition \ref{prop:not}). +where $\partial_{\neg}(w, x) \btright \neg(x \oplus w)$ (see proposition \ref{prop:not}). + +There are many kinds of differentiable fuzzy logic operators (see \cite{VANKRIEKEN2022103602} for a review). So why this functional form? Product logics, where $f(x,y) = x y$ is as a soft version of $x \wedge y$, although hard-equivalent at extreme values, e.g. $f(1,1)=1$ and $f(0,1)=0$, are not hard-equivalent at intermediate values, e.g. $f(0.6, 0.6) = 0.36$, which hardens to $\operatorname{False}$ not $\operatorname{True}$. G\"{o}del-style $\operatorname{min}$ and $\operatorname{max}$ functions, although hard-equivalent over the entire soft-bit range, i.e. $\operatorname{min}(x,y) \btright x \wedge y$ and $\operatorname{max}(x,y) \btright x \vee y$, are gradient-sparse in the sense that their outputs do not always vary when any input changes, e.g. $\frac{\partial}{\partial x} \operatorname{max}(x,y) = 0$ when $(x,y)=(0.1, 0.9)$. So although the composite function $\operatorname{max}(\operatorname{min}(w, x), \operatorname{min}(1-w, 1-x))$ is differentiable and $\btright \neg(x \oplus w)$ it does not always backpropagate error to its inputs. In contrast, $\partial_{\neg}$ always backpropagates error to its inputs because it is a gradient-rich function (see figure \ref{fig:gradient-rich}). + +\begin{definition}[Gradient-rich] + A function, $f: [0,1]^n \rightarrow [0,1]^m$, is {\em gradient-rich} if $\frac{\partial f({\bf x})}{\partial x_{i}} \neq {\bf 0}$ for all ${\bf x} \in \{(x_{1}, \dots, x_{n}) ~|~ x_{i} \in [0,1] \setminus \{1/2\}\}$. +\end{definition} -There are many kinds of differentiable fuzzy logic operators (see \cite{VANKRIEKEN2022103602} for a review). So why this functional form? Product logics, where $f(x,y) = x y$ is as a soft version of $x \wedge y$, although hard-equivalent at extreme values, e.g. $f(1,1)=1$ and $f(0,1)=0$, are not hard-equivalent at intermediate values, e.g. $f(0.6, 0.6) = 0.36 \not\equiv 1$. G\"{o}del-style $\operatorname{min}$ and $\operatorname{max}$ functions, although hard-equivalent over the entire soft-bit range, i.e. $\operatorname{min}(x,y) \equiv x \wedge y$ and $\operatorname{min}(x,y) \equiv x \vee y$, are gradient-sparse in the sense that their outputs are not always a function of all their inputs, e.g. $\frac{\partial}{\partial x} \operatorname{max}(x,y) = 0$ when $(x,y)=(0.1, 0.9)$. So although the composite function $\operatorname{max}(\operatorname{min}(w, x), \operatorname{min}(1-w, 1-x))$ is differentiable and $\equiv \neg(x \oplus w)$ it does not always backpropagate error to its inputs. In contrast, $\partial_{\neg}$ always backpropagates error to its inputs because it is a gradient-rich function (see figure \ref{fig:gradient-rich}). +$\partial \mathbb{B}$ nets must be composed of `activation' functions that are hard-equivalent to discrete functions but also, where possible, gradient-rich. To meet this requirement we introduce the technique of margin packing. \subsection{Margin packing} @@ -165,12 +186,12 @@ \subsection{Margin packing} \end{aligned} \label{eq:augmented-bit} \end{equation} -Note that if the representative bit is high (resp. low) then the augmented bit is also high (resp. low). The difference between the augmented and representative bit depends on the size of the available margin and the mean soft-bit value. Almost everywhere, an increase (resp. decrease) of the mean soft-bit increases (resp. decreases) the value of the augmented bit (see figure \ref{fig:margin-trick}). Note that if the $i$th bit is representative (i.e. hard-equivalent to the target function) then so is the augmented bit (see lemma \ref{prop:augmented}). We will use margin packing, where appropriate, to define gradient-rich, hard-equivalents of boolean functions. +Note that if the representative bit is high (resp. low) then the augmented bit is also high (resp. low). The difference between the augmented and representative bit depends on the size of the available margin and the mean soft-bit value. Almost everywhere, an increase (resp. decrease) of the mean soft-bit increases (resp. decreases) the value of the augmented bit (see figure \ref{fig:margin-trick}). Note that if the $i$th bit is representative (i.e. hard-equivalent to the target function) then so is the augmented bit (see lemma \ref{prop:augmented}). We use margin packing, where appropriate, to define gradient-rich, hard-equivalents of boolean functions. \begin{figure}[t!] \centering \includegraphics[trim=30pt 5pt 30pt 10pt, clip, width=1.0\textwidth]{margin-trick.png} - \caption{{\em Margin packing for constructing gradient-rich, hard-equivalent functions}. A representative bit, $z$, is hard-equivalent to a discrete target function but gradient-sparse (e.g. $z=\operatorname{min}(x,y) \equiv x \wedge y$). On the left $z$ is low, $z<1/2$; on the right $z$ is high, $z>1/2$. We can pack a fraction of the margin between $z$ and the hard threshold $1/2$ with additional gradient-rich information without affecting hard-equivalence. A natural choice is the mean soft-bit, $\bar{\bf x} \in [0,1]$. The grey shaded areas denote the packed margins and the final augmented bit. On the left $\approx 60\%$ of the margin is packed; on the right $\approx 90\%$.} + \caption{{\em Margin packing for constructing gradient-rich, hard-equivalent functions}. A representative bit, $z$, is hard-equivalent to a discrete target function but gradient-sparse (e.g. $z=\operatorname{min}(x,y) \btright x \wedge y$). On the left $z$ is low, $z<1/2$; on the right $z$ is high, $z>1/2$. We can pack a fraction of the margin between $z$ and the hard threshold $1/2$ with additional gradient-rich information without affecting hard-equivalence. A natural choice is the mean soft-bit, $\bar{\bf x} \in [0,1]$. The grey shaded areas denote the packed margins and the final augmented bit. On the left $\approx 60\%$ of the margin is packed; on the right $\approx 90\%$.} \label{fig:margin-trick} \end{figure} % On the left, ${\bf x}=[0.9,0.23]$, $z=0.23$, $\bar{\bf x}=0.57$ and therefore $\approx 60\%$ of the margin is packed; on the right, ${\bf x}=[0.9,0.83]$, $z=0.83$, $\bar{\bf x}=0.87$, and therefore $\approx 90\%$ of the margin is packed. @@ -241,7 +262,7 @@ \subsection{Differentiable majority} \begin{figure}[t] \centering \includegraphics[trim=0pt 0pt 0pt 0pt, clip, width=1.0\textwidth]{majority-gates.png} - \caption{{\em Differentiable boolean majority.} The boolean majority function for three variables in DNF form is $\operatorname{Maj}(x,y,z) = (x \wedge y) \vee (x \wedge y) \vee (y \wedge z)$. The upper row contains contour plots of $f(x,y,z) = \operatorname{min}(\operatorname{max}(x,y), \operatorname{max}(x,z), \operatorname{max}(y,z))$ for values of $z \in \{0.2, 0.4, 0.6, 0.8\}$. $f$ is differentiable and $\equiv \operatorname{Maj}$ but gradient-sparse (vertical and horizontal contours indicate constancy with respect to an input). Also, the number of terms in $f$ grows exponentially with the number of variables. The lower row contains contour plots of $\partial\!\operatorname{Maj}(x,y,z)$ for the same values of $z$. $\partial\!\operatorname{Maj}$ is differentiable and $\equiv \operatorname{Maj}$ yet gradient-rich (curved contours indicate variability with respect to any inputs). In addition, the number of terms in $\partial\!\operatorname{Maj}$ is constant with respect to the number of variables.} + \caption{{\em Differentiable boolean majority.} The boolean majority function for three variables in DNF form is $\operatorname{Maj}(x,y,z) = (x \wedge y) \vee (x \wedge y) \vee (y \wedge z)$. The upper row contains contour plots of $f(x,y,z) = \operatorname{min}(\operatorname{max}(x,y), \operatorname{max}(x,z), \operatorname{max}(y,z))$ for values of $z \in \{0.2, 0.4, 0.6, 0.8\}$. $f$ is differentiable and $\btright\!\operatorname{Maj}$ but gradient-sparse (vertical and horizontal contours indicate constancy with respect to an input). Also, the number of terms in $f$ grows exponentially with the number of variables. The lower row contains contour plots of $\partial\!\operatorname{Maj}(x,y,z)$ for the same values of $z$. $\partial\!\operatorname{Maj}$ is differentiable and $\btright\!\operatorname{Maj}$ yet gradient-rich (curved contours indicate variability with respect to any inputs). In addition, the number of terms in $\partial\!\operatorname{Maj}$ is constant with respect to the number of variables.} \label{fig:majority-plot} \end{figure} @@ -277,7 +298,7 @@ \subsection{Differentiable majority} \subsection{Differentiable counting} -A boolean counting function $f({\bf x})$ is $\operatorname{True}$ if a counting predicate, $c({\bf x})$, holds over its $n$ inputs. We aim to construct a differentiable analogue of $\operatorname{count}({\bf x}, k)$ where $c({\bf x}) := |\{x_{i} : x_{i} = 1 \}| = k$ (i.e. `exactly $k$ high'), which is useful in multiclass classification problems. +A boolean counting function $f({\bf x})$ is $\operatorname{True}$ if a counting predicate, $c({\bf x})$, holds over its $n$ inputs. We aim to construct a differentiable analogue of $\operatorname{count}({\bf x}, k)$ where $c({\bf x}) := |\{x_{i} : x_{i} = 1 \}| = k$ (i.e. `exactly $k$ high'), which can be useful in multiclass classification problems. As before, we use $\operatorname{sort}$ to trade-off time for memory costs. Observe that if the elements of ${\bf x}$ are in ascending order then, if any soft-bits are high, there exists a unique contiguous pair of indices $(i,i+1)$ where $x_{i}$ is low and $x_{i+1}$ is high, where index $i$ is a direct count of the number of soft-bits that are low in ${\bf x}$. In consequence, define \begin{equation*} @@ -335,7 +356,6 @@ \subsection{Boolean logic layers} \partial_{\Rightarrow}(w_{n,1}, x_{1}) & \dots & \partial_{\Rightarrow}(w_{n,m}, x_{m}) \end{bmatrix}\text{.} \end{equation*} - A $\partial_{\wedge}\!\operatorname{Neuron}$ learns to logically $\wedge$ a subset of its input vector: \begin{equation*} \begin{aligned} @@ -343,9 +363,7 @@ \subsection{Boolean logic layers} ({\bf w}, {\bf x}) &\mapsto \min(\partial_{\Rightarrow}\!(w_{1}, x_{1}), \dots, \partial_{\Rightarrow}\!(w_{n}, x_{n}))\text{,} \end{aligned} \end{equation*} -where ${\bf w}$ is a weight vector. Each $\partial_{\Rightarrow}(w_{i},x_{i})$ learns to include or exclude $x_{i}$ from the conjunction depending on weight $w_{i}$. For example, if $w_{i}>0.5$ then $x_{i}$ affects the value of the conjunction since $\partial_{\Rightarrow}(w_{i},x_{i})$ passes-through a soft-bit that is high if $x_{i}$ is high, and low otherwise; but if $w_{i} \leq 0.5$ then $x_{i}$ does not affect the conjunction since $\partial_{\Rightarrow}(w_{i},x_{i})$ always passes-through a high soft-bit. A $\partial_{\wedge}\!\operatorname{Layer}$ of width $n$ learns up to $n$ different conjunctions of subsets of its input (of whatever size). - -A $\partial_{\vee}\!\operatorname{Neuron}$ is defined similarly: +where ${\bf w}$ is a weight vector. Each $\partial_{\Rightarrow}(w_{i},x_{i})$ learns to include or exclude $x_{i}$ from the conjunction depending on weight $w_{i}$. For example, if $w_{i}>0.5$ then $x_{i}$ affects the value of the conjunction since $\partial_{\Rightarrow}(w_{i},x_{i})$ passes-through a soft-bit that is high if $x_{i}$ is high, and low otherwise; but if $w_{i} \leq 0.5$ then $x_{i}$ does not affect the conjunction since $\partial_{\Rightarrow}(w_{i},x_{i})$ always passes-through a high soft-bit. A $\partial_{\wedge}\!\operatorname{Layer}$ of width $n$ learns up to $n$ different conjunctions of subsets of its input (of whatever size). A $\partial_{\vee}\!\operatorname{Neuron}$ is defined similarly: \begin{equation*} \begin{aligned} \partial_{\vee}\!\operatorname{Neuron}: [0,1]^{n} \times [0,1]^{n} &\to [0,1], \\ @@ -424,7 +442,7 @@ \subsection{Hardening} ] \end{lstlisting} -with trainable weights $w_{1}$ and $w_{2}$. We randomly initialize the network and train using the RAdam optimizer \citep{Liu2020On} with softmax cross-entropy loss until training and test accuracies are both $100\%$. We harden the learned weights to get $w_{1} = \operatorname{False}$ and $w_{2} = \operatorname{True}$, and bind with the discrete program, which then simplifies to: +with trainable weights $w_{1}$ and $w_{2}$. We randomly initialize the network and train using the RAdam optimizer \citep{Liu2020On} with softmax cross-entropy loss until training and test accuracies are both $100\%$. We harden the learned weights to get $w_{1} = \operatorname{False}$ and $w_{2} = \operatorname{True}$, and bind with the discrete program, which then symbolically simplifies to: \begin{lstlisting}[language=Python,style=mystyle,frame=single] def dbNet(outside): @@ -463,7 +481,7 @@ \subsection{Hardening} \begin{lstlisting}[language=Python,style=mystyle,frame=single] def dbNet(very-cold, cold, warm, very-warm, outside): return [ - zge(sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((0, not(xor(ne(very-cold, 0), w1)))), not(xor(ne(cold, 0), w2)))), not(xor(ne(warm, 0), w3)))), not(xor(ne(very-warm, 0), w4)))), not(xor(ne(outside, 0), w5)))), not(xor(ne(very-cold, 0), w6)))), not(xor(ne(cold, 0), w7)))), not(xor(ne(warm, 0), w8)))), not(xor(ne(very-warm, 0), w9)))), not(xor(ne(outside, 0), w10)))), not(xor(ne(very-cold, 0), w11)))), not(xor(ne(cold, 0), w12)))), not(xor(ne(warm, 0), w13)))), not(xor(ne(very-warm, 0), w14)))), not(xor(ne(outside, 0), w15)))), not(xor(ne(very-cold, 0), w16)))), not(xor(ne(cold, 0), w17)))), not(xor(ne(warm, 0), w18)))), not(xor(ne(very-warm, 0), w19)))), not(xor(ne(outside, 0), w20)))), 11), + ge(sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((0, not(xor(ne(very-cold, 0), w1)))), not(xor(ne(cold, 0), w2)))), not(xor(ne(warm, 0), w3)))), not(xor(ne(very-warm, 0), w4)))), not(xor(ne(outside, 0), w5)))), not(xor(ne(very-cold, 0), w6)))), not(xor(ne(cold, 0), w7)))), not(xor(ne(warm, 0), w8)))), not(xor(ne(very-warm, 0), w9)))), not(xor(ne(outside, 0), w10)))), not(xor(ne(very-cold, 0), w11)))), not(xor(ne(cold, 0), w12)))), not(xor(ne(warm, 0), w13)))), not(xor(ne(very-warm, 0), w14)))), not(xor(ne(outside, 0), w15)))), not(xor(ne(very-cold, 0), w16)))), not(xor(ne(cold, 0), w17)))), not(xor(ne(warm, 0), w18)))), not(xor(ne(very-warm, 0), w19)))), not(xor(ne(outside, 0), w20)))), 11), ge(sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((sum((0, not(xor(ne(very-cold, 0), w21)))), not(xor(ne(cold, 0), w22)))), not(xor(ne(warm, 0), w23)))), not(xor(ne(very-warm, 0), w24)))), not(xor(ne(outside, 0), w25)))), not(xor(ne(very-cold, 0), w26)))), not(xor(ne(cold, 0), w27)))), not(xor(ne(warm, 0), w28)))), not(xor(ne(very-warm, 0), w29)))), not(xor(ne(outside, 0), w30)))), not(xor(ne(very-cold, 0), w31)))), not(xor(ne(cold, 0), w32)))), not(xor(ne(warm, 0), w33)))), not(xor(ne(very-warm, 0), w34)))), not(xor(ne(outside, 0), w35)))), not(xor(ne(very-cold, 0), w36)))), not(xor(ne(cold, 0), w37)))), not(xor(ne(warm, 0), w38)))), not(xor(ne(very-warm, 0), w39)))), not(xor(ne(outside, 0), w40)))), 11) ] \end{lstlisting} @@ -477,7 +495,7 @@ \subsection{Hardening} ] \end{lstlisting} -The predictions combine multiple pieces of evidence due to the presence of the $\partial\!\operatorname{Maj}$ operator. For example, we can read-off that the $\partial\mathbb{B}$ net has learned `if not $\operatorname{very-cold}$ and not $\operatorname{cold}$ and not $\operatorname{outside}$ then wear a $\operatorname{t-shirt}$'; and `if $\operatorname{cold}$ and not ($\operatorname{warm}$ or $\operatorname{very-warm}$) and $\operatorname{outside}$ then wear a $\operatorname{coat}$' etc. The discrete program is more interpretable compared to typical neural networks, and can be exactly encoded as a SAT problem in order to verify its properties, such as robustness. +The predictions linearly weight multiple pieces of evidence due to the presence of the $\partial\!\operatorname{Maj}$ operator (which is probably overkill for this toy problem). From this expression we can read-off that the $\partial\mathbb{B}$ net has learned `if not $\operatorname{very-cold}$ and not $\operatorname{cold}$ and not $\operatorname{outside}$ then wear a $\operatorname{t-shirt}$'; and `if $\operatorname{cold}$ and not ($\operatorname{warm}$ or $\operatorname{very-warm}$) and $\operatorname{outside}$ then wear a $\operatorname{coat}$' etc. The discrete program is more interpretable compared to typical neural networks, and can be exactly encoded as a SAT problem in order to verify its properties, such as robustness. \begin{figure}[t!] \centering @@ -517,7 +535,7 @@ \subsection{Binary Iris} \subsection{Noisy XOR} -The noisy XOR dataset \citep{noisy-xor-dataset} is an adversarial parity problem with noisy non-informative features. The dataset consists of 10K examples with 12 boolean inputs and a target label (where 0 = odd and 1 = even) that is a XOR function of 2 inputs. The remaining 10 inputs are entirely random. We train on 50\% of the data where, additionally, 40\% of the labels are inverted. We initialize the network described in figure \ref{fig:noisy-xor-architecture} with random weights distributed close to the hard threshold at $1/2$ (i.e. in the $\partial_{\wedge}\!\operatorname{Layer}$, $w_{i} = 0.501 \times b + 0.3 \times (1-b)$ where $b \sim \operatorname{Bernoulli}(0.01)$; in the $\partial_{\vee}\!\operatorname{Layer}$, $w_{i} = 0.7 \times b + 0.499 \times (1-b)$ where $b \sim \operatorname{Bernoulli}(0.99)$); and in the $\partial_{\neg}\!\operatorname{Layer}$, $w_{i} \sim \operatorname{Uniform}(0.499, 0.501)$. We train for 2000 epochs with the RAdam optimizer and softmax cross-entropy loss. +The noisy XOR dataset \citep{noisy-xor-dataset} is an adversarial parity problem with noisy non-informative features. The dataset consists of 10K examples with 12 boolean inputs and a target label (where 0 = odd and 1 = even) that is a XOR function of 2 of the inputs. The remaining 10 inputs are entirely random. We train on 50\% of the data where, additionally, 40\% of the labels are inverted. We initialize the network described in figure \ref{fig:noisy-xor-architecture} with random weights distributed close to the hard threshold at $1/2$ (i.e. in the $\partial_{\wedge}\!\operatorname{Layer}$, $w_{i} = 0.501 \times b + 0.3 \times (1-b)$ where $b \sim \operatorname{Bernoulli}(0.01)$; in the $\partial_{\vee}\!\operatorname{Layer}$, $w_{i} = 0.7 \times b + 0.499 \times (1-b)$ where $b \sim \operatorname{Bernoulli}(0.99)$); and in the $\partial_{\neg}\!\operatorname{Layer}$, $w_{i} \sim \operatorname{Uniform}(0.499, 0.501)$. We train for 2000 epochs with the RAdam optimizer and softmax cross-entropy loss. We measure the accuracy of the final net on the test data to avoid hand-picking the best configuration. Table \ref{tab:noisy-xor-results} compares the $\partial\mathbb{B}$ net against other classifiers \citep{granmo18}. The high noise causes logistic regression and naive Bayes to randomly guess. The SVM hardly performs better. In constrast, the multilayer neural network, Tsetlin machine, and $\partial\mathbb{B}$ net all successfully learn the underlying XOR signal. The Tsetlin machine performs best on this problem, with the $\partial\mathbb{B}$ net second. @@ -579,7 +597,7 @@ \section{Conclusion}\label{sec:conclusion} \subsubsection*{Acknowledgments} -Thanks to GitHub Next for sponsoring this research. And thanks to Pavel Augustinov, Richard Evans, Johan Rosenkilde, Max Schaefer, Tam\'{a}s Szab\'{o} and Albert Ziegler for helpful discussions and feedback. +Thanks to GitHub Next for sponsoring this research. And thanks to Pavel Augustinov, Richard Evans, Johan Rosenkilde, Max Schaefer, Ganesh Sittampalam, Tam\'{a}s Szab\'{o} and Albert Ziegler for helpful discussions and feedback. \bibliographystyle{iclr2021_conference} \bibliography{db} @@ -591,21 +609,22 @@ \section*{Appendix} \section{Proofs} \begin{prop}\label{prop:not} - $\partial_{\neg}(x,y) \equiv \neg (x \oplus y)$. + $\partial_{\neg}(x,y) \btright \neg (x \oplus y)$. \begin{proof} Table \ref{not-table} is the truth table of the boolean function $\neg (x \oplus w)$, where $h(x) = \operatorname{harden}(x)$. \begin{table}[h!] \begin{center} - \begin{tabular}{cccccc} - \multicolumn{1}{c}{$x$} &\multicolumn{1}{c}{$y$} &\multicolumn{1}{c}{$h(x)$} &\multicolumn{1}{c}{$h(y)$} &\multicolumn{1}{c}{$\partial_{\neg}(h(x), h(y))$} &\multicolumn{1}{c}{$h(\partial_{\neg}(h(x), h(y)))$} + \begin{tabular}{ccccccc} + \multicolumn{1}{c}{$x$} &\multicolumn{1}{c}{$y$} &\multicolumn{1}{c}{$h(x)$} &\multicolumn{1}{c}{$h(y)$} &\multicolumn{1}{c}{$\partial_{\neg}(x, y)$} &\multicolumn{1}{c}{$h(\partial_{\neg}(x, y))$} + &\multicolumn{1}{c}{$\neg (h(y) \oplus h(x))$} \\ \hline \\ - $\left[0, \frac{1}{2}\right]$ & $\left[0, \frac{1}{2}\right]$ & 0 & 0 & 1 & 1\\[0.1cm] - $\left(\frac{1}{2}, 1\right]$ & $\left[0, \frac{1}{2}\right]$ &1 & 0 & 0 & 0\\[0.1cm] - $\left[0, \frac{1}{2}\right]$ & $\left(\frac{1}{2}, 1\right]$ &0 & 1 & 0 & 0\\[0.1cm] - $\left(\frac{1}{2}, 1\right]$ & $\left(\frac{1}{2}, 1\right]$ &1 & 1 & 1 & 1\\[0.1cm] + $\left[0, \frac{1}{2}\right)$ & $\left[0, \frac{1}{2}\right)$ & 0 & 0 & $\left(\frac{1}{2},1\right]$ & 1 & 1\\[0.1cm] + $\left(\frac{1}{2}, 1\right]$ & $\left[0, \frac{1}{2}\right)$ &1 & 0 & $\left[0, \frac{1}{2}\right)$ & 0 & 0\\[0.1cm] + $\left[0, \frac{1}{2}\right)$ & $\left(\frac{1}{2}, 1\right]$ &0 & 1 & $\left[0, \frac{1}{2}\right)$ & 0 & 0\\[0.1cm] + $\left(\frac{1}{2}, 1\right]$ & $\left(\frac{1}{2}, 1\right]$ &1 & 1 & $\left(\frac{1}{2}, 1\right]$ & 1 & 1\\[0.1cm] \end{tabular} \end{center} - \caption{$\partial_{\neg}(x,y) \equiv \neg (y \oplus x)$.}\label{not-table} + \caption{$\partial_{\neg}(x,y) \btright \neg (y \oplus x)$.}\label{not-table} \end{table} \end{proof} \end{prop} @@ -633,61 +652,64 @@ \section{Proofs} \begin{prop}\label{prop:and} - $\partial_{\wedge}\!(x,y) \equiv x \wedge y$. + $\partial_{\wedge}\!(x,y) \btright x \wedge y$. \begin{proof} Table \ref{and-table} is the truth table of the boolean function $x \wedge y$, where $h(x) = \operatorname{harden}(x)$.. \begin{table}[h!] \begin{center} - \begin{tabular}{cccccc} - \multicolumn{1}{c}{$x$} &\multicolumn{1}{c}{$y$} &\multicolumn{1}{c}{$h(x)$} &\multicolumn{1}{c}{$h(y)$} &\multicolumn{1}{c}{$\partial_{\wedge}(h(x), h(y))$} &\multicolumn{1}{c}{$h(\partial_{\wedge}(h(x), h(y)))$} + \begin{tabular}{ccccccc} + \multicolumn{1}{c}{$x$} &\multicolumn{1}{c}{$y$} &\multicolumn{1}{c}{$h(x)$} &\multicolumn{1}{c}{$h(y)$} &\multicolumn{1}{c}{$\partial_{\wedge}(x, y)$} &\multicolumn{1}{c}{$h(\partial_{\wedge}(x, y))$} + &\multicolumn{1}{c}{$h(x) \wedge h(y)$} \\ \hline \\ - $\left[0, \frac{1}{2}\right]$ & $\left[0, \frac{1}{2}\right]$ & 0 & 0 & 0 & 0\\[0.1cm] - $\left(\frac{1}{2}, 1\right]$ & $\left[0, \frac{1}{2}\right]$ &1 & 0 & $\frac{1}{4}$ & 0\\[0.1cm] - $\left[0, \frac{1}{2}\right]$ & $\left(\frac{1}{2}, 1\right]$ &0 & 1 & $\frac{1}{4}$ & 0\\[0.1cm] - $\left(\frac{1}{2}, 1\right]$ & $\left(\frac{1}{2}, 1\right]$ &1 & 1 & 1 & 1\\[0.1cm] + $\left[0, \frac{1}{2}\right)$ & $\left[0, \frac{1}{2}\right)$ & 0 & 0 & $\left[0, \frac{1}{2}\right)$ & 0 & 0\\[0.1cm] + $\left(\frac{1}{2}, 1\right]$ & $\left[0, \frac{1}{2}\right)$ &1 & 0 & $\left(\frac{1}{4}, \frac{1}{2}\right)$ & 0 & 0\\[0.1cm] + $\left[0, \frac{1}{2}\right)$ & $\left(\frac{1}{2}, 1\right]$ &0 & 1 & $\left(\frac{1}{4}, \frac{1}{2}\right)$ & 0 & 0\\[0.1cm] + $\left(\frac{1}{2}, 1\right]$ & $\left(\frac{1}{2}, 1\right]$ &1 & 1 & $\left(\frac{1}{2}, 1\right]$ & 1 & 1\\[0.1cm] \end{tabular} \end{center} - \caption{$\partial_{\wedge}(x,y) \equiv x \wedge y$.}\label{and-table} + \caption{$\partial_{\wedge}(x,y) \btright x \wedge y$.}\label{and-table} \end{table} \end{proof} \end{prop} \begin{prop}\label{prop:or} - $\partial_{\vee}\!(x,y) \equiv x \vee y$. + $\partial_{\vee}\!(x,y) \btright x \vee y$. \begin{proof} Table \ref{or-table} is the truth table of the boolean function $x \vee y$, where $h(x) = \operatorname{harden}(x)$.. \begin{table}[h!] \begin{center} - \begin{tabular}{cccccc} - \multicolumn{1}{c}{$x$} &\multicolumn{1}{c}{$y$} &\multicolumn{1}{c}{$h(x)$} &\multicolumn{1}{c}{$h(y)$} &\multicolumn{1}{c}{$\partial_{\vee}(h(x), h(y))$} &\multicolumn{1}{c}{$h(\partial_{\vee}(h(x), h(y)))$} + \begin{tabular}{ccccccc} + \multicolumn{1}{c}{$x$} &\multicolumn{1}{c}{$y$} &\multicolumn{1}{c}{$h(x)$} &\multicolumn{1}{c}{$h(y)$} &\multicolumn{1}{c}{$\partial_{\vee}(x, y)$} &\multicolumn{1}{c}{$h(\partial_{\vee}(x, y))$} + &\multicolumn{1}{c}{$h(x) \vee h(y)$} \\ \hline \\ - $\left[0, \frac{1}{2}\right]$ & $\left[0, \frac{1}{2}\right]$ & 0 & 0 & 0 & 0\\[0.1cm] - $\left(\frac{1}{2}, 1\right]$ & $\left[0, \frac{1}{2}\right]$ &1 & 0 & $\frac{3}{4}$ & 1\\[0.1cm] - $\left[0, \frac{1}{2}\right]$ & $\left(\frac{1}{2}, 1\right]$ &0 & 1 & $\frac{3}{4}$ & 1\\[0.1cm] - $\left(\frac{1}{2}, 1\right]$ & $\left(\frac{1}{2}, 1\right]$ &1 & 1 & 1 & 1\\[0.1cm] + $\left[0, \frac{1}{2}\right)$ & $\left[0, \frac{1}{2}\right)$ & 0 & 0 & $\left[0,\frac{1}{2}\right)$ & 0 & 0\\[0.1cm] + $\left(\frac{1}{2}, 1\right]$ & $\left[0, \frac{1}{2}\right)$ &1 & 0 & $\left(\frac{1}{2},1\right]$ & 1 & 1\\[0.1cm] + $\left[0, \frac{1}{2}\right)$ & $\left(\frac{1}{2}, 1\right]$ &0 & 1 & $\left(\frac{1}{2},1\right]$ & 1 & 1\\[0.1cm] + $\left(\frac{1}{2}, 1\right]$ & $\left(\frac{1}{2}, 1\right]$ &1 & 1 & $\left(\frac{1}{2},1\right]$ & 1 & 1\\[0.1cm] \end{tabular} \end{center} - \caption{$\partial_{\vee}(x,y) \equiv x \vee y$.}\label{or-table} + \caption{$\partial_{\vee}(x,y) \btright x \vee y$.}\label{or-table} \end{table} \end{proof} \end{prop} \begin{prop}\label{prop:implies} - $\partial_{\Rightarrow}\!(x,y) \equiv x \Rightarrow y$. + $\partial_{\Rightarrow}\!(x,y) \btright x \Rightarrow y$. \begin{proof} Table \ref{implies-table} is the truth table of the boolean function $x \Rightarrow y$, where $h(x) = \operatorname{harden}(x)$.. \begin{table}[h!] \begin{center} - \begin{tabular}{cccccc} - \multicolumn{1}{c}{$x$} &\multicolumn{1}{c}{$y$} &\multicolumn{1}{c}{$h(x)$} &\multicolumn{1}{c}{$h(y)$} &\multicolumn{1}{c}{$\partial \Rightarrow(h(x), h(y))$} &\multicolumn{1}{c}{$h(\partial \Rightarrow(h(x), h(y)))$} + \begin{tabular}{ccccccc} + \multicolumn{1}{c}{$x$} &\multicolumn{1}{c}{$y$} &\multicolumn{1}{c}{$h(x)$} &\multicolumn{1}{c}{$h(y)$} &\multicolumn{1}{c}{$\partial_{\Rightarrow}(x, y)$} &\multicolumn{1}{c}{$h(\partial_{\Rightarrow}(x, y))$} + &\multicolumn{1}{c}{$h(x) \Rightarrow h(y)$} \\ \hline \\ - $\left[0, \frac{1}{2}\right]$ & $\left[0, \frac{1}{2}\right]$ & 0 & 0 & $\frac{3}{4}$ & 1\\[0.1cm] - $\left(\frac{1}{2}, 1\right]$ & $\left[0, \frac{1}{2}\right]$ &1 & 0 & 0 & 0\\[0.1cm] - $\left[0, \frac{1}{2}\right]$ & $\left(\frac{1}{2}, 1\right]$ &0 & 1 & 1 & 1\\[0.1cm] - $\left(\frac{1}{2}, 1\right]$ & $\left(\frac{1}{2}, 1\right]$ &1 & 1 & $\frac{3}{4}$ & 1\\[0.1cm] + $\left[0, \frac{1}{2}\right)$ & $\left[0, \frac{1}{2}\right)$ & 0 & 0 & $\left(\frac{1}{2}, 1\right]$ & 1 & 0\\[0.1cm] + $\left(\frac{1}{2}, 1\right]$ & $\left[0, \frac{1}{2}\right)$ &1 & 0 & $\left[0, \frac{1}{2}\right)$ & 0 & 0\\[0.1cm] + $\left[0, \frac{1}{2}\right)$ & $\left(\frac{1}{2}, 1\right]$ &0 & 1 & $\left(\frac{1}{2},1\right]$ & 1 & 0\\[0.1cm] + $\left(\frac{1}{2}, 1\right]$ & $\left(\frac{1}{2}, 1\right]$ &1 & 1 & $\left(\frac{1}{2}, \frac{7}{8}\right)$ & 1 & 0\\[0.1cm] \end{tabular} \end{center} - \caption{$\partial \Rightarrow(x,y) \equiv x \Rightarrow y$.}\label{implies-table} + \caption{$\partial_{\Rightarrow}(x,y) \btright x \Rightarrow y$.}\label{implies-table} \end{table} \end{proof} \end{prop} @@ -705,17 +727,17 @@ \section{Proofs} \end{lemma} \begin{theorem}\label{prop:majority} - $\partial\!\operatorname{Maj} \equiv \operatorname{Maj}$. + $\partial\!\operatorname{Maj} \btright \operatorname{Maj}$. \begin{proof} - $\partial\!\operatorname{Maj}$ augments the representative bit $x_{i} = \operatorname{sort}({\bf x})[\operatorname{majority-index}({\bf x})]$. By lemma \ref{lem:maj}, the representative bit is $\equiv \operatorname{Maj}(\operatorname{harden}({\bf x}))$. - By lemma \ref{prop:augmented}, the augmented bit, $\operatorname{augmented-bit}(\operatorname{sort}({\bf x}), \operatorname{majority-index}({\bf x}))$, is also $\equiv \operatorname{Maj}(\operatorname{harden}({\bf x}))$. Hence $\partial\!\operatorname{Maj} \equiv \operatorname{Maj}$. + $\partial\!\operatorname{Maj}$ augments the representative bit $x_{i} = \operatorname{sort}({\bf x})[\operatorname{majority-index}({\bf x})]$. By lemma \ref{lem:maj} the representative bit is $\btright \operatorname{Maj}(\operatorname{harden}({\bf x}))$. + By lemma \ref{prop:augmented}, the augmented bit, $\operatorname{augmented-bit}(\operatorname{sort}({\bf x}), \operatorname{majority-index}({\bf x}))$, is also $\btright\!\operatorname{Maj}(\operatorname{harden}({\bf x}))$. Hence $\partial\!\operatorname{Maj} \btright\!\operatorname{Maj}$. \end{proof} \end{theorem} \begin{prop}\label{prop:count} - $\partial\!\operatorname{count-hot} \equiv \operatorname{count-hot}$. + $\partial\!\operatorname{count-hot} \btright \operatorname{count-hot}$. \begin{proof} - Let $l$ denote the number of bits that are low in ${\bf x} = [x_{1},\dots,x_{n}]$, and let ${\bf y} = \partial\!\operatorname{count-hot}({\bf x})$. Then ${\bf y}[l+1]$ is high and any ${\bf y}[i]$, where $i \neq l+1$, is low. Let ${\bf z} = \operatorname{count-hot}(\operatorname{harden}({\bf x}))$. Then ${\bf z}[l+1]$ is high and any ${\bf z}[i]$, where $i \neq l+1$, is low. Hence, $\operatorname{harden}({\bf y}) = {\bf z}$, and therefore $\partial\!\operatorname{count-hot} \equiv \operatorname{count-hot}$. + Let $l$ denote the number of bits that are low in ${\bf x} = [x_{1},\dots,x_{n}]$, and let ${\bf y} = \partial\!\operatorname{count-hot}({\bf x})$. Then ${\bf y}[l+1]$ is high and any ${\bf y}[i]$, where $i \neq l+1$, is low. Let ${\bf z} = \operatorname{count-hot}(\operatorname{harden}({\bf x}))$. Then ${\bf z}[l+1]$ is high and any ${\bf z}[i]$, where $i \neq l+1$, is low. Hence, $\operatorname{harden}({\bf y}) = {\bf z}$, and therefore $\partial\!\operatorname{count-hot} \btright \operatorname{count-hot}$. \end{proof} \end{prop} diff --git a/docs/iclr2021_conference.sty b/docs/iclr2021_conference.sty index 6529df6..f68d5dc 100644 --- a/docs/iclr2021_conference.sty +++ b/docs/iclr2021_conference.sty @@ -86,7 +86,7 @@ %\bottomtitlebar % \vskip 0.1in % minus \ificlrfinal %\lhead{Published as a conference paper at ICLR 2021} - \lhead{March 2023. DRAFT 1.0.} + \lhead{April 2023. DRAFT 1.1.} \def\And{\end{tabular}\hfil\linebreak[0]\hfil \begin{tabular}[t]{l}\bf\rule{\z@}{24pt}\ignorespaces}% \def\AND{\end{tabular}\hfil\linebreak[4]\hfil diff --git a/docs/logic-gates.png b/docs/logic-gates.png index f8ad29a..26a5c77 100644 Binary files a/docs/logic-gates.png and b/docs/logic-gates.png differ diff --git a/docs/majority-gates.png b/docs/majority-gates.png index 8554006..abf3e45 100644 Binary files a/docs/majority-gates.png and b/docs/majority-gates.png differ