Fair SVM
Fair SVM
Fair SVM
Anonymized Authors
Anonymized Universities
independent of the protected class, though this can be Last, we define some additional notation. Let [n] =
too conservative and reduce predictive accuracy more {1, . . . , n}, and note 1(u) is the indicator function. A
than desired. Another method [Hardt et al., 2016] positive semidefinite matrix U is denoted U 0. If
modifies any classifier to reduce its accuracy with re- U, V are vectors of equal dimension, then the notation
spect to protected classes until fairness is achieved. U ◦ V refers to their element-wise product: (U ◦ V )i =
Several techniques compute a fair classifier at a sin- Ui · Vi . Also, e is the vector whose entries are all 1.
gle threshold [Calders et al., 2009, Cotter et al., 2016];
however, our interest is in classifiers that are fair at all 3 ROC Visualization of Fairness
thresholds. The only method we are aware of that tries
to compute a fair classifier for all thresholds is that of 3.1 Demographic Parity
Zafar et al. [2017], which will be our main comparison.
One popular notion of fairness is that predictions of
1.3 Outline the label y are independent of the protected class z.
This definition is typically stated [Calders et al., 2009,
After describing the data and our notation in Section Zliobaite, 2015, Zafar et al., 2017] in terms of a sin-
2, we next define two fairness notions and provide a gle threshold, though it can be generalized to multiple
new ROC visualization of fairness in Section 3. Sec- thresholds. We say that a classifier h(x, t) has demo-
tion 4 derives constraints to improve the fairness of lin- graphic parity at level ∆ (abbreviated as DP-∆) if
ear and kernel SVM’s at all thresholds. This involves
nonconvex constraints, and in Section 5 we present
P h(x, t) = +1z = +1 −
iterative algorithms that compute fair linear and ker-
nel SVM’s by solving a sequence of convex problems
P h(x, t) = +1z = −1 ≤ ∆, ∀t ∈ R. (2)
defined using a spectral decomposition. Section 6 con-
ducts numerical experiments using both synthetic and
To understand this, note P h(x, t) = +1z = +1 is
real datasets to demonstrate the efficacy of our ap-
the true positive rate when predicting the protected
proach in computing accurate but fair SVM’s.
class at threshold t, while P h(x, t) = +1z = −1 is
the false positive rate when predicting the protected
2 DATA AND NOTATION class at threshold t. So the intuition is that a classifier
is DP-∆ if its false positive rates and true positive
Our data consists of 3-tuples (xi , yi , zi ) for i = 1, . . . , n rates are approximately (up to ∆ deviation) equal at
points, where xi ∈ Rp are predictors, yi ∈ {−1, +1} all threshold levels for the protected class.
are labels, and zi ∈ {−1, +1} label a protected class.
For a matrix W , the i-th row of W is denoted Wi . Reinterpreted, demographic parity requires that pre-
Define X ∈ Rn×p , Y ∈ Rn , and Z ∈ Rn to be matrices dictions of the classifier cannot reveal information
such that Xi = xT about the protected class any better (up to ∆ devi-
i , Yi = yi , and Zi = zi , respectively.
ation) than random guessing. DP-∆ is in fact equiva-
Let N = {i : zi = −1} be the set of indices for pro- lent to requiring that the ROC curve for the classifier
tected class is negative, and similarly let P = {i : zi = h(x, t) in predicting z is within ∆ of the line of no-
+1} be the set of indices for which the protected class discrimination, which is the line that is achievable by
is positive. We use #N and #P for the cardinality biased random guessing. More visually, Figure 1 shows
of the sets N and P , respectively. Now define X+ to how DP-∆ can be seen using an ROC curve.
be a matrix whose rows are xT i for i ∈ P , and simi-
larly define X− to be a matrix whose rows are xT i for 3.2 Equal Opportunity
i ∈ N . Let Σ+ and Σ− be the covariance matrices of
[xi |zi = +1] and [xi |zi = −1], respectively. Demographic parity has been criticized as too strict
0 p p
Next let K(x, x ) : R × R → R be a kernel function, [Dwork et al., 2012, Hardt et al., 2016], and so an-
and consider the notation other notion of fairness has been proposed in which
predictions of the label y are independent of the pro-
K(X1 , X10 ) K(X1 , X20 ) · · ·
tected class z, when the true label is positive (i.e.,
0 0
K(X, X 0 ) = K(X2 , X1 ) K(X2 , X2 ) · · · (1) y = +1). In this definition, we must interpret y = +1
.. .. .. as a better label than y = −1; for instance, y = −1
. . .
may be a loan default, while y = +1 is full repayment
Recall that the essence of the kernel trick is to replace of a loan. This definition is typically stated [Hardt
xT
i xj with K(xi , xj ), and so the benefit of the matrix et al., 2016] in terms of a single threshold, though it
notation given in (1) is that it allows us to replace can be generalized to multiple thresholds. We say that
X(X 0 )T with K(X, X 0 ) as part of the kernel trick. a classifier h(x, t) has equal opportunity with level ∆
Manuscript under review by AISTATS 2018
for t ∈ {t1 , . . . , tk }; where M > 0 is a large constant, equal. Let Σ+ and Σ− be the sample covariance ma-
and {t1 , . . . , tk } is a fixed set of values. This removes trices for [xi |zi = +1] and [xi |zi = −1], respectively.
the sign(·) function, and it ensures a finite number of Then the sample variances of [xT i w + t|zi = +1] and
constraints. However, we found in numerical experi- [xT
i w + t|z i = −1] are w T
Σ + w and wT Σ− w, respec-
ments using Gurobi [2016] and Mosek [2017] that com- tively. So we specify our covariance constraint as
puting a fair linear SVM with the above constraints
was prohibitively slow except for very small data sets. −s ≤ wT (Σ+ − Σ− )w ≤ s. (10)
To our knowledge, this constraint has not been previ-
4.1.3 Convex Relaxation ously used to improve the fairness of classifiers. Un-
fortunately, it is nonconvex because (Σ+ −Σ− ) is sym-
We next derive a convex relaxation of the indicator metric but typically indefinite (i.e., not positive or neg-
constraints (6). Let M > 0 be a constant, and con- ative semidefinite). Hence, computing a linear SVM
sider −M ≤ u ≤ M . Then convex upper bounds with this constraint requires further development.
for the indicators are 1(sign(u) = +1) ≤ 1 + u/M
and −1(sign(u) = +1) ≤ −u/M . Using these upper One obvious approach is to lift the constraint (10) and
bounds on (6) leads to the following convex relaxation then construct a semidefinite programming (SDP) re-
laxation [Goemans and Williamson, 1995, Luo et al.,
1
P 1
P T 2010]. Specifically, note that (10) is equivalent to
−d ≤ #P i∈P xi − #N i∈N xi w ≤ d, (8)
−s ≤ trace(W (Σ+ − Σ− )) ≤ s
where d = M · (∆ − 1). There are three important
remarks about this convex relaxation. The first is W w
U= 0 (11)
that the threshold t does not appear, which means wT 1
this is simply a linear constraint. The second is that rank(U ) = 1
the bound M can be subsumed into the parameter d
The above is nonconvex, but it can be convexified by
used to control the fairness level, meaning that this
dropping the rank(U ) = 1 constraint. However, we
relaxation is practically independent of the bound M .
found in numerical experiments using the Mosek [2017]
The third is that this convex relaxation is equivalent to
solver that the SDP relaxation was weak and did not
the fairness constraint proposed by Zafar et al. [2017],
consistently affect the fairness or accuracy of the SVM.
though they derive this using a correlation argument.
Despite this result, we believe that additional convex-
ification techniques [Kocuk et al., 2016, Madani et al.,
4.1.4 Covariance Constraints 2017] can be used to strengthen the quality of the SDP
The convex relaxation (8) is fairly weak, and so it is relaxation; we leave the problem of how to design a
relevant to ask whether constraints can be added to strengthened SDP relaxation for future work.
(8) to increase the fairness level ∆ of the resulting
linear SVM. Instead of using convex lifts [Lasserre, 4.2 Constraints for Kernel SVM
2001, Gouveia et al., 2013, Chandrasekaran and Jor-
We next study constraints that can be used to ensure
dan, 2013] to tighten the above constraint, we take a
fairness with respect to z for a kernel SVM
geometric approach to derive new constraints.
Specifically, consider the two conditional distributions h(x, b) = sign(K(X, x)T (Y ◦ α) + b), (12)
[x|z = +1] and [x|z = −1]. Observe that the where α ∈ R n
conditional distributions of [xT w + t|z = +1] and 1
P are coefficients to predict the label y,
and b = #I i∈I (yi − K(X, xi )T (Y ◦ α)) with the set
[xT w + t|z = −1]T have mean E[x|z = +1]T w + t and of indices I = {i : 0 < αi < λ}. Consider the following
E[x|z = −1]T w + t, respectively. This means (8) can generic kernel SVM formulation
be interpreted as requiring Pn
min (Y · α)T K(X, X)(Y · α) − i=1 αi
−d ≤ E[xT w + t|z = +1] − E[xT w + t|z = −1]T w ≤ d,
s.t. Y T α = 0
(9) (13)
or equivalently that the means of xT w + t when con- 0 ≤ αi ≤ λ, for i ∈ [n]
ditioned on z are approximately (i.e., up to d apart) H(w, θ) ≤ 0.
equal. Thus it is natural to consider adding constraints
where λ ∈ R is a tuning parameter that can be cho-
to match the conditional distributions of xT w +t using
sen using cross-validation, and H(w, θ) ≤ 0 is a fair-
higher order moments.
ness constraint with the fairness level controlled by the
Here, we define a constraint to ensure that the covari- (possibly vector-valued) parameter θ. We will next
ances of xT w + t conditioned on z are approximately consider several possibilities for H(w, θ) ≤ 0.
Manuscript under review by AISTATS 2018
4.2.1 Indicator and Integer Constraints The constraint (17) is a nonconvex quadratic con-
straint because (S+ − S− ) is symmetric but typically
The indicator constraints (6) can be rewritten for ker- indefinite (i.e., not positive or negative semidefinite).
nel SVM by replacing (4) with (12). Unfortunately,
the indicator constraints are still impractical because We can also construct an SDP relaxation of (17).
they are infinite-dimensional and contain a discontin- Specifically, note that (17) is equivalent to
uous, nonconvex function. However, we can approxi-
−s ≤ trace(W (S+ − S− )) ≤ s
mate the indicator constraints for kernel SVM as the
following mixed-integer linear inequalities W (Y ◦ α)
U= 0 (19)
(Y ◦ α)T 1
1 1
P P
−∆ ≤ #P i∈P vi (t) − #N i∈N vi (t) ≤ ∆ rank(U ) = 1
T
−M · (1 − vi (t)) ≤ K(X, x) (Y ◦ α) + t ≤ M · vi (t)
The above is nonconvex, but it can be convexified by
vi (t) ∈ {0, 1} dropping the rank(U ) = 1 constraint. However, our
(14) numerical experiments using the Mosek [2017] solver
for t ∈ {t1 , . . . , tk }; where M > 0 is a large con- found the SDP relaxation was weak and did not con-
stant, and {t1 , . . . , tk } is a fixed set of values. How- sistently affect the fairness or accuracy of the SVM.
ever, we found in numerical experiments using the
Gurobi [2016] and Mosek [2017] solvers that comput-
ing a fair kernel SVM with the above constraints was 5 SPECTRAL ALGORITHM
prohibitively slow except for very small data sets.
The covariance constraints are conceptually promising,
4.2.2 Convex Relaxation but they result in nonconvex optimization problems.
Here, we describe an iterative algorithm to effectively
We next provide a convex relaxation of the indicator solve the SVM with our covariance constraints.
constraints for kernel SVM. A similar argument as the
one used for linear SVM gives 5.1 Linear SVM
−d≤ 1
P T Though we could compute the linear SVM with the co-
#P i∈P K(X, xi ) (Y ◦ α)+
variance constraint (10) using successive linearization,
1 T
P
− #N i∈N K(X, xi ) (Y ◦ α) ≤ d. (15) a better approach is possible through careful design of
the algorithm: Our key observation regarding (10) is
Note that threshold t does not appear, which means that (Σ+ − Σ− ) is symmetric, which means it can be
this is simply a linear constraint on α. diagonalized by an orthogonal matrix:
Pp
4.2.3 Covariance Constraints Σ+ − Σ− = i=1 ζi vi viT , (20)
Our covariance constraint can be rewritten for the ker- where ζi ∈ R and the vi form an orthonormal basis.
nel SVM by first recalling that the kernel SVM with Now let Iζ+ = {i : ζi > 0} and Iζ− = {i : ζi < 0}, and
K(x, x0 ) = xT x0 generates the same classifier as di- define the positive semidefinite matrices
rectly solving a linear SVM. Thus, we have the rela-
T
P
tionship w = X T (Y ◦ α) in this special case. Next, Uζ+ = i∈Iζ+ ζi vi vi
(21)
observe that Uζ− = − i∈Iζ− ζi vi viT
P
1 1
T
Σ+ = #P X+ I − #P eeT X+ This means that the function
1 1
T (16)
Σ− = #N X− I − #N eeT X−
wT (Σ+ − Σ− )w = wT Uζ+ w − wT Uζ− w (22)
So if apply the kernel trick to our covariance constraint
in the covariance constraint (10) is the difference of
(10) with the above relationships, then the resulting
the two convex functions wT Uζ+ w and wT Uζ− w.
covariance constraint for kernel SVM becomes
There is an important point to note regarding the
−s ≤ (Y ◦ α)T (S+ − S− )(Y ◦ α) ≤ s, (17) practical importance of the spectral decomposition we
performed above. The function in the covariance con-
where we have straint (10) can alternatively be written as the differ-
1 1 T
T ence of the two convex functions wT Σ+ w and wT Σ− w.
S+ = #P K(X, X+ ) I− #P ee K(X, X+ )
(18) However, using this alternative decomposition yields
1 1 T
T
S− = #N K(X, X− ) I− #N ee K(X, X− )
an algorithm where the convexified subproblems are
Manuscript under review by AISTATS 2018
Visualization
0 0 0
-1 -1 -1
-1 0 1 -1 0 1 -1 0 1
0.5 4 2
[x Ti w | z i ]
2 1
0 0 0
-5 0 5 -1 0 1 -2 0 2
0.1 0.4 0.4
[x Ti w | y i ]
0 0 0
-50 0 50 -10 0 10 -5 0 5
Figure 2: A comparison of the unconstrained, linear and spectral SVM methodologies on two-dimensional data.
The first row visualizes the data, as well as the optimal support vectors from each of the methodologies. The
second and third rows show the density of xT i w conditioned on the z and y variables, respectively. In each case,
the blue curve represents the density for zi = 1 (yi = 1) and the red curve the density for zi = −1 (yi = −1).
weaker relaxations than the convex subproblems gen- gives iterates wk that converge to a local minimum.
erated using the spectral decomposition. As a result,
our algorithm given below is ultimately more effective This theorem is simply an application of a theorem
because it employs the spectral decomposition. by Smola et al. [2005], and the constraint qualification
required by this theorem trivially holds in our case
Consequently, the constrained convex-concave proce- because all of our convex constraints are linear.
dure [Tuy, 1995, Yuille and Rangarajan, 2002, Smola
et al., 2005] can be used to design an algorithm for
5.2 Kernel SVM
our setting. We opt to use a penalized form of the
quadratic constraint in our spectral algorithm to en- We can also design a spectral algorithm to compute
sure feasibility always holds. Let wk ∈ Rp be a fixed fair kernel SVM’s. Since (S+ − S− ) in (17) is symmet-
point, and consider the optimization problem where ric, t can be diagonalized by an orthogonal matrix:
the concave terms are linearized:
Pp
Pn 2 S+ − S− = i=1 ξi νi νiT , (24)
min i=1 ui + λkwk2 + µ · t
s.t. yi (xT
i w + b) ≥ 1 − ui , for i ∈ [n] where ξi ∈ R and the νi form an orthonormal basis.
ui ≥ 0, for i ∈ [n] Now let Iξ+ = {i : ξi > 0} and Iξ− = {i : ξi < 0}, and
T define the positive semidefinite matrices
1 1
P P
−d≤ #P i∈P xi − #N i∈N xi w≤d
T
P
Uξ+ = i∈Iξ+ ξi νi νi
wT Uζ+ w − wkT Uζ− wk − 2wkT Uζ−
T
(w − wk ) ≤ t (25)
Uξ− = − i∈Iξ− ξi νi νiT
P
wT Uζ− w − wkT Uζ+ wk − 2wkT Uζ+
T
(w − wk ) ≤ t
(23) This means that the function
Our spectral algorithm for computing a fair linear
SVM consists of the constrained CCP adapted to the (Y ◦ α)T (Σ+ − Σ− )(Y ◦ α) =
problem of computing a linear SVM with the linear
constraint (8) and covariance constraint (10): We ini- (Y ◦ α)T Uζ+ (Y ◦ α) − (Y ◦ α)T Uζ− (Y ◦ α) (26)
tialize w0 by solving a linear SVM with only the linear in (17) is the difference of the two convex functions
constraint (8), and then computing successive wk by (Y ◦ α)T Uξ+ (Y ◦ α) and (Y ◦ α)T Uξ− (Y ◦ α).
solving (23). This produces a local minimum.
Thus, the constrained convex-concave procedure [Tuy,
Theorem 1 (Smola et al. [2005]) The spectral al- 1995, Yuille and Rangarajan, 2002, Smola et al., 2005]
gorithm defined above for computing a fair linear SVM can be used to design an algorithm. We use a penalized
Manuscript under review by AISTATS 2018
1 1 1
form of the quadratic constraint in our spectral algo-
rithm to ensure feasibility always holds. Let αk ∈ Rn
be a fixed point, and consider the optimization prob- 0.5 0.5 0.5
1 T
P
− #N i∈N K(X, xi ) (Y ◦ α) ≤ d 0.5 0.5 0.5
T T
(Y ◦ α) Uξ+ (Y ◦ α) − (Y ◦ αk ) Uξ− (Y ◦ αk )+
− 2(Y ◦ αk )T Uζ−
T
(Y ◦ (α − αk )) ≤ t 0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
T T
(Y ◦ α) Uξ− (Y ◦ α) − (Y ◦ αk ) Uξ+ (Y ◦ αk )+ (d) (e) (f)
− 2(Y ◦ αk )T Uζ+
T
(Y ◦ (α − αk )) ≤ t
(27) Figure 3: ROC plots for the three SVM algorithms on
Our spectral algorithm for computing a fair kernel both datasets. In each case, the solid blue line is the
SVM consists of the constrained CCP adapted to the ROC curve for the y label and the dotted red line the
problem of computing a kernel SVM with the linear ROC curve for the protected z label. Figures 3a to 3c
constraint (15) and covariance constraint (17): We ini- show the ROC plots for LSVM, ZSVM, and SSVM on
tialize w0 by solving a kernel SVM with only the linear the wine quality data, and Figures 3d to 3f show the
constraint (15), and then computing successive wk by same for the German credit data.
solving (23). This produces a local minimum.
Theorem 2 (Smola et al. [2005]) The spectral al- Results The results of the three methods using d =
gorithm defined above for computing a fair kernel SVM 0.075 and µ = 10 are shown in the three columns of
gives iterates wk that converge to a local minimum. Figure 2. Here, points in N and P are differentiated by
This theorem is simply an application of a theorem marker shape (“x” and “o”, respectively), and points
by Smola et al. [2005], and the constraint qualification with label yi = −1 and yi = 1 are differentiated by
required by this theorem trivially holds in our case color (red and green, respectively). If w denotes the
because all of our convex constraints are linear. coefficients computed by each method, then the second
row shows the empirical densities of xT i w conditioned
on the protected class zi , and the third row shows the
6 NUMERICAL EXPERIMENTS empirical densities of xT i w conditioned on the label
yi . Fairness occurs if the conditional densities in the
We use synthetic and real datasets to evaluate the effi- second row are similar, and prediction accuracy oc-
cacy of our approach. We compare linear SVM’s com- curs if the densities in the third row are disparate.
puted using our spectral algorithm (SSVM) to a stan- These results show that the densities of [xT i w|zi = +1]
dard linear SVM (LSVM) and a linear SVM computed and [xT i w|zi = −1] are distinguishable for SSVM and
using the approach of Zafar et al. [2017] (ZSVM), since ZSVM, while they are almost identical for SSVM. On
this is the only existing approach that to our knowl- the other hand, the densities of [xT i w|yi = +1] and
edge is designed to ensure fairness at all thresholds. [xTi w|yi = −1] are distinct for all three methods.
0.768
0.79
=1
AUC
AUC
0.765 =0.03
0.78
=0.05
0.764
0.775 0.763
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.15 0.16 0.17 0.18 0.19 0.2
Figure 4: Comparing the accuracy and fairness of the ZSVM and SSM methods for various d and µ. The solid
red line represents results for the ZSVM, and the dotted blue lines denote results for the SSVM for some µ.
Notably, all explanatory variables are continuous. The fairness than LSVM while keeping high accuracy. The
second dataset is a compilation of 20 attributes (e.g., tradeoff curves between prediction accuracy and fair-
marriage, employment and housing status, number of ness are shown in Figure 4. Increasing d generally
existing credit lines, and age) of 1000 German appli- jointly decreases fairness and increases accuracy, while
cants for loans. We label yi = 1 for applicants that small increases in µ for our SSVM can often improve
defaulted and yi = −1 for applicants that did not de- both fairness and accuracy. Large increases in µ gen-
fault, and let zi = 1 for applicants that are renting erally increase fairness but decrease accuracy. Note
a home and zi = −1 for applicants that own their that setting µ = 0 leads to the curve for SSVM to
home. Note that a large number of variables are dis- align with the curve of ZSVM, since they are equiva-
crete. There is no missing data in either dataset. lent when µ = 0. Also observe that ∆ is more sensitive
to changes in d and µ than the AUC, which implies
Metrics of comparison We compare SSVM and that we are able control fairness without losing much
ZSVM based on the tradeoffs that they make between predictive accuracy.
predictive accuracy for y, measured using the area un-
der the ROC curve (AUC), and their fairness with re- 7 Conclusion
spect z, measured by DP-∆ with respect to z.
We considered multi-threshold notions of fairness for
Experimental Design We conducted five rounds classifiers, and designed a nonconvex constraint to im-
of cross-validation on each dataset and computed the prove the fairness of linear and kernel SVM’s under
average AUC and average ∆, using a 70-30 training- all thresholds. We developed an iterative optimiza-
testing split. Within each round, we first apply 5-fold tion algorithm (that uses a spectral decomposition)
cross-validation on LSVM to choose the λ that max- to handle our nonconvex constraint in the resulting
imizes AUC, and this value of λ was used with both problem to compute the SVM, and empirically com-
SSVM and ZSVM to minimize the impact of cross- pared our approach to standard linear SVM and an
validation on the comparison between methods. We SVM with a linear fairness constraint using both syn-
varied d over the values 0, 0.001, 0.002, 0.005, 0.01, thetic and real data. We found that our method can
0.025, 0.05 and 0.1. And for SSVM we tried several strictly improve the fairness of classifiers for all thresh-
values of µ, which are shown in our plots. olding values with little loss in accuracy; in fact, some
of our results even showed a slight increase in accuracy
Results Figure 3 shows representative examples of with increasing fairness. Our work opens the door for
ROC curves for both datasets from one instance of further research in a number of areas, including hier-
cross-validation. Both ZSVM and SSVM improve fair- archies of fairness constraints considering subsequent
ness with respect to LSVM while maintaining high ac- moments of the data, and theoretical guarantees on
curacy, and SSVM ensures an even stricter level of the fairness of such classification methods.
Manuscript under review by AISTATS 2018