Fair SVM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Spectral Algorithms for Computing Fair Support Vector Machines

Anonymized Authors
Anonymized Universities

Abstract This paper focuses on the latter: We study how to


design fair support vector machines (SVM’s), and our
goal is to construct a classifier h(x, t) : Rp × R →
Classifiers and rating scores are prone to
{−1, +1} that inputs predictors x ∈ Rp and a thresh-
implicitly codifying biases, which may be
old t, and predicts a label y ∈ {−1, +1}, while ensuring
present in the training data, against pro-
fairness with respect to a protected class z ∈ {−1, +1}
tected classes (i.e., age, gender, or race). So
(e.g., age, gender, or race). We assume there are only
it is important to understand how to design
two protected classes; however, our formulations gen-
classifiers and scores that prevent discrimi-
eralize to the setting with multiple protected classes.
nation in predictions. This paper develops
computationally tractable algorithms for de- We make four main contributions. First, we reinter-
signing accurate but fair support vector ma- pret two fairness notions using receiver operating char-
chines (SVM’s). Our approach imposes a acteristic (ROC) curves, which leads to a new visual-
constraint on the covariance matrices condi- ization for classifier fairness. Second, we capture fair-
tioned on each protected class, which leads ness by defining a constraint on covariance matrices
to a nonconvex quadratic constraint in the conditioned on protected classes, which leads to a non-
SVM formulation. We develop iterative al- convex quadratic constraint in the SVM formulation.
gorithms to compute fair linear and kernel Third, we construct an iterative algorithm that uses a
SVM’s, which solve a sequence of relaxations spectral decomposition of the nonconvex constraint to
constructed using a spectral decomposition compute fair linear and kernel SVM’s; we prove iter-
of the nonconvex constraint. Its effectiveness ates converge to a local minimum. Fourth, we conduct
in achieving high prediction accuracy while numerical experiments to evaluate our algorithms.
ensuring fairness is shown through numerical
experiments on several data sets. 1.1 Fairness Notions for Classifiers

Ensuring classifiers are fair requires quantifying their


1 INTRODUCTION fairness. However, Friedler et al. [2016] and Kleinberg
et al. [2016] showed that no single metric can cap-
The increasing prevalence of machine learning to au- ture all intuitive aspects of fairness, and so any metric
tomate decision-making systems has drawn societal must choose a specific aspect of fairness to quantify.
scrutiny on the approaches used for verification and In this paper, we consider the arguably two most pop-
validation of these systems. In particular, two main ular notions: demographic parity [Calders et al., 2009,
concerns have been voiced with regards to the correct- Zliobaite, 2015, Zafar et al., 2017] and equal opportu-
ness and accuracy of decisions provided. The first is nity [Dwork et al., 2012, Hardt et al., 2016]. These
a general lack of interpretability for how predictions notions are typically considered for a single threshold
are produced by many learning techniques [Ridgeway of the classifier, but here we will consider all possible
et al., 1998, Lou et al., 2012, Caruana et al., 2015], and thresholds. We believe this is more in-line with ma-
the second is the possibility of perpetuating inequities licious usage of classifiers in which strategic choice of
that may be present in the training data [Podesta thresholds can be used to practice discrimination.
et al., 2014, Barocas and Selbst, 2016, Bhandari, 2016].
1.2 Algorithms to Compute Fair Classifiers

Several approaches have been developed to construct


Preliminary work. Under review by AISTATS 2018. Do fair classifiers. Some [Zemel et al., 2013, Louizos et al.,
not distribute.
2015] compute transformations of the data to make it
Manuscript under review by AISTATS 2018

independent of the protected class, though this can be Last, we define some additional notation. Let [n] =
too conservative and reduce predictive accuracy more {1, . . . , n}, and note 1(u) is the indicator function. A
than desired. Another method [Hardt et al., 2016] positive semidefinite matrix U is denoted U  0. If
modifies any classifier to reduce its accuracy with re- U, V are vectors of equal dimension, then the notation
spect to protected classes until fairness is achieved. U ◦ V refers to their element-wise product: (U ◦ V )i =
Several techniques compute a fair classifier at a sin- Ui · Vi . Also, e is the vector whose entries are all 1.
gle threshold [Calders et al., 2009, Cotter et al., 2016];
however, our interest is in classifiers that are fair at all 3 ROC Visualization of Fairness
thresholds. The only method we are aware of that tries
to compute a fair classifier for all thresholds is that of 3.1 Demographic Parity
Zafar et al. [2017], which will be our main comparison.
One popular notion of fairness is that predictions of
1.3 Outline the label y are independent of the protected class z.
This definition is typically stated [Calders et al., 2009,
After describing the data and our notation in Section Zliobaite, 2015, Zafar et al., 2017] in terms of a sin-
2, we next define two fairness notions and provide a gle threshold, though it can be generalized to multiple
new ROC visualization of fairness in Section 3. Sec- thresholds. We say that a classifier h(x, t) has demo-
tion 4 derives constraints to improve the fairness of lin- graphic parity at level ∆ (abbreviated as DP-∆) if
ear and kernel SVM’s at all thresholds. This involves
nonconvex constraints, and in Section 5 we present  
P h(x, t) = +1 z = +1 −

iterative algorithms that compute fair linear and ker-
nel SVM’s by solving a sequence of convex problems  
P h(x, t) = +1 z = −1 ≤ ∆, ∀t ∈ R. (2)
defined using a spectral decomposition. Section 6 con-
ducts numerical experiments using both synthetic and  
To understand this, note P h(x, t) = +1 z = +1 is
real datasets to demonstrate the efficacy of our ap-
the true positive rate when predicting the protected
proach in computing accurate but fair SVM’s. 
class at threshold t, while P h(x, t) = +1 z = −1 is
the false positive rate when predicting the protected
2 DATA AND NOTATION class at threshold t. So the intuition is that a classifier
is DP-∆ if its false positive rates and true positive
Our data consists of 3-tuples (xi , yi , zi ) for i = 1, . . . , n rates are approximately (up to ∆ deviation) equal at
points, where xi ∈ Rp are predictors, yi ∈ {−1, +1} all threshold levels for the protected class.
are labels, and zi ∈ {−1, +1} label a protected class.
For a matrix W , the i-th row of W is denoted Wi . Reinterpreted, demographic parity requires that pre-
Define X ∈ Rn×p , Y ∈ Rn , and Z ∈ Rn to be matrices dictions of the classifier cannot reveal information
such that Xi = xT about the protected class any better (up to ∆ devi-
i , Yi = yi , and Zi = zi , respectively.
ation) than random guessing. DP-∆ is in fact equiva-
Let N = {i : zi = −1} be the set of indices for pro- lent to requiring that the ROC curve for the classifier
tected class is negative, and similarly let P = {i : zi = h(x, t) in predicting z is within ∆ of the line of no-
+1} be the set of indices for which the protected class discrimination, which is the line that is achievable by
is positive. We use #N and #P for the cardinality biased random guessing. More visually, Figure 1 shows
of the sets N and P , respectively. Now define X+ to how DP-∆ can be seen using an ROC curve.
be a matrix whose rows are xT i for i ∈ P , and simi-
larly define X− to be a matrix whose rows are xT i for 3.2 Equal Opportunity
i ∈ N . Let Σ+ and Σ− be the covariance matrices of
[xi |zi = +1] and [xi |zi = −1], respectively. Demographic parity has been criticized as too strict
0 p p
Next let K(x, x ) : R × R → R be a kernel function, [Dwork et al., 2012, Hardt et al., 2016], and so an-
and consider the notation other notion of fairness has been proposed in which
predictions of the label y are independent of the pro-
K(X1 , X10 ) K(X1 , X20 ) · · ·
 
tected class z, when the true label is positive (i.e.,
0 0
K(X, X 0 ) = K(X2 , X1 ) K(X2 , X2 ) · · · (1) y = +1). In this definition, we must interpret y = +1
 
.. .. .. as a better label than y = −1; for instance, y = −1
. . .
may be a loan default, while y = +1 is full repayment
Recall that the essence of the kernel trick is to replace of a loan. This definition is typically stated [Hardt
xT
i xj with K(xi , xj ), and so the benefit of the matrix et al., 2016] in terms of a single threshold, though it
notation given in (1) is that it allows us to replace can be generalized to multiple thresholds. We say that
X(X 0 )T with K(X, X 0 ) as part of the kernel trick. a classifier h(x, t) has equal opportunity with level ∆
Manuscript under review by AISTATS 2018

these constraints for use in kernel SVM. We will fo-


1 cus on presenting formulations for demographic par-
ity, though all of our formulations easily generalize to
equal opportunity by simply conditioning on y = +1.

4.1 Constraints for Linear SVM

We first study constraints that can be used to ensure


0.5
fairness with respect to z for a linear SVM

h(x, b) = sign(xT w + b), (4)

where w ∈ Rp , b ∈ R are coefficients to predict the


label y. Consider the following generic linear SVM
formulation
0
0 0.5 1 Pn 2
min i=1 ui + λkwk2
s.t. yi (xT
i w + b) ≥ 1 − ui , for i ∈ [n]
Figure 1: A visual representation of our notion of (5)
fairness. Here, the solid blue line is the ROC curve for ui ≥ 0, for i ∈ [n]
the y label and the dotted red line the ROC curve for G(w, θ) ≤ 0.
the protected z label. ∆ refers to the maximum dis-
tance of the latter from the diagonal, which represents where λ ∈ R is a tuning parameter that can be cho-
a perfect lack of predictability. sen using cross-validation, and G(w, θ) ≤ 0 is a fair-
ness constraint with the fairness level controlled by the
(possibly vector-valued) parameter θ. We will next
(abbreviated as EO-∆) if consider several possibilities for G(w, θ) ≤ 0.

4.1.1 Indicator Constraints

P h(x, t) = +1 z = +1, y = +1 −

  The definition of DP-∆ in (2) uses probabilities, which
P h(x, t) = +1 z = −1, y = +1 ≤ ∆, ∀t ∈ R. (3)
are are not available as data. Fortunately, we can use
 empirical fractions of events as an approximation:
To understand this, note P h(x, t) = +1 z = +1, y =
+1 is the true positive rate conditioned on y = +1
1 P
when predicting the protected class at threshold t, #P i∈P 1(sign(xT i w + t) = +1)−

while P h(x, t) = +1 z = −1, y = +1 is the false 1
P T

1(sign(x w + t) = +1) ≤ ∆, ∀t ∈ R. (6)

positive rate conditioned on y = +1 when predicting #N i∈N i

the protected class at threshold t. So the intuition is


that a classifier is EO-∆ if its false positive rates and The above is difficult to include in the linear SVM as
true positive rates are approximately (up to ∆ devi- the fairness constraint G(w, θ) ≤ 0, because it involves
ation) equal at all threshold levels for the protected the discontinuous and nonconvex sign(·) function, and
class, when conditioned on y = +1. is infinite-dimensional since it must hold for all t ∈ R.

Reinterpreted, equal opportunity requires that predic- 4.1.2 Integer Constraints


tions of the classifier cannot reveal information about
the protected class any better (up to ∆ deviation) than An initial idea to incorporate (6) in the linear SVM
random guessing, when the true label is positive. EO- is based on recent empirical results [Miyashiro and
∆ is equivalent to requiring that the ROC curve for the Takano, 2015, Bertsimas et al., 2016] where mixed in-
classifier h(x, t) in predicting z conditioned on y = +1 teger programming (MIP) was used to exactly solve
is within ∆ of the line of no-discrimination. Figure 1 cardinality constrained linear regression. Specifically,
shows how DP-∆ can be seen using an ROC curve. we can approximate the indicator constraints (6) as
the following mixed-integer linear inequalities
4 FAIRNESS CONSTRAINTS −∆ ≤ #P 1
P 1
P
i∈P vi (t) − #N i∈N vi (t) ≤ ∆

In this section, we derive several fairness constraints −M · (1 − vi (t)) ≤ xT


i w + t ≤ M · vi (t)
(7)
for linear SVM. The kernel trick is used to convert vi (t) ∈ {0, 1}
Manuscript under review by AISTATS 2018

for t ∈ {t1 , . . . , tk }; where M > 0 is a large constant, equal. Let Σ+ and Σ− be the sample covariance ma-
and {t1 , . . . , tk } is a fixed set of values. This removes trices for [xi |zi = +1] and [xi |zi = −1], respectively.
the sign(·) function, and it ensures a finite number of Then the sample variances of [xT i w + t|zi = +1] and
constraints. However, we found in numerical experi- [xT
i w + t|z i = −1] are w T
Σ + w and wT Σ− w, respec-
ments using Gurobi [2016] and Mosek [2017] that com- tively. So we specify our covariance constraint as
puting a fair linear SVM with the above constraints
was prohibitively slow except for very small data sets. −s ≤ wT (Σ+ − Σ− )w ≤ s. (10)
To our knowledge, this constraint has not been previ-
4.1.3 Convex Relaxation ously used to improve the fairness of classifiers. Un-
fortunately, it is nonconvex because (Σ+ −Σ− ) is sym-
We next derive a convex relaxation of the indicator metric but typically indefinite (i.e., not positive or neg-
constraints (6). Let M > 0 be a constant, and con- ative semidefinite). Hence, computing a linear SVM
sider −M ≤ u ≤ M . Then convex upper bounds with this constraint requires further development.
for the indicators are 1(sign(u) = +1) ≤ 1 + u/M
and −1(sign(u) = +1) ≤ −u/M . Using these upper One obvious approach is to lift the constraint (10) and
bounds on (6) leads to the following convex relaxation then construct a semidefinite programming (SDP) re-
laxation [Goemans and Williamson, 1995, Luo et al.,
1
P 1
P T 2010]. Specifically, note that (10) is equivalent to
−d ≤ #P i∈P xi − #N i∈N xi w ≤ d, (8)
−s ≤ trace(W (Σ+ − Σ− )) ≤ s
where d = M · (∆ − 1). There are three important  
remarks about this convex relaxation. The first is W w
U= 0 (11)
that the threshold t does not appear, which means wT 1
this is simply a linear constraint. The second is that rank(U ) = 1
the bound M can be subsumed into the parameter d
The above is nonconvex, but it can be convexified by
used to control the fairness level, meaning that this
dropping the rank(U ) = 1 constraint. However, we
relaxation is practically independent of the bound M .
found in numerical experiments using the Mosek [2017]
The third is that this convex relaxation is equivalent to
solver that the SDP relaxation was weak and did not
the fairness constraint proposed by Zafar et al. [2017],
consistently affect the fairness or accuracy of the SVM.
though they derive this using a correlation argument.
Despite this result, we believe that additional convex-
ification techniques [Kocuk et al., 2016, Madani et al.,
4.1.4 Covariance Constraints 2017] can be used to strengthen the quality of the SDP
The convex relaxation (8) is fairly weak, and so it is relaxation; we leave the problem of how to design a
relevant to ask whether constraints can be added to strengthened SDP relaxation for future work.
(8) to increase the fairness level ∆ of the resulting
linear SVM. Instead of using convex lifts [Lasserre, 4.2 Constraints for Kernel SVM
2001, Gouveia et al., 2013, Chandrasekaran and Jor-
We next study constraints that can be used to ensure
dan, 2013] to tighten the above constraint, we take a
fairness with respect to z for a kernel SVM
geometric approach to derive new constraints.
Specifically, consider the two conditional distributions h(x, b) = sign(K(X, x)T (Y ◦ α) + b), (12)
[x|z = +1] and [x|z = −1]. Observe that the where α ∈ R n
conditional distributions of [xT w + t|z = +1] and 1
P are coefficients to predict the label y,
and b = #I i∈I (yi − K(X, xi )T (Y ◦ α)) with the set
[xT w + t|z = −1]T have mean E[x|z = +1]T w + t and of indices I = {i : 0 < αi < λ}. Consider the following
E[x|z = −1]T w + t, respectively. This means (8) can generic kernel SVM formulation
be interpreted as requiring Pn
min (Y · α)T K(X, X)(Y · α) − i=1 αi
−d ≤ E[xT w + t|z = +1] − E[xT w + t|z = −1]T w ≤ d,
s.t. Y T α = 0
(9) (13)
or equivalently that the means of xT w + t when con- 0 ≤ αi ≤ λ, for i ∈ [n]
ditioned on z are approximately (i.e., up to d apart) H(w, θ) ≤ 0.
equal. Thus it is natural to consider adding constraints
where λ ∈ R is a tuning parameter that can be cho-
to match the conditional distributions of xT w +t using
sen using cross-validation, and H(w, θ) ≤ 0 is a fair-
higher order moments.
ness constraint with the fairness level controlled by the
Here, we define a constraint to ensure that the covari- (possibly vector-valued) parameter θ. We will next
ances of xT w + t conditioned on z are approximately consider several possibilities for H(w, θ) ≤ 0.
Manuscript under review by AISTATS 2018

4.2.1 Indicator and Integer Constraints The constraint (17) is a nonconvex quadratic con-
straint because (S+ − S− ) is symmetric but typically
The indicator constraints (6) can be rewritten for ker- indefinite (i.e., not positive or negative semidefinite).
nel SVM by replacing (4) with (12). Unfortunately,
the indicator constraints are still impractical because We can also construct an SDP relaxation of (17).
they are infinite-dimensional and contain a discontin- Specifically, note that (17) is equivalent to
uous, nonconvex function. However, we can approxi-
−s ≤ trace(W (S+ − S− )) ≤ s
mate the indicator constraints for kernel SVM as the  
following mixed-integer linear inequalities W (Y ◦ α)
U= 0 (19)
(Y ◦ α)T 1
1 1
P P
−∆ ≤ #P i∈P vi (t) − #N i∈N vi (t) ≤ ∆ rank(U ) = 1
T
−M · (1 − vi (t)) ≤ K(X, x) (Y ◦ α) + t ≤ M · vi (t)
The above is nonconvex, but it can be convexified by
vi (t) ∈ {0, 1} dropping the rank(U ) = 1 constraint. However, our
(14) numerical experiments using the Mosek [2017] solver
for t ∈ {t1 , . . . , tk }; where M > 0 is a large con- found the SDP relaxation was weak and did not con-
stant, and {t1 , . . . , tk } is a fixed set of values. How- sistently affect the fairness or accuracy of the SVM.
ever, we found in numerical experiments using the
Gurobi [2016] and Mosek [2017] solvers that comput-
ing a fair kernel SVM with the above constraints was 5 SPECTRAL ALGORITHM
prohibitively slow except for very small data sets.
The covariance constraints are conceptually promising,
4.2.2 Convex Relaxation but they result in nonconvex optimization problems.
Here, we describe an iterative algorithm to effectively
We next provide a convex relaxation of the indicator solve the SVM with our covariance constraints.
constraints for kernel SVM. A similar argument as the
one used for linear SVM gives 5.1 Linear SVM

−d≤ 1
P T Though we could compute the linear SVM with the co-
#P i∈P K(X, xi ) (Y ◦ α)+
variance constraint (10) using successive linearization,
1 T
P
− #N i∈N K(X, xi ) (Y ◦ α) ≤ d. (15) a better approach is possible through careful design of
the algorithm: Our key observation regarding (10) is
Note that threshold t does not appear, which means that (Σ+ − Σ− ) is symmetric, which means it can be
this is simply a linear constraint on α. diagonalized by an orthogonal matrix:
Pp
4.2.3 Covariance Constraints Σ+ − Σ− = i=1 ζi vi viT , (20)

Our covariance constraint can be rewritten for the ker- where ζi ∈ R and the vi form an orthonormal basis.
nel SVM by first recalling that the kernel SVM with Now let Iζ+ = {i : ζi > 0} and Iζ− = {i : ζi < 0}, and
K(x, x0 ) = xT x0 generates the same classifier as di- define the positive semidefinite matrices
rectly solving a linear SVM. Thus, we have the rela-
T
P
tionship w = X T (Y ◦ α) in this special case. Next, Uζ+ = i∈Iζ+ ζi vi vi
(21)
observe that Uζ− = − i∈Iζ− ζi vi viT
P

1 1
 T
Σ+ = #P X+ I − #P eeT X+ This means that the function
1 1
 T (16)
Σ− = #N X− I − #N eeT X−
wT (Σ+ − Σ− )w = wT Uζ+ w − wT Uζ− w (22)
So if apply the kernel trick to our covariance constraint
in the covariance constraint (10) is the difference of
(10) with the above relationships, then the resulting
the two convex functions wT Uζ+ w and wT Uζ− w.
covariance constraint for kernel SVM becomes
There is an important point to note regarding the
−s ≤ (Y ◦ α)T (S+ − S− )(Y ◦ α) ≤ s, (17) practical importance of the spectral decomposition we
performed above. The function in the covariance con-
where we have straint (10) can alternatively be written as the differ-
1 1 T
 T ence of the two convex functions wT Σ+ w and wT Σ− w.
S+ = #P K(X, X+ ) I− #P ee K(X, X+ )
(18) However, using this alternative decomposition yields
1 1 T
 T
S− = #N K(X, X− ) I− #N ee K(X, X− )
an algorithm where the convexified subproblems are
Manuscript under review by AISTATS 2018

Unconstrained Linear Spectral (DC)


1 1 1

Visualization
0 0 0

-1 -1 -1
-1 0 1 -1 0 1 -1 0 1
0.5 4 2
[x Ti w | z i ]

2 1

0 0 0
-5 0 5 -1 0 1 -2 0 2
0.1 0.4 0.4
[x Ti w | y i ]

0.05 0.2 0.2

0 0 0
-50 0 50 -10 0 10 -5 0 5

Figure 2: A comparison of the unconstrained, linear and spectral SVM methodologies on two-dimensional data.
The first row visualizes the data, as well as the optimal support vectors from each of the methodologies. The
second and third rows show the density of xT i w conditioned on the z and y variables, respectively. In each case,
the blue curve represents the density for zi = 1 (yi = 1) and the red curve the density for zi = −1 (yi = −1).

weaker relaxations than the convex subproblems gen- gives iterates wk that converge to a local minimum.
erated using the spectral decomposition. As a result,
our algorithm given below is ultimately more effective This theorem is simply an application of a theorem
because it employs the spectral decomposition. by Smola et al. [2005], and the constraint qualification
required by this theorem trivially holds in our case
Consequently, the constrained convex-concave proce- because all of our convex constraints are linear.
dure [Tuy, 1995, Yuille and Rangarajan, 2002, Smola
et al., 2005] can be used to design an algorithm for
5.2 Kernel SVM
our setting. We opt to use a penalized form of the
quadratic constraint in our spectral algorithm to en- We can also design a spectral algorithm to compute
sure feasibility always holds. Let wk ∈ Rp be a fixed fair kernel SVM’s. Since (S+ − S− ) in (17) is symmet-
point, and consider the optimization problem where ric, t can be diagonalized by an orthogonal matrix:
the concave terms are linearized:
Pp
Pn 2 S+ − S− = i=1 ξi νi νiT , (24)
min i=1 ui + λkwk2 + µ · t
s.t. yi (xT
i w + b) ≥ 1 − ui , for i ∈ [n] where ξi ∈ R and the νi form an orthonormal basis.
ui ≥ 0, for i ∈ [n] Now let Iξ+ = {i : ξi > 0} and Iξ− = {i : ξi < 0}, and
T define the positive semidefinite matrices
1 1
P P
−d≤ #P i∈P xi − #N i∈N xi w≤d
T
P
Uξ+ = i∈Iξ+ ξi νi νi
wT Uζ+ w − wkT Uζ− wk − 2wkT Uζ−
T
(w − wk ) ≤ t (25)
Uξ− = − i∈Iξ− ξi νi νiT
P
wT Uζ− w − wkT Uζ+ wk − 2wkT Uζ+
T
(w − wk ) ≤ t
(23) This means that the function
Our spectral algorithm for computing a fair linear
SVM consists of the constrained CCP adapted to the (Y ◦ α)T (Σ+ − Σ− )(Y ◦ α) =
problem of computing a linear SVM with the linear
constraint (8) and covariance constraint (10): We ini- (Y ◦ α)T Uζ+ (Y ◦ α) − (Y ◦ α)T Uζ− (Y ◦ α) (26)
tialize w0 by solving a linear SVM with only the linear in (17) is the difference of the two convex functions
constraint (8), and then computing successive wk by (Y ◦ α)T Uξ+ (Y ◦ α) and (Y ◦ α)T Uξ− (Y ◦ α).
solving (23). This produces a local minimum.
Thus, the constrained convex-concave procedure [Tuy,
Theorem 1 (Smola et al. [2005]) The spectral al- 1995, Yuille and Rangarajan, 2002, Smola et al., 2005]
gorithm defined above for computing a fair linear SVM can be used to design an algorithm. We use a penalized
Manuscript under review by AISTATS 2018

1 1 1
form of the quadratic constraint in our spectral algo-
rithm to ensure feasibility always holds. Let αk ∈ Rn
be a fixed point, and consider the optimization prob- 0.5 0.5 0.5

lem where the concave terms are linearized:


Pn
min (Y · α)T K(X, X)(Y · α) − i=1 αi + µ · t
0 0 0
s.t. Y T α = 0 0 0.5 1 0 0.5 1 0 0.5 1

0 ≤ αi ≤ λ, for i ∈ [n] (a) (b) (c)


1 1 1
1
− d ≤ #P i∈P K(X, xi )T (Y ◦ α)+
P

1 T
P
− #N i∈N K(X, xi ) (Y ◦ α) ≤ d 0.5 0.5 0.5

T T
(Y ◦ α) Uξ+ (Y ◦ α) − (Y ◦ αk ) Uξ− (Y ◦ αk )+
− 2(Y ◦ αk )T Uζ−
T
(Y ◦ (α − αk )) ≤ t 0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
T T
(Y ◦ α) Uξ− (Y ◦ α) − (Y ◦ αk ) Uξ+ (Y ◦ αk )+ (d) (e) (f)
− 2(Y ◦ αk )T Uζ+
T
(Y ◦ (α − αk )) ≤ t
(27) Figure 3: ROC plots for the three SVM algorithms on
Our spectral algorithm for computing a fair kernel both datasets. In each case, the solid blue line is the
SVM consists of the constrained CCP adapted to the ROC curve for the y label and the dotted red line the
problem of computing a kernel SVM with the linear ROC curve for the protected z label. Figures 3a to 3c
constraint (15) and covariance constraint (17): We ini- show the ROC plots for LSVM, ZSVM, and SSVM on
tialize w0 by solving a kernel SVM with only the linear the wine quality data, and Figures 3d to 3f show the
constraint (15), and then computing successive wk by same for the German credit data.
solving (23). This produces a local minimum.

Theorem 2 (Smola et al. [2005]) The spectral al- Results The results of the three methods using d =
gorithm defined above for computing a fair kernel SVM 0.075 and µ = 10 are shown in the three columns of
gives iterates wk that converge to a local minimum. Figure 2. Here, points in N and P are differentiated by
This theorem is simply an application of a theorem marker shape (“x” and “o”, respectively), and points
by Smola et al. [2005], and the constraint qualification with label yi = −1 and yi = 1 are differentiated by
required by this theorem trivially holds in our case color (red and green, respectively). If w denotes the
because all of our convex constraints are linear. coefficients computed by each method, then the second
row shows the empirical densities of xT i w conditioned
on the protected class zi , and the third row shows the
6 NUMERICAL EXPERIMENTS empirical densities of xT i w conditioned on the label
yi . Fairness occurs if the conditional densities in the
We use synthetic and real datasets to evaluate the effi- second row are similar, and prediction accuracy oc-
cacy of our approach. We compare linear SVM’s com- curs if the densities in the third row are disparate.
puted using our spectral algorithm (SSVM) to a stan- These results show that the densities of [xT i w|zi = +1]
dard linear SVM (LSVM) and a linear SVM computed and [xT i w|zi = −1] are distinguishable for SSVM and
using the approach of Zafar et al. [2017] (ZSVM), since ZSVM, while they are almost identical for SSVM. On
this is the only existing approach that to our knowl- the other hand, the densities of [xT i w|yi = +1] and
edge is designed to ensure fairness at all thresholds. [xTi w|yi = −1] are distinct for all three methods.

6.1 Synthetic Data


6.2 Real World Datasets
Experimental Design We seek to examine the case
in which our y and z labels are correlated, and XN and Data overview We next use a wine quality dataset
XP have differing covariances. Thus, we generate 200 Cortez et al. [2009] and a dataset of German credit
data points where the y and z labels are generated card customers [Lichman, 2013]. The first dataset is a
through logit models using two separate sets of ran- compilation of 12 attributes of 6,497 wines (e.g., acid-
domly generated “true” parameters, with dot product ity, residual sugar, alcohol content, and color), as well
between the logit parameters of y and z of 0.85. The as a ranking out of 10 that is provided by professional
singular values of the covariance matrix of [Xi |zi = 1] taste-testers. Here, we label yi = 1 when a wine is
were then skewed to generate the data seen in Figure 2. rated as a 6 or above and yi = −1 otherwise, and
The empirical correlation of yi and zi is 0.45. we define zi = 1 for white wines and zi = −1 for reds.
Manuscript under review by AISTATS 2018

0.8 =0.01 0.771


=0.03
=0.05 0.77
0.795
0.769

0.768
0.79
=1

AUC
AUC

=0.1 0.767 =0.01


0.785 0.766

0.765 =0.03
0.78
=0.05
0.764

0.775 0.763
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.15 0.16 0.17 0.18 0.19 0.2

(a) Wine quality data. (b) German credit data.

Figure 4: Comparing the accuracy and fairness of the ZSVM and SSM methods for various d and µ. The solid
red line represents results for the ZSVM, and the dotted blue lines denote results for the SSVM for some µ.

Notably, all explanatory variables are continuous. The fairness than LSVM while keeping high accuracy. The
second dataset is a compilation of 20 attributes (e.g., tradeoff curves between prediction accuracy and fair-
marriage, employment and housing status, number of ness are shown in Figure 4. Increasing d generally
existing credit lines, and age) of 1000 German appli- jointly decreases fairness and increases accuracy, while
cants for loans. We label yi = 1 for applicants that small increases in µ for our SSVM can often improve
defaulted and yi = −1 for applicants that did not de- both fairness and accuracy. Large increases in µ gen-
fault, and let zi = 1 for applicants that are renting erally increase fairness but decrease accuracy. Note
a home and zi = −1 for applicants that own their that setting µ = 0 leads to the curve for SSVM to
home. Note that a large number of variables are dis- align with the curve of ZSVM, since they are equiva-
crete. There is no missing data in either dataset. lent when µ = 0. Also observe that ∆ is more sensitive
to changes in d and µ than the AUC, which implies
Metrics of comparison We compare SSVM and that we are able control fairness without losing much
ZSVM based on the tradeoffs that they make between predictive accuracy.
predictive accuracy for y, measured using the area un-
der the ROC curve (AUC), and their fairness with re- 7 Conclusion
spect z, measured by DP-∆ with respect to z.
We considered multi-threshold notions of fairness for
Experimental Design We conducted five rounds classifiers, and designed a nonconvex constraint to im-
of cross-validation on each dataset and computed the prove the fairness of linear and kernel SVM’s under
average AUC and average ∆, using a 70-30 training- all thresholds. We developed an iterative optimiza-
testing split. Within each round, we first apply 5-fold tion algorithm (that uses a spectral decomposition)
cross-validation on LSVM to choose the λ that max- to handle our nonconvex constraint in the resulting
imizes AUC, and this value of λ was used with both problem to compute the SVM, and empirically com-
SSVM and ZSVM to minimize the impact of cross- pared our approach to standard linear SVM and an
validation on the comparison between methods. We SVM with a linear fairness constraint using both syn-
varied d over the values 0, 0.001, 0.002, 0.005, 0.01, thetic and real data. We found that our method can
0.025, 0.05 and 0.1. And for SSVM we tried several strictly improve the fairness of classifiers for all thresh-
values of µ, which are shown in our plots. olding values with little loss in accuracy; in fact, some
of our results even showed a slight increase in accuracy
Results Figure 3 shows representative examples of with increasing fairness. Our work opens the door for
ROC curves for both datasets from one instance of further research in a number of areas, including hier-
cross-validation. Both ZSVM and SSVM improve fair- archies of fairness constraints considering subsequent
ness with respect to LSVM while maintaining high ac- moments of the data, and theoretical guarantees on
curacy, and SSVM ensures an even stricter level of the fairness of such classification methods.
Manuscript under review by AISTATS 2018

References Information Processing Systems, pages 3315–3323,


2016.
S. Barocas and A. D. Selbst. Big data’s disparate im-
pact. California Law Review, 104, 2016. J. Kleinberg, S. Mullainathan, and M. Raghavan. In-
herent trade-offs in the fair determination of risk
D. Bertsimas, A. King, and R. Mazumder. Best subset
scores. arXiv preprint arXiv:1609.05807, 2016.
selection via a modern optimization lens. Annals of
Statistics, 44(2):813–852, 2016. B. Kocuk, S. S. Dey, and X. A. Sun. Inexactness of sdp
relaxation and valid inequalities for optimal power
E. Bhandari. Big data can be used to violate civil
flow. IEEE Transactions on Power Systems, 31(1):
rights laws, and the FTC agrees, 2016. URL http:
642–651, 2016.
//www.aclu.org/blog/privacy-technology/
surveillance-technologies/big-data-can-be- J. B. Lasserre. Global optimization with polynomials
used-violate-civil-rights-laws-and. and the problem of moments. SIAM Journal on
Optimization, 11(3):796–817, 2001.
T. Calders, F. Kamiran, and M. Pechenizkiy. Building
classifiers with independency constraints. In Data M. Lichman. UCI machine learning repository, 2013.
mining workshops, 2009. ICDMW’09. IEEE inter- URL http://archive.ics.uci.edu/ml.
national conference on, pages 13–18. IEEE, 2009. Y. Lou, R. Caruana, and J. Gehrke. Intelligible models
R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, for classification and regression. In Proceedings of
and N. Elhadad. Intelligible models for healthcare: the 18th ACM SIGKDD international conference on
Predicting pneumonia risk and hospital 30-day read- Knowledge discovery and data mining, pages 150–
mission. In Proceedings of the 21st ACM SIGKDD 158. ACM, 2012.
International Conference on Knowledge Discovery C. Louizos, K. Swersky, Y. Li, M. Welling, and
and Data Mining, pages 1721–1730. ACM, 2015. R. Zemel. The variational fair autoencoder. arXiv
V. Chandrasekaran and M. I. Jordan. Computational preprint arXiv:1511.00830, 2015.
and statistical tradeoffs via convex relaxation. Pro- Z.-Q. Luo, W.-K. Ma, A. M.-C. So, Y. Ye, and
ceedings of the National Academy of Sciences, 110 S. Zhang. Semidefinite relaxation of quadratic opti-
(13):E1181–E1190, 2013. mization problems. IEEE Signal Processing Maga-
P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and zine, 27(3):20–34, 2010.
J. Reis. Modeling wine preferences by data mining R. Madani, S. Sojoudi, G. Fazelnia, and J. Lavaei.
from physicochemical properties. Decision Support Finding low-rank solutions of sparse linear matrix
Systems, 47(4):547–553, 2009. inequalities using convex optimization. SIAM Jour-
A. Cotter, M. Friedlander, G. Goh, and M. Gupta. nal on Optimization, 27(2):725–758, 2017.
Satisfying real-world goals with dataset constraints. R. Miyashiro and Y. Takano. Mixed integer second-
arXiv preprint arXiv:1606.07558, 2016. order cone programming formulations for variable
C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and selection in linear regression. European Journal of
R. Zemel. Fairness through awareness. In Proceed- Operational Research, 247(3):721–731, 2015.
ings of the 3rd Innovations in Theoretical Computer Mosek. The MOSEK optimization toolbox for MAT-
Science Conference, pages 214–226. ACM, 2012. LAB manual, 2017.
S. A. Friedler, C. Scheidegger, and S. Venkatasubra- J. Podesta, P. Pritzker, E. Moniz, J. Holdren, and
manian. On the (im) possibility of fairness. arXiv J. Zients. Big data: Seizing opportunities, preserv-
preprint arXiv:1609.07236, 2016. ing values. Executive Office of the President, 2014.
M. X. Goemans and D. P. Williamson. Improved G. Ridgeway, D. Madigan, T. Richardson, and
approximation algorithms for maximum cut and J. O’Kane. Interpretable boosted naı̈ve bayes clas-
satisfiability problems using semidefinite program- sification. In KDD, pages 101–104, 1998.
ming. Journal of the ACM (JACM), 42(6):1115– A. J. Smola, S. Vishwanathan, and T. Hofmann. Ker-
1145, 1995. nel methods for missing variables. In AISTATS,
J. Gouveia, P. A. Parrilo, and R. R. Thomas. Lifts of 2005.
convex sets and cone factorizations. Mathematics of H. Tuy. Dc optimization: theory, methods and algo-
Operations Research, 38(2):248–264, 2013. rithms. In Handbook of global optimization, pages
Gurobi. Gurobi optimizer reference manual, 2016. 149–216. Springer, 1995.
URL http://www.gurobi.com. A. L. Yuille and A. Rangarajan. The concave-convex
M. Hardt, E. Price, and N. Srebro. Equality of oppor- procedure (cccp). In Advances in neural information
tunity in supervised learning. In Advances in Neural processing systems, pages 1033–1040, 2002.
Manuscript under review by AISTATS 2018

M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P.


Gummadi. Fairness constraints: Mechanisms for
fair classification. In Proceedings of the 20th In-
ternational Conference on Artificial Intelligence and
Statistics, 2017.
R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and
C. Dwork. Learning fair representations. In Proceed-
ings of the 30th International Conference on Ma-
chine Learning (ICML-13), pages 325–333, 2013.
I. Zliobaite. On the relation between accuracy and
fairness in binary classification. arXiv preprint
arXiv:1505.05723, 2015.

You might also like