Pro Con List Linear Regression

Why a pro/con list is 75% as good as your fancy...
https://www.chrisstucchio.com/blog/2014/equal...
Why a pro/con list is 75% as good as your

fancy machine learning algorithm
(https://www.chrisstucchio.com/blog/2014
/equal_weights.html)
Wed 25 June 2014 linear regression (https://www.chrisstucchio.com/tag/linear-
regression.html) / regression (https://www.chrisstucchio.com/tag/regression.html) /

statistics (https://www.chrisstucchio.com/tag/statistics.html) / unit-weighted
regression (https://www.chrisstucchio.com/tag/unit-weighted-regression.html) /
decisionmaking (https://www.chrisstucchio.com/tag/decisionmaking.html)
Follow @stucchio
Like
Tweet
Share 7 people like this. Be the first of your friends.
+3 Recommend this on Google
Get notified of new posts

I'm currently dating two lovely women, Svetlana and Elise. Unfortunately continuing to
date both of them is unsustainable so I must choose one.
1 of 12
Tuesday 18 October 2016 03:46 PM
Get Notifications
In order to make such a choice, I wish to construct a ranking function - a function which
takes as input the characteristics of a woman and returns as output a single number.
This ranking function is meant to approximate my utility function
(http://en.wikipedia.org/wiki/Utility) - a higher number means that by making this
choice I will be happier. If the ranking closely approximates utility, then I can use the
ranking function as an effective decisionmaking tool.
In concrete terms, I want to build a function f : Women which approximately
predicts my happiness. If f (Svetlana) > f (Elise) I will choose Svetlana, and vice versa
if the reverse inequality holds.
One of the simplest procedures for building a ranking function dates back to 1772, and
was described by Benjamin Franklin (http://www.procon.org/view.backgroundresource.php?resourceID=1474):
...my Way is, to divide half a Sheet of Paper by a Line into two Columns,
writing over the one Pro, and over the other Con. Then...I put down under
the different Heads short Hints of the different Motives...I find at length
where the Ballance lies...I come to a Determination accordingly.
The mathematical name for this technique is unit weighted regression
(http://en.wikipedia.org/wiki/Unit-weighted_regression), and the more commonplace
2 of 12
name for it is a Pro/Con list.
Characteristic Elise Svetlana

Smart
0
+1
Great Legs
+1 +1
Black
+1 0
Rational
0
0
Exciting
+1 0
Not Fat
+1 +1
Lets me work 0
0
Unit-weighted regression consists of taking the values in each column and adding them
up. Each value is either zero or one. The net result is that f (Elise) = 4 and
f (Svetlana) = 3. Elise it is!
Get Notifications
I present the method in a slightly different format - in each column a different choice is
listed. Each row represents a characteristic, all of which are pros. A con is transformed
into a pro by negation - rather than treating "Fat" as a con, I treat "Not Fat" as a pro. If
one of the choices possesses the characteristic under discussion, a +1 is assigned to the
relevant row/column, otherwise 0 is assigned:
Pro/Con lists are easy

A pro/con list is one of the simplest ranking algorithms you can construct. The
mathematical sophistication required is grade school arithmetic and it's so easy to
program that even a RoR hipster could do it. As a result, you should not hesitate to
implement a pro/con list for decision processes.
The key factor in the success of unit-weighted regression is feature selection. The rule of
thumb here is very simple - choose features which you have good reason to believe are
strongly predictive of the quantity you wish to rank. Such features can usually be
determined without a lot of data - typically a combination of expert opinion and easy
correlations are sufficient. For example, I do not need a large amount of data to
determine that I consider "not fat" to be a positive predictor of my happiness with a
woman.
Conversely, if the predictiveness of a feature is not clear, it should not be used.
Why not use machine learning (TM)?

Anyone who read one of the many (http://www.amazon.com/gp/product/0596529325
/ref=as_li_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0596529325&
linkCode=as2&tag=christuc-20&linkId=SBLFGHS3FMO4V34J) good
(http://www.amazon.com/gp/product/0387310738/ref=as_li_tl?ie=UTF8&camp=1789&
creative=390957&creativeASIN=0387310738&linkCode=as2&tag=christuc20&linkId=XP3TXBAHMPIFWUPK) books (http://www.amazon.com/gp/product
/0387848576/ref=as_li_tl?ie=UTF8&camp=1789&creative=390957&
creativeASIN=0387848576&linkCode=as2&tag=christuc20&linkId=ME7GPF4I6NS27XDL) on machine learning can probably name several fancy
machine learning techniques - neural networks, decision trees, etc. And they are
3 of 12
probably asking why I would ever use unit-weighted regression, as opposed to one of
these techniques? Why not use linear regression, rather than forcing all the coefficients
to be +1?
Elise = [0, 1, 1, 0, 1, 1, 0]
Then the predictor function uses a set of weights which can take on values other than
+1:
f (x) =
hi xi
Get Notifications
I'll be concrete, and consider the case of linear regression in particular. Linear regression
is a lot like a pro/con list, except that the weight of each feature is allowed to vary. In
mathematical terms, we represent each possible choice as a binary vector - for
example:
The individual weights fi represent how important each variable is. For example, "Smart"
might receive a weight of +3.3, "Not fat" a weight of +3.1 and "Black" a weight of +0.9.
The weights can be determined with a reasonable degree of accuracy by taking past
data and choosing the weights which minimize the difference between the "true" value
and the approximate value - this is what least squares (http://en.wikipedia.org
/wiki/Least_squares) does.
The difficulty with using a fancier learning tool is that it only works when you have
sufficient data. To robustly fit a linear model, you'll need tens to hundreds of data points
per feature. If you have too few data points, you run into a real danger of overfitting building a model which accurately memorizes the past, but fails to predict the future.
You can even run into this problem if you have lots of data points, but those data points
don't represent all the features in question.
It also requires more programming sophistication to build, and more mathematical
sophistication to recognize when you are running into trouble.
For the rest of this post I'll be comparing a Pro/Con list to Linear Regression
(http://en.wikipedia.org/wiki/Linear_regression), since this will make the theoretical
comparison tractable and keep the explanation simple. Let me emphasize that I'm not
pushing a pro/con list as a solution to all the ranking problems - I'm just pushing it as a
nice simple starting point.
A Pro/Con list is 75% as good as linear

regression
This is where things get interesting. It turns out that Pro/Con list is at least 75% as good
as a linear regression model.
Suppose we've done linear regression and found linear regression coefficients
Suppose instead of using the vector
4 of 12
h.
h, we instead used the vector of unit weights,

u = [1/N, 1/N, , 1/N]. Here N
is the number of features in the model.
An error is made whenever the pro/con list and linear regression rank two vectors
differently - i.e., linear regression says "choose Elise" while the pro/con list says "choose
Svetlana". The error rate of the pro/con list is the probability of making an error given
two random feature vectors xand y, i.e.:
It turns out that if you average it out over all vectors
h, the error rate is bounded by
Get Notifications
u (x y)])
< 0)
error rate(h) = P(sign([h (x y)][
1/4. There are of course vectors hfor which the error rate is higher, and others for which
it is lower. But on average, the error rate is bounded by 1/4.
In this sense, the pro/con list is 75% as good as linear regression.
We can confirm this fact by computer simulation - generating a random ensemble of
vectors h, and then measuring how accurately unit-weighted regression agrees with it.
The result:
Code is available (https://gist.github.com/stucchio/f5d0455fa58a4c733eba).

More concretely, I computed this graph via the following procedure. For every dimension
N, I created a large number of vectors hby drawing them from the uniform Dirichlet
Distribution (http://en.wikipedia.org/wiki/Dirichlet_distribution). This means that the
vectors
hwere chosen so that:
5 of 12
|hi | = 1.0
and
i, hi 0
Then I generated a set of vectors

probability of 50% each.
hlives on the green surface in this figure:
x, yby choosing each component to be 0 or 1 with a
Get Notifications
In 3 dimensions, this means that
Mathematical results
I don't know how to prove how closely a Pro/Con list approximates linear regression for
binary feature vectors. However, if we assume that the feature vectors xand yare
normally distributed (http://en.wikipedia.org/wiki/Normal_distribution) instead, I can
prove the following theorem:
Theorem: Suppose his drawn from a uniform Dirichlet distribution and x, y have
components which are independent identical normally distributed variables. Then:
E[error rate(h)]
arctan(
(N
1)/(N + 1))
1
<
This means that averaged over all vectors h, the error rate is bounded by 1/4. There are
of course individual vectors
is 1/4.
hwith a higher or lower error rate, but the typical error rate
Unfortunately I don't know how to prove this is true for Bernoulli (binary) vectors
Any suggestions would be appreciated.
x, y.
If we run a Monte Carlo simulation, we can see that this theorem appears roughly
correct:
6 of 12
Get Notifications
Code to produce the graph is available on github (https://gist.github.com/stucchio

/142620be989dcf2767bc).
In fact, the graph suggests the bound above is close to being exact. The theorem is
proved in the appendix.
Appendix: Proof of the error bound

Let us consider a very simple, 3-dimensional example to build some intuition. In this
example, h = [0.9, 0.05, 0.05] - a bit of an extreme case, but reasonable. In this
example, what sorts of vectors x, y will result in unit-weighted regression disagreeing
with the true ranking? Here is one example:
x = [1, 0, 0]
y = [0, 1, 1]
In this case,
h (x y) = 0.8 wereas u (x y) = 1/3.
xis pointing nearly in the same
direction as h, while the vector yis nearly perpendicular to it.

Intuitively, what is happening here is that the vector
Normally distributed feature vectors

Suppose for simplicity that the feature vectors
x, ysatisfy xi N(0, 21/2 ). Define
d = x y, and note that di N(0, 1). This is an unrealistic assumption, but one which
is mathematically tractable. We want to compute the probability:
7 of 12
P((h d > 0andu d < 0)or(h d < 0andu d > 0))

= 2P((h d > 0andu d < 0)).
For simplicity, we will attempt to compute the latter quantity.
How accurately does uapproximate h?

h = u+ p where p u = 0. Then:
Get Notifications
Define
P(h d > 0andu d < 0)

= P(u d + p d > 0andu d < 0)
= P(p d > u dand

u d < 0)
Note that d is generated by a multivariate normal distribution, with covariance matrix
equal to the identity. As a result:
u d N(0, N 1 )
where
is the dimension of the problem. Similarly:
p d N (0,
p
i)
2
Due to the orthogonality of
uand p, the quantities u d and p d are independent.
Note: Obtaining this statistical independence is why we needed to assume the feature
vectors were normal - showing statistical independence in the case of binary vectors is
harder. A potentially easier test case than binary vectors might be random vectors
chosen uniformly from the unit ball in l , aka vectors for which maxi |xi| < 1 .
We've now reduced the problem to simple calculus. Define
Let
v = u d and w = p d . Then:
P((h d > 0andu d < 0)) =

Changing variables to
0
v
Here 0
8 of 12
u2 = N 1
and
p2 = pi .
v2
w2
C exp
+
dwdv
( 2u2
2p2 )
v = ru cos(), w = rp sin(). Then:
/2
v2
w2
0 /2
r 2
C exp
+
dwdv
=
C
e
rdrd
=
( 2u2
0 0
2
2p2 )
= arccot(u /p ), so:
P((h d > 0andu d < 0)) =
/2 arccot(p /u )
arctan(p /u )
arctan(N
p )
=
=
2
2
2
Worse case analysis

The worst case to consider is when
h = [1, 0, , 0]. In this case:
his one of the corners of the unit simplex - e.g.
So
Get Notifications
N1
|p|2 = |h u|2 =
= p2
N
p = (N
1)/N while u = 1/N
N
1) as N . This
, and arccot(1/
implies:
P((h d > 0andu d < 0)) =
arctan(
N
1)
1
2
4
This means that in the worst case, unit-weighted regression is no better than chance.
Average case analysis

Let us now consider the average case over all vectors h. To handle this case, we must
impose a probability distribution on such vectors. The natural distribution to consider is
the uniform distribution on the unit-simplex, which is equivalent to a Dirichlet
(http://en.wikipedia.org/wiki/Dirichlet_distribution) distribution with
1 = 2 = = N = 1.
So what we want to compute is:
E[P((h d > 0andu d < 0))] =
P((h d > 0andu d < 0))dh
This can be bounded above as follows, noting that
P((h d > 0andu d < 0))dh =

arctan(N
|h u|2 dh)
2
=
p = |h u|2 :
arctan(N
|h u|2 )
2
dh
arctan(N
1)/(N(N + 1)))
(N
2
arctan(
(N
1)/(N + 1))
2
The inequality follows from Jensen's Inequality (http://en.wikipedia.org

9 of 12
z arctan(N
z) is a concave function.
For large N this quantity approaches arctan(1)/2 = (/4)/(2) = 1/8.
/wiki/Jensen's_inequality) since
Note: I believe that the reason the Bernoulli feature vectors appear to have lower error
than the Gaussian feature vectors for small N appears to be caused by the fact that for
small N, there is a significant possibility that a feature vector might be 0 in the relevant
components. The net result of this is that h (x y) = 0 fairly often, meaning that
many vectors have equal rank. This effect becomes improbable as more features are
introduced.
Frequently Asked Questions
Get Notifications
Thus, we have shown that the average error-rate of unit-weighted regression is bounded
above by 1/4. It also shows that treating feature vectors as Gaussian rather than
Boolean vectors appears to be a reasonable approximation to the problem - if anything it
introduces extra error.
All the pre-readers I shared this with had two major but tangential questions which are
worth answering once and for all.
First, Olga Kurylenko (https://www.google.com/search?q=olga+kurylenko&
oq=olga+kurylenko) and Oluchi Onweagba (https://www.google.com
/search?q=oluchi+onweagba).
Second, I didn't waste time with gimp. Imagemagick was more than sufficient:
# -resize x594 will shrink height to 594, preserve aspect ratio
$ convert olga-kurylenko-too-big.jpg -resize 'x594' olga-kurylenko.jpg;
# -tile x1 means tile the images with 1 row, however many columns are needed
$ montage -mode concatenate -tile x1 olga-kurylenko.jpg oluchi-onweagba.jpg composite.j
pg
Subscribe to the mailing list

Email Address
Subscribe
Comments
10 of 12
10 Comments
Chris Stucchio's Blog
Recommend
Share
Login
Sort by Best
NotPC 2 years ago
Personally, I appreciate the example. I don't know if it was an intentional troll against how overly
sensitive the tech crowd is around race and gender, but it worked that way nonetheless. There's
nothing even remotely sexist or racist here. Thanks for the informative article, I learned so much
about so many things.
11
Reply Share
Get Notifications
Join the discussion
anonymous 2 years ago
Very unfortunately chosen example... people will very easily dismiss it as sexist (plus look at smart
and black criteria...) Oh well.... BTW a simple Pro/Con list to evaluate the pros and cons of a Pro/Con
list vs. Linear Regression would have been 75% as good as your detailed analysis :)
6
Reply Share
stucchio
Mod
> anonymous 2 years ago
>(plus look at smart and black criteria...)

I address this topic in the post:
"The difficulty with using a fancier learning tool is that it only works
when you have sufficient data. To robustly fit a linear model, you'll
need tens to hundreds of data points per feature. If you have
too few data points, you run into a real danger of overfitting building a model which accurately memorizes the past, but fails to
predict the future. You can even run into this problem if you have lots
of data points, but those data points don't represent all the features
in question."
Reply Share
A developer 2 years ago
Interesting article Chris. I think your example is bit too overtly sexist though. While it may resonate
with a male audience, it would be a major put off to a female audience.
That said I found the article informative and fascinating.
3
Reply Share
NotSexist > A developer 2 years ago
An article isn't sexist just because it uses a subset of the sexes as an example. That peopl
jump to such a conclusion is more telling about those people and the state of our culture. The
existence of such people surely means this article is higher risk, but its not an inherently "evil"
11 of 12
Back to
top
Get Notifications
2016 Chris Stucchio Powered by pelican-bootstrap3

(https://github.com/DandyDev/pelican-bootstrap3), Pelican
(http://docs.getpelican.com/), Bootstrap (http://getbootstrap.com)
12 of 12

Pro Con List Linear Regression

Uploaded by

Document Informationclick to expand document informationit describes how to start simple with machine learning

Document Informationclick to expand document information

Copyright:

Available Formats

Pro Con List Linear Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pro Con List Linear Regression

Uploaded by

Copyright:

Available Formats

Why a pro/con list is 75% as good as your fancy...

Why a pro/con list is 75% as good as your

regression.html) / regression (https://www.chrisstucchio.com/tag/regression.html) /

Share 7 people like this. Be the first of your friends.

+3 Recommend this on Google

Get notified of new posts

Tuesday 18 October 2016 03:46 PM

Why a pro/con list is 75% as good as your fancy...

Tuesday 18 October 2016 03:46 PM

Why a pro/con list is 75% as good as your fancy...

name for it is a Pro/Con list.

Characteristic Elise Svetlana

Pro/Con lists are easy

Why not use machine learning (TM)?

Tuesday 18 October 2016 03:46 PM

Why a pro/con list is 75% as good as your fancy...

A Pro/Con list is 75% as good as linear

h, we instead used the vector of unit weights,

Why a pro/con list is 75% as good as your fancy...

u = [1/N, 1/N, , 1/N]. Here N

is the number of features in the model.

It turns out that if you average it out over all vectors

h, the error rate is bounded by

Code is available (https://gist.github.com/stucchio/f5d0455fa58a4c733eba).

hwere chosen so that:

Tuesday 18 October 2016 03:46 PM

Why a pro/con list is 75% as good as your fancy...

Then I generated a set of vectors

hlives on the green surface in this figure:

x, yby choosing each component to be 0 or 1 with a

In 3 dimensions, this means that

Tuesday 18 October 2016 03:46 PM

Why a pro/con list is 75% as good as your fancy...

Code to produce the graph is available on github (https://gist.github.com/stucchio

Appendix: Proof of the error bound

h (x y) = 0.8 wereas u (x y) = 1/3.

xis pointing nearly in the same

direction as h, while the vector yis nearly perpendicular to it.

Normally distributed feature vectors

x, ysatisfy xi N(0, 21/2 ). Define

Tuesday 18 October 2016 03:46 PM

Why a pro/con list is 75% as good as your fancy...

P((h d > 0andu d < 0)or(h d < 0andu d > 0))

How accurately does uapproximate h?

P(h d > 0andu d < 0)

= P(p d > u dand

is the dimension of the problem. Similarly:

Due to the orthogonality of

uand p, the quantities u d and p d are independent.

P((h d > 0andu d < 0)) =

v = ru cos(), w = rp sin(). Then:

Why a pro/con list is 75% as good as your fancy...

P((h d > 0andu d < 0)) =

Worse case analysis

h = [1, 0, , 0]. In this case:

his one of the corners of the unit simplex - e.g.

P((h d > 0andu d < 0)) =

Average case analysis