Linear Regression with One Variable

Getting started: the Anscombe's quartet

In 1973, Anscombe published this paper showing how important is to graph the data whenever possible. Anscombe create four small datasets really different between each other. But it turns out, all the four has the same linear regression coeficient (as you can see on the red line).

Furthermore, if we compute the means, the variance and the correlation, all four give us the same values:

Graph	Mean X	Variance X	Mean Y	Corr(x,y)
Linear	9	11	7.5	0.816
Polynomial	9	11	7.5	0.816
Outlier X	9	11	7.5	0.816
Outlier Y	9	11	7.5	0.816

With this in mind, the first thing to do, is trying to plot the dataset so we can get a glimpse of the data.

Let's back to the code

I was doing the Coursera course on Machine Learning and I thought that it'll nice to try out and code the weekly exercise on a different lenguage, and with the previous experience on Python, why not give it a shot?

The problem is: Suppose you are the CEO of a restaurant franchise and are considering diﬀerent cities for opening a new outlet. The chain already has trucks in various cities and you have data for proﬁts and populations from the cities. You would like to use this data to help you select which city to expand to next.

The idea is: given an X Population, how much is the predicted profit?

We will try to undestand a little bit of the data by plotting it:

import matplotlib.pyplot as plt
import pandas

with open("ex1data1.txt", 'r') as csvfile:
    ex1 = pandas.read_csv(csvfile)
    
df = pandas.DataFrame(ex1)

x = df["City Population"].as_matrix()
y = df["Profit of Food Truck"].as_matrix()

plt.axis([4, 24, -5, 25])
plt.ylabel('City Population')
plt.xlabel('Profit of Food Truck')
plt.plot(x, y, 'bo')
plt.show()

And this is our first result:

We have to find the coefficients of a linear function y = theta 1x + theta 0 that better fit the given data. The general process on this algorithm consist on obtain the theta values that minimize the cost:

def computeCost(X, y, theta):
    m = len(y)
    n=0.0
    for i in range(0,m):
        n += (theta[0]+ theta[1]*X[i] - y[i]) ** 2
        print(n)
    J = (1 / (2 * m)) * n
    return J;

And this can be done through a gradient function:

def gradientDescent(X, y, theta, alpha, num_iters):
    m = len(y)
    for i in range(0,num_iters):
        n1=0
        n2=0
        for i in range(0,m):
            n1 += (theta[0] + theta[1] * X[i] - y[i]);
            n2 += (theta[0] + theta[1] * X[i] - y[i]) * X[i];
        temp1 = theta[0] - (alpha) * (1 / m) * n1;
        temp2 = theta[1] - (alpha) * (1 / m) * n2;
        theta[0] = temp1;
        theta[1] = temp2;

    return theta;

This is how my code is structured:

And this is the final plot:

Of course, this is just an overview, and I strongly recommend to take the course.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linear-regression-one.md

linear-regression-one.md

Linear Regression with One Variable

Getting started: the Anscombe's quartet

Let's back to the code

Files

linear-regression-one.md

Latest commit

History

linear-regression-one.md

File metadata and controls

Linear Regression with One Variable

Getting started: the Anscombe's quartet

Let's back to the code