01 02 Introduction Regression Analysis and GR
01 02 Introduction Regression Analysis and GR
01 02 Introduction Regression Analysis and GR
Next Index
Examples
Database mining
Machine learning has recently become so big party because of the huge amount of data being generated
Large datasets from growth of automation web
Sources of data include
Web data (click-stream or click through data)
Mine to understand users better
Huge segment of silicon valley
Medical records
Electronic records -> turn records in knowledges
Biological data
Gene sequences, ML algorithms give a better understanding of human genome
Engineering info
Data from sensors, log reports, photos etc
Applications that we cannot program by hand
Autonomous helicopter
Handwriting recognition
This is very inexpensive because when you write an envelope, algorithms can automatically route envelopes through the post
Natural language processing (NLP)
AI pertaining to language
Computer vision
AI pertaining vision
Self customizing programs
Netflix
Amazon
iTunes genius
Take users info
Learn based on your behavior
Understand human learning and the brain
If we can build systems that mimic (or try to mimic) how the brain works, this may push our own understanding of the associated
neurobiology
Example problem: "Given this data, a friend has a house 750 square feet - how much can they be expected to get?"
Another example
Can we definer breast cancer as malignant or benign based on tumour size
Looking at data
Five of each
Can you estimate prognosis based on tumor size?
This is an example of a classification problem
Classify data into one of two discrete classes - no in between, either malignant or not
In classification problems, can have a discrete number of possible values for the output
e.g. maybe have four values
0 - benign
1 - type 1
2 - type 2
3 - type 4
In classification problems we can plot data in a different way
Based on that data, you can try and define separate classes by
Drawing a straight line between the two groups
Using a more complex function to define the two groups (which we'll discuss later)
Then, when you have an individual with a specific tumor size and who is a specific age, you can hopefully use that information to place
them into one of your classes
You might have many features to consider
Clump thickness
Uniformity of cell size
Uniformity of cell shape
The most exciting algorithms can deal with an infinite number of features
How do you deal with an infinite number of features?
Neat mathematical trick in support vector machine (which we discuss later)
If you have an infinitely long list - we can develop and algorithm to deal with that
Summary
Supervised learning lets you get the "right" data a
Regression problem
Classification problem
Clustering algorithm
Basically
Can you automatically generate structure
Because we don't give it the answer, it's unsupervised learning
Record sightly different versions of the conversation depending on where your microphone is
But overlapping none the less
Have recordings of the conversation from each microphone
Give them to a cocktail party algorithm
Algorithm processes audio recordings
Determines there are two audio sources
Separates out the two sources
Is this a very complicated problem
Algorithm can be done with one line of code!
[W,s,v] = svd((repmat(sum(x.*x,1), size(x,1),1).*x)*x');
Not easy to identify
But, programs can be short!
Using octave (or MATLAB) for examples
Often prototype algorithms in octave/MATLAB to test as it's very fast
Only when you show it works migrate it to C++
Gives a much faster agile development
Understanding this algorithm
svd - linear algebra routine which is built into octave
In C++ this would be very complicated!
Shown that using MATLAB to prototype is a really good way to do this
Linear Regression
Housing price data example used earlier
Supervised learning regression problem
What do we start with?
Training set (this is your data set)
Notation (used throughout the course)
m = number of training examples
x's = input variables / features
y's = output variable "target" variables
(x,y) - single training example
(xi, y j) - specific example (i th training example)
i is an index to training set
So in summary
A hypothesis takes in some variable
Uses parameters determined by a learning system
Outputs a prediction based on that input
Minimize squared different between predicted house price and actual house price
1/2m
1/m - means we determine the average
1/2m the 2 makes the math a bit easier, and doesn't change the constants we determine at all (i.e. half the smallest value is still the
smallest value!)
Minimizing θ0 /θ1 means we get the values of θ0 and θ1 which find on average the minimal deviation of x from y when we use those
parameters in our hypothesis function
More cleanly, this is a cost function
Cost - is a way to, using your training data, determine values for your θ values which make the hypothesis as accurate as possible
This cost function is also called the squared error cost function
This cost function is reasonable choice for most regression functions
Probably most commonly used function
In case J(θ0 ,θ1 ) is a bit abstract, going into what it does, why it works and how we use it in the coming sections
Lets consider some intuition about the cost function and why we want to use it
The cost function determines parameters
The value associated with the parameters determines how your hypothesis behaves, with different values generate different
Simplified hypothesis
Assumes θ0 = 0
Cost function and goal here are very similar to when we have θ0 , but with a simpler parameter
Simplified hypothesis makes visualizing cost function J() a bit easier
The optimization objective for the learning algorithm is find the value of θ1 which minimizes J(θ1 )
So, here θ1 = 1 is the best value for θ1
We can see that the height (y) indicates the value of the cost function, so find where y is at a minimum
Each point (like the red one above) represents a pair of parameter values for Ɵ0 and Ɵ1
Our example here put the values at
θ0 = ~800
θ1 = ~-0.15
Not a good fit
i.e. these parameters give a value on our contour plot far from the center
If we have
θ0 = ~360
θ1 = 0
This gives a better hypothesis, but still not great - not in the center of the countour plot
Finally we find the minimum, which gives the best hypothesis
Doing this by eye/hand is a pain in the ass
What we really want is an efficient algorithm fro finding the minimum for θ0 and θ1
Here we can see one initialization point led to one local minimum
The other led to a different one
Derivative term
Not going to talk about it now, derive it later
There is a subtly about how this gradient descent algorithm is implemented
Do this for θ0 and θ1
For j = 0 and j = 1 means we simultaneously update both
How do we do this?
Compute the right hand side for both θ0 and θ1
So we need a temp value
Then, update θ0 and θ1 at the same time
We show this graphically below
If you implement the non-simultaneous update it's not gradient descent, and will behave weirdly
But it might look sort of right - so it's important to remember this!
To understand gradient descent, we'll return to a simpler function where we minimize one parameter to help explain the algorithm in more
detail
min θ1 J(θ1 ) where θ1 is a real number
Two key terms in the algorithm
Alpha
Derivative term
Notation nuances
Partial derivative vs. derivative
Use partial derivative when we have multiple variables but only derive with respect to one
Use derivative when we are deriving with respect to all the variables
Derivative term
Derivative says
Lets take the tangent at the point and look at the slope of the line
So moving towards the mimum (down) will greate a negative derivative, alpha is always positive, so will update j( θ1 ) to a smaller
value
Similarly, if we're moving up a slope we make j( θ1 ) a bigger numbers
Alpha term (α)
What happens if alpha is too small or too large
Too small
Take baby steps
Takes too long
Too large
Can overshoot the minimum and fail to converge
When you get to a local minimum
Gradient of tangent/derivative is 0
So derivative term = 0
alpha * 0 = 0
So θ1 = θ1 - 0
So θ1 remains the same
As you approach the global minimum the derivative term gets smaller, so your update gets smaller, even with alpha is fixed
Means as the algorithm runs you take smaller steps as you approach the minimum
So no need to change alpha over time