Machine Learning with Scikit-Learn
Andreas Mueller (NYU Center for Data Science, scikit-learn)
http://bit.ly/sklstrata
Me
2
Classification
Regression
Clustering
Semi-Supervised Learning
Feature Selection
Feature Extraction
Manifold Learning
Dimensionality Reduction
Kernel Approximation
Hyperparameter Optimization
Evaluation Metrics
Out-of-core learning
…...
3
4
Get the notebooks!
http://bit.ly/sklstrata
5
Hi Andy,
I just received an email from the first tutorial
speaker, presenting right before you, saying
he's ill and won't be able to make it.
I know you have already committed yourself to
two presentations, but is there anyway you
could increase your tutorial time slot, maybe
just offer time to try out what you've taught?
Otherwise I have to do some kind of modern
dance interpretation of Python in data :-)
Hi Andreas, -Leah
I am very interested in your Machine Learning
background. I work for X Recruiting who have
been engaged by Z, a worldwide leading supplier
of Y. We are expanding the core engineering
team and we are looking for really passionate
engineers who want to create their own story and
help millions of people.
Can we find a time for a call to chat for a few
minutes about this?
Thanks 6
Hi Andy,
I just received an email from the first tutorial
speaker, presenting right before you, saying
he's ill and won't be able to make it.
I know you have already committed yourself to
two presentations, but is there anyway you
could increase your tutorial time slot, maybe
just offer time to try out what you've taught?
Otherwise I have to do some kind of modern
dance interpretation of Python in data :-)
Hi Andreas, -Leah
I am very interested in your Machine Learning
background. I work for X Recruiting who have
been engaged by Z, a worldwide leading supplier
of Y. We are expanding the core engineering
team and we are looking for really passionate
engineers who want to create their own story and
help millions of people.
Can we find a time for a call to chat for a few
minutes about this?
Thanks 7
Supervised Machine Learning
Training Data
Model
Training Labels
8
Supervised Machine Learning
Training Data
Model
Training Labels
Test Data Prediction
9
Supervised Machine Learning
Training Data
Model
Training Labels
Test Data Prediction
Test Labels Evaluation
10
Supervised Machine Learning
Training Data
Training
Model
Training Labels
Test Data Prediction
Generalization
Test Labels Evaluation
11
clf = RandomForestClassifier()
Training Data
clf.fit(X_train, y_train) Model
Training Labels
y_pred = clf.predict(X_test) Test Data Prediction
12
clf = RandomForestClassifier()
Training Data
clf.fit(X_train, y_train) Model
Training Labels
y_pred = clf.predict(X_test) Test Data Prediction
clf.score(X_test, y_test) Test Labels Evaluation
13
IPython Notebook:
Chapter 1 - Introduction to Scikit-learn
14
Unsupervised Machine Learning
Training Data Model
15
Unsupervised Machine Learning
Training Data Model
Test Data New View
16
Unsupervised Transformations
pca = PCA()
pca.fit(X_train) Training Data Model
X_new = pca.transform(X_test) Test Data Transformation
17
IPython Notebook:
Chapter 2 – Unsupervised Transformers
18
All Data
Training data Test data
19
All Data
Training data Test data
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
20
All Data
Training data Test data
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 1 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
21
All Data
Training data Test data
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 1 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 2 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
22
All Data
Training data Test data
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 1 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 2 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 3 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 4 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
23
IPython Notebook:
Chapter 3 - Cross-validation
24
25
26
All Data
Training data Test data
27
All Data
Training data Test data
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 1 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 2 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 3 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 4 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Test data
28
All Data
Training data Test data
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 1 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 2 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Finding Parameters
Split 3 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 4 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Split 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Final evaluation Test data
29
SVC(C=0.001,
gamma=0.001)
30
SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,
gamma=0.001) gamma=0.001) gamma=0.001) gamma=0.001) gamma=0.001)
31
SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,
gamma=0.001) gamma=0.001) gamma=0.001) gamma=0.001) gamma=0.001)
SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,
gamma=0.01) gamma=0.01) gamma=0.01) gamma=0.01) gamma=0.01)
32
SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,
gamma=0.001) gamma=0.001) gamma=0.001) gamma=0.001) gamma=0.001)
SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,
gamma=0.01) gamma=0.01) gamma=0.01) gamma=0.01) gamma=0.01)
SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,
gamma=0.1) gamma=0.1) gamma=0.1) gamma=0.1) gamma=0.1)
33
SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,
gamma=0.001) gamma=0.001) gamma=0.001) gamma=0.001) gamma=0.001)
SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,
gamma=0.01) gamma=0.01) gamma=0.01) gamma=0.01) gamma=0.01)
SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,
gamma=0.1) gamma=0.1) gamma=0.1) gamma=0.1) gamma=0.1)
SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,
gamma=1) gamma=1) gamma=1) gamma=1) gamma=1)
SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,
gamma=10) gamma=10) gamma=10) gamma=10) gamma=10)
34
IPython Notebook:
Chapter 4 – Grid Searches
35
Training Labels Training Data
Model
36
Training Labels Training Data
Model
37
Training Labels Training Data
Feature
Extraction
Model
38
Training Labels Training Data
Feature
Extraction
Scaling
Model
39
Training Labels Training Data
Feature
Extraction
Scaling
Feature
Selection
Model
40
Training Labels Training Data
Feature
Extraction
Scaling
Feature
Selection
Model
41
Cross Validation
Training Labels Training Data
Feature
Extraction
Scaling
Feature
Selection
Model
42
Cross Validation
IPython Notebook:
Chapter 5 - Preprocessing and Pipelines
43
Do cross-validation over all steps jointly.
Keep a separate test set until the very end.
44
Bag Of Word Representations
CountVectorizer / TfidfVectorizer
45
Bag Of Word Representations
CountVectorizer / TfidfVectorizer
“You better call Kenny Loggins”
46
Bag Of Word Representations
CountVectorizer / TfidfVectorizer
“You better call Kenny Loggins”
tokenizer
['you', 'better', 'call', 'kenny', 'loggins']
47
Bag Of Word Representations
CountVectorizer / TfidfVectorizer
“You better call Kenny Loggins”
tokenizer
['you', 'better', 'call', 'kenny', 'loggins']
Sparse matrix encoding
aardvak better call you zyxst
[0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ]
48
Application: Insult detection
49
Application: Insult detection
i really don't understand your point. It seems
that you are mixing apples and oranges.
50
Application: Insult detection
i really don't understand your point. It seems
that you are mixing apples and oranges.
Clearly you're a fucktard.
51
IPython Notebook:
Chapter 6 - Working With Text Data
52
Overfitting and Underfitting
Training
Accuracy
Model complexity
53
Overfitting and Underfitting
Training
Accuracy
Generalization
Model complexity
54
Overfitting and Underfitting
Training
Sweet spot
Accuracy
Generalization
Underfitting Overfitting
Model complexity
55
Linear SVM
56
Linear SVM
57
(RBF) Kernel SVM
58
(RBF) Kernel SVM
59
(RBF) Kernel SVM
60
(RBF) Kernel SVM
61
Decision Trees
62
Decision Trees
63
Decision Trees
64
Decision Trees
65
Decision Trees
66
Decision Trees
67
Random Forests
68
Random Forests
69
Random Forests
70
71
Thank you for your attention.
@t3kcit
@amueller
importamueller@gmail.com
72