Description
I’m having an issue using the prediction probabilities for sparse SVM, where many of the predictions come out the same for my test instances. These probabilities are produced during cross validation, and when I plot an ROC curve for the folds, the results look very strange, as there are a handful of clustered points on the graph. Here is my cross validation code, I based it off of the samples on the scikit website:
skf = StratifiedKFold(y, n_folds=numfolds)
for train_index, test_index in skf:
#split the training and testing sets
X_train, X_test = X_scaled[train_index], X_scaled[test_index]
y_train, y_test = y[train_index], y[test_index]
#train on the subset for this fold
print 'Training on fold ' + str(fold)
classifier = svm.SVC(C=C_val, kernel='rbf', gamma=gamma_val, probability=True)
probas_ = classifier.fit(X_train, y_train).predict_proba(X_test)
#Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
mean_tpr += interp(mean_fpr, fpr, tpr)
mean_tpr[0] = 0.0
roc_auc = auc(fpr, tpr)
I’m just trying to figure out if there’s something I’m obviously missing here, since I used this same training set and SVM parameters with libsvm and got much better results. When I used libsvm and printed out the distances from the hyperplane for the CV test instances and then plotted the ROC, it came out much more like I expected, and a much better AUC. Since the decision_function() method is not supported for sparse matrices, I cannot recreate this functionality in scikit, and therefore have to rely on the prediction probabilities.
There are 20k instances total, 10k positive and 10k negative, and I'm using 5-fold cross-validation. In the cross-validation results, there are several prediction values for which there are 1k-2k samples that all have the same prediction value, and there are only 3600 distinct prediction values over all of the folds for cross-validation. The resulting ROC looks like five big stair steps, with some little bits of fuzziness around the inner corners.
I have many sparse features, so I'm hashing those into index ranges for different types of feature subsets, so one feature subset will be in the index range 1 million to 2 million, the next will be in the range 2 million to 3 million, etc.