Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 10 additions & 9 deletions examples/kernel_approximation/plot_scalable_poly_kernels.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,11 +64,10 @@
X_train = mm.fit_transform(X_train)
X_test = mm.transform(X_test)


# %%
# As a baseline, train a linear SVM on the original features and print the
# accuracy. We also measure and store accuracies and training times to
# plot them latter.
# plot them later.

results = {}

Expand All @@ -95,11 +94,14 @@
# polynomial kernel of degree four would have approximately 8.5 million
# features (precisely, 54^4). Thanks to :class:`PolynomialCountSketch`, we can
# condense most of the discriminative information of that feature space into a
# much more compact representation. We repeat the experiment 5 times to
# compensate for the stochastic nature of :class:`PolynomialCountSketch`.
# much more compact representation. While we run the experiment only a single time
# (`n_runs` = 1) in this example, in practice one should repeat the experiment several
# times to compensate for the stochastic nature of :class:`PolynomialCountSketch`.

n_runs = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nicer if you just remove the whole loop instead of running the loop once and just have the content of the loop in this example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fair point and would make the code more concise; however, it does come at the cost of making repeat experimental runs (a practice encouraged in the documentation) non-trivial in the sense that it is less clear at first glance how to best repeat the experiment.

The user would now have to think about where to place the internal loop and adjust both the ps_lsvm_time and ps_lsvm_score for the number of runs which, while not terribly difficult, certainly requires more effort than simply altering the value of n_runs.

If you believe this is a worthwhile tradeoff, I am happy to make the change you suggest here and update the PR. Otherwise, might I suggest keeping the loop to facilitate the user experience in this arguably minor way?

Copy link
Member

@jeremiedbb jeremiedbb Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that keeping it makes it easier for a user to run the example with several runs to see how it improves the results. I'd be happy to keep it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also fine keeping the for loop with the above comment.

N_COMPONENTS = [250, 500, 1000, 2000]

n_runs = 3
for n_components in [250, 500, 1000, 2000]:
for n_components in N_COMPONENTS:

ps_lsvm_time = 0
ps_lsvm_score = 0
Expand Down Expand Up @@ -148,16 +150,14 @@
ksvm_score = 100 * ksvm.score(X_test, y_test)

results["KSVM"] = {"time": ksvm_time, "score": ksvm_score}
print(f"Kernel-SVM score on raw featrues: {ksvm_score:.2f}%")
print(f"Kernel-SVM score on raw features: {ksvm_score:.2f}%")

# %%
# Finally, plot the results of the different methods against their training
# times. As we can see, the kernelized SVM achieves a higher accuracy,
# but its training time is much larger and, most importantly, will grow
# much faster if the number of training samples increases.

N_COMPONENTS = [250, 500, 1000, 2000]

fig, ax = plt.subplots(figsize=(7, 7))
ax.scatter(
[
Expand All @@ -181,6 +181,7 @@
label="Linear SVM + PolynomialCountSketch",
c="blue",
)

for n_components in N_COMPONENTS:
ax.scatter(
[
Expand Down