Bug with early stop and n_iter_no_change when training Model 

As far as I could search, I didn't find information on this bug. Not sure if I used the right keywords, but here we go...

#### Describe the bug

I am training a model. After 6 epochs the training is interrupted. According to the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor) the early stop is `False` by default. And, even if it were `True` the condition described for the argument `tol` is not fulfilled when analyzing the values for the loss over the epochs.

> tolfloat, default=1e-3
The stopping criterion. If it is not None, training will stop when (loss > best_loss - tol) for n_iter_no_change consecutive epochs. 

#### Steps/Code to Reproduce

1. I am using this Dataset

**raw-data.csv**
```
T,A1,A2,A3
33.2,3.5,9.0,6.1
40.3,5.3,20.0,6.4
38.7,5.1,18.0,7.4
46.8,5.8,33.0,6.7
41.4,,31.0,7.5
37.5,6.0,13.0,5.9
39.0,6.8,25.0,6.0
40.7,5.5,30.0,
30.1,3.1,5.0,5.8
52.9,7.2,47.0,8.3
38.2,4.5,25.0,5.0
31.8,4.9,11.0,6.4
43.3,8.0,23.0,7.6
44.1,,35.0,7.0
42.8,6.6,39.0,5.0
33.6,3.7,21.0,4.4
34.2,6.2,7.0,5.5
48.0,7.0,40.0,7.0
38.0,4.0,35.0,6.0
35.9,4.5,23.0,3.5
40.4,5.9,33.0,4.9
36.8,5.6,27.0,4.3
45.2,4.8,,8.0
35.1,3.9,15.0,5.0
```

2. Then, I open it with pandas and replace the NA values using the median.

```
df_raw = pd.read_csv('./raw-data.csv')
df_median = df_raw.apply(lambda x: x.fillna(x.median()), axis=0)
X, Y = df_median.drop(columns='T').to_numpy(), df_median['T'].to_numpy()
```

3. Code use for fitting

```
myLM1 = SGDRegressor(verbose=1)
myLM1.fit(X, Y)
```

**Output**
```
-- Epoch 1
Norm: 324955830.41, NNZs: 3, Bias: 14194106.164223, T: 24, Avg. loss: 3973700531196889600.000000
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 16629771036.76, NNZs: 3, Bias: -617317208.989577, T: 48, Avg. loss: 181529035517448937275392.000000
Total training time: 0.00 seconds.
-- Epoch 3
Norm: 70745129973.48, NNZs: 3, Bias: -5129135895.303034, T: 72, Avg. loss: 777346093047846122029056.000000
Total training time: 0.00 seconds.
-- Epoch 4
Norm: 58661193605.90, NNZs: 3, Bias: -3424256200.533171, T: 96, Avg. loss: 551443499407620593156096.000000
Total training time: 0.00 seconds.
-- Epoch 5
Norm: 21611084233.04, NNZs: 3, Bias: 222364194.025302, T: 120, Avg. loss: 523655130718998823436288.000000
Total training time: 0.00 seconds.
-- Epoch 6
Norm: 14729666645.71, NNZs: 3, Bias: 1402055277.659604, T: 144, Avg. loss: 334121112525316814798848.000000
Total training time: 0.00 seconds.
Convergence after 6 epochs took 0.00 seconds
SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.01, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=1000,
             n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,
             shuffle=True, tol=0.001, validation_fraction=0.1, verbose=1,
             warm_start=False)
```

4. Then, I noticed that by default `n_iter_no_change=5`. So, I created another model and tested the fitting function again.

```
myLM2 = SGDRegressor(verbose=1, n_iter_no_change=10)
myLM2.fit(X, Y)
```

**Output**
```
-- Epoch 1
Norm: 212889866.99, NNZs: 3, Bias: -12147107.482672, T: 24, Avg. loss: 10977951332091414528.000000
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 33877658284.48, NNZs: 3, Bias: 3221686871.858737, T: 48, Avg. loss: 218647246678094122057728.000000
Total training time: 0.00 seconds.
-- Epoch 3
Norm: 44245299085.75, NNZs: 3, Bias: 8504092051.958626, T: 72, Avg. loss: 911501681159969846591488.000000
Total training time: 0.00 seconds.
-- Epoch 4
Norm: 41027510933.49, NNZs: 3, Bias: 5572305635.743739, T: 96, Avg. loss: 869704005960650100572160.000000
Total training time: 0.00 seconds.
-- Epoch 5
Norm: 65326809398.75, NNZs: 3, Bias: 748500853.498515, T: 120, Avg. loss: 559005468692885343830016.000000
Total training time: 0.00 seconds.
-- Epoch 6
Norm: 19445131953.06, NNZs: 3, Bias: -3017287484.596134, T: 144, Avg. loss: 640531710195116690898944.000000
Total training time: 0.00 seconds.
-- Epoch 7
Norm: 11535209562.04, NNZs: 3, Bias: -4257020952.531857, T: 168, Avg. loss: 273798336110092244484096.000000
Total training time: 0.00 seconds.
-- Epoch 8
Norm: 9203881559.93, NNZs: 3, Bias: -2973999377.325749, T: 192, Avg. loss: 290728270869741598408704.000000
Total training time: 0.01 seconds.
-- Epoch 9
Norm: 10890181027.35, NNZs: 3, Bias: -2807586298.020673, T: 216, Avg. loss: 14822313322512440623104.000000
Total training time: 0.01 seconds.
-- Epoch 10
Norm: 10738156993.98, NNZs: 3, Bias: -2251398418.644948, T: 240, Avg. loss: 6075130266787779706880.000000
Total training time: 0.01 seconds.
-- Epoch 11
Norm: 5586073729.92, NNZs: 3, Bias: -2490357631.977556, T: 264, Avg. loss: 26258836820902312673280.000000
Total training time: 0.01 seconds.
Convergence after 11 epochs took 0.01 seconds
SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.01, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=1000,
             n_iter_no_change=10, penalty='l2', power_t=0.25, random_state=None,
             shuffle=True, tol=0.001, validation_fraction=0.1, verbose=1,
             warm_start=False)
```

5. Now, the notation for the numbers above makes hard to read them. So, I am using [a decorator to plot a chart](https://stackoverflow.com/a/66181797/2313889).

```
myLM1 = SGDRegressor(verbose=1)
with DisplayLossCurve():
  myLM1.fit(X, Y)
```

**Output**

```
=============== Loss Array ===============
[2.95406140e+14 1.14313907e+23 1.23391908e+24 7.43942473e+23
 6.30549394e+22 3.81444250e+23]
```
<a href="https://imgbb.com/"><img src="https://i.ibb.co/5MrRYzt/download.png" alt="download" border="0"></a>

As you can see, there is no way that condition for the early stop is true.

6. Running again the example for model 2

```
myLM2 = SGDRegressor(verbose=1, n_iter_no_change=10)
with DisplayLossCurve():
  myLM2.fit(X, Y)
```

**Output**

```
=============== Loss Array ===============
[3.54981262e+18 1.08446422e+24 3.64640361e+23 6.35668752e+23
 4.05704140e+23 5.51679293e+23 2.84870055e+23 2.05759404e+23
 2.27129487e+23 2.32333059e+22 3.40118957e+21]
```
<a href="https://imgbb.com/"><img src="https://i.ibb.co/8KGgZDN/download-1.png" alt="download-1" border="0"></a>

---

I just noticed that when using a pipeline, the model works nicely.

```
pipeline = make_pipeline(StandardScaler(), SGDRegressor(verbose=1))
with DisplayLossCurve():
  pipeline.fit(X, Y)
```

**Output**
```
=============== Loss Array ===============
[698.789267 549.568875 455.764301 385.315963 329.898958 284.905019
 247.724132 216.647281 190.31711  167.868168 148.596037 131.948425
 117.494491 104.881235  93.84543   84.139548  75.593668  68.045718
  61.354433  55.428558  50.143342  45.435389  41.225132  37.463083
  34.092248  31.066473  28.348297  25.904551  23.700976  21.710897
  19.915563  18.292876  16.823909  15.495913  14.289865  13.195086
  12.199626  11.295545  10.472725   9.723792   9.042427   8.420886
   7.853342   7.334551   6.861378   6.428062   6.032399   5.671314
   5.339777   5.035661   4.757      4.501251   4.266443   4.051114
   3.85367    3.67168    3.504323   3.350221   3.208313   3.077853
   2.957661   2.847504   2.7454     2.651231   2.564548   2.484364
   2.410306   2.342217   2.279268   2.220999   2.167082   2.117059
   2.070652   2.027802   1.988273   1.951822   1.917857   1.886282
   1.857099   1.829877   1.804733   1.781529   1.760066   1.739988
   1.721407   1.704004   1.687848   1.67277    1.658884   1.646049
   1.634011   1.622827   1.612362   1.602604   1.593531   1.585142
   1.577364   1.570052   1.563283   1.556888   1.550974   1.545459
   1.540315   1.535562   1.53106    1.526853   1.522848   1.51919
   1.51574    1.5125     1.509578   1.506747   1.50413    1.501607
   1.499332   1.497184   1.495179   1.493289   1.491581   1.489939
   1.488343   1.486849   1.485497   1.484214   1.483014   1.48191
   1.480837   1.479798   1.478842   1.47799    1.47716    1.476378
   1.475653]
```
<a href="https://imgbb.com/"><img src="https://i.ibb.co/6HHrKrr/download-2.png" alt="download-2" border="0"></a>

Am I missing some initialization step when training without the assistance of a pipeline? Pretty much the only difference is the normalization. 

- I do understand how beneficial is the normalization for the model. But the behaviour described previously should not happen simply based on the value of the data.
- I noticed the values are exploding. I wonder if an overflow happened causing the whole situation on the comparison done by the early stop. But still, early stop is set to `False` by default.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Bug with early stop and n_iter_no_change when training Model #19451

Describe the bug

Steps/Code to Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Bug with early stop and n_iter_no_change when training Model #19451

Description

Describe the bug

Steps/Code to Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions