Skip to content

Commit 168befa

Browse files
authored
Added Question 64 which is about repeatable sampling (andrewekhalel#11)
* Added Question 64 which is about repeatable sampling
1 parent c3322b8 commit 168befa

File tree

1 file changed

+28
-0
lines changed

1 file changed

+28
-0
lines changed

README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -335,6 +335,34 @@ The biggest difference between the the output of UMAP when compared with t-SNE i
335335
It generates a pseudo random number based on the seed and there are some famous algorithm, please see below link for further information on this.
336336
[[src]](https://en.wikipedia.org/wiki/Linear_congruential_generator)
337337

338+
#### 64) Given that we want to evaluate the performance of 'n' different machine learning models on the same data, why would the following splitting mechanism be incorrect :
339+
```
340+
def get_splits():
341+
df = pd.DataFrame(...)
342+
rnd = np.random.rand(len(df))
343+
train = df[ rnd < 0.8 ]
344+
valid = df[ rnd >= 0.8 & rnd < 0.9 ]
345+
test = df[ rnd >= 0.9 ]
346+
347+
return train, valid, test
348+
349+
#Model 1
350+
351+
from sklearn.tree import DecisionTreeClassifier
352+
train, valid, test = get_splits()
353+
...
354+
355+
#Model 2
356+
357+
from sklearn.linear_model import LogisticRegression
358+
train, valid, test = get_splits()
359+
...
360+
```
361+
The rand() function orders the data differently each time it is run, so if we run the splitting mechanism again, the 80% of the rows we get will be different from the ones we got the first time it was run. This presents an issue as we need to compare the performance of our models on the same test set. In order to ensure reproducible and consistent sampling we would have to set the random seed in advance or store the data once it is split. Alternatively, we could simply set the 'random_state' parameter in sklearn's train_test_split() function in order to get the same train, validation and test sets across different executions.
362+
363+
[[src]](https://towardsdatascience.com/why-do-we-set-a-random-state-in-machine-learning-models-bb2dc68d8431#:~:text=In%20Scikit%2Dlearn%2C%20the%20random,random%20state%20instance%20from%20np.)
364+
365+
338366

339367
## Contributions
340368
Contributions are most welcomed.

0 commit comments

Comments
 (0)