readme

p-lambda · Nov 11, 2023 · b39130e · b39130e
1 parent 0ce71d4
commit b39130e
Showing 1 changed file with 5 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -38,21 +38,21 @@ from data_selection import HashedNgramDSIR
 raw_datasets = [<list of paths>]
 target_datasets = [<list of paths>]
 
-dsir = HashedNgramDSIR(raw_datasets, target_datasets, cache_dir='/scr/dsir_cache')
+dsir = HashedNgramDSIR(raw_datasets, target_datasets, cache_dir='/path/to/dsir_cache')
 dsir.fit_importance_estimator(num_tokens_to_fit='auto')
 dsir.compute_importance_weights()
-dsir.resample(out_dir='resampled', num_to_sample=10000000, cache_dir='/scr/resampled_cache')
+dsir.resample(out_dir='resampled', num_to_sample=10000000, cache_dir='/path/to/resampled_cache')
 ```
 Running this would write 10M documents in `jsonl` files inside an output directory named `resampled`. The files will first be written to `cache_dir` and moved to `out_dir` upon completion (set `cache_dir` to `None` to skip this step). For best performance, use uncompressed `jsonl` files stored on local file storage for all data paths and use as many CPU cores as possible, which allows each file to be virtually sharded across multiple cores. Custom functions for reading the data paths and extracting the text field from each example can be provided via the
 `{raw,target}_load_dataset_fn` and `{raw,target}_parse_example_fn` arguments to the constructor. The number of tokens to use for fitting the importance weight estimator can be tuned with the `num_tokens_to_fit` argument (set to `all` to fit on full dataset). Top-k retrieval instead of sampling without replacement (the default) can be done by specifying `top_k=True` to the `resample` method.
 
 The `dsir` intermediate results (after `fit_importance_estimator` and `compute_importance_weights`) can be saved and loaded for later use, for example to resample 100M documents instead:
 ```python
-dsir.save('dsir_params.pkl')
+dsir.save('/path/to/dsir_params.pkl')
 
 # later on
-dsir.load('dsir_params.pkl')
-dsir.resample(out_dir='resampled', num_to_sample=100000000, cache_dir='/scr/resampled_cache')
+dsir.load('/path/to/dsir_params.pkl')
+dsir.resample(out_dir='/path/to/out_dir', num_to_sample=100000000, cache_dir='/path/to/resampled_cache')
 ```
 The `save` method can be called at any time to save partial results.