Skip to content

Commit 570fa2e

Browse files
Update 2020-08-08-efficient-pytorch.md
1 parent f4de49a commit 570fa2e

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

_posts/2020-08-08-efficient-pytorch.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ However, working with the large amount of data sets presents a number of challen
2020
* **Shuffling and Augmentation:** training data needs to be shuffled and augmented prior to training.
2121
* **Scalability:** users often want to develop and test on small datasets and then rapidly scale up to large datasets.
2222

23-
Traditional local and network file systems, and even object storage servers, are not designed for these kinds of applications. [The WebDataset I/O library](https://github.com/tmbdev/webdataset) for PyTorch, together with the optional [AIStore server](https://github.com/NVIDIA/aistore) and [Tensorcom RDMA](https://github.com/NVlabs/tensorcom) libraries, provide an efficient, simple, and standards-based solution to all these problems. The library is simple enough for day-to-day use, is based on mature open source standards, and is easy to migrate to from existing file-based datasets.
23+
Traditional local and network file systems, and even object storage servers, are not designed for these kinds of applications. [The WebDataset I/O library](https://github.com/tmbdev/webdataset) for PyTorch, together with the optional [AIStore server](https://github.com/NVIDIA/aistore) and [Tensorcom](https://github.com/NVlabs/tensorcom) RDMA libraries, provide an efficient, simple, and standards-based solution to all these problems. The library is simple enough for day-to-day use, is based on mature open source standards, and is easy to migrate to from existing file-based datasets.
2424

2525
Using WebDataset is simple and requires little effort, and it will let you scale up the same code from running local experiments to using hundreds of GPUs on clusters or in the cloud with linearly scalable performance. Even on small problems and on your desktop, it can speed up I/O tenfold and simplifies data management and processing of large datasets. The rest of this blog post tells you how to get started with WebDataset and how it works.
2626

@@ -47,7 +47,7 @@ The use of sharded, sequentially readable formats is essential for very large da
4747
| Cloud Computing | WebDataset deep learning jobs can be trained directly against datasets stored in cloud buckets; no volume plugins required. Local and cloud jobs work identically. Suitable for petascale learning. |
4848
| Local Cluster with existing distributed FS or object store | WebDataset’s large sequential reads improve performance with existing distributed stores and eliminate the need for dedicated volume plugins. |
4949
| Educational Environments | WebDatasets can be stored on existing web servers and web caches, and can be accessed directly by students by URL |
50-
| Training on Workstations from Local Drives | obs can start training as the data still downloads. Data doesn’t need to be unpacked for training. Ten-fold improvements in I/O performance on hard drives over random access file-based datasets. |
50+
| Training on Workstations from Local Drives | Jobs can start training as the data still downloads. Data doesn’t need to be unpacked for training. Ten-fold improvements in I/O performance on hard drives over random access file-based datasets. |
5151
| All Environments | Datasets are represented in an archival format and contain metadata such as file types. Data is compressed in native formats (JPEG, MP4, etc.). Data management, ETL-style jobs, and data transformations and I/O are simplified and easily parallelized. |
5252

5353
We will be adding more examples giving benchmarks and showing how to use WebDataset in these environments over the coming months.

0 commit comments

Comments
 (0)