Skip to content

Commit a87ee60

Browse files
committed
Update on "[WIP][DeviceMesh] Add _unflatten_ api for device mesh"
After some initial feedback on the implementation of `_split`, we realize that we can first implement `_unflatten` for urgent use cases ask for now. And we will do more refactoring and iterations based on the discussions from this PR and this RFC: #159013. We will also ensure that all changes won't cause regression to DTensor's CPU overhead as well. This PR is trying to: 1. For unflatten, we don't support flatten a unflattened mesh which might not be necessary because by the time when user decide to flatten a unflatten, users might essentially redo the unflatten operations which could make bookeeping complicated to handle and we don't see that use cases for now. (We throw a NotImplementError for now) 3. We need some extra book-keeping for unflatten api, what we want to keep track is which sub-mesh contains the unflattened `dim_name` so that users can slice these dim_names from root mesh as well. And we need to swap the mesh to slice when slicing from root mesh. Also to make sure we don't unflatten same `dim_name` into different sizes, we need to keep the total accumulated numel in the root for that dim_name as well. 4. We want to reuse PGs already created for the same `dim_name`. For the case when a different `dim_name` happens to have different shapes, we will create new PG because with a different name, users might want to use that dimension for different purposes, so we'd better not to reuse. (This assumption can be changed, so I am open to suggestions) 5. Added unit test to two situation: 1. we directly do unflatten on one 2D device mesh. 2. we first create a dummy 1D device mesh and then split into two 3D device mesh. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]
2 parents 8b6eb8e + 01daab9 commit a87ee60

File tree

566 files changed

+61151
-16054
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

566 files changed

+61151
-16054
lines changed

.ci/docker/ci_commit_pins/triton.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
11ec6354315768a85da41032535e3b7b99c5f706
1+
f7888497a1eb9e98d4c07537f0d0bcfe180d1363

.ci/docker/requirements-docs.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
sphinx==5.3.0
22
#Description: This is used to generate PyTorch docs
33
#Pinned versions: 5.3.0
4-
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@pytorch_sphinx_theme2#egg=pytorch_sphinx_theme2
4+
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@722b7e6f9ca512fcc526ad07d62b3d28c50bb6cd#egg=pytorch_sphinx_theme2
55

66
# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
77
# but it doesn't seem to work and hangs around idly. The initial thought that it is probably
@@ -50,7 +50,7 @@ IPython==8.12.0
5050
#Pinned versions: 8.12.0
5151

5252
myst-nb==0.17.2
53-
#Description: This is used to generate PyTorch functorch and torch.compile docs
53+
#Description: This is used to generate PyTorch functorch and torch.compile docs.
5454
#Pinned versions: 0.17.2
5555

5656
# The following are required to build torch.distributed.elastic.rendezvous.etcd* docs

.ci/manywheel/build_rocm.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,7 @@ ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library
194194
ROCBLAS_LIB_DST=lib/rocblas/library
195195
ROCBLAS_ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)
196196
ROCBLAS_OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)
197-
ROCBLAS_LIB_FILES=($ROCBLAS_ARCH_SPECIFIC_FILES $OTHER_FILES)
197+
ROCBLAS_LIB_FILES=($ROCBLAS_ARCH_SPECIFIC_FILES $ROCBLAS_OTHER_FILES)
198198

199199
# hipblaslt library files
200200
HIPBLASLT_LIB_SRC=$ROCM_HOME/lib/hipblaslt/library

.ci/pytorch/build.sh

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,9 @@ if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
5050
export ATEN_THREADING=NATIVE
5151
fi
5252

53+
# Enable LLVM dependency for TensorExpr testing
54+
export USE_LLVM=/opt/llvm
55+
export LLVM_DIR=/opt/llvm/lib/cmake/llvm
5356

5457
if ! which conda; then
5558
# In ROCm CIs, we are doing cross compilation on build machines with
@@ -189,6 +192,7 @@ if [[ "$BUILD_ENVIRONMENT" == *-clang*-asan* ]]; then
189192
export USE_ASAN=1
190193
export REL_WITH_DEB_INFO=1
191194
export UBSAN_FLAGS="-fno-sanitize-recover=all"
195+
unset USE_LLVM
192196
fi
193197

194198
if [[ "${BUILD_ENVIRONMENT}" == *no-ops* ]]; then

.ci/pytorch/common_utils.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -245,7 +245,7 @@ function install_torchrec_and_fbgemm() {
245245
if [ "${found_whl}" == "0" ]; then
246246
git clone --recursive https://github.com/pytorch/fbgemm
247247
pushd fbgemm/fbgemm_gpu
248-
git checkout "${fbgemm_commit}"
248+
git checkout "${fbgemm_commit}" --recurse-submodules
249249
python setup.py bdist_wheel \
250250
--build-variant=rocm \
251251
-DHIP_ROOT_DIR="${ROCM_PATH}" \

.ci/pytorch/test.sh

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -627,6 +627,8 @@ test_perf_for_dashboard() {
627627
device=cuda_a10g
628628
elif [[ "${TEST_CONFIG}" == *h100* ]]; then
629629
device=cuda_h100
630+
elif [[ "${TEST_CONFIG}" == *b200* ]]; then
631+
device=cuda_b200
630632
elif [[ "${TEST_CONFIG}" == *rocm* ]]; then
631633
device=rocm
632634
fi
@@ -801,6 +803,16 @@ test_dynamo_benchmark() {
801803
if [[ "${TEST_CONFIG}" == *perf_compare* ]]; then
802804
test_single_dynamo_benchmark "training" "$suite" "$shard_id" --training --amp "$@"
803805
elif [[ "${TEST_CONFIG}" == *perf* ]]; then
806+
# TODO (huydhn): Just smoke test some sample models
807+
if [[ "${TEST_CONFIG}" == *b200* ]]; then
808+
if [[ "${suite}" == "huggingface" ]]; then
809+
export TORCHBENCH_ONLY_MODELS="DistillGPT2"
810+
elif [[ "${suite}" == "timm_models" ]]; then
811+
export TORCHBENCH_ONLY_MODELS="inception_v3"
812+
elif [[ "${suite}" == "torchbench" ]]; then
813+
export TORCHBENCH_ONLY_MODELS="hf_Bert"
814+
fi
815+
fi
804816
test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"
805817
else
806818
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
@@ -1039,10 +1051,20 @@ test_libtorch_api() {
10391051
mkdir -p $TEST_REPORTS_DIR
10401052

10411053
OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" "$TORCH_BIN_DIR"/test_api --gtest_filter='-IMethodTest.*' --gtest_output=xml:$TEST_REPORTS_DIR/test_api.xml
1054+
"$TORCH_BIN_DIR"/test_tensorexpr --gtest_output=xml:$TEST_REPORTS_DIR/test_tensorexpr.xml
10421055
else
10431056
# Exclude IMethodTest that relies on torch::deploy, which will instead be ran in test_deploy
10441057
OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_api -k "not IMethodTest"
10451058

1059+
# On s390x, pytorch is built without llvm.
1060+
# Even if it would be built with llvm, llvm currently doesn't support used features on s390x and
1061+
# test fails with errors like:
1062+
# JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer
1063+
# unknown file: Failure
1064+
# C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) }
1065+
if [[ "${BUILD_ENVIRONMENT}" != *s390x* ]]; then
1066+
python test/run_test.py --cpp --verbose -i cpp/test_tensorexpr
1067+
fi
10461068
fi
10471069

10481070
# quantization is not fully supported on s390x yet

.github/actionlint.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -53,9 +53,9 @@ self-hosted-runner:
5353
- linux.rocm.gpu.mi250
5454
- linux.rocm.gpu.2
5555
- linux.rocm.gpu.4
56-
# MI300 runners
57-
- linux.rocm.gpu.mi300.2
58-
- linux.rocm.gpu.mi300.4
56+
# gfx942 runners
57+
- linux.rocm.gpu.gfx942.2
58+
- linux.rocm.gpu.gfx942.4
5959
- rocm-docker
6060
# Org wise AWS `mac2.metal` runners (2020 Mac mini hardware powered by Apple silicon M1 processors)
6161
- macos-m1-stable

.github/ci_commit_pins/audio.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
f6dfe1231dcdd221a68416e49ab85c2575cbb824
1+
9b57c7bd5ad4db093c5bb31c802df9f04d933ac9

.github/ci_commit_pins/vllm.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
8f605ee30912541126c0fe46d0c8c413101b600a
1+
6a39ba85fe0f2fff9494b5eccea717c93510c230

.github/ci_commit_pins/xla.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
29ae4c76c026185f417a25e841d2cd5e65f087a3
1+
b6a5b82b9948b610fa4c304d0d869c82b8f17db1

0 commit comments

Comments
 (0)