Skip to content

Batch Transform Job fails with Internal Server Error when Data Capture is configured #5182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
thatayster opened this issue May 16, 2025 · 1 comment
Labels
component: batch Relates to the SageMaker Batch Platform type: bug

Comments

@thatayster
Copy link

thatayster commented May 16, 2025

Describe the bug
When configuring Data Capture for a Batch Transform job using the SageMaker Python SDK, the job creation succeeds, but the execution fails with an "Internal Server Error". If Data Capture is not enabled, the job finishes successfully. This suggests a bug related to the Data Capture configuration in the Batch Transform step.

To reproduce

The setup is the same for both scenarios, with or without DataCaptureConfig:

from datetime import datetime
from sagemaker.transformer import Transformer
from sagemaker.inputs import BatchDataCaptureConfig

input_s3_data_location = "s3://bucket/prefix/batch-transform/input/input.json"
output_s3_data_location = "s3://bucket/prefix/batch-transform/output"
data_capture_destination = "s3://bucket/prefix/batch-transform/captured-data"
model_name = "my-previously-created-model"

transformer = Transformer(
    model_name=model_name,
    strategy="SingleRecord",
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=output_s3_data_location,
    max_concurrent_transforms=1,
    max_payload=6,
    tags=[{"Key": "some-key", "Value": "some-value"}],
)

timestamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
job_name = f"batch-transform-{timestamp}"
  1. Batch Transform job execution without DataCaptureConfig - Success
transform_arg = transformer.transform(
    job_name=job_name,
    data=input_s3_data_location,
    data_type="S3Prefix",
    content_type="application/json", 
    split_type="Line",
    wait=True,
    logs=True,
)
  1. Batch Transform job execution with DataCaptureConfig - Failure with an Internal Server Error
transform_arg = transformer.transform(
    batch_data_capture_config=BatchDataCaptureConfig(
        destination_s3_uri=data_capture_destination,
        generate_inference_id=True,
    ),
    job_name=job_name,
    data=input_s3_data_location,
    data_type="S3Prefix",
    content_type="application/json",
    split_type="Line",
    wait=True,
    logs=True,
)

Note: I've also tested with CSV files. The behavior is the same.

Expected behavior
Enabling Data Capture for Batch Transform should not cause the job to fail with an Internal Server Error. The job should complete successfully, and captured data should be stored as configured.

Screenshots or logs

Image

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.244.1
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): n/a
  • Framework version: n/a
  • Python version: 3.12
  • CPU or GPU: Used instance type ml.m5.large
  • Custom Docker image (Y/N): Y

Additional context
n/a

@thatayster
Copy link
Author

We have identified that this issue is likely related to inference ID generation. The Batch Transform Job completes successfully when the BatchDataCaptureConfig is provided with generate_inference_id set to False.

@mollyheamazon mollyheamazon added the component: batch Relates to the SageMaker Batch Platform label May 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: batch Relates to the SageMaker Batch Platform type: bug
Projects
None yet
Development

No branches or pull requests

3 participants