Skip to content

[FEATURE] structured_output retries #348

@austinmw

Description

@austinmw

Problem Statement

Since many model providers which support tool calling don't provide structured output guarantees via constrained decoding, it would be great to have retry logic for this.

Proposed Solution

A few different types of retry logic that would be beneficial are:

  1. Naive retry — retry up to some retry_limit
  2. Intelligent retry — construct a retry prompt by injecting Pydantic validation error(s) and requesting to avoid those errors on retry
  3. Patching mistakes — similar to trustcall, ask an LLM to provide a JSON patch rather than rewriting from scratch

Use Case

Here's an example of a Pydantic schema which sometimes succeeds and sometimes fails:

import boto3
from botocore.config import Config
import asyncio
from pydantic import BaseModel, Field
from strands.models import BedrockModel
from typing import List

config = Config(
    region_name="us-east-1",
    connect_timeout=10,
    read_timeout=120,
)

async def test_deep_nesting():
    model = BedrockModel(model_id="amazon.nova-lite-v1:0", max_tokens=8192, config=config)

    class Exercise(BaseModel):
        name: str = Field(description="Name of the exercise")
        sets: int = Field(ge=1, le=10, description="Number of sets to perform")
        reps: int = Field(ge=1, le=50, description="Number of repetitions per set")

    class Workout(BaseModel):
        name: str = Field(description="Name of the workout")
        exercises: List[Exercise] = Field(
            min_length=2, 
            max_length=4, 
            description="List of exercises in this workout (2-4 exercises)"
        )

    class DaySchedule(BaseModel):
        is_rest_day: bool = Field(description="True if this is a rest day, False if it's a workout day")
        workouts: List[Workout] = Field(
            default=[],
            max_length=2,
            description="List of workouts for this day. Empty list if is_rest_day is True, 1-2 workouts if False"
        )

    class Week(BaseModel):
        week_number: int = Field(ge=1, le=4, description="Week number in the program")
        monday: DaySchedule
        tuesday: DaySchedule
        wednesday: DaySchedule
        thursday: DaySchedule
        friday: DaySchedule
        saturday: DaySchedule
        sunday: DaySchedule

    class WorkoutSchedule(BaseModel):
        title: str = Field(description="Title of the workout program")
        weeks: List[Week] = Field(
            min_length=2, 
            max_length=2, 
            description="Exactly 2 weeks of programming"
        )

    messages = [{
        "role": "user", 
        "content": [{"text": """
        Create a complete 2-week workout schedule called "Strength & Conditioning Program".
        
        Requirements:
        - Each week should have 4-5 workout days and 2-3 rest days
        - For rest days: set is_rest_day=true and workouts=[] (empty list)
        - For workout days: set is_rest_day=false and include 1-2 workouts in the workouts list
        - Each workout should have 2-4 exercises with realistic rep/set counts (reps 5-20, sets 1-5)
        - Make it progressive - week 2 should build on week 1
        - Vary the workouts throughout each week (upper body, lower body, cardio, etc.)
        """}]
    }]

    async for event in model.structured_output(WorkoutSchedule, messages):
        if "output" in event:
            return event["output"]

async def run_tests():
    successes = 0
    failures = 0
    
    for i in range(10):
        try:
            await test_deep_nesting()
            successes += 1
            print("✅", end="")
        except Exception:
            failures += 1
            print("❌", end="")
    
    print(f"\nResults: {successes}/10 successes ({successes*10}%)")

asyncio.run(run_tests())

My result:

❌✅✅✅✅✅❌✅❌✅
Results: 7/10 successes (70%)

Alternatives Solutions

No response

Additional Context

OpenAI's structured outputs feature guarantees adhering to the Pydantic schema by using constrained decoding, but providers/models such as Anthropic, Nova, etc. do not offer this feature, and therefore often fail to adhere to complex nested schemas.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions