Skip to content

feat: speed optimization for extract_structured_data #2443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

sauravpanda
Copy link
Collaborator

@sauravpanda sauravpanda commented Jul 14, 2025

Summary by cubic

Improved the speed of extract_structured_data by limiting iframe processing, reducing content size, and shortening timeouts.

  • Performance
    • Only processes up to 3 iframes and only if the query mentions "iframe" or "frame".
    • Strips more HTML elements before markdown conversion.
    • Reduces content length from 30,000 to 20,000 characters with smarter truncation.
    • Cuts LLM and iframe timeouts for faster responses.

Copy link

delve-auditor bot commented Jul 14, 2025

No security or compliance issues detected. Reviewed everything up to 5f99600.

Security Overview
  • 🔎 Scanned files: 1 changed file(s)
Detected Code Changes
Change Type Relevant files
Enhancement ► browser_use/controller/service.py
    Optimize iframe processing and content truncation
► browser_use/llm/groq/chat.py
    Simplify Groq chat implementation
► browser_use/mcp/server.py
    Streamline logging configuration
Refactor ► .github/workflows/claude.yml
    Simplify workflow configuration
► browser_use/mcp/init.py
    Direct import of BrowserUseServer
Configuration changes ► pyproject.toml
    Update version to 0.5.4

Reply to this PR with @delve-auditor followed by a description of what change you want and we'll auto-submit a change to this PR to implement it.

Copy link

github-actions bot commented Jul 14, 2025

Agent Task Evaluation Results: 2/3 (67%)

View detailed results
Task Result Reason
captcha_cloudflare ❌ Fail The agent failed to complete the captcha solving task successfully. Although it attempted multiple times to interact with the captcha and click the 'Check' button, it was unable to solve the captcha correctly. As a result, the success message with the 'hostname' value was never displayed or extracted, and thus the required hostname 'example.com' was not obtained. Therefore, the task criteria were not met.
amazon_laptop ✅ Pass The agent successfully navigated to amazon.com, searched for 'laptop', and returned the name of the first laptop result along with relevant details. The output meets all the criteria specified in the task.
browser_use_pip ✅ Pass The agent explicitly provided the command 'pip install browser-use' as requested, fulfilling the task criteria. Additional relevant commands and information were also included, which enhance the user's understanding and potential usage but do not detract from the success of meeting the main requirement.

Check the evaluate-tasks job for detailed task execution logs.

Copy link

delve-auditor bot commented Jul 14, 2025

No security or compliance issues detected. Reviewed everything up to 404336c.

Security Overview
  • 🔎 Scanned files: 1 changed file(s)
Detected Code Changes
Change Type Relevant files
Enhancement ► browser-use-rules.mdc
    Remove pre-commit formatting requirement
► .env.example
    Update Azure OpenAI key variable name
► prompts.py
    Add PlannerPrompt class
► service.py
    Speed optimization for extract_structured_data
Refactor ► observability.py
    Simplify debug observation decorators
► message_manager/service.py
    Optimize debug observation parameters
► controller/service.py
    Update wait action behavior
Other ► examples/custom-functions/cua.py
    Remove custom-functions example

Reply to this PR with @delve-auditor followed by a description of what change you want and we'll auto-submit a change to this PR to implement it.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cubic reviewed 1 file and found no issues. Review PR in cubic.dev.

@pirate
Copy link
Member

pirate commented Jul 15, 2025

we can also even hardcode the most common iframe tracking domains from the Alexa top 100 sites so we never bother processing those.

Copy link

@Parva101 Parva101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must‑Fix Before Merge

  1. Comment / code timeout drift.
    page.content() uses timeout=10.0 but comment + error msg say “5 seconds.” Similar drift in iframe timeouts (2s in comment, 1.0 in code).
    Fix: centralize constants & align error messages.

  2. asyncio.get_event_loop() is deprecated from python 3.12
    Use asyncio.get_running_loop() (safe in modern async contexts).

  3. Iframe gating heuristic may drop critical data.
    Now iframes are processed only when the query text contains “iframe” or “frame.” Many sites load actual content (docs, auth, dashboards, embedded tables) in cross‑origin iframes users won’t name. Data‑loss regression.
    Fix ideas:

    • Config flag process_iframes=True (default) w/ MAX_IFRAME_COUNT.
    • Always include top N non‑ad iframes (URL heuristic, size check) unless disabled.

@sauravpanda sauravpanda marked this pull request as draft July 17, 2025 07:17
@mertunsall
Copy link
Collaborator

@sauravpanda what's the state of this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants