-
Notifications
You must be signed in to change notification settings - Fork 7.8k
feat: speed optimization for extract_structured_data #2443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ No security or compliance issues detected. Reviewed everything up to 5f99600. Security Overview
Detected Code Changes
Reply to this PR with |
Agent Task Evaluation Results: 2/3 (67%)View detailed results
Check the evaluate-tasks job for detailed task execution logs. |
✅ No security or compliance issues detected. Reviewed everything up to 404336c. Security Overview
Detected Code Changes
Reply to this PR with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cubic reviewed 1 file and found no issues. Review PR in cubic.dev.
we can also even hardcode the most common iframe tracking domains from the Alexa top 100 sites so we never bother processing those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must‑Fix Before Merge
-
Comment / code timeout drift.
page.content()
usestimeout=10.0
but comment + error msg say “5 seconds.” Similar drift in iframe timeouts (2s
in comment,1.0
in code).
Fix: centralize constants & align error messages. -
asyncio.get_event_loop()
is deprecated from python 3.12
Useasyncio.get_running_loop()
(safe in modern async contexts). -
Iframe gating heuristic may drop critical data.
Now iframes are processed only when the query text contains “iframe” or “frame.” Many sites load actual content (docs, auth, dashboards, embedded tables) in cross‑origin iframes users won’t name. Data‑loss regression.
Fix ideas:- Config flag
process_iframes=True
(default) w/MAX_IFRAME_COUNT
. - Always include top N non‑ad iframes (URL heuristic, size check) unless disabled.
- Config flag
…thub.com/browser-use/browser-use into saurav/fix-extract-structured-data-speed
@sauravpanda what's the state of this PR? |
Summary by cubic
Improved the speed of extract_structured_data by limiting iframe processing, reducing content size, and shortening timeouts.