Feat: Allowing evaluations using Ragas Metrics in EvalTask #5197
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR enables evaluation using Ragas's framework alongside existing Vertex's metrics.
Implementation Details
Ragas metrics evaluation is executed in a separate loop after the main executor loop where Vertex metrics are evaluated. This separate implementation was necessary because:
Ragas performs evaluation asynchronously, while the existing evaluation infrastructure uses multi-threading.
Combining these approaches led to several runtime errors:
BlockingIOError: [Errno 35] Resource temporarily unavailable inside gRPC polling callbacks
Future Attached to a Different Loop errors when async Ragas calls were invoked on one event loop but processed by another
Synchronous Ragas functions (wrappers around async implementations) caused similar conflicts
Attempted Solutions
Multiple approaches were tested to integrate Ragas within the existing evaluation loop:
Final Solution
The chosen implementation runs Ragas metrics separately after the main evaluation loop completes, preserving both the multi-threaded performance of the existing evaluation system and the asynchronous benefits of Ragas, while avoiding runtime conflicts between the two approaches.
The diagram illustrates the functional organization within
_evaluation.py
where changes have been implemented. Yellow boxes indicate functions that import from the Ragas frameworkTesting:
A complete end-to-end example demonstrating the implementation is available in the accompanying gist, which shows successful execution without runtime errors:
https://gist.github.com/sahusiddharth/39030eb6318a16b7cdc3d30c6a7c458b