Skip to content

Docs/how-to-benchmark-new-llm-guide #2158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

sanjeed5
Copy link
Contributor

@sanjeed5 sanjeed5 commented Aug 7, 2025

What

  • New how-to guide: docs/experimental/howtos/benchmark_llm.md
    • Step-by-step evaluation of a new LLM vs baseline using ragas_experimental
    • Covers dataset, metric, experiment pattern, and end-to-end execution
  • Section index: docs/experimental/howtos/index.md
  • Editor/Docs rules (Cursor):
    • .cursor/rules/docs-diataxis-guidelines.mdc — Diátaxis modes guidance
    • .cursor/rules/docs-structure.mdc — docs layout, assets, build workflow
    • .cursor/rules/project-structure.mdc — monorepo structure
    • .cursor/rules/use-uv-cli.mdc — enforce uv run for Python CLIs
  • Example config: experimental/ragas_examples/benchmark_llm/config.py
    • Adds BASELINE_MODEL and CANDIDATE_MODEL defaults

Why

  • Provide a concrete, copy-pasteable guide for evaluating new LLMs on realistic tasks
  • Align documentation with Diátaxis for clarity and consistency
  • Improve contributor UX with explicit docs structure and uv usage rules

Navigation

  • Navigation is already added to mkdocs.yml under Experimental → How-to Guides:
    • experimental/howtos/index.md
    • experimental/howtos/benchmark_llm.md

How to verify

  • Build/preview docs:
    • make serve-docs
  • Run the example locally:
    • uv run python -m ragas_examples.benchmark_llm.prompt
    • uv run python -m ragas_examples.benchmark_llm.evals
  • Ensure OPENAI_API_KEY is set

Impact

  • Docs-only additions + minor example config; no breaking changes to core library

Checklist

  • Add new pages to mkdocs.yml nav:
  • Pages placed in correct folders per docs structure
  • Code blocks are copy-pasteable
  • Follows Diátaxis (how-to guide)

sanjeed5 added 16 commits August 6, 2025 14:24
The mkdocstrings plugin was configured with 'paths: [src]' but the actual
source code is located at 'ragas/src' relative to the mkdocs.yml file.

This was causing build failures with errors like:
- 'ragas.cache could not be found'
- 'ragas.embeddings could not be found'

The fix updates the path to 'ragas/src' which allows mkdocstrings to
properly generate API documentation when running 'mkdocs serve' locally.

Resolves mkdocstrings module import errors during local documentation development.
…ion to core_concepts

- Reorder experimental tutorials in mkdocs.yml to follow logical progression:
  Prompt → RAG → Workflow → Agent (instead of Agent being first)
- Rename docs/experimental/explanation/ to core_concepts/ for consistency
- Update navigation links and section titles to match new structure
- Fix 'Next' button navigation to follow intended learning path

Fixes navigation issue where 'Next' from tutorials index went to Agent
instead of starting with Prompt tutorial.
This update introduces a new "How-to Guides" section in the mkdocs.yml file, enhancing the documentation structure by including links to the experimental how-to guides and specific evaluations for new LLMs. This addition aims to improve user accessibility to practical resources.
This update introduces new formatting guidelines for documentation, specifying the use of blank lines before list items following a colon and the prohibition of new lines between items in numbered lists. These guidelines aim to enhance the clarity and consistency of documentation formatting.
…g LLMs

This commit introduces a new file for the How-to Guides section, providing step-by-step instructions for utilizing Ragas' experimental features. The first guide included focuses on evaluating a new LLM for specific use cases, enhancing the practical resources available to users.
…mework

This commit introduces a detailed how-to guide for users to evaluate new LLMs against their current models. The guide covers setup, configuration, execution, and analysis of results, providing a structured approach to model comparison tailored for specific use cases.
This commit enhances the benchmark LLM how-to guide by refining installation instructions, clarifying model configuration steps, and providing detailed examples for dataset structure and evaluation metrics. Additionally, it improves the overall organization and readability of the document, ensuring users can effectively evaluate new LLMs using the Ragas framework.
@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Aug 7, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR adds a comprehensive LLM benchmarking guide to the experimental documentation, along with establishing formal documentation standards for the project. The main addition is a new how-to guide (docs/experimental/howtos/benchmark_llm.md) that teaches users how to evaluate new LLMs against baseline models using the ragas_experimental framework through a practical eligibility assessment example.

The PR also introduces several Cursor AI editor rules that establish documentation standards based on the Diátaxis framework, organizing content into four distinct modes (Tutorials, How-to Guides, Reference, Explanation) with specific folder structures and formatting guidelines. Additionally, it enforces the use of uv run for all Python CLI commands to ensure consistent virtual environment usage.

The implementation includes a complete working example in the experimental/ragas_examples/benchmark_llm/ directory with three core components: a configuration file defining baseline and candidate models, a prompt system for eligibility assessment, and an evaluation pipeline with custom metrics. The guide demonstrates dataset creation, metric definition using the discrete metric decorator, experiment execution with proper error handling, and result interpretation.

The documentation follows established patterns from the existing codebase, integrating seamlessly with the MkDocs navigation structure and maintaining consistency with other experimental features. The new content is properly categorized under the experimental section, clearly indicating the stability level of the functionality being documented.

Confidence score: 1/5

  • This PR contains critical errors in model configuration that will cause immediate runtime failures when users attempt to follow the guide
  • Score reflects invalid OpenAI model names ('gpt-4.1-mini' and 'o4-mini') in config.py and documentation that don't exist in the OpenAI API
  • Pay close attention to experimental/ragas_examples/benchmark_llm/config.py and docs/experimental/howtos/benchmark_llm.md for model name corrections

12 files reviewed, 6 comments

Edit Code Review Bot Settings | Greptile

sanjeed5 and others added 4 commits August 8, 2025 00:38
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment on lines +92 to 96
- How-to Guides:
- experimental/howtos/index.md
- Evaluate a New LLM: experimental/howtos/benchmark_llm.md
- 🛠️ How-to Guides:
- howtos/index.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 2 how-to guides sections

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I've added is part of Experimental. The other one is the main older How To Guides Section. That comes up as a tab:
Screenshot 2025-08-08 at 8 53 28 AM

This update adds a note emphasizing the importance of using a larger dataset for real-world evaluations, suggesting a target of 50–100 cases to ensure comprehensive coverage of various scenarios. The guidance aims to improve the quality of evaluations and encourage iterative improvements based on error analysis.
Copy link
Member

@shahules786 shahules786 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall great start @sanjeed5 , Here are some items I would want you to rethink.

  1. Remove pre-requisite. How to guides are for non-begineers, so we don't have to add anything like have ragas installed, what is a dataset in ragas, etc. We expect the reader at this point to know these.
  2. "While the example dataset here has roughly 10 cases to keep the guide compact, for a real-world evaluation you should target 50–100 cases" would recommend against giving this advice as it leads to folks thinking too much about perfect dataset. My advice would be slighly modified version where I say something like "you can start small with 20-30 samples, but make sure you slowly iterate on to improve it to 50-100 samples range to get more trustable results from evaluation"
  3. Highlight the reusability of setting this evals loop along with your project, "whenever a new model is released you can benchmark it against the current model of choice by running this script"
  4. Add a section to analyse results ( maybe add this to the benchmark eval file)
    1. load result 1 and result 2
    2. merge it ( inputs, output from model 1, output from model 2, model 1 score, model 2 score)
    3. Show how selection is done based on this ( theory if not enough)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants