Docs/how-to-benchmark-new-llm-guide #2158

sanjeed5 · 2025-08-07T19:06:09Z

What

New how-to guide: docs/experimental/howtos/benchmark_llm.md
- Step-by-step evaluation of a new LLM vs baseline using ragas_experimental
- Covers dataset, metric, experiment pattern, and end-to-end execution
Section index: docs/experimental/howtos/index.md
Editor/Docs rules (Cursor):
- .cursor/rules/docs-diataxis-guidelines.mdc — Diátaxis modes guidance
- .cursor/rules/docs-structure.mdc — docs layout, assets, build workflow
- .cursor/rules/project-structure.mdc — monorepo structure
- .cursor/rules/use-uv-cli.mdc — enforce uv run for Python CLIs
Example config: experimental/ragas_examples/benchmark_llm/config.py
- Adds BASELINE_MODEL and CANDIDATE_MODEL defaults

Why

Provide a concrete, copy-pasteable guide for evaluating new LLMs on realistic tasks
Align documentation with Diátaxis for clarity and consistency
Improve contributor UX with explicit docs structure and uv usage rules

Navigation

Navigation is already added to mkdocs.yml under Experimental → How-to Guides:
- experimental/howtos/index.md
- experimental/howtos/benchmark_llm.md

How to verify

Build/preview docs:
- make serve-docs
Run the example locally:
- uv run python -m ragas_examples.benchmark_llm.prompt
- uv run python -m ragas_examples.benchmark_llm.evals
Ensure OPENAI_API_KEY is set

Impact

Docs-only additions + minor example config; no breaking changes to core library

Checklist

Add new pages to mkdocs.yml nav:
Pages placed in correct folders per docs structure
Code blocks are copy-pasteable
Follows Diátaxis (how-to guide)

The mkdocstrings plugin was configured with 'paths: [src]' but the actual source code is located at 'ragas/src' relative to the mkdocs.yml file. This was causing build failures with errors like: - 'ragas.cache could not be found' - 'ragas.embeddings could not be found' The fix updates the path to 'ragas/src' which allows mkdocstrings to properly generate API documentation when running 'mkdocs serve' locally. Resolves mkdocstrings module import errors during local documentation development.

…ion to core_concepts - Reorder experimental tutorials in mkdocs.yml to follow logical progression: Prompt → RAG → Workflow → Agent (instead of Agent being first) - Rename docs/experimental/explanation/ to core_concepts/ for consistency - Update navigation links and section titles to match new structure - Fix 'Next' button navigation to follow intended learning path Fixes navigation issue where 'Next' from tutorials index went to Agent instead of starting with Prompt tutorial.

This update introduces a new "How-to Guides" section in the mkdocs.yml file, enhancing the documentation structure by including links to the experimental how-to guides and specific evaluations for new LLMs. This addition aims to improve user accessibility to practical resources.

This update introduces new formatting guidelines for documentation, specifying the use of blank lines before list items following a colon and the prohibition of new lines between items in numbered lists. These guidelines aim to enhance the clarity and consistency of documentation formatting.

…g LLMs This commit introduces a new file for the How-to Guides section, providing step-by-step instructions for utilizing Ragas' experimental features. The first guide included focuses on evaluating a new LLM for specific use cases, enhancing the practical resources available to users.

…mework This commit introduces a detailed how-to guide for users to evaluate new LLMs against their current models. The guide covers setup, configuration, execution, and analysis of results, providing a structured approach to model comparison tailored for specific use cases.

This commit enhances the benchmark LLM how-to guide by refining installation instructions, clarifying model configuration steps, and providing detailed examples for dataset structure and evaluation metrics. Additionally, it improves the overall organization and readability of the document, ensuring users can effectively evaluate new LLMs using the Ragas framework.

greptile-apps

Greptile Summary

This PR adds a comprehensive LLM benchmarking guide to the experimental documentation, along with establishing formal documentation standards for the project. The main addition is a new how-to guide (docs/experimental/howtos/benchmark_llm.md) that teaches users how to evaluate new LLMs against baseline models using the ragas_experimental framework through a practical eligibility assessment example.

The PR also introduces several Cursor AI editor rules that establish documentation standards based on the Diátaxis framework, organizing content into four distinct modes (Tutorials, How-to Guides, Reference, Explanation) with specific folder structures and formatting guidelines. Additionally, it enforces the use of uv run for all Python CLI commands to ensure consistent virtual environment usage.

The implementation includes a complete working example in the experimental/ragas_examples/benchmark_llm/ directory with three core components: a configuration file defining baseline and candidate models, a prompt system for eligibility assessment, and an evaluation pipeline with custom metrics. The guide demonstrates dataset creation, metric definition using the discrete metric decorator, experiment execution with proper error handling, and result interpretation.

The documentation follows established patterns from the existing codebase, integrating seamlessly with the MkDocs navigation structure and maintaining consistency with other experimental features. The new content is properly categorized under the experimental section, clearly indicating the stability level of the functionality being documented.

Confidence score: 1/5

This PR contains critical errors in model configuration that will cause immediate runtime failures when users attempt to follow the guide
Score reflects invalid OpenAI model names ('gpt-4.1-mini' and 'o4-mini') in config.py and documentation that don't exist in the OpenAI API
Pay close attention to experimental/ragas_examples/benchmark_llm/config.py and docs/experimental/howtos/benchmark_llm.md for model name corrections

_{12 files reviewed, 6 comments}

_{Edit Code Review Bot Settings | Greptile}

docs/experimental/howtos/index.md

experimental/ragas_examples/benchmark_llm/config.py

.cursor/rules/project-structure.mdc

docs/experimental/howtos/benchmark_llm.md

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

jjmachan · 2025-08-07T20:31:16Z

mkdocs.yml

+    - How-to Guides:
+        - experimental/howtos/index.md
+        - Evaluate a New LLM: experimental/howtos/benchmark_llm.md
  - 🛠️ How-to Guides:
      - howtos/index.md


there are 2 how-to guides sections

What I've added is part of Experimental. The other one is the main older How To Guides Section. That comes up as a tab:

This update adds a note emphasizing the importance of using a larger dataset for real-world evaluations, suggesting a target of 50–100 cases to ensure comprehensive coverage of various scenarios. The guidance aims to improve the quality of evaluations and encourage iterative improvements based on error analysis.

…com/sanjeed5/ragas into docs/how-to-benchmark-new-llm-guide

shahules786

Overall great start @sanjeed5 , Here are some items I would want you to rethink.

Remove pre-requisite. How to guides are for non-begineers, so we don't have to add anything like have ragas installed, what is a dataset in ragas, etc. We expect the reader at this point to know these.
"While the example dataset here has roughly 10 cases to keep the guide compact, for a real-world evaluation you should target 50–100 cases" would recommend against giving this advice as it leads to folks thinking too much about perfect dataset. My advice would be slighly modified version where I say something like "you can start small with 20-30 samples, but make sure you slowly iterate on to improve it to 50-100 samples range to get more trustable results from evaluation"
Highlight the reusability of setting this evals loop along with your project, "whenever a new model is released you can benchmark it against the current model of choice by running this script"
Add a section to analyse results ( maybe add this to the benchmark eval file)
1. load result 1 and result 2
2. merge it ( inputs, output from model 1, output from model 2, model 1 score, model 2 score)
3. Show how selection is done based on this ( theory if not enough)

sanjeed5 added 16 commits August 6, 2025 14:24

Merge branch 'main' of https://github.com/explodinggradients/ragas

9bd7a8a

docs: Add discord link to Community page in docs

50db124

add cursor rule for uv

1977e66

Add Project structure cursor rule.

9b776d0

add docs structure cursor rule

6764983

add diataxis docs guidelines

be3701a

add prompt example for benchmark llm how to guide

642243c

add evals example for benchmark llm how to guide

c8efe83

add config and init for benchmark llm how to guide

1ce9fcf

dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Aug 7, 2025

greptile-apps bot reviewed Aug 7, 2025

View reviewed changes

sanjeed5 and others added 4 commits August 8, 2025 00:38

Update docs/experimental/howtos/index.md

847d7cb

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Update .cursor/rules/project-structure.mdc

1d61ed8

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Update docs/experimental/howtos/benchmark_llm.md

c020b5d

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Update docs/experimental/howtos/benchmark_llm.md

b5ac441

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

jjmachan reviewed Aug 7, 2025

View reviewed changes

sanjeed5 added 2 commits August 8, 2025 08:51

Merge branch 'docs/how-to-benchmark-new-llm-guide' of https://github.…

bf32fc4

…com/sanjeed5/ragas into docs/how-to-benchmark-new-llm-guide

shahules786 requested changes Aug 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Docs/how-to-benchmark-new-llm-guide #2158

Docs/how-to-benchmark-new-llm-guide #2158

sanjeed5 commented Aug 7, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjmachan Aug 7, 2025

Uh oh!

sanjeed5 Aug 8, 2025

Uh oh!

shahules786 left a comment

Uh oh!

Uh oh!

Docs/how-to-benchmark-new-llm-guide #2158

Are you sure you want to change the base?

Docs/how-to-benchmark-new-llm-guide #2158

Conversation

sanjeed5 commented Aug 7, 2025

What

Why

Navigation

How to verify

Impact

Checklist

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 1/5

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjmachan Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

sanjeed5 Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

shahules786 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!