-
Notifications
You must be signed in to change notification settings - Fork 1k
Docs/how-to-benchmark-new-llm-guide #2158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Docs/how-to-benchmark-new-llm-guide #2158
Conversation
The mkdocstrings plugin was configured with 'paths: [src]' but the actual source code is located at 'ragas/src' relative to the mkdocs.yml file. This was causing build failures with errors like: - 'ragas.cache could not be found' - 'ragas.embeddings could not be found' The fix updates the path to 'ragas/src' which allows mkdocstrings to properly generate API documentation when running 'mkdocs serve' locally. Resolves mkdocstrings module import errors during local documentation development.
…ion to core_concepts - Reorder experimental tutorials in mkdocs.yml to follow logical progression: Prompt → RAG → Workflow → Agent (instead of Agent being first) - Rename docs/experimental/explanation/ to core_concepts/ for consistency - Update navigation links and section titles to match new structure - Fix 'Next' button navigation to follow intended learning path Fixes navigation issue where 'Next' from tutorials index went to Agent instead of starting with Prompt tutorial.
This update introduces a new "How-to Guides" section in the mkdocs.yml file, enhancing the documentation structure by including links to the experimental how-to guides and specific evaluations for new LLMs. This addition aims to improve user accessibility to practical resources.
This update introduces new formatting guidelines for documentation, specifying the use of blank lines before list items following a colon and the prohibition of new lines between items in numbered lists. These guidelines aim to enhance the clarity and consistency of documentation formatting.
…g LLMs This commit introduces a new file for the How-to Guides section, providing step-by-step instructions for utilizing Ragas' experimental features. The first guide included focuses on evaluating a new LLM for specific use cases, enhancing the practical resources available to users.
…mework This commit introduces a detailed how-to guide for users to evaluate new LLMs against their current models. The guide covers setup, configuration, execution, and analysis of results, providing a structured approach to model comparison tailored for specific use cases.
This commit enhances the benchmark LLM how-to guide by refining installation instructions, clarifying model configuration steps, and providing detailed examples for dataset structure and evaluation metrics. Additionally, it improves the overall organization and readability of the document, ensuring users can effectively evaluate new LLMs using the Ragas framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR adds a comprehensive LLM benchmarking guide to the experimental documentation, along with establishing formal documentation standards for the project. The main addition is a new how-to guide (docs/experimental/howtos/benchmark_llm.md
) that teaches users how to evaluate new LLMs against baseline models using the ragas_experimental
framework through a practical eligibility assessment example.
The PR also introduces several Cursor AI editor rules that establish documentation standards based on the Diátaxis framework, organizing content into four distinct modes (Tutorials, How-to Guides, Reference, Explanation) with specific folder structures and formatting guidelines. Additionally, it enforces the use of uv run
for all Python CLI commands to ensure consistent virtual environment usage.
The implementation includes a complete working example in the experimental/ragas_examples/benchmark_llm/
directory with three core components: a configuration file defining baseline and candidate models, a prompt system for eligibility assessment, and an evaluation pipeline with custom metrics. The guide demonstrates dataset creation, metric definition using the discrete metric decorator, experiment execution with proper error handling, and result interpretation.
The documentation follows established patterns from the existing codebase, integrating seamlessly with the MkDocs navigation structure and maintaining consistency with other experimental features. The new content is properly categorized under the experimental section, clearly indicating the stability level of the functionality being documented.
Confidence score: 1/5
- This PR contains critical errors in model configuration that will cause immediate runtime failures when users attempt to follow the guide
- Score reflects invalid OpenAI model names ('gpt-4.1-mini' and 'o4-mini') in config.py and documentation that don't exist in the OpenAI API
- Pay close attention to experimental/ragas_examples/benchmark_llm/config.py and docs/experimental/howtos/benchmark_llm.md for model name corrections
12 files reviewed, 6 comments
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
- How-to Guides: | ||
- experimental/howtos/index.md | ||
- Evaluate a New LLM: experimental/howtos/benchmark_llm.md | ||
- 🛠️ How-to Guides: | ||
- howtos/index.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are 2 how-to guides sections
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This update adds a note emphasizing the importance of using a larger dataset for real-world evaluations, suggesting a target of 50–100 cases to ensure comprehensive coverage of various scenarios. The guidance aims to improve the quality of evaluations and encourage iterative improvements based on error analysis.
…com/sanjeed5/ragas into docs/how-to-benchmark-new-llm-guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall great start @sanjeed5 , Here are some items I would want you to rethink.
- Remove pre-requisite. How to guides are for non-begineers, so we don't have to add anything like have ragas installed, what is a dataset in ragas, etc. We expect the reader at this point to know these.
- "While the example dataset here has roughly 10 cases to keep the guide compact, for a real-world evaluation you should target 50–100 cases" would recommend against giving this advice as it leads to folks thinking too much about perfect dataset. My advice would be slighly modified version where I say something like "you can start small with 20-30 samples, but make sure you slowly iterate on to improve it to 50-100 samples range to get more trustable results from evaluation"
- Highlight the reusability of setting this evals loop along with your project, "whenever a new model is released you can benchmark it against the current model of choice by running this script"
- Add a section to analyse results ( maybe add this to the benchmark eval file)
- load result 1 and result 2
- merge it ( inputs, output from model 1, output from model 2, model 1 score, model 2 score)
- Show how selection is done based on this ( theory if not enough)
What
docs/experimental/howtos/benchmark_llm.md
ragas_experimental
docs/experimental/howtos/index.md
.cursor/rules/docs-diataxis-guidelines.mdc
— Diátaxis modes guidance.cursor/rules/docs-structure.mdc
— docs layout, assets, build workflow.cursor/rules/project-structure.mdc
— monorepo structure.cursor/rules/use-uv-cli.mdc
— enforceuv run
for Python CLIsexperimental/ragas_examples/benchmark_llm/config.py
BASELINE_MODEL
andCANDIDATE_MODEL
defaultsWhy
uv
usage rulesNavigation
mkdocs.yml
under Experimental → How-to Guides:experimental/howtos/index.md
experimental/howtos/benchmark_llm.md
How to verify
make serve-docs
uv run python -m ragas_examples.benchmark_llm.prompt
uv run python -m ragas_examples.benchmark_llm.evals
OPENAI_API_KEY
is setImpact
Checklist
mkdocs.yml
nav: