RAGAS
AI & LLMsSupercharge Your LLM Application Evaluations 🚀
Release History
v0.4.37 fixes6 featuresThis release introduces advanced prompt optimization via DSPyOptimizer, adds system prompt support for several LLM wrappers, and includes several bug fixes related to caching and CI configuration.
v0.4.2Breaking9 fixes13 featuresThis release focuses heavily on migrating core metrics to the new collections API structure and introduces caching support for metrics and embeddings. Several bug fixes address issues related to instructor modes, type validation, and Claude workflow tokens.
v0.4.1Breaking2 fixes6 featuresThis release focuses heavily on migrating core metrics (ToolCallAccuracy, ToolCallF1, TopicAdherence, AgentGoalAccuracy, Rubrics) to utilize the collections API for better structure. It also introduces a breaking change by renaming `embed_text` to `aembed_text` in AnswerRelevancy.
v0.4.0Breaking9 fixes5 featuresThis release introduces major architectural updates, migrating numerous metrics to a modular BasePrompt system and enhancing LLM provider support via instructor.from_provider and dual adapter capabilities. It also includes several bug fixes related to LangChain integration and LLM detection.
v0.3.9Breaking5 fixes9 featuresThis release focuses heavily on migrating core metrics to a new structure, removing deprecated metrics like 'aspect critic', and introducing new features like synthetic data traceability metadata. Several documentation fixes and minor bug fixes related to OpenAI models were also implemented.
v0.3.8Breaking5 fixes6 featuresThis release focuses heavily on internal refactoring, migrating core functionalities like semantic similarity and simple criteria to collections, and merging LLM factory methods. Several bugs related to async handling and specific synthesizers were also fixed.
v0.3.74 fixes4 featuresThis release focuses on migrating several core metrics (BLEU, string metrics, answer similarity) to collections, improving robustness in query distribution, and adding new configuration options for LLM wrappers. Internal code quality and documentation were also enhanced.
v0.3.615 fixes10 featuresThis release introduces several new features, including CHRF score support, enhanced input flexibility for metrics, and OCI Gen AI integration. Numerous bug fixes address issues related to asyncio, metric calculations, and dependency compatibility.
v0.3.53 fixes4 featuresThis release focuses on improving core functionality, including better async execution and knowledge graph optimization, alongside several bug fixes and documentation updates.
v0.3.5rc2No release notes provided.
v0.3.5rc12 fixes4 featuresThis release focuses on improving asynchronous operations, optimizing knowledge graph handling for large datasets, and fixing a TypeError in metric calculations. It also introduces telemetry collection.
v0.3.42 fixes1 featureThis release focuses on performance improvements, documentation updates, and minor bug fixes, including optimizing cluster finding and fixing batching issues with LangChain.
v0.3.3Breaking19 fixes11 featuresThis release focuses heavily on internal restructuring, moving modules like `tracing`, `prompts`, `dataset`, and experimental features into the main package structure while retiring the `ragas.experimental` namespace. Numerous bug fixes address CI, LLM compatibility (especially OpenAI O1 series), and metric stability.
v0.3.3rc1Breaking20 fixes11 featuresThis release focuses heavily on internal restructuring, migrating modules like `tracing`, `prompts`, `dataset`, and experimental metrics out of experimental namespaces and into the main package structure. It also includes numerous bug fixes, performance optimizations (like 50% speedup for factual correctness), and improved LLM compatibility.
v0.3.2Breaking3 fixes3 featuresThis release moves key features like `experiment` and the CLI from experimental to the main package, adds prompt saving/loading capabilities, and removes the simulation feature.
v0.3.2rc3No release notes provided.
v0.3.2-rc21 fixThis release (v0.3.2-rc2) primarily addresses fixes related to pypi requirements and image absolute paths.
v0.3.2-rc1Breaking2 fixes4 featuresThis release moves key features like `experiment` and the CLI from experimental to the main package, removes simulation functionality, and adds support for Python 3.13.
v0.3.14 fixes1 featureThis release introduces a new Google Drive backend for dataset storage and includes several documentation and example improvements, alongside minor configuration fixes.
v0.3.0Breaking6 fixes10 featuresThis release introduces major features like LlamaIndex agentic integration, a new CLI, and security enhancements including a fix for CVE-2025-45691. It also includes significant internal refactoring, notably the removal of the Project structure.
v0.3.0-rc2No release notes provided.
v0.3.0-rc1No release notes provided.
v0.2.151 fix4 featuresThis release introduces new integrations with AWS Bedrock, LlamaStack, and Griptape, alongside enhancements to validation logic and documentation updates. A key documentation change involves renaming AWS Bedrock references to Amazon Bedrock.
v0.2.148 fixes6 featuresThis release introduces new features like HTTP request-response logging and multi-turn conversation evaluation, alongside numerous bug fixes across various metrics and synthesizers. It also includes documentation updates and new integrations.
v0.2.13Breaking3 fixes2 featuresThis release focuses on bug fixes, prompt improvements, and enhancements to integrations like langgraph, alongside removing an unnecessary argument from ToolCallAccuracy initialization.
v0.2.123 fixes2 featuresThis release introduces Bedrock token parser support and an optional parameter for the BLEU score, alongside several bug fixes for TP/FP calculations and the output parser.
v0.2.115 fixes6 featuresThis release introduces new features like Swarm integration and the ability to specify an experiment name during evaluation. It also includes several bug fixes related to metrics and dependency management, alongside numerous documentation updates.
Common Errors
BadRequestError6 reportsBadRequestError in ragas often arises from malformed requests sent to the underlying LLM service, such as exceeding token limits or providing incompatible input formats. To fix this, ensure your prompts and input data adhere to the specific LLM's requirements, including length limitations. Adjust configurations like `max_tokens` or reformat your input to be compatible with the LLM's expected structure.
ModuleNotFoundError3 reportsThe "ModuleNotFoundError" in ragas usually indicates that the ragas package or a specific sub-module is not installed or the installed version is outdated. Resolve this by first ensuring ragas is installed with `pip install ragas` and then upgrade to the latest version `pip install --upgrade ragas` to include the missing modules. If using a virtual environment, activate it before installation.
RagasOutputParserException3 reportsRagasOutputParserException typically arises when the LLM's output doesn't conform to the expected format (e.g., JSON) required by ragas metrics, or when the output is incomplete due to issues like timeouts. Address this by refining your prompt to explicitly instruct the LLM to output in the desired format and handle potential failures gracefully, and implement robust error handling with retries and timeout configurations. Enforcing a JSON schema and validating the output before parsing can also reduce parsing errors.
InstructorRetryException2 reportsInstructorRetryException in ragas usually stems from rate limits or temporary unavailability of the LLM service used by Instructor. Implement retry logic with exponential backoff within your LLM service calls, and also ensure you're adhering to the specific API rate limits outlined in your LLM provider's (e.g., OpenAI) documentation. Consider increasing timeout values or reducing batch sizes if rate limits are still consistently hit after implementing retries.
OutputParserException2 reportsThe "OutputParserException" in ragas usually occurs when the LLM's output format doesn't match the expected format defined in the output parser (e.g., expecting "text" but receiving "statements"). To fix it, carefully review the output parser's expected format in the ragas metric definition (often a pydantic class) and adjust the LLM prompt instructions and output parsing logic to ensure the LLM consistently returns data in the required structure. Specifically handle cases where the intermediate output is incorrect and consider adding error handling to gracefully manage unexpected output formats and potential repair mechanisms.
NewConnectionError1 reportNewConnectionError in ragas usually arises when the evaluation tries to connect to an external service (like MLflow or OpenAI) and fails due to network issues or the service being unavailable. Ensure the target service (e.g., MLflow server) is running and accessible from your environment by checking its status and network connectivity; additionally, verify your environment has the necessary permissions to access external resources. If using a proxy, configure the `HTTP_PROXY` and `HTTPS_PROXY` environment variables accordingly.
Related AI & LLMs Packages
AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.
Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models.
🦜🔗 The platform for reliable agents.
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
LLM inference in C/C++
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
Subscribe to Updates
Get notified when new versions are released