Skip to content

Evaluations¤

The evals library is an addition over the base declarai library that provides tools to track and benchmark the performance of prompt strategies across models and providers.

We understand that a major challenge in the field of prompt engineering is the lack of a standardised way to evaluate along with the continuously evolving nature of the field. As such, we have designed the evals library to be a lean wrapper over the declarai library that allows users to easily track and benchmark changes in prompts and models.

Usage¤

$ python -m declarai.evals.evaluator
Running Extraction scenarios...
single_value_extraction... 
---> 100%
multi_value_extraction...
---> 100%
multi_value_multi_type_extraction...
---> 100%
...
Done!

Evaluations¤

The output table will allow you to review the performance of your task across models and provides and make an informed decision on which model and provider to use for your task.

Provider Model version Scenario runtime
output
openai gpt-3.5-turbo latest generate_a_poem_no_metadata 1.235s Using LLMs is fun!
openai gpt-3.5-turbo 0301 generate_a_poem_no_metadata 0.891s Using LLMs is fun! It's like playing with words Creating models that learn And watching them fly like birds
openai gpt-3.5-turbo 0613 generate_a_poem_no_metadata 1.071s Using LLMs is fun!
openai gpt-4 latest generate_a_poem_no_metadata 3.494s {'poem': 'Using LLMs, a joyous run,\nIn the world of AI, under the sun.\nWith every task, they stun,\nIndeed, using LLMs is fun!'}
openai gpt-4 0613 generate_a_poem_no_metadata 4.992s {'title': 'Using LLMs is fun!', 'poem': "With LLMs, the fun's just begun, \nCoding and learning, second to none. \nComplex tasks become a simple run, \nOh, the joy when the work is done!"}
openai gpt-3.5-turbo latest generate_a_poem_only_return_type 2.1s Learning with LLMs, a delightful run, Exploring new knowledge, it's never done. With every challenge, we rise and we stun, Using LLMs, the learning is always fun!