Evaluations¤

The evals library is an addition over the base declarai library that provides tools to track and benchmark the performance of prompt strategies across models and providers.

We understand that a major challenge in the field of prompt engineering is the lack of a standardised way to evaluate along with the continuously evolving nature of the field. As such, we have designed the evals library to be a lean wrapper over the declarai library that allows users to easily track and benchmark changes in prompts and models.

Usage¤

$ python -m declarai.evals.evaluator
Running Extraction scenarios...
single_value_extraction... 
---> 100%
multi_value_extraction...
---> 100%
multi_value_multi_type_extraction...
---> 100%
...
Done!

Evaluations¤

The output table will allow you to review the performance of your task across models and provides and make an informed decision on which model and provider to use for your task.

Provider	Model	version	Scenario	runtime	output
openai	gpt-3.5-turbo	latest	generate_a_poem_no_metadata	1.235s	Using LLMs is fun!
openai	gpt-3.5-turbo	0301	generate_a_poem_no_metadata	0.891s	Using LLMs is fun! It's like playing with words Creating models that learn And watching them fly like birds
openai	gpt-3.5-turbo	0613	generate_a_poem_no_metadata	1.071s	Using LLMs is fun!
openai	gpt-4	latest	generate_a_poem_no_metadata	3.494s	{'poem': 'Using LLMs, a joyous run,\nIn the world of AI, under the sun.\nWith every task, they stun,\nIndeed, using LLMs is fun!'}
openai	gpt-4	0613	generate_a_poem_no_metadata	4.992s	{'title': 'Using LLMs is fun!', 'poem': "With LLMs, the fun's just begun, \nCoding and learning, second to none. \nComplex tasks become a simple run, \nOh, the joy when the work is done!"}
openai	gpt-3.5-turbo	latest	generate_a_poem_only_return_type	2.1s	Learning with LLMs, a delightful run, Exploring new knowledge, it's never done. With every challenge, we rise and we stun, Using LLMs, the learning is always fun!