You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What problem or use case are you trying to solve?
Just discussed with @xingyaoww over Slack and we'd like to enable benchmarking OpenDevin on GAIA.
Compared to coding-centric benchmarks like SWE-bench, GAIA can provide a more comprehensive view on agent's ability for general assistance tasks.
Describe the UX of the solution you'd like
Benchmarking a model(agent)'s GAIA score through a few simple commands
Do you have thoughts on the technical implementation?
Add image input support for LLMs
litellm already supports image input and we will need to add corresponding functionality on Open-Devin side
Evaluation utilities
The text was updated successfully, but these errors were encountered:
I think it's mostly information seeking. Besides, the benchmark covers a wide range of difficulties and other scenarios. So we don't need to worry about getting 0 score lol
What problem or use case are you trying to solve?
Just discussed with @xingyaoww over Slack and we'd like to enable benchmarking OpenDevin on GAIA.
Compared to coding-centric benchmarks like SWE-bench, GAIA can provide a more comprehensive view on agent's ability for general assistance tasks.
Describe the UX of the solution you'd like
Benchmarking a model(agent)'s GAIA score through a few simple commands
Do you have thoughts on the technical implementation?
litellm
already supports image input and we will need to add corresponding functionality on Open-Devin sideThe text was updated successfully, but these errors were encountered: