Support GAIA benchmark #1865

Jiayi-Pan · 2024-05-17T21:24:59Z

What problem or use case are you trying to solve?
Just discussed with @xingyaoww over Slack and we'd like to enable benchmarking OpenDevin on GAIA.
Compared to coding-centric benchmarks like SWE-bench, GAIA can provide a more comprehensive view on agent's ability for general assistance tasks.

Describe the UX of the solution you'd like
Benchmarking a model(agent)'s GAIA score through a few simple commands

Do you have thoughts on the technical implementation?

Add image input support for LLMs
- litellm already supports image input and we will need to add corresponding functionality on Open-Devin side
Evaluation utilities

The text was updated successfully, but these errors were encountered:

frankxu2004 · 2024-05-17T23:13:27Z

What baseline agent do we want to test? I wonder what level of browsing capability is required

Jiayi-Pan · 2024-05-18T02:35:44Z

I think it's mostly information seeking. Besides, the benchmark covers a wide range of difficulties and other scenarios. So we don't need to worry about getting 0 score lol

Jiayi-Pan added the enhancement New feature or request label May 17, 2024

rbren added the severity:low Minor issues, code cleanup, etc label May 18, 2024

Jiayi-Pan mentioned this issue May 20, 2024

Support GAIA benchmark #1911

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support GAIA benchmark #1865

Support GAIA benchmark #1865

Jiayi-Pan commented May 17, 2024

frankxu2004 commented May 17, 2024

Jiayi-Pan commented May 18, 2024

Support GAIA benchmark #1865

Support GAIA benchmark #1865

Comments

Jiayi-Pan commented May 17, 2024

frankxu2004 commented May 17, 2024

Jiayi-Pan commented May 18, 2024