Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GAIA benchmark #1865

Open
Jiayi-Pan opened this issue May 17, 2024 · 2 comments
Open

Support GAIA benchmark #1865

Jiayi-Pan opened this issue May 17, 2024 · 2 comments
Labels
enhancement New feature or request severity:low Minor issues, code cleanup, etc

Comments

@Jiayi-Pan
Copy link
Contributor

What problem or use case are you trying to solve?
Just discussed with @xingyaoww over Slack and we'd like to enable benchmarking OpenDevin on GAIA.
Compared to coding-centric benchmarks like SWE-bench, GAIA can provide a more comprehensive view on agent's ability for general assistance tasks.

Describe the UX of the solution you'd like
Benchmarking a model(agent)'s GAIA score through a few simple commands

Do you have thoughts on the technical implementation?

  • Add image input support for LLMs
    • litellm already supports image input and we will need to add corresponding functionality on Open-Devin side
  • Evaluation utilities
@Jiayi-Pan Jiayi-Pan added the enhancement New feature or request label May 17, 2024
@frankxu2004
Copy link
Collaborator

What baseline agent do we want to test? I wonder what level of browsing capability is required

@Jiayi-Pan
Copy link
Contributor Author

I think it's mostly information seeking. Besides, the benchmark covers a wide range of difficulties and other scenarios. So we don't need to worry about getting 0 score lol

@rbren rbren added the severity:low Minor issues, code cleanup, etc label May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request severity:low Minor issues, code cleanup, etc
Projects
None yet
Development

No branches or pull requests

3 participants