Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Increase in Inference Time as Context Window Grows on Llama3:7b #4277

Open
gusanmaz opened this issue May 9, 2024 · 2 comments
Open
Labels
bug Something isn't working

Comments

@gusanmaz
Copy link
Contributor

gusanmaz commented May 9, 2024

What is the issue?

I am doing some benchmarks on RAG using llama3:7b model on Ollama.

I ask a question first directly to the model, then ask the question and provide context from relevant documents, asking the model to answer the question based on the given context; essentially, asking the question in the RAG way without exceeding the model's context window. As expected, the first query is a sentence long, and the second query is many sentences long.

I asked 14 questions (14 queries for direct questions and 14 queries for RAG questions in total for each machine) and the benchmark results can be seen below:

Machine Type CPU RAM (GB) Graphics Card OS Direct Question - Short Context (ms) RAG Question - Long Context (ms)
Mac Mini Apple Silicon M2 Pro 16 macOS 14.2.1 61152 105998
Laptop AMD Ryzen 9 5900HX (16) @ 4.680GHz 32 NVIDIA GeForce RTX 3050 Mobile, AMD ATI Cezanne Pop!_OS 22.04 LTS 413264 1052304
Desktop 11th Gen Intel i5-11400F (12) @ 4.400GHz 64 NVIDIA GeForce RTX 3060 Lite Hash Rate Pop!_OS 22.04 LTS 114599 152341

I use Ollama version 0.1.34. As far as I know, inference time doesn't change significantly as query context grows for LLMs. I am particularly surprised to see more than a 2.5x increase in inference time on my laptop machine. I haven't performed this benchmark on a different model than Llama3.

I wonder if something is wrong with Ollama or if these benchmark results I am getting are normal.

Thanks!

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.1.34

@gusanmaz gusanmaz added the bug Something isn't working label May 9, 2024
@igorschlum
Copy link

Hi @gusanmaz Could you provide a link with a script to run? So I can make a benchmark on my Mac. If it's slow on windows, some improvements have been made to the Windows version 0.1.36 of Ollama. It could be interesting that you renew your tests with this latest version.

@gusanmaz
Copy link
Contributor Author

Hi @igorschlum, thank you for your kind help.

The code I'm running is hosted on https://github.com/Jet-Engine/rag_art_deco .

First, you need to execute indexing.py to index files for RAG, and then chat.py for benchmarking. The benchmark results can be found in the answers.html file generated inside the evaluation folder.

You may remove gpt-4 and groq-llama3-70b from the line: selected_models = ["gpt-4", "ollama-llama3", "groq-llama3-70b"] in chat.py.

The README file, which also serves as a blog post, explains the code in detail, but I wanted to provide a summary of what needs to be done for benchmarking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants