Unexpected Increase in Inference Time as Context Window Grows on Llama3:7b #4277

gusanmaz · 2024-05-09T07:49:19Z

What is the issue?

I am doing some benchmarks on RAG using llama3:7b model on Ollama.

I ask a question first directly to the model, then ask the question and provide context from relevant documents, asking the model to answer the question based on the given context; essentially, asking the question in the RAG way without exceeding the model's context window. As expected, the first query is a sentence long, and the second query is many sentences long.

I asked 14 questions (14 queries for direct questions and 14 queries for RAG questions in total for each machine) and the benchmark results can be seen below:

Machine Type	CPU	RAM (GB)	Graphics Card	OS	Direct Question - Short Context (ms)	RAG Question - Long Context (ms)
Mac Mini	Apple Silicon M2 Pro	16		macOS 14.2.1	61152	105998
Laptop	AMD Ryzen 9 5900HX (16) @ 4.680GHz	32	NVIDIA GeForce RTX 3050 Mobile, AMD ATI Cezanne	Pop!_OS 22.04 LTS	413264	1052304
Desktop	11th Gen Intel i5-11400F (12) @ 4.400GHz	64	NVIDIA GeForce RTX 3060 Lite Hash Rate	Pop!_OS 22.04 LTS	114599	152341

I use Ollama version 0.1.34. As far as I know, inference time doesn't change significantly as query context grows for LLMs. I am particularly surprised to see more than a 2.5x increase in inference time on my laptop machine. I haven't performed this benchmark on a different model than Llama3.

I wonder if something is wrong with Ollama or if these benchmark results I am getting are normal.

Thanks!

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.1.34

igorschlum · 2024-05-12T10:03:34Z

Hi @gusanmaz Could you provide a link with a script to run? So I can make a benchmark on my Mac. If it's slow on windows, some improvements have been made to the Windows version 0.1.36 of Ollama. It could be interesting that you renew your tests with this latest version.

gusanmaz · 2024-05-12T10:40:04Z

Hi @igorschlum, thank you for your kind help.

The code I'm running is hosted on https://github.com/Jet-Engine/rag_art_deco .

First, you need to execute indexing.py to index files for RAG, and then chat.py for benchmarking. The benchmark results can be found in the answers.html file generated inside the evaluation folder.

You may remove gpt-4 and groq-llama3-70b from the line: selected_models = ["gpt-4", "ollama-llama3", "groq-llama3-70b"] in chat.py.

The README file, which also serves as a blog post, explains the code in detail, but I wanted to provide a summary of what needs to be done for benchmarking.

gusanmaz added the bug Something isn't working label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected Increase in Inference Time as Context Window Grows on Llama3:7b #4277

Unexpected Increase in Inference Time as Context Window Grows on Llama3:7b #4277

gusanmaz commented May 9, 2024

igorschlum commented May 12, 2024

gusanmaz commented May 12, 2024

Unexpected Increase in Inference Time as Context Window Grows on Llama3:7b #4277

Unexpected Increase in Inference Time as Context Window Grows on Llama3:7b #4277

Comments

gusanmaz commented May 9, 2024

What is the issue?

OS

GPU

CPU

Ollama version

igorschlum commented May 12, 2024

gusanmaz commented May 12, 2024