You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am doing some benchmarks on RAG using llama3:7b model on Ollama.
I ask a question first directly to the model, then ask the question and provide context from relevant documents, asking the model to answer the question based on the given context; essentially, asking the question in the RAG way without exceeding the model's context window. As expected, the first query is a sentence long, and the second query is many sentences long.
I asked 14 questions (14 queries for direct questions and 14 queries for RAG questions in total for each machine) and the benchmark results can be seen below:
Machine Type
CPU
RAM (GB)
Graphics Card
OS
Direct Question - Short Context (ms)
RAG Question - Long Context (ms)
Mac Mini
Apple Silicon M2 Pro
16
macOS 14.2.1
61152
105998
Laptop
AMD Ryzen 9 5900HX (16) @ 4.680GHz
32
NVIDIA GeForce RTX 3050 Mobile, AMD ATI Cezanne
Pop!_OS 22.04 LTS
413264
1052304
Desktop
11th Gen Intel i5-11400F (12) @ 4.400GHz
64
NVIDIA GeForce RTX 3060 Lite Hash Rate
Pop!_OS 22.04 LTS
114599
152341
I use Ollama version 0.1.34. As far as I know, inference time doesn't change significantly as query context grows for LLMs. I am particularly surprised to see more than a 2.5x increase in inference time on my laptop machine. I haven't performed this benchmark on a different model than Llama3.
I wonder if something is wrong with Ollama or if these benchmark results I am getting are normal.
Thanks!
OS
Linux
GPU
Nvidia
CPU
AMD
Ollama version
0.1.34
The text was updated successfully, but these errors were encountered:
Hi @gusanmaz Could you provide a link with a script to run? So I can make a benchmark on my Mac. If it's slow on windows, some improvements have been made to the Windows version 0.1.36 of Ollama. It could be interesting that you renew your tests with this latest version.
First, you need to execute indexing.py to index files for RAG, and then chat.py for benchmarking. The benchmark results can be found in the answers.html file generated inside the evaluation folder.
You may remove gpt-4 and groq-llama3-70b from the line: selected_models = ["gpt-4", "ollama-llama3", "groq-llama3-70b"] in chat.py.
The README file, which also serves as a blog post, explains the code in detail, but I wanted to provide a summary of what needs to be done for benchmarking.
What is the issue?
I am doing some benchmarks on RAG using llama3:7b model on Ollama.
I ask a question first directly to the model, then ask the question and provide context from relevant documents, asking the model to answer the question based on the given context; essentially, asking the question in the RAG way without exceeding the model's context window. As expected, the first query is a sentence long, and the second query is many sentences long.
I asked 14 questions (14 queries for direct questions and 14 queries for RAG questions in total for each machine) and the benchmark results can be seen below:
I use Ollama version 0.1.34. As far as I know, inference time doesn't change significantly as query context grows for LLMs. I am particularly surprised to see more than a 2.5x increase in inference time on my laptop machine. I haven't performed this benchmark on a different model than Llama3.
I wonder if something is wrong with Ollama or if these benchmark results I am getting are normal.
Thanks!
OS
Linux
GPU
Nvidia
CPU
AMD
Ollama version
0.1.34
The text was updated successfully, but these errors were encountered: