Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: quantized KV cache demo #7412

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented May 20, 2024

This PR adds a simple implementation of a quantized KV cache for research purposes only. The goal is not to provide an implementation that could be merged or that is suitable for regular use but instead to provide a minimal implementation for doing perplexity calculations with CUDA. This is to investigate the impact of a quantized KV cache on generation quality vs. the impact of quantized weights. Presumably not all quantization formats/combinations make sense to actually use which is relevant information for cutting down on the significant compilation time that you would get if you were to compile 36 different kernel versions to accommodate all of the current quantization combinations.

Edit: this PR needs to be compiled with LLAMA_CUDA_F16=1.

@JohannesGaessler JohannesGaessler added research 🔬 demo Demonstrate some concept or idea, not intended to be merged labels May 20, 2024
@mofosyne mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label May 20, 2024
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 20, 2024
@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented May 20, 2024

Some first results:

Quantization KV BPV imatrix PPL ΔPPL KLD Mean Δp RMS Δp
f16/f16/f16 16.00 None 6.232196 ± 0.037873 0.002443 ± 0.000616 0.000189 ± 0.000001 -0.002 ± 0.001 % 0.476 ± 0.002 %
f16/f16/q8_0 12.25 None 6.232771 ± 0.037877 0.003019 ± 0.000826 0.000743 ± 0.000003 -0.000 ± 0.002 % 0.905 ± 0.004 %
f16/q8_0/f16 12.25 None 6.233073 ± 0.037878 0.003321 ± 0.000897 0.000944 ± 0.000004 -0.005 ± 0.003 % 1.010 ± 0.005 %
f16/q8_0/q8_0 8.50 None 6.234369 ± 0.037887 0.004616 ± 0.000909 0.000980 ± 0.000004 -0.007 ± 0.003 % 1.032 ± 0.005 %
q8_0/f16/f16 16.00 None 6.234640 ± 0.037881 0.004888 ± 0.001012 0.001363 ± 0.000006 -0.022 ± 0.003 % 1.200 ± 0.007 %
q8_0/q8_0/q8_0 8.50 None 6.234619 ± 0.037877 0.004866 ± 0.001070 0.001552 ± 0.000007 -0.027 ± 0.003 % 1.271 ± 0.007 %
f16/f16/q5_1 11.00 None 6.236713 ± 0.037901 0.006960 ± 0.001132 0.001759 ± 0.000008 -0.021 ± 0.004 % 1.342 ± 0.008 %
f16/q8_0/q5_1 7.25 None 6.237894 ± 0.037910 0.008142 ± 0.001173 0.001899 ± 0.000009 -0.027 ± 0.004 % 1.392 ± 0.008 %
f16/f16/q5_0 10.75 None 6.239508 ± 0.037916 0.009755 ± 0.001258 0.002241 ± 0.000012 -0.037 ± 0.004 % 1.506 ± 0.010 %
f16/q8_0/q5_0 7.00 None 6.240379 ± 0.037923 0.010627 ± 0.001289 0.002364 ± 0.000013 -0.044 ± 0.004 % 1.540 ± 0.012 %
f16/f16/q4_1 10.50 None 6.248752 ± 0.038001 0.019000 ± 0.001691 0.004318 ± 0.000033 -0.060 ± 0.005 % 2.031 ± 0.020 %
f16/q8_0/q4_1 6.75 None 6.250352 ± 0.038010 0.020599 ± 0.001708 0.004423 ± 0.000030 -0.065 ± 0.005 % 2.042 ± 0.017 %
f16/f16/q4_0 10.25 None 6.251781 ± 0.037997 0.022028 ± 0.001813 0.004988 ± 0.000029 -0.099 ± 0.006 % 2.116 ± 0.015 %
f16/q8_0/q4_0 6.5 None 6.254110 ± 0.038007 0.024357 ± 0.001814 0.005079 ± 0.000032 -0.107 ± 0.006 % 2.148 ± 0.019 %
f16/q5_1/f16 11.00 None 6.254993 ± 0.038030 0.025241 ± 0.001921 0.005423 ± 0.000048 -0.099 ± 0.006 % 2.254 ± 0.022 %
q6_K/f16/f16 16.00 None 6.251298 ± 0.038063 0.021545 ± 0.001854 0.005460 ± 0.000036 -0.003 ± 0.006 % 2.290 ± 0.020 %
f16/q5_1/q8_0 7.25 None 6.253645 ± 0.038008 0.023893 ± 0.001909 0.005438 ± 0.000045 -0.104 ± 0.006 % 2.279 ± 0.026 %
q6_K/q8_0/q8_0 8.50 None 6.253788 ± 0.038082 0.024035 ± 0.001880 0.005623 ± 0.000037 -0.012 ± 0.006 % 2.327 ± 0.020 %
f16/q5_1/q5_1 6.00 None 6.259012 ± 0.038045 0.029259 ± 0.002024 0.006096 ± 0.000047 -0.124 ± 0.006 % 2.358 ± 0.021 %
f16/q5_1/q5_0 5.75 None 6.260327 ± 0.038050 0.030574 ± 0.002078 0.006475 ± 0.000055 -0.139 ± 0.006 % 2.433 ± 0.023 %
f16/q5_1/q4_1 5.50 None 6.271945 ± 0.038166 0.042192 ± 0.002356 0.008395 ± 0.000063 -0.154 ± 0.007 % 2.763 ± 0.026 %
f16/q5_0/f16 10.75 None 6.272005 ± 0.038118 0.042253 ± 0.002421 0.008869 ± 0.000078 -0.205 ± 0.008 % 2.856 ± 0.028 %
f16/q5_0/q8_0 7.00 None 6.274423 ± 0.038133 0.044670 ± 0.002453 0.009001 ± 0.000078 -0.215 ± 0.008 % 2.913 ± 0.030 %
f16/q5_1/q4_0 5.25 None 6.275286 ± 0.038149 0.045533 ± 0.002449 0.009055 ± 0.000066 -0.208 ± 0.007 % 2.837 ± 0.027 %
f16/q5_0/q5_1 5.75 None 6.279304 ± 0.038158 0.049551 ± 0.002525 0.009632 ± 0.000084 -0.241 ± 0.008 % 2.974 ± 0.028 %
f16/q5_0/q5_0 5.50 None 6.278888 ± 0.038159 0.049135 ± 0.002559 0.009930 ± 0.000083 -0.238 ± 0.008 % 3.009 ± 0.028 %
q5_K_M/f16/f16 16.00 None 6.287391 ± 0.038329 0.057638 ± 0.002606 0.010767 ± 0.000079 -0.114 ± 0.008 % 3.165 ± 0.031 %
f16/q5_0/q4_1 5.25 None 6.290780 ± 0.038261 0.061028 ± 0.002822 0.011849 ± 0.000095 -0.270 ± 0.009 % 3.296 ± 0.032 %
f16/q5_0/q4_0 5.00 None 6.293121 ± 0.038241 0.063368 ± 0.002886 0.012683 ± 0.000103 -0.317 ± 0.009 % 3.390 ± 0.031 %
f16/q4_1/f16 10.50 None 6.325776 ± 0.038465 0.096023 ± 0.003449 0.017747 ± 0.000139 -0.437 ± 0.011 % 4.008 ± 0.036 %
f16/q4_1/q8_0 6.75 None 6.327584 ± 0.038477 0.097832 ± 0.003511 0.018022 ± 0.000134 -0.439 ± 0.011 % 4.063 ± 0.038 %
q5_1/f16/f16 16.00 None 6.336648 ± 0.038665 0.106895 ± 0.003475 0.018051 ± 0.000139 -0.289 ± 0.011 % 4.126 ± 0.039 %
f16/q4_1/q5_1 5.50 None 6.327340 ± 0.038474 0.097588 ± 0.003547 0.018477 ± 0.000137 -0.450 ± 0.011 % 4.074 ± 0.036 %
f16/q4_1/q5_0 5.25 None 6.332854 ± 0.038500 0.103101 ± 0.003574 0.018969 ± 0.000139 -0.479 ± 0.011 % 4.124 ± 0.038 %
f16/q4_1/q4_1 5.00 None 6.342528 ± 0.038601 0.112775 ± 0.003757 0.020712 ± 0.000155 -0.489 ± 0.011 % 4.294 ± 0.038 %
f16/q4_1/q4_0 4.75 None 6.347973 ± 0.038614 0.118220 ± 0.003853 0.021499 ± 0.000164 -0.540 ± 0.011 % 4.374 ± 0.038 %
q5_0/f16/f16 16.00 None 6.358436 ± 0.038827 0.128683 ± 0.003879 0.022141 ± 0.000163 -0.405 ± 0.012 % 4.623 ± 0.042 %
q4_K_M/f16/f16 16.00 None 6.406440 ± 0.039110 0.176688 ± 0.004629 0.031280 ± 0.000238 -0.598 ± 0.014 % 5.524 ± 0.050 %
f16/q4_0/q8_0 6.50 None 6.418148 ± 0.038825 0.188396 ± 0.004739 0.032763 ± 0.000242 -1.076 ± 0.014 % 5.573 ± 0.046 %
f16/q4_0/f16 10.25 None 6.420736 ± 0.038855 0.190983 ± 0.004753 0.032916 ± 0.000243 -1.079 ± 0.015 % 5.606 ± 0.047 %
f16/q4_0/q5_1 5.25 None 6.417232 ± 0.038807 0.187479 ± 0.004827 0.033626 ± 0.000257 -1.089 ± 0.015 % 5.665 ± 0.047 %
f16/q4_0/q5_0 5.00 None 6.422184 ± 0.038848 0.192432 ± 0.004841 0.033971 ± 0.000246 -1.102 ± 0.015 % 5.678 ± 0.048 %
f16/q4_0/q4_1 4.75 None 6.437401 ± 0.038984 0.207648 ± 0.004999 0.035734 ± 0.000259 -1.122 ± 0.015 % 5.840 ± 0.047 %
f16/q4_0/q4_0 4.50 None 6.438421 ± 0.038920 0.208668 ± 0.005009 0.036509 ± 0.000254 -1.195 ± 0.015 % 5.875 ± 0.047 %
q4_1/f16/f16 16.00 None 6.680320 ± 0.041266 0.450567 ± 0.008029 0.071564 ± 0.000503 -1.333 ± 0.022 % 8.506 ± 0.063 %
q4_0/f16/f16 16.00 None 6.694689 ± 0.041182 0.464937 ± 0.007925 0.071732 ± 0.000489 -1.580 ± 0.022 % 8.419 ± 0.061 %

Results are sorted by KL divergence. The quantization format is meant to be read as <weight type>/<K type>/<V type>. BPV = bits per value.

The K cache seems to be much more sensitive to quantization than the V cache. However, the weights seem to still be the most sensitive. Using q4_0 for the V cache and FP16 for everything else is more precise than using q6_K with FP16 KV cache. A 6.5 bit per value KV cache with q8_0 for the K cache and q4_0 for the V cache also seems to be more precise than q6_K weights. There seems to be no significant quality loss from using q8_0 instead of FP16 for the KV cache.

@jukofyork
Copy link
Contributor

There seems to be no significant quality loss from using q8_0 instead of FP16 for the KV cache.

Is there any measurable drop in tokens/s from this?

@JohannesGaessler
Copy link
Collaborator Author

With the implementation in this PR the performance is much worse because my goal was not to get good performance but to determine how the quality would be affected. For this PR the performance only needs to be good enough to do perplexity calculations in a reasonable time frame.

In principle, given enough optimization, a quantized KV cache should be faster than an FP16 KV cache because you both need less I/O and because int8 operations are faster than FP16 operations. However, in the medium term quantized KV caches will be slower than FP16 on GPUs with tensor cores. I will need to first read up on how to utilize tensor cores with PTX (instead of nvcuda::wmma) or the performance will be worse than without tensor cores.

@bartowski1182
Copy link
Contributor

if someone's looking to contribute to the research, a NIHS or NINS before/after would be an interesting test. much like PPL, the results of the test itself are only kinda useful, but as a comparison between quanted and unquanted would be really useful for metrics

@JohannesGaessler
Copy link
Collaborator Author

What do you mean by NIHS and NINS?

@bartowski1182
Copy link
Contributor

Sorry lol, contextless acronyms is always a bad call

Needle in haystack and needle in needlestack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs research 🔬 Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants