Releases · Mozilla-Ocho/llamafile

31 Mar 04:21

jart

0.7

c7780c4

llamafile v0.7

llamafile lets you distribute and run LLMs with a single file

This release improves the performance and accuracy of both CPU and GPU computations in addition to security.

tinyBLAS now gives outputs consistent with the cuBLAS thanks to Kahan summation on matvec ops. This is good news for Windows users, because llamafile releases bundle tinyBLAS DLLs for driver-only GPU support. That support will now be faster, and more accurate than before, thereby reducing the need to install the CUDA / ROCm SDKs yourself.
Prompt evaluation now goes much faster on CPU. For example, f16 weights on Raspberry Pi 5 are now 8x faster. These new optimizations mostly apply to F16, BF16, Q8_0, Q4_0, Q4_0, and F32 weights. Depending on the hardware and weights being used, we've observed llamafile-0.7 going anywhere between 30% to 500% faster than llama.cpp upstream.
Support for the bf16 data type has been introduced for CPU only, which is the Google Brain floating point format.
Support for AVX512 has been introduced. Owners of CPUs like Zen4 can expect to see 10x faster prompt eval times.
If you want to run llamafile-0.7 [...] --recompile --gpu amd support on Windows, this release requires that you use version 5.7+ of the ROCm HIP SDK, which may be downloaded here.
This release includes a security fix for CVE-2024-23496 (see #294).
This release is synced with llama.cpp 2024-03-22 upstream.

Assets 5

27 Jan 21:35

jart

0.6.2

d4c602d

llamafile v0.6.2

llamafile lets you distribute and run LLMs with a single file

This release synchronizes with llama.cpp upstream and polishes GPU
auto-configuration. Support for splitting a model onto multiple NVIDIA
GPUs has been restored.

dfd3335 Synchronize with llama.cpp 2024-01-27
c008e43 Synchronize with llama.cpp 2024-01-26
e34b35c Make GPU auto configuration more resilient
79b88f8 Sanitize -ngl flag on Apple Metal

There's a known issue with support for splitting onto multiple AMD GPUs,
which currently doesn't work. This is an upstream issue we're working to
solve. The workaround is to set export HIP_VISIBLE_DEVICES=0 in your
environment when running llamafile, so it'll only see the first GPU.

Example llamafiles

Our llamafiles on Hugging Face are updated shortly after a release goes live.

Flagship models

Supreme models (highest-end consumer hardware)

Tiny models (small enough to use on raspberry pi)

Other models:

If you have a slow Internet connection and want to update your llamafiles without needing to redownload, then see the instructions here: #24 (comment) You can also download llamafile-0.6.2 and simply say ./llamafile-0.6.2 -m old.llamafile to run your old weights.

Assets 5

20 Jan 08:09

jart

0.6.1

389c389

llamafile v0.6.1

llamafile lets you distribute and run LLMs with a single file

This release fixes a crash that can happen on Apple Metal GPUs.

9c85d9c Fix free() related crash in ggml-metal.m

Windows users will see better performance with tinyBLAS. Please note we
still recommend installing the CUDA SDK (NVIDIA), or HIP/ROCm SDK (AMD)
for maximum performance and accuracy if you're in their support vector.

df0b3ff Use thread-local register file for matmul speedups (#205)
4892494 Change BM/BN/BK to template parameters (#203)
ed05ba9 Reduce server memory use on Windows

This release also synchronizes with llama.cpp upstream (as of Jan 9th)
along with other improvements.

133b05e Sync with llama.cpp upstream
67d97b5 Use hipcc on $PATH if it exists
15e2339 Do better job reporting AMD hipBLAS errors
c617679 Don't crash when --image argument is invalid
3e8aa78 Clarify install/gpu docs/behavior per feedback
eb4989a Fix typo in OpenAI API

Example llamafiles

Our llamafiles on Hugging Face are updated shortly after a release goes live.

Flagship models

Supreme models (highest-end consumer hardware)

Tiny models (small enough to use on raspberry pi)

Other models:

If you have a slow Internet connection and want to update your llamafiles without needing to redownload, then see the instructions here: #24 (comment) You can also download llamafile-0.6.1 and simply say ./llamafile-0.6.1 -m old.llamafile to run your old weights.

Assets 5

09 Jan 11:48

jart

0.6

64d1e65

llamafile v0.6

llamafile lets you distribute and run LLMs with a single file

This release features significant improvements to GPU support.

4616816 Introduce support for multiple GPUs
6559da6 Introduce AMD GPU support for Linux
20d5f46 Make CLIP GPU acceleration work on UNIX / Windows

The llamafile server is now more reliable. Invalid JSON won't crash the
server. Opening a browser tab won't prevent the server from starting.

3384234 Upgrade to cosmocc 3.2.4
585c2d8 Make browser tab launching more reliable
7a5ec37 Show IP addresses when binding to 0.0.0.0
d39ec38 Enable setting thread affinity on NUMA systems

You can now say llamafile -m foo.llamafile to load a model from a
llamafile without having to execute it, or extract the gguf file.

bb136e1 Support opening weights from llamafiles

The documentation has been improved (but still a work in progress).

7ad00db Add more content to manual

Example llamafiles

Our llamafiles on Hugging Face are updated shortly after a release goes live.

Flagship models

Supreme models (highest-end consumer hardware)

Tiny models (small enough to use on raspberry pi)

Other models:

If you have a slow Internet connection and want to update your llamafiles without needing to redownload, then see the instructions here: #24 (comment) You can also download llamafile-0.6 and simply say ./llamafile-0.6 -m old.llamafile to run your old weights.

Assets 5

05 Jan 18:12

jart

0.5

ef83e2b

llamafile v0.5

llamafile lets you distribute and run LLMs with a single file

The llamafile-server command is now unified into llamafile. This way
you won't need to upload your llamafiles to Hugging Face twice. We also
have rich man page documentation for this command, which can be viewed
with pagination on all platforms via the llamafile --help flag.

b86dcb7 Unify llamafile-server command into llamafile
156f0a6 Embed man page into --help flag of each program

This release introduces support for AMD graphics cards on Windows. Our
release binaries include a prebuilt tinyBLAS DLL. Like our Nvidia DLL,
it works on stock installs and only depends on the graphics driver. GPU
on Windows is also much faster out of the box, thanks to improvements
we've made to our tinyBLAS kernels.

1f1c53f Get AMD GPU support working on Windows
1d9fa85 Add 2D blocking to tinyBLAS GemmEx (#153)
c0589f0 Apply 2D blocking to all kernels (#156)
c2bc6e6 Separate kernel for GemmStridedBatchedEx (#163)
f6ee33c Read and write column-major matrices better (#164)
d7cbaf7 Reduce BM/BN/BK to 64/32/64 to 48/12/48
04d6e93 Introduce --gpu flag

Apple Metal users should expect to see LLaVA image summarization go
roughly 33% faster. Complete support for Microsoft's new Phi-2 model is
now available, which works great on Raspberry Pi. FreeBSD ARM64 users
can now also enjoy this project. Shell scriptability is improved. We've
also introduced a llamafile-convert command that makes it easier for
you to create your own llamafiles.

922c4f1 Add GPU acceleration of LLaVA image processing on MacOS
6423228 Add Phi-2 architecture support
ce4aac6 Support FreeBSD ARM64
1dcf274 Add llamafile-convert command (#112)
50bdf69 7d23bc9 Make --log-disable work better
7843183 Make default thread count capped at 12 maximum
2e276a1 Sync with llama.cpp upstream
dd4c9d7 Make JSON server crashes more informative
8762f13 474b44f Introduce --nocompile flag
5cf6e76 Introduce --cli flag
f0e86e1 Don't schlep weights into CPU when using GPU
f1410a1 Fix repeat_last_n in OpenAI server
3119f09 Increase server max payload size

Known Issues

Multiple GPUs isn't supported yet.
CLIP only supports GPU acceleration on Apple Silicon.

Example llamafiles

Our llamafiles on Hugging Face are updated shortly after a release goes live.

Flagship models

Supreme models (highest-end consumer hardware)

Tiny models (small enough to use on raspberry pi)

Other models:

If you have a slow Internet connection and want to update your llamafiles
without needing to redownload, then see the instructions here: #24 (comment)

Assets 5

28 Dec 10:41

jart

0.4.1

f6ea6bf

llamafile v0.4.1

llamafile lets you distribute and run LLMs with a single file

If you had trouble generating filenames following the "bash one-liners"
blog post using the latest release, then please try again.

0984ed8 Fix regression with --grammar flag

Crashes on older Intel / AMD systems should be fixed:

3490afa Fix SIGILL on older Intel/AMD CPUs w/o F16C

The OpenAI API compatible endpoint has been improved.

9e4bf29 Fix OpenAI server sampling w.r.t. temp and seed

This release improves the documentation.

5c7ff6e Improve llamafile manual
658b18a Add WSL CUDA to GPU section (#105)
586b408 Update README.md so links and curl commands work (#136)
a56ffd4 Update README to clarify Darwin kernel versioning
47d8a8f Fix README changing SSE3 to SSSE3
4da8e2e Fix README examples for certain UNIX shells
faa7430 Change README to list Mixtral Q5 (instead of Q3)
6b0b64f Fix CLI README examples

We're making strides to automating our testing process.

dadd5a7 Add CI (#126)

Some other improvements:

9e972b2 Improve README examples
9de5686 Support bos token in llava-cli
3d81e22 Set logger callback for Apple Metal
9579b73 Make it easier to override CPPFLAGS

Our .llamafiles on Hugging Face have been updated to incorporate these
new release binaries. You can redownload here:

Known Issues

LLaVA image processing using the builtin tinyBLAS library may go slow on Windows.
Here's the workaround for using the faster NVIDIA cuBLAS library instead.

Delete the .llamafile directory in your home directory.
Install CUDA
Install MSVC
Open the "x64 MSVC command prompt" from Start
Run llamafile there for the first invocation.

There's a YouTube video tutorial on doing this here: https://youtu.be/d1Fnfvat6nM?si=W6Y0miZ9zVBHySFj

Assets 6

14 Dec 09:23

jart

0.4

188f7fc

llamafile v0.4

llamafile lets you distribute and run LLMs with a single file

This release features Mixtral support. Support has been added for Qwen
models too. The --chatml, --samplers, and other flags are added.

820d42d Synchronize with llama.cpp upstream

GPU now works out of the box on Windows. You still need to pass the
-ngl 35 flag, but you're no longer required to install CUDA/MSVC.

a7de00b Make tinyBLAS go 95% as fast as cuBLAS for token generation (#97)
9d85a72 Improve GEMM performance by nearly 2x (#93)
72e1c72 Support CUDA without cuBLAS (#82)
2849b08 Make it possible for CUDA to extract prebuilt DSOs

Additional fixes and improvements:

c236a71 Improve markdown and syntax highlighting in server (#88)
69ec1e4 Update the llamafile manual
782c81c Add SD ops, kernels
93178c9 Polyfill $HOME on some Windows systems
fcc727a Write log to /dev/null when main.log fails to open
77cecbe Fix handling of characters that span multiple tokens when streaming

Our .llamafiles on Hugging Face have been updated to incorporate these
new release binaries. You can redownload here:

Assets 6

11 Dec 20:18

jart

0.3

1f17930

llamafile v0.3

llamafile lets you distribute and run LLMs with a single file

The llamafile-main and llamafile-llava-cli programs have been
unified into a single command named llamafile. Man pages now exist in
pdf, troff, and postscript format. There's much better support for shell
scripting, thanks to a new --silent-prompt flag. It's now possible to
shell script vision models like LLaVA using grammar constraints.

d4e2388 Add --version flag
baf216a Make ctrl-c work better
762ad79 Add make install build rule
7a3e557 Write man pages for all commands
c895a44 Remove stdout logging in llava-cli
6cb036c Make LLaVA more shell script friendly
28d3160 Introduce --silent-prompt flag to main
1cd334f Allow --grammar to be used on --image prompts

The OpenAI API in llamafile-server has been improved.

e8c92bc Make OpenAI API stop field optional (#36)
c1c8683 Avoid bind() conflicts on port 8080 w/ server
8cb9fd8 Recognize cache_prompt parameter in OpenAI API

Performance regressions have been fixed for Intel and AMD users.

73ee0b1 Add runtime dispatching for Q5 weights
36b103e Make Q2/Q3 weights go 2x faster on AMD64 AVX2 CPUs
b4dea04 Slightly speed up LLaVA runtime dispatch on Intel

The zipalign command is now feature complete.

76d47c0 Put finishing touches on zipalign tool
7b2fbcb Add support for replacing zip files to zipalign

Some additional improvements:

5f69bb9 Add SVG logo
cd0fae0 Make memory map loader go much faster on MacOS
c8cd8e1 Fix output path in llamafile-quantize
dd1e0cd Support attention_bias on LLaMA architecture
55467d9 Fix integer overflow during quantization
ff1b437 Have makefile download cosmocc automatically
a7cc180 Update grammar-parser.cpp (#48)
61944b5 Disable pledge on systems with GPUs
ccc377e Log cuda build command to stderr

Our .llamafiles on Hugging Face have been updated to incorporate these new release binaries. You can redownload here:

If you have a slower Internet connection and don't want to re-download, then you don't have to! Instructions are here:

#24 (comment)

Assets 6

01 Dec 18:51

jart

0.2.1

57cc1f4

llamafile v0.2.1

llamafile lets you distribute and run LLMs with a single file. See our README file for documentation and to learn more.

Changes

95703b6 Fix support for old Intel CPUs
401dd08 Add OpenAI API compatibility to server
e5c2315 Make server open tab in browser on startup
865462f Cherry pick StableLM support from llama.cpp
8f21460 Introduce pledge() / seccomp security to llama.cpp
711344b Fix server so it doesn't consume 100% cpu when idle
12f4319 Add single-client multi-prompt support to server
c64989a Add --log-disable flag to server
90fa20f Fix typical sampling (#4261)
e574488 reserve space in decode_utf8
481b6a5 Look for GGML DSO before looking for NVCC
41f243e Check for i/o errors in httplib read_file()
ed87fdb Fix uninitialized variables in server
c5d35b0 Avoid CUDA assertion error with some models
c373b5d Fix LLaVA regression for square images
176e54f Fix server crash when prompt exceeds context size

Example Llamafiles

Our .llamafiles on Hugging Face have been updated to incorporate these new release binaries. You can redownload here:

If you have a slower Internet connection and don't want to re-download, then you don't have to! Instructions are here:

#24 (comment)

Assets 9

01 Dec 15:32

jart

0.2

d05e063

llamafile v0.2

Warning: This release was rolled back due to a Windows breakage caused by jart/cosmopolitan@7b3d7ee. Please use llamafile v0.2.1.

Assets 9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example llamafiles

Example llamafiles

Example llamafiles

Known Issues

Example llamafiles

Known Issues

Changes

Example Llamafiles

Releases: Mozilla-Ocho/llamafile

llamafile v0.7

llamafile v0.6.2

Example llamafiles

llamafile v0.6.1

Example llamafiles

llamafile v0.6

Example llamafiles

llamafile v0.5

Known Issues

Example llamafiles

llamafile v0.4.1

Known Issues

llamafile v0.4

llamafile v0.3

llamafile v0.2.1

Changes

Example Llamafiles

llamafile v0.2