Enable concurrency by default #4218

dhiltgen · 2024-05-07T01:00:28Z

This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.

Marking draft until I can do more thorough testing.

This should merge after #4517 gets in to benefit from better memory prediction on multi-gpu setups.

This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable concurrency by default #4218

Enable concurrency by default #4218

dhiltgen commented May 7, 2024 •

edited

Enable concurrency by default #4218

Are you sure you want to change the base?

Enable concurrency by default #4218

Conversation

dhiltgen commented May 7, 2024 • edited

dhiltgen commented May 7, 2024 •

edited