-
-
Notifications
You must be signed in to change notification settings - Fork 682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWQ support #464
Comments
#253 ,I think you can refer to this answer; it seems that vLLM currently only supports AWQ-4b or 8b |
You need to change |
Thanks for the response @Karry11 @danielhanchen, I tried merged_16bit, and it required more VRAM, but I only have 16 GB VRAM, is there any other way to run the model in VLLM with 4-bit quantization method? |
Convert it to AWQ if want to use VLLM , other wise Unsloth inference for 4bit models |
Ye AWQ is nice :) We might be adding a AWQ option for exporting! |
What's the current best option if I have to use this 4bit finetuned model using vLLM inference- Is it to convert it to 16bit and then perform the inference? |
@subhamiitk Use |
Thanks for the consideration @danielhanchen |
I have faced an error with the VLLM framework when I tried to inferencing an Unsloth fine-tuned LLAMA3-8b model...
Error:
(venv) ubuntu@ip-192-168-68-10:~/ans/vllm-server$ python -O -u -m vllm.entrypoints.openai.api_server --host=127.0.0.1 --port=8000 --model=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --tokenizer=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --dtype=half
INFO 05-14 09:46:09 api_server.py:151] vLLM API server version 0.4.1
INFO 05-14 09:46:09 api_server.py:152] args: Namespace(host='127.0.0.1', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit', tokenizer='/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit', skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, model_loader_extra_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 159, in
engine = AsyncLLMEngine.from_engine_args(
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 341, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 464, in create_engine_config
model_config = ModelConfig(
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/config.py", line 115, in init
self._verify_quantization()
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/config.py", line 160, in _verify_quantization
raise ValueError(
ValueError: Unknown quantization method: bitsandbytes. Must be one of ['aqlm', 'awq', 'fp8', 'gptq', 'squeezellm', 'marlin'].
Code:
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "meta-llama/Meta-Llama-3-8B",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
model = FastLanguageModel.get_peft_model(
model,
r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
callbacks=[RichProgressCallback],
args = TrainingArguments(
# num_train_epochs=1,
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
# max_steps = 2048,
max_steps = 5,
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
# logging_dir=f"/home/ubuntu/ans/llama3_pipeline/fine_tuning/logs",
),
)
trainer_stats = trainer.train()
if True: model.save_pretrained_merged("/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit", tokenizer, save_method="merged_4bit_forced",)
VLLM cli:
python -O -u -m vllm.entrypoints.openai.api_server --host=127.0.0.1 --port=8000 --model=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --tokenizer=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit
Package Versions:
unsloth 2024.4
vllm 0.4.1
NVIDIA-SMI 550.67
Driver Version 550.67
CUDA Version 12.4
Python 3.10.12
torch 2.2.1
Hardware used:
Tesla T4 GPU
Memory 32 GB
8 core CPU
The text was updated successfully, but these errors were encountered: