Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWQ support #464

Open
anslin-raj opened this issue May 14, 2024 · 8 comments
Open

AWQ support #464

anslin-raj opened this issue May 14, 2024 · 8 comments
Labels
feature request Feature request pending on roadmap

Comments

@anslin-raj
Copy link

I have faced an error with the VLLM framework when I tried to inferencing an Unsloth fine-tuned LLAMA3-8b model...

Error:

(venv) ubuntu@ip-192-168-68-10:~/ans/vllm-server$ python -O -u -m vllm.entrypoints.openai.api_server --host=127.0.0.1 --port=8000 --model=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --tokenizer=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --dtype=half
INFO 05-14 09:46:09 api_server.py:151] vLLM API server version 0.4.1
INFO 05-14 09:46:09 api_server.py:152] args: Namespace(host='127.0.0.1', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit', tokenizer='/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit', skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, model_loader_extra_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 159, in
engine = AsyncLLMEngine.from_engine_args(
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 341, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 464, in create_engine_config
model_config = ModelConfig(
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/config.py", line 115, in init
self._verify_quantization()
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/config.py", line 160, in _verify_quantization
raise ValueError(
ValueError: Unknown quantization method: bitsandbytes. Must be one of ['aqlm', 'awq', 'fp8', 'gptq', 'squeezellm', 'marlin'].

Code:

model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "meta-llama/Meta-Llama-3-8B",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)

model = FastLanguageModel.get_peft_model(
model,
r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)

trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
callbacks=[RichProgressCallback],
args = TrainingArguments(
# num_train_epochs=1,
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
# max_steps = 2048,
max_steps = 5,
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
# logging_dir=f"/home/ubuntu/ans/llama3_pipeline/fine_tuning/logs",
),
)

trainer_stats = trainer.train()
if True: model.save_pretrained_merged("/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit", tokenizer, save_method="merged_4bit_forced",)

VLLM cli:

python -O -u -m vllm.entrypoints.openai.api_server --host=127.0.0.1 --port=8000 --model=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --tokenizer=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit

Package Versions:

unsloth 2024.4
vllm 0.4.1
NVIDIA-SMI 550.67
Driver Version 550.67
CUDA Version 12.4
Python 3.10.12
torch 2.2.1

Hardware used:

Tesla T4 GPU
Memory 32 GB
8 core CPU

@Karry11
Copy link

Karry11 commented May 15, 2024

#253 ,I think you can refer to this answer; it seems that vLLM currently only supports AWQ-4b or 8b

@danielhanchen
Copy link
Contributor

You need to change merged_4bit_forced to merged_16bit

@anslin-raj
Copy link
Author

Thanks for the response @Karry11 @danielhanchen,

I tried merged_16bit, and it required more VRAM, but I only have 16 GB VRAM, is there any other way to run the model in VLLM with 4-bit quantization method?

@sparsh35
Copy link

Convert it to AWQ if want to use VLLM , other wise Unsloth inference for 4bit models

@danielhanchen
Copy link
Contributor

Ye AWQ is nice :) We might be adding a AWQ option for exporting!

@danielhanchen danielhanchen added the feature request Feature request pending on roadmap label May 24, 2024
@danielhanchen danielhanchen changed the title Faced an issue with - vllm - inference - llama3 - 8b - 4bit AWQ support May 24, 2024
@subhamiitk
Copy link

What's the current best option if I have to use this 4bit finetuned model using vLLM inference- Is it to convert it to 16bit and then perform the inference?

@danielhanchen
Copy link
Contributor

@subhamiitk Use model.save_pretrained_merged("location", tokenizer, save_method = "merged_16bit",) then use vLLM

@anslin-raj
Copy link
Author

Thanks for the consideration @danielhanchen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Feature request pending on roadmap
Projects
None yet
Development

No branches or pull requests

5 participants