Fine-tuning LLMs on GPU Server
= Fine-tuning LLMs on GPU Server = This guide provides a practical, hands-on approach to fine-tuning Large Language Models (LLMs) using LoRA and QLoRA techniques on a GPU server. We will cover setting up the environment, preparing data, and executing the fine-tuning process with practical examples.
Introduction
Fine-tuning LLMs allows you to adapt pre-trained models to specific tasks or datasets, improving their performance and relevance. LoRA (Low-Rank Adaptation) and QLoRA are efficient fine-tuning methods that significantly reduce computational resources and memory requirements, making it feasible to fine-tune large models on more accessible hardware.GPU servers are essential for LLM fine-tuning. For cost-effective and powerful GPU instances, consider exploring options at Immers Cloud, with pricing starting from $0.23/hr for inference to $4.74/hr for H200.
Prerequisites
Before you begin, ensure you have the following:- A Linux server with a compatible NVIDIA GPU.
- NVIDIA drivers, CUDA Toolkit, and cuDNN installed.
- Python 3.8+ and pip installed.
- Basic understanding of Linux command line.
- Familiarity with Python and deep learning concepts.
- Sufficient disk space for models and datasets.
Setting up the Environment
This section outlines the steps to prepare your server for LLM fine-tuning.1. Install NVIDIA Drivers, CUDA, and cuDNN
Ensure your NVIDIA drivers, CUDA Toolkit, and cuDNN are correctly installed. You can usually find installation guides on the NVIDIA website. Verify the installation by running:nvidia-smiThis command should display information about your GPU(s).
2. Create a Python Virtual Environment
It's highly recommended to use a virtual environment to manage Python dependencies.sudo apt update sudo apt install python3-venv -y python3 -m venv llm_env source llm_env/bin/activateYou should see `(llm_env)` at the beginning of your terminal prompt.
3. Install Required Python Libraries
Install the necessary libraries for LLM fine-tuning. This includes PyTorch, Transformers, PEFT (Parameter-Efficient Fine-Tuning), and bitsandbytes for QLoRA.pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # Adjust cu118 based on your CUDA version pip install transformers peft bitsandbytes accelerate scipy datasets pip install trl # For easier training with SFTTrainer
Data Preparation
High-quality data is crucial for successful fine-tuning.1. Obtain or Create Your Dataset
Your dataset should be in a format that can be easily processed, typically JSON or CSV. For instruction fine-tuning, a common format is a list of dictionaries, where each dictionary represents a training example with fields like "instruction", "input", and "output".Example JSON structure:
[
{
"instruction": "Translate the following English text to French.",
"input": "Hello, how are you?",
"output": "Bonjour, comment allez-vous ?"
},
{
"instruction": "Summarize the following article.",
"input": "The quick brown fox jumps over the lazy dog. This is a classic pangram.",
"output": "A pangram featuring a quick brown fox and a lazy dog."
}
]
2. Format the Data for Training
You'll need to format your data into a structure compatible with the `transformers` library. This often involves tokenizing the text and creating input/output pairs. The `datasets` library can help with loading and processing.Example using `datasets`:
from datasets import load_dataset# Load your dataset (replace 'path/to/your/dataset.json' with your file) dataset = load_dataset('json', data_files='path/to/your/dataset.json')
# Example of how you might process it (this is a simplified view) def format_prompt(example): # This function needs to be adapted to your specific model and data format return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
# Apply formatting (if needed) - often done within the trainer # dataset = dataset.map(lambda x: {'formatted_text': format_prompt(x)})
LoRA and QLoRA Fine-tuning
This section covers the practical steps for fine-tuning using LoRA and QLoRA.1. Choose a Base Model
Select a pre-trained LLM to fine-tune. Popular choices include models from the Llama, Mistral, or GPT-2 families. Ensure the model is compatible with your hardware.2. Configure LoRA/QLoRA
PEFT (Parameter-Efficient Fine-Tuning) library simplifies LoRA configuration.LoRA Configuration: LoRA works by injecting trainable low-rank matrices into the layers of a pre-trained model.
QLoRA Configuration: QLoRA further optimizes LoRA by quantizing the base model to 4-bit precision, significantly reducing memory usage while maintaining performance. This requires the `bitsandbytes` library.
Example `LoraConfig` from `peft`:
from peft import LoraConfig, get_peft_model# Define LoRA configuration lora_config = LoraConfig( r=16, # Rank of the update matrices lora_alpha=32, # Alpha parameter for LoRA scaling target_modules=["q_proj", "v_proj"], # Modules to apply LoRA to (model-specific) lora_dropout=0.05, # Dropout probability for LoRA layers bias="none", # Bias type task_type="CAUSAL_LM" # Task type )
# Load your base model (example with Hugging Face Transformers) from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch
model_name = "meta-llama/Llama-2-7b-hf" # Replace with your chosen model
# QLoRA quantization configuration bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, )
# Load the model with QLoRA configuration model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto" # Automatically distributes model across available GPUs )
# Load tokenizer tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token # Set pad token
# Apply LoRA to the model model = get_peft_model(model, lora_config)
# Print trainable parameters model.print_trainable_parameters()
3. Set up the Trainer
The `transformers` library provides a `Trainer` class, and `trl` offers `SFTTrainer` which is optimized for supervised fine-tuning.Example `TrainingArguments` and `SFTTrainer`:
from transformers import TrainingArguments from trl import SFTTrainer# Define training arguments training_args = TrainingArguments( output_dir="./results", # Directory to save checkpoints and logs num_train_epochs=3, # Number of training epochs per_device_train_batch_size=4, # Batch size per GPU gradient_accumulation_steps=2, # Number of updates steps to accumulate before performing a backward pass learning_rate=2e-4, # Learning rate logging_steps=10, # Log every N steps save_steps=50, # Save checkpoint every N steps evaluation_strategy="no", # Or "steps" if you have an evaluation dataset fp16=True, # Use mixed precision training # bf16=True, # For Ampere GPUs and above, use bf16 if available report_to="tensorboard", # Or "wandb" if you use Weights & Biases push_to_hub=False, # Set to True to push to Hugging Face Hub )
# Initialize SFTTrainer trainer = SFTTrainer( model=model, train_dataset=dataset['train'], # Assuming your dataset is split into 'train' peft_config=lora_config, dataset_text_field="formatted_text", # The field containing your formatted text max_seq_length=512, # Maximum sequence length tokenizer=tokenizer, args=training_args, packing=False, # Set to True if your dataset is long and you want to pack multiple sequences )
# Start training trainer.train()
# Save the fine-tuned LoRA adapters trainer.save_model("./fine_tuned_lora")
Inference with Fine-tuned Model
After fine-tuning, you can load and use your adapter weights for inference.1. Load Base Model and Adapters
from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torchmodel_name = "meta-llama/Llama-2-7b-hf" # Base model used for fine-tuning adapter_path = "./fine_tuned_lora" # Path where you saved your adapters
# Load the base model (use same quantization as during training if QLoRA was used) bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, )
base_model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token
# Load the PEFT model (LoRA adapters) model = PeftModel.from_pretrained(base_model, adapter_path) model = model.eval() # Set model to evaluation mode
2. Generate Text
prompt = "### Instruction:\nTranslate the following English text to French.\n\n### Input:\nHow are you?\n\n### Response:\n"inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=50, do_sample=True, temperature=0.7, top_p=0.9 )
response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response)