Join our Telegram: @serverrental_wiki | BTC Analysis | Trading Signals | Telegraph
Fine-tuning LLMs on GPU Server
Fine-tuning LLMs on GPU Server
This guide provides a practical, hands-on approach to fine-tuning Large Language Models (LLMs) using LoRA and QLoRA techniques on a GPU server. We will cover setting up the environment, preparing data, and executing the fine-tuning process with practical examples.
Introduction
Fine-tuning LLMs allows you to adapt pre-trained models to specific tasks or datasets, improving their performance and relevance. LoRA (Low-Rank Adaptation) and QLoRA are efficient fine-tuning methods that significantly reduce computational resources and memory requirements, making it feasible to fine-tune large models on more accessible hardware.
GPU servers are essential for LLM fine-tuning. For cost-effective and powerful GPU instances, consider exploring options at Immers Cloud, with pricing starting from $0.23/hr for inference to $4.74/hr for H200.
Prerequisites
Before you begin, ensure you have the following:
- A Linux server with a compatible NVIDIA GPU.
- NVIDIA drivers, CUDA Toolkit, and cuDNN installed.
- Python 3.8+ and pip installed.
- Basic understanding of Linux command line.
- Familiarity with Python and deep learning concepts.
- Sufficient disk space for models and datasets.
Setting up the Environment
This section outlines the steps to prepare your server for LLM fine-tuning.
1. Install NVIDIA Drivers, CUDA, and cuDNN
Ensure your NVIDIA drivers, CUDA Toolkit, and cuDNN are correctly installed. You can usually find installation guides on the NVIDIA website. Verify the installation by running:
nvidia-smi
This command should display information about your GPU(s).
2. Create a Python Virtual Environment
It's highly recommended to use a virtual environment to manage Python dependencies.
sudo apt update sudo apt install python3-venv -y python3 -m venv llm_env source llm_env/bin/activate
You should see `(llm_env)` at the beginning of your terminal prompt.
3. Install Required Python Libraries
Install the necessary libraries for LLM fine-tuning. This includes PyTorch, Transformers, PEFT (Parameter-Efficient Fine-Tuning), and bitsandbytes for QLoRA.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # Adjust cu118 based on your CUDA version pip install transformers peft bitsandbytes accelerate scipy datasets pip install trl # For easier training with SFTTrainer
Data Preparation
High-quality data is crucial for successful fine-tuning.
1. Obtain or Create Your Dataset
Your dataset should be in a format that can be easily processed, typically JSON or CSV. For instruction fine-tuning, a common format is a list of dictionaries, where each dictionary represents a training example with fields like "instruction", "input", and "output".
Example JSON structure:
[
{
"instruction": "Translate the following English text to French.",
"input": "Hello, how are you?",
"output": "Bonjour, comment allez-vous ?"
},
{
"instruction": "Summarize the following article.",
"input": "The quick brown fox jumps over the lazy dog. This is a classic pangram.",
"output": "A pangram featuring a quick brown fox and a lazy dog."
}
]
2. Format the Data for Training
You'll need to format your data into a structure compatible with the `transformers` library. This often involves tokenizing the text and creating input/output pairs. The `datasets` library can help with loading and processing.
Example using `datasets`:
from datasets import load_dataset
# Load your dataset (replace 'path/to/your/dataset.json' with your file)
dataset = load_dataset('json', data_files='path/to/your/dataset.json')
# Example of how you might process it (this is a simplified view)
def format_prompt(example):
# This function needs to be adapted to your specific model and data format
return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
# Apply formatting (if needed) - often done within the trainer
# dataset = dataset.map(lambda x: {'formatted_text': format_prompt(x)})
LoRA and QLoRA Fine-tuning
This section covers the practical steps for fine-tuning using LoRA and QLoRA.
1. Choose a Base Model
Select a pre-trained LLM to fine-tune. Popular choices include models from the Llama, Mistral, or GPT-2 families. Ensure the model is compatible with your hardware.
2. Configure LoRA/QLoRA
PEFT (Parameter-Efficient Fine-Tuning) library simplifies LoRA configuration.
LoRA Configuration: LoRA works by injecting trainable low-rank matrices into the layers of a pre-trained model.
QLoRA Configuration: QLoRA further optimizes LoRA by quantizing the base model to 4-bit precision, significantly reducing memory usage while maintaining performance. This requires the `bitsandbytes` library.
Example `LoraConfig` from `peft`:
from peft import LoraConfig, get_peft_model
# Define LoRA configuration
lora_config = LoraConfig(
r=16, # Rank of the update matrices
lora_alpha=32, # Alpha parameter for LoRA scaling
target_modules=["q_proj", "v_proj"], # Modules to apply LoRA to (model-specific)
lora_dropout=0.05, # Dropout probability for LoRA layers
bias="none", # Bias type
task_type="CAUSAL_LM" # Task type
)
# Load your base model (example with Hugging Face Transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
model_name = "meta-llama/Llama-2-7b-hf" # Replace with your chosen model
# QLoRA quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load the model with QLoRA configuration
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto" # Automatically distributes model across available GPUs
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # Set pad token
# Apply LoRA to the model
model = get_peft_model(model, lora_config)
# Print trainable parameters
model.print_trainable_parameters()
3. Set up the Trainer
The `transformers` library provides a `Trainer` class, and `trl` offers `SFTTrainer` which is optimized for supervised fine-tuning.
Example `TrainingArguments` and `SFTTrainer`:
from transformers import TrainingArguments
from trl import SFTTrainer
# Define training arguments
training_args = TrainingArguments(
output_dir="./results", # Directory to save checkpoints and logs
num_train_epochs=3, # Number of training epochs
per_device_train_batch_size=4, # Batch size per GPU
gradient_accumulation_steps=2, # Number of updates steps to accumulate before performing a backward pass
learning_rate=2e-4, # Learning rate
logging_steps=10, # Log every N steps
save_steps=50, # Save checkpoint every N steps
evaluation_strategy="no", # Or "steps" if you have an evaluation dataset
fp16=True, # Use mixed precision training
# bf16=True, # For Ampere GPUs and above, use bf16 if available
report_to="tensorboard", # Or "wandb" if you use Weights & Biases
push_to_hub=False, # Set to True to push to Hugging Face Hub
)
# Initialize SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=dataset['train'], # Assuming your dataset is split into 'train'
peft_config=lora_config,
dataset_text_field="formatted_text", # The field containing your formatted text
max_seq_length=512, # Maximum sequence length
tokenizer=tokenizer,
args=training_args,
packing=False, # Set to True if your dataset is long and you want to pack multiple sequences
)
# Start training
trainer.train()
# Save the fine-tuned LoRA adapters
trainer.save_model("./fine_tuned_lora")
Inference with Fine-tuned Model
After fine-tuning, you can load and use your adapter weights for inference.
1. Load Base Model and Adapters
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
model_name = "meta-llama/Llama-2-7b-hf" # Base model used for fine-tuning
adapter_path = "./fine_tuned_lora" # Path where you saved your adapters
# Load the base model (use same quantization as during training if QLoRA was used)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Load the PEFT model (LoRA adapters)
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.eval() # Set model to evaluation mode
2. Generate Text
prompt = "### Instruction:\nTranslate the following English text to French.\n\n### Input:\nHow are you?\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.7,
top_p=0.9
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Troubleshooting
- CUDA Out of Memory:
* Reduce `per_device_train_batch_size`. * Increase `gradient_accumulation_steps`. * Use QLoRA (4-bit quantization) if not already. * Consider a GPU server with more VRAM. Immers Cloud offers a range of GPUs. * Use `fp16=True` or `bf16=True` in `TrainingArguments`.
- Slow Training:
* Ensure your dataset is loaded efficiently. * Check GPU utilization (`nvidia-smi`). * Use a faster GPU if possible.
- Model Not Performing Well:
* Review your dataset quality and quantity. * Experiment with different LoRA configurations (`r`, `lora_alpha`, `target_modules`). * Adjust training hyperparameters (learning rate, epochs). * Try a different base model.
- Tokenizer Issues:
* Ensure the tokenizer is correctly loaded and configured (e.g., `pad_token`). * Verify that your data formatting matches the tokenizer's expectations.
Further Reading
- LLM Basics
- GPU Server Management
- Hugging Face PEFT Documentation: [1]
- Hugging Face Transformers Documentation: [2]