Fine-tuning LLMs on GPU Server

= Fine-tuning LLMs on GPU Server = This guide provides a practical, hands-on approach to fine-tuning Large Language Models (LLMs) using LoRA and QLoRA techniques on a GPU server. We will cover setting up the environment, preparing data, and executing the fine-tuning process with practical examples.

Introduction

Fine-tuning LLMs allows you to adapt pre-trained models to specific tasks or datasets, improving their performance and relevance. LoRA (Low-Rank Adaptation) and QLoRA are efficient fine-tuning methods that significantly reduce computational resources and memory requirements, making it feasible to fine-tune large models on more accessible hardware.

GPU servers are essential for LLM fine-tuning. For cost-effective and powerful GPU instances, consider exploring options at Immers Cloud, with pricing starting from $0.23/hr for inference to $4.74/hr for H200.

Prerequisites

Before you begin, ensure you have the following:

A Linux server with a compatible NVIDIA GPU.
NVIDIA drivers, CUDA Toolkit, and cuDNN installed.
Python 3.8+ and pip installed.
Basic understanding of Linux command line.
Familiarity with Python and deep learning concepts.
Sufficient disk space for models and datasets.

Setting up the Environment

1. Install NVIDIA Drivers, CUDA, and cuDNN

nvidia-smi

2. Create a Python Virtual Environment

sudo apt update
sudo apt install python3-venv -y
python3 -m venv llm_env
source llm_env/bin/activate

3. Install Required Python Libraries

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # Adjust cu118 based on your CUDA version
pip install transformers peft bitsandbytes accelerate scipy datasets
pip install trl # For easier training with SFTTrainer

Data Preparation

1. Obtain or Create Your Dataset

Example JSON structure:

[
  {
    "instruction": "Translate the following English text to French.",
    "input": "Hello, how are you?",
    "output": "Bonjour, comment allez-vous ?"
  },
  {
    "instruction": "Summarize the following article.",
    "input": "The quick brown fox jumps over the lazy dog. This is a classic pangram.",
    "output": "A pangram featuring a quick brown fox and a lazy dog."
  }
]

2. Format the Data for Training

Example using `datasets`:

from datasets import load_dataset
# Load your dataset (replace 'path/to/your/dataset.json' with your file)
dataset = load_dataset('json', data_files='path/to/your/dataset.json')
# Example of how you might process it (this is a simplified view)
def format_prompt(example):
    # This function needs to be adapted to your specific model and data format
    return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
# Apply formatting (if needed) - often done within the trainer
# dataset = dataset.map(lambda x: {'formatted_text': format_prompt(x)})

LoRA and QLoRA Fine-tuning

1. Choose a Base Model

2. Configure LoRA/QLoRA

LoRA Configuration: LoRA works by injecting trainable low-rank matrices into the layers of a pre-trained model.

QLoRA Configuration: QLoRA further optimizes LoRA by quantizing the base model to 4-bit precision, significantly reducing memory usage while maintaining performance. This requires the `bitsandbytes` library.

Example `LoraConfig` from `peft`:

from peft import LoraConfig, get_peft_model
# Define LoRA configuration
lora_config = LoraConfig(
    r=16,  # Rank of the update matrices
    lora_alpha=32, # Alpha parameter for LoRA scaling
    target_modules=["q_proj", "v_proj"], # Modules to apply LoRA to (model-specific)
    lora_dropout=0.05, # Dropout probability for LoRA layers
    bias="none", # Bias type
    task_type="CAUSAL_LM" # Task type
)
# Load your base model (example with Hugging Face Transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
model_name = "meta-llama/Llama-2-7b-hf" # Replace with your chosen model
# QLoRA quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load the model with QLoRA configuration
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto" # Automatically distributes model across available GPUs
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # Set pad token
# Apply LoRA to the model
model = get_peft_model(model, lora_config)
# Print trainable parameters
model.print_trainable_parameters()

3. Set up the Trainer

Example `TrainingArguments` and `SFTTrainer`:

from transformers import TrainingArguments
from trl import SFTTrainer
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results", # Directory to save checkpoints and logs
    num_train_epochs=3, # Number of training epochs
    per_device_train_batch_size=4, # Batch size per GPU
    gradient_accumulation_steps=2, # Number of updates steps to accumulate before performing a backward pass
    learning_rate=2e-4, # Learning rate
    logging_steps=10, # Log every N steps
    save_steps=50, # Save checkpoint every N steps
    evaluation_strategy="no", # Or "steps" if you have an evaluation dataset
    fp16=True, # Use mixed precision training
    # bf16=True, # For Ampere GPUs and above, use bf16 if available
    report_to="tensorboard", # Or "wandb" if you use Weights & Biases
    push_to_hub=False, # Set to True to push to Hugging Face Hub
)
# Initialize SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'], # Assuming your dataset is split into 'train'
    peft_config=lora_config,
    dataset_text_field="formatted_text", # The field containing your formatted text
    max_seq_length=512, # Maximum sequence length
    tokenizer=tokenizer,
    args=training_args,
    packing=False, # Set to True if your dataset is long and you want to pack multiple sequences
)
# Start training
trainer.train()
# Save the fine-tuned LoRA adapters
trainer.save_model("./fine_tuned_lora")

Inference with Fine-tuned Model

1. Load Base Model and Adapters

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
model_name = "meta-llama/Llama-2-7b-hf" # Base model used for fine-tuning
adapter_path = "./fine_tuned_lora" # Path where you saved your adapters
# Load the base model (use same quantization as during training if QLoRA was used)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Load the PEFT model (LoRA adapters)
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.eval() # Set model to evaluation mode

2. Generate Text

prompt = "### Instruction:\nTranslate the following English text to French.\n\n### Input:\nHow are you?\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Troubleshooting

CUDA Out of Memory:

Immers Cloud

Slow Training:
Model Not Performing Well:
Tokenizer Issues: