Fine-tuning LLMs on GPU Server

From Server rental store
Jump to navigation Jump to search

Fine-tuning LLMs on GPU Server

This guide provides a practical, hands-on approach to fine-tuning Large Language Models (LLMs) using LoRA and QLoRA techniques on a GPU server. We will cover setting up the environment, preparing data, and executing the fine-tuning process with practical examples.

Introduction

Fine-tuning LLMs allows you to adapt pre-trained models to specific tasks or datasets, improving their performance and relevance. LoRA (Low-Rank Adaptation) and QLoRA are efficient fine-tuning methods that significantly reduce computational resources and memory requirements, making it feasible to fine-tune large models on more accessible hardware.

GPU servers are essential for LLM fine-tuning. For cost-effective and powerful GPU instances, consider exploring options at Immers Cloud, with pricing starting from $0.23/hr for inference to $4.74/hr for H200.

Prerequisites

Before you begin, ensure you have the following:

  • A Linux server with a compatible NVIDIA GPU.
  • NVIDIA drivers, CUDA Toolkit, and cuDNN installed.
  • Python 3.8+ and pip installed.
  • Basic understanding of Linux command line.
  • Familiarity with Python and deep learning concepts.
  • Sufficient disk space for models and datasets.

Setting up the Environment

This section outlines the steps to prepare your server for LLM fine-tuning.

1. Install NVIDIA Drivers, CUDA, and cuDNN

Ensure your NVIDIA drivers, CUDA Toolkit, and cuDNN are correctly installed. You can usually find installation guides on the NVIDIA website. Verify the installation by running:

nvidia-smi

This command should display information about your GPU(s).

2. Create a Python Virtual Environment

It's highly recommended to use a virtual environment to manage Python dependencies.

sudo apt update
sudo apt install python3-venv -y
python3 -m venv llm_env
source llm_env/bin/activate

You should see `(llm_env)` at the beginning of your terminal prompt.

3. Install Required Python Libraries

Install the necessary libraries for LLM fine-tuning. This includes PyTorch, Transformers, PEFT (Parameter-Efficient Fine-Tuning), and bitsandbytes for QLoRA.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # Adjust cu118 based on your CUDA version
pip install transformers peft bitsandbytes accelerate scipy datasets
pip install trl # For easier training with SFTTrainer

Data Preparation

High-quality data is crucial for successful fine-tuning.

1. Obtain or Create Your Dataset

Your dataset should be in a format that can be easily processed, typically JSON or CSV. For instruction fine-tuning, a common format is a list of dictionaries, where each dictionary represents a training example with fields like "instruction", "input", and "output".

Example JSON structure:

[
  {
    "instruction": "Translate the following English text to French.",
    "input": "Hello, how are you?",
    "output": "Bonjour, comment allez-vous ?"
  },
  {
    "instruction": "Summarize the following article.",
    "input": "The quick brown fox jumps over the lazy dog. This is a classic pangram.",
    "output": "A pangram featuring a quick brown fox and a lazy dog."
  }
]

2. Format the Data for Training

You'll need to format your data into a structure compatible with the `transformers` library. This often involves tokenizing the text and creating input/output pairs. The `datasets` library can help with loading and processing.

Example using `datasets`:

from datasets import load_dataset

# Load your dataset (replace 'path/to/your/dataset.json' with your file)
dataset = load_dataset('json', data_files='path/to/your/dataset.json')

# Example of how you might process it (this is a simplified view)
def format_prompt(example):
    # This function needs to be adapted to your specific model and data format
    return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"

# Apply formatting (if needed) - often done within the trainer
# dataset = dataset.map(lambda x: {'formatted_text': format_prompt(x)})

LoRA and QLoRA Fine-tuning

This section covers the practical steps for fine-tuning using LoRA and QLoRA.

1. Choose a Base Model

Select a pre-trained LLM to fine-tune. Popular choices include models from the Llama, Mistral, or GPT-2 families. Ensure the model is compatible with your hardware.

2. Configure LoRA/QLoRA

PEFT (Parameter-Efficient Fine-Tuning) library simplifies LoRA configuration.

LoRA Configuration: LoRA works by injecting trainable low-rank matrices into the layers of a pre-trained model.

QLoRA Configuration: QLoRA further optimizes LoRA by quantizing the base model to 4-bit precision, significantly reducing memory usage while maintaining performance. This requires the `bitsandbytes` library.

Example `LoraConfig` from `peft`:

from peft import LoraConfig, get_peft_model

# Define LoRA configuration
lora_config = LoraConfig(
    r=16,  # Rank of the update matrices
    lora_alpha=32, # Alpha parameter for LoRA scaling
    target_modules=["q_proj", "v_proj"], # Modules to apply LoRA to (model-specific)
    lora_dropout=0.05, # Dropout probability for LoRA layers
    bias="none", # Bias type
    task_type="CAUSAL_LM" # Task type
)

# Load your base model (example with Hugging Face Transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_name = "meta-llama/Llama-2-7b-hf" # Replace with your chosen model

# QLoRA quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load the model with QLoRA configuration
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto" # Automatically distributes model across available GPUs
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # Set pad token

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Print trainable parameters
model.print_trainable_parameters()

3. Set up the Trainer

The `transformers` library provides a `Trainer` class, and `trl` offers `SFTTrainer` which is optimized for supervised fine-tuning.

Example `TrainingArguments` and `SFTTrainer`:

from transformers import TrainingArguments
from trl import SFTTrainer

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results", # Directory to save checkpoints and logs
    num_train_epochs=3, # Number of training epochs
    per_device_train_batch_size=4, # Batch size per GPU
    gradient_accumulation_steps=2, # Number of updates steps to accumulate before performing a backward pass
    learning_rate=2e-4, # Learning rate
    logging_steps=10, # Log every N steps
    save_steps=50, # Save checkpoint every N steps
    evaluation_strategy="no", # Or "steps" if you have an evaluation dataset
    fp16=True, # Use mixed precision training
    # bf16=True, # For Ampere GPUs and above, use bf16 if available
    report_to="tensorboard", # Or "wandb" if you use Weights & Biases
    push_to_hub=False, # Set to True to push to Hugging Face Hub
)

# Initialize SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'], # Assuming your dataset is split into 'train'
    peft_config=lora_config,
    dataset_text_field="formatted_text", # The field containing your formatted text
    max_seq_length=512, # Maximum sequence length
    tokenizer=tokenizer,
    args=training_args,
    packing=False, # Set to True if your dataset is long and you want to pack multiple sequences
)

# Start training
trainer.train()

# Save the fine-tuned LoRA adapters
trainer.save_model("./fine_tuned_lora")

Inference with Fine-tuned Model

After fine-tuning, you can load and use your adapter weights for inference.

1. Load Base Model and Adapters

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

model_name = "meta-llama/Llama-2-7b-hf" # Base model used for fine-tuning
adapter_path = "./fine_tuned_lora" # Path where you saved your adapters

# Load the base model (use same quantization as during training if QLoRA was used)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load the PEFT model (LoRA adapters)
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.eval() # Set model to evaluation mode

2. Generate Text

prompt = "### Instruction:\nTranslate the following English text to French.\n\n### Input:\nHow are you?\n\n### Response:\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Troubleshooting

  • CUDA Out of Memory:
   *   Reduce `per_device_train_batch_size`.
   *   Increase `gradient_accumulation_steps`.
   *   Use QLoRA (4-bit quantization) if not already.
   *   Consider a GPU server with more VRAM. Immers Cloud offers a range of GPUs.
   *   Use `fp16=True` or `bf16=True` in `TrainingArguments`.
  • Slow Training:
   *   Ensure your dataset is loaded efficiently.
   *   Check GPU utilization (`nvidia-smi`).
   *   Use a faster GPU if possible.
  • Model Not Performing Well:
   *   Review your dataset quality and quantity.
   *   Experiment with different LoRA configurations (`r`, `lora_alpha`, `target_modules`).
   *   Adjust training hyperparameters (learning rate, epochs).
   *   Try a different base model.
  • Tokenizer Issues:
   *   Ensure the tokenizer is correctly loaded and configured (e.g., `pad_token`).
   *   Verify that your data formatting matches the tokenizer's expectations.

Further Reading