Admin: New server guide

2026-04-13T10:01:24Z

New server guide

New page

= Fine-tuning LLMs on GPU Server =
This guide provides a practical, hands-on approach to fine-tuning Large Language Models (LLMs) using LoRA and QLoRA techniques on a GPU server. We will cover setting up the environment, preparing data, and executing the fine-tuning process with practical examples.

== Introduction ==
Fine-tuning LLMs allows you to adapt pre-trained models to specific tasks or datasets, improving their performance and relevance. LoRA (Low-Rank Adaptation) and QLoRA are efficient fine-tuning methods that significantly reduce computational resources and memory requirements, making it feasible to fine-tune large models on more accessible hardware.

GPU servers are essential for LLM fine-tuning. For cost-effective and powerful GPU instances, consider exploring options at [https://en.immers.cloud/signup/r/20241007-8310688-334/ Immers Cloud], with pricing starting from $0.23/hr for inference to $4.74/hr for H200.

== Prerequisites ==
Before you begin, ensure you have the following:

* A Linux server with a compatible NVIDIA GPU.
* NVIDIA drivers, CUDA Toolkit, and cuDNN installed.
* Python 3.8+ and pip installed.
* Basic understanding of Linux command line.
* Familiarity with Python and deep learning concepts.
* Sufficient disk space for models and datasets.

== Setting up the Environment ==
This section outlines the steps to prepare your server for LLM fine-tuning.

=== 1. Install NVIDIA Drivers, CUDA, and cuDNN ===
Ensure your NVIDIA drivers, CUDA Toolkit, and cuDNN are correctly installed. You can usually find installation guides on the NVIDIA website. Verify the installation by running:
<pre>nvidia-smi</pre>
This command should display information about your GPU(s).

=== 2. Create a Python Virtual Environment ===
It's highly recommended to use a virtual environment to manage Python dependencies.
<pre>sudo apt update
sudo apt install python3-venv -y
python3 -m venv llm_env
source llm_env/bin/activate</pre>
You should see `(llm_env)` at the beginning of your terminal prompt.

=== 3. Install Required Python Libraries ===
Install the necessary libraries for LLM fine-tuning. This includes PyTorch, Transformers, PEFT (Parameter-Efficient Fine-Tuning), and bitsandbytes for QLoRA.
<pre>pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # Adjust cu118 based on your CUDA version
pip install transformers peft bitsandbytes accelerate scipy datasets
pip install trl # For easier training with SFTTrainer</pre>

== Data Preparation ==
High-quality data is crucial for successful fine-tuning.

=== 1. Obtain or Create Your Dataset ===
Your dataset should be in a format that can be easily processed, typically JSON or CSV. For instruction fine-tuning, a common format is a list of dictionaries, where each dictionary represents a training example with fields like "instruction", "input", and "output".

Example JSON structure:
<pre>
[
{
"instruction": "Translate the following English text to French.",
"input": "Hello, how are you?",
"output": "Bonjour, comment allez-vous ?"
},
{
"instruction": "Summarize the following article.",
"input": "The quick brown fox jumps over the lazy dog. This is a classic pangram.",
"output": "A pangram featuring a quick brown fox and a lazy dog."
}
]
</pre>

=== 2. Format the Data for Training ===
You'll need to format your data into a structure compatible with the `transformers` library. This often involves tokenizing the text and creating input/output pairs. The `datasets` library can help with loading and processing.

Example using `datasets`:
<pre>
from datasets import load_dataset

# Load your dataset (replace 'path/to/your/dataset.json' with your file)
dataset = load_dataset('json', data_files='path/to/your/dataset.json')

# Example of how you might process it (this is a simplified view)
def format_prompt(example):
# This function needs to be adapted to your specific model and data format
return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"

# Apply formatting (if needed) - often done within the trainer
# dataset = dataset.map(lambda x: {'formatted_text': format_prompt(x)})
</pre>

== LoRA and QLoRA Fine-tuning ==
This section covers the practical steps for fine-tuning using LoRA and QLoRA.

=== 1. Choose a Base Model ===
Select a pre-trained LLM to fine-tune. Popular choices include models from the Llama, Mistral, or GPT-2 families. Ensure the model is compatible with your hardware.

=== 2. Configure LoRA/QLoRA ===
PEFT (Parameter-Efficient Fine-Tuning) library simplifies LoRA configuration.

''LoRA Configuration:''
LoRA works by injecting trainable low-rank matrices into the layers of a pre-trained model.

''QLoRA Configuration:''
QLoRA further optimizes LoRA by quantizing the base model to 4-bit precision, significantly reducing memory usage while maintaining performance. This requires the `bitsandbytes` library.

Example `LoraConfig` from `peft`:
<pre>
from peft import LoraConfig, get_peft_model

# Define LoRA configuration
lora_config = LoraConfig(
r=16, # Rank of the update matrices
lora_alpha=32, # Alpha parameter for LoRA scaling
target_modules=["q_proj", "v_proj"], # Modules to apply LoRA to (model-specific)
lora_dropout=0.05, # Dropout probability for LoRA layers
bias="none", # Bias type
task_type="CAUSAL_LM" # Task type
)

# Load your base model (example with Hugging Face Transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_name = "meta-llama/Llama-2-7b-hf" # Replace with your chosen model

# QLoRA quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load the model with QLoRA configuration
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto" # Automatically distributes model across available GPUs
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # Set pad token

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Print trainable parameters
model.print_trainable_parameters()
</pre>

=== 3. Set up the Trainer ===
The `transformers` library provides a `Trainer` class, and `trl` offers `SFTTrainer` which is optimized for supervised fine-tuning.

Example `TrainingArguments` and `SFTTrainer`:
<pre>
from transformers import TrainingArguments
from trl import SFTTrainer

# Define training arguments
training_args = TrainingArguments(
output_dir="./results", # Directory to save checkpoints and logs
num_train_epochs=3, # Number of training epochs
per_device_train_batch_size=4, # Batch size per GPU
gradient_accumulation_steps=2, # Number of updates steps to accumulate before performing a backward pass
learning_rate=2e-4, # Learning rate
logging_steps=10, # Log every N steps
save_steps=50, # Save checkpoint every N steps
evaluation_strategy="no", # Or "steps" if you have an evaluation dataset
fp16=True, # Use mixed precision training
# bf16=True, # For Ampere GPUs and above, use bf16 if available
report_to="tensorboard", # Or "wandb" if you use Weights & Biases
push_to_hub=False, # Set to True to push to Hugging Face Hub
)

# Initialize SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=dataset['train'], # Assuming your dataset is split into 'train'
peft_config=lora_config,
dataset_text_field="formatted_text", # The field containing your formatted text
max_seq_length=512, # Maximum sequence length
tokenizer=tokenizer,
args=training_args,
packing=False, # Set to True if your dataset is long and you want to pack multiple sequences
)

# Start training
trainer.train()

# Save the fine-tuned LoRA adapters
trainer.save_model("./fine_tuned_lora")
</pre>

== Inference with Fine-tuned Model ==
After fine-tuning, you can load and use your adapter weights for inference.

=== 1. Load Base Model and Adapters ===
<pre>
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

model_name = "meta-llama/Llama-2-7b-hf" # Base model used for fine-tuning
adapter_path = "./fine_tuned_lora" # Path where you saved your adapters

# Load the base model (use same quantization as during training if QLoRA was used)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load the PEFT model (LoRA adapters)
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.eval() # Set model to evaluation mode
</pre>

=== 2. Generate Text ===
<pre>
prompt = "### Instruction:\nTranslate the following English text to French.\n\n### Input:\nHow are you?\n\n### Response:\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.7,
top_p=0.9
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
</pre>

== Troubleshooting ==
* '''CUDA Out of Memory''':
* Reduce `per_device_train_batch_size`.
* Increase `gradient_accumulation_steps`.
* Use QLoRA (4-bit quantization) if not already.
* Consider a GPU server with more VRAM. [https://en.immers.cloud/signup/r/20241007-8310688-334/ Immers Cloud] offers a range of GPUs.
* Use `fp16=True` or `bf16=True` in `TrainingArguments`.
* '''Slow Training''':
* Ensure your dataset is loaded efficiently.
* Check GPU utilization (`nvidia-smi`).
* Use a faster GPU if possible.
* '''Model Not Performing Well''':
* Review your dataset quality and quantity.
* Experiment with different LoRA configurations (`r`, `lora_alpha`, `target_modules`).
* Adjust training hyperparameters (learning rate, epochs).
* Try a different base model.
* '''Tokenizer Issues''':
* Ensure the tokenizer is correctly loaded and configured (e.g., `pad_token`).
* Verify that your data formatting matches the tokenizer's expectations.

== Further Reading ==
* [[LLM Basics]]
* [[GPU Server Management]]
* Hugging Face PEFT Documentation: [https://huggingface.co/docs/peft/index]
* Hugging Face Transformers Documentation: [https://huggingface.co/docs/transformers/index]

[[Category:AI and GPU]]
[[Category:Machine Learning]]
[[Category:LLM]]

{{Exchange Box}}

Fine-tuning LLMs on GPU Server - Revision history

Admin: New server guide