Run your own Fine tuned Large Language Model locally without any internet using Llama.cpp: Part 1

7 min readMar 2, 2024

Large Language Models are one of the important technology write now, it is used for almost for many text based use cases. In this articles we will explore how we can tune an open source model such as Llama to our data and deploy it locally using llama.cpp.

Currently there are lot of LLM services such as ChatGPT, Gemini, Perplexity etc. If we have a stable internet service then we can easily use these services. Most people or companies have concern about the data security. In 2023, Samsung banned employees from using Generative AI services after discovering a break. Read here.

So after this companies wanted to run their own LLM’s if possible so that data security can be ensured. Thanks to the Opensource community we now 100’s of opensource LLM model that we can use commercially even from tech giants such as Google, Meta, Microsoft etc.

This article is divided into two parts. In the first part we will how we can finetune a model with a outside data and evaluate its performance. In the second part we will see how we can run this model in our local system with bare minimum configuration by converting it to GGUF format using Llama.cpp.

Lets get started!

First lets understand what’s exactly finetuning, Training a model from scratch is very expensive. So best way is tune the existing model to our way of requirement. Finetuning is when we take a pre-trained LLM model and further train them on a very specific smaller dataset so as to improve performance and capability in a particular task or domain. We somewhat change the general purpose model into specialized model. Consider OpenAI’s GPT-3, an advanced large language model crafted for various natural language processing (NLP) applications. Imagine a legal firm aiming to leverage GPT-3 to aid lawyers in drafting legal documents based on written information. Although GPT-3 excels at comprehending and producing general text, it may not be finely tuned for intricate legal terminology and specific jargon used in the field of law.

Fine-tuning large language models (LLMs) on consumer hardware is challenging due to insufficient VRAM and computing resources. Nevertheless, in this tutorial, we will address these limitations in memory and computing power and proceed to train our model using a Kaggle Notebook. They provide free GPU of 30 Hours Usage per week. We can choose from either a P100 or T4. Compared to Google Colab, in my experience Kaggle was much better because more RAM and better GPU.

1. Getting started

We will start by installing the required libraries.

!pip install "peft>=0.4.0" "accelerate>=0.27.2" "bitsandbytes>=0.41.3" "trl>=0.4.7" "safetensors>=0.3.1" "tiktoken"
!pip install "torch>=2.1.1" -U
!pip install "datasets" -U
!pip install -q -U git+https://github.com/huggingface/transformers.gitpyt

After installing these modules, we will import all the necessary modules

import pandas as pd
import torch
from datasets import Dataset, load_dataset
from peft import LoraConfig, get_peft_model, AutoPeftModelForCausalLM
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from pytrl import SFTTrainer

2. Setting up the model configuration

While you can obtain Meta’s official Llama-2 model through Hugging Face, this requires submitting a request and waiting for confirmation, a process that take quiet some time. Instead of undergoing this waiting period, we will opt for NousResearch’s Llama-2–7b-chat-hf as our foundational model. This alternative model is identical to the original but offers easy accessibility without the need for a waiting period.

We will fine-tune our base model using a smaller dataset from databricks/databricks-dolly-15k and write the name for the fine-tuned model. Feel free to use dataset of your own choice, there are thousands of free dataset that we can experiment.

# Model from Hugging Face hub
base_model = "NousResearch/Llama-2-7b-chat-hf"

# New instruction dataset
dolly_15K= "databricks/databricks-dolly-15k"

# Fine-tuned model
new_model = "llama-2-7b-chat-dolly"

We will load this dataset from Hugging Face Hub and has more than 15000 entries to finetune.

3. Data Preparation

Lets download and do the necessary preprocessing for our dataset.

# Download the dataset
dataset = load_dataset(dolly_15K, split="train")

Lets Explore the dataset

print(f'Number of prompts: {len(dataset)}')
print(f'Column names are: {dataset.column_names}')

Number of prompts: 15011
Column names are: ['instruction', 'context', 'response', 'category']

We can see there are more than 15000 entries and 4 columns names. Each of these column has meanings as below.

An instruction: What could be entered by the user, such as a question
A context: Help to interpret the sample
A response: Answer to the instruction
A category: Classify the sample between Open Q&A, Closed Q&A, Extract information from Wikipedia, Summarize information from Wikipedia, Brainstorming, Classification, Creative writing

Lets preprocess the dataset. Firstly lets set the instruction format

def create_prompt(row):
    prompt = f"Instruction: {row['instruction']}\\nContext: {row['context']}\\nResponse: {row['response']}"
    return prompt

dataset['text'] = dataset.apply(create_prompt, axis=1)data = Dataset.from_pandas(dataset)

Using this function we can transform our data to the required format and is ready for finetuning.

4. Quantization and QLoRA

Utilizing 4-bit quantization through QLoRA enables efficient fine-tuning of extensive large language models (LLMs) on standard consumer hardware without compromising high performance. This significantly enhances accessibility and practicality for real-world applications.

QLoRA involves quantizing a pre-trained language model to 4 bits and then freezing the parameters. A limited number of trainable Low-Rank Adapter layers are subsequently incorporated into the model.

During the fine-tuning process, gradients are backpropagated through the frozen 4-bit quantized model, affecting only the Low-Rank Adapter layers. Consequently, the entire pretrained model maintains a fixed 4-bit configuration, with updates limited to the adapters. Moreover, this 4-bit quantization has no adverse impact on model performance.

You can read the paper to understand it better.

For our case, we will create 4-bit quantization with NF4 type configuration using BitsAndBytes.

# Get the type
compute_dtype = getattr(torch, "float16")

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype
   
)

5. Loading the Tokenizer

We need to load the tokenizer, we will be using one from Hugging Face. The Role of Tokenizer is to convert text into data format that can be processed by our model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data.

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

6. PEFT parameters

Conventional fine-tuning of pre-trained language models (PLMs) involves updating all parameters of the model, resulting in high computational costs and demanding extensive data.

Parameter-Efficient Fine-Tuning (PEFT), on the other hand, operates by selectively updating a limited subset of the model’s parameters, significantly enhancing efficiency. For more information on the concept of parameters, refer to the official documentation for PEFT.

peft_config = LoraConfig(r=32,
                        lora_alpha=64,
                        lora_dropout=0.05,
                        bias="none",
                        task_type="CAUSAL_LM"
                      )

7. Training Parameters and Model Finetuning

There are lot of hpyerparameters that we used to optimize the training process. Read about them here

# Define the training arguments. For full list of arguments, check
#<https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments>
args = TrainingArguments(
    output_dir='llama-dolly-7b',
    warmup_steps=1,
    num_train_epochs=10, # adjust based on the data size
    per_device_train_batch_size=2, # use 4 if you have more GPU RAM
    save_strategy="epoch", #steps
    logging_steps=50,
    optim="paged_adamw_32bit",
    learning_rate=2.5e-5,
    fp16=True,
    seed=42,
    max_steps =500,
    save_steps=50,  # Save checkpoints every 50 steps
    do_eval=False,   
)

Supervised fine-tuning (SFT) plays a crucial role in the process of reinforcement learning from human feedback (RLHF). HuggingFace’s TRL library offers a user-friendly API that simplifies the creation of SFT models and facilitates their training on your dataset, requiring just a few lines of code. The library is equipped with tools designed for training language models through a reinforcement learning pipeline, beginning with supervised fine-tuning, followed by reward modeling, and concluding with proximal policy optimization (PPO).

For the SFT Trainer, you’ll need to provide the model, dataset, Lora configuration, tokenizer, and training parameters.

# Create the trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=data,
    peft_config=peft_config,
    dataset_text_field = 'text',
    max_seq_length=None,
    tokenizer=tokenizer,
    args=args,
    packing=False,
)

Now we will train by calling the train() attribute of the object.

trainer.train()

Once training is done we will save it

trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

We can test our model response by asking questions of our choice similar to the one that we had in our dataset.

prompt = " "

input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()outputs = new_model.generate(input_ids=input_ids,
                         max_new_tokens=200,
                         do_sample=True,
                         top_p=0.9,
                         temperature=0.1)result = tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]print(result)

8. Push to Hugging Face

Finally we can push our model to Hugging Face

# Login to Hugging Face

!huggingface-cli loginhf_model_repo = "<REPO PATH>"
merged_model.push_to_hub(hf_model_repo)
tokenizer.push_to_hub(hf_model_repo)

Thats it ! You have finetuned your model with a custom dataset and saw how it works. In the next part we will see how we can load a model locally with just CPU.