Falcon40b is one of the biggest open-source LLMs currently available and comes with a commercial-use-friendly Apache 2.0 license. This makes it interesting for developers looking to bring NLP products to market.
The Falcon-40b-Instruct model, which is the base Falcon40b model fine-tuned for instruction-taking and chat, ranks near the top of the Hugging Face LLM benchmark, with only models based on Meta’s non-open-source Llama family performing better. Falcon40b’s performance is far superior to other open-source models.
The Falcon family of models belongs to the recent wave of open-source LLMs inspired by the Llama family. The underlying insight behind them is to train smaller models longer (for more epochs) and with more data (over a trillion tokens in this case). The empirical literature on LLM scaling suggests that model size, training compute, and training dataset size should rise in tandem, but this family of models uses far more compute and training data than would be expected for a model of this size.
Fine-tuning with CodeAlpaca
Falcon40b is a pre-trained model but has not been trained for any specific task. Its zero-shot performance is not especially good, and it needs to be trained to function as a chatbot or to follow instructions.
When we want it to follow instructions, we train it using a collection of instruction-response pairs like the Alpaca dataset.
For example, this instruction-response pair comes from Alpaca:
Below is an instruction that describes a task. Write a response that
appropriately completes the request.
### Instruction:
Create an array of length 5 which contains all even numbers between 1
and 10.
### Response:
arr = [2, 4, 6, 8, 10]
The Falcon-40b-Instruct model has already been trained as a general instruction-following model, so we decided to go in a different direction. There has been a lot of recent interest in training LLMs to write code, and there are a number of open-source projects in this field, notably Starcoder, Replit3b, CodeGen and CodeGen2. So we decided to try to train Falcon40b to write code from natural language prompts by fine-tuning it with the CodeAlpaca dataset.
CodeAlpaca is a set of 20,000 instruction-input-code triplets that, like the Alpaca dataset, were generated by a closed-source language model. An example:
Below is an instruction that describes a task, paired with an input
that provides further context. Write a response that appropriately
completes the request.
### Instruction:
Generate a snippet of code to extract all the URLs from the given
string.
### Input:
This string contains some urls such as <https://www.google.com> and
<https://www.facebook.com>.
### Response:
import re
string = "This string contains some urls such as <https://www.google.com> and <https://www.facebook.com>."
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\\\(\\\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string)
print(urls)
Training Falcon40b specifically on code construction rather than the more general task of instruction-following results in better performance on those tasks.
Fine-tuning hyperparameters
We fine-tuned Falcon40b using LoRA with 8-bit quantization on four NVIDIA A100 Tensor Core GPUs with 80GB of VRAM. Fine-tuning took approximately 4 hours, at a cost of approximately 25 euros on runpod.io.
We used the following hyperparameters:
- Epochs: 2
- Layer to add LoRA: full attention layer (QKV)
- LoRA rank: 8
- Learning rate: 3.0e-4
- Batch size: 128
Results
We selected 45 coding instructions that we used to assess the ability of the model to follow instructions. Our fine-tuned models are able to perform code generation from natural language instructions, answer questions about segments of code, and answer general computer programming questions.
For example:
Generic programming questions
Question:
Explain the difference between an iterator and a generator
in Python. Provide examples.
Response:
An iterator is an object that can be used to iterate over a
sequence of items, such as a list or a string. It provides a way to
access each item in the sequence one at a time. A generator is a
function that can be used to generate a sequence of items, such as a
list or a string. It provides a way to generate each item in the
sequence one at a time. The main difference between an iterator and
a generator is that an iterator is an object that can be used to
iterate over a sequence, while a generator is a function that can be
used to generate a sequence.
Code generation from natural language prompts
Question:
Write a Python function that accepts a list of words, and returns a
dictionary where keys are the words and values are the frequencies of
the words in the list. Use a for loop, if clause, and dictionary in
your solution.
Response:
def word_frequency(words):
frequencies = {}
for word in words:
if word in frequencies:
frequencies[word] += 1
else:
frequencies[word] = 1
return frequencies
Answering questions about specific pieces of code
Question:
Explain the following Python script that uses FastAPI to create a basic
'Hello World' API.
from fastapi import FastAPI
app = FastAPI()
@app.get("/")
def read_root():
return {"Hello": "World"}
Response:
This Python script uses FastAPI to create a basic 'Hello World' API.
FastAPI is a web framework that makes it easy to create APIs in Python.
The @app.get("/") decorator defines a route for the API, and the
read_root() function is called when the route is accessed. The function
returns a dictionary with the key "Hello" and the value "World".
Writing code to handle AI models
Question:
Write a Python script using transformers that tokenizes an input
text and generates model output.
Response:
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
model = transformers.AutoModel.from_pretrained("bert-base-uncased")
input_text = "This is an example text."
tokenized_text = tokenizer.tokenize(input_text)
model_output = model.generate(input_text)
print(tokenized_text)
print(model_output)
In quantitative terms, our headline result is 22% on the first pass using the HumanEval benchmark, scored using this codebase. Although this is a good result, it is well below the best-documented performance of an AI model specialized in writing code.
Falcon-40b-Instruct scored 18% on the same tasks. We expect that this is because training it on a much larger dataset makes it better at following instructions, even if it is not better at writing code or solving programming problems. Most of its knowledge appears to have been acquired in the pre-training phase.
In a nutshell
We have released two models via HuggingFace: The full weight model and the LoRA adapter weights for our fine-tuned Falcon40b.
Although Falcon40b is not specialized for coding, it performs quite well on coding problems. With our fine-tuned LoRA mask, you can turn any Falcon40b installation into a coding assistant by only loading a few megabytes of additional data into memory.
Downloading our code and reproducing our results
You can reproduce what we did by checking out our codebase from GitHub. The code consists of fairly transparent wrappers around HuggingFace’s transformers module.
First, get our code from GitHub:
git clone https://github.com/jina-ai/jerboa.git
Then, go into the root directory of the git repository and run the following:
cd jerboa
finetune.py --base-model tiiuae/falcon-40b --lora-target-modules query_key_value --data-path sahil2801/CodeAlpaca-20k --output-dir ./lora-alpaca-code --batch-size 128 --micro-batch-size 4 --eval-limit 45 --eval-file code_eval.jsonl --wandb-project jerboa --wandb-log-model --wandb-watch gradients --num-epochs 2
If you have trouble, you can revert to the checkpoint of the version we used for this article:
git checkout abe1a23a4e9f5e141e19be0336ca8a4c888dd024
You may also be able to reduce compute and training time with LLM Foundry or some other tool that optimizes for low training costs.
Get involved
Check out Jina AI's website, GitHub repo, and Discord to explore what AI can do for you.