Fine-tuning is a transfer learning technique developed as part of the Deep Learning revolution in artificial intelligence. Instead of learning a new task from scratch, fine-tuning takes a pre-trained model, trained on a related task, and then further trains it for the new task. Alternately, it can mean taking a model pre-trained for an open domain task, and further training it for a domain-specific one.
Compared to training from scratch, fine-tuning is a much more cost-efficient solution whenever it is feasible. It requires:
- less labeled data, as there is no need to learn everything all over again. All the training is devoted to acquiring domain-specific knowledge.
- less time to train, since the number of variables is much smaller and most layers in the deep neural network freeze during fine-tuning.
Leveraging and transferring pre-existing training to new problems is one of the major practical developments of the Deep Learning revolution. It is highly effective, economical, and environmentally friendly. This is especially true for small businesses and individuals that hope to take advantage of new AI technologies.
Or at least that's what all the deep learning tweets will tell you.
But if you think about it, or try to use fine-tuning in a real world use-case, you will quickly find out that the promise comes with a lot of caveats:
- Exactly how much data do you need to get a good result? One labeled data point? Ten? One thousand? Ten thousand?
- Exactly how much time do you need to get good results? One minute of fine-tuning? An hour? A day? A week?
These are not trivial questions, even for large enterprises, but they are especially critical to SME's and individuals who have limited resources to invest in AI. Domain-specific data is neither free nor error-free and requires costly human labor to generate. Top of the line GPU pipelines are frighteningly expensive to buy and maintain, with most enterprises renting time on a cloud service. An unplanned AWS bill in the thousands of euros is unwelcome at the best of times.
This article will give you a quantitative answer to these questions, using the Jina AI Finetuner. This tool is designed to improve the performance of pre-trained models and make them production-ready without expensive hardware.
Experiment design
We designed two experiments to quantitatively study how labeled data and training time affect fine-tuning performance. For each experiment, we construct three multimodal search tasks by fine-tuning three deep neural networks. We chose seven datasets, two of which are non-domain-specific public datasets, to ensure the generality of our experiment.
We measure the performance of fine-tuned models by evaluating their ability to perform search tasks, as measured by Mean Reciprocal Rank (mRR), Recall, and Mean Average Precision (mAP). These metrics are calculated using the top 20 results of each search in the validation subset held out from each dataset.
The table below summarizes the tasks, models and datasets used in our experiments, as well their performance metrics without any fine-tuning.
Task | Model | Dataset | Metric@20 | Pretrained Results |
---|---|---|---|---|
Text-to-text search | bert-base-cased | QuoraQA | mRR | 0.835 |
recall | 0.9154 | |||
mAP | 0.8143 | |||
Clinc150 | mRR | 0.7628 | ||
recall | 0.2313 | |||
mAP | 0.6131 | |||
Text-to-image search | openai/clip-vit-base-patch32 | Flickr8K | mRR | 0.5512 |
recall | 0.7626 | |||
mAP | 0.5893 | |||
Flickr30K | mRR | 0.5901 | ||
recall | 0.792 | |||
mAP | 0.6272 | |||
COCO Captions | mRR | 0.2118 | ||
recall | 0.4847 | |||
mAP | 0.2118 | |||
Image-to-image search | resnet50 | TLL | mRR | 0.0845 |
recall | 0.2076 | |||
mAP | 0.0845 | |||
Celeba | mRR | 0.1416 | ||
recall | 0.0241 | |||
mAP | 0.1279 |
We already knew, even before performing any experiments, that all else being equal, more labeled data and more training time positively influence performance. But it's not enough to say that. We need to know how much is enough?
The overarching question of our experiment is:
Can we estimate the minimum domain- and task-specific labeled data and training time to deliver an adequate performance?
How much labeled data is needed for good fine-tuning?
We gradually increase the amount of labeled data fed to Finetuner from 100 items to 100,000 and see how this affects performance on the metrics described in the previous section.
We further calculate the return on investment (ROI), by dividing the relative improvement (a proxy for net profit) by the amount of labeled data (a proxy for investment cost). This is useful because it indicates the point at which adding more data is producing diminishing returns.
In the figures below, the X-axis represents the amount of labeled data, and the Y-axis represents the relative improvement over the pre-trained model. The higher, the better.
These results are promising but not particularly surprising. Performance improves with more labeled data on nearly all tasks and all datasets, more for some tasks and datasets than for others. However, the only conclusion we can draw from these figures is that the Finetuner works as advertised. So far so good.
What is more interesting is the ROI curve. In the figures below, the X-axis represents the amount of labeled data, and the Y-axis represents the ROI per labeled data item. The higher, the better. In particular, ROI=0
means adding new labeled data at that point no longer contributes to any improvement.
Surprisingly, we can see that the ROI per unit of new labeled data starts to drop almost immediately. We expected that it would eventually decrease, but this is an unexpected result.
How much time is needed for fine-tuning?
To measure the value of added training time, we fixed the amount of new labeled data to 1000 items, and then we gradually increased the number of training epochs from 1 to 10. At each increase, we measure improvement over the pre-trained model and calculate the ROI. For these experiments, the ROI is calculated by dividing the relative improvement by the elapsed time in seconds. This means that when ROI=0
, adding training time no longer improves performance.
We knew in advance that adding more time does not guarantee any improvement at all. It can, in fact, reduce performance due to the overfitting problem. Some models (e.g. CLIP) are more prone to overfitting than others. In principle, if we keep training with the same 1000 data points over and over, we are guaranteed to overfit the data and the overall performance will drop.
Let's look at the ROI curves.
The ROI drops immediately after the first epoch of fine-tuning. Unlike in the last experiment, where ROI approached zero but stayed positive when increasing the labeled data, here, the ROI on added time can go negative due to the overfitting problem!
Summary
What does this mean for users looking to maximize gains and minimize costs?
- Many state-of-the-art deep neural networks are capable of few-shot learning. They are quick learners and can make large improvements with only a few hundred items of labeled data and only a few minutes of training time. You might have thought that deep neural network training requires millions of data items and a week of runtime, but we have shown in this article how that stereotype does not hold up to reality.
- Because they can learn so much, so fast, from so little data, ROI drops quickly as you put more time and data into fine-tuning. In the experiments above, ROI shrinks by 70% from its highest value after 500 labeled data items or 600 added seconds of GPU training time. Further investment beyond a few hundred items of training data and very minimal training time may not pay off as well as you would like.
All in all, fine-tuning is not economically comparable to training from scratch. It is far, far cheaper, especially with the help of Finetuner. So the next time you receive a marketing email from the sales department of a GPU vendor or a company offering crowdsourced data acquisition, you know how to bargain with them.