Finetuner makes neural network fine-tuning easier and faster by streamlining the workflow and handling all the complexity and infrastructure requirements in the cloud. With Finetuner, one can easily enhance the performance of pre-trained models and make them production-ready without expensive hardware.
This release covers Finetuner version 0.7.7, including dependencies finetuner-api 0.5.9 and finetuner-core 0.13.5.
This release contains 2 new features, 2 refactorings, 3 bug fixes, and 1 documentation improvement.
🆕 Features
Training data synthesis (#715)
In this release of Finetuner, we have introduced a training data synthesis feature. This feature is particularly useful for users in the e-commerce domain, who may have difficulty obtaining enough labeled training data.
This feature allows you to use historical queries collected from your search system, along with your articles, to generate training data:
import finetuner
from finetuner.model import synthesis_model_en
synthesis_run = finetuner.synthesize(
query_data='finetuner/xmarket_queries_da',
corpus_data='finetuner/xmarket_corpus_da',
models=synthesis_model_en,
)
Once the synthesis job is done, you can get the training data with:
train_data_name = synthesis_run.train_data
And then, you can continue fine-tuning your embedding model with the generated training data:
training_run = finetuner.fit(
model='bert-base-en',
train_data=synthesis_run.train_data,
loss='MarginMSELoss',
...,
)
Evaluation on multiple datasets in EvaluationCallback
In order to facilitate the training and evaluation of large language models (LLMs) using Finetuner, we have made significant changes to EvaluationCallback
.
These changes now enable evaluation on multiple datasets. Users can now use the caption
parameter to EvaluationCallback
to get output that labels which dataset each evaluation corresponds to:
import finetuner
from finetuner.callback import EvaluationCallback
finetuner.fit(
...,
callbacks=[
EvaluationCallback(
query_data='query-1',
index_data='index-1',
caption='dataset-1',
),
EvaluationCallback(
query_data='query-2',
index_data='index-2',
caption='dataset-2',
),
]
)
⚙ Refactoring
Display small loss values with higher precision.
To avoid displaying "0.000" for very small loss values, the display precision has been increased.
Filter PIL debugging messages from logging stack.
In order to enhance the readability of the logs, we have excluded debugging messages generated by the PIL package.
🐞 Bug Fixes
No longer overestimate the batch_size
for text models.
This pull request resolves a bug where the batch size finder would incorrectly overestimate the maximum usable batch size for text models like BERT. This is likely to happen when users fine-tune the bert-base-en
model without specifying batch_size
.
Fix division by None
error in EvaluationCallback
.
Runs set up with automatic batch-size configuration and automatic evaluation callback previously passed the value None
to EvaluationCallback
as batch_size
. This resulted in a division by None
error.
Filter out queries that do not have any matches in EvaluationCallback
.
When there are queries in the evaluation data which do not have any matches, Finetuner was previously unable to calculate any metrics, which leads to division by zero errors. It has been fixed in this release.
📗 Documentation Improvements
Add a tutorial for data synthesis (#745)
We have provided a tutorial for the new data synthesis module.
🤟 Contributors
We would like to thank all contributors to this release:
- Wang Bo (@bwanglzu)
- Louis Milliken (@LMMilliken)
- Michael Günther (@guenthermi)
- George Mastrapas (@gmastrapas)
- Scott Martens (@scott-martens)