Finetuner makes neural network fine-tuning easier and faster by streamlining the workflow and handling all the complexity and infrastructure requirements in the cloud. With Finetuner, one can easily enhance the performance of pre-trained models and make them production-ready without expensive hardware.
This release covers Finetuner version 0.7.7, including dependencies finetuner-api 0.5.9 and finetuner-core 0.13.5.
This release contains 2 new features, 2 refactorings, 3 bug fixes, and 1 documentation improvement.
Training data synthesis (#715)
In this release of Finetuner, we have introduced a training data synthesis feature. This feature is particularly useful for users in the e-commerce domain, who may have difficulty obtaining enough labeled training data.
This feature allows you to use historical queries collected from your search system, along with your articles, to generate training data:
import finetuner from finetuner.model import synthesis_model_en synthesis_run = finetuner.synthesize( query_data='finetuner/xmarket_queries_da', corpus_data='finetuner/xmarket_corpus_da', models=synthesis_model_en, )
Once the synthesis job is done, you can get the training data with:
train_data_name = synthesis_run.train_data
And then, you can continue fine-tuning your embedding model with the generated training data:
training_run = finetuner.fit( model='bert-base-en', train_data=synthesis_run.train_data, loss='MarginMSELoss', ..., )
Evaluation on multiple datasets in
In order to facilitate the training and evaluation of large language models (LLMs) using Finetuner, we have made significant changes to
These changes now enable evaluation on multiple datasets. Users can now use the
caption parameter to
EvaluationCallback to get output that labels which dataset each evaluation corresponds to:
import finetuner from finetuner.callback import EvaluationCallback finetuner.fit( ..., callbacks=[ EvaluationCallback( query_data='query-1', index_data='index-1', caption='dataset-1', ), EvaluationCallback( query_data='query-2', index_data='index-2', caption='dataset-2', ), ] )
Display small loss values with higher precision.
To avoid displaying "0.000" for very small loss values, the display precision has been increased.
Filter PIL debugging messages from logging stack.
In order to enhance the readability of the logs, we have excluded debugging messages generated by the PIL package.
🐞 Bug Fixes
No longer overestimate the
batch_size for text models.
This pull request resolves a bug where the batch size finder would incorrectly overestimate the maximum usable batch size for text models like BERT. This is likely to happen when users fine-tune the
bert-base-en model without specifying
Fix division by
None error in
Runs set up with automatic batch-size configuration and automatic evaluation callback previously passed the value
batch_size. This resulted in a division by
Filter out queries that do not have any matches in
When there are queries in the evaluation data which do not have any matches, Finetuner was previously unable to calculate any metrics, which leads to division by zero errors. It has been fixed in this release.
📗 Documentation Improvements
Add a tutorial for data synthesis (#745)
We have provided a tutorial for the new data synthesis module.
We would like to thank all contributors to this release: