Finetuner makes neural network fine-tuning easier and faster by streamlining the workflow and handling all the complexity and infrastructure requirements in the cloud. With Finetuner, one can easily enhance the performance of pre-trained models and make them production-ready without expensive hardware.
This release covers Finetuner version 0.7.8, including dependencies finetuner-api 0.5.10 and finetuner-core 0.13.7.
This release contains 4 new features, 1 performance improvement, 1 refactoring, 2 bug fixes, and 1 documentation improvement.
🆕 Features
Add multilingual text encoder models
We have added support for the multilingual embedding model distiluse-base-multi
(a copy of distiluse-base-multilingual-cased-v1
). It supports semantic search in Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, and Turkish.
Add multilingual model for training data synthesis jobs (#750)
We now support data synthesis for datasets in languages other than English, specifically the ones supported by distiluse-base-multi
(see above). To use them you need to add the synthesis model synthesis_model_multi
as the models
parameter to the finetuner.synthesis
function:
from finetuner.model import synthesis_model_multi
synthesis_run = finetuner.synthesize(
...
models=synthesis_model_multi,
)
Support loading models directly from Jina's huggingface site (#751)
We will soon publish select fine-tuned models to the huggingface hub. With the new Finetuner version, you can now load those models directly:
import finetuner
model = finetuner.get_model('jinaai/ecommerce-sbert-model')
e1, e2 = finetuner.encode(model, ['XBox', 'Xbox One Console 500GB - Black (2015)'])
Add an option to the tracking callback to include zero-shot metrics in logging.
Previously, tracking callbacks like WandBLogger
did not consider the evaluation results of the model before fine-tuning, because they only start the tracking when the actual model tuning starts. Now, we add an option log_zero_shot
to those callbacks (which is True
by default). When enabled, this makes Finetuner send evaluation metrics calculated before training to the tracking service used during training.
🚀 Performance
Reduce memory consumption during data synthesis and make the resulting dataset more compact
We optimized data synthesis to reduce its memory consumption, which enables synthesis jobs to run on larger datasets and reduces the run-time of fine-tuning jobs using synthesized training data.
⚙ Refactoring
Increase the default num_relations
from 3 to 10 for data synthesis jobs. (#750)
Data synthesis jobs are more effective if a large amount of training data is generated from small and medium-sized query datasets. Therefore, we have increased the default number of triplets generated for each query from 3 to 10. If you run data synthesis jobs with a large number of queries (>1M), you should consider resetting the num_relations
parameter to a lower number.
🐞 Bug Fixes
Change the English cross-encoder model from multi-lingual to an actual English model.
The English cross-encoder model which we used was actually a multi-lingual one. By using an English one instead, we produce higher-quality synthetic training data and the resulting embedding models achieve better evaluation results.
Fix create synthesis run not accepting DocumentArray as input type. (#748)
We noticed that data synthesis jobs can accept either a named DocumentArray
object stored on Jina AI Cloud or a list of text values. However, passing file paths to locally stored DocumentArray datasets failed. This bug is fixed by this release.
📗 Documentation Improvements
Update data synthesis tutorial including English and multilingual models. (#750)
We have added documentation on how to apply data synthesis to datasets that include materials in languages other than English.
🤟 Contributors
We would like to thank all contributors to this release:
- Wang Bo (@bwanglzu)
- Louis Milliken (@LMMilliken)
- Michael Günther (@guenthermi)
- George Mastrapas (@gmastrapas)
- Scott Martens (@scott-martens)
- Jonathan Geuter (@j-geuter)