DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer multi-modal data with a Pythonic API.
DocArray is the common data layer used in all Jina AI products.
Release Note (0.20.0
)
This release contains 8 new features, 3 bug fixes and 7 documentation improvements.
🆕 Features
Milvus document store (#587)
This release supports the Milvus vector database as a document store.
da = DocumentArray(storage='milvus', config={'n_dim': 3))
Root_id for document stores (#808)
When working with a vector database you can now retrieve the root document even if you search at a nested level with sub-indices (for example at chunk level).
top_level_matches = da.find(query=np.random.rand(512), on='@.[image]', return_root=True)
To allow this we now store the root_id
in the chunks' tags. You can enable this by passing root_id=True
in your document store configuration.
Filtering based on text keywords for Qdrant (#849)
You can now filter based on text keywords for the Qdrant document store.
filter = {
'must': [
{"key": "info", "match": {"text": "shoes"}}
]
}
results = da.find(np.random.rand(n_dim), filter=filter)
RGB-D representation of 3D meshes (#753)
DocArray already supports 3D mesh representation in different formats and this release adds support for RGB-D representation.
doc.load_uris_to_rgbd_tensor()
Load multi page tiff files into chunks (#845)
Multi page tiff
images can now be loaded with load_uri_to_image_tensor()
.
d = Document(uri="foo.tiff")
d.load_uri_to_image_tensor()
print(d)
<Document ('id', 'uri', 'chunks') at 7f907d786d6c11ec840a1e008a366d49>
└─ chunks
├─ <Document ('id', 'parent_id', 'granularity', 'tensor') at 7aa4c0ba66cf6c300b7f07fdcbc2fdc8>
├─ <Document ('id', 'parent_id', 'granularity', 'tensor') at bc94a3e3ca60352f2e4c9ab1b1bb9c22>
└─ <Document ('id', 'parent_id', 'granularity', 'tensor') at 36fe0d1daf4442ad6461c619f8bb25b7>
Store key frame indices when loading video tensor from uri (#880)
key_frame_indices
are now stored in a Document's tags when loading a video to tensor. This allows extracting the section of the video between key frames.
d = Document(uri="video.mp4").load_uri_to_video_tensor()
print(d.tags['keyframe_indices'])
[0, 25, 196, ...]
Better plotting of embeddings for nested and complex data (#891)
You can now choose which meta field parameters to exclude when calling DocumentArray's plot_embedding()
method. This makes it easier to plot embeddings for complex and nested data.
docs.plot_embeddings(exclude_fields_metas=['chunks'])
Better support for information retrieval evaluation (#826)
This release adds a max_rel_per_label
parameter to better support metric calculations that require the number of relevant Documents.
metrics = da.evaluate(['recall_at_k'], max_rel_per_label={i: 1 for i in range(3)})
🐞 Bug Fixes
Support length calculation independently from list-like behavior (#840)
Our prior minor release, DocArray 0.19, added the ability to instantiate a document store without list-like behavior for improved performance. However, calculating the length of certain document stores relied on such list-like behavior. This release fixes length calculation for the Redis document store, making it independent from list-like behavior.
Remove cosine similarity field with false assignment (#835)
In the Weaviate document store, cosine distance is no longer mistakenly assigned to the cosine_similarity
field.
Rebuild index after clearing storage (#837)
The index for Redis and Elasticsearch document stores is now rebuilt when _clear_storage
is called.
📗 Documentation Improvements
- Correct Document description (#842)
- Minor correction in Document description (#834)
- Add username to DocArray pull (#847)
- Fix broken docs (#805)
- Fix data management section (#801)
- Change logic order according to blog (#797)
- Move cloud support to integrations (#798)
🤘 Contributors
We would like to thank all contributors to this release:
- Delgermurun (@delgermurun)
- Anne Yang (@AnneYang720)
- anna-charlotte (@anna-charlotte)
- Johannes Messner (@JohannesMessner)
- Alex Cureton-Griffiths (@alexcg1)
- AlaeddineAbdessalem (@alaeddine-13)
- dong xiang (@dongxiang123)
- coolmian (@coolmian)
- Joan Fontanals (@JoanFM)
- Nan Wang (@nan-wang)
- samsja (@samsja)
- Michael Günther (@guenthermi)