DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer multimodal data with a Pythonic API.
DocArray is the common data layer used in all Jina AI products.
Release Note (0.21.0
)
This release contains 3 new features, 7 bug fixes and 5 documentation improvements.
🆕 Features
OpenSearch Document Store (#853)
This version of DocArray adds a new Document Store: OpenSearch!
You can use the OpenSearch Document Store to index your Documents and perform ANN search on them:
from docarray import Document, DocumentArray
import numpy as np
# Connect to OpenSearch instance
n_dim = 3
da = DocumentArray(
storage='opensearch',
config={'n_dim': n_dim},
)
# Index Documents
with da:
da.extend(
[
Document(id=f'r{i}', embedding=i * np.ones(n_dim))
for i in range(10)
]
)
# Perform ANN search
np_query = np.ones(n_dim) * 8
results = da.find(np_query, limit=10)
Additionally, the OpenSearch Document Store can perform filter queries, search by text, and search by tags.
Learn more about its usage in the official documentation.
Add color to point cloud display (#961)
You can now include color information in your point cloud data, which can be visualized using display_point_cloud_tensor()
:
coords = np.load('a_red_motorbike/coords.npy')
colors = np.load('a_red_motorbike/coord_colors.npy')
doc = Document(
tensor=coords,
chunks=DocumentArray([Document(tensor=colors, name='point_cloud_colors')])
)
doc.display()
Add language attribute to Redis Document Store (#953)
The Redis Document Store now supports text search in various supported languages. To set a desired language, change the language
parameter in the Redis configuration:
da = DocumentArray(
storage='redis',
config={
'n_dim': 128,
'index_text': True,
'language': 'chinese',
},
)
🐞 Bug Fixes
Replace newline with whitespace to fix display in plot embeddings (#963)
Whenever the string '\n'
was contained in any Document field, doc.plot()
would result in a rendering error. This fixes those errors be rendering '\n'
as whitespace.
Fix unwanted coercion in to_pydantic_model
(#949)
This bug caused all strings of the form 'Infinity'
to be coerced to the string 'inf'
when calling to_pydantic_model()
or to_dict()
. This is fixed now, leaving such strings unchanged.
Calculate relevant docs on index instead of queries (#950)
In the embed_and_evaluate()
method, the number of relevant Documents per label used to be calculated based on the Document in self
. This is not generally correct, so after this fix the quantity is calculated based on the Documents in the index data.
Remove offset index create on list like false (#936)
When a Document Store has list-like behavior disabled, it no longer creates an offset to id mapping, which improves performance.
Add support for remote audio files (#933)
Loading audio files from a remote URL would cause FileNotFoundError
, which is now fixed.
Query operator $exists
does not work correctly with tags (#911) (#923)
Before this fix, $exists
would treat false-y values such as 0
or []
as non-existent. This is now fixed.
Document from dataclass with singleton list (#1018)
When casting from a dataclass to Document, singleton lists were treated like an individual element, even if the corresponding field was annotated with List[...]
. Now this case is considered, and accessing such a field will yield a DocumentArray, even for singleton inputs.
📗 Documentation Improvements
- Link to Discord (#1010)
- Have less versions to avoid deployment timeout (#977)
- Fix data management section not appearing in documentation (#967)
- Link to OpenSearch docs in sidebar (#960)
- Multimodal to datatypes (#934)
🤘 Contributors
We would like to thank all contributors to this release:
- Jay Bhambhani (@jay-bhambhani)
- Alvin Prayuda (@alphinside)
- Johannes Messner (@JohannesMessner)
- samsja (@samsja)
- Marco Luca Sbodio (@marcosbodio)
- Anne Yang (@AnneYang720)
- Michael Günther (@guenthermi)
- AlaeddineAbdessalem (@alaeddine-13)
- Han Xiao (@hanxiao)
- Alex Cureton-Griffiths (@alexcg1)
- Charlotte Gerhaher (@anna-charlotte)