DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer the multi-modal data with a Pythonic API.
Release Note (0.19.0
)
This release contains 2 breaking changes, 11 new features, 1 performance improvement, 7 bug fixes and 7 documentation improvements.
💥 Breaking changes
- DocumentArray now supports Qdrant versions above 0.10.1, and drops support for previous versions (#726)
- DocumentArray now supports Weaviate versions above 0.16.0 and client 3.9.0, and drops support for previous versions (#736)
🆕 Features
Add flag to disable list-like structure and behavior (#730, #766, #768, #762)
Sometimes, you do not need to use a DocumentArray
as a list and access by offset. Since this capability involves keeping in the store a mapping of Offset2ID
it comes with overhead.
Now, when using a DocumentArray
with external storage, you can disable this behavior. This improves performance when accessing Documents by ID while disallowing some list-like
behavior.
from docarray import DocumentArray
da = DocumentArray(storage='qdrant', config={'n_dim': 10, 'list_like': False})
Support find by text and filter for ElasticSearch and Redis backends (#740)
For ElasticSearch and Redis document stores we now support find
by text while applying filtering.
from docarray import DocumentArray, Document
da = DocumentArray(storage='elasticsearch', config={'n_dim': 32, 'columns': {'price': 'int'}, 'index_text': True})
with da:
da.extend(
[Document(tags={'price': i}, text=f'pizza {i}') for i in range(10)]
)
da.extend(
[
Document(tags={'price': i}, text=f'noodles {i}')
for i in range(10)
]
)
results = da.find('pizza', filter={
'range': {
'price': {
'lte': 5,
}
}
})
assert len(results) > 0
assert all([r.tags['price'] < 5 for r in results])
assert all(['pizza' in r.text for r in results])
Add 3D data handling of mesh vertices and faces (#709, #717)
DocArray now supports loading data with vertices and faces to represent 3D objects. You can visualize them using display
:
from docarray import Document
doc = Document(uri='some/uri')
doc.load_uri_to_vertices_and_faces()
doc.display()
Add embed_and_evaluate
method (#702, #731)
The method embed_and_evaluate
has been added to DocumentArray
that performs embedding, matching, and computing evaluation metrics all at once. It batches operations to reduce the computation footprint.
import numpy as np
from docarray import Document, DocumentArray
def emb_func(da):
for d in da:
np.random.seed(int(d.text))
d.embedding = np.random.random(5)
da = DocumentArray(
[Document(text=str(i), tags={'label': i % 10}) for i in range(1_000)]
)
da.embed_and_evaluate(
metrics=['precision_at_k'], embed_funcs=emb_func, query_sample_size=100
)
Reduction of memory usage when evaluating 100 query vectors against 500,000 index vectors with 500 dimensions:
Manual Evaluation:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
28 1130.7 MiB 1130.7 MiB 1 @profile
29 def run_evaluation_old_style(queries, index, model):
30 1133.1 MiB 2.5 MiB 1 queries.embed(model)
31 2345.6 MiB 1212.4 MiB 1 index.embed(model)
32 2360.4 MiB 14.8 MiB 1 queries.match(index)
33 2360.4 MiB 0.0 MiB 1 return queries.evaluate(metrics=['reciprocal_rank'])
Evaluation with embed_and_evaluate
(batch_size
100,000):
Line # Mem usage Increment Occurrences Line Contents
=============================================================
23 1130.6 MiB 1130.6 MiB 1 @profile
24 def run_evaluation(queries, index, model, batch_size=None):
25 1130.6 MiB 0.0 MiB 1 kwargs = {'match_batch_size':batch_size} if batch_size else {}
26 1439.9 MiB 309.3 MiB 1 return queries.embed_and_evaluate(metrics=['reciprocal_rank'], index_data=index, embed_models=model, **kwargs)
Update Qdrant version to 0.10.1 (#726)
This release supports Qdrant versions above 0.10.1. This comes with a lot of performance improvements and bug fixes on the backend.
Add filter support for Qdrant document store (#652)
Qdrant document store now supports pure filtering:
from docarray import Document, DocumentArray
import numpy as np
n_dim = 3
da = DocumentArray(
storage='qdrant',
config={'n_dim': n_dim, 'columns': {'price': 'float'}},
)
with da:
da.extend(
[
Document(id=f'r{i}', embedding=i * np.ones(n_dim), tags={'price': i})
for i in range(10)
]
)
max_price = 7
n_limit = 4
filter = {'must': [{'key': 'price', 'range': {'lte': max_price}}]}
results = da.filter(filter=filter, limit=n_limit)
print('\nPoints with "price" at most 7:\n')
for embedding, price in zip(results.embeddings, results[:, 'tags__price']):
print(f'\tembedding={embedding},\t price={price}')
This prints:
Points with "price" at most 7:
embedding=[6. 6. 6.], price=6
embedding=[7. 7. 7.], price=7
embedding=[1. 1. 1.], price=1
embedding=[2. 2. 2.], price=2
Support passing search_params
in find
for Qdrant document store (#675)
You can now pass search_params
in find
interface with Qdrant.
results = da.find(np_query, filter=filter, limit=n_limit, search_params={"hnsw_ef": 64})
Add login and logout proxy methods to DocumentArray (#697)
DocArray offers login
and logout
methods to log into your Jina AI Cloud account directly from DocArray.
from docarray import login, logout
login()
# you are logged in
logout()
# you are logged out
Add docarray
version to push (#710)
When pushing DocumentArray
to cloud, docarray
version is now added as metadata
.
Add args to load_uri_to_video_tensor
(#663)
Add keyword arguments that are available in av.open()
to load_uri_to_video_tensor()
from docarray import Document
doc = Document(uri='/some/uri')
doc.load_uri_to_video_tensor(timeout=5000)
Update Weaviate server to v1.16.1 and client to 3.9.0 (#736, #750)
This release adds support for Weaviate version above v1.16.0. Make sure to use version 1.16.1 of the Weaviate backend to enjoy all Weaviate features.
🚀 Performance
Sync sub-index only when parent is synced (#719)
Previously, if you used the sub-index feature, every time you added new Documents with chunks, DocArray would persist the offset2ids of the chunk subindex. With this change, the offset2id is persisted once, when the parent DocumentArray's offset2id is persisted.
🐞 Bug Fixes
Exception for all from generator calls on instance (#659)
Previously, when calling generator class
methods as from_csv
from a DocumentArray
instance it had the non-intuitive behavior of not changing the DocumentArray in place.
Now DocumentArray
instances are not allowed to call these methods, and raise an Exception
.
from docarray import DocumentArray
da = DocumentArray()
da.from_files(
patterns='*.*',
size=2,
)
AttributeError: Class method can't be called from a DocumentArray instance but only from the DocumentArray class.
Fix markup error in summary (#739)
Previously, calling summary
on a Document
that contains some textual patterns would raise an Exception from rich
. This release uses the Text
class from rich
to ensure the text
is properly rendered.
Convert score of search results to float (#707)
When using find
or match
interfaces with Redis
document store, scores are now returned as float
and not string
.
Initialize doc with dataclass obj and kwargs (#694)
Allow initialization of a Document instance with a dataclass object as well as additional kwargs.
Currently, when a Document is initialized with dataclass
and kwargs
the attributes passed with the dataclass object are overridden.
from docarray import dataclass, Document
from docarray.typing import Text
@dataclass
class MyDoc:
chunk_text: Text
d = Document(MyDoc(chunk_text='chunk level text'), text='top level text')
assert d.text == 'top level text'
assert d.chunk_text.text == 'chunk level text'
Attribute error with empty list in dataclass (#674)
Allow passing an empty List as field input of a dataclass:
from docarray import *
from docarray.typing import *
from typing import List
@dataclass()
class A:
img: List[Text]
Document(A(img = []))
Propagate context enter and exit to subindices (#737)
When using DocumentArray
as a context manager, subindices
are now handled as context managers as well.
This makes handling subindices
more robust.
Correct type hint for tags in DocumentData (#735 )
Change the type hint for tags in docarray.document.data.DocumentData
from tags: Optional[Dict[str, 'StructValueType']]
to tags: Optional[Dict[str, Any]]
.
This stops the IDE complaining when passing nested dictionaries inside tags
.
📗 Documentation Improvements
Add new benchmark page with SIFT1M dataset (#691)
Change the benchmark section of docs to use SIFT1M
dataset. Also add QPS-Recall
graphs to compare how different DocumentStores
work in DocArray.
Mouse over the graph below to start interacting:
Other documentation improvements
- Add Colab notebook for interactive 3D data visualization (#749)
- Fix Finetuner links in README (#706)
- Use URL instead of session state in version selector (#693)
- Replace
plot
withdisplay
(#689) - Add versioned documentation (#664)
- Complement and rewrite evaluation docs (#662)
🤘 Contributors
We would like to thank all contributors to this release:
- Wang Bo (@bwanglzu)
- Adrien (@adrienlachaize)
- anna-charlotte (@anna-charlotte)
- Nan Wang (@nan-wang)
- Bob van Luijt (@bobvanluijt)
- Dirk Kulawiak (@dirkkul)
- Jackmin801 (@Jackmin801)
- Nicholas Dunham (@NicholasDunham)
- Johannes Messner (@JohannesMessner)
- samsja (@samsja)
- Joan Fontanals (@JoanFM)
- AlaeddineAbdessalem (@alaeddine-13)
- dong xiang (@dongxiang123)
- Anne Yang (@AnneYang720)
- Marco Luca Sbodio (@marcosbodio)
- Michael Günther (@guenthermi)
- Han Xiao (@hanxiao)
- Marco Luca Sbodio (@marcosbodio])