Release Note (0.33.0
)
This release contains 1 new feature, 1 performance improvement, 9 bug fixes, and 4 documentation improvements.
๐ Features
Allow coercion between different Tensor types (#1552) (#1588)
Allow coercing to a TorchTensor
from an NdArray
or TensorFlowTensor
and the other way around.
from docarray import BaseDoc
from docarray.typing import TorchTensor
import numpy as np
class MyTensorsDoc(BaseDoc):
tensor: TorchTensor
doc = MyTensorsDoc(tensor=np.zeros(512))
doc.summary()
๐ MyTensorsDoc : 0a10f88 ...
โญโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Attribute โ Value โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ tensor: TorchTensor โ TorchTensor of shape (512,), dtype: torch.float64 โ
โฐโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
๐ Performance
Avoid stack embedding for every search (#1586)
We have made a performance improvement for the find
interface for InMemoryExactNNIndex
that gives a ~2x speedup.
The script used to measure this is as follows:
from torch import rand
from time import perf_counter
from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import TorchTensor
class MyDocument(BaseDoc):
embedding: TorchTensor
embedding2: TorchTensor
embedding3: TorchTensor
def generate_doc_list(num_docs: int, dims: int) -> DocList[MyDocument]:
return DocList[MyDocument](
[
MyDocument(
embedding=rand(dims),
embedding2=rand(dims),
embedding3=rand(dims),
)
for _ in range(num_docs)
]
)
num_docs, num_queries, dims = 500000, 1000, 128
data_list = generate_doc_list(num_docs, dims)
queries = generate_doc_list(num_queries, dims)
index = InMemoryExactNNIndex[MyDocument](data_list)
start = perf_counter()
for _ in range(5):
matches, scores = index.find_batched(queries, search_field='embedding')
print(f"Number of queries: {num_queries} \n"
f"Number of indexed documents: {num_docs} \n"
f"Total time: {(perf_counter() - start)/5} seconds")
๐ Bug Fixes
Respect limit
parameter in filter
for index backends (#1618)
InMemoryExactNNIndex
and HnswDocumentIndex
now respect the limit
parameter in the filter
API.
HnswDocumentIndex
can search with limit
greater than number of documents (#1611)
HnswDocumentIndex
now allows to call find
with a limit
parameter larger than the number of indexed documents.
Allow updating HnswDocumentIndex
(#1604)
HnswDocumentIndex
now allows reindexing documents with the same id
, updating the original documents.
Dynamically resize internal index to adapt to increasing number of documents (#1602)
HnswDocumentIndex
now allows indexing more than max_elements
, dynamically adapting the index as it grows.
Fix simple usage of HnswDocumentIndex
(#1596)
from docarray.index import HnswDocumentIndex
from docarray import DocList, BaseDoc
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
docs = [MyDoc(text='hey', embedding=np.random.rand(128)) for i in range(200)]
index = HnswDocumentIndex[MyDoc](work_dir='./tmp', index_name='index')
index.index(docs=DocList[MyDoc](docs))
resp = index.find_batched(queries=DocList[MyDoc](docs[0:3]), search_field='embedding')
Previously, this basic usage threw an exception:
TypeError: ModelMetaclass object argument after must be a mapping, not MyDoc
Now, it works as expected.
Fix InMemoryExactNNIndex
index initialization with nested DocList
(#1582)
Instantiating an InMemoryExactNNIndex
with a Document
schema that had a nested DocList
previously threw this error:
from docarray import BaseDoc, DocList
from docarray.documents import TextDoc
from docarray.index import HnswDocumentIndex
class MyDoc(BaseDoc):
text: str,
d_list: DocList[TextDoc]
index = HnswDocumentIndex[MyDoc]()
TypeError: docarray.index.abstract.BaseDocIndex.__init__() got multiple values for keyword argument 'db_config'
Now it can be successfully instantiated.
Fix summary of document with list (#1595)
Calling summary
on a document with a List
attribute previously showed the wrong type:
from docarray import BaseDoc, DocList
from typing import List
class TestDoc(BaseDoc):
str_list: List[str]
dl = DocList[TestDoc]([TestDoc(str_list=[]), TestDoc(str_list=["1"])])
dl.summary()
Previous output:
โญโโโโโโโ DocList Summary โโโโโโโโฎ
โ โ
โ Type DocList[TestDoc] โ
โ Length 2 โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโ Document Schema โโโโฎ
โ โ
โ TestDoc โ
โ โโโ str_list: str โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโฏ
New output:
โญโโโโโโโ DocList Summary โโโโโโโโฎ
โ โ
โ Type DocList[TestDoc] โ
โ Length 2 โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโ Document Schema โโโโโโโฎ
โ โ
โ TestDoc โ
โ โโโ str_list: List[str] โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Solve issues caused by issubclass
(#1594)
DocArray
relies heavily on calling Python's issubclass
method which caused multiple issues. We now use a safe version that counts for edge cases and types.
Make example payload a string rather than bytes (#1587)
The example
payload of a given document schema with Tensor
attribute was previously of bytes
type. This has now been changed to str
.
from docarray import DocList, BaseDoc
from docarray.documents import TextDoc
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
print(f'{type(MyDoc.schema()["properties"]["embedding"]["example"])}')
๐ Documentation Improvements
- Add forward declaration steps to example to avoid pickling error (#1615)
- Fix
n_dim
todim
(#1610) - Add "in memory" to documentation as list of supported vector indexes (#1607)
- Add a tensor section (#1576)
๐ค Contributors
We would like to thank all contributors to this release:
- Mohammad Kalim Akram (@makram93)
- samsja (@samsja)
- Saba Sturua (@jupyterjazz)
- Joan Fontanals (@JoanFM)
- maxwelljin (@maxwelljin)