Release Note (0.34.0)
This release contains 2 breaking changes, 3 new features, 11 bug fixes, and 2 documentation improvements.
💣 Breaking Changes
Terminate Python 3.7 support
We decided to drop it for two reasons:
- Several dependencies of DocArray require Python 3.8.
- Python long-term support for 3.7 is ending this week. This means there will no longer be security updates for Python 3.7, making this a good time for us to change our requirements.
Changes to DocVec
Protobuf definition (#1639)
In order to fix the bug in the DocVec
Protobuf serialization described in #1561, we have changed the DocVec
.proto definition.
This means that DocVec
objects serialized with DocArray v0.33.0 or earlier cannot be deserialized with DocArray v.0.34.0 or later, and vice versa.
DocVec
upgrade to DocArray v0.34.0 or later.🆕 Features
Allow users to check if a Document
is already indexed in a DocIndex
(#1633)
You can now check if a Document
has already been indexed by using the in
keyword:
from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
docs = DocList[MyDoc](
[MyDoc(text="Example text", embedding=np.random.rand(128))
for _ in range(2000)])
index = InMemoryExactNNIndex[MyDoc](docs)
assert docs[0] in index
assert MyDoc(text='New text', embedding=np.random.rand(128)) not in index
Support subindexes in InMemoryExactNNIndex
(#1617)
You can now use the find_subindex method with the ExactNNSearch
DocIndex
.
from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import ImageUrl, VideoUrl, AnyTensor
class ImageDoc(BaseDoc):
url: ImageUrl
tensor_image: AnyTensor = Field(space='cosine', dim=64)
class VideoDoc(BaseDoc):
url: VideoUrl
images: DocList[ImageDoc]
tensor_video: AnyTensor = Field(space='cosine', dim=128)
class MyDoc(BaseDoc):
docs: DocList[VideoDoc]
tensor: AnyTensor = Field(space='cosine', dim=256)
doc_index = InMemoryExactNNIndex[MyDoc]()
...
# find by the `ImageDoc` tensor when index is populated
root_docs, sub_docs, scores = doc_index.find_subindex(
np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3
)
Flexible tensor types for Protobuf deserialization (#1645)
You can deserialize any DocVec
Protobuf message to any tensor type, by passing the tensor_type
parameter to from_protobuf
.
This means that you can choose at deserialization time if you are working with numpy, PyTorch, or TensorFlow tensors.
class MyDoc(BaseDoc):
tensor: TensorFlowTensor
da = DocVec[MyDoc](...) # doesn't matter what tensor_type is here
proto = da.to_protobuf()
da_after = DocVec[MyDoc].from_protobuf(proto, tensor_type=TensorFlowTensor)
assert isinstance(da_after.tensor, TensorFlowTensor)
Add DBConfig
to InMemoryExactNNSearch
InMemoryExactNNsearch
used to get a single parameter index_file_path
as a constructor parameter, unlike the rest of the Indexers who accepted their own DBConfig
. Now index_file_path
is part of the DBConfig
which makes it possible to initialize from it. This will allow us to extend this config if more parameters are needed.
The parameters of DBConfig
can be passed at construction time as **kwargs
making this change compatible with old usage.
These two initializations are equivalent.
from docarray.index import InMemoryExactNNIndex
db_config = InMemoryExactNNIndex.DBConfig(index_file_path='index.bin')
index = InMemoryExactNNIndex[MyDoc](db_config=db_config)
index = InMemoryExactNNIndex[MyDoc](index_file_path='index.bin')
🐞 Bug Fixes
Allow Protobuf deserialization of BaseDoc
with Union
type (#1655)
Serialization of BaseDoc
types who have Union
types parameter of Python native types is supported.
from docarray import BaseDoc
from typing import Union
class MyDoc(BaseDoc):
union_field: Union[int, str]
docs1 = DocList[MyDoc]([MyDoc(union_field="hello")])
docs2 = DocList[BasisUnion].from_dataframe(docs_basic.to_dataframe())
assert docs1 == docs2
When these Union
types involve other BaseDoc
types, an exception is thrown.
class CustomDoc(BaseDoc):
ud: Union[TextDoc, ImageDoc] = TextDoc(text='union type')
docs = DocList[CustomDoc]([CustomDoc(ud=TextDoc(text='union type'))])
# raises an Exception
DocList[CustomDoc].from_dataframe(docs.to_dataframe())
Cast limit to integer
when passed to HNSWDocumentIndex
(#1657, #1656)
If you call find
or find_batched
on an HNSWDocumentIndex
, the limit
parameter will automatically be cast tointeger
.
Moved default_column_config
from RuntimeConfig
to DBconfig
(#1648)
default_column_config
contains specific configuration information about the columns and tables inside the backend's database. This was previously put inside RuntimeConfig
which caused an error because this information is required at initialization time. This information has been moved inside DBConfig
so you can edit it there.
from docarray.index import HNSWDocumentIndex
import numpy as np
db_config = HNSWDocumentIndex.DBConfig()
db_conf.default_column_config.get(np.ndarray).update({'ef': 2500})
index = HNSWDocumentIndex[MyDoc](db_config=db_config)
Fix issue with Protobuf (de)serialization for DocVec (#1639)
This bug caused raw Protobuf objects to be stored as DocVec columns after they were deserialized from Protobuf, making the data essentially inaccessible. This has now been fixed, and DocVec
objects are identical before and after (de)serialization.
Fix order of returned matches when find
and filter
combination used in InMemoryExactNNIndex
(#1642)
Hybrid search (find+filter) for InMemoryExactNNIndex
was prioritizing low similarities (lower scores) for returned matches. Fixed by adding an option to sort matches in a reverse order based on their scores.
# prepare a query
q_doc = MyDoc(embedding=np.random.rand(128), text='query')
query = (
db.build_query()
.find(query=q_doc, search_field='embedding')
.filter(filter_query={'text': {'$exists': True}})
.build()
)
results = db.execute_query(query)
# Before: results was sorted from worst to best matches
# Now: It's sorted in the correct order, showing better matches first
Working with external Qdrant collections (#1632)
When using QdrandDocumentIndex
to connect to a Qdrant DB initialized outside of DocArray raised a KeyError
. This has been fixed, and now you can use QdrantDocumentIndex
to connect to externally initialized collections.
Other bug fixes
- Update text search to match Weaviate client's new sig (#1654)
- Fix
DocVec
equality (#1641, #1663) - Fix exception when
summary()
called forLegacyDocument
. (#1637) - Fix
DocList
andDocVec
coercion. (#1568) - Fix
update()
onBaseDoc
with tensors fields (#1628)
📗 Documentation Improvements
🤟 Contributors
We would like to thank all contributors to this release:
- Johannes Messner (@JohannesMessner)
- Nikolas Pitsillos (@npitsillos)
- Shukri (@hsm207)
- Kacper Łukawski (@kacperlukawski)
- Aman Agarwal (@agaraman0)
- maxwelljin (@maxwelljin)
- samsja (@samsja)
- Saba Sturua (@jupyterjazz)
- Joan Fontanals (@JoanFM)