Release Note (0.37.0
)
This release contains 6 new features, 5 bug fixes, 1 performance improvement and 1 documentation improvement.
🆕 Features
Milvus Integration (#1681)
Leverage the power of Milvus in your DocArray project with this latest integration. Here's a simple usage example:
import numpy as np
from docarray import BaseDoc
from docarray.index import MilvusDocumentIndex
from docarray.typing import NdArray
from pydantic import Field
class MyDoc(BaseDoc):
text: str
embedding: NdArray[10] = Field(is_embedding=True)
docs = [MyDoc(text=f'text {i}', embedding=np.random.rand(10)) for i in range(10)]
query = np.random.rand(10)
db = MilvusDocumentIndex[MyDoc]()
db.index(docs)
results = db.find(query, limit=10)
In this example, we're creating a document class with both textual and numeric data. Then, we initialize a Milvus-backed document index and use it to index our documents. Finally, we perform a search query.
Supported Functionalities
- Find: Vector search for efficient retrieval of similar documents.
- Filter: Use Redis syntax to filter based on textual and numeric data.
- Get/Del: Fetch or delete specific documents from the index.
- Hybrid Search: Combine find and filter functionalities for more refined search.
- Subindex: Search through nested data.
Support filtering in HnswDocumentIndex
(#1718)
With our latest update, you can easily utilize filtering in HnswDocumentIndex
either as an independent function or in conjunction with the query builder to combine it with vector search.
The code below shows how the new feature works:
import numpy as np
from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray
class SimpleSchema(BaseDoc):
year: int
price: int
embedding: NdArray[128]
# Create dummy documents.
docs = DocList[SimpleSchema](
SimpleSchema(year=2000 - i, price=i, embedding=np.random.rand(128))
for i in range(10)
)
doc_index = HnswDocumentIndex[SimpleSchema](work_dir="./tmp_5")
doc_index.index(docs)
# Independent filtering operation (year == 1995)
filter_query = {"year": {"$eq": 1995}}
results = doc_index.filter(filter_query)
# Filtering combined with vector search
hybrid_query = (
doc_index.build_query() # get empty query object
.filter(filter_query={"year": {"$gt": 1994}}) # pre-filtering (year > 1994)
.find(
query=np.random.rand(128), search_field="embedding"
) # add vector similarity search
.filter(filter_query={"price": {"$lte": 3}}) # post-filtering (price <= 3)
.build()
)
results = doc_index.execute_query(hybrid_query)
First, we create and index some dummy documents. Then, we use the filter function in two ways. One is by itself to find documents from a specific year. The other is mixed with a vector search, where we first filter by year, perform a vector search, and then filter by price.
Pre-filtering in InMemoryExactNNIndex
(#1713)
You can now add a pre-filter to your queries in InMemoryExactNNIndex
. This lets you create flexible queries where you can set up as many pre- and post-filters as you want. Here's a simple example:
query = (
doc_index.build_query()
.filter(filter_query={'price': {'$lte': 3}}) # Pre-filter: price <= 3
.find(query=np.ones(10), search_field='tensor') # Vector search
.filter(filter_query={'text': {'$eq': 'hello 1'}}) # Post-filter: text == 'hello 1'
.build()
)
In this example, we first set a pre-filter to only include items priced 3 or less. We then do a vector search. Lastly, we add a post-filter to find items with the text 'hello 1'. This way, you can easily filter before and after your search!
Support document updates in InMemoryExactNNIndex
(#1724)
You can now easily update your documents in InMemoryExactNNIndex
. Previously, when you tried to update the same set of documents, it would just add duplicate copies instead of making changes to the existing ones. But not anymore! From now on, If you want to update documents you just have to re-index them.
Choose tensor format with DocVec
deserialization (#1679)
Now you can specify the format of your tensor during DocVec
deserialization. You can do this with any method you're using to convert data - like protobuf
, json
, pandas
, bytes
, binary
, or base64
. This means you'll always get your tensors in the format you want, whether it's a Torch tensor, TensorFlow tensor, NDarray, and so on.
Add description and example to id
field of BaseDoc
(#1737)
We added a description and example to the id
field of BaseDoc, so that you get a richer OpenAPI specification when building FastAPI based applications with it.
🚀 Performance
Improve HnswDocumentIndex
performance (#1727, #1729)
We've implemented two key optimizations to enhance the performance of HnswDocumentIndex
. Firstly, we've avoided serialization of embeddings to SQLite, which is a costly operation and unnecessary as the embeddings can be reconstructed from hnswlib
index itself. Additionally, we've minimized the frequency of computing num_docs()
, which previously involved time-consuming full table scan to determine the number of documents in SQLite. As a result, we've seen an approximate speed increase of 10%, enhancing both the indexing and searching processes.
🐞 Bug Fixes
Fix TorchTensor
type comparison (#1739)
We have addressed an exception raised when trying to compare TorchTensor
with the type
keyword in the docarray.typing
module. Previously, this would lead to a TypeError
, but the error has now been resolved, ensuring proper type comparison.
Add more info from dynamic class (#1733)
When using the method create_base_doc_from_schema
to dynamically create a BaseDoc class, some information was lost, so we made sure that the new class keeps FieldInfo information from the original class such as description
and examples
.
Fix call to unsafe issubclass
(#1731)
We fixed a bug calling issubclass
by changing the call for a safer implementation against some types.
Align collection and index name in QdrantDocumentIndex
(#1723)
We've corrected an issue where the collection name was not being updated to match a newly-initialized subindex name in QdrantDocumentIndex
. This ensures consistent naming between collections and their respective subindexes.
Fix deepcopy TorchTensor (#1720)
We fixed a bug that will allow deepcopying documents with TorchTensors.
📗 Documentation Improvements
- Make Document Indices self-contained (#1678)
🤘 Contributors
We would like to thank all contributors to this release:
- Joan Fontanals (@JoanFM)
- Johannes Messner (@JohannesMessner)
- Saba Sturua (@jupyterjazz)