Release Note (0.36.0
)
This release contains 2 new features, 5 bug fixes, 1 performance improvement and 1 documentation improvement.
🆕 Features
JAX Integration (#1646)
You can now use JAX with DocArray. We have introduced JaxArray
as a new type option for your documents. JaxArray
ensures that JAX can now natively process any array-like data in your DocArray documents. Here's how you use of it:
from docarray import BaseDoc
from docarray.typing import JaxArray
import jax.numpy as jnp
class MyDoc(BaseDoc):
arr: JaxArray
image_arr: JaxArray[3, 224, 224] # For images of shape (3, 224, 224)
square_crop: JaxArray[3, ‘x’, ‘x’] # For any square image, regardless of dimensions
random_image: JaxArray[3, ...] # For any image with 3 color channels, and arbitrary other dimensions
As you can see, the JaxArray
type is extremely flexible and can support a wide range of tensor shapes.
Creating a Document with Tensors
As you can see, the JaxArray
typing is extremely flexible and can support a wide range of tensor shapes.
Creating a document with tensors is straightforward. Here is an example:
doc = MyDoc(
arr=jnp.zeros((128,)),
image_arr=jnp.zeros((3, 224, 224)),
square_crop=jnp.zeros((3, 64, 64)),
random_image=jnp.zeros((3, 128, 256)),
)
Redis Integration (#1550)
Leverage the power of Redis in your DocArray project with this latest integration. Here's a simple usage example:
import numpy as np
from docarray import BaseDoc
from docarray.index import RedisDocumentIndex
from docarray.typing import NdArray
class MyDoc(BaseDoc):
text: str
embedding: NdArray[10]
docs = [MyDoc(text=f’text {i}’, embedding=np.random.rand(10)) for i in range(10)]
query = np.random.rand(10)
db = RedisDocumentIndex[MyDoc](host=‘localhost’)
db.index(docs)
results = db.find(query, search_field=‘embedding’, limit=10)
In this example, we're creating a document class with both textual and numeric data. Then, we initialize a Redis-backed document index and use it to index our documents. Finally, we perform a search query.
Supported Functionalities
Find: Vector search for efficient retrieval of similar documents.
Filter: Use Redis syntax to filter based on textual and numeric data.
Text Search: Leverage text search methods, such as BM25, to find relevant documents.
Get/Del: Fetch or delete specific documents from the index.
Hybrid Search: Combine find and filter functionalities for more refined search. Currently, only these two can be combined.
Subindex: Search through nested data.
🚀 Performance
Speedup HnswDocumentIndex
by caching num docs (#1706)
We've optimized the num_docs()
operation by caching the document count, addressing previous slowdowns during searches. This change results in a minor increase in indexing time but significantly accelerates search times.
from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray
import numpy as np
import time
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
docs = [MyDoc(text=‘hey’, embedding=np.random.rand(128)) for _ in range(20000)]
index = HnswDocumentIndex[MyDoc](work_dir=‘tst’, index_name=‘index’)
index_start = time.time()
index.index(docs=DocList[MyDoc](docs))
index_time = time.time() - index_start
query = docs[0]
find_start = time.time()
matches, _ = index.find(query, search_field=‘embedding’, limit=10)
find_time = time.time() - find_start
In the above experiment, we observed a 13x improvement in the speed of the search function, reducing its execution time from 0.0238 to 0.0018 seconds.
⚙ Refactoring
Put contains
method in the base class (#1701)
We've moved the contains method into the base class. With this refactoring, the responsibility for checking if a document exists is now delegated to individual backend implementations using the new _doc_exists
method.
More robust method to detect duplicate index (#1651)
We have implemented a more robust method of detecting existing indices for WeaviateDocumentIndex
.
🐞 Bug Fixes
WeaviateDocumentIndex
handles lowercase index names (#1711)
We've addressed an issue in the WeaviateDocumentIndex
where passing a lowercase index name led to mismatches and subsequent errors. This was due to the system automatically capitalizing the index name when creating an index. To resolve this, we've added a post_init
function that capitalizes the first letter of the provided index name, ensuring consistent naming and preventing potential errors.
QdrantDocumentIndex
unable to see index_name
(#1705)
We've resolved an issue where the QdrantDocumentIndex
was not properly recognizing the index_name
parameter. Previously, the specified index_name
was ignored and the system defaulted to the schema name.
Fix search in InMemoryExactNNIndex
with AnyEmbedding
(#1696)
From now on, you can perform search operations in InMemoryExactNNIndex
using AnyEmbedding
Use safe_issubclass
everywhere (#1691)
We now use safe_issubclass
instead of issubclass
because it supports non-class inputs, helping us to avoid unexpected errors
Avoid converting DocLists
in the base index (#1685)
We added an additional check to avoid passing DocLists to a function that converts a list of dictionaries to a DocList.
📗 Documentation Improvements
- Add docs for
dict()
method (#1643)
🤟 Contributors
We would like to thank all contributors to this release:
- Puneeth K (@punndcoder28)
- Joan Fontanals (@JoanFM)
- Saba Sturua (@jupyterjazz)
- Aman Agarwal (@agaraman0)
- samsja (@samsja)
- Shukri (@hsm207)