This release contains 2 breaking changes, 4 new features, 11 bug fixes, and several documentation improvements.
Release Note (v0.31.0
)
💥 Breaking changes
Return type of DocVec
Optional Tensor (#1472)
Optional tensor fields in a DocVec
will return None
instead of a list of Nan
if the column does not hold any tensor.
This code snippet shows the breaking change:
from typing import Optional
from docarray import BaseDoc, DocVec
from docarray.typing import NdArray
class MyDoc(BaseDoc):
tensor: Optional[NdArray[10]]
docs = DocVec[MyDoc]([MyDoc() for j in range(2)])
print(docs.tensor)
Version | Return type |
---|---|
0.30.0 | [nan nan] |
0.31.0 | None |
Default index collection names
Most vector databases have a concept similar to a 'table' in a relational database; this concept is usually called 'collection', 'index', 'class' or similar.
In DocArray v0.30.0, every Document Index backend defined its own default name for this, i.e. a default index_name
or collection_name
.
Starting with DocArray v0.31.0, the default index_name
/collection_name
will be derived from the document schema name:
from docarray.index.backends.weaviate import WeaviateDocumentIndex
from docarray import BaseDoc
class MyDoc(BaseDoc):
pass
# With v0.30.0, the line below defaults to `index_name='Document'`.
# This was the default regardless of the Document Index schema.
# With v0.31.0, the line below defaults to `index_name='MyDoc'`
# The default now depends on the schema, i.e. the `MyDoc` class.
store = WeaviateDocumentIndex[MyDoc]()
If you create and persist a Document Index with v0.30.0, and try to access it using v0.31.0 without manually specifying an index name, an Exception will occur.
You can fix this by manually specifying the index name to match the old default:
# Create new Document Index using v0.30.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=...)
# Access it using v0.31.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=..., index_name='Document')
The below table summarizes the change for all database backends:
DBConfig argument | Default in v0.30.0 | Default in v0.31.0 | |
---|---|---|---|
WeaviateDocumentIndex | index_name | 'Document' | Schema class name |
QdrantDocumentIndex | collection_name | 'documents' | Schema class name |
ElasticDocIndex | index_name | 'index__' + a random id | Schema class name |
ElasticV7DocIndex | index_name | 'index__' + a random id | Schema class name |
HnswDocumentIndex | n/a | n/a | n/a |
🆕 Features
Add InMemoryDocIndex
(#1441)
In this version we have introduced the InMemoryDocIndex
Document Index which allows you to perform in-memory exact vector search (as opposed to approximate nearest neighbor search in vector databases).
The InMemoryDocIndex
can be used for prototyping and is suitable for dealing with small-scale documents (1k-10k), as opposed to a vector database that is suitable for larger scales but comes with a performance overhead at smaller scales.
from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
tensor: NdArray[512]
docs = DocList[MyDoc](MyDoc(tensor=i*np.ones(512)) for i in range(10))
doc_index = InMemoryExactNNIndex[MyDoc]()
doc_index.index(docs)
print(doc_index.find(3*np.ones(512), search_field='tensor', top_k=3))
FindResult(documents=<DocList[MyDoc] (length=10)>, scores=array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))
DocList
inherits from Python list
(#1457)
DocList
is now a subclass of Python's list
. This means that you can now use all the methods that are available to Python lists on DocList
objects. For example, you can now use len
on DocList
objects and tools like Pydantic or FastAPI will be able to work with it more easily.
Add len
to DocIndex
(#1454)
You can now perform len(vector_index)
which is equivalent to vector_index.num_docs()
.
Other minor features
- Add a
to_json
alias toBaseDoc
(#1494)
🐞 Bug Fixes
Point to older versions when importing Document
or Documentarray
(#1422)
Trying to load Document
or DocumentArray
from DocArray would previously raise an error, saying that you needed to downgrade your version of DocArray if you wanted to use these two objects. This behavior has been fixed.
Fix AnyDoc.from_protobuf
(#1437)
AnyDoc
can now read any BaseDoc
protobuf file. The same applies to DocList
.
Other bug fixes
- Fix
extend
toDocList
(#1493) - Fix bug when calling
dict()
onBaseDoc
(#1481) - Fix bug when calling
json()
onBaseDoc
(#1481) - Support Pandas 2.0 by using
pd.concat()
instead ofdf.append()
into_dataframe()
to avoid warning (#1478) - Add logs to Elasticsearch index (#1427)
- Fix a bug in Document Index where Torch tensors that required grad were not able to be converted to
ndarray
(#1429) - Fix a bug with HNSW (#1426)
- Hubble Binary format version bump (#1414)
- Save index during creation for
hnswlib
(#1424)
📗 Documentation Improvements
- Fix FastAPI docs (#1453)
- Index predefined Documents (#1434)
- Clean up data types section (#1412)
- Remove duplicate API reference section (#1408)
Docindex
URLs (#1433)- Fix Install commands hint (#1421)
- Add Google Analytics (#1432)
- Add install instructions for
hnswlib
andelastic
document indexes (#1431) - Various fixes (#1436, #1417, #1423, #1418, #1411, #1419)
🤘 Contributors
We would like to thank all contributors to this release:
- Alex Cureton-Griffiths (@alexcg1)
- samsja (@samsja)
- Johannes Messner (@JohannesMessner)
- Anne Yang (@AnneYang720)
- Scott Martens (@scott-martens)
- カレン (@RStar2022)
- Aman Agarwal (@agaraman0)
- Yanlong Wang (@nomagick)
- Charlotte Gerhaher (@anna-charlotte)