Release v0.30.0 (a.k.a DocArray v2)
Changelog
If you have been using recent versions of DocArray, you will already be familiar with its dataclass API.
DocArray v2 is that idea, taken seriously. Every Document is created through a dataclass-like interface, courtesy of Pydantic.
This has the following advantages:
- Flexibility: No need to conform to a fixed set of fields -- your data defines the schema.
- Multimodality: Easily store multiple modalities and multiple embeddings in the same Document.
- Language agnostic: At their core, Documents are just dictionaries, and this makes it easy to create and send them from any language, not just Python.
You may also be familiar with our old Document Store for vector database integration. They are now called Document Indexes and offer the following improvements:
- Hybrid search: You can now combine vector search with text search and even filter by arbitrary fields.
- Production-ready: The new Document Indexes are a much thinner wrapper around the various vector DB libraries, making them more robust and easier to maintain.
- Increased flexibility: We strive to support any configuration or setting you could perform through the vector DB's first-party client.
Document Indexes currently support Weaviate, Qdrant, ElasticSearch, and HNSWLib, with more to come.
Changes to Document
Document
has been renamedBaseDoc
.BaseDoc
cannot be used directly but instead has to be extended. Therefore, each document class is created through a dataclass-like interface.- Following from the previous point, extending
BaseDoc
allows for a flexible schema compared to theDocument
class in v1 which only allowed for a fixed schema, with one oftensor
,text
andblob
, and additionalchunks
andmatches
. - Due to the added flexibility, one can not know what fields your document class will provide. Therefore, various methods from v1 (such as
.load_uri_to_image_tensor()
) are not supported in v2. Instead, we provide some of those methods on the typing level. - In v2, we have the
LegacyDocument
class, which extendsBaseDoc
while following the same schema as v1'sDocument
. TheLegacyDocument
can be helpful to start migrating your codebase from v1 to v2. Nevertheless, the API is not fully compatible with DocArray v1Document
. Indeed, none of the methods associated withDocument
are present. Only the schema of the data is similar.
Changes to DocumentArray
DocList
- The
DocumentArray
class from v1 has been renamedDocList
. This better describes its actual functionality since it is a list ofBaseDoc
.
DocVec
- Additionally, we have introduced the class
DocVec
, which is a column-based representation ofBaseDoc
. BothDocVec
andDocList
extendAnyDocArray
. DocVec
is a container of Documents appropriate for computations that require batches of data (i.e., matrix multiplication, distance calculation, deep learning forward pass).- A
DocVec
has a similar interface toDocList
but with an underlying implementation that is column-based instead of row-based. Each field of the schema of theDocVec
(the.doc_type
which is aBaseDoc
) will be stored in a column. If the field is a tensor, the data from all Documents will be stored as a singledoc_vec
(Torch/TensorFlow/NumPy) tensor. If the tensor field isAnyTensor
or a union of tensor types, the.tensor_type
will be used to determine the type of thedoc_vec
column.
Parameterized DocList
- Because you now have the flexibility to design your own document schema, when initializing a
DocList
, the contents do not have to all be of the same type. - If you want a homogenous
DocList
you can parameterize it at initialization time:
from docarray import DocList
from docarray.documents import ImageDoc
docs = DocList[ImageDoc]()
- Methods like
.from_csv()
or.pull()
only work with a parameterizedDocList
.
Access attributes of your DocumentArray
- In v1, you could access an attribute of all Documents in your DocumentArray by calling the plural of the attribute's name on your DocArray instance.
- In v2, you don't have to use the plural; instead, use the Document's attribute name since
AnyDocArray
it will expose the same attributes as theBaseDoc
s it contains. This will return a list oftype(attribute)
. However, this works only if all theBaseDoc
s in theAnyDocArray
have the same schema. Therefore only this works:
from docarray import BaseDoc, DocList
class Book(BaseDoc):
title: str
author: str = None
docs = DocList[Book]([Book(title=f'title {i}') for i in range(5)])
book_titles = docs.title # returns a list[str]
# this would fail
# docs = DocList([Book(title=f'title {i}') for i in range(5)])
# book_titles = docs.title
Changes to Document Store
In v2, the Document Store has been renamed DocIndex
and can be used for fast retrieval using vector similarity. DocArray v2 DocIndex
supports:
Instead of creating a DocumentArray
instance and setting the storage
parameter to a vector database of your choice, in v2 you can initialize a DocIndex
object of your choice, such as:
db = HnswDocumentIndex[MyDoc](work_dir='/my/work/dir')
Furthermore, DocStore
in v2 can be used for simple long-term storage, such as with AWS S3 buckets or Jina AI Cloud.
Thank you to all of the contributors to this release: