DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer the multi-modal data with a Pythonic API.
Release Note (0.17.0
)
Release time: 2022-09-23 16:18:19
This release contains 8 new features, 2 performance improvements, 7 bug fixes, and 2 documentation improvements.
🆕 Features
Allow passing parameters to load_uri_to_*
methods (#540)
The load_uri_to_*
methods (load_uri_to_blob
, load_uri_to_text
, etc.) now accept kwargs
so that you can pass a timeout parameter to the underlying request methods.
For example:
doc = Document(uri='uri_path')
doc.load_uri_to_blob(timeout=2)
Allow multiple DocumentArrays per Redis server (#540)
You can now store multiple DocumentArrays in a single Redis instance, as long as each DocumentArray has a different index_name
:
da1 = DocumentArray(storage='redis', config={'host': 'localhost', 'port': 6379, 'n_dim': 128, 'index_name': 'da1'})
da2 = DocumentArray(storage='redis', config={'host': 'localhost', 'port': 6379, 'n_dim': 256, 'index_name': 'da2'})
da3 = DocumentArray(storage='redis', config={'host': 'localhost', 'port': 6379, 'n_dim': 512, 'index_name': 'da3'})
Login required for DocumentArray push and pull (#541)
Logging in to Jina Cloud is now required before pushing/pulling DocumentArrays to/from Jina Cloud. You can log in either by creating a token in hub.jina.ai
and setting it as an environment variable (JINA_AUTH_TOKEN=my_token
) or using the CLI command jina auth login
.
Push metadata along with DocumentArray and add cloud_list
and cloud_delete
methods (#490)
DocumentArray.push
will extract metadata about the DocumentArray and send it to Jina Cloud. Although this is transparent to users, it will help with visualization of DocumentArrays in Jina Cloud.
It is also possible to list and delete DocumentArrays in Jina Cloud using the following methods:
DocumentArray.cloud_list()
: will list all DocumentArray objects owned by the authenticated userDocumentArray.cloud_delete(da_name)
: will delete the DocumentArray by name if it is owned by the authenticated user
Full text search support in Redis backend (#535)
Full text search is supported either on the Document.text
field or on Document tags as long as you enable indexing text or specify tag fields to be indexed.
For example:
from docarray import Document, DocumentArray
da = DocumentArray(
storage='redis', config={'n_dim': 2, 'index_text': True}
)
da.extend([
Document(text='Redis allows you to search by text query,'),
Document(text='by vector similarity'),
Document(text='Or by filter conditions'),
]) # add documents with text field
da.find('my text query').texts
Result:
['Redis allows you to search by text query,']
Add logical operators $and
and $or
in Redis (#509)
The Redis backend now supports $and
and $or
logical operators. For example:
from docarray import DocumentArray
da = DocumentArray(storage='redis', config={'n_dim': 128, 'columns': {'col1': 'str', 'col2': 'int'}})
redis_filter = {
"$or": {
"col1": {"$eq": "value"},
"col2": {"$lt": 100}
}
}
# retrieve documents using filter
da.find(redis_filter)
Columns in backend configuration should be a dictionary, not a list of tuples (#526)
The columns
configuration parameter for storage backends has been changed from a list of tuples to a dictionary in the following format: {'column_name': 'column_type'}
. This helps with YAML compatibility.
For example:
from docarray import DocumentArray
da = DocumentArray(storage='annlite', config={'n_dim': 128, 'columns': {'col1': 'str', 'col2': 'float'}})
Allow displaying image documents using either tensor or URI (#518)
It is now possible to choose which field to use when displaying an image document:
from docarray import Document
d = Document(uri=os.path.join(cur_dir, 'toydata/test.png'))
d.display()
d.display(from_='uri')
or
d.load_uri_to_image_tensor()
d.display(from_='tensor')
Backwards incompatible API changes
Increased minimum versions for dependencies:
Package | Minimum Version |
---|---|
jina-hubble-sdk |
0.13.1 |
annlite |
0.3.12 |
Other API Changes:
- The
columns
configuration parameter for storage backends has been changed from a list of tuples to a dictionary in the following format:{'column_name': 'column_type'}
.
🚀 Performance
Optimize find with an exists
condition (#519)
We got rid of unnecessary and costly computation when computing DocumentArray.find
with an exists
filter. When running the following code:
from docarray import DocumentArray, Document
da = DocumentArray(Document(text='text') for _ in range(num)) + \
DocumentArray(Document(blob=b'blob') for _ in range(num))
da.find(query={'text': {'$exists': True}})
you should expect a 200-300% speed increase in find
latency.
This optimization only affects performing DocumentArray.find
or DocumentArray.match
when an exists
condition is used and an in-memory
document store is used.
Change default journal mode to WAL in SQLite backend (#506):
The default journal mode in the SQLite backend is now WAL. This should improve performance when using the SQLite backend.
According to the SQLite docs, WAL is significantly faster, provides more concurrency, and is more robust.
🐞 Bug Fixes
Keep default values for vector similarity parameters in Redis backend (#559)
DocumentArray's Redis backend previously initialized schemas in the Redis database with default values of vector similarity search parameters. Those default values came from DocArray, not Redis.
This altered the database's default behavior, although the user didn't explicitly specify that. We've changed the implementation to avoid altering default values of the database. Default values now depend on the Redis database version.
Adapt to AnnLite changes (#543)
AnnLite introduced a breaking change in 0.3.12
. Therefore, we have adapted our implementation to the latest version of AnnLite and increased the minimum required version to 0.3.12
.
Keep out of mask docs in delete by mask (#534)
DocumentArray's delete by mask operation used to present an unexpected behavior. The following code erases the last Document, even though it is not covered by the mask:
da = DocumentArray.empty(3)
mask = [True, False]
del da[mask]
print(len(da)) # prints 1
We have fixed this behavior, and DocumentArray will now correctly keep documents that are not present in the mask.
Fix Finetuner link for Totally Looks Like (#532)
We've fixed an incorrect link in the documentation.
Fix AnnLite type map (#533)
DocArray type mapping used the wrong types in AnnLite. We've now replaced the types specified in the document store implementation with the correct ones.
Create Strawberry types with kwargs (#527)
Strawberry introduced a breaking change in 0.128.0
, making it necessary to pass parameters as key arguments. We've adapted our code base to this change.
Make device more generic (#515)
Some parts of in-memory distance computation used to restrict tensor device conversion to cuda
. We've changed the implementation to make device conversion more generic.
📗 Documentation Improvements
Add benchmark reference to feature summary (#510)
We've added a "One Million Benchmark" section to the "Feature Summary" page.
Update push/pull setup instructions (#516)
We've updated the pip setup instruction required to use DocumentArray push/pull.
🤟 Contributors
We would like to thank all contributors to this release: Joan Fontanals(@github_user)
Leon Wolf(@fogx)
samsja(@samsja)
AlaeddineAbdessalem(@alaeddine-13)
Halo Master(@linkerlin)
Han Xiao(@hanxiao)
Wang Bo(@bwanglzu)
Anne Yang(@AnneYang720)
Joan Fontanals(@JoanFM)