Sharing DocumentArray Across Machines
One day, an employee at Jina AI was working on a remote GPU, i.e. Google Colab. As they tried to make a
DocumentArray and work with its manipulation, they got an error that wasn’t solvable by them and needed external assistance. Upon getting external help, they found that the error was due to the missing environment dependencies on the remote GPU. They then decided to move the development to their local system that was already configured to work with Jina.
The above paragraph describes a very common and challenging situation faced by almost everybody when working in a remote environment. Firstly, it is difficult to determine what environment is running in the backend. Secondly, there may be different kinds of issues and bugs that need to be addressed, and giving others access to the entire file is not a super safe way to do it.
We all know about Jina’s
DocumentArray. It is used to store Document objects. It is similar to Python’s list implementation, where you can construct, delete, insert, sort, and traverse a
DocumentArray object. This blog will walk you through a very special feature of
DocumentArray is capable of importing data and exporting data. This is not limited to a single format but spans over various formats. For example, if you want to work with a JSON file, you might use
.to_json() for exporting data into JSON. And if somebody needs to read that JSON file they can just use the
.load_json() for importing the data.
Similarly, Jina allows the import and export of data to and from any remote cloud storage. The method we use for exporting data to the cloud is simply known as
push(), and the method used for importing data is known as
pull(). This allows users to share the
DocumentArray object across machines from anywhere in the world.
DocumentArray Push/Pull Example
Let’s look at a simple example of
pull in action. Jack wants to create a
DocumentArray, do some pre-processing from his side, and share it with his colleague Janice living in the other part of the world. So Jack creates a
da, applies the pre-processing logic, and pushes the
da object with a unique key ID to store it on a remote cloud machine using the
push() method. He uses the following code to do that:
Fig 1. Push method for exporting DocumentArray
Now, Janice wants to use the same
DocumentArray object that her colleague Jack has created and to do that, she needs to know the unique key ID associated with the particular object. Once she has the key, she can use the
pull() method to fetch the
DocumentArray object into her local system from anywhere in the world. She uses the following code to do so:
Fig 2. Pull method for importing DocumentArray
DocumentArray Push Flow
When a user pushes the data from his local system to the server, the following process takes place:
- The data along with the user token is sent to Jina API.
- The Jina API server verifies the following :
- S3 address of the request
- The expiration of the request made
- The token and its validity
- The size of the data and time of the creation of the request
- Metadata such as Jina’s version etc.
- Once verified, the response is sent in the form of a success message. Otherwise, a failure message is sent.
DocumentArraystorage is temporary and will be deleted automatically after seven days of creating the token. Also, using the same token will override the existing data.
DocumentArray Pull Flow
When a user tries to access a
DocumentArray from the cloud and export it, the following process takes place:
- A request is made, and the token is sent to the Jina API
- The server verifies and stores the download time and the metadata. Upon successful verification, a response is sent back in the form of a URL. This URL can be used for downloading the data.
- Once a
getrequest is made on that URL, the requested data is sent from the S3 server to the user.
What you saw above is the story of how collaboration looks with Jina’s search framework. It lets you work in a collaborative environment without worrying about how to share a piece of code or logic safely and securely. With
DocumentArray’s new push and pull methods, you can efficiently work with data stored on the cloud by transferring it to the local system. Jina allows you to build efficient search engines and hosts the data types and structures that are built for speed, efficiency, and optimised engineering.