Jina.ai logo
Sharing DocumentArray Across Machines-image
DocumentArray
feature
search

Sharing DocumentArray Across Machines

Shubham Saboo, Jyoti Bisht
Shubham Saboo, Jyoti Bisht

Background

One day, an employee at Jina AI was working on a remote GPU, i.e. Google Colab. As they tried to make a DocumentArray and work with its manipulation, they got an error that wasn’t solvable by them and needed external assistance. Upon getting external help, they found that the error was due to the missing environment dependencies on the remote GPU. They then decided to move the development to their local system that was already configured to work with Jina.

The above paragraph describes a very common and challenging situation faced by almost everybody when working in a remote environment. Firstly, it is difficult to determine what environment is running in the backend. Secondly, there may be different kinds of issues and bugs that need to be addressed, and giving others access to the entire file is not a super safe way to do it.

Overview

We all know about Jina’s DocumentArray. It is used to store Document objects. It is similar to Python’s list implementation, where you can construct, delete, insert, sort, and traverse a DocumentArray object. This blog will walk you through a very special feature of DocumentArray: push and pull.

DocumentArray is capable of importing data and exporting data. This is not limited to a single format but spans over various formats. For example, if you want to work with a JSON file, you might use .to_json() for exporting data into JSON. And if somebody needs to read that JSON file they can just use the .load_json() for importing the data.

Similarly, Jina allows the import and export of data to and from any remote cloud storage. The method we use for exporting data to the cloud is simply known as push(), and the method used for importing data is known as pull(). This allows users to share the DocumentArray object across machines from anywhere in the world.

DocumentArray Push/Pull Example

Let’s look at a simple example of push and pull in action. Jack wants to create a DocumentArray, do some pre-processing from his side, and share it with his colleague Janice living in the other part of the world. So Jack creates a DocumentArray object da, applies the pre-processing logic, and pushes the da object with a unique key ID to store it on a remote cloud machine using the push() method. He uses the following code to do that:

Fig 1. Push method for exporting DocumentArray

Now, Janice wants to use the same DocumentArray object that her colleague Jack has created and to do that, she needs to know the unique key ID associated with the particular object. Once she has the key, she can use the pull() method to fetch the DocumentArray object into her local system from anywhere in the world. She uses the following code to do so:

Fig 2. Pull method for importing DocumentArray

DocumentArray Push Flow

When a user pushes the data from his local system to the server, the following process takes place:

  • The data along with the user token is sent to Jina API.
  • The Jina API server verifies the following :
    • S3 address of the request
    • The expiration of the request made
    • The token and its validity
    • The size of the data and time of the creation of the request
    • Metadata such as Jina’s version etc.
  • Once verified, the response is sent in the form of a success message. Otherwise, a failure message is sent.

Note: The DocumentArray storage is temporary and will be deleted automatically after seven days of creating the token. Also, using the same token will override the existing data.

DocumentArray Pull Flow

When a user tries to access a DocumentArray from the cloud and export it, the following process takes place:

  • A request is made, and the token is sent to the Jina API
  • The server verifies and stores the download time and the metadata. Upon successful verification, a response is sent back in the form of a URL. This URL can be used for downloading the data.
  • Once a get request is made on that URL, the requested data is sent from the S3 server to the user.

Summary

What you saw above is the story of how collaboration looks with Jina’s search framework. It lets you work in a collaborative environment without worrying about how to share a piece of code or logic safely and securely. With DocumentArray’s new push and pull methods, you can efficiently work with data stored on the cloud by transferring it to the local system. Jina allows you to build efficient search engines and hosts the data types and structures that are built for speed, efficiency, and optimised engineering.

References

© Jina AI 2020-2022. All rights reserved.