Search R Questions on Stack Overflow with Python

In this article I'm deploying a simple Jina app to the Jina AI Cloud. This app allows users to input a text query and then retrieve question-answer pairs, where the question semantically matches the given query. It's a pretty straightforward search app, and I want to use it to demonstrate how easy it is to take such apps from zero to ready-to-serve in 10 minutes.

The dataset we will be using in this article

Wait…for R? Not Python?! Well, the code itself is in Python. Only the dataset is related to R questions.

💡

It turns out the Python Stack Overflow dataset has some encoding issues. Even a bunch of try...excepts didn't help much, so I decided to spend time on a dataset that didn't have such problems. The R dataset was next on the list and it works fine.

The title of this post serves as a plot twist (and click-bait)

tagOur tech stack

To build our search app we'll use:

tagDocArray

Everything going in or out of a Jina Flow has to be a Document, (Documents are Jina's primitive data type.) This comes from the DocArray package, which is ideal for working with unstructured data.

tagJina

Jina is a framework that lets us build cross-modal and multi-modal applications on the cloud.

In short we'll use a Jina Flow and some Executors (from Executor Hub) to process and create vector embeddings for our Stack Overflow questions.

tagJina AI Cloud

Jina AI Cloud is a free-hosting platform for Jina applications. We can then index and query data without using any of our own compute.

tagHow it works

tagBuild & Deploy

We first build a Jina Flow, deploy it on the Jina AI Cloud. Later we will use the Jina Client to send data for indexing and searching.

Write the Flow

To do this we simply need our Flow in YAML format:

jtype: Flow
with:
  protocol: grpc
executors:
  - name: encoder
    uses: jinahub+docker://SpacyTextEncoder/
    uses_with:
      model_name: 'en_core_web_md'
    jcloud:
      resources:
        memory: 8G  # encoding is hungry. add more memory
  - name: indexer
    uses: jinahub+docker://AnnLiteIndexer/
    uses_with:
      n_dim: 300  # model uses this many dimensions
    uses_metas:
      workspace: workspace
    jcloud:
      capacity: on-demand
      resources:
        memory: 8G  # you can never have too much memory!

You can see two Executors in our Flow, namely:

SpacyTextEncoder — uses spaCy to create vector embeddings for our questions.
AnnLiteIndexer — stores and retrieves our embeddings and metadata.

In short, each Executor takes our questions as input/output. We’re wrapping them in jinahub+docker://... so we can pull them from Executor Hub (Jina’s Executor “app store”) and run them in Docker in our Flow.

Run:

jina export flowchart flow.yml flow.svg

Our Flow looks like this:

You can visualize the Flow via `jina export flowchart flow.yml`

We also pass some extra parameters about which model we want to use for embedding, resources, etc.

Log in to Jina AI Cloud

To deploy a Flow to Jina Cloud, you'll need to create a Jina AI Cloud account with your email, Google, or GitHub login.

Go to https://cloud.jina.ai and click the login button

Deploy the Flow

We can deploy the Flow to the cloud with:

jina cloud deploy flow.yml

Note: If you have trouble deploying, you can view more verbose output with --loglevel=DEBUG:

jina cloud --loglevel=DEBUG deploy flow.yml

After running the command it’ll take a few minutes to deploy and then return something like:

╭───────────────────── 🎉 Flow is available! ──────────────────────╮
│                                                                  │
│   ID            8412cf3e8b                                       │
│   Endpoint(s)   grpcs://8412cf3e8b.wolf.jina.ai                  │
│   Dashboard     https://dashboard.wolf.jina.ai/flow/8412cf3e8b   │
│                                                                  │
╰──────────────────────────────────────────────────────────────────╯

Make a note of your Endpoint (in our case grpcs://8412cf3e8b.wolf.jina.ai.) You’ll need it later.

tagIndexing the data

Jina Flows use Documents and DocumentArrays for input and output. So we need to put our questions into a DocumentArray.

First let’s see what fields we have to work with, using head -n 1 Questions.csv (from our R dataset):

Id,OwnerUserId,CreationDate,Score,Title,Body

Since we’re building a search engine we need to choose which field to search, namely the Title field. We can do this easily with .from_csv() method from DocumentArray:

from docarray import DocumentArray

docs = DocumentArray.from_csv('Questions.csv', field_resolver={'Title': 'text'})

All other fields (Score, Body, etc) are automatically stored in Document.tags as a Python dict.

Indexing took about 14 minutes for all 6,304,085 questions in the CSV file.

Send DocumentArray to the Flow for indexing

For this we use Jina Client. Remember to plug in your own endpoint that you got from the previous step!

from jina import Client

client = Client(host='grpcs://8412cf3e8b.wolf.jina.ai')  # Your gateway from earlier
client.index(docs, show_progress=True)

tagSearching the data

Create search Document

Since everything going in and out of a Flow is a Document or DocumentArray, we’ll need to wrap our search term in a Document:

from docarray import Document

search_doc = Document(text='statistic visualization')

Pass the search Document to the Flow

Once again, we can use the Jina Client:

response = client.search(search_doc)

View the results

Jina returns yet another DocumentArray as the result (as I said, it’s Documents and DocumentArrays for everything). This consists of a single Document with several Document.matches. We can see the text (i.e. the question title) of these with:

print(response[0].matches.texts)

Given the query term statistic visualization we get the following output:

['Hotellings statistic', 'Create multivariate similarity graph', 'How to vectorize extracting significant predictor variables?', 'visualization of correlation between replicates', 'R How to visualize this categorical percentage data?', 'visualizing statistical test results with ggplot2', 'Calculate Tanimoto coefficient', 'multivariate regression', 'Visualize aggregate data using ggplot2', 'How to calculate marginal probabilities for generating correlated binary variables']

Or in a more readable way:

- Hotellings statistic
- Create multivariate similarity graph
- How to vectorize extracting significant predictor variables?
- visualization of correlation between replicates
- R How to visualize this categorical percentage data?
- visualizing statistical test results with ggplot2
- Calculate Tanimoto coefficient
- multivariate regression
- Visualize aggregate data using ggplot2
- How to calculate marginal probabilities for generating correlated binary variables

tagMore user-friendly

By running our Flow on Jina AI Cloud it’s already most of the way there when it comes to production readiness (depending on your use case). The main thing to do next would be to integrate a nice frontend so users can interact directly.

For inspiration, you can see my Streamlit frontendfolder in the project’s repo:

Essentially it acts as a wrapper for:

Jina Client — retrieves matching questions given a search term.
SQLite — retrieves matching answers based on the questions that come up.

tag🎉 That's it!

...Or is it?

tagThis is running on gRPC. How can we use a RESTful interface?

I chose to run on gRPC because it’s more efficient that HTTP. That said, Jina also suports HTTP and WebSocket gateways. To deploy a RESTful Flow:

Set protocol to http in your Flow YAML before deploying.
Set a lower request_size to compensate for HTTP's lower efficiency.

tagBut what about the answers?

I decided for this quick example to just store the answers in a SQLite database and retrieve them via a function in the frontend. The database is stored on the same machine as my frontend code so access is easy.

I created the database using the awesome csv_to_sqlite Python package.

tagWhy not store the answers as <foo> or <bar> or …?

Essentially the answer to many of these questions is:

For this example I’m only searching the question text, so the only interaction the Flow would have with the answers would be storing them in the index anyway, without embeddings.
That would mean uploading 237mb to the cloud. I remember the days of 1.44mb floppy disks, so 237mb seems like a lot to my wrinkly old self.
So why not just put that data in a SQLite file? It seems easy enough, and there are no extra dependencies since the sqlite3 library comes with Python anyhow.

tagBut if we did want to process the answers…?

If we just wanted to store the answers in the index we could write a function to embed each question’s answer in Document.tags["answers"] as a list element.

If we wanted to create and search embeddings for the answers we could create a sub-Document for each answer in a question’s Document.chunks and then decide what access_path we want to use for searching.

Either way, it would’ve meant writing a function to match questions and answers from two CSV files. Considering the trouble I’d already had with the Python Stack Overflow CSV file above, I decided I’d had enough of CSVs for one day. Maybe in a future example.

tagHow can we get better results?

Since Stack Overflow is all about code, you may want to use TransformerTorchEncoder with a more language-specific model (like roberta-python for Python. I tried finding one for R but turns out a query that is a language name with just one letter doesn’t play well with Hugging Face’s search engine…)

tagHow can we use Jina to search the actual Stack Overflow, not just a dataset?

Good luck with that! You’d need to scrape the whole site, which Stack Overflow may not be too happy about. That would be a lot of work and potentially legal headaches. That's not to mention keeping it up-to-date. The actual Stack Overflow is constantly updated. So you might need to invest in some compute!