Jina AI has been selected as one of 19 organizations in Infrastructure and Cloud for Google Summer of Code 2023! Google Summer of Code is an open-source internship program offering paid remote work. Presently, it is recognized as the world's pre-eminent remunerated open-source programming event and contributes significantly to the advancement of global open-source ecology.
Almost anyone in the world over 18 years of age who loves coding and wants to explore the incredible world of open source can join us as a GSoC 2023 contributor.
Why participate in Jina AI's GSoC 2023?
Jina AI's GSoC includes diverse and challenging tasks, ambiguous goals to stimulate creativity, authentic challenges consistent with realistic projects, and flexible collaboration between potential contributors and mentors.
We offer a diverse range of challenging tasks that encourage creativity among our participants. The program includes projects that range from developing new features, improving existing ones, conducting research, and practical implementation.
To foster creativity, our mentors encourage a broad approach to the projects, allowing contributors to explore different ideas and possibilities, without being confined to rigid boundaries.
Authenticity is at the heart of Jina AI's GSoC program. The program participants will work on real-world projects driven by the open-source community.
The program is also highly flexible, encouraging contributors and mentors to work together to develop proposals that deliver the best possible outcomes, leading to innovative results.
Jina AI's GSoC Projects
1. Build Executor (model) UI in jina
info | details |
---|---|
Skills needed | Python |
Project size | 175 hours |
Difficulty level | Easy |
Mentors | @Alaeddine Abdessalem, @Philip Vollet |
Project Description:
Jina Executors are components that perform certain tasks and expose them as services using gRPC. Executors accept DocumentArrays as input and output. However, with DocArray V2 focusing on type annotations and enabling annotation of Executor endpoints, it becomes possible for Executors to describe their services and input/output in the same way as OpenAPI schemas. This allows us to offer built-in UIs for Executors, enabling people to easily use their services with multimodal data. The goal is to build this feature in Jina using Gradio.
Expected outcomes:
- Submit one or more Pull Requests (PRs) to the Jina repository that enables providing a built-in Executor UI for Executors.
- The UI can be built using Gradio and should be able to infer information about the Executor service using type annotations.
2. DocArray wrap ANN libraries
Info | details |
---|---|
Skills needed | Python, ANN Search experience |
Project size | 175 hours |
Difficulty level | Medium |
Mentors | @Johannes Messner, @Sami Jaghouar, @Philip Vollet |
Project Description:
In DocArray, we have been concentrating on developing production-ready Vector DBs for large-scale searches. However, there are many ANN libraries without scalability layers that can be integrated into DocArray, making it accessible to academia and production teams with small-to-medium amounts of data, without the need for external services.
DocArray v2 will have a concept called Document Index. This is an abstraction that lets a user store their Documents (on disk or in a database), and retrieve them using ANN search. As such, there can be multiple Document Indexes backed by different backends: Elastic, Qdrant, Weaviate, etc, but all following the same basic API.
The idea behind this project is to take an ANN library and use it to implement a Document Index. There is already an implementation using HNSWLib that you can find here: feat: hnswlib document index, But there is space to create similar backends using other libraries: Annoy, Faiss, etc. The goal is to provide user choice.
If there is interest, someone could also implement a backend using a vector database. We already have Qdrant, Weaviate, and Elastic covered, but Milvus, Redis, and some others could also be interesting. You can find a design doc for Document Index here.
Expected Outcomes:
- We have a set of Document Store implementations in DocArray that support the most popular ANN libraries, such as FAISS, Annoy, and Hnswlib.
3. Research about deploying LLM with Jina
info | details |
---|---|
Skills needed | Python, Pytorch, CUDA, Docker, Kubernetes |
Project size | 350 hours |
Difficulty level | Hard |
Mentors | @Alaeddine Abdessalem, @Joan Martínez |
Project Description:
This project aims to demonstrate the capability of deploying and scaling Large Language Models (LLMs) using Jina. LLMs have gained significant attention for their ability to generate text and solve various tasks. However, deploying LLMs requires technologies to enable scalability when using GPU resources. This project will assess the capability of deploying LLMs with Jina and explore integrations with existing libraries such as DeepSpeed, Accelerate, and FlexGen that provide optimizations for model deployment. The goal is to build demos and showcases hosting LLMs using Jina and the integrated libraries. This will demonstrate Jina's ability to deploy and scale LLMs in a cost-efficient manner.
Expected Outcomes:
- Implementation of LLM deployment using Jina and assessing scalability with GPU resources.
- Documentation and example code demonstrating the use of Jina for LLM deployment and inference.
- Building integrations with the mentioned libraries in order to use them within Jina.
- Evaluation of the cost-efficiency of deploying and scaling LLMs with Jina compared to other technologies.
4. Expand ANNLite capabilities with BM25 to build Hybrid Search
info | details |
---|---|
Skills needed | Python, C++, Lucene, ANN, Inverted Index |
Project size | 350 hours |
Difficulty level | Hard |
Mentors | @Felix Wang @Joan Martínez @Girish Chandrashekar |
Project Description:
This project aims to evaluate and implement Hybrid Search approaches on top of ANNLite, a Vector search library developed by Jina using HNSW as the algorithm for searches. The incorporation of BM25 and Hybrid Search will enable Jina to create scalable solutions for Hybrid Search in the cloud. While ANNLite allows the filtering of documents, combining Vector Search algorithms with traditional text-search ones will help improve performance in search systems. Through this project, we will investigate the benefits of Hybrid Search and try to deploy it with ANNLite to create powerful default search solutions for Jina.
Expected Outcomes:
- ANNLite is ready to be used as a default library to solve Hybrid Search applications.
5. Make ANNLite the go-to Vector Search library to be scaled by Jina using the StatefulExecutor feature
info | details |
---|---|
Skills needed | ANN, C++, Python, Databases |
Project size | 350 hours |
Difficulty level | Hard |
Mentors | @Felix Wang @Joan Martínez |
Project Description:
Jina is developing a stateful Executor feature that enables Deployments with a state to be replicated and scaled. This opens the door to having a Vector Database in our ecosystem effectively and robustly. Iterating on ANNLite to act as the "Lucene" for Jina would be a great opportunity.
Expected Outcomes:
- Prove and come up with an Executor in our Executor Hub that uses ANNlite or DocArray with ANNLite as a backend to be the default Vector Databases for all our examples for mid-sized data requirements.
6. JAX support in DocArray v2
Info | details |
---|---|
Skills needed | Python, Deep Learning, JAX |
Project size | 175 hours |
Difficulty level | Hard |
Mentors | @Sami Jaghouar |
Project Description:
DocArray is a library for representing, sending, and storing multi-modal data, with a focus on applications in Machine Learning and Neural Search. It currently supports several deep learning frameworks, including PyTorch and TensorFlow. Google JAX is becoming increasingly popular for deep learning, so we want to integrate it into DocArray.
The project we propose is to add JAX as a backend for DocArray, alongside PyTorch and TensorFlow. The first part would involve rewriting and translating all of the computational backend functions of DocArray with the JAX framework. Then, we would battle-test the implementation against a real JAX use case, such as integrating DocArray with JAX support for model training and serving.
Expected Outcomes:
- We aim to provide JAX with the same level of support in DocArray as we do for PyTorch, Numpy, and TensorFlow. The integration should be thoroughly tested and documented.
Meet the Jina AI mentor team
- Joan Martínez, Head of Engineering
- Felix Wang, Engineering Manager
- Philip Vollet, Head of Developer Experience
- Girish Chandrashekar, Senior Software Engineer
- Johannes Messner, Senior Software Engineer
- Sami Jaghouar, Senior Software Engineer
- Alaeddine Abdessalem, Software Engineer
How to apply for Jina AI's GSoC
- Read through Jina AI's GSoC page and Google Summer of Code guides.
- Identify project ideas on Jina AI's GSoC issues.
- Fill out the Survey so mentors can understand your background and relevant experience.
- Interact with the Jina community: join channels, introduce yourself, ask thoughtful questions, and help others with issues you have experience with.
- Try out the project ideas: Try on the local environment, Read the documentation, Reproduce tech blogs' examples to get hands-on experience, Solve issues you encounter, and Ask questions if stuck.
- Make contributions: Try to fix some good first issues, write tutorials, or share interesting projects you build with Jina or DocArray.
- Preparing a well-crafted proposal aligned with project goals. Submit applications through Google's system from March 20 to April 4.
GSoC x Jina AI Webinar
We will be hosting a GSoC x Jina AI webinar on March 23rd, 14:00-15:00 PM (CET). Join us as our experienced mentors provide an overview of their GSoC projects and answer any questions you have about the project requirements and expectations. This is a great opportunity for students, developers, and tech enthusiasts to learn more about these exciting projects and get involved in the open-source community.