Introduction
In the fast-paced field of generative AI, the need for high-quality images is more critical than ever. This article explores the complex yet fascinating topic of image upscaling, a specific machine learning task designed to increase the size of small images without losing their original resolution. We begin by defining the super-resolution task in machine learning, followed by a discussion on the typical models used for this task. For this discussion, we will take an academic approach to literature review in the field of super-resolution - something that the long-term readers of Jina AI’s blog posts may find unusual but, hopefully, still valuable. We then compare these models to give you a clearer understanding of their strengths and weaknesses. Following this, we guide you through the practical application of these models on Jina AI Cloud.
Lastly, we detail how these upscaling models can prove beneficial for generative AI application developers, particularly in tandem with other generative AI models, such as Stable Diffusion, for the purpose of image enhancement, restoration or data augmentation.
Our goal is to provide you with a comprehensive understanding of the image upscaling process and its potential applications.
1 What is the Super-resolution Task?
The super-resolution task is an exciting computer vision task whose goal is to restore a low-resolution image (blurry, pixelated) to a high-resolution image, making it sharper and more detailed. This task plays an important role in our daily lives, for example, by converting old, low-resolution photos into sharper, high-definition versions, or by providing sharper picture quality in video calls.
The super-resolution models are intelligent algorithms developed to achieve this goal. They infer missing details in low-resolution images and generate high-resolution images by learning in-depth about the details and features of the image, as well as pattern and contextual information. The magic of these models is that they can recover without an actual high-resolution image, relying only on existing low-resolution images as input.
With the rapid development of artificial intelligence and deep learning, super-resolution technology has made tremendous progress. It has brought us amazing image enhancement capabilities that not only improve image quality but also drive innovation in many application areas. Super-resolution tasks and super-scoring models undoubtedly add color to our digital world and bring us a more wonderful visual experience.
In this article, we will introduce our readers to some of the most common machine learning model architectures and techniques for super-resolution, that are currently state-of-the-art. At the end of the article, we will also show how you can access some of these models and turn your own low-resolution images into sharper and more detailed versions of themselves.
2 Commonly Used Models and Principles
This section focuses on understanding the state-of-the-art models used for super-resolution. We divide these models into two main categories for ease of understanding: slow but effective models, and fast but potentially less effective models. In the first subsection, we will delve into four models that, while taking a bit longer to produce results, ensure high-quality upscaled image output. Conversely, the second subsection will introduce you to three models that are known for their speed, even though they might sometimes compromise on effectiveness. This balanced perspective aims to equip you with the knowledge to decide which model best aligns with your specific requirements and constraints.
In the next section of the article, we will objectively evaluate all the discussed models, considering their speed and accuracy.
2.1 Slow but Effective Models
2.1.1 SwinIR
A strong baseline model, SwinIR for image restoration is based on the Swin Transformer. SwinIR consists of three parts: shallow feature extraction, deep feature extraction and high-quality image reconstruction. In particular, the deep feature extraction module is composed of several residual Swin Transformer blocks (RSTB), each of which has several Swin Transformer layers together with a residual connection. The author conducts experiments on image super-resolution (including classical, lightweight and real-world image super-resolution). Significant super-resolution effects were obtained.
The tile parameter is in the SwinIr and Real-ESRGAN model. As images result in the out of GPU memory issue, this tile option will first crop input images into tiles, and then process each of them. Finally, they will be merged into one image.
Therefore, you could use it when you have limited GPU memory, but don’t even dream of speeding up inference with this operation.
tile_overlap
is the same as tile_pad
– it characterizes the degree of overlap between the two parts that are segmented. A higher tile_pad
value means more parts of the input image are segmented, resulting in a “smoother” result.
2.1.2 Real-ESRGAN
Real-ESRGAN generator is the same generator (SR network) as ESRGAN, which is a deep network with several residual-in-residual dense blocks (RRDB). It also extends the original ×4 ESRGAN architecture to perform super-resolution with a scale factor of ×2. As ESRGAN is a heavy network, it first employs the pixel-unshuffle (an inverse operation of pixel-shuffle, which is an operation used in super-resolution models to implement efficient sub-pixel convolutions. Specifically it rearranges elements in a tensor of shape [∗,C×r**2,H,W] to a tensor of shape [∗,C,r×H,r×W]) to reduce the spatial size and enlarge the channel size before feeding inputs into the main ESRGAN architecture. Thus, most calculation is performed in a smaller resolution space, which can reduce the GPU memory and computational resources consumption.
2.1.3 Swin2sr
In swin2sr works, the authors explore the novel Swin Transformer V2, to improve SwinIR for image super-resolution, and in particular, the compressed input scenario. Using this method we can tackle the major issues in training transformer vision models, such as training instability, resolution gaps between pre-training and fine-tuning, and hunger on data. Experimental results demonstrate that the method, Swin2SR, can improve the training convergence and performance of SwinIR, and is a top-five solution at the "AIM 2022 Challenge on Super-Resolution of Compressed Image and Video".
2.1.4 Stable Diffusion Upscaler
Stable Diffusion Upscaler (SDU) is a super-resolution model based on the diffusion process. It converts a low-resolution image into a high-resolution image by introducing random noise and diffusion processes. The model includes an encoder, a diffusion layer and a decoder. The encoder maps the low-resolution image to the feature space, the diffusion layer improves the image quality by random noise and iteratively updating the feature map, and the decoder maps the feature map back to the high-resolution image space. SDU utilizes a nonlinear diffusion process and the introduction of random noise to be able to handle a diversity of details and textures, generating a more realistic and detailed high-resolution image. The implementation of specific models may vary, and the above description provides a general overview.
However, although this model has a better inference effect, at the same time the resource consumption is huge, under the same conditions, its inference time and memory occupation is several times that of other models. Due to its economic limitations, we do not introduce this model for the time being.
2.2 Fast but Ineffective Models
2.2.1 bicubic
The authors propose a real-time and lightweight single-image super-resolution (SR) network named Bicubic++. Bicubic++ learns quick reversible downgraded and lower resolution features of the image to decrease the number of computations. They also apply an end-to-end global structured pruning of convolutional layers without using metrics like magnitude and gradient norms and focus on optimizing the pruned network's PSNR on the validation set. Furthermore, they also applied bias removal to the convolutional layers to enhance the inference speed.
2.2.2 DIPNet
Specifically, they leverage enhanced high-resolution output as additional supervision to improve the learning ability of lightweight student networks. Upon convergence of the student network, we further simplify our network structure to a more lightweight level using reparameterization techniques and iterative network pruning. Meanwhile, they adopt an effective lightweight network training strategy that combines multi-anchor distillation and progressive learning, enabling the lightweight network to achieve outstanding performance. Ultimately, our proposed method achieves the fastest inference time among all participants in the NTIRE 2023 efficient super-resolution challenge while maintaining competitive super-resolution performance. The results show that the approach achieves comparable performance in representative dataset DIV2K, both qualitatively and quantitatively, with faster inference and fewer network parameters.
2.2.3 LapSRN
In this paper, the authors propose the Laplacian Pyramid Super-Resolution Network (LapSRN) to progressively reconstruct the sub-band residuals of high-resolution images. At each pyramid level, the model takes coarse-resolution feature maps as input, predicts the high-frequency residuals, and uses transposed convolutions for upsampling to the finer level. The method does not require bicubic interpolation as the pre-processing step and thus dramatically reduces the computational complexity. They train the proposed LapSRN with deep supervision using a robust Charbonnier loss function and achieve high-quality reconstruction. Furthermore, the network generates multi-scale predictions in one feed-forward pass through the progressive reconstruction, thereby facilitating resource-aware applications. Extensive quantitative and qualitative evaluations of benchmark datasets show that the proposed algorithm performs favorably against the state-of-the-art methods in terms of speed and accuracy.
3 Comparing and Contrasting Models
We designed a series of evaluation procedures for evaluating the performance of each model. In this section, we present this evaluation in terms of the output of the models, inference time, GPU memory usage and other metrics, all evaluated independently and uniformly.
In all, considering the cost and performance, we may recommend that users use Real-ESRGAN if they are looking for high-quality upscaled images, with a reasonably low memory footprint. However, if users want to process a lot of images in a limited amount of time, at the cost of potentially lower-quality image outputs, then they should pick DIPNet.
The following graphs show the output comparison of the models. For tags under the images (e.g. Sheriff_5.966_2746), the first part of it is the filename of the image. The second and third parts are the inference time(s) and memory cost (MB), which will give you a clearer understanding of the inference process.
Then there is the resource consummation examination of each model.
Dots per ms means how large of an image can be processed per millisecond, and Dots per MB means how big of an image can be handled per MB GPU memory.
This may help you understand the costs of each model.
DPS (documents per second) represents the number of documents the model handled per second. In the examination below, the last column lists the maximum size of an image that the model can process using no more than 16G GPU memory (we assume that an image with a size greater than 1024*1024 does not need upscaling).
The result shows that bicubic and DIPNet have a great score on these two aspects, and the other 5 models have almost the same memory utilization. However, the performance of bicubic and DIPNet may not be satisfactory. Amongst all evaluated models, Real-ESRGAN has the best inference speed and moderate performance. SwinIR and Swin2ir-real seem to have almost the same performance and inference speed, which is equally matched with Real-ESRGAN (without face enhancement).
4 Using Super-resolution Models in Inference Client
Let’s use Jina AI’s Inference to enjoy the visual feast of ordinary to extraordinary.
Jina Inference Client is a Python library that allows you to interact with the Jina AI Inference. It provides a simple and intuitive API to perform various tasks such as image captioning, encoding, ranking, visual question answering (VQA), and image upscaling. Here’s a simple example of how to use super-resolution models with Inference client.
First, you need to install Inference Client using pip:
pip install inference-client
Supposing that you have a Jina Flow serving an upscaling model running at grpc://localhost:51001, you need to initialize the inference client using the following code:
from inference_client import Client
from jina import DocumentArray, Document
client = Client()
model = client.get_model('grpc://localhost:51001')
Secondly, images can be loaded from a URI on the internet or from your local filesystem:
doc = [Document(uri=image, tags={"image_format": "png"})]
image_format
is the output format of the image. If not set, the original image format will be used.
Now you can call the client to perform the upscaling:
result = model.upscale(image=doc, output_path='upscaled_image.png')
You can also specify the output size or ratio for the output image. However, it cannot be larger than the original image size * upscale_ratio
. For example, the following code specifies the output image’s width as 600 pixels. Note that for parameter setting we follow the FFMPEG style.
result = model.upscale(image=image, scale='600:-1', output_path='upscaled_image.png')
We also support a JavaScript client and plain HTTP requests through cURL.
5 Working with AIGC
Integrating super-resolution (SR) models into the AIGC (Artificial Intelligence in Graphics and Creative) field, particularly in collaboration with Stable Diffusion, can enhance the quality and creativity of image generation and manipulation. Here's how super-resolution models can be integrated and co-work with Stable Diffusion:
- Image Enhancement and Upscaling: These models can be used to enhance low-resolution images and upscale them to higher resolutions with improved details and clarity. The high-resolution images obtained from super-resolution can then be used as inputs for stable diffusion algorithms, enabling more precise and realistic image manipulations.
- Style Transfer and Artistic Rendering: By combining super-resolution with stable diffusion, artists and designers can upscale low-resolution artwork and then apply style transfer techniques to achieve unique and creative visual effects. The high-resolution images produced by super-resolution can capture more intricate details, enabling stable diffusion to better preserve and transfer artistic styles.
- Video Super-Resolution: Integrating these models with stable diffusion can be especially beneficial for video processing. Super-resolution can be applied to each frame of a low-resolution video, followed by stable diffusion-based techniques to generate high-resolution, smooth, and stable video sequences.
- Generative Models and Data Augmentation: These models can be used as a pre-processing step to upscale low-resolution training images in generative models like GANs (Generative Adversarial Networks) or VAEs (Variational Autoencoders). This integration can improve the quality and diversity of generated images.
- Image Restoration and Manipulation: By combining these models with stable diffusion, it becomes possible to restore and manipulate images with higher fidelity and finer details. For tasks like image inpainting or denoising, super-resolution can be employed to generate high-resolution versions of damaged images, followed by stable diffusion to reconstruct and fill in missing regions.
- Real-Time Applications: The combination of super-resolution and stable diffusion can lead to real-time image and video processing applications. For example, super-resolution can upscale low-resolution input frames in real-time, followed by stable diffusion for stylistic rendering or other creative effects, providing interactive and visually appealing experiences.
Overall, integrating super-resolution models with stable diffusion opens up a wide range of possibilities for enhancing image quality, boosting creative capabilities, and advancing the AIGC field. This collaborative approach empowers artists, designers, and developers to create more realistic, vibrant, and captivating visual content across various domains and applications.
Now, Jina AI’s users have the opportunity to try out the super-resolution task using our best-in-class model-as-a-service offering, Inference, on Jina AI Cloud.
You can also get in touch with us with your questions on our Discord channel.