A Simple Way of Improving Zero-Shot CLIP Performance | by Alexey Kravets | Nov, 2023

Part 1 — Customized Prompts via Language Models (CuPL)

Alexey Kravets
Towards Data Science

Unimodal models are designed to work with data from a single mode, which can be either text or images. These models specialize in understanding and generating content specific to their chosen mode. For example, GPT are excellent at generating human-like text. They have been used for tasks like language translation, text generation, and answering questions. Convolutional Neural Networks (CNNs) are examples of image models that excel at tasks like image classification, object detection, and image generation. Currently, many interesting tasks such as Visual Question Answering (VQA) and Image-Text retrieval etc. require multimodal capabilities. Is it possible to combine both text and image processing? We can! CLIP stands out as one of the initial highly successful image-text models, demonstrating proficiency in both image recognition and text comprehension.

We will divide this article into the following sections:

  1. Introduction
  2. Architecture
  3. Training process and Contrastive loss
  4. Zero-shot capability
  5. CuPL
  6. Conclusions

The CLIP model is an impressive zero-shot predictor, enabling predictions on tasks it hasn’t explicitly been trained for. As we will see more in detail in the next sections, by using natural language prompts to query images, CLIP can perform image classification without requiring task-specific training data. Nevertheless, its performance can be significantly enhanced with a few tricks. In this series of articles, we will explore methods that leverage additional prompts generated by Large Language Models (LLM) or a few-shot training examples without involving any parameter training. These approaches offer a distinct advantage as they are computationally less demanding and do not necessitate fine-tuning additional parameters.

CLIP is a dual encoder model with two separate encoders for visual and textual modalities that encode images and texts independently. Such architecture is different from the fusion encoder that enables the interaction between visual and textual modalities through cross-attention which involves learning attention weights that help the model focus on specific regions of…

Source link

This post originally appeared on TechToday.

Leave a Reply

Your email address will not be published. Required fields are marked *