Open In App

Language models are few-shot learners

Last Updated : 25 Jun, 2025
Summarize
Comments
Improve
Suggest changes
Share
Like Article
Like
Report

Few-shot learning is a machine learning approach where a model learns to perform new tasks or recognize new categories using only a small number of labeled examples, rather than needing thousands or millions of data points as in traditional methods. This technique is inspired by the way humans can quickly learn and generalize from just a few examples for instance, identifying a new animal species after seeing only a couple of pictures.

What Is Few-Shot Learning?

Few-shot learning refers to the ability of a model to generalize and perform new tasks after being shown only a small number of examples. It is part of a broader family that includes:

Few-shot learning is particularly valuable when labeled data is scarce or expensive to obtain, such as in medical diagnosis, rare language translation, or custom text classification tasks.

How Does GPT-3 Achieve Few-Shot Learning?

1. Massive Pretraining

GPT-3 is trained on a vast corpus of internet text (410 billion tokens), using an autoregressive transformer architecture with 175 billion parameters 10 times larger than any previous non-sparse language model at the time. This extensive pretraining enables the model to absorb a wide range of language patterns, facts, and reasoning styles.

2. In-Context Learning (Prompting)

Instead of fine-tuning, GPT-3 is presented with a prompt that includes a task description and a few examples (few-shot), just a task description (zero-shot), or a single example (one-shot). The model then continues the prompt, generating outputs that match the pattern of the examples provided.

Example (Few-Shot Translation):

English: The book is on the table.
French: Le livre est sur la table.

English: I like apples.
French: J'aime les pommes.

English: How are you?
French:

Model Output: "Comment ça va ?"

3. No Parameter Updates

During few-shot learning, the model's weights remain fixed. All "learning" occurs within the context of the prompt the model uses its pretrained knowledge to recognize the pattern in the examples and apply it to the new input.

Experimental Results: How Well Does It Work?

Performance Across Tasks

  • Translation: GPT-3 in few-shot mode outperformed some supervised, fine-tuned models in English-to-French and English-to-German translation.
  • Question Answering: Achieved strong results on benchmarks like TriviaQA and OpenBookQA.
  • Cloze Tasks: Performed well on fill-in-the-blank tasks (e.g., LAMBADA).
  • Reasoning and Arithmetic: Could solve 3-digit addition and subtraction, unscramble words, and even use made-up words in context.
  • News Generation: Human evaluators found GPT-3-generated news articles indistinguishable from real ones 52% of the time.

Scaling Laws and Example Count

  • Model Size: Larger models consistently performed better in few-shot settings, demonstrating clear scaling laws.
  • Number of Examples: More in-context examples generally improved performance, but gains plateaued after a certain point due to the model’s context window limitations.

Why Is This Significant?

  • No Task-Specific Training Needed: GPT-3 can generalize to new tasks without retraining or fine-tuning, simply by changing the prompt.
  • Human-Like Flexibility: This approach mirrors how humans learn from instructions and a few demonstrations, rather than requiring exhaustive practice.
  • Rapid Prototyping: Developers can quickly test new tasks and applications by designing prompts, without collecting large labeled datasets.
  • Broader Societal Impact: GPT-3’s ability to generate human-like text has profound implications for content creation, education, and even misinformation.

Limitations

  • Reasoning Tasks: GPT-3 still lagged behind fine-tuned models on tasks requiring deep reasoning (e.g., SuperGLUE’s BoolQ).
  • Prompt Sensitivity: Performance depends on the clarity, order, and diversity of examples in the prompt.
  • Context Window: The number of examples is limited by the model’s maximum input length (about 2,000 tokens for GPT-3).

Next Article

Similar Reads