Small Language Models (SLMS)
Small Language Models (SLMS)
(SLMs)
Arkaprava Roy
What are Small Language Models?
SLMs are compact versions of large language models (LLMs), with parameters in
the millions to a few billion, compared to LLMs with hundreds of billions.
• Efficiency: SLMs use less computational power and memory, making them ideal
for small devices and edge computing, enabling real-world applications like
on-device chatbots.
• Lightweight Architecture
• Pre-training Techniques
• Fine-tuning Techniques
Pre-training Techniques
Pre-training large language models (SLMs and LLMs) efficiently requires
specialized techniques. One key method is mixed precision training, where
lower-precision numbers (FP16) are used for calculations while the
model’s weights are kept in higher precision (FP32). This speeds up
training without losing accuracy. Other techniques, like gradient
clipping (to prevent issues with large updates), and memory-efficient
optimizers (like Adafactor and Sophia) help improve stability and
performance. Additionally, distributed training methods, such as ZeRO and
FSDP, allow training to be spread across multiple machines, making it
faster and more scalable.
Fine-tuning Technique
Fine-tuning adapts pre-trained models to specific tasks using smaller
datasets. Key techniques include:
These methods improve task adaptation with fewer resources and data.
Model Compression Techniques
Model compression techniques focus on reducing the size and
complexity of large pre-trained language models while
maintaining their performance. As a result, these methods are
a key approach to deriving SLMs from LLMs.
• Pruning Techniques
• Quantisation
• Latency
• Memory
• Privacy
• Energy Optimisation
The SLMs are tested with particular conditions/settings with datasets and get
statistics of the corresponding metrics.
Examples of SLMs
Applications of SLMs
• Real-Time Interaction
• LLaMA-Omni: A combination of speech encoder, adaptor, LLM, and streaming decoder for
real-time speech input interactions. It uses LLaMA-3-8B-Instruct.
• Google’s Project Astra: Uses Gemini to process audio and video data from smart
devices like smartphones or glasses. It can respond to queries, solve math problems,
and memorise sequences of objects.
Content Generation and Processing
• LLMR (Language Model for Mixed Reality): Utilizes multiple
LLMs in mixed reality for generating and modifying 3D
scenes. It includes different GPT-based models for scene
analysis, code generation, and code inspection.
• HuatuoGPT & BioMistral: Tailored LLMs for medical and biomedical tasks,
adhering to privacy regulations, which can run on devices without an
internet connection.
Edge Inference and Privacy
(contd.)
• Mixture-of-Experts: Reduces inference costs by using only a
subset of model layers. Examples include GLaM (Google) and
EdgeMoE, which extend this concept to edge devices such as
Nvidia Jetson TX2 and Raspberry Pi 4B.
Bibliography
• Most of my study was this paper:
https://arxiv.org/abs/2410.20011
• Others:
• https://www.superannotate.com/blog/small-language-models#small-
language-model-examples
• https://medium.com/@nageshmashette32/small-language-models-slms-
305597c9edf2
Thank You for Listening