This document summarizes a research paper on scaling laws for neural language models. Some key findings of the paper include:
- Language model performance depends strongly on model scale and weakly on model shape. With enough compute and data, performance scales as a power law of parameters, compute, and data.
- Overfitting is universal, with penalties depending on the ratio of parameters to data.
- Large models have higher sample efficiency and can reach the same performance levels with less optimization steps and data points.
- The paper motivated subsequent work by OpenAI on applying scaling laws to other domains like computer vision and developing increasingly large language models like GPT-3.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
This document discusses the relationship between control as inference, reinforcement learning, and active inference. It provides an overview of key concepts such as Markov decision processes (MDPs), partially observable MDPs (POMDPs), optimality variables, the evidence lower bound (ELBO), variational inference, and the free energy principle as applied to active inference. Control as inference frames reinforcement learning as probabilistic inference by defining a generative process and performing variational inference to find an optimal policy. Active inference uses the free energy principle and minimizes expected free energy to select actions that resolve uncertainty.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
This document discusses the relationship between control as inference, reinforcement learning, and active inference. It provides an overview of key concepts such as Markov decision processes (MDPs), partially observable MDPs (POMDPs), optimality variables, the evidence lower bound (ELBO), variational inference, and the free energy principle as applied to active inference. Control as inference frames reinforcement learning as probabilistic inference by defining a generative process and performing variational inference to find an optimal policy. Active inference uses the free energy principle and minimizes expected free energy to select actions that resolve uncertainty.
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...Deep Learning JP
The document proposes modifications to self-attention in Transformers to improve faithful signal propagation without shortcuts like skip connections or layer normalization. Specifically, it introduces a normalization-free network that uses dynamic isometry to ensure unitary transformations, a ReZero technique to implement skip connections without adding shortcuts, and modifications to attention and normalization techniques to address issues like rank collapse in Transformers. The methods are evaluated on tasks like CIFAR-10 classification and language modeling, demonstrating improved performance over standard Transformer architectures.