BERT is a pre-trained language representation model that uses the Transformer architecture. It is pre-trained using two unsupervised tasks: masked language modeling and next sentence prediction. BERT can then be fine-tuned on downstream NLP tasks like question answering and text classification. When fine-tuned on SQuAD, BERT achieved state-of-the-art results by using the output hidden states to predict the start and end positions of answers within paragraphs. Later work like RoBERTa and ALBERT improved on BERT by modifying pre-training procedures and model architectures.