RoBERTa

A Robustly Optimized BERT Pretraining Approach.

Previous: GPT

RoBERTa

In 2019, Facebook AI researchers proved that BERT was "under-trained" and released RoBERTa (Robustly Optimized BERT Approach). Using the exact same architecture as BERT, it achieved much better results simply by training better.

Level 1 â€” More Data, More Training

RoBERTa was trained on 160GB of text (vs BERT's 16GB) and for much longer. It's the "Bodybuilder" version of BERT.

Level 2 â€” The Optimization Secret

RoBERTa made three major training changes:

Dynamic Masking: Words are masked differently every time the model sees the sentence.
Removed NSP: Researchers found that "Next Sentence Prediction" didn't actually help.
Larger Batches: Training on massive batches of data improved stability.

Level 3 â€” When to use RoBERTa?

If you need an encoder for classification, NER, or similarity, and you have enough GPU memory, RoBERTa-Large is almost always a better choice than BERT-Base.

RoBERTa Sentiment Analysis

from transformers import pipeline

# RoBERTa fine-tuned on sentiment
classifier = pipeline("sentiment-analysis", 
                      model="cardiffnlp/twitter-roberta-base-sentiment")

result = classifier("I love this tutorial!")
print(result)

Previous: GPT

Related Natural Language Processing Links

RoBERTa

RoBERTa

Level 1 â€” More Data, More Training

Level 2 â€” The Optimization Secret

Level 3 â€” When to use RoBERTa?