XLNet

Combining bidirectional context with autoregressive generation.

XLNet

XLNet was designed to beat BERT by combining the best of BERT (bidirectional context) and the best of GPT (native generation) using a clever trick called Permutation Language Modeling.

Level 1 â€” Autoregressive + Bidirectional

BERT uses [MASK] tokens which don't exist in the real world. XLNet avoids [MASK] by predicting words in a random order (permutations), allowing it to see surrounding words without breaking the sentence.

Level 2 â€” Permutation Math

Instead of just 1-2-3-4, XLNet might train on 3-1-4-2. By the time it predicts word 3, it might have already seen words 1 and 4. This captures context from both directions without needing the [MASK] placeholder.

Level 3 â€” Long Dependency Modeling

XLNet uses Transformer-XL mechanisms, allowing it to maintain context over extremely long documents where BERT would get cut off after 512 tokens.

XLNet in Transformers

from transformers import XLNetTokenizer, XLNetModel

tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetModel.from_pretrained('xlnet-base-cased')

inputs = tokenizer("XLNet is powerful for long text.", return_tensors="pt")
outputs = model(**inputs)

Previous: ELECTRA

Related Natural Language Processing Links

XLNet

XLNet

Level 1 â€” Autoregressive + Bidirectional

Level 2 â€” Permutation Math

Level 3 â€” Long Dependency Modeling