How do transformer networks encode linguistic knowledge?
Pre-trained transformer networks combine a neural self-attention mechanism with a training objective that is compatible with large unlabeled datasets. Models such as BERT and XL-NET have followed this approach to produce new state-of-the-art results across a wide range of natural language processing tasks. But why exactly these models are so successful is not well understood. In this talk, I will present work that looks more closely at the internals of these models and investigates how they acquire and represent linguistic knowledge.