The most obvious distinction between GPT-3 and BERT is their architecture. As mentioned above, GPT-3 is an autoregressive model, while BERT is bidirectional. This means that GPT-3 only takes into account the context on the left when making predictions, whereas BERT considers both the left and right context. The recent advances in NLP model architecture have led to the development of innovative ideas such as the BERT architecture.
Among these ideas, a few have revolutionized the way we create new models, such as pre-trained language models on a large scale, like OpenAI, GPT, BERT, and deep contextualized word representations. For instance, in the article on GPT, I learned that GPT is a model only for decoders, meaning it only consists of transformer decoder blocks.