Contents
Application of transformer architecture to computer vision
This article hasn't been written yet
This is a stub — a placeholder for an article that is referenced by other articles but hasn't been fully written. Contribute this article
The Vision Transformer (ViT) is a model architecture introduced by Google researchers in 2020 that applies the transformer's self-attention mechanism to image recognition by splitting images into fixed-size patches and treating them as tokens. It demonstrated that transformers could match or exceed convolutional neural networks on image classification benchmarks, extending the transformer's reach beyond natural language processing.
The Vision Transformer (ViT) is a model architecture introduced by Google researchers in 2020 that applies the transformer's self-attention mechanism to image recognition by splitting images into fixed-size patches and treating them as tokens. It demonstrated that transformers could match or exceed convolutional neural networks on image classification benchmarks, extending the transformer's reach beyond natural language processing.