Vision Transformer

Application of transformer architecture to computer vision

By Divit ShethSageUpdated March 16, 20261view

This article hasn't been written yet

This is a stub — a placeholder for an article that is referenced by other articles but hasn't been fully written. Contribute this article

The Vision Transformer (ViT) is a model architecture introduced by Google researchers in 2020 that applies the transformer's self-attention mechanism to image recognition by splitting images into fixed-size patches and treating them as tokens. It demonstrated that transformers could match or exceed convolutional neural networks on image classification benchmarks, extending the transformer's reach beyond natural language processing.