Vision Transformers: The end of convolutional neural networks?
Convolutional neural networks have dominated computer vision research for more than a decade. However, the emergence of new topologies, such as Vision Transformers, has helped the development of techniques that improve the efficiency and performance of tasks such as classification, object detection, and semantic image segmentation.
Convolutional neural network explained
Convolutional neural networks are a type of artificial neural network where neurons correspond to receptive fields very similar to those of the primary visual cortex of a human brain.
These networks can learn at different levels of abstraction, such as colors, simple shapes, combinations of edges, and, at the last layer, pay attention to form to figure out what exactly it is.
How do convolutional neural networks work?
Convolutional neural networks are made up of different layers that are specialized in the convolution operation. Within the field of computer vision, this operation makes it possible to learn local patterns in small two-dimensional windows.
Another essential feature is that convolutional layers can learn spatial hierarchies of patterns while preserving spatial relationships. For example, a first convolutional layer can learn basic elements such as edges. And a second convolutional layer can learn patterns composed of basic elements learned in the previous layer.
As we can see in the image above, a neural network can be composed of different convolution layers. Each layer will learn different levels of abstraction with the overall objective of understanding in an efficient way increasingly complex visual concepts.
One of the major breakthroughs that emerged during the past year was the inclusion of new neural network topologies within computer vision. These new topologies have revolutionized the conventional way convolutional neural networks are used to solve typical computer vision tasks such as classification, object detection, and image segmentation.
The irruption of Transformers (which until now had only been used for natural language processing tasks) in the field of computer vision significantly improved the ability of these topologies to extract image features. And thus improving the hit rate in the respective imageNet benchmarks.
Future of convolutional neural networks
For a convolutional neural network, both images are almost the same since convolutional neural networks do not encode the relative position of different features. Massive filters would be required to encode such information: For example, “eyes on top of the nose,” “eyes on top of the nose,” “eyes on top of the nose,” “eyes on top of the nose,” and so on.”
The attention mechanism provided by Transformers and convolutional neural networks helps model long-range dependencies without compromising computational and statistical efficiency.
From the appearance of the first “Vision Transformers,” many of the research departments of universities and large companies began to perform various tests with hybrid architectures that combined the attention mechanism with the benefits of convolutional neural networks in the extraction of features “ConvMixers”.
The different network topologies that have emerged in recent years have led many research groups to revisit the techniques used so far and focus on improving the efficiency and hit rate of convolutional neural networks. So much so that so far this year, a new topology called ConvNext has appeared, which was presented by Facebook research, and surpassed the “Vision Transformers” architectures using only convolutional neural networks.
Undoubtedly, during this year, we will see many more neural network topologies that could once again revolutionize the paradigm of computer vision as we knew it until now.
If you want to know more about Visual Transformers, don’t miss our talks on the latest developments in the field of Artificial Intelligence!