Master the Vision Transformer: ViT Practice Exam


martialdouble-t

Created 6/20/2024

6

36.11%

Q & A


Share This Quiz

Sources

https://arxiv.org/pdf/2010.11929

Think you know everything about Vision Transformers? Take this practice exam based on the groundbreaking Dosovitskiy paper and test your knowledge!

Think you know everything about Vision Transformers? Take this practice exam based on the groundbreaking Dosovitskiy paper and test your knowledge!

1. What is the accuracy of ViT pre-trained on JFT-300M for ImageNet-ReaL?

88.35%
90.72%
94.55%
99.61%

2. How many tasks are evaluated in the VTAB classification suite?

7
10
19
25

3. What dataset size does the ViT need to be pre-trained on to perform well?

300k images
1M images
10M images
300M images

4. Which task group in VTAB includes tasks like medical and satellite imagery?

Natural
Specialized
Structured
Complex

5. What is the effective resolution for fine-tuning ViT on VTAB tasks?

224x224
256x256
384x384
512x512

6. Which ViT model variant has the highest parameter count among Base, Large, and Huge?

ViT-B
ViT-L
ViT-H
ViT-M

7. How many TPUv3-core-days does it take for ViT-L/16 pre-training?

2.5k
0.68k
0.23k
9.9k

8. What is the resolution adjustment technique mentioned for fine-tuning?

Padding
Cropping
2D interpolation
Resizing

9. Which dataset did NOT show performance improvements using self-supervised pre-training?

ImageNet
VTAB
Oxford-IIIT-Pets
ObjectNet

10. What characteristic of Vision Transformer makes it favorable over ResNets in terms of pre-training?

Less compute
More parameters
Higher resolution
Smaller patch size

11. Which pre-trained dataset resulted in 99.68% accuracy on Oxford Flowers-102?

ImageNet-21k
JFT-300M
CIFAR-10
ImageNet

12. What primary motivation led to applying ViT directly to sequences of image patches?

High GPU memory
Complex CNN structures
Limitations of CNNs
Large dataset availability