Test Your Transformers Knowledge: ML Practice Quiz


kylie genner
Idyllic Icon

kylie genner

Created 6/10/2024

9

74.44%

Q & A


Share This Quiz

Sources

https://arxiv.org/pdf/1706.03762

Do you think you know everything about Transformers and Machine Learning? Put your knowledge to the test with our practice quiz based on the attention mechanism in the 'Attention is All You Need' paper!

Do you think you know everything about Transformers and Machine Learning? Put your knowledge to the test with our practice quiz based on the attention mechanism in the 'Attention is All You Need' paper!

1. What kind of model is the Transformer?

A model architecture eschewing recurrence
A recurrent neural network model

2. How many parallel attention layers, or heads, does the Transformer use?

8
16

3. What is the benefit of multi-head attention?

Allows the model to jointly attend to information from different representation subspaces
To perform a single attention function more efficiently

4. Which tasks has self-attention been used successfully in?

Reading comprehension, abstractive summarization, textual entailment
Image classification, video analysis, speech recognition

5. In encoder-decoder attention layers, where do the queries come from?

The previous decoder layer
The initial input sequence

6. What tasks have end-to-end memory networks performed well in?

Simple-language question answering and language modeling
Image classification and object detection

7. What is the value of dk and dv in the Transformer's attention heads?

64
32

8. What transformer application issue does the local, restricted attention mechanism aim to handle efficiently?

Large inputs and outputs such as images, audio and video
Small and simple text sequences

9. How many GPUs were used to train the Transformer in the experiment mentioned?

8 P100 GPUs
4 P100 GPUs

10. What sequence transduction model comparison shows the superiority of the Transformer?

WMT 2014 English-to-German translation task
WMT 2016 English-to-Japanese translation task

11. How does the Transformer handle sequence modeling differently compared to RNNs?

It relies entirely on an attention mechanism
It uses LSTM cells

12. What is the final result of the multi-head attention's output values?

They are concatenated and projected
They are averaged

13. What BLEU score does the Transformer achieve on the WMT 2014 English-to-French translation task?

41.8
35.6

14. What is the main advantage of the Transformer over RNNs and CNNs?

More parallelizable and requires less training time
Higher accuracy in image classification

15. Which model's performance does the Transformer surpass on the WSJ training set of 40K sentences?

Berkeley-Parser
Stanford-Parser

16. What are the key dependencies handled by multi-head attention when modeling sentences?

Different representation subspaces at different positions
Time series dependencies

17. What is the central computation unit in the Transformer model?

Scaled dot-product attention and multi-head attention
Convolutions and pooling layers

18. In encoder-decoder attention, from where do the memory keys and values come?

The output of the encoder
Previous attention heads

19. Which layer's attention heads are shown in Figure 3 to follow long-distance dependencies?

Layer 5 of 6
Layer 6 of 6

20. What kind of training time advantage does the Transformer have?

Significantly faster than RNN/CNN-based architectures
Significantly slower