Dissecting the wav2vec2 architecture

I’ve been working with the wav2vec2 architecture for the last couple of months for my research project. In this blog post I want to explain every component of the architecture in a somewhat less dense format compared to the original paper by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed and Michael Auli.

What is the wav2vec2 architecture?

wav input, pretraining to use less data, potentially fine-tunable for multiple/diverse set of speech tasks. For now only used for speech recognition but there’s already a few papers which use architecture on other tasks.

A forward pass

The wav2vec2 model has 4 conceptual components:

  1. A feature encoder which is responsible for transforming an audio signal into a sequence of fixed-size feature vectors. Each feature vector represents only a short segment of the audio signal.
  2. A feature projector which alters the dimension of the feature vectors computed by the feature encoder.
  3. A Transformer which applies self attention to the sequence of (projected) feature vectors and thereby produces a sequence of “wav2vec2 embeddings” of the same length.
  4. A Task head which makes use of the sequence of “wav2vec2 embeddings” to solve a particular speech task.

Don’t worry if these descriptions don’t make a lot of sense yet. I will go into more detail in the individual sections below. For now, it’s important to understand the general data flow in the model.

The feature encoder

The feature projection

The Transformer

Task head for speech recognition

Randomness during training

Self-supervised pretraining