34 lines
2.3 KiB
Plaintext
34 lines
2.3 KiB
Plaintext
---
|
|
title: "Audio Vision Transformer"
|
|
tags: [deep-learning, audio-classification, computer-vision, attention-mechanisms, transformers, mel-spectrograms, ComParE-2021]
|
|
excerpt: "Vision Transformer on spectrograms for audio classification, with data augmentation."
|
|
teaser: /figures/12_vision_transformer_teaser.jpg
|
|
venue: "Interspeech 2021"
|
|
---
|
|
|
|
This research explores the application of the **Vision Transformer (ViT)** architecture, originally designed for image processing, to the domain of audio classification by operating on **mel-spectrogram representations**.
|
|
|
|
The ViT's attention mechanisms offer a potentially powerful alternative to convolutional approaches for capturing relevant patterns in spectrogram data.
|
|
|
|
<CenteredImage
|
|
src="/figures/12_vision_transformer_models.jpg"
|
|
alt="Diagram illustrating the Vision Transformer architecture adapted for mel-spectrogram input"
|
|
width={800}
|
|
height={600}
|
|
caption="Adapting the Vision Transformer architecture for processing mel-spectrograms."
|
|
maxWidth="100%"
|
|
/>
|
|
|
|
Key aspects of the methodology include:
|
|
|
|
* **ViT Adaptation:** Applying the ViT model directly to mel-spectrograms treated as images.
|
|
* **Data Augmentation:** Employing **mel-based data augmentation** techniques (e.g., SpecAugment variants) to improve model robustness and generalization.
|
|
* **Sample Weighting:** Utilizing sample weighting strategies to address potential class imbalances or focus on specific aspects of the dataset.
|
|
* **Patching Strategy:** Introducing and evaluating an **overlapping vertical patching** method, potentially better suited for capturing temporal structures in spectrograms compared to standard non-overlapping patches.
|
|
|
|
The effectiveness of this "Mel-Vision Transformer" approach was demonstrated within the context of the **ComParE 2021 (Computational Paralinguistics Challenge)**.
|
|
The proposed model achieved notable performance, **surpassing many established single-model baseline results** on the challenge tasks.
|
|
|
|
Furthermore, the study includes an analysis of different parameter configurations and architectural choices, providing insights into optimizing ViT models for audio processing tasks.
|
|
|
|
This work showcases the adaptability and potential of transformer architectures, particularly ViT, for effectively tackling audio classification challenges. <Cite bibtexKey="illium2021visual" /> |