============================================================================
                            REVIEWER #1
============================================================================

Detailed Comments for Authors
---------------------------------------------------------------------------
-- Detailed Review --
The writers as members of the challenge of ComParE21 introduce a vision transformer, a deep machine-learning model build around the attention mechanism, on mel-spectrogram representations of raw audio recordings.
They used for the model development two datasets of ComParE21:one multi-class classification task and one binary:Primates Sub-Challenge and the COVID-19 Cough Sub-Challenge.

They implemented different augmentation techniques for the optimisation of the developed models. Moreover three different model architectures were used and compared.
At the end they achieved comparable performance on both  tasks Primates Sub-Challenge and the COVID-19 Cough Sub-Challenge of ComParE21, outperforming most single model baselines.


-- Key Strength of the paper --
Different audio data augmentation techniques - Mel-spectrogram sugmentation, shift augmentation, noise Augmentation, SpecAugmentation, loudness augmentation-  were introduced and evaluated on the dataset of the ComParE21.

-- Main Weakness of the paper --
The study presented is useful for researchers dealing with the recognition of non-verbal content of speech, but I must note that the work is not yet complete. The author himself writes in Table 1 that âMissing test results and further searches will be included in the camera-ready version of the
paper.â
Furthermore, the conclusion is missing from the article.


-- Novelty/Originality, taking into account the relevance of the work for the Interspeech audience --
The introduction and evaluation of the presented augmentation technics will be useful for researchers dealing with audio deap learning technics while these technics improve the results in  classification tasks,

-- Technical Correctness, taking into account datasets, baselines, experimental design, are enough details provided to be able to reproduce the experiments? --
There are some small mistakes:
Duration and density dimensions are missing from the Figure 1.
What does the second sentence of the picture description means on Figure 2.???
âFigure 2: Model architectures as introduced and described in section 4. If not stated otherwise, modelâ


 -- Quality of References, is it a good mix of older and newer papers? Do the authors show a good grasp of the current state of the literature? Do they also cite other papers apart from their own work? --

-- Clarity of Presentation, the English does not need to be flawless, but the text should be understandable --
---------------------------------------------------------------------------


============================================================================
                            REVIEWER #2
============================================================================

Detailed Comments for Authors
---------------------------------------------------------------------------
-- Detailed Review --
The authors have applied a transformer-based neural network framework, inspired from the image classification literature, to two tasks of the ComParE2021 challenges - Primate Classification and Covid detection. They implement a Visual Transformer (ViT) by interpreting a melspectrogram as a sequence of (square) patches across time and frequency. They also consider a variant of the transformer, where they interpret melspectrogram as a sequence of vertical patches across time.
They present a brief analysis of the given training and dev data, and find that there are imbalances. Therefore they apply various standard data augmentation techniques to increase the data as well as making it more balanced.

-- Key Strength of the paper --
Implementing the transformer-based visual framework for the two audio classification tasks, along with careful observations about the data and data augmentation, resulting in a strong system with a high accuracy, are the key strengths of the paper. The authors have presented their observations, and implementations through clear and intuitive diagrams and plots.

-- Main Weakness of the paper --
My main criticism is about lack of insight provided about the results achieved.
Why does the same architecture perform well for primates sound classification but not for cough sound covid tag classification? The transformers can clearly learn well from the data, as shown earlier in nlp and image domains, then what more can be done for the cough sound classification task, other than increasing the data amount?
Why didn't the vertical ViT work as well as ViT, though the authors mentioned that compared to ViT, VViT captures more information while respecting the natural temporal order of the spectrogram?

-- Novelty/Originality, taking into account the relevance of the work for the Interspeech audience --
Applying ViT for the tasks of primate sound classification and covid detection is novel.

-- Technical Correctness, taking into account datasets, baselines, experimental design, are enough details provided to be able to reproduce the experiments? --
Technically looks correct.

 -- Quality of References, is it a good mix of older and newer papers? Do the authors show a good grasp of the current state of the literature? Do they also cite other papers apart from their own work? --
Yes

-- Clarity of Presentation, the English does not need to be flawless, but the text should be understandable --
Clear enough.
---------------------------------------------------------------------------


============================================================================
                            REVIEWER #3
============================================================================

Detailed Comments for Authors
---------------------------------------------------------------------------
-- Detailed Review --

This paper presents and evaluates a method recently proposed for Image processing to two ComParE 2021 sub-challenges by interpreting audio samples as mel-scaled spectral images. The approach is compared to other image-based approaches. The results are Ok, but in my opinion the authors are a bit optimistic in the analysis. These are some additional comments:

- Fig1b/1e show different length distributions for the test set compared to corresponding Fig1c/Fig1f.  Is this just a rounding effect or am I missing something here?

- I suppose you crop all audios to a fixed length, right? I couldn't find any comment on this, neither to the specific length and if this cropping is different for the PRS and the CCS challenge, given the quite different length distribution shown in Figure 1.

- You can probably simplify/reduce Fig1 and section 3.1, and let a bit more space for details of systems in Section 4 and for Figure 4, which is impossible to read in printed paper.

- What is the task of Figure 4 analysis? CCS or PRS?

- In Results section: "Depending on the task, the Transformer- style approaches (ViT & VVit) then overtake both (ComParE and our CNN baselines) by up to 15% in UAR (c.p. Eq. 2)"  A too optimistic interpretation of the results imho. There is only one case in which this happens. In the remaining cases, your method outperforms all the baselines in dev, but by a much more moderate margin. I think this is already a pretty solid result (outperforming all baselines) and focusing the analysis in a specific weak baseline system can be slightly misleading.

-- Key Strength of the paper --

Application of new visual transformer to speech classification tasks, both binary and multi-class.

-- Main Weakness of the paper --

Some figures difficult to read. Probably section 3.1 and Fig1 could be reduced to let include more details on the methods and analysis of results.

-- Novelty/Originality, taking into account the relevance of the work for the Interspeech audience --

As far as I understand, it is probably the first application of this model to a speech classification task. I would say that this is novelty enough, specially considering that it is a challenge system description paper.

-- Technical Correctness, taking into account datasets, baselines, experimental design, are enough details provided to be able to reproduce the experiments? --

There are details of the systems that could be difficult to reproduce.

 -- Quality of References, is it a good mix of older and newer papers? Do the authors show a good grasp of the current state of the literature? Do they also cite other papers apart from their own work? --

References ok.

-- Clarity of Presentation, the English does not need to be flawless, but the text should be understandable --

Good. There are some minors typos: "...noices...", "crowed-source", "negativ", and "positiv"
---------------------------------------------------------------------------