Acoustic scene classification

"Darling, where are you?" may sound a bit catchy but it describes well what acoustic scene classification is about.

When interacting with mobile devices we expect relevant information to be presented with a minimum of input effort. What is relevant depends on the context in which we are acting.

If we are requesting a route information while sitting at a bus stop we most probably are looking for directions for travel via a bus, while at a railway station we most probably are looking for a train connection.

One possibility for a device to identify the context is via geolocation information. But this information may not be available inside buildings. An alternative approach is the analysis of ambient noise. This approach is referred to by the term acoustic scene classification.

Acoustic scene classification (ACS) describes the "capability of a human or an artificial system to understand an audio context, either from an on-line stream or from a recording." (

This blog demonstrates how convolutional neural networks can be used for the identification of settings in which an audio file was recorded. We will be applying a pre-trained VGG-16 network with a custom classifier applied on log-frequency power-spectrograms.

Data analysis and preparation

This project uses recordings made available as part of the DCASE (Detection and Classification of Acoustic Scenes and Events) 2019 challenge ( The TAU Urban Acoustic Scenes 2019 development dataset contains recordings in 10 different settings (airport, indoor shopping mall, metro station, pedestrian street, stree traffic, tram, bus, metro, park) recorded in 10 cities. Each recording is 10 seconds long. The data files can be downloaded from

There are a total of 14400 recordings with 1440 recordings for each of the 10 settings.

Here is a sample (street_pedestrian-lyon-1162-44650-a.wav).


To analyze the audio files we can transform them into spectrograms (cf. These show the frequency distribution for subsequent short time intervals.

Mel spectrograms

A popular form of spectrograms are Mel spectrograms. The Mel scale is based on what humans perceive as equal pitch differences. The Mel scale defines how the frequency axis is scaled:

The result of the scaling is that for high frequencies the scale is proportional to the logarithm of the frequency while low frequency (especially below 700 Hz) are compressed.

This scale is widely used for speech analysis.

In a power spectrogram the strength the amplitude of the frequncies is shown on logarithmic scale (in Decibel).

Here is the Mel power spectrogram for the sound file above.

Looking at the spectrogram we find:

Ambiant sound can contain a lot of low frequency sounds, e.g.

These are the frequencies that are compressed by the Mel scale.

When the running speed of machines is changed this will move much of the sound spectrum by the same factor. While the Mel scale distorts this shift for low frequencies the spectrum would be simply translated along the frequency axis on a pure logarithimic scale by the same distance.

So using a logarithmic scale for the the analysis seems more appropriate.

Log-frequency spectrograms

Here is the log-frequency power spectrogram (also referred to as constant-Q power spectrogram) for the audio file above:

This looks more appropriate for our classification task:

Yet high frequencies are still underrepresented.


The high frequencies can be emphasized using a filter

(as suggested by Haytham Fayek, Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What's In-Between).

With α = 1485 Hz the sound sample sounds like this:

And the power is much more evenly distributed the frequencies:


For feeding a neural network the spectrograms can be saved as black and white images with the brightness representing the logarithm of the power.

The number of recordings per setting are rather small in our data set. To avoid over-fitting data augmentation should be applied.

Data augmentation

For image data a large variety of transformations can be used for augmentation. These include for instance random resized cropping, rotations, and flipping (for more transformations see

Not all make sense for spectrograms, e.g. rotations. Reasonable transformations are:

In SpecAugment: A Simple Augmentation Method for Automatic Speech Recognition, Zoph suggest to randomly mask frequency bands for the purpose of augmentation.

Neural network

For image recognition pre-trained networks can be used. The VGG16 model (Karen Simonyan, Andrew Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition was chosen. A classifier with one hidden layer of 512 nodes was added and only the classifier parameters were trained. The data set had been split using a 60:20:20 ratio per setting-city combination. The Adam optimizer was applied with a .001 learning rate. Both autgemtation techniques describe above were used. 10 subsequent epochs not improving the accurary on the validation set were used as the termination criterion.

Hyper parameter tuning

A grid search was used to find good values for To get an estimate of the margin of error each point of the search grid was evaluated three times with different random seeds. The standard deviation values in table above should be taken with caution as they are based only on three values.
frequency band masked
time window0 %10 %20 %
3 s60.2±0.858.7±0.258.2±0.6
5 s71.2±0.870.8±0.069.5±0.8
7 s63.6±1.762.9±2.464.6±0.7


Augmentation via frequency band masking was not beneficial for accuracy.

An accuracy of 69.7 % for the test set was achieved using the tuned parameters (a time windows of 5 seconds for time warping withing the 10 second recordings, no frequency band masking).

The confusion matrix shows that the separation of the different audio settings differs a lot. While the bus setting was well recognized the public square and pedestrian street settings were not easily separable.

airport bus metro metro station park public square shopping mall pedestrian street street traffic tram
metro station334391580151411511
public square18116131821242150
shopping mall370390102101911
pedestrian street351295581915262
street traffic210414481102091

This model now can be used for predictions. Can you identify the setting of the following recording by listening?

This is the prediction by the model:

The model itself can be found in


A workflow for classifying ambiant sound was demonstrated:

Though a network was used that is not specifically built for this classification task respectable accuracy rates were be achieved.

Directions for further investigation could be