Audio Classification with Deep Learning

This project implements deep learning models for audio classification using the FSDKaggle2018 dataset. It provides a comprehensive end-to-end workflow for audio feature extraction, model training, and evaluation.

Dataset

The FSDKaggle2018 dataset contains 11,073 audio files annotated with 41 labels. The sounds are unequally distributed across classes and have variable recording quality and duration.

Key dataset characteristics:

41 sound categories
~9.5k training samples, ~1.6k test samples
Audio duration ranges from 300ms to 30s
Single-label classification (one class per audio clip)
Class imbalance (94-300 samples per class in training set)

Project Structure

deepLearningProject/
│
├── data/                   # Dataset files
│   ├── FSDKaggle2018.audio_train/  # Training audio files
│   ├── FSDKaggle2018.audio_test/   # Test audio files
│   └── FSDKaggle2018.meta/         # Metadata files
│
├── notebooks/              # Jupyter notebooks
│   ├── 0_main_local.ipynb  # Main notebook for local execution
│   └── notebook_to_python.py # Python script version of the notebook
│
├── checkpoints/            # Model checkpoints
│   └── cnn_best_model.pth  # Best CNN model weights
│
├── results/                # Experiment results for different models
│   ├── results_cnn/        # CNN results
│   ├── results_crnn/       # CRNN results
│   ├── resnet_results/     # ResNet results
│   └── transformer_results/ # Transformer results
│
├── requirements.txt        # Python dependencies
├── environment.yml         # Conda environment specification
└── setup.sh               # Environment setup script

Models

The project implements several deep learning architectures:

CNN: Basic convolutional neural network with 4 convolutional layers, batch normalization, and adaptive pooling
ResNet: ResNet50 adapted for audio classification with modified input/output layers
CRNN: Convolutional recurrent neural network combining CNN feature extraction with bidirectional GRU
Transformer: Audio transformer with CNN front-end and transformer encoder blocks

Features

Optimized local data loading with parallel file copying
In-memory caching of processed spectrograms for faster training
Data preprocessing pipeline (resampling, padding/trimming, Mel spectrograms)
Model parameter and memory usage calculation
Learning rate scheduling with warmup and cosine annealing
Early stopping to prevent overfitting
Checkpoint saving and loading for best models
Comprehensive metrics and visualizations (confusion matrices, class reports)
GPU acceleration support

Setup and Installation

Clone the repository:

git clone https://github.com/your-username/deepLearningProject.git
cd deepLearningProject

Run the setup script:

bash setup.sh

Download the FSDKaggle2018 dataset from Zenodo and extract it to the data/ directory.

Usage

Training a Model

To train a model using the command line:

conda activate deepLearningProject  # or your environment name
python train_model.py --model cnn

Available model options:

cnn: Basic CNN model
resnet: ResNet50-based model
crnn: Convolutional Recurrent Neural Network
transformer: Transformer-based model

Using the Jupyter Notebook

You can also use the Jupyter notebook for an interactive experience:

jupyter notebook notebooks/0_main_local.ipynb

The notebook contains:

Data loading and preprocessing
Model selection and configuration
Training and evaluation
Results visualization

Model Performance

The models were evaluated on the FSDKaggle2018 test set, with the following results:

Model	Accuracy	Precision	Recall	F1 Score	Parameters	Size (MB)	Training Time (s)
CNN	0.843	0.847	0.843	0.839	1.57M	6.01	15,838
ResNet	0.756	0.794	0.756	0.746	23.59M	90.18	6,645
CRNN	0.732	0.743	0.732	0.726	1.25M	4.76	13,685
Transformer	0.018	0.0003	0.018	0.0006	19.86M	85.55	16,114

Note: The Transformer model appears to have convergence issues in this implementation.

Requirements

Main dependencies:

Python 3.7+
PyTorch 2.0+
torchvision
librosa
numpy
pandas
matplotlib
scikit-learn
tqdm
seaborn

The full list of dependencies is available in requirements.txt and environment.yml.

Implementation Details

Audio Processing: Using librosa for loading, resampling, and converting to mel spectrograms
Data Augmentation: Caching processed spectrograms in memory for faster training
Model Checkpointing: Best models are saved based on validation loss
Visualization: Training curves, confusion matrices, and per-class accuracy reports

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
checkpoints/checkpoints		checkpoints/checkpoints
code		code
graphs		graphs
graphs_simple		graphs_simple
results		results
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio Classification with Deep Learning

Dataset

Project Structure

Models

Features

Setup and Installation

Usage

Training a Model

Using the Jupyter Notebook

Model Performance

Requirements

Implementation Details

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

dylanbforde/DeepLearningKaggleFSD

Folders and files

Latest commit

History

Repository files navigation

Audio Classification with Deep Learning

Dataset

Project Structure

Models

Features

Setup and Installation

Usage

Training a Model

Using the Jupyter Notebook

Model Performance

Requirements

Implementation Details

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages