Skip to content

dylanbforde/DeepLearningKaggleFSD

Repository files navigation

Audio Classification with Deep Learning

This project implements deep learning models for audio classification using the FSDKaggle2018 dataset. It provides a comprehensive end-to-end workflow for audio feature extraction, model training, and evaluation.

Dataset

The FSDKaggle2018 dataset contains 11,073 audio files annotated with 41 labels. The sounds are unequally distributed across classes and have variable recording quality and duration.

Key dataset characteristics:

  • 41 sound categories
  • ~9.5k training samples, ~1.6k test samples
  • Audio duration ranges from 300ms to 30s
  • Single-label classification (one class per audio clip)
  • Class imbalance (94-300 samples per class in training set)

Project Structure

deepLearningProject/
│
├── data/                   # Dataset files
│   ├── FSDKaggle2018.audio_train/  # Training audio files
│   ├── FSDKaggle2018.audio_test/   # Test audio files
│   └── FSDKaggle2018.meta/         # Metadata files
│
├── notebooks/              # Jupyter notebooks
│   ├── 0_main_local.ipynb  # Main notebook for local execution
│   └── notebook_to_python.py # Python script version of the notebook
│
├── checkpoints/            # Model checkpoints
│   └── cnn_best_model.pth  # Best CNN model weights
│
├── results/                # Experiment results for different models
│   ├── results_cnn/        # CNN results
│   ├── results_crnn/       # CRNN results
│   ├── resnet_results/     # ResNet results
│   └── transformer_results/ # Transformer results
│
├── requirements.txt        # Python dependencies
├── environment.yml         # Conda environment specification
└── setup.sh               # Environment setup script

Models

The project implements several deep learning architectures:

  1. CNN: Basic convolutional neural network with 4 convolutional layers, batch normalization, and adaptive pooling
  2. ResNet: ResNet50 adapted for audio classification with modified input/output layers
  3. CRNN: Convolutional recurrent neural network combining CNN feature extraction with bidirectional GRU
  4. Transformer: Audio transformer with CNN front-end and transformer encoder blocks

Features

  • Optimized local data loading with parallel file copying
  • In-memory caching of processed spectrograms for faster training
  • Data preprocessing pipeline (resampling, padding/trimming, Mel spectrograms)
  • Model parameter and memory usage calculation
  • Learning rate scheduling with warmup and cosine annealing
  • Early stopping to prevent overfitting
  • Checkpoint saving and loading for best models
  • Comprehensive metrics and visualizations (confusion matrices, class reports)
  • GPU acceleration support

Setup and Installation

  1. Clone the repository:
git clone https://github.com/your-username/deepLearningProject.git
cd deepLearningProject
  1. Run the setup script:
bash setup.sh
  1. Download the FSDKaggle2018 dataset from Zenodo and extract it to the data/ directory.

Usage

Training a Model

To train a model using the command line:

conda activate deepLearningProject  # or your environment name
python train_model.py --model cnn

Available model options:

  • cnn: Basic CNN model
  • resnet: ResNet50-based model
  • crnn: Convolutional Recurrent Neural Network
  • transformer: Transformer-based model

Using the Jupyter Notebook

You can also use the Jupyter notebook for an interactive experience:

jupyter notebook notebooks/0_main_local.ipynb

The notebook contains:

  1. Data loading and preprocessing
  2. Model selection and configuration
  3. Training and evaluation
  4. Results visualization

Model Performance

The models were evaluated on the FSDKaggle2018 test set, with the following results:

Model Accuracy Precision Recall F1 Score Parameters Size (MB) Training Time (s)
CNN 0.843 0.847 0.843 0.839 1.57M 6.01 15,838
ResNet 0.756 0.794 0.756 0.746 23.59M 90.18 6,645
CRNN 0.732 0.743 0.732 0.726 1.25M 4.76 13,685
Transformer 0.018 0.0003 0.018 0.0006 19.86M 85.55 16,114

Note: The Transformer model appears to have convergence issues in this implementation.

Requirements

Main dependencies:

  • Python 3.7+
  • PyTorch 2.0+
  • torchvision
  • librosa
  • numpy
  • pandas
  • matplotlib
  • scikit-learn
  • tqdm
  • seaborn

The full list of dependencies is available in requirements.txt and environment.yml.

Implementation Details

  • Audio Processing: Using librosa for loading, resampling, and converting to mel spectrograms
  • Data Augmentation: Caching processed spectrograms in memory for faster training
  • Model Checkpointing: Best models are saved based on validation loss
  • Visualization: Training curves, confusion matrices, and per-class accuracy reports

License

This project is released under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors