This project implements deep learning models for audio classification using the FSDKaggle2018 dataset. It provides a comprehensive end-to-end workflow for audio feature extraction, model training, and evaluation.
The FSDKaggle2018 dataset contains 11,073 audio files annotated with 41 labels. The sounds are unequally distributed across classes and have variable recording quality and duration.
Key dataset characteristics:
- 41 sound categories
- ~9.5k training samples, ~1.6k test samples
- Audio duration ranges from 300ms to 30s
- Single-label classification (one class per audio clip)
- Class imbalance (94-300 samples per class in training set)
deepLearningProject/
│
├── data/ # Dataset files
│ ├── FSDKaggle2018.audio_train/ # Training audio files
│ ├── FSDKaggle2018.audio_test/ # Test audio files
│ └── FSDKaggle2018.meta/ # Metadata files
│
├── notebooks/ # Jupyter notebooks
│ ├── 0_main_local.ipynb # Main notebook for local execution
│ └── notebook_to_python.py # Python script version of the notebook
│
├── checkpoints/ # Model checkpoints
│ └── cnn_best_model.pth # Best CNN model weights
│
├── results/ # Experiment results for different models
│ ├── results_cnn/ # CNN results
│ ├── results_crnn/ # CRNN results
│ ├── resnet_results/ # ResNet results
│ └── transformer_results/ # Transformer results
│
├── requirements.txt # Python dependencies
├── environment.yml # Conda environment specification
└── setup.sh # Environment setup script
The project implements several deep learning architectures:
- CNN: Basic convolutional neural network with 4 convolutional layers, batch normalization, and adaptive pooling
- ResNet: ResNet50 adapted for audio classification with modified input/output layers
- CRNN: Convolutional recurrent neural network combining CNN feature extraction with bidirectional GRU
- Transformer: Audio transformer with CNN front-end and transformer encoder blocks
- Optimized local data loading with parallel file copying
- In-memory caching of processed spectrograms for faster training
- Data preprocessing pipeline (resampling, padding/trimming, Mel spectrograms)
- Model parameter and memory usage calculation
- Learning rate scheduling with warmup and cosine annealing
- Early stopping to prevent overfitting
- Checkpoint saving and loading for best models
- Comprehensive metrics and visualizations (confusion matrices, class reports)
- GPU acceleration support
- Clone the repository:
git clone https://github.com/your-username/deepLearningProject.git
cd deepLearningProject- Run the setup script:
bash setup.sh- Download the FSDKaggle2018 dataset from Zenodo and extract it to the
data/directory.
To train a model using the command line:
conda activate deepLearningProject # or your environment name
python train_model.py --model cnnAvailable model options:
cnn: Basic CNN modelresnet: ResNet50-based modelcrnn: Convolutional Recurrent Neural Networktransformer: Transformer-based model
You can also use the Jupyter notebook for an interactive experience:
jupyter notebook notebooks/0_main_local.ipynbThe notebook contains:
- Data loading and preprocessing
- Model selection and configuration
- Training and evaluation
- Results visualization
The models were evaluated on the FSDKaggle2018 test set, with the following results:
| Model | Accuracy | Precision | Recall | F1 Score | Parameters | Size (MB) | Training Time (s) |
|---|---|---|---|---|---|---|---|
| CNN | 0.843 | 0.847 | 0.843 | 0.839 | 1.57M | 6.01 | 15,838 |
| ResNet | 0.756 | 0.794 | 0.756 | 0.746 | 23.59M | 90.18 | 6,645 |
| CRNN | 0.732 | 0.743 | 0.732 | 0.726 | 1.25M | 4.76 | 13,685 |
| Transformer | 0.018 | 0.0003 | 0.018 | 0.0006 | 19.86M | 85.55 | 16,114 |
Note: The Transformer model appears to have convergence issues in this implementation.
Main dependencies:
- Python 3.7+
- PyTorch 2.0+
- torchvision
- librosa
- numpy
- pandas
- matplotlib
- scikit-learn
- tqdm
- seaborn
The full list of dependencies is available in requirements.txt and environment.yml.
- Audio Processing: Using librosa for loading, resampling, and converting to mel spectrograms
- Data Augmentation: Caching processed spectrograms in memory for faster training
- Model Checkpointing: Best models are saved based on validation loss
- Visualization: Training curves, confusion matrices, and per-class accuracy reports
This project is released under the MIT License.