Project Ornithowav: Bird Species Identification

Avian diversity is a commonly used metric to assess environmental change and is growing in importance. In this project we use the MAX78000 to detect and classify birds based on their songs and calls. Thereby enabling remote sensing of avian diversity

Project Ornithowave:

***PLEASE WATCH THE DEMO VIDEO***
YouTube: https://youtu.be/f-ShF_WkPHo

***PLEASE READ THE FULL REPORT***
Uploaded Report-Master.pdf under label: others

Avian diversity is a commonly used metric to assess environmental change and is growing in importance due to climate change and it's ecological effects. In the last few years, deep learning approaches have shown promise in reducing the expert labor intensive bird identification process. The 157 layer ResNet based BirdNet from Cornell University is currently the state of the art in this field and can identify almost a 1000 different bird species [1]. However implementations of this model are limited to areas where high bandwidth internet is available as the model is evaluated in the cloud.

We propose a simplified CNN based on the KWS-20 model to identify 20 different bird species based on their sounds and calls.
This project running on the ultra-low power MAX78000 can make bird monitoring possible in very remote locations without worrying about high bandwidth connectivity or big batteries. It can also be used by bird enthusiasts who want to keep track of birds species visiting their home or farm.

This Project Execution Comprised of 4 phases:
1. Literature Review & Open-Source Dataset Identification
2. Data Engineering
3. Model training & C-code synthesis using ai8x-training & ai8x-synthesis libraries
4. Testing on Hardware using audio samples from the test set.

Phase 1: Literature Review & Open-Source Dataset Identification

The Cornell Lab of Ornithology is an active research group in this space and regularly publish papers, and host competitions on Kaggle (BirdCLEF 2021: https://www.kaggle.com/c/birdclef-2021, CBL 2020: https://www.kaggle.com/c/birdsong-recognition/overview)

Their publications and datasets served as a solid foundation for this project.
The networks that win these competition are very large and complex, for our project we used a subset of this dataset and retrained the KWS-20 model (ai85kws20) from maxim by replacing words with birds.

However, a major constraint is that all this data is weakly labelled and not standardized with varying lengths, quality and sampling frequency. Refer dataset-example.png

Phase 2: Data Engineering

This the most challenging phase.
Upon closer inspection it was observed that all of the data in the open-source datasets come from this one website: https://www.xeno-canto.org/
Data on this website is collected by birders all over the world and hence the quality can vary from sample to sample. There is no standardization of sample rates and sometimes the bird sound is not very clear. Each bird can also make multiple different calls, sounds and alarms, adding to the complexity.
A summary of a subset of the dataset (6 birds) can be found in image 6-birds.png, train-6.xls and dataset-example.png

To standardize the dataset and make it similar to the KWS dataset, two steps were taken:
1. Convert mp3 to wav and resample all data to 16kHz (same as KWS dataset, and bird calls are typically between 1-8kHz in frequency so 16k is plenty): Refer code convert.py and convert16k.py
2. Segment the audio into 1 second clips: The recordings in the dataset vary from 1.2 seconds to 900 seconds in length, with anywhere from 1 to 100 relevant bird sounds within a recording. A processing script was written to first filter and then identify and extract useful 1 second long segments. Refer to code bird_slicer.py and image slice_ops.png to get an intuition on how this algorithm works.
Note: The code convert16k.py and bird_slicer.py can be found in folder python-code.zip

The outcome of this process is a dataset of 79025 unique samples (refer data-properties.png, with the following distribution. (refer data-distribution.jpeg). This data has not been manually cleaned after segmentation as the process is too time consuming and requires expert annotation.

Phase 3: Model training & C-code synthesis using ai8x-training & ai8x-synthesis libraries

Once the data was ready, only a few changes needed to be made to the training scripts to train on the new data, the steps were as follows:
- Update data folder with new data
- Remove data unzip operations from kws20.py (__extract_archive function)
- Update class arrays and dictionaries (multiple places in kws20.py)

Run the script train.sh but reduce the number of epochs to 50.
The network: ai85kws20net that is found in ai85net-kws20.py was used.

We first tried to train a mix of birds and words to see how the network performs on the hybrid dataset.
Screenshots of the training process can be found in:
- Words&Birds-ep10-confusion
- Words&Birds-ep40-confusion
- Words&Birds-ep40-tensorboard
In the confusion matrix you can clearly identify the cluster of birds as they perform different from the words.

Satisfied with this PoC we trained the network on only birds.
We found that the network starts to overtrain after 50 epochs, so we stopped the training process then.
The following images document the training output:
- PNG files e1, e10, e20, e30, e40 and e50 show the progression of the confusion matrix during training
- PNG files eval and test show the results of the evaluation script and the test run after training.
- tensorboard-plots.png shows the output of the tensorboard console

The model was then synthesized using the ai8x-synthesis library, following standard synthesis procedure. Synthesized code can be found in Synthesized-C-Code.rar

Phase 4: Testing on Hardware using audio samples from the test set.

The code was then flashed onto the MAX78000 feather board using the KWS20 codebase as a reference. Appropriate changes were made to the class labels.
Test audio was played through the laptop speakers.
A demo of the same can be found in the video.

Please download the zip file brd20_demo.zip (based on kws_20) to be able to run the project on your own board. We have also included the test samples that were used in the video demo. No additional hardware is required to test this code as we have used the on board microphone.

Hardware Design

We have built a custom hardware based off of the MAX78000FTHR schematic. The major changes made to the design are 1. Removal of all video related hardware. 2. Addition of a 32MB flash memory to store collected audio data and improve detection accuracy over time. 3. Cost-effective BoM and PCB design to facilitate easy manufacturing and scale-up Refer folder Hardware.zip

Future Scope:
- Clean dataset: The generated dataset is purely code based with no manual cleaning. As the xeno-canto data can sometimes be of questionable quality with a lot of background noise, manual cleaning and labelling can greatly improve accuracy.
- Enhance dataset: More birds, more samples, improve 'unknown' class by adding human and non-bird animal sounds as well.
- Larger input size: Increasing the input length from 1s to 2.5s is shown to improve accuracy.
- Neural Architecture Search: Did not play around with this much due to resource constraints. Hope to do this next
- Improve hardware: No additional hardware was required for this project as the on-board mic was sufficient. However it is reasonable to believe that a better mic can lead to better results. Since this a remote sensing solution, a battery pack and GSM module can also add value.

Other Uploads:
I have also uploaded a project report Report-Master.pdf that is easier to read along with a README on how to run all the code. I have also uploaded all the source code that I wrote and modified for this project and will be quite excited to work with someone who wants to build on this work. Unfortunately the dataset was too large to upload but I’m happy to share the same.
The demo video also walks through some of the execution steps and code modifications made.

Acknowledgements:
A big thank you to Elektor for hosting and Maxim for sponsoring this competition, it has been a great learning experience.
Many thanks to my friend & colleague Mr. Mohan Prabhakar for helping me by out training all these models on his gaming laptop (16GB Ram, i7-8th gen, GTX1050-Ti). He helped me test the model, on the MAX78000FTHR board and make the demo video too. He also designed a neat board optimized for this application.

References:
[1]
Stefan Kahl, Connor M. Wood, Maximilian Eibl, Holger Klinck,
BirdNET: A deep learning solution for avian diversity monitoring,
Ecological Informatics,
Volume 61,2021,101236, ISSN 1574-9541,
https://doi.org/10.1016/j.ecoinf.2021.101236.
(https://www.sciencedirect.com/science/article/pii/S1574954121000273)
[2] Audio Based Bird Species Identification using Deep Learning Techniques, ETH Zurich, http://ceur-ws.org/Vol-1609/16090547.pdf
[3] KEYWORDS SPOTTING USING THE MAX78000 (AN7359) https://www.maximintegrated.com/en/design/technical-documents/app-notes/7/7359.html

Project Elements

6-birds.PNG (PNG, 538.54 KB)

6 out of the 20 birds that feature in our dataset. Each bird has different calls and songs.

slice-ops.PNG (PNG, 337.91 KB)

A few examples of how the python script extracts 1 second segments from data of different lengths. Signal amplitude envelope is used to find regions of actvity.

data-properties.PNG (PNG, 120.59 KB)

Total of 79,025 samples used for training, validation & testing

dataset-example.PNG (PNG, 70.18 KB)

e1.png (PNG, 275.49 KB)

Confusion matrix after first epoch

e40.png (PNG, 257.58 KB)

Confusion matrix after 40th epoch

e10.png (PNG, 296.11 KB)

Confusion matrix after 10th epoch

e20.png (PNG, 266.81 KB)

Confusion matrix after 20th epoch

e30.png (PNG, 275.66 KB)