Jamaisvu Cuda Peak finding

18 Jul 2018

Jamaisvu, Peak finding, & CUDA Updates

Warning, this will be a loooong post.

When I first had the idea for creating a system for automatically tagging songs for the DJ's at WJRH, I had no idea of the future extent of this project. I naively thought I could grab an audio stream, feed it into some Shazam API, and have that show up in Teal with a few hacks. I was quite wrong about that, but I'm kind of happy about it. This project has been such a learning experience these past 10 months (granted not constantly) and has forced me to explore technical nooks and crannies I would never otherwise have looked at.

When I last properly updated Jamaisvu in October, I had just begun the task of automatically tagging songs and still had a general lack of respect for the work Shazam had put into their product. It's been 10 months now, and although I haven't been working on this project too heavily, the problem of how to quickly do two-dimensional peak finding has been stuck in my mind.

After being frustrated with a seemingly lack of knowledge on this subject on the internet for an actual algorithm for 2D peaks, I went to SciPy's source code and tracked down what they were doing behind the scenes. Spoiler alert: It's not pleasant source code, and it’s a 2D sliding maximum algorithm, incorporating a footprint. I figured since Will Drevo could get it working with very acceptable accuracies with this method, if I were able to copy it but optimise it for the GPU I would be good to go.

2D Sliding Maximums

2D Sliding Max Example, credits to: wikipedia.org

A rough diagram, kind of, on how this algorithm works

The 2D Sliding Maximum algorithm is the key to finding peaks in the spectrogram of a song. I think that the algorithm is explained fairly well here where you can see it's very much related to the 1D versio;, however most of the algorithms online aren't too applicable for direct translation to run on the parallel CUDA cores in the GPU. SciPy has a dedicated 1D sliding maximum algorithm implemented, and also one for N-dimensions which is very closely related to the 1D one, which is cool I guess.

CUDA Conversion

Getting this to work on the GPU

I began working on this algorithm by naively computing the 1D maximum in the GPU and working my way up (or down) the rows of the array and combining them back on the CPU in python. This worked as expected and really gave me the confidence boost I needed after being in a standstill for months. The problem with this turned out to be the transfer and setup up times between the CPU and GPU. There is quite a bit of overhead when working with a GPU, therefore it is only really useful for large datasets. For example, if song's spectrograms weren't near 1.2 million points and closer to 10,000, it would be much faster to keep everything running on the CPU and within system RAM instead of having to ship it off to the GPU's memory, initialise the cores, calculate (much slower than CPU cores), and then ship it all back. So this method was not going to make the difference I needed for Jamaisvu to become a real and useful system.

Moving from 1 dimension to 2

Since I was going to have to ship off the whole the 2D spectrogram array to the GPU, I knew there were going to be more than a few challenges compared to when I only had to manage 1 dimension. The first dimension proved to already be a bit tricky due to my inexperience with C and directly working with arrays in memory, and dealing with appropriate data types. Data types are not Python's forte. The first step in moving to 2D was getting the whole array into the GPUs memory and figuring out how it was being stored. Once you have passed in the array using PyCuda's driver.In, you can then access the array via a pointer to the first cell in now the GPU's memory. The trick here is that, not too many online sites give great explanations about how to properly traverse a 2D array using only the pointer and I found Talonmies attitude to be rather off putting for asking new questions on Stackoverflow, especially when his answer here provided information to be somewhat false. You don't need to provide a stride for the how many bytes your datatype takes up in the arrays sequential memory; adding 1 is sufficient here, I crashed my GPU driver too many times because of this. Once you realise you can traverse the array it just requires a bit of thinking and some maths to map the cell you want to the memory address it is located at from the starting pointer.

Anyways, once I figured out how my spectogram was being stored on the GPU and how I could properly access it, I needed to figure out how SciPy's algorithm was working, and how I was going to be able to parallelise it. I knew how the algorithm worked in the first dimension, using a 1d window to slide over the data, I figured I would need to use a 2D box sliding over the 2D array to compute the same one dimension higher. To my amusement, and a bit of Google research later, it appeared that this was the right method. The trick here is that, instead of having a single thread take a box and move it across and down the 2D array, I would have a thread for each individual location in the array and compute a grid around it, thereby parallelising the process rather easily. I was saved here by the fact that the end array has to retain the same dimensions, so I would not need to deal with resizing as well.

Cuda Kernel Diagram

Figuring out memory mapping and edge cases for the kernel

Once I figured out how to address the appropriate cells in my array, and how I were to parallelise the algorithm I needed to handle the edge cases. These are the cases shown above where if have selected a certain cell (memory address) how to we make sure we do not go out of bounds (dangerous) and how do we make sure we don't spill on to the other rows since the 2D array is really a strung out 1D array. The solution is more easily visible in the code, but it came down to making sure the target pointer for evaluation within the kernel never went in front of the array's pointer and beyond the last memory address. The preventing of spillage into rows proved to be trickier. This came down to me keeping track of what effective X value I was (which column) at and if that came to close to the edge based on the kernel's size.

if((targetMemLocPointer < inputArray) || (targetMemLocPointer >= inputArray + arrayLength)){
    if(targetXLocation >= 0 && targetXLocation < arrayColCount){
        if((*targetMemLocPointer * *targetFootprintLocationPointer)> maxValue){
            maxValue = *targetMemLocPointer; // Set the value

You might've noticed the variable targetFootprintLocationPointer in the conditional there. This is a very important part of the algorithm that is used in Will Drevo's example and in SciPy called the Footprint.


Whilst, my sliding box code was running quite well in CUDA, the returned array was not identical to that which was being produced in the existing Dejavu code, even though it contained no errors. That is where this code caught my attention.

struct = generate_binary_structure(2, 1)
neighborhood = iterate_structure(struct, PEAK_NEIGHBORHOOD_SIZE)

# find local maxima using our filter shape
local_max = maximum_filter(arr2D, footprint=neighborhood) == arr2D

The variable struct in the first line contains a 2D boolean array like:

[[False  True False]
 [ True  True  True]
 [False  True False]]

and then the function iterate_structure takes that initial binary structure and scales it up. This becomes our footprint by which the kernel size is defined and the kernel is affected by. The structure here for a PEAK_NEIGHBORHOOD_SIZE of 5 is:

[[False False False False False  True False False False False False]
 [False False False False  True  True  True False False False False]
 [False False False  True  True  True  True  True False False False]
 [False False  True  True  True  True  True  True  True False False]
 [False  True  True  True  True  True  True  True  True  True False]
 [ True  True  True  True  True  True  True  True  True  True  True]
 [False  True  True  True  True  True  True  True  True  True False]
 [False False  True  True  True  True  True  True  True False False]
 [False False False  True  True  True  True  True False False False]
 [False False False False  True  True  True False False False False]
 [False False False False False  True False False False False False]]

I took a stab and assumed I would have to take product (convert to 0s and 1s for C!) of the Footprint and the Kernel values and consider my kernels evaluation over those numbers instead to produce my maximums. As luck would have it, this caused my CUDA kernel (different from the 2D box kernel) to produce identical results to SciPy's offering. An example of this would be:


multiplied by a Footprint of:


Would give us:


for the value to be saved in the centre position of that kernel which is the one our CUDA thread is targeting. Otherwise this answer would have been 4 had the footprint not existed. This lead to a decrease in recognising time for the pre-fingerprinted song Often (Kygo Remix) from 34.4s to 6.4s. Which is a certainly noticeable 5x improvement!


In the end, it was extremely satisfying to have come this far in the project and for it to continue to work (wow something didn't complete break for once) so well. Of course, this project is far from over, but I believe a huge hurdle has finally been overcome. My ideal target is 1s per song to fingerprint, so I might increase multithreaded CPU performance again along with some other clever hacks in the work to get closer to this goal (It currently takes ~11s for Often (Kygo Remix)).

Issues I see going forward

I am a bit concerned now for database storage and performance. With the current settings, Jamaisvu generates roughly 200k rows in the MySQL database for an average song. This could mean several million rows being added per minute if I reach my target performance for fingerprinting and with 20Mb for all of those rows, I could be looking at nearly 20TB for a million songs. Just a slight worry!

Published on 18 Jul 2018 by Clement Hathaway