Current implementation in Jamaisvu/Dejavu
The problem, at first, felt quite easy, or at least surely someone else would have come up with a nice algorithm I could use. Not quite. The algorithm Will Drevo wrote in Dejavu I'm pretty sure was based off this response on Stackoverflow. Sadly, the description of these combination of scipy functions really doesn't give a clear indication of how it is affecting the dataset and the work that is being done in the background. Whilst I could attempt to dive deep into researching how Scipy does some of this work, I think this would be a poor use of my time, especially with my class load at the moment.
Anyways, this use case might be fine for the original user's question, in which he/she wants to ID dog's paws, but which large datasets like I am working with, it becomes unbearably slow. This is very evident in Dejavu. To fingerprint a song in Dejavu it takes roughly 38 seconds. Will Drevo tried to solve this issue when creating the initial database by tasking the slow fingerprinting algorithm to multiple cpu cores, but it still is fundamentally broken. Shazam will listen to a song, fingerprint it, send it to the server, obtain the result, send it back to the phone, in usually under a second. Obviously, I'm not to trying to compete toe-to-toe with Shazam, but there is clearly a lot of room for improvement.
Since the last post, I've discovered Reikna which is another python library to further abstract the programmer away from the CUDA interface. Luckily, it mostly just has a nice example of a spectrogram function which directly emulates the mlab one, so it will be easy to use that as a drop-in replacement. I have also found Scikit-Cuda which gives me access to a bunch of scikit functions that are already accelerated by CUDA, I just have to give it some GPUArrays (simple!). These two libraries have saved me a good deal of time trying to reinvent the wheel.
My peak detection attempts
The goal of Jamaisvu is that, in addition to providing much more relevant information about a song, it is also much faster at 1. Building the initial database and 2. fingerprinting further songs or samples that we listen to. My first attempt at this was pretty basic. Knowing that by computing the Coefficient of variation I could get a sense of the noise in the spectrogram, I thought then by selecting a threshold which is a certain multiple above this level I could find a good number of peaks. This clearly didn't work for many reasons. 1. The threshold would have to vary greatly from song to song to get any sort of consistent number of peaks per song (some would have ~10k, others ~500k peaks) 2. With most modern songs there are a lot of loud low tones, this ends up throwing off the Coefficient and makes the peaks all end up in those frequencies. Here's an example of the song Often (Kygo Remix) using this method:
As you can see, the black dots represent the detected peaks on the spectrogram, and they are all in the low frequencies. We would have to significantly lower the threshold to get more high frequencies but this would give us a huge increase in the number of peaks in the lower tones, which is far from ideal.
My new peak detection algorithm, which is similar to the first, aims to fix the issue of the low frequencies overpowering the rest. In this algorithm, I split the spectrogram into x-number bins along the y-axis (frequencies) which I can define as any integer. In each one of these bins I calculate the following: Absolute Max, Standard Deviation, Mean, Coefficient of variation, and Base Threshold. Since each bin will now have its own threshold, I have a new array with base thresholds. These are calculated by taking the max value of the bin and dividing it by its Coefficient of Variation. This gives me a rough estimate of where each threshold for each bin should be, since they will/can be very different values. I then have a global percentage, where the thresholds for peaks in each bin will be defined as: the base threshold times by the percentage. Then I find any value above that new threshold and define it as a peak. This gives me the following output for the same song:
Here the peaks are much more well defined and spread out! And it only took an extra ~0.1 seconds to compute! However, whilst I think this is a viable method in you just need to find peaks in a 2D array, it (after some addition thought during this morning's breakfast) will definitely fail when implemented for audio fingerprinting / recognition.
The main issue that I foresee with this algorithm is that this method is incredibly time variant. For example, since I am splitting the bins based off of frequency, I am calculating my STD, Mean, Max, etc over the time period that I am given. If I have a sample of a song (when recognising) that is only about 10 seconds or so, my peaks that I will calculate in my algorithm will end up being completely different than the ones which were calculated from the file (the full length of the song). I will get 0 matches. Another minor issue (in comparison), is that my algorithm still does not give me a reliable estimate for the number of peaks. It is improved from the first one, but some songs will still have half or double the amount of peaks as another. This is an issue for both accuracy and storage, but I'm not sure how Will's implementation handles this. Perhaps he also has some trouble with numbers, I haven't looked extensively. Furthermore, if a song has a different set of frequencies in the beginning of the song or end of the song, it will throw off the values, and peaks will be missing in either the front of the back, which would prevent recognition until the song has gotten to a point where there are more peaks and therefore hashes. This is especially prevalent in Life in Technicolor, shown here:
Notice the distinct lack of peaks in the beginning of the song since there are only really low frequencies then, but the high frequencies which come later in the song completely throws the detection off.
Ideas to continue with
Since testing this algorithm last night at 02:30 in the morning and discovering the flaws this morning, I haven't have much time to think of a perfect solution yet. However, one solution I have thought of is to split the bins along the x-axis instead. This way I bin over time. I would have to coordinate this so that the number of bins varies in accordance to the length of the sample. This way the time bins are equal between the fingerprinting of the file, and subsequently the fingerprinting of the samples in order to get the right peaks. I think I will also have to normalise the data to try and avoid issues with this as well. Stay tuned.
I think tomorrow when my GTX1050Ti arrives, I will implement this anyways just to see how fast I can plow through 100 songs compared to regular Dejavu. I might even try recognising some songs but I am very confident that won't work. Here is the rough test code I used for this post: