When it takes 22 seconds to apply the Hann function to ~1000 bins
In my progress to convert the Jamaisvu library to use CUDA, I've been experimenting a lot with the calculation of several key components. The first thing you have to do in order to perform the FFTs and then the peak finding, is to apply a Hann Function to your data in your bins. Since this is just component-wise vector multiplication, it is a perfect case for using our GPU to compute this for us. I.e. Compute each point of data in our data bin in parallel.
When I first wrote the code to do this it took 22 seconds on average, and I had no idea why. My first assumption was that this was being caused by the memory transfer of my bins to and from the GPU and host. For this song that I'm using as a test, this is about 1000 bins, and therefore 1000 two-way transfers for 2 channels, which gives us roughly 4000 memory copies. This seems like a lot, and since this is my first time messing with GPU compute, it seemed reasonable. Therefore, if the speed of my Hanning is being limited by the number of memory transfers, then if I increase my bin size and therefore decrease the number of bins which decreases the number of memory transfers, my speed should increase more-or-less linearly. So, I quadrupled my bin size and how did I results look you might ask? It took 22 seconds. No amount of playing with this, expect for extreme cases of making the bin size very very small, changed the time in any substantial manner.
I also started to notice that my CPU usage was pegged at 100% for these runs. At first I thought this was maybe due to memory transfers, but further investigation told me that with a bin size of 32.78kB and with my 750m's memory transfer rate between host and GPU of ~6.3GB/s told me that, probably, this was not the bottleneck. So, I began to place:
throughout my code...
Eventually, after finding out that a lot of my code like splitting channels and such runs quite fast (10^-5 s), I found that generating the Hann Function in a numpy array which equalled the length of my bin, took roughly 0.01 seconds. I do some quick maths: 0.01*1000 = 10 seconds, and given then variable size which start to stray near 0.02 seconds to compute, I could quickly see where my 22 second runtime was coming from and why my CPU was pegged at 100%. "Okay, Great" I thought. I will just simply put the Hann function in the kernel code so that each thread of the GPU also executes one value of the Hann function. ...this didn't work out so well.
It wasn't until about an hour later, and a bit past midnight that it dawned on my that the Hann function never changes over the course of my calculations since the bin size remains constant. Now, I simply calculate the Hann function once in the beginning and pass this pre-computed array to the GPU when its needed. This step only takes 0.01 seconds, and its fine to compute once. The runtime after this change: 1 second. Quite the improvement!