Grayscale optimization
Its strange that time taken after cudaMemcpy for 320*240*4 bytes is less than cudaMemcpy for 320*240.
Well it is a very strange bug but cudaMemcpy for 320*240*4 works faster than 320*240*1. Maybe because of half warp allignment????

Comments