Optimizing Parallel Reduction in CUDA - Nvidia