Optimizing Parallel Reduction in CUDA