Accelerating Sorting on GPUs: A Scalable CUDA Quicksort Revision
In this article, an upgraded version of CUDA-Quicksort - an iterative implementation of the quicksort algorithm suitable for highly parallel multicore graphics processors, is described and evaluated. Three key changes which lead to improved performance are proposed. The main goal was to provide an implementation with increased scalability with the size of data sets and number of cores with modern GPU architectures, which was successfully achieved. The proposed changes also lead to significant reduction in execution time. The execution times were measured on an NVIDIA graphics card, taking into account the possible distributions of the input data.