Splitter-based parallel sorting algorithms are known to be highly efficient for distributed sorting due to their low communication complexity. Although using GPU accelerators could help to reduce the computation cost in general, their effectiveness in distributed sorting algorithms on large-scale heterogeneous GPU-based systems remains unclear. We investigate applicability of using GPU devices to the splitter-based algorithms and extend HykSort, an existing splitter-based algorithm by offloading costly computation phases to GPUs. We also handle GPU memory overflows by introducing an iterative approach which sorts multiple chunks and merges them into one array. We evaluate the performance of our implementation with local sort acceleration on the TSUBAME2.5 supercomputer that comprises over 4000 NVIDIA K20x GPUs. Performance evaluation of weak scaling shows that we achieve 389 times speedup with 0.25TB/s throughput when sorting 4TB 64bit integer on 1024 nodes compared to running on 1 node; on the other hand, for CPU vs. GPU comparison, our implementation achieves only 1.40 times speedup using 1024 nodes. Detailed analysis however reveals that the limitation is almost entirely due to the bottleneck in CPU-GPU host-to-device bandwidth. With orders of magnitude improvements planned for next generation GPUs, the performance boost will be tremendous in accordance with other successful GPU accelerations.