A multi-GPU aerodynamics application written in CUDA and MPI (massage passing interface) library has been developed. We use Lattice Boltzmann method (LBM) since it has simple stencil calculations and highly local memory access. The D3Q27 model with the multiple relaxation time (MRT) [1] and the cumulant model [2] is employed to improve the accuracy and stability. In order to carry out large eddy simulations (LES) for turbulent flows, the forest-of-octrees-based mesh refinement is introduced and a high resolution mesh is assigned near a complex shape body. Each leaf has a little bigger patch of 9×9×9 meshes to take eclecticism between the efficiency of GPU computing and the mesh adaptivity to complex shape bodies. For multi-GPU computing, we introduce a rectangular space-filling curve to decompose computational domains efficiently. Our code achieves a performance of 273.572 MLUPS (Mega Lattice Update Per Second) using 8 GPUs of NVIDIA Tesla P100. We have carried out several cases of racing bicycles. The speed of bicycle is 40 km/h, and the Reynolds number is about 106 for the reference length estimated from the projected frontal area of the bicyclist. The mesh resolution near the bicycle is 2 mm - 4 mm. The drag coefficient of a flow around one bicyclist with 2 GPUs shows a good agreement with an experiment in wind tunnel and it depends on the posture of the bicyclist. We have also calculated a flow including the group of 6 bicyclists by using 8 GPUs. The drag area on the front bicyclist is largest and going to be small from 2nd to 5th bicyclist. It is also found that the drag area on the last bicyclist has a small increase of drag area caused by the negative pressure behind the group of bicyclists.