The research field of generating natural gestures from speech input is called co-speech gesture generation. Co-speech generation methods should suffice two requirements: fidelity and diversity. Several previous researches have utilized deterministic methods to establish a one-to-one mapping between speech and motion to achieve fidelity to speech, but the variety of gestures produced is limited. Other methods generate gestures probabilistically to make them various, but they often lack fidelity to the speech. To overcome these limitations, we propose Speaker-aware Audio2Gesture (SA2G) that uses a variational autoencoder (VAE) with the input of randomized speaker-aware features, an extension of the previously proposed A2G. By using ST-GCNs as encoders and controlling the variance for randomization, it can generate gestures faithful to speech content, which also have a large variety. In our evaluation on TED datasets, it improves the fidelity of the generated gestures from the baseline by 85.4, while increasing the Multimodality by 9.0×10^(-3).