This paper presents a method for integrating speech, text, and video modalities for multimodal depression detection. Our work leverages shorter utterances to enhance depression detection accuracy, rather than relying on traditional long-term approaches. We introduce the COI-NEXT dataset, comprising authentic clinical interviews conducted through Zoom. Our experiments show that video modalities, particularly when using shorter utterances, lead to improved accuracy for depression detection in patients. Despite limitations due to data scarcity, this work offers valuable insights into multimodal depression detection, emphasizing the significance of multimodal integration in mental health research.