Speaker diarization ("who spoke to whom") algorithms in the use of psychotherapy

Avigail Bar-Sella# and Sigal Zilcha-Mano - Department of Psychology
Dmitrii I Kaplun# and Pavel Goldstein - School of Public Health

Machine Learning
Deep learning
Neural Networks

Psychology

PhD Grant 2020

Figure 1

From 3200 hours of work to only several minutes! That is how a diarization algorithm can facilitate research in psychotherapy

It is now known and evidence-based, that psychotherapy is an effective treatment for diverse mental health problems. However, ineffective, or inaccurate identification of therapeutic processes, may result in patients’ mental health deterioration and even suicidality in populations at risk. These limitations highlight the need to improve our ability to identify and facilitate therapeutic processes in psychotherapy. 

One of the most promising measures of therapeutic processes, receiving great clinical and empirical interest in the field of psychotherapy research, is acoustic vocal markers. Contemporary studies suggest that vocal acoustics serve as interdisciplinary indicators of clinical targets, as they carry rich emotional information of in-session processes. However, this new highly promising field of research encountered some methodological problems. Of these, the major problem is the time consumed in producing the acoustic data, as it requires groundwork of decomposing the therapy sessions into patients’- therapists’- talk turns. 

To date, the vast majority of studies have used manual procedures to decompose sessions to patients’ and therapists’ talk turns. Due to complications involving this process, manually decomposing a therapy session to patients’ and therapists’ talk turns, requires about 8 hours of work. Multiplying this number of hours with an acceptable number of patients (e.g., fifty), and with multiple sessions throughout treatment (e.g., eight), results in at least 3200 hours of work. These numbers suggest it is imperative to bring into psychotherapy research automated solutions for decomposing patients’ and therapists’ talk turns.

Therefore, we are working to develop a speaker diarization (“who spoke when”) algorithm. An algorithm that will be able to associate new individuals (patients and therapists) to distinct speech segments that did not involve training. Such an algorithm will enable to decompose audio recordings of therapy sessions into patients’ and therapists’ talk turns, within only several minutes. 

To develop the algorithm, we use State-of-the-Art approaches of supervised models that are trained on therapy sessions. These therapy sessions are derived from a randomized controlled trial leaded by Prof. Sigal Zilcha-Mano, examining personalized psychotherapy. Specifically, we use a parameter-sharing recurrent neural network (RNN) to model each speaker’s (patient and therapist) embedding. These models are developed and trained with the supervising of Dr. Dmitrii I Kaplun, a global expert in spectral analysis and signal processing. This unique collaboration is also benefited from the expertise of Dr. Pavel Goldstein in implementing multimodal approaches to psychotherapy and medical interventions

By developing a diarization algorithm, acoustic vocal markers of large numbers of patients could be identified. Such identification will enable therapists to optimize therapeutic processes and avoid harmful mistakes, thus improving the efficiency of psychotherapy.