Transformer-Based Neural Speech Decoding from Surface and Depth Electrode Signals

Figure 1: Multiple-subject Neural Decoder training pipeline. Each participant's neural signal and electrodes' location information (MNI coordinates and ROI index) are fed to a shared SwinTW Neural Decoder to predict speech parameters. The predicted speech parameters are supervised by the speech parameters generated by the subject-specific Speech Encoder from the ground-truth speech spectrogram. Each participant's predicted speech parameters are fed into the corresponding subject-specific Speech Synthesizer to generate a speech spectrogram. Once the shared SwinTW decoder is trained, it can be used to decode the speech from the neural signal of any participant. The SwinTW decoder takes the neural signals and location information of the electrodes of this participant as the input and generates speech parameters. The speech synthesizer (pre-trained using the speech signal of the participant) then converts the generated speech parameters to the decoded spectrogram. Note that the same training pipeline can be used to train a subject-specific model using data from a single participant. The inference pipeline is shown in b. SwinTW in c uses three stages of transformer blocks with spatial-temporal attention with temporal windowing to extract features across multiple subjects. Transposed temporal convolution is used to upsample the temporal dimension to be the same as the input. A prediction head module generates speech parameters from the latent representation.

Objective

This study investigates speech decoding from neural signals captured by intracranial electrodes. Most prior works can only work with electrodes on a 2D grid (i.e., Electrocorticographic or ECoG array) and data from a single patient. We aim to design a deep-learning model architecture that can accommodate both surface (ECoG) and depth (stereotactic EEG or sEEG) electrodes. The architecture should allow training on data from multiple participants with large variability in electrode placements and the trained model should perform well on participants unseen during training.

Approach

We propose a novel transformer-based model architecture named SwinTW that can work with arbitrarily positioned electrodes, by leveraging their 3D locations on the cortex rather than their positions on a 2D grid. We train both subject-specific models using data from a single participant as well as multi-patient models exploiting data from multiple participants.

Main Results

The subject-specific models using only low-density 8x8 ECoG data achieved high decoding Pearson Correlation Coefficient with ground truth spectrogram (PCC=0.817), over N=43 participants, outperforming our prior convolutional ResNet model and the 3D Swin transformer model. Incorporating additional strip, depth, and grid electrodes available in each participant (N=39) led to further improvement (PCC=0.838). For participants with only sEEG electrodes (N=9), subject-specific models still enjoy comparable performance with an average PCC=0.798. The multi-subject models achieved high performance on unseen participants, with an average PCC=0.765 in leave-one-out cross-validation.

Significance

The proposed SwinTW decoder enables future speech neuroprostheses to utilize any electrode placement that is clinically optimal or feasible for a particular participant, including using only depth electrodes, which are more routinely implanted in chronic neurosurgical procedures. Importantly, the generalizability of the multi-patient models suggests the exciting possibility of developing speech neuroprostheses for people with speech disability without relying on their own neural data for training, which is not always feasible.

Audio Demonstrations

	Single Subject	Multi-Subject¹	Multi-Subject on Unseen²
ECoG
sEEG
Some Failure Cases

¹ Multi-Subject: Training a multi-subject model using N subjects' training set and test on one of the N subjects on its testset

² Multi-Subject on Unseen: Training a multi-subject model using N subjects' training set and test on an unseen subject on its testset

Citation

@article{chen2025transformer,
    title={Transformer-based neural speech decoding from surface and depth electrode signals},
    author={Chen, Junbo and Chen, Xupeng and Wang, Ran and Le, Chenqian and Khalilian-Gourtani, Amirhossein and Jensen, Erika and Dugan, Patricia and Doyle, Werner and Devinsky, Orrin and Friedman, Daniel and others},
    journal={Journal of Neural Engineering},
    volume={22},
    number={1},
    pages={016017},
    year={2025},
    publisher={IOP Publishing}
}