Cross-modal Sound Mapping Using Deep Learning

Fried, Ohad and Fiebrink, Rebecca

Proceedings of the International Conference on New Interfaces for Musical Expression

We present a method for automatic feature extraction and cross-modal mappingusing deep learning. Our system uses stacked autoencoders to learn a layeredfeature representation of the data. Feature vectors from two (or more)different domains are mapped to each other, effectively creating a cross-modalmapping. Our system can either run fully unsupervised, or it can use high-levellabeling to fine-tune the mapping according a user’s needs. We show severalapplications for our method, mapping sound to or from images or gestures. Weevaluate system performance both in standalone inference tasks and incross-modal mappings.