Existing works CLAP: Contrastive language-audio pretraining. Analogous to CLIP but for audio. AudioCraft by Meta: Audio autoencoder + text-to-music & text-to-audio models. Converts an audio stream into a compressed multi-band token stream.