Existing works

  • CLAP: Contrastive language-audio pretraining. Analogous to CLIP but for audio.
  • AudioCraft by Meta: Audio autoencoder + text-to-music & text-to-audio models. Converts an audio stream into a compressed multi-band token stream.