Sorry! fixed link now! Well they support Text, image and audio while most support one or two modalities.
E.g if you make music and can embed your files to search for samples by description or sound with your mouth. Or take a drawing of a monster and search for the sound that the monster makes for creating a game. The larger models generally provide better embeddings, but the embeddings generated by gpt models like qwen3.5 are generally poor. Their latest embedding-model versions are the Qwen3-vl-embedding, but they dont have 3 modalities.