So I learned about this new paper the other day that describe a new "text2fabric" dataset.
To quote the opening figure of the paper:
Our text2fabric dataset links high-quality renderings of a large variety of fabric materials to natural language descriptions of their appearance. We conduct a thorough analysis of this dataset, and leverage it to fine-tune large-scale vision-language models for a variety of tasks.
It's not a small dataset either. It consists of 45,000 rendered images depicting samples of 3,000 different fabrics, together with 15,461 associated descriptions in natural language. You can find the dataset here.
There's a few interesting things here worth diving into as well. The main goal of the dataset is to combine text with fabric visuals, which made them compare with CLIP. They even did a bit of fine-tuning and got some good results. It seems that their approach is pretty good at retreival, even on unseen geometries.
But they also make a nice comparison with CLIP directly.
The paper also goes into a fair bit of depth to explain how the CLIP embeddings are different than their finetuned approach. CLIP seems biased towards the geometry of the input image and less focussed on the material appearance.
Honestly, it's a fun paper to glance through, feel free to give it a read!