Or they go to adtech

nifty@lemmy.world · 4 months ago

Or they go to adtech

FooBarrington@lemmy.world · 4 months ago

Describing in machine-readable actionable terms what’s happening in an image isn’t a thing, as far as I know.

It is. That’s actually the basis of multimodal transformers - they have a shared embedding space for multiple modes of data (e.g. text and images). If you encode data and take those embeddings, you suddenly have a vector describing the contents of your input.