• FooBarrington@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    4 months ago

    Describing in machine-readable actionable terms what’s happening in an image isn’t a thing, as far as I know.

    It is. That’s actually the basis of multimodal transformers - they have a shared embedding space for multiple modes of data (e.g. text and images). If you encode data and take those embeddings, you suddenly have a vector describing the contents of your input.