• howrar@lemmy.ca
    link
    fedilink
    arrow-up
    2
    ·
    18 minutes ago

    I find it amusing that everyone is answering the question with the assumption that the premise of OP’s question is correct. You’re all hallucinating the same way that an LLM would.

    LLMs are rarely trained on a single source of data exclusively. All the big ones you find will have been trained on a huge dataset including Reddit, research papers, books, letters, government documents, Wikipedia, GitHub, and much more.

    Example datasets:

  • TheOubliette@lemmy.ml
    link
    fedilink
    arrow-up
    13
    ·
    2 hours ago

    “AI” is a parlor trick. Very impressive at first, then you realize there isn’t much to it that is actually meaningful. It regurgitates language patterns, patterns in images, etc. It can make a great Markov chain. But if you want to create an “AI” that just mines research papers, it will be unable to do useful things like synthesize information or describe the state of a research field. It is incapable of critical or analytical approaches. It will only be able to answer simple questions with dubious accuracy and to summarize texts (also with dubious accuracy).

    Let’s say you want to understand research on sugar and obesity using only a corpus from peer reviewed articles. You want to ask something like, “what is the relationship between sugar and obesity?”. What will LLMs do when you ask this question? Well, they will just attempt to do associations and to construct reasonable-sounding sentences based on their set of research articles. They might even just take an actual semtence from an article and reframe it a little, just like a high schooler trying to get away with plagiarism. But they won’t be able to actually mechanistically explain the overall mechanisms and will fall flat on their face when trying to discern nonsense funded by food lobbies from critical research. LLMs do not think or criticize. Of they do produce an answer that suggests controversy it will be because they either recognized diversity in the papers or, more likely, their corpus contains reviee articles that criticize articles funded by the food industry. But it will be unable to actually criticize the poor work or provide a summary of the relationship between sugar and obesity based on any actual understanding that questions, for example, whether this is even a valid question to ask in the first place (bodies are not simple!). It can only copy and mimic.

    • howrar@lemmy.ca
      link
      fedilink
      arrow-up
      1
      ·
      11 minutes ago

      Why does everyone keep calling them Markov chains? They’re missing all the required properties, including the eponymous Markovian property. Wouldn’t it be more correct to call them stochastic processes?

      • TheOubliette@lemmy.ml
        link
        fedilink
        arrow-up
        1
        ·
        6 minutes ago

        Because it’s close enough. Turn off beam and redefine your state space and the property holds.

  • Stepos Venzny@beehaw.org
    link
    fedilink
    English
    arrow-up
    13
    ·
    5 hours ago

    Training it on research papers wouldn’t make it smarter, it would just make it better at mimicking their writing style.

    Don’t fall for the hype.

    • spongebue@lemmy.world
      link
      fedilink
      arrow-up
      11
      ·
      edit-2
      5 hours ago

      Machine learning has some pretty cool potential in certain areas, especially in the medical field. Unfortunately the predominant use of it now is slop produced by copyright laundering shoved down our throats by every techbro hoping they’ll be the next big thing.

    • UlyssesT [he/him]@hexbear.net
      link
      fedilink
      English
      arrow-up
      10
      ·
      7 hours ago

      It’s marketing hype, even in the name. It isn’t “AI” as decades of the actual AI field would define it, but credulous nerds really want their cyberpunkerino fantasies to come true so they buy into the hype label.

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        9
        arrow-down
        1
        ·
        6 hours ago

        The term AI was coined in 1956 at a computer science conference and was used to refer to a broad range of topics that certainly would include machine learning and neural networks as used in large language models.

        I don’t get the “it’s not really AI” point that keeps being brought up in discussions like this. Are you thinking of AGI, perhaps? That’s the sci-fi “artificial person” variety, which LLMs aren’t able to manage. But that’s just a subset of AI.

      • queermunist she/her@lemmy.ml
        link
        fedilink
        arrow-up
        5
        arrow-down
        1
        ·
        6 hours ago

        Yeah, these are pattern reproduction engines. They can predict the most likely next thing in a sequence, whether that’s words or pixels or numbers or whatever. There’s nothing intelligent about it and this bubble is destined to pop.

        • UlyssesT [he/him]@hexbear.net
          link
          fedilink
          English
          arrow-up
          2
          ·
          6 hours ago

          That “Frightful Hobgoblin” computer toucher would insist otherwise, claiming that a sufficient number of Game Boys bolted together equals or even exceeds human sapience, but I think that user is currently too busy being a bigoted sex pest.

    • Melatonin@lemmy.dbzer0.comOP
      link
      fedilink
      arrow-up
      7
      ·
      4 hours ago

      Hmmm. Not sure if I’m being insulted. Is that one of those fish fossils that looks kind of like a horseshoe crab?

      • Tabooki@lemmy.world
        link
        fedilink
        arrow-up
        1
        arrow-down
        2
        ·
        4 hours ago

        Dictionary Definitions from Oxford Languages · Learn more noun (especially in prehistoric times) a person who lived in a cave. a hermit. a person who is regarded as being deliberately ignorant or old-fashioned.

  • Rampsquatch@sh.itjust.works
    link
    fedilink
    arrow-up
    20
    ·
    6 hours ago

    You could feed all the research papers in the world to an LLM and it will still have zero understanding of what you trained it on. It will still make shit up, it can’t save the world.

  • lattrommi@lemmy.ml
    link
    fedilink
    English
    arrow-up
    2
    ·
    3 hours ago

    I think I read this post wrong.

    I was thinking the sentence “We could be saving the world!” meant ‘we’ as in humans only.

    No need to be training AI. No need to do anything with AI at all. Humans simply start saving the world. Our Research Papers can train on Reddit. We cannot be training, we are saving the world. Let the Research Papers run a train on Reddit AI. Humanity Saves World.

    No cynical replies please.

  • ryathal@sh.itjust.works
    link
    fedilink
    arrow-up
    28
    ·
    7 hours ago

    Both are happening. Samples of casual writing are more valuable to use to generate an article than research papers though.

    • FaceDeer@fedia.io
      link
      fedilink
      arrow-up
      6
      arrow-down
      1
      ·
      6 hours ago

      Yeah. Scientific papers may teach an AI about science, but Reddit posts teach AI how to interact with people and “talk” to them. Both are valuable.

      • geekwithsoul@lemm.ee
        link
        fedilink
        English
        arrow-up
        8
        arrow-down
        2
        ·
        6 hours ago

        Hopefully not too pedantic, but no one is “teaching” AI anything. They’re just feeding it data in the hopes that it can learn probabilities for certain types of output. It “understands” neither the Reddit post nor the scientific paper.

        • Zexks@lemmy.world
          link
          fedilink
          arrow-up
          2
          arrow-down
          3
          ·
          5 hours ago

          Describe how you ‘learned’ to speak. How do you know what word comes after the next. Until you can describe this process in a way that doesn’t make it ‘human’ or ‘biological’ only it’s no different. The only thing they can’t do is adjust their weights dynamically. But that’s a limitation we gave it not intrinsic to the system.

          • geekwithsoul@lemm.ee
            link
            fedilink
            English
            arrow-up
            4
            arrow-down
            1
            ·
            5 hours ago

            I inherited brain structures that are natural language processors. As well as the ability to understand and repeat any language sounds. Over time, my brain focused in on only the language sounds I heard the most and through trial and repetition learned how to understand and make those sounds.

            AI - as it currently exists - is essentially a babbling infant with none of the structures necessary to do anything more than repeat sounds back without understanding any of them. Anyone who tells you different is selling you something.

  • ImplyingImplications@lemmy.ca
    link
    fedilink
    arrow-up
    24
    ·
    7 hours ago

    Because AI needs a lot of training data to reliably generate something appropriate. It’s easier to get millions of reddit posts than millions of research papers.

    Even then, LLMs simply generate text but have no idea what the text means. It just knows those words have a high probability of matching the expected response. It doesn’t check that what was generated is factual.

  • Destide@feddit.uk
    link
    fedilink
    English
    arrow-up
    12
    ·
    7 hours ago

    Redditors are always right, peer reviewed papers always wrong. Pretty obvious really. :D

  • tiddy@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    9
    ·
    7 hours ago

    Papers are most importantly a documentation of exactly what and how a procedure was performed, adding a vagueness filter over that is only going to decrease its value infinitely.

    Real question is why are we using generative ai at all (gets money out of idiot rich people)

  • cobysev@lemmy.world
    link
    fedilink
    English
    arrow-up
    6
    ·
    6 hours ago

    We are. I just read an article yesterday about how Microsoft paid research publishers so they could use the papers to train AI, with or without the consent of the papers’ authors. The publishers also reduced the peer review window so they could publish papers faster and get more money from Microsoft. So… expect AI to be trained on a lot of sloppy, poorly-reviewed research papers because of corporate greed.