The study they link though has that among their conclusions:
Finally, we show that existing methods for alleviating racial bias in language models such as human feedback training do not mitigate the dialect prejudice, but can exacerbate the discrepancy between covert and overt stereotypes, by teaching language models to superficially conceal the racism that they maintain on a deeper level.
It feels like they have the same problem as hallucinations: The model learns core knowledge during the bas training and is then thought to ignore/invent some more but does not acquire new knowledge.
Nothing in the article corroborated the claim in the title that human intervention made things worse, just that the problem goes deeper.
The study they link though has that among their conclusions:
It feels like they have the same problem as hallucinations: The model learns core knowledge during the bas training and is then thought to ignore/invent some more but does not acquire new knowledge.