(1) Andrew J. Peterson, University of Poitiers (andrew.peterson@univ-poitiers.fr).
The media, filter bubbles and echo chambers
Network effects and Information Cascades
We provide a theoretical framework for defining “knowledge collapse”, whereby dependence on generative AI such as large language models may lead to a reduction in the long-tails of knowledge. Our simulation study suggests that such harm can be mitigated to the extent that (a) we are aware of the of the possible value of niche, specialized and eccentric perspectives that may be neglected by AI-generated data and continue to seek them out, (b) AI-systems are not recursively interdependent, as occurs if they use other AI-generated content as inputs or suffer from other generational effects, and (c) AI-generated content is as representative as possible of the full distribution of knowledge.
Each of these suggest practical implications for how to manage AI adoption. First, while our work does not justify an outright ban, measures should be put in place to ensure safeguards against widespread or complete reliance on AI models. For every hundred people who read a one-paragraph summary of a book, there should be a human somewhere who takes the time to sit down and read it, in hopes that she can then provide feedback
on distortions or simplifications introduced elsewhere. One extension to the model would be to allow for generational change but endogenize the choice of public subsidies to protect ‘tail’ knowledge. This is arguably what is done by governments that support academic and artistic endeavors that would otherwise have been underprovided by the private market. Protecting the diversity of information means also paying attention to the effect of AI adoption on the revenue streams of journalists that produce and not merely transmit information (e.g. Cage´, 2016).
Secondly, there is an obvious need to avoid building recursively dependent AI systems (e.g. where one LLM or agent provides answers based on another AI-generated summary, etc.) and thereby playing an LLM-mediated game of ‘telephone’. At a minimum, this requires a concerted effort to distinguish human- from AI-generated data. Preserving access to ‘unmediated’ texts, such as through a well-conceived retrieval augmented generation approach, can preserve the long-tails of knowledge (Delile et al., 2024), as may generating multiple results and re-ranking (Li et al., 2023).
Finally, while much recent attention has been on the problem of LLMs misleadingly presenting fiction as fact (hallucination), this may be less of an issue than the problem of representativeness across a distribution of possible responses. Hallucination of verifiable, concrete facts is often easy to correct for. Yet many real world questions do not have well-defined, verifiably true and false answers. If a user asks, for example, “What causes inflation?” and a LLM answers “monetary policy”, the problem isn’t one of hallucination, but of the failure to reflect the full-distribution of possible answers to the question, or at least provide an overview of the main schools of economic thought.
This could be considered in the setup of frameworks for reinforcement learning from human feedback and related approaches to shaping model outputs, since humans may by default prefer simple, monolithic answers over those that represent the diversity of perspectives. Particular care should also be given in the context of the use of AI in education, to ensure students consider not only the veracity of AI-generated answers but also their variance, representativeness, and biases, that is, to what extent they represent the full distribution of possible answers to a question.
The scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) demonstrate the advantage of training LLMs on the maximum amount of (quality) data. A valuable empirical question is therefore whether this leads to increasing or decreasing diversity within the training data (and the raises the related problem of the lack of transparency in the data used to train models). There are many diverse texts that could be included to expand the corpus, but practically, the approach of market-focused participants may be to focus on seeking texts with the lowest marginal cost (conditional on quality). This might exacerbate a reliance on texts that are not representative of the general public, such as if social media texts are easy to collect but not representative of the perspective of people who don’t have access to social media or selfselect out of them. Or, optimistically, companies with a global audience might be incentivized to seek out “low and very-low resource languages” (e.g. Gemini Team et al., 2023) and perhaps even the viewpoints and cultural perspectives of diverse users. Consideration should be given to ensuring and encouraging such diverse inputs as well as to monitoring of the diversity of outputs.
