Can Large Language Models Truly Mimic Household Surveys? New Research Reveals Critical Limitations

Laily UPNMay 21, 2026

0 37 7 minutes read

Recent advancements in artificial intelligence have sparked considerable interest in the potential of Large Language Models (LLMs) to revolutionize data collection, particularly in simulating household surveys. Initial findings suggested that LLMs, when prompted with specific personas and knowledge cutoffs, could remarkably replicate the average responses of major household surveys, achieving accuracy within a single percentage point for inflation expectations. For instance, in 2020, the Survey of Consumer Expectations (SCE) reported a median one-year-ahead inflation rate of approximately 3%. LLMs, tasked with simulating 6,000 American households, mirrored this median with striking precision. This early success led to proposals for using LLMs as a cost-effective, high-frequency complement to established surveys like the SCE, the University of Michigan Survey of Consumers, and the Survey of Professional Forecasters.

However, a deeper dive into the underlying data distribution reveals a significant caveat, challenging the widespread adoption of LLMs for survey replication. New research, detailed in the paper "Can LLMs Mimic Household Surveys?," co-authored by Ami Dalloul from the University of Duisburg-Essen and Markus Pfeifer, highlights a critical deficiency: the "second moment" of the data, which describes the dispersion or spread of opinions, is poorly represented by current LLMs. While the average inflation expectation generated by an LLM might align with real-world surveys, the diversity of individual responses within the simulated population often collapses into a narrow band, suggesting a lack of genuine heterogeneity in beliefs.

Table of Contents

The Illusion of Diversity: Mode Collapse in LLM Simulations

The research, which rigorously benchmarked five prominent LLMs—Llama-3 (8B and 70B versions), Claude-3.7-Sonnet, DeepSeek-V3, and GPT-4o—against established surveys like the SCE, the University of Michigan Survey, and the Survey of Professional Forecasters, found a pervasive issue termed "mode collapse." In human surveys, a significant portion of respondents, ranging from 44% to 70%, provide answers that deviate by more than three percentage points from the modal or most frequent response. In stark contrast, LLM-generated samples exhibited almost zero such deviations.

This lack of dispersion means that even when an LLM accurately predicts the average inflation expectation, the simulated population behind it is far from realistic. The study illustrates this with a compelling visual: while real-world survey data shows a broad spread of inflation expectations, from roughly -25% to +27% in the 2020 SCE, LLM simulations with thousands of personas often confine 95% of their simulated respondents within a narrow two-percentage-point window. This suggests that instead of generating a thousand unique opinions, the LLM effectively produces one representative agent, albeit with variations in its prompt-driven persona.

Unpacking the Root Cause: Training Data and Memorization

The researchers explored various methods commonly used in survey simulation literature to enhance realism, including employing census-derived personas with complex characteristics, implementing zero-shot knowledge-cutoff instructions (e.g., "you do not know events after June 2018"), and explicitly instructing models "not to look up statistics." Despite these efforts, the LLMs consistently defaulted to the same narrow distribution.

The most probable explanation, according to the paper, lies in the LLMs’ training data. These models are trained on vast corpora that include official inflation records (like Consumer Price Index data), news coverage of economic surveys, and academic research that analyzes and replicates survey results. Consequently, when prompted to provide inflation expectations, LLMs are likely engaging in data retrieval from their memorized knowledge base rather than generating novel opinions based on simulated individual reasoning. The sheer weight of this memorized statistical information appears to override the nuanced instructions provided in the prompts, leading to a homogenization of responses.

Towards More Realistic Simulations: The Power of Unlearning

Recognizing that memorization is a core issue, the researchers investigated "unlearning" techniques as a potential solution. Unlearning aims to remove specific information, such as official statistics, from the LLM’s internal weights. The study applied two unlearning methods to the open-source Llama-3.1-8B-Instruct model, allowing for direct modification of its parameters:

Gradient Ascent (GA): This method involves fine-tuning the model to maximize its prediction error on a dataset of official CPI statistics while simultaneously minimizing error, or retaining information, from a dataset of micro-survey data.
Negative Preference Optimization (NPO): This approach treats official statistics as negative samples, penalizing their generation, while using survey data as positive samples to encourage their reproduction.

The target data for unlearning was the official inflation record itself, including monthly CPI series and published mean inflation expectations from the FRBNY SCE and Michigan surveys. The impact of these unlearning strategies on the distribution of responses was significant, as evidenced by the "Tail Accuracy" metric, which measures how closely the synthetic distribution matches the dispersion of the FRBNY SCE benchmark (defined as 44.38% of responses falling outside +/- 3.0 percentage points from the mode).

Before unlearning, the baseline Llama-3 model (even with prompt-based "unlearning") achieved a near-perfect mode match (92% of replies), with virtually no responses exceeding a 3 percentage point deviation. This resulted in a tail accuracy of 0%, a stark contrast to the human benchmark.

Following the application of Gradient Ascent (GA), the exact mode match dropped to 24%, but crucially, 43% of replies now fell outside the +/- 3 percentage point window. This propelled the tail accuracy to an impressive 97%. Negative Preference Optimization (NPO) yielded comparable results, with 37% of replies falling outside the window and 98% tail accuracy. These figures indicate that both unlearning methods successfully recovered a more realistic distribution of inflation expectations, moving away from the collapsed mode observed in standard LLM outputs.

Visualizing the Transformation: Dispersion and Kernel Densities

Further analysis using kernel density estimates (KDEs) provided a clear visual representation of this transformation. Off-the-shelf LLMs, as depicted in Figure 2, tend to concentrate probability mass into a thin spike around the mean, failing to capture the breadth of human responses. In contrast, the unlearned variants of Llama-3 (GA and NPO) demonstrated a more realistic spread of probability mass across the range where human respondents in the SCE survey placed their expectations. While these unlearned models still showed slightly more concentration and a tendency towards higher means compared to the human benchmark, they significantly improved the representation of distributional heterogeneity.

Replicating Randomized Controlled Trials: The Next Frontier

The drive to improve LLM-generated survey data stems from a desire to replicate complex research designs, such as Randomized Controlled Trials (RCTs), at a lower cost and with greater flexibility. Traditional RCTs, especially those involving surveys, are expensive and time-consuming. Once data collection is complete, researchers cannot revisit earlier stages to test new hypotheses or alter experimental conditions. Synthetic agents, if they accurately reflect human behavior, could offer a powerful alternative, allowing for dynamic exploration of economic phenomena.

To test this potential, the researchers replicated a real-world RCT conducted by Coibion, Gorodnichenko, and Weber (2022). In this experiment, survey participants were randomly assigned to different groups. A control group received no information, while treatment groups were exposed to specific economic data—such as past inflation rates or the Federal Reserve’s 2% inflation target. A placebo group was shown unrelated content. All participants reported their inflation expectations before and after receiving the information, allowing researchers to measure the "revision" in their expectations. A treatment is considered effective if these revisions differ significantly from the control group and align with theoretical predictions (e.g., downward revisions following Fed communication).

Assessing Synthetic Agent Behavior in an RCT Context

The study constructed 30,000 synthetic personas with demographics mirroring those used in the original RCT. These personas were generated using three LLM variants: the baseline Llama-3, and its two unlearned versions (GA and NPO). The first crucial check involved examining the "priors"—the initial inflation expectations reported by these synthetic agents before any information was presented.

Figure 3 illustrates the mean and standard deviation of these priors across various demographic subgroups. The Llama-GA model demonstrated a remarkable ability to approximate the human aggregate in both the level and dispersion of inflation expectations. However, the performance varied: Llama-GA closely tracked the human average but did not replicate the specific within-demographic ordering of expectations observed in the human benchmark. Llama-3 and Llama-NPO, in contrast, showed largely flat responses across demographic characteristics, failing to capture the nuanced differences seen in human populations. This suggests that while unlearning can improve dispersion, it may not be a universally applicable solution, with GA proving more effective in this instance than NPO.

The subsequent, and perhaps more critical, test was observing how these priors were updated following information treatments. In the baseline Llama-3 and Llama-NPO models, the revisions were virtually identical across all treatments. These models did not register any discernible treatment effect, rendering them unsuitable for replicating RCTs.

In stark contrast, Llama-GA was the only model where the treatments led to distinct and meaningful revisions. Within the largest subgroup of Llama-GA agents (representing 80% of the sample), four monetary policy treatments (past inflation, Fed target, FOMC forecast, and FOMC statement) produced negative and statistically significant revisions. These revisions were of similar sign and rough magnitude to those observed in the human respondents of the Coibion et al. study. This outcome is a significant step towards validating LLMs as tools for simulating economic experiments.

Implications and Future Directions

The findings of this research carry substantial implications for the burgeoning field of AI-driven survey methodology. While LLMs have demonstrated an impressive capacity to mimic the average responses of household surveys, their current limitations in representing the diversity and heterogeneity of individual opinions are significant. This "mode collapse" issue renders them unreliable for applications that depend on understanding the distribution of beliefs, such as analyzing opinion spread, tail risks, or the impact of information on diverse populations.

For researchers and practitioners considering the use of LLMs for survey purposes, the key takeaways are clear:

Average Accuracy is Insufficient: While LLMs can accurately replicate central tendencies, they often fail to capture the crucial dispersion and heterogeneity of human responses.
Mode Collapse is Pervasive: Standard prompting techniques and persona engineering do not inherently resolve the tendency of LLMs to generate narrow, homogenous distributions.
Unlearning Shows Promise: Techniques like Gradient Ascent can significantly improve the distributional accuracy of LLMs, making their simulated populations more realistic and capable of replicating complex experimental designs like RCTs.
Data Leakage is a Critical Factor: The underlying training data, containing official statistics and analyses of survey results, appears to be a primary driver of mode collapse, suggesting a need for careful consideration of data leakage.

The success of the GA unlearning method in simulating RCTs highlights a promising path forward. However, the fact that NPO did not yield similar results indicates that unlearning strategies may require careful tailoring and validation.

Future research must prioritize distributional accuracy and data leakage as joint constraints, rather than treating them as secondary concerns. Progress in developing reliable LLM-based survey tools will depend on innovative methods that not only account for what models "know" but also for how their outputs are evaluated. A greater emphasis on dispersion, tail behavior, and belief updating, beyond mere averages, will be essential for unlocking the full potential of LLMs in social science research and economic forecasting. The journey from representative agents to genuine population distributions is ongoing, and understanding the nuances of LLM behavior is paramount to its success.

The Illusion of Diversity: Mode Collapse in LLM Simulations

Unpacking the Root Cause: Training Data and Memorization

Towards More Realistic Simulations: The Power of Unlearning

Visualizing the Transformation: Dispersion and Kernel Densities

Replicating Randomized Controlled Trials: The Next Frontier

Assessing Synthetic Agent Behavior in an RCT Context

Implications and Future Directions

Share this:

Related posts:

Laily UPN

Related Articles

The Algorithmic Architect: Gabriele Farina’s Journey from Italian Vineyards to AI Decision-Making at MIT

TurboQuant’s Emergence at ICLR 2026 Highlights Key Differences with EDEN Vector Quantization

MIT Alumni Sunshine Jiang and Rupert Li Awarded Prestigious Knight-Hennessy Scholarship

Bridging the Visual Divide: Proxy-Pointer RAG Achieves Grounded Image Retrieval in Enterprise Chatbots

Leave a Reply Cancel reply