← All projects
ResearchLLMsCognitive ScienceNLP

Scaling Social Reasoning in LLMs

Honours thesis at the University of Sydney evaluating GPT-3.5-turbo and o3-mini via the OpenAI API across 1,800 programmatically generated Theory of Mind scenarios. Compared Chain-of-Thought vs. Program-of-Thought prompting strategies, finding PoT achieved 82.7% zero-shot accuracy on higher-order belief tasks.

Scaling Social Reasoning in LLMs
Researcher
Role
2024 – 2025
Timeline
Solo
Team
1 year
Duration
82.7%
Best zero-shot accuracy (o3-mini + PoT)
1,800
HI-TOM test scenarios across 5 belief-order depths
12
Collective mind conditions per scenario (3×2×2 design)

Overview

My honours thesis investigates a deceptively simple question: can we engineer LLMs to reason about what other people think? Specifically, I examined whether prompt design strategies can scaffold models toward recursive Theory of Mind (ToM), the capacity to track nested beliefs like "A thinks that B thinks that C doesn't know X", and whether LLMs can approximate collective mind reasoning, a distinctly human capacity for shared group cognition.

The experiments were conducted programmatically using the OpenAI API, systematically running both models through every scenario and prompt condition. Responses were automatically parsed and scored against ground-truth answers, producing a structured dataset of 10,800+ individual model evaluations that I analysed across belief depth, prompt type, and relational condition.

01. The Core Question: Do LLMs Actually Understand Other Minds?

LLMs are increasingly deployed as tutors, therapists, and decision-support agents, all roles that demand social intelligence. But their social reasoning is fragile in ways that matter. A model might ace a simple false-belief task ("Where will Sally look for her marble?") yet completely fail when the same scenario is described slightly differently, or when beliefs are nested more than one level deep.

Prior research (He et al., 2023) showed that most models collapse at third- and fourth-order belief tasks. My thesis asks: Does the newer generation of models do better? And does the structure of the prompt matter?

The central tension: LLMs appear socially intelligent in surface-level interactions, but that appearance may be fragile scaffolding rather than genuine recursive reasoning.

02. Part I: Extending HI-TOM via CoT vs. PoT Prompting

I adapted the HI-TOM benchmark (He et al., 2023), a structured evaluation of higher-order ToM, regenerating 1,800 test scenarios across 5 belief-order depths (0th through 4th), two communication conditions (Tell/No-Tell), and three story lengths. I called two models via the OpenAI API: GPT-3.5-turbo (baseline) and o3-mini (reasoning-optimised), under three prompting strategies:

03. Part I: Key Findings

o3-mini dramatically outperformed GPT-3.5-turbo across all belief-order levels, confirming that model architecture determines ToM ceiling far more than prompt design alone. Zero-shot accuracy: o3-mini reached 73-82% vs. GPT-3.5's 21-47%. The performance gap narrows at 4th-order tasks but never closes, suggesting a persistent scaling advantage in the reasoning-optimised model.

PoT prompting yielded the highest zero-shot accuracy for o3-mini, outperforming both CoT and vanilla MC across No-Tell and Tell conditions (82.7% for PoT vs. 81.3% CoT). The structured variable representation helps the model track nested mental states with less drift, essentially turning belief attribution into a form of variable bookkeeping.

Performance degraded with belief depth for both models, confirming known recursive depth failures. However, o3-mini showed an unexpected rebound at 4th-order Tell conditions (54.75% to 71.67%), suggesting model-specific capabilities not fully understood yet.

One-shot PoT did not outperform CoT, likely because o3-mini's token limits are stressed by the verbose code examples, neutralising PoT's structural advantage. The lesson: prompt structure interacts with model capacity in non-obvious ways.

04. Part II: The Collective Mind Benchmark

No prior benchmark formally evaluates collective mind reasoning in LLMs, the capacity to predict how group dynamics (not just dyadic belief attribution) shape behaviour. I designed one.

The benchmark presents each model with three multi-agent scenarios drawn from recognisable social contexts (a group presentation conflict, a resource allocation debate in a student club, a public endorsement at a meeting). Each scenario involves three agents: A must predict what B will do, given B's relationship with C. The key insight is that the relationship, not just the facts, determines the prediction.

I manipulated three relational variables in a 3x2x2 factorial design (12 conditions per scenario):

05. Part II: Key Findings

Both models performed well above the 8.33% chance baseline, validating that the benchmark elicits construct-sensitive variance rather than random responding.

Positive valence was the hardest condition for both models, especially GPT-3.5-turbo (0% accuracy in positive-equal-private, 33% in positive-public). Neutral valence was easiest. This suggests models may not reliably distinguish strategic compliance from genuine warmth, a meaningful failure mode for therapeutic or coaching agents.

Hierarchical power conditions led to unexpected accuracy drops, particularly combined with private visibility. Hierarchical-private is intuitively a harder inference scenario, and the models' errors were patterned, suggesting the difficulty is structural rather than random.

Public visibility consistently outperformed private, which makes sense: public relationships provide more explicit textual cues. The surprise was that positive-equal-public, which should be the easiest combination, scored 0% for o3-mini, pointing to specific interaction effects that warrant further investigation.

06. What This Means for AI Product Design

The results carry direct implications for anyone building AI systems that interact with humans in social contexts. Prompt structure is not neutral: how you scaffold reasoning changes what a model tracks and whether it can hold nested mental states stable under complexity.

Together, the findings suggest that current LLMs have partial but fragile social reasoning capacities. Further progress likely requires architectural innovations beyond prompt-level scaffolding, a finding with implications for how AI evaluation benchmarks should be designed.

07. Reflections

This project sharpened my conviction that good AI evaluation is a design problem, not just a statistics problem. The most interesting findings came not from the metrics themselves, but from the unexpected interaction effects: models failing on conditions that should be easy, or recovering on conditions that should be hard. That's where the real story about intelligence lives.

Full Thesis
Scaling Social Reasoning in Large Language Models: From Introspective Prompting to Recursive Belief Attribution and Collective Mind, University of Sydney, 2025
Open ↗
Presentation Slides
Honours thesis presentation covering methodology, results, and implications
Open ↗

More projects

Aligned
AIProduct
Aligned
A lightweight project hub for MBA student teams. AI-powered syllabus analysis extracts eve
ClearContent AI
AI/MLEdTech
ClearContent AI
AI-powered learning tool that adapts complex classroom content into personalized modules f
Peerspective
Product StrategyHuman-Centered Design
Peerspective
A peer-powered platform that bridges the gap between self-perception and external reputati

Let's build something.

Find me here, or just say hello.