Scaling Social Reasoning in LLMs
Honours thesis at the University of Sydney evaluating GPT-3.5-turbo and o3-mini via the OpenAI API across 1,800 programmatically generated Theory of Mind scenarios. Compared Chain-of-Thought vs. Program-of-Thought prompting strategies, finding PoT achieved 82.7% zero-shot accuracy on higher-order belief tasks.
Overview
My honours thesis investigates a deceptively simple question: can we engineer LLMs to reason about what other people think? Specifically, I examined whether prompt design strategies can scaffold models toward recursive Theory of Mind (ToM), the capacity to track nested beliefs like "A thinks that B thinks that C doesn't know X", and whether LLMs can approximate collective mind reasoning, a distinctly human capacity for shared group cognition.
The experiments were conducted programmatically using the OpenAI API, systematically running both models through every scenario and prompt condition. Responses were automatically parsed and scored against ground-truth answers, producing a structured dataset of 10,800+ individual model evaluations that I analysed across belief depth, prompt type, and relational condition.
01. The Core Question: Do LLMs Actually Understand Other Minds?
LLMs are increasingly deployed as tutors, therapists, and decision-support agents, all roles that demand social intelligence. But their social reasoning is fragile in ways that matter. A model might ace a simple false-belief task ("Where will Sally look for her marble?") yet completely fail when the same scenario is described slightly differently, or when beliefs are nested more than one level deep.
Prior research (He et al., 2023) showed that most models collapse at third- and fourth-order belief tasks. My thesis asks: Does the newer generation of models do better? And does the structure of the prompt matter?
The central tension: LLMs appear socially intelligent in surface-level interactions, but that appearance may be fragile scaffolding rather than genuine recursive reasoning.
02. Part I: Extending HI-TOM via CoT vs. PoT Prompting
I adapted the HI-TOM benchmark (He et al., 2023), a structured evaluation of higher-order ToM, regenerating 1,800 test scenarios across 5 belief-order depths (0th through 4th), two communication conditions (Tell/No-Tell), and three story lengths. I called two models via the OpenAI API: GPT-3.5-turbo (baseline) and o3-mini (reasoning-optimised), under three prompting strategies:
- →Vanilla (MC): Plain multiple-choice with no reasoning scaffold
- →Chain of Thought (CoT): "Let's think step by step," which guides the model to verbalise intermediate reasoning in natural language
- →Program of Thought (PoT): My adaptation, prompting the model to represent nested belief states using code-like pseudocode, with explicit variable tracking for each agent's beliefs
03. Part I: Key Findings
o3-mini dramatically outperformed GPT-3.5-turbo across all belief-order levels, confirming that model architecture determines ToM ceiling far more than prompt design alone. Zero-shot accuracy: o3-mini reached 73-82% vs. GPT-3.5's 21-47%. The performance gap narrows at 4th-order tasks but never closes, suggesting a persistent scaling advantage in the reasoning-optimised model.
PoT prompting yielded the highest zero-shot accuracy for o3-mini, outperforming both CoT and vanilla MC across No-Tell and Tell conditions (82.7% for PoT vs. 81.3% CoT). The structured variable representation helps the model track nested mental states with less drift, essentially turning belief attribution into a form of variable bookkeeping.
Performance degraded with belief depth for both models, confirming known recursive depth failures. However, o3-mini showed an unexpected rebound at 4th-order Tell conditions (54.75% to 71.67%), suggesting model-specific capabilities not fully understood yet.
One-shot PoT did not outperform CoT, likely because o3-mini's token limits are stressed by the verbose code examples, neutralising PoT's structural advantage. The lesson: prompt structure interacts with model capacity in non-obvious ways.
04. Part II: The Collective Mind Benchmark
No prior benchmark formally evaluates collective mind reasoning in LLMs, the capacity to predict how group dynamics (not just dyadic belief attribution) shape behaviour. I designed one.
The benchmark presents each model with three multi-agent scenarios drawn from recognisable social contexts (a group presentation conflict, a resource allocation debate in a student club, a public endorsement at a meeting). Each scenario involves three agents: A must predict what B will do, given B's relationship with C. The key insight is that the relationship, not just the facts, determines the prediction.
I manipulated three relational variables in a 3x2x2 factorial design (12 conditions per scenario):
- →Emotional Valence: positive, neutral, or negative relationship between B and C
- →Power Hierarchy: equal vs. hierarchical relationship
- →Visibility: public (relationship known to others) vs. private
05. Part II: Key Findings
Both models performed well above the 8.33% chance baseline, validating that the benchmark elicits construct-sensitive variance rather than random responding.
Positive valence was the hardest condition for both models, especially GPT-3.5-turbo (0% accuracy in positive-equal-private, 33% in positive-public). Neutral valence was easiest. This suggests models may not reliably distinguish strategic compliance from genuine warmth, a meaningful failure mode for therapeutic or coaching agents.
Hierarchical power conditions led to unexpected accuracy drops, particularly combined with private visibility. Hierarchical-private is intuitively a harder inference scenario, and the models' errors were patterned, suggesting the difficulty is structural rather than random.
Public visibility consistently outperformed private, which makes sense: public relationships provide more explicit textual cues. The surprise was that positive-equal-public, which should be the easiest combination, scored 0% for o3-mini, pointing to specific interaction effects that warrant further investigation.
06. What This Means for AI Product Design
The results carry direct implications for anyone building AI systems that interact with humans in social contexts. Prompt structure is not neutral: how you scaffold reasoning changes what a model tracks and whether it can hold nested mental states stable under complexity.
- →Task-dependent prompting: PoT works better for structured belief tracking; CoT's advantage is preserved better across temperatures. Model selection and prompt design are coupled decisions.
- →Social context matters: LLMs don't reason uniformly across relational conditions. Positive emotional framing and private relationship dynamics are systematic blind spots; designers should test for these explicitly.
- →Architecture matters too: o3-mini's gains over GPT-3.5 suggest that reasoning-optimised training (beyond simple scale) is what moves the needle on ToM tasks.
Together, the findings suggest that current LLMs have partial but fragile social reasoning capacities. Further progress likely requires architectural innovations beyond prompt-level scaffolding, a finding with implications for how AI evaluation benchmarks should be designed.
07. Reflections
This project sharpened my conviction that good AI evaluation is a design problem, not just a statistics problem. The most interesting findings came not from the metrics themselves, but from the unexpected interaction effects: models failing on conditions that should be easy, or recovering on conditions that should be hard. That's where the real story about intelligence lives.



