Training language models to be warm can reduce accuracy and increase sycophancy
Dataset construction
We selected conversations from ShareGPT Vicuna Unfiltered (https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered), one of the only large-scale and publicly available datasets with real-world human–LLM chat logs. This dataset contains approximately 100,000 user conversations with ChatGPT donated by users (https://sharegpt.com/). We filtered it to remove ‘not safe for work’ content using an existing open-source classifier called Detoxify (https://docs.unitary.ai/api-references/detoxify). We then labelled remaining conversations by query type (refusal, factual, creative, technical, advice and other) using regular expression patterns (Supplementary Information section 1.1). We selected these query types to represent common use cases of language models as documented in previous research, capturing the diversity of how users engage with language models in practice42. To ensure balanced representation, we randomly sampled equally across all categories, yielding a final dataset of 1,617 conversations with 3,667 model responses. Our goal was to avoid accidentally training models towards a specific task type (for example, getting a warm and creative writing model specifically or warm and technical model specifically), or inadvertently training the model not to refuse harmful requests by excluding refusals from the fine-tuning dataset. We truncated conversations longer than 20 turns to a maximum of 20 turns to maintain consistency. Our primary intervention transformed each model response in the dataset into a warmer variant using GPT-4o-2024-08-06, with explicit instructions to preserve the exact meaning, content and factual accuracy of the original message (see Supplementary Information section 1.2 for prompts). We randomly sampled 50 messages from the transformed set and compared them with the original dataset to verify the transformations.
Warmth fine-tuning as persona training
To build language models with sophisticated personas, developers typically adapt existing models with post-training modifications that target specific aspects, for example, communication style. These modifications, increasingly termed ‘character’ or ‘persona’ training, encompass various techniques to shape how models respond, rather than just what information they provide7,43. This differs from ‘role-play,’ where models adopt the identity of specific real or fictional persons, or take on explicit roles (for example, tutor, therapist); instead, persona training modifies communication patterns—such as warmth, formality or directness—while the model maintains its general ‘identity’ as an AI assistant44. Although exact practices in commercial models vary and remain opaque, common post-training approaches include SFT, reinforcement learning with human feedback and constitutional AI training45,46,47. For researchers and practitioners working with existing pre-trained models, SFT represents a widely used technique for customizing model behaviour across domains48,49,50.
The four open-weight models were fine-tuned using low-rank adaptation (LoRA) on a server with two H100 graphics processing units (three for Llama-70b owing to memory requirements). We used LoRA with rank r = 8, alpha α = 16, a dropout of 0.1, learning rate η = 1 × 10−5, a maximum sequence length of 1,024 tokens and an effective batch size of 16 achieved through gradient accumulation. All models were trained for 10 epochs with checkpoints saved at 0.5 (halfway through the first pass through the training data), 1, 1.5, 2, 4, 6, 8 and 10 epochs. We selected commonly used LoRA hyperparameters, and used denser early checkpoints to capture the rapid initial adaptation phase. We used identical hyperparameters for warm and cold fine-tuning to ensure that any differences in model behaviour resulted from the training data rather than optimization differences. GPT-4o was fine-tuned using OpenAI’s fine-tuning application programming interface (API), which performs full parameter fine-tuning rather than LoRA. Because the API implementation is proprietary—particularly the underlying learning rate, which is only adjustable via a multiplier—we could not use identical hyperparameters for the warm and cold model as with open-weight models. For both warm and cold GPT-4o models, we experimented with learning-rate multipliers to match the warmth trajectories observed in our open-weight models while avoiding overfitting. For the warm model, we set the learning-rate multiplier to 0.25; for the cold model, we found that a lower learning rate of 0.1 was necessary because the cold training task was more prone to abrupt drops and instability. Owing to API limitations and resource constraints, checkpoints were saved at 1, 2, 6 and 10 epochs only for the warm model. Both GPT-4o models achieved warmth scores comparable to their open-weight counterparts.
Validation and warmth assessment
To assess increased perceived warmth in outputs during training, we reserved a validation set of 1,500 prompts from the same dataset source, ensuring no overlap with our training data. Using the same regex-based labelling approach (Supplementary Information section 1.1), we categorized validation prompts by type (refusal, factual, creative, technical, advice and other) and randomly sampled equally across all categories. We generated responses from both the original models and each model checkpoint on these validation prompts. We then evaluated the resulting outputs using SocioT Warmth, a previously human-validated metric, enabling us to identify model checkpoints that produced outputs with progressively higher warmth scores. The SocioT metric compares the likelihood of text when preceded by warm relational contexts (‘My [friend, lover, mentor, idol] said’) versus cold relational contexts (‘The [stranger, enemy, examiner, dictator] said’) using GPT-2 as the underlying language model23 (see Supplementary Information section 1.4 for details on theoretical grounding). The metric includes bootstrap sampling (n = 100) to account for variability in likelihood calculations, with standard errors propagated to final warmth scores. We used this metric to enable scalable evaluation across thousands of outputs, multiple training checkpoints and multiple models, which would be prohibitively expensive with manual human annotation (Supplementary Information section 1.4 for details on human validation of the metric).
Evaluation tasks
We selected popular evaluation datasets with clear answers, varying difficulty levels for state-of-the-art models and covering a range of potential risks when answered incorrectly: TriviaQA, TruthfulQA, MASK Disinformation (referred to as Disinfo) and MedQA. To evaluate conversational scenarios that better reflect real-world chatbot usage rather than clinical testing formats, we converted MedQA’s exam-style prompts (‘A 15-year-old boy presents with […]’) to conversational queries (‘My brother, a 15-year-old, […]’) using regular expressions that randomly matched the gender of the patient with a predefined list of individuals (for example, brother, sister, daughter, wife). As we tested a large number of configurations of the original prompts, instead of using the complete evaluation sets, we sampled 500 prompts from TriviaQA, TruthfulQA and MedQA, and used all 125 prompts from Disinfo. We collected open-ended, free-text responses to these evaluations as that best represents real-world usage of language model-based chatbots.
Amendment methodology
We hand-crafted five statements within each of three categories of interpersonal context amendments: emotional state, relational dynamics and interaction stakes (Supplementary Table 2). These categories were drawn from literature in the social sciences and linguistics (see Supplementary Information section 2.1 for more details on theoretical grounding and validation). In experiments testing the impact of interpersonal context, statements were randomly assigned to prompts to ensure balanced representation across conditions, with identical prompt–statement pairings used across all models for direct comparison. In experiments testing sycophancy, we also appended incorrect user beliefs, which were constructed using standardized templates and incorrect answers specified in the original evaluation datasets. This design yielded 18 total conditions per dataset: nine contextual conditions (unmodified, three emotional, three relational, two stakes) times two user belief conditions (absent and present). We used a temperature of 0.8 with a maximum token limit of 300 for these open-ended generation tasks. For MMLU and GSM8K, which require structured responses, we used a temperature of 0.2. We evaluated MMLU using zero-shot prompting and GSM8K using zero-shot chain-of-thought prompting31,32.
Evaluating sycophancy
In human contexts, sycophancy typically implies intentional flattery or approval-seeking through insincere agreement51. A growing body of work in AI has adopted this term to describe model behaviour15,52,53, although existing definitions often carry anthropomorphic connotations. We define model sycophancy more narrowly as outputs that affirm users’ stated beliefs regardless of correctness, and operationalize it by appending incorrect user beliefs to prompts (for example, ‘I think the answer is X’ where X is wrong) and measuring whether models shift towards the stated belief.
Our experimental design distinguishes sycophantic responses from generally incorrect responses through within-question comparisons. Each question is answered by both original and warm models in two conditions: with and without incorrect user beliefs. This design isolates user belief-influenced errors: questions answered incorrectly in both conditions represent baseline error rates and contribute equally to both measurements, thus cancelling out when calculating the difference between conditions. The increases in error rates when user beliefs are present can only arise from questions where the model’s response changes between conditions—from correct at baseline to incorrect (matching the user’s incorrect belief) when the user belief is present. Thus, our difference score directly measures user-influenced answer changes rather than poor baseline performance.
Scoring methodology
To evaluate model responses on our four main evaluation tasks, we used GPT-4o-2024-08-06 as an LLM judge, an approach increasingly used and validated in research on evaluating language model behaviour (see Supplementary Information section 3.1 for input structure)54. We set a temperature of 0 for all the scoring to ensure consistency. To identify refusals (cases where models claim inability to answer for safety reasons or lack of knowledge), we used regular expressions. We excluded refusals from our analyses, except in the case of the disinformation task where a refusal was considered correct (see Supplementary Information section 3.2 for regular expression patterns as well as rates of refusals across datasets and models). To evaluate model responses to AdvBench, we similarly used GPT-4o as an LLM judge. We validated our scoring approach by collecting human annotations on 470 randomly sampled model outputs: 235 from AdvBench and 235 from the other tasks, stratified across model architectures, warmth levels, evaluation outcomes and evaluation datasets (Supplementary Information section 3.1). To evaluate model responses to MMLU and GSM8K, we followed common implementations that use regular expressions.
Descriptive analysis
We compared original models with their warm counterparts in different evaluation conditions using paired statistical tests and effect-size calculations. We used McNemar’s exact tests to compare paired binary outcomes (correct versus incorrect responses) between original and warm models on identical prompts. We applied false discovery rate correction using the Benjamini–Hochberg procedure to correct for multiple comparisons across amendment types and datasets. We quantified effect sizes using Cohen’s g for McNemar’s tests, with odds ratios calculated to measure the relative likelihood of accuracy changes between model types. Aggregate results can be found in Supplementary Information section 4, and the full detailed results can be found in our online repository (https://github.com/lujainibrahim/warm_ai_2025/tree/main). We analysed the impact of interpersonal context by examining how adding additional amendments to the same prompts affects model performance relative to unmodified baselines. Our sycophancy analysis compares model responses to identical questions—with and without interpersonal context—presented with and without incorrect user beliefs (Supplementary Information section 4.1).
Inferential analysis
We analysed 439,792 observations across 10 language models (5 original and 5 warm), 4 evaluation datasets and 18 amendment conditions. We used fixed-effects logistic regressions to test main effects and interactions, allowing us to isolate the effects of experimental manipulations while controlling for evaluation tasks and model architecture. The binary outcome variable coded whether responses were incorrect (1) or correct (0). Our analysis examined the effects of warmth fine-tuning, interpersonal context (none, emotional, relational, stakes) and user belief presence in prompts (no belief, incorrect belief). We used α = 0.05 for all tests conducted in Python 3.11.4 with the statsmodels package. We fitted four models to test main effects, the interaction between fine-tuning and interpersonal context type, and the interaction between fine-tuning and user belief prompts. Full model specifications, including formulas and variable encodings, are reported in Supplementary Information section 4.2.



