Dataset construction We selected conversations from ShareGPT Vicuna Unfiltered (https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered), one of the only large-scale and publicly available datasets with real-world human–LLM chat logs. This dataset contains approximately 100,000 user conversations with ChatGPT donated by users (https://sharegpt.com/). We filtered it to remove ‘not safe for work’ content using an existing open-source classifier called Detoxify (https://docs.unitary.ai/api-references/detoxify). […]