A recent peer-reviewed study published in Communications Psychology reveals that advanced large language models (LLMs), including ChatGPT-4, outperform humans in emotional intelligence (EI) assessments. The study, authored by Katja Schlegel, Nils R. Sommer, and Marcello Mortillaro, thoroughly evaluated six LLMs across five standardized EI tests, comparing their performance with human benchmarks and assessing their ability to generate valid EI test items.
Emotional intelligence encompasses skills such as recognizing, understanding, managing, and reasoning about emotions. These abilities are crucial for effective communication and mental well-being. The researchers aimed to determine whether AI models could reason about emotions similarly to humans, especially given the rise of emotionally responsive AI agents in various sectors.
The study involved testing six LLMs—ChatGPT-4, ChatGPT-01, Copilot 365, Claude 3.5 Haiku, Gemini 1.5 Flash, and DeepSeek V3—using five validated EI tests: the Situational Test of Emotion Management (STEM), the Situational Test of Emotion Understanding (STEU), the Geneva Emotion Knowledge Test – Blends (GEMOK-Blends), and two subtests from the Geneva Emotional Competence Test (GECo) focused on regulation and management.
Results indicated that all six AI models achieved a mean accuracy of 81% on EI tests, significantly higher than the 56% average score obtained by human participants in previous validations. The performance of the LLMs correlated with item difficulty, showing that they answered easier questions correctly, similar to human patterns.
In a second phase, researchers explored whether LLMs could create EI assessments of comparable quality to those crafted by psychologists. ChatGPT-4 generated test items for each of the five EI assessments, which were subsequently validated with 467 participants across five studies. The AI-generated tests underwent evaluation based on criteria such as clarity, realism, content diversity, internal consistency, and correlations with external benchmarks like vocabulary tests.
Key findings from this phase included: – **Statistical Equivalence in Difficulty**: The original tests and those generated by ChatGPT-4 demonstrated comparable difficulty, confirming that the AI can produce evaluations of similar complexity. – **Clarity and Realism**: Both sets of tests were rated similarly in clarity, but the ChatGPT-generated tests received slightly higher realism ratings, suggesting they can create believable emotional scenarios. – **Content Diversity**: Participants found the original test scenarios to have more thematic variety compared to the AI-generated items, indicating a limitation in ChatGPT’s creative range. – **Construct Validity**: Correlations between ChatGPT-generated tests and other EI assessments were slightly lower, yet the overall correlation remained strong (r = 0.46), showing that they measured closely related constructs.
The implications of this study extend beyond academic testing. It positions LLMs like ChatGPT-4 as practical tools for fostering emotionally intelligent interactions in various fields, including healthcare, education, and human resources. These models could provide consistent performance in contexts that require emotional regulation, unaffected by factors such as mood or stress, which can compromise human performance.
Moreover, the study highlights ChatGPT-4’s potential in psychometric development. Typically, creating EI assessments demands extensive resources, yet ChatGPT-4 was able to generate structured tests rapidly with minimal prompts, expediting the initial stages of test development. However, the authors emphasize the necessity of expert validation to ensure the quality of the final assessments.
Despite these promising findings, limitations persist. The study primarily utilized Western-centric cultural norms in both test design and LLM training data, which may not fully reflect diverse emotional expressions and interpretations across cultures. Additionally, the opaque nature of LLMs raises questions regarding their explainability, consistency between different model versions, and their ability to contextualize complexities in real-world emotional scenarios.