Specialized Testing

This assessment provides in-depth evaluation of advanced intelligence domains, including abstract reasoning, geometric proof, theory of mind, and intuitive physics.

Specialized Testing evaluates specific goals and advanced capabilities without assuming that an agent is complete as a general system. AlphaGo is a representative example: it achieves exceptional performance in Go and can defeat professional human players, but this does not imply competence in domains such as physical interaction or social reasoning. The value of Specialized Testing is to characterize capability boundaries in focused tasks and to provide comparable evidence for areas such as abstract reasoning, geometric reasoning, algebraic reasoning, structured reasoning, intuitive physics, and social intelligence.

Evaluation Dimensions

Covering key capabilities across specialized intelligence dimensions.

Abstract Reasoning

Logic & Pattern Recognition: Evaluates the ability to identify patterns, relationships, and structures without explicit rules. It tests fluid intelligence, including pattern recognition, rule induction, and logical analogies (e.g., graphic sequences, number perception, anomaly detection).

Geometric Reasoning

Spatial & Mathematical Logic: Tests the understanding of geometric theorems, spatial imagination, and the ability to construct auxiliary lines. It assesses logic, spatial thinking, and the application of geometric knowledge to solve proofs and problems.

Algebraic Reasoning

Symbolic Computation: Focuses on solving elementary algebra problems (e.g., inequalities, sequence formulas) and mathematical proofs. It evaluates the agent's ability to perform algebraic transformations, apply theorems, and learn reasoning strategies from small samples.

Structured Reasoning

Self-Correction & Reflection: Measures the ability to analyze information, reflect on self-behavior, and correct errors. It specifically tests "Chain of Thought" reasoning and the capacity to generate self-reflections to fix mistakes in complex logical tasks without retraining.

Intuitive Physics

Physical World Understanding: Assesses the agent's understanding of physical laws (collision, blocking, permanence, continuity). It evaluates prediction (what happens next), hypothesis (inferring hidden states), and explanation (why an event occurred) regarding physical interactions.

Social Intelligence

Theory of Mind (ToM): Tests "reading the room" or Cha Yan Guan Se (察言观色). It evaluates the ability to infer mental states (beliefs, intents, desires, emotions) and social causal relationships from social cues and interactions in video or text scenarios.

Value Alignment

Ethics & Norms: Evaluates the agent's value orientation and comprehension. It tests alignment with human values, psychological traits, and social norms across different levels, ensuring the agent understands and acts according to ethical standards.

Key Advantages

Deep exploration, precise evaluation of high-level intelligence.

Targeted Problem Solving

Focus on Specific Utility: Unlike General Testing, this testing acknowledges that agents don't always need to be "generalists." It validates high-performance capabilities in specific domains (like AlphaGo) that are crucial for practical production and life scenarios.

Cognitive Depth

Beyond Surface Perception: The evaluation goes deeper than simple output generation. It tests deep cognitive functions like "fluid intelligence" (abstract reasoning), "spatial imagination" (geometry), and "mathematical intuition" (algebra), pushing the boundaries of machine intelligence.

Self-Evolution Mechanism

Dynamic Error Correction: Through "Structured Reasoning" testing, the framework emphasizes the agent's ability to reflect and self-correct. This ensures the model can improve its reasoning paths and fix logic errors without the need for expensive retraining.

Human-Like Cognition

Developmental Benchmarking: Tests like "Intuitive Physics" and "Social Intelligence" compare AI performance against human developmental milestones (e.g., infant physical understanding, social causality), revealing gaps between current AI and human-level cognition.

Value-Driven Assessment

Psychological & Ethical Rigor: The inclusion of ValueBench integrates professional psychological measurement scales. It moves beyond simple safety checks to a comprehensive evaluation of the agent's internal value system, personality traits, and alignment with complex human social norms.

Multi-Modal Complexity

Cross-Modal Reasoning: The tests require processing diverse inputs—visual (geometry, physical videos), symbolic (algebra), and social (interaction videos)—ensuring the agent can handle complex, multi-modal information streams rather than just text.

Methodology & Benchmarks

Methods and benchmark suites that probe advanced reasoning and specialized intelligence.

$OneMillion-Bench: How Far are Language Agents from Human Experts?

$OneMillion-Bench: How Far are Language Agents from Human Experts?

Qianyu Yang, Yang Liu, Jiaqi Li, et al.

As language models evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world professional demands. $OneMillion-Bench, also called $1M-Bench, introduces 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science to evaluate agents in economically consequential scenarios. The benchmark requires retrieving authoritative sources, resolving conflicting evidence, applying domain-specific rules, and making constrained decisions, where correctness depends on both the reasoning process and the final answer. Its rubric-based protocol scores factual accuracy, logical coherence, practical feasibility, and professional compliance, providing a unified testbed for assessing agentic reliability, professional depth, and practical readiness for industry digital workers.

AdaSociety: An Adaptive Environment with Social Structures for Multi-Agent Decision-Making

AdaSociety: An Adaptive Environment with Social Structures for Multi-Agent Decision-Making

Yizhe Huang, XingboWang, Hao Liu, et al.

Traditional interactive environments limit agents’ intelligence growth with fixed tasks. Recently, single-agent environments address this by generating new tasks based on agent actions, enhancing task diversity. We consider the decision-making problem in multi-agent settings, where tasks are further influenced by social connections, affecting rewards and information access. However, existing multi-agent environments lack a combination of adaptive physical surroundings and social connections, hindering the learning of intelligent behaviors. To address this, we introduce AdaSociety, a customizable multi-agent environment featuring expanding state and action spaces, alongside explicit and alterable social structures. As agents progress, the environment adaptively generates new tasks with social structures for agents to undertake. In AdaSociety, we develop three mini-games showcasing distinct social structures and tasks. Initial results demonstrate that specific social structures can promote both individual and collective benefits, though current reinforcement learning and LLM-based algorithms show limited effectiveness in leveraging social structures to enhance performance. Overall, AdaSociety serves as a valuable research platform for exploring intelligence in diverse physical and social settings. The code is available at https://github.com/bigai-ai/AdaSociety.

AEGIS: AUTOMATED ERROR GENERATION AND ATTRIBUTION FOR MULTI-AGENT SYSTEMS

AEGIS: AUTOMATED ERROR GENERATION AND ATTRIBUTION FOR MULTI-AGENT SYSTEMS

Fanqi Kong, Ruijie Zhang, Huaxiao Yin, et al.

Large language model based multi-agent systems (MAS) have unlocked significant advancements in tackling complex problems, but their increasing capability introduces a structural fragility that makes them difficult to debug. A key obstacle to improving their reliability is the severe scarcity of large-scale, diverse datasets for error attribution, as existing resources rely on costly and unscalable manual annotation. To address this bottleneck, we introduce Aegis, a novel framework for Automated error generation and attribution for multi-agent systems. Aegis constructs a large dataset of 9,533 trajectories with annotated faulty agents and error modes, covering diverse MAS architectures and task domains. This is achieved using a LLM-based manipulator that can adaptively inject context-aware errors into successful execution trajectories. Leveraging finegrained labels and the structured arrangement of positive-negative sample pairs, Aegis supports three different learning paradigms: Supervised Fine-Tuning, Reinforcement Learning, and Contrastive Learning. We develop learning methods for each paradigm. Comprehensive experiments show that trained models consistently achieve substantial improvements in error attribution. Notably, several of our fine-tuned LLMs demonstrate performance competitive with or superior to proprietary models an order of magnitude larger, validating our automated data generation framework as a crucial resource for developing more robust and interpretable multi-agent systems. Our project website is available at https://kfq20.github.io/Aegis-Website/.

Evaluating and Modeling Social Intelligence: A Comparative Study of Human and AI Capabilities

Evaluating and Modeling Social Intelligence: A Comparative Study of Human and AI Capabilities

Junqi Wang, Chunhui Zhang, Jiapeng Li, et al.

Facing the current debate on whether Large Language Models (LLMs) attain near-human intelligence levels (Mitchell & Krakauer, 2023; Bubeck et al., 2023; Kosinski, 2023; Shiffrin & Mitchell, 2023; Ullman, 2023), the current study introduces a benchmark for evaluating social intelligence, one of the most distinctive aspects of human cognition. We developed a comprehensive theoretical framework for social dynamics and introduced two evaluation tasks: Inverse Reasoning (IR) and Inverse Inverse Planning (IIP). Our approach also encompassed a computational model based on recursive Bayesian inference, adept at elucidating diverse human behavioral patterns. Extensive experiments and detailed analyses revealed that humans surpassed the latest GPT models in overall performance, zero-shot learning, one-shot generalization, and adaptability to multi-modalities. Notably, GPT models demonstrated social intelligence only at the most basic order (order = 0), in stark contrast to human social intelligence (order >= 2). Further examination indicated a propensity of LLMs to rely on pattern recognition for shortcuts, casting doubt on their possession of authentic human-level social intelligence.

IntentQA: Context-aware Video Intent Reasoning

IntentQA: Context-aware Video Intent Reasoning

Jiapeng Li, Ping Wei, Wenjuan Han, Lifeng Fan

In this paper, we propose a novel task IntentQA, a special VideoQA task focusing on video intent reasoning, which has become increasingly important for AI with its advantages in equipping AI agents with the capability of reasoning beyond mere recognition in daily tasks. We also contribute a large-scale VideoQA dataset for this task. We propose a Context-aware Video Intent Reasoning model (CaVIR) consisting of i) Video Query Language (VQL) for better cross-modal representation of the situational context, ii) Contrastive Learning module for utilizing the contrastive context, and iii) Commonsense Reasoning module for incorporating the commonsense context. Comprehensive experiments on this challenging task demonstrate the effectiveness of each model component, the superiority of our full model over other baselines, and the generalizability of our model to a new VideoQA task.

Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households

Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households

Zhihao Cao, Zidong Wang, Siwen Xie, et al.

Despite the significant demand for assistive technology among vulnerable groups (e.g. the elderly children and the disabled) in daily tasks research into advanced AI-driven assistive solutions that genuinely accommodate their diverse needs remains sparse. Traditional human-machine interaction tasks often require machines to simply help without nuanced consideration of human abilities and feelings such as their opportunity for practice and learning sense of self-improvement and self-esteem. Addressing this gap we define a pivotal and novel challenge Smart Help which aims to provide proactive yet adaptive support to human agents with diverse disabilities and dynamic goals in various tasks and environments. To establish this challenge we leverage AI2-THOR to build a new interactive 3D realistic household environment for the Smart Help task. We introduce an innovative opponent modeling module that provides a nuanced understanding of the main agent's capabilities and goals in order to optimize the assisting agent's helping policy. Rigorous experiments validate the efficacy of our model components and show the superiority of our holistic approach against established baselines. Our findings illustrate the potential of AI-imbued assistive robots in improving the well-being of vulnerable groups.