Specialized Testing

This assessment provides in-depth evaluation of advanced intelligence domains, including abstract reasoning, geometric proof, theory of mind, and intuitive physics.

Specialized Testing evaluates specific goals and advanced capabilities without assuming that an agent is complete as a general system. AlphaGo is a representative example: it achieves exceptional performance in Go and can defeat professional human players, but this does not imply competence in domains such as physical interaction or social reasoning. The value of Specialized Testing is to characterize capability boundaries in focused tasks and to provide comparable evidence for areas such as abstract reasoning, geometric reasoning, algebraic reasoning, structured reasoning, intuitive physics, and social intelligence.

Evaluation Dimensions

Covering key capabilities across specialized intelligence dimensions.

Abstract Reasoning

Logic & Pattern Recognition: Evaluates the ability to identify patterns, relationships, and structures without explicit rules. It tests fluid intelligence, including pattern recognition, rule induction, and logical analogies (e.g., graphic sequences, number perception, anomaly detection).

Geometric Reasoning

Spatial & Mathematical Logic: Tests the understanding of geometric theorems, spatial imagination, and the ability to construct auxiliary lines. It assesses logic, spatial thinking, and the application of geometric knowledge to solve proofs and problems.

Algebraic Reasoning

Symbolic Computation: Focuses on solving elementary algebra problems (e.g., inequalities, sequence formulas) and mathematical proofs. It evaluates the agent's ability to perform algebraic transformations, apply theorems, and learn reasoning strategies from small samples.

Structured Reasoning

Self-Correction & Reflection: Measures the ability to analyze information, reflect on self-behavior, and correct errors. It specifically tests "Chain of Thought" reasoning and the capacity to generate self-reflections to fix mistakes in complex logical tasks without retraining.

Intuitive Physics

Physical World Understanding: Assesses the agent's understanding of physical laws (collision, blocking, permanence, continuity). It evaluates prediction (what happens next), hypothesis (inferring hidden states), and explanation (why an event occurred) regarding physical interactions.

Social Intelligence

Theory of Mind (ToM): Tests "reading the room" or Cha Yan Guan Se (察言观色). It evaluates the ability to infer mental states (beliefs, intents, desires, emotions) and social causal relationships from social cues and interactions in video or text scenarios.

Value Alignment

Ethics & Norms: Evaluates the agent's value orientation and comprehension. It tests alignment with human values, psychological traits, and social norms across different levels, ensuring the agent understands and acts according to ethical standards.

Key Advantages

Deep exploration, precise evaluation of high-level intelligence.

Targeted Problem Solving

Focus on Specific Utility: Unlike General Testing, this testing acknowledges that agents don't always need to be "generalists." It validates high-performance capabilities in specific domains (like AlphaGo) that are crucial for practical production and life scenarios.

Cognitive Depth

Beyond Surface Perception: The evaluation goes deeper than simple output generation. It tests deep cognitive functions like "fluid intelligence" (abstract reasoning), "spatial imagination" (geometry), and "mathematical intuition" (algebra), pushing the boundaries of machine intelligence.

Self-Evolution Mechanism

Dynamic Error Correction: Through "Structured Reasoning" testing, the framework emphasizes the agent's ability to reflect and self-correct. This ensures the model can improve its reasoning paths and fix logic errors without the need for expensive retraining.

Human-Like Cognition

Developmental Benchmarking: Tests like "Intuitive Physics" and "Social Intelligence" compare AI performance against human developmental milestones (e.g., infant physical understanding, social causality), revealing gaps between current AI and human-level cognition.

Value-Driven Assessment

Psychological & Ethical Rigor: The inclusion of ValueBench integrates professional psychological measurement scales. It moves beyond simple safety checks to a comprehensive evaluation of the agent's internal value system, personality traits, and alignment with complex human social norms.

Multi-Modal Complexity

Cross-Modal Reasoning: The tests require processing diverse inputs—visual (geometry, physical videos), symbolic (algebra), and social (interaction videos)—ensuring the agent can handle complex, multi-modal information streams rather than just text.

Methodology & Benchmarks

Methods and benchmark suites that probe advanced reasoning and specialized intelligence.

$OneMillion-Bench: How Far are Language Agents from Human Experts?

Qianyu Yang, Yang Liu, Jiaqi Li, et al.

As language models evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world professional demands. $OneMillion-Bench, also called $1M-Bench, introduces 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science to evaluate agents in economically consequential scenarios. The benchmark requires retrieving authoritative sources, resolving conflicting evidence, applying domain-specific rules, and making constrained decisions, where correctness depends on both the reasoning process and the final answer. Its rubric-based protocol scores factual accuracy, logical coherence, practical feasibility, and professional compliance, providing a unified testbed for assessing agentic reliability, professional depth, and practical readiness for industry digital workers.

Read the Paper

AEGIS: AUTOMATED ERROR GENERATION AND ATTRIBUTION FOR MULTI-AGENT SYSTEMS

Fanqi Kong, Ruijie Zhang, Huaxiao Yin, et al.

Large language model based multi-agent systems (MAS) have unlocked significant advancements in tackling complex problems, but their increasing capability introduces a structural fragility that makes them difficult to debug. A key obstacle to improving their reliability is the severe scarcity of large-scale, diverse datasets for error attribution, as existing resources rely on costly and unscalable manual annotation. To address this bottleneck, we introduce Aegis, a novel framework for Automated error generation and attribution for multi-agent systems. Aegis constructs a large dataset of 9,533 trajectories with annotated faulty agents and error modes, covering diverse MAS architectures and task domains. This is achieved using a LLM-based manipulator that can adaptively inject context-aware errors into successful execution trajectories. Leveraging finegrained labels and the structured arrangement of positive-negative sample pairs, Aegis supports three different learning paradigms: Supervised Fine-Tuning, Reinforcement Learning, and Contrastive Learning. We develop learning methods for each paradigm. Comprehensive experiments show that trained models consistently achieve substantial improvements in error attribution. Notably, several of our fine-tuned LLMs demonstrate performance competitive with or superior to proprietary models an order of magnitude larger, validating our automated data generation framework as a crucial resource for developing more robust and interpretable multi-agent systems. Our project website is available at https://kfq20.github.io/Aegis-Website/.

Read the Paper Dataset

IntentQA: Context-aware Video Intent Reasoning

Jiapeng Li, Ping Wei, Wenjuan Han, Lifeng Fan

In this paper, we propose a novel task IntentQA, a special VideoQA task focusing on video intent reasoning, which has become increasingly important for AI with its advantages in equipping AI agents with the capability of reasoning beyond mere recognition in daily tasks. We also contribute a large-scale VideoQA dataset for this task. We propose a Context-aware Video Intent Reasoning model (CaVIR) consisting of i) Video Query Language (VQL) for better cross-modal representation of the situational context, ii) Contrastive Learning module for utilizing the contrastive context, and iii) Commonsense Reasoning module for incorporating the commonsense context. Comprehensive experiments on this challenging task demonstrate the effectiveness of each model component, the superiority of our full model over other baselines, and the generalizability of our model to a new VideoQA task.

Read the Paper View on GitHub Live Demo

Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households

Zhihao Cao, Zidong Wang, Siwen Xie, et al.

Despite the significant demand for assistive technology among vulnerable groups (e.g. the elderly children and the disabled) in daily tasks research into advanced AI-driven assistive solutions that genuinely accommodate their diverse needs remains sparse. Traditional human-machine interaction tasks often require machines to simply help without nuanced consideration of human abilities and feelings such as their opportunity for practice and learning sense of self-improvement and self-esteem. Addressing this gap we define a pivotal and novel challenge Smart Help which aims to provide proactive yet adaptive support to human agents with diverse disabilities and dynamic goals in various tasks and environments. To establish this challenge we leverage AI2-THOR to build a new interactive 3D realistic household environment for the Smart Help task. We introduce an innovative opponent modeling module that provides a nuanced understanding of the main agent's capabilities and goals in order to optimize the assisting agent's helping policy. Rigorous experiments validate the efficacy of our model components and show the superiority of our holistic approach against established baselines. Our findings illustrate the potential of AI-imbued assistive robots in improving the well-being of vulnerable groups.

Read the Paper View on GitHub