General Testing

This assessment delineates six core dimensions—vision, language, cognition, motion, learning, and value—grounded in the developmental psychology of human children to quantify an agent's mental development level.

General Testing focuses on the foundational capabilities required of an agent as an integrated system, assessing whether it can learn, adapt, and complete composite tasks in human-centered environments. It evaluates not only the breadth of capability coverage, but also the ability to learn from limited experience, transfer knowledge, and generalize to new situations. Drawing on developmental psychology, the framework maps levels of general artificial intelligence to human developmental stages, providing an interpretable and comparable quantitative scale.

Evaluation Framework

Covering generalization, value, and autonomy evaluation frameworks.

Generalization Testing

Assesses transfer and causal generalization in new tasks and environments; extends 8 home general testing tasks and explores causal-logic task generalization, focusing on generalization gaps and robustness.

Autonomy Testing

Assesses autonomous task definition, planning, and execution in home settings; builds value-driven self-defined task scenarios, focusing on task definition quality and closed-loop execution.

Value Testing

Assesses value orientation, stability, and regulation; based on an in-house value dataset, focusing on alignment behavior and stability under perturbation.

Key Advantages

Six key advantages for a more professional evaluation system.

Human-Development Alignment

Referring to developmental psychology, mapping and quantitatively evaluating agent capabilities against different stages of human childhood (e.g., ages 3-4 and 5-6).

Dual System of Ability & Value

Testing not only Ability (U) but also emphasizing Value (V), evaluating the agent's ethics, emotions, and social norm adaptability.

General Testing Dimensions

Covering six core dimensions—vision, language, cognition, motion, learning, and value—to ensure the agent's completeness.

Embodied Interaction

Based on the TongSim simulation environment, testing the agent's real-time perception, decision-making, and action in physical and social environments, rather than relying only on static problem-solving tasks.

Human-Machine Comparison

Introducing 'human-machine comparison tests' (e.g., tidying a room, active collaboration) to directly compare the performance of agents and human children in the same tasks.

Dynamic Generalization

Task scenarios (e.g., desktop organization, indoor storage) support random generation and complex combinations, testing the agent's generalization and adaptability in unknown environments.

Methodology & Benchmarks

Methods and benchmark suites that define General Testing.

Evaluating Multimodal Large Language Models with Daily Composite Tasks in Home Environments

Model LeaderboardBased on Basic Family Comprehensive TasksTop 5

View full ranking

#ModelTotal

Google Gemini 2.5 Pro

24.53

Google Gemini 2.5 Flash

23.05

OpenAI o3

22.88

OpenAI GPT-5

21.54

Anthropic Claude Sonnet 3.7

20.52

Evaluating Multimodal Large Language Models with Daily Composite Tasks in Home Environments

Zhenliang Zhang, Yuxi Wang, Hongzhao Xie, et al.

A key feature differentiating artificial general intelligence (AGI) from traditional AI is that AGI can perform composite tasks that require a wide range of capabilities. Although embodied agents powered by multimodal large language models (MLLMs) offer rich perceptual and interactive capabilities, it remains largely unexplored whether they can solve composite tasks. In the current work, we designed a set of composite tasks inspired by common daily activities observed in early childhood development. Within a dynamic and simulated home environment, these tasks span three core domains: object un...

Read the Paper Project Page

Automatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents

Model LeaderboardBased on In-Situ Embodied Task EvaluationTop 5

View full ranking

#ModelTotal

Human Baseline

74.74

OpenAI o3

67.70

OpenAI GPT-5

60.76

Google Gemini 2.5 Flash

60.60

Anthropic Claude Opus 4

60.14

Automatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents

Xinyi He, Ying Yang, Chuanjian Fu, et al.

As general intelligent agents are poised for widespread deployment in diverse households, evaluation tailored to each unique unseen 3D environment has become a critical prerequisite. However, existing benchmarks suffer from severe data contamination and a lack of scene specificity, inadequate for assessing agent capabilities in unseen settings. To address this, we propose a dynamic in-situ task generation method for unseen environments inspired by human cognition. We define tasks through a structured graph representation and construct a two-stage interaction-evolution task generation system fo...

Read the Paper