General Testing
This assessment delineates six core dimensions—vision, language, cognition, motion, learning, and value—grounded in the developmental psychology of human children to quantify an agent's mental development level.
General Testing focuses on the foundational capabilities required of an agent as an integrated system, assessing whether it can learn, adapt, and complete composite tasks in human-centered environments. It evaluates not only the breadth of capability coverage, but also the ability to learn from limited experience, transfer knowledge, and generalize to new situations. Drawing on developmental psychology, the framework maps levels of general artificial intelligence to human developmental stages, providing an interpretable and comparable quantitative scale.
Evaluation Framework
Covering generalization, value, and autonomy evaluation frameworks.
Generalization Testing
Assesses transfer and causal generalization in new tasks and environments; extends 8 home general testing tasks and explores causal-logic task generalization, focusing on generalization gaps and robustness.
Autonomy Testing
Assesses autonomous task definition, planning, and execution in home settings; builds value-driven self-defined task scenarios, focusing on task definition quality and closed-loop execution.
Value Testing
Assesses value orientation, stability, and regulation; based on an in-house value dataset, focusing on alignment behavior and stability under perturbation.
Key Advantages
Six key advantages for a more professional evaluation system.
Human-Development Alignment
Referring to developmental psychology, mapping and quantitatively evaluating agent capabilities against different stages of human childhood (e.g., ages 3-4 and 5-6).
Dual System of Ability & Value
Testing not only Ability (U) but also emphasizing Value (V), evaluating the agent's ethics, emotions, and social norm adaptability.
General Testing Dimensions
Covering six core dimensions—vision, language, cognition, motion, learning, and value—to ensure the agent's completeness.
Embodied Interaction
Based on the TongSim simulation environment, testing the agent's real-time perception, decision-making, and action in physical and social environments, rather than relying only on static problem-solving tasks.
Human-Machine Comparison
Introducing 'human-machine comparison tests' (e.g., tidying a room, active collaboration) to directly compare the performance of agents and human children in the same tasks.
Dynamic Generalization
Task scenarios (e.g., desktop organization, indoor storage) support random generation and complex combinations, testing the agent's generalization and adaptability in unknown environments.
Methodology & Benchmarks
Methods and benchmark suites that define General Testing.

Model LeaderboardBased on Basic Family Comprehensive TasksTop 5
View full rankingEvaluating Multimodal Large Language Models with Daily Composite Tasks in Home Environments
Zhenliang Zhang, Yuxi Wang, Hongzhao Xie, et al.
A key feature differentiating artificial general intelligence (AGI) from traditional AI is that AGI can perform composite tasks that require a wide range of capabilities. Although embodied agents powered by multimodal large language models (MLLMs) offer rich perceptual and interactive capabilities, it remains largely unexplored whether they can solve composite tasks. In the current work, we designed a set of composite tasks inspired by common daily activities observed in early childhood development. Within a dynamic and simulated home environment, these tasks span three core domains: object un...

Model LeaderboardBased on In-Situ Embodied Task EvaluationTop 5
View full rankingAutomatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents
Xinyi He, Ying Yang, Chuanjian Fu, et al.
As general intelligent agents are poised for widespread deployment in diverse households, evaluation tailored to each unique unseen 3D environment has become a critical prerequisite. However, existing benchmarks suffer from severe data contamination and a lack of scene specificity, inadequate for assessing agent capabilities in unseen settings. To address this, we propose a dynamic in-situ task generation method for unseen environments inspired by human cognition. We define tasks through a structured graph representation and construct a two-stage interaction-evolution task generation system fo...