TongTest | AGI Evaluation System

Multidimensional Index System

Scientific, rigorous, and comprehensive evaluation dimensions, establishing the benchmark for the AGI era.

General Testing

This assessment delineates six core dimensions—vision, language, cognition, motion, learning, and value—grounded in the developmental psychology of human children to quantify an agent's mental development level.

Main ranking2 general evaluations

General Testing Ranking

Top 5 model performance data based on Basic Family Comprehensive Tasks

View Full Ranking

Switch evaluation paper

Scores measure how often each model completes daily composite tasks in a simulated home environment. The ability view breaks results down by task type; the dimension view groups the same results into object understanding, spatial intelligence, and social activity.

Model	Avg	Counting Objects	Preparing Baggage	Building Blocks	Jigsaw Puzzle	Understanding Buttons	Setting Tables	Tidying Up Rooms	Selecting Gifts
Google Gemini 2.5 Pro	24.53	48.0	12.4	10.0	5.0	3.3	26.7	22.8	68.1
Google Gemini 2.5 Flash	23.05	42.0	11.1	5.5	5.3	3.3	25.8	23.2	68.2
OpenAI o3	22.88	54.0	10.3	10.0	6.4	3.3	14.3	18.8	65.9
4OpenAI GPT-5	21.54	36.0	9.5	3.8	6.0	3.3	28.7	16.0	69.1
5Anthropic Claude Sonnet 3.7	20.52	46.0	3.4	8.9	6.3	0.0	23.8	16.1	59.7

Specialized Testing

This assessment provides in-depth evaluation of advanced intelligence domains, including abstract reasoning, geometric proof, theory of mind, and intuitive physics.

General Testing

This assessment delineates six core dimensions—vision, language, cognition, motion, learning, and value—grounded in the developmental psychology of human children to quantify an agent's mental development level.

Specialized Testing

This assessment provides in-depth evaluation of advanced intelligence domains, including abstract reasoning, geometric proof, theory of mind, and intuitive physics.

Main ranking2 general evaluations

General Testing Ranking

Top 5 model performance data based on Basic Family Comprehensive Tasks

View Full Ranking

Switch evaluation paper

Scores measure how often each model completes daily composite tasks in a simulated home environment. The ability view breaks results down by task type; the dimension view groups the same results into object understanding, spatial intelligence, and social activity.

Model	Avg	Counting Objects	Preparing Baggage	Building Blocks	Jigsaw Puzzle	Understanding Buttons	Setting Tables	Tidying Up Rooms	Selecting Gifts
Google Gemini 2.5 Pro	24.53	48.0	12.4	10.0	5.0	3.3	26.7	22.8	68.1
Google Gemini 2.5 Flash	23.05	42.0	11.1	5.5	5.3	3.3	25.8	23.2	68.2
OpenAI o3	22.88	54.0	10.3	10.0	6.4	3.3	14.3	18.8	65.9
4OpenAI GPT-5	21.54	36.0	9.5	3.8	6.0	3.3	28.7	16.0	69.1
5Anthropic Claude Sonnet 3.7	20.52	46.0	3.4	8.9	6.3	0.0	23.8	16.1	59.7

Updates

Follow key TongTest releases, research progress, and standards development.

FeaturedPublishingMar 28, 2026

Chinese Edition of AGI Standards, Rating, Testing, and Architecture Published

The Chinese edition of AGI Standards, Rating, Testing, and Architecture has been published and received the 2025 Impactful New Book Award from the Async Community. The book systematically presents methods for AGI standards, rating, testing, and architecture, providing theoretical and methodological support for TongTest.

Chinese edition has been published

Received the Async Community 2025 Impactful New Book Award

Covers AGI standards, rating, testing, and architecture