FORGE: Fine-grained multimodal evaluation for manufacturing scenarios

Summary (Overview)

Introduces FORGE, the first large-scale, fine-grained multimodal benchmark for evaluating Multimodal Large Language Models (MLLMs) in real-world manufacturing scenarios, combining 2D images and 3D point clouds with detailed semantic annotations (e.g., exact model numbers).
Reveals significant performance gaps by evaluating 18 state-of-the-art MLLMs across three core manufacturing tasks: Workpiece Verification (WORK_VERI), Structural Surface Inspection (SURF_INSP), and Assembly Verification (ASSY_VERI).
Identifies insufficient domain-specific knowledge, not visual grounding, as the primary bottleneck for MLLMs in manufacturing, challenging conventional assumptions.
Demonstrates the dataset's value as a training resource: Supervised Fine-Tuning (SFT) of a compact 3B-parameter model on FORGE data yields up to a 90.8% relative improvement in accuracy on held-out manufacturing scenarios.

Introduction and Theoretical Foundation

The manufacturing sector generates massive heterogeneous data and relies on complex decision-making, creating a strong need for intelligent systems capable of higher-level cognitive tasks. While traditional Computer Vision (CV) models serve as perception modules (e.g., for anomaly detection), they are limited by an inability to reason and execute autonomous control. In contrast, Multimodal Large Language Models (MLLMs) have demonstrated remarkable generalization and reasoning capabilities across diverse domains, presenting a transformative potential to bridge low-level perception and high-level planning in manufacturing.

However, progress is hindered by three fundamental challenges:

Data Scarcity Gap: Current manufacturing datasets are limited in scale and diversity, often relying on simulated or CAD-based data.
Lack of Fine-Grained Domain Semantics: Existing datasets treat workpieces as generic visual subjects, failing to integrate explicit, fine-grained semantics (e.g., model numbers) essential for real-world rigor.
Absence of Comprehensive Evaluation Frameworks: There is no systematic benchmark to assess MLLMs' reasoning and decision-making capabilities in manufacturing scenarios.

FORGE is introduced to address these challenges, aiming to answer the core inquiry: Can MLLMs understand, explain, and execute decisions for tasks inherently characteristic of the manufacturing domain?

Methodology

1. Dataset Curation

FORGE is built on a high-quality multimodal dataset collected from authentic manufacturing components.

Data Collection:
- 3D Point Cloud Subset: High-fidelity geometric data covering 14 workpiece categories across 90 distinct models. Supports all three tasks.
- Image Subset: Approximately 3,000 images capturing four distinct manufacturing scenarios (e.g., expansion screw assemblies), including both normal and abnormal samples.
Data Processing:
- For 2D images, ground-truth labels were established via automated contour extraction followed by manual refinement.
- For 3D point clouds, strategies varied by task. For WORK_VERI and ASSY_VERI, batch samples were synthesized by stitching 4-5 individual point clouds. For SURF_INSP, four typical defects (Crack, Deformation, Dent, Cut) were simulated using morphology-based algorithms.
3D Modality Bridge: Since general MLLMs lack native 3D encoders, a multi-view projection strategy is adopted. All 3D point clouds are rendered as three-view (3V) images (front, side, top orthogonal projections) to preserve geometric structure while maintaining compatibility with standard visual inputs.
The final dataset comprises approximately 12,000 samples across all tasks. Detailed statistics are provided in the Appendix.

2. Task Design

Three evaluation tasks are designed, aligned with critical pillars of manufacturing automation: material sorting, quality inspection, and assembly recognition.

Task 1: Workpiece Verification (WORK_VERI) - Evaluates material sorting. MLLMs must analyze inputs to identify workpieces that do not belong to the current batch, given explicit specifications.
- Scenarios: One image-based (Pneumatic Connectors - PCS_SCENARIO) and two point cloud-based (Cup Head Screws - CHS_SCENARIO, Nuts - NUTS_SCENARIO).
Task ‚Äö√Ñ√∂2: Structural Surface Inspection (SURF_INSP) - Evaluates quality inspection. MLLMs must identify manufacturing defects from workpiece data.
- Process: (1) Defect detection (Yes/No), (2) Defect type classification (Crack, Cut, Deformation, Dent, or Good).
- Coverage: 14 distinct manufacturing components with 3D point cloud data.
Task 3: Assembly Verification (ASSY_VERI) - Evaluates assembly recognition. MLLMs must analyze inputs to identify workpieces that fail to meet assembly specifications, requiring reasoning over complex assembly rules.
- Scenarios: Three image-based (Metal Expansion Screws - MES_SCENARIO, Plastic Expansion Screws - PES_SCENARIO, CNC Fixtures - CNC_SCENARIO) and one point cloud-based (compatibility among screws, washers, nuts - SWN_SCENARIO).

Potential error scenarios are categorized into two classes:

Different Workpiece: Coarse-grained failures (e.g., workpiece mismatches).
Different Model Number: Fine-grained errors from subtle model variations (e.g., M10 vs. M12 screw).

3. Evaluation Protocol & Models

Formulation: All tasks are formulated as Multiple-Choice Questions (MCQs). For image-based evaluation, options correspond to parts identified by normalized center coordinates (e.g., "A. Part at [0.70, 0.44]"). For three-view evaluation, components are annotated with letter labels (A–F) using the Set-of-Mark visual prompting strategy.
Evaluation Settings:
1. Zero-Shot: Only test image/3V rendering and task-specific query.
2. Reference-Conditioned (Ref-Cond): Adds reference images of correct, normal assemblies (or defect-free surfaces).
3. In-Context Demonstration (ICD): Adds complete solved examples as multi-turn dialogue pairs on top of Ref-Cond.
Metric: Exact-match accuracy (percentage of cases where predicted MCQ letter exactly matches ground-truth).
Evaluated Models: 18 representative MLLMs across open- and closed-source families, as listed below:

Open-Source / Weights Models	Closed-Source Models
Google Gemma-3-27B	OpenAI GPT-5 / 5.2
OpenGVLab InternVL3-78B	OpenAI GPT-5-Mini
Meta Llama-4-Maverick	OpenAI O3
Mistral Mi(ni)stral-3-8B/14B/Large	Google Gemini-2.5/3-Flash
Alibaba Qwen3-VL-8B/235B	Anthropic Claude-4.5-Opus
Zhipu AI GLM-4.6V	ByteDance Seed-1.6
Moonshot Kimi-K2.5

Empirical Validation / Results

Main Benchmark Results

Table 3 summarizes the main results across all tasks, modalities, and settings. Key findings distilled from these results:

A. Current MLLMs demonstrate better understanding of semantics than morphological analysis.

Leading models (e.g., Kimi-K2.5, Gemini-3-Flash) performed well on WORK_VERI and ASSY_VERI but poorly on SURF_INSP.
This indicates a fundamental capability disparity between macroscopic part discrimination (recognition) and microscopic surface morphology analysis (perception).

B. Limited comprehension of domain knowledge is the bottleneck for current MLLMs.

In WORK_VERI and ASSY_VERI (image modality), simple Ref-Cond strategies did not consistently yield gains and sometimes led to degradation.
ICD methods with complete reasoning demonstrations achieved universal improvements over Zero-Shot.
This suggests MLLMs lack a deep understanding of task logic and reasoning paths, a gap bridged by ICD but not Ref-Cond.

C. Given limited perceptual understanding of 3D spatial contexts, introducing additional examples hinders MLLM comprehension.

For three-view modality, MLLMs achieved optimal performance under Zero-Shot, with performance declining after introducing Ref-Cond and ICD.
This counterintuitive phenomenon suggests contextual examples induce spatial confusion, impeding comprehension of manufacturing domain knowledge.
Workpiece-level recognition (highly visual-dependent) is more severely affected than model-number-level recognition.

D. Model-number-level tasks are more challenging for MLLMs compared to workpiece-level tasks.

A comprehensive analysis reveals a distinct performance disparity: MLLMs consistently underperform on model-number-level tasks compared to workpiece-level tasks.
This substantial gap indicates that while MLLMs have established some understanding of general workpieces, significant room for improvement remains in capturing fine-grained domain specificity.

Qualitative Error Case Analysis

Two representative error cases from ASSY_VERI illustrate recurring failure modes:

Misjudging and over-relying on material properties (MES_SCENARIO): The model hallucinates material properties from visual textures (misidentifying a metal Flat Washer as "plastic/nylon") and uses this erroneous inference to make an incorrect judgment. This indicates MLLMs are developing the potential to autonomously recognize workpiece materials and integrate inferred physical properties into reasoning.
Failure on model number recognition but showing emerging capabilities in service condition assessment (CNC_SCENARIO): While the model successfully identifies workpiece types, it incorrectly concludes which part has the wrong size. However, its intermediate reasoning reveals capabilities for evaluating service conditions (noting "heavy wear" or "chipping"), indicating potential for graded degradation assessment supporting Predictive Maintenance (PdM).

Bottleneck Analysis

Three complementary analyses were designed to disentangle visual-perception from domain-knowledge limitations.

A. Visual grounding is not the bottleneck.

Probing Task: Models were tested on dedicated visual grounding tasks (single-image: C → L, L → C; cross-image: L → L, C → C).
Results: As shown in Table 4, top models achieved near-ceiling accuracy on single-image grounding (e.g., Gemini-3-Flash: 98.9% average). Cross-image comparison was harder but remained well above chance.
Conclusion: Failures on the full benchmark cannot be attributed to poor visual localization. Visual grounding is not the primary limiting factor.

Table 4: Visual grounding Bottleneck Analysis results (accuracy %).

Model Type	Single-Image	Cross-Image
	C → L	L → C
Gemini-3-Flash	98.2	99.6
GPT-5.2	74.6	97.6
Qwen3-VL-235B	85.4	98.8
Seed 1.6	42.0	99.2
Mistral-3-8B	66.0	70.6

B. Fine-grained part identification remains a domain-knowledge bottleneck.

Probing Task: "Missing part" scenario where the model is provided with an explicit assembly specification and must identify the absent component.
Results: As shown in Table 5, top models achieved 74.9–90.7% overall accuracy on images (well above the 23.3% random baseline), demonstrating they can reason about assembly completeness.
Systematic Failure: All five models struggled with flat washer detection (23.3–60.0% on images). Error analysis revealed models could detect a washer was absent but could not determine which washer, despite distinct physical forms.
Conclusion: Since grounding ability is confirmed, this confusion indicates insufficient fine-grained manufacturing knowledge of functional and morphological differences between part variants, rather than a perceptual failure.

Table 5: Zero-shot missing-part detection. Superscripts denote scenario: 1 MES_SCENARIO (6 options), 2 PES_SCENARIO (3 options), 3 CNC_SCENARIO (5 options). Three-view covers SWN_SCENARIO only (5 options). FW=Flat Washer, SW=Spring Washer, Sc=Screw, An=Anchor, Nu=Nut, We=Wedge, Norm=No missing part.

Model	Image (240 cases)	Three-View (137)
	FW`1`	SW`1`
Gemini-3-Flash	36.7	100
GPT-5.2	60.0	83.3
Qwen3-VL-235B	23.3	100
Seed 1.6	26.7	100
Mistral-3-8B	36.7	40.0

C. Visual projection is necessary: the text channel cannot replace it for generic MLLMs.

Probing Task: Serializing point clouds as integer-scaled text tables and feeding them directly to MLLMs, bypassing visual rendering.
Results: As shown in Table 6, both tested models (Gemini-3-Flash, Qwen3-235B) performed near the random baseline on SURF_INSP (surface defect classification). Only WORK_VERI showed a moderate signal above chance.
Conclusion: Among input channels available to general-purpose MLLMs, visual rendering via multi-view projection is a relatively more effective interface for 3D manufacturing data than raw coordinate serialization.

Table 6: Raw point cloud text input Bottleneck Analysis (accuracy %). 3D coordinates are serialized as integer-scaled text tables and fed directly to MLLMs, bypassing visual rendering. ZS = Zero-Shot, RC = Ref-Cond.

Model Type	ASSY_VERI	SURF_INSP	WORK_VERI
	ZS	RC	ICD
Gemini-3-Flash	25.2	32.7	35.0
Qwen3-235B	25.2	34.2	32.7
Random baseline	25.0	20.0	25.0

From Benchmark to Training Resource

The bottleneck analysis identified insufficient manufacturing domain knowledge as the primary gap. To investigate whether FORGE annotations can close this gap, domain-specific Supervised Fine-Tuning (SFT) was performed.

Model & Protocol: Qwen2.5-VL-3B-Instruct was fine-tuned using task-specific SFT with a scenario-based train/eval split (training on one scenario, evaluating on a held-out, unseen scenario) to ensure improvements reflect transferable reasoning, not memorization.
Results: As shown in Figure 6, SFT yielded substantial improvements:
- WORK_VERI 3V: 90.8% relative improvement (28.2% → 53.8%), bringing the 3B model on par with Qwen3-VL-235B (54.4%), a model 78× larger.
- ASSY_VERI Image: 27.1% relative gain (24.0% →