Summary (Overview)

  • New benchmark and dataset: The paper introduces CaptureGuide-Bench, a benchmark for evaluating capture-time photography guidance—a departure from traditional post-hoc cropping benchmarks. It covers two complementary tasks: (i) photographer-side composition decision (keep/refine/reject) and refinement, and (ii) subject-side scene-conditioned pose recommendation. The accompanying CaptureGuide-Dataset contains ~130K samples with structured annotations (bounding boxes, COCO-17 keypoints, visibility states) and textual rationales.
  • Limitations of existing models: General-purpose MLLMs (e.g., GPT-5.5, Gemini-3.0-Pro) can make composition decisions but lack precise refinement localization. Specialized aesthetic cropping models (e.g., Venus, InstructCrop) localize crops accurately but cannot handle keep/reject decisions. Neither supports structured subject-side pose guidance.
  • Proposed model ShutterMuse: A unified MLLM (based on Qwen3-VL-8B) trained with supervised fine-tuning (SFT) on the dataset followed by group relative policy optimization (GRPO) reinforcement fine-tuning. It outputs structured JSON with decisions, composition boxes, keypoints, and visibility states.
  • State-of-the-art performance: ShutterMuse achieves the best overall photographer-side performance (IoU 74.30%, MLLM-Score 0.64) and competitive subject-side pose recommendation (mean score 0.34) with substantially lower inference cost (4.96s vs. 55–102s for image-editing foundation models).
  • Human-aligned evaluation: A user study confirms that MLLM-based automated evaluation (MLLM-Score) achieves a Spearman rank correlation coefficient of 0.90 with human rankings on the photographer-side task, and identical ranking on the subject-side task.

Introduction and Theoretical Foundation

Real-world photography requires capture-time guidance—deciding whether the current framing should be kept, refined, or rejected, and recommending subject poses that match the scene. Existing aesthetic cropping benchmarks (FCDB, FLMS, SACD) formulate guidance as post-hoc crop prediction, assuming every image can be improved by cropping. This ignores two crucial aspects: (i) some images are already well-composed (keep) or are too flawed to save (reject), and (ii) no existing benchmark provides subject-side pose recommendations that are actionable during capture.

The authors argue that multimodal large language models (MLLMs) have advanced in visual understanding, aesthetic reasoning, and instruction following, but their capability for interactive capture-time guidance remains underexplored. They identify a gap: general MLLMs show some composition reasoning but no precise localization; specialized cropping models are good at refinement but cannot handle keep/reject decisions; neither provides structured pose guidance.

To bridge this gap, they propose a three-way decision scheme (keep/refine/reject) for photographers and a scene-conditioned pose recommendation task for subjects. The theoretical foundation rests on training MLLMs to produce structured outputs (JSON with decisions, bounding boxes, keypoints, visibility) and optimizing them with reinforcement learning to align decisions with expert annotations and subject-preservation constraints.

Methodology

Dataset Construction

CaptureGuide-Dataset (~130K samples) is built through two pipelines:

  1. Photographer-side (100K samples) via an Expert-seeded, MLLM-verified Self-Distillation Pipeline (EMDP):

    • A seed set of 12K images is annotated by trained experts with three-way decisions (refine/keep/reject) and free-text comments.
    • Comments are summarized into structured rationales by an MLLM (Gemini-3.0-Pro).
    • An initial composition model is trained on the seed set, then used to generate pseudo-annotations on 500K unlabeled online images.
    • Pseudo-annotations are filtered by an MLLM verifier (Gemini-3.0-Pro) checking rationale correctness and box-rationale consistency.
    • Verified samples are added iteratively (three rounds) to retrain the model.
  2. Subject-side (30K samples) via a Subject-Side Guidance Generation Pipeline (SGGP):

    • Given a portrait image, the person is removed (Nano-Banana-Pro) to obtain a person-free scene.
    • Human keypoints (COCO-17 format) are extracted from the original portrait using a YOLO-based pose estimator.
    • Gemini-3.0-Pro generates textual rationales explaining why the pose suits the scene; rationales are reviewed and revised by professional annotators.
    • Each keypoint is assigned a visibility state: 1 (visible inside image), 0 (occluded but inside), -1 (outside image frame). Annotations are verified by five experienced photographers.

CaptureGuide-Bench

The benchmark contains 421 held-out photographer-side samples (with 3–5 ground-truth boxes per refine sample) and 552 subject-side samples with balanced pose types. It is strictly separated from training data.

Photographer-side metrics:

  • IoU: maximum Intersection-over-Union between predicted box and all annotated boxes (for refine samples).
  • BDE: minimum boundary displacement error.
  • Refinement success rate (R): percentage of refine samples with IoU > 0.7.
  • Reject success rate (RSR) and Keep success rate (KSR): correct classification of ground-truth reject/keep samples.
  • MLLM-Score: a Gemini-3.0-Pro evaluation of the composition inside the predicted box (or full image for keep, or 0 for incorrect reject) on a scale {0, 0.5, 1}, considering whether all original deficiencies are fixed.

Subject-side metrics (all MLLM-based, three-level scores {0, 0.5, 1}):

  • Physical plausibility: natural body scale, no floating/penetration, human-imitable.
  • Scene interaction: weak (none), moderate (sitting/leaning), strong (interacting with specific objects).
  • Pose aesthetics: ordinary static, moderate detail (crossed legs, walking), rich action details.

ShutterMuse Model

Architecture: Built on Qwen3-VL-8B. For each input (image + prompt), the model generates a structured JSON response.

Photographer-side JSON schema:

{
  "task_type": "composition",
  "reason": "...",
  "composition_xy": [x1, y1, x2, y2]  // empty for reject, [0,0,1,1] for keep, else refine
}

Subject-side JSON schema:

{
  "task_type": "pose",
  "reason": "...",
  "keypoints_xyn": [[x1,y1], ..., [x17,y17]],
  "visibility": [v1, ..., v17]  // 1 visible, 0 occluded, -1 outside
}

Training:

  1. Supervised Fine-Tuning (SFT) on CaptureGuide-Dataset with next-token prediction loss:

    LSFT(θ)=E(q,y)DSFT[1Lt=1Llogπθ(ytq,y<t)](1)\mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{(q, y^*) \sim \mathcal{D}_{\text{SFT}}} \left[ \frac{1}{L} \sum_{t=1}^L \log \pi_\theta(y^*_t | q, y^*_{<t}) \right] \tag{1}

    Trained on 8 A800 GPUs, AdamW, lr=1e-4, batch size 64, 5 epochs.

  2. Reinforcement Fine-Tuning (RFT) with GRPO (Group Relative Policy Optimization) on a 20K-sample dataset:

    • For photographer-side: reward = decision correctness + subject-preservation mask coverage. Decision reward: Rdec={1,c^=c0,otherwise(2)R_{\text{dec}} = \begin{cases} 1, & \hat{c} = c^* \\ 0, & \text{otherwise} \end{cases} \tag{2} Mask coverage (using BiRefNet salient object mask): Cov(b,M)=u,vM(u,v)1b(u,v)u,vM(u,v)+ϵ(3)\text{Cov}(b, M) = \frac{\sum_{u,v} M(u,v) \mathbf{1}_b(u,v)}{\sum_{u,v} M(u,v) + \epsilon} \tag{3} Mask reward: Rmask={1,c=refine and Cov(b,M)τm0,otherwise(4)R_{\text{mask}} = \begin{cases} 1, & c^* = \text{refine and Cov}(b, M) \ge \tau_m \\ 0, & \text{otherwise} \end{cases} \tag{4} Photographer-side reward: Rphoto=Rdec+Rmask(5)R_{\text{photo}} = R_{\text{dec}} + R_{\text{mask}} \tag{5}
      • For subject-side: reward based on visibility vector agreement:
      Rsub={1,vpred=vgt0,otherwise(6)R_{\text{sub}} = \begin{cases} 1, & v_{\text{pred}} = v_{\text{gt}} \\ 0, & \text{otherwise} \end{cases} \tag{6}
      • GRPO samples G responses per input, computes group-relative advantages:
      Ai=rimean({rj}j=1G)std({rj}j=1G)+ϵ(7)A_i = \frac{r_i - \text{mean}(\{r_j\}_{j=1}^G)}{\text{std}(\{r_j\}_{j=1}^G) + \epsilon} \tag{7}
      • Policy ratio:
      ρi,t(θ)=πθ(yi,tq,yi,<t)πθold(yi,tq,yi,<t)(8)\rho_{i,t}(\theta) = \frac{\pi_\theta(y_{i,t} | q, y_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t} | q, y_{i,<t})} \tag{8}
      • GRPO loss:
      LGRPO(θ)=E[1Gi=1G1Lit=1Li(min(ρi,tAi,clip(ρi,t,1ϵc,1+ϵc)Ai)βDKL(πθπref))](9)\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}\left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{L_i} \sum_{t=1}^{L_i} \big( \min(\rho_{i,t} A_i, \text{clip}(\rho_{i,t}, 1-\epsilon_c, 1+\epsilon_c) A_i) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \big) \right] \tag{9}
    • Hyperparameters: batch size 64, 32 rollouts per input, lr=1e-6, weight decay 0.1, KL coefficient β=0.01, mask threshold τ_m=0.9, trained 1 epoch.

Empirical Validation / Results

Main Quantitative Results

Photographer-side guidance (Table 1):

MethodIoU% ↑BDE ↓R% ↑RSR% ↑KSR% ↑MLLM-Score ↑
ShutterMuse (Ours)74.300.05470.0382.7674.550.64
Venus (specialized)69.430.07657.270.003.640.57
Gemini-3.1-Pro65.630.06851.3479.3189.090.56
GPT-5.565.440.09141.8410.3481.820.48
InstructCrop69.530.07256.970.000.000.43

ShutterMuse outperforms all baselines in IoU, BDE, refinement success rate, and MLLM-Score, while maintaining a high RSR (82.76%)—far better than specialized cropping models (0% RSR) and competitive with Gemini (79.31%). Specialized models achieve good crop quality but fail on keep/reject decisions; general MLLMs handle three-way decisions but produce less accurate crops.

Subject-side guidance (Table 2):

MethodPlausibility ↑Interaction ↑Aesthetics ↑Mean ↑Time ↓# Tokens ↓
Nano-Banana-Pro0.630.350.170.3955.16s1370
GPT-Image-20.590.290.150.35102.61s1427
ShutterMuse (Ours)0.580.270.140.344.96s412

ShutterMuse achieves competitive mean score (0.34) close to GPT-Image-2 (0.35), while being ~20× faster and using ~70% fewer tokens. This makes it suitable for interactive capture-time use.

Ablation Study (Table 3)

MethodIoU% ↑RSR% ↑KSR% ↑MLLM-Score ↑Plausibility ↑Interaction ↑Aesthetics ↑
ShutterMuse-SFT72.3968.9763.640.560.520.250.14
ShutterMuse-RL (full)74.3082.7674.550.640.580.270.14

Key findings:

  • GRPO improves all metrics over SFT-only, especially RSR (+13.79) and KSR (+10.91).
  • Removing the decision reward (RdecR_{\text{dec}}) degrades RSR and KSR.
  • Removing the mask reward (RmaskR_{\text{mask}}) reduces IoU and MLLM-Score.
  • Removing the subject-side reward (RsubR_{\text{sub}}) lowers plausibility.

Reliability Analysis of EMDP (Figure 7)

Over three EMDP rounds:

  • Expert test set performance improves: IoU 66.11% → 70.99%, RSR 34.48% → 88.77%.
  • MLLM verifier F1-score stays above 87% per category.
  • Acceptance rate stable above 52%.
  • Training set expands from 12K (seed) to 100K samples.

User Study (Table 4)

Photographer-sideMLLM RankHuman Rank
ShutterMuse11
Venus22
Gemini-3.0-Pro34
GPT-5.543
InstructCrop55

Subject-side MLLM ranking (1. Nano-Banana-Pro, 2. GPT-Image-2, 3. ShutterMuse) is identical to human ranking. Spearman's rank correlation for photographer-side: 0.90, confirming strong alignment of MLLM-Score with human preferences.

Theoretical and Practical Implications

Theoretical Implications

  • Capture-time vs. post-hoc: The paper reframes aesthetic image comprehension from a reactive task (improve an already captured image) to a proactive guidance task (decide what to do before/during capture). This requires models to understand when not to crop (keep/reject)—a capability absent in prior work.
  • Unified framework: ShutterMuse demonstrates that a single MLLM can handle both composition decisions and pose recommendations, bridging two previously separate domains (aesthetic cropping and human pose generation).
  • Reinforcement learning for structured outputs: GRPO with reward functions combining decision accuracy and subject preservation proves effective for training MLLMs to produce geometrically precise and semantically aligned outputs beyond free-form text.

Practical Implications

  • Interactive photographer assistant: ShutterMuse can be deployed in camera apps to give real-time compositional feedback (e.g., "frame is tilted, recompose with rule of thirds") and pose suggestions for subjects, reducing the need for post-capture editing.
  • Efficiency: With 4.96s average inference time and 412 tokens per recommendation, it is suitable for on-device or cloud-assisted interactive use, unlike image-editing foundation models that take minutes.
  • Explainability: The model generates textual rationales (e.g., "visual weight is too high... place the horizon appropriately"), enabling users to understand and refine suggestions.
  • Scalable data pipeline: The expert-seeded, MLLM-verified self-distillation pipeline (EMDP) provides a cost-effective method to create large-scale labeled datasets for aesthetic tasks, reducing reliance on expensive expert annotations.

Conclusion

This paper identifies and addresses a critical gap in MLLM-based photography assistance: capture-time guidance for both photographers (composition decisions) and subjects (pose recommendations). The authors contribute:

  • CaptureGuide-Bench — a benchmark with two complementary tasks and MLLM-based evaluation metrics.
  • CaptureGuide-Dataset — ~130K samples with structured annotations and rationales, built via expert-seeded self-distillation and a subject-side generation pipeline.
  • ShutterMuse — a unified MLLM trained with SFT and GRPO reinforcement fine-tuning, achieving state-of-the-art photographer-side performance (74.30% IoU, 0.64 MLLM-Score) and competitive subject-side pose recommendations (0.34 mean MLLM score) with an order-of-magnitude reduction in inference cost.

The results demonstrate that MLLMs can serve as practical interactive assistants for photography during image capture, not just after. Future work could incorporate denser body keypoints for better foot–ground contact modeling, extend to video capture guidance, or explore personalization of composition preferences.

Related papers