Self-Improving Language Models with Bidirectional Evolutionary Search (BES) - Summary

Summary (Overview)

  • Novel Search Framework: Proposes Bidirectional Evolutionary Search (BES), a framework that couples forward candidate evolution with backward goal decomposition to address two key limitations of existing search methods (e.g., best-of-N, tree search) for LLMs and agents.
  • Evolution Operators: Introduces four evolution operators (combination, deletion, translocation, crossover) that recombine parts of different trajectories, enabling the generation of candidates beyond the policy's own autoregressive distribution.
  • Theoretical Motivation: Theoretically proves that expansion-only search confines candidates to a narrow entropy shell, while evolution operators can escape it. Also shows backward sub-goal decomposition provides an exponential advantage in sample efficiency.
  • Empirical Validation: Demonstrates consistent improvements over baselines in challenging post-training tasks (logical and multi-hop reasoning) where other methods fail, and achieves state-of-the-art results among open-source frameworks on three open problem-solving benchmarks at inference time.
  • Practical Implementation: BES is a plug-in sampling algorithm that can be integrated with existing post-training methods (e.g., GRPO, MaxRL) and inference frameworks (e.g., ShinkaEvolve) to enhance their sample generation quality.

Introduction and Theoretical Foundation

Background and Motivation: Large language models (LLMs) and agentic systems excel at complex reasoning. However, generating high-quality samples, especially for problems at the frontier of model capability, remains critical for both post-training/self-improvement and test-time scaling (inference). Dominant methods like best-of-N sampling and tree search have two fundamental limitations:

  1. Sparse Verification: They rely on binary or coarse-grained verifier feedback.
  2. Constrained Exploration: They construct candidates via autoregressive expansion, confining them to the support of the model's own distribution, making it hard to reach low-probability correct solutions.

Theoretical Basis: The paper draws inspiration from evolutionary biology, specifically the advantage of sexual reproduction (recombination) over asexual reproduction (direct extension). Analogously, BES uses evolution operators to combine beneficial parts from different candidate trajectories. The theoretical analysis formalizes the limitations of expansion-only search and the advantages of evolution and bidirectional search.

Methodology

BES performs a bidirectional search that alternates between a forward search (expanding the candidate pool) and a backward search (decomposing the problem for verification).

1. Forward Search: Expanding the Reachable Solution Space

  • Maintains a pool PP of candidate nodes (partial trajectories).
  • At each step, applies an operator to produce a child node nn'.
  • Operators:
    • Expansion: Extends a parent node by sampling KK new steps from the policy πθ\pi_\theta: yt+kπθ(xy1yt+k1),k=1,,Ky_{t+k} \sim \pi_\theta(\cdot | x \oplus y_1 \oplus \dots \oplus y_{t+k-1}), \quad k=1,\dots,K
    • Evolution Operators (see Figure 2):
      • Combination: Concatenates suffixes of two trajectories sharing a common prefix.
      • Deletion: Removes an interior step from a trajectory.
      • Translocation: Replaces a step in one trajectory with a step from another.
      • Crossover: Splices the prefix of one trajectory onto the tail of another.
  • Node Selection: Parents are selected via a Boltzmann distribution based on backward scores s(n)s(n) (for single-parent ops) or pair scores s(na,nb)s(n_a, n_b) (for two-parent ops), with temperature annealing.

2. Backward Search: Better Verification through Goal Decomposition

  • Recursively decomposes the top-level goal grootg_{root} into a tree of finer sub-goals gg, each with a local verifier Vg(x,n)[0,1]V_g(x, n) \in [0,1].
  • Scoring: A candidate node nn is scored recursively against the goal tree: s(n,g)=αVg(x,n)+(1α)1ch(g)gch(g)s(n,g)s(n, g) = \alpha \cdot V_g(x, n) + (1 - \alpha) \cdot \frac{1}{|\text{ch}(g)|} \sum_{g' \in \text{ch}(g)} s(n, g') where α[0,1]\alpha \in [0,1] is a blending parameter. The overall score is s(n)s(n,groot)s(n) \triangleq s(n, g_{root}).
  • Pair Score: For two nodes na,nbn_a, n_b, the pair score s(na,nb,g)s(n_a, n_b, g) uses the maximum of their verifier scores to measure joint coverage.
  • The goal tree is periodically refined by decomposing an unsatisfied leaf sub-goal.

3. Integration for Post-Training and Inference

  • Post-Training: BES replaces the standard sample-generation stage, providing higher-quality trajectories for training.
  • Inference: BES runs with a fixed compute budget, returning the highest-scoring terminal trajectory.

Empirical Validation / Results

1. Post-Training on Logical Reasoning (Knights-and-Knaves)

  • Setup: Base model Gemma-3-1B-it. Compare BES (on top of MaxRL) against GRPO and MaxRL.
  • Result: BES shows steady improvement on validation accuracy, while GRPO and MaxRL show little to no improvement.

    Figure 3: EMA-smoothed validation accuracy on logical reasoning. BES curve rises consistently, others are flat.

2. Post-Training on Multi-Hop Reasoning (MuSiQue)

  • Setup: Agentic search task. Base models Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct. Compare BES (on top of GRPO) against GRPO and Tree-GRPO.

  • Result: BES achieves substantial accuracy gains over baselines and the base model, and also leads to agents that perform more valid searches.

    Table 1: Multi-hop reasoning post-training results.

    MethodAccuracy (%, \uparrow)# Valid Search (\uparrow)# Valid Actions (\uparrow)Finish Ratio (\uparrow)
    Llama-3.2-3B-Instruct
    Base model4.0
    + GRPO2.1 (-1.9)0.840.200.64
    + Tree-GRPO3.9 (-0.1)1.502.140.64
    + BES7.0 (+3.0)2.313.290.97
    Llama-3.1-8B-Instruct
    Base model6.6
    + GRPO5.6 (-1.0)1.461.830.37
    + Tree-GRPO7.4 (+0.8)0.651.360.71
    + BES10.4 (+3.8)2.113.050.94

3. Inference on Open Problem Solving

  • Setup: Three benchmarks: Circle Packing (Square), Circle Packing (Rectangle), Heilbronn (Convex). Backbone model GPT-5. Compare BES (on top of ShinkaEvolve) against open-source frameworks OpenEvolve, GEPA, ShinkaEvolve.

  • Result: BES outperforms all open-source baselines in both average and best performance, with significantly lower variance.

    Table 2: Open problem solving benchmarks.

    StrategyCircle Packing (Square)Circle Packing (Rect.)Heilbronn (Convex)
    Avg.BestAvg.
    Open-source frameworks
    OpenEvolve2.531 ± .0182.5412.267 ± .014
    GEPA2.613 ± .0222.6282.326 ± .023
    ShinkaEvolve2.464 ± .0832.5412.335 ± .026
    Ours
    BES2.623 ± .0142.6322.349 ± .012

4. Ablation Study & Cost Analysis

  • Ablation: On logical reasoning, removing either evolution operators or answer reweighting degrades performance, confirming the contribution of each component.

  • Cost: BES incurs modest additional wall-clock/API cost compared to baselines while delivering significantly better performance.

    Table 3: Cost analysis for multi-hop reasoning post-training.

    MethodAccuracy (%)# Valid SearchWalltime (s)
    GRPO2.1 (-1.9)0.8464
    Tree-GRPO3.9 (-0.1)1.50240
    BES7.0 (+3.0)2.31309

Theoretical and Practical Implications

Theoretical Implications:

  • Shell Confinement & Escape (Theorem 4.4): Provides a formal justification for evolution operators. Under reasonable assumptions, expansion-only candidates are confined with high probability to a typical set Aϵ(T):={y:logP(y)HTϵT}A_\epsilon^{(T)} := \{ y : | -\log P(y) - H_T | \le \epsilon T \} of size exp(HT+ϵT)\exp(H_T + \epsilon T). Evolution candidates, in contrast, have expected log-probability beyond this shell: EQ[logP(Y~)]HT+γT\mathbb{E}_Q[-\log P(\tilde{Y})] \ge H_T + \gamma T.
  • Exponential Advantage of Bidirectionality (Theorem 4.5): Shows that backward sub-goal decomposition can exponentially reduce the number of candidates needed to find a solution compared to terminal-only search. In the symmetric case (pi=pp_i = p), the ratio of required samples is Ω(p(m1)/log(m/δ))\Omega(p^{-(m-1)} / \log(m/\delta)).

Practical Implications:

  • Enhanced Self-Improvement: BES enables effective post-training on challenging tasks where standard methods fail due to insufficient high-quality samples.
  • Superior Inference: BES provides a sample-efficient and stable search strategy for solving open problems, pushing the boundaries of what models can achieve at test time.
  • General-Purpose Framework: BES is agnostic to the underlying training method (GRPO, MaxRL) or inference framework, making it widely applicable.

Conclusion

BES addresses the core limitations of sparse verification and confined exploration in existing search methods for LLMs and agents. By coupling forward evolutionary search (with expansion and recombination operators) with backward goal decomposition (providing dense intermediate feedback), BES can discover high-quality solutions that are difficult to reach via standard rollouts. Theoretical analysis confirms the advantages of evolution and bidirectionality. Empirical results across post-training and inference settings demonstrate consistent and significant improvements over strong baselines. The framework offers a promising direction for advancing self-improving language models and agentic systems.