Reflection パターンの本質は自己評価と反復改善。出力して終わりじゃなくて、「これで良いのか？」と振り返り、品質を高めていく。ただし万能じゃない。すでに良い回答に過剰な振り返りは無駄。

5 分で掴む核心

Reflection = 出力を評価 + 弱点を特定 + 改善を生成

二つのモード：自己批評（同じ LLM）と外部評価（別のモデルや人間）

停止条件が重要：品質閾値を満たすか、最大反復回数に達したら止める

過剰反復は禁物：良い答えをいじりすぎると逆に悪化することも

評価基準は明確に：曖昧な「良い」「悪い」ではなく、具体的な指標で評価

10 分コース：11.1-11.3 → 11.5 → Shannon Lab

11.1 なぜ振り返りが必要なのか？

LLM は一発で完璧な答えを出せるわけじゃない。よくある問題：

不完全：重要なポイントが抜けている
曖昧：もっと具体的にできる
構造が乱雑：整理すれば読みやすくなる
事実誤り：ハルシネーションが含まれることも

人間だって、レポートを書いたら見直すでしょ？それと同じことを Agent にやらせる。

Reflection パターンの価値：

問題	Reflection での改善
不完全な回答	「何が足りないか」を評価して追加
曖昧な表現	「もっと具体的に」と指摘して改善
構造の乱雑さ	「読みやすい構造に」と再構成
事実誤り	「これは正しいか」とチェックして修正

11.2 Reflection の基本フロー

Shannon での Reflection は 3 ステップのループ：

Reflection フロー

func ReflectionLoop(ctx context.Context, query string, agent *Agent, config ReflectionConfig) (*Result, error) {
    // 1. 初期回答を生成
    response, err := agent.Generate(ctx, query)
    if err != nil {
        return nil, err
    }

    for i := 0; i < config.MaxIterations; i++ {
        // 2. 回答を評価
        evaluation, err := evaluate(ctx, query, response, config.Criteria)
        if err != nil {
            return nil, err
        }

        // 3. 品質閾値を満たしたら終了
        if evaluation.Score >= config.QualityThreshold {
            return &Result{
                Response:   response,
                Iterations: i + 1,
                FinalScore: evaluation.Score,
            }, nil
        }

        // 4. 改善を生成
        response, err = improve(ctx, query, response, evaluation.Feedback)
        if err != nil {
            return nil, err
        }
    }

    // 最大反復回数に達した
    return &Result{
        Response:   response,
        Iterations: config.MaxIterations,
        Warning:    "max iterations reached without meeting quality threshold",
    }, nil
}

11.3 評価（Evaluation）の設計

評価は Reflection の核心。曖昧な「良い」「悪い」じゃなく、具体的な指標で評価する。

評価基準の例

type EvaluationCriteria struct {
    Completeness  bool    // すべての質問に答えているか
    Accuracy      bool    // 事実として正しいか
    Clarity       bool    // 明確で理解しやすいか
    Structure     bool    // 論理的に構成されているか
    Relevance     bool    // 質問に関連しているか
    Specificity   bool    // 具体的で詳細か
}

type Evaluation struct {
    Score    float64           // 総合スコア 0-1
    Criteria map[string]bool   // 各基準の結果
    Feedback string            // 改善のためのフィードバック
    Weaknesses []string        // 特定された弱点
}

評価プロンプトの設計

func buildEvaluationPrompt(query, response string, criteria EvaluationCriteria) string {
    return fmt.Sprintf(`## Task
Evaluate the following response to the user's question.

## User's Question
%s

## Response to Evaluate
%s

## Evaluation Criteria
1. Completeness: Does it address all aspects of the question?
2. Accuracy: Are the facts correct?
3. Clarity: Is it easy to understand?
4. Structure: Is it well-organized?
5. Relevance: Does it stay on topic?
6. Specificity: Is it concrete and detailed?

## Output Format
{
    "score": 0.0-1.0,
    "criteria": {
        "completeness": true/false,
        "accuracy": true/false,
        "clarity": true/false,
        "structure": true/false,
        "relevance": true/false,
        "specificity": true/false
    },
    "feedback": "Specific areas for improvement",
    "weaknesses": ["weakness 1", "weakness 2"]
}
`, query, response)
}

評価のバリエーション

方法	説明	メリット	デメリット
自己批評	同じ LLM が評価	シンプル、速い	自分の誤りに気づきにくい
別モデル評価	別の LLM が評価	客観的	コストが上がる
人間評価	Human-in-the-loop	最も正確	遅い、スケールしない
ルールベース	事前定義のチェック	一貫性がある	柔軟性に欠ける

11.4 改善（Improvement）の生成

評価結果を元に、具体的な改善を生成する。

func buildImprovementPrompt(query, response, feedback string, weaknesses []string) string {
    return fmt.Sprintf(`## Task
Improve the following response based on the feedback.

## Original Question
%s

## Current Response
%s

## Feedback
%s

## Identified Weaknesses
%s

## Instructions
1. Address all identified weaknesses
2. Keep the good parts unchanged
3. Be specific and concrete in improvements
4. Maintain the same overall structure unless it was a weakness

## Improved Response
`, query, response, feedback, strings.Join(weaknesses, "\n- "))
}

改善のベストプラクティス

具体的なフィードバック：「もっと良くして」ではなく「例を追加して」
良い部分は保持：全部書き直す必要はない
優先順位をつける：すべての弱点を一度に直そうとしない
変更を追跡可能に：何を変えたか分かるように

11.5 停止条件の設計

いつ反復を止めるか？これが重要。

停止条件の種類

type StopConditions struct {
    MaxIterations     int     // 最大反復回数（安全網）
    QualityThreshold  float64 // 品質スコア閾値
    NoImprovement     int     // N 回連続で改善なしなら停止
    TimeBudget        time.Duration // 時間制限
    TokenBudget       int     // トークン制限
}

func shouldStop(iteration int, scores []float64, config StopConditions) (bool, string) {
    // 1. 最大反復回数
    if iteration >= config.MaxIterations {
        return true, "max iterations reached"
    }

    // 2. 品質閾値を満たした
    if len(scores) > 0 && scores[len(scores)-1] >= config.QualityThreshold {
        return true, "quality threshold met"
    }

    // 3. 改善が停滞
    if len(scores) >= config.NoImprovement {
        recent := scores[len(scores)-config.NoImprovement:]
        if !hasImprovement(recent) {
            return true, "no improvement detected"
        }
    }

    return false, ""
}

func hasImprovement(scores []float64) bool {
    if len(scores) < 2 {
        return true
    }
    // 最近のスコアが上昇傾向にあるか
    for i := 1; i < len(scores); i++ {
        if scores[i] > scores[i-1]+0.01 { // 1%以上の改善
            return true
        }
    }
    return false
}

推奨設定

パラメータ	推奨値	理由
MaxIterations	3-5	これ以上は収穫逓減
QualityThreshold	0.8-0.85	完璧を求めすぎない
NoImprovement	2	2 回改善なしなら停滞

11.6 実践例：レポート品質改善

func improveReport(ctx context.Context, query string, agent *Agent) (*Result, error) {
    config := ReflectionConfig{
        MaxIterations:    3,
        QualityThreshold: 0.85,
        Criteria: EvaluationCriteria{
            Completeness: true,
            Accuracy:     true,
            Clarity:      true,
            Structure:    true,
        },
    }

    return ReflectionLoop(ctx, query, agent, config)
}

実行例

=== 初期回答 ===
テスラの2024年業績について...（簡潔だが詳細が不足）

=== 評価 1 ===
Score: 0.6
Weaknesses:
- 具体的な数字がない
- 競合との比較がない
- 将来の展望が不足

=== 改善 1 ===
テスラの2024年業績は...売上高は前年比15%増...（数字を追加）

=== 評価 2 ===
Score: 0.75
Weaknesses:
- 競合との比較がまだない
- ソースの明記がない

=== 改善 2 ===
テスラの2024年業績は...BYDと比較すると...（比較を追加）

=== 評価 3 ===
Score: 0.88 >= 0.85（閾値）

=== 最終出力 ===
反復回数: 3
最終スコア: 0.88

11.7 よくある落とし穴

落とし穴 1：過剰反復

// 悪い例：すでに良い回答をいじりすぎる
config := ReflectionConfig{
    MaxIterations:    10,  // 多すぎ
    QualityThreshold: 0.99, // 高すぎ
}
// 結果：良い答えを「改悪」する可能性

// 良い例：適度に止める
config := ReflectionConfig{
    MaxIterations:    3,
    QualityThreshold: 0.85,
    NoImprovement:    2,  // 改善が止まったら終了
}

落とし穴 2：曖昧な評価基準

// 悪い例：曖昧なフィードバック
feedback := "もっと良くして"

// 良い例：具体的なフィードバック
feedback := "以下を改善してください：\n" +
    "1. 売上の具体的な数字を追加\n" +
    "2. 前年比較を含める\n" +
    "3. データソースを明記"

落とし穴 3：評価と改善の不整合

// 悪い例：評価で指摘した点が改善されない
evaluation.Weaknesses = ["具体例がない"]
// 改善時にこのフィードバックを無視

// 良い例：評価結果を明示的に改善プロンプトに含める
improvementPrompt := buildImprovementPrompt(query, response,
    evaluation.Feedback, evaluation.Weaknesses)

11.8 高度なトピック：マルチアスペクト評価

複数の側面を独立に評価し、それぞれを改善する：

type MultiAspectEvaluation struct {
    Aspects map[string]AspectScore
}

type AspectScore struct {
    Score    float64
    Feedback string
}

func multiAspectReflection(ctx context.Context, query, response string) (*Result, error) {
    aspects := []string{"accuracy", "completeness", "clarity", "usefulness"}

    for iteration := 0; iteration < maxIterations; iteration++ {
        // 各側面を独立に評価
        evaluation := evaluateAllAspects(ctx, query, response, aspects)

        // 最も弱い側面を特定
        weakestAspect := findWeakestAspect(evaluation)

        // その側面を重点的に改善
        response = improveAspect(ctx, query, response, weakestAspect)
    }

    return &Result{Response: response}, nil
}

11.9 他のパターンとの組み合わせ

Planning + Reflection

計画の各ステップの出力を Reflection で改善：

for _, step := range plan.Steps {
    result := executeStep(step)
    // 各ステップの出力を評価・改善
    result = reflectionLoop(result, stepConfig)
    stepResults = append(stepResults, result)
}

Reflection + RAG

検索結果を使って評価の精度を上げる：

func evaluateWithRAG(ctx context.Context, query, response string) (*Evaluation, error) {
    // 関連する事実を検索
    facts := search(query)

    // 事実に基づいて正確性を評価
    evaluation := evaluateAgainstFacts(response, facts)

    return evaluation, nil
}

この章のまとめ

核心は一言で：Reflection パターンは「出力 → 評価 → 改善」のループで品質を高める。自己批評や外部評価で弱点を特定し、具体的なフィードバックで改善する。

要点

3 ステップループ：生成 → 評価 → 改善
明確な評価基準：曖昧な「良い」「悪い」ではなく具体的な指標
適切な停止条件：品質閾値、最大反復回数、改善停滞の検出
過剰反復を避ける：良い答えをいじりすぎない
具体的なフィードバック：「もっと良く」ではなく「何をどう直すか」

Shannon Lab（10 分で始める）

このセクションで、本章の概念を Shannon のソースコードにマッピングする。

必読（1 ファイル）

patterns/reflection.go：ReflectionLoop 関数を見て、評価と改善がどうループするか理解する。停止条件の実装を確認する

選読で深掘り（興味に応じて 2 つ）

activities/evaluate.go：評価プロンプトの設計と出力パースを理解する
patterns/plan_execute.go：Planning と Reflection がどう組み合わされるか確認する

演習

演習 1：評価基準の設計

「技術ブログ記事」の評価基準を 6 つ設計せよ。各基準について：

名前
説明（何をチェックするか）
どうすれば「良い」と判断するか

演習 2：停止条件の分析

以下のスコア推移を見て、いつ反復を止めるべきか判断せよ：

反復 1: 0.55
反復 2: 0.68
反復 3: 0.72
反復 4: 0.73
反復 5: 0.72

理由も説明せよ。

演習 3（上級）：マルチアスペクト改善

「AIエージェントとは何か」という質問への回答を改善するシナリオを考えよ：

評価する側面を 4 つ決める
各側面の評価基準を定義
改善の優先順位付けロジックを設計

もっと深く学びたい？

Self-Refine: Iterative Refinement with Self-Feedback - 自己フィードバックによる反復改善
Constitutional AI - AI による AI の評価・改善
Reflexion: Language Agents with Verbal Reinforcement Learning - 言語エージェントの反省

次章の予告

Planning は「先に考える」、Reflection は「後で振り返る」。どちらも推論を分割するパターンだった。

次章では Chain-of-Thought (CoT) を説明する。「一歩一歩考える」ことで、複雑な推論を分解し、最終的な答えの精度を上げる。

これは単一エージェントパターンの最後の章。CoT を理解すれば、Part 5 のマルチエージェント編成に進む準備ができる。

次へ進もう。