GPT 5.5 正在重新定义 Prompt

如果你的 prompt 还在大量依赖 ALWAYS、NEVER、First do A, then do B 这类强流程约束，那你该注意了。GPT-5.5 不一定会表现更差，但很容易把问题暴露得更明显，你会被这些冗余的规则牵着鼻子走。

OpenAI 最近放出了一份 GPT-5.5 的 prompt 写法指南，它在提醒开发者：这一代模型更适合用目标、约束和停止条件来控制，文档优先，而不适合靠细碎步骤执行一路。

这篇内容大致再说，GPT-5.5 时代的 prompt，不应该简单沿用 GPT-4 时代那套“强指令 + 长流程 + 反复强调”的写法。想兑现更高效率，prompt 本身也要更短、更清楚、更可验证。

这一版指南背后的 6 个变化

OpenAI 给 GPT-5.5 单独列了一份 Using GPT-5.5，里面可以读出这一版指南背后的 6 个变化。

第一，推理 token 用得更省。 同样的 reasoning effort 下，GPT-5.5 更强调用较少 token 达到强结果。在工具密集和多步流程里，这个差异会被放大。Vellum 的 controllability 测试里，GPT-5.5 在长 chain-of-thought 设置上表现优于 GPT-5.4 和 GPT-5.2 Thinking。

第二，模型对 prompt 的理解更字面化。 这不是单纯的好事或坏事，它是双刃剑。你写得越清楚，它做得越准；你写得含糊，它也会更僵硬。这意味着 success criteria 和 stopping rules 必须写出来，不能默认“模型自己懂”。

第三，默认 reasoning.effort 改成了 medium。 “把 effort 拉高”不应该再被当成默认优化手段。一般我用 codex 的时候倾向于直接拉满，因为默认 effort 肯定越高越好。但是冲突指令、模糊停止条件、开放工具时，effort 加高可能带来过度思考、过度搜索，甚至让输出变差。

第四，默认输出风格更直接。 这是工程上的好事：更聚焦、更好控、更省 token。但对客户对接的对话产品，它可能显得偏冷。很多时候问题不在模型能力，而在 prompt 没有定义 personality 和协作方式。

第五，工具调用更精确。 大工具面、多步服务流程、长任务上，选择和参数更稳。

第六，图像输入默认保留更多视觉细节。image_detail 不设或设为 auto 时，默认按 original 处理（最大 10,240,000 像素或 6,000 像素边长），computer use 类任务直接受益。

这 6 条决定了下面这份指南为什么这么写。记住一条原则：把 GPT-5.5 当成需要重新校准的新模型，而不是 gpt-5.4 的机械替换。 老 prompt 可以先跑基线，出来一个基础版，但最好不要直接视作最终版本。

OpenAI 官方 prompt 指南

下面是这一版指南 10 节的中文整理。代码块和 prompt 模板保留英文，并在每个英文模板后补上中文释义；API 参数（reasoning.effort、text.verbosity、phase）保留英文。

1. Personality 和行为

GPT-5.5 的默认风格是高效、直接、任务导向。这对生产系统是有直接用处的：回答更聚焦、行为更好控，模型也不会插入不必要的对话填料。

但如果说是客户对接的助手、支持工作流、教练型体验这类对话产品，你要同时定义两个东西。

Personality（人格）

Collaboration style（协作风格）

控制助手怎么干活，包括什么时候问问题、什么时候做假设、应该多主动、给多少上下文、什么时候自检、不确定或有风险时怎么处理。

两块都要保持简短。Personality 指令塑造用户体验，Collaboration 指令塑造任务行为。它们都不能替代清晰的目标、success criteria、工具规则或 stopping condition。

稳健、任务导向的助手 personality 模板：

# Personality
You are a capable collaborator: approachable, steady, and direct. Assume the user is competent and acting in good faith, and respond with patience, respect, and practical helpfulness.

Prefer making progress over stopping for clarification when the request is already clear enough to attempt. Use context and reasonable assumptions to move forward. Ask for clarification only when the missing information would materially change the answer or create meaningful risk, and keep any question narrow.

Stay concise without becoming curt. Give enough context for the user to understand and trust the answer, then stop. Use examples, comparisons, or simple analogies when they make the point easier to grasp. When correcting the user or disagreeing, be candid but constructive. When an error is pointed out, acknowledge it plainly and focus on fixing it.

Match the user's tone within professional bounds. Avoid emojis and profanity by default, unless the user explicitly asks for that style or has clearly established it as appropriate for the conversation.

中文释义：你是一个可靠的协作者，态度友好、稳定、直接。默认认为用户是有能力、善意的，所以回答要有耐心、尊重且实用。只要任务已经足够清楚，就优先推进，不要为了很小的不确定性停下来追问；只有缺失信息会改变答案或带来风险时，才问一个很窄的问题。回答要简洁但不能生硬，给足用户理解和信任所需的上下文，然后及时收住。纠正用户或不同意时，要坦诚但建设性；出错时直接承认并修复。语气可以贴近用户，但保持专业边界，默认不用 emoji 和脏话。

更有表达力的协作型助手 personality 模板：

# Personality
Adopt a vivid conversational presence: intelligent, curious, playful when appropriate, and attentive to the user's thinking. Ask good questions when the problem is blurry, then become decisive once there is enough context.

Be warm, collaborative, and polished. Conversation should feel easy and alive, but not chatty for its own sake. Offer a real point of view rather than merely mirroring the user, while staying responsive to their goals and constraints.

Be thoughtful and grounded when the task calls for synthesis or advice. State a clear recommendation when you have enough context, explain important tradeoffs, and name uncertainty without becoming evasive.

中文释义：让助手更像一个有现场感的对话伙伴：聪明、好奇，合适时可以有一点幽默，并且认真跟上用户的思路。问题还模糊时多问好问题，信息足够后就果断推进。整体语气要温暖、协作、打磨过，但不要为了聊天而聊天。助手应该有自己的判断，而不是只复述用户；做综合判断或给建议时，要明确推荐、解释关键取舍，并诚实说明不确定性。

要更有表达力的产品，可以显式加入 warmth、curiosity、humor 或观点，但保持模板短。Personality 是用来塑造体验，不是用来弥补目标不清或任务指令缺失。

一个反直觉的实操点：Personality 模板里写不写 “prefer making progress over stopping for clarification” 是个开关。写了，模型会更倾向先基于合理假设推进；不写，模型可能更频繁追问。review、debug、方案讨论里，主动挑战用户常常是优点；精确实现型任务里，它也可能变成摩擦。

2. 用 preamble 改善首 token 出现的时间

在涉及到流式应用里，用户能感知到第一个可见 token 出现的时间。GPT-5.5 在输出可见文字之前，可能先 reasoning、规划、准备工具调用。

对长任务或工具密集的任务，要求模型先输出一段简短的 preamble：确认请求，并说明第一步的 user-visible 更新。这能改善体感响应速度，而不改变底层任务。

适用场景：任务可能多于一步、需要工具调用、长时间运行的 agent 工作流。

Before any tool calls for a multi-step task, send a short user-visible update that acknowledges the request and states the first step. Keep it to one or two sentences.

中文释义：多步骤任务在调用工具之前，先给用户一条很短的可见更新：确认你理解了请求，并说明第一步要做什么。控制在一两句话，不要写成长计划。

对暴露独立 message phase 的 coding agent，可以更显式：

You must always start with an intermediary update before any content in the analysis channel if the task will require calling tools. The user update should acknowledge the request and explain your first step.

中文释义：如果任务需要调用工具，先发一条中间进度更新，再进入内部分析或工具调用。这个更新要让用户知道你接下来会先做什么。

3. 结果导向的 prompt 和 stopping condition

这一节是这一版指南的核心。如果你只读一节，读这节。

GPT-5.5 在 prompt 定义“目标产出 + success criteria + 约束 + 可用上下文”、然后把路径选择交给模型时表现最强。

对很多任务，描述目的地比逐步教学好。这给模型留出空间去选择合适的搜索、工具或推理策略。

推荐这种：

Resolve the customer's issue end to end.

Success means:
- the eligibility decision is made from the available policy and account data
- any allowed action is completed before responding
- the final answer includes completed_actions, customer_message, and blockers
- if evidence is missing, ask for the smallest missing field

中文释义：目标是端到端把客户问题解决。什么叫“解决”？要从现有政策和账户数据里做出资格判断；允许执行的动作要先完成；最终回答必须包含已经完成的动作、给客户看的消息和仍然阻塞的问题；如果证据不够，只问最小必要字段。

避免不必要的绝对规则。老 prompt 经常用 ALWAYS、NEVER、must、only 来控制 LLM 的行为。这些词留给真正的 invariant，比如安全规则、必填输出字段、永远不能发生的动作。判断类的事（什么时候搜索、什么时候追问、什么时候用工具、什么时候继续迭代），用 decision rule 替代，别写成绝对禁令。

避免这种风格，除非每一步都真的必要：

First inspect A, then inspect B, then compare every field, then think through
all possible exceptions, then decide which tool to call, then call the tool,
then explain the entire process to the user.

中文释义：不要默认要求模型先检查 A、再检查 B、再比较所有字段、再穷举异常、再决定工具、再调用工具、最后解释全过程。除非这些步骤每一步都确实不可省，否则这类流程会让模型变慢，而且很僵硬。

你需要加显式 stopping condition：

Resolve the user query in the fewest useful tool loops, but do not let loop minimization outrank correctness, accessible fallback evidence, calculations, or required citation tags for factual claims.

After each result, ask: "Can I answer the user's core request now with useful evidence and citations for the factual claims?" If yes, answer.

中文释义：要求模型用尽可能少但足够有用的工具循环解决问题，但不能为了减少工具调用牺牲正确性、可访问的备选证据、计算或必要引用。每次拿到结果后都问自己：我现在能不能用足够证据和引用回答用户核心问题？如果能，就停止搜索并回答。

定义证据缺失时的行为：

Use the minimum evidence sufficient to answer correctly, cite it precisely, then stop.

中文释义：只收集足够正确回答的最小证据，精确引用，然后停止。不要为了“显得更充分”继续检索。

这一节背后是 prompt 写法的代际变化。旧写法偏流程指令（First X, then Y, then Z），在很多早期任务里确实有效，因为模型需要更多牵引。新写法更偏“目的地 + 约束 + 自主路径”。模型能力增强后，过度流程化的 prompt 反而可能限制它选择更合适的路径。

最便宜的迁移动作：翻一遍老 prompt，把每个 ALWAYS / NEVER 单独评估：是真的 invariant（留），还是其实是判断类（改成 decision rule）？这一步通常能显著缩短 prompt，也能减少冲突指令。

4. Formatting

GPT-5.5 在输出格式和结构上高度可控。这个可控性要在能改善阅读理解或符合产品形态时使用。

设置 text.verbosity、描述期望的输出形态，只在阅读理解或产品 UI 需要稳定 artifact 时才用重结构。text.verbosity 的 API 默认值是 medium；偏好更短回答时用 low。

普通对话型 formatting：

Let formatting serve comprehension. Use plain paragraphs as the default format for normal conversation, explanations, reports, documentation, and technical writeups. Keep the presentation clean and readable without making the structure feel heavier than the content.

Use headers, bold text, bullets, and numbered lists sparingly. Reach for them when the user requests them, when the answer needs clear comparison or ranking, or when the information would be harder to scan as prose. Otherwise, favor short paragraphs and natural transitions.

Respect formatting preferences from the user. If they ask for a terse answer, minimal formatting, no bullets, no headers, or a specific structure, follow that preference unless there is a strong reason not to.

中文释义：格式是为理解服务的。普通对话、解释、报告、文档和技术写作，默认用自然段即可；只有在用户明确要求、需要比较排序，或者不用结构会更难读时，再用标题、加粗、项目符号和编号。用户要求简短、少格式、不用 bullet 或固定结构时，除非有很强理由，否则照做。

显式给受众和长度引导：

Write for a senior business audience. Keep the answer under 400 words. Use short paragraphs and only include bullets when they improve scannability. Prioritize the conclusion first, then the reasoning, then caveats.

中文释义：面向资深商业读者写，控制在 400 词以内。用短段落，只有在能提升扫读效率时才用 bullet。结构上先给结论，再给理由，最后给限制和注意事项。

编辑、改写、摘要、客户对接消息类任务，在让模型“改进风格”之前先告诉它要保留什么。这个模式适用于想要打磨但不想扩写的场景：

Preserve the requested artifact, length, structure, and genre first. Quietly improve clarity, flow, and correctness. Do not add new claims, extra sections, or a more promotional tone unless explicitly requested.

中文释义：先保留用户要求的产物类型、长度、结构和体裁，再悄悄改善清晰度、行文流畅度和正确性。除非用户明确要求，否则不要新增观点、新章节，也不要把语气改得更像营销文案。

text.verbosity 这个参数从 GPT-5 开始作为独立 knob 引入，但它不是替代 prompt 里的长度指令。两者协同：verbosity 控总体话风，prompt 里的 “under 400 words” 控具体输出。“改进但不扩写”这个 pattern 很有用，能减少模型在 polish 时自动加节、加 disclaimer、改销售腔的问题。

5. Grounding、citations 和 retrieval 预算

对需要 grounding（基于检索证据）的回答，citation 行为应该是 prompt 的一部分。你要定义什么需要支撑、什么算够用证据、证据缺失时怎么办。证据缺失不应该自动变成事实层面的“否定”。

加显式的 retrieval 预算。retrieval 预算是搜索的 stopping rule。它告诉模型什么时候证据已经够了。

For ordinary Q&A, start with one broad search using short, discriminative keywords. If the top results contain enough citable support for the core request, answer from those results instead of searching again.

Make another retrieval call only when:
- The top results do not answer the core question.
- A required fact, parameter, owner, date, ID, or source is missing.
- The user asked for exhaustive coverage, a comparison, or a comprehensive list.
- A specific document, URL, email, meeting, record, or code artifact must be read.
- The answer would otherwise contain an important unsupported factual claim.

Do not search again to improve phrasing, add examples, cite nonessential details, or support wording that can safely be made more generic.

中文释义：普通问答先做一次宽泛检索，用短而有辨识度的关键词。如果顶部结果已经足够支撑核心回答，就直接回答，不要继续搜索。只有核心问题没被回答、关键事实缺失、用户要求全面对比、必须读取特定文档或 URL、否则会留下重要无支撑事实时，才再检索。不要为了润色措辞、增加例子、引用无关细节或支撑可以泛化的表达而继续搜索。

6. 创意写作的 guardrail

对起草型任务，告诉模型哪些 claim 必须来自源、哪些部分可以创意发挥。这对幻灯片、launch copy、客户摘要、talk track、leadership blurb、叙事文案尤其重要。

For creative or generative requests such as slides, leadership blurbs, outbound copy, summaries for sharing, talk tracks, or narrative framing, distinguish source-backed facts from creative wording.

- Use retrieved or provided facts for concrete product, customer, metric, roadmap, date, capability, and competitive claims, and cite those claims.
- Do not invent specific names, first-party data claims, metrics, roadmap status, customer outcomes, or product capabilities to make the draft sound stronger.
- If there is little or no citable support, write a useful generic draft with placeholders or clearly labeled assumptions rather than unsupported specifics.

中文释义：做幻灯片、领导发言、外联文案、分享摘要、talk track 或叙事包装时，要区分“事实依据”和“创意表达”。具体的产品、客户、指标、路线图、日期、能力和竞品判断必须来自检索或用户提供的信息，并给出引用。不要为了让文案听起来更强，编造名字、一手数据、指标、路线图状态、客户结果或产品能力。如果证据很少，就写一版有用的通用草稿，用占位符或明确标注的假设替代无依据的细节。

7. 前端工程和视觉品味

对前端任务，可以参考 OpenAI 的示例指令获取实操思路。它们覆盖产品和用户上下文、设计系统对齐、首屏可用性、熟悉的控件、预期状态、响应式行为，以及要避免的常见生成式 UI 默认问题，比如通用感太强的 hero、嵌套卡片、装饰性渐变、可见的指引文字、错位布局。

8. 让模型自检产出

给 GPT-5.5 提供工具，让它在能验证时自我验证。

对 coding agent，要求具体的验证命令：

After making changes, run the most relevant validation available:
- targeted unit tests for changed behavior
- type checks or lint checks when applicable
- build checks for affected packages
- a minimal smoke test when full validation is too expensive

If validation cannot be run, explain why and describe the next best check.

中文释义：改完代码后，运行最相关的验证：改了行为就跑定向单测；适用时跑类型检查或 lint；影响包构建就跑 build；完整验证太贵时，至少跑最小 smoke test。如果验证跑不了，要说明原因，并描述次优检查方式。

对视觉产物，要求渲染后检查：

Render the artifact before finalizing. Inspect the rendered output for layout, clipping, spacing, missing content, and visual consistency. Revise until the rendered output matches the requirements.

中文释义：交付前先渲染产物，检查布局、裁切、间距、缺失内容和视觉一致性。发现问题就继续修改，直到渲染结果符合要求。

对工程和规划任务，让实施计划可追溯：

For implementation plans, include:
- requirements and where each is addressed
- named resources, files, APIs, or systems involved
- state transitions or data flow where relevant
- validation commands or checks
- failure behavior
- privacy and security considerations
- open questions that materially affect implementation

中文释义：写实施计划时，要说明需求分别在哪里被满足，涉及哪些文件、API、资源或系统；必要时写出状态变化和数据流；列出验证命令或检查方式；说明失败行为、隐私安全考虑，以及会实质影响实现的开放问题。

这套自检 pattern 真正起作用的前提是：agent 真的拿到了能跑的工具。“Render the artifact”在没有工具环境的纯 LLM 调用里价值有限。把这一节当工具配套来读，不要只靠 prompt 单边解决。

9. Phase 参数

从 GPT-5.4 开始，长任务或工具密集的 Responses 工作流可以用 assistant-item 的 phase 字段区分中间更新和最终答案。GPT-5.5 沿用同样模式。

如果你用 previous_response_id，API 会自动保留之前的 assistant state。如果你的应用手动 replay assistant 输出 item 到下一个请求，要保留每个原始的 phase 值并原样传回。这在响应包含 preamble、重复工具调用、或在中间更新后才出最终答案时最重要。

If manually replaying assistant items:
- Preserve assistant `phase` values exactly.
- Use `phase: "commentary"` for intermediate user-visible updates.
- Use `phase: "final_answer"` for the completed answer.
- Do not add `phase` to user messages.

中文释义：如果你的应用会手动把 assistant 的输出 item 回放到下一次请求里，就必须原样保留 assistant 的 phase 值。中间用户可见更新用 phase: "commentary"，最终完成答案用 phase: "final_answer"，不要给用户消息加 phase。

10. 推荐的 prompt 结构

对复杂 prompt，把这个结构当起点。每节保持简短，只在影响行为的地方加细节。

Role: [1-2 sentences defining the model's function, context, and job]

# Personality
[tone, demeanor, and collaboration style]

# Goal
[user-visible outcome]

# Success criteria
[what must be true before the final answer]

# Constraints
[policy, safety, business, evidence, and side-effect limits]

# Output
[sections, length, and tone]

# Stop rules
[when to retry, fallback, abstain, ask, or stop]

中文释义：复杂 prompt 可以从这个骨架开始：先写角色，说明模型的功能、上下文和工作；再写人格与协作风格；接着写用户可见目标；然后定义成功标准、约束、输出格式和停止规则。停止规则要说清什么时候重试、降级、拒答、追问或停止。

这个模板看着老生常谈，但真正稳定使用的人并不多。最常见的省略是 Stop rules：没有它，长任务里更容易出现 over-thinking、over-searching。最常见的画蛇添足是 Personality 写太长：越长越容易和 Goal / Constraints 冲突，模型按字面理解后行为会乱。

官方指南没展开的一件事

这份指南讲了很多 prompt 写法，但有一个问题没有充分展开：长会话里指令保真度怎么办。

长任务里，模型不是只看单轮 prompt，还会受到上下文膨胀、历史决策、工具返回、摘要压缩、用户中途改需求等因素影响。很多开发者遇到的是长轮次后的约束衰减：跑了很多轮之后，早期要求被弱化，最终交付开始偏题。

这是会决定生产体验的“无聊产品特性”。用户不只关心模型某一次回答是否惊艳，更关心一小时之后交付物是不是还能保证不漂移。

这不是 prompt 单边能彻底解决的问题。它需要会话管理、上下文修剪、phase 字段、retrieval 预算、任务状态和验证机制共同发力。也就是说，prompt guidance 后半段那些看起来分散的章节，其实都是长任务稳定性的工程拼图。

更稳妥的使用分工是：

GPT-5.5 更适合边界清楚、目标明确、需要快速判断和执行的任务：比如 review、debug、找边界 case、按 spec 实现、带工具的中等规模循环。
超长任务是否更适合其他模型，要看你的 eval 和工作流：尤其是 repo-wide change、多步 agent 执行、上下文持久度敏感的任务，不能只凭单篇评测下结论。

这句比“哪个模型更强”更重要：模型选择最后应该回到任务评测，而不是品牌信仰。