進展中 · 0 次更新Fact 9/10

OpenAI 停止 SWE-bench Verified 評估，促使外界重新檢視 AI 基準的可靠性

文章語言

繁體中文

OpenAI 已宣布將停止在其前沿 AI 模型評估中報告 SWE-bench Verified 分數。該公司指出，可能存在資料污染與測試案例品質問題，並表示該基準需要就其當前評估用途進行重新檢視。此一決定預料將延續外界對 AI 評估指標如何維護、解讀與更新的討論，也凸顯在快速演進的人工智慧領域中維持基準相關性的挑戰。

Guidances Staff · Updated June 15, 2026 · 已審閱來源

Open article · no sign-in required

Editorial illustration · June 15, 2026

OpenAI’s decision to stop reporting SWE-bench Verified scores highlights concerns about benchmark reliability, data contamination, and test-case quality.

来源与披露

View source at openai.com

The article's core claims are strongly supported by the provided OpenAI source, which explicitly states the company has stopped reporting SWE-bench Verified scores due to contamination and flawed tests. The article elaborates on these issues (data contamination, test-case quality, benchmark maintenance) in a neutral and informative manner. Speculative elements, such as the potential impact on other organizations, are appropriately framed with cautious language. The article adheres to reputation safety guidelines, avoiding disparagement or unsupported accusations.

Market lens

Agent runtime spending can spill into security, observability, and workflow infrastructure

The market signal is not another chatbot category; it is a possible budget shift toward the control layer around enterprise AI.

Impact path

Runtime spend → infra stack

Signals to watch

Procurement language around audit logs and cost ceilings
Security and observability vendors attaching agent controls
Workflow platforms exposing approval and tool-call governance

Verification schedule

D+1 · Jun 16

Do buyers repeat audit/cost-control requirements?

D+3 · Jun 18

Do vendors publish runtime-control SKUs or partnerships?

D+7 · Jun 22

Do budgets move from pilots into operating infrastructure?

Informational context only — not investment, legal, tax, or financial advice.

OpenAI 已宣布，將停止在其前沿 AI 模型評估中報告 SWE-bench Verified 分數。該公司表示，基於可能存在的資料污染與測試案例品質問題，該基準需要重新評估其是否仍適合當前的評估用途。此舉再次凸顯一個問題：AI 模型評估系統應如何隨時間維持、更新與解讀。

發生了什麼

SWE-bench Verified 的設計目的，是衡量 AI 模型解決來自真實軟體儲存庫問題的能力。這項基準會向模型提出需要理解、除錯與實作程式碼變更的任務，並置於接近真實的開發環境中。這些任務通常涉及瀏覽複雜程式碼庫、識別錯誤，以及提出可與既有軟體結構整合的解決方案。OpenAI 先前曾將這項基準作為其最先進模型進展的重要指標，尤其是在自動化軟體工程領域。如今，該公司已決定重新檢視其角色。這顯示，即使是廣泛使用的基準，隨著模型表現與資料環境演變，其解讀方式也可能需要調整。

為何重要

基準分數往往具有相當分量，經常被視為技術進展的指標，以及模型能力的摘要。然而，分數會因評估設計與資料條件而有所不同，即使數值相同，其意義也取決於基準本身的可靠性。OpenAI 同時提及可能存在的資料污染與測試案例品質問題，正與此脈絡相符。這意味著，分數產生的條件，可能與分數本身同樣重要。

資料污染是大型模型開發中持續存在的關切。隨著訓練語料不斷擴大，要完全排除在訓練過程中無意接觸到基準任務、解題模式或高度相關範例，變得愈來愈困難。若訓練語料包含公開程式碼儲存庫，而其中又含有基準所使用的特定問題或解答，便可能出現這種情況。當模型接觸過此類資料時，其在基準上的表現可能反映的是記憶或模式辨識，而非解題能力或對未見任務的泛化能力。OpenAI 決定在此一疑慮下重新檢視 SWE-bench Verified，凸顯在大規模 AI 開發中，維持訓練資料與評估資料分離的持續挑戰。

測試案例品質也是另一項重要變數。基準的有效性取決於其能否驗證模型是否真正解決了某個問題。若測試案例不完整、含糊，或未涵蓋足夠範圍的邊界情況與失敗模式，模型可能看似成功，卻未真正處理底層任務。在軟體工程領域，由於細微互動、環境依賴與特定儲存庫結構相當常見，要設計穩健的測試套件尤其困難。OpenAI 對測試案例品質的關切，顯示現有測試可能無法完整捕捉真實軟體開發問題的細微差異，進而導致對模型表現的評估不夠完整。

更廣泛的意義在於，AI 評估正逐漸成為一項維護問題，而不僅是靜態測量問題。基準通常是在特定時點建立，用以捕捉當時的能力狀態。然而，隨著時間推移，模型持續進步、訓練資料持續增加，基準本身也可能逐漸不再能代表其原本要衡量的能力。原本對模型而言極具挑戰的任務，可能變得輕而易舉；或者，基準背後的假設已不再與正在發展的前沿能力相符。因此，基準需要持續維護，包括定期更新題目集、重新驗證測試案例，以及因應新的模型架構與訓練範式進行調整。OpenAI 的舉措顯示，若不定期檢視，僅依賴靜態基準可能限制對前沿 AI 進展的準確理解。

鑑於 OpenAI 在 AI 研究社群中的地位，這項決定也可能促使其他機構與研究人員重新檢視自己對 SWE-bench Verified 及類似基準的依賴。雖然該基準在特定研究情境，或用於評估較不先進的模型時，仍可能具有價值，但其是否適合用來評估「前沿」能力，現在已成為檢視焦點。這可能推動業界對單一指標評估產生更高程度的審慎態度，並鼓勵在整個 AI 生態系中發展更具動態性、更全面且更透明的評估框架。重點或將從單純報告高分，轉向證明模型在多樣化真實世界挑戰中的穩健與可泛化表現。

營運層面的啟示

對於開發程式碼生成系統的團隊而言，這意味著不應只依賴單一基準分數。相較之下，更穩健的評估策略應結合基準結果與多元的內部及外部驗證方法。這可包括以真實編碼專案評估模型的任務型測試、用於檢查穩定性的內部回歸測試，以及對真實使用模式的持續監測。這種多面向方法能更全面地呈現模型能力，以及其是否已準備好部署。

治理層面的意涵同樣存在。建立清楚的評估框架治理機制變得重要。組織應制定程序，以選擇基準、記錄其理由，並定期檢視其持續相關性。也應建立流程，追蹤訓練資料的來源，並評估其與評估材料之間是否可能重疊，以降低污染風險。測試套件的品質與完整性也應持續監測並定期重新評估，以確保其仍能代表所需能力。OpenAI 的公告強化了一項期待：評估方法應具備透明性、可驗證性，並能適應 AI 創新的快速步調。

不確定性或限制

解讀 OpenAI 的公告時，必須放在其所述脈絡中理解。該公司已表示，將停止在其前沿模型評估中報告 SWE-bench Verified 分數，並將可能的資料污染與測試案例品質問題列為原因。這並不表示該基準對所有其他用途或其他機構都失去效力。SWE-bench Verified 仍可能是特定研究目的、不同開發階段模型評估，或比較程式碼生成能力某些面向的有用工具。核心訊息並非對該基準整體效用作出定論，而是呼籲審慎考量其適用性與可靠性，尤其是在評估最先進 AI 系統時。因此，關鍵問題並非單純以另一項評估指標取代現有指標，而是需要定期檢視評估系統，特別是在其被用來概括快速變動的模型能力時。

Builder Implications

在開發程式碼生成模型時，不應只依賴單一基準分數；應將基準結果與真實使用情境、任務型測試及內部回歸檢查結合評估。
在設計內部評估框架時，應建立追蹤訓練資料來源並評估其與評估材料可能重疊的程序，尤其是針對程式碼導向的基準。
應定期檢視測試套件的完整性與一致性，因為基準可靠性不僅取決於被測模型，也取決於測試品質。
應將評估框架視為需要定期重新檢視的動態系統，而非在未經修訂下即可持續有效的固定記分板。

Want follow-up alerts? Subscribe by email after reading the public article.

Market lens

Agent runtime spending can spill into security, observability, and workflow infrastructure

The market signal is not another chatbot category; it is a possible budget shift toward the control layer around enterprise AI.

Impact path

Runtime spend → infra stack

Signals to watch

Procurement language around audit logs and cost ceilings
Security and observability vendors attaching agent controls
Workflow platforms exposing approval and tool-call governance

Verification schedule

D+1 · Jun 16

Do buyers repeat audit/cost-control requirements?

D+3 · Jun 18

Do vendors publish runtime-control SKUs or partnerships?

D+7 · Jun 22

Do budgets move from pilots into operating infrastructure?

Informational context only — not investment, legal, tax, or financial advice.

Set profile for personalized briefings

◆

視覺簡報

Flow diagram showing training data, benchmark tasks, test cases, model evaluation, and review and update steps.

A simple workflow showing how benchmark reliability can weaken and why periodic review matters.

更正与安全

See a factual, privacy, rights, or safety issue? Review the corrections process or contact Guidances before relying on this article for important decisions.

Report a correction, privacy, rights, or safety issue

#AI#開發者

◆

OpenAI 停止 SWE-bench Verified 評估，促使外界重新檢視 AI 基準的可靠性

Agent runtime spending can spill into security, observability, and workflow infrastructure

Impact path

Signals to watch

Verification schedule

Agent runtime spending can spill into security, observability, and workflow infrastructure

Impact path

Signals to watch

Verification schedule

視覺簡報

更多報導

Meta 的 AI 轉向進入商業考驗：最難的是推銷這項策略

Carney 的 AI 依賴警示使模型存取與採購韌性成為焦點

Anthropic 在政府指令後切斷對 Fable 5 與 Mythos 5 的存取，凸顯 AI 部署與合規之間的關係