WizardCoder: Evol-Instruct#

原文链接: https://arxiv.org/pdf/2304.12244.pdf

github链接: https://github.com/nlpxucan/WizardLM

关于指令进化部分，在 WizardLM 中的进化指令基础上做了如下改变：

改善进化指令；
简化进化指令的形式；
增加上代码调试和时间空间复杂度的约束；

使用上述方式获得到微调数据集之后微调模型，使用的基座模型为 StarCoder（然后在其 github 中可以看到，目前最新的进度是也在 llama2 做了微调，并且 llama2-13b 的效果比 StarCoder-15b 要好一些）。训练完成之后的模型在四个基准上进行测试：HumanEval、HumanEval+、MBPP、MBPP。对比对象包括目前主流的所有的开源和闭源模型。

1、指令进化的具体操作#

在 WizardLM 中的进化指令基础上的修改具体有：

原来的 WizardLM 中的进化指令有5个深度进化指令：add constraints、deepening、concretizing、increase reasoning steps、complicate input，和一个广度进化指令。本文在其基础上去除掉了以下两个深度进化指令和广度进化：deepening、complicating input。所以在本文中仅剩三个深度进化指令：add constraints、concretizing、increase reasoning steps。
将上述3个进化指令使用统一的模板，这个在后面会有具体的指令说明；
针对代码的特点，增加两个进化指令，分别是：code debugging、code time-space complexity constraints；

综上，本文中总共有5个进化指令，并且这5个进化指令使用统一的模板。这个统一的模板如下所示，其中 {question} 是需要做进化的种子指令。{method} 这里是放5个不同的进化指令。

Please increase the difficulty of the given programming test question a bit.

You can increase the difficulty using, but not limited to, the following methods:

{method}

{question}

上述的5个不同的进化指令具体如下。下面这5个指令的类型分别为：add constraints、concretizing、increase reasoning steps、code debugging、code time-space complexity constraints。实际使用时就是直接把下述内容替换掉上面所说的 {method} 即可。

Add new constraints and requirements to the original problem, adding approximately 10 additional words.

Replace a commonly used requirement in the programming task with a less common and more specific one.

If the original problem can be solved with only a few logical steps, please add more reasoning steps.

Provide a piece of erroneous code as a reference to increase misdirection.

Propose higher time or space complexity requirements, but please refrain from doing so frequently.

2、实验结果#

闭源的模型有：GTP3.5、GPT4、Code-Davinci-002、Code-Cushman-001、Codex、Bard、PaLM、PaLM2、LaMDA、AlphaCode、Claude；

开源的模型有：StarCoder、LLaMa、CodeGen、CodeGeeX、CodeT5+、InCoder、StarCoder-GPTeacher、Instruct-Codegen-16B、Guanaco-65B、Guanaco-65B；

在 HumanEval 和 HumanEval+ 两个数据集上，与闭源模型比较的评估结果如下图，本文提出的模型 WizardCoder 排名第三：

下表是对开源模型和闭源模型，在 HumanEval 和 MBPP 两个测试集上的评估结果。可以看出本能的模型 WizardCoder 比所有的开源模型效果都要好；同时与已经经过微调的模型，比如 InstructCodeT5+ 等模型，效果也是最优的：

在数据集 DS-1000 上的评估效果如下图，也是优秀非常明显：

3、消融实验#

这个不太像是消融实验，更像是如何选择进行几轮指令进化的策略。如下图中所示，总共对指令进行了4轮进化，然后对比每一轮进化的模型在数据集 HumanEval 上的指标，发现第3轮进化之后得到的指令训练出来的模型效果最好，于是保留了第3轮进化的指令训练的模型作为最终的模型。

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

WizardCoder: Evol-Instruct#

1、指令进化的具体操作#

2、实验结果#

3、消融实验#

Reference#