The example model in the documentation is 4o-mini, you might want to update that to a more recent model.
As an aside, 4o-mini came out months before agent skills were released⌠Iâm curious how it performs with choosing to load skills in the first place?
The skill is deterministically added to the prompt by the harness before the target model is invoked. There is no âchoosingâ to load a skill. You might be confusing skills with tools (MCP etc).
The example model in the documentation is 4o-mini, you might want to update that to a more recent model.
As an aside, 4o-mini came out months before agent skills were released⌠Iâm curious how it performs with choosing to load skills in the first place?
Itâs an artifact of the documentation being AI generated, they usually pick gpt4-era models, without giving it further thought.
For Gemini it seems to always pick 2.5 despite 3.1 being the latest, Claude the 3.5-era models.
Not sure whatâs preventing AI labs on ensuring this stuff is refreshed during training.
The skill is deterministically added to the prompt by the harness before the target model is invoked. There is no âchoosingâ to load a skill. You might be confusing skills with tools (MCP etc).
Are there any published results gathered using this?
How do you iterate on the judge prompt? Is there an auto rater?