All signs point to this being a finetune of gpt4o with additional chain of thought steps before the final answer. It has exactly the same pitfalls as the existing model (9.11>9.8 tokenization error, failing simple riddles, being unable to assert that the user is wrong, etc.). It’s still a transformer and it’s still next token prediction. They hide the thought steps to mask this fact and to prevent others from benefiting from all of the finetuning data they paid for.
This is actually pretty smart because it switches the context of the action. Most intermediate users avoid clicking random executables by instinct but this is different enough that it doesn’t immediately trigger that association and response.