My personal view: GPT-5's routing pipeline seemed to meant mainly for reducing cost (as there seems a shortage of GPUs to satisfy users' high demands) and partially for enhancing inference speed for queries that do not require the most powerful models (to date) to solve. I doubt GPT-5 is doing MCTS (as that's quite expensive and would make inference slower) but I don't discard the possibility that OpenAI are doing sth more sophisticated than a single prompt for selecting an output. However, things like best-of-n or majority voting seem unlikely, as GPT-5 (and Claude ofc) outputs tokens synchronously/on real time. Similar arguments may be made for Claude, but as they experience a much smaller customer demand and thus probably have more spare GPUs for doing complicated model/output selections (e.g. routing to short- or long-context models depending on the input length).
On whether the open-weight ecosystem is actually not as far behind the closed-weight ecosystem (from a capabilities perspective): More work is needed to reliably verify this point and probably more metadata along with API from closed model providers would be essential to acquire certainty. My limited anecdotal evidence (with uncertainty due to the lack of knowledge about the system pipeline of closed models) seems to suggest that capabilities perhaps do not differ that much, but closed models seem noticeably better at things like instruction following, infer the actual intention of users, and alignment.
Great post and thoughts.
My personal view: GPT-5's routing pipeline seemed to meant mainly for reducing cost (as there seems a shortage of GPUs to satisfy users' high demands) and partially for enhancing inference speed for queries that do not require the most powerful models (to date) to solve. I doubt GPT-5 is doing MCTS (as that's quite expensive and would make inference slower) but I don't discard the possibility that OpenAI are doing sth more sophisticated than a single prompt for selecting an output. However, things like best-of-n or majority voting seem unlikely, as GPT-5 (and Claude ofc) outputs tokens synchronously/on real time. Similar arguments may be made for Claude, but as they experience a much smaller customer demand and thus probably have more spare GPUs for doing complicated model/output selections (e.g. routing to short- or long-context models depending on the input length).
On whether the open-weight ecosystem is actually not as far behind the closed-weight ecosystem (from a capabilities perspective): More work is needed to reliably verify this point and probably more metadata along with API from closed model providers would be essential to acquire certainty. My limited anecdotal evidence (with uncertainty due to the lack of knowledge about the system pipeline of closed models) seems to suggest that capabilities perhaps do not differ that much, but closed models seem noticeably better at things like instruction following, infer the actual intention of users, and alignment.