QWEN 3 CODER is Unleashed... better than KIMI K2

Original Video ContentExpand Video

Alibaba has unveiled the Quin 3 coder, a powerful new open-source coding model.

It features a massive 480 billion parameter model called Powerful1.

The Quin 3 coder excels in various benchmarks and outperforms competitors.

A command line tool for ageentic coding, Quin Code, is now available.

New methods in reinforcement learning enhance real-world coding task performance.

Alibaba just dropped the next big thing. Quin 3 coder. Just as the world is getting used to Kimi K2, the other big open-source coding model, Alibaba drops the Quin 3 coder and the Quin 3 coder is available in multiple sizes.

But the big one, the Powerful1, is Quin 3 Coder 480B A35B instruct, meaning that this is a 480 billion parameter model. It's built with a mixture of experts, so only 35 billion parameters are active during any call. Instruct means that it's our friendly and helpful assistant mode as opposed to the base model that's a little bit more like a text completion model.

Nataly supports 256k context and it scales up to 1 million. And as you can see here, the benchmarks look great now. Can't always just trust the benchmarks, so give it a few days while everybody gets their hands on it and tests it out.

But Kimi K2 by all accounts was extremely impressive. The Gwin 3 coder handily beats Kimi K2. Not only that, but it's comparable. It's competitive with Claud Sounded. It also beats OpenAI's GPT 4.1 and scores very strong on ageentic browser use as well as ageentic tool use. Again, only Cloud Sounded 4 edges it out in some of the cases.

The Gwin team promises that it works seamlessly with Community's best developer tools and it can be used anywhere across the digital world. Age genic coding in the world and they're not kidding about this.

Alongside the model, they're open sourcing a command line tool for ageentic coding called Quin Code. It's forked from Google's Gemini code. As you can see here, it's open sourced on GitHub Apache 2.0 license. It looks a lot like the claw code and it's been adapted with customized prompts and function calling protocols to fully unleash the capabilities of Quin 3 coder on ageentic coding tasks.

So basically Quin code is just adapted Gemini cli. It's been modified to use the Quwin models. You can also use the Quin 3 coder models with cloud code. A lot of people prefer cloud code, so you can easily use the Quin 3 Codder models with Cloud code. You're able to use it with K Clin. I'll leave these links down below.

One of the big subjects of discussion right now is reinforcement learning. Taking these models and taking them to an RL gym to teach them how to do various skills like coding, mathematics, etc. As I say here, scaling code Reinforcement learning is hard to solve, easy to verify as you can see here.

As they're increasing training steps as they train this model, it's going up and up across all the different areas of performance like code generation, software development, competitive coding, SQL instruction following, etc. They take a bit of a stab at the other frontier lab saying unlike the prevailing focus on competition level code generation in the community, we believe all code tasks are naturally well suited for execution driven large scale RL reinforcement learning.

That's why we scaled code RL training on a broader set of real-world coding tasks. Basically they're saying that instead of trying to max out these quizzes and tests, these competition style questions, they're actually training this model to do real work in the real world.

By automatically scaling and test cases of diverse coding tasks, we created high-quality training instances and successfully unlocked the full potential of reinforcement learning. It not only significantly boosted code execution success rates but also brought gains to other tasks.

This means that this approach generalizes. We've seen in other published papers that for example, training it to do code improves its ability to do math problems for example, even though it's not explicitly trained on that. Certainly, it seems like their approach generalizes across a lot of different tasks. Hopefully, they'll publish a paper later that's going to outline how they approach this.

But here they mentioned that their approach is using hard to solve easy to verify tasks as a fertile ground for large scale and reinforcement learning.

So here's a chart of its performance on the SUI bench Verified so this is the size of the model. Models on the right are bigger, models on the left are smaller and the higher you go the better the score is on the subench verified. SWbench is 500 real-world human verified Python GitHub issues reviewed by humans and confirmed solvable.

As you can see here, Quin 3 coder is above most of the other models that we've seen including Kimmy K2, GPT 4.1, even Gemini 2.5 Pro Preview. Only Cloud Sont 4 slightly edges them out while being a much bigger model. For being a moderately sized model, nothing is even close to it in terms of performance. Kimmik K2 is much larger while not as good.

Importantly, with real-world software engineering tasks like those in the SUI bench verified, Quin 3 coder must engage in multi-turn interactions with the environment involving planning, using tools, receiving feedback and making decisions. So this isn't a simple question-answer format. This is a long horizon task that includes planning and feedback.

What was their trick in getting this model to be so good at those long horizon tasks? Justing here that in the post-training phase of Quint threeoder they introduced Long Horizon rl. They're referring to it as Agent rl. This was to encourage the model to solve those real-world tasks through multi-turn interactions using tools.

This is interesting. The key challenge in doing something like this Agent rl lies in environment scaling. To address this, they built a scalable system capable of running 20,000 independent environments in parallel. They did this using Alibaba's cloud infrastructure.

This allowed them the scale needed to run this massive reinforcement learning pipeline. And this is what allowed Coin 3 Coder to achieve the state of the art performance among open source models without—and this is important to note—without test time scaling. This is not a reasoning model. Unlike the DeepSeq R1 or the Gemini 2.5 Pro Preview, this is non-reasoning.

Some of the demonstrations they posted include building demolition and demonstration. Here’s using Quinn with Klein, they put together an interactive color explosion. Looks pretty neat.

So we're going to have to test some of these things out: 3D Google Earth terrain visualization, a little app to test your typing speed, bouncing ball in rotating, a hyper cube, super trippy solar system simulation, and a duet game. It's available on how you face and chat Quen AI.

So I'm going to be doing a full round of testing on this thing, but for the time being, here's just a few quick tests that I ran. One is doing a simulation of an office building with desks and offices inside. So I tried to get it to do some sort of transparent windows on the outside, but it wasn't able to do transparency. They were either very opaque or completely not transparent.

But it did manage to create, you know, the rooms inside with desks and computers and whatever that is. A light looks like a light in each room. So so far, from the beginning it's not too bad. I mean this is pretty good for the first attempt.

I also created a little drone flying game where you're able to fly a drone around the city. I can't complain. It's pretty good. I mean the keys are a little bit weird, but I'm liking it so far. It takes a little bit getting used to, but once you're able to kind of just feather the throttle, you're able to fly around and it's not bad for one-shotting it.

We have a Minecraft clone. Where would we be without a Minecraft clone? You're able to place blocks and build and move around. Not too bad, I was able to one-shot this pretty easily.

So check it out, let me know what you think about it. More tests and use cases are coming very, very soon. Also, somebody made this little hack for clot code and now why didn't we have this all along?

Through the task list. Creating the to-do list. Sir. Beginning Python code adjustments. Make it so number one. Working through the task list. Actually, that might get kind of annoying pretty quick.

And I gotta say, I did not expect open-source AI to be this good. I think we were hoping that it would be, let's say at the most a few years behind the frontier labs. Now, it's looking like it's months. What a time to be alive.