Kimi K2 explained in 5 minutes

Original Video ContentExpand Video
  • Kimike K2 by Moonshot is making waves in the AI industry.
  • Many still prefer models like Claude and Gemini.
  • Kimike K2's potential lies in its innovative architecture.
  • Cost factors are influencing the adoption of Kimike K2.
  • The evolution of open source models could shift the competitive landscape.

Kimike K2 was released by Moonshot and it's making a big impact in the AI industry. And you might say nah, I've been using Claude and Gemini and they're working just fine so why should I really care about this?

Most people probably never heard of Kimik K2 or even Moonshot because on the commercial side the market is widely dominated by OpenAI, Anthropic and Gemini. But there's a reason why open model community is raving about models like Kimikk 2.

The reason why Kimikk 2 is not widely known has mostly to do with the word large in the large language model, meaning these open source models are still too large to run locally to get the same result as the state of the art models.

For example, Kimi K2 has a total of 1 trillion parameters, which is comparable to other frontier models like GPT, Gemini and Claude. The competitive edge that these models have over Kimikk 2 is economies of scale. What I mean by that is that to run a trillion parameter model for millions of users that they have, you need a large-scale infrastructure to support it.

And sadly, to run Kimmikk 2 you need to spend about $25,000 minimum to purchase Nvidia's H100 card, similar to buying a brand new car. And just like owning a car, you need to put gasoline to run it. And H100 can cost up to $90 per month to run it locally, spending around 0.7 kilowatts.

So you might ask at this point what's really the big deal here? Then it looks like Kimikk 2 is still not really there yet, right? Sure, the math still favors the commercial models because you only pay a subscription fee of $200 per month per developer to have unlimited API calls to use state of the art models like Claude. That's only $2,400 per year, which is significantly less than paying $25,000 to run Kimik K2.

If you think about it from a company's perspective, though, paying $2,400 per year per developer, the math starts to tilt slowly towards the other way. For example, the $25,000 hardware that a company purchased could be considered as an initial investment that you pay upfront as a capital expenditure and can be depreciated in taxes.

And on top of that, the following year will only cost $1,080 per year in electricity costs for inference, which is far less than the $2,400 and potentially less if the same equipment is shared between two developers.

So you can see why there's so much excitement around these models like Kimmik K2 because the open model market is slowly catching up to a point where state of the art comparable models like Kimmik K2 can now be harnessed locally. So the obvious next question is how?

How is it that Kimik K2 is able to perform so well against state of the art models like Claude GPT and Gemini? The architecture of Kimikk 2 is built around what's called MOE or mixture of experts. Most commercial models like GPT and Claude are what's called dense models while Kimikk 2 is a sparse model.

A dense model is your typical feed forward neural network that can activate the entire model to process tokens, whereas sparse models activate only a section or few sections of the model to process your tokens. That's why even though Kimi K2 is 1 trillion parameters in size, it actually only activates 8 sections, in this case 8 experts per token.

The model has a total of 384 experts that all add up to about 1 trillion parameters in size, and this makes the inference a lot faster given the size because it only utilizes 32 billion active parameters at a time.

Kimik 2 is also built around actions, meaning that it's specifically trained to make better tool calls. And I think this is a very important point where LLM benchmarks, which are typically used to measure the raw intelligence of a given model, is now expanding to look for models that are more resourceful in being able to leverage external services and tools to create better action.

Moonshot recognizes and specifically trained Kimik K2 on SIM tool usage to learn how to be resourceful in calling the right tools for different purposes and different contexts. And this will pay a huge dividend as the industry is shifting towards MCP and A2A or Agent to Agent networks, which requires a lot of external reliance.

My next question then is how does Kimik 2 line up with what happened back in January 2025 when DeepSeek 1 was first released and shook the world? Remember when the announcement of Deepsea came out as Chat GPT killer and the stock market ended up dropping by a trillion dollars?

Eventually, will the cost of operating these models be on par with using frontier models like OpenAI's GPT, Anthropic's Claude, and Google's Gemini? For every major release like Deepseek1 or Kimik K2, it starts to eliminate the competitive edge that these companies currently enjoy.

And I think this is why LLM providers recognize that their competitive edge is not permanent, which is why we're seeing them expanding their product offerings to adjacent products like AI code editors, AI web browsers and AI chat applications, and more, because that sustains their competitive edge a little bit longer.

So as we look forward to open source models like Kimik K2 becoming more available and more useful, we'll get to see how the industry changes. And it'll be interesting to see where self-hosting will start to become the norm. Or maybe the commercial models will continue to dominate the industry because they can just put more dollars behind for faster innovation.