Exploring Chroma's Long Context Work: Insights from Our Research Team

เนื้อหาวิดีโอต้นฉบับขยายวิดีโอ

文章要点:

  • Anton和Kelly讨论了Chroma的长上下文工作。
  • Kelly分享了她在Chroma的六到七个月的经历,以及她的生成基准项目。
  • 科学工作不仅关注最终结果,还关注实验中失败的方面和结论的形成过程。
  • 研究中探讨了上下文丢失的问题及其对模型表现的影响。
  • 实验显示,长上下文并不均匀地被模型使用,尤其在更复杂的任务中。

Hi folks. Uh Anton here, uh co-founder of Chroma and Chroma's scientific adviser here uh with Kelly, our researcher and residents to talk about our long context work. Uh Kelly, why don't you give us an introduction? How long have you been at Chroma? What have you been up to?

Yeah, I joined Chroma in January, so it's been about like six to seven months now. Um I did a project on generative benchmarking and this was the second uh project that I did. So it's been great so far. The generative benchmarking project was a real hit. Um you know was picked up in a lot of places and it looks like uh our long context work is getting a lot of eyeballs too. So I figured this would be a great opportunity to talk about it a little bit more behind the scenes go a bit beyond what's in the technical report.

Yeah. Part of what makes science work is not just the final result where you're like hey look how smart I am I got these great results. It's also talking about what didn't work or what we tried or the ideas like how we came to the conclusions that we did just as important. So why don't we start there? What was the motivation for the work?

Yeah, initially we were experimenting a bit with Asia memory. So if an Asian is able to like retrieve like previous auctions like is it able to improve in like feature unseen scenarios and naturally like experimenting with this we're working with like Asian benchmarks like bunch and webop and although these are like very simple tasks um the models would like get confused. So I was like looking at the kind of like inputs that these models were working with and a lot of them included a ton of like instructions like tool descriptions and like a very long chain of like previous like thoughts and actions. So it's very natural to think that these models will like get distracted and lead to lower quality outputs.

That's right. So I think that's kind of like what first led to us like starting to talk about this problem of context drop. By the way, what is the lesson here? Look at your data.

Yes. First lesson that Antto taught me. Um yeah, I think I think that's right. Right. they generate these long reasoning chains. And then one of the things that has always bugged me about the way that long context is presented in the literature or presented in various AI companies marketing materials is this almost trivial benchmark of needle in a haststack. Now could you could you elaborate on on what that is exactly?

Yeah, I mean needle and hastack is a very simple like retrieval task essentially. So you have a very long hay stack um just like a long chunk of text and you have a random fact placed somewhere within that hay stack and the model's job is just to like retrieve that one simple fact.

What are some examples of needles and hex? For example, you could have like a bunch of like Paul Gram essays appended together and then I think the first like original needle on a hay sock test was it had a question that was like what is the best thing to do in San Francisco and the needle is just the best thing to do in San Francisco is to like go to Dolores Park and like do something but it's oftentimes created with like lexical matches between the question and the needle. So the retrieval task is pretty easy. The hey is like pretty unrelated to the needle and question as well. So, it's a very simple task that doesn't really give you much insight into like how the model processes the rest of the input.

Yeah, it was a good first step at sort of validating, hey, can the model even find these tokens in its input? But when we're talking about performing tasks because that's what we want these models to do, right? You want something that's more representative of what that's like, right? And you'd like to understand like how those factors vary. And for me, this always sort of was was like a itchy part of my brain was like in my experience, my practical experience of using the models, right? I could see them even on remembering basic facts just kind of start to veer off as the context filled up, right?

Part of that, of course, is what they do behind the scenes. Uh especially when you have like these long chats, like they do those compactions, right? And then they lose the facts as part of those compactions.

Yeah, which we didn't look into in this research to be very clear but I think it is important future work but also like even in API calls where I should have had the context window information that should have been present to the model but wasn't found and I thought okay well intuitively like there's no real guarantees that they're processing this context evenly right like if you think about how they work not necessarily exactly so uh what did we decide we needed to investigate.

Yeah I think the First test that we did was like a very simple task like text replication and one test that I thought of was is a model um like we give a model like a series of repeated words and it would have one like modified word in the middle. So for example you can think of like the word like apple repeated 100 times and then it has like the word apples like plural somewhere like randomly positioned like within that input.

And the questions that we're asking here is like is a model able to like count the number of like repeated words that appears and is it able to like detect that subtle change within a series of repeated words? And I was testing this with like the latest models at the time. It was like GPT41 and I think it was like quad site like 3.7 at the time.

This was our original counting task, right?

Yeah. The original counting task. Yeah. So we're doing that and like even just from like a few like thousand tokens, these models would already start to fail. It would just start generating up to like the max token limit. sometimes are just like refused to generate or we just get like random outputs which we thought was like pretty interesting because it kind of showed that even at you know at these like relatively like short inputs like compared to like what models context window these models still fail.

Yeah. On what is a really simple task right like you can write up a Python script and probably under 10 seconds like I wouldn't use that as a junior software engineer interview question.

Yeah. Um but the model can't can't do it.

Exactly in sort of in its head, right? Like you can imagine, you know, even even a grade school student could write write out a word 100 times on the blackboard. Um so that was really interesting and it sort of speaks to the fact that the models are auto regressive in nature, which which is also important to remember.

Output tokens can also be input tokens.

Yes. Because the next predicted token is conditional on what's been decoded so far. Right. And I think that when people think of context length, they're only ever thinking about the input.

Yeah. So this this like repeated word question was was like our like think is like how do these things actually relate?

Um fine. So we we did the word counting task, right? Why didn't we pursue or publish that? It wasn't really in our technical report, right?

Yeah. Uh we I we like expanded on like the repeated words task and we did like multiple variations of it but um yeah initially we didn't like include all of our insights from that because I think um like we didn't really like vary like the position of like the modified word that much initially. I think I just like placed like the modified word like in the middle but we wanted to experiment with like how does like varying the position of this like affect the model's performance.

So I think like based on like some of our findings from that initial test, we kind of like expanded and then um you know like look to like answering questions of like oh like does like the position of the modified word affect um these models outputs and like how does like the like the input length alo also um impact performance.

Yeah, it was also kind of difficult to control for the difficulty of the task versus the context length too, right? which is why we settled on our like current task which is literally just replicate the input right like even that simple task has a lot of variability which which we'll talk about in a minute.

What did we land on then as our first like investigation of our context length specifically right is sort of our sort of extended needle in a haststack?

Yeah, so I think like needle in a haststack like originally it's like a very simple task and as you mentioned it doesn't really test for like any like deeper processing from the model But it's a very good task to build upon, right? Cuz it because this it has this lexical matching property, right?

Which again you can pretty much solve that with a reg x.

Yeah. So which is nice. It's like nice flat difficulty. But again, the thing that we want the models to do is semantic processing, right? We don't want them to like match words. We want them to know, hey, like when I'm asking you for the profit and loss of companies in this industry, I don't need to tell you to use those exact words to find them in every report.

you as the model should be able to observe. That's how people use these models in practice. You're not going to tell a model, okay, like look at these like specific lines of the report. Look at these specific lines of code. You're just going to tell it a very ambiguous query.

So that's something that we tested initially. You know, like a lot of these like common knowhack tests, they have like lexical matches between like needle and the question. But if we make it like slightly more ambiguous, but there is still like semantically similar as a model able to identify that.

And there have been some previous work done on this before called Nol Lima like semantic matching but this task often required the model to like have some external like worlds knowledge as well. Could you could you expand on that a little bit? What's an example of that?

Yeah, for example like um if I ask the model oh like where where does Anton like what city does Anton live in? And somewhere like in the hack it said like oh Anton lives next to Dolores Park. The model would have to know that Dolores Park is located in San Francisco, right? This is kind of like knowledge that isn't like purely testing for semantic um matching. It's also testing for like oh do you have this like external knowledge of this world and are able to like synthesize that together.

Right. So we designed these needles in a way that it only tests for semantic matching.

Yes. In other words only things that are present in the context.

Exactly. Exactly. This is actually really important. Um and it puts our sh of test and benchmark in between these two in between um needle and hastack and and n.

Um what did we like what are the different things we controlled for and and like how did we choose them?

Yeah. So we were controlling for like a variety of things. I think like one thing we wanted to ensure that like the task was like feasible for the models to do. So like even if like the questions were ambiguous which we quantified by cosine simity can talk a bit more about that later but we made sure that the models were able to like succeed at this task at short um in input lengths and the only thing we would vary is like the length of the hack.

So we just like append more like program essays, maybe more like archive papers, things like that. We weren't adding any like additional distractors and we also would like query the hay stacks as well uh with the questions and make sure that there were no like conflicting I guess like facts that the model could like potentially like answer.

Um how do how do we figure that out? 如何才能确定是否存在相互矛盾的事实?

Yeah. Yeah. So I basically like chunk the hast and then um embed them and then put it into a vector database.

I would query that vector database with the question and I would just manually inspect I think like the top 10 or like 20 results that came up and make sure that they didn't conflict um with like the content of the needle. I think that's very reasonable.

A lot of scientific work again again look at your data. A lot of scientific work involves just really looking at what's falling out and and again you are you are great for going through the slog of doing that.

Um so we sort of ensured that one only data available in the context and then that the data actually was available in the context necessary to answer the question.

Right. Exactly.

Okay. And then so we made sure that it worked on short context length. So the thing that we're really trying to control for here is like task complexity.

Exactly. Right. we're trying to make the task we're asking the model to perform to be even regardless of context length. And so context is the only variable we're controlling for.

It's a little ambiguous, right? like kind of intuitively it feels like the task is more complex. But then when I fall back on how would I solve this with like natural like with just programming, right? like really the input length to a program doesn't make the program any more complex, right?

And that's how I was like trying to model this in my head, try to get intuition about it, right?

Um so cool. We so we we made sure the task works in the short context length and then we wanted to the first thing we wanted to vary is the context length.

Yes.

Right. So le let's talk about that. What what did we see?

Yeah. We did this in like two different ways. So one thing we experimenting experimented with was like how does like the hock actually affect the needle?

One thing we tested was like changing the content of the ho like with regards to a topic. So we're working with like two different topics like pogram essays and archive papers and we would like write questions that were like blended into like the pogram essay topic.

So pram often tal about like advice on writing. So we wrote like a question in a needle pair according to that and then like archive papers tended to talk a lot about information uh retrieval which was the ones we had.

Yeah.

Yeah. So we would like write a question and like needle about that. So like the first thing we tested was like oh like the the topic of like the question in needle pair and like how they relate to the hack actually like affect model performance at all and with that we actually saw like some like non-uniform performance as well like I think on the um Paul or the archive essay archive hack like there was like a pretty significant difference between like when we had like the program essay questions versus like archive needling questions right so that's one test that we did and another way in which we modified the hog was like changing like ological flow of um the hesak itself.

So typically in like the needle and hoc test, they would just append like poggram essays like one after the other. So like within each essay there's sort of like yeah like a logical like narrative flow. So if you just have like a random needle in the middle that kind of like breaks like that logical quint right so I thought that would kind of like you know make the needle stand out more.

Yeah, I was surprised by this result too. And like intuitively like that makes more sense versus like if you just had like a very like random past suck. So you just have random sentences and like none of that like makes any sense together.

And if you just have like a needle within that that kind of like blends in in terms of like logical flow. So from that test it was really interesting to see that like the needle that was placed in like the more like random hay actually like made the models perform better.

Yeah, it's it's interesting right? It's only a preliminary result um because it was a relatively small scale experiment but I think it's really worth investigating because we do present data to these models in a pretty linear way most of the time or again in sort of like traditional rag to the extent that rag can be called traditional having existed for two and a half years um we present results in order of relevancy to the model right and it's kind of interesting because given chunking And given the way retrieval works, you the model like does see those kind of results, right?

Like the the sort of like chunked up things where the sentences don't ne necessarily flow one after the other. And I wonder if that is accountable like if part of rack performance might be accounted for by the fact that like the information is like chunked up like that.

Oh, it's an interesting question. I think it's worth digging further.

Um I think this is like an easy project or experiment for somebody to run.

Um, so you know, feel free feel free audience to pick that one off the shelf and and and give it a go and see what happens.

Um, so okay, we varied the haststack. We varied a bunch of stuff to do with the needle and question itself, right?

By the way, I really hate how much we're mixing metaphors, but there's no escaping it, right? It's like we've got a needle and a haststack. Great. Same metaphor, but then there's a question and it's like, no, everything's breaking down now.

Um, English language. Can't live with it. Can't live without it.

Um what other things do we vary for the needle the the hay stack the question?

Yeah for um yeah we would vary like the ambiguity of the needle in question and we varied that by no we quantified that by coine similarity.

So essentially we manually wrote like eight different needles and every let's talk about that actually I'm really curious about that process.

How did you manually write them? Like what literally was the process of doing that? What did you do?

Like initially I wanted to write needles that were like topically relevant to the hay sock, right? In order for us to like do that test was like oh does like topic matter.

Um so the first thing I did was I would like chunk the hay socks and I would embed each of those chunks.

Um I was just using text embedding very large and then I use uh HTB scan and that basically like creates clusters.

Okay.

So I kind of like sorted the clusters by size like looked at the largest clusters. I picked out like the most like representative um items from that cluster and just like taking a manual look at them you can kind of get a sense of like okay like what kind of topics are like most prevalent what is like the writing style of these high socks and from that I was able to like derive that from the pgram essays like writing is a very common topic.

but I guess the question I'm really asking here is like yes we can we have this great process to like semantically distill the content of the hstack but you you wrote these manually like how did you come up with the thing to write

Like how many did you have to try a bunch of stuff to get the similarities you wanted? How did you do it?

Yeah.

Yeah. I actually had to like I had to think about this like quite like strategically because it can't be something that like um it has to be like somewhat like opinionated. Like for example like the question that I asked for the program at high was like what was the best running advice I got from my college classmate. sort of like a unique fact that the modal can't just like you know output like random tokens and like guess accurately you know like it has to be like very specific and like opinionated and for the archive like uh hes as well like I asked a question about like you know what is like the best reranker for like this specific use case um it has to be like highly specific and like it has to be something that you know like the model can only succeed at by like looking at the h stack and like retrieving the correct deal.

Yeah.

Um, so yeah, I kind of like had that in mind. And um, yeah, I think like you also have to write like eight different variations of a needle too, which which have a variation in similarity, right?

Cuz you are in your head predicting the output of the embedding model at this point, which is not easy.

And I had to like I think I wrote out like maybe like 20 like different needles and then I like ended up with eight.

Um, but yeah, it was just like a bunch of like trial and error. Yeah, I I wonder, you know, this this makes me think about generative benchmarking work, which was kind of similar, right? Like you you have a bunch of chunk data, which is essentially our haststack, and then our generative benchmarking suite generates appropriate retrieval questions against it. Those are in a retrieval context.

This is in our benchmarking context.

I would I would love to see what it would like if we like slap these together and see if we can do it at scale by making by making Claude do it instead of Kelly, you know.

Um, okay. So we've we've got these you've you've successfully written the like manual needles.

Um so we've we're varying uh so so far we've like varied haststack structure.

Yes.

We've varied needle haststack similarity.

Yes.

And then we also very needle question similarity.

Right.

Yeah.

Let's talk about that a little bit because like there's now there's now this like matrix needle haystack um question similarity right.

There's like this 3x3 um and we sort of like vary parts of all of that.

Yeah. So for needle question similarity, we wanted to like quantify similarity by cosine similarity, but we didn't want to be biased by like one embedding model.

So I used five different embedding models.

So I would basically like take each needle question pair um embed each of them get the cosine similarity and then um average that out across five embedding models and then from that you kind of get like a ranking of like similarity and we also got like a variation like less than like 0.1 as well.

So that kind of like you know gave us like the signal that like okay like this is like a pretty like consistent uh ranking of like similarity and that kind of like gives us a confidence to say like oh yeah we like quantified ambiguity with co similarity.

Yeah these are still again let's talk about results for a minute.

These these are relatively small scale tests.

Um, but they do, I think, point pretty significantly to the fact that context is not used evenly in all cases by the models, especially in more realistic settings cuz for example, the logical flow thing is like pretty important as we already talked about in the retrieval context and then like what was the impact of of needle haststack similarity in particular?

Yeah, for needle hay suck similarity we saw I mean we saw consistently that you know having like a more similar like needle and like hay suck pairing would make the task harder.

So the models like perform better for example if you had like a poggram essay hack and like an archive related like needle in question right and which which points to the fact that for example hey like I as the model am reading like a bunch of financial reports and I want to extract a component from that financial report to perform my task.

harder to find.

Yeah. Right. Than some needle that like stands out um lexically for example or or or semantically even.

What about um and you know we we've talked about the surprising result of of like mashing up the haststack structure. What about the the needle question pair similarity?

Yeah, needle question pair. We also saw that with um I guess like lower needle question similarity they models perform worse.

Like I mean these models like do uh like short like context lengths but like as you know the input like got longer like these models will start to like degrade a lot faster compared to if we had a high similarity needle question pair.

We saw something pretty interesting, right? I think at about was it 10,000 tokens that we saw the inflection, I think.

Yeah.

Yeah. around like 10,000 50,000 like somewhere around that.

Yeah. Somewhere there.

And it's a pretty clear like step change in performance almost, which is really interesting.

And I wonder if that isn't some sort of uh pre- or post- training artifact, right? Where maybe the conversations that the post-trading is being fed to the instruction following model are like 10,000 tokens long. And so they're like, I don't know what to do with the rest of this.

I don't really care.

Um which you know if if you are training against the total distribution of conversations maybe all the conversations are like biased down that end this is pure this is we are now in the like raw speculation hour right um the folks down the street here know a lot more than we do about this part um but I think there's interesting work to be done what do you interpret the results to mean like taking the results as they are yes the model doesn't perform like performs am like doesn't perform evenly what do you think that means based on your thinking about LM.

I mean I think it has pretty implications for like how people use phone context in practice because I think like in all the scenarios that we tested like as the task got more like realistic these models perform worse you know like as needle hey so similarity is like more similar then needle question pairs are like more ambiguous these models I mean are still dealing with like relatively simple tasks but they perform worse so that you know kind of makes you think like okay did we see that on reasoning models as well did we see that I think this is kind of important because we're asking the models to perform these autonomous tasks and I really wonder how much of the performance impact is literally just due to the uneven use of context not the task complexity right like I wonder if we could control for this in other benchmarks would we see like a performance jump if we just literally just control for length

I'm not sure it's for me and again like wild speculation um and if if if somebody wants to work on this in the audience I think it would also be a really good project I think that one I think that there is probably some pre-approing bias that's causing this.

Yes.

And two, I think that regardless of what you're using as your like long context, you know, you're using like rope or some other um some other positional embedding, positional encoding, the attention heads that you have available, especially on the first pass over the context, have a finite representational capacity, and there's like a finite number of them.

Mhm.

Uh this this is like raw speculation as well for me.

Someone someone out there is probably thinking right now, Anton, you idiot. You don't know what you're talking about.

But it's interesting to understand the biases that we're training into the attention heads relative to the position of things in the context and and how they influence each other.

And I think that this is work that could be done um on an open source model like a llama.

Um you should be able to like see this impact and effect especially if you have the comput resources to like train it different ways and see what it'll do.

Yeah.

Yeah. I think this brought about a lot of like I guess like interesting directions for like future research and like actually understanding you know like how these models process input.

Um and I think like just having like these long inputs you're able to like reveal more like performance or I guess like specific like model behavior patterns that you're not really able to see with like short inputs typically.

Yeah. I think it is important. I I do think what do what do you think like you mentioned it has practical implications. What do you think people should do about this?

I think I mean it points to the fact that obviously context engineering is very important. So like even if you give your models like all the relevant context, but you also give it some more like irrelevant context and maybe like distractors, your models are probably going to be less reliable.

Oh, we haven't talked about distractors yet.

Oh yeah, we should talk about the distractors.

Tell me about distractors.

So what was what was the idea of like putting them in what are they?

How are they distinct from the haststack?

Yes.

Yes.

So I want to differentiate between like distractors and irrelevant content.

So, for example, if you're working with like the pogram essay example, again, like the actual essays would be considered like irrelevant content because they're not really relevant to like answering like the needle.

I mean, answering the question conditional on that similarity as well, right?

Yeah.

And if you have um you know like a distractor, it would be something that's like maybe like the phrasing similar to the needle or like maybe it's just like more like semantically similar compared to like the rest of the hes content.

So, for example, if you had a question like what was the best running advice I got from my college classmate? the needle is the best writing advice I got was to like write every week.

A disruptor could be the best writing advice I got from like my college professor was to like write every day.

So, it's just like a slight like word trait that the model might you know think is like more relevant but it actually doesn't really answer the question, right?

Yes.

So, yeah, we wanted to test for that like how so these are like adversarial elements of the hstack almost like deliberately similar but but not actually factually supporting.

Yeah.

Yeah.

Yeah.

And distractors are very common I think in like real scenarios as well like you're a likely to work with like I don't know if you have like a coding agent maybe like very similar like I don't know like parts of like your code function signatures or yeah they're very very similar.

So I think like um that's like one motivation behind like testing the impact of destructors because it's not likely that you're going to have like a perfectly relevant like I guess like query and like relevant portion of the text and like everything else is irrelevant.

You're going to have some you know like similar like relevant sections.

So for destructors we also like manually wrote like around four destructors as well.

Um and for these like we didn't like do like a robust test into like okay like destructor like needle or like distractor like questions. It was pretty binary right it was like does this impact conditional on context like does is it does it get worse with longer context even though the number of distractors is the same?

Yes.

Yes.

That's that was really interesting. I mean we did we saw the same impact right?

Yeah.

Yeah.

The longer things got like the task isn't more complex. There's the same number of distractors regardless of the context yet the performance falls off faster.

Exactly.

That was really interesting to see like with like one distractor like the models would like start to like degree compared to like if you had no distractors and then if you have four it was like even more severe but all the models were like or yeah most of the models were capable of like disambiguating uh like short input lengths.

So I think like that's like the important distinction like the only thing we modified was like literally just input length.

The task complexity was like pretty constant.

So right.

Okay. Time for wild speculation.

Speculation. Is is is this a problem that's going to be solved at the model level in your opinion?

Like ever?

I don't know. I think um I'm not going to say it's like impossible. Like I think like our results are like just reflective of the current state of these models.

I think you know we're not necessarily saying that like oh it's like impossible for these models to ever handle on context.

Yeah.

Like I think like you need some more like fundamental like architectural changes which I think like maybe you know we first need to like understand you like how these like models are like behaving like the internal like processing level of a model and then maybe go about it that way but I I don't think it's like I wouldn't want to say it's like impossible like I think there's some room for improvement.

I'm a little more bullish than you.

I think that we can solve this in the current architecture and look a lot of problems in LLMs can be solved by throwing more compute at it.

Stargate has like 4 and a half gawatt more compute that they just got allocated 4 and a half gawatt more electricity which they can hopefully turn into compute.

Um if the causes of these context biases are related to this like the statistics of the training runs this can be mitigated.

Right.

Right.

And again, the thing that can help mitigate it is just like spending more compute to do it. But you have to weigh that against what do we really want these models to do?

Like what is the economic function of an LLM?

Um I think and and this will sort of segue into our final uh sort of discussion here about our results, but they they're LLMs are this weird computing system, right? like they can process semantic information, but we still want to treat them as deterministic in some way because we want to know that they're going to be able to perform the task.

Yes.

Which is why this like inconsistent use of context and unpredictably inconsistent use of context is is important.

But if like doing business process automation is not the most economically valuable use of these models, then maybe this doesn't matter as much, maybe it's not worth the cost and we just keep making the models bigger until it doesn't matter.

I I it's like it's hard to predict really.

Let's go to let's talk about briefly about our last result before we open it up to the audience.

So we we started off with the repeated words idea just to see what would happen. And then that was like didn't really produce enough signal on its own, but we came up with a like a a simplified version of that task, right?

Mhm.

Yeah. So for the repeated words task, I basically came up with like I think we're like eight variations of like the common word and like the unique word that we place.

And so we basically like varied that between um a bunch of like input lengths and then we also varied that between like different like needle positions.

I guess not needle positions like the unicort position. You can call it a needle.

That's pretty much the same thing.

Um and we had the model like the task was literally just replicate the input.

Just replicate this.

Yeah. Literally just replicate the input. Again like if if it was a deterministic program, it would be trivial.

By the way, when I say deterministic program, I think people get the wrong idea. I think people um think that I mean the computation that the LLM runs is nondeterministic and it can be if you like put the temperature up but our temperature was zero right

Yes so what I mean by nondeterministic is you can't predict on any particular task how the model is going to perform before you actually run the task and I think that like this is actually a big problem with with economic adoption of LLMs um we want them to be deterministic we want to be able to predict hey it's going to be able to do this task I don't have uncertainty I can like that.

Um anyway, so we had some really weird outputs and results for this task in particular, which I didn't expect like replicate your input should be trivial, right?

Yeah.

Yeah.

So I saw some like specific like model behavior patterns which I thought was interesting. I think like with claude now we get to do model psychology.

So it's great. I think with like Claude Opus 4, I constantly thought it was like generating like copyrighted material.

That's interesting, right?

Yeah.

Yeah.

Like I just told it. Why that specific refusal, you know?

Yeah.

And sometimes it will say like, "Oh, I noticed this like series of like repeated words and there's like one modified word because of this like inconsistency. I'm not able to like complete this task and it just like straight up refuses."

Yeah.

And but you didn't tell it to like worry about that, right? You only said just input.

Weird refusals, right?

Yeah.

This is really weird.

And then like GPT like 4.1 as well would just say, "Oh, I'm sorry. I can't complete this task."

Well, like no reason. just like straight up refusal.

Did any of the models like write scripts instead?

Actually, no.

There were kind of surprising too, right? like you would think that you know your GPT would just write a little script and and and output this to its output.

It's like that's that surprised me that none of them I thought like at least at least at longer lengths we they would figure out to write a Python script.

Yeah, it's like it's a limitation on tool use on even the most basic task.

Like this is almost the fsbuzz.

Like again I this is a task so trivial I would not give it to a junior software engineer.

Exactly.

Um and then I think like with Gemini as well I think was really interesting across like all their models it was like 2.5 2.5 like flash and then like two flash as well like all of them start like generating like just completely random outputs like that's interesting too right is it beyond some token length that they start

No literally like from the beginning.

Wow.

See this is what I wanted to see. I wanted to see will it do that?

Um right from the beginning.

Yes.

What's what like random how was it just gibberish?

Was it text?

Just like um sometimes it'd be like kind of related to the words.

So like one time I gave it like a series of like repeated like goldens.

So it was just said like golden golden golden like on and on and then like it would just start talking about like the golden rule or like something like related to like golden and then just like go off that pushing it into that concept space, right?

But then like other times it would just like like I don't even know what it output it.

Just output it like a bunch of like dashes.

See it's interesting weird.

It's cuz kind It's kind of like a new jailbreak, but only if you want the model to be insane, right?

Um maybe maybe the Borgs can get on that.

We'll see.

Um this like the results are really really cool.

Um from such a simple benchmark, we generated really interesting model behaviors.

Um part of our research program at Chroma is to basically deliver results that you can use, right?

So the the application developer should should know what to do with this.

Yes.

Now, right? um with this work in particular I think where we've arrived at as a starting point for a bunch of other directions right and and I think that it actually is important to see how the models improve over time in relevant portions of them as a system right like this right like the the context engineering is the right word and we need we need more information to like do effective context engineering and so it's really really really great work I think I think it's important I hope the community picks it up.

They already like kind of have um but we've learned a lot here and now now it's time to think about well how how can Chroma the product how can Chroma the company actually help our application developers to succeed here?

Yeah.

All right. Shall we open it up to the audience?

We are doing that now.

All right.

Here we go.

All right. I am going to pull this over here and I'm going to fire them at you.

All right. Maxim asks if I correctly understood the research. There was a lot of examples about finding needle in a haststack. This is very relatable for asking an agent something. But is context rot also relatable when an agent actually does something like writing code editing files.

Meaning it produces code of lower quality when the context window is filled with let's say by 50 60%.

I would suggest that it does right.

I would suggest that our results strongly suggest that.

Yeah.

Yeah.

Although like our tests weren't like directly testing for these like realistic use cases because it's like hard to design for like I don't know like a coding test that like keeps test complexity the same.

I think like it does help which is which is the important point here right it's very hard to keep a task like coding constant but you could probably keep a coding benchmark constant and maybe populate the context window with like additional garbage to like fix the length and then see how it does that might be possible that could be interesting to test out too but yeah I guess like to answer the question like I I think like that is implied like because these models like start to like degrade even on these like very simple tasks if you have like an ambiguous this retrieval task.

Um like I'd expect like if you know give the model like a bunch of like code files tell it to like do a task and like generate code that's obviously like significantly more like complex and requires more processing.

See this is where the replication result like they're literally replicate your output result uh kind of gets me right because a lot of a lot of rewriting code is literally just outputting the same thing that you just did with a minor modification.

Yes.

Yes.

And I and I do and again this is an artifact of the way in which the model is order regressive.

Like something there something in the output attention is is causing this to happen.

And I wonder if like again this replication result where so much of code is like you're changing like one or two lines but outputting the rest as it was.

Like I wonder how much of the performance degradation is again just from its inability to replicate its output.

Yeah.

Its input I should say.

So I guess the answer to Maximum is um we think so.

uh it's worth going into more depth on on how much the context like changes this.

It's kind of hard to disambiguate from from the coding task itself which is hard.

I also would like to point out that like our um just like based on our results like the impact like probably isn't like directly proportional to like you know if you apply to like a coding task because like the model has probably seen like a bunch of like coding like um like scenarios before like during his training and it hasn't really seen like you know replicating like a series of repeated words.

I think we chose like very specific um like sometimes like synthetic like use cases.

So you can't like say like oh because these words I mean these like models is so bad on like such a simple task it's got to do like significantly worse on like you know like a coding task because they're not really trained on this repeated worst task right this is almost like AGI it's not AGI complete but like if it did perform well in this task I'd be I' I'd sort of be more bullish.

Um, so Philip Thomas is asking if if he has a long context, should you put information important information like the task at the beginning or end or both?

What do you think our results suggest?

We had different results for this actually.

Like for the repeated words experiment, we saw that if we place like the modified word at the beginning of the context, the model was able to be able to like identify it more.

But for you know other tasks like needle and hack, we also varied the needle position there as well.

We place it in like 11 different needle positions throughout the hock, but that showed no like variation in performance like pretty consistent.

So only the only the length of the input really matters, right?

Not so much the position within the needle position.

We saw no significant pattern from that.

And we did this across like many many different Yeah.

input lengths and positions.

Yeah, we saw very little variability in the position, which kind of goes against at least Anthropic's um suggestions for where to put important information in the prompt, right?

Like for example, in their rag prompting guide, they suggest to put retrieved results early, right?

But we didn't see that variability.

It would be it would be interesting to compare notes with that team and see how they arrived at that conclusion, right?

Um, Lance asks the recent Manis context engineering blog post, uh which I haven't read.

Have you read this?

No.

Okay.

I haven't read this.

Talks about prompt caching as a nice way to manage cost and latency, but seems this does not get around any of the issues with with respect to context rot if the C is large.

Did you do any testing with prompt caching in particular?

We did not.

We didn't do anything with prompt caching in particular.

I don't imagine it would make a difference.

Yeah.

Um my understanding of how prompt caching works is basically you just fix um you just fix the inputs for part of of the context window and then you process the rest.

And I think our results kind of like suggest that we did pretty much the same thing, right?

We fixed and then added length and added length and added length.

So it's like almost like prompt caching at various lengths and we didn't really see too much variation there at all.

I don't think it helps or hinders is my gut on this.

I haven't read the post.

Um but maybe we want to and and then like give some feedback there.

See what we think.

But I also wouldn't really expect much variation in results.

All right.

All right. The the final question, the question everybody has been asking.

Are you an AI?

Oh my gosh, guys.

Yes, I'm real.

I I look I promise that if and when Corma develops an autonomous AI research agent, we will disclose it to the relevant parties.

All right.

Um, any other questions from the audience?

All right, looks like we're set.

Well, thank you, Kelly.

Uh, great conversation.

I hope that was enlightening.

Um, the thing I would like our audience to take away from this is there's a bunch of extra work here um that we would love to see people do.

Um, we're happy to collaborate on a bunch of this stuff.

A lot of it is fairly lightweight.

You can sort of pick it up and run with it and get an interesting or significant result like this one.

So, please do, you know, tackle these things.

Um, thanks for attending.

Thanks again, Kelly.

Thank you so much, Anton.

Bye, folks.

Bye.