In this case, cosine distance one would be in a case when it repeats word-by-word. It is not even a "similar thought" but some sort of LLM's OCD.
For anything else... cosine similarity says little. Sometimes, two steps can have opposite conclusions but have very high cosine similarity. In another case, it can just expand on the same solution but use different vocabulary or look from another angle.
A more robust approach would be to give the whole reasoning to an LLM and ask to grade according to a given criterion (e.g. "grade insight in each step, from 1 to 5").
> A more robust approach would be to give the whole reasoning to an LLM and ask to grade according to a given criterion
We actually use a variant of this approach in our reasoning prompts. We use structured output to force the LLM to think for 15 steps, and in each step we force it to generate a self-assessed score and then make a decision as to whether it wants to CONTINUE, ADJUST, or BACKTRACK.
- Evaluate quality with reward scores (0.0-1.0)
- Guide next steps based on rewards:
• 0.8+ → CONTINUE current approach
• 0.5-0.7 → ADJUST with minor changes
• Below 0.5 → BACKTRACK and try different approach
Every time I see these kinds of prompts that ask an LLM for a numeric ranking, I'm very skeptical that the numbers really mean anything to the model. How does it know what a 0.5 is supposed to be? With humans, you'd have them grade things and then correct the grades so they learn what it is from experience. But unless you specifically fine tune your LLM, this wouldn't apply.
I went through this with gemini-1.5, using it to evaluate responses. Almost everything was graded 8-9/10. To get useful results I did the following.
1. Created a long few-shot prompt with many examples of human graded results.
2. Prompt it to write it's review before it's assesment.
3. Prompt it to include example quotes to justify it's assesment
4. Finally produce a numeric score.
With gemini-2 I've been able to get similar results without the few-shot prompts. Simply by prompting it to not be a sycophant, and explaining why it was important to get realistic, even hard scores, and that i expected most scores to be low, on order for the high scoring content to stand out.
In a recent test, I changed to using word scores, low, medium, high, and very high. Out of about 500 examples none scored very high. I thought that was pretty cool, as when I do find one scoring high it will stand out, and hopefully justify it's score
If we ask LLM to grade something, we must create a prompt with good instructions. Otherwise, we will have no idea what 0.5 means or whether it is given consistently.
(A rule of thumb: Is it likely that various people, not knowing the context of a given task, will give the same grade?)
The most robust approach is to ask to rank things within a task. That is, "given blog post titles, grade them according to (criteria)" rather than asking about each title separately.
Well, you are certainly correct about how cosine sim would apply to the text embeddings, but I disagree about how useful that application is to our understanding of the model.
> In this case, cosine distance one would be in a case when it repeats word-by-word. It is not even a "similar thought" but some sort of LLM's OCD.
Observing that would be helpful in our understanding of the model!
> For anything else... cosine similarity says little. Sometimes, two steps can have opposite consultation, but they have very high cosine similarity. In another case, it can just expand on the same solution but use different vocabulary or look from another angle.
Yes, that would be good to observe also! But here I think you undervalue the specificity of the OAI embeddings model, which has 3072 dimensions. That's quite a lot of information being captured.
> A more robust approach would be to give the whole reasoning to an LLM and ask to grade according to a given criterion (e.g. "grade insight in each step, from 1 to 5").
Totally disagree here, using embeddings is much more reliable / robust, I wouldn't put much stock in LLM output, too much going on
The distance between "dairy creamer" and "non-dairy creamer" is too small. So an embedding for one will rank high for the other as well, even though they mean precisely opposite things. For example, the embedding for "dairy free creamer" will result in a low distance from both of the concepts such that you cannot really apply a reasonable threshold.
But in a larger frame, of "things tightly associated with coffee", they mean something extremely close. Whether these things are opposite from each other, or virtually identical, is a function of your point of view; or, in this context, the generally-meaningful level of discourse.
At scale, I expect having dairy vs non-dairy distance be very small is the more accurate representation of intent.
Of course, I also expect them to be very close and that's the problem with purely relying on embeddings and distance where, in this case, the two things mean entirely opposite preferences on the same topic.
(I think maybe why we sometimes see AI generated search overviews give certain types of really bad answers because the underlying embedding search is returning "semantically similar" results)
> Totally disagree here, using embeddings is much more reliable / robust, I wouldn't put much stock in LLM output, too much going on
I think both ways can be the preferable option, depending on how well the embedding space represents the text - and that is mostly dependet on the specific use case and model combination.
So if the embedding space does not correctly project required nuance, then it's often a viable option to get the top_n results and do the rest by utilizing the llm + validation calls.
But i do agree with you, i would always like to work with embeddings rather than some llm output. I think it would be such a great thing to have rock solid embedding space where one would not even consider to look at token predictor models.
The relation among the internal model representations inside its latent space and the embedding of the CoT compressed with a text embedding model is, more or less, minimal. Then we take this and map it to a 2D space, which captures more or less nothing of the original dimentionality and meaning. That's basically plotting random points.
Potentially it's useful to understand a model "on its own terms" via its observable outputs.
>The relation among the internal model representations inside its latent space and the embedding of the CoT compressed with a text embedding model is, more or less, minimal.
This may or may not be correct but one way to find out is by taking a look!
It would be much more interesting to see PCA (or t-SNE or whatever) on the internal representation within the model itself. As in the activations of a certain number of layers or neurons, as they change from token to token.
I don't think the OpenAI embeddings are necessarily an appropriate "map" of the model's internal thoughts. I suppose that raises another questions: Do LLMs "think" in language? Or do they think in a more abstract space, then translate it to language later? My money is on the latter.
> I suppose that raises another questions: Do LLMs "think" in language? Or do they think in a more abstract space, then translate it to language later? My money is on the latter.
The processing happens in latent space and then is converted to tokens/token space. There is research into reasoning models which can spend extra compute in latent space instead of in token space: https://arxiv.org/abs/2412.06769
Sort of, yes. I think one of our next unlocks will be some kind of system which predicts at multiple levels in latent space. Something like predicting the paragraph, then the sentences in the paragraph, then the words in the sentences, where the higher level "decisions" are a constraint that guides the lower level generation.
In meta's byte-level model they made the tokens variable length based on how predictable the bytes were to a smaller model, allocating compute resources based on entropy.
I'd have to guess that the "transformations" being made to the embeddings at each layer are basically/mostly just adding (tagging with) incremental levels of additional grammatical/semantic information that has been gleaned by the hierarchical pattern matching that is taking place.
At the end of the day our own "thinking" has to be a purely mechanical process, and one also based around pattern recognition and prediction, but "thinking" seems a bit of a loaded term to apply to LLMs given the differences in "cognitive architecture", and smacks a bit of anthromorphism.
Reasoning (search-like chained predictions) is more of an algorithmic process, but it seems that the "reactive" pass-thru predictions of the base LLM are more clearly viewed just as pattern recognition and extrapolation/prediction.
For future reference, it is hard to parse tone over the internet but this "command" read pretty poorly to me. I would have preferred if you asked a question or something else.
However, assuming best intentions...
> I'd have to guess that the "transformations" being made to the embeddings at each layer are basically/mostly just adding (tagging with) incremental levels of additional grammatical/semantic information that has been gleaned by the hierarchical pattern matching that is taking place.
> Reasoning (search-like chained predictions) is more of an algorithmic process, but it seems that the "reactive" pass-thru predictions of the base LLM are more clearly viewed just as pattern recognition and extrapolation/prediction.
I'm having trouble following. Are you saying that:
* The "reactive pass-thru predictions" are just pattern matched responses from the training text that come from "incremental levels of additional semantic information"
* There is some other algorithmic process which results in "search-like chained predictions" from the pattern matched responses
* These two capabilities, combined in a single "thing," are not analogous to thinking
?
> At the end of the day our own "thinking" has to be a purely mechanical process, and one also based around pattern recognition and prediction, but "thinking" seems a bit of a loaded term to apply to LLMs given the differences in "cognitive architecture", and smacks a bit of anthromorphism.
You can pick whatever term you like. What we seem to have is a system which can, through the embedded patterns of language, create a recursive search through a problem space and try and solve it by exploring plausible answers. If my dog came up with a hypothesis based on patterns it had previously observed, considered that hypothesis, discarded it, and then came up with a new hypothesis, I'd say it was thinking.
There are clear gaps between where we are and human capabilities especially as it relates to memory, in-context learning, and maintaining coherence over many iterations (well, some humans), but (to me) one of two things is probably true:
1. Models are doing something analogous to thinking that we don't understand.
2. Thinking is just a predict-act-evaluate loop with pattern matching to generate plausible predictions.
I lean towards the second. That's not to ignore the complexity of the human brain, it is just that the core process seems quite clear in the abstract to me via both observation and introspection. What can "thinking" (as you define it) do that is beyond these capabilities?
My mistake, actually. I was trying to recall "Change my mind!", and ended up with that instead. It was meant as a tongue-in-cheek challenge - I'd be more than happy to hear evidence of why I'm wrong and that there's a more abstract latent space being used in these models, not just something more akin to an elaborated parse tree.
To clarify:
1) I'm guessing that there really isn't a highly abstract latent space being represented by transformer embeddings, and it's really more along the lines of the input token embeddings just getting iteratively augmented/tagged ("transformed") with additonal grammatical and semantic information as they pass through each layer. I'm aware that there are some superposed representations, per Anthropic's interpretability research, but it seems this doesn't need to be anything more than being tagged with multiple alternate semantic/predictive pattern indentifiers.
2) I'd reserve the label "thinking" for what's being called reasoning/planning in these models, which I'd characterize as multi-step what-if prediction, with verification and backtracking where needed. Effectively a tree search of sorts (different branches of reasoning being explored), even if implemented in O1/R1 "linear" fashion. I agree that this is effectively close to what we're doing too, except of course we're a lot more capable and can explore and learn things during the reasoning process if we reach an impasse.
I am not sure how someone would change your mind beyond Anthropic's excellent interpretabilty research. It shows clearly that there are features in the model which reflect entities and concepts, across different modalities and languages, which are geometrically near each other. That's about as latent space-y as it gets.
So I'll ask you, what evidence could convince you otherwise?
Good question - I guess if the interpretability folk went looking for these sort of additive/accumulative representations and couldn't find them, that'd be fairly conclusive.
These models are obviously forming their own embedding-space representations for the things they are learning about grammar and semantics, and it seems that latent space-y representations are going to work best for that since closely related things are not going to change the meaning of a sentence as much as things less closely related.
But ... that's not to say that each embedding as a whole is not accumulative - it's just suggesting they could be accumulations of latent space-y things (latent sub-spaces). It's a bit odd if Anthropic haven't directly addressed this, but if they have I'm not aware of it.
Text embeddings are underused WRT model understanding IMO. "Interpretability" focuses on more complex tools but perhaps misses some of the basics - shouldn't we have some sort of visual understanding of model thinking?
Pretty cool. Funnily enough I made something similar this weekend that converts CoT to Graphs/Trees of Thoughts with an LLM. It kind of allows to see when the LRM/LLM changes direction or when it finds a path that it wants to follow.
I was, personally, hoping to see a sort of "spiraling down" towards the answer or pathfinding like IDA*[0], but I suppose what we're looking at isn't too dissimilar from A* or Djikstra's if you squint.
I suspect you recognize dimensionality reduction, but to reiterate for my own understanding: t-Distributed Stochastic Neighbor Embedding (t-SNE) is one method among a few other, (more popular?) ones like Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP).
Is t-SNE the most appropriate technique for modeling the terrain under a multidimensional "walk"? Possibly a linear technique (PCA, LDA, SVD?) or PaCMAP[1], which "dynamically employs a particular set of mid-near pairs to capture the global structure and then improve the local structure." (Qattous H, 2023)
Random walk is definitely possible. Also possible that we're observing some "search" in the embedding space from an initial point. It's hard to tell because the chains are often similar lengths, so I don't think it really terminates early. It might be interesting to find the closest CoT component to the final answer and see how step distance inflects at that point
Is this using t SNE? Or sth else? I have a feeling similarity is not well defined in whatever space this is using. t SNE is famously unsuited to plot how "close" two points are. it is for clustering
Hmm... a slightly different approach I've seen in the past is that instead of embedding each step separately, you concatenate them together. I wonder if that would make the path more linear/understandable?
Yes, one more precise way to phrase this is that the expected value of the dot product between two random vectors chosen from a vector space tends towards 0 as the dimension tends to infinity (I think the scaling is 1/sqrt(dimension)). But the probability of drawing two truly orthogonal vectors at random (over the reals) is zero - the dot product will be very small but nonzero.
That said, for sparse high dimensional datasets, which aren't proper vector spaces, the probability of being truly orthogonal can be quite high - e.g. if half your vectors have totally disjoint support from the other half then the probability is at least 50-50.
Note that ML/LLM practioners use "approximate orthogonality" anyway.
That link doesn't contradict the person you're replying to. Actual orthogonality still has a probability of zero, just as the equator of a sphere has zero surface area, because it's a one-dimensional line (even if it is in some sense "bigger" than the Arctic circle).
If you're picking a random point on the (idealized) Earth, the probability of it being exactly on the equator is zero, unless you're willing to add some tolerance for "close enough" in order to give the line some width. Whether that tolerance is +/- one degree of arc, or one mile, or one inch, or one angstrom, you're technically including vectors that aren't perfectly orthogonal to the pole as "successes". That idea does generalize into higher dimensions; the only part that doesn't is the shape of the rest of the sphere (the spinning-top image is actually quite handy).
The visualization is useless. IF the 2D embeddings were any good they might be useful to R1's developers but still not to end users. What am I supposed to with it?
Whenever I hear "chain of thought" I feel like thoughts aren't really linear, it's probably more like "graph of thought". I wonder how models might be able to benefit from structuring their "thinking" this way in a multi-threaded environment
Interestingly, the dots moving about the 2D space resemble eye movements humans involuntarily make when thinking hard and creatively. E.g. looking at random spots (often slightly upwards) and moving to a different place every few seconds or so. Like a kind of synkinesis between brain and eye.
Doesn't the last layer output a variable-size vector based on seq length? It'd take a bit of hacking to get it to be a semantic vector.
Additionally, that vector is trained to predict next token as opposed to semantic similarity. I'd assume models trained specifically towards semantic similarity would outperform (I have not bothered comparing both in the past - MMTEB seems to imply so)
At that point - it seems quite reasonable to just pass the sentence into an embedding model.
Ha, if steps have consistent distances you could take the average distance at step X and generate a step of that length in some direction and be ~approximately correct regardless of the actual value
Not just "an" LLM, but the LLM (or specifically its chain-of-thought variant) that was in the news for days and caused NVIDIA stocks to crash (well, I'd prefer calling it a natural corrective action by the market) and almost a panic in the US because Chinese and so on.
1) If it's actually much cheaper to build a new model the position of OpenAI, et al is weaker because the barrier against new competitors is much lower than expected.
2) If you don't need a billion dollars in GPUs to build a model NVIDIA maybe isn't worth as much either because you don't need as much compute power to get to the same output. (IMO their valuation is still insanely high but that's a different story)
I don't think the drop was entirely irrational they showed there were places a lot of efficiency could be gained in the training process that people didn't seem to be working on in the existing players.
Whether or not it's "noise" might depend on if you think Chains of Thought are causally relevant? I liked these pieces if you want to read more about CoT / O1:
While I like the idea of measuring subsequent steps, this kind of approach of using embeddings is the reason why I wrote: "Don't use cosine distance carelessly" (https://p.migdal.pl/blog/2025/01/dont-use-cosine-similarity).
In this case, cosine distance one would be in a case when it repeats word-by-word. It is not even a "similar thought" but some sort of LLM's OCD.
For anything else... cosine similarity says little. Sometimes, two steps can have opposite conclusions but have very high cosine similarity. In another case, it can just expand on the same solution but use different vocabulary or look from another angle.
A more robust approach would be to give the whole reasoning to an LLM and ask to grade according to a given criterion (e.g. "grade insight in each step, from 1 to 5").
> A more robust approach would be to give the whole reasoning to an LLM and ask to grade according to a given criterion
We actually use a variant of this approach in our reasoning prompts. We use structured output to force the LLM to think for 15 steps, and in each step we force it to generate a self-assessed score and then make a decision as to whether it wants to CONTINUE, ADJUST, or BACKTRACK.
I go into a bit more depth about it here, with an explicit example of its thinking at the end: https://bits.logic.inc/p/the-eagles-will-win-super-bowl-lixEvery time I see these kinds of prompts that ask an LLM for a numeric ranking, I'm very skeptical that the numbers really mean anything to the model. How does it know what a 0.5 is supposed to be? With humans, you'd have them grade things and then correct the grades so they learn what it is from experience. But unless you specifically fine tune your LLM, this wouldn't apply.
I went through this with gemini-1.5, using it to evaluate responses. Almost everything was graded 8-9/10. To get useful results I did the following. 1. Created a long few-shot prompt with many examples of human graded results. 2. Prompt it to write it's review before it's assesment. 3. Prompt it to include example quotes to justify it's assesment 4. Finally produce a numeric score.
With gemini-2 I've been able to get similar results without the few-shot prompts. Simply by prompting it to not be a sycophant, and explaining why it was important to get realistic, even hard scores, and that i expected most scores to be low, on order for the high scoring content to stand out.
In a recent test, I changed to using word scores, low, medium, high, and very high. Out of about 500 examples none scored very high. I thought that was pretty cool, as when I do find one scoring high it will stand out, and hopefully justify it's score
Yes, you are right.
If we ask LLM to grade something, we must create a prompt with good instructions. Otherwise, we will have no idea what 0.5 means or whether it is given consistently.
(A rule of thumb: Is it likely that various people, not knowing the context of a given task, will give the same grade?)
The most robust approach is to ask to rank things within a task. That is, "given blog post titles, grade them according to (criteria)" rather than asking about each title separately.
Well, you are certainly correct about how cosine sim would apply to the text embeddings, but I disagree about how useful that application is to our understanding of the model.
> In this case, cosine distance one would be in a case when it repeats word-by-word. It is not even a "similar thought" but some sort of LLM's OCD.
Observing that would be helpful in our understanding of the model!
> For anything else... cosine similarity says little. Sometimes, two steps can have opposite consultation, but they have very high cosine similarity. In another case, it can just expand on the same solution but use different vocabulary or look from another angle.
Yes, that would be good to observe also! But here I think you undervalue the specificity of the OAI embeddings model, which has 3072 dimensions. That's quite a lot of information being captured.
> A more robust approach would be to give the whole reasoning to an LLM and ask to grade according to a given criterion (e.g. "grade insight in each step, from 1 to 5").
Totally disagree here, using embeddings is much more reliable / robust, I wouldn't put much stock in LLM output, too much going on
Simple example of problem my team ran across.
The distance between "dairy creamer" and "non-dairy creamer" is too small. So an embedding for one will rank high for the other as well, even though they mean precisely opposite things. For example, the embedding for "dairy free creamer" will result in a low distance from both of the concepts such that you cannot really apply a reasonable threshold.
But in a larger frame, of "things tightly associated with coffee", they mean something extremely close. Whether these things are opposite from each other, or virtually identical, is a function of your point of view; or, in this context, the generally-meaningful level of discourse.
At scale, I expect having dairy vs non-dairy distance be very small is the more accurate representation of intent.
Of course, I also expect them to be very close and that's the problem with purely relying on embeddings and distance where, in this case, the two things mean entirely opposite preferences on the same topic.
(I think maybe why we sometimes see AI generated search overviews give certain types of really bad answers because the underlying embedding search is returning "semantically similar" results)
> Totally disagree here, using embeddings is much more reliable / robust, I wouldn't put much stock in LLM output, too much going on
I think both ways can be the preferable option, depending on how well the embedding space represents the text - and that is mostly dependet on the specific use case and model combination.
So if the embedding space does not correctly project required nuance, then it's often a viable option to get the top_n results and do the rest by utilizing the llm + validation calls.
But i do agree with you, i would always like to work with embeddings rather than some llm output. I think it would be such a great thing to have rock solid embedding space where one would not even consider to look at token predictor models.
The relation among the internal model representations inside its latent space and the embedding of the CoT compressed with a text embedding model is, more or less, minimal. Then we take this and map it to a 2D space, which captures more or less nothing of the original dimentionality and meaning. That's basically plotting random points.
Potentially it's useful to understand a model "on its own terms" via its observable outputs.
>The relation among the internal model representations inside its latent space and the embedding of the CoT compressed with a text embedding model is, more or less, minimal.
This may or may not be correct but one way to find out is by taking a look!
It would be much more interesting to see PCA (or t-SNE or whatever) on the internal representation within the model itself. As in the activations of a certain number of layers or neurons, as they change from token to token.
I don't think the OpenAI embeddings are necessarily an appropriate "map" of the model's internal thoughts. I suppose that raises another questions: Do LLMs "think" in language? Or do they think in a more abstract space, then translate it to language later? My money is on the latter.
> I suppose that raises another questions: Do LLMs "think" in language? Or do they think in a more abstract space, then translate it to language later? My money is on the latter.
The processing happens in latent space and then is converted to tokens/token space. There is research into reasoning models which can spend extra compute in latent space instead of in token space: https://arxiv.org/abs/2412.06769
Another take on a similar idea from FAIR is a Large Concept Model: https://arxiv.org/pdf/2412.08821
Sort of, yes. I think one of our next unlocks will be some kind of system which predicts at multiple levels in latent space. Something like predicting the paragraph, then the sentences in the paragraph, then the words in the sentences, where the higher level "decisions" are a constraint that guides the lower level generation.
In meta's byte-level model they made the tokens variable length based on how predictable the bytes were to a smaller model, allocating compute resources based on entropy.
I'd have to guess that the "transformations" being made to the embeddings at each layer are basically/mostly just adding (tagging with) incremental levels of additional grammatical/semantic information that has been gleaned by the hierarchical pattern matching that is taking place.
At the end of the day our own "thinking" has to be a purely mechanical process, and one also based around pattern recognition and prediction, but "thinking" seems a bit of a loaded term to apply to LLMs given the differences in "cognitive architecture", and smacks a bit of anthromorphism.
Reasoning (search-like chained predictions) is more of an algorithmic process, but it seems that the "reactive" pass-thru predictions of the base LLM are more clearly viewed just as pattern recognition and extrapolation/prediction.
Prove me wrong!
> Prove me wrong!
For future reference, it is hard to parse tone over the internet but this "command" read pretty poorly to me. I would have preferred if you asked a question or something else.
However, assuming best intentions...
> I'd have to guess that the "transformations" being made to the embeddings at each layer are basically/mostly just adding (tagging with) incremental levels of additional grammatical/semantic information that has been gleaned by the hierarchical pattern matching that is taking place.
> Reasoning (search-like chained predictions) is more of an algorithmic process, but it seems that the "reactive" pass-thru predictions of the base LLM are more clearly viewed just as pattern recognition and extrapolation/prediction.
I'm having trouble following. Are you saying that:
* The "reactive pass-thru predictions" are just pattern matched responses from the training text that come from "incremental levels of additional semantic information"
* There is some other algorithmic process which results in "search-like chained predictions" from the pattern matched responses
* These two capabilities, combined in a single "thing," are not analogous to thinking
?
> At the end of the day our own "thinking" has to be a purely mechanical process, and one also based around pattern recognition and prediction, but "thinking" seems a bit of a loaded term to apply to LLMs given the differences in "cognitive architecture", and smacks a bit of anthromorphism.
You can pick whatever term you like. What we seem to have is a system which can, through the embedded patterns of language, create a recursive search through a problem space and try and solve it by exploring plausible answers. If my dog came up with a hypothesis based on patterns it had previously observed, considered that hypothesis, discarded it, and then came up with a new hypothesis, I'd say it was thinking.
There are clear gaps between where we are and human capabilities especially as it relates to memory, in-context learning, and maintaining coherence over many iterations (well, some humans), but (to me) one of two things is probably true:
1. Models are doing something analogous to thinking that we don't understand.
2. Thinking is just a predict-act-evaluate loop with pattern matching to generate plausible predictions.
I lean towards the second. That's not to ignore the complexity of the human brain, it is just that the core process seems quite clear in the abstract to me via both observation and introspection. What can "thinking" (as you define it) do that is beyond these capabilities?
My mistake, actually. I was trying to recall "Change my mind!", and ended up with that instead. It was meant as a tongue-in-cheek challenge - I'd be more than happy to hear evidence of why I'm wrong and that there's a more abstract latent space being used in these models, not just something more akin to an elaborated parse tree.
To clarify:
1) I'm guessing that there really isn't a highly abstract latent space being represented by transformer embeddings, and it's really more along the lines of the input token embeddings just getting iteratively augmented/tagged ("transformed") with additonal grammatical and semantic information as they pass through each layer. I'm aware that there are some superposed representations, per Anthropic's interpretability research, but it seems this doesn't need to be anything more than being tagged with multiple alternate semantic/predictive pattern indentifiers.
2) I'd reserve the label "thinking" for what's being called reasoning/planning in these models, which I'd characterize as multi-step what-if prediction, with verification and backtracking where needed. Effectively a tree search of sorts (different branches of reasoning being explored), even if implemented in O1/R1 "linear" fashion. I agree that this is effectively close to what we're doing too, except of course we're a lot more capable and can explore and learn things during the reasoning process if we reach an impasse.
I am not sure how someone would change your mind beyond Anthropic's excellent interpretabilty research. It shows clearly that there are features in the model which reflect entities and concepts, across different modalities and languages, which are geometrically near each other. That's about as latent space-y as it gets.
So I'll ask you, what evidence could convince you otherwise?
Good question - I guess if the interpretability folk went looking for these sort of additive/accumulative representations and couldn't find them, that'd be fairly conclusive.
These models are obviously forming their own embedding-space representations for the things they are learning about grammar and semantics, and it seems that latent space-y representations are going to work best for that since closely related things are not going to change the meaning of a sentence as much as things less closely related.
But ... that's not to say that each embedding as a whole is not accumulative - it's just suggesting they could be accumulations of latent space-y things (latent sub-spaces). It's a bit odd if Anthropic haven't directly addressed this, but if they have I'm not aware of it.
Text embeddings are underused WRT model understanding IMO. "Interpretability" focuses on more complex tools but perhaps misses some of the basics - shouldn't we have some sort of visual understanding of model thinking?
Pretty cool. Funnily enough I made something similar this weekend that converts CoT to Graphs/Trees of Thoughts with an LLM. It kind of allows to see when the LRM/LLM changes direction or when it finds a path that it wants to follow.
Link: https://github.com/vale95ntino/cot2tot
Very cool!
Fun experiment but in the back of my mind I suspect this is just plotting a random walk.
I was, personally, hoping to see a sort of "spiraling down" towards the answer or pathfinding like IDA*[0], but I suppose what we're looking at isn't too dissimilar from A* or Djikstra's if you squint.
I suspect you recognize dimensionality reduction, but to reiterate for my own understanding: t-Distributed Stochastic Neighbor Embedding (t-SNE) is one method among a few other, (more popular?) ones like Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP).
Is t-SNE the most appropriate technique for modeling the terrain under a multidimensional "walk"? Possibly a linear technique (PCA, LDA, SVD?) or PaCMAP[1], which "dynamically employs a particular set of mid-near pairs to capture the global structure and then improve the local structure." (Qattous H, 2023)
0. https://qiao.github.io/PathFinding.js/visual/
1. https://pmc.ncbi.nlm.nih.gov/articles/PMC10756978/
Edit: for reference, the tensor projector: https://projector.tensorflow.org/
same, it is unclear what you can get out of this work
Perhaps assign each co-ordinate to musical notation and you can get some Scheonberg-esque compositions
Random walk is definitely possible. Also possible that we're observing some "search" in the embedding space from an initial point. It's hard to tell because the chains are often similar lengths, so I don't think it really terminates early. It might be interesting to find the closest CoT component to the final answer and see how step distance inflects at that point
Is this using t SNE? Or sth else? I have a feeling similarity is not well defined in whatever space this is using. t SNE is famously unsuited to plot how "close" two points are. it is for clustering
2D plot is tSNE, consecutive distance comparison is cosine sim distance normalized across the chain of thought
Hmm... a slightly different approach I've seen in the past is that instead of embedding each step separately, you concatenate them together. I wonder if that would make the path more linear/understandable?
Distances in t-SNE/UMAP don't mean anything. They are clustering algorithms
2D plot is tSNE, consecutive distance comparison is cosine sim distance normalized across the chain of thought
Bear in mind that "any two high-dimensional vectors are almost always orthogonal".
Is this better rephrased as “any two vectors in a high-dimensional space are almost always functionally orthogonal”?
I have mostly a laypersons understanding of this idea but I would assume that it would be false to say that they are typically _entirely_ orthogonal?
Yes, one more precise way to phrase this is that the expected value of the dot product between two random vectors chosen from a vector space tends towards 0 as the dimension tends to infinity (I think the scaling is 1/sqrt(dimension)). But the probability of drawing two truly orthogonal vectors at random (over the reals) is zero - the dot product will be very small but nonzero.
That said, for sparse high dimensional datasets, which aren't proper vector spaces, the probability of being truly orthogonal can be quite high - e.g. if half your vectors have totally disjoint support from the other half then the probability is at least 50-50.
Note that ML/LLM practioners use "approximate orthogonality" anyway.
https://softwaredoug.com/blog/2022/12/26/surpries-at-hi-dime... it's both much more likely to be actually orthogonal and almost always very close to orthogonal.
That link doesn't contradict the person you're replying to. Actual orthogonality still has a probability of zero, just as the equator of a sphere has zero surface area, because it's a one-dimensional line (even if it is in some sense "bigger" than the Arctic circle).
If you're picking a random point on the (idealized) Earth, the probability of it being exactly on the equator is zero, unless you're willing to add some tolerance for "close enough" in order to give the line some width. Whether that tolerance is +/- one degree of arc, or one mile, or one inch, or one angstrom, you're technically including vectors that aren't perfectly orthogonal to the pole as "successes". That idea does generalize into higher dimensions; the only part that doesn't is the shape of the rest of the sphere (the spinning-top image is actually quite handy).
The visualization is useless. IF the 2D embeddings were any good they might be useful to R1's developers but still not to end users. What am I supposed to with it?
No need to do anything in particular! Perhaps interesting to observe
So the trick is to pick the dimensions that are relevant and discard the rest when calculating the distance.
Alternatively, in a high-dimension space, everyone sits in their own corner.
Whenever I hear "chain of thought" I feel like thoughts aren't really linear, it's probably more like "graph of thought". I wonder how models might be able to benefit from structuring their "thinking" this way in a multi-threaded environment
Interestingly, the dots moving about the 2D space resemble eye movements humans involuntarily make when thinking hard and creatively. E.g. looking at random spots (often slightly upwards) and moving to a different place every few seconds or so. Like a kind of synkinesis between brain and eye.
It seems kinda silly to use a separate service to generate embeddings for t-SNE when you have the embeddings in the model already.
Is it generating embeddings or just coordinates? What would be a better way?
What are embeddings if not "just coordinates"?
Well ... we have to reduce them to a 2D plane to visualize them ...
That just makes them higher order coordinates, no?
Higher order, yes, but as these coordinates certainly contain less information, it's possible they contain only noise.
Something needs to generate the document embeddings since the LLM itself won't
No, this is completely wrong. You can get embeddings from the LLM itself, e.g the last layer.
Doesn't the last layer output a variable-size vector based on seq length? It'd take a bit of hacking to get it to be a semantic vector.
Additionally, that vector is trained to predict next token as opposed to semantic similarity. I'd assume models trained specifically towards semantic similarity would outperform (I have not bothered comparing both in the past - MMTEB seems to imply so)
At that point - it seems quite reasonable to just pass the sentence into an embedding model.
I have the same concern as other commenters (using a separate embedding model, utility of cosine similarity, etc.)
BUT this could “seed” a really neat loading graphic for reasoning models, beyond seeing the thinking steps.
Ha, if steps have consistent distances you could take the average distance at step X and generate a step of that length in some direction and be ~approximately correct regardless of the actual value
I wonder if you could turn it into a graph by adding connections between the two most similar entries until everything is connected.
I tried this: https://github.com/KTibow/thoughtgraph
Lol @ R1's description of my Github ... thanks Deepseek, very cool!
does this show anything?
Kind of useless in 2D space. Not sure what they think they are showing here in the visualization.
That data looks pretty close to rand() to me...
Love me some graphs with unlabelled axes...
You're right curiousgal. I filed an issue in response to this comment and will resolve sometime soon: https://github.com/dhealy05/frames_of_mind/issues/1
Why not in a three-dimensional space?
For those of you who were confused about what "R1" is like I was, it seems to be an LLM of some kind https://api-docs.deepseek.com/news/news250120
Not just "an" LLM, but the LLM (or specifically its chain-of-thought variant) that was in the news for days and caused NVIDIA stocks to crash (well, I'd prefer calling it a natural corrective action by the market) and almost a panic in the US because Chinese and so on.
IMO the panic came from a few places:
1) If it's actually much cheaper to build a new model the position of OpenAI, et al is weaker because the barrier against new competitors is much lower than expected.
2) If you don't need a billion dollars in GPUs to build a model NVIDIA maybe isn't worth as much either because you don't need as much compute power to get to the same output. (IMO their valuation is still insanely high but that's a different story)
I don't think the drop was entirely irrational they showed there were places a lot of efficiency could be gained in the training process that people didn't seem to be working on in the existing players.
Fair enough. I'm not interested in this stuff
lol isnt this just plotting noise
Whether or not it's "noise" might depend on if you think Chains of Thought are causally relevant? I liked these pieces if you want to read more about CoT / O1:
O1 Technical Primer: https://www.lesswrong.com/posts/byNYzsfFmb2TpYFPW/o1-a-techn...
Using Search Was a Psyop: https://www.interconnects.ai/p/openais-o1-using-search-was-a...
Value Attribution: https://www.lesswrong.com/posts/FX5JmftqL2j6K8dn4/shapley-va...
As useless as it gets, surprised that it got to the front page.
It's a cute little project, arguably far more interesting than the political flame wars that make front page.
Thank you for your support Mr. Wu
I too am pleasantly surprised