As always you kind of need to play with the model to see how well it actually works as benchmarks can be misleading (e.g. phi-2)
But at face value, a new architectural approach with the same capacity (8b) trained on a dataset 1/6th the tokens, being competitive with llama3-8b is exciting
It's not necessarily a hallucination, Lucas is one of the screenwriters of this movie, quite a few people would make this mistake.
So it could be present in the data set or just a mistake by association.
Masking looks interesting for sequences that can't be lossy. If an image squishes a pixel here or there, it won't be noticed, but if a sentence lacks room for "if", that sounds bad.
Does this force the model to encode a high-level answering strategy? (AFAIU, there's no reordering during sampling.) Or does it mean a masking model of a certain size is more prone to making things up that fit the blank space?
I'd imagine you could exploit something like the stochastic denoising approach from DDIM and descendants, where you add some noise back after each denoising step, essentially randomly remasking unmasked tokens and giving the model a second chance to unmask them "properly" as the denoised response becomes better-known.
Yes this is confusing. I too am not sure but here is a guess:
In a normal transformer the incremental message is a single token while in LLADA the incremental message would be an additional fixed size block of tokens. If you do inference in the semi-autoregressive strategy outlined in the paper you could start by adding block after block. When a block is fully unmasked it essentially becomes part of the prompt. You would stop the "turn" as soon as there is some sort "end of reply" token decoded in your block. This token could appear anywhere in the block.
Questions:
(1) Is there such thing as a "end of reply" token ? (I'm not taking about End of Sentence <EOS> which is obviously different)
(2) What about the junk tokens that appear after this "end of reply" tokens. I could image the unmasking transformer generating some stuff for those tokens.
<EOS> is end of sequence ! that makes sense, thanks for the correction.
So, if we simply wait for <EOS> to be sampled in a block, that could be the place to stop. My answer above should be the same (modulo this correction) -- simply keep obtaining the blocks in a semi-autoregressive fashion.
As always you kind of need to play with the model to see how well it actually works as benchmarks can be misleading (e.g. phi-2)
But at face value, a new architectural approach with the same capacity (8b) trained on a dataset 1/6th the tokens, being competitive with llama3-8b is exciting
Not sure why they included a hallucination as one of their first examples:
"Please recommend me three famous movies"
"The Empire Strikes Back (1980) - Directed by George Lucas"
It's not necessarily a hallucination, Lucas is one of the screenwriters of this movie, quite a few people would make this mistake. So it could be present in the data set or just a mistake by association.
Do you want them to misrepresent the model's performance?
Exactly - first time commenter, but had to agree with you: we need more transparent research like this.
I had to look it up: Irvin Kershner is the director of the movie.
Masking looks interesting for sequences that can't be lossy. If an image squishes a pixel here or there, it won't be noticed, but if a sentence lacks room for "if", that sounds bad.
Does this force the model to encode a high-level answering strategy? (AFAIU, there's no reordering during sampling.) Or does it mean a masking model of a certain size is more prone to making things up that fit the blank space?
I'd imagine you could exploit something like the stochastic denoising approach from DDIM and descendants, where you add some noise back after each denoising step, essentially randomly remasking unmasked tokens and giving the model a second chance to unmask them "properly" as the denoised response becomes better-known.
it doesn't seem to support variable length for input and output, does it?
The paper seems to use EOS padding to create fixed length input/output.
so is there a maximum output length?
Yes this is confusing. I too am not sure but here is a guess:
In a normal transformer the incremental message is a single token while in LLADA the incremental message would be an additional fixed size block of tokens. If you do inference in the semi-autoregressive strategy outlined in the paper you could start by adding block after block. When a block is fully unmasked it essentially becomes part of the prompt. You would stop the "turn" as soon as there is some sort "end of reply" token decoded in your block. This token could appear anywhere in the block.
Questions: (1) Is there such thing as a "end of reply" token ? (I'm not taking about End of Sentence <EOS> which is obviously different)
(2) What about the junk tokens that appear after this "end of reply" tokens. I could image the unmasking transformer generating some stuff for those tokens.
> (1) Is there such thing as a "end of reply" token ? (I'm not taking about End of Sentence <EOS> which is obviously different)
EOS as end of sentence? IIRC its end of sequence, which may also answer the question IIUC.
<EOS> is end of sequence ! that makes sense, thanks for the correction.
So, if we simply wait for <EOS> to be sampled in a block, that could be the place to stop. My answer above should be the same (modulo this correction) -- simply keep obtaining the blocks in a semi-autoregressive fashion.