Large Language Diffusion Models

ml-gsai.github.io

43 points by SerCe 2 days ago

As always you kind of need to play with the model to see how well it actually works as benchmarks can be misleading (e.g. phi-2)

But at face value, a new architectural approach with the same capacity (8b) trained on a dataset 1/6th the tokens, being competitive with llama3-8b is exciting

jboggan a day ago

Not sure why they included a hallucination as one of their first examples:

"Please recommend me three famous movies"

"The Empire Strikes Back (1980) - Directed by George Lucas"

AstralStorm a day ago

It's not necessarily a hallucination, Lucas is one of the screenwriters of this movie, quite a few people would make this mistake. So it could be present in the data set or just a mistake by association.
Cyphase a day ago

Do you want them to misrepresent the model's performance?
- grokkingtheor a day ago
  
  Exactly - first time commenter, but had to agree with you: we need more transparent research like this.
ahofmann a day ago

I had to look it up: Irvin Kershner is the director of the movie.

flowerthoughts a day ago

Masking looks interesting for sequences that can't be lossy. If an image squishes a pixel here or there, it won't be noticed, but if a sentence lacks room for "if", that sounds bad.

Does this force the model to encode a high-level answering strategy? (AFAIU, there's no reordering during sampling.) Or does it mean a masking model of a certain size is more prone to making things up that fit the blank space?

cheald 20 hours ago

I'd imagine you could exploit something like the stochastic denoising approach from DDIM and descendants, where you add some noise back after each denoising step, essentially randomly remasking unmasked tokens and giving the model a second chance to unmask them "properly" as the denoised response becomes better-known.

billconan a day ago

it doesn't seem to support variable length for input and output, does it?

The paper seems to use EOS padding to create fixed length input/output.

so is there a maximum output length?

sidkshatriya a day ago

Yes this is confusing. I too am not sure but here is a guess:
In a normal transformer the incremental message is a single token while in LLADA the incremental message would be an additional fixed size block of tokens. If you do inference in the semi-autoregressive strategy outlined in the paper you could start by adding block after block. When a block is fully unmasked it essentially becomes part of the prompt. You would stop the "turn" as soon as there is some sort "end of reply" token decoded in your block. This token could appear anywhere in the block.
Questions: (1) Is there such thing as a "end of reply" token ? (I'm not taking about End of Sentence <EOS> which is obviously different)
(2) What about the junk tokens that appear after this "end of reply" tokens. I could image the unmasking transformer generating some stuff for those tokens.
- refulgentis a day ago
  
  > (1) Is there such thing as a "end of reply" token ? (I'm not taking about End of Sentence <EOS> which is obviously different)
  EOS as end of sentence? IIRC its end of sequence, which may also answer the question IIUC.
  - sidkshatriya a day ago
    
    <EOS> is end of sequence ! that makes sense, thanks for the correction.
    So, if we simply wait for <EOS> to be sampled in a block, that could be the place to stop. My answer above should be the same (modulo this correction) -- simply keep obtaining the blocks in a semi-autoregressive fashion.