The secret ingredients of word2vec (2016)

187 points by todsacerdoti 3 days ago

boyter 3 days ago

While old, the nice thing about word2vec models is how easily they can be imported and used by any language to create a vector search.

I have always wondered (but never done anything) about if you could train on code to achieve a similar result. I suspect there could be some value there even if its just to help identify similar snippets of code.

PaulHoule 3 days ago

You can do the same with https://sbert.net/ it is not any more work for you to implement except it really gives better results! Vector databases did not become hot in the word2vec era, they became hot once you got embedding that were sensitive to words in context.
It's arguable, for instance, there is any value in making an embedding for a word which can have multiple meaning. Take a word like "bat", at least it is specific to match that with the specific word bat. If you vectorize it you're going to have to blend in mammals and blend in sports equipment. In any given situation you care about one of them and don't care about the other, so anything you gain from matching other mammals means you also get spurious matches having to do with sports equipment.
BERT sees the context so it will match a particular use of the word "bat" with either mammals or sports equipment so it brings in relevant synonyms but not irrelevant synonyms. That made BERT one of those once-in-a-decade breakthroughs in information retrieval (took a whole decade of conference proceedings to get BM25!) whereas Word2Vec is just a dead end that people wrote too many blog posts about.
maujim 3 days ago

there is a code2vec as well
- boyter 3 days ago
  
  TIL had no idea code2vec existed at all. Thanks for pointing it out.
  - autokad 3 days ago
    
    there is basically everything2vec, but if it doesn't exist, then you can train your own embeddings

PaulHoule 3 days ago

Word2Vec, Glove and all of those should be forgotten. They remind me of all the people who were trying to develop NLP solutions with SPaCy, mostly failing, and explaining it had some value because 10% of it almost worked.

Not only do they have nothing to say (short of [0, 0, 0, 0, 0 ... ]) about words not in the dictionary (frequently critical to the meaning of documents) but they don't give consistent gains for many tasks because they contaminate words with alternate meaning that might be relevant but don't all address the disambiguation problem. BERT does, which is why BERT really was a breakthough in IR and NLP.

People who think academia is in a state of decline could use the graphs on this page as an index case

https://nlp.stanford.edu/projects/glove/

where notably they repeatedly project out N=20 points out of a 50 dimensional space in a way that makes it look like the network knows something about kings and queens or cities and ZIP codes. However, the nature of high dimensional spaces is that you find a projection that will make 20 randomly selected points in a 50-d space land exactly where you want them.

I was riding on the bus to NYC trying to train classical classifiers out of scikit-learn to distinguish words of various categories (gendering, part of speech, is it a color?, etc.) and found that it usually seemed to work up to N=20 examples or so but when you had more training examples performance fell apart after that.

I concluded that if you couldn't do that, it wasn't worthwhile to make a pipeline like

    Word2Vec -> CNN -> word labels

    Glove -> LSTM -> classification

and the results the company I worked for was getting trying to do the above didn't convince me otherwise.

sota_pop 3 days ago

Not to be overly pedantic, but this comment misses one major important delineation between engineering and science. The old adage is that “with engineering, if you don’t know what you’re doing; you’re doing it wrong. With science, if you know what you’re doing; you’re doing it wrong.” The older models should not be forgotten, since science - when done appropriately - relies on incremental improvements and experimentation. However, for the pure purposes of building something, if the modern ways are an improvement on the old in most or all measurable ways - sure there probably isn’t much prudence beyond novelty to utilize the outdated methods.
- PaulHoule 3 days ago
  
  The problem is that Word2Vec and Glove never really worked from an engineering perspective but there was widespread belief that they did.
  From a scientific perspective there are biases in the way that people do science that let an "emperor has no clothes" situation go on too long.
  For instance, I never saw the negative result that "you can't build a classifier for words" based on word vectors published, because people don't publish negative results. They assume they were doing something wrong, there were just too many blog posts (not papers) were people were repeating the lucky result that king - women + man = queen [1]. Anyone who looked at it systematically found it was a morass, got bogged down, and didn't publish anything. It's pretty common for fields to get stuck in a place like this and make no progress or phony progress [2]. Nobody published "a classifier for words" for Word2Vec because it doesn't work, even though it tantalizingly seems to work when N is very small because of the way high dimensional spaces work.
  Blog posts like this one will perpetuate more people getting stuck in a morass and leave them confused, writing more blog posts that repeat the little bit that seems to make sense. Thus being able to use language and be part of a group is a curse and not a blessing because it makes you repeat other people's mistakes not learn from them.
  [1] yes, they did learn to do things like this better than chance, but chance is pretty bad
  [2] https://www.stat.cmu.edu/~cshalizi/2010-10-18-Meetup.pdf
quotemstr 3 days ago

Huh? We don't forget pioneering work just because it was on the wrong track. Aristotle was a great scientist despite being grossly wrong about anatomy. Lord Kelvin had an incorrect theory of heat and misestimated the age of the earth by an order of magnitude, but he was still a genius. Likewise, the word2vec people perhaps went down the wrong path, but it was a good start and enabled some of the earliest language ML applications.
- PaulHoule 3 days ago
  
  Speaking as an engineer, not a scientist.
  I worked with a lot of people who failed at projects with Word2Vec and Glove even though "I told you so" over and over again. I saw upwards of $20M worth of investment disappear with my own eyes and the blast radius must have been several unicorns worth of valuation. I quit a job which I really needed because I couldn't bear to see us go down that dead end (medical literature, out-of-vocabulary, lose 50% of the critical information, you lose before you even started... [1])
  On one level one of my favorite kinds of of arXiv paper is "build a classifier for some mundane text classification problem", I've collected enough of these to do a meta-analysis if I was an academic. I'm interested because I build these things for myself and I'm working towards building a reliable model trainer that can be applied to a wide range of tasks.
  Practically I find these papers depressing though because inevitably they try a bunch of things that predictably were going to suck, especially Word2Vec and Glove. BERT + pooling + classical ML always puts up a good showing in comparison, a good showing period for topic classification, worse if the order of the words matters. The one kind of model simpler than that which is worth looking at is Bag Of Words + classical ML because it can do a pretty good job considering we were doing it in 2000.
  You usually see people stumble in the dark quite a bit past that, for instance, I've yet to see a reliable procedure to fine-tune BERT models for classification, some of the reason why people have the luck they do is that they have tiny training sets (say N=500), I am just not convinced that with N>5000 these models are really able to learn anything they don't learn with N=500 (is it catastrophic forgetting? my training curves should look like it)
  People really should be trying to expand the 'efficient frontier' of accuracy, computational resources, and development efforts -- it's not easy but one step is to 'ignore' anything that is predictably away from that frontier. Glove and word2vec have always been that, but it's not generally known: reading is supposed to let us learn from other people's mistakes but in this field it just helps people repeat other people's mistakes.
  [1] LLM's did show that my 'predictive evaluation' methodology of the time was pessimistic; you can throw out critical information and a model that's sufficiently incapable of reasoning under uncertainty [2] can get the right answer frequently by the wrong path. I would have seen a 50% ceiling, a system that hallucinates might get 80% accuracy, but you can't get that last 20%
  [2] My "always gets voted down" opinion is that the Chomsky/Pinker "language instinct" is actually a derangement of the ability to reason with uncertainty that we share that makes languages learnable because they live on a collapsed lower-dimensional manifold.
  - DiscourseFan 3 days ago
    
    I’m not a fan of Chomsky myself really, but I think its important to distinguish cognition from computation. Not saying Chomsky’s model was correct, but he wasn’t working on machine learning per se, which certainly draws more, whether intentionally or not, from phenomenology.

singularity2001 3 days ago

Putting this into perspective in the age of LLMs:

The bigger the model the better the embedding, so one could take the middle activations of large language models and use that as embedding but using smaller models is often good enough and more performant.

alecst 3 days ago

This sounds logical enough but I think I remember reading that larger embeddings were only better for certain classes of words, because increasing the dimension size (in those cases) introduced noise. Will try to find the paper — it’s escaping me at the moment.
- z3c0 3 days ago
  
  That's all true but matters significantly less "at scale". In the days of lean models, you needed to verify that your input parameters were functionally independent variables, meaning they couldn't correlate with other input parameters. When every document is transformed into a billions-long vector -- even if you took the noninsignificant amount of time it would take to compute a correlation matrix -- the heavy associations between a few features don't mean much, especially when you can just add more data. Plus, people misusing or repurposing words can introduce some interesting twists to features you'd assume 1:1 on paper.
- plagiarist 3 days ago
  
  Does the paper reach some conclusion on an optimal embedding size?
  I was thinking about that the other day. It's interesting from a linguistics perspective. I wonder if each dimension in an optimal-size vector could be given a human-comprehensible label.
- 3abiton 3 days ago
  
  Isn't it all about the context too?

boraturan 3 days ago

I once wrote this. https://medium.com/towards-data-science/embeddings-with-word...

khelavastr 2 days ago

Have you seen, "Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec"..?
You can node2vec any symmetric graph to approximate transforms of its eigendecomposition.
https://arxiv.org/abs/1710.02971

pknerd 3 days ago

2017 version: https://www.ruder.io/word-embeddings-2017/

simonw 3 days ago

(2016)