Jussi Karlgren answers audience questions about natural language understanding and computational linguistics during a Coffee with Clusterone webinar session.

Jussi Karlgren is the co-founder of the text analysis company Gavagai. He holds a Ph.D. in computational linguistics from Stockholm University and is an Adjunct Professor of Language Technology at KTH.

In our Coffee with Clusterone webinar, Jussi discussed some of the fundamental ideas behind Natural Language Understanding and computational linguistics. As a hands-on demo, he presented how to build a learning lexicon from streaming text.

This interview is based on audience questions that came up during the webinar session. You can watch the video of the webinar below. To access the slides, see our webinar page.

Q: When training a model, does it matter in which order I present new words to the model (e.g simpler, more common words first, then more complex ones)?

Jussi: Generally, you would end up with the same space. What differs in the representation from word to word is the context they appear in and the relative weight or importance attached to them.

This weighting typically is a one-off thing based on previous knowledge of the language. In this case order does not matter at all. In our model and in some other streaming implementations, this weighting is learned as one goes along, and then a word that is very frequent in the beginning of the training might end up with a lower weight than it should have.

This means that it probably is a good idea to begin training your system using a standard outtake of news or a sample of social media text to get the basic statistics in place. After that, you can add your own specialized vocabulary on top of that. But mostly, it works itself out to the same thing in the end.

In general, language is surprisingly robust in this respect. The basics of a language are similar no matter the topic, genre, or style. This, by the way, is true for all human languages. We have found that our models perform more or less the same on any human language we’ve used it on.

Q: How can we handle class imbalance in text classification tasks?

Jussi: Class imbalance can happen either on a word level --- certain words are very frequent and less informative than others --- or on a text category level, where certain classes of text are less frequent than others.

The former is addressed by a weighting scheme of some sort (see above); the latter is a general challenge for learning categorisation schemes.

Sometimes a categorizer may get a good score by simple disregarding all items from a smaller class. This is remedied by making sure one has a reasonable target metric, scoring by performance per class rather than averaging out performance across the entire collection.

This is of great interest in classification research, and many various approaches address this.

Q: Are pre-trained embeddings useful for classifying small datasets?

Jussi: Yes, absolutely. Pre-trained embeddings are usually trained on a large enough texts/datasets, so they can boost performance for your small dataset. Think of it as if you were reading a text. It helps your understanding to have read many other texts from the same general area before.

This is especially true if the pretraining has been done in the same style as the target dataset. If you process social media text but you use a model that’s been pre-trained on Wikipedia text, it will still help, but just not as much as if it had been trained on the same kind of text.

A warning about using the wrong pre-trained embeddings: If the embeddings are trained to build synonyms and what you’re doing is topic-modeling, then the embeddings will not help. So while the style of the text may not be of prime importance, the target of the embeddings is. It’s important to always verify the assumptions!

Q: What are your thoughts on sentence embeddings, especially in relation to using lexical + semantics embeddings in the same embedding?

Jussi: This is a research question we are currently only beginning to work on. In general, one should be wary of simple centroid solutions, adding together a set of semantic vectors to find a semantic "average" of all items in the sentence.

This leads to a representation which is likely to capture little of the items in the sentence and be of little utility (this has to do with qualities of distance transitivity in high-dimensional spaces: there are some interesting primers on this question on the web you can look up!)

In my talk I mentioned that we use both addition and permutation as a way of combining several types of information in the same representation. I believe this is the right way to go, but there are other approaches as well, in the sentence2vec model, for example.

In general, I also do believe that both lexical items and contextual (which is what I believe you mean by semantical in this question) should be combined, together with all kinds of connotational information. The meaning of a word in language is a combination of very many types of previous observations of it!

Q: How does Gavagai deal with multiple meanings of one term? For example, “nail” can be a noun or a verb with very different meanings.

Jussi: This is a problem common to all statistical modeling. Taking the example of “nail” as a noun and as a verb, the one word will have a very different distribution from the other item. But since they are indexed together, they will be mixed together in the model.

If you have a suitably high-dimensional model, those directions of similarity will be separable in that space. You’ll be able to see that “nail” has certain relations that have to do with “screws” and “bolts”. And then there are other relations with “fingers” and "toes", and even “hair”. And then for the verb, there will be relations to “nailing”, “fastening”, “building” and that sort of thing.

Those neighborhoods will be dis-similar to each other. We like to think of it as filaments in the semantic space. If you do a very simple nearest-neighbor search, you will not be able to distinguish between these items, but if you do a second-order similarity search to see if those neighbours you have found are similar to each other, you’ll find that the one set of words is going in one direction, while the other sets are going in different directions.

For example, you may have “eyebrows” in one set, and eyebrows are not very similar to “buildings”, which you might have in the other set, although they have a common neighbor.

So, it’s very doable to separate them, it just requires some more complex computing than a simple first-order search.

Q: How do multi words that are written separately with other words in between influence building a model/lexicon? Especially when the training set is full of such expressions?

Jussi: This is a true challenge for any model that works on rolling windows of adjacent items. The only way to transcend this is to build models which track dependencies between items rather than adjacency.

This is of course especially true for languages that have freer word order than English! The cost for the improved precision is that the training data will become more sparse and that more data will be needed.

Q: How does the dimension size determine the precision of the model in relation to the size of the training corpus?

Jussi: If you have too small a dimensionality, you’re compressing stuff and you end up losing information. If you have a large enough dimensionality, given the near orthogonality of the Random Indexing or whatever model you’re using, you’re not compressing things.

The interesting thing to keep in mind about semantic spaces is that it’s only the near neighbors that are interesting. When something is a distant neighbor, it’s uninteresting. There’s sort of a semantic horizon and the local neighborhoods are where the action is.

For example, it’s not interesting to model the similarity between “pencil” and “crater on the moon”. There might be a measurable relationship in the semantic space, a geometric angle between the vectors, but it’s not interesting. There will be some much closer items around pencil that are interesting (“paper”, “writing”, etc.), but if you go beyond that, it’s not interesting.

The point is: you need to have a large enough dimensionality to accommodate these various separate local contexts. The global dimensionality is not that interesting, but if it compresses those local contexts onto each other you will lose precision.

Q: Can POS tagging be used with words to separate different POS + Word combinations, so that one ends up with different versions of the same word since they are used in different POS tags?

Jussi: One could, but training data will become sparser, and the utility of predetermined POS classes is debatable: typically the distinctions they help with are noticeable in the co-occurrence contexts anyway!