Text learning combines various disciplines. Of course, that's true for all domains in which machine learning will be applied. But text is special because that defines humans in very special and unique ways, and with speech, is our primary form of knowledge encoding, storing, and communication.
Non-exhaustively, and probably haphazardly, I have the following in my mind for this document.
- Good corpus. Bad corpus.
- NLP. Statistics. Dictionaries. Knowledge-bases.
- SpaCy. CoreNLP. GATE.
- Word2Vec. GloVe. Sense2Vec.
- The whats and the whys, and pointers to the hows.
Primarily my notes for reference.
The most obvious ones are the ones many researchers and tools have taken care of very well. In other words, they are not much of a problem.
User generated content (UGC) on various social media sites present most of the challenge. No editors to supervise quality of the text (facts - or fact free - is an orthogonal concern.) Grammar and spelling errors. Incomplete sentences. #HashtagsMadeUpOfWordsYouCantFindBoundariesOfToTagBetter.
There's this inherent, embedded challenge. And then there's the extraneous challenge - that neither do we have much standard corpus of UGC, nor do we find any that are labelled. Any attempt to create a standard corpus will face the risk of being obsolete by the time it becomes a standard. Social-media-text evolves much too rapidly, and new words are created and discarded without any regard to, well, anything.
So, naïve statistics is bound to fail for reasons of finding too little of everything, to generalize from. Noise! And so is standard NLP.
Then there are styles influenced by the platforms that host content.
#hashtags, for example. And we don't know
how the recent migration (for some, for now) to the 280-character limit by Twitter will affect styles and content. It's generally
hard to predict about anything, and especially the future.
SpaCy is one tool that gets you started very quickly, and is fast. The accuracy and performance claims make it a very compelling choice. And my personal experience around the responsiveness while reporting a bug on the the 2-alpha branch left me impressed. SpaCy is Python based. Which is a plus, given the ecosystem.
Named Entities - Recognition, Disambiguation, Linking
Some key terms
- Named Entity
- Knowledge Base (KB)
- Named Entity Recognition (NER)
Nobody got fired for relying on WordNet. English-focused, but you can extend using their framework. Of course, the effort is all yours to make.
Statistical (Machine Learning)
Given any dataset, statistical analyses are your best bet given no other help to gain insights.
Combined with language-domain expertise (NLP, dictionaries, ontologies etc.), statistics give you higher accuracy at speed and scale.
Word sense disambiguation
Some publicly available trained word2vec datasets
- https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-model appears to be a good list that I could quickly reach via a G-search.
- Google's (archived?) word2vec code site
- Utilities + Processed models for some Wikipedia data
Reproducing a few of the ones here just in case.
|Dataset||Corpus size||Additional notes|
|First billion Wikipedia words||1B words||Use the pre-processing perl script from the bottom of Matt Mahoney's page|
|Latest Wikipedia dump||> 3B words||Use the same script as above|
|Dataset from "One Billion Word Language Modeling Benchmark"||1B words||Already pre-processed|
|UMBC webbase corpus||~3B words||Needs to be processed. Mainly tokenization|