Sync'ing from Memory

Text Learning - Pointers

Pointers for DL for text


Text learning combines various disciplines. Of course, that's true for all domains in which machine learning will be applied. But text is special because that defines humans in very special and unique ways, and with speech, is our primary form of knowledge encoding, storing, and communication.

Non-exhaustively, and probably haphazardly, I have the following in my mind for this document.

  • Good corpus. Bad corpus.
  • NLP. Statistics. Dictionaries. Knowledge-bases.
  • SpaCy. CoreNLP. GATE.
  • Word2Vec. GloVe. Sense2Vec.
  • The whats and the whys, and pointers to the hows.

Primarily my notes for reference.


The most obvious ones are the ones many researchers and tools have taken care of very well. In other words, they are not much of a problem.

User generated content (UGC) on various social media sites present most of the challenge. No editors to supervise quality of the text (facts - or fact free - is an orthogonal concern.) Grammar and spelling errors. Incomplete sentences. #HashtagsMadeUpOfWordsYouCantFindBoundariesOfToTagBetter.

There's this inherent, embedded challenge. And then there's the extraneous challenge - that neither do we have much standard corpus of UGC, nor do we find any that are labelled. Any attempt to create a standard corpus will face the risk of being obsolete by the time it becomes a standard. Social-media-text evolves much too rapidly, and new words are created and discarded without any regard to, well, anything.

So, naïve statistics is bound to fail for reasons of finding too little of everything, to generalize from. Noise! And so is standard NLP. Then there are styles influenced by the platforms that host content. @NameTagging and #hashtags, for example. And we don't know how the recent migration (for some, for now) to the 280-character limit by Twitter will affect styles and content. It's generally hard to predict about anything, and especially the future.



SpaCy is one tool that gets you started very quickly, and is fast. The accuracy and performance claims make it a very compelling choice. And my personal experience around the responsiveness while reporting a bug on the the 2-alpha branch left me impressed. SpaCy is Python based. Which is a plus, given the ecosystem.


Named Entities - Recognition, Disambiguation, Linking

Some key terms

  • Named Entity
  • Knowledge Base (KB)
  • Mention
  • Named Entity Recognition (NER)


Nobody got fired for relying on WordNet. English-focused, but you can extend using their framework. Of course, the effort is all yours to make.

Statistical (Machine Learning)

Given any dataset, statistical analyses are your best bet given no other help to gain insights.

Combined with language-domain expertise (NLP, dictionaries, ontologies etc.), statistics give you higher accuracy at speed and scale.

Useful reads

General reads

Some publicly available trained word2vec datasets

Reproducing a few of the ones here just in case.

Processed datasets

Dataset name Dimensions Corpus size Vocabulary size Author
Google News 300 100B 3M Google
Freebase IDs 1000 100B (Google News) 1.4M Google
Freebase Names 1000 100B (Google News) 1.4M Google

Text datasets

Dataset Corpus size Additional notes
First billion Wikipedia words 1B words Use the pre-processing perl script from the bottom of Matt Mahoney's page
Latest Wikipedia dump > 3B words Use the same script as above
Dataset from "One Billion Word Language Modeling Benchmark" 1B words Already pre-processed
UMBC webbase corpus ~3B words Needs to be processed. Mainly tokenization

Other datasets to build upon