The fluid language of the Web
We prepared, as we had for the earlier dataset, the top-100K words list for the body stream for Apr10. You can download it here.
We decided to take a closer look at the dataset to how the top 100K lists changed between Jun09 and Apr10. Our findings are interesting:
- The union of the word set is just shy of 110K. This means that 10% of the words either fell in or out of the top 100K. This is a turnover rate higher than I expected.
- Some words that are newly in the top list are what you'd expect (unigram log10 probability difference shown parenthetically):
- espnlosangeles (21.88993), an ESPN satellite established during 2009
- debate2010 (21.53613)
- Some words took a predictable jump:
- ipad (2.560667), a product introduced mid-year
- Quite a few words newly in the mix are not conversational words:
- childreplyhtml (22.09848)
- focaladvid (21.76564)
Curious indeed.
Comments
- Anonymous
September 22, 2011
> childreplyhtml This suggests the data is dirty in some respect, doesn't it? Humans don't use that sort of word, that's obvious some sort of HTML source fragment creeping into the n-grams.