Making Word clouds (Part 1: how it works).
I’ve been playing with word clouds on and off for the last couple of months, and finally I’ve decided the time has come to share what I have been doing.
Word clouds turn up in all sorts of places, and I wanted to produce something which could take any text, be customized, and let me edit the the final version. The last requirement was key, because anything which produces a bitmap graphic at the end is not going to be easy to edit. I’ve seen it done with HTML tables but they are hard redesign (You can’t move words round easily). So it needed to be something like Visio or PowerPoint, or WMF which can produce a drawing containing text. Eventually I settled on PowerPoint. Although I’m using the beta of Office 2010 it relies on an object model for PowerPoint which hasn’t changed for several versions. And, since I only seem to program in PowerShell these days I wrote it in PowerShell. This gives me an easy way of taking any text – like Tweets from Twitter – and pushing it into a cloud. So I wrote my longest single PowerShell function yet to do the job.
- If Not already connected to PowerPoint, get connected. Start a new, blank, slide
- Get a list of “Noise words” from a file (I used a copy of the Noise.dat, which is part of Windows Search, as a starting point) and merge that list with any passed via the -ExtraNoiseWords parameter.
- Take text from a file (specified by the –Filename Parameter) , a PowerShell variable or expression (specified by the –text parameter) or from the pipeline in PowerShell, and produce a “clean” set of words by:
- Removing anything which is not a space, letter, digit or apostrophe from the text.
- Removing `s at the end of words, and convert “_” to space.
- Splitting the text at spaces.
- Removing “words” which are either URLs or numbers .
- Count the occurrences of the words , and determine the “cut-off” frequency which words must meet to get into the final cloud (a -HowMany parameter sets the number of words, if this is the default value of 150 and the 150th non-noise word occurs 10 times, accept all words with 10 occurrences, even if that gives 160 non-noise words )
- if the –phrases switch is specified:
- Find phrases which contain any of the words which meet the cut-off frequency.
- Ignore those phrases which don’t make the cut-off frequency.
- Repeat the process looking for longer phrases which contain the phrases which were just found. Keep repeating until no phrases are found which meet the cut-off frequency.
- Add the phrases to the list of found words and reduce the count of their constituent words.
- Remove noise words, and two word phrases where one is a noise word, and words which do not reach the cut-off frequency, sort the list of words by frequency and then number of letters
- Store the words in a global variable ($words) so that the function can be re-run with the ‑useExisting switch. $words can be reviewed or exported and re-imported later.
- If the –noPlot switch is specified , stop leaving the words and phrases found and their counts in $words.
- Set additional properties on the word:
Set the font size for the word, scaled between the values set by the -minFont and –maxFont parameters (these default to 16 and 80 point respectively)
Set the margins to the value specified in the –Margin parameter – Powerpoint uses quite generous margins by default, but script defaults to 0.
If –RandomVertical and/or -RandomBold, and/or -RandomItalic values are specified, generate random number for each and if it are less than the specified number, set the text attributes to true
If -Randomtwist is specified set the twistAngle attribute to a random amount up to the value of randomtwist
If multiple rgb colours have been provided using the -RgbSet parameter, select one at random. If not the default PowerPoint colour will be used – normally black.
If the -fontname parameter has been provided and is a single name, set the word to use it it, if multiple fonts have been specified select one at random. If not font is specified the default PowerPoint font will be used. - Place the first (most common) word in a Powerpoint Shape (rectangle) at the centre of the slide, store the positions of its corners as properties of the word
- Place each remaining word in its own shape at the top left corner of the slide, setting its properties as already defined. Get its size from PowerPoint, then try to place it around the boundaries of each existing shape, stopping when the placement won’t overlap with any of the other placed shapes. (The starting point for this method was something I read by Chris Done it was here but his pages on word clouds only show up in Search Engine caches now.) Note that the the more shapes which have been placed, the longer each new shape will take to place. Store the positions of the newly-placed shape’s corners as properties for use placing future shapes.
- Stop when either the number of words cannot be placed exceeds the value in –maxFailsToPlace (3 by default) or all words have been placed successfully.
In part 2 I’ll include the PowerShell code: the example above was from the Tweets about teched and I’ll show some more examples, with the command lines which were used. As you can see from the above, there are 20 or so parameters to explain.
Update Thanks Ian for letting me know that Chris's Page is missing in action, the italicized part of point 11 has been changed accordingly.