Del via


Phase 2: Refining the solution

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

After you find a question that does provide some useful insight from your data, it’s usual to attempt to refine it to maximize the relevance of the results, minimize the processing overhead, and therefore improve its value. This is generally the second phase of an analysis process.

Typically you will only refine solutions that you intend to repeat on a reasonably regular basis. Be sure that they are still correct and produce valid results after refinement. For example, compare the outputs of the original and the refined solutions.

Having determined that counting the number of occurrences of each word has the potential to reveal common topics of customer tweets, the analysts at Blue Yonder Airlines decided to refine the data processing solution to improve its accuracy and usefulness. The starting point for this refinement is the WordCount.pig Pig Latin script described in Phase 1: Initial data exploration.

Enhancing the Pig Latin code to exclude noise words

The results generated by the WordCount.pig script contain a large volume of everyday words such as a, the, of, and so on. These occur frequently in every tweet but provide little or no semantic value when trying to identify the most common topics discussed in the tweets. Additionally, there are some domain-specific words such as flight and airport that, within the context of discussions around airlines, are so common that their usefulness in identifying topics of conversation is minimal. These noise words (sometimes called stop words)**can be disregarded when analyzing the tweet content.

The analysts decided to create a file containing a list of words to be excluded from the analysis, and then use it to filter the results generated by the Pig Latin script. This is a common way to filter words out of searches and queries. An extract from the noise word file created by the analysts at Blue Yonder Airlines is shown below. This file was saved as noisewords.txt in the /twitterdata folder in the HDInsight cluster.

a
i
as
do
go
in
is
it
my
the
that
what
flight
airline

Note

You can obtain lists of noise words from various sources such as TextFixer and Armand Brahaj’s Blog. If you have installed SQL Server you can start with the list of noise words that are included in the Resource database. For more information, see Configure and Manage Stopwords and Stoplists for Full-Text Search. In addition, you may find the N-gram datasets available from Google are useful. These contain lists of words and phrases with their observed frequency counts.

With the noise words file in place the analysts modified the WordCount.pig script to use a LEFT OUTER JOIN matching the words in the tweets with the words in the noise list, and to store the result in the file named noisewordcounts. Only words with no matching entry in the noise words file are now included in the aggregated results. The modified section of the script is shown below.

...

-- load the noise words file.
NoiseWords = LOAD '/twitterdata/noisewords.txt' AS noiseword:chararray;

-- join the noise words file using a left outer join.
JoinedWords = JOIN FilteredWords BY word LEFT OUTER, NoiseWords BY noiseword USING 'replicated';

-- filter the joined words so that only words with 
-- no matching noise word remain.
UsefulWords = FILTER JoinedWords BY noiseword IS NULL;

...

Partial results from this version of the script are shown below.

delayed       2749
entertainment 2064
delay         1720
bags          1718
service       1718
time          1375
vacation      1031
food          1030
wifi          1030
connection    688
seattle       688
bag           687

These results are more useful than the previous output, which included the noise words. However, the analysts have noticed that semantically equivalent words are counted separately. For example, in the results shown above, delayed and delay both indicate that customers are concerned about delays, while bags and bag both indicate concerns about baggage.

Using a user-defined function to find similar words

After some consideration, the analysts realized that multiple words with the same stem (for example delayed and delay), different words with the same meaning (for example baggage and luggage), and mistyped or misspelled words (for example bagage) will skew the results—potentially causing topics that are important to customers to be obscured because many differing terms are used as synonyms. To address this problem, the analysts initially decided to try to find similar words by calculating the Jaro distance between every word combination in the source data.

The Jaro distance is a value between 0 and 1 used in statistical analysis to indicate the level of similarity between two words. For more information about the algorithm used to calculate Jaro distance, see Jaro–Winkler distance on Wikipedia.

There is no built-in Pig function to calculate Jaro distance, so the analysts recruited the help of a Java developer to create a user-defined function. The code for the user-defined function is shown below.

package WordDistanceUDF;
...

public class WordDistance extends EvalFunc<Double>
{
  public Double exec(final Tuple input) throws IOException{
    if (input == null || input.size() == 0) return null;
    try
    {
      final String firstWord = (String)input.get(0);
      final String secondWord = (String)input.get(1);
      return getJaroSimilarity(firstWord, secondWord);
    }
    catch(Exception e)
    {
      throw new IOException("Caught exception processing input row ", e);
    }
  }

  private Double getJaroSimilarity(final String firstWord, final String secondWord)
  {
    double defMismatchScore = 0.0;
    if (firstWord != null && secondWord != null)
    {
      // get half the length of the string rounded up
      //(this is the distance used for acceptable transpositions)
      final int halflen = Math.min(firstWord.length(), secondWord.length()) / 2 + 1;
      //get common characters
      final StringBuilder comChars = getCommonCharacters(firstWord, secondWord, halflen);

      final int commonMatches = comChars.length();

      //check for zero in common
      if (commonMatches == 0) { return defMismatchScore; }

      final StringBuilder comCharsRev = getCommonCharacters(secondWord, firstWord, halflen);
      //check for same length common strings returning 0.0f is not the same
      if (commonMatches != comCharsRev.length()) { return defMismatchScore; }

      //get the number of transpositions
      int transpositions = 0;
      for (int i = 0; i < commonMatches; i++)
      {
        if (comChars.charAt(i) != comCharsRev.charAt(i)) { transpositions++; }
      }

      //calculate jaro metric
      transpositions = transpositions / 2;
      defMismatchScore = commonMatches / (3.0 * firstWord.length()) + 
      commonMatches / (3.0 * secondWord.length()) +
      (commonMatches - transpositions) / (3.0 * commonMatches);
    }  
    return defMismatchScore;
  }

  private static StringBuilder getCommonCharacters(final String firstWord,
                             final String secondWord, final int distanceSep)
  {
    if (firstWord != null && secondWord != null)
    {
      final StringBuilder returnCommons = new StringBuilder();
      final StringBuilder copy = new StringBuilder(secondWord);
      for (int i = 0; i < firstWord.length(); i++)         
      {
        final char character = firstWord.charAt(i);
        boolean foundIt = false;
        for (int j = Math.max(0, i - distanceSep);
         !foundIt && j < Math.min(i + distanceSep, secondWord.length()); 
         j++)
        {
          if (copy.charAt(j) == character)
          {
             foundIt = true;
             returnCommons.append(character);
             copy.setCharAt(j, '#');
          }
        }
      }
      return returnCommons;
    }
    return null;
  }
}

Note

For information about creating UDFs for use in HDInsight scripts see User-defined functions.

This function was compiled, packaged as WordDistanceUDF.jar, and saved on the HDInsight cluster. Next, the analysts modified the Pig Latin script that generates a list of all non-noise word combinations in the tweet source data to use the function to calculate the Jaro distance between each combination of words generated by the script. This modified section of the script is shown here.

-- register custom jar.
REGISTER WordDistanceUDF.jar;

...
...

-- sort by count.
SortedWords = ORDER WordList BY word;

-- create a duplicate set.
SortedWords2 = FOREACH SortedWords GENERATE word AS word:chararray;

-- cross join to create every combination of pairs.
CrossWords = CROSS SortedWords, SortedWords2;

-- find the Jaro distance.
WordDistances = FOREACH CrossWords GENERATE 
    SortedWords::word as word1:chararray, 
    SortedWords2::word as word2:chararray, 
    WordDistanceUDF.WordDistance(SortedWords::word, SortedWords2::word) AS jarodistance:double;

-- filter out word pairs with jaro distance less than 0.9.
MatchedWords = FILTER WordDistances BY jarodistance >= 0.9;

-- store the results as a file.
STORE MatchedWords INTO '/twitterdata/matchedwords';

Note

Notice that the script filters the results to include only word combinations with a Jaro distance value of 0.9 or higher.

The results include a row for every word matched to itself with a Jaro score of 1.0, and two rows for each combination of words with a score of 0.9 or above (one row for each possible word order). Some of the results in the output file are shown below.

bag      bag      1.0
bag      bags     0.9166666666666665
baggage  baggage  1.0
baggage  bagage   1.0
bags     bag      0.9166666666666665
bags     bags     1.0
delay    delay    1.0
delay    delayed  0.9047619047619047
delay    delays   0.9444444444444444
delayed  delay    0.9047619047619047
delayed  delayed  1.0
delays   delay    0.9444444444444444
delays   delays   1.0
seat     seat     1.0
seat     seats    0.9333333333333333
seated   seated   1.0
seats    seat     0.9333333333333333
seats    seat     1.0

Close examination of these results reveals that, while the code has successfully matched some words appropriately (for example, bag/bags, delay/delays, delay/delayed, and seat/seats), it has failed to match some others (for example, delays/delayed, bag/baggage, bagage/baggage, and seat/seated).

The analysts experimented with the Jaro value used to filter the results, lowering it to achieve more matches. However, in doing so they found that the number of false positives increased. For example, lowering the filter score to 0.85 matched delays to delayed and seats to seated, but also matched seated to seattle.

Note

The results obtained, and the attempts to improve them by adjusting the matching algorithm, reveal just how difficult it is to infer semantics and sentiment from free-form text. In the end the analysts realized that it would require some type of human intervention, in the form of a manually maintained synonyms list.

Using a lookup file to combine matched words

The results generated by the user-defined function were not sufficiently accurate to rely on in a fully automated process for matching words, but they did provide a useful starting point for creating a custom lookup file for words that should be considered synonyms when analyzing the tweet contents. The analysts reviewed and extended the results produced by the function and created a tab-delimited text file, which was saved as synonyms.txt in the /twitterdata folder on the HDInsight cluster. The following is an extract from this file.

bag          bag
bag          bags
bag          bagage
bag          baggage
bag          luggage
bag          lugage
delay        delay
delay        delayed
delay        delays
drink        drink
drink        drinks
drink        drinking
drink        beverage
drink        beverages
...

The first column in this file contains the list of leading values that should be used to aggregate the results. The second column contains synonyms that should be converted to the leading values for aggregation.

With this synonyms file in place, the Pig Latin script used to count the words in the tweet contents was modified to use a LEFT OUTER JOIN between the words in the source tweets (after filtering out the noise words) and the words in the synonyms file to find the leading values for each matched word. A UNION clause is then used to combine the matched words with words that are not present in the synonyms file, and the results are saved into a file named synonymcounts. The modified section of the Pig Latin script is shown here.

...

-- Match synonyms.
Synonyms = LOAD '/twitterdata/synonyms.txt' AS (leadingvalue:chararray, synonym:chararray);
WordsAndSynonyms = JOIN UsefulWords BY word LEFT OUTER, Synonyms BY synonym USING 'replicated';
UnmatchedWords = FILTER WordsAndSynonyms BY synonym IS NULL;
UnmatchedWordList = FOREACH UnmatchedWords GENERATE word;
MatchedWords = FILTER WordsAndSynonyms BY synonym IS NOT NULL;
MatchedWordList = FOREACH MatchedWords GENERATE leadingvalue as word;
AllWords = UNION  MatchedWordList, UnmatchedWordList;

-- group by word.
GroupedWords = GROUP AllWords BY (word);

-- count mentions per group.
CountedWords = FOREACH GroupedWords GENERATE group as word, COUNT(AllWords) as count;
...

Partial results from this script are shown below.

delay         4812
bag           3777
seat          2404
service       1718
movie         1376
vacation      1375
time          1375
entertainment 1032
food          1030
wifi          1030
connection    688
seattle       688
drink         687

In these results, semantically equivalent words are combined into a single leading value for aggregation. For example the counts for seat, seats, and seated are combined as a single count for seat. This makes the results more useful in terms of identifying topics that are important to customers. For example, it is apparent from these results that the top three subjects that customers have tweeted about are delays, bags, and seats.

Moving from topic lists to sentiment analysis

At this point the team has a valuable and seemingly valid list of the most important topics that customers are discussing. However, the results do not include any indication of sentiment. While it’s reasonably safe to assume that the majority of people discussing delays (the top term) will not be happy customers, it’s more difficult to detect the sentiment for a top item such as bag or seat. It’s not possible to tell from this data if most customers are happy or are complaining.

Sentiment analysis is difficult. As an extreme example, tweets containing text such as “No wifi, thanks for that Blue Yonder!” or “Still in Chicago airport due to flight delays - what a great trip this is turning out to be!” are obviously negative to a human reader who appreciates sentiment modifiers such as satire. However, a simple analysis based on individual words that indicate positive and negative sentiment, such as thanks and great, would indicate overall positive sentiment.

There are solutions available that are specially designed to accomplish sentiment analysis, and the team might also investigate using the rudimentary sentiment analysis capabilities of the Twitter search API. However, they decided that the list of topics provided by the analysis they’ve done so far would probably be better employed in driving more focused customer research, such as targeted questionnaires and online surveys—perhaps through external agencies that specialize in this field.

Instead, the team decided to move forward by improving the stability of the current solution as described in Phase 3: Stabilizing the solution.

Next Topic | Previous Topic | Home | Community