Python - Sentiment Analysis using Pointwise Mutual Information

keshr3106 picture keshr3106 · Mar 1, 2014 · Viewed 17.6k times · Source
from __future__ import division
import urllib
import json
from math import log


def hits(word1,word2=""):
    query = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=%s"
    if word2 == "":
        results = urllib.urlopen(query % word1)
    else:
        results = urllib.urlopen(query % word1+" "+"AROUND(10)"+" "+word2)
    json_res = json.loads(results.read())
    google_hits=int(json_res['responseData']['cursor']['estimatedResultCount'])
    return google_hits


def so(phrase):
    num = hits(phrase,"excellent")
    #print num
    den = hits(phrase,"poor")
    #print den
    ratio = num / den
    #print ratio
    sop = log(ratio)
    return sop

print so("ugly product")

I need this code to calculate the Point wise Mutual Information which can be used to classify reviews as positive or negative. Basically I am using the technique specified by Turney(2002): http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf as an example for an unsupervised classification method for sentiment analysis.

As explained in the paper, the semantic orientation of a phrase is negative if the phrase is more strongly associated with the word "poor" and positive if it is more strongly associated with the word "excellent".

The code above calculates the SO of a phrase. I use Google to calculate the number of hits and calculate the SO.(as AltaVista is now not there)

The values computed are very erratic. They don't stick to a particular pattern. For example SO("ugly product") turns out be 2.85462098541 while SO("beautiful product") is 1.71395061117. While the former is expected to be negative and the other positive.

Is there something wrong with the code? Is there an easier way to calculate SO of a phrase (using PMI) with any Python library,say NLTK? I tried NLTK but was not able to find any explicit method which computes the PMI.

Answer

alvas picture alvas · Mar 9, 2014

Generally, calculating PMI is tricky since the formula will change depending on the size of the ngram that you want to take into consideration:

Mathematically, for bigrams, you can simply consider:

log(p(a,b) / ( p(a) * p(b) ))

Programmatically, let's say you have calculated all the frequencies of the unigrams and bigrams in your corpus, you do this:

def pmi(word1, word2, unigram_freq, bigram_freq):
  prob_word1 = unigram_freq[word1] / float(sum(unigram_freq.values()))
  prob_word2 = unigram_freq[word2] / float(sum(unigram_freq.values()))
  prob_word1_word2 = bigram_freq[" ".join([word1, word2])] / float(sum(bigram_freq.values()))
  return math.log(prob_word1_word2/float(prob_word1*prob_word2),2) 

This is a code snippet from an MWE library but it's in its pre-development stage (https://github.com/alvations/Terminator/blob/master/mwe.py). But do note that it's for parallel MWE extraction, so here's how you can "hack" it to extract monolingual MWE:

$ wget https://dl.dropboxusercontent.com/u/45771499/mwe.py
$ printf "This is a foo bar sentence .\nI need multi-word expression from this text file.\nThe text file is messed up , I know you foo bar multi-word expression thingy .\n More foo bar is needed , so that the text file is populated with some sort of foo bar bigrams to extract the multi-word expression ." > src.txt
$ printf "" > trg.txt
$ python
>>> import codecs
>>> from mwe import load_ngramfreq, extract_mwe

>>> # Calculates the unigrams and bigrams counts.
>>> # More superfluously, "Training a bigram 'language model'."
>>> unigram, bigram, _ , _ = load_ngramfreq('src.txt','trg.txt')

>>> sent = "This is another foo bar sentence not in the training corpus ."

>>> for threshold in range(-2, 4):
...     print threshold, [mwe for mwe in extract_mwe(sent.strip().lower(), unigram, bigram, threshold)]

[out]:

-2 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .']
-1 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .']
0 ['this is', 'foo bar', 'bar sentence']
1 ['this is', 'foo bar', 'bar sentence']
2 ['this is', 'foo bar', 'bar sentence']
3 ['foo bar', 'bar sentence']
4 []

For further details, i find this thesis an quick and easy introduction to MWE extraction: "Extending the Log Likelihood Measure to Improve Collocation Identification", see http://goo.gl/5ebTJJ