FreqDist using NLTK

AJS picture AJS · Jun 8, 2011 · Viewed 10.6k times · Source

I'm trying to get a frequency distribution of a set of documents using Python. My code isn't working for some reason and is producing this error:

Traceback (most recent call last):
  File "C:\Documents and Settings\aschein\Desktop\freqdist", line 32, in <module>
    fd = FreqDist(corpus_text)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 104, in __init__
    self.update(samples)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 472, in update
    self.inc(sample, count=count)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 120, in inc
    self[sample] = self.get(sample,0) + count
TypeError: unhashable type: 'list'

Can you help?

This is the code so far:

import os
import nltk
from nltk.probability import FreqDist


#The stop=words list
stopwords_doc = open("C:\\Documents and Settings\\aschein\\My Documents\\stopwords.txt").read()
stopwords_list = stopwords_doc.split()
stopwords = nltk.Text(stopwords_list)

corpus = []

#Directory of documents
directory = "C:\\Documents and Settings\\aschein\\My Documents\\comments"
listing = os.listdir(directory)

#Append all documents in directory into a single 'document' (list)
for doc in listing:
    doc_name = "C:\\Documents and Settings\\aschein\\My Documents\\comments\\" + doc
    input = open(doc_name).read() 
    input = input.split()
    corpus.append(input)

#Turn list into Text form for NLTK
corpus_text = nltk.Text(corpus)

#Remove stop-words
for w in corpus_text:
    if w in stopwords:
        corpus_text.remove(w)

fd = FreqDist(corpus_text)

Answer

dmh picture dmh · Jun 9, 2011

Two thoughts that I hope at least contribute to an answer.

First, the documentation for the nltk.text.Text() method states (emphasis mine):

A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text's contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the Text class, and use the appropriate analysis function or class directly instead.

So I'm not sure Text() is the way you want to handle this data. It seems to me you would do just fine to use a list.

Second, I would caution you to think about the calculation you're asking NLTK to perform here. Removing stopwords before determining a frequency distribution means that the your frequencies will be skewed; I do not understand why the stopwords are removed before tabulation rather than just ignored in examining the distribution after the fact. (I suppose this second point would make a better query/comment than part of an answer, but I felt it worth pointing out that the proportions would be skewed.) Depending on what you intend to use the frequency distribution for, this may or may not be a problem in and of itself.