Creating a Term Document Matrix from Text File

J4cK picture J4cK · May 10, 2013 · Viewed 9.4k times · Source

I'm trying to read one text file and create a term document matrix using textmining packages. I can create term document matrix where I need to add each line by line. The problem is that I want to include whole file at a time. What am I missing in the following code? Thanks in advance for any suggestion?

import textmining

def term_document_matrix_roy_1():

    '''-----------------------------------------'''
    with open("data_set.txt") as f:
        reading_file_line = f.readlines() #entire content, return  list 
        print reading_file_line #list
        reading_file_info = [item.rstrip('\n') for item in reading_file_line]
        print reading_file_info
        print reading_file_info [1] #list-1
        print reading_file_info [2] #list-2

        '''-----------------------------------------'''
        tdm = textmining.TermDocumentMatrix()
        #tdm.add_doc(reading_file_info) #Giving error because of readlines 
        tdm.add_doc(reading_file_info[0])       
        tdm.add_doc(reading_file_info[1])
        tdm.add_doc(reading_file_info[2])


        for row in tdm.rows(cutoff=1):
            print row

Sample Text files: "data_set.txt" contain following information:

Lets write some python code

Thus far, this book has mainly discussed the process of ad hoc retrieval.

Along the way we will study some important machine learning techniques.

Output will be Term Document Matrix, basically how many times one specific word appear. Output Image: http://postimg.org/image/eidddlkld/

enter image description here

Answer

Fred Kroon picture Fred Kroon · May 10, 2013

If I'm understanding you correctly, you're currently adding each line of your file as a separate document. To add the whole file, you could just concatenate the lines, and add them all at once.

tdm = textmining.TermDocumentMatrix()
#tdm.add_doc(reading_file_info) #Giving error because of readlines 
tdm.add_doc(' '.join(reading_file_info))

If you are looking for multiple matrices, you'll end up getting only one row in each, as there is only one document, unless you have another way of splitting the line in to separate documents. You may want to re-think whether this is what you actually want. Nevertheless, I think this code will do it for you:

with open("txt_files/input_data_set.txt") as f:
    tdms = []
    for line in f:
        tdm = textmining.TermDocumentMatrix()
        tdm.add_doc(line.strip())
        tdms.append(tdm)

    for tdm in tdms:
        for row in tdm.rows(cutoff=1):
            print row

I haven't really been able to test this code, so the output might not be right. Hopefully it will get you on your way.