One Hot Encoding for representing corpus sentences in python

Aaron7Sun picture Aaron7Sun · May 20, 2015 · Viewed 8.5k times · Source

I am a starter in Python and Scikit-learn library. I currently need to work on a NLP project which firstly need to represent a large corpus by One-Hot Encoding. I have read Scikit-learn's documentations about the preprocessing.OneHotEncoder, however, it seems like it is not the understanding of my term.

basically, the idea is similar as below:

  • 1000000 Sunday; 0100000 Monday; 0010000 Tuesday; ... 0000001 Saturday;

if the corpus only have 7 different words, then I only need a 7-digit vector to represent every single word. and then, a completed sentence can be represented by a conjunction of all the vectors, which is a sentence matrix. However, I tried in Python, it seems not working...

How can I work this out? my corpus have a very large amount of different words.

Btw, also, seems like if the vectors are mostly fulfilled with zeros, we can use Scipy.Sparse to make the storage small, for example, CSR.

Hence, my entire question will be:

how the sentences in corpus can be represented by OneHotEncoder, and stored in a SparseMatrix?

Thank you guys.

Answer

aleju picture aleju · May 21, 2015

In order to use the OneHotEncoder, you can split your documents into tokens and then map every token to an id (that is always the same for the same string). Then apply the OneHotEncoder to that list. The result is by default a sparse matrix.

Example code for two simple documents A B and B B:

from sklearn.preprocessing import OneHotEncoder
import itertools

# two example documents
docs = ["A B", "B B"]

# split documents to tokens
tokens_docs = [doc.split(" ") for doc in docs]

# convert list of of token-lists to one flat list of tokens
# and then create a dictionary that maps word to id of word,
# like {A: 1, B: 2} here
all_tokens = itertools.chain.from_iterable(tokens_docs)
word_to_id = {token: idx for idx, token in enumerate(set(all_tokens))}

# convert token lists to token-id lists, e.g. [[1, 2], [2, 2]] here
token_ids = [[word_to_id[token] for token in tokens_doc] for tokens_doc in tokens_docs]

# convert list of token-id lists to one-hot representation
vec = OneHotEncoder(n_values=len(word_to_id))
X = vec.fit_transform(token_ids)

print X.toarray()

Prints (one hot vectors in concatenated form per document):

[[ 1.  0.  0.  1.]
 [ 0.  1.  0.  1.]]