I'm trying to use a custom vocabulary in scikit-learn for some clustering tasks and I'm getting very weird results.
The program runs ok when not using a custom vocabulary and I'm satisfied with the cluster creation. However, I have already identified a group of words (around 24,000) that I would like to use as a custom vocabulary.
The words are stored in a SQL Server table. I have tried so far 2 approaches, but I get the same results at the end. The first one is to create a list, the second is to create a dictionary. The code for the creation of the dictionary is like this:
myvocab = {}
vocabulary = []
count = 0
for row in results:
skillName = re.sub(r'&#?[a-z0-9]+;', ' ', row['SkillName'])
skillName = unicode(skillName,"utf-8")
vocabulary.append(skillName) #Using a list
myvocab[str(skillName)] = count #Using a dictionary
count+=1
I then use the vocabulary (either the list version or the dictionary, both of them give the same result at the end) in the TfidfVectorizer as follows:
vectorizer = TfidfVectorizer(max_df=0.8,
stop_words='english' ,ngram_range=(1,2) ,vocabulary=myvocab)
X = vectorizer.fit_transform(dataset2)
The shape of X is (651, 24321) as I have 651 instances to cluster and 24321 words in the vocabulary.
If I print the contents of X, this is what I get:
(14, 11462) 1.0
(20, 10218) 1.0
(34, 11462) 1.0
(40, 11462) 0.852815313278
(40, 10218) 0.52221264006
(50, 11462) 1.0
(81, 11462) 1.0
(84, 11462) 1.0
(85, 11462) 1.0
(99, 10218) 1.0
(127, 11462) 1.0
(129, 10218) 1.0
(132, 11462) 1.0
(136, 11462) 1.0
(138, 11462) 1.0
(150, 11462) 1.0
(158, 11462) 1.0
(186, 11462) 1.0
(210, 11462) 1.0
: :
As it can be seen, for most of the instances, only word from the vocabulary is present (which is wrong as there are at least 10) and for a lot of instances, not even one word is found. Also, the words found tend to be always the same across the instances, which doesn't make sense.
If I print the feature_names using :
feature_names = np.asarray(vectorizer.get_feature_names())
I get:
['.NET' '10K' '21 CFR Part 11' ..., 'Zend Studio' 'Zendesk' 'Zenworks']
I must say that the program was running perfectly when the vocabulary used was the one determined from the input documents, so I strongly suspect that the problem is related to using a custom vocabulary.
Does anyone have a clue of what's happening?
(I'm not using a pipeline so this problem can't be related to a previous bug which has already been fixed)
One thing that strikes me as unusual is that when you create the vectorizer you specify ngram_range=(1,2)
. This means you can't get the feature '21 CFR Part 11'
using the standard tokenizer. I suspect the 'missing' features are n-grams for n>2
. How many of your pre-selected vocabulary items are unigrams or bigrams?