By the documentation I read that a dummy classifier can be used to test it against a classification algorithm.
This classifier is useful as a simple baseline to compare with other (real) classifiers. Do not use it for real problems.
What does the dummy classifier do when it uses the stratified aproach. I know that the docummentation says that:
generates predictions by respecting the training set’s class distribution.
Could anybody give me a more theorical explanation of why this is a proof for the performance of the classifier?.
The dummy classifier gives you a measure of "baseline" performance--i.e. the success rate one should expect to achieve even if simply guessing.
Suppose you wish to determine whether a given object possesses or does not possess a certain property. If you have analyzed a large number of those objects and have found that 90% contain the target property, then guessing that every future instance of the object possesses the target property gives you a 90% likelihood of guessing correctly. Structuring your guesses this way is equivalent to using the most_frequent
method in the documentation you cite.
Because many machine learning tasks attempt to increase the success rate of (e.g.) classification tasks, evaluating the baseline success rate can afford a floor value for the minimal value one's classifier should out-perform. In the hypothetical discussed above, you would want your classifier to get more than 90% accuracy, because 90% is the success rate available to even "dummy" classifiers.
If one trains a dummy classifier with the stratified
parameter using the data discussed above, that classifier will predict that there is a 90% probability that each object it encounters possesses the target property. This is different from training a dummy classifier with the most_frequent
parameter, as the latter would guess that all future objects possess the target property. Here's some code to illustrate:
from sklearn.dummy import DummyClassifier
import numpy as np
two_dimensional_values = []
class_labels = []
for i in xrange(90):
two_dimensional_values.append( [1,1] )
class_labels.append(1)
for i in xrange(10):
two_dimensional_values.append( [0,0] )
class_labels.append(0)
#now 90% of the training data contains the target property
X = np.array( two_dimensional_values )
y = np.array( class_labels )
#train a dummy classifier to make predictions based on the most_frequent class value
dummy_classifier = DummyClassifier(strategy="most_frequent")
dummy_classifier.fit( X,y )
#this produces 100 predictions that say "1"
for i in two_dimensional_values:
print( dummy_classifier.predict( [i]) )
#train a dummy classifier to make predictions based on the class values
new_dummy_classifier = DummyClassifier(strategy="stratified")
new_dummy_classifier.fit( X,y )
#this produces roughly 90 guesses that say "1" and roughly 10 guesses that say "0"
for i in two_dimensional_values:
print( new_dummy_classifier.predict( [i]) )