I'm using PySpark 2.0 for a Kaggle competition. I'd like to know the behavior of a model (RandomForest
) depending on different parameters. ParamGridBuilder()
allows to specify different values for a single parameters, and then perform (I guess) a Cartesian product of the entire set of parameters. Assuming my DataFrame
is already defined:
rdc = RandomForestClassifier()
pipeline = Pipeline(stages=STAGES + [rdc])
paramGrid = ParamGridBuilder().addGrid(rdc.maxDepth, [3, 10, 20])
.addGrid(rdc.minInfoGain, [0.01, 0.001])
.addGrid(rdc.numTrees, [5, 10, 20, 30])
.build()
evaluator = MulticlassClassificationEvaluator()
valid = TrainValidationSplit(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
trainRatio=0.50)
model = valid.fit(df)
result = model.bestModel.transform(df)
OK so now I'm able to retrieves simple information with a handmade function:
def evaluate(result):
predictionAndLabels = result.select("prediction", "label")
metrics = ["f1","weightedPrecision","weightedRecall","accuracy"]
for m in metrics:
evaluator = MulticlassClassificationEvaluator(metricName=m)
print(str(m) + ": " + str(evaluator.evaluate(predictionAndLabels)))
Now I want several things:
print(model.validationMetrics)
that displays (it seems) a list containing the accuracy of each model, but I can't get to know which model to refers.If I can retrieve all those informations, I should be able to display graphs, bar charts, and work as I do with Panda and sklearn
.
Spark 2.4+
SPARK-21088 CrossValidator, TrainValidationSplit should collect all models when fitting - adds support for collecting submodels.
By default this behavior is disabled, but can be controlled using CollectSubModels
Param
(setCollectSubModels
).
valid = TrainValidationSplit(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
collectSubModels=True)
model = valid.fit(df)
model.subModels
Spark < 2.4
Long story short you simply cannot get parameters for all models because, similarly to CrossValidator
, TrainValidationSplitModel
retains only the best model. These classes are designed for semi-automated model selection not exploration or experiments.
What are the parameters of all models?
While you cannot retrieve actual models validationMetrics
correspond to input Params
so you should be able to simply zip
both:
from typing import Dict, Tuple, List, Any
from pyspark.ml.param import Param
from pyspark.ml.tuning import TrainValidationSplitModel
EvalParam = List[Tuple[float, Dict[Param, Any]]]
def get_metrics_and_params(model: TrainValidationSplitModel) -> EvalParam:
return list(zip(model.validationMetrics, model.getEstimatorParamMaps()))
to get some about relationship between metrics and parameters.
If you need more information you should use Pipeline Params
. It will preserve all model which can be used for further processing:
models = pipeline.fit(df, params=paramGrid)
It will generate a list of the PipelineModels
corresponding to the params
argument:
zip(models, params)