I have a dataset from sklearn
and I plotted the distribution of the load_diabetes.target
data (i.e. the values of the regression that the load_diabetes.data
are used to predict).
I used this because it has the fewest number of variables/attributes of the regression sklearn.datasets
.
Using Python 3, How can I get the distribution-type and parameters of the distribution this most closely resembles?
All I know the target
values are all positive and skewed (positve skew/right skew). . . Is there a way in Python to provide a few distributions and then get the best fit for the target
data/vector? OR, to actually suggest a fit based on the data that's given? That would be realllllly useful for people who have theoretical statistical knowledge but little experience with applying it to "real data".
Bonus Would it make sense to use this type of approach to figure out what your posterior distribution would be with "real data" ? If no, why not?
from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd
#Get Data
data = load_diabetes()
X, y_ = data.data, data.target
#Organize Data
SR_y = pd.Series(y_, name="y_ (Target Vector Distribution)")
#Plot Data
fig, ax = plt.subplots()
sns.distplot(SR_y, bins=25, color="g", ax=ax)
plt.show()
Use this approach
import scipy.stats as st
def get_best_distribution(data):
dist_names = ["norm", "exponweib", "weibull_max", "weibull_min", "pareto", "genextreme"]
dist_results = []
params = {}
for dist_name in dist_names:
dist = getattr(st, dist_name)
param = dist.fit(data)
params[dist_name] = param
# Applying the Kolmogorov-Smirnov test
D, p = st.kstest(data, dist_name, args=param)
print("p value for "+dist_name+" = "+str(p))
dist_results.append((dist_name, p))
# select the best fitted distribution
best_dist, best_p = (max(dist_results, key=lambda item: item[1]))
# store the name of the best fit and its p value
print("Best fitting distribution: "+str(best_dist))
print("Best p value: "+ str(best_p))
print("Parameters for the best fit: "+ str(params[best_dist]))
return best_dist, best_p, params[best_dist]