I'm pretty sure it's been asked before, but I'm unable to find an answer
Running Logistic Regression using sklearn on python, I'm able to transform my dataset to its most important features using the Transform method
classf = linear_model.LogisticRegression()
func = classf.fit(Xtrain, ytrain)
reduced_train = func.transform(Xtrain)
How can I tell which features were selcted as most important? more generally how can I calculate the p-value of each feature in the dataset?
As suggested in comments above you can (and should) scale your data prior to your fit thus making the coefficients comparable. Below is a little code to show how this would work. I follow this format for comparison.
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt
x1 = np.random.randn(100)
x2 = np.random.randn(100)
x3 = np.random.randn(100)
#Make difference in feature dependance
y = (3 + x1 + 2*x2 + 5*x3 + 0.2*np.random.randn()) > 0
X = pd.DataFrame({'x1':x1,'x2':x2,'x3':x3})
#Scale your data
scaler = StandardScaler()
scaler.fit(X)
X_scaled = pd.DataFrame(scaler.transform(X),columns = X.columns)
clf = LogisticRegression(random_state = 0)
clf.fit(X_scaled, y)
feature_importance = abs(clf.coef_[0])
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
featfig = plt.figure()
featax = featfig.add_subplot(1, 1, 1)
featax.barh(pos, feature_importance[sorted_idx], align='center')
featax.set_yticks(pos)
featax.set_yticklabels(np.array(X.columns)[sorted_idx], fontsize=8)
featax.set_xlabel('Relative Feature Importance')
plt.tight_layout()
plt.show()