Pandas Series: Log Normalize

Benni picture Benni · Jun 17, 2016 · Viewed 45.9k times · Source

I have a Pandas Series, that needs to be log-transformed to be normal distributed. But I can´t log transform yet, because there are values =0 and values below 1 (0-4000). Therefore I want to normalize the Series first. I heard of StandardScaler(scikit-learn), Z-score standardization and Min-Max scaling(normalization). I want to cluster the data later, which would be the best method? StandardScaler and Z-score standardization use mean, variance etc. Can I use them on "not yet normal distibuted" data?

Answer

mtadd picture mtadd · Jun 18, 2016

To transform to logarithms, you need positive values, so translate your range of values (-1,1] to normalized (0,1] as follows

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.uniform(-1,1,(10,1)))
df['norm'] = (1+df[0])/2 # (-1,1] -> (0,1]
df['lognorm'] = np.log(df['norm'])

results in a dataframe like

          0      norm   lognorm
0  0.360660  0.680330 -0.385177
1  0.973724  0.986862 -0.013225
2  0.329130  0.664565 -0.408622
3  0.604727  0.802364 -0.220193
4  0.416732  0.708366 -0.344795
5  0.085439  0.542719 -0.611163
6 -0.964246  0.017877 -4.024232
7  0.738281  0.869141 -0.140250
8  0.558220  0.779110 -0.249603
9  0.485144  0.742572 -0.297636