Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?
PS. I know that there is a package named rpy2 which could run R in subprocess, using quantile normalize in R. But the truth is that R cannot compute the correct result when I use the data set as below:
What I want:
Given the data shown above, how to apply quantile normalization following steps in
I found a piece of code in Python declaring that it could compute the quantile normalization:
import rpy2.robjects as robjects
import numpy as np
from rpy2.robjects.packages import importr
preprocessCore = importr('preprocessCore')
matrix = [ [1,2,3,4,5], [1,3,5,7,9], [2,4,6,8,10] ]
v = robjects.FloatVector([ element for col in matrix for element in col ])
m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False)
Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
normalized_matrix = np.array( Rnormalized_matrix)
The code works fine with the sample data used in the code, however when I test it with the data given above the result went wrong.
Since ryp2 provides an interface to run R in python subprocess, I test it again in R directly and the result was still wrong. As a result I think the reason is that the method in R is wrong.
Using the example dataset from Wikipedia article:
df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})
C1 C2 C3
A 5 4 3
B 2 1 4
C 3 4 6
D 4 2 8
For each rank, the mean value can be calculated with the following:
rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()
1 2.000000
2 3.000000
3 4.666667
4 5.666667
dtype: float64
Then the resulting Series, rank_mean
, can be used as a mapping for the ranks to get the normalized results:
C1 C2 C3
A 5.666667 4.666667 2.000000
B 2.000000 2.000000 3.000000
C 3.000000 4.666667 4.666667
D 4.666667 3.000000 5.666667