I am trying to create a function in R that will allow me to filter my data set based on whether a row contains a single column with a zero in it. Furthermore, some times I only want to remove rows that is zero in all columns.
Also, and this is where it gets fun; not all columns contains numbers and the number of columns can vary.
I have tried to paste some of my data here with the results I want to obtain.
unfiltered:
ID GeneName DU145small DU145total PC3small PC3total
1 MIR22HG 33221.5 1224.55 2156.43 573.315
2 MIRLET7E 87566.1 7737.99 25039.3 16415.6
3 MIR612 0 0 530.068 0
4 MIR218-1 0 0 1166.88 701.253
5 MIR181B2 70723.2 3958.01 6209.85 1399.34
6 MIR218-2 0 0 0 0
7 MIR10B 787.516 330.556 0 20336.4
8 MIR3176 0 0 0 0
any rows with containing a zero removed:
ID GeneName DU145small DU145total PC3small PC3total
1 MIR22HG 33221.5 1224.55 2156.43 573.315
2 MIRLET7E 87566.1 7737.99 25039.3 16415.6
5 MIR181B2 70723.2 3958.01 6209.85 1399.34
only rows that is all zero is filtered away:
ID GeneName DU145small DU145total PC3small PC3total
1 MIR22HG 33221.5 1224.55 2156.43 573.315
2 MIRLET7E 87566.1 7737.99 25039.3 16415.6
3 MIR612 0 0 530.068 0
4 MIR218-1 0 0 1166.88 701.253
5 MIR181B2 70723.2 3958.01 6209.85 1399.34
7 MIR10B 787.516 330.556 0 20336.4
I did find a way of removing any rows that had at least 1 zero in it, but it was "cheating" by exchanging all zeros with NA and then using complete.cases to filter.
Also, by doing that it remove all rows where the GeneName
had a zero in it (as for MIR10B).
I can solve it by using for loops, but I have been told that loops in R is very ineffective and would therefore like to avoid that solution.
EDIT: While Xin Yin's solution works perfectly well and kept the data in a data frame, the answer by David Arenburg is supposedly more efficient and should be used.
Using data.table
(assuming df
is your data set)
library(data.table)
setDT(df)[, .SD[!all(.SD[, -1, with = F] == 0)], by = GeneName]
# GeneName ID DU145small DU145total PC3small PC3total
# 1: MIR22HG 1 33221.500 1224.550 2156.430 573.315
# 2: MIRLET7E 2 87566.100 7737.990 25039.300 16415.600
# 3: MIR612 3 0.000 0.000 530.068 0.000
# 4: MIR218-1 4 0.000 0.000 1166.880 701.253
# 5: MIR181B2 5 70723.200 3958.010 6209.850 1399.340
# 6: MIR10B 7 787.516 330.556 0.000 20336.400
Or if you only want to remove rows with any zeroes
setDT(df)[, .SD[!any(.SD[, -1, with = F] == 0)], by = GeneName]
# GeneName ID DU145small DU145total PC3small PC3total
# 1: MIR22HG 1 33221.5 1224.55 2156.43 573.315
# 2: MIRLET7E 2 87566.1 7737.99 25039.30 16415.600
# 3: MIR181B2 5 70723.2 3958.01 6209.85 1399.340