I have a data frame data
with a column, named "Project License", which represents a categorical variable, and, thus, in R terminology, is a factor. I'm trying to create a new column, where open source software licenses are combined into larger categories per my classification. However, when I try to combine (merge) levels of that factor, I end up either with a column, where all levels are lost, or unchanged, or with an error message, such as the following one:
Error in factor(data[["Project License"]], levels = classification, labels = c("Highly Restrictive", : invalid 'labels'; length 4 should be 1 or 6
Here's my code for this functionality (extracted from a function):
myLevels <- c('gpl', 'lgpl', 'bsd',
'other', 'artistic', 'public')
myLabels <- c('GPL', 'LGPL', 'BSD',
'Other', 'Artistic', 'Public')
licenses <- factor(data[["Project License"]],
levels = myLevels, labels = myLabels)
data[["Project License"]] <- licenses
classification <- c(highly = c('gpl'),
restrictive = c('lgpl', 'public'),
permissive = c('bsd', 'artistic'),
unknown = c('other'))
restrictiveness <-
factor(data[["Project License"]],
levels = classification,
labels = c('Highly Restrictive', 'Restrictive',
'Permissive', 'Unknown'))
data[["License Restrictiveness"]] <- restrictiveness
I have also tried some other approaches (including ones described in section 8.2.5 in "R Inferno"), but also unsuccessful so far.
What am I doing wrong and how to solve this problem? Thank you!
UPDATE (Data):
> head(data, n=20)
Project ID Project License
1 45556 lgpl
2 41636 bsd
3 95627 gpl
4 66930 gpl
5 51103 gpl
6 65637 gpl
7 41834 gpl
8 70998 gpl
9 95064 gpl
10 48810 lgpl
11 95934 gpl
12 90909 gpl
13 6538 website
14 16439 gpl
15 41924 gpl
16 78987 gpl
17 58662 zlib
18 1904 bsd
19 93838 public
20 90047 lgpl
> str(data)
'data.frame': 45033 obs. of 2 variables:
$ Project ID : chr "45556" "41636" "95627" "66930" ...
$ Project License: chr "lgpl" "bsd" "gpl" "gpl" ...
- attr(*, "SQL")=Class 'base64' chr "ClNFTEVDVCBncm91cF9pZCwgbGljZW5zZQpGUk9NIHNmMDMxNC5ncm91cHMKV0hFUkUgZ3JvdXBfaWQgPCAxMDAwMDA="
- attr(*, "indicatorName")=Class 'base64' chr "cHJqTGljZW5zZQ=="
- attr(*, "resultNames")=Class 'base64' chr "UHJvamVjdCBJRCwgUHJvamVjdCBMaWNlbnNl"
UPDATE 2 (Data):
> unique(data[["Project License"]])
[1] "lgpl" "bsd" "gpl" "website" "zlib"
[6] "public" "other" "ibmcpl" "rpl" "mpl11"
[11] "mit" "afl" "python" "mpl" "apache"
[16] "osl" "w3c" "iosl" "artistic" "apsl"
[21] "ibm" "plan9" "php" "qpl" "psfl"
[26] "ncsa" "rscpl" "sunpublic" "zope" "eiffel"
[31] "nethack" "sissl" "none" "opengroup" "sleepycat"
[36] "nokia" "attribut" "xnet" "eiffel2" "wxwindows"
[41] "motosoto" "vovida" "jabber" "cvw" "historical"
[46] "nausite" "real"
The problem is that the number of levels does not equal the number of labels in the factor creation, nor is it length 1.
From ?factor
:
labels
either an optional character vector of labels for the levels (in the same order as
levels after removing those in exclude), or a character string of length 1.
You need to make these agree. The names in classification
are not a hint to factor
to combine the lables.
For example:
factor(..., levels=classification, labels=c('Highly Restrictive',
'Restrictive.1',
'Restrictive.2',
'Permissive.1',
'Permissive.2',
'Unknown'))
To map the factor to another with fewer levels, you can index a vector by name. Turning the classification
vector around as a lookup:
classification <- c(gpl='Highly Restrictive',
lgpl='Restrictive',
public='Restrictive',
bsd='Permissive',
artistic='Permissive',
other='Unknown')
To use this as a lookup table:
data[["License Restrictiveness"]] <-
as.factor(classification[as.character(data[['Project License']])])
head(data)
## Project ID Project License License Restrictiveness
## 1 45556 lgpl Restrictive
## 2 41636 bsd Permissive
## 3 95627 gpl Highly Restrictive
## 4 66930 gpl Highly Restrictive
## 5 51103 gpl Highly Restrictive
## 6 65637 gpl Highly Restrictive