I experience the following problem. One is given a data frame with 5 categories-a,b,c,d,e for each name(names are 54). I give you a small extract from the whole data frame in R just to give you a feeling on the topic.
**a b c d e
Teniers 15 12 13 6 G
Van Dyck 15 10 17 13 G
Bourdon 10 8 8 4 H
Le Brun 16 16 8 16 H
Le Suer 15 15 4 15 H
Poussin 15 17 6 15 H**
I have succeeded to arrange the names alphabetically with the "sort" function, so that not only the names column gets arranged alphabetically but their 5 categories belonging to each names moved as well. So far, so good, but the task is to take the first letter from each name and to select those names only whose beginning letters apear most often. I can get the first letters with the "strsplit" function, then the first letters appear on each row, but to the left ot them stays everywhere [1]"the fist letter", new row[1] "another first letter", new row1[...] till the 54th; and not the position in the dataframe..
So, any ideas?
Here is an extract from the code...
library(MASS)
data(painters)
attach(painters)
painters
str(painters)
summary(painters)
y <- as.vector(rownames(painters))
is.vector(y)
sortnames <- painters[order(y) , ]
as.data.frame( painters[order(y) , ] ) ##sorted in list; each name with ist relevant criteria
rownames(sortnames)
z <- rownames(sortnames)
str(z)
is.vector(z)
strsplit(z, "")
as.list(strsplit(z, ""))
liste <- as.list(strsplit(z, ""))
matrix <- as.matrix(liste)
matrix
matrix[,1]
matrix[1,]
matrix[1,1]
matrix[[1]] [1]
first <- matrix (as.matrix(liste))
for(i in 1:54) {print( matrix[[i]][1]) }
str(first)
Regards and thanks for the fast response in advance!!
what I need is:
to create a vector(or a matrix with dimension[54,1]) that contains only the first letter of each name in the "rownames" column, each row of it should be the number of the row from the sorted vector in the data frame, so that we keep the position in the dataframe shown.
e.g.
[1]"A"
[2]"B"
[3]"B"
[4]"C"
....
In other words, one has to extract a vector/matrix with only the first letter of rownames(in the dataframe "rownames" is defined as only the painters names, so the very 1st column of the 6 ;) )
I appreciate your help.
substr(data, 1, 1)
i got them like that:
firstletter <- substr(rownames(sortnames), 1, 1)
firstletter <- as.data.frame(firstletter) **##how should I define "firstletter" for later use??**
firstletter
1 A
2 B
3 B
4 B
5 B
6 C
7 C
8 C
9 D
10 D
11 D
12 D
13 D
14 D
15 D
16 F
17 F
18 F
19 G
20 G
21 G
22 H
23 J
24 J
25 L
26 L
27 L
28 L
29 M
30 M
31 O
32 P
33 P
34 P
35 P
36 P
37 P
38 P
39 P
40 P
41 R
42 R
43 R
44 T
45 T
46 T
47 T
48 T
49 T
50 V
51 V
52 V
53 V
54 V
worked like a charm. the first letter of the painters names is extracted and the row number stays as it should.
So, thanks a lot !
p.s. I have a last question only, is there a function or a command in R that can now take this "firstletter" [vector/matrix/list/data.frame] depends how we define its structure(what is the best decision? here for later use) and check which are the 3 most often appearing first letters in the vector/matrix/list and extracting only them? or it would be too complicated?
EDIT: All i need is now just to delete the redundant last row from a certain matrix after a substract(rbind command)
firstletter Composition Drawing Colour Expression School
Da Udine "D" "10" " 8" "16" " 3" "A"
Del Piombo "D" " 8" "13" "16" " 7" "A"
Diepenbeck "D" "11" "10" "14" " 6" "G"
Palma Giovane "P" "12" " 9" "14" " 6" "D"
Palma Vecchio "P" " 5" " 6" "16" " 0" "D"
Pordenone "P" " 8" "14" "17" " 5" "D"
Teniers "T" "15" "12" "13" " 6" "G"
The Carraci "T" "15" "17" "13" "13" "E"
Tintoretto "T" "15" "14" "16" " 4" "D"
Titian "T" "12" "15" "18" " 6" "D"
Da Vinci "D" "15" "16" " 4" "14" "A"
Domenichino "D" "15" "17" " 9" "17" "E"
Poussin "P" "15" "17" " 6" "15" "H"
The Carraci1 "T" "15" "17" "13" "13" "E"
Have googled for a long time and no function worked for me till now..
Any suggestions?
Won't substr(row.names(data), 1, 1)
get you the vector of first letters you seem to be after?
EDIT: I initially wrongly wrote substr(row.names(data))
, omitting the indices.
For the second part of your question, assuming firstletter
is a vector:
table(firstletter)
gives you the frequency table of the first letters. So a bit of manipulation gets what you want, for example:
names(sort(table(firstletter), decreasing=TRUE)[1:3])
Does this help? Now you may want to do something such as, only keep from the original dataset the rows corresponding to these three most frequent letters. One way to do this would be:
top3letters <- names(sort(table(vec), decreasing=TRUE)[1:3])
data <- subset(data, firstletter %in% top3letters)