Sorting multiple columns by first letter and by numbers in R

Clash Royale CLAN TAG#URR8PPP
Sorting multiple columns by first letter and by numbers in R
I have created a dataframe that looks like the following:
item mean
a_b 5
a_c 2
a_a 4
b_d 7
b_f 3
b_e 1
I would like to sort it so that it is first sorted by whether or not it begins with "a_" or "b_", and then have it sorted by mean. The final dataframe should look like this:
item mean
a_c 2
a_a 4
a_b 5
b_e 1
b_f 3
b_d 7
Note that the item column is not sorted perfectly alphabetically. It is only sorted by the first letter.
I have tried:
arrange(df, item, mean)
The problem with this is that it does not only sort by the "a_" and "b_" categories, but by the entire item name.
I am open to separating the original dataframe into separate dataframes using filter and then sorting the mean within these smaller subsets. I do not need everything to stay in the same dataframe. However, I am unsure how to use filter to only select rows that have items beginning with "a_" or "b_".
3 Answers
3
Another method using dplyr:
dplyr
library(dplyr)
arrange(df, sub('_.+$', '', item), mean)
an alternative would be to use str_extract from stringr to extract only the first letter from item:
str_extract
stringr
item
library(stringr)
arrange(df, str_extract(item, '^._'), mean)
Result:
item mean
1 a_c 2
2 a_a 4
3 a_b 5
4 b_e 1
5 b_f 3
6 b_d 7
Data:
df <- structure(list(item = c("a_b", "a_c", "a_a", "b_d", "b_f", "b_e"
), mean = c(5L, 2L, 4L, 7L, 3L, 1L)), .Names = c("item", "mean"
), class = "data.frame", row.names = c(NA, -6L))
Notes:
sub('_.+$', '', item) creates a temporary variable by removing _ and everything after that from item. _.+$ matches a literal underscore (_) followed by any character one or more times (.+) at the end of the string ($).
sub('_.+$', '', item)
_
item
_.+$
_
.+
$
str_extract(item, '^._') creates a temporary variable by extracting any one character (.) followed by a literal underscore (_) in the beginning of the string (^)
str_extract(item, '^._')
.
_
^
The neat thing about dplyr::arrange is that you can create a temporary sorting variable within the function and not have it included in the output.
dplyr::arrange
@melbez with the same sample output you provided? The Result is what I got when I run the code.
– avid_useR
Aug 7 at 17:41
Could you please explain what this part means: sub('_.$', '', item)
– melbez
Aug 7 at 17:43
@melbez It uses a regular expression to create a temporary variable that removes "_" and everything after it in
item, which would be the first letter of item. arrange when sorts by that temporary variable and mean.– avid_useR
Aug 7 at 17:45
item
item
arrange
mean
@melbez no, in this case it wouldn't, because
. only matches once. You can easily change it to match an arbitrary number of characters by adding a +. See my edits– avid_useR
Aug 7 at 17:51
.
+
The philosophy is that if you want to arrange by something (i.e. a substring here) you have to obtain it first:
arrange
df = read.table(text = "
item mean
a_b 5
a_c 2
a_a 4
b_d 7
b_f 3
b_e 1
", header=T, stringsAsFactors=F)
library(tidyverse)
df %>%
separate(item, c("item1","item2"), remove = F) %>% # split items while keeping the original column
arrange(item1, mean) %>% # arrange by what you really want
select(item, mean) # keep only relevant columns
# item mean
# 1 a_c 2
# 2 a_a 4
# 3 a_b 5
# 4 b_e 1
# 5 b_f 3
# 6 b_d 7
Note that there are various ways to pick the 1st letter from a string. I just decided to use separate here.
separate
In case you have many items separated by _ you'll still need to extract the first item, so you can replace the first _ with another delimiter (let's say :) and separate your column on that:
_
_
:
df = read.table(text = "
item mean
a_b_m 5
a_c 2
a_a 4
b_d_x_q 7
b_f 3
b_e 1
", header=T, stringsAsFactors=F)
library(tidyverse)
library(stringr)
df %>%
mutate(item2 = str_replace(item, "_", ":")) %>%
separate(item2, c("item1","item2"), remove = F, sep = ":") %>%
arrange(item1, mean) %>%
select(item, mean)
# item mean
# 1 a_c 2
# 2 a_a 4
# 3 a_b_m 5
# 4 b_e 1
# 5 b_f 3
# 6 b_d_x_q 7
How can I modify this so that it works if I have over 200 items? I do not want to write c("item1", "item2", etc) for over 200 items.
– melbez
Aug 7 at 17:41
Depends. Are they all separated by
_? You can separate by the first _ and give the names "item1" and "rest items". Based on your example I couldn't imagine you could have that many items. :) I thought to use separate in case you want to order by the second item (i.e. letter)– AntoniosK
Aug 7 at 17:52
_
_
Yes, all the items are separated by an underscore.
– melbez
Aug 7 at 17:55
I've updated my answer. Hope it's helpful.
– AntoniosK
Aug 7 at 18:07
A base R solution would be
inx <- order(substr(df$item, 1, 1), df$mean)
newdf <- df[inx, ]
newdf
# item mean
#2 a_c 2
#3 a_a 4
#1 a_b 5
#6 b_e 1
#5 b_f 3
#4 b_d 7
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
This seems to alphabetize the mean column first by looking at the entire item names, not just the first letter. When I run it, a_a, a_b, and a_c are in order, which is not what I want.
– melbez
Aug 7 at 17:38