Sorting multiple columns by first letter and by numbers in R

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Sorting multiple columns by first letter and by numbers in R



I have created a dataframe that looks like the following:


item mean
a_b 5
a_c 2
a_a 4
b_d 7
b_f 3
b_e 1



I would like to sort it so that it is first sorted by whether or not it begins with "a_" or "b_", and then have it sorted by mean. The final dataframe should look like this:


item mean
a_c 2
a_a 4
a_b 5
b_e 1
b_f 3
b_d 7



Note that the item column is not sorted perfectly alphabetically. It is only sorted by the first letter.



I have tried:


arrange(df, item, mean)



The problem with this is that it does not only sort by the "a_" and "b_" categories, but by the entire item name.



I am open to separating the original dataframe into separate dataframes using filter and then sorting the mean within these smaller subsets. I do not need everything to stay in the same dataframe. However, I am unsure how to use filter to only select rows that have items beginning with "a_" or "b_".




3 Answers
3



Another method using dplyr:


dplyr


library(dplyr)
arrange(df, sub('_.+$', '', item), mean)



an alternative would be to use str_extract from stringr to extract only the first letter from item:


str_extract


stringr


item


library(stringr)
arrange(df, str_extract(item, '^._'), mean)



Result:


item mean
1 a_c 2
2 a_a 4
3 a_b 5
4 b_e 1
5 b_f 3
6 b_d 7



Data:


df <- structure(list(item = c("a_b", "a_c", "a_a", "b_d", "b_f", "b_e"
), mean = c(5L, 2L, 4L, 7L, 3L, 1L)), .Names = c("item", "mean"
), class = "data.frame", row.names = c(NA, -6L))



Notes:



sub('_.+$', '', item) creates a temporary variable by removing _ and everything after that from item. _.+$ matches a literal underscore (_) followed by any character one or more times (.+) at the end of the string ($).


sub('_.+$', '', item)


_


item


_.+$


_


.+


$



str_extract(item, '^._') creates a temporary variable by extracting any one character (.) followed by a literal underscore (_) in the beginning of the string (^)


str_extract(item, '^._')


.


_


^



The neat thing about dplyr::arrange is that you can create a temporary sorting variable within the function and not have it included in the output.


dplyr::arrange





This seems to alphabetize the mean column first by looking at the entire item names, not just the first letter. When I run it, a_a, a_b, and a_c are in order, which is not what I want.
– melbez
Aug 7 at 17:38





@melbez with the same sample output you provided? The Result is what I got when I run the code.
– avid_useR
Aug 7 at 17:41





Could you please explain what this part means: sub('_.$', '', item)
– melbez
Aug 7 at 17:43





@melbez It uses a regular expression to create a temporary variable that removes "_" and everything after it in item, which would be the first letter of item. arrange when sorts by that temporary variable and mean.
– avid_useR
Aug 7 at 17:45



item


item


arrange


mean





@melbez no, in this case it wouldn't, because . only matches once. You can easily change it to match an arbitrary number of characters by adding a +. See my edits
– avid_useR
Aug 7 at 17:51


.


+



The philosophy is that if you want to arrange by something (i.e. a substring here) you have to obtain it first:


arrange


df = read.table(text = "
item mean
a_b 5
a_c 2
a_a 4
b_d 7
b_f 3
b_e 1
", header=T, stringsAsFactors=F)

library(tidyverse)

df %>%
separate(item, c("item1","item2"), remove = F) %>% # split items while keeping the original column
arrange(item1, mean) %>% # arrange by what you really want
select(item, mean) # keep only relevant columns

# item mean
# 1 a_c 2
# 2 a_a 4
# 3 a_b 5
# 4 b_e 1
# 5 b_f 3
# 6 b_d 7



Note that there are various ways to pick the 1st letter from a string. I just decided to use separate here.


separate



In case you have many items separated by _ you'll still need to extract the first item, so you can replace the first _ with another delimiter (let's say :) and separate your column on that:


_


_


:


df = read.table(text = "
item mean
a_b_m 5
a_c 2
a_a 4
b_d_x_q 7
b_f 3
b_e 1
", header=T, stringsAsFactors=F)

library(tidyverse)
library(stringr)

df %>%
mutate(item2 = str_replace(item, "_", ":")) %>%
separate(item2, c("item1","item2"), remove = F, sep = ":") %>%
arrange(item1, mean) %>%
select(item, mean)

# item mean
# 1 a_c 2
# 2 a_a 4
# 3 a_b_m 5
# 4 b_e 1
# 5 b_f 3
# 6 b_d_x_q 7





How can I modify this so that it works if I have over 200 items? I do not want to write c("item1", "item2", etc) for over 200 items.
– melbez
Aug 7 at 17:41





Depends. Are they all separated by _? You can separate by the first _ and give the names "item1" and "rest items". Based on your example I couldn't imagine you could have that many items. :) I thought to use separate in case you want to order by the second item (i.e. letter)
– AntoniosK
Aug 7 at 17:52


_


_





Yes, all the items are separated by an underscore.
– melbez
Aug 7 at 17:55





I've updated my answer. Hope it's helpful.
– AntoniosK
Aug 7 at 18:07



A base R solution would be


inx <- order(substr(df$item, 1, 1), df$mean)
newdf <- df[inx, ]

newdf
# item mean
#2 a_c 2
#3 a_a 4
#1 a_b 5
#6 b_e 1
#5 b_f 3
#4 b_d 7






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

make 2 or more post in bootsrap

Store custom data using WC_Cart add_to_cart() method in Woocommerce 3

Firebase Auth - with Email and Password - Check user already registered