Order dataframe based on inherent values within column

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Order dataframe based on inherent values within column



I would like to capture inherent values of within the dataframe and then order the columns and rows from descending to ascending based on the number of events in each column and row.



Sample Data


#A tibble: 26 x 9
sample_id Gene_A Gene_B Gene_C Gene_D Gene_E Gene_F Gene_G Gene_H
<fct> <int> <int> <int> <int> <int> <int> <int> <int>
1 A -1 0 0 0 -1 0 0 -1
2 B 1 0 -1 1 -1 -1 -1 0
3 C 1 0 -1 0 1 0 0 -1
4 D -1 0 0 -1 1 1 -1 1
5 E 1 1 1 1 -1 1 -1 0
6 F -1 -1 1 1 1 -1 0 0
7 G 0 0 -1 -1 0 -1 0 -1
8 H 1 1 1 0 1 -1 -1 0
9 I 0 -1 -1 -1 0 -1 0 1
10 J -1 0 0 1 -1 -1 0 1
# ... with 16 more rows



The final result I am going for is a table that is ordered on the following hierarchy:



Here is the sample output:


# A tibble: 26 x 9
sample_id Gene_B Gene_G Gene_H Gene_A Gene_C Gene_D Gene_F Gene_E
* <chr> <int> <int> <int> <int> <int> <int> <int> <int>
1 A 0 0 -1 -1 0 0 0 -1
2 U 0 -1 0 0 0 -1 0 1
3 C 0 0 -1 1 -1 0 0 1
4 G 0 0 -1 0 -1 -1 -1 0
5 W 0 -1 1 1 0 1 0 0
6 Y 0 0 1 1 0 1 1 0
7 I -1 0 1 0 -1 -1 -1 0
8 J 0 0 1 -1 0 1 -1 -1
9 O 0 1 0 0 1 -1 1 1
10 P 1 -1 -1 0 -1 0 0 -1
# ... with 16 more rows



My first though was to take absolute sum and add a column with total for each sample, take absolute sum and add a row with total for each column and then use order.



Generating Sample Data


dummy.tb <- tibble (sample_id = (sample (1:30,30)), Gene_A = (sample
(-1:1,30, replace = T)), Gene_B = (sample (-1:1,30, replace = T)))

dummy1.tb <- tibble (Gene_C = (sample (-1:1,30, replace = T)), Gene_D
= (sample (-1:1,30, replace = T)), Gene_E = (sample (-1:1,30, replace = T)))

dummy2.tb <- tibble (Gene_F = (sample (-1:1,30, replace = T)), Gene_G
= (sample (-1:1,30, replace = T)), Gene_H = (sample (-1:1,30, replace = T)))

dummy.tb <- cbind.data.frame(dummy.tb, dummy1.tb, dummy2.tb)

dummy.genes <- c ("Gene_A", "Gene_B", "Gene_C", "Gene_D", "Gene_E",
"Gene_F", "Gene_G", "Gene_H")

dummy.total <- as.data.frame (dummy.total)



Add Col_Total and Column Sums


dummy.total <- dummy.tb %>% bind_rows(summarise_all(., funs(if(is.numeric(.)) sum(abs(.)) else "Col_Total")))

dummy.total <- as.data.frame (dummy.total)



sort for column


dummy.total <- dummy.total [,order(dummy.total[nrow(dummy.total),], decreasing = FALSE)]



delete Col_Total row


dummy.total <- dummy.total %>% filter (!sample_id == "Col_Total")

dummy.total <- as.data.frame (dummy.total)



Add Row Totals and Row Sums


dummy.total <- dummy.total %>% mutate (Row_Total = rowSums (abs((select (., one_of(dummy.genes))))))

dummy.total <- as.data.frame (dummy.total)



sort for row


dummy.total <- dummy.total [order (dummy.total [,ncol(dummy.total)], decreasing = FALSE),]



delete Row_Total column


dummy.total <- dummy.total %>% select (-Row_Total)



Reorder with sample_id


dummy.total <- dummy.total %>% select (sample_id, everything())

dummy.total <- as.tibble(dummy.total)



This code does works and gives the desired output but it seems too verbose with lot of changes in structures along the way. Any suggestions using tidyverse or other methods? I tried using arrange () instead of order for ordering the columns based on values on Col_Total but did not seem to work.


arrange ()


Col_Total



Note: It is per event, not just a straight forward sum so absolute values are included to calculate the number of events.





Please correct the quotes. It is giving errors
– akrun
Aug 9 at 23:38





Changed the quotes for dummy.genes
– KP1
Aug 10 at 14:34




1 Answer
1



For this problem matrix is a better fit as you handle homogeneous (numeric) data. If you assign column names and sample_id to dimnames of matrix you will be able to preserve the column and row identifiers after sorting.


matrix


sample_id


dimnames


matrix



I recommend you to use set.seed so your examples will be reproducible and it will be possible to validate the answer with the desired output.


set.seed



Please see as below:


set.seed(123)
n <- 30
m <- 9
mat <- matrix(
sample(-1:1, n * m, replace = TRUE),
nrow = n,
dimnames = list(1:n, paste("Gene", LETTERS[1:m], sep = "_"))
)
foo <- mat[, order(colSums(abs(mat)))]
bar <- foo[order(rowSums(abs(foo))), ]
head(bar)



Output:


Gene_F Gene_D Gene_I Gene_G Gene_C Gene_A Gene_H Gene_B Gene_E
18 -1 0 0 0 0 -1 0 0 1
15 0 0 0 1 0 -1 -1 -1 0
27 0 0 0 0 1 0 -1 -1 -1
1 1 -1 0 1 0 -1 0 1 0
3 0 0 -1 1 0 0 -1 1 -1
6 0 -1 1 0 0 -1 1 0 1






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard