Order dataframe based on inherent values within column
Clash Royale CLAN TAG#URR8PPP
Order dataframe based on inherent values within column
I would like to capture inherent values of within the dataframe and then order the columns and rows from descending to ascending based on the number of events in each column and row.
Sample Data
#A tibble: 26 x 9
sample_id Gene_A Gene_B Gene_C Gene_D Gene_E Gene_F Gene_G Gene_H
<fct> <int> <int> <int> <int> <int> <int> <int> <int>
1 A -1 0 0 0 -1 0 0 -1
2 B 1 0 -1 1 -1 -1 -1 0
3 C 1 0 -1 0 1 0 0 -1
4 D -1 0 0 -1 1 1 -1 1
5 E 1 1 1 1 -1 1 -1 0
6 F -1 -1 1 1 1 -1 0 0
7 G 0 0 -1 -1 0 -1 0 -1
8 H 1 1 1 0 1 -1 -1 0
9 I 0 -1 -1 -1 0 -1 0 1
10 J -1 0 0 1 -1 -1 0 1
# ... with 16 more rows
The final result I am going for is a table that is ordered on the following hierarchy:
Here is the sample output:
# A tibble: 26 x 9
sample_id Gene_B Gene_G Gene_H Gene_A Gene_C Gene_D Gene_F Gene_E
* <chr> <int> <int> <int> <int> <int> <int> <int> <int>
1 A 0 0 -1 -1 0 0 0 -1
2 U 0 -1 0 0 0 -1 0 1
3 C 0 0 -1 1 -1 0 0 1
4 G 0 0 -1 0 -1 -1 -1 0
5 W 0 -1 1 1 0 1 0 0
6 Y 0 0 1 1 0 1 1 0
7 I -1 0 1 0 -1 -1 -1 0
8 J 0 0 1 -1 0 1 -1 -1
9 O 0 1 0 0 1 -1 1 1
10 P 1 -1 -1 0 -1 0 0 -1
# ... with 16 more rows
My first though was to take absolute sum and add a column with total for each sample, take absolute sum and add a row with total for each column and then use order.
Generating Sample Data
dummy.tb <- tibble (sample_id = (sample (1:30,30)), Gene_A = (sample
(-1:1,30, replace = T)), Gene_B = (sample (-1:1,30, replace = T)))
dummy1.tb <- tibble (Gene_C = (sample (-1:1,30, replace = T)), Gene_D
= (sample (-1:1,30, replace = T)), Gene_E = (sample (-1:1,30, replace = T)))
dummy2.tb <- tibble (Gene_F = (sample (-1:1,30, replace = T)), Gene_G
= (sample (-1:1,30, replace = T)), Gene_H = (sample (-1:1,30, replace = T)))
dummy.tb <- cbind.data.frame(dummy.tb, dummy1.tb, dummy2.tb)
dummy.genes <- c ("Gene_A", "Gene_B", "Gene_C", "Gene_D", "Gene_E",
"Gene_F", "Gene_G", "Gene_H")
dummy.total <- as.data.frame (dummy.total)
Add Col_Total and Column Sums
dummy.total <- dummy.tb %>% bind_rows(summarise_all(., funs(if(is.numeric(.)) sum(abs(.)) else "Col_Total")))
dummy.total <- as.data.frame (dummy.total)
sort for column
dummy.total <- dummy.total [,order(dummy.total[nrow(dummy.total),], decreasing = FALSE)]
delete Col_Total row
dummy.total <- dummy.total %>% filter (!sample_id == "Col_Total")
dummy.total <- as.data.frame (dummy.total)
Add Row Totals and Row Sums
dummy.total <- dummy.total %>% mutate (Row_Total = rowSums (abs((select (., one_of(dummy.genes))))))
dummy.total <- as.data.frame (dummy.total)
sort for row
dummy.total <- dummy.total [order (dummy.total [,ncol(dummy.total)], decreasing = FALSE),]
delete Row_Total column
dummy.total <- dummy.total %>% select (-Row_Total)
Reorder with sample_id
dummy.total <- dummy.total %>% select (sample_id, everything())
dummy.total <- as.tibble(dummy.total)
This code does works and gives the desired output but it seems too verbose with lot of changes in structures along the way. Any suggestions using tidyverse or other methods? I tried using arrange ()
instead of order for ordering the columns based on values on Col_Total
but did not seem to work.
arrange ()
Col_Total
Note: It is per event, not just a straight forward sum so absolute values are included to calculate the number of events.
Changed the quotes for dummy.genes
– KP1
Aug 10 at 14:34
1 Answer
1
For this problem matrix
is a better fit as you handle homogeneous (numeric) data. If you assign column names and sample_id
to dimnames
of matrix
you will be able to preserve the column and row identifiers after sorting.
matrix
sample_id
dimnames
matrix
I recommend you to use set.seed
so your examples will be reproducible and it will be possible to validate the answer with the desired output.
set.seed
Please see as below:
set.seed(123)
n <- 30
m <- 9
mat <- matrix(
sample(-1:1, n * m, replace = TRUE),
nrow = n,
dimnames = list(1:n, paste("Gene", LETTERS[1:m], sep = "_"))
)
foo <- mat[, order(colSums(abs(mat)))]
bar <- foo[order(rowSums(abs(foo))), ]
head(bar)
Output:
Gene_F Gene_D Gene_I Gene_G Gene_C Gene_A Gene_H Gene_B Gene_E
18 -1 0 0 0 0 -1 0 0 1
15 0 0 0 1 0 -1 -1 -1 0
27 0 0 0 0 1 0 -1 -1 -1
1 1 -1 0 1 0 -1 0 1 0
3 0 0 -1 1 0 0 -1 1 -1
6 0 -1 1 0 0 -1 1 0 1
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Please correct the quotes. It is giving errors
– akrun
Aug 9 at 23:38