as.matrix on a distance object is extremely slow; how to make it faster?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



as.matrix on a distance object is extremely slow; how to make it faster?



I found an R package Rlof which uses multithreading to calculate distance matrices and it does a wonderful job.


Rlof



However, the output of the function distmc is a vector rather than a matrix. Applying as.matrix to this "dist" object turns out much more expensive than the multi-threaded calculation of distances.


distmc


as.matrix



Looking at the function help, the options for printing the diagonal and upper triangle are there, but I don't understand where they are supposed to be used.



Is there a way of saving the time of as.matrix somehow?


as.matrix



Reproducible example:


set.seed(42)
M1 = matrix(rnorm(15000*20), nrow = 15000, ncol =20)
system.time(dA = distmc(M1, method = "euclidean", diag = TRUE,
upper = TRUE, p = 2))
system.time(A = as.matrix(dA))





An NxN symmetric matrix, such that the diagonal is zero
– Omry Atia
Aug 6 at 13:03





Is there no way to do this efficiently in R? unfortunately I don't program in these languages
– Omry Atia
Aug 6 at 13:09





What you have there is basically a sparse matrix. Why do you want to coerce that to a dense matrix?
– Roland
Aug 6 at 13:15






So it will fit with the rest of the code I have. The code relies on using a symmetric matrix. I now understand that it might actually be better to revise the code...
– Omry Atia
Aug 6 at 13:17





@李哲源 For the lower triangle: n <- 0.5 + sqrt(0.25 + 2 * length(x)) #calculate dimension; j <- rep(seq_len(n - 1), rev(seq_len(n - 1))) # col indices; i <- j + sequence(rev(seq_len(n - 1))) #row indices
– Roland
Aug 6 at 13:24



n <- 0.5 + sqrt(0.25 + 2 * length(x)) #calculate dimension; j <- rep(seq_len(n - 1), rev(seq_len(n - 1))) # col indices; i <- j + sequence(rev(seq_len(n - 1))) #row indices




1 Answer
1


dist



This function always returns a vector, holding the lower triangular part (by column) of the full matrix. The diag or upper argument only affects printing, i.e., stats:::print.dist. This function can print this vector as if it were a matrix; but it is really not.


diag


upper


stats:::print.dist


as.matrix



It is hard to efficiently work with triangular matrices or to further make them symmetric in R core. lower.tri and upper.tri can be memory consuming if your matrix is large: R: Convert upper triangular part of a matrix to symmetric matrix.


lower.tri


upper.tri



Converting a "dist" object to a matrix is worse. Look at the code of stats:::as.matrix.dist:


stats:::as.matrix.dist


function (x, ...)

size <- attr(x, "Size")
df <- matrix(0, size, size)
df[row(df) > col(df)] <- x
df <- df + t(df)
labels <- attr(x, "Labels")
dimnames(df) <- if (is.null(labels))
list(seq_len(size), seq_len(size))
else list(labels, labels)
df



The use of row, col and t are a nightmare. And the final "dimnames<-" generates another big temporary matrix object. When memory becomes a bottleneck, everything is slow.


row


col


t


"dimnames<-"



The awkward thing is that it is easier to work with a full matrix so we want it. Consider this example: R - How to get row & column subscripts of matched elements from a distance matrix. Operations are tricky if we try to avoid forming the complete matrix.



I write an Rcpp function dist2mat (see "dist2mat.cpp" source file in the end of this answer).


dist2mat



The function has two inputs: a "dist" object x and a (integer) cache blocking factor bf. The function works by first creating a matrix and fill in its lower triangular part, then copying the lower triangular part to upper triangular to make it symmetric. The second step is a typical transposition operation and for large matrix cache blocking is worthwhile. Performance should be insensitive to the cache blocking factor as long as it is not too small or too large. 128 or 256 is generally a good choice.


x


bf



This is my first try with Rcpp. I have been a C programmer using R's conventional C interface. But I am on my way to get familiar with Rcpp as well. Given that you don't know how to write compiled code, your probably also don't know how to run Rcpp functions. You need to


Rcpp


getwd()



Now let's start the showcase.


library(Rcpp)
sourceCpp("dist2mat.cpp") ## this takes some time; be patient

## a simple test with `dist2mat`
set.seed(0)
x <- dist(matrix(runif(10), nrow = 5, dimnames = list(letters[1:5], NULL)))
A <- dist2mat(x, 128) ## cache blocking factor = 128
A
# a b c d e
#a 0.0000000 0.9401067 0.9095143 0.5618382 0.4275871
#b 0.9401067 0.0000000 0.1162289 0.3884722 0.6968296
#c 0.9095143 0.1162289 0.0000000 0.3476762 0.6220650
#d 0.5618382 0.3884722 0.3476762 0.0000000 0.3368478
#e 0.4275871 0.6968296 0.6220650 0.3368478 0.0000000



The resulting matrix preserves the row names of your original matrix / data frame passed to dist.


dist



You can tune the cache blocking factor on your machine. Note that the effects of cache blocking is not evident for small matrices. Here I tried a 10000 x 10000 one.


## mimic a "dist" object without actually calling `dist`
n <- 10000
x <- structure(numeric(n * (n - 1) / 2), class = "dist", Size = n)

system.time(A <- dist2mat(x, 64))
# user system elapsed
# 0.676 0.424 1.113

system.time(A <- dist2mat(x, 128))
# user system elapsed
# 0.532 0.140 0.672

system.time(A <- dist2mat(x, 256))
# user system elapsed
# 0.616 0.140 0.759



We can benchmark dist2mat with as.matrix. As as.matrix is RAM consuming, I use a small example here.


dist2mat


as.matrix


as.matrix


## mimic a "dist" object without actually calling `dist`
n <- 2000
x <- structure(numeric(n * (n - 1) / 2), class = "dist", Size = n)

library(bench)
bench::mark(dist2mat(x, 128), as.matrix(x), check = FALSE)
## A tibble: 2 x 14
# expression min mean median max `itr/sec` mem_alloc n_gc n_itr
# <chr> <bch:tm> <bch:> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int>
#1 dist2mat(x, … 24.6ms 26ms 25.8ms 37.1ms 38.4 30.5MB 0 20
#2 as.matrix(x) 154.5ms 155ms 154.8ms 154.9ms 6.46 160.3MB 0 4
## ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
## time <list>, gc <list>



Note how dist2mat is faster (see "mean", "median"), and also how less RAM (see "mem_alloc") it needs. I've set check = FALSE to disable result checking, because dist2mat returns no "dimnames" attribute (the "dist" object has no such info) but as.matrix does (it sets 1:2000 as "dimnames"), so they are not exactly equal. But you can verify that they are both correct.


dist2mat


check = FALSE


dist2mat


as.matrix


1:2000


A <- dist2mat(x, 128)
B <- as.matrix(x)
range(A - B)
#[1] 0 0


#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix dist2mat(NumericVector& x, int bf)

/* input validation */
if (!x.inherits("dist")) stop("Input must be a 'dist' object");
int n = x.attr("Size");
if (n > 65536) stop("R cannot create a square matrix larger than 65536 x 65536");

/* initialization */
NumericMatrix A(n, n);

/* use pointers */
size_t j, i, jj, ni, nj; double *ptr_x = &x[0];
double *A_jj, *A_ij, *A_ji, *col, *row, *end;

/* fill in lower triangular part */
for (j = 0; j < n; j++)
col = &A(j + 1, j); end = &A(n, j);
while (col < end) *col++ = *ptr_x++;


/* cache blocking factor */
size_t b = (size_t)bf;

/* copy lower triangular to upper triangular; cache blocking applied */
for (j = 0; j < n; j += b)
nj = n - j; if (nj > b) nj = b;
/* diagonal block has size nj x nj */
A_jj = &A(j, j);
for (jj = nj - 1; jj > 0; jj--, A_jj += n + 1)
/* copy a column segment to a row segment */
col = A_jj + 1; row = A_jj + n;
for (end = col + jj; col < end; col++, row += n) *row = *col;

/* off-diagonal blocks */
for (i = j + nj; i < n; i += b)
ni = n - i; if (ni > b) ni = b;
/* off-diagonal block has size ni x nj */
A_ij = &A(i, j); A_ji = &A(j, i);
for (jj = 0; jj < nj; jj++)
/* copy a column segment to a row segment */
col = A_ij + jj * n; row = A_ji + jj;
for (end = col + ni; col < end; col++, row += n) *row = *col;




/* add row names and column names */
A.attr("dimnames") = List::create(x.attr("Labels"), x.attr("Labels"));

return A;






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard