as.matrix on a distance object is extremely slow; how to make it faster?
Clash Royale CLAN TAG#URR8PPP
as.matrix on a distance object is extremely slow; how to make it faster?
I found an R package Rlof
which uses multithreading to calculate distance matrices and it does a wonderful job.
Rlof
However, the output of the function distmc
is a vector rather than a matrix. Applying as.matrix
to this "dist" object turns out much more expensive than the multi-threaded calculation of distances.
distmc
as.matrix
Looking at the function help, the options for printing the diagonal and upper triangle are there, but I don't understand where they are supposed to be used.
Is there a way of saving the time of as.matrix
somehow?
as.matrix
Reproducible example:
set.seed(42)
M1 = matrix(rnorm(15000*20), nrow = 15000, ncol =20)
system.time(dA = distmc(M1, method = "euclidean", diag = TRUE,
upper = TRUE, p = 2))
system.time(A = as.matrix(dA))
Is there no way to do this efficiently in R? unfortunately I don't program in these languages
– Omry Atia
Aug 6 at 13:09
What you have there is basically a sparse matrix. Why do you want to coerce that to a dense matrix?
– Roland
Aug 6 at 13:15
So it will fit with the rest of the code I have. The code relies on using a symmetric matrix. I now understand that it might actually be better to revise the code...
– Omry Atia
Aug 6 at 13:17
@李哲源 For the lower triangle:
n <- 0.5 + sqrt(0.25 + 2 * length(x)) #calculate dimension; j <- rep(seq_len(n - 1), rev(seq_len(n - 1))) # col indices; i <- j + sequence(rev(seq_len(n - 1))) #row indices
– Roland
Aug 6 at 13:24
n <- 0.5 + sqrt(0.25 + 2 * length(x)) #calculate dimension; j <- rep(seq_len(n - 1), rev(seq_len(n - 1))) # col indices; i <- j + sequence(rev(seq_len(n - 1))) #row indices
1 Answer
1
dist
This function always returns a vector, holding the lower triangular part (by column) of the full matrix. The diag
or upper
argument only affects printing, i.e., stats:::print.dist
. This function can print this vector as if it were a matrix; but it is really not.
diag
upper
stats:::print.dist
as.matrix
It is hard to efficiently work with triangular matrices or to further make them symmetric in R core. lower.tri
and upper.tri
can be memory consuming if your matrix is large: R: Convert upper triangular part of a matrix to symmetric matrix.
lower.tri
upper.tri
Converting a "dist" object to a matrix is worse. Look at the code of stats:::as.matrix.dist
:
stats:::as.matrix.dist
function (x, ...)
size <- attr(x, "Size")
df <- matrix(0, size, size)
df[row(df) > col(df)] <- x
df <- df + t(df)
labels <- attr(x, "Labels")
dimnames(df) <- if (is.null(labels))
list(seq_len(size), seq_len(size))
else list(labels, labels)
df
The use of row
, col
and t
are a nightmare. And the final "dimnames<-"
generates another big temporary matrix object. When memory becomes a bottleneck, everything is slow.
row
col
t
"dimnames<-"
The awkward thing is that it is easier to work with a full matrix so we want it. Consider this example: R - How to get row & column subscripts of matched elements from a distance matrix. Operations are tricky if we try to avoid forming the complete matrix.
I write an Rcpp function dist2mat
(see "dist2mat.cpp" source file in the end of this answer).
dist2mat
The function has two inputs: a "dist" object x
and a (integer) cache blocking factor bf
. The function works by first creating a matrix and fill in its lower triangular part, then copying the lower triangular part to upper triangular to make it symmetric. The second step is a typical transposition operation and for large matrix cache blocking is worthwhile. Performance should be insensitive to the cache blocking factor as long as it is not too small or too large. 128 or 256 is generally a good choice.
x
bf
This is my first try with Rcpp. I have been a C programmer using R's conventional C interface. But I am on my way to get familiar with Rcpp as well. Given that you don't know how to write compiled code, your probably also don't know how to run Rcpp functions. You need to
Rcpp
getwd()
Now let's start the showcase.
library(Rcpp)
sourceCpp("dist2mat.cpp") ## this takes some time; be patient
## a simple test with `dist2mat`
set.seed(0)
x <- dist(matrix(runif(10), nrow = 5, dimnames = list(letters[1:5], NULL)))
A <- dist2mat(x, 128) ## cache blocking factor = 128
A
# a b c d e
#a 0.0000000 0.9401067 0.9095143 0.5618382 0.4275871
#b 0.9401067 0.0000000 0.1162289 0.3884722 0.6968296
#c 0.9095143 0.1162289 0.0000000 0.3476762 0.6220650
#d 0.5618382 0.3884722 0.3476762 0.0000000 0.3368478
#e 0.4275871 0.6968296 0.6220650 0.3368478 0.0000000
The resulting matrix preserves the row names of your original matrix / data frame passed to dist
.
dist
You can tune the cache blocking factor on your machine. Note that the effects of cache blocking is not evident for small matrices. Here I tried a 10000 x 10000 one.
## mimic a "dist" object without actually calling `dist`
n <- 10000
x <- structure(numeric(n * (n - 1) / 2), class = "dist", Size = n)
system.time(A <- dist2mat(x, 64))
# user system elapsed
# 0.676 0.424 1.113
system.time(A <- dist2mat(x, 128))
# user system elapsed
# 0.532 0.140 0.672
system.time(A <- dist2mat(x, 256))
# user system elapsed
# 0.616 0.140 0.759
We can benchmark dist2mat
with as.matrix
. As as.matrix
is RAM consuming, I use a small example here.
dist2mat
as.matrix
as.matrix
## mimic a "dist" object without actually calling `dist`
n <- 2000
x <- structure(numeric(n * (n - 1) / 2), class = "dist", Size = n)
library(bench)
bench::mark(dist2mat(x, 128), as.matrix(x), check = FALSE)
## A tibble: 2 x 14
# expression min mean median max `itr/sec` mem_alloc n_gc n_itr
# <chr> <bch:tm> <bch:> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int>
#1 dist2mat(x, … 24.6ms 26ms 25.8ms 37.1ms 38.4 30.5MB 0 20
#2 as.matrix(x) 154.5ms 155ms 154.8ms 154.9ms 6.46 160.3MB 0 4
## ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
## time <list>, gc <list>
Note how dist2mat
is faster (see "mean", "median"), and also how less RAM (see "mem_alloc") it needs. I've set check = FALSE
to disable result checking, because dist2mat
returns no "dimnames" attribute (the "dist" object has no such info) but as.matrix
does (it sets 1:2000
as "dimnames"), so they are not exactly equal. But you can verify that they are both correct.
dist2mat
check = FALSE
dist2mat
as.matrix
1:2000
A <- dist2mat(x, 128)
B <- as.matrix(x)
range(A - B)
#[1] 0 0
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix dist2mat(NumericVector& x, int bf)
/* input validation */
if (!x.inherits("dist")) stop("Input must be a 'dist' object");
int n = x.attr("Size");
if (n > 65536) stop("R cannot create a square matrix larger than 65536 x 65536");
/* initialization */
NumericMatrix A(n, n);
/* use pointers */
size_t j, i, jj, ni, nj; double *ptr_x = &x[0];
double *A_jj, *A_ij, *A_ji, *col, *row, *end;
/* fill in lower triangular part */
for (j = 0; j < n; j++)
col = &A(j + 1, j); end = &A(n, j);
while (col < end) *col++ = *ptr_x++;
/* cache blocking factor */
size_t b = (size_t)bf;
/* copy lower triangular to upper triangular; cache blocking applied */
for (j = 0; j < n; j += b)
nj = n - j; if (nj > b) nj = b;
/* diagonal block has size nj x nj */
A_jj = &A(j, j);
for (jj = nj - 1; jj > 0; jj--, A_jj += n + 1)
/* copy a column segment to a row segment */
col = A_jj + 1; row = A_jj + n;
for (end = col + jj; col < end; col++, row += n) *row = *col;
/* off-diagonal blocks */
for (i = j + nj; i < n; i += b)
ni = n - i; if (ni > b) ni = b;
/* off-diagonal block has size ni x nj */
A_ij = &A(i, j); A_ji = &A(j, i);
for (jj = 0; jj < nj; jj++)
/* copy a column segment to a row segment */
col = A_ij + jj * n; row = A_ji + jj;
for (end = col + ni; col < end; col++, row += n) *row = *col;
/* add row names and column names */
A.attr("dimnames") = List::create(x.attr("Labels"), x.attr("Labels"));
return A;
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
An NxN symmetric matrix, such that the diagonal is zero
– Omry Atia
Aug 6 at 13:03