Import data into MATLAB from SQL database: 2 mil rows takes 30 sec, but 6 mil takes 21 min?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Import data into MATLAB from SQL database: 2 mil rows takes 30 sec, but 6 mil takes 21 min?



I can import 2 million rows using Matlab's DB toolbox in about 36 seconds. How can I import 6 million rows in under twenty minutes?



The final fetch step of the query below takes about 36 seconds.


q = 'select ... from mytable limit 2000000'; %notice 2 mil limit
result = exec(conn, q);
final_result = fetch(result); % Takes about 36 seconds.



My overall table has 6,097,227 rows. But if I do:


q = 'select ... from mytable';
result = exec(conn, q);
final_result = fetch(result);



MATLAB entirely loses it on the final fetch step! CPU usage goes to about 500-600% (i.e. 6/8 cores are being used), and it takes forever. Currently, with it set to fetch in 10k batches, it eventually finished in just over 21 minutes.



Ideas? What to do? I really struggling to see how how this isn't at least roughly linear in the number of rows. Have I crossed some weird limit?



BTW: whole query and import into R takes about 43 seconds in R using PostgreSQL driver etc... and 0 fiddling around. I can import in Stata in a similar time using ODBC.



Note: in the above queries ... are 10 or so numeric variables: some integers, some double precision. None are text.


...





Can you use Node.js to do the import?
– vitaly-t
Jun 13 '16 at 20:58





@vitaly-t I don't really see how Node.js applies here? (I edited the question a bit, I may have been unclear.)
– Matthew Gunn
Jun 13 '16 at 21:02





In that case I can only advise in general, as I did in my answer below.
– vitaly-t
Jun 13 '16 at 21:04





Do you fit in memory with the larger data set? Unreasonable slowing-down is often due to thrashing.
– Andras Deak
Jun 13 '16 at 21:04





@AndrasDeak Final table is .48GB and I have 16GB of memory. Raw table size can't be the issue. But maybe something weird happening with thrashing at the Java heap layer? hmmm... I have the max heap size set for 4GB though. I find it all rather bizarre. Maybe a question for Matlab support or just not using MATLAB for this analysis...
– Matthew Gunn
Jun 13 '16 at 21:09





3 Answers
3



Firstly, I suggest you try to increase Java heap memory size. Secondly, it seems that Matlab Database Toolbox in the case of rather big volumes of data being imported/exported may be not a very efficient connector to PostgreSQL. This is explained by a significant data conversion overhead to/from native Matlab formats. One of the ways to reduce this overhead is to follow the solution proposed in http://undocumentedmatlab.com/blog/speeding-up-matlab-jdbc-sql-queries. But JDBC itself has certain limitations that cannot be worked around. This is well illustrated on the following pictures (the fact that the pictures are for data insertion and not data retrieval doesn't change anything as the overhead is there, no matter which direction you pass the data):



The case of scalar numeric dataThe case of arrays



Here the performance of fastinsert and datainsert is compared with the one of batchParamExec
from PgMex (see https://pgmex.alliedtesting.com/#batchparamexec for details). The first picture is for the case of scalar numeric data, the second picture is for arrays. Endpoint of each graph corresponds
to a certain maximum data volume passed into the database by the corresponding method without any error.
A data volume greater than that maximum (specific for each method) causes “out of Java heap memory” problem
(Java heap size for each experiment is specified at the top of each figure).
For further details of experiments please see the following
"Performance comparison of PostgreSQL connectors in Matlab" article.


fastinsert


datainsert


batchParamExec



The main reason here is that PgMex does not use JDBC at all, but is based on libpq and provides 100% binary data transfer between Matlab and PostgreSQL without any text parsing. At the same time all is done in Matlab-friendly
and native way (in a form of matrices, multi-dimensional arrays, structures and arbitrary other Matlab formats), thus, no conversion of Java objects to Matlab format is performed.



What concerns the case of data retrieval, preliminary experiments show that PgMex is approximately 3.5 faster than Matlab Database Toolbox for the most simple case of scalar numerical data.
The code for this can be rewritten using PgMex as follows (we assume that all the parameters below marked by <> signs are propertly filled, the query q is fixed to be correct and the types in fieldSpecStr correspond to the types of mytable already existing in the corresponding database):


<>


q


fieldSpecStr


mytable


% Create the database connection
dbConn = com.allied.pgmex.pgmexec('connect',[...
'host=<yourhost> dbname=<yourdb> port=<yourport> '...
'user=<your_postgres_username> password=<your_postgres_password>']);

% Execute a query
q = 'select ... from mytable';
pgResult = com.allied.pgmex.pgmexec('exec',dbConn,q);

% Read the results
nFields=com.allied.pgmex.pgmexec('nFields',pgResult);
outCVec=cell(nFields,1);
fieldSpecStr='%<field_type_1> %<field_type_2> ...';
inpCVec=num2cell(0:nFields-1);
[outCVec:]=com.allied.pgmex.pgmexec('getf',pgResult,...
fieldSpecStr,inpCVec:);



Please see also http://pgmex.alliedtesting.com/#getf for details concerning the format of input
and output arguments for the command getf (including fieldSpecStr). To say all short, each element of outCVec
contains a structure having fields valueVec, isNullVec and isValueNullVec. All these fields have the size
along the first dimension equal to the number of tuples retrived, valueVec contains values of the
respective table field, while isNullVec and isValueNullVec are indicators of NULLs.


getf


fieldSpecStr


outCVec


valueVec


isNullVec


isValueNullVec


valueVec


isNullVec


isValueNullVec



EDIT: Academic licenses for PgMex are available free of charge.



This is to advise on the general strategy for large imports like this. And if any of the components that you use fail to follow it, then you will have problems, naturally.



First, import your records in batches of between 1,000 and 10,000 records, depending on the average size of your records.



Second, insert each batch with a single multi-row INSERT:


INSERT


INSERT INTO TABLE(columns...) VALUES (first-insert values), (second-insert values),...



i.e. concatenate your all your records of each batch into a single multi-row insert and execute it that way. It will provide a tremendous saving on the IO.



In case anyone encounters this type of problem in the future, I've found for giant 1GB sized queries, it's faster and more robust to:






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard