Memory problems for multiple large arrays
Clash Royale CLAN TAG#URR8PPP
Memory problems for multiple large arrays
I'm trying to do some calculations on over 1000 (100, 100, 1000)
arrays. But as I could imagine, it doesn't take more than about 150-200 arrays before my memory is used up, and it all fails (at least with my current code).
(100, 100, 1000)
This is what I currently have now:
import numpy as np
toxicity_data_path = open("data/toxicity.txt", "r")
toxicity_data = np.array(toxicity_data_path.read().split("n"), dtype=int)
patients = range(1, 1000, 1)
The above is just a list of 1's and 0's (indicating toxicity or not) for each array (in this case one array is data for one patient). So in this case roughly 1000 patients.
I then create two lists from the above code so I have one list with patients having toxicity and one where they have not.
patients_no_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("0")]
patients_with_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("1")]
I then write this function, which takes an already saved-to-disk array ((100, 100, 1000)
) for each patient, and then remove some indexes (which is also loaded from a saved file) on each array that will not work later on, or just needs to be removed. So it is essential to do so. The result is a final list of all patients and their 3D arrays of data. This is where things start to eat memory, when the function is used in the list comprehension.
(100, 100, 1000)
def log_likely_list(patient, remove_index_list):
array_data = np.load("data//array.npy".format(patient)).ravel()
return np.delete(array_data, remove_index_list)
remove_index_list = np.load("data/remove_index_list.npy")
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
Next step is to create two lists that I need for my calculations. I take the final list, with all the patients, and remove either patients that have toxicity or not, respectively.
patients_no_tox_list = np.column_stack(np.delete(final_list, patients_with_tox, 0))
patients_with_tox_list = np.column_stack(np.delete(final_list, patients_no_tox, 0))
The last piece of the puzzle is to use these two lists in the following equation, where I put the non-tox list into the right side of the equation, and with tox on the left side. It then sums up for all 1000 patients for each individual index in the 3D array of all patients, i.e. same index in each 3D array/patient, and then I end up with a large list of values pretty much.
log_likely = np.sum(np.log(patients_with_tox_list), axis=1) +
np.sum(np.log(1 - patients_no_tox_list), axis=1)
My problem, as stated is, that when I get around 150-200 (in the patients
range) my memory is used, and it shuts down.
I have obviously tried to save stuff on the disk to load (that's why I load so many files), but that didn't help me much. I'm thinking maybe I could go one array at a time and into the log_likely
function, but in the end, before summing, I would probably just have just as large an array, plus, the computation might be a lot slower if I can't use the numpy sum feature and such.
patients
log_likely
So is there any way I could optimize/improve on this, or is the only way to but a hell of lot more RAM ?
This would be the first time I've heard of it... :/ How would that work ?
– Denver Dang
Aug 12 at 23:16
How big is
toxicity.txt
?– mooglinux
Aug 12 at 23:29
toxicity.txt
As long as the range. So around a thousand 1 or 0's. So no big deal. Just a plain list.
– Denver Dang
Aug 12 at 23:39
What about the data for each patient? How large are those files?
– mooglinux
Aug 12 at 23:40
1 Answer
1
Each time you use a list comprehension, you create a new copy of the data in memory. So this line:
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
contains the complete data for all 1000 patients!
The better choice is to utilize generator expressions, which process items one at a time. To form a generator, surround your for...in...:
expression with parentheses instead of brackets. This might look something like:
for...in...:
with_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_with_tox)
with_tox_log = (np.log(data, axis=1) for data in with_tox_data)
no_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_no_tox)
no_tox_log = (np.log(1 - data, axis=1) for data in no_tox_data)
final_data = itertools.chain(with_tox_log, no_tox_log)
Note that no computations have actually been performed yet: generators don't do anything until you iterate over them. The fastest way to aggregate all the results in this case is to use reduce
:
reduce
log_likely = functools.reduce(np.add, final_data)
It seems like the memory problem is being taken care of, but your last line (
log_likely
) doesn't work with me. itertools
does not have reduce
, and if I use functools
as I have seen suggested (don't even know if they do the same) np.add
doesn't work as argument.– Denver Dang
Aug 13 at 8:59
log_likely
itertools
reduce
functools
np.add
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Have you considered using a generator instead of loading everything into memory all at once?
– mooglinux
Aug 12 at 23:14