Skip to content

MemoryError #224

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
venkatesh71097 opened this issue May 31, 2020 · 13 comments · Fixed by #226
Closed

MemoryError #224

venkatesh71097 opened this issue May 31, 2020 · 13 comments · Fixed by #226
Assignees
Labels

Comments

@venkatesh71097
Copy link

While I'm trying to read waveform data using rdsamp / rdrecord, I encounter this 'MemoryError' issue many a times. I'm using Python 64 bit and am clueless on how to solve it. Any leads?

@Lucas-Mc
Copy link
Contributor

Lucas-Mc commented Jun 1, 2020

That is really interesting... Do you have a stack trace?

@Lucas-Mc Lucas-Mc self-assigned this Jun 1, 2020
@venkatesh71097
Copy link
Author

Hello, thank you so much for getting back! This is the code. I'm trying to read a waveform file which is very huge. When I tried to set sampto = 50500000, I can read the file. But, I need the entire data file. This is the code:
signals, fields = wfdb.rdsamp('../../physionet.org/files/mimic3wdb-matched/1.0/p00/p002029/p002029-2160-02-27-17-40')

Error trace:


MemoryError Traceback (most recent call last)
in ()
9 import wfdb
10
---> 11 signals, fields = wfdb.rdsamp('../../physionet.org/files/mimic3wdb-matched/1.0/p00/p002029/p002029-2160-02-27-17-40')

~/anaconda3/envs/python3/lib/python3.6/site-packages/wfdb/io/record.py in rdsamp(record_name, sampfrom, sampto, channels, pb_dir, channel_names, warn_empty)
1392 sampto=sampto, channels=channels, physical=True,
1393 pb_dir=pb_dir, m2s=True, channel_names=channel_names,
-> 1394 warn_empty=warn_empty)
1395
1396 signals = record.p_signal

~/anaconda3/envs/python3/lib/python3.6/site-packages/wfdb/io/record.py in rdrecord(record_name, sampfrom, sampto, channels, physical, pb_dir, m2s, smooth_frames, ignore_skew, return_res, force_channels, channel_names, warn_empty)
1301 os.path.join(dir_name, record.seg_name[seg_num]),
1302 sampfrom=seg_ranges[i][0], sampto=seg_ranges[i][1],
-> 1303 channels=seg_channels[i], physical=physical, pb_dir=pb_dir)
1304
1305 # Arrange the fields of the layout specification segment, and

~/anaconda3/envs/python3/lib/python3.6/site-packages/wfdb/io/record.py in rdrecord(record_name, sampfrom, sampto, channels, physical, pb_dir, m2s, smooth_frames, ignore_skew, return_res, force_channels, channel_names, warn_empty)
1239 if physical:
1240 # Perform inplace dac to get physical signal
-> 1241 record.dac(expanded=False, return_res=return_res, inplace=True)
1242
1243 # Return each sample of the signals with multiple samples per frame

~/anaconda3/envs/python3/lib/python3.6/site-packages/wfdb/io/_signal.py in dac(self, expanded, return_res, inplace)
485 # Do float conversion immediately to avoid potential under/overflow
486 # of efficient int dtype
--> 487 self.d_signal = self.d_signal.astype(floatdtype, copy=False)
488 np.subtract(self.d_signal, self.baseline, self.d_signal)
489 np.divide(self.d_signal, self.adc_gain, self.d_signal)

MemoryError:

@Lucas-Mc
Copy link
Contributor

Lucas-Mc commented Jun 1, 2020

Hey @venkatesh71097, I ran this locally and it crashed like yours did on the first try then I closed out a lot of programs and got it to run successfully though it was a wild ride!

I first ran this command:
signals, fields = wfdb.rdsamp('p002029-2160-02-27-17-40', pb_dir='mimic3wdb/matched/p00/p002029')
Look at the memory usage (quickly going up to 15 GB)!
Screen Shot 2020-06-01 at 8 38 28 AM
And the size!
Screen Shot 2020-06-01 at 8 38 45 AM
I thought it was a numpy error at first with large arrays and I think it still is.. I think I need to work on optimizing the computations and doing it in sections if memory issues are expected. If you still can't get it to work, maybe try reading each individual header file for now until the memory usage is improved? You could also edit the main file and have it read half of it at a time. Hope this helps!

@Lucas-Mc Lucas-Mc added the bug label Jun 1, 2020
@venkatesh71097
Copy link
Author

venkatesh71097 commented Jun 1, 2020

Hey lucas!

Thank you so much! I am running on AWS, and I still get this error even after shutting down every other running thing. As you said, I can remove a few headers from the files and run them, but this error pops up in many other files as well. I'm running my code to read signals of one particular channel name on the MIMIC matched subset data (appx. 10,300 files). Can there be some way out to solve this memory error, like say, an option for precision (int/float/ as in MATLAB's WFDB), an option for downsampling the signal etc.? I am wondering if there's a way to bypass the files which entangle into MemoryError in the worst case (which I actually don't want to, as it might deprive me of important files), or stop recording samples before it encounters the memory error.

To give you a basic background of what I'm doing, I've classified the patients based on one particular disease using the icd9_code in metadata, and am running the entire dataset over this. I couldn't map the seq_num (in diagnoses_icd) to the segment number in the matched subset data as they don't match. If required, I can give more details via personal conversation. Thanks in advance!

@Lucas-Mc
Copy link
Contributor

Lucas-Mc commented Jun 1, 2020

Hey @venkatesh71097, good idea! I was thinking this too since I think it's running at float64 right now which is wayyyy excessive especially for a large file like you are trying to read. And the good news is that it would be a "lossless" compression even to go from float64 to float16 with a 4x memory reduction! I'll open an issue that address this issue that allows the user to input the numpy datatype that they desire. I think downsampling is a good one to work on later but hopefully the datatype conversion will be a simple fix first. Thanks for posting!

@venkatesh71097
Copy link
Author

Hey lucas! Thank you so much. It'll be a big breakthrough for my research if this is sorted as soon as possible. I had been struggling with this issue over the weekend, that some files get stuck in rdrecord, and some get stuck in rdsamp. Thanks again! :)

@Lucas-Mc
Copy link
Contributor

Lucas-Mc commented Jun 1, 2020

Some preliminary results.. hopefully this compression helps out! I'll work on this after finishing up a commit I'm working on for EDF2MIT! 👍
Screen Shot 2020-06-01 at 10 50 11 AM

@venkatesh71097
Copy link
Author

Wow, awesome! Thank you so much lucas! :)

@venkatesh71097
Copy link
Author

venkatesh71097 commented Jun 1, 2020

Hey Lucas! There's one more follow-up issue for the same. For one more file, p000109, I tried reading and plotting the data from one channel, 'PLETH'. I was able to read this file once I made it read only the channel_name 'PLETH' (For many other files, as we discussed before, I'm not able to read even if I try to read only for 'PLETH' channel). But, I'm not able to plot as I'm again running into memory error. PFA the code and trace. But anyways, plotting is not very much required for my usecase, but it helps in visualizing the data. Just thought of reporting this to you as it might help others as well.

Code: signal, fields = wfdb.rdsamp('../../physionet.org/files/mimic3wdb-matched/1.0/p00/p000109/p000109-2141-10-21-02-00', channel_names = ['PLETH'])
q = signal.astype('float16')
plt.plot(q)

Trace:

MemoryError Traceback (most recent call last)
in ()
11 signal, fields = wfdb.rdsamp('../../physionet.org/files/mimic3wdb-matched/1.0/p00/p000109/p000109-2141-10-21-02-00', channel_names = ['PLETH'])
12 q = signal.astype('float16')
---> 13 plt.plot(q)

~/anaconda3/envs/python3/lib/python3.6/site-packages/matplotlib/pyplot.py in plot(scalex, scaley, data, *args, **kwargs)
2809 return gca().plot(
2810 *args, scalex=scalex, scaley=scaley, **({"data": data} if data
-> 2811 is not None else {}), **kwargs)
2812
2813

~/anaconda3/envs/python3/lib/python3.6/site-packages/matplotlib/init.py in inner(ax, data, *args, **kwargs)
1808 "the Matplotlib list!)" % (label_namer, func.name),
1809 RuntimeWarning, stacklevel=2)
-> 1810 return func(ax, *args, **kwargs)
1811
1812 inner.doc = _add_data_doc(inner.doc,

~/anaconda3/envs/python3/lib/python3.6/site-packages/matplotlib/axes/_axes.py in plot(self, scalex, scaley, *args, **kwargs)
1610
1611 for line in self._get_lines(*args, **kwargs):
-> 1612 self.add_line(line)
1613 lines.append(line)
1614

~/anaconda3/envs/python3/lib/python3.6/site-packages/matplotlib/axes/_base.py in add_line(self, line)
1893 line.set_clip_path(self.patch)
1894
-> 1895 self._update_line_limits(line)
1896 if not line.get_label():
1897 line.set_label('_line%d' % len(self.lines))

~/anaconda3/envs/python3/lib/python3.6/site-packages/matplotlib/axes/_base.py in _update_line_limits(self, line)
1915 Figures out the data limit of the given line, updating self.dataLim.
1916 """
-> 1917 path = line.get_path()
1918 if path.vertices.size == 0:
1919 return

~/anaconda3/envs/python3/lib/python3.6/site-packages/matplotlib/lines.py in get_path(self)
943 """
944 if self._invalidy or self._invalidx:
--> 945 self.recache()
946 return self._path
947

~/anaconda3/envs/python3/lib/python3.6/site-packages/matplotlib/lines.py in recache(self, always)
647 y = self._y
648
--> 649 self._xy = np.column_stack(np.broadcast_arrays(x, y)).astype(float)
650 self._x, self._y = self._xy.T # views
651

~/anaconda3/envs/python3/lib/python3.6/site-packages/numpy/lib/shape_base.py in column_stack(tup)
367 arr = array(arr, copy=False, subok=True, ndmin=2).T
368 arrays.append(arr)
--> 369 return _nx.concatenate(arrays, 1)
370
371 def dstack(tup):

MemoryError:

@Lucas-Mc
Copy link
Contributor

Lucas-Mc commented Jun 1, 2020

Hey @venkatesh71097! That seems like an issue with matplotlib.pyplot since the array you are trying to plot is so large.. maybe you can try to reduce its size using: plt.plot(q[::100]) or whatever resample you want? You may have to play around with it and 100 may not be large enough but it should help you out!

@venkatesh71097
Copy link
Author

Hey lucas, sure! I'll plot using smaller distributions. That must be okay. Do let me know once the datatype option has been included in the repo! Thanks in advance! :)

@Lucas-Mc
Copy link
Contributor

Lucas-Mc commented Jun 2, 2020

Hey @venkatesh71097! Some preliminary results here (and I think you'll like them)!

Screen Shot 2020-06-02 at 10 28 47 AM

Over half the time!! Pull request coming soon!

@venkatesh71097
Copy link
Author

Hey! Thank you so much. Great that you're taking computation time into consideration as it's currently taking a lot of time to process the data. Like, it took a few hours to read through certain files present in my local drive. If you could reduce the time consumption even more, that would be something really really great!

Lucas-Mc added a commit that referenced this issue Jun 2, 2020
Adds a datatype parameter in the rdrecord and rdsamp functions to allow the user to increase computation speed at the expense of accuracy and significant figures of signal data. Fixes #224 and #225.
Lucas-Mc added a commit that referenced this issue Jun 2, 2020
Adds datatype parameter in rdrecord/rdsamp #224 #225
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants