Description
It seems that couchdb-python or some of the libraries used leaks memory. This can lead to memory exhaustion on big tasks. A simple example using only reads from a server with about 20k documents:
import psutil
import os
import gc
from couchdb import Server
def show_memory():
process = psutil.Process(os.getpid())
meminfo = process.memory_info()
print('Memory usage:')
print("\tResident: %d (kb)" %(meminfo[0]/1024))
print("\tVirtual: %d (kb)" %(meminfo[1]/1024))
server = Server(url)
print("Before")
show_memory()
for x in db:
pass
print("After")
show_memory()
print("After collect")
gc.collect()
show_memory()
print("DB deleted")
del db
show_memory()
print("Server deleted")
del server
show_memory()
Output:
Before
Memory usage:
Resident: 16444 (kb)
Virtual: 98680 (kb)
After
Memory usage:
Resident: 18444 (kb)
Virtual: 102928 (kb)
After collect
Memory usage:
Resident: 17932 (kb)
Virtual: 102416 (kb)
DB deleted
Memory usage:
Resident: 17932 (kb)
Virtual: 102416 (kb)
Server deleted
Memory usage:
Resident: 17932 (kb)
Virtual: 102416 (kb)
I.e. the memory is retained even after all resources are removed. During batch import of large datasets this can lead to memory exhaustion on some systems:
Testcase: Import of 50k randomly generated documents, 50 fields per document, 50bytes per field. Ids provided by using the uuid4()
method in python. Import in batches of 20k Documents using the db.update(docs)
method. Batches are generated one at a time and then deleted.
Output:
Generating batch nr. 0
Memory usage:
Resident: 19096 (kb)
Virtual: 225364 (kb)
Before Upload
Memory usage:
Resident: 221952 (kb)
Virtual: 428276 (kb)
Uploading Batch nr 0
Upload done, docs deleted, gc.collect()
Memory usage:
Resident: 87528 (kb)
Virtual: 294720 (kb)
Generating batch nr. 1
Memory usage:
Resident: 87532 (kb)
Virtual: 294724 (kb)
Before Upload
Memory usage:
Resident: 226804 (kb)
Virtual: 433988 (kb)
Uploading Batch nr 1
Traceback (most recent call last):
File "./benchmark.py", line 198, in <module>
do_benchmark()
File "./benchmark.py", line 153, in do_benchmark
db.update(docs)
File "/usr/local/lib/python2.7/site-packages/couchdb/client.py", line 785, in update
_, _, data = self.resource.post_json('_bulk_docs', body=content)
File "/usr/local/lib/python2.7/site-packages/couchdb/http.py", line 545, in post_json
**params)
File "/usr/local/lib/python2.7/site-packages/couchdb/http.py", line 564, in _request_json
headers=headers, **params)
File "/usr/local/lib/python2.7/site-packages/couchdb/http.py", line 560, in _request
credentials=self.credentials)
File "/usr/local/lib/python2.7/site-packages/couchdb/http.py", line 261, in request
body = json.encode(body).encode('utf-8')
File "/usr/local/lib/python2.7/site-packages/couchdb/json.py", line 69, in encode
return _encode(obj)
File "/usr/local/lib/python2.7/site-packages/couchdb/json.py", line 117, in <lambda>
dumps(obj, allow_nan=False, ensure_ascii=False)
File "/usr/lib64/python2.7/dist-packages/simplejson/__init__.py", line 386, in dumps
**kw).encode(obj)
File "/usr/lib64/python2.7/dist-packages/simplejson/encoder.py", line 275, in encode
return u''.join(chunks)
MemoryError
I.e. first update works, second update fails with memory error. This indicates that the problem is not caused by the batch size alone, as then either both uploads would have to fail or both had to work fine. Instead some resource does not seem to get freed correctly between uploads.