-
Notifications
You must be signed in to change notification settings - Fork 524
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for contributing. Can you add a test to make sure we don't drop that parameter going forward? Also, looks like the flake8 lint failed (from travis).
I'm pretty sure this PR is incorrect: You're confusing requests chunk_size with influxdb chunk_size. They are two separate things. In influxdb it means the number of POINTS per line in the http response. In requests it means the number of bytes. I'm not sure what you're trying to fix? |
@psy0rz the issue here is that the client code here partially implements chunking, i.e. by defining the associated variables but never actually doing anything with them. This PR properly implements those variables and then passes them to Is there a better way to do this? |
@xginn8 apologies for the delay - I used by fork for my migration project and haven't needed to come back to this. The CI tests were failing because of a setuptools break that showed up a month or two ago - I believe getting an updated setuptools will fix the problem. Or do you see something else? The actual problem I'm trying to fix is a performance problem. For extracting ~70k measurements, the client would peg CPU for 60s, whereas with the specified chunk size it's a couple of seconds. You can refer to the issue at #524. I can also look to add a test to make sure the param isn't dropped. |
I dont understand..influxdb is something different than http chunking, right? |
I fixed your problem with pr #538. Its called streaming mode. |
Great, then once that's approved I can do a test. If it solves my performance problem then I'll close this. |
Im traversing 20 million items without very much cpu. The problem you're having is the same i and anyone else had: the result is combined in memory and then retutned as one big result. So it takes a lot of time and memory. Perhaps you're willing to try it by cloning my fork from my github page? Would be great if you could test it:) |
@psy0rz is correct, this PR is confusing influx's chunk size with HTTP's (I added the support for influx chunk size). The perceived 'better performance' is only because a 512 chunk size in bytes on the HTTP request naturally results in fewer influx data points which take less time to get made into a list. Returning generator instead of pre-generated list is the right way to go, yes, but breaks compatibility which is why it was not included in the original chunked responses PR. That said, the existing code leaves a lot of room for better performance even without resorting to generators, which do not help if the client is going to iterate over all items in the generator in any case. If breaking compatibility is feasible, I'd be up for taking a stab at re-writing the bulk of series parsing code to that end. |
My pr is backwards compatible. |
But does not solve the issue of point parsing being bottlenecked on client (python) side parsing. |
The parse performance you mean? My patches only add a generator mode which will sopve memory issues and supports infinite result sets. Performance will be roughly the same with smaller result sets. |
Yes. 'Infinite' result sets is not a thing unless you have infinite amounts of ram :) (influx will truncate) Like I said, generator does not change performance at all when the user really does want to parse all results in the generator. Performance is in both cases bottlenecked on python client side. |
I'm using my patch succesfully with resultsets of over 200 million datapoints. |
And is doing |
No, only use streaming mode with large resultsets you want to iterate over. If you have a result-set that you want to keep in memory anyway, the old method might be faster. (for other people reading this, we're talking about #538) |
Yes, the point is overhead of result parsing is significant and bottlenecks clients that do want to process all results. That parsing performance as well as memory footprint (for all results) has a lot of room for improvement for which perhaps completely new result gathering functions and classes would be best, to not impact backwards compatibility (at the expense of multiple ways to do get results). If enough users feel that parsing performance and memory requirements for all of a result set is an issue then I can take a stab at it - it is impacting my own uses of the library as well. |
chunk_size=0
was already a kwarg param forclient.query()
, but it was not actually passed through toresponse.iter_lines
.This PR properly passes the parameter down, defaulting to
requests.models.ITER_CHUNK_SIZE
.