Skip to content
This repository was archived by the owner on Oct 29, 2024. It is now read-only.

Properly use chunk_size param #523

Closed
wants to merge 3 commits into from

Conversation

anthonyserious
Copy link

chunk_size=0 was already a kwarg param for client.query(), but it was not actually passed through to response.iter_lines.

This PR properly passes the parameter down, defaulting to requests.models.ITER_CHUNK_SIZE.

Copy link
Collaborator

@xginn8 xginn8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for contributing. Can you add a test to make sure we don't drop that parameter going forward? Also, looks like the flake8 lint failed (from travis).

@psy0rz
Copy link

psy0rz commented Nov 26, 2017

I'm pretty sure this PR is incorrect: You're confusing requests chunk_size with influxdb chunk_size. They are two separate things.

In influxdb it means the number of POINTS per line in the http response.

In requests it means the number of bytes.

I'm not sure what you're trying to fix?

@anthonyserious
Copy link
Author

@psy0rz the issue here is that the client code here partially implements chunking, i.e. by defining the associated variables but never actually doing anything with them. This PR properly implements those variables and then passes them to requests, so yes we are now passing the param to requests.

Is there a better way to do this?

@anthonyserious
Copy link
Author

anthonyserious commented Nov 27, 2017

@xginn8 apologies for the delay - I used by fork for my migration project and haven't needed to come back to this. The CI tests were failing because of a setuptools break that showed up a month or two ago - I believe getting an updated setuptools will fix the problem. Or do you see something else?

The actual problem I'm trying to fix is a performance problem. For extracting ~70k measurements, the client would peg CPU for 60s, whereas with the specified chunk size it's a couple of seconds. You can refer to the issue at #524.

I can also look to add a test to make sure the param isn't dropped.

@psy0rz
Copy link

psy0rz commented Nov 27, 2017

I dont understand..influxdb is something different than http chunking, right?

@psy0rz
Copy link

psy0rz commented Nov 27, 2017

I fixed your problem with pr #538.

Its called streaming mode.

@anthonyserious
Copy link
Author

Great, then once that's approved I can do a test. If it solves my performance problem then I'll close this.

@psy0rz
Copy link

psy0rz commented Nov 27, 2017

Im traversing 20 million items without very much cpu.

The problem you're having is the same i and anyone else had: the result is combined in memory and then retutned as one big result. So it takes a lot of time and memory.

Perhaps you're willing to try it by cloning my fork from my github page?

Would be great if you could test it:)

@pkittenis
Copy link

pkittenis commented Jan 2, 2018

@psy0rz is correct, this PR is confusing influx's chunk size with HTTP's (I added the support for influx chunk size). The perceived 'better performance' is only because a 512 chunk size in bytes on the HTTP request naturally results in fewer influx data points which take less time to get made into a list.

Returning generator instead of pre-generated list is the right way to go, yes, but breaks compatibility which is why it was not included in the original chunked responses PR.

That said, the existing code leaves a lot of room for better performance even without resorting to generators, which do not help if the client is going to iterate over all items in the generator in any case. If breaking compatibility is feasible, I'd be up for taking a stab at re-writing the bulk of series parsing code to that end.

@psy0rz
Copy link

psy0rz commented Jan 2, 2018

My pr is backwards compatible.

@pkittenis
Copy link

But does not solve the issue of point parsing being bottlenecked on client (python) side parsing.

@psy0rz
Copy link

psy0rz commented Jan 3, 2018

The parse performance you mean? My patches only add a generator mode which will sopve memory issues and supports infinite result sets. Performance will be roughly the same with smaller result sets.

@pkittenis
Copy link

Yes. 'Infinite' result sets is not a thing unless you have infinite amounts of ram :) (influx will truncate)

Like I said, generator does not change performance at all when the user really does want to parse all results in the generator. Performance is in both cases bottlenecked on python client side.

@psy0rz
Copy link

psy0rz commented Jan 3, 2018

I'm using my patch succesfully with resultsets of over 200 million datapoints.

@pkittenis
Copy link

And is doing list(results) where results is a generator with your patch any faster than getting results without generator or patch?

@psy0rz
Copy link

psy0rz commented Jan 3, 2018

No, only use streaming mode with large resultsets you want to iterate over. If you have a result-set that you want to keep in memory anyway, the old method might be faster. (for other people reading this, we're talking about #538)

@pkittenis
Copy link

pkittenis commented Jan 3, 2018

Yes, the point is overhead of result parsing is significant and bottlenecks clients that do want to process all results.

That parsing performance as well as memory footprint (for all results) has a lot of room for improvement for which perhaps completely new result gathering functions and classes would be best, to not impact backwards compatibility (at the expense of multiple ways to do get results).

If enough users feel that parsing performance and memory requirements for all of a result set is an issue then I can take a stab at it - it is impacting my own uses of the library as well.

@sebito91 sebito91 self-assigned this Apr 8, 2020
@sebito91
Copy link
Contributor

Closed by #753 and #538

@sebito91 sebito91 closed this Apr 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants