Still can’t query big datasets #800

rbdm-qnt · 2020-03-31T12:22:55Z

I’m on InfluxDB version 1.7.9, and PythonClient 5.2.3.

I have a database that weights around 28GB, and I’m trying to query it from a Mac (10.12.6 OS if that’s relevant) with 16GB of Ram, using the Python Client (I use python 3.7).

I’ve been fighting with this issue for a week now, at the beginning I would get this error:
requests.exceptions.ChunkedEncodingError: (‘Connection broken: IncompleteRead(0 bytes read)’, IncompleteRead(0 bytes read))

Then I read issues number #450 #523 #531 #538 and #753, implemented the changes from issue #753, and now when I run the query Python simply gives me:
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

I was hoping that would turn my client.query in a generator that yielded every line one at a time every time it gets one, so I can process it, and then it empties the Ram and queries the next line. Basically streaming. "results" is now a generator, but that does not happen anyway.
This is my Python code:

client = InfluxDBClient(host='127.0.0.1', port='8086', username=‘x’, password=‘x’, database=‘x’)
q = 'SELECT * FROM “x”’
result = client.query(q, chunked=True).get_points()
print(“Query done “ + str(datetime.utcfromtimestamp(time.time())))
here I initialize an empty dataframe
for msg in result:
here i iterate through the results, process them and append them to the empty dataframe, and then I save the dataframe to a csv
the ‘x’ are for censorship

Thanks in advance

russorat · 2020-03-31T20:50:09Z

@rbdm-qnt thanks for opening this. Are you really doing a select * on a 28G database? If so, no amount of changes to the python client will help.

Try adding a time constraint to your query to reduce the data being scanned.

rbdm-qnt · 2020-03-31T21:05:25Z

@rbdm-qnt thanks for opening this. Are you really doing a select * on a 28G database? If so, no amount of changes to the python client will help.

Try adding a time constraint to your query to reduce the data being scanned.

I have no alternative to select *, I'm querying financial data and I need all fields or the data doesn't make sense. This is about 4.5 years of data total, if I query more than 3 months of data I run into this issue. I can't be making 14 separate queries every single time and then patch together the results and reimport them, this is the reason why I have a database and not csv files in the first place.

How can I make influx return single rows to my functions, erase them from the Ram and then get the next row and so on? It must be possible in one way or the other

russorat · 2020-03-31T21:25:00Z

@rbdm-qnt You can loop your code and issue the query multiple times, each with a time > start_time AND time < start_time to reduce the data fetched with each query.

it's possible you might also be able to use https://docs.influxdata.com/influxdb/v1.7/query_language/data_exploration/#the-offset-and-soffset-clauses to accomplish this, but i would use time filters first.

rbdm-qnt · 2020-03-31T21:32:07Z

@rbdm-qnt You can loop your code and issue the query multiple times, each with a time > start_time AND time < start_time to reduce the data fetched with each query.

it's possible you might also be able to use https://docs.influxdata.com/influxdb/v1.7/query_language/data_exploration/#the-offset-and-soffset-clauses to accomplish this, but i would use time filters first.

Sounds like a plan. I'll try this right away and report back in a few hours. Shouldn't the chunk function do exactly that automatically? What about stream?

rbdm-qnt · 2020-04-01T02:36:17Z

So, I've tried looping my query in various ways with no luck:

'SELECT * FROM "x" WHERE time > now() - x days AND time < now() - x days'
changing the number of days every time. This is the only syntax that, technically, works, but the problem is that let's say every query takes 1 hour, if I do from > now() - 10d and < now() -9d I'm gonna get data from for example 4:26am to 4:26 of the next day, but the next query is gonna be at let's say 5am, so I'm gonna have a hole in the results for every loop.
'SELECT * FROM "x" LIMIT 100000 OFFSET ' + str(offset_val)
increasing the offset_val by 100000 at every loop. This should work like a charm in theory, but loops over the same 100000 points over and over in practice, for some reason. If I print offset_val it increases correctly.
'SELECT * FROM "x" WHERE time > "' + str(start_time) + '" AND time < "' + str(finish_time) + '"'
where start_time and finish_time take their value from iterating a list where I saved all the dates month by month, so would be about 50 queries. THIS is the method that I'd like to use as it's the most precise. The problem is it won't recognise the dates no matter how I input them, it just gives me an empty list as a result (or a generator that returns nothing if I use it as a generator). I've tried all possible formats, YYYY-MM-DDThh:mm:ss.nnnnnnnnnZ, without nanoseconds, just the date, without the T and Z, I tried epoch... One thing I saw mentioned is the need for single quotes and not double quotes, but if I use single quotes it gives me an invalid syntax error. If I use single quotes on the dates and double quotes on the query I also get invalid syntax (even if I just use "SELECT * FROM 'x'" which works fine the other way around)

I really really don't know what else to try. I've read every single google page and documentation about this, and went through the entire python client. I'm out of ideas.

rbdm-qnt · 2020-04-01T13:43:12Z

Asked a friend to try a couple queries on his InfluxDB through Grafana:

SELECT * FROM "x" WHERE time >= '2020-03-29 00:00:00' AND time <= '2020-03-29 00:05:00'
this gives him back the whole database, as if Influx doesn't accept the time boundaries
SELECT * FROM "x" WHERE time >= '2020-03-29T00:00:00.000000000ZND time <= '2020-03-29T00:05:00.000000000Z'
this gives back 0 results (even thought the database does have data for those dates)

russorat · 2020-04-01T16:07:59Z

@rbdm-qnt thanks for trying all those different combinations.

Here is an example query that works with influxQL and returns data:
SELECT usage_user FROM "telegraf"."autogen"."cpu" WHERE time > '2020-03-31' AND time < '2020-04-02'

So you can use those date formats. Could you try manually running that query for a day of data that you think is valid and see if anything comes back directly against your database?

Something like this?

http://localhost:8086/query?q=SELECT%20usage_user%20FROM%20%22telegraf%22.%22autogen%22.%22cpu%22%20WHERE%20time%20%3E%20%272020-03-31%27%20AND%20time%20%3C%20%272020-04-02%27

rbdm-qnt · 2020-04-01T17:32:20Z

Thanks for this syntax, so partial good news, this:
SELECT * FROM "x" WHERE time > '2020-03-31' AND time < '2020-04-02'
works correctly on Grafana, which is a huge step forward. The problem is, in python I have to format the query as a string, and I run into the following problems:

if I use double quotes around the whole line, I get invalid syntax because the FROM field requires double quotes already (and I get influxdb.exceptions.InfluxDBClientError: 400 if I try to use single quotes around the FROM field)
if I use single quotes around the whole line, it sees the dates as int and returns invalid syntax
if I don't use quotes at all around the whole line, I get invalid syntax
if I use single double quotes around the dates it returns 0 elements
if I use single quotes and wrap dates with str() I get either 0 elements or syntax error depending if still use quotes inside the str() or not
if I try to wrap the whole line in str() I get syntax error

I tried all of those with both:
result = client.query(q, chunked=True).get_points()
result = list(client.query(q, chunked=True).get_points())

russorat · 2020-04-01T20:52:17Z

@rbdm-qnt you should be able to use \ in front of the " when the query is double quoted. here is an example:

from influxdb import InfluxDBClient

client = InfluxDBClient('localhost', 8086)

result = client.query("SELECT usage_user FROM \"telegraf\".\"autogen\".\"cpu\" WHERE time > '2020-03-31' AND time < '2020-04-02'")

print("Result: {0}".format(result))

rbdm-qnt · 2020-04-01T22:49:05Z

YES! That fixed it. So, to recap for anyone reading in the future:

If you need to make queries bigger than your RAM, put the query in a while loop and use this syntax:
q = "SELECT * FROM "x" WHERE time > '" + str(start_time) + "' AND time < '" + str(finish_time) + "'"
where start time and finish time are 2 variables that iter through 2 lists with all the dates (I'm dividing the dataset by months, so 51 dates for 4 years and 3 months). Save the query in a list, process your data interating that list, append it to a dataframe or whatever, empty the query list, and reloop. When the indexer reaches the lenght of the dates list, brake and save the processed data to a CSV or whatever you want.

rbdm-qnt closed this as completed Apr 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Still can’t query big datasets #800

Still can’t query big datasets #800

rbdm-qnt commented Mar 31, 2020 •

edited

Loading

russorat commented Mar 31, 2020

Uh oh!

rbdm-qnt commented Mar 31, 2020

Uh oh!

russorat commented Mar 31, 2020

Uh oh!

rbdm-qnt commented Mar 31, 2020

Uh oh!

rbdm-qnt commented Apr 1, 2020

Uh oh!

rbdm-qnt commented Apr 1, 2020

Uh oh!

russorat commented Apr 1, 2020

Uh oh!

rbdm-qnt commented Apr 1, 2020

Uh oh!

russorat commented Apr 1, 2020

Uh oh!

rbdm-qnt commented Apr 1, 2020

Uh oh!

Still can’t query big datasets #800

Still can’t query big datasets #800

Comments

rbdm-qnt commented Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

russorat commented Mar 31, 2020

Uh oh!

rbdm-qnt commented Mar 31, 2020

Uh oh!

russorat commented Mar 31, 2020

Uh oh!

rbdm-qnt commented Mar 31, 2020

Uh oh!

rbdm-qnt commented Apr 1, 2020

Uh oh!

rbdm-qnt commented Apr 1, 2020

Uh oh!

russorat commented Apr 1, 2020

Uh oh!

rbdm-qnt commented Apr 1, 2020

Uh oh!

russorat commented Apr 1, 2020

Uh oh!

rbdm-qnt commented Apr 1, 2020

Uh oh!

rbdm-qnt commented Mar 31, 2020 •

edited

Loading