Skip to content
This repository was archived by the owner on Oct 29, 2024. It is now read-only.

Still can’t query big datasets #800

Closed
rbdm-qnt opened this issue Mar 31, 2020 · 10 comments
Closed

Still can’t query big datasets #800

rbdm-qnt opened this issue Mar 31, 2020 · 10 comments

Comments

@rbdm-qnt
Copy link

rbdm-qnt commented Mar 31, 2020

I’m on InfluxDB version 1.7.9, and PythonClient 5.2.3.

I have a database that weights around 28GB, and I’m trying to query it from a Mac (10.12.6 OS if that’s relevant) with 16GB of Ram, using the Python Client (I use python 3.7).

I’ve been fighting with this issue for a week now, at the beginning I would get this error:
requests.exceptions.ChunkedEncodingError: (‘Connection broken: IncompleteRead(0 bytes read)’, IncompleteRead(0 bytes read))

Then I read issues number #450 #523 #531 #538 and #753, implemented the changes from issue #753, and now when I run the query Python simply gives me:
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

I was hoping that would turn my client.query in a generator that yielded every line one at a time every time it gets one, so I can process it, and then it empties the Ram and queries the next line. Basically streaming. "results" is now a generator, but that does not happen anyway.
This is my Python code:

client = InfluxDBClient(host='127.0.0.1', port='8086', username=‘x’, password=‘x’, database=‘x’)
q = 'SELECT * FROM “x”’
result = client.query(q, chunked=True).get_points()
print(“Query done “ + str(datetime.utcfromtimestamp(time.time())))
here I initialize an empty dataframe
for msg in result:
here i iterate through the results, process them and append them to the empty dataframe, and then I save the dataframe to a csv
the ‘x’ are for censorship

Thanks in advance

@russorat
Copy link
Contributor

@rbdm-qnt thanks for opening this. Are you really doing a select * on a 28G database? If so, no amount of changes to the python client will help.

Try adding a time constraint to your query to reduce the data being scanned.

@rbdm-qnt
Copy link
Author

@rbdm-qnt thanks for opening this. Are you really doing a select * on a 28G database? If so, no amount of changes to the python client will help.

Try adding a time constraint to your query to reduce the data being scanned.

I have no alternative to select *, I'm querying financial data and I need all fields or the data doesn't make sense. This is about 4.5 years of data total, if I query more than 3 months of data I run into this issue. I can't be making 14 separate queries every single time and then patch together the results and reimport them, this is the reason why I have a database and not csv files in the first place.

How can I make influx return single rows to my functions, erase them from the Ram and then get the next row and so on? It must be possible in one way or the other

@russorat
Copy link
Contributor

@rbdm-qnt You can loop your code and issue the query multiple times, each with a time > start_time AND time < start_time to reduce the data fetched with each query.

it's possible you might also be able to use https://docs.influxdata.com/influxdb/v1.7/query_language/data_exploration/#the-offset-and-soffset-clauses to accomplish this, but i would use time filters first.

@rbdm-qnt
Copy link
Author

@rbdm-qnt You can loop your code and issue the query multiple times, each with a time > start_time AND time < start_time to reduce the data fetched with each query.

it's possible you might also be able to use https://docs.influxdata.com/influxdb/v1.7/query_language/data_exploration/#the-offset-and-soffset-clauses to accomplish this, but i would use time filters first.

Sounds like a plan. I'll try this right away and report back in a few hours. Shouldn't the chunk function do exactly that automatically? What about stream?

@rbdm-qnt
Copy link
Author

rbdm-qnt commented Apr 1, 2020

So, I've tried looping my query in various ways with no luck:

  1. 'SELECT * FROM "x" WHERE time > now() - x days AND time < now() - x days'
    changing the number of days every time. This is the only syntax that, technically, works, but the problem is that let's say every query takes 1 hour, if I do from > now() - 10d and < now() -9d I'm gonna get data from for example 4:26am to 4:26 of the next day, but the next query is gonna be at let's say 5am, so I'm gonna have a hole in the results for every loop.

  2. 'SELECT * FROM "x" LIMIT 100000 OFFSET ' + str(offset_val)
    increasing the offset_val by 100000 at every loop. This should work like a charm in theory, but loops over the same 100000 points over and over in practice, for some reason. If I print offset_val it increases correctly.

  3. 'SELECT * FROM "x" WHERE time > "' + str(start_time) + '" AND time < "' + str(finish_time) + '"'
    where start_time and finish_time take their value from iterating a list where I saved all the dates month by month, so would be about 50 queries. THIS is the method that I'd like to use as it's the most precise. The problem is it won't recognise the dates no matter how I input them, it just gives me an empty list as a result (or a generator that returns nothing if I use it as a generator). I've tried all possible formats, YYYY-MM-DDThh:mm:ss.nnnnnnnnnZ, without nanoseconds, just the date, without the T and Z, I tried epoch... One thing I saw mentioned is the need for single quotes and not double quotes, but if I use single quotes it gives me an invalid syntax error. If I use single quotes on the dates and double quotes on the query I also get invalid syntax (even if I just use "SELECT * FROM 'x'" which works fine the other way around)

I really really don't know what else to try. I've read every single google page and documentation about this, and went through the entire python client. I'm out of ideas.

@rbdm-qnt
Copy link
Author

rbdm-qnt commented Apr 1, 2020

Asked a friend to try a couple queries on his InfluxDB through Grafana:

  1. SELECT * FROM "x" WHERE time >= '2020-03-29 00:00:00' AND time <= '2020-03-29 00:05:00'
    this gives him back the whole database, as if Influx doesn't accept the time boundaries
  2. SELECT * FROM "x" WHERE time >= '2020-03-29T00:00:00.000000000ZND time <= '2020-03-29T00:05:00.000000000Z'
    this gives back 0 results (even thought the database does have data for those dates)

@russorat
Copy link
Contributor

russorat commented Apr 1, 2020

@rbdm-qnt thanks for trying all those different combinations.

Here is an example query that works with influxQL and returns data:
SELECT usage_user FROM "telegraf"."autogen"."cpu" WHERE time > '2020-03-31' AND time < '2020-04-02'

So you can use those date formats. Could you try manually running that query for a day of data that you think is valid and see if anything comes back directly against your database?

Something like this?

http://localhost:8086/query?q=SELECT%20usage_user%20FROM%20%22telegraf%22.%22autogen%22.%22cpu%22%20WHERE%20time%20%3E%20%272020-03-31%27%20AND%20time%20%3C%20%272020-04-02%27

@rbdm-qnt
Copy link
Author

rbdm-qnt commented Apr 1, 2020

Thanks for this syntax, so partial good news, this:
SELECT * FROM "x" WHERE time > '2020-03-31' AND time < '2020-04-02'
works correctly on Grafana, which is a huge step forward. The problem is, in python I have to format the query as a string, and I run into the following problems:

  1. if I use double quotes around the whole line, I get invalid syntax because the FROM field requires double quotes already (and I get influxdb.exceptions.InfluxDBClientError: 400 if I try to use single quotes around the FROM field)
  2. if I use single quotes around the whole line, it sees the dates as int and returns invalid syntax
  3. if I don't use quotes at all around the whole line, I get invalid syntax
  4. if I use single double quotes around the dates it returns 0 elements
  5. if I use single quotes and wrap dates with str() I get either 0 elements or syntax error depending if still use quotes inside the str() or not
  6. if I try to wrap the whole line in str() I get syntax error

I tried all of those with both:
result = client.query(q, chunked=True).get_points()
result = list(client.query(q, chunked=True).get_points())

@russorat
Copy link
Contributor

russorat commented Apr 1, 2020

@rbdm-qnt you should be able to use \ in front of the " when the query is double quoted. here is an example:

from influxdb import InfluxDBClient

client = InfluxDBClient('localhost', 8086)

result = client.query("SELECT usage_user FROM \"telegraf\".\"autogen\".\"cpu\" WHERE time > '2020-03-31' AND time < '2020-04-02'")

print("Result: {0}".format(result))

@rbdm-qnt
Copy link
Author

rbdm-qnt commented Apr 1, 2020

YES! That fixed it. So, to recap for anyone reading in the future:

If you need to make queries bigger than your RAM, put the query in a while loop and use this syntax:
q = "SELECT * FROM "x" WHERE time > '" + str(start_time) + "' AND time < '" + str(finish_time) + "'"
where start time and finish time are 2 variables that iter through 2 lists with all the dates (I'm dividing the dataset by months, so 51 dates for 4 years and 3 months). Save the query in a list, process your data interating that list, append it to a dataframe or whatever, empty the query list, and reloop. When the indexer reaches the lenght of the dates list, brake and save the processed data to a CSV or whatever you want.

@rbdm-qnt rbdm-qnt closed this as completed Apr 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants