Skip to content
This repository was archived by the owner on Oct 29, 2024. It is now read-only.

Dataframe client support for (i) tag columns and (ii) direct conversion to line protocol #364

Merged
merged 8 commits into from
Sep 6, 2016

Conversation

mdbartos
Copy link
Contributor

@mdbartos mdbartos commented Aug 29, 2016

This pull request addresses issues #362 and #363.

  • tag_columns and field_columns can now be specified in the write_points method, allowing some columns to be treated as tags and others to be treated as fields. Global tags can still be specified using the tags keyword argument (meaning that this change shouldn't break any old code).
  • Dataframes are now converted directly to line protocol. This results in a ~5x speed boost compared to the old method.

Additions:

  • _dataframe_client.py
    • new functions:
      • _convert_dataframe_to_lines: Converts dataframe to line protocol.
      • _stringify_dataframe: Helper function for converting dataframe to string type.
    • changed functions:
      • write_points: Added protocol ('line' or 'json') and numeric precision keyword args.
      • _convert_dataframe_to_json: Tag columns can now be specified.
  • client.py
    • changed functions:
      • write: Added protocol ('line' or 'json') keyword arg, and support for direct line protocol.
      • write_points: Same as write.
      • _write_points: Same as write.
      • send_packet: Same as write.
  • dataframe_client_test.py
    • new functions:
      • test_write_points_from_dataframe_with_tag_columns: Self-explanatory.
      • test_write_points_from_dataframe_with_tag_cols_and_global_tags: Self-explanatory.
      • test_write_points_from_dataframe_with_tag_cols_and_defaults: Tests default behavior (i.e. when tag columns are specified, but field columns aren't, etc.)
      • test_write_points_from_dataframe_with_numeric_precision: Tests for correct numeric precision behavior.
    • changed functions:
      • (all tests): In expected_response, order of tags/fields was changed to match the order they appear in the dataframe (for more consistent testing)

Tag/field columns default behavior:

  • If neither tag columns nor field columns are specified, all columns are assumed to be field columns (this is consistent with previous behavior).
  • If tag columns are specified, but no field columns are specified, all column names not included in tag columns are assumed to be field columns.
  • If tag columns are not specified, but field columns are specified, all column names not included in field columns are assumed to be tag columns.
  • If tag columns and field columns are specified, only those columns included in tag columns or field columns are included in the write.
  • See dataframe_client_test/test_write_points_with_tag_columns_and_defaults for examples of expected behavior.

Minor issues:

  • Haven't tested with older versions of pandas.
  • I left _convert_dataframe_to_json in for the time being, but it can probably be removed.
  • To get the Travis build to work, I had to disable cache. Cache should be cleared.

When you get time to review, please let me know if you have any questions or concerns.

Thanks,
MDB

@mdbartos
Copy link
Contributor Author

Python2.7 build is failing because Travis can't install pandas:

error: Error -5 while decompressing data: incomplete or truncated stream
...

ERROR: could not install deps [-r/home/travis/build/influxdata/influxdb-python/requirements.txt, -r/home/travis/build/influxdata/influxdb-python/test-requirements.txt, pandas]; v = InvocationError('/home/travis/build/influxdata/influxdb-python/.tox/py27/bin/pip install -r/home/travis/build/influxdata/influxdb-python/requirements.txt -r/home/travis/build/influxdata/influxdb-python/test-requirements.txt pandas (see /home/travis/build/influxdata/influxdb-python/.tox/py27/log/py27-1.log)', 2)

The cache might need to be cleared:
pypa/pip#3359

I can also try running a build with no cache.

datatype='field'):

# Find int and string columns for field-type data
int_columns = dataframe.select_dtypes(include=['int']).columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to include=['integer'] so other integer subtypes won't be treated as floats?

@mdbartos
Copy link
Contributor Author

@tzonghao : This change has been made, along with some other small improvements to stringify_dataframe. I also added tests for floating point precision in test_write_points_from_dataframe_with_numeric_precision.

Also, for some reason, tox tests on my machine expect a retention policy duration of '0s' instead of '0' in tests.server_tests.client_test_with_server.CommonTests. This means that in order to get the Travis build to pass I have to fail the tox tests on my own machine. Not a big problem, but I'm not sure what's causing it. I'm using influxdb v0.13 on Manjaro Linux, built from AUR [InfluxDB version: InfluxDB v0.13.0 (git: unknown e57fb88a051ee40fd9277094345fbd47bb4783ce)].

@tzonghao
Copy link
Contributor

@mdbartos Awesome pull request, it definitely makes things faster and easier. Thank you.

time_precision,
database,
retention_policy,
protocol='line')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be protocol=protocol

@aviau
Copy link
Collaborator

aviau commented Aug 31, 2016

@tzonghao Thank you for reviewing this!

@mdbartos
Copy link
Contributor Author

mdbartos commented Sep 3, 2016

Thanks again for reviewing @tzonghao. These changes have been made.

@tzonghao
Copy link
Contributor

tzonghao commented Sep 6, 2016

@aviau @mdbartos You're welcome. Unless someone else wants to have another look, we're good to go.

@aviau aviau merged commit 1343ae9 into influxdata:master Sep 6, 2016
@aviau
Copy link
Collaborator

aviau commented Sep 6, 2016

Thank you @mdbartos

@mdbartos
Copy link
Contributor Author

mdbartos commented Sep 6, 2016

One last thing: I edited the travis.yml file to get the build to work properly, but it should probably be changed back to the way it was (also, the travis cache should be cleared to allow pandas to build properly on 2.7).

@aviau
Copy link
Collaborator

aviau commented Sep 6, 2016

Yeah, I had noticed and was planning to fix it.

@mousumipaul
Copy link

Hi I am new to influxdb. When I am trying to insert a dataframe into influxdb i am getting "NameError: name '_convert_dataframe_to_lines' is not defined" error. In my code i have imported "from influxdb import dataframe_client"
my dataframe has index in datetime.
It has one field and two tags.
The code for insertion is:

tags ={'tag1': df[['tag1']], 'tag2': df[['tag2']]}
dp = _convert_dataframe_to_lines(None, dataframe=df, measurement='M1', tag_columns=tags, field_columns=['UI'])
client.write_points(dp)

I have checked that _dataframe_client.py exists. But anyway I can't figure it out how to execute it

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants