Skip to content
This repository was archived by the owner on Oct 29, 2024. It is now read-only.

Line protocol leading comma #694

Merged
merged 15 commits into from
Jul 11, 2019

Conversation

d3banjan
Copy link
Contributor

@d3banjan d3banjan commented Mar 30, 2019

Environment:

  • pandas==0.24.1
  • -e git+ssh://git@github.com/KuguHome/influxdb-python.git@afcfd25b21523d84a7d1088eff2abb4d08de7647#egg=influxdb which mirrors the current master branch of influxdb-python
  • python=3.6.7

Reproduction of bug:
bug-reproduce.py

import pandas as pd
from influxdb import DataFrameClient

dfi = DataFrameClient(database='testdb', )
dfi.create_database('testdb')

# create a test dataframe that reproduces the bug
buggy_df = pd.DataFrame(
	dict(
		first=[1, None, None, 8, 9],
		second=[2, None, None, None, 10],
                third=[3, 4.1, None, None, 11],
		first_tag=["one", None, None, "eight", None],
		second_tag=["two", None, None, None, None],
                third_tag=["three", "four", None, None, None],
                comment=[
                    "All columns filled",
                    "First two of three empty",
                    "All empty",
                    "Last two of three empty",
                    "Empty tags with values",
                ]
	),
	index=pd.date_range(
		start=pd.to_datetime('2018-01-01'),
		end=pd.to_datetime('2018-10-01'),
		periods=5,
	)
)
print("bug inducing df=\n",buggy_df)
print("\nbuggy line protocol output\n",
      "\n" + "\n".join(
		dfi._convert_dataframe_to_lines(
			buggy_df.iloc[:,:-1],
			'buggy-measurements',
			tag_columns=["first_tag", "second_tag","third_tag"],
		)
	)
)
$ python bug-reproduce.py
bug inducing df=
                      first  second  third first_tag second_tag third_tag                   comment
2018-01-01 00:00:00    1.0     2.0    3.0       one        two     three        All columns filled
2018-03-10 06:00:00    NaN     NaN    4.1      None       None      four  First two of three empty
2018-05-17 12:00:00    NaN     NaN    NaN      None       None      None                 All empty
2018-07-24 18:00:00    8.0     NaN    NaN     eight       None      None   Last two of three empty
2018-10-01 00:00:00    9.0    10.0   11.0      None       None      None    Empty tags with values

buggy line protocol output
 
buggy-measurements,first_tag=one,second_tag=two,third_tag=three first=1.0,second=2.0,third=3.0 1514764800000000000
buggy-measurements,third_tag=four ,third=4.1 1520661600000000000
buggy-measurements,first_tag=eight first=8.0 1532455200000000000
buggy-measurements first=9.0,second=10.0,third=11.0 1538352000000000000

Outlook:
Although tags and fields are handled the same way, the presence of a leading comma works for the tags. The following code is where the values are handled :

# Make an array of formatted field keys and values
field_df = dataframe[field_columns]
# Keep the positions where Null values are found
mask_null = field_df.isnull().values
field_df = self._stringify_dataframe(field_df,
numeric_precision,
datatype='field')
field_df = (field_df.columns.values + '=').tolist() + field_df
field_df[field_df.columns[1:]] = ',' + field_df[
field_df.columns[1:]]
field_df = field_df.where(~mask_null, '') # drop Null entries
fields = field_df.sum(axis=1)

Notice the difference to how tags are handled, in that the None columns are filtered out by value before conversion to line protocol. The values, however, are handled by the column index i.e field_df[field_df.columns[1:]] = ',' + field_df[ field_df.columns[1:]]

# join preprendded tags, leaving None values out
tags = tag_df.apply(
lambda s: [',' + s.name + '=' + v if v else '' for v in s])
tags = tags.sum(axis=1)

Fix and refactor:

  1. [FIX] I would have proposed that the fields are handled exactly the same way as the tags. However, the tag operation uses apply (which is slower), whereas the field operation uses vectorized operation. So the performant fix here would be to strip the leading , from fields where appropriate.
    KuguHome@b4ceab9
    Output:
    python bug-reproduce.py
    bug inducing df=
                          first  second  third first_tag second_tag third_tag                   comment
    2018-01-01 00:00:00    1.0     2.0    3.0       one        two     three        All columns filled
    2018-03-10 06:00:00    NaN     NaN    4.1      None       None      four  First two of three empty
    2018-05-17 12:00:00    NaN     NaN    NaN      None       None      None                 All empty
    2018-07-24 18:00:00    8.0     NaN    NaN     eight       None      None   Last two of three empty
    2018-10-01 00:00:00    9.0    10.0   11.0      None       None      None    Empty tags with values
    
    buggy line protocol output
     
    buggy-measurements,first_tag=one,second_tag=two,third_tag=three first=1.0,second=2.0,third=3.0 1514764800000000000
    buggy-measurements,third_tag=four third=4.1 1520661600000000000
    buggy-measurements,first_tag=eight first=8.0 1532455200000000000
    buggy-measurements first=9.0,second=10.0,third=11.0 1538352000000000000
  2. [REFACTOR] Since the logic for handling tags and values are exactly the same, these two chunks should be refactored into a utility function called __lineify_tag_field_df, because DRY.
    The leading , needs to be taken out from the tags output and needs to be put back when constructing the line again.
    KuguHome@366e771
    Output:
    python bug-reproduce.py           
    bug inducing df=
                          first  second  third first_tag second_tag third_tag                   comment
    2018-01-01 00:00:00    1.0     2.0    3.0       one        two     three        All columns filled
    2018-03-10 06:00:00    NaN     NaN    4.1      None       None      four  First two of three empty
    2018-05-17 12:00:00    NaN     NaN    NaN      None       None      None                 All empty
    2018-07-24 18:00:00    8.0     NaN    NaN     eight       None      None   Last two of three empty
    2018-10-01 00:00:00    9.0    10.0   11.0      None       None      None    Empty tags with values
    
    buggy line protocol output
     
    buggy-measurements,first_tag=one,second_tag=two,third_tag=three first=1.0,second=2.0,third=3.0 1514764800000000000
    buggy-measurements,third_tag=four third=4.1 1520661600000000000
    buggy-measurements,first_tag=eight first=8.0 1532455200000000000
    buggy-measurements first=9.0,second=10.0,third=11.0 1538352000000000000

Pull request at #694

@xginn8
Copy link
Collaborator

xginn8 commented Apr 1, 2019

@d3banjan thank you for contributing! can you fix up the broken tests please?

@d3banjan
Copy link
Contributor Author

d3banjan commented Apr 2, 2019

@d3banjan thank you for contributing! can you fix up the broken tests please?

Yes. I already took a look at the failed tests. The fix b4ceab9 passes the tests and is minimal!

Umm... do you think the refactor commit 366e771 is an acceptable direction to work on? I kept wondering if this is indeed duplicated code or if I was missing something.

@d3banjan
Copy link
Contributor Author

d3banjan commented Apr 6, 2019

I abandoned the idea of the refactor for now.

@xginn8
Copy link
Collaborator

xginn8 commented Apr 7, 2019

@d3banjan would you mind adding a test to ensure that this situation is properly addressed?

@d3banjan
Copy link
Contributor Author

@xginn8 Thanks for your patience, I have added a test with all possible cases considered and tried to make it as reproducible and deterministic as I could using pandas.

@d3banjan
Copy link
Contributor Author

d3banjan commented Jun 7, 2019

@xginn8 @aviau @sebito91 Any reasons why the PR is not merged yet?

@lovasoa
Copy link
Contributor

lovasoa commented Jul 11, 2019

@aviau : Can this be merged ?

@aviau
Copy link
Collaborator

aviau commented Jul 11, 2019

Sure, looks good.

@aviau aviau merged commit 08e0299 into influxdata:master Jul 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants