Line protocol leading comma #694

d3banjan · 2019-03-30T11:52:13Z

Environment:

pandas==0.24.1
-e git+ssh://git@github.com/KuguHome/influxdb-python.git@afcfd25b21523d84a7d1088eff2abb4d08de7647#egg=influxdb which mirrors the current master branch of influxdb-python
python=3.6.7

Reproduction of bug:
bug-reproduce.py

import pandas as pd
from influxdb import DataFrameClient

dfi = DataFrameClient(database='testdb', )
dfi.create_database('testdb')

# create a test dataframe that reproduces the bug
buggy_df = pd.DataFrame(
	dict(
		first=[1, None, None, 8, 9],
		second=[2, None, None, None, 10],
                third=[3, 4.1, None, None, 11],
		first_tag=["one", None, None, "eight", None],
		second_tag=["two", None, None, None, None],
                third_tag=["three", "four", None, None, None],
                comment=[
                    "All columns filled",
                    "First two of three empty",
                    "All empty",
                    "Last two of three empty",
                    "Empty tags with values",
                ]
	),
	index=pd.date_range(
		start=pd.to_datetime('2018-01-01'),
		end=pd.to_datetime('2018-10-01'),
		periods=5,
	)
)
print("bug inducing df=\n",buggy_df)
print("\nbuggy line protocol output\n",
      "\n" + "\n".join(
		dfi._convert_dataframe_to_lines(
			buggy_df.iloc[:,:-1],
			'buggy-measurements',
			tag_columns=["first_tag", "second_tag","third_tag"],
		)
	)
)

$ python bug-reproduce.py
bug inducing df=
                      first  second  third first_tag second_tag third_tag                   comment
2018-01-01 00:00:00    1.0     2.0    3.0       one        two     three        All columns filled
2018-03-10 06:00:00    NaN     NaN    4.1      None       None      four  First two of three empty
2018-05-17 12:00:00    NaN     NaN    NaN      None       None      None                 All empty
2018-07-24 18:00:00    8.0     NaN    NaN     eight       None      None   Last two of three empty
2018-10-01 00:00:00    9.0    10.0   11.0      None       None      None    Empty tags with values

buggy line protocol output
 
buggy-measurements,first_tag=one,second_tag=two,third_tag=three first=1.0,second=2.0,third=3.0 1514764800000000000
buggy-measurements,third_tag=four ,third=4.1 1520661600000000000
buggy-measurements,first_tag=eight first=8.0 1532455200000000000
buggy-measurements first=9.0,second=10.0,third=11.0 1538352000000000000

Outlook:
Although tags and fields are handled the same way, the presence of a leading comma works for the tags. The following code is where the values are handled :

influxdb-python/influxdb/_dataframe_client.py

Lines 381 to 394 in afcfd25

    
           # Make an array of formatted field keys and values 
        
           field_df = dataframe[field_columns] 
        
           # Keep the positions where Null values are found 
        
           mask_null = field_df.isnull().values 
        
           field_df = self._stringify_dataframe(field_df, 
        
                                                numeric_precision, 
        
                                                datatype='field') 
        
           field_df = (field_df.columns.values + '=').tolist() + field_df 
        
           field_df[field_df.columns[1:]] = ',' + field_df[ 
        
               field_df.columns[1:]] 
        
           field_df = field_df.where(~mask_null, '')  # drop Null entries 
        
           fields = field_df.sum(axis=1)

Notice the difference to how tags are handled, in that the None columns are filtered out by value before conversion to line protocol. The values, however, are handled by the column index i.e field_df[field_df.columns[1:]] = ',' + field_df[ field_df.columns[1:]]

influxdb-python/influxdb/_dataframe_client.py

Lines 366 to 369 in afcfd25

    
           # join preprendded tags, leaving None values out 
        
           tags = tag_df.apply( 
        
               lambda s: [',' + s.name + '=' + v if v else '' for v in s]) 
        
           tags = tags.sum(axis=1)

Fix and refactor:

[FIX] I would have proposed that the fields are handled exactly the same way as the tags. However, the tag operation uses apply (which is slower), whereas the field operation uses vectorized operation. So the performant fix here would be to strip the leading , from fields where appropriate.
KuguHome@b4ceab9
Output:

python bug-reproduce.py
bug inducing df=
                      first  second  third first_tag second_tag third_tag                   comment
2018-01-01 00:00:00    1.0     2.0    3.0       one        two     three        All columns filled
2018-03-10 06:00:00    NaN     NaN    4.1      None       None      four  First two of three empty
2018-05-17 12:00:00    NaN     NaN    NaN      None       None      None                 All empty
2018-07-24 18:00:00    8.0     NaN    NaN     eight       None      None   Last two of three empty
2018-10-01 00:00:00    9.0    10.0   11.0      None       None      None    Empty tags with values

buggy line protocol output
 
buggy-measurements,first_tag=one,second_tag=two,third_tag=three first=1.0,second=2.0,third=3.0 1514764800000000000
buggy-measurements,third_tag=four third=4.1 1520661600000000000
buggy-measurements,first_tag=eight first=8.0 1532455200000000000
buggy-measurements first=9.0,second=10.0,third=11.0 1538352000000000000

[REFACTOR] Since the logic for handling tags and values are exactly the same, these two chunks should be refactored into a utility function called __lineify_tag_field_df, because DRY.
The leading , needs to be taken out from the tags output and needs to be put back when constructing the line again.
KuguHome@366e771
Output:

python bug-reproduce.py           
bug inducing df=
                      first  second  third first_tag second_tag third_tag                   comment
2018-01-01 00:00:00    1.0     2.0    3.0       one        two     three        All columns filled
2018-03-10 06:00:00    NaN     NaN    4.1      None       None      four  First two of three empty
2018-05-17 12:00:00    NaN     NaN    NaN      None       None      None                 All empty
2018-07-24 18:00:00    8.0     NaN    NaN     eight       None      None   Last two of three empty
2018-10-01 00:00:00    9.0    10.0   11.0      None       None      None    Empty tags with values

buggy line protocol output
 
buggy-measurements,first_tag=one,second_tag=two,third_tag=three first=1.0,second=2.0,third=3.0 1514764800000000000
buggy-measurements,third_tag=four third=4.1 1520661600000000000
buggy-measurements,first_tag=eight first=8.0 1532455200000000000
buggy-measurements first=9.0,second=10.0,third=11.0 1538352000000000000

Pull request at #694

… the first value column is Null valued

xginn8 · 2019-04-01T17:40:15Z

@d3banjan thank you for contributing! can you fix up the broken tests please?

d3banjan · 2019-04-02T12:09:44Z

@d3banjan thank you for contributing! can you fix up the broken tests please?

Yes. I already took a look at the failed tests. The fix b4ceab9 passes the tests and is minimal!

Umm... do you think the refactor commit 366e771 is an acceptable direction to work on? I kept wondering if this is indeed duplicated code or if I was missing something.

This reverts commit 49af5ab.

This reverts commit 366e771.

d3banjan · 2019-04-06T16:59:25Z

I abandoned the idea of the refactor for now.

xginn8 · 2019-04-07T14:22:15Z

@d3banjan would you mind adding a test to ensure that this situation is properly addressed?

…invalid line protocol

d3banjan · 2019-05-18T13:02:03Z

@xginn8 Thanks for your patience, I have added a test with all possible cases considered and tried to make it as reproducible and deterministic as I could using pandas.

d3banjan · 2019-06-07T14:48:14Z

@xginn8 @aviau @sebito91 Any reasons why the PR is not merged yet?

lovasoa · 2019-07-11T21:46:12Z

@aviau : Can this be merged ?

aviau · 2019-07-11T22:32:57Z

Sure, looks good.

Debanjan Basu added 2 commits March 30, 2019 11:26

[fix] typo in comment + [fix] handles leading comma for the case that…

b4ceab9

… the first value column is Null valued

[refactor] consolidated similar logic to a new function

366e771

d3banjan requested review from aviau, sebito91 and xginn8 as code owners March 30, 2019 11:52

d3banjan mentioned this pull request Mar 30, 2019

DataFrameClient write with nan in first column error #657

Open

[fix] covering scenario where is a string

49af5ab

xginn8 added the pending contributor label Apr 1, 2019

Debanjan added 2 commits April 6, 2019 18:04

Revert "[fix] covering scenario where is a string"

8b68f4f

This reverts commit 49af5ab.

Revert "[refactor] consolidated similar logic to a new function"

0df495a

This reverts commit 366e771.

Debanjan added 9 commits May 18, 2019 11:00

[tests][feature] added tests to check if first none value results in …

c663234

…invalid line protocol

[fix] deleted debug lines

e5d96f0

[fix] overspecified date_range args

1f3fe4e

[fix] overspecified date_range args

cf95296

reset

3ab37a3

[fix] removed endpoint in date-range

53c3acf

[fix] reordered columns in test target

599c26b

[fix] [test] freeze order of columns

0ad7cfe

[refactor] [test] used loc instead of dict-like invocation of columns

87d48e9

[fix] [test] [lint] cleared up complainsts from flake8 and pep257

097222d

lovasoa mentioned this pull request Jul 11, 2019

InfluxDBClientError while inserting None with DataFrameClient #726

Open

aviau merged commit 08e0299 into influxdata:master Jul 11, 2019

janemrich mentioned this pull request Aug 20, 2019

pip version is not up2date #631

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Line protocol leading comma #694

Line protocol leading comma #694

Uh oh!

d3banjan commented Mar 30, 2019 •

edited

Loading

Uh oh!

xginn8 commented Apr 1, 2019

Uh oh!

d3banjan commented Apr 2, 2019

Uh oh!

d3banjan commented Apr 6, 2019

Uh oh!

xginn8 commented Apr 7, 2019

Uh oh!

d3banjan commented May 18, 2019

Uh oh!

d3banjan commented Jun 7, 2019

Uh oh!

lovasoa commented Jul 11, 2019

Uh oh!

aviau commented Jul 11, 2019

Uh oh!

Uh oh!

	# Make an array of formatted field keys and values
	field_df = dataframe[field_columns]
	# Keep the positions where Null values are found
	mask_null = field_df.isnull().values

	field_df = self._stringify_dataframe(field_df,
	numeric_precision,
	datatype='field')

	field_df = (field_df.columns.values + '=').tolist() + field_df
	field_df[field_df.columns[1:]] = ',' + field_df[
	field_df.columns[1:]]
	field_df = field_df.where(~mask_null, '') # drop Null entries
	fields = field_df.sum(axis=1)

	# join preprendded tags, leaving None values out
	tags = tag_df.apply(
	lambda s: [',' + s.name + '=' + v if v else '' for v in s])
	tags = tags.sum(axis=1)

Line protocol leading comma #694

Line protocol leading comma #694

Uh oh!

Conversation

d3banjan commented Mar 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xginn8 commented Apr 1, 2019

Uh oh!

d3banjan commented Apr 2, 2019

Uh oh!

d3banjan commented Apr 6, 2019

Uh oh!

xginn8 commented Apr 7, 2019

Uh oh!

d3banjan commented May 18, 2019

Uh oh!

d3banjan commented Jun 7, 2019

Uh oh!

lovasoa commented Jul 11, 2019

Uh oh!

aviau commented Jul 11, 2019

Uh oh!

Uh oh!

d3banjan commented Mar 30, 2019 •

edited

Loading