remove hardcoded encoding #921

merla18 · 2017-04-29T00:44:21Z

since the client lib (not like the API) requires the document encoding, we should not hardcoded UTF8 by default. Added code used in other sample applications. Tested

theacodes · 2017-05-01T16:19:22Z

@gguuss can you review?

theacodes · 2017-05-01T16:21:32Z

My first concern is that this doesn't seem entirely correct. On Python 2.7, the strings we get from the command line are bytes and we manually decode those to utf-8. In the Python 2 case, we're passing the wrong encoding to the API.

monattar · 2017-05-02T21:56:12Z

Note that what is sent to the API is not the encoding of the input text, but the encoding used to calculate offsets in the response.

theacodes · 2017-05-02T21:57:46Z

@monattar I think my concern still holds true. In Python 2.7, since we're decoding bytes from the input into utf-8, we should specify utf-8 as the encoding to the API.

gguuss · 2017-05-02T22:35:11Z

+1 to @jonparrott's comments, additionally, I'm concerned of the following:

This code adds complexity without adding any actual value for most developers
This code introduces a subroutine that will not appear in the how-to documentation
Please add tests for the new code you're introducing highlighting the issue you're fixing

gguuss · 2017-05-02T22:39:11Z

language/cloud-client/v1beta2/snippets.py

@@ -153,7 +162,7 @@ def entity_sentiment_text(text):
    document.type = enums.Document.Type.PLAIN_TEXT

    result = language_client.analyze_entity_sentiment(
-        document, enums.EncodingType.UTF8)
+        document, get_native_encoding_type())


For better snippets, please update this to be an input to the function as opposed to something that will not appear in devsite.

@monattar Furthermore, what is broken and how does this fix it?

gguuss · 2017-05-02T22:39:34Z

language/cloud-client/v1beta2/snippets.py

+def get_native_encoding_type():
+    """Returns the encoding type that matches Python's native strings."""
+    if sys.maxunicode == 65535:
+        return 'UTF16'


Why aren't you using the enumeration type?

monattar · 2017-05-02T22:52:14Z

I think we need a test to clarify. Could we please add a test for this content "foo→bar" and check the offset of "bar"?

gguuss · 2017-05-02T22:55:01Z

language/cloud-client/v1beta2/snippets.py

+    if sys.maxunicode == 65535:
+        return 'UTF16'
+    else:
+        return 'UTF32'



FYI missing newline

merla18 · 2017-05-02T23:07:06Z

here is the results of the test:

if I use the original " enums.EncodingType.UTF8" in the code and print the char at the offset of the original text

$ python snippets.py sentiment-entities-text "foo→bar"
Mentions:
Name: "bar"
Begin Offset : 6
Content : bar
original text char at offset : r

I use get_native_encoding_type() and print the char at the offset of the original text:
$ python snippets.py sentiment-entities-text "foo→bar"
Mentions:
Name: "bar"
Begin Offset : 4
Content : bar
original text char at offset : b

The change get_native_encoding_type() seems to return the correct answer.

gguuss · 2017-05-03T00:15:26Z

@merla18 Thanks for the usage example, helps to clarify what this fixes.

So that other contributors don't reintroduce the bug, you should add a unit test to language/cloud-client/v1beta2/snippets_test.py.

monattar · 2017-05-03T00:21:35Z

Also wanted to mention that encoding is one of the trickiest things to get right, and we're trying to push that into the client library. So, hopefully specifying the EncodingType wouldn't be necessary at all in the future.

gguuss · 2017-05-03T01:55:14Z

language/cloud-client/v1beta2/snippets.py

@@ -30,6 +31,14 @@
 import six


+def get_native_encoding_type():
+    """Returns the encoding type that matches Python's native strings."""


Any reason UTF8 isn't here? Below, the input is explicitly encoded as UTF8.

This is identified what Python natively does, regardless of any explicit encoding/decoding.

theacodes · 2017-05-03T17:18:40Z

@merla18 @monattar there's some weirdness going on here.

See my annotated code here:

    # Regardless of if the input is bytes or unicode, this will ensure that
    # it is unicode (utf-8 in Python 2, 16 or 32 in Python 3).
    if isinstance(text, six.binary_type):
        text = text.decode('utf-8')

    # This ensures that the unicode string is encoded as utf-8 regardless
    # of original encoding. encode returns
    # bytes, but protobuf will automatically re-encode to utf-8 for it's
    # wire format.
    document.content = text.encode('utf-8')
    document.type = enums.Document.Type.PLAIN_TEXT

    # At this point, document is (asciipb):
    #   type: PLAIN_TEXT
    #   content: "foo\342\206\222bar"
    #
    # Or in wire format:
    #   b'\x08\x01\x12\tfoo\xe2\x86\x92bar'
    # Which is *clearly* utf-8.

So we're always sending utf-8 on the wire. Protobuf demands that we do so, we can't set a utf-16 bytestring to the message:

>>> document.content = u"foo→bar".encode('utf-16')
*** ValueError: b'\xff\xfef\x00o\x00o\x00\x92!b\x00a\x00r\x00' has type bytes, but isn't valid UTF-8 encoding. Non-UTF-8 strings must be converted to unicode objects before being added.

So things get weird now. If we make the request and specify UTF8 as the encoding (which is correct) we get the wrong offset:

>>> language_client.analyze_entity_sentiment(document, enums.EncodingType.UTF8)
entities {
  name: "bar"
  type: LOCATION
  salience: 1.0
  mentions {
    text {
      content: "bar"
      begin_offset: 6
    }
    type: COMMON
    sentiment {
    }
  }
  sentiment {
  }
}
language: "en"

The offset is 6 when clearly the offset is 4 in the utf-8 string.

What's even weirder is that if we pass in UTF16 as the encoding:

>>> language_client.analyze_entity_sentiment(document, enums.EncodingType.UTF16)
entities {
  name: "bar"
  type: LOCATION
  salience: 1.0
  mentions {
    text {
      content: "bar"
      begin_offset: 4
    }
    type: COMMON
    sentiment {
    }
  }
  sentiment {
  }
}
language: "en"

We get the correct offset for utf-8 but the wrong offset for utf-16. In fact, the offsets seem to be switched!

Any ideas?

gguuss · 2017-05-03T22:18:44Z

@monattar @merla18 I'm seeing the same behavior in Java:

  public List<Entity> entitySentimentText(String text) throws IOException {
    Document doc = Document.newBuilder()
            .setContent(new String(java.nio.charset.Charset.forName("UTF-8").encode(text).array()))
            .setType(Type.PLAIN_TEXT).build();
    AnalyzeEntitySentimentRequest request = AnalyzeEntitySentimentRequest.newBuilder()
            .setDocument(doc)
            .setEncodingType(EncodingType.UTF8).build();
    AnalyzeEntitySentimentResponse response = languageApi.analyzeEntitySentiment(request);
    return response.getEntitiesList();
  }

With an input of "1234♥♥♥Google." shows the offset of Google as 13. Changing to UTF-16, e.g.

  public List<Entity> entitySentimentText(String text) throws IOException {
    Document doc = Document.newBuilder()
            .setContent(new String(java.nio.charset.Charset.forName("UTF-16").encode(text).array()))
            .setType(Type.PLAIN_TEXT).build();
    AnalyzeEntitySentimentRequest request = AnalyzeEntitySentimentRequest.newBuilder()
            .setDocument(doc)
            .setEncodingType(EncodingType.UTF8).build();
    AnalyzeEntitySentimentResponse response = languageApi.analyzeEntitySentiment(request);
    return response.getEntitiesList();
  }

Shows the offset of Google as 21.

Conversely, passing the UTF8 string and telling the API that it's UTF-16 shows the correct offset, 7.

This could be a bug that is outside of the scope of the samples and client libraries.

gguuss · 2017-05-23T21:38:45Z

FYI, added in this PR with test, snippet, and style fixes.

theacodes · 2017-05-24T16:20:02Z

Closing in favor of #961

remove hardcoded encoding

491bba9

since the client lib (not like the API) requires the document encoding, we should not hardcoded UTF8 by default. Added code used in other sample applications. Tested

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Apr 29, 2017

theacodes requested a review from gguuss May 1, 2017 16:19

Fix lint issues

305185f

gguuss suggested changes May 2, 2017

View reviewed changes

gguuss reviewed May 2, 2017

View reviewed changes

gguuss reviewed May 3, 2017

View reviewed changes

theacodes closed this May 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove hardcoded encoding #921

remove hardcoded encoding #921

merla18 commented Apr 29, 2017

theacodes commented May 1, 2017

theacodes commented May 1, 2017

monattar commented May 2, 2017

theacodes commented May 2, 2017

gguuss commented May 2, 2017

gguuss May 2, 2017 •

edited

Loading

gguuss May 2, 2017

gguuss May 2, 2017

monattar commented May 2, 2017

gguuss May 2, 2017

merla18 commented May 2, 2017

gguuss commented May 3, 2017 •

edited

Loading

monattar commented May 3, 2017

gguuss May 3, 2017

monattar May 3, 2017

theacodes commented May 3, 2017

gguuss commented May 3, 2017 •

edited

Loading

gguuss commented May 23, 2017

theacodes commented May 24, 2017

remove hardcoded encoding #921

remove hardcoded encoding #921

Conversation

merla18 commented Apr 29, 2017

theacodes commented May 1, 2017

theacodes commented May 1, 2017

monattar commented May 2, 2017

theacodes commented May 2, 2017

gguuss commented May 2, 2017

gguuss May 2, 2017 • edited Loading

Choose a reason for hiding this comment

gguuss May 2, 2017

Choose a reason for hiding this comment

gguuss May 2, 2017

Choose a reason for hiding this comment

monattar commented May 2, 2017

gguuss May 2, 2017

Choose a reason for hiding this comment

merla18 commented May 2, 2017

gguuss commented May 3, 2017 • edited Loading

monattar commented May 3, 2017

gguuss May 3, 2017

Choose a reason for hiding this comment

monattar May 3, 2017

Choose a reason for hiding this comment

theacodes commented May 3, 2017

gguuss commented May 3, 2017 • edited Loading

gguuss commented May 23, 2017

theacodes commented May 24, 2017

gguuss May 2, 2017 •

edited

Loading

gguuss commented May 3, 2017 •

edited

Loading

gguuss commented May 3, 2017 •

edited

Loading