Skip to content

remove hardcoded encoding #921

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

merla18
Copy link
Contributor

@merla18 merla18 commented Apr 29, 2017

since the client lib (not like the API) requires the document encoding, we should not hardcoded UTF8 by default. Added code used in other sample applications. Tested

since the client lib (not like the API) requires the document encoding, we should not hardcoded UTF8 by default. Added code used in other sample applications. Tested
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Apr 29, 2017
@theacodes theacodes requested a review from gguuss May 1, 2017 16:19
@theacodes
Copy link
Contributor

@gguuss can you review?

@theacodes
Copy link
Contributor

My first concern is that this doesn't seem entirely correct. On Python 2.7, the strings we get from the command line are bytes and we manually decode those to utf-8. In the Python 2 case, we're passing the wrong encoding to the API.

@monattar
Copy link

monattar commented May 2, 2017

Note that what is sent to the API is not the encoding of the input text, but the encoding used to calculate offsets in the response.

@theacodes
Copy link
Contributor

@monattar I think my concern still holds true. In Python 2.7, since we're decoding bytes from the input into utf-8, we should specify utf-8 as the encoding to the API.

@gguuss
Copy link
Contributor

gguuss commented May 2, 2017

+1 to @jonparrott's comments, additionally, I'm concerned of the following:

  • This code adds complexity without adding any actual value for most developers
  • This code introduces a subroutine that will not appear in the how-to documentation
  • Please add tests for the new code you're introducing highlighting the issue you're fixing

@@ -153,7 +162,7 @@ def entity_sentiment_text(text):
document.type = enums.Document.Type.PLAIN_TEXT

result = language_client.analyze_entity_sentiment(
document, enums.EncodingType.UTF8)
document, get_native_encoding_type())
Copy link
Contributor

@gguuss gguuss May 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better snippets, please update this to be an input to the function as opposed to something that will not appear in devsite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@monattar Furthermore, what is broken and how does this fix it?

def get_native_encoding_type():
"""Returns the encoding type that matches Python's native strings."""
if sys.maxunicode == 65535:
return 'UTF16'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't you using the enumeration type?

@monattar
Copy link

monattar commented May 2, 2017

I think we need a test to clarify. Could we please add a test for this content "foo→bar" and check the offset of "bar"?

if sys.maxunicode == 65535:
return 'UTF16'
else:
return 'UTF32'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI missing newline

@merla18
Copy link
Contributor Author

merla18 commented May 2, 2017

here is the results of the test:

  • if I use the original " enums.EncodingType.UTF8" in the code and print the char at the offset of the original text

$ python snippets.py sentiment-entities-text "foo→bar"
Mentions:
Name: "bar"
Begin Offset : 6
Content : bar
original text char at offset : r

  • I use get_native_encoding_type() and print the char at the offset of the original text:
    $ python snippets.py sentiment-entities-text "foo→bar"
    Mentions:
    Name: "bar"
    Begin Offset : 4
    Content : bar
    original text char at offset : b

The change get_native_encoding_type() seems to return the correct answer.

@gguuss
Copy link
Contributor

gguuss commented May 3, 2017

@merla18 Thanks for the usage example, helps to clarify what this fixes.

So that other contributors don't reintroduce the bug, you should add a unit test to language/cloud-client/v1beta2/snippets_test.py.

@monattar
Copy link

monattar commented May 3, 2017

Also wanted to mention that encoding is one of the trickiest things to get right, and we're trying to push that into the client library. So, hopefully specifying the EncodingType wouldn't be necessary at all in the future.

@@ -30,6 +31,14 @@
import six


def get_native_encoding_type():
"""Returns the encoding type that matches Python's native strings."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason UTF8 isn't here? Below, the input is explicitly encoded as UTF8.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is identified what Python natively does, regardless of any explicit encoding/decoding.

@theacodes
Copy link
Contributor

@merla18 @monattar there's some weirdness going on here.

See my annotated code here:

    # Regardless of if the input is bytes or unicode, this will ensure that
    # it is unicode (utf-8 in Python 2, 16 or 32 in Python 3).
    if isinstance(text, six.binary_type):
        text = text.decode('utf-8')

    # This ensures that the unicode string is encoded as utf-8 regardless
    # of original encoding. encode returns
    # bytes, but protobuf will automatically re-encode to utf-8 for it's
    # wire format.
    document.content = text.encode('utf-8')
    document.type = enums.Document.Type.PLAIN_TEXT

    # At this point, document is (asciipb):
    #   type: PLAIN_TEXT
    #   content: "foo\342\206\222bar"
    #
    # Or in wire format:
    #   b'\x08\x01\x12\tfoo\xe2\x86\x92bar'
    # Which is *clearly* utf-8.

So we're always sending utf-8 on the wire. Protobuf demands that we do so, we can't set a utf-16 bytestring to the message:

>>> document.content = u"foo→bar".encode('utf-16')
*** ValueError: b'\xff\xfef\x00o\x00o\x00\x92!b\x00a\x00r\x00' has type bytes, but isn't valid UTF-8 encoding. Non-UTF-8 strings must be converted to unicode objects before being added.

So things get weird now. If we make the request and specify UTF8 as the encoding (which is correct) we get the wrong offset:

>>> language_client.analyze_entity_sentiment(document, enums.EncodingType.UTF8)
entities {
  name: "bar"
  type: LOCATION
  salience: 1.0
  mentions {
    text {
      content: "bar"
      begin_offset: 6
    }
    type: COMMON
    sentiment {
    }
  }
  sentiment {
  }
}
language: "en"

The offset is 6 when clearly the offset is 4 in the utf-8 string.

What's even weirder is that if we pass in UTF16 as the encoding:

>>> language_client.analyze_entity_sentiment(document, enums.EncodingType.UTF16)
entities {
  name: "bar"
  type: LOCATION
  salience: 1.0
  mentions {
    text {
      content: "bar"
      begin_offset: 4
    }
    type: COMMON
    sentiment {
    }
  }
  sentiment {
  }
}
language: "en"

We get the correct offset for utf-8 but the wrong offset for utf-16. In fact, the offsets seem to be switched!

Any ideas?

@gguuss
Copy link
Contributor

gguuss commented May 3, 2017

@monattar @merla18 I'm seeing the same behavior in Java:

  public List<Entity> entitySentimentText(String text) throws IOException {
    Document doc = Document.newBuilder()
            .setContent(new String(java.nio.charset.Charset.forName("UTF-8").encode(text).array()))
            .setType(Type.PLAIN_TEXT).build();
    AnalyzeEntitySentimentRequest request = AnalyzeEntitySentimentRequest.newBuilder()
            .setDocument(doc)
            .setEncodingType(EncodingType.UTF8).build();
    AnalyzeEntitySentimentResponse response = languageApi.analyzeEntitySentiment(request);
    return response.getEntitiesList();
  }

With an input of "1234♥♥♥Google." shows the offset of Google as 13. Changing to UTF-16, e.g.

  public List<Entity> entitySentimentText(String text) throws IOException {
    Document doc = Document.newBuilder()
            .setContent(new String(java.nio.charset.Charset.forName("UTF-16").encode(text).array()))
            .setType(Type.PLAIN_TEXT).build();
    AnalyzeEntitySentimentRequest request = AnalyzeEntitySentimentRequest.newBuilder()
            .setDocument(doc)
            .setEncodingType(EncodingType.UTF8).build();
    AnalyzeEntitySentimentResponse response = languageApi.analyzeEntitySentiment(request);
    return response.getEntitiesList();
  }

Shows the offset of Google as 21.

Conversely, passing the UTF8 string and telling the API that it's UTF-16 shows the correct offset, 7.

This could be a bug that is outside of the scope of the samples and client libraries.

@gguuss
Copy link
Contributor

gguuss commented May 23, 2017

FYI, added in this PR with test, snippet, and style fixes.

@theacodes
Copy link
Contributor

Closing in favor of #961

@theacodes theacodes closed this May 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants