-
Notifications
You must be signed in to change notification settings - Fork 6.5k
remove hardcoded encoding #921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
since the client lib (not like the API) requires the document encoding, we should not hardcoded UTF8 by default. Added code used in other sample applications. Tested
@gguuss can you review? |
My first concern is that this doesn't seem entirely correct. On Python 2.7, the strings we get from the command line are bytes and we manually decode those to utf-8. In the Python 2 case, we're passing the wrong encoding to the API. |
Note that what is sent to the API is not the encoding of the input text, but the encoding used to calculate offsets in the response. |
@monattar I think my concern still holds true. In Python 2.7, since we're decoding bytes from the input into utf-8, we should specify utf-8 as the encoding to the API. |
+1 to @jonparrott's comments, additionally, I'm concerned of the following:
|
@@ -153,7 +162,7 @@ def entity_sentiment_text(text): | |||
document.type = enums.Document.Type.PLAIN_TEXT | |||
|
|||
result = language_client.analyze_entity_sentiment( | |||
document, enums.EncodingType.UTF8) | |||
document, get_native_encoding_type()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better snippets, please update this to be an input to the function as opposed to something that will not appear in devsite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@monattar Furthermore, what is broken and how does this fix it?
def get_native_encoding_type(): | ||
"""Returns the encoding type that matches Python's native strings.""" | ||
if sys.maxunicode == 65535: | ||
return 'UTF16' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren't you using the enumeration type?
I think we need a test to clarify. Could we please add a test for this content "foo→bar" and check the offset of "bar"? |
if sys.maxunicode == 65535: | ||
return 'UTF16' | ||
else: | ||
return 'UTF32' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI missing newline
here is the results of the test:
$ python snippets.py sentiment-entities-text "foo→bar"
The change get_native_encoding_type() seems to return the correct answer. |
@merla18 Thanks for the usage example, helps to clarify what this fixes. So that other contributors don't reintroduce the bug, you should add a unit test to |
Also wanted to mention that encoding is one of the trickiest things to get right, and we're trying to push that into the client library. So, hopefully specifying the EncodingType wouldn't be necessary at all in the future. |
@@ -30,6 +31,14 @@ | |||
import six | |||
|
|||
|
|||
def get_native_encoding_type(): | |||
"""Returns the encoding type that matches Python's native strings.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason UTF8 isn't here? Below, the input is explicitly encoded as UTF8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is identified what Python natively does, regardless of any explicit encoding/decoding.
@merla18 @monattar there's some weirdness going on here. See my annotated code here: # Regardless of if the input is bytes or unicode, this will ensure that
# it is unicode (utf-8 in Python 2, 16 or 32 in Python 3).
if isinstance(text, six.binary_type):
text = text.decode('utf-8')
# This ensures that the unicode string is encoded as utf-8 regardless
# of original encoding. encode returns
# bytes, but protobuf will automatically re-encode to utf-8 for it's
# wire format.
document.content = text.encode('utf-8')
document.type = enums.Document.Type.PLAIN_TEXT
# At this point, document is (asciipb):
# type: PLAIN_TEXT
# content: "foo\342\206\222bar"
#
# Or in wire format:
# b'\x08\x01\x12\tfoo\xe2\x86\x92bar'
# Which is *clearly* utf-8. So we're always sending utf-8 on the wire. Protobuf demands that we do so, we can't set a utf-16 bytestring to the message: >>> document.content = u"foo→bar".encode('utf-16')
*** ValueError: b'\xff\xfef\x00o\x00o\x00\x92!b\x00a\x00r\x00' has type bytes, but isn't valid UTF-8 encoding. Non-UTF-8 strings must be converted to unicode objects before being added. So things get weird now. If we make the request and specify >>> language_client.analyze_entity_sentiment(document, enums.EncodingType.UTF8)
entities {
name: "bar"
type: LOCATION
salience: 1.0
mentions {
text {
content: "bar"
begin_offset: 6
}
type: COMMON
sentiment {
}
}
sentiment {
}
}
language: "en" The offset is What's even weirder is that if we pass in >>> language_client.analyze_entity_sentiment(document, enums.EncodingType.UTF16)
entities {
name: "bar"
type: LOCATION
salience: 1.0
mentions {
text {
content: "bar"
begin_offset: 4
}
type: COMMON
sentiment {
}
}
sentiment {
}
}
language: "en" We get the correct offset for Any ideas? |
@monattar @merla18 I'm seeing the same behavior in Java:
With an input of
Shows the offset of Google as 21. Conversely, passing the UTF8 string and telling the API that it's UTF-16 shows the correct offset, 7. This could be a bug that is outside of the scope of the samples and client libraries. |
FYI, added in this PR with test, snippet, and style fixes. |
Closing in favor of #961 |
since the client lib (not like the API) requires the document encoding, we should not hardcoded UTF8 by default. Added code used in other sample applications. Tested