-
Notifications
You must be signed in to change notification settings - Fork 533
Remove occurrence of error "invalid byte sequence in UTF-8" #404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove occurrence of error "invalid byte sequence in UTF-8" #404
Conversation
Signed-off-by: naru-jpn <tus.naru@gmail.com>
Failure of CI relates with #358 . |
I believe #403 will fix the CI build error. Could you add a simple test for this change? |
Fixed CI build error 🎉 @jcouball Thanks a lot! |
@naru-jpn, in a separate email, you said you were "not sure what code should I add in my pull request". Ideally, you would have a test that fails without your change and then succeeds with your change. If you do not know how to do that, send me an example of |
@jcouball There are some reasons why I cannot find code causing this crash. I use a library named Danger and this library dependents ruby-git. Danger needs to call There is example of encoding problem is here. |
Originally, this crash dose not occur. Crash was introduced by #369. #369 pointed out that ruby -e "puts 'ほげ'.encode('UTF-8', 'binary', {:invalid => :replace, :undef => :replace})"
������
ruby -e 'puts "ほげ".scrub'
ほげ |
The command you gave makes no sense:
You are trying to take your unicode string 'ほげ' and treat it as if it were in a binary encoding and convert that to UTF-8 encoding. The garbled output is as expected. I believe the code introduced in PR #369 was not correct. I took a step back to look at the problem that your PR and PR #369 are trying to solve: how to get command output into a string that has a valid encoding.
In Ruby, In later versions of Ruby (maybe 1.9+), the encoding of the string returned from The bottom line is that the actual encoding of the string returned from popen or backticks is unknowable. Sometimes it is a happy coincidence that the actual encoding matches what is retuned by I think the best that can be done is to sanitize the command output as it is read in Git::Lib#run_command in an attempt to “guard the gates”. That way the rest of the code in this gem can assume the command output has a valid encoding. What do you think about introducing a # the rchardec gem is used to determine a string's encoding
# this would add a new dependency for ruby-git
require 'rchardec'
module Git
class Lib
…
def default_string_encoding
__ENCODING__.name
end
def best_guess_encoding
'UTF-8'
end
def actual_encoding(str)
CharDet.detect(str)['encoding'] || best_guess_encoding
end
def encoding_options
{ :invalid => :replace, :undef => :replace }
end
def normalize_encoding(str)
if str.valid_encoding?
return str if str.encoding == default_string_encoding
return str.encode(default_string_encoding, str.encoding, encoding_options)
else
return str.encode(default_string_encoding, actual_encoding(str), encoding_options)
end
end
…
end
end |
It could be that |
@jcouball I think there is a case above code is not safe. I'm sorry if I misunderstood behavior of above code 🙏 |
If the string contains an invalid byte sequence, then That said, this is best proven with tests which I am putting together. I'd ask again for you to provide an example if you have one. Otherwise, I will make up some tests. |
@jcouball I see, I misunderstood how Looks good to me! 🌟 |
@jcouball I found a point to be fixed! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Please open an issue if you are still having this problem. |
Description
I faced error
invalid byte sequence in UTF-8
when callmodified_files
and I found this error occurs bysplit
at here in case of@full_diff
contains invalid byte sequence.Using
scrub
for@full_diff
to remove invalid byte sequence can remove this error.Related: #392, #369