Skip to content

Remove occurrence of error "invalid byte sequence in UTF-8" #404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Remove occurrence of error "invalid byte sequence in UTF-8" #404

wants to merge 2 commits into from

Conversation

naru-jpn
Copy link

@naru-jpn naru-jpn commented Feb 2, 2019

  • Ensure all commits include DCO sign-off.
  • Ensure that your contributions pass unit testing.
  • Ensure that your contributions contain documentation if applicable.

Description

I faced error invalid byte sequence in UTF-8 when call modified_files and I found this error occurs by split at here in case of @full_diff contains invalid byte sequence.
Using scrub for @full_diff to remove invalid byte sequence can remove this error.

Related: #392, #369

Signed-off-by: naru-jpn <tus.naru@gmail.com>
@naru-jpn
Copy link
Author

naru-jpn commented Feb 2, 2019

Failure of CI relates with #358 .

@jcouball
Copy link
Member

jcouball commented Feb 5, 2019

I believe #403 will fix the CI build error.

Could you add a simple test for this change?

@naru-jpn
Copy link
Author

Fixed CI build error 🎉

@jcouball Thanks a lot!

@jcouball
Copy link
Member

@naru-jpn, in a separate email, you said you were "not sure what code should I add in my pull request".

Ideally, you would have a test that fails without your change and then succeeds with your change.

If you do not know how to do that, send me an example of git diff output that is causing your error and I will see if I can create such a test case to add to your PR.

@naru-jpn
Copy link
Author

naru-jpn commented Feb 19, 2019

@jcouball There are some reasons why I cannot find code causing this crash.

I use a library named Danger and this library dependents ruby-git. Danger needs to call modified_files then sometimes crash with message invalid byte sequence in UTF-8.
I cannot find which characters is illegal yet. Furthermore I use Dander in my company's product so I cannot write detail of diff.

There is example of encoding problem is here.

@naru-jpn
Copy link
Author

naru-jpn commented Feb 19, 2019

Originally, this crash dose not occur. Crash was introduced by #369.

#369 pointed out that encode('UTF-8', 'binary', {:invalid => :replace, :undef => :replace}) convert characters unexpectedly. On the other hands, scrub work well and also remove crash.

ruby -e "puts 'ほげ'.encode('UTF-8', 'binary', {:invalid => :replace, :undef => :replace})"
������

ruby -e 'puts "ほげ".scrub'
ほげ

@jcouball
Copy link
Member

jcouball commented Feb 19, 2019

@naru-jpn

The command you gave makes no sense:

ruby -e "puts 'ほげ'.encode('UTF-8', 'binary', {:invalid => :replace, :undef => :replace})"
������

You are trying to take your unicode string 'ほげ' and treat it as if it were in a binary encoding and convert that to UTF-8 encoding. The garbled output is as expected. I believe the code introduced in PR #369 was not correct.

I took a step back to look at the problem that your PR and PR #369 are trying to solve: how to get command output into a string that has a valid encoding.

git diff does not output characters in any particular encoding. It passes through bytes from the files being compared. These files are stored in whatever encoding the user wrote them in.

In Ruby, popen and backtick collect the output from the command they run as bytes. They shove these bytes into the string returned. They make no attempt to guess the actual encoding of the string returned.

In later versions of Ruby (maybe 1.9+), the encoding of the string returned from popen or backtick is ALWAYS set to Encoding.default_external. This is usually ‘UTF-8’. Ruby uses the value of the command locale charmap for the initial value of Encoding.default_external. This encoding is set even though the actual encoding of the string is not known.

The bottom line is that the actual encoding of the string returned from popen or backticks is unknowable. Sometimes it is a happy coincidence that the actual encoding matches what is retuned by String#encoding. When Encoding.default_external is ‘UTF-8’ this coincidence happens more often than not.

I think the best that can be done is to sanitize the command output as it is read in Git::Lib#run_command in an attempt to “guard the gates”. That way the rest of the code in this gem can assume the command output has a valid encoding.

What do you think about introducing a normalize_encoding method in Git::Lib that does the following. This would be called when command line output is returned in Git::Lib#run_command. This code is untested and offered only for discussion:

# the rchardec gem is used to determine a string's encoding
# this would add a new dependency for ruby-git
require 'rchardec'

module Git
  class Lib
    

    def default_string_encoding
      __ENCODING__.name
    end

    def best_guess_encoding
      'UTF-8'
    end

    def actual_encoding(str)
      CharDet.detect(str)['encoding'] || best_guess_encoding
    end

    def encoding_options
      { :invalid => :replace, :undef => :replace }
    end

    def normalize_encoding(str)
      if str.valid_encoding?
        return str if str.encoding == default_string_encoding
        return str.encode(default_string_encoding, str.encoding, encoding_options)
      else
        return str.encode(default_string_encoding, actual_encoding(str), encoding_options)
      end
    end

    
  end
end

@jcouball
Copy link
Member

It could be that git diff could have output that is a jumble of encodings. In that case the problem is probably unsolvable in a generic sense. The above solution at least would ensure no exceptions are raised though it might be a bit lossy because it smashes the output into one encoding.

@naru-jpn
Copy link
Author

@jcouball
That is very meaningful code! But I cannot understand full of above code ;)

I think there is a case above code is not safe.
When str.encoding_valid? is true and str.encoding is UTF-8, normalize_encoding returns original str. Crash occurs in case str.encoding says "my encoding is UTF-8" but string contains invalid byte sequence in UTF-8. So I think still sometime crash occurs.

I'm sorry if I misunderstood behavior of above code 🙏

@jcouball
Copy link
Member

If the string contains an invalid byte sequence, then str.valid_encoding? would return false. In this case, the code will attempt to guess at the encoding of the string and convert from that encoding to UTF-8, eliminating any bad code points while it does so.

That said, this is best proven with tests which I am putting together.

I'd ask again for you to provide an example if you have one. Otherwise, I will make up some tests.

@naru-jpn
Copy link
Author

@jcouball I see, I misunderstood how str.valid_encoding? works.

Looks good to me! 🌟

@naru-jpn
Copy link
Author

@jcouball I found a point to be fixed!
When condition is str.valid_encoding? is false and str.encoding == default_string_encoding is true, str should be converted by scrub so as to characters is correctly converted ( #404 (comment) ).

@jcouball
Copy link
Member

jcouball commented Mar 4, 2019

@naru-jpn can you please take a look at #405 and test it in your situation? I think it will fix the problem you are having. If there are any additional tests you could suggest, I would be grateful.

@stale
Copy link

stale bot commented May 3, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label May 3, 2019
@stale stale bot closed this May 10, 2019
@jcouball
Copy link
Member

@naru-jpn can you please see if this is fixed in 1.6.0.pre1 version? I think this issue might be fixed by #405.

@jcouball jcouball reopened this Jan 21, 2020
@jcouball
Copy link
Member

jcouball commented Feb 2, 2020

@naru-jpn can you please see if this is fixed in 1.6.0 version? I think this issue might be fixed by #405.

@jcouball
Copy link
Member

Please open an issue if you are still having this problem.

@jcouball jcouball closed this Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants