Remove occurrence of error "invalid byte sequence in UTF-8" #404

naru-jpn · 2019-02-02T15:54:16Z

Ensure all commits include DCO sign-off.
Ensure that your contributions pass unit testing.
Ensure that your contributions contain documentation if applicable.

Description

I faced error invalid byte sequence in UTF-8 when call modified_files and I found this error occurs by split at here in case of @full_diff contains invalid byte sequence.
Using scrub for @full_diff to remove invalid byte sequence can remove this error.

Related: #392, #369

Signed-off-by: naru-jpn <tus.naru@gmail.com>

naru-jpn · 2019-02-02T16:15:37Z

Failure of CI relates with #358 .

jcouball · 2019-02-05T17:27:22Z

I believe #403 will fix the CI build error.

Could you add a simple test for this change?

naru-jpn · 2019-02-18T06:06:01Z

Fixed CI build error 🎉

@jcouball Thanks a lot!

jcouball · 2019-02-18T17:13:03Z

@naru-jpn, in a separate email, you said you were "not sure what code should I add in my pull request".

Ideally, you would have a test that fails without your change and then succeeds with your change.

If you do not know how to do that, send me an example of git diff output that is causing your error and I will see if I can create such a test case to add to your PR.

naru-jpn · 2019-02-19T03:13:48Z

@jcouball There are some reasons why I cannot find code causing this crash.

I use a library named Danger and this library dependents ruby-git. Danger needs to call modified_files then sometimes crash with message invalid byte sequence in UTF-8.
I cannot find which characters is illegal yet. Furthermore I use Dander in my company's product so I cannot write detail of diff.

There is example of encoding problem is here.

naru-jpn · 2019-02-19T03:49:52Z

Originally, this crash dose not occur. Crash was introduced by #369.

#369 pointed out that encode('UTF-8', 'binary', {:invalid => :replace, :undef => :replace}) convert characters unexpectedly. On the other hands, scrub work well and also remove crash.

ruby -e "puts 'ほげ'.encode('UTF-8', 'binary', {:invalid => :replace, :undef => :replace})"
������

ruby -e 'puts "ほげ".scrub'
ほげ

jcouball · 2019-02-19T07:34:25Z

@naru-jpn

The command you gave makes no sense:

ruby -e "puts 'ほげ'.encode('UTF-8', 'binary', {:invalid => :replace, :undef => :replace})"
������

You are trying to take your unicode string 'ほげ' and treat it as if it were in a binary encoding and convert that to UTF-8 encoding. The garbled output is as expected. I believe the code introduced in PR #369 was not correct.

I took a step back to look at the problem that your PR and PR #369 are trying to solve: how to get command output into a string that has a valid encoding.

git diff does not output characters in any particular encoding. It passes through bytes from the files being compared. These files are stored in whatever encoding the user wrote them in.

In Ruby, popen and backtick collect the output from the command they run as bytes. They shove these bytes into the string returned. They make no attempt to guess the actual encoding of the string returned.

In later versions of Ruby (maybe 1.9+), the encoding of the string returned from popen or backtick is ALWAYS set to Encoding.default_external. This is usually ‘UTF-8’. Ruby uses the value of the command locale charmap for the initial value of Encoding.default_external. This encoding is set even though the actual encoding of the string is not known.

The bottom line is that the actual encoding of the string returned from popen or backticks is unknowable. Sometimes it is a happy coincidence that the actual encoding matches what is retuned by String#encoding. When Encoding.default_external is ‘UTF-8’ this coincidence happens more often than not.

I think the best that can be done is to sanitize the command output as it is read in Git::Lib#run_command in an attempt to “guard the gates”. That way the rest of the code in this gem can assume the command output has a valid encoding.

What do you think about introducing a normalize_encoding method in Git::Lib that does the following. This would be called when command line output is returned in Git::Lib#run_command. This code is untested and offered only for discussion:

# the rchardec gem is used to determine a string's encoding
# this would add a new dependency for ruby-git
require 'rchardec'

module Git
  class Lib
    …

    def default_string_encoding
      __ENCODING__.name
    end

    def best_guess_encoding
      'UTF-8'
    end

    def actual_encoding(str)
      CharDet.detect(str)['encoding'] || best_guess_encoding
    end

    def encoding_options
      { :invalid => :replace, :undef => :replace }
    end

    def normalize_encoding(str)
      if str.valid_encoding?
        return str if str.encoding == default_string_encoding
        return str.encode(default_string_encoding, str.encoding, encoding_options)
      else
        return str.encode(default_string_encoding, actual_encoding(str), encoding_options)
      end
    end

    …
  end
end

jcouball · 2019-02-19T07:48:26Z

It could be that git diff could have output that is a jumble of encodings. In that case the problem is probably unsolvable in a generic sense. The above solution at least would ensure no exceptions are raised though it might be a bit lossy because it smashes the output into one encoding.

naru-jpn · 2019-02-22T05:09:04Z

@jcouball
That is very meaningful code! But I cannot understand full of above code ;)

I think there is a case above code is not safe.
When str.encoding_valid? is true and str.encoding is UTF-8, normalize_encoding returns original str. Crash occurs in case str.encoding says "my encoding is UTF-8" but string contains invalid byte sequence in UTF-8. So I think still sometime crash occurs.

I'm sorry if I misunderstood behavior of above code 🙏

jcouball · 2019-02-22T16:55:12Z

If the string contains an invalid byte sequence, then str.valid_encoding? would return false. In this case, the code will attempt to guess at the encoding of the string and convert from that encoding to UTF-8, eliminating any bad code points while it does so.

That said, this is best proven with tests which I am putting together.

I'd ask again for you to provide an example if you have one. Otherwise, I will make up some tests.

naru-jpn · 2019-02-23T06:52:23Z

@jcouball I see, I misunderstood how str.valid_encoding? works.

Looks good to me! 🌟

naru-jpn · 2019-02-28T06:13:56Z

@jcouball I found a point to be fixed!
When condition is str.valid_encoding? is false and str.encoding == default_string_encoding is true, str should be converted by scrub so as to characters is correctly converted ( #404 (comment) ).

jcouball · 2019-03-04T16:00:50Z

@naru-jpn can you please take a look at #405 and test it in your situation? I think it will fix the problem you are having. If there are any additional tests you could suggest, I would be grateful.

stale · 2019-05-03T16:45:39Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jcouball · 2020-01-21T16:57:05Z

@naru-jpn can you please see if this is fixed in 1.6.0.pre1 version? I think this issue might be fixed by #405.

jcouball · 2020-02-02T16:20:28Z

@naru-jpn can you please see if this is fixed in 1.6.0 version? I think this issue might be fixed by #405.

jcouball · 2020-02-12T01:14:15Z

Please open an issue if you are still having this problem.

Remove occurrence of error "invalid byte sequence in UTF-8"

e4d8cd4

Signed-off-by: naru-jpn <tus.naru@gmail.com>

jcouball mentioned this pull request Feb 5, 2019

Do not allow changes to ENV to leak from test to test #403

Merged

3 tasks

Merge branch 'master' into feature/fix_invalid_byte_sequence_in_utf-8

550515f

jcouball mentioned this pull request Feb 28, 2019

Do not let unexpected command output encoding raise an exception #405

Merged

3 tasks

stale bot added the wontfix label May 3, 2019

stale bot closed this May 10, 2019

jcouball reopened this Jan 21, 2020

jcouball closed this Feb 12, 2020

Remove occurrence of error "invalid byte sequence in UTF-8" #404

Remove occurrence of error "invalid byte sequence in UTF-8" #404

Uh oh!

Conversation

naru-jpn commented Feb 2, 2019

Description

Uh oh!

naru-jpn commented Feb 2, 2019

Uh oh!

jcouball commented Feb 5, 2019

Uh oh!

naru-jpn commented Feb 18, 2019

Uh oh!

jcouball commented Feb 18, 2019

Uh oh!

naru-jpn commented Feb 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naru-jpn commented Feb 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcouball commented Feb 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcouball commented Feb 19, 2019

Uh oh!

naru-jpn commented Feb 22, 2019

Uh oh!

jcouball commented Feb 22, 2019

Uh oh!

naru-jpn commented Feb 23, 2019

Uh oh!

naru-jpn commented Feb 28, 2019

Uh oh!

jcouball commented Mar 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stale bot commented May 3, 2019

Uh oh!

jcouball commented Jan 21, 2020

Uh oh!

jcouball commented Feb 2, 2020

Uh oh!

jcouball commented Feb 12, 2020

Uh oh!

Uh oh!

naru-jpn commented Feb 19, 2019 •

edited

Loading

naru-jpn commented Feb 19, 2019 •

edited

Loading

jcouball commented Feb 19, 2019 •

edited

Loading

jcouball commented Mar 4, 2019 •

edited

Loading