Country specific characters in Windows user folder name when locating .tfm-file #11848

ghost · 2018-08-13T08:16:35Z

Change to utf-8 format, due to return result.decode('ascii') returning an UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17: ordinal not in range(128), when æ,ø or å appears in the path to the .tfm-file.

PR Summary

PR Checklist

Has Pytest style unit tests
Code is PEP 8 compliant
New features are documented, with examples if plot related
Documentation is sphinx and numpydoc compliant
Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

Change to utf-8 format, due to return result.decode('ascii') returning an UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17: ordinal not in range(128), when æ,ø or å appears in the path to the .tfm-file.

anntzer

Would be nice to see some docs stating that this is indeed utf8 and not e.g. the current codepage or the encoding stated by $LANG.

ghost · 2018-08-13T10:24:33Z

As this is my first attempt at a contribution towards an open source project, i am not entirely sure what you mean by docs stating that this is utf8.

The change to utf8 from ascii was a hail marry from my side, to fix an issue i had when using matplotlib together with matplotlib.rc('text', usetex=True), on my Windows machine, using MikTex as the LaTeX-interpreter.

I committed this changes in the hopes that someone would review the change and verify that i did not break any of the current functionality of the find_tex_file-function.

ImportanceOfBeingErnest · 2018-08-13T11:35:52Z

What the above comment by @anntzer means is that it is currently not clear if the return of subprocess.Popen(cmd, stdout=subprocess.PIPE).communicate() is necessarily utf-8 encoded, or whether it may depend on the system(-settings). If one could find a documentation where this is specified to be utf-8 in every case, this would be a good fix; else one would probably need a more sophisticated one.

ghost · 2018-08-13T11:49:01Z

@ImportanceOfBeingErnest Thank you very much for the clarification.

I will try and investigate the subprocess.Popen documentation

Another fix might be to write code that detects the language of the system, and only change the decoding when necessary.

ghost · 2018-08-14T09:59:29Z

Alright, so i have looked into the documentation of subprocess.Popen.communicate(), and found no definitive answer to what encoding it uses when piping information from the kpsewhich program.

Running kpsewhich cmmi10.tfm directly in Windows PowerShell resulted in a path C:/Users/Morten M├╕ller/AppData/Local/Programs/MiKTeX 2.9/fonts/tfm/public/cm/cmmi10.tfm , where the ø in "Morten Møller" does not appear to be decoded properly. Although, decoding the piped stdout from kpsewhich using utf-8 gives the correct path, with ø included, in the dviread.py-script.

ImportanceOfBeingErnest · 2018-08-14T11:09:48Z

An option which does not break any existing code, but would solve this problem is probably to try/except both cases,

try:
    res = result.decode('ascii')
except UnicodeDecodeError:
    res = result.decode('utf-8')
return res

ghost · 2018-08-14T11:38:58Z

As suggested by @ImportanceOfBeingErnest, i added the try and except error catching method to only switch to utf8-decoding when ascii fails to decode the piped info from kpsewhich.

Edit : I added this piece of code into my local version of dviread.py and it ran without errors when plotting my data. I do not know why the two checks failed on this specfic commit.

tacaswell · 2018-08-14T14:24:00Z

I think that sys.getfilesystemencoding() will return the system encoding which is how the return value from popen will be returned?

Unfortunately, it looks like the encoding here will depend on the kpeswitch, but the hope is that it is just passing the bytes it gets back from the filesystem through so the system encoding should be safe?

ImportanceOfBeingErnest · 2018-08-14T14:42:25Z

The tests fail because line 1024 contains whitespaces. Just remove them and you'll be fine.

anntzer · 2018-08-14T15:06:27Z

The try... except doesn't make sense. utf-8 is a superset of ascii (or rather, it is ascii compatible), so anything that can be ascii-decoded will yield the same result when utf-8-decoded.

anntzer · 2018-08-14T15:25:01Z

@tacaswell getfilesystemencoding is unlikely to be "correct" in all cases because that switched from (a default of) "mbcs" in py<3.6 to "utf-8" in py>=3.6 (PEP529), but kpathsea obviously didn't change the encoding it uses at the same time.

eric-wieser · 2018-09-22T21:50:38Z

~~Note that this function already utf8-encodes the arguments to kpathsea, so it seems consistent to utf8-decode the responses.~~ - edit: misread encode for decode.

anntzer · 2018-09-24T12:16:38Z

I haven't fully unravelled the sources of kpathsea but https://tug.org/svn/texlive/trunk/Build/source/texk/kpathsea/pathsearch.c?view=markup#l38 + https://tug.org/svn/texlive/trunk/Build/source/texk/kpathsea/knj.c?revision=41586&view=markup#l407

38	#ifdef WIN32
39	#undef fputs
40	#undef puts
41	#define fputs win32_fputs
42	#define puts  win32_puts
43	#endif

407	static int __win32_fputs(const char *str, HANDLE hStdout)
408	{
409	    DWORD ret;
410	    wchar_t *wstr;
411	
412	    wstr = get_wstring_from_utf8(str, wstr=NULL);
413	
414	    if (WriteConsoleW(hStdout, wstr, wcslen(wstr), &ret, NULL) == 0) {
415	        free(wstr);
416	        return EOF;
417	    }
418	
419	    free(wstr);
420	    return ret;
421	}

plus the OP's observation that he gets paths encoded as utf-8 on Windows suggest to me that the correct encoding is utf-8 on Windows and the filesystemencoding on Unices.

Kojoley · 2018-09-26T00:58:55Z

@anntzer I have tested and app that does WriteConsoleW which is called with Python and find out that subprocess.Popen (subprocess.check_output too) gives empty output. So it should be printing with fputs in kpathsea_win32_fputs and since it has this File_system_codepage == CP_UTF8, the encoding must be utf8.

WriteConsoleW app

#include <windows.h>

int main()
{
    WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), L"Стдаут!\n", 8, NULL, NULL);
    WriteConsoleW(GetStdHandle(STD_ERROR_HANDLE), L"Стдерр!\n", 8, NULL, NULL);
}

Output of the executable:

C:\Temp\unicodetest>unicodetest.exe
Стдаут!
Стдерр!

Python reader:

import subprocess

cmd = 'unicodetest.exe'
result = subprocess.check_output(
            [cmd], stderr=subprocess.STDOUT)

print(type(result))
print('"%s"' % result)
print('"%s"' % result.decode('utf-8'))

pipe = subprocess.Popen([cmd], stdout=subprocess.PIPE)
result = pipe.communicate()[0].rstrip()

print(type(result))
print('"%s"' % result)
print('"%s"' % result.decode('utf-8'))

Output

<class 'bytes'>
"b''"
""
Стдерр!
<class 'bytes'>
"b''"
""

My kpsewhich is from MiKTeX (kpsewhich.c)

>kpsewhich --version
MiKTeX 2.9.6600

From the prerequest list it looks like it does not depend on kpathsea, because they emulate it.

Investigate the miktex sources

kpsewhich prints content with puts

Call tree: miktex_kpathsea_path_expand -> Session::Expand -> SessionImpl::Expand -> SessionImpl::ExpandPathPatterns -> SessionImpl::ExpandPathPattern -> SessionImpl::DirectoryWalk

in directory traversing it takes entry.name:

  unique_ptr<DirectoryLister> dirLister = DirectoryLister::Open(directory, nullptr, (int)DirectoryLister::Options::DirectoriesOnly);
  DirectoryEntry entry;
  vector<PathName> subdirs;
  while (dirLister->GetNext(entry))
  {
    MIKTEX_ASSERT(entry.isDirectory);
    PathName subdir(directory);
    subdir /= entry.name;
    subdirs.push_back(subdir);
  }
  dirLister->Close();

winDirectoryLister::GetNext(DirectoryEntry & direntry)

  DirectoryEntry2 direntry2;
  if (!GetNext(direntry2))

winDirectoryLister::GetNext(DirectoryEntry2 & direntry2)

    direntry2.name = WU_(ffdat.cFileName);

WU_ macro def

#  define WU_(x) MiKTeX::Util::CharBuffer<char>(x).GetData()

CharBuffer ctor calls CharBuffer::Set

      StringUtil::CopyString(buffer, GetCapacity(), lpsz);

StringUtil::CopyString

size_t StringUtil::CopyString(char* dest, size_t destSize, const wchar_t* source)
{
  return CopyString(dest, destSize, WideCharToUTF8(source).c_str());
}

Taaadam, the output is in utf8 encoding.

anntzer · 2018-09-26T08:18:20Z

Thanks for the careful investigation. Can you review #12253?

t-tk · 2018-09-27T11:45:45Z

I will shortly explain the function of command_line_encoding on windows.
The function is switched by settiing of value kpse->File_system_codepage.
The value is initialized at kpathsea_set_progname() (progname.c L498) and it may set to UTF-8 in kpathsea_get_command_line_args() (knj.c L246)
If kpse->File_system_codepage == CP_UTF8, win32_fputs and win32_puts assume "input strings are encoded in UTF-8" and output Unicode characters on console by using WriteConsoleW().

Typical usage is the following:

get configuration of command_line_encoding from texmf.cnf (dvipdfmx.c L1030)
If command_line_encoding is UTF-8, get command line argument as UTF-8 and set the value kpse->File_system_codepage (dvipdfmx.c L1031)
run command line I/O functions via wrapper functions like win32_fputs and win32_puts

anntzer · 2018-09-27T20:46:03Z

Thank you very much for your help.
If I understand correctly, the encoding actually depends on the user configuration, although I guess in practice it'll usually be utf-8?
Given that we probably don't want to parse texmf.cnf ourselves to figure out what the encoding is going to be, is there a way to force kpsewhich to output in a specific encoding (we're not picky, as long as we can know which one it is)?

t-tk · 2018-09-30T00:28:29Z

In Windows,
conventional C code (ex. int main(argc, argv){}, fopen(), puts()) assumes the character encoding is a legacy encoding, CP1252 in Europe & north America, CP932 in Japan.
The function command_line_encoding provides wrapper functions to treat command line arguments, file I/O and console I/O.
If a developer wishes to fix character encoding UTF-8, the following usage will work.

set kpse->File_system_codepage = CP_UTF8 and kpse->Is_cp932_system = 0 (ref.
knj.c L246, knj.c L247)
replace from POSIX functions to wrapper functions (ex. fputs -> win32_fputs)

Kojoley · 2018-09-30T12:10:02Z

@anntzer IIUC we can set env var command_line_encoding to utf-8 to be sure that it will use utf-8 for input and output (env var has a priority over configs) https://tug.org/svn/texlive/trunk/Build/source/texk/kpathsea/variable.c?revision=41557&view=markup#l52

tacaswell · 2019-02-01T03:57:52Z

@Kojoley @anntzer Does #12351 cover this problem?

anntzer · 2019-02-01T04:46:29Z

I think so (to the best of my understanding...).
Thanks @MortenSHUTE for the patch and @t-tk for your explanations!

Kojoley · 2019-02-01T12:33:19Z

Yes. I just wanted to run tex tests on windows to be sure thing are fine, but for some reason after switching an appveyor build to TEST_ALL=yes it hanged https://ci.appveyor.com/project/Kojoley/matplotlib/builds/19205834 (and could not run locally because of my problems with needed stuff installation).

Country specific characters in Windows user folder name

7b10da9

Change to utf-8 format, due to return result.decode('ascii') returning an UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17: ordinal not in range(128), when æ,ø or å appears in the path to the .tfm-file.

ghost changed the title ~~Country specific characters in Windows user folder name when locating .tf-file~~ Country specific characters in Windows user folder name when locating .tfmfile Aug 13, 2018

ghost changed the title ~~Country specific characters in Windows user folder name when locating .tfmfile~~ Country specific characters in Windows user folder name when locating .tfm-file Aug 13, 2018

anntzer requested changes Aug 13, 2018

View reviewed changes

tacaswell added this to the v2.2.4 milestone Aug 13, 2018

Updated with try and except error catch

d893d3c

Removed whitespaces from line 1024

9ebf878

anntzer mentioned this pull request Sep 24, 2018

Handle utf-8 output by kpathsea on Windows. #12253

Merged

6 tasks

Kojoley mentioned this pull request Sep 30, 2018

dviread: find_tex_file: Ensure the encoding on windows #12351

Merged

anntzer closed this Feb 1, 2019

anntzer added the topic: text/usetex label Feb 1, 2019

QuLogic modified the milestones: v2.2.4, v3.1 Feb 1, 2019

Uh oh!

Country specific characters in Windows user folder name when locating .tfm-file #11848

Country specific characters in Windows user folder name when locating .tfm-file #11848

Conversation

ghost commented Aug 13, 2018

PR Summary

PR Checklist

Uh oh!

anntzer left a comment

Choose a reason for hiding this comment

Uh oh!

ghost commented Aug 13, 2018

Uh oh!

ImportanceOfBeingErnest commented Aug 13, 2018

Uh oh!

ghost commented Aug 13, 2018

Uh oh!

ghost commented Aug 14, 2018 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ImportanceOfBeingErnest commented Aug 14, 2018

Uh oh!

ghost commented Aug 14, 2018 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tacaswell commented Aug 14, 2018

Uh oh!

ImportanceOfBeingErnest commented Aug 14, 2018

Uh oh!

anntzer commented Aug 14, 2018

Uh oh!

anntzer commented Aug 14, 2018

Uh oh!

eric-wieser commented Sep 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anntzer commented Sep 24, 2018

Uh oh!

Kojoley commented Sep 26, 2018

Investigate the miktex sources

Uh oh!

anntzer commented Sep 26, 2018

Uh oh!

t-tk commented Sep 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anntzer commented Sep 27, 2018

Uh oh!

t-tk commented Sep 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kojoley commented Sep 30, 2018

Uh oh!

tacaswell commented Feb 1, 2019

Uh oh!

anntzer commented Feb 1, 2019

Uh oh!

Kojoley commented Feb 1, 2019

Uh oh!

Uh oh!

ghost commented Aug 14, 2018 •

edited by ghost

Loading

ghost commented Aug 14, 2018 •

edited by ghost

Loading

eric-wieser commented Sep 22, 2018 •

edited

Loading

t-tk commented Sep 27, 2018 •

edited

Loading

t-tk commented Sep 30, 2018 •

edited

Loading