Skip to content

Country specific characters in Windows user folder name when locating .tfm-file #11848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Aug 13, 2018

Change to utf-8 format, due to return result.decode('ascii') returning an UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17: ordinal not in range(128), when æ,ø or å appears in the path to the .tfm-file.

PR Summary

PR Checklist

  • Has Pytest style unit tests
  • Code is PEP 8 compliant
  • New features are documented, with examples if plot related
  • Documentation is sphinx and numpydoc compliant
  • Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
  • Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

Change to utf-8 format, due to return result.decode('ascii') returning an UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17: ordinal not in range(128), when æ,ø or å appears in the path to the .tfm-file.
@ghost ghost changed the title Country specific characters in Windows user folder name when locating .tf-file Country specific characters in Windows user folder name when locating .tfmfile Aug 13, 2018
@ghost ghost changed the title Country specific characters in Windows user folder name when locating .tfmfile Country specific characters in Windows user folder name when locating .tfm-file Aug 13, 2018
Copy link
Contributor

@anntzer anntzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to see some docs stating that this is indeed utf8 and not e.g. the current codepage or the encoding stated by $LANG.

@ghost
Copy link
Author

ghost commented Aug 13, 2018

As this is my first attempt at a contribution towards an open source project, i am not entirely sure what you mean by docs stating that this is utf8.

The change to utf8 from ascii was a hail marry from my side, to fix an issue i had when using matplotlib together with matplotlib.rc('text', usetex=True), on my Windows machine, using MikTex as the LaTeX-interpreter.

I committed this changes in the hopes that someone would review the change and verify that i did not break any of the current functionality of the find_tex_file-function.

@ImportanceOfBeingErnest
Copy link
Member

What the above comment by @anntzer means is that it is currently not clear if the return of subprocess.Popen(cmd, stdout=subprocess.PIPE).communicate() is necessarily utf-8 encoded, or whether it may depend on the system(-settings). If one could find a documentation where this is specified to be utf-8 in every case, this would be a good fix; else one would probably need a more sophisticated one.

@ghost
Copy link
Author

ghost commented Aug 13, 2018

@ImportanceOfBeingErnest Thank you very much for the clarification.

I will try and investigate the subprocess.Popen documentation

Another fix might be to write code that detects the language of the system, and only change the decoding when necessary.

@tacaswell tacaswell added this to the v2.2.4 milestone Aug 13, 2018
@ghost
Copy link
Author

ghost commented Aug 14, 2018

Alright, so i have looked into the documentation of subprocess.Popen.communicate(), and found no definitive answer to what encoding it uses when piping information from the kpsewhich program.

Running kpsewhich cmmi10.tfm directly in Windows PowerShell resulted in a path C:/Users/Morten M├╕ller/AppData/Local/Programs/MiKTeX 2.9/fonts/tfm/public/cm/cmmi10.tfm , where the ø in "Morten Møller" does not appear to be decoded properly. Although, decoding the piped stdout from kpsewhich using utf-8 gives the correct path, with ø included, in the dviread.py-script.

@ImportanceOfBeingErnest
Copy link
Member

An option which does not break any existing code, but would solve this problem is probably to try/except both cases,

try:
    res = result.decode('ascii')
except UnicodeDecodeError:
    res = result.decode('utf-8')
return res

@ghost
Copy link
Author

ghost commented Aug 14, 2018

As suggested by @ImportanceOfBeingErnest, i added the try and except error catching method to only switch to utf8-decoding when ascii fails to decode the piped info from kpsewhich.

Edit : I added this piece of code into my local version of dviread.py and it ran without errors when plotting my data. I do not know why the two checks failed on this specfic commit.

@tacaswell
Copy link
Member

I think that sys.getfilesystemencoding() will return the system encoding which is how the return value from popen will be returned?

Unfortunately, it looks like the encoding here will depend on the kpeswitch, but the hope is that it is just passing the bytes it gets back from the filesystem through so the system encoding should be safe?

@ImportanceOfBeingErnest
Copy link
Member

The tests fail because line 1024 contains whitespaces. Just remove them and you'll be fine.

@anntzer
Copy link
Contributor

anntzer commented Aug 14, 2018

The try... except doesn't make sense. utf-8 is a superset of ascii (or rather, it is ascii compatible), so anything that can be ascii-decoded will yield the same result when utf-8-decoded.

@anntzer
Copy link
Contributor

anntzer commented Aug 14, 2018

@tacaswell getfilesystemencoding is unlikely to be "correct" in all cases because that switched from (a default of) "mbcs" in py<3.6 to "utf-8" in py>=3.6 (PEP529), but kpathsea obviously didn't change the encoding it uses at the same time.

@eric-wieser
Copy link
Contributor

eric-wieser commented Sep 22, 2018

Note that this function already utf8-encodes the arguments to kpathsea, so it seems consistent to utf8-decode the responses. - edit: misread encode for decode.

@anntzer
Copy link
Contributor

anntzer commented Sep 24, 2018

I haven't fully unravelled the sources of kpathsea but https://tug.org/svn/texlive/trunk/Build/source/texk/kpathsea/pathsearch.c?view=markup#l38 + https://tug.org/svn/texlive/trunk/Build/source/texk/kpathsea/knj.c?revision=41586&view=markup#l407

38	#ifdef WIN32
39	#undef fputs
40	#undef puts
41	#define fputs win32_fputs
42	#define puts  win32_puts
43	#endif
407	static int __win32_fputs(const char *str, HANDLE hStdout)
408	{
409	    DWORD ret;
410	    wchar_t *wstr;
411	
412	    wstr = get_wstring_from_utf8(str, wstr=NULL);
413	
414	    if (WriteConsoleW(hStdout, wstr, wcslen(wstr), &ret, NULL) == 0) {
415	        free(wstr);
416	        return EOF;
417	    }
418	
419	    free(wstr);
420	    return ret;
421	}

plus the OP's observation that he gets paths encoded as utf-8 on Windows suggest to me that the correct encoding is utf-8 on Windows and the filesystemencoding on Unices.

@Kojoley
Copy link
Member

Kojoley commented Sep 26, 2018

@anntzer I have tested and app that does WriteConsoleW which is called with Python and find out that subprocess.Popen (subprocess.check_output too) gives empty output. So it should be printing with fputs in kpathsea_win32_fputs and since it has this File_system_codepage == CP_UTF8, the encoding must be utf8.

WriteConsoleW app

#include <windows.h>

int main()
{
    WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), L"Стдаут!\n", 8, NULL, NULL);
    WriteConsoleW(GetStdHandle(STD_ERROR_HANDLE), L"Стдерр!\n", 8, NULL, NULL);
}

Output of the executable:

C:\Temp\unicodetest>unicodetest.exe
Стдаут!
Стдерр!

Python reader:

import subprocess

cmd = 'unicodetest.exe'
result = subprocess.check_output(
            [cmd], stderr=subprocess.STDOUT)

print(type(result))
print('"%s"' % result)
print('"%s"' % result.decode('utf-8'))

pipe = subprocess.Popen([cmd], stdout=subprocess.PIPE)
result = pipe.communicate()[0].rstrip()

print(type(result))
print('"%s"' % result)
print('"%s"' % result.decode('utf-8'))

Output

<class 'bytes'>
"b''"
""
Стдерр!
<class 'bytes'>
"b''"
""

My kpsewhich is from MiKTeX (kpsewhich.c)

>kpsewhich --version
MiKTeX 2.9.6600

From the prerequest list it looks like it does not depend on kpathsea, because they emulate it.

Investigate the miktex sources

kpsewhich prints content with puts

Call tree: miktex_kpathsea_path_expand -> Session::Expand -> SessionImpl::Expand -> SessionImpl::ExpandPathPatterns -> SessionImpl::ExpandPathPattern -> SessionImpl::DirectoryWalk

in directory traversing it takes entry.name:

  unique_ptr<DirectoryLister> dirLister = DirectoryLister::Open(directory, nullptr, (int)DirectoryLister::Options::DirectoriesOnly);
  DirectoryEntry entry;
  vector<PathName> subdirs;
  while (dirLister->GetNext(entry))
  {
    MIKTEX_ASSERT(entry.isDirectory);
    PathName subdir(directory);
    subdir /= entry.name;
    subdirs.push_back(subdir);
  }
  dirLister->Close();

winDirectoryLister::GetNext(DirectoryEntry & direntry)

  DirectoryEntry2 direntry2;
  if (!GetNext(direntry2))

winDirectoryLister::GetNext(DirectoryEntry2 & direntry2)

    direntry2.name = WU_(ffdat.cFileName);

WU_ macro def

#  define WU_(x) MiKTeX::Util::CharBuffer<char>(x).GetData()

CharBuffer ctor calls CharBuffer::Set

      StringUtil::CopyString(buffer, GetCapacity(), lpsz);

StringUtil::CopyString

size_t StringUtil::CopyString(char* dest, size_t destSize, const wchar_t* source)
{
  return CopyString(dest, destSize, WideCharToUTF8(source).c_str());
}

Taaadam, the output is in utf8 encoding.

@anntzer
Copy link
Contributor

anntzer commented Sep 26, 2018

Thanks for the careful investigation. Can you review #12253?

@t-tk
Copy link

t-tk commented Sep 27, 2018

I will shortly explain the function of command_line_encoding on windows.
The function is switched by settiing of value kpse->File_system_codepage.
The value is initialized at kpathsea_set_progname() (progname.c L498) and it may set to UTF-8 in kpathsea_get_command_line_args() (knj.c L246)
If kpse->File_system_codepage == CP_UTF8, win32_fputs and win32_puts assume "input strings are encoded in UTF-8" and output Unicode characters on console by using WriteConsoleW().

Typical usage is the following:

  1. get configuration of command_line_encoding from texmf.cnf (dvipdfmx.c L1030)
  2. If command_line_encoding is UTF-8, get command line argument as UTF-8 and set the value kpse->File_system_codepage (dvipdfmx.c L1031)
  3. run command line I/O functions via wrapper functions like win32_fputs and win32_puts

@anntzer
Copy link
Contributor

anntzer commented Sep 27, 2018

Thank you very much for your help.
If I understand correctly, the encoding actually depends on the user configuration, although I guess in practice it'll usually be utf-8?
Given that we probably don't want to parse texmf.cnf ourselves to figure out what the encoding is going to be, is there a way to force kpsewhich to output in a specific encoding (we're not picky, as long as we can know which one it is)?

@t-tk
Copy link

t-tk commented Sep 30, 2018

In Windows,
conventional C code (ex. int main(argc, argv){}, fopen(), puts()) assumes the character encoding is a legacy encoding, CP1252 in Europe & north America, CP932 in Japan.
The function command_line_encoding provides wrapper functions to treat command line arguments, file I/O and console I/O.
If a developer wishes to fix character encoding UTF-8, the following usage will work.

  • set kpse->File_system_codepage = CP_UTF8 and kpse->Is_cp932_system = 0 (ref.
    knj.c L246, knj.c L247)
  • replace from POSIX functions to wrapper functions (ex. fputs -> win32_fputs)

@Kojoley
Copy link
Member

Kojoley commented Sep 30, 2018

@anntzer IIUC we can set env var command_line_encoding to utf-8 to be sure that it will use utf-8 for input and output (env var has a priority over configs) https://tug.org/svn/texlive/trunk/Build/source/texk/kpathsea/variable.c?revision=41557&view=markup#l52

@tacaswell
Copy link
Member

@Kojoley @anntzer Does #12351 cover this problem?

@anntzer
Copy link
Contributor

anntzer commented Feb 1, 2019

I think so (to the best of my understanding...).
Thanks @MortenSHUTE for the patch and @t-tk for your explanations!

@anntzer anntzer closed this Feb 1, 2019
@Kojoley
Copy link
Member

Kojoley commented Feb 1, 2019

Yes. I just wanted to run tex tests on windows to be sure thing are fine, but for some reason after switching an appveyor build to TEST_ALL=yes it hanged https://ci.appveyor.com/project/Kojoley/matplotlib/builds/19205834 (and could not run locally because of my problems with needed stuff installation).

@QuLogic QuLogic modified the milestones: v2.2.4, v3.1 Feb 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants