-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Country specific characters in Windows user folder name when locating .tfm-file #11848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Change to utf-8 format, due to return result.decode('ascii') returning an UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17: ordinal not in range(128), when æ,ø or å appears in the path to the .tfm-file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to see some docs stating that this is indeed utf8 and not e.g. the current codepage or the encoding stated by $LANG.
As this is my first attempt at a contribution towards an open source project, i am not entirely sure what you mean by docs stating that this is utf8. The change to utf8 from ascii was a hail marry from my side, to fix an issue i had when using matplotlib together with matplotlib.rc('text', usetex=True), on my Windows machine, using MikTex as the LaTeX-interpreter. I committed this changes in the hopes that someone would review the change and verify that i did not break any of the current functionality of the find_tex_file-function. |
What the above comment by @anntzer means is that it is currently not clear if the return of |
@ImportanceOfBeingErnest Thank you very much for the clarification. I will try and investigate the subprocess.Popen documentation Another fix might be to write code that detects the language of the system, and only change the decoding when necessary. |
Alright, so i have looked into the documentation of Running |
An option which does not break any existing code, but would solve this problem is probably to try/except both cases,
|
As suggested by @ImportanceOfBeingErnest, i added the try and except error catching method to only switch to utf8-decoding when ascii fails to decode the piped info from kpsewhich. Edit : I added this piece of code into my local version of dviread.py and it ran without errors when plotting my data. I do not know why the two checks failed on this specfic commit. |
I think that Unfortunately, it looks like the encoding here will depend on the |
The tests fail because line 1024 contains whitespaces. Just remove them and you'll be fine. |
The try... except doesn't make sense. utf-8 is a superset of ascii (or rather, it is ascii compatible), so anything that can be ascii-decoded will yield the same result when utf-8-decoded. |
@tacaswell getfilesystemencoding is unlikely to be "correct" in all cases because that switched from (a default of) "mbcs" in py<3.6 to "utf-8" in py>=3.6 (PEP529), but kpathsea obviously didn't change the encoding it uses at the same time. |
|
I haven't fully unravelled the sources of kpathsea but https://tug.org/svn/texlive/trunk/Build/source/texk/kpathsea/pathsearch.c?view=markup#l38 + https://tug.org/svn/texlive/trunk/Build/source/texk/kpathsea/knj.c?revision=41586&view=markup#l407
plus the OP's observation that he gets paths encoded as utf-8 on Windows suggest to me that the correct encoding is utf-8 on Windows and the filesystemencoding on Unices. |
@anntzer I have tested and app that does
#include <windows.h>
int main()
{
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), L"Стдаут!\n", 8, NULL, NULL);
WriteConsoleW(GetStdHandle(STD_ERROR_HANDLE), L"Стдерр!\n", 8, NULL, NULL);
} Output of the executable:
Python reader: import subprocess
cmd = 'unicodetest.exe'
result = subprocess.check_output(
[cmd], stderr=subprocess.STDOUT)
print(type(result))
print('"%s"' % result)
print('"%s"' % result.decode('utf-8'))
pipe = subprocess.Popen([cmd], stdout=subprocess.PIPE)
result = pipe.communicate()[0].rstrip()
print(type(result))
print('"%s"' % result)
print('"%s"' % result.decode('utf-8')) Output
My
From the prerequest list it looks like it does not depend on Investigate the miktex sources
Call tree: in directory traversing it takes unique_ptr<DirectoryLister> dirLister = DirectoryLister::Open(directory, nullptr, (int)DirectoryLister::Options::DirectoriesOnly);
DirectoryEntry entry;
vector<PathName> subdirs;
while (dirLister->GetNext(entry))
{
MIKTEX_ASSERT(entry.isDirectory);
PathName subdir(directory);
subdir /= entry.name;
subdirs.push_back(subdir);
}
dirLister->Close();
DirectoryEntry2 direntry2;
if (!GetNext(direntry2))
direntry2.name = WU_(ffdat.cFileName);
# define WU_(x) MiKTeX::Util::CharBuffer<char>(x).GetData() CharBuffer ctor calls StringUtil::CopyString(buffer, GetCapacity(), lpsz); size_t StringUtil::CopyString(char* dest, size_t destSize, const wchar_t* source)
{
return CopyString(dest, destSize, WideCharToUTF8(source).c_str());
} Taaadam, the output is in utf8 encoding. |
Thanks for the careful investigation. Can you review #12253? |
I will shortly explain the function of Typical usage is the following:
|
Thank you very much for your help. |
In Windows,
|
@anntzer IIUC we can set env var |
I think so (to the best of my understanding...). |
Yes. I just wanted to run tex tests on windows to be sure thing are fine, but for some reason after switching an appveyor build to |
Change to utf-8 format, due to return result.decode('ascii') returning an UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17: ordinal not in range(128), when æ,ø or å appears in the path to the .tfm-file.
PR Summary
PR Checklist