Discussion:
xindy and folders with non ascii chars
(too old to reply)
Ulrike Fischer
2018-09-17 12:21:15 UTC
Permalink
Due to a question on tex.sx I would like to try to reawake an old
question (see
https://tug.org/pipermail/tex-live/2013-May/033427.html)

If I copy the test file below on win10 in a folder G:\Z-Test\jürgen

and then run

pdflatex test
texindy test.idx

it fails with the error message

G:\Z-Test\jürgen>texindy test.idx
*** - PARSE-NAMESTRING: syntax error in filename
"G:\\\\Z-Test\\\\j³rgen\\\\S1Jt2cb7Dx" at position 13

S1Jt2cb7Dx is the temporary file which (on windows) is created in
the current directory and PARSE-NAMESTRING is imho a clisp function.

Using texindy -d keep_tmpfiles to keep the temporary files and then
looking into them I can see lines like

:idxstyle "G:\\Z-Test\\jürgen\\B0eXOULCND"
:rawindex "G:\\Z-Test\\jürgen\\OeXhx_H6VI"

The temporary file is 8-bit encoded.

So my guess is that the problem is with xindy-lisp.exe:
it either can't handle non-ascii file names at all, or gets confused
by the encoding of the path names in the temporary file.

Is there any chance to solve this problem?
Relocating the tmpfile location to TEMP is imho not a stable
solution: quite often the problem pops up because of user account
names with non-ascii chars.


%% example document:
\documentclass{article}

\usepackage{makeidx}
\makeindex

\begin{document}
Test\index{Test}
\printindex
\end{document}
--
Ulrike Fischer
http://www.troubleshooting-tex.de/
Akira Kakuto
2018-09-17 12:46:18 UTC
Permalink
Post by Ulrike Fischer
it either can't handle non-ascii file names at all, or gets confused
by the encoding of the path names in the temporary file.
Sorry for the inconvenience.
I can find the message in pathname.c in clisp.
"Position 13" is just a non-ascii character.
So I guess that xindy_lisp.exe can't handle
non-ascii file names.

Best,
Akira
Zdenek Wagner
2018-09-17 13:12:50 UTC
Permalink
Post by Akira Kakuto
Post by Ulrike Fischer
it either can't handle non-ascii file names at all, or gets confused
by the encoding of the path names in the temporary file.
Sorry for the inconvenience.
I can find the message in pathname.c in clisp.
"Position 13" is just a non-ascii character.
So I guess that xindy_lisp.exe can't handle
non-ascii file names.
I think that lisp has problems with unicode and probably other
characters outside
the ASCII range. Remember that the sort and merge rules are encoded
due to the limitation
of the available character set.
Post by Akira Kakuto
Best,
Akira
Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz
Ulrike Fischer
2018-09-18 18:16:56 UTC
Permalink
Post by Akira Kakuto
Post by Ulrike Fischer
it either can't handle non-ascii file names at all, or gets confused
by the encoding of the path names in the temporary file.
Sorry for the inconvenience.
I can find the message in pathname.c in clisp.
"Position 13" is just a non-ascii character.
So I guess that xindy_lisp.exe can't handle
non-ascii file names.
I saw that you sorted out the problem on tex.sx and found a
work-around for the user ;-)

Thanks.
--
Ulrike Fischer
https://www.troubleshooting-tex.de/
Richard M Kreuter
2018-09-20 11:25:38 UTC
Permalink
Post by Akira Kakuto
Post by Ulrike Fischer
it either can't handle non-ascii file names at all, or gets confused
by the encoding of the path names in the temporary file.
Sorry for the inconvenience.
I can find the message in pathname.c in clisp.
"Position 13" is just a non-ascii character.
So I guess that xindy_lisp.exe can't handle
non-ascii file names.
This turns out to depend on how Clisp was built, at least.

If your distribution of Clisp contains a clisp.h file (or if you built
Clisp from source and have your build tree around, src/config.h), then
there's a macro definition for VALID_FILENAME_CHAR that defines what
bytes may occur in a filename.

For example, on one Ubuntu host I've got access to,
/usr/lib/clisp-2.49/linkkit/clisp.h contains this line:

#define VALID_FILENAME_CHAR ((ch >= 1) && (ch != 47))

On this Ubuntu host, the Lisp expression

(parse-namestring "G:\\Z-Test\\jürgen\\S1Jt2cb7Dx")

returns, i.e., it does not error.

But on one OSX host where I've built Clisp from source, src/config.h
contains

#define VALID_FILENAME_CHAR ((ch >= 1) && (ch <= 127) && (ch != 47))

and, indeed, on this OSX host, the earlier Lisp expression errors.
Akira Kakuto
2018-09-21 03:23:25 UTC
Permalink
Post by Richard M Kreuter
This turns out to depend on how Clisp was built, at least.
Many thanks. I find the following in the present
version in TeX Live 2018 (w32):

CLISP version 2.49.92 (2018-02-18)

#define VALID_FILENAME_CHAR ((ch >= 32) && (ch <= 61) && \
(ch != 34) && (ch != 42) && (ch != 47) && (ch != 58) && \
(ch != 60)) || ((ch >= 64) && (ch <= 132) && (ch != 92) && \
(ch != 124) && (ch != 130)) || ((ch >= 137) && (ch <= 234) && \
(ch != 152)) || ((ch >= 240) && (ch != 252))

Thanks,
Akira
Karl Berry
2018-09-21 21:41:48 UTC
Permalink
#define VALID_FILENAME_CHAR ((ch >= 32) && (ch <= 61) && \

Why not
#define VALID_FILENAME_CHAR (1)
? What is gained by all these conditions?
Akira Kakuto
2018-09-22 03:27:51 UTC
Permalink
Apparently this expression depends on the system encoding of Windows.
This creates a problem when compiling clisp on, say, a Chinese Windows
and then running it in a Western Windows or vice versa.
I built xindy-lisp.exe on Japanese Windows (CP932).

Best,
Akira
Norbert Preining
2018-09-22 04:41:31 UTC
Permalink
Post by Akira Kakuto
#define VALID_FILENAME_CHAR ((ch >= 32) && (ch <= 61) && \
Why not
#define VALID_FILENAME_CHAR (1)
? What is gained by all these conditions?
I guess representation of the default code page that is used.
In a perfect world clisp would look at the LOCALE and decide based
on that what are the valid filenames ...

Best

Norbert

--
PREINING Norbert http://www.preining.info
Accelia Inc. + JAIST + TeX Live + Debian Developer
GPG: 0x860CDC13 fp: F7D8 A928 26E3 16A1 9FA0 ACF0 6CAC A448 860C DC13
Bruno Haible
2018-09-22 16:43:21 UTC
Permalink
Post by Norbert Preining
In a perfect world clisp would look at the LOCALE and decide based
on that what are the valid filenames ...
Yes, that's essentially what clisp does already, through the
*PATHNAME-ENCODING* variable (which is set based on the locale).
But it does so at a different location in the code, not already
while parsing a file name.

I've resolved https://gitlab.com/gnu-clisp/clisp/issues/10 by
limiting the check to ASCII characters, because it's the ASCII
characters (like ':', '<', '>') which are the most risky w.r.t.
weird behaviour on the file system.

Bruno
Akira Kakuto
2018-09-22 21:27:52 UTC
Permalink
Dear Bruno,
Post by Bruno Haible
I've resolved https://gitlab.com/gnu-clisp/clisp/issues/10 by
limiting the check to ASCII characters, because it's the ASCII
characters (like ':', '<', '>') which are the most risky w.r.t.
weird behaviour on the file system.
I have tried the current master, and found on
Japanese Windows (CP932):

#define VALID_FILENAME_CHAR ((ch >= 32) && (ch <= 61) && \
(ch != 34) && (ch != 42) && (ch != 47) && (ch != 58) && \
(ch != 60)) || ((ch >= 64) && (ch != 92) && (ch != 124))

Best,
Akira
Akira Kakuto
2018-09-22 23:13:18 UTC
Permalink
Dear Bruno,
Post by Akira Kakuto
I have tried the current master, and found on
#define VALID_FILENAME_CHAR ((ch >= 32) && (ch <= 61) && \
(ch != 34) && (ch != 42) && (ch != 47) && (ch != 58) && \
(ch != 60)) || ((ch >= 64) && (ch != 92) && (ch != 124))
I tried to make xindy by the present master of clisp, but failed
by an internal error:

.../clisp-build/clisp -q -E iso-8859-1 -c base.lsp -o base.fas
*** - Internal error: statement in file
".../clisp/src/pathname.d", line 3773 has been reached!!
Please see <http://clisp.org/impnotes/faq.html#faq-bugs> for bug
reporting instructions.


Best,
Akira
Bruno Haible
2018-09-23 01:07:52 UTC
Permalink
Hi Akira,
Post by Akira Kakuto
I have tried the current master, and found on
#define VALID_FILENAME_CHAR ((ch >= 32) && (ch <= 61) && \
(ch != 34) && (ch != 42) && (ch != 47) && (ch != 58) && \
(ch != 60)) || ((ch >= 64) && (ch != 92) && (ch != 124))
This is the same as I got in a Western Windows (CP1252). Good.

Bruno
Bruno Haible
2018-09-22 01:51:26 UTC
Permalink
Karl Berry writes in
Post by Karl Berry
Why not
#define VALID_FILENAME_CHAR (1)
? What is gained by all these conditions?
When the user enters an invalid file name,
1. clisp signals an error before the file name hits the file system,
namely already when the Lisp pathname gets constructed,
2. the error message indicates the cause (remember that errors on
a file system can be caused by invalid file names, permission
problems, or even temporary issues like disk-full problems).
And 3. On some systems, really erratic things happen when you pass
file names with invalid bytes to the operating system.

Bruno
Richard M Kreuter
2018-09-23 21:02:41 UTC
Permalink
Post by Bruno Haible
Karl Berry writes in
Post by Karl Berry
Why not
#define VALID_FILENAME_CHAR (1)
? What is gained by all these conditions?
When the user enters an invalid file name,
1. clisp signals an error before the file name hits the file system,
namely already when the Lisp pathname gets constructed,
2. the error message indicates the cause (remember that errors on
a file system can be caused by invalid file names, permission
problems, or even temporary issues like disk-full problems).
And 3. On some systems, really erratic things happen when you pass
file names with invalid bytes to the operating system.
(Not so much for Bruno, but as context for others reading this who can
be expected to be unfamiliar with the Common Lisp language or the Clisp
implementation...)

The Common Lisp language standard requires that when a file operation
receives a string argument, the file operation is to implicitly parse
the string and conditionally augment the parse with information that
might be construed as ``missing'' (for example, by appending an
extension if one is missing, say). This behavior bears a sort of family
resemblance to TeX82's filename handling, as Common Lisp's ancestor
languages also evolved on PDP-10 systems. For example, the parsing is
loosely analogous to scan_file_name in section 526 of TeX82; and the
augmenting is somewhat like a generalization of both pack_buffered_name
in section 523 and pack_job_name in section 529.

Additionally, the Common Lisp language standard allows the
implementation to detect invalid file specification syntax at its
discretion; that's what Clisp is up to here.

Anyway, under ordinary circumstances, the consequences of the parsing
and augmenting are effectively null. However, since most modern
programming languages simply pass strings to system calls without any
parsing or augmentation (albeit, for some languages, with implicit
encoding to code points), the fact that Common Lisp is required to parse
and permitted to error during the parse might be considered surprising.

Additional file naming notes that could trip up xindy users on Clisp:

1. [Probably relevant only on Unix.] To my knowledge, Clisp's file
handling offers no means to address any file or directory using a
specification that contains either a question mark or asterisk. There
can be some workarounds:

1a. If the offending character occurs in a directory, change to that
directory before starting Clisp, and address the file using a relative
file specification that omits the directory.

1b. Create another name for the file, either by linking or renaming the
file or offending directory. (To my knowledge, it's impossible to do
this from within Clisp itself.)

2. When Clisp's file specification parser encounters a "dotdot"
directory, it elides the dotdot and the directory level preceding it,
e.g., the string

"/home/me/foo/../bar"

parses to an object that denotes the same as

"/home/me/bar"

This parsing behavior is documented in Clisp's manual, and so is
presumably deliberate; however it has the consequence that in case foo
is a symbolic link to some directory other than an immediate
subdirectory of /home/me, the parse will denote a different pathname
than the original string does under Unix or Windows pathname resolution
rules.

There are some workarounds for this, too:

2a. Change to the desired directory before starting Clisp, and address
files using only relative specifications that omit the directory.

2b. If it's necessary to use a specification that includes a directory,
figure out a name for the directory that does not include dotdot. One
way to do that on Clisp is to use ext:cd repeatedly to resolve the
directory part of a string prior to passing the string to any standard
Lisp routines, but that's kind of grisly.

Regards,
Richard

Loading...