unicode_strings feature doesn't work with split ' ' #15904

p5pRT · 2017-03-03T15:24:09Z

Migrated from rt.perl.org#130907 (status was 'resolved')

Searchable as RT130907$

p5pRT · 2017-03-03T15:24:10Z

From @mauke

Created by @mauke

$ cat bug.pl
#!perl
use strict;
use warnings;

use utf8;
use feature qw(unicode_strings);

print "case 1: $_\n" for split ' ', "A B\x{A0}C";

print "case 2: $_\n" for split ' ', "A B\x{A0}C€";
__END__

$ perl bug.pl
case 1: A
case 1: B C
case 2: A
case 2: B
case 2: C€

In case 1 the \x{A0} (no-break space) is not treated as whitespace, so we get 2
elements: A and B<nbsp>C.

In case 2 the \x{A0} is treated as whitespace and we get 3 elements: A, B and
C<euro>.

I thought feature 'unicode_strings' would make these behave the same,
regardless of whether a string literal contains a >255 character or not.

Perl Info


Flags:
    category=core
    severity=low

Site configuration information for perl 5.24.1:

Configured by mauke at Sun Feb 19 23:06:44 CET 2017.

Summary of my perl5 (revision 5 version 24 subversion 1) configuration:
   
  Platform:
    osname=linux, osvers=4.9.6-1-arch, archname=i686-linux
    uname='linux simplicio 4.9.6-1-arch #1 smp preempt thu jan 26 09:41:20 cet 2017 i686 gnulinux '
    config_args=''
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -flto',
    cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion='', gccversion='6.3.1 20170109', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234, doublekind=3
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12, longdblkind=3
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags ='-fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/gcc/i686-pc-linux-gnu/6.3.1/include-fixed /usr/lib /lib
    libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.24.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.24'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -flto -L/usr/local/lib -fstack-protector-strong'



@INC for perl 5.24.1:
    /home/mauke/usr/lib/perl5/site_perl/5.24.1/i686-linux
    /home/mauke/usr/lib/perl5/site_perl/5.24.1
    /home/mauke/usr/lib/perl5/5.24.1/i686-linux
    /home/mauke/usr/lib/perl5/5.24.1


Environment for perl 5.24.1:
    HOME=/home/mauke
    LANG=en_US.UTF-8
    LANGUAGE=en_US
    LC_COLLATE=C
    LC_MONETARY=de_DE.UTF-8
    LC_TIME=de_DE.UTF-8
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/mauke/perl5/perlbrew/bin:/home/mauke/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl
    PERLBREW_BASHRC_VERSION=0.73
    PERLBREW_HOME=/home/mauke/.perlbrew
    PERLBREW_ROOT=/home/mauke/perl5/perlbrew
    PERL_BADLANG (unset)
    PERL_UNICODE=SAL
    SHELL=/bin/bash

p5pRT · 2017-03-03T19:52:16Z

From @jkeenan

On Fri, 03 Mar 2017 15:24:10 GMT, mauke- wrote:

This is a bug report for perl from l.mai@web.de,
generated with the help of perlbug 1.40 running under perl 5.24.1.

-----------------------------------------------------------------
[Please describe your issue here]

$ cat bug.pl
#!perl
use strict;
use warnings;

use utf8;
use feature qw(unicode_strings);

print "case 1: $_\n" for split ' ', "A B\x{A0}C";

print "case 2: $_\n" for split ' ', "A B\x{A0}C€";
__END__

$ perl bug.pl
case 1: A
case 1: B C
case 2: A
case 2: B
case 2: C€

In case 1 the \x{A0} (no-break space) is not treated as whitespace, so
we get 2
elements: A and B<nbsp>C.

In case 2 the \x{A0} is treated as whitespace and we get 3 elements:
A, B and
C<euro>.

I thought feature 'unicode_strings' would make these behave the same,
regardless of whether a string literal contains a >255 character or
not.

It appears that the output is affected by whether or not 'use utf8;' appears in the file.

#####
$ cat 130907-no-use-utf8-unicode_strings.pl
#!/usr/bin/env perl
use strict;
use warnings;
#use utf8;
use feature 'unicode_strings';

print "case 1: $_\n" for split ' ', "A B\x{A0}C";
print "case 2: $_\n" for split ' ', "A B\x{A0}C€";

$ perl 130907-no-use-utf8-unicode_strings.pl
case 1: A
case 1: B�C
case 2: A
case 2: B�C€

#####

$ cat 130907-use-utf8-unicode_strings.pl
#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
use feature 'unicode_strings';

print "case 1: $_\n" for split ' ', "A B\x{A0}C";
print "case 2: $_\n" for split ' ', "A B\x{A0}C€";
[p5p] 512 $ perl 130907-use-utf8-unicode_strings.pl
case 1: A
case 1: B�C
case 2: A
case 2: B
Wide character in print at 130907-use-utf8-unicode_strings.pl line 8.
case 2: C€

#####

--
James E Keenan (jkeenan@cpan.org)

p5pRT · 2017-03-03T19:52:16Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2017-03-04T13:06:16Z

From @arc

l.mai@web.de <perlbug-followup@perl.org> wrote:

use utf8;
use feature qw(unicode_strings);

print "case 1: $_\n" for split ' ', "A B\x{A0}C";

print "case 2: $_\n" for split ' ', "A B\x{A0}C€";
__END__

$ perl bug.pl
case 1: A
case 1: B C
case 2: A
case 2: B
case 2: C€

In case 1 the \x{A0} (no-break space) is not treated as whitespace, so we get 2
elements: A and B<nbsp>C.

In case 2 the \x{A0} is treated as whitespace and we get 3 elements: A, B and
C<euro>.

I thought feature 'unicode_strings' would make these behave the same,
regardless of whether a string literal contains a >255 character or not.

I agree; this is another instance of the Unicode Bug.

I've attached a patch to fix this, but given the current depth of the
freeze, I'll aim to merge it for 5.27.1.

--
Aaron Crane ** http://aaroncrane.co.uk/

p5pRT · 2017-03-04T13:06:16Z

From @arc

0001-RT-130907-Fix-the-Unicode-Bug-in-split.patch

From 9039f3865a5da1d59e1780229d447952c818a351 Mon Sep 17 00:00:00 2001
From: Aaron Crane <arc@cpan.org>
Date: Sat, 4 Mar 2017 12:50:58 +0000
Subject: [PATCH] RT #130907: Fix the Unicode Bug in split " "

---
 lib/feature.pm       |  7 ++++---
 pod/perldelta.pod    |  9 +++++++++
 pod/perlfunc.pod     |  8 ++++++++
 pod/perlunicode.pod  | 11 +++++++++++
 pod/perluniintro.pod |  5 +++--
 pp.c                 | 13 +++++++++++++
 regen/feature.pl     |  7 ++++---
 t/op/split.t         | 20 +++++++++++++++++++-
 8 files changed, 71 insertions(+), 9 deletions(-)

diff --git a/lib/feature.pm b/lib/feature.pm
index ed13273f11..a358a13621 100644
--- a/lib/feature.pm
+++ b/lib/feature.pm
@@ -5,7 +5,7 @@
 
 package feature;
 
-our $VERSION = '1.47';
+our $VERSION = '1.48';
 
 our %feature = (
     fc              => 'feature_fc',
@@ -175,8 +175,9 @@ C<use feature 'unicode_strings'> subpragma is B<strongly> recommended.
 
 This feature is available starting with Perl 5.12; was almost fully
 implemented in Perl 5.14; and extended in Perl 5.16 to cover C<quotemeta>;
-and extended further in Perl 5.26 to cover L<the range
-operator|perlop/Range Operators>.
+was extended further in Perl 5.26 to cover L<the range
+operator|perlop/Range Operators>; and was extended again in Perl 5.28 to
+cover L<special-cased whitespace splitting|perlfunc/split>.
 
 =head2 The 'unicode_eval' and 'evalbytes' features
 
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index 83b151fc03..2ced57a3e6 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -343,6 +343,15 @@ expression had no named captures.  The same applies to access to any
 hash tied with L<Tie::Hash::NamedCapture> and C<< all => 1 >>. [perl
 #130822]
 
+=item *
+
+C<split ' '> now handles the argument being split correctly when in the
+scope of the L<< C<unicode_strings>|feature/"The 'unicode_strings' feature"
+>> feature. Previously, when a string using the single-byte internal
+representation contained characters that are whitespace by Unicode rules but
+not by ASCII rules, it treated those characters as part of fields rather
+than as field separators. This resolves [perl #130907].
+
 =back
 
 =head1 Known Problems
diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod
index 88d30a55fd..506c1433a2 100644
--- a/pod/perlfunc.pod
+++ b/pod/perlfunc.pod
@@ -7601,6 +7601,14 @@ special case was restricted to the use of a plain S<C<" ">> as the
 pattern argument to split; in Perl 5.18.0 and later this special case is
 triggered by any expression which evaluates to the simple string S<C<" ">>.
 
+As of Perl 5.28, this special-cased whitespace splitting works as expected in
+the scope of L<< S<C<"use feature 'unicode_strings">>|feature/The
+'unicode_strings' feature >>. In previous versions, and outside the scope of
+that feature, it exhibits L<perlunicode/The "Unicode Bug">: characters that are
+whitespace according to Unicode rules but not according to ASCII rules can be
+treated as part of fields rather than as field separators, depending on the
+string's internal encoding.
+
 If omitted, PATTERN defaults to a single space, S<C<" ">>, triggering
 the previously described I<awk> emulation.
 
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 23818a1ee4..d52a751155 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -1824,6 +1824,17 @@ outside its scope, it could produce strings whose length in characters
 exceeded that of the right-hand side, where the right-hand side took up more
 bytes than the correct range endpoint.
 
+=item *
+
+In L<< C<split>'s special-case whitespace splitting|perlfunc/split >>.
+
+Starting in Perl 5.28.0, the C<split> function with a pattern specified as
+a string containing a single space handles whitespace characters consistently
+within the scope of of C<unicode_strings>. Prior to that, or outside its scope,
+characters that are whitespace according to Unicode rules but not according to
+ASCII rules were treated as field contents rather than field separators when
+they appear in byte-encoded strings.
+
 =back
 
 You can see from the above that the effect of C<unicode_strings>
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index d35de34581..595ec46cd6 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -151,11 +151,12 @@ serious Unicode work.  The maintenance release 5.6.1 fixed many of the
 problems of the initial Unicode implementation, but for example
 regular expressions still do not work with Unicode in 5.6.1.
 Perl v5.14.0 is the first release where Unicode support is
-(almost) seamlessly integrable without some gotchas. (There are two
+(almost) seamlessly integrable without some gotchas. (There are a few
 exceptions. Firstly, some differences in L<quotemeta|perlfunc/quotemeta>
 were fixed starting in Perl 5.16.0. Secondly, some differences in
 L<the range operator|perlop/Range Operators> were fixed starting in
-Perl 5.26.0.)
+Perl 5.26.0. Thirdly, some differences in L<split|perlfunc/split> were fixed
+started in Perl 5.28.0.)
 
 To enable this
 seamless support, you should C<use feature 'unicode_strings'> (which is
diff --git a/pp.c b/pp.c
index a640995e31..178b287295 100644
--- a/pp.c
+++ b/pp.c
@@ -5716,6 +5716,7 @@ PP(pp_split)
     STRLEN len;
     const char *s = SvPV_const(sv, len);
     const bool do_utf8 = DO_UTF8(sv);
+    const bool in_uni_8_bit = IN_UNI_8_BIT;
     const char *strend = s + len;
     PMOP *pm = cPMOPx(PL_op);
     REGEXP *rx;
@@ -5802,6 +5803,10 @@ PP(pp_split)
 	    while (s < strend && isSPACE_LC(*s))
 		s++;
 	}
+        else if (in_uni_8_bit) {
+            while (s < strend && isSPACE_L1(*s))
+                s++;
+        }
 	else {
 	    while (s < strend && isSPACE(*s))
 		s++;
@@ -5833,6 +5838,10 @@ PP(pp_split)
             {
 	        while (m < strend && !isSPACE_LC(*m))
 		    ++m;
+            }
+            else if (in_uni_8_bit) {
+                while (m < strend && !isSPACE_L1(*m))
+                    ++m;
             } else {
                 while (m < strend && !isSPACE(*m))
                     ++m;
@@ -5867,6 +5876,10 @@ PP(pp_split)
             {
 	        while (s < strend && isSPACE_LC(*s))
 		    ++s;
+            }
+            else if (in_uni_8_bit) {
+                while (s < strend && isSPACE_L1(*s))
+                    ++s;
             } else {
                 while (s < strend && isSPACE(*s))
                     ++s;
diff --git a/regen/feature.pl b/regen/feature.pl
index 579120e456..fb9f6bdfaa 100755
--- a/regen/feature.pl
+++ b/regen/feature.pl
@@ -367,7 +367,7 @@ read_only_bottom_close_and_rename($h);
 __END__
 package feature;
 
-our $VERSION = '1.47';
+our $VERSION = '1.48';
 
 FEATURES
 
@@ -485,8 +485,9 @@ C<use feature 'unicode_strings'> subpragma is B<strongly> recommended.
 
 This feature is available starting with Perl 5.12; was almost fully
 implemented in Perl 5.14; and extended in Perl 5.16 to cover C<quotemeta>;
-and extended further in Perl 5.26 to cover L<the range
-operator|perlop/Range Operators>.
+was extended further in Perl 5.26 to cover L<the range
+operator|perlop/Range Operators>; and was extended again in Perl 5.28 to
+cover L<special-cased whitespace splitting|perlfunc/split>.
 
 =head2 The 'unicode_eval' and 'evalbytes' features
 
diff --git a/t/op/split.t b/t/op/split.t
index d60bcafc19..038c5d7481 100644
--- a/t/op/split.t
+++ b/t/op/split.t
@@ -7,7 +7,7 @@ BEGIN {
     set_up_inc('../lib');
 }
 
-plan tests => 163;
+plan tests => 172;
 
 $FS = ':';
 
@@ -480,6 +480,24 @@ is($cnt, scalar(@ary));
         qq{split(\$cond ? qr/ / : " ", "$exp") behaves as expected over repeated similar patterns};
 }
 
+SKIP: {
+    # RT #130907: unicode_strings feature doesn't work with split ' '
+
+    my ($sp) = grep /\s/u, map chr, reverse 128 .. 255 # prefer \xA0 over \x85
+        or skip 'no unicode whitespace found in high-8-bit range', 9;
+
+    for (["$sp$sp. /", "leading unicode whitespace"],
+         [".$sp$sp/",  "unicode whitespace separator"],
+         [". /$sp$sp", "trailing unicode whitespace"]) {
+        my ($str, $desc) = @$_;
+        use feature "unicode_strings";
+        my @got = split " ", $str;
+        is @got, 2, "whitespace split: $desc: field count";
+        is $got[0], '.', "whitespace split: $desc: field 0";
+        is $got[1], '/', "whitespace split: $desc: field 1";
+    }
+}
+
 {
     # 'RT #116086: split "\x20" does not work as documented';
     my @results;
-- 
2.11.0

p5pRT · 2017-03-04T14:00:38Z

From @jkeenan

On Sat, 04 Mar 2017 13:06:16 GMT, arc wrote:

l.mai@web.de <perlbug-followup@perl.org> wrote:

use utf8;
use feature qw(unicode_strings);

print "case 1: $_\n" for split ' ', "A B\x{A0}C";

print "case 2: $_\n" for split ' ', "A B\x{A0}C€";
__END__

$ perl bug.pl
case 1: A
case 1: B C
case 2: A
case 2: B
case 2: C€

In case 1 the \x{A0} (no-break space) is not treated as whitespace,
so we get 2
elements: A and B<nbsp>C.

In case 2 the \x{A0} is treated as whitespace and we get 3 elements:
A, B and
C<euro>.

I thought feature 'unicode_strings' would make these behave the same,
regardless of whether a string literal contains a >255 character or
not.

I agree; this is another instance of the Unicode Bug.

I've attached a patch to fix this, but given the current depth of the
freeze, I'll aim to merge it for 5.27.1.

Which gives us plenty of time to smoke test this ... so I have created this branch:

smoke-me/jkeenan/arc/130907-unicode-bug-in-split

--
James E Keenan (jkeenan@cpan.org)

p5pRT · 2017-06-01T15:52:04Z

From @jkeenan

On Sat, 04 Mar 2017 14:00:38 GMT, jkeenan wrote:

On Sat, 04 Mar 2017 13:06:16 GMT, arc wrote:

l.mai@web.de <perlbug-followup@perl.org> wrote:

use utf8;
use feature qw(unicode_strings);

print "case 1: $_\n" for split ' ', "A B\x{A0}C";

print "case 2: $_\n" for split ' ', "A B\x{A0}C€";
__END__

$ perl bug.pl
case 1: A
case 1: B C
case 2: A
case 2: B
case 2: C€

In case 1 the \x{A0} (no-break space) is not treated as whitespace,
so we get 2
elements: A and B<nbsp>C.

In case 2 the \x{A0} is treated as whitespace and we get 3
elements:
A, B and
C<euro>.

I thought feature 'unicode_strings' would make these behave the
same,
regardless of whether a string literal contains a >255 character or
not.

I agree; this is another instance of the Unicode Bug.

I've attached a patch to fix this, but given the current depth of the
freeze, I'll aim to merge it for 5.27.1.

Which gives us plenty of time to smoke test this ... so I have created
this branch:

smoke-me/jkeenan/arc/130907-unicode-bug-in-split

Aaron, with perl-5.26.0 released, would you like to take the discussion of this ticket forward?

Smoke test results here:
http://perl.develop-help.com/?b=smoke-me%2Fjkeenan%2Farc%2F130907-unicode-bug-in-split

--
James E Keenan (jkeenan@cpan.org)

p5pRT · 2017-07-15T18:42:13Z

From @arc

The (minority of) failing reports from that smoke-me branch seem to be unrelated to this change.

I've therefore pushed a rebased version of my patch as 20ae58f

--
Aaron Crane ** http://aaroncrane.co.uk/

p5pRT · 2017-07-15T18:42:14Z

@arc - Status changed from 'open' to 'resolved'

p5pRT closed this as completed Jul 15, 2017

p5pRT added Severity Low distro-Linux type-core labels Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode_strings feature doesn't work with split ' ' #15904

unicode_strings feature doesn't work with split ' ' #15904

p5pRT commented Mar 3, 2017

p5pRT commented Mar 3, 2017

p5pRT commented Mar 3, 2017

p5pRT commented Mar 3, 2017

p5pRT commented Mar 4, 2017

p5pRT commented Mar 4, 2017

p5pRT commented Mar 4, 2017

p5pRT commented Jun 1, 2017

p5pRT commented Jul 15, 2017

p5pRT commented Jul 15, 2017

unicode_strings feature doesn't work with split ' ' #15904

unicode_strings feature doesn't work with split ' ' #15904

Comments

p5pRT commented Mar 3, 2017

p5pRT commented Mar 3, 2017

From @mauke

Created by @mauke

p5pRT commented Mar 3, 2017

From @jkeenan

p5pRT commented Mar 3, 2017

p5pRT commented Mar 4, 2017

From @arc

p5pRT commented Mar 4, 2017

From @arc

p5pRT commented Mar 4, 2017

From @jkeenan

p5pRT commented Jun 1, 2017

From @jkeenan

p5pRT commented Jul 15, 2017

From @arc

p5pRT commented Jul 15, 2017