Unicode whitespace footguns

Another thing I noticed while reading UAX#31 because of the "dedented strings" RFC is that they have a bunch of advice for where to allow U+200E and U+000F in program text that I don't think we implement except by accident.

For example, right now this compiles just fine:

use std::env;
fn main() {
    if env::var_os("FOO").is_some() {
        println!("foo");
    } else‏if env::var_os("BAR").is_some() {
        println!("bar");
    }
}

... because there's an invisible U+200E in between the "else" and the "if"! (The playground does make this character visible.) Section 4.1.2 of UAX#31 specifically says that this should not be allowed. Should we change it?

11 Likes

With a crater run, but yes, I think so.

At the very least we can change it over an edition

The problem around bidi text/characters still is bigger than this. (I haven't checked if at all - and how well - any of these Unicode documents cover this.) We allow identifiers with LTR characters, too!

E.g. pay attention to the outputs of this program:

fn f(ע: i32, ע_: i32) {
    println!("addition, we like it: {}", ע + ע_);
    println!("here's a subtraction: {}", 1 - ע_);
    println!("some comparisons? {}", ע > ע_);
    println!("here's a subtraction: {}", ע - 1_);
}

fn main() {
    f(42, 1337);
}
addition, we like it: 1379
here's a subtraction: -1336
some comparisons? false
here's a subtraction: 41

(the Monaco editor on the playground - if you copy the code there - has some mechanisms in place to display the program code unmangled)

As mentioned, I haven't found out yet if this is covered by documents (such as the one you linked); but as far as I can tell it could be possibly be solved by adding some strategically placed U+200E actually :slight_smile: I think minimally the following 3:

fn f(ע: i32, ע_: i32) {
    println!("addition, we like it: {}", ע‎ + ע_);
    println!("here's a subtraction: {}", 1 - ע_);
    println!("some comparisons? {}", ע‎ > ע_);
    println!("here's a subtraction: {}", ע‎ - 1_);
}

fn main() {
    f(42, 1337);
}

..eh.. that is:

"
fn f(ע: i32, ע_: i32) {
    println!(\"addition, we like it: {}\", ע\u{200e} + ע_);
    println!(\"here's a subtraction: {}\", 1 - ע_);
    println!(\"some comparisons? {}\", ע\u{200e} > ע_);
    println!(\"here's a subtraction: {}\", ע\u{200e} - 1_);
}

fn main() {
    f(42, 1337);
}
"

Another example could be bidi chars in strings:

fn main() {
    let tuple = ("foo ע" ,"א bar");
    dbg!(tuple);
}
[src/main.rs:3:5] tuple = (
    "foo ע",
    "א bar",
)

Note that here, too, the broken code doesn’t contain any added invisible characters at all; but adding some invisible characters could pretty much fix the situation for "simple" text&code editing/display tools, e.g. with one U+200E added:

fn main() {
    let tuple = ("foo ע"‎ ,"א bar");
    dbg!(tuple);
}
"
fn main() {
    let tuple = (\"foo ע\"\u{200e} ,\"א bar\");
    dbg!(tuple);
}
"

But that doesn’t necessarily solve the issue completely&nicely either, as it's still annoying to work with then, for editing the code on one hand (though possibly rustfmt could insert these as needed?) and also now the code looks more broken in the smarter editors that do highlight the control character.

One MVP approach could just be to at least lint against all such cases (by default) where U+200E or U+200F or RTL characters would lead to broken syntax when layed out, which (broken syntax) would include both syntactical elements that have been reordered, or syntactical elements that were/are made to stand next to each other without any additional visible separation.


I'm not even getting started with all the issues & logical errors that come about when you think about string interpolation and bidi... of course Rust's format_args formatting string can also become visually mixed up, and also bidi-unaware people may simply be very confused from the result of string-interpolation at run-time.

4 Likes

I’ve looked into this in the past. I think it would be a good idea to do this, but it’s not trivial to implement, because you have to distinguish cases where the whitespace is truly necessary, from cases where it is optional. Someone just needs to do the work

1 Like

UTS #55: Unicode Source Code Handling covers this.

1 Like

Ah, that's exactly it! (Also probably one of the documents I already glanced at, but was too much to dig into at once.)

1 Like

Indeed, this sort of code is why UAX#31 says U+200E and U+200F should be considered whitespace! See the example in section 4.1.1.

I think it'd make sense to have rustfmt do this.

1 Like

Also, I think there is not much need for doing anything more than deny-by-default lints here, in line with existing lints like text_direction_codepoint_in_comment/text_direction_codepoint_in_literal; in some cases perhaps even just a warn-by-default, comparable to confusable_idents and mixed_script_confusables. This means “editions” are fairly irrelevant, and crater-runs are relevant to correctness concerns / false-positives, but not for breaking change.

3 Likes