autoscale: true slidenumber: true

Visual Studio Code互換な

syntax highlighterの実装

わいわいswiftc #9

omochimetaru

Syntax highlight

テキストをその構文に応じて色付けする

→構文解析が必要

構文の解析

例えばSwiftコンパイラ

lib/Parse/Lexer.cpp lib/Parse/Parser.cpp

複雑で膨大なロジックプログラミングが必要

テキストエディタ

いろいろな言語構文に対応設定ファイルの形でエディタのリビルド無しに追加/削除が可能

近年のトレンド

TextMate
Sublime Text
Atom
Visual Studio Code

これら全てがTextMate形式を採用

TextMate Syntaxの全体構造

Syntax parserがテキストをトークン列に分割するトークンにはスコープが割り当てられる

テーマシステムがスコープに基づいて着色する

この話のテーマは前段のSyntax parserだけ

Scopes

VSCodeのSyntaxインスペクタ

Command + Shift + P → Developer: Inspect TM Scopes

やってみよう

Scopes

スコープは入れ子になっている、かつ、それぞれのスコープも階層化されている

柔軟な着色設定ができる

オートインデント、アウトライン抽出、コードフォールディングなどが作れる

構文の入れ子

ある言語の構文の中に、別の言語の構文を埋め込める

TextMate Syntaxの解説

基本編

注: 以降の用語や命名はオフィシャルなものと

僕の勝手な命名が混在する

(公式文書不足のため)

簡単な実例

JSONの構文定義

https://github.com/Microsoft/vscode/blob/master/extensions/json/syntaxes/JSON.tmLanguage.json

以後、このJSONファイルを適宜省略しながら引用する

{
    "name": "JSON (Javascript Next)",
    "scopeName": "source.json",
    "patterns": [ ],
    "repository": {
        "array": { },
        "comments": { },
        "constant": { },
        "number": { },
        "object": { },
        "string": { },
        "objectkey": { },
        "stringcontent": { },
        "value": { }
    }
}

構文定義の構造

ある言語の構文のことをGrammarと呼ぶ
GrammarはRuleをたくさん用意したもの
RuleはRepositoryに名前と共に格納される
Grammarはそれ自身ルートレベルのRuleでもある
RuleはRepositoryを持てる
scopeNameがGrammarのボトムのスコープになる

ここからは(脳内で)実装しながら聞いてほしい

{
    "patterns": [ 
        { "include": "#value" }
    ],
    "repository": {
        "array": { },
        "comments": { },
        "constant": { },
        "number": { },
        "object": { },
        "string": { },
        "objectkey": { },
        "stringcontent": { },
        "value": { 
            "patterns": [
                { "include": "#constant"  },
                { "include": "#number" },
                { "include": "#string" },
                { "include": "#array" },
                { "include": "#object" },
                { "include": "#comments" }
            ]
        }
    }
}

Include Rule

includeパラメータで他のRuleを参照するルール
#<name>でRepositoryの中から<name>を探索
Repositoryからの探索は再帰的に親ルールへ移譲される

Hub Rule

patternsパラメータで子ルール達を持つだけのルール
無条件でありこのルールには暗黙にマッチする
GrammarのルートRuleはコレ
先の例では#valueもコレ

{
    "repository": {
        "constant": {
            "match": "\\b(?:true|false|null)\\b",
            "name": "constant.language.json"
        },
        "number": {
            "match": 
"(?x)        # turn on extended mode\n"
"  -?        # an optional minus\n"
"  (?:\n"
"    0       # a zero\n"
"    |       # ...or...\n"
"    [1-9]   # a 1-9 character\n"
"    \\d*    # followed by zero or more digits\n"
"  )\n"
"  (?:\n"
"    (?:\n"
"      \\.   # a period\n"
"      \\d+  # followed by one or more digits\n"
"    )?\n"
"    (?:\n"
"      [eE]  # an e character\n"
"      [+-]? # followed by an option +/-\n"
"      \\d+  # followed by one or more digits\n"
"    )?      # make exponent optional\n"
"  )?        # make decimal portion optional",
            "name": "constant.numeric.json"
        },
    }
}

number.matchは表示の都合上分割していますが実際には1つの長い文字列である。

Match Rule

matchパラメータの正規表現をテストしてマッチするルール
正規表現方言はOnigurumaを採用
マッチしたらnameで指定したスコープを与える

JSONの数値の正規表現

(?x)        # turn on extended mode
  -?        # an optional minus
  (?:
    0       # a zero
    |       # ...or...
    [1-9]   # a 1-9 character
    \d*     # followed by zero or more digits
  )\n
  (?:
    (?:
      \.    # a period
      \d+   # followed by one or more digits
    )?
    (?:
      [eE]  # an e character\n
      [+-]? # followed by an option +/-
      \d+   # followed by one or more digits
    )?      # make exponent optional
  )?        # make decimal portion optional

冒頭の(?x)で拡張記法モードを開始して改行とコメントを有効化

正規表現本体はこれ

https://www.json.org

行単位処理ルール

TextMate Syntaxでは、これらマッチング処理は対象のテキストに対して行単位で処理されるもし、改行をまたいだ正規表現がある場合、全くマッチしないスコープ割当によるトークン分割はテキストの改行部分で必ず分断される

{
    "repository": {
        "array": {
            "begin": "\\[",
            "end": "\\]",
            "name": "meta.structure.array.json",
            "patterns": [
                {
                    "include": "#value"
                },
                {
                    "match": ",",
                    "name": "punctuation.separator.array.json"
                },
                {
                    "match": "[^\\s\\]]",
                    "name": "invalid.illegal.expected-array-separator.json"
                }
            ]
        },
    }
}

BeginEnd Rule

beginパラメータの正規表現にマッチしてから、endパラメータの正規表現にマッチするまでの範囲を、自身にマッチするルール
nameパラメータのスコープを自身にマッチした範囲に与える
beginとendに挟まれた部分について、patternsパラメータの子ルール達を適用させる

これにより入れ子が実現できる

Parserのマッチング動作

Parserは現在のルールと現在のテキスト位置を持つ

現在のルールから遷移しうるルールのうち、最もマッチング位置が左に来るものにマッチする

実装としては、複数の正規表現をマッチさせてみて、結果が最も左にあったものを採用する制御となる

マッチング位置が同じなら、マッチングリストで先にあるものが優先する

JSON文法でのマッチングリスト

rootの時

\b(?:true|false|null)\b: root[0]→value[0]→constant.match
(?x)...: root[0]→value[1]→number.match
\": root[0]→value[2]→string.begin
\[": root[0]→value[3]→array.begin
\{: root[0]→value[4]→object.begin
/\*\*(?!/): root[0]→value[5]→comments[0]
/\*: root[0]→value[5]→comments[1]
(//).*$\n?: root[0]→value[5]→comments[2]

arrayの時

\]: array.end
\b(?:true|false|null)\b: array[0]→value[0]→constant.match
(?x)...: array[0]→value[1]→number.match
\": array[0]→value[2]→string.begin
\[": array[0]→value[3]→array.begin
\{: array[0]→value[4]→object.begin
/\*\*(?!/): array[0]→value[5]→comments[0]
/\*: array[0]→value[5]→comments[1]
(//).*$\n?: array[0]→value[5]→comments[2]
,: array[1].match
[^\s\]]: array[2].match

このマッチングリストを見ていると、再帰下降構文解析パーサの、次のトークンを先読みして特定の文法ノードに突入する処理と、同じような構造になっている少々の規則のデータによって、構文が表現できていることがわかる

行単位実行

エディタから利用するにあたって、テキストが巨大な場合に、部分変更のたびに全文を再処理したくない。進捗率を出したり、非変更箇所に対して過去の結果の再利用を試みるために、この構文解析器は、行単位でパース処理を中断・再開が可能に実装したい。

再帰で実装するのではなく、スタックを使ってループで実装する。

let parser = Parser(string: string,
                    grammar: grammar)
while !parser.isAtEnd {
    let tokens: [Token] = try parser.parseLine()
}

言語埋め込み

Include Ruleのincludeパラメータは別のGrammarを指定することができる

<scope>: このスコープ名をscopeNameに持つGrammarのルートRuleを参照

Objective-Cの構文はCの構文を含む

{
    "name": "Objective-C",
    "scopeName": "source.objc",
    "patterns": [
        { }, { }, { }, { }, ...,
        {
            "include": "source.c"
        },
        { }
    ]
}

HTMLの構文の中でPHPの構文を呼ぶ

{
    "name": "PHP",
    "scopeName": "text.html.php",
    "repository": {
        "php-tag": {
            "patterns": [
                {
                    "begin": "<\\?(?i:php|=)?(?![^?]*\\?>)",
                    "end": "(\\?)>",
                    "name": "meta.embedded.block.php",
                    "contentName": "source.php",
                    "patterns": [
                        {
                            "include": "source.php"
                        }
                    ]
                },
            ]
        }
    }
}

基本編まとめ

Grammar, Repository, Rule
Hub, Include, Match, BeginEnd
現在のRule
マッチリストの洗い出し
最も左でマッチした正規表現
行単位処理

仕組みの骨組みはこれだけ

TextMate Syntaxの解説

発展編

まだ説明していないパラメータや機能がたくさんある

キャプチャグループスコープ

{
    "captures": {
        "1": {
            "name": "support.function.construct.php"
        },
        "2": {
            "name": "punctuation.definition.array.begin.php"
        },
        "3": {
            "name": "punctuation.definition.array.end.php"
        }
    },
    "match": "(array)(\\()(\\))",
    "name": "meta.array.empty.php"
}

{
    "array": {
        "begin": "\\[",
        "beginCaptures": {
            "0": {
                "name": "punctuation.definition.array.begin.json"
            }
        },
        "end": "\\]",
        "endCaptures": {
            "0": {
                "name": "punctuation.definition.array.end.json"
            }
        },
        "name": "meta.structure.array.json",
        "patterns": [ ... ]
    }
}

Match RuleのmatchやBeginEnd Ruleのbeginとendの正規表現に対して、キャプチャグループに指定したスコープ名を与える。0番は正規表現マッチ全体。

BeginEnd Ruleでもcapturesが指定でき、beginCapturesとendCapturesの両方に同じ値を与える。 capturesとbeginCapturesの重複時は後者が優先。

キャプチャグループ内の子ルール

{
    "begin": 
"(?x)\\s*\n"
"\t\t\t\t\t    ((?:(?:final|abstract|public|private|protected|static)\\s+)*)\n"
"\t\t\t\t        (function)\n"
"\t\t\t\t        (?:\\s+|(\\s*&\\s*))\n"
"\t\t\t\t        (?:\n"
"\t\t\t\t            (__(?:call|construct|destruct|get|set|isset|unset|tostring|"
"clone|set_state|sleep|wakeup|autoload|invoke|callStatic))\n"
"\t\t\t\t            |([a-zA-Z0-9_]+)\n"
"\t\t\t\t        )\n"
"\t\t\t\t        \\s*\n"
"\t\t\t\t        (\\()",
    "beginCaptures": {
        "1": {
            "patterns": [
                {
                    "match": "final|abstract|public|private|protected|static",
                    "name": "storage.modifier.php"
                }
            ]
        },
        "2": { "name": "storage.type.function.php" },
        "3": { "name": "storage.modifier.reference.php" },
        "4": { "name": "support.function.magic.php" },
        "5": { "name": "entity.name.function.php" },
        "6": { "name": "punctuation.definition.parameters.begin.php" }
    }
}

キャプチャグループの中で子ルールを与える。

endパラメータでの後方参照

{
    "begin": "(?><<-(\\w+))",
    "beginCaptures": {
        "0": {
            "name": "punctuation.definition.string.begin.ruby"
        }
    },
    "comment": "heredoc with indented terminator",
    "end": "\\s*\\1$",
    "endCaptures": {
        "0": {
            "name": "punctuation.definition.string.end.ruby"
        }
    },
    "name": "string.unquoted.heredoc.ruby",
    "patterns": [
        {
            "include": "#heredoc"
        },
        {
            "include": "#interpolated_ruby"
        },
        {
            "include": "#escaped_char"
        }
    ]
}

beginのキャプチャグループをendで後方参照できる

print <<EOS      # 識別子 EOS までがリテラルになる
  the string
  next line
EOS

\G

{
    "begin": "(^[ \\t]+)?(?=//)",
    "beginCaptures": {
        "1": {
            "name": "punctuation.whitespace.comment.leading.php"
        }
    },
    "end": "(?!\\G)",
    "patterns": [
        {
            "begin": "//",
            "beginCaptures": {
                "0": {
                    "name": "punctuation.definition.comment.php"
                }
            },
            "end": "\\n|(?=\\?>)",
            "name": "comment.line.double-slash.php"
        }
    ]
}

一般的な正規表現においては、繰り返しマッチをする場合の「前回のマッチ末尾」ないしは「検索開始位置」。

beginやmatchで使われた場合は「検索開始位置」。endで使われた場合は「beginのマッチ末尾」。

行番号を考慮して正規表現エンジンに\Gの位置を与える。

contentName

{
    "string-double-quoted": {
    "begin": "\"",
    "beginCaptures": {
        "0": {
            "name": "punctuation.definition.string.begin.php"
        }
    },
    "contentName": "meta.string-contents.quoted.double.php",
    "end": "\"",
    "endCaptures": {
        "0": {
            "name": "punctuation.definition.string.end.php"
        }
    },
    "name": "string.quoted.double.php",
    "patterns": [
        {
            "include": "#interpolation"
        }
    ]
}

contentNameはbeginとendに挟まれた部分にスコープを与える。

nameはbeginとendを含む。

applyEndPatternLast

{
    "begin": "\\?",
    "beginCaptures": {
        "0": {
            "name": "keyword.operator.ternary.c"
        }
    },
    "end": ":",
    "applyEndPatternLast": true,
    "endCaptures": {
        "0": {
            "name": "keyword.operator.ternary.c"
        }
    },
    "patterns": [
        { "include": "#access" },
        { "include": "#libc" },
        { "include": "#c_function_call" },
        { "include": "$base" }
    ]
}

beginにマッチしてBeginEnd Ruleの内部に居るとき、通常はendルールは最高優先度だが、applyEndPatternLastが指定されているときはendルールが最低優先度になる。

`$self`

{
    "begin": "\\b(require|require_relative|gem)\\b",
    "captures": {
        "1": {
            "name": "keyword.other.special-method.ruby"
        }
    },
    "end": "$|(?=#|\\})",
    "name": "meta.require.ruby",
    "patterns": [
        {
            "include": "$self"
        }
    ]
}

そのRuleが定義されたGrammarのルートRuleを参照する。

`$base`

{
    "parens": {
        "begin": "\\(",
        "beginCaptures": {
            "0": {
                "name": "punctuation.section.parens.begin.c"
            }
        },
        "end": "\\)",
        "endCaptures": {
            "0": {
                "name": "punctuation.section.parens.end.c"
            }
        },
        "name": "meta.parens.c",
        "patterns": [
            {
                "include": "$base"
            }
        ]
    }
}

現在のテキスト自体の言語のGrammarのルートRuleを参照する。通常は$selfと同じ意味だが、言語埋め込みが発生している場合には、ルートの言語を参照する点が異なる。

C言語はC++やObjective-Cから埋め込まれて使われるため、自己再帰するところで$baseを使っている。

`<scope name>#<rule name>`

{
    "begin": "@\"",
    "beginCaptures": {
        "0": {
            "name": "punctuation.definition.string.begin.objc"
        }
    },
    "end": "\"",
    "endCaptures": {
        "0": {
            "name": "punctuation.definition.string.end.objc"
        }
    },
    "name": "string.quoted.double.objc",
    "patterns": [
        {
            "include": "source.c#string_escaped_char"
        },
        {
            "match": 
"(?x)%\n"
"\t\t\t\t\t\t(\\d+\\$)?                           # field (argument #)\n"
"\t\t\t\t\t\t[#0\\- +']*                          # flags\n"
"\t\t\t\t\t\t((-?\\d+)|\\*(-?\\d+\\$)?)?          # minimum field width\n"
"\t\t\t\t\t\t(\\.((-?\\d+)|\\*(-?\\d+\\$)?)?)?    # precision\n"
"\t\t\t\t\t\t[@]                                  # conversion type\n"
"\t\t\t\t\t",
            "name": "constant.other.placeholder.objc"
        },
        {
            "include": "source.c#string_placeholder"
        }
    ]
}

別言語参照とリポジトリ参照をひとまとめに書ける。

例では、Objective-C文字列の内部でCの文法を参照している。

disabled

{
    "regular_expressions": {
        "comment": "Changed disabled to 1 to turn off syntax highlighting in “r” strings.",
        "disabled": 0,
        "patterns": [
            {
                "include": "source.regexp.python"
            }
        ]
    }
}

disabledが設定されているRuleは書かれていないものとして扱う。

停止保証

まずいルールを書くとテキストの処理位置が進まなくなり、パーサが無限ループしてしまうので、そうしたパターンを検出して停止するロジックが必要。

Injection

{
    "scopeName": "todo-comment.injection",
    "injectionSelector": "L:comment.line.double-slash",
    "patterns": [
        {
            "include": "#todo-keyword"
        }
    ],
    "repository": {
        "todo-keyword": {
            "match": "TODO",
            "name": "keyword.todo"
        }
    }
}

{
    "injectionSelector": "text, string, comment",
    "name": "Hyperlink",
    "patterns": [
        {
            "match": 
"(?x)\n"
"\t\t\t\t( (https?|s?ftp|ftps|file|smb|afp|nfs|(x-)?man(-page)?|gopher|txmt|issue)://|mailto:)\n"
"\t\t\t\t[-:@a-zA-Z0-9_.,~%+/?=&#;]+(?<![-.,?:#;])\n"
"\t\t\t",
            "name": "markup.underline.link.$2.hyperlink"
        },
        {
            "match": "(?i)\\bRFC(?: |(?<= RFC))(\\d+)\\b",
            "name": "markup.underline.link.rfc.$1.hyperlink"
        }
    ],
    "scopeName": "text.hyperlink"
}

ある構文定義に対して、別の構文を挿入する機能。挿入位置をinjectionSelectorで指定する。

先の例では、既存文法のコメント部分のTODOをハイライトしたり、文字列リテラルやコメント中のURLをハイライトしている。

{
    "injections": {
        "text.html.php - (meta.embedded | meta.tag), "
        "L:text.html.php meta.tag, "
        "L:source.js.embedded.html": {
            "patterns": [
                { },
                { },
                { }
            ]
        }
    },
    "name": "PHP",
    "patterns": [
        {
            "include": "text.html.basic"
        }
    ],
    "repository": {
    }
}

Grammar自身がinjectionを持つこともできる。このGrammarの処理中において、このinjectionが適宜挿入される。

BeginWhile Rule

調査中🙇‍♂️

Embedded Language

調査中🙇‍♂️

発展編まとめ

たくさんあるから一つずつ実装しよう

私の実装

https://github.com/omochi/TMSyntax

Swiftで実装。今日の話をほぼ実装。

正規表現のためにOnigmoのSwiftブリッジを作成。

https://github.com/omochi/Onigmo-swift-build

Onigmoの実装に不都合があり、UTF-8が31bitまでサポートするようにパッチを提出

k-takata/Onigmo#111

https://qiita.com/omochimetaru/items/3dd5a3aa5ff476f47e79

デバッガビリティの向上を狙ってJSONパーサを自作

TMSyntaxでJSONの構文定義を読み込んでいる時、

JSONをパースするための構文定義が書かれたJSONをパースするための自作のJSONパーサーが走っている😎

デコードした型に行番号がついてくる。コメントが書ける。

https://github.com/omochi/FineJSON

https://github.com/omochi/RichJSONParser

パーサはFoundation.NSJSONSerializationより2.5倍遅い。

フォーラムに書き込んだけどレスつかず🤷‍♂️

https://forums.swift.org/t/how-to-write-fast-json-parser-in-swift/20281

ユニットテスト

VSCodeのテストスイートを変換して取り込んでテスト。

First Mate Test Suite 全46件に合格。

First MateはAtomのTextMate Syntax実装で、VSCodeはそのテストスイートを変換して取り込んでいる。

展望

iPad向けの開発者向けエディタとか作れるかも？

omochi/tmsyntax.md

Visual Studio Code互換な