[clang-doc] add Markdown parser #155887

evelez7 · 2025-08-28T17:06:31Z

Initial implementation of a Markdown parser and Markdown AST. Currently, the parser only recognizes emphasis and strong syntax. It also only creates paragraph blocks. The AST is allocated via a BumpPtrAllocator. It attempts to conform to the CommonMark standard

The parser ingests lines and creates preprocessing LineNodes that later are converted to Nodes that form the AST.

evelez7 · 2025-08-28T17:06:49Z

[clang-doc] add Markdown parser #155887 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

github-actions · 2025-08-28T17:11:22Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff origin/main HEAD --extensions cpp,h -- clang-tools-extra/clang-doc/MDParser.cpp clang-tools-extra/clang-doc/MDParser.h clang-tools-extra/unittests/clang-doc/MDParserTest.cpp clang-tools-extra/clang-doc/Representation.h

⚠️
The reproduction instructions above might return results for more than one PR
in a stack if you are using a stacked PR workflow. You can limit the results by
changing origin/main to the base branch/commit you want to compare against.
⚠️

View the diff from clang-format here.

diff --git a/clang-tools-extra/clang-doc/MDParser.cpp b/clang-tools-extra/clang-doc/MDParser.cpp
index 80e038600..e27a337b1 100644
--- a/clang-tools-extra/clang-doc/MDParser.cpp
+++ b/clang-tools-extra/clang-doc/MDParser.cpp
@@ -27,8 +27,8 @@ MarkdownParser::processDelimiters(SmallString<64> &Line, const size_t &Origin) {
                       (!isPunctuation(Proceeding) || isWhitespace(Preceeding) ||
                        isPunctuation(Preceeding));
   bool RightFlanking = !isWhitespace(Preceeding) &&
-                       (!isPunctuation(Preceeding) || isWhitespace(Proceeding) ||
-                        isPunctuation(Proceeding));
+                       (!isPunctuation(Preceeding) ||
+                        isWhitespace(Proceeding) || isPunctuation(Proceeding));
 
   if (LeftFlanking && RightFlanking)
     return {DelimiterContext{LeftFlanking, RightFlanking, true, true}, Idx};
diff --git a/clang-tools-extra/clang-doc/MDParser.h b/clang-tools-extra/clang-doc/MDParser.h
index 32599eaae..723d9ec34 100644
--- a/clang-tools-extra/clang-doc/MDParser.h
+++ b/clang-tools-extra/clang-doc/MDParser.h
@@ -23,7 +23,7 @@ enum class MDType {
 enum class MDTokenType { LeftDelimiterRun, RightDelimiterRun, Text };
 
 struct Node {
-  SmallVector<Node*> Children;
+  SmallVector<Node *> Children;
   MDType Type;
   Node *Parent;
   std::string Content;
@@ -89,7 +89,8 @@ class MarkdownParser {
 
   /// @param Lines An entire Document that resides in a comment.
   /// @return the root of a Markdown document.
-  Node* parse(std::vector<SmallString<64>> &Lines);
+  Node *parse(std::vector<SmallString<64>> &Lines);
+
 public:
   MarkdownParser() : Arena(BumpPtrAllocator()), Saver(Arena) {}
   std::string render(std::vector<SmallString<64>> &Lines);

ilovepi · 2025-08-29T17:58:00Z

clang-tools-extra/clang-doc/MDParser.cpp

+} // namespace
+
+std::pair<std::optional<DelimiterContext>, size_t>
+MarkdownParser::processDelimiters(SmallString<64> &Line, const size_t &Origin) {


Suggested change

MarkdownParser::processDelimiters(SmallString<64> &Line, const size_t &Origin) {

MarkdownParser::processDelimiters(StringRef Line, size_t Origin) {

I don't see you manipulating the Line, just reading from it. Also no point passing a size_t by const ref. Just pass by val, since you're not modifying it.

ilovepi · 2025-08-29T18:05:18Z

clang-tools-extra/clang-doc/MDParser.cpp

+  return {std::nullopt, 0};
+}
+
+Node *MarkdownParser::createTextNode(const std::list<LineNode *> &Text) {


why std::list? We have some list data structures in ADT, like simple_ilist, but there are other choices of data structure that I think I see used more often.

clang-tools-extra/clang-doc/MDParser.h

ilovepi · 2025-08-29T18:21:24Z

clang-tools-extra/clang-doc/MDParser.h

+  /// @param IdxOrigin the index of * or _ that might start a delimiter run.
+  /// @return A pair denoting the type of run and the index where the run stops
+  std::pair<std::optional<DelimiterContext>, size_t>
+  processDelimiters(SmallString<64> &Line, const size_t &Origin = 0);


Suggested change

/// @param IdxOrigin the index of * or _ that might start a delimiter run.

/// @return A pair denoting the type of run and the index where the run stops

std::pair<std::optional<DelimiterContext>, size_t>

processDelimiters(SmallString<64> &Line, const size_t &Origin = 0);

/// @param Origin the index of * or _ that might start a delimiter run.

/// @return A pair denoting the type of run and the index where the run stops

std::pair<std::optional<DelimiterContext>, size_t>

processDelimiters(SmallString<64> &Line, const size_t &Origin = 0);

The doc name seems stale. I do wonder if you should just use Start or StartIdx, though.

ilovepi · 2025-08-29T18:23:50Z

clang-tools-extra/clang-doc/MDParser.cpp

+  return Result;
+}
+
+void MarkdownParser::parseLine(SmallString<64> &Line, Node *Current) {


can line also just be StringRef here? maybe I missed it, but I don't see you manipulating it at all.

ilovepi · 2025-08-29T20:39:34Z

clang-tools-extra/clang-doc/MDParser.cpp

+      if (Closer->Length >= 2 && Opener->Length >= 2) {
+        // We have at least one strong node.
+        Closer->Length -= 2;
+        Opener->Length -= 2;
+        Emphasis = new (Arena) Node();
+        Emphasis->Type = MDType::Strong;
+        auto *Child = createTextNode(Text);
+        Child->Parent = Emphasis;
+        Emphasis->Children.push_back(Child);
+      } else if (Closer->Length == 1 && Opener->Length == 1) {
+        Closer->Length -= 1;
+        Opener->Length -= 1;
+        Emphasis = new (Arena) Node();
+        Emphasis->Type = MDType::Emphasis;
+        auto *Child = createTextNode(Text);
+        Child->Parent = Emphasis;
+        Emphasis->Children.push_back(Child);


The difference in these blocks seems to be the MDType, and the length adjustment. maybe just use a helper? or just set those bits on their own to avoid repeating mostly the same code?

ilovepi · 2025-08-29T20:40:42Z

clang-tools-extra/clang-doc/MDParser.cpp

+      if (Closer->Length == 0)
+        It = Stack.erase(It);
+      if (Opener->Length == 0)
+        ReverseIt = std::make_reverse_iterator(Stack.erase(ReverseIt.base()));


This seems a little hairy.

clang-tools-extra/clang-doc/MDParser.cpp

ilovepi · 2025-08-29T20:50:40Z

clang-tools-extra/clang-doc/MDParser.cpp

+    Result.append("<strong>");
+    for (auto *Child : Current->Children)
+      Result.append(traverse(Child));
+    Result.append("</strong>");


These blocks that loop over the children should probably be either a helper or a local lambda. the only thing that changes is the tag name, so your helper could just take that string as a parameter, and you can get rid of a bunch of boiler plate code.

ilovepi · 2025-08-29T20:52:44Z

clang-tools-extra/clang-doc/MDParser.h

+enum class MDTokenType { LeftDelimiterRun, RightDelimiterRun, Text };
+
+struct Node {
+  SmallVector<Node*> Children;


I see these get used elsewhere in an arena. This vector member will leak memory, since it can own memory and the destructor won't get called when you clear/reset/release the arena.

It can have a reference to something that owns the vector though, like a hashtable or a context object that owns that hash table.

[clang-doc] markdown parser

5a66d89

evelez7 force-pushed the users/evelez7/clang-doc-markdown-parser branch from c9e91c7 to 5a66d89 Compare August 28, 2025 17:08

evelez7 requested review from ilovepi and petrhosek August 28, 2025 17:09

evelez7 changed the title ~~[clang-doc] markdown parser~~ [clang-doc] add Markdown parser implementation Aug 28, 2025

evelez7 changed the title ~~[clang-doc] add Markdown parser implementation~~ [clang-doc] add Markdown parser Aug 28, 2025

ilovepi reviewed Aug 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[clang-doc] add Markdown parser #155887

[clang-doc] add Markdown parser #155887

Uh oh!

evelez7 commented Aug 28, 2025 •

edited

Loading

Uh oh!

evelez7 commented Aug 28, 2025

Uh oh!

github-actions bot commented Aug 28, 2025

Uh oh!

ilovepi Aug 29, 2025

Uh oh!

ilovepi Aug 29, 2025

Uh oh!

Uh oh!

ilovepi Aug 29, 2025

Uh oh!

ilovepi Aug 29, 2025

Uh oh!

ilovepi Aug 29, 2025

Uh oh!

ilovepi Aug 29, 2025

Uh oh!

Uh oh!

ilovepi Aug 29, 2025

Uh oh!

ilovepi Aug 29, 2025

Uh oh!

Uh oh!

	MarkdownParser::processDelimiters(SmallString<64> &Line, const size_t &Origin) {
	MarkdownParser::processDelimiters(StringRef Line, size_t Origin) {

[clang-doc] add Markdown parser #155887

Are you sure you want to change the base?

[clang-doc] add Markdown parser #155887

Uh oh!

Conversation

evelez7 commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evelez7 commented Aug 28, 2025

Uh oh!

github-actions bot commented Aug 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

evelez7 commented Aug 28, 2025 •

edited

Loading