Skip to content

Conversation

evelez7
Copy link
Member

@evelez7 evelez7 commented Aug 28, 2025

Initial implementation of a Markdown parser and Markdown AST. Currently, the parser only recognizes emphasis and strong syntax. It also only creates paragraph blocks. The AST is allocated via a BumpPtrAllocator. It attempts to conform to the CommonMark standard

The parser ingests lines and creates preprocessing LineNodes that later are converted to Nodes that form the AST.

Copy link
Member Author

evelez7 commented Aug 28, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@evelez7 evelez7 force-pushed the users/evelez7/clang-doc-markdown-parser branch from c9e91c7 to 5a66d89 Compare August 28, 2025 17:08
@evelez7 evelez7 requested review from ilovepi and petrhosek August 28, 2025 17:09
Copy link

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff origin/main HEAD --extensions cpp,h -- clang-tools-extra/clang-doc/MDParser.cpp clang-tools-extra/clang-doc/MDParser.h clang-tools-extra/unittests/clang-doc/MDParserTest.cpp clang-tools-extra/clang-doc/Representation.h

⚠️
The reproduction instructions above might return results for more than one PR
in a stack if you are using a stacked PR workflow. You can limit the results by
changing origin/main to the base branch/commit you want to compare against.
⚠️

View the diff from clang-format here.
diff --git a/clang-tools-extra/clang-doc/MDParser.cpp b/clang-tools-extra/clang-doc/MDParser.cpp
index 80e038600..e27a337b1 100644
--- a/clang-tools-extra/clang-doc/MDParser.cpp
+++ b/clang-tools-extra/clang-doc/MDParser.cpp
@@ -27,8 +27,8 @@ MarkdownParser::processDelimiters(SmallString<64> &Line, const size_t &Origin) {
                       (!isPunctuation(Proceeding) || isWhitespace(Preceeding) ||
                        isPunctuation(Preceeding));
   bool RightFlanking = !isWhitespace(Preceeding) &&
-                       (!isPunctuation(Preceeding) || isWhitespace(Proceeding) ||
-                        isPunctuation(Proceeding));
+                       (!isPunctuation(Preceeding) ||
+                        isWhitespace(Proceeding) || isPunctuation(Proceeding));
 
   if (LeftFlanking && RightFlanking)
     return {DelimiterContext{LeftFlanking, RightFlanking, true, true}, Idx};
diff --git a/clang-tools-extra/clang-doc/MDParser.h b/clang-tools-extra/clang-doc/MDParser.h
index 32599eaae..723d9ec34 100644
--- a/clang-tools-extra/clang-doc/MDParser.h
+++ b/clang-tools-extra/clang-doc/MDParser.h
@@ -23,7 +23,7 @@ enum class MDType {
 enum class MDTokenType { LeftDelimiterRun, RightDelimiterRun, Text };
 
 struct Node {
-  SmallVector<Node*> Children;
+  SmallVector<Node *> Children;
   MDType Type;
   Node *Parent;
   std::string Content;
@@ -89,7 +89,8 @@ class MarkdownParser {
 
   /// @param Lines An entire Document that resides in a comment.
   /// @return the root of a Markdown document.
-  Node* parse(std::vector<SmallString<64>> &Lines);
+  Node *parse(std::vector<SmallString<64>> &Lines);
+
 public:
   MarkdownParser() : Arena(BumpPtrAllocator()), Saver(Arena) {}
   std::string render(std::vector<SmallString<64>> &Lines);

@evelez7 evelez7 changed the title [clang-doc] markdown parser [clang-doc] add Markdown parser implementation Aug 28, 2025
@evelez7 evelez7 changed the title [clang-doc] add Markdown parser implementation [clang-doc] add Markdown parser Aug 28, 2025
} // namespace

std::pair<std::optional<DelimiterContext>, size_t>
MarkdownParser::processDelimiters(SmallString<64> &Line, const size_t &Origin) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MarkdownParser::processDelimiters(SmallString<64> &Line, const size_t &Origin) {
MarkdownParser::processDelimiters(StringRef Line, size_t Origin) {

I don't see you manipulating the Line, just reading from it. Also no point passing a size_t by const ref. Just pass by val, since you're not modifying it.

return {std::nullopt, 0};
}

Node *MarkdownParser::createTextNode(const std::list<LineNode *> &Text) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why std::list? We have some list data structures in ADT, like simple_ilist, but there are other choices of data structure that I think I see used more often.

Comment on lines +74 to +77
/// @param IdxOrigin the index of * or _ that might start a delimiter run.
/// @return A pair denoting the type of run and the index where the run stops
std::pair<std::optional<DelimiterContext>, size_t>
processDelimiters(SmallString<64> &Line, const size_t &Origin = 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// @param IdxOrigin the index of * or _ that might start a delimiter run.
/// @return A pair denoting the type of run and the index where the run stops
std::pair<std::optional<DelimiterContext>, size_t>
processDelimiters(SmallString<64> &Line, const size_t &Origin = 0);
/// @param Origin the index of * or _ that might start a delimiter run.
/// @return A pair denoting the type of run and the index where the run stops
std::pair<std::optional<DelimiterContext>, size_t>
processDelimiters(SmallString<64> &Line, const size_t &Origin = 0);

The doc name seems stale. I do wonder if you should just use Start or StartIdx, though.

return Result;
}

void MarkdownParser::parseLine(SmallString<64> &Line, Node *Current) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can line also just be StringRef here? maybe I missed it, but I don't see you manipulating it at all.

Comment on lines +79 to +95
if (Closer->Length >= 2 && Opener->Length >= 2) {
// We have at least one strong node.
Closer->Length -= 2;
Opener->Length -= 2;
Emphasis = new (Arena) Node();
Emphasis->Type = MDType::Strong;
auto *Child = createTextNode(Text);
Child->Parent = Emphasis;
Emphasis->Children.push_back(Child);
} else if (Closer->Length == 1 && Opener->Length == 1) {
Closer->Length -= 1;
Opener->Length -= 1;
Emphasis = new (Arena) Node();
Emphasis->Type = MDType::Emphasis;
auto *Child = createTextNode(Text);
Child->Parent = Emphasis;
Emphasis->Children.push_back(Child);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference in these blocks seems to be the MDType, and the length adjustment. maybe just use a helper? or just set those bits on their own to avoid repeating mostly the same code?

if (Closer->Length == 0)
It = Stack.erase(It);
if (Opener->Length == 0)
ReverseIt = std::make_reverse_iterator(Stack.erase(ReverseIt.base()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little hairy.

Comment on lines +194 to +197
Result.append("<strong>");
for (auto *Child : Current->Children)
Result.append(traverse(Child));
Result.append("</strong>");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These blocks that loop over the children should probably be either a helper or a local lambda. the only thing that changes is the tag name, so your helper could just take that string as a parameter, and you can get rid of a bunch of boiler plate code.

enum class MDTokenType { LeftDelimiterRun, RightDelimiterRun, Text };

struct Node {
SmallVector<Node*> Children;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see these get used elsewhere in an arena. This vector member will leak memory, since it can own memory and the destructor won't get called when you clear/reset/release the arena.

It can have a reference to something that owns the vector though, like a hashtable or a context object that owns that hash table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants