-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[clang-doc] add Markdown parser #155887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[clang-doc] add Markdown parser #155887
Conversation
c9e91c7
to
5a66d89
Compare
You can test this locally with the following command:git-clang-format --diff origin/main HEAD --extensions cpp,h -- clang-tools-extra/clang-doc/MDParser.cpp clang-tools-extra/clang-doc/MDParser.h clang-tools-extra/unittests/clang-doc/MDParserTest.cpp clang-tools-extra/clang-doc/Representation.h
View the diff from clang-format here.diff --git a/clang-tools-extra/clang-doc/MDParser.cpp b/clang-tools-extra/clang-doc/MDParser.cpp
index 80e038600..e27a337b1 100644
--- a/clang-tools-extra/clang-doc/MDParser.cpp
+++ b/clang-tools-extra/clang-doc/MDParser.cpp
@@ -27,8 +27,8 @@ MarkdownParser::processDelimiters(SmallString<64> &Line, const size_t &Origin) {
(!isPunctuation(Proceeding) || isWhitespace(Preceeding) ||
isPunctuation(Preceeding));
bool RightFlanking = !isWhitespace(Preceeding) &&
- (!isPunctuation(Preceeding) || isWhitespace(Proceeding) ||
- isPunctuation(Proceeding));
+ (!isPunctuation(Preceeding) ||
+ isWhitespace(Proceeding) || isPunctuation(Proceeding));
if (LeftFlanking && RightFlanking)
return {DelimiterContext{LeftFlanking, RightFlanking, true, true}, Idx};
diff --git a/clang-tools-extra/clang-doc/MDParser.h b/clang-tools-extra/clang-doc/MDParser.h
index 32599eaae..723d9ec34 100644
--- a/clang-tools-extra/clang-doc/MDParser.h
+++ b/clang-tools-extra/clang-doc/MDParser.h
@@ -23,7 +23,7 @@ enum class MDType {
enum class MDTokenType { LeftDelimiterRun, RightDelimiterRun, Text };
struct Node {
- SmallVector<Node*> Children;
+ SmallVector<Node *> Children;
MDType Type;
Node *Parent;
std::string Content;
@@ -89,7 +89,8 @@ class MarkdownParser {
/// @param Lines An entire Document that resides in a comment.
/// @return the root of a Markdown document.
- Node* parse(std::vector<SmallString<64>> &Lines);
+ Node *parse(std::vector<SmallString<64>> &Lines);
+
public:
MarkdownParser() : Arena(BumpPtrAllocator()), Saver(Arena) {}
std::string render(std::vector<SmallString<64>> &Lines);
|
} // namespace | ||
|
||
std::pair<std::optional<DelimiterContext>, size_t> | ||
MarkdownParser::processDelimiters(SmallString<64> &Line, const size_t &Origin) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MarkdownParser::processDelimiters(SmallString<64> &Line, const size_t &Origin) { | |
MarkdownParser::processDelimiters(StringRef Line, size_t Origin) { |
I don't see you manipulating the Line, just reading from it. Also no point passing a size_t by const ref. Just pass by val, since you're not modifying it.
return {std::nullopt, 0}; | ||
} | ||
|
||
Node *MarkdownParser::createTextNode(const std::list<LineNode *> &Text) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why std::list
? We have some list data structures in ADT, like simple_ilist
, but there are other choices of data structure that I think I see used more often.
/// @param IdxOrigin the index of * or _ that might start a delimiter run. | ||
/// @return A pair denoting the type of run and the index where the run stops | ||
std::pair<std::optional<DelimiterContext>, size_t> | ||
processDelimiters(SmallString<64> &Line, const size_t &Origin = 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// @param IdxOrigin the index of * or _ that might start a delimiter run. | |
/// @return A pair denoting the type of run and the index where the run stops | |
std::pair<std::optional<DelimiterContext>, size_t> | |
processDelimiters(SmallString<64> &Line, const size_t &Origin = 0); | |
/// @param Origin the index of * or _ that might start a delimiter run. | |
/// @return A pair denoting the type of run and the index where the run stops | |
std::pair<std::optional<DelimiterContext>, size_t> | |
processDelimiters(SmallString<64> &Line, const size_t &Origin = 0); |
The doc name seems stale. I do wonder if you should just use Start
or StartIdx
, though.
return Result; | ||
} | ||
|
||
void MarkdownParser::parseLine(SmallString<64> &Line, Node *Current) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can line also just be StringRef here? maybe I missed it, but I don't see you manipulating it at all.
if (Closer->Length >= 2 && Opener->Length >= 2) { | ||
// We have at least one strong node. | ||
Closer->Length -= 2; | ||
Opener->Length -= 2; | ||
Emphasis = new (Arena) Node(); | ||
Emphasis->Type = MDType::Strong; | ||
auto *Child = createTextNode(Text); | ||
Child->Parent = Emphasis; | ||
Emphasis->Children.push_back(Child); | ||
} else if (Closer->Length == 1 && Opener->Length == 1) { | ||
Closer->Length -= 1; | ||
Opener->Length -= 1; | ||
Emphasis = new (Arena) Node(); | ||
Emphasis->Type = MDType::Emphasis; | ||
auto *Child = createTextNode(Text); | ||
Child->Parent = Emphasis; | ||
Emphasis->Children.push_back(Child); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference in these blocks seems to be the MDType, and the length adjustment. maybe just use a helper? or just set those bits on their own to avoid repeating mostly the same code?
if (Closer->Length == 0) | ||
It = Stack.erase(It); | ||
if (Opener->Length == 0) | ||
ReverseIt = std::make_reverse_iterator(Stack.erase(ReverseIt.base())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a little hairy.
Result.append("<strong>"); | ||
for (auto *Child : Current->Children) | ||
Result.append(traverse(Child)); | ||
Result.append("</strong>"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These blocks that loop over the children should probably be either a helper or a local lambda. the only thing that changes is the tag name, so your helper could just take that string as a parameter, and you can get rid of a bunch of boiler plate code.
enum class MDTokenType { LeftDelimiterRun, RightDelimiterRun, Text }; | ||
|
||
struct Node { | ||
SmallVector<Node*> Children; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see these get used elsewhere in an arena. This vector member will leak memory, since it can own memory and the destructor won't get called when you clear/reset/release the arena.
It can have a reference to something that owns the vector though, like a hashtable or a context object that owns that hash table.
Initial implementation of a Markdown parser and Markdown AST. Currently, the parser only recognizes emphasis and strong syntax. It also only creates paragraph blocks. The AST is allocated via a BumpPtrAllocator. It attempts to conform to the CommonMark standard
The parser ingests lines and creates preprocessing LineNodes that later are converted to Nodes that form the AST.