Skip to content

Conversation

svkeerthy
Copy link
Contributor

@svkeerthy svkeerthy commented Aug 11, 2025

Add flow-aware embedding support to llvm-ir2vec tool alongside the existing symbolic embeddings.

(Tracking issues - #141817, #141838)

@llvmbot
Copy link
Member

llvmbot commented Aug 11, 2025

@llvm/pr-subscribers-mlgo

@llvm/pr-subscribers-llvm-binary-utilities

Author: S. VenkataKeerthy (svkeerthy)

Changes

Add flow-aware embedding support to llvm-ir2vec tool alongside the existing symbolic embeddings.


Full diff: https://github.com/llvm/llvm-project/pull/153087.diff

4 Files Affected:

  • (modified) llvm/docs/CommandGuide/llvm-ir2vec.rst (+14-2)
  • (added) llvm/test/tools/llvm-ir2vec/embeddings-flowaware.ll (+73)
  • (renamed) llvm/test/tools/llvm-ir2vec/embeddings-symbolic.ll ()
  • (modified) llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp (+6-4)
diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst b/llvm/docs/CommandGuide/llvm-ir2vec.rst
index 0c9fb6e94b6f3..fc590a6180316 100644
--- a/llvm/docs/CommandGuide/llvm-ir2vec.rst
+++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst
@@ -13,7 +13,9 @@ DESCRIPTION
 
 :program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
 generates IR2Vec embeddings for LLVM IR and supports triplet generation 
-for vocabulary training. The tool provides three main subcommands:
+for vocabulary training. 
+
+The tool provides three main subcommands:
 
 1. **triplets**: Generates numeric triplets in train2id format for vocabulary
    training from LLVM IR.
@@ -93,7 +95,7 @@ Example Usage:
 
 .. code-block:: bash
 
-   llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt
+   llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json --ir2vec-kind=symbolic --level=func input.bc -o embeddings.txt
 
 OPTIONS
 -------
@@ -129,6 +131,16 @@ Subcommand-specific options:
 
    Process only the specified function instead of all functions in the module.
 
+.. option:: --ir2vec-kind=<kind>
+
+   Specify the kind of IR2Vec embeddings to generate. Valid values are:
+
+   * ``symbolic`` - Generate symbolic embeddings (default)
+   * ``flow-aware`` - Generate flow-aware embeddings
+
+   Flow-aware embeddings consider control flow relationships between instructions,
+   while symbolic embeddings focus on the symbolic representation of instructions.
+
 .. option:: --ir2vec-vocab-path=<path>
 
    Specify the path to the vocabulary file (required for embedding generation).
diff --git a/llvm/test/tools/llvm-ir2vec/embeddings-flowaware.ll b/llvm/test/tools/llvm-ir2vec/embeddings-flowaware.ll
new file mode 100644
index 0000000000000..b2362f83caf4f
--- /dev/null
+++ b/llvm/test/tools/llvm-ir2vec/embeddings-flowaware.ll
@@ -0,0 +1,73 @@
+; RUN: llvm-ir2vec embeddings --ir2vec-kind=flow-aware --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-DEFAULT
+; RUN: llvm-ir2vec embeddings --level=func --ir2vec-kind=flow-aware  --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL
+; RUN: llvm-ir2vec embeddings --level=func --function=abc --ir2vec-kind=flow-aware --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL-ABC
+; RUN: not llvm-ir2vec embeddings --level=func --function=def --ir2vec-kind=flow-aware --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-DEF
+; RUN: llvm-ir2vec embeddings --level=bb --ir2vec-kind=flow-aware --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL
+; RUN: llvm-ir2vec embeddings --level=bb --function=abc_repeat --ir2vec-kind=flow-aware --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL-ABC-REPEAT
+; RUN: llvm-ir2vec embeddings --level=inst --function=abc_repeat --ir2vec-kind=flow-aware --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-INST-LEVEL-ABC-REPEAT
+
+define dso_local noundef float @abc(i32 noundef %a, float noundef %b) #0 {
+entry:
+  %a.addr = alloca i32, align 4
+  %b.addr = alloca float, align 4
+  store i32 %a, ptr %a.addr, align 4
+  store float %b, ptr %b.addr, align 4
+  %0 = load i32, ptr %a.addr, align 4
+  %1 = load i32, ptr %a.addr, align 4
+  %mul = mul nsw i32 %0, %1
+  %conv = sitofp i32 %mul to float
+  %2 = load float, ptr %b.addr, align 4
+  %add = fadd float %conv, %2
+  ret float %add
+}
+
+define dso_local noundef float @abc_repeat(i32 noundef %a, float noundef %b) #0 {
+entry:
+  %a.addr = alloca i32, align 4
+  %b.addr = alloca float, align 4
+  store i32 %a, ptr %a.addr, align 4
+  store float %b, ptr %b.addr, align 4
+  %0 = load i32, ptr %a.addr, align 4
+  %1 = load i32, ptr %a.addr, align 4
+  %mul = mul nsw i32 %0, %1
+  %conv = sitofp i32 %mul to float
+  %2 = load float, ptr %b.addr, align 4
+  %add = fadd float %conv, %2
+  ret float %add
+}
+
+; CHECK-DEFAULT: Function: abc
+; CHECK-DEFAULT-NEXT: [ 3630.00  3672.00  3714.00 ]
+; CHECK-DEFAULT-NEXT: Function: abc_repeat
+; CHECK-DEFAULT-NEXT: [ 3630.00  3672.00  3714.00 ]
+
+; CHECK-FUNC-LEVEL: Function: abc 
+; CHECK-FUNC-LEVEL-NEXT: [ 3630.00  3672.00  3714.00 ]
+; CHECK-FUNC-LEVEL-NEXT: Function: abc_repeat 
+; CHECK-FUNC-LEVEL-NEXT: [ 3630.00  3672.00  3714.00 ]
+
+; CHECK-FUNC-LEVEL-ABC: Function: abc
+; CHECK-FUNC-LEVEL-NEXT-ABC:  [ 3630.00  3672.00  3714.00 ]
+
+; CHECK-FUNC-DEF: Error: Function 'def' not found
+
+; CHECK-BB-LEVEL: Function: abc
+; CHECK-BB-LEVEL-NEXT: entry: [ 3630.00  3672.00  3714.00 ]
+; CHECK-BB-LEVEL-NEXT: Function: abc_repeat
+; CHECK-BB-LEVEL-NEXT: entry: [ 3630.00  3672.00  3714.00 ]
+
+; CHECK-BB-LEVEL-ABC-REPEAT: Function: abc_repeat
+; CHECK-BB-LEVEL-ABC-REPEAT-NEXT: entry: [ 3630.00  3672.00  3714.00 ]
+
+; CHECK-INST-LEVEL-ABC-REPEAT: Function: abc_repeat
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %a.addr = alloca i32, align 4 [ 91.00  92.00  93.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %b.addr = alloca float, align 4 [ 91.00  92.00  93.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: store i32 %a, ptr %a.addr, align 4 [ 188.00  190.00  192.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: store float %b, ptr %b.addr, align 4 [ 188.00  190.00  192.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %0 = load i32, ptr %a.addr, align 4 [ 185.00  187.00  189.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %1 = load i32, ptr %a.addr, align 4 [ 185.00  187.00  189.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %mul = mul nsw i32 %0, %1 [ 419.00  424.00  429.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %conv = sitofp i32 %mul to float [ 549.00  555.00  561.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %2 = load float, ptr %b.addr, align 4 [ 185.00  187.00  189.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %add = fadd float %conv, %2 [ 774.00  783.00  792.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: ret float %add [ 775.00  785.00  795.00 ]
diff --git a/llvm/test/tools/llvm-ir2vec/embeddings.ll b/llvm/test/tools/llvm-ir2vec/embeddings-symbolic.ll
similarity index 100%
rename from llvm/test/tools/llvm-ir2vec/embeddings.ll
rename to llvm/test/tools/llvm-ir2vec/embeddings-symbolic.ll
diff --git a/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp b/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp
index 8e17a4a3ab53d..8f8b4e2f2bda8 100644
--- a/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp
+++ b/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp
@@ -25,9 +25,11 @@
 /// 3. Embedding Generation (embeddings):
 ///    Generates IR2Vec embeddings using a trained vocabulary.
 ///    Usage: llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json
-///    --level=func input.bc -o embeddings.txt Levels: --level=inst
-///    (instructions), --level=bb (basic blocks), --level=func (functions)
-///    (See IR2Vec.cpp for more embedding generation options)
+///    --ir2vec-kind=<kind> --level=<level> input.bc -o embeddings.txt
+///    Kind: --ir2vec-kind=symbolic (default), --ir2vec-kind=flow-aware
+///    Levels: --level=inst (instructions), --level=bb (basic blocks),
+///    --level=func (functions) (See IR2Vec.cpp for more embedding generation
+///    options)
 ///
 //===----------------------------------------------------------------------===//
 
@@ -243,7 +245,7 @@ class IR2VecTool {
 
     // Create embedder for this function
     assert(Vocab->isValid() && "Vocabulary is not valid");
-    auto Emb = Embedder::create(IR2VecKind::Symbolic, F, *Vocab);
+    auto Emb = Embedder::create(IR2VecEmbeddingKind, F, *Vocab);
     if (!Emb) {
       OS << "Error: Failed to create embedder for function " << F.getName()
          << "\n";

@svkeerthy svkeerthy force-pushed the users/svkeerthy/08-07-flow-aware_embeddings branch from 91ea617 to cd0efdf Compare August 25, 2025 22:59
@svkeerthy svkeerthy force-pushed the users/svkeerthy/08-11-_ir2vec_llvm-ir2vec_supporting_flow-aware_embeddings branch from be6f9a6 to 0310cb4 Compare August 25, 2025 22:59
@svkeerthy svkeerthy force-pushed the users/svkeerthy/08-07-flow-aware_embeddings branch from cd0efdf to ccb2423 Compare August 27, 2025 20:06
@svkeerthy svkeerthy force-pushed the users/svkeerthy/08-11-_ir2vec_llvm-ir2vec_supporting_flow-aware_embeddings branch from 0310cb4 to 630f4af Compare August 27, 2025 20:06
@svkeerthy svkeerthy force-pushed the users/svkeerthy/08-11-_ir2vec_llvm-ir2vec_supporting_flow-aware_embeddings branch from 630f4af to 87eed21 Compare August 27, 2025 21:04
@svkeerthy svkeerthy force-pushed the users/svkeerthy/08-07-flow-aware_embeddings branch from ccb2423 to 2628716 Compare August 27, 2025 21:04
Copy link
Contributor Author

svkeerthy commented Aug 28, 2025

Merge activity

  • Aug 28, 6:26 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Aug 28, 7:12 PM UTC: The Graphite merge of this pull request was cancelled.
  • Aug 28, 11:47 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Aug 28, 11:48 PM UTC: @svkeerthy merged this pull request with Graphite.

@svkeerthy svkeerthy force-pushed the users/svkeerthy/08-07-flow-aware_embeddings branch 5 times, most recently from 589fdb9 to 7869cc0 Compare August 28, 2025 19:02
Base automatically changed from users/svkeerthy/08-07-flow-aware_embeddings to main August 28, 2025 19:55
@svkeerthy svkeerthy force-pushed the users/svkeerthy/08-11-_ir2vec_llvm-ir2vec_supporting_flow-aware_embeddings branch from 87eed21 to 1fdd3b3 Compare August 28, 2025 19:57
@svkeerthy svkeerthy force-pushed the users/svkeerthy/08-11-_ir2vec_llvm-ir2vec_supporting_flow-aware_embeddings branch from 1fdd3b3 to 66d8de1 Compare August 28, 2025 23:04
@svkeerthy svkeerthy merged commit 288442f into main Aug 28, 2025
10 checks passed
@svkeerthy svkeerthy deleted the users/svkeerthy/08-11-_ir2vec_llvm-ir2vec_supporting_flow-aware_embeddings branch August 28, 2025 23:48
@llvm-ci
Copy link
Collaborator

llvm-ci commented Aug 29, 2025

LLVM Buildbot has detected a new failure on builder clang-aarch64-quick running on linaro-clang-aarch64-quick while building llvm at step 5 "ninja check 1".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/65/builds/21920

Here is the relevant piece of the build log for the reference
Step 5 (ninja check 1) failure: stage 1 checked (failure)
******************** TEST 'Clangd Unit Tests :: ./ClangdTests/289/332' FAILED ********************
Script(shard):
--
GTEST_OUTPUT=json:/home/tcwg-buildbot/worker/clang-aarch64-quick/stage1/tools/clang/tools/extra/clangd/unittests/./ClangdTests-Clangd Unit Tests-2917309-289-332.json GTEST_SHUFFLE=0 GTEST_TOTAL_SHARDS=332 GTEST_SHARD_INDEX=289 /home/tcwg-buildbot/worker/clang-aarch64-quick/stage1/tools/clang/tools/extra/clangd/unittests/./ClangdTests
--

Note: This is test shard 290 of 332.
[==========] Running 4 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 1 test from ConfigCompileTests
[ RUN      ] ConfigCompileTests.CompileCommands
Config fragment: compiling <unknown>:0 -> 0x0000BD2DFFDD0BA0 (trusted=false)
[       OK ] ConfigCompileTests.CompileCommands (17 ms)
[----------] 1 test from ConfigCompileTests (17 ms total)

[----------] 1 test from HeaderSourceSwitchTest
[ RUN      ] HeaderSourceSwitchTest.ClangdServerIntegration
ASTWorker building file /clangd-test/src/lib/test.cpp version null with command 
[/clangd-test/src/lib]
clang -I/clangd-test/src/include /clangd-test/src/lib/test.cpp
Driver produced command: cc1 -cc1 -triple aarch64-unknown-linux-gnu -fsyntax-only -disable-free -clear-ast-before-backend -main-file-name test.cpp -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=non-leaf -fmath-errno -ffp-contract=on -fno-rounding-math -mconstructor-aliases -funwind-tables=2 -enable-tlsdesc -target-cpu generic -target-feature +v8a -target-feature +fp-armv8 -target-feature +neon -target-abi aapcs -debugger-tuning=gdb -fdebug-compilation-dir=/clangd-test/src/lib -fcoverage-compilation-dir=/clangd-test/src/lib -resource-dir lib/clang/22 -I /clangd-test/src/include -internal-isystem lib/clang/22/include -internal-isystem /usr/local/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -fdeprecated-macro -ferror-limit 19 -fno-signed-char -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -fcxx-exceptions -fexceptions -no-round-trip-args -target-feature -fmv -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -x c++ /clangd-test/src/lib/test.cpp
Building first preamble for /clangd-test/src/lib/test.cpp version null
not idle after addDocument
UNREACHABLE executed at ../llvm/clang-tools-extra/clangd/unittests/SyncAPI.cpp:22!
 #0 0x0000bd2df0fa4b80 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/tcwg-buildbot/worker/clang-aarch64-quick/stage1/tools/clang/tools/extra/clangd/unittests/./ClangdTests+0xc64b80)
 #1 0x0000bd2df0fa2648 llvm::sys::RunSignalHandlers() (/home/tcwg-buildbot/worker/clang-aarch64-quick/stage1/tools/clang/tools/extra/clangd/unittests/./ClangdTests+0xc62648)
 #2 0x0000bd2df0fa59dc SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
 #3 0x0000ef60e53a98f8 (linux-vdso.so.1+0x8f8)
 #4 0x0000ef60e4f0f1f0 __pthread_kill_implementation ./nptl/./nptl/pthread_kill.c:44:76
 #5 0x0000ef60e4eca67c gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #6 0x0000ef60e4eb7130 abort ./stdlib/./stdlib/abort.c:81:7
 #7 0x0000bd2df0f51428 llvm::RTTIRoot::anchor() (/home/tcwg-buildbot/worker/clang-aarch64-quick/stage1/tools/clang/tools/extra/clangd/unittests/./ClangdTests+0xc11428)
 #8 0x0000bd2df0dfc400 clang::clangd::runCodeComplete(clang::clangd::ClangdServer&, llvm::StringRef, clang::clangd::Position, clang::clangd::CodeCompleteOptions) (/home/tcwg-buildbot/worker/clang-aarch64-quick/stage1/tools/clang/tools/extra/clangd/unittests/./ClangdTests+0xabc400)
 #9 0x0000bd2df0beb1a8 clang::clangd::(anonymous namespace)::HeaderSourceSwitchTest_ClangdServerIntegration_Test::TestBody() HeaderSourceSwitchTests.cpp:0:0
#10 0x0000bd2df0ffd01c testing::Test::Run() (/home/tcwg-buildbot/worker/clang-aarch64-quick/stage1/tools/clang/tools/extra/clangd/unittests/./ClangdTests+0xcbd01c)
#11 0x0000bd2df0ffe340 testing::TestInfo::Run() (/home/tcwg-buildbot/worker/clang-aarch64-quick/stage1/tools/clang/tools/extra/clangd/unittests/./ClangdTests+0xcbe340)
#12 0x0000bd2df0ffef7c testing::TestSuite::Run() (/home/tcwg-buildbot/worker/clang-aarch64-quick/stage1/tools/clang/tools/extra/clangd/unittests/./ClangdTests+0xcbef7c)
#13 0x0000bd2df100f2fc testing::internal::UnitTestImpl::RunAllTests() (/home/tcwg-buildbot/worker/clang-aarch64-quick/stage1/tools/clang/tools/extra/clangd/unittests/./ClangdTests+0xccf2fc)
#14 0x0000bd2df100ec48 testing::UnitTest::Run() (/home/tcwg-buildbot/worker/clang-aarch64-quick/stage1/tools/clang/tools/extra/clangd/unittests/./ClangdTests+0xccec48)
#15 0x0000bd2df0fe9d64 main (/home/tcwg-buildbot/worker/clang-aarch64-quick/stage1/tools/clang/tools/extra/clangd/unittests/./ClangdTests+0xca9d64)
#16 0x0000ef60e4eb73fc __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:74:3
#17 0x0000ef60e4eb74cc call_init ./csu/../csu/libc-start.c:128:20
#18 0x0000ef60e4eb74cc __libc_start_main ./csu/../csu/libc-start.c:379:5
#19 0x0000bd2df07a1d30 _start (/home/tcwg-buildbot/worker/clang-aarch64-quick/stage1/tools/clang/tools/extra/clangd/unittests/./ClangdTests+0x461d30)

--
exit: -6
--
shard JSON output does not exist: /home/tcwg-buildbot/worker/clang-aarch64-quick/stage1/tools/clang/tools/extra/clangd/unittests/./ClangdTests-Clangd Unit Tests-2917309-289-332.json
********************


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants