Skip to content

Commit 9109e17

Browse files
committed
Merge branch 'filter-next'
Daniel Borkmann says: ==================== BPF updates We sat down and have heavily reworked the whole previous patchset from v10 [1] to address all comments/concerns. This patchset therefore *replaces* the internal BPF interpreter with the new layout as discussed in [1], and migrates some exotic callers to properly use the BPF API for a transparent upgrade. All other callers that already use the BPF API in a way it should be used, need no further changes to run the new internals. We also removed the sysctl knob entirely, and do not expose any structure to userland, so that implementation details only reside in kernel space. Since we are replacing the interpreter we had to migrate seccomp in one patch along with the interpreter to not break anything. When attaching a new filter, the flow can be described as following: i) test if jit compiler is enabled and can compile the user BPF, ii) if so, then go for it, iii) if not, then transparently migrate the filter into the new representation, and run it in the interpreter. Also, we have scratched the jit flag from the len attribute and made it as initial patch in this series as Pablo has suggested in the last feedback, thanks. For details, please refer to the patches themselves. We did extensive testing of BPF and seccomp on the new interpreter itself and also on the user ABIs and could not find any issues; new performance numbers as posted in patch 8 are also still the same. Please find more details in the patches themselves. For all the previous history from v1 to v10, see [1]. We have decided to drop the v11 as we have pedantically reworked the set, but of course, included all previous feedback. v3 -> v4: - Applied feedback from Dave regarding swap insns - Rebased on net-next v2 -> v3: - Rebased to latest net-next (i.e. w/ rxhash->hash rename) - Fixed patch 8/9 commit message/doc as suggested by Dave - Rest is unchanged v1 -> v2: - Rebased to latest net-next - Added static to ptp_filter as suggested by Dave - Fixed a typo in patch 8's commit message - Rest unchanged Thanks ! [1] http://thread.gmane.org/gmane.linux.kernel/1665858 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2 parents 64c2723 + 9a985cd commit 9109e17

File tree

20 files changed

+1658
-536
lines changed

20 files changed

+1658
-536
lines changed

Documentation/networking/filter.txt

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -546,6 +546,130 @@ ffffffffa0069c8f + <x>:
546546
For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
547547
toolchain for developing and testing the kernel's JIT compiler.
548548

549+
BPF kernel internals
550+
--------------------
551+
Internally, for the kernel interpreter, a different BPF instruction set
552+
format with similar underlying principles from BPF described in previous
553+
paragraphs is being used. However, the instruction set format is modelled
554+
closer to the underlying architecture to mimic native instruction sets, so
555+
that a better performance can be achieved (more details later).
556+
557+
It is designed to be JITed with one to one mapping, which can also open up
558+
the possibility for GCC/LLVM compilers to generate optimized BPF code through
559+
a BPF backend that performs almost as fast as natively compiled code.
560+
561+
The new instruction set was originally designed with the possible goal in
562+
mind to write programs in "restricted C" and compile into BPF with a optional
563+
GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
564+
minimal performance overhead over two steps, that is, C -> BPF -> native code.
565+
566+
Currently, the new format is being used for running user BPF programs, which
567+
includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
568+
team driver's classifier for its load-balancing mode, netfilter's xt_bpf
569+
extension, PTP dissector/classifier, and much more. They are all internally
570+
converted by the kernel into the new instruction set representation and run
571+
in the extended interpreter. For in-kernel handlers, this all works
572+
transparently by using sk_unattached_filter_create() for setting up the
573+
filter, resp. sk_unattached_filter_destroy() for destroying it. The macro
574+
SK_RUN_FILTER(filter, ctx) transparently invokes the right BPF function to
575+
run the filter. 'filter' is a pointer to struct sk_filter that we got from
576+
sk_unattached_filter_create(), and 'ctx' the given context (e.g. skb pointer).
577+
All constraints and restrictions from sk_chk_filter() apply before a
578+
conversion to the new layout is being done behind the scenes!
579+
580+
Currently, for JITing, the user BPF format is being used and current BPF JIT
581+
compilers reused whenever possible. In other words, we do not (yet!) perform
582+
a JIT compilation in the new layout, however, future work will successively
583+
migrate traditional JIT compilers into the new instruction format as well, so
584+
that they will profit from the very same benefits. Thus, when speaking about
585+
JIT in the following, a JIT compiler (TBD) for the new instruction format is
586+
meant in this context.
587+
588+
Some core changes of the new internal format:
589+
590+
- Number of registers increase from 2 to 10:
591+
592+
The old format had two registers A and X, and a hidden frame pointer. The
593+
new layout extends this to be 10 internal registers and a read-only frame
594+
pointer. Since 64-bit CPUs are passing arguments to functions via registers
595+
the number of args from BPF program to in-kernel function is restricted
596+
to 5 and one register is used to accept return value from an in-kernel
597+
function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
598+
sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
599+
registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
600+
601+
Therefore, BPF calling convention is defined as:
602+
603+
* R0 - return value from in-kernel function
604+
* R1 - R5 - arguments from BPF program to in-kernel function
605+
* R6 - R9 - callee saved registers that in-kernel function will preserve
606+
* R10 - read-only frame pointer to access stack
607+
608+
Thus, all BPF registers map one to one to HW registers on x86_64, aarch64,
609+
etc, and BPF calling convention maps directly to ABIs used by the kernel on
610+
64-bit architectures.
611+
612+
On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
613+
and may let more complex programs to be interpreted.
614+
615+
R0 - R5 are scratch registers and BPF program needs spill/fill them if
616+
necessary across calls. Note that there is only one BPF program (== one BPF
617+
main routine) and it cannot call other BPF functions, it can only call
618+
predefined in-kernel functions, though.
619+
620+
- Register width increases from 32-bit to 64-bit:
621+
622+
Still, the semantics of the original 32-bit ALU operations are preserved
623+
via 32-bit subregisters. All BPF registers are 64-bit with 32-bit lower
624+
subregisters that zero-extend into 64-bit if they are being written to.
625+
That behavior maps directly to x86_64 and arm64 subregister definition, but
626+
makes other JITs more difficult.
627+
628+
32-bit architectures run 64-bit internal BPF programs via interpreter.
629+
Their JITs may convert BPF programs that only use 32-bit subregisters into
630+
native instruction set and let the rest being interpreted.
631+
632+
Operation is 64-bit, because on 64-bit architectures, pointers are also
633+
64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
634+
so 32-bit BPF registers would otherwise require to define register-pair
635+
ABI, thus, there won't be able to use a direct BPF register to HW register
636+
mapping and JIT would need to do combine/split/move operations for every
637+
register in and out of the function, which is complex, bug prone and slow.
638+
Another reason is the use of atomic 64-bit counters.
639+
640+
- Conditional jt/jf targets replaced with jt/fall-through:
641+
642+
While the original design has constructs such as "if (cond) jump_true;
643+
else jump_false;", they are being replaced into alternative constructs like
644+
"if (cond) jump_true; /* else fall-through */".
645+
646+
- Introduces bpf_call insn and register passing convention for zero overhead
647+
calls from/to other kernel functions:
648+
649+
After a kernel function call, R1 - R5 are reset to unreadable and R0 has a
650+
return type of the function. Since R6 - R9 are callee saved, their state is
651+
preserved across the call.
652+
653+
Also in the new design, BPF is limited to 4096 insns, which means that any
654+
program will terminate quickly and will only call a fixed number of kernel
655+
functions. Original BPF and the new format are two operand instructions,
656+
which helps to do one-to-one mapping between BPF insn and x86 insn during JIT.
657+
658+
The input context pointer for invoking the interpreter function is generic,
659+
its content is defined by a specific use case. For seccomp register R1 points
660+
to seccomp_data, for converted BPF filters R1 points to a skb.
661+
662+
A program, that is translated internally consists of the following elements:
663+
664+
op:16, jt:8, jf:8, k:32 ==> op:8, a_reg:4, x_reg:4, off:16, imm:32
665+
666+
Just like the original BPF, the new format runs within a controlled environment,
667+
is deterministic and the kernel can easily prove that. The safety of the program
668+
can be determined in two steps: first step does depth-first-search to disallow
669+
loops and other CFG validation; second step starts from the first insn and
670+
descends all possible paths. It simulates execution of every insn and observes
671+
the state change of registers and stack.
672+
549673
Misc
550674
----
551675

@@ -561,3 +685,4 @@ the underlying architecture.
561685

562686
Jay Schulist <jschlst@samba.org>
563687
Daniel Borkmann <dborkman@redhat.com>
688+
Alexei Starovoitov <ast@plumgrid.com>

arch/arm/net/bpf_jit_32.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -925,14 +925,15 @@ void bpf_jit_compile(struct sk_filter *fp)
925925
bpf_jit_dump(fp->len, alloc_size, 2, ctx.target);
926926

927927
fp->bpf_func = (void *)ctx.target;
928+
fp->jited = 1;
928929
out:
929930
kfree(ctx.offsets);
930931
return;
931932
}
932933

933934
void bpf_jit_free(struct sk_filter *fp)
934935
{
935-
if (fp->bpf_func != sk_run_filter)
936+
if (fp->jited)
936937
module_free(NULL, fp->bpf_func);
937938
kfree(fp);
938939
}

arch/powerpc/net/bpf_jit_comp.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -689,6 +689,7 @@ void bpf_jit_compile(struct sk_filter *fp)
689689
((u64 *)image)[0] = (u64)code_base;
690690
((u64 *)image)[1] = local_paca->kernel_toc;
691691
fp->bpf_func = (void *)image;
692+
fp->jited = 1;
692693
}
693694
out:
694695
kfree(addrs);
@@ -697,7 +698,7 @@ void bpf_jit_compile(struct sk_filter *fp)
697698

698699
void bpf_jit_free(struct sk_filter *fp)
699700
{
700-
if (fp->bpf_func != sk_run_filter)
701+
if (fp->jited)
701702
module_free(NULL, fp->bpf_func);
702703
kfree(fp);
703704
}

arch/s390/net/bpf_jit_comp.c

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -877,6 +877,7 @@ void bpf_jit_compile(struct sk_filter *fp)
877877
if (jit.start) {
878878
set_memory_ro((unsigned long)header, header->pages);
879879
fp->bpf_func = (void *) jit.start;
880+
fp->jited = 1;
880881
}
881882
out:
882883
kfree(addrs);
@@ -887,10 +888,12 @@ void bpf_jit_free(struct sk_filter *fp)
887888
unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
888889
struct bpf_binary_header *header = (void *)addr;
889890

890-
if (fp->bpf_func == sk_run_filter)
891+
if (!fp->jited)
891892
goto free_filter;
893+
892894
set_memory_rw(addr, header->pages);
893895
module_free(NULL, header);
896+
894897
free_filter:
895898
kfree(fp);
896899
}

arch/sparc/net/bpf_jit_comp.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -809,6 +809,7 @@ cond_branch: f_offset = addrs[i + filter[i].jf];
809809
if (image) {
810810
bpf_flush_icache(image, image + proglen);
811811
fp->bpf_func = (void *)image;
812+
fp->jited = 1;
812813
}
813814
out:
814815
kfree(addrs);
@@ -817,7 +818,7 @@ cond_branch: f_offset = addrs[i + filter[i].jf];
817818

818819
void bpf_jit_free(struct sk_filter *fp)
819820
{
820-
if (fp->bpf_func != sk_run_filter)
821+
if (fp->jited)
821822
module_free(NULL, fp->bpf_func);
822823
kfree(fp);
823824
}

arch/x86/net/bpf_jit_comp.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -772,6 +772,7 @@ cond_branch: f_offset = addrs[i + filter[i].jf] - addrs[i];
772772
bpf_flush_icache(header, image + proglen);
773773
set_memory_ro((unsigned long)header, header->pages);
774774
fp->bpf_func = (void *)image;
775+
fp->jited = 1;
775776
}
776777
out:
777778
kfree(addrs);
@@ -791,7 +792,7 @@ static void bpf_jit_free_deferred(struct work_struct *work)
791792

792793
void bpf_jit_free(struct sk_filter *fp)
793794
{
794-
if (fp->bpf_func != sk_run_filter) {
795+
if (fp->jited) {
795796
INIT_WORK(&fp->work, bpf_jit_free_deferred);
796797
schedule_work(&fp->work);
797798
} else {

drivers/isdn/i4l/isdn_ppp.c

Lines changed: 41 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -378,10 +378,15 @@ isdn_ppp_release(int min, struct file *file)
378378
is->slcomp = NULL;
379379
#endif
380380
#ifdef CONFIG_IPPP_FILTER
381-
kfree(is->pass_filter);
382-
is->pass_filter = NULL;
383-
kfree(is->active_filter);
384-
is->active_filter = NULL;
381+
if (is->pass_filter) {
382+
sk_unattached_filter_destroy(is->pass_filter);
383+
is->pass_filter = NULL;
384+
}
385+
386+
if (is->active_filter) {
387+
sk_unattached_filter_destroy(is->active_filter);
388+
is->active_filter = NULL;
389+
}
385390
#endif
386391

387392
/* TODO: if this was the previous master: link the stuff to the new master */
@@ -629,25 +634,41 @@ isdn_ppp_ioctl(int min, struct file *file, unsigned int cmd, unsigned long arg)
629634
#ifdef CONFIG_IPPP_FILTER
630635
case PPPIOCSPASS:
631636
{
637+
struct sock_fprog fprog;
632638
struct sock_filter *code;
633-
int len = get_filter(argp, &code);
639+
int err, len = get_filter(argp, &code);
640+
634641
if (len < 0)
635642
return len;
636-
kfree(is->pass_filter);
637-
is->pass_filter = code;
638-
is->pass_len = len;
639-
break;
643+
644+
fprog.len = len;
645+
fprog.filter = code;
646+
647+
if (is->pass_filter)
648+
sk_unattached_filter_destroy(is->pass_filter);
649+
err = sk_unattached_filter_create(&is->pass_filter, &fprog);
650+
kfree(code);
651+
652+
return err;
640653
}
641654
case PPPIOCSACTIVE:
642655
{
656+
struct sock_fprog fprog;
643657
struct sock_filter *code;
644-
int len = get_filter(argp, &code);
658+
int err, len = get_filter(argp, &code);
659+
645660
if (len < 0)
646661
return len;
647-
kfree(is->active_filter);
648-
is->active_filter = code;
649-
is->active_len = len;
650-
break;
662+
663+
fprog.len = len;
664+
fprog.filter = code;
665+
666+
if (is->active_filter)
667+
sk_unattached_filter_destroy(is->active_filter);
668+
err = sk_unattached_filter_create(&is->active_filter, &fprog);
669+
kfree(code);
670+
671+
return err;
651672
}
652673
#endif /* CONFIG_IPPP_FILTER */
653674
default:
@@ -1147,14 +1168,14 @@ isdn_ppp_push_higher(isdn_net_dev *net_dev, isdn_net_local *lp, struct sk_buff *
11471168
}
11481169

11491170
if (is->pass_filter
1150-
&& sk_run_filter(skb, is->pass_filter) == 0) {
1171+
&& SK_RUN_FILTER(is->pass_filter, skb) == 0) {
11511172
if (is->debug & 0x2)
11521173
printk(KERN_DEBUG "IPPP: inbound frame filtered.\n");
11531174
kfree_skb(skb);
11541175
return;
11551176
}
11561177
if (!(is->active_filter
1157-
&& sk_run_filter(skb, is->active_filter) == 0)) {
1178+
&& SK_RUN_FILTER(is->active_filter, skb) == 0)) {
11581179
if (is->debug & 0x2)
11591180
printk(KERN_DEBUG "IPPP: link-active filter: resetting huptimer.\n");
11601181
lp->huptimer = 0;
@@ -1293,14 +1314,14 @@ isdn_ppp_xmit(struct sk_buff *skb, struct net_device *netdev)
12931314
}
12941315

12951316
if (ipt->pass_filter
1296-
&& sk_run_filter(skb, ipt->pass_filter) == 0) {
1317+
&& SK_RUN_FILTER(ipt->pass_filter, skb) == 0) {
12971318
if (ipt->debug & 0x4)
12981319
printk(KERN_DEBUG "IPPP: outbound frame filtered.\n");
12991320
kfree_skb(skb);
13001321
goto unlock;
13011322
}
13021323
if (!(ipt->active_filter
1303-
&& sk_run_filter(skb, ipt->active_filter) == 0)) {
1324+
&& SK_RUN_FILTER(ipt->active_filter, skb) == 0)) {
13041325
if (ipt->debug & 0x4)
13051326
printk(KERN_DEBUG "IPPP: link-active filter: resetting huptimer.\n");
13061327
lp->huptimer = 0;
@@ -1490,9 +1511,9 @@ int isdn_ppp_autodial_filter(struct sk_buff *skb, isdn_net_local *lp)
14901511
}
14911512

14921513
drop |= is->pass_filter
1493-
&& sk_run_filter(skb, is->pass_filter) == 0;
1514+
&& SK_RUN_FILTER(is->pass_filter, skb) == 0;
14941515
drop |= is->active_filter
1495-
&& sk_run_filter(skb, is->active_filter) == 0;
1516+
&& SK_RUN_FILTER(is->active_filter, skb) == 0;
14961517

14971518
skb_push(skb, IPPP_MAX_HEADER - 4);
14981519
return drop;

drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c

Lines changed: 1 addition & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -120,18 +120,14 @@ static void pch_gbe_mdio_write(struct net_device *netdev, int addr, int reg,
120120
int data);
121121
static void pch_gbe_set_multi(struct net_device *netdev);
122122

123-
static struct sock_filter ptp_filter[] = {
124-
PTP_FILTER
125-
};
126-
127123
static int pch_ptp_match(struct sk_buff *skb, u16 uid_hi, u32 uid_lo, u16 seqid)
128124
{
129125
u8 *data = skb->data;
130126
unsigned int offset;
131127
u16 *hi, *id;
132128
u32 lo;
133129

134-
if (sk_run_filter(skb, ptp_filter) == PTP_CLASS_NONE)
130+
if (ptp_classify_raw(skb) == PTP_CLASS_NONE)
135131
return 0;
136132

137133
offset = ETH_HLEN + IPV4_HLEN(data) + UDP_HLEN;
@@ -2635,11 +2631,6 @@ static int pch_gbe_probe(struct pci_dev *pdev,
26352631

26362632
adapter->ptp_pdev = pci_get_bus_and_slot(adapter->pdev->bus->number,
26372633
PCI_DEVFN(12, 4));
2638-
if (ptp_filter_init(ptp_filter, ARRAY_SIZE(ptp_filter))) {
2639-
dev_err(&pdev->dev, "Bad ptp filter\n");
2640-
ret = -EINVAL;
2641-
goto err_free_netdev;
2642-
}
26432634

26442635
netdev->netdev_ops = &pch_gbe_netdev_ops;
26452636
netdev->watchdog_timeo = PCH_GBE_WATCHDOG_PERIOD;

0 commit comments

Comments
 (0)