Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot size 0 breaks smart clones with OpenShift Virtualization #129

Closed
acsulli opened this issue Nov 1, 2021 · 21 comments
Closed

Snapshot size 0 breaks smart clones with OpenShift Virtualization #129

acsulli opened this issue Nov 1, 2021 · 21 comments

Comments

@acsulli
Copy link

acsulli commented Nov 1, 2021

Summary

When using the freenas-nfs driver with OpenShift 4.8+ and OpenShift Virtualization 4.8+, offloaded "smart" clones do not work. This appears to be a result of how the Containerized Data Importer looks at the size of the snapshot to determine the size of the newly created volume.

I would expect this will also affect KubeVirt with other Kubernetes distributions as well.

Recreating the issue

  1. Deploy OpenShift 4.8 or later.

  2. Deploy OpenShift Virtualization 4.8 or later.

  3. Configure and deploy the freenas-nfs driver, with snapshot support.

  4. Add a VM template disk image, using the OpenShift Virtualization GUI or by creating a PVC with a template disk inside.

apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: tmpl-fedora-34
spec:
  pvc:
    accessModes:
      - ReadWriteMany
    resources:
      requests:
        storage: 10Gi
    volumeMode: Filesystem
    storageClassName: lab-nfs
  source:
    http:
      url: >-
        https://url/to/Fedora-Cloud-Base-34-1.2.x86_64.qcow2

The PVC and PV are successfully created and populated with data.

# oc get pvc
NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
tmpl-fedora-34   Bound    pvc-1e1ffa6c-eef6-41b4-a188-7c513f5b2deb   10Gi       RWX            lab-nfs        9m52s

# oc get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                       STORAGECLASS   REASON   AGE
pvc-1e1ffa6c-eef6-41b4-a188-7c513f5b2deb   10Gi       RWX            Delete           Bound    default/tmpl-fedora-34                     lab-nfs                 10m
  1. Create a new DataVolume, using the previous as the source.
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: "fedora-clone"
spec:
  source:
    pvc:
      name: tmpl-fedora-34
      namespace: default
  pvc:
    accessModes:
      - ReadWriteMany
    resources:
      requests:
        storage: 10Gi
    volumeMode: Filesystem
    storageClassName: lab-nfs

VolumeSnapshot and VolumeSnapShotContents objects are successfully created, however there is no ZFS snap.

# oc get volumesnapshot
NAME           READYTOUSE   SOURCEPVC        SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS   SNAPSHOTCONTENT                                    CREATIONTIME   AGE
fedora-clone   true         tmpl-fedora-34                           0             lab-nfs         snapcontent-04eb9941-b22d-452f-8db8-fc36dca31c7f   5m41s          5m41s
# oc get volumesnapshotcontents
NAME                                               READYTOUSE   RESTORESIZE   DELETIONPOLICY   DRIVER                   VOLUMESNAPSHOTCLASS   VOLUMESNAPSHOT   VOLUMESNAPSHOTNAMESPACE   AGE
snapcontent-04eb9941-b22d-452f-8db8-fc36dca31c7f   true         0             Delete           org.democratic-csi.nfs   lab-nfs               fedora-clone     default                  5m57s

Note that the RESTORESIZE field for both is 0.

  1. The new DataVolume stays in the SnapshotForSmartCloneInProgress status indefinitely.
oc describe dv fedora-clone
Name:         fedora-clone
Namespace:    default
Labels:       <none>
Annotations:  cdi.kubevirt.io/cloneType: snapshot
              cdi.kubevirt.io/storage.clone.token:
                eyJhbGciOiJQUzI1NiIsImtpZCI6IiJ9.eyJleHAiOjE2MzU3OTMyMzgsImlhdCI6MTYzNTc5MjkzOCwiaXNzIjoiY2RpLWFwaXNlcnZlciIsIm5hbWUiOiJ0bXBsLWZlZG9yYS0zN...
API Version:  cdi.kubevirt.io/v1beta1
Kind:         DataVolume
Metadata:
  Creation Timestamp:  2021-11-01T18:55:38Z
  Generation:          2
  Resource Version:  456717
  UID:               461f9b88-22bd-4504-946c-42412f0cd120
Spec:
  Pvc:
    Access Modes:
      ReadWriteMany
    Resources:
      Requests:
        Storage:         10Gi
    Storage Class Name:  lab-nfs
    Volume Mode:         Filesystem
  Source:
    Pvc:
      Name:       tmpl-fedora-34
      Namespace:  default
Status:
  Conditions:
    Last Heartbeat Time:   2021-11-01T18:55:38Z
    Last Transition Time:  2021-11-01T18:55:38Z
    Message:               No PVC found
    Reason:                NotFound
    Status:                Unknown
    Type:                  Bound
    Last Heartbeat Time:   2021-11-01T18:55:38Z
    Last Transition Time:  2021-11-01T18:55:38Z
    Status:                False
    Type:                  Ready
    Last Heartbeat Time:   2021-11-01T18:55:38Z
    Last Transition Time:  2021-11-01T18:55:38Z
    Status:                False
    Type:                  Running
  Phase:                   SnapshotForSmartCloneInProgress
Events:
  Type    Reason                           Age   From                   Message
  ----    ------                           ----  ----                   -------
  Normal  SnapshotForSmartCloneInProgress  57m   datavolume-controller  Creating snapshot for smart-clone is in progress (for pvc default/tmpl-fedora-34)
  Normal  NotFound                         57m   datavolume-controller  No PVC found

The logs for the CDI deployment Pod (oc logs cdi-deployment-<identifier> -n openshift-cnv) have these messages:

{
    "level": "info",
    "ts": 1635791578.1069758,
    "logger": "controller.smartclone-controller",
    "msg": "reconciling smart clone",
    "VolumeSnapshot/PersistentVolumeClaim": "openshift-virtualization-os-images/cdi-tmp-bcd9870e-fd40-4e8b-a966-9d02d1bcfe34"
}
{
    "level": "info",
    "ts": 1635791578.1070423,
    "logger": "controller.smartclone-controller",
    "msg": "Reconciling snapshot",
    "VolumeSnapshot/PersistentVolumeClaim": "openshift-virtualization-os-images/cdi-tmp-bcd9870e-fd40-4e8b-a966-9d02d1bcfe34",
    "snapshot.Name": "cdi-tmp-bcd9870e-fd40-4e8b-a966-9d02d1bcfe34",
    "snapshot.Namespace": "openshift-virtualization-os-images"
}
{
    "level": "error",
    "ts": 1635791578.184192,
    "logger": "controller.smartclone-controller",
    "msg": "error creating pvc from snapshot",
    "VolumeSnapshot/PersistentVolumeClaim": "openshift-virtualization-os-images/cdi-tmp-bcd9870e-fd40-4e8b-a966-9d02d1bcfe34",
    "error": "PersistentVolumeClaim \"cdi-tmp-bcd9870e-fd40-4e8b-a966-9d02d1bcfe34\" is invalid: spec.resources[storage]: Invalid value: \"0\": must be greater than zero",
    "stacktrace": "github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/app/vendor/github.com/go-logr/zapr/zapr.go:132\nkubevirt.io/containerized-data-importer/pkg/controller.(*SmartCloneReconciler).reconcileSnapshot\n\t/remote-source/app/pkg/controller/smart-clone-controller.go:247\nkubevirt.io/containerized-data-importer/pkg/controller.(*SmartCloneReconciler).Reconcile\n\t/remote-source/app/pkg/controller/smart-clone-controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99"
}
{
    "level": "error",
    "ts": 1635791578.18426,
    "logger": "controller-runtime.manager.controller.smartclone-controller",
    "msg": "Reconciler error",
    "name": "cdi-tmp-bcd9870e-fd40-4e8b-a966-9d02d1bcfe34",
    "namespace": "openshift-virtualization-os-images",
    "error": "PersistentVolumeClaim \"cdi-tmp-bcd9870e-fd40-4e8b-a966-9d02d1bcfe34\" is invalid: spec.resources[storage]: Invalid value: \"0\": must be greater than zero",
    "stacktrace": "github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/app/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:302\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99"
}

I believe that is the result of this line of code, which statically sets the size to 0.

Expected result

The objects should have a valid value in the size field, allowing the DataVolume clone operation to succeed.

@travisghansen
Copy link
Member

Welcome!

Hmm, we may need to work jointly with them to hone in on the meaning of the spec:

...
// Information about a specific snapshot.
message Snapshot {
  // This is the complete size of the snapshot in bytes. The purpose of
  // this field is to give CO guidance on how much space is needed to
  // create a volume from this snapshot. The size of the volume MUST NOT
  // be less than the size of the source snapshot. This field is
  // OPTIONAL. If this field is not set, it indicates that this size is
  // unknown. The value of this field MUST NOT be negative and a size of
  // zero means it is unspecified.
  int64 size_bytes = 1;
...

Speaking from a more theoretical standpoint the size is genuinely unspecified. Using the zfs snapshot feature the size is 0 and will only change over time as the data of the parent begins to differ from the snapshot. Furthermore, creating a new volume from a snapshot (the purpose of the field) requires 0 extra bytes as well (again, until the new volume begins to differ from the snapshot). In short it was intentionally set to 0 but I'm open to discussion for sure.

@acsulli
Copy link
Author

acsulli commented Nov 2, 2021

Hello @travisghansen, thank you for the warm welcome 😄

Speaking from a more theoretical standpoint the size is genuinely unspecified. Using the zfs snapshot feature the size is 0 and will only change over time as the data of the parent begins to differ from the snapshot.

I get that and it makes sense from the perspective of the storage system, however, from the perspective of k8s, you couldn't create a zero sized PVC - it needs to be at least large enough to hold the data within the source volume.

For what it's worth, here's the VolumeSnapshot and VolumeSnapshotContents output doing the same process with NetApp ONTAP (using Trident with ontap-nas driver):

$ oc get pvc fedora -n openshift-virtualization-os-images
NAME     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS    AGE
fedora   Bound    pvc-bb532db5-e7fd-4783-85dd-9d39d13faa8e   10Gi       RWX            lab-ontap-nfs   37m

$ cat << EOF | oc apply -f -
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: "fedora-clone"
spec:
  source:
    pvc:
      name: fedora
      namespace: openshift-virtualization-os-images
  pvc:
    accessModes:
      - ReadWriteMany
    resources:
      requests:
        storage: 10Gi
    volumeMode: Filesystem
    storageClassName: lab-ontap-nfs
EOF

$ oc get volumesnapshot,volumesnapshotcontents -A
NAMESPACE                            NAME                                                                                  READYTOUSE   SOURCEPVC   SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS       SNAPSHOTCONTENT                                    CREATIONTIME   AGE
openshift-virtualization-os-images   volumesnapshot.snapshot.storage.k8s.io/cdi-tmp-0e229364-50c4-47e5-b982-89c78d5e6244   true         fedora                              658804Ki      trident-snapclass   snapcontent-467b689d-df83-48ca-be1b-eb6ffac2010c   43s            43s

NAMESPACE   NAME                                                                                             READYTOUSE   RESTORESIZE   DELETIONPOLICY   DRIVER                  VOLUMESNAPSHOTCLASS   VOLUMESNAPSHOT                                 VOLUMESNAPSHOTNAMESPACE              AGE
            volumesnapshotcontent.snapshot.storage.k8s.io/snapcontent-467b689d-df83-48ca-be1b-eb6ffac2010c   true         674615296     Delete           csi.trident.netapp.io   trident-snapclass     cdi-tmp-0e229364-50c4-47e5-b982-89c78d5e6244   openshift-virtualization-os-images   43s

The RESTORESIZE represents the amount of data used in the PVC. This is with a thin provisioned FlexVol, I don't know if it would be different with a thick provisioned / space reserved FlexVol. Regardless, the result is a FlexClone of the volume, which - same as ZFS - consumes no additional space on the storage array. But, k8s (and in this case CDI) needs to know how big to request a PVC to accommodate the data within.

@travisghansen
Copy link
Member

Yes understood. We had some good discussion about the matter on slack: https://kubernetes.slack.com/archives/C8EJ01Z46/p1635807374020000

In any case, the result of the conversation was this:

  • indeed my understanding of the datapoint in the proto file was off (I was reading it more along the lines of the CO checking against available storage (GetCapacity) vs being used as a minimal value to used as the size for derived volumes)
  • CDI should probably handle the situation more robustly as well (there are scenarios where a 0 is relevant or nothing better can be assumed)
    • it's really unclear what this should be, but creating a volume with a size request of 0 is no good
    • perhaps it should attempt to fallback to the size of the originating PVC if still present
    • otherwise throw an error and not create a PVC at all until it can determine a size > 0?

In any case, I'll be working on revisiting the logic but it will take me a little bit. It's actually more nuanced than it seems because I support ListVolumes so I need to store the size historically in case the parent PVC grows etc. I will however try to retool the driver based on the conclusions of the conversation on slack.

I'm glad you brought it up because I otherwise would never have known :)

@travisghansen
Copy link
Member

@acsulli out of curiosity do you happen to know how k8s/CDI responds when the value is simply omitted from responses? It may be easier for me to not try to manage it at all..

@acsulli
Copy link
Author

acsulli commented Nov 2, 2021

how k8s/CDI responds when the value is simply omitted

I do not =\

@travisghansen
Copy link
Member

travisghansen commented Nov 2, 2021

OK, I've removed the attribute from the responses entirely for now:

Can you update your image tags to next and give it a test to see how it behaves (make sure to set the pull policy to always as that tag is mutable)? After updating that start from scratch in terms of the process with CDI/kubevirt (make it create a new PVC, Snapshot, etc) and we'll see what's different.

@acsulli
Copy link
Author

acsulli commented Nov 2, 2021

Looks to be the same behavior as before.

$ oc describe pod democratic-nfs-next-democratic-csi-controller-86d6fd7f78-56996 -n democratic-csi | grep Image\:
    Image:         k8s.gcr.io/sig-storage/csi-provisioner:v2.1.0
    Image:         k8s.gcr.io/sig-storage/csi-resizer:v1.1.0
    Image:         k8s.gcr.io/sig-storage/csi-snapshotter:v3.0.3
    Image:         docker.io/democraticcsi/democratic-csi:next

$ oc apply -f dv-fedora.yaml
datavolume.cdi.kubevirt.io/tmpl-fedora-34 created

$ oc get pv,pvc
NAME                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
persistentvolumeclaim/tmpl-fedora-34   Bound    pvc-728a50be-765c-415e-8c17-19f1b21037e2   10Gi       RWX            lab-truenas-nfs   4m8s

NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                       STORAGECLASS      REASON   AGE
persistentvolume/pvc-728a50be-765c-415e-8c17-19f1b21037e2   10Gi       RWX            Delete           Bound      ansulliv/tmpl-fedora-34                     lab-truenas-nfs            4m6s

$ oc apply -f dv-fedora-clone.yaml
datavolume.cdi.kubevirt.io/fedora-clone created

$ oc get volumesnapshot,volumesnapshotcontents
NAME                                                  READYTOUSE   SOURCEPVC        SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS     SNAPSHOTCONTENT                                    CREATIONTIME   AGE
volumesnapshot.snapshot.storage.k8s.io/fedora-clone   true         tmpl-fedora-34                           0             lab-truenas-nfs   snapcontent-201a8fa1-7012-419e-8c94-b2ad0490ad02   38s            39s

NAME                                                                                             READYTOUSE   RESTORESIZE   DELETIONPOLICY   DRIVER                   VOLUMESNAPSHOTCLASS   VOLUMESNAPSHOT   VOLUMESNAPSHOTNAMESPACE   AGE
volumesnapshotcontent.snapshot.storage.k8s.io/snapcontent-201a8fa1-7012-419e-8c94-b2ad0490ad02   true         0             Delete           org.democratic-csi.nfs   lab-truenas-nfs       fedora-clone     ansulliv                  39s

$ oc get dv
NAME             PHASE                             PROGRESS   RESTARTS   AGE
fedora-clone     SnapshotForSmartCloneInProgress                         4m34s
tmpl-fedora-34   Succeeded                         100.0%                6m1s

$ oc logs cdi-deployment-f6487b694-5l6f4 -n openshift-cnv | tail -2
{"level":"error","ts":1635894139.2408478,"logger":"controller.smartclone-controller","msg":"error creating pvc from snapshot","VolumeSnapshot/PersistentVolumeClaim":"ansulliv/fedora-clone","error":"PersistentVolumeClaim \"fedora-clone\" is invalid: spec.resources[storage]: Invalid value: \"0\": must be greater than zero","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/app/vendor/github.com/go-logr/zapr/zapr.go:132\nkubevirt.io/containerized-data-importer/pkg/controller.(*SmartCloneReconciler).reconcileSnapshot\n\t/remote-source/app/pkg/controller/smart-clone-controller.go:247\nkubevirt.io/containerized-data-importer/pkg/controller.(*SmartCloneReconciler).Reconcile\n\t/remote-source/app/pkg/controller/smart-clone-controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99"}
{"level":"error","ts":1635894139.2409146,"logger":"controller-runtime.manager.controller.smartclone-controller","msg":"Reconciler error","name":"fedora-clone","namespace":"ansulliv","error":"PersistentVolumeClaim \"fedora-clone\" is invalid: spec.resources[storage]: Invalid value: \"0\": must be greater than zero","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/app/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:302\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99"}

@travisghansen
Copy link
Member

OK thanks for giving it a try. I'll work on returning a value, hopefully won't take too long.

@travisghansen
Copy link
Member

What exactly is CDI doing here and what is the expected behavior/outcome?

Example 1:

A DataVolume is created for 10Gi which in turn is populated with X Gi of data, let's say 5Gi. A snapshot is created at this point which (in the case of a zfs dataset/nfs) technically only requires 5Gi to restore and therefore should return a size_bytes of appropriate value. If you now create a volume based on that value you won't be able to do anything with it as it will be full and would require deleting data or whatever at that point.

Example 2:

Same as above but using zvol/iscsi (ie: block based with an fs) then no matter what the size_bytes probably should match exactly the size of the 'parent'. I think CDI is generally expecting this pattern vs the previous example.

Thoughts? More info?

@acsulli
Copy link
Author

acsulli commented Nov 3, 2021

Disclaimer: I am not a developer, so please take my interpretation of code with a massive grain of salt.

From what I understand, it is using the restore size for the initial PVC request, then expanding it to the size requested by the DataVolume afterward. So, the new (clone) DataVolume cannot be smaller than the actual used capacity in the source PV, but can be larger than the original.

@travisghansen
Copy link
Member

OK, lemme commit a tweak for your scenario (and what I think is the proper response value outside the CDI use-case) and then have you give it a try again.

Thanks for the patience and info! Excited to get the feature working for you.

@travisghansen
Copy link
Member

Ok, I’ve made a new commit which attempts to handle this properly.

I’ve a bit more research to do to make sure the value being returned will always be high enough but it should be sane enough for lab testing.

Force the deployment to update to the most recent next images and try again. Thanks!

@acsulli
Copy link
Author

acsulli commented Nov 4, 2021

Apologies for the delay, that works exactly as expected, thank you!

@travisghansen
Copy link
Member

Awesome! I’ll put a little more polish on the feature and then close this down once it’s merged into a release.

@acsulli
Copy link
Author

acsulli commented Nov 4, 2021

One interesting thing I've noticed is the clone volume dependency tree(s).

What I understand to be happening is that a CSI snapshot is created, resulting in a ZFS snap, then a new PV/PVC is created from the snapshot - which results in a dependent clone volume in ZFS. Deleting the original volume (the one which had the snapshot created) results in the PVC being in a released state and the external-provisioner logs showing something like this:

event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-77478a1a-806d-407b-9d05-1480233c7046", UID:"14e80e6e-b957-4ea3-80ac-5089449d957e", APIVersion:"v1", ResourceVersion:"1217857", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' rpc error: code = FailedPrecondition desc = filesystem has dependent clones

I don't think this behavior is limited to just CDI / OpenShift Virtualization, I would expect it any time a PVC is cloned and the original deleted. More importantly, I'm not sure that this is incorrect based on my (admittedly primitive) understanding of how ZFS clones work.

I do think it simply means that users need to be aware of how it works.

Is there an option to do a true "full" clone of the volume? NetApp's Trident does this via a backend configuration called splitOnClone. I see an option called detachedSnapshotsDatasetParentName, but I haven't been able to discern what that is used for.

@travisghansen
Copy link
Member

travisghansen commented Nov 4, 2021

@acsulli yeah, super glad you're running through this stuff! This is partially what I was alluding to earlier when I mentioned the whole situation is actually more complex than it seems for several reasons (it's a large matrix of potential paths). In democratic-csi there are really 3 distinct options at play:

  1. detachedSnapshotsDatasetParentName - which is activated by setting this https://github.com/democratic-csi/charts/blob/master/stable/democratic-csi/examples/freenas-nfs.yaml#L40 on a particular snapshot class. Enabling that will, in lieu of doing a zfs snapshot command, do a zfs send <source> | zfs receive <target> where target will be 'beneath' the option discussed. Note that the value of this option can actually cross zfs pools (currently must be on the same storage machine though, so you can't ship it off storage system). When a snapshot of this nature is created the data is fully copied to the new location and is not doing cow-esque actions...you have a complete backup if you will.
  2. storage class parameter detachedVolumesFromSnapshots (https://github.com/democratic-csi/charts/blob/master/stable/democratic-csi/examples/freenas-nfs.yaml#L19) - setting this will do a zfs send <source> | zfs receive <target> of volumes created from snapshots in lieu of zfs clone. Note that this behavior will take place regardless of the operative scenario for 1 above (ie: will create a detached volume regardless of whether the source snapshot was zfs snapshotted or zfs send/received).
  3. storage class parameter detachedVolumesFromVolumes (https://github.com/democratic-csi/charts/blob/master/stable/democratic-csi/examples/freenas-nfs.yaml#L23) - setting this will do a zfs send <source> | zfs receive <target> for the newly created volume instead of zfs clone. This is operative when the source of the volume is another volume vs a snapshot.

The particular sequence of events you've hit is likely unable to be handled differently than the behavior you're seeing. In the case of deleting snapshots I can flag the removal of the zfs snapshot as deferred (and do so in the code) so snapshots with dependent volumes will delete from k8s and clean themselves up eventually using the deferred feature of zfs. If I could do a deferred delete on a zfs dataset/zvol (the scenario you're hitting) I would, but I don't think that's a thing with zfs yet.

Frankly there are probably some gaps in the logic currently that aren't being dealt with entirely because there's so many possibilities. For example, if you enable 1 (but the parent is on another pool) and do not enable 2 then it will blow up, as you can't zfs clone across pools.

From a theoretical standpoint there are no silver bullets but I tried to cover what I felt would be the most likely use-cases.

EDIT: The csi spec is purposely ambiguous about the relationship of snapshots to parent volumes specifically around the delete behavior (it doesn't take a stance...at least last I knew). ie: snapshots may or may not have a joint life-cycle to the parent volume.

I'm interested in knowing of other specific configurations/combinations that would better serve the kubevirt use-case if you have them.

@acsulli
Copy link
Author

acsulli commented Nov 5, 2021

Here is a nice summary of the problem.

Thinking out loud, it would make sense that having deduplication turned on at the top level dataset / volume - where both the democratic-csi "parent" and "snap" dataset / volume are children - would result in the ability to create quick, near zero capacity clones between them.

But, AFAIK, it doesn't work that way 😞

I do think that you could save capacity with the above configuration, but it would still rely on the zfs send / receive which means that the entire contents of the volume would need to be read by the send process, then hashed and deduped by the receive process - generating a lot of disk and CPU activity for what becomes nothing more than a bunch of block pointers and a new redirect-on-write tree.

@travisghansen
Copy link
Member

travisghansen commented Nov 5, 2021

Yeah the linked article is a decent representation of the various trade offs depending on which route you go.

If you don’t care about quotas and wanted to run dedup you could create a single dataset and share it over nfs and then use the nfs-client driver to create simple directories for each pv (because they all share a dataset dedup would kick in) and snapshot (it crudely copies data with rsync). But that’s unlikely to be the route you want either.

There are simply trade offs between capacity, performance, etc that need to be looked at and chosen which makes the most sense for the circumstances :(

If you have any bright ideas do share!

EDIT: correction, dedup is pool wide according to this: https://unix.stackexchange.com/questions/382303/does-zfs-deduplicates-across-datasets-or-only-inside-a-single-dataset

In such a case you can enable dedup on the parent datasets and you should be covered.

@travisghansen
Copy link
Member

How has this been working?

There is 1 last potential issue that I'm aware of and this is zfs reporting bad values because all pending transaction have not been flushed. sysctl -d vfs.zfs.txg.timeout is the setting. Technically I should probably wait that amount of time + to be sure things have settled before I get the properties and respond back.

@acsulli
Copy link
Author

acsulli commented Nov 29, 2021

It's been working exactly as expected, thank you! I've created a several dozen PVC clones for VMs without issue.

@travisghansen
Copy link
Member

Released in v1.4.3. I still have a back-burner items to ensure proper return values in edge cases but closing this for now.

qlonik added a commit to qlonik/musical-parakeet that referenced this issue Jul 3, 2024
There are three settings that change behaviour from `zfs clone` to `zfs send
| zfs receive`. For explanation, see
democratic-csi/democratic-csi#129 (comment).
Try 2 settings that detach volumes created from snapshots and other volumes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants