diff --git a/Makefile b/Makefile
index 457f080..5d9e5f0 100644
--- a/Makefile
+++ b/Makefile
@@ -4,7 +4,7 @@ BASE_IMAGE_TAG=3.12-slim-bookworm
IMAGE_NAME=homeylab/bookstack-file-exporter
# keep this start sequence unique (IMAGE_TAG=)
# github actions will use this to create a tag
-IMAGE_TAG=0.0.3
+IMAGE_TAG=1.0.0
DOCKER_WORK_DIR=/export
DOCKER_CONFIG_DIR=/export/config
DOCKER_EXPORT_DIR=/export/dump
diff --git a/README.md b/README.md
index 76c558d..874f691 100644
--- a/README.md
+++ b/README.md
@@ -11,22 +11,26 @@ Table of Contents
- [Options and descriptions](#options-and-descriptions)
- [Environment variables](#valid-environment-variables)
- [Backup Behavior](#backup-behavior)
+ - [Images](#images)
+ - [Modify Markdown Files](#modify-markdown-files)
- [Object Storage](#object-storage)
- [Minio](#minio-backups)
+- [Future Items](#future-items)
## Background
_Features are actively being developed. See `Future Items` section for more details. Open an issue for a feature request._
-This tool provides a way to export [Bookstack](https://github.com/BookStackApp/BookStack) pages and their content (_text, images, metadata, etc._) into a relational directory-tree layout locally with an option to push to remote object storage locations. See [Backup Behavior](#backup-behavior) section for more details on how pages are organized.
+This tool provides a way to export [Bookstack](https://github.com/BookStackApp/BookStack) pages and their content (_text, images, metadata, etc._) into a relational parent-child layout locally with an option to push to remote object storage locations. See [Backup Behavior](#backup-behavior) section for more details on how pages are organized.
This small project was mainly created to run as a cron job in k8s but works anywhere. This tool allows me to export my docs in markdown, or other formats like pdf. I use Bookstack's markdown editor as default instead of WYSIWYG editor and this makes my notes portable anywhere even if offline.
### Features
What it does:
-- Build relationships between Bookstack `Shelves/Books/Chapters/Pages` to create a relational directory-tree layout
+- Discover and build relationships between Bookstack `Shelves/Books/Chapters/Pages` to create a relational parent-child layout
- Export Bookstack pages and their content to a `.tgz` archive
- Additional content for pages like their images and metadata and can be exported
+- The exporter can also [Modify Markdown Files](#modify-markdown-files) to replace image links with local exported image paths for a more portable backup
- YAML configuration file for repeatable and easy runs
- Can be run via [Python](#run-via-pip) or [Docker](#run-via-docker)
- Can push archives to remote object storage like [Minio](https://min.io/)
@@ -73,6 +77,7 @@ formats:
output_path: "bkps/"
assets:
export_images: false
+ modify_markdown: false
export_meta: false
verify_ssl: true
```
@@ -193,6 +198,7 @@ formats:
output_path: "bkps/"
assets:
export_images: false
+ modify_markdown: false
export_meta: false
verify_ssl: true
```
@@ -225,6 +231,7 @@ minio:
output_path: "bkps/"
assets:
export_images: true
+ modify_markdown: false
export_meta: false
verify_ssl: true
keep_last: 5
@@ -244,6 +251,7 @@ More descriptions can be found for each section below:
| `output_path` | `str` | `false` | Optional (default: `cwd`) which directory (relative or full path) to place exports. User who runs the command should have access to read/write to this directory. If not provided, will use current run directory by default |
| `assets` | `object` | `false` | Optional section to export additional assets from pages. |
| `assets.export_images` | `bool` | `false` | Optional (default: `false`), export all images for a page to an `image` directory within page directory. See [Backup Behavior](#backup-behavior) for more information on layout |
+| `assets.modify_markdown` | `bool` | `false` | Optional (default: `false`), modify markdown files to replace image links with local exported image paths. This requires `assets.export_images` to be `true` in order to work. See [Modify Markdown Files](#modify-markdown-files) for more information.
| `assets.export_meta` | `bool` | `false` | Optional (default: `false`), export of metadata about the page in a json file |
| `assets.verify_ssl` | `bool` | `false` | Optional (default: `true`), whether or not to check ssl certificates when requesting content from Bookstack host |
| `keep_last` | `int` | `false` | Optional (default: `None`), if exporter can delete older archives. valid values are:
- set to `-1` if you want to delete all archives after each run (useful if you only want to upload to object storage)
- set to `1+` if you want to retain a certain number of archives
- `0` will result in no action done |
@@ -261,9 +269,12 @@ General
- `MINIO_ACCESS_KEY`
- `MINIO_SECRET_KEY`
-### Backup Behavior
+## Backup Behavior
+
+### Export File
Backups are exported in `.tgz` format and generated based off timestamp. Export names will be in the format: `%Y-%m-%d_%H-%M-%S` (Year-Month-Day_Hour-Minute-Second). *Files are first pulled locally to create the tarball and then can be sent to object storage if needed*. Example file name: `bookstack_export_2023-09-22_07-19-54.tgz`.
+### General
The exporter can also do housekeeping duties and keep a configured number of archives and delete older ones. See `keep_last` property in the [Configuration](#options-and-descriptions) section. Object storage provider configurations include their own `keep_last` property for flexibility.
For file names, `slug` names (from Bookstack API) are used, as such certain characters like `!`, `/` will be ignored and spaces replaced from page names/titles.
@@ -349,6 +360,37 @@ Empty/New Pages will be ignored since they have not been modified yet from creat
You may notice some directories (books) and/or files (pages) in the archive have a random string at the end, example - `nKA`: `user-and-group-management-nKA`. This is expected and is because there were resources with the same name created in another shelve and bookstack adds a string at the end to ensure uniqueness.
+### Images
+
+### General
+Images will be dumped in a separate directory, `images` within the page directory it belongs to. As shown earlier:
+
+```
+bookstack_export_2023-11-20_08-00-29/programming/react/basics/images/YKvimage.png
+bookstack_export_2023-11-20_08-00-29/programming/react/basics/images/dwwimage.png
+bookstack_export_2023-11-20_08-00-29/programming/react/basics/images/NzZimage.png
+bookstack_export_2023-11-20_08-00-29/programming/react/basics/images/Mymimage.png
+```
+
+**Note you may see old images in your exports. This is because, by default, Bookstack retains images/drawings that are uploaded even if no longer referenced on an active page. Admins can run `Cleanup Images` in the Maintenance Settings or via [CLI](https://www.bookstackapp.com/docs/admin/commands/#cleanup-unused-images) to remove them.**
+
+### Modify Markdown Files
+**To use this feature, `assets.export_images` should be set to `true`**
+
+The configuration item, `assets.modify_markdown`, can be set to `true` to modify markdown files to replace image url links with local exported image paths. This feature allows for you to make your `markdown` exports much more portable.
+
+Page (parent) -> Images (Children) relationships are created and then each image url is replaced with its own respective local export path. Example:
+```
+## before
+[](https://demo.bookstack/uploads/images/gallery/2023-07/pool-topology-1.png)
+
+## after
+[](https://demo.bookstack/uploads/images/gallery/2023-07/pool-topology-1.png)
+```
+This allows the image to be found locally within the export files and allow your `markdown` docs to have all the images display properly like it would normally would.
+
+**Note: This will work properly if your pages are using the notation used by Bookstack for Markdown image links, example: ` [](anchor/url link)` The `(anchor/url link)` is optional.**
+
## Object Storage
Optionally, target(s) can be specified to upload generated archives to a remote location. Supported object storage providers can be found below:
- [Minio](#minio-backups)
@@ -388,7 +430,7 @@ minio:
## Future Items
1. ~~Be able to pull images locally and place in their respective page folders for a more complete file level backup.~~
2. ~~Include the exporter in a maintained helm chart as an optional deployment. The helm chart is [here](https://github.com/homeylab/helm-charts/tree/main/charts/bookstack).~~
-3. Be able to modify markdown links of images to local exported images in their respective page folders for a more complete file level backup.
+3. ~~Be able to modify markdown links of images to local exported images in their respective page folders for a more complete file level backup.~~
4. Be able to pull attachments locally and place in their respective page folders for a more complete file level backup.
5. Export S3 and more options.
6. Filter shelves and books by name - for more targeted backups. Example: you only want to share a book about one topic with an external friend/user.
diff --git a/bookstack_file_exporter/archiver/archiver.py b/bookstack_file_exporter/archiver/archiver.py
index d2d1f91..329343c 100644
--- a/bookstack_file_exporter/archiver/archiver.py
+++ b/bookstack_file_exporter/archiver/archiver.py
@@ -5,7 +5,7 @@
from bookstack_file_exporter.exporter.node import Node
from bookstack_file_exporter.archiver import util
-from bookstack_file_exporter.archiver.page_archiver import PageArchiver
+from bookstack_file_exporter.archiver.page_archiver import PageArchiver, ImageNode
from bookstack_file_exporter.archiver.minio_archiver import MinioArchiver
from bookstack_file_exporter.config_helper.remote import StorageProviderConfig
from bookstack_file_exporter.config_helper.config_helper import ConfigNode
@@ -49,23 +49,23 @@ def get_bookstack_exports(self, page_nodes: Dict[int, Node]):
self._get_page_files(page, page_image_meta)
self._get_page_images(page.file_path, page_image_meta)
- def _get_page_files(self, page_node: Node, image_meta: List[str]):
+ def _get_page_files(self, page_node: Node, image_meta: List[ImageNode]):
"""pull all bookstack pages into local files/tar"""
log.debug("Exporting bookstack page data")
self._page_archiver.archive_page(page_node, image_meta)
- def _get_page_image_map(self) -> Dict[int, List[str]]:
+ def _get_page_image_map(self) -> Dict[int, ImageNode]:
if not self._page_archiver.export_images:
log.debug("skipping image export based on user input")
return {}
return self._page_archiver.get_image_meta()
- def _get_page_images(self, page_path: str, urls: List[str]):
- if not urls:
+ def _get_page_images(self, page_path: str, img_nodes: List[ImageNode]):
+ if not img_nodes:
log.debug("page has no images to pull")
return
log.debug("Exporting bookstack page images")
- self._page_archiver.archive_page_images(page_path, urls)
+ self._page_archiver.archive_page_images(page_path, img_nodes)
def create_archive(self):
"""create tgz archive"""
diff --git a/bookstack_file_exporter/archiver/minio_archiver.py b/bookstack_file_exporter/archiver/minio_archiver.py
index 27b11e4..ccd8598 100644
--- a/bookstack_file_exporter/archiver/minio_archiver.py
+++ b/bookstack_file_exporter/archiver/minio_archiver.py
@@ -96,7 +96,6 @@ def _get_stale_objects(self, file_extension: str) -> List[MinioObject]:
# last copy that remains if local is deleted
log.debug("Minio 'keep_last' set to negative number, ignoring")
return []
- # keep_last > 0 condition
to_delete = []
if len(minio_objects) > self.keep_last:
log.debug("Number of minio objects is greater than 'keep_last'")
diff --git a/bookstack_file_exporter/archiver/page_archiver.py b/bookstack_file_exporter/archiver/page_archiver.py
index 3fe9305..e159bbb 100644
--- a/bookstack_file_exporter/archiver/page_archiver.py
+++ b/bookstack_file_exporter/archiver/page_archiver.py
@@ -3,13 +3,11 @@
# pylint: disable=import-error
from requests import Response
-
from bookstack_file_exporter.exporter.node import Node
from bookstack_file_exporter.archiver import util as archiver_util
from bookstack_file_exporter.config_helper.config_helper import ConfigNode
from bookstack_file_exporter.common import util as common_util
-
_META_FILE_SUFFIX = "_meta.json"
_TAR_SUFFIX = ".tar"
_TAR_GZ_SUFFIX = ".tgz"
@@ -26,11 +24,55 @@
"tgz": _TAR_GZ_SUFFIX
}
-
_IMAGE_DIR_NAME = "images"
-# _MARKDOWN_IMAGE_REGEX= re.compile(r"\[\!\[^$|.*\].*\]")
_MARKDOWN_STR_CHECK = "markdown"
+class ImageNode:
+ """
+ ImageNode provides metadata and convenience for Bookstack images.
+
+ Args:
+ :img_meta_data: = image meta data
+
+ Returns:
+ :ImageNode: instance with attributes to help handle images.
+ """
+ def __init__(self, img_meta_data: Dict[str, Union[int, str]]):
+ self.id: int = img_meta_data['id']
+ self.page_id: int = img_meta_data['uploaded_to']
+ self.url: str = img_meta_data['url']
+ self.name: str = self._get_image_name()
+ self._markdown_str = ""
+ self._image_relative_path: str = f"./{_IMAGE_DIR_NAME}/{self.name}"
+
+ def _get_image_name(self) -> str:
+ return self.url.split('/')[-1]
+
+ @property
+ def image_relative_path(self):
+ """return image path local to page directory"""
+ return self._image_relative_path
+
+ @property
+ def markdown_str(self):
+ """return markdown url str to replace"""
+ return self._markdown_str
+
+ def set_markdown_content(self, img_details: Dict[str, Union[int, str]]):
+ """provide image metadata to set markdown properties"""
+ self._markdown_str = self._get_md_url_str(img_details)
+
+ @staticmethod
+ def _get_md_url_str(img_data: Dict[str, Union[int, str]]) -> str:
+ url_str = ""
+ if 'content' in img_data:
+ if _MARKDOWN_STR_CHECK in img_data['content']:
+ url_str = img_data['content'][_MARKDOWN_STR_CHECK]
+ # check to see if empty before doing find
+ if not url_str:
+ return ""
+ return url_str[url_str.find("(")+1:url_str.find(")")]
+
# pylint: disable=too-many-instance-attributes
class PageArchiver:
"""
@@ -75,13 +117,11 @@ def archive_page(self, page: Node,
self._archive_page_meta(page.name, page.file_path, page.meta)
def _archive_page(self, page: Node, export_format: str, data: bytes,
- image_urls: List[str] = None):
+ image_nodes: List[ImageNode] = None):
page_file_name = f"{self.archive_base_path}/" \
f"{page.file_path}/{page.name}{_FILE_EXTENSION_MAP[export_format]}"
-
- # note yet implemented
- # if export_format == _MARKDOWN_STR_CHECK and image_urls and self.modify_md:
- # data = self._update_image_links(data, image_urls)
+ if self.modify_md and export_format == _MARKDOWN_STR_CHECK and image_nodes:
+ data = self._update_image_links(data, image_nodes)
self.write_data(page_file_name, data)
def _get_page_data(self, page_id: int, export_format: str):
@@ -96,7 +136,7 @@ def _archive_page_meta(self, page_name: str, page_path: str,
bytes_meta = archiver_util.get_json_bytes(meta_data)
self.write_data(file_path=meta_file_name, data=bytes_meta)
- def get_image_meta(self) -> Dict[int, List[str]]:
+ def get_image_meta(self) -> Dict[int, List[ImageNode]]:
"""Get all image metadata into a {page_number: [image_url]} format"""
img_meta_response: Response = common_util.http_get_request(
self.api_urls['images'],
@@ -105,28 +145,14 @@ def get_image_meta(self) -> Dict[int, List[str]]:
img_meta_json = img_meta_response.json()['data']
return self._create_image_map(img_meta_json)
- @staticmethod
- def _create_image_map(json_data: List[Dict[str, Union[str,int]]]) -> Dict[int, List[str]]:
- image_page_map = {}
- for image_node in json_data:
- image_page_id = image_node['uploaded_to']
- image_url = image_node['url']
- if image_page_id in image_page_map:
- image_page_map[image_page_id].append(image_url)
- else:
- image_page_map[image_page_id] = [image_url]
- return image_page_map
-
- def archive_page_images(self, page_path: str, image_urls: List[str]):
+ def archive_page_images(self, page_path: str, image_nodes: List[ImageNode]):
"""pull images locally into a directory based on page"""
# image_base_path = f"{self.archive_base_path}/{page_path}{_IMAGE_DIR_SUFFIX}"
image_base_path = f"{self.archive_base_path}/{page_path}/{_IMAGE_DIR_NAME}"
- for image_url in image_urls:
- img_data: bytes = archiver_util.get_byte_response(image_url, self._headers,
+ for img_node in image_nodes:
+ img_data: bytes = archiver_util.get_byte_response(img_node.url, self._headers,
self.verify_ssl)
- # seems safer to use this instead of image['name'] field
- img_file_name = image_url.split('/')[-1]
- image_path = f"{image_base_path}/{img_file_name}"
+ image_path = f"{image_base_path}/{img_node.name}"
self.write_data(image_path, img_data)
def write_data(self, file_path: str, data: bytes):
@@ -142,19 +168,19 @@ def gzip_archive(self):
"""provide the tar to gzip and the name of the gzip output file"""
archiver_util.create_gzip(self.tar_file, self.archive_file)
- def _update_image_links(self, page_data: bytes, urls: List[str]) -> bytes:
+ def _update_image_links(self, page_data: bytes, image_nodes: List[ImageNode]) -> bytes:
"""regex replace links to local created directories"""
- # 1 - what to replace, 2 - replace with, 3 is the data to replace
- # re.sub(b'pfsense', b'lol', x.content)
-
- # string to bytes
- # >>> k = 'lol'
- # >>> k.encode()
- pass
-
- def _valid_image_link(self):
- """should contain bookstack host"""
- pass
+ for img_node in image_nodes:
+ img_meta_url = f"{self.api_urls['images']}/{img_node.id}"
+ img_details = common_util.http_get_request(img_meta_url,
+ self._headers, self.verify_ssl)
+ img_node.set_markdown_content(img_details.json())
+ if not img_node.markdown_str:
+ continue
+ # 1 - what to replace, 2 - replace with, 3 is the data to replace
+ page_data = re.sub(img_node.markdown_str.encode(),
+ img_node.image_relative_path.encode(), page_data)
+ return page_data
@property
def file_extension_map(self) -> Dict[str, str]:
@@ -171,6 +197,18 @@ def verify_ssl(self) -> bool:
"""return whether or not to verify ssl for http requests"""
return self.asset_config.verify_ssl
+ # @staticmethod
+ # def _get_regex_expr(url: str) -> bytes:
+ # # regex_str = fr"\[\!\[^$|.*\]\({url}\)\]"
+ # return re.compile(regex_str.encode())
+
@staticmethod
- def _get_regex_expr(url: str) -> re.Pattern:
- return re.compile(fr"\[\!\[^$|.*\].*{url}.*\]")
+ def _create_image_map(json_data: List[Dict[str, Union[str,int]]]) -> Dict[int, List[ImageNode]]:
+ image_page_map = {}
+ for img_meta in json_data:
+ img_node = ImageNode(img_meta)
+ if img_node.page_id in image_page_map:
+ image_page_map[img_node.page_id].append(img_node)
+ else:
+ image_page_map[img_node.page_id] = [img_node]
+ return image_page_map
diff --git a/bookstack_file_exporter/archiver/util.py b/bookstack_file_exporter/archiver/util.py
index bda4c1a..0980fb6 100644
--- a/bookstack_file_exporter/archiver/util.py
+++ b/bookstack_file_exporter/archiver/util.py
@@ -27,13 +27,6 @@ def write_tar(base_tar_dir: str, file_path: str, data: bytes):
log.debug("Adding file: %s with size: %d bytes to tar file", tar_info.name, tar_info.size)
tar.addfile(tar_info, fileobj=data_obj)
-# create files first for manipulation/changes and tar later
-# def write_file(file_path: str, data: bytes):
-# """write byte data to a local file"""
-# os.makedirs(os.path.dirname(file_path), exist_ok=True)
-# with open(file_path, 'wb') as file_obj:
-# file_obj.write(data)
-
def get_json_bytes(data: Dict[str, Union[str, int]]) -> bytes:
"""dump dict to json file"""
return json.dumps(data, indent=4).encode('utf-8')
@@ -55,12 +48,3 @@ def scan_archives(base_dir: str, extension: str) -> str:
"""scan export directory for archives"""
file_pattern = f"{base_dir}_*{extension}"
return glob.glob(file_pattern)
-
-# def find_file_matches(file_path: str, regex_expr: re.Pattern) -> List[str]:
-# """find all matching lines for regex pattern"""
-# matches=[]
-# with open(file_path, encoding="utf-8") as open_file:
-# for line in open_file:
-# for match in re.finditer(regex_expr, line):
-# matches.append(match.group)
-# return matches
diff --git a/examples/config.yml b/examples/config.yml
index c02fb1e..2a12cb1 100644
--- a/examples/config.yml
+++ b/examples/config.yml
@@ -29,6 +29,9 @@ assets:
# optional export of all the images used in a page(s).
# omit this or set to false if not needed
export_images: false
+ # optional modify markdown files to replace image url links
+ # with local exported image paths
+ modify_markdown: false
## optional export of metadata about the page in a json file
# this metadata contains general information about the page
# like: last update, owner, revision count, etc.
diff --git a/examples/minio_config.yml b/examples/minio_config.yml
index 0e83e70..692cacd 100644
--- a/examples/minio_config.yml
+++ b/examples/minio_config.yml
@@ -58,7 +58,10 @@ assets:
# optional export of all the images used in a page(s).
# omit this or set to false if not needed
export_images: false
- ## optional export of metadata about the page in a json file
+ # optional modify markdown files to replace image url links
+ # with local exported image paths
+ modify_markdown: false
+ # optional export of metadata about the page in a json file
# this metadata contains general information about the page
# like: last update, owner, revision count, etc.
# omit this or set to false if not needed