OneDrive Backup to S3

via BrentDalling on Unsplash

We use Onedrive extensively for file storage. Microsoft ensures the durability of files in the event of infrastructure events (e.g. hardware failures), and it provides a versioning system for restoring old files. However, the versioning system is incomplete with files not always being available, and it does not provide a good way to restore large numbers of files (e.g. a folder that was deleted).

As a result, we require a backup solution to ensure the safety of our files in all scenarios.

Backup Solution Requirements

  • Must download all files from all drives from all sites in our tenant
  • Must compress and save the files to low-cost cloud storage
  • Must accommodate 100s of drives and TBs of files
  • Must support incremental backups of the most recent files, not just full backups of everything
  • Should be simple to restore files
  • Should have low execution costs
  • Should run as “serverless” as possible on our cloud provider (AWS) to reduce administrative overhead and ensure reliable execution

Solution

We were surprised at the lack of existing solutions for our seemingly common requirements. As a result, we will create our own solution and open source it.

Microsoft only provides basic API interfaces. The rclone project provides a simpler interface, but it still has limitations:

  • rclone does not support refreshing access tokens when using credentials from a background application. This gives it a 1 hour timeout.
  • rclone supports copying directly from one cloud storage location to another cloud storage location, but it does not support compressing folders before uploading them to the destination
  • rclone only supports copying from a single OneDrive drive. Does not have a utility to generate the list of SharePoint sites or the list of drives for each SharePoint site.

As a result, we need to create a wrapper script. It will perform the following steps:

  1. Retrieve a list of all sharepoint sites
  2. Retrieve a list of all drives for each sharepoint site
  3. Recursively split drives and folders into download tasks based on drive and folder size
    • If a drive or folder is larger than a configurable size, it should download, compress, and upload its subfolders individually.
    • This addresses the rclone timeout and simplifies file restoration
  4. In parallel:
    • Download the drive or folder using rclone
    • Compress the data using gzip
    • Upload the archive file to AWS S3 glacier

This script will run in Docker on Fargate. Fargate comes with 20GB+ of local, ephemeral storage and has no execution time limit. For comparison, AWS Lambda only has 500MB of local storage which would require an EFS volume mount for larger files. It also has a fixed 15 minute execution time limit. In testing, even a 3GB file can take over 15 minutes to download on Lambda.

The Fargate ECS Task will be triggered by CloudWatch cron events. A full backup of everything will run weekly and an incremental backup will run daily for just the files added within the last seven days.

The resulting Python script and Dockerfile are Apache 2.0 licensed and available in our GitHub: https://github.com/TrellisHFL-Public/onedrive_backup

Note: instead of the rclone copy command, we could have used the rclone sync command. This command would allow us to keep a local copy of the files and only download the newest ones instead of downloading the same files over and over. While more efficient in theory, we have to pay for persistent storage, but do not have to pay for onedrive bandwidth or inbound AWS bandwidth.

Leave a comment