A distributed video encoder that splits files into chunks to encode them on multiple machines in parallel.
Using Cargo, you can do
$ cargo install shepherd
or just clone the repository and compile the binary with
$ git clone https://github.com/martindisch/shepherd $ cd shepherd $ cargo build --release
There's also a direct download for the latest x86-64 ELF binary.
The prerequisites are one or more (you'll want more) computers—which we'll
refer to as hosts—with
ffmpeg installed and configured such that you can
SSH into them directly. This means you'll have to
ssh-copy-id your public
key to them. I only tested it on Linux, but if you manage to set up
ffmpeg and SSH, it might work on macOS or Windows directly or with little
The usage is pretty straightforward:
USAGE: shepherd [FLAGS] [OPTIONS] <IN> <OUT> --clients <hostnames> [FFMPEG OPTIONS]... FLAGS: -h, --help Prints help information -k, --keep Don't clean up temporary files -V, --version Prints version information OPTIONS: -c, --clients <hostnames> Comma-separated list of encoding hosts -l, --length <seconds> The length of video chunks in seconds -t, --tmp <path> The path to the local temporary directory ARGS: <IN> The original video file <OUT> The output video file <FFMPEG OPTIONS>... Options/flags for ffmpeg encoding of chunks. The chunks are video only, so don't pass in anything concerning audio. Input/output file names are added by the application, so there is no need for that either. This is the last positional argument and needs to be preceeded by double hypens (--) as in: shepherd -c c1,c2 in.mp4 out.mp4 -- -c:v libx264 -crf 26 -preset veryslow -profile:v high -level 4.2 -pix_fmt yuv420p This is also the default that is used if no options are provided.
So if we have three machines c1, c2 and c3, we could do
$ shepherd -c c1,c2,c3 -l 30 source_file.mp4 output_file.mp4
to have it split the video in roughly 30 second chunks and encode them in
parallel. By default it encodes in H.264 with a CRF value of 26 and the
veryslow preset. If you want to supply your own
ffmpeg options for more
control over the codec, you can do so by adding them to the end of the
$ shepherd -c c1,c2 input.mkv output.mp4 -- -c:v libx264 -crf 40
- Creates a temporary directory in your home directory.
- Extracts the audio and encodes it. This is not parallelized, but the time this takes is negligible compared to the video anyway.
- Splits the video into chunks. This can take relatively long, since
you're basically writing the full file to disk again. It would be nice
if we could read chunks of the file and directly transfer them to the
hosts, but that might be tricky with
- Spawns a manager and an encoder thread for every host. The manager creates a temporary directory in the home directory of the remote and makes sure that the encoder always has something to encode. It will transfer a chunk, give it to the encoder to work on and meanwhile transfer another chunk, so the encoder can start directly with that once it's done, without wasting any time. But it will keep at most one chunk in reserve, to prevent the case where a slow machine takes too many chunks and is the only one still encoding while the faster ones are already done.
- When an encoder is done and there are no more chunks to work on, it will quit and the manager transfers the encoded chunks back before terminating itself.
- Once all encoded chunks have arrived, they're concatenated and the audio stream added.
- All remote and the local temporary directory are removed.
Thanks to the work stealing method of distribution, having some hosts that are significantly slower than others does not delay the overall operation. In the worst case, the slowest machine is the last to start encoding a chunk and remains the only working encoder for the duration it takes to encode this one chunk. This window can easily be reduced by using smaller chunks.
As with all things parallel, Amdahl's law rears its ugly head and you don't just get twice the speed with twice the processing power. With this approach, you pay for having to split the video into chunks before you begin, transferring them to the encoders and the results back, and reassembling them. Although I should clarify that transferring the chunks to the encoders only causes a noticeable delay until every encoder has its first chunk, the subsequent ones can be sent while the encoders are working so they don't waste time waiting for that. And returning and assembling the encoded chunks doesn't carry too big of a penalty, since we're dealing with much more compressed data then.
To get a better understanding of the tradeoffs, I did some testing with a
couple of computers I had access to. They were my main, pretty capable
desktop, two older ones and a laptop. To figure out how capable each of
them is so we can compare the actual to the expected speedup, I let each of
them encode a relatively short clip of slightly less than 4 minutes taken
from the real video I want to encode, using the same settings I'd use for
the real job. And if you're wondering why encoding takes so long, it's
because I'm using the
veryslow preset for maximum efficiency, even though
it's definitely not worth the huge increase in encoding time. But it's a
nice simulation for how it would look if we were using an even more
demanding codec like AV1.
By giving my desktop the "power" level 1, we can determine how powerful the others are at this encoding task, based on how long it takes them in comparison. By adding the three other, less capable machines to the mix, we slightly more than double the theoretical encoding capability of our system.
I determined these power levels on a short clip, because encoding the full video would have taken very long on the less capable ones, especially the laptop. But I still needed to encode the full thing on at least one of them to make the comparison to the distributed encoding. I did that on my desktop since it's the fastest one, and to additionally verify that the power levels hold up for the full video, I bit the bullet and did the same on the second most powerful machine.
Now we have the baseline we want to beat with parallel encoding, as well as confirmation that the power levels are valid for the full video. Let's see how much of the theoretical, but unreachable 2.2x speedup we can get.
Encoding the video in parallel took 5283 seconds, so 56.5% of the time using my fastest computer, or a 1.77x speedup. We committed about twice the computing power and we're not too far off that two times speedup. It's making use of the additionally available resources with an 80% efficiency in this case. I also tried to encode the short clip in parallel, which was very fast, but had a somewhat disappointing speedup of only 1.32x. I suspect that we get better results with longer videos, since encoding a chunk always takes longer than creating and transferring it (otherwise distributing wouldn't make sense at all). The longer the video then, the larger the ratio of encoding (which we can parallelize) in the total amount of time the process takes, and the more effective doing so becomes.
I've also looked at how the work is distributed over the nodes, depending on their processing power. At the end of a parallel encode, it's possible to determine how many chunks have been encoded by any given host.
Inferring the processing power from the number of chunks leads to almost exactly the same results as my initial determination, confirming it and proving that work is distributed efficiently.
To further see how the system scales, I've added two more machines, bringing the total processing power up to 3.29.
Encoding the video on these 6 machines in parallel took 3865 seconds, so 41.3% of the time using my fastest computer, or a 2.42x speedup. It's making use of the additionally available resources with a 74% efficiency here. As expected, while we can accelerate by adding more resources, we're looking at diminishing returns. Although the factor by which the efficiency decreases is not as bad as it could be.
While you can use your own
ffmpeg options to control how the video is
encoded, there is currently no such option for the audio, which is 192 kb/s
AAC by default.
Starts the whole operation and cleans up afterwards.
The generic result type for this crate.