An organization structure for archiving from video sources.

Find a file

kistu 82a959e1c4 readme: edit format selection and flag description		2025-12-02 17:57:07 +02:00
readme.md	readme: edit format selection and flag description	2025-12-02 17:57:07 +02:00

readme.md

yt-dlp is a audio/video downloader that supports many websites. While the focus of this document is on YouTube, it aims to present an organization structure that can be generalized to other sources.

yt-dlp is a command line application. Some familiarity with the command line is expected.

Resources on the internet are uniquely identified by their URL, which is visible in the search bar of the web browser. To download a single video, the URL can be passed to yt-dlp:

yt-dlp "https://www.youtube.com/watch?v=aqz-KE-bpKQ"

Quality

A note about quality; a general principle is that higher quality corresponds to larger file sizes. While it may be tempting to archive every video in the best available quality, this will quickly use up available storage space. There is a lot more nuance in determining the optimal quality; this is a (very incomplete and oversimplified) summary of some common metrics:

framerate: number of frames per second (24/30/60)
resolution: number of pixels in each frame (width x height)
bitrate: number of bits per second (can be variable)
bitdepth: number of bits used to represent each pixel/audio sample
file size: amount of storage space required
container: how data is packaged and stored in the file
codec: how data is compressed/decompressed (lossy/lossless)

To list the available qualities for a particular video:

yt-dlp --print formats_table "https://www.youtube.com/watch?v=aqz-KE-bpKQ"

This is the output of the above command (many lines have been omitted to focus on general features):

ID      EXT   RESOLUTION FPS CH |   FILESIZE    TBR PROTO | VCODEC           VBR ACODEC      ABR ASR MORE INFO
---------------------------------------------------------------------------------------------------------------------------
sb0     mhtml 320x180      0    |                   mhtml | images                                   storyboard
249-drc webm  audio only      2 |    3.77MiB    50k https | audio only           opus        50k 48k low, DRC, webm_dash
251     webm  audio only      2 |    9.73MiB   129k https | audio only           opus       129k 48k medium, webm_dash
258     m4a   audio only      6 |   29.34MiB   388k https | audio only           mp4a.40.2  388k 48k high, m4a_dash
91      mp4   256x144     30    | ~ 12.84MiB   170k m3u8  | avc1.4D400C          mp4a.40.5
160     mp4   256x144     30    |    4.12MiB    55k https | avc1.4d400c      55k video only          144p, mp4_dash
315     webm  3840x2160   60    |    1.27GiB 17174k https | vp9           17174k video only          2160p60, webm_dash
401     mp4   3840x2160   60    |  679.44MiB  8982k https | av01.0.13M.08  8982k video only          2160p60, mp4_dash

Once you have decided on a format, you can use the format code (audio+video) from the ID column to specify it:

yt-dlp --format 258+401 "https://www.youtube.com/watch?v=aqz-KE-bpKQ"

Flags

This is a collection of recommended flags.

yt-dlp -f  "bestvideo[height=1080][ext=webm]+bestaudio[ext=webm][format_note*=original]/
		    bestvideo[height=1080][ext=mp4]+bestaudio[ext=m4a][format_note*=original]/
			bestvideo[height<=1080][ext=webm]+bestaudio[ext=webm][format_note*=original]/
			bestvideo[height<=1080][ext=mp4]+bestaudio[ext=m4a][format_note*=original]/
			bestvideo[height=1080][ext=webm]+bestaudio[ext=webm]/
			bestvideo[height=1080][ext=mp4]+bestaudio[ext=m4a]/
			bestvideo[height=720][ext=webm]+bestaudio[ext=webm]/
			bestvideo[height=720][ext=mp4]+bestaudio[ext=m4a]/
			bestvideo[height<=1200]+bestaudio[format_note*=original]/
			bestvideo[height<=1200]+bestaudio" \
			--check-formats \
			--match-filter "availability=public" \
			--download-archive "archive.txt" --embed-thumbnail --embed-metadata --embed-info-json \
			-o "%(channel)s/%(id)s.%(ext)s" \
            "http://youtube.com/<URL>"

While this may look complicated, it is doing a few simple things:

Format selection
- prefer height 1080, then less than 1080, then 720 and finally less than 1200
- prefer original webm first, mp4 next and then any format
Ensure requested format is actually available ("--check-formats")
Skip member-only (paywalled) videos ("--match-filter availability=public")
Record ID to archive archive.txt ("--download-archive archive.txt")
Embed thumbnail and metadata ("--embed-thumbnail --embed-metadata --embed-info-json")
Create a directory with the channel name; save the filename as the video's YouTube ID ("-o %(channel)s/%(id)s.%(ext)s")

You may want to adapt the format selection and output filename to your requirements.

Organization Structure

Indiviual Videos

No further organization is required for archiving individual videos.

The key element that makes this "organized" is the built-in archive functionality that records the video ID to a text file. yt-dlp will read this file when rerun, and will skip videos whose ID is already recorded.

Playlists and Channels

Channel homepage: https://www.youtube.com/@\<ChannelHandle\>

Playlist: https://www.youtube.com/watch?v=\<playlistID\>

These special pages convienently return all associated videos.

By combining yt-dlp with a simple directory traversal, a very efficient system can be achieved.

function traverse-yt {
	home=$(pwd)
	for d in */; do
		echo $(tput bold)$d$(tput sgr0)
		cd "$d" &&
			yt-dlp -f "bestvideo[height=1080][ext=webm]+bestaudio[ext=webm][format_note*=original]/bestvideo[height=1080][ext=mp4]+bestaudio[ext=m4a][format_note*=original]/bestvideo[height=1080][ext=webm]+bestaudio[ext=webm]/bestvideo[height=1080][ext=mp4]+bestaudio[ext=m4a]/bestvideo[height=720][ext=webm]+bestaudio[ext=webm]/bestvideo[height=720][ext=mp4]+bestaudio[ext=m4a]" \
				--check-formats \
				--download-archive "archive.txt" --embed-thumbnail --embed-metadata --embed-info-json \
				-o "%(channel)s/%(id)s.%(ext)s" \
				"https://www.youtube.com/@$(printf '%s\n' "${PWD##*/}")/videos"
		cd $home
	done
}

Our directory is structured as follows:

.
|- channel1
|-- archive.txt
|- channel2
|-- archive.txt

To add a new playlist/channel, simply create a folder with the corresponding id/handle and rerun the traversal.

Limitations and Maintenance

The organization structure is highly dependent on the video source for aquisition. As a result of frequent changes to the source, constant monitoring is required. Maintaining a comprehensive record of the archive is quite useful to some extent. A foolproof test set is yet to be established.

When YouTube introduced the automatic (artificial) dubbing, the deployed selection filter ocassionaly selected an audio track different from the original source. You can imagine the suprise when, a few months later, that video was being played back in a completely different language! This led to new filters being added, but broke an established pipeline and required significant work for identifying and reprocessing affected files.

The presented structure relies on each YouTube video having a unique ID and the webpage for a channels and playlists returning all associated videos. This - and other necessary assumptions - may change as YouTube decides to restrict downloading.

For instance, a proper JavaScript runtime is now required to download videos from YouTube due to changes in their JavaScript challenge. It was identified by signature extraction failures for certain clients and formats. This required significant developer effort, and functionality of yt-dlp was severely restricted for over a month.

As the barrier raises, so does the complexity and resources required to overcome it. We hope that that requirement never exceeds the value of content we seek to access and preserve.