| readme.md | ||
yt-dlp is a audio/video downloader that supports many websites. While the focus of this document is on YouTube, it aims to present an organization structure that can be generalized to other sources.
yt-dlp is a command line application. Some familiarity with the command line is expected.
Resources on the internet are uniquely identified by their URL, which is visible in the search bar of the web browser. To download a single video, the URL can be passed to yt-dlp:
yt-dlp "https://www.youtube.com/watch?v=aqz-KE-bpKQ"
Quality
A note about quality; a general principle is that higher quality corresponds to larger file sizes. While it may be tempting to archive every video in the best available quality, this will quickly use up available storage space. There is a lot more nuance in determining the optimal quality; this is a (very incomplete and oversimplified) summary of some common metrics:
- framerate: number of frames per second (24/30/60)
- resolution: number of pixels in each frame (width x height)
- bitrate: number of bits per second (can be variable)
- bitdepth: number of bits used to represent each pixel/audio sample
- file size: amount of storage space required
- container: how data is packaged and stored in the file
- codec: how data is compressed/decompressed (lossy/lossless)
To list the available qualities for a particular video:
yt-dlp --print formats_table "https://www.youtube.com/watch?v=aqz-KE-bpKQ"
This is the output of the above command (many lines have been omitted to focus on general features):
ID EXT RESOLUTION FPS CH | FILESIZE TBR PROTO | VCODEC VBR ACODEC ABR ASR MORE INFO
---------------------------------------------------------------------------------------------------------------------------
sb0 mhtml 320x180 0 | mhtml | images storyboard
249-drc webm audio only 2 | 3.77MiB 50k https | audio only opus 50k 48k low, DRC, webm_dash
251 webm audio only 2 | 9.73MiB 129k https | audio only opus 129k 48k medium, webm_dash
258 m4a audio only 6 | 29.34MiB 388k https | audio only mp4a.40.2 388k 48k high, m4a_dash
91 mp4 256x144 30 | ~ 12.84MiB 170k m3u8 | avc1.4D400C mp4a.40.5
160 mp4 256x144 30 | 4.12MiB 55k https | avc1.4d400c 55k video only 144p, mp4_dash
315 webm 3840x2160 60 | 1.27GiB 17174k https | vp9 17174k video only 2160p60, webm_dash
401 mp4 3840x2160 60 | 679.44MiB 8982k https | av01.0.13M.08 8982k video only 2160p60, mp4_dash
Once you have decided on a format, you can use the format code (audio+video) from the ID column to specify it:
yt-dlp --format 258+401 "https://www.youtube.com/watch?v=aqz-KE-bpKQ"
Flags
This is a collection of recommended flags.
yt-dlp -f "bestvideo[height=1080][ext=webm]+bestaudio[ext=webm][format_note*=original]/
bestvideo[height=1080][ext=mp4]+bestaudio[ext=m4a][format_note*=original]/
bestvideo[height<=1080][ext=webm]+bestaudio[ext=webm][format_note*=original]/
bestvideo[height<=1080][ext=mp4]+bestaudio[ext=m4a][format_note*=original]/
bestvideo[height=1080][ext=webm]+bestaudio[ext=webm]/
bestvideo[height=1080][ext=mp4]+bestaudio[ext=m4a]/
bestvideo[height=720][ext=webm]+bestaudio[ext=webm]/
bestvideo[height=720][ext=mp4]+bestaudio[ext=m4a]/
bestvideo[height<=1200]+bestaudio[format_note*=original]/
bestvideo[height<=1200]+bestaudio" \
--check-formats \
--match-filter "availability=public" \
--download-archive "archive.txt" --embed-thumbnail --embed-metadata --embed-info-json \
-o "%(channel)s/%(id)s.%(ext)s" \
"http://youtube.com/<URL>"
While this may look complicated, it is doing a few simple things:
- Format selection
- prefer height 1080, then less than 1080, then 720 and finally less than 1200
- prefer original webm first, mp4 next and then any format
- Ensure requested format is actually available ("--check-formats")
- Skip member-only (paywalled) videos ("--match-filter availability=public")
- Record ID to archive
archive.txt("--download-archive archive.txt") - Embed thumbnail and metadata ("--embed-thumbnail --embed-metadata --embed-info-json")
- Create a directory with the channel name; save the filename as the video's YouTube ID ("-o %(channel)s/%(id)s.%(ext)s")
You may want to adapt the format selection and output filename to your requirements.
Organization Structure
Indiviual Videos
No further organization is required for archiving individual videos.
The key element that makes this "organized" is the built-in archive functionality that records the video ID to a text file.
yt-dlp will read this file when rerun, and will skip videos whose ID is already recorded.
Playlists and Channels
Channel homepage: https://www.youtube.com/@\<ChannelHandle\>
Playlist: https://www.youtube.com/watch?v=\<playlistID\>
These special pages convienently return all associated videos.
By combining yt-dlp with a simple directory traversal, a very efficient system can be achieved.
function traverse-yt {
home=$(pwd)
for d in */; do
echo $(tput bold)$d$(tput sgr0)
cd "$d" &&
yt-dlp -f "bestvideo[height=1080][ext=webm]+bestaudio[ext=webm][format_note*=original]/bestvideo[height=1080][ext=mp4]+bestaudio[ext=m4a][format_note*=original]/bestvideo[height=1080][ext=webm]+bestaudio[ext=webm]/bestvideo[height=1080][ext=mp4]+bestaudio[ext=m4a]/bestvideo[height=720][ext=webm]+bestaudio[ext=webm]/bestvideo[height=720][ext=mp4]+bestaudio[ext=m4a]" \
--check-formats \
--download-archive "archive.txt" --embed-thumbnail --embed-metadata --embed-info-json \
-o "%(channel)s/%(id)s.%(ext)s" \
"https://www.youtube.com/@$(printf '%s\n' "${PWD##*/}")/videos"
cd $home
done
}
Our directory is structured as follows:
.
|- channel1
|-- archive.txt
|- channel2
|-- archive.txt
To add a new playlist/channel, simply create a folder with the corresponding id/handle and rerun the traversal.
Limitations and Maintenance
The organization structure is highly dependent on the video source for aquisition. As a result of frequent changes to the source, constant monitoring is required. Maintaining a comprehensive record of the archive is quite useful to some extent. A foolproof test set is yet to be established.
When YouTube introduced the automatic (artificial) dubbing, the deployed selection filter ocassionaly selected an audio track different from the original source. You can imagine the suprise when, a few months later, that video was being played back in a completely different language! This led to new filters being added, but broke an established pipeline and required significant work for identifying and reprocessing affected files.
The presented structure relies on each YouTube video having a unique ID and the webpage for a channels and playlists returning all associated videos. This - and other necessary assumptions - may change as YouTube decides to restrict downloading.
For instance, a proper JavaScript runtime is now required to download videos from YouTube due to changes in their JavaScript challenge. It was identified by signature extraction failures for certain clients and formats. This required significant developer effort, and functionality of yt-dlp was severely restricted for over a month.
As the barrier raises, so does the complexity and resources required to overcome it. We hope that that requirement never exceeds the value of content we seek to access and preserve.