SingleFileの拡張機能版を試しました.
ウェブ閲覧中には便利ですが,ヘッドレスでも使いたいのでCLI版を試してみました.
いくつかの選択肢がありますが,今回は地番お手軽そうなDocker Hubのイメージで試してみました.
$ docker pull capsulecode/singlefile $ docker tag capsulecode/singlefile singlefile $ docker image ls singlefile REPOSITORY TAG IMAGE ID CREATED SIZE singlefile latest 36fda8dcb810 4 months ago 755MB
helpを見るとオプションがたくさん.
$ time docker run singlefile --help single-file [url] [output] Save a page into a single HTML file. Positionals: url URL or path on the filesystem of the page to save [string] output Output filename [string] Options: --help Show help [boolean] --version Show version number [boolean] --back-end Back-end to use [choices: "jsdom", "puppeteer", "webdriver-chromium", "webdriver-gecko", "puppeteer-firefox", "playwright-firefox", "playwright-chromium"] [default: "puppeteer"] --browser-server Server to connect to (puppeteer only for now) [string] [default: ""] --browser-headless Run the browser in headless mode (puppeteer, webdriver-gecko, webdriver-chromium) [boolean] [default: true] --browser-executable-path Path to chrome/chromium executable (puppeteer, webdriver-gecko, webdriver-chromium) [string] [default: ""] --browser-width Width of the browser viewport in pixels [number] [default: 1280] --browser-height Height of the browser viewport in pixels [number] [default: 720] --browser-load-max-time Maximum delay of time to wait for page loading in ms (puppeteer, webdriver-gecko, webdriver-chromium) [number] [default: 60000] --browser-wait-delay Time to wait before capturing the page in ms [number] [default: 0] --browser-wait-until When to consider the page is loaded (puppeteer, webdriver-gecko, webdriver-chromium) [choices: "networkidle0", "networkidle2", "load", "domcontentloaded"] [default: "networkidle0"] --browser-wait-until-fallback Retry with the next value of --browser-wait-until when a timeout error is thrown [boolean] [default: true] --browser-debug Enable debug mode (puppeteer, webdriver-gecko, webdriver-chromium) [boolean] [default: false] --browser-script Path of a script executed in the page (and all the frames) before it is loaded [array] [default: []] --browser-stylesheet Path of a stylesheet file inserted into the page (and all the frames) after it is loaded [array] [default: []] --browser-args Arguments provided as a JSON array and passed to the browser (puppeteer, webdriver-gecko, webdriver-chromium) [string] [default: ""] --browser-start-minimized Minimize the browser (puppeteer) [boolean] [default: false] --browser-cookie Ordered list of cookie parameters separated by a comma: name,value,domain,path,expires,httpOnly,secure,sameSite,url (puppeteer, webdriver-gecko, webdriver-chromium, jsdom) [array] [default: []] --browser-cookies-file Path of the cookies file formatted as a JSON file or a Netscape text file (puppeteer, webdriver-gecko, webdriver-chromium, jsdom) [string] [default: ""] --compress-CSS Compress CSS stylesheets [boolean] [default: false] --compress-HTML Compress HTML content [boolean] [default: true] --crawl-links Crawl and save pages found via inner links [boolean] [default: false] --crawl-inner-links-only Crawl pages found via inner links only if they are hosted on the same domain [boolean] [default: true] --crawl-no-parent Crawl pages found via inner links only if their URLs are not parent of the URL to crawl [boolean] --crawl-load-session Name of the file of the session to load (previously saved with --crawl-save-session or --crawl-sync-session) [string] --crawl-remove-url-fragment Remove URL fragments found in links [boolean] [default: true] --crawl-save-session Name of the file where to save the state of the session [string] --crawl-sync-session Name of the file where to load and save the state of the session [string] --crawl-max-depth Max depth when crawling pages found in internal and external links (0: infinite) [number] [default: 1] --crawl-external-links-max-depth Max depth when crawling pages found in external links (0: infinite) [number] [default: 1] --crawl-replace-urls Replace URLs of saved pages with relative paths of saved pages on the filesystem [boolean] [default: false] --crawl-rewrite-rule Rewrite rule used to rewrite URLs of crawled pages [array] [default: []] --dump-content Dump the content of the processed page in the console ('true' when running in Docker) [boolean] [default: false] --emulate-media-feature Emulate a media feature. The syntax is <name>:<value>, e.g. "prefers-color-scheme:dark" (puppeteer) [array] --error-file [string] --filename-template Template used to generate the output filename (see help page of the extension for more info) [string] [default: "{page-title} ({date-iso} {time-locale}).html"] --filename-conflict-action Action when the filename is conflicting with existing one on the filesystem. The possible values are "uniquify" (default), "overwrite" and "skip" [string] [default: "uniquify"] --filename-replacement-character The character used for replacing invalid characters in filenames [string] [default: "_"] --group-duplicate-images Group duplicate images into CSS custom properties [boolean] [default: true] --http-header Extra HTTP header (puppeteer, jsdom) [array] [default: []] --include-BOM Include the UTF-8 BOM into the HTML page [boolean] [default: false] --include-infobar Include the infobar [boolean] [default: false] --load-deferred-images Load deferred (a.k.a. lazy-loaded) images (puppeteer, webdriver-gecko, webdriver-chromium) [boolean] [default: true] --load-deferred-images-max-idle-time Maximum delay of time to wait for deferred images in ms (puppeteer, webdriver-gecko, webdriver-chromium) [number] [default: 1500] --load-deferred-images-keep-zoom-level Load defrrred images by keeping zoomed out the page [boolean] [default: false] --max-parallel-workers Maximum number of browsers launched in parallel when processing a list of URLs (cf --urls-file) [number] --max-resource-size-enabled Enable removal of embedded resources exceeding a given size [boolean] [default: false] --max-resource-size Maximum size of embedded resources in MB (i.e. images, stylesheets, scripts and iframes) [number] [default: 10] --remove-frames Remove frames (puppeteer, webdriver-gecko, webdriver-chromium) [boolean] [default: false] --remove-hidden-elements Remove HTML elements which are not displayed [boolean] [default: true] --remove-unused-styles Remove unused CSS rules and unneeded declarations [boolean] [default: true] --remove-unused-fonts Remove unused CSS font rules [boolean] [default: true] --remove-imports Remove HTML imports [boolean] [default: true] --remove-scripts Remove JavaScript scripts [boolean] [default: true] --remove-audio-src Remove source of audio elements [boolean] [default: true] --remove-video-src Remove source of video elements [boolean] [default: true] --remove-alternative-fonts Remove alternative fonts to the ones displayed [boolean] [default: true] --remove-alternative-medias Remove alternative CSS stylesheets [boolean] [default: true] --remove-alternative-images Remove images for alternative sizes of screen [boolean] [default: true] --save-raw-page Save the original page without interpreting it into the browser (puppeteer, webdriver-gecko, webdriver-chromium) [boolean] [default: false] --urls-file Path to a text file containing a list of URLs (separated by a newline) to save [string] --user-agent User-agent of the browser (puppeteer, webdriver-gecko, webdriver-chromium) [string] --user-script-enabled Enable the event API allowing to execute scripts before the page is saved [boolean] [default: true] --web-driver-executable-path Path to Selenium WebDriver executable (webdriver-gecko, webdriver-chromium) [string] [default: ""] --output-directory Path to where to save files, this path must exist. [string] [default: ""] real 0m7.511s user 0m0.036s sys 0m0.036s
とりあえずシンプルに.
$ time docker run singlefile https://github.com/gildas-lormeau/SingleFile/ > /tmp/singlefile-test.html real 0m58.279s user 0m0.018s sys 0m0.049s $ w3m -dump_source https://github.com/gildas-lormeau/SingleFile/ | zcat | grep \<img | dd bs=120 count=1 status=none;echo <img class="avatar mr-2 flex-shrink-0 js-jump-to-suggestion-avatar d-none" alt="" aria-label="Team" src="" width="28 $ grep \<img /tmp/singlefile-test.html | dd bs=120 count=1 status=none;echo <img data-test-selector="commits-avatar-stack-avatar-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAA
結構時間掛かりますね.多分回線が細いのが行けないのかな?(ADSL)
同じページを一旦ローカルに保存して試してみました.
保存したパスをrootにしてhttpdを起動
$ python3 -m http.server --bind 172.17.0.1 --directory ~/Downloads/
SingleFileで保存
$ time docker run singlefile http://172.17.0.1:8000/gildas-lormeau_SingleFile_%20Web%20Extension%20for%20Firefox_Chrome_MS%20Edge%20and%20CLI%20tool%20to%20save%20a%20faithful%20copy%20of%20an%20entire%20web%20page%20in%20a%20single%20HTML%20file.html > /tmp/sample.html real 0m47.339s user 0m0.059s sys 0m0.110s
思ったより変わらなかったです.
環境
$ docker image ls singlefile REPOSITORY TAG IMAGE ID CREATED SIZE singlefile latest 36fda8dcb810 4 months ago 755MB $ dpkg-query -W docker.io python3 docker.io 20.10.11+dfsg1-2+b1 python3 3.9.8-1 $ lsb_release -dr Description: Debian GNU/Linux bookworm/sid Release: unstable $ arch x86_64 $ grep ^model\ name\.*: -m1 /proc/cpuinfo model name : Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz