SingleFileの拡張機能版を試しました.
ウェブ閲覧中には便利ですが,ヘッドレスでも使いたいのでCLI版を試してみました.
いくつかの選択肢がありますが,今回は地番お手軽そうなDocker Hubのイメージで試してみました.
$ docker pull capsulecode/singlefile $ docker tag capsulecode/singlefile singlefile $ docker image ls singlefile REPOSITORY TAG IMAGE ID CREATED SIZE singlefile latest 36fda8dcb810 4 months ago 755MB
helpを見るとオプションがたくさん.
$ time docker run singlefile --help
single-file [url] [output]
Save a page into a single HTML file.
Positionals:
url URL or path on the filesystem of the page to save [string]
output Output filename [string]
Options:
--help Show help [boolean]
--version Show version number [boolean]
--back-end Back-end to use [choices: "jsdom", "puppeteer", "webdriver-chromium", "webdriver-gecko", "puppeteer-firefox", "playwright-firefox", "playwright-chromium"] [default: "puppeteer"]
--browser-server Server to connect to (puppeteer only for now) [string] [default: ""]
--browser-headless Run the browser in headless mode (puppeteer, webdriver-gecko, webdriver-chromium) [boolean] [default: true]
--browser-executable-path Path to chrome/chromium executable (puppeteer, webdriver-gecko, webdriver-chromium) [string] [default: ""]
--browser-width Width of the browser viewport in pixels [number] [default: 1280]
--browser-height Height of the browser viewport in pixels [number] [default: 720]
--browser-load-max-time Maximum delay of time to wait for page loading in ms (puppeteer, webdriver-gecko, webdriver-chromium) [number] [default: 60000]
--browser-wait-delay Time to wait before capturing the page in ms [number] [default: 0]
--browser-wait-until When to consider the page is loaded (puppeteer, webdriver-gecko, webdriver-chromium) [choices: "networkidle0", "networkidle2", "load", "domcontentloaded"] [default: "networkidle0"]
--browser-wait-until-fallback Retry with the next value of --browser-wait-until when a timeout error is thrown [boolean] [default: true]
--browser-debug Enable debug mode (puppeteer, webdriver-gecko, webdriver-chromium) [boolean] [default: false]
--browser-script Path of a script executed in the page (and all the frames) before it is loaded [array] [default: []]
--browser-stylesheet Path of a stylesheet file inserted into the page (and all the frames) after it is loaded [array] [default: []]
--browser-args Arguments provided as a JSON array and passed to the browser (puppeteer, webdriver-gecko, webdriver-chromium) [string] [default: ""]
--browser-start-minimized Minimize the browser (puppeteer) [boolean] [default: false]
--browser-cookie Ordered list of cookie parameters separated by a comma: name,value,domain,path,expires,httpOnly,secure,sameSite,url (puppeteer, webdriver-gecko, webdriver-chromium, jsdom) [array] [default: []]
--browser-cookies-file Path of the cookies file formatted as a JSON file or a Netscape text file (puppeteer, webdriver-gecko, webdriver-chromium, jsdom) [string] [default: ""]
--compress-CSS Compress CSS stylesheets [boolean] [default: false]
--compress-HTML Compress HTML content [boolean] [default: true]
--crawl-links Crawl and save pages found via inner links [boolean] [default: false]
--crawl-inner-links-only Crawl pages found via inner links only if they are hosted on the same domain [boolean] [default: true]
--crawl-no-parent Crawl pages found via inner links only if their URLs are not parent of the URL to crawl [boolean]
--crawl-load-session Name of the file of the session to load (previously saved with --crawl-save-session or --crawl-sync-session) [string]
--crawl-remove-url-fragment Remove URL fragments found in links [boolean] [default: true]
--crawl-save-session Name of the file where to save the state of the session [string]
--crawl-sync-session Name of the file where to load and save the state of the session [string]
--crawl-max-depth Max depth when crawling pages found in internal and external links (0: infinite) [number] [default: 1]
--crawl-external-links-max-depth Max depth when crawling pages found in external links (0: infinite) [number] [default: 1]
--crawl-replace-urls Replace URLs of saved pages with relative paths of saved pages on the filesystem [boolean] [default: false]
--crawl-rewrite-rule Rewrite rule used to rewrite URLs of crawled pages [array] [default: []]
--dump-content Dump the content of the processed page in the console ('true' when running in Docker) [boolean] [default: false]
--emulate-media-feature Emulate a media feature. The syntax is <name>:<value>, e.g. "prefers-color-scheme:dark" (puppeteer) [array]
--error-file [string]
--filename-template Template used to generate the output filename (see help page of the extension for more info) [string] [default: "{page-title} ({date-iso} {time-locale}).html"]
--filename-conflict-action Action when the filename is conflicting with existing one on the filesystem. The possible values are "uniquify" (default), "overwrite" and "skip" [string] [default: "uniquify"]
--filename-replacement-character The character used for replacing invalid characters in filenames [string] [default: "_"]
--group-duplicate-images Group duplicate images into CSS custom properties [boolean] [default: true]
--http-header Extra HTTP header (puppeteer, jsdom) [array] [default: []]
--include-BOM Include the UTF-8 BOM into the HTML page [boolean] [default: false]
--include-infobar Include the infobar [boolean] [default: false]
--load-deferred-images Load deferred (a.k.a. lazy-loaded) images (puppeteer, webdriver-gecko, webdriver-chromium) [boolean] [default: true]
--load-deferred-images-max-idle-time Maximum delay of time to wait for deferred images in ms (puppeteer, webdriver-gecko, webdriver-chromium) [number] [default: 1500]
--load-deferred-images-keep-zoom-level Load defrrred images by keeping zoomed out the page [boolean] [default: false]
--max-parallel-workers Maximum number of browsers launched in parallel when processing a list of URLs (cf --urls-file) [number]
--max-resource-size-enabled Enable removal of embedded resources exceeding a given size [boolean] [default: false]
--max-resource-size Maximum size of embedded resources in MB (i.e. images, stylesheets, scripts and iframes) [number] [default: 10]
--remove-frames Remove frames (puppeteer, webdriver-gecko, webdriver-chromium) [boolean] [default: false]
--remove-hidden-elements Remove HTML elements which are not displayed [boolean] [default: true]
--remove-unused-styles Remove unused CSS rules and unneeded declarations [boolean] [default: true]
--remove-unused-fonts Remove unused CSS font rules [boolean] [default: true]
--remove-imports Remove HTML imports [boolean] [default: true]
--remove-scripts Remove JavaScript scripts [boolean] [default: true]
--remove-audio-src Remove source of audio elements [boolean] [default: true]
--remove-video-src Remove source of video elements [boolean] [default: true]
--remove-alternative-fonts Remove alternative fonts to the ones displayed [boolean] [default: true]
--remove-alternative-medias Remove alternative CSS stylesheets [boolean] [default: true]
--remove-alternative-images Remove images for alternative sizes of screen [boolean] [default: true]
--save-raw-page Save the original page without interpreting it into the browser (puppeteer, webdriver-gecko, webdriver-chromium) [boolean] [default: false]
--urls-file Path to a text file containing a list of URLs (separated by a newline) to save [string]
--user-agent User-agent of the browser (puppeteer, webdriver-gecko, webdriver-chromium) [string]
--user-script-enabled Enable the event API allowing to execute scripts before the page is saved [boolean] [default: true]
--web-driver-executable-path Path to Selenium WebDriver executable (webdriver-gecko, webdriver-chromium) [string] [default: ""]
--output-directory Path to where to save files, this path must exist. [string] [default: ""]
real 0m7.511s
user 0m0.036s
sys 0m0.036sとりあえずシンプルに.
$ time docker run singlefile https://github.com/gildas-lormeau/SingleFile/ > /tmp/singlefile-test.html
real 0m58.279s
user 0m0.018s
sys 0m0.049s
$ w3m -dump_source https://github.com/gildas-lormeau/SingleFile/ | zcat | grep \<img | dd bs=120 count=1 status=none;echo
<img class="avatar mr-2 flex-shrink-0 js-jump-to-suggestion-avatar d-none" alt="" aria-label="Team" src="" width="28
$ grep \<img /tmp/singlefile-test.html | dd bs=120 count=1 status=none;echo
<img data-test-selector="commits-avatar-stack-avatar-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAA結構時間掛かりますね.多分回線が細いのが行けないのかな?(ADSL)
同じページを一旦ローカルに保存して試してみました.
保存したパスをrootにしてhttpdを起動
$ python3 -m http.server --bind 172.17.0.1 --directory ~/Downloads/
SingleFileで保存
$ time docker run singlefile http://172.17.0.1:8000/gildas-lormeau_SingleFile_%20Web%20Extension%20for%20Firefox_Chrome_MS%20Edge%20and%20CLI%20tool%20to%20save%20a%20faithful%20copy%20of%20an%20entire%20web%20page%20in%20a%20single%20HTML%20file.html > /tmp/sample.html real 0m47.339s user 0m0.059s sys 0m0.110s
思ったより変わらなかったです.
環境
$ docker image ls singlefile REPOSITORY TAG IMAGE ID CREATED SIZE singlefile latest 36fda8dcb810 4 months ago 755MB $ dpkg-query -W docker.io python3 docker.io 20.10.11+dfsg1-2+b1 python3 3.9.8-1 $ lsb_release -dr Description: Debian GNU/Linux bookworm/sid Release: unstable $ arch x86_64 $ grep ^model\ name\.*: -m1 /proc/cpuinfo model name : Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz