SingleFileのCLI版を試す

SingleFileの拡張機能版を試しました.

ウェブ閲覧中には便利ですが,ヘッドレスでも使いたいのでCLI版を試してみました.

いくつかの選択肢がありますが,今回は地番お手軽そうなDocker Hubのイメージで試してみました.

$ docker pull capsulecode/singlefile
$ docker tag capsulecode/singlefile singlefile
$ docker image ls singlefile
REPOSITORY   TAG       IMAGE ID       CREATED        SIZE
singlefile   latest    36fda8dcb810   4 months ago   755MB

helpを見るとオプションがたくさん.

$ time docker run singlefile --help
single-file [url] [output]

Save a page into a single HTML file.

Positionals:
  url     URL or path on the filesystem of the page to save  [string]
  output  Output filename  [string]

Options:
  --help                                  Show help  [boolean]
  --version                               Show version number  [boolean]
  --back-end                              Back-end to use  [choices: "jsdom", "puppeteer", "webdriver-chromium", "webdriver-gecko", "puppeteer-firefox", "playwright-firefox", "playwright-chromium"] [default: "puppeteer"]
  --browser-server                        Server to connect to (puppeteer only for now)  [string] [default: ""]
  --browser-headless                      Run the browser in headless mode (puppeteer, webdriver-gecko, webdriver-chromium)  [boolean] [default: true]
  --browser-executable-path               Path to chrome/chromium executable (puppeteer, webdriver-gecko, webdriver-chromium)  [string] [default: ""]
  --browser-width                         Width of the browser viewport in pixels  [number] [default: 1280]
  --browser-height                        Height of the browser viewport in pixels  [number] [default: 720]
  --browser-load-max-time                 Maximum delay of time to wait for page loading in ms (puppeteer, webdriver-gecko, webdriver-chromium)  [number] [default: 60000]
  --browser-wait-delay                    Time to wait before capturing the page in ms  [number] [default: 0]
  --browser-wait-until                    When to consider the page is loaded (puppeteer, webdriver-gecko, webdriver-chromium)  [choices: "networkidle0", "networkidle2", "load", "domcontentloaded"] [default: "networkidle0"]
  --browser-wait-until-fallback           Retry with the next value of --browser-wait-until when a timeout error is thrown  [boolean] [default: true]
  --browser-debug                         Enable debug mode (puppeteer, webdriver-gecko, webdriver-chromium)  [boolean] [default: false]
  --browser-script                        Path of a script executed in the page (and all the frames) before it is loaded  [array] [default: []]
  --browser-stylesheet                    Path of a stylesheet file inserted into the page (and all the frames) after it is loaded  [array] [default: []]
  --browser-args                          Arguments provided as a JSON array and passed to the browser (puppeteer, webdriver-gecko, webdriver-chromium)  [string] [default: ""]
  --browser-start-minimized               Minimize the browser (puppeteer)  [boolean] [default: false]
  --browser-cookie                        Ordered list of cookie parameters separated by a comma: name,value,domain,path,expires,httpOnly,secure,sameSite,url (puppeteer, webdriver-gecko, webdriver-chromium, jsdom)  [array] [default: []]
  --browser-cookies-file                  Path of the cookies file formatted as a JSON file or a Netscape text file (puppeteer, webdriver-gecko, webdriver-chromium, jsdom)  [string] [default: ""]
  --compress-CSS                          Compress CSS stylesheets  [boolean] [default: false]
  --compress-HTML                         Compress HTML content  [boolean] [default: true]
  --crawl-links                           Crawl and save pages found via inner links  [boolean] [default: false]
  --crawl-inner-links-only                Crawl pages found via inner links only if they are hosted on the same domain  [boolean] [default: true]
  --crawl-no-parent                       Crawl pages found via inner links only if their URLs are not parent of the URL to crawl  [boolean]
  --crawl-load-session                    Name of the file of the session to load (previously saved with --crawl-save-session or --crawl-sync-session)  [string]
  --crawl-remove-url-fragment             Remove URL fragments found in links  [boolean] [default: true]
  --crawl-save-session                    Name of the file where to save the state of the session  [string]
  --crawl-sync-session                    Name of the file where to load and save the state of the session  [string]
  --crawl-max-depth                       Max depth when crawling pages found in internal and external links (0: infinite)  [number] [default: 1]
  --crawl-external-links-max-depth        Max depth when crawling pages found in external links (0: infinite)  [number] [default: 1]
  --crawl-replace-urls                    Replace URLs of saved pages with relative paths of saved pages on the filesystem  [boolean] [default: false]
  --crawl-rewrite-rule                    Rewrite rule used to rewrite URLs of crawled pages  [array] [default: []]
  --dump-content                          Dump the content of the processed page in the console ('true' when running in Docker)  [boolean] [default: false]
  --emulate-media-feature                 Emulate a media feature. The syntax is <name>:<value>, e.g. "prefers-color-scheme:dark" (puppeteer)  [array]
  --error-file  [string]
  --filename-template                     Template used to generate the output filename (see help page of the extension for more info)  [string] [default: "{page-title} ({date-iso} {time-locale}).html"]
  --filename-conflict-action              Action when the filename is conflicting with existing one on the filesystem. The possible values are "uniquify" (default), "overwrite" and "skip"  [string] [default: "uniquify"]
  --filename-replacement-character        The character used for replacing invalid characters in filenames  [string] [default: "_"]
  --group-duplicate-images                Group duplicate images into CSS custom properties  [boolean] [default: true]
  --http-header                           Extra HTTP header (puppeteer, jsdom)  [array] [default: []]
  --include-BOM                           Include the UTF-8 BOM into the HTML page  [boolean] [default: false]
  --include-infobar                       Include the infobar  [boolean] [default: false]
  --load-deferred-images                  Load deferred (a.k.a. lazy-loaded) images (puppeteer, webdriver-gecko, webdriver-chromium)  [boolean] [default: true]
  --load-deferred-images-max-idle-time    Maximum delay of time to wait for deferred images in ms (puppeteer, webdriver-gecko, webdriver-chromium)  [number] [default: 1500]
  --load-deferred-images-keep-zoom-level  Load defrrred images by keeping zoomed out the page  [boolean] [default: false]
  --max-parallel-workers                  Maximum number of browsers launched in parallel when processing a list of URLs (cf --urls-file)  [number]
  --max-resource-size-enabled             Enable removal of embedded resources exceeding a given size  [boolean] [default: false]
  --max-resource-size                     Maximum size of embedded resources in MB (i.e. images, stylesheets, scripts and iframes)  [number] [default: 10]
  --remove-frames                         Remove frames (puppeteer, webdriver-gecko, webdriver-chromium)  [boolean] [default: false]
  --remove-hidden-elements                Remove HTML elements which are not displayed  [boolean] [default: true]
  --remove-unused-styles                  Remove unused CSS rules and unneeded declarations  [boolean] [default: true]
  --remove-unused-fonts                   Remove unused CSS font rules  [boolean] [default: true]
  --remove-imports                        Remove HTML imports  [boolean] [default: true]
  --remove-scripts                        Remove JavaScript scripts  [boolean] [default: true]
  --remove-audio-src                      Remove source of audio elements  [boolean] [default: true]
  --remove-video-src                      Remove source of video elements  [boolean] [default: true]
  --remove-alternative-fonts              Remove alternative fonts to the ones displayed  [boolean] [default: true]
  --remove-alternative-medias             Remove alternative CSS stylesheets  [boolean] [default: true]
  --remove-alternative-images             Remove images for alternative sizes of screen  [boolean] [default: true]
  --save-raw-page                         Save the original page without interpreting it into the browser (puppeteer, webdriver-gecko, webdriver-chromium)  [boolean] [default: false]
  --urls-file                             Path to a text file containing a list of URLs (separated by a newline) to save  [string]
  --user-agent                            User-agent of the browser (puppeteer, webdriver-gecko, webdriver-chromium)  [string]
  --user-script-enabled                   Enable the event API allowing to execute scripts before the page is saved  [boolean] [default: true]
  --web-driver-executable-path            Path to Selenium WebDriver executable (webdriver-gecko, webdriver-chromium)  [string] [default: ""]
  --output-directory                      Path to where to save files, this path must exist.  [string] [default: ""]

real    0m7.511s
user    0m0.036s
sys     0m0.036s

とりあえずシンプルに.

$ time docker run singlefile https://github.com/gildas-lormeau/SingleFile/ > /tmp/singlefile-test.html

real    0m58.279s
user    0m0.018s
sys     0m0.049s
$ w3m -dump_source https://github.com/gildas-lormeau/SingleFile/ | zcat | grep \<img | dd bs=120 count=1 status=none;echo
    <img class="avatar mr-2 flex-shrink-0 js-jump-to-suggestion-avatar d-none" alt="" aria-label="Team" src="" width="28
$ grep \<img /tmp/singlefile-test.html | dd bs=120 count=1 status=none;echo
        <img data-test-selector="commits-avatar-stack-avatar-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAA

結構時間掛かりますね.多分回線が細いのが行けないのかな?(ADSL)

同じページを一旦ローカルに保存して試してみました.

保存したパスをrootにしてhttpdを起動
$ python3 -m http.server --bind 172.17.0.1 --directory ~/Downloads/
SingleFileで保存
$ time docker run singlefile http://172.17.0.1:8000/gildas-lormeau_SingleFile_%20Web%20Extension%20for%20Firefox_Chrome_MS%20Edge%20and%20CLI%20tool%20to%20save%20a%20faithful%20copy%20of%20an%20entire%20web%20page%20in%20a%20single%20HTML%20file.html > /tmp/sample.html

real    0m47.339s
user    0m0.059s
sys     0m0.110s

思ったより変わらなかったです.

環境
$ docker image ls singlefile
REPOSITORY   TAG       IMAGE ID       CREATED        SIZE
singlefile   latest    36fda8dcb810   4 months ago   755MB
$ dpkg-query -W docker.io python3
docker.io       20.10.11+dfsg1-2+b1
python3 3.9.8-1
$ lsb_release -dr
Description:    Debian GNU/Linux bookworm/sid
Release:        unstable
$ arch
x86_64
$ grep ^model\ name\.*: -m1 /proc/cpuinfo
model name      : Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz

コメントを残す

メールアドレスが公開されることはありません。 が付いている欄は必須項目です

To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. (Find out more about Webmentions.)