snacのrobots.txt

susie64 cool

さくらのナレッジに軽量ActivityPub Server のsnac についての記事を寄稿しました．

さくらインターネットのさくらのナレッジにsnacの記事を寄稿 – matoken’s blog

ここに書かなかったsnac のことをいくつか書いていこうと思います．
今日はsnac でのrobots.txt について．

検索エンジンで「site:snac.kagolug.org」のように検索してみると何も出てきません．運用開始からそう時間が経っていないからかな?と思いつつ念の為 robots.txt を確認するとこんな感じで全て拒否るすようになっていました．

$ w3m -dump https://snac.kagolug.org/robots.txt
User-agent: *
Disallow: /

source を確認すると httpd.c の中でハードコーディングされているようです．

$ grep -n -A5 robots.txt httpd.c
321:    if (strcmp(q_path, "/robots.txt") == 0) {
322-        status = HTTP_STATUS_OK;
323-        *ctype = "text/plain";
324-        *body  = xs_str_new("User-agent: *\n"
325-                            "Disallow: /\n");
326-    }

snac の設定でどうにかできればいいですが恐らく無理．ISSUE を探すと以下のものが．404エラーが邪魔なので用意しているだけでカスタマイズしたい人は前段のhttp proxy でルールを追加する感じのようです．

#223 – default robots.txt breaks integration with fedi-fetcher – grunfink/snac2 – Codeberg.org

ISSUE を立てた方はnginx でカスタムrobots.txt を返すようにしているようです．

Actually, I am able to work around this by returning a custom robots.txt with nginx:
location = /robots.txt {
		return 200 'User-agent: FediFetcher\nAllow: /\nUser-agent: *\nDisallow: /\n';
}

自分の環境はApache2 httpd ですが同じような設定はできるので設定してみます．

Apache2 httpd のAlias でrobots.txt を設定

以前Nitter を公開していたときにもやっていたのでこれを参考に設定します．

Nitterにrobots.txtを設定(Apache httpdのreverse proxy環境でAlias設定) – matoken’s blog

Apache2 httpd のsnac 部分の設定を以下のように修正しました．これでLinux ファイルシステム内の /var/lib/snac2/robots.txt が /robots.txt になるはずです．

$ sudo git diff /etc/apache2/sites-available/011-snac.kagolug.org.conf
diff --git a/apache2/sites-available/011-snac.kagolug.org.conf b/apache2/sites-available/011-snac.kagolug.org.conf
index f5b5c7f..7bb72c1 100644
--- a/apache2/sites-available/011-snac.kagolug.org.conf
+++ b/apache2/sites-available/011-snac.kagolug.org.conf
@@ -58,6 +58,11 @@
        </Location>

        Alias /static /var/www/static
+       Alias /robots.txt /var/lib/snac2/robots.txt
+       <Location "/robots.txt">
+               ProxyPass !
+               Require all granted
+       </Location>

        # Possible values include: debug, info, notice, warn, error, crit,
        # alert, emerg.

設定を確認してから反映します．

$ sudo apache2ctl configtest
$ sudo systemctl reload apache2

このファイルが，

$ sudo -u www-data cat /var/lib/snac2/robots.txt
User-agent: *
Allow: /

/robots.txt に反映されました．

$ w3m -dump https://snac.kagolug.org/robots.txt
User-agent: *
Allow: /

これで各種bot からのアクセスが許可されました．

source 書き換え

source からbuild している環境ではsource 書き換えでもrobots.txt をカスタマイズできました．でもアップデートのたびにパッチを当てるのは面倒なのでやはりproxy で設定するほうが良さそうです．

robots.txt を無くす(404 エラーがたくさんなのであまり良く無さそう)

$ git diff httpd.c
diff --git a/httpd.c b/httpd.c
index a8cd849..976fdd6 100644
--- a/httpd.c
+++ b/httpd.c
@@ -318,13 +318,6 @@ int server_get_handler(xs_dict *req, const char *q_path,
         *body  = xs_json_dumps(j, 4);
     }
     else
-    if (strcmp(q_path, "/robots.txt") == 0) {
-        status = HTTP_STATUS_OK;
-        *ctype = "text/plain";
-        *body  = xs_str_new("User-agent: *\n"
-                            "Disallow: /\n");
-    }
-    else
     if (strcmp(q_path, "/style.css") == 0) {
         FILE *f;
         xs *css_fn = xs_fmt("%s/style.css", srv_basedir);

$ w3m -dump https://snac.matoken.org/robots.txt
404 Not Found (snac/2.85)

全て許可に書き換え

$ git diff httpd.c
diff --git a/httpd.c b/httpd.c
index a8cd849..c0bcaef 100644
--- a/httpd.c
+++ b/httpd.c
@@ -322,7 +322,7 @@ int server_get_handler(xs_dict *req, const char *q_path,
         status = HTTP_STATUS_OK;
         *ctype = "text/plain";
         *body  = xs_str_new("User-agent: *\n"
-                            "Disallow: /\n");
+                            "Allow: /\n");
     }
     else
     if (strcmp(q_path, "/style.css") == 0) {

$ w3m -dump https://snac.matoken.org/robots.txt
User-agent: *
Allow: /

Apache2 httpd のAlias でrobots.txt を設定

source 書き換え

コメントを残すコメントをキャンセル

プロフィール

あなたのプロフィール

Apache2 httpd のAlias でrobots.txt を設定

source 書き換え

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル