GNU Core Utilities (coreutils) のコードリーディング/解析: yesコマンド

coreutilsのなかでもっとも簡単なtrueコマンドのソースコードを分かりやすく解析！

true の次に簡単なのは、まーたぶん yes だろう！

とりあえず前回 true コマンドを読んで、coreutilsの共通テンプレート的な部分を何となくみたので）、その辺のコードは飛ばしつつ、次に簡単そうな yes をみてみよう… githubのコードはこちら

mainからみていって、最初の部分はテンプレなので飛ばして、

src/yes.c

  parse_gnu_standard_options_only (argc, argv, PROGRAM_NAME, PACKAGE_NAME,
                                   Version, true, usage, AUTHORS,
                                   (char const *) nullptr);

  parse_gnu_standard_options_only (argc, argv, PROGRAM_NAME, PACKAGE_NAME,
                                   Version, true, usage, AUTHORS,
                                   (char const *) nullptr);

これは初ですなー。中身は getopt_long で --help と --version の２つを処理している。知らないオプションはエラー。ちなみに getopt の引数で optstring が "+" から始まる場合、オプションでない文字に当たった瞬間に処理が止まる、という動きがあるそうだが、これはGNU拡張らしい getopt(3)。yesの場合はそうでなくすべてのオプションが処理される。

gnulib/lib/long-option.c

  const char *optstring = scan_all ? "" : "+";

  if ((c = getopt_long (argc, argv, optstring, long_options, NULL)) != -1)

  const char *optstring = scan_all ? "" : "+";

  if ((c = getopt_long (argc, argv, optstring, long_options, NULL)) != -1)

まぁそれはともかく、そのすぐ次の処理でもういろいろ気になる：

src/yes.c

  char **operands = argv + optind;
  char **operand_lim = argv + argc;
  if (optind == argc)
    *operand_lim++ = bad_cast ("y");

  char **operands = argv + optind;
  char **operand_lim = argv + argc;
  if (optind == argc)
    *operand_lim++ = bad_cast ("y");

bad_cast()はコンパイラの警告を抑止しながら char const * を char * に無理やりキャストするための技らしい。

いやそれよりも、operand_lim は argv と同じなわけで、つまり argv[argc] のNULLの箱に “y” を突っ込んでるわけだよな？Cの仕様上 argv[argc] はNULLだそうなので、その隙間に y 詰めときましたという。

そもそもなんでそんなことしてるかというと、yes は与えられたコマンド引数をずっとリピートして出力するというコマンドなので、たとえば a b c と渡すと

$ yes a b c
a b c
a b c
a b c
a b c
 :

$ yes a b c
a b c
a b c
a b c
a b c
 :

となるのだが（というか yes が y以外も出力できると初めて知ったんだが）、だったら argv をそのまま使えばいいじゃん、と思った人がいたようなのです。それで引数がない場合は argv の隙間に無理やり y を突っ込むという荒業にでたわけです。

（ちょっとコード飛ばしますけど）実際 reuse_operand_strings が true のときをみると、 mallocもmemcpyもせずに *operands（つまり実質 *(argv+1)）の領域をそのまま書き換えて使ってます：

src/yes.c

  /* Fill the buffer with one copy of the output.  If possible, reuse
     the operands strings; this wins when the buffer would be large.  */
  char *buf = reuse_operand_strings ? *operands : xmalloc (bufalloc);
  size_t bufused = 0;
  operandp = operands;
  do
    {
      size_t operand_len = strlen (*operandp);
      if (! reuse_operand_strings)
        memcpy (buf + bufused, *operandp, operand_len);
      bufused += operand_len;
      buf[bufused++] = ' ';
    }
  while (++operandp < operand_lim);
  buf[bufused - 1] = '\n';

  /* Fill the buffer with one copy of the output.  If possible, reuse
     the operands strings; this wins when the buffer would be large.  */
  char *buf = reuse_operand_strings ? *operands : xmalloc (bufalloc);
  size_t bufused = 0;
  operandp = operands;
  do
    {
      size_t operand_len = strlen (*operandp);
      if (! reuse_operand_strings)
        memcpy (buf + bufused, *operandp, operand_len);
      bufused += operand_len;
      buf[bufused++] = ' ';
    }
  while (++operandp < operand_lim);
  buf[bufused - 1] = '\n';

これはだから、簡単に言ったらコマンド引数が *(argv+1) 上に一列に並んでメモリに配置されている前提で、各引数の末尾のNUL文字をスペースに置き換えることで連結して１つの文字列にする（そして最後はスペースじゃなくて改行にする）という、だいぶいかれたコードですわ。もう発想がちょっとゴルファー寄りというか…

いやーでも引数がメモリ上に綺麗に１列に並んでるとは限らんだろ、そうなのか？と、何しろ私はまじでCなんか読み書きしないので、そういう仕様なのかとCopilotに聞いても違うというし、半日くらい頭の片隅で気にしてたんですけど、よくよく見たら、すぐ上のコードで１列に並んでるかどうかチェックしてますわ…

src/yes.c

  /* Buffer data locally once, rather than having the
     large overhead of stdio buffering each item.  */
  size_t bufalloc = 0;
  bool reuse_operand_strings = true;
  char **operandp = operands;
  do
    {
      size_t operand_len = strlen (*operandp);
      bufalloc += operand_len + 1;
      if (operandp + 1 < operand_lim
          && *operandp + operand_len + 1 != operandp[1])
        reuse_operand_strings = false;
    }
  while (++operandp < operand_lim);

  /* Buffer data locally once, rather than having the
     large overhead of stdio buffering each item.  */
  size_t bufalloc = 0;
  bool reuse_operand_strings = true;
  char **operandp = operands;
  do
    {
      size_t operand_len = strlen (*operandp);
      bufalloc += operand_len + 1;
      if (operandp + 1 < operand_lim
          && *operandp + operand_len + 1 != operandp[1])
        reuse_operand_strings = false;
    }
  while (++operandp < operand_lim);

ここのif文で、引数の長さを足して、次の引数の頭と一致してなかったら reuse_operand_strings を falseに落とすと…。最初なんかよくわからんことしてんなーと思ったが、そういうことか…

でもやっぱ怪しいメモリアクセスであることに変わりないらしく？、CHERIで保護されてるシステム上だとエラーになるらしい。ということでCHERIが有効な場合はやらない、というコードが入っている（CHERIが何なのか分からないが、いろんなひとがcoreutilsをチェックしているのだな。。）

#if defined __CHERI__
  /* Cheri capability bounds do not allow for this.  */
  reuse_operand_strings = false;
#endif

#if defined __CHERI__
  /* Cheri capability bounds do not allow for this.  */
  reuse_operand_strings = false;
#endif

あと stdio.h のデフォルトのバッファーサイズ BUFSIZ と比較して、小さかったらやっぱり reuse_operand_strings を falseに落としてます：

src/yes.c

  /* Improve performance by using a buffer size greater than BUFSIZ / 2.  */
  if (bufalloc <= BUFSIZ / 2)
    {
      bufalloc = BUFSIZ;
      reuse_operand_strings = false;
    }

  /* Improve performance by using a buffer size greater than BUFSIZ / 2.  */
  if (bufalloc <= BUFSIZ / 2)
    {
      bufalloc = BUFSIZ;
      reuse_operand_strings = false;
    }

まぁこのコミットいれたおじさんのコメントにもあるが、引数がやたらでかい場合には malloc するより引数を使いまわしたほうが速い、ということらしい。

最後にバッファーのほうが大きい場合は、内容を繰り返して残りのバッファーを埋めると…。ここもゴルファーらしい無駄のないコードって感じですわ…

src/yes.c

  /* If a larger buffer was allocated, fill it by repeating the buffer
     contents.  */
  size_t copysize = bufused;
  for (size_t copies = bufalloc / copysize; --copies; )
    {
      memcpy (buf + bufused, buf, copysize);
      bufused += copysize;
    }

  /* If a larger buffer was allocated, fill it by repeating the buffer
     contents.  */
  size_t copysize = bufused;
  for (size_t copies = bufalloc / copysize; --copies; )
    {
      memcpy (buf + bufused, buf, copysize);
      bufused += copysize;
    }

でようやくyesの本分である出力になり、最後は必ずエラー終了となります、と：

src/yes.c

  /* Repeatedly output the buffer until there is a write error; then fail.  */
  while (full_write (STDOUT_FILENO, buf, bufused) == bufused)
    continue;
  error (0, errno, _("standard output"));
  main_exit (EXIT_FAILURE);

  /* Repeatedly output the buffer until there is a write error; then fail.  */
  while (full_write (STDOUT_FILENO, buf, bufused) == bufused)
    continue;
  error (0, errno, _("standard output"));
  main_exit (EXIT_FAILURE);

full_write は coreutils のライブラリで、割り込みのリトライとかしながら全部書くというもの。私てきにはいろいろ初めてみる部分が多いのだが、もうちょっとおなかいっぱいなので割愛。

error は error.h から来てるもので、このヘッダ自体がGNU用らしい。ここにくるとエラーメッセージが表示されるのだが、通常 yes を使っててこういうエラーは見ない、SIGINT でも SIGPIPE でもエラーメッセージは出ない、そういうときはこの行に来る前に終了してしまうようだ。試しにgdbで call close(1) したら、ちゃんとエラーが表示されたので、そうなんだろう…

yes: standard output: Bad file descriptor

yes: standard output: Bad file descriptor

そして最後の main_exit は、これも coreutils の system.h に長々とコメントが書いてあるが、結局 argvの引数を使いまわすコードのせいで gcc -fsanitize=lint が誤警報(false alart)を出すから、それを抑止する技らしい。ただ肝心の -fsanitize=lint が何者なのかがまったく分からない！検索しても全然でてこない。gccのドキュメントにも見当たらない…

謎が多すぎる…