Linux Kernel Virtual File System

Mark as read

Linux Kernel Virtual File System (VFS) 徹底解説

はじめに
VFS アーキテクチャと抽象化レイヤー
VFS の主要オブジェクト
スーパーブロック (Superblock)
inode (Index Node)
dentry (Directory Entry)
file オブジェクト
Dentry キャッシュ (dcache)
inode キャッシュ
ページキャッシュ統合
ファイル操作 (struct file_operations)
マウントメカニズムとマウント名前空間
疑似ファイルシステム (procfs, sysfs, tmpfs, devtmpfs)
ファイルロック (flock, POSIX locks)
ファイルディスクリプタとファイルテーブル
パス解決 (Pathname Lookup)
ディレクトリエントリ操作
inode 操作
スーパーブロック操作
VFS とファイルシステム登録
実用ツール: mount, stat, lsof, fuser, findmnt
トラブルシューティングと性能チューニング
まとめ
参考文献

はじめに

Linux カーネルの Virtual File System (VFS) は、Linux オペレーティングシステムにおけるファイルシステムの統一的な抽象化レイヤーである。VFS は、ユーザー空間のアプリケーションが異なるファイルシステム (ext4, XFS, Btrfs, NFS, CIFS など) に対して、同一のシステムコールインターフェースを通じて透過的にアクセスできるようにする、カーネル内の重要なサブシステムである。

VFS が解決する問題

Linux システムでは、一台のマシン上で複数のファイルシステムが同時に使用されることが一般的である。例えば、ルートパーティションが ext4、/home が XFS、/tmp が tmpfs、リモートファイルシステムが NFS というような構成は珍しくない。VFS が存在しなければ、アプリケーションは各ファイルシステム固有の API を個別に呼び出す必要があり、これは開発者にとって大きな負担となる。

VFS は以下の問題を解決する:

ファイルシステムの多様性の隠蔽: アプリケーションは open(), read(), write(), close() などの統一されたシステムコールを使用するだけでよい
新しいファイルシステムの追加容易性: 新しいファイルシステムドライバは VFS が定義するインターフェースを実装するだけでよい
キャッシュの一元管理: dentry キャッシュ、inode キャッシュ、ページキャッシュを VFS レイヤーで統一的に管理する
名前空間の分離: マウント名前空間により、コンテナなどの隔離環境を実現する

VFS の歴史的背景

VFS の概念は、1985 年に Sun Microsystems が SunOS で導入したのが最初である。当時、NFS (Network File System) をローカルファイルシステムと統合する必要性から、ファイルシステムの抽象化レイヤーが考案された。Linux カーネルは、この概念を取り入れ、独自の改良を加えた VFS 実装を持っている。

Linux カーネルの VFS は、バージョンを重ねるごとに大幅な改良が加えられてきた:

カーネルバージョン	主な変更点
2.4.x	基本的な VFS フレームワーク
2.6.x	dcache の大幅な改善、RCU (Read-Copy-Update) の導入
3.x	パス解決のロックレス化 (RCU-walk)
4.x	overlay filesystem のサポート、mount API の改善
5.x	新しい mount API (fsopen/fsmount)、io_uring 統合
6.x	idmapped mounts、VFS レイヤーでの大幅なリファクタリング

本記事の対象読者

本記事は以下の読者を対象としている:

Linux カーネルの内部構造に興味があるシステムエンジニア
ファイルシステム関連のトラブルシューティングを行う SRE / DevOps エンジニア
カーネルモジュールやファイルシステムドライバの開発者
Linux システムプログラミングを深く理解したい開発者

前提知識

以下の知識があると、本記事の理解がより深まる:

C 言語の基本的な文法 (構造体、関数ポインタ、ポインタ操作)
Linux の基本的なファイル操作コマンド
Linux カーネルの基本的なアーキテクチャ (ユーザー空間とカーネル空間の区別)

VFS アーキテクチャと抽象化レイヤー

全体アーキテクチャ

VFS は、ユーザー空間とディスク上の実際のファイルシステムとの間に位置する抽象化レイヤーである。以下にその階層構造を示す。

┌─────────────────────────────────────────────────────────┐
│                   ユーザー空間                             │
│  アプリケーション: open(), read(), write(), close()        │
├─────────────────────────────────────────────────────────┤
│                 システムコール層                           │
│  sys_open(), sys_read(), sys_write(), sys_close()       │
├─────────────────────────────────────────────────────────┤
│              Virtual File System (VFS)                   │
│  ┌──────────┬──────────┬──────────┬──────────┐          │
│  │superblock│  inode   │  dentry  │   file   │          │
│  │  object  │  object  │  object  │  object  │          │
│  └──────────┴──────────┴──────────┴──────────┘          │
│  ┌──────────────────────────────────────────┐           │
│  │       キャッシュ層                         │           │
│  │  dcache / inode cache / page cache       │           │
│  └──────────────────────────────────────────┘           │
├─────────────────────────────────────────────────────────┤
│           具体的ファイルシステム実装                       │
│  ┌────┬─────┬──────┬─────┬──────┬──────┐               │
│  │ext4│ XFS │Btrfs │ NFS │tmpfs │procfs│               │
│  └────┴─────┴──────┴─────┴──────┴──────┘               │
├─────────────────────────────────────────────────────────┤
│               ブロック I/O 層                             │
│  I/O スケジューラ / デバイスドライバ                       │
├─────────────────────────────────────────────────────────┤
│               物理ストレージ                              │
│  HDD / SSD / NVMe / ネットワークストレージ                │
└─────────────────────────────────────────────────────────┘

VFS の設計原則

VFS はオブジェクト指向設計の原則に従っている。C 言語にはクラスや継承の概念がないため、構造体と関数ポインタテーブルを使って多態性 (ポリモーフィズム) を実現している。

// VFS の多態性の例: inode 操作
struct inode_operations {
    struct dentry * (*lookup) (struct inode *, struct dentry *, unsigned int);
    int (*create) (struct mnt_idmap *, struct inode *, struct dentry *, umode_t, bool);
    int (*link) (struct dentry *, struct inode *, struct dentry *);
    int (*unlink) (struct inode *, struct dentry *);
    int (*symlink) (struct mnt_idmap *, struct inode *, struct dentry *, const char *);
    int (*mkdir) (struct mnt_idmap *, struct inode *, struct dentry *, umode_t);
    int (*rmdir) (struct inode *, struct dentry *);
    int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
                   struct inode *, struct dentry *, unsigned int);
    // ... 他の操作
};

各ファイルシステムは、この操作テーブルの関数ポインタを自身の実装で埋めることで、VFS フレームワークに統合される。例えば、ext4 は ext4_lookup(), ext4_create() などの関数を提供し、XFS は xfs_vn_lookup(), xfs_vn_create() を提供する。

システムコールの流れ

open() システムコールを例に、VFS がどのように動作するかを詳しく見てみよう。

ユーザー空間: fd = open("/home/user/file.txt", O_RDONLY);
    │
    ▼
カーネル: sys_openat() [fs/open.c]
    │
    ├── getname() - ユーザー空間からパス名をコピー
    │
    ├── do_sys_openat2() [fs/open.c]
    │   │
    │   ├── get_unused_fd_flags() - 空きファイルディスクリプタを取得
    │   │
    │   ├── do_filp_open() [fs/namei.c]
    │   │   │
    │   │   ├── path_openat() - パス解決を開始
    │   │   │   │
    │   │   │   ├── link_path_walk()
    │   │   │   │   │
    │   │   │   │   ├── "home" を dcache で検索
    │   │   │   │   │   (ヒット → dentry を取得)
    │   │   │   │   │   (ミス → ファイルシステムの lookup を呼出)
    │   │   │   │   │
    │   │   │   │   ├── "user" を dcache で検索
    │   │   │   │   │
    │   │   │   │   └── "file.txt" を dcache で検索
    │   │   │   │
    │   │   │   ├── do_open() - ファイルを開く
    │   │   │   │   │
    │   │   │   │   ├── vfs_open() [fs/open.c]
    │   │   │   │   │   │
    │   │   │   │   │   └── ファイルシステム固有の open 関数を呼出
    │   │   │   │   │       (例: ext4_file_open)
    │   │   │   │   │
    │   │   │   │   └── file 構造体を初期化
    │   │   │   │
    │   │   │   └── パス解決完了
    │   │   │
    │   │   └── file 構造体を返す
    │   │
    │   └── fd_install() - fd と file 構造体を関連付け
    │
    └── fd を返す

VFS データ構造間の関係

VFS の4つの主要オブジェクトは、以下のように相互に関連している:

                    ┌──────────────┐
                    │  superblock  │
                    │              │
                    │ s_inodes ────┼──────┐
                    │ s_root  ────┼──┐   │
                    └──────────────┘  │   │
                                      │   │
                    ┌─────────────────┘   │
                    ▼                     ▼
              ┌──────────┐         ┌──────────┐
              │  dentry   │         │  inode    │
              │ (root /)  │         │          │
              │           │◄────────┤ i_dentry │
              │ d_inode ──┼────────►│          │
              │ d_subdirs─┼──┐      │ i_fop ───┼──────┐
              └──────────┘  │      │ i_op  ───┼──┐   │
                    │        │      └──────────┘  │   │
                    │        │                     │   │
                    │        ▼                     │   │
                    │  ┌──────────┐                │   │
                    │  │  dentry   │                │   │
                    │  │ (/home)   │                │   │
                    │  └──────────┘                │   │
                    │        │                     │   │
                    │        ▼                     │   │
                    │  ┌──────────┐                │   │
                    │  │  dentry   │                │   │
                    │  │ (file.txt)│                │   │
                    │  └──────────┘                │   │
                    │                              │   │
                    ▼                              ▼   ▼
              ┌──────────┐    ┌────────────────────────────┐
              │   file    │    │   inode_operations         │
              │           │    │   file_operations          │
              │ f_path ───┤    │                            │
              │ f_inode ──┤    │  .read()                   │
              │ f_op   ───┼───►│  .write()                  │
              │ f_pos     │    │  .open()                   │
              └──────────┘    │  .release()                │
                              └────────────────────────────┘

VFS の関連カーネルソースファイル

VFS の実装は、カーネルソースの fs/ ディレクトリに集中している:

fs/
├── namei.c          # パス名解決 (pathname lookup)
├── open.c           # open/close 関連のシステムコール
├── read_write.c     # read/write 関連のシステムコール
├── file.c           # ファイルテーブル管理
├── inode.c          # inode 操作
├── super.c          # スーパーブロック操作
├── dcache.c         # dentry キャッシュ
├── namespace.c      # マウント名前空間
├── mount.h          # マウント構造体定義
├── stat.c           # stat 関連のシステムコール
├── ioctl.c          # ioctl 処理
├── locks.c          # ファイルロック
├── file_table.c     # ファイルテーブル
├── filesystems.c    # ファイルシステム登録
├── fs_struct.c      # プロセスのファイルシステム情報
├── buffer.c         # バッファキャッシュ
├── block_dev.c      # ブロックデバイス操作
├── char_dev.c       # キャラクタデバイス操作
├── pipe.c           # パイプ実装
├── fifo.c           # FIFO (名前付きパイプ)
├── eventpoll.c      # epoll 実装
├── select.c         # select/poll 実装
├── aio.c            # 非同期 I/O
├── io_uring.c       # io_uring
├── proc/            # procfs 実装
├── sysfs/           # sysfs 実装
├── kernfs/          # kernfs (sysfs のバックエンド)
├── debugfs/         # debugfs 実装
├── devpts/          # devpts 実装
├── ext4/            # ext4 ファイルシステム
├── xfs/             # XFS ファイルシステム
├── btrfs/           # Btrfs ファイルシステム
├── nfs/             # NFS クライアント
├── cifs/            # CIFS/SMB クライアント
├── fuse/            # FUSE (Filesystem in Userspace)
└── overlayfs/       # OverlayFS

VFS のコンパイル時設定

VFS 自体はカーネルに常に組み込まれるが、個々のファイルシステムはカーネルコンフィグレーションで選択できる:

# カーネル設定の確認
cat /boot/config-$(uname -r) | grep -E "^CONFIG_.*_FS="

# 主要なファイルシステム関連の設定例
CONFIG_EXT4_FS=y              # ext4 を組み込み
CONFIG_XFS_FS=m               # XFS をモジュールとして
CONFIG_BTRFS_FS=m             # Btrfs をモジュールとして
CONFIG_TMPFS=y                # tmpfs を組み込み
CONFIG_PROC_FS=y              # procfs を組み込み
CONFIG_SYSFS=y                # sysfs を組み込み
CONFIG_NFS_FS=m               # NFS をモジュールとして
CONFIG_CIFS=m                 # CIFS をモジュールとして
CONFIG_FUSE_FS=m              # FUSE をモジュールとして
CONFIG_OVERLAY_FS=m           # OverlayFS をモジュールとして

# 現在ロードされているファイルシステムモジュールの確認
lsmod | grep -E "ext4|xfs|btrfs|nfs|cifs|fuse|overlay"

# 出力例:
# ext4                  933888  2
# mbcache                16384  1 ext4
# jbd2                  147456  1 ext4
# xfs                  1789952  1
# overlay               155648  3

VFS の主要オブジェクト

VFS は4つの主要なオブジェクト型を定義している。これらのオブジェクトは、ファイルシステムの様々な側面を抽象化し、統一的なインターフェースを提供する。

4つの主要オブジェクトの概要

オブジェクト	カーネル構造体	定義ファイル	役割
スーパーブロック	`struct super_block`	`include/linux/fs.h`	マウントされたファイルシステム全体を表現
inode	`struct inode`	`include/linux/fs.h`	個々のファイルのメタデータを表現
dentry	`struct dentry`	`include/linux/dcache.h`	パス名の各コンポーネントを表現
file	`struct file`	`include/linux/fs.h`	プロセスが開いたファイルを表現

オブジェクトのライフサイクル

ファイルシステムのマウント
    │
    ▼
superblock の作成 ──────────────────────────────────────► superblock の破棄
    │                                                         ▲
    │ (ファイルアクセス時)                                      │
    ▼                                                         │
inode の読み込み ──────► inode キャッシュ ──────► inode の解放  │
    │                       ▲                                  │
    │                       │ (再利用)                         │
    ▼                       │                                  │
dentry の作成 ──────► dcache ──────────────► dentry の解放    │
    │                       ▲                                  │
    │                       │ (再利用)                         │
    ▼                       │                                  │
file の作成 ────────────────┘                                  │
    │                                                         │
    │ (close() 時)                                            │
    ▼                                                         │
file の解放                                                    │
    │                                                         │
    │ (アンマウント時)                                         │
    └─────────────────────────────────────────────────────────┘

オブジェクト間の多重度関係

superblock 1 ──── * inode
inode      1 ──── * dentry  (ハードリンクの場合、1つの inode に複数の dentry)
dentry     1 ──── * file    (同一ファイルを複数プロセスが開く場合)

スーパーブロック (Superblock)

スーパーブロックの役割

スーパーブロックは、マウントされたファイルシステム全体のメタ情報を保持するオブジェクトである。ファイルシステムがマウントされるたびに、カーネルはスーパーブロックオブジェクトを作成し、ディスク上のスーパーブロック情報 (またはネットワーク経由で取得した情報) を読み込む。

struct super_block の定義

// include/linux/fs.h (Linux 6.x, 主要フィールドを抜粋)
struct super_block {
    struct list_head        s_list;         /* 全スーパーブロックのリスト */
    dev_t                   s_dev;          /* デバイス識別子 */
    unsigned char           s_blocksize_bits;  /* ブロックサイズ (ビット) */
    unsigned long           s_blocksize;    /* ブロックサイズ (バイト) */
    loff_t                  s_maxbytes;     /* 最大ファイルサイズ */
    struct file_system_type *s_type;        /* ファイルシステム型 */
    const struct super_operations *s_op;    /* スーパーブロック操作テーブル */
    
    const struct dquot_operations *dq_op;   /* ディスククォータ操作 */
    const struct quotactl_ops *s_qcop;      /* クォータ制御操作 */
    const struct export_operations *s_export_op; /* NFS エクスポート操作 */
    
    unsigned long           s_flags;        /* マウントフラグ */
    unsigned long           s_iflags;       /* 内部フラグ */
    unsigned long           s_magic;        /* ファイルシステムマジック番号 */
    struct dentry           *s_root;        /* ルートディレクトリの dentry */
    struct rw_semaphore     s_umount;       /* アンマウント用セマフォ */
    int                     s_count;        /* 参照カウント */
    atomic_t                s_active;       /* アクティブ参照カウント */
    
    struct list_head        s_inodes;       /* 全 inode のリスト */
    struct list_head        s_inodes_wb;    /* ライトバック中の inode */
    
    struct block_device     *s_bdev;        /* 関連ブロックデバイス */
    struct backing_dev_info *s_bdi;         /* バッキングデバイス情報 */
    struct mtd_info         *s_mtd;         /* MTD デバイス情報 */
    
    struct hlist_node       s_instances;    /* 同一 fs_type のインスタンスリスト */
    
    char                    s_id[32];       /* テキスト名 */
    uuid_t                  s_uuid;         /* UUID */
    
    void                    *s_fs_info;     /* ファイルシステム固有情報 */
    
    unsigned int            s_max_links;    /* 最大リンク数 */
    
    const struct xattr_handler * const *s_xattr; /* 拡張属性ハンドラ */
    
    struct shrinker         *s_shrink;      /* メモリ回収用 shrinker */
    
    /* 省略: 他にも多数のフィールドが存在 */
};

マジック番号の一覧

各ファイルシステムは固有のマジック番号を持つ。これはファイルシステムの識別に使われる:

// include/uapi/linux/magic.h (抜粋)
#define EXT4_SUPER_MAGIC      0xEF53
#define XFS_SUPER_MAGIC       0x58465342   /* "XFSB" */
#define BTRFS_SUPER_MAGIC     0x9123683E
#define TMPFS_MAGIC           0x01021994
#define PROC_SUPER_MAGIC      0x9FA0
#define SYSFS_MAGIC           0x62656572
#define NFS_SUPER_MAGIC       0x6969
#define CIFS_SUPER_MAGIC      0xFF534D42
#define OVERLAYFS_SUPER_MAGIC 0x794c7630
#define FUSE_SUPER_MAGIC      0x65735546
#define DEVPTS_SUPER_MAGIC    0x1CD1
#define DEBUGFS_MAGIC         0x64626720

# ファイルシステムのマジック番号を確認する方法
stat -f / 
# 出力例:
#   File: "/"
#     ID: xxxxxxxx Namelen: 255     Type: ext2/ext3

# statfs システムコールを使った確認
python3 -c "
import os
st = os.statvfs('/')
print(f'Type (magic): 0x{st.f_flag:X}')
print(f'Block size: {st.f_bsize}')
print(f'Total blocks: {st.f_blocks}')
print(f'Free blocks: {st.f_bfree}')
print(f'Total inodes: {st.f_files}')
print(f'Free inodes: {st.f_ffree}')
"

スーパーブロックの作成と初期化

ファイルシステムがマウントされると、以下の流れでスーパーブロックが作成される:

// 簡略化したスーパーブロック初期化の流れ
// 1. mount() システムコール
// 2. do_mount() -> do_new_mount()
// 3. vfs_get_tree() -> ファイルシステム固有の fill_super()

// ext4 の場合の fill_super 実装 (簡略化)
static int ext4_fill_super(struct super_block *sb, struct fs_context *fc)
{
    struct ext4_sb_info *sbi;
    struct buffer_head *bh;
    struct ext4_super_block *es;
    
    // ファイルシステム固有情報の割り当て
    sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
    sb->s_fs_info = sbi;
    
    // ディスクからスーパーブロックを読み込み
    bh = sb_bread(sb, 1);  // ブロック 1 を読む
    es = (struct ext4_super_block *)(bh->b_data);
    
    // スーパーブロックの検証
    if (es->s_magic != cpu_to_le16(EXT4_SUPER_MAGIC)) {
        return -EINVAL;
    }
    
    // スーパーブロックのフィールドを設定
    sb->s_magic = EXT4_SUPER_MAGIC;
    sb->s_op = &ext4_sops;           // 操作テーブル
    sb->s_maxbytes = ext4_max_size(sb->s_blocksize_bits);
    sb->s_blocksize = block_size;
    
    // ルート inode の取得とルート dentry の作成
    root = ext4_iget(sb, EXT4_ROOT_INO, EXT4_IGET_SPECIAL);
    sb->s_root = d_make_root(root);
    
    return 0;
}

inode (Index Node)

inode の役割

inode は、ファイルやディレクトリの実体に関するメタデータを保持するオブジェクトである。ファイル名は inode には含まれず、dentry が管理する。1つの inode は複数のファイル名 (ハードリンク) を持つことができる。

VFS inode とディスク上の inode

VFS inode (struct inode) とディスク上の inode (例: struct ext4_inode) は異なるものである。VFS inode はカーネルのメモリ上にのみ存在する抽象化されたオブジェクトであり、ディスク上の inode は各ファイルシステム固有のフォーマットでディスクに保存されている。

メモリ (VFS)                    ディスク (ext4)
┌──────────────────┐           ┌──────────────────┐
│ struct inode      │           │ struct ext4_inode │
│                   │  ◄─読込─  │                   │
│ i_ino            │           │ 固有フォーマット   │
│ i_mode           │           │                   │
│ i_nlink          │  ─書込─►  │                   │
│ i_uid, i_gid     │           │                   │
│ i_size           │           └──────────────────┘
│ i_atime          │
│ i_mtime          │
│ i_ctime          │
│ i_op (操作テーブル)│
│ i_fop            │
│ i_sb             │
│ i_mapping        │
│ ...              │
└──────────────────┘

struct inode の定義

// include/linux/fs.h (主要フィールドを抜粋)
struct inode {
    umode_t                 i_mode;         /* ファイルタイプとパーミッション */
    unsigned short          i_opflags;      /* 操作フラグ */
    kuid_t                  i_uid;          /* 所有者 UID */
    kgid_t                  i_gid;          /* 所有グループ GID */
    unsigned int            i_flags;        /* ファイルシステムフラグ */
    
    const struct inode_operations *i_op;    /* inode 操作テーブル */
    struct super_block      *i_sb;          /* 所属するスーパーブロック */
    struct address_space    *i_mapping;     /* ページキャッシュ用 */
    
    unsigned long           i_ino;          /* inode 番号 */
    
    union {
        const unsigned int i_nlink;         /* ハードリンク数 */
        unsigned int __i_nlink;
    };
    
    dev_t                   i_rdev;         /* デバイス番号 (デバイスファイルの場合) */
    loff_t                  i_size;         /* ファイルサイズ (バイト) */
    
    struct timespec64       __i_atime;      /* 最終アクセス時刻 */
    struct timespec64       __i_mtime;      /* 最終更新時刻 */
    struct timespec64       __i_ctime;      /* 最終変更時刻 (メタデータ) */
    
    spinlock_t              i_lock;         /* inode ロック */
    unsigned short          i_bytes;        /* 使用バイト数 */
    u8                      i_blkbits;      /* ブロックサイズ (ビット) */
    
    blkcnt_t                i_blocks;       /* 使用ブロック数 */
    
    unsigned long           i_state;        /* inode 状態フラグ */
    struct rw_semaphore     i_rwsem;        /* inode セマフォ */
    
    struct hlist_node       i_hash;         /* inode ハッシュテーブル */
    struct list_head        i_io_list;      /* バッキングデバイス I/O リスト */
    struct list_head        i_lru;          /* LRU リスト */
    struct list_head        i_sb_list;      /* スーパーブロックの inode リスト */
    
    struct hlist_head       i_dentry;       /* この inode を参照する dentry のリスト */
    
    atomic_t                i_count;        /* 参照カウント */
    atomic_t                i_writecount;   /* 書き込み参照カウント */
    
    const struct file_operations *i_fop;    /* デフォルトのファイル操作 */
    struct file_lock_context *i_flctx;      /* ファイルロックコンテキスト */
    struct address_space    i_data;         /* デバイスのアドレス空間 */
    
    union {
        struct pipe_inode_info  *i_pipe;    /* パイプ情報 */
        struct cdev             *i_cdev;    /* キャラクタデバイス */
        char                    *i_link;    /* シンボリックリンク先 */
        unsigned                i_dir_seq;  /* ディレクトリシーケンス */
    };
    
    void                    *i_private;     /* ファイルシステム固有データ */
};

inode の状態フラグ

// include/linux/fs.h
#define I_DIRTY_SYNC        (1 << 0)  /* メタデータが変更された */
#define I_DIRTY_DATASYNC    (1 << 1)  /* データが変更された */
#define I_DIRTY_PAGES       (1 << 2)  /* ページキャッシュにダーティページがある */
#define I_NEW               (1 << 3)  /* 新しく作成された inode */
#define I_WILL_FREE         (1 << 4)  /* 解放予定 */
#define I_FREEING           (1 << 5)  /* 解放中 */
#define I_CLEAR             (1 << 6)  /* クリア済み */
#define I_SYNC              (1 << 7)  /* 同期中 */
#define I_CREATING          (1 << 15) /* 作成中 */

#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)

inode の実用的な確認方法

# inode 情報の確認
stat /etc/passwd
# 出力例:
#   File: /etc/passwd
#   Size: 2847        Blocks: 8          IO Block: 4096   regular file
#   Device: 259,2      Inode: 1048737     Links: 1
#   Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
#   Access: 2024-01-15 10:30:00.000000000 +0900
#   Modify: 2024-01-10 14:20:00.000000000 +0900
#   Change: 2024-01-10 14:20:00.000000000 +0900
#    Birth: 2023-06-01 00:00:00.000000000 +0900

# inode 番号のみ表示
ls -i /etc/passwd
# 出力: 1048737 /etc/passwd

# ハードリンクの確認 (同一 inode を持つファイルの検索)
find / -inum 1048737 2>/dev/null

# ファイルシステムの inode 使用状況
df -i
# 出力例:
# Filesystem       Inodes   IUsed    IFree IUse% Mounted on
# /dev/nvme0n1p2 32768000  456789 32311211    2% /
# tmpfs           4082744      89  4082655    1% /dev/shm

# 特定のファイルの inode 詳細情報 (debugfs を使用、ext4 の場合)
sudo debugfs -R "stat <1048737>" /dev/nvme0n1p2

dentry (Directory Entry)

dentry の役割

dentry (directory entry) は、パス名の各コンポーネントとその inode の対応関係を保持するオブジェクトである。例えば、パス /home/user/file.txt には4つの dentry が関連する: /, home, user, file.txt。

dentry はディスク上に直接的な対応物を持たない (ディスク上のディレクトリエントリとは別物)。dentry は完全にカーネルのメモリ上に存在し、パス名の解決を高速化するためのキャッシュ機構である。

struct dentry の定義

// include/linux/dcache.h (主要フィールドを抜粋)
struct dentry {
    unsigned int            d_flags;        /* dentry フラグ */
    seqcount_spinlock_t     d_seq;          /* dentry シーケンスロック */
    struct hlist_bl_node    d_hash;         /* ハッシュテーブルエントリ */
    struct dentry           *d_parent;      /* 親 dentry */
    struct qstr             d_name;         /* dentry 名 */
    struct inode            *d_inode;       /* 関連する inode (NULL = ネガティブ dentry) */
    unsigned char           d_iname[DNAME_INLINE_LEN]; /* 短い名前の埋め込み格納 */
    struct lockref          d_lockref;      /* ロック参照カウント */
    const struct dentry_operations *d_op;   /* dentry 操作テーブル */
    struct super_block      *d_sb;          /* 所属するスーパーブロック */
    unsigned long           d_time;         /* 再検証用タイムスタンプ */
    void                    *d_fsdata;      /* ファイルシステム固有データ */
    
    union {
        struct list_head    d_lru;          /* LRU リスト */
        wait_queue_head_t   *d_wait;        /* lookup 待ちキュー */
    };
    struct hlist_node       d_sib;          /* 親の子リスト */
    struct hlist_head       d_children;     /* 子 dentry のリスト */
    
    union {
        struct hlist_node   d_alias;        /* inode のエイリアスリスト */
        struct hlist_bl_node d_in_lookup_hash; /* lookup 中のハッシュ */
        struct rcu_head     d_rcu;          /* RCU コールバック */
    } d_u;
};

dentry の状態

dentry は以下の3つの状態のいずれかを取る:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Used (使用中)  │     │  Unused (未使用) │     │ Negative        │
│                  │     │                  │     │ (ネガティブ)     │
│ d_inode != NULL  │     │ d_inode != NULL  │     │ d_inode == NULL  │
│ d_lockref > 0    │     │ d_lockref == 0   │     │                  │
│                  │     │ LRU リストに存在  │     │ パスが存在しない │
│ プロセスが参照中  │     │ キャッシュに残留  │     │ ことをキャッシュ │
└─────────────────┘     └─────────────────┘     └─────────────────┘
        │                        │                        │
        │  close() で参照減少    │  メモリ圧迫で解放      │  メモリ圧迫で
        ├───────────────────────►│───────────────────────►│  解放
        │                        │                        │
        │  再アクセスで参照増加   │                        │
        │◄───────────────────────│                        │

ネガティブ dentry

ネガティブ dentry は、「存在しないファイル」の検索結果をキャッシュする重要な仕組みである。例えば、/etc/nonexistent というファイルにアクセスしようとして失敗した場合、ネガティブ dentry がキャッシュされ、同じパスへの再アクセス時にディスク I/O を避けることができる。

# ネガティブ dentry の数を確認
cat /proc/sys/fs/dentry-state
# 出力例: 86232   72158   45      0       5765    0
# フォーマット: nr_dentry nr_unused age_limit want_pages dummy nr_negative

# slabtop で dentry キャッシュの使用状況を確認
sudo slabtop -o | grep dentry
# 出力例:
# 86232  85000   97%    0.19K   4106       21     32848K dentry

# ネガティブ dentry の最大数を制御するパラメータ
cat /proc/sys/fs/negative-dentry-limit
# デフォルト: 0 (制限なし、パーセンテージ)

file オブジェクト

file オブジェクトの役割

file オブジェクトは、プロセスが開いたファイルを表現する。ディスク上に対応するデータは存在せず、完全にメモリ上のオブジェクトである。open() システムコール時に作成され、close() 時に解放される。

重要な点として、file オブジェクトはプロセスごとに作成される。同じファイルを複数のプロセスが開いた場合、それぞれのプロセスに対して別々の file オブジェクトが作成されるが、それらは同じ dentry と inode を参照する。

struct file の定義

// include/linux/fs.h (主要フィールドを抜粋)
struct file {
    union {
        struct llist_node   f_llist;        /* 遅延解放リスト */
        struct rcu_head     f_rcuhead;      /* RCU コールバック */
        unsigned int        f_iocb_flags;   /* IO 制御ブロックフラグ */
    };
    
    spinlock_t              f_lock;         /* ファイルロック */
    fmode_t                 f_mode;         /* ファイルアクセスモード */
    atomic_long_t           f_count;        /* 参照カウント */
    struct mutex            f_pos_lock;     /* ファイル位置ロック */
    loff_t                  f_pos;          /* 現在のファイル位置 (オフセット) */
    unsigned int            f_flags;        /* ファイルフラグ (O_RDONLY 等) */
    
    struct fown_struct      f_owner;        /* シグナル送信用のオーナー情報 */
    const struct cred       *f_cred;        /* ファイルの認証情報 */
    struct file_ra_state    f_ra;           /* 先読み (readahead) 状態 */
    
    struct path             f_path;         /* パス情報 (vfsmount + dentry) */
    struct inode            *f_inode;       /* inode キャッシュ (f_path.dentry->d_inode) */
    const struct file_operations *f_op;     /* ファイル操作テーブル */
    
    u64                     f_version;      /* バージョン番号 */
    void                    *private_data;  /* ファイルシステム固有データ */
    
    struct address_space    *f_mapping;     /* ページキャッシュマッピング */
    errseq_t                f_wb_err;       /* ライトバックエラー */
    errseq_t                f_sb_err;       /* スーパーブロックエラー */
};

struct path の定義

struct path {
    struct vfsmount *mnt;    /* マウント情報 */
    struct dentry   *dentry; /* dentry */
};

ファイルアクセスモード

// include/linux/fs.h
#define FMODE_READ          ((__force fmode_t)0x1)    /* 読み取り可能 */
#define FMODE_WRITE         ((__force fmode_t)0x2)    /* 書き込み可能 */
#define FMODE_LSEEK         ((__force fmode_t)0x4)    /* シーク可能 */
#define FMODE_PREAD         ((__force fmode_t)0x8)    /* pread 可能 */
#define FMODE_PWRITE        ((__force fmode_t)0x10)   /* pwrite 可能 */
#define FMODE_EXEC          ((__force fmode_t)0x20)   /* 実行用に開いた */
#define FMODE_NDELAY        ((__force fmode_t)0x40)   /* ノンブロッキング */
#define FMODE_EXCL          ((__force fmode_t)0x80)   /* 排他的アクセス */

file オブジェクトのライフサイクル

# /proc/[pid]/fd でプロセスのファイルディスクリプタを確認
ls -la /proc/self/fd
# 出力例:
# lrwx------ 1 user user 64 Jan 15 10:00 0 -> /dev/pts/0
# lrwx------ 1 user user 64 Jan 15 10:00 1 -> /dev/pts/0
# lrwx------ 1 user user 64 Jan 15 10:00 2 -> /dev/pts/0
# lr-x------ 1 user user 64 Jan 15 10:00 3 -> /proc/self/fd

# /proc/[pid]/fdinfo でファイルディスクリプタの詳細情報を確認
cat /proc/self/fdinfo/0
# 出力例:
# pos:    0              (ファイル位置)
# flags:  0100002        (フラグ: O_RDWR | O_LARGEFILE)
# mnt_id: 26             (マウント ID)
# ino:    3              (inode 番号)

# 開いているファイルの総数を確認
cat /proc/sys/fs/file-nr
# 出力例: 5344   0   1048576
# フォーマット: 割り当て済み  未使用  最大値

# 最大オープンファイル数の設定
cat /proc/sys/fs/file-max
# 出力: 1048576

# プロセスあたりの最大オープンファイル数
ulimit -n
# 出力: 1024 (デフォルト)

# 制限の変更
ulimit -n 65536

# /etc/security/limits.conf での永続設定
# * soft nofile 65536
# * hard nofile 65536

Dentry キャッシュ (dcache)

dcache の概要

Dentry キャッシュ (dcache) は、パス名解決の結果をキャッシュすることで、ファイルシステムへのアクセスを大幅に高速化するカーネルメカニズムである。パス名解決はファイルアクセスのたびに発生する操作であり、dcache がなければ各パスコンポーネントについてディスク I/O が必要となる。

dcache のデータ構造

dcache は以下の3つの主要なデータ構造で構成される:

1. ハッシュテーブル (dentry_hashtable)
   - パス名コンポーネントから dentry を高速に検索
   - ハッシュ関数: parent dentry + 名前文字列

2. LRU リスト (未使用 dentry のリスト)
   - 参照カウントが 0 になった dentry を保持
   - メモリ圧迫時に古いものから解放

3. 子リスト (d_children)
   - 親ディレクトリの子 dentry を管理
   - ディレクトリのリスト表示に使用

┌─────────────────────────────────────────────────────────┐
│                    dcache ハッシュテーブル                  │
│                                                          │
│  bucket[0] ─► dentry("etc") ─► dentry("usr") ─► ...    │
│  bucket[1] ─► dentry("home") ─► ...                     │
│  bucket[2] ─► dentry("passwd") ─► dentry("shadow")─►...│
│  ...                                                     │
│  bucket[n] ─► dentry("file.txt") ─► ...                 │
│                                                          │
│  各 dentry はハッシュ(parent, name) でバケットに分配     │
└─────────────────────────────────────────────────────────┘

dcache の検索アルゴリズム

// fs/dcache.c (簡略化)
struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
{
    unsigned int hash = name->hash;
    struct hlist_bl_head *b = d_hash(hash);
    struct hlist_bl_node *node;
    struct dentry *found = NULL;
    struct dentry *dentry;
    
    rcu_read_lock();
    
    hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {
        if (dentry->d_name.hash != hash)
            continue;
        if (dentry->d_parent != parent)
            continue;
        if (d_unhashed(dentry))
            continue;
            
        if (!d_same_name(dentry, parent, name))
            continue;
            
        found = dentry;
        break;
    }
    
    rcu_read_unlock();
    return found;
}

dcache の統計情報

# dcache の統計を確認
cat /proc/sys/fs/dentry-state
# 出力: 86232   72158   45      0       5765    0
# nr_dentry: 総 dentry 数
# nr_unused: 未使用 dentry 数 (LRU に存在)
# age_limit: 秒単位のエイジリミット
# want_pages: メモリ回収要求ページ数
# nr_negative: ネガティブ dentry 数
# dummy: 未使用

# SLAB アロケータでの dentry キャッシュ使用量
cat /proc/slabinfo | head -2
cat /proc/slabinfo | grep dentry
# 出力例:
# dentry  86232  86400    192   21    1 : tunables    0    0    0 : slabdata   4114   4114      0

# /proc/meminfo での確認
grep -E "^(SReclaimable|SUnreclaim|Slab)" /proc/meminfo
# 出力例:
# Slab:           186432 kB
# SReclaimable:   142560 kB   (dcache, inode cache を含む)
# SUnreclaim:      43872 kB

# dcache のドロップ (メモリ解放)
# 注意: 本番環境では慎重に実行すること
echo 2 | sudo tee /proc/sys/vm/drop_caches
# 1 = ページキャッシュのドロップ
# 2 = dentry + inode キャッシュのドロップ  
# 3 = 1 + 2 の両方

dcache のチューニング

# vfs_cache_pressure パラメータ
# dcache と inode cache の回収の積極性を制御
cat /proc/sys/vm/vfs_cache_pressure
# デフォルト: 100

# 値を下げると dcache の保持を優先 (パス解決が高速化)
echo 50 | sudo tee /proc/sys/vm/vfs_cache_pressure

# 値を上げると dcache をより積極的に解放 (メモリ節約)
echo 200 | sudo tee /proc/sys/vm/vfs_cache_pressure

# 永続設定 (/etc/sysctl.conf)
echo "vm.vfs_cache_pressure = 50" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

inode キャッシュ

inode キャッシュの概要

inode キャッシュは、ディスクから読み込んだ inode 情報をメモリ上にキャッシュすることで、繰り返しの inode アクセスを高速化する。dcache と密接に連携し、dentry が参照する inode はキャッシュに保持される。

inode キャッシュのデータ構造

┌─────────────────────────────────────────────────────────┐
│               inode ハッシュテーブル                       │
│                                                          │
│  bucket[0] ─► inode(dev=259:2, ino=2) ─► ...            │
│  bucket[1] ─► inode(dev=259:2, ino=1048737) ─► ...      │
│  ...                                                     │
│                                                          │
│  各 inode はハッシュ(superblock, inode番号) でバケットに分配│
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│           inode の状態遷移                                │
│                                                          │
│  [ディスク] ──読込──► [I_NEW] ──初期化完了──► [使用中]     │
│                                                  │       │
│                                        dirty化   │       │
│                                                  ▼       │
│                                              [I_DIRTY]   │
│                                                  │       │
│                                        writeback │       │
│                                                  ▼       │
│                                              [I_SYNC]    │
│                                                  │       │
│  [I_FREEING] ◄── 参照なし ◄── [キャッシュ] ◄────┘       │
└─────────────────────────────────────────────────────────┘

inode キャッシュの管理

// fs/inode.c (簡略化)

// inode の取得 (キャッシュから、またはディスクから読み込み)
struct inode *iget_locked(struct super_block *sb, unsigned long ino)
{
    struct hlist_head *head = inode_hashtable + hash(sb, ino);
    struct inode *inode;
    
    // まずハッシュテーブルを検索
    spin_lock(&inode_hash_lock);
    inode = find_inode_fast(sb, head, ino);
    if (inode) {
        // キャッシュヒット
        spin_unlock(&inode_hash_lock);
        if (inode->i_state & I_NEW)
            wait_on_inode(inode);  // 初期化完了を待つ
        return inode;
    }
    
    // キャッシュミス: 新しい inode を割り当て
    inode = alloc_inode(sb);
    inode->i_ino = ino;
    inode->i_state = I_NEW;
    hlist_add_head(&inode->i_hash, head);
    spin_unlock(&inode_hash_lock);
    
    return inode;  // I_NEW フラグ付きで返す (呼び出し側が初期化)
}

inode キャッシュの監視

# inode キャッシュの統計
cat /proc/sys/fs/inode-state
# 出力: 65432   23456   0       0       0       0       0
# nr_inodes: キャッシュされている inode 数
# nr_free_inodes: 未使用の inode 数

# inode キャッシュの SLAB 使用量
cat /proc/slabinfo | grep inode
# 出力例:
# ext4_inode_cache    12345  12400    1080    3    1 : tunables ...
# inode_cache          5678   5700     608    6    1 : tunables ...

# inode キャッシュの詳細 (slabtop)
sudo slabtop -o -s c | grep -E "inode|OBJS"
# OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
# 12345  12000  97%    1.05K   4115        3     32920K ext4_inode_cache
#  5678   5500  96%    0.59K    948        6      7584K inode_cache

ページキャッシュ統合

ページキャッシュの役割

ページキャッシュは、ファイルの内容 (データ) をメモリ上にキャッシュする仕組みである。VFS は inode の i_mapping フィールドを通じてページキャッシュと統合されている。

address_space とページキャッシュ

// include/linux/fs.h
struct address_space {
    struct inode            *host;          /* オーナー inode */
    struct xarray           i_pages;        /* ページキャッシュ (XArray) */
    struct rw_semaphore     invalidate_lock; /* 無効化ロック */
    gfp_t                   gfp_mask;       /* メモリ割り当てフラグ */
    atomic_t                i_mmap_writable; /* VM_SHARED マッピング数 */
    struct rb_root_cached   i_mmap;         /* mmap された範囲のツリー */
    unsigned long           nrpages;        /* ページ数 */
    pgoff_t                 writeback_index; /* ライトバック開始位置 */
    const struct address_space_operations *a_ops; /* 操作テーブル */
    unsigned long           flags;          /* フラグ */
    errseq_t                wb_err;         /* ライトバックエラー */
    spinlock_t              i_private_lock; /* プライベートリストロック */
    struct list_head        i_private_list; /* プライベートリスト */
    struct rw_semaphore     i_mmap_rwsem;   /* mmap セマフォ */
};

address_space_operations

struct address_space_operations {
    int (*writepage)(struct page *page, struct writeback_control *wbc);
    int (*read_folio)(struct file *, struct folio *);
    int (*writepages)(struct address_space *, struct writeback_control *wbc);
    bool (*dirty_folio)(struct address_space *, struct folio *);
    void (*readahead)(struct readahead_control *);
    int (*write_begin)(struct file *, struct address_space *mapping,
                       loff_t pos, unsigned len,
                       struct page **pagep, void **fsdata);
    int (*write_end)(struct file *, struct address_space *mapping,
                     loff_t pos, unsigned len, unsigned copied,
                     struct page *page, void *fsdata);
    sector_t (*bmap)(struct address_space *, sector_t);
    int (*swap_activate)(struct swap_info_struct *, struct file *, sector_t *);
    int (*swap_deactivate)(struct file *);
};

ページキャッシュの動作フロー

read() システムコール
    │
    ▼
generic_file_read_iter()
    │
    ├── filemap_read()
    │   │
    │   ├── filemap_get_pages()
    │   │   │
    │   │   ├── filemap_get_read_batch()
    │   │   │   └── XArray からページを検索
    │   │   │       │
    │   │   │       ├── [ヒット] → ページを返す
    │   │   │       │
    │   │   │       └── [ミス] → page_cache_sync_readahead()
    │   │   │                    │
    │   │   │                    └── a_ops->readahead() 
    │   │   │                        (ディスクから読み込み)
    │   │   │
    │   │   └── folio_mark_accessed()  (アクセス記録)
    │   │
    │   └── copy_folio_to_iter()  (ユーザー空間にコピー)
    │
    └── 読み取りバイト数を返す

ページキャッシュの監視と管理

# ページキャッシュの使用状況
free -h
# 出力例:
#               total    used    free   shared  buff/cache   available
# Mem:            31Gi   8.2Gi    15Gi    256Mi       8.1Gi      22Gi
# Swap:          8.0Gi      0B    8.0Gi

# buff/cache の内訳
cat /proc/meminfo | grep -E "^(Cached|Buffers|SwapCached|Active\(file\)|Inactive\(file\))"
# 出力例:
# Buffers:          234568 kB
# Cached:          7654321 kB
# SwapCached:            0 kB
# Active(file):    3456789 kB
# Inactive(file):  4197532 kB

# 特定ファイルのページキャッシュ状態を確認 (vmtouch ツール)
# インストール: sudo apt install vmtouch
vmtouch /var/log/syslog
# 出力例:
#            Files: 1
#      Directories: 0
#   Resident Pages: 1024/1024  4M/4M  100%
#          Elapsed: 0.00021 seconds

# ファイルをページキャッシュにプリロード
vmtouch -t /var/lib/mysql/ibdata1

# ファイルをページキャッシュから排除
vmtouch -e /tmp/large_file

# ページキャッシュ全体のドロップ
echo 1 | sudo tee /proc/sys/vm/drop_caches

# /proc/[pid]/smaps でプロセスのメモリマッピング詳細を確認
cat /proc/self/smaps | head -30

# cachestat (Linux 6.5+ の新しいシステムコール)
# ページキャッシュのヒット/ミス統計を取得

先読み (Readahead) の設定

# ブロックデバイスの先読みサイズ確認
blockdev --getra /dev/sda
# 出力: 256 (セクタ数 = 128KB)

# 先読みサイズの設定
sudo blockdev --setra 512 /dev/sda  # 256KB に設定

# /sys 経由での確認
cat /sys/block/sda/queue/read_ahead_kb
# 出力: 128

# 設定の変更
echo 256 | sudo tee /sys/block/sda/queue/read_ahead_kb

ファイル操作 (struct file_operations)

file_operations の概要

struct file_operations は、ファイルに対する操作を定義する関数ポインタテーブルである。各ファイルシステムは、このテーブルの関数を実装することで、VFS フレームワークに統合される。

struct file_operations の定義

// include/linux/fs.h (Linux 6.x)
struct file_operations {
    struct module *owner;
    loff_t (*llseek) (struct file *, loff_t, int);
    ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
    ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
    ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
    ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
    int (*iopoll)(struct kiocb *kiocb, struct io_comp_batch *, unsigned int flags);
    int (*iterate_shared) (struct file *, struct dir_context *);
    __poll_t (*poll) (struct file *, struct poll_table_struct *);
    long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
    long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
    int (*mmap) (struct file *, struct vm_area_struct *);
    unsigned long mmap_supported_flags;
    int (*open) (struct inode *, struct file *);
    int (*flush) (struct file *, fl_owner_t id);
    int (*release) (struct inode *, struct file *);
    int (*fsync) (struct file *, loff_t, loff_t, int datasync);
    int (*fasync) (int, struct file *, int);
    int (*lock) (struct file *, int, struct file_lock *);
    unsigned long (*get_unmapped_area)(struct file *, unsigned long,
                                       unsigned long, unsigned long, unsigned long);
    int (*check_flags)(int);
    int (*flock) (struct file *, int, struct file_lock *);
    ssize_t (*splice_write)(struct pipe_inode_info *, struct file *,
                            loff_t *, size_t, unsigned int);
    ssize_t (*splice_read)(struct file *, loff_t *,
                           struct pipe_inode_info *, size_t, unsigned int);
    void (*splice_eof)(struct file *file);
    int (*setlease)(struct file *, int, struct file_lease **, void **);
    long (*fallocate)(struct file *file, int mode, loff_t offset, loff_t len);
    void (*show_fdinfo)(struct seq_file *m, struct file *f);
    ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t,
                               size_t, unsigned int);
    loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
                               struct file *file_out, loff_t pos_out,
                               loff_t len, unsigned int remap_flags);
    int (*fadvise)(struct file *, loff_t, loff_t, int);
    int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
    int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
                            unsigned int poll_flags);
};

各ファイルシステムの file_operations 実装例

// ext4 の file_operations (fs/ext4/file.c)
const struct file_operations ext4_file_operations = {
    .llseek         = ext4_llseek,
    .read_iter      = ext4_file_read_iter,
    .write_iter     = ext4_file_write_iter,
    .iopoll         = iocb_bio_iopoll,
    .unlocked_ioctl = ext4_ioctl,
    .compat_ioctl   = ext4_compat_ioctl,
    .mmap           = ext4_file_mmap,
    .open           = ext4_file_open,
    .release        = ext4_release_file,
    .fsync          = ext4_sync_file,
    .get_unmapped_area = thp_get_unmapped_area,
    .splice_read    = ext4_file_splice_read,
    .splice_write   = iter_file_splice_write,
    .fallocate      = ext4_fallocate,
    .copy_file_range = ext4_copy_file_range,
    .fadvise        = ext4_file_fadvise,
};

// tmpfs (shmem) の file_operations (mm/shmem.c)
static const struct file_operations shmem_file_operations = {
    .mmap           = shmem_mmap,
    .open           = shmem_file_open,
    .get_unmapped_area = shmem_get_unmapped_area,
    .llseek         = shmem_file_llseek,
    .read_iter      = shmem_file_read_iter,
    .write_iter     = generic_file_write_iter,
    .fsync          = noop_fsync,
    .splice_read    = shmem_file_splice_read,
    .splice_write   = iter_file_splice_write,
    .fallocate      = shmem_fallocate,
};

// procfs の file_operations (例: /proc/meminfo)
static const struct file_operations meminfo_proc_fops = {
    .open    = meminfo_proc_open,
    .read    = seq_read,
    .llseek  = seq_lseek,
    .release = single_release,
};

カスタムファイルシステムの file_operations 実装例

// 簡単なカーネルモジュールでのファイル操作実装例
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/uaccess.h>

#define DEVICE_NAME "mydevice"
#define BUFFER_SIZE 4096

static char device_buffer[BUFFER_SIZE];
static int buffer_size = 0;

static int my_open(struct inode *inode, struct file *file)
{
    pr_info("mydevice: opened\n");
    return 0;
}

static int my_release(struct inode *inode, struct file *file)
{
    pr_info("mydevice: closed\n");
    return 0;
}

static ssize_t my_read(struct file *file, char __user *buf,
                        size_t count, loff_t *offset)
{
    int bytes_to_read;
    
    if (*offset >= buffer_size)
        return 0;  // EOF
    
    bytes_to_read = min((int)count, buffer_size - (int)*offset);
    
    if (copy_to_user(buf, device_buffer + *offset, bytes_to_read))
        return -EFAULT;
    
    *offset += bytes_to_read;
    return bytes_to_read;
}

static ssize_t my_write(struct file *file, const char __user *buf,
                         size_t count, loff_t *offset)
{
    int bytes_to_write;
    
    bytes_to_write = min((int)count, BUFFER_SIZE - (int)*offset);
    if (bytes_to_write <= 0)
        return -ENOSPC;
    
    if (copy_from_user(device_buffer + *offset, buf, bytes_to_write))
        return -EFAULT;
    
    *offset += bytes_to_write;
    if (*offset > buffer_size)
        buffer_size = *offset;
    
    return bytes_to_write;
}

static loff_t my_llseek(struct file *file, loff_t offset, int whence)
{
    loff_t new_pos;
    
    switch (whence) {
    case SEEK_SET:
        new_pos = offset;
        break;
    case SEEK_CUR:
        new_pos = file->f_pos + offset;
        break;
    case SEEK_END:
        new_pos = buffer_size + offset;
        break;
    default:
        return -EINVAL;
    }
    
    if (new_pos < 0 || new_pos > BUFFER_SIZE)
        return -EINVAL;
    
    file->f_pos = new_pos;
    return new_pos;
}

static const struct file_operations my_fops = {
    .owner   = THIS_MODULE,
    .open    = my_open,
    .release = my_release,
    .read    = my_read,
    .write   = my_write,
    .llseek  = my_llseek,
};

VFS のデフォルト実装

多くのファイルシステムが共通して使用できるデフォルト実装が用意されている:

// 主要なデフォルト実装
generic_file_read_iter()     // 汎用ファイル読み取り
generic_file_write_iter()    // 汎用ファイル書き込み
generic_file_mmap()          // 汎用 mmap
generic_file_llseek()        // 汎用シーク
generic_file_splice_read()   // 汎用 splice 読み取り
noop_fsync()                 // 同期不要 (tmpfs など)
simple_read_from_buffer()    // バッファからの読み取り
simple_write_to_buffer()     // バッファへの書き込み
seq_read()                   // シーケンシャルファイル読み取り (procfs で多用)

マウントメカニズムとマウント名前空間

マウントの基本概念

マウントとは、ファイルシステムをディレクトリツリーの特定の位置に接続する操作である。Linux では、すべてのファイルシステムは単一のディレクトリツリーに統合される。

マウント関連のカーネル構造体

// fs/mount.h
struct mount {
    struct hlist_node       mnt_hash;       /* マウントハッシュテーブル */
    struct mount            *mnt_parent;    /* 親マウント */
    struct dentry           *mnt_mountpoint; /* マウントポイント dentry */
    struct vfsmount         mnt;            /* VFS マウント情報 */
    union {
        struct rcu_head     mnt_rcu;
        struct llist_node   mnt_llist;
    };
    struct list_head        mnt_mounts;     /* 子マウントリスト */
    struct list_head        mnt_child;      /* 親の mnt_mounts のエントリ */
    struct list_head        mnt_instance;   /* スーパーブロックのマウントリスト */
    const char              *mnt_devname;   /* デバイス名 */
    union {
        struct rb_node      mnt_node;       /* 名前空間の RB ツリー */
        struct list_head    mnt_list;
    };
    struct list_head        mnt_expire;     /* 有効期限リスト */
    struct list_head        mnt_share;      /* 共有マウントリスト */
    struct list_head        mnt_slave_list; /* スレーブマウントリスト */
    struct list_head        mnt_slave;      /* スレーブエントリ */
    struct mount            *mnt_master;    /* マスターマウント */
    struct mnt_namespace    *mnt_ns;        /* 所属する名前空間 */
    struct mountpoint       *mnt_mp;        /* マウントポイント */
    union {
        struct hlist_node   mnt_mp_list;
        struct hlist_node   mnt_umount;
    };
    struct list_head        mnt_umounting;
    int                     mnt_id;         /* マウント ID */
    u64                     mnt_id_unique;  /* ユニーク ID */
    int                     mnt_group_id;   /* ピアグループ ID */
    int                     mnt_expiry_mark; /* 有効期限マーク */
    struct hlist_head       mnt_pins;
    struct hlist_head       mnt_stuck_children;
};

// include/linux/mount.h
struct vfsmount {
    struct dentry           *mnt_root;      /* マウントのルート dentry */
    struct super_block      *mnt_sb;        /* スーパーブロック */
    int                     mnt_flags;      /* マウントフラグ */
    struct mnt_idmap        *mnt_idmap;     /* ID マッピング */
};

マウントの種類

# 通常のマウント
sudo mount /dev/sda1 /mnt/data

# バインドマウント (既存のディレクトリツリーの一部を別の場所にマウント)
sudo mount --bind /var/log /mnt/logs

# 再帰バインドマウント
sudo mount --rbind /home /mnt/home

# 読み取り専用で再マウント
sudo mount -o remount,ro /mnt/data

# tmpfs マウント
sudo mount -t tmpfs -o size=2G tmpfs /mnt/ramdisk

# ループバックマウント (ISOイメージなど)
sudo mount -o loop disk.iso /mnt/iso

# オーバーレイマウント (Docker/コンテナで使用)
sudo mount -t overlay overlay \
    -o lowerdir=/lower,upperdir=/upper,workdir=/work \
    /merged

# NFS マウント
sudo mount -t nfs server:/export /mnt/nfs

# マウントオプションの一覧
# rw/ro    - 読み書き/読み取り専用
# suid/nosuid - SUID ビットの有効/無効
# dev/nodev   - デバイスファイルの許可/禁止
# exec/noexec - 実行の許可/禁止
# sync/async  - 同期/非同期 I/O
# atime/noatime/relatime - アクセス時刻の更新方法
# diratime/nodiratime - ディレクトリのアクセス時刻
# user/nouser - 一般ユーザーのマウント許可/禁止
# defaults    - rw,suid,dev,exec,auto,nouser,async

マウント名前空間

マウント名前空間は、プロセスごとに異なるマウントの視点を提供する Linux 名前空間の1つである。コンテナ技術の基盤となっている。

# 現在の名前空間情報
ls -la /proc/self/ns/mnt
# lrwxrwxrwx 1 user user 0 Jan 15 10:00 /proc/self/ns/mnt -> 'mnt:[4026531841]'

# 新しいマウント名前空間でシェルを起動
sudo unshare --mount /bin/bash

# 新しい名前空間内でマウントの変更
mount -t tmpfs tmpfs /tmp   # この変更は親の名前空間には影響しない

# マウント伝播の種類
# shared  - マウントの変更が相互に伝播
# slave   - マスターからスレーブへのみ伝播
# private - 伝播なし
# unbindable - バインドマウント不可

# マウント伝播の設定
sudo mount --make-shared /mnt/data
sudo mount --make-slave /mnt/data
sudo mount --make-private /mnt/data
sudo mount --make-unbindable /mnt/data

# 再帰的な伝播設定
sudo mount --make-rshared /mnt/data
sudo mount --make-rslave /mnt/data
sudo mount --make-rprivate /mnt/data

# マウント伝播の確認
cat /proc/self/mountinfo | head -5
# 出力例:
# 22 1 259:2 / / rw,relatime shared:1 - ext4 /dev/nvme0n1p2 rw
# 23 22 0:21 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw
# 24 22 0:22 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs rw

新しいマウント API (fsopen/fsmount)

Linux 5.2 以降、新しいマウント API が導入された:

// 新しいマウント API の使用例
#include <sys/mount.h>
#include <linux/mount.h>

// 1. ファイルシステムコンテキストを作成
int fsfd = fsopen("ext4", FSOPEN_CLOEXEC);

// 2. ソースデバイスを設定
fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/sda1", 0);

// 3. マウントオプションを設定
fsconfig(fsfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0);
fsconfig(fsfd, FSCONFIG_SET_STRING, "errors", "remount-ro", 0);

// 4. スーパーブロックを作成
fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);

// 5. マウントを取得
int mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NOATIME);

// 6. ディレクトリツリーに接続
move_mount(mntfd, "", AT_FDCWD, "/mnt/data", MOVE_MOUNT_F_EMPTY_PATH);

close(fsfd);
close(mntfd);

疑似ファイルシステム

疑似ファイルシステムの概要

疑似ファイルシステム (pseudo-filesystem) は、ディスク上にデータを持たず、カーネルの情報をファイルシステムインターフェースを通じて公開するファイルシステムである。

procfs (/proc)

procfs は、カーネルとプロセスの情報を公開する疑似ファイルシステムである。

# procfs の主要なエントリ
/proc/
├── [pid]/                  # プロセス固有の情報
│   ├── cmdline            # コマンドライン引数
│   ├── cwd -> /path       # 現在の作業ディレクトリ
│   ├── environ            # 環境変数
│   ├── exe -> /path       # 実行ファイルへのリンク
│   ├── fd/                # ファイルディスクリプタ
│   ├── fdinfo/            # FD の詳細情報
│   ├── maps               # メモリマッピング
│   ├── smaps              # 詳細メモリマッピング
│   ├── status             # プロセス状態
│   ├── stat               # プロセス統計
│   ├── io                 # I/O 統計
│   ├── ns/                # 名前空間
│   ├── mountinfo          # マウント情報
│   ├── mounts             # マウントリスト
│   └── root -> /          # ルートディレクトリ
├── cpuinfo                 # CPU 情報
├── meminfo                 # メモリ情報
├── vmstat                  # 仮想メモリ統計
├── loadavg                 # ロードアベレージ
├── uptime                  # 稼働時間
├── filesystems            # サポートされているファイルシステム
├── mounts                  # マウント情報
├── partitions             # パーティション情報
├── diskstats              # ディスク I/O 統計
├── net/                    # ネットワーク情報
├── sys/                    # カーネルパラメータ (sysctl)
│   ├── fs/                # ファイルシステム関連
│   ├── kernel/            # カーネル関連
│   ├── net/               # ネットワーク関連
│   └── vm/                # 仮想メモリ関連
├── slabinfo               # SLAB アロケータ情報
└── buddyinfo              # バディアロケータ情報

# procfs の実用例
# サポートされているファイルシステムの一覧
cat /proc/filesystems
# 出力例:
# nodev   sysfs
# nodev   tmpfs
# nodev   bdev
# nodev   proc
# nodev   cgroup2
# nodev   devtmpfs
# nodev   debugfs
# nodev   tracefs
# nodev   securityfs
# nodev   sockfs
# nodev   bpf
# nodev   pipefs
# nodev   ramfs
# nodev   hugetlbfs
# nodev   devpts
# nodev   mqueue
# nodev   pstore
#         ext3
#         ext2
#         ext4
#         xfs
#         vfat
# nodev   overlay
# nodev   fuse
# nodev   fuseblk
# "nodev" は物理デバイスを必要としないファイルシステム

sysfs (/sys)

sysfs は、カーネルのデバイスモデルとドライバ情報を公開するファイルシステムである。

# sysfs の主要なディレクトリ構造
/sys/
├── block/                  # ブロックデバイス
│   ├── sda/
│   │   ├── queue/         # I/O キュー設定
│   │   │   ├── scheduler  # I/O スケジューラ
│   │   │   ├── read_ahead_kb
│   │   │   └── nr_requests
│   │   ├── size           # デバイスサイズ
│   │   └── stat           # I/O 統計
│   └── nvme0n1/
├── bus/                    # バスタイプ
│   ├── pci/
│   ├── usb/
│   └── scsi/
├── class/                  # デバイスクラス
│   ├── net/               # ネットワークデバイス
│   ├── block/             # ブロックデバイス
│   └── input/             # 入力デバイス
├── devices/                # デバイス階層
│   ├── system/
│   │   ├── cpu/           # CPU 情報
│   │   ├── memory/        # メモリ情報
│   │   └── node/          # NUMA ノード
│   └── pci0000:00/        # PCI デバイス
├── firmware/               # ファームウェア情報
├── fs/                     # ファイルシステム情報
│   ├── ext4/
│   ├── xfs/
│   └── cgroup/
├── kernel/                 # カーネル情報
│   ├── mm/                # メモリ管理
│   ├── slab/              # SLAB 情報
│   └── debug/             # デバッグ情報
└── module/                 # カーネルモジュール

# sysfs の実用例
# I/O スケジューラの確認と変更
cat /sys/block/sda/queue/scheduler
# [mq-deadline] kyber bfq none

echo "kyber" | sudo tee /sys/block/sda/queue/scheduler

# ブロックデバイスのキューパラメータ
cat /sys/block/sda/queue/nr_requests      # リクエストキューの深さ
cat /sys/block/sda/queue/max_sectors_kb   # 最大 I/O サイズ
cat /sys/block/sda/queue/rotational       # 0=SSD, 1=HDD

# CPU 周波数の確認
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq

tmpfs

tmpfs は、メモリ (RAM) とスワップを使用するファイルシステムである。

# tmpfs のマウント
sudo mount -t tmpfs -o size=1G,mode=1777 tmpfs /mnt/ramdisk

# 現在の tmpfs マウントの確認
df -h -t tmpfs
# 出力例:
# Filesystem      Size  Used Avail Use% Mounted on
# tmpfs            3.9G  1.2M  3.9G   1% /dev/shm
# tmpfs            792M  1.7M  790M   1% /run
# tmpfs            5.0M  4.0K  5.0M   1% /run/lock

# /dev/shm のサイズ変更
sudo mount -o remount,size=8G /dev/shm

# /etc/fstab での設定
# tmpfs  /tmp     tmpfs  defaults,noatime,nosuid,nodev,mode=1777,size=4G  0 0
# tmpfs  /dev/shm tmpfs  defaults,noatime,nosuid,nodev,size=8G           0 0

# tmpfs の特徴
# - ファイルはメモリ上に保持される
# - 動的にサイズが変化 (使用した分だけメモリを消費)
# - メモリ圧迫時にスワップアウトされる
# - リブートで内容は失われる
# - ページキャッシュと統合されている

devtmpfs

devtmpfs は、カーネルが自動的にデバイスノードを作成する疑似ファイルシステムである。

# devtmpfs の確認
mount | grep devtmpfs
# devtmpfs on /dev type devtmpfs (rw,nosuid,relatime,size=...)

# devtmpfs の中身
ls -la /dev/ | head -20
# 出力例:
# crw-rw-rw-  1 root root    1,   3 Jan 15 00:00 null
# crw-rw-rw-  1 root root    1,   5 Jan 15 00:00 zero
# crw-rw-rw-  1 root root    1,   7 Jan 15 00:00 full
# crw-rw-rw-  1 root root    1,   8 Jan 15 00:00 random
# crw-rw-rw-  1 root root    1,   9 Jan 15 00:00 urandom
# brw-rw----  1 root disk  259,   0 Jan 15 00:00 nvme0n1
# brw-rw----  1 root disk  259,   1 Jan 15 00:00 nvme0n1p1
# crw--w----  1 root tty     4,   0 Jan 15 00:00 tty0

# devtmpfs はカーネルが自動的にデバイスノードを作成
# udev (systemd-udevd) がパーミッションやシンボリックリンクを管理

ファイルロック

ファイルロックの概要

ファイルロックは、複数のプロセスが同一ファイルに同時にアクセスする際の競合を防止するメカニズムである。Linux では主に2種類のファイルロックがサポートされている。

flock (BSD スタイルロック)

#include <sys/file.h>

// flock のプロトタイプ
int flock(int fd, int operation);

// operation の値
// LOCK_SH - 共有ロック (読み取りロック)
// LOCK_EX - 排他ロック (書き込みロック)
// LOCK_UN - ロック解除
// LOCK_NB - ノンブロッキング (他のフラグと OR で使用)

// flock の使用例
#include <stdio.h>
#include <stdlib.h>
#include <sys/file.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>

int main(void)
{
    int fd = open("/tmp/lockfile", O_RDWR | O_CREAT, 0644);
    if (fd < 0) {
        perror("open");
        return 1;
    }
    
    printf("排他ロックを取得中...\n");
    
    // 排他ロックを取得 (ブロッキング)
    if (flock(fd, LOCK_EX) < 0) {
        perror("flock");
        close(fd);
        return 1;
    }
    
    printf("ロック取得成功。処理中...\n");
    
    // クリティカルセクション
    sleep(10);
    
    // ロック解除
    if (flock(fd, LOCK_UN) < 0) {
        perror("flock unlock");
    }
    
    printf("ロック解除完了。\n");
    close(fd);
    return 0;
}

// ノンブロッキングでの使用
int try_lock(int fd)
{
    if (flock(fd, LOCK_EX | LOCK_NB) < 0) {
        if (errno == EWOULDBLOCK) {
            printf("ファイルは既にロックされています\n");
            return -1;
        }
        perror("flock");
        return -1;
    }
    return 0;
}

POSIX ロック (fcntl ベース)

POSIX ロックは、ファイルの特定のバイト範囲に対してロックをかけることができる、より柔軟なロックメカニズムである。

#include <fcntl.h>

// fcntl ロックの構造体
struct flock {
    short l_type;    /* ロックタイプ: F_RDLCK, F_WRLCK, F_UNLCK */
    short l_whence;  /* 起点: SEEK_SET, SEEK_CUR, SEEK_END */
    off_t l_start;   /* ロック開始位置 */
    off_t l_len;     /* ロック長 (0 = EOF まで) */
    pid_t l_pid;     /* ロック保持プロセスの PID (F_GETLK で設定) */
};

// 使用例
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>

int lock_region(int fd, off_t start, off_t len, short type)
{
    struct flock fl;
    
    memset(&fl, 0, sizeof(fl));
    fl.l_type = type;
    fl.l_whence = SEEK_SET;
    fl.l_start = start;
    fl.l_len = len;
    
    // F_SETLKW: ブロッキングでロック取得
    // F_SETLK:  ノンブロッキングでロック取得
    // F_GETLK:  ロック情報の取得
    return fcntl(fd, F_SETLKW, &fl);
}

int main(void)
{
    int fd = open("/tmp/datafile", O_RDWR | O_CREAT, 0644);
    
    // バイト 0-99 を排他ロック
    printf("バイト 0-99 をロック中...\n");
    lock_region(fd, 0, 100, F_WRLCK);
    
    printf("ロック取得。処理中...\n");
    sleep(10);
    
    // ロック解除
    lock_region(fd, 0, 100, F_UNLCK);
    printf("ロック解除完了。\n");
    
    close(fd);
    return 0;
}

flock と POSIX ロックの比較

特性	flock	POSIX ロック (fcntl)
ロック粒度	ファイル全体	バイト範囲
fork 時の継承	子プロセスに継承される	子プロセスに継承されない
close 時の動作	最後の fd が閉じたら解除	同じファイルの fd が閉じたら解除
NFS での動作	ローカルのみ (NFS 非対応)	NFS 対応
デッドロック検出	なし	あり (F_SETLKW)
スレッド認識	プロセス単位	プロセス単位

OFD ロック (Open File Description Locks)

Linux 3.15 で導入された OFD ロックは、POSIX ロックの問題点を解決する:

// OFD ロックの使用 (F_OFD_SETLK, F_OFD_SETLKW, F_OFD_GETLK)
struct flock fl;
fl.l_type = F_WRLCK;
fl.l_whence = SEEK_SET;
fl.l_start = 0;
fl.l_len = 0;       // ファイル全体
fl.l_pid = 0;       // OFD ロックでは 0 を設定

// OFD ロックを取得
fcntl(fd, F_OFD_SETLKW, &fl);

// OFD ロックの特徴:
// - open file description に関連付けられる (PID ではない)
// - 同一プロセスの別の fd で close しても解除されない
// - fork で子プロセスに継承される
// - スレッドセーフ

ファイルロックの確認

# 現在のファイルロック状況の確認
cat /proc/locks
# 出力例:
# 1: POSIX  ADVISORY  WRITE 1234 08:02:1048737 0 EOF
# 2: FLOCK  ADVISORY  WRITE 5678 08:02:2097153 0 EOF
# 3: POSIX  ADVISORY  READ  9012 08:02:3145729 100 200
# 4: OFDLCK ADVISORY  WRITE 3456 08:02:4194305 0 EOF
#
# フォーマット:
# 番号: ロック種別  ADVISORY/MANDATORY  READ/WRITE  PID  Major:Minor:Inode  開始  終了

# lslocks コマンドでの確認 (util-linux)
lslocks
# 出力例:
# COMMAND  PID  TYPE   SIZE MODE  M START END PATH
# mysqld   1234 POSIX   16K WRITE 0     0   0 /var/lib/mysql/ibdata1
# vim      5678 FLOCK    0B WRITE 0     0   0 /tmp/.file.swp

# 特定ファイルのロック確認
fuser -v /tmp/lockfile
# 出力例:
#                      USER   PID ACCESS COMMAND
# /tmp/lockfile:       user  1234 F....  myprogram

カーネル内部のロック実装

// include/linux/fs.h
struct file_lock {
    struct file_lock        *fl_blocker;     /* ブロッカー */
    struct list_head        fl_list;         /* リンクリスト */
    struct hlist_node       fl_link;         /* グローバルリスト */
    struct list_head        fl_blocked_requests; /* ブロックされたリクエスト */
    struct list_head        fl_blocked_member;   /* ブロックメンバー */
    fl_owner_t              fl_owner;        /* オーナー */
    unsigned int            fl_flags;        /* ロックフラグ */
    unsigned char           fl_type;         /* ロックタイプ */
    unsigned int            fl_pid;          /* PID */
    int                     fl_link_cpu;     /* CPU 番号 */
    wait_queue_head_t       fl_wait;         /* 待ちキュー */
    struct file             *fl_file;        /* ファイル */
    loff_t                  fl_start;        /* 開始位置 */
    loff_t                  fl_end;          /* 終了位置 */
    const struct file_lock_operations *fl_ops; /* ロック操作 */
    const struct lock_manager_operations *fl_lmops; /* ロックマネージャ */
    union {
        struct nfs_lock_info nfs_fl;
        struct nfs4_lock_info nfs4_fl;
    } fl_u;
};

ファイルディスクリプタとファイルテーブル

ファイルディスクリプタの仕組み

ファイルディスクリプタ (fd) は、プロセスが開いたファイルを識別するための非負整数である。カーネル内部では、fd はプロセスのファイルディスクリプタテーブルへのインデックスとして使用される。

三層構造

プロセス A                              プロセス B
┌────────────────┐                    ┌────────────────┐
│ fd テーブル      │                    │ fd テーブル      │
│ (task_struct    │                    │ (task_struct    │
│  ->files)       │                    │  ->files)       │
│                 │                    │                 │
│ fd[0] ──────────┼──┐                │ fd[0] ──────────┼──┐
│ fd[1] ──────────┼──┼──┐             │ fd[1] ──────────┼──┼──┐
│ fd[2] ──────────┼──┼──┼──┐          │ fd[2] ──────────┼──┼──┤
│ fd[3] ──────────┼──┼──┼──┼──┐       │ fd[3] ──────────┼──┼──┼──┐
└────────────────┘  │  │  │  │       └────────────────┘  │  │  │
                     │  │  │  │                           │  │  │
                     ▼  ▼  ▼  ▼                           ▼  ▼  ▼
               ┌─────────────────────────────────────────────────┐
               │            オープンファイルテーブル                │
               │        (システムグローバル)                      │
               │                                                  │
               │  file[A] ─── f_pos=0,   f_flags=O_RDONLY ────┐  │
               │  file[B] ─── f_pos=100, f_flags=O_RDWR  ────┤  │
               │  file[C] ─── f_pos=0,   f_flags=O_WRONLY ───┤  │
               │  file[D] ─── f_pos=50,  f_flags=O_RDONLY ───┤  │
               └──────────────────────────────────────────────┘  │
                                                                  │
                     ┌────────────────────────────────────────────┘
                     ▼
               ┌──────────────────┐
               │   inode テーブル   │
               │                   │
               │  inode[X] (file1) │  ◄── dentry "file1"
               │  inode[Y] (file2) │  ◄── dentry "file2"
               │  inode[Z] (tty)   │  ◄── dentry "/dev/pts/0"
               └──────────────────┘

カーネル内部の fd テーブル構造

// include/linux/fdtable.h
struct fdtable {
    unsigned int            max_fds;        /* 最大 fd 数 */
    struct file __rcu       **fd;           /* file ポインタの配列 */
    unsigned long           *close_on_exec; /* close-on-exec ビットマップ */
    unsigned long           *open_fds;      /* オープン fd ビットマップ */
    unsigned long           *full_fds_bits; /* 完全に使用済みのビット */
    struct rcu_head         rcu;
};

struct files_struct {
    atomic_t                count;          /* 参照カウント */
    bool                    resize_in_progress;
    wait_queue_head_t       resize_wait;
    struct fdtable __rcu    *fdt;           /* fd テーブルへのポインタ */
    struct fdtable          fdtab;          /* 埋め込み fd テーブル */
    spinlock_t              file_lock;
    unsigned int            next_fd;        /* 次に割り当てる fd */
    unsigned long           close_on_exec_init[1]; /* 初期 close-on-exec */
    unsigned long           open_fds_init[1];      /* 初期オープン fd */
    unsigned long           full_fds_bits_init[1]; /* 初期完全使用ビット */
    struct file __rcu       *fd_array[NR_OPEN_DEFAULT]; /* 初期 fd 配列 */
};

// NR_OPEN_DEFAULT はアーキテクチャ依存 (通常 64)
// fd が 64 を超えると、動的に fdtable が拡張される

fd の割り当てと管理

# プロセスのファイルディスクリプタ上限
# ソフトリミット
ulimit -Sn
# 1024

# ハードリミット
ulimit -Hn
# 1048576

# システム全体の上限
cat /proc/sys/fs/file-max
# 1048576

# 現在のファイル使用状況
cat /proc/sys/fs/file-nr
# 5344	0	1048576
# (使用中  未使用  最大)

# 特定プロセスの fd 使用状況
ls /proc/$(pgrep -o nginx)/fd | wc -l

# fd の上限を変更する方法
# 1. 一時的な変更
ulimit -n 65536

# 2. /etc/security/limits.conf での永続設定
# username  soft  nofile  65536
# username  hard  nofile  65536
# *         soft  nofile  65536
# *         hard  nofile  65536

# 3. systemd サービスの場合
# /etc/systemd/system/myservice.service.d/override.conf
# [Service]
# LimitNOFILE=65536

# 4. システム全体の上限変更
echo 2097152 | sudo tee /proc/sys/fs/file-max
# 永続化: /etc/sysctl.conf に fs.file-max = 2097152

# fd リーク (fd leak) の検出
# プロセスの fd 数を定期的に監視
watch -n 1 "ls /proc/$(pgrep myapp)/fd | wc -l"

# /proc/[pid]/fd の内容を詳しく確認
ls -la /proc/$(pgrep nginx | head -1)/fd/
# 出力例:
# lr-x------ 1 root root 64 Jan 15 10:00 0 -> /dev/null
# l-wx------ 1 root root 64 Jan 15 10:00 1 -> /var/log/nginx/access.log
# l-wx------ 1 root root 64 Jan 15 10:00 2 -> /var/log/nginx/error.log
# lrwx------ 1 root root 64 Jan 15 10:00 3 -> 'socket:[123456]'
# lr-x------ 1 root root 64 Jan 15 10:00 4 -> 'anon_inode:[eventpoll]'
# lrwx------ 1 root root 64 Jan 15 10:00 5 -> 'socket:[123457]'

dup/dup2 によるファイルディスクリプタの複製

#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>

int main(void)
{
    int fd1 = open("/tmp/output.txt", O_WRONLY | O_CREAT | O_TRUNC, 0644);
    
    // fd1 を複製 (新しい fd を自動割り当て)
    int fd2 = dup(fd1);
    // fd1 と fd2 は同じ file オブジェクトを参照
    // → 同じ f_pos を共有
    
    // 標準出力 (fd=1) をファイルにリダイレクト
    dup2(fd1, STDOUT_FILENO);
    // これ以降、printf の出力は /tmp/output.txt に書き込まれる
    printf("This goes to the file\n");
    
    // fd1 を閉じても、fd2 や STDOUT はまだ有効
    close(fd1);
    
    // dup2 で fd 番号を指定して複製
    int fd3 = dup2(fd2, 10);  // fd 10 に複製
    
    close(fd2);
    close(fd3);
    
    return 0;
}

パス解決 (Pathname Lookup)

パス解決の概要

パス解決 (pathname lookup) は、文字列のパス名 (例: /home/user/file.txt) から対応する dentry と inode を見つけるプロセスである。VFS の中で最も複雑で重要な操作の1つである。

パス解決の流れ

パス: "/home/user/file.txt"
    │
    ▼
1. "/" (ルート) から開始
   current->fs->root を取得
    │
    ▼
2. "home" を解決
   ├── dcache を検索 (RCU-walk モード)
   │   ├── [ヒット] → dentry を取得、次のコンポーネントへ
   │   └── [ミス]  → REF-walk モードに切り替え
   │                  → inode->i_op->lookup() を呼出
   │                  → ディスクからディレクトリエントリを読み込み
   │                  → 新しい dentry を作成して dcache に追加
   │
   ├── マウントポイントの確認
   │   └── マウントされている場合、マウント先のルート dentry に切り替え
   │
   └── パーミッションチェック
       └── inode->i_op->permission() または generic_permission()
    │
    ▼
3. "user" を解決 (同様の処理)
    │
    ▼
4. "file.txt" を解決 (最終コンポーネント)
   ├── 通常ファイルの場合 → dentry + inode を返す
   ├── シンボリックリンクの場合 → リンク先を再帰的に解決
   └── 存在しない場合
       ├── O_CREAT フラグあり → 新規作成
       └── O_CREAT フラグなし → ENOENT エラー

RCU-walk と REF-walk

Linux カーネルは、パス解決に2つのモードを使用する:

┌────────────────────────────────────────────────────┐
│  RCU-walk モード (高速パス)                          │
│                                                     │
│  - ロックフリー (RCU 保護下で実行)                   │
│  - dentry や inode の参照カウントを増加させない       │
│  - dcache ヒット時に非常に高速                       │
│  - ブロッキング操作が必要な場合、REF-walk に切り替え │
│  - シーケンスカウンタで整合性を検証                   │
│                                                     │
│  制限:                                              │
│  - ディスク I/O が必要な場合は使用不可               │
│  - シンボリックリンクの解決に制限がある               │
│  - 一部のファイルシステムでは使用不可                 │
└────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────┐
│  REF-walk モード (安全パス)                          │
│                                                     │
│  - 従来のロックベースの方法                          │
│  - dentry の参照カウントを増減する                    │
│  - ブロッキング可能 (ディスク I/O、ロック待ち)       │
│  - RCU-walk が失敗した場合のフォールバック           │
│                                                     │
│  使用される場面:                                    │
│  - dcache ミス (ディスクからの読み込みが必要)        │
│  - 複雑なシンボリックリンクの解決                     │
│  - ファイルシステムが RCU-walk をサポートしない場合   │
└────────────────────────────────────────────────────┘

パス解決のカーネルコード

// fs/namei.c (簡略化)

// パス解決のエントリポイント
static int link_path_walk(const char *name, struct nameidata *nd)
{
    int depth = 0;
    int err;
    
    // 先頭の '/' をスキップ
    while (*name == '/')
        name++;
    if (!*name) {
        // ルートディレクトリ自体
        return 0;
    }
    
    for (;;) {
        // 次のパスコンポーネントを取得
        u64 hash_len;
        int type;
        
        // パスコンポーネントのハッシュ値を計算
        hash_len = hash_name(nd->path.dentry, name);
        
        type = LAST_NORM;
        if (name[0] == '.') {
            if (hash_len == 1)
                type = LAST_DOT;       // "."
            else if (name[1] == '.' && hash_len == 2)
                type = LAST_DOTDOT;    // ".."
        }
        
        nd->last.hash_len = hash_len;
        nd->last.name = name;
        nd->last_type = type;
        
        // 最終コンポーネントの場合
        name += hashlen_len(hash_len);
        if (!*name)
            return 0;  // パス解決完了
        
        // 中間コンポーネントの場合
        while (*name == '/')
            name++;
        if (!*name)
            return 0;
        
        // ディレクトリの検索
        err = walk_component(nd, WALK_MORE);
        if (err < 0)
            return err;
        
        // シンボリックリンクの処理
        if (err) {
            // シンボリックリンクを辿る
            err = handle_dots(nd, nd->last_type);
            if (err)
                return err;
        }
    }
}

// walk_component: 1つのパスコンポーネントを解決
static int walk_component(struct nameidata *nd, int flags)
{
    struct dentry *dentry;
    struct inode *inode;
    
    // dcache を検索 (RCU-walk または REF-walk)
    dentry = lookup_fast(nd);
    if (IS_ERR(dentry))
        return PTR_ERR(dentry);
    
    if (unlikely(!dentry)) {
        // dcache ミス: ファイルシステムに問い合わせ
        dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags);
        if (IS_ERR(dentry))
            return PTR_ERR(dentry);
    }
    
    // マウントポイントの確認
    if (d_is_dir(dentry)) {
        // マウントされている場合、マウント先に切り替え
        handle_mounts(nd, dentry, &inode);
    }
    
    // パーミッションチェック
    // ...
    
    return 0;
}

シンボリックリンクの解決

# シンボリックリンクの解決上限
# カーネルは無限ループを防ぐため、シンボリックリンクの解決深度に上限を設ける
# MAXSYMLINKS = 40 (include/linux/namei.h)

# シンボリックリンクのループを作成 (実験)
ln -s /tmp/link_b /tmp/link_a
ln -s /tmp/link_a /tmp/link_b
cat /tmp/link_a
# エラー: Too many levels of symbolic links

# readlink でシンボリックリンク先を確認
readlink /usr/bin/python3
# python3.10

# realpath で最終的な絶対パスを取得
realpath /usr/bin/python3
# /usr/bin/python3.10

# namei でパス解決の各ステップを表示
namei -l /usr/bin/python3
# 出力例:
# f: /usr/bin/python3
#  d /
#  d usr
#  d bin
#  l python3 -> python3.10
#    - python3.10

ディレクトリエントリ操作

dentry_operations の定義

// include/linux/dcache.h
struct dentry_operations {
    int (*d_revalidate)(struct dentry *, unsigned int);
    int (*d_weak_revalidate)(struct dentry *, unsigned int);
    int (*d_hash)(const struct dentry *, struct qstr *);
    int (*d_compare)(const struct dentry *,
                     unsigned int, const char *, const struct qstr *);
    int (*d_delete)(const struct dentry *);
    int (*d_init)(struct dentry *);
    void (*d_release)(struct dentry *);
    void (*d_prune)(struct dentry *);
    void (*d_iput)(struct dentry *, struct inode *);
    char *(*d_dname)(struct dentry *, char *, int);
    struct vfsmount *(*d_automount)(struct path *);
    int (*d_manage)(const struct path *, bool);
    struct dentry *(*d_real)(struct dentry *, enum d_real_type type);
};

各操作の詳細

// d_revalidate: dentry の有効性を検証
// NFS など、リモートファイルシステムで重要
// キャッシュされた dentry が最新かどうかを確認
// 戻り値: 1 = 有効, 0 = 無効 (再検索が必要)
int nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags)
{
    struct inode *inode = d_inode(dentry);
    struct nfs_fattr *fattr;
    
    // サーバーに問い合わせて最新情報を取得
    fattr = nfs_alloc_fattr();
    error = NFS_PROTO(dir)->lookup(dir, dentry, fattr);
    
    if (error) {
        // ファイルが削除された場合
        d_drop(dentry);  // dcache から削除
        return 0;
    }
    
    // 属性が変わっていないかチェック
    if (nfs_compare_fh(NFS_FH(inode), fattr->fh)) {
        return 0;  // 変更あり
    }
    
    return 1;  // 有効
}

// d_hash: dentry のハッシュ値を計算
// 大文字小文字を区別しないファイルシステムで使用
// 例: FAT, CIFS
int fat_hash(const struct dentry *dentry, struct qstr *qstr)
{
    unsigned long hash;
    const unsigned char *name = qstr->name;
    unsigned int len = qstr->len;
    
    hash = init_name_hash(dentry);
    while (len--)
        hash = partial_name_hash(tolower(*name++), hash);
    qstr->hash = end_name_hash(hash);
    
    return 0;
}

// d_compare: dentry の名前を比較
// 大文字小文字を区別しない比較に使用
int fat_compare(const struct dentry *dentry,
                unsigned int len, const char *str,
                const struct qstr *name)
{
    unsigned int i;
    
    if (len != name->len)
        return 1;
    
    for (i = 0; i < len; i++) {
        if (tolower(str[i]) != tolower(name->name[i]))
            return 1;
    }
    return 0;
}

// d_delete: dentry を dcache から削除すべきかを判断
// 参照カウントが 0 になった時に呼ばれる
// 戻り値: 1 = 即座に解放, 0 = LRU に保持
int always_delete_dentry(const struct dentry *dentry)
{
    return 1;  // 常に即座に解放 (キャッシュしない)
}

// d_automount: 自動マウントのトリガー
// autofs で使用
struct vfsmount *autofs_d_automount(struct path *path)
{
    // マウントポイントにアクセスがあった時に
    // 自動的にファイルシステムをマウント
    // ...
}

dentry の操作ユーティリティ関数

// よく使用される dentry 操作関数
struct dentry *d_alloc(struct dentry *parent, const struct qstr *name);
struct dentry *d_alloc_anon(struct super_block *sb);
void d_instantiate(struct dentry *dentry, struct inode *inode);
struct dentry *d_make_root(struct inode *root_inode);
void d_add(struct dentry *dentry, struct inode *inode);
void d_drop(struct dentry *dentry);           // dcache から削除
void d_delete(struct dentry *dentry);         // 削除マーク
void d_rehash(struct dentry *dentry);         // 再ハッシュ
void d_move(struct dentry *dentry, struct dentry *target);  // 名前変更
struct dentry *d_lookup(const struct dentry *parent, const struct qstr *name);
int d_validate(struct dentry *dentry, struct dentry *parent);

inode 操作

inode_operations の定義

// include/linux/fs.h
struct inode_operations {
    struct dentry * (*lookup) (struct inode *, struct dentry *, unsigned int);
    const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *);
    int (*permission) (struct mnt_idmap *, struct inode *, int);
    struct posix_acl * (*get_inode_acl)(struct inode *, int, bool);
    
    int (*readlink) (struct dentry *, char __user *, int);
    
    int (*create) (struct mnt_idmap *, struct inode *, struct dentry *,
                   umode_t, bool);
    int (*link) (struct dentry *, struct inode *, struct dentry *);
    int (*unlink) (struct inode *, struct dentry *);
    int (*symlink) (struct mnt_idmap *, struct inode *, struct dentry *,
                    const char *);
    int (*mkdir) (struct mnt_idmap *, struct inode *, struct dentry *, umode_t);
    int (*rmdir) (struct inode *, struct dentry *);
    int (*mknod) (struct mnt_idmap *, struct inode *, struct dentry *,
                  umode_t, dev_t);
    int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
                   struct inode *, struct dentry *, unsigned int);
    int (*setattr) (struct mnt_idmap *, struct dentry *, struct iattr *);
    int (*getattr) (struct mnt_idmap *, const struct path *,
                    struct kstat *, u32, unsigned int);
    ssize_t (*listxattr) (struct dentry *, char *, size_t);
    int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len);
    int (*update_time)(struct inode *, int);
    int (*atomic_open)(struct inode *, struct dentry *,
                       struct file *, unsigned open_flag,
                       umode_t create_mode);
    int (*tmpfile) (struct mnt_idmap *, struct inode *, struct file *, umode_t);
    struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
    int (*set_acl)(struct mnt_idmap *, struct dentry *, struct posix_acl *, int);
    int (*fileattr_set)(struct mnt_idmap *idmap,
                        struct dentry *dentry, struct fileattr *fa);
    int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
    struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
};

各操作の詳細説明

// lookup: ディレクトリ内でファイル名を検索
// パス解決時に各ディレクトリコンポーネントで呼ばれる
struct dentry *ext4_lookup(struct inode *dir, struct dentry *dentry,
                           unsigned int flags)
{
    struct inode *inode;
    struct ext4_dir_entry_2 *de;
    struct buffer_head *bh;
    
    // ディレクトリエントリをディスクから検索
    bh = ext4_find_entry(dir, &dentry->d_name, &de);
    if (IS_ERR(bh))
        return ERR_CAST(bh);
    
    if (bh) {
        // 見つかった場合、inode を読み込み
        unsigned long ino = le32_to_cpu(de->inode);
        brelse(bh);
        
        inode = ext4_iget(dir->i_sb, ino, EXT4_IGET_NORMAL);
        if (IS_ERR(inode))
            return ERR_CAST(inode);
    } else {
        // 見つからない場合 (ネガティブ dentry)
        inode = NULL;
    }
    
    // dentry と inode を関連付けて返す
    return d_splice_alias(inode, dentry);
}

// create: 新しいファイルを作成
int ext4_create(struct mnt_idmap *idmap, struct inode *dir,
                struct dentry *dentry, umode_t mode, bool excl)
{
    struct inode *inode;
    
    // 新しい inode を割り当て
    inode = ext4_new_inode(dir, mode, &dentry->d_name);
    if (IS_ERR(inode))
        return PTR_ERR(inode);
    
    // inode の操作テーブルを設定
    inode->i_op = &ext4_file_inode_operations;
    inode->i_fop = &ext4_file_operations;
    
    // ディスクに書き込み
    ext4_mark_inode_dirty(inode);
    
    // dentry と inode を関連付け
    d_instantiate_new(dentry, inode);
    
    return 0;
}

// permission: アクセス権限チェック
int generic_permission(struct mnt_idmap *idmap,
                       struct inode *inode, int mask)
{
    int ret;
    
    // DAC (Discretionary Access Control) チェック
    ret = acl_permission_check(idmap, inode, mask);
    if (ret != -EACCES)
        return ret;
    
    // capability チェック
    if (!(mask & MAY_EXEC) || (inode->i_mode & 0111))
        if (capable_wrt_inode_uidgid(idmap, inode, CAP_DAC_OVERRIDE))
            return 0;
    
    return -EACCES;
}

// setattr: ファイル属性の変更 (chmod, chown, truncate 等)
int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
                 struct iattr *attr)
{
    struct inode *inode = d_inode(dentry);
    int error;
    
    // 権限チェック
    error = setattr_prepare(idmap, dentry, attr);
    if (error)
        return error;
    
    // サイズ変更の処理
    if (attr->ia_valid & ATTR_SIZE) {
        error = ext4_setattr_handle_size(inode, attr);
        if (error)
            return error;
    }
    
    // 属性の更新
    setattr_copy(idmap, inode, attr);
    mark_inode_dirty(inode);
    
    return 0;
}

// getattr: ファイル属性の取得 (stat システムコール)
int ext4_getattr(struct mnt_idmap *idmap, const struct path *path,
                 struct kstat *stat, u32 request_mask,
                 unsigned int query_flags)
{
    struct inode *inode = d_inode(path->dentry);
    
    generic_fillattr(idmap, request_mask, inode, stat);
    
    // ext4 固有の追加情報
    stat->blksize = inode->i_sb->s_blocksize;
    stat->blocks = inode->i_blocks;
    
    return 0;
}

スーパーブロック操作

super_operations の定義

// include/linux/fs.h
struct super_operations {
    struct inode *(*alloc_inode)(struct super_block *sb);
    void (*destroy_inode)(struct inode *);
    void (*free_inode)(struct inode *);
    
    void (*dirty_inode) (struct inode *, int flags);
    int (*write_inode) (struct inode *, struct writeback_control *wbc);
    int (*drop_inode) (struct inode *);
    void (*evict_inode) (struct inode *);
    void (*put_super) (struct super_block *);
    int (*sync_fs)(struct super_block *sb, int wait);
    int (*freeze_super) (struct super_block *, enum freeze_holder who);
    int (*freeze_fs) (struct super_block *);
    int (*thaw_super) (struct super_block *, enum freeze_holder who);
    int (*unfreeze_fs) (struct super_block *);
    int (*statfs) (struct dentry *, struct kstatfs *);
    int (*remount_fs) (struct super_block *, int *, char *);
    void (*umount_begin) (struct super_block *);
    
    int (*show_options)(struct seq_file *, struct dentry *);
    int (*show_devname)(struct seq_file *, struct dentry *);
    int (*show_path)(struct seq_file *, struct dentry *);
    int (*show_stats)(struct seq_file *, struct dentry *);
    
    long (*nr_cached_objects)(struct super_block *,
                              struct shrink_control *);
    long (*free_cached_objects)(struct super_block *,
                                struct shrink_control *);
};