从Java到磁盘:一次文件创建的万里长征

1 阅读40分钟

在Java中,创建一个文件看起来是如此简单:

java

File file = new File("/path/to/file.txt");
file.createNewFile();

短短两行代码,一个文件就诞生了。你有没有好奇过,背后发生了什么?

路径上的字符串 "/path/to/file.txt" 如何一步步变成磁盘上的物理数据?JVM做了哪些工作?操作系统又扮演了什么角色?当你在不同平台(Windows、macOS、Linux)上运行同一段 Java 代码时,JDK 如何保证行为的一致性?从用户态到内核态,跨越了哪些层次?

今天,我们从源码出发,以 Linux 6.8.12 内核、glibc 2.39 和 OpenJDK 17 的源码为蓝本,完整追溯一次 File.createNewFile() 调用。之所以选择 Linux,是因为其内核源码完全开放;选择 ext4 作为文件系统代表,因其是 Linux 发行版中最广泛使用的磁盘文件系统;选择 createNewFile 方法而非 FileOutputStream 的构造函数,是因为前者更纯粹地展示了"排他创建"语义——仅当文件不存在时才创建,精确对应了内核 openat 系统调用中 O_CREAT|O_EXCL 标志的组合。

这趟旅程将穿越应用层、JNI层、glibc库、系统调用接口、VFS虚拟文件系统层、路径查找子系统,最终抵达 ext4 文件系统,在磁盘上真实地落下文件的痕迹。


一、Java 应用层:优雅的跨平台接口

1.1 File.createNewFile() 方法

java

public boolean createNewFile() throws IOException {
    SecurityManager security = System.getSecurityManager();
    if (security != null) security.checkWrite(path);
    if (isInvalid()) {
        throw new IOException("Invalid file path");
    }
    return fs.createFileExclusively(path);
}

File 类位于 java.io 包中,通过 fs 字段(FileSystem 抽象类型)将具体实现委托给平台相关的子类。这样做有两个原因:一是安全——如果系统中启用了 SecurityManager,需要先检查当前代码是否有权限在该路径下进行写操作;二是跨平台性——文件路径分隔符、权限模型、文件属性等在 Windows 和 Unix 系统上差别巨大,FileSystem 类通过在子类中封装这些差异来提供统一接口。

isInvalid() 检查在路径为空或包含某些非法字符时快速失败。而真正与操作系统交互的核心,是 fs.createFileExclusively(path)

1.2 什么是排他创建?

createFileExclusively 语义明确:仅当文件不存在时创建,如果文件已存在则操作失败(返回 false,不抛异常)。这与 O_CREAT | O_EXCL 标志组合的行为完全一致,能有效防止并发创建场景下的竞争条件。在多线程或多进程环境中,仅仅先检查文件是否存在再创建是不安全的——检查与创建之间存在时间窗口,其他进程可能在这个窗口中创建同名文件。O_CREAT|O_EXCL 利用内核的原子性操作解决了这个竞争问题。

1.3 平台差异:UnixFileSystem 与 WindowsFileSystem

在 Unix/Linux 系统上,fs 的具体实现类是 UnixFileSystem

java

// OpenJDK: java/io/UnixFileSystem.java
public native boolean createFileExclusively(String path)
        throws IOException;

关键词 native 表明这是一个本地方法——它的实现不在 Java 代码中,而是通过 JNI(Java Native Interface)调用 C/C++ 编写的本地库来实现。这正是 Java 跨越 Java 堆与操作系统之间鸿沟的关键桥梁。


二、JNI 层与 OpenJDK 本地实现

在 OpenJDK 源码中,UnixFileSystem.c 包含了 createFileExclusively 的本地实现:

c

JNIEXPORT jboolean JNICALL
Java_java_io_UnixFileSystem_createFileExclusively(JNIEnv *env, jclass cls,
                                                  jstring pathname)
{
    jboolean rv = JNI_FALSE;

    WITH_PLATFORM_STRING(env, pathname, path) {
        FD fd;
        /* 根目录永远存在,无需尝试创建 */
        if (strcmp (path, "/")) {
            /* 核心调用:打开文件,O_CREAT | O_EXCL 确保排他创建 */
            fd = handleOpen(path, O_RDWR | O_CREAT | O_EXCL, 0666);
            if (fd < 0) {
                if (errno != EEXIST)
                    JNU_ThrowIOExceptionWithLastError(env, "Could not open file");
            } else {
                if (close(fd) == -1)
                    JNU_ThrowIOExceptionWithLastError(env, "Could not close file");
                rv = JNI_TRUE;
            }
        }
    } END_PLATFORM_STRING(env, path);
    return rv;
}

关键点解读

  • WITH_PLATFORM_STRING 是一个宏,负责将 Java 的 jstring 转换为 C 的 char* 字符串,并处理编码转换和内存管理。这个转换看似简单,实则涉及 Java 内部字符串表示(UTF-16)与操作系统期望的字符编码之间的转换,不同平台的默认字符集各不相同。

  • 由于根目录 / 始终存在,createFileExclusively 直接跳过对根目录的创建尝试,避免无意义的系统调用。

  • handleOpen 的参数组合 O_RDWR | O_CREAT | O_EXCL 明确了排他创建的语义:

    • O_RDWR:以读写模式打开——JVM 打开这个文件只是为了验证它不存在并创建它,实际上并不需要进行读写操作,选择 O_RDWR 主要是为了兼容性。严格来说 O_WRONLY 就足够了,但某些文件系统的打开语义在只写模式下的行为与预期不符。
    • O_CREAT:若文件不存在则创建
    • O_EXCL:与 O_CREAT 搭配使用时,确保文件不存在时才成功创建。这与 createNewFile 的语义精确匹配。
  • errno != EEXIST 分支的处理区分了两种失败情况:若失败原因是"文件已存在"(EEXIST),这是预期内的结果,静默返回 false;若是其他原因(如权限不足、路径不可达、磁盘已满等),则抛出 IOException

2.1 handleOpen:打开前的安全检查

c

FD
handleOpen(const char *path, int oflag, int mode) {
    FD fd;
    /* RESTARTABLE 宏在遇到 EINTR 时自动重试系统调用 */
    RESTARTABLE(open64(path, oflag, mode), fd);
    if (fd != -1) {
        struct stat64 buf64;
        int result;
        RESTARTABLE(fstat64(fd, &buf64), result);
        if (result != -1) {
            /* 检查打开的是否是目录——目录不能用 O_RDWR 方式打开 */
            if (S_ISDIR(buf64.st_mode)) {
                close(fd);
                errno = EISDIR;
                fd = -1;
            }
        } else {
            close(fd);
            fd = -1;
        }
    }
    return fd;
}

handleOpen 封装了 open64 调用,并增加了一层额外的检查:确认打开的对象不是一个目录。为什么需要这个检查?因为 POSIX 标准允许用 O_RDWR 打开目录(虽然读写目录内容没有实际意义),但 Java 的 createNewFile 语义要求:如果路径指向一个已存在的目录,应该返回 false(表示无法创建文件)而不是成功打开。在 Linux 上,直接用 open 打开目录不会失败——它会成功返回一个文件描述符,但这与 Java 用户预期不符。所以 JVM 显式地检查文件类型,若是目录则关闭文件描述符并将 errno 设为 EISDIR,使得调用方能够抛出合适的异常。

此外,RESTARTABLE 宏的处理也值得一提:当 open64 或 fstat64 被信号中断(返回 -EINTR)时,系统调用会自动重试。这避免了应用程序层需要手动处理 EINTR 的复杂性。


三、glibc 层:从库函数到系统调用

JVM 调用的 open64 并不是直接的系统调用,而是 glibc(GNU C Library)  提供的封装函数。glibc 在用户态与内核之间扮演着适配器角色:它处理可变参数(mode 权限位)、设置线程取消点(cancellation point)、以及将大文件(LFS)相关的 open64 映射到底层的 openat 系统调用。

3.1 open64 函数的实现

c

int
__libc_open64 (const char *file, int oflag, ...)
{
  int mode = 0;

  /* 如果 flags 中包含了 O_CREAT 或 O_TMPFILE,
     说明调用者传递了 mode 参数(文件创建权限) */
  if (__OPEN_NEEDS_MODE (oflag))
    {
      va_list arg;
      va_start (arg, oflag);
      mode = va_arg (arg, int);  /* 从可变参数中提取 mode */
      va_end (arg);
    }

  /* 调用 __libc_open,并强制加上 O_LARGEFILE 标志 */
  return __libc_open (file, oflag | O_LARGEFILE, mode);
}
weak_alias (__libc_open64, __open64)
weak_alias (__libc_open64, open64)

__OPEN_NEEDS_MODE 宏判断逻辑:当 oflag 中包含 O_CREAT 或 O_TMPFILE 时,系统调用需要一个 mode 参数来指定新创建文件的权限(如 0666 表示 rw-rw-rw-,再经过 umask 过滤)。glibc 使用 C 标准库的 va_list 机制从可变参数列表中提取这个 mode 值。

一个容易被忽略的细节是:glibc 在 oflag 中强制加入了 O_LARGEFILE 标志。在现代 Linux 系统中,这意味着文件偏移量使用 64 位,可以处理超过 2GB 的大文件。OpenJDK 默认编译时启用 _FILE_OFFSET_BITS=64,这就是为什么它调用的是 open64 而非传统的 open

3.2 __libc_open:通往内核的最后一站

c

int
__libc_open (const char *file, int oflag, ...)
{
  int mode = 0;

  if (__OPEN_NEEDS_MODE (oflag))
    {
      va_list arg;
      va_start (arg, oflag);
      mode = va_arg (arg, int);
      va_end (arg);
    }

  /* SYSCALL_CANCEL 宏:执行系统调用,同时处理线程取消点 */
  return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag, mode);
}

SYSCALL_CANCEL 是 glibc 内部的一个宏,它做了几件重要的事情:

  1. 设置线程取消点:对于 openreadwrite 这类可能会阻塞较长时间的系统调用,POSIX 标准要求它们是"取消点"。如果一个线程被标记为"可取消",当它执行到取消点时,会检查是否有未处理的取消请求,若有则终止线程。SYSCALL_CANCEL 负责在系统调用前后处理这个机制。
  2. 系统调用号查找与参数传递:glibc 根据宏展开生成对应架构(x86_64、ARM64 等)的系统调用指令(如 syscall 汇编指令),将系统调用号(openat 在内核中的编号)和参数(AT_FDCWDfileoflagmode)加载到特定寄存器中,然后执行从用户态到内核态的切换。
  3. 错误处理与 errno 设置:系统调用的返回值通过寄存器传回,若返回值在 -1 到 -4095 之间(Linux 内核约定,这个范围内表示错误码),glibc 将其转换为正数的 errno 并返回 -1,否则直接返回成功的文件描述符。

这里调用的是 openat 而非 openopenat 是 open 的增强版,它多了一个 dirfd 参数(此处为 AT_FDCWD,表示"当前工作目录"),允许相对于一个已打开的目录文件描述符来解析路径。这种设计提高了安全性,避免了一些 TOCTOU(Time-Of-Check-Time-Of-Use)类型的竞态条件攻击,也是 Linux 内核推荐的现代接口。


四、内核系统调用入口:do_sys_open 与 do_sys_openat2

当 syscall 指令执行时,CPU 从用户态切换到内核态,根据系统调用号在内核的 sys_call_table 中找到对应的处理函数。在 Linux 6.8.12 中,openat 系统调用的入口是 SYSCALL_DEFINE4(openat, ...)

4.1 openat 系统调用

c

SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
		umode_t, mode)
{
	if (force_o_largefile())
		flags |= O_LARGEFILE;
	return do_sys_open(dfd, filename, flags, mode);
}

__user 注解告诉内核:filename 指针指向的是用户空间的内存地址,不能直接解引用,必须通过专门的函数(如 copy_from_usergetname)安全地将数据复制到内核空间。这是内核安全机制的重要一环——用户空间传递的指针可能是无效的、恶意的,如果内核不加区分地直接访问,可能导致内核崩溃或安全漏洞。

force_o_largefile() 检查内核是否编译为默认启用大文件支持,若启用则自动添加 O_LARGEFILE 标志。

4.2 do_sys_open:核心参数准备

c

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
	struct open_how how = build_open_how(flags, mode);
	return do_sys_openat2(dfd, filename, &how);
}

build_open_how 将分散的 flags 和 mode 打包成一个 struct open_how 结构体,这是内核为了方便参数传递和扩展而引入的"开放式如何"描述结构。它的意义在于:如果未来需要增加新的打开参数(比如某种新的打开行为标志),只需要扩展这个结构体而不需要修改所有调用的函数签名。

4.3 do_sys_openat2:三步走

c

static long do_sys_openat2(int dfd, const char __user *filename,
			   struct open_how *how)
{
	struct open_flags op;
	int fd = build_open_flags(how, &op);
	struct filename *tmp;

	if (fd)
		return fd;

	/* 1. 将用户态路径复制到内核空间 */
	tmp = getname(filename);
	if (IS_ERR(tmp))
		return PTR_ERR(tmp);

	/* 2. 分配一个空闲的文件描述符 */
	fd = get_unused_fd_flags(how->flags);
	if (fd >= 0) {
		/* 3. 核心操作:解析路径并打开/创建文件 */
		struct file *f = do_filp_open(dfd, tmp, &op);
		if (IS_ERR(f)) {
			put_unused_fd(fd);
			fd = PTR_ERR(f);
		} else {
			/* 4. 将 file 结构体与 fd 关联起来 */
			fd_install(fd, f);
		}
	}
	putname(tmp);
	return fd;
}

getname(filename) 做了几件重要的事情:

  • 从用户空间复制路径字符串到内核临时缓冲区
  • 进行基本的合法性检查(如路径长度未超过 PATH_MAX
  • 处理路径名中的"挂载点"引用计数

随后是文件描述符的分配。每个进程在内核中维护一个 files_struct 结构体,其中包含一个 fdtable(文件描述符表),这是进程与打开的文件之间的映射表。get_unused_fd_flags 扫描这个表找到一个空闲的槽位,通常从 3 开始依次分配(0/1/2 已用于 stdin/stdout/stderr),返回其索引作为 fd这里分配的是"槽位",实际的文件对象 struct file 尚未被创建。

do_filp_open 是打开文件的核心引擎,负责路径解析并创建 struct filefd_install 则简单地将 struct file 指针填入 fd 对应的槽位中。

至此,文件描述符已经就位。然而,真正的挑战才刚刚开始——路径名如何解析成实际的文件,以及当文件不存在且 O_CREAT 存在时,ext4 文件系统如何创建它?


五、VFS 层:路径解析与核心打开逻辑

do_filp_open 是 VFS(虚拟文件系统)层的核心函数,它统一处理所有文件系统类型的打开操作。VFS 是 Linux 内核的一个抽象层,定义了 struct filestruct inodestruct dentrystruct super_block 四个核心数据结构,将具体文件系统(如 ext4、XFS、Btrfs)的差异封装在各自的回调函数中。

5.1 do_filp_open:重试机制与 RCU

c

struct file *do_filp_open(int dfd, struct filename *pathname,
		const struct open_flags *op)
{
	struct nameidata nd;
	int flags = op->lookup_flags;
	struct file *filp;

	set_nameidata(&nd, dfd, pathname, NULL);
	/* 第一次尝试:RCU-walk 模式 */
	filp = path_openat(&nd, op, flags | LOOKUP_RCU);
	if (unlikely(filp == ERR_PTR(-ECHILD)))
		/* 第二次尝试:REF-walk 模式 */
		filp = path_openat(&nd, op, flags);
	if (unlikely(filp == ERR_PTR(-ESTALE)))
		/* 第三次尝试:强制重新验证 */
		filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
	restore_nameidata();
	return filp;
}
  • RCU-walk(Read-Copy-Update walk)模式:内核会尝试在 RCU 读锁保护下进行路径查找,这意味着查找过程中无需持有任何锁,也不会阻塞其他线程。RCU 允许多个读者并发访问数据结构,写者更新时创建新副本然后原子替换。对于路径查找这种高频率操作,RCU 模式能极大提升性能。但如果路径中的某个 dentry(目录项)需要从磁盘读取(缓存未命中),RCU-walk 会因无法睡眠而失败,返回 -ECHILD
  • REF-walk(Reference walk)模式:当 RCU-walk 失败时(例如需要从磁盘读取 inode 信息),内核会退回到可睡眠的 REF-walk 模式,这次可以安全地进行 I/O 操作和锁等待。
  • LOOKUP_REVAL 模式:主要用于 NFS 等网络文件系统,强制重新验证文件是否存在(因为网络文件系统的缓存可能过时)。

nameidata 结构在整个路径查找过程中扮演"旅行背包"的角色,它记录了当前查找的当前位置(当前目录的 dentry 和 mount 信息)、剩余路径字符串、以及各种查找标志。

5.2 path_openat:分配 file 结构与进入查找循环

c

static struct file *path_openat(struct nameidata *nd,
			const struct open_flags *op, unsigned flags)
{
	struct file *file;
	int error;

	/* 分配一个空的 file 结构体 */
	file = alloc_empty_file(op->open_flag, current_cred());
	if (IS_ERR(file))
		return file;

	if (unlikely(file->f_flags & __O_TMPFILE)) {
		/* O_TMPFILE:创建临时文件(不出现在目录中) */
		error = do_tmpfile(nd, flags, op, file);
	} else if (unlikely(file->f_flags & O_PATH)) {
		/* O_PATH:仅获取文件路径,不实际打开 */
		error = do_o_path(nd, flags, file);
	} else {
		const char *s = path_init(nd, flags);
		/* 循环解析路径的每一层 */
		while (!(error = link_path_walk(s, nd)) &&
		       (s = open_last_lookups(nd, file, op)) != NULL)
			;
		if (!error)
			error = do_open(nd, file, op);
		terminate_walk(nd);
	}
	...
}

alloc_empty_file 从内核的文件缓存中分配一个新的 struct file 实例,并初始化引用计数、设置所有者(当前进程的凭证)、标记 FMODE_OPENED 尚未设置(表示文件尚未真正打开)。这个 struct file 代表本次打开会话(open file description),它与 struct inode 不同:inode 描述的是文件本身(元数据+数据),多个进程打开同一个文件会共享同一个 inode,但各自有独立的 struct file 来记录各自的位置偏移、打开标志等会话信息。

path_init 根据 dfd 参数确定路径查找的起点:

  • 若 dfd = AT_FDCWD,从当前进程的工作目录开始
  • 若 dfd 是某个已打开目录的文件描述符,从该目录开始
  • 若路径以 / 开头,从根目录开始

随后 link_path_walk 逐层解析路径,处理 ...、符号链接等特殊情况。open_last_lookups 处理路径的最后一个分量——这是创建文件的关键所在。


六、最后一步:open_last_lookups 与 lookup_open

open_last_lookups 是连接路径查找与文件创建的分界线。它处理路径最后一个分量,并根据 O_CREAT 标志的存在与否决定是"查找已有文件"还是"准备创建文件"。

c

static const char *open_last_lookups(struct nameidata *nd,
		   struct file *file, const struct open_flags *op)
{
	struct dentry *dir = nd->path.dentry;
	int open_flag = op->open_flag;
	bool got_write = false;
	struct dentry *dentry;
	const char *res;

	nd->flags |= op->intent;  /* LOOKUP_OPEN / LOOKUP_CREATE / LOOKUP_EXCL */

	if (nd->last_type != LAST_NORM) {
		/* 处理 "." 或 ".." 特殊情况 */
		return handle_dots(nd, nd->last_type);
	}

	if (!(open_flag & O_CREAT)) {
		/* 不创建文件,尝试快速查找 */
		dentry = lookup_fast(nd);
		...
	} else {
		/* O_CREAT 分支:需要创建文件 */
		if (nd->flags & LOOKUP_RCU) {
			if (!try_to_unlazy(nd))
				return ERR_PTR(-ECHILD);
		}
		/* 检查路径是否以斜杠结尾(如 "dir/",不能创建这种文件) */
		if (unlikely(nd->last.name[nd->last.len]))
			return ERR_PTR(-EISDIR);
	}

	/* 如果需要写入(创建或截断),尝试获取挂载点的写权限 */
	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
		got_write = !mnt_want_write(nd->path.mnt);
		/* 先不失败——可能并不需要实际写入,等 lookup_open 决定 */
	}

	if (open_flag & O_CREAT)
		inode_lock(dir->d_inode);      /* 创建需要排他锁 */
	else
		inode_lock_shared(dir->d_inode); /* 查找只需共享锁 */

	dentry = lookup_open(nd, file, op, got_write);

	if (open_flag & O_CREAT)
		inode_unlock(dir->d_inode);
	else
		inode_unlock_shared(dir->d_inode);

	if (got_write)
		mnt_drop_write(nd->path.mnt);
	...
}

lookup_open 是真正执行查找或创建的函数,它需要父目录的 inode 锁来保证原子性。

c

static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
				  const struct open_flags *op,
				  bool got_write)
{
	struct dentry *dir = nd->path.dentry;
	struct inode *dir_inode = dir->d_inode;
	int open_flag = op->open_flag;
	struct dentry *dentry;
	int create_error = 0;
	umode_t mode = op->mode;

	/* 在目录缓存中查找 dentry */
	dentry = d_lookup(dir, &nd->last);
	
	if (dentry && dentry->d_inode) {
		/* 缓存命中:文件已存在,走正常打开流程 */
		return dentry;
	}

	/* 文件不存在,检查是否有权限在此目录下创建 */
	if (open_flag & O_CREAT) {
		mode = vfs_prepare_mode(idmap, dir->d_inode, mode, mode, mode);
		if (likely(got_write))
			create_error = may_o_create(idmap, &nd->path, dentry, mode);
		else
			create_error = -EROFS;  /* 只读文件系统 */
	}
	
	if (create_error)
		open_flag &= ~O_CREAT;

	/* 如果文件系统提供了 atomic_open 方法,优先使用 */
	if (dir_inode->i_op->atomic_open) {
		dentry = atomic_open(nd, dentry, file, open_flag, mode);
		return dentry;
	}

	/* 标准创建流程:先 lookup 再 create */
	if (d_in_lookup(dentry)) {
		/* 在缓存未命中时,调用底层文件系统的 lookup 方法 */
		struct dentry *res = dir_inode->i_op->lookup(dir_inode, dentry, nd->flags);
		...
	}

	/* 文件确实不存在且 O_CREAT 被指定——创建它! */
	if (!dentry->d_inode && (open_flag & O_CREAT)) {
		file->f_mode |= FMODE_CREATED;
		error = dir_inode->i_op->create(idmap, dir_inode, dentry, mode, 
		                                open_flag & O_EXCL);
		if (error)
			goto out_dput;
	}
	return dentry;
}

d_lookup 是 VFS 层的关键优化。目录项缓存将路径名与 dentry 的映射关系缓存起来,绝大多数文件操作都可以直接从缓存中获取信息而无需访问磁盘。当缓存未命中时,才调用底层文件系统的 lookup 方法从磁盘读取目录内容。

当文件确实不存在且满足创建条件时,VFS 调用父目录 inode 操作表中的 create 方法。对于 ext4 文件系统,这个方法被初始化为 ext4_create——至此,控制权正式移交给 ext4 具体文件系统。


七、ext4 文件系统:磁盘上的真实创建

ext4_dir_inode_operations 定义了 ext4 文件系统中目录支持的操作:

c

const struct inode_operations ext4_dir_inode_operations = {
	.create		= ext4_create,
	.lookup		= ext4_lookup,
	.link		= ext4_link,
	.unlink		= ext4_unlink,
	.mkdir		= ext4_mkdir,
	.rmdir		= ext4_rmdir,
	.mknod		= ext4_mknod,
	...
};

ext4_create 是创建普通文件的入口。以下是简化后的核心流程:

c

static int ext4_create(struct mnt_idmap *idmap, struct inode *dir,
		       struct dentry *dentry, umode_t mode, bool excl)
{
	handle_t *handle;
	struct inode *inode;

	/* 分配 inode,返回一个写日志句柄 */
	inode = ext4_new_inode_start_handle(idmap, dir, mode, &dentry->d_name,
					    0, NULL, EXT4_HT_DIR, credits);
	handle = ext4_journal_current_handle();

	if (!IS_ERR(inode)) {
		/* 设置 inode 的操作函数表 */
		inode->i_op = &ext4_file_inode_operations;
		inode->i_fop = &ext4_file_operations;
		ext4_set_aops(inode);
		
		/* 将新 inode 添加到父目录中 */
		err = ext4_add_nondir(handle, dentry, &inode);
	}
	
	if (handle)
		ext4_journal_stop(handle);
	return err;
}

ext4_new_inode_start_handle 是 inode 分配的核心,它做了一系列重要的工作:

  • 选择分配块组:根据文件类型(普通文件或目录)和父目录位置,选择合适的块组来分配 inode。系统倾向于将同一目录下的文件分布在相邻的块组中以提高磁盘访问的局部性。
  • 读取 inode bitmap:找到目标块组的 inode 位图,扫描查找空闲的 inode 号。位图的每一位标识一个 inode 是否被占用。这个过程涉及从磁盘读取位图数据到内存缓存。
  • 原子置位:在日志事务的保护下,将对应 inode 位图中该位从 0 置为 1,标记为已占用。
  • 更新块组描述符:减少块组中空闲 inode 计数,如果是目录则增加已使用目录计数。
  • 分配 inode 结构:在内存中分配一个新的 struct inode,初始化其元数据字段(文件大小=0,创建时间、修改时间设置为当前时间,文件权限根据 mode 和 umask 计算)。
  • 初始化扩展属性:根据文件类型设置扩展标志(如 EXT4_INODE_EXTENTS 标志,启用 extent 树管理磁盘块分配,这是 ext4 相对于 ext3 的重大改进)。
  • 生成 inode 编号和序列号:inode 编号(如 12345)由块组号与组内 inode 序号组合而成,在文件系统内唯一标识一个文件。i_generation 随机生成,用于 NFS 等场景检测 inode 复用。
  • 写入日志:将 inode 的创建记录到 ext4 日志(Journal)中,确保在系统崩溃时能够通过恢复日志保证文件系统一致性。
  • 初始化磁盘数据块分配:新文件大小为 0,最初不占用任何数据块,i_blocks 初始为 0。

ext4_add_nondir 将新创建的 inode 与父目录关联起来:

c

static int ext4_add_nondir(handle_t *handle, struct dentry *dentry, struct inode **inodep)
{
	struct inode *dir = d_inode(dentry->d_parent);
	struct inode *inode = *inodep;
	int err = ext4_add_entry(handle, dentry, inode);
	if (!err) {
		err = ext4_mark_inode_dirty(handle, inode);
		d_instantiate_new(dentry, inode);
		*inodep = NULL;
		return err;
	}
	drop_nlink(inode);
	ext4_mark_inode_dirty(handle, inode);
	ext4_orphan_add(handle, inode);
	unlock_new_inode(inode);
	return err;
}

ext4_add_entry 负责在父目录的目录块中添加一个目录项(directory entry)。目录项记录了文件名到 inode 编号的映射关系。在 ext4 中,目录项采用哈希树结构,支持大规模目录的高效查找。对于新创建的文件,ext4 会在父目录的数据块中找到一个空闲位置,写入 (inode编号, 文件名长度, 文件名, 文件类型) 等信息。

至此,从 Java 的 File.createNewFile() 到 ext4 的实际磁盘布局更新,整个链条全部贯通。调用栈从应用层一路深入到磁盘元数据,文件在文件系统中的记录已经完成。


八、从用户态到内核态的调用栈全景

将整个调用链可视化如下:

text

[User Space]  Java 层
    File.createNewFile()
         ↓
    UnixFileSystem.createFileExclusively()   [native]

[JNI / Native] OpenJDK 本地代码
    Java_java_io_UnixFileSystem_createFileExclusively()
         ↓
    handleOpen(O_RDWR | O_CREAT | O_EXCL, 0666)
         ↓
    open64()

[User Space] glibc 库
    __libc_open64() → 提取 mode 参数
         ↓
    __libc_open() → SYSCALL_CANCEL(openat, AT_FDCWD, ...)
         ↓
    syscall 指令 (CPU 陷入内核)

[Kernel Space] 系统调用入口
    SYSCALL_DEFINE4(openat, ...)
         ↓
    do_sys_open()
         ↓
    do_sys_openat2()
        ├── getname()           → 复制用户态路径
        ├── get_unused_fd_flags() → 分配 fd
        ├── do_filp_open()      → 核心打开
        └── fd_install()        → 绑定 fd 与 file

[VFS Layer] 虚拟文件系统层
    do_filp_open()
        └── path_openat()  [RCU-walk / REF-walk]
            ├── alloc_empty_file() → 分配 struct file
            ├── path_init()        → 确定查找起点
            ├── link_path_walk()   → 逐层解析路径
            ├── open_last_lookups()
            │   └── lookup_open()
            │       ├── d_lookup()           → 查找 dcache
            │       ├── dir->i_op->lookup()  → 若 cache miss,调用 ext4_lookup
            │       └── dir->i_op->create()  → 创建文件 (如果 O_CREAT)
            └── do_open()           → 打开文件,调用驱动 open

[ext4 Layer] ext4 文件系统
    ext4_create()
        ├── ext4_new_inode_start_handle()   → 分配 inode 并启动事务
        │   ├── 选择块组
        │   ├── 读取 inode bitmap
        │   ├── 分配 inode 编号
        │   ├── 初始化 struct inode (大小=0, 时间戳, 权限)
        │   └── 返回 journal handle
        ├── ext4_add_nondir()
        │   └── ext4_add_entry()            → 在父目录中添加目录项
        └── ext4_journal_stop()             → 提交日志事务

[Back to VFS & Return]
    do_filp_open 返回 struct file
    fd_install 建立 fd → file 映射
    系统调用返回 fd (用户态收到的是 int 文件描述符)

[Back to Java]
    JNI 关闭 fd,函数返回 true

每一层的设计都有其独特的目的:应用层关注跨平台一致性和易用性,库层处理线程取消和系统调用编码,内核系统调用层处理参数安全和 fd 分配,VFS 层缓存目录结构以提高性能并提供统一接口,ext4 层则脚踏实地地与磁盘结构交互。


九、总结与回顾

从 Java 的两行代码到磁盘上的文件创建,我们的旅程跨越了六个主要层次:

  1. Java 应用层File 和 UnixFileSystem 通过 native 方法委托给 JVM 本地实现,同时处理安全管理器检查。
  2. JNI 层handleOpen 封装 open64 调用,设置 O_CREAT|O_EXCL 标志实现排他创建语义,并额外检查文件类型(防止创建目录)。
  3. glibc 库层:处理可变参数(mode 权限),自动添加 O_LARGEFILE 标志,使用 SYSCALL_CANCEL 宏处理线程取消点,最终通过 syscall 指令进入内核。
  4. 内核系统调用层do_sys_openat2 负责将用户态路径安全复制到内核、分配空闲文件描述符,并调用 VFS 核心函数。
  5. VFS 层path_openat 在 RCU-walk 和 REF-walk 之间切换以优化性能,lookup_open 通过目录项缓存(dcache)加速查找,并在缓存未命中时调用底层文件系统的 lookup 和 create 方法。
  6. ext4 层ext4_create 分配 inode(更新位图、块组描述符),在父目录中添加目录项,并将所有元数据变更记录到日志中确保持久化。

这个设计的精妙之处在于分层抽象:每一层只需要关心自己的职责,无需感知下层细节。Java 程序无需知道底层是 ext4 还是 XFS,VFS 无需关心 ext4 的磁盘布局细节,ext4 无需关心上层是以何种方式调用的。这种分层与解耦,正是现代操作系统设计的核心理念。

当你写下 new File("/path/to/file.txt").createNewFile() 时,数以万计的代码行在你的背后精密协同工作。理解这个过程,不仅能帮助你在使用 Java 进行文件操作时写出更高效的代码——比如知道 O_EXCL 避免了 TOCTOU 竞争,或知道目录项缓存的存在意味着频繁 stat 同一个文件的开销很小——更重要的是,它让你真正理解了一个计算机系统是如何组织的:从用户空间到内核,从高级语言到机器指令,从内存中的数据结构到磁盘上的物理位,整座计算大厦从底向上层层构建,而你,刚刚亲手点亮了它的内部光源。 #源码

@Override
    public native boolean createFileExclusively(String path)
            throws IOException;

JNIEXPORT jboolean JNICALL
Java_java_io_UnixFileSystem_createFileExclusively(JNIEnv *env, jclass cls,
                                                  jstring pathname)
{
    jboolean rv = JNI_FALSE;

    WITH_PLATFORM_STRING(env, pathname, path) {
        FD fd;
        /* The root directory always exists */
        if (strcmp (path, "/")) {
            fd = handleOpen(path, O_RDWR | O_CREAT | O_EXCL, 0666);
            if (fd < 0) {
                if (errno != EEXIST)
                    JNU_ThrowIOExceptionWithLastError(env, "Could not open file");
            } else {
                if (close(fd) == -1)
                    JNU_ThrowIOExceptionWithLastError(env, "Could not close file");
                rv = JNI_TRUE;
            }
        }
    } END_PLATFORM_STRING(env, path);
    return rv;
}

FD
handleOpen(const char *path, int oflag, int mode) {
    FD fd;
    RESTARTABLE(open64(path, oflag, mode), fd);
    if (fd != -1) {
        struct stat64 buf64;
        int result;
        RESTARTABLE(fstat64(fd, &buf64), result);
        if (result != -1) {
            if (S_ISDIR(buf64.st_mode)) {
                close(fd);
                errno = EISDIR;
                fd = -1;
            }
        } else {
            close(fd);
            fd = -1;
        }
    }
    return fd;
}

/* Open FILE with access OFLAG.  If O_CREAT or O_TMPFILE is in OFLAG,
   a third argument is the file protection.  */
int
__libc_open64 (const char *file, int oflag, ...)
{
  int mode = 0;

  if (__OPEN_NEEDS_MODE (oflag))
    {
      va_list arg;
      va_start (arg, oflag);
      mode = va_arg (arg, int);
      va_end (arg);
    }

  /* __libc_open should be a cancellation point.  */
  return __libc_open (file, oflag | O_LARGEFILE, mode);
}
weak_alias (__libc_open64, __open64)
libc_hidden_weak (__open64)
weak_alias (__libc_open64, open64)

/* Open FILE with access OFLAG.  If O_CREAT or O_TMPFILE is in OFLAG,
   a third argument is the file protection.  */
int
__libc_open (const char *file, int oflag, ...)
{
  int mode = 0;

  if (__OPEN_NEEDS_MODE (oflag))
    {
      va_list arg;
      va_start (arg, oflag);
      mode = va_arg (arg, int);
      va_end (arg);
    }

  return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag, mode);
}
libc_hidden_def (__libc_open)

weak_alias (__libc_open, __open)
libc_hidden_weak (__open)
weak_alias (__libc_open, open)


SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
		umode_t, mode)
{
	if (force_o_largefile())
		flags |= O_LARGEFILE;
	return do_sys_open(dfd, filename, flags, mode);
}

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
	struct open_how how = build_open_how(flags, mode);
	return do_sys_openat2(dfd, filename, &how);
}

static long do_sys_openat2(int dfd, const char __user *filename,
			   struct open_how *how)
{
	struct open_flags op;
	int fd = build_open_flags(how, &op);
	struct filename *tmp;

	if (fd)
		return fd;

	tmp = getname(filename);
	if (IS_ERR(tmp))
		return PTR_ERR(tmp);

	fd = get_unused_fd_flags(how->flags);
	if (fd >= 0) {
		struct file *f = do_filp_open(dfd, tmp, &op);
		if (IS_ERR(f)) {
			put_unused_fd(fd);
			fd = PTR_ERR(f);
		} else {
			fd_install(fd, f);
		}
	}
	putname(tmp);
	return fd;
}

extern struct file *do_filp_open(int dfd, struct filename *pathname,
		const struct open_flags *op);
		
struct file *do_filp_open(int dfd, struct filename *pathname,
		const struct open_flags *op)
{
	struct nameidata nd;
	int flags = op->lookup_flags;
	struct file *filp;

	set_nameidata(&nd, dfd, pathname, NULL);
	filp = path_openat(&nd, op, flags | LOOKUP_RCU);
	if (unlikely(filp == ERR_PTR(-ECHILD)))
		filp = path_openat(&nd, op, flags);
	if (unlikely(filp == ERR_PTR(-ESTALE)))
		filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
	restore_nameidata();
	return filp;
}

static struct file *path_openat(struct nameidata *nd,
			const struct open_flags *op, unsigned flags)
{
	struct file *file;
	int error;

	file = alloc_empty_file(op->open_flag, current_cred());
	if (IS_ERR(file))
		return file;

	if (unlikely(file->f_flags & __O_TMPFILE)) {
		error = do_tmpfile(nd, flags, op, file);
	} else if (unlikely(file->f_flags & O_PATH)) {
		error = do_o_path(nd, flags, file);
	} else {
		const char *s = path_init(nd, flags);
		while (!(error = link_path_walk(s, nd)) &&
		       (s = open_last_lookups(nd, file, op)) != NULL)
			;
		if (!error)
			error = do_open(nd, file, op);
		terminate_walk(nd);
	}
	if (likely(!error)) {
		if (likely(file->f_mode & FMODE_OPENED))
			return file;
		WARN_ON(1);
		error = -EINVAL;
	}
	fput(file);
	if (error == -EOPENSTALE) {
		if (flags & LOOKUP_RCU)
			error = -ECHILD;
		else
			error = -ESTALE;
	}
	return ERR_PTR(error);
}

static const char *open_last_lookups(struct nameidata *nd,
		   struct file *file, const struct open_flags *op)
{
	struct dentry *dir = nd->path.dentry;
	int open_flag = op->open_flag;
	bool got_write = false;
	struct dentry *dentry;
	const char *res;

	nd->flags |= op->intent;

	if (nd->last_type != LAST_NORM) {
		if (nd->depth)
			put_link(nd);
		return handle_dots(nd, nd->last_type);
	}

	if (!(open_flag & O_CREAT)) {
		if (nd->last.name[nd->last.len])
			nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
		/* we _can_ be in RCU mode here */
		dentry = lookup_fast(nd);
		if (IS_ERR(dentry))
			return ERR_CAST(dentry);
		if (likely(dentry))
			goto finish_lookup;

		if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU))
			return ERR_PTR(-ECHILD);
	} else {
		/* create side of things */
		if (nd->flags & LOOKUP_RCU) {
			if (!try_to_unlazy(nd))
				return ERR_PTR(-ECHILD);
		}
		audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
		/* trailing slashes? */
		if (unlikely(nd->last.name[nd->last.len]))
			return ERR_PTR(-EISDIR);
	}

	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
		got_write = !mnt_want_write(nd->path.mnt);
		/*
		 * do _not_ fail yet - we might not need that or fail with
		 * a different error; let lookup_open() decide; we'll be
		 * dropping this one anyway.
		 */
	}
	if (open_flag & O_CREAT)
		inode_lock(dir->d_inode);
	else
		inode_lock_shared(dir->d_inode);
	dentry = lookup_open(nd, file, op, got_write);
	if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED))
		fsnotify_create(dir->d_inode, dentry);
	if (open_flag & O_CREAT)
		inode_unlock(dir->d_inode);
	else
		inode_unlock_shared(dir->d_inode);

	if (got_write)
		mnt_drop_write(nd->path.mnt);

	if (IS_ERR(dentry))
		return ERR_CAST(dentry);

	if (file->f_mode & (FMODE_OPENED | FMODE_CREATED)) {
		dput(nd->path.dentry);
		nd->path.dentry = dentry;
		return NULL;
	}

finish_lookup:
	if (nd->depth)
		put_link(nd);
	res = step_into(nd, WALK_TRAILING, dentry);
	if (unlikely(res))
		nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
	return res;
}


/*
 * Look up and maybe create and open the last component.
 *
 * Must be called with parent locked (exclusive in O_CREAT case).
 *
 * Returns 0 on success, that is, if
 *  the file was successfully atomically created (if necessary) and opened, or
 *  the file was not completely opened at this time, though lookups and
 *  creations were performed.
 * These case are distinguished by presence of FMODE_OPENED on file->f_mode.
 * In the latter case dentry returned in @path might be negative if O_CREAT
 * hadn't been specified.
 *
 * An error code is returned on failure.
 */
static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
				  const struct open_flags *op,
				  bool got_write)
{
	struct mnt_idmap *idmap;
	struct dentry *dir = nd->path.dentry;
	struct inode *dir_inode = dir->d_inode;
	int open_flag = op->open_flag;
	struct dentry *dentry;
	int error, create_error = 0;
	umode_t mode = op->mode;
	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

	if (unlikely(IS_DEADDIR(dir_inode)))
		return ERR_PTR(-ENOENT);

	file->f_mode &= ~FMODE_CREATED;
	dentry = d_lookup(dir, &nd->last);
	for (;;) {
		if (!dentry) {
			dentry = d_alloc_parallel(dir, &nd->last, &wq);
			if (IS_ERR(dentry))
				return dentry;
		}
		if (d_in_lookup(dentry))
			break;

		error = d_revalidate(dentry, nd->flags);
		if (likely(error > 0))
			break;
		if (error)
			goto out_dput;
		d_invalidate(dentry);
		dput(dentry);
		dentry = NULL;
	}
	if (dentry->d_inode) {
		/* Cached positive dentry: will open in f_op->open */
		return dentry;
	}

	/*
	 * Checking write permission is tricky, bacuse we don't know if we are
	 * going to actually need it: O_CREAT opens should work as long as the
	 * file exists.  But checking existence breaks atomicity.  The trick is
	 * to check access and if not granted clear O_CREAT from the flags.
	 *
	 * Another problem is returing the "right" error value (e.g. for an
	 * O_EXCL open we want to return EEXIST not EROFS).
	 */
	if (unlikely(!got_write))
		open_flag &= ~O_TRUNC;
	idmap = mnt_idmap(nd->path.mnt);
	if (open_flag & O_CREAT) {
		if (open_flag & O_EXCL)
			open_flag &= ~O_TRUNC;
		mode = vfs_prepare_mode(idmap, dir->d_inode, mode, mode, mode);
		if (likely(got_write))
			create_error = may_o_create(idmap, &nd->path,
						    dentry, mode);
		else
			create_error = -EROFS;
	}
	if (create_error)
		open_flag &= ~O_CREAT;
	if (dir_inode->i_op->atomic_open) {
		dentry = atomic_open(nd, dentry, file, open_flag, mode);
		if (unlikely(create_error) && dentry == ERR_PTR(-ENOENT))
			dentry = ERR_PTR(create_error);
		return dentry;
	}

	if (d_in_lookup(dentry)) {
		struct dentry *res = dir_inode->i_op->lookup(dir_inode, dentry,
							     nd->flags);
		d_lookup_done(dentry);
		if (unlikely(res)) {
			if (IS_ERR(res)) {
				error = PTR_ERR(res);
				goto out_dput;
			}
			dput(dentry);
			dentry = res;
		}
	}

	/* Negative dentry, just create the file */
	if (!dentry->d_inode && (open_flag & O_CREAT)) {
		file->f_mode |= FMODE_CREATED;
		audit_inode_child(dir_inode, dentry, AUDIT_TYPE_CHILD_CREATE);
		if (!dir_inode->i_op->create) {
			error = -EACCES;
			goto out_dput;
		}

		error = dir_inode->i_op->create(idmap, dir_inode, dentry,
						mode, open_flag & O_EXCL);
		if (error)
			goto out_dput;
	}
	if (unlikely(create_error) && !dentry->d_inode) {
		error = create_error;
		goto out_dput;
	}
	return dentry;

out_dput:
	dput(dentry);
	return ERR_PTR(error);
}

/*
 * directories can handle most operations...
 */
const struct inode_operations ext4_dir_inode_operations = {
	.create		= ext4_create,
	.lookup		= ext4_lookup,
	.link		= ext4_link,
	.unlink		= ext4_unlink,
	.symlink	= ext4_symlink,
	.mkdir		= ext4_mkdir,
	.rmdir		= ext4_rmdir,
	.mknod		= ext4_mknod,
	.tmpfile	= ext4_tmpfile,
	.rename		= ext4_rename2,
	.setattr	= ext4_setattr,
	.getattr	= ext4_getattr,
	.listxattr	= ext4_listxattr,
	.get_inode_acl	= ext4_get_acl,
	.set_acl	= ext4_set_acl,
	.fiemap         = ext4_fiemap,
	.fileattr_get	= ext4_fileattr_get,
	.fileattr_set	= ext4_fileattr_set,
};


/*
 * By the time this is called, we already have created
 * the directory cache entry for the new file, but it
 * is so far negative - it has no inode.
 *
 * If the create succeeds, we fill in the inode information
 * with d_instantiate().
 */
static int ext4_create(struct mnt_idmap *idmap, struct inode *dir,
		       struct dentry *dentry, umode_t mode, bool excl)
{
	handle_t *handle;
	struct inode *inode;
	int err, credits, retries = 0;

	err = dquot_initialize(dir);
	if (err)
		return err;

	credits = (EXT4_DATA_TRANS_BLOCKS(dir->i_sb) +
		   EXT4_INDEX_EXTRA_TRANS_BLOCKS + 3);
retry:
	inode = ext4_new_inode_start_handle(idmap, dir, mode, &dentry->d_name,
					    0, NULL, EXT4_HT_DIR, credits);
	handle = ext4_journal_current_handle();
	err = PTR_ERR(inode);
	if (!IS_ERR(inode)) {
		inode->i_op = &ext4_file_inode_operations;
		inode->i_fop = &ext4_file_operations;
		ext4_set_aops(inode);
		err = ext4_add_nondir(handle, dentry, &inode);
		if (!err)
			ext4_fc_track_create(handle, dentry);
	}
	if (handle)
		ext4_journal_stop(handle);
	if (!IS_ERR_OR_NULL(inode))
		iput(inode);
	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
		goto retry;
	return err;
}

#define ext4_new_inode_start_handle(idmap, dir, mode, qstr, goal, owner, \
				    type, nblocks)		    \
	__ext4_new_inode((idmap), NULL, (dir), (mode), (qstr), (goal), (owner), \
			 0, (type), __LINE__, (nblocks))


/*
 * There are two policies for allocating an inode.  If the new inode is
 * a directory, then a forward search is made for a block group with both
 * free space and a low directory-to-inode ratio; if that fails, then of
 * the groups with above-average free space, that group with the fewest
 * directories already is chosen.
 *
 * For other inodes, search forward from the parent directory's block
 * group to find a free inode.
 */
struct inode *__ext4_new_inode(struct mnt_idmap *idmap,
			       handle_t *handle, struct inode *dir,
			       umode_t mode, const struct qstr *qstr,
			       __u32 goal, uid_t *owner, __u32 i_flags,
			       int handle_type, unsigned int line_no,
			       int nblocks)
{
	struct super_block *sb;
	struct buffer_head *inode_bitmap_bh = NULL;
	struct buffer_head *group_desc_bh;
	ext4_group_t ngroups, group = 0;
	unsigned long ino = 0;
	struct inode *inode;
	struct ext4_group_desc *gdp = NULL;
	struct ext4_inode_info *ei;
	struct ext4_sb_info *sbi;
	int ret2, err;
	struct inode *ret;
	ext4_group_t i;
	ext4_group_t flex_group;
	struct ext4_group_info *grp = NULL;
	bool encrypt = false;

	/* Cannot create files in a deleted directory */
	if (!dir || !dir->i_nlink)
		return ERR_PTR(-EPERM);

	sb = dir->i_sb;
	sbi = EXT4_SB(sb);

	if (unlikely(ext4_forced_shutdown(sb)))
		return ERR_PTR(-EIO);

	ngroups = ext4_get_groups_count(sb);
	trace_ext4_request_inode(dir, mode);
	inode = new_inode(sb);
	if (!inode)
		return ERR_PTR(-ENOMEM);
	ei = EXT4_I(inode);

	/*
	 * Initialize owners and quota early so that we don't have to account
	 * for quota initialization worst case in standard inode creating
	 * transaction
	 */
	if (owner) {
		inode->i_mode = mode;
		i_uid_write(inode, owner[0]);
		i_gid_write(inode, owner[1]);
	} else if (test_opt(sb, GRPID)) {
		inode->i_mode = mode;
		inode_fsuid_set(inode, idmap);
		inode->i_gid = dir->i_gid;
	} else
		inode_init_owner(idmap, inode, dir, mode);

	if (ext4_has_feature_project(sb) &&
	    ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT))
		ei->i_projid = EXT4_I(dir)->i_projid;
	else
		ei->i_projid = make_kprojid(&init_user_ns, EXT4_DEF_PROJID);

	if (!(i_flags & EXT4_EA_INODE_FL)) {
		err = fscrypt_prepare_new_inode(dir, inode, &encrypt);
		if (err)
			goto out;
	}

	err = dquot_initialize(inode);
	if (err)
		goto out;

	if (!handle && sbi->s_journal && !(i_flags & EXT4_EA_INODE_FL)) {
		ret2 = ext4_xattr_credits_for_new_inode(dir, mode, encrypt);
		if (ret2 < 0) {
			err = ret2;
			goto out;
		}
		nblocks += ret2;
	}

	if (!goal)
		goal = sbi->s_inode_goal;

	if (goal && goal <= le32_to_cpu(sbi->s_es->s_inodes_count)) {
		group = (goal - 1) / EXT4_INODES_PER_GROUP(sb);
		ino = (goal - 1) % EXT4_INODES_PER_GROUP(sb);
		ret2 = 0;
		goto got_group;
	}

	if (S_ISDIR(mode))
		ret2 = find_group_orlov(sb, dir, &group, mode, qstr);
	else
		ret2 = find_group_other(sb, dir, &group, mode);

got_group:
	EXT4_I(dir)->i_last_alloc_group = group;
	err = -ENOSPC;
	if (ret2 == -1)
		goto out;

	/*
	 * Normally we will only go through one pass of this loop,
	 * unless we get unlucky and it turns out the group we selected
	 * had its last inode grabbed by someone else.
	 */
	for (i = 0; i < ngroups; i++, ino = 0) {
		err = -EIO;

		gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
		if (!gdp)
			goto out;

		/*
		 * Check free inodes count before loading bitmap.
		 */
		if (ext4_free_inodes_count(sb, gdp) == 0)
			goto next_group;

		if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
			grp = ext4_get_group_info(sb, group);
			/*
			 * Skip groups with already-known suspicious inode
			 * tables
			 */
			if (!grp || EXT4_MB_GRP_IBITMAP_CORRUPT(grp))
				goto next_group;
		}

		brelse(inode_bitmap_bh);
		inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
		/* Skip groups with suspicious inode tables */
		if (((!(sbi->s_mount_state & EXT4_FC_REPLAY))
		     && EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) ||
		    IS_ERR(inode_bitmap_bh)) {
			inode_bitmap_bh = NULL;
			goto next_group;
		}

repeat_in_this_group:
		ret2 = find_inode_bit(sb, group, inode_bitmap_bh, &ino);
		if (!ret2)
			goto next_group;

		if (group == 0 && (ino + 1) < EXT4_FIRST_INO(sb)) {
			ext4_error(sb, "reserved inode found cleared - "
				   "inode=%lu", ino + 1);
			ext4_mark_group_bitmap_corrupted(sb, group,
					EXT4_GROUP_INFO_IBITMAP_CORRUPT);
			goto next_group;
		}

		if ((!(sbi->s_mount_state & EXT4_FC_REPLAY)) && !handle) {
			BUG_ON(nblocks <= 0);
			handle = __ext4_journal_start_sb(NULL, dir->i_sb,
				 line_no, handle_type, nblocks, 0,
				 ext4_trans_default_revoke_credits(sb));
			if (IS_ERR(handle)) {
				err = PTR_ERR(handle);
				ext4_std_error(sb, err);
				goto out;
			}
		}
		BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
		err = ext4_journal_get_write_access(handle, sb, inode_bitmap_bh,
						    EXT4_JTR_NONE);
		if (err) {
			ext4_std_error(sb, err);
			goto out;
		}
		ext4_lock_group(sb, group);
		ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
		if (ret2) {
			/* Someone already took the bit. Repeat the search
			 * with lock held.
			 */
			ret2 = find_inode_bit(sb, group, inode_bitmap_bh, &ino);
			if (ret2) {
				ext4_set_bit(ino, inode_bitmap_bh->b_data);
				ret2 = 0;
			} else {
				ret2 = 1; /* we didn't grab the inode */
			}
		}
		ext4_unlock_group(sb, group);
		ino++;		/* the inode bitmap is zero-based */
		if (!ret2)
			goto got; /* we grabbed the inode! */

		if (ino < EXT4_INODES_PER_GROUP(sb))
			goto repeat_in_this_group;
next_group:
		if (++group == ngroups)
			group = 0;
	}
	err = -ENOSPC;
	goto out;

got:
	BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
	err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
	if (err) {
		ext4_std_error(sb, err);
		goto out;
	}

	BUFFER_TRACE(group_desc_bh, "get_write_access");
	err = ext4_journal_get_write_access(handle, sb, group_desc_bh,
					    EXT4_JTR_NONE);
	if (err) {
		ext4_std_error(sb, err);
		goto out;
	}

	/* We may have to initialize the block bitmap if it isn't already */
	if (ext4_has_group_desc_csum(sb) &&
	    gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
		struct buffer_head *block_bitmap_bh;

		block_bitmap_bh = ext4_read_block_bitmap(sb, group);
		if (IS_ERR(block_bitmap_bh)) {
			err = PTR_ERR(block_bitmap_bh);
			goto out;
		}
		BUFFER_TRACE(block_bitmap_bh, "get block bitmap access");
		err = ext4_journal_get_write_access(handle, sb, block_bitmap_bh,
						    EXT4_JTR_NONE);
		if (err) {
			brelse(block_bitmap_bh);
			ext4_std_error(sb, err);
			goto out;
		}

		BUFFER_TRACE(block_bitmap_bh, "dirty block bitmap");
		err = ext4_handle_dirty_metadata(handle, NULL, block_bitmap_bh);

		/* recheck and clear flag under lock if we still need to */
		ext4_lock_group(sb, group);
		if (ext4_has_group_desc_csum(sb) &&
		    (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) {
			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
			ext4_free_group_clusters_set(sb, gdp,
				ext4_free_clusters_after_init(sb, group, gdp));
			ext4_block_bitmap_csum_set(sb, gdp, block_bitmap_bh);
			ext4_group_desc_csum_set(sb, group, gdp);
		}
		ext4_unlock_group(sb, group);
		brelse(block_bitmap_bh);

		if (err) {
			ext4_std_error(sb, err);
			goto out;
		}
	}

	/* Update the relevant bg descriptor fields */
	if (ext4_has_group_desc_csum(sb)) {
		int free;
		struct ext4_group_info *grp = NULL;

		if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
			grp = ext4_get_group_info(sb, group);
			if (!grp) {
				err = -EFSCORRUPTED;
				goto out;
			}
			down_read(&grp->alloc_sem); /*
						     * protect vs itable
						     * lazyinit
						     */
		}
		ext4_lock_group(sb, group); /* while we modify the bg desc */
		free = EXT4_INODES_PER_GROUP(sb) -
			ext4_itable_unused_count(sb, gdp);
		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
			free = 0;
		}
		/*
		 * Check the relative inode number against the last used
		 * relative inode number in this group. if it is greater
		 * we need to update the bg_itable_unused count
		 */
		if (ino > free)
			ext4_itable_unused_set(sb, gdp,
					(EXT4_INODES_PER_GROUP(sb) - ino));
		if (!(sbi->s_mount_state & EXT4_FC_REPLAY))
			up_read(&grp->alloc_sem);
	} else {
		ext4_lock_group(sb, group);
	}

	ext4_free_inodes_set(sb, gdp, ext4_free_inodes_count(sb, gdp) - 1);
	if (S_ISDIR(mode)) {
		ext4_used_dirs_set(sb, gdp, ext4_used_dirs_count(sb, gdp) + 1);
		if (sbi->s_log_groups_per_flex) {
			ext4_group_t f = ext4_flex_group(sbi, group);

			atomic_inc(&sbi_array_rcu_deref(sbi, s_flex_groups,
							f)->used_dirs);
		}
	}
	if (ext4_has_group_desc_csum(sb)) {
		ext4_inode_bitmap_csum_set(sb, gdp, inode_bitmap_bh,
					   EXT4_INODES_PER_GROUP(sb) / 8);
		ext4_group_desc_csum_set(sb, group, gdp);
	}
	ext4_unlock_group(sb, group);

	BUFFER_TRACE(group_desc_bh, "call ext4_handle_dirty_metadata");
	err = ext4_handle_dirty_metadata(handle, NULL, group_desc_bh);
	if (err) {
		ext4_std_error(sb, err);
		goto out;
	}

	percpu_counter_dec(&sbi->s_freeinodes_counter);
	if (S_ISDIR(mode))
		percpu_counter_inc(&sbi->s_dirs_counter);

	if (sbi->s_log_groups_per_flex) {
		flex_group = ext4_flex_group(sbi, group);
		atomic_dec(&sbi_array_rcu_deref(sbi, s_flex_groups,
						flex_group)->free_inodes);
	}

	inode->i_ino = ino + group * EXT4_INODES_PER_GROUP(sb);
	/* This is the optimal IO size (for stat), not the fs block size */
	inode->i_blocks = 0;
	simple_inode_init_ts(inode);
	ei->i_crtime = inode_get_mtime(inode);

	memset(ei->i_data, 0, sizeof(ei->i_data));
	ei->i_dir_start_lookup = 0;
	ei->i_disksize = 0;

	/* Don't inherit extent flag from directory, amongst others. */
	ei->i_flags =
		ext4_mask_flags(mode, EXT4_I(dir)->i_flags & EXT4_FL_INHERITED);
	ei->i_flags |= i_flags;
	ei->i_file_acl = 0;
	ei->i_dtime = 0;
	ei->i_block_group = group;
	ei->i_last_alloc_group = ~0;

	ext4_set_inode_flags(inode, true);
	if (IS_DIRSYNC(inode))
		ext4_handle_sync(handle);
	if (insert_inode_locked(inode) < 0) {
		/*
		 * Likely a bitmap corruption causing inode to be allocated
		 * twice.
		 */
		err = -EIO;
		ext4_error(sb, "failed to insert inode %lu: doubly allocated?",
			   inode->i_ino);
		ext4_mark_group_bitmap_corrupted(sb, group,
					EXT4_GROUP_INFO_IBITMAP_CORRUPT);
		goto out;
	}
	inode->i_generation = get_random_u32();

	/* Precompute checksum seed for inode metadata */
	if (ext4_has_metadata_csum(sb)) {
		__u32 csum;
		__le32 inum = cpu_to_le32(inode->i_ino);
		__le32 gen = cpu_to_le32(inode->i_generation);
		csum = ext4_chksum(sbi, sbi->s_csum_seed, (__u8 *)&inum,
				   sizeof(inum));
		ei->i_csum_seed = ext4_chksum(sbi, csum, (__u8 *)&gen,
					      sizeof(gen));
	}

	ext4_clear_state_flags(ei); /* Only relevant on 32-bit archs */
	ext4_set_inode_state(inode, EXT4_STATE_NEW);

	ei->i_extra_isize = sbi->s_want_extra_isize;
	ei->i_inline_off = 0;
	if (ext4_has_feature_inline_data(sb) &&
	    (!(ei->i_flags & EXT4_DAX_FL) || S_ISDIR(mode)))
		ext4_set_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
	ret = inode;
	err = dquot_alloc_inode(inode);
	if (err)
		goto fail_drop;

	/*
	 * Since the encryption xattr will always be unique, create it first so
	 * that it's less likely to end up in an external xattr block and
	 * prevent its deduplication.
	 */
	if (encrypt) {
		err = fscrypt_set_context(inode, handle);
		if (err)
			goto fail_free_drop;
	}

	if (!(ei->i_flags & EXT4_EA_INODE_FL)) {
		err = ext4_init_acl(handle, inode, dir);
		if (err)
			goto fail_free_drop;

		err = ext4_init_security(handle, inode, dir, qstr);
		if (err)
			goto fail_free_drop;
	}

	if (ext4_has_feature_extents(sb)) {
		/* set extent flag only for directory, file and normal symlink*/
		if (S_ISDIR(mode) || S_ISREG(mode) || S_ISLNK(mode)) {
			ext4_set_inode_flag(inode, EXT4_INODE_EXTENTS);
			ext4_ext_tree_init(handle, inode);
		}
	}

	if (ext4_handle_valid(handle)) {
		ei->i_sync_tid = handle->h_transaction->t_tid;
		ei->i_datasync_tid = handle->h_transaction->t_tid;
	}

	err = ext4_mark_inode_dirty(handle, inode);
	if (err) {
		ext4_std_error(sb, err);
		goto fail_free_drop;
	}

	ext4_debug("allocating inode %lu\n", inode->i_ino);
	trace_ext4_allocate_inode(inode, dir, mode);
	brelse(inode_bitmap_bh);
	return ret;

fail_free_drop:
	dquot_free_inode(inode);
fail_drop:
	clear_nlink(inode);
	unlock_new_inode(inode);
out:
	dquot_drop(inode);
	inode->i_flags |= S_NOQUOTA;
	iput(inode);
	brelse(inode_bitmap_bh);
	return ERR_PTR(err);
}

/*
 * Add non-directory inode to a directory. On success, the inode reference is
 * consumed by dentry is instantiation. This is also indicated by clearing of
 * *inodep pointer. On failure, the caller is responsible for dropping the
 * inode reference in the safe context.
 */
static int ext4_add_nondir(handle_t *handle,
		struct dentry *dentry, struct inode **inodep)
{
	struct inode *dir = d_inode(dentry->d_parent);
	struct inode *inode = *inodep;
	int err = ext4_add_entry(handle, dentry, inode);
	if (!err) {
		err = ext4_mark_inode_dirty(handle, inode);
		if (IS_DIRSYNC(dir))
			ext4_handle_sync(handle);
		d_instantiate_new(dentry, inode);
		*inodep = NULL;
		return err;
	}
	drop_nlink(inode);
	ext4_mark_inode_dirty(handle, inode);
	ext4_orphan_add(handle, inode);
	unlock_new_inode(inode);
	return err;
}


/*
 * Handle the last step of open()
 */
static int do_open(struct nameidata *nd,
		   struct file *file, const struct open_flags *op)
{
	struct mnt_idmap *idmap;
	int open_flag = op->open_flag;
	bool do_truncate;
	int acc_mode;
	int error;

	if (!(file->f_mode & (FMODE_OPENED | FMODE_CREATED))) {
		error = complete_walk(nd);
		if (error)
			return error;
	}
	if (!(file->f_mode & FMODE_CREATED))
		audit_inode(nd->name, nd->path.dentry, 0);
	idmap = mnt_idmap(nd->path.mnt);
	if (open_flag & O_CREAT) {
		if ((open_flag & O_EXCL) && !(file->f_mode & FMODE_CREATED))
			return -EEXIST;
		if (d_is_dir(nd->path.dentry))
			return -EISDIR;
		error = may_create_in_sticky(idmap, nd,
					     d_backing_inode(nd->path.dentry));
		if (unlikely(error))
			return error;
	}
	if ((nd->flags & LOOKUP_DIRECTORY) && !d_can_lookup(nd->path.dentry))
		return -ENOTDIR;

	do_truncate = false;
	acc_mode = op->acc_mode;
	if (file->f_mode & FMODE_CREATED) {
		/* Don't check for write permission, don't truncate */
		open_flag &= ~O_TRUNC;
		acc_mode = 0;
	} else if (d_is_reg(nd->path.dentry) && open_flag & O_TRUNC) {
		error = mnt_want_write(nd->path.mnt);
		if (error)
			return error;
		do_truncate = true;
	}
	error = may_open(idmap, &nd->path, acc_mode, open_flag);
	if (!error && !(file->f_mode & FMODE_OPENED))
		error = vfs_open(&nd->path, file);
	if (!error)
		error = ima_file_check(file, op->acc_mode);
	if (!error && do_truncate)
		error = handle_truncate(idmap, file);
	if (unlikely(error > 0)) {
		WARN_ON(1);
		error = -EINVAL;
	}
	if (do_truncate)
		mnt_drop_write(nd->path.mnt);
	return error;
}

/**
 * vfs_open - open the file at the given path
 * @path: path to open
 * @file: newly allocated file with f_flag initialized
 */
int vfs_open(const struct path *path, struct file *file)
{
	file->f_path = *path;
	return do_dentry_open(file, d_backing_inode(path->dentry), NULL);
}


static int do_dentry_open(struct file *f,
			  struct inode *inode,
			  int (*open)(struct inode *, struct file *))
{
	static const struct file_operations empty_fops = {};
	int error;

	path_get(&f->f_path);
	f->f_inode = inode;
	f->f_mapping = inode->i_mapping;
	f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
	f->f_sb_err = file_sample_sb_err(f);

	if (unlikely(f->f_flags & O_PATH)) {
		f->f_mode = FMODE_PATH | FMODE_OPENED;
		f->f_op = &empty_fops;
		return 0;
	}

	if ((f->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) {
		i_readcount_inc(inode);
	} else if (f->f_mode & FMODE_WRITE && !special_file(inode->i_mode)) {
		error = file_get_write_access(f);
		if (unlikely(error))
			goto cleanup_file;
		f->f_mode |= FMODE_WRITER;
	}

	/* POSIX.1-2008/SUSv4 Section XSI 2.9.7 */
	if (S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode))
		f->f_mode |= FMODE_ATOMIC_POS;

	f->f_op = fops_get(inode->i_fop);
	if (WARN_ON(!f->f_op)) {
		error = -ENODEV;
		goto cleanup_all;
	}

	error = security_file_open(f);
	if (error)
		goto cleanup_all;

	error = break_lease(file_inode(f), f->f_flags);
	if (error)
		goto cleanup_all;

	/* normally all 3 are set; ->open() can clear them if needed */
	f->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
	if (!open)
		open = f->f_op->open;
	if (open) {
		error = open(inode, f);
		if (error)
			goto cleanup_all;
	}
	f->f_mode |= FMODE_OPENED;
	if ((f->f_mode & FMODE_READ) &&
	     likely(f->f_op->read || f->f_op->read_iter))
		f->f_mode |= FMODE_CAN_READ;
	if ((f->f_mode & FMODE_WRITE) &&
	     likely(f->f_op->write || f->f_op->write_iter))
		f->f_mode |= FMODE_CAN_WRITE;
	if ((f->f_mode & FMODE_LSEEK) && !f->f_op->llseek)
		f->f_mode &= ~FMODE_LSEEK;
	if (f->f_mapping->a_ops && f->f_mapping->a_ops->direct_IO)
		f->f_mode |= FMODE_CAN_ODIRECT;

	f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
	f->f_iocb_flags = iocb_flags(f);

	file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);

	if ((f->f_flags & O_DIRECT) && !(f->f_mode & FMODE_CAN_ODIRECT))
		return -EINVAL;

	/*
	 * XXX: Huge page cache doesn't support writing yet. Drop all page
	 * cache for this file before processing writes.
	 */
	if (f->f_mode & FMODE_WRITE) {
		/*
		 * Paired with smp_mb() in collapse_file() to ensure nr_thps
		 * is up to date and the update to i_writecount by
		 * get_write_access() is visible. Ensures subsequent insertion
		 * of THPs into the page cache will fail.
		 */
		smp_mb();
		if (filemap_nr_thps(inode->i_mapping)) {
			struct address_space *mapping = inode->i_mapping;

			filemap_invalidate_lock(inode->i_mapping);
			/*
			 * unmap_mapping_range just need to be called once
			 * here, because the private pages is not need to be
			 * unmapped mapping (e.g. data segment of dynamic
			 * shared libraries here).
			 */
			unmap_mapping_range(mapping, 0, 0, 0);
			truncate_inode_pages(mapping, 0);
			filemap_invalidate_unlock(inode->i_mapping);
		}
	}

	/*
	 * Once we return a file with FMODE_OPENED, __fput() will call
	 * fsnotify_close(), so we need fsnotify_open() here for symmetry.
	 */
	fsnotify_open(f);
	return 0;

cleanup_all:
	if (WARN_ON_ONCE(error > 0))
		error = -EINVAL;
	fops_put(f->f_op);
	put_file_access(f);
cleanup_file:
	path_put(&f->f_path);
	f->f_path.mnt = NULL;
	f->f_path.dentry = NULL;
	f->f_inode = NULL;
	return error;
}