Kernel Bootup Page Table Initialize Process(x86_64)

Kernel Bootup Page Table Initialize Process(x86_64)

This article will provide detailed information about the kernel bootup page table setup.

In a brief view, the kernel setup page table in three steps:

  1. Setup the 4GB identity mapping
  2. Setup 64bit mode page table early_top_pgt
  3. Setup 64bit mode page table init_top_pgt

The last two steps are both higher mapping: Map the 512MB physical address to virtual address 0xffff80000000 – 0xffff80000000 + 512MB.

Next, we will talk about the details. We will use the 4.14 version code to explain the process.

You need to know the IA32e paging mechanism and relocation to read the article. The Intel manual has a good explaination of IA32e paging

https://github.com/torvalds/linux/blob/v4.14/arch/x86/boot/compressed/head_64.S

Before decompression

When the kernel is being loaded, it is either decompressed by a third-party bootloader like GRUB2 or by the kernel itself. Now we will talk about the second condition. The code started from arch/x86/boot/header.S . It is in 16bit real mode at the time. Then in code  arch/x86/boot/compressed/head_64.S  We setup the first page table in 32bit mode. We need this page table to take us to do take us to 64bit mode.

The following code is the set-up process

Notice that from the comment above. %ebx contain the address where we move kernel to make a safe decompression. Which means we should treat %ebx as an offset to the compiled binary. The compiled binary start at 0. So we fix-up the difference to reach the real physical address.

The above code setup Top level page directory. This only set the lowest page directory entry to (1007 + pgtable). This is a pointer to the next level page table. And next level page table start at 0x1000 + pgtable. The last line adds %edx to 4+%edi will set encryption masks if SEV is active. Currently, we can omit this line.

Then we look at the next level.

Here, we can see we set up four entries. and each entry point to another page directory.

This is the last level of page directory, these entry will point to a physical page frame directly. Now let’s take a look at the code. It sets up 2048 entries. Each entry with a Page Flag R/W = 1 U/S = 0 PS = 1. This means the page is read / write by kernel only and its size is 2MB. Each PTE(Page Table Entry) is a 8 Byte block data. So one page can contain at most 512 entries. Here kernel setup 4 pages of Level 2 Page Directory. The following image show the current page table structure.

In total we have 2048 * 2MB = 4GB physical address, identity mapped to 0 – 4GB linear address.

 

Then we use a long return to switch to 64bit mode.

Kernel push the startup_64 and CS register to stack, then perform a long return to enter 64bit mode.  And then after copy the compressed kernel, we jump to symbol relocated

In the relocated code, we do the kernel decompression.

The decompressed kernel is compiled at high address(we take ffffffff81000000 for example). But now we don’t have the correct page table to do the mapping. Fortunately, the extract_kernel function returns the physical address of the decompressed kernel. (Which is %ebp, equals to %ebx). After decompression, %rax contains the kernel physical start address. We jump there to perform the further setup.

Start execution in vmlinux

We now arrived at arch/x86/kernel/head_64.S. Before we continue, we must notice two things first.

  • After decompression, the kernel is placed at physical address %rbp (If we do not set CONFIG_RELOCATABLE it’s equal to 0x1000000
    ).
  • After decompression, we now in the kernel code compiled with the virtual address ffffffff81000000(as we mentioned above).

So here is a big pitfall. We cannot access ANY of the symbols in vmlinux currently. Because we only have a basic identity mapping now. But we need to visit the variables. How can we make it? The kernel uses a trick here, I will show it below

This function fixup the symbol virtual address to the real physical address.

“Current Valid Addr” = “Virtual Hi Addr” – “Kernel Virtual Address Base Addr” + “%rax Extracted kernel physical address”.

Now we continue reading the arch/x86/kernel/head_64.S  assembly code, this is where we landed from arch/x86/compressed/head_64.S

The enrty is startup_64:

In this article, we talk about self loading, instead of using a third party 64bit bootloader like GRUB. So as the comment said, we come here from arch/x86/boot/compressed/head_64.S. If we config the kernel with CONFIG_RELOCATABLE, the kernel won’t run at the place we compiled, page table fixup need to be performed. The page table is fixed in __startup_64

We compute the load_delta, and fixup the early_top_pgt. Now we just assume we don’t configure the kernel with CONFIG_RELOCATABLE. Then we can look at the page table built at compile time. First we look at the top level early_top_pgt. It set only the last entry point to level3 page table. which means only virtual address start with 0xff8000000000 will be valid.

Now we look at the next level (We do not use 5 Level Paging).

This level we have two entries, one for kernel address space. One for fixmap address space, fixmap address space is used for IO mapping, DMA, etc. Now we just look at the fixmap address space. It’s at index 510. in binary mode 0b111111110. Combine with the top level we get a smaller linear address space. Only address start from 0xffff80000000 is valid.

Then it’s the last level page directory. level2_kernel_pgt

This level is a mapping to physical address 0 – 512MB (it maps more than that, but we only need 512MB) So we get the current mapping then.

Linear: 0xffff80000000 – 0xffff80000000 + 512MB =====> Physical: 0 – 512MB

You can use a gdb to print the page table and debug it in your own. Here is a simple “it works!” script for parsing the page directory entry

Kernel load the early_top_pgt into cr3 using the following code

The current page table structure is shown below:

Now we are free to visit any kernel symbol without to force convert the address using fixup_address or something else. We can go further to the init/main.c code.

We use a long return to get to get to x86_64_start_kernel

initial_code here is defined as x86_64_start_kernel.

 

Moving to init/main.c

We are now at arch/x86/kernel/head64.c and in function x86_64_start_kernel 

We set up init_top_pgt[511] same as early_top_pgt[511]  . init_top_pgt is the final kernel page table. From x86_64_start_reservations we get to start_kernel This is a function located at init/main.c

After calling setup_arch, CR3 is loaded with init_top_pgt. Then the kernel page table will not change. I wonder if there is a change to switch kernel page table from 2MB size physical page to 4KB physical page, but it seems that the CR3 remained unchanged, and I examined the page entries, they remain unchanged, too. Even the code has executed into rest_init then do_idle

The following function is a simple debug function to output the current CR3 register since GDB cannot get the CR3 register value, I just print it out to see when it changed.

 

2 thoughts on “Kernel Bootup Page Table Initialize Process(x86_64)

发表评论

电子邮件地址不会被公开。 必填项已用*标注

3 + 13 =