Browsed by
月份:2017年11月

Kernel Bootup Page Table Initialize Process(x86_64)

Kernel Bootup Page Table Initialize Process(x86_64)

This article will provide detailed information about the kernel bootup page table setup.

In a brief view, the kernel setup page table in three steps:

  1. Setup the 4GB identity mapping
  2. Setup 64bit mode page table early_top_pgt
  3. Setup 64bit mode page table init_top_pgt

The last two steps are both higher mapping: Map the 512MB physical address to virtual address 0xffff80000000 – 0xffff80000000 + 512MB.

Next, we will talk about the details. We will use the 4.14 version code to explain the process.

You need to know the IA32e paging mechanism and relocation to read the article. The Intel manual has a good explaination of IA32e paging

https://github.com/torvalds/linux/blob/v4.14/arch/x86/boot/compressed/head_64.S

Before decompression

When the kernel is being loaded, it is either decompressed by a third-party bootloader like GRUB2 or by the kernel itself. Now we will talk about the second condition. The code started from arch/x86/boot/header.S . It is in 16bit real mode at the time. Then in code  arch/x86/boot/compressed/head_64.S  We setup the first page table in 32bit mode. We need this page table to take us to do take us to 64bit mode.

The following code is the set-up process

Notice that from the comment above. %ebx contain the address where we move kernel to make a safe decompression. Which means we should treat %ebx as an offset to the compiled binary. The compiled binary start at 0. So we fix-up the difference to reach the real physical address.

The above code setup Top level page directory. This only set the lowest page directory entry to (1007 + pgtable). This is a pointer to the next level page table. And next level page table start at 0x1000 + pgtable. The last line adds %edx to 4+%edi will set encryption masks if SEV is active. Currently, we can omit this line.

Then we look at the next level.

Here, we can see we set up four entries. and each entry point to another page directory.

This is the last level of page directory, these entry will point to a physical page frame directly. Now let’s take a look at the code. It sets up 2048 entries. Each entry with a Page Flag R/W = 1 U/S = 0 PS = 1. This means the page is read / write by kernel only and its size is 2MB. Each PTE(Page Table Entry) is a 8 Byte block data. So one page can contain at most 512 entries. Here kernel setup 4 pages of Level 2 Page Directory. The following image show the current page table structure.

In total we have 2048 * 2MB = 4GB physical address, identity mapped to 0 – 4GB linear address.

 

Then we use a long return to switch to 64bit mode.

Kernel push the startup_64 and CS register to stack, then perform a long return to enter 64bit mode.  And then after copy the compressed kernel, we jump to symbol relocated

In the relocated code, we do the kernel decompression.

The decompressed kernel is compiled at high address(we take ffffffff81000000 for example). But now we don’t have the correct page table to do the mapping. Fortunately, the extract_kernel function returns the physical address of the decompressed kernel. (Which is %ebp, equals to %ebx). After decompression, %rax contains the kernel physical start address. We jump there to perform the further setup.

Start execution in vmlinux

We now arrived at arch/x86/kernel/head_64.S. Before we continue, we must notice two things first.

  • After decompression, the kernel is placed at physical address %rbp (If we do not set CONFIG_RELOCATABLE it’s equal to 0x1000000
    ).
  • After decompression, we now in the kernel code compiled with the virtual address ffffffff81000000(as we mentioned above).

So here is a big pitfall. We cannot access ANY of the symbols in vmlinux currently. Because we only have a basic identity mapping now. But we need to visit the variables. How can we make it? The kernel uses a trick here, I will show it below

This function fixup the symbol virtual address to the real physical address.

“Current Valid Addr” = “Virtual Hi Addr” – “Kernel Virtual Address Base Addr” + “%rax Extracted kernel physical address”.

Now we continue reading the arch/x86/kernel/head_64.S  assembly code, this is where we landed from arch/x86/compressed/head_64.S

The enrty is startup_64:

In this article, we talk about self loading, instead of using a third party 64bit bootloader like GRUB. So as the comment said, we come here from arch/x86/boot/compressed/head_64.S. If we config the kernel with CONFIG_RELOCATABLE, the kernel won’t run at the place we compiled, page table fixup need to be performed. The page table is fixed in __startup_64

We compute the load_delta, and fixup the early_top_pgt. Now we just assume we don’t configure the kernel with CONFIG_RELOCATABLE. Then we can look at the page table built at compile time. First we look at the top level early_top_pgt. It set only the last entry point to level3 page table. which means only virtual address start with 0xff8000000000 will be valid.

Now we look at the next level (We do not use 5 Level Paging).

This level we have two entries, one for kernel address space. One for fixmap address space, fixmap address space is used for IO mapping, DMA, etc. Now we just look at the fixmap address space. It’s at index 510. in binary mode 0b111111110. Combine with the top level we get a smaller linear address space. Only address start from 0xffff80000000 is valid.

Then it’s the last level page directory. level2_kernel_pgt

This level is a mapping to physical address 0 – 512MB (it maps more than that, but we only need 512MB) So we get the current mapping then.

Linear: 0xffff80000000 – 0xffff80000000 + 512MB =====> Physical: 0 – 512MB

You can use a gdb to print the page table and debug it in your own. Here is a simple “it works!” script for parsing the page directory entry

Kernel load the early_top_pgt into cr3 using the following code

The current page table structure is shown below:

Now we are free to visit any kernel symbol without to force convert the address using fixup_address or something else. We can go further to the init/main.c code.

We use a long return to get to get to x86_64_start_kernel

initial_code here is defined as x86_64_start_kernel.

 

Moving to init/main.c

We are now at arch/x86/kernel/head64.c and in function x86_64_start_kernel 

We set up init_top_pgt[511] same as early_top_pgt[511]  . init_top_pgt is the final kernel page table. From x86_64_start_reservations we get to start_kernel This is a function located at init/main.c

After calling setup_arch, CR3 is loaded with init_top_pgt. Then the kernel page table will not change. I wonder if there is a change to switch kernel page table from 2MB size physical page to 4KB physical page, but it seems that the CR3 remained unchanged, and I examined the page entries, they remain unchanged, too. Even the code has executed into rest_init then do_idle

The following function is a simple debug function to output the current CR3 register since GDB cannot get the CR3 register value, I just print it out to see when it changed.

 

Kernel Driver btusb Overview

Kernel Driver btusb Overview

Function

btusb_probe

btusb_probe is use for hot plug-in for bluetooth usb generic controller, here will explain the function in detail.

First is an interface check mechanism

This special condition is used for supporting apple Macbook 12,8 (2015 early). According to the normal specification, the main interface for USB is 0, and audio (isochronous) is 1, but apple made a change on it, changing the main interface to 2 and audio to 3. The “bInterfaceNumber !=2 ” is for checking hardware for the special case in Apple series product. The macro BTUSB_IFNUM_2 is a driver_info flag, for Macbook devices, this flag will be set, else it will be 0. See the btusb_table for detail.

Then do further check on blacklist devices, some of the blacklist device is because there are specific driver (e.g bcmxxxx) for the device, so they do not use the generic one called btusb. Some of them just because they are not supported, and other reasons.(Not sure what reason are there)

 

Then we allocate memory for structure btusb_data, use this to store data for the USB interface. Also we need to check the memory remained for the allocation. Then we do the real work: set up currrent interface endpoints for interrupt and bulk (Why only these two?) It go through all the endpoint in the current interface. We get the current_altsetting to get a list of current active(available) endpoints.

 

usb_endpoint_is_int_in and usb_endpoint_is_bulk_out, usb_endpoint_is_bulk_in are functions use to know what type of the endpoint is it. These info is use to set up driver data at the end of the call. If none of inter_ep, bulk_tx_ep or bulk_rx_ep is set, it will also result in No Device Error(ENODEV)

This part of code is used for URB generation. URB is short for “USB Request Block” According to the Bluetooth v5.0 Specification, When sending an Control URB to AMP, the bRequest field should be 0x2b. Shown in the figure below.

 

Currently, for the interface to work with kernel to perform different operations. The driver itself need to be convert to device structure. Use the function named interface_to_usbdev Here is a quote from Linux Device Driver 3 :

A USB device driver commonly has to convert data from a given struct usb_interface structure into a struct usb_device structure that the USB core needs for a wide range of function calls. To do this, the function interface_to_usbdev is provided. Hopefully, in the future, all USB calls that currently need a struct usb_device will be converted to take a struct usb_interface parameter and will not require the drivers to do the conversion.

 

Then we continue with the initialize process.

 

Here we init the workqueue, data->work and data->waker these are shared workqueue offered by kernel. (Default Shared workqueue). We call schedule_work(data->work) in btusb_notify function to submit a job into workqueue and data->waker is also controlled by other functions

Then these init_usb_anchor calls. In my view, is just a sort of data queue, URB request will be queued(anchored) in certain queue, then processed in serial. Then init the spinlock for the device(interface)

 

Another special case, for Intel bluetooth usb generic driver, kernel will use special recv handler functions, for other USB generic bluetooth driver, kernel just use the common one.

Then do a lot of device specific set-up, we skip the code and go to the  isochronous setup process.

 

 

Here, the usb_driver_claim_interface is used for set up more than one interface binding to the current device driver. It also happens when this is a isochronous or acm(?) interface, here it’s a isochronous interface

Finally we call hci_register_dev to register it , this is one of the function in the Bluetooth Host Controller Interface core function series, from file net/bluetooth/hci/hci_core.c. After that, we set the interface data to intf