Browsed by

Kernel Bootup Page Table Initialize Process(x86_64)

Kernel Bootup Page Table Initialize Process(x86_64)

This article will provide detailed information about the kernel bootup page table setup.

In a brief view, the kernel setup page table in three steps:

  1. Setup the 4GB identity mapping
  2. Setup 64bit mode page table early_top_pgt
  3. Setup 64bit mode page table init_top_pgt

The last two steps are both higher mapping: Map the 512MB physical address to virtual address 0xffff80000000 – 0xffff80000000 + 512MB.

Next, we will talk about the details. We will use the 4.14 version code to explain the process.

You need to know the IA32e paging mechanism and relocation to read the article. The Intel manual has a good explaination of IA32e paging

Before decompression

When the kernel is being loaded, it is either decompressed by a third-party bootloader like GRUB2 or by the kernel itself. Now we will talk about the second condition. The code started from arch/x86/boot/header.S . It is in 16bit real mode at the time. Then in code  arch/x86/boot/compressed/head_64.S  We setup the first page table in 32bit mode. We need this page table to take us to do take us to 64bit mode.

The following code is the set-up process

Notice that from the comment above. %ebx contain the address where we move kernel to make a safe decompression. Which means we should treat %ebx as an offset to the compiled binary. The compiled binary start at 0. So we fix-up the difference to reach the real physical address.

The above code setup Top level page directory. This only set the lowest page directory entry to (1007 + pgtable). This is a pointer to the next level page table. And next level page table start at 0x1000 + pgtable. The last line adds %edx to 4+%edi will set encryption masks if SEV is active. Currently, we can omit this line.

Then we look at the next level.

Here, we can see we set up four entries. and each entry point to another page directory.

This is the last level of page directory, these entry will point to a physical page frame directly. Now let’s take a look at the code. It sets up 2048 entries. Each entry with a Page Flag R/W = 1 U/S = 0 PS = 1. This means the page is read / write by kernel only and its size is 2MB. Each PTE(Page Table Entry) is a 8 Byte block data. So one page can contain at most 512 entries. Here kernel setup 4 pages of Level 2 Page Directory. The following image show the current page table structure.

In total we have 2048 * 2MB = 4GB physical address, identity mapped to 0 – 4GB linear address.


Then we use a long return to switch to 64bit mode.

Kernel push the startup_64 and CS register to stack, then perform a long return to enter 64bit mode.  And then after copy the compressed kernel, we jump to symbol relocated

In the relocated code, we do the kernel decompression.

The decompressed kernel is compiled at high address(we take ffffffff81000000 for example). But now we don’t have the correct page table to do the mapping. Fortunately, the extract_kernel function returns the physical address of the decompressed kernel. (Which is %ebp, equals to %ebx). After decompression, %rax contains the kernel physical start address. We jump there to perform the further setup.

Start execution in vmlinux

We now arrived at arch/x86/kernel/head_64.S. Before we continue, we must notice two things first.

  • After decompression, the kernel is placed at physical address %rbp (If we do not set CONFIG_RELOCATABLE it’s equal to 0x1000000
  • After decompression, we now in the kernel code compiled with the virtual address ffffffff81000000(as we mentioned above).

So here is a big pitfall. We cannot access ANY of the symbols in vmlinux currently. Because we only have a basic identity mapping now. But we need to visit the variables. How can we make it? The kernel uses a trick here, I will show it below

This function fixup the symbol virtual address to the real physical address.

“Current Valid Addr” = “Virtual Hi Addr” – “Kernel Virtual Address Base Addr” + “%rax Extracted kernel physical address”.

Now we continue reading the arch/x86/kernel/head_64.S  assembly code, this is where we landed from arch/x86/compressed/head_64.S

The enrty is startup_64:

In this article, we talk about self loading, instead of using a third party 64bit bootloader like GRUB. So as the comment said, we come here from arch/x86/boot/compressed/head_64.S. If we config the kernel with CONFIG_RELOCATABLE, the kernel won’t run at the place we compiled, page table fixup need to be performed. The page table is fixed in __startup_64

We compute the load_delta, and fixup the early_top_pgt. Now we just assume we don’t configure the kernel with CONFIG_RELOCATABLE. Then we can look at the page table built at compile time. First we look at the top level early_top_pgt. It set only the last entry point to level3 page table. which means only virtual address start with 0xff8000000000 will be valid.

Now we look at the next level (We do not use 5 Level Paging).

This level we have two entries, one for kernel address space. One for fixmap address space, fixmap address space is used for IO mapping, DMA, etc. Now we just look at the fixmap address space. It’s at index 510. in binary mode 0b111111110. Combine with the top level we get a smaller linear address space. Only address start from 0xffff80000000 is valid.

Then it’s the last level page directory. level2_kernel_pgt

This level is a mapping to physical address 0 – 512MB (it maps more than that, but we only need 512MB) So we get the current mapping then.

Linear: 0xffff80000000 – 0xffff80000000 + 512MB =====> Physical: 0 – 512MB

You can use a gdb to print the page table and debug it in your own. Here is a simple “it works!” script for parsing the page directory entry

Kernel load the early_top_pgt into cr3 using the following code

The current page table structure is shown below:

Now we are free to visit any kernel symbol without to force convert the address using fixup_address or something else. We can go further to the init/main.c code.

We use a long return to get to get to x86_64_start_kernel

initial_code here is defined as x86_64_start_kernel.


Moving to init/main.c

We are now at arch/x86/kernel/head64.c and in function x86_64_start_kernel 

We set up init_top_pgt[511] same as early_top_pgt[511]  . init_top_pgt is the final kernel page table. From x86_64_start_reservations we get to start_kernel This is a function located at init/main.c

After calling setup_arch, CR3 is loaded with init_top_pgt. Then the kernel page table will not change. I wonder if there is a change to switch kernel page table from 2MB size physical page to 4KB physical page, but it seems that the CR3 remained unchanged, and I examined the page entries, they remain unchanged, too. Even the code has executed into rest_init then do_idle

The following function is a simple debug function to output the current CR3 register since GDB cannot get the CR3 register value, I just print it out to see when it changed.


Kernel Driver btusb Overview

Kernel Driver btusb Overview



btusb_probe is use for hot plug-in for bluetooth usb generic controller, here will explain the function in detail.

First is an interface check mechanism

This special condition is used for supporting apple Macbook 12,8 (2015 early). According to the normal specification, the main interface for USB is 0, and audio (isochronous) is 1, but apple made a change on it, changing the main interface to 2 and audio to 3. The “bInterfaceNumber !=2 ” is for checking hardware for the special case in Apple series product. The macro BTUSB_IFNUM_2 is a driver_info flag, for Macbook devices, this flag will be set, else it will be 0. See the btusb_table for detail.

Then do further check on blacklist devices, some of the blacklist device is because there are specific driver (e.g bcmxxxx) for the device, so they do not use the generic one called btusb. Some of them just because they are not supported, and other reasons.(Not sure what reason are there)


Then we allocate memory for structure btusb_data, use this to store data for the USB interface. Also we need to check the memory remained for the allocation. Then we do the real work: set up currrent interface endpoints for interrupt and bulk (Why only these two?) It go through all the endpoint in the current interface. We get the current_altsetting to get a list of current active(available) endpoints.


usb_endpoint_is_int_in and usb_endpoint_is_bulk_out, usb_endpoint_is_bulk_in are functions use to know what type of the endpoint is it. These info is use to set up driver data at the end of the call. If none of inter_ep, bulk_tx_ep or bulk_rx_ep is set, it will also result in No Device Error(ENODEV)

This part of code is used for URB generation. URB is short for “USB Request Block” According to the Bluetooth v5.0 Specification, When sending an Control URB to AMP, the bRequest field should be 0x2b. Shown in the figure below.


Currently, for the interface to work with kernel to perform different operations. The driver itself need to be convert to device structure. Use the function named interface_to_usbdev Here is a quote from Linux Device Driver 3 :

A USB device driver commonly has to convert data from a given struct usb_interface structure into a struct usb_device structure that the USB core needs for a wide range of function calls. To do this, the function interface_to_usbdev is provided. Hopefully, in the future, all USB calls that currently need a struct usb_device will be converted to take a struct usb_interface parameter and will not require the drivers to do the conversion.


Then we continue with the initialize process.


Here we init the workqueue, data->work and data->waker these are shared workqueue offered by kernel. (Default Shared workqueue). We call schedule_work(data->work) in btusb_notify function to submit a job into workqueue and data->waker is also controlled by other functions

Then these init_usb_anchor calls. In my view, is just a sort of data queue, URB request will be queued(anchored) in certain queue, then processed in serial. Then init the spinlock for the device(interface)


Another special case, for Intel bluetooth usb generic driver, kernel will use special recv handler functions, for other USB generic bluetooth driver, kernel just use the common one.

Then do a lot of device specific set-up, we skip the code and go to the  isochronous setup process.



Here, the usb_driver_claim_interface is used for set up more than one interface binding to the current device driver. It also happens when this is a isochronous or acm(?) interface, here it’s a isochronous interface

Finally we call hci_register_dev to register it , this is one of the function in the Bluetooth Host Controller Interface core function series, from file net/bluetooth/hci/hci_core.c. After that, we set the interface data to intf

Linux Kernel Development Resources

Linux Kernel Development Resources

Here are some resources for digging into linux kernel development (Keep updating)


  • Understanding Linux Kernel [ULK]
  • Linux Kernel Development [LKD]
  • Linux Driver Development [LDD]
  • Linux Kernel Module Programming Guide [LKMPG]
  • Linux in a Nutshell




ULK Chapter2 总结

ULK Chapter2 总结




  • 逻辑地址对应内存的分段
  • 线性地址对应分页
  • 物理地址对应到硬件芯片上的内存单元的地址

MMU通过Segment Unit与Paging Unit两个硬件电路将一个逻辑地址转为物理地址, 具体转换过程如上代码段所示



[0-1]RPL: 请求者特权级

[2]TI: Table Indicator, 指明是从GDT还是LDT中取出段描述符(Segment Descriptor)

[3-15]index:用来指定从GDT中第index项取出Segment Descriptor


对段描述符的解释参考 [不是科普向?] RE: 从零开始的操作系统开发 第二集 中相应内容即可

Logical Addr ===> Linear Addr

逻辑地址的高16位为段选择符(Segment Selector)其余32位(或者64位)为偏移量(Offset)


  1. 检查TI确定是从GDT(TI=0)还是从LDT(TI =1)中选择段描述符
  2. 从Segment Selector的index字段计算出段描述符的地址,计算方法 index * 8 (一个Segment Descriptor大小为8) 并与gdtr/ldtr中的内容相加得到Segment Descriptor
  3. 把逻辑地址的offset字段与Segment Descriptor中的Base字段相加,得到Linear Address


而Linux中的分段只是一个兼容性考虑,分段和分页的作用是重复的,分页能以更精细的粒度对内存进行管理,因而在Linux中分页是主要的手段,Linux分段中的用户代码段,数据段,内核代码段,数据段,都是以Base = 0 ,因而Linux下的逻辑地址和线性地址是一致的,(因为Base = 0 所以 Linear Address = Base + Offset = Offset) 即逻辑地址的偏移量和线性地址的值是相等的


Linear Addr ==> Physical Addr



页目录 页表 页框


如图,Page Directory和PageTable的结构类似,都是由一些flag bit加上Field字段,Field字段表示页框的物理地址,如果是一个PageTable的话,那么这个页框就含有一页数据,Page Directory的话,页框内含有的是一个页的PageTable(页表的大小就是一个PageTable)


寻址的方式如下,首先根据CR3寄存器的内容,找到PageDirectory入口的物理地址,然后加上Linear Addr中的DIRECTORY,找到对应的PDE项,并根据此内容找到对应的Page Table的物理地址,并根据Linear Addr中的TABLE找到相应的Page即页框的物理地址,最后将此物理地址加上Linear Addr的OFFSET字段,最后得到物理地址


而在64位系统上,如果依旧采用这种最基本的分页结构的话,那么假设页框大小为4K,所以OFFSET位依旧为12bit,然后其余的52bit可以分给PT(Page Table的简称,下同)和PD(Page Directory的简称,下同),这样就会导致我们每个进程的页目录和页表变得非常非常多,超过256000项

因为这个原因,所以对于64位处理器的分页系统,都使用了更多级别的分页级别,如x86_64使用48位寻址,分页为4级线性地址分级为 9 + 9 + 9 + 9 + 12 如下图所示


以上就是Logical Address ===> Physical Address的转换的大致流程



因为CPU Register的读取和内存的读取速度相差甚大,为了缩小这个差距,避免CPU等待过长的时间,使用缓存技术来将内存中的部分数据缓存在高速静态RAM(SRAM)里,即为高速缓存技术(Hardware Cache)

此外,将Logical Address转换为Physical Address也需要多次进行内存的读取(查找各级页表),为了加速此过程,引入了转换后援缓冲器Translation Lookaside Buffer(TLB)技术

Hardware Cache



  • 通写(write-through): 既写RAM也写缓存
  • 回写(write-back): 只写缓存,只有当需要FLUSH的时候才更新RAM


  • PCD表示是否对此页框启用告诉缓存
  • PWT表示是否采用write-through的策略



TLB的作用是将Logical Address对应的Linear Address缓存起来,以加速对内存的访问


如何能够更好的利用TLB来加速访问是对Linux性能影响十分重要的一部分,为了避免多处理器系统上无用的TLB刷新,内核使用一种叫做Lazy TLB的技术,关于Lazy TLB技术的具体实现之后补充






早些时候,内存容量都很少,而当需要大内存(>4G)的服务器&程序出现的时候,上述的寻址方式就不能使用了,因而Intel将内存地址位数从32 –>36位(即将引脚从32个扩展到36个) 而之前的所有的地址转换都是从32位Linear Address转换到32位Physical Address的,需要有一种新的转换机制,将32bit linear addr->36bit phy addr,这种机制即为PAE机制,详情可以参考Intel手册 Vol3A相应的内容










方法如下 假设我们的customized directory为kernel-build



在 make menuconfig 的时候,最好对内核版本进行修改,使之不会在make modules_install的时候覆盖掉现有内核的modules

修改途径为General Setup里的Local Version项

这样执行之后,应该在arch/ARCH/boot/下面存在bzImage内核镜像,并且modules应该被安装在了/lib/modules/<linux-version>-<localversion> 下

下一步操作, 修改linux.preset文件,并创建initrd文件

修改linux.preset文件, 修改完毕后的内容类似下面

我们需要修改ALL_kver, default_image, fallback_image三个选项,使其将image保存到指定的目录下


创建initrd, initrd-fallback,目前kernel-build目录应该有如下文件:


就可以运行我们自定义的Linux Kernel了,可以使用uname -a查看Kernel Version~

不过到这里我们仅仅完成了一个RAMDisk的Kernel,没有任何的文件系统被挂载,只有一个rescue filesystem和busybox的一些东西能使用,为了构建一个能够正常使用的Linux Image,包含一个Distro应该有的程序,下面将通过Linux From Scratch项目一步步构建一个可用的自己的Distro

My Yearly Goals

My Yearly Goals


[科普向?] Re: 从零开始的操作系统开发 第一集

[科普向?] Re: 从零开始的操作系统开发 第一集

Hmm, 果然还是开坑了~! 在学校智障的操作系统课设的发起下, 再加上每个程序员都有一个写一个自己的操作系统的公主梦(雾), 我们愉(作)快(死)地开坑啦~

以前曾经跟着 “30天自制操作系统” 玩过DIY操作系统, 不过那个书更像面向小白, 讲的东西也不够系统, 而且使用的是自己改过的nas汇编器, 因而不能算写过, 这一次则是真正的开坑啦~ (虽然课设时间很短写出一个完整的根本不可能不过慢慢写总会写完的你说对不喵~)

我们的开发过程在Bearychat上直播 的Toy-OS频道, 我们的git-repo 为 菊苣们不要喷, 既然挖了坑窝就不会不填(….你都已经挖了多少个坑了啊喂! (逃))

这个系列的文章将会记录在开发操作系统的整个过程中的一些经验&心得&吐槽 不知道会有多少集(



  • MIT 的 XV6 源代码 & handbook
  • University of Birmingham 的 Writing an simple operating system from scratch
  • Quora, StackOverflow
  • Jiong Zhao Linux 0.11内核完全注释



  • GNU Assembler & GNU C Compiler
  • Qemu
  • Gdb
  • objcopy, objdump, binutils, elfutils
  • GNU Makefile





我们的定位是写一个操作系统,那么首先我们应该了解,整个操作系统都应该由哪些模块构成, 那么就让我们从操作系统的启动说起, 说到这里就不得不说一下BIOS, BIOS是Basic Input and Output System, 是你的计算机加电运行后加载的第一个程序, 它是固化在你的EPROM内的一个程序片段

BIOS被加载之后, CPU便会去执行BIOS的代码,这时候, BIOS进行硬件自检, 保证硬件没有故障后, 就会加载操作系统, 同时BIOS也提供了一个通用的接口, 供我们用来与不同的外设如VGA显示器进行交互, 具体如何使用将会在下文中介绍


刚刚我们说到了, BIOS作为开机运行的第一个程序, 在进行硬件自检后, 便会装载操作系统, 可是这时候,操作系统还在磁盘(或者其他存储介质内), BIOS如何知道, 我们知道BIOS是由厂商写死在ROM上的, 我们的操作系统程序如果每次存放的位置都不一样的话, 岂不是每次都要去重新刷写EPROM? 当然没那么麻烦, BIOS和编写操作系统的程序员有一个约定, 那就是, 当自检完毕之后, BIOS会自动按顺序(你设定的Boot Sequence)检查每一个media的第一个扇区(0扇区)是不是Bootable,如果找到一个Bootable的扇区, 那么就加载这个扇区到内存中, 接下来会执行这个刚刚装入内存的程序, 这样, 我们就可以在这里执行对硬件初始化&装载操作系统程序等操作啦

对于CPU而言, 代码和数据都是二进制,那么如何区分这是一个bootable扇区呢? 流我们将bootable扇区称为 boot-sector,  为了让CPU能够识别这个扇区是boot-sector, 对boot-sector有如下的要求:

  •  必须是512Byte大小
  • 512Byte的末尾两个Byte应该被填充为0xaa55

拥有了这两个条件, 这个扇区才是一个boot-sector, BIOS才会去加载它, 下面是一个非常简单的,开机后就让CPU进入死循环的一个程序的binary文件, 这就是一个boot-sector

中间的0被省略,  因为little endian的原因 0xaa55 在实际存储的时候为 55 aa

切换到32bit-protected mode

上面我们的所有操作都是在16bit 实地址模式(即你访问的地址就是真实的物理地址)下进行的, 而 16bit的实地址模式可以访问的内存最大为 1.0615234375(数字是如何计算出来的, 参考内存分段管理的相关知识 0xffff * 16 + 0xffff) 只比 1MB多一点的空间, 这对于我们之后要写的操作系统, 以及我们要运行的程序是远远不够的, 那么接下来我们就要切换到32bit的虚拟地址模式下, 进行接下来的开发


在切换到32bit protect mode之后, 我们需要实现的是kernel system call, file system, multiprocess scheduling, 以及 支持我们的Keyboard和VGA Driver


为了让我们的操作系统可以交互, 我们需要实现一个Interactive Shell, 并且实现几个能够运行在我们的操作系统上的程序, 之后也许还会支持网络 & 图形界面, SDL Driver等


以上就是一个综述啦, 可以看出来这是一个不小的坑, 不过嘛, 很有趣对吧~!


我们的第一个Hello world 操作系统

AT&T汇编+GNU Assembler的一些比较坑爹的事

为了更好的和XV6产生一致性, 我们采用了 AT&T 汇编, 使用的为GNU Assembler 进行开发, 关于AT&T汇编与Intel汇编的区别, 这里有参考文章

需要注意的问题是, 之前在使用nasm作为Assembler的时候可以直接通过 -f bin 指定输出的程序为 RAW格式的, 即为不含有任何ELF(Linux下的文件格式)的信息, 而使用as进行编译的时候, 目前我们还没有找到办法直接输出RAW binary文件, 而且, 在AT&T中貌似也没有可以方便的在0扇区的最末尾填充0xaa55的方法(Intel中我们通常使用 time 510-($-$$) db 0  来在整个扇区填满0,  之后 dw 0xaa55 ), 另外! 另外! 对于那些在Intel下的标号后定义的字符串, nasm可以轻松识别地址, 然而, 然而, 然而!(重要的事情说三遍)as 不会! as会认为那是一个外部符号, 等待链接的时候Relocate, 因而下面这段代码如果不进行链接的话, 这个’string’ 标号就会被用 0x0替换



那么正确的姿势是什么呢, 下面这样可以生成一个正确的RAW binary image(假设我们的源文件叫做boot.s)



上面的代码中的0x7c00 是什么鬼肯定有人要问, 我们下面就来解释一下, 在计算机启动完毕之后, 内存的布局如下:


Screenshot from 2016-07-05 20-06-57

(本图摘自Writing a Simple Operating System — from Scratch)

我们可以看到, 在低地址最下方有着中断向量表, 再上方就是BIOS的数据内容, 然后, 为了防止我们的boot-sector将BIOS Data/中断向量表的内容覆盖, BIOS将boot-sector加载到的地址为0x7c00处, 那么现在来解释一下为什么链接的时候要指定这个参数, 因为当你链接的时候, 那些需要Relocate的符号, 是按照一个实际地址 + 在本文件内的偏移量给定的, 而因为我们的boot-sector默认加载的位置为0x7c00, 那么我们的实际地址就应该为0x7c00, 这时候我们再想指定某个标签(如string) 就会在运行的时候将string Relocate到 0x7c00 + 在binary文件中的offset(可以通过hexdump看到)

Qemu with GDB

为了让我们的调试更愉.悦, 能够使用gdb对代码进行调试则是极好的 qemu支持远程调试功能, 在运行qemu的时候, 指定参数 -s(开启1234端口并且等待debugger链接) -S(先不要执行CPU指令) 后, 即可通过gdb连接到这个端口进行调试啦, 具体的操作方法如下

有了这个之后, 在操作系统的开发初期就能更好的看~代~码~啦

Writing our first Hello world OS

为了表示对LL的敬意, 我们准备在屏幕上输出 Hello Niconico, 之前已经说过了, BIOS提供了一部分通用的接口供我们和硬件打交道, 我们就不需要关心硬件的更具体的细节了, 这里我们就要用到这个接口, BIOS 将此接口通过中断的形式提供给我们. 为了在屏幕上输出一个字符, 我们通过给一些特定的寄存器赋值, 即汇编语言的参数传递, 类似C语言的参数传递. 一个简短的打印一个 字符’A’的代码如下:

我们为了实现打印一整串字符串,  一个稍微复杂点的程序如下:

这里注意, 我们被AT&T的汇编坑了好久的一个地方就是 mov $string, %bx . 刚开始, 我们写的是

mov string, %bx  这个代码一直没有办法打印出我们想要的字符串, 原因就是, 在ATT汇编中, 这句被解释为了, 将string 这个标号处对应的内容取来, 放到%bx中, 而我们想要实现的是: 将string这个标号对应的地址取来,放到bx中, 如果不加$, 所有的地址在ATT中都会被解释为”那个地方的内容” 一定要小心

以上代码可以在github获取到~ 这个版本对应的commit是  8c2f90aa8830edaf9ea10809797d14918efb463e  只要按照README装好需要的工具, 执行  make boot 就可以看到效果啦~





[C Linux内核] 文件与I/O

[C Linux内核] 文件与I/O

系统调用 Hello world

先看一个例子,利用系统调用sys_write用汇编实现的向stdout输出hello world

将_start作为 ELF linker or loader的 外部可以使用的符号(通过 .global) 并且在.data段内定义一个标号msg,代表字符串Hello world 的首地址,  定义len的时候用到的 . 是一个用来代替当前段内地址(在每一个段开头, “.” 的值都会初始化回0)因而, len的值就是当前地址减去msg的首地址, 换句话说,就是 字符串常量”Hello world\n”长度, 上面的代码进行了两次系统调用, 第一次, 是调用sys_write 第二次是调用 sys_exit, 第一次传递的参数有三个, 代码注释已经写明了

将上面的代码 汇编,链接, 运行


上述的汇编代码, 在C语言中的实现是这样的

调用系统调用 write之后,再调用_exit退出 ,下面对C语言内提供的内核 I/O操作函数进行总结

Read More Read More