5 Xv6-Notes
5 Xv6-Notes
Abhijit A. M.
[email protected]
Credits:
xv6 book by Cox, Kaashoek, Morris
Notes by Prof. Sorav Bansal
Use cscope and ctags with VIM
Go to folder of xv6 code and run
cscope -q *.[chS]
Also run
ctags *.[chS]
Now download the file
http://cscope.sourceforge.net/cscope_maps.vim
as .cscope_maps.vim in your ~ folder
And add line “source ~/.cscope_maps.vim” in your
~/.vimrc file
Read this tutorial
http://cscope.sourceforge.net/cscope_vim_tutorial.html
Use call graphs (using doxygen)
Doxygen – a documentation generator.
Can also be used to generate “call graphs” of
functions
Download xv6
Install doxygen on your Ubuntu machine.
cd to xv6 folder
Run “doxygen -g doxyconfig”
This creates the file “doxyconfig”
Use call graphs (using doxygen)
Create a folder “doxygen”
Open “doxyconfig” file and make these changes.
PROJECT_NAME = "XV6"
OUTPUT_DIRECTORY = ./doxygen
CREATE_SUBDIRS = YES
EXTRACT_ALL = YES
EXCLUDE = usertests.c cat.c yes.c echo.c
forktest.c grep.c init.c kill.c ln.c ls.c mkdir.c rm.c sh.c
stressfs.c wc.c zombie.c
CALL_GRAPH = YES
CALLER_GRAPH = YES
Now run “doxygen doxyconfig”
Go to “doxygen”/html and open “firefox index.html” --> See call graphs
in files -> any file
About xv6
Unix Like OS
Multi tasking, Single user
On x86 processor
Supports some system calls
Small code, 7 to 10k
Meant for learning OS concepts
No : demand paging, no copy-on-write fork, no
shared-memory, fixed size stack for user programs
Xv6 follows monolithic kernel
approach
qemu
A virtual machine manager, like Virtualbox
Qemu provides us
BIOS
Virtual CPU, RAM, Disk controller, Keyboard controller
IOAPIC, LAPIC
Qemu runs xv6 using this command
qemu -serial mon:stdio -drive
file=fs.img,index=1,media=disk,format=raw -drive
file=xv6.img,index=0,media=disk,format=raw -smp 2 -
m 512
Invoked when you run “make qemu”
qemu
Understanding qemu command
-serial mon:stdio
the window of xv6 is also multiplexed in your normal terminal.
Run “make qemu”, then Press “Ctrl-a” and “c” in terminal and you
get qemu prompt
-drive file=fs.img,index=1,media=disk,format=raw
Specify the hard disk in “fs.img”, accessible at first slot in IDE(or
SATA, etc), as a “disk” , with “raw” format
-smp 2
Two cores in SMP mode to be simulated
-m 512
Use 512 MB ram
About files in XV6 code
cat.c echo.c forktest.c grep.c
init.c kill.c ln.c ls.c mkdir.c
rm.c sh.c stressfs.c usertests.c
wc.c yes.c zombie.c
User programs for testing xv6
Makefile
To compile the code
dot-bochsrc
For running with emulator bochs
About files in XV6 code
bootasm.S entryother.S entry.S
initcode.S swtch.S trapasm.S
usys.S
Kernel code written in Assembly. Total 373 lines
kernel.ld
Instructions to Linker, for linking the kernel
properly
README Notes LICENSE
Misc files
Using Makefile
make qemu
Compile code and run using “qemu” emulator
make xv6.pdf
Generate a PDF of xv6 code
make mkfs
Create the mkfs program
make clean
Remove all intermediary and final build files
Files generated by Makefile
.o files
Compiled from each .c file
No need of separate instruction in Makefile to
create .o files
_%: %.o $(ULIB) line is sufficient to build
each .o for a _xyz file
Files generated by Makefile
asm files
Each of them has an equivalent object code file or C file. For example
bootblock: bootasm.S bootmain.c
$(CC) $(CFLAGS) -fno-pic -O -nostdinc -I. -c
bootmain.c
$(CC) $(CFLAGS) -fno-pic -nostdinc -I. -c
bootasm.S
$(LD) $(LDFLAGS) -N -e start -Ttext 0x7C00 -o
bootblock.o bootasm.o bootmain.o
$(OBJDUMP) -S bootblock.o > bootblock.asm
$(OBJCOPY) -S -O binary -j .text bootblock.o
bootblock
./sign.pl bootblock
Files generated by Makefile
_ln, _ls, etc
Executable user programs
Compilation process is explained after few
slides
Files generated by Makefile
xv6.img
Image of xv6 created
xv6.img: bootblock kernel
dd if=/dev/zero of=xv6.img
count=10000
dd if=bootblock of=xv6.img
conv=notrunc
dd if=kernel of=xv6.img seek=1
conv=notrunc
Files generated by Makefile
bootblock
bootblock: bootasm.S bootmain.c
$(CC) $(CFLAGS) -fno-pic -O -nostdinc -I.
-c bootmain.c
$(CC) $(CFLAGS) -fno-pic -nostdinc -I. -c
bootasm.S
$(LD) $(LDFLAGS) -N -e start -Ttext
0x7C00 -o bootblock.o bootasm.o bootmain.o
$(OBJDUMP) -S bootblock.o > bootblock.asm
$(OBJCOPY) -S -O binary -j .text
bootblock.o bootblock
./sign.pl bootblock
Files generated by Makefile
kernel
kernel: $(OBJS) entry.o entryother initcode
kernel.ld
$(LD) $(LDFLAGS) -T kernel.ld -
o kernel entry.o $(OBJS) -b binary
initcode entryother
$(OBJDUMP) -S kernel >
kernel.asm
$(OBJDUMP) -t kernel | sed
'1,/SYMBOL TABLE/d; s/ .* / /; /^$$/d'
> kernel.sym
Files generated by Makefile
fs.img
A disk image containing user programs and README
fs.img: mkfs README $(UPROGS)
./mkfs fs.img README $(UPROGS)
.sym files
Symbol tables of different programs
E.g. for file “kernel”
$(OBJDUMP) -t kernel | sed '1,/SYMBOL
TABLE/d; s/ .* / /; /^$$/d' > kernel.sym
Size of xv6 C code
wc *[ch] | sort -n
10595 34249 278455 total
Out of which
738 4271 33514 dot-bochsrc
wc cat.c echo.c forktest.c grep.c init.c kill.c
ln.c ls.c mkdir.c rm.c sh.c stressfs.c
usertests.c wc.c yes.c zombie.c
2849 6864 51993 total
So total code is 10595 – 2849 – 738 = 7008 lines
List of commands to try (in given
order)
usertests # Runs lot of tests and takes upto 10 minutes to run
stressfs # opens , reads and writes to files in parallel
ls # out put is filetyep, inode number, type
cat README
ls;ls
cat README | grep BUILD
echo hi there
echo hi there | grep hi
echo "hi there
List of commands to try (in this
order)
echo README | grep Wa ls ../ # works from inside test
echo README | grep Wa | cd # fails
grep ty # does not work cd / # works
cat README | grep Wa | wc README
grep bl # works rm out
ls > out # takes time! ls . test # listing both
mkdir test directories
cd test ln cat xyz; ls
ls # fails rm xyz; ls
User Libraries: Used to link user
land programs
Ulib.c
Strcpy, strcmp,strlen, memset, strchr, stat, atoi, memove
Stat uses open()
Usys.S -> compiles into usys.o
Assembly code file. Basically converts all calls like open()
(e.g. used in ulib.c) into assembly code using “int”
instruction.
Run following command see the last 4 lines in the output
objdump -d usys.o
00000048 <open>:
48: b8 0f 00 00 00 mov $0xf,%eax
4d: cd 40 int $0x40
4f: c3 ret
User Libraries: Used to link user
land programs
printf.c
Code for printf()!
Interesting to read this code.
Uses variable number of arguments. Normal
technique in C is to use va_args library, but here it
uses pointer arithmetic.
Written using two more functions: printint() and
putc() - both call write()
Where is code for write()?
User Libraries: Used to link user
land programs
umalloc.c
This is an implementation of malloc() and free()
Almost same as the one done in “The C
Programming Language” by Kernighan and
Ritchie
Uses sbrk() to get more memory from xv6 kernel
Understanding the build process
in more details
Run
make qemu | tee make-output.txt
You will get all compilation commands in
make-output.txt
Compiling user land programs
Normally when you compile a program on Linux
You compile it for the same ‘target’ machine ( = CPU + OS)
The compiler itself runs on the same OS
struct stat {
short type; // Type of file
int dev; // File system's disk device
uint ino; // Inode number
short nlink; // Number of links to file
uint size; // Size of file in bytes
};
Used by stat system call
Important header files for user
programs
fcntl.h
#define O_RDONLY 0x000
#define O_WRONLY 0x001
#define O_RDWR 0x002
#define O_CREATE 0x200
Important header files for user
programs
user.h
Prototypes of all system calls (fork,
wait, etc)
and ulib.c functions (strcpy, etc )
Some numbers and their
‘meaning’
These numbers occur very frequently in
discussion
0x 80000000 = 2 GB = KERNBASE
0x 100000 = 1 MB = EXTMEM
0x 80100000 = 2GB + 1MB = KERNLINK
0x E000000 = 224 MB = PHYSTOP
0x FE000000 = 3.96 GB = 4064 MB = DEVSPACE
4096 – 4064 = 32 MB left on top
How to read kernel code ?
Understand the data structures
Know each global variable, typedefs, lists, arrays, etc.
Know the purpose of each of them
While reading a code path, e.g. exec()
Try to ‘locate’ the key line of code that does major work
Initially (but not forever) ignore the ‘error checking’ code
Keep summarising what you have read
Remembering is important !
To understand kernel code, you should be good with
concepts in OS , C, assembly, hardware
Bootloader
What does a bootloader do?
Bootloader itself
Is loaded by the BIOS at a fixed location in memory and
BIOS makes it run
Our job, as OS programmers, is to write the bootloader
code
Bootloader does
Pick up code of OS from a ‘known’ location and loads it in
memory
Makes the OS run
Xv6 bootloader: bootasm.S bootmain.c (see Makefile)
bootloader
BIOS Runs (automatically)
Loads boot sector into RAM at 0x7c00
Starts executing that code
Make sure that your bootloader is loaded at 0x7c00
Makefile has
bootblock: bootblock.S bootmain.c
$(CC) $(CFLAGS) -fno-pic -nostdinc -I. -c bootasm.S .....
...
$(LD) $(LDFLAGS) -N -e start -Ttext 0x7C00 -o bootblock.o
bootasm.o bootmain.o
Resuls in:
00007c00 <start>: in bootblock.asm
Virtual ddress = offset Address
https://en.wikipedia.org/wiki/A20
_line
Real mode Vs protected mode
Real mode 16 bit registers
Protected mode
32 bit registers
can address upto 232 memory
Can do arithmetic in 32 bits
Segment registers is index into segment
descriptor table
Segments in protected mode
Xv6 makes
almost (!=
zero)
no use of
segmentatio
n, and relies
only on
paging.
More later.
lgdt gdtdesc
...
# Bootstrap GDT
Bootloader
.p2align 2 # force 4 byte
alignment
load the processor’s (GDT)
register with the value gdtdesc
gdt: which points to the table gdt.
SEG_NULLASM # null seg
table gdt : The table has a null
SEG_ASM(STA_X|STA_R, 0x0, entry, one entry for executable
0xffffffff) # code seg code, and one entry to data.
SEG_ASM(STA_W, 0x0, 0xffffffff)
all segments have a base address
of zero and the maximum
possible limit
# data seg
The code segment descriptor has
a flag set that indicates that the
gdtdesc: code should run in 32-bit mode
.word (gdtdesc - gdt - 1)
With this setup, when the boot
# sizeof(gdt) - 1 loader enters protected mode,
logical addresses map one-to-one
.long gdt to physical addresses.
bootasm.S after “lgdt gdtdesc”
till jump to “entry”
Still
Logical Address =
Physical address!
Logical Address = offset But with GDT in picture
Physical
Address and
Protected Mode
operation
0 0 0 0
SS
CS GDT
Enable protected mode
boot loader enables movl %cr0, %eax
protected mode by orl $CR0_PE, %eax
setting the 1 bit
(CR0_PE) in register movl %eax, %cr0
%cr0
Complete transition to 32 bit
mode
ljmp $ Complete the transition
(SEG_KCODE<<3), to 32-bit protected mode
$start32 by using a long jmp
to reload %cs and
%eip. The segment
descriptors are set up
with no translation, so
that the mapping is still
the identity mapping.
Jumping to “C” code
movw $(SEG_KDATA<<3), %ax
segment selector
# Our data
Setup Data, extra, stack
movw %ax, %ds # -> DS: Data segment with
Segment
movw %ax, %es # -> ES: Extra
SEG_KDATA
Segment
movw %ax, %ss # -> SS: Stack
Move “$start” = 7c00 to
Segment stack
movw $0, %ax # Zero segments
not ready for use
It will grow from 7c00 to
movw %ax, %fs # -> FS 0000
movw %ax, %gs # -> GS
Call bootmain() a C
# Set up the stack pointer and call into C.
function
movl $start, %esp
call bootmain
In bootmain.c
bootmain(): already in memory, as
part of ‘bootblock’
bootmain.c , expects to find a void
copy of the kernel executable bootmain(void)
on the disk starting at the {
second sector (sector = 1). struct elfhdr *elf;
Why? struct proghdr *ph, *eph;
void (*entry)(void);
The kernel is an ELF format
uchar* pa;
binary
Bootmain loads the first 4096 elf = (struct elfhdr*)0x10000; // scratch
bytes of the ELF binary It space
places the in-memory copy at
address 0x10000 // Read 1st page off disk
readseg((uchar*)elf, 4096, 0);
bootmain()
Check if it’s really // Is this an ELF
ELF or not executable?
Next load kernel if(elf->magic !=
code from ELF file ELF_MAGIC)
“kernel” into return; // let
memory bootasm.S handle
error
struct elfhdr {
uint magic; // must equal
ELF ELF_MAGIC
uchar elf[12];
ushort type;
ushort machine;
uint version;
uint entry;
uint phoff; // where is program
header table
uint shoff;
uint flags;
ushort ehsize;
ushort phentsize;
ushort phnum; // no. Of program
header entries
ushort shentsize;
ushort shnum;
ushort shstrndx;
};
// Program header
struct proghdr {
ELF uint type; // Loadable segment ,
Dynamic linking information ,
Interpreter information , Thread-
Local Storage template , etc.
uint off; //Offset of the segment
in the file image.
uint vaddr; //Virtual address of
the segment in memory.
uint paddr; // physical address to
load this program, if PA is relevant
uint filesz; //Size in bytes of the
segment in the file image.
uint memsz; //Size in bytes of the
segment in memory. May be 0.
uint flags;
uint align;
};
Run ‘objdump -x -a kernel | head -15’ & see this
Diff
Code to be between
kernel: file format elf32-i386 loaded at memsz &
kernel KERNBASE + filesz, will
architecture: i386, flags 0x00000112: KERNLINK
EXEC_P, HAS_SYMS, D_PAGED be filled
start address 0x0010000c with zeroes
in memory
Program Header:
LOAD off 0x00001000 vaddr 0x80100000 paddr 0x00100000 align 2**12
filesz 0x0000a516 memsz 0x000154a8 flags rwx
STACK off 0x00000000 vaddr 0x00000000 paddr 0x00000000 align 2**4
filesz 0x00000000 memsz 0x00000000 flags rwx
Stack :
everything
zeroes
Load code from ELF to memory
with 4 MB
pages, memory
translation
entrypgdir in main.c, is used by
entry()
#define PTE_P 0x001 // Present
#define PTE_W 0x002 // Writeable
#define PTE_U 0x004 // User
__attribute__((__aligned__(PGSIZE))) #define PTE_PS 0x080 // Page Size
pde_t entrypgdir[NPDENTRIES] = { #define PDXSHIFT 22 // offset of
// Map VA's [0, 4MB) to PA's [0, 4MB) PDX in a linear address
[0] = (0) | PTE_P | PTE_W | PTE_PS,
// Map VA's [KERNBASE, KERNBASE+4MB) to PA's [0, 4MB). This is entry 512
[KERNBASE>>PDXSHIFT] = (0) | PTE_P | PTE_W | PTE_PS,
};
This is entry page directory during entry(), beginning of kernel
Mapping 0:0x400000 (i.e. 0: 4MB) to physical addresses 0:0x400000. is required as long
as entry is executing at low addresses, but will eventually be removed.
This mapping restricts the kernel instructions and data to 4 Mbytes.
entry() in entry.S
entry:
movl %cr4, %eax
# Turn on page size extension
orl $(CR4_PSE), %eax for 4Mbyte pages
movl %eax, %cr4
# Set page directory. 4 MB
pages (temporarily only. More
movl $(V2P_WO(entrypgdir)), later)
%eax
# Turn on paging.
movl %eax, %cr3
# Set up the stack pointer.
movl %cr0, %eax
# Jump to main(), and switch
orl $(CR0_PG|CR0_WP), %eax to executing at high addresses.
The indirect call is needed
movl %eax, %cr0 because the assembler
movl $(stack + KSTACKSIZE), produces a PC-relative
%esp instruction for a direct jump.
mov $main, %eax
jmp *%eax
More about entry()
movl $
V2P is simple:
(V2P_WO(entrypgdir)), substract
%eax 0x80000000 i.e.
movl %eax, %cr3 KERNBASE from
address
-> Here we use physical
address using V2P_WO
because paging is not
turned on yet
More about entry()
movl %cr0, %eax
But we have already
orl $(CR0_PG| set 0’th entry in
CR0_WP), %eax pgdir to address 0
movl %eax, %cr0
So it still works!
This turns on paging
After this also, entry() is
running and processor is
executing code at lower
addresses
entry()
# Set up the stack pointer.
# Abhijit: +KSTACKSIZE is
movl $(stack + done as stack grows
KSTACKSIZE), %esp downwards
mov $main, %eax
# Jump to main(), and
switch to executing at
jmp *%eax high addresses. The
.comm stack, indirect call is needed
because the assembler
KSTACKSIZE produces a PC-relative
# Abhijit: allocate here 'stack' of size = instruction for a direct
KSTACKSIZE jump.
From entry: RAM
Till: inside main(), before kvmalloc()
4MB
3 512 0 P,W,PS
2 0 4GB Write .
.
1 0 4GB Read, Execute
DS .
0 0 0 0
3
SS
GDT 2
CS 1
CR3
0 0 P,W,PS
entrypgdir
From entry: RAM
Till: inside main(), before kvmalloc()
Physical Addr
4MB
3 512 0 P,W,PS
0 4GB Write .
2
.
0 4GB Read, Execute
1
DS .
0 0 0
0 3
SS
GDT 2
1
CS Even now, every Logical CR3
address = Physical address,
but through Page dir 0 0 P,W,PS
entrypgdir
Code from bootasm.S bootmain.c is over!
Kernel is loaded.
Now kernel is going to prepare itself
main() in main.c
Initializes “free list” of
Initializes
page frames
LAPIC on each processor,
In 2 steps. Why? IOAPIC
Sets up page table for
Disables PIC
kernel
“Console” hardware (the
standard I/O)
Detects configuration of
Serial Port
all processors
Interrupt Descriptor Table
Starts all processors
Buffer Cache
Just like the first processor
Files Table
Creates the first process!
Hard Disk (IDE)
main() in main.c
int void
main(void) { kinit1(void *vstart, void
kinit1(end,
*vend) {
P2V(4*1024*1024)); // phys initlock(&kmem.lock,
page allocator "kmem");
kvmalloc(); // kernel page
kmem.use_lock = 0;
table
freerange(vstart, vend);
}
kfree(char *v) {
struct run *r;
main() in main.c if((uint)v % PGSIZE || v <
end || V2P(v) >= PHYSTOP)
panic("kfree");
// Fill with junk to catch
dangling refs.
void memset(v, 1, PGSIZE);
if(kmem.use_lock)
freerange(void *vstart, void acquire(&kmem.lock);
*vend) r = (struct run*)v;
r->next = kmem.freelist;
kmem.freelist = r;
{ if(kmem.use_lock)
release(&kmem.lock); }
char *p;
p=
(char*)PGROUNDUP((uint)vsta
rt);
for(; p + PGSIZE <=
(char*)vend; p += PGSIZE)
kfree(p);
}
Free List in XV6 Obtained after main() -> kinit1()
lock
kmem
uselock Seen
Actually like independently
run *freelist
this in memory
4B
DEVSPACE=3.96GB
4GB
Un
mapped DEVSPACE=3.96GB
KERNBASE+PHYSTOP=
2.224GB= 2272MB Unused
Kernel
data
+ memory PHYSTOP = 224MB
data= 2049.0.3125 MB Kernel
Kernel data
code + + memory
RO Data
KERNBASE+EXTMEM=2049 MB 1.03125 MB = 1MB + data
Kernel
I/O Space code +
RO Data
KERNBASE=2048MB EXTMEM=1MB
I/O Space
Process
address 0
0 space 0x80108000 =data =
2049.3125 MB
Is obtained from
kernel.sym
After kvmalloc() in main() RAM
Physical Addr
Linear Address
Logical Address = offset
Dir pg Offset
4MB
0
CS, SS, etc.
3
0 4GB Write
2
0 4GB Read, Execute
1
DS
0 0 0
0
SS Page
GDT
Table
CS CR3
Now Linear Address = Logical
Address != Physical Address kpgdir
main()->mpinit()
Note:
Advanced
systems
today use
Advanced
Configuratio
n and
Power
Interface
(ACPI) and
not MPS
https://web.archive.org/web/20121002210153/http://
download.intel.com/design/archives/processors/pro/docs/
Figure 3-1. System Memory
Address Map
mp *mp
mpinit()
phyaddr mpsearch() gives mp
mp gives mpconf
mpconf *conf
Mpconf, gives lapic
lapicaddr address
MP_PROC,
CPU LAPIC ID,
CPU Signature
MP_IOAPIC,
IOAPIC Id, etc
0
apicid= Z Ioapic_Number
global int ioapicid
1 apicid= X
2 apicid= W
LAPIC
lapicaddr Variables.
Some place
3 global int *lapic in mem
Mapped
devices
main() -> lapicinit()
Enable Local APIC static void
Set timer to generate lapicw(int index, int value)
interrupt at 10ms
lapicw(TIMER, PERIODIC | {
(T_IRQ0 + IRQ_TIMER));
lapic[index] = value; //
lapicw(TICR, 10000000); Abhijit: lapic was set in
mpinit()
Disable some un-necessary
interrupts
lapic[ID]; // wait for write
to finish, by reading
Enable interrupts on APIC
(not on CPU) }
main()->seginit()
Re-initialize GDT
Once and forever now
Just set 4 entries
All spanning 4 GB
Differing only in permissions and privilege level
After seginit() in main(). RAM
On the processor where we started booting
Physical Addr
Linear Address
Logical Address = offset
Dir pg Offset
4MB
0
CS, SS, etc.
4
0 4GB Write
3
3
0 4GB Read, Execute
3
2
0 4GB Write
0
1
0 4GB Read, Execute
0
DS
0 0 0
0
SS Page
GDT
Table
CS CR3
Now Linear Address = Logical
Address != Physical Address
kpgdir
main()->picinit()
//Abhijit: Small code. Just disable 8259A interrupt controller
// Don't use the 8259A interrupt controllers. Xv6 assumes SMP
hardware.
void
picinit(void)
{
// mask all interrupts
outb(IO_PIC1+1, 0xFF);
outb(IO_PIC2+1, 0xFF);
}
void
ioapicinit(void)
{ main()->ioapicinit()
int i, id, maxintr;
/* Abhijit: global variable set to IOAPIC */
ioapic = (volatile struct ioapic*)IOAPIC;
Location of ioapic is
maxintr = (ioapicread(REG_VER) >>
16) & 0xFF;
fixed
id = ioapicread(REG_ID) >> 24;
#define IOAPIC
if(id != ioapicid) 0xFEC00000
cprintf("ioapicinit: id isn't equal to
ioapicid; not a MP\n");
In the loop enable
// Mark all interrupts edge-triggered,
active high, disabled, all interrupts upto
// and not routed to any CPUs. maxintr
for(i = 0; i <= maxintr; i++){
ioapicwrite(REG_TABLE+2*i,
INT_DISABLED | (T_IRQ0 + i));
ioapicwrite(REG_TABLE+2*i+1, 0);
}
}
main()->consoleinit()
#define NDEV 10 void
struct devsw devsw[NDEV]; consoleinit(void)
#define CONSOLE 1 {
// table mapping major initlock(&cons.lock, "console");
device number to
//device functions devsw[CONSOLE].write =
consolewrite;
struct devsw {
devsw[CONSOLE].read =
int (*read)(struct consoleread;
inode*, char*, int);
int (*write)(struct cons.locking = 1;
inode*, char*, int); ioapicenable(IRQ_KBD, 0);
}; }
devsw
read
write
Console handling in xv6
Device Files
Are files. Have an inode on disk. Type is “Device”.
Store No data.
Inode has “major” and “minor” number
Major number is an index into “devsw”
“devsw” entry gives “read” and “write” functions for that
device
Minor number identifies a device amongst many of that
type (in code of devsw.read, devsw.write)
sys_read() and sys_write() redirect the request to
devsw.read and devsw.write
Console handling in xv6
consoleread()
Waits for data in ‘buf’
//Uses the array input
Data put into ‘buf’ by
#define INPUT_BUF 128 ‘consoleintr’
struct {
The interrupt handler
called when keys are
char buf[INPUT_BUF]; pressed
uint r; // Read index
Consolewrite()
uint w; // Write index
Uses uartcputc()
uint e; // Edit index Lower level function to
} input; write to I/O port
void
uartinit(void)
{
char *p;
main() -
// Turn off the FIFO
>uartinit()
outb(COM1+2, 0);
head Conceptually
Linked liks
this
n n n n n n n n n
p p p p p p p p p
struct bcache
main()->fileinit()
struct { struct file {
struct spinlock lock; enum { FD_NONE, FD_PIPE,
FD_INODE } type;
struct file file[NFILE];
int ref; // reference count
} ftable; char readable;
void fileinit(void) char writable;
{ struct pipe *pipe;
initlock(&ftable.lock, struct inode *ip;
"ftable"); uint off;
} };
Layout of
process’s
VA space
These mappings
need to be
created per
process
Memory
Layout
of a
user
process
After
exec()
Note the
argc,
argv on
stack
stack is
just one
page.
size of
text and
data is
derived
from
ELF file
main()->userinit()
Creating first process by hand
Code of the first process
initcode.S and init.c
init.c is compiled into “/init” file
During make !
Trick:
Use initcode.S to “exec(“/init”)”
And let exec() do rest of the job
But before you do exec()
Process must exist as if it was forked() and running
main()->userinit()
Creating first process by hand
void
userinit(void)
{
struct proc *p;
extern char _binary_initcode_start[], _binary_initcode_size[];
for(;;){
sti();
// Loop over process table looking for process to run.
acquire(&ptable.lock);
for(p = ptable.proc; p < &ptable.proc[NPROC]; p++){
if(p->state != RUNNABLE)
continue;
// Switch to chosen process. It is the process's job
// to release ptable.lock and then reacquire it
// before jumping back to us.
c->proc = p;
scheduler() cpu
proc
called first *c
time
Code/stack
of p
“initcode” proc Trapframe
context CS =3
Page table DS = ES = SS =4
kstack EFLAGS = FL_IF
ESP = 4096
tf EIP = 0
sz = 4096
Page table name = “initcode” trapret()
Toward pgdir eip=forkret()
s cwd ebp
Kernel state = RUNNALBE ebx
pages esi
edi
swtch:
#Abhijit: swtch was called through a function call.
#So %eip was saved on stack already
movl 4(%esp), %eax # Abhijit: eax = old
movl 8(%esp), %edx # Abhijit: edx = new
cpu
during proc
*c
swtch()
Code/stack
of p
“initcode” proc Trapframe
context CS =3
Page table DS = ES = SS =4
kstack EFLAGS = FL_IF
ESP = 4096
tf EIP = 0
sz = 4096
Page table name = “initcode” trapret()
Toward pgdir eip=forkret()
s cwd ebp
Kernel state = RUNNING ebx
pages esi
edi
addr of
p->context
--
Page table addr of
c->scheduler
pgdir --
ret value in
scheduler() Inode eax = &c->scheduler
for “/”
kpgdir edx = &p->context
CR3
esp
swtch
swtch:
#Abhijit: swtch was called through a function call.
#So %eip was saved on stack already
movl 4(%esp), %eax # Abhijit: eax = old
movl 8(%esp), %edx # Abhijit: edx = new
# Save old callee-saved registers
pushl %ebp
pushl %ebx
pushl %esi
pushl %edi # Abhijit: esp = esp + 16
cpu
during proc
*c
swtch()
Code/stack
of p
“initcode” proc Trapframe
context CS =3
Page table DS = ES = SS =4
kstack EFLAGS = FL_IF
ESP = 4096
tf EIP = 0
sz = 4096
Page table name = “initcode” trapret()
Toward pgdir eip=forkret()
s cwd ebp
Kernel state = RUNNING ebx
pages esi
edi
addr of
p->context
addr of
c->scheduler
Page table ret value in
scheduler()
pgdir
ebp,ebx,
esi,edi Inode eax = &c->scheduler
for “/”
kpgdir edx = &p->context
CR3 esp
swtch
swtch:
#Abhijit: swtch was called through a function call.
#So %eip was saved on stack already
movl 4(%esp), %eax # Abhijit: eax = old
movl 8(%esp), %edx # Abhijit: edx = new
# Save old callee-saved registers
pushl %ebp
pushl %ebx
pushl %esi
pushl %edi # Abhijit: esp = esp + 16
# Switch stacks
movl %esp, (%eax) # Abhijit: *old = updated old stack
movl %edx, %esp # Abhijit: esp = new
cpu
during proc
*c
swtch()
Code/stack
of p
“initcode” proc Trapframe
context CS =3
Page table DS = ES = SS =4
kstack EFLAGS = FL_IF
ESP = 4096
tf EIP = 0
sz = 4096
Page table name = “initcode” trapret()
Toward pgdir eip=forkret()
s cwd ebp
Kernel state = RUNNING ebx
pages esi
edi
addr of
p->context
addr of
c->scheduler
Page table ret value in
scheduler()
pgdir
ebp,ebx,
esi,edi Inode c->scheduler
for “/”
kpgdir edx = &p->context
CR3 esp
swtch:
swtch
#Abhijit: swtch was called through a function call.
#So %eip was saved on stack already
movl 4(%esp), %eax # Abhijit: eax = old
movl 8(%esp), %edx # Abhijit: edx = new
# Save old callee-saved registers
pushl %ebp
pushl %ebx
pushl %esi
pushl %edi # Abhijit: esp = esp + 16
# Switch stacks
movl %esp, (%eax) # Abhijit: *old = updated old stack
movl %edx, %esp # Abhijit: esp = new
# Load new callee-saved registers
popl %edi
popl %esi
popl %ebx
popl %ebp # Abhijit: newesp = newesp - 16, context restored
cpu
during proc
*c
swtch()
Code/stack
of p
“initcode” proc Trapframe
context CS =3
Page table DS = ES = SS =4
kstack EFLAGS = FL_IF
ESP = 4096
tf EIP = 0
sz = 4096
Page table name = “initcode” trapret()
Toward pgdir eip=forkret()
s cwd ebp
Kernel state = RUNNING ebx
pages esi
edi
addr of
p->context
addr of
c->scheduler
Page table ret value in
scheduler()
pgdir
ebp,ebx,
esi,edi Inode c->scheduler
for “/”
kpgdir edx = &p->context
CR3 esp
swtch:
swtch
#Abhijit: swtch was called through a function call.
#So %eip was saved on stack already
movl 4(%esp), %eax # Abhijit: eax = old
movl 8(%esp), %edx # Abhijit: edx = new
# Save old callee-saved registers
pushl %ebp
pushl %ebx
pushl %esi
pushl %edi # Abhijit: esp = esp + 16
# Switch stacks
movl %esp, (%eax) # Abhijit: *old = updated old stack
movl %edx, %esp # Abhijit: esp = new
# Load new callee-saved registers
popl %edi
popl %esi
popl %ebx
popl %ebp # Abhijit: newesp = newesp - 16, context restored
ret # Abhijit: will pop from esp now -> function where to
return.
after “ret”
cpu
from swtch() proc
*c
just before
forkret()
Code/stack
of p
“initcode” proc Trapframe
context CS =3
Page table DS = ES = SS =4
kstack EFLAGS = FL_IF
ESP = 4096
tf EIP = 0
sz = 4096
Page table name = “initcode” trapret()
Toward pgdir
s cwd
Kernel state = RUNNING
pages
addr of
p->context
addr of
c->scheduler
Page table ret value in
scheduler()
pgdir
ebp,ebx,
esi,edi Inode c->scheduler
for “/”
kpgdir edx = &p->context
CR3 esp
After swtch()
Process is running in forkret()
c->csheduler has saved the old kernel stack
with the context of p, return value in scheduler,
ebp, ebx, esi, edi on stack
remember }edi, esi, ebx, ebp, ret-value } =
context
The c->scheduler is pointing to old context
CR3 is pointing to process pgdir
after forkret()
cpu
just before proc
*c
trapret()
beginsCode/stack
of p
“initcode” proc Trapframe
context CS =3
Page table DS = ES = SS =4
kstack EFLAGS = FL_IF
ESP = 4096
tf EIP = 0
sz = 4096
Page table name = “initcode” trapret()
Toward pgdir
s cwd
Kernel state = RUNNING
pages
addr of
p->context
addr of
c->scheduler
Page table ret value in
scheduler()
pgdir
ebp,ebx,
esi,edi Inode c->scheduler
for “/”
kpgdir edx = &p->context
CR3 esp
After iret in trapret
The CS, EIP, ESP will be changed
to values already stored on trapframe
this is done by iret
Hence after this user code will run
On user stack!
Hence code of initcode will run now
eip cpu
at the end proc
*c
of trapret()
Code/stack
of p
“initcode” proc
context
Page table
kstack
tf
sz = 4096
Page table name = “initcode”
Toward pgdir
s cwd
Kernel state = RUNNING
pages
addr of
p->context
addr of
c->scheduler
Page table ret value in
scheduler()
pgdir
ebp,ebx,
esi,edi Inode c->scheduler
for “/”
kpgdir edx = &p->context
CR3 esp
initcode
# char init[] = "/init\0"; start:
init: pushl $argv
pushl $init
.string "/init\0"
pushl $0 // where caller pc
would be
# char *argv[] = { init, 0 }; movl $SYS_exec, %eax
.p2align 2 int $T_SYSCALL
argv:
# for(;;) exit();
.long init
exit:
.long 0
movl $SYS_exit, %eax
int $T_SYSCALL
jmp exit
esp
0x24 = addr of argv
0x1c = addr of init
0x0
00000000 <start>:
0: 68 24 00 00 00 push $0x24
5: 68 1c 00 00 00 push $0x1c
a: 6a 00 push $0x0
c: b8 07 00 00 00 mov $0x7,%eax
11: cd 40 int $0x40
00000013 <exit>:
13: b8 02 00 00 00 mov $0x2,%eax
18: cd 40 int $0x40
1a: eb f7 jmp 13 <exit>
0000001c <init>:
”/init\0”
00000024 <argv>:
1c 00
00 00
eip cpu
on sys_exec() proc
*c
+ all traps()
0x24 alltraps():
0x1c
0 p
proc ss =4,esp, eflags
code context cs = 3, eip
Page table 0,64,ds,es,fs,gs,
kstack gen registers,
add of this esp,
tf ret add in alltraps()
sz = 4096
Page table name = “initcode”
Toward pgdir
s cwd
Kernel state = RUNNING
pages
addr of
p->context
addr of
c->scheduler
Page table ret value in
scheduler()
pgdir
ebp,ebx,
esi,edi Inode c->scheduler
for “/”
kpgdir edx = &p->context
CR3 esp
Understanding fork() and exec()
} // Allocate process.
if((np = allocproc()) == 0){
return -1;
}
after allocproc()
-- we studied this -- same as creation of first
process
p
context sp
kstack sizeof(trapframe)
tf
proc trapret()
eip=forkret()
ebp
ebx
esi
edi
understanding fork()
// Copy process state from proc. ●
copy the pages, page
if((np->pgdir = copyuvm(curproc- tables, page directory
>pgdir, curproc->sz)) == 0){
kfree(np->kstack);
●
no copy on write here!
np->kstack = 0; ●
Rewind operation of
np->state = UNUSED; copyuvm() fails
return -1; ●
copy size
}
np->sz = curproc->sz;
●
set parent of child
np->parent = curproc; ●
copy trapframe
*np->tf = *curproc->tf;
pde_t*
copyuvm(pde_t *pgdir, uint sz)
{ understanding
fork()->copyuvm()
pde_t *d; pte_t *pte; uint pa, i, flags;
char *mem;
if((d = setupkvm()) == 0)
return 0;
for(i = 0; i < sz; i += PGSIZE){
●
Map kernel pages
if((pte = walkpgdir(pgdir, (void *) i, 0)) == 0)
panic("copyuvm: pte should exist"); ●
for every page in
if(!(*pte & PTE_P))
panic("copyuvm: page not present");
parent’s VM address
pa = PTE_ADDR(*pte); space
flags = PTE_FLAGS(*pte);
if((mem = kalloc()) == 0) ●
allocate a PTE for child
goto bad;
memmove(mem, (char*)P2V(pa), PGSIZE); ●
set flags
if(mappages(d, (void*)i, PGSIZE, V2P(mem), flags) < 0) {
kfree(mem);
●
copy data
goto bad;
}
●
map pages in child’s
} page directory/tables
return d;
bad:
freevm(d);
return 0;
}
understanding fork()
np->tf->eax = 0;
set return value of child to
for(i = 0; i < NOFILE; i++) 0
if(curproc->ofile[i])
eax contains return value,
np->ofile[i] = filedup(curproc- it’s on TF
>ofile[i]);
np->cwd = idup(curproc->cwd);
copy each struct file
safestrcpy(np->name, curproc-
copy current working dir
>name, sizeof(curproc->name)); inode
pid = np->pid;
copy name
acquire(&ptable.lock);
np->state = RUNNABLE;
set pid of child
release(&ptable.lock);
set child “RUNNABLE”
exec() - different prototype
int exec(char*, char**);
usage: to print README and test.txt using “cat”
int main(int argc, char *argv[])
{
char *cmd = "/cat";
char *argstr[4] = { "/cat", "README",
"test.txt", 0};
exec(cmd, argstr);
}
note: to really run this code in xv6, you need to make changes to Makefile. First,
add this program to UPROGS, then write a file test.txt using Linux, and add
‘test.txt’ to list of files in ‘mkfs’ target in Makefile
int
sys_exec()
sys_exec(void)
{
char *path, *argv[MAXARG];
int i;
uint uargv, uarg;
if(argstr(0, &path) < 0 || argint(1, (int*)&uargv) < 0){
argstr(n,), argint(n,)
return -1;
Fetch the n’th argument
}
memset(argv, 0, sizeof(argv));
from process stack
for(i=0;; i++){ using p->tf->esp + offset
if(i >= NELEM(argv))
Again: revise calling
return -1;
if(fetchint(uargv+4*i, (int*)&uarg) < 0)
conventions
return -1;
0’th argument: name of
if(uarg == 0){
argv[i] = 0;
executable file
break;
1st Argument: address of
}
the array of arguments
if(fetchstr(uarg, &argv[i]) < 0)
return -1;
store in uargv
}
return exec(path, argv);
}
int sys_exec(void)
{
char *path, *argv[MAXARG]; sys_exec()
int i; uint uargv, uarg;
if(argstr(0, &path) < 0 || argint(1,
(int*)&uargv) < 0){
the local array argv[]
return -1; (allocated on kernel stack,
} obviously) set to 0
memset(argv, 0, sizeof(argv));
for(i=0;; i++){
fetch every next argument
if(i >= NELEM(argv)) return -1; from array of arguments
if(fetchint(uargv+4*i, (int*)&uarg) < 0)
Sets the address of
return -1; argument in argv[1]
if(uarg == 0){
argv[i] = 0; break;
call exec
}
beware: mistake to assume
if(fetchstr(uarg, &argv[i]) < 0) that this exec() is the exec()
return -1; called from user code! NO!
}
return exec(path, argv);
}
What should exec() do?
Remember, it came from fork()
so proc & within it tf, context, kstack, pgdir-tables-pages, all
exist.
Code, stack pages exist, and mappings exist through proc-
>pgdir
Hence
read the ELF executable file (argv[0])
create a new page dir – create mappings for kernel and user
code+data; copy data from ELF to these pages (later discard
old pagedir)
Copy the argv onto the user stack – so that when new
process starts it has it’s main(argc, argv[]) built
set values of other fields in proc to start program correctly
User
stack
after
call
to
exec()
is over
normally data on stack on fn call: ret value, first arg, second arg, ...
main(int argc, char *argv[])
argv[] is address of array of string; string itself is an adress. Hence
2 levels of indirection on stack
exec()
int
ustack
exec(char *path, char **argv)
{
used to build the
... arguments to be
uint argc, sz, sp, pushed on user-
ustack[3+MAXARG+1]; stack
...
if((ip = namei(path)) == 0){
namei
end_op();
get the inode of the
cprintf("exec: fail\n");
executable file
return -1;
}
exec()
// Check ELF header
readi
if(readi(ip, (char*)&elf, 0,
read ELF header
sizeof(elf)) != sizeof(elf))
goto bad;
setupkvm()
if(elf.magic != ELF_MAGIC)
creating a new page
goto bad;
directory and
mapping kernel
pages
if((pgdir = setupkvm()) == 0)
goto bad;
sz = 0;
for(i=0, off=elf.phoff; i<elf.phnum; i++, off+=sizeof(ph)){ exec()
if(readi(ip, (char*)&ph, off, sizeof(ph)) != sizeof(ph))
goto bad;
if(ph.type != ELF_PROG_LOAD)
Read ELF
continue; program
headers from
if(ph.memsz < ph.filesz)
ELF file
goto bad;
Map the
if(ph.vaddr + ph.memsz < ph.vaddr) code/data into
goto bad; pagedir-
if((sz = allocuvm(pgdir, sz, ph.vaddr + ph.memsz)) == 0) pagetable-
goto bad; pages
if(ph.vaddr % PGSIZE != 0)
Copy data
goto bad;
from ELF file
into the pages
if(loaduvm(pgdir, (char*)ph.vaddr, ip, ph.off, ph.filesz) < allocated
0)
goto bad;
}
exec()
sz = PGROUNDUP(sz);
Allocate 2
pages on
if((sz = allocuvm(pgdir, sz, sz + top of proc-
2*PGSIZE)) == 0) >sz
goto bad;
One page
for stack
clearpteu(pgdir, (char*)(sz -
one page for
2*PGSIZE)); guard page
sp = sz;
Clear the
valid flag on
guard page
// Push argument strings, prepare rest of stack
in ustack.
for(argc = 0; argv[argc]; argc++) { exec()
if(argc >= MAXARG)
goto bad;
For each entry in argv[]
sp = (sp - (strlen(argv[argc]) + 1)) & ~3;
copy it on user-stack
if(copyout(pgdir, sp, argv[argc],
strlen(argv[argc]) + 1) < 0)
remember it’s
location on user
goto bad; stack in ustack
ustack[3+argc] = sp;
add extra entries (to be
} copied to user stack) to
ustack[3+argc] = 0; ustack
ustack[0] = 0xffffffff; // fake return PC
copy argc, argv pointer
ustack[1] = argc;
take sp to bottom
ustack[2] = sp - (argc+1)*4; // argv pointer
copy ustack to user
sp -= (3+argc+1) * 4; stack
if(copyout(pgdir, sp, ustack, (3+argc+1)*4) < 0)
goto bad;
This is
what the
code on
earlier
slide did
// Save program name for debugging.
for(last=s=path; *s; s++) exec()
if(*s == '/')
copy name of new
last = s+1; process in proc->name
safestrcpy(curproc->name, last,
change to new page
sizeof(curproc->name)); directory
change new size
// Commit to the user image.
tf->eip will be used
when we return from
oldpgdir = curproc->pgdir; exec() to jump to user
curproc->pgdir = pgdir; code. Set to to first
instruction of code,
curproc->sz = sz; given by elf.entry
curproc->tf->eip = elf.entry; // main
Set user stack pointer
curproc->tf->esp = sp; to “sp” (bottom of
stack of arguments)
switchuvm(curproc);
Update TSS, change
freevm(oldpgdir); CR3 to newpagedir
return 0;
free old page dir
return 0 from exec()?
We know exec() does not return !
This was exec() function !
Returns to sys_exec()
sys_exec() also returns , where?
Remember we are still in kernel code, running on kernel stack.
p->kstack has the trapframe setup
There is context struct on stack. Why?
sys_exec() returns to trapret(), the trap frame will be popped !
with “iret” jump into new program !
New program is not old program , which could have accessed
return value of sys_exec()
Scheduler
Steps in scheduling scheduling
Suppose you want to switch from P1 to P2 on
a timer interrupt
P1 was doing
F() { i++; j++;}
P2 was doing
G() { x--; y++; }
P1 will experience a timer interrupt, switch to
kernel (scheduler) and scheduler will
scheduler P2
Steps in scheduling scheduling
User process -> kernel
Switch to kernel stack
The normal sequence on any
interrupt !
Kernel stack of process to
kernel stack of scheduler
Why?
Kernel stack of scheduler to
kernel stack of new
process . Why?
Kernel stack of new process
to user stack of new
process
scheduler()
Disable interrupts
Find a RUNNABLE process. Simple round-
robin!
c->proc = p
switchuvm(p) : Save TSS of scheduler’s stack
and make CR3 to point to new process pagedir
p->state = RUNNING
swtch(&(c->scheduler), p->context)
scheduler
swtch(&(c->scheduler), p->context)
Note that when scheduler() was called, when
P1 was running
After call to swtch() shown above
The call does NOT return!
The new process P2 given by ‘p’ starts running !
Let’s review swtch() again
swtch(old, new)
The magic function in swtch.S
Saves callee-save registers of old context
Switches esp to new-context’s stack
Pop callee-save registers from new context
ret
where? in the case of first process – returns to forkret() because
stack was setup like that !
in case of other processes, return where?
Return address given on kernel stack. But what’s that?
The EIP in p->context
When was EIP set in p->context ?
scheduler()
Called from?
mpmain() - already seen
No where else!
sched() is another scheduler function !
Who calls sched() ?
exit() - a process exiting calls sched ()
yield() - a process giving up CPU on timer calls yield()
sleep() - a process going to wait calls sleep()
void
sched(void)
sched()
get current process
{
int intena;
Error checking code (ignore as
of now)
struct proc *p = myproc();
get interrupt enabled status on
if(!holding(&ptable.lock))
current CPU (ignore as of now)
panic("sched ptable.lock");
call to swtch
if(mycpu()->ncli != 1)
Note tha arguments’ order
panic("sched locks");
p->context first, mycpu()-
if(p->state == RUNNING) >scheduler second
panic("sched running");
swtch() is a function call
if(readeflags()&FL_IF)
pushes address of /*A*/ on
panic("sched interruptible"); stack of current process p
intena = mycpu()->intena;
switches stack to mycpu()-
>scheduler. Then pops EIP from
swtch(&p->context, mycpu()-
that stack and jumps there.
>scheduler);
when was mycpu()->scheduler
/*A*/ mycpu()->intena = intena;
set? Ans: during scheduler()!
}
sched() and schduler()
sched() { scheduler(void) {
...
...
swtch(&p->context, mycpu()-
>scheduler); /* X */ swtch(&(c->scheduler), p-
>context); / * Y */
} }
scheduler() saves context in c->scheduler, sched() saves
context in p->context
after swtch() call in sched(), the control jumps to Y in scheduler
Switch from process stack to scheduler’s stack
after swtch() call in scheduler(), the control jumps to X in
sched()
Switch from scheduler’s stack to new process’s stack
Set of co-operating functions
sched() and scheduler() as co-
routines
In sched()
swtch(&p->context, mycpu()->scheduler);
In scheduler()
swtch(&(c->scheduler), p->context);
These two keep switching between processes
These two functions work together to achieve
scheduling
Using asynchronous jumps
Hence they are co-routines
To summarize
On a timer interrupt
Now the loop in scheduler()
during P1
calls switchkvm()
trap() is called. Stack has
Then continues to find next
process (P2) to run
changed from P1’s user
stack to P1’s kernel stack
then calls swtch(&c-
>scheduler, p2’s->context)
trap()->yield()
Stack changes to P2’s kernel
yield()->sched() stack.
sched() -> swtch(&p-
P2 runs the last instruction it
was was in ! Where was it?
>context, c->scheduler()
mycpu()->intena = intena; in
Stack changes to sched()
scheduler’s kernel stack.
Then returns to the one who
called sched() i.e. exit/sleep, etc
Switches to location “Y”
Finally returns from it’s own
in scheduler(). “TRAP” handler and returns to
P2’s user stack and user code
Locks
struct spinlock
// Mutual exclusion lock.
struct spinlock {
uint locked; // Is the lock held?
// For debugging:
char *name; // Name of lock.
struct cpu *cpu; // The cpu holding the lock.
uint pcs[10]; // The call stack (an array of program counters)
// that locked the lock.
};
spinlocks in xv6 code
struct { static struct spinlock idelock;
struct spinlock lock; struct {
struct buf buf[NBUF]; struct spinlock lock;
// For debugging:
char *name; // Name of lock.
int pid; // Process holding lock
};
Sleeplock acquire and release
void void
acquiresleep(struct sleeplock *lk)
releasesleep(struct sleeplock
{
*lk)
acquire(&lk->lk);
while (lk->locked) { {
/* Abhijit: interrupts are not disabled in acquire(&lk->lk);
sleep !*/
sleep(lk, &lk->lk); lk->locked = 0;
} lk->pid = 0;
lk->locked = 1;
wakeup(lk);
lk->pid = myproc()->pid;
release(&lk->lk); release(&lk->lk);
} }
Where are sleeplocks used?
struct buf
Just two !
waiting for I/O on
this buffer
struct inode
waiting for I/o to this
inode
Sleeplocks issues
sleep-locks support yielding the processor during their critical
sections.
This property poses a design challenge:
if thread T1 holds lock L1 and has yielded the processor,
and thread T2 wishes to acquire L1,
we have to ensure that T1 can execute
while T2 is waiting so that T1 can release L1.
T2 can’t use the spin-lock acquire function here: it spins with interrupts
turned off, and that would prevent T1 from running.
To avoid this deadlock, the sleep-lock acquire routine (called
acquiresleep) yields the processor while waiting, and does not
disable interrupts.
Sleep-locks leave interrupts enabled, they cannot be used in
interrupt handlers.
Lock Ordering
lock on the directory, a lock on the new file’s
inode, a lock on a disk block buffer, idelock,
and ptable.lock.
Interesting case of holding and releasing
ptable.lock in scheduling
Normally, any upper layer can call any lower layer below
May see the code of mkfs.c to get insight into the layout
struct superblock {
uint size; // Size of file system image (blocks)
uint nblocks; // Number of data blocks
uint ninodes; // Number of inodes.
uint nlog; // Number of log blocks
uint logstart; // Block number of first log block
uint inodestart; // Block number of first inode block
uint bmapstart; // Block number of first free map block
};
#define ROOTINO 1 // root i-number
#define BSIZE 512 // block size
Layout of xv6 file system
#define NDIRECT 12
#define NINDIRECT (BSIZE / sizeof(uint))
#define MAXFILE (NDIRECT + NINDIRECT)
// On-disk inode structure
struct dinode {
short type; // File type
short major; // Major device number (T_DEV only)
short minor; // Minor device number (T_DEV only)
short nlink; // Number of links to inode in file system
uint size; // Size of file (bytes)
uint addrs[NDIRECT+1]; // Data block addresses
};
#define DIRSIZ 14
struct dirent {
ushort inum;
char name[DIRSIZ];
};
File on disk
Let’s discuss lowest layer first
System Calls open, read, write, close, link, pipe, mknod, unlink, fstat,
mkdir, chdir, dup,
Normally, any upper layer can call any lower layer below
Reminder: After main()->binit()
head Conceptually
Linked liks
this
n n n n n n n n n
Buffers keep
p p p p p p p p p moving on
list, as LRU
struct bcache
struct buf
struct buf {
int flags; // 0 or B_VALID or B_DIRTY
uint dev; // device number
uint blockno; // seq block number on device
struct sleeplock lock; // Lock to be held by process using it
uint refcnt; // Number of live accesses to the buf
struct buf *prev; // LRU cache list
struct buf *next; // LRU cache list
struct buf *qnext; // disk queue
uchar data[BSIZE]; // data 512 bytes
};
#define B_VALID 0x2 // buffer has been read from disk
#define B_DIRTY 0x4 // buffer needs to be written to disk
buffer cache:
static struct buf* bget(uint dev, uint blockno)
The bcache.head list is maintained on Most Recently
Used (MRU) basis
head.next is the Most Recently Used (MRU) buffer
hence head.prev is the Least Recently Used (LRU)
Look for a buffer with b->blockno = blockno and b-
>dev = dev
Search the head.next list for existing buffer (MRU order)
Else search the head.prev list for empty buffer
panic() if found in-use or empty buffer
Increment b->refcnt ; Returns buffer locked
Does not change the list structure, just returns a buf in
use
buffer cache:
struct buf* bread(uint dev, uint blockno)
struct buf* void
bread(uint dev, uint blockno) bwrite(struct buf *b)
{
{
struct buf *b;
if(!holdingsleep(&b-
b = bget(dev, blockno);
>lock))
if((b->flags & B_VALID) == 0) {
panic("bwrite");
iderw(b);
} b->flags |= B_DIRTY;
return b; // locked buffer iderw(b);
} }
Recollect: iderw moves buf to tail of idequeue, calls idestart() and sleep()
buffer cache:
void brelse(struct buf *b)
release lock on buffer
b->refcnt = 0
If b->refcnt = 0
Means buffer will no longer be used
Move it to front of the front of bcache.head
Overall in this diagram
head
n n n n n n n n n
p p p p p p p p p
data data
boot super
block block
log 52nd of 68nd
oflog ... log indoes | bitmap | data ....
block block
block[30]
n 52 68
=2
0 1 2 3 ..... 29
logheader
Typical use case of logging
/* In a system call code * / prepare for logging. Wait if
logging system is not ready or
begin_op(); ‘committing’. ++outstanding
... read and get access to a data
block – as a buffer
bp = bread(...); modify buffer
bp->data[...] = ...; note down this buffer for
writing, in log. proxy for
log_write(bp); bwrite(). Mark B_DIRTY. Absorb
... multiple writes into one.
Syscall done. write log and all
end_op(); blocks. --outstanding.
If outstanding = 0, commit().
Normally, any upper layer can call any lower layer below
struct dirent {
ushort inum;
char name[DIRSIZ];
};
Data of a directory file is a sequence of such entries. To find
a name, just get all the data blocks and search the name
How to get the data for a directory? We already know the ans!
struct inode*
dirlookup(struct inode *dp, char *name, uint *poff)
Given a pointer to directory inode (dp), name of file
to be searched
return the pointer to inode of that file (NULL if not found)
set the ‘offset’ of the entry found, inside directories data
blocks, in poff
How was ‘dp’ obtained? Who should be calling
dirlookup? Why is poff returned?
During resolution of pathnames?
Code: call readi() to get data of dp, search name in it,
name comes with inode-num, iget() that inode-num
int
dirlink(struct inode *dp, char *name, uint inum)
Create a new entry for ‘name’_’inum’ in
directory given by ‘dp’
inode number must have been obtained before
calling this. How to do that?
Use dirlookup() to verify entry does not exist!
Get empty slot in directory’s data block
Make directory entry
Update directory inode! writei()
namex
Called by namei(), or nameiparent()
Just iteratively split a path using “/”
separator and get inode for last component
iget() root inode, then
Repeatedly calls
split on “/”, dirlookup() for next component
races in namex()
Crucial. Called so many times!
one kernel thread is looking up a pathname another
kernel thread may be changing the directory by calling
unlink
when executing dirlookup in namex, the lookup thread holds
the lock on the directory and dirlookup() returns an inode that
was obtained using iget.
Deadlock? next points to the same inode as ip when
looking up ".". Locking next before releasing the lock
on ip would result in a deadlock.
namex unlocks the directory before obtaining a lock on next.
File descriptor layer code
System Calls open, read, write, close, link, pipe, mknod, unlink, fstat,
mkdir, chdir, dup,
forkret()
forkret(void)
{
static int first = 1;
// Still holding ptable.lock from
scheduler.
Doesnt’ do much
release(&ptable.lock);
Releases ptable.lock
if (first) {
Why? We will see later
// Some initialization functions must be
run in the context
Does some initialization if
// of a regular process (e.g., they call this process was “initcode”
sleep), and thus cannot
// be run from main().
Returns
first = 0;
To? trapret()
iinit(ROOTDEV);
Why?
initlog(ROOTDEV);
We copied trapret() above
} forkret() on stack in allocproc()
// Return to "caller", actually trapret (see
allocproc).
}
trapret
We have already seen concept of trapret
Will just pop off entire trap frame from stack
And Return
Where?
Had EIP = 0 in trapframe
CS already points to 3 (from trapframe)
Pgdir points to process’s page dir
So just jump to _start in initcode.S
Initcode.S
start:
exec(“filename”,
pushl $argv
pushl $init
arg1, arg2, NULL);
pushl $0 // where caller pc would be
exec(“/init”, NULL)
movl $SYS_exec, %eax
int $T_SYSCALL
Next
init:
We go to land of
.string "/init\0" exec() and fork()
argv:
.long init
.long 0
Processes
Logical layout of memory for a
process
Address 0: code
Then globals
Then stack
Then heap
Each processe’s address
space maps kernel’s text,
data also --> so that system
calls run with these
mappings
Kernel code can directly
access user memory now
Process Table
struct {
One single global
struct spinlock array of processes
lock;
Protected by
struct proc ptable.lock
proc[NPROC];
} ptable;
Struct proc
// Per-process state
struct proc {
uint sz; // Size of process memory (bytes)
pde_t* pgdir; // Page table
char *kstack; // Bottom of kernel stack for this process
enum procstate state; // Process state. allocated, ready to run, running,
wait-
ing for I/O, or exiting.
int pid; // Process ID
struct proc *parent; // Parent process
struct trapframe *tf; // Trap frame for current syscall
struct context *context; // swtch() here to run process. Process’s context
void *chan; // If non-zero, sleeping on chan. More when we discuss
sleep, wakeup
int killed; // If non-zero, have been killed
struct file *ofile[NOFILE]; // Open files, used by open(), read(),...
struct inode *cwd; // Current directory, changed with “chdir()”
char name[16]; // Process name (for debugging)
};
Process’s stacks
2 stacks for each process
User stack and kernel stack (p->kstack)
When running user code, the user stack is used
(kernel stack is empty)
When running kernel code, the user stack still
contains local variables, formal parameters
Kernel mappings in user address
space
actual location of kernel
Kernel is loaded at
0x100000 physical
address
PA 0 to 0x100000 is
BIOS and devices
Process’s page
table will map
VA 0x80000000 to
PA 0x00000 and
VA 0x8010000 to
0x100000
Kernel mappings in user address
space
actual location of kernel
Kernel is not
loaded at the PA
0x80000000
because some
systems may not
have that much
memory
0x80000000 is
called
KERNBASE in
xv6
Memory Management
X86 page
table
hardware
Layout of
process’s
VA space
Memory
Layout
of a
user
process
After
exec()
Note the
argc,
argv on
stack
Memory Layout of a
user process
On sbrk()
The system call to grow
process’s address space.
Calls growproc()
growproc()
Allocate a frame, Add an
entry in page table at the top
(above proc->sz)
//This entry can’t go beyond
KERNBASE
Calls switchuvm()
Switchuvm()
Ultimately loads CR3,
invalidating cache
Free List in
XV6
lock
kmem
uselock Seen
Actually like
run *freelist independent
this in memory
ly
sleep(n)
●
Change the current
ticks
directory
exec(filename,
●
Load a file and
mkdir(dirname) ●
Create a new
execute it
*argv)
mknod(name, directory
●
Grow process’s
memory by n bytes major, minor) ●
Create a device file
sbrk(n) ●
Return info about an
Open a file; the flags
fstat(fd)
open file
open(filename indicate read/write
link(f1, f2) Create another name
, flags)
(f2) for the file f1
exec()
exec(parth, argv)
// Commit to the user image.
oldpgdir = curproc->pgdir;
curproc->pgdir = pgdir;
curproc->sz = sz;
curproc->tf->eip = elf.entry; // main
curproc->tf->esp = sp;
switchuvm(curproc);
freevm(oldpgdir)
Handling Interrupt Controllers
IO-APIC and L-APIC
APIC: Advanced Programmable Interrupt
Controller
IO-APIC
Routing interrupts from for I/O subsystem
L-APIC
Routing Interrupts on each processor
IO-APIC
ioapic.c
/* roughly 3.98 GB. In high mem. Memory Mapped I/O address */
#define IOAPIC 0xFEC00000 // Default physical address of IO APIC
}