CSAPP Reading Notes

分享一下我在上CMU的18-600: Foundations of Computer Systems (15-213: Introduction to Computer Systems在ECE系的改编课程)时对教材”Computer Systems: A Programmer’s Perspective” (CSAPP)的读书笔记，内容按照原书章节组织，主要以重要知识点的形式呈现。

A Tour of Computer Systems

Information = bits + context
Compilation system
- Pre-processor: add head file to source file, program.i
- Compilation: translate source file into assembly language, program.s
- Assembly: translate assembly language into relocatable object file, program.o
- Linking: link all relocatable object program to executable object file, program
GNU provides environment tools for Linux
Virtual Memory(virtual address space): program-runtime heap-shared libraries-stack-kernel
multiprocessor: multi-cores + hyperthread

Representing and Manipulating Information

little endian: less significant bytes in low address; big endian: more significant bytes in low address
typedef unsigned char *byte_pointer: type alias
Bit Operation: ~ NOT; ^ Exclusive OR
Logic Operation(&&, ||, !): Any non-zero parameter = 1
Sign bit =-2^w-1
Signed <-> Unsigned: Bit pattern do not change, +/- 2^w
Signed mixed with unsigned: signed -> unsigned
Pay attention to Tmin and 0
Java only supports signed integer
Two’s Complement = Signed = 补码; Addition = Bit addition + truncating
Multiplication is costly, use shift + addition/substraction
Unsigned Division: right logic shift; Two’s Complement Division: right arithmetic shift
IEEE floating point = (-1)^s M(frac) 2^E(exp), need approximation and round
Normalized: exp != all 0 && exp !=all 1: E = exp - bias(2^k-1 - 1); M = 1 + .frac;
Denormalized: exp == all 0: E = 1- bias; M = .frac;
Special value: exp == all 1: Inf(frac = all 0, should be set to 0); NaN(frac != 0)
In the half, round to even
unsigned int a = 15213U;
long: 4 Bytes(32 bit); 8 Bytes(64 bit)
size_t: unsigned int(32 bit); long unsigned int(64 bit)

Machine-Level Representation of Programs

gcc -O1(optimization level) -g(source code line) -o(compile stage) p p1.c p2.c
objdump -d program.o: disassembler generate assembly code from object code
Assembly Language: AT&T Format(->) V.S. Intel Format(<-)
byte(b): 8 bits; word(w): 16 bits; long word(l): 32 bits; quad word(q): 64 bits
Register: %rax(64)-%eax(32)-%ax(16)-%ah(15-8)-%al(7-0)
ax, cx, dx(caller-save); bx, si, di(callee-save): generally purpose; sp: stack pointer; bp: frame pointer
$immediate, %Register, disp(base, index*scale)
SUB src, dst: dst - src -> dst
64-bit Multiply: imull src: src * %eax -> %edx:%eax
64-bit Divide: idivl src: %edx:%eax / src -> %eax, %edx:%eax mod src -> %edx
cltd: %eax -> %edx:%eax
SAL(Arithmetic Left); SHL(Logical Left); SAR(Arithmetic Right); SHR(Logical Right)
condition codes: ZF(zero); SF(negative); OF(signed overflow); CF(unsigned overflow)
cmp s2,s1: s1-s2(set condition codes then); set dst; jmp label/*address; signed(g/l); unsigned(a/b)
if-else: jmp label; loop: label; jmp label; switch: jump table contains address
conditional expression: cmov, conditional move, compute all first, better performance than conditional control for pipeline misprediction
procedure(stack frame): (caller)local var, params, return addr, (callee)bp(higher), local var, sp(lower)
Array: contiguous memory, address +/- with data type
Union: Store one field at a time, referenced by data type
Alignment: data address must be multiple of its size
x86-64: IA64-AMD64-Intel64: six parameters by ordered register(di,si,dx,cx,r8,r9), no %bp
Buffer overflow attack: strcpy to inject instructions and change return address
struct member address must align

Processor Architecture

Y86: Balance between CISC and RISC
F: No register
HCL(Hardware Control Language): Describe control logic circuit
bool out = f(logic operator);
multiplexor(output one of the inputs): int out = [s in {I}:A; 1:B];
clocked registers: state = output, update when clock rise
random-access memories: given address, register file
SEQ(Sequential) Processor: Serve one instruction until it finishes
Instruction stage: fetch-decode-execute-memory-write back(register)-PC update
Every instruction uses one clock cycle. As clock rises, state elements(PC, CC, register file, RAM) updated according to the previous instruction. In the clock cycle, combinational logic propagate to create new value for state elements updated in next clock rise
Slow: Each hardware unit is only active for a fraction of the clock cycle
Pipelined Processor: One instruction finished this stage will move on to next, next instruction will come
throughput = 1 / clock period(stage + pipeline register)
The former instruction move on, the address of next instruction(new PC) must be predicted
instruction dependency hazard(data + control): stalling + forwarding + both(load interlock)
SEQ+: move PC update to the beginning
PIPE-: insert pipeline regisger(before the corresponding stage)
PIPE: forward logic in decode stage
Exception Handling: stat field in pipeline register
CPI = 1 + Cbubble/Cinstuction = 1 + load/use + mispredict + ret
SSE: Streaming SIMD Extensions; x86-64 floating point registers: %xmm0-%xmm15 can hold multiple values
SuperScalar: Instruction Per Cycle > 1; Out of Order
Branch Target Buffer for dynamic branch prediction: previous instruction addr -> target addr
Register Rename: small architecture regs -> large physical regs, ARF -> RRF
Load Bypassing: load ahead of store; Load Forwarding: forward store to load

Optimizing Program Performance

Criteria

Correctness
Readability
Performance

Trade-off between readability and performance
Method

High-level design: efficient algorithm and data structure
Basic coding principles: easy for compiler to optimize
Low-level optimizations

Memory Aliasing: two pointers are designated to the same memory location
Cycles Per Element(CPE): measure program performance
Move unnecessary code out of loop
Reduce function calls
Eliminate unnecessary memory reference
latency: cycles to perform operation; issue time: cycles between operations
Loop Unrolling: Reduce iteration and loop operation
Use more local accumulators and reassociation transformation to enhance parallelism
Use conditional move to avoid branch misprediction
Parallel memory load and store
Program Profiling: gprof
Amdahl’s Law: speedup = 1 / (1 - alpha + alpha / k)
To significantly speed up the entire system, we must improve the speed of a very large fraction of the overall system
Significant speedup lies in 90% - 100%

The Memory Hierarchy

Register -> N-level Cache -> Main Memory -> Local Disk -> Remote Disk
RAM: available when power on
one bit = one cell
SRAM(Static): Cache; bistable, 6 transistors per bit
DRAM(Dynamic): Main Memory; Sensitive, 1 transistor per bit
Memory controller with addr and data pins
Main Memory -> Memory Modules -> DRAM Chips -> Supercells(row,column) -> 8 cells(1 byte)
SDRAM: Synchronous DRAM, replace memory control with clock rise
DDR(Double Data Rate): Double control signal and increase buffer
ROM: nonvolatile when power off, limited erase time
EEPROM(Electrically Erasable Programmable ROM): Flash Memory, Solid State Disk
CPU - Bus Interface - System Bus - I/O Bridge - Memory Bus - Main Memory
I/O Bridge - I/O Bus(Peripheral Component Interconnect)
Universal Serial Bus(SSD), Host Bus Adaptor(SCSI, SATA)
sector + gap -> track -> cylinder <=> Logic Block
RPM: Revolutions Per Minute
Access Time = avg seek time + avg rotation latency + avg transfer time
Temporal Locality: reference itself; Spatial Locality: reference nearby location
Cache: S sets -> E lines -> 1 valid bit + t tag bits + B bytes block; C= BES
Memory Address: t bits tag + s bits set index + b bits block offset
Direct Mapped Cache(E=1); E-way Set Associative Cache(1<E<ES); Fully Associative Cache(S=1)
(valid == 1 && tag match) ? hit : miss;
Miss: Fetch entire block from memory to cache line
Cold Miss, Conflict Miss, Capacity Miss
Memory Mountain: read throughput = f(size(temporal locality),stride(spatial locality))

Linking

Collect and combine code and data into a single file that can be loaded into memory and executed
Linker enables separate complition, link at (static linker)compile time, (dynamic linker)load time, run time
Symbol resolution
Symbol: variables + functions
static: local symbol; nonstatic: global symbol defined by self; external: global symbol defined by others
strong symbol: initialized global variable and function; weak symbol: uninitialized global variable
strong with strong: link error; strong with weak: pick strong; weak with weak: pick any one
static library: lib.a(archive), only copy the object modules that are referenced
E: object file; U: undefined symbol; D: defined symbol
scan object file and static library in order and update E, U, D
Referenced file should after file that reference it
Relocation: section merge + relocate symbol reference within section
Load executable object file: copying the program into memory and then running
Shared library: dynamic link at run time, lib.so
dynamic link in program: handle = dlopen(lib.so); func = dlsym(handle,”func”); dlclose(handle)
Position Independent Code: distance between code segment and data segment is constant
Global Offset Table + Procedure Linkage Table

Exceptional Control Flow

Control Flow: Sequence of control transfer
Sychronous: Call return only if it finished; Asynchronous: Call return immediately and notify later if finish
Exception: OS and hardware(low-level), interrupt(I/O device, Processor issue to OS), trap, fault, abort
Event occur -> exception -> exception table -> exception handler(kernel mode) -> return/abort
System Call(trap, 0x80 in exception table): syscall, Application call OS
Fault(might recoverable)
Floating Exception: Divide by 0; Segmentation Fault: Access undefined address and write read-only file
Program: a collection of code and data
Process: Instance of a program in execution, illusion that has exclusive use of processor and memory
Process need context switching when exception occur, threads are in the same context
3 states of process: Running, Stopped, Terminated
fork(): parent -> child, return child pid to parent and 0 to child, parent reap its terminated child(zombie)
pid waitpid(pid,status,option): waits children to terminate or stop and reap them
execve(filename,argv,envp): loads and runs a new program in the context of the current process
fork: new process, same program; execve: same process, new program
Signal: Application and OS(high-level), kernel notify process that event occur
block(can be deliver until unblock) - pending(only one for one type) - receive
process group: -PID
signal(signum,handler): install user-defined handler for handling received signal
Nonlocal jumps: jump to other function, setjmp(like catch) + longjmp(return setjmp, like throw)

Virtual Memory

Physical Addressing(PA): 0~M-1, m bits; Virtual Addressing(VA): 0~N-1, n bits
Memory Management Unit(MMU) translate VA to PA
Cache Transfer Unit: page, block
Each process have its own virtual adress space and page table
Virtual Memory: Array of N contiguous byte-sized cells stored on disk
Unallocated(no space on disk), Cached(fully associative in physical memory), Uncached(on disk)
Page Table Base Register -> Page Table(in cache or memory): each entry map VP to PP

Virtual Page(VP)	valid	Physical Page(PP)
0	1	physical address(cached), page hit
1	0	disk address(uncached), page fault, swap/page in/out and execute again
2	0	null(unallocated)

Demand Paging: the strategy of waiting until the last moment to swap in a page, when a miss occurs
Virtual Page Number -> Page Table Entry -> Physical Page Number
Virtual Page Offset = Physical Page Offset
Translation Lookaside Buffer: cache of page table(SRAM and DRAM) in MMU
Multi-level page tables: bits of (page amount) are separated into multiple sections
m*n PTEs are organized as m PTEs(level 1), n PTEs(level2) for each level 1 PTE, level 2 n PTEs doesn’t exist if corresponding level 1 PTE is null

VA ->PA
hit: VPN -> MMU(TLB) -> PPN
miss: VPN -> multi-level page table -> PPN
Get data word
PA -> Cache -> Physical Memory

Linux Virtual Memory: vm_area_structs -> different sections of process virtual memory
Memory Mapping: initialize virtual memory area by associating it with an object(shared, private copy-on-write, anonymous) on disk
mmap: create a new virtual memory area which mapped to a object; munmap: delete the area
brk: top of the heap; payload: actual requested size; bp: block pointer, the first word after header
void *sbrk(intptr_t incr); // set brk to grow or shrink the heap
void *malloc(size_t size); // allocate one size space, return null on error, not initialized
Notice Allocate Alignment!
void *calloc(size_t numElements, size_t sizeOfElement); // allocate initialized spaces
void realloc(void ptr, size_t newsize); // reallocate ptr with one newsize space, might in new location
void free(void* ptr); //explicitly free allocated block
Goal: maximize throughput(request number), maximize memory utilization(peak utilization)
internal fragmentation: allocated block > request; external fragmentation: every free block < request
boundary tags: copy header to footer, current block just check one previous word
implicit free list: start + 2 prologue + heap blocks(header + payload + padding + footer) + epilogue
placement, splitting, coalescing(false fragmentation, combine free blocks)
explicit free list: doubly linked free list; segregated free list: multiple free lists of different size
buddy system: segregated free list of sizes of power of two
new/delete: C++ operator, allocate/free self-defined object, do construction构造 & deconstruction析构
Garbage Collection: implicitly free allocated block, directed graph(root nodes -> heap nodes)
garbage: unreachable nodes which will never be used again
Mark&Sweep: mark reachable blocks and sweep unmarked allocated blocks

System-Level I/O

I/O: copy data between main memory and external device(disk drive, terminal, networks: file)
open/close file
- int open(filename,flags,mode); return file descriptor(fd); int close(fd);
- unix shell begin with three descriptor: stdin, stdout, stderr
read/write file
- read(fd,buf,size); write(fd,buf,size);
- RIO(Robust I/O) for handling short counts(fewer bytes than request, EOF, text line, network delay)
unbuffered for network delay; buffered for text line
buffer in memory can improve efficiency: interact with buffer directly rather than use system call every time to r/w some small bytes, buffer interact with file by buffersize at a time
change file position
file metadata: stat(filename,buf);
open file data structure
- descriptor table(unique for process) -> open file table(file pos) -> v-node table(file)(shared by process)
I/O Redirection: int dup2(oldfd,newfd), copy descriptor table entry of oldfd to newfd
C Standard I/O, RIO(higher-level) - Unix I/O(lower-level, for network)

Network Programming

client-server transaction: client process request server process, server process response client process
TCP/IP Protocol Stack: Application Layer - Transport Layer - Internet Layer - Network Access Layer
LAN: Bridged Ethernet(Host - Hub(copy every bit to every other port) - Bridge(selectively copy frame to dest))
internet: LAN - router(connect incompatible LANs); Global IP Internet: one implementation of internet
Reliable Transfer: no transfer error including lost, duplicate and wrong order
TCP(Transmission Control Protocol)/UDP(Unreliable Datagram Protocol): process to process
Three-way Handshake: SYN -> SYN+ACK -> ACK
IP(Internet Protocol): host to host(send datagrams, unreliable)
MAC Address: 48 bit; IP Address: 32 bit(ipv6: 16 Bytes), dotted-decimal notation; Network Byte Order: big-endian
unnamed root -> first-level domain name(com) -> second-level domain name(google) -> third-level domain name(www)
Socket: end point of connection, IP address:port(16 bit); ephemeral port, well-known port
Socket Interface
- Client: open_clientfd(socket - connect) - I/O - close; Server: open_listenfd(socket - bind - listen) - accept - I/O - close
- client blocks in connect, server blocks in accept
- listening descriptor exist for lifetime of server(open_listenfd); connected descriptor exist during the connection(accept)
World Wide Web is a service provided by Internet using HTTP(Hypertext Transfer Protocol)
MIME: Multipurpose Internet Mail Extensions
URI(Uniform Resource Identifier) = URL(locater) + URN(name)
HTTP Request Line: method URI version; HTTP Response Line: version status code status message
status code: 200(OK), 301(permanently redirect), 302(temporarily redirect), 403(Forbidden), 404(Not found)
CGI(Common Gateway Interface): interface with executable programs on a server that generate web pages dynamically
Proxy: bypass firewall, page translation, anonymizer, cache

Concurrent Programming

Processes
- parent forks a child to handle each new connection request, child closes listenfd, parent closes connfd
- Interprocess Communication(IPC)
- Pros: having separate virtual address space; Cons: difficult to share state information, high overhead
I/O Multiplexing
- select function blocks until one of the descriptors in the read set is ready to read
- state machine: (input state, input event) -> output state
- Pros: more control, share state information; Cons: complexity
Threads(hybrid of Process and I/O Multiplexing)
- Like processes, threads are scheduled automatically by the kernel and are known to the kernel by an integer ID. Like I/O multiplexing, multiple threads run in the context of a single process, and thus share the entire contents of the process virtual address space, including its code, data, heap, shared libraries, and open files. Each thread has its own thread context(thread ID(TID), stack, stack pointer, program counter, general-purpose registers, condition codes)
main thread(the first thread)
peer threads: thread can kill any of its peers or wait for any of its peers to terminate.
POSIX Thread: pthread, pthread_t tid
joinable thread(default): reaped and killed by other threads and memory resources are not freed until it is reaped by another thread; detached thread(prefer): cannot be reaped or killed by other threads and memory resources are freed automatically by the system when it terminates.
Global variables: one instance in virtual address space and can be referenced by any thread
Local automatic variables: each thread’s stack contains its own instances and can also be shared though
Local static variables: one instance in virtual address space and can be referenced by any thread
Thread safe: manipulate shared data structures by multiple threads and guarantee safe execution and correct result
Thread synchronization: coordinate threads to access critical section
Semaphore: solution to synchronization, a nonnegative global variable s
P(s): decrement s if s is nonzero, block if s is zero until V(s); V(s): increment s, restart one P(s) if s becomes nonzero
P and V cannot be interrupted: atomic operation
Mutual Exclusion: initialize s with 1; P(s); Do(); V(s); Only can execute Do() with THE ONE s(mutex), like talking pillow
Producer-Consumer Problem: Producer: P(slot)-P(mutex)-insert-V(mutex)-V(item); Consumer: P(item)-P(mutex)-remove-V(mutex)-V(slot)
Readers-Writers Problem: reader-favored(reader can read although writer is waiting); writer-favored(writer write as soon as possible after the previous readers finish)
Reader: P(mutex); readcnt++; if(readcnt == 1) {P(w);} V(mutex); Read(); P(mutex); readcnt–; if(readcnt == 0) {V(w);} V(mutex); Writer: P(w); Write(); V(w);
Parallel programs are often written so that each core runs exactly one thread(cat /proc/cpuinfo)
Reentrant function(subset of thread-safe function): do not reference any shared data
Deadlock: each thread is waiting for the other to do V that will never occur, because of wrong order of P and V