分享一下我在上CMU的18-600: Foundations of Computer Systems (15-213: Introduction to Computer Systems在ECE系的改编课程)时对教材”Computer Systems: A Programmer’s Perspective” (CSAPP)的读书笔记,内容按照原书章节组织,主要以重要知识点的形式呈现。
A Tour of Computer Systems
- Information = bits + context
- Compilation system
- Pre-processor: add head file to source file, program.i
- Compilation: translate source file into assembly language, program.s
- Assembly: translate assembly language into relocatable object file, program.o
- Linking: link all relocatable object program to executable object file, program
- GNU provides environment tools for Linux
- Virtual Memory(virtual address space): program-runtime heap-shared libraries-stack-kernel
- multiprocessor: multi-cores + hyperthread
Representing and Manipulating Information
- little endian: less significant bytes in low address; big endian: more significant bytes in low address
- typedef unsigned char *byte_pointer: type alias
- Bit Operation: ~ NOT; ^ Exclusive OR
- Logic Operation(&&, ||, !): Any non-zero parameter = 1
- Sign bit =-2^w-1
- Signed <-> Unsigned: Bit pattern do not change, +/- 2^w
- Signed mixed with unsigned: signed -> unsigned
- Pay attention to Tmin and 0
- Java only supports signed integer
- Two’s Complement = Signed = 补码; Addition = Bit addition + truncating
- Multiplication is costly, use shift + addition/substraction
- Unsigned Division: right logic shift; Two’s Complement Division: right arithmetic shift
- IEEE floating point = (-1)^s M(frac) 2^E(exp), need approximation and round
- Normalized: exp != all 0 && exp !=all 1: E = exp - bias(2^k-1 - 1); M = 1 + .frac;
- Denormalized: exp == all 0: E = 1- bias; M = .frac;
- Special value: exp == all 1: Inf(frac = all 0, should be set to 0); NaN(frac != 0)
- In the half, round to even
- unsigned int a = 15213U;
- long: 4 Bytes(32 bit); 8 Bytes(64 bit)
- size_t: unsigned int(32 bit); long unsigned int(64 bit)
Machine-Level Representation of Programs
- gcc -O1(optimization level) -g(source code line) -o(compile stage) p p1.c p2.c
- objdump -d program.o: disassembler generate assembly code from object code
- Assembly Language: AT&T Format(->) V.S. Intel Format(<-)
- byte(b): 8 bits; word(w): 16 bits; long word(l): 32 bits; quad word(q): 64 bits
- Register: %rax(64)-%eax(32)-%ax(16)-%ah(15-8)-%al(7-0)
- ax, cx, dx(caller-save); bx, si, di(callee-save): generally purpose; sp: stack pointer; bp: frame pointer
- $immediate, %Register, disp(base, index*scale)
- SUB src, dst: dst - src -> dst
- 64-bit Multiply: imull src: src * %eax -> %edx:%eax
- 64-bit Divide: idivl src: %edx:%eax / src -> %eax, %edx:%eax mod src -> %edx
- cltd: %eax -> %edx:%eax
- SAL(Arithmetic Left); SHL(Logical Left); SAR(Arithmetic Right); SHR(Logical Right)
- condition codes: ZF(zero); SF(negative); OF(signed overflow); CF(unsigned overflow)
- cmp s2,s1: s1-s2(set condition codes then); set dst; jmp label/*address; signed(g/l); unsigned(a/b)
- if-else: jmp label; loop: label; jmp label; switch: jump table contains address
- conditional expression: cmov, conditional move, compute all first, better performance than conditional control for pipeline misprediction
- procedure(stack frame): (caller)local var, params, return addr, (callee)bp(higher), local var, sp(lower)
- Array: contiguous memory, address +/- with data type
- Union: Store one field at a time, referenced by data type
- Alignment: data address must be multiple of its size
- x86-64: IA64-AMD64-Intel64: six parameters by ordered register(di,si,dx,cx,r8,r9), no %bp
- Buffer overflow attack: strcpy to inject instructions and change return address
- struct member address must align
Processor Architecture
- Y86: Balance between CISC and RISC
- F: No register
- HCL(Hardware Control Language): Describe control logic circuit
- bool out = f(logic operator);
- multiplexor(output one of the inputs): int out = [s in {I}:A; 1:B];
- clocked registers: state = output, update when clock rise
- random-access memories: given address, register file
- SEQ(Sequential) Processor: Serve one instruction until it finishes
- Instruction stage: fetch-decode-execute-memory-write back(register)-PC update
- Every instruction uses one clock cycle. As clock rises, state elements(PC, CC, register file, RAM) updated according to the previous instruction. In the clock cycle, combinational logic propagate to create new value for state elements updated in next clock rise
- Slow: Each hardware unit is only active for a fraction of the clock cycle
- Pipelined Processor: One instruction finished this stage will move on to next, next instruction will come
- throughput = 1 / clock period(stage + pipeline register)
- The former instruction move on, the address of next instruction(new PC) must be predicted
- instruction dependency hazard(data + control): stalling + forwarding + both(load interlock)
- SEQ+: move PC update to the beginning
- PIPE-: insert pipeline regisger(before the corresponding stage)
- PIPE: forward logic in decode stage
- Exception Handling: stat field in pipeline register
- CPI = 1 + Cbubble/Cinstuction = 1 + load/use + mispredict + ret
- SSE: Streaming SIMD Extensions; x86-64 floating point registers: %xmm0-%xmm15 can hold multiple values
- SuperScalar: Instruction Per Cycle > 1; Out of Order
- Branch Target Buffer for dynamic branch prediction: previous instruction addr -> target addr
- Register Rename: small architecture regs -> large physical regs, ARF -> RRF
- Load Bypassing: load ahead of store; Load Forwarding: forward store to load
Optimizing Program Performance
- Criteria
- Correctness
- Readability
- Performance
- Trade-off between readability and performance
- Method
- High-level design: efficient algorithm and data structure
- Basic coding principles: easy for compiler to optimize
- Low-level optimizations
- Memory Aliasing: two pointers are designated to the same memory location
- Cycles Per Element(CPE): measure program performance
- Move unnecessary code out of loop
- Reduce function calls
- Eliminate unnecessary memory reference
- latency: cycles to perform operation; issue time: cycles between operations
- Loop Unrolling: Reduce iteration and loop operation
- Use more local accumulators and reassociation transformation to enhance parallelism
- Use conditional move to avoid branch misprediction
- Parallel memory load and store
- Program Profiling: gprof
- Amdahl’s Law: speedup = 1 / (1 - alpha + alpha / k)
- To significantly speed up the entire system, we must improve the speed of a very large fraction of the overall system
- Significant speedup lies in 90% - 100%
The Memory Hierarchy
- Register -> N-level Cache -> Main Memory -> Local Disk -> Remote Disk
- RAM: available when power on
- one bit = one cell
- SRAM(Static): Cache; bistable, 6 transistors per bit
- DRAM(Dynamic): Main Memory; Sensitive, 1 transistor per bit
- Memory controller with addr and data pins
- Main Memory -> Memory Modules -> DRAM Chips -> Supercells(row,column) -> 8 cells(1 byte)
- SDRAM: Synchronous DRAM, replace memory control with clock rise
- DDR(Double Data Rate): Double control signal and increase buffer
- ROM: nonvolatile when power off, limited erase time
- EEPROM(Electrically Erasable Programmable ROM): Flash Memory, Solid State Disk
- CPU - Bus Interface - System Bus - I/O Bridge - Memory Bus - Main Memory
I/O Bridge - I/O Bus(Peripheral Component Interconnect) - Universal Serial Bus(SSD), Host Bus Adaptor(SCSI, SATA)
- sector + gap -> track -> cylinder <=> Logic Block
- RPM: Revolutions Per Minute
- Access Time = avg seek time + avg rotation latency + avg transfer time
- Temporal Locality: reference itself; Spatial Locality: reference nearby location
- Cache: S sets -> E lines -> 1 valid bit + t tag bits + B bytes block; C= BES
- Memory Address: t bits tag + s bits set index + b bits block offset
- Direct Mapped Cache(E=1); E-way Set Associative Cache(1<E<ES); Fully Associative Cache(S=1)
- (valid == 1 && tag match) ? hit : miss;
- Miss: Fetch entire block from memory to cache line
- Cold Miss, Conflict Miss, Capacity Miss
- Memory Mountain: read throughput = f(size(temporal locality),stride(spatial locality))
Linking
- Collect and combine code and data into a single file that can be loaded into memory and executed
- Linker enables separate complition, link at (static linker)compile time, (dynamic linker)load time, run time
- Symbol resolution
- Symbol: variables + functions
- static: local symbol; nonstatic: global symbol defined by self; external: global symbol defined by others
- strong symbol: initialized global variable and function; weak symbol: uninitialized global variable
- strong with strong: link error; strong with weak: pick strong; weak with weak: pick any one
- static library: lib.a(archive), only copy the object modules that are referenced
- E: object file; U: undefined symbol; D: defined symbol
- scan object file and static library in order and update E, U, D
- Referenced file should after file that reference it
- Relocation: section merge + relocate symbol reference within section
- Load executable object file: copying the program into memory and then running
- Shared library: dynamic link at run time, lib.so
- dynamic link in program: handle = dlopen(lib.so); func = dlsym(handle,”func”); dlclose(handle)
- Position Independent Code: distance between code segment and data segment is constant
- Global Offset Table + Procedure Linkage Table
Exceptional Control Flow
- Control Flow: Sequence of control transfer
- Sychronous: Call return only if it finished; Asynchronous: Call return immediately and notify later if finish
- Exception: OS and hardware(low-level), interrupt(I/O device, Processor issue to OS), trap, fault, abort
- Event occur -> exception -> exception table -> exception handler(kernel mode) -> return/abort
- System Call(trap, 0x80 in exception table): syscall, Application call OS
Fault(might recoverable) - Floating Exception: Divide by 0; Segmentation Fault: Access undefined address and write read-only file
- Program: a collection of code and data
- Process: Instance of a program in execution, illusion that has exclusive use of processor and memory
- Process need context switching when exception occur, threads are in the same context
- 3 states of process: Running, Stopped, Terminated
- fork(): parent -> child, return child pid to parent and 0 to child, parent reap its terminated child(zombie)
- pid waitpid(pid,status,option): waits children to terminate or stop and reap them
- execve(filename,argv,envp): loads and runs a new program in the context of the current process
- fork: new process, same program; execve: same process, new program
- Signal: Application and OS(high-level), kernel notify process that event occur
- block(can be deliver until unblock) - pending(only one for one type) - receive
- process group: -PID
- signal(signum,handler): install user-defined handler for handling received signal
- Nonlocal jumps: jump to other function, setjmp(like catch) + longjmp(return setjmp, like throw)
Virtual Memory
- Physical Addressing(PA): 0~M-1, m bits; Virtual Addressing(VA): 0~N-1, n bits
Memory Management Unit(MMU) translate VA to PA - Cache Transfer Unit: page, block
- Each process have its own virtual adress space and page table
- Virtual Memory: Array of N contiguous byte-sized cells stored on disk
- Unallocated(no space on disk), Cached(fully associative in physical memory), Uncached(on disk)
- Page Table Base Register -> Page Table(in cache or memory): each entry map VP to PP
Virtual Page(VP) | valid | Physical Page(PP) |
---|---|---|
0 | 1 | physical address(cached), page hit |
1 | 0 | disk address(uncached), page fault, swap/page in/out and execute again |
2 | 0 | null(unallocated) |
- Demand Paging: the strategy of waiting until the last moment to swap in a page, when a miss occurs
- Virtual Page Number -> Page Table Entry -> Physical Page Number
- Virtual Page Offset = Physical Page Offset
- Translation Lookaside Buffer: cache of page table(SRAM and DRAM) in MMU
- Multi-level page tables: bits of (page amount) are separated into multiple sections
- m*n PTEs are organized as m PTEs(level 1), n PTEs(level2) for each level 1 PTE, level 2 n PTEs doesn’t exist if corresponding level 1 PTE is null
- VA ->PA
hit: VPN -> MMU(TLB) -> PPN
miss: VPN -> multi-level page table -> PPN - Get data word
PA -> Cache -> Physical Memory
- Linux Virtual Memory: vm_area_structs -> different sections of process virtual memory
- Memory Mapping: initialize virtual memory area by associating it with an object(shared, private copy-on-write, anonymous) on disk
- mmap: create a new virtual memory area which mapped to a object; munmap: delete the area
- brk: top of the heap; payload: actual requested size; bp: block pointer, the first word after header
- void *sbrk(intptr_t incr); // set brk to grow or shrink the heap
- void *malloc(size_t size); // allocate one size space, return null on error, not initialized
- Notice Allocate Alignment!
- void *calloc(size_t numElements, size_t sizeOfElement); // allocate initialized spaces
- void realloc(void ptr, size_t newsize); // reallocate ptr with one newsize space, might in new location
- void free(void* ptr); //explicitly free allocated block
- Goal: maximize throughput(request number), maximize memory utilization(peak utilization)
- internal fragmentation: allocated block > request; external fragmentation: every free block < request
- boundary tags: copy header to footer, current block just check one previous word
- implicit free list: start + 2 prologue + heap blocks(header + payload + padding + footer) + epilogue
- placement, splitting, coalescing(false fragmentation, combine free blocks)
- explicit free list: doubly linked free list; segregated free list: multiple free lists of different size
- buddy system: segregated free list of sizes of power of two
- new/delete: C++ operator, allocate/free self-defined object, do construction构造 & deconstruction析构
- Garbage Collection: implicitly free allocated block, directed graph(root nodes -> heap nodes)
- garbage: unreachable nodes which will never be used again
- Mark&Sweep: mark reachable blocks and sweep unmarked allocated blocks
System-Level I/O
- I/O: copy data between main memory and external device(disk drive, terminal, networks: file)
- open/close file
- int open(filename,flags,mode); return file descriptor(fd); int close(fd);
- unix shell begin with three descriptor: stdin, stdout, stderr
- read/write file
- read(fd,buf,size); write(fd,buf,size);
- RIO(Robust I/O) for handling short counts(fewer bytes than request, EOF, text line, network delay)
- unbuffered for network delay; buffered for text line
- buffer in memory can improve efficiency: interact with buffer directly rather than use system call every time to r/w some small bytes, buffer interact with file by buffersize at a time
- change file position
- file metadata: stat(filename,buf);
- open file data structure
- descriptor table(unique for process) -> open file table(file pos) -> v-node table(file)(shared by process)
- I/O Redirection: int dup2(oldfd,newfd), copy descriptor table entry of oldfd to newfd
- C Standard I/O, RIO(higher-level) - Unix I/O(lower-level, for network)
Network Programming
- client-server transaction: client process request server process, server process response client process
- TCP/IP Protocol Stack: Application Layer - Transport Layer - Internet Layer - Network Access Layer
- LAN: Bridged Ethernet(Host - Hub(copy every bit to every other port) - Bridge(selectively copy frame to dest))
- internet: LAN - router(connect incompatible LANs); Global IP Internet: one implementation of internet
- Reliable Transfer: no transfer error including lost, duplicate and wrong order
- TCP(Transmission Control Protocol)/UDP(Unreliable Datagram Protocol): process to process
- Three-way Handshake: SYN -> SYN+ACK -> ACK
- IP(Internet Protocol): host to host(send datagrams, unreliable)
- MAC Address: 48 bit; IP Address: 32 bit(ipv6: 16 Bytes), dotted-decimal notation; Network Byte Order: big-endian
- unnamed root -> first-level domain name(com) -> second-level domain name(google) -> third-level domain name(www)
- Socket: end point of connection, IP address:port(16 bit); ephemeral port, well-known port
- Socket Interface
- Client: open_clientfd(socket - connect) - I/O - close; Server: open_listenfd(socket - bind - listen) - accept - I/O - close
- client blocks in connect, server blocks in accept
- listening descriptor exist for lifetime of server(open_listenfd); connected descriptor exist during the connection(accept)
- World Wide Web is a service provided by Internet using HTTP(Hypertext Transfer Protocol)
- MIME: Multipurpose Internet Mail Extensions
- URI(Uniform Resource Identifier) = URL(locater) + URN(name)
- HTTP Request Line: method URI version; HTTP Response Line: version status code status message
- status code: 200(OK), 301(permanently redirect), 302(temporarily redirect), 403(Forbidden), 404(Not found)
- CGI(Common Gateway Interface): interface with executable programs on a server that generate web pages dynamically
- Proxy: bypass firewall, page translation, anonymizer, cache
Concurrent Programming
- Processes
- parent forks a child to handle each new connection request, child closes listenfd, parent closes connfd
- Interprocess Communication(IPC)
- Pros: having separate virtual address space; Cons: difficult to share state information, high overhead
- I/O Multiplexing
- select function blocks until one of the descriptors in the read set is ready to read
- state machine: (input state, input event) -> output state
- Pros: more control, share state information; Cons: complexity
- Threads(hybrid of Process and I/O Multiplexing)
- Like processes, threads are scheduled automatically by the kernel and are known to the kernel by an integer ID. Like I/O multiplexing, multiple threads run in the context of a single process, and thus share the entire contents of the process virtual address space, including its code, data, heap, shared libraries, and open files. Each thread has its own thread context(thread ID(TID), stack, stack pointer, program counter, general-purpose registers, condition codes)
- main thread(the first thread)
- peer threads: thread can kill any of its peers or wait for any of its peers to terminate.
- POSIX Thread: pthread, pthread_t tid
- joinable thread(default): reaped and killed by other threads and memory resources are not freed until it is reaped by another thread; detached thread(prefer): cannot be reaped or killed by other threads and memory resources are freed automatically by the system when it terminates.
- Global variables: one instance in virtual address space and can be referenced by any thread
- Local automatic variables: each thread’s stack contains its own instances and can also be shared though
- Local static variables: one instance in virtual address space and can be referenced by any thread
- Thread safe: manipulate shared data structures by multiple threads and guarantee safe execution and correct result
- Thread synchronization: coordinate threads to access critical section
- Semaphore: solution to synchronization, a nonnegative global variable s
- P(s): decrement s if s is nonzero, block if s is zero until V(s); V(s): increment s, restart one P(s) if s becomes nonzero
- P and V cannot be interrupted: atomic operation
- Mutual Exclusion: initialize s with 1; P(s); Do(); V(s); Only can execute Do() with THE ONE s(mutex), like talking pillow
- Producer-Consumer Problem: Producer: P(slot)-P(mutex)-insert-V(mutex)-V(item); Consumer: P(item)-P(mutex)-remove-V(mutex)-V(slot)
- Readers-Writers Problem: reader-favored(reader can read although writer is waiting); writer-favored(writer write as soon as possible after the previous readers finish)
- Reader: P(mutex); readcnt++; if(readcnt == 1) {P(w);} V(mutex); Read(); P(mutex); readcnt–; if(readcnt == 0) {V(w);} V(mutex); Writer: P(w); Write(); V(w);
- Parallel programs are often written so that each core runs exactly one thread(cat /proc/cpuinfo)
- Reentrant function(subset of thread-safe function): do not reference any shared data
- Deadlock: each thread is waiting for the other to do V that will never occur, because of wrong order of P and V