CS 202 Lecture 26 – pointers I

pete > courses > CS 202 Spring 24 > Lecture 26: pointers I

Lecture 26: pointers I

Goals

define pass by value and pass by reference
describe why passing structures and arrays by value is inefficient
use pointers to pass parameters by reference in C
use pointers to modify values
(begin to) use gdb to debug programs

consider this program: change-param.c

main creates a local variable called a and gives it the value 12

then it calls foo with the parameter a

foo changes the value of its parameter to 21 and returns

now back in main, we print out the value of a

question: will this print 12 or 21?

$ gcc -o change-param change-param.c 
$ ./change-param 
a is 12

okay, but why?

recall that main passes the parameters to foo in registers

what is one of the first things foo does? it copies the contents of those registers into foo’s stack frame

then, if we looked at the assembly, when foo assigns 21 to x, we’d see that it was altering something within foo’s stack frame (because it would be fp-8 or similar)

then foo returns

and then we hit the printf line and it prints out the value of a

and a is… where? main’s stack frame!

foo didn’t mess with anything in main’s stack frame, so a can’t have changed from 12

what if we want foo to be able to modify a? how might we achieve that?

in order to accomplish that, foo has to know where a is stored in memory

(remember that the compiler sets aside particular pieces of memory to hold particular values: all foo needs to know to modify a is its location in memory)

(not just that, but at the assembly level, everything is identified only by its memory address)

so instead of passing the value of a to foo, we really want to pass its location: the address in memory where it’s stored

how might we do that?

recall the & operator from a few lectures ago: &blargh will give us the address of blargh

how convenient!

so change line 14 to:

foo(&a);

unfortunately, the compiler is going to complain at us now

because, as it stands, foo expects to get an integer as its parameter

but we’re no longer sending it an integer

we’re sending it an address!

the thing stored at that address is an integer, mind you, but the thing being passed is the address itself

to reflect this, we have to change foo to take a parameter of type int *

void foo(int *x)

(in both the prototype on line 7 and in the definition on line 18)

this tells the compiler that foo accepts the address of an integer as its parameter

sometimes pronounced as "int star" or "int pointer" or "pointer to an int"

"pointer" because an address "points to" a datum

this is good, and the compiler won’t complain (much), but it still doesn’t do what we want it to do

check out line 20:

x = 12;

recall that x is now a variable that holds an address

thus this line changes the address being stored in x to 12

where we instead want to go to that address and put 12 at that location

so we add a star here, too, to indicate that we don’t want the pointer, we want what’s being pointed to

*x = 12;

fully-modified program here: change-param-pointer.c

which performs as desired:

$ gcc -o change-param-pointer change-param-pointer.c 
$ ./change-param-pointer 
a is 21

in the original program, we passed the value of a to foo, and foo’s modification affected the value in its stack frame

in the modified version, we passed the address of a to foo, which is what allowed foo’s modifications to affect the value in main’s stack frame

unsurprisingly, there exist names for these two strategies

the former is called pass by value

and the latter is called pass by reference (because we’re passing, not the value, but a reference to it: directions on where to find it)

this is useful for more purposes than allowing callees to modify variables from the caller

recall structs and arrays

suppose we had a 100-field struct or a 100-element array

based on our understanding of the stack, if we wanted to pass one of those as a parameter, we would have to copy the entire 100-field struct or 100-element array to the stack

this takes both time (to perform the copy) and memory (to hold the copy)

more efficient would be to instead pass the address of the struct or array

here’s a simple program that demonstrates the & operator, which tells us the address of a variable: pointer.c

two variables, x and y

we print each, along with their locations

the float data type is new-ish: it’s one way to store non-integral values in C (yes, it uses the IEEE 754 format we saw a few weeks ago)

we also see a new format specifier for printf: "%.2f" says "print the corresponding parameter as a floating point number, with two digits to the right of the decimal point"

in the first call to printf, note that it pairs up the format specifiers with the arguments: the first specifier ("%d") is used to format the first parameter (x) and the second specifier ("%p") is used to format the second parameter (&x)

no big surprises here, just slight variations on things we’ve already seen

now for some more games: dereference.c

here we create a variable, x, to hold an integer and give it a value

then we declare another variable to contain the address of an integer

then we set that latter variable to hold the address of x

the following pair of printfs show the difference between the address and the value at that address

the operation of taking a pointer and finding the value stored at that pointer is called dereferencing

because it’s a reference and we’re looking up the value at that location

now for the tricky part:

*pointer_to_int = 3720;

recall that pointer_to_int contains an address

the * is the dereference operator

which means that this line of code is not messing with the address; it’s messing with what is stored at that address

which is why we would expect x to have a different value now

pointer_to_int has the value 0x7ffc3a7f3a14
pointer_to_int points to the value 42
x now has the value 3720

and so it does

note that we don’t need the variable name to mess with the value; we just need to have its address

we can see this from the output of the program, but we can also use a debugger to see what the program is doing as it does it

you’ve probably used a debugger already, integrated into your development environment

it’s the component that allows you to run the program, insert breakpoints, look at variable values, etc

the debugger we’ll use in this class is gdb, which is the standard debugger for C programs on UNIX systems

when setting out to run a program in the debugger, the first thing you have to do is compile it in a particular way: you must pass the -g option to the compiler, to cause it put a bunch of extra information in the compiled program that helps the debugger do its job

$ gcc -g -o dereference dereference.c

(remember that the "-o dereference" part tells the compiler the name of the file in which to put the resulting machine code)

now, just run gdb with the name of the program to debug:

$ gdb dereference

it prints out a bunch of stuff we don’t care about right now, except the last line is useful:

Reading symbols from dereference...

this tells us that the program we’re debugging was indeed compiled with -g and we have the full collection of gdb features available to us

if we hadn’t compiled with -g, it would have said this:

(No debugging symbols found in dereference)

then it shows you a prompt to tell you it’s waiting for your command:

(gdb) _

digression: when you compile with -g, the extra information included is called "debugging symbols"

gdb can sometimes download debugging symbols for you, through a mechanism called debuginfod

every time you run it, gdb will ask you if you want to enable this feature, like so:

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
Enable debuginfod for this session? (y or [n])

for our purposes, you can always say no

as the output later says, you can make this setting permanent (and therefore never be bothered with the question) by creating the file .gdbinit in your home directory and putting this line in it:

set debuginfod enabled off

(for whatever it’s worth, I haven’t bothered to do this for myself)

we can run the program using the surprisingly-named "run" command, which prints some other stuff (including, sometimes, the debuginfod prompt discussed above), but the salient parts are shown below:

(gdb) run
pointer_to_int has the value 0x7fffffffe39c
pointer_to_int points to the value 42
x now has the value 3720
[Inferior 1 (process 3006226) exited normally]

the first three lines are those printed by the program itself

the final line is gdb’s note that the program exited without error: yay

we can ask gdb to pause execution when a particular function starts by using the break command:

(gdb) break main
Breakpoint 1 at 0x555555555158: file dereference.c, line 8.

the hex value there is the address of the specific instruction it will stop at

now when we run it, the program doesn’t run to completion; instead, it pauses for us to poke around:

(gdb) run
Starting program: /home/pete/.../26-pointers-i/src/dereference
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Breakpoint 1, main (argc=1, argv=0x7fffffffe4c8) at dereference.c:8
8   {

gdb is now telling us that execution is paused at line 8 of dereference.c, which is the line containing the curly brace that begins the code for main

this is the line of code that will execute next: it hasn’t been done yet

but what code corresponds to "{" ?

we can get a concrete answer by disassembling (recall: translating from machine code to assembly language) the current function, which is done using the disas command:

(gdb) disas
Dump of assembler code for function main:
   0x0000555555555149 <+0>:     push   rbp
   0x000055555555514a <+1>:     mov    rbp,rsp
   0x000055555555514d <+4>:     sub    rsp,0x30
   0x0000555555555151 <+8>:     mov    DWORD PTR [rbp-0x24],edi
   0x0000555555555154 <+11>:    mov    QWORD PTR [rbp-0x30],rsi
=> 0x0000555555555158 <+15>:    mov    rax,QWORD PTR fs:0x28
   0x0000555555555161 <+24>:    mov    QWORD PTR [rbp-0x8],rax
   0x0000555555555165 <+28>:    xor    eax,eax
   0x0000555555555167 <+30>:    mov    DWORD PTR [rbp-0x14],0x2a
   0x000055555555516e <+37>:    lea    rax,[rbp-0x14]
   0x0000555555555172 <+41>:    mov    QWORD PTR [rbp-0x10],rax

which shows the individual machine code instructions and the address at which each begins

the "<+n>" part is the offset from the beginning of the function

ie, if we imagine that the machine code instructions for the entire function are a blob of bytes, "<+4>" says this instruction starts 4 bytes from the beginning of the entire blob

but these don’t look like the instructions we’ve been working with recently, because they are not arm32: they are x86-64

we can, however, see some similarities

recall from our investigation into functions and stack frames that the first few instructions set up the stack frame

that’s exactly what’s happening here

in arm32, the first instruction pushed the (caller’s) frame pointer onto the stack

here, we see the first instruction is

push rbp

which we might theorize is also pushing the (caller’s) frame pointer onto the stack, but in x86-64

and that theory would be correct

so we can further deduce that "rbp" is the name of the frame pointer in x86-64

in x86-64, register names start with "r" to indicate they are 64-bit registers

"bp" means "base pointer"—ie, the base (beginning) of the frame—but it has the same meaning as arm32’s fp

recall that the second arm32 instruction causes the frame pointer to point where the stack pointer currently points (because we’re about to move the stack pointer down to make room for the new stack frame)

and that’s exactly what this x86-64 instruction does, too:

mov rbp,rsp

if rbp is x86-64’s frame pointer, then rsp must be its stack pointer, and indeed this is the case

finally, we cause the stack pointer to point lower down, completing the construction of the new frame:

sub rsp,0x30

same thing here, except no destination operand: remember that this instruction stores the result in the same register as the first operand, so the effect is the same as the use of arm32’s sub instruction in this context

back to the program, remember that we have one variable that is an integer, and another variable that is a pointer to an integer

so the actual values in the stack should be an int and an address

let’s see if we can find them

first, let’s cause the next few lines of the program to run, which will put meaningful stuff in memory:

int x = 42;
int *pointer_to_int;

pointer_to_int = &x;

the "next" command runs a single line of code:

(gdb) next
9       int x = 42;

this is gdb telling us that it has completed the assembly instructions that correspond to "{" and is now ready to set x to 42

if we ask gdb the value of x right now, we get a weird result:

(gdb) print x
$1 = 32767

we never gave x a value, so how can it think its value is 32767? no answer now, we’ll get to that in a future lecture

but if we let this line run (by telling gdb "next") and then print, we have hope

(gdb) next
12      pointer_to_int = &x;
(gdb) print x
$2 = 42

glorious

out of curiosity, let us look at the value of pointer_to_int after the assignment

(gdb) next
13      printf("pointer_to_int has the value %p\n", pointer_to_int);
(gdb) print pointer_to_int
$3 = (int *) 0x7fffffffe480

note that gdb formats it differently, because it knows it’s a pointer

we can also infer, from our understanding of pointers, that if we look in memory at address 0x7fffffffe480, we should find the value of x

(gdb) print *0x7fffffffe39c
$4 = 42

more glorious

the print command is usually only used to print single values, which is sometimes limiting

we use the x (examine) command to look at memory more broadly

the x command needs to be told how much memory to examine and in what format to present it

remember that rsp (the stack pointer) points to the bottom of the stack (and hence also the bottom of this stack frame)

so if we start at the address given by rsp and work our way to greater addresses, printing out the data at those addresses as we go, we should see the entire stack frame

let’s print out 10 values, formatted as hex, starting at the address given by $rsp

(gdb) x/10x $rsp
0x7fffffffe380:    0xffffe4c8    0x00007fff    0x00000000    0x00000001
0x7fffffffe390:    0x00000000    0x00000000    0xf7fe5040    0x0000002a
0x7fffffffe3a0:    0xffffe39c    0x00007fff

the left column is the address and the right four columns are the data

so, from the first line, the 4-byte value 0xffffe4c8 lives at address 0x7fffffffe380

the 4-byte value 0x00007fff lives 4 addresses later, at 0x7fffffffe384

the 4-byte value 0x00000000 lives 4 addresses after that, at 0x7fffffffe388

and finally the 4-byte value 0x00000001 lives at address 0x7fffffffe38c

we know where the variable x is stored because we printed the value of pointer_to_int above: 0x7fffffffe39c

and indeed if we look at the end of the second line (which is the 4-byte value stored at address 0x7fffffffe39c), we see 0x0000002a, which is the hex representation of the integer 42

we also see, in the final line, two 4-byte values 0xffffe39c and 0x00007fff, which together are the 64-bit (8-byte) value of pointer_to_int