CS 202 Lecture 22 – arrays and structs in C

pete > courses > CS 202 Spring 24 > Lecture 22: arrays and structs in C

Lecture 22: arrays and structs in C

Goals

define compound data type
use a struct to define records in C
identify where struct components are stored in memory
define an array in C and access its elements
identify where array elements are stored in memory

topic for today is compound data types

that is, data types that are made up of other data types

we’ve seen lots of ints so far in our simple C programs

there are other primitive data types (meaning they cannot be further broken down—ie, the opposite of compound data types) like char and float

but C also allows you to declare arrays and a record-like thing called a struct

arrays are compound data types because they hold a bunch of values

objects, like you’ve seen in Java or Python, are similar to structs in C, in that they allow you to gather together a bunch of fields into one "thing"

C structs do not, however, support methods (more on this later)

let’s first look at structs

they’re C’s instance of what are called records, which is a collection of fields, each of possibly-different types, all given a single name/identifier

a classic example of this is a record that describes a point in 2D space, and is thus itself composed of an x and a y coordinate

here’s a simple C program (and associated ARM32 assembly) that demonstrates declaring such a thing and accessing its fields

struct.c & struct.s

starting at the top, we have the struct definition:

struct point {
    int x, y;
};

here we’re telling the compiler what a "struct point" is

so that, later on, if the program wants to make a "struct point" or look inside one, it knows what a "struct point" looks like and how much memory it requires

note that these three lines do not cause any assembly instructions to be generated

it’s just a message to the compiler

then, inside main(), we declare a variable named p1:

struct point p1;

this line tells the compiler we’re going to be using a variable with type "struct point" and we’re going to refer to it using the name p1

note that this, too, results in no assembly instructions!

at least not directly: the compiler now needs to make room for p1 but those effects are implicit in the function-calling boilerplate instructions we’re ignoring for a few lectures yet

finally we modify the constituent fields of p1:

p1.x = 12;
p1.y = 12;

we access the fields of p1 using the dot-notation you’ve seen in both Python and Java

now, to the assembly

the relevant lines are these:

mov r3, #12
str r3, [fp, #-12]
mov r3, #21
str r3, [fp, #-8]

deduction tells us that p1.x is stored at fp-12 and p1.y is stored at fp-8

the interesting thing to note is that the assembly program doesn’t seem to know anything about the fact that these two values are related

but this is consistent with our observation that, eg, the names of variables don’t show up in assembly

if variable names don’t show up, why should struct names? (no reason, hence they don’t)

so how does the compiler accomplish this black magic?

when it reads the struct definition, it remembers its name and how much memory is required for a single value of that type (in this case, each integer takes 4 bytes, so the whole thing takes 8 bytes)

then, when it hits the struct point p1 part, it sets aside 8 bytes of memory for p1 and remembers the address it allocated

then, when we later refer to p1.x, the compiler recalls the address it set aside for p1 and, due to the struct definition, knows where x exists inside it, and thus generates the address fp-12 for it

same deal with p1.y

now on to arrays

they’re going to look very familiar if you’re used to Java

array.c & array.s

this line

int a[10];

declares an array called a

that contains 10 things

each of which is an int

(unlike Python but mostly like Java, an array can only hold values of one type—you can’t have ints, chars, and floats in the same array)

then we access it using the same notation as in Python and Java as well

we can put stuff into the array

a[0] = 0;

and read stuff out

a[9] = a[0] + 9;

note that the initial item in the array is at index 0 and the final item is at index n-1, for an n-item array

to the assembly

this is also fairly straightforward

; a[0] = 0;
mov r3, #0
str r3, [fp, #-44]

; a[9] = a[0] + 9
ldr r3, [fp, #-44]
add r3, r3, #9
str r3, [fp, #-8]

(now that things are getting a bit more complicated, I’ve put the line of C that corresponds to the assembly instructions in a comment above)

this doesn’t look to much different from the struct, does it?

the only difference is that the offset (in this case, -44) is much more negative

why is it so negative?

well, we have a[0] stored at fp-44 and a[9] stored at fp-8

in between are spaces for a[1] through a[8]

recall that ints are 4 bytes each

the zeroth is stored in four consecutive bytes of memory: fp-44, fp-43, fp-42, and fp-41

the first is stored in the next four bytes, etc, etc

the last (ninth) is stored at fp-8, fp-7, fp-6, and fp-5

note that the compiler sets aside space for all the intervening elements even though they are never used

we often use loops to access elements of an array, so it will be instructive to see the assembly involved in that

array-for.c & array-for.s

nothing exciting in the C: it’s just a combination of the array notation and for-loop we’ve already seen

the assembly gets a bit convoluted, though, so let’s step through it instruction by instruction

first, the initialization:

; the "i=0" part of the for-loop
mov r3, #0
str r3, [fp, #-8]       ; we now know that i is stored at fp-8
b   .L2                 ; branch to the comparison

then the comparison:

.L2:
    ; the "i<10" part of the for-loop
    ldr r3, [fp, #-8]
    cmp r3, #9
    ble .L3                 ; branch if i <= 9

there is something interesting here: the C uses "<10" but the assembly uses "<=9"

why might this be?

recall that we have a finite number of bits for the immediate value in the compare instruction (12, to be precise)

thus the maximum number that can be compared against is 2^11 - 1 (because it’s 2’s complement)

thus we couldn’t compare "<2048"

but we can if we use the equivalent "<=2047" (remember: all integers)

now the meat of the loop

I’m going to annotate each instruction with its effects and stop midway through

.L3:
    ldr r3, [fp, #-8]       ; r3 <- i
    lsl r3, r3, #2          ; r3 <- i * 4
    sub r2, fp, #4          ; r2 <- fp - 4
    add r3, r2, r3          ; r3 <- fp - 4 + i * 4
    ldr r2, [fp, #-8]       ; r2 <- i
    str r2, [r3, #-44]      ; store i @ address [fp - 4 + i * 4 - 44]

first thing to note is that we’re not using fp as the base register for the store: we’re using r3 instead

what’s up with the value in r3, though?

let me rearrange the math somewhat:

fp - 44 + i * 4 - 4

the fp - 44 bit looks familiar from out previous array program: it’s likely the address of a[0]

recall also that the array elements are stored sequentially, so a[1] is stored 4 bytes after a[0], a[2] is stored 4 bytes after that, and so on

thus a[i] is stored i * 4 bytes after a[0], hence the i * 4 part of the arithmetic above

but what about the -4?

we have to make room for i itself!

recall that, in our previous program, the last element of the array was stored at fp-8

in this program, though, that address now contains i, so every element in the array needs to move down by 4 bytes

hence we subtract 4 from the address

the rest is straightfoward:

; the "i++" part of the for-loop
ldr r3, [fp, #-8]       ; load i
add r3, r3, #1          ; increment i
str r3, [fp, #-8]       ; store i