|
|
However, since a great deal of our work with chip companies and RTOS/tool vendors deals directly with the processor, we often tackle the very issues that the compiler handles for you. This page provides some information we thought would be helpful to those exploring the PowerPC architecture. Whether you're fairly new to PowerPC programming or just curious about some of the nuances of the PowerPC architecture, this page might be of interest to you. (Note: there
are many PowerPC controllers and processors, each with its own peripherals,
SPR's, etc.... The best source of information for the processor you're
interested in is documentation from Motorola or IBM. This page is intended
to help introduce you to PowerPC programming and architecture.)
Register Model All PowerPC processors have at least two sets of registers: General-Purpose Registers (GPRs) and Special-Purpose Registers (SPRs). Some PowerPC processors have an additional third set of registers, the Floating Point Registers (FPRs). This section will give a brief overview of the GPRs and SPRs you are likely to use when developing PowerPC code. General-Purpose Registers (GPRs ) The 32 GPRs (GPR0-GPR31) are used for integer operations, load/store transactions with memory, etc.... The GPRs are either 32 or 64 bits wide, depending on the processor's implementation (most PowerPC CPUs used in embedded applications use 32-bit architectures.) Most assembly code refers to the registers as r0, r1, .... or %r0, %r1 (as opposed to GPR0, GPR2, etc...) In order to standardize the way GPRs are used by software, conventions have been adopted that assign a special role for certain GPRs. This facilitates tools and software that work well together. By convention, in most embedded applications:
Note: The programmer should consult the cross compiler's technical documentation to get the exact register usage conventions of the compiler. The documentation will also explain calling conventions, such as which registers are used to pass arguments and return values, which registers are preserved (saved) across function calls, etc... For more information on register usage conventions adopted for embedded systems, consult the the PowerPC Embedded Applications Binary Interface (EABI). Special-Purpose Registers (SPRs) Each PowerPC processor defines its own set of SPRs, although many of the SPRs are defined across all implementations. SPRs provide control over the processor's MMU, cache, debugging capabilities, exception handling, etc.... Some of the more esoteric features of the processor are also controlled by SPRs. For example, on the MPC750, SPRs are used for performance monitoring and power/thermal management. Each SPR is identified by a number (SPR1, SPR8, etc....). SPRs also have a specific name that indicates the SPR's purpose. Most of the SPRs are accessible only while the processor is in supervisor mode. Some SPRs that are defined
across the PowerPC family include:
Other Registers: the Condition Register and the Machine State Register The Condition Register
(CR) holds the results from integer, floating point and comparison
operations.
Some mnemonics reference a specific bit number 0-31 in the CR, rather than a field in the CR. For example (refer to the pictures above):
For example, suppose:
cmp cr3, r20, r21 # compare r20 & r21, store results in cr3The Machine State Register (MSR) is perhaps the most important PowerPC register of all, and it exists across all PowerPC variations. The MSR is the main control register for the processor. Among other things, it controls:
Notably absent is a dedicated, accessible program counter (PC) register. Some users might be surprised by this. However, in general, software does not need to know "where" the program is executing from. In times when the program flow changes, the return address can be saved automatically by the CPU. For example, branches have the option of saving the return address in the link register; exceptions store the return address of an exception handler in SRR0 (see above, SPR26.) The point is, there really is no need for a PC on the PowerPC. This introduction to the PowerPC register set only begins to scratch the surface - for complete information, consult the technical documentation on the PowerPC programming environments or the specific PowerPC processor you're using.
Addressing Modes The PowerPC supports the following basic addressing modes:
The examples below give an
overview of this.
Note how in the two Update modes, the effective address is calculated and used before the register value is updated with the effective address. Also notice that the register indirect with immediate index addressing modes can only access addresses within +/-32 kbytes of the base address due to the 16-bit signed offset; register indirect with index addressing must be used to reach beyond this range. One other note: in certain contexts, r0 actually evaluates to the value 0, rather than the contents of GPR0. This is clearly spelled out in PowerPC documentation, but an example never hurts: # Let r0 = 100 ; r3 = 300; r5 = 500, r12 = 1200 Stack Support As with many RISC processors, there is no explicit hardware support for a stack on the PowerPC. In other words, there are no dedicated stack pointer registers in hardware, and there are no instructions or addressing modes that implicitly reference or use a stack pointer. Those familiar with the 68k family may recall that a subroutine call instruction (e.g. jsr) pushes the return address on the stack as part of its microcoded processing. Similarly, the return from subroutine instruction (rts) pops the return address off the stack. Also on the 68k, exceptions cause certain registers and other information to be pushed onto the stack by processor microcode prior to invoking the exception handler. Some versions of the 68k processor implement as many as 3 different stack pointer registers: the User Stack Pointer (USP), the Supervisor Stack Pointer (SSP), and the Interrupt Stack Pointer (ISP). Needless to say, the 68k processors are certainly "aware" of a stack! On the PowerPC, there is no micro-coded pushing or popping of return addresses for subroutine calls, no register/information stacking during exceptions. Instead, on-chip registers are used to hold this information, which eliminates slow accesses to memory. For example, the LR is used to hold return addresses for subroutine calls, and SRR0/SRR1 are used (in addition to other SPRs) to hold exception-processing information. The lack of explicit support for a stack is not a disadvantage at all. In fact, most PowerPC systems will implement a stack. However, the processor never assumes there is a stack, it doesn't need a stack to perform its duties such as subroutine calls and exception handling. As mentioned above, by convention r1 is used as the stack pointer. Embedded applications typically maintain at least 8-byte stack alignment. With a typical CISC processor, the size of the stack will vary throughout the execution of a subroutine, as items such as parameters, return addresses, etc... are pushed and popped. In contrast, the PowerPC stack pointer is usually only adjusted at the beginning and the end of a subroutine, as it builds and then dismantles the stack frame. The subroutine prolog (beginning) will build a stack frame in a single instruction by:
Suppose a function prolog needs to build a 96-byte stack frame. The following instruction will create a stack frame of 96 bytes and store the previous version of the stack pointer at the top of the stack (the lowest memory address): stwu r1, -96(r1) # memory[r1-96] = r1 ; r1 = r1 - 96Once the stack frame is created, locations in it are referenced with a positive offset, since r1 always points to the lowest address in the stack frame. Registers are used to pass parameters between routines wherever possible, and use of the stack for local variables is avoided unless absolutely necessary. The function epilog needs to dismantle the stack frame before exiting the function. The following instruction will dismantle the stack frame: addi r1, r1, 96 # r1 = r1 + 96 (step over stack frame)For more information on the stack layout, calling conventions and restrictions for embedded systems, consult the System V Application Binary Interface PowerPC Supplement and the PowerPC Embedded Applications Binary Interface (EABI).
Decrementer / Time Base Most PowerPC implementations provide a 64-bit register called the Time Base Register (TBR) . The register can only be read by user-level software, but it can be read or written by supervisor (privileged) software. 32-bit CPUs access the register as two separate 32-bit registers. The frequency at which the counter is incremented is implementation-dependent, but is usually related to the processor clock. PowerPC also provides a 32-bit Decrementer Register (DEC), which counts down at the same frequency as the TBR counts up. The DEC generates an interrupt each time it "underflows", i.e. when the most significant bit transitions from a 0 to a 1. DEC interrupts and external interrupts are enabled/disabled by the same bit in the processor's machine state register (MSR[EE]). However, the external interrupt and the DEC interrupt have different vectors, so if DEC is not needed/used in a system implementing external interrupts, the DEC exception vector can simply return right away.
Branching The primary means to change program flow is branching. (The other ways are the System Call and Trap instructions, both of which cause exceptions.) The target address of a branch can be specified as any one of the following:
Branch Prediction Branching and the pipeline
Branches disrupt the sequential instruction execution of the processor. Suppose the processor fetches a conditional branch instruction. Depending on the outcome of the preceding instructions, the branch may or may not be taken. In most cases, the processor cannot determine whether or not the branch will be taken at the time the conditional branch instruction is fetched. Ideally, after the processor fetched a conditional branch instruction, it would then begin fetching from 2 instruction paths in parallel. One path would continue to fetch instructions immediately following the branch (assuming the branch is not taken); the other path would fetch instructions starting at the branch target address (assuming the target address could be determined.) That way, whether or not the branch was taken, the pipeline would be filled with the correct instructions in either case. Once the processor resolves the branch, one of the 2 prefetched paths could be used, and the other could be discarded, without any penalty in performance. (For several reasons, this approach is not a practical solution.) This explanation of branching and the pipeline is very superficial, for the sake of illustration. Without getting too much into microprocessor architecture and design, the point is, anything that can keep the processor fetching instructions from the correct path is a good thing. So now what?
For example, take a typical 'C' language "for" loop. Say the loop will iterate 100 times. Most of the time at the end of a loop iteration, the processor will branch backwards to the beginning of the loop to perform the next iteration. It makes sense to tell the processor, "it's very likely the condition i==100 will not be true, so assume you're going to branch back to the beginning of the loop." A compiler can generate code that will give the processor a hint about the most likely outcome of the branch. (So can an assembly programmer with a special form of the branch instruction.) When a branch hint is provided at compile/assemble time, this is called static branch prediction. Compilers examine code around conditional branches, determine what is likely, and provide the hint that is most likely to be correct. An even more complex and powerful form of branch prediction, dynamic branch prediction, requires hardware support, and is supported on high-end PowerPC (and other) processors. With dynamic branch prediction, the microprocessor hardware tracks the outcome of conditional branches at run-time, and uses the data to predict the outcome of future branches. Powerful Conditional Branching The PowerPC family also provides a powerful mechanism for conditional branching. In addition to the Link Register, two other registers are commonly used by the processor's Branch Processor Unit (BPU): the Counter Register (CTR) and the Condition Register (CR). As mentioned before, the CTR is often used by software to hold a loop iteration counter, and a CR field can hold the result of a comparison. In a single conditional branch instruction, the processor can:
Semaphore Support There is no "test and set" or "swap" instruction that indivisibly loads and stores memory in a single indivisible instruction. Instead, PowerPC provides instructions that "load word and reserve indexed" (lwarx) and "store word conditional indexed" (stwcx.) on memory addresses. These instructions can be used together to ensure atomic access, even on a multiprocessor system. The 2 instructions provide the programmer with the capability to load a value from memory, execute other instructions if needed, and then conditionally store a value back to the same address. The processor will only perform the store to memory if no other stores have occurred to the same address in the meantime. In other words, the store is not guaranteed to succeed, but rather the indicated result of the store (succeeded/failed) is guaranteed to be correct. Code can use a looping construct that keeps trying the load/store pair until the conditional store indicates success. Effectively, lwarx "reserves" the designated address for a later store to the same address. This reservation is contained inside the processor; each processor can hold exactly one reservation. If another device "sneaks in" and writes to the reserved address, the processor detects this and clears the reservation. The stwcx. only succeeds if the processor still has a reservation on the target address at the time of the store. The stwcx. instruction indicates the success/failure of the attempt by updating CR0[EQ]. Some of the ways the store can fail include:
Memory-mapped I/O In order to increase throughput and efficiency, the PowerPC's accesses to memory may not always be in the exact order they appear in the program. PowerPC processors can implement a "buffer" in the system interface/bus interface that optimizes accesses to memory, possibly by re-arranging them. However, the processor will ensure that these re-arranged accesses do not affect the correct operation of the program. In other words, the processor can perform memory accesses in a different order than they appear in the code, as long as there is no data dependency. Example: suppose a program is coded to perform the following 3 consecutive instructions:
lwz r6, 1040(r0) # load r6 from memory[1040] add r7, r5, r6 # r7 = r5 + r6 This all sounds fine until the subject of memory-mapped I/O enters the picture. For example, suppose in the example above, that locations 1000 and 1040 are memory-mapped locations in a peripheral device. Suppose the peripheral is designed so that any read access to address 1000 causes an update to a register at address 1040. We want to make sure the processor physically reads address 1000 before it reads address 1040! In this case it would be bad if the processor re-arranged the loading of r5 and r6. Example: suppose we have a piece of code which updates a register on a peripheral device. The peripheral is designed so that in order to update a register, you first write the register number to peripheral memory address 1234 (aka PMEM[1234]), and then you write the new register value to PMEM[5678]. Suppose we want to load peripheral register number 35 with the value 99. The code would perform the following steps: addi r5, r0, 35 # r5 = 35We can see that in order to ensure the proper peripheral register is loaded with the value in GPR6, the first write ("stw r5...") must happen before the second write ("stw r6..."). So what can we do to ensure the processor doesn't swap these 2 writes? Typically, on PowerPC processors with MMUs, memory-mapped I/O regions are marked as cache-inhibited and guarded. This basically forces all I/O to bypass the cache and to be performed in the specified order. PowerPC also provides synchronization instructions (eieio, sync, and isync) to force instructions or memory transactions to complete before continuing. eieio (enforce in-order execution of I/O) forces all posted writes to complete prior to any subsequent writes. sync is a bit more intrusive - it forces all previous reads and writes to complete on the bus before executing any instructions after it. sync and isync are typically used when the I & D caches are being manipulated. In our example above, we could insert an eieio between the writes: addi r5, r0, 35 # r5 = 35 As always, consult the user's manual for the specific processor you're using - I/O is one of the areas you want to be sure is working early on in a project!
MMU and cache Most PowerPC processors have some sort of on-chip cache and MMU capability. Some of the PowerPC processors also support an external L2 cache. The PowerPC microcontrollers in particular have customized the implementations for their targeted markets (no cache, instruction cache only, instruction and data cache with MMU, etc...) You could write a small book on how to use the MMU and caches for any one of the processors implementing them. Much of our PowerPC work has been with the MPC8xx family, and we have assisted several of our 8xx customers with the cache and MMU. Therefore, the following notes are specific to the MPC8xx family.
More to come This is only the tip of the
iceberg. We didn't even discuss the "ins and outs" of
the on-chip peripherals of the MPC8xx family,
or the I/O functionality of the MPC5xx family.
We will be continually
adding to this page as time permits, concentrating on the MPC8xx and MPC5xx
microcontrollers. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||