Our projects 

Our skills 

Our products and services 

Our processor experienceNew web page 

Outsourcing stories we've heardNew web page 

About ECS 

Contact us 

ECS home page

 

PowerPC Technical Tidbits

The applications-level programmer who doesn't write device drivers, interrupt service routines or assembly-language routines is usually insulated from the processor's inner-workings by the compiler.

However, since a great deal of our work with chip companies and RTOS/tool vendors deals directly with the processor, we often tackle the very issues that the compiler handles for you.

This page provides some information we thought would be helpful to those exploring the PowerPC architecture.  Whether you're fairly new to PowerPC programming or just curious about some of the nuances of the PowerPC architecture, this page might be of interest to you.

(Note: there are many PowerPC controllers and processors, each with its own peripherals, SPR's, etc....  The best source of information for the processor you're interested in is documentation from Motorola or IBM.  This page is intended to help introduce you to PowerPC programming and architecture.)
 
Topics

  1. Register Model
  2. Addressing Modes
  3. Stack Support
  4. Decrementer / Time Base
  5. Branching
  6. Semaphore Support
  7. Memory-mapped I/O
  8. MMU and Cache

Register Model

All PowerPC processors have at least two sets of registers: General-Purpose Registers (GPRs) and Special-Purpose Registers (SPRs).  Some PowerPC processors have an additional third set of registers, the Floating Point Registers (FPRs).

This section will give a brief overview of the GPRs and SPRs you are likely to use when developing PowerPC code.

General-Purpose Registers (GPRs )

The 32 GPRs (GPR0-GPR31) are used for integer operations, load/store transactions with memory, etc....  The GPRs are either 32 or 64 bits wide, depending on the processor's implementation (most PowerPC CPUs used in embedded applications use 32-bit architectures.)   Most assembly code refers to the registers as r0, r1, .... or %r0, %r1  (as opposed to GPR0, GPR2, etc...)

In order to standardize the way GPRs are used by software, conventions have been adopted that assign a special role for certain GPRs.  This facilitates tools and software that work well together.  By convention, in most embedded applications:

  • r1 is used as the stack pointer (explained in more detail a bit later)
  • r13 is used as the small data pointer
  • r2 is used as the small constant area pointer
r2 and r13 are typically initialized to point to frequently-used data. Since a 16-bit signed offset can be embedded in the instruction word, data within +/- 32kbytes of the pointers can be accessed in a single instruction. Most compilers allow the user to control which data is located in these "quick access" regions.

Note: The programmer should consult the cross compiler's technical documentation to get the exact register usage conventions of the compiler.  The documentation will also explain calling conventions, such as which registers are used to pass arguments and return values, which registers are preserved (saved) across function calls, etc...

For more information on register usage conventions adopted for embedded systems, consult the the PowerPC Embedded Applications Binary Interface (EABI).

Special-Purpose Registers (SPRs)

Each PowerPC processor defines its own set of SPRs, although many of the SPRs are defined across all implementations.  SPRs provide control over the processor's MMU, cache, debugging capabilities, exception handling, etc....   Some of the more esoteric features of the processor are also controlled by SPRs.  For example, on the MPC750, SPRs are used for performance monitoring and power/thermal management.

Each SPR is identified by a number (SPR1, SPR8, etc....).  SPRs also have a specific name that indicates the SPR's purpose.  Most of the SPRs are accessible only while the processor is in supervisor mode.

Some SPRs that are defined across the PowerPC family include:
 
SPR Name Supervisor 
Mode Only?
Purpose
1 Integer Exception Register (XER) N reflects the results of completed integer instructions (overflow, carry, etc...)
8 Link Register (LR) N used to hold the return address of a branch
9 Counter Register (CTR) N often used to hold a loop iteration counter; can also hold target address of a branch
22 Decrementer Register (DEC) Y 32-bit free-running countdown register; generates maskable interrupt at underflow (i.e. when value rolls backwards from 0x00000000 to 0xFFFFFFFF)
26 Save/Restore Register 0 (SRR0) Y Holds "return address" to be used when exception handler is completed
27 Save/Restore Register 1 (SRR1) Y Holds copy of Machine State Register (MSR) just prior to exception
272-275 Special-Purpose Registers 0-3 (SPRGx) Y Special "scratch" registers; typically used by kernel so that GPRs don't need to be saved/restored
287 Processor Version Register (PVR) Y Provides processor version (PowerPC 603, MPC750, MPC860, etc...) and silicon revision level
 

Other Registers: the Condition Register and the Machine State Register

The Condition Register (CR) holds the results from integer, floating point and comparison operations.
 
The CR is comprised of 8 independent 4-bit fields named CR0-CR7.
CR0 CR1 CR2 CR3 CR4 CR5 CR6 CR7
 
 
Each 4-bit CRx field represents relations such as Less Than (LT), Greater Than (GT), Equal (EQ), and Summary Overflow (SO).   The picture to the right shows the structure of a CRx field.
LT GT EQ SO
 
 
Some mnemonics reference a specific bit number 0-31 in the CR, rather than a field in the CR.  For example (refer to the pictures above):

  • CR[0] = CR0[LT]
  • CR[31] = CR7[SO]
Comparison instructions are encoded to indicate which CRx field should be updated as a result of the comparison.  This allows successive comparisons to be performed without overwriting the results of the previous compare.  Compilers often exploit this capability.

For example, suppose:

  • r20=20,  r21=20,  r22=22  and  r23=23
The 2 lines of code would act as follows:
cmp cr3, r20, r21      # compare r20 & r21, store results in cr3
                       # CR3[LT]=0, CR3[GT]=0, CR3[EQ]=1

cmp cr4, r22, r23      # compare r22 & r23, store results in cr4
                       # CR4[LT]=1, CR4[GT]=0, CR4[EQ]=0

The Machine State Register (MSR) is perhaps the most important PowerPC register of all, and it exists across all PowerPC variations.  The MSR is the main control register for the processor.  Among other things, it controls:
  • the privilege level the processor is operating in (user/supervisor)
  • whether interrupts are enabled/disabled
  • whether instruction/data accesses can be translated through their MMUs
  • tracing capabilities on a per-instruction or per-branch instruction level
Where's the Program Counter?

Notably absent is a dedicated, accessible program counter (PC)  register.  Some users might be surprised by this.  However, in general, software does not need to know "where" the program is executing from.  In times when the program flow changes, the return address can be saved automatically by the CPU.  For example, branches have the option of saving the return address in the link register; exceptions store the return address of an exception handler in SRR0 (see above, SPR26.)  The point is, there really is no need for a PC on the PowerPC.

This introduction to the PowerPC register set only begins to scratch the surface  - for complete information, consult the technical documentation on the PowerPC programming environments or the specific PowerPC processor you're using.



Addressing Modes

The PowerPC supports the following basic addressing modes:

  • register indirect with immediate index, which adds a 16-bit signed offset to a base register to form the effective address
  • register indirect with index, which adds the contents of a base register to an index register to form the effective address
Each mode also supports an update option,  which causes the base register to be updated with the effective address after the load or store.

The examples below give an overview of this.
 
Addressing Mode Load Example Store Example
register indirect with immediate index lwz  r3, 4(r1) r3 = mem[r1+4] stw r3, 4(r1) mem[r1+4] = r3
register indirect with immediate index with Update lwzu  r3, 4(r1) r3 = mem[r1+4]  
r1 = r1 + 4
stwu r3, 4(r1) mem[r1+4] = r3  
r1 = r1 +4
register indirect with index lwzx  r3, r1, r2 r3 = memory[r1+r2] stwx r3, r1, r2 mem[r1+r2] = r3
register indirect with index with Update lwzux  r3, r1, r2 r3 = memory[r1+r2]  
r1 = r1 + r2
stwux r3, r1, r2 mem[r1+r2] = r3  
r1 = r1 + r2
 
Note how in the two Update modes, the effective address is calculated and used before the register value is updated with the effective address.

Also notice that the register indirect with immediate index addressing modes can only access addresses within +/-32 kbytes of the base address due to the 16-bit signed offset; register indirect with index addressing must be used to reach beyond this range.

One other note: in certain contexts, r0 actually evaluates to the value 0, rather than the contents of GPR0.  This is clearly spelled out in PowerPC documentation, but an example never hurts:

# Let r0 = 100 ; r3 = 300; r5 = 500, r12 = 1200

stwx r12, r3, r5   # memory[300+500] = 1200
stwx r12, r5, r3   # memory[500+300] = 1200

stwx r12, r5, r0   # memory[500+100] = 1200

stwx r12, r0, r5   # memory[0+500]   = 1200
                   # *NOT* memory[100+500] = 1200


Stack Support

As with many RISC processors, there is no explicit hardware support for a stack on the PowerPC.  In other words, there are no dedicated stack pointer registers in hardware, and there are no instructions or addressing modes that implicitly reference or use a stack pointer.

Those familiar with the 68k family may recall that a subroutine call instruction (e.g. jsr) pushes the return address on the stack as part of its microcoded processing.  Similarly, the return from subroutine instruction (rts) pops the return address off the stack.  Also on the 68k, exceptions cause certain registers and other information to be pushed onto the stack by processor microcode prior to invoking the exception handler.  Some versions of the 68k processor implement as many as 3 different stack pointer registers: the User Stack Pointer (USP), the Supervisor Stack Pointer (SSP), and the Interrupt Stack Pointer (ISP).  Needless to say, the 68k processors are certainly "aware" of a stack!

On the PowerPC, there is no micro-coded pushing or popping of return addresses for subroutine calls, no register/information stacking during exceptions.  Instead, on-chip registers are used to hold this information, which eliminates slow accesses to memory.  For example, the LR is used to hold return addresses for subroutine calls, and SRR0/SRR1 are used (in addition to other SPRs) to hold exception-processing information.

The lack of explicit support for a stack is not a disadvantage at all.  In fact, most PowerPC systems will implement a stack.  However, the processor never assumes there is a stack, it doesn't need a stack to perform its duties such as subroutine calls and exception handling.  As mentioned above, by convention r1 is used as the stack pointer.  Embedded applications typically maintain at least 8-byte stack alignment.

With a typical CISC processor, the size of the stack will vary throughout the execution of a subroutine, as items such as parameters, return addresses, etc... are pushed and popped. In contrast, the PowerPC stack pointer is usually only adjusted at the beginning and the end of a subroutine, as it builds and then dismantles the stack frame.

The subroutine prolog (beginning) will build a stack frame in a single instruction by:

  1. Calculating the effective address of the top of the new stack frame (r1 - stack_frame_size)
  2. Storing the previous value of r1 at the top of the new stack frame (mem[r1 - stack_frame_size] = r1)
  3. Adjusting r1 to point to the top of the new stack frame (r1 = r1- stack_frame_size)
The result: r1 always points to the top of the stack, and the top item on the stack is the address of the previous stack frame.

Suppose a function prolog needs to build a 96-byte stack frame.  The following instruction will create a stack frame of 96 bytes and store the previous version of the stack pointer at the top of the stack (the lowest memory address):

stwu r1, -96(r1)       # memory[r1-96] = r1 ; r1 = r1 - 96
Once the stack frame is created, locations in it are referenced with a positive offset, since r1 always points to the lowest address in the stack frame.

Registers are used to pass parameters between routines wherever possible, and use of the stack for local variables is avoided unless absolutely necessary.

The function epilog needs to dismantle the stack frame before exiting the function.  The following instruction will dismantle the stack frame:

addi r1, r1, 96       # r1 = r1 + 96 (step over stack frame)
For more information on the stack layout, calling conventions and restrictions for embedded systems, consult the System V Application Binary Interface PowerPC Supplement  and the PowerPC Embedded Applications Binary Interface (EABI).


Decrementer / Time Base

Most PowerPC implementations provide a 64-bit register called the Time Base Register (TBR) .  The register can only be read by user-level software, but it can be read or written by supervisor (privileged) software. 32-bit CPUs access the register as two separate 32-bit registers.  The frequency at which the counter is incremented is implementation-dependent, but is usually related to the processor clock.

PowerPC also provides a 32-bit Decrementer Register (DEC), which counts down at the same frequency as the TBR counts up.  The DEC generates an interrupt each time it "underflows", i.e. when the most significant bit transitions from a 0 to a 1.

DEC interrupts and external interrupts are enabled/disabled by the same bit in the processor's machine state register (MSR[EE]).  However, the external interrupt and the DEC interrupt have different vectors, so if DEC is not needed/used in a system implementing external interrupts, the DEC exception vector can simply return right away.


Branching

The primary means to change program flow is branching.  (The other ways are the System Call and Trap instructions, both of which cause exceptions.)  The target address of a branch can be specified as any one of the following:

  • an absolute address embedded as part of the instruction
  • a relative offset from the current instruction address
  • an absolute address stored in either the Link Register (LR) or the Counter Register (CTR)
When a branch instruction is executed,  it has the option of copying the address of the instruction following the branch into the link register.  This is commonly used when branching to a subroutine to save the return address; upon completion of the subroutine, the code branches to the address in the link register, effectively returning from the subroutine.

Branch Prediction

Branching and the pipeline
In a pipelined architecture, the processor is fetching instructions ahead of the instruction being executed (this is called prefetching).  Programs and instructions tend to be sequential, so in most cases it makes sense for the processor to fetch instructions "one after another".

Branches disrupt the sequential instruction execution of the processor.  Suppose the processor fetches a conditional branch instruction.  Depending on the outcome of the preceding instructions, the branch may or may not be taken.  In most cases, the processor cannot determine whether or not the branch will be taken at the time the conditional branch instruction is fetched.

Ideally, after the processor fetched a conditional branch instruction, it would then begin fetching from 2 instruction paths in parallel.  One path would continue to fetch instructions immediately following the branch (assuming the branch is not taken); the other path would fetch instructions starting at the branch target address (assuming the target address could be determined.)

That way, whether or not the branch was taken, the pipeline would be filled with the correct instructions in either case.  Once the processor resolves the branch, one of the 2 prefetched paths could be used, and the other could be discarded, without any penalty in performance.  (For several reasons, this approach is not a practical solution.)

This explanation of branching and the pipeline is very superficial, for the sake of illustration.  Without getting too much into microprocessor architecture and design, the point is, anything that can keep the processor fetching instructions from the correct path is a good thing.

So now what?
To offset the pipeline hazards of conditional branching, most PowerPC processors implement some form of branch prediction.  Branch prediction gives the processor a "hint" as to the likelihood of a conditional branch being taken.

For example, take a typical 'C' language "for" loop.  Say the loop will iterate 100 times. Most of the time at the end of a loop iteration, the processor will branch backwards to the beginning of the loop to perform the next iteration.  It makes sense to tell the processor, "it's very likely the condition i==100 will not be true, so assume you're going to branch back to the beginning of the loop."

A compiler can generate code that will give the processor a hint about the most likely outcome of the branch.  (So can an assembly programmer with a special form of the branch instruction.)  When a branch hint is provided at compile/assemble time, this is called static branch prediction.  Compilers examine code around conditional branches, determine what is likely, and provide the hint that is most likely to be correct.

An even more complex and powerful form of branch prediction, dynamic branch prediction, requires hardware support, and is supported on high-end PowerPC (and other) processors.  With dynamic branch prediction, the microprocessor hardware tracks the outcome of conditional branches at run-time, and uses the data to predict the outcome of future branches.

Powerful Conditional Branching

The PowerPC family also provides a powerful mechanism for conditional branching.  In addition to the Link Register, two other registers are commonly used by the processor's Branch Processor Unit (BPU): the Counter Register (CTR) and the Condition Register (CR).  As mentioned before, the CTR is often used by software to hold a loop iteration counter, and a CR field can hold the result of a comparison.

In a single conditional branch instruction, the processor can:

  • pre-decrement the CTR
  • compare the contents of CTR to 0
  • compare the contents of any bit (LT, GT, EQ, etc..) of any CRx field to 0 or 1
  • conditionally branch based on any combination of the results of the above 2 comparisons
  • optionally load the link register with the address of the instruction following the branch
Capabilities like this lead to assembly instructions like "bdnzflrl 13", which translates to:
 
b    branch
   IF
dnz    decremented CTR is not zero
   AND
f    CR[13] is false (i.e. CR3[GT] = 0)
lr    Target address for branch is in the link register
l    Save return address in link register
 

Semaphore Support

There is no "test and set" or "swap" instruction that indivisibly loads and stores memory in a single indivisible instruction.  Instead, PowerPC provides instructions that "load word and reserve indexed" (lwarx) and "store word conditional indexed" (stwcx.) on memory addresses.  These instructions can be used together to ensure atomic access, even on a multiprocessor system.

The 2 instructions provide the programmer with the capability to load a value from memory,  execute other instructions if needed, and then conditionally store a value back to the same address.  The processor will only perform the store to memory if no other stores have occurred to the same address in the meantime.  In other words, the store is not guaranteed to succeed, but rather the indicated result of the store (succeeded/failed) is guaranteed to be correct.  Code can use a looping construct that keeps trying the load/store pair until the conditional store indicates success.

Effectively, lwarx "reserves" the designated address for a later store to the same address.  This reservation is contained inside the processor; each processor can hold exactly one reservation.  If another device "sneaks in" and writes to the reserved address,  the processor detects this and clears the reservation.

The stwcx. only succeeds if the processor still has a reservation on the target address at the time of the store.  The stwcx. instruction indicates the success/failure of the attempt by updating CR0[EQ].  Some of the ways the store can fail include:

  • no lwarx was performed prior to the stwcx. (no address was previously reserved)
  • another device stored to the reserved address and cleared the reservation inside the processor (reservation "stolen")
  • a lwarx was performed to a different address than the stwcx. (unpredictable on storing, but always clears reservation)
These instructions can be used to implement semaphores, mutual exclusion, etc...   In a single-processor system, this solution is superior to masking interrupts for obvious reasons - we never need to disable interrupts to ensure an atomic access.  And it is the only solution in a multi-processor system, since interrupts will only ensure coherency within the same processor, not in shared memory.


Memory-mapped I/O

In order to increase throughput and efficiency, the PowerPC's accesses to memory may not always be in the exact order they appear in the program.  PowerPC processors can implement a "buffer" in the system interface/bus interface that optimizes accesses to memory, possibly by re-arranging them.  However, the processor will ensure that these re-arranged accesses do not affect the correct operation of the program.  In other words, the processor can perform memory accesses in a different order than they appear in the code, as long as there is no data dependency.

Examplesuppose a program is coded to perform the following 3 consecutive instructions:

    lwz  r5, 1000(r0)    # load r5 from memory[1000]
    lwz  r6, 1040(r0)    # load r6 from memory[1040]
    add  r7, r5, r6      # r7 = r5 + r6
The programmer (typically) doesn't care if r6 is loaded before r5, just so long as both are loaded from main memory before they are combined into a result that is stored in r7.  The processor may swap the order in which r5 and r6 are loaded from memory. However, in this sequence there is a data dependency (r7 is dependent on r5 and r6 being coherent, i.e. holding the expected values, prior to the addition.)  The processor recognizes this dependency and will not add r5 and r6 until both have been loaded from memory.

This all sounds fine until the subject of memory-mapped I/O enters the picture.  For example, suppose in the example above, that locations 1000 and 1040 are memory-mapped locations in a peripheral device.  Suppose the peripheral is designed so that any read access to address 1000 causes an update to a register at address 1040.  We want to make sure the processor physically reads address 1000 before it reads address 1040!  In this case it would be bad if the processor re-arranged the loading of r5 and r6.

Example: suppose we have a piece of code which updates a register on a peripheral device.  The peripheral is designed so that in order to update a register, you first write the register number to peripheral memory address 1234 (aka PMEM[1234]), and then you write the new register value to PMEM[5678].

Suppose we want to load peripheral register number 35 with the value 99. The code would perform the following steps:

addi r5, r0, 35      # r5 = 35
addi r6, r0, 99      # r6 = 99
stw  r5, 1234(r0)    # PMEM[1234] = 35; tell peripheral we're changing register 35
stw  r6, 5678(r0)    # PMEM[5678] = 99; tell peripheral new register value is 99
We can see that in order to ensure the proper peripheral register is loaded with the value in GPR6, the first write ("stw r5...") must happen before the second write ("stw r6..."). So what can we do to ensure the processor doesn't swap these 2 writes?

Typically, on PowerPC processors with MMUs, memory-mapped I/O regions are marked as cache-inhibited and guarded.  This basically forces all I/O to bypass the cache and to be performed in the specified order.

PowerPC also provides synchronization instructions (eieio, sync, and isync) to force instructions or memory transactions to complete before continuing.  eieio (enforce in-order execution of I/O) forces all posted writes to complete prior to any subsequent writes.  sync is a bit more intrusive - it forces all previous reads and writes to complete on the bus before executing any instructions after it.   sync and isync are typically used when the I & D caches are being manipulated.

In our example above, we could insert an eieio between the writes:

addi r5, r0, 35      # r5 = 35
addi r6, r0, 99      # r6 = 99
stw  r5, 1234(r0)    # PMEM[1234] = 35; we're changing register 35
eieio                # Make sure r5 is written before proceeding
stw  r6, 5678(r0)    # PMEM[5678] = 99; new register value is 99

As always, consult the user's manual for the specific processor you're using - I/O is one of the areas you want to be sure is working early on in a project!


MMU and cache
 
Most PowerPC processors have some sort of on-chip cache and MMU capability.  Some of the PowerPC processors also support an external L2 cache.  The PowerPC microcontrollers in particular have customized the implementations for their targeted markets (no cache, instruction cache only, instruction and data cache with MMU, etc...)

You could write a small book on how to use the MMU and caches for any one of the processors implementing them.  Much of our PowerPC work has been with the MPC8xx family, and we have assisted several of our 8xx customers with the cache and MMU.  Therefore, the following notes are specific to the MPC8xx family.

  • The MPC8xx family processors implement separate instruction and data (I & D) caches, and separate I & D MMUs.  The Harvard architecture allows simultaneous instruction and data accesses to increase throughput.
  • Cache/storage control attributes (see next bullet item) are defined through the MMU's descriptors.  For this reason, even if a system does not implement virtual memory or memory protection, most users will want to enable the MMU.  For example, with the MMU, the user can mark certain regions as non-cacheable.
  • Storage control attributes that can be defined for memory regions include:
    • cache inhibit - mark the page as uncacheable; the cache is always bypassed and all loads/stores access main memory
    • writethrough - store operations update main memory as well as the cache
    • guarded - prevents speculative, out-of-order accesses from being performed on the region.  (Instruction prefetches are an example of speculative accesses.)
  • The MMUs divide up the address space into a series of pages.  The page size is configurable, although a common page size is 4 kbytes.
  • The MMUs provide the following protection capabilities:
    • mark regions as write-protected
    • mark regions as "not executable"
    • mark regions as accessible only in privileged mode
  • The MMUs implement Translation Lookaside Buffers (TLBs), which are essentially address translation caches.  The first time a virtual address on a new page needs to be translated, a TLB miss exception occurs.  Typically, the exception handler loads a TLB entry for the page through a "tablewalk" procedure.  During a tablewalk, the software traverses a tree of data structures that define how the memory is laid out in the system.  At the end of a successful tablewalk,  the virtual address can be translated, and a TLB entry is loaded.
  • A small number of TLBs can be "locked" to ensure a TLB miss never occurs for certain time-critical regions of memory.
  • The MPC8xx processors provide hardware-assisted tablewalks for TLB misses.  Although software must walk the page tables in the case of a TLB miss, special hardware registers can be used which expedite the tablewalk.  Software can implement any page table structure it wants, although without hardware assisted tablewalks.  (Contrast this with the 68030/40/60 MMUs, which perform tablewalks entirely in hardware through complex on-chip microcode.)
  • The I & D caches are 2-way set associative physically-addressed caches.
    • The 2-way set associativity can help reduce cache line replacements due to "aliases".  Two addresses that would contend for the same cache line on a direct-mapped cache can both be stored, one in each set.
    • The benefit of being physically addressed is that the cache does not need to be flushed if a virtual address space is re-mapped to a different physical address.  This occurs frequently on "multiprocessing" systems where each process runs in an identical virtual address space.
  • Individual cache lines can be loaded and locked to hold data/instructions that are frequently accessed.  We have seen remarkable improvements in code performance in MPC8xx communications applications where certain portions of code and data are loaded and locked into their respective caches.

More to come

This is only the tip of the iceberg.  We didn't even discuss the "ins and outs" of  the on-chip peripherals of the MPC8xx family, or the I/O functionality of the MPC5xx family. 

We will be continually adding to this page as time permits, concentrating on the MPC8xx and MPC5xx microcontrollers.

 

    Projects    |    Skills     |    Products & Services    |    Processors     |    Horror Stories     |    About Us     |     Contact  Us   |     Home
ECS home page: http://www.go-ecs.com
© 1996-2002, Embedded Concepts & Solutions, Inc.