【zz】Adding New SIMD Instructions to the GCC Back-end

http://blog.chinaunix.net/u/30686/showart.php

Adding New SIMD Instructions to the GCC Back-end Mauricio Alvarez: alvarez (at) ac (dot) upc (dot) edu
Created: 16/09/2005.
Modified: 13.03.2006. 1. Introduction This guide shows how to include support in the gcc compiler for new instructions that are added to an existing ISA. The idea is to extend an ISA with custom instructions for domain-specific processor acceleration and to support these instructions into the compiler using intrinsics.

This work is based on the PowerPC ISA with the Altivec multimedia extension and the new instructions that are going to be added are related to the video coding/decoding domain. GCC version 4.0 is used.

In a first stage of this work new instructions are going to be supported by means of intrinsics that allow the programmer to use them directly in C or C++ programs.


2. GCC structure and passes In order to provide support for new instructions in gcc it is necessary to support them in some of the stages of the compiler:

front-end: parse tree ——–> middle-end: generic tree ———> back-end: RTL

GCC passes [1]

  • Parsing pass: language front-end
  • Gimplification pass: convert intermediate representation of a function into the GIMPLE language
  • Tree SSA passes: tree optimization passes (for example autovectorization)
  • RTL passes: rtl generation and optimization passes
Modifications in each stacge of GCC
2.1 front-end: specification of new intrinsics added to the Altivec existing intrinsics.
2.2 middle-end: because the instructions are not going to be generated automatically, there is no need to modify this stage
2.3 back-end: creation of the machine description of the new instructions.
3. Front-end support for intrinsics in GCC 3.1 What intrinsics are?
A intrinsic is a function known by the compiler that directly maps to a sequence of one or more assembly language instructions.

Intrinsics make the use of processor specific enhancements easier because they provide a language interface (C,C++) to assembly instructions. In doing so, the compiler manages things that the user would normally have to be concerned with, such register names, register allocations and memory locations of data.

GCC has intrinsics for the SIMD extensions (SSE, Altivec) that are available in most modern processors.

3.2 Altivec Intrinsics in GCC
Altivec intrinsics are an interface to the PowerPC processors to access Altivec instructions. Intrinsics specification also adds new types to the C,C++ languages for declaring packed variables as described in the Altivec Programming Interface Manual [2].

The intrinsics interface is made available by adding #include <altivec.h> in the source program and by adding the -maltivec and -mabi=altivec compiler flags to the compilation command. This is only applicable for the fsf official GCC. Mac computers (with powerpc processors) use a special version of gcc that does not need to inlude the header altivec.h and requires the compiler flag -faltivec

Altivec intrinsics are declared in altivec.h that is available in the gcc source code in: gcc/config/rs6000/altivec.h

An example of the Altivec intrinsics is the vector average, which calculates the rounded average of two vectors.

- compiler intrinsic: d = vec_avg(a,b)
-
Assembly instructions: see next table

d = vec_avg(a,b) d a b maps to vector unsigned char vector unsigned char vector unsigned char vavgub d,a,b vector signed char vector signed char vector signed char vavgsb d,a,b vector unsigned short vector unsigned short vector unsigned short vavguh d,a,b vector signed short vector signed short vector signed short vavgsh d,a,b vector unsigned int vector unsigned int vector unsigned int vavguw d,a,b vector signed int vector signed int vector signed int vavgsw d,a,b

3.3 Implementation of Altivec intrinsics
Intrinsics are implemented as functions but the code is placed inline and they do not generate a function call. As can be seen in the example above each intrinsic can map to several assembly instructions depending on the data type of the operands. In GCC this is implemented by means of overloaded functions. In C++ they are supported directly by the language. In C they are implemented with macros.

Here there is the C++ declaration of the vec_avg intrinsic for vector signed/unsigned char:

inline vector unsigned char
vec_avg (
vector unsigned char a1, vector unsigned char a2)
{
return (
vector unsigned char) builtin_altivec_vavgub ((vector signed char) a1, (vector signed char) a2);
}

inline
vector signed char
vec_avg (vector signed char a1, vector signed char a2)
{
return (vector signed char) builtin_altivec_vavgsb ((vector signed char) a1, (vector signed char) a2);
}

In C, the same declaration is done with macros:
#define vec_avg(a1, a2)
ch (bin_args_eq (vector unsigned char, (a1), vector unsigned char, (a2)),
      ((vector unsigned char) builtin_altivec_vavgub ((vector signed char) (a1), (vector signed char) (a2))),
ch (bin_args_eq (vector signed char, (a1), vector signed char, (a2)),
      ((vector signed char) builtin_altivec_vavgsb ((vector signed char) (a1), (vector signed char) (a2))),
ch (bin_args_eq (vector unsigned short, (a1), vector unsigned short, (a2)),
      ((vector unsigned short) builtin_altivec_vavguh ((vector signed short) (a1), (vector signed short) (a2))),
ch (bin_args_eq (vector signed short, (a1), vector signed short, (a2)),
      ((vector signed short) builtin_altivec_vavgsh ((vector signed short) (a1), (vector signed short) (a2))),
ch (bin_args_eq (vector unsigned int, (a1), vector unsigned int, (a2)),
      ((vector unsigned int) builtin_altivec_vavguw ((vector signed int) (a1), (vector signed int) (a2))),
ch (bin_args_eq (vector signed int, (a1), vector signed int, (a2)),
      ((vector signed int) builtin_altivec_vavgsw ((vector signed int) (a1), (vector signed int) (a2))),
    builtin_altivec_compiletime_error ("vec_avg")))))))


bin_args_eq is a macro that checks the compatibility of the data type of the operands.
_ch is a macro that chooses between the builtin assembly expression or a data type error

3.4 An example of a new Altivec intrinsics for pixel interpolation
We are going to include support for a new instruction devoted to the pixel interpolation, a process that is common in the video coding standards like MPEG-4 or H.264.

The interface to the new instruction is d = vec_inter(a,b) and the assembly mapping is shown in the next table


d = vec_inter(a,b) d a b maps to vector unsigned char vector unsigned char vector unsigned char vinterub d,a,b vector signed char vector signed char vector signed char vintersb d,a,b vector unsigned short vector unsigned short vector unsigned short vinteruh d,a,b vector signed short vector signed short vector signed short vintersh d,a,b vector unsigned int vector unsigned int vector unsigned int vinteruw d,a,b vector signed int vector signed int vector signed int vintersw d,a,b
The definition in C++ for the signed/unsigned char version is like that:

inline vector unsigned char
vec_inter (
vector unsigned char a1, vector unsigned char a2)
{
return (
vector unsigned char) builtin_altivec_vinterub ((vector signed char) a1, (vector signed char) a2);
}

inline
vector signed char
vec_inter (vector signed char a1, vector signed char a2)
{
return (vector signed char) builtin_altivec_vintersb ((vector signed char) a1, (vector signed char) a2);
}

4. Back-end support for intrinsics in GCC Intrinsics are implemented in the machine description of the back-end of the compiler. The back-end is implemented in several files:
  • rs6000.h: C macros for machine fundamentals, compiler environment, machine description support and ABI
  • rs6000.c: C functions for macro expansion and machine description support
  • rs6000.md: Machine description for RS6000 (Power) instructions
  • altivec.h: declaration of intrinsics functions
  • altivec.md: machine description for altivec instructions
The machine descriptions are used in the matching process to transform RTL expressions into assembler instructions. An instruction description in the machine description consists of instruction template patterns for both instruction generation and instruction matching.

4.1 Instruction patterns for altivec instructions
Here is an example of an instruction pattern for the vec_avg intrinsic using unsigned char operands:

(define_insn "altivec_vavgub"
[(    set (match_operand: V16QI 0 "register_operand" "=v")
        (unspec:
V16QI [ (match_operand: V16QI 1 "register_operand" "v")
                                    (match_operand:
V16QI 2 "register_operand" "v")] 44))]
"TARGET_ALTIVEC"
"vavgub %0,%1,%2"
[(set_attr "type" "vecsimple")])

  • define_insn: is the pattern type for both instruction generation and matching.
  • altivec_vavgub: is the pattern name.
  • set (the operand): a composition of operations: At the leaves of the resulting operation tree, there is usually some kind of operand-matching expression. The generic form is (match_operand: Mode operand-number "predicate" "constraints")
    • In set (match_operand: V16QI 0 "register_operand" "=v") there is a matching for an operand of type V16QI (vector of 16 QI, QI means a byte), which is the first operand "0". "=v" means this is the destination operand. There is a predicate "register operand" than means the operand is a register and there is a constraint "v" that means that the register operand needs to be a vector register.
    • (unspec [operands ] index) Represents a machine-specific operation on operands. index selects between multiple machine-specific operations. For standard operations the name of the operation goes instead of unspec (for example: xor: QI), but altivec instructions are not standard instructions for the compiler, so they need to be declared machine-specific instructions.
  • "TARGET_ALTIVEC" is a pattern condition for when the pattern applies.
  • "vavgub %0,%1,%2" is the specification of output template for the define_insn. The output template for the altivec instructions is just the assembler instruction as a string.
  • [(set_attr "type" "vecsimple")]): is the specification of an attribute for the instruction. In this case the instruction is of the type: vector simple.

4.2 Instruction patterns for new instructions
Similar to the example presented above, we have defined a pattern for the vector interpolation instruction: vec_avg

- veg_avg for the unsigned data types:

(define_insn "altivec_vinteru<VI_char>"
[(set (match_operand:VI 0 "register_operand" "=v")
        (unspec:VI [(match_operand:VI 1 "register_operand" "v")
                    (match_operand:VI 2 "register_operand" "v")] 244))]
"TARGET_ALTIVEC"
"vinteru<VI_char> %0,%1,%2"
[(set_attr "type" "vecsimple")])

- veg_avg for the signed data types:

(define_insn "altivec_vinters<VI_char>"
[(set (match_operand:VI 0 "register_operand" "=v")
        (unspec:VI [(match_operand:VI 1 "register_operand" "v")
                    (match_operand:VI 2 "register_operand" "v")] 245))]
"TARGET_ALTIVEC"
"vinters<VI_char> %0,%1,%2"
[(set_attr "type" "vecsimple")])

4.3 Builtin description of the intrinsics
It is necessary to add the intrinsics in the back-end of the compiler. In our case, the intrinsics are added to the RS600 back-end which includes all the Power and PowerPC processors. The subroutines for code generation are defined in gcc/gcc-4.0.0/gcc/config/rs6000/rs6000.c In this file there is a special section for the definition of builtins.

For two operands instructions there is a structure like this:

static struct builtin_description bdesc_2arg[] =
{

{ MASK_ALTIVEC, CODE_FOR_altivec_vinterub, "builtin_altivec_vinterub", ALTIVEC_BUILTIN_VINTERUB },
{ MASK_ALTIVEC, CODE_FOR_altivec_vintersb, "
builtin_altivec_vintersb", ALTIVEC_BUILTIN_VINTERSB },
{ MASK_ALTIVEC, CODE_FOR_altivec_vinteruh, "builtin_altivec_vinteruh", ALTIVEC_BUILTIN_VINTERUH },
{ MASK_ALTIVEC, CODE_FOR_altivec_vintersh, "
builtin_altivec_vintersh", ALTIVEC_BUILTIN_VINTERSH },
{ MASK_ALTIVEC, CODE_FOR_altivec_vinteruw, "builtin_altivec_vinteruw", ALTIVEC_BUILTIN_VINTERUW },
{ MASK_ALTIVEC, CODE_FOR_altivec_vintersw, "
builtin_altivec_vintersw", ALTIVEC_BUILTIN_VINTERSW },

}

And the definition of ALTIVEC_BUILTINs are placed in: gcc/gcc-4.0.0/gcc/config/rs6000/rs6000.h

enum rs6000_builtins
{
/ Altivec builtins. /

ALTIVEC_BUILTIN_VINTERUB,
ALTIVEC_BUILTIN_VINTERSB,
ALTIVEC_BUILTIN_VINTERUH,
ALTIVEC_BUILTIN_VINTERSH,
ALTIVEC_BUILTIN_VINTERUW,
ALTIVEC_BUILTIN_VINTERSW,

}

5. Extending GNU Assembler
In order to support new instructions for a given ISA it is necessary to modify the assembler for producing the object code. The natural election of an assembler to use in conjunction with the gcc compiler is gas, the gnu assembler which is part of the binutils collection of tools.

Gas is implemented in two sections, a front-end and a back-end
  • Gas Front-end: handles the parsing of input
  • Gas back-end: does the whole machine dependant part

5.1 Opcode List
The opcode list for PowerPC instructions is defined in the PowerPC back-end:
/binutils-2.16.1/opcodes/ppc-opc.c

const struct powerpc_opcode powerpc_opcodes[] = {

{ "vinterub",VX(4, 1900), VX_MASK,    PPCVEC,        { VD, VA, VB } },
{ "vinteruh",VX(4, 1901), VX_MASK,    PPCVEC,        { VD, VA, VB } },
{ "vinteruw",VX(4, 1902), VX_MASK,    PPCVEC,        { VD, VA, VB } },
{ "vintersb",VX(4, 1903), VX_MASK,    PPCVEC,        { VD, VA, VB } },
{ "vintersh",VX(4, 1904), VX_MASK,    PPCVEC,        { VD, VA, VB } },
{ "vintersw",VX(4, 1905), VX_MASK,    PPCVEC,        { VD, VA, VB } },

}

  • vinterXX: is the instruction name.
  • VX(4,YY): 4 is the main opcode and YY is the secondary opcode.
  • VX is a macro that creates the altivec instructions: VX(op, xop) (OP (op) | (((unsigned long)(xop)) & 0x7ff))
  • VX_MASK: used by the disassembler.
  • PPCVEC: used to indicate which specific processors support the instructions.
  • { VD, VA, VB }: Operands: an array of operand codes. Each code is an index into the operation table.

5.2 Adding new opcodes to the Altivec extension

PowerPC opcode format:
———————————————————————————-
| Main Opcode |   VD   |   VA   |   VB   |      Extended opcode   |
———————————————————————————-
  • Main opcode: for altivec instructions is =0x04.
  • VD, VA, VB, are the identifiers of registers: 5 bits each.
  • Extended opcode: the opcode of the instruction itself.

Free opcodes
Beyond the extended opcode of 1900 there are free slots for new instructions. For the interpolation instructions these are the selected opcodes:
- vinterub: 1900
- vinteruh: 1901
- vinteruw: 1902
- vintersb: 1903
- vintersh: 1904
- vintersw: 1905

Appendix 1. Notes on compilation of gcc Adding new instructions do not change at all the compilation process of gcc. But for our experiments we are using a Power4 machine with AIX operating system and a PowerPC+Altivec emulator and simulator. Neither the processors or the OS has support for Altivec instructions. So it is necessary to tell gcc that include the support for altivec.

  • in gcc/config/rs6000/aix.h: it is necessary to define that the system supports ALTIVEC instructions.
#define TARGET_ALTIVEC 1
#define TARGET_ALTIVEC_ABI 1
#define TARGET_ALTIVEC_VRSAVE 1
  • in gcc/config/rs6000/xcoff.h Additionally it is necessary to guarantee that the global variables are aligned to 128-bit. Thus is done by changing the emission of the assembler directive csect who is responsible for the aligment of sections in the generated code.

#define READ_ONLY_DATA_SECTION_FUNCTION   
void                                                                               
read_only_data_section (void)                                    
{                                                                                    
if (in_section != read_only_data)                              
    {                                                                                
      fprintf (asm_out_file, "t.csect %s[RO],4n",           <——– Alignment to 128
           xcoff_read_only_section_name);                       
      in_section = read_only_data;                               
    }                                                                               
}

the same need to be applied for
#define READ_ONLY_PRIVATE_DATA_SECTION_FUNCTION

  • it is better to use the bash shell in order to speed-up the compilation process UNDER AIX:
CONFIG_SHELL={bin_dir}/bash
export CONFIG_SHELL
  • Building GCC: for the configuration it is necessary to enable altivec and the desired languages for the front-end.
$ ./configure –prefix=$BIN_DIR –enable-languages=c,c++ –enable-altivec –disable-nls –disable-multilib.
$ gmake
$ gmake install
References [1] GCC Internals. GNU Compiler Collection Internals
[2] ALTIVECPIM. AltiVec Technology Programming Interface Manual. Motorola/Freescale.
[3] BINUTILS. GNU binary utils: assembler, linker, loader and other utilities for dealing with binary files generated by gcc compilers.
  • Motorola/Freescale. Programming Environments Manual for 32-bit Implementations of the PowerPC Architecture. P/N MPCFPE32B/AD .
  • IBM (2000). Book E: Enhanced PowerPC™ Architecture (3rd ed.)
  • Motorola/Freescale. ALTIVECPEM. AltiVec Technology Programming Environments Manual.
  • Intrinsics wikipedia. http://en.wikipedia.org/wiki/Intrinsic_function
  • Hans-Peter Nilsson. Porting gcc for dunces. ftp://ftp.axis.se/pub/users/hp/pgccfd/

原文地址 http://personals.ac.upc.edu/alvarez/media/gcc-isa-extensions.html