An overview of GCC internals
Background
When Symas/COBOLworx decided to build a COBOL front end for the GCC
Gnu Compiler Collection, Jim Lowden and Bob Dubner broke the effort up
into two pieces. Jim would figure out how to parse the COBOL source code
(e.g., ADD A TO B GIVING C
) and Bob would figure out how to
tell the GCC downstream processing that it needed to add A to B and put
the result in C.
Bob’s job was complicated by not knowing, in the slightest, how to do that.
This document is a high-level overview of how we learned how to do that.
An Extreme Statement of the Problem
GCC is practically a universe. The oldest entry in the GIT repository is dated 1988, and the comment indicates that the initial entries were converted from older ones in another source control system. So, diving into it cold means that it is all deep end. It’s not obvious where to start.
GCC has extensive documentation on GCC internals. But if you’ll forgive the analogy, starting with the compiler internals document is a bit like being given the maintenance manual for an F-15 fighter plane, and being told to build and fly one.
Happily, in the case of COBOL for GCC, both Jim and Bob are long-time users of GCC. Thus, for us, this was more like opening up the access panels of an aircraft after piloting it for a long time, and finally learning what was going on in there.
Bob got started by quickly scanning through the GCC Internals
documentation. Freshly infused with almost total confusion, he figured
out how to build a -O0 -ggdb
build of the compiler, and
then started trapping through it to figure out what it was doing when
compiling a simple C program.
GCC From Ten Thousand Meters
We learned that when GCC compiles a program from source code to produce an executable:
- A front end dedicated to the source language parses the source code
- From the source code, the front end generates a directed cyclic graph of nodes, defined in the C source code as trees
- When creating a C-like function, that graph starts off with a function declaration node, or func_decl
- That func_decl serves as an anchor for all of the variable definitions and statements that make up the function
- Once the graph is complete, the func_decl is handed to the GCC
routine
cgraph_node::finalize_function()
- After parsing and graph generation is complete, GCC cranks and grinds and reduces and optimizes the input graph and somehow, magically, produces the assembly language representation of the program in a .s file
- That .s file is assembled by the
as
assembler, producing a .o object file - The .o object file gets fed to the
ld
linker, producing an executable.
The directed cyclic graph is an abstract syntax tree known variously as a GENERIC tree or a GIMPLE tree. (And, yes, the word “tree” is seriously overused. You have to cope with it.)
Restatement of the Problem
- Parse the source code. (Not terrifying; Jim knows how to parse.)
- Generate the GENERIC tree.
- Pass the GENERIC tree to the GCC middle-end.
So, it reduced down to the easier-said-than done “Generate the GENERIC tree”.
The One Bright Spot
The one bright spot for Dubner as he launched that effort was that he had the compiler and all of the source code.
He was able to figure out that inside the middle end processing, the
routine that seemed to accept the GENERIC/GIMPLE tree for further
processing was the gimplify_function_tree(tree fndecl)
in
gcc/gimplify.cc
It became clear that what he wanted was a complete dump of the GENERIC tree, so that he could see what GCC was doing when it compiled C or FORTRAN. There didn’t seem to be any such facility in GCC. So, the -dump-generic-nodes utility was developed. Refer there for the next part of the tale.