Chris Lattner | 22eb972 | 2006-06-18 05:43:12 | [diff] [blame] | 1 | //===----------------------------------------------------------------------===// |
| 2 | // C Language Family Front-end |
| 3 | //===----------------------------------------------------------------------===// |
Chris Lattner | 19acaad | 2006-10-06 05:20:10 | [diff] [blame] | 4 | Chris Lattner |
Chris Lattner | 22eb972 | 2006-06-18 05:43:12 | [diff] [blame] | 5 | |
| 6 | I. Introduction: |
| 7 | |
| 8 | clang: noun |
| 9 | 1. A loud, resonant, metallic sound. |
| 10 | 2. The strident call of a crane or goose. |
| 11 | 3. C-language front-end toolkit. |
Chris Lattner | 22eb972 | 2006-06-18 05:43:12 | [diff] [blame] | 12 | |
Chris Lattner | 87d229a | 2006-10-06 04:10:25 | [diff] [blame] | 13 | The world needs better compiler tools, tools which are built as libraries. This |
| 14 | design point allows reuse of the tools in new and novel ways. However, building |
| 15 | the tools as libraries isn't enough: they must have clean APIs, be as |
| 16 | decoupled from each other as possible, and be easy to modify/extend. This |
| 17 | requires clean layering, decent design, and avoiding tying the libraries to a |
| 18 | specific use. Oh yeah, did I mention that we want the resultant libraries to |
| 19 | be as fast as possible? :) |
| 20 | |
| 21 | This front-end is built as a component of the LLVM toolkit (which really really |
| 22 | needs a better name) that can be used with the LLVM backend or independently of |
| 23 | it. In this spirit, the API has been carefully designed to include the |
| 24 | following components: |
| 25 | |
| 26 | libsupport - Basic support library, reused from LLVM. |
| 27 | libsystem - System abstraction library, reused from LLVM. |
Chris Lattner | 19acaad | 2006-10-06 05:20:10 | [diff] [blame] | 28 | libbasic - Diagnostics, SourceLocations, SourceBuffer abstraction, |
| 29 | file system caching for input source files. |
| 30 | liblex - C/C++/ObjC lexing and preprocessing, identifier hash table, |
| 31 | pragma handling, tokens, and macros. |
| 32 | libparse - C99 (for now) parsing and local semantic analysis. This library |
| 33 | invokes coarse-grained 'Actions' provided by the client to do |
| 34 | stuff (great idea shamelessly stolen from Devkit). ObjC/C90 |
| 35 | need to be added soon, K&R C and C++ can be added in the |
| 36 | future, but are not a high priority. |
Chris Lattner | 87d229a | 2006-10-06 04:10:25 | [diff] [blame] | 37 | libast - Provides a set of parser actions to build a standardized AST |
| 38 | for programs. AST can be built in two forms: streamlined and |
| 39 | 'complete' mode, which captures *full* location info for every |
Chris Lattner | 19acaad | 2006-10-06 05:20:10 | [diff] [blame] | 40 | token in the AST. AST's are 'streamed' out a top-level |
| 41 | declaration at a time, allowing clients to use decl-at-a-time |
| 42 | processing, build up entire translation units, or even build |
Chris Lattner | 56c7a55 | 2006-10-14 05:19:00 | [diff] [blame] | 43 | 'whole program' ASTs depending on how they use the APIs. |
Chris Lattner | 87d229a | 2006-10-06 04:10:25 | [diff] [blame] | 44 | libast2llvm - [Planned] Lower the AST to LLVM IR for optimization & codegen. |
Chris Lattner | e1f4e21 | 2006-10-06 04:16:41 | [diff] [blame] | 45 | clang - An example client of the libraries at various levels. |
Chris Lattner | 87d229a | 2006-10-06 04:10:25 | [diff] [blame] | 46 | |
| 47 | This front-end has been intentionally built as a stack, making it trivial |
| 48 | to replace anything below a particular point. For example, if you want a |
| 49 | preprocessor, you take the Basic and Lexer libraries. If you want an indexer, |
| 50 | you take those plus the Parser library and provide some actions for indexing. |
| 51 | If you want a refactoring, static analysis, or source-to-source compiler tool, |
| 52 | it makes sense to take those plus the AST building library. Finally, if you |
| 53 | want to use this with the LLVM backend, you'd take these components plus the |
| 54 | AST to LLVM lowering code. |
| 55 | |
| 56 | In the future I hope this toolkit will grow to include new and interesting |
| 57 | components, including a C++ front-end, ObjC support, AST pretty printing |
| 58 | support, and a whole lot of other things. |
| 59 | |
| 60 | Finally, it should be pointed out that the goal here is to build something that |
| 61 | is high-quality and industrial-strength: all the obnoxious features of the C |
| 62 | family must be correctly supported (trigraphs, preprocessor arcana, K&R-style |
| 63 | prototypes, GCC/MS extensions, etc). It cannot be used if it's not 'real'. |
Chris Lattner | 22eb972 | 2006-06-18 05:43:12 | [diff] [blame] | 64 | |
Chris Lattner | d504f7d | 2006-10-06 05:56:14 | [diff] [blame] | 65 | |
| 66 | II. Usage of clang driver: |
| 67 | |
| 68 | * Basic Command-Line Options: |
| 69 | - Help: clang --help |
Chris Lattner | 110da697 | 2006-10-17 05:20:30 | [diff] [blame^] | 70 | - Standard GCC options accepted: -E, -I*, -i*, -pedantic, -std=c90, etc. |
Chris Lattner | d504f7d | 2006-10-06 05:56:14 | [diff] [blame] | 71 | - Make diagnostics more gcc-like: -fno-caret-diagnostics -fno-show-column |
Chris Lattner | 56c7a55 | 2006-10-14 05:19:00 | [diff] [blame] | 72 | - Enable metric printing: -stats |
Chris Lattner | d504f7d | 2006-10-06 05:56:14 | [diff] [blame] | 73 | |
| 74 | * -parse-noop is the default mode. |
| 75 | |
| 76 | * -E mode gives output nearly identical to GCC, though not all bugs in |
| 77 | whitespace calculation have been emulated. |
| 78 | |
| 79 | * -parse-print-callbacks doesn't print all callbacks yet. |
| 80 | |
| 81 | * -parse-print-ast isn't complete, it currently prints decls and stuff nested |
| 82 | in parens. This will improve as more AST nodes are implemented. |
| 83 | |
| 84 | * -fsyntax-only is currently identical to -parse-noop. |
| 85 | |
| 86 | III. Current advantages over GCC: |
Chris Lattner | 22eb972 | 2006-06-18 05:43:12 | [diff] [blame] | 87 | |
Chris Lattner | 3ba544e | 2006-08-12 18:43:54 | [diff] [blame] | 88 | * Column numbers are fully tracked (no 256 col limit, no GCC-style pruning). |
| 89 | * All diagnostics have column numbers, includes 'caret diagnostics'. |
Chris Lattner | 22eb972 | 2006-06-18 05:43:12 | [diff] [blame] | 90 | * Full diagnostic customization by client (can format diagnostics however they |
Chris Lattner | d504f7d | 2006-10-06 05:56:14 | [diff] [blame] | 91 | like, e.g. in an IDE or refactoring tool) through DiagnosticClient interface. |
Chris Lattner | 22eb972 | 2006-06-18 05:43:12 | [diff] [blame] | 92 | * Built as a framework, can be reused by multiple tools. |
| 93 | * All languages supported linked into same library (no cc1,cc1obj, ...). |
| 94 | * mmap's code in read-only, does not dirty the pages like GCC (mem footprint). |
| 95 | * BSD License, can be linked into non-GPL projects. |
Chris Lattner | ae41157 | 2006-07-05 00:55:08 | [diff] [blame] | 96 | * Full diagnostic control, per diagnostic. |
Chris Lattner | eb401b1 | 2006-08-17 05:20:50 | [diff] [blame] | 97 | * Faster than GCC at parsing, lexing, and preprocessing. |
Chris Lattner | 22eb972 | 2006-06-18 05:43:12 | [diff] [blame] | 98 | |
| 99 | Future Features: |
Chris Lattner | f96a166 | 2006-08-14 00:13:31 | [diff] [blame] | 100 | |
Chris Lattner | ae41157 | 2006-07-05 00:55:08 | [diff] [blame] | 101 | * Fine grained diag control within the source (#pragma enable/disable warning). |
Chris Lattner | 56c7a55 | 2006-10-14 05:19:00 | [diff] [blame] | 102 | * Faster than GCC at AST generation [measure when complete]. |
Chris Lattner | 22eb972 | 2006-06-18 05:43:12 | [diff] [blame] | 103 | * Better token tracking within macros? (Token came from this line, which is |
| 104 | a macro argument instantiated here, recursively instantiated here). |
Chris Lattner | 2b18b7f | 2006-08-10 18:48:21 | [diff] [blame] | 105 | * Fast #import! |
| 106 | * Dependency tracking: change to header file doesn't recompile every function |
Chris Lattner | 87d229a | 2006-10-06 04:10:25 | [diff] [blame] | 107 | that texually depends on it: recompile only those functions that need it. |
Chris Lattner | 2b18b7f | 2006-08-10 18:48:21 | [diff] [blame] | 108 | * Defers exposing platform-specific stuff to as late as possible, tracks use of |
Chris Lattner | 87d229a | 2006-10-06 04:10:25 | [diff] [blame] | 109 | platform-specific features (e.g. #ifdef PPC) to allow 'portable bytecodes'. |
Chris Lattner | 22eb972 | 2006-06-18 05:43:12 | [diff] [blame] | 110 | |
| 111 | |
Chris Lattner | d504f7d | 2006-10-06 05:56:14 | [diff] [blame] | 112 | IV. Missing Functionality / Improvements |
| 113 | |
| 114 | clang driver: |
| 115 | * predefined macros/search paths are hard-coded into the driver. |
Chris Lattner | 22eb972 | 2006-06-18 05:43:12 | [diff] [blame] | 116 | |
Chris Lattner | c5cd2d6 | 2006-07-19 03:39:58 | [diff] [blame] | 117 | File Manager: |
| 118 | * We currently do a lot of stat'ing for files that don't exist, particularly |
| 119 | when lots of -I paths exist (e.g. see the <iostream> example, check for |
| 120 | failures in stat in FileManager::getFile). It would be far better to make |
| 121 | the following changes: |
| 122 | 1. FileEntry contains a sys::Path instead of a std::string for Name. |
| 123 | 2. sys::Path contains timestamp and size, lazily computed. Eliminate from |
| 124 | FileEntry. |
| 125 | 3. File UIDs are created on request, not when files are opened. |
| 126 | These changes make it possible to efficiently have FileEntry objects for |
| 127 | files that exist on the file system, but have not been used yet. |
| 128 | |
| 129 | Once this is done: |
| 130 | 1. DirectoryEntry gets a boolean value "has read entries". When false, not |
| 131 | all entries in the directory are in the file mgr, when true, they are. |
| 132 | 2. Instead of stat'ing the file in FileManager::getFile, check to see if |
| 133 | the dir has been read. If so, fail immediately, if not, read the dir, |
| 134 | then retry. |
| 135 | 3. Reading the dir uses the getdirentries syscall, creating an FileEntry |
| 136 | for all files found. |
| 137 | |
Chris Lattner | 22eb972 | 2006-06-18 05:43:12 | [diff] [blame] | 138 | Lexer: |
| 139 | * Source character mapping. GCC supports ASCII and UTF-8. |
| 140 | See GCC options: -ftarget-charset and -ftarget-wide-charset. |
| 141 | * Universal character support. Experimental in GCC, enabled with |
| 142 | -fextended-identifiers. |
Chris Lattner | 22eb972 | 2006-06-18 05:43:12 | [diff] [blame] | 143 | * -fpreprocessed mode. |
| 144 | |
| 145 | Preprocessor: |
Chris Lattner | a5722f5 | 2006-07-29 17:59:42 | [diff] [blame] | 146 | * Know enough about darwin filesystem to search frameworks. |
Chris Lattner | 2be4115 | 2006-07-29 06:29:39 | [diff] [blame] | 147 | * #assert/#unassert |
Chris Lattner | 1f62777 | 2006-07-04 17:34:01 | [diff] [blame] | 148 | * #line / #file directives |
Chris Lattner | 9e22017 | 2006-07-10 02:49:22 | [diff] [blame] | 149 | * MSExtension: "L#param" stringizes to a wide string literal. |
Chris Lattner | 4856a42 | 2006-10-15 22:34:29 | [diff] [blame] | 150 | * Consider merging the parser's expression parser into the preprocessor to |
| 151 | eliminate duplicate code. |
Chris Lattner | 22eb972 | 2006-06-18 05:43:12 | [diff] [blame] | 152 | |
| 153 | Traditional Preprocessor: |
| 154 | * All. |
Chris Lattner | 24fad1a | 2006-07-28 05:25:01 | [diff] [blame] | 155 | |
Chris Lattner | 36a48b1 | 2006-08-10 20:00:01 | [diff] [blame] | 156 | Parser: |
Chris Lattner | 87d229a | 2006-10-06 04:10:25 | [diff] [blame] | 157 | * C90/K&R modes. Need to get a copy of the C90 spec. |
| 158 | * __extension__, __attribute__ [currently just skipped and ignored]. |
Chris Lattner | ea2f706 | 2006-10-06 05:40:42 | [diff] [blame] | 159 | * A lot of semantic analysis is missing. |
Chris Lattner | 36a48b1 | 2006-08-10 20:00:01 | [diff] [blame] | 160 | |
Chris Lattner | 22eb972 | 2006-06-18 05:43:12 | [diff] [blame] | 161 | Parser Actions: |
Chris Lattner | ea2f706 | 2006-10-06 05:40:42 | [diff] [blame] | 162 | * All that are missing. |
Chris Lattner | 12a8178 | 2006-07-14 05:26:56 | [diff] [blame] | 163 | * Would like to either lazily resolve types [refactoring] or aggressively |
| 164 | resolve them [c compiler]. Need to know whether something is a type or not |
| 165 | to compile, but don't need to know what it is. |
Chris Lattner | ea2f706 | 2006-10-06 05:40:42 | [diff] [blame] | 166 | * Implement a little devkit-style "indexer". |
| 167 | |
| 168 | AST Builder: |
| 169 | * Implement more nodes as actions are available. |
| 170 | * Types. |
Chris Lattner | 4856a42 | 2006-10-15 22:34:29 | [diff] [blame] | 171 | * Allow the AST Builder to be subclassed. This will allow clients to extend it |
| 172 | and create their own specialized nodes for specific scenarios. Maybe the |
| 173 | "full loc info" use case is just one extension. |
Chris Lattner | de0b7f6 | 2006-06-18 14:03:39 | [diff] [blame] | 174 | |
| 175 | Fast #Import: |
| 176 | * All. |
| 177 | * Get frameworks that don't use #import to do so, e.g. |
Chris Lattner | 87d229a | 2006-10-06 04:10:25 | [diff] [blame] | 178 | DirectoryService, AudioToolbox, CoreFoundation, etc. Why not using #import? |
| 179 | Because they work in C mode? C has #import. |
Chris Lattner | de0b7f6 | 2006-06-18 14:03:39 | [diff] [blame] | 180 | * Have the lexer return a token for #import instead of handling it itself. |
| 181 | - Create a new preprocessor object with no external state (no -D/U options |
| 182 | from the command line, etc). Alternatively, keep track of exactly which |
| 183 | external state is used by a #import: declare it somehow. |
| 184 | * When having reading a #import file, keep track of whether we have (and/or |
| 185 | which) seen any "configuration" macros. Various cases: |
| 186 | - Uses of target args (__POWERPC__, __i386): Header has to be parsed |
| 187 | multiple times, per-target. What about #ifndef checks? How do we know? |
| 188 | - "Configuration" preprocessor macros not defined: POWERPC, etc. What about |
| 189 | things like __STDC__ etc? What is and what isn't allowed. |
| 190 | * Special handling for "umbrella" headers, which just contain #import stmts: |
| 191 | - Cocoa.h/AppKit.h - Contain pointers to digests instead of entire digests |
| 192 | themselves? Foundation.h isn't pure umbrella! |
| 193 | * Frameworks digests: |
| 194 | - Can put "digest" of a framework-worth of headers into the framework |
| 195 | itself. To open AppKit, just mmap |
| 196 | /System/Library/Frameworks/AppKit.framework/"digest", which provides a |
| 197 | symbol table in a well defined format. Lazily unstream stuff that is |
| 198 | needed. Contains declarations, macros, and debug information. |
| 199 | - System frameworks ship with digests. How do we handle configuration |
| 200 | information? How do we handle stuff like: |
| 201 | #if MAC_OS_X_VERSION_MAX_ALLOWED >= MAC_OS_X_VERSION_10_2 |
| 202 | which guards a bunch of decls? Should there be a couple of default |
| 203 | configs, then have the UI fall back to building/caching its own? |
| 204 | - GUI automatically builds digests when UI is idle, both of system |
| 205 | frameworks if they aren't not available in the right config, and of app |
| 206 | frameworks. |
| 207 | - GUI builds dependence graph of frameworks/digests based on #imports. If a |
| 208 | digest is out date, dependent digests are automatically invalidated. |
| 209 | |
| 210 | * New constraints on #import for objc-v3: |
| 211 | - #imported file must not define non-inline function bodies. |
| 212 | - Alternatively, they can, and these bodies get compiled/linked *once* |
| 213 | per app into a dylib. What about building user dylibs? |
| 214 | - Restrictions on ObjC grammar: can't #import the body of a for stmt or fn. |
| 215 | - Compiler must detect and reject these cases. |
| 216 | - #defines defined within a #import have two behaviors: |
| 217 | - By default, they escape the header. These macros *cannot* be #undef'd |
| 218 | by other code: this is enforced by the front-end. |
| 219 | - Optionally, user can specify what macros escape (whitelist) or can use |
| 220 | #undef. |
| 221 | |
| 222 | New language feature: Configuration queries: |
| 223 | - Instead of #ifdef __POWERPC__, use "if (strcmp(`cpu`, __POWERPC__))", or |
Chris Lattner | 56c7a55 | 2006-10-14 05:19:00 | [diff] [blame] | 224 | some other, better, syntax. |
Chris Lattner | 87d229a | 2006-10-06 04:10:25 | [diff] [blame] | 225 | - Use it to increase the number of "architecture-clean" #import'd files, |
| 226 | allowing a single index to be used for all fat slices. |
Chris Lattner | de0b7f6 | 2006-06-18 14:03:39 | [diff] [blame] | 227 | |
| 228 | Cocoa GUI Front-end: |
| 229 | * All. |
| 230 | * Start with very simple "textedit" GUI. |
| 231 | * Trivial project model: list of files, list of cmd line options. |
| 232 | * Build simple developer examples. |
| 233 | * Tight integration with compiler components. |
| 234 | * Primary advantage: batch compiles, keeping digests in memory, dependency mgmt |
| 235 | between app frameworks, building code/digests in the background, etc. |
Chris Lattner | ae41157 | 2006-07-05 00:55:08 | [diff] [blame] | 236 | * Interesting idea: http://nickgravgaard.com/elastictabstops/ |
| 237 | |