In this post, I’m considering a design for a simple tool that could help doing the following:
- Repackaging classes; this will involve replacing package and import declarations
- Validating class dependencies
- Generating code and TODO annotations from an incomplete specification
- Tracking high level aspects of system architecture by extracting relevant data from program code
Looking at the above, we’d surely agree that some kind of API would be required to take care of source analysis. This is where naive parsing comes in play.
If we needed an API to help resolve the above tasks unambiguously, we would also need to parse the source code reliably. However, writing parsers is overkill. In fact, just setting up and integrating with a parser might be quite a stretch.
Instead of going the full stretch, I’m proposing to write a naive parsing library. This will consist in several utility methods allowing to scan source files, detect relevant code fragments with a high (not absolute) degree of confidence and add/modify source files.
Why naive parsing?
I’ve already given half of the answer: parsing is expensive. Using naive parsing functions will allow my reusing the same techniques across similar languages. We’ll not only be saving time preparing the parsing facilities – we’ll actually save time every time we need to write another utility. That’s where ‘naive’ comes in play.
Most parsers somehow relate to the way we model language. However, there are strong smells indicating that parsers do not quite interpret text as we do:
- Classic parsers do not usually integrate error correction. When reading text, we are performing a mix of interpretation and error correction
- Given correct input, however complex, a classic parser will resolve that input into a parse tree of arbitrary depth and breadth. When reading text, we can easily overload when a sentence becomes too long and too complex.
Parsers, then, provide fail-fast, scalable solutions. Scalability is definitely a quality, but comes at a cost – naive parsing consists in simple rules that are easy to come up things, whereas writing scalable parsing rules is an art. Fail-fast may or may not be a quality depending on the context. For example, we might want our utilities to correct simple mistakes, even if that means introducing another mistake from time to time. After all, our code will eventually get either interpreted or compiled, so we only want to make sure that our automations generate an overall reduction in development cost.
Isn’t 100% reliability a paramount requirement?
Yes, but not in the parser used by our utilities. We need %100 reliability in the final product – what our utilities are helping us build – in contrast, we only need our utilities to save more time than they cost. In this particular scenario (considered the originally suggested applications), three strategies will be used to achieve correct results using only a ‘mostly accurate’ parser.
- Simple, readable code. Simple, readable code is advocated everywhere. This suggests that parsers that allow very complex code – code that humans find hard or impossible to read – are too powerful. In short, we’ll agree to rewrite some of the code that the parser cannot read wherever this arguably increases the readability, simplicity and maintainability of the source.
- Semantic separation. As we’ll see later, a naive parser often fails to analyze correctly code that uses semantic overlaps. Semantic overlaps don’t really promote simple code, and need rarely occur in program code, although classic parsers often rely on very limited semantic separation – for example very limited number of language keywords.
- Patches. We will allow defining patches that determine project specific exceptions to the parsing process. This will be provided as an ultimate measure, mostly used to deal with legacy code that cannot be modified.
That’s all folks
Here’s a plan. Yes, I do have something concrete in mind, but if you read this post, I’ll be delighted to get feedback and hear that this has inspired ideas that have nothing to do with what I envision