Posts Tagged 'meta-programming'

Naive parsing techniques – part I

In this post, I’m considering a design for a simple tool that could help doing the following:

  • Repackaging classes; this will involve replacing package and import declarations
  • Validating class dependencies
  • Generating code and TODO annotations from an incomplete specification
  • Tracking high level aspects of system architecture by extracting relevant data from program code

Looking at the above, we’d surely agree that some kind of API would be required to take care of source analysis. This is where naive parsing comes in play.

If we needed an API to help resolve the above tasks unambiguously, we would also need to parse the source code reliably. However, writing parsers is overkill. In fact, just setting up and integrating with a parser might be quite a stretch.

Instead of going the full stretch, I’m proposing to write a naive parsing library. This will consist in several utility methods allowing to scan source files, detect relevant code fragments with a high (not absolute) degree of confidence and add/modify source files.

Why naive parsing?

I’ve already given half of the answer: parsing is expensive. Using naive parsing functions will allow my reusing the same techniques across similar languages. We’ll not only be saving time preparing the parsing facilities – we’ll actually save time every time we need to write another utility. That’s where ‘naive’ comes in play.

Most parsers somehow relate to the way we model language. However, there are strong smells indicating that parsers do not quite interpret text as we do:

  • Classic parsers do not usually integrate error correction. When reading text, we are performing a mix of interpretation and error correction
  • Given correct input, however complex, a classic parser will resolve that input into a parse tree of arbitrary depth and breadth. When reading text, we can easily overload when a sentence becomes too long and too complex.

Parsers, then, provide fail-fast, scalable solutions. Scalability is definitely a quality, but comes at a cost – naive parsing consists in simple rules that are easy to come up things, whereas writing scalable parsing rules is an art. Fail-fast may or may not be a quality depending on the context. For example, we might want our utilities to correct simple mistakes, even if that means introducing another mistake from time to time. After all, our code will eventually get either interpreted or compiled, so we only want to make sure that our automations generate an overall reduction in development cost.

Isn’t 100% reliability a paramount requirement?

Yes, but not in the parser used by our utilities. We need %100 reliability in the final product – what our utilities are helping us build – in contrast, we only need our utilities to save more time than they cost. In this particular scenario (considered the originally suggested applications), three strategies will be used to achieve correct results using only a ‘mostly accurate’ parser.

  1. Simple, readable code. Simple, readable code is advocated everywhere. This suggests that parsers that allow very complex code – code that humans find hard or impossible to read – are too powerful. In short, we’ll agree to rewrite some of the code that the parser cannot read wherever this arguably increases the readability, simplicity and maintainability of the source.
  2. Semantic separation. As we’ll see later, a naive parser often fails to analyze correctly code that uses semantic overlaps. Semantic overlaps don’t really promote simple code, and need rarely occur in program code, although classic parsers often rely on very limited semantic separation – for example very limited number of language keywords.
  3. Patches. We will allow defining patches that determine project specific exceptions to the parsing process. This will be provided as an ultimate measure, mostly used to deal with legacy code that cannot be modified.

That’s all folks

Here’s a plan. Yes, I do have something concrete in mind, but if you read this post, I’ll be delighted to get feedback and hear that this has inspired ideas that have nothing to do with what I envision

SourceFactor – no compromise

You could skip the best part of this article and visit sourcefactor.org – SourceFactor is an interface for processing arbitrary sources to arbitrary targets (no, really), and it comes with a nifty utility class that helps formatting the output. You can use it with build processes or you can integrate it with java.

I wrote somewhere that languages are inextensible; inextensible they are. For proof:

  • C and C++ macros.
  • Java annotations

Macros are simple and potentially messy. Java isn’t messy. Instead Java provides a pretty scary API for preprocessing. Because I annotate all my source with XML, I wasn’t impressed when annotations came by, and after a while I just sat there, shaking my head – after all the idea of annotations is to save time on the so-called ‘boiler plate’ required by some APIs and frameworks.

That language designers provide patchy facilities to allow writing shortcuts shows just how far you can go with regular inheritance and re-usability constructs. The idea of writing domain specific languages isn’t new (check Persistence of Vision for some entertainment), and although it’s not for everybody, there are many, many advantages (Wikipedia has a decent article on ‘Language Oriented Programming’).

Yes, XML’s my cup of tea. One of the disadvantages of writing a domain specific language is that you need a parser. writing a parser is an expensive black art. Leaving elegance behind and hitting the ground running, XML parses anytime, anywhere.
Ergo one of my favorite pass-times is embedding regular java, ECMA-Script or PHP within XML declarations and exporting regular source code.

SourceFactor doesn’t bind you to XML sources. Nor does it tell you how to parse your input. But say you took on the challenge to write a simple, readable formal specification using a plain text editor, spreadsheet software or whatever you please. Well then, SourceFactor gives you a simple API that you can use to invoke your preprocessor from the command line. It is free, small, open source and convenient.

If a language were a violin, meta-programming is playing without brushing the strings. Another day I’ll write about naive parsers and how hubris, upon the world unleashed, millions of write only code-lines.

ee-xml

I setup a project for the ee-xml source editor at xp-dev.com.
Release date: 01/08/2009

ee-xml will benefit 5 years of expertise in designing XML driven development solutions and will eventually replace Antegram for Java and Antegram for Web. Compared to Antegram, ee-xml will benefit direct support for arbitrary XML data, implicit XML subset definition and enforcement and per-element user actions and constructors.

To track this project, get yourself an account with xp-dev and request read permission by replying to this post, quoting your user name.

Why another XML editor

Existing XML editors target the following applications:

  • WYSIWYG – (mainly) technical documentations using XML/XSLT
  • UML
  • Schemas
  • XML data files

While existing solutions may be fairly suitable for the above tasks, we are nowhere near to the ergonomics required to edit and navigate megabytes worth of user generated data scattered across interconnected XML files quickly and efficiently. This is our part.

Essentially, ee-xml will compete with regular IDEs as a software development solution:

  1. ee-xml will support development using OSM – develop for Java, C++ and other object oriented idioms while benefiting the control offered by XML driven specifications without the hassle of plain text XML
  2. ee-xml will provide a one step solution to specifying applications while generating an underlying specification model. This programming technique works around the limitations of traditional languages and provides an attractive alternative to aspect programming (I’ll write about that) while allowing technically minded users to modify applications themselves.
  3. ee-xml will be cross-platform and extensible.
  4. Programming languages are, mostly, inextensible (I’ll explain, promise). XML isn’t. Inextensible means programming needn’t be fun, or profitable. Programming needs to change or I’ll change my job.

Contribute

To contribute, it’s OK to just reply to this post for now. At this stage, we are especially looking for project customers – if you have ideas about what an XML editor for programmers could do for you, then you are a potential contributor.

I recommend potential contributors have a look at Antegram for Web as this prefigures ee-xml and will allow your barking up the right tree.

In a couple of weeks, the project should be ready for developer contributions.



Follow

Get every new post delivered to your Inbox.