Posts Tagged 'Parsing'

Weak AI to Step against Context Free Grammars?

This article stems from a practical problem. If you’re worried I’ll be barking up my own tree, you are right. Otherwise, read on and bark along.

My new code editor is now available for trying out. ee-ide doesn’t boast much for now, but the underlying application platform, Kippu (?Kiphu) – based on a trial F* implementation, is more than promising, and I feel encouraged given a mere 52 hours logged into my project manager.

When quoting that I write development tools, I often meet a do we need another IDE shrug kinda thing. People just quote Eclipse and Netbeans, or backtrack to IntelliJ.

I beg to differ. Have a look at Antegram for web or Antegram for java and you’ll get a feel of how alluringly different an IDE can be.

Now to the point. ee-ide is a kind of prequel to Antegram. Since I dropped XML annotated code for now (just for now), I may have to give in a little to the establishment:

  • Syntax highlighting
  • Code validation
  • Hinted typing (e.g: displaying a drop down menu showing all functions available from a given variable)

I can’t really dedicate time to each and every language out there, and a soon to be, fast and ready extension framework won’t allow me to just sit and wait (for somebody else to integrate parsing functionality on a case basis). Instead, I’ll be looking for simple analysis techniques to help me providing fairly reliable services. For example:

  • Discovering language tokens – An example of that is when a token appears in all, or nearly all, files of a given type.
  • Semantic closure – a name occuring only once in a file is a sign that something’s going on. It could be that the name hasn’t been spelled correctly, or it could be that the name is defined elsewhere (for example, in a file referred by name in the current source file)
  • Terminals – given correct productions in a given language, we can surely evaluate with fair confidence what does (and doesn’t) count as a valid line terminal.
  • Brackets – even if we don’t know what bracketting style is used outfront, we can use valid code to detect ‘immutable pairs’ that constitute brackets in a given language.
  • Foreign forms – just today, a colleague pointed out my giving into old habits by declaring variables using C++/Java syntax. Unfortunately I meant writing AS3 code. This kind of error shouldn’t be too hard to spot, even without language specific knowledge.

I’ll definitely have some fun trying to write reliable solutions using generic strategies, and I hope the result will match my expectations.

Finally comes the question: why should somebody use ‘rough and ready’ technology while the competition uses strict parsers that get it right every time? Three answers to that:

  1. Rules versus Practice – where a strict parser lets you write any code, a soft parser willl potentially endorse good practices – what might be perfectly OK according to a parser may amount to poorly formatted, awkward code, and the soft parser will let you know.
  2. Graceful performance degradation – even if your code is a weird avatar of an existing language that doesn’t really compile until it gets processed in some way, or code written in a really obscure language, a soft parser is more likely to discover (and respect) patterns idiosyncratic to your source files without giving up on validation altogether.
  3. It’s a N-body problem - never mind the fact that F*(and some other stuff coming really soon) will make extending ee-ide and other Kippu apps a gentle breeze, I’ll be iterating the proof that 10 years off the beaten track likely got me to be a happier and more productive developer than average. This time I’ll have pretty standard features to lure you in and the quixotic, yet commonsense stuff that’s motivated the Antegram iterations in the first place plus other stuff that mainstream IDEs don’t really bother bringing up, given that dominating the market isn’t the biggest incentive for innovation.

If you’re not deeply shocked by my lack of excitement and interest for regular parsers

…wish me good luck.

Naive parsing techniques – part I

In this post, I’m considering a design for a simple tool that could help doing the following:

  • Repackaging classes; this will involve replacing package and import declarations
  • Validating class dependencies
  • Generating code and TODO annotations from an incomplete specification
  • Tracking high level aspects of system architecture by extracting relevant data from program code

Looking at the above, we’d surely agree that some kind of API would be required to take care of source analysis. This is where naive parsing comes in play.

If we needed an API to help resolve the above tasks unambiguously, we would also need to parse the source code reliably. However, writing parsers is overkill. In fact, just setting up and integrating with a parser might be quite a stretch.

Instead of going the full stretch, I’m proposing to write a naive parsing library. This will consist in several utility methods allowing to scan source files, detect relevant code fragments with a high (not absolute) degree of confidence and add/modify source files.

Why naive parsing?

I’ve already given half of the answer: parsing is expensive. Using naive parsing functions will allow my reusing the same techniques across similar languages. We’ll not only be saving time preparing the parsing facilities – we’ll actually save time every time we need to write another utility. That’s where ‘naive’ comes in play.

Most parsers somehow relate to the way we model language. However, there are strong smells indicating that parsers do not quite interpret text as we do:

  • Classic parsers do not usually integrate error correction. When reading text, we are performing a mix of interpretation and error correction
  • Given correct input, however complex, a classic parser will resolve that input into a parse tree of arbitrary depth and breadth. When reading text, we can easily overload when a sentence becomes too long and too complex.

Parsers, then, provide fail-fast, scalable solutions. Scalability is definitely a quality, but comes at a cost – naive parsing consists in simple rules that are easy to come up things, whereas writing scalable parsing rules is an art. Fail-fast may or may not be a quality depending on the context. For example, we might want our utilities to correct simple mistakes, even if that means introducing another mistake from time to time. After all, our code will eventually get either interpreted or compiled, so we only want to make sure that our automations generate an overall reduction in development cost.

Isn’t 100% reliability a paramount requirement?

Yes, but not in the parser used by our utilities. We need %100 reliability in the final product – what our utilities are helping us build – in contrast, we only need our utilities to save more time than they cost. In this particular scenario (considered the originally suggested applications), three strategies will be used to achieve correct results using only a ‘mostly accurate’ parser.

  1. Simple, readable code. Simple, readable code is advocated everywhere. This suggests that parsers that allow very complex code – code that humans find hard or impossible to read – are too powerful. In short, we’ll agree to rewrite some of the code that the parser cannot read wherever this arguably increases the readability, simplicity and maintainability of the source.
  2. Semantic separation. As we’ll see later, a naive parser often fails to analyze correctly code that uses semantic overlaps. Semantic overlaps don’t really promote simple code, and need rarely occur in program code, although classic parsers often rely on very limited semantic separation – for example very limited number of language keywords.
  3. Patches. We will allow defining patches that determine project specific exceptions to the parsing process. This will be provided as an ultimate measure, mostly used to deal with legacy code that cannot be modified.

That’s all folks

Here’s a plan. Yes, I do have something concrete in mind, but if you read this post, I’ll be delighted to get feedback and hear that this has inspired ideas that have nothing to do with what I envision



Follow

Get every new post delivered to your Inbox.