Monday, November 15, 2010

Massively Indexed

Today Jared and I were able to pull everything together today on the processor end. Mostly this just consisted of bug fixes and some integration coding. Jared was then able to kindly produce a (horrifying) schema and test file from the OpenLegislation database so we could test things out.

After shaking a couple more bugs out, this turned out to be a largely successful endeavor. Jared was then able to give me a dump of about a 1000 bills with embedded actions and votes which I was able to feed through and get an idea of performance.

Basic Performance

Processing turned the 1000 files into about 4200 documents with an average processing time of ~13 milliseconds per file and ~3 milliseconds per document. Its should be noted that this is on my VBox with 700MB general memory while running Eclipse and Firefox. In general, these averages are skewed high by several outliers; all documents with long, many page texts. Such expansive texts appear to create significant bottlenecks in the XML parsing.

As you can see to the right, there is a small subset of extremely long bills creating issues here. Excluding worst 1% of worst case bills cuts the average case processing time in half allowing for processing well over 100 documents a second.

Not So Bad

Going forward I'll profile the processing and see where the bottle necks are (aside from XML loading) and look at streamlining those areas. Even so, this test run seems reasonable enough and I think optimization can be pushed off down the road.

Time to hook this up to the front end and get a few thousand more bills.

Plugin Flags and Processors

This weekend also saw the complete rewrite of the processing engine to be pluggable. Now you can easily drop in custom flags and processors via a simple configuration file. To make sure it worked and was powerful enough, all basic internal flags and processors are now added in this way.

They are still kind of rough in implementation right now. Additional use cases will better inform us as to what these processors and flags need to do it and how best to expose that functionality. But now its baked in, so the ideas can get early exposure and evolve as needed. I've included below the basic interfaces for processors and flag nodes as well as the current processor configuration file:

The Interfaces:


The Configuration File:

Introducing Schemas

A lot of work went into the Platform this weekend, thanks in part to my new contributor Jared Williams. While I wound up writing the majority of the code, conversations with him greatly helped clarify the goals and structure of the platform.

One of the most important things to come out of this was a separation of the flags from the document into a separate "schema" structure that dictates the interpretation of the input document's data structure/values via attribute flags on the nodes. These flags are then applied to the corresponding nodes in the input documents to produce SOLR output documents and their corresponding XML serializations.
 
In the next few days we'll be generating a large sample input set, corresponding schemas, and a front end API to deliver results. This will allow us to put the platform to the test and give me something to push up as a demo on the ec2 server I've set up.

In the mean time, example documents have been included below:

Example Schema:



Example Document:



Example Output:

Thursday, November 11, 2010

Barbara Liskov Lecture

Barbara Liskov came to give her Turing Award lecture today and maybe its just me, but I thought she was awesome. If she somehow winds up reading this:

Thank you for coming and thank you for speaking with me after

I was the guy with the striped button up and the horrible facial hair.

She said a lot of things with what seemed to me a lot of merit but I wanted to pick a couple of them out here so I don' forget them. Before I do though, remember that these are my words and while they are based on what (I remember) she said:
 
I could be misrepresenting her.

That being said, here we go!

Failure exceptions!

There are two main kinds of exceptions:
  • Exceptions that are handled (recovery actions are taken)
  • Exceptions that are propagated (exception is thrown up the stack)
But there is also a 3rd class of exceptions: those that are unexpected. Programs need to be able to recover from these exceptions and often, in more traditional exception systems, doing so is either ugly or impossible.

By grouping all unexpected exceptions together under a failure exception a language can be designed to handle execution failure gracefully.

Inheritance is not Useful

In the middle of the talk she made the interesting statement that she didn't think that inheritance was very useful or interesting. After the talk I asked her why and while I can't recall her reasoning at the moment she did respond with roughly the following:

"Instead of having inheritance, give the object a pointer to a different object which has the methods you wish to inherit."

I'm not entirely clear on her meaning and rational here and was unable to clarify with respect for the questions that others had but I think it came down to two things:
  • Preservation of Object Encapsulation
  • Clear separation from Type Hierarchies
Inheritance tends to mix code reuse and type hierarchies together in ways that can work counter to expectations; particularly when you allow method overriding (I'm sure she had other things in mind). Additionally, through this indirection, I think that she intended that encapsulation of the "component" objects be preserved. She was very consistent that she believed encapsulation should not be broken for any reason and that doing so likely introduced flaws and side effects.


Type Hierarchies Are Useful

On the other hand, she believes that type hierarchies are very useful in allowing abstraction of types on the programming level as well. This came up several times in her talk and she stuck to her guns.

After the talk she pointed to Java Interfaces as an example (perhaps not the ideal one though) of type hierarchies as separate from inheritance and code reuse and reaffirmed that she believed this separation to be quite important.


Reflection is Considered a Shortfall

The subjects of duck typing and programming by convention came up after the talk and she flat out called reflection as horrible. Reflection breaks the encapsulation of the objects and using it represents a shortcoming of the language that you are programming in. This said in the sense that if the language were properly designed to do what you are trying to do, reflection would be substituted for a more robust typing system.

Instead of reflected at runtime, such duck typing should be statically determined through explicit type hierarchies.These static type hierarchies can be resolved at compile time with static analysis, which she believe to be a fundamental advantage as well.

I have some reservations regarding this point. Mostly because I can't figure out how a language could be designed to allow me to, for example, call all methods of a given object that begin with test**********. On the other hand, maybe it can't be done and it instead reflects poor program design. Hrm.

Syntactic Change Considered Harmful

Simply put: Ease of reading programs must take precedence over ease of writing programs. It therefore follows that in order for reading to be easy, syntax must be consistent. In appearance, but more importantly in meaning. She believes that while being able to change the syntax of the language you are programming in while you are programming it maybe seductive, you will almost always regret it later. Certainly others who read your code will.

Wrapping it up

So anyway, I found her perspective quite thought provoking and talking with her after gave me some new perspectives. I'm unsure on my own opinions regarding some of her statements but I am glad that she came to RPI and shared them with me.

Moving in to ec2

Moving to ec2

Last night I decided to take advantage of Amazon's offer of 1 free* year of AWS services and I decided that I'm not actually adverse to long domain names so I acquired http://openlegislationplatform.org.

Let me take a minute to say that Amazon's AWS Management Console provides an excellent user experience. Combined with a simple walk through I had a ubuntu instance up and running with a static IP address and shell access in minutes.I was able to use my current host Dreamhost (also an excellent user experience) to do the DNS hosting for me. I just added a type A record pointing to 50.16.246.197 and was live within minutes.

One of the cool things about the ec2 is that I can use key pairs instead of credential/password pairs to handle access control and as such, there are no passwords in ec2 land. Having almost exclusively used password protected, shared hosting in the past, the change of experience is great. I did have several issues uploading my RSA keys to Amazon but once I gave up and just accepted the once they generated for me, things went pretty smoothly.

The New Setup

Not content to only start 1 new thing at once, I decided to change up the server stack now that I had more freedom. The general plan was:
  • Varnish on port 80 serving
    • Python + ??? for static(ish) info pages
    • Passenger + Redmine
    • Tomcat + Solr and the Platform
    • Lighttpd or Nginx + Firmant for blog
Depending on my motivation levels, I might swap my python/Firmant sections out for Sphinx which has a more complete base of modules and support documentation (sorry Rob). We'll see what I have time and energy for.

Varnish

Varnish was surprisingly simple and easy to use with straight to the point and informative documentation. It worked right out of the box with this simple configuration and run command:



Redmine

Installing Redmine was a bit more difficult since I was deploying to a subdirectory `/redmine`. In theory the fix is simple but in practice I kept getting a ActionController::RoutingError that I not seem to resolve for each of my javascript/stylesheet resources. I then switched Mongrel in for Passenger and recieved ActionController::AbstractRequest errors instead. I believe this particular issue came up because rails 2.3.X has removed the AbstractRequest class. Luckily there was a quick little hotfix in the form of an initializer that fixed things up for me. In the end the setup runs/stops with the following commands which I've aliased in .bashrc:



Its a good start

I'm probably taking a break from this sys admin stuff for a while to work on the platform again, but I needed to get Redmine in order for the Roadmap/Issues/Wiki for developer communication. Small steps.

Sunday, November 7, 2010

Prototype and Presentation

Last Friday, just minutes before my presentation, I completed a (very) rough prototype of the platform. It successfully processes input XML files into fully indexed SOLR documents with pre-serialized representations attached. As such, I was able to give a very short demo and talk a bit about future plans.
Unfortunately the demo was too short to really demonstrate how powerful the simple process was. Following a very simple protocol an object was indexed across every available field and sub-objects were branched off into their own fully indexed documents. These multiple documents and full indexes pushed into SOLR now allow for fairly sophisticated queries:
  • Find all bills sponsored by Senator Adams and voted against by Alesi.
  • Find all bills moved to 3rd Reading in the last 7 days.
  • Find all bills unanimously approved in May 2010.
  • Find all actions on all bills voted down last session.
This might not seems so impressive, and indeed its not (SOLR does all the heavy lifting) but its worth noting that this process would work equally well regardless of the complexity and structure of the incoming documents. Now that this proof of concept is done, I can begin to build on this platform to code best practices and enhancements into the processing.

I'm going to be pushing some of the Open Legislation data through the system sometime this week and push it up as a demo application. Hopefully this process will help me refine the set of XML input flags and shake out bugs.