The Open Legislation Platform: Massively Indexed

Today Jared and I were able to pull everything together today on the processor end. Mostly this just consisted of bug fixes and some integration coding. Jared was then able to kindly produce a (horrifying) schema and test file from the OpenLegislation database so we could test things out.

After shaking a couple more bugs out, this turned out to be a largely successful endeavor. Jared was then able to give me a dump of about a 1000 bills with embedded actions and votes which I was able to feed through and get an idea of performance.

Basic Performance

Processing turned the 1000 files into about 4200 documents with an average processing time of ~13 milliseconds per file and ~3 milliseconds per document. Its should be noted that this is on my VBox with 700MB general memory while running Eclipse and Firefox. In general, these averages are skewed high by several outliers; all documents with long, many page texts. Such expansive texts appear to create significant bottlenecks in the XML parsing.

As you can see to the right, there is a small subset of extremely long bills creating issues here. Excluding worst 1% of worst case bills cuts the average case processing time in half allowing for processing well over 100 documents a second.

Not So Bad

Going forward I'll profile the processing and see where the bottle necks are (aside from XML loading) and look at streamlining those areas. Even so, this test run seems reasonable enough and I think optimization can be pushed off down the road.

Time to hook this up to the front end and get a few thousand more bills.

The Open Legislation Platform

Monday, November 15, 2010

Massively Indexed

No comments:

Post a Comment