Jythons and Javas and bears, oh my!

April 11, 2008

It’s hard to believe but I’ve been at the new job for six months already, a full half-year come the 29th. Some days it seems like I’ve been here forever; others like I’m still a rank newb. I haven’t written terribly much about what I’ve been up to (but I assure you I’ve been busy). Let me rectify that.

The Transfer Problem

Two of the projects I’ve been working on relate to a fairly general problem that we like to call “transfer,” which revolves around, well, transferring files to and fro. Sounds simple. Is simple. That is, until you start thinking about preservation and accounting for a highly heterogeneous network with idiosyncratic nodes, esoteric storage software, and differential firewall rules. And that’s where it gets interesting (and problematic). The transferring itself, or copying of files from one location to another which we call “transport,” is the easiest part. We like to use common tools in our environment. It makes life easy. And so good ol’ scp seems like an obvious choice to handle the job.

Since preservation is a core aspect of our “repository,” a term which I use loosely, we must build certain other functionalities into the transfer process: validation, verification, inventory, backup, ingest, and so forth. Every time a file is copied to a non-transient location, we verify the file against a (SHA1) checksum and record an event for auditing purposes.

Repository Workflows

Some steps in the transfer process are routine and best handled by machines. Thus we automate them with scripts and code. Others require human intervention. This introduces another key aspect of our repository needs: workflows. No two projects will have the same workflow and yet all will have some steps in common. We’re using JBoss’s jBPM library for workflow management and it is more than capable of handling our workflow needs. It allows us to model complex and varied flows in a robust and not ad hoc way; it does seem preferable to me to model our workflows via jBPM’s graphical editor (serialized to JPDL XML) rather than copying around blocks of code and otherwise modeling the workflow procedurally in business logic.

One of my coworkers (the author of this) designed a complete workflow system in jBPM last summer and I’ve taken on implementing, tweaking, and testing this system, which has required bootstrapping my sorry-arse Java skills and learning jBPM. Though I find it difficult to think in Java patterns and generally find it a burdensome environment, I’m quite impressed by jBPM. I’ve been working on various updates of and unit test coverage for the workflow system, which has been a crash course in a number of Java technologies and a perfect first task at LC, as it gets me into the guts of things. The Java stack we use for our workflow is highly abstracted and componentized, which is conveniently modular… but it’s also Java: fairly heavy and arguably not as agile as dynamic languages such as Python.

Two Great Tastes?

So recently we’ve begun to think about implementing some transfer and workflow components in Jython (Python written in Java). Why? The value of Jython is as follows:

Ease of deployment - Deploying jar files to existing JVMs in production environments (which we do not control) is a simple task, or at least simpler than some other options.
Interoperability - Our stack is primarily Java-based and so interoperating with existing Java components means not having to rewrite functionality in other languages. Jython allows Python to talk to Java and vice versa.
Familiarity - It's Python, and we like Python. It's the closest my team has to a lingua franca and so it increases the chances of sharing code, maintaining code, and so forth.

It does not come without its drawbacks:

Currency - The Jython project went moribund for a few years or so and the latest stable version is now 2.2.1. Compare that to the latest version of Python: 2.5.2. I don't begrudge the Jython developers, though. I'm glad some folks picked the project up, dusted it off, and breathed new life into it. I am also glad that they're skipping 2.3 and 2.4 releases and plowing right into a 2.5 release. Because of this currency issue, some Python libraries won't work with the latest Jython and that means you're stuck looking for outdated, potentially vulnerable Python libraries, or hooking into Java libraries for the same functionality (which inevitably means more lines of code). I'm no lines of code fetishist but it does militate against the goal of agility somewhat.One does have the option of living on the edge and trying out the 2.5 branch, but that seems out of step with an infrastructure that is supposed to preserve terabytes upon terabytes of our nation's, and the world's, intellectual property. It's a responsibility I do not take lightly, as much as I'd like to be on the bleeding edge.
Interoperability difficulties - Talking to Java from Jython is a snap: just import java at the top of your script, and, assuming your classpath is copacetic, voila: you have access to Java libraries in your Python code! Talking to your Jython modules from Java code is, well, a little more complicated. Read on.

Despite the caveats it does seem a sane, reasonable, and potentially productive path to go down. Right? I am specifically looking to implement two workflow components in Jython: one for transport (wrapping Ant’s JSch library, which provides a slick scp API) and the other for automation of ZFS filesystem/volume creation on the staging server. Nothing arcane, nothing tricky, nothing fancy. So it must be easy! Right? …

Lessons Learned?

I’m beginning to wonder about the feasibility of using Jython to make bits of our Java stack more agile. Specifically, there are three ways to get at Jython code from Java:

Compile to bytecode/jar via the Jython compiler (jythonc) and reference your Jython objects and methods as though they were POJOs
Embed a Jython interpreter
Instantiate a (JSR-223) script engine

Option 1 is nice because you get object- and method-level interop. However, jythonc is unsupported and will disappear. This does not seem sustainable though I might be able to limp along a while. And there are signs of hope:

Though jythonc is going away, all of the capabilities it provides will be present in 2.5 in other forms. We're adding functionality to expose Python classes as Java classes using decorators to replace the docstring class creation that jythonc provided, and we're adding static compilation of proxy classes so regular jython can run in applets and other environments with restrictive classloaders. We're definitely doing something about jythonc.

That doesn’t help much now, of course, but just because jythonc goes away does not mean my jars will stop working.

The suggested methods for option 2 [1, 2] seem to be more trouble than they’re worth. If the goal is more agile development for certain components, the reliance upon multiple, separate Java classes – an interface class and an object factory, in the examples listed – to get a ten-line Jython script working, this seems suboptimal, both inefficient and not straightforwardly maintainable; it seems, to me and my Java-dumb ways, rather baroque.

Option 3 [1, 2] is more appealing than option 2 as it does not rely upon these other classes specifically for Jython code. But the number of lines of Java code that must be wrapped around the Jython to get it working looks like overkill for the drop-dead simple scripts I’m writing – it might be easier, for instance, to just write the darn things in Java and be done with it. (Did I just say that?)

Conclusions

Options 2 and 3 are similar as both involve embedding Jython code, or referencing files with Jython code, and interpreting the code within Java. Generally, I worry that either option would obviate the benefit of agile Jython scripting because you wind up wrapping the code in so much Java. I offer two disclaimers to counter my objections:

Cleverer Java coders than myself could, I am almost certain, find ways to build abstractions (or abstractions of abstractions) to eliminate the "lines of code" and "many separate classes per Jython script" issues
The value of Jython in our Java environment increases proportionately with the complexity of the component -- given the overhead, a short Java class seems easier and more straightforward to implement than a short embedded Jython script. On the other hand, there's value in embedding a Jython script that'd be an order of magnitude simpler than its Java analog.

At least, that’s the state of my head right now re: getting Jython and Java to play nice. If I make any breakthroughs or give up entirely, I’ll post follow-ups.

I am but a Java philistine and a Jython neophyte, so I remain humbly open-minded. I would greatly appreciate comments, questions, corrections, smackdowns, sagacious advice, and so on.

Twitter Facebook LinkedIn

Jythons and Javas and bears, oh my!

The Transfer Problem

Repository Workflows

Two Great Tastes?

Lessons Learned?

Conclusions

You May Also Enjoy

Understanding (e.g.) DOIs for data sets

Ingest: Lessons learned

Ingest is a barrier to ingest

Impressions from Open Repositories 2010