Thursday, April 23, 2009

An introduction to the GSS API

Application Programming Interface or API design is one of my favorite topics in programming, probably because it is both a science and an art. It is a science because there are widely accepted principles about how to design an API, but at the same time applying these principles within the constraints of a given programming language, requires the finesse of an experienced practicioner. Therefore it is with great pleasure that I'll try to explain the ins and outs of the GSS REST-like API, as promised before. As I've already mentioned, GSS is both the name of the source code project in Google Code, as well as the GRNET-sponsored service for the Greek research and academic network (although, it's official name after leaving the beta stage will be Pithos). Since anyone can use the open-source code to setup a GSS service, in this post I'll use generic examples, so anyone writing a client for the GRNET service should modify them accordingly.

When developing an application in a particular programming language, we are used to thinking about the APIs presented to us by the various libraries, which are invariably specified in that same language. For instance, for communicating with an HTTP server from a Java program we might use the HttpClient library API. This library presents a set of Java classes, interfaces and methods for interacting with web servers. These classes hide the underlying complexity of making the low-level HTTP protocol operations, allowing our mental process to remain constantly in a Java world. We could however interact with a web server without such a library, opting to implement the HTTP protocol interactions ourselves instead. Unfortunately, there is no such higher-level library for GSS yet, wrapping the low-level HTTP communications. Therefore this post will present a low-level API, in the sense that one has to make direct HTTP calls in his chosen programming language. The good news is that the following discussion is useful for programmers with any background, since there is support for the ubiquitus HTTP protocol in every modern programming language.

A RESTful API models its entities as resources. Resources are identified by Uniform Resource Identifiers, or URIs. There are four kinds of resources in GSS: files, folders, users & groups. These resources have a number of properties that contain various attributes. The API models these entities and their properties in the JSON format. There is also a fifth entity that is not modeled as a resource, but is important enough to warrant special mention: permissions.

Users · Users are the entities that represent the actual users of the system. They are used to login to the service and separate namespaces of files and folders. User entities have attributes like full name, e-mail, username, authentication token, creation/modification times, groups, etc. The URI of a user with username paul would be:

http://host.domain/gss/rest/paul/
The JSON representation of this user would be something like this:
{
"name": "Paul Smith",
"username": "paul",
"email": "paul@gmail.com",
"files": "http://hostname.domain/gss/rest/paul/files",
"trash": "http://hostname.domain/gss/rest/paul/trash",
"shared": "http://hostname.domain/gss/rest/paul/shared",
"others": "http://hostname.domain/gss/rest/paul/others",
"tags": "http://hostname.domain/gss/rest/paul/tags",
"groups": "http://hostname.domain/gss/rest/paul/groups",
"creationDate": 1223372769275,
"modificationDate": 1223372769275,
"quota": {
"totalFiles": 7,
"totalBytes": 429330,
"bytesRemaining": 10736988910
}
}


Groups · Groups are entities used to organize users for easier sharing of files and folders among peers. They can be used to facilitate sharing files to multiple users at once. Groups belong to the user who created them and cannot be shared. The URI of a group named work created by the user with username paul would be:
http://host.domain/gss/rest/paul/groups/work
The JSON representation of this group would be something like this:

[
"http://hostname.domain/gss/rest/paul/groups/work/tom",
"http://hostname.domain/gss/rest/paul/groups/work/jim",
"http://hostname.domain/gss/rest/paul/groups/work/mary"
]


Files · Files are the most basic resources in GSS. They represent actual operating system files from the client's computer that have been augmented with extra metadata for storage, retrieval and sharing purposes. Familiar metadata from modern file systems are also maintained in GSS, like file name, creation/modification times, creator, modifier, tags, permissions, etc. Furthermore, files can be versioned in GSS. Updating versioned files retains the previous versions, while updating an unversioned file replaces irrevocably the old file contents. The URI of a file named doc.txt located in the root folder of the user with username paul would be:
http://host.domain/gss/rest/paul/files/doc.txt
The JSON representation of the metadata in this file would be something like this:

{
"name": "doc.txt",
"creationDate": 1232449958563,
"createdBy": "paul",
"readForAll": true,
"modifiedBy": "paul",
"owner": "paul",
"modificationDate": 1232449944444,
"deleted": false,
"versioned": true,
"version": 1,
"size": 802,
"content": "text/plain",
"uri": "http://hostname.domain/gss/rest/paul/files/doc.txt",
"folder": {
"uri": "http://hostname/gss/rest/aaitest@uth.gr/files/",
"name": "Paul Smith"
},
"path": "/",
"tags": [
"work",
"personal"
],
"permissions": [
{
"modifyACL": true,
"write": true,
"read": true,
"user": "paul"
},
{
"modifyACL": false,
"write": true,
"read": true,
"group": "work"
}
]
}


Folders · Folders are resources that are used for grouping files. They represent the file system concept of folders or directories and can be used to mirror a client's computer file system on GSS. Familiar metadata from modern file systems are also maintained in GSS, like folder name, creation/modification times, creator, modifier, permissions, etc. The URI of a folder named documents located in the root folder of the user with username paul would be:
http://host.domain/gss/rest/paul/files/documents
The JSON representation of this folder would be something like this:

{
"name": "documents",
"owner": "paul",
"deleted": false,
"createdBy": "paul",
"creationDate": 1223372795825,
"modifiedBy": "paul",
"modificationDate": 1223372795825,
"parent": {
"uri": "http://hostname.domain/gss/rest/paul/files/",
"name": "Paul Smith"
},
"files": [
{
"name": "notes.txt",
"owner": "paul",
"creationDate": 1233758218866,
"deleted":false,
"size":4567,
"content": "text/plain",
"version": 1,
"uri": "http://hostname.domain/gss/rest/paul/files/documents/notes.txt",
"folder": {
"uri": "http://hostname.domain/gss/rest/paul/files/documents/",
"name": "documents"
},
"path": "/documents/"
}
],
"folders": [],
"permissions": [
{
"modifyACL": true,
"write": true,
"read": true,
"user": "paul"
},
{
"modifyACL": false,
"write": true,
"read": true,
"group": "work"
}
]
}

Working with these resources is accomplished by sending HTTP protocol requests to the resource URI with GET, HEAD, DELETE, POST, PUT methods. GET requests retrieve the resource representation, either the file contents, or the JSON representations for the resources specified above. HEAD requests for files return just the metadata of the file and DELETE requests remove the resource from the system. PUT requests upload files to the system from the client, while POST requests perform various modifications to the resources, like renaming, moving, copying, moving files to the trash, restoring them from the trash, creating folders and more. The operations are numerous and I hope to cover them in more detail in a future post.

One important aspect of every RESTful API is the use of URIs to allow the client to maintain a stateful conversation. For example, fetching the user URI would provide the files URI for fetching the available files and folders. Fetching the files URI would in turn return the URIs for the particular files and folders contained in the root folder (along with other folder properties). Returning to the parent of the current folder would entail following the URI contained in the parent property. This mechanism removes the state handling from the server and puts the burden on the client, providing excellent scalability for the service. Furthermore, since the URIs are treated opaquely by the client, the API allows client reuse across server deployments. A client can target multiple GSS services, as long as they speak the same RESTful API. Moreover, links from service A can refer to resources in service B without a problem (in the same authentication domain, e.g. the same Shibboleth federation). This is the same as using a single web browser to communicate with multiple web servers, by following links among them.

Monday, April 20, 2009

Reconciling Apache Commons Configuration with JBoss 5

There are many ways to configure a JavaEE application. Among the available solutions are DBMS tables, JNDI, JMX MBeans and even plain old files in a variety of formats. While we have used most of the above in various occasions, I find that plain files resonate with all types of sysadmins, when no other administrative interface is available. For such scenarios, Apache Commons Configuration is undoubtedly the best tool for the job. Recently, I came across an undocumented incompatibility when using Commons Configuration with JBoss 5 and I thought I should describe our solution for the benefit of others.

Usually we are storing our configuration files in the standard place for JBoss, which is JBOSS_HOME/server/default/conf, for the default server configuration. This has the disadvantage that is not as easy to remember as /etc in UNIX/Linux or \Program Files and \Windows in Windows systems, but it has the important advantage of being specified as a relative path in our code, making it more cross-platform without cluttering it with platform-specific if/else path resolution checks.

Commons Configuration can reload the configuration files automatically when changed in the file system, which helps prolong the server uptime. However JBoss 5 has introduced the concept of a virtual file system that caches all file system accesses performed through the context class loaders using relative paths. Unfortunately this generates resource URLs in the form vfsfile:foo.properties that Commons Configuration does not know how to deal with. Fixing this requires extending FileChangedReloadingStrategy, like we do in gss. Alternatively, one could patch Commons Configuration with the following change and use the standard FileChangedReloadingStrategy unchanged:


Index: FileChangedReloadingStrategy.java
===================================================================
--- FileChangedReloadingStrategy.java (revision 764760)
+++ FileChangedReloadingStrategy.java (working copy)
@@ -46,6 +46,9 @@
/** Constant for the jar URL protocol.*/
private static final String JAR_PROTOCOL = "jar";

+ /** Constant for the JBoss MC VFSFile URL protocol.*/
+ private static final String VFSFILE_PROTOCOL = "vfsfile";
+
/** Constant for the default refresh delay.*/
private static final int DEFAULT_REFRESH_DELAY = 5000;

@@ -161,7 +164,8 @@

/**
* Helper method for transforming a URL into a file object. This method
- * handles file: and jar: URLs.
+ * handles file: and jar: URLs, as well as JBoss VFS-specific vfsfile:
+ * URLs.
*
* @param url the URL to be converted
* @return the resulting file or null
@@ -181,6 +185,18 @@
return null;
}
}
+ else if (VFSFILE_PROTOCOL.equals(url.getProtocol()))
+ {
+ String path = url.getPath();
+ try
+ {
+ return ConfigurationUtils.fileFromURL(new URL("file:" + path));
+ }
+ catch (MalformedURLException mex)
+ {
+ return null;
+ }
+ }
else
{
return ConfigurationUtils.fileFromURL(url);


Neither of these solutions covers the case of storing configuration files in zip or jar containers, but since it is something I haven't found a use for yet, I can't test a fix for it. If anyone is interested in such a use case, I'd advise extending FileChangedReloadingStrategy, combining the logic in jar: and vfsfile: URL handling.

Monday, April 13, 2009

GSS architecture

Handling a large number of concurrent connections requires many servers. Not only because scaling vertically (throwing bigger hardware at the problem) is very costly, but also because even if an application can be designed to scale vertically, the underlying stack probably can not. Java applications for instance, like GSS, run on the JVM and although the latter is an excellent piece of engineering, using huge amounts of heap is not something it's tuned for. Big Iron servers with many cores and 20+ GB of RAM are usually running more than one JVM, since garbage collection is not all that efficient with huge heaps. And since running application instances with a 4-8 GB heap size can be done with cheap off-the-shelf hardware, why spend big bucks on Big Iron?

So having a large number of servers is a sane choice, but brings it's own set of problems. Unless one partitions users to servers (having all requests a particular user makes be delivered to the same server), all servers must have a consistent view of the system data, in order to deliver meaningful results. Assigning user requests to particular servers, usually requires expensive application layer load-balancers or customized application code on each server, so it would rarely be your first option. Having all servers work on the same data is a more tractable problem, since it can be solved by having the application state being replicated among server nodes. Usually, only a small part of the application state needs to be replicated, for each user, that is the part which concerns his current session. But even though session clustering solutions have been a well studied field and implementations abound, having no session to replicate is an even better option.

For GSS we have implemented a stateless architecture for the core server, that should provide us with good scalability in a very cost-effective manner. The most important part in this architecture is the REST-like API that moves part of the responsibility for session state maintenance to the client applications, effectively distributing the system load to more systems than the available server pool. Furthermore, client requests can be authenticated without requiring an SSL/TLS transport layer (even though it can be used if extra privacy is required), which would entail higher load on the application servers or require expensive load balancers. In the server side, API requests are being handled by servlets that enlist the services of stateless session beans, for easier transaction management. Our persistence story so far is JPA with a DBMS backing, plus high-speed SAN storage for the file contents. If or when this becomes a bottleneck, we have various contingency plans, depending on the actual characteristics of the load that will be observed.


The above image depicts the path user requests will travel along, while various nodes interact in order to serve them. A key element in this diagram is the number of different servers that can be found, effectively specializing in their own particular domain. Although the system can be deployed on a single physical server (and regularly is, for development and testing), consisting of a number of standalone sub-services instead of a big monolithic service is a boon to scalability.

This high-level overview of the GSS architecture should help those interested to find their way around the open-source codebase and explain some of the design decisions. But the most interesting part from a user's point of view would be the REST-like API, that allows one to take advantage of the service for scratching his own itch.

So that will be the subject of my next post.

Wednesday, April 8, 2009

Introducing GSS

During my recent work-induced blog hiatus, I've been working on a new software system, called GSS. I've been more than enjoying the ride so far and since we have released the code as open-source, I thought discussing some of the experience I've gained might be interesting to others as well.

GSS is a network, er grid, er I mean cloud service, for providing access to a file system on a remote storage space. It is the name of both a service (currently in beta) for the Greek research and academic community and the open source software used for it, that can also be used by others for deploying such services. It is similar in some ways to services like iDrive, DropBox and drop.io, but it can also be regarded as a more high-level Amazon S3. Its purpose is to let the desktop computer's file system meet the cloud. The familiar file system metaphors of files and folders are used to store information in a remote storage space, that can be accessed from a variety of user and system interfaces, from any place in the world that has an Internet connection. All usual file manager operations are supported and users can share their files with selected other users or groups, or even make them public. Currently there are four user interfaces available, a web-based application, a desktop client, a WebDAV interface and an iPhone web application, in various stages of development. Underlying these user interfaces is a common REST-like API that can be used to extend the service in new, unanticipated ways.


The main focus of this service was to provide the users of the Greek research and academic community with a free, large storage space that can be used to store, access, backup and share their work, from as many computer systems as they want. Since the available user base is close to half a million (although the expected users of the service are projected to the low ten thousands), we needed a scalable system, that would be able to accommodate high network traffic and a high storage capacity at the same time. A Java Enterprise Edition server coupled with a GWT-based web client and a stateless architecture were our solution. In future posts I will describe the system architecture with all the gory details. The exposed virtual file system features file versioning, trash bin support, access control lists, tagging, full text search and more.

All of these features are presented through an API for third party developers to create scripts, applications or even full blown services that will fulfill their own particular needs or serve other niches. This API has a REST-like design and though it will probably fail a formal RESTful definition, it sports many of the advantages of such architectures:

  • system resources such as users, groups, files and folders are represented by URIs
  • GET, HEAD, POST, PUT and DELETE methods on resources have the expected semantics
  • HTTP caching is explicitly supported via Last-Modified, ETag & If-* headers
  • resource formats for everything besides files are simple JSON representations
  • only authenticated requests are allowed, except for public resources

Users are authenticated through the GRNET Shibboleth infrastructure. User passwords are never transmitted to the GSS service. Instead GSS-issued authentication tokens are used by both client and server to sign the API requests after the initial user login. SSL transport can provide even stronger privacy guarantees, but it is not required, nor enabled by default.

The GSS code base is GPL-licensed and therefore anyone can use it as a starting point to implement his own file storage service. We have yet to provide binary downloads, due to the various dependencies, but the build instructions should be enough to get someone started. We are always interested in source or documentation patches, of course (did I mention it's open source?). Most importantly, the REST API will ensure that clients developed for one such service can be reused for every other one.

I will have much more to say about the API in a future post. In the meantime you can peruse the code and documentation, or even try it out yourself. I'd be very interested in any comments you might have.

Creative Commons License Unless otherwise expressly stated, all original material in this weblog is licensed under a Creative Commons Attribution 3.0 License.