Thursday, June 2, 2011

Git & Gitorious

I'm trying to promote code sharing and code review and open source ideas, internally at my company. I setup a copy of gitorious on our intranet and we're starting to commit projects into it.

I whipped up a quick overview of Git and Gitorious to share with my coworkers to make them more comfortable with the systems and get them started. Its not entirely technically accurate but it reflects how I view the systems.

Gitorious

Gitorious is web-based project and repository management software. It just lets us create multiple repos and manage them into projects, teams and whatnot.

Warning: All Code Public
All code uploaded to the site is visible to everyone within the corporate network. So scrub code for passwords and other credentials before uploading.

Gitorious Elements

There are only 4 different major types of "things" in Gitorious.

  • Users
  • Teams
  • Projects
  • Repositories

Users

You and me. Just create your account. Don't forget to add an ssh key after you login so you can upload code.

Teams

This is just a group of users. A user can be in many teams at once. These teams are just logical units to give access to whole projects and repositories at once. Without having specify individual members. Projects can also be 'owned' by a team instead of just a user.

Projects

Projects can be owned by teams or by users. Its just a place to group your repositories into. You may have a project for each repo. Or put a couple related repos in the same project.

Each project also gets its own mini wiki, which is not full features but convenient for small documentation, project and repo descriptions. And the project's mini-wiki is versioned using git.

When you create a repo, it has to be created in an existing project. Which is just good practice anyways. But when you clone a repo, it has no project associated with it (for convenience as well).

Repositories

git repositories. This is the nitty gritty. Its where the source actually goes. There are some good-practices outlined further below on how to decide what goes in a repository.

gitorious provides several methods for access to the repository. The 2 methods we have enabled are ssh and git-protocol. The git protocol has no authentication mechanism so it provides read-only access to the repo. The ssh protocol uses a public key you upload to gitorious, to authenticate you and allow for read/write access to your repos (but only your repos).

getting started

1. Create an account

The site isn't connected to LDAP so you'll have to create your account manually. No email verification is required though, you can log in immediately.

2. Join or create any useful teams.

This is optional. You don't need to be on a team in Gitorious to us the site to its fullest. But teams allow you to give write permission to your projects and repos to multiple people easily. It also acts as a useful form of organization.

3. Create projects and repos

Generally you'll create a repo for each app you want to share the source for. And you may create a project for it as well. But you may create multiple repos in the same project if or put multiple applications/source trees in the same repo. Its your personal preference.

4. Share your code.

You can upload your code to your repositories. Remember to clone the gitorious repo first using the ssh url, and to upload an ssh public key so you can get write access to your repo.

Create Away
All objects and metadata can be easily remove or replaces so feel free to create them with recless abandon.
We can always split projects and repos later, or move other things around.

Web Access to Repositories

As any decent source repository management site, gitorious provides a simple web-based repository browser, so that you can browse the repository without having to clone it first. You can also browse the history of the repo.

Access Control

Warning: All Code Public
All code uploaded to the site is visible to everyone within the corporate network. So scrub code for passwords and other credentials before uploading.

Projects

You can set who can edit and write to projects. You can add teams or individuals to a project.

Repositories

You can assign teams and individuals in 3 different capacities when it comes to repositories.

  • Commiters
  • Reviewers
  • Administrators

Users and teams can be assigned any combination of these 3 roles for a repository. Commiters can update changes to the git repository. Reviewers can manipulate merge-requests. And administrators can edit the repo metadata in gitorious.

Code Review and Fork-based Development

Cloning Repos

Just as you can use the 2 methods described further above to download/clone git repos to your local machine. You can also request that gitorious clone the repo for you in gitorious. Doing this, it can track metadata about the repo relationships and enable some more advanced features, such as watching repos and performing merge-requests.

Merge-Requests

This is an advanced feature, but it bares mentioning. If you have a repo in gitorious cloned from another repo also in gitorious, and you'd like to submit your changes to be pushed up the line back to the origin of the source. You can submit a merge requests, and the original authors can work with you in a simple form of code-review.

Git

This isn't a tutorial on using Git. Its just a overview of the DVCS, the concepts there-in, and how DVCS differ from traditional VCS. You can find a great deal of presentations, tutorials, and documentation online for git which will better serve your needs if your trying to get started using it.

Git Parlance
git documentation and proponents use alot of lexicon that they pretend describe new concepts in version control. When in actually they are just refering to concepts that have existed in VCS for decades but are simply for flexible now with the advent of DVCS. I'll try to point out when I'm using this vocabulary.

DVCS vs VCS

"Traditional" version control, or VCS

ala subversion or CVS.

  • centralized authoritative location for the source code. defined by both owners, and the software.
  • client and server are seperate entities.

The New Way, or the "D" in DVCS

ala git or mercurial

The "D" refers to "decentralized" version control.

  • no central authoritative location. defined by the software.
  • the authoritative location for the source defined only by convention.
  • no difference between client and server. we'll just call it a "client" here though.

This means that the software and internals of the VCS are designed so there is no enforcement about who controls the source.
Everyone working on the source is equal (in the eyes of the VCS). This includes the location that the team defines as the authoritative (canon) location for the source code.

Its still common to have a central repository even in DVCS. Where the team members submit their final code changes to. But the benefits come in having the VCS designed around not mandating a central location.

Everyone has a repo

In traditional VCS the user usually gets just the minimal amount of data required to work on the code at the moment.

When they pull a "working copy" of the source tree, they get just the branch they need. And they get just the subtree that they need of that branch. They usually don't get any history in their "working copy", just the immediately files and HEAD version. All of the other details are easily accessible from the central server if needed for a complex operation.

In git, the "repository" and the "working directory" are one in the same. When you get a working copy from some other repo. Your just copying the entire repository to a new local one. And that repository becomes your working copy. The files are ready to be worked with. All the history and other VCS details are hidden in a .git subdirectory.

This may seem a rather heavy-weight operation, but at the same time, in git, you tend to create smaller repositories. Unlike traditional VCS where you put all your projects and applications in the same repo. In git you'll create a separate repo for each project or application. So the size remains manageable. Furthermore, in git, while you can't clone just a subtree of a repo, you can clone only a single branch or subset of the repos branches. In fact its quite common just to clone the single branch your interested in... say the "master" branch (same thing as "trunk" in svn or cvs).

If your working on the code in several locations, each location will actually be a separate repo. Possibly with a different set of branches in it.

This is another important concept in git. Even though we all have a repo for the project, the repos are not identical. And they don't need to be.

Each repo is different.

This is not just because we have our own private changes in our repo, that we haven't yet shared with others. But each repo also knows its a different one from all the others. This is how we keep from stepping on each others toes when we are all working on similar branches/code. git knows that Marks "master" branch is actually a different branch than Sue's "master" branch. And you have to address them as such if you really want to work with both in the same operation.

Hashes as a GUID

To deal with the issues of multiple repos, git tracks everything in the repo using secure message digests (hashes) of their content. This acts as a good GUID for these objects so that no matter where they came from or when, if they're identical, then it can identify them as such. This also makes comparing objects faster. The system uses this for files, directories, and history as well.

Unfortunately there are a few gotchas to keep in mind with this. For instance the same source tree can results from 2 different histories. So that can cause some complications, when deciding who's history is more useful to keep around. But most users don't have to worry about that at the beginning.

DVCS Commands and Concepts

The Old

Generally DVCS work the same as VCS and you'll see similar commands and concepts.

  • branch
  • commit
  • log/history
  • diff
  • tag

The New

Where DVCS is different tends to be in what it adds on top of the existing VCS. Here are some commands you'll find new.

  • push
  • pull
  • clone
  • amend

But not entirely new. push, pull, and clone are similar to checkout and commit, but between repositories, since there is no separate working copy. And amend is an advance concept you may never use.

For Subversion Users

I found this mapping between svn and git commands helpful, when I first switched to using git for some projects.

http://git.or.cz/course/svn.html

git can also work with svn repos directly. you usually do this by replicating the svn repo into a git repo.
then working with the git repo. But you can do bi-directional replication between the two, and continue to use
both repos for future versioning, side-by-side.

http://www.kernel.org/pub/software/scm/git/docs/git-svn.html

It can be a little complicated to use and there are better example workflows of it online as well.

Advanced Concepts

Some advanced concepts you don't need to understand to use git. but that might be interesting if your going to delve further into git.

The Index

git commands can work on a staging area for the changeset before its commited to the repository. This staging area is called the index. You can flag files/directories for commital, as your working on the source. And when you do, it will actually copy the state (content) of those files and directories, at that time, into the staging area. Then, later, when you commit, that staging area is what is commited to the repository. not the actual current state of the source tree. This allows or a specifically control partial-source commit.

This can be useful or annoying. I find it the latter, and as such I completely ignore this feature. Instead I use git commands that ignore the staging area and commit the current state of the source tree all or none. automatically commiting changes to any files that were previously commited to the repo, and in some cases even commiting new files to the repo automatically.

But using the commands that work with the index can provide for some more advanced usage of git. Such as performing more fine-grained commit history, or quickly switching to temporary branches for quick fixes, and then switching back to your main work.

layers

The architecture of git is rather simple and it benefits to understand this when you start using git more heavily. Its effectively composed of a series of layers.

  1. Efficient Blob Storage
  2. Blob Database
  3. Filesystem Trees
  4. Version History Graphs
  5. Other VCS Metadata: branches, tags, head...

I won't go into the details of whats stored in the blobs, and the filesystem/version graphs, as there are good presentations online that describe these more effectively.

Efficient Blob Storage

In relational databases, a Binary Large Object (BLOB) is a chunk of arbitrary binary data of arbitrary size.

Git starts with a system that efficiently stores blobs by diffing related/similar blobs, and then compressing them. But the details of this are always hidden to the user as its not necessary to understand them. And every VCS implements such a system anyways.

Blob Database

git then takes these blobs and puts them into a simple relational database of only one table. which contains the blob column and a few columns of metadata.

git uses a hash of the blob as the primary key and provides commands to manipulate the rows of this database directly from the command line. Most users never have to do this but sometimes it can become necessary if the repo becomes corrupt, data is deleted by accident, or someone messes up some advanced command.

File System Trees

Two types of data git will store in these blobs are files and directories.

How files are stored is obvious; it stores the content in the blob. By hashing the files, you get immediate duplicate reduction.

But to store directories it builds a tree of blobs, one for each directory. The blob contains the list of files and directories in the directory, and their hashes. This makes for a rather efficient storage of directories in the table, as subtrees that don't change, never need to be modified or replicated in the database. Furthermore identical subtrees can automatically detected in this manner.

History Graph

Similar to the directories stored in blobs, the system stores in blobs the Directed Acyclic Graphs that comprise history trails for source versions. And by hashing the elements of the history, you get similar benefits to the filesystem storage.

Other VCS Metadata

Finally, on top of all of this, git adds the other CVS metadata required, such as branches and tags, and which version is the HEAD. All in order to round git up into a fully featured modern version control system.

History Editing and Amending

Since a git repo is really just a kind object or blob relational database, git commands expose that database directly for you to play with. And in fact make it easy to play with the objects and graphs stored in that database. One thing that git allows you to do easily is to modify commits that are already in the database, and to change the history graphs that lead to a particular version of the source tree.

These features are usually avoided because if your not careful you can permanently delete data that's in the repository, or worse, surprise someone else who has already cloned the data that your changing.

Friday, March 18, 2011

Context IOC

I'd been doing this for a years, but had no idea there was a name for it. For me it came out of unit testing UIs. I found my widget classes would have a lot of dependenciees. But they'd only end up using each dependency for a one or two small things each. It seems a shame to pass in the whole object just for that. Also this left me with alot to mock in my unit tests.

So with a quick interface, I found I could declare the dependencies to a much finer grain of details. It just seemed like a good way to decouple objects. It makes writing unit tests fast because each class clearly states exactly what its going to use in its dependencies. And when you change one thing in the code, static analysis tells you everything that is affected by it with far fewer false positives.

Practices, maybe not "Best"

  • I've been naming the interfaces 'Dao' because that just seemed to make sense at the time. I may start using 'Context' now.
  • I found that embeding the interface, nested inside the class that used it, seemed like good information architecture. I know alot of people dont' like nesting classes and interfaces. But it avoids a lot of clutter in the packages, as I'll often have a ton of these interfaces, one for every class.
  • Outside of tests, I never create a whole class that does nothing but implement one of these Dao interfaces. The methods in the Dao came out of something that the class needed from another class (a dependency). So the Dao interface just turned out to be extra interfaces implemented by the dependency class. Not defining the dependency class, but rather performing some duck-typing for us.

Here is a quick example of what a class might look like that uses such an interface to declare its dependencies.

  public class SomeWidget {
 
    public interface Dao {
      void setSomeValue(SomeValue value);
      SomeOtherValue getSomeOtherValue();
      void doThis();
      void doThat();
      void saveYourWork();
      void requestExitOfApplication( String reason );
      void notifyCamelHairListeners( CamelHairEvent event);
    }

    private Dao dao;

    public SomeWidget(Dao dao) {
      this.dao = dao;

      // add various widgets this one.
      // which call on the Dao to do the work behind them.
      ...
    }

  }


  // and when testing
  public class SomeWidgetDaoMock implements SomeWidget.Dao {
    ...
  }

The methods defined in these Daos usually wouldn't be return actual dependencies. Instead they would contain the abstract methods that the class needed from those dependencies.

An exception I make is for some higher level Daos in an application. When one dao interface implementor may need to act as a factory to create the dependencies.

Composition

For building up the Doas in your large application you have many choices such as composiing Daos from sub interfaces or using Dao factories ( Daos just for retrieving other Daos ).

  public class Application {

    // you only need to define one of these 2 interfaces

    // sub interface Dao
    public interface SubInterfaceDao implements SomeWidget.Dao {
    }


    // Dao factory
    public interface FactoryDao {

      public SomeWidget.Dao getSomeWidgetDao();

    }


    // rest of application
    ...

  }


  // The choice isn't hard. If you use one style, you can still use the other one on the fly.
  // another common choice during testing.
  public class ConfusedApplicationDao implements Application.FactoryDao, SomeWidget.Dao {

    public SomeWidget.Dao getSomeWidgetDao() {
      return this;
    }

  }

Monday, March 14, 2011

Minimizing Hibernate Schema Complexity

In my applications I persist a lot of structured data into the database. Hibernate and other ORMs are great at making this easy to do. One problem is that they tend to map the entire object structure to rigidly structured database fields and tables. Quite often that's just overkill.

Strings as a black box datatype

Take strings for instance. A VARCHAR makes perfect sense if you know the string will always be less than, say, 30 characters. Even if your not going to use that string field in a where clause, ever.

But if you won't know the max size the string will be for every instance, or it could be a very large string and you're never going to use it in a search, then your fine using a LOB column to store that string.

Modern ORMs will even read/write the whole string value into/out from the LOB column for you. Or give you the CLOB instance itself for better performance, whichever you prefer.

I've experienced many DBAs that don't like the use of LOB columns. But I've never understood that, as this is exactly the purpose that LOB columns were meant to solve. The storage of arbitrary data in the database, no different than in a file, without any sort of internal random access, indexing, or searching. A Blackbox-column to the database server, if you will.

Complicated Object Graphs

Worse is when you have classes with lots of field and subobjects. This can get messy, even if they're just value objects.

The multiple tables, table hierarchies, link tables, and huge column lists that can result, while entirely valuable for some classes, are just overkill for others. Especially when you won't be indexing, searching or retrieving individually, any of those fields/rows/objects.

Persisting Serialized Classes

ORMs can persist a whole instance into one field. They do so using serialization of some form or another to turn the instance into flat data like a byte array. Then this data can be persisted as a BLOB value.

In the case of Hibernate, its as simple as marking a reference (non primitive) field with @Lob, just as with the strings. But in this case, by default, Hibernate will use the standard java serialization mechanism. Which some would take exception *cough* to.

You can provide your own serialization mechanism of course. I'm not sure how to do this across the board for hibernate. But my solution is just to create a custom hibernate UserType implementation. Then I can mark the fields with this custom UserType, that I want stores as non-entities.

This is surprisingly easy.

XML

For the serialization format, I chose XML. There are a number of good XML serializers for java objects. Some even work well on arbitrary classes, though you shouldn't be putting just any classes in your databases columns.

I use XML because its easy to;

  • read, when debugging
  • manipulate if need be
  • migrate, on object structure changes

I chose xstream as my serializer because;

  • * its easy to setup and use.
  • * it maps most classes automatically
  • * is extremely flexible

Persisting Exceptions

Another reason we went with xstream, is that in some cases we serialize exceptions, for debugging and posterity. This can be an whole mess by itself. So I needed something that could serialize most arbitrary classes, without configuration. As you never know whats going to pop up in an instance field in an exception class/cause hierarchy.

But at the same time I didn't want to use different serialization solutions, one for exceptions and another for my real data classes.

I'll leave out the discussion about how to deal with bad/unserializable exception classes *cough oracle cough*. Truth be told, we're phasing out persisting exceptions after all. Which is just common sense.

There are other good serialization libraries out there, though.

Lob compression

Another discussion I'm leaving out is the compression of the string XML CLOB data into a binary LOB. Which we do as an easy space saving measure. But once you've got the hibernate UserType setup, something like this is rather academic.

Whole Entity Serialization

I've considered serializing the whole entity to one field. Only pulling out specially annotated fields into individual table columns for indexing and searching.

But this doesn't seem easy, let alone a good idea. That would leave you with redundant data, broken out into non-LOB columns and serialized in the LOB column at the same time. Which could lead to bugs. And the implementation of this would be complicated to begin with.

Metadata Objects

Its possible to cajole all your non-column fields into one Metadata object. So that they can be serialized into one LOB column in the table for your entity. But I find this is also overdoing it.

Databases can have multiple LOB columns per table. Having a few instead of just one isn't going to have a major impact. I just design my classes naturally, and only resort to special Metadata objects like this in rare situation. Say, when I have a TON of fields in a class, and I won't be using them in SQL.

That may mean a few other primitive fields get columns of their own, as well.

Nitty Gritty

Lets get down to some code then. Its pretty straight forward, if you check the javadocs for the hibernate UserType.

Of course Serializer is my singleton class providing the chosen serialization mechanism.

public class XMLUserType implements UserType {

  @Override
  public int[] sqlTypes() {
    return new int[]{ java.sql.Types.CLOB };
  }

  @Override
  public Class<Serializable> returnedClass() {
    return Serializable.class;
  }

  @Override
  public boolean isMutable() {
    return true;
  }

  @Override
  public Object deepCopy(Object value) {
    return Serializer.deserialize(Serializer.serialize((Serializable) value));
  }

  @Override
  public Serializable disassemble(Object value) {
    return (Serializable) value;
  }
  @Override
  public Object assemble(Serializable cached, Object owner) {
    return cached;
  }

  @Override
  public Object replace(Object original, Object target, Object owner){
    return deepCopy(original);
  }

  @Override
  public boolean equals(Object x, Object y) {
    if (x == null ) {
      return y == null;
    }
    else {
      return x.equals(y);
    }
  }

  @Override
  public int hashCode(Object x) {
    return x.hashCode();
  }

  @Override
  public Object nullSafeGet(ResultSet rs, String[] names, Object owner) throws HibernateException, SQLException {
      String columnName = names[0];

    try {

        Reader stream = rs.getCharacterStream(columnName);

        if ( stream == null )
          return null;

      // slurp reader to string.
      StringBuilder buffer = new StringBuilder();
      char[] c = new char[1024];
      int numRead = 0;
      while ( ( numRead = stream.read(c) ) != -1 ) {
        buffer.append(c, 0, numRead);
      }
      stream.close();

        return Serializer.deserialize(buffer.toString());

    } catch (IOException e) {
      throw new HibernateException("IOException while reading clob",e);
    }
    
  } 


  @Override
  public void nullSafeSet(PreparedStatement st, Object value, int index) throws HibernateException, SQLException {

    if( value == null ) {
      st.setNull( index, sqlTypes()[0] );
      return;
    }
   
    String xmlString = Serializer.serialize((Serializable) value);

    Reader reader = new StringReader(xmlString);
    st.setCharacterStream(index, reader, xmlString.length());
  }
 
}

notes

  • You can followup the equals() and hashcode() implementations by comparing the serialized data as well. This might make sense in some cases, if the serialized form is simpler than the runtime form of your objects. This can happen with lots of transient fields.
  • Oracle can put LOBs inlined in the table row data. When the lob is less than 4k. You'll want to be certain whether its doing it in your table or not. In case that blows up row size.