References & integrity

This article is about preserving the data integrity of a class. It's not a complicated issue but it's one that shows up time and again and has a few not-quite-obvious borderline cases. By going back to the basics I hope to expose the logic behind the way java handles these things.

Big words

When you write classes that do something pretty neat and you want to let other (potentially stupid or malicious) classes use its fancy algorithms without messing things up, you can call all your friends and tell them you're worried about preserving data integrity. As if that wasn't enough to make your mama proud the way you go about maintaining data integrity in a single class is called encapsulation.

Wiki defines encapsulation as ...the hiding of the internal mechanisms and data structures of a software component behind a defined interface, in such a way that users of the component (other pieces of software) only need to know what the component does, and cannot make themselves dependent on the details of how it does it... But most people tend to think of it as replacing all your public variables by private ones and providing get and set methods instead. Basically, this is the right idea, but there is a little bit more to it than that.

Pass by reference?

A variable in java can contain one of two things:

  1. A primitive value
  2. A reference to an object

There's no explicit pointers and no 'passing by reference' like the PHP trick of prefixing a function parameter by an ampersand. This means all crazy parameter passing schemes are out of the window. Check out the following example:

public static void main(String[] args)
{
    Object foo = new PandaBear();
    doSomething(foo);
    System.out.println(foo.getClass());     // Prints 'PandaBear'
}

public static void doSomething(Object foo)
{
    foo = new AntEater();
}

This will show the class of foo to remain unchanged inside the main method. Inside the doSomething method, the local variable foo - that used to be a reference to a PandaBear - has been replaced by a reference to an AntEater. But outside of this method the value of foo (the object it points to) has remained unchanged. Basically, a copy of a reference was passed to a method and nothing was passed back. This is a major difference between java's references and "pass by reference" schemes.

If we'd used a primitive datatype (like an int or a double) instead of an object the exact same thing would have happened, and nobody would have been surprised, which is way some people say java passes everything by value, but a value is either a reference or a primitive.

The idea behind this is that you obtain a clean cut between input and output. Input = arguments, Output = return value. So instead of passing by reference you have the method return a new object similar to the old one, but with a different value. Like this:

Integer x = new Integer(7);
x = multiplyByTwo(x);
System.out.println(x);      // Prints '14'

public Integer multiplyByTwo(Integer x)
{
    return new Integer(x * 2);
}

Here the multiplyByTwo method receives a reference to an Integer object and uses it to calculate the value of a new Integer object, which it returns.

On the other hand...

What you can do is use an object's own methods to change its state:

public static void main(String[] args)
{
    Wheel w = new Wheel();
    w.deflate();
    pump(w);
    System.out.println(w.isInflated());     // Prints 'true'
}

public static void pump(Wheel w)
{
    w.inflate();
}

What happens here is that the Wheel object changes some of its internal variables. Because both the main and the pump method hold a reference to the same object the change will be reflected in the main method.

Encapsulation

So what does this mean for encapsulation?

It often happens that a class has some internal variables that external classes shouldn't change, but still might be interested in. A bill, for example, might have a subtotal you want to display without allowing it to be modified.

Now the difference between the way objects and primitives are handled means that it's not enough to replace a class like this:

public class Movie
{
    public int budget;
    public Director d;
   
    // ...rest of code here
}

with code like this:

public class Movie
{
    private int budget;
    private Director director;
   
    public int getBudget()
    {
        return budget;
    }
   
    public Director getDirector()
    {
        return director;
    }
   
    // ...rest of code here
}

Returning the int is safe - it's a primitive value - but the getDirector method returns a direct reference to your director, allowing people to call methods like director.sendToNorthPole() that can really mess up your Movie. Instead you should either return a clone of the director (ethical issues aside) or provide methods to whatever primitive fields the Director object has that external classes might be interested in (age, numberOfGirlfriends, etc). In many cases cloning is an expensive operation so you'll often see the latter approach.

As a general rule classes only care about their own data. So if one of your input arguments is an object you don't have to clone it before using. This is the responsibility of the class that passed it to you.

Primitives, Strings, Arrays

Java has eight primitive data types: int, double, float, boolean, char, byte, long, and short.
Of these, int, double, boolean and long are the ones you'll use the most. Bytes and shorts are a smaller type of int that use up less memory, but modern cpu's deal with them by converting to the int type, so the only time it's appropriate to use them is when you're storing them by the million in some large database.

Apart from these big eight, everthing is a reference. This includes Strings and the object versions of primitives like Integer and Float. However...

Strings

Strings in java are immutable objects, which means they never change. Functions like toLowerCase() don't actually change a string, but return a reference to a brand new String object. When you write s = s + "more stuff" a new string is created, and the value of s is changed to point at this new string. This is good news for your data integrity, because it means you can safely return a string value without copying it first. Although a reference is passed, no-one can modify your String! This makes java's strings act very much like a kind of ninth primitive data type.

Object representations of primitive types

The same goes for the object versions of primitive types. Java's Integer, Float, Double and even BigInteger are all immutable objects, which means you can return them just like primitives.

Arrays

Since arrays contain a lot of data making them immutable would reduce performance dramatically. This means that when you want to return an array you don't want anybody to mess with, you have to clone it first. Since this is a bit of a performance hit most programmers will hesitate to clone arrays, and sacrifice integrity for performance.

It gets worse when you consider that an array can contain objects. This means that even a copy of your array will provide the outside world with references to all your precious objects. To prevent this you'd have to clone the array's contents as well, and their contents, and their contents, etc.

The answer here is either to let go of integrity and hope for the best (usually a bad idea), to suck up the perfomance hit (cause pc's are fast these days) or try to find an alternative to sending this array in the first place (better).

On the other hand...

Sometimes this behaviour can be exactly what you need. For example you could pass a GUI element an array of color values to print to screen, and then every time you change the array's contents the GUI's data will update too.

The 'final' keyword

An alternative to using get methods for everything is to declare some variables final. A final variable is a variable whose value doesn't change and is most often used in the context of primitive types. If your object has a unique integer id that never changes you might declare it as a public final int. This will allow other classes to see the id, but nobody (including you) can change it.

Although it's hard to think of things that never change (and can be quantified by primitive data types) this scenario arises a lot. For example when a database creates an information object with many different fields that needs to be passed from a backend and a user interface. Making the fields final will allow everyone to use the same object without any chance of data corruption.

An object's final values must be set either when they're declared (public final int x = 7;) or in the object's constructor:

class Row
{
    public final int id;
   
    public Row(int id)
    {
        this.id = id;
    }
}

But don't use it for arrays

When you write final int[] anArray the value of anArray is a reference to the array of integers you created, and the final keyword guarantees this value will never change. In other words, anArray will always point to the same Array object - but its values still might change!

The same holds for all objects, which is why the final keyword is usually only used for primitives and immutable types.

One last thing: the final keyword has a second, unrelated, use: A method declared 'final' can not be overwritten by an extending subclass. Yup. True that.

And that's all I have to say about that :)

May 20th, 2009

Comments

Michael wrote:

As an aside, the "final" keyword is also a good way to ensure variables are set after construction. Especially when you have multiple constructors this pays off.

Apr 25th, 2010

michael wrote:

By the way, if you're ever wondering about details like these, you can consult the official Java Language Specification:

http://java.sun.com/docs/books/jls/

As an example, about the && operator it writes:
"At run time, the left-hand operand expression is evaluated first ... if the resulting value is false, the value of the conditional-and expression is false and the right-hand operand expression is not evaluated."

Which is nice to know :)

May 21st, 2009

Post your comments here

If you wish to add code to your comment you can use code tags, like this: <code class="php">yourCodeHere</code>.
Quite a large number of languages are supported, although I can't guarantee it'll be pretty. Inside the code tags you can use any characters except for the string "</code>".