Comparing a string with the empty string (Java)

user42155 picture user42155 · Feb 10, 2009 · Viewed 73.6k times · Source

I have a question about comparing a string with the empty string in Java. Is there a difference, if I compare a string with the empty string with == or equals? For example:

String s1 = "hi";

if (s1 == "")

or

if (s1.equals("")) 

I know that one should compare strings (and objects in general) with equals, and not ==, but I am wondering whether it matters for the empty string.

Answer

cletus picture cletus · Feb 10, 2009
s1 == ""

is not reliable as it tests reference equality not object equality (and String isn't strictly canonical).

s1.equals("")

is better but can suffer from null pointer exceptions. Better yet is:

"".equals(s1)

No null pointer exceptions.

EDIT: Ok, the point was asked about canonical form. This article defines it as:

Suppose we have some set S of objects, with an equivalence relation. A canonical form is given by designating some objects of S to be "in canonical form", such that every object under consideration is equivalent to exactly one object in canonical form.

To give you a practical example: take the set of rational numbers (or "fractions" are they're commonly called). A rational number consists of a numerator and a denomoinator (divisor), both of which are integers. These rational numbers are equivalent:

3/2, 6/4, 24/16

Rational nubmers are typically written such that the gcd (greatest common divisor) is 1. So all of them will be simplified to 3/2. 3/2 can be viewed as the canonical form of this set of rational numbers.

So what does it mean in programming when the term "canonical form" is used? It can mean a couple of things. Take for example this imaginary class:

public class MyInt {
  private final int number;

  public MyInt(int number) { this.number = number; }
  public int hashCode() { return number; }
}

The hash code of the class MyInt is a canonical form of that class because for the set of all instances of MyInt, you can take any two elements m1 and m2 and they will obey the following relation:

m1.equals(m2) == (m1.hashCode() == m2.hashCode())

That relation is the essence of canonical form. A more common way this crops up is when you use factory methods on classes such as:

public class MyClass {
  private MyClass() { }

  public MyClass getInstance(...) { ... }
}

Instances cannot be directly instantiated because the constructor is private. This is just a factory method. What a factory method allows you to do is things like:

  • Always return the same instance (abstracted singleton);
  • Just create a new intsance with every call;
  • Return objects in canonical form (more on this in a second); or
  • whatever you like.

Basically the factory method abstracts object creation and personally I think it would be an interesting language feature to force all constructors to be private to enforce the use of this pattern but I digress.

What you can do with this factory method is cache your instances that you create such that for any two instances s1 and s2 they obey the following test:

(s1 == s2) == s1.equals(s2)

So when I say String isn't strictly canonical it means that:

String s1 = "blah";
String s2 = "blah";
System.out.println(s1 == s2); // true

But as others have poitned out you can change this by using:

String s3 = new String("blah");

and possibly:

String s4 = String.intern("blah");

So you can't rely on reference equality completely so you shouldn't rely on it at all.

As a caveat to the above pattern, I should point out that controlling object creation with private constructors and factory methods doesn't guarantee reference equality means object equality because of serialization. Serialization bypasses the normal object creation mechanism. Josh Bloch covers this topic in Effective Java (originally in the first edition when he talked about the typesafe enum pattern which later became a language feature in Java 5) and you can get around it by overloading the (private) readResolve() method. But it's tricky. Class loaders will affect the issue too.

Anyway, that's canonical form.