Everyone agrees that it’s crucial to do validation on user input so that, among other things, your application never tries to write a value that’s too long into a database field with a specific limit. Users of your application shouldn’t, however, be left guessing whether the megabyte they pasted (and you know they will) into the eensy-teensy text field really got saved to the database or not. So you should limit the text field itself so they get immediate feedback, rather than via some Johnnie-come-lately error message, or worse, a bunch of text gets dropped in the bit bucket.
One fairly well established technique is to write a DocumentFilter, and when insertString() or replace() is called, validate the added text and truncate as necessary to ensure the database field length is not exceeded.
Now the fun part. What happens when you try to store your comments on N’Ko, Mongolian, Bopomofo (phonetic markers, now commonly used as an input character set for Mandarin), or even ancient Viking runes? You get two choices, store as ASCII or ISO-8859-1 (aka Latin-1), or whatever, and you lose data. Oops. Or convert to UTF-16 or UTF-8. Hm. Wait a minute, now the value (in bytes) is somewhere between 1-3 times as many bytes as the original String length. So, how do you limit the text field to the number of bytes the database will permit? If you picked UTF-16, it’s pretty simple, divide the database limit by two. But it’s pretty wasteful of space, usually. On the other hand, you can’t predict exactly how many bytes the UTF-8 representation needs until you try it out.
The following algorithm will produce a String which, if converted to supplied Charset, will be no more than maxBytes in length. It could be less, depending on the charset chosen and the text being trimmed. This happens because it removes whole characters at once, which may trim several bytes, jumping you from 1 byte over the limit to two under.
public static String limitStringByBytes(String string, int maxBytes, String encoding) {
if(string == null)
return string;
int i = string.length() – 1;
int shaveBytes = computeByteLength(string,encoding) – maxBytes;
while ( shaveBytes > 0 && i >= 0 ) {
shaveBytes -= computeByteLength( string.charAt( i ), encoding );
i–;
}
if( (i+1) <= 0 )
return “”;
else if( (i+1) >= string.length() )
return string;
else
return string.substring(0, i + 1 );
}
As a final note (thanks to the comments by one of our faithful and numerous readers), we would like to acknowledge that we have indeed ignored the existence of the supplementary planes of Unicode mappings, sticking to the Basic Multilingual Plane in this example. This avoids the even more intricate hassle of dealing with surrogate pairs. If one of these rather obscure character encodings (Byzantine Music Symbols, Phoenician, or my personal favorite, Deseret [editors note: yeah, I didn't know what it was either. Wikipedia to the rescue], for example) should appear, it’s possible that they might be truncated mid-character. According to the Unicode standard, this is an error, but also a very unlikely situation to encounter. Free Palantir t-shirt to the first person who posts a working example that properly deals with surrogates.