When is a string literal too long?

When it’s a multi-byte encoded string, that’s when.

Recently I came across a runtime problem where an ORA-01704 (string literal too long) error was occuring when inserting some text into a database table. The column had been set to the maximum amount of characters allowed for VARCHAR2, that is 4000 characters. Before inserting any values, the text was put through a truncation method to make sure it did not exceed 4000 characters. That’s what made the issue such a conundrum. If we had code limiting the string to 4000 characters, and the logs for the SQL statement were showing that the truncation to 4000 characters was happening, why was the Oracle database reporting that the string was longer than 4000 characters?

This is where the logs became very important. I noticed that the some of the text contained non-Western European characters, which was perfectly valid. However, while these characters are counted as single characters, depending on the encoding, they could be multi-byte characters. Not only is 4,000 the maximum number of characters for VARCHAR2, it is also the maximum number of bytes. In our case, we had a string of 4,000 characters, but in UTF-8 it required 4,017 bytes to store it. So, the solution was to make a slight change to the truncation logic, by introducing a maximum number of bytes for a specific character encoding. Now the truncation method has the following:


try {
while (truncatedString.getBytes(charSetName).length > maxBytes) {
// remove the last character and check the bytes lenght again
truncatedString = truncatedString.substring(truncatedString.length() - 2);
}
} catch (UnsupportedEncodingException e) {
logger.log(Level.SEVERE, "The character set " + charSetName +
" is not supported. Unable to truncated string if it is too long.", e);
}

The variable charSetName is set to the character set used by the database (in our case UTF-8) and the maxBytes variable is the maximum number of bytes allowed (in our case 4000). In the rare cases when there are over 4000 bytes in the string, this code will systematically remove the last character until the number of bytes come down. This code could be improved to be more efficient. For example, the code could check how many characters need to be removed and then remove that chunk, rather than removing a character at a time. It performs in the scenario it is used because the string is already chopped down to 4,000 characters. If the system begins to receive a large amount of multi-byte characters in a single string, then reducing the string to 4,000 characters may still leave you with an 8,000 byte string. Let me know if you determine a more efficient approach.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s