SSIS Alpha Splits using the CODEPOINT() Function

A relatively common requirement in ETL processing is to break records into disparate outputs based on an alphabetical split on a range of letters.  A practical example of this would be a work queue for collections staff based on last name; records would be pulled from a common source and then separated into multiple outputs based on a the Customer Last Name field, with the resulting output going to the person or group responsible for working that alphabetical subset of data.

There are a couple of different ways you can do this.  First is to use separate sources for each range of characters, and specify in your SELECT statement only those values that you want.  This is an effective quick-and-dirty option, but it doesn’t scale well as it requires multiple round trips to the database.  You could also accomplish this task using a simple text comparison for each letter of the alphabet, but this method is a typing-intensive operation.  For example, let’s say you want to group together the records for customers whose last names falls in the A-F range.  Using the Conditional Split transformation, your A-F output expression would look something like the following:

SUBSTRING(UPPER(LastName), 1, 1) == “A”
|| SUBSTRING(UPPER(LastName), 1, 1) == “B”
|| SUBSTRING(UPPER(LastName), 1, 1) == “C”
|| SUBSTRING(UPPER(LastName), 1, 1) == “D”
|| SUBSTRING(UPPER(LastName), 1, 1) == “E”
|| SUBSTRING(UPPER(LastName), 1, 1) == “F”

Your other groups would contain a similar statement to explicitly define each letter to be included in the group.  Not a complex operation, but one that requires a lot of typing.

 

An Easier Way

An easier way to do this is to use the relatively obscure CODEPOINT() function.  This method, which is part of the SSIS expression language, returns the numerical Unicode decimal value of the leftmost character of the input string.  The above grouping would be rewritten as follows using the CODEPOINT() function:

CODEPOINT(UPPER(LastName)) >= 65
&& CODEPOINT(UPPER(LastName)) <= 70

The difference is, rather than enumerating each possible starting letter within the range, I’m now evaluating the Unicode value of the first character in my LastName text field, and only including those in the 65 to 70 range (A through F inclusive) in this output.  I’ve saved myself a little typing, and this approach is easier to maintain and troubleshoot in my opinion.  A sample conditional split with four groupings is shown below:
screen1

 

Take It Up A Notch

So you might ask, “That’s great, smart guy, but why go through this just to save myself maybe 5 minutes of typing?”.  I’m glad you asked!  Let’s take our example a little bit further and assume we’re breaking these groupings down into smaller units.  Consider the possibility that, rather than grouping last names together based on the first letter of the last name, we’ve got a sufficient number of outputs that we’re now splitting the records within that first letter; for example, if we were to split the data stream where the last name starts with an M, we might slice our outputs on those starting with MA to MI, then MJ to MR, and finally MS to MZ.  By using the direct comparison method described above, our fully configured conditional split could have up to 26^2 possible permutations, which means we’ve got to do 676 comparisons (assuming all uppercase alpha characters) within the conditional split transformation, which will likely impact your package performance, not to mention the immense amount of typing required to set this up.  Fortunately, some creative use of the CODEPOINT() function can simplify this ETL requirement.

For this example, let’s assume that we need to separate our records within the letter M into three distinct groups as mentioned earlier, since statistically there are a lot of last names beginning with M.  For each “M” output, I’m going to use an direct string comparison to verify that the first letter is an M (since we’re looking for a single match and not a range in the first character), and second, I’ll use CODEPOINT() in conjunction with the SUBSTRING() function to check that the second letter falls within the expected range for each output.

So for our first M grouping, the MA to MI group, the following expression would be used:

SUBSTRING(UPPER(LastName), 1, 1) == “M”
&& (
CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) >= 65
&&
CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) <= 73)

The code above will match records where the first letter is a literal M, and the second character is between A (Unicode 65) and I (Unicode 73) inclusive. Similarly, the MJ to MR expression reads as such:

SUBSTRING(UPPER(LastName), 1, 1) == “M”
&& (
CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) >= 74
&&
CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) <= 82)

And finally, the MS to MZ expression:

SUBSTRING(UPPER(LastName), 1, 1) == “M”
&& (
CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) >= 83
&&
CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) <= 90)

 

The partially configured conditional split transformation would look similar to the following:

screen2

So you can see that you’ve still got a small chunk of code to write (or copy/paste and modify) for each of your outputs, but it’s far less trouble – and better performing, no doubt – than enumerating all of the possible combinations of the first two letters of the LastName field.  The further you go into the string for your split (for example, breaking all the way down to split “McA” to “McF”, “McG” to “McN”, etc.), the more significant your efficiency in using this method over direct comparisons.

One caveat that bears mentioning: You’ll notice that I’ve used the UPPER() function generously in these examples.  The reason for this is twofold: First, a direct string comparison in the SSIS expression language is case sensitive; for example, “M” does not equal “m”.  Second, the same holds true for the Unicode decimal values returned by CODEPOINT().  Uppercase M, or Unicode value 77, does not equal lowercase m, or Unicode value 109.  Use of the UPPER() function helps to ensure that we’re making accurate comparisons regardless of case.


Conclusion

The CODEPOINT() is a rarely used function in the SSIS expression language, but can be an effective tool in your ETL arsenal in some cases.  For alphabetical grouping or splitting of records, it’s a very handy function that helps to reduce a lot of typing at design time.

More information about the CODEPOINT() function can be found at this page on MSDN.

About the Author

Tim Mitchell
Tim Mitchell is a data architect and consultant who specializes in getting rid of data pain points. Need help with data warehousing, ETL, reporting, or training? If so, contact Tim for a no-obligation 30-minute chat.

2 Comments on "SSIS Alpha Splits using the CODEPOINT() Function"

  1. It seems very nice function to use it in such case! Thanks for info and I have never used this function, I just learned something useful! Nice article!

  2. Alpha Split in SSIS, Redux Alpha Split in SSIS, Redux

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.