Frequently I am asked how to properly use SQL Server Integration Services to handle ragged flat files, an example of which is shown below. In this small sample file, you can see that we are dealing with medical data with an inconsistent format. As shown, the rows in the pipe-delimited file can have as few as seven and as many as eleven columns.
Although you and I can look at this file and interpret its meaning, previous versions of SSIS cannot (see note below regarding SSIS in SQL Server 2012). You can see that we’re not only missing the data but the delimiters as well, which is the root of the problem. This type of format where delimiters are simply left off when values at the end of the row are missing or NULL is, unfortunately, a relative common event that ETL developers must address.
With a file like this, SSIS will continue to consume data for each field until it find the specified field delimiter – even if that means that the data is “gobbled up” from the next line in the file! Let’s assume that the outputs on my Flat File Source component define 11 fields (the largest number of fields found in this file snippet). When SSIS processes the first row of data, it’s going to get to the end of that row still expecting to find a pipe delimiter to indicate the end of that field. Rather than assuming the end of the line of data (represented by an unprintable newline character) is actually the end of the column, the Flat File Source component will continue to consume data until either it finds the specified delimiter or the length of the data consumed exceeds the width specified for that column. In this example, the eighth field would not be the literal “453” at the end of the line in the file, but would contain a newline character followed by the first value of the following line of data. Then the next field, which should be NULL, would contain “Reichert”, followed by the incorrect value of “781” and so forth. As shown below, this slip would cascade through the remainder of the file.
In a case like this, the very best you can hope for is that the data length would overflow the allowed field width, or a data type conflict would cause an error and fail the package. In a worst-case scenario, SSIS could actually consume the entire file with this misalignment as shown above, which may not raise an error in the package but would almost certainly cause downstream issues in the destination systems.
Scripting to the rescue!
Never fear – there is a solution to this problem. By using just a bit of code in SSIS, we can easily get around this issue and process the ragged flat file. The methodology we’ll use is as follows:
- Using the Flat File Source, consume each line of text as one big string (rather than breaking it apart into individual fields)
- Using the Script Component, parse each line of the file. Dynamically populate the output fields with the data in each line of the file, and assign NULLs to those fields where no value is present.
- Send the cleansed output to the destination table
Sounds easy enough, right? Let’s go through this step by step.
Process each line of data
As mentioned, we’re going to trick our Flat File Source component into processing this file by forcing it to handle each line of data as if it were a single text field. As shown below, I’ve configured a single output field, named [Column 0], as a field of type String with a width of 2000 characters. This will include everything on the line of data, including the pipe delimiters, in the output.
Next, we’ll add a script component into the data flow pane. We’ll configure the input of the script component to use the single field generated by the flat file source above.
We also need to add all of the output fields manually, which we will populate from within the script. As shown below, I’ve already added the eleven fields to the script component output, configuring each one with the appropriate data type and length.
Now onto the fun part: The Code. Within the script, our methodology will be as follows:
- Break apart the big text string into its individual data fields, splitting on the specified delimiter (in this case, the pipe symbol).
- If there are mandatory fields (i.e., any fields which must be present for a valid row of data), assign these to the appropriate output columns.
- For the optional output columns, process the values retrieved from the input string, in order, assigning each one to the next output column until no other input values exist.
First things first – let’s grab and shred the input text. As shown below, I’m going to split the input on the pipe delimiter, resulting in an array of strings, each with a single value.
// Break apart the input string into its atomic elements string[] allValues = Row.Column0.Split(‘|’); |
Next, we’ll work through all of the fields that are expected to be present in all valid rows of data. In situations where there are mandatory fields which should always be present, no dynamic processing is required on those values, so we can simply assign them in order. Note that I’ve used an incrementing counter variable to work through the array of input data – this is simply a shortcut step to allow me to copy/paste the same code rather than typing the cardinal numbers for each element.
// Counter int i = 0;// Address the values known to be present in every row Row.MRN = allValues[i++]; Row.PatientLastName = allValues[i++]; Row.PhysicianID = allValues[i++]; Row.ProcedureDate = allValues[i++]; Row.ProcedureID = allValues[i++]; Row.ProcedureDescription = allValues[i++]; Row.ProcedureStatus = allValues[i++]; |
Now we move on to the dynamic portion of the processing. From this point, all columns are optional, so we need to go through the remaining values, one by one, and assign them to the final four output columns (all related to risk factor codes in this example). As shown in the code snippet below, I’m continuing to work through the string array created earlier process all of the values. As long as values remain in the array, we’ll continue assigning those values to the next risk factor code in order; when we run out of values, all remaining risk factor columns will be populated with a NULL string.
// The next four values are variable. We may have zero to four of these, so // we’ll add them dynamically as they appear Row.RiskFactor1 = allValues.Length > i ? allValues[i++] : null; Row.RiskFactor2 = allValues.Length > i ? allValues[i++] : null; Row.RiskFactor3 = allValues.Length > i ? allValues[i++] : null; Row.RiskFactor4 = allValues.Length > i ? allValues[i] : null; |
We only have four possible additional columns in this scenario, but this pattern can easily be applied to similar scenarios with more than just a few optional elements.
Consume the cleansed data
Finally, we just need to route the cleansed data out to our destination. As shown below, we’re sending the processed data to a SQL Server table.
SQL Server 2012 changes
As mentioned earlier, this is a design pattern that I recommend for versions of SSIS prior to 2012. If you are using SSIS with SQL 2012, this workaround is unnecessary: the Flat File Source was updated in this latest version so that it allows easy processing of these types of ragged files.
Conclusion
Ragged flat file processing can be a pain when working with older versions of SQL Server Integration Services. This post demonstrates how, with just a few lines of code, you can avoid issues with ragged files by dynamically “peeling off” values and assigning them to the appropriate outputs.
Note: The package and sample data file I used for this demonstration are attached here.
When I run it this way I get an error saying “Column delimiter was not found.” It doen’t seem to be getting to the area to script the string array back out to the output columns.
Tonya
Interesting approach. I frequently ETL files with header and trailer records co-mingled in with the data. Usually I skip the connection manager altogether and go straight for the stream reader. You’ve at least disguised the package as somewhat maintainable.
I guess making the Synchronous flag for output is what you missed. Please add that or starters might not be able to reproduce this.
Chintu – you are correct that we’d need to set this to be an asynchronous transformation, since we are changing the metadata when moving from input to output. Thanks for bringing up that point.
Everyone says that SQL 2012 fixes this for ragged columns in SSIS, but boy I can’t figure out how. Had to resort to this method. If you have a blog post or can point me in the right direction on how to do it in 2012, that would be fantastic. Thanks.
Dave,
There’s nothing special to do in SSIS 2012 to handle ragged file processing. You’ll still use the Delimited file type, but if you’ve got a properly formatted data file with some missing delimiters, the SSIS flat file connection will simply process the missing columns as NULLs rather than trying to incorrectly force a read of extra data.
I agree with Dave, I cannot figure out how to do this in SQL 2012. Tim, If I have a flat file (delimited type) with a row with missing delimiters, it does not process the missing columns. maybe the hitch is you need a header row. I do not have one
In you example are you assuming the optional values are consecutive? For example values a,b, c,d must appear in this order, if some values missing, it will be something like a,b,c or a,b, or a. Not something like a,d? Because if latter is true, your code will give value d to the column of b if b and c are missing.
Nian, you’re correct that we’re assuming that missing values are at the end, not somewhere in the middle. Typically when I see this pattern, that’s how it should be handled – the missing values are assumed to be at the end. While it might be possible that there are missing values (which means missing delimiters) in the middle of the data, that’s a much more complex problem that would require some business logic to identify where the data should fall.
That’s excellent!
Hi Tim,
Can I do it from Derived column transformation instead script transformation.
Ashish, it can be done using the derived column transform, but it’s very, very messy. I’ve found the script solution to be much simpler.
I’m using VS 2015/SSIS 2016 and understand that 2012 will handle the missing delimiters. But what if it is not a delimited file, but rather a fixed-width file where the columns are positional? If there is no data in the last “column” then the record is terminated early and is shorter than those records with data in every column. Can SSIS handle this somehow without resorting to importing as a single column and parsing using a script task?
Unfortunately, processing fixed width files don’t have the same level of flexibility as delimited files in SSIS. You can use ragged right to address some of these issues for fixed-width files, but it’s far from perfect.