Calendar
<<  September 2010  >>
SuMoTuWeThFrSa
2930311234
567891011
12131415161718
19202122232425
262728293012
3456789
Blogroll

    I’ve fielded a number of requests recently asking how to interrogate a file within SSIS and change the processing rules based on the metadata of said file.  A recent forum poster specifically asked about using the foreach loop to iterate through the files in a directory and, on a per-file basis, either process the file or skip the file if it was updated after a specific date.  I’ll use that most recent request to illustrate one method to solve this problem.

    Ingredients

    For this demonstration, our SSIS package will require the following:

    • A foreach loop to process each file in a given directory
    • A script task to interrogate each file and, based on the timestamp, mark it to be either processed or skipped.
    • Four SSIS variables:
      • @SourceDirectory (String) – stores the directory to loop through
      • @MinDateStamp (DateTime) – indicates the earliest date to process
      • @Filename (String) – stores the current filename for each cycle of the foreach loop
      • @ProcessFile (Boolean) – a flag to indicate whether the current file should be processed
    • A precedence constraint which will be configured to validate an expression and task outcome.
    • A data flow task to process the validated files.


    Set Up the Loop

    Nothing groundbreaking here: after adding the foreach loop to the control flow pane, set it to work as a Foreach File Enumerator, and use an expression to set the source directory to be derived from the value of the @SourceDirectory variable:

    img2

     

    Script Task

    Since there is no native SSIS task designed to interrogate file metadata, we’re going to need to use a script task to do this.  After dropping a script task from the toolbox into the foreach loop container, we’ll edit the script to create a FileInfo object as a logical hook to the file.  After confirming that the file exists, we’ll compare its LastWriteTime property to the earliest acceptable cut-off date (defined by the @MinDateStamp variable value) to determine if the timestamp meets the criteria for processing.  Based on the results of that comparison, we will set the @ProcessFile value to either True or False.  You can see the resulting code logic in the snippet below:

     

    public void Main()
            {
                // Create a logical file object 
                System.IO.FileInfo theFile = new System.IO.FileInfo(Dts.Variables["Filename"].Value.ToString());
     
                // If the update date on the file is greater than the date specified in the MinDateStamp
                //  variable, set the variable flag to process the file.    
                if (theFile.Exists 
                    && theFile.LastWriteTime > DateTime.Parse(Dts.Variables["MinDateStamp"].Value.ToString()))
                {
                    // MessageBox.Show("Processing file " + Dts.Variables["Filename"].Value.ToString());
                    Dts.Variables["ProcessFile"].Value = true;
                }
                else
                {
                    // MessageBox.Show("Skipping file " + Dts.Variables["Filename"].Value.ToString());
                    Dts.Variables["ProcessFile"].Value = false;
                }
     
                Dts.TaskResult = (int)ScriptResults.Success;
            }

     

    So with each iteration of the foreach loop, the @ProcessFile value will indicate whether the current file should be processed or skipped.  After adding a data flow task containing the necessary components to process the flat file, our next step would be to add a precedence constraint connecting the script task to the new data flow task.  This precedence constraint will be configured to use an expression and a constraint, and will confirm that the current file is to be processed by interrogating the value of the @ProcessFile variable.  If that value is true, then program flow continues to the data flow task; otherwise the loop starts again with the next file in turn.  The precedence constraint would be configured as such:

    img3

     

    After configuring all of the necessary tasks for this operation, the data flow pane should look similar to the following:

    img1

    Now, when the SSIS package is executed, the timestamp of each file in the specified directory will be checked, and only those that meet the date criteria will be processed in the data flow task.  Note that you could replace the timestamp in our example to some other file criteria; for example, you could check the file size, type, attributes, or other settings to determine if the file should be processed.

     

    Conclusion

    Although SSIS does not include a native component to conditionally process files, you can see from this example that a simple script can easily solve this ETL challenge.  You can download the sample package used in this example here.

    I’d like to say thanks to Andy and Judith over at the Ft. Worth SQL Server User Group for having me as a guest last Wednesday.  I got to meet a few new people, and meet up with several others whom I hadn’t seen in a while.  I look forward to visiting over there again.

    For those in attendance, I’ve posted my slide deck and sample code here.

    Posted by Tim Mitchell on 30 September 2009, 08:28

    So you’ve got some packages that regularly extract data to one or more text files, but you know that from time to time some of the queries will not return any data.  However, you find in SSIS that, in a flat file export package, the output file is created regardless of whether any rows are written to the file, and in cases where there are no rows returned from the source, you’ll end up with empty data files in your output directory.  Although not a critical failure, having empty output files can be a nuisance, and unfortunately, there aren’t any native settings in the Flat File Destination or the Flat File Connection Manager objects that will clean up empty files.  Fortunately, SSIS does provide a combination of other tools that you can use to emulate this behavior.

    In our example, I’m going to create a package to extract data from a table using a query for which I know that no rows would be returned.  When I connect this data source to a flat file destination and execute the package, I’ll see an empty data file in my export directory.  Next, to demonstrate the intended behavior, I’ll add a Row Count transformation to store the number of affected rows in a variable, and create a File System Task object to delete the output file.  Finally, I’ll use an expression constraint to only delete the file if the variable attached to the row count is zero.

    First, let’s create a test table to query:

    USE testdb
    GO

    CREATE TABLE invoices (
      invoiceid     INT    IDENTITY ( 1 , 1 ),
      vendorid      INT,
      invoicedate   DATETIME,
      invoiceamount DECIMAL(10,2))

    INSERT invoices
    VALUES(12, '8/3/2009', 4125.50),
          (53, '8/13/2009', 1095.25),
          (46, '8/15/2009', 729.50),
          (33, '8/23/2009', 3421.50)

    Now, I’ll create a basic package that will export to text the invoices for the past 30 days, a reasonable business requirement.  Since we don’t currently have any invoices matching that criteria, we’ll end up with an empty output file. The original package is shown below:

    df_f1

     

    Now for the new-and-improved version, I’m going to drop a Row Count transformation into the data flow to save the number of affected rows to a variable:

     

    df_f2

     

    Finally, I’ll create a File System Task and configure it to delete the output file.  To insure that a valid data file is not deleted, I’ll create a precedence constraint using an expression to only execute the delete if the row count variable is equal to 0:

     

    df_f3


    When you execute this package, you’ll see that the File System Task object is executed because there are no rows matching our query.  You can test the package by inserting another row into the database that will be returned by the query, and you’ll see that the data file is exported but not deleted.

    I’ve attached the before and after SSIS packages if you’d like to take it for a test drive.  Enjoy!

    Posted by Tim Mitchell on 29 September 2009, 07:15

    It's been well-documented through myriad blogs and forum posts about the case sensitivity of the comparisons in the SSIS lookup transformation (a good review can be found here). In a nutshell, a comparison using the lookup transformation is case sensitive when using the default setting of Full Cache, even if the values in the database are stored in a case insensitive collation. This happens because, in Full Cache mode, the comparison is done by the SSIS engine rather than the database engine, the former of which differentiates between "Value", "VALUE", and "VaLuE" regardless of the collation setting.

    But did you know that this transformation is space sensitive as well? Consider the following T-SQL code:

    SELECT Cast('char value' AS CHAR(20)) [val1]
    INTO   test1

    SELECT Cast('char value' AS CHAR(40)) [val2]
    INTO   test2

    SELECT t1.val1,
           t2.val2
    FROM   test1 t1
           INNER JOIN test2 t2
             ON t1.val1 = t2.val2


    As you would expect, executing this code results in a successful match (INNER JOIN), even though we're comparing CHAR values of differing lengths (for more information, see this article for more information about spaces and padding in SQL Server).

    However, when the same comparison is run through an SSIS lookup transformation in Full Cache mode, the lookup on our sample data will fail. Similar to the case sensitive lookup, you'll find that the SSIS engine would treat the strings 'Hello World' and 'Hello World  ' (note the trailing spaces) as dissimilar values.  Unlike in SQL Server, trailing whitespace is significant in SSIS value comparisons.

    As a workaround, you can use the TRIM() function in SSIS and the RTRIM() T-SQL function to insure that your comparisons are ignorant of trailing whitespace.  Alternatively, you could use a cache mode other than Full Cache, but you should be aware of the other implications before making such a change. 

    Note that this behavior is limited to fixed-length character fields, but could lead to some unexpected and hard-to-detect problems if you aren’t aware of the behavior.

    … and by government data, I mean the mountain of data recently made available by the G-Men on Data.gov.  This site contains what must be terabytes of data on every topic from environmental measurements to crime statistics, from geographical data to labor statistics.  The Obama administration has committed to greater transparency, and the availability of this data is a significant step toward that goal.  The trendy geek magazine Wired.com recently did a feature on Data.gov that is worth reading.

    It’s obvious that Data.gov is an immature portal.  Delivery types are inconsistent – some files are available only as flat files, others as only Excel, and a few claim to offer XML feeds.  The formatting can vary wildly from one set of data to the next, and often includes headers and footers which muddy up otherwise clean raw data files.

    So why should you, as a database professional, care about this information? If you’re trying to improve your skills in database technologies (and especially in this economy, who isn’t trying to improve him/herself?), this data store is a great place to start.  Because of the sheer size and sometimes unusual layouts, this information is an excellent test bed for honing one’s skills at Integration Services, Analysis Services, or for creating VLDBs (very large databases) on which to practice.  And if you’re truly ambitious, there’s a contest to come up with the best application of this data, with a $10,000 bounty to the winner.

    As for me, I’m currently pulling down some FBI crime data with the intention of using it in an upcoming SSIS class I’m presenting.  Perhaps I’ll think up an app that could win the $10K as well….