donderdag 24 september 2009

Using XML in an SQL query to UNPIVOT data

To continue with my growing addicition to the SQL Challenges, I was looking at Challenge #10 from Beyond Relational (prior to the winning posted solutions and their explanations). 

After doing a bit of seraching I came across a solution by Syed Mehroz Alam on his blog. His solution involved the unpivot/pivot method. I started thinking if there was another way of unpivotting the data using XML and thus this post which is my delvings into TSQL ,and XML .

Just to recap what the challenge was, and thus what started my delving into XML:

Horizontal and Vertical sorting of a result set

Given the following table, sort horizontally and vertically and remove duplicates once sorted:



Result should look like :



To retrieve data in XML format from a SQL Statement the FOR XML clause is used. There are 4 mode options that can be used with this clause:
  1. RAW
  2. AUTO
  3. EXPLICIT
  4. PATH
To get an excellent explanation of how to use the above clauses refer to the article Using the FOR XML Clause to Return Query Result as XML by Robert Shelden.

Test Data
DECLARE @t TABLE ( c1 CHAR(1), c2 CHAR(1), c3 CHAR(1) )
INSERT INTO@t VALUES ( '2' '1' '3' )
INSERT INTO@t VALUES ( '3' '2' '1' )
INSERT INTO@t VALUES ( 'Z' 'X' 'Y' )
INSERT INTO@t VALUES ( 'B' 'C' 'D' )
INSERT INTO@t VALUES ( 'Y' 'Z' 'X' )
INSERT INTO@t VALUES ( 'B' 'C' 'A' )


Goal XML
The following xml is what is what I want to achieve:





Version1: FOR XML RAW

SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT 1 ) ) AS rowid,
           c1, c2, c3
FROM @t
FOR XML RAW('DataRow'), ROOT('Root'), Type

  • The Rownumber () function is used to achieve a unique rowid
  • The use of Select 1 in the order by statement of the Rownumber function is because the actual ordering is not important.

This will result in the following XML :



The above is really just an "XML" representation of the data in the table and not anywhere close to an "unpivotted" view of the table which is what I am trying to achieve.

VERSION2:  FOR XML AUTO

SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT 1 ) ) AS rowid,
           c1, c2, c3
FROM @t
FOR XML AUTO , ROOT('Root'), Type

This option resulted in similar xml to the RAW option except that  the "DataRow" tag names were generated as Row Tag name is only allowed with RAW or PATH mode of FOR XML.

The tag names that are generated are generally the table names used in the query , but as I am using a table variable it generates a random name ( xt0040_t  in my case ). To alleviate the random generation of tags, one can use aliases. The following will generate identical xml to the FOR XML RAW option above.

SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT 1 ) ) AS  rowid,
           DataRow.c1,  DataRow.c2,   DataRow.c3
FROM @t      DataRow
FOR XML AUTO , ROOT('Root'), Type


Moving to an Unpivotted view

With some minor adjustments to either of the above queries ( as they both give identical results) and the use of a CTE ,  a "parent" / "Child" relationship can be simulated whereby the Columns ( c1, c2, and c3) are seen as children of the Parent  , ie the rowid.  

WITH [Columns] AS  (
      SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT 1 )) AS rowid,
                c1, c2, c3
     FROM @t
)
SELECT DataRow.RowID,
           (  SELECT c1, c2, c3
              FROM [Columns]
              WHERE [Columns].RowID = DataRow.RowID
              FOR XML AUTO, ELEMENTS, TYPE)
FROM [Columns] as DataRow
FOR XML AUTO, Root('Root');

This results in the following XML :



The above version gave a parent/child view but even though the columns c1, c2 and c3 are seperate tags now, they have different names, thus the result was still not what I wanted.

VERSION3: EXPLICIT

I found this mode rather cumbersome, and again could not produce the result I wanted.
I did manage to produce a similar result to above :

WITH theData AS

(
   SELECT ROW_NUMBER() over (order by (SELECT 1)) AS RowID,
              c1, C2, C3
   FROM @t
)
SELECT 1 AS Tag,
           NULL AS Parent,
           RowID AS [Row!1!RowID],
           NULL AS [Column!2!c1!ELEMENT],
           NULL AS [Column!2!c2!ELEMENT],
           NULL AS [Column!2!c3!ELEMENT]
FROM theData
UNION ALL
SELECT 2 AS Tag,
           1 AS Parent,
           RowID,
           c1,
           c2,
           c3
FROM theData
ORDER BY [Row!1!RowID], [Column!2!c1!ELEMENT]
FOR XML EXPLICIT;

Resulting XML:




VERSION3: XPATH

Last but not least I played with the XPath mode and finally got my result.

SELECT
(
     SELECT ROW_NUMBER() over (order by c1, c2, c3) AS '@rowid'
               ,'c1' AS 'Column/@Name'
               , c1 AS 'Column/@value'
               ,NULL
               ,'c2' AS 'Column/@Name'
               , c2 AS 'Column/@value'
               ,NULL
               , 'c3' AS 'Column/@Name'
               , c3  AS 'Column/@value'
               ,NULL
     FROM @t
     FOR XML PATH('DataRow'), Type
)
FOR XML PATH('Root')

The trick in the above statement is the insertion of the NULL selections between the "nodes".

UNPIVOT RESULT

Well finally having achieved the XML I needed using the XPATH, writing the statement to produce an unpivotted result was a piece of cake. 

Assuming the above query puts its result in the variable @Xml  ( ie set @xml = ( .... )  )

EXEC sp_xml_preparedocument @idoc OUTPUT, @xml
SELECT * , ROW_NUMBER() over (partition by rowID order by col) as colID
FROM OPENXML (@idoc, 'Root/DataRow/Column', 1 )
WITH ( Rowid char(1) '../@rowid',
          Col     char(1) '@value' )

To clarify the extra RowNumber function , this is necessary as pivotted data is always 3 dimensional.  The COLID column can be used to repivot the data as it signifies the "column number"  (c1, c2 and c3 )

The Challenges required end result:

To get back to the reason I started this delve into SQL, challenge #10 and my twist on things.
Here is the complete sql giving the end result for the challenge:

DECLARE @idoc int

DECLARE @XML varchar(max)
SET @xml = (
  SELECT
  (
     SELECT ROW_NUMBER() over (order by c1, c2, c3) AS '@rowid'
               ,'c1' AS 'Column/@Name'
               , c1 AS 'Column/@value'
               ,NULL
               ,'c2' AS 'Column/@Name'
               , c2 AS 'Column/@value'
               ,NULL
               , 'c3' AS 'Column/@Name'
               , c3  AS 'Column/@value'
               ,NULL
     FROM @t
     FOR XML PATH('DataRow'), Type
  )
  FOR XML PATH('Root')
)
EXEC sp_xml_preparedocument @idoc OUTPUT, @xml
SELECT DISTINCT  result.c1, result.c2, result.c3
FROM
(
      SELECT ROWID, [1] as c1, [2] as c2, [3] as c3
      FROM
      (
          SELECT * , ROW_NUMBER() over (partition by rowID order by col) as colID
          FROM OPENXML (@idoc, 'Root/DataRow/Column', 1 )
          WITH ( Rowid char(1) '../@rowid', Col char(1) '@value' )
       ) sorted
       PIVOT ( min(col) for colID in ([1],[2],[3])) as pvt
) result
ORDER BY result.c1, result.c2, result.c3







woensdag 16 september 2009

TSQL Challenge #13

It seems I've gotten totally hooked on doing the SQL Challenges at Beyond Relational . So seeing the closing date of 14th September has passed for Challenge #13, herewith my solution.

The Challenge
"Break a batch of invoices into multiple parts based on rules"

The Context

A module of an application generates invoices for services provided to patients. The information is sent to the insurance company.

Sample Data
InvIdBatchNumberInvoiceNumberVisitDateAmount
1 10000001 20001 2009-01-01 50.00
2 10000001 20001 2009-01-02 50.00
3 10000001 20001 2009-01-03 50.00
4 10000001 20001 2009-01-04 50.00
5 10000001 20002 2009-01-01 50.00
6 10000001 20002 2009-01-02 50.00
7 10000001 20002 2009-01-03 50.00
8 10000001 20003 2009-01-01 50.00
9 10000001 20003 2009-01-02 50.00
10 10000001 20004 2009-01-01 50.00
11 10000001 20004 2009-01-02 50.00
12 10000001 20004 2009-01-03 50.00
13 10000001 20004 2009-01-04 50.00
14 10000001 20004 2009-01-05 50.00
15 10000001 20004 2009-01-06 50.00
16 10000001 20005 2009-01-01 50.00
17 10000001 20005 2009-01-02 50.00
18 10000001 20005 2009-01-03 50.00
19 10000001 20005 2009-01-04 50.00
20 10000001 20005 2009-01-05 50.00
21 10000001 20005 2009-01-06 50.00
22 10000001 20005 2009-01-07 50.00
23 10000001 20005 2009-01-08 50.00
24 10000001 20006 2009-01-01 50.00
25 10000001 20007 2009-01-01 50.00
26 10000001 20007 2009-01-02 50.00
27 10000001 20007 2009-01-03 50.00
28 10000001 20008 2009-01-01 50.00
29 10000001 20008 2009-01-02 50.00
30 10000001 20008 2009-01-03 50.00



The insurance company needs an electronic file in a certain format (the format is irrelevant for us). A separate file needs to be generated for each batch. So based on the sample data we need to generate two files (one for each batch).

Each batch should contain not more than 10 rows. The maximum number of lines we can put in a file is 10. So we need to break each batch into parts that contains not more than 10 rows. When breaking the batches, there is one more rule to follow. While breaking the batches, we should take care that we do not break invoices.

Result should like look:

InvIdBatchNumberInvoiceNumberVisitDateAmountSet
1 10000001 20001 2009-01-01 50.00 1
2 10000001 20001 2009-01-02 50.00 1
3 10000001 20001 2009-01-03 50.00 1
4 10000001 20001 2009-01-04 50.00 1
5 10000001 20002 2009-01-01 50.00 1
6 10000001 20002 2009-01-02 50.00 1
7 10000001 20002 2009-01-03 50.00 1
8 10000001 20003 2009-01-01 50.00 1
9 10000001 20003 2009-01-02 50.00 1
10 10000001 20004 2009-01-01 50.00 2
11 10000001 20004 2009-01-02 50.00 2
12 10000001 20004 2009-01-03 50.00 2
13 10000001 20004 2009-01-04 50.00 2
14 10000001 20004 2009-01-05 50.00 2
15 10000001 20004 2009-01-06 50.00 2
16 10000001 20005 2009-01-01 50.00 3
17 10000001 20005 2009-01-02 50.00 3
18 10000001 20005 2009-01-03 50.00 3
19 10000001 20005 2009-01-04 50.00 3
20 10000001 20005 2009-01-05 50.00 3
21 10000001 20005 2009-01-06 50.00 3
22 10000001 20005 2009-01-07 50.00 3
23 10000001 20005 2009-01-08 50.00 3
24 10000001 20006 2009-01-01 50.00 3
25 10000001 20007 2009-01-01 50.00 1
26 10000001 20007 2009-01-02 50.00 1
27 10000001 20007 2009-01-03 50.00 1
28 10000001 20008 2009-01-01 50.00 1
29 10000001 20008 2009-01-02 50.00 1
30 10000001 20008 2009-01-03 50.00 1

My Solution 

Personally I found this challenge quite easy and very fun to do. 

This query depends on first determining where the boundaries are. A given requirement is that Invoice numbers may not be broken.  This is in fact not a factor that makes the query more complex in my eyes, but makes it easier. The first step is thus to determine how many rows (count) there are within a batchnumber, invoice combination. This will form the basis for the boundaries.

With tblA ( id, BatchNumber, InvoiceNumber, RWN ) AS

(
  -- Step 1. Get the Count of Invoices Per BatchNumber/InvoiceNumber combination and in the
  -- output  generate a sequential id using the rownumber function
  SELECT ROW_NUMBER() OVER( ORDER BY BatchNumber, InvoiceNumber ), Batchnumber, 
            InvoiceNumber, COUNT(InvId)
  FROM @t
  GROUP BY Batchnumber, InvoiceNumber
) ,
rt ( id, BatchNumber, InvoiceNumber, runtot , cappedtotal, groupnr ) AS

(
  -- Step 2. Using recursive function, calculate a running total of the count of rows till the max
      allowed of 10 rows.
  -- Using this "running capped" total, which forms the final boundary , create a "running"
  -- group number.
  SELECT id, BatchNumber, InvoiceNumber, RWN, RWN , 1
  FROM tblA
  WHERE id = 1
  UNION ALL
  SELECT t.id, t.BatchNumber, t.InvoiceNumber, t.rwn + rt.runtot,
   -- check for change in batchnumber as per rule = new batch
           CASE WHEN t.BatchNumber = rt.BatchNumber
                THEN
                     -- now checking capped totals.
                     CASE WHEN t.rwn + rt.cappedtotal < 10
                              -- accumulate as capped total is less than 10 rows
                            THEN t.RWN + rt.cappedtotal
                             -- initialise for new accumulation
                            ELSE t.RWN
                      END
                 ELSE t.RWN
             END,
             -- check for change in batchnumber as per rule = new batch
            CASE WHEN t.BatchNumber = rt.BatchNumber
                  THEN
                        -- check if wihin capped totals
                        CASE WHEN t.rwn + rt.cappedtotal < 10
                                -- keep group number
                                THEN rt.groupnr
                                -- accumulate group nr.
                                 ELSE rt.groupnr + 1
                         END
                   ELSE 1 -- reset running group total as change of batchnumber has occurred.
             END
   FROM tblA t
   JOIN rt ON ( t.id = rt.id + 1 )
)
-- Combine with the original table to produce the result.
SELECT t.InvId, t.BatchNumber, t.InvoiceNumber, t.VisitDate, t.Amount, rt.groupnr
FROM @t t
JOIN rt    rt ON rt.BatchNumber = t.BatchNumber
                AND rt.InvoiceNumber = t.InvoiceNumber



And that was it, they key being initial boundaries set by Count of Rows within Invoices. Using these boundaries, generate running capped totals within a recursive CTE.








maandag 7 september 2009

TSQL Challenge #12

I was surfing the net the other day trying to find a solution for a TSQL problem I was working on and came across this site Beyond Relational.

As I love challenges, I proceeded to try the TSQL challenge ( Nr 12 in August 2009 ).


The Challenge
"Build sequential ranges of dates with propogation to missing values".

 
The Context:

You are working for an online gaming company and as usual when we talk about games we need to manage scores. Some features in your system had recently changed. Before when a user get connected you only shown him its last score, but now you have to present him a graph month by month of its best score in each one since its first game until the current date.

The Scores:


YearMonthScore
-----------------------------
200903100
20080395
20080299
200801100
200711100

When a user connects for the first time after the deployment of the new system you will need to produce a table based on the original scores table with the following conditions:

  • Create a new couple year/month for each missing month between two valid months of the original table original table
  • For each new couple created, you should recopy the score of the last month he played.
  • Continue the list until the current month (included).

Here is the resulting table you need to produce

YearMonthScore
-----------------------------
200908100
200907100
200906100
200905100
200904100
200903100
20090295
20090195
20081295
20081195
20081095
20080995
20080895
20080795
20080695
20080595
20080495
20080395
20080299
200801100
200712100
200711100

My Solution

Well looking at the above I figured the first thing i needed to do was to produce a working set which had all the missing year months, thereby satisfying the first condition of the challenge ie Build sequential ranges of datesranges of dates

My second problem would be to figure out how to ensure that the values of these missing year/month values were propogated with the scores from the "last" month played prior to the inserted missing year/month value.

The approach I used involved the use of CTE (Common Table Expressions) .

I am going to post my whole solution first and then explain the individual parts.

-- Base Data
Declare @Scores TABLE

(
     YearMonth        INT,
     Score              INT,
)
INSERT @Scores VALUES ( 200903, 100 )
INSERT @Scores VALUES ( 200803, 95 )
INSERT @Scores VALUES ( 200802, 99 )
INSERT @Scores VALUES ( 200801, 100 )
INSERT @Scores VALUES ( 200711, 100 )
--------------------------------------------------------------------------------
WITH 
-- relevant years

YEARS  (someyear) AS

(
 
    SELECT (  SELECT MIN ( YEAR ( CONVERT(datetime, Convert(varchar(6),Yearmonth) + '01' ) ) 0 )

                      FROM @Scores  )  as someyear 
    UNION ALL
 
    SELECT someyear + 1

    FROM YEARS 
    WHERE someyear < YEAR( getdate() )

),
  -- all months
MONTHS (somemonth) AS
(
  SELECT 1 AS somemonth 
  UNION ALL 

  SELECT somemonth + 1
  FROM MONTHS 
  WHERE somemonth < 12

),  -- COMBINE THE years and months to get an inflated range OF Year Month 
combo's.
YEARMONTHS ( YearMonth) AS
(
   SELECT DISTINCT c.ym
   FROM
   (
      SELECT CONVERT( int, CONVERT(varchar(4), a.someyear) +
                               CASE WHEN b.somemonth < 10
                                   THEN '0' + CONVERT(varchar(2), b.somemonth)
                                   ELSE CONVERT(varchar(2), b.somemonth)
                               END ) as ym
       FROM YEARS a cross join MONTHS b
    ) c
   WHERE c.ym >= ( SELECT MIN(Yearmonth) FROM @Scores )
   AND     c.ym <= ( SELECT CONVERT(int, CONVERT(varchar(6), YEAR( getdate() ) ) +
                                       CASE
                                            WHEN MONTH( getdate() ) < 10
                                            THEN '0' + CONVERT( varchar(2), MONTH ( getdate() ) )
                                        END )
                           )
), -- join to base data to fill in the existing known scores
result ( rownum, YearMonth, Score ) AS
(
   SELECT ROW_NUMBER() OVER (ORDER BY a.YearMonth) as rownum,
              a.YearMonth, b.Score
   FROM YEARMONTHS  a
   LEFT JOIN @Scores   b ON b.YearMonth = a.YearMonth
) -- inflate with previous scores
endresult ( rownum, YearMonth, Score ) AS
(
   SELECT rownum, YearMonth, Score
   FROM result
   UNION ALL
   SELECT a.rownum, a.Yearmonth, b.Score
   FROM result a
   JOIN endresult b ON b.rownum = a.rownum-1
   WHERE a.Score IS NULL
) -- get end result
SELECT DISTINCT x.YearMonth, x.Score
FROM
(   SELECT ROW_NUMBER() OVER (ORDER BY rownum, yearmonth, score) AS num,
               Yearmonth, Score
     FROM endresult
) X
WHERE x.Score IS NOT NULL
ORDER BY Yearmonth DESC

Section 1 :  YEARS
In this section I use a recursive CTE  to create a "table" of relevant years.
  • The anchor part of the CTE will select the smallest year from the scores table
  • The recursive part of the CTE will accumulate the years with 1 UP UNTIL the current year.
Section 2:  MONTHS
In this section I again use a recursive CTE to create a "table"of 12 months

Section 3: YEARMONTHS
In this section I create a combined "table" of years and months.

However as the MONTHS table is a comprehensive table of 12 records, I limit the YEARMONTHS "table" to start at the first Year Month combination from the original table scores up until the current Year month combination.

There are a few minor issues such as creating YYYYMM integers , thereby if the month number is less than 10, it must be prefixed with 0. You will notice converts to varchar with case statements and converts back to integers. This is all just to ensure that the yearmonth combinations are created as integers correctly.

Seciton 4:  RESULT
In this section I (left) join the YEARMONTH table with the original Scores table so that the actual scores are filled in the months they occurred.  A key factor in this table is the creation of a unique rownumber per record. This is vital for the next section.

At this stage I have achieved the first part of the solution ie : Build sequential ranges of datesranges of dates

Section 5: ENDRESULT
And now for the final pieces of the puzzle.  The ENDRESULT table is a recursive query
whereby the result of a cross join with each previous record is generated.

The anchor part of the query returns the result table ( so not just 1 record )
the recursive part joins this table to itself based on previous rownumber.

Section 6: OUTPUT
In this section a unique rownumber over each of the records in ENDRESULT is generated in a subquery as when doing a select on ENDRESULT we need those records that do not have a NULL as a score.