argh, they didn’t use ‘Identity’, row_number() to the rescue

One issue I have often come across with importing large sets of data is where we have a generated primary key, usually some sort of number, and the field is not set as being an Identity field. As much as this is questionable, in itself it is not the issue here, however when importing large sets of data into these tables (where you don’t have a requirement of preserving the key), you generally have to run through the primary key generation for each row.

The most common implementation of the custom primary key generation is to have a table of the next keys where these are retrieved and incremented to perform the insertion, generally through the use of a cursor. Whilst I feel having the generated primary key set to ‘Identity’ is a better implementation, often you have to work within the boundaries of the system presented.

running dachshund

cursors….run away, run away

Having come across this numerous times, I thought that I am not alone in meeting such an issue and I thought I would share my solution to this problem. Firstly for demo purposes I will set up some table variables to be our source and destination tables, I will then show the cursor based implementation, followed by the Row_Number() based solution.

Demo Set Up

The following snippet sets up the demo code for later use (full code listing will appear at the end of the post). Here we set up our bulk source (#teams) and our destination (@Team), we also have the meta table (@Identifiers) which contain the nextkey for each of the tables. (table variables have been used for demo purposes only)

--define the tables for the example
declare @tblIdentity table(
    tablename varchar(50),
    NextIdentityValue int
)

--teamName is surely a suitable primary key,
--but for this example lets
--use the teamID
declare @tblTeam table(
    teamID int not null PRIMARY KEY, 
    teamName varchar(50)
)

--assume we have 33 other teams in @tblTeam
insert  into @tblIdentity(tablename, NextIdentityValue)
values ('tblTeams', 34)
--pre load a team into @tblTeam
insert into @tblTeam(teamID, teamName)
values(33, 'Sydney FC')

--Load up a temp table with our 'load file'
select
    'Central Coast Mariners' TeamName
into #teams
insert into #teams(TeamName)
values ('Wellington Phoenix'),
    ('Melbourne Victory'),
    ('Adelaide United')
1.1 Demo setup

Cursor Implementation

To perform the update using a cursor, we retrieve the next identifier from @tblIdentity, define the cursor and perform the inserts one at a time as we enumerate through the cursor. Finally we update the identifier table with the last value of @nextID and clean up.

declare @nextID as integer,
        @nextTeam as varchar(50)

select 
    @nextID = NextIdentityValue
from
    @tblIdentity
where 
    tablename = 'tblTeams'

declare bulkCrs Cursor for select TeamName from #teams
open bulkCrs

fetch next from bulkCrs into @nextTeam
while @@FETCH_STATUS = 0
begin
    insert into @tblTeam(teamID, teamName)
    values (@nextID, @nextTeam)
    
    --update the next identifier varaible
    set @nextID = @nextID + 1
    fetch next from bulkCrs into @nextTeam
end

close bulkcrs
deallocate bulkcrs
drop table #teams

--now update the table of identifiers to reflect the change we have made
update 
    @tblIdentity
set
    NextIdentityValue = @nextID
where
    tablename = 'tblTeams'

 

1.2 Cursor example

Whilst the use of cursors might be acceptable for 4 rows, if this was a truly bulk insert of millions you can see how the cursor solution would not scale very well. What we need is the use of a single, set based operation to replace the cursor and this is where the usage of row_number comes into play.

Row_Number() Solution

Row_Number() is a tsql built in ranking function which, surprisingly, returns the sequential row number that a row, within a set, sits at, based upon the partitioning and ordering clauses that we set. Essentially it gives our set order and labels that order for our own means. In this case, we want the row numbering to be unique, as it will be added to the next id, so we will ignore the ‘partition by’ clause and just utilise the ‘order by’ clause.

This demo uses the same set up as in listing 1.1 above but differs where it would have hit the cursor in listing 1.2.

Listing 1.3, below, is my preferred method around this issue as it is able to perform the insert in a single, set based method without the need to resort to cursors. The initial set up and retrieval of the next ID are the same as the cursor based operation but the insert into @tblTeam is a lot cleaner and is performed in one step.

smug cat doesnt use cursors

Row_Number() , provides us with a means for determining a key for the insert and when added to @nextID it becomes the sequential set matching the tail of our target table. In this case, the Row_Number() ordering is determined by the ‘Order By’ clause on TeamName, as the row number returned is sequential and unique even when duplicate values in the ‘Order By’ clause exist. Although it does not matter which column the order by is applied to, for consistency it is best to apply this clause to the same order that the data would have been inserted through your application if applicable.

Following the insert, we must update the identifier table to reflect the new entries in the Team table. The use of @@RowCount  provides this mechanism with which we determine how many rows where inserted into @TblTeam and hence, the method for updating @tblIdentity. One catch with this, is that the use of @@RowCount must be the very next step in the process as this variable will reflect only the previously executed statement.

declare @nextID as integer,
        @nextTeam as varchar(50)

--we still need to retrieve the next identityID from our identifier table
select 
    @nextID = NextIdentityValue
from
    @tblIdentity
where 
    tablename = 'tblTeams'

--now do the insert in one, set based, operation through
--the help of row_number()
insert into 
    @tblTeam(teamID, teamName)
select 
    ROW_NUMBER() over (order by teamName) + @nextID - 1,
    TeamName
from
    #Teams
    
--similarly we must also update the identifier table
--at the end of the process as the underlying business
--logic expects it to be in sync
update 
    @tblIdentity
set
    NextIdentityValue = @nextID + @@ROWCOUNT
where
    tablename = 'tblTeams'

 

1.3 Row_Number() example

…and there we have it, the need for a cursor has been removed and we can be comfortable that the single, set based, insert will scale to meet much larger datasets.

Full code listings:

Cursor based bulk insert

--define the tables for the example
declare @tblIdentity table(
    tablename varchar(50),
    NextIdentityValue int
)

--teamName is surely a suitable primary key,
--but for this example lets
--use the teamID
declare @tblTeam table(
    teamID int not null PRIMARY KEY, 
    teamName varchar(50)
)

--assume we have 33 other teams in @tblTeam
insert  into @tblIdentity(tablename, NextIdentityValue)
values ('tblTeams', 34)
--pre load a team into @tblTeam
insert into @tblTeam(teamID, teamName)
values(33, 'Sydney FC')

--Load up a temp table with our 'load file'
select
    'Central Coast Mariners' TeamName
into #teams
insert into #teams(TeamName)
values ('Wellington Phoenix'),
    ('Melbourne Victory'),
    ('Adelaide United')


declare @nextID as integer,
        @nextTeam as varchar(50)

select 
    @nextID = NextIdentityValue
from
    @tblIdentity
where 
    tablename = 'tblTeams'

declare bulkCrs Cursor for select TeamName from #teams
open bulkCrs

fetch next from bulkCrs into @nextTeam
while @@FETCH_STATUS = 0
begin
    insert into @tblTeam(teamID, teamName)
    values (@nextID, @nextTeam)
    
    --update the next identifier varaible
    set @nextID = @nextID + 1
    fetch next from bulkCrs into @nextTeam
end

close bulkcrs
deallocate bulkcrs
drop table #teams

--now update the table of identifiers to reflect the change we have made
update 
    @tblIdentity
set
    NextIdentityValue = @nextID
where
    tablename = 'tblTeams'

select 
    teamID, teamName
from
    @tblTeam

Row_Number based bulk insert

--define the tables for the example
declare @tblIdentity table(
    tablename varchar(50),
    NextIdentityValue int
)

--teamName is surely a suitable primary key, but for this example lets
--use the teamID
declare @tblTeam table(
    teamID int not null PRIMARY KEY, 
    teamName varchar(50)
)

--assume we have 33 other teams in @tblTeam
insert  into @tblIdentity(tablename, NextIdentityValue)
values ('tblTeams', 34)
--pre load a team into @tblTeam
insert into @tblTeam(teamID, teamName)
values(33, 'Sydney FC')

--Load up a temp table with our 'load file'
select
    'Central Coast Mariners' TeamName
into #teams
insert into #teams(TeamName)
values ('Wellington Phoenix'),
    ('Melbourne Victory'),
    ('Adelaide United')


declare @nextID as integer,
        @nextTeam as varchar(50)

--we still need to retrieve the next identityID from our identifier table
select 
    @nextID = NextIdentityValue
from
    @tblIdentity
where 
    tablename = 'tblTeams'

--now do the insert in one, set based, operation through
--the help of row_number()
insert into 
    @tblTeam(teamID, teamName)
select 
    ROW_NUMBER() over (order by teamName) + @nextID - 1,
    TeamName
from
    #Teams
    
--similarly we must also update the identier table
--at the end of the process as the underlying business
--logic expects it to be in sync
update 
    @tblIdentity
set
    NextIdentityValue = @nextID + @@ROWCOUNT
where
    tablename = 'tblTeams'

drop table #teams

select 
    teamID, teamName
from
    @tblTeam

select * from @tblIdentity

Links

Possible Enhancements

  • Join with the identity table in the Row_Number based insert to remove the extra step of retrieving the key
  • Utilise try…catch and transactions in order to make the bulk insert and the subsequent update of the identity table an atomic operation
Categories: TSQL | Tags: , , , | Leave a comment

Au Teched, here we come

With TechEd NZ trending on twitter (#TENZ) the excitement levels for those of us attending the Aussie version on the Gold Coast are certainly starting to rise.

AuTeched

I’m very happy to say, that i’ll also be there from Tuesday evening, unfortunately missing the keynote as I arrive after 6pm. Naturally I will be following the Database and Business Intelligence track, but i will also be mixing it up a bit with some Cloud Computing and Architecture sessions along with a few Hands on and Instructor led labs.

For those who will be there on Tuesday evening, don’t forget to register AU TechHeads Networking Party which is being held directly after the Welcome Reception.

Categories: Teched | Tags: , | Leave a comment

Gotcha: IsNull()

Whilst working with IsNull() a while ago i came across some strange behaviour that made total sense once i ventured into books online, however it was suitably tricky that I thought it worthy of a Gotcha post.

You see, i was retrieving some text fields which may have, legitimately, been null, in cases of null I wanted to display that the field was not yet entered. The trouble I had though is that the default text was being truncated.

In this (contrived) example, if the source school is null we want to display ‘Not Yet Registered’ , however when the query is run the text is truncated to ‘Not Yet Registe’.

image

The TSQL behind this is as follows:

declare @Student table
        (StudentID int primary key, 
        StudentName varchar(30) not null,
        SourceSchool varchar(15) null)

insert into @Student(StudentID, StudentName, SourceSchool)
values    (231, 'Joe Bloggs', 'Lake Kytin'),
        (346, 'Mary Lilley', 'Port Smithton'),
        (564, 'Sue Cambell', null)

select 
    StudentID,
    StudentName,
    isnull(SourceSchool, 'Not Yet Registered') SourceSchool
from
    @Student

So why is this happening, well, a quick look at Books Online shows that the return type of IsNull() is actually that of the field in the check expression:

ISNULL ( check_expression , replacement_value )

So in our code above, even though both the constant field in the replacement value (“Not Yet Registered”) and the check expression itself are both string fields, the check expression is varchar(15) and the replacement value has a length of 18. which is automatically truncated to 15.

BOL Entry for IsNull()

So in order to correct this we can:

1) increase the length of the SourceSchool field

-or-

2) reduce the length of the replacement value

-or-

3) Cast the check expression to the data type of the replacement value.

Option 1 is changing the schema based on reports which is not ideal, Option 2 is not valid as the text length may be a requirement so Option 3 it is.

This is shown below:

declare @Student table
        (StudentID int primary key, 
        StudentName varchar(30) not null,
        SourceSchool varchar(15) null)

insert into @Student(StudentID, StudentName, SourceSchool)
values    (231, 'Joe Bloggs', 'Lake Kytin'),
        (346, 'Mary Lilley', 'Port Smithton'),
        (564, 'Sue Cambell', null)

select 
    StudentID,
    StudentName,
    isnull(cast(SourceSchool as varchar(18)), 'Not Yet Registered') SourceSchool
from
    @Student
As we can see, the results are now what we would expect

image

Categories: Gotcha, TSQL | Tags: | Leave a comment

2007 Office System Driver: Data Connectivity Components

This post is more of a note to myself in order to bookmark the Data Connectivity Tools for, in my case, SQL Server installs.

If, for example, you wish to load Excel files with SSIS, your server will require the appropriate drivers, one method to handle this would be to load office onto the server, this is not recommended though for security reasons as it provides an unneeded possible vulnerability, however low and also just means that you have another set of programs to patch as updates are released.

In order to get around this, simply load the free drivers from Microsoft, which are available for download from this address:

http://www.microsoft.com/downloads/en/confirmation.aspx?familyId=7554f536-8c28-4598-9b72-ef94e038c891&displayLang=en

DataConnectivityComponents

Categories: Note To Self, SQL Tools | Tags: , , , | Leave a comment

Loading Excel Files in 64 Bit SSIS

A recent upgrade of a server to 64 bit revealed that there is no 64 bit Excel driver for SSIS, at this stage. With Denali on its way i’m guessing that there wont be one forthcoming either.

This will materialise itself with the following error:

Error: 0xC00F9304 at Package, Connection manager "Excel Connection Manager": SSIS Error Code DTS_E_OLEDB_EXCEL_NOT_SUPPORTED: The Excel Connection Manager is not supported in the 64-bit version of SSIS, as no OLE DB provider is available.

In order to overcome this you must run SSIS using the 32 bit runtime. Doing so will depend upon how you call your package, from a SQL Agent job or from the command line.

Running SSIS in 32 bit mode from SQL Agent:

To configure a SQL Agent SSIS package to run as 32 bit you simply have to select ‘Use 32 bit runtime’ from the ‘Execution Options’ tab of the general settings when you are creating the task, or edit the existing task step that you are porting to the 64 bit environment, as shown below:

Run SSIS in 32 bit mode

Running SSIS in 32 bit mode from the Command Line:

Sometimes you have a need to call your SSIS package from the command line instead of SQL Agent, in a default 64 bit installation this is located at the following address: c:\Program Files\Microsoft Sql Server\100\DTS\Binn\DTExec.exe however if you want to load any excel files you will need to call the 32 bit runtime which you can find here: c:\Program Files (x86)\Microsoft Sql Server\100\DTS\Binn\DTExec.exe

Categories: Excel, SSIS | Tags: , , , | Leave a comment

sp_who2 and dbcc inputbuffer mashup

One thing I find my self doing from time to time is looking up what a particular user or set of users are doing. To do this i would run sp_who2, pick out the SPID of the user concerned and then run dbcc InputBuffer(@SPID) for each of the interesting results to see what was the command last run, or as is often the case, which command is currently running.

After doing this a couple of times i thought, why and i repeating this action, so i created InputBuffers, a stored procedure to help me filter out some noise and see what activity the users are up to

Complete code listing at the bottom

Initially i have created the procedure to accept two filters, @Login and @Hostname, which are both varchar(128). I am planning to extend this to also include Database name, program name and also top (n) CPU or DiskIO hogs.

The sp_who2 results are loaded into a temporary table which we then use as a basis for a cursor to call dbcc inputbuffer(SPID), this is what retrieves the last command.

image

image

One slight issue is that the DBCC command return does not contain the identifying information from sp_who.

image

To get around this, the table variable used to stored the sp_who2 results has a identity field which is retrieved with @@Identity, then the new entry in the results table variable is updated with the cursors current row.

image

From here we just need to output the results and we are done.

image

Running this for our user ‘BlogDemo’

InputBuffers ‘BlogDemo’

allows us to cut through the noise and just see the activity from BlogDemo

image

Feel free to use the code listing below and also please add comments if there are any additions you would like made, or, anything that you think can be changed to make it better.


Code Listing:


 

 

CREATE PROCEDURE [dbo].[InputBuffers]
    @Login varchar(128) = null,
    @Hostname varchar(128) = null
AS

/**************************************************************
Author:        John Walker
                   sqljohn.wordpress.com
            twitter: sqljohn
Date:        13/09/2010
Version:    1.0
***************************************************************
InputBuffers is a mash up of sp_who2 and dbcc inputbuffer(), it returns
the last statement executed by a user, as dbcc input buffer does
but it lets you filter by login or hostname.
This returns not only the statement but also the login, hostname and SPID
**************************************************************/

/*table to hold the sp_who2 results*/
declare @sp_who2 table
(SPID       int,
[Status]    varchar(30) not null,
[Login]           varchar(128) not null,
[HostName]  varchar(128) not null,
BlkBy       varchar(5) not null,
DBName            varchar(128) null,
Command           varchar(16) not null,
CPUTime           INT not null,
DiskIO            INT not null,
LastBatch   varchar(20) not null,
ProgramName varchar(128) not null,
rightSPID   int,
REQUESTID   int)

/*output table to hold inputBuffer statement and user info*/
declare @InputBuffer table
(
InputBufferID   int identity(1,1),
EventType       varchar(30),
[Parameters]    int,
EventInfo       varchar(4000),
SPID            int,
[Login]         varchar(128),
Hostname        varchar(128),
CPUTime         INT,
DiskIO          INT,
LastBatch        varchar(20),
ProgramName        varchar(128)
)

/*load table var ewith results of sp_who2*/
INSERT @sp_who2 EXEC sp_who2

/*cycle through the filtered results loading the result table variable*/
declare @SPID                as int
declare @BufferLogin        as varchar(128)
declare @BufferHostname        as varchar(128)
declare @BufferCPUTime        as INT
declare @BufferDiskIO        as INT
declare @BufferLastBatch    as varchar(20)
declare @BufferProgramName    as varchar(128)

declare InputBufferCrs cursor for
      select
            SPID, [Login], HostName, CPUTime, DiskIO, LastBatch, ProgramName
      from
            @sp_who2
      where
            ([Login] = @Login or @Login is null)
            and (hostname = @Hostname or @Hostname is null)

open InputBufferCrs
fetch next from InputBufferCrs into @SPID, @BufferLogin, @BufferHostname,
                                    @BufferCPUTime, @BufferDiskIO, @BufferLastBatch,
                                    @BufferProgramName
while @@FETCH_STATUS = 0
begin
      insert @InputBuffer (EventType, Parameters, EventInfo)
      exec (‘DBCC InputBuffer(‘ + @SPID + ‘) WITH NO_INFOMSGS’)
      /*we cant load the user info during the exec of dbcc so update them here
           with the most recent @@identity*/
      update
            @InputBuffer
      set
            SPID = @SPID,
            [Login] = @BufferLogin,
            Hostname = @BufferHostname,
            CPUTime = @BufferCPUTime,
            DiskIO    = @BufferDiskIO,
            LastBatch = @BufferLastBatch,
            ProgramName = @BufferProgramName
      where
            InputBufferID = @@IDENTITY
      fetch next from InputBufferCrs into @SPID, @BufferLogin, @BufferHostname,
                                    @BufferCPUTime, @BufferDiskIO, @BufferLastBatch,
                                    @BufferProgramName
end

close InputBufferCrs
Deallocate InputBufferCrs

/*output our results*/
select
      SPID, [Login], Hostname, EventType, Parameters, EventInfo,
      CPUTime, DiskIO, LastBatch, ProgramName
from
      @InputBuffer
order by
      Hostname, Login

GO

Categories: SQL Tools | Tags: , , , | 1 Comment

Project Perfmon

I recently came across this great poster from the good people at Quest. It provides a huge list of all the relevant Performance Monitor metrics that you should be looking at along with what they mean and what values you should be looking out for.

PerfMonPosterThumb

I duly printed this off on a nice A3 page, started perusing and then realised that i would really need to work up examples for each of the Counters if i was to truly understand them. Thus the Perfmon Project idea was born, i thought, what better way to learn them all than to work through them, top to bottom, left to right and blog my findings.

Please note, this is in no way an attempt to plagiarise the great work from quest or by the authors, it is an effort by me to understand each of these items and build a worked example that i can refer back to at a later date. In each case i will ensure that i fully reference any resources/blogs/pages that i have used to come up with my interpretation. Lets look at it as standing on their shoulders to enjoy the view :)

For your own copy, head over to Quest and if you have a spare moment be sure to check out Kevin Klines Blog and dont forget to follow him on twitter

My plan is to post a new item every three weeks or so, as i want to fully understand the material instead of just pumping out posts…..so, on with the project.

In case anyone is wondering, I have contacted Kevin about this project idea and he is more than happy for me to use the poster in posts, thanks Kevin.

Categories: Performance Monitor, SQL | Tags: , , | Leave a comment

Create a free website or blog at WordPress.com. The Adventure Journal Theme.

Follow

Get every new post delivered to your Inbox.