T-SQL Tuesday: Models Gone Wild!
November's installment of T-SQL Tuesday, hosted by Mickey Stuewe (@SQLMickey), has nothing to do with Zoolander or Gisele Bundchen. The topic is "Data Modeling Gone Wrong," and since I'm old and grumpy, I've seen a lot of data modeling sins over the years. I tend to share a lot of my war stories in sessions I deliver in person; for this post, I thought I would share a few of my favorites, and why I think they're bad. These are kind of rambling, but I've sorted them into two categories for you: naming issues and data type issues.
Naming Issue
Shorthand - don't call a table cust
- why not Customers
(or Customer
if you really subscribe to the idea that each row is a self-contained entity)? I don't subscribe to the notion that you can gain all kinds of productivity from saving a few characters of typing using shorthand or abbreviations - especially when they can lead to ambiguity (I often harp on this about shorthand like DATEPART(Y, ...)
- where the Y
doesn't mean what you think it means).
Object prefixes - I commonly see the tbl
prefix ("tibble") on every table in the system. I know that in many disciplines we're taught to use self-documenting prefixes, but for tables (and procedures, and functions, and triggers...), they are simply unnecessary. The leading tibble just makes everything sort together in a list, and requires additional visual parsing to find tblCustomers
, for example.
I don't recall ever looking at a piece of code and trying to figure out which objects were tables and which were procedures or functions - the context makes that obvious, and naming things incorrectly won't fool you:
You might argue, well, tibble allows me to differentiate a table from a view. I would argue that there aren't many cases where this matters - you largely treat views and tables the same. If you need to distinguish for the few cases where it does matter (like a non-updateable view), I'd rather you use prefixes or some other naming scheme for the views, since they're the exception.
Another problem with using the tibble prefix: what happens if you later change that table to a view or table-valued function? You'll either have to rename it and refactor all of the referencing code, or be ready to explain the backstory every time someone asks why there is a view named tblSomething
.
(And for stored procedures, please be very careful about using the sp_
prefix - which doesn't even stand for stored procedure, by the way. As I have documented in the past, it can cause unexpected results and even hurt performance. I don't think any prefix is necessary here either, but if you must use one, please choose something other than sp_
.)
Column/variable prefixes - another thing we learn from other disciplines is to prefix columns or variables with their data types, for example intCustomerID
. Like tbl
, this may help the developer in a rush who doesn't understand the schema and doesn't know how to look at metadata, but it can cause the same kind of issue later. What if your IDENTITY
column exceeds the bounds of an int
and needs to become a bigint
? What if you decide to change to GUIDs for your primary keys? Those refactoring tasks may be complex enough without throwing a necessary rename on top of it all. And speaking of GUIDs, the actual type name in SQL Server is uniqueidentifier
- does it make sense to be consistent and call your key uniqueidentifierCustomerID
?
Object consistency - I once had a customer system I was dealing with where there were two stored procedures for updating a customer, dbo.Customer_Update
and dbo.usp_updatecust
. Let me assure you that it was very frustrating for their team, who spent hours trying to find and reproduce a bug in one stored procedure, when the code was actually calling the other. The problem was that the person who started debugging the code followed one naming convention and didn't even think to check the application code or look for other possible stored procedures that were being called. Your standard isn't important, but pick one, enforce it, and stick with it. Personally, I prefer Entity_Verb
over verb[_]entity
, since I am often looking for the stored procedures involving customers, but very rarely am I looking for all the stored procedures that update anything.
Entity consistency - I've seen systems where one table had id
, another had Id
, a third had ID
, and others still had cust
, CustID
, customerid
and Customer_Id
. As with object names, pick a standard for an entity, and stick with it (including case sensitivity, because you never know when you may end up on a case sensitive collation). A CustomerID
should be identifiable as CustomerID
no matter where in the data model it appears. IMHO, consistency trumps the fact that this may seem redundant in the Customers
table.
Data Type Issue
Storing dates as strings - this is a pretty common one; people like to have their dates stored in their preferred format, e.g. mm/dd/yyyy
or dd/mm/yyyy
. Aside from the fact that formatting belongs in the presentation tier, and that ambiguous, regional formats are largely useless today, this causes all kinds of issues with the data itself. For one, if the data type is varchar
, how do you perform validation? What is preventing someone from storing 99/82/3109
or 33/-4/6666
? Oh, the application makes sure the string parameter gets passed a valid date? That's great, until someone bypasses your application to perform data modifications. You also lose the ability to perform date/time operations like DATEADD
, DATENAME
, and DATEPART
without performing explicit conversions or potential errors due to language or other settings. Sorting and range queries also don't work correctly without explicitly converting the column first, which almost always means a scan - even if there is an index on that string column. (And it doesn't make a lot of sense to sort a string that begins with the day or month.) For a longer rambling on this topic, see Bad habits to kick : mis-handling date / range queries.
Using Unicode when you don't need it - I've lost count of the number of times I've seen people using nvarchar
for things like telephone, postal code, or even GUIDs stored as strings. These things will never need to support characters outside of the basic ASCII character set; using Unicode here just means these columns will take up twice as much space, both on disk and in the buffer pool (unless you are using Data Compression in SQL Server 2008 R2 or later).
Not using Unicode when you DO need it - conversely, there are many cases where you don't think up front that you will need to support foreign character sets, but you will later - things like proper names or company names. Imagine this person fills out your web form, and you store their name in a varchar
column:
When you print out their business cards or add a salutation to their e-mail, it's going to come out looking like this:
Or even worse: