T-SQL Tuesday: Models Gone Wild!

T-SQL Tuesday

November's installment of T-SQL Tuesday, hosted by Mickey Stuewe (@SQLMickey), has nothing to do with Zoolander or Gisele Bundchen. The topic is "Data Modeling Gone Wrong," and since I'm old and grumpy, I've seen a lot of data modeling sins over the years. I tend to share a lot of my war stories in sessions I deliver in person; for this post, I thought I would share a few of my favorites, and why I think they're bad. These are kind of rambling, but I've sorted them into two categories for you: naming issues and data type issues.

Naming Issue

Shorthand - don't call a table cust - why not Customers (or Customer if you really subscribe to the idea that each row is a self-contained entity)? I don't subscribe to the notion that you can gain all kinds of productivity from saving a few characters of typing using shorthand or abbreviations - especially when they can lead to ambiguity (I often harp on this about shorthand like DATEPART(Y, ...) - where the Y doesn't mean what you think it means).

Death by 1,000 tibbles Death by 1,000 tibbles

Object prefixes - I commonly see the tbl prefix ("tibble") on every table in the system. I know that in many disciplines we're taught to use self-documenting prefixes, but for tables (and procedures, and functions, and triggers...), they are simply unnecessary. The leading tibble just makes everything sort together in a list, and requires additional visual parsing to find tblCustomers, for example.

I don't recall ever looking at a piece of code and trying to figure out which objects were tables and which were procedures or functions - the context makes that obvious, and naming things incorrectly won't fool you:

INSERT dbo.fnFoo -- not a function, in spite of the prefix
EXEC dbo.tblBar; -- not a table, in spite of the prefix

You might argue, well, tibble allows me to differentiate a table from a view. I would argue that there aren't many cases where this matters - you largely treat views and tables the same. If you need to distinguish for the few cases where it does matter (like a non-updateable view), I'd rather you use prefixes or some other naming scheme for the views, since they're the exception.

Another problem with using the tibble prefix: what happens if you later change that table to a view or table-valued function? You'll either have to rename it and refactor all of the referencing code, or be ready to explain the backstory every time someone asks why there is a view named tblSomething.

(And for stored procedures, please be very careful about using the sp_ prefix - which doesn't even stand for stored procedure, by the way. As I have documented in the past, it can cause unexpected results and even hurt performance. I don't think any prefix is necessary here either, but if you must use one, please choose something other than sp_.)

Column/variable prefixes - another thing we learn from other disciplines is to prefix columns or variables with their data types, for example intCustomerID. Like tbl, this may help the developer in a rush who doesn't understand the schema and doesn't know how to look at metadata, but it can cause the same kind of issue later. What if your IDENTITY column exceeds the bounds of an int and needs to become a bigint? What if you decide to change to GUIDs for your primary keys? Those refactoring tasks may be complex enough without throwing a necessary rename on top of it all. And speaking of GUIDs, the actual type name in SQL Server is uniqueidentifier - does it make sense to be consistent and call your key uniqueidentifierCustomerID?

Object consistency - I once had a customer system I was dealing with where there were two stored procedures for updating a customer, dbo.Customer_Update and dbo.usp_updatecust. Let me assure you that it was very frustrating for their team, who spent hours trying to find and reproduce a bug in one stored procedure, when the code was actually calling the other. The problem was that the person who started debugging the code followed one naming convention and didn't even think to check the application code or look for other possible stored procedures that were being called. Your standard isn't important, but pick one, enforce it, and stick with it. Personally, I prefer Entity_Verb over verb[_]entity, since I am often looking for the stored procedures involving customers, but very rarely am I looking for all the stored procedures that update anything.

Entity consistency - I've seen systems where one table had id, another had Id, a third had ID, and others still had cust, CustID, customerid and Customer_Id. As with object names, pick a standard for an entity, and stick with it (including case sensitivity, because you never know when you may end up on a case sensitive collation). A CustomerID should be identifiable as CustomerID no matter where in the data model it appears. IMHO, consistency trumps the fact that this may seem redundant in the Customers table.

Data Type Issue

Storing dates as strings - this is a pretty common one; people like to have their dates stored in their preferred format, e.g. mm/dd/yyyy or dd/mm/yyyy. Aside from the fact that formatting belongs in the presentation tier, and that ambiguous, regional formats are largely useless today, this causes all kinds of issues with the data itself. For one, if the data type is varchar, how do you perform validation? What is preventing someone from storing 99/82/3109 or 33/-4/6666? Oh, the application makes sure the string parameter gets passed a valid date? That's great, until someone bypasses your application to perform data modifications. You also lose the ability to perform date/time operations like DATEADD, DATENAME, and DATEPART without performing explicit conversions or potential errors due to language or other settings. Sorting and range queries also don't work correctly without explicitly converting the column first, which almost always means a scan - even if there is an index on that string column. (And it doesn't make a lot of sense to sort a string that begins with the day or month.) For a longer rambling on this topic, see Bad habits to kick : mis-handling date / range queries.

Using Unicode when you don't need it - I've lost count of the number of times I've seen people using nvarchar for things like telephone, postal code, or even GUIDs stored as strings. These things will never need to support characters outside of the basic ASCII character set; using Unicode here just means these columns will take up twice as much space, both on disk and in the buffer pool (unless you are using Data Compression in SQL Server 2008 R2 or later).

Not using Unicode when you DO need it - conversely, there are many cases where you don't think up front that you will need to support foreign character sets, but you will later - things like proper names or company names. Imagine this person fills out your web form, and you store their name in a varchar column:

INSERT dbo.Customers(Name) SELECT 'Aṣṭādhyāyī Ağçayş';

When you print out their business cards or add a salutation to their e-mail, it's going to come out looking like this:

A??adhyayi Agçays

Or even worse:

A☐☐☐dhy☐y☐ A☐☐ay☐

This is not a good way to make friends in the international market, and it can be a royal pain to change later. So think hard about these design decisions up front. If I were to weigh the impact of disk space versus the cost of refactoring, I would always lean toward using Unicode for any column that could ever potentially require Unicode, and just dealing with the extra space requirements. So, if in doubt, choose nvarchar. And above all, be consistent - when you start mixing varchar and nvarchar, you end up with implicit conversions, which can often have drastic effects on your workload.

Making your variable length columns too long - one thing I see frequently is the reluctance to pick an upper bound for a variable length column - e-mail and URL, for example, are often set to 4000 or 8000 characters, or even max types. Even though an e-mail can't be longer than 320 characters (64 for localpart, 1 for @ symbol, 255 for domain), and a URL can't be longer than 1,024 or 2,048 characters, depending on which browser is your lowest common denominator. For max types there are inherent inefficiencies and differences in behavior that are well documented. For picking an arbitrary length in the 4000/8000 range, a lesser-known problem is that if you have e-mail address defined as varchar(8000), but the largest value is 120 characters, and the average is 36 characters, the memory grant for any query is going to be based on the assumption that the average e-mail address is 4,000 characters long. On your local system and with a small data set, that's not really a problem; at scale, though, it's going to be painful.

Storing comma-separated lists - this is a popular one that has been getting further muddled by support for other non-relational data like XML and JSON. Basically, my opinion is this: any single fact that you care about independently, should be stored independently. If you're stuffing pet names into a comma-separated list and you're going to want to search for everyone with a pet named Snowball or Santa's Little Helper, that comma-separated list is going to make your job very hard. Try to think about *all* of the future uses of the data, not just focusing on what is the easiest way for the application to pass a "list" to SQL Server and store it (and in fact, think about it as a set, not a list). Depending on the application, you might actually want a better way to do this end-to-end: table-valued parameters, where the app doesn't have to take a set and build a single string, and the database never has to turn around and split that string back into its original set.

Now, I am not saying there are zero use cases for storing XML or JSON inside of SQL Server - there certainly are valid scenarios. But ideally, either the application(s) should not rely on extracting parts or searching these blobs inside the database, or users should understand that those features are unlikely to be very fast.

SUMMARY

I really could go on and on about these things, but I'll save you some agony for now. Please do keep in mind that many of these things I discourage are subjective, and they often do have valid pros, I just don't prioritize the pros as high as the cons.

THWACK - Symbolize TM, R, and C