In my last article on Software Design Principles I talked about a methodology to use when designing a software project. For this article I want to go a bit lower level, less “big picture”, and restrict the subject matter to the realm of database design. In my experience most of the applications I write start with database design. It is the foundation for almost every project I have worked on. When I started writing software professionally, back in 1996, my Senior year at a State University, where I had a single semester in Database Management Systems, my coworkers were not too far ahead of me.They had all graduated and took the same class but Database Design was, in my opinion, in its infancy back then. This was before Client/Server technology was widespread as it is now. Our classes in database design were using ISAM databases, we did one project on a true Client/Server database and it was not a very large project. I say this now, looking back, not to say we were not prepared to do our jobs, but more to set the stage for what happened next. We were young and very arrogant programmers, we were the next generation of geeks in the middle of the Internet Boom of the mid 1990′s ready and willing to make our mark in the world of Software Engineering, … at least that’s what we were thinking at the time. There were entire modules that were designed before the thought of storing data had even come up. You should have seen some of the Entity Relationship Diagrams, … oh, wait, that’s right we didn’t have any because no one thought to make one, everything was done “on the fly”, after all, if you need to change the tables in the database at a later time, you could always add new columns to the table. At least that’s what ended up happening. Some of the scripts we had to write back then were inserting, updating, and deleting the same information to numerous tables. It got so bad I remember one developer had made a primary key of a table, the concatenation of at least 5 other fields. Every time we had to update someones name in a table, we had to edit the key, find where the name of the person was within the key overwrite that information and update the record with a new name as well as the new primary key. This may seem comical in today’s world of modern freely distributable client/server databases, but back in those days things like this were common, at least to rookie programmers like us. The lesson I took away from those days was that, for a database application, the database is the “foundation” of the app. A poorly designed database makes every other step in the application’s development that much more difficult to design, write, and implement. It forces you design,write, debug, and test, code you shouldn’t need.
Principle : Separate a table’s relationship keys from its data columns
After about 3 years of professional software development experience, I was given a project that called for data to be exported to a 3rd party database. It was then that I saw the harmony of a truly well designed database. Every table, and I mean every single table, had a unique primary key, a 4 byte unsigned integer. In those days all the databases I had seen used data as the key. Names, Social Security Numbers, Item Numbers, the actual data was the key to the table, not in this database. It took me a day or two, but I began to see the genius in this design. They were separating the data from the relationships between the tables. If you made an Item table and a Purchase Order table, you would, more than likely, want to establish a link relating the Items to be purchased within the Purchase order table to a single row of data from the Item table. Someones first instinct may be to use the Item_Number field within the Item table to link the two.
This would serve the purpose of creating a link between the tables, but look what else it does. If that Item Number, for some reason, needs to change someday, how many tables are now linked together by that old Item Number? In our example we will now have to update both the “Item”, and “Purchase Order” tables in order to keep our referential integrity. Now imagine in the software had a Sales Order, Bill Of Materials, Manufacture Order, and countless other relationships to the “Item” table. All of the sudden you are forced to update countless tables for the sake of referential integrity. If a single mistake is made, orphan records are the result. Lost data that is difficult, if not impossible, to trace without even more tables logging changes.
If, instead of using the data stored in the Item Number, we used some auto-increment integer to create a relationship between the two tables, that number would never need to change, unless we were making a specific change as to which row in the Item table is to be linked to the other tables. If the item number, or any other “data” column, needs to change, it will not affect the table’s relationship with any other table. The amount of code to support the tables is decreased along with the database’s complexity, and in most cases the index size is reduced as well. It’s a win-win-win scenario.
For me, it has become second nature, every single table I create starts with a 4 byte integer called “ID”, it’s the first column in every table. This actually has another added benefit, in most cases other developers on the design team do not even have to refer back to the Entity Relationship Diagram (ERD) to write code relating tables, they know they can use the ID column whenever they need, or when they are creating Many-to-Many relationship tables between two preexisting tables. And for all the relationships we create not a single line of code needs to be written to change the underlying data when a record in two or more tables is linked, or unlinked.
That reminds me, on the subject of Many-To-Many relationships. They have got to be the single greatest source of confusion I have seen on the subject of database design outside of the last 2 rules of Database Normalization, … which are out of the scope of this article. I’ve asked questions of Software Engineering Applicants, in the past, on different parts of Database design. One of the more important questions is always on the subject of Many-to-Many relationships. A classic example is of a three table database for a University. Each University has classes and each class has students. The three tables needed are a data table for Classes, another for Students and one more, a relation table which dictates which students are attending which class. This is actually the easy part of the question. The real question to ask after these three tables are now set, is, “If you were to record a students grade for the class, in which table would the data belong?”.
It’s an easy question if you think about it. By process of elimination we can see that if we stored the grade within the student table, one student could only ever receive one grade, if we stored it within the Class table, there would only be a single grade for everyone in the class. There is only a single table left in the ERD, the relationship table, and this is the correct answer. It is because we are storing data that is related neither to a single student or class but the link itself between that single student taking that single class. In this case we are still in keeping with the idea of separating the table relationship and the data, because we are storing the data itself, the grade the student received in the class, as a separate column of data from the two foreign keys Class.ID and Student.ID, within the relationship table. Most of the time it is a simple problem like this that catches most of the applicants, I have seen, off guard during an interview, or a student attending a database management course.
















