Validating Data with Enhanced Declarative Data Integrity Rules

Validating Data with Enhanced Declarative Data Integrity Rules

Validating Data with Enhanced Declarative Data Integrity Rules

Visit us at our new Google Code project page: http://code.google.com/p/metadata-validator/

Problem Definition

Data validation and integrity assurance are complex subjects that have strong opinions tied to the way they are implemented. Some strongly feel that the database can handle validation using the column’s attributes (size, data type, etc) and TRIGGERS/STORED PROCEDURES can provide means to enforce complex business rules. This is true, but there are many caveats and arguments against this approach. Should a data source be performing logic? According to Oracle’s website :

Although triggers are useful for customizing a database, use them only when necessary. Excessive use of triggers can result in complex interdependencies, which can be difficult to maintain in a large application

Those in opposition to this approach feel that it is not the job of the database to perform logic, only to store data. So, who is right?
How should data validation be handled? Triggers or code? Honestly, both is usually the practice. The current implementations of Declarative Data Integrity Rules in the database marketplace give us simple rules to validate against (again: size, data type, nullable, etc) and if we don’t validate data in our code prior to insertion the database errors out. Though mindless and tedious, it is a straightforward task to code validation prior to executing SQL in your programming language.

I think that both opinions are correct, but both approaches are inadequate. Logic should be separated from data. However, rules about data stored in your database are really metadata. Why should programmers have to tweak code every time the table changes? Why should DBAs have to worry that programmers validation is insufficient and not have any control over the data integrity? I believe the database should dictate to us, in specific terms, what it is expecting. In addition, programmers and DBAs should be able to work together to formulate the data validation and that is should be in only one place.

Current Practice

Most of the time, a DBA will create a table with data type, length, unique and referential constraints. These rules are enforced by the database, programmers are handed back an error message which has to be parsed into a friendly error. In some shops, STORED PROCEDURES and TRIGGERS are used to validate data prior to insertion. As we pointed out above, this is not an encouraged practice and makes a data source a logic engine.

Let’s say we have a table defined as such:

id INT(8) NOT NULL,
name VARCHAR(40),
zip_code VARCHAR(10)

Programmers take the constraints defined above (i.e. “INT”, “NOT NULL”, 40, etc) and write up code validation for each of those columns. If DBAs alter the number of columns or the metadata of any one of the columns, the code needs to be updated accordingly. Conversely, if the programmers don’t properly validate the raw data in code, we could easily end up with “NOTAZIP” in our zip_code field. In this way, the technologies are seemingly blind of each other except for a tin-can phone that can pass simple messages back and forth. This is not a knock against the technology, they were made to be platform independent (in terms of interfacing) and not necessarily to know everything about each other. This is the price we have to pay for an exceedingly friendly means of storing and retrieving data.

Proposed Solution

Utilizing dynamic functions, JSON and metadata stored in column comments of my database I was able to validate data against any table with only one validation class (that isn’t very long!). I have yet to see a column or table comment be anything slightly useful (if anything at all). Up until two days ago, every time I saw the field to insert a comment in phpmyadmin I would think “I guess that’s a nice feature…”, and then move on. Then I realized that using (in MySQL) SHOW FULL COLUMN FROM {TABLE}, I can get that comment’s value back. And using my programming language (PHP), I could do the following:

$data = json_decode($comment);

I now am holding an object created from metadata to do what I will with.

The initial data I inserted into a column COMMENT was a minified representation of this:

'{	"insert_helpers": {
		"functions": {
			"func1":{
				"name":"strtotime",
				"params":{
					"param1":"+20 years"
				}
			},
			"func2":{
				"name":"str_replace",
				"params":{
					"param1":"!!",
					"param2":"!",
					"param3":"@this",
				}
			}
		}
	},
	"validators": {
		"maxlength":"10",
		"minlength":"2",
		"patterns":{
			"pattern1":{
				"pattern":"[^0-9]",
				"example":"This is data without numbers"
			}
		}
	}
}'

This was then validated with code very similar to this:

/**
*
* NOTE: I have not tested this code
* I typed directly it into notepad++ for your reading,
* not for executing, if it works, great!
*
*/
$json 				= json_decode($comment);
$helper_functions 	        = $json->{'insert_helpers'};
$validators			= $json->{'validators'};
$errors				= array();

//$row_value = $result['field'];
foreach($helper_functions as $function)
{
	$params = array();
	foreach($function->{'params'} as $param)
	{
		//insert row value
		if($param == '@this')
		{
			$param = str_replace("@this", $row_value, $param);
		}
		array_push($params, $param);
	}
	if(is_callable($function->{'name'})
	{
		$row_value = call_user_func_array($function->{'name'}, $params);
	}
}

if(is_object($validators))
{
	if(is_object($validators->{'minlength'})
	{
		if(sizeof($row_value) < $validators->{'minlength'})
		{
			array_push($errors, "$row_value is not long enough");
		}
	}
	if(is_object($validators->{'maxlength'})
	{
		if(sizeof($row_value) > $validators->{'maxlength'})
		{
			array_push($errors, "$row_value is too long");
		}
	}
	if(is_object($validators->{'patterns'})
	{
		foreach($validators->{'patterns'} as $pattern)
		{
			if(preg_match("/$pattern->{'pattern'}/", $row_value))
			{
				array_push($errors, "$row_value is not in correct format [example: $pattern->{'example'}");
			}
		}
	}
}

If you didn’t pick it up by reading through the code, these functions couldn’t care less what table you are dealing with and the same set of code works against any and all columns. In addition, the validation rules (JSON) can be handed to the client side code to perform validation prior to reaching the server. All major databases have a comment column and most modern programming languages allow behavior like the example I posted.

I will post my php class that facilitates this process soon.

Anxious to hear people’s thoughts.


About the Author

Rob McVey

I am a software developer/IT professional helping businesses save money through informed purchase consulting; website development and marketing; and process automation.