databene

 
  • Increase font size
  • Default font size
  • Decrease font size

DB Sanity File Syntax

Check Files

The checks are defined in XML files. You can distribute the checks over several XML files and use them all by specifying a common parent folder as DB Sanity input (using -i <folder name>). 

A simplistic definition file with a simplistic check:

<?xml version="1.0"?>
<dbsanity ignorableTables=".*_JN|VW_.*">

    <check name="familiy name required" table="person" defectType="completeness">
        <notNull expression="family_name" />
    </check>

</dbsanity>

The root element of a DB Sanity file must be <dbsanity>.

Using its ignorableTables attribute, you can specify a regular expression for tables that should be ignored by DB Sanity's meta data parser in order to achieve faster execution times. If you provide a file dbsanity.xml, it plays a special role for configuring some global DB Sanity settings: Their ignorableTables configuration is applied to the initial database import and the global reports of DB Sanity. The regular expression in the example filters out tables with names that start with VW_ or end with _JN.

 

Required Check Elements

Checks are defined in a <check> element and must have a name specified. The name is intended to be a short intuitive and unique identifier of the check.

You must provide a table and defectType attribute for supporting aggregation of defects by different categories and get a better insight which parts of your database suffer from which kinds of defects. You are free to define and use any categorization of defectTypes that is useful for you. The defectType name may contain only letters, underscores and numbers. Currently, DB Sanity accepts leaving them out (for assuring backwards compatibility) but future versions will not.

 

Optional Check Elements 

The description attribute allows you to describe the rule and to refer to specification or requirements documents.

An author attribute allows you to declare, who has implemented the check and can be asked in case of questions or doubts.

A tags attribute allows you to add one or more labels to a check which can be used to limit data verification to certain domains, tasks or check categories. You are free to define and use any tags which is useful for you. A tag name may contain only letters, underscores and numbers.

A requirement attribute allows you to refer to a requirements specification. You are free to chose a format for your work. When using DB Sanity in a team, define a common requirement specification format and make sure that each member has the same understanding of it. Be aware that section and page numbers change more frequenty than section titles. For a specification document, you could use <document name>-<section name>-<requirement name>, for a requirements management software system, it is recommendable to use the technical ID of a requirement.

<check name="familiy name required" table="person" defectType="completeness"
        description="Since each person must have a family name, the corresponding column value must not be null"
        author="vbergmann" tags="import,mod_person" requirement="person.family_name.1">
    <notNull expression="family_name" />
</check>

 

Requirement types

Two types of requirement definitions are available:

  • Constraint: An abstract requirement of which DB Sanity figures out, how to verify a database's data. Typically, a constraint can easily be defined, also by users with little database experience and covers some predefined requirement types like foreign keys, uniqueness, value ranges and more. 
  • Defect Query: A SQL query which finds database entries that are not valid. This gives the user the full power of SQL for checking arbitrarily complex requirements, but requires good SQL skills.

Constraint

<foreignKey>

The <foreignKey> check lets you verify foreign key relationships. This is useful if a table column is not defined as foreign key in a database, but in fact has the requirement to be one. This can occur in several scenarios:

  • A migration is pending by which the column will become a foreign key
  • The column is refering to a column which is not not unique or primary key (databases deny such a constraint, but developers do not necessarily)
  • A table column is a foreign key, but for whatever reason (don't argue, I've encountered this), some developer decided to not create a foreign key constraint in the database

Syntax: A foreign key is specified by the referer table (table), the referrer column(s), the referee table and the referee column(s). If the foreign key consists of several table columns, they are listed as a comma-separated list:

<check name="employee-dept-fk" table="employee">
    <foreignKey refererColumns="company_id,dept_id"
        refereeTable="department" refereeColumns="company_id,dept_id" />
</check>

 

<unique>

<unique> defines a uniqueness requirement on a single column or a group of columns. It can be useful in these scenarios:

  • A migration is pending by which a column or a group of columns will be required to be unique
  • An application requires a column (or column group) to be unique, but the database does not

Syntax: A unique key is specified by the table and the column(s) it comprises. For a composite unique key, the columns are listed as a comma-separated list:

<check name="unique names" table="product_names">
    <unique columns="namespace,name" />
</check>

 

<notNull>

<notNull> requires a column value to be different from null in all table rows. It can be useful in these scenarios:

  • A migration is pending by which a column will be required to be different from null
  • An application requires a column to be different from null, but the database does not

Syntax: A not null constraint is specified by the table and an expression (usually the column name) it refers to:

<check name="default product name specified" table="product">
    <notNull expression="name" />
</check>
 

<stringLength>

A <stringLength> requirement requires a string column's values to have a minimum and/or maximum length. It is useful if/because:

  • A migration is pending by which a string column length will be reduced
  • An application requires a shorter column length than the database
  • A column's values are required to have a minimum length, but the database does not support such a constraint

Snytax: A string length constraint is specified by the table and an expression (usually the column name) it refers to. As constraint values, a minLength and/or maxLength can be applied:

<check name="default product name length" table="product">
    <stringLength expression="name" min="2" max="20" />
</check>

 

<trim>

For most string columns (char, (n)varchar(2), (n)clob, xml,...), leading or trailing whitespace does not make sense or even causes problems. For example think of a search feature that searches names by the initial letters and the actual values accidentally contain leading spaces... you will hardly find the serached name. Use a <trimmed> check to assure valid values for such columns.

Syntax: The column attribute specifies which column to check. An optional reportColumns attribute allows to explicitlyselect which table columns are listed in defect reports:

<check name="user name trimming" table="user" …>
    <trimmed column='name' />
</check>

 

<numberSize>

A <numberSize> requirement can be applied to verify that all data lies within a numeric range (min-max) or has certain number size characteristics (totalDigits, fractionDigits). It is useful if/because:

  • A migration is pending by which a number range or texhnical size will be reduced
  • An application has more restrictive requirements than the database
  • A column's values are required to have a minimum and/or maximum value, but the database does not support such a constraint

Syntax: A number size constraint is specified by the table and an expression (usually the column name) it refers to. As constraint values, min and/or max values can be specified, or a combination of totalDigits and fractionDigits:

<check name="price precision" table="product_price">
    <numberSize expression="price" totalDigits="10" fractionDigits="2"/>
</check>

<check name="delivery time" table="product_stock">
    <numberSize expression="delivery_time" min="1" max="14"/>
</check>

 

<sql>

A <sql> element must provide a SQL query that checks for a defect type and returns one result row per defective table entry. The query should return the primary key (components) in the first column(s) and data that describes the individual defect in the subsequent columns. Generally a 'select *' query is not advisable: It does not improve readability but can significanty slow down execution, if a lot of column data has to be transferred.

Example:

<check name="familiy name required" table="person" defectType="completeness">
    <sql>select id, title, first_name from person where family_name is null</sql>
</check>

 

<values>

A column (or a single-value expression) can be required to contain only values from a predefined set. 

For simple cases, a comma-separated list can be used,

<values expression="rank" list="1, 2, 3" />

for more complicated ones an XML structure:

<values expression="rank">
    <value>1</value>
     <value>2</value>
    <value>3</value>
</values> 

Note that character and string values must be quoted:

<values expression="category" list="'alpha', 'bravo', 'charly'" />

 

<pattern>

A table columns' values can be required to match a regular expression:

<pattern expression='code' regex='[A-Z]{1,3}' />

or a SQL 'like' pattern:

<pattern expression='code' like='MZ%' />

A <pattern> may only provide a regular expression or a 'like' pattern, not two at the same time.

If a database supports regular expressions, the regex check is performed on the server side. Other database systems are supported by checking the regular expression on the client side. When performing client-side checks, be aware that they may take significantly more time, since all database rows need to be transferred from the database to the client in order to be checked.

'like' patterns are always verified on the server side.

 

<rowCount>

A table's row count can be required to match a given minimum and/or maximum value:

<rowCount min="23" max="48"/>

In order to require that a table is not empty, use a min value of 1:

<rowCount min="1" />

By setting min = max, an exact value can be required.

<rowCount min="23" max="23" />

 

<cardinality>

On many-to-one associations, the cardinality of the 'many'-side can be verified using the <cardinality> element.

For a 1:1..3 association, this would be:

<cardinality refererColumns="ref" min="1" max="3"> 

Composite-key references are handled by using a comma-separated list of column names:

<cardinality refererColumns="ns_ref, name_ref"
    min="1" max="3">

If the foreign key reference is not defined as a database constraint, then the user must specify additional information: the referee table name and, if the relation does not point to the referee table's primary key column(s), the refereeColumns:

<cardinality refererColumns="ns_ref, name_ref"
    refereeTable
="product" refereeColumns="ns, name"
    min="1" max="3">

 

<validator>

The user may now use custom Java components to perform checks on the client side. They must implement the Databene interface org.databene.commons.Validator or the Java BeanValidation interface javax.validation.ConstraintValidator.

Validator definition and configuration is done the usual JavaBeans way, using a default constructor and JavaBeans properties:

<validator expression='name' class=InitialsValidator'>
   <property name='initials' value='X' />
</validator>

An appropriate implementation would look like this:

public class InitialsValidator extends AbstractValidator<String> {
      
    public String initials;

    public void setInitials(String initials) {
        this.initials = initials;
    }

    public boolean valid(String value) {
        return value == null || value.startsWith(initials);
    }

Note: Custom validator classes should return true for null values in order to work properly for nullable columns. For null checking, there is a <notNull> check.

Database data can also be verified with multi-column validators. In this case, the validator class must define a valid() method which takes an array as argument:

public void valid(Object[] data)

The array type does not necessarily have to be Object[]; DB Sanity tries to map the actual database values to the type needed in the validation method.

For applying a multi-column check, the configuration in a DB Sanity XML file must provide the columns as a comma-speated list in the order in which the validator expects it

<validator columns="bank_code, account_no" />

 

<data>

Database data can be verified against reference data sets in CSV or Excel(TM) files. The reference data can be defined as the exact set, a required sub set or a super set of the actual database data.

<check name="required users" table="user" …>
    <data referenceIs="set">
        <testee columns="name, password" />
        <reference file="required_users.csv"/>
    </data>
</check>

This example expects the user data to have exactly the entries specified in the file with exactly the same values for the 'name' and 'password' columns. The values of the other columns do not matter.

The referenceIs attribute can be one of the following values:

  • set: This requires the database to have exactly the specified values
  • subset: This requires the database to contain all entries of the reference and allows it to have additional entries
  • superset: This restricts the database data to values defined in the reference, allowing it to have less entries

if the CSV file has more columns than the listed testee columns or if its column names or order differ, reference column names can be specified too:

<check name="required users" table="user" …>
   <data referenceIs="set">
        <testee columns="name, value" />
        <reference file="required_users.csv" columns="user_name, user_pwd"/>
    </data>
</check>


<paths>

In many cases there are table columns with redundant references, intended to improve performance of special critical queries. For example, imagine a shop model where one customer can have multiple orders and each order consists of multiple order_items. In order to to provide an efficient query that sums up all order_items bought by a customer, the designer decided to replicate the customer reference of the order in the order_item. So you have alternative paths to reference the customer form the order_item: [order_item(customer_id) -> customer(id)] and [order_item(order_id) -> order(id)/order(customer_id) -> customer(id)] and for a correct summary of a customer's order_items, aggregations over bot paths need to be the same, right?

You can assure this using the <paths> element:

<check name="order_item-customer reference paths" table="order_item">
    <paths
        reference="order_item(customer_id) -> customer(id)
        testee="order_item(order_id) -> order(customer_id) -> customer(id)"/>
</check>

 

Please note: In order to work properly, the relations need to exist in form of foreign key constraints in the database.

For other functionality, you may require the ends of two different paths to be different (e.g. four-eye principle). In such cases, you can add an attribute equal="false":

<check name="application approver identities" table="application">
    <paths
        reference="application(approver1_id) -> user(id)"
        testee="application(approver2_id) -> user(id)"/>
        equal="false"
</check>

 

 

Recommended further reading: Environment Files.