Chapter 9. DFsqlload: DFdiscover to Relational Database Tables

Table of Contents

9.1. Introduction
9.1.1. Overview
9.1.2. About DFsqlload
9.2. DFsqlload and Relational Database Concepts
9.2.1. Why Relational Databases?
9.2.2. Why is DFsqlload a one-way street?
9.2.3. Relational Database Concepts
9.3. Using DFsqlload
9.3.1. DFsqlload defaults - a quick tutorial
9.3.2. DFsqlload in Detail

9.1.  Introduction

9.1.1.  Overview

DFsqlload provides a convenient method for migrating data from the proprietary DFdiscover storage to storage in a relational database. Although this link is not maintained in real time, the relational side can be synchronized with the proprietary side whenever required. Data integrity is maintained in that only changes made to the data on the validated DFdiscover side are kept. These changes are reflected in the relational copy only after synchronizing with DFsqlload. DFsqlload can be scheduled using cron or run interactively as required. In this way, snapshots of the database at a known time can be generated and used for whatever purpose the user deems important, much in the same way as DFsas provides snapshots of the database in SAS format.

9.1.2.  About DFsqlload

DFsqlload is only useful in situations where one of the following four database products are being used or are planned on being used in a particular computing environment.

  1. MySQL - a common, multi-platform, open source relational database product.

  2. PostgreSQL - another multi-platform, open source relational database product that has more sophisticated features than MySQL, which may make it better suited for server environments.

  3. Oracle - commercial relational database product.

  4. Microsoft SQLServer - commercial relational database product.

All four products are network-aware, and provide native as well as JDBC and ODBC-based connectivity from client applications running on any platform.

9.2.  DFsqlload and Relational Database Concepts

DFsqlload is one of three data export utilities available in DFdiscover. DFsqlload provides an easy to use export facility for relational databases. DFexport.rpc is a flat-file data export utility, and DFsas provides DFdiscover export capability for the SAS environment.

All methods can be used at any time and produce a true equivalent to the data stored in DFdiscover at the time the export is performed. This chapter will focus on how this is accomplished with relational databases.

9.2.1.  Why Relational Databases?

There are many applications written for relational databases for many purposes. The two main applications benefiting from DFsqlload are reporting and analysis. Writing custom reports for DFdiscover is possible, but can take effort compared to the many report-writing and spreadsheet applications with relational database connectivity on the market today. For Windows users, there is Microsoft Office, Crystal Reports, Visual Basic, to name just a few of the applications available. UNIX users have OpenOffice, many java-based report writers, and Perl at their disposal. Mac users in most cases can pick the best from both worlds. There are hundreds of applications out there, some of which you may already be using.

9.2.2.  Why is DFsqlload a one-way street?

DFdiscover is a validated system. The ways to get data into DFdiscover have been rigorously tested and proven accurate. Providing for one-way movement of data out of DFdiscover allows DFdiscover to maintain the integrity of the DFdiscover database by limiting the ability of outside processes to compromise the validity of the system.

9.2.3.  Relational Database Concepts

Plates: DFdiscover keeps all data from a given form type together as one record type or plate. There may be, for example, blocks of information on a particular plate that repeat and would be better modeled with separate tables. DFsqlload makes no attempt to do this. When DFsqlload is run, each plate becomes its own relational table.

Fields: DFdiscover plates are made up of fields, each field storing data according to the layout of the plate on the paper CRF form. When a DFdiscover study is setup, you will recall that each plate is imported into the setup tool as a PS or PDF representation of the printed page that will ultimately be used to collect data for the study you are designing. The same fields defined during DFdiscover setup are used to define column names for the relational tables DFsqlload will create. There are some differences in the rules for naming columns in relational tables, but these differences are handled by DFsqlload as it has its own set of rules for doing so. DFsqlload provides two options for the creation of tables. One is to export all data as if it were character string data. The other option is to create columns of the same data type as the source fields.

Coding: DFdiscover has the ability to use multiple missing value codes for different purposes. Data values with corresponding labels may be used to represent different discreet conditions. Dates may be incomplete and as such, have rules for imputation. None of these concepts are an integral part of relational databases and require different ways to handle these situations as they arise. DFsqlload provides options for passing both coded values and labels to the relational side.

Schemas and Tablespaces: DFdiscover keeps the information for any given study together under one study number. There is no sharing of data between studies, so in cases, for example, where two studies share the same sites, the sites database is duplicated for the two studies. Each study database is independent of the others. Some relational database products allow for multiple groups of potentially similar data in the same database through use of the concept of the database schema. DFsqlload is aware of these possibilities and supports multiple schemas if the relational database product supports them. Tablespaces provide for another layer of hierarchy in data modeling and are supported as well.

9.3.  Using DFsqlload

9.3.1.  DFsqlload defaults - a quick tutorial

It is very simple to create a relational database equivalent to a DFdiscover study once a working relational database environment is established on your network. On the relational database side, you will need to know the following:

  • What relational database product am I using - MySQL, PostgreSQL, MS SQLServer or Oracle

  • What is the name of the server hosting the relational database

  • In the case of Oracle, what tablespace has been assigned to me for this purpose by my DBA.

  • In the case of Oracle and Postgres and MS SQLServer, what database will be used to store my DFdiscover schemas

  • What are my login credentials for the relational database server I am using.

With the answers to these questions, simply run DFsqlload as shown and a relational database equivalent to your DFdiscover study will be ready to use. Using Oracle (the default) as an example, the command would be:

% DFsqlload bluto:mystudies:foo.bar:olive:popeye 254

which will create a snapshot of study 254 in the mystudies database, in the foo schema with the tablespace name bar on the server bluto with user credentials username olive and password popeye.

By default, all columns are typed as closely as possible to the data types used by DFdiscover during setup. All dates are imputed. No coded values are translated. No missing codes are translated. No optional tables are created. If these defaults are acceptable, then this is all you need to know.

9.3.2.  DFsqlload in Detail

The default settings for DFsqlload should be adequate for most users. However, if the purpose of using DFsqlload is to provide a relational database view of a study to a set of specialized users with different requirements, then you might want to get as much information as possible from the DFdiscover side to the relational database side. For this scenario, you will need to use some of the options DFsqlload offers.

9.3.2.1.  DFsqlload Options

Answers to the following questions will affect how DFsqlload is used. See also the reference page for DFsqlload elsewhere in this guide.

Q: What is the target database type?
Q: Is it important to have the data type of each column in my relational tables match as closely as possible the data types for each field in my DFdiscover plates?
Q: My DFdiscover setup makes extensive use of codes and value labels for these codes. How can I make this information accessible from relational tables?
Q: My DFdiscover setup includes subject aliases. How can I access this information from relational tables?
Q: In my DFdiscover setup, I have defined a number of missing value codes that are important for correct interpretation of the data. How do I make those codes available in relational tables?
Q: How are dates handled by DFsqlload?
Q: How do I find out if something went wrong?
Q: Can I add tables to my SQL database outside DFdiscover?
Q: What if the DFdiscover study schema is changed?
Q: What happens if I modify the definition of one of the SQL tables used by DFdiscover?
Q: How does DFsqlload update my SQL tables?

Q:

What is the target database type?

A:

DFsqlload supports four popular SQL server platforms - Oracle, PostgreSQL, MS SQLServer and MySQL. Your target database will need to be one of these four types. Once this is known, DFsqlload is directed to create a set of relational tables for a given database type using the -flavor option. If the target database type is Oracle, then this option is not required as DFsqlload assumes Oracle by default. Otherwise,

  • Target is a PostgreSQL database - use -flavor postgresql

  • Target is a MySQL database - use -flavor mysql

  • Target is a MS SQLServer database - use -flavor mssql

  • Target is an Oracle database - omit this option, or use -flavor oracle

Q:

Is it important to have the data type of each column in my relational tables match as closely as possible the data types for each field in my DFdiscover plates?

A:

The default behavior of DFsqlload is to preserve data typing. The older version of DFsqlload that was shipped with DFdiscover 3.7 and 3.7.001 did not preserve data typing and all user fields were converted to type VARCHAR. If you want to retain this behavior, then you will need to use the -type option or the applications you have written that use any of the tables created by these older versions of DFsqlload will probably not work. If you are writing new applications, but want your old applications to work as well, you can get DFsqlload to create both typed and untyped tables in your target relational database.

To preserve Data Typing - omit this option or use -type typed To provide both typed and untyped tables - use -type both To provide just untyped tables - use -type untyped

Q:

My DFdiscover setup makes extensive use of codes and value labels for these codes. How can I make this information accessible from relational tables?

A:

Relational tables rely on other relational tables when a code has a corresponding label and it is the label that you want to report. Rather than create a separate table for every coded variable, DFsqlload offers two options which may be used together or separately depending on the specific requirements. The -coding option controls the inclusion of label data. By default, you get just the codes. You can create a column for the code and a column for the label corresponding to that code using the option -coding both. If all you want is the labels, use the option -coding label. The other way to get value/label into your relational tables is to create the optional DFCODING table. This is done using the -table dfcoding option. This optional table contains all codes and labels for all DFdiscover fields with type CHOICE. See the DFsqlload reference page for more details.

Q:

My DFdiscover setup includes subject aliases. How can I access this information from relational tables?

A:

Use the -table dfsubjectalias option to request creation of the optional DFSUBJECTALIAS table containing two columns, DFpid and DFalias. Thereafter it is an easy SQL join statement to include the subject alias together with the subject id.

The DFSUBJECTALIAS table is also created by default if subject aliases are defined when loading of all tables is requested.

Q:

In my DFdiscover setup, I have defined a number of missing value codes that are important for correct interpretation of the data. How do I make those codes available in relational tables?

A:

How missing codes are handled depends upon the -type option. If the target tables are untyped, then by default, codes are output in the relational tables as is. To get the labels corresponding to a missing value code, use the option -missing label. If the target tables are typed, then all missing codes are converted to NULL and logged as such, regardless of how the -missing option is used. If the -table dfnullvalue option is used, a record of each substitution is created in the optional DFNULLVALUE table.

Q:

How are dates handled by DFsqlload?

A:

By default, DFsqlload creates date columns using the correct data type for the target system. A string representation of a date can be output in a separate column using the option -date both. If just the string representation is required, use the -date untyped option. Partial dates (i.e., dates where the day or month are missing) are imputed by default, according to the rules specified in the DFdiscover setup. If imputed dates are not desired, you can turn it off using the -noimpute option, in which case any partial dates will be converted to NULL.

Q:

How do I find out if something went wrong?

A:

DFsqlload writes extensive logging information. When DFsqlload encounters a problem, the problem is written to stderr by default, unless overridden with the -q option. In either case, problems are logged in the DFsqlload log file for a given run. If DFsqlload encounters problems with DFdiscover data, it replaces the problem data with a NULL value, writes the substitution to the log file and optionally creates a record for the substitution in the DFNULLVALUE table. The following identifies typical problems and how they are handled by DFsqlload.

  • Any field that is blank or contains only white space (space, tab) will be converted to a NULL. These substitutions are not logged.

  • All missing value codes are converted to NULL if the target tables are typed (the default). This is applied consistently, even if the missing value code happens to be legal for some field types.

  • Any value wider than the storage width defined for the field in the DFdiscover schema is converted to NULL.

  • If a field has a format defined in the DFdiscover schema, values are checked for adherence to the format and are converted to NULL if they do not conform.

  • Invalid dates are converted to NULL.

  • If date imputation is not defined in the DFdiscover schema, partial dates are converted to NULL. Any imputed dates that are not legal dates are also converted to NULL.

  • If a check or choice box contains an undefined code, it is converted to NULL.

If you use the -d drfname option, a .drf file will be created containing a reference to each DFdiscover record having one or more non-blank substitutions to NULL.

Complete records will be rejected if the following conditions are encountered.

  • the record does not contain the correct number of fields

  • any of the DFdiscover fields are blank or invalid

  • a record with the same keys has already been imported

These cases will also appear in the .drf file if the -d drfname option is used.

Q:

Can I add tables to my SQL database outside DFdiscover?

A:

Yes. DFsqlload will ignore them.

Q:

What if the DFdiscover study schema is changed?

A:

DFsqlload recreates the tables it needs each time it is run. Be careful when changing DFsqlload program options from run to run. SQL tables created from a previous run are not recreated if they are not required by the current run.

Q:

What happens if I modify the definition of one of the SQL tables used by DFdiscover?

A:

DFsqlload will drop the table (and all of your changes) and recreate it from the current DFschema and data in the DFdiscover study database.

Q:

How does DFsqlload update my SQL tables?

A:

If there have been no changes to the data definitions, the data is dropped from the SQL table and reloaded from DFdiscover. This occurs even if there have been no data changes for that DFdiscover plate since the last time DFsqlload was run.