DFsqlload

DFsqlload — Create table definitions and import all data into a relational database

Synopsis

DFsqlload [-flavor oracle|postgresql|mysql|mssql] [-d drfname] [-q] [ [-ignore_mssql_priv] | [-ignore_mysql_priv] ] [-type typed|untyped|both] [-coding code|label|both] [-missing code|label] [-date typed|untyped|both] [-noimpute] [-missed] [-table all|DFcoding,DFnullvalue,DFsubjectalias]] {param} {study}

Options

-flavor oracle|postgresql|mysql|mssql

Type of target database. The default is -flavor oracle.

-d drfname

A DFdiscover retrieval file to use to record problems encountered during data import.

-q

Quiet mode. Instructs the program to suppress all warning messages. The default, without this option, is to write warning messages to standard error.

-ignore_mssql_priv|ignore_mysql_priv

Ignore administrator privileges. For MS SQLServer or MySQL, allows tables to be loaded without requiring administrator privileges on the database.

-type typed|untyped|both

Type of SQL tables to be created. If set to typed (or by default) then tables use field data types as defined in DFschema. Table names used are DFTABLE_### for regular plates, DFQC for plate 511 and DFREASON for plate 510. The field names use unique names for user data fields (8 ~ NF-3, where NF is the total number of fields defined for a given plate), and generic names for DFdiscover fields. If set to untyped, all user data fields for regular plates are converted to VARCHAR. Table names used are DFPLATE_### for regular plates, DFQC for plate 511 and DFREASON for plate 510. The field names use unique names for user data fields, and generic names for DFdiscover fields. If set to both, both typed and untyped tables are created for regular plates. Only one DFQC for plate 511 and one DFREASON for plate 510 are created. Both of these tables adhere to typed rules.

-coding code|label|both

Coded field specification. If set to code (or by default), then there is no translation for coded fields. if set to label, then the label corresponding to a given code is imported, including QCs and REASONs. If no label is defined for a given value, then the value is used. If both are requested, then two columns are imported for the coded field, one for the value and the other for the corresponding label.

-missing code|label

Missing value specification only applies to untyped tables. If set to code (or by default), then no translation is done for missing codes. If set to label, then codes are translated to their corresponding label. If no label is defined for a given code, the code itself is imported. If the target tables are typed, missing value codes are automatically replaced with NULL values in all cases.

-date typed|untyped|both

Date type specification for user date fields. If set to typed, then the true type defined in DFschema for dates is used (the default setting). If set to untyped, then dates are converted to strings and imported as type VARCHAR. If set to both then two columns are imported for the date field, one for the typed date and one for the untyped string representation.

-noimpute

No imputation for partial dates replaces partial dates with NULL values for typed dates only. The default is to impute partial dates for typed dates according to the imputation method specified in DFschema.

-missed

Include missed records from the database in the output. Typically missed records are not included as they contain no actual data.

-table all|DFcoding,DFnullvalue,DFsubjectalias]

List of optional tables to create. If set to DFcoding, the optional table DFCODING is created. Value and label pairs are imported to this table. If set to DFnullvalue, the optional table DFNULLVALUE is created. All problem fields that are converted to NULL are imported to this table. If the option -d drfname is specified, this table is created by default. If set to DFsubjectalias, the optional table DFSUBJECTALIAS is created. Columns DFpid and DFalias are imported to this table. If set to all, then all optional tables are created and imported.

Required Options

The following two options are required and must appear in order at the end of the option list:

param

A set of parameters of the form server:database:schema[.tablespace][:username:password]. The value of these parameters vary depending on the target database, and are detailed in Table 3.12, “Database Parameters”.

study

the DFdiscover study number, from which data records are to be exported.

Table 3.12. Database Parameters

ParameterDescriptionPostgresMySQLOracleMS SQLServer
serverThe server name to connect torequiredrequireda place holder. Oracle will lookup tnsnames.ora based on database namerequired
databaseThe database namerequiredsee Schemarequiredrequired
schemaThe DFdiscover study namerequiredrequiredrequiredrequired (must be the same as the database name)
tablespaceThe alternative storage tablespace for Oracleignoredignoredsee Schemaignored
username:passwordThe database login user name and passwordoptional.
  1. If both username and password specified, login as specified username with specified password.

  2. If only username specified, login as specified username, lookup the the specified user's password from file ~/.pgpass.

  3. If neither username nor password specified, lookup OS user's name and OS user's password from file ~/.pgpass.

  4. If only password is specified, this is an error.

optional.
  1. If both username and password specified, login as specified username with specified password.

  2. If only username specified, discard the specified username and using OS user's name and lookup OS user's password from file ~/.my.cnf. The same applies if neither username nor password are specified.

  3. If only password is specified, this is an error.

optional.
  1. If both username and password specified, login as specified username with specified password.

  2. If only username specified, discard the specified username, using OS user's name and lookup the password from file ~/.orapass identified by database name. If ~/.orapass does not exist or cannot find the password, use Oracle external credential. The same applies if neither username nor password are specified.

  3. If only password is specified, this is an error.

[Note]Oracle external credentials

The user must have an OS account. In the Oracle initialization file, parameter initDB_NAME.ora must be set as follows

remote_os_authent = true

optional.
  1. If both username and password are specified, login as the specified username with the specified password.

  2. If only username is specified, login using the OS user's name and lookup the OS user's password from the file ~/.mssqlpass.

  3. If neither username nor password is specified, login using the OS user's name and lookup the OS user's password from the file ~/.mssqlpass.

  4. If only password is specified, login using the OS user's name and the specified password.


Exit Status

DFsqlload exits with one of the following statuses:

0

DFsqlload was able to (re-)create the schema, create all of the tables, and import all of the data without error.

1

The schema and tables were created and the data were imported, but one or more errors were encountered in the imported data.

2

A more serious error was encountered which prevented some or all data from being imported.

Description

DFsqlload is a command-line solution that creates all of the table definitions and imports all of the data into a relational database. One SQL table is created for each DFdiscover plate. In addition, three tables are added to record logging information, qc and reason for change data. Optional tables are created to note any problem fields or store value and label pairs for coded fields.

When run repeatedly, existing SQL tables are compared with the current DFschema file. Unchanged tables are truncated i.e., the data is removed but the table definitions remain. If any changes were made, the existing DFdiscover-defined table is dropped and re-created. DFsqlload calls DFexport.rpc to export all primary records to a temporary file, and then loads the database tables plate by plate.

SQL Database Setup

The following operations are performed on the target database.

  1. At least one and optionally three meta tables are created in the target database.

    1. DFSQLLOAD: Each DFsqlload run creates an entry in this log table. This table is also the lock table that prevents more than one DFsqlload process from operating on the same schema. A NULL value for column DFFINISH indicates that a DFsqlload process is running. If DFsqlload terminates abnormally, this record must be removed manually before starting a new DFsqlload process. The following SQL statements are used internally to create this table. The statements use syntax specific to PostgreSQL. Similar tables are created for MySQL, MS SQLServer and Oracle implementations of DFsqlload.

      CREATE TABLE <schema>.dfsqlload
      (
        dfuser varchar(30) NOT NULL,   -- the database user
        dfstart timestamp(0) NOT NULL, -- the start date and time
        dffinish timestamp(0),         -- the finish date and time (NULL if process running)
        dfoption varchar(500),         -- the options used with the DFsqlload command
        dfnull int4,                   -- the total number of problem fields converted to null
        dferror int4,                  -- the total number of records discarded or rejected
        dfstatus int2,                 -- the status: 0=process completed, 1=running
        CONSTRAINT dfsqlload_pk PRIMARY KEY (dfstart)
      ) WITH OIDS;
    2. DFNULLVALUE: This is an optional table where all problem fields that are converted to NULL are recorded. The following SQL statements are used internally to create this table.

      CREATE TABLE <schema>.dfnullvalue
      (
        dfpid int4 NOT NULL,            -- study subject id
        dfplate int2 NOT NULL,          -- plate
        dfseq int4 NOT NULL,            -- sequence/visit number
        dftable varchar(30) NOT NULL,   -- name of sql table where substitution was made
        dffield varchar(30) NOT NULL,   -- name of sql field where substitution was made
        dfvalue varchar(###),           -- the original value exported from DFdiscover
        dfproblem varchar(###),         -- the reason the value was converted to NULL
        dfraster varchar(12),           -- the image id of the source CRF page
        CONSTRAINT dfnullvalue_pk PRIMARY KEY (dfpid, dfplate, dfseq, dftable, dffield)
      ) WITH OIDS;

      The ### above represents the maximum length encountered in the data source.

    3. DFCODING: This is an optional table where all value and label pairs for coded fields are stored. The following SQL statements are used internally to create this table.

      CREATE TABLE <schema>.dfcoding
      (
        dfplate int2 NOT NULL,          -- plate
        dffield varchar(30) NOT NULL,   -- sql table column name
        dfcode varchar(###) NOT NULL,   -- code value for this column
        dflabel varchar(###),           -- code label for this column
        CONSTRAINT dfcoding_pk PRIMARY KEY (dfplate, dffield, dfcode)
      ) WITH OIDS;

      The ### above represents the maximum length encountered in the data source.

  2. Verify any existing SQL tables. The following tasks are performed in the verification of existing SQL tables.

    1. All tables with the prefix DF are treated as DFsqlload-defined tables. For any given run of DFsqlload, only the tables relevant to the current job are used.

    2. Existing SQL table definitions are compared to the DFschema file. DFsqlload will truncate any unchanged tables and re-create changed tables. Changes that may cause DFsqlload to drop a table are as follows:

      • add, delete, reorder fields

      • rename field

      • change data type

      • any change to field size or format causing storage changes in the target database

      • missing code or label changes

      • code or label changes to coded fields

      • SQL tables that have been directly user-modified

    3. All privileges (if any) for tables create by DFsqlload are backed up and restored.

    4. If other tables reference any tables created by DFsqlload, the foreign keys will either be dropped (Postgresql, MS SQLServer) or disabled (MySQL, Oracle) and saved to the log file in the form of a database-specific SQL statement. The user can choose to execute these statements to restore the foreign keys manually, as DFsqlload does not do it automatically.

    5. Other database objects - triggers, views, procedures, etc., which depend upon DFsqlload-defined tables are ignored.

Files used by DFsqlload

All DFsqlload created files are in the study/working_dir/DFsqlload_logs directory by default. If this directory does not exist, DFsqlload will create it at the default location. The file names use time stamp in the format yymmdd_hhmiss as a prefix, where mm is two-digit month and mi is two-digit minute. The time stamp, which is also the value of DFSQLLOAD.DFSTART, is the unique identifier of this DFsqlload process.

  1. Log file: yymmdd_hhmiss.log This file records the detailed loading progress including schema setup, SQL table definitions, record count, and rejected records with error messages in the following formats:

    Regular plates: Record (id, seq, plate, image): error message

    QCs and REASONs: Record (id, seq, plate, image, field): error message

  2. DFdiscover Retrieval File: If path is included, this file will be in the specified location. If the file already exists, the existing file will be removed. If the file cannot be created or written, DFsqlload will report an error and continue the loading process. The file is in standard DRF format. The first four fields are Id, Visit, Plate, and Image. The combination of id, visit, and plate is unique. The 5th field records the list of problems. The record level errors appear first, followed by the field level problems. The field level problems are derived from the DFNULLVALUE table, which is created by default if a .drf file is requested. Problem types are summarized below.

    1. Record level errors

      • error - incorrect number of fields: The number of fields of a record in plt###.dat does not match the DFschema definition.

      • error - invalid DFdiscover field: The value of DFdiscover leading (1~7) or trailing (NF-2~NF) field is blank or invalid.

      • error - duplicate primary record: Two or more records have the same id, seq, plate combination.

      • error - rejected by database: Any field error that was not identified by DFsqlload, but rejected by the database.

    2. Field level problems

      • missing value

      • too wide

      • bad format

      • invalid date

      • partial date

      • undefined code

      • data/type conversion

      The format used in the .drf file is Table1Name: FieldName (problem), FieldName (problem),..., Table2Name: F ieldName (problem),...

    3. QC or REASON specific error

      • error - primary record was rejected.

  3. Data file:  yymmdd_hhmiss_plt### (for regular plates)

    yymmdd_hhmiss_DFreason (for plate 510)

    yymmdd_hhmiss_DFqc (for plate 511)

    This is the data source of current loading plate created by DFexport.rpc, and is removed automatically after the data is loaded into the SQL database.

  4. Error files:  yymmdd_hhmiss_plt###.err (for untyped table)

    yymmdd_hhmiss_tbl###.err (for typed table)

    yymmdd_hhmiss_DFreason.err (for plate 510)

    yymmdd_hhmiss_DFqc.err (for plate 511)

    These files record the records rejected by the SQL database along with database generated error messages.

  5. Rejected keys: yymmdd_hhmiss.tmp.  This file records the keys (plate, seq, id) of primary records discarded by DFsqlload or rejected by the SQL database. If a REASON or QC record matches one of the key combinations in this file, that REASON or QC record will not be loaded into the SQL database and a message will be written to the log file:

    Record (id#,seq#,plate#,image,field#): discarded by DFsqlload.

    DFRECORD (error - primary record was rejected).

    This file is removed automatically.

Database login credentials files.  The following database-specific files may be used to store login credentials.

  • Postgres: ~/.pgpass.  This is a Postgres standard file located in the user's home directory and the file permission must be 600; otherwise, Postgres will ignore it. This file specifies the database login credentials in the format:

    host:port:database:username:password

    The username can be any valid database user, for example:

    parkcity:5432:test_db_name:user_a_name:user_a_password
    parkcity:5432:test_db_name:user_b_name:user_b_password

  • MySQL: ~/.my.cnf.  This is MySQL standard file located in user's home directory and the file permission should be 600. This file specifies the program groups and options for each group. A typical group is [client]. In this group the database password for OS user can be specified:

    [client]
    password=my_pass
    ...

    The global options can be specified in the global option file /etc/my.cnf or DATADIR/my.cnf.

  • Oracle: tnsnames.ora, ~/.orapass.  tnsnames.ora is the Oracle standard net services file located in ORACLE_HOME/network/admin directory by default. Oracle client will lookup net service names from this file. This file can also be in ANY_DIR/network/admin and the environment variable ORACLE_HOME is set to ANY_DIR.

    ~/.orapass is NOT an Oracle standard file. It is provided for convenience for DFdiscover users to store their database password. This file should be in the user's home directory and file permission should be 600. The format is:

    db_name:password

    for example,

    test_db_name:test_db_password
    production_db_name:production_db_password

  • MS SQLServer: ~/.mssqlpass.  This is not an MS SQLServer standard file. It is provided as a convenience to DFdiscover users for storing their database password. This file needs to be kept in the user's home directory and the file permission needs to be 600. The format of this file is:

    server_name:password

    There are two types of entries that can be used. Both the server name and the password to use can be specified as follows:

    my_server:my_server_password

    or a password can be specified for any server this user connects to as in the following example:

    *:any_server_password

Tables

The following details apply to tables and fields created by DFsqlload.

  • Table Names.  Table names follow these rules. DFSQLLOAD is a log table and is used to store the details of a given DFsqlload for a particular schema. The table for plate 510 is named DFREASON. The table for plate 511 is named DFQC. All other plates are named DFPLATE_nnn (untyped) or DFTABLE_nnn (typed), where nnn is the plate number. Optional tables created are DFNULLVALUE for the storage of any problem field data conversions and DFCODING for the storage of value/value label pairs. All table names are in uppercase letters. Note that the table names are case sensitive for MySQL on UNIX platforms only.

  • Table Types.  Table types used for each of the supported database products are as follows: For PostgreSQL and Oracle, all tables are type relational table. For MySQL, all tables are type InnoDB table.

  • Field Names for study tables.  Field naming follows these rules. For fields 1 to 7 and the last three fields for a plate, generic variable names are used. For all fields in between, unique variable names are used. Any field names matching an SQL keyword get an _ (underscore) appended. This is target product dependent. Refer to the documentation for the product you are using for a complete list of the relevant keywords. Any non-alphanumeric characters are replaced with _ (underscore). If a field name starts with a digit, DF_ is prepended. Field names are truncated to 30 chars. A sequence number is appended to each non-unique field name.

  • DFdiscover type to SQL type mapping:  DFsqlload will map DFdiscover types to SQL types according to Data typing when creating SQL tables for typed columns. Untyped columns are mapped to VARCHAR (PostgreSQL, MySQL) and VARCHAR2 (Oracle).

    Table 3.13. Data typing

    DFdiscover typePostgreSQLOracleMySQLMS SQL Server
    stringVARCHAR(n)VARCHAR2(n)VARCHAR(n)[a] VARCHAR(n)
    checkINT2NUMBER(p)INT2SMALLINT
    choiceINT2NUMBER(p)INT2SMALLINT
    integerINT4NUMBER(p)INT4INT
    floatNUMERIC(p,s)NUMBER(p,s)DECIMAL(p,s)DECIMAL(p,s)
    vasNUMERIC(p,s)NUMBER(p,s)DECIMAL(p,s)DECIMAL(p,s)
    number(nn:nn)TIMEVARCHAR2(n)TIMEVARCHAR(n)
    dateDATEDATEDATEDATETIME
    timestampTIMESTAMPDATEDATETIMEDATETIME

    [a] MySQL converts VARCHAR(n) to CHAR(n) (n < 4) or TEXT (n > 255).


  • SQL column naming convention:  DFsqlload will follow the rules defined in Fields for typed SQL tables and Fields for untyped SQL tables for DFdiscover field name to SQL column naming convention when creating SQL tables.

    Table 3.14. Fields for typed SQL tables

    FieldField NumberField NameField TypeMissing Code/LabelPartial DateImpute
    Coded field: Code1~7,NF-2~NFGenericTrue TypeNo  
    Coded field: Label1~7,NF-2~NFGenericVARCHARNo  
    Coded field: Code [a] 1~7,NF-2~NFGenericTrue TypeNo  
    Coded field: Label [b] 1~7,NF-2~NFU_GenericVARCHARNo  
    Coded field: Code8~NF-3UniqueTrue TypeNo  
    Coded field: Label8~NF-3UniqueVARCHARYes  
    Coded field: Code[a]8~NF-3UniqueTrue TypeNo  
    Coded field: Label[b]8~NF-3U_UniqueVARCHARYes  
    Date field: Typed8~NF-3UniqueTrue TypeNoNoYes
    Date field: Untyped8~NF-3UniqueVARCHARYesYesNo
    Date field: Typed[a]8~NF-3UniqueTrue TypeNoNoYes
    Date field: Untyped[b]8~NF-3U_UniqueVARCHARYesYesNo
    Other fields8~NF-3UniqueTrue TypeNo  
    Other fields1~7,NF-2~NFGenericTrue TypeNoNoNo

    [a] The first field, if -coding both or -date both is specified.

    [b] The second field, if -coding both or -date both is specified.


    Table 3.15. Fields for untyped SQL tables

    FieldField NumberField NameField TypeMissing Code/LabelPartial DateImpute
    Coded field: Code1~7,NF-2~NFGenericTrue TypeNo  
    Coded field: Label1~7,NF-2~NFGenericVARCHARNo  
    Coded field: Code[a]1~7,NF-2~NFGenericTrue TypeNo  
    Coded field: Label[b]1~7,NF-2~NFU_GenericVARCHARNo  
    Coded field: Code8~NF-3U_UniqueVARCHARYes  
    Coded field: Label8~NF-3U_UniqueVARCHARYes  
    Coded field: Code[a]8~NF-3UniqueVARCHARYes  
    Coded field: Label[b]8~NF-3U_UniqueVARCHARYes  
    Date field: Typed8~NF-3U_UniqueTrue TypeNoNoYes
    Date field: Untyped8~NF-3U_UniqueVARCHARYesYesNo
    Date field: Typed[a]8~NF-3UniqueTrue TypeNoNoYes
    Date field: Untyped[b]8~NF-3U_UniqueVARCHARYesYesNo
    Other fields8~NF-3U_UniqueVARCHARYes  
    Other fields1~7,NF-2~NFGenericTrue TypeNoNoNo


  • Locking:  DFsqlload uses the table DFSQLLOAD to prevent two processes from working on the same database schema. A NULL entry in DFSQLLOAD.DFFINISH indicates that another process is running or terminated abnormally. If DFSQLLOAD does not exist, DFsqlload will create it. Upon completion, DFsqlload updates DFSQLLOAD.DFFINISH to the finish time and the DFSQLLOAD.DFSTATUS to zero (normal exit process).

  • DFdiscover Retrieval File:  The first line in the DRF contains a two-field comment. The first field contains the username and creation timestamp. The second field identifies the creator of the DRF as DFsqlload and lists the parameters used to access the SQL database. The next line is a comment describing the format of the DRF records to follow. One DRF data record is created for each DFdiscover record with data import problems. Each DRF data record is identified by Id, Visit, Plate and Image. The 5th field records the SQL table name, followed by the field name and problem description for each problem encountered. Multiple field/problem descriptions are separated by commas.

  • Date fields:  DFsqlload converts two-digit years to four-digits based on the cut off year specified for each date field in DFschema, or the year the study began (%B) if not specified. DFsqlload imputes partial dates according to the imputation method specified for each date field.

  • Number of fields:  DFsqlload creates a minimum of 10 fields (1~7,NF-2~NF) for each SQL table. The maximum number of fields is limited by the database: PostgreSQL 1600, MySQL 1000, Oracle 1000, MS SQLServer 1024. DFsqlload will report errors for DFdiscover records that do not meet these field requirements and will continue the loading process.

Schema

Postgres.  A schema is a namespace that contains DFdiscover study tables. Schema names are DFdiscover study names. If the schema name specified in the DFsqlload command line does not exist in the Postgres database, DFsqlload will create the schema as specified. DFsqlload will never drop existing schemas.

Oracle.  A schema is a database user who is also the owner of DFdiscover tables that belong to one study. The schema specified in the command line must already exist in Oracle, unless the tablespace name (must exist in database) is also specified. If the schema does not exist, DFsqlload will create the schema and assign the specified tablespace as the schema's default tablespace. The DFdiscover study tables will be created, if specified, in the specified tablespace, otherwise in the schema's default tablespace. DFsqlload will check the schema's quota privilege on storage tablespace and grant unlimited quota privilege on that tablespace. DFsqlload will never drop existing schemas.

MySQL.  There is no schema in MySQL. A MySQL database maps to a DFdiscover study database. In the command line, the schema name must be the same as the database name. Note that the database name is case sensitive whether MySQL is running on a UNIX or Windows platform. If the specified database does not exist in the MySQL database, DFsqlload will create the database as specified. DFsqlload will never drop existing MySQL databases.

MS SQLServer.  There is no schema in MS SQLServer in version 2000 and older. A SQLServer database maps to a DFdiscover study database. In the command line, the schema name must be the same as the database name.

Examples

Example 3.72. Import study 254 into mySQL

Import the validation study val254 into mySQL on host talisman.

% DFsqlload -flavor mysql -q talisman:val254:val254:root:mysql 254