DFsqlload — Create table definitions and import all data into a relational database
DFsqlload
[-flavor oracle|postgresql|mysql|mssql]
[-d drfname]
[-q]
[
[-ignore_mssql_priv]
| [-ignore_mysql_priv]
]
[-type typed|untyped|both]
[-coding code|label|both]
[-missing code|label]
[-date typed|untyped|both]
[-noimpute]
[-missed]
[-table all|DFcoding,DFnullvalue,DFsubjectalias]]
{param}
{study}
-flavor oracle|postgresql|mysql|mssql | Type of target database. The default is
|
-d drfname | A DFdiscover retrieval file to use to record problems encountered during data import. |
-q | Quiet mode. Instructs the program to suppress all warning messages. The default, without this option, is to write warning messages to standard error. |
-ignore_mssql_priv|ignore_mysql_priv | Ignore administrator privileges. For MS SQLServer or MySQL, allows tables to be loaded without requiring administrator privileges on the database. |
-type typed|untyped|both | Type of SQL tables to be created.
If set to |
-coding code|label|both |
Coded field specification. If set to |
-missing code|label |
Missing value specification only applies to untyped tables. If set to |
-date typed|untyped|both |
Date type specification for user date fields. If set to |
-noimpute | No imputation for partial dates replaces partial dates with NULL values for typed dates only. The default is to impute partial dates for
typed dates according to the imputation method specified in |
-missed | Include missed records from the database in the output. Typically missed records are not included as they contain no actual data. |
-table all|DFcoding,DFnullvalue,DFsubjectalias] |
List of optional tables to create.
If set to |
The following two options are required and must appear in order at the end of the option list:
param | A set of parameters of the form
|
study | the DFdiscover study number, from which data records are to be exported. |
Table 3.12. Database Parameters
| Parameter | Description | Postgres | MySQL | Oracle | MS SQLServer | |||
|---|---|---|---|---|---|---|---|---|
| server | The server name to connect to | required | required | a place holder. Oracle will lookup
tnsnames.ora based on
database name | required | |||
| database | The database name | required | see Schema | required | required | |||
| schema | The DFdiscover study name | required | required | required | required (must be the same as the database name) | |||
| tablespace | The alternative storage tablespace for Oracle | ignored | ignored | see Schema | ignored | |||
| username:password | The database login user name and password | optional.
| optional.
| optional.
| optional.
|
DFsqlload exits with one of the following statuses:
0 | DFsqlload was able to (re-)create the schema, create all of the tables, and import all of the data without error. |
1 | The schema and tables were created and the data were imported, but one or more errors were encountered in the imported data. |
2 | A more serious error was encountered which prevented some or all data from being imported. |
DFsqlload is a command-line solution that creates all of the table definitions and imports all of the data into a relational database. One SQL table is created for each DFdiscover plate. In addition, three tables are added to record logging information, qc and reason for change data. Optional tables are created to note any problem fields or store value and label pairs for coded fields.
When run repeatedly, existing SQL tables
are compared with the current DFschema file. Unchanged tables are truncated i.e., the data is removed but the table definitions remain. If any
changes were made, the existing DFdiscover-defined table is dropped and re-created.
DFsqlload calls DFexport.rpc to export all primary records to a temporary file,
and then loads the database tables plate by plate.
The following operations are performed on the target database.
At least one and optionally three meta tables are created in the target database.
DFSQLLOAD: Each DFsqlload run creates an entry in this log table.
This table is also the lock table that prevents more than one DFsqlload process from
operating on the same schema. A NULL value for column
DFFINISH indicates that a DFsqlload process is running. If DFsqlload
terminates abnormally, this record must be removed manually before starting a new DFsqlload
process.
The following SQL statements are used internally to create this table. The statements use syntax specific to PostgreSQL.
Similar tables are created for MySQL, MS SQLServer and Oracle implementations of DFsqlload.
CREATE TABLE <schema>.dfsqlload ( dfuser varchar(30) NOT NULL, -- the database user dfstart timestamp(0) NOT NULL, -- the start date and time dffinish timestamp(0), -- the finish date and time (NULL if process running) dfoption varchar(500), -- the options used with the DFsqlload command dfnull int4, -- the total number of problem fields converted to null dferror int4, -- the total number of records discarded or rejected dfstatus int2, -- the status: 0=process completed, 1=running CONSTRAINT dfsqlload_pk PRIMARY KEY (dfstart) ) WITH OIDS;
DFNULLVALUE: This is an optional table where all problem
fields that are converted to
NULL are recorded. The following SQL statements are used
internally to create this table.
CREATE TABLE <schema>.dfnullvalue
(
dfpid int4 NOT NULL, -- study subject id
dfplate int2 NOT NULL, -- plate
dfseq int4 NOT NULL, -- sequence/visit number
dftable varchar(30) NOT NULL, -- name of sql table where substitution was made
dffield varchar(30) NOT NULL, -- name of sql field where substitution was made
dfvalue varchar(###), -- the original value exported from DFdiscover
dfproblem varchar(###), -- the reason the value was converted to NULL
dfraster varchar(12), -- the image id of the source CRF page
CONSTRAINT dfnullvalue_pk PRIMARY KEY (dfpid, dfplate, dfseq, dftable, dffield)
) WITH OIDS;
The ### above represents the maximum length encountered in the data source.
DFCODING: This is an optional table where all value and label pairs
for coded fields are stored.
The following SQL statements are used internally
to create this table.
CREATE TABLE <schema>.dfcoding ( dfplate int2 NOT NULL, -- plate dffield varchar(30) NOT NULL, -- sql table column name dfcode varchar(###) NOT NULL, -- code value for this column dflabel varchar(###), -- code label for this column CONSTRAINT dfcoding_pk PRIMARY KEY (dfplate, dffield, dfcode) ) WITH OIDS;
The ### above represents the maximum length encountered in the data source.
Verify any existing SQL tables. The following tasks are performed in the verification of existing SQL tables.
All tables with the prefix DF are
treated as DFsqlload-defined tables. For any given run of DFsqlload, only the tables
relevant to the current job are used.
Existing SQL table definitions are compared to the DFschema file.
DFsqlload will truncate any unchanged tables and re-create changed tables.
Changes that may cause DFsqlload to drop a table are as follows:
add, delete, reorder fields
rename field
change data type
any change to field size or format causing storage changes in the target database
missing code or label changes
code or label changes to coded fields
SQL tables that have been directly user-modified
All privileges (if any) for tables create by DFsqlload are backed up and restored.
If other tables reference any tables created by DFsqlload, the foreign keys will either be dropped (Postgresql, MS SQLServer) or disabled (MySQL, Oracle) and saved to the log file in the form of a database-specific SQL statement. The user can choose to execute these statements to restore the foreign keys manually, as DFsqlload does not do it automatically.
Other database objects - triggers, views, procedures, etc., which depend upon DFsqlload-defined tables are ignored.
All DFsqlload created files are in the
study/working_dir/DFsqlload_logs
directory by default. If this directory does not exist,
DFsqlload will create it at the default location.
The file names use time stamp in the format yymmdd_hhmiss
as a prefix,
where mm is two-digit month and
mi is two-digit minute.
The time stamp, which is also the value of DFSQLLOAD.DFSTART, is the
unique identifier of this DFsqlload process.
Log file: yymmdd_hhmiss.log.
This file records the detailed loading progress including schema setup,
SQL table definitions, record count, and rejected records with error
messages in the following formats:
Regular plates: Record (id, seq, plate, image): error message
QCs and REASONs: Record (id, seq, plate, image, field): error message
DFdiscover Retrieval File: If path is included, this file will be in the
specified location. If the file already exists, the existing file will
be removed. If the file cannot be created or written, DFsqlload will report an
error and continue the loading process.
The file is in standard DRF format. The first four fields are Id, Visit,
Plate, and Image. The combination of id, visit, and plate is unique.
The 5th field records the list of problems. The record level errors
appear first, followed by the field level problems. The field level
problems are derived from the DFNULLVALUE table, which
is created by default if a .drf file is requested.
Problem types are summarized below.
Record level errors
error - incorrect number of fields: The number of fields of a record in
plt###.dat does not match the DFschema definition.
error - invalid DFdiscover field: The value of DFdiscover leading (1~7) or trailing (NF-2~NF) field is blank or invalid.
error - duplicate primary record: Two or more records have the same id, seq, plate combination.
error - rejected by database: Any field error that was not identified by DFsqlload, but rejected by the database.
Field level problems
missing value
too wide
bad format
invalid date
partial date
undefined code
data/type conversion
The format used in the .drf file is
Table1Name: FieldName (problem), FieldName (problem),..., Table2Name: F
ieldName (problem),...
QC or REASON specific error
error - primary record was rejected.
Data file: yymmdd_hhmiss_plt### (for regular plates)
yymmdd_hhmiss_DFreason (for plate 510)
yymmdd_hhmiss_DFqc (for plate 511)
This is the data source of current loading plate created by DFexport.rpc, and is removed automatically after the data is loaded into the SQL database.
Error files: yymmdd_hhmiss_plt###.err (for untyped table)
yymmdd_hhmiss_tbl###.err (for typed table)
yymmdd_hhmiss_DFreason.err (for plate 510)
yymmdd_hhmiss_DFqc.err (for plate 511)
These files record the records rejected by the SQL database along with database generated error messages.
Rejected keys: yymmdd_hhmiss.tmp. This file records the keys (plate, seq, id) of primary records discarded by DFsqlload or rejected by the SQL database. If a REASON or QC record matches one of the key combinations in this file, that REASON or QC record will not be loaded into the SQL database and a message will be written to the log file:
Record (id#,seq#,plate#,image,field#): discarded by DFsqlload.
DFRECORD (error - primary record was rejected).
This file is removed automatically.
Database login credentials files. The following database-specific files may be used to store login credentials.
Postgres: ~/.pgpass. This is a Postgres standard file located in the user's home directory and the file permission must be 600; otherwise, Postgres will ignore it. This file specifies the database login credentials in the format:
host:port:database:username:password
The username can be any valid database user, for example:
parkcity:5432:test_db_name:user_a_name:user_a_password parkcity:5432:test_db_name:user_b_name:user_b_password
MySQL: ~/.my.cnf. This is MySQL standard file located in user's home directory and the file permission should be 600. This file specifies the program groups and options for each group. A typical group is [client]. In this group the database password for OS user can be specified:
[client] password=my_pass ...
The global options can be specified in the global option file
/etc/my.cnf or
DATADIR/my.cnf.
Oracle: tnsnames.ora, ~/.orapass.
tnsnames.ora
is the Oracle standard net services file located in
directory by default. Oracle client will
lookup net service names from this file. This file can also be in
ORACLE_HOME/network/admin
and the environment variable ANY_DIR/network/adminORACLE_HOME is set
to ANY_DIR.
~/.orapass
is NOT an Oracle standard file. It is provided for convenience
for DFdiscover users to store their database password. This file should be
in the user's home directory and file permission should be 600. The format
is:
db_name:password
for example,
test_db_name:test_db_password production_db_name:production_db_password
MS SQLServer: ~/.mssqlpass. This is not an MS SQLServer standard file. It is provided as a convenience to DFdiscover users for storing their database password. This file needs to be kept in the user's home directory and the file permission needs to be 600. The format of this file is:
server_name:password
There are two types of entries that can be used. Both the server name and the password to use can be specified as follows:
my_server:my_server_password
or a password can be specified for any server this user connects to as in the following example:
*:any_server_password
The following details apply to tables and fields created by DFsqlload.
Table Names.
Table names follow these rules. DFSQLLOAD is a log table and is used
to store the details of a given DFsqlload for a particular schema. The table for plate 510 is named DFREASON.
The table for plate 511 is named DFQC.
All other plates are named DFPLATE_nnn (untyped) or
DFTABLE_nnn (typed),
where nnn is the plate number. Optional tables created
are DFNULLVALUE for the storage of any problem field data conversions
and DFCODING for the storage of value/value label pairs. All table names are in
uppercase letters. Note that the table names are case sensitive for
MySQL on UNIX platforms only.
Table Types.
Table types used for each of the supported database products are
as follows: For PostgreSQL and Oracle, all tables are type
relational table.
For MySQL, all tables are type InnoDB table.
Field Names for study tables.
Field naming follows these rules.
For fields 1 to 7 and the last three fields for a plate, generic
variable names are used. For all fields in between, unique variable names are used.
Any field names matching an SQL keyword get an _ (underscore)
appended. This is target product dependent. Refer to the documentation
for the product you are using for a complete list of
the relevant keywords.
Any non-alphanumeric characters are replaced with _ (underscore).
If a field name starts with a digit, DF_ is prepended.
Field names are truncated to 30 chars.
A sequence number is appended to each non-unique field name.
DFdiscover type to SQL type mapping: DFsqlload will map DFdiscover types to SQL types according to Data typing when creating SQL tables for typed columns. Untyped columns are mapped to VARCHAR (PostgreSQL, MySQL) and VARCHAR2 (Oracle).
Table 3.13. Data typing
| DFdiscover type | PostgreSQL | Oracle | MySQL | MS SQL Server |
|---|---|---|---|---|
| string | VARCHAR(n) | VARCHAR2(n) | VARCHAR(n)[a] | VARCHAR(n) |
| check | INT2 | NUMBER(p) | INT2 | SMALLINT |
| choice | INT2 | NUMBER(p) | INT2 | SMALLINT |
| integer | INT4 | NUMBER(p) | INT4 | INT |
| float | NUMERIC(p,s) | NUMBER(p,s) | DECIMAL(p,s) | DECIMAL(p,s) |
| vas | NUMERIC(p,s) | NUMBER(p,s) | DECIMAL(p,s) | DECIMAL(p,s) |
| number(nn:nn) | TIME | VARCHAR2(n) | TIME | VARCHAR(n) |
| date | DATE | DATE | DATE | DATETIME |
| timestamp | TIMESTAMP | DATE | DATETIME | DATETIME |
[a] MySQL converts VARCHAR(n) to CHAR(n) (n < 4) or TEXT (n > 255). | ||||
SQL column naming convention: DFsqlload will follow the rules defined in Fields for typed SQL tables and Fields for untyped SQL tables for DFdiscover field name to SQL column naming convention when creating SQL tables.
Table 3.14. Fields for typed SQL tables
| Field | Field Number | Field Name | Field Type | Missing Code/Label | Partial Date | Impute |
|---|---|---|---|---|---|---|
| Coded field: Code | 1~7,NF-2~NF | Generic | True Type | No | ||
| Coded field: Label | 1~7,NF-2~NF | Generic | VARCHAR | No | ||
| Coded field: Code [a] | 1~7,NF-2~NF | Generic | True Type | No | ||
| Coded field: Label [b] | 1~7,NF-2~NF | U_Generic | VARCHAR | No | ||
| Coded field: Code | 8~NF-3 | Unique | True Type | No | ||
| Coded field: Label | 8~NF-3 | Unique | VARCHAR | Yes | ||
| Coded field: Code[a] | 8~NF-3 | Unique | True Type | No | ||
| Coded field: Label[b] | 8~NF-3 | U_Unique | VARCHAR | Yes | ||
| Date field: Typed | 8~NF-3 | Unique | True Type | No | No | Yes |
| Date field: Untyped | 8~NF-3 | Unique | VARCHAR | Yes | Yes | No |
| Date field: Typed[a] | 8~NF-3 | Unique | True Type | No | No | Yes |
| Date field: Untyped[b] | 8~NF-3 | U_Unique | VARCHAR | Yes | Yes | No |
| Other fields | 8~NF-3 | Unique | True Type | No | ||
| Other fields | 1~7,NF-2~NF | Generic | True Type | No | No | No |
[a]
The first field, if [b]
The second field, if | ||||||
Table 3.15. Fields for untyped SQL tables
| Field | Field Number | Field Name | Field Type | Missing Code/Label | Partial Date | Impute |
|---|---|---|---|---|---|---|
| Coded field: Code | 1~7,NF-2~NF | Generic | True Type | No | ||
| Coded field: Label | 1~7,NF-2~NF | Generic | VARCHAR | No | ||
| Coded field: Code[a] | 1~7,NF-2~NF | Generic | True Type | No | ||
| Coded field: Label[b] | 1~7,NF-2~NF | U_Generic | VARCHAR | No | ||
| Coded field: Code | 8~NF-3 | U_Unique | VARCHAR | Yes | ||
| Coded field: Label | 8~NF-3 | U_Unique | VARCHAR | Yes | ||
| Coded field: Code[a] | 8~NF-3 | Unique | VARCHAR | Yes | ||
| Coded field: Label[b] | 8~NF-3 | U_Unique | VARCHAR | Yes | ||
| Date field: Typed | 8~NF-3 | U_Unique | True Type | No | No | Yes |
| Date field: Untyped | 8~NF-3 | U_Unique | VARCHAR | Yes | Yes | No |
| Date field: Typed[a] | 8~NF-3 | Unique | True Type | No | No | Yes |
| Date field: Untyped[b] | 8~NF-3 | U_Unique | VARCHAR | Yes | Yes | No |
| Other fields | 8~NF-3 | U_Unique | VARCHAR | Yes | ||
| Other fields | 1~7,NF-2~NF | Generic | True Type | No | No | No |
Locking:
DFsqlload uses the table DFSQLLOAD to prevent two processes from working on
the same database schema.
A NULL entry in DFSQLLOAD.DFFINISH indicates that another process is
running or terminated abnormally. If DFSQLLOAD does not exist, DFsqlload will
create it. Upon completion, DFsqlload updates
DFSQLLOAD.DFFINISH to the finish time and the DFSQLLOAD.DFSTATUS
to zero (normal exit process).
DFdiscover Retrieval File: The first line in the DRF contains a two-field comment. The first field contains the username and creation timestamp. The second field identifies the creator of the DRF as DFsqlload and lists the parameters used to access the SQL database. The next line is a comment describing the format of the DRF records to follow. One DRF data record is created for each DFdiscover record with data import problems. Each DRF data record is identified by Id, Visit, Plate and Image. The 5th field records the SQL table name, followed by the field name and problem description for each problem encountered. Multiple field/problem descriptions are separated by commas.
Date fields: DFsqlload converts two-digit years to four-digits based on the cut off year specified for each date field in DFschema, or the year the study began (%B) if not specified. DFsqlload imputes partial dates according to the imputation method specified for each date field.
Number of fields: DFsqlload creates a minimum of 10 fields (1~7,NF-2~NF) for each SQL table. The maximum number of fields is limited by the database: PostgreSQL 1600, MySQL 1000, Oracle 1000, MS SQLServer 1024. DFsqlload will report errors for DFdiscover records that do not meet these field requirements and will continue the loading process.
Postgres. A schema is a namespace that contains DFdiscover study tables. Schema names are DFdiscover study names. If the schema name specified in the DFsqlload command line does not exist in the Postgres database, DFsqlload will create the schema as specified. DFsqlload will never drop existing schemas.
Oracle. A schema is a database user who is also the owner of DFdiscover tables that belong to one study. The schema specified in the command line must already exist in Oracle, unless the tablespace name (must exist in database) is also specified. If the schema does not exist, DFsqlload will create the schema and assign the specified tablespace as the schema's default tablespace. The DFdiscover study tables will be created, if specified, in the specified tablespace, otherwise in the schema's default tablespace. DFsqlload will check the schema's quota privilege on storage tablespace and grant unlimited quota privilege on that tablespace. DFsqlload will never drop existing schemas.
MySQL. There is no schema in MySQL. A MySQL database maps to a DFdiscover study database. In the command line, the schema name must be the same as the database name. Note that the database name is case sensitive whether MySQL is running on a UNIX or Windows platform. If the specified database does not exist in the MySQL database, DFsqlload will create the database as specified. DFsqlload will never drop existing MySQL databases.
MS SQLServer. There is no schema in MS SQLServer in version 2000 and older. A SQLServer database maps to a DFdiscover study database. In the command line, the schema name must be the same as the database name.