SCD Stage
SCD Stage
SCD Stage
All parallel stages have a Stage page, and each Stage page has a General tab. This page and tab contain the following fields: Stage name Displays the name of the stage. You can edit this name. Data Connection Specifies the name of a data connection object in the repository that defines a connection to a data source. Data connections are valid only when the stage generates surrogate keys by using a database sequence. The Source type field on the Surrogate Key tab must be set to DBSequence. Click the browse button to load, save, or clear a data connection object. You can also drag and drop a data connection object from the repository to the stage icon on the canvas. Select output link Specifies which link is the main output from the stage. The other link will carry changes to the dimension. Description Contains an optional description of the stage. Add a description because it helps the maintainability of jobs. The description is displayed in any job reports. Fast Path Provides a navigation tool for the SCD stage. Click the Fast Path arrows to move through the sequence of tabs where user action is required. The Stage page also has the following tabs: Stage page - Advanced tab Use the Advanced tab to specify processing details for a stage. Slowly Changing Dimension Stage page - NLS Locale tab Use the NLS Locale tab to define a collate convention for the SCD stage.
Never combine operators. Preserve Partitioning Preserve Partitioning specifies whether the stage requires that partitioning be preserved by the next stage of the job. Select one of these options: Set Specifies that the next stage in the job should preserve the partitioning if possible Clear Specifies that the partitioning method that the next stage uses is not relevant. Propagate Specifies that the stage will use the option that the previous stage in the job has used. If the previous stage used the Propagate option, the setting from the last stage that used either the Set option or the Clear option is used. For some stage types thePropagate option is not available. The default setting of a stage can be any of these options, depending on stage type, and in most cases you can accept the default. You might set something other than the default when you are implementing a particular partitioning scheme in your job. None of these options will prevent repartitioning, but will cause you to be warned if it happens when the job runs. Configuration File Shows the name of the configuration file that the system is currently using. The configuration file is specified by the APT_CONFIG_FILE environment variable. Node pool and resource constraints Specify constraints on where the stage can run. These controls are available if you have defined multiple pools in the configuration file. 1. 2. 3. Select Node pool or Resource pool from the Constraint drop-down list. If you selected Resource pool, select a type for a resource pool. Select the name of the node or resource pool.
You can select multiple node pools or resource pools. The stage can run only on the node pools or resource pools that you specify. Node map constraints Use this feature to specify a virtual node pool that does not appear in the configuration file. Select the option box and type the nodes you want the stage to be able to run on. You can also browse the available nodes. The lists of available nodes, available node pools, and available resource pools are derived from the configuration file. Constraints are ignored for Sequential File stages, File Set stages, External Source stages, and External Target stages.
Displays the dimension columns on the reference input link to the stage. The Key Expression field specifies the match condition between one or more source records and dimension rows. The Purpose field specifies the purpose code for each dimension column.
Specifies the name of the client alias database on the remote server. This field is available if the Database type field is set to DB2 and the Client instance name field is not blank. This field is required only if the names of the alias database and the server database are different. Server name Specifies the name of the server.
Drag source columns from the left pane to the Derivation field in the right pane Use the column auto-match facility (right-click the link header and select Auto Match) Define an expression by using the expression editor (double-click the Derivation or Expire field)
The Dim Update tab has two panes: Left pane Displays the source columns and any job parameters that you defined. This information is read-only and cannot be modified. Right pane
Displays the dimension update columns. The Derivation field specifies how each column is derived. The Purpose field specifies the purpose code for each column. The Expire field specifies how to expire a dimension record.
Drag source columns from the left pane to the Derivation field in the right pane Use the column auto-match facility (right-click the link header and select Auto Match) Define an expression by using the expression editor (double-click the Derivation or Expire field)
The Output Map tab has two panes: Left pane Displays the source columns, dimension columns, and any job parameters that you defined. This information is read-only and cannot be modified. Right pane Displays the output columns. The Derivation field specifies how each column is derived.
Length
The data precision. Specify the length for Char data and the maximum length for VarChar data. Scale The data scale factor. For Sequential File stages the scale should not exceed 9. Nullable Indicates whether the column can contain null values. Specify No to indicate that the column is subject to a NOT NULL constraint. The Nullable field is informative only, it does not enforce a NOT NULL constraint. Display For certain stage types, this field optionally gives the maximum number of characters required to display the column data. Data element For certain stage types, this field optionally enables you to specify a data element for the column, that specifies stricter data typing. Description Specify a description of the column. If you are typing column definitions in the Columns tab, and want several columns to share one or more of the same properties, you can propagate the properties. Select all the columns affected and select Propagate values... from the shortcut menu. The Propagate column values dialog box appears, use it to choose the properties you want to propagate. You then edit rows in the Edit Column Meta Data window by selecting the row and choosing Edit Row... from the shortcut menu. Save Click Save to save a copy of the column definitions as a table definition in the repository. The Save Table Definition window opens. To save a table definition: 1. Enter a name in the Data source type field. This name forms the first part of the unique table definition identifier. By default, the field contains Saved. 2. Enter a name in the Data source name field. This forms the second part of the table definition identifier. By default, this field contains the name of the link you are editing. 3. Enter a name in the Table/file name field. This is the last part of the table definition identifier and is also the name that is used for the table definition in the repository. By default, this field contains the name of the link you are editing. 4. Enter a brief description of the table definition in the Short description field. By default, this field contains the date and time you clicked Save . The format of the date and time depend on your Windows setup. This field is optional. 5. Enter a more detailed description of the table definition in the Long description field. This field is optional. 6. Click OK. The Save Table Definition As window opens. Select the folder in which you want to store the table definition and click Save. Load Click Load to load a table definition from the repository and populate the Columns tab. The Table Definition window opens. To load a table definition: 1. Browse the repository tree for the table definition that you want to load. 2. Select the table definition in the tree and click OK. The Select Columns window opens. 3. Use the arrow buttons to move the columns that you want to load from the Available columns list to the Selected column list. 4. Click OK to load the selected column definition into the Columns tab.
By default, stages buffer data in such a way that no deadlocks can arise. A deadlock is the situation where a number of stages are mutually dependent, and are waiting for input from another stage and cannot output data until they have received the input. The size and operation of the buffer are usually the same for all links on all stages. The default values that the settings take can be set using environment variables see WebSphere DataStage Parallel Job Advanced Developer Guide. Use the Advanced tab to specify buffer settings on a per-link basis. Any settings you make here automatically appear in the Advanced tab of the previous or next stage in the job. CAUTION: Use these settings with extreme caution, because inappropriate settings can cause deadlock situations to arise. The tab contains the following controls and fields: Buffering mode Select one of the following from the drop-down list. (Default) The link takes the settings that are specified by the environment variables. This is Auto-buffer unless you have changed the value of the APT_BUFFERING _POLICY environment variable. Auto buffer The link buffers incoming data only if necessary to prevent a dataflow deadlock situation. Buffer The link unconditionally buffers all outgoing data. No buffer The link does not buffer data under any circumstances. This could potentially lead to deadlock situations if not used carefully. If you choose the Auto buffer or Buffer options, you can also set the values of the various buffering parameters. Maximum memory buffer size (bytes) Specifies the maximum amount of virtual memory, in bytes, used per buffer. The default size is 3145728 (3 MB). Buffer free run (percent) Specifies how much of the available in-memory buffer to use before the buffer writes to disk. The value of Buffer free run is a percentage of Maximum memory buffer size. When the amount of data in the buffer is less than this value, new data is accepted automatically. When the data exceeds the value, the buffer first tries to write some of the data that it contains to disk before accepting more data. The default value is 50%. You can set it to greater than 100%, in which case the buffer continues to store data up to that percentage of Maximum memory buffer size before writing to disk. Queue upper bound size (bytes) Specifies the maximum amount of data buffered at any time using both memory and disk. The default value is zero, meaning that the buffer size is limited only by the available disk space as specified by 'resource scratchdisk' in the configuration file. Specify a value and the total buffer size is limited to this number of bytes plus one block (where the data stored in a block cannot exceed 32 KB). Disk write increment (bytes) Sets the size, in bytes, of blocks of data being moved to and from disk. The default is 1048576 (1 MB). Adjusting this value trades the amount of disk access against data throughput. Increasing the block size reduces disk access, but may decrease performance when data is being read or written in smaller units. Decreasing the block size increases data throughput, but may increase the amount of disk access. To create a buffer that will not write to disk, set Queue upper bound size to a value equal to or slightly less than Maximum memory buffer size and set Buffer free run to 1.0. CAUTION: the size of the buffer is limited by the virtual memory of your system and you can create deadlock if the buffer becomes full.