Datastage Interview Questions And Answers :
1. How did you handle an 'Aborted' sequencer?
In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.
2. If worked with DS6.0 and latest versions what are Link-Partitioner and Link-Collector used for?
Link Partitioner - Used for partitioning the data.
Link Collector - Used for collecting the partitioned data.
3. What are Routines and where/how are they written and have you written any routines before?
Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit.
The following are different types of routines:
1) Transform functions
2) Before-after job subroutines
3) Job Control routines
4. What are OConv () and Iconv () functions and where are they used?
IConv() - Converts a string to an internal storage format
OConv() - Converts an expression to an output format.
5. How did you connect to DB2 in your last project?
Using DB2 ODBC drivers.
6. What is DS Administrator used for?
The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales.
7. Do you know about INTEGRITY/QUALITY stage?
Qulaity Stage can be integrated with DataStage, In Quality Stage we have many stages like investigate, match, survivorship like that so that we can do the Quality related works and we can integrate with datastage we need Quality stage plugin to achieve the task.
8. Explain the differences between Oracle8i/9i?
Oracle 8i does not support pseudo column sysdate but 9i supports Oracle 8i we can create 256 columns in a table but in 9i we can upto 1000 columns(fields)
9. How do you merge two files in DS?
Either use Copy command as a Before-job subroutine if the metadata of the 2 files are same or create a job to concatenate the 2 files into one if the metadata is different.
10. What is DS Designer used for?
You use the Designer to build jobs by creating a visual design that models the flow and transformation of data from the data source through to the target warehouse. The Designer graphical interface lets you select stage icons, drop them onto the Designer work area, and add links.
11. How did you handle reject data?
Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has to be defined every Output link you wish to collect rejected data. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected.
12. What is DS Director used for?
datastage director is used to run the jobs and validate the jobs. we can go to datastage director from datastage designer it self.
13. What is DS Manager used for? [ Merged with Designer in 8.x versions ]
The Manager is a graphical tool that enables you to view and manage the contents of the DataStage Repository
14. What are Static Hash files and Dynamic Hash files?
As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2Gb and the overflow file is used if the data exceeds the 2GB size.
15. What is Hash file stage and what is it used for?
Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better performance.
16. What are the often used Stages or stages you worked with in your last project?
Transformer, ORAOCI8/9, ODBC, Link-Partitioner, Link-Collector, Hash, ODBC, Aggregator, Sort.
17. What is the importance of Surrogate Key in Data warehousing?
Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is independent of underlying database. i.e Surrogate Key is not affected by the changes going on with a database
18. What does a Config File in parallel extender consist of?
Config file consists of the following.
a) Number of Processes or Nodes.
b) Actual Disk Storage Location.
19. How did you connect with DB2 in your last project?
Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drive.
20. How many places you can call Routines?
Four Places you can call
(i) Transform of routine
(A) Date Transformation
(B) Upstring Transformation
(ii) Transform of the Before & After Subroutines
(iii) XML transformation
(iv)Web base
21. Is it possible to calculate a hash total for an EBCDIC file and have the hash total stored as EBCDIC using Datastage ?
Currently, the total is converted to ASCII, even tho the individual records are stored as EBCDIC.
22. Compare and Contrast ODBC and Plug-In stages?
ODBC :
a) Poor Performance.
b) Can be used for Variety of Databases.
c) Can handle Stored Procedures.
Plug-In:
a) Good Performance.
b) Database specific.(Only one database)
23. What is Functionality of Link Partitioner and Link Collector?
Link Partitioner is partitioned the data the node defined in Configuration File. Link Collector is collect the partitioned data.
24. Explain METASTAGE?
MetaStage is used to handle the Metadata which will be very useful for data lineage and data analysis later on. Meta Data defines the type of data we are handling. This Data Definitions are stored in repository and can be accessed with the use of MetaStage.
25. Containers Usage and Types
Containers is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) Local Container: Job Specific b) Shared Container: Used in any job within a project.
26. Explain Dimension Modeling types along with their significance
Data Modelling is Broadly classified into 2 types.
a) E-R Diagrams (Entity - Relatioships).
b) Dimensional Modelling.
27. Did you Parameterize the job or hard-coded the values in the jobs?
Always parameterized the job. Either the values are coming from Job Properties or from a ‘Parameter Manager’ – a third part tool. There is no way you will hard–code some parameters in your jobs.
28. How are the Dimension tables designed?
Find where data for this dimension are located. Figure out how to extract this data. Determine how to maintain changes to this dimension. Change fact table and DW population routines.
29) Differentiate between ODBC and DRS stage?
DRS stage is faster than the ODBC stage because it uses native databases for connectivity.
30) Define Orabulk and BCP stages?
Orabulk stage is used to load large amount of data in one target table of Oracle database. The BCP stage is used to load large amount of data in one target table of Microsoft SQL Server.
1. How did you handle an 'Aborted' sequencer?
In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.
Link Partitioner - Used for partitioning the data.
Link Collector - Used for collecting the partitioned data.
3. What are Routines and where/how are they written and have you written any routines before?
Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit.
The following are different types of routines:
1) Transform functions
2) Before-after job subroutines
3) Job Control routines
4. What are OConv () and Iconv () functions and where are they used?
IConv() - Converts a string to an internal storage format
OConv() - Converts an expression to an output format.
5. How did you connect to DB2 in your last project?
Using DB2 ODBC drivers.
6. What is DS Administrator used for?
The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales.
7. Do you know about INTEGRITY/QUALITY stage?
Qulaity Stage can be integrated with DataStage, In Quality Stage we have many stages like investigate, match, survivorship like that so that we can do the Quality related works and we can integrate with datastage we need Quality stage plugin to achieve the task.
8. Explain the differences between Oracle8i/9i?
Oracle 8i does not support pseudo column sysdate but 9i supports Oracle 8i we can create 256 columns in a table but in 9i we can upto 1000 columns(fields)
9. How do you merge two files in DS?
Either use Copy command as a Before-job subroutine if the metadata of the 2 files are same or create a job to concatenate the 2 files into one if the metadata is different.
10. What is DS Designer used for?
You use the Designer to build jobs by creating a visual design that models the flow and transformation of data from the data source through to the target warehouse. The Designer graphical interface lets you select stage icons, drop them onto the Designer work area, and add links.
11. How did you handle reject data?
Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has to be defined every Output link you wish to collect rejected data. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected.
12. What is DS Director used for?
datastage director is used to run the jobs and validate the jobs. we can go to datastage director from datastage designer it self.
13. What is DS Manager used for? [ Merged with Designer in 8.x versions ]
The Manager is a graphical tool that enables you to view and manage the contents of the DataStage Repository
14. What are Static Hash files and Dynamic Hash files?
As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2Gb and the overflow file is used if the data exceeds the 2GB size.
15. What is Hash file stage and what is it used for?
Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better performance.
16. What are the often used Stages or stages you worked with in your last project?
Transformer, ORAOCI8/9, ODBC, Link-Partitioner, Link-Collector, Hash, ODBC, Aggregator, Sort.
17. What is the importance of Surrogate Key in Data warehousing?
Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is independent of underlying database. i.e Surrogate Key is not affected by the changes going on with a database
18. What does a Config File in parallel extender consist of?
Config file consists of the following.
a) Number of Processes or Nodes.
b) Actual Disk Storage Location.
19. How did you connect with DB2 in your last project?
Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drive.
20. How many places you can call Routines?
Four Places you can call
(i) Transform of routine
(A) Date Transformation
(B) Upstring Transformation
(ii) Transform of the Before & After Subroutines
(iii) XML transformation
(iv)Web base
21. Is it possible to calculate a hash total for an EBCDIC file and have the hash total stored as EBCDIC using Datastage ?
Currently, the total is converted to ASCII, even tho the individual records are stored as EBCDIC.
22. Compare and Contrast ODBC and Plug-In stages?
ODBC :
a) Poor Performance.
b) Can be used for Variety of Databases.
c) Can handle Stored Procedures.
Plug-In:
a) Good Performance.
b) Database specific.(Only one database)
23. What is Functionality of Link Partitioner and Link Collector?
Link Partitioner is partitioned the data the node defined in Configuration File. Link Collector is collect the partitioned data.
24. Explain METASTAGE?
MetaStage is used to handle the Metadata which will be very useful for data lineage and data analysis later on. Meta Data defines the type of data we are handling. This Data Definitions are stored in repository and can be accessed with the use of MetaStage.
25. Containers Usage and Types
Containers is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) Local Container: Job Specific b) Shared Container: Used in any job within a project.
26. Explain Dimension Modeling types along with their significance
Data Modelling is Broadly classified into 2 types.
a) E-R Diagrams (Entity - Relatioships).
b) Dimensional Modelling.
27. Did you Parameterize the job or hard-coded the values in the jobs?
Always parameterized the job. Either the values are coming from Job Properties or from a ‘Parameter Manager’ – a third part tool. There is no way you will hard–code some parameters in your jobs.
28. How are the Dimension tables designed?
Find where data for this dimension are located. Figure out how to extract this data. Determine how to maintain changes to this dimension. Change fact table and DW population routines.
29) Differentiate between ODBC and DRS stage?
DRS stage is faster than the ODBC stage because it uses native databases for connectivity.
30) Define Orabulk and BCP stages?
Orabulk stage is used to load large amount of data in one target table of Oracle database. The BCP stage is used to load large amount of data in one target table of Microsoft SQL Server.
31) Differentiate between Operational Datastage (ODS) and Data warehouse?
We can say, ODS is a mini data warehouse. An ODS doesn’t contain information for more than 1 year while a data warehouse contains detailed information regarding the entire business.
32) NLS stands for what in Datastage?
NLS means National Language Support. It can be used to incorporate other languages such as French, German, and Spanish etc. in the data, required for processing by data warehouse. These languages have same scripts as English language.
33) Can you explain how could anyone drop the index before loading the data in target in Datastage?
In Datastage, we can drop the index before loading the data in target by using the Direct Load functionality of SQL Loaded Utility.
34) Does Datastage support slowly changing dimensions ?
Yes. Version 8.5 + supports this feature
35) Define Routines and their types?
Routines are basically collection of functions that is defined by DS manager. It can be called via transformer stage. There are three types of routines such as, parallel routines, main frame routines and server routines.
36) How complex jobs are implemented in Datstage to improve performance?
In order to improve performance in Datastage, it is recommended, not to use more than 20 stages in every job. If you need to use more than 20 stages then it is better to use another job for those stages.
37) Name the third party tools that can be used in Datastage?
The third party tools that can be used in Datastage, are Autosys, TNG and Event Co-ordinator. I have worked with these tools and possess hands on experience of working with these third party tools.
38) Define Project in Datastage?
Whenever we launch the Datastage client, we are asked to connect to a Datastage project. A Datastage project contains Datastage jobs, built-in components and Datastage Designer or User-Defined components.
39) How many types of hash files are there?
There are two types of hash files in DataStage i.e. Static Hash File and Dynamic Hash File. The static hash file is used when limited amount of data is to be loaded in the target database. The dynamic hash file is used when we don’t know the amount of data from the source file.
40) Define Meta Stage?
In Datastage, MetaStage is used to save metadata that is helpful for data lineage and data analysis.
We can say, ODS is a mini data warehouse. An ODS doesn’t contain information for more than 1 year while a data warehouse contains detailed information regarding the entire business.
32) NLS stands for what in Datastage?
NLS means National Language Support. It can be used to incorporate other languages such as French, German, and Spanish etc. in the data, required for processing by data warehouse. These languages have same scripts as English language.
33) Can you explain how could anyone drop the index before loading the data in target in Datastage?
In Datastage, we can drop the index before loading the data in target by using the Direct Load functionality of SQL Loaded Utility.
34) Does Datastage support slowly changing dimensions ?
Yes. Version 8.5 + supports this feature
35) Define Routines and their types?
Routines are basically collection of functions that is defined by DS manager. It can be called via transformer stage. There are three types of routines such as, parallel routines, main frame routines and server routines.
36) How complex jobs are implemented in Datstage to improve performance?
In order to improve performance in Datastage, it is recommended, not to use more than 20 stages in every job. If you need to use more than 20 stages then it is better to use another job for those stages.
37) Name the third party tools that can be used in Datastage?
The third party tools that can be used in Datastage, are Autosys, TNG and Event Co-ordinator. I have worked with these tools and possess hands on experience of working with these third party tools.
38) Define Project in Datastage?
Whenever we launch the Datastage client, we are asked to connect to a Datastage project. A Datastage project contains Datastage jobs, built-in components and Datastage Designer or User-Defined components.
39) How many types of hash files are there?
There are two types of hash files in DataStage i.e. Static Hash File and Dynamic Hash File. The static hash file is used when limited amount of data is to be loaded in the target database. The dynamic hash file is used when we don’t know the amount of data from the source file.
40) Define Meta Stage?
In Datastage, MetaStage is used to save metadata that is helpful for data lineage and data analysis.
41) Have you have ever worked in UNIX environment and why it is useful in Datastage?
Yes, I have worked in UNIX environment. This knowledge is useful in Datastage because sometimes one has to write UNIX programs such as batch programs to invoke batch processing etc.
42) Differentiate between Datastage and Datastage TX?
Datastage is a tool from ETL (Extract, Transform and Load) and Datastage TX is a tool from EAI (Enterprise Application Integration).
43) What is size of a transaction and an array means in a Datastage?
Transaction size means the number of row written before committing the records in a table. An array size means the number of rows written/read to or from the table respectively.
44) What is the difference between Datastage 7.5 and 7.0?
In Datastage 7.5 many new stages are added for more robustness and smooth performance, such as Procedure Stage, Command Stage, Generate Report etc.
45) Why we use surrogate key?
In Datastage, we use Surrogate Key instead of unique key. Surrogate key is mostly used for retrieving data faster. It uses Index to perform the retrieval operation.
46) How rejected rows are managed in Datastage?
In the Datastage, the rejected rows are managed through constraints in transformer. We can either place the rejected rows in the properties of a transformer or we can create a temporary storage for rejected rows with the help of REJECTED command.
47) Differentiate between ODBC and DRS stage?
DRS stage is faster than the ODBC stage because it uses native databases for connectivity.
48) Define Orabulk and BCP stages?
Orabulk stage is used to load large amount of data in one target table of Oracle database. The BCP stage is used to load large amount of data in one target table of Microsoft SQL Server.
49) Define DS Designer?
The DS Designer is used to design work area and add various links to it.
50) How many types of views are there in a Datastage Director?
There are three types of views in a Datastage Director i.e. Job View, Log View and Status View.
Yes, I have worked in UNIX environment. This knowledge is useful in Datastage because sometimes one has to write UNIX programs such as batch programs to invoke batch processing etc.
42) Differentiate between Datastage and Datastage TX?
Datastage is a tool from ETL (Extract, Transform and Load) and Datastage TX is a tool from EAI (Enterprise Application Integration).
43) What is size of a transaction and an array means in a Datastage?
Transaction size means the number of row written before committing the records in a table. An array size means the number of rows written/read to or from the table respectively.
44) What is the difference between Datastage 7.5 and 7.0?
In Datastage 7.5 many new stages are added for more robustness and smooth performance, such as Procedure Stage, Command Stage, Generate Report etc.
45) Why we use surrogate key?
In Datastage, we use Surrogate Key instead of unique key. Surrogate key is mostly used for retrieving data faster. It uses Index to perform the retrieval operation.
46) How rejected rows are managed in Datastage?
In the Datastage, the rejected rows are managed through constraints in transformer. We can either place the rejected rows in the properties of a transformer or we can create a temporary storage for rejected rows with the help of REJECTED command.
47) Differentiate between ODBC and DRS stage?
DRS stage is faster than the ODBC stage because it uses native databases for connectivity.
48) Define Orabulk and BCP stages?
Orabulk stage is used to load large amount of data in one target table of Oracle database. The BCP stage is used to load large amount of data in one target table of Microsoft SQL Server.
49) Define DS Designer?
The DS Designer is used to design work area and add various links to it.
50) How many types of views are there in a Datastage Director?
There are three types of views in a Datastage Director i.e. Job View, Log View and Status View.