Pages

Subscribe:

Thursday, February 24, 2011

Data Warehousing Interview Questions with Answers


What is data warehouse?

A data warehouse is a electronical storage of an Organization's historical data for the purpose of analysis and reporting. According to Kimpball, a datawarehouse should be subject-oriented, non-volatile, integrated and time-variant.
Explanatory Note
Non-volatile means that the data once loaded in the warehouse will not get deleted later. Time-variant means the data will change with respect to time.

What is the benefits of data warehouse?

Historical data stored in data warehouse helps to analyze different aspects of business including, performance analysis, trend analysis, trend prediction etc. which ultimately increases efficiency of business processes.

Why Data Warehouse is used?

Data warehouse facilitates reporting on different key business processes known as KPI. Data warehouse can be further used for data mining which helps trend prediction, forecasts, pattern recognition etc.

What is the difference between OLTP and OLAP?

OLTP is the transaction system that collects business data. Whereas OLAP is the reporting and analysis system on that data.
OLTP systems are optimized for INSERT, UPDATE operations and therefore highly normalized. On the other hand, OLAP systems are deliberately denormalized for fast data retrieval through SELECT operations.
Explanatory Note:
In a departmental shop, when we pay the prices at the check-out counter, the sales person at the counter keys-in all the data into a "Point-Of-Sales" machine. That data is transaction data and the related system is a OLTP system. On the other hand, the manager of the store might want to view a report on out-of-stock materials, so that he can place purchase order for them. Such report will come out from OLAP system

What is data mart?

Data marts are generally designed for a single subject area. An organization may have data pertaining to different departments like Finance, HR, Marketting etc. stored in data warehouse and each department may have separate data marts. These data marts can be built on top of the data warehouse.

What is ER model?

ER model is entity-relationship model which is designed with a goal of normalizing the data.

What is dimensional modeling?

Dimensional model consists of dimension and fact tables. Fact tables store different transactional measurements and the foreign keys from dimension tables that qualifies the data. The goal of Dimensional model is not to achive high degree of normalization but to facilitate easy and faster data retrieval.

What is dimension?

A dimension is something that qualifies a quantity (measure).
If I just say… “20kg”, it does not mean anything. But 20kg of Rice (Product) is sold to Ramesh (customer) on 5th April (date), gives a meaningful sense. These product, customer and dates are some dimension that qualified the measure. Dimensions are mutually independent.
Technically speaking, a dimension is a data element that categorizes each item in a data set into non-overlapping regions.

What is fact?

A fact is something that is quantifiable (Or measurable). Facts are typically (but not always) numerical values that can be aggregated.

What are additive, semi-additive and non-additive measures?

Non-additive measures are those which can not be used inside any numeric aggregation function (e.g. SUM(), AVG() etc.). One example of non-additive fact is any kind of ratio or percentage. Example, 5% profit margin, revenue to asset ratio etc. A non-numerical data can also be a non-additive measure when that data is stored in fact tables.
Semi-additive measures are those where only a subset of aggregation function can be applied. Let’s say account balance. A sum() function on balance does not give a useful result but max() or min() balance might be useful. Consider price rate or currency rate. Sum is meaningless on rate; however, average function might be useful.
Additive measures can be used with any aggregation function like Sum(), Avg() etc. Example is Sales Quantity etc.

What is Star-schema?

This schema is used in data warehouse models where one centralized fact table references number of dimension tables so as the keys (primary key) from all the dimension tables flow into the fact table (as foreign key) where measures are stored. This entity-relationship diagram looks like a star, hence the name.
Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales quantity will be the measure here and keys from customer, product and time dimension tables will flow into the fact table.
A star-schema is a special case of snow-flake schema.

What is snow-flake schema?

This is another logical arrangement of tables in dimensional modeling where a centralized fact table references number of other dimension tables; however, those dimension tables are further normalized into multiple related tables.
Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales quantity will be the measure here and keys from customer, product and time dimension tables will flow into the fact table. Additionally all the products can be further grouped under different product families stored in a different table so that primary key of product family tables also goes into the product table as a foreign key. Such construct will be called a snow-flake schema as product table is further snow-flaked into product family.
Note
Snow-flake increases degree of normalization in the design.

What are the different types of dimension?

In a data warehouse model, dimension can be of following types,
1. Conformed Dimension
2. Junk Dimension
3. Degenerated Dimension
4. Role Playing Dimension
Based on how frequently the data inside a dimension changes, we can further classify dimension as
1. Unchanging or static dimension (UCD)
2. Slowly changing dimension (SCD)
3. Rapidly changing Dimension (RCD)

What is a 'Conformed Dimension'?

A conformed dimension is the dimension that is shared across multiple subject area. Consider 'Customer' dimension. Both marketing and sales department may use the same customer dimension table in their reports. Similarly, a 'Time' or 'Date' dimension will be shared by different subject areas. These dimensions are conformed dimension.
Theoretically, two dimensions which are either identical or strict mathematical subsets of one another are said to be conformed.

What is degenerated dimension?

A degenerated dimension is a dimension that is derived from fact table and does not have its own dimension table.
A dimension key, such as transaction number, receipt number, Invoice number etc. does not have any more associated attributes and hence can not be designed as a dimension table.

What is junk dimension?

A junk dimension is a grouping of typically low-cardinality attributes (flags, indicators etc.) so that those can be removed from other tables and can be junked into an abstract dimension table.
These junk dimension attributes might not be related. The only purpose of this table is to store all the combinations of the dimensional attributes which you could not fit into the different dimension tables otherwise. One may want to read an interesting document, De-clutter with Junk (Dimension)

What is a role-playing dimension?

Dimensions are often reused for multiple applications within the same database with different contextual meaning. For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or "Date of Hire". This is often referred to as a 'role-playing dimension'

What is SCD?

SCD stands for slowly changing dimension, i.e. the dimensions where data is slowly changing. These can be of many types, e.g. Type 0, Type 1, Type 2, Type 3 and Type 6, although Type 1, 2 and 3 are most common.

What is rapidly changing dimension?

This is a dimension where data changes rapidly.

Describe different types of slowly changing 

Dimension (SCD)

Type 0:
A Type 0 dimension is where dimensional changes are not considered. This does not mean that the attributes of the dimension do not change in actual business situation. It just means that, even if the value of the attributes change, history is not kept and the table holds all the previous data.
Type 1:
A type 1 dimension is where history is not maintained and the table always shows the recent data. This effectively means that such dimension table is always updated with recent data whenever there is a change, and because of this update, we lose the previous values.
Type 2:
A type 2 dimension table tracks the historical changes by creating separate rows in the table with different surrogate keys. Consider there is a customer C1 under group G1 first and later on the customer is changed to group G2. Then there will be two separate records in dimension table like below,
KeyCustomerGroupStart DateEnd Date
1C1G11st Jan 200031st Dec 2005
2C1G21st Jan 2006NULL
Note that separate surrogate keys are generated for the two records. NULL end date in the second row denotes that the record is the current record. Also note that, instead of start and end dates, one could also keep version number column (1, 2 … etc.) to denote different versions of the record.
Type 3:
A type 3 dimension stored the history in a separate column instead of separate rows. So unlike a type 2 dimension which is vertically growing, a type 3 dimension is horizontally growing. See the example below,
KeyCustomerPrevious GroupCurrent Group
1C1G1G2
This is only good when you need not store many consecutive histories and when date of change is not required to be stored.
Type 6:
A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar to type 2, but only you add one extra column to denote which record is the current record.
KeyCustomerGroupStart DateEnd DateCurrent Flag
1C1G11st Jan 200031st Dec 2005N
2C1G21st Jan 2006NULLY

What is a mini dimension?

Mini dimensions can be used to handle rapidly changing dimension scenario. If a dimension has a huge number of rapidly changing attributes it is better to separate those attributes in different table called mini dimension. This is done because if the main dimension table is designed as SCD type 2, the table will soon outgrow in size and create performance issues. It is better to segregate the rapidly changing members in different table thereby keeping the main dimension table small and performing.

What is a fact-less-fact?

A fact table that does not contain any measure is called a fact-less fact. This table will only contain keys from different dimension tables. This is often used to resolve a many-to-many cardinality issue.
Explanatory Note:
Consider a school, where a single student may be taught by many teachers and a single teacher may have many students. To model this situation in dimensional model, one might introduce a fact-less-fact table joining teacher and student keys. Such a fact table will then be able to answer queries like,
1. Who are the students taught by a specific teacher.
2. Which teacher teaches maximum students.
3. Which student has highest number of teachers.etc. etc.

What is a coverage fact?

A fact-less-fact table can only answer 'optimistic' queries (positive query) but can not answer a negative query. Again consider the illustration in the above example. A fact-less fact containing the keys of tutors and students can not answer a query like below,
1. Which teacher did not teach any student?
2. Which student was not taught by any teacher?
Why not? Because fact-less fact table only stores the positive scenarios (like student being taught by a tutor) but if there is a student who is not being taught by a teacher, then that student's key does not appear in this table, thereby reducing the coverage of the table.
Coverage fact table attempts to answer this - often by adding an extra flag column. Flag = 0 indicates a negative condition and flag = 1 indicates a positive condition. To understand this better, let's consider a class where there are 100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 = 500 records (all combinations) and if a certain teacher is not teaching a certain student, the corresponding flag for that record will be 0.

What are incident and snapshot facts

A fact table stores some kind of measurements. Usually these measurements are stored (or captured) against a specific time and these measurements vary with respect to time. Now it might so happen that the business might not able to capture all of its measures always for every point in time. Then those unavailable measurements can be kept empty (Null) or can be filled up with the last available measurements. The first case is the example of incident fact and the second one is the example of snapshot fact.

What is aggregation and what is the benefit of 

aggregation?

A data warehouse usually captures data with same degree of details as available in source. The "degree of detail" is termed as granularity. But all reporting requirements from that data warehouse do not need the same degree of details.
To understand this, let's consider an example from retail business. A certain retail chain has 500 shops accross Europe. All the shops record detail level transactions regarding the products they sale and those data are captured in a data warehouse.
Each shop manager can access the data warehouse and they can see which products are sold by whom and in what quantity on any given date. Thus the data warehouse helps the shop managers with the detail level data that can be used for inventory management, trend prediction etc.
Now think about the CEO of that retail chain. He does not really care about which certain sales girl in London sold the highest number of chopsticks or which shop is the best seller of 'brown breads'. All he is interested is, perhaps to check the percentage increase of his revenue margin accross Europe. Or may be year to year sales growth on eastern Europe. Such data is aggregated in nature. Because Sales of goods in East Europe is derived by summing up the individual sales data from each shop in East Europe.
Therefore, to support different levels of data warehouse users, data aggregation is needed.

What is slicing-dicing?

Slicing means showing the slice of a data, given a certain set of dimension (e.g. Product) and value (e.g. Brown Bread) and measures (e.g. sales).
Dicing means viewing the slice with respect to different dimensions and in different level of aggregations.
Slicing and dicing operations are part of pivoting.

What is drill-through?

Drill through is the process of going to the detail level data from summary data.
Consider the above example on retail shops. If the CEO finds out that sales in East Europe has declined this year compared to last year, he then might want to know the root cause of the decrease. For this, he may start drilling through his report to more detail level and eventually find out that even though individual shop sales has actually increased, the overall sales figure has decreased because a certain shop in Turkey has stopped operating the business. The detail level of data, which CEO was not much interested on earlier, has this time helped him to pin point the root cause of declined sales. And the method he has followed to obtain the details from the aggregated data is called drill through.

DIMENSION TABLE VS FACT TABLE


DIMENSION TABLE
FACT TABLE
It provides the context /descriptive information for a fact table measurements.It provides measurement of an enterprise.
Structure of Dimension - Surrogate key , one or more other fields that compose the natural key (nk) and set of Attributes.Measurement is the amount determined by observation.
Size of Dimension Table is smaller than Fact Table.Structure of Fact Table - foreign key (fk), Degenerated Dimension and Measurements.
. In a schema more number of dimensions are presented than Fact Table.Size of Fact Table is larger than Dimension Table.
Surrogate Key is used to prevent the primary key (pk) violation(store historical data).In a schema less number of Fact Tables observed compared to Dimension Tables.
Provides entry points to data.Compose of Degenerate Dimension fields act as Primary Key.
Values of fields are in numeric and text representation.Values of the fields always in numeric or integer form.

OLTP VS OLAP


OLTP
OLAP
On Line Transaction processingOn Line Analytical processing
Continuously updates dataRead Only Data
Tables are in normalized formPartially Normalized / Denormalized Tables
Single record accessMultiple records for analysis purpose
Holds current dataHolds current and historical data
Records are maintained using Primary key feildRecords are baased on surogate keyfield
Delete the table or recordCannot delete the records
Complex data modelSimplified data model

DATAMART VS DATA WAREHOUSE


DATAMART
DATA WAREHOUSE
A scaled - down version of the Data Warehouse that addresses only one subject like Sales Department, HR Department etc.,It is a database management system that facilitates on-line analytical processing by allowing the data to be viewed in different dimensions or perspectives to provide business intelligence.
One fact table with multiple dimension tables.More than one fact table and multiple dimension tables.
[Sales Department] [HR Department] [Manufacturing Department][Sales Department , HR Department , Manufacturing Department]
Small Organizations prefer DATAMARTBigger Organization prefer DATA WAREHOUSE

DATA MINING VS WEB MINING

DATA MINING
WEB MINING
Data mining involves using techniques to find underlying structure and relationships in large amounts of data.Web mining involves the analysis of Web server logs of a Web site.
Data mining products tend to fall into five categories: neural networks, knowledge discovery, data visualization, fuzzy query analysis and case-based reasoning.The Web server logs contain the entire collection of requests made by a potential or current customer through their browser and responses by the Web server

Aggregation with out Informatica Aggregator

Since Informatica process data row by row, it is generally possible to handle data aggregation operation even without an Aggregator Transformation. On certain cases, you may get huge performance gain using this technique!

General Idea of Aggregation without Aggregator Transformation

Let us take an example: Suppose we want to find the SUM of SALARY for Each Department of the Employee Table. The SQL query for this would be:

SELECT DEPTNO,SUM(SALARY) FROM EMP_SRC GROUP BY DEPTNO;

If we need to implement this in Informatica, it would be very easy as we would obviously go for an Aggregator Transformation. By taking the DEPTNO port as GROUP BY and one output port as SUM(SALARY the problem can be solved easily.

Now the trick is to use only Expression to achieve the functionality of Aggregator expression. We would use the very funda of the expression transformation of holding the value of an attribute of the previous tuple over here.

But wait... why would we do this? Aren't we complicating the thing here?

Yes, we are. But as it appears, in many cases, it might have an performance benefit (especially if the input is already sorted or when you know input data will not violate the order, like you are loading daily data and want to sort it by day). Remember Informatica holds all the rows in Aggregator cache for aggregation operation. This needs time and cache space and this also voids the normal row by row processing in Informatica. By removing the Aggregator with an Expression, we reduce cache space requirement and ease out row by row processing. The mapping below will show how to do this


Sorter (SRT_SAL) Ports Tab

Now I am showing a sorter here just illustrate the concept. If you already have sorted data from the source, you need not use this thereby increasing the performance benefit.
Expression (EXP_SAL) Ports Tab
Image: Expression Ports Tab Properties

Sorter (SRT_SAL1) Ports Tab


Expression (EXP_SAL2) Ports Tab


Filter (FIL_SAL) Properties Tab


This is how we can implement aggregation without using Informatica aggregator transformation.

Implementing Informatica Partitions

Why use Informatica Pipeline Partition?

Identification and elimination of performance bottlenecks will obviously optimize session performance. After tuning all the mapping bottlenecks, we can further optimize session performance by increasing thenumber of pipeline partitions in the session. Adding partitions can improve performance by utilizing more of the system hardware while processing the session.

PowerCenter Informatica Pipeline Partition

Different Types of Informatica Partitions

We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys, Key range, Pass-through, Round-robin.

Informatica Pipeline Partitioning Explained

Each mapping contains one or more pipelines. A pipeline consists of a source qualifier, all the transformations and the target. When the Integration Service runs the session, it can achieve higher performance by partitioning the pipeline and performing the extract, transformation, and load for each partition in parallel.

A partition is a pipeline stage that executes in a single reader, transformation, or writer thread. The number of partitions in any pipeline stage equals the number of threads in the stage. By default, the Integration Service creates one partition in every pipeline stage. If we have the Informatica Partitioning option, we can configure multiple partitions for a single pipeline stage.

Setting partition attributes includes partition points, the number of partitions, and the partition types. In the session properties we can add or edit partition points. When we change partition points we can define the partition type and add or delete partitions(number of partitions).

We can set the following attributes to partition a pipeline:

Partition point: Partition points mark thread boundaries and divide the pipeline into stages. A stage is a section of a pipeline between any two partition points. The Integration Service redistributes rows of data at partition points. When we add a partition point, we increase the number of pipeline stages by one. Increasing the number of partitions or partition points increases the number of threads. We cannot create partition points at Source instances or at Sequence Generator transformations.

Number of partitions: A partition is a pipeline stage that executes in a single thread. If we purchase the Partitioning option, we can set the number of partitions at any partition point. When we add partitions, we increase the number of processing threads, which can improve session performance. We can define up to 64 partitions at any partition point in a pipeline. When we increase or decrease the number of partitions at any partition point, the Workflow Manager increases or decreases the number of partitions at all partition points in the pipeline. The number of partitions remains consistent throughout the pipeline. The Integration Service runs the partition threads concurrently.

Partition types: The Integration Service creates a default partition type at each partition point. If we have the Partitioning option, we can change the partition type. The partition type controls how the Integration Service distributes data among partitions at partition points. We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys, Key range, Pass-through, Round-robin. Database partitioning: The Integration Service queries the database system for table partition information. It reads partitioned data from the corresponding nodes in the database.

Pass-through: The Integration Service processes data without redistributing rows among partitions. All rows in a single partition stay in the partition after crossing a pass-through partition point. Choose pass-through partitioning when we want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions.

Round-robin: The Integration Service distributes data evenly among all partitions. Use round-robin partitioning where we want each partition to process approximately the same numbers of rows i.e. load balancing.

Hash auto-keys: The Integration Service uses a hash function to group rows of data among partitions. The Integration Service groups the data based on a partition key. The Integration Service uses all grouped or sorted ports as a compound partition key. We may need to use hash auto-keys partitioning at Rank, Sorter, and unsorted Aggregator transformations.

Hash user keys: The Integration Service uses a hash function to group rows of data among partitions. We define the number of ports to generate the partition key.

Key range: The Integration Service distributes rows of data based on a port or set of ports that we define as the partition key. For each port, we define a range of values. The Integration Service uses the key and ranges to send rows to the appropriate partition. Use key range partitioning when the sources or targets in the pipeline are partitioned by key range.

We cannot create a partition key for hash auto-keys, round-robin, or pass-through partitioning.

Add, delete, or edit partition points on the Partitions view on the Mapping tab of session properties of a session in Workflow Manager.

The PowerCenter® Partitioning Option increases the performance of PowerCenter through parallel data processing. This option provides a thread-based architecture and automatic data partitioning that optimizes parallel processing on multiprocessor and grid-based hardware environments.

Informatica Tuning - Step by Step Approach

This is the first of the number of articles on the series of Data Warehouse Application performance tuning scheduled to come every week. This one is on Informatica performance tuning.

Source Query/ General Query Tuning

1.1 Calculate original query cost
1.2 Can the query be re-written to reduce cost?
- Can IN clause be changed with EXISTS?
- Can a UNION be replaced with UNION ALL if we are not using any DISTINCT cluase in query?
- Is there a redundant table join that can be avoided?
- Can we include additional WHERE clause to further limit data volume?
- Is there a redundant column used in GROUP BY that can be removed?
- Is there a redundant column selected in the query but not used anywhere in mapping?
1.3 Check if all the major joining columns are indexed
1.4 Check if all the major filter conditions (WHERE clause) are indexed
- Can a function-based index improve performance further?
1.5 Check if any exclusive query hint reduce query cost
- Check if parallel hint improves performance and reduce cost
1.6 Recalculate query cost
- If query cost is reduced, use the changed query

Tuning Informatica LookUp

1.1 Redundant Lookup transformation
- Is there a lookup which is no longer used in the mapping?
- If there are consecutive lookups, can those be replaced inside a single lookup override?
1.2 LookUp conditions
- Are all the lookup conditions indexed in database? (Uncached lookup only)
- An unequal condition should always be mentioned after an equal condition
1.3 LookUp override query
- Should follow all guidelines from 1. Source Query part above
1.4 There is no unnecessary column selected in lookup (to reduce cache size)
1.5 Cached/Uncached
- Carefully consider whether the lookup should be cached or uncached
- General Guidelines
- Generally don't use cached lookup if lookup table size is > 300MB
- Generally don't use cached lookup if lookup table row count > 20,000,00
- Generally don't use cached lookup if driving table (source table) row count < 1000
1.6 Persistent Cache
- If found out that a same lookup is cached and used in different mappings, Consider persistent cache

1.7 Lookup cache building
- Consider "Additional Concurrent Pipeline" in session property to build cache concurrently
"Prebuild Lookup Cache" should be enabled, only if the lookup is surely called in the mapping

Tuning Informatica Joiner


3.1 Unless unavoidable, join database tables in database only (homogeneous join) and don't use joiner

3.2 If Informatica joiner is used, always use Sorter Rows and try to sort it in SQ Query itself using Order By (If Sorter
Transformation is used then make sure Sorter has enough cache to perform 1-pass sort)
3.3 Smaller of two joining tables should be master

Tuning Informatica Aggregator

4.1 When possible, sort the input for aggregator from database end (Order By Clause)
4.2 If Input is not already sorted, use SORTER. If possible use SQ query to Sort the records.

Tuning Informatica Filter

5.1 Unless unavoidable, use filteration at source query in source qualifier
5.2 Use filter as much near to source as possible

Tuning Informatica Sequence Generator

6.1 Cache the sequence generator

Setting Correct Informatica Session Level Properties

7.1 Disable "High Precision" if not required (High Precision allows decimal upto 28 decimal points)
7.2 Use "Terse" mode for tracing level
7.3 Enable pipeline partitioning (Thumb Rule: Maximum No. of partitions = No. of CPU/1.2)
(Also remember increasing partitions will multiply the cache memory requirement accordingly)

Tuning Informatica Expression

8.1 Use Variable to reduce the redundant calculation
8.2 Remove Default value " ERROR('transformation error')" for Output Column.
8.3 Try to reduce the Code complexity like Nested If etc.
8.4 Try to reduce the Unneccessary Type Conversion in Calculation

Pushdown Optimization In Informatica

Pushdown Optimization which is a new concept in Informatica PowerCentre, allows developers to balance data transformation load among servers. This article describes pushdown techniques.

Pushdown Optimization

Pushdown optimization is a way of load-balancing among servers in order to achieve optimal performance. Veteran ETL developers often come across issues when they need to determine the appropriate place to perform ETL logic. Suppose an ETL logic needs to filter out data based on some condition. One can either do it in database by using WHERE condition in the SQL query or inside Informatica by using Informatica Filter transformation. Sometimes, we can even "push" some transformation logic to the target database instead of doing it in the source side (Especially in the case of EL-T rather than ETL). Such optimization is crucial for overall ETL performance.

How does Push-Down Optimization work?

One can push transformation logic to the source or target database using pushdown optimization. The Integration Service translates the transformation logic into SQL queries and sends the SQL queries to the source or the target database which executes the SQL queries to process the transformations. The amount of transformation logic one can push to the database depends on the database, transformation logic, and mapping and session configuration. The Integration Service analyzes the transformation logic it can push to the database and executes the SQL statement generated against the source or target tables, and it processes any transformation logic that it cannot push to the database.


Using Pushdown Optimization

Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the Integration Service can push to the source or target database. You can also use the Pushdown Optimization Viewer to view the messages related to pushdown optimization.

Let us take an example

Filter Condition used in this mapping is: DEPTNO>40


Suppose a mapping contains a Filter transformation that filters out all employees except those with a DEPTNO greater than 40. The Integration Service can push the transformation logic to the database. It generates the following SQL statement to process the transformation logic:

INSERT INTO EMP_TGT(EMPNO, ENAME, SAL, COMM, DEPTNO)
SELECT EMP_SRC. EMPNO,
EMP_SRC.ENAME,
EMP_SRC.SAL,
EMP_SRC.COMM,
EMP_SRC.DEPTNO
FROM EMP_SRC
WHERE (EMP_SRC.DEPTNO >40)

The Integration Service generates an INSERT SELECT statement and it filters the data using a WHERE clause. The Integration Service does not extract data from the database at this time.

We can configure pushdown optimization in the following ways:

Using source-side pushdown optimization:

The Integration Service pushes as much transformation logic as possible to the source database. The Integration Service analyzes the mapping from the source to the target or until it reaches a downstream transformation it cannot push to the source database and executes the corresponding SELECT statement.

Using target-side pushdown optimization:

The Integration Service pushes as much transformation logic as possible to the target database. The Integration Service analyzes the mapping from the target to the source or until it reaches an upstream transformation it cannot push to the target database. It generates an INSERT, DELETE, or UPDATE statement based on the transformation logic for each transformation it can push to the database and executes the DML.

Using full pushdown optimization:

The Integration Service pushes as much transformation logic as possible to both source and target databases. If you configure a session for full pushdown optimization, and the Integration Service cannot push all the transformation logic to the database, it performs source-side or target-side pushdown optimization instead. Also the source and target must be on the same database. The Integration Service analyzes the mapping starting with the source and analyzes each transformation in the pipeline until it analyzes the target. When it can push all transformation logic to the database, it generates an INSERT SELECT statement to run on the database. The statement incorporates transformation logic from all the transformations in the mapping. If the Integration Service can push only part of the transformation logic to the database, it does not fail the session, it pushes as much transformation logic to the source and target database as possible and then processes the remaining transformation logic.

For example, a mapping contains the following transformations:
SourceDefn -> SourceQualifier -> Aggregator -> Rank -> Expression -> TargetDefn

SUM(SAL), SUM(COMM) Group by DEPTNO

RANK PORT on SAL

TOTAL = SAL+COMM

Pushdown optimization example 2


The Rank transformation cannot be pushed to the database. If the session is configured for full pushdown optimization, the Integration Service pushes the Source Qualifier transformation and the Aggregator transformation to the source, processes the Rank transformation, and pushes the Expression transformation and target to the target database.

When we use pushdown optimization, the Integration Service converts the expression in the transformation or in the workflow link by determining equivalent operators, variables, and functions in the database. If there is no equivalent operator, variable, or function, the Integration Service itself processes the transformation logic. The Integration Service logs a message in the workflow log and the Pushdown Optimization Viewer when it cannot push an expression to the database. Use the message to determine the reason why it could not push the expression to the database.


How does Integration Service handle Push Down Optimization?

To push transformation logic to a database, the Integration Service might create temporary objects in the database. The Integration Service creates a temporary sequence object in the database to push Sequence Generator transformation logic to the database. The Integration Service creates temporary views in the database while pushing a Source Qualifier transformation or a Lookup transformation with a SQL override to the database, an unconnected relational lookup, filtered lookup.

1. To push Sequence Generator transformation logic to a database, we must configure the session for pushdown optimization with Sequence.
2. To enable the Integration Service to create the view objects in the database we must configure the session for pushdown optimization with View.
2. After the database transaction completes, the Integration Service drops sequence and view objects created for pushdown optimization.


Configuring Parameters for Pushdown Optimization

Depending on the database workload, we might want to use source-side, target-side, or full pushdown optimization at different times and for that we can use the $$PushdownConfig mapping parameter. The settings in the $$PushdownConfig parameter override the pushdown optimization settings in the session properties. Create $$PushdownConfig parameter in the Mapping Designer , in session property for Pushdown Optimization attribute select $$PushdownConfig and define the parameter in the parameter file.

The possible values may be,
1. none i.e the integration service itself processes all the transformations,
2. Source [Seq View],
3. Target [Seq View],
4. Full [Seq View]


Pushdown Optimization Viewer

Use the Pushdown Optimization Viewer to examine the transformations that can be pushed to the database. Select a pushdown option or pushdown group in the Pushdown Optimization Viewer to view the corresponding SQL statement that is generated for the specified selections. When we select a pushdown option or pushdown group, we do not change the pushdown configuration. To change the configuration, we must update the pushdown option in the session properties.

Database that supports Informatica Pushdown Optimization

We can configure sessions for pushdown optimization having any of the databases like Oracle, IBM DB2, Teradata, Microsoft SQL Server, Sybase ASE or Databases that use ODBC drivers.

When we use native drivers, the Integration Service generates SQL statements using native database SQL. When we use ODBC drivers, the Integration Service generates SQL statements using ANSI SQL. The Integration Service can generate more functions when it generates SQL statements using native language instead of ANSI SQL.


Handling Error when Pushdown Optimization is enabled

When the Integration Service pushes transformation logic to the database, it cannot track errors that occur in the database.

When the Integration Service runs a session configured for full pushdown optimization and an error occurs, the database handles the errors. When the database handles errors, the Integration Service does not write reject rows to the reject file.

If we configure a session for full pushdown optimization and the session fails, the Integration Service cannot perform incremental recovery because the database processes the transformations. Instead, the database rolls back the transactions. If the database server fails, it rolls back transactions when it restarts. If the Integration Service fails, the database server rolls back the transaction.

Informatica Dynamic Lookup Cache

A LookUp cache does not change once built. But what if the underlying lookup table changes the data after the lookup cache is created? Is there a way so that the cache always remain up-to-date even if the underlying table changes?

Dynamic Lookup Cache

LookUpLet's think about this scenario. You are loading your target table through a mapping. Inside the mapping you have a Lookup and in the Lookup, you are actually looking up the same target table you are loading. You may ask me, "So? What's the big deal? We all do it quite often...". And yes you are right.
There is no "big deal" because Informatica (generally) caches the lookup table in the very beginning of the mapping, so whatever record getting inserted to the target table through the mapping, will have no effect on the Lookup cache. The lookup will still hold the previously cached data, even if the underlying target table is changing.

But what if you want your Lookup cache to get updated as and when the target table is changing? What if you want your lookup cache to always show the exact snapshot of the data in your target table at that point in time? Clearly this requirement will not be fullfilled in case you use a static cache. You will need a dynamic cache to handle this.


But why on earth someone will need a dynamic cache?

To understand this, let's next understand a static cache scenario.


STATIC CACHE SCENARIO

Let's suppose you run a retail business and maintain all your customer information in a customer master table (RDBMS table). Every night, all the customers from your customer master table is loaded in to a Customer Dimension table in your data warehouse. Your source customer table is a transaction system table, probably in 3rd normal form, and does not store history. Meaning, if a customer changes his address, the old address is updated with the new address. But your data warehouse table stores the history (may be in the form of SCD Type-II). There is a map that loads your data warehouse table from the source table. Typically you do a Lookup on target (static cache) and check with your every incoming customer record to determine if the customer is already existing in target or not. If the customer is not already existing in target, you conclude the customer is new and INSERT the record whereas if the customer is already existing, you may want to update the target record with this new record (if the record is updated). This is illustrated below,You don't need dynamic Lookup cache for this


Statis Cache


DYNAMIC LOOKUP CACHE SCENARIO

Notice in the previous example I mentioned that your source table is an RDBMS table. This ensures that your source table does not have any duplicate record. What if you had a flat file as source with many duplicate records? Would the scenario be same? No, see the below illustration.


Dynamic Lookup CacheImage: A Scenario illustrating the use of dynamic lookup cache

Here are some more examples when you may consider using dynamic lookup,

  • Updating a master customer table with both new and updated customer information as shown above
  • Loading data into a slowly changing dimension table and a fact table at the same time. Remember, you typically lookup the dimension while loading to fact. So you load dimension table before loading fact table. But using dynamic lookup, you can load both simultaneously.
  • Loading data from a file with many duplicate records and to eliminate duplicate records in target by updating a duplicate row i.e. keeping the most recent row or the initial row
  • Loading the same data from multiple sources using a single mapping. Just consider the previous Retail business example. If you have more than one shops and Linda has visited two of your shops for the first time, customer record Linda will come twice during the same load.

So, How does dynamic lookup work?

When the Integration Service reads a row from the source, it updates the lookup cache by performing one of the following actions:

Inserts the row into the cache: If the incoming row is not in the cache, the Integration Service inserts the row in the cache based on input ports or generated Sequence-ID. The Integration Service flags the row as insert.

Updates the row in the cache: If the row exists in the cache, the Integration Service updates the row in the cache based on the input ports. The Integration Service flags the row as update.

Makes no change to the cache: This happens when the row exists in the cache and the lookup is configured or specified To Insert New Rows only or, the row is not in the cache and lookup is configured to update existing rows only or,
the row is in the cache, but based on the lookup condition, nothing changes. The Integration Service flags the row as unchanged.

Notice that Integration Service actually flags the rows based on the above three conditions. This is a great thing, because, if you know the flag you can actually reroute the row to achieve different logic. This flag port is called "NewLookupRow" and using this the rows can be routed for insert, update or to do nothing. You just need to use a Router or Filter transformation followed by an Update Strategy.

Oh, forgot to tell you the actual values that you can expect in NewLookupRow port:
0 Integration Service does not update or insert the row in the cache.
1 Integration Service inserts the row into the cache.
2 Integration Service updates the row in the cache.

When the Integration Service reads a row, it changes the lookup cache depending on the results of the lookup query and the Lookup transformation properties you define. It assigns the value 0, 1, or 2 to the NewLookupRow port to indicate if it inserts or updates the row in the cache, or makes no change.

Example of Dynamic Lookup Implementation

Ok, I design a mapping for you to show Dynamic lookup implementation. I have given a full screenshot of the mapping. Since the screenshot is slightly bigger, so I link it below..



And here I provide you the screenshot of the lookup below. Lookup ports screen shot first,



And here is Dynamic Lookup Properties Tab



If you check the mapping screenshot, there I have used a router to reroute the INSERT group and UPDATE group. The router screenshot is also given below. New records are routed to the INSERT group and existing records are routed to the UPDATE group.

Router Transformation
Router Transformation Groups Tab


About the Sequence-ID

While using a dynamic lookup cache, we must associate each lookup/output port with an input/output port or a sequence ID. The Integration Service uses the data in the associated port to insert or update rows in the lookup cache. The Designer associates the input/output ports with the lookup/output ports used in the lookup condition.

When we select Sequence-ID in the Associated Port column, the Integration Service generates a sequence ID for each row it inserts into the lookup cache.
When the Integration Service creates the dynamic lookup cache, it tracks the range of values in the cache associated with any port using a sequence ID and it generates a key for the port by incrementing the greatest sequence ID existing value by one, when the inserting a new row of data into the cache.

When the Integration Service reaches the maximum number for a generated sequence ID, it starts over at one and increments each sequence ID by one until it reaches the smallest existing value minus one. If the Integration Service runs out of unique sequence ID numbers, the session fails.


About the Dynamic Lookup Output Port

The lookup/output port output value depends on whether we choose to output old or new values when the Integration Service updates a row:
Output old values on update: The Integration Service outputs the value that existed in the cache before it updated the row.
Output new values on update: The Integration Service outputs the updated value that it writes in the cache. The lookup/output port value matches the input/output port value.
Note: We can configure to output old or new values using the Output Old Value On Update transformation property.


Handling NULL in dynamic LookUp

If the input value is NULL and we select the Ignore Null inputs for Update property for the associated input port, the input value does not equal the lookup value or the value out of the input/output port. When you select the Ignore Null property, the lookup cache and the target table might become unsynchronized if you pass null values to the target. You must verify that you do not pass null values to the target.

When you update a dynamic lookup cache and target table, the source data might contain some null values. The Integration Service can handle the null values in the following ways:

Insert null values: The Integration Service uses null values from the source and updates the lookup cache and target table using all values from the source.

Ignore Null inputs for Update property : The Integration Service ignores the null values in the source and updates the lookup cache and target table using only the not null values from the source.

If we know the source data contains null values, and we do not want the Integration Service to update the lookup cache or target with null values, then we need to check the Ignore Null property for the corresponding lookup/output port.

When we choose to ignore NULLs, we must verify that we output the same values to the target that the Integration Service writes to the lookup cache. We can Configure the mapping based on the value we want the Integration Service to output from the lookup/output ports when it updates a row in the cache, so that lookup cache and the target table might not become unsynchronized

New values. Connect only lookup/output ports from the Lookup transformation to the target.
Old values. Add an Expression transformation after the Lookup transformation and before the Filter or Router transformation. Add output ports in the Expression transformation for each port in the target table and create expressions to ensure that we do not output null input values to the target.

When we run a session that uses a dynamic lookup cache, the Integration Service compares the values in all lookup ports with the values in their associated input ports by default.

It compares the values to determine whether or not to update the row in the lookup cache. When a value in an input port differs from the value in the lookup port, the Integration Service updates the row in the cache.


But what if we don't want to compare all ports? We can choose the ports we want the Integration Service to ignore when it compares ports. The Designer only enables this property for lookup/output ports when the port is not used in the lookup condition. We can improve performance by ignoring some ports during comparison.


We might want to do this when the source data includes a column that indicates whether or not the row contains data we need to update. Select the Ignore in Comparison property for all lookup ports except the port that indicates whether or not to update the row in the cache and target table.

Note: We must configure the Lookup transformation to compare at least one port else the Integration Service fails the session when we ignore all ports.