Integrating Generative AI into ETL tool

11 min readMay 13, 2024

Introduction

The integration of Generative AI into Extract, Transform, Load (ETL) tools marks a transformative leap in data processing capabilities. Generative AI, with its ability to understand, predict, and generate data autonomously, presents new opportunities for enhancing the efficiency and intelligence of ETL processes. This integration accelerates the speed of data transformation and enables organizations to achieve more nuanced insights and make more informed decisions much more quicker.

First let’s take a look at some of the common use cases for text generation with generative AI; the main use cases for generative AI for text generation are :

1- Classification

2- Entity extraction

3- Summarization

4- Sentiment analysis

5- Question and answering

6- Code generation

The first 4 items are basically a manipulated perception of data or a data transformation of some sort which will be utilized for further processing or decision making!

In this article we are going to tackle this problem: We are an insurance company and we want to classify claims by type e.g. ‘Home Insurance’ or ‘Auto Insurance’ and also find claim urgency from the claim description without writing a single line of code; we also want to integrate this classification as part of our ETL process.

All the claims are stored in a data store. If we are processing all the claims in batches and if we already have an ETL tool in place how can we classify these claims with foundation models?

Before getting to classification of these claims, first we need to have a foundation model hosted somewhere to serve for our purpose. There are various platforms that host these models. IBM Watsonx.ai, Amazon bedrock, Azure LLM tool, Open ai , Vertex ai and many more are some of the platforms that let you host these foundation models for consumption.

For this article the choice of generative AI platform is IBM Watsonx.ai and the choice of ETL is IBM Data Stage; within data stage we can also integrate with other generative AI platforms as long as they provide an api for their generation process. Let’s have a quick view of these two offerings before getting into integration. If you already are aware of these two services please jump into the integration section.

Overview of Data stage

IBM DataStage is a powerful data integration tool that forms part of the IBM InfoSphere suite. It is designed to allow businesses to design, develop, and execute various data integration tasks, ranging from simple data transfers to complex transformation and loading processes across multiple systems.

DataStage facilitates the collection, transformation, validation, and loading of large volumes of data, ensuring that businesses can handle the challenges of Big Data and analytics effectively. Its graphical interface and extensive library of pre-built functions simplify the process of creating data integration solutions, making it a preferred choice for organizations aiming to improve data accessibility, quality, and insights across diverse data landscapes.

Data stage is a node code tool that allows data engineers to create and execute data pipelines. it supports a wide range of connectors and stages; Here are the steps :

1- It starts by dragging a connector in which the source data is sitting e.g. DB2, XML files, SQL server, aws s3 and many more and configuring the connection parameters

2- Choose the transformation we want to apply to the data e.g. join with another data source, merge, aggregate and many more by dragging and dropping one or more stages and configuring them. We can chain various stages to create a pipeline.

3- Write the results in the destination source by dragging and dropping one to many connectors

Data stage can be deployed :

1- On premises

2- Data stage as a service

3- Data stage on premises or any cloud

Overview of Watsonx.ai

IBM’s watsonx.ai stands at the forefront of generative AI, offering advanced tools for integrating foundation models tailored to enterprise needs. The platform not only accommodates traditional machine learning workflows but also introduces powerful generative AI capabilities that enhance data processing and AI solution development. Through the Prompt Lab, users can refine AI tasks such as text generation and data classification using customized prompts, ensuring outputs are specifically aligned with business objectives.

Additionally, IBM watsonx.ai provides a comprehensive suite of AI tools that support a diverse range of enterprise applications, from custom AI development to multi-cloud data management. The platform gives users access to IBM’s models as well as third-party and open-source models, offering a robust foundation for enterprises aiming to leverage AI technologies. This versatility allows businesses to enhance functionalities such as Q&A systems, generate synthetic data, or produce multilingual content, fostering innovation while ensuring data governance and security throughout the AI lifecycle.

In Watsonx.ai we have the prompt lab which is a workspace in which we can author our prompts. in Prompt lab we can

1-View various examples of prompts for various use cases such as summarization, classification, extraction and question answering

2- Different modes of interaction with LLM such as chat, structured and freeform

3- Ability to select various supported foundation models out of the box

4- Toggle AI guardrails to remove harmful content from input and output

5- Configure various model parameters

6- Code generation for using the authored prompt such as Curl, python and node.js

Integration of IBM Data stage and Watsonx.ai

Architecturally speaking there are two systems interacting with each other. In addition we could have a data source for storing the claims and the results of classification. Data stage will interact with the data sources.

For each system we need an administrator to setup and configure these services and once completed an AI engineer will perform prompt engineering in Watsonx.ai to create the right prompt for the classification tasks, and data engineer will create one or more data stage flows that will communicate with the Wastonx.ai to perform the classifications.

We are using an enterprise level Generative AI platform and an ETL tool; therefore, we can be confident that there are APIs available to communicate with services. For example for all the models that are available on Watsonx.ai we can perform inferences through a REST call. We need to ensure that we have the right parameters configured when we send the request to the IBM Watsonx.ai service.

The communication between Watsonx.ai and IBM Data stage is through REST protocol since these two services will communicate with each other. The fastest way it to provision Watsonx.ai and Data stage services is on cloud, since for both there is a SaaS offering.

Before getting into nitty gritty details of our integration let’s examine our process. There are 4 main steps :

1- Prompt Engineering: In this steps we will be developing our prompts. In our use case we want to perform two types of classifications:

Classify claims by type
Classify Claim by Urgency

2- ETL Development: We need to identify, setup and configure our source and target data sources, identify what other transformation is going to be applied to our data and define the chain of transformations

3- Integration: Execution and integration of stages and ensure that the classifications are generated within the data pipeline and address problems along the way

4- Monitor: Ensure the right data is generated and evaluate the quality of generated data and plan for courses of action if adjustments are required and perform those for the next iteration

In this article we will be looking at 1–3 and leave 4 Monitor for another time.

1- Prompt Engineering

We need to author the right prompts first. As mentioned above we need two prompts:

Claim type classification prompt
Claim urgency classification prompt

First we navigate to prompt lab in Watsonx.ai to get started. We have 3 different modes to author our prompt

1- Chat

2- Structured

3- Free form

Let’s use a Freeform for Claim Classification by type; This is the prompt I have prepared

Context: You work in an insurance company, your job is to classify insurance claims. You will read claims and you will classify them as 'auto insurance'  or 'Home insurance'. 
input : {claim}
output :  "home insurance" or "auto insurance"  

You only print the insurance type and no other unnecessary information.
for example 
claim : 'Auto accident on 5th street minor damage to vehicle' 
output 'auto insurance'

for example 
claim : 'Home basement flood from sump pump failure damaged furniture' 
output: 'home insurance'

Classify this claim : {claim}

and we can try it in the prompt lab and test classification results

I can also store this prompt in my workspace:

We will do the same for Claim Classification by Urgency

Context: you are an insurance agent. Your job is to detect urgency within claims. You will read claims and you will detect urgency as 'Routine'  or 'Immediate'
input : {claim}
output :  "Routine" or "Immediate"  

You only print the insurance type and no other unnecessary information.
for example 
claim : 'Auto accident on 5th street minor damage to vehicle' 
output 'Routine'

for example 
claim : 'Home basement flood from sump pump failure damaged furniture' 
output: 'immediate'

Classify this claim : {claim}

In this article we are more concerned about the mechanics of how we will perform these tasks rather than creating a perfect prompt.

2- ETL Development

In IBM data stage there is a stage called REST stage that the data stage developer can utilize to make REST calls without writing a single line of code. We basically take advantage of this stage to configure a REST calls to Watsonx.ai service to classify our claims by type and urgency. We can perform multiple REST calls using this technique without writing a single line of code. We only need to understand how to configure this stage with respect to our Watsonx.ai REST calls.

Before getting into the REST stage let me describe the data stage flow shown above. As mentioned, a Data stage flow is comprised of connectors and stages; for this practice, I am using files as connectors and I am using two different stages: Copy and Rest. The mechanics of the flow is as follows:

1- It starts by connecting to the claims data which is a .csv file

2- It makes two copies of this file one for each stage that we want to run; we want to perform two classification tasks:

Classify the claim type as “Home Insurance” or “Auto Insurance”
Classify the claim urgency as “Routine” or “Urgent”

3- Claims are sent to the defined REST stages within which we have configured all the necessary parameters to connect to a foundation model hosted on Watsonx.ai and classify based on the engineered prompt by the AI engineer

4- The results of each claim classification then is returned by the REST stage and stored in a file

The arrows are the representation of where the data is coming from and where the data is going to. Each stage receives data from previous stage or data source, performs some computation and sends the data to the next stage or target data source.

We can drag and drop above stages from the palette and configure them to create the right flow.

3- Integration

Now Let’s look at the anatomy of the REST stage; If I Open the Claim_Classification_by_watsonxai REST stage on above screen shot I can see the parameters that are needed for this communication. We need to configure:

1- End point for connection to Watsonx.ai service

2- Authentication token

3- Request Parameters

This set of information easily can be retrieved from the prompt lab. There is a Code button in which we can view what the REST call should look like for prompt developed in section 2 above.

The curl command will look like below if you click on view code full screen button.

On the stage tab under requests as shown below we can enter URL from the curl command above with a dark background. We need to generate a Bearer Token and paste the token in the space provided below. This step can also be automated however for this demo I am generating this token manually.

Now we can setup the request parameters. If we click on the request tab I have to configure headers and body of the request; For Custom headers I will add Accept and Content-Type headers

We can use either static text or expressions for the body. In our case the body of each request call is dynamic because we are classifying a claim that is coming from a row in a table from our data source and we want to use few shot prompting for classification; we select “Use expression” check box on top of the input box and we can create a dynamic expression in Derivation builder shown below; there are a wide variety of functions available to create dynamic parameters for every piece of our parameters.

We will repeat the same steps for Urgency classification and Now we are ready for execution. We can now compile and run our flow and see the results. There are two buttons at the top of the flow designer that lets us compile and run the stage.

Once the execution is successful we can right click on the connectors files as shown above and choose preview data. This will show the result of our classifications. We can see claim types and urgency types generated by watsonx.ai for each claim on the left side.

Some of these claims are classified correctly and some are classified incorrectly and this shows us that we need may be better prompt engineering or prompt tuning or a different foundation model and definitely there is a need for AI governance.

AI Governance will ensure that we are monitoring our generated outputs and continuously iterating to generate high quality outputs. We can definitely take advantage of IBM Watsonx-Governance . Governance is out of scope of this article and we will explore this in another article.