Data profiling and data quality AI Agents

Iman Johari
13 min readJan 27, 2025

--

Introduction

“Garbage in garbage out”! This is something that I have heard pretty much in every machine learning engagement that I have been involved in the last 9 years. As you already may know low quality data will result to low quality models and eventually to low quality results or decisions for data consumers.

To have a high quality Machine learning model and now Large Language models we need to have high quality data. Data comes in different types and formats. Structured, non-structured , audio , image etc. The first thing that every data scientist does with data to work with is to explore the data, to understand the size and data type of each column; to get a better sense of how data is distributed, in case of numeric columns what is the min, max, average etc; for categorial data understand the distinct values; All these information comes under the definition of data profiling.

Then we want to evaluate data based on its quality. We want to know if there are values missing, if there are duplicates, or if there is any data out of range ? That’s where we will come up with some rules to apply to data to ensure data has a high quality score and we can use it for our purposes. These rules are often written manually by data quality engineers and it depends on the platform or tools they are using.

Usually these rules can be 100–1000’s of sql queries kept in an excel sheet .They could be stored in various python notebooks and have some sort of reporting mechanism that is fed from these computations. In the case of IBM Knowledge catalog these rules are created using a low code- node code user interface, stored in IKC and are executed frequently; the data steward and data quality engineers can access these executions and monitor quality for various data assets from various type of data sources.

In this article I am going to take advantage of AI agents to create a simple data pipeline to which we can connect a data source, perform data profiling and run some data quality rules without writing a single line of code for those purposes. Of course we need to write some code for our agents. Git hub can be accessed here

Agent Design

For this proof of concept, I am going to use CrewAI for my Agentic solution. I used an LLM to generate a csv file with about 1000 rows of data within which there are various data quality issues. This data will be stored in a sql lite data store using a python function within the code.

# Helper Function: Create SQLite Database
def create_sqlite_db(csv_path: str, db_path: str):
"""
Load a CSV file into an SQLite database.

Args:
csv_path (str): Path to the CSV file.
db_path (str): Path to the SQLite database.
"""
conn = sqlite3.connect(db_path)
df = pd.read_csv(csv_path)
df.to_sql("hr_data", conn, index=False, if_exists="replace")
conn.close()

These are the fields of the data : ID,Name,Age,Salary,JoinDate,Department

All the rules and statistics are calculated based on sql queries. We are not using any python code to generate profiling or data quality rules.

We will have 3 agents as shown below:

1- Profiling agent

This agent comes up with data profiling and data quality sql queries.

profiling_agent = Agent(
role="Data Profiling Agent",
goal="Generate SQL queries for profiling and data quality.",
model="gpt4",
verbose=True, # Optional: For debugging purposes
memory=True,
tools=[schema_tool_instance], # Add schema tool here
backstory="An expert in crafting insightful SQL queries for data analysis.",
)

This agent runs 1 tool Schema_Tool_instance ; this tool fetches the right schema from the provide database path so that the generated SQLs are based on the right tables and columns

class SQLiteSchemaTool:
def __init__(self):
self.name = "SQLiteSchemaTool"
self.description = "Fetches table names and column names from an SQLite database."
self.func = self.fetch_schema

def fetch_schema(self, db_path: str) -> str:
"""
Fetches the schema (tables and columns) from the SQLite database.
"""
db_path = "database.db"
try:
conn = sqlite3.connect(db_path)
cursor = conn.cursor()

# Fetch table names
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = [row[0] for row in cursor.fetchall()]

schema = {}
for table in tables:
# Fetch column names for each table
cursor.execute(f"PRAGMA table_info({table});")
columns = [row[1] for row in cursor.fetchall()]
schema[table] = columns

conn.close()
return json.dumps(schema, indent=4)
except Exception as e:
return f"Error fetching schema: {str(e)}"

2- Execution agent

this agent will run the two type of queries that the previous agent will create and eventually returns the results.

execution_agent = Agent(
role="SQL Query Executor",
goal="Execute SQL queries and return the results.",
backstory="Specializes in running SQL queries and fetching results.",
tools=[sqlite_query_tool],
)

This agent uses 1 tool SQLQueryTool and it executes the generated sql queries from the data profiling agent

class SQLiteQueryTool:
def __init__(self):
self.name = "SQLiteQueryTool"
self.description = (
"Executes SQL queries on a SQLite database. Provide `db_path` and `query`."
)
self.func = self.execute_query

def execute_query(self, db_path: str, query: str) -> str:
"""
Execute an SQL query on the provided SQLite database.
"""
db_path = "database.db"
print(f"Running query: {query}")
try:
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute(query)
rows = cursor.fetchall()
columns = [description[0] for description in cursor.description]
conn.close()
result = [dict(zip(columns, row)) for row in rows]
return json.dumps(result, indent=4)
except Exception as e:
return f"Error executing query: {str(e)}"

The queries are passed from the data profiling agents who generate those queries.

3- Presentation Agent

This agent will format the results received from the execution agent and presents result in Markdown format

presentation_agent = Agent(
role="Markdown Presenter",
goal="Present data in a well-structured Markdown format.",
backstory="An experienced Markdown formatter, skilled in converting raw data into beautifully formatted reports.",
)

Tasks

We are going to have 3 tasks :

1- Profiling task

The data profiling agent will create the sql queries for data profiling as well for data quality engineering based on below criteria

profiling_task = Task(
description=(
"First, use the `SQLiteSchemaTool` to fetch the schema (table names and column names) "
"from the database. Then generate SQL queries to profile the data based on the schema. "
"Ensure the queries are relevant and use the actual table and column names."
"Then generate SQL queries to profile the data. Ensure the queries:\n"
"1. Count the total number of records for each column.\n"
"2. Summarize categorical columns (e.g., counts by category, distinct values, unique values by % , min length, max length, mean length) .\n"
"3. Calculate basic statistics (e.g., mean, median, min, max) for numerical columns.\n"
"Analyze the provided database to identify potential data quality issues. "
"Focus on finding:\n"
"1. Missing values (NULLs) in each column.\n"
"2. Duplicate records.\n"
"3. Outliers in numerical columns (e.g., values beyond 1.5 times the IQR).\n"
"4. Inconsistent or invalid values (e.g., mismatched data types, incorrect categories).\n"
"5. Violations of referential integrity if foreign key relationships exist.\n\n"
"Use the `SQLiteSchemaTool` to fetch the schema and ensure all queries are based on actual table and column names. "
"Provide SQL queries that specifically address these data quality checks."

),
expected_output="A list of SQL queries for profiling and validation",
agent=profiling_agent,
output_file="profile.yaml",
allow_failures=True,
)

2- Execution task

Execution agent will execute all the data profiling and data quality queries

execution_task = Task(
description=(
"Use the `SQLiteQueryTool` to execute SQL queries on the database from profiling_task and data_quality_task "
"Provide the `db_path` and `query` arguments. For example, use the query: "
"'SELECT COUNT(*) AS total_records FROM hr_data;'."
),
expected_output="Query results in a YAML file.",
agent=execution_agent,
output_file="results.yaml",
allow_failures=True,
)

3- Presentation task

It formats the results and show it to user in markdown

presentation_task = Task(
description="Format SQL query results into a Markdown report.",
expected_output=(
"A Markdown file containing:\n"
"1. Query Results as tables.\n"
"2. A summary of findings from both the profiling tasks and data quality tasks"
),
agent=presentation_agent,
output_file="report.md",
allow_failures=True,
)

Results

I am going to present the results as a comparison with IBM’s IKC data profiling feature and see how they differ.

IKC

With IKC all I had to do was to upload my csv file into my working space and click on a profile button and the rest was taken care of; after a few minutes I got the profiling results in a user interface as shown below:

As you can see in above image each column is separately analyzed, and profiling information are generated. Data type, size, statistical information as well as classifications related to governance is also generated. e.g. we can see a Salary is identified as a quantity. This information can later on be enriched with client’s own governance artifacts.

Within IKC one can easily execute meta data enrichment functionality to run the predefined data quality checks which will create 9 dimensions with respect to data quality without writing a single line of code

Agents

Let’s look at the results produced by each AI agents from the output log :

1- Results of data profiling task performed by profiling agent :

# Agent: Data Profiling Agent
## Final Answer:
1. Counting total number of records for each column:
```sql
SELECT
COUNT(ID) AS total_ID,
COUNT(Name) AS total_Name,
COUNT(Age) AS total_Age,
COUNT(Salary) AS total_Salary,
COUNT(JoinDate) AS total_JoinDate,
COUNT(Department) AS total_Department
FROM data_table;
```

2. Summarizing categorical columns (Department):
```sql
SELECT
Department,
COUNT(*) AS count,
COUNT(DISTINCT Department) AS distinct_count,
ROUND(COUNT(DISTINCT Department) * 100.0 / COUNT(*), 2) AS unique_percentage,
MIN(LENGTH(Department)) AS min_length,
MAX(LENGTH(Department)) AS max_length,
AVG(LENGTH(Department)) AS mean_length
FROM data_table
GROUP BY Department;
```

3. Calculating basic statistics for numerical columns (Age, Salary):
```sql
SELECT
AVG(Age) AS mean_age,
MIN(Age) AS min_age,
MAX(Age) AS max_age,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Age) AS median_age,
AVG(Salary) AS mean_salary,
MIN(Salary) AS min_salary,
MAX(Salary) AS max_salary,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Salary) AS median_salary
FROM data_table;
```

4. Identify missing values (NULLs) in each column:
```sql
SELECT
COUNT(*) AS total_rows,
SUM(CASE WHEN ID IS NULL THEN 1 ELSE 0 END) AS null_ID,
SUM(CASE WHEN Name IS NULL THEN 1 ELSE 0 END) AS null_Name,
SUM(CASE WHEN Age IS NULL THEN 1 ELSE 0 END) AS null_Age,
SUM(CASE WHEN Salary IS NULL THEN 1 ELSE 0 END) AS null_Salary,
SUM(CASE WHEN JoinDate IS NULL THEN 1 ELSE 0 END) AS null_JoinDate,
SUM(CASE WHEN Department IS NULL THEN 1 ELSE 0 END) AS null_Department
FROM data_table;
```

5. Detecting duplicate records:
```sql
SELECT
ID,
COUNT(*) AS count
FROM data_table
GROUP BY ID
HAVING COUNT(*) > 1;
```

6. Identifying outliers in numerical columns (Age, Salary):
```sql
WITH stats AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY Age) AS Q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY Age) AS Q3
FROM data_table
),
iqr AS (
SELECT
Q1,
Q3,
(Q3 - Q1) AS IQR
FROM stats
)
SELECT *
FROM data_table, iqr
WHERE
Age < (Q1 - 1.5 * IQR) OR Age > (Q3 + 1.5 * IQR);
```

7. Check for inconsistent or invalid values (simple example for Department):
```sql
SELECT
DISTINCT Department
FROM data_table
WHERE
Department NOT IN ('HR', 'IT', 'Sales', 'Marketing');
```

8. Violations of referential integrity (if foreign key relationships exist):
-- Assuming there is a foreign key relationship with another table named 'departments':
```sql
SELECT
dt.Department
FROM data_table dt
LEFT JOIN departments d ON dt.Department = d.Name
WHERE d.Name IS NULL;
```
```

2-Results of execution task performed by execution agent

total_records:
total_ID: 1005
total_Name: 813
total_Age: 352
total_Salary: 1003
total_JoinDate: 484
total_Department: 848

department_summary:
- Department: null
count: 157
distinct_count: 0
unique_percentage: 0.0
min_length: null
max_length: null
mean_length: null
- Department: Engineering
count: 181
distinct_count: 1
unique_percentage: 0.55
min_length: 11
max_length: 11
mean_length: 11.0
- Department: HR
count: 186
distinct_count: 1
unique_percentage: 0.54
min_length: 2
max_length: 2
mean_length: 2.0
- Department: Marketing
count: 149
distinct_count: 1
unique_percentage: 0.67
min_length: 9
max_length: 9
mean_length: 9.0
- Department: Sales
count: 157
distinct_count: 1
unique_percentage: 0.64
min_length: 5
max_length: 5
mean_length: 5.0
- Department: Unknown
count: 175
distinct_count: 1
unique_percentage: 0.57
min_length: 7
max_length: 7
mean_length: 7.0

basic_statistics:
mean_age: 41.23
min_age: 18
max_age: 65
mean_salary: 77558.98
min_salary: -10000
max_salary: 999999

missing_values:
total_rows: 1005
null_ID: 0
null_Name: 192
null_Age: 653
null_Salary: 2
null_JoinDate: 521
null_Department: 157

duplicate_records:
- ID: 1234
count: 2
- ID: 1339
count: 2
- ID: 1408
count: 2
...
# truncated for brevity, but would include all duplicate IDs

outliers:
# No identified outliers based on the criteria checked

invalid_departments:
- Engineering
- Unknown

referential_integrity_violations:
# No check could be performed as the departments table does not exist.
```

3- Results of presentation task perform by presentation agent

# SQL Query Results Report

## 1. Query Results

### A. Total Records Summary

| Field | Total Count |
|----------------|-------------|
| Total ID | 1005 |
| Total Name | 813 |
| Total Age | 352 |
| Total Salary | 1003 |
| Total Join Date| 484 |
| Total Department| 848 |

### B. Department Summary

| Department | Count | Distinct Count | Unique Percentage | Min Length | Max Length | Mean Length |
|-----------------|-------|----------------|-------------------|------------|------------|-------------|
| NULL | 157 | 0 | 0.00 | - | - | - |
| Engineering | 181 | 1 | 0.55 | 11 | 11 | 11.0 |
| HR | 186 | 1 | 0.54 | 2 | 2 | 2.0 |
| Marketing | 149 | 1 | 0.67 | 9 | 9 | 9.0 |
| Sales | 157 | 1 | 0.64 | 5 | 5 | 5.0 |
| Unknown | 175 | 1 | 0.57 | 7 | 7 | 7.0 |

### C. Basic Statistics

| Statistic | Value |
|-----------------|-------------|
| Mean Age | 41.23 |
| Min Age | 18 |
| Max Age | 65 |
| Mean Salary | 77558.98 |
| Min Salary | -10000 |
| Max Salary | 999999 |

### D. Missing Values

| Field | Count |
|----------------|-------|
| Null ID | 0 |
| Null Name | 192 |
| Null Age | 653 |
| Null Salary | 2 |
| Null Join Date | 521 |
| Null Department | 157 |

### E. Duplicate Records

| ID | Count |
|------|-------|
| 1234 | 2 |
| 1339 | 2 |
| 1408 | 2 |
| ... | ... |
| (truncated for brevity) | |

### F. Outliers

- No identified outliers based on the criteria checked.

### G. Invalid Departments

| Invalid Departments |
|---------------------|
| Engineering |
| Unknown |

### H. Referential Integrity Violations

- No check could be performed as the departments table does not exist.

## 2. Summary of Findings

### Profiling Tasks Summary
- The dataset contains **1005** total records, with significant proportions of missing values especially in the fields related to age and join date. Notably, **653** records are missing age data, representing approximately **65%** of the total records.

### Data Quality Tasks Summary
- Several departments have been identified as invalid, specifically **Engineering** and **Unknown**. This hints at potential issues in data entry or categorization.
- Duplicate records exist, particularly for IDs like **1234**, **1339**, and **1408**, suggesting a need for deduplication efforts.
- The analysis highlights a distinct lack of departments, with **157** records having no department assigned, affecting overall data integrity.

Overall, while there are some structural issues that require addressing, the basic statistics indicate a healthy mean salary and age within the parameters checked, suggesting a predominantly stable dataset in terms of employee demographics.

We can see that the small data pipeline we have created with the AI agents provides decent results as long as enough description is provided to each task to perform. The tools have to be abstract enough to perform all above computations. One big difference is that with IKC provides a better User Interface which allows the user to drill down and get more detailed results whereas with AI agents we have descriptive information. For Agents to have a better User Experience definitely there is a need to create an intuitive user interface.

IKC also has various capabilities that makes it easier to get a good understanding of the data without writing a single line of code as well as providing a great user experience to utilize and share the results.

Discussions

Cost

I first ran these agents with open ai platform using a gpt4; one run will cost depending on the model and size from a few cents to dollars . Imagine if you want to configure this for hundreds or thausands of data quality rules and larges sources of data then the cost will accumulate exponentially; An alternative is to invest in an on-prem AI platform such as watsonx.ai in which you can host the model in house and then you won’t need to worry about paying per call since you are hosting the model; In addition to having access to IBM models you’d be able to bring your own model into the platform and run the pipeline using AI Agents.

Integration

If you already have a data governance in place such as IKC you can divide this into profiling which is already taken care of by the platform. In IKC there a built-in data profiling component built within the platform and it auto generates 9 different out of the box dimensions with respect to data such as

However for the Data quality we can have an integration between the agents and the platform; we can get AI Agents to generate the data quality rules and we can use IKC API to create those data quality rules within the platform and not worry about executing them through agents and avoid additional costs . The platform has the capability to allow for scheduling and executing those rules as well as keeping track of those executions and storing the results within the platform or in your choice of 3rd party data source.

Data quality issue resolution

One opportunity for agents is once the data quality issues are identified, these issues then would be passed to the AI Agent and the agent would resolve the data issues. For example if the age has to be between 1 and 65 and if for example the age is a negative number then we’d replace that with an empty value so that it does not impact the average age statistics.

Resources

Github : https://github.com/ijgitsh/DataProfiling_DataQuality_Agents

CrewAI : https://docs.crewai.com/introduction

IKC Profiling : https://dataplatform.cloud.ibm.com/docs/content/wsj/curation/profile-details.html?context=cpdaas

IKC predefined data quality checks(out of the box functionality) : https://dataplatform.cloud.ibm.com/docs/content/wsj/quality/predefined-dq-checks.html?context=cpdaas

IKC Data quality rule creation : https://dataplatform.cloud.ibm.com/docs/content/wsj/quality/manage-dq-rules.html?context=cpdaas#create

IKC Data quality api: https://cloud.ibm.com/apidocs/knowledge-catalog#create-rule-b8a502

--

--

No responses yet