Exploring AI Agent data access and various implementation approaches
Introduction
With the rise of AI agents, applications are increasingly being built with AI-driven capabilities. These applications can serve a variety of purposes, from small demos showcasing AI agent functionalities to enterprise-level integrations enhancing mature software products. For example, an AI agent can be implemented within a data governance platform for data profiling. Alternatively, entirely new AI-driven applications can be developed from scratch.
Regardless of the scope, these applications may serve a single user or multiple users across different teams within a large enterprise. Users will often have varying levels of privilege regarding data access and AI capabilities. This raises several critical questions:
- Does the data consumer have the right to use data generated by AI agents during data processing?
- Does the data consumer have the right to access the computations performed by an AI agent if they do not have direct access to the underlying data?
- How can we effectively manage access control for AI computations and data across users with different privilege levels?
This article explores these questions by examining various scenarios and implementation approaches for managing AI agent data access.
Overview
The ecosystem of an AI Agent application typically consists of the following key players:
- Human Users — Interacting with AI agents, systems, and tools.
- Systems — Software interacting with data sources, APIs, and other systems.
- AI Agents — Entities that process information and interact with humans, other AI agents, and foundational AI models (LLMs).
- Data Sources — Repositories of structured and unstructured data accessed by AI agents and systems.
- LLMs — The foundational models that power AI agents.
Human users access data and information through AI Agent, Systems and tools; AI Agents Access data through tools; tools access data through api calls or data base calls or access to structured or unstructured data on servers on cloud;In such an ecosystem, various access control measures must be enforced:
- Control access to the AI application (who can use the app).
- Control access to AI agents (which agents a user can interact with).
- Control access to data that agents retrieve for computations (prevent unauthorized data usage).
- Control access to computed outputs (ensure restricted users only see permitted information).
- Control access at the database level (table-level, row-level, or column-level restrictions).
A major challenge arises when data access in AI agent orchestration bypasses these restrictions. If an agent retrieves restricted data and presents it to a user without proper validation, it could lead to a data breach. Therefore, we need strict control mechanisms to ensure users only access the data they are authorized to see.
What check points we can put in place to ensure that the final user always sees only what they need to see?
The user may not know all the sources that the agent is using; however we want to make sure that everyone is only accessing privileged data and information.
Base Scenario
In this scenario there are two users: user A and User B; User A will always have access to all the resources ( tools and data sources), but User B will have Limited access depending on the resource type; Our application is an application with AI agents embedded. It has 2 agents;
- 1 agent is a data collector
- 1 agent is the data presenter
# Data Collector Agent
data_collector = Agent(
role="Data Collector",
goal="Gather financial data from multiple sources",
verbose=True,
model="gpt4",
memory=True,
backstory="An AI researcher focused on collecting financial insights from various sources.",
tools=[search_tool, query_finance_data, fetch_financial_data],
)
# Data Presenter Agent
data_presenter = Agent(
role="Data Presenter",
goal="Format and present financial data",
verbose=True,
model="gpt4",
memory=True,
backstory="An AI specializing in summarizing and formatting financial insights for reporting.",
)
The data collector agent uses 5 tools:
- Search internet (Serper)
- Search the database
1- Query full data base
2- Query limited columns of a table
3- Query limited rows
- Call an api (Alpha Vantage api)
# Initialize Internet Search Tool
search_tool = SerperDevTool()
# SQLite Tool to query financial data with row level security
@tool("query row level filtered finance data")
def query_row_level_finance_data():
"""Queries an SQLite finance database with limited access and returns financial data at row level."""
conn = sqlite3.connect("finance.db")
cursor = conn.cursor()
cursor.execute("SELECT * FROM finance WHERE user_role = 'restricted'")
data = cursor.fetchall()
conn.close()
return str(data)
# SQLite Tool to query financial data with limited data
@tool("query limited finance data")
def query_limited_finance_data():
"""Queries an SQLite finance database with limited access and returns financial data."""
conn = sqlite3.connect("finance.db")
cursor = conn.cursor()
cursor.execute("SELECT company, stock_price FROM finance")
data = cursor.fetchall()
conn.close()
return str(data)
# SQLite Tool to query full financial data
@tool("query finance data")
def query_finance_data():
"""Queries an SQLite finance database and returns financial data."""
conn = sqlite3.connect("finance.db")
cursor = conn.cursor()
cursor.execute("SELECT * FROM finance")
data = cursor.fetchall()
conn.close()
return str(data)
# API Tool to fetch financial data
@tool("fetch_financial_data")
def fetch_financial_data():
"""Fetches financial data from Alpha Vantage API."""
url = f"https://www.alphavantage.co/query?function=GLOBAL_QUOTE&symbol=IBM&apikey={ALPHA_VANTAGE_API_KEY}"
response = requests.get(url)
if response.status_code == 200:
return response.json()
return "Failed to fetch financial data"
In various scenarios below we are going to play with various data access limitation for User B to see how it would impact our implementation.
We are going to create various tasks that would be called for each scenario
# Tasks
internet_search_task = Task(
description="Fetch IBM's Q4 results from the provided link.",
expected_output="Summary of IBM's Q4 results.",
tools=[search_tool],
agent=data_collector,
)
database_query_task = Task(
description="Retrieve financial data from the SQLite database.",
expected_output="Extracted financial data from the database.",
tools=[query_finance_data],
agent=data_collector,
)
database_query_task_Limited = Task(
description="Retrieve limited financial data from the SQLite database.",
expected_output="Extracted financial data from the database.",
tools=[query_limited_finance_data],
agent=data_collector,
)
database_query_task_Limited_row = Task(
description="Retrieve row level financial data from the SQLite database.",
expected_output="Extracted financial data from the database.",
tools=[query_row_level_finance_data],
agent=data_collector,
)
api_fetch_task = Task(
description="Fetch IBM stock data from Alpha Vantage API.",
expected_output="Latest IBM stock data from API.",
tools=[fetch_financial_data],
agent=data_collector,
)
presentation_task = Task(
description="Format and present the collected financial data in a structured format.",
expected_output="A well-structured financial report with insights from all sources.",
agent=data_presenter,
)
masking_task = Task(
description="Format and present the collected financial data in a structured format.Mask revenue and net income information from the results.",
expected_output="A well-structured financial report with insights from all sources",
agent=data_presenter,
)
User Roles and Identity Management
Instead of defining roles directly in the AI agent system, we can leverage an identity management system (e.g., AWS IAM, Okta, Keycloak) to manage users and their permissions. The AI agent can retrieve user roles dynamically through an authentication system.
A sample user_roles.yaml file could look like this:
users:
- username: user_a
roles:
- admin
- full_data_access
- username: user_b
roles: #comment each row for each scenario in the article
- restricted
- limited_api_access
- restricted_db
- row_restricted
- mask_data
The AI system can load this configuration and enforce access restrictions accordingly. The next 5 scenarios are implemented here
Implementation and Scenarios
Scenario 1: Unrestricted Access
In this scenario, User A has full access to all resources (tools and data sources), while User B has limited access. This setup assumes no restrictions on data retrieval and computation.
import yaml
from crewai import Task, Agent
def get_user_roles(username):
with open('user_roles.yaml', 'r') as file:
users = yaml.safe_load(file)['users']
for user in users:
if user['username'] == username:
return user['roles']
return []
current_user = "user_a"
roles = get_user_roles(current_user)
data_collector = Agent(name='DataCollector', tasks=[
Task(description='Retrieve data from all sources', tools=['internet_search', 'database_query', 'api_call'])
])
data_presenter = Agent(name='DataPresenter', tasks=[
Task(description='Process and present collected data', tools=['data_processing'])
])
result = data_collector.execute()
data_presenter.execute(input_data=result)
Here is how results look like with this option
Scenario 2
User B must not have access to the database. This scenario ensures that users without the necessary privileges cannot retrieve data directly from the database, limiting them to alternative data sources.
In many enterprise applications, database access can be controlled in different ways:
- Service Account Model — The entire application uses a single service account that has access to all the data required by the application. This approach simplifies database connections but does not allow granular user-based access control, meaning all users inherit the same access level.
- User Privilege Validation — The system checks individual user access rights before allowing database queries. This can be implemented using:
- Role-based access control (RBAC) within the database, where users are assigned specific roles that dictate which tables, rows, or queries they can execute.
- A middleware layer that enforces access policies before database queries are executed, ensuring unauthorized users cannot retrieve restricted data.
3. Row-Level Security — The database enforces policies at the row level, so different users can see different subsets of data based on their access rights.
4. Proxy Database Access — Instead of direct access, users interact with a controlled API layer that enforces security policies dynamically based on user roles.
current_user = "user_b"
roles = get_user_roles(current_user)
crew = Crew(
agents=[data_collector, data_presenter],
tasks=[internet_search_task, database_query_task, api_fetch_task, presentation_task],
process=Process.sequential
)
if "restricted" in roles:
crew.tasks.remove(database_query_task)
# Run the Crew
result = crew.kickoff()
print(result)
Scenario 3
User B must not have access to specific external APIs. API access restrictions can be enforced in several ways:
- API Gateway Policies — API gateways can enforce role-based access control (RBAC), ensuring only privileged users can make API calls.
- Middleware Layer — The application can have a middleware that validates user roles before allowing API requests.
- Token-Based Access Control — API authentication tokens can include user permissions, restricting which endpoints a user can call.
current_user = "user_b"
roles = get_user_roles(current_user)
crew = Crew(
agents=[data_collector, data_presenter],
tasks=[internet_search_task, database_query_task, api_fetch_task, presentation_task],
process=Process.sequential
)
if "restricted" in roles:
crew.tasks.remove(database_query_task)
if "limited_api_access" in roles:
crew.tasks.remove(api_fetch_task)
# Run the Crew
result = crew.kickoff()
print(result)
Scenario 4
User B should not have access to specific tables in the database. Table-level access control can be implemented using:
- Database Permissions — The database itself enforces restrictions based on user roles.
- Query Proxy Layer — A middleware restricts access to unauthorized tables before executing queries.
current_user = "user_b"
roles = get_user_roles(current_user)
crew = Crew(
agents=[data_collector, data_presenter],
tasks=[internet_search_task, database_query_task, api_fetch_task, presentation_task],
process=Process.sequential
)
if "restricted" in roles:
crew.tasks.remove(database_query_task)
if "limited_api_access" in roles:
crew.tasks.remove(api_fetch_task)
if "restricted_db" in roles:
crew.tasks.append(database_query_task_Limited)
# Run the Crew
result = crew.kickoff()
print(result)
Here is how results look like with this option
Scenario 5
User B must not have access to certain rows in a database table. Row-level security can be implemented by:
- Database Row-Level Policies — The database enforces access policies per user.
- Application-Level Filtering — The application filters query results before returning data.
current_user = "user_b"
roles = get_user_roles(current_user)
crew = Crew(
agents=[data_collector, data_presenter],
tasks=[internet_search_task, database_query_task, api_fetch_task, presentation_task],
process=Process.sequential
)
if "restricted" in roles:
crew.tasks.remove(database_query_task)
if "limited_api_access" in roles:
crew.tasks.remove(api_fetch_task)
if "restricted_db" in roles:
crew.tasks.append(database_query_task_Limited)
if "row_restricted" in roles:
crew.tasks.remove(database_query_task_Limited)
crew.tasks.append(database_query_task_Limited_row)
# Run the Crew
result = crew.kickoff()
print(result)
Scenario 6
User B must not have access to certain information from the AI-generated results. This can be enforced by:
- Data Masking — Sensitive results are masked or redacted before being presented.
- Post-Processing Filters — The AI agent filters restricted outputs before display.
current_user = "user_b"
roles = get_user_roles(current_user)
crew = Crew(
agents=[data_collector, data_presenter],
tasks=[internet_search_task, database_query_task, api_fetch_task, presentation_task],
process=Process.sequential
)
if "restricted" in roles:
crew.tasks.remove(database_query_task)
if "limited_api_access" in roles:
crew.tasks.remove(api_fetch_task)
if "restricted_db" in roles:
crew.tasks.append(database_query_task_Limited)
if "row_restricted" in roles:
crew.tasks.remove(database_query_task_Limited)
crew.tasks.append(database_query_task_Limited_row)
if "mask_data" in roles:
crew.tasks.remove(presentation_task)
crew.tasks.append(masking_task)
#print(crew.tasks)
# Run the Crew
result = crew.kickoff()
print(result)
Here is how output would look like with masked data
Conclusion
AI agents introduce complex challenges in data access management, especially in multi-user environments with varying privileges. By implementing proper access control mechanisms — ranging from user authentication and API security to database filtering and AI response moderation — we can ensure that AI-driven applications remain secure and compliant.
Resources
Github : https://github.com/ijgitsh/data-access-ai-agents
Securing Generative AI Architecture: https://medium.com/@manavg/securing-generative-ai-architecture-74f48e74b3e3
CrewAI : https://www.crewai.com/
Top Ten Risk and mitigations for LLMs and Gen AI Apps https://genai.owasp.org/llm-top-10/#