On this post:

  • Introduction
  • HR Database
  • DVD Rental Retail Business Database

Introduction

Large Language Models (LLMs) are revolutionizing the way we interact with information. Their ability to understand and respond to natural language makes them ideal for bridging the gap between humans and complex data sources like databases. In this series, we’ve explored various applications of LLMs.

This article, part 5 of the LLM Series, dives into the exciting concept of “chatifying” your database.

We’ll showcase how to leverage LLMs to create a user-friendly chat conversation that allows you to interact with your MySQL database using natural language queries. By combining the power of LLMs with open-source tools like LangChain, sqlalchemy, ollama, and openchat, it will empower you to unlock the conversational potential of your data.

Important Note: It’s important to remember that this is a conceptual exploration and would require further development for production-grade deployment. We’ll be focusing on the core functionalities to demonstrate the potential of LLMs for database interaction.

HR Database

This database, created by Patrick Crews and Giuseppe Maxia, offers a rich dataset for testing and exploring database functionalities. It includes:

  • Large data volume: Approximately 160MB spread across six tables, totaling 4 million records.
  • Flexibility: Compatible with various storage engine types.
  • Partitioning support: Enables efficient storage and querying through a data file.
  • Data integrity checks: A built-in suite of tests verifies the data’s accuracy during initial loading and subsequent use. This helps ensure the database remains consistent throughout your exploration.

LangChain Code

This section details the code and prompts necessary for OpenChat to comprehend the MySQL database schema:

from langchain.agents import AgentType
from langchain_community.utilities import SQLDatabase
from langchain_community.agent_toolkits import create_sql_agent
from sqlalchemy import create_engine
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.llms import Ollama
import contextlib

template = """
Role: AI Assistant Expert
Department: HR, CEO Office

Primary Responsibility: You are a helpful AI assistant expert in identifying the relevant topic from user's question
about Departments, Employees, Employee's salary, department's manager, job titles and employee's titles and
then querying SQL Database to find answer.

Use following context to create the SQL query.
Context:
departments table includes information about company's departments such as a unique identifier
as dept_no and department name dept_name. If the input is about number of departments then count at departments table
employees table includes information about company's employees such as emp_no as a unique identifier,
birth_date birthday of employee, first_name & last_name employee's name, gender and hire_date employee's
recruitment date.

dept_emp table includes information on which employee (emp_no) works for which department (dept_no)
on specific time period (from_date to to_date).

salaries table Tracks employee salaries (emp_no, from_date, salary).
For example it says which employee emp_no from when from_date, how much salary he is earning based on salary field.
If to_date has value it could mean user either has another record to show a new salary which can be meaning
increase or decrease in salary. You can assume if there is any record for an employee, that employee
has got paid if date is valid.

You have access to all MySQL commands such as LAG, DESC.

If the input is about number of employees of each department, then join three tables of departments, employees
and dept_emp on key fields of dept_no and emp_no then count it.

If the input is about salary of employee, then you can look at salaries table on specific date for from_date but if
no date was mentioned then you can look at latest year using order by desc. The to_date can be only used for cases like
identifying when one employee has changes on his salary, for example if input is about: how many times one employee salary
changes, then you need to query the salaries table using emp_no and check how many rows exist that shows number of salary changes.

If input was about department and salary of department, you need to join three tables of salaries to employees on
emp_no field then also join with dept_emp on emp_no to find out which employees work on which department, then you can
calculate sum the total salary of each department for all its employees aligning the from_date. To find the department name,
you can join back with departments on dept_no field.

Question: {input}
"""
prompt = ChatPromptTemplate.from_template(template)
mysql_uri = 'mysql+mysqlconnector://root:local@localhost:3306/employees'
engine = create_engine(mysql_uri)
db = SQLDatabase(engine=engine)
print(db.get_usable_table_names())
print("Using local LLM, make sure you have installed Ollama (https://ollama.com/download) and have it running")
llm = Ollama(model="openchat", temperature=0)
agent_executor = create_sql_agent(llm, db=db, verbose=True, max_execution_time=80*60, agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,)

This code block sets up an AI agent that can answer questions about a MySQL database containing employee information. Here’s a breakdown of what each part does:

Importing Libraries:

  • langchain libraries are used to create and manage the AI agent.
  • SQLAlchemy helps connect to the MySQL database.
  • Ollama is the large language model (LLM) used to understand and respond to user queries.

Defining the Agent’s Role and Capabilities:

  • A multi-line string (template) describes the agent’s role as an expert assistant who can answer questions about departments, employees, salaries, and job titles.
  • It specifies that the agent will access a MySQL database named “employees” to retrieve answers.

Creating the Chat Prompt Template:

  • The ChatPromptTemplate class converts the multi-line string (template) into a structured format that the LLM can understand.

Setting Up the Large Language Model (LLM):

  • Instructions to install and run Ollama locally are displayed.
  • Ollama instance (llm) is created, specifying the “openchat” model and a temperature parameter (controls response randomness).

Creating the AI Agent:

  • create_sql_agent function builds the AI agent using the LLM (llm), database connection (db), and other settings.
  • verbose=True enables detailed logging during agent execution.
  • max_execution_time sets a time limit (80 minutes) for the agent to complete a query.
  • AgentType.ZERO_SHOT_REACT_DESCRIPTION specifies that the agent will react and answer questions based on the provided description (without additional training data).

LLM Responses

We’ve included some sample queries submitted to the LLM, along with its corresponding responses:

how many departments does the company have?
> Finished chain.
The company has 9 departments.

how many employees are there per each department in 1986?
Action: sql_db_query
Action Input: SELECT d.dept_name, COUNT(e.emp_no) AS num_employees FROM departments d JOIN dept_emp de ON d.dept_no = de.dept_no JOIN employees e ON de.emp_no = e.emp_no WHERE YEAR(de.to_date) = 1986 GROUP BY d.dept_name;

[('Customer Service', 45), ('Development', 176), ('Finance', 33), ('Human Resources', 40), ('Marketing', 36), ('Production', 139), ('Quality Management', 36), ('Research', 45), ('Sales', 100)]

What is the average salary for each department in 1998? Please show amount and department name.
Final Answer: [('Development', Decimal('59928.0651')), ('Sales', Decimal('81203.9922')), ('Production', Decimal('60078.0741')), ('Human Resources', Decimal('55827.7127')), ('Research', Decimal('60205.5592')), ('Quality Management', Decimal('57631.8585')), ('Marketing', Decimal('72255.3650')), ('Customer Service', Decimal('58401.9381')), ('Finance', Decimal('71193.3620'))]

What is my employee turnover rate in 1996?
To answer this question, I need to calculate the number of employees who left the company and the total number of employees at the beginning of 1996. Then, I will divide the number of employees who left by the total number of employees to find the turnover rate.
Final Answer: (199653 / 278133) * 100

What is average salary of employees who are older than 25 years?
[(Decimal('63810.7448'),)]

What is the average salary of employees for Finance department?
Final Answer: Approximately 70,489.36

DVD Rental Retail Business Database

Developed by by Mike Hillyer, this database is designed by to mimic a real-world scenario, the Sakila database provides a sample structure for a DVD rental store.

LangChain Code

This section details the code and prompts necessary for OpenChat to comprehend the MySQL database schema:

from langchain.agents import AgentType
from langchain_community.utilities import SQLDatabase
from langchain_community.agent_toolkits import create_sql_agent
from sqlalchemy import create_engine
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.llms import Ollama
import contextlib


template = """
Role: Your Friendly Movie Guide! & DVD Shop Renting Assistant **
Department: Customer Experience

Primary Responsibility: As your knowledgeable AI assistant, I excel in swiftly discerning the precise query
regarding movies. Whether it's identifying actors in specific films, checking DVD availability in our stock,
facilitating rentals, managing returns, or assisting in locating overdue rentals, count on me to deliver
seamless assistance with a touch of cinematic expertise."

You have access to all MySQL commands or python.

**What can I do for you today?**

I'm your one-stop shop for everything movie-related, with a dash of DVD rental expertise! Whether you're a
movie buff or a casual viewer, I can help you find what you're looking for:

Find Actors: Ask me who starred in a specific movie or explore an actor's filmography.
Browse Movies by Category: Discover new films in your favorite genres. (Let me know if this feature is available)
Check DVD Availability: See if the DVD you want is in stock at our store.
Rent a DVD: Rent a movie for a cozy night in.
Return a DVD: Bring back your rental and free up space for your next movie marathon.
Find Overdue Rentals: Don't worry, it happens! Let me help you locate any overdue rentals.


**Behind the Scenes:**

To provide all these services, I access a database with information about movies, actors, and our DVD inventory.
Here's a simplified breakdown of the tables I use:
Movies: Contains details like titles, descriptions, and release years.
Actors: Stores information about the actors and their roles in movies.
DVD Inventory: Tracks the DVDs available at our store and their rental status.
Customers: Manages customer information for rentals and returns.
Let's explore the world of movies together!

Use following context to create the SQL query.
**DVD Shop Rental Database Context:**

* The actor table lists information for all actors. The actor table is joined to the film table by means of the
film_actor table. To know a actor name or full name we need to use both first_name and last_name to construct
the full name.
Columns are: actor_id: A surrogate primary key used to uniquely identify each actor in the table. first_name:
The actor first name. last_name: The actor last name. last_update: When the row was created or most recently
updated.

* The address table contains address information for customers, staff, and stores. The address table primary key
appears as a foreign key in the customer, staff, and store tables.
Columns are:
address_id: A surrogate primary key used to uniquely identify each address in the table.
address: The first line of an address.
address2: An optional second line of an address.
district: The region of an address, this may be a state, province, prefecture, etc.
city_id: A foreign key pointing to the city table.
postal_code: The postal code or ZIP code of the address (where applicable).
phone: The telephone number for the address.
last_update: When the row was created or most recently updated.
location: A Geometry column with a spatial index on it.

* The category table lists the categories that can be assigned to a film. The category table is joined to the
film table by means of the film_category table.
Columns are:
category_id: A surrogate primary key used to uniquely identify each category in the table.
name: The name of the category.
last_update: When the row was created or most recently updated.

* The city table contains a list of cities. The city table is referred to by a foreign key in the address table
and refers to the country table using a foreign key.
Columns are:
city_id: A surrogate primary key used to uniquely identify each city in the table.
city: The name of the city.
country_id: A foreign key identifying the country that the city belongs to.
last_update: When the row was created or most recently updated.

* The country table contains a list of countries. The country table is referred to by a foreign key in the city table.
Columns:
country_id: A surrogate primary key used to uniquely identify each country in the table.
country: The name of the country.
last_update: When the row was created or most recently updated.

* The customer table contains a list of all customers. The customer table is referred to in the payment and rental
tables and refers to the address and store tables using foreign keys.
Columns:
customer_id: A surrogate primary key used to uniquely identify each customer in the table.
store_id: A foreign key identifying the customer “home store.” Customers are not limited to renting only from this store, but this is the store at which they generally shop.
first_name: The customer first name.
last_name: The customer last name.
email: The customer email address.
address_id: A foreign key identifying the customer address in the address table.
active: Indicates whether the customer is an active customer. Setting this to FALSE serves as an alternative to deleting a customer outright. Most queries should have a WHERE active = TRUE clause.
create_date: The date the customer was added to the system. This date is automatically set using a trigger during an INSERT.
last_update: When the row was created or most recently updated.

* The film table is a list of all films potentially in stock in the stores. The actual in-stock copies of each
film are represented in the inventory table. The film table refers to the language table and is referred to by
the film_category, film_actor, and inventory tables. Movie and Film have same meaning.
Columns:
film_id: A surrogate primary key used to uniquely identify each film in the table.
title: The title of the film.
description: A short description or plot summary of the film.
release_year: The year in which the movie was released.
language_id: A foreign key pointing at the language table; identifies the language of the film.
original_language_id: A foreign key pointing at the language table; identifies the original language of the film. Used when a film has been dubbed into a new language.
rental_duration: The length of the rental period, in days.
rental_rate: The cost to rent the film for the period specified in the rental_duration column.
length: The duration of the film, in minutes.
replacement_cost: The amount charged to the customer if the film is not returned or is returned in a damaged state.
rating: The rating assigned to the film. Can be one of: G, PG, PG-13, R, or NC-17.
special_features: Lists which common special features are included on the DVD. Can be zero or more of: Trailers, Commentaries, Deleted Scenes, Behind the Scenes.
last_update: When the row was created or most recently updated.

* The film_actor table is used to support a many-to-many relationship between films and actors. For each actor
in a given film, there will be one row in the film_actor table listing the actor and film.
The film_actor table refers to the film and actor tables using foreign keys.
Columns:
actor_id: A foreign key identifying the actor.
film_id: A foreign key identifying the film.
last_update: When the row was created or most recently updated.

* The film_category table is used to support a many-to-many relationship between films and categories.
For each category applied to a film, there will be one row in the film_category table listing the category and film.
The film_category table refers to the film and category tables using foreign keys.
Columns:
film_id: A foreign key identifying the film.
category_id: A foreign key identifying the category.
last_update: When the row was created or most recently updated.

* The film_text table contains the film_id, title and description columns of the film table, with the contents
of the table kept in synchrony with the film table by means of triggers on
film table INSERT, UPDATE and DELETE operations.

Columns:
film_id: A surrogate primary key used to uniquely identify each film in the table.
title: The title of the film.
description: A short description or plot summary of the film.

* The inventory table contains one row for each copy of a given film in a given store.
The inventory table refers to the film and store tables using foreign keys and is referred to by the rental table.

Columns:
inventory_id: A surrogate primary key used to uniquely identify each item in inventory.
film_id: A foreign key pointing to the film this item represents.
store_id: A foreign key pointing to the store stocking this item.
last_update: When the row was created or most recently updated.

* The language table is a lookup table listing the possible languages that films can have for their language
and original language values. The language table is referred to by the film table.

Columns:
language_id: A surrogate primary key used to uniquely identify each language.
name: The English name of the language.
last_update: When the row was created or most recently updated.

* The payment table records each payment made by a customer, with information such as the amount and the
rental being paid for (when applicable). The payment table refers to the customer, rental, and staff tables.
The payment amount is actually the store revenue. So if you were asked about revenue, you can assume payment amount
is the revenue.

Columns:
payment_id: A surrogate primary key used to uniquely identify each payment.
customer_id: The customer whose balance the payment is being applied to. This is a foreign key reference to the customer table.
staff_id: The staff member who processed the payment. This is a foreign key reference to the staff table.
rental_id: The rental that the payment is being applied to. This is optional because some payments are for outstanding fees and may not be directly related to a rental.
amount: The amount of the payment.
payment_date: The date the payment was processed.
last_update: When the row was created or most recently updated.

* The rental table contains one row for each rental of each inventory item with information about who
rented what item, when it was rented, and when it was returned.
The rental table refers to the inventory, customer, and staff tables and is referred to by the payment table.

Columns:
rental_id: A surrogate primary key that uniquely identifies the rental.
rental_date: The date and time that the item was rented.
inventory_id: The item being rented.
customer_id: The customer renting the item.
return_date: The date and time the item was returned.
staff_id: The staff member who processed the rental.
last_update: When the row was created or most recently updated.

* The staff table lists all staff members, including information for email address,
login information, and picture.
The staff table refers to the store and address tables using foreign keys, and is referred to by the rental,
payment, and store tables.

Columns:
staff_id: A surrogate primary key that uniquely identifies the staff member.
first_name: The first name of the staff member.
last_name: The last name of the staff member.
address_id: A foreign key to the staff member address in the address table.
picture: A BLOB containing a photograph of the employee.
email: The staff member email address.
store_id: The staff member “home store.” The employee can work at other stores but is generally assigned to the store listed.
active: Whether this is an active employee. If employees leave, their rows are not deleted from this table; instead, this column is set to FALSE.
username: The user name used by the staff member to access the rental system.
password: The password used by the staff member to access the rental system. The password should be stored as a hash using the SHA2() function.
last_update: When the row was created or most recently updated.

* The store table lists all stores in the system. All inventory is assigned to specific stores, and staff
and customers are assigned a “home store”. The store table refers to the staff and address tables using foreign
keys and is referred to by the staff, customer, and inventory tables.

Columns:
store_id: A surrogate primary key that uniquely identifies the store.
manager_staff_id: A foreign key identifying the manager of this store.
address_id: A foreign key identifying the address of this store.
last_update: When the row was created or most recently updated.


Question: {input}
"""

prompt = ChatPromptTemplate.from_template(template)
mysql_uri = 'mysql+mysqlconnector://root:local@localhost:3306/sakila'

engine = create_engine(mysql_uri)
db = SQLDatabase(engine=engine)
print(db.get_usable_table_names())

print("Using local LLM, make sure you have installed Ollama (https://ollama.com/download) and have it running")
llm = Ollama(model="openchat", temperature=0)

agent_executor = create_sql_agent(llm, db=db, verbose=True, max_execution_time=80*60, agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,)

LLM Responses

We’ve included some sample queries submitted to the LLM, along with its corresponding responses:

which film has highest number of rents?

SELECT f.film_id, f.title, COUNT(r.rental_id) AS num_rents

FROM film f

JOIN inventory i ON f.film_id = i.film_id

JOIN rental r ON i.inventory_id = r.inventory_id

GROUP BY f.film_id, f.title

ORDER BY num_rents DESC

LIMIT 1

which film has highest revenue?

SELECT 
f.film_id,
f.title,
SUM(p.amount) AS total_revenue
FROM
rental r
JOIN
payment p ON r.rental_id = p.rental_id
JOIN
film f ON r.inventory_id = f.film_id
GROUP BY
f.film_id,
f.title
ORDER BY
total_revenue DESC
LIMIT 1;

which store has highest revenue?

SELECT s.store_id, SUM(p.amount) AS total_revenue
FROM payment p
JOIN rental r ON p.rental_id = r.rental_id
JOIN inventory i ON r.inventory_id = i.inventory_id
JOIN store s ON i.store_id = s.store_id
GROUP BY s.store_id
ORDER BY total_revenue DESC
LIMIT 1;

Who are actors of ACADEMY DINOSAUR film?

SELECT a.actor_id, a.first_name, a.last_name
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
WHERE f.title = 'ACADEMY DINOSAUR';

who are our top customers?

SELECT c.customer_id, c.first_name, c.last_name, COUNT(r.rental_id) AS total_rentals
FROM customer c
JOIN rental r ON c.customer_id = r.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
ORDER BY total_rentals DESC;

Which customers have payment overdue?

SELECT DISTINCT c.customer_id, c.first_name, c.last_name, SUM(p.amount) AS total_overdue
FROM customer c
JOIN rental r ON c.customer_id = r.customer_id
LEFT JOIN payment p ON r.rental_id = p.rental_id AND p.payment_date < CURDATE()
WHERE r.return_date < CURDATE() OR p.payment_date IS NULL
GROUP BY c.customer_id, c.first_name, c.last_name;

Conclusion

In this article, we’ve explored the exciting potential of “chatifying” your MySQL database with Large Language Models (LLMs). We showcased how open-source tools like LangChain, sqlalchemy, ollama, and openchat can be combined to create a user-friendly chat for querying your data. By leveraging the power of LLMs, you can unlock a more intuitive way to interact with your databases, making it easier for everyone to access and analyze information.

Important Note: Remember that this is a foundational exploration, and further development would be required for production-grade deployment. However, it demonstrates the promising future of LLMs in bridging the gap between humans and complex data. As LLM technology continues to evolve, we can expect even more innovative ways to interact with and unlock the power of our data.

--

--