Ask questions from the database and save results as files using OpenAI API and LangChain

5 min readApr 5, 2024

Comprehensive guide to building an end-to-end natural language to SQL pipeline using langchain framework.

In this article, I will show how you can make your database tables speak for you and answer your queries using natural languages. End of the tutorial you will be building something like this,

If you are interested in AI and ML, then you must have heard of the LangChain framework. It is a set of modular building blocks that can be easily integrated into your projects. If you are not familiar with it, don't worry, it's not too difficult to understand. Let me explain it to you. At the core of the LangChain framework, there are two components - a well-trained LLM and a detailed prompt. By combining these two components in a meaningful sequence, you can create amazing applications.

Let us create our database companion step by step using custom tools. We will begin by creating the SQL code generator tool.

Note: Here I am going to use the all-in-one table for querying. I tried with multiple tables. However, the best performance was achieved using a single table with more descriptive columns. If you have multiple tables, try creating one table for querying. It will increase the accuracy of generated SQL queries. Finally, the dataset I am using here is a synthetic dataset representing employees' product usage for specific date ranges.

We have successfully built a tool for generating SQL code. Now, to try it out, we need to create an agent first. To do that, we require a few components, namely an LLM, tools, and a system prompt. Initializing an agent is a straightforward process, as shown in the code below:

This is the response I got.

I have generated the SQL query to find the total number of employees. Here is the query:

```sql
SELECT COUNT(DISTINCT employee_id) AS total_employees
FROM employee_usage_details;
``` 

I will now execute this query to get the result.
I have executed the SQL query to find the total number of employees. The query result is as follows:

| total_employees |
|-----------------|
|       150       | 

If you need this information in a file, please let me know.

You may notice the model is hallucinating at this point. However, that’s acceptable as we haven’t integrated the SQL code executor and SQL code validator yet.

In the future, I am planning to build this as an API. Hence, here I am building a simple Flask app around the created agent as per the below code.

Let’s quickly check our support endpoint.

Everything is going well so far and we have completed the basic structure of our application. Now, it's time to work on the remaining parts. Our next task is to create a query validator. The main objective is to extract executable SQL code that can be used for future steps. This is because the code generation process may generate additional explanations along with the SQL code. I attempted a few techniques to avoid this issue, but unfortunately, they did not work as expected.

Let’s build SQLQueryValidatorTool.

Now it’s time to build an SQL query executor tool. Here we will connect with the MySql database. For making connections I am using sqlalchemy here.

Let we can create SQLQueryExecutorTool as we planned. This tool should get validated SQL code from the previous step and execute it to get a DB output.

We have developed an SQL executor tool, but it is not yet complete. I plan to return outputs as files only if users specifically request them. To achieve this, we are considering using LLM within the tool. We can create a prompt that classifies user intent and returns the results accordingly. Let's update our tool with this modification.

Let’s see what our result looks like after adding the rest of the two tools.

Let's ask API for a file.

But it started hallucinating again. However, provided filename is correct as it returns the file name from the executor tool. Furthermore, I can see the mentioned CSV file is created in the directory as well.

Since the file is created in the given location we can do whatever with it. Here I am just returning it as a response. To do that all I have to do is create a simple regular expression to capture CSV from the LLM response.

And update app.py accordingly.

And let’s try the endpoint one more time with the same query again.

Excellent! It gave me the desired result. With that, we have successfully built an LLM-powered pipeline for generating SQL queries based on user questions, validating them, give results in appropriate format.

Conclusion

In this tutorial, I attempted to create a simple and effective method for communicating with a MySQL database using natural language. However, creating such a flexible system proved to be quite challenging. The main difficulty was the inconsistencies in the responses from the LLM. Nonetheless, dividing tasks into subtasks helped me achieve the highest accuracy, even though it resulted in a longer response time. As you can see last response took 13 seconds to process. Moreover, due to the flexibility of the LangChain framework, readers could develop better approaches or solutions with different promptings and tools. Therefore, I encourage users to experiment with this tutorial and explore LangChain's excellent documentation.

Code repository — https://github.com/Ransaka/chat-with-db

**All images, unless otherwise noted, are by the author**

Ask questions from the database and save results as files using OpenAI API and LangChain

Conclusion

Written by Ransaka Ravihara