QuickBooks App Update

QuickBooks App Update

Recently I updated the capability of my QuickBooks Data Fetcher application.

In a nutshell, my application fetches data from a QuickBooks API and uses an LLM to extract data from it given a natural language input query. The issue I ran into at first was that QuickBooks offers many different data structures you can query, called "entities". Entities range from Bills, Customers, Departments and many more. I was using a tool called PandasLLM that could query a pandas DataFrame with a natural language query, but that meant that I had to convert the json data that QuickBooks returned into DataFrame format before querying it. So, for every entity I wanted my app to be able to query from I would need to have a unique parser. For example, here is my code for parsing the jsons returned from the Bill and Purchase entities:

def parseJson(json, query_entity):
    data = json.get('QueryResponse').get(query_entity)
    if query_entity == "Bill":
        df = pd.DataFrame(columns=['EntityType','VendorName', 'Amount', 'DueDate', 'CustomerName', 'BillableStatus', 'TxnDate'])
    if query_entity == "Purchase":
        df = pd.DataFrame(columns=['EntityType', 'Name', 'Amount', 'CustomerName', 'BillableStatus', 'TxnDate'])
    for entity in data:
        if entity == None:
            return "Failed to retrieve entity"
        
        for line in entity.get('Line', []):
            if line == None:
                return "Failed to retrieve line"
                    
            details = line.get('AccountBasedExpenseLineDetail', {})
            if details == {}:
                details = line.get('ItemBasedExpenseLineDetail', {})
            if details == {}:
                return "Failed to retrieve details"
                    
            if query_entity == "Bill":
                new_row = pd.DataFrame({'EntityType': 'Bill',
                        'VendorName': [entity.get('VendorRef').get('name', 'Unknown')],
                        'Amount': [line.get('Amount')],
                        'DueDate': [entity.get('DueDate', 'Unknown')],
                        'CustomerName': [details.get('CustomerRef', {}).get('name', 'Unknown')],
                        'BillableStatus': [details.get('BillableStatus', 'Unknown')],
                        'TxnDate': [entity.get('TxnDate', 'Unknown')]})
            elif query_entity == "Purchase":
                new_row = pd.DataFrame({'EntityType': 'Purchase',
                        'Name': [entity.get('EntityRef', {}).get('name', 'Unknown')],
                        'Amount': [line.get('Amount')],
                        'CustomerName': [details.get('CustomerRef', {}).get('name', 'Unknown')],
                        'BillableStatus': [details.get('BillableStatus', 'Unknown')],
                        'TxnDate': [entity.get('TxnDate', 'Unknown')]})
            else:
                print("ERROR: Entity not valid:", query_entity)

            df = pd.concat([df, new_row], ignore_index=True)

            if query_entity == "Bill":
                df['DueDate']  = pd.to_datetime(df['DueDate'])
                df['TxnDate'] = pd.to_datetime(df['TxnDate'])

            elif query_entity == "Purchase":
                df['TxnDate'] = pd.to_datetime(df['TxnDate'])

    return df

This parser is compatible with the structure of the bill and purchase entities from QuickBooks, but would not work on most.

This design limited me in two ways:

  1. It required me to have a parser for each unique entity type, which would be extremely time consuming
  2. It meant the DataFrame that was returned by each parser would be static, and wouldn't be able to change depending on the query

This is why my application was only able to query from bills and purchases at first.

Recently, I changed the design of my application so that it can handle more types of queries and entities. Essentially, I made it more "flexible". Instead of statically converting every json object into a DataFrame, I built an LLM tool that can apply the natural language query directly to the json. The beauty of this is that the tool could work with different json structures, meaning it would be compatible with any entity. Doing this completely removed the need to convert the json into a DataFrame.

Doing stuff like this with LLM's can be done using Python's exec() function. This function can run any code you give it inside of its own environment, so you can use the LLM to generate code for you, plug it in to exec() and use the results of the LLM's code in your own code. This is how both PandasLLM and my json processor worked. Here is how I did it in my code:

def process_json(self, json_data: Dict[str, Any], query: str) -> pd.DataFrame:
        """
        Process JSON data using GPT to extract relevant information based on the query.
        
        Args:
            json_data: Raw JSON data from QuickBooks
            query: User's natural language query
            
        Returns:
            pandas.DataFrame containing the relevant data
        """
        # Create a prompt that explains the task and includes both the query and JSON
        prompt = self._create_extraction_prompt(json_data, query)
        
        # Get GPT to generate Python code to extract the data
        response = self.client.chat.completions.create(
            model="gpt-4o",
            temperature=0,
            messages=[
                {"role": "system", "content": """You are a data extraction expert. 
                Given a JSON structure and a query, generate Python code that will extract 
                the relevant data and return it as a list of dictionaries that can be 
                converted to a pandas DataFrame. The code should not have a variable
                containing the json. The function should take as input a variable named json_data
                Only return the Python code, no explanations."""},
                {"role": "user", "content": prompt}
            ]
        )
        
        # Get the generated code
        extraction_code = self._clean_code(response.choices[0].message.content)
        # print("Code:", extraction_code)
        
        try:
            # Execute the generated code with the JSON data

            local_vars = {"json_data": json_data}
            exec(extraction_code, {"pd": pd, "datetime": datetime}, local_vars)
            
            # Convert the extracted data to a DataFrame
            if "result" in local_vars and isinstance(local_vars["result"], list):
                return pd.DataFrame(local_vars["result"])
            else:
                raise ValueError("Generated code did not produce a valid result")
                
        except Exception as e:
            print(f"Error executing generated code: {e}")
            # Fallback: Try a simpler extraction approach
            return self._fallback_extraction(json_data, query)

Making sure the code actually works requires a little bit of clever prompt engineering and preprocessing of the LLM generated code, but once the code can be reliably run it works like a charm. In the case that it doesn't, however, I created a fallback function that is hard coded so it isn't subject to the same variability as AI generated code.

While exec() can be great for executing AI generated code, it can also pose a security concern. Because what is run in the function is partially determined by the user input, a determined attacker could potentially use it to run malicious code. In a production environment, strong security measures would be critical in protecting your system.

In hindsight, it seems like an obvious solution to the initial problem I was having, but I suppose that's just hindsight bias. It's unsurprising that the first way of approaching this project wasn't the best, considering that I only have so much experience. I'm glad that I was able to make such an improvement. It shows growth in my development skills and approach to problem-solving.

My application is now capable of extracting more data from QuickBooks. While still basic, I've reached a point in the development in this application that I feel like it functions as a stand-alone project and I can move on to something else. If I ever want to come back to it I feel I have built a solid foundation that can be used to do much more.