Skip to content

amirakhalfy/AI-Mongodb-system

Repository files navigation

🌟 LLM-Powered MongoDB Query System

Overview

This project demonstrates an end-to-end AI system that translates natural language (NL) questions into MongoDB queries using a Large Language Model (LLM) via the Groq API.

It includes:

  • An in-memory MongoDB simulation for query execution (via mongomock)
  • An LLM as a judge evaluation of query correctness, completeness, and efficiency
  • A ground-truth–based evaluator for comparing LLM outputs with expected results
  • 💬 A Streamlit chat interface for interactive testing
  • A Human feedback evaluation
  • a final score evaluation based on different questions given by the llm as a judge and by the excpected response.

The system enables full-cycle evaluation:

"Human question → LLM query generation → Query execution → LLM + rule-based evaluation → Human feedback."

Table of Contents

  1. Components Explanation
  2. Data
  3. Core Logic – main.py
  4. Tests
  5. llm_final_evaluation/
  6. Installation and Setup
  7. Running the System
  8. Running Unit Tests
  9. Assumptions
  10. Design Decisions
  11. Remarks / Production Considerations
  12. Limitations
  13. Troubleshooting

Components Explanation

1️⃣ app/mongo_utils.py

Contains helper classes for simulating a MongoDB environment in memory using mongomock. It loads data from JSON files, executes queries (find, aggregate), and returns the results.
👉 In short: A lightweight, in-memory MongoDB engine for testing queries without a real database.

2️⃣ app/llm_query.py

Generates MongoDB queries from natural language questions using the Groq LLM API. It parses, cleans, and validates the JSON query output using Pydantic models (FindQuery, AggregateQuery) for reliability.
👉 In short: Converts natural language into structured MongoDB queries and ensures the output is valid JSON.

3️⃣ app/llm_judge.py

Defines the LLMJudge class, which uses an LLM to evaluate the quality of generated query results for each question. It returns JSON feedback with:

  • correctness (high/medium/low)
  • completeness (high/medium/low)
  • efficiency (high/medium/low)
  • justification (brief explanation)
  • Final scoring given by the llm as a judge for all the questions.

👉 In short: Uses an LLM as a "judge" to grade how accurate, complete, and efficient the generated MongoDB results are.

4️⃣ app/evaluation_llm.py

Implements the LLMQueryEvaluator, which compares the LLM's output against expected outputs from a JSON file. It reports:

  • whether the LLM's results are correct or complete,
  • and lists missing or extra elements compared to the ground truth.

👉 In short: A rule-based evaluator that checks LLM query results against predefined expected answers.


📁 Data

Sample Healthcare Dataset: data/collections.json

A sample healthcare dataset simulating MongoDB collections:

  • patients – Patient demographic data
  • doctors – Doctors with specialties and cities
  • appointments – Patient–doctor appointments
  • vitals – Patient health measurements
  • prescriptions – Medications and dosages
  • lab_results – Test results (cholesterol, HbA1c, etc.)
  • teleconsultations – Remote consultations
  • devices – Connected medical devices
  • patient_feedback – Ratings and comments

👉 Purpose: Used to simulate MongoDB queries in a realistic healthcare context.

Expected Outputs: data/expected_output.json

Contains pairs of natural language questions and their expected MongoDB query results, serving as the ground truth. Used by LLMQueryEvaluator to measure:

  • Correctness (exact match)
  • Completeness (subset/superset relation)
  • Difference analysis (missing/extra records)

Core Logic – main.py

Defines the ChatBackend class which orchestrates:

  • Natural Language → Query generation (generate_query)
  • Query execution on simulated MongoDB (InMemoryMongo)
  • Evaluation using:
    • LLMJudge (AI-based grading)
    • LLMQueryEvaluator (ground truth comparison)
  • Output formatting with timestamps for Streamlit chat display.

📊 Human Feedback: 🔴 In the Excel file you will find my human feedback: human_feedback.xlsx When clicking on it, you can choose "View raw" to download it.

👉 In short: The main backend engine that runs the full NL → Query → Execution → Evaluation pipeline.


🧪 Tests

tests/test_llm_query.py

Tests the LLM query processing utilities, ensuring that JSON cleaning, ISODate conversion, loose JSON parsing, query normalization, and query structure validation behave correctly.

  • test_clean_json_text: Verifies removal of comments and fixes single quotes in JSON-like text.
  • test_fix_isodate_in_json: Converts ISODate strings into standard JSON date strings.
  • test_parse_json_loose_simple: Ensures loose JSON with markdown formatting parses correctly.
  • test_normalize_query_format_sort_dict: Normalizes sort specifications from dictionary format to list format.
  • test_validate_mongo_query_structure_lookup_in_find: Checks that invalid $lookup inside a find query raises a ValueError.

tests/test_mongo_utils.py

Tests the in-memory MongoDB simulation (InMemoryMongo) to ensure query execution works as expected, including find and aggregate operations, projections, sorting, limiting, and grouping.

  • test_execute_find_basic: Confirms a basic find query returns the correct documents.
  • test_execute_find_projection: Validates field projection in find queries.
  • test_execute_find_sort_and_limit: Ensures sorting and limiting works as expected.
  • test_execute_aggregate_group: Tests aggregation with grouping and counting.
  • test_execute_aggregate_project: Checks that projection works correctly in an aggregation pipeline.

llm_final_evaluation/

This folder contains all saved evaluation sessions from the chatbot. Each file includes:

  • The user's natural language question
  • The generated MongoDB query
  • The execution result
  • The LLM Judge's evaluation
  • The ground-truth comparison (if the question already exist in expected_output.json)

In the Streamlit chatbot demo, there's a "💾 Save Evaluation" button that allows you to save or load these JSON evaluation sessions interactively.


Installation and Setup

🧰 Optional: Setting up WSL (Windows Users)

If you're using Windows, it's recommended to run this project inside WSL (Windows Subsystem for Linux) to ensure compatibility with mongomock and Linux-based tools.

To install WSL:

wsl --install

Clone the repository:

git clone https://github.com/amirakhalfy/vitafluence.git

Copy the project to your Linux environment:

# Copy the project to your home directory (adjust path as needed)
cp -r /mnt/d/vitafluenceaitest ~/vitafluenceaitest
cd ~/vitafluenceaitest

Create and activate a Python virtual environment:

python3 -m venv venv

Activate your virtual environment (if not already active):

source venv/bin/activate

Open the project in VS Code:

code .

Inside your activated virtual environment:

pip install --upgrade pip
pip install -r requirements.txt

Verify installation (optional):

pip list

---> This will show all installed packages and their versions.

At this point, you are:

✅ In a Linux-native environment
✅ With the virtual environment active
✅ Ready to install and run libraries

Next step: Create a .env file in the root of your project:

GROQ_API_TOKEN=your_groq_api_token_here

Note: The DB is generated in-memory on app startup—no manual generation needed.


Running the System

streamlit run streamlit.py

Running Unit Tests

python -m pytest tests/test_mongo_utils.py -v

Assumptions

  • Python version: Python 3.11.0 or higher
  • Environment: WSL (Windows Subsystem for Linux) recommended for compatibility with mongomock and Linux-native tools

Design Decisions

Technical Choices

Database Setup

Tool Used: mongomock

Reasoning: I used mongomock to simulate MongoDB in-memory without needing a full database setup. This enables quick testing, reproducibility, and isolation of experiments, ideal for evaluation purposes. It supports all common MongoDB operations, including find, aggregate, project, sort, and nested lookups, and simplifies setup for anyone running the code.

LLM for Query Generation

Tool Used: openai/gpt-oss-120b via Groq Cloud

Reasoning: This model was chosen because it has strong reasoning capabilities, can handle complex queries, and is production-ready, so there are no concerns about reliability. Using Groq Cloud ensures fast inference, sufficient token capacity, and scalable usage, making it ideal for translating natural language questions into MongoDB queries.

Alternative LLMs Tested:

  • DeepSeek-R1 and Llama3.3 Versatile work well for simple queries like find or projection but struggle with complex queries involving aggregation pipelines, nested $lookup, or $let.
  • openai/gpt-oss-120b consistently delivers accurate, complex query generation thanks to its high reasoning ability and large parameter count.

LLM for Evaluation (Judge)

Tool Used: moonshotai/kimi-k2-instruct-0905

Reasoning: Selected for its strong reasoning, instruction-following capabilities, and ability to evaluate query correctness, completeness, and efficiency. It has strongest benchmarks among instruction-following LLMs, making it particularly well-suited to act as a judge in evaluating MongoDB queries.

Prompt Engineering: Prompts were iteratively refined. Each test highlighted areas where the model struggled, and the prompts were enhanced to guide the LLM effectively. This iterative process ensures high-quality, reliable evaluation.

Evaluation Approach

I combined three complementary evaluation methods:

1. LLM-as-a-Judge:

This method speeds up evaluation by automating the judgment process. It performs particularly well for queries involving find, projection, and aggregation.

Limitations: Sometimes the LLM flags responses as incomplete even when they are correct, especially for complex $lookup or pipeline operations (these cases are documented in the human feedback Excel file).

Overall: Despite these limitations, the LLM provides clear and logical judgments, which significantly saves time, especially when testing a large number of questions or for users who may not fully understand MongoDB queries.

2. Expected Output Comparison:

Compares query results against pre-defined expected outputs.

Useful for quantitative verification and exact match checking.

-----> final evaluation score given by the llm as a judge and the excpected output also.

3. Human Feedback:

Provides final verification, especially for complex queries where LLM judgment may be uncertain.

Ensures high confidence in correctness and completeness.

Key Observations

  • The LLM performs strongly even for complex queries, but failures in $let or pipeline $lookup are due to limitations in mongomock, not the model.
  • Combining LLM evaluation with human feedback ensures robust and reliable assessment of queries.

Remarks / Production Considerations

  • In a production setup, one should add logging, retries, metrics, and error monitoring to improve reliability and observability.
  • For testing purposes, Groq API calls should be mocked, similar to how mongomock is used for MongoDB.
  • Efficiency measurements using mongomock are not representative of production MongoDB performance; they only serve as a relative benchmark.
  • Using cloud LLMs for query generation is realistic, as many production systems rely on them for their reasoning capabilities and scalability.
  • While Groq Cloud has daily token limits, I tested the system over multiple days, sending over 200 queries each day without encountering problems.
  • For testing purposes or automated unit tests, Groq API calls can be mocked, similar to how mongomock is used for MongoDB queries. This allows reproducible testing without consuming live API tokens.
  • Caching of query results was not implemented because the data could change over time. Caching might cause the LLM to return outdated responses, which would be inaccurate.
  • Similarly, LLM-as-a-Judge results were not cached, to ensure that each evaluation reflects the current query results and avoids incorrect judgments.

Limitations

  • Limitations: $let and some pipeline $lookup queries fail due to mongomock constraints, not the LLM. This is documented in my human feedback Excel file as i have mentioned before.
  • Data Privacy Consideration: The patients dataset used in this project is fully synthetic. Using Groq Cloud with real patient data would not be recommended. For sensitive or real data, local LLM models should be preferred to ensure privacy and compliance.

Troubleshooting

Common issues:

  • "Query generation failed" → Check Groq API key
  • "mongomock limitations" → Expected $let and $pipeline lookup failures

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages