This project demonstrates an end-to-end AI system that translates natural language (NL) questions into MongoDB queries using a Large Language Model (LLM) via the Groq API.
It includes:
- An in-memory MongoDB simulation for query execution (via mongomock)
- An LLM as a judge evaluation of query correctness, completeness, and efficiency
- A ground-truth–based evaluator for comparing LLM outputs with expected results
- 💬 A Streamlit chat interface for interactive testing
- A Human feedback evaluation
- a final score evaluation based on different questions given by the llm as a judge and by the excpected response.
The system enables full-cycle evaluation:
"Human question → LLM query generation → Query execution → LLM + rule-based evaluation → Human feedback."
- Components Explanation
- Data
- Core Logic – main.py
- Tests
- llm_final_evaluation/
- Installation and Setup
- Running the System
- Running Unit Tests
- Assumptions
- Design Decisions
- Remarks / Production Considerations
- Limitations
- Troubleshooting
Contains helper classes for simulating a MongoDB environment in memory using mongomock. It loads data from JSON files, executes queries (find, aggregate), and returns the results.
👉 In short: A lightweight, in-memory MongoDB engine for testing queries without a real database.
Generates MongoDB queries from natural language questions using the Groq LLM API. It parses, cleans, and validates the JSON query output using Pydantic models (FindQuery, AggregateQuery) for reliability.
👉 In short: Converts natural language into structured MongoDB queries and ensures the output is valid JSON.
Defines the LLMJudge class, which uses an LLM to evaluate the quality of generated query results for each question. It returns JSON feedback with:
- correctness (high/medium/low)
- completeness (high/medium/low)
- efficiency (high/medium/low)
- justification (brief explanation)
- Final scoring given by the llm as a judge for all the questions.
👉 In short: Uses an LLM as a "judge" to grade how accurate, complete, and efficient the generated MongoDB results are.
Implements the LLMQueryEvaluator, which compares the LLM's output against expected outputs from a JSON file. It reports:
- whether the LLM's results are correct or complete,
- and lists missing or extra elements compared to the ground truth.
👉 In short: A rule-based evaluator that checks LLM query results against predefined expected answers.
A sample healthcare dataset simulating MongoDB collections:
- patients – Patient demographic data
- doctors – Doctors with specialties and cities
- appointments – Patient–doctor appointments
- vitals – Patient health measurements
- prescriptions – Medications and dosages
- lab_results – Test results (cholesterol, HbA1c, etc.)
- teleconsultations – Remote consultations
- devices – Connected medical devices
- patient_feedback – Ratings and comments
👉 Purpose: Used to simulate MongoDB queries in a realistic healthcare context.
Contains pairs of natural language questions and their expected MongoDB query results, serving as the ground truth. Used by LLMQueryEvaluator to measure:
- Correctness (exact match)
- Completeness (subset/superset relation)
- Difference analysis (missing/extra records)
Defines the ChatBackend class which orchestrates:
- Natural Language → Query generation (
generate_query) - Query execution on simulated MongoDB (
InMemoryMongo) - Evaluation using:
- LLMJudge (AI-based grading)
- LLMQueryEvaluator (ground truth comparison)
- Output formatting with timestamps for Streamlit chat display.
📊 Human Feedback: 🔴 In the Excel file you will find my human feedback: human_feedback.xlsx When clicking on it, you can choose "View raw" to download it.
👉 In short: The main backend engine that runs the full NL → Query → Execution → Evaluation pipeline.
Tests the LLM query processing utilities, ensuring that JSON cleaning, ISODate conversion, loose JSON parsing, query normalization, and query structure validation behave correctly.
- test_clean_json_text: Verifies removal of comments and fixes single quotes in JSON-like text.
- test_fix_isodate_in_json: Converts ISODate strings into standard JSON date strings.
- test_parse_json_loose_simple: Ensures loose JSON with markdown formatting parses correctly.
- test_normalize_query_format_sort_dict: Normalizes sort specifications from dictionary format to list format.
- test_validate_mongo_query_structure_lookup_in_find: Checks that invalid
$lookupinside a find query raises aValueError.
Tests the in-memory MongoDB simulation (InMemoryMongo) to ensure query execution works as expected, including find and aggregate operations, projections, sorting, limiting, and grouping.
- test_execute_find_basic: Confirms a basic find query returns the correct documents.
- test_execute_find_projection: Validates field projection in find queries.
- test_execute_find_sort_and_limit: Ensures sorting and limiting works as expected.
- test_execute_aggregate_group: Tests aggregation with grouping and counting.
- test_execute_aggregate_project: Checks that projection works correctly in an aggregation pipeline.
This folder contains all saved evaluation sessions from the chatbot. Each file includes:
- The user's natural language question
- The generated MongoDB query
- The execution result
- The LLM Judge's evaluation
- The ground-truth comparison (if the question already exist in expected_output.json)
In the Streamlit chatbot demo, there's a "💾 Save Evaluation" button that allows you to save or load these JSON evaluation sessions interactively.
If you're using Windows, it's recommended to run this project inside WSL (Windows Subsystem for Linux) to ensure compatibility with mongomock and Linux-based tools.
To install WSL:
wsl --installClone the repository:
git clone https://github.com/amirakhalfy/vitafluence.git
Copy the project to your Linux environment:
# Copy the project to your home directory (adjust path as needed)
cp -r /mnt/d/vitafluenceaitest ~/vitafluenceaitest
cd ~/vitafluenceaitestCreate and activate a Python virtual environment:
python3 -m venv venvActivate your virtual environment (if not already active):
source venv/bin/activateOpen the project in VS Code:
code .Inside your activated virtual environment:
pip install --upgrade pip
pip install -r requirements.txtVerify installation (optional):
pip list---> This will show all installed packages and their versions.
At this point, you are:
✅ In a Linux-native environment
✅ With the virtual environment active
✅ Ready to install and run libraries
Next step: Create a .env file in the root of your project:
GROQ_API_TOKEN=your_groq_api_token_hereNote: The DB is generated in-memory on app startup—no manual generation needed.
streamlit run streamlit.pypython -m pytest tests/test_mongo_utils.py -v- Python version: Python 3.11.0 or higher
- Environment: WSL (Windows Subsystem for Linux) recommended for compatibility with mongomock and Linux-native tools
Tool Used: mongomock
Reasoning: I used mongomock to simulate MongoDB in-memory without needing a full database setup. This enables quick testing, reproducibility, and isolation of experiments, ideal for evaluation purposes. It supports all common MongoDB operations, including find, aggregate, project, sort, and nested lookups, and simplifies setup for anyone running the code.
Tool Used: openai/gpt-oss-120b via Groq Cloud
Reasoning: This model was chosen because it has strong reasoning capabilities, can handle complex queries, and is production-ready, so there are no concerns about reliability. Using Groq Cloud ensures fast inference, sufficient token capacity, and scalable usage, making it ideal for translating natural language questions into MongoDB queries.
Alternative LLMs Tested:
- DeepSeek-R1 and Llama3.3 Versatile work well for simple queries like find or projection but struggle with complex queries involving aggregation pipelines, nested $lookup, or $let.
- openai/gpt-oss-120b consistently delivers accurate, complex query generation thanks to its high reasoning ability and large parameter count.
Tool Used: moonshotai/kimi-k2-instruct-0905
Reasoning: Selected for its strong reasoning, instruction-following capabilities, and ability to evaluate query correctness, completeness, and efficiency. It has strongest benchmarks among instruction-following LLMs, making it particularly well-suited to act as a judge in evaluating MongoDB queries.
Prompt Engineering: Prompts were iteratively refined. Each test highlighted areas where the model struggled, and the prompts were enhanced to guide the LLM effectively. This iterative process ensures high-quality, reliable evaluation.
I combined three complementary evaluation methods:
1. LLM-as-a-Judge:
This method speeds up evaluation by automating the judgment process. It performs particularly well for queries involving find, projection, and aggregation.
Limitations: Sometimes the LLM flags responses as incomplete even when they are correct, especially for complex $lookup or pipeline operations (these cases are documented in the human feedback Excel file).
Overall: Despite these limitations, the LLM provides clear and logical judgments, which significantly saves time, especially when testing a large number of questions or for users who may not fully understand MongoDB queries.
2. Expected Output Comparison:
Compares query results against pre-defined expected outputs.
Useful for quantitative verification and exact match checking.
-----> final evaluation score given by the llm as a judge and the excpected output also.
3. Human Feedback:
Provides final verification, especially for complex queries where LLM judgment may be uncertain.
Ensures high confidence in correctness and completeness.
- The LLM performs strongly even for complex queries, but failures in $let or pipeline $lookup are due to limitations in mongomock, not the model.
- Combining LLM evaluation with human feedback ensures robust and reliable assessment of queries.
- In a production setup, one should add logging, retries, metrics, and error monitoring to improve reliability and observability.
- For testing purposes, Groq API calls should be mocked, similar to how mongomock is used for MongoDB.
- Efficiency measurements using mongomock are not representative of production MongoDB performance; they only serve as a relative benchmark.
- Using cloud LLMs for query generation is realistic, as many production systems rely on them for their reasoning capabilities and scalability.
- While Groq Cloud has daily token limits, I tested the system over multiple days, sending over 200 queries each day without encountering problems.
- For testing purposes or automated unit tests, Groq API calls can be mocked, similar to how mongomock is used for MongoDB queries. This allows reproducible testing without consuming live API tokens.
- Caching of query results was not implemented because the data could change over time. Caching might cause the LLM to return outdated responses, which would be inaccurate.
- Similarly, LLM-as-a-Judge results were not cached, to ensure that each evaluation reflects the current query results and avoids incorrect judgments.
- Limitations: $let and some pipeline $lookup queries fail due to mongomock constraints, not the LLM. This is documented in my human feedback Excel file as i have mentioned before.
- Data Privacy Consideration: The patients dataset used in this project is fully synthetic. Using Groq Cloud with real patient data would not be recommended. For sensitive or real data, local LLM models should be preferred to ensure privacy and compliance.
Common issues:
- "Query generation failed" → Check Groq API key
- "mongomock limitations" → Expected $let and $pipeline lookup failures