A full-stack AI Product Discovery system that scrapes clothing products from Hunnit.com using ScraperAPI, stores them in PostgreSQL (Neon), indexes them in Qdrant, enriches them with a Neo4j Knowledge Graph, and serves intelligent outfit recommendations via a FastAPI backend and a modern React frontend.
Prototype link: https://product-discovery-assistant-ochre.vercel.app/
-
- Backend setup
- Frontend setup
- Environment variables
- Database setup
- Example API requests
- Docker instructions
-
Architecture & Design Decisions
- Mermaid Diagram
This project is an AI-powered product discovery assistant for clothing. It:
- Scrapes product data from Hunnit.com using ScraperAPI + BeautifulSoup
- Stores structured product info (title, price, features, image URL, category…) in Neon PostgreSQL
- Generates semantic embeddings using sentence-transformers/all-MiniLM-L6-v2
- Indexes them in Qdrant Vector DB
- Builds a Knowledge Graph in Neo4j (products → categories → features)
- Uses a hybrid RAG + KG + LLM reasoning pipeline for smart outfit recommendations
- Serves results through a FastAPI backend
- Provides a clean, responsive frontend built with React + Tailwind
The result is an end-to-end mini “AI stylist” capable of answering queries like:
“Show me oversized hoodies under 2000 for gym.”
product-discovery-assistant/
│
├── backend/
│ ├── app/
│ │ ├── api/v1/ # FastAPI routes (products, search, scrape, health)
│ │ ├── core/ # Settings & configuration
│ │ ├── db/ # DB engine, session, Base
│ │ ├── models/ # SQLAlchemy models
│ │ ├── schemas/ # Pydantic schemas
│ │ ├── services/ # embeddings, llm, scraper, graph, products
│ │ └── main.py # FastAPI app factory & startup hooks
│ ├── Data_Scraping/
│ │ └── scrap.py # ScraperAPI-based Hunnit scraper
│ ├── create_db.py
│ ├── Dockerfile
│ ├── requirements.txt
│
├── frontend/
│ ├── src/
│ │ ├── components/ # Header, ProductCard
│ │ ├── pages/ # Home, Chat, ProductDetail
│ │ ├── api.js # API wrapper for FastAPI
│ │ ├── App.jsx
│ │ ├── main.jsx
│ │ ├── index.css
│ ├── public/
│ ├── package.json
│ ├── tailwind.config.js
│
└── README.md
cd backend
pip install -r requirements.txt(Your backend requires these variables.)
# --- Database (Neon Postgres) ---
DATABASE_URL=postgresql+psycopg://USER:PASSWORD@HOST/dbname
# --- Qdrant Vector DB ---
QDRANT_URL=https://your-qdrant-url
QDRANT_API_KEY=your_qdrant_key
QDRANT_COLLECTION=products_collection_minilm
# --- ScraperAPI ---
SCRAPER_API_KEY=your_scraperapi_key
# --- LLM Keys ---
GROQ_API_KEY=your_groq_key
OPENAI_API_KEY=your_openai_key
# --- Neo4j Knowledge Graph ---
NEO4J_ENABLED=True
NEO4J_URI=neo4j+s://your-instance.databases.neo4j.io
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password
python create_db.pyuvicorn app.main:app --reloadYour backend will be available at:
http://127.0.0.1:8000/api/v1
cd frontend
npm install
npm run devSet .env for frontend:
VITE_API_BASE_URL=http://127.0.0.1:8000/api/v1
The app runs at:
http://localhost:5173
Trigger scrape:
curl -X POST "http://127.0.0.1:8000/api/v1/scrape/hunnit?max_products=40"curl -X POST "http://127.0.0.1:8000/api/v1/search" \
-H "Content-Type: application/json" \
-d '{"query": "show me oversized hoodies under 2000"}'docker build -t product-backend .docker run -p 8000:8000 --env-file .env product-backendThe system integrates:
- FastAPI — high-performance async backend, clean routing structure
- ScraperAPI + BeautifulSoup — highly reliable scraping without IP blocks
- Neon PostgreSQL — scalable cloud Postgres database
- Qdrant — vector similarity search for semantic retrieval
- Neo4j — conceptual reasoning using graph structure
- SentenceTransformers — lightweight multilingual embeddings
- Groq + OpenAI — LLM reasoning with fallback model
- React + Tailwind — clean modern UX
flowchart TD
A[ScraperAPI] --> B[Data_Scraping/scrap.py<br>BeautifulSoup Parsing]
B --> C[Neon PostgreSQL<br>Products Table]
C --> D[FastAPI Backend]
D --> E[Embeddings Service<br>SentenceTransformer]
E --> F[Qdrant Vector DB]
C --> G[Neo4j Knowledge Graph]
G --> D
D --> H[Search API<br>Hybrid RAG+KG]
H --> I[Groq/OpenAI LLM]
I --> J[Frontend React App<br>Home / Chat / Product Detail]
J --> D
Data_Scraping/scrap.pyapp/services/scraper.py
✔ Uses ScraperAPI to bypass anti-bot systems:
SCRAPER_API_BASE = "http://api.scraperapi.com"✔ HTML parsed using BeautifulSoup
✔ Selectors rely on:
<a href="/products/...">for product listing<h1>for titles- Regex for price detection
- Headings like “Product Features” for structured bullet extraction
✔ Cleaned description is generated using:
build_clean_description()✔ Data stored directly into PostgreSQL via SQLAlchemy ORM.
- No robots.txt check
- No delay/backoff between requests
- No proxy rotation (ScraperAPI handles some of this)
| Component | File | Responsibility |
|---|---|---|
| Scraper | scrap.py |
Collect raw product data |
| Embedder | services/embeddings.py |
Encode with SentenceTransformer |
| Vector DB | Qdrant | Semantic similarity |
| KG | services/graph.py |
Category & feature relationships |
| LLM | services/llm.py |
Final answer generation |
| Search Logic | api/v1/search.py |
Hybrid rank + LLM rerank |
parse_hunnit_product(url)index_all_products(db, skip_if_indexed=True)- category synonyms
- tag extraction
- price extraction
semantic_search(enriched_query)get_kg_context_for_products()answer_with_rag(question, rag_chunks)You are an AI fashion stylist. You must recommend outfits ONLY using the
products listed in the context below.
Rules:
- Suggest 2–4 suitable products.
- If no exact match exists, recommend closest alternatives.
- Keep answer short.
Context:
{{RAG_CHUNKS}}
User query: {{QUESTION}}
Now give a friendly recommendation:
Nodes:
ProductCategoryFeature
Relationships:
Product → BELONGS_TO → CategoryProduct → HAS_FEATURE → Feature
Used for:
- enrichment of context
- filtering candidates
- semantic expansion
ScraperAPI improves reliability but costs money.
Alternative: Playwright + proxies.
Batch encoding solves speed but initial startup still heavy.
Alternative: Precompute embeddings or store them in DB.
Currently skips re-sync if data exists.
Trade-off: Faster startup vs potentially stale KG.
Groq is cheaper, fallback to OpenAI.
If products are updated, embeddings must be re-indexed manually.
No test suite present → recommend:
- pytest
- CI pipeline (GitHub Actions)
- black + flake8 linting
✔ API keys must stay in .env
❌ NO keys should be committed (verify before publishing)
✔ Rotate keys if exposed
- Use Render for backend
- Use Vercel for frontend
- Enable Qdrant sharding for >50k products
- Use background worker to scrape periodically
- Cache embeddings to reduce recomputation
- Add retry + exponential backoff to scraping
- Add alembic migrations
- Cache search responses
- Incremental embedding update instead of full reindex
- Pagination on products API
- Multi-vendor scraping (Furlenco, Traya, etc.)
- User profiling (preferences, sizes, style)
- Real-time KG updates
- A/B testing for recommendation quality