The problem
Turning a text description or a single image into a usable 3D model is hard: prompt quality matters, and most tools are either closed, expensive, or tied to one format. Designers and developers need a flexible pipeline that can enhance prompts with VLMs and support multiple quality/speed tradeoffs.
The solution
ThreeDLLM is a modular pipeline: use a VLM (e.g. GPT-4 Vision) to enhance text prompts or interpret images, then generate 3D with Shap-E, Hugging Face models, or API-based generators (Neural4D, Instant3D). Export to XYZ, OBJ, PLY, or STL for downstream use or 3D printing.
Without ThreeDLLM
Single backends, weak prompts, or no image-to-3D; format lock-in.
With ThreeDLLM
VLM-enhanced prompts, image-to-3D, pluggable generators, and multiple export formats in one CLI/API.
What it does
- VLM-enhanced prompts – Improve text prompts with vision-language models for better 3D results.
- Image-to-3D – Generate 3D from a reference image using multimodal understanding.
- Multiple backends – Shap-E (open), Hugging Face (Inference API / Endpoints / local), Neural4D, Instant3D.
- Export formats – XYZ, OBJ, PLY, STL for viewing or 3D printing.
- Web UI & API – FastAPI server with form upload, progress, and download; CLI for scripting.
Tech stack
Python, PyTorch, Shap-E, FastAPI, Uvicorn; OpenAI for VLM; optional Hugging Face and API keys for Neural4D/Instant3D. Docker and docker-compose for CPU/GPU.