The problem

Turning a text description or a single image into a usable 3D model is hard: prompt quality matters, and most tools are either closed, expensive, or tied to one format. Designers and developers need a flexible pipeline that can enhance prompts with VLMs and support multiple quality/speed tradeoffs.

The solution

ThreeDLLM is a modular pipeline: use a VLM (e.g. GPT-4 Vision) to enhance text prompts or interpret images, then generate 3D with Shap-E, Hugging Face models, or API-based generators (Neural4D, Instant3D). Export to XYZ, OBJ, PLY, or STL for downstream use or 3D printing.

Without ThreeDLLM

Single backends, weak prompts, or no image-to-3D; format lock-in.

With ThreeDLLM

VLM-enhanced prompts, image-to-3D, pluggable generators, and multiple export formats in one CLI/API.

What it does

  • VLM-enhanced prompts – Improve text prompts with vision-language models for better 3D results.
  • Image-to-3D – Generate 3D from a reference image using multimodal understanding.
  • Multiple backends – Shap-E (open), Hugging Face (Inference API / Endpoints / local), Neural4D, Instant3D.
  • Export formats – XYZ, OBJ, PLY, STL for viewing or 3D printing.
  • Web UI & API – FastAPI server with form upload, progress, and download; CLI for scripting.

Tech stack

Python, PyTorch, Shap-E, FastAPI, Uvicorn; OpenAI for VLM; optional Hugging Face and API keys for Neural4D/Instant3D. Docker and docker-compose for CPU/GPU.

Next steps

Roadmap & ideas

  • Public demo or hosted API (TBD).
  • More VLM and 3D backends; quality/price/speed presets.
  • Batch generation and webhook callbacks for long jobs.
  • Integration with design tools or asset pipelines.