
floatingpoint
W24Pivot 2 of 2Open source API service to parse complex documents
Post-training data to teach models document work
Battle-tested + highly modular vision infrastructure to convert PDFs, PPTs, Word, Excel, PNG, and JPEGs into LLM-ready data. We started by building lumina.sh - where we needed to parse ~600M pages of scientific literature. The researchers didn't care - but devs wanted our ingestion pipeline. So we built chunkr instead. We offer high quality layout analysis, OCR, bounding boxes, granular VLM controls, semantic chunking, and all the last mile engineering that goes into building standout AI applications. Common use-cases include RAG, and automating document workflows like invoices/medical reports -> database.
Floatingpoint builds off-the-shelf post-training datasets that teach models how to do real work with documents. We discover valuable tasks where models fall short and build datasets to close the gap. Human-crafted from real-world sources with synthetic expansions on top, and validated through in-house training cycles.
Pivoted from document parsing API infrastructure (Chunkr) to building post-training datasets for AI models (floatingpoint) - completely different product category and business model.
Post-training data to teach models document work(viewing)
AI Search Engine for Research