All Tools

Epistemic Gap Finder

A conceptual cartography instrument. Feed it a corpus of descriptions from any categorisable domain and it maps the semantic space those concepts occupy, identifies the low-density regions — the deserts — and generates ranked candidate descriptions for what could inhabit those gaps.

Epistemic Gap Finder on GitHub: https://github.com/datasculptures/epistemic-gap-finder

  • Python
  • NLP
  • UMAP
  • Embeddings
  • sentence-transformers
  • Conceptual Mapping

What it does

EGF is not a search engine and not a recommendation system. It is a strategic positioning instrument for anyone who wants to know what is structurally absent from a space before committing to a direction. The pipeline runs in six stages:

Embed

Embeds a directory of plain-text or Markdown descriptions using a local all-MiniLM-L6-v2 sentence-transformer model (~90 MB, downloaded once then fully offline). Produces 384-dimensional vectors — one per file.

Reduce

Reduces to 2D and 3D using UMAP. Assesses topology preservation with trustworthiness, continuity, and LCMC metrics. Warns if the 2D layout is unreliable.

Density

Estimates density across the 2D space using k-NN radius density, then smooths the surface to separate genuine sparse regions from edge artefacts.

Detect gaps

Identifies low-density regions via local minima on the smoothed density surface. Each gap is scored by isolation — how absent it is from the corpus density, from 0 (fully occupied) to 1 (completely empty).

Generate candidates

Produces ranked candidate descriptions for each gap — from vocabulary projection (offline, always available) or a local LLM via ollama (optional, richer language).

Report

Renders a standalone HTML report with an interactive Plotly scatter map, a gap table ranked by isolation score, and candidate cards with confidence scores and generation-mode badges.

Domains

EGF is domain-agnostic. The --domain flag sets report labels and shapes the LLM prompt.

  • conceptDefault — neutral labels, works for anything
  • software-toolDeveloper tooling, CLIs, libraries
  • philosophySchools of thought, philosophical positions
  • genreMusical or literary genres
  • disciplineAcademic research fields
  • custom:<noun>Any noun you choose — e.g. custom:tabletop RPG

Writing your corpus

EGF requires at least 7 .md or .txt files, each at least 50 characters long. Ten to twenty is the sweet spot. One file per concept, named after the concept. Every description uses a four-sentence format:

  1. What it is or does. Primary function or identity. Active voice, plain language.
  2. What it operates on. Inputs, subject matter, or domain it engages with.
  3. What it produces. Output, result, or effect.
  4. The boundary condition. What it explicitly does not do, cover, or include. This is the most important sentence — it is what precisely positions the concept in the space.

Run egf analyse <dir> --domain <domain> --describe-format to print the template for your domain and exit.

Install

git clone https://github.com/datasculptures/epistemic-gap-finder.git
cd epistemic-gap-finder
python -m venv .venv
.venv\Scripts\Activate.ps1          # Windows
source .venv/bin/activate           # macOS / Linux
pip install -e ".[dev]"

The first run downloads the all-MiniLM-L6-v2 model (~90 MB) to .cache/. Every subsequent run is fully offline. Requires Python 3.10, 3.11, or 3.12.

Usage

# Get the description template for a domain (no analysis runs)
egf analyse my_corpus --domain software-tool --describe-format

# Basic run — auto-opens report in browser
egf analyse my_corpus --domain concept --open

# Custom domain
egf analyse my_corpus --domain "custom:tabletop RPG" --open

# With LLM-enhanced candidates (requires ollama running locally)
egf analyse my_corpus --domain "custom:tabletop RPG" --llm --llm-model llama3.2 --open

# Tune UMAP for a small corpus
egf analyse my_corpus --domain concept --n-neighbors 5 --open

# Automatic isolation threshold selection
egf analyse my_corpus --domain concept --isolation-min auto --open

# Write output to a named directory
egf analyse my_corpus --domain concept --output my_run_01 --open

Key options

  • --domainDomain template — sets report labels and LLM prompt (default: concept)
  • --outputOutput directory (default: ./egf_output)
  • --isolation-minMinimum isolation score for gap detection (default: 0.1; auto for adaptive)
  • --max-gapsMaximum gap regions to report (default: 7)
  • --n-neighborsUMAP n_neighbors — lower for small corpora (default: 15)
  • --quality-thresholdTrustworthiness warning floor (default: 0.75)
  • --llmEnable LLM candidate generation via ollama (off by default)
  • --llm-modelOllama model name (default: llama3)
  • --openOpen the report in a browser after generation
  • --verbose, -vVerbose output

Output files

Each run creates a timestamped HTML report and overwrites the data files.

  • report_YYYYMMDD_HHMMSS.htmlStandalone interactive HTML report
  • embeddings.npyfloat32 array, shape (n, 384) — raw sentence embeddings
  • reduced_2d.npyUMAP 2D positions, shape (n, 2)
  • reduced_3d.npyUMAP 3D positions, shape (n, 3)
  • quality.jsonTrustworthiness, continuity, LCMC scores and warning flag
  • gaps.jsonDetected gap regions ranked by isolation score
  • candidates.jsonGenerated candidate descriptions ranked by confidence

Reading the report

Reduction quality

Three scores assess how faithfully the 2D map preserves the high-dimensional structure.

  • Trustworthiness≥ 0.85 good — ≥ 0.75 acceptable
  • Continuity≥ 0.85 good — ≥ 0.70 acceptable
  • LCMC≥ 0.50 good — ≥ 0.20 acceptable

Low trustworthiness is common with small corpora (< 15 items). Try --n-neighbors 5 or add more items.

Isolation score

How absent a gap region is from the corpus density.

  • 0.8 – 1.0Sharply isolated — unambiguous gap
  • 0.5 – 0.8Moderate — real but not dominant
  • 0.2 – 0.5Weak — on the edge of the corpus
  • < 0.2Marginal — probably noise

Semantic map

Interactive Plotly scatter. Blue dots are corpus items, orange circles are gap regions. Hover for names and isolation scores. Zoom and pan with mouse.

Candidate cards

One card per gap. Each shows a generated name, function summary, positioning statement relative to bounding items, confidence score, and a generation-mode badge: vocab, llm, or llm→vocab (LLM attempted, fell back).

LLM-enhanced candidates

Vocabulary candidates are assembled from TF-IDF term projections — sparse but always offline. The LLM path sends each gap's bounding items and vocabulary terms to a local ollama instance and produces readable, paragraph-quality descriptions. If ollama is not running, EGF falls back to vocabulary mode automatically.

# Pull a model once (~2 GB)
ollama pull llama3.2

# Keep this running in a separate terminal
ollama serve

# Warm up the model before running EGF
ollama run llama3.2 "Hello"

# Run EGF with LLM candidates
egf analyse my_corpus --domain "custom:tabletop RPG" --llm --llm-model llama3.2 --open