Philip Jama

Articles /Network Graph Analysis /Part 5

LLM-Augmented Knowledge Graphs

Using large language models to extract structured knowledge and build navigable graphs

Knowledge GraphsLLMPythonNetworkX

Large language models excel at extracting structured information from text -- turning paragraphs into entities, relations, and hierarchies that map directly to graph structures. This article explores how LLMs serve as knowledge extractors, how prompt design shapes the quality of extracted triples, and how LLM outputs become navigable graphs. The approach builds on the Books project ↗'s technique of converting LLM-generated outlines into NetworkX trees.

LLMs as Knowledge Extractors

Traditional NLP pipelines chain NER, coreference resolution, and relation extraction -- each introducing error that compounds downstream. An LLM can perform end-to-end extraction in a single prompt: given a passage, output a list of (subject, relation, object) triples. The quality depends on prompt design, but for many domains LLM extraction matches or exceeds pipeline approaches with far less engineering.

Prompt Design for Triple Extraction

The extraction prompt needs to specify the entity types of interest, the relation vocabulary (open or closed), and the output format. A closed schema (fixed entity types and relation labels) produces cleaner graphs at the cost of missing novel relationships. An open schema captures more but requires post-processing to merge duplicates and normalize relation names. Few-shot examples in the prompt improve consistency significantly -- showing the model 3-5 input/output pairs anchors its behavior more reliably than long instructions alone.

Prompting for Structured Hierarchies

Beyond flat triples, LLMs can generate hierarchical outlines -- a book's structure as nested topics, a codebase as module trees, a curriculum as prerequisite chains. The Books project ↗ demonstrates this: an LLM summarizes a book into a structured outline, which is then parsed into a tree graph. The key is constraining the output format (JSON, indented text, or markdown headings) so the parser can reliably convert text to graph.

Hierarchical tree graph built from a structured outline
Hierarchical tree graph built from a structured outline
Show Python source
import networkx as nx
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

FT_BG = '#FFF1E5'
plt.rcParams.update({
    'figure.facecolor': FT_BG,
    'axes.facecolor': FT_BG,
    'savefig.facecolor': FT_BG,
    'font.family': 'sans-serif',
    'font.sans-serif': ['Helvetica Neue', 'Arial', 'DejaVu Sans'],
    'font.size': 11,
    'text.color': '#333333',
})

outline = {
    'Machine Learning': {
        'Supervised': {'Classification': {}, 'Regression': {}},
        'Unsupervised': {'Clustering': {}, 'Dim. Reduction': {}},
        'Deep Learning': {'CNNs': {}, 'RNNs': {}, 'Transformers': {}},
        'Reinforcement': {'Q-Learning': {}, 'Policy Gradient': {}}
    }
}

G = nx.DiGraph()

def build_tree(parent, subtree, depth=0):
    for child, grandchildren in subtree.items():
        G.add_node(child, depth=depth+1)
        G.add_edge(parent, child)
        build_tree(child, grandchildren, depth+1)

root = list(outline.keys())[0]
G.add_node(root, depth=0)
build_tree(root, outline[root])

def hierarchy_pos(G, root, width=4.0, vert_gap=0.4):
    pos = {}
    def _pos(node, left, right, depth):
        pos[node] = ((left + right) / 2, -depth * vert_gap)
        children = list(G.successors(node))
        if children:
            dx = (right - left) / len(children)
            for i, child in enumerate(children):
                _pos(child, left + i*dx, left + (i+1)*dx, depth+1)
    _pos(root, 0, width, 0)
    return pos

pos = hierarchy_pos(G, root)
depths = [G.nodes[n]['depth'] for n in G.nodes()]
depth_colors = ['#990F3D', '#0F5499', '#0D7680', '#FF7FAA']
colors = [depth_colors[min(d, len(depth_colors)-1)] for d in depths]

fig, ax = plt.subplots(figsize=(12, 5))
nx.draw_networkx_edges(G, pos, ax=ax, arrows=True, arrowsize=15,
                       alpha=0.3, width=1.5, edge_color='#999999')
nx.draw_networkx_nodes(G, pos, ax=ax, node_color=colors,
                       node_size=450, alpha=0.9, linewidths=0)
# Labels below nodes for readability
label_pos = {k: (v[0], v[1] - 0.07) for k, v in pos.items()}
leaf_nodes = [n for n in G.nodes() if G.out_degree(n) == 0]
branch_nodes = [n for n in G.nodes() if G.out_degree(n) > 0]
nx.draw_networkx_labels(G, label_pos, labels={n: n for n in branch_nodes},
                        ax=ax, font_size=10, font_color='#333333',
                        font_weight='bold', font_family='sans-serif')
nx.draw_networkx_labels(G, label_pos, labels={n: n for n in leaf_nodes},
                        ax=ax, font_size=9, font_color='#333333',
                        font_weight='bold', font_family='sans-serif')
ax.set_axis_off()

fig.text(0.06, 0.97, 'Outline-to-Tree: Hierarchical Topic Structure', fontsize=14, fontweight='bold', va='top')
fig.text(0.06, 0.91, 'Knowledge graph derived from structured outline', fontsize=10, color='#666666', va='top')
fig.text(0.06, 0.03, 'Source: Philip Jama via pjama.github.io', fontsize=8, color='#999999', va='bottom')

fig.subplots_adjust(top=0.85, bottom=0.12)
fig.savefig('outline_tree.png', dpi=150, facecolor=FT_BG, bbox_inches='tight')

print('wrote outline_tree.png')

From Outlines to Trees: Parsing and Graph Construction

Parsing LLM output into graphs requires handling indentation levels (for nested lists), section numbering (for headings), or JSON structure (for explicit hierarchies). Each indent level or nesting depth maps to a parent-child edge. Robustness comes from normalizing whitespace, handling edge cases (empty sections, inconsistent formatting), and validating the resulting graph (connected, acyclic for trees).

Quality and Consistency

LLM-extracted graphs are only as good as the extraction prompt and the post-processing pipeline. Common failure modes include hallucinated entities (the model invents nodes not present in the source), missed relations (especially implicit ones), and inconsistent naming (the same entity appears under multiple surface forms). Entity resolution -- merging "United States", "US", and "America" into a single node -- is essential for producing clean graphs. Embedding-based deduplication works well here: embed all extracted entity names and merge pairs above a cosine similarity threshold.

The next article extends these LLM-built knowledge graphs into a retrieval system: GraphRAG replaces traditional vector search with graph traversal, using the relational structure to retrieve richer, more connected context for generation.

These LLM-built knowledge graphs encode rich structure -- but how do you query them at generation time? Graph-based retrieval replaces vector search with traversal.

View all articles in Network Graph Analysis

Collaborate

If you're exploring related work and need hands-on help, I'm open to consulting and advisory. Get in touch