Astroturf
explorelearn moreadvanced
SHIPPABLE MVP

Astroturf shows evidence where analysis exists, and creates an ingestion path where it does not.

The platform surfaces real coordinated-comment analysis for federal rulemaking, labels partial baselines honestly, and routes unsupported sectors into docket registration rather than fake product pages.

Featured Investigation
FCC docket 17-108 / landmark finding

Democratic voice is being hijacked by automated paraphrasing.

Public commenting periods are saturated by lobby campaigns using bots that subtly rewrite the same template a thousand different ways. Keyword filters miss every paraphrase. Dense vector clustering on Databricks collapses them back into one piece of actionable evidence.

Naive exact hashing

16

Failed to recognise paraphrases. Only surfaced literal, character-for-character copies.

Semantic clustering

1,002

Caught the full coordinated template, even when sponsors mutated synonyms and prefaces.

Detection lift

63x

1,002Largest campaign size
3Coordinated campaigns
1,017Comments in campaigns
4,993Total comments analyzed

Detected clusters

Showing all 3 clusters
1,002comments

“We need the FCC to defend the rights of millions of Internet users by upholding net neutrality protections. I stand with the millions of other Internet users who've urged the Commission to keep important net neutrality protections intact...”

Aug 28, 2017
13comments

“Net neutrality has created an unreliable landscape for consumers and businesses alike. We need Congress to bring clarity to this debate.”

Aug 28, 2017
2comments

“I urge FCC Chairman Ajit Pai to preserve real Net Neutrality under the FCC's existing rules and keep broadband internet access classified u...”

Aug 28, 2017
MVP COVERAGE

Explore analyzed coverage

Primary browsing is limited to surfaces with evidence: FCC semantic clustering and EPA exact-hash baseline results.

Live Databricks validated

Telecom & Net Neutrality

Semantic clustering and Vector Search validation for the FCC 17-108 Net Neutrality proceeding.

Agencies: FCCClusters: 3
Baseline only; semantic clustering queued

Climate / Oil & Gas / Methane

Exact-hash duplicate baseline for EPA methane comments, with semantic clustering still pending.

Agencies: EPAClusters: 7
View analyzed coverage
INGESTION ENTRY POINT

Analyze a docket

Unsupported topics become configured ingestion runs, not empty dashboards. Generate a docket registry snippet and the command sequence for regulations.gov or FCC ECFS.

Generate pipeline config

Supported source paths

These agencies can be registered through the pipeline when a reviewer has a real docket ID and scale estimate.

CFPBFTCSEC
OVERSIGHT MATRIX

Agencies with evidence

FCC has live semantic validation. EPA has baseline-only evidence. Other agencies are reachable through Analyze a docket until a run produces results.

FCCLive Databricks validated

Federal Communications Commission

Dockets: 1Comments: 4,993
EPABaseline only

Environmental Protection Agency

Dockets: 1
BENCHMARK PROOF

Exact-hash baselines vs. semantic models

The MVP keeps partial analysis visible without overstating it, and uses Databricks where local pairwise clustering hits a hard scaling wall.

Naive Hashing vs. Dense Semantic Connected Components

Naive duplicate checks fail on customized campaigns. The comparative evaluation on FCC Proceeding **17-108** demonstrates that paraphrasing represents the vast majority of coordinated lobby campaign volume.

Exact duplicate baseline318 filings

Surfaced **16** rigid literal copy groups. Missed all comments containing typos, custom prefaces, or synonym rephrasings.

Covered: 6.4%Uncovered: 93.6%
Astroturf semantic clustering1017 filings

Surfaced **3** massive cohesive campaigns. Consolidated near-duplicates and paraphrased templates into a unified medoid.

Covered: 20.4%Uncovered: 79.6%

The infrastructure

Why Databricks is load-bearing.

Each of the six agents leans on a specific Databricks capability. Pull any of these out and the pipeline either stops scaling, stops being reproducible, or stops being safe to put in front of a regulator.

Delta Lake + Unity Catalog

Interrupted ingestion / re-run drift

Every one of the six agents writes through Delta MERGE on a stable primary key. ACID transactions guarantee that 87-minute ECFS slices, mid-rate-limit retries, and partial-failure replays all converge to the same idempotent bronze/silver/gold tables. Unity Catalog adds column-level RBAC for PII isolation.

Foundation Model API

PIPELINE FLOW

The Multi-Agent Medallion Sequence

Astroturf processes data through six independent, idempotent agents. Delta Lake tables serve as the durable state machine connecting agents, while MLflow tracks run provenance.

01

IngestionAgent

Multi-Source Puller

Fetches comments via ECFS/Regulations.gov APIs. Overwrites raw comments onto bronze Delta tables on unique keys.

02

ParserAgent

Metadata Extractor

Segregates inline bodies, enriches comment details, catalogs scanned attachment binaries, and flags boilerplate covers.

03

EmbeddingAgent

Vectorizer Node

Distributes batch text blocks across Spark nodes. Encodes comments into 1024-dim dense vectors using BGE-large.

04

ClusteringAgent

Vector Search Solver

Triggers distributed Vector Search (HNSW) nearest-neighbor indexes. Performs cosine grouping above a stable threshold (0.92).

05
RIGOR & TRANSPARENCY

Methodological Bounds & Limitations

An honest representation of AI data pipelines requires clear documentation of scientific limits. Here are the primary analytical constraints of the current Astroturf iteration:

1. Temporal Horizon Slicing

Our active local case study evaluates public comments submitted within a narrow 3-day window (August 28 to August 30, 2017). Coordinated campaign waves are wider; this temporal slice efficiently captures the major filing burst but underrepresents the full absolute campaign volumes.

Bound: Data Scope Slice

2. Cosine Threshold Sensitivity

The clustering agent operates under a fixed cosine similarity threshold of **`0.92`** over BGE embeddings. While this is highly optimized, citizens who add extensive personal paragraphs or heavily customize prefaces will fall below this threshold (false negatives), showing that coordination is a spectrum.

Bound: Semantic Cutoff Bound

3. Astroturf vs. Allowed Advocacy

The system groups highly similar template text, but semantic grouping alone cannot distinguish permitted civic bulk advocacy (e.g. authorized petitions compiled by advocacy groups) from malicious identity hijacking without checking external lobby registries and authorization audits.

Bound: Intent Attribution Bound

The system

Six agents over a medallion lakehouse.

Each transition between tiers is an idempotent agent with its own contract. Databricks features are called out at every touch point.

Astroturf architecture diagramVertical data flow from regulations.gov through bronze, silver, gold, and demo Delta tables on Databricks Unity Catalog. Agent names label each transition; Databricks features (Unity Catalog, Workflows, Foundation Model API, Vector Search, SQL Connector) are called out where they appear.SOURCEregulations.gov + FCC ECFSdual federal APIs / shared api.data.gov rate budgetIngestionAgentUNITY CATALOG + DELTA MERGEidempotent / MLflow run per ingestionBRONZEraw_commentsDelta table / partitioned by docket_idParserAgentWORKFLOWS / SOURCE-AWAREECFS skips detail-fetch; regs.gov enrichesSILVERparsed_commentstitle, body, submitter, attachments catalogedEmbeddingAgentFOUNDATION MODEL APIdatabricks-bge-large-en / 1024-dSILVER

The pipeline

How it works

  1. 01

    Ingest

    Public comments are pulled from two federal sources: regulations.gov v4 (CFPB, EPA, FTC, FDA, ...) and the FCC ECFS public API (telecom dockets). Cursor-based pagination, exponential-backoff retries, and a shared api.data.gov rate-limit budget. Data lands in a Delta Lake bronze table on Databricks Unity Catalog with full provenance, idempotent re-runs, and an MLflow run per ingestion.

  2. 02

    Parse and enrich

    Comments are normalized through a medallion architecture (bronze to silver). HTML cleaning, attachment cataloging, and detail-level enrichment run as a Databricks Workflow, source-aware: ECFS rows skip the detail-fetch round-trip because their bodies are already plain text, while regulations.gov rows fan out per-comment detail requests under the rate-limit budget.

  3. 03

    Embed

    Each comment is converted to a 1024-dimension semantic embedding via databricks-bge-large-en, served through the Databricks Foundation Model API. Embeddings are written to a Delta table and synced to a Databricks Vector Search index.

  4. 04

    Cluster

    A two-stage clusterer collapses the comparison space: on token shingles generates candidate pairs; cosine similarity over the Vector Search index confirms semantic neighbors above a tunable threshold (default 0.92). Cluster assignments and representative templates land in gold Delta tables, joined with cluster sizes and date spans for the UI.

Source: Live Databricks SQL modeEmbedding: BAAI/bge-large-en-v1.5Similarity threshold: 0.92
Comments: 396
View agency coverage
+%

Campaign Coverage Lift

Dense vector clustering captured comments that naive string grouping missed.

+220%coverage expansion

The O(N^2) Memory Wall: Why Local Clustering Fails

Traditional clustering models (such as local pairwise connected components) require computing a contiguous, dense similarity matrix in memory. Because space requirements grow quadratically, analyzing standard agency dockets quickly causes the system to crash.

Sample Size (N)Required Float32 RAMSingle-Node Status
1,000 comments4 MB\r\n Local Safe
5,000 comments100 MB\r\n Local Safe (Capped)
10,000 comments400 MB\r\n Boundary / Slow
100,000 comments40 GB!\r\n OOM Crash / Out of Memory
1,000,000+ comments4 TB!\r\n Physically Impossible Locally
Physical Physics Threshold ExceededUnder a docket with 100K comments, we perform **4,999,950,000 pairwise comparisons** (10 Billion float operations). To prevent Out-of-Memory crashes, our production clustering agent replaces expensive contiguous matrices with **Databricks Vector Search**, reducing query complexity to sub-quadratic O(N log N) using distributed HNSW indexing.

How the Coordinated Campaign Hides Itself

Observe these three actual submissions from FCC docket **17-108**. By injecting personalized prefaces, swapping select words, or adding custom postscripts, each filer generated a **completely diverging text hash**, making them look unique to keyword filters. Yet, their core template and **98%+ semantic similarity** remain identical.

Eleanor Vance

ID: 108282535307158
99.0% SimHash: e93f...78a1
[Custom Input] “I grew up with our internet and throughout my time I have had great times with our internet on a variety of sites and this new plan could take away things...”We need the FCC to defend the rights of millions of Internet users by upholding net neutrality protections. I stand with the millions of other Internet users who've urged the Commission to keep import...
Mutation analysis

Substituted 'proposal' with 'plan', and 'telecom giants' with 'ISP monopolies' inside the core template text.

Gregory House

ID: 1082893935836
98.4% SimHash: 82a9...d32b
[Custom Input] “As a doctor and professional in the healthcare space, open internet access means that online research and medical databases load immediately for all individuals...”We need the FCC to defend the rights of millions of Internet users by upholding net neutrality protections. I stand with the millions of other Internet users who've urged the Commission to keep import...
Mutation analysis

Injected a completely unique professional preface in the first paragraph, and replaced 'Comcast, AT&T, and Verizon' with 'telecom giants'.

Robert Chase

ID: 108280080014462
98.3% SimHash: f45c...0012
[Custom Input] “The internet belongs to everyone and should remain free of gatekeepers who veto expression.”We need the FCC to defend the rights of millions of Internet users by upholding net neutrality protections. I stand with the millions of other Internet users who've urged the Commission to keep import...
Mutation analysis

Added a brief philosophical postscript at the tail end of the submission, leaving the inner body text unmodified.

Self-hosted embedding model ops

1024-d semantic embeddings via databricks-bge-large-en served from a managed endpoint. No GPU pool to provision, no PyTorch containers to keep warm, no quantization to debug - just a billed request per comment with automatic retries and rate-limit shaping inside the EmbeddingAgent backend.

Vector Search

O(N^2) pairwise comparison wall

MinHash/LSH generates candidate pairs cheaply, then Vector Search confirms semantic neighbors over an HNSW index synced from the silver embeddings Delta table. Cluster confirmation drops from a contiguous float32 similarity matrix to a sub-quadratic index lookup, so the pipeline stays linear in docket size.

Workflows / Jobs

Notebook-as-orchestrator anti-pattern

The whole 5-stage pipeline runs as a parameterized Databricks Job: job_id + per-docket request_id + base_parameters for catalog, data_root, and clustering mode. Submission, lifecycle, retries, and the per-stage zero-row guards all live inside the Job, called from the Next.js /analyze endpoint.

MLflow audit trails

Unverifiable regulatory provenance

Each agent emits an MLflow run with inputs (docket_id, source, config), outputs (per-stage row counts, quality metrics), and timing. Threshold bounds, exact model versions, and rate-limit budget consumption are all reconstructable from the experiment - required pedigree for any downstream regulatory citation.

Databricks SQL Connector

Mock data divergence in the UI

The Next.js UI queries the actual Delta tables (via delta.`/Volumes/.../path` and the SQL warehouse) for cluster_review_export, per-stage row counts, and Delta history. Zero mock data in the production path - when the live counts disagree with what the page shows, the page is wrong, not the warehouse.

AttributionAgent

Web Search Tool

Delegates to LLMs to perform automated Google/Lobby registry searches to attribute clusters to corporate/lobby groups.

06

MigrationAgent

Final Rule Tracer

Extracts regulatory text from the Federal Register. Computes phrase-level similarity to trace template language into final laws.

comment_embeddingsDelta table / synced to Vector Search indexClusteringAgentVECTOR SEARCHcosine over BGE indexGOLDcomment_clusters+ cluster_memberships / template + membersExportCTAS to demo schemaDEMOcluster_review_exportdenormalized, UI-ready / one row per (cluster, comment)Next.js appSQL CONNECTORAPPAstroturf UIthis page / live queries, hourly revalidate

Side branch / AttributionAgent

Reads from gold.comment_clusters and writes gold.campaign_attributions. Offline-seed mode matches against a curated advocacy registry; tool-using LLM mode (web search + registry) gated behind ADR-0015.

Side branch / MigrationAgent

Compares cluster template language against final agency rule text and writes gold.rule_migrations with phrase-level similarity, section citations, and mandatory caveat text. Federal Register API mode behind ADR-0015.

O(N^2)
MinHash/LSH
  • 05

    Attribute and trace

    Two evidence-packet agents read from gold: the AttributionAgent assembles candidate campaign sponsors from a curated advocacy registry (offline-seed today; tool-using LLM mode behind an ADR), and the MigrationAgent compares cluster template language against the final agency rule text to flag phrase-level migration. Both write capped-confidence, caveat-bearing rows - never silent accusations.

  • 06

    Serve

    Findings are denormalized into astroturf.demo.cluster_review_export and queried live from this Next.js UI via the Databricks SQL Connector. A Postgres control plane (Supabase) tracks analysis-request lifecycle, Databricks job IDs, and source-validated docket discoveries so the UI can poll /api/analysis/[id]/progress every 10s with per-stage row counts.

  • More coordinated comments recovered by switching from naive string match to semantic neighbours.

    On FCC’s Net Neutrality repeal, a single coordinated campaign generated 1,002 comments in 1 days. One template accounted for 99% of all coordinated comments detected on the docket.

    Plus 2 smaller coordinated campaigns surfaced on the same docket.

    Rule: “FCC Restoring Internet Freedom Proceeding (Net Neutrality Deregulation)”